A guided tour of artificial intelligence research. Vol 2 9783030061661, 9783030061678


531 69 4MB

English Pages 529 Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
General Presentation of the Guided Tour of Artificial Intelligence Research......Page 6
Contents......Page 8
Preface: AI Algorithms......Page 10
Foreword: Algorithms for Artificial Intelligence......Page 12
1 Introduction......Page 15
2 From Graphs of Subproblems to State Graphs......Page 16
3 Sliding-Puzzles, a Fertile Challenge for Heuristically Ordered Search......Page 18
4 At the Beginning Was A*......Page 20
4.1 Guiding Search With Heuristic Estimates......Page 21
4.2 Conditions Under Which A* Stops When Discovering a Minimal Path......Page 24
4.4 Comparison Between Algorithms A* When Their Heuristics h Are Underestimating and Monotone......Page 25
6 Bidirectional Heuristically Ordered Search A*......Page 26
7 Relaxations of A*: Sub-admissible Algorithms......Page 27
8 Fortunately IDA* Came......Page 28
9 Inventing Heuristics as Improvements of Those Already Known......Page 31
10 Combining the Heuristic Estimates Obtained for Subproblems......Page 32
11.1 State Graphs: We Can Be Less Demanding......Page 34
11.3 States to be Developed: Other Informed Choices......Page 35
12 Conclusion......Page 36
References......Page 37
1 Introduction......Page 41
2.1 Genetic Algorithms......Page 43
2.2 Local Search......Page 45
3.1 Greedy Randomised Algorithms......Page 47
3.2 Estimation of Distribution Algorithms......Page 48
3.3 Ant Colony Optimisation......Page 49
4.1 Memetic Algorithms......Page 51
5 Intensification Versus Diversification......Page 52
6.1 Satisfiability of Boolean Formulas......Page 53
6.2 Constraint Satisfaction Problems......Page 57
7 Discussions......Page 60
References......Page 62
1 Introduction......Page 67
2 First-Order Logic......Page 68
3.1 Transformation to Clausal Form......Page 70
3.2 Unification......Page 71
3.3 The Superposition Calculus......Page 72
3.4 Redundancy Elimination......Page 74
3.5 Implementation Techniques......Page 75
3.6 Termination: Some Decidable Classes......Page 77
3.7 Proof by Instantiation......Page 78
3.9 Model Building......Page 79
4.1 Propositional Tableaux......Page 81
4.2 Tableaux for First Order Logic......Page 83
4.3 Free Variable Tableaux......Page 85
5.1 An Implicit Tableaux Calculus......Page 87
5.2 An Explicit Tableaux Calculus......Page 88
6.1 Induction......Page 90
6.2 Logical Frameworks or Proof Assistants......Page 92
7 Conclusion......Page 94
References......Page 95
1 Introduction......Page 96
2.1 From Logic to Logic Programming......Page 97
2.2 Logic Programming with Horn Clauses......Page 98
2.3 The Prolog Language......Page 100
3 Constraint Logic Programming......Page 102
3.1 The CLP Scheme and CLP(mathcalR)......Page 103
3.2 CLP(FD)......Page 104
3.3 Writing Your Own Solver with CLP......Page 105
3.4 Some CLP Systems......Page 106
4 Answer Set Programming......Page 107
4.1 Theoretical Foundations......Page 108
4.2 ASP and (More or Less) Traditional Logic......Page 110
4.3 Knowledge Representation and Problem Resolution in ASP......Page 114
4.4 ASP Solvers......Page 117
4.5 Discussion......Page 119
References......Page 121
1 Introduction......Page 127
2 Reasoning in Propositional Logic......Page 128
2.2 Reasoning is Proving a Semantic Consequence......Page 129
2.3 Simplifications and Normalized Rewritings of Sub-formulas......Page 130
2.4 Reasoning by Calculus: The Resolution Rule......Page 133
3.1 SAT from a Theoretical Point of View......Page 135
3.2 SAT from a Practical Point of View......Page 137
4.1 Incomplete Algorithms......Page 138
4.2 Complete, Systematic Algorithms......Page 142
4.3 Preprocessing of Formulas......Page 147
4.5 On the Limits and Challenges for Sat Solvers......Page 148
5.1 Principles of KC......Page 149
5.2 The ``Beginnings'' of KC: Prime Implicates and Implicants......Page 150
5.4 Abstraction of KC Principles......Page 151
5.5 Examples of KC Applications: ATMS and Model-Based Diagnosis......Page 153
6 Quantified Boolean Formulas......Page 155
6.1 Solving QBF in Practice......Page 156
7 Conclusion......Page 157
References......Page 158
1 Introduction......Page 165
2 Definitions......Page 167
3 Chronological Backtracking......Page 168
4 Constraint Propagation......Page 169
4.1 Consistency on One Constraint at a Time......Page 170
4.2 Strong Consistencies......Page 174
5 Polynomial Cases......Page 176
6 Solution Synthesis and Decompositions......Page 177
7 Improving Chronological Backtracking......Page 178
7.1 Look Back......Page 179
7.2 Look Ahead......Page 180
7.3 Variable and Value Ordering Heuristics......Page 183
8.1 Non-standard Backtracking Search......Page 184
8.2 Large Neighborhood Search......Page 186
9 Global Constraints......Page 187
10 Conclusion......Page 189
References......Page 191
1 Introduction......Page 196
2 Valued Constraint Networks......Page 197
2.1 Valuation Structure......Page 198
2.2 Cost Function Networks......Page 200
2.3 Operations on the Cost Functions......Page 202
2.4 Links with Other Approaches......Page 203
3 Dynamic Programming and Variable Elimination......Page 204
3.1 Partial Variable Elimination or ``Mini-Buckets''......Page 205
4 Search for Optimal Solutions......Page 206
5 Propagation of Valued Constraints......Page 207
6 Complexity and Tractable Classes......Page 210
7 Solvers and Applications......Page 212
8 Conclusion......Page 213
References......Page 214
Belief Graphical Models for Uncertainty Representation and Reasoning......Page 219
1 Introduction......Page 220
2.1 Probability Theory......Page 221
2.2 Conditional Independence......Page 222
2.4 Graphical Encoding of Independence Relations......Page 223
3 Probabilistic Graphical Models......Page 224
3.1 Bayesian Networks......Page 225
4.1 Main Reasoning Tasks......Page 227
5.1 Parameter Learning......Page 230
5.2 Structure Learning......Page 231
5.4 Classification......Page 234
6.1 Influence Diagrams......Page 235
6.2 Dynamic Bayesian Networks......Page 237
6.3 Credal Networks......Page 238
6.4 Markov Networks......Page 239
7 Non Probabilistic Belief Graphical Models......Page 240
7.1 Possibilistic Graphical Models......Page 241
7.2 Kappa Networks......Page 244
8.1 Main Application Domains......Page 247
8.3 Software Platforms for Modeling and Reasoning with Probabilistic Graphical Models......Page 248
9 Conclusion......Page 249
References......Page 250
1 Introduction......Page 257
2 Probabilistic Logics: The Laws of Thought?......Page 258
3 Bayesian Networks and Their Diagrammatic Relational Extensions......Page 261
4 Probabilistic Logic Programming......Page 269
5 Probabilistic Logic, Again; and Probabilistic Description Logics......Page 273
6 Markov Random Fields: Undirected Graphs......Page 275
7 Probabilistic Programming......Page 279
8 Inference and Learning: Some Brief Words......Page 283
References......Page 286
1 Introduction......Page 294
2 Classical Planning......Page 295
2.1 Propositional STRIPS Planning Framework......Page 296
2.2 A Language for the Description of Planning Problems: PDDL......Page 297
2.3 Structural Analysis of Problems in Classical Planning......Page 300
2.4 Main Algorithms and Planners......Page 302
3.1 The Markov Decision Processes Framework......Page 304
3.2 Intensional Representation of MDPs......Page 306
3.3 Algorithms and Planners......Page 309
4.1 Partially Observed Markov Decision Processes......Page 312
4.2 Markov Decision Processes and Learning......Page 313
5 Conclusion......Page 314
References......Page 315
1 Introduction......Page 322
2.2 Alpha-Beta......Page 323
2.3 Transposition Table......Page 324
2.4 Iterative Deepening......Page 325
2.6 Other Alpha-Beta Enhancements......Page 326
3.1 Monte Carlo Evaluation......Page 327
3.2 Monte Carlo Tree Search......Page 328
4.1 A*......Page 332
4.2 Monte Carlo......Page 333
5.1 Endgame Tablebases......Page 334
5.2 Pattern Databases......Page 335
6 AI in Video Games......Page 336
6.1 Transitioning from Classical Games to Video Games......Page 337
6.2 AI in the Game Industry: Scripting and Evolution......Page 338
6.3 Research Directions: Adaptive AI and Planning......Page 339
6.4 New Research Directions for Video Game AI......Page 341
References......Page 342
1 Introduction......Page 347
2 Classical Scenarios for Machine Learning......Page 348
2.1 The Outputs of Learning......Page 349
2.2 The Inputs of Learning......Page 350
3.1 Three Questions that Shape the Design of Learning Algorithms......Page 351
3.2 Unsupervised Learning......Page 352
3.3 Supervised Learning......Page 356
3.4 The Evaluation of Induction Results......Page 360
3.5 Types of Algorithms......Page 361
4 Clustering......Page 362
4.1 Optimization Criteria and Exact Methods......Page 363
4.2 K-Means and Prototype-Based Approaches......Page 364
4.3 Generative Learning for Clustering......Page 366
4.4 Density-Based Clustering......Page 368
4.5 Spectral Clustering, Non Negative Matrix Factorization......Page 369
4.6 Hierarchical Clustering......Page 371
4.7 Conceptual Clustering......Page 373
4.8 Clustering Validation......Page 374
5 Linear Models and Their Generalizations......Page 375
6.1 The Multi-layer Perceptron......Page 384
6.2 Deep Learning: The Dream of the Old AI Realized?......Page 386
6.3 Convolutional Neural Networks......Page 388
6.4 The Generative Adversarial Networks......Page 390
6.5 Deep Neural Networks and New Interrogations on Generalization......Page 391
7 Concept Learning: Structuring the Search Space......Page 392
7.1 Hypothesis Space Structured by a Generality Relation......Page 393
7.2 Four Illustrations......Page 399
8 Probabilistic Models......Page 404
9 Learning and Change of Representation......Page 406
10.1 Semi-supervised Learning......Page 407
10.2 Active Learning......Page 408
10.3 Online Learning......Page 409
10.4 Transfer Learning......Page 410
10.5 Learning to Rank......Page 411
10.6 Learning Recommendations......Page 412
11 Conclusion......Page 413
References......Page 414
1 Introduction......Page 419
2.1 Context, Concepts and the Concept Lattice......Page 421
2.2 Rules and Implications......Page 423
2.4 The Stability Measure......Page 424
3.1 Introduction......Page 425
3.2 Interval Pattern Structures......Page 427
3.3 Projections and Representation Context for Pattern Structures......Page 428
4.1 Relational Concept Analysis......Page 430
4.2 Graph-FCA......Page 434
5 Triadic Concept Analysis......Page 438
6 Applications......Page 440
References......Page 445
Constrained Clustering: Current and New Trends......Page 454
1 Introduction......Page 455
2.1 Cluster Analysis......Page 457
2.2 User Constraints......Page 459
3.1 k-Means......Page 462
3.3 Spectral Graph Theory......Page 465
4.1 Overview......Page 467
4.2 Integer Linear Programming......Page 468
4.3 Constraint Programming......Page 470
5.1 Ensemble Clustering......Page 473
5.2 Collaborative Clustering......Page 474
6.1 Interactive and Incremental Constrained Clustering......Page 478
6.2 Beyond Constraints: Exploratory Data Analysis......Page 481
7 Conclusions......Page 484
References......Page 485
1 Operations Research......Page 492
3 The Common Fight of OR and AI against Complexity......Page 493
4 Conclusion......Page 497
References......Page 498
Index......Page 499
Recommend Papers

A guided tour of artificial intelligence research. Vol 2
 9783030061661, 9783030061678

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Pierre Marquis Odile Papini Henri Prade Editors

A Guided Tour of Artificial

Intelligence Research 2 A I Algorithms

A Guided Tour of Artificial Intelligence Research

Pierre Marquis Odile Papini Henri Prade •



Editors

A Guided Tour of Artificial Intelligence Research Volume II: AI Algorithms

123

Editors Pierre Marquis CRIL-CNRS, Université d’Artois and Institut Universitaire de France Lens, France

Odile Papini Aix Marseille Université, Université de Toulon, CNRS, LIS Marseille, France

Henri Prade IRIT CNRS and Université Paul Sabatier Toulouse, France

ISBN 978-3-030-06166-1 ISBN 978-3-030-06167-8 https://doi.org/10.1007/978-3-030-06167-8

(eBook)

© Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

General Presentation of the Guided Tour of Artificial Intelligence Research

Artificial Intelligence (AI) is more than sixty years old. It has a singular position in the vast fields of computer science and engineering. Though AI is nowadays largely acknowledged for various developments and a number of impressive applications, its scientific methods, contributions, and tools remain unknown to a large extent, even in the computer science community. Notwithstanding introductory monographs, there do not exist treatises offering a detailed, up-to-date, yet organized overview of the whole range of AI researches. This is why it was important to review the achievements and take stock of the recent AI works at the international level. This is the main goal of this A Guided Tour of Artificial Intelligence Research. This set of books is a fully revised and substantially expanded version, of a panorama of AI research previously published in French (by Cépaduès, Toulouse, France, in 2014), with a number of entirely new or renewed chapters. For such a huge enterprise, we have largely benefited the support and expertise of the French AI research community, as well as of colleagues from other countries. We heartily thank all the contributors for their commitments and works, without which this quite special venture would not have come to an end. Each chapter is written by one or several specialist(s) of the area considered. This treatise is organized into three volumes: The first volume gathers twenty-three chapters dealing with the foundations of knowledge representation and reasoning formalization including decision and learning; the second volume offers an algorithm-oriented view of AI, in fourteen chapters; the third volume, in sixteen chapters, proposes overviews of a large number of research fields that are in relation to AI at the methodological or at the applicative levels.

vii

viii

General Presentation of the Guided Tour of Artificial Intelligence Research

Although each chapter can be read independently from the others, many cross-references between chapters together with a global index facilitate a nonlinear reading of the volumes. In any case, we hope that readers will enjoy browsing the proposed surveys, and that some chapters will tease their curiosity and stimulate their creativity. July 2018

Pierre Marquis Odile Papini Henri Prade

Contents

Heuristically Ordered Search in State Graphs . . . . . . . . . . . . . . . . . . . . Henri Farreny

1

Meta-heuristics and Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . Jin-Kao Hao and Christine Solnon

27

Automated Deduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thierry Boy de la Tour, Ricardo Caferra, Nicola Olivetti, Nicolas Peltier and Camilla Schwind

53

Logic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arnaud Lallouet, Yves Moinard, Pascal Nicolas and Igor Stéphan

83

Reasoning with Propositional Logic: From SAT Solvers to Knowledge Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Laurent Simon Constraint Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Christian Bessiere Valued Constraint Satisfaction Problems . . . . . . . . . . . . . . . . . . . . . . . 185 Martin C. Cooper, Simon de Givry and Thomas Schiex Belief Graphical Models for Uncertainty Representation and Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Salem Benferhat, Philippe Leray and Karim Tabia Languages for Probabilistic Modeling Over Structured and Relational Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Fabio Gagliardi Cozman Planning in Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Régis Sabbadin, Florent Teichteil-Königsbuch and Vincent Vidal

ix

x

Contents

Artificial Intelligence for Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Bruno Bouzy, Tristan Cazenave, Vincent Corruble and Olivier Teytaud Designing Algorithms for Machine Learning and Data Mining . . . . . . . 339 Antoine Cornuéjols and Christel Vrain Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 Sébastien Ferré, Marianne Huchard, Mehdi Kaytoue, Sergei O. Kuznetsov and Amedeo Napoli Constrained Clustering: Current and New Trends . . . . . . . . . . . . . . . . . 447 Pierre Gançarski, Thi-Bich-Hanh Dao, Bruno Crémilleux, Germain Forestier and Thomas Lampert Afterword—Artificial Intelligence and Operations Research . . . . . . . . . . 485 Philippe Chrétienne Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493

Preface: AI Algorithms

The simulation of “intelligent” behaviors, which is the main purpose of AI, is a complex task. It requires the identification of a number of concepts and processes on which such behaviors are anchored. These concepts (beliefs, preferences, actions, etc.) and processes (learning, inference, decision, etc.) need to be modeled. It is thus mandatory to define models that are suited to the reality one wants to deal with (in particular, those models must take account as much as possible for the available pieces of information) and that offer “good properties” (from a normative point of view or from a descriptive point of view). The concepts, once modeled, must be represented. This requires to define and study representation languages (a piece of information pertaining to a given model having in general several possible representations). The choice of a suitable representation language typically depends on several criteria (expressiveness, succinctness, among others). Processes must also be modeled to render a computer simulation feasible from the representations at hand. This involves the design, study, and evaluation of dedicated algorithms. This volume of the guided tour of AI research focuses on this last issue and aims to present in fourteen chapters the main families of algorithms developed or used in AI to learn, to infer, and to decide. The first two chapters (Chapters “Heuristically Ordered Search in State Graphs” and “Meta-heuristics and Artificial Intelligence”) deal with heuristic search and meta-heuristics, two families of approaches forming the backbone of many AI algorithms. The next three chapters are about algorithms that process logic-based representations, and they present, respectively, the computational problems encountered in automatic deduction (Chapter “Automated Deduction”), those of logic programming (Chapter “Logic Programming”) already evoked in the preface of this volume, and finally those of classical propositional logic (including the famous SAT problem, which plays a key role in complexity theory). The next three chapters are focused on algorithms suited to graph-based representations: standard constraint networks (Chapter “Constraint Reasoning”), valued constraint networks (Chapter “Valued Constraint Satisfaction Problems”), probabilistic and non-probabilistic graphical models (Chapter “Belief Graphical Models for Uncertainty Representation and Reasoning”). Chapter “Languages for xi

xii

Preface: AI Algorithms

Probabilistic Modeling Over Structured and Relational Domains” considers other extensions of Bayes nets (including probabilistic relational models and Markov random fields) and settings which combine both logic-based representations and probabilities. The last five chapters, like the first two chapters, present AI algorithms which are not grounded on specific representation frameworks, but have been developed to simulate specific processes (or, more generally, specific families of such processes), such as planning (Chapter “Planning in Artificial Intelligence”), playing (Chapter “Artificial Intelligence for Games”), various forms of learning (Chapter “Designing Algorithms for Machine Learning and Data Mining”), discovering knowledge from data in the setting of formal concept analysis (Chapter “Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing”), and clustering pieces of data (Chapter “Constrained Clustering: Current and New Trends”). The volume ends with an afterword linking and comparing how some optimization issues are addressed in operation research (OR) and in AI, and how the two approaches can complement each other. Lens, France Marseille, France Toulouse, France

Pierre Marquis Odile Papini Henri Prade

Foreword: Algorithms for Artificial Intelligence

When was created the term Artificial Intelligence? John McCarthy, at Stanford University, wanted to organize a conference on Complex Information Processing, which included learning, planning, chess games, robot control, automated theorem proving, automated translation, speech recognition, etc. But the name of the conference was not appalling enough, he told me, so he proposed this term much more flashy, in order to obtain credit. That was summer 1956. In fifty years a lot has changed, because computers have made much progress. From 1971 to 2013, the number of transistors within a microprocessor increased from thousands to billions. The speed and memory size of computers have practically doubled every year. The date 1971 is precisely when Nils Nilson in his book, Problem-solving Methods in Artificial Intelligence, outlined different tree researches in games. Now, with computers which are a million times faster, the heuristic search algorithms get their true meaning. For example, Toma Rokick and Herbert Kociemba et al. showed in 2010 that it is possible to solve the Rubik’s Cube in at most 20 moves; this result is obtained by an exhaustive search of the fortieth of 43252003274489856000 possible positions. The algorithm of games is the subject of Chapters “Heuristically Ordered Search in State Graphs” and “Artificial Intelligence for Games” of this volume. This speed and this capacity have also benefited to whatever is related directly or indirectly to computational logic: Current Prolog is a powerful programming language, and the propositional demonstrators manipulate statements with 106 proposals. It is the automated theorem proving, as developed by Alan Robinson, who will serve as a guideline to show what has been developed in each chapter by the authors of this volume. We use formulas like 8x (Brother ðPaul ; xÞ ! ð9y Egal ðy; Mother ðPaulÞÞ ^ Egal ðy; Mother ðxÞÞÞÞ which are made of atomic formulas of connectors and quantifiers; atomic formulas themselves being predicate over terms made from functional symbols and variables without arguments. This is the first-order logic. It is always possible, after

xiii

xiv

Foreword: Algorithms for Artificial Intelligence

introducing the Skolem functions to eliminate the existential quantification, to reduce to the clause form L1 _    _ L n where each Li is an atomic formula or the negation of such a formula. It is implied that the only permitted quantification is universal. In 1965, Alan Robinson, in his article A Machine-Oriented Logic Based on the Resolution Principle, shows that in this case we need the only one inference rule: from

A _ L1 _    _ Lm

and from :A _ Lm þ 1 _    _ Lm þ n deduce L1 _    _ Lm _ Lm þ 1 _    _ Lm þ n : In fact, this is only true if the literals do not contain variables. In general, we must first unify the atomic formulas in calculating a substitution and apply it to the result. At the same time, he also notes that in this case we can limit ourselves to interpretations of Herbrand: A variable is interpreted as a term without variable, and functional symbols are interpreted by themselves. Automatic theorem demonstration has grown significantly: For example, William McCune by Otter (one of many demonstrators) found in 1998 the proof that Robbins algebra was a Boolean algebra. I found this result at the beginning of Chapter “Automated Deduction” of this volume. The remainder of this chapter is devoted to unconventional logic and method by tables. This method, introduced by Evert Smuyllian and Jaakko Hintikka, works directly on the original logical formula, with all quantifiers, while also using unification. An important limitation is to use only clauses with one and only one literal without negation A0 _ : A1 _    _ : An that is A0

A1 ^    ^ An ;

where Ai are the atomic formulas. As shown by Robert Kowalski and Maarten van Emden, such a set of formulas has a minimal Herbrand model (predicates less often true as possible) in which it is possible to calculate; see the beginning of Chapter “Logic Programming” of this volume. So, the assertions do not play the same role as negations. Prolog language was able to develop itself thanks to this concept, but with non-declarative additions such as predicates cut, setof input and output, without which it would not be a real programming language. By the way, we want to grant David Warren whose compiler really consecrated this programming language. Extensions to unification, which are seen more as a constraint-solving system, allowed me to deal with linear real Prolog III. It also helped Mehmet Dincbas, Pascal van Henteryck, and Helmut Simonis to deal with finite domains in CHIP. Meanwhile, Joxan Jaffar and Jean-Louis Lassez introduced the concept of Constraint Logic Programming which gives a theoretical framework for this. A constraint-solving formalism in finite domains, the CSP, is particularly studied in

Foreword: Algorithms for Artificial Intelligence

xv

Chapters “Constraint Reasoning” and “Valued Constraint Satisfaction Problems” of this volume. We are not far from Operational Research; see the afterword. A weaker restriction is to use only the clauses of the form A

A1 ^    ^ Am ^ :Am þ 1 ^    ^ :An ;

where the negation is interpreted as an extension of negation by failure: the stable models from Michael Gelfond and Vladimir Lifschitz. This is amply developed in the second part of Chapter “Logic Programming” of this volume. In particular we see the relationship with the circumscription by John McCarty who introduced an additional set of formulas to give these formulas a usual logical sense. Another restriction is that each clause be of the form Q1 _    _ Qn ; each Qi being either a predicate with zero argument or its negation. This is the SAT problem which is by excellence the problem in the finite domain. It has many applications in industry. It is treated by the excellent Chapter “Reasoning with Propositional Logic: From SAT Solvers to Knowledge Compilation” of this volume. We should also mention solving NP-hard combinatorial problems in Chapter “Meta-heuristics and Artificial Intelligence” of this volume. Chapters “Belief Graphical Models for Uncertainty Representation and Reasoning”, “Planning in Artificial Intelligence”, and “Designing Algorithms for Machine Learning and Data Mining” of this volume relate to the Uncertainty, Planning, and Learning. These themes are vital. The Uncertainty is important because our data are often approximate. For instance, Planning to provide a series of actions to be performed, such as driving an autonomous vehicle. Finally, Automatic Learning is still cruelly missing. If the machines could learn by themselves, it would be a major advance and we would have not much left to do… Alain Colmerauer (1941–2017), Formerly with LIS-CNRS, Institut Universitaire de France, Aix-Marseille Université Marseille, France

Heuristically Ordered Search in State Graphs Henri Farreny

Abstract Heuristic search in state graphs is one of the oldest topics of artificial intelligence. Initially, the theoretical and practical interest of this research area has been demonstrated by application to solving games such as sliding puzzles. We first present the definitions and properties of the widely used algorithms A* and IDA*, able to find solutions of minimum length, when the length of a solution is simply defined as the sum of the costs of its components. The gradual construction of a minimal solution is guided by evaluation functions that can take into account some pieces of knowledge coming from empirical data; when these functions satisfy some relations, the discovery of a minimal solution is ensured. Several relaxations of these algorithms that produce solutions of ‘almost minimal’ length, or which are less computationally demanding, are discussed. Some extensions, based on other notions of length of a solution and of evaluation functions, are also recalled.

1 Introduction In the early 1970s, the techniques of problem solving by heuristic search in graphs appear as one of the main branches of the young discipline which then asserts itself under the name of artificial intelligence (AI). The reference book in the period 1970– 1980 “Problem-solving methods in artificial intelligence” (Nilsson 1971) testifies that this topic has been important since the beginning of AI. This importance is also confirmed by the tables of contents of the first volumes of the series “Machine Intelligence” (since 1967), the programs of the first global conferences “IJCAI” (International Joint Conference on Artificial Intelligence, since 1969) and the summaries of the first issues of the journal “Artificial Intelligence” (since 1970). The subject remained prominently in most of the books that have accompanied the development of AI courses (Winston 1984; Nilsson 1980, 1998; Barr et al. 1982; Rich 1983; Shirai and Tsujii 1984; Charniak and McDermott 1985; Laurière 1986; Farreny and Ghallab 1987; Farreny 1987; Shapiro 1987; Russell and Norvig 2003). Several H. Farreny (B) Previously with IRIT-CNRS, Université Paul Sabatier (now retired), Toulouse, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_1

1

2

H. Farreny

books focusing on heuristic search for problem solving have also been published (Pearl 1990; Kanal and Kumar 1988; Farreny 1995).

2 From Graphs of Subproblems to State Graphs Let us sketch the principle of heuristic search in graphs. We assume known the notions of graph, node (i.e., vertex), arc and path (in so-called directed graphs), chain (in the so-called undirected graphs), subgraph of a graph. A node N is son of a node M in a directed graph as soon as there exists an arc (M , N ) from M to N ; in an undirected graph: as soon as there exists an edge {M , N }. If N is son of M , then M is father of N ; and vice versa. A general process to attack a problem is to try to break it down into subproblems, then to break down them, and so on, until obtain only primitive problems, i.e., problems whose solution is considered as being immediately accessible. The possible decompositions can be combined into a graph of the subproblems (also named and/or graph1 ), as the fragment presented in Fig. 1. The part of this graph that includes the p + 2 nodes PB, D1 , SPB1,1 , . . . , SPBi,p and the p + 1 arcs U1 , V1,1 , . . . , V1,p expresses a possible way to attack the problem represented by the node PB: this problem can be resolved if we solve the conjunction of subproblems represented by the p nodes SPB1,1 to SPB1,p . Similarly, the problem can be solved if we solve the conjunction of subproblems represented by the nodes SPBi,1 to SPBi,q for i = 2, 3, . . . , n. In the end, Fig. 1 shows off n ways to attack the problem PB by decomposition. The nodes PB and SPBi are called OR nodes, while the nodes such as Di are called AND nodes. Di is one of the n sons of PB; SPBi,1 . . . SPBi,q are all the sons of Di . Each node without son represents an elementary problem. The explicit construction of the graph of the subproblems is avoided or excluded for reasons of cost (space and time). Solving the problem represented by PB consists into building a sub-graph SG of the graph G of the subproblems so that: 1. SG includes PB, 2. each OR node of SG that does not represent an elementary problem retains only one of its sons in G and the arc oriented to this son, 3. each AND node of SG retains all the sons it has in G and the arcs that join it to each son. The search for such a subgraph can be guided by knowledge from a more or less empirical origin. This approach is called heuristically ordered search2 in a graph of subproblems. The corresponding algorithms differ according to the way they order

1 The

and/or graphs are special cases of hyper-graphs (Berge 1970). comes from Greek heuriskein which conveys the ideas of finding, and of what is useful for discovery. 2 Heuristic

Heuristically Ordered Search in State Graphs

3

PB ...

U1

. . . V1,p

...

Un

Di

D1 V1,1

Ui

Dn

Vi,1 . . .Vi, j

...

Vi,q

...

Vn,1

Vn,r

SPB1,1

SPB1,p

SPBi,1

SPBi, j

SPBi,q

SPBn,1

SPBn,r

...

...

...

...

...

...

...

Fig. 1 Fragment of graph of subproblems. We can solve the initial problem PB solving either of the n subproblems conjunctions: [SPB1,1 and . . . and SPB1,p ] or. . . [SPBi,1 and . . . and SPBi,q ] or . . . [SPBn,1 and . . . and SPBn,r ]. In turn, each SPBi,j can be broken down as long as it is not an elementary problem

PB

Fig. 2 Fragment of state graph. We can solve the problem PB solving one or the other of the subproblems SPB1 or . . . or SPBi or . . . or SPBn , then iterating

U1

SPB1 ...

...

...

Ui

...

SPBi ...

Un

...

SPBn ...

the expansion and the exploration of the graph of the subproblems (and according to their subsequent effectiveness). Usually, one considers the special case where any decomposition i applicable to a problem PB produces 2 subproblems SPBi,1 and SPBi,2 with, systematically, SPBi,1 being elementary. We can then adopt a simpler representation (see Fig. 2): the arc Ui joins PB directly to SPBi,2 (denoted now SSPBi ); the node Di and the arcs Vi,1 and Vi,2 disappear: Ui represents the immediate resolution of SPBi,1 . The new representation expresses that solve PB is transformed in solve SPB1 or SPB2 or . . . or SPBn . When iterating on each subproblem that is not an elementary problem, we get a state graph.3 The explicit construction of the state graph is avoided or excluded for reasons of cost (space and time). Solve the initial problem consists to building a path from PB to an elementary problem, i.e., a sequence of nodes beginning by PB and terminating by an elementary problem, such that each node, except the first, is son of its predecessor in the sequence. This approach is called heuristically ordered search in a state graph.

3 Representable

by a simple graph rather than an hypergraph (Berge 1970).

4

H. Farreny

Formally, search in graphs of subproblems can be transformed into search in state graphs. In practice, specific procedures have been developed. Here, we consider only the heuristically ordered search in state graphs. About heuristically ordered search in graphs of subproblems, we can advise: Pearl (1990), Farreny and Ghallab (1987), Russell (1992), Russell and Norvig (2003). On the comparison of heuristic search in graphs with dynamic programming and branch-and-bound approaches, we refer the reader to Ibaraki (1978), Kumar and Kanal (1988). Results and resources of heuristically ordered search are often ignored in the manuals dedicated to graph theory, discrete algorithmic or operational research, despite obvious proximities with these disciplines. Most of the textbooks oriented to AI teaching present the search in state graphs reproducing a part of the pioneering work in Nilsson (1971). The statement of the properties of algorithms is often over-constrained and the proofs incomplete. The AI courses that universities and engineering schools provide often introduce heuristically ordered search; this attention is mainly due to the progress realized for solving by computer (rather than by logical analysis) an old but resistant problem: the fifteenpuzzle (the most popular sliding-puzzle).4

3 Sliding-Puzzles, a Fertile Challenge for Heuristically Ordered Search The sliding-puzzles are very popular games. In the most widespread material form (see Fig. 3), 15 rectangular tiles, of the same size, are set in a rectangular frame that can store 16 juxtaposed tiles, on 4 lines and 4 columns (16 cells); one of the 16 cells is empty; each tile is identified by an exclusive mark (here the numbers from 1 to 15). The only moves allowed are those which do drag in the empty cell one of the 2, 3 or 4 contiguous tiles. With n lines and n columns, the game is named (n2 − 1)-puzzle. The objective is to move the tiles one by one in order to transform an initial state (e.g., in Fig. 3: state s1 or state s2 ) into a fixed goal state (e.g., in Fig. 3: state t), by a minimal number of moves. NB: Usually, we are only interested in finding a minimal solution. Producing a simple solution (by computer or human exploration) is much less thorny. Producing a “good” solution is still another problem. For instance, the initial state s1 can be transformed by 9 moves into the goal state t and it is impossible with less than 9 moves; 80 moves are sufficient, and necessary, 4 Other puzzles provide a field of experimentation and demonstration, e.g., the Rubiks cube. Classic

problems of graph theory or operational research have been revisited: optimal itinerary calculus, travelling salesman problem. Among other application fields, let us mention: automatic theoremproving (Kowalski 1970; Chang and Lee 1973; Minker et al. 1973; Dubois et al. 1987), image processing (Martelli 1976), path finding in robotics (Chatila 1982; Gouzènes 1984; Lirov 1987), automatic generation of natural language (Farreny et al. 1984), spelling correctors (Pérennou et al. 1986), breakdown diagnosis (Gonella 1989), dietetics (Buisson 2008).

Heuristically Ordered Search in State Graphs Fig. 3 From s1 to t: at least 9 moves. From s2 to t: at least 80 moves

5 1 4 s1 8 12

5 2 3 6 7 14 10 11 9 13 15

1 2 3 4 5 6 7 t 8 9 10 11 12 13 14 15

15 14 s2 2 3

11 10 6 7

8 12 9 13 1 4 5

to transform the initial state s2 into the goal state t. This kind of sliding-puzzle is named fifteen-puzzle or (4 × 4)-puzzle. To each state of a fifteen-puzzle, we can associate a node of a graph; to each possible move, we can associate an edge joining two nodes. The state graph has 16! nodes (i.e., 20 922 789 888 000 ≈ 2 · 1013 ). Playing is equivalent to search for a chain in the state graph (in this context of study we demand a minimum number of edges) between the nodes representing the initial state and the goal state. By 1879, it was established (Johnson and Storey 1879) a resolvability criterion easy to test,5 through which it appears that the state graph of the fifteen-puzzle is made of two connected components6 with 16!/2 nodes each one. Therefore, if we draw at random the initial state we have exactly 1 chance over 2 to fall on a non-solvable problem and we can easily test if this is the case. But, when we know that a problem is solvable, there are no formula to obtain a minimal chain or only the length of such chain. To determine a minimum chain between two nodes (here: the initial state and the goal state) we could consider using one of the classical algorithms in graph theory7 ; but this type of algorithm assumes that the entire state graph is pre-built in memory; even today, common computers cannot simultaneously represent in main memory 16!/2 nodes for the (4 × 4)-puzzle, or, worse, 25! / 2 nodes for the (5 × 5)-puzzle. Other algorithms, taught in graph theory or in algorithmic theory, are able to gradually build subgraphs of the state graph, called graphs of search, until to discern, in one of them, a minimal chain joining the initial node and the goal node. So, the breadth first algorithm, from the initial state s1 , generates four son states; since none of them is the goal (t), the algorithm produces and examines the sons of each of them, i.e., the set SR2 of the states reachable from s1 in at least 2 moves; repeatedly, the algorithm produces the set SRi of the states reachable from s1 in at least i moves, for i = 3 then 4, · · · , then 8, without finding the goal state t; by building SR9 , the algorithm inevitably meets the goal state t; the mode of progression (s1 then SR1 , then SR2 , . . . , then SR8 , then SR9 ), ensures that the discovered chain since s1 to t is minimal in number of moves. 5 If

we associate 0 to the empty cell and if we represent each state by the sequence of the numbers of the 16 cells (for instance: from top to bottom and from left to right) then a problem can be solved if and only if the representation of the state goal is an even permutation of the representation of the initial state. 6 Two nodes of a graph G satisfy the relationship of connectivity if and only if they are connected by at least a chain. It is an equivalence relationship that allows us to distinguish equivalence classes in the set of the nodes of G. If there is only one equivalence class, G is a connected graph. The subgraph of consisting of all nodes of an equivalence class and all edges of G that connect two nodes of the class, is called the connected component of G. 7 For instance, the Moore-Dijkstra algorithm (Dijkstra 1959; Gondran and Minoux 1979).

6

H. Farreny

Since (Ratner and Warmuth 1990) we know that finding a minimal solution for any (n2 − 1)-puzzle problem is NP-hard. However, in practice, we do not face the (n2 − 1)-puzzle with any n: whatever the complexity that weighs on the general problem, one may wish to solve a particular case or a restricted family of cases with fixed n. Many researchers have described and tested more or less effective techniques to solve problems of (n2 − 1)-puzzle: first (3 × 3)-puzzles, then (4 × 4)-puzzles, rarely (5 × 5)-puzzles and sometimes other types of sliding-puzzles (different forms and sizes of pieces). So, theoretical and algorithmic bases have emerged to formalize and experiment the heuristically ordered search, which may be useful for other areas of application.

4 At the Beginning Was A∗ In 1968, Hart et al. (1968)8 proposed a model of algorithm commonly designated like algorithm A∗ . This model, coded in Fig. 4, may be applied to state graphs positively arc-valued9 : at each arc (m, n), which represents the transformation of state m into state n, is associated a positive cost c(m, n); the length of a path is measured by the sum of the costs of the arcs10 that compose it; the algorithm seeks to build a path of minimal length11 from the initial state s (supposed unique) to the goal node t. We can consider a set of goal states, defined explicitly or by a group of criteria. The algorithm builds a series of graphs of search: GR0 , GR1 , . . . , GRx , . . . which are subgraphs of the state graph. We shall see that each GSi is an arborescence with s as a root. In a directed graph, a node r is a root if and only if: 1. r has no father, 2. and whatever a node n different from r, there is a path from r to n (obviously: if a root exists it is unique). A graph is an arborescence if and only if: 1. It includes a root, 2. and whatever a node n different from r, n has only one father. GS0 includes the initial state s only. Schematically: GRx is derived from GSx−1 by developing a node of GRx−1 , i.e., selecting a node m of GRx−1 then generating

8 See

also Nilsson (1971, 1980, 1998). reason here in the setting of directed graphs. The model can be applied to undirected graphs: substitute edge to arc, chain to path. The state graph of (n2 − 1)-puzzles can be represented by an undirected graph as well than by a directed graph (each edge of the undirected graph corresponds to 2 opposite arcs of the directed graph). 10 In (n2 − 1)-puzzles, the cost of an arc (or edge) is commonly considered equal to 1. 11 In (n2 − 1)-puzzles, commonly the objective is to minimize the number of moves from s to t. 9 We

Heuristically Ordered Search in State Graphs

7

all the sons12 of m; at the end of the xth development, GSx is substituted to GRx−1 in main memory. The specificity of A∗ , compared to the breadth first algorithm,13 lies in the mode of selection of m described further. Under conditions that we shall specify, A∗ stops after d developments, discovering in GSd (graph of search at the end of the d th development) a path14 C from s to t, minimal. That is to say: the length of any other path, from s to t, in the state graph, is greater or equal. The resolution is estimated to be all the more effective than GSd is “smaller” (saving nodes and arcs). The behavior of the algorithm would be perfect if GSd was reduced to being a path from s to t. The works of Nilsson and colleagues showed that with the same computing resources, some problems can be solved efficiently exploiting empirical knowledge,15 while we cannot solve it—or the task is harder—with algorithms that ignore these pieces of knowledge. These algorithms are called uninformed algorithms or blind search algorithms, because they do not use particular information about the problem areas. For example, the breadth-first algorithm proceeds in the same way for solving a (3 × 3)-puzzle or finding an itinerary on a road map; it only exploits the relationships of contiguity between cells of the (3 × 3)-puzzle or between cities on the map, but does not take into account the specific metrics that can be applied to a board of a puzzle or to the geographical space; it ignores heuristic estimates that may result of such metrics. We comment now the code of A∗ (a simpler scheme is presented in section “Monte Carlo Evaluation” of chapter “Artificial Intelligence for Games” of this Volume, partly dedicated to games and heuristic search).

4.1 Guiding Search With Heuristic Estimates Each main cycle of the algorithm A∗ (lines 3–21 in Algorithm 1) deals with the development of one state, let m (selected in line 5). The algorithm maintains a front of search which at the start includes only s (line 2). The choice of the node of front which will undergo the (x + 1)th development, exploits an evaluation fonction f ; at the end of the xth development, to each node q of front is associated a value fx (q); in order to proceed to the (x + 1)th development, A∗ chooses a node p in front as fx (p) is supposed that each state has only a finite number of sons, e.g., 4 at most for the (n2 − 1)puzzles. 13 And compared to its generalization called uniform-cost algorithm. Among the nodes, never developed before, the breadth first algorithm favors the development of those nodes reachable from s by a path of which the number of arcs is minimal; the uniform-cost algorithm prefers the nodes reachable from s by a path which the sum of arc costs is minimal. While the breadth first algorithm guarantees to find a path from s to t whose number of arcs is minimal (if such a path exists), the uniform-cost algorithm guarantees to find a path from s to t whose sum of arc costs is minimal (if such a path exists). 14 We shall see that GS is an arborescence, so C will be the single path joining s to t in GS . d d 15 That is to say: coming from experience or observation (without previous theory). 12 It

8

H. Farreny

is “better” in the sense where: fx (p) = min fx (q). This principle of choice is named q∈front

best first (here, “best” means “minimal”). By definition, A∗ uses evaluation functions of the form: fx (n) = gx (n) + h(n) is the length of the path16 that joins s to n in GSx , while h(n) is an heuristic estimation with positive or zero values, attached to n (line 4): we say that h is static to emphasize that, whatever n, the value h(n) only depends on n, not of the rank of development.17 In the code of the algorithm, whatever n, the successive values of gx (n) are stored into gnoted (n). In practice, it is convenient to program A∗ (and other heuristic search algorithms) through languages suitable for managing lists; e.g., gnoted (n) can be implemented as an element of a list of features attached to state n. At the start, A∗ associates the value gnoted = 0 to s. Next, the manner to update gnoted (lines 15, 17, 20) guarantees that, at the end of the xth development, gnoted (n) is the minimal length of the paths from s to which was discovered during the developments of ranks 1, 2, . . . , x. In this way, whatever n and whatever z > y, gz (n) ≤ gy (n), i.e., whatever n, gx (n) decreases when the rank of development x increases. If g ∗ (n) denotes the length of a minimal path from s to n in the state graph, we observe that: whatever x, gx (n) ≥ g ∗ (n). So, gx (n) is an overestimate of the minimal length of paths from s to n, which decreases when x increases. In the particular case where h is the zero function, the algorithm A∗ becomes the above-mentioned uniform-cost algorithm. This algorithm is named so because it operates as if the estimation of the minimal distance between the goal t to any state of the current front was uniformly a constant, say K; with this hypothesis, for the (x + 1)th development, it is natural to choose the node n which minimizes gx (n) + K, thus n which minimizes gx (n). Nilsson and colleagues have noticed that in some areas they could operate with better discernment if they replaced the constant K by a function of n, h(n). In their first approaches, h is static. So, for experiments on (3 × 3)-puzzles or (4 × 4)-puzzles, Nilsson and colleagues, employed as heuristic estimation h(n) of a state n, the number W (n) of tiles that are not in the place assigned to them in the state goal t; for instance, in Fig. 3, W (s1 ) = 6, W (s2 ) = 15. Later, they used the Manhattan distance P(n) from state n to state t, that we define now; for each tile j, columns(j) designates the absolute value of the difference between the ranks of the columns of j in n and in t; lines(j) designates the absolute value of the difference between the ranks of the lines of j in n and in t; P(n) is calculated as the sum of columns(j) and lines(j) for all the tiles j; for instance, in Fig. 3, P(s1 ) = 7, P(s2 ) = 58. Let h∗ (n) be the length of a minimal path18 from n to t in the state graph; it is easy to verify that, for any n:

16 For

any x, GRx is an arborescence with root s, so it exists a single path from s to n in GRx . we may wish to exploit knowledge about the current state n, not reducible to a function of only n (for example also depending on the progress of the ongoing search). 18 If it exists; else we let: h∗ (n) = ∞. 17 Alternatively,

Heuristically Ordered Search in State Graphs

9

W (n) ≤ P(n) ≤ h∗ (n). Thus: W and P are underestimates of the minimal distance from n to t. Algorithme 1: Heuristically ordered search, type A∗ (Hart, Nilsson and Raphaël) 1 begin 2 front ← {s}; reserve ← ∅; gnoted (s) ← 0 3 repeat 4 best ← all nodes p of front such as gnoted (p) + h(p) is minimal in front 5 if it exists a goal node, let m, in best, then 6 repeat 7 write m; m ← father(m) 8 until m = s 9 write “s”; stop 10 11 12 13 14 15

m ← any element of best front ← front − {m}; reserve ← reserve + {m} for each son n of m do gnew ← gnoted (m) + c(m, n) if n ∈ front et gnew < gnoted (n) then father(n) ← m; gnoted (n) ← gnew

16 17 18

if n ∈ reserve et gnew < gnoted (n) then father(n) ← m; gnoted (n) ← gnew front ← front + {n}; reserve ← reserve − {n}

19 20

if n ∈ / reserve and n ∈ / front then father(n) ← m; gnoted (n) ← gnew ; front ← front + {n}

21

until front = ∅

As soon as a node m is chosen to be developed, it leaves front to be stored in reserve (line 11). Let x + 1 the rank of development; all the sons of m are considered (lines 12–20); for any son n which was in reserve before, it stays in it provided that gnew ≥ gnoted (n), that is to say if gx+1 (n) ≥ gx (n) [thus: fx+1 (n) ≥ fx (n), because h is static]; but n leaves reserve and returns in front if gnew < gnoted (n), that is to say if gx+1 (n) < gx (n) [thus: fx+1 (n) < fx (n)]. When a state n is generated for the first time, a father is assigned to it (line 20); the names of the fathers are managed in the onedimensional array father(.); the father of a state may be replaced (lines 15, 17, 20), but each of the generated state (those present in front or reserve) conserves only one father. At the start of the algorithm, state s has no father and a zero value is assigned to gnoted (n) in line 2; this value cannot be changed because the arc costs are supposed to be positive; thus, any generated state can be reached by a path beginning in s, that is to say: s is a root. Therefore, the successive graphs of search are arborescences.

10

H. Farreny

The algorithm stops either when front becomes empty (line 21), or when one of the nodes of front which presents the best evaluation satisfies the definition of the goal states (line 5).

4.2 Conditions Under Which A∗ Stops When Discovering a Minimal Path In the frame of arc-valued graphs of search,19 h∗ (n) denotes the length of a minimal path from n to the set of goal nodes; if such a minimal path does not exist, we put: h∗ (n) = ∞; h∗ (s) measures the length of a minimal solution, i.e. a minimal path from the initial state s to a goal node. Under the name of admissibility theorem, Nilsson and colleagues have established (Hart et al. 1968; Nilsson 1971, 1980) that algorithm A∗ applied to an arc-valued state graph G necessarily stops finding a minimal path from the initial state to a goal node when: 1. 2. 3. 4. 5.

G includes at least one path from the initial state to the goal state, and each state of G owns a finite number of sons, and it exists δ > 0 such as for any arc (m, n) of G: c(m, n) ≥ δ, and for any state n of G: h(n) ≥ 0 and for any state n of G: h∗ (n) ≥ h(n).

We shall say that the fonctions h which satisfy condition 5 (on G) “underestimating” (on G). For the state graphs of the (n2 − 1)-puzzles, conditions 2 and 3 are obviously satisfied. Condition 1 is satisfied if and only if the goal state and the initial state belong to the same connected component; we have observed before that the state graph of any (n2 − 1)-puzzle owns two connected components and that it is easy to test if two states belong to the same component. Heuristics W and P satisfy condition 5; it is also the case if the heuristic h is identically zero; in this case, we obtain a specific admissibility theorem for the uniform cost algorithm. Thus, by aggregating a term, h(n), which is a lower bound of the distance from n to the goal (we may say: an underestimate of the forwards cost) and a term gx (n) which measures the shortest path known so far to reach n (we may say: an overestimate of the backwards cost), the evaluation function leads the algorithm to finding a minimal path.

19 We

have introduced h∗ before, but in the specific context of the state graphs of (n2 − 1)-puzzles.

Heuristically Ordered Search in State Graphs

11

4.3 Comparison Between Algorithms A∗ When Their Heuristics h Are Underestimating Consider h1 and h2 two heuristics defined for a same state graph G and underestimating on G; h2 is said better informed than h1 on G if and only if: for any state n of G which it is not a goal, h1 (n) < h2 (n). Note: the inequality is strict. For instance, for the (n2 − 1)-puzzles, we can compare partially the functions W , P and 0; clearly, the functions W et P are better informed than the function zero; but P is not formally best informed than W : certainly W ≤ P, but it can exist n satisfying W (n) = P(n) without being a goal state. Under the name of optimality theorem,20 Nilsson and colleagues have established (Hart et al. 1968; Nilsson 1971, 1980) that if h1 and h2 are heuristics defined for the same state graph G and underestimating on G, such as h2 is better informed than h1 on G, then all the states developed by A∗ using h2 are developed by A∗ using h1 . Note: this result does not guarantee that algorithm A∗ using h1 executes more developments that algorithm A∗ using h2 , because the same state could be developed more times by h2 than by h1 .

4.4 Comparison Between Algorithms A∗ When Their Heuristics h Are Underestimating and Monotone A heuristic h, defined for a state graph G, is said monotone on G if and only if: for any state m of G and any son n of m, h(m) ≤ c(m, n) + h(n). This triangular inequality expresses a kind of coherence: the estimate of the shortest path from m until the goals, h(m), is less than or equal to the estimate of the shortest path from n until the goals, h(n), increased by the cost of the arc (m, n). The zero heuristic is monotone on any graph which arc values are positive or zero. For the (n2 − 1)-puzzles, it is easily verified that heuristics W and P are monotone. It is easy to demonstrate that: if A∗ exploits a monotone heuristic, then no node can be developed two times. This result allows to refine the optimality theorem when h2 is monotone: the algorithmA∗ which uses h1 executes more developments (or the same number) than the algorithm A∗ which uses h2 . For instance, in the context of the (n2 − 1)-puzzles, the algorithm A∗ which uses the zero heuristic (i.e., the uniform cost algorithm) develops a subset SS 0 of the nodes of the state graph, once each only, because the zero function is monotone; the algorithm A∗ which uses the heuristic P (monotone too) develops a subset SS P of the nodes of the state graph, once each only, with SS P ⊆ SS 0 ; in practice, it can be observed that algorithm A∗ with heuristic P often consumes much less memory and time than uniform-cost algorithm does. Although P is not (strictly) better informed than W , it often consumes less memory and time than W . 20 A

better name might be: comparison theorem.

12

H. Farreny

5 Variants in the Programming of A∗ During the 15 years that followed his invention, the algorithm A∗ has inspired many variants of coding and implementation to improve its performance in terms of space and speed of calculation. Algorithms B (Martelli 1977), C (Bagchi and Mahanti 1983), B (Mero 1984) exploit evaluation functions of same format as A∗ (with underestimating heuristics h) and lead to admissibility theorems stated in the same terms as for A∗ . Despite ingenious contributions, there was no breakthrough in the ability to solve problems of reference (specially: (n2 − 1)-puzzles) of significant size: A∗ and its substitutes present defect to maintain in memory all states generated. For example, when A∗ , using P, solves the problem of Fig. 3, although it is a problem that needs only 9 moves to be solved, the algorithm runs 18 developments and keep the 45 states generated. In the end, 27 states are included in front includes and 18 in reserve. These numbers are related to the following mode of implementation: to develop any state we first considered the son got by moving the empty cell to the top, then left, then right, then down. To avoid saturation of the main memory, Chakrabarti and his colleagues have designed MA∗ (Memory-bounded A∗ ), as an algorithm A∗ which adapts itself to the available memory space (Chakrabarti et al. 1989); when memory runs out, MA∗ frees the space occupied by the states that have the worst evaluations, but retains some features of the cut branches so that it can restore them if necessary to continue the search; MA∗ presents the great advantage of not block the search, but the running time may become prohibitive.

6 Bidirectional Heuristically Ordered Search A∗ To search for a minimal path in a state graph G from a state s to a state t, the idea of controlling the nested executions21 of two algorithms A∗ early appeared. One of them works from s to t in graph G; the other works from t to s in the graph G reverse obtained by reversing the direction of all the arcs in G. A module supervises the allocation of cycles of expansion22 to one or the other of the two A∗ ; it also controls the occurrence of an intersection between the two current graphs of search. Pohl has been the main pioneer of this approach, with, mainly, his algorithm BHPA: Bi-directional Heuristic Path Algorithm (Pohl 1971); the algorithm A∗ which works from s to t (respectively from t to s) exploits an underestimating and monotone heuristic denoted sh (respectively: th) and maintains a front of search: sfront (respectively: tfront); for any n of sfront (respectively: tfront) the heuristic estimate h(n) worths sh(n) [respectively: th(n)]. From a theoretical point of view, Pohl shows23 that processor, supposed unique, allocates disjointed periods of operation to the two A∗ . first idea is to give control, at each development cycle, to the algorithm which owns a node with better evaluation in its front. 23 Pohl supposes that arc costs are positive, and consequently the functions sh and th. 21 The 22 The

Heuristically Ordered Search in State Graphs

13

BHPA finds a minimal path since it stops; from a practical point of view, he observes that the convergence between the both sides of search can lead to very delayed stops. De Champeaux and Sint have proposed the algorithm BHFFA: Bi-directional Heuristic Front-to-Front Algorithm (de Champeaux and Sint 1977) as an improvement of BHPA. BHFFA exploits a heuristic E that estimates, each time it is needed, the distance between a state p of the current sfront and a state q of the current tfront. For any n of sfront, the algorithm A∗ which operates from s to t, calculates the heuristic estimate h(n) as the minimum, evaluated for all the states y of the current tfront, of E(n, y) + tg(y), where tg(y) is the length of the minimal path known so far from t to y; note: h(n) is not calculated as E(n, t). The working of the algorithm A∗ which searches in the other direction is symmetrical. From a theoretical point of view, De Champeaux and Sint show24 that BHFFA stops finding a minimal path between s and t. From a practical point of view, the authors consider that the main defect of BHPA is corrected: the states where meet the two graphs of search are now roughly equidistant from s and t, but they emphasize that the time of calculation of the heuristic estimate is very high. Kaindl and Khorsand developed a bidirectional version of a simplified version of MA∗ (see above) that cannot be blocked by lack of memory space (Kaindl and Khorsand 1994). The bidirectional search (or multidirectional if several goal nodes) could experience a boost in a modern bi- or multi-processor environment.

7 Relaxations of A∗ : Sub-admissible Algorithms We can consider to relax some constraints attached to Nilsson’s A∗ algorithm for several reasons. Some heuristics can inspire confidence without being strictly underestimating, or without proof that they are. For many applications, it is more important to find, without too much effort (space and time consumed), a pretty good solution rather than obtaining an optimal solution at all costs. The very notion of optimality is relative: it refers to a model of the real problem, which is more or less schematic; what is the point to search a path with a certainly minimal length if the costs of the arcs that represent the changes of states (and the distinction even between states) are just approximations of the real problem? For example: a route estimated minimal on a road map is not necessarily minimal on the ground, and vice versa; in addition, a path less short than the minimal path can be advantageous according to another criterion. While following the algorithmic code of A∗ , Harris suggested to exploit some heuristics h which do not satisfy the relationship h < h∗ , but the released relationship: h ≤ h∗ + e, where e is a positive constant (Harris 1974). By respecting the 4 first conditions of the statement of the admissibility theorem, we can easily prove that the algorithm finds out a path from the initial state s up to a goal state, of length

24 Respecting

the 5 conditions introduced in the statement of the admissibility of A∗ algorithms.

14

H. Farreny

less than or equal to h∗ (s) + e. The algorithm is said sub-admissible and the solution sub-minimal. Pearl and Kim, have retained the constraint h < h∗ for heuristics h, but have offered to soften the selection mode of the state to develop (Pearl and Kim 1982). This relaxation allows to introduce in the choice of the state to develop, a criterion related to the application, coming to help estimate the lengths of paths given by the evaluation function. For start the (x + 1)th development, they accepted the choice in the current front of any state p satisfying: fx (p) ≤ min (1 + ε)fx (q) where ε is a q∈front

positive constant. By respecting the 4 first conditions of the statement of the admissibility theorem, we can easily prove that the algorithm25 finds out a path from the initial state s up to a goal state, of length less than or equal to (1 + ε)h∗ (s). It is another form of sub-admissibility of the algorithm and sub-minimality of the solution. Ghallab and Allard proposed the same selection mode of the state to develop than Pearl and Kim, but admitting heuristics h that satisfy: h ≤ (1 + α)h∗ where α is a positive constant (Ghallab and Allard 1983). By respecting the 4 first conditions of the statement of the admissibility theorem, we can easily prove that the algorithm26 finds out a path from the initial state s up to a goal state, of length less than or equal to(1 + α)(1 + ε)h∗ (s). It is also a form of sub-admissibility of the algorithm and sub-minimality of the solution. We shall show that the perspective opened by Harris, Pearl and Kim, Ghallab and Allard may be significantly extended.

8 Fortunately IDA∗ Came Algorithm A∗ and all successors that we have mentioned so far have been tested on (n2 − 1)-puzzles. But until 1985, none of them demonstrated its capability to solve all the problems of (3 × 3)-puzzles. At the time, Korf reported that having chosen by random 100 solvable problems of (4 × 4)-puzzles27 and attempting to apply A∗ , with P, on a DEC 2060, he managed to solve none of them: for the 100 cases, despite a careful programming, the computer memory was saturated after creating about 30 000 states. It is what led him to design the algorithm named IDA∗ : Iterative-Deepening A∗ (Korf 1985a, b). IDA∗ is mentioned in paragraph 3.3 of chapter “Artificial Intelligence for Games” of this volume. IDA∗ uses evaluation functions of the same form as those used by A∗ : fx (n) = gx (n) + h(n) where gx (n) measures the minimal path known from s to n at the end of the development of rank x, while h(n) in a static and underestimating heuristic estimate. But IDA∗ no longer follows the strategy of expansion of A∗ . For A∗ , algorithm can be denoted A∗ε ; when ε = 0, we recognize algorithm A∗ . α ∗ ∗ ε by the authors; we prefer: Aε ; when α = 0, the algorithm is Aε ; it is A if α = ε = 0. 27 Listed in Korf (1985a). Denote them: K4, 1 to K4, 100. The goal node is standard: empty cell to the top on left, then tiles 1 to 15 from left to right and from top to bottom: see state t in Fig. 3. 25 This

26 Denoted A

Heuristically Ordered Search in State Graphs

15

develop a state is to produce and to evaluate all its sons; at each cycle, A∗ favors the development of a state with the smallest evaluation among all those of the current front; the successive graphs of search are arborescences. For IDA∗ , developing a state is producing and evaluating only one of its sons, different at each new demand of development; IDA∗ favors the development of the last produced state, under condition that its evaluation does not exceed a certain threshold; so, IDA∗ behaves locally like a depth-first algorithm, as is indicated by the expression: iterative deepening in the acronym IDA∗ ; the successive graphs of search are paths (i.e., particular arborescences); we can consider that for IDA∗ the front of A∗ is reduced to the final extremity of the current search path. A code for IDA* is proposed in Algorithm 2 below. Algorithme 2: Heuristically ordered search, type IDA∗ of Korf 1 2 3 4 5 6 7 8

begin front ← ((s 1)); T ← h(s); next-T ← ∞ repeat if front = ∅ then front ← ((s 1)); T ← next-T ; next-T ← ∞ m ← first element of the head of front; p ← second element of this head if m is a goal node then Edit-reversed-path; stop

9 10 11

generate the first son n of m, with rank ≥ p, not present in front, let q if n exists and f (n) ≤ T then replace (m p) by (m q + 1) in head of front; put (n 1) as new head of front

12 13

if n exists and f (n) > T then next-T ← min(next-T , f (n)); replace (m p) by (m q + 1) in head of front

14 15

if n does not exist then delete the head of front

16

until front = ∅ and next-T = ∞

Algorithme 3: Edit-reversed-path 1 2 3 4

begin repeat m ← first element of the head of front; write m; delete the head of front until front = ∅

At the start (line 2) the value h(s) is assigned to the threshold T . The search front, represented by front, is managed as a list of couples; it initially contains only the couple (s 1). IDA∗ progresses by cycles (loop between lines 3–16), seeking to

16 Fig. 4 For this problem, IDA∗ only needs 18 states in memory against 290 for A∗

H. Farreny 1 2 3

2 1 6

s

4

8

7 5 3

t

8

4

7 6 5

partially develop the state contained in the first couple of front; every couple is of the form (n k) where n is a state while k is the rank of filiation of the next son28 of n to generate. The first couple of front being (s 1), the first partial development will concern s: it will generate its first son. In line 6, the first couple of front is represented by (m p); if m is a goal node then IDA∗ edits the found path (lines 8 and 17–20), else IDA∗ tries to generate the kth son of m, provided that is not already in the current front, else the (k + 1)th, etc. (line 9). If a new son n is found and if its evaluation f (n) is less or equal than the current threshold T (line 10), IDA∗ continues the development in depth, under n. If f (n) is greater than T , IDA∗ goes back to the first previous state that allows the production of a new son and tries again a depth expansion. If it is not possible, IDA∗ launches a new cycle from s, but using a new threshold calculated as the minimum of the evaluations exceeding the threshold in force so far. IDA∗ stops if it happens that the state to develop is the goal state. For finite state graphs, under the hypothesis for evaluation functions and heuristics proposed by Nilsson for A∗ , Korf showed that IDA∗ is admissible and develop only the states developed by A∗ . These constraints can be widely released, see (Farreny 1995, 1997a, b, 1999). The huge advantage of IDA∗ over A∗ is that at any moment it keeps in memory only the states of a path joining s to the current deepest state, path whose length cannot be greater than h∗ (s); if there is a d > 0 such as all arc values are greater than d or equal, then the number of states of this path cannot be greater than h∗ (s)/d ; thus, the complexity of IDA∗ is linear with regard to the space required. Note that for the (n2 − 1)-puzzles, we can consider d = 1. For example, to find the minimal solution (18 moves) of the problem shown in Fig. 4, A∗ keeps in memory a graph of search that counts up to 290 states while IDA∗ maintains only a path of search that cannot count more than 18 states. In return, each time the threshold is increased, IDA∗ undertakes a new cycle for exploration in depth, over which it remakes all or part of the work done in the previous cycle (all the work if the new cycle is not the last). In 1985, thanks to IDA∗ with h = P, Korf determines a minimal solution for the problems K4, 1 to K4, 100; the average length of minimal solutions is about 53. The resolution on DEC 2060 consumes about half an hour per problem. In 1993, thanks to IDA∗ with h = P, Reinefeld could process29 all the problems of (3 × 3)-puzzles 28 It

is supposed that for any state, the number of sons is finite. To each son of a state is assigned a rank of generation. 29 On a Sun Sparc Station.

Heuristically Ordered Search in State Graphs Fig. 5 Problem K5, 1: find a minimal path from s to t. Solution: 100 moves

17 17

1 20

9 16

2 22 19 14

s 15 21

3 24

23 18 13 12 10

8

5

6

5

1

2

3

4

6

7

8

9

t 10 11 12 13 14

7

15 16 17 18 19

4 11

20 21 22 23 24

which consider a goal said standard: the empty cell is at the top left, while tiles 1 to 8 are arranged in this order, from left to right and from top to bottom (Reinefeld 1993). For each of the 9!/2 solvable problems he has produced solutions of minimal length (in fact: not only one but all the solutions). With the computer resources he had, he could not do this work using the zero heuristic or the heuristic W . Through this investigation, he established that the maximum of the minimal solutions counts 31 moves; observing that the minimal solutions for the problems with non-standard goals cannot be longer, he concludes that the diameter of the graph of (3 × 3)-puzzles is 31 (the diameter of a graph is the maximum of the number of edges of the minimal chains between all pairs of nodes). One of the flaws of IDA∗ , as a counterpart of its extreme simplicity in space, is not to use the available computer memory capacity, while it repeats many calculations. IDA∗ has inspired a new wave of algorithms, including: MREC (Sen and Bagchi 1989), DFA∗ (Rao et al. 1991), RFBS (Korf 1993), IE (Russell 1992), BIDA∗ (Manzini 1995), DBIDA∗ et BDBIDA∗ (Eckerle and Schuierer 1995). Improvements in performance, sometimes very important (BIDA∗ especially), were obtained on the 4 × 4-puzzle, with P as heuristic. However, when it comes to deal with 5 × 5-puzzle problems, IDA∗ and his epigones (always guided by h = P) appeared to be far too slow. It should be noted that the next step in the mastery of the (n2 − 1)-puzzles has been accomplished, not by replacing IDA∗ , but by seeking to surpass the heuristic P, as explained below.

9 Inventing Heuristics as Improvements of Those Already Known Korf and Taylor (1996) have experimented IDA∗ with a new heuristic, thanks to which, they managed to find minimal solutions for 9 problems of 5 × 5-puzzle among a series of 10 randomly drawn, noted below K5, 1 to K5, 10 (the goal state is standard). They used a Sun Ultra Sparc station, 70 times faster than the DEC 2060 of Reinefeld. The problem which was the most quickly solved (shown in Fig. 5) requires 100 moves; the most slowly solved requires 113 shots; the resolution of the 10th was interrupted after 1 month of calculation. The average length of 9 minimal solutions is about 101.

18 Fig. 6 Example of computation of LC and C; t is the goal state; s1 and s2 are two initial states

H. Farreny

8 7 6 s1 5 4 3 2 1

1 2 t 3 4 5 6 7 8

s2

3 1 4 2 5 8 7 6

We now present a simplified version, denoted hK , of the heuristic actually used by Korf, is defined by: hK = P + LC + C, where P is the Manhattan distance; here are the principles of calculation of the LC and C components. LC (Linear-Conflict Heuristic) is described in Hansson et al. (1992) but the idea is already presented in Gardner (1979). If in state p, 2 tiles are in the row (line or column) that is intended for them in t, but if they are in reverse order, it will be necessary that one of the 2 tiles quit the row to allow the move of the other and then come back. Note that these two extra moves are not considered when evaluating P (they do weigh nothing for P). For example, in Fig. 6, state s1 presents several conflicts in line 2 and column 2; naturally we must manage relations between conflicts for not overstating their cost. C (Corner-Tiles Heuristic) is described in Korf (1997). In state s2 of Fig. 6, tile 4 occupies a corner that is not its destination; but tile 1 (neighbor of the corner) is in its place; in order to allow tile 4 to leave the corner and tile 2 to arrive there, tile 1 must move; this will cost at less 2 moves (to go and to return). Korf counts in the same manner for 3 of the 4 corners: he does not account for the top-left corner (for avoiding complex tests). On a ( n × n)-puzzles with n > 3, this system can weigh up to 12 moves; 6 moves only on a (5 × 5)-puzzle because a neighbor of corner does not intervene 2 times. In addition, if we took into account the contribution of LC, we must avoid counting additional moves for the neighboring corner tiles already involved in linear conflict; for instance: for the state s2 , tile 7 is involved in a conflict on line 3. With some caution about the interferences, we can define more precisely LC and C, thus hK = P + LC + C, so that hK is assured to be underestimating. In addition: for any state p, P(p) ≤ hK (p). In the strict sense of Nilsson and colleagues, hK is not better informed than P: we cannot invoke the optimality theorem. Nevertheless, the experiments confirm the intuitive expectation: generally, hK performs less developments and, despite the extra cost of evaluation, consumes less time.

10 Combining the Heuristic Estimates Obtained for Subproblems In order to have heuristic values greater than P, but always underestimating, we can apply the technique of Pattern databases (Culberson and Schaeffer 1998), then improved under the name: Disjoint pattern databases (Korf 2000; Korf and Felner 2002; Felner et al. 2004). This technique is mentioned in Sect. 4.2 of chapter “Artificial Intelligence for Games” of this volume (about games and heuristic search).

Heuristically Ordered Search in State Graphs Fig. 7 Problem Q: find a minimal path from s2 to t. Problem1−5 : find a minimal path from s2 -relaxed to t-relaxed

19 + + + + + + + + 2

+ 1

3

+ 5

4

4

1

2

5

+ +

3

+ + + +

t -relaxed

+ + + +

s2 -relaxed 1

2

3

15 11 8

12

4

5

6

7

14 10 9

13

8

9

10 11

2

6

1

4

12 13 14 15

3

7

5

s2

t

We shall describe its principle by examining one of the problems represented in Figs. 3 and 7: find a minimal path from s2 to t. Let us note Q this problem. We have reported previously that solving Q requires 80 moves. Let us consider the relaxation of Q named Q15 in Fig. 7: it comes to move only the tiles 1 to 5, from their positions in s2 (or in s2 -relaxed) to those they occupy in t (or in t-relaxed), but neglecting the differences of identity of the 10 other tiles. We have: P(s2 ) = 58. Let us note P1−5 (s2) the part of P(s2 ) that concerns the moves of the tiles 1 to 5; we have: P1−5 (s2 ) = 20. Similarly we define: Q6−10 , Q11−15 , the estimates P6−10 (s2 ) and P10−15 (s2 ). We have: P6−10 (s2 ) = 14, P10−15 (s2 ) = 24, P(s2 ) = P1−5 (s2) + P6−10 (s2) + P10−15 (s2) = 58. Since 10 tiles are interchangeable, the state graph of Q1−5 has 10! times less states that the state graph of Q, that is to say N = 2882880 states; Q1−5 is thus considerably easier to solve than Q. As we have reported before, in 1985 Korf had saturated the memory of his DEC 2060 with only 30 000 states. Thirty years later, many personal computers can store 10 000 times more states in main memory. This allows other ways to proceed. Suppose that Q1−5 is resolved30 ; let us note h1−5 (s2 ) the minimal distance found between s2 -relaxed and t-relaxed, then h1−5 (s2 ) the number of moves that concern only the tiles 1 to 5; necessarily:h1−5 (s2 ) ≥ P1−5 (s2 ). Similarly we determine h6−10 (s2 ) and h11−15 (s2 ). We define h(s2 ) = h1−5 (s2 ) + h6−10 (s2 ) + h11−15 (s2 ) ; obviously: h(s2 ) ≥ P(s2 ). Because each move only shifts one tile, h is clearly an underestimating of h∗ , but greater than or equal to P. According to the previous analysis, calculating h(s2 ) could need 3 auxiliary resolutions: those of the subproblems Q1−5 , Q6−10 and Q11−15 . It is wiser to proceed as follows. We apply the breadth-first algorithm to Q1−5 , but starting from t-relaxed; when the state s2 -relaxed will appear we can save h1−5 (s2 ); in fact, whenever a new state is created it is recorded along with the value h1−5 (n); we continue until all relaxed states related to the tiles 1 to 5 have been generated. By this way we obtain a base of heuristic data B1−5 ; we constitute similarly B6−10 and B11−15 . B1−5 , B6−10 and B11−15 are made up and installed in memory, prior to any treatment of (4 × 4)puzzle; 3N entries are needed, i.e., 8 648 640 in total, what is commonly acceptable today. Thereafter, consulting these banks is equivalent to exploit the underestimating heuristic h.

30 We

could resort to IDA∗ , with P or better: P + LC + C.

20

H. Farreny

Using IDA∗ with the technique of the Pattern databases (less powerful31 than that of the Disjoint pattern databases presented above), Korf was able to solve 10 problems of Rubik’s Cube randomly drawn; he found (minimal) solutions whose lengths were between 16 and 18 moves (Korf 1997).

11 Formalizing in Order to Open Other Application Fields and Other Solving Methods Algorithms A∗ and IDA∗ , presented in detail above, as well as those briefly characterized A∗ε , Aαε )32 can be seen as special cases of a vast family defined and analyzed in Farreny (1995) under the name of ρ algorithms. Five ways of generalization were followed to identify these algorithms. We now introduce, very succinctly, four of them.

11.1 State Graphs: We Can Be Less Demanding In many presentations dealing with heuristically ordered search, the constraints applied to the graphs of states in view to ensure that algorithms stop, finding a minimal solution (property named: admissibility) or sub-minimal (property named: sub-admissibility), are excessive. For example, in order to guaranty that A∗ stops by discovering a minimal solution, it is not necessary to satisfy conditions 3 and 4 that are demanded in the classic statement of the theorem of admissibility presented in Nilsson (1971, 1980, 1998), Barr et al. (1982), Rich (1983), Winston (1984), Pearl (1990), Shirai and Tsujii (1984), Laurière (1986), Farreny and Ghallab (1987), Shapiro (1987), Russell and Norvig (2003). The values of the arcs can be real numbers, if: a) for all real M , beyond a certain number of arcs, the length of any elementary path33 that begins from s is greater than M and b) there does not exist circuit34 whose length is strictly negative.

technique uses h = max(h1−5 , h6−10 , h11−15 ) rather than h = h = h1−5 + h6−10 + h11−15 . well as several of the mentioned algorithms (B, C, D . . .) and others not evocated so far, such as: BF ∗ (Pearl 1990), A∗∗ (Dechter and Pearl 1985, 1988), SDW (Köll and Kaindl 1992). 33 A path is elementary if it does not include two occurrences of the same node. 34 Path whose final node is identical to the initial node. 31 This 32 As

Heuristically Ordered Search in State Graphs

21

11.2 Length of a Path: We Can Get off the Beaten Track Very generally, the length of a path is defined as the sum of the costs of the arcs that make it up; we denote this kind of length: Ladd . However, Pearl has envisaged a path length, calculated as the maximum of the cost of its arcs (1990), Yager (1986), as well as Dubois et al. (1987) have calculated the length of a path as the minimum of the costs of its arcs. Gonella has associated the costs of arcs by multiplication (1989). However, the heuristically ordered search can find new fields of application by substituting to Ladd a definition of the path length much more general. The operation + and the set R =] − ∞, +∞[, involved in the definition of Ladd , can be replaced by any binary operation Θ and any subset V of R, provided that V and Θ form a monoid. Remember that (V, Θ) is a monoid if and only if V is closed for Θ, Θ is associative and admits a neutral element in V . Let c be a function which associates a cost c(u) to each arc u in R. We define as length associated with the monoid (V, Θ) and with the function c, any function L that respects the following constraints: 1. for any arc u: L(u) = c(u), 2. L(suite vide d’arcs) = en , neutral element in (V, Θ), 3. for any sequence of arcs S and S , L(concatenation of S andS ) = Θ(L(S ), L(S )). Taking V = R, or R+ , or Q, or Q+ , or Z, or N, and Θ = +, we find common forms of Ladd . Withsome subsets V of R, we can choose Θ(x, y) = x.y or min(x, y) or max(x, y) or x2 + y2 , etc.; for other examples, see Farreny (1995). The theorems of admissibility and sub-admissibility presented previously remain valid when a length of this general type is substituted to Ladd .

11.3 States to be Developed: Other Informed Choices We saw that A∗ chooses the next state to develop, following the principle of the best first. The Pearl-Kim and Ghallab-Allard algorithms allow choice among the nodes of the front whose evaluation exceeds by no more than ε % the minimal evaluation, while ensuring that the length of the found path will exceed by no more than ε % the length of a minimal path. We can consider a much more general process: to engage the xth development, we accept choosing any state p in the front such as: fx−1 (p) ≤ E( min fx−1 (q)). q∈front

We can establish that, if the function E satisfies some conditions (Farreny 1995, 1999), including growth in the broad sense, the length (possibly generalized as previously) of the found path is less than or equal to E(L), where L is the length of a minimal path. Taking for E the identity function I , we recognize the procedure of choice followed by A∗ , as well as the assurance that the length of the found path is L.

22

H. Farreny

Taking for E the function (1 + ε)I , we recognize the procedure of choice followed by the algorithms of Pearl-Kim and Ghallab-Allard, as well as the assurance that the length of the found path is lower or equal to (1 + ε)L. √ But we can also take, for example, E = (1 + ε)I + e or E = I × I + e.

11.4 Heuristics: Some Other Relaxations Can Preserve Some Guarantees It was noted that A∗ use underestimating and static heuristics h. It was also noted that Harris’ algorithm accepts a static heuristic less than or equal to h∗ + e, and that the gap between the length of the found path and the length of a minimal path does not exceed e. It was finally noted that Ghallab-Allard algorithm accepts a static heuristic that does not exceed h∗ of more than e α %, and that the length of the path found does not exceed the length of a minimal path of more α %. We can establish a more general theorem of sub-admissibility concerning a broad family of functions that are neither underestimating heuristics nor static heuristics, for which there is a relationship of the form: h ≤ F, where F satisfies some conditions (Farreny 1995, 1997a, b, 1999); this theorem formulates an upper bound for the length of the found path; by applying it to the algorithms of Harris and GhallabAllard, we find again their properties as special cases. Some other resolution techniques, which take more liberties with the objectives of admissibility and sub-admissibility, are set out in chapter “Meta-Heuristics and Artificial Intelligence” of this volume on meta-heuristics.

12 Conclusion Everyone is familiar with the use of heuristics, for instance for travelling without using an accurate representation of the environment; our behavior (more or less instinctive or thoughtful) seems to result from the taking into account of several segments of information: direction of the goal, estimate of the distance, history of our own movements so far, etc. Go straight to the goal, is a banal heuristic, fallible, but useful. What can we expect by using it? How better use it? In 1968, a first formal framework has been proposed. A∗ includes an algorithm and a theory that justifies its admissibility. The properties of the underestimating heuristics, better informed heuristics and monotone heuristics, explain why the heuristics P and W are most effective than the heuristic zero. Actually, A∗ with P manages to solve 3 × 3-puzzles that the breadth-first algorithm could not handle due to lack of memory. In 1985, IDA∗ constitutes an algorithmic progress, sustained by the previous formal framework: because it is admissible, as A∗ , but much more efficient for economize memory, IDA∗ with P manages to solve all the problems of 3 × 3-puzzles

Heuristically Ordered Search in State Graphs

23

(the most difficult requires 31 moves); it also manages to solve 100 randomly drawn 4 × 4-puzzles (the found minimal solutions require an average of about 53 moves). Despite the evolution of the material, solving 5 × 5-puzzle (randomly drawn) remains out of reach. In 1996, by substituting to P a new underestimating heuristic (greater or equal than P but not always greater) IDA∗ is able to solve 9 5 × 5-puzzles among 10 randomly drawn problems (the average length of the minimal solutions is about 101). In 1997, the combination of analytical arguments (using an overestimating heuristic i.e., a heuristic greater or equal to h∗ to rule out quickly all problems solvable in less than 80 moves) and the use of a powerful network of parallel calculators have led to prove that the diameter of the state graph of the 4 × 4-puzzles is either 80 or 81 (Brungger 1997). To our knowledge, until today, the diameter of the state graph of the 5 × 5-puzzles remains no determined (but greater than 113 or equal). In 1997, by exploiting pattern databases IDA∗ is able to solve 10 problems of Rubik’s Cube randomly drawn (the minimal solutions found consists of a number of moves between 16 and 18). In 2010, thanks to the mobilization of colossal computer means, it has been proven that the diameter of the state graph of the Rubiks Cube (classical type: 6 faces of 9 cubes, where each action on the faces counts for one move) is only 20 (Rokicki et al. 2010; Delahaye 2011). To improve our capacity to solve such problems, the computing power certainly plays major role; it remains to combine the increasing power of the hardware with new ideas about the evaluation functions, the algorithms, the heuristics. A number of advances for more realistic applications areas (where arcs are not uniformly valued), and based on the techniques presented so far or on methods that remain to be invented, are still expected.

References Bagchi A, Mahanti A (1983) Search algorithms under different kinds of heuristics–a comparative study. JACM: J ACM 30(1):1–21 Barr A, Feigenbaum E, Cohen P (1982) The handbook of artificial intelligence. Morgan Kaufman, Los Altos CA Berge C (1970) Graphes et Hypergraphes. Dunod, Paris Brungger A (1997) Solving hard combinatorial optimization problems in parallel : two case studies. PhD thesis, ETH Zurich Buisson J-C (2008) Nutri-educ, a nutrition software application for balancing meals, using fuzzy arithmetic and heuristic search algorithms. Artif Intell Med 42(3):213–227 Chakrabarti PP, Ghose S, Acharya A, de Sarkar SC (1989) Heuristic search in restricted memory. Artif Intell 41(2):197–222 Chang CL, Lee RCT (1973) Symbolic logic and mechanical theorem proving. Academic Press, New York Charniak, McDermott (1985) Introduction to artificial intelligence. Addison-Wesley, Reading Chatila R (1982) Path planning and environment learning in a mobile robot system. In: Proceedings ECAI-82, Orsay, pp 211–215 Culberson JC, Schaeffer J (1998) Pattern databases. Comput Intell 14(4):318–334 de Champeaux D, Sint L (1977) An improved bidirectional heuristic search algorithm. JACM 24(2):177–191

24

H. Farreny

Dechter R, Pearl J (1985) Generalized best first search strategies and the optimality of A*. JACM 32(3):505–536 Dechter R, Pearl J (1988) The optimality of A∗ . In: Kanal L, Kumar V (eds) Search in artificial intelligence. Springer, Berlin, pp 166–199 Delahaye JP (2011) Le Rubik’s Cube : pas plus de 20 mouvements!. Pour la Science 400:98–103 Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–271 Dubois D, Lang J, Prade H (1987) Theorem proving under uncertainty. A possibility theory-based approach. In: Proceedings of the 10th international joint conference on artificial intelligence, Milan, Italy, pp 984–986 Eckerle J, Schuierer S (1995) Efficient memory-limited graph search. Lect Notes Comput Sci 981:101–112 Farreny H (1987) Exercices programmés d’intelligence artificielle. Méthode + programmes. Masson Farreny H (1995) Recherche Heuristiquement Ordonnée dans les graphes d’états: algorithmes et propriétés. Manuels informatiques Masson, Masson Farreny H (1997a) Recherche heuristiquement ordonnée: Généralisations compatibles avec la complétude et l’admissibilité. Technique et Science Informatiques 16(7):925–953 Farreny H (1997b) Recherche Heuristiquement Ordonnée dans les graphes d’états : élargissement des cas admissibilité et sous-admissibilité. Revue d’Intelligence Artificielle 11(4):407–448 Farreny H (1999) Completeness and admissibility for general heuristic search algorithms – a theoretical study: basic concepts and proofs. J Heuristics 5(3):353–376 Farreny H, Ghallab M (1987) Éléments d’intelligence artificielle. Hermès Farreny H, Piquet-Gauthier S, Prade H (1984) Méthode de recherche ordonnée pour la désignation d’objets en génération de phrases. In: Proceedings of 6th international congress cybernetics and systems, pp 647–652 Felner A, Korf RE, Hanan S (2004) Additive pattern database heuristics. J Artif Intell Res (JAIR) 22:279–318 Gardner M (1979) Jeux mathématiques du “Scientific american”. CEDIC Ghallab M, Allard DG (1983) Aε - an efficient near admissible heuristic search algorithm. In: Bundy A (ed) Proceedings of the 8th international joint conference on artificial intelligence, Karlsruhe, FRG. William Kaufmann, pp 789–791 Gondran M, Minoux M (1979) Graphes et algorithmes. Eyrolles, Paris Gonella R (1989) Diagnostic de pannes sur avions : mise en œuvre d’un raisonnement révisable. PhD thesis, ENSAE, Toulouse, France Gouzènes L (1984) Strategies for solving collision-free trajectories problems for mobile and manipulator robots. Int J Robot Res 3(4):51–65 Hansson O, Mayer A, Yung M (1992) Criticizing solutions to relaxed models yields powerful admissible heuristics. Inf Sci 63(3):207–227 Harris LR (1974) The heuristic search under conditions of error. Artif Intell 5(3):217–234 Hart PE, Nilsson NJ, Raphael B (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE Trans Syst Sci Cybern SSC–4(2):100–107 Ibaraki T (1978) Branch-and-bound procedure and state-space representation of combinatorial optimization problems. Inf Control 36(1):1–27 Johnson WW, Storey WE (1879) Notes on the “15” puzzle. Am J Math 2:397–404 Kaindl H, Khorsand A (1994) Memory-bounded bidirectional search. In: Proceedings of the twelfth national conference on artificial intelligence (AAAI-94). AAAI Press, Seattle, Washington, pp 1359–1364 Kanal LN, Kumar V (eds) (1988) Search in artificial intelligence. Springer, Berlin Köll AL, Kaindl H (1992) In: Neumann B (ed) A new approach to dynamic weighting. Wiley, Vienna, Austria, pp 16–17 Korf RE (1985a) Depth-first iterative deepening: an optimal admissible tree search. Artif Intell 27(1):97–109

Heuristically Ordered Search in State Graphs

25

Korf RE (1985b) Iterative-deepening A*: an optimal admissible tree search. In: Proceedings of the ninth international joint conference on artificial intelligence (IJCAI-85), Los Angeles, California. Morgan Kaufmann, pp 1034–1036 Korf RE (1993) Linear-space best-first search. Artif Intell 62(1):41–78 Korf RE (1997) Finding optimal solutions to Rubik’s Cube using pattern databases. In: Proceedings of the 14th national conference on artificial intelligence (AAAI-96), pp 700–705 Korf RE (2000) Recent progress in the design and analysis of admissible heuristic functions. In: Proceedings of the 17th national conference on artificial intelligence (AAAI 2000), Austin, Texas, USA. AAAI / MIT Press, pp 1165–1170 Korf RE, Felner A (2002) Disjoint pattern database heuristics. AIJ Artif Intell 134(1–2):9–22 Korf RE, Taylor LA (1996) Finding optimal solutions to the twenty-four puzzle. In: Proceedings of the 14th national conference on artificial intelligence (AAAI-96), Portland, Oregon, USA. AAAI Press / The MIT Press, pp 1202–1207 Kowalski R (1970) Search strategies for theorem proving. In: Meltzer B, Michie D (eds) Machine intelligence, vol 5. American Elsevier, pp 181–201 Kumar V, Kanal LN (1988) The CDP: a unifying formulation for heuristic search, dynamic programming, and branch-and-bound. In: Kanal LN, Kumar V (eds) Search in artificial intelligence. Springer, Berlin, pp 1–27 Laurière J-L (1986) Intelligence Artificielle: résolution de problèmes par l’homme et la machine. Editions Eyrolles, Paris Lirov Y (1987) A locally optimal fast obstacle-avoiding path-planning algorithm. Math Model 9(1):63–68 Manzini G (1995) BIDA*: an improved perimeter search algorithm. Artif Intell 75(2):347–360 Martelli A (1976) Application of heuristic search methods to edge and contour detection. Commun ACM 19(2):73–83 Martelli A (1977) On the complexity of admissible search algorithms. Artif Intell 8(1):1–13 Mero L (1984) A heuristic search algorithm with modifiable estimate. Artif Intell 23(1):13–27 Minker J, Fishman DH, McSkimin JR (1973) The Q* algorithm-A search strategy for a deductive question-answering system. Artif Intell 4(3–4):225–243 Nilsson NJ (1971) Problem solving methods in artificial intelligence. McGraw-Hill, New York Nilsson NJ (1980) Principles of Artificial Intelligence. Tioga, Palo Alto. Principes d’Intelligence Artificielle, Cepadues 1988 Nilsson NJ (1998) Artificial intelligence: a new synthesis. Morgan Kaufmann Publishers, San Francisco Pearl J (1984) Heuristics - intelligent search strategies for computer problem solving. AddisonWesley series in artificial intelligence. Addison-Wesley. Heuristiques - Stratégies de recherche intelligentes pour la résolution de problèmes par ordinateur, Cepadues 1990 Pearl J, Kim JH (1982) Studies in semi-admissible heuristics. IEEE Trans Pattern Anal Mach Intell 4(4):392–399 Pérennou G, Lahens F, Daubèze P, De Calmes M (1986) Rôle et structure du lexique dans le correcteur vortex. In: Séminaire GRECO/GALF Lexiques et traitement automatique des langages. Université Paul Sabatier, Toulouse Pohl I (1971) Bi-directional search. In: Meltzer B, Michie D (eds) Machine intelligence, 6th edn. Edinburgh University Press, Edinburgh, Scotland, pp 127–140 Rao NV, Kumar V, Korf RE (1991) Depth-first versus best-first search. In: Dean TL, McKeown K (eds) Proceedings of the 9th national conference on artificial intelligence, Anaheim, CA, USA, 14–19 July 1991, vol 1. AAAI Press / The MIT Press, pp 434–440 Ratner D, Warmuth M (1990) The (n2 − 1)-puzzle and related relocation problems. JSCOMP J Symb Comput 10:111–137 Reinefeld A (1993) Complete solution of the eight-puzzle and the benefit of node ordering in IDA∗ . In: Proceedings of the 11th international joint conference on artificial intelligence. Chambéry, France, pp 248–253 Rich E (1983) Artificial intelligence. McGraw-Hill, New York

26

H. Farreny

Rokicki T, Kociemba H, Davidson M, Dethridge J (2010) God’s number is 20. http://www.cube20. org/ Russell S (1992) Efficient memory-bounded search methods. In: Neumann B (ed) Proceedings of the 10th European conference on artificial intelligence, Vienna, Austria. Wiley, pp 1–5 Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Prentice Hall, Upper Saddle River, NJ Sen AK, Bagchi A (1989) Fast recursive formulations for best-first search that allow controlled use of memory. In: Sridharan NS (ed) Proceedings of the 11th international joint conference on artificial intelligence, Detroit, MI, USA. Morgan Kaufmann, pp 297–302 Shapiro S (1987) Encyclopedia of AI. Wiley-Interscience, Hoboken Shirai Y, Tsujii J (1984) Artificial intelligence: concepts, techniques, and applications. Wiley series in computing. Wiley, New Jersey Winston PH (1984) Artificial intelligence. Addison-Wesley, Reading Yager RR (1986) Paths of least resistance in possibilistic production systems. Fuzzy Sets Syst 19(2):121–132

Meta-heuristics and Artificial Intelligence Jin-Kao Hao and Christine Solnon

Abstract Meta-heuristics are generic search methods that are used to solve challenging combinatorial problems. We describe these methods and highlight their common features and differences by grouping them in two main kinds of approaches: Perturbative meta-heuristics that build new combinations by modifying existing combinations (such as, for example, genetic algorithms and local search), and Constructive meta-heuristics that generate new combinations in an incremental way by using a stochastic model (such as, for example, estimation of distribution algorithms and ant colony optimization). These approaches may be hybridised, and we describe some classical hybrid schemes. We also introduce the notions of diversification (exploration) and intensification (exploitation), which are shared by all these meta-heuristics: diversification aims at ensuring a good sampling of the search space and, therefore, at reducing the risk of ignoring a promising sub-area which actually contains high-quality solutions, whereas intensification aims at locating the best combinations within a limited region of the search space. Finally, we describe two applications of meta-heuristics to typical artificial intelligence problems: satisfiability of Boolean formulas, and constraint satisfaction problems.

1 Introduction Meta-heuristics are generic methods that may be used to solve complex and challenging combinatorial search problems. These problems are challenging for computer scientists because solving them involves examining a huge number—usually exponential—of combinations. Every man jack has already encountered such a combinatorial explosion phenomenon, which transforms an apparently very simple probJ.-K. Hao (B) LERIA, Université d’Angers, 2 Boulevard Lavoisier, 49045 Angers, France e-mail: [email protected] C. Solnon LIRIS-CNRS, INSA de Lyon, 20 Avenue Albert Einstein, 69621 Villeurbanne, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_2

27

28

J.-K. Hao and C. Solnon

lem into a tricky brain-teaser as soon as one increases the size of the problem to solve. This is the case, for example, when we try to solve tiling puzzles such as pentaminoes: when the number of tiles is small enough, these problems are rather easily solved by a systematic review of all possible combinations; however, when slightly increasing the number of tiles, the number of different combinations to review increases so drastically that the problem cannot be solved any longer by a simple enumeration and, for larger problems, even the most powerful computer cannot enumerate all combinations within a reasonable amount of time. The challenge for solving these problems clearly goes beyond puzzles. Indeed, this combinatorial explosion phenomenon also occurs in many industrial problems such as, for example, scheduling activities, planning a production, or packing objects of different volumes into a finite number of bins. Hence, it is most important to design intelligent algorithms that are actually able to solve these hard combinatorial problems in a reasonable amount of time. There exist three main approaches for tackling combinatorial problems. Exact approaches explore the space of combinations (i.e., candidate solutions) in a systematic way until either a solution is found or an inconsistency is proven. In order to (try to) restrain combinatorial explosion, these approaches structure the set of all combinations in a tree and use pruning techniques —to reduce the search space— and ordering heuristics —to define the order in which it is explored. These approaches are able to solve many problems. However, pruning techniques and ordering heuristics are not always able to restrain combinatorial explosion, and some problem instances cannot be solved by these approaches within a reasonable amount of time. Heuristic approaches get round combinatorial explosion by willfully ignoring some combinations. As a consequence, these approaches may miss the optimal solution and, of course, they cannot prove the optimality of the combination they found even if it is actually optimal. As a counterpart, their time complexity usually is polynomial. Approximation approaches aim to find approximate solutions with provable guarantees on the distance of the achieved solution to the optimum. If an algorithm can find a solution within a factor α of the optimum for every instance of the given problem, it is an α-approximation algorithm. Organisation of the Chapter There mainly exist two kinds of heuristic approaches: Perturbative heuristic approaches —described in Sect. 2— build new combinations by modifying existing combinations; Constructive heuristic approaches —described in Sect. 3— generate new combinations in an incremental way by using a (stochastic) model. These approaches may be hybridised, and we describe in Sect. 4 some classical hybrid schemes. Then, we introduce in Sect. 5 the notions of diversification (exploration) and intensification (exploitation) which are shared by all these heuristic approaches: diversification aims at ensuring a good sampling of the search space and, therefore, at reducing the risk of ignoring a sub-area which actually contains a solution, whereas intensification aims at guiding the search towards the best combinations within a limited region of the search space. Finally, we describe in Sect. 6 two applications

Meta-heuristics and Artificial Intelligence

29

of meta-heuristics to typical artificial intelligence problems: satisfiability of Boolean formulas (SAT), and constraint satisfaction problems (CSPs). Notations In this chapter, we assume that the search problem to be solved is defined by a couple (S, f ) such that S is a set of candidate combinations, and f : S → R is an objective function that associates a numerical value with every combination of S. Solving such a problem involves finding the combination s ∗ ∈ S that optimises (maximises or minimises) f . We more particularly illustrate the different meta-heuristics introduced in this chapter on the traveling salesman problem (TSP): given a set V of cities and a function d : V × V → R such that for each pair of cities {i, j} ⊆ V , d(i, j) is the distance between i and j, the goal is to find the shortest route that passes through each city of V exactly once. For this problem, the set S of candidate combinations is defined by the set of all circular permutations of V and the objective function f to be minimised is defined by the sum of the distances between every couple of consecutive cities in the permutation.

2 Perturbative Meta-heuristics Perturbative approaches explore the combination space S by iteratively perturbating combinations: starting from one or more initial combinations (that can be obtained by any means, often randomly or greedily), the idea is to iteratively generate new combinations by modifying some previously generated combinations. These approaches are said to be instance-based in Zlochin et al. (2004). The most well known perturbative approaches are genetic algorithms, described in Sect. 2.1, and local search, described in Sect. 2.2.

2.1 Genetic Algorithms Genetic algorithms (Holland 1975; Goldberg 1989; Eiben and Smith 2003) draw their inspiration from the evolutionary process of biological organisms in nature and, more particularly, from three main mechanisms which allow them to better fit their environment: • Natural selection implies that individuals that are well fitted to their environment usually have a better chance of surviving and, therefore, reproducing. • Reproduction by cross-over implies that an individual inherits its features from its two parents in such a way that two well-fitted individuals tend to generate new individuals that also are well-fitted, and hopefully, better fitted. • Mutation implies that some features may randomly appear or disappear, thus allowing nature to introduce new abilities that are spread to the next generations thanks

30

J.-K. Hao and C. Solnon

to natural selection and cross-over if these new abilities better fit the individual to its environment. Genetic algorithms combine these three mechanisms to define a meta-heuristic for solving combinatorial optimisation problems. The idea is to evolve a population of combinations—by applying selection, cross-over and mutation— in order to find better fitted combinations, where the fitness of a combination is assessed with respect to the objective function to optimise. Algorithm 1 describes this basic principle, the main steps of which are briefly described in the next paragraphs. Algorithm 1: Genetic Algorithm Initialise the population while stopping criteria not reached do Select combinations from the population Create new combinations by recombination and mutation Update the population return the best combination that ever belonged to the population

Initialisation of the population: In most cases, the initial population is randomly generated with respect to a uniform distribution in order to ensure a good diversity of the combinations. Selection: This step involves choosing the combinations of the population that will be used to generate new combinations (by recombination and mutation). Selection procedures are usually stochastic and designed in such a way that the selection of the best combinations is favoured while leaving a small chance to worse combinations to be selected. There exist many different ways to implement this selection step. For example, tournament selection consists in randomly selecting a few combinations in the population, and keeping the best one (or randomly selecting one with respect to a probability proportional to the objective function). Selection may also consider other criteria, such as diversity. Recombination (cross-over): This step aims at generating new combinations from selected combinations. The goal is to lead the search towards a new zone of the space where better combinations may be found. To this aim, the recombination should be well-fitted to the function to be optimised, and able to transmit good properties of the parents to the new combination. Moreover, recombination should allow to create diversified children. From a diversification/intensification point of view, recombination has a strategic diversification role, with a long term goal of intensification. Mutation: Mutation aims at slightly modifying combinations obtained after crossover. It is usually implemented by randomly selecting combination components and randomly choosing new values to these components. Example 1 For the TSP, a simple recombination consists in copying a sub-sequence of the first parent, and completing the permutation by sequencing the cities that are missing in the order they occur in the second parent. A classical mutation operator consists in randomly choosing some cities and exchanging their positions.

Meta-heuristics and Artificial Intelligence

31

Population updating step: This step aims at replacing some combinations of the population by some of the new combinations —that have been generated by applying recombination and mutation operators— in order to create the next generation population. The update policy is essential to maintain an appropriate level of diversity in the population, to prevent the search process from premature convergence, and to allow the algorithm to discover new promising areas of the search space. Hence, decisions are often taken with respect to criteria related to quality and diversity. For example, a well known quality-based update rule consists in replacing the worse combination of the population, while a diversity-based rule consists in replacing old combinations by similar new combinations, with respect to some given similarity measure (Lü and Hao 2010; Porumbel et al. 2010). Other criteria such as the age may also be considered. Stopping criteria: The evolution process is iterated, from generation to generation, until either it has found a solution whose quality reaches some given bound or a fixed number of generations or a CPU-time limit have been reached. One may also use diversity indicators such as, for example, the resampling rate or the average pairwise distance, to restart a new search when the population becomes too uniform.

2.2 Local Search Local Search (LS) explores the search space by iteratively modifying a combination: starting from an initial combination, it iteratively moves from the current combination to a neighbour combination obtained by applying some transformation to it (Hoos and Stützle 2005). LS may be viewed as a very particular case of GA whose population is composed of only one combination. Algorithm 2 describes this basic principle, the main steps of which are briefly described in the next paragraphs. Algorithm 2: Local Search (LS) Generate an initial combination s ∈ S while stopping criteria not reached do Choose s  ∈ n(s) s ← s return the best combination encountered during the search

Neighbourhood function: The LS algorithm is parameterised by a neighbourhood function n : S → P(S) which defines the set of combinations n(s) that may be obtained by applying some transformation operators to a given combination s ∈ S. One may consider different kinds of transformation operators such as, for example, changing the value of one variable, or swapping the values of two variables. Each different transformation operator induces a different neighbourhood, the size of which may vary. Hence, the choice of the transformation operators has a strong influence on the solution process. A strongly desirable property of the transformation operator

32

J.-K. Hao and C. Solnon

is that it must allow the search to reach the optimal combination from any initial combination. In other words, the directed graph which associates a vertex with each combination of S, and an edge (si , s j ) with each couple of combinations such that s j ∈ n(si ), must contain a path from any of its vertices to the vertex associated with the optimal combination. Example 2 For the TSP, the 2-opt operator consists in deleting deux edges, and replacing them by two new edges that reconnect the two paths created by the edge deletion. More generally, the k-opt operator consists in deleting k mutually disjoint edges and re-assembling the different sub-paths created by these deletions by adding k new edges in such a way that a complete tour is reconstituted. The larger k, the larger the neighbourhood size. Generation of the initial combination: The search is started from a combination which is often randomly generated. The initial combination may also be generated with a greedy approach such as those introduced in Sect. 3.1. When local search is hybridised with another meta-heuristic such as, for example, genetic algorithms or ant colony optimisation, the initial combination may be the result of another search process. Choice of a neighbour: At each iteration of LS, one has to choose a combination s  in the neighbourhood of the current combination s and substitute s with s  (this is called a move). There exist many different strategies for choosing a neighbour. For example, greedy strategies always choose better (or at least as good) neighbours. In particular, the best improvement greedy strategy scans the whole neighbourhood and selects the best neighbour, that is, the one which most improves the objective function (Selman et al. 1992), whereas the first improvement greedy strategy selects the first neighbour which improves the objective function. These greedy strategies may be compared to hill climbers that always choose a raising path. This kind of strategy usually allows the search to quickly improve the initial combination. However, once the search has reached a locally optimal combination —that is, a combination whose neighbours all have worse objective function values— it becomes stuck on it. To escape from these local optima, one may consider different alternative strategies offered by meta-heuristics like: • random walk (Selman et al. 1994), that allows with a very small probability (controlled by a parameter) to randomly select a neighbour; • simulated annealing (Aarts and Korst 1989), that allows to select neighbours of worse quality with respect to a probability that decreases with time; • tabu search (Glover and Laguna 1993), that prevents the search from cycling on a small subset of combinations around local optima by memorising the last moves in a tabu list, and forbidding inverse moves to these tabu moves. Repetition of local search: Local search may be repeated several times, starting from different initial combinations. These initial combinations may be randomly and independently generated, as proposed in multi-start local search. They may also be obtained by perturbing a combination generated during the previous local search

Meta-heuristics and Artificial Intelligence

33

process, as proposed in iterated local search (Lourenco et al. 2002) and breakout local search (Benlic and Hao 2013a, b). We may also perform several local searches in parallel, starting from different initial combinations, and evenly redistributing current combinations by removing the worst ones and duplicating the best ones, as proposed in go with the winner (Aldous and Vazirani 1994). Local search with multiple neighbourhoods: A typical local search algorithm usually relies on a single neighbourhood to explore the search space. However, in a number of settings, several neighbourhoods can be jointly employed to reinforce the search capacity of local search. Variable neighborhood search is a well-known example, which employs the greedy strategy and a strict neighbourhood transition rule to examine a set of nested neighbourhoods with increasing sizes. Each time a local optimum is reached within the current neighborhood, the search switches to the next (larger) neighborhood and switches back again to the smallest neighourhood once an improving solution is found (Hansen and Mladenovic 2001). Other local search algorithms using more flexible neighbourhood transition rules can be found in Goëffon et al. (2008), Ma and Hao (2017), Wu et al. (2012).

3 Constructive Meta-heuristics Constructive approaches build one or more combinations in an incremental way: starting from an empty combination, they iteratively add combination components until obtaining a complete combination. These approaches are said to be model-based in Zlochin et al. (2004), as they use a model (which is often stochastic) to choose, at each iteration, the next component to be added to the partial combination. There exist different strategies to choose the components to be added, at each iteration, the most well known being greedy randomised strategies, described in Sect. 3.1, Estimation of Distribution Algorithms, described in Sect. 3.2, and Ant Colony Optimisation, described in Sect. 3.3.

3.1 Greedy Randomised Algorithms A greedy algorithm builds a combination in an incremental way: it starts from an empty combination and incrementally completes it by adding components to it. At each step, the component to be added is chosen in a greedy way, that is, one chooses the component which maximises some problem-dependent heuristic function which locally evaluates the interest of adding the component with respect to the objective function. A greedy algorithm usually has a very low time complexity, as it never backtracks to a previous choice. The quality of the final combination depends on the heuristic.

34

J.-K. Hao and C. Solnon

Example 3 A greedy algorithm for the TSP may be defined as follows: starting from an initial city which is randomly chosen, one chooses, at each iteration, the closest city that has not yet been visited, until all cities have been visited. Greedy randomised algorithms deliberately introduce a slight amount of randomness into greedy algorithms in order to diversify the constructed combinations. In this case, greedy randomised constructions are iteratively performed, until some stopping criteria is reached, and the best constructed combination is returned. To introduce randomness in the construction, one may randomly choose the next component within the k best ones, or within the set of components whose quality is bounded by a given ratio with respect to the best component (Feo and Resende 1989). Another possibility is to select the next component with respect to probabilities which are defined proportionally to component qualities (Jagota and Sanchis 2001). Example 4 For the TSP, if the last visited city is i, and if C contains the set of cities that have not yet been visited, we may define the probability to select a city j)]β j ∈ C by p( j) =  [1/d(i, β . β is a parameter that allows one to tune the level of k∈C [1/d(i,k)] greediness/randomisation: if β = 0, then all cities in C have the same probability to be selected; the higher β, the higher the probability of selecting cities that are close to i.

3.2 Estimation of Distribution Algorithms Estimation of Distribution Algorithms (EDAs) are greedy randomised algorithms (Larranaga and Lozano 2001): at each iteration, a set of combinations is generated according to a greedy randomised principle as described in the previous section. However, EDAs take benefit of previously computed combinations to bias the construction of new combinations. Algorithm 3 describes this basic principle, the main steps of which are briefly described in the next paragraphs. Generation of the initial population: In most cases, the initial population is randomly generated with respect to a uniform distribution, and only the best constructed combinations are kept in the population. Algorithm 3: Estimation of Distribution Algorithm (EDA) begin Generate an initial population of combinations P ⊆ S while stopping criteria not reached do Use P to construct a probabilistic model M Use M to generate new combinations Update P with respect to these new combinations return the best combination that has been built during the search process

Meta-heuristics and Artificial Intelligence

35

Construction of a probabilistic model: Different kinds of probabilistic models may be considered. The simplest one, called PBIL (Baluja 1994), is based on the probability distribution of each combination component, independently from other components. In this case, one computes for each component its occurrence frequency in the population, and one defines the probability of selecting this component proportionally to its frequency. Other models may take into account dependency relationships between components by using bayesian networks (Pelikan et al. 1999). In this case, the dependency relationships between components are modelled by edges in a graph, and conditional probability distributions are associated with these edges. Such models usually allow the search to build better combinations, but they are also more expensive to compute. Example 5 For the TSP, if the last visited city is i, and if C contains the set of cities that have not yet been visited, we may define the probability to select a city j ∈ C by p( j) =  f r eqf rPeq(i,Pj)(i,k) , where f r eq P (i, j) is the number of combinations of P k∈C that use edge (i, j). Hence, the more the population uses edge (i, j), the higher the probability to select j. Generation of new combinations: New combinations are built in a greedy randomised way, using the probabilistic model to choose components. Update of the population: In most cases, only the best combinations are kept in the population for the next iteration of the search process, whatever they belong to the current population or to the set of new generated combinations. However, it is also possible to keep lower quality combinations in order to maintain a good diversity in the population, thus ensuring a good sampling of the search space.

3.3 Ant Colony Optimisation There exists a strong similarity between Ant Colony Optimisation (ACO) and EDAs (Zlochin et al. 2004). Both approaches use a probabilistic model to build new combinations, and in both approaches, this probabilistic model evolves during the search process with respect to previously built combinations, in an iterative learning process. The originality and the main contribution of ACO are that it borrows features from the collective behaviour of ants to update the probabilistic model. Indeed, the probability of choosing a component depends on a quantity of pheromone which represents the past experience of the colony with respect to the choice of this component. This quantity of pheromone evolves by combining two mechanisms. The first mechanism is a pheromone laying step: pheromone trails associated with the best combinations are reinforced in order to increase the probability of selecting these components. The second mechanism is pheromone evaporation: pheromone trails are uniformly and progressively decreased in order to progressively forget older experience. Algorithm 4 describes this basic principle, the main steps of which are briefly described in the next paragraphs.

36

J.-K. Hao and C. Solnon

Algorithm 4: Ant Colony Optimisation Initialise pheromone trails to τ0 while Stopping conditions are not reached do Each ant builds a combination Pheromone trails are updated return the best combination

Pheromone trails: Pheromone is used to bias selection probabilities when building combinations in a greedy randomised way. A key point lies in the choice of the components on which pheromone is laid. At the beginning of the search, all pheromone trails are initialised to a given value τ0 . Example 6 For the TSP, a trail τi j is associated with each pair of cities (i, j). This trail represents the past experience of the colony with respect to visiting i and j consecutively. Construction of a combination by an ant: At each cycle of an ACO algorithm, each ant builds a combination according to a greedy randomised principle, as introduced in Sect. 3.1. Starting from an empty combination (or a combination that contains a first combination component), at each iteration, the ant selects a new component to be added to the combination, until the combination is complete. At each iteration, the next combination component is selected with respect to a probabilistic transition rule: given a partial combination X , and a set C of combination components that may be added to X , the ant selects a component i ∈ C with probability: p X (i) = 

[τ X (i)]α · [η X (i)]β α β j∈C [τ X ( j)] · [η X ( j)]

(1)

where τ X (i) (resp. η X (i)) is the pheromone (resp. heuristic) factor associated with component i, given the partial combination X (the definition of this factor is problemdependant), and α and β are two parameters used to balance the relative influence of pheromone and heuristic factors in the transition probability. In particular, if α = 0 then the pheromone factor does not influence the selection and the algorithm behaves like a pure greedy randomised algorithms. On the contrary, if β = 0 then transition probabilities only depend on pheromone trails. Example 7 For the TSP, the pheromone factor τ X (i) is defined by the quantity of pheromone τki laying between the last city k visited in X and the candidate city i. The heuristic factor is proportionally inverse to the distance between the last city visited in X and the candidate city i. Pheromone updating step: Once each ant has built a combination, pheromone trails are updated. First, they are decreased by multiplying each trail with a factor (1 − ρ), where ρ ∈ [0; 1] is the evaporation rate. Then, some combinations are

Meta-heuristics and Artificial Intelligence

37

“rewarded”, by laying pheromone on their components. There exist different strategies for selecting the combinations to be rewarded. We may reward all combinations built during the current cycle, or only the best combinations of the current cycles, or the best combination found since the beginning of the search. These different strategies influence the search intensification and diversification. In general, the quantity of pheromone added is proportional to the quality of the rewarded combination. This quantity is added on the pheromone trails associated with the rewarded combination. To prevent the algorithm from premature convergence, pheromone trails may be bounded between two bounds, τmin and τmax , as proposed in the MAX-MIN Ant System (Stützle and Hoos 2000). Example 8 For example, for the TSP, we add pheromone on each trail τi j such that cities i and j have been consecutively visited in the rewarded combination.

4 Hybrid Meta-heuristics The different meta-heuristics presented in Sects. 2 and 3 may be combined to define new meta-heuristics. Two classical examples of such hybridisations are described in Sects. 4.1 and 4.2.

4.1 Memetic Algorithms Memetic algorithms combine population-based approaches (evolutionary algorithms) with local search (Moscato 1999; Neri et al. 2012). This hybridisation aims at taking benefit of the diversification abilities of population-based approaches and intensification abilities of local search (Hao 2012). A memetic algorithm may be viewed as a genetic algorithm (as described in Algorithm 1) extended by a local search process. As in genetic algorithms, a population of combinations is used to sample the search space, and a recombination operator is used to create new combinations from combinations of the population. Selection and replacement mechanisms are used to determine the combinations that are recombined and those that are eliminated. However, the mutation operator of genetic algorithms is replaced by a local search process, that may be viewed as a guided macro-mutation process. The goal of the local search step is to improve the quality of the generated combinations. It mainly aims at intensifying the search by exploiting search paths determined by the considered neighbourhood operators. Like recombination, local search is a key component of memetic approaches.

38

J.-K. Hao and C. Solnon

4.2 Hybridisation Between Perturbative and Constructive Approaches Perturbative and constructive approaches may be hybridised in a straightforward way: at each iteration, one or more combinations are built according to a greedy randomised principle (that may be guided by EDA or ACO), and then some of these combinations are improved with a local search process. This hybridisation is called GRASP (Greedy Randomised Adaptive Search Procedure) (Resende and Ribeiro 2003). A key point is the choice of the neighbourhood operator and the strategy used to select moves for the local search. The goal is to find a compromise between the time spent by the local search to improve combinations, and the quality of these improvements. Typically, one chooses a simple greedy local search, that improves combinations until reaching a local optimum. Note that the best performing EDA and ACO algorithms usually include this kind of hybridisation with local search.

5 Intensification Versus Diversification For all meta-heuristics described in this chapter, a key point highly relies on their capability to find a suitable balance between two dual goals: • intensification (also called exploitation), which aims at guiding the search towards the most promising areas of the search space, that is, around the best combinations found so far; • diversification (also called exploration), which aims at allowing the search to move away from local optima and discover new areas, that may contain better combinations. The way the search is intensified/diversified depends on the considered metaheuristic and is achieved by modifying parameters. For perturbative approaches, intensification is achieved by favouring the best neighbours: for genetic algorithms, elitist selection and replacement strategies favour the reproduction of the best combinations; for local search, greedy strategies favour the selection of the best neighbours. Diversification of perturbative approaches is usually achieved by introducing a mild amount of randomness: for genetic algorithms, diversification is mainly achieved by mutation; for local search, diversification is achieved by allowing the search to select “bad” neighbours with small probabilities. For constructive approaches, intensification is achieved by favouring, at each step of the construction, the selection of components that belong to the best combinations built so far. Diversification is achieved by introducing randomness in the selection procedures, thus allowing to choose (with small probabilities) worse components. In general, the more the search is intensified, the quicker it converges towards rather good combinations. However, if one over-intensifies search, the algorithm

Meta-heuristics and Artificial Intelligence

39

may stagnate around local optima, concentrating all the search effort on a small subarea without being able to escape from this attracting sub-area to explore other areas. On the contrary, if one over-diversifies the search, so that it behaves like a random search, the algorithm may spend most of its time on exploring poor quality combinations. It is worth mentioning here that the right balance between intensification and diversification clearly depends on the CPU time the user is willing to spend on the solution process: the shorter the CPU time limit, the more the search should be intensified to quickly converge towards solutions. The right balance between intensification and diversification also highly depends on the instance to be solved or, more precisely, on the topology of its search landscape. In particular, if there is a good correlation between the quality of a locally optimal combination and its distance to the closest optimal combination (such as massif central landscapes), then the best results are usually obtained with a strong intensification of the search. On the contrary, if the search landscape contains a lot of local optima which are uniformly distributed in the search space independently from their quality, then the best results are usually obtained with a strong diversification of the search. Different approaches have proposed to dynamically adapt parameters that balance intensification and diversification during the search process. This is usually referred to as reactive search (Battiti et al. 2008). For example, the reactive tabu search approach proposed in Battiti and Protasi (2001) dynamically adapts the length of the tabu list by increasing it when combinations are recomputed (thus indicating that it is turning around a local optima), and decreasing it when the search has not recomputed combinations for a while. Also, the IDwalk local search of Neveu et al. (2004) dynamically adapts the number of neighbours considered at each move. More generally, machine learning techniques may be used to automatically learn how to dynamically adapt parameters during the search (Battiti and Brunato 2015). Other approaches have studied how to automatically search for the best static parameter settings, either for a given class of instances (parameter configuration), or for each new instance to solve (parameter selection) (Hoos et al. 2017).

6 Applications in Artificial Intelligence Meta-heuristics have been applied successfully to solve a very large number of classical NP-hard problems and practical applications in highly varied areas. In this section, we discuss their application to two central problems in artificial intelligence: satisfiability of Boolean formulas (sat) and satisfaction of constraints (csp).

6.1 Satisfiability of Boolean Formulas sat is one of the central problems in artificial intelligence. For a Boolean formula, sat involves determining a model, namely an assignment of a Boolean value (true

40

J.-K. Hao and C. Solnon

or false) to each variable such that the valuation of the formula is true. For practical reasons, it is often assumed that the Boolean formula under consideration is given in its clausal form (cnf) even if from a general point of view, this is not a necessary condition to apply a meta-heuristic. Note that sat per se is not an optimisation problem and does not have an explicit objective function to optimise. Since meta-heuristics are generally conceived to solve optimisation problems, they consider a more general problem maxsat whose goal is to find the valuation maximising the number of satisfied clauses. In this context, the result of a meta-heuristic algorithm to an instance maxsat can be of two kinds: either the returned assignment satisfies all the clauses of the formula, in which case a solution (model) is found for the given sat instance, or it does not satisfy all the clauses, in which case we can not know if the given instance is satisfiable or not. The search space of a sat instance is naturally given by the set of possible assignments of Boolean values to the set of variables. Thus, for an instance with n variables, the size of the search space is 2n . The objective function (also called evaluation function) counts the number of satisfied clauses. This function introduces a total order on the combinations of the search space. We can also consider the dual objective (or evaluation) function, counting the number of falsified clauses, and corresponding to a penalty function to minimise: each falsified clause has a penalty weight equal to 1, and a combination with an evaluation of 0 indicates a solution (model). This function can be further fine-tuned by a dynamic penalty mechanism or an extension including other information than the number of falsified clauses. A lot of work has been done during the last decades for practical solving of sat. Various competitions on sat solvers regularly organised by the scientific community continually boost research activities, assessment and comparison of sat algorithms. These researches resulted in a very large number of contributions that improve the performance and robustness of sat algorithms, especially stochastic local search (sls).

6.1.1

Stochastic Local Search

sls algorithms generally consider the following simple neighbourhood function: two combinations are neighbouring if their Hamming distance is exactly 1. The transition from one combination to another is conveniently achieved by flipping a variable. For a formula with n variables, the neighbourhood has a size of n (since each of the n variable can be flipped to obtain a neighbour). This neighbourhood can be shrunk by limiting the choice of the modified variable to those that appear in at least one falsified clause. This reduced neighbourhood is often used in sls algorithms, because in addition to its reduced size, it makes the search more focused (recall that the goal is to satisfy all the clauses). sls algorithms differ essentially in the techniques used to find the suitable compromise between (1) exploration of the search space sufficiently broad to reduce the risk of search stagnation in a non-promising region and (2) exploitation of available information to discover a solution in the region currently under examination. In order

Meta-heuristics and Artificial Intelligence

41

to classify these different algorithms, we can consider on the one hand the way the constraints are exploited, and on the other hand the way the search history is used (Bailleux and Hao 2008). Exploitation of instance structures The structure of the given problem instance is induced by its clauses. We can distinguish three levels of exploitation of this structure. The search can be simply guided by the objective function, so that only the number of clauses falsified by the current assignment is taken into account. A typical example is the popular gsat algorithm (Selman et al. 1992) that, at each iteration, randomly selects a neighbouring assignment from those minimising the number of falsified clauses. This greedy descent strategy is an aggressive exploitation technique that can be easily trapped in local optima. To work around this problem, gsat uses a simple diversification technique based on the restart of the search from a new initial configuration after a fixed number of iterations. Several variants of gsat such as csat et tsat (Gent and Walsh 1993) and simulated annealing (Spears 1996) are based on the same principle of minimising the number of falsified clauses. The search can also be guided by conflicts, so that falsified clauses are explicitly taken into account. This approach is particularly useful when the neighbourhood no longer contains a combination improving the objective value. In order to better guide the search, other information obtained via, for example, an analysis of falsified clauses can be used. The archetype example of this type of algorithms is walksat (McAllester et al. 1997) that, at each iteration, randomly selects one of the falsified clauses in the current assignment. A heuristic is then used to choose one of the variables of this clause, the value of which will be modified to obtain the new assignment. The gwsat algorithm (gsat with random walk) (Selman and Kautz 1994) also uses a random walk guided by falsified clauses that, at each iteration, modifies a variable belonging to a falsified clause with a probability p (called noise) and uses the descent strategy of gsat with a probability 1 − p. Finally, exploitation of deductive constraints makes it possible to use deductive rules to modify the current combination. This is the case for unitwalk (Hirsch 2005), which uses the unit resolution to determine the value of some variables of the current assignment from the values of other variables fixed by local search. Another type of deductive approach is used in non- cnf adapt novelty (Pham et al. 2007), which analyses the formula to be processed to search for dependency links between variables. Exploitation of search history To classify sls approaches for sat, we can also consider the way the action history is exploited since the beginning of the search as well as possibly their effects. There are three levels of exploitation of this history. For Markov algorithms (or without memory), the choice of each new combination depends only on the current combination. The algorithms gsat, grsat, walksat/skc, as well as the general simulated annealing algorithm are typical examples of sls algorithms without memory. These algorithms have the advantage of minimising

42

J.-K. Hao and C. Solnon

the processing time necessary for each iteration, yet they have been outperformed in practice during the last decades by algorithms with memory. For algorithms with short-term memory, the choice of each new combination takes into account a history of all or part of the changes of the variable values. sat solvers based on tabu search, such as walksat/tabu (McAllester et al. 1997) or tsat (Mazure et al. 1997) are typical examples. Some algorithms like walksat, novelty, rnovely, and g2wsat (Li and Huang 2005) also integrate aging information in the choice criterion. Typically, this criterion is used to favour amongst several candidate variables the oldest modified one. For algorithms with long-term memory, the choice of each new combination depends on choices made since the start of the search and their effects, in particular in terms of satisfied and falsified clauses. The idea is to use a learning mechanism to take advantage of the failures and accelerate the search. In practice, the history of falsified clauses is exploited by associating weights with clauses, the objective being to diversify the search by forcing it to take new directions. A first example is weightedgsat (Selman and Kautz 1993) where, after each change of a variable, weights of falsified clauses are incremented. The score used by the search process is simply the sum of the weights of the falsified clauses. As the weight of frequently falsified clauses increases, the process tends to favour the modification of variables belonging to these clauses, until the evolution of the weights of the other clauses influences the search again. Other examples include dlm (discrete lagrangian method) (Shang and Wah 1998), ddwf (Divide and Distribute Fixed Weight) (Ishtaiwi et al. 2005), paws (Pure Additive Weighting Scheme) (Thornton et al. 2004), esg (Exponentiated Subgradient Algorithm) (Schuurmans et al. 2001), saps (Scaling and Probabilistic Smoothing) (Hutter et al. 2002) and wv (weighted variables) (Prestwich 2005).

6.1.2

Population-Based Algorithms

Genetic algorithms (GAs) have been repeatedly applied to sat (Jong and Spears 1989; Young and Reel 1990; Crawford and Auton 1993; Hao and Dorne 1994). Like local search algorithms, these GAs adopt a representation of a candidate combination as an assignment of values 0/1 to the set of variables. First GAs handle general formulas, not limited to the cnf form. Unfortunately, the results obtained by these algorithms are generally disappointing when they are applied to sat benchmarks. The first GA dealing with cnf formulas is presented in Fleurent and Ferland (1996). The originality of this algorithm is the use of a specific and original crossover operator that tries to exploit the semantics of two parent assignments. Given two parents p1 and p2 , one examines the clauses that are satisfied by one parent, but falsified by the other parent. When a clause is true in one parent p1 and false in the other parent, the values of all variables in this clause are passed directly to the child. The memetic algorithm integrating this cross-over operator and a tabu search procedure yielded very interesting results during the second DIMACS implementation challenge.

Meta-heuristics and Artificial Intelligence

43

Another representative hybrid genetic algorithm is gasat (Lardeux et al. 2006). As the algorithm of Fleurent and Ferland (1996), gasat attaches a preponderant importance to the design of a semantic cross-over operator. Thus, a new class of crossovers is introduced aimed at correcting the “errors” of the parents and combining their good characteristics. For example, if a clause is false in two good-quality parents, we can force this clause to become true in the child by flipping a variable of the clause. The intuitive argument that justifies this consists in considering such a clause as difficult to satisfy otherwise and consequently it is preferable to satisfy the clause by force. Similarly, recombination mechanisms are developed to exploit the semantics associated with a clause that is made simultaneously true by both parents. With this type of cross-overs and a tabu search algorithm, gasat is very competing on some categories of sat benchmarks. FlipGA (Marchiori and Rossi 1999) is a genetic algorithm that relies on a standard uniform cross-over that generates, by an equiprobable mixture of the values of both parents, a child which is further improved by local search. Other GAs are presented in Eiben and van der (1997), Gottlieb and Voss (1998), Rossi et al. (2000), but they are in fact local search algorithms since the population is reduced to a single assignment. Their interest lies in the techniques used to refine the basic evaluation function (the number of falsified clauses) by a dynamic adjustment during the search process. We find in Gottlieb et al. (2002) an experimental comparison of these few genetic algorithms for sat. However, their practical interest remains to be demonstrated since they have rarely been directly confronted with state-of-the-art sls algorithms. Note that in these population-based algorithms, local search is often an essential component in order to reinforce intensification capabilities. The cross-over’s role may differ depending on whether it is completely random (Marchiori and Rossi 1999) in which case it is used essentially to diversify the search, or is based on the semantics of the problem (Fleurent and Ferland 1996; Lardeux et al. 2006) in which case it allows both to diversify and intensify the search.

6.2 Constraint Satisfaction Problems A constraint satisfaction problem (CSP) is a combinatorial problem modelled in the form of a set of constraints defined over a set of variables, each of these variables taking its values in a given domain. As for sat, one generally considers the optimisation problem MaxCSP whose objective is to find a complete assignment (assigning a domain value to each variable) that maximises the number of satisfied constraints. The search space is thus defined by the set of all possible complete assignments, while the objective function to be maximised counts the number of satisfied constraints for a complete assignment.

44

6.2.1

J.-K. Hao and C. Solnon

Genetic Algorithms

In its simplest form, a genetic algorithm for CSPs uses a population of complete assignments that are recombined by simple cross-over, as well as mutation consisting of changing the value of a variable. An experimental comparison of eleven genetic algorithms for solving binary CSPs is presented in Craenen et al. (2003). It is showed that the three best algorithms (Heuristics GA version 3, Stepwise Adaptation of Weights et Glass-Box) have equivalent performances and are significantly better than the other eight algorithms. However, these best genetic algorithms are clearly not competitive, either with exact approaches based on a tree search, or with other heuristic approaches, such as local search or Ant colony optimisation (van Hemert and Solnon 2004). Other genetic algorithms have been proposed for particular CSPs. These specific algorithms exploit knowledge of the constraints of the problem to be solved in order to define better cross-over and mutation operators, leading to better results. We can cite in particular (Zinflou et al. 2007) that obtains competitive results for a car sequencing problem.

6.2.2

Local Search

There is a great deal of work on CSP solving using local search techniques, and this approach generally yields excellent results. These local search algorithms for CSPs differ essentially by the neighbourhood and the selection strategy considered. Neighbourhood. Given a candidate assignment, a move operation typically modifies the value of a variable. So the neighbourhood of an assignment is composed of all assignments that can be obtained by changing the value of a variable in this assignment. Depending on the nature of the constraints, it is possible to consider other neighbourhoods, such as the neighbourhood induced by swap moves (that exchange the values of two variables), typically considered when there is a permutation constraint (that enforces a given set of variables to be assigned to a permutation of a given set of values). Some studies consider neighbourhoods not only between complete assignments, but also between partial assignments. In particular, the decision repair approach in Jussien and Lhomme (2002) combines filtering techniques such as those described in chapter “Constraint Reasoning” of this Volume with local search on the space of partial assignments. Given a current partial assignment A, if the filtering detects an inconsistency then the neighbourhood of A is the set of assignments resulting from the removal of a “variable/value” couple of A, otherwise the neighbourhood of A is the set of assignments resulting from the addition of a “variable/value” couple to A. Selection strategies. Many strategies for choosing the next move have been proposed. The min-conflict strategy (Minton et al. 1992) randomly selects a variable involved in at least one violated constraint and chooses for this variable the value that minimises the number of constraint violations. This greedy strategy, which is famous for having found solutions to the N -queen problem for a million of queens,

Meta-heuristics and Artificial Intelligence

45

tends to be easily trapped in local optima. A simple and classical way to get the minconflict strategy out of local minima is to combine it with the random walk strategy (Wallace 1996). Other strategies for choosing the variable/value pair are studied in Hao and Dorne (1996). Local search using tabu search (TabuCSP) (Galinier and Hao 1997) obtains excellent results for binary CSPs. Starting from an initial assignment, the idea is to choose the non-tabu move that most increases the number of satisfied constraints at each iteration. A move consists in changing the value of a conflicting variable (i.e., a variable involved in at least one unsatisfied constraint). Each time a move is performed, it is forbidden to select the move again during the next k iterations (the move is said to be tabued, k being the tabu tenure). However, a move leading to a solution better than any discovered solution is always performed even if the move is currently declared as tabu. This selection criterion is called aspiration in the terminology of tabu search. CBLS (Constraint Based Local Search) (Hentenryck and Michel 2005) adapts ideas of constraint programming to local search and allows one to quickly design local search algorithms for solving CSPs. In particular, it introduces the concept of incremental variable allowing an incremental evaluation of neighbourhoods, and uses invariants that are stated declaratively to achieve this. In Björdal et al. (2015), a CBLS backend is described for the MiniZinc CSP modelling language. Other generic systems of CSP solving based on local search are presented in Davenport et al. (1994), Nonobe and Ibaraki (1998), Codognet and Diaz (2001), (Galinier and Hao, 2004).

6.2.3

Greedy Construction Algorithms

We can construct a complete assignment for a CSP according to the following greedy principle: starting from an empty assignment, we iteratively select an unassigned variable and a value for this variable, until all variables receive a value. To choose a variable and a value at each step, we can use the ordering heuristics that are employed by exact approaches described in chapter “Constraint Reasoning” of this Volume. A well-known example of using this principle is the DSATUR algorithm (Brélaz 1979) for the graph coloring problem, which is a particular CSP. DSATUR constructs a coloring by choosing at each iteration the uncoloured vertex having the largest number of neighbours colored with different colors (the most saturated neighbour). Ties are broken by choosing the vertex of the highest degree. The selected vertex is then colored by the smallest possible color.

6.2.4

Ant Colony Optimisation

Ant colony optimisation has been applied to CSPs in Solnon (2002, 2010). The idea is to construct complete assignments according to a random greedy principle: starting from an empty assignment, one selects at each iteration an unassigned variable and a value to be assigned to this variable, until all variables are assigned. The main contribution of ACO is to provide a value selection heuristic: it is chosen according

46

J.-K. Hao and C. Solnon

to a probability that depends on a heuristic factor (inversely proportional to the number of new violations introduced by the value), and a pheromone factor that reflects the past experience with the use of this value. The pheromone structure is generic and can be used to solve any CSP. It associates a pheromone trail to each variable xi and each value vi that can be assigned to xi : intuitively, this trail represents the colony’s past experience of assigning value vi to variable xi . Other pheromone structures have been proposed for solving particular CSPs such as the car sequencing problem (Solnon 2008) or the multidimensional knapsack problem (Alaya et al. 2007). Ant colony optimisation has been integrated into general CSP libraries and IBM/Ilog Solver for constraint optimisation problems and CP Optimizer (Khichane et al. 2008, 2010). These generic systems make it possible to use high level languages to describe in a declarative way the problem to be solved, the solution of the problem is automatically supported by an ACO algorithm built into the language.

7 Discussions Meta-heuristics have been used successfully to solve many difficult combinatorial search problems, and these approaches often obtain very good results during various implementation competitions, either for solving real problems such as car sequencing, timetabling and nurse rostering, or well-known NP-hard problems such as the sat problem and many graph problems (e.g., graph coloring). The variability of meta-heuristics naturally raises the question of how to choose the most suitable meta-heuristic to solve a given problem. Obviously, this question is complex and comes close to a fundamental quest in artificial intelligence, namely the automatic problem solving. We are only discussing some of the answers. In particular, the choice of a meta-heuristic depends on the relevance of its basic mechanisms for the problem considered, that is, their ability to generate good combinations: • For GAs, the recombination operator should be able to identify interesting patterns that can be assembled and recombined (Hao 2012). For example, for graph coloring, an interesting pattern is a group of vertices of the same color and shared by good solutions; This type of information allowed the design of very successful cross-over operators with two parents (Dorne and Hao 1998; Galinier and Hao 1999) or with several parents (Galinier et al. 2008; Malaguti et al. 2008; Lü and Hao 2010; Porumbel et al. 2010). • For local search, the neighbourhood must favour the construction of better combinations. For example, for the sat and csp problems, the neighbourhood centred on conflicting variables (see Sect. 6.2) is relevant because it favours elimination of conflicts.

Meta-heuristics and Artificial Intelligence

47

• For ACO, the pheromone structure must be able to guide the search for better combinations. For example, for csp, the pheromone structure associating a pheromone trail with each variable/value couple is relevant because it allows to learn what are the right values to be assigned to each variable. Another crucial point is the time complexity of the elementary operators of the meta-heuristic considered (recombination and mutation for GAs, move for local search, addition of a solution component for constructive approaches,…). This complexity depends on the data structures used to represent the combinations. Thus, the choice of a meta-heuristic depends on the existence of data structures that allow a fast evaluation of the evaluation function after each application of the elementary operators of the meta-heuristic. Other important elements are shared by all meta-heuristics. In particular, the evaluation function (which may be the same as or different from the objective function of the problem) is a key element because it measures the relevance of a combination. For example, for csp, the evaluation function can simply count the number of unsatisfied constraints, but in this case the relative importance of each constraint violation is not recognized. An interesting refinement is to introduce into the evaluation function a penalty term to quantify the degree of violation of a constraint (Galinier and Hao 2004). This function can be further refined, as in the case of the sat problem (see Sect. 6.1), by using weights that can be dynamically adjusted according to the history of violations of each constraint. Meta-heuristic algorithms often have a number of parameters that have a significant influence on their efficiency. Thus, we can consider the development of a meta-heuristic algorithm as a configuration problem: the goal is to choose the best building blocks to combine (recombination or mutation operators, neighbourhoods, move selection strategies, heuristic factors, etc.) as well as the best parameter setting (mutation rate, evaporation rate, noise, tabu tenure, the weight of pheromone structures and heuristic factors, etc.). A promising approach to solve this configuration problem consists in using automatic configuration algorithms to design the algorithm which is best suited to a set of given instances (Bezerra et al. 2016; Hutter et al. 2017; Xu et al. 2010). More generally, one popular way to boost the performance of meta-heuristic based algorithms is to hybridise different and complementary approaches (e.g., memetic algorithms). In the continuation of the effort of creating improved hybrid methods, an interesting trend is to combine artificial intelligence techniques (e.g., machine learning and data mining) and meta-heuristics (Battiti and Brunato 2015; de Lima Martins et al. 2016; Raschip et al. 2015; Santos et al. 2008; Toffolo et al. 2018; Zhou et al. 2018). In such a combined approach, artificial intelligence techniques are typically used to help to design informed algorithmic components, dynamically select suitable search strategies or identify useful patterns from high-quality solutions. We terminate this chapter with a word of caution. In the last years, the community has witnessed the appearance of a great number of metaphor-based or nature-inspired metaheuristics. These “novel” methods are often proposed by recasting a natural phenomenon or species in terms of a search method without a real justification or

48

J.-K. Hao and C. Solnon

understanding of the underlying search strategies. The proliferation of these fancy methods pollutes in some sens the research in the area of metaheuristics and makes it difficult for people to figure out which are the true methods that can be used. Fortunately, the dangers related to this proliferation begin to be recognized by the research community, as justly analyzed in Sörensen (2013).

8 Conclusion In this chapter, we have presented a panorama of meta-heuristics, a class of general methods useful to solve difficult combinatorial search problems. Even if these methods have no provable guarantees on the distance of the achieved solution to the optimum of the given problem, they have the advantage of being virtually applicable to any difficult search problem. Meanwhile, to obtain an effective search algorithm, it is critical to adapt the general search strategies offered by these methods to the problem at hand. In particular, the targeted problem needs to be understood in depth to identify problem specific knowledge, which can be incorporated into the search components of the algorithm. The design goal is to build an effective algorithm that is able to ensure a balanced exploitation and exploration of the search space. It is equally important to apply the lean design principle in order to avoid redundant or superficial algorithmic components. To sum up, meta-heuristics represent an important enrichment to the arsenal of existing search methods and constitute a valuable alternative for tackling hard combinatorial problems.

References Aarts EH, Korst JH (1989) Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing. Wiley, Chichester Alaya I, Solnon C, Ghedira K (2007) Optimisation par colonies de fourmis pour le problème du sac-à-dos multi-dimensionnel. Tech et Sci Inform (TSI) 26(3–4):271–390 Aldous D, Vazirani UV (1994) “Go with the winners” algorithms. In: 35th Annual Symposium on Foundations of Computer Science, pp 492–501 Bailleux O, Hao JK (2008) Algorithmes de recherche stochastiques. In: Saïs L (ed) Problème SAT: progrès et défis (Chap. 5). Hermès - Lavoisier Baluja S (1994) Population-based incremental learning: a method for integrating genetic search based function optimization and competitive learning. Technical report, Carnegie Mellon University, Pittsburgh, PA, USA, Battiti R, Brunato M (2015) The LION way: machine learning plus intelligent optimization. University of Trento, LIONlab, Italy Battiti R, Protasi M (2001) Reactive local search for the maximum clique problem. Algorithmica 29(4):610–637 Battiti R, Brunato M, Mascia F (2008) Reactive search and intelligent optimization. Springer, Berlin Benlic U, Hao JK (2013a) Breakout local search for the quadratic assignment problem. Appl Math Comput 219(9):4800–4815

Meta-heuristics and Artificial Intelligence

49

Benlic U, Hao JK (2013b) Breakout local search for the vertex separator problem. In: Proceedings of the IJCAI-2013, pp 461–467 Bezerra LCT, López-Ibáñez M, Stützle T (2016) Automatic component-wise design of multiobjective evolutionary algorithms. IEEE Trans Evol Comput 20(3):403–417 Björdal G, Monette JN, Flener P, Pearson J (2015) A constraint-based local search backend for minizinc. Constraints 20(3):325–345 Brélaz D (1979) New methods to color the vertices of a graph. J CACM 22(4):251–256 Codognet P, Diaz D (2001) Yet another local search method for constraint solving. In: International symposium on stochastic algorithms: foundations and applications (SAGA). LNCS, vol 2264. Springer, Berlin, pp 342–344 Craenen BG, Eiben A, van Hemert JI (2003) Comparing evolutionary algorithms on binary constraint satisfaction problems. IEEE Trans Evol Comput 7(5):424–444 Crawford JM, Auton L (1993) Experimental results on the cross-over point in satisfiability problems. In: Proceedings of the national conference on artificial intelligence, pp 22–28 Davenport A, Tsang E, Wang CJ, Zhu K (1994) GENET: a connectionist architecture for solving constraint satisfaction problems by iterative improvement. In: Proceedings of the AAAI-1994, vol 1. AAAI, pp 325–330 de Lima Martins S, Rosseti I, Plastino A (2016) Data mining in stochastic local search. In: Handbook of Heuristics, Springer, Berlin, pp 1–49 Dorne R, Hao JK (1998) A new genetic local search algorithm for graph coloring. In: 5th International conference on parallel problem solving from nature (PPSN). Lecture Notes in Computer Science, vol 1498. Springer, Berlin, pp 745–754 Eiben A, van der Hauw J (1997) Solving 3-sat with adaptive genetic algorithms. In: Proceedings of the fourth IEEE conference on evolutionary computation. IEEE Press, pp 81–86 Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer, Berlin Feo TA, Resende MG (1989) A probabilistic heuristic for a computationally difficult set covering problem. Oper Res Lett 8:67–71 Fleurent C, Ferland JA (1996) Object-oriented implementation of heuristic search methods for graph coloring, maximum clique, and satisfiability. DIMACS Ser Discret Math Theor Comput Sci 26:619–652 Galinier P, Hao JK (1997) Tabu search for maximal constraint satisfaction problems. In: International conference on principles and practice of constraint programming (CP). LNCS, vol 1330. Springer, Berlin, pp 196–208 Galinier P, Hao JK (1999) Hybrid evolutionary algorithms for graph coloring. J Comb Optim 3(4):379–397 Galinier P, Hao JK (2004) A general approach for constraint solving by local search. J Math Model Algorithms 3(1):73–88 Galinier P, Hertz A, Zufferey N (2008) An adaptive memory algorithm for the k-coloring problem. Discret Appl Math 156(2):267–279 Gent IP, Walsh T (1993) Towards an understanding of hill-climbing procedures for SAT. Proceedings of AAAI- 93:28–33 Glover F, Laguna M (1993) Tabu search. In: Reeves C (ed) Modern heuristics techniques for combinatorial problems. Blackwell Scientific Publishing, Oxford, pp 70–141 Goëffon A, Richer J, Hao JK (2008) Progressive tree neighborhood applied to the maximum parsimony problem. IEEE/ACM Trans Comput Biol Bioinform 5(1):136–145 Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. AddisonWesley, Boston Gottlieb J, Voss N (1998) Solving from Nature. Lecture Notes in Computer Science, vol 1498. Springer, Berlin, pp 755–764 Gottlieb J, Marchiori E, Rossi C (2002) Evolutionary algorithms for the satisfiability problem. Evol Comput 10:35–50 Hansen P, Mladenovic N (2001) Variable neighborhood search: Principles and applications. Eur J Oper Res 130(3):449–467

50

J.-K. Hao and C. Solnon

Hao JK (2012) Memetic algorithms in discrete optimization. Handbook of memetic algorithms, vol 379. Studies in computational intelligence. Springer, Berlin, pp 73–94 Hao JK, Dorne R (1994) An empirical comparison of two evolutionary methods for satisfiability problems. In: Proceedings of IEEE International Conference on Evolutionary Computation. IEEE Press, pp 450–455 Hao JK, Dorne R (1996) Practice of constraint programming (CP), vol 1118. LNCS. Springer, Berlin, pp 194–208 Hentenryck PV, Michel L (2005) Constraint-based local search. MIT Press, Cambridge Hirsch EA (2005) Unitwalk: a new sat solver that uses local search guided by unit clause elimination. Ann Math Artif Intell 24(1–4):91111 Holland JH (1975) Adaptation and artificial systems. University of Michigan Press Hoos HH, Stützle T (2005) Stochastic local search, foundations and applications. Morgan Kaufmann, San Francisco Hoos HH, Neumann F, Trautmann H (2017) Automated algorithm selection and configuration (Dagstuhl Seminar 16412). Dagstuhl Rep 6(10):33–74 Hutter F, Tompkins DAD, Hoos HH (2002) Scaling and probabilistic smoothing: efficient dynamic local search for sat. In: Proceedings of CP 2002, principles and practice of constraints programming. Lecture notes in computer science. Springer, Berlin, pp 233–248 Hutter F, Lindauer M, Balint A, Bayless S, Hoos HH, Leyton-Brown K (2017) The configurable SAT solver challenge (CSSC). Artif. Intell. 243:1–25 Ishtaiwi A, Thornton J, Sattar A, Pham DN (2005) Neighbourhood clause weight redistribution in local search for sat. Proceedings of CP 2005:772–776 Jagota A, Sanchis LA (2001) Adaptive, restart, randomized greedy heuristics for maximum clique. J Heuristics 7(6):565–585 Jong KD, Spears W (1989) Using genetic algorithms to solve np-complete problems. International Conference on Genetic Algorithms (ICGA’89). Fairfax, Virginia, pp 124–132 Jussien N, Lhomme O (2002) Local search with constraint propagation and conflict-based heuristics. Artif Intell 139(1):21–45 Khichane M, Albert P, Solnon C (2008) Integration of ACO in a constraint programming language. In: 6th international conference on ant colony optimization and swarm intelligence (ANTS’08). LNCS, vol 5217. Springer, Berlin, pp 84–95 Khichane M, Albert P, Solnon C (2010) Strong integration of ant colony optimization with constraint programming optimization. In: 7th international conference on integration of artificial intelligence and operations research techniques in constraint programming (CPAIOR). LNCS, vol 6140. Springer, Berlin, pp 232–245 Lardeux F, Saubion F, Hao JK (2006) Gasat: a genetic local search algorithm for the satisfibility problem. Evol Comput 14(2):223–253 Larranaga P, Lozano JA (2001) Estimation of distribution algorithms. A new tool for evolutionary computation. Kluwer Academic Publishers, Boston Li CM, Huang WQ (2005) Diversification and determinism in local search for satisfiability. In: Proceedings of SAT. Lecture notes in computer science, vol 3569. Springer, Berlin, pp 158–172 Lourenco HR, Martin O, Stützle T (2002) Iterated local search. Handbook of metaheuristics. Kluwer Academic Publishers, Boston, pp 321–353 Lü Z, Hao JK (2010) A memetic algorithm for graph coloring. Eur J Oper Res 200(1):235–244 Ma F, Hao JK (2017) A multiple search operator heuristic for the max-k-cut problem. Ann Oper Res 248(1–2):365–403 Malaguti E, Monaci M, Toth P (2008) A metaheuristic approach for the vertex coloring problem. INFORMS J Comput 20(2):302–316 Marchiori E, Rossi C (1999) A flipping genetic algorithm for hard 3-sat problems. Proceedings of the Genetic and Evolutionary Computation Conference 1:393–400 Mazure B, Sais L, Grégoire E (1997) Tabu search for sat. In: Proceedings of the AAAI-97, pp 281–285

Meta-heuristics and Artificial Intelligence

51

McAllester D, Selman B, Kautz H (1997) Evidence for invariants in local search. In: Proceedings of the AAAI-97, pp 321–326 Minton S, Johnston MD, Philips AB, Laird P (1992) Minimizing conflicts: a heuristic repair method for constraint satisfaction and scheduling problems. Artif Intell 58:161–205 Moscato P (1999) Memetic algorithms: a short introduction. In: Corne D, Dorigo M, Glover F (eds) New Ideas in Optimization. McGraw-Hill Ltd, Maidenhead, pp 219–234 Neri F, Cotta C, (Eds) PM, (2012) Handbook of memetic algorithms. Studies in computational intelligence, vol 379. Springer, Berlin Neveu B, Trombettoni G, Glover F (2004) Id walk: a candidate list strategy with a simple diversification device. LNCS, vol 3258. (CP). Springer, Berlin, pp 423–437 Nonobe K, Ibaraki T (1998) A tabu search approach to the constraint satisfaction problem as a general problem solver. Eur J Oper Res 106:599–623 Pelikan M, Goldberg DE, Cantú-Paz E (1999) BOA: the Bayesian optimization algorithm, vol I. Morgan Kaufmann Publishers, San Fransisco, pp 525–532 Pham DN, Thornton J, Sattar A (2007) Building structure into local search for sat. Proceedings of IJCAI 2007:2359–2364 Porumbel DC, Hao JK, Kuntz P (2010) An evolutionary approach with diversity guarantee and well-informed grouping recombination for graph coloring. Comput Oper Res 37(10):1822–1832 Prestwich S (2005) Random walk with continuously smoothed variable weights. In: Proceedings of the eighth international conference on theory and applications of satisfiability testing (SAT 2005), pp 203–215 Raschip M, Croitoru C, Stoffel K (2015) Using association rules to guide evolutionary search in solving constraint satisfaction. In: Congress on evolutionary computation (CEC-2015), pp 744– 750 Resende MG, Ribeiro CC (2003) Greedy randomized adaptive search procedures. Handbook of metaheuristics. Kluwer Academic Publishers, Boston, pp 219–249 Rossi C, Marchiori E, Kok JN (2000) An adaptive evolutionary algorithm for the satisfiability problem. In: Carroll Jea (ed) Proceedings of ACM symposium on applied computing. ACM, New York, pp 463–469 Santos LF, Martins SL, Plastino A (2008) Applications of the DM-GRASP heuristic: a survey. Int Trans Oper Res 15(4):387–416 Schuurmans D, Southey F, Holte R (2001) The exponential subgradient algorithm for heuristic boolean programming. Proceedings of AAAI 2001:334–341 Selman B, Kautz H (1993) Domain-independent extensions to GSAT: solving large structured satisfiability problems. In: Proceedings of IJCAI-93, pp 290–295 Selman B, Kautz H (1994) Noise strategies for improving local search. Proceedings of the AAAI 94:337–343 Selman B, Levesque H, Mitchell D (1992) A new method for solving hard satisfiability problems. In: 10th national conference on artificial intelligence (AAAI), pp 440–446 Selman B, Kautz HA, Cohen B (1994) Noise strategies for improving local search. In: Proceedings of the 12th national conference on artificial intelligence. AAAI Press/The MIT Press, Menlo Park, pp 337–343 Shang Y, Wah BW (1998) A discrete lagragian based global search method for solving satisfiability problems. J Glob Optim 12:61–100 Solnon C (2002) Ants can solve constraint satisfaction problems. IEEE Trans Evol Comput 6(4):347–357 Solnon C (2008) Combining two pheromone structures for solving the car sequencing problem with ant colony optimization. Eur J Oper Res 191:1043–1055 Solnon C (2010) Constraint programming with ant colony optimization (232 pp.). Wiley, New York Sörensen K (2013) Metaheuristics — the metaphor exposed. Int Trans Oper Res 1–16 Spears WM (1996) Simulated annealing for hard satisfiability problems. DIMACS Ser Discret Math Theor Comput Sci 26:533–558

52

J.-K. Hao and C. Solnon

Stützle T, Hoos HH (2000) Max-min ant system. J Futur Gener Comput Syst 16:889–914. Special isuue on Ant Algorithms Thornton J, Pham DN, Bain S, Ferreira VJ (2004) Additive versus multiplicative clause weighting for sat. Proceeding of AAAI 2004:191–196 Toffolo TA, Christiaens J, Malderen SV, Wauters T, Berghe GV (2018) Stochastic local search with learning automaton for the swap-body vehicle routing problem. Comput Oper Res 89:6881 van Hemert JI, Solnon C (2004) A study into ant colony optimization, evolutionary computation and constraint programming on binary constraint satisfaction problems. In: Evolutionary computation in combinatorial optimization (EvoCOP 2004).LNCS, vol 3004. Springer, Berlin, pp 114–123 Wallace RJ (1996) Practice of constraint programming (CP). LNCS, vol 1118. Springer, Berlin, pp 308–322 Wu Q, Hao JK, Glover F (2012) Multi-neighborhood tabu search for the maximum weight clique problem. Ann Oper Res 196(1):611–634 Xu L, Hoos HH, Leyton-Brown K (2010) Hydra: automatically configuring algorithms for portfoliobased selection. In: 24th AAAI conference on artificial intelligence, pp 210–216 Young R, Reel A (1990) A hybrid genetic algorithm for a logic problem. In: Proceedings of the 9th European conference on artificial intelligence, Stockholm, Sweden, pp 744–746 Zhou Y, Duval B, Hao JK (2018) Improving probability learning based local search for graph coloring. Appl Soft Comput 65:542–553 Zinflou A, Gagné C, Gravel M (2007) Crossover operators for the car sequencing problem. In: 7th European conference on evolutionary computation in combinatorial optimization (EvoCOP). Lecture notes in computer science, vol 4446. Springer, Berlin, pp 229–239 Zlochin M, Birattari M, Meuleau N, Dorigo M (2004) Model-based search for combinatorial optimization: a critical survey. Ann Oper Res 131:373–395

Automated Deduction Thierry Boy de la Tour, Ricardo Caferra, Nicola Olivetti, Nicolas Peltier and Camilla Schwind

Abstract After a brief history and motivation of Automated Deduction (AD), the necessary notions of first order logic are defined and different aspects of the Superposition calculus and semantic tableaux are presented. This last method can be suitably adapted to non classical logics, as illustrated by a few examples. The issues connected to incompleteness of mathematics are then addressed. By lack of space a number of issues and recent developments are only alluded to, either by name or through references to a succinct bibliography.

1 Introduction Automated Deduction studies how computers can help humans discover and write formal proofs. It can be seen as the crossroad of two intellectual traditions, one investigating human reasoning and originating in India and Ancient Greece, namely logic, the other building tools to provide help in a number of administrative or commercial tasks of a mathematical nature, like the abacus. The emergence of electronic computers resulted in a new research area concerned with the design of software devoted to (mostly deductive) reasoning, which is clearly connected to Artificial Intelligence. The first computer program recognized as a practical application of AI was a theoT. Boy de la Tour (B) · R. Caferra · N. Peltier Université Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France e-mail: [email protected] R. Caferra e-mail: [email protected] N. Peltier e-mail: [email protected] N. Olivetti · C. Schwind Aix Marseille Université, Université de Toulon, CNRS, LIS, Marseille, France e-mail: [email protected] C. Schwind e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_3

53

54

T. Boy de la Tour et al.

rem prover, the Logic Theory Machine by Newell, Shaw et Simon, written in 1956. This closeness between AI and AD, both half a century old, was anticipated in the following assertions by two great mathematicians and pioneers: The fundamental idea of my proof theory is none other than to describe the activity of our understanding, to make a protocol of the rules according to which our thinking actually proceeds (D. Hilbert). First I wished to construct a formalism that comes as close as possible to actual reasoning. Thus arose a “calculus of natural deduction” (G. Gentzen).

The first computer program published in AD seems to have been a decision procedure for Presbuger Arithmetic by M. Davis, illustrating the focus on termination of the earlier attempts at automating reasoning. More ambitious endeavors related to first order logic soon exposed the limits inflicted on theorem provers by combinatorial explosion (Siekmann and Wrightson 1983). During the seventies two different approaches were competing, one trying to mimic the problem solving strategies of mathematicians (Bledsoe and Loveland 1984), the other aiming at the implementation of general purpose proof systems. A number of interesting theorems, especially in analysis and topology, were demonstrated thanks to the first approach, yet the second one soon took advantage, thus postponing the study of the powerful techniques used in the first approach since they were too difficult to formalize. Recent thoughts on these two approaches can be found in Sloman (2008), Barendregt and Wiedijk (2005). Theoretical progress on proof procedures, the increase in computing power and the development of practical tools (theorem provers, benchmarks of theorems Sutcliffe 2017, system competitions…) led to a constant increase in performance. A few pages cannot provide an exhaustive view of such a vast research area. The current chapter only glimpses the most common methods used in the field and pinpoints a number of subjects that appear as the most important ones.

2 First-Order Logic Propositional logic is a very powerful tool for expressing and checking properties of finite systems (see chapter “Reasoning with Propositional Logic: From SAT Solvers to Knowledge Compilation” of this Volume). Expressing properties of infinite collections of objects require more expressive languages such as first order logic (FOL) at the cost of decidability, but FOL is still semi-decidable, i.e., there is an algorithm that recognizes in finite time all valid formulas (and therefore cannot similarly recognize all non valid formulas). A signature Ω = {V , F , P} consists in a denumerable set V of variables and two at most denumerable sets F of functional symbols and P of predicate symbols. To every symbol in F ∪ P is associated a natural number called its arity. The set of terms build on {V , F } is the smallest set Σ(V , F ) including V and such that, if f ∈ F has arity n and t1 , . . . , tn ∈ Σ(V , F ) then f (t1 , . . . , tn ) ∈ Σ(V , F ). If

Automated Deduction

55

n = 0 the term is called a constant and written f (without parentheses). If no variable occurs in t it is said to be closed or ground. The language of FOL, whose elements are formulas, is the smallest set LΩ such that: • if p ∈ P is a predicate symbol of arity n and t1 , . . . , tn are terms then p(t1 , . . . , tn ) ∈ LΩ (if n = 0 the formula is written p), if t and t  are terms then t  t  ∈ LΩ , these formulas are called atomic; • if φ, ψ ∈ LΩ and x ∈ V then ¬φ, (φ ∧ ψ), (φ ∨ ψ), (φ ⇒ ψ), ∀x φ and ∃x φ belong to LΩ . A literal is either an atomic formula or a negated atomic formula. A literal and its negation are called complementary. Any literal ¬ t  t  is written t  t  . P0 is the set of all p ∈ P of arity zero; they are called propositional symbols. Propositional logic is the fragment of FOL obtained with V = F = ∅ and P = P0 . In practice many parentheses are omitted and formulas are read according to the usual priority among connectives (unary connectives ¬, ∀ and ∃ come first, followed by ∧ > ∨ > ⇒). A variable x is bound if it occurs in the scope of a quantifier ∀x or ∃x. Since bound variables can be renamed, we assume that in any formula any bound variable is quantified only once, and that variables that are not bound, called free, are distinct from bound variables. A formula that has no free variable is closed. The definition of the semantics of FOL follows. Let D be a non empty set, an interpretation I with domain D associates to every f ∈ F with arity n a function from D n to D and to every p ∈ P with arity n a relation in D n . The semantic object associated to a symbol is denoted by exponentiation, e.g., f I denotes the function associated to the symbol f in the interpretation I . A valuation σ of a set of variables is a function mapping every variable in the set to an element of D. A couple (I , σ ) associates, to every term t ∈ Σ(V, F ) such that σ is a valuation on the variables in t, a value in D written [t](I ,σ ) and defined inductively as follows: def for all x ∈ V let [x](I ,σ ) = σ (x) and for all t1 , . . . , tn ∈ Σ(V, F ) and f ∈ F with def arity n, let [ f (t1 , . . . , tn )](I ,σ ) = f I ([t1 ](I ,σ ) , . . . , [tn ](I ,σ ) ). Similarly if σ is a valuation of the free variables of a formula φ, then (I , σ ) associates to φ a truth value t or f, written [φ](I ,σ ) and inductively defined as follows: 1. 2. 3. 4. 5. 6. 7. 8.

[ p(t1 , . . . , tn )](I ,σ ) = t iff [t1 ](I ,σ ) , . . . , [tn ](I ,σ )  ∈ p I , def [t  t  ](I ,σ ) = t iff [t](I ,σ ) = [t  ](I ,σ ) , def [¬φ](I ,σ ) = t iff [φ](I ,σ ) = f, def [φ ∧ ψ](I ,σ ) = t iff [φ](I ,σ ) = t and [ψ](I ,σ ) = t, def [φ ∨ ψ](I ,σ ) = t iff [φ](I ,σ ) = t or [ψ](I ,σ ) = t, def [φ ⇒ ψ](I ,σ ) = t iff [φ](I ,σ ) = f or [ψ](I ,σ ) = t, def [∃x φ](I ,σ ) = t iff there is a a ∈ D such that [φ](I ,σ ∪{x→a}) = t, def [∀x φ](I ,σ ) = t iff for all a ∈ D, [φ](I ,σ ∪{x→a}) = t. def

56

T. Boy de la Tour et al.

The satisfaction relation between interpretations and formulas is then defined as follows: I |= φ iff for every valuation σ of the free variables of φ, [φ](I ,σ ) = t holds. The formula φ is then satisfied by I and I is a model of φ; in the opposite case (there is a valuation σ such that [φ](I ,σ ) = f) I is a counter-model of φ. If φ is closed then [φ](I ) stands for [φ](I ,∅) . A Herbrand model is an interpretation I with domain1 Σ(∅, F ) and such that for all f ∈ F and t1 , . . . , tn ∈ Σ(∅, F ), f I (t1 , . . . , tn ) = f (t1 , . . . , tn ). These models play an important rle in AD, in particular thanks to the Herbrand theorem which allows one to reduce the unsatisfiability of any universal formula (of the form ∀x1 · · · ∀xn φ where φ has no quantifier) to the unsatisfiability of a finite set (which remains to be found) of closed instances2 of φ (and of axioms of ), which is a propositional question.

3 The Resolution Method The Resolution rule, due to J. A. Robinson (together with the paramodulation rule devoted to equality) is the most common proof procedure in the field since their publication in the sixties. Its success comes from its simplicity obtained thanks to the use of a normal form for formulas and to a fundamental algorithm called unification, both introduced in the next sections. Precise definitions and the resulting concepts and results can easily be found in an extensive literature on the subject (good starting points are Wos et al. 1992; Wos 1988; Robinson and Voronkov 2001).

3.1 Transformation to Clausal Form One of the most elementary form of reasoning consists in performing (boolean) algebraic computations, using a number of properties like commutativity and associativity of ∧ and ∨, or the duality laws: ¬(φ ∧ ψ) = ¬φ ∨ ¬ψ ¬(φ ∨ ψ) = ¬φ ∧ ¬ψ ¬∀x φ = ∃x ¬φ ¬∃x φ = ∀x ¬φ

¬¬φ = φ

(1)

Such properties are often used implicitly in common sense reasoning, for instance to simplify mathematical problems. Mechanizing this kind of reasoning is not difficult: applying the identities (1) systematically from left to right (and expressing implication and equivalence by means of ∧, ∨ and ¬) allows one to move all negations towards atomic symbols; this yields the negation normal form. Quantifiers are moved in the opposite direction by: 1 At

least one constant is assumed to exist in F , so that Σ(∅, F ) = ∅. Sect. 3.2.

2 See

Automated Deduction

∀x φ ∧ ∀y ψ = ∀x (φ ∧ ψ[y/x]) φ ∧ Qx ψ = Qx (φ ∧ ψ) ∃x φ ∨ ∃y ψ = ∃x (φ ∨ ψ[y/x]) φ ∨ Qx ψ = Qx (φ ∨ ψ)

57

(2)

(ψ[y/x] is the formula ψ after y has been replaced by x, Q is any quantifier) and thus obtain a prenex formula where quantifiers precede all other connectives. Finally, the distributivity law φ ∨ (ψ1 ∧ ψ2 ) = (φ ∨ ψ1 ) ∧ (φ ∨ ψ2 ) yields a conjunctive normal form. More precisely: • A clause is a formula C of the form L 1 ∨ · · · ∨ L n where n ≥ 0 and the L i ’s are literals. If n = 0 we write C = ; this is the empty clause. • A conjunctive normal form (or CNF) is a prenex formula where the part that follows the quantifiers has the form C1 ∧ · · · ∧ Cm where m ≥ 1 and the Ci ’s are clauses. • A clausal form is a CNF with only universal quantifiers. It can be considered as a set of clauses and similarly a clause is often considered as a set of literals. Hence to obtain a clausal form from ψ existential quantifiers still have to be eliminated; this is done by skolemization, the process of replacing every subformula ∃y φ of ψ by a subformula φ[y/ f (x1 , . . . , xn )], where x1 , . . . , xn are the free variables of ∃y φ and f is a new function symbol, one that does not occur in ψ. This however is not an algebraic computation since the resulting formula is not equivalent to ψ; the value y = f (x1 , . . . , xn ) is generally correct only for those interpretations of f that are compatible with φ; hence only the satisfiability of the formula is preserved by this transformation. Skolemization may be performed before or after applying rules (2) in order to minimize the arity or the number of Skolem functions. A technique similar to skolemization consists in renaming a number of disjuncts inside ψ, i.e., replacing ψ[φ ∨ φ  ] by ψ[ p(x1 , . . . , xn ) ∨ φ  ] ∧ ∀x1 . . . ∀xn (φ ∨ ¬ p(x1 , . . . , xn )), where x1 , . . . , xn are the free variables of φ and p is a new predicate symbol (as for skolemization ψ is assumed to be in negation normal form). The subformula φ ∨ φ  of ψ is thus replaced by p(x1 , . . . , xn ) ∨ φ  ; if φ  is a conjunction this avoids the duplication of φ through the distributivity law, and a clausal form of quadratic complexity can be obtained in this way (Robinson and Voronkov 2001).

3.2 Unification A substitution is a function σ from V to Σ(V , F ), it is ground if σ (x) is ground for all x ∈ V . A substitution can be extended to a function that maps to every term t ∈ Σ(V , F ) the term σ (t) obtained by replacing in t any variable x by its image σ (x); σ (t) (often written tσ ) is an instance of t. It is therefore possible to compose any substitutions σ and σ  ; the result σ  ◦ σ (often written σ σ  ) is said to be an instance of σ . Substitutions σ can similarly be extended to formulas φ, and then φσ is an instance of φ.

58

T. Boy de la Tour et al.

Algorithm 1: A unification algorithm Data: S = {t1 = s1 , . . . , tn = sn } Result: either the mgu σ of S or ⊥ (no solution) σ ← ∅ /* ∅ represents the identity function on V while S = ∅ do Choose t = s in S S ← S \ {t = s} if t = s then if t = f (t1 , . . . , tn ) and s = g(s1 , . . . , sm ) then if f = g then return ⊥ else S := S ∪ {ti = si | i ∈ [1..n]} else if t is a variable then if t occurs in s then return ⊥

*/

else σ ← σ {t → s}; S ← S {t → s} else /* s is a variable if s occurs in t then return ⊥

*/

else σ ← σ {s → t}; S ← S {s → t} return σ

Given a set of n equations between terms {ti = ti | 1 ≤ i ≤ n} the unification problem is to find, if there is one, a substitution σ (called unifier) such that σ (ti ) = σ (ti ) for all 1 ≤ i ≤ n. If such a set has a unifier then it also has a most general unifier (mgu), in the sense that every unifier is an instance of this mgu. Algorithm 1 computes a mgu with exponential worst-case complexity, but it can easily be reduced to quadratic complexity by means of structure sharing (Robinson and Voronkov 2001).

3.3 The Superposition Calculus Instead of presenting the Resolution calculus (Leitsch 1997) in its original form, we prefer to present what can be considered as the most advanced and successful proof procedure for first-order logic with equality: the Superposition calculus (Robinson and Voronkov 2001). It is an extension of the Resolution method and it can also be considered as a generalization of Knuth–Bendix completion algorithm to arbitrary first-order clauses. The Superposition calculus applies on formulas in clausal form and uses an inference rule, called paramodulation, specifically tailored to reason on equalities. This rule allows one, given an equation t  s, to replace any occurrence of the term t inside a clause C by s. This replacement can be performed at any depth in the clause, for instance from a  b and f (g(a), c)  d one deduces f (g(b), c)  d. This idea extends to non-unit clauses, in which case the replacement

Automated Deduction

59

Paramodulation C1 : (t[w]  s) ∨ , C2 : (u  v) ∨  → (t[w/v]  s ∨  ∨  ) if  = mgu(u, w), v  u , s  t , w is not a variable, (t[w]  s) ∈ sel(C1  ) and (u  v) ∈ sel(C2  ). Reflection C : (t  s) ∨  →  if  = mgu(t, s) and (t  s) ∈ sel(C ). Equational Factorization C : (t  s) ∨ (u  v) ∨  → (s  v ∨ t  s ∨ ) if  = mgu(t, u), s  t , v  u and (t  s) ∈ sel(C ). Fig. 1 The superposition calculus SP sel

is performed under some particular context, corresponding to literals added to the consequent: from t  s ∨ D and C the clause C[t/s] ∨ D can be inferred, where C[t/s] is obtained from C by replacing t by s. The rule also extends to non-ground clauses, using unification to compute the most general instances of the clauses on which the paramodulation rule can be applied (the term t above is unified with a subterm of C). In order to prune the search space, several additional restrictions are considered: in particular, no replacement inside variables is allowed, and the replacement is performed only if the term s is “simpler” than t (according to some fixed ordering ). Similarly, only terms occurring in the maximal side of an equation can be replaced. For instance, if f (a)  b  a, then the rule is not applicable on the clauses f (a)  b and b  a. Indeed, b cannot be replaced in the right-hand side of the first equation because it is not maximal ( f (a)  b), and the term a in f (a) cannot be replaced by b since b  a. Furthermore, the rule is only applied upon some specific literals occurring in the parent clauses; these literals (which are either the maximal literals in the clause or arbitrarily chosen negative literals) are called selected. More formally, the complete set of rules is depicted in Fig. 1. The rules are parameterized by an order  on terms and a selection function sel mapping any clause C to a set of literals in C. By convention, the symbol  is used to denote either  or . The Reflection rule applied on an unit clause t  t infers the empty clause . The Factorization rule fuses two literals t  s and t  v if s and v are equal. Decomposition rules can also be added, for instance a clause of the form C ∨ D where C and D share no variable can be replaced by two clauses C ∨ p and D ∨ ¬ p, where p denotes a new propositional variable. In order to avoid extra rules for handling non-equational predicate symbols, we assume that every non-equational atom P(t1 , . . . , tn ) is encoded as an equation P(t1 , . . . , tn )  t (assuming that t is a constant and P is a function symbol). The Superposition calculus simulates the Resolution calculus when applied on a set of non-equational clauses encoded in the above way. The Superposition calculus is sound, in the sense that all inferred clauses are logical consequences of their premises (hence if the empty clause  is deducible from S then S is unsatisfiable). It is also refutationally complete, in the sense that  can be inferred from every unsatisfiable clause set, provided  and sel satisfy the following properties:

60

T. Boy de la Tour et al.

(P1 ) The order  is well-founded (i.e., every non-empty set of terms has a minimal element), closed under substitution (i.e., t  s ⇒ tσ  sσ ) and under contextual embedding (i.e., t  s ⇒ u[t]  u[s]). An order satisfying these conditions is called a reduction order. (P2 ) For every clause C, either sel(C) contains a negative literal, or sel(C) contains all the maximal literals in C ( is extended to literals and clauses by using the multiset extension, t  s and t  s are ordered as {{t}, {s}} and {{t, s}} respectively). Variants of the Superposition calculus exist, for instance Basic Superposition forbids paramodulation inside terms introduced by previous unifiers, and Strict Superposition replaces the Equational Factorization rule by standard Factorization. The latter calculus is however incomplete if the tautology elimination rule is used (see the next section).

3.4 Redundancy Elimination The Superposition calculus infers a huge number of clauses (usually infinitely many). In order to restrict the search space, criteria have been defined to detect and remove useless clauses. Definition 1 A clause C is redundant w.r.t. a set of clauses S (for a given order ) if for every ground substitution σ there exist clauses C1 , . . . , Cn ∈ S and ground substitutions θ1 , . . . , θn such that ∀i ∈ [1..n], Cσ  Ci θi and {C1 θ1 , . . . , Cn θn } |= Cσ . A set of clauses S is saturated if every clause deducible from S by the rules of SP sel (in one step) is redundant w.r.t. S. The next theorem states the completeness of the Superposition calculus in the presence of redundancy elimination rules. Theorem 1 (Bachmair et Ganzinger) Assume that  and sel satisfy the properties (P1 ) et (P2 ) of Sect. 3.3. For every set of clauses S, if S is saturated and does not contain  then S is satisfiable. The result is proven by constructing an interpretation I validating S. The model is built on the set of ground terms, by induction on the order , interpreting (ground) clauses of the form f (t1 , . . . , tn )  s ∨ C, where [C](I ) = f, [ti ](I ) = ti and f (t1 , . . . , tn )  s as rewrite rules fixing the interpretation of f (t1 , . . . , tn ) to [s](I ) . A slightly more general redundancy criterion can be defined by considering inferences instead of clauses. In practice, the redundancy relation is undecidable and restricted criteria must be used instead:

Automated Deduction

61

Definition 2 • (Tautologies) A clause C is called a tautology if it contains two complementary literals or a literal of the form t  t. • (Subsumption) A clause C subsumes a clause D iff there exists a substitution σ such that Cσ ⊆ D. • (Equational Simplification) A clause C is called a reduction of a clause D in a set of clauses S iff there exist a clause (t  s) ∨ α ∈ S and a substitution σ such that tσ occurs in D, C = D[tσ/sσ ], tσ  sσ , (t  s)σ  ασ and ασ ⊆ C. It is easy to check that any clause that is a tautology, or that is subsumed by a clause in S, or that possesses a reduction in S is necessarily redundant w.r.t. S (the converse is not true).

3.5 Implementation Techniques Without efficient implementation techniques, difficult problems could not be handled and the automated theorem provers would not be used as widely as they are used nowadays. We shall only mention two of these techniques, namely structure sharing and term indexing. Structure sharing has been used since the very beginning for implementing automated theorem provers. Beforehand, it was used in the implementation of programming languages. In the pioneer work of Boyer and Moore (published in 1972), deduced clauses are not constructed explicitly and substitutions are not applied, but are kept instead as contexts. The required space is then independent of the size of the clauses. The literals, the clauses and the context are represented as lists. For instance, the list

(q(y, y), 2), ( p(x, y), 2), (¬ p(x, f (y, z)), 4)

(3)

(natural numbers denote indices used to rename variables) in the context {x2 → x3 , y2 → f (x4 , y4 ), y4 → x3 , z 4 → f (x2 , y2 )} denotes the clause q( f (x4 , x3 ), f (x4 , x3 )) ∨ p(x3 , f (x4 , x3 )) ∨ ¬ p(x4 , f (x3 , f (x3 , f (x4 , x3 )))). The literals of Clause (3) are obtained from the premises of the inference: it is thus sufficient to keep pointers to these clauses together with some information allowing one to proceed with the search for a refutation. Structure sharing can also be integrated at the logical level, in order to obtain more efficient calculi and to reduce the length of proofs.

62

T. Boy de la Tour et al.

Indexing is necessary for efficiently applying redundancy pruning rules such as “forward” subsumption (which discards newly generated clauses if they are subsumed by previous clauses) and equational simplification. The idea is to construct a data structure called an index to access efficiently to the terms occurring in the considered clauses (without having to go through the entire set of clauses). This technique is also intensively used in other domains such as logic programming. For instance the set of unit clauses: 1. p(e(x,x)) 2. p(e(x,e(x,e(x,x)))) 3. p(e(x,e(x,e(y,y)))) 4. p(e(x,e(y,e(y,x)))) 5. p(e(e(x,y),e(y,x))) 6. p(e(e(e(x,y),x),y)) is represented as follows: e

x

y

x

y

6

x

y y

e e

y y y

x x y

x

e

5 4 3

x

x

2

e p

e e x x

1

This indexing technique (due to W. McCune) has two interesting properties: each branch in the tree represents exactly one literal and no node has more than one child with the same label (structure sharing is thus optimal). There exist several indexing techniques that are impossible to describe here in detail (Robinson and Voronkov 2001). These techniques (and many others) have been used to implement very powerful automated provers based on Resolution and Superposition, with which striking results have been obtained in mathematics (either in a purely automated way or with little assistance from the user), in particular proofs of conjectures. We mention for instance the proof of the independence of axioms in the ternary boolean algebra (with a 3-argument operation), proofs of some results in Robbins algebras, in group theory, in equivalential calculus, in combinatory logic, etc. Details are available on the website: www.mcs.anl.gov/research/projects/AR/new_results and in Wos et al. (1992). Automated theorem provers are also intensively used by proof assistants. We mention among the many systems available on the web some of the most well-known and efficient ones (in alphabetic order): E, Prover9 (the successor of OTTER), SPASS, Vampire (that won the CASC competition in the main division for many years). All these systems are based on the Resolution and Superposition calculi.

Automated Deduction

63

3.6 Termination: Some Decidable Classes FOL is undecidable: no algorithm can decide in finite time whether a formula is valid (resp. unsatisfiable) or not. It is therefore natural to look for fragments for which this problem is decidable, and in particular for which the Superposition calculus terminates. In several cases, termination can be ensured by suitably choosing the selection function and reduction order (Robinson and Voronkov 2001). For instance, termination can be ensured for the monadic class with equality. A formula is monadic if it contains no function symbol of arity greater than 0 and no predicate symbol of arity greater than 1. In this case, it can be shown that the terms occurring in the clausal form of the formula satisfy some regularity properties that are preserved by inference rules and that bound both the length and the depth of the clauses (up to condensing). This entails that the set of deducible clauses is finite (up to a renaming of variables). Similar results have been established for the Ackermann class that is the set of formulas of the form ∃x1 , . . . , xn ∀y ∃z 1 , . . . , z m φ where φ contains no function symbol. Some termination results have also been obtained for more specific theories, for instance those used in program verification, such as the theory of arrays, lists and other similar data structures. These results allow one to use the Superposition calculus as a decision procedure for testing the satisfiability of ground formulas modulo these theories (this is called the SMT problem, for “Satisfiability Modulo Theory”). Resolution procedures have also been employed to prove the decidability of fragments used to check the correctness of cryptographic protocols. It is also natural to consider fragments of FOL obtained by translation from decidable logics. For instance, there exists a Superposition-based decision procedure for the guarded fragment, a subclass of FOL in which numerous non-classical logics can be naturally embedded. A Resolution-based decision procedure has been proposed for the 2-variables fragment. Decision procedures for description logics are also based on refinements of SP sel . The above-mentioned results are essentially based on the choice of an adequate reduction ordering. Termination results can also be established by choosing the selection function. For instance, the case in which sel(C) always contains one negative literal (if such a literal exists), deserves to be considered. These strategies are called “positive” because one of the premises is always positive. It is then possible to replace the Resolution rule by a macro-inference rule, called Hyper-resolution, generating only positive clauses by iterated sequences of Resolution steps. Techniques have been devised to prove that positive Resolution (or Hyper-resolution) terminates on specific fragments. They have been used to establish the decidability of various subclasses, for instance PVD (a generalization of DATALOG). Termination results have also been obtained for the class BU which is obtained by translating many modal or description logics into first-order logic. Some of these results extend to equational clauses.

64

T. Boy de la Tour et al.

p(t1 , . . . ,tn ) ∨C ¬p(s1 , . . . , sn ) ∨ D (p(t1 , . . . ,tn ) ∨C) , (¬p(s1 , . . . , sn ) ∨ D) If  is the most general unifier of t1 , . . . ,tn and s1 , . . . , sn . Fig. 2 The hyperlinking rule

3.7 Proof by Instantiation The Herbrand theorem reduces the satisfiability test for a set of first-order clauses S to that of a set of ground instances of S. This result suggests a straightforward semidecision procedure for FOL, consisting in enumerating the ground instances of the considered clauses and using a SAT-solver for testing satisfiability. In practice, this naive procedure is very inefficient. However, techniques exist to reduce the number of generated clauses, and reasonably efficient proof procedures have been designed based on this principle. These techniques benefit from the efficiency of modern SATsolvers (or SMT-solvers, for reasoning modulo theories). They are usually based on instantiation rules similar to the Hyperlinking rule originally devised by D. Plaisted (cf. Fig. 2). Generated clauses are instances of one of the parent clauses. We denote by hl(S) the set of clauses that can be inferred from S by the Hyperlinking rule (in any number of steps). To get a propositional clause set, it suffices to instantiate all the remaining variables by some fixed constant ⊥; the obtained set is denoted by hl(S)⊥. Since this set is infinite it cannot be explicitly generated, but finite subsets can be computed and fed to a SAT solver. If one of these sets is unsatisfiable then S is also unsatisfiable. The obtained algorithm is complete for formulas not containing : S is unsatisfiable iff hl(S)⊥ is unsatisfiable, and by compactness, if S is unsatisfiable and if the Hyperlinking rule is applied in a fair way then an unsatisfiable subset of hl(S)⊥ is eventually obtained. There exist several refinements of this technique, for instance extensions have been developed to handle equational formulas or formulas interpreted modulo background theories such as arithmetic, or to combine instantiation and saturation-based proof procedures. Some other approaches have been proposed, combining the Hyperlinking rule with existing proof procedures for propositional logic, such as the tableaux method (see Sect. 4) or Davis and Putnam’s procedure. Instead of keeping the generation of instances independent of the propositional satisfiability test (viewed as a “black box”), the two procedures are fused and interleaved. The advantage is that the generated instances may be specific to a particular branch in the search space. Other approaches use the partial model returned by the SAT-solver to guide the generation of new instances.

Automated Deduction

65

3.8 Equational Unification In order to handle axiomatic theories in the Resolution or Superposition calculus, it suffices to append to the initial clause set the clauses corresponding to the axioms of the theory. If this theory is equational, equations must thus be added. However, this approach is not elegant (usually, one does not explicitly refer to the axioms of arithmetic for proving that (x + y) + z = (z + y) + x) and often inefficient (the added equations generate many new consequents). It is thus generally simpler to have the theory built-in in the unification algorithm itself. This is called equational unification (Robinson and Voronkov 2001). For instance, if f denotes an associative and commutative function, the unification problem f (a, f (b, z)) = f (x, y) has the following solutions: {x → a, y → f (b, z)}, {x → b, y → f (a, z)}, {x → z, y → f (a, b)}. One can easily check that these solutions are pairwise incomparable: it is not possible to obtain one of these unifiers from another one by substitution. In general, no unique most general unifier exists, and the set of most general unifiers may even be infinite (it is sometimes representable finitely by using ad-hoc techniques). Numerous authors have proposed to encode axioms in the unification algorithm. This is possible only if the unification problem is decidable for the considered theory. This is the case for instance for term schematizations, which are formal languages allowing one to denote iterated schemata of terms such as f n (0) (where n is a variable that varies in the set of natural numbers).

3.9 Model Building Model building is the problem dual from the search for a refutation (Caferra et al. 2004). The goal is, given a formula φ, to construct an interpretation satisfying φ (i.e., a counter-model of ¬φ). This allows one to prove for instance that a theory is coherent, or to detect errors in logical specifications. Several techniques have been devised to construct models automatically (see, e.g., Reynolds et al. 2016 for a recent approach). Two kinds of approaches can be distinguished. Enumeration Methods The existence of a finite model is a semi-decidable problem. If the cardinality of the domain is finite, quantifications can be replaced by conjunctions and disjunctions over the elements of the domain, which allows one to transform the considered problem into a propositional clause set, that can be handled by any SAT solver. The system FINDER is one of the first concrete realizations, and this technique is also used by the program MACE2. Refinements have been proposed to prune the number of generated propositional clauses, which increases very quickly with the cardinality

66

T. Boy de la Tour et al.

of the domain. Another idea consists in translating the problem into a set of firstorder clauses with no function symbols (the so-called Bernays–Schoenfinkel class) for which decision procedures exist. Algorithm 2 performs a direct enumeration of the models which avoids any translation. To simplify, we assume that the only predicate symbol is the equality predicate. This procedure is used by systems such as SEM. Finite model builders have been used to settle conjectures in mathematics (mainly about the existence of finite structures satisfying given sets of axioms).

Algorithm 2: Enumeration-Based Finite Model Search Notations: The set of cells on the domain D (written C D ) is the set of expressions of the form f (e1 , . . . , ek ) where f is a symbol of arity k and (e1 , . . . , ek ) ∈ D k . A partial interpretation on D is a partial function from C D to D. A partial interpretation associates some terms or formulas with a value (the value of the other terms or formulas is undefined). FMod(S, n, D, I ) = Input: A set of clauses S, a natural number n > 0 and a partial interpretation I on a domain D with |D| ≤ n Output: Either an extension of I satisfying S, or f I ← propagation(S, n, D, I ) /* Value Propagation */ if I = f or I is total then return I else Let a cell c ∈ C D undefined in I forall the v ∈ D do J ← FMod(S, n, D, I ∪ {c  → v}) if J = f then return J if |D| = n then return f else /* Add a new element in the domain Let a new element a ∈ / D return FMod(S, n, D ∪ {a}, I ∪ {c → a})

*/

propagation(S, n, D, I ) = Input: A set of clauses S, a natural number n > 0 and a partial interpretation I on D with |D| ≤ n Output: Either an extension of I or f if there exist C ∈ S and σ : V → D such that [C](I ,σ ) = f then /* Detecting Contradictions */ return f else if there exists C ∨ f (t1 , . . . , tn )  s in S and σ : V → D such that [C](I ,σ ) = f and t1 , . . . , tn , s have a value in (I , σ ) but not f ([t1 ](I ,σ ) , . . . , [tn ](I ,σ ) ) then return propagation(S, n, D, I ∪ { f ([t1 ](I ,σ ) , . . . , [tn ](I ,σ ) ) → [s](I ,σ ) }) else return I

Model Building and Deduction Model building algorithms based on refinements of saturation-based procedures (i.e., using Resolution or Superposition) have also been devised. If a saturated set does

Automated Deduction

67

not contain the empty clause then it is necessarily satisfiable. However, the model cannot in general be explicitly constructed because its domain is infinite (it is the set of ground terms). For some fragments, a computable representation of this model can be obtained, which, to be useful for applications, should at least allow one to evaluate ground atoms (in some cases non-ground formulas can also be evaluated). Such algorithms have been proposed for various proof strategies, for instance for positive Resolution. Calculi devoted to the simultaneous search for refutations and models have also been proposed.

4 Semantic Tableaux The semantic tableaux method is the oldest proof procedure in Automated Deduction; it was devised in the 50s by Beth and Hintikka and further developed by Smullyan and Fitting in the 60–70s (Fitting 1996). The intuitive idea consists in trying to construct in a systematic way a model of the considered set of formulas by decomposing them according to their syntactic structure. If all the attempts for constructing a model fail, then the formula is unsatisfiable, otherwise every successful attempt produces a model of the formula. These proof procedures, that use only (variants of) formulas occurring in the initial candidate theorem, are called analytic. The tableaux procedure is based purely on the semantics and requires no transformation into specific normal form. For this reason, it can be easily adapted to non-classical logics for which useful normal forms are hard to obtain. Last, tableaux calculi also have a theoretical interest since they are closely related to Gentzen’s sequent calculi (Buss 1998) which form the basis of proof theory.

4.1 Propositional Tableaux A tableau is a tree whose nodes are labeled by formulas (or ×). The initial tree has a unique node labeled by the formula to be tested for satisfiability; then expansion rules allow one to extend the tree. Using a widely spread notation originally introduced by Smullyan, we classify these rules according to the kind of expansion they perform (i.e., the number of children added to the considered branch): either extension (α formulas) or branching (β formulas). These two kinds of formulas are presented in Fig. 3a and the expansion rules are depicted in Fig. 3b. In order to simplify notations, we often identify a formula with its type (α or β), and we denote by αi or βi the corresponding components. The symbol × is used to mark unsatisfiable branches. Definition 3 The set of tableaux for a set of formulas S is inductively defined as follows. Base: every tree with a unique node labeled by φ ∈ S is a tableau for S. Inductive steps: if T is a tableau for S and B is a branch in T , then:

68

T. Boy de la Tour et al.

Fig. 3 Propositional tableaux rules Fig. 4 Tableau for proving the validity of ( p ∨ q) ⇒ (( p ⇒ q) ⇒ q)

¬((p ∨ q) ⇒ ((p ⇒ q) ⇒ q)) p∨q ¬((p ⇒ q) ⇒ q) p⇒q ¬q p ¬p q × ×

q ×

(i) if φ ∈ S, the tree obtained from T by extending B with a new node labeled by φ is a tableau for S, (ii) if α ∈ B the tree obtained from T by extending B with a node n labeled by α1 and (if necessary) with a direct successor n  of n labeled by α2 is a tableau for S, (iii) if β ∈ B the tree obtained from T by extending B with two nodes n 1 and n 2 labeled respectively by β1 and β2 is a tableau for S. If S = {φ}, T is a tableau for φ. The rule (r ×) applies only to literals p and ¬ p occurring in the same branch B. Definition 4 Given a tableau T , a branch B in T is closed if it contains ×, otherwise it is open. A tableau is closed if all its branches are closed and open otherwise. An open branch B is saturated if for every formula φ in B, B contains at least one of the components of the (unique) rule applicable on φ. Unsuccessful model construction attempts for the negation of the formula ( p ∨ q) ⇒ (( p ⇒ q) ⇒ q) are depicted in Fig. 4. The procedure is sound and complete: a propositional formula is valid if and only if there exists a closed tableau for ¬φ. It is also easy to see that the procedure always terminates and produces either a closed tableau or a tableau containing a saturated open branch. Consequently, soundness and completeness are usually stated

Automated Deduction

69

as follows: the formula φ is satisfiable if and only if there exists a tableau for φ containing a saturated and open branch. For soundness, it suffices to prove that the expansion rules preserve the satisfiability of the formulas in the considered branch: if a branch is satisfiable3 then for every applicable expansion rule, one of the obtained branch is also satisfiable. Hence it cannot be closed, since all closed branches are unsatisfiable. We deduce that the final tableau necessarily contains an open (saturated) branch. For completeness, it can be shown that if a tableau contains a saturated open branch B then there exists an interpretation I that validates all the formulas in B (and in particular the root formula). The interpretation I is constructed as follows: [ p](I ) = t if p ∈ B and [ p](I ) = f otherwise. It can be easily proven that if φ ∈ B then [φ](I ) = t. The above formalization is not unique: An equivalent calculus can be defined by considering nodes labeled by sets of formulas. Then the expansion rules can be defined locally without referring to the branches. We provide below such a formalization for propositional rules. Here Γ denotes a set of formulas. Γ, β Γ, α Γ, p, ¬ p (rβ) (r α) (r ×) Γ, α1 , α2 Γ, β1 | Γ, β2 ×

(4)

In this formalization, a tableau for Γ is a tree whose root is labeled by Γ and constructed by applying the above rules. This formulation is very close to that of the sequent calculus without cut (Gentzen’s calculus). Sequent calculi are the most widely used systems in proof theory and there exist precise connections between tableaux and sequent calculi.

4.2 Tableaux for First Order Logic In order to extend the tableaux procedure to first order logic, rules must be added to handle quantifiers (see Fig. 5). Using again Smullyan’s notations, we distinguish between the rules γ and δ. The γ rule extends a branch with a ground instance of a universally quantified formula occurring in the branch. The δ rule instantiates an existentially quantified variable with a new constant. About the γ rule, two remarks are needed. First, if no ground term occurs in the branch, then the rule itself introduces a new constant symbol (i.e., t is a new constant). More importantly, the γ rule can (and must) be applied repeatedly on the same formula with different ground terms. The example in Fig. 6a illustrates these two aspects. The tableau of Fig. 6b provides a counter-example for the formula ∀x ( p(x) ∨ q(x)) ⇒ (∀x p(x) ∨ ∀x q(x)). It can be observed that the open (saturated) branch provides a counter-model I of the formula on the domain {a, b} defined by p I = {b}, q I = {a}. 3 I.e.,

if there exists an interpretation satisfying every formula in the branch.

70

T. Boy de la Tour et al.

(a) Formulas: t is (b) Formulas: c is a a ground term occur- new constant not ocring in the branch or curring in the branch a new constant if no such term exists

(c)

and

Rules

Fig. 5 Tableaux for FOL

¬∃x (p(x) ⇒ ∀ p( ))

¬(∀x (p(x) ∨ q(x)) ⇒ (∀x p(x) ∨ ∀x q(x)))

¬(p(a) ⇒ ∀ p( ))

∀x (p(x) ∨ q(x))

p(a)

¬(∀x p(x) ∨ ∀x q(x))

¬∀ p( )

¬∀x p(x)

¬p(b)

¬∀x q(x)

¬(p(b) ⇒ ∀ p( ))

¬p(a)

p(b)

¬q(b)

¬∀ p( ) × (a) Tableau for ¬∃x (p(x) ⇒ ∀ p( ))

p(a) ∨ q(a) p(b) ∨ q(b) p(a) ×

q(a)

p(b) q(b) × (b) Tableau for ¬(∀x (p(x) ∨ q(x)) ⇒ (∀x p(x) ∨ ∀x q(x))). Fig. 6 Examples of tableaux for FOL

In propositional logic, the construction of a tableau always terminates: after a finite number of expansion steps a tableau is obtained where all branches are either saturated or closed. This is obviously not the case for first-order logic, which is not decidable. To see this, it suffices to construct a tableau for the formula ∀x ∃y p(x, y). Since there may exist infinitely many ground terms,4 a saturated open branch may

4 Either

because the signature contains a function symbol or arity strictly greater than 0, or because new constants are repeatedly created by the δ rule.

Automated Deduction

71

be infinite. Yet, it is possible to prove that the procedure is sound and complete: a first-order formula φ is valid if and only if there exists a closed tableau for ¬φ. For soundness, as in propositional logic, it suffices to show that the rules preserve satisfiability. The completeness proof is more complex: an interpretation validating all the formulas in a saturated and open (in general infinite) branch must be constructed. To this purpose, a model satisfying the set of formulas occurring in the branch B is defined as follows: p I = { t1 , . . . , tn  | p(t1 , . . . , tn ) ∈ B} for every n ∈ IN and for every predicate symbol p of arity n. It is easy to check that if φ ∈ B then I |= φ (assuming that all the minimal terms not occurring in the branch are interpreted as an arbitrarily chosen constant in the branch). This proves that φ is satisfiable iff there exists a (usually infinite) tableau for φ containing at least one open, saturated (infinite) branch. In order to achieve the completeness proof, there remains to provide a strategy (i.e., an algorithm) to ensure that a saturated open branch can be generated. A systematic strategy is required, using a fairness criterion for applying the rules: no formula or term can be “infinitely delayed”. Such a strategy works because the rules are non-destructive: every application of the expansion rules to a tableau T produces a tableau that contains T as a subtree. Consequently, the application of a rule cannot prevent the construction of the closed tableau if it exists. The calculus fulfills a property called confluence: any tableau for an unsatisfiable formula can be expanded into a closed tableau.

4.3 Free Variable Tableaux In practice, the blind instantiation of universal quantifiers is inefficient. Some adaptations must be applied to the expansion rules and to the tableau construction procedure in order to ensure better efficiency while preserving soundness and completeness. More precisely, the γ rule usually replaces the considered quantified variable with a new free variable, treated rigidly (i.e., it denotes the same term in all branches). These free variables are eventually instantiated by unifying complementary literals in a branch (cf. Sect. 3.2), which entails that the branch can be closed. Since the free variables are rigid, the obtained most general unifier must be applied to the entire tableau. Note that since formulas can now contain free variables, the δ rule must be adapted as well: the constant c must be replaced by a complex term f (x1 , . . . , xn ) containing all the free variables occurring in the considered existential formula. Definition 5 A free variable tableau for a set of formulas S is a tree generated (in the sense of Definition 3) by the α and β rules and by the following rules: γ (r γ  ) 1. γ1 (x  ) where x  is a variable not occurring in the tableau. γ1 (x  ) is called a new instance of γ .

72 Fig. 7 Free variable tableau for ¬∃x ( p(x) ⇒ ∀z p(z))

T. Boy de la Tour et al.

¬∃x (p(x) ⇒ ∀ p( )) ¬(p(x ) ⇒ ∀ p( )) p(x ) ¬∀ p( ) ¬p(b) × {x → b}

δ (r δ + ) where x1 , . . . , xn are the variables freely occurring in 2. δ1 ( f (x1 , . . . , xn )) δ, f is a function symbol not occurring in the branch. 3. Rule (Subst): if a branch contains two literals p(t1 , . . . , tn ) and ¬ p(s1 , . . . , sn ) and if σ is a most general unifier of (t1 , . . . , tn ) and (s1 , . . . , sn ), then the substitution σ is applied to T (hence T is replaced by T σ ) and × is added to the branch. A free variable closed tableau for the same formula as in Fig. 6 is depicted in Fig. 7. It can be proven that the procedure is sound. However, providing an algorithm for generating the closed tableau (if it exists) is much more difficult than for the original method. The above systematic strategy does not work anymore since the closure rule (Subst) is destructive: consequently, a wrong application of the rule can hinder the construction of a closed tableau. Consider for instance the tableau corresponding to the set of formulas {∀x ( p(x) ∨ q(x)), ¬ p(a), ¬ p(b), ¬q(b)}. The γ and β rules generate two branches with p(x  ) and q(x  ) respectively. If the wrong choice is made by unifying p(x  ) with p(a), the first branch can be closed, but no closed tableau can be obtained, unless a new variable x  is created by applying again the γ rule on the first formula, generating again two branches with p(x  ) and q(x  ). If all these variables x  , x  . . . are repeatedly unified with a, then the tableau will never be closed. For this reason, usual algorithms for free variable tableaux either try to construct all possible tableaux in parallel (using a systematic exploration of the set of possible tableaux by “iterative deepening”) or delay the application of the rule (Subst) until a unifier closing all branches simultaneously is found (Fitting 1996). Note however that free variable tableaux are still confluent. It is possible to devise more efficient calculi by assuming that the considered formula is in clausal form (see Sect. 3.1). These approaches use a single expansion rule that combines n-ary branching with the introduction of free variables (combining the rules β and γ ). This variant is called clausal tableaux. A further refinement (called connection tableaux) imposes that one of the literals introduced by the expansion rule must be unified with the complementary of the label of the parent node. However, this calculus is not confluent and thus requires backtracking to former tableaux.

Automated Deduction

73

5 Non Classical Logics In this part we refer to chapter “Knowledge Representation: Modalities, Conditionals, and Nonmonotonic Reasoning” of Volume 1 introducing modal logics and Kripke semantics of possible worlds. A modal language is build on a set P0 . The most popular methods in AD are Resolution methods which require a notion of clausal form. Such normal forms are easily obtained in classical logics (both propositional and first-order), but it is not so in modal or conditional logics where, in general, no clausal form exists. From this observation two approaches of AD in modal logics can be considered: a direct method that dispenses with clausal forms or an indirect method using a classical theorem prover through a translation to classical logic by naming possible worlds with explicit variables. However, many modal logics are decidable and the translation method may not be a decision procedure. Furthermore, some modal logics may be more expressive than first-order logic and can only be translated in higherorder logic, for instance the logic S4.3.1 defined by the axioms K, T, 4 and “Dum” (expressing the absence of “non final clusters”). Translation methods to first-order logic are not suitable to these modal logics. Methods based on semantic tableaux, which do not assume any normal form on the input formula, belong to the direct methods. They have become the most popular methods of AD in modal logics. We distinguish two types of calculi based on semantic tableaux: the implicit and the explicit tableaux calculi. In an implicit tableaux calculus the rules encode implicitly the transition from one possible world to another (Goré 1999). An explicit tableau calculus uses formulas decorated with the name of some world (a label). The calculus thus allows one to prove that a formula is (or is not) satisfiable in the world x that labels the formula. The tableaux rules can thus be easily and naturally expressed. Many modal logics are defined by their axioms. Each axiom semantically characterizes a property of the accessibility relation R. For instance, to mention the most familiar examples, axiom T enforces reflexivity of relation R, axiom 4 corresponds to transitivity and axiom B corresponds to symmetry. In an explicit tableaux calculus these properties can be directly expressed, while in implicit calculi they must be encoded indirectly. The usual notions on modal tableaux are analogous to the corresponding notions in classical tableaux: closed and saturated branches and tableaux.

5.1 An Implicit Tableaux Calculus We first describe an implicit calculus for the modal logics K, KT, K4 and S4 = KT4 based on sets of formulas. Each of these tableaux calculi contain the usual propositional rules (see Sect. 4.1), the rule (K) and any combination of rules (T) and (4) (including none) of Fig. 8.

74

T. Boy de la Tour et al.

Fig. 8 Implicit tableaux rules for K, T, 4

Fig. 9 C K-Tableau for ¬(( p ⇒ q) ∧ ♦ p ⇒ ♦q)

¬((p ⇒ q) ∧ ♦p ⇒ ♦q) (p ⇒ q), ♦p, ¬♦q p ⇒ q, ¬q, p p ⇒ q, ¬q, p, ¬p ×

p ⇒ q, ¬q, p, q ×

Γ is a finite set of formulas, φ is a formula, Γ  = {ψ : ψ ∈ Γ } and Γ = {ψ : ψ ∈ Γ } Each rule implicitly uses the semantics of the modal operator. If, in a set Γ of formulas, a modal formula ♦φ, i.e., ¬¬φ, is true, then there is a possible world in which φ is true as well as all the formulas ψ such that ψ was true. This world is built by the conclusion of Rule (K). In Rule (T) the property of reflexivity of the accessibility relation is implemented, i.e., the fact that every world is a possible world for itself. The formula φ is thus simply added to any world that contains φ. Rule (4) implements the property of transitivity: if a new possible world is introduced it contains all the formulas of the form ψ that were contained in the “current” world. A tableau for a set of formulas is built as described in Sect. 4 with the rules for sets (4). We call C X-tableau any tableau for logic X. Figure 9 depicts a C K-tableau for the formula ¬(( p ⇒ q) ∧ ♦ p ⇒ ♦q). A set of formulas Γ is called C K-inconsistent (resp. C T-inconsistent, …, resp. C S4-inconsistent) if some C K-tableau (resp. C T-tableau, …, resp. C S4-tableau) for Γ is closed. Hence it is consistent iff every tableau for Γ is open. From any saturated open tableau for Γ a model of Γ can be built. A formula φ is therefore a theorem of K (resp. T,…, resp. S4) iff there is a C K-tableau (resp. C T-tableau, …, resp. C S4-tableau) closed for ¬φ.

5.2 An Explicit Tableaux Calculus We now introduce tableaux calculi in which the accessibility relation is explicitly represented. One can “derive” a rule for a formula schema from the semantics of the considered logic. Consider the modal formula φ. Given a modal interpretation M = (S, R, v) and x ∈ S then M, x |= φ iff ∀y ∈ S if x R y then M, y |= φ. We may therefore deduce from M, x |= φ and x R y that M, y |= φ; this is Rule (T ) in Fig. 10 which contains the rules for the initial modal logic K. Their premises are labelled formulas x : φ and transition formulas x R y.

Automated Deduction Fig. 10 Explicit tableaux rules for K

75

x : ¬ x R y, y : ¬

x: , xRy (T ) y:

(F)(∗)

(*) y is a new label in the branch.

x : ¬((p ⇒ q) ∧ ♦p ⇒ ♦q) (i) x : (p ⇒ q) (ii) x : ♦p (iii) x : ¬♦q

Fig. 11 Explicit tableau for ¬(( p ⇒ q) ∧ ♦ p ⇒ ♦q)

(iv) x R y0 , y0 : p

f rom (ii) by (F)

(v) y0 : ¬q

f rom (iii) and (iv) by (T )

y0 : p ⇒ q

f rom (i) and (iv) by (T )

y0 : ¬p × (iv)

y0 : q × (v)

In Fig. 11 can be found a labelled tableau for the formula ¬(( p ⇒ q) ∧ ♦ p ⇒ ♦q): Closed branches and tableaux are defined as in classical logic but include labelled formulas (and similarly for saturated branches) by adding the rules of modal tableaux. To prove correction and completeness of tableaux calculi we associate an interpretation to any open and saturated branch. Given a branch B we define the set of labels on branch B as B E = {x | x : φ ∈ B}5 and P(B) = { p ∈ P0 | x : p ∈ B} is the set of atomic formulas that have an occurrence in B. We say that an interpretation M = (S, R, v) is associated to a branch B under function f : B E −→ S if for every transition formula x R y in B, f (x) R f (y) holds. A branch B is satisfied by an interpretation M = (S, R, v) under function f iff M is associated to B under f and if, for every labeled formula x : φ in B, M, f (x) |= φ holds. A tableau is satisfiable if it admits a branch satisfied by an interpretation under some function. The correction of the calculus can be proven easily: if T is a satisfiable tableau then any tableau obtained form T by application of a tableau rule is satisfiable. To prove completeness a canonical interpretation MC = (S, R, v) is built from any open and saturated branch B, with S = B E , R = { s, s   | s R s  ∈ B} and v(s, p) = t if s : p ∈ B for all p ∈ P(B). MC can be shown to satisfy all the formulas in B. Standard extensions of modal logic K are characterized by additional axioms and by semantic properties of the accessibility relation. Table 1 shows the tableaux rules for axioms T, 4, and B, which stem directly from the corresponding semantic property. 5 If

x R y ∈ B then x ∈ B E and y ∈ B E , according to the rules.

76

T. Boy de la Tour et al.

Table 1 Axioms and tableaux rules for T, 4, B Axiom Property of R T φ ⇒ φ 4 φ ⇒ φ B ¬¬φ ⇒ φ

Reflexive Transitive Symmetric

Tableau rule x:φ x Rx x R y,y Rz x Rz x Ry y Rx

In presence of Rules 4 and B the calculus does not terminate. For instance the tableau for the formula ¬(♦ p ⇒ ♦ p) generates an infinite sequence of possible worlds. However, loops can be detected since every newly generated label refers to a recurring finite set of formulas. Hence a terminating calculus can be obtained by introducing loop checking. Termination can also be ensured by applying more sophisticated strategies (Massacci 2000). The idea of using labels can already be found in the works of Kripke, who devised a tableaux calculus for modal logics (Kripke 1963; Fitting 1983). In Schütte (1978) were introduced labelled trees where the labels are finite sequences of natural numbers. Such an encoding of labels can also be found in Massacci (2000), providing a systematic and modular tableaux calculus for a wide variety of logics. In this calculus, the descendants of a node labelled σ are labelled σ.1, σ.2, . . . , which allows one to directly “deduce” the accessibility relation between two nodes by taking care, if needed, of the properties of transitivity, symmetry, etc.

6 Dealing with Incompleteness The logics presented in the previous sections are semi-decidable, they admit complete inference systems that constitute an essential ingredient in their mechanization. It is however known that they only cover a limited range of mathematical reasoning, they lack the essential ingredient of mathematical induction. But Gödel has shown that arithmetics, with its induction principle on natural numbers, is incomplete. AD can only alleviate this inconvenience by using methods very different from those presented above, that we can only briefly evoke (see also chapter “Theoretical Computer Science: Computability, Decidability and Logic” of Volume 3 on the fundamental interaction between logic and computability).

6.1 Induction Mathematical induction relies on an induction principle, i.e., a theorem that says that, for every property P, if ∀y (χ (P, y) ⇒ P(y)) holds then P(y) holds for all y. The formula χ (P, y) is the induction hypothesis and y is the induction variable.

Automated Deduction

77

Mathematical induction consists in proving a formula φ (with free variable y) from the induction hypothesis χ (φ, y), and then to deduce from the induction principle that ∀y φ is a theorem. An induction principle obviously depends on the set E in which y varies and especially on the formula χ (φ, y). There are two general forms of induction principles according to whether the formula χ (φ, y) is universal or existential. The existential form is called structural induction, it depends on a set C of constructors; a constructor is a function from E n to E for some n ∈ IN. The induction hypothesis χ (φ, y) is then ∃n ∈ IN, ∃ f ∈ C , ∃x1 , . . . , xn ∈ E, y = f (x1 , . . . , xn ) ∧ φ[y/x1 ] ∧ · · · ∧ φ[y/xn ].

It is easy to show that ∀y ∈ E, (χ (φ, y) ⇒ φ) implies ∀y ∈ E, φ if E the set of elements that are finitely generated by the constructors; this is the principle of structural induction. It is then natural to denote the elements of E as closed terms in the signature F = C (with the same arities). But in FOL it is generally not possible to exclude that an interpretation may contain elements that are not denoted by some closed term; hence structural induction does not hold in FOL since it would generalize to all elements properties that are true only on E (hence true only on the Herbrand interpretations built on F ). This is why FOL is not suitable for reasoning on most mathematical objects (as found in computer science). Well-founded induction uses a binary relation R on a set E with no infinite sequence (xi )i∈IN such that ∀i ∈ IN, xi+1 R xi (in other words, all descending paths through R are finite, this generalizes well-founded orders). This allows for a universal induction hypothesis: ∀x ∈ E, x R y ⇒ φ[y/x]. In the theory of ordinals R is the membership relation ∈ and the well-founded induction principle is then called transfinite induction). Another well-known principle of well-founded induction is the induction based on the strict ordering of natural numbers. It is easy to see that Peano induction is both a principle of structural induction (with 0 and the successor function s(n) = n + 1 as constructors) and of well-founded induction (with R = s); this is due to the fact that from any n there is a unique descending path through s to 0. Gödel’s first incompleteness theorem thus shows that both forms of induction principles lead to incomplete theories. This is reflected in proof theory by the fact that some inductive theorems can only be obtained through the use of some lemma that cannot be expressed in the language of the theorem (such “external” lemma, or cut formula, cannot generally be eliminated from proofs as in FOL). For example, some theorems need to be generalized so that an inductive proof may be found. Also the induction variable must be carefully chosen when the conjecture has several universally quantified variables.

78

T. Boy de la Tour et al.

This is why systems of automated reasoning by induction remain open to human interaction. The most famous one is probably the Boyer and Moore theorem prover (Nqthm, then ACL2); but other systems like RRL, INKA or Oyster/Clam have also been used in numerous works aiming at mechanizing the construction of induction schemas, inductive generalizations or simply for proposing an induction variable (Robinson and Voronkov 2001; Berardi and Tatsuta 2017). Other techniques, inspired by the Knuth–Bendix completion method for rewriting systems, allow a simpler mechanization of induction in special theories or for special formulas. They use sets of terms with a number of completeness properties (the “cover sets” and “test sets”, as in the systems RRL and SPIKE). In some cases special axioms allow to prove conjectures by consistency, i.e., by showing that the conjecture is not contradictory with the axioms. It is then possible to use a FOL theorem prover, provided it can establish the absence of contradiction by saturation, in order to prove inductive properties (properties that are valid in the class of Herbrand interpretations); this is inductionless induction, see Robinson and Voronkov (2001).

6.2 Logical Frameworks or Proof Assistants One can see computers as assistant mathematicians from their very beginning: even today numerical computation is still a major activity of computers around the world. But for more elaborate problems computation turns symbolic and its mechanization requires techniques from AI. Symbolic computation does not share the nice algorithmic properties of numeric computation, complexity and computability issues arise. A natural approach is to mechanize standard algebra, for instance to compute with polynomials or algebraic fractions. Computer algebra systems have first been developed for physics, where the system Schoonschip contributed to scientific developments that earned a Nobel prize to its author. Nowadays these systems are widely used in mathematics. However, these systems are increasingly complex and present a number of drawbacks: algorithms are used whose correction is seldom proven, and some methods are only correct under special hypotheses but can be applied ad libitum. Such problems become crucial when one needs to guarantee the correctness of the results, or more generally when a proof is required. It is then necessary to design computer systems that are powerful enough to develop a substantial amount of mathematics; such systems are called proof assistants, or logical frameworks since they rely on logical languages in which mathematical texts can be expressed in a more or less direct way, including definitions, theorems and proofs (Robinson and Voronkov 2001; de Moura et al. 2015). In these systems a user has the possibility to develop and verify mathematical proofs. The verification relies on a fairly simple kernel that basically encodes a formal system (axioms and inference rules). When every proof developed in such a system is checked by this simple kernel, it is said to satisfy the de Bruijn Criterion. The correctness of theorems thus only relies on that of the kernel.

Automated Deduction

79

This reduction seems indispensable in mathematics in the sense that some proofs are simply too long and involved to be checked by a sufficient number of competent referees. This seems to be the case of the proof of Fermat’s Last Theorem by Wiles, or the classification of the finite simple groups: most mathematicians are reduced to believing that such results are firmly established. Some even consider that mathematical texts that are too long for a human to read and check should not be accepted as proofs. In particular the first, computer assisted proof of the four color theorem, partly based on a mechanized enumeration of a very long list of particular cases, can hardly be considered as intuitive. But prior to grant computers with the ability to solve these problems one must first design a logical framework powerful enough to express such proofs and simple enough to make them intuitive, and then provide a user with the necessary tools for developing longer proofs than he could read. The first problem has been solved thanks to the investigations on the foundations of mathematics. It is known since the beginning of the 20th century that set theory is suitable to formalize mathematics, though sometimes at the cost of elegance. Type theory is another candidate, often closer to the practice of mathematics (and computer science). The latter has been chosen in the first logical framework, the system Automath developed by N. de Bruijn from 1968. But despite its name this system did not offer the possibility to mechanize the construction of proofs. Two systems more specialized for program verification appeared in the years 1971–72: Nqthm (the Boyer&Moore prover, later known as ACL2) and Milner’s LCF. The former is extensively mechanized, the latter offers automation capacities through the means of tactics; these programs constitute inverted inference rules, applied to a conjecture as conclusion they yield a list of premises as new conjectures. In case of failure an exception handling mechanism enables backtracking to the initial state. The advantage of tactics is that they can be combined in order to build more and more complex tactics, by means of operators called tacticals. For instance one tactical repetitively applies a given tactic to the resulting conjectures until a failure occurs. And it is precisely to allow the user to write his own tactics that Milner designed the ML language (for MetaLanguage), where typing ensures that only proven formulas have type thm. Another pioneering proof assistant is the Mizar system, developed by Trybulek from 1973. It is based on a set theory in classical logic, but offers typing possibilities close to the usual practice of mathematics. The aim of this system is to automatically check given texts (or articles) written in a language as close as possible to standard mathematics, hence readable by a human being. The author of an article is asked to fill in intermediary details (using feedback from the system) until Mizar achieves its verification. More recent logical frameworks also offer the possibility to develop proofs in constructive mathematics (by rejecting the law of excluded middle and hence classical logic), which is particularly attractive to computer scientists since it is possible to extract from a constructive proof of a theorem of the form ∀x1 . . . xn ∃y p(x1 , . . . , xn , y) a functional program f such that ∀x1 . . . xn p(x1 , . . . , xn ,

80

T. Boy de la Tour et al.

f (x1 , . . . , xn )). This program can then be run on given data within the system. This reflects the traditional computational nature of mathematics (one may think of Euclid’s gcd algorithm), and it seems essential that the notion of computing be integrated to a system in which mathematics are being developed. Different logical frameworks allow different ways of performing numerical or symbolic computations, with more or less control and conciseness, according to the way equational reasoning is represented. Among the recent systems, the best known is PVS. It provides the user with many features (decision procedures, SMT solvers, etc.) partly due to the fact that it does not fulfill the de Bruijn Criterion. The Isabelle system does, but formal proofs (very lengthy when computations are involved) are not stored and hence no program can be extracted from them. The Coq system stores proofs as typed lambda-terms (in the calculus of inductive constructions) as well as the script (user’s input) that led to its construction from the initial conjecture. A more detailed comparison of these systems and some others (NuPRL, Agda, HOL, Lego...) can be found in Barendregt and Wiedijk (2005). These systems contain more or less extended libraries of standard mathematical results that have been built at the cost of long and laborious efforts. These results are available for proving conjectures, and experience shows that proving is accelerated when adequate libraries are available. This is one obvious superiority of proof assistants over automated theorem provers, whose mathematical “culture” is usually reduced to their inference rules and strategies. Proof assistants can therefore already be praised for achieving the complete formalization of a number of fundamental mathematical results, including some very elaborate ones (such as Gödel’s first incompleteness theorem or refutational completeness of proof procedures Blanchette et al. 2018). One striking result is the proof of the four color theorem, obtained by Georges Gonthier with the Coq system. This formal proof has been obtained partly thanks to standard methods of automated deduction. It is of course still too long to be read by a mathematician, but the de Bruijn Criterion at least lifts any remaining doubt on its validity.

7 Conclusion Despite these obvious accomplishments further advancement is required for proof assistants to become tools as useful to scientists and engineers as computer algebra systems are. They are often demanding on the user and offer little intuition. Unless extremely meticulous, a mathematician has no need for the excessive formalism necessary for obtaining proofs in a logical framework. In particular the readability and conciseness of the informations provided by the user and the system to each other must be improved, to allow a fruitful interaction to take place. This requires better interfaces, translation capabilities between different formalisms and a higher degree of mechanization, so that the proof assistant does not freeze on what appears obvious to the user (unless the obvious turns out to be

Automated Deduction

81

wrong). Until such contradictory requirements are assembled, it seems reasonable to develop proof assistants toward program verification. Apart from the obvious economic benefits, the proofs found in this area are often mathematically simpler and yet tedious enough to welcome a proof assistant’s help.

References Barendregt H, Wiedijk F (2005) The challenge of computer mathematics. Philos Trans R Soc A 363:2351–2375 Berardi S, Tatsuta M (2017) Equivalence of inductive definitions and cyclic proofs under arithmetic. In: 32nd annual ACM/IEEE symposium on logic in computer science, LICS 2017, Reykjavik, Iceland, 20–23 June 2017. IEEE Computer Society, pp 1–12 Blanchette JC, Fleury M, Lammich P, Weidenbach C (2018) A verified SAT solver framework with learn, forget, restart, and incrementality. J Autom Reason 61(1–4):333–365 Bledsoe WW, Loveland DW (eds) (1984) Automated theorem proving: after 25 years. Contemporary mathematics, vol 29. American Mathematical Society, Providence Buss SR (1998) Handbook of proof theory. Studies in logic and the foundations of mathematics. Elsevier Science Publisher, New York Caferra R, Leitsch A, Peltier N (2004) Automated model building. Applied logic (Kluwer), vol 31. Springer (Kluwer), Berlin de Moura LM, Kong S, Avigad J, van Doorn F, von Raumer J (2015) The lean theorem prover (system description). In: Automated deduction - CADE-25 - 25th international conference on automated deduction, Berlin, Germany, 1–7 August 2015, Proceedings, pp 378–388 Fitting MC (1983) Proof methods for modal and intuitionistic logics, vol 169. Synthese library. D. Reidel, Dordrecht Fitting MC (1996) First-order logic and automated theorem proving, 2nd edn. Springer, New York Goré R (1999) Tableau methods for modal and temporal logics. Handbook of tableau methods. Kluwer Academic Publishers, Dordrecht, pp 297–396 Kripke SA (1963) Semantical analysis of modal logic I, normal propositional calculi. Zeitschr f math Logik u Grundl d Math 9:67–96 Leitsch A (1997) The resolution calculus. Texts in theoretical computer science. Springer, Berlin Massacci F (2000) Single step tableaux for modal logics. J Autom Reason 24(3):319–364 Reynolds A, Blanchette JC, Cruanes S, Tinelli C (2016) Model finding for recursive functions in SMT. In: Automated reasoning - 8th international joint conference, IJCAR 2016, Coimbra, Portugal, June 27–July 2 2016, Proceedings, pp 133–151 Robinson JA, Voronkov A (2001) Handbook of automated reasoning (2 volumes). Elsevier and MIT Press, Amsterdam and Cambridge Schütte P (1978) Vollständige Systeme modaler und intuitionistischer Logik. Springer, Berlin Siekmann J, Wrightson G (eds) (1983) Automation of reasoning. Classical papers on computational logic 1957-1967 and 1967-1970, vol 1, 2. Springer, Berlin Sloman A (2008) The well-designed young mathematician. Artif Intell 172(18):2015–2034 Sutcliffe G (2017) The TPTP problem library and associated infrastructure - from CNF to th0, TPTP v6.4.0. J Autom Reason 59(4):483–502 Wos L (1988) Automated reasoning: 33 basic research problems. Prentice Hall, Englewood Cliffs Wos L, Overbeek R, Lusk E, Boyle J (1992) Automated reasoning: introduction and applications. McGraw-Hill, New York

Logic Programming Arnaud Lallouet, Yves Moinard, Pascal Nicolas and Igor Stéphan

Abstract This chapter presents the family of logic programming languages in which computation is viewed as deduction in a logical formalism. We first present the foundation of logic programming with Horn clauses illustrated by the Prolog language. From this first concept are born numerous extensions; here we describe two of them in details: constraint logic programming which allows a more elegant treatment of domains other than finite terms and Answer Set Programming which gives a better treatment of negation and appears to be an effective implementation of nonmonotonic reasoning.

1 Introduction Logic Programming,1 Constraint Logic Programming and Non-monotonic Logic Programming are different paradigms that have supported Artificial Intelligence along its history. Today, under very close or even identical syntaxes, we find two families of logic programming. The first one, usually called “logic programming” is represented by Prolog and is based on proof theory. So it is best understood by its operational semantics. Programming overtakes on logic. In order to allow the use of different domains, in particular numerical domains, logic programming has 1 To be precise, logic programming refers more to programming in logic as many programming activities require the use of logic.

A. Lallouet (B) Huawei Technologies Ltd, French Research Center, Boulogne-Billancourt, France e-mail: [email protected]; [email protected] Normandie Univ, UNICAEN, ENSICAEN, CNRS, GREYC, Caen, France Y. Moinard INRIA Bretagne Atlantique - IRISA, Rennes, France e-mail: [email protected] P. Nicolas · I. Stéphan LERIA - Université d’Angers, Angers, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_4

83

84

A. Lallouet et al.

evolved into constraint logic programming. In parallel, Answer Set Programming has emerged as a non-monotonic logic programming. This family of languages enjoys a purely declarative semantics based on model theory. Logic goes beyond programming. In the following sections, we present the theoretical foundations of these three approaches, software systems that allow one to use them in practice and the fields of AI concerned by these logic languages and their extensions. Section 2 presents an overview of logic programming and its theoretical foundations, Sect. 3 presents some aspects of constraint logic programming and in Sect. 4, we present recent advances in the ASP field.

2 Logic Programming The invention of the logic programming paradigm (Giannesini et al. 1985; Colmerauer et al. 1973; Kowalski 1974; Colmerauer 1983) has been a true revolution in the field of Artificial Intelligence: formal logic was no more only an effective way to represent knowledge but also an efficient way to compute. We first present this paradigm in its general framework, then we focus on its most popular instance, the logic programming language with Horn clauses and finally its implementation, the Prolog language.

2.1 From Logic to Logic Programming The logic programming paradigm considers computation as a deduction in a logical formalism. This paradigm is to put into perspective with the paradigm of imperative programming which considers computation as a modification of a global state by instructions and with the paradigm of functional programming which considers computation as the result of the evaluation of a function. A logic programming language consists of a language of data, a language of programs and a language of queries. The two last languages are greater or lesser fragments of a logical formalism. The logic programming languages are declarative languages (said of “fifth generation”): the program declares the structure of the problem but does not describe an operational method to solve it. The logical formalism should not only be able to represent knowledge but must also be powerful enough to provide a way to define all the calculable functions. In order to compute, a theorem prover restricted to the selected fragments is needed and it must contain a deduction principle. The set of the hypothesis of the theorem is specified in the language of the programs associated with the language of data while the formula to be deduced from the hypothesis is described in the language of queries associated with the language of data. A logic program expresses the knowledge of a domain thanks to formulas whose variables are implicitly universally quantified while

Logic Programming

85

the variables of the queries are implicitly existentially quantified. An expected answer for such a program is a set of instances for the existentially quantified variables. Since programming may be seen as the addition of a “logic” component, a “data structure” component and a “control” component, this latter component disappears if the paradigm is carried to the extreme i.e. using a theorem prover whose internal working is operationally opaque. A first element which distinguishes an ordinary theorem prover from a mechanism underlying the execution of a language from the logic programming paradigm is as follows: a theorem prover decides if “yes” or “no” the theorem is true while the mechanism computes, from the same knowledge, the answers to the existential variables of the queries. Moreover, the mechanism of the execution of the program which is based on deduction gives a procedural semantics to the logic programming paradigm. Logic programming languages are relational languages: the computation is defined in terms of relationship like in database languages. Hence, conversely to functions which have inputs and one output, the arguments of a relation are reversible and are not statically neither input nor output but dynamically either one or the other or both. A consequence of the relational nature of the logic programming languages is that they are non-deterministic: the answer to the query is not necessarily one and (a sufficiently part of) the set of the answers is computed. Logic programming languages are symbolic languages: the data language consists of symbols with no own semantics but whose signification is only given by the programmer (unlike to numerical values). Hence, computation is completely syntactic and the elements of the data language are never semantically related but only syntactically (except for specific added domains as the integers for example).

2.2 Logic Programming with Horn Clauses Logic programming with Horn clauses is no doubt the most popular and widespread instance of the paradigm of logic programming (Lloyd 1987). The data language is the language of the terms built inductively on a set of function symbols (function symbols of arity zero are constant symbols) and a set of variables. So, every variable and every constant symbol is a term and if t1 , . . . , tn are terms and f is a function symbol of arity n then f (t1 , . . . , tn ) is a term. Program and query languages rely on atoms, literals and clauses which are built on a set of predicate symbols: If p is a predicate of arity n and t1 , . . . , tn are terms then p(t1 , . . . , tn ) is an atom. An atom (resp. its negation) is a positive (resp. negative) literal. A universally quantified disjunction of literals is a clause. A clause with only one positive literal is a definite clause. If a definite clause which contains only one literal is a fact. If a definite clause which contains some negative literals is denoted A ← A1 . . . , An with A1 , . . . , An the negative literals and A the only one positive literal of the clause. The language of the programs is defined as the set of the definite clauses. The semantics of a set of definite clauses is the conjunction of these clauses. The language of the queries is the set of the existentially quantified conjunctions of positive literals.

86

A. Lallouet et al.

In formal logic, a model is an interpretation of the symbols of a formula such that the formula is true. An interpretation is a model for a program if it is simultaneously a model for every clauses of the program. The deduction mechanism underlying logic programming in Horn clauses is the logic consequence: a query is a logic consequence of a program if every interpretation which is a model of a program is also a model for the query. In logic, proving that a formula is a logic consequence of a set of formulas is equivalent to prove that the conjunction of the formulas of the set plus the negation of the formula which has to be consequence is unsatisfiable, ie. has no model. The negation of a existentially quantified conjunction of positive literals is equivalent to a clause consisting of only negative literals. Hence, add to the program the negation of the query leads to a set of clauses such that each one contains at most one positive literal which is the definition of a Horn clause and justifies the name “logic programming in Horn clauses”. Proving that a set of formulas has no model in no interpretation at all is in most cases an insurmountable task. But it is not the case for the very special case of a set of clauses since it is then sufficient to prove that there is no Herbrand model. An Herbrand interpretation is a very simple interpretation which makes the bridge between syntax and semantics: constant and function symbols are interpreted by themselves. The Herbrand universe of a program is the set of terms which may be built without variables (and by adding a constant symbol if there is no one). The Herbrand base of a program is the set of atoms which may be built with the elements of the Herbrand universe and the predicate symbols. The Herbrand base is necessarily a model for the definite program since it contains only definite clauses. An Herbrand interpretation is simply a subset of the Herbrand base. The intersection of all the Herbrand models is itself a model, the least Herbrand model, and is exactly the subset of the Herbrand base which is logic consequence of the program. Hence, the declarative semantics of a definite program is double: it is the least Herbrand model as intersection of all the models and the logic consequence of the program. The structure of the Herbrand interpretation with set inclusion is a complete lattice for every definite program. Over this structure may be defined a procedural semantics so called in “forward chaining” as a fix-point of a consequence operator TP from the set of interpretations to itself. For a definite program P, this function TP is defined as follows: If I is an Herbrand interpretation and if I contains A1 , . . . , An and if A ← A1 , . . . , An is an instance with no variables of a definite clause of P then A is in TP (I ). This function is not only increasing but it is also continuous: upper bounds are conserved. The least fix-point of this function for a program P coincides to the least Herbrand model of P. In order to describe SLD resolution, last procedural semantics so called in “backward chaining”, we have to define unification which is the only parameter passing mechanism of logic programming with Horn clauses. A unifier of two atoms is a substitution, ie. a function from the variables into the terms, which makes terms equal. If two terms are unifiable then there exists necessarily a most general unifier unique modulo variable renaming ie. a substitution which instantiates the least possible the two terms. SLD resolution applies nondeterministically in a SLD derivation the following SLD inference rule: if Q1 ,…, Qm−1 , Qm , Qm+1 , . . . , Qr is a con-

Logic Programming

87

junction of atoms and A ← A1 , . . . , An is (a copy with coherent renaming of variables of) a definite clause such that θ is the most general unifier of Qm and A then θ (Q1 , . . . , Qm−1 , A1 , . . . , An , Qm+1 , . . . , Qr ) is a resolvent w.r.t. this definite clause and the selected atom Qm . SLD resolution (Loveland 1968; Luckham 1968) for Selective Linear Definite clause resolution is a linear restriction of the resolution of Robinson (Robinson 1965) for definite clauses with a selection function over the literals of the resolvent. An SLD derivation from a query considered as first resolvent leads to a failure if at least one of the atoms cannot be eliminated and to a success if all the atoms are eliminated, generating an empty resolvent. The computed answer is the combination of the unifiers applied to the initial query. Once the selection strategy of the atom into the resolvent is fixed, the set of all possible SLD derivation for a query may be seen as a tree. The search strategy specified the traversal method for such an SLD tree. If the search strategy is fair, ie. SLD resolution rule will be applied, if possible, to every resolvent of the tree, then selection strategy may be any. The set of the elements of the Herbrand base of a definite program P such that there exists for every element an SLD derivation leading to a success coincides to the least Herbrand model of P. Hence, the procedural semantics of a definite program is also double and coincides to the two declarative semantics offering to every program a double approach to programming. Moreover, the selection of the atom into the resolvent and the selection of the clause to unify with that atom lead to two kinds of parallelisms.

2.3 The Prolog Language The language Prolog (for “Programmation en logique” in French) is an instance of the logic programming with Horn clauses paradigm more programming than logic. Logic programming with Horn clauses is a theoretical tool which leads by non-determinism to a potentially infinite number of SLD derivations to manage in parallel. Only a fair search strategy guarantees the completeness of the SLD resolution like for example the breadth first strategy but it necessarily leads to too costly mechanisms. Hence, in Prolog, the search strategy is a depth-first strategy. This strategy is implemented by a backtrack stack in order to store the choice points which keep the branches of the SLD tree not yet explored. By this choice of efficiency the completeness of the language as theorem prover is lost. Another source of inefficiency is the occurs check in the unification algorithm which prevents a variable from unifying with a term including it. This test is necessary to the soundness of the SLD inference rule. This test requires a term walk which is optional in Prolog and usually not activated. By this choice of efficiency the soundness of the language as theorem prover is lost. Although logic programming with Horn clauses is Turing complete, SLD resolution has been “extended” to offer more pragmatism, more expressivity and control. The purely syntactic model of logic programming is particularly restricting when it comes to manipulating numbers which are usually represented by numbers of Peano (based on a constant symbol “zero” and a symbol function “successor” of arity one). Such linear representation leads to some algorithmic complexity growth

88

A. Lallouet et al.

which penalizes the algorithms. By pragmatism and for efficiency, a functional arithmetic expression build-in evaluator has been integrated in Prolog using the integers (and floating point numbers).2 Since those integers are not inductively defined, the properties of the induction theorem are not preserved. Moreover, since this evaluator is functional, the relational nature and its reversibility are lost for every program having this feature. The will to overcome those limits while preserving the logical nature is one of the motivations of the introduction of constraints into the language (cf Sect. 3). For expressivity, the major extension is the possibility to introduce negative piece of information. Since all the clauses of the program are definite, its impossible without adding a new inference rule. The most popular is the negation by failure rule under the closed world assumption: if an atom without variables is not a logical consequence of the program then the negation is deduced and considered as true. This negation is different from the classical negation which would not be deducible in most cases. Without leaving the logic, the primary means to control is the order of the atoms in the clauses and their order in the program. An extra-logical feature to control, the cut, allows one to prune some branches of the SLD tree. If this mechanism is very useful to prune infinite branches or leading to failures or already computed success, it is very dangerous since it also might prune some branches leading to not-already computed success and, hence, modify the semantics of the program without cut. Another feature of logic programming already present at the beginning is meta-programming, ie. taking Prolog programs as data. The build-in predicate call executes its argument as if that term appeared textually in place of the atom. The predicate base is managed thanks to predicates assert and retract which allows one respectively to add and delete clauses while access to the clauses of the program is possible through the predicate clause. Because variables and meta-variables are mixed in Prolog, the authors of the logic language Gödel (Hill and Lloyd 1994) have promoted a close vision of meta-programming where object language variables are represented by constant symbols. Prolog has received a wide distribution which allows its standardization in 1995 (Deransart et al. 1996). This standard has to settle the question of the syntax (at the beginning, the so called “de Marseille” syntax was in competition with the so called “Edinburgh” syntax) and also fix the semantics. Efficiency was also an important part of the research and it is only with the Warren’s abstract machine (Warren 1983) which allows compilation of program that Prolog received little visibility in industry. However, for Prolog considered as a programming language, prototyping was favored over software engineering, particularly through executable specification. This has not favored the development of applications in production and does not allowed to impose widely the language. The absence of typing and the non classical control does not allow one to programmers used to imperative languages to subtly control the code. The module system only appears in 2000 when the language was already on the decline. Prolog gave large hopes in parallelism (de Kergommeaux and Codognet

2 Predicate

is

Logic Programming

89

1994) due to its two sources of non-determinism (choice over the atom and over the clause) which seem to be suited to extract large parts of independent calculus. The reader may find in Colmerauer and Roussel (1996) a passionate and fascinating narration of the invention of Prolog by its authors.

2.4 Beyond Logic Programming in Horn Clauses Prolog is a programming language which is Turing complete. From its features, Prolog is suited to be used in symbolic domains as for example natural language processing where logic programming is born. Since Prolog is non-deterministic, it is also suited for automata and compilation domains. Since Prolog execution is based on a deduction mechanism, it is also suited to systems manipulating knowledge bases. Since control may be ignored in first approach and typing is someway dynamic, Prolog is also suited to prototyping and executable specification domains. Accomplished implementation of Prolog integrates many modules which allow one to use Prolog as a glue language between constraint solvers, internet applications, graphical interfaces, and so on but which also allow Prolog to cooperate which other programming language, most of the time, imperative ones. The most of the extensions of logic programming with Horn clauses have not for goal to modify the foundations. Some other extensions are more drastic and have for goal either to change the considered logic fragment as for example λ-Prolog which replaces terms by λ-terms and definite clauses by hereditary Harrop formulas (Nadathur and Miller 1998), or to integrate other programming paradigms as for example the language Mercury (Somogyi et al. 1996) (logic and functional) or the language Oz (Van Roy et al. 2003) (logic, functional, constraints, object and concurrency), or to incorporate rewriting mechanisms as for example CHR language (for Constraint Handling Rules) (Fruhwirth 1998), or to incorporate semantics domains through constraint resolution. The two last aspects are treated in Sect. 3. Finally, in contact with nonmonotonic logic is born the huge domain of nonmonotonic logic programming which is treated in Sect. 4.

3 Constraint Logic Programming Constraint Logic Programming (or CLP) has emerged from the need to handle other domains than the finite terms of classical Logic Programming. Indeed, a language like Prolog reaches the ideal of Logic Programming only for definite clauses in the set of finite terms and if occur check is activated. Other domains, especially numerical domains are imperfectly handled by the functional arithmetics defined by is (see Fig. 1). The original motivation was to provide a better treatment of negation through the difference constraint, although the semantics of ASP has proved to be a better answer (see Sect. 4).

90 Fig. 1 Comparaison between Prolog and CLP

A. Lallouet et al.

Prolog code for Fibonacci

CLP code for Fibonacci

The first conceptual step leading to CLP is the intuition of the replacement of unification by solving a set of equation on a domain in Prolog II (Colmerauer 1983). The domain is still the one of terms, but infinite (and incidentally it provides a theoretical justification of omitting the occur check) and the difference constraint is added to the term equality. This vision allows one to see Prolog as an instance of CLP in which the only constraint is the finite terms equality. It yields on one side to what is called the “CLP scheme” (Jaffar and Lassez 1987) in which elementary constraints are solved in their respective domains and to the elaboration of an heterogeneous tree domain called π4 which was the foundation of Prolog IV (Benhamou et al. 1996). On the other side, thanks to the integration of CSP (constraint satisfaction problems) techniques (Waltz 1975), the development of the CHIP language (Constraint Handling in Prolog) (Dincbas et al. 1988) has led to a wider diffusion of constraint programming on finite domains and to bridge the gap with operations research. It is interesting to start a description of CLP with a short comparison with Prolog. Figure 1 depicts a Prolog versus a CLP program to compute Fibonacci numbers. It clearly shows that computing on numerical domains is a direct extension of computing on terms. In particular, the Prolog version includes the predicate is which is not inversible. Hence only a call to the fib predicate with its first parameter instantiated will not lead to a failure. The CLP version does not suffer from this problem3 and returns N = 7 when called by fib(N,13).

3.1 The CLP Scheme and CLP(R) In CLP, a constraint is above all a syntactic object having a fixed interpretation on a particular domain, like X ≥ 2 or X = Y + Z. Predicate fall into two categories: there are constraint predicate whose interpretation is fixed and program predicate which are treated like in Prolog. A CLP clause is therefore of the form H ← c, B1 ,…, Bn where c is a conjunction of constraints. Program clauses are handled by the CSLD 3 However,

this program does not terminate if it is called with an invalid second argument, like fib(N,4) because it will try to generate increasing values for F1 and F2. It is simple to correct it by noticing that in fib(N,F), we have F ≥ N .

Logic Programming

91

resolution, which is similar to SLD resolution except that constraints are added to a constraint store. This store is simplified according to an axiomatic theory, which is correct with respect to the domain (and complete for satisfaction), until reaching a solved form in which the satisfiability of the formula is obvious. For Prolog, this theory is Clarke’s strong equality (CET), enhanced by the closed world assumption (CWA). The solved form is a set of equations of the form X = t where X is a variable and t a term. Most of these equations are set by parameter passing, although they can be explicitly laid with the equality predicate. Along CSDL derivation, constraints and equalities which stem from each clause are added to the constraints of the goal and incrementally simplified by an adequate algorithm. It preserves and extends the logical semantics of logic programming. In particular, computed answers are logical consequences of the program in all models of the interpretation structure of the constraints. Particular instances of CLP are Prolog III (Colmerauer 1990) and CLP(R) (Jaffar et al. 1992) in which constraints are interpreted as relations on reals numbers. In these systems, linear equations and inequations are handled separately by the Gauss-Jordan elimination algorithm and a variant of simplex. An interesting aspect is that non-linear constraints are accepted by the language, while not handled by the solver – which is thus incomplete. Instead, they are delayed in the hope that they become linear after the instantiation of some variables. In Prolog IV (Benhamou et al. 1996), constraints on real numbers are solved using interval propagation on a tree structure with heterogeneous leaves called π4 , following the seminal work of BNR-Prolog (Older and Vellino 1990). In the original CLP approach, the solver was considered as a black box which could decide the satisfiability of the constraints. Subsequent efforts have managed to change this black box to a glass box.

3.2 CLP(FD) Finite domain constraints have by nature numerous applications and open to CLP the field of combinatorial optimization. For complexity reasons (solving these constraints is NP-complete), only an incomplete check is performed for satisfiability. Instead, constraints are simply propagated as in CSP (see Waltz 1975; Rossi et al. 2006 and chapter “Constraint Reasoning” of this volume) and the search for a solution is done by a labeling predicate which enumerates the values in the variables’ domains. The satisfiability of the constraint store is tested by a local consistency algorithm which uses an explicit representation of the domain, by preserving possible values or bounds. The result of this algorithm is incomplete: if the output is false, then the store is unsatisfiable. But often, the output is simply unknown, meaning that the deduction is incomplete. However, the solver does not remain inactive even in this case, because it reduces the domains of the variables by removing locally inconsistent values. Consequently, it is tempting to use Prolog as a CSP generator, as it can be done with any other language. When constraints are defined, the user starts a search with the labeling predicate. The combination of forward chaining of constraints and backward

92

A. Lallouet et al.

chaining of Prolog simplifies the writing of complex search strategies. From its multiple parameters, the labeling predicate allows one to specify a heuristics for choosing the next variable to enumerate, a heuristics for the order of the domain values and a preference function for the solution by minimizing or maximizing a variable.

3.3 Writing Your Own Solver with CLP Using CLP has multiple advantages, in particular to what concerns search algorithm personalization and definition of constraint propagation algorithms. Indeed, writing this for imperative language libraries requires a good knowledge of the solver architecture and its implementation. In contrast, CLP solvers often propose a mechanism to access to domain representation which allows one to use Prolog-based search in a labeling predicate. For example, the Sicstus Prolog system (Carlsson et al. 1997) proposes the fd_min and fd_max predicates to access to the current bounds of a finite domain variable. Then it is easy to use these values to perform branching. But another possibility given to the user is to define his own constraints in a language having a natural integration in Prolog: the indexicals language (Hentenryck et al. 1998). An indexical is a constraint of the form X in r where X is a variable and r a set expression in which other variables’ domains can be accessed. Here is an example defining the propagation of the constraint X = Y + Z under bounds consistency: X in min(Y)+min(Z)..max(Y)+max(Z) Y in min(X)-max(Z)..max(X)-min(Z) Z in min(X)-max(Y)..max(X)-min(Y)

The clp(FD) (now Gnu-Prolog) (Diaz and Codognet 2001) and Sicstus Prolog (Carlsson et al. 1997) implement this language. Indexicals are awakened when a variable in the right-hand side expression is modified and thus are tightly integrated in the solver propagation loop. They have an obvious logical semantics (X is included in r), but also an operational one since, after execution, the new domain of variable X is equal to the intersection of the old domain of X and the set represented by r. Indexicals are well tailored for implementing simple constraints, and can be executed efficiently. Constraint Handling Rules (or CHR) (Frühwirth 1994, 2009) extend even further the possibilities for writing a solver by allowing more powerful rules having multiple atoms in the head of a rule. CHR rules are applied in forward chaining on a store of atoms and constraints. A rule is chosen and fired when it is applicable and this mechanism is performed until no more rule can be fired. They allow one to rewrite the constraint store while maintaining a compatibility with Prolog terms. There are three types of CHR rules, where H, H’, C and B are sets of atoms:

Logic Programming

93

– propagation rules: H => C | B. – simplification rules: H C | B. – simpagation rules: H \ H’ C | B. The meaning of this rules is the following. If the atoms of the head H are in the store, and if the guard C is entailed, a propagation rule adds the atoms of the body B to the store. A simplification rule replaces the head by the body and a simpagation rules performs a combination of the two by replacing only a part H’ of the head. This mechanism is clearly a syntacting rewriting. Like in Prolog, unification is used as parameter passing between the constraint in the store and the head of the rule. It is useful to note that Prolog integration is bidirectional because the guard C can be a Prolog predicate for which an evaluation is required. In order to avoid a trivial nontermination, a propagation rule cannot be applied twice on the same atom. Finally, a rule can be named using the prefix name @. The correspondence between constraints in the store and rules with multiple heads is one of the beauty of the formalism, but also a source of inefficiency. The system keeps a notion of active constraint for which a rule is seeked for application. If such a rule is found, the remainder of the head is then seeked (such atoms are said to be passive). In consequence, the great expressivity of CHR can be difficult to handle. It is better to start the design of a solver by setting simplification rules with one head. Here is how to define a solver for the ≤ constraint: reflexivity antisymmetry idempotence transitivity

@ @ @ @

leq(X,X) true. leq(X,Y), leq(Y,X) X = Y. leq(X,Y) \ leq(X,Y) true. leq(X,Y), leq(Y,Z) ==> leq(X,Z).

For example, the goal (leq(X,Y),leq(Y,Z),leq(Z,X)) is rewritten after application of transitivity in (leq(X,Y),leq(Y,Z),leq(Z,X),leq(X,Z)). By the antisymmetry rule, we get (leq(X,Y),leq(Y,Z), X=Z). Then equality is propagated to give (leq(Z,Y),leq(Y,Z),X=Z) and then, by another application of antisymmetry, the result is (Y=Z,X=Z).

3.4 Some CLP Systems The CLP family of languages improves over Prolog by allowing constraint solving on multiple domains. But while constraint solving itself is a very active research domain having numerous industrial applications, CLP languages are neglected despite their immense qualities. Lack of formation of students, difficulties to write an efficient code and its integration in an application, causes are multiple and sometimes difficult to pinpoint. However, high quality systems do exist. Instead of trying to be exhaustive (which is quite impossible), we choose to cite a few systems which have advantages or present an historical interest:

94

A. Lallouet et al.

• Gnu-Prolog (http://www.gprolog.org/): originating from the clp(fd) system, GnuProlog has introduced the glass-box approach for finite domain constraints thanks to indexicals (Diaz and Codognet 2001). It compiles Prolog code towards an optimized version of the WAM. It is a robust and efficient system. • Sicstus Prolog (http://www.sics.se/sicstus/): it is a commercial system which is probably the most complete of Prolog compiler and one of the most efficient (Carlsson et al. 1997). It owns numerous libraries that allow one to interface Prolog code in a production environment (interfaces with C, Java, ODBC, ;NET, tcl/tk, HTML, XML, Zinc, etc. There is a number of predefined data structures graphs, AVL trees, sets, multisets, arrays, etc.) and an object-oriented extension. It also includes an efficient solver for finite domain constraints having original global constraints, a solver on R, on Q, on booleans and an integration of CHR. • Eclipse (http://eclipseclp.org/): this system comes from CHIP (Dincbas et al. 1988) the first system introducing finite domain constraints. It includes original libraries which sometimes have no equivalent (set constraints, linear solver interface, Propia library to perform constraint-like inference on Prolog predicates, etc.). • Prolog IV (http://prolog-heritage.org/) is no more maintained, but the last version is available for download. This language created by the PrologIA company and Marseille CS Lab has more than 120 predicates for constraints between real, integers or lists (Benhamou et al. 1996). It has been used for teaching and some application for the Air Liquide company. • Other systems: there are many more systems besides the aforementioned seminal systems. One can cite Yap-Prolog (http://www.dcc.fc.up.pt/~vsc/Yap/ which is parallelizable, Swi-Prolog (http://www.swi-prolog.org/), Jekejeke Prolog (http:// www.jekejeke.ch/), written in Java and which can be embedded in an Android application, XSB Prolog (http://xsb.sourceforge.net/) or B-Prolog (http://www. probp.com/) which extends indexicals by rules called action rules. In particular, Swi-Prolog (Wielemaker et al. 2012) has an active development of libraries. Research on implementation of CLP incorporates mechanism that have proven to be useful for Prolog like tabled execution (Arias and Carro 2015). Extension to semirings in order to model different valuation systems for predicates is present for CLP (Bistarelli et al. 2001). Also a few attempts have been made to introduce constraints in Answer Set Programming (Drescher and Walsh 2010).

4 Answer Set Programming The answer set programming (ASP (Niemelä 1999; Lifschitz 2002; Gelfond and Leone 2002; Baral 2003; Lifschitz 2008b; Eiter et al. 2009)) is a declarative formalism with logic program syntax, a default negation and a non monotonic semantics which appears in the 90s (Gelfond and Lifschitz 1988, 1991). A special track at the ICLP conference in 2008 (de la Banda and Pontelli 2008) has celebrated its 20 years old. Nowadays, by the availability of efficient solvers, ASP appears as an effective

Logic Programming

95

implementation of non monotonic reasoning such that it has been theorized by Reiter in Default logic (Reiter 1980). But, well beyond non monotonic reasoning, ASP is also a suited formalism in many domains of knowledge representation in Artificial Intelligence (common sens reasoning, web semantics, causal reasoning, . . . ) and of the resolution of combinatorial problems (planning, graph theory problems, configuration, bioinformatics, . . . ). This latter aspect, certainly the most promising for ASP, is based on the idea that every problem is encoded in a logic program whose models correspond to the solutions of the problem. Then, following the example of SAT (see chapter “Reasoning with Propositional Logic: from SAT Solvers to Knowledge Compilation” of this volume), using a solver ASP to compute those models and this way obtaining the solutions of the initial problem.

4.1 Theoretical Foundations The ASP paradigm has its roots in the stable model semantics (Gelfond and Lifschitz 1988) for normal logic programs . Those are set of rules defined as follows: (c ← a1 , . . . , an , not b1 , . . . , not bm .)

(1)

where n ≥ 0, m ≥ 0 and c, a1 , . . . , an , b1 , . . . bm are propositional atoms and the symbol not denotes the default negation of the Default logic of Reiter (1980). For such a rule r, head (r) = c is the head, body+ (r) = {a1 , . . . , an } the positive body and body− (r) = {b1 , . . . , bm } the negative body. The intuitive meaning of such a rule is : “If the positive body is verified and if no element of the negative body is verified then the head is verified”. If n = m = 0, this rule is a fact and says simply (c ← .) or (c.). Formally, a set X is said to be closed under a definite program P (ie. without default negation, cf Sect. 2) if for all rule r in P if body+ (r) ⊆ X then head (r) ∈ X . The smallest set closed under P is denoted Cn(P) and is a model of Herbrand (cf Sect. 2). Cn(P) is the least fix-point of the consequence operator TP : 2X −→ 2X where X is the set of the atoms of the program P and TP (A) = {b | b ← a1 , . . . , an . ∈ P and for all i = 1, . . . , n, ai ∈ A}. TP allows one to build the following sequence TP0 i+1 TP

=∅ = TP (Tpi ), ∀i > 0

 such that i≥0 Tpi = Cn(P). For a logic program P and a set of atoms X the reduct of P by X is the following program

96

A. Lallouet et al.

P X = {head (r) ← body+ (r). | r ∈ P, body− (r) ∩ X = ∅} Since this program is a definite one, it has only one model Cn(P). A stable model of P is an atom set S such that S = Cn(P S ). The following examples illustrate that a normal logic program may have no, one or many stable models.   Example 1 • P1 = a ← b., b ← a. has one empty stable model, while it has two classical model : ∅ and {a, b}.  • P2 =  a ← not b., b ← not a. has twostable models {a} and {b}. • P3 = a ← ., b ← a, not d ., d ← b. has no stable model, it is inconsistent.   • P4 = c ← a, not b., a ← . has one stable model {a, c} and P4 ∪ {b ← .} one stable model {a, b}. Clearly, the direction of the rule is fundamental and should not be mingled with material implication of classical logic. Program P1 shows that every stable model is a classical model whose every atom is justified (or supported) by a rule. Program P2 shows that a fundamental feature of ASP is to be able to represent alternative or exclusive situations. This potentiality is exploited in the representation of combinatorial problems. Program P4 shows the non-monotonic nature of the ASP formalism : Add b to the program loses conclusion c. From a theoretical point of view, deciding whether a normal logic program has a stable model or not is an NP-complete problem. Membership to this complexity class explains the interest in ASP for the resolution of many combinatorial problems as we shall see in Sect. 4.3. Very often, problem modeling leads to write rules with variables. For example, the famous “All birds fly but the ostriches” is represented by the rule (fly(X ) ← bird (X ), not ostrich(X ).) which should be considered as a general schema covering the set of all the concrete possible instances of this rule. This brings back to the propositional case. This is equivalent to consider the universal closure of the rules and make the closed world assumption. The first restriction is usual in logic where a free-quantifier formula must be understood as its universal closure over its free variables. The second restriction considers only the Herbrand universe : only constant symbols and terms built over function symbols and constant symbols may be assign to variables. The simplest case only allows constant symbols. Present-day systems allow more and more often function symbols with various restrictions to limit the domain. The class of extended logic programs (Gelfond and Lifschitz 1991) allows strong negation in order to ease knowledge representation. For example, rules (cross ← not train.) and (cross ← ¬train.) even they are syntactically closed have not the

Logic Programming

97

same semantics. The first one expresses that we cross if there is no information about the arrival of a train. While the second expresses that we cross if we have shown that no train will arrive. From a formal point of view, if we are not interested by so called contradictory models, an extended logic program may be rewrite in a normal logic program as follows : each literal ¬x is replaced by a new atom nx and a rule (false ← x, nx, not false.) is added. This last rule acts as a constraint which is written as a rule without head (← x, nx.), since it forbids x and nx to be in the same model. Hence, the stable model semantics applies and only consistent models are produce. The class of disjunctive logic programs (Gelfond and Lifschitz 1991) allows rules as follows: (2) (c1 ; . . . ; cp ← a1 , . . . , an , not b1 , . . . , not bm .) From a semantic point of view, for such a program P, we keep the same notion of reduct as in the non disjunctive case by considering that for such a rule r we have head (r) = {c1 ; . . . ; cp }. Then we call answer set of P any atom set X that is a minimal model of P X .   Example 2 P5 = b; c ← a., a ← . has two answer sets {a, b} and {a, c}. {a, b, c} is not a answer set since it is not minimal. The introduction of a minimality condition into the definition induces that deciding p whether there exists or not an answer set for a disjunctive logic program is 2 complete. Hence, the main practical advantage of disjunctive logic programming is not the increased expressivity but in the fact that very often programs are easier and more natural to write. Of course, systems allowing disjunctions try to be as efficient as systems that forbid them and the main difficulties of programming are avoided. Finally, the term answer set is used to name different types of stable models. Interested readers may find in Lifschitz (2008a) twelve different ways to define the notion of stable model.

4.2 ASP and (More or Less) Traditional Logic From the very beginning, Logic programming has been associated to non monotonic formalisms (see chapter “Knowledge Representation: Modalities, Conditionals, and Nonmonotonic Reasoning” of Volume 1) which allow one to express easily rules with exception. Following the presentation of Ferraris et al. (2007) non monotonic formalisms may be classified in two categories: 1. Formalisms expressed in terms of “translation” (“translational formalisms”) as completion or circonscription : one adds formulas to the initial formulas. For completion one adds some contrapositives and for circumscription one adds formulas which “minimize” some predicate extensions.

98

A. Lallouet et al.

2. Formalisms expressed in terms of “fix-point” as default logic (Reiter 1980) and autoepistemic logic (Moore 1985). From the very beginning ASP was defined by fix-point definition (Gelfond and Lifschitz 1988) and very early the relation with autoepistemic logic and default logic was clearly identified. Although definition in Sect. 4.1 is fix-point definition. But, nowadays we know how to express ASP in terms of “translation”. This gives us definitions more in accordance with traditional logic even if “←” connector still has a specific role. One can find here a short description of these two types of methods. Relations with the Logic of “Here-and-There” An example of alternative definition proposed is based on a logic first introduced by Heyting in 1930 (see Mancosu 1998 for an English translation) and brought up to date in Cabalar and Ferraris (2007) : the “here-and-there” logic. This logic may be defined as a propositional modal logic with a universe with two worlds (“here” and “there”) where interpretations are pairs (X , Y ) of atom sets satisfying X ⊆ Y (atoms of X are true “here” and those of Y are true “there”). A formula F of the here-and-there logic is a combination of (propositional) atoms connected by connectors ⊥, ∨, ∧ and → (what is always false, disjunction , conjunction, material implication). Connectors ¬,  and ≡ are defined (as in traditional logic) by ¬F =def F → ⊥,  =def ¬⊥ and F ≡ G =def (F → G) ∧ (G → F). (X , Y ) mod F denotes that an interpretation (X , Y ) satisfies a formula F : one can use without ambiguity the same symbol as in classical logic since this interpretation is a pair of classical interpretations. The recursive definition of this notion of satisfaction is similar to that of classical logic except for the implication: 1. 2. 3. 4.

For every atom a, (X , Y ) mod a if a ∈ X . (X , Y ) mod (F ∧ G) if (X , Y ) mod F and (X , Y ) mod G. (X , Y ) mod (F ∨ G) if (X , Y ) mod F or (X , Y ) mod G. (X , Y ) mod (F → G) if (X , Y ) mod F implies (with the usual meaning) (X , Y ) mod G and Y mod F → G (“→” is the classical implication in this last formula).

Intuitively, the atoms of X are considered to be true and those not of Y are considered to be false. We have (X , Y ) mod ¬F if and only if Y mod ¬F. An interpretation (X , X ) is total and we have the following equivalence with the classical logic satisfaction : (X , X ) mod F if and only if X mod F. We define a nested expression as a formula without implication and a nested rule as a formula (B → H ) where B and H are nested expressions. Hence we have a head H and a body B as usual in logic programming but here one may use conjunction and disjunction in both. Instead of (H ← B) as traditionally written in logic programming, we write (B → H ) as in classical logic where “not ” are replaced by “¬”. We do not allow strong negation but it is not a real restriction. A nested rule may be translated in a set of disjunctive rules (2) called in what follows non-nested rules. For example ((b ∨ (c ∧ ¬d )) → (a ∧ (e ∨ p))) is considered as the set of non-nested rules {(b → a), ((c ∧ ¬d ) → a), (b → (e ∨ p)), ((c ∧ ¬d ) → (e ∨ p))}.

Logic Programming

99

Hence, the notion of answer set for nested rules is well defined with the usual notion. We will see that we can extend the notion of nested rule to allow nested implications. So here is the definition of a non-nested rule (Lk+1 ∧ . . . ∧ Ln ) → (L1 , ∨ . . . ∨ Lk ) (0 ≤ k ≤ n)

(3)

As usual, an empty conjunction (n = k) is treated as the formula  and an empty disjunction is treated as the formula ⊥. There is a strong correspondence between ASP and here-and-there logic (Cabalar and Ferraris 2007). But one can obtain an even better correspondence with a recent refinement : the equilibrium logic (Pearce 1996). The equilibrium logic only keeps from the here-and-there logic the stable models (called equilibrium models) which are, for a formula F, the models of F such that for all X , X ⊆ Y , (X , Y ) mod F if and only if X = Y . Since this definition depends only on Y , we are left with classical interpretations. Pearce has shown that an atom set Y is an answer set of a logic program with nested rules if and only if it is an equilibrium model of that program. Since the equilibrium logic allows one to write propositional formula without restriction, one can allow nestings also for formula with implications. Any propositional formula may be considered as a logic program. One can wonder as in Cabalar and Ferraris (2007) if this extension is useful for the programmer in logic programming. But it may be very possible that the habit is to consider as natural and with a clear meaning some rules as rg1 : ((p ← q) ∨ r) and rg2 : (p ← (q ← r)). Anyway we obtain a quite natural way to generalize the notion of logic program to this kind of rules. The equilibrium logic shows that rg1 and rg2 are equivalent (with the strong meaning define thereafter) respectively to the set of following nested rules: rg1 : {(p ∨ r) ← q, (¬q ∨ r) ← ¬p}; rg2 : {p ← ¬r, p ← q, (p ∨ ¬q ∨ r) ←}. Then it is a new definition of ASP, linked to an alternative logic born in the beginning of the twentieth century. And this definition naturally extends the notion of rule in logic programming. The notion of equivalence in equilibrium logic corresponds to strong equivalence between logic programs: Not only programs have the same answer sets but if they are completed with any arbitrary programs, the obtained programs still have the same answer sets. It is the most relevant notion of equivalence between logic programs (even if Ferraris et al. 2010 mentions an even stronger equivalence). This correspondence to the equilibrium logic allows one to prove that any logic program is strongly equivalent to a non-nested program that is to say a set (a conjunction) of rules of type (3) (some clauses). Thus, one can consider that for those very general logic programs, the set of non-nested rules plays the role of conjunctive normal form of classical logic (Cabalar and Ferraris 2007). This shows also that one can without loss of expressiveness be limited to traditional non-nested programs (with the restriction that, maintaining the original vocabulary, there exists no polynomial algorithm which does this conversion in every case). To finish, Cabalar and Ferraris (2007) also defines

100

A. Lallouet et al.

which would correspond to the analog in logic programming of a disjunctive normal form. Relations with Circumscription Circumscription, introduced by McCarthy more that thirty years ago (McCarthy 1980, 1986) is a rigorous logical formalization of the natural notion of “minimal model”, in order for example to replace automatically the “if” of a definition by some “if and only if”. In this sense, it extends the “completion of predicates” (Clark 1977) to any formulas. Circumscription corresponds by the adding a set of formulas (or a second order formula) which depends on the formulas of the given theory. It is a quite natural extension of classical logic which allows non-monotonic reasoning : a true result infers from a data set may not be true anymore if the set is extended. This kind of behavior is essential for a satisfying translation of rules with exceptions since it allows one to learn progressively some new exceptions without challenging the already known theory. Connections between circumscription and ASP are well known: there is a abundant literature on this subject (see for example Yahya and Henschen 1985; Gelfond et al. 1986; Lifschitz 1989, Yuan and You 1993; Sakama and Inoue 1993). We limit our discussion in the following to the connections described in Ferraris et al. (2010). Since circumscription concerns predicate calculus, we handle atoms with variables. A logic program is a set of first order formulas ϕ without free variables since implicitly universally closed (∀x1 , . . . , xn φ where the variables in φ are the xi ). A more traditional program (a finite set of rules (2), without “true negation” but with first order atoms) is translated into a first order logic formula as follows: 1. Replace the “,” by some ∧, the “;” by some ∨ and the not by some ¬. 2. Replace (Head ← Body) by classical implication (Body → Head ). 3. Consider the conjunction of the universal closure of those formulas. So the program {p(a) ← ., q(b) ← ., r(x) ← p(x), not q(x).} is translated in the formula (p(a) ∧ q(b) ∧ ∀x((p(x) ∧ ¬q(x)) → r(x))). Furthermore ¬F is considered as an abbreviation of (F → ⊥). Then a constraint (← p(x), not q(x).) corresponds to the formula (∀x¬(p(x) ∧ ¬q(x))). The circumscription describes here is a simplified version where all the predicates are circumscribed. We briefly recall the definitions in this case: If p = (p1 , ...pn ) and q = (q1 , ...qn ) where pi and qi are predicate symbols (every pi having the same arity as qi ), we denote p ≤ q if for all i ∈ {1, . . . , n}, we have ∀x(pi (x) → qi (x)), where x is the list of the variables of pi and p < q if p ≤ q and q  p. For a formula F, Circ[F] denotes the second order formula (F ∧ ¬∃q((q < p) ∧ F(q))). Here p is the list of predicate symbols in F, q is a corresponding n-uplet of predicate variables and F(q) denotes the formula F in which every occurrence of an atom pi (t1 , . . . , tl ) is replaced by qi (t1 , . . . , tl ). So we only keep models of F where the predicate extension is minimal.

Logic Programming

101

The contribution of Ferraris et al. (2010) is to show that a slightly modification of this definition gives a notion of stable model: we replace F(q) (substitution of the original pi by qi ) by F ∗ (q) defined as follows: 1. 2. 3. 4. 5.

pi (t1 , . . . , tm )∗ =def qi (t1 , . . . , tm ); (t1 = t2 )∗ =def (t1 = t2 ); ⊥∗ =def ⊥; (F ∧ G)∗ =def (F ∗ ∧ G ∗ ); (F ∨ G)∗ =def (F → G)∗ =def ((F ∗ → G ∗ ) ∧ (F → G)); (∀xF)∗ =def (∀xF); (∃xF)∗ =def (∃xF)

(F ∗ ∨ G ∗ );

The only visible difference with the definition of F(q) is (here again) the implication. Of course, there (here again) also a difference with the negation “¬_” since it is translated to “_ → ⊥”: In particular if F1 is the propositional symbol (predicate symbol of arity 0) a, F2 is ¬a and F3 is ¬¬a then we have F1∗ (q) = q[= F1 (q)], F2∗ (q) = ¬q ∧ ¬a[≡ F2 (q) = ¬q] and F3∗ (q) = ¬(¬q ∧ ¬a)[≡ F3 (q) = ¬¬q ≡ F1 (q)]. It gives second way to define in “pure logic” the notion of stable model. Our presentation of the method of here-and-there is restricted to the propositional case but it is only to simplify this brief presentation. Actually, both methods allow one to define some “logic programs” (with some predicate symbol of any arity) in a way more extended than classically and give a syntax and a semantics that are purely logical. One benefit of these extensions, staying into the logic programming context, is that one can add some of the usual addings of the classical logic programming. Those addings are in fact essential as soon as one wants to deal with “real” problems in ASP and are present in each ASP system which effectively works. The fact to encompass all those addings in the theoretical foundations is an interesting progress. The other benefit is that it helps to understand the logical meaning of the answer sets.

4.3 Knowledge Representation and Problem Resolution in ASP Over the years and thanks to the availability of many solvers (see Sect. 4.4), ASP has shown it great flexibility to represent and solve many problems of artificial intelligence. Even if the borderlines are not completely drawn, we distinguish two kinds of problems: those which are essentially combinatorial and those concerned by knowledge representation for commonsense reasoning. Since we can not quote exhaustively all the applications based on ASP, we invite the interested reader to refer to the proceedings of the conferences (Calimeri et al. 2015; Vos et al. 2015) and those of the previous years. ASP to Solve Combinatorial Problems Niemelä (1999) presents in details the ability of ASP to represent and solve combinatorial problems derived from graph theory, planning (see chapter “Languages for Probabilistic Modeling over Structured and Relational Domains” of this volume),

102

A. Lallouet et al.

logic puzzles, games like Sudoku (see chapter “Artificial Intelligence for Games” of this volume), . . . In particular, those derived from bioinformatics (see chapter “Artificial Intelligence and Bioinformatics” of Volume 3) seems to be well adapted to a resolution by ASP as has already been the case for some of them Erdem et al. (2009), Gebser et al. (2010, 2011b). From a theoretical point of view, any NP-problem might be polynomially encoded p into a normal logic program and a 2 problem might be also polynomially encoded in a disjunctive logic program. Problems with higher complexity in the polynomial hierarchy might be polynomially encoded into a first-order ASP program (see for example Stéphan et al. 2009). But, since the resolution by a solver of a first-order ASP program needs the computation of a grounding phase leading to a propositional program (see Sect. 4.4), the result is an exponential increase of the size of the generated propositional program. From a practical point of view, represent in ASP combinatorial problems is write normal logic programs with these three kinds of rules: • some rules to describe or enumerate the data, • some guess rules to describe the search space and generate all the possible answers, all the potential solutions to the problem, • some check rules to describe the constraints and delete the sets that can not be solution. Example 3 illustrates those kinds of rules. Example 3 3-coloration of a graph DATA // n vertices of the undirected graph and the edges v(1) ← . . . . v(n) ← . . . . e(1, 4) ← . . . . GUESS // a vertex is red if it is neither green or blue red (X ) ← v(X ), not green(X ), not blue(X ). // a vertex is green if it is neither red or blue green(X ) ← v(X ), not red (X ), not blue(X ). // a vertex is blue if it is neither red or green blue(X ) ← v(X ), not red (X ), not green(X ). CHECK // two neighbours can not have the same color ← e(X , Y ), red (X ), red (Y ). ← e(X , Y ), green(X ), green(Y ). ← e(X , Y ), blue(X ), blue(Y ).

Here again we use constraints. Hence no atom set which satisfies a body of a constraint will be accepted as answer set of the program. ASP is characterized by a declarative programming style: one has to write rules which generate the search space and add constraints which delete sets that do not respect the constraints of the initial problem. Depending on solvers, one can use some extensions or coding facilities (most of them introduced in Niemelä et al. (1999)) among:

Logic Programming

103

• arithmetic expressions (X + Y < Z, Z = Y mod 2, . . .); • conditional literals ({p(X ) : q(X )} represents the enumeration {. . . , p(ai ), . . . } satisfying also q(ai )); • Cardinality constraints as (Kmin {. . . , li , . . . }Kmax representing the sets containing between Kmin and Kmax literals among the li ), a simplified version being the choice rule; • some extensions with weighted literals (Pmin {. . . , li = pi . . . }Pmax , in this case the sum of the weights pi of literals li in the answer set must be between Pmin and Pmax ); • aggregates (count, min, max, sum, . . . to compute a numerical value from a set, for example 0 is a probabilistic graphical model where • G =< V , E > is a directed acyclic graph (DAG) encoding conditional independence relationships where V = {A1 , A2 , . . . , An } is the set of variables of interest (Di denotes the domain of variable Ai ) and E is the set of edges of G. • K = {K1 , K2 , . . . , Kn } is a collection of local credal sets, each Ki is associated with the variable Ai in the context of its parents pa(Ai ). Such credal networks are called separately specified credal networks as the only constraints on probabilities are specified in local tables for each variable in the context of its parents. Note that in practice, in local tables, one can either specify a set of

Belief Graphical Models for Uncertainty Representation and Reasoning Fig. 7 Example of an interval-based credal network over four variables A, B, F and S

B T T F F

F T F T F

B p(B) T [.05,.15] F [.85,.95]

F p(F) T [.01,.05] F [.95,.99]

B

F

A p(A|B, F) A=T A=F [.95, 1] [0,.05] [.85,.95] [.05,.15] [.8,.9] [.1,.2] [.95,.95] [.05,.05]

229

S

p(S|F) F S=T S=F T [.7,.9] [.1,.3] F [.2,.4] [.6,.8]

extreme points characterizing the credal set as in JavaBayes3 software or directly local interval-based probability distributions as in shown in the following example (Fig. 7). A credal network CN is often seen as a set of Bayesian networks BN s, each encoding a joint probability distribution. In this case, each BN has exactly the same structure as the CN (hence they encode the same conditional independence relations). Regarding the parameters, for each variable Ai , ∀ai ∈ Di , pBN (ai |pa(ai )) ∈ Ki (ai |pa(ai )). Reasoning with CNs amounts to answering queries as that of Bayesian networks. In CNs, one can for instance compute posterior probabilities given an evidence. For MPE and MAP queries, different criteria may be used to characterize the optimal instantiations of query variables given an evidence (Antonucci and Campos 2011). For instance, in credal network classifiers, a class is selected if it is not dominated by any other class (Zaffalon 2002). Without surprise, inference in credal network is harder than in Bayesian networks since inference in CNs considers sets of probability measures (Mauá et al. 2014).

6.4 Markov Networks Markov networks (Pearl 1988a; Lauritzen 1996), also known as Markov Random Fields (MRFs) or simply undirected graphical models are undirected probabilistic graphical models widely used in some applications like computer vision (Wang et al. 2013). Undirected graphs can encode dependency relationships making them useful in particular in modeling problems where the probabilistic interactions among the variables are somehow undirected or symmetrical. Moreover, Markov networks can 3 http://www.cs.cmu.edu/~javabayes/Home/.

230

S. Benferhat et al.

encode some independence statements that DAG structures fail to encode like the famous misconception problem (Koller and Friedman 2009). At the representation level, Markov networks depart from Bayesian networks by the use of undirected links in the graph and the use of potential functions or factors associated to maximal cliques (subsets of variables) instead of local CPTs associated to variables individually. A potential function θc associated to a clique c can be any non-negative function on the domain of c (Cartesian product of variables involved in c). Formally, Definition 7 (Markov network) A Markov network MN =< G, Θ > is specified by: (i) A graphical component G consisting of a undirected graph where vertices represent the variables and edges represent direct dependence relationships between variables. Intuitively, any variable Ai is independent of any other variable Aj given all Ai ’s immediate neighbors. Generally, the graph G is represented as a clique tree to allow parametrization. (ii) A numerical component Θ allowing to weight the uncertainty relative to each cliqueci ∈ C using local potential functions. A clique is a fully connected subset of nodes in the graph and it is used to factorize the joint probability distribution over the set of variables as a product of potential functions associated with cliques. The joint probability distribution encoded by a MN is factored as follows: p(a1 . . . an ) =

1 θc (c[a1 . . . an ]), Z c∈C

(11)

where Z is a normalization constant while θc denotes the potential of clique c and θc (c[a1 . . . an ]) is the potential of the configuration of variables involved in c. Inference in Markov networks can be performed by algorithms based on clique trees such as the junction tree algorithms (Lauritzen and Spiegelhalter 1990). Note finally that there exist probabilistic graphical models mixing both directed and undirected edges, they are called Chain graphs (Lauritzen and Wermuth 1989). Next section provides an overview of two belief graphical models based on alternative uncertainty theories: possibility theory and ranking functions.

7 Non Probabilistic Belief Graphical Models To overcome the limitations of classical probability theory, many alternative uncertainty frameworks have been developed, essentially since the sixties. Such theories, often generalizing probability theory, allow to model and reason with different forms of uncertain information such as qualitative information, imprecise knowledge and so on. However, like in the probabilistic case, in order to use such settings in real

Belief Graphical Models for Uncertainty Representation and Reasoning

231

world applications, many issues have to be solved such as the compactness of the representation, the easiness of elicitation from an expert, learning from empirical data, the computational efficiency of the reasoning tasks, etc.

7.1 Possibilistic Graphical Models Like Bayesian networks which compactly encode joint probability distributions, possibilistic ones (Fonck 1997; Gebhardt and Kruse 1996) aim to compactly possibility distributions. This latter is alternative uncertainty representation particularly suited for handling incomplete or qualitative information.

7.1.1

Possibility Theory

Possibility theory (Zadeh 1999; Dubois and Prade 1988; Giles 1982) is a well-known uncertainty theory. It is based on the concept of possibility distribution π which associates every state ω ∈ Ω with a degree in the interval [0, 1] expressing a partial knowledge over the world. The degree π(ω) represents the degree of compatibility (or consistency) of the interpretation ω with the available knowledge. By convention, π(ω) = 1 means that ω is fully consistent with the available knowledge, while π(ω) = 0 means that ω is impossible. π(ω) > π(ω ) simply means that ω is more compatible than ω . As in probabilistic models, independence relations are fundamental as they allow to factorize joint possibility distributions. Such relations are also heavily exploited by inference algorithms to efficiently answer queries. The concept of event and variable independence is closely related to the one of possibilistic conditioning. There are different views of the possibilistic scale [0, 1] used to assess the uncertainty. Hence, different interpretations result in different conjunction operators that are used to perform the conditioning task (e.g. product, min, Łukasiewicz t-norm). Two major definitions of possibilistic conditioning are however used in the literature. The first one is called product-based conditioning (also known as possibilistic Dempster rule of conditioning (Shafer 1976)) stems from a quantitative view of the possibilistic scale. This semantics views a possibility distribution as a special plausibility function in the context of Dempster–Shafer theory. More precisely, a possibility distribution π corresponds to a consonant (nested) plausibility function. Hence, the underlying conditioning meets Dempster rule of conditioning and it is formally defined as follows (it is assumed that Π (φ) > 0):  π(w|p φ) =

π(w) Π(φ)

0

if w ∈ φ; otherwise.

(12)

232

S. Benferhat et al.

In the qualitative setting, the possibilistic scale is ordinal and only the relative order of events matters. Accordingly, a min-based conditioning operator is proposed in (Dubois and Prade 1990): ⎧ if π(w) = Π (φ) and w ∈ φ; ⎨1 (13) π(w|m φ) = π(w) if π(w) < Π (φ) and w ∈ φ; ⎩ 0 otherwise. While there are many similarities between the quantitative possibilistic and the probabilistic frameworks, the qualitative one is significantly different. The main definitions of the concept of independence in a possibilistic setting are: • No-interactivity: This concept proposed in (Zadeh 1975) can be stated as follows: Definition 8 (No-interactivity) Let X , Y and Z be three disjoint sets of variables and having the domains DX , DY and DZ respectively. X is said to not interact with Y conditionally to Z and denoted X ⊥ Y |Z iff ∀xi ∈ DX , yj ∈ DY , zk ∈ DZ , Π (X = xi , Y = yj |Z = zk ) = min(Π (X = xi |Z = zk ), Π (Y = yj |Z = zk )). • Conditional independence: Proposed in (Fonck 1997), this definition of independence can be stated as follows: Definition 9 (Conditional independence) Let X , Y and Z be three disjoint sets of variables and having the domains DX , DY and DZ respectively. X is said to be independent of Y conditionally to Z iff ∀xi ∈ DX , yj ∈ DY , zk ∈ DZ , Π (X = xi |Y = yj , Z = zk ) = Π (X = xi |Z = zk ) and Π (Y = yj |X = xi , Z = zk ) = Π (Y = yj |Z = zk )

Note that in Definition 9, the statement Π (X = xi |Y = yj , Z = zk ) = Π (X = xi |Z = zk ) does not imply Π (Y = yj |X = xi , Z = zk ) = Π (Y = yj |Z = zk ) in a minbased possibilistic setting. The conditional independence relations of Definition 9 are graphoids (Fonck 1997). Note also that conditional independence relations of Definition 9 are stronger than no-interactivity relations of Definition 8, namely conditional independence implies no-interactivity but the converse is not guaranteed.

7.1.2

Possibilistic Networks

A possibilistic network PN =< G, Θ > is specified by: (i) A graphical component G consisting of a directed acyclic graph (DAG) where vertices represent the variables and edges encode conditional independence relationships between variables. (ii) A numerical component Θ allowing to weight the uncertainty relative to each variable using local possibility tables. The possibilistic component consists in

Belief Graphical Models for Uncertainty Representation and Reasoning A π (A) T 1 F .4

Fig. 8 Example of a possibilistic network

C C T F T F

A π (C|A) T .3 T 1 F .2 F 1

233

A

B

B π (B) T .1 F 1

D D T F T F T F T F

B T T T T F F F F

A π (D|AB) T .4 T 1 F .2 F 1 T 1 T 1 F 1 F .1

a set of local possibility tables θi = π(Ai |pa(Ai )) for each variable Ai in the context of its parents pa(Ai ) in the network PN . Note that all the local possibility distributions θi must be normalized, namely ∀i = 1 . . . n, for each parent context pa(ai ), maxai ∈Di (π(ai | pa(ai )) = 1. Example 3 Figure 8 gives an example of a possibilistic network over four Boolean variables A, B, C and D. In the possibilistic setting, the joint possibility distribution is factorized using the following possibilistic counterpart of the chain rule: π(a1 , a2 , . . . , an ) = ⊗ni=1 (π(ai |pa(ai ))).

(14)

where ⊗ denotes the product or the min-based operator depending on the quantitative or the qualitative interpretation of the possibilistic scale (Dubois and Prade 1988). Most of the works dealing with inference in PNs are more or less direct adaptations of probabilistic network inference algorithms. For instance, inference algorithms like variable elimination, message passing, junction tree, etc. are directly adapted for PNs. In (Benferhat et al. 2002), PNs are encoded in the form of possibilistic logics bases (the two representations are semantically equivalent and encode a possibility distribution) and inferences could be achieved using possibilistic logic inference rules and mechanisms. PNs could be seen as approximate models of some imprecise probabilistic models. In (Benferhat et al. 2015b), an approach based on probabilitypossibility transformations is proposed to perform approximate MAP inference in credal networks where MAP inference is very hard (Mauá et al. 2014). As probabilistic graphical models, possibilistic ones either model the subjective knowledge of an agent (for example, the authors in (Dubois et al. 2017) use possibilistic networks to encode expert’s knowledge for a human geography problem) or represent the knowledge learnt from empirical data or a combination of subjective beliefs and empirical data. Learning PNs from data amounts to derive the structure and the local possibility tables of each variable from a dataset. Learning PNs makes

234

S. Benferhat et al.

sense within quantitative interpretations of possibility distributions and it is suitable especially in case of learning with imprecise data, scarce datasets and learning from datasets with missing values (Tabia 2016). Similar to learning the structure of Bayesian networks, two main approaches are used for possibilistic networks structure learning: (i) Constraint-based methods where the principle is to detect conditional independence relations I by performing a set of tests on the training dataset then try to find a DAG that satisfies I seen as a set of constraints. A constraint-based possibilistic network structure learning algorithm called POSSCAUSE is proposed in Sangesa et al. (1998). This algorithm is based on a similarity measure between possibility distributions to check conditional independences. The main disadvantage of constraint-based methods is that the search space is very large even for a small number of variables. (ii) Score-based methods: They are based on heuristics that start with a completely disconnected (or completely connected) DAG. At each iteration, the heuristic adds (or removes) an arc and evaluates the quality of the new DAGs with respect to the training dataset. The best DAG at each iteration is selected using a score function. The key issues of score-based methods are the scoring functions and the heuristics used to search the DAG space. For the heuristics, one can make use of the ones defined for Bayesian networks (eg. K2 algorithm, simulated annealing, etc.). However, for the score functions, they are assumed to assess how much a given structure captures the independence relations in the training sample. Examples of possibility theory-based scoring functions are possibilistic network non-specificity (Borgelt and Kruse 2003) and specificity gain (Sangesa et al. 1998). Parameter learning is needed to fill the local tables once the structure is learnt from data or elicited by an expert. For possibilistic networks, parameter learning from data consists basically in deriving conditional local possibility distributions from data. There are two main approaches for learning the parameters (Haddad et al. 2015): (i) Transformation-based approach: It first consists in learning probability distributions from data then transforming them into possibilistic ones using probabilitypossibility transformations (Benferhat et al. 2015a). (ii) Possibilistic-based approach: Such approaches stem from some quantitative interpretations of possibility distributions. For instance, a possibility distribution is viewed as a contour function of a consonant belief function (Shafer 1976).

7.2 Kappa Networks Kappa networks, also known as OCF-based networks, are belief graphical models based on ranking function also called ordinal conditional functions (OCF) (Spohn 1988).

Belief Graphical Models for Uncertainty Representation and Reasoning

7.2.1

235

Ranking Functions

Ranking functions is an ordinal setting that has been successfully used for modeling revision of agents’ beliefs (Darwiche and Pearl 1996). OCFs are very useful for representing uncertainty and several works point out their relevance for representing agents’ beliefs and defining belief change operators for updating the current beliefs in the light of new information (Ma and Liu 2008). OCF-based networks Halpern (2001) are graphical models expressing the beliefs using OCF ranking functions. The graphical component allows an easy and compact representation of influence or independence relationships existing between the domain variables while OCFs allow an easy quantification of belief strengths. OCF-based networks are less demanding than probabilistic networks (where exact probability degrees are needed). In OCFbased networks, belief strengths, called degrees of surprise, may be regarded as order of magnitude probability estimates which makes easier the elicitation of agents’ beliefs. An OCF (also called a ranking or kappa function) denoted κ is a mapping from the universe of discourse Ω to the set of ordinals (here, we assume to a set of integers). κ(wi ) is called a disbelief degree (or degree of surprise). By convention, κ(wi ) = 0 means that wi is not surprising and corresponds to a normal state of affairs while κ(wi ) = ∞ denotes an implausible event. The relation κ(wi ) < κ(wj ) means that wi is more plausible than wj . The function κ is normalized if there exists at least one possible interpretation w ∈ Ω such that κ(w) = 0. The disbelief degree κ(φ) of an arbitrary event φ ⊆ Ω is defined as follows: κ(φ) = min(κ(wi )). wi ∈φ

(15)

Conditioning is defined in this setting as follows (it is assumed that κ(φ) = ∞):  κ(wi |φ) =

7.2.2

κ(wi ) − κ(φ) if wi ∈ φ; ∞ otherwise.

(16)

OCF-Based Networks

A Kappa network shares the same graphical concepts with Bayesian networks and differs only in the use of local conditional OCF instead of conditional probability tables. Namely, the numerical component of a Kappa network Θ = {κ(Ai |pa(Ai )), i = 1 . . . n} consists in a set of local kappa functions for each node Ai in the context of its parents Ui ) as shown in the following example. Example 1 In Fig. 9, we have a Kappa network over four Boolean variables A, M , N and P. The joint Kappa function over the set of variable A1 ,…, An encoded by a Kappa network is factorized as follows:

236

S. Benferhat et al. κ (N) N=F N=T 6 0

Fig. 9 Example of a Kappa network κ (A|N) N A=F A=T F 0 0 T 10 0

κ (P|N) N P=F P=T F 0 20 T 10 0

N

A

P

M A F F T T

κ (M|A,P) P M=F M=T F 0 30 0 T 3 F 0 0 T 0 100

n

κ(a1 . . . an ) = min(κ(ai |pa(ai )). i=1

Many issues still have to be addressed for OCF-networks. For instance, parametrizing an OCF-network is recently studied in (Eichhorn and Kern-Isberner 2015). In (Eichhorn et al. 2016), the relationships between OC-networks and CP-networks (graphical models of conditional preferences) are studied. As mentioned earlier, belief graphical models have been studies in most uncertainty frameworks in order to provide compact representation and efficient analysis and reasoning tools. In the context of evidence theory, evidential networks (Simon et al. 2008) are graphical models based on Dempster Shafer theory. Belief graphical models are also studied in the framework of Valuation-Based Systems (VBS for short) (Shenoy 1992, 1993b). VBS are designed to represent and reason with uncertain information in expert systems. They can capture some uncertainty settings including propositional calculus, probability theory, evidence theory, ranking functions, and possibility theory. In the VBS setting, the main concepts used to encode uncertain information are the ones of variables and valuations where each valuation encodes the knowledge about a subset of variables. The graphical representation of a VBS is a valuation network (V N ). A V N does not rely on a DAG structure and it is based on algebraic properties of the marginalization, conditioning and merging operations for the propagation of the information associated with the graph valuations. The graphical component of a V N consists of vertices corresponding to variables, nodes corresponding to valuations, edges representing domains of valuations or tails of domains of conditionals and arcs denoting the heads of domains of conditionals. A V N provides a decomposition of a joint valuation. This latter is obtained combining local valuations.

Belief Graphical Models for Uncertainty Representation and Reasoning

237

8 Applications 8.1 Main Application Domains Belief graphical models have been widely adopted and used in various fields. A lot of common tasks encountered in many real world applications can be addressed by belief graphical models. Examples of such tasks are classification, annotation, diagnosis and troubleshooting, sensitivity analysis, explanation, planning, forecasting, control and decision making to name a few. Belief graphical models are successfully used in computer vision (Wang et al. 2013), fraud detection and computer security (Ye et al. 2004; An et al. 2006), risk analysis (Weber et al. 2012), diagnosis and assistance in medical decision (Long 1989), forensic analysis (Biedermann and Taroni 2012), information retrieval (de Cristo et al. 2003), detection of military targets (Antonucci et al. 2009), bioinformatics (Mourad et al. 2011), pattern recognition (Zaarour et al. 2004), spam detection (as in the SpamAssassin system) and computer intrusions, etc. The reasons for this success are manifold. In particular, belief graphical models are suitable for knowledge representation and for reasoning and decision-making tasks during the operational phase of the system. The modular and intuitive nature of graphical models make them efficient tools for representing uncertain and complex knowledge. Moreover, the ease of modeling with such models and the possibility to learn them automatically from data, and the effectiveness of inference are some of the very important benefits provided by belief graphical models. Among the first applications based on probabilistic graphical models, operational for several years now, first there is the VISTA project (Horvitz and Barry 1995) from the US space agency NASA to select from thousands of information pieces available in real-time only those that could be relevant to be displayed on the consoles of different operators. In the field of automatic navigation of submarines, Lockheed Martin UUV Martin (1996) is an intelligent system for controlling an autonomous underwater vehicle, developed by Hugin for Lockheed. In the field of consumer software, the MicroSoft Lumiere project, initiated in 1993, aims to anticipate the needs and problems of software users (Clippy, the assistant of MicroSoft Office is the most popular product of this project). In the medical field, the system PathFinder/Intellipath Heckerman et al. (1992) is a Bayesian expert system for assistance in identifying anomalies in samples of lymph tissues. In recent years, there is a growing use of graphical models in computer vision (denoising, segmentation, pose estimation, tracking, etc), automatic speech recognition, human-machine interaction, finance and risk management, bioinformatics, environmental modeling and management, medical applications, etc. For instance, Bayesian networks are used in many diagnostic systems. Typically, in the medical area, the model is built by medical experts and it is basically used to perform inferences regarding the potential causes/deceases/hypotheses/consequences corresponding to the observed symptoms. In other domains like mechanical or electrical systems, Bayesian network-based diagnosis systems are also built by experts and they are used for troubleshooting. In bioinformatics, BN graphs are learnt from data

238

S. Benferhat et al.

and they are regarded as knowledge extraction tools. Several publications and books present applications of belief graphical models and case studies in many real world problems. For example, in (Pourret et al. 2008), the reader can find practical cases in areas such as diagnosis and assistance in medical decision, forensics, etc. We give below some examples of application of these models in the field of computer security.

8.2 Applications in Computer Security In computer security (which refers to the detection and prevention of any action that could affect the availability or confidentiality or availability of information and services), several problems were modeled using belief graphical models and solutions have been implemented. One of the first projects that used a Bayesian network in intrusion detection (Kumar and Spafford 1994) proposed to model the dependencies between several anomaly measures on various aspects of the activity of a computer system (as the number of running processes, number of connections, CPU time, etc.). The eBayes (Valdes and Skinner 2000), one of the components of the anomalybased intrusion detection system EMERALD (Porras and Neumann 1997), uses a naive Bayesian network. In eBayes, the root node represents the class of TCP sessions while the attributes (such as the number of different IP addresses, number of unique ports, etc.) describe these sessions. During the detection phase, the attributes of the session to be analyzed are extracted and used by the Bayesian classifier to determine the most probable class for this session among the classes Normal and Abnormal corresponding to the normal sessions and abnormal sessions respectively. Among the systems that used a graphical model to associate an anomaly score to an audit event, the best known example is SPADE (Staniford et al. 2002) which is a plugin developed by Silicon Defense. SPADE is part of SPICE which contains a second module for alert correlation. Installed on the intrusion detection system Snort,4 it can detect some anomalies due to port scans by analyzing the headers of TCP SYN packets and incoming UDP packets.

8.3 Software Platforms for Modeling and Reasoning with Probabilistic Graphical Models Regarding platforms and software tools, there are several products. One of the key actors in the field of platforms and applications probabilistic models, Hugins5 is probably in the lead. This editor and consultant develops general platforms and solutions in many fields such as medicine, finance, industry, etc. The other platform having 4 www.snort.org. 5 http://www.hugin.com/.

Belief Graphical Models for Uncertainty Representation and Reasoning

239

imposed his name in the last two decades is Netica of the Norsys6 company. Netica offers a complete platform for modeling and reasoning with Bayesian networks and influence diagrams. It also offers several libraries and programming interfaces for using graphical models from other applications. Analytica7 is another platform offering the same kind of solutions. Other platforms specialize in certain types of applications and tasks like Agenarisk8 offering solutions to the risk analysis. Openmarkov9 is a software for modeling and reasoning with Bayesian networks, influence diagrams, and factored Markov models. There are also toolkits for some environments as BN toolbox10 for Matlab, JavaBayes11 for Java, etc. We may also mention other toolkits for Bayesian networks as MensXMachina,12 Causal Explorer,13 PMTK14 etc.

9 Conclusion Belief graphical models are compact and powerful tools for representing and reasoning with complex and uncertain information. They involve a set of principled and well-established formalisms for learning, modeling and reasoning under uncertainty. For modeling, they offer the advantage of being intuitive, modular and come in several variants suitable for modeling different types of dependencies (conditional, causal, sequential, etc.). In inference, they are effective and fit multiple tasks such as classification, diagnosis, explaining, planning (see chapter “Planning in Artificial Intelligence” of this Volume for the use of dynamic Bayesian networks in planning), etc. Belief graphical models can be built by an expert or built automatically from data. Building a graphical model by an expert is made easy by the fact that the process of elicitation first performs a qualitative step which deals only with variables of interest and their relationships. Secondly, the expert quantifies relationships locally (for each variable in the context of his parents), which greatly facilitates the modeling work and elicitation. A graphical model can be interpreted by an expert in particular for validation purposes and can be used to support communication between multiple experts. In addition, there are several frameworks for uncertainty that can be used for the quantitative component and for inference on the built model. In the presence of

6 http://www.norsys.com/. 7 http://www.lumina.com/. 8 http://www.agenarisk.com/. 9 http://www.openmarkov.org/. 10 http://code.google.com/p/bnt/. 11 http://www.cs.cmu.edu/~javabayes/Home/. 12 http://www.mensxmachina.org/software/pgm-toolbox/. 13 http://www.dsl-lab.org/causal_explorer/index.html. 14 https://github.com/probml/pmtk3.

240

S. Benferhat et al.

empirical data for the problem to be modeled, there are several learning techniques that can automatically build a model from this data. Since the seminal works on probabilistic expert systems, the literature on graphical models is abundant but several issues are still the topic of intense work in some artificial intelligence communities. Indeed, belief graphical models often appear as one of the main topics in most prestigious conferences in IA and several issues of scientific journals are dedicated to them. The best indicator of the maturity of these formalisms and their interest is undoubtedly their use in many sensitive applications ranging from from computer security to medical and military applications.

References Akaike H (1970) Statistical predictor identification. Ann Inst Stat Math 22:203–217 An X, Jutla D, Cercone N (2006) Privacy intrusion detection using dynamic Bayesian networks. In: ICEC 2006: proceedings of the 8th international conference on electronic commerce. ACM, New York, pp 208–215. https://doi.org/10.1145/1151454.1151493 Antonucci A, Campos CPd (2011) Decision making by credal nets. In: Proceedings of the 2011 third international conference on intelligent human-machine systems and cybernetics IHMSC ’11, vol 01. IEEE Computer Society, Washington, pp 201–204. https://doi.org/10.1109/IHMSC. 2011.55 Antonucci A, Brühlmann R, Piatti A, Zaffalon M (2009) Credal networks for military identification problems. Int J Approx Reason 50(4):666–679. https://doi.org/10.1016/j.ijar.2009.01.005, http:// www.sciencedirect.com/science/article/pii/S0888613X09000206 Arnborg S, Corneil DG, Proskurowski A (1987) Complexity of finding embeddings in a k-tree. SIAM J Algebraic Discret Methods 8(2):277–284 Auliac C, d’Alché-Buc F, Frouin V (2007) Learning transcriptional regulatory networks with evolutionary algorithms enhanced with niching. In: Masulli F, Mitra S, Pasi G (eds) Applications of fuzzy sets theory, vol 4578. Lecture notes in computer science. Springer, Berlin, pp 612–619 Auvray V, Wehenkel L (2002) On the construction of the inclusion boundary neighbourhood for markov equivalence classes of Bayesian network structures. In: Darwiche A, Friedman N (eds) Proceedings of the 18th conference on uncertainty in artificial intelligence (UAI-02). Morgan Kaufmann Publishers, pp 26–35 Bart A, Koriche F, Lagniez J, Marquis P (2016) An improved CNF encoding scheme for probabilistic inference. In: ECAI 2016 - 22nd European conference on artificial intelligence, 29 Aug–2 Sept 2016, The Hague, The Netherlands - Including prestigious applications of artificial intelligence (PAIS 2016), pp 613–621. https://doi.org/10.3233/978-1-61499-672-9-613 Ben Amor N, Benferhat S (2005) Graphoid properties of qualitative possibilistic independence. Int J Uncertain Fuzziness Knowledge-Based 13:59–96 Ben Yaghlane B, Mellouli K (2008) Inference in directed evidential networks based on the transferable belief model. Int J Approx Reason 48:399–418 Benferhat S, Smaoui S (2007) Hybrid possibilistic networks. Int J Approx Reason 44(3):224–243 Benferhat S, Dubois D, Garcia L, Prade H (2002) On the transformation between possibilistic logic bases and possibilistic causal networks. Int J Approx Reason 29(2):135– 173. https://doi.org/10.1016/S0888-613X(01)00061-5, http://www.sciencedirect.com/science/ article/pii/S0888613X01000615 Benferhat S, Levray A, Tabia K (2015a) On the analysis of probability-possibility transformations: changing operations and graphical models. In: ECSQARU 2015, Compiegne, France, 15–17 July Benferhat S, Levray A, Tabia K (2015b) Probability-possibility transformations: application to credal networks. In: Scalable uncertainty management - 9th international conference, SUM 2015,

Belief Graphical Models for Uncertainty Representation and Reasoning

241

Québec City, QC, Canada, 16–18 Sept 2015. Proceedings, pp 203–219. https://doi.org/10.1007/ 978-3-319-23540-0_14 Biedermann A, Taroni F (2012) Bayesian networks for evaluating forensic DNA profiling evidence: A review and guide to literature. Forensic Sci Int: Genet 6(2):147–157. https://doi.org/10.1016/ j.fsigen.2011.06.009, http://www.sciencedirect.com/science/article/pii/S1872497311001359 Borgelt C, Kruse R (2003) Learning possibilistic graphical models from data. IEEE Trans Fuzzy Syst 11(2):159–172 Bouckaert RR (1993) Probabilistic network construction using the minimum description length principle. Lecture Notes Comput Sci 747:41–48. http://citeseer.nj.nec.com/bouckaert93probabilistic. html Buntine W (1991) Theory refinement on Bayesian networks. In: D’Ambrosio B, Smets P, Bonissone P (eds) Proceedings of the 7th conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers, San Mateo, pp 52–60 Chavira M, Darwiche A (2005) Compiling Bayesian networks with local structure. In: Proceedings of the 19th international joint conference on artificial intelligence (IJCAI), pp 1306–1312 Chavira M, Darwiche A, Jaeger M (2006) Compiling relational Bayesian networks for exact inference. Int J Approx Reason 42(1–2):4–20. https://doi.org/10.1016/j.ijar.2005.10.001 Chickering D (1995) A transformational characterization of equivalent Bayesian network structures. In: Besnard P, Hanks S (eds) Proceedings of the 11th conference on uncertainty in artificial intelligence (UAI’95). Morgan Kaufmann Publishers, San Francisco, pp 87–98 Chickering DM (2002) Optimal structure identification with greedy search. J Mach Learn Res 3:507–554 Chickering D, Heckerman D (1996) Efficient Approximation for the Marginal Likelihood of Incomplete Data given a Bayesian Network. In: UAI’96. Morgan Kaufmann, pp 158–168 Chickering D, Geiger D, Heckerman D (1994) Learning Bayesian networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft Research Technical Report Chickering D, Geiger D, Heckerman D (1995) Learning Bayesian networks: search methods and experimental results. In: Proceedings of fifth conference on artificial intelligence and statistics, pp 112–128 Chow C, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Inf Theory 14(3):462–467 Cooper GF (1990) Computational complexity of probabilistic inference using Bayesian belief networks. Artif Intell 42:393–405 Cooper G, Herskovits E (1992) A Bayesian method for the induction of probabilistic networks from data. Mach Learn 9:309–347 Cozman FG (2000) Credal networks. Artif Intell 120(2):199–233. https://doi.org/10.1016/S00043702(00)00029-1, http://www.sciencedirect.com/science/article/pii/S0004370200000291 Daly R, Shen Q, Aitken S (2011) Learning Bayesian networks: approaches and issues. The Knowl Eng Rev 26:99–157 Darwiche A (2009) Modeling and reasoning with Bayesian networks, 1st edn. Cambridge University Press, New York Darwiche A, Pearl J (1996) On the logic of iterated belief revision. Artif Intell 89:1–29 de Cristo MAP, Calado PP, de Lourdes da Silveira M, Silva I, Muntz R, Ribeiro-Neto B, (2003) Bayesian belief networks for ir. Int J Approx Reason 34(2):163–179. https://doi.org/10.1016/j. ijar.2003.07.006, http://www.sciencedirect.com/science/article/pii/S0888613X03000902 de Campos CP (2011) New complexity results for map in Bayesian networks. In: IJCAI 2011, proceedings of the 22nd international joint conference on artificial intelligence, Barcelona, Catalonia, Spain, pp 2100–2106 Delaplace A, Brouard T, Cardot H (2007) Two evolutionary methods for learning Bayesian network structures. In: Wang Y, Cheung Ym, Liu H (eds) Computational intelligence and security. Lecture notes in computer science, vol 4456. Springer, Berlin, pp 288–297 Dubois D, Prade H (1988) Possibility theory: an approach to computerized processing of uncertainty. Plenum Press, New York

242

S. Benferhat et al.

Dubois D, Prade H (1990) The logical view of conditioning and its application to possibility and evidence theories. Int J Approx Reason 4(1):23–46. https://doi.org/10.1016/0888-613X(90)90007O Dubois D, Fusco G, Prade H, Tettamanzi AG (2017) Uncertain logical gates in possibilistic networks: theory and application to human geography. Int J Approx Reason 82:101– 118. https://doi.org/10.1016/j.ijar.2016.11.009, http://www.sciencedirect.com/science/article/ pii/S0888613X1630233X Eichhorn C, Kern-Isberner G (2015) Using inductive reasoning for completing ocf-networks. J Appl Logic 13(4):605–627. https://doi.org/10.1016/j.jal.2015.03.006 Eichhorn C, Fey M, Kern-Isberner G (2016) Cp- and ocf-networks - a comparison. Fuzzy Sets Syst 298(C):109–127. https://doi.org/10.1016/j.fss.2016.04.006 Fiot C, Saptawati GAP, Laurent A, Teisseire M (2008) Learning Bayesian network structure from incomplete data without any assumption. In: Proceedings of the 13th international conference on database systems for advanced applications, DASFAA’08. Springer, Berlin, pp 408–423. http:// dl.acm.org/citation.cfm?id=1802514.1802554 Fonck P (1994) Réseaux d’inférence pour le raisonnement possibiliste. PhD thesis, Université de Liège, Faculté des Sciences Fonck P (1997) A comparative study of possibilistic conditional independence and lack of interaction. Int J Approx Reason 16:149–171 Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29(2– 3):131–163 Gebhardt J, Kruse R (1996) Learning possibilistic networks from data. In: Proceedings of 5th international workshop on artificial intelligence and statistics, Fort Lauderdale, pp 233–244 Geiger D, Verma T, Pearl J (1989) d-separation: From theorems to algorithms. In: Proceedings of the fifth conference on uncertainty in artificial intelligence (UAI’89). Elsevier Science Publishing Company Inc., New York, pp 139–148 Geiger D, Verma TS, Pearl J (1990) Identifying independence in Bayesian networks. Networks 20:507–534 Greiner R, Su X, Shen B, Zhou W (2002) Structural extension to logistic regression: discriminative parameter learning of belief net classifiers. In: Proceedings of the eighteenth annual national conference on artificial intelligence, AAAI-02, pp 167–173 Giles R (1982) Foundation for a possibility theory. In: Fuzzy information and decision processes, pp 83–195 Grossman D, Domingos P (2004) Learning Bayesian network classifiers by maximizing conditional likelihood. In: ICML2004. ACM Press, pp 361–368 Haddad M, Leray P, Amor NB (2015) Learning possibilistic networks from data: a survey. In: 2015 conference of the international fuzzy systems association and the european society for fuzzy logic and technology (IFSA-EUSFLAT-15), Gijón, Spain, 30 June 2015 Halpern JY (2001) Conditional plausibility measures and Bayesian networks. J Artif Int Res 14(1):359–389 Heckerman D (1998) A tutorial on learning with Bayesian network. In: Jordan MI (ed) Learning in graphical models. Kluwer Academic Publishers, Boston Heckerman DE, Horvitz EJ, Nathwani BN (1992) Toward normative expert systems: Part i. the pathfinder project. Methods Inf Med 31(2):90–105 Heckerman D, Geiger D, Chickering M (1994) Learning Bayesian networks: the combination of knowledge and statistical data. In: de Mantaras RL, Poole D (eds) Proceedings of the 10th conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers, San Francisco, pp 293–301 Henrion M (1986) Propagating uncertainty in Bayesian networks by probabilistic logic sampling. Uncertainty in artificial intelligence 2 annual conference on uncertainty in artificial intelligence (UAI-86). Elsevier Science, Amsterdam, pp 149–163

Belief Graphical Models for Uncertainty Representation and Reasoning

243

Horvitz E, Barry M (1995) Display of information for time-critical decision making. In: In Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann, pp 296–305 Howard RA, Matheson JE (1984) Influence diagrams. Princ Appl Decis Anal 2:720–761 Jensen FV (1996) Introduction to Bayesien networks. UCL Press, University college, London Jordan MI, Ghahramani Z, Jaakkola TS, Saul LK (1999) An introduction to variational methods for graphical models. Mach Learn 37(2):183–233. https://doi.org/10.1023/A:1007665907178 Keogh E, Pazzani M (1999) Learning augmented Bayesian classifiers: a comparison of distributionbased and classification-based approaches. In: Proceedings of the seventh international workshop on artificial intelligence and statistics, pp 225–230 Kimmig A, Van den Broeck G, De Raedt L (2016) Algebraic model counting. Int J Appl Logic. http://web.cs.ucla.edu/~guyvdb/papers/KimmigJAL16.pdf Koivisto M (2006) Advances in exact Bayesian structure discovery in Bayesian networks. In: Proceedings of the 22nd conference on uncertainty in artificial intelligence (UAI 2006), pp 241–248 Koivisto M, Sood K (2004) Exact Bayesian structure discovery in Bayesian networks. J Mach Learn 5:549–573 Koller D, Friedman N (2009) Probabilistic graphical models - principles and techniques. MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11886 Kumar S, Spafford EH (1994) An application of pattern matching in intrusion detection. Technical Report CSD–TR–94–013, Department of Computer Sciences, Purdue University, West Lafayette Larrañaga P, Poza Y, Yurramendi Y, Murga R, Kuijpers C (1996) Structure learning of Bayesian networks by genetic algorithms: a performance analysis of control parameters. IEEE Trans Pattern Anal Mach Intell 18(9):912–926 Lauritzen SL (1996) Graphical models. Oxford statistical science series. Clarendon Press, Oxford. http://opac.inria.fr/record=b1079282 (autre tirage : 1998) Lauritzen SL, Spiegelhalter DJ (1990) Local computations with probabilities on graphical structures and their application to expert systems. Readings in uncertain reasoning. Morgan Kaufmann Publishers Inc, San Francisco, pp 415–448. http://dl.acm.org/citation.cfm?id=84628.85343 Lauritzen SL, Spiegelhalter DJ (1988) Local computations with probabilities on graphical structures and their application to expert systems. J R Stat Soc 50:157–224 Lauritzen SL, Wermuth N (1989) Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann Stat 17(1):31–57. https://doi.org/10.1214/aos/ 1176347003 Levi I (1980) The enterprise of knowledge: an essay on knowledge, credal probability, and chance/Isaac Levi. MIT Press, Cambridge Long W (1989) Medical diagnosis using a probabilistic causal network. Appl Artif Intell 3:367– 383. https://doi.org/10.1080/08839518908949932, http://portal.acm.org/citation.cfm?id=68613. 68627 Ma J, Liu W (2008) A general model for epistemic state revision using plausibility measures. In: 2008 conference on ECAI, pp 356–360 Malone BM, Yuan C, Hansen EA, Bridges S (2011) Improving the scalability of optimal Bayesian network learning with external-memory frontier breadth-first branch and bound search. In: Cozman FG, Pfeffer A (eds) UAI 2011, proceedings of the twenty-seventh conference on uncertainty in artificial intelligence, Barcelona, Spain, 14–17 July 2011. AUAI Press, pp 479–488 Martin L (1996) Autonomous control logic to guide unmanned underwater vehicle. Technical Report, Lockheed Martin Mauá D, de Campos CP, Benavoli A, Antonucci A (2014) Probabilistic inference in credal networks: New complexity results. J Artif Intell Res (JAIR) 50:603–637 Meek C (1995) Causal inference and causal explanation with background knowledge. In: Proceedings of 11th conference on uncertainty in artificial intelligence, pp 403–418 Mourad R, Sinoquet C, Leray P (2011) A hierarchical Bayesian network approach for linkage disequilibrium modeling and data-dimensionality reduction prior to genome-wide association studies. BMC Bioinf 12:16

244

S. Benferhat et al.

Murphy KP (2002) Dynamic Bayesian networks: representation, inference and learning. PhD thesis, aAI3082340 Murphy KP, Weiss Y, Jordan MI (1999) Loopy belief propagation for approximate inference: an empirical study. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, UAI’99. Morgan Kaufmann Publishers Inc., San Francisco, pp 467–475. http://dl.acm. org/citation.cfm?id=2073796.2073849 Muruzabal J, Cotta C (2007) A study on the evolution of Bayesian network graph structures. Advances in probabilistic graphical models, vol 214. Studies in fuzziness and soft computing. Springer, Berlin, pp 193–213 Parviainen P, Koivisto M (2009) Exact structure discovery in Bayesian networks with less space. In: Bilmes J, Ng AY (eds) UAI 2009, Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, Montreal, QC, Canada, 18–21 June 2009. AUAI Press, pp 436–443 Pearl J (1982) Reverend Bayes on inference engines: a distributed hierarchical approach. In: Proceedings of the American association of artificial intelligence national conference on AI, Pittsburgh, PA, pp 133–136 Pearl J (1986) Fusion, propagation, and structuring in belief networks. Artif Intell 29(3):241–288. https://doi.org/10.1016/0004-3702(86)90072-X Pearl J (1988a) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmman, San Francisco (California) Pearl J (1988b) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco Pearl J (2000) Causality: models, reasoning, and inference. Cambridge University Press, New York Pearl J, Verma TS (1991) A theory of inferred causation. In: Allen JF, Fikes R, Sandewall E (eds) Proceeding of the second international conference on knowledge representation and reasoning (KR’91). Morgan Kaufmann, San Mateo, California, pp 441–452 Peña JM, Nilsson R, Björkegren J, Tegnér J (2007) Towards scalable and data efficient learning of markov boundaries. Int J Approx Reason 45(2):211–232 Pernkopf F, Bilmes J (2005) Discriminative versus generative parameter and structure learning of Bayesian network classifiers. In: Proceedings of the 22nd international conference on machine learning, ICML ’05. ACM, New York, pp 657–664. https://doi.org/10.1145/1102351.1102434 Porras PA, Neumann PG (1997) EMERALD: Event monitoring enabling responses to anomalous live disturbances. In: Proceedings of the 20th national information systems security conference. NIST, National Institute of Standards and Technology/National Computer Security Center, Baltimore, Maryland, USA, pp 353–365 Pourret O, Naim P, Marcot B (2008) Bayesian networks: a practical guide to applications. Wiley, New York Raiffa H (1968) Decision analysis. Addison-Welsley Publishing Company, Toronto Ramoni M, Sebastiani P (1998) Parameter estimation in Bayesian networks from incomplete databases. Intell Data Anal 2(2):139–160. http://dl.acm.org/citation.cfm?id=2639323.2639329 Robinson RW (1977) Counting unlabeled acyclic digraphs. In: Little CHC (ed) Combinatorial mathematics V, vol 622. Lecture notes in mathematics. Springer, Berlin, pp 28–43 Rodrigues De Morais S, Aussem A (2008) A novel scalable and data efficient feature subset selection algorithm. In: Proceedings of the European conference on machine learning and knowledge discovery in databases - part II, ECML PKDD ’08. Springer, Berlin, pp 298–312 Sangesa R, Cabs J, Corts U (1998) Possibilistic conditional independence: a similarity-based measure and its application to causal network learning. Int J Approx Reason 18(1):145–167. https:// doi.org/10.1016/S0888-613X(98)00012-7 Schwartz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464 Shachter RD (1986) Evaluating influence diagrams. Oper Res 34:871–882 Shachter RD, Bhattacharjya D (2010) Solving Influence diagrams: exact algorithms. Wiley Inc. https://doi.org/10.1002/9780470400531.eorms0808 Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton Shenoy P (1989) A valuation-based language for expert systems. Int J Approx Reason 3(5):341–383

Belief Graphical Models for Uncertainty Representation and Reasoning

245

Shenoy PP (1992) Valuation-based systems: a framework for managing uncertainty in expert systems. Fuzzy logic for the management of uncertainty. Wiley, New York, pp 83–104. http://dl. acm.org/citation.cfm?id=133602.133611 Shenoy P (1993a) Valuation networks and conditional independence. In: UAI, pp 191–199 Shenoy PP (1993b) Valuation networks and conditional independence. In: Heckerman D, Mamdani A (eds) Uncertainty in artificial intelligence, vol 93. Morgan Kaufmann, San Mateo, pp 191–199 Simon C, Weber P, Evsukoff A (2008) Bayesian networks inference algorithm to implement dempster shafer theory in reliability analysis. Reliab Eng Syst Safety 93(7):950– 963. https://doi.org/10.1016/j.ress.2007.03.012, http://www.sciencedirect.com/science/article/ pii/S0951832007001068 (Bayesian Networks in Dependability) Spirtes P, Glymour C, Scheines R (1993) Causation, prediction, and search. Springer, Berlin Spirtes R, Glymour C, Scheines R (2000) Causation, prediction, and search. MIT Press, Cambridge Spohn W (1988) Ordinal conditional functions: a dynamic theory of epistemic states. In: Causation in decision, belief change, and statistics, vol II. Kluwer Academic Publishers, pp 105–134 Staniford S, Hoagland JA, McAlerney JM (2002) Practical automated detection of stealthy portscans. J Comput Secur 10(1–2):105–136 Tabia K (2016) Possibilistic graphical models for uncertainty modeling. In: Proceedings of Scalable uncertainty management - 10th international conference, SUM 2016, Nice, France, 21–23 September 2016, pp 33–48. https://doi.org/10.1007/978-3-319-45856-4_3 Tsamardinos I, Aliferis CF, Statnikov A (2003) Time and sample efficient discovery of markov blankets and direct causal relations. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03. ACM, New York, pp 673–678 Tsamardinos I, Brown L, Aliferis C (2006) The max-min hill-climbing Bayesian network structure learning algorithm. Mach Learn 65(1):31–78 Valdes A, Skinner K (2000) Adaptive, model-based monitoring for cyber attack detection. In: Recent advances in intrusion detection, pp 80–92 Vlasselaer J, Meert W, Van den Broeck G, De Raedt L (2016) Exploiting local and repeated structure in dynamic Bayesian networks. Artif Intell 232(C):43–53. https://doi.org/10.1016/j.artint.2015. 12.001 Walley P (2000) Towards a unified theory of imprecise probability. Int J Approx Reason 24(23):125– 148 Wang T, Yang J (2010) A heuristic method for learning Bayesian networks using discrete particle swarm optimization. Knowl Info Syst 24:269–281 Wang C, Komodakis N, Paragios N (2013) Markov random field modeling, inference & learning in computer vision & image understanding: a survey. Comput Vis Image Underst 117(11):1610– 1627. https://doi.org/10.1016/j.cviu.2013.07.004 Weber P, Medina-Oliva G, Simon C, Iung B (2012) Overview on Bayesian networks applications for dependability, risk analysis and maintenance areas. Eng Appl Artif Intell 25(4):671–682, https://doi.org/10.1016/j.engappai.2010.06.002, http://www.sciencedirect.com/ science/article/pii/S095219761000117X( special section: dependable system modelling and analysis) Xu H, Smets P (1994) Evidential reasoning with conditional belief functions. In: et al DH (ed) UAI’94, pp 598–606 Ye D, Huiqiang W, Yonggang P (2004) A hidden Markov models-based anomaly intrusion detection method. In: 2004 WCICA 2004 fifth world congress on 5 intelligent control and automation. http:// ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1342334 Zaarour I, Heutte L, Leray P, Labiche J, Eter B, Mellier D (2004) Clustering and Bayesian network approaches for discovering handwriting strategies of primary school children. Int J Pattern Recognit Artif Intell 18(7):1233–1251 Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning. Inf Sci 9:43–80 Zadeh LA (1999) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 100:9–34

246

S. Benferhat et al.

Zaffalon M (2002) The naive credal classifier. J Stat Plann Inference 105(1):5– 21. https://doi.org/10.1016/S0378-3758(01)00201-4, http://www.sciencedirect.com/science/ article/pii/S0378375801002014 (imprecise probability models and their applications) Zhang N, Poole D (1994) A simple approach to Bayesian network computations. In: Proceedings of the tenth Canadian conference on artificial intelligence

Languages for Probabilistic Modeling Over Structured and Relational Domains Fabio Gagliardi Cozman

Abstract In this chapter we survey languages that specify probability distributions using graphs, predicates, quantifiers, fixed-point operators, recursion, and other logical and programming constructs. Many of these languages have roots both in probabilistic logic and in the desire to enhance Bayesian networks and Markov random fields. We examine their origins and comment on various proposals up to recent developments in probabilistic programming.

1 Introduction Diversity is a mark of research in artificial intelligence (AI). From its inception, the field has exercised freedom in pursuing formalisms to handle monotonic, nonmonotonic, uncertain, and fuzzy inferences. Some of these formalisms have seen cycles of approval and disapproval; for instance, probability theory was taken, in 1969, to be “epistemologically inadequate” by leading figures in AI (McCarthy and Hayes 1969). At that time there was skepticism about combinations of probability and logic, even though such a combination had been under study for more than a century. A turning point in the debate on the adequacy of probability theory to AI was Judea Pearl’s development of Bayesian networks (Pearl 1988). From there many other models surfaced, based on the notion that we must be able to specify probability distributions in a modular fashion through independence relations (Sadeghi and Lauritzen 2014). In spite of their flexibility, Bayesian networks are “propositional” in the sense that random variables are not parameterized, and one cannot quantify over them. For instance, if you have 1000 individuals in a city, and for each of them you are interested in three random variables (say education, income, and age), then you must explicitly specify 3000 random variables and their independence relations. There have been many efforts to extend graphical models so as to allow one to encode repetitive patterns, using logical variables, quantification, recursion, loops, and the like. There are specification languages based on database schema, on firstF. G. Cozman (B) University of São Paulo, São Paulo, Brazil e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_9

247

248

F. G. Cozman

order logic, on logic programming, on functional programming, even on procedural programming. Often these languages employ techniques from seminal probabilistic logics. The purpose of this chapter is to review some of these languages, starting with a brief review of probabilistic logic concepts, and then moving to relational variants of Bayesian networks and to probabilistic programming. In Sect. 2 we present some foundational results from probabilistic logic, so as to fix useful terminology concerning syntax and semantics. In Sect. 3 we look at several formalisms that mix, using graphs, Bayesian networks and relational modeling. Section 4 is devoted to probabilistic logic programming; in Sect. 5 we go through languages inspired by various logics, and in Sect. 6 we examine Markov random fields. In Sect. 7 we consider the emerging field of probabilistic programming. Finally, in Sect. 8 we very briefly mention some inference and learning techniques.

2 Probabilistic Logics: The Laws of Thought? Boole’s book on The Laws of Thought is aptly sub-titled “on which are founded the mathematical theories of logic and probabilities” (Boole 1958). Starting from that effort, many other thinkers examined the combination of first-order logic and probabilities (Gaifman 1964; Gaifman and Snir 1982; Hoover 1978; Keisler 1985; Scott and Krauss 1966). A mix of logical and probabilistic reasoning was also central to de Finetti’s concept of probability (Coletti and Scozzafava 2002; de Finetti 1964). Nilsson (1986) later rediscovered some of these ideas in the context of AI, in particular emphasizing linear programming methods that had been touched before (Bruno and Gilio 1980; Hailperin 1976). At the beginning of the nineties, Nilsson’s probabilistic logic and its extensions were rather popular amongst AI researchers (Hansen and Jaumard 1996); in particular, probabilistic first-order logic received sustained attention (Bacchus 1990; Fagin et al. 1990; Halpern 2003). That feverish work perhaps convinced some that the laws of thought had indeed been nailed down. We now review some concepts used in probabilistic logics, as they are relevant to the remainder of this survey. Propositional logic consists of formulas containing propositions A1 , A2 , . . . , and the Boolean operators ∧ (conjunction), ∨ (disjunction), ¬ (negation), → (implication) and ↔ (equivalence); further details are discussed in chapter “Reasoning with Propositional Logic - From SAT Solvers to Knowledge Compilation” of this Volume. First-order logic consists of formulas containing predicates and functions, plus the Boolean operators, logical variables and existential/universal quantifiers (further details are discussed in chapter “Automated Deduction” of this Volume). Any predicate r , and any function f , is associated with a nonnegative integer, its arity. A predicate of arity zero is treated as a proposition. A function of arity zero is a constant. A term is either a logical variable or a constant. A predicate of arity k, followed by k terms (usually in parenthesis) is an atom. For instance, if baker is a predicate of arity 1, then both baker(x ) and baker(John) are atoms.

Languages for Probabilistic Modeling …

249

The syntax of a minimal propositional probabilistic logic is rather simple: we can have any propositional formula ϕ, and moreover any probabilistic assessment P (φ) = α, where φ is a propositional formula and α is a number in [0, 1]. For instance, both A1 ∧ ¬A2 and P (¬A3 ∨ A4 ∨ ¬A5 ) = 1/2 are well-formed formulas. Conditional probabilistic assessments are often allowed; that is, P (φ|θ ) = α where φ and θ are propositional formulas. Example 1 Suppose A1 means “Tweety is a penguim”, A2 means “Tweety is a bird”, and A3 means “Tweety flies”. Then A1 → A2 ,

A1 → ¬A3 ,

P ( A3 |A2 ) = 0.95

is a set of well-formed formulas.



The usual semantics associated with this propositional syntax is as follows. Suppose we have propositions A1 , A2 , . . . , An . There are 2n truth assignments to these propositions (each proposition assigned true or false). To simplify the terminology, we refer to a truth assignment as an interpretation. Propositional probabilistic logic focuses on probability measures over interpretations, where the sample space Ω is the set of 2n interpretations. Recall that such a measure P can be specified by associating each element of Ω with a nonnegative number in [0, 1], guaranteeing that these numbers add up to one (as discussed in chapter “Representations of Uncertainty in Artificial Intelligence: Probability and Possibility” of Volume 1). A probabilistic assessment P (φ) = α, for some α ∈ [0, 1] is interpreted as a constraint; namely, the constraint that the probability of the set of interpretations satisfying φ is exactly α. A conditional probability assessment P ( A1 |A2 ) = α is usually read as the constraint P ( A1 ∧ A2 ) = αP ( A2 ). Example 2 Consider a (rather simplified) version of Boole’s challenge problem (Hailperin 1996): we have P ( A1 ) = α1 and P ( A2 ) = α2 , and moreover P ( A3 |A1 ) = β1 and P ( A3 |A2 ) = β2 ; finally we know that A3 → (A1 ∨ A2 ). What are the possible values for P (A3 )? There are three propositions A1 , A2 , A3 ; an interpretation can be encoded as a triple (a1 a2 a3 ) where ai = 1 when Ai is true, and ai = 0 otherwise. There are 8 interpretations that might be possible, but interpretation (001) is impossible because A3 → (A1 ∨ A2 ) holds. Each interpretation is to have a probability; we denote by p j the probability of the jth interpretation, where we order lexicographically the triple (a1 a2 a3 ) as if it were a binary number. Thus the constraint P ( A1 ) = α1 means p4 + p5 + p6 + p7 = α1 , and the constraint P ( A3 |A2 ) = β2 means p3 + p7 = β2 ( p2 + p3 + p6 + p7 ). By minimizing/maximizing P ( A3 ) = p3 + p5 + p7 , subject to these constraints and  p j ≥ 0 for all j and j p j = 1, we obtain P ( A3 ) ∈ [L , U ], where

250

F. G. Cozman

L = max(α1 β1 , α2 β2 ), U = min(α1 β1 + α2 β2 , α1 β1 + (1 − α1 ), α2 β2 + (1 − α2 )), provided 0 ≤ L ≤ U ≤ 1 (otherwise the whole problem is inconsistent).



To use de Finetti’s terminology, a set of formulas and assessments that is satisfied by at least one probability measure is said to be coherent (Coletti and Scozzafava 2002). Given a coherent set of assessments, reasoning should only inform us about the least commitment conclusions that are necessarily correct. We can of course contemplate probabilistic assessments P (φ) = α where φ is a formula of first-order logic and α is a number in [0, 1].1 The semantics of firstorder formulas is given by a domain and an interpretation. A domain D, in this technical sense, is just a set (and should not be taken as the sort of “structured domain knowledge” alluded to in the title of this chapter). An interpretation is a mapping from each predicate of arity k to a relation in Dk , and from each function of arity k to a function from Dk to D (Enderton 1972). Each probabilistic assessment P (φ) = α, where φ is a first-order formula, is interpreted as a constraint on the probability measures over the set of interpretations for a fixed domain. If the domain is infinite, the set of interpretations is uncountable. In this overview we can bypass difficulties with uncountable sets, but still present the main points, by assuming all domains to be finite, and moreover by assuming that no functions, other than constants, are present. Example 3 Consider predicates penguim, bird, and flies, all of arity 1, and predicate friends, of arity 2. The intended meaning is that penguim(Tweety) indicates that Tweety is a penguim, and likewise for bird, while flies(Tweety) indicates that Tweety flies, and friends(Tweety, Skipper) indicates that Tweety and Skipper are friends. Suppose the domain is D = {Tweety, Skipper, Tux}. An interpretation might assign both Tweety and Tux to penguim, only the pair (Tweety, Skipper) to friends, and so on.  Suppose P ∀x : ∃y : friends(x , y) = 0.01. This assigns probability 0.01 to the set of interpretations where any element of the domain has a friend. Another possible assessment is ∀x : P (penguim(x )) = 0.03; note that here the quantifier is “outside” of the probability. For a domain with N elements, we have 2 N possible interpretations for penguim, 2 and 2 N interpretations for friends; the total number of possible interpretations for 2  the predicates is 23N +N . We refer to the semantics just outlined as an interpretation-based semantics, because probabilities are assigned to sets of interpretations. There is also a domainbased semantics, where we assume that probability measures are defined over the domain. This may be useful to capture some common scenarios. For instance, consider: “The probability that some citizen is a computer scientist is α”. We might wish might be more even general by introducing “probabilistic quantifiers”, say by writing P≥α φ to mean P (φ) ≥ α. We could then nest P≥α φ within other formulas (Halpern 2003). We avoid this generality here.

1 We

Languages for Probabilistic Modeling …

251

to interpret this through a probability measure over the set of citizens (the domain); the constraint is that the set of computer scientists gets probability α. Domain-based semantics, and even combinations of interpretation- and domain-based semantics, have been investigated for a while (Bacchus 1990; Fagin et al. 1990; Hoover 1978; Keisler 1985); however, interpretation-based semantics are more commonly adopted. First-order probabilistic logic has high representational power, but very high complexity (Abadi and Halpern 1994). Another drawback is inferential vacuity: given a set of formulas and assessments, usually all that can be said about the probability of some other formula is that it lies in a large interval, say between 0.01 and 0.99. This happens because there are exponentially many interpretations, and a few assessments do not impose enough constraints on probability values. Finally, there is yet another problem: it has been observed that first-logic itself is not sufficient to express recursive concepts or default assumptions, and tools for knowledge representation often resort to special constructs (Baader and Nutt 2002; Baral 2003). Hence it does not seem that the machinery of first-order probabilistic logic, however elegant, can exhaust the laws of thought, after all.

3 Bayesian Networks and Their Diagrammatic Relational Extensions Bayesian networks offer a pleasant way to visualize independence relations, and an efficient way to encode a probability distribution; as such, they have been widely applied within AI and in many other fields (Darwiche 2009; Koller and Friedman 2009; Pourret et al. 2008). To fix terminology, here is a definition (see also chapter “Belief Graphical Models for Uncertainty Representation and Reasoning” of this Volume). A Bayesian network consists of a pair: there is a directed acyclic graph G, where each node is a random variable X i , and a probability distribution P over the random variables, so that P satisfies the Markov condition with respect to G: each X i is independent of its nondescendants (in G) given its parents (in G) (Pearl 1988). Even though we can have discrete and continuous random variables in a Bayesian network, in this survey we simplify matters by focusing on binary random variables. When we have a finite set of binary random variables, the Markov condition implies a factorization of the joint probability distribution; for any configuration {X 1 = x1 , . . . , X n = xn }, P (X 1 = x1 , . . . , X n = xn ) =

n 

P (X i = xi |pa(X i ) = πi ) ,

i=1

where pa(X i ) denotes the parents of X i , πi is the projection of {X 1 = x1 , . . . , X n = xn } on pa(X i ). Often each P (X i = xi |pa(X i ) = πi ) is described by a local conditional probability table.

252

F. G. Cozman P(JohnIsDedicated = 1)

P(CourseIsHard = 1)

JohnIsDedicated

CourseIsHard

0.6

a b p(1|a, b) 00 0.5 01 0.8

JohnFailsCourse

0.4

a b p(1|a, b) 10 0.1 11 0.4

Fig. 1 The Bayesian network for the “propositional” version of the University World, where p(1|a, b) denotes P (JohnFailsCourse = 1|JohnIsDedicated = a, CourseIsHard = b)

Consider, as an example, a simplified version of the ubiquitous “University World” (Getoor et al. 2007). We have random variables JohnIsDedicated, CourseIsHard, and JohnFailsCourse, each one with values 0 and 1. Figure 1 depicts a Bayesian network, where JohnIsDedicated and CourseIsHard are independent, and where both directly affect JohnFailsCourse. Bayesian networks do encode structured domain knowledge through its independent relations. However, domain knowledge may come with much more structured patterns. For example, a University World usually has many students and many courses, and a very repetitive structure. We might say: for any pair (x , y), where x is a student and y is a course, the probability that x fails given that she is dedicated and the course is easy is 0.1. Figure 2 depicts a Bayesian network with various students and courses, where probabilities are obtained by repeating the conditional probability tables in Fig. 1. The question then arises as to how we should specify such “very structured” Bayesian networks. It makes sense to import some tools from first-order logic. For instance, we clearly have predicates, such as the predicate fails, that can be grounded by replacing logical variables by elements of appropriate domains (thus we obtain the grounded predicate fails(Tina, Physics), and so on). However, here the “grounded predicates” that appear in a graph are not just propositions, but rather random variables. A symbol such as fails must be understood with a dual purpose: it can be viewed as a predicate, or as a function that yields a random variable for each pair of elements of the domain. And a symbol such as fails(Tina, Physics) also has a dual purpose: it can be viewed as a grounded predicate, or as a random variable that yields 1 when the grounded predicate is true, and 0 otherwise. We adopt the following terminology (Poole 2003). A parvariable (for parameterized random variable) is a function that yields a random variable (its grounding) when its parameters are substituted for elements of the domain. The number of parameters of a parvariable is its arity. To specify a parameterized Bayesian network, we use parvariables instead of random variables. Of course, when we have parvariables we must adapt our conditional probability tables accordingly, as they depend on the values of parvariables. For example, for the Bayesian network in Fig. 2 we might have P (isDedicated(x ) = 1) = 0.6,

Languages for Probabilistic Modeling …

253

fails(Tina, Physics) isDedicated(Tina)

isHard(Physics) fails(Mary, Chemistry)

isDedicated(Mary)

isHard(Chemistry) fails(John, Chemistry)

isDedicated(John)

isHard(Math) fails(John, Math)

Fig. 2 The Bayesian network for the University World with three students and three courses

meaning P (isDedicated(a)) = 0.6 for each student a. Also, we might write   P fails(x , y) = 1|isDedicated(x ) = 0, isHard(y) = 0 = 0.5,   P fails(x , y) = 1|isDedicated(x ) = 0, isHard(y) = 1 = 0.8, and so on, assessments that are imposed on every pair (x , y). Each “parameterized table” specifying a conditional probability table for each substitution of logical variables is called a parfactor. Thus for the Bayesian network in Fig. 2 we need only three parfactors. Possibly the most popular diagrammatic scheme to specify parvariables and parfactors is offered by plate models. A plate consists of a set of parvariables that share a domain (that is, the parvariables are all indexed by elements of a domain). A plate model for the University World is presented in Fig. 3 (left); plates are usually depicted as rectangles enclosing the related parvariables. The main constraint imposed on plates is that the domains used to index the parents of a parvariable must also be used to index the parvariable. Thus in Fig. 3 (left) we see that fails appears in the intersection of two plates, each one of them associated with a domain: fails is indexed by both domains. Plate models appeared within the BUGS project (Gilks et al. 1993; Lunn et al. 2009) around 1992. At that time other template languages were discussed under the general banner of “knowledge-based model construction” (Goldman and Charniak 1990; Horsch and Poole 1990; Wellman et al. 1992). Plates were promptly adopted in machine learning (Buntine 1994) and in many statistical applications (Lunn et al. 2012). The BUGS package was innovative not only in proposing plates but also in introducing a textual language in which to specify large Bayesian networks through plates. Figure 3 (right) shows a plate model rendered by WinBUGS (a version of BUGS); this plate model can be specified textually by using a loop to describe the plate, and by introducing a statement for each parvariable in the plate model:

254

F. G. Cozman

Fig. 3 Left: Plate model for the University World. Right: A plate model rendered in WinBUGS (described in the WinBUGS manual)

model { for (i in 1 : N) { theta[i] ˜ dgamma(alpha, beta) lambda[i] BooleanDistrib(0.80), [true, true] -> BooleanDistrib(0.90) };

The second line associates a Poisson distribution with the number of houses. A language where domain size is not necessarily fixed is sometimes called an openuniverse language (Russell 2015). Such a language must deal with a number of challenges; for instance, the need to consider infinitely many parents for a random variable (Milch et al. 2005a). The flexibility of Blog has met the needs of varied practical applications (Russell 2015). It is only natural to think that even more powerful specification languages can be built by adding probabilistic constructs to existing programming languages. An early language that adopted this strategy is CES, where probabilistic constructs are added to C (Thrun 2000); that effort later led to the PTP language, whose syntax augments CAML (Park et al. 2008). A similar strategy appeared in the community interested in planning: existing languages, sometimes based on logic programming, have been coupled with probabilities — two important examples are Probabilistic PDDL (Yones and Littman 2004) and RDDL (Sanner 2011). The latter languages have been used extensively in demonstrations and in competitions, and have been influential in using actions with deterministic and with uncertain effects to obtain decision making with temporal effects. A rather influential language that adds probabilistic constructs to a functional programming language (in this case, Scheme) is Church (Goodman et al. 2008). Even though the goal of Church was to study cognition, the language is heavily featured; for instance, we can use the “flip” construct, plus conjunction and disjunction, to define conditional probability tables as follows: (define flu (flip 0.1)) (define cold (flip 0.2)) (define fever (or (and cold (flip 0.3)) (and flu (flip 0.5))))

and we can even use recursion to define a genometric distribution: (define (geometric p) (if (flip p)

0

(+ 1 (geometric p))))

272

F. G. Cozman

A descendant of Church is WebPPL; here the probabilistic constructs are added to JavaScript. For instance, a conditional probability table is written as follows: var flu = flip(0.1); var cold = flip(0.2); var fever = ((cold && flip(0.3))||(flu && flip(0.5)));

Many other probabilistic programming languages have been proposed by adding probabilistic constructs to programming languages from procedural to functional persuasions (Gordon et al. 2014b; Kordjamshidi et al. 2015; Narayanan et al. 2016; Mansinghka and Radul 2014; McCallum et al. 2009; Paige and Wood 2014; Pfeffer 2016; Wood et al. 2014). These languages offer at least “flip”-like commands, and some offer declarative constructs that mimic plate models. For instance, in Haikaru (Narayanan et al. 2016) one can specify a Latent Dirichlet Allocation model (Blei et al. 2003) in a few lines of code, for instance specifying a plate as follows: phi r > α, AB improves its maximal value so far, and α is updated with r . If r ≤ α, AB continues by calling AB on the next child node. When all child nodes have been visited or cut, AB returns α. On an adversarial node, AB proceeds analogously: if r ≤ α, AB stops and returns α. This is called an α cut. If α < r < β, the opponent improves what he could get so far and β is updated with r . If r ≥ β, AB continues. When all the child nodes have been visited, AB returns β. If Alpha-Beta, launched with initial values α and β, returns value r , then the following results are guaranteed. If α < r < β then r is the minimax value. If α = r then the minimax value is smaller than α. If r = β then the minimax value is greater than β. The size of the memory used by Alpha-Beta is linear in d. Alpha-Beta efficiency depends a lot on the order in which AB explores the child nodes of a given node. This order is often given by domain knowledge or search heuristics. Exploring the best node in first brings about many cuts and shortens the search. So as to know the minimax value of the root node of a tree with depth d and branching factor b, Knuth and Moore (1975) shows that AB explores a number of nodes greater than 2bd/2 approximately, and this number can be reached when the order of exploration is sufficiently good. T being the number of nodes explored by √ Minimax, this means that AB explores approximately 2 T nodes in good cases. Practically, Alpha-Beta is used in its NegaMax version. Negamax does not explicitly distinguish friendly nodes and opponent nodes, but recursively calls—NegaMax with −β and −α as parameters.

2.3 Transposition Table In practice, AB can be enhanced with a transposition table (TT) which stores all the results of the searches starting from a given position. Its interest lies in the fact that two different nodes in the search tree could correspond to the same position. This situation happens very often. The simplest case is when two sequences of moves A, B, C and C, B, A lead to the the same position. With a TT, searching twice on the same position is avoided. After each search starting on a given node, AB stores the result in the TT, and each time a node is visited, AB looks into the TT whether there is an existing result corresponding to the position of the node.

316

B. Bouzy et al.

To represent a game position with an index, the first idea is to map the position with a non negative integer inferior to |E|, the size of the game state space. |E| can be huge. For 9 × 9 Go, |E|  381  1040  2133 and this first idea is not realistic on current computers. The position must be mapped to a smaller number, risking to observe collisions: two different positions with the same index (type 1 collision). Zobrist designed a coding scheme enabling a program to use index of positions on current computers (a 32-bit or a 64-bit number) with an arbitrary small probability of collision. For each couple (property, value) of the position, a random number (a 32-bit or a 64-bit number) is set up offline, once for all and beforehand. Online, the position which can be defined by the set of all its couples has a Zobrist value which equals the XOR of all its couple random numbers. When a move is played, the Zobrist value is incrementally updated by xoring the Zobrist value of the position with the random numbers of the couple (property, value) of the position that have been changed. Zobrist has shown that, for a search with a fixed number of nodes, a type 1 collision has a probability of happening that can be arbitrarily small provided that the number of bits of the random numbers is sufficient (Zobrist 1990). This mechanism is called Zobrist hashing. In practice, the size of the TT is fixed, say 2 L (with L = 20 or L = 30). The index of the position is made up with the first L bits of the Zobrist value of the position. Collisions happen in the TT when two positions have the same index (type 2 collision), which is frequent. To avoid type 2 collision, the Zobrist value of the position is stored with its search result. When reading an entry in the TT, the tree search checks that the Zobrist value of the position equals the Zobrist value of the entry, and the search result contained in the entry is used. The first Chess programs used TT (Greenblatt et al. 1967). Nowadays, Zobrist hashing is currently used for many games.

2.4 Iterative Deepening AB is a depth-first search algorithm. If the optimal solution is short and situated below the second node of the root node, AB explores all the nodes situated before the first node before entering the second node. It may spend a useless time below the first node before finding the optimal solution below the second node. To prevent this problem, Iterative Deepening (ID) calls AB with a fixed depth 1, then 2, and so on and so forth, iteratively while computing time is available (Korf 1985a). ID is anytime. ID finds the shortest solution. ID can be used with a TT. A subsequent and deeper search can use the results of a previous and shallower search. Particularly, if the best move of a shallow search is stored in the TT, a subsequent and deeper search can search this move first to produce cuts (Slate and Atkin 1977).

Artificial Intelligence for Games

317

2.5 MTD(f) Rather than launching AB with α = −∞ and β = +∞, AB can be launched with any values provided that α < β. Let v be the AB value of the root. It can be shown that iff v ≤ α, then AB returns α. Similarly, iff v ≥ β, then AB returns β. Iff α < v < β, then AB returns v. The minimal-window idea is to set up β = α + 1. The corresponding search produces many cuts and its computing time is significantly smaller than the computing time of the search launched with α = −∞ and β = +∞. If AB returns α + 1, then α + 1 is a lower bound of v: v ≥ α + 1. If AB returns α, then α is an upper bound of v: v ≤ α. Memory Test Driver (MTD) names a class of algorithms (Plaat et al. 1996) using the minimal-window principle. MTD(f) is the simplest and the most efficient one. MTD(f) iteratively calls AB with α = γ and β = γ + 1. The initial value of γ can be a random value or the result of a previous MTD(f) search. At each iteration, if AB returns γ , then v ≤ γ and γ is decremented. Otherwise, v ≥ γ + 1 and γ is incremented. After a finite number of iterations v is known and the best move read in the TT. At the expense of using a TT and ID, MTD(f) is a significant enhancement of AB, used in current Chess programs.

2.6 Other Alpha-Beta Enhancements Other AB enhancements exist. First, Principal Variation Search (PVS) assumes that the nodes are already well ordered by the knowledge-based move generator. Consequently, PVS is designed to check this order (Pearl 1980a, b). PVS calls PVS on the first child node with a minimal window so as to check that α cannot be surpassed. If this is the case, the computing time is low. Otherwise, a normal AB search is launched on this node, which costs a second search. Secondly, the null move heuristic (Donninger 1993) launches a shallow search assuming the first move is a null move, which gives a first value to α at a low cost. Thirdly, the history heuristic (Schaeffer 1989) assesses the moves in term of number of cuts, and stores the results in a table. The moves with a good assessment in the table are tried first later on. This heuristic assumes that the moves can be the same from one position to another. In Chess, a move can be identified by its kind of piece, its origin and its destination. Fourthly, Quiescence search (Beal 1990) searches while the position is not quiet, i.e. at least one urgent move exists (for instance capturing a piece in Chess). Rivest (1988) studies the back-up formula. Finally, Junghanns (1998) is an overview of Alpha-Beta.

318

B. Bouzy et al.

2.7 Best First Search Other algorithms improve Minimax by exploring the best moves in first, the depth of the search not being fixed beforehand. Proof Number Search (PNS) (Allis et al. 1994) is useful in a AND-OR tree. PNS computes the proof number of a node: the number of nodes to explore under this node so as to prove its value. PNS explores in first the node with the lowest proof number. Best-First Search (Korf and Chickering 1994) calls the evaluation function for all child nodes and explores the best node first. SSS* (Stockman 1979) explores all nodes in parallel as A* would do it with a specific heuristic. B* (Berliner 1979) uses an optimistic evaluation and a pessimistic one. B* searches so as to prove that the pessimistic value of the best node is better than the optimistic value of the second best node. McAllester (1988), Schaeffer (1990) define conspiracy nodes. A conspiracy node is a leaf node whose evaluation influences the Minimax value of the root. The conspiracy nodes are searched first.

3 Monte Carlo Search A major change occurred recently in the game of Go. In 1998, Martin Mueller (amateur 6 Dan) could win against Many Faces of Go, the best program at that time, with an astronomic handicap of 29 stones (Müller 2002). In 2008, MoGo, from the French Monte Carlo Go school, won with a decent 9 stones handicap against Kim Myungwang, 8 Dan pro. Later on, programs won with reduced handicap, and then AlphaGo (Silver et al. 2017b), from Google Deepmind, won without handicap against the very best professional players. In Silver et al. (2017a) this was reproduced for several games without using any human knowledge; and Tian et al. (2018) releases an open source version also beating the best professional players. These successes came from algorithmic improvements. The same Monte Carlo techniques were used in active learning (Rolet et al. 2009a), in optimization of grammars (De Mesmay et al. 2009), in non-linear optimization (Rolet et al. 2009b). Simultaneously, related algorithms were used in planning. In the case of AlphaGo, the Monte Carlo method was combined with deep networks (chapter “Designing Algorithms for Machine Learning and Data Mining” of the present volume).

3.1 Monte Carlo Evaluation The first use of simulated annealing for ranking a list of moves goes back to Bruegmann (1993). The state of the art was the alpha-beta pruning; it works quite well for checkers of chess but it needs a decent and fast evaluation function. An evaluation function is a mapping from a board position to an evaluation of the value of this position for each of the players.

Artificial Intelligence for Games

319

Typically, a human expert can write a good evaluation function for checkers or chess; whereas this does not exist in Go. Bruegmann (1993) proposed a workaround: randomly simulate many games, for approximating a winning probability (see Algorithm 1); Monte Carlo Go was born. Algorithm 1 Evaluation of a position p by the Monte Carlo method via n simulations. Input: a position p, a number n of simulations. for i =∈ {1, . . . , n} do Let p  = p. while p  is not a final state do c =random move among legal moves at p  p  = transition( p  , c) end while if p  is a win for black then ri = 1 else ri = 0 end if end for  n Return n1 i=1 ri (estimated probability of gain for black).

While the Monte Carlo method is old (we usually trace it back to the Manhattan project, i.e. the project for the construction of the nuclear bomb in united stated during world war 2, but it has also been pointed out much early as an original method for approximating π ), its use in games was then new.

3.2 Monte Carlo Tree Search The technique was producing convincing results; it was further developed in Bouzy and Cazenave (2001), Bouzy and Helmstetter (2003) and improved by combination with search and with knowledge (Bouzy 2005; Cazenave and Helmstetter 2005). Nonetheless, the real “take off” of the performance will be the combination with an incremental tree building. The resulting algorithm, termed Monte Carlo Tree Search (Coulom 2006; Chaslot et al. 2006) is presented in Algorithm 2. The structure T is a set of situations, progressively augmented by adding, at each random simulation, the first situation in this simulation which had not yet been stored in it; this structure stores, for each node, a number of wins for black and a number of wins for white. The default policy part does not have to be a pure default random—the early successes of MoGo were due to the use of a sophisticated default policy (Gelly et al. 2006). Combinations with tactical solvers (solving local situations) have been tried without clear success. This technique is currently applied in all strong programs in the game of Go, and in many games:

320

B. Bouzy et al.

Algorithm 2 Evaluation of a position p by the Monte Carlo Tree Search technique with n simulations. Notations: ¬white = black, ¬black = white; transition( p, c) is the situation in which we move when the player to play chooses move c in position p. Input: p a position, n a number of simulations. T ← empty structure. for i =∈ {1, . . . , n} do Let p  = p, q = ∅, game = ∅. while p  is not a final state if p  is in T j = player to play in p  for each c legal move in p  do

//Do a complete game do // If p  is in memory then // Algorithm called “bandit”

//Compute the score for each move as follows p  = transition( p  , c) Scor e(c) = bandit For mula(T (¬ j, p  ), T ( j, p  ), T ( j, p  ) + T (¬ j, p  )) end for c = legal move in p  maximizing Scor e(c) else if q = ∅ // We have not yet found the state to be added then game← game + p  end if c =random move among legal moves in p  // default policy end if p  = transition( p  , c) end while Add q in T // if q = ∅ if p  is a win for black then ri = 1 else ri = 0 end if for p  in game do T (ri , p  ) = T (ri , p  ) + 1 // Increase T (ri , p  ) end for end for  n Output: n1 i=1 ri .

• Monte Carlo (not MCTS): Scrabble world champion beaten by MC (Sheppard 2002) • General Game Playing (Cadiaplayer world champion Finnsson and Björnsson 2008a) • Hex (MCTS world champion Arneson et al. 2010) • Havannah (Teytaud and Teytaud 2009) • Arimaa (game built specifically built for being hard for computers Kozelek 2009) • Nogo (Chou et al. 2011) • Fortress positions in chess (folklore claim) • Hide and seek “Scotland Yard” (Nijssen and Winands 2012) • Chinese Checkers, Focus, Rolit, Blokus (Nijssen and Winands 2013)

Artificial Intelligence for Games

321

• Amazons, Breakthrough, M. Jack, Chinese Military Chess, real-time video games (Ms Pac-Man), Total war: Rome, Poker, Skat, Magic: The Gathering, Settlers of Catan, 7 wonders... Monte Carlo is used in various puzzles: SameGame (Schadd et al. 2008); Morpion Solitaire (state of the art by Nested-rollout MCTS Rosin 2011; Cazenave et al. 2016); Samurai Sudoku (Finley 2016); and in applications far from games: operations research (Chang et al. 2005); sometimes claimed to be an early variant of MCTS); Linear Temporal Logic problems, including car driving (Paxton et al. 2017); traveling salesman problem with time windows (Nested Rollout method; Cazenave and Teytaud 2012); unit commitment (an early combination of neural nets and MCTS, Couetoux 2013); continuous uncertain industrial problems (Couetoux 2013); sailing (Kocsis and Szepesvari 2006). A particularly interesting point is the so-called “general game playing” (GGP Pitrat 1968); in these competitions, the program has to read the rules (in a given format, usually “game description language”), and then play. The best GGP programs use MCTS (Finnsson and Björnsson 2008b). Algorithm 2 does not specify what is the “bandit” formula. A classical variant, though not the most widely used in the case of Go, is UCT (Upper Confidence Tree Kocsis and Szepesvari 2006) as follows: bandit For mula(w, l, n) = w/(w + l) +



K log(n)/(w + l)

(1)

(where K is an empirically chosen constant, n is the number of simulations at the considered situation, w the number of wins for the considered move, and l the number of losses; n = w + l for games without draw). The first term v/(v + d), called exploitation, is in favor of moves which have a high success rate; the second term is in favor of moves which are not much explored (w + l is small) and is therefore called exploration term. The formula (1) is not properly defined for v + d = 0; it is frequent to specify bandit For mula(w, l, n) = F when w + l = 0, for a given constant F (Gelly et al. 2006). Different modifications of this formula have been proposed. The RAVE formula (Rapid Action Value Estimates), based on so-called AMAF (All Moves As First) values, is as follows: bandit For mula(w, l, w , l  ) = α(w + l)w/l + (1 − α(w + l)w /l  . We use the same w and l as in UCT, and we also use for the bandit formula for a given situation: • w the number of wins in which the considered move c has been played by the player to play in the considered situation before being played by the other player, even if this was not played in the considered situation. • l  the number of losses in which the considered move c has been played by the player to play in the considered situation before being played by the other player,

322

B. Bouzy et al.

even if this was not played in the considered situation. α(.) is a function converging to 1, for example α(n) = n/(n + 50): • if w + l is much larger than 1, we use these values and their ratio; whereas when w and l are too small for the ratio w/(w + l) to have a meaning; • then, as the number of simulations increases, we move to w/(w + l) which is asymptotically better (less biased) than w /(w + l  ). A second important modification (Coulom 2007; Lee et al. 2009; Chaslot et al. 2008) consists in using heuristics tuned on databases; a simple method for using a heuristic h( p  , c) typically estimating the frequency at which a move c is played in situation p  , given the configuration p  around move c, is: bandit For mula(w, l, p  , c) = w/(w + l) + K h( p  , c)/(w + l) for some empirically tuned constant K . Strong results can be obtained by combining these different approaches (Lee et al. 2009). After the wide success of deep neural networks for various tasks, in particular pattern recognition tasks, neural networks have been used for estimating h( p  , c) (such a network is called a critic network), with p  the entire board (Silver et al. 2016). Deep neural networks were also used for generating the so-called default policy in Algorithm 2 (such a network, actually playing moves, is called an actor network). The training can be done in different manners; a possibility is as follows: • learn the actor network move = random(boar d) by the reinforce method (Williams 1992), i.e. by self-play (the network plays against itself and applies gradient updates to its weights); • learn the critic network h( p  , c) by classical supervised learning on the situations met in self-play games; Using this method, combined with MCTS, AlphaGo won a long uninterrupted series of games against the best professional players in the world (Silver et al. 2017b). MCTS has the following advantages: • scaling: the program becomes stronger if the computational power increases; • very low need for human expertise; the algorithm presented in Algorithm 2 is independent of the game; even the heuristic h(., .) might be tuned automatically on databases, though human expertise can help. Due to the nice scaling properties of MCTS, parallelization has been applied (Cazenave and Jouandeau 2007; Gelly et al. 2008); however, results, if numerically good in the sense that MCTS running on dozens of CPU does outperform the single CPU version, keep the same limitations; it looks like the performance against humans does not increase as much as suggested by the performance against the non-parallel version of the code. The parallel code remains, at least when no special trick and

Artificial Intelligence for Games

323

no deep network is applied, unable to evaluate so-called “capturing races” (also known as “semeais”); the program tends to always believe that semeais are won with probability 50% whenever humans know clearly that it’s a win for e.g. black with probability 100%. An improved version of AlphaGo Zero named AlphaZero has been proposed (Silver et al. 2017a). Apart from the game of Go it has been applied to Chess and Shogi. After a few hours of training it has been able to defeat the best computer Chess and Shogi players, Stockfish and Elmo. For the game of Go it has surpassed AlphaGo Zero (Silver et al. 2017b) and it is considered as a more generic version of AlphaGo Zero.

4 Puzzles Puzzles are one player problems where we search for a sequence of moves that gives the solution of the problem. Algorithms can either optimize the number of moves or the score of the solution.

4.1 A* The A* algorithm (cf. chapter “Heuristically Ordered Search in State Graphs” of this Volume) (Hart et al. 1968) enables to find solution with a minimal number of moves to various puzzles. Examples of such puzzles are the Rubik’s Cube (Korf 1997), the 9-puzzle or Sokoban (Junghanns and Schaeffer 2001). A* is also used in video games (Cazenave 2006a; Bulitko et al. 2008; Sturtevant et al. 2009). For each puzzle addressed by A*, an admissible heuristic has to de defined. This heuristic will be computed for every state of the search. An heuristic is admissible when it always gives a value smaller than the true number of moves required to reach the solution from the evaluated state.

4.1.1

The Manhattan Heuristic

The most widely used heuristic is the Manhattan heuristic. The principle of the Manhattan heuristic is to compute very rapidly the solution of a simplified problem and to use the cost of this solution as a lower bound of the real cost. It calculates for each piece the cost of moving it to its goal without taking into account the interactions with the other pieces. The heuristic is the sum of all these calculations for all the pieces. For example in the 9-puzzle the heuristic counts for each tile the number of moves to move it to its goal location as if there were no other tile. For the Rubik’s cube, the same calculation if done for every cube, however each move at the Rubik’s cube moves eight cubes, so the sum has to be divided by eight in order to be

324

B. Bouzy et al.

admissible. For finding the optimal moves on maps of video games, the Manhattan heuristic calculates the distance as if there were no obstacle.

4.1.2

Tree Search

A* develops at each search step the state that has the lowest estimated cost. The cost of a path is the cost already used to reach the state plus the lower bound on the remaining cost to reach the goal. It ensures that when a solution is found it has a minimal cost (all other states that could be developed have a greater associated cost). In games, the cost of a state is often the number of moves required to reach the goal. Algorithm 3 Search of a minimal cost solution with A* Input: a position p. Open ← { p}. Closed ← {}. g[ p] = 0 h[ p] = estimated cost from p f [ p] = g[ p] + h[ p] while Open = {} do pos = position in Open with the smallest f if pos is the goal then return the path to pos end if remove pos from Open add pos to Closed for c =legal move of pos do pos  = transition( pos, c) g  = g[ pos]+cost of c if pos  is not in Closed then if pos  is not already on Open with g[ pos  ] ≤ g  then g[ pos  ] = g  h[ pos  ] =estimated cost from pos’ f [ pos  ] = g[ pos  ] + h[ pos  ] add pos’ to Open end if end if end for end while return fail

4.2 Monte Carlo The recent success of Monte Carlo methods for games has brought them as interesting candidates for solving puzzles. Monte Carlo are suited to puzzles lacking a good heuristic to guide the search. It is the case for puzzles such as Morpion Solitaire or

Artificial Intelligence for Games

325

SameGame. For these two games as well as for Sudoku and Kakuro, a nested Monte Carlo search has good results (Cazenave 2009). The principle of this algorithm is to play random games at the lowest level and to choose for upper levels the move that resulted in the best score of a playout of the underlying level (for example each move of a playout of level one is chosen after the score of a random game starting with the move). Moreover the methods described in the previous section Monte Carlo Search are general and can be applied to puzzles.

4.3 Further Readings It is possible for some puzzles to us a depth-first version of A* named Iterative Deepening A* (IDA*) (Korf 1985b) that enable to use A* with very few memory. Kendall et al. (2008) is a nice survey of NP-complete puzzles.

5 Retrograde Analysis Retrograde analysis enable to precompute a solution for each element of a subset of the states of a game. It has enabled to optimally solve a few games. We first present its application to endgames of two player games, then to puzzles.

5.1 Endgame Tablebases The principle underlying an endgame tablebase is to calculate the exact value of some endgame states. For example in Chess it is possible to calculate for every state containing five pieces or less its exact value. Ken Thompson calculated all values for six pieces endgames (Thompson 1996). Retrograde Analysis is the algorithm used to calculate the score of endgame states. The principle is to start enumerating all won states. Then for each won state, it undoes a White move and a Black move and verifies with a depth two search the status of the new state. It then finds new won states. It continues this process of finding new won states as long as it finds new won states. Endgame Tablebases are used by Chess programs as they can play some endgames instantly and better than any human. They have changed Chess theory for certain won states that human players thought to be draw (notably the King-Bishop-BishopKing-Knight). Retrograde analysis can also be used in other games. Chinook solved Checkers using some endgames tablebases and search algorithms (Schaeffer et al. 2007). Another popular game completely solved by retrograde analysis is Awari, there exist

326

B. Bouzy et al.

a program that can play perfectly and instantly all the Awari states (Romein and Bal 2003).

5.2 Pattern Databases In order to improve A* or IDA* on a given problem, it is natural to try improving the admissible heuristic. Improving the admissible heuristic reduces to make it find greater values for an equivalent computing time.If the heuristic finds greater values, A* will develop less states in order to find the solution as it will cut some paths earlier. A nice way to improve the admissible heuristic is to precompute the solution of numerous configurations of a problem more simple than the original one and to use the precomputed values as admissible heuristics. For example in the 16-puzzle the Manhattan heuristic consider each tile as independent of the other. If some interactions between tiles are taken into account, the heuristic will be improved. All states containing some predefined tiles are solved taking into account the interactions between tiles. All configurations for the first eight tiles can be calculated, removing the other tiles and replacing them with empty tiles (Culberson and Schaeffer 1998). A retrograde analysis algorithm close to the one used in Chess compute the minimal number of moves required to solve each configuration of the first eight tiles. Using this pattern database is fast as it consist in finding a precomputed number in a table with the index corresponding to the configuration of the first eight tiles of the state to evaluate. Pattern databases can be used for other problems than the 16-puzzle. It is possible for example to precompute all the combinations of the eight corner cubes of the Rubik’s cube. It is then possible to use it as an admissible heuristic (Korf 1997). The 16-puzzle and the Rubik’s cube are examples among many problems that can be speeded up with pattern databases, enabling much faster solving than with the Manhattan heuristic alone (a thousand times faster for the 16-puzzle and required to solve the Rubik’s cube). In order to solve the Rubik’s cube faster it is possible to combine pattern databases, for example by computing a database for the eight corner cubes and another one for six out of the twelve border cubes. The evaluation of a position is then the maximum of the two values found by the patterns databases. Moves at Rubik’s cube move both corner and border cubes, so it is no possible to add the two heuristics. For other problems such as the 16-puzzle it is not the case: if two databases containing disjoint sets of tiles are available, the two values can be added and still give an admissible result since no move of the first database is also a move in the other database (tile of a database are not in the other database) (Felner et al. 2004). For some problems such as the four-peg Towers of Hanoi it is valuable to compress pattern databases so that they fit in memory. Compression stores only one value for a set of patterns (the minimum number of moves over all the patterns of the set) (Felner et al. 2007).

Artificial Intelligence for Games

327

Precomputing to improve the admissible heuristic and accelerate search is not limited to puzzles. For the shortest path problem on a game map, it is possible to precompute the distance from a given point to all other points. The triangular inequality can then be used to compute an admissible heuristic between any two points (Cazenave 2006a). It is also possible to compute pattern databases for two-player games. For the game of Go, all living shapes that fit within a given rectangle can be precomputed and used to accelerate life and death search (Cazenave 2003).

5.3 Further Topics This section has been devoted to fully observable games; there exist extension to non-observable applications (Cazenave 2006b; Rolet et al. 2009a). In some games, the modeling of the opponent is critical (for example Poker Maîtrepierre et al. 2008).

6 AI in Video Games Besides major AI contributions to the area of classical games, new types of games have emerged over the last three decades that have called for new developments in the field of AI. Video games first distinguished themselves from classical games by relying heavily on graphics and reflex action (instead of analysis and reasoning). Yet these new video games have given more and more importance to AI, not only to provide artificial opponents to human players, but also to animate the virtual characters that inhabit the complex virtual worlds of some of the more recent games, so as to make them credible and entertaining. A strong and active community has therefore developed in this area over the last years, involving both industry and academia, with specialized books, conferences supported by major academic societies (e.g., IEEE with Computational Intelligence in Games, AAAI with Artificial Intelligence and Interactive Digital Entertainment), and journals such as IEEE Transactions on Computational Intelligence and AI in Games (TCIAIG). Many AI researchers have turned to the area of video games as a rich problem area (Laird 2002). Indeed video games constitute excellent platforms for experimental work in AI, and that is true of a wide range of game types, including historically FPS (first-person shooters), RTS (Real-Time Strategy), RPG (role-playing games) or even adventure games, each one bringing its own research problems (Corruble and Ramalho 2009). One limitation for research oriented work in this area was for a long time the limited accessibility of open platforms, commercial games being typically closed to open investigation by outside researchers. The situation has improved recently, with open platforms becoming more available, often as a result of partnerships between academia and industry. Alternatively, a few projects have recently tackled AI game-playing with commercial version of games, e.g. simulating mouse

328

B. Bouzy et al.

clicks to communicate AI moves (Madeira and Corruble 2009), up to the point where the game state is acquired by a video camera monitoring the screen as human players would. In parallel, more and more game platforms are being released in the framework of competitions or challenges to the research community. Lastly, we see in this section how the video game AI domain, beyond its role of experimental platform for testing and challenging AI techniques, also contributes its own research questions, which brings about the enrichment and renewal of the field of AI as a whole.

6.1 Transitioning from Classical Games to Video Games It is possible for some game genres to consider, from an AI perspective, video games as extensions of classical games, while other game genres introduce fundamentally new problems. In the first category, one can place modern strategy games which, similarly to classical games, stage a conflict or competition between two or more sides, each representing an army, civilization or faction, in a context that can be historical, or imaginary. Typical examples are Age of Empires (Microsoft), the Total War series (Creative Assembly), that combines strategy and tactical combat at various historical periods, Sid Meier’s Civilisation, or Paradox Interactive grand strategy games such as Europa Universalis, or Victoria. Innovations in this group are significant and go beyond visual immersion. Besides moving units on a map in a 2D or 3D environment as one can find in classical games and many wargames, players are challenged to manage an economy (resource collection, production, budget,...), diplomacy (alliances,...) or even research and innovation policies. These multiple levels of simulation, from the most tactical (involving moving units on a map) to the more strategic (with long term policies) and their complex interactions, bring about new levels of complexity, where the middle or long term impact of decisions is extremely difficult to predict. While these strategy games can be played in a turn-based fashion or real-time, from a complexity perspective, there main innovation in comparison with classical games is their high degree of parallelism: be it for a wargame or a grand strategy game, all units can potentially receive independent orders at any moment or game turn. As a result, the combined set of possible decisions or actions at a given point in time becomes hard to enumerate and even more to evaluate, as its size grows exponentially with the number of units. Thus, the traditional AI approach to games based on tree search usually becomes non practical. This initial remark related to the complexity of modern games might go some way toward explaining a phenomenon that could seem surprising at first: AI in video games has until recently used relatively few results or techniques coming from AI research on classical games and more generally from academic AI. Yet the video game industry has been one of the first areas to adopt the notion of AI to the point of making it a commercial argument. The game industry, and players, refer to game AI mainly as the part of the game that manages the automated behavior of the player’s opponents or of the non-player characters that populate the game environment. The

Artificial Intelligence for Games

329

issue of whether this game AI actually uses techniques coming from AI research is often, maybe justifiably, seen as secondary. As we are about to see in the next section, a large proportion of video games use techniques that can seem rather basic from an AI research perspective, but that show real strengths from the point of view of game designers and still allowing for some degree of refinement. We will see also how more recent work coming both from academia and industry have initiated a move from what is described as traditional, scripted game AI toward one that is more advanced. Furthermore, this overview of game AI must go beyond these basic questions and address other important ones. A key topic for game AI is the notion of non-player characters (NPCs), these creatures that populate the game world, especially in adventure and role-playing games, but can be relevant also in other popular genres such as simulation, sports games, or first-person shooters (FPS). NPCs extend the notion of opponent (they can indeed represent allies or be neutral to the player character), and are related to the notion of actor, that must follow the instructions of a director) or of an autonomous agent that must act, react, or interact in a credible manner with the story and the player. With this area, much of the recent research on autonomous intelligent agents finds an application and a field of experimentation that is particularly exciting (Corruble and Ramalho 2009), but we will see in the following that one has to enrich, redefine, and sometime even strengthen some of the goals of classical research on rational agents.

6.2 AI in the Game Industry: Scripting and Evolution Modern video games should be seen as interactive media which borrow much from a movie culture where an author imagine a story and directs actors. They add to this the key dimension of interactivity: the evolution of the story is strongly impacted by the player’s actions, who might as a result guide it towards a direction or another. This complex intertwining between the levels of story and individual actions is studied in a new research domain known as interactive narrative (Perlin 2005; Natkin 2004). The roots of video games in a movie culture goes some way towards explaining the reservations held by some (now rare) game designers towards the notion of an Artificial Intelligence leading to autonomous agents: in their minds, NPCs are seen mainly as actors, they must behave by following indications from the game designer, and help in steering the story towards one of the paths they anticipated. In that vein, some of most scripted games are referred to as roller-coasters, they are designed precisely to make the player live a planned sequence of emotions where intense moments are followed by relaxing episodes and so on. Roller-coaster games are often opposed to sandbox games, where the player is the one building the story that emerges from its interactions with a rich game environment. In the first category, Roller-coaster games have most often what is called a scripted approach to AI. Some behaviors or action sequences are triggered when specific conditions on the game state (e.g. the position of the player character, or a specific point in the story) are met. NPCs designed this way lead to behaviors expected by the designer at the

330

B. Bouzy et al.

time they expect it and therefore are a suitable solution to many needs of the video game industry. Nonetheless, it has its own shortcomings; Development costs can be significant as all relevant game situations must be taken into account at the time of design, which in turns implies some level of constraint on the behaviors of the player... which has generate some level of player frustration which is not conducive to immersion and enjoyment. Furthermore, it tends to lead to a rather closed or less dynamic game environment. The same situation leads to the same behaviors. It can lead to some form of player boredom, and also decrease the replay value of the game. Another consequence is that the player can exploit this tendency to repeat the same actions and game the game: when predicting easily the game AI reaction to its own moves, the player acquires an advantage that quickly ruins the game challenge. At a more technical level, several approaches have been used to model the behaviors of NPCs. Finite state machines (FSM) have for long been the basic tool for the scripted AI approach. The limitation of FSM in terms of representation have lead to some improvements by using hierarchical finite state machines (HFSM) (Harel 1987) that allow the modeling of generalized states and state transitions, hence factoring some aspects of similar states. Further, behaviour trees (Flórez-Puga et al. 2008) aim at maximizing the level of factoring between states and make more explicit the logic of transitions between behaviors—they have thus become the leading AI technique in the video game industry over the last decade. In order to implement scripted behavior, powerful script languages have been developed over the years, several of them motivated by the game industry. One of them is LUA (Ierusalimschy et al. 2007), used in successful games such as World of Warcraft). They have been well received by offering a good balance between speed of execution and a development that remains outside of the game engine (the computation intensive part that deals with graphics, etc.), which lets studios work on the game AI in the last months of game development, or to refine it after game release, or in some cases to let players contribute to its improvements via modding.

6.3 Research Directions: Adaptive AI and Planning In order to overcome the limitations of scripted approaches to game AI listed above, some attempts have been made to combine the strong points of scripted AI, well regarded in the video game industry, and adaptive abilities as studied by AI researchers, especially in the area of Machine Learning. AN important example is Dynamic Scripting (Spronck et al. 2006). In its initial version, this approach aims at computing, based on experience, to associate a score to each script defined by hand. This learning stage allows then afterwards the selection of the script that is the most relevant to the current situation and player. The later versions of Dynamic Scripting introduced a reinforcement learning stage that modifies the scripts themselves. It is therefore a rel machine learning approach to game AI, though it leaves open the possibility of using a base of pre-defined scripts as input.

Artificial Intelligence for Games

331

Mostly in academic circles, some researchers have proposed for some years to design a game AI with a more radically different approach, placing learning techniques at its heart, with a triple objective: avoid the somewhat frozen behaviors of scripted approaches, offer the possibility of an adaptive game AI (it adapts naturally to the player and its play style as its by playing that it develops or adapts its strategy), and from a practical, development points of view, offer an alternative to the classical approach that implies the programming and debugging of numerous scripts by several programmers (some game studios hire dozens of AI script programmers in their final production phase). Resistance against this type of learning approach in the game industry has several explanations: current learning techniques on complex problems tend to converge slowly, induced behaviors are difficult to predict and it is therefore difficult to offer a guarantee a priori about the acceptability of results by the designer and ultimately by the player. On the other hand, the game industry recognizes that adaptive methods based on learning are certainly promising, as expressed plainly in the enthusiastic preface of Rabin (2002) qualifying learning as the Next Big Thing for game AI, before addressing various aspects and projects of learning game AI in ten chapters of the book. By nature, most video games can be modeled as an agent or a system of agents interacting with their environment. They often receive an evaluation of their actions, for example, through the evolution of a game score. They therefore appear open to AI approaches based on machine learning. They are however domains where state and action spaces can be much larger than more classical games, and constitute a good motivation to improve these methods, for example, using approaches for factoring underlying Markov Decision Processes (Degris et al. 2009), or using a hierarchical decomposition for learning on adapted representations (Madeira and Corruble 2009), which lead to viable solutions respectively for FPS games or strategy games such as historical wargames. In this last case in particular, hierarchical decomposition of the decision and learning processes, and the automated adaptation of representations by abstraction mechanisms proved necessary due to the high-level of parallelism that makes most wargames analogous to actual multi-agent simulations (Corruble and Ramalho 2009). Besides these approaches using learning intensively, one should not neglect approaches using some form of planning. Research presented in Degris et al. (2009) is an interesting example of work combining reinforcement learning with a model itself learned from experience. Other approaches using sophisticated planning techniques such as Hierarchical Task Networks (HTN) become an important research direction (Hoang et al. 2005) that inspire the game AI of several commercial games such as Total War (Creative Assembly), Airborne Assault (Panther Games), or Armed Assault (Bohemia Interactive).

332

B. Bouzy et al.

6.4 New Research Directions for Video Game AI We have introduced in the previous pages the state of the art and some research directions for video game AI seen as an extension of AI research on classical games. The underlying hypothesis was that the aim is to design a game AI that plays better, that is to say whose performance level approaches, reaches, or even takes over the one of a human player. This objective, so obvious that it is often not stated, is seriously challenged by the video game domain. Indeed, though until recently for some game types, either classical (chess, go,...) or “modernes” (strategy games), the main challenge for the AI researcher is to develop a challenging opponent for the human player, it is not the case anymore for many games where machines can now easily beat the human players. The challenge for AI research is then somewhat different: the goal is not to reach human level anymore, it is to propose an opponent, or a game companion, with whom the human player will enjoy the confrontation. This touches on some complex issues related to the notion of enjoyment and entertainment, at the heart of gaming, but that science and technology has ignored until recently . A significant amount of work, at the intersection of AI, psychology and social sciences, addresses the definition and measure of satisfaction and entertainment of players. Theories, coming from the area of aesthetics, literature and cinema on one hand, and experimental work, looking at player activity and physiological parameters (heart rate, etc.) to evaluate the level of interest, are two approaches that are sometimes combined. In turn, this can guide the design of the game AI so that the behavior produced lead to the satisfaction or enjoyment of the player. One specific example of such work is the issue of game balancing or more accurately of dynamic difficulty adjustment. How to proceed so that the game AI plays at the right level whatever its human opponent and at anytime along the player’s personal evolution? An important component in that area is the flow theory from psychology (Csikszentmihalyi 1975, 1990) that relates well-being and satisfaction with a good balance between competence level and task difficulty. In particular, Andrade et al. (2006) has proposed an approach to dynamic difficulty adjustment using a method derived from traditional reinforcement learning so that the selected action is not necessarily the one with the highest utility, depending on the estimated difficulty perceived by the player. All while learning to play well, agents learn also to adapt their level of play (their performance) by selecting sub-optimal actions, because they are seen as better adapted to the level of their opponent, the human player. Work on difficulty adjustment are made easier by the availability of objective measures, such as the game score. Other game dimensions are however more difficult to evaluate but have a strong impact on player perceptions and his/her immersion in a story. Character believability is a good example of this, especially for adventure and role-playing games as they usually assume complex interactions including dialogues, negotiation, etc. between NPCs and PCs. AN entire research domain has developed around this question, which attracts interest beyond the game domain, including all areas of virtual characters. In this domain, one has to go beyond the goal of having NPCs behaving rationally with high level of performance. To be credible or

Artificial Intelligence for Games

333

believable, what is important is that they have a recognizable personality influencing their actions in the long term, and that they react emotionally in a credible manner to events in their environment and to interactions with other characters. Game NPCs have therefore become an important application area for affective computing (Rickel et al. 2002). Ochs et al. (2009) for example propose a computational model allowing the simulation of the dynamics of emotions and social relations taking into account the personality of NPCs. The few research directions outlined above give an idea of the rich scene that game AI research has become over the last decade or so. This list is far from exhaustive. By moving from the simple role of opponent to the one of NPC, game AI has extended its domain, but it is now invoked for other areas of game development. Can it contribute more centrally to game design (by crafting stories for example)? What about stage direction, camera placement and so on? And a game music that is dependant on game state and player tastes? All these questions currently constitutes new frontiers both for AI research and for the game industry.

7 Conclusion In this chapter we have presented some classical algorithms for programming games: alpha-beta and its improvements for zero-sum two-player games, A* for puzzles. We have also presented more recent approaches such as Monte-Carlo algorithms and applications of AI to video games. As we have seen, AI in games includes many algorithms and raises many research questions which are relevant beyond the game domain. The most active research areas of the last years include Monte-Carlo methods and video-game AI.

References Allis L (1994) Searching for solutions in games and artificial intelligence. PhD thesis, Vrije Universitat Amsterdam Allis L, van der Meulen M, van den Herik H (1994) Proof-number search. Artif Intell 66:91–124 Anantharaman T, Campbell M, Hsu F (1989) Singular extensions: adding selectivity to brute force searching. Artif Intell 43(1):99–109 Andrade G, Ramalho G, Gomes AS, Corruble V (2006) Dynamic game balancing: an evaluation of user satisfaction. In: AAAI conference on artificial intelligence and interactive digital entertainment, pp 3–8 Arneson B, Hayward RB, Henderson P (2010) Monte Carlo tree search in hex. IEEE Trans Comput Intell AI Games 2(4):251–258 Beal D (1990) A generalised quiescence search algorithm. Artif Intell 43:85–98 Berliner H (1979) The B* tree search algorithm: a best-first proof procedure. Artif Intell 12:23–40 Bouzy B (2005) Associating domain-dependent knowledge and Monte Carlo approaches within a Go program. Inf Sci 175(4):247–257 Bouzy B, Cazenave T (2001) Computer Go: an AI oriented survey. Artif Intell 132(1):39–103

334

B. Bouzy et al.

Bouzy B, Helmstetter B (2003) Monte-Carlo Go developments. In: ACG, vol 263 of IFIP. Kluwer, pp 159–174 Bruegmann B (1993) Monte Carlo Go. Unpublished Bulitko V, Lustrek M, Schaeffer J, Bjornsson Y, Sigmundarson S (2008) Dynamic control in realtime heuristic search. J Artif Intell Res 32(1):419–452 Campbell M, Marsland T (1983) A comparison of minimax tree search algorithms. Artif Intell 20:347–367 Campbell M, Hoane A, Hsu F-H (2002) Deep blue. Artif Intell 134:57–83 Cazenave T (2003) Metarules to improve tactical Go knowledge. Inf Sci 154(3–4):173–188 Cazenave T (2006a) Optimizations of data structures, heuristics and algorithms for path-finding on maps. In: CIG, pp 27–33 Cazenave T (2006b) A phantom-Go program. Advances in computer games 2005. Lecture notes in computer science, vol. 4250. Springer, Berlin, pp 120–125 Cazenave T (2009) Nested Monte Carlo search. In: IJCAI 2009, Pasadena, USA, pp 456–461 Cazenave T, Helmstetter B (2005) Combining tactical search and Monte-Carlo in the game of Go. IEEE CIG 2005:171–175 Cazenave T, Jouandeau N (2007) On the parallelization of UCT. In: Proceedings of CGW07, pp 93–101 Cazenave T, Teytaud F (2012) Application of the nested rollout policy adaptation algorithm to the traveling salesman problem with time windows. In: Learning and intelligent optimization, pp 42–54 Cazenave T, Saffidine A, Schofield MJ, Thielscher M (2016) Nested Monte Carlo search for twoplayer games. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, Phoenix, Arizona, USA, 12–17 Feb 2016, pp 687–693 Chang HS, Fu MC, Hu J, Marcus SI (2005) An adaptive sampling algorithm for solving Markov decision processes. Oper Res 53(1):126–139 Chaslot G, Saito J, Bouzy B, Uiterwijk JWHM, van den Herik HJ (2006) Monte-Carlo strategies for computer Go. In: Schobbens P-Y, Vanhoof W, Schwanen G (eds) Proceedings of the 18th BeNeLux conference on artificial intelligence, Namur, Belgium, pp 83–91 Chaslot G, Winands M, Uiterwijk J, van den Herik H, Bouzy B (2008) Progressive strategies for Monte-Carlo tree search. New Math Nat Comput 4(3):343–357 Chou C-W, Teytaud O, Yen S-J (2011) Revisiting Monte Carlo tree search on a normal form game: Nogo. In: European conference on the applications of evolutionary computation. Springer, Berlin, pp 73–82 Corruble V, Ramalho G (2009) Jeux vidéo et Systèmes Multi-Agents. IC2 Series. Hermès Lavoisier, pp 235–264. ISBN: 978-2-7462-1785-0 Couetoux A (2013) Monte Carlo tree search for continuous and stochastic sequential decision making problems. PhD thesis, Université Paris Sud-Paris XI Coulom R (2006) Efficient selectivity and backup operators in Monte-Carlo tree search. In: Ciancarini P, van den Herik HJ (eds) Proceedings of the 5th international conference on computers and games, Turin, Italy Coulom R (2007) Computing Elo ratings of move patterns in the game of Go. ICGA J 4(30):198–208 Csikszentmihalyi M (1975) Beyond boredom and anxiety. Jossey-Bass San Francisco Csikszentmihalyi M (1990) Flow: the psychology of optimal experience. Harper and Row, New York Culberson JC, Schaeffer J (1998) Pattern databases. Comput Intell 4(14):318–334 De Mesmay F, Rimmel A, Voronenko Y, Püschel M (2009) Bandit-based optimization on graphs with application to library performance tuning. In: ICML, Montréal Canada Degris T, Sigaud O, Wuillemin P (2009) Apprentissage par renforcement factorisé pour le comportement de personnages non joueurs. Revue d’Intelligence Artificielle 23(2):221–251 Donninger C (1993) Null move and deep search: selective search heuristics for obtuse chess programs. ICCA J 16(3):137–143

Artificial Intelligence for Games

335

Felner A, Korf RE, Hanan S (2004) Additive pattern database heuristics. J Artif Intell Res (JAIR) 22:279–318 Felner A, Korf RE, Meshulam R, Holte RC (2007) Compressed pattern databases. J Artif Intell Res (JAIR) 30:213–247 Finley L (2016) Nested Monte Carlo tree search as applied to Samurai Sudoku Finnsson H, Björnsson Y (2008a) Simulation-based approach to general game playing. AAAI 8:259–264 Finnsson H, Björnsson Y (2008b) Simulation-based approach to general game playing. In: AAAI, pp 259–264 Flórez-Puga G, Gómez-Martın M, Dıaz-Agudo B, González-Calero P (2008) Dynamic expansion of behaviour trees. In: AAAI conference on artificial intelligence and interactive digital entertainment, pp 36–41 Gelly S, Wang Y, Munos R, Teytaud O (2006) Modification of UCT with patterns in Monte Carlo Go. Rapport de recherche INRIA RR-6062 Gelly S, Hoock JB, Rimmel A, Teytaud O, Kalemkarian Y (2008) The parallelization of Monte-Carlo planning. In: Proceedings of the international conference on informatics in control, automation and robotics (ICINCO 2008), pp 198–203. To appear Greenblatt R, Eastlake D, Crocker S (1967) The Greenblatt chess program. In: Fall joint computing conference, vol 31. ACM, New York, pp 801–810 Harel D (1987) Statecharts: a visual formalism for complex systems Hart P, Nilsson N, Raphael B (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE Trans Syst Sci Cybernet 4(2):100–107 Hoang H, Lee-Urban S, Muñoz-Avila H (2005) Hierarchical plan representations for encoding strategic game AI. In: Proceedings of artificial intelligence and interactive digital entertainment conference (AIIDE-05) Ierusalimschy R, de Figueiredo LH, Celes W (2007) The evolution of LUA. In: HOPL III: proceedings of the third ACM SIGPLAN conference on history of programming languages, ACM, New York, NY, USA, pp 2-1-2-26 Junghanns A (1998) Are there practical alternatives to alpha-beta? ICCA J 21(1):14–32 Junghanns A, Schaeffer J (2001) Sokoban: enhancing general single-agent search methods using domain knowledge. Artif Intell 129(1–2):219–251 Kendall G, Parkes A, Spoerer K (2008) A survey of NP-complete puzzles. ICGA J 31(1):13–34 Knuth D, Moore R (1975) An analysis of alpha-beta pruning. Artif Intell 6:293–326 Kocsis L, Szepesvari C (2006) Bandit-based Monte-Carlo planning. In: ECML’06, pp 282–293 Korf R (1985a) Depth-first iterative deepening: an optimal admissible tree search. Artif Intell 27:97– 109 Korf RE (1985b) Depth-first iterative-deepening: an optimal admissible tree search. Artif Intell 27(1):97–109 Korf RE (1997) Finding optimal solutions to Rubik’s cube using pattern databases. In: AAAI-97, pp 700–705 Korf R, Chickering D (1994) Best-first search. Artif Intell 84:299–337 Kozelek T (2009) Methods of MCTS and the game arimaa Laird JE (2002) Research in human-level AI using computer games. Commun ACM 45(1):32–35 Lee C-S, Wang M-H, Chaslot G, Hoock J-B, Rimmel A, Teytaud O, Tsai S-R, Hsu S-C, Hong T-P (2009) The computational intelligence of MOGO revealed in Taiwan’s computer Go tournaments. In: IEEE transactions on computational intelligence and AI in games Madeira C, Corruble V (2009) Strada: une approche adaptative pour les jeux de stratégie modernes. Revue d’Intelligence Artificielle 23(2):293–326 Maîtrepierre R, Mary J, Munos R (2008) Adaptative play in texas hold’em poker. In: European conference on artificial intelligence-ECAI Marsland T (1986) A review of game-tree pruning. ICCA J 9(1):3–19 McAllester D (1988) Conspiracy numbers for min-max search. Artif Intell 35:287–310 Müller M (2002) Computer Go. Artif Intell 134(1–2):145–179

336

B. Bouzy et al.

Natkin S (2004) Jeux vidéo et médias du XXIe siècle: quels modèles pour les nouveaux loisirs numériques? Vuibert Nijssen P, Winands MH (2012) Monte Carlo tree search for the hide-and-seek game Scotland yard. IEEE Trans Comput Intell AI Games 4(4):282–294 Nijssen J, Winands MH (2013) Search policies in multi-player games 1. Icga J 36(1):3–21 Ochs M, Sabouret N, Corruble V (2009) Simulation de la dynamique des emotions et des relations sociales de personnages virtuels. Revue d’Intelligence Artificielle 23(2):327–358 Paxton C, Raman V, Hager GD, Kobilarov M (2017) Combining neural networks and tree search for task and motion planning in challenging environments. arXiv:1703.07887 Pearl J (1980a) Asymptotic properties of minimax trees and game-searching procedures. Artif Intell 14:113–138 Pearl J (1980b) SCOUT: a simple game-searching algorithm with proven optimal properties. In: Proceedings of the first annual national conference on artificial intelligence, pp 143–145 Perlin K (2005) Toward interactive narrative. In: International conference on virtual storytelling. Springer, Berlin, pp 135–147 Pitrat J (1968) Realization of a general game-playing program. IFIP Congr 2:1570–1574 Plaat A, Schaeffer J, Pils W, de Bruin A (1996) Best-first fixed depth minimax algorithms. Artif Intell 87:255–293 Rabin S (2002) AI game programming wisdom. Charles River Media, USA Rickel J, Marsella S, Gratch J, Hill R, Traum D, Swartout W (2002) Toward a new generation of virtual humans for interactive experiences. In: IEEE intelligent systems, pp 32–38 Rivest R (1988) Game-tree searching by min-max approximation. Artif Intell 34(1):77–96 Rolet P, Sebag M, Teytaud O (2009a) Optimal active learning through billiards and upper confidence trees in continuous domains. In: Proceedings of the ECML conference Rolet P, Sebag M, Teytaud O (2009b) Optimal robust expensive optimization is tractable. In: Gecco 2009, ACM, Montréal Canada, p 8 Romein JW, Bal HE (2003) Solving awari with parallel retrograde analysis. IEEE Comput 36(10):26–33 Rosin CD (2011) Nested rollout policy adaptation for Monte Carlo tree search. In: Ijcai, pp 649–654 Schadd MP, Winands MH, Van Den Herik HJ, Chaslot GM-B, Uiterwijk JW (2008) Single-player Monte Carlo tree search. In: International conference on computers and games. Springer, Berlin, pp 1–12 Schaeffer J (1989) The history heuristic and alpha-beta search enhancements in practice. IEEE Trans Pattern Anal Mach Intell 11(11):1203–1212 Schaeffer J (1990) Conspiracy numbers. Artif Intell 43:67–84 Schaeffer J, van den Herik J (2002) Games, computers, and artificial intelligence. Artif Intell 134:1–7 Schaeffer J, Burch N, Bjornsson Y, Kishimoto A, Muller M, Lake R, Lu P, Sutphen S (2007) Checkers is solved. In: Science Shannon C (1950) Programming a computer to play Chess. Philos Mag 41:256–275 Sheppard B (2002) World-championship-caliber scrabble scrabble® is a registered trademark. All intellectual property rights in and to the game are owned in the USA by Hasbro Inc. in Canada by Hasbro Canada Corporation, and throughout the rest of the world by JW Spear & Sons Limited of Maidenhead, Berkshire, England, a subsidiary of Mattel Inc. Artif Intell 134(1-2):241–275 Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489 Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T et al (2017a) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv:1712.01815 Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017b) Mastering the game of go without human knowledge. Nature 550(7676):354

Artificial Intelligence for Games

337

Slate D, Atkin L (1977) Chess 4.5 - the northwestern university chess program. In: Frey P (ed) Chess skill in man and machine. Springer, Berlin, pp 82–118 Spronck P, Ponsen M, Sprinkhuizen-Kuyper I, Postma E (2006) Adaptive game AI with dynamic scripting. Mach Learn 63(3):217–248 Stockman G (1979) A minimax algorithm better than alpha-beta? Artif Intell 12:179–196 Sturtevant NR, Felner A, Barrer M, Schaeffer J, Burch N (2009) Memory-based heuristics for explicit state spaces. In: IJCAI, pp 609–614 Teytaud F, Teytaud O (2009) Creating an upper-confidence-tree program for havannah. In: Advances in computer games. Springer, Berlin, pp 65–74 Thompson K (1996) 6-Piece endgames. ICCA J 19(4):215–226 Tian Y, Jerry Ma*, Qucheng Gong*, Sengupta S, Chen Z, Zitnick CL (2018) Elf opengo. https:// github.com/pytorch/ELF von Neumann J, Morgenstern O (1944) Theory of games and economic behavior. Princeton University Press, Princeton Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3–4):229–256 Zobrist A (1990) A new hashing method with application for game playing. ICCA J 13(2):69–73

Designing Algorithms for Machine Learning and Data Mining Antoine Cornuéjols and Christel Vrain

Abstract Designing Machine Learning algorithms implies to answer three main questions: First, what is the space H of hypotheses or models of the data that the algorithm considers? Second, what is the inductive criterion used to assess the merit of a hypothesis given the data? Third, given the space H and the inductive criterion, how is the exploration of H carried on in order to find a as good as possible hypothesis? Any learning algorithm can be analyzed along these three questions. This chapter focusses primarily on unsupervised learning, on one hand, and supervised learning, on the other hand. For each, the foremost problems are described as well as the main existing approaches. In particular, the interplay between the structure that can be endowed over the hypothesis space and the optimisation techniques that can in consequence be used is underlined. We cover especially the major existing methods for clustering: prototype-based, generative-based, density-based, spectral based, hierarchical, and conceptual and visit the validation techniques available. For supervised learning, the generative and discriminative approaches are contrasted and a wide variety of linear methods in which we include the Support Vector Machines and Boosting are presented. Multi-Layer neural networks and deep learning methods are discussed. Some additional methods are illustrated, and we describe other learning problems including semi-supervised learning, active learning, online learning, transfer learning, learning to rank, learning recommendations, and identifying causal relationships. We conclude this survey by suggesting new directions for research.

1 Introduction Machine Learning is the science of, on one hand, discovering the fundamental laws that govern the act of learning and, on the other hand, designing machines that learn from experiences, in the same way as physics is both the science of uncovering the A. Cornuéjols (B) UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, 75005 Paris, France e-mail: [email protected] C. Vrain LIFO, EA 4022, University of Orléans, 45067 Orleans, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_12

339

340

A. Cornuéjols and C. Vrain

laws of the universe and of providing knowledge to make, in a very broad sense, machines. Of course, “understanding” and “making” are tightly intertwined, in that a progress in one aspect generally benefits to the other aspect. But a Machine Learning scientist can feel more comfortable and more interested in one end of the spectrum that goes from ‘theorize Machine Learning’ to ‘making Machine Learning’. Machine Learning is tied to data science, because it is fundamentally the science of induction, that tries to uncover general laws and relationships from some set of data. However, it is interested as much in understanding how it is possible to use very few examples, like when you learnt how to avoid a “fork” in chess from one experience only, as how to make sense of large amount of data. Thus, “big data” is not synonymous with Machine Learning. In this chapter, we choose not to dwell upon the problems and techniques associated with gathering data and realize all the necessary preprocessing phases. We will mostly assume that this has been done in such a way that looking for patterns in the data will not be too compromised by imperfections of the data at hand. Of course, any practitioner of Machine Learning will know that this is a huge assumption and that the corresponding work is of paramount importance. Before looking at what can be a science of designing learning algorithms, it is interesting to consider basic classical Machine Learning scenarios.

2 Classical Scenarios for Machine Learning A learning scenario is defined by the exchanges between the learner and its environment. Usually, this goes hand in hand with the target task given to the system. In supervised learning, the learner receives a set of examples S = {(xi , yi )}1≤i≤m from the environment, each composed of a set of explanatory or input variables xi and of output variable(s) yi , of which the value must be predicted when the explanatory variables are observed. The goal for the learner is to be able to make predictions about the output values given input values. For example, the learner may receive data about patients registered in an hospital, in the form of pairs (measures made on the patient, diagnostic), and aims at being able to give a correct diagnostic for new arriving patients on which measurements are available. By contrast, the objective of unsupervised learning is not to make predictions from input values to output values, but to reveal possible hidden structures in the data set, S = {x1 , . . . , xm }, or to detect correlations between the variables. If these putative structures or regularities may sometimes be extrapolated to other data collections, this is not the primary goal of unsupervised learning. A third type of learning, of growing importance, is reinforcement learning (see chapter “Reinforcement Learning” of Volume 1). There, the learner acts in the environment and therefore must be able to decide on the action to take in each successive state encountered in its peregrinations. The trick is that the learner receives reinforcement signals, positive or negative, from time to time, sometimes long after the action that triggered it has been performed. It is therefore not easy to determine which actions are the best in each possible state. The goal of the learner is to maximize the

Designing Algorithms for Machine Learning and Data Mining

341

cumulated reinforcement over time even though the credit assignment problem is hard to solve. Reinforcement learning is at the core of the famous AlphaGo system that beat one of the strongest Go player in the world in March 2016, and is now believed to far outclass any human player (see chapter “Artificial Intelligence for Games” of this volume). One important distinction is between descriptive learning and predictive learning. Descriptive learning aims at finding regularities in the data in the hope that they may help to better understand the phenomenon under study. For instance, descriptive learning may uncover different groups in a population of customers, which may in turn help to understand their behavior and suggest different marketing strategies. Descriptive learning is strongly linked to unsupervised learning. Predictive learning is concerned with finding rules that allow one to make prediction when given a new instance. Predictive learning is therefore inherently extrapolative. Its justification is in making prediction for new instances, while descriptive learning is turned towards the examples at hand, and not, at least directly, towards telling something about new instances. Predictive learning is tightly associated with supervised learning. Sometimes, a third type of learning, called prescriptive learning, is mentioned. The goal there is to extract information about what can be levers or control actions that would allow one to alter the course of some phenomenon (e.g. climatic change, the diet of the general population). Generally, gaining control means that causal factors have been identified. And this is not the same as being able to predict some events based on the occurrence of some other ones, which might be done by discovering correlations between events. Therefore, special techniques and some specific form of knowledge have to be called up to confront this challenge.

2.1 The Outputs of Learning It is useful to clarify what is the output of learning. It can indeed change in function of the application. A learning algorithm A can be seen as a machine that takes as input a data set S and produces as output either a model (loosely speaking) of the world M or a decision procedure h, formally A : S → M or h. Let us consider this last case, there the decision procedure h is able to associate an output y ∈ Y to any input x ∈ X . Then we have h : x ∈ X → y ∈ Y . So, we see that we naturally speak of different outputs–either a model M or a function h–without using different words. And the decision function h outputs a prediction y given any input x. The right interpretation of the word output is provided by the context, and the reader should always be careful about the intended meaning. In the following of this section, output will mean the output of the learning algorithm A . One important distinction is between the generative and the discriminative models or decision functions. In the generative approach, one tries to learn a (parametric) probability distribution pX over the input space X . If learning a precise enough probability distribution is successful, it becomes possible in principle to generate further examples x ∈ X

342

A. Cornuéjols and C. Vrain

of which the distribution is indistinguishable from the true underlying distribution. Using the learnt distribution pX , it is possible to use it as a model of the data in the unsupervised regime, or as a basis for a decision function using maximum a posteriori criterion (see Sect. 3.3 below). Some say that this makes the generative approach “explicative”. This is only true as far as a distribution function provides an explanation. Not every one would agree on this. The discriminative approach does not try to learn a model that allows the generation of more examples. It contents itself with providing either means of deciding when in the supervised mode, or means to express some regularities in the data set in the unsupervised mode. The regularities or these decision functions can be expressed as logical rules, graphs, neural networks, etc. While they do not allow to generate new examples, they nonetheless can be much more interpretable than probability distributions. Furthermore, as Vapnik, a giant in this field, famously said “If you are limited to a restricted amount of information, do not solve the particular problem you need by solving a more general problem” (Vapnik 1995), p. 169. This can be translated by, if you need to make prediction or to summarize a data set, it might not be convenient to look first for a generative model, something that generally implies to have large quantities of data. Now, a decision function might provide a yes or no answer when someone would like to have an associated confidence score. The generative techniques are generally able to provide this uncertainty level rather naturally, while the discriminative techniques must be adapted, often through some heuristic means.

2.2 The Inputs of Learning According to the discussion about the possible outputs of learning, the inputs can be either a whole data set S , or a particular instance x for which one wants a prediction y. Here, we list some possible descriptions that the elements x ∈ X can take depending on the application domain. 1. Vectorial. This is generally the case when the data is taken from a relational database. In this case, the descriptors are considered as dimensions of an input space, which is often considered as a vectorial space. In addition, when a distance is defined, we get a normed vectorial space: this is very convenient for lots of mathematical techniques have been devised for such spaces. 2. Non vectorial. This is the case for instance when the number of internal elements of an example is not fixed. For instance, genomes or documents have a undefined number of elements (e.g. nucleotides or words). Then, it is more difficult to define proper distances, but in most cases, adapted distances have been defined. 3. Structured data. In this case, one can exploit the internal structure of the data points, thus adding new structural features. It is possible to further distinguish:

Designing Algorithms for Machine Learning and Data Mining

343

• Sequential data. In sequential data, the description of a data point is composed of ordered elements: the individual measurements cannot be exchanged without changing its information content. A time series for instance can be characterized by some trend, or some periodic variations. • Spatial data. Spatial data exhibit a spatial structure, expressing some dependencies between adjacent or distant elements of the description. • Graphical data. Some data, like social networks, are best described as graphs with directed or undirected, weighted or not, edges. • Relational data. More generally, complex structures, like molecules or textual documents, may need to be described, relying on formalisms close to first order logic. Some data, like videos, share several descriptions, for instance being both sequentially and spatially organized. It must be emphasized that finding the appropriate description of the data is very often a tricky task, which requires skilled experts. This must be done before some extra techniques that seek to massage the data further be put to work, like, for instance, identifying the principal components, or selecting the most informative descriptors.

3 Designing Learning Algorithms 3.1 Three Questions that Shape the Design of Learning Algorithms Although predictive and descriptive algorithms have different goals and different success criteria, the overall approach to their design is similar, and it can be cast as the answer to three main questions. 1- What type of regularities is of interest for the expert in the data? In unsupervised learning, this question is paramount since the goal of descriptive learning is to uncover structures in the data. This question seems less central in supervised learning where the first concern is to make prediction, and, it could be said, whatever the means. However, even in the supervised setting, the type of decision function that one is ready to consider to make predictions determines the type of algorithm that is adapted. 2- What is the performance criterion that one wants to optimize? In the early days of machine learning, algorithms were mostly designed in an algorithmic way. The algorithm had to fill a function, and if it did, if possible with smarty procedures, all was good. The approach, nowadays, is different. One starts by specifying a performance criterion. It evaluates the quality of the model or of the hypothesis learned. In supervised learning, the criterion takes into account the fit to the training data plus a component that expresses how much the hypothesis satisfies some prior bias. In unsupervised learning, the criterion conveys how much the structure discovered in the data matches the kind of regularities one is expecting.

344

A. Cornuéjols and C. Vrain

3- How to organize the search in the space of possible structures? Learning is viewed as a search procedure in a space of possible structures, given a performance criterion. Once a space of possibilities and a performance measure have been decided upon, it is time to devise an algorithm that is able to search efficiently the space of possibilities, called the search space in order to find one that has the best, or at least a good, performance measure. This where computer science comes to the fore. In the following of this section, we stay at a general level of description. Details about the choice of responses for these questions will be given in later sections.

3.2 Unsupervised Learning 3.2.1

The Problems of Unsupervised Learning

Unsupervised learning works on a dataset described in an input space X and aims at understanding the underlying structure of data and of the representation space w.r.t. this data. Two kinds of problems are considered: the first one, clustering, tends to find an organization of data into classes, whereas the second one aims at finding dependencies, such as correlations, between variables. • Given a dataset, unsupervised classification, also called clustering aims at finding a set of compact and well separated clusters, which means that objects in a same cluster are similar (their pairwise distance is low) and objects in different clusters are dissimilar (their pairwise distance is large). This set of clusters can form a partition of the data (partitioning problem) or it can be organized into a hierarchy (hierarchical clustering). The problem is usually modeled by an optimization criterion specifying properties the output must satisfy. For instance, in partitioning problems, the most used optimization criterion popularized by k-means algorithm is the minimization of the sum of the squared distance of the points to the center of the cluster they belong to. Given m objects x1 , …, xm usually in Rd , let {C1 , . . . , Ck } denote the k clusters and let μ j denote the centroid of the cluster C j , it is written k  

||xi − μ j ||22 .

j=1 xi ∈C j

This domain has received a lot of attention, and the framework has been extended in several directions. For instance, requiring a partition of the data can be too restrictive. As an example, in document classification, a text can be labelled with several topics and thus could be classified into several clusters. This leads to soft clustering (in opposition to hard clustering), including fuzzy clustering where a point belongs to a cluster with a degree of membership, or overlapping clustering where an object can belong to several clusters. Fuzzy-c-means (Dunn 1973; Bezdek 1981) for instance is an extension of k-means for fuzzy clustering. Data can also be

Designing Algorithms for Machine Learning and Data Mining

345

described by several representations (for instance, texts and images for a webpage), thus leading to multi-view clustering that aims at finding a consensus clustering between the different views of the data. Classic clustering methods are usually heuristic and search for a local optimum and different local optima may exist. Depending on the representation of data, on the choice of the dissimilarity measure, on the chosen methods and on the parameter setting, many different results can be produced. Which one is the correct partition, the one the expert expects? To overcome this problem, two directions have been taken. The first one initiated in (Wagstaff and Cardie 2000) integrates user knowledge, written as constraints, in the clustering process. Constraints can be put on pairs of points: a must-link constraint between two points require these two points to be in the same cluster whereas a cannot-link constraint between two points require these two points to be in different clusters. This leads to a new field, called Constrained Clustering, that is presented in chapter “Constrained Clustering: Current and New Trends” of this volume. The second solution is to generate many partitions and then to combine them in a hopefully more robust partition, this research direction is called cluster ensemble (Vega-Pons and RuizShulcloper 2011). Another domain related to clustering is bi-clustering, also called co-clustering: given a matrix M, bi-clustering aims at finding simultaneously a clustering of rows and of columns. Its goal is therefore to identify blocks or biclusters, composed of rows and columns, satisfying a given property: for instance the elements in a block are constant or the elements in each row of the block are constant. See for instance (Busygin et al. 2008; Madeira and Oliveira 2004) for an introduction to co-clustering. • Mining interesting patterns has been introduced in the 90s by (Agrawal and Srikant 1994) and has known a growing interest since then. The initial problem was to find association rules, modeling a dependency (A1 ∧ · · · ∧ A p ) → (B1 ∧ · · · ∧ Bq ) between two sets of variables. The interest of the dependency was measured by two criteria: the support, defined as the proportion of the population satisfying both sets of variables and the confidence, an estimation of P(B1 ∧ · · · ∧ Bq |A1 ∧ · · · ∧ A p ) measured by the proportion of the population satisfying the conclusion of the rule among those satisfying the conditions. Unsupervised learning has to face two important problems: controlling the complexity of the algorithms and ensuring the validation of the results. Indeed, the number of partitions of a set of m elements is given by the Bell number Bm and when the number k of clusters is fixed, it is given by the Stirling number of the second kind S(m, k).1 When mining frequent patterns, the complexity is linked to the size of the search space (in case of a boolean dataset, the size of the search space is 2d, where d is the number of boolean attributes) and to the size of the dataset (computing the frequency of a pattern require to consider all points of the dataset). This explains why many works in pattern mining have tried to reduce the complexity by pruning 1 S(m, k)

=

1 k!

k

j=0 (−1)

k− j (k

j)j

m.

346

A. Cornuéjols and C. Vrain

the search space and/or changing the representation of the database. From the perspective of finding an actual solution, clustering is often defined as an optimization problem, e.g., finding the best partition given an optimization criterion whereas pattern mining is an enumeration problem, e.g. finding all the patterns satisfying some given constraints. Another important difficulty for unsupervised learning is the problem of validation. It is well-known that the notion of confidence for association rules can be misleading: it only measures the probability of the conclusion of the rule given the condition (P(B|A) for a rule A → B) but it does not measure the correlation between A and B, and it is possible to have a rule A → B with a high confidence despite a negative correlation between A and B, therefore, different measures have been introduced for assessing the interest of a rule (see for instance Han et al. 2011). Moreover this domain has to confront the large amount of patterns that can be generated, thus making an expert evaluation difficult. Regarding clustering, the problem is aggravated: how can we validate an organization of data into classes, while, in addition, it can depend on points of views. Classically two kinds of evaluation are performed: either relying on an internal criterion or by evaluating the classification in regards of a ground truth.

3.2.2

Approaches to Unsupervised Learning

Pattern mining and clustering are two distinct tasks, with their own family of methods. The first one works at the attribute level whereas the latter one works at the data level. Nevertheless, they interact in the domain of conceptual clustering that addresses the problem of finding an organization of data in concepts, where a concept is defined by two components: the extent of the concept, a subset of observations belonging to this concept, and the intent of the concept, a subset of attributes satisfied by elements of the concept. Distance-based/similarity-based clustering In distance-based clustering, the notion of dissimilarity between pairs of points is fundamental. When points are described by real features, the Euclidean distance (||.||2 ) is generally considered, but when dealing with more complex data (texts, images) the identity (d(x, y) = 0 if and only if x = y) and the triangle inequality properties can be difficult to enforce, and therefore a dissimilarity measure must be defined. Many clustering tasks are often defined as an optimization problem. But because of the complexity of most optimization criteria, there exist only few exact methods, either exploring the search space by Branch and Bound strategies or based on declarative frameworks, such as Integer Linear Programming or Constraint Programming (see Sect. 4.1). Methods are mainly heuristic and search for a local optimum. The heuristic nature of the methods depends of several factors • Initialization of the method: some methods, such as k-means or k-medoids, start from a random initialization and search for a local optimum that depends on the initial choice.

Designing Algorithms for Machine Learning and Data Mining

347

• Search strategy: usually, the optimization of the criterion relies on a gradient descent procedure which is prone to converge to a local optimum. Another family of clustering methods is no longer based on an optimization criterion but on the notion of density (see Sect. 4.4). This is illustrated by DBSCAN (Ester et al. 1996) and it relies on the notion of core points: a core point is a point the neighborhood of which is dense. The important notions are then the neighborhood defined as a ball of a given radius around the point and the density specified by a minimum number of points in the neighborhood. Relying on core points, dense regions can be built. One interest of such approaches is that they allow finding clusters of various shapes. Spectral clustering (see Sect. 4.5) is the most recent family of methods. It is based on a similarity graph, where each point is represented by a vertex in the graph and edges between vertices are weighted by the similarity between the points. The graph is not fully connected: for instance edges between nodes can be removed when their similarity is less than a given threshold. The unnormalized graph Laplacian L of the graph has an interesting property: the multiplicity of the eigenvalue 0 of L is equal to the number of connected components in the similarity graph. This property leads to different algorithms. Classically, the k first eigenvalues of the Laplacian are computed, inducing a change in the data representation and then a k-means procedure is applied on the new representation. It has been shown that there are some tight links between spectral clustering, Non Negative Matrix factorization, kernel k-means and some variant of min-cut problems (see Sect. 4.5). Finally, generative models that aim at modeling the underlying distribution p(x) of data can be applied. For clustering, data are modeled by a mixture of Gaussian and the parameters are learned, usually by maximizing the log likelihood of data, under the assumption that the examples are i.i.d. The EM algorithm is the most widely used algorithm in this context. Conceptual clustering. Conceptual clustering was first introduced in (Michalski 1980). The aim is to learn concepts where a concept is defined by a set of objects (extent of the concept) and a description (intent of the concept). It is well illustrated by the system Cobweb (Fisher 1987): it incrementally builds a hierarchy of concepts where a concept C is described by the quantities P(X i = v|C) for each feature X i and each possible value v this feature can take. The interest for conceptual clustering has been revived for a decade now with the emergence of pattern mining. Truly, an important notion in itemset mining is the notion of closed itemsets, where a closed itemset is a set of items that forms a concept, as defined in Formal Concept Analysis (see Sect. 7.1.4). Therefore once interesting closed itemsets, and therefore concepts, have been found, it becomes natural to study the organization of (some of) these concepts in a structure, leading to a partition of data or to a hierarchical organization of data. Pattern mining. The problem introduced in (Agrawal and Srikant 1994) was mining association rules in the context of transactional databases: given a predefined set of items I , a transaction is defined by a subset of items, called an itemset. A

348

A. Cornuéjols and C. Vrain

transactional database can be also represented by means of |I | Boolean features, each feature X i representing the presence (X i = 1) or absence (X i = 0) of the item in the transaction. In this context a pattern is a conjunction of items, represented by an itemset. Mining association rules is divided into two steps, first mining the frequent patterns, i.e., those with a support greater than a predefined threshold and then from these frequent patterns, building the association rules. The first step is the most time-consuming, with a complexity depending on the number of features and on the number of observations: in the Boolean case, the complexity of the search space is 2d , where d is the number of Boolean features and the evaluation of the patterns, (e.g. computing their supports) require to go through the entire database. Many algorithms, as for instance Eclat (Zaki 2000), FP-Growth (Han et al. 2000) or LCM (Uno et al. 2004), have been developed to improve the efficiency of pattern mining, relying on different representations of the database or/and on different search strategies. The closedness of an itemset is an important property, since the set of frequent closed itemsets forms a condensed representation of the set of frequent itemsets, requiring less memory to store it. Some methods find all the frequent patterns, some searches only for maximal frequent itemsets or only for closed frequent itemsets (Bastide et al. 2000). Pattern mining has also been developed for handling more complex databases containing structured data such as sequences or graphs.

3.3 Supervised Learning Supervised learning aims at finding prediction rules from an input space X : the description of examples, or situations of the world, to an output space Y : the decisions to be made. The goal is to make predictions about the output values given input values, and this is done through the learning of a decision procedure h : X −→ Y using a training sample S = {(x1 , y1 ), . . . , (xm , ym )}. Both the input space and the output space can take various forms according to the task at hand. Classically, the input space often resulted from extractions from a relational data base, therefore taking the form of vectorial descriptions (e.g. age, gender, revenue, number of dependents, profession, …). Recently, non vectorial input spaces have become fashionable, like texts, images, videos, genomes, and so on. Similarly, the output space may vary from binary labels (e.g. likes, dislikes), to a finite set of categories (e.g. the set of possible diseases), or to the set of real numbers. When the output space is a finite set of elements, the learning task is known as classification, while, when it is infinite, it is called regression. In the case of multivariate supervised learning, the output space is the cartesian product of several spaces (e.g. one may want to predict both the profession and the age of a customer). Some learning tasks involve structured outputs. For instance, one may want to infer the structure of a molecule given a set of physical and chemical measurements, or to infer the grammatical structure of an input sentence.

Designing Algorithms for Machine Learning and Data Mining

3.3.1

349

The Problems of Supervised Learning

In the following, we focus on the classification task and do not deal directly with regression. A supervised learning algorithm A takes as input the learning set S and a space of decision functions or hypotheses H , and it must output a decision function h : X −→ Y . The search for a good hypothesis h can be done directly by an exploration of the space H . This is called the discriminative approach. Or it can be done indirectly by first inferring a joint probability distribution pX Y over X × Y by estimating both pY and pX |Y and then computing the likelihood p(y|x) for all possible values of y ∈ Y , and choosing the most probable one. This approach is known as generative since, in principle, it is possible to generate artificial data points xi for all classes y ∈ Y using the estimated distribution pX |Y . This is not possible if one has learned a decision function like, for instance, a logical rule or an hyperplane in the input space X separating two classes. The generative viewpoint lies at the core of Bayesian learning, while the discriminative one is central to the machine learning approach. We adopt the later perspective in the following. The Inductive Criterion The learning algorithm A must solve the following problem: given a training sample S = {(x1 , y1 ), . . . , (xm , ym )} and an hypothesis space H , what is the optimal hypothesis h ∈ H ? Under the assumption of a stationary environment, the best hypothesis is the one that will minimize the expected loss over future examples. This expected loss, also called the true risk, writes as:  (h(x), y) pX Y dxdy R(h) = X ×Y

where (h(x), y) measures the cost of predicting h(x) instead of the true label y, while pX Y is the joint probability governing the world and the labeling process. A best hypothesis is thus: h ∗ = argminh∈H R(h). However, the underlying distribution pX Y is unknown, and it is therefore not possible to estimate R(h) and thus to determine h ∗ . Short of being able to compute the true value of any h ∈ H , it is necessary to resort to a proxy for the true risk R(h). This is called the inductive criterion. One of the best known inductive criterion is the empirical risk. It consists in replacing the expected loss by an empirical measure: the mean loss computed on the training set. m 1   (h(xi ), yi ) R(h) = m i=1  is called the Empirical Searching the best hypothesis using hˆ = argminh∈H R(h) Risk Minimization (ERM) principle.

350

A. Cornuéjols and C. Vrain

The ground for using an inductive criterion, such as the ERM, and the guarantees it offers, in order to find an hypothesis with a true risk not too far from the true risk of the best hypothesis h ∗ is the object of the statistical learning theory (see chapter “Statistical Computational Learning” of Volume 1). Under the assumption that the data points are identically and independently distributed, the theory is able to show that the ERM must be altered with the incorporation of a bias on the hypotheses to be considered by the learner. Indeed, if no such bias is imposed, then it is always possible to find hypotheses that have low empirical risk, they fit the training data very well, while their true risk is high, meaning they behave badly over instances that do not belong to the training set. This is called overfitting. The hypothesis space H must consequently be limited in its capacity to accommodate any target concept, or, alternatively, the empirical risk must be “regularized” with the addition of a term that imposes a cost over hypotheses not well behaved according to some prior preference. For instance, one may suppose that the target concepts obeys some smoothness over the input space, which can be translated in penalizing functions h with high values of their second derivative. Another way of limiting the space of hypotheses considered by the learner is to prevent it to search the whole space H by stopping the search process early on. This is for instance the role of the “early stopping rule” known in artificial neural networks. The challenge in supervised induction is therefore to identify an hypothesis space rich enough so that a good hypothesis may be found (no underfitting) but constrained enough so that overfitting can be controlled. Sophisticated learning algorithms are able to automatically adjust the “capacity” of the hypothesis space in order to balance optimally the two competing factors. Once the inductive criterion is set, it remains to explore the hypothesis space in order to find an hypothesis that optimizes as best as possible the inductive criterion. Controlling the Search in the Hypothesis Space and the Optimization Strategy Finding the hypothese(s) that optimize(s) the inductive criterion can be done analytically only in very specific cases. Typically, learning algorithms implement methods that update estimates of the solution via an iterative process. These processes include optimization techniques, solving systems of linear equations, and searching lattice-like state spaces. Usually, the learning algorithms require large amounts of numerical and other operations, and it is therefore of foremost concern to control the computational and space complexities of these processes as well as ensuring that the approximations for real numbers stay correct. When the hypothesis space is the space of vectors of real numbers Rn (n ∈ N), as is the case for artificial neural networks for instance, gradient-based methods are the method of choice for optimizing a numerical criterion. When the hypothesis space is discrete, for instance when it is a space of logical expressions, it is important to find operators between expressions that render the exploration of the hypothesis space amenable to classical artificial intelligence search techniques, and in particular to efficient pruning of the search space. This is central to the version space learning algorithm (see Mitchell 1982, 1997) and to the search for typical patterns in databases (see Agrawal and Srikant 1994; Aggarwal 2015).

Designing Algorithms for Machine Learning and Data Mining

3.3.2

351

Approaches to Supervised Learning

There exist a wide variety of supervised learning methods, with new methods, or at least variations of methods, invented almost daily. It is nonetheless possible to group these methods into broad categories. Parametric Methods It often happens that the expert knows in advance precisely the type of regularities he/she is looking for in the data. For instance, one may want to fit a set of data points with linear regression. In this case, the learning problem is generally to estimate the coefficients, or parameters, of the model. n wi x (i) , where the x (i) are the E.g., in a linear regression in Rn , h(x) = i=0 coordinates of the input x, the n + 1 parameters wi (0 ≤ i ≤ n) must be estimated. Parametric methods include most of the generative models (which hypothesize some probability distributions over the data), linear and generalized linear models and simple neural networks models. Non Parametric Methods The difference between parametric methods and non parametric methods is not as clear cut as the terms would suggest. The distinction is between families of models that are constrained by having a limited number of parameters and those which are so flexible that they can approximate almost any posterior probabilities or decision functions. This is typically the case of learning systems that learn a non a priori fixed number of prototypes and use nearest neighbors technique to decide the class of a new input. These systems are ready to consider any number of prototypes as long as it allow them to fit well the data. The Support Vector Machines (SVM) fall in this category since they adjust the number of support vectors (learning examples) in order to fit the data. Deep neural networks are in between parametric methods and non parametric methods. They have often indeed a fixed number of parameters, but this number is usually so large that the system has the ability to learn any type of dependency between the input and the output. Other non parametric methods include decision tree learners, and more generally all learning systems that partition the input space into a set of “boxes” of which the structure depends on the learning data. Systems that learn logical descriptions of the data, often in the form of a collection of rules, are equally non parametric, and may be seen as variants of technique that partition the input space into a set of categories. Finally, ensemble learning methods, such as bagging, boosting and random forests, are also non parametric since the number of base hypotheses that are combined to form the final hypothesis is determined during learning. It is worth noting that methods that use regularized inductive criteria or that adjust the hypothesis space to the learning task at hand (like the Structural Risk Minimization (SRM) principle of Vapnik) may be seen as belonging to an intersection between parametric and non parametric methods. They are parametric because they impose a constrained form to the learning hypotheses, but they adjust the number of non null parameters in function of the learning problem.

352

A. Cornuéjols and C. Vrain

Changes of Representations in Supervised Learning Many learning methods rely at their core on some change of representation of the input space, so that the structure of the data or the decision function becomes easy to discover. These changes of representation may be obtained through a preprocessing step, when e.g. a Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) or Non Negative Matrix Factorization (NMF) are performed. They can also result from the learning process itself, like when using MultiLayer Perceptrons and deep neural networks, where the first layers adapt their weights (parameters) so that the top layer of neurons can implement a linear decision function. Likewise, regularized techniques, like LASSO, that select the descriptive features that play a role in the final hypothesis, can be considered as methods for changing the representation of the input space.

3.4 The Evaluation of Induction Results Most methods in inductive learning entail building a model of the relationships between the attributes and the outcome (supervised learning) or between attributes themselves (unsupervised learning), with a penalty term that penalizes the complexity of the model. In order to select the best model, and therefore, the optimal complexity parameter, and the best meta parameters of the methods (e.g. the architecture of the neural networks), which ensure the best predictive performance without overfitting, one has to be able to evaluate the model’s performance. The biggest hurdle to evaluate the value of a learning model is that one does not know the future events that the system will have to process. One has therefore to rely on the known (training) instances and on some a priori assumptions about the relationship between the training data and the future environment in which the system will have to perform. One such assumption is that the world is stationary. Learning entails the optimization of two different sets of parameters: the parameters that define one specific hypothesis in the model which is also the class of possible hypotheses (e.g. the weights of the neural network of which the architecture is given), and the parameters, aka meta-parameters, that control the model (e.g. the architecture of the neural network, the learning step and other such choices that govern the optimization process). In order for these two optimization problems to be properly carried out, it is essential that the training data used for these two optimization tasks be different to obviate the risk of obtaining optimistic and biased results. Thus, ideally, the available data is split into three different subsets: the learning set used to set the parameters of the hypothesis, the validation set that is used both to evaluate the generalization performance of the hypothesis and to learn the meta-parameters in order to optimize the model, and the test set that is used just once, at the end of the whole learning procedure in order to estimate the true value of the final hypothesis. When data is abundant, it is possible to reserve a significant part of the sample for each of the three subsets. Often, however, data is scarce and methods such as

Designing Algorithms for Machine Learning and Data Mining

353

cross-validation must be used which, in essence, uses repeated learnings and tests on different subsets of the training sample in order to compensate for the lack of data. Evaluating the performance of a learned model differs in supervised learning and in unsupervised learning. We first look at evaluation in supervised learning. In supervised learning, the estimation of the rate of error in generalization is sometimes insufficient as a way to evaluate the value of the learned hypothesis. Thus, for example, it may be useful to estimate more precisely the rates of false positives and false negatives, or of precision and recall. Confusion matrices are then an useful tool. It is even possible to obtain curves of the evolution of the learning performance when some meta-parameters are varied. The ROC curve (English Receiver Operating Characteristics, motivated by the development of radars during the Second World War) is the best known example. Typically it is sought to optimize the area under the ROC curve which characterizes the discriminating power of the classification method employed. The book (Japkowicz 2011) is a good source of information about evaluation methods for supervised classification. In unsupervised learning, the goal is not to be able to predict the value of the output for some new input, but to uncover structures of interest in the data. Therefore, the error in generalization is replaced by other criteria that appreciate to which extent the structures discovered in the data fit the expectations set by the user. For instance, if one wishes to uncover subgroups or clusters in the data, performance criteria will be based on measures of the compactness of each cluster together with a measure of the dissimilarity between clusters. One can then choose for instance the number of clusters that maximizes the chosen criterion. However, in contrast with supervised learning, where the ground truth is known on the training data, in unsupervised learning, the estimation criteria are very dependent on a priori assumptions about the type of patterns present in the world, and this can easily give very misleading results if these assumptions are not verified.

3.5 Types of Algorithms The presentation of learning algorithms can be organized along several dimensions: 1. The optimization method used by the learner. For instance, gradient-based, divideand-conquer, and so on. As an illustration, many recent learning algorithms have been motivated by the allure of convex optimization. 2. The types of hypotheses that are considered for describing the world or to make predictions. For instance, linear models, non linear models, by partitioning the input spaces, etc. 3. The type of structure of the hypothesis space. For instance, vectorial spaces, Galois lattice, and so on. Obviously, the optimization methods and the structure of the hypothesis space are closely interrelated, while the type of hypotheses considered commands, in a large part, the structure of the hypothesis space.

354

A. Cornuéjols and C. Vrain

A specialist of Machine Learning tries to solve inductive problems, that is discovering general patterns or models from data by processes that can be automatized. The first questions asked are: what are the available data? What is the task? How one expects to measure the quality or performance of the learned regularities? What kind of regularities are of interest? Only after these questions have answers does the problem of actually searching the space of possible regularities arise. Of course, Machine Learning is also concerned with feasibility issues, and this all the more that the data is increasingly massive and that the regularities considered become more complex. Therefore the space and time requirements of the computations involved in learning are also a factor in choosing or devising a learning method. This is where Machine Learning blend with Computer Science. However, if a specialist of Machine Learning has to be aware of the computational demands of various types of algorithms, this has not to be his/her foremost forte, in the same way that a specialist in Artificial Intelligence is centrally interested in knowledge representation and automatic reasoning, and less centrally, even though this is important, in computational issues. In accordance with these considerations, in the following, the chapter is structured around families of regularities that current learning algorithms are able to uncover. For each of these types, however, algorithmic and computational issues will be addressed.

4 Clustering Clustering aims at finding the underlying organization of a dataset S = {x1 , . . . , xm }. An observation is usually represented by the values it takes on a set of descriptors {X 1 , . . . , X d }, thus defining the input space X . Most methods require the definition of either a dissimilarity measure, denoted by dis, between pairs of objects, or a similarity (also called affinity) measure, denoted by sim. Clustering has attracted significant attention since the beginning of exploratory data analysis, and many methods have been developed. In this chapter we focus on partitioning methods that aim at building a partition of S (see Sects. 4.1–4.5) and hierarchical methods that aim at organizing data in a hierarchical structure (see Sects. 4.6 and 4.7). The methods differ according to the way clustering is modeled. Prototype-based methods (Sect. 4.2) seek for representative points of the clusters, density-based methods (Sect. 4.4) assume that a cluster is built from connected dense regions whereas generative methods (Sect. 4.3) assume that data has been generated by a mixture of gaussians. Spectral clustering (Sect. 4.5) relies on a similarity measure and the construction of a graph reflecting the links in terms of similarity between objects. We present in this section a few well-known methods that are representative of the various existing algorithms. More complete overviews of clustering and analysis of clustering methods can be found in (Bishop 2006), (Hastie et al. 2009). Let us also mention that outlier detection is a domain close to clustering that we do not address in this chapter. An outlier is defined in (Hawkins 1980) as an observation which deviates so much from the other observations as to arouse suspicions that it

Designing Algorithms for Machine Learning and Data Mining

355

Table 1 Criteria on a cluster C with centroid μ: homogeneity (left)/separation (right) Homogeneity of C to be minimized Separation of C to be maximized diameter (diam): maxoi ,o j ∈C dis(oi , o j ) radius (r): minoi ∈C maxo j ∈C dis(oi , o j )  star (st): minoi ∈C o j ∈C dis(oi , o j )

split a : minoi ∈C,o j ∈C / dis(oi , o j )   cut: oi ∈C o j ∈C / dis(oi , o j )

normalized star: st (C)/|C|  clique (cl): oi ,o j ∈C dis(oi , o j )

normalized_cut(C)c :

normalized cl: st (C)/(|C| × (|C| − 1))  sum of squares (ss): oi ∈C ||oi − μ||22 

variance(var)b :

oi ∈C

ratio_cut:

cut (C) |C|

cut (C) |C|×(n−|C|)

||oi −μ||22 |C|

a Or

margin called the error sum of squares, for instance in Ward (1963) c Given a weighted similarity graph, the normalized cut is defined differently by cut (C)/vol(C),  with vol(C) = oi ∈C di b Also

was generated by a different mechanism. Some clustering methods, as for instance density-based methods, allow the detection of outliers during the clustering process whereas other methods, as for instance k-means, need the outliers to be detected before the clustering process. An overview of outlier detection methods can be found in (Aggarwal 2015). Defining a metric between objects is fundamental in clustering. We discuss in Sect. 5 the kernel trick allowing to embed data in a higher dimension, without expliciting the new representation. Many clustering methods, relying on metrics, have been “kernelized”.

4.1 Optimization Criteria and Exact Methods Clustering looks for a partition of data made of compact and well separated clusters. Several criteria have been defined for assessing the quality of a single cluster. A large list is given in (Hansen and Jaumard 1997), and some examples are given in Table 1, assuming a dissimilarity measure dis between pairs of objects, or the Euclidean distance ||.||2 . Once criteria have been defined for assessing the property of a cluster, we can define different objective functions, specifying the properties of the expected output partition C = (C1 , . . . , Ck ). For instance the minimization of the maximal diameter (defined by max kj=1 diam(C j )), or the minimization of the within-clusters sum of squares (WCSS), defined in Eq. 1, rely on the notion of compactness. On the other hand the maximization of the minimal margin (defined by min kj=1 split (C j )) emphasizes the notion of separation. The first problem is polynomial only for k = 2, the second one  is NP-hard, the third one is polynomial. A criterion can be to maximize the cut ( kj=1 cut (C j )), which can be solved efficiently for k = 2, but which leads to unbalanced clusters. This is why usually the normalized cut or the ratio cut are

356

A. Cornuéjols and C. Vrain

preferred. In case where we have a similarity matrix between points (see Sect. 4.5), the criterion becomes to minimize the cut or the ratio cut. Because of the complexity of the problem, there are few exact methods. (Hansen and Delattre 1978) presents a method based on graph coloring for the minimization of the maximum diameter. Branch and bound algorithms (see for instance Brusco and Stahl 2005) have been proposed for different criteria. Clustering has also been modeled in Integer Linear Programming (Rao 1969; Klein and Aronson 1991; du Merle et al. 1999) and there exists a new stream of research on using declarative frameworks (Integer Linear Programming, Constraint Programming, SAT) for clustering as for instance (Aloise et al. 2012; Dao et al. 2017) and for constrained clustering (see chapter “Constrained Clustering: Current and New Trends” of this volume).

4.2 K-Means and Prototype-Based Approaches The most well-known algorithm for clustering is k-means (Lloyd 1982; Forgy 1965). The main ideas are that a cluster can be characterized by its centroid (the most central point in the cluster) and an observation x is assigned to the closest cluster, measured by the distance between x and the centroid of the cluster. The number k of classes must be provided a priori. First, k observations, μ01 , . . . , μ0k , are chosen as initial seeds and then the algorithm alternates two phases until convergence (i.e. when the partition is no longer changed): it first assigns each observation to the cluster whose centroid is the closest and then computes the new centroid of the clusters with the observations assigned to this cluster. Let μt1 , . . . , μtk denote the centroids at iteration t, the main scheme of the algorithm is given in Algorithm 1.

Algorithm 1: k-means algorithm Initialization: Choice of k observations μ01 , . . . , μ0k as initial seeds; t ←0 repeat For each point xi , i = 1 . . . m, assign xi to the cluster Cl with l = argmin j∈1...k ||xi − μtj ||2 ; For each j, j = 1, . . . k, compute the new centroid: = |C1j | x∈C j x μt+1 j t ←t +1 until Convergence;

As already stated, clustering is often viewed as an optimization problem, stating the quality of the desired partition. In this case, the optimization criterion is called the within-clusters sum of squares and it is defined by W C SS = Σ kj=1 Σx∈C j ||x − μ j ||22

(1)

Designing Algorithms for Machine Learning and Data Mining

357

when || ||2 denotes the Euclidian distance. It is equivalent to W C SS = Σ kj=1 Σu,v∈C j

||u − v||22 |C j |

The complexity of the algorithm is O(mkt), when m is the number of observations, k the number of clusters and t the number of iterations. Let us notice two important points. First, to be applicable, the number of classes must be given and computing the means of a subset of observations must be feasible. It means that all the features must be numeric. K-means can be adapted to handle non numeric attributes, for instance by replacing the centroid by the most central observation (the observation  minimizing the star, i.e. defined by argminoi ∈C o j ∈C dis(oi , o j )). But in case of non numeric attributes, the most used method consists in looking for k observations, called medoids, that are the most representative of the clusters, as for instance in the system PAM (Kaufman and Rousseeuw 1990): the search space is the space of all k-subsets of the observations; the search consists in considering pairs of objects composed of a medoid and a non medoid and analyzing whether a swap between them would improve WCSS. It has been further extended to improve its efficiency (see for instance Ng and Han 1994; Park and Jun 2009). The second observation is that the optimization criterion is not convex and therefore the initialization step is fundamental: the results depend heavily on it. Several methods have been proposed for the initialization, as for instance K-means++ (Arthur and Vassilvitskii 2007). K-means is an iterative alternate optimization method. When looking at Eq. 1, it can be seen that it depends on two parameter sets: the assignment of points to clusters and the centroids of the clusters. Let us define a boolean variable X i j , i = 1, . . . , m, j = 1 . . . k which is true when the observation xi belongs to cluster C j .  We have for all i, kj=1 X i j = 1, that is each point is assigned to one and only one cluster. Then Eq. 1 can be written: m Σ kj=1 X i j .||xi − μ j ||22 W C SS = Σi=1

(2)

This form highlights the two sets of parameters: X i j (i = 1, . . . , m, j = 1 . . . k) (assignment of points to clusters) and μ j ( j = 1, . . . k) (determination of the centroid of cluster) and in fact, the two alternate steps correspond to (1) fix μ j , j = 1, . . . k in Eq. 2 and optimize X i j , i = 1, . . . , m, j = 1 . . . k and (2) fix X i, j , i = 1, . . . , m, j = 1 . . . k and optimize μ j , j = 1, . . . k. For solving (1) we can notice that the terms involving X i j are independent from those involving X i j , i = i and can be optimized independently, thus giving X il = 1 iff l = argmin j∈1...k ||xi − μ j ||2 . The variables X i j being fixed, finding the best μ j , j = 1 . . . k, is a quadratic optimization problem that can be solved by setting the derivatives with respect to μ j equal to 0, leading to m X x i=1 μ j = m Xi ji j i . For more details, see (Bishop 2006). i=1 K-means relies on the Euclidian distance and makes the underlying assumption that the different clusters are linearly separable. As many clustering algorithms, it has

358

A. Cornuéjols and C. Vrain

been extended to kernel k-means, thus mapping points in a higher dimensionality space using a non linear function Φ. Kernel k-means relies on the fact that ||u − v||22 =< u, u > −2 < u, v > + < v, v > and that the dot product can be replaced by a kernel function. Compared to k-means, the first part of the iteration, i.e. the assignment of a point x to a cluster, has to be changed: argmin j∈1...k ||Φ(x) − μ j ||,  with μ j = m1 u∈C j Φ(u). Replacing μ j by its value and developing the norm using the dot product allows one to express ||Φ(x) − μ j ||2 in terms of the kernel, thus avoiding the computation of Φ(x). Self-organizing maps (SOM) are also based on prototypes. The simplest approach called vector quantization (VQ) follows the same philosophy as k-means, in the sense that k prototypes are initially chosen at random in the input space, each point is assigned to the cluster of its closest prototype and prototypes are updated. Nevertheless, at each iteration of VQ a single point x is chosen at random and the closest prototype is updated, so as to move closer to x. Let us call m tj the prototype of cluster j at iteration t. If x is assigned to C j , then its prototype is updated by the formula (ε is a fixed parameter): ← m tj + ε(x − m tj ) m t+1 j − x = (1 − ε)(m tj − x). Self-organizing map (Kohonen 1997) adds a thus m t+1 j structure (often a grid) on the prototypes and at each iteration, the closest prototypes and its neighbors are updated. The prototypes and their structure can be seen as a neural network, but learning can be qualified as competitive since at each iteration only the winning prototype and its neighbors are updated.

4.3 Generative Learning for Clustering In Sect. 2.1, we have mentioned that there are two approaches in Machine Learning: generative versus discriminative. Generative methods aims at learning a (parametric) probability distribution pX over the input space X . On the other hand, clustering as an exploratory process aims at discovering the underlying structure of data, and this structure can naturally be modeled by a probability distribution pX . This leads to the use of generative methods for clustering. The model that is chosen is usually a linear combination of k Gaussian densities (intuitively a Gaussian density for each cluster): p(x) =

k 

π j N (x|μ j , Σ j )

(3)

j=1

with for all j π j ≥ 0 and

k 

πj = 1

(4)

j=1

The coefficients π j are called the mixing coefficients and can be interpreted as the prior probability for a point to be generated from the jth cluster, N (x|μ j , Σ j ) is the probability of the point given that it is drawn from the jth component.

Designing Algorithms for Machine Learning and Data Mining

359

This can also be formulated by introducing a k dimensional binary random variable, z = (z 1 , . . . , z k ), intuitively representing the assignment of a point to a cluster (z j = 1 if x is assigned to cluster j). Therefore, only one z j is equal to 1  and the other ones are equal to 0 ( kj=1 z j = 1). The marginal distribution of z is of x given z is given by given by p(z j = 1) = π j and the conditional probability  p(x|z j = 1) = N (x|μ j , Σ j ) and since p(x) = j p(z j = 1)p(x|z j = 1), we fall back on Eq. 3. The k dimensional variable z is called a latent variable. It can be seen as a hidden set of variables: observations are described in the input space (X , Z ) of dimension d + k, where hidden variables in Z give the assignment of points to clusters. Given the observations S , learning a generative model for S is classically achieved by maximizing the log likelihood of the data, defined as ln(p(S |π, μ, Σ)), where π, μ, Σ denote respectively π = (π1 , . . . , πk ), μ = (μ1 , . . . , μk ) and Σ = (Σ1 , . . . , Σk ). Supposing that examples have been drawn independently from the distribution we get ln(p(S |π, μ, Σ)) =

m  i=1

ln(

k 

π j N (xi |μ j , Σ j ))

(5)

j=1

Maximizing the log likelihood, or, in other words, learning the coefficients (π j ,μ j ,Σ j ), j = 1 . . . k is usually achieved by the Expectation/Minimization algorithm (EM) algorithm. Roughly, after having initialized the parameters to learn, EM iterates two phases: an expectation phase, computing the probability of the assignment of a point to a cluster given that point (p(z j = 1|xn )) and a maximization step that given the assignments of points to cluster, optimize the parameters (π j , μ j , Σ j ) for maximizing the likelihood. The algorithm for learning mixture of Gaussians is given in Algorithm 2 (Bishop 2006).

Algorithm 2: EM algorithm for learning Gaussian mixtures (Bishop 2006) Initialize, for all j in 1, . . . k, the means μ j , the covariance Σ j and the mixing coefficients πj ; Evaluate the log likelihood using Eq. 5; repeat E step Evaluate the responsibilities γ (z i j ) = p(z j = 1|xi ) γ (z i j ) =

π j N (xi |μ j ,Σ j ) k ; l=1 πl N (x i |μl ,Σl )

M step Reestimate m m the parameters μnew = m1j i=1 γ (z i j )xi , with m j = i=1 γ (z i j ) j 1 m new )(x − μnew )t = γ (z )(x − μ Σ new i j i i i=1 j j j mj m

π new = mj ; j Evaluate the log likelihood using Eq. 5 ; until Convergence;

360

A. Cornuéjols and C. Vrain

The structure of this algorithm looks similar to the structure of k-means algorithm (Algorithm 1). Let us notice that in k-means a point is assigned to a single cluster (hard assignment) whereas with this approach a point is assigned to the cluster with a probability (soft assignment). A more detailed presentation of EM can be found in (Bishop 2006).

4.4 Density-Based Clustering Density-based methods are based on the assumption that clusters correspond to highdensity subspaces. The first method was proposed in the system DBSCAN (Ester et al. 1996): it relies on the search for core points, that are points with a dense neighborhood. The notion of density is based on two parameters: the radius ε defining the neighborhood of a point and a density threshold Min Pts, defining the minimum number of points that must be present in the neighborhood of a point for the density around this point to be high. More formally the ε-neighborhood of a point u is defined by: Nε (u) = {v ∈ S |dis(u, v) ≤ ε} and a core point is a point u satisfying |Nε (u)| ≥ Min Pts. A point v is linked to a core point u when there exists a sequence of core points u1 , …, un such that u1 = u, and for all i, i = 2, …, n, ui ∈ Nε (ui−1 ) and v ∈ Nε (un ) A dense cluster is a maximal set of connected objects, that is objects that are linked to a same core point. It is thus composed of core points (the internal points of the cluster) and non core points (the border points of the cluster). The method is described in Algorithms 3 and 4: roughly for each point that has not yet been visited (marked by U ncl) and that is a core point, a new cluster is built and this cluster is iteratively expanded considering the core points in its neighborhood (Ester et al. 1996).

Algorithm 3: Outline of DBSCAN algorithm Data: SetofPoints, ε, Min Pts Clust I d ←− f ir st (Cluster I d) ; // First cluster label for each point x do if Mark(x) = U ncl ; // x unclassified then if Expand(SetofPoints, x, Clust I d, ε, Min Pts) then Clust I d ←− next(Cluster I d) ; // New cluster label end end end

Designing Algorithms for Machine Learning and Data Mining

361

Algorithm 4: Expand(Seto f Points, x, Clust I d, ε, Min Pts ) N ←− Nε (x); if |N | < Min Pts ; // Not a core point then Mark(x) ← N oise and Return False ; // Mark may change later else for all points z in N Mark(z) ← Clust I d ; // Init the cluster with Nε (x) Delete x from N ; while N = ∅ do Let s be the first element of N ; if Nε (s) ≥ Min Pts ; // Expand the cluster then for all points z in Nε (s) do if Mark(z) = U ncl then add z to N ; if Mark(z) = U ncl or N oise then Mark(z) ← Clust I d; end end Delete s from N ; end Return True; end

The average complexity of this algorithm is O(n × log(n)) with adapted data structures, such as R ∗ −tr ees. Density-based clustering allows the detection of outliers, that are defined in DBSCAN as points that are connected to no core points. The second important property is that by merging neighborhoods, density-based clustering allows one to find clusters of arbitrary shape. On the other hand the drawback is that the output is highly dependent on the parameters ε and Minpts, although heuristics have been proposed to set these parameters. Moreover the algorithm cannot handle cases where the density varies from one cluster to another. Some extensions, as for instance OPTICS proposed in (Ankerst et al. 1999), have been defined to take into account learning in presence of clusters with different density.

4.5 Spectral Clustering, Non Negative Matrix Factorization A clear and self-contained introduction to spectral clustering can be found in von Luxburg (2007). Spectral clustering takes as input a similarity graph G = (S, E): the nodes S of the graph are the observations and E is a set of weighted edges, where the weight wi j , wi j ≥ 0, between two nodes xi and x j represent the similarity between xi and x j (wi j = 0 when there are no edges between xi and x j or when the similarity is null). When the input is a pairwise dissimilarity or a distance dis between pairs of points, a function transforming the distance into a similarity must first be applied;

362

A. Cornuéjols and C. Vrain ||x −x ||2

for instance it can be a Gaussian similarity function: sim(xi , x j ) = ex p(− i2σ 2j 2 ), where σ is a parameter. Several methods have been proposed to build the graph. All nodes can be connected and the edges weighted by the similarity. It is also possible to use an unweighted graph, connecting points whose distances are less than a given parameter ε. In the following Sim = (sim i j )i, j=1...m denotes a similarity matrix with sim i j the similarity between xi and x j , W = (wi j )i, j=1...m denotes the weight matrix and D the degree matrix. D is adiagonal matrix whose diagonal terms are the degree di of points, defined by di = mj=1 wi j , i = 1, . . . m. Given this graph, a Laplacian matrix is built. The Laplacian matrix is particularly interesting since the multiplicity of the eigenvalue 0 of the Laplacian matrix is equal to the number of connected components of the graph. If the multiplicity is equal to k then the partition of the observations in k clusters is naturally given by the k connected components of the graph. More precisely, several Laplacian graphs can be built: the unnormalized graph Laplacian L, or the normalized graph Laplacians, L sym or L r w , defined by: L = D − W, L sym = D − 2 L D 2 and L r w = D −1 L . 1

1

Laplacian have important properties. For instance considering L, we have: ∀x = (x1 , . . . , xm )t ∈ Rm , xt Lx =

m 1  wi j (xi − x j )2 2 i, j=1

The definition of L and this property allow to show that L is a symmetric, positive semi-definite matrix, it has m non negative, real-valued eigenvalues, 0 ≤ λ1 ≤ · · · ≤ λm and 0 is the smallest eigenvalue with as eigenvector the unit vector 1 whose components are all equal to 1. This leads to two kinds of approaches: • Spectral clustering, that given the graph Laplacian L computes the first k eigenvectors u 1 , …, u k of L. Considering U = [u 1 , . . . , u k ], each element u i j , i = 1 . . . m, j = 1 . . . k can be seen as the belonging of xi to C j . K-means is then applied on the rows to obtain a partition. See Algorithm 5. • Structured graph learning (Nie et al. 2014) that aims at modifying the similarity matrix Sim so that the multiplicity of 0 in its graph Laplacian is equal to k. This leads to the following minimization problem, where a regularization term is introduced to avoid a trivial solution (||.|| F denotes the Frobenius norm) min Sim

m 

||xi − x j ||22 sim i j + μ||Sim|| F

i, j=1

s.t. Sim.1 = 1, sim i j ≥ 0, rank(L sim ) = m − k.

Designing Algorithms for Machine Learning and Data Mining

363

Algorithm 5: Unnormalized spectral clustering (von Luxburg 2007) Input : a similarity matrix Sim = (si j )i, j=1...m , ∈ Rm×m a number k of classes Output: k clusters, C1 , …, Ck Compute the similarity graph G = (S, E); Compute the unnormalized Laplacian L, L ∈ Rm×m ; Compute the first k eigenvectors u 1 , …, u k of L; Let U = [u 1 , . . . , u k ] , the matrix whose columns are u i (U ∈ Rm×k ); Let z i , i = 1, . . . , m, the vectors corresponding to the rows of U ; Cluster the points (z i )i=1,...,m into k clusters A1 …Ak ; Return for all j, j = 1 . . . l, C j = {xi |z i ∈ A j }

Spectral clustering can be seen as a relaxation of the problem of finding a partition C = (C1 , . . . , Ck ) minimizing the ratioCut of C , defined by ratioCut (C ) = k cut (C j ) j=1 |C j | . In fact it can be shown that ratioCut (C ) can be rewritten by introducing a m × k matrix H defined by h i j = √1 if xi ∈ C j and h i j = 0 if xi ∈ / Cj. |C j |

ratioCut (C ) = T race(H t L H ) H satisfies H t H = I . Each row of H contains a single non zero value, denoting the cluster the point belongs to whereas each column of H is an indicator vector representing the composition of the corresponding cluster. The relaxation is performed by allowing H to take arbitrary real values. The minimization problem: minimi ze H ∈Rm×k T race(H t L H ) under the condition H t H = I has as solution the matrix U = [u 1 , . . . , u n ] composed by the first k eigenvectors of L (theorem of Rayleigh–Ritz). U has to be transformed to model a partition and this is achieved by applying k-means. Other relations between clustering frameworks have been addressed: Dhillon et al. (2004) also shows a relation on a weighted version of kernel k-means, spectral clustering as defined in Ng et al. (2001) and normalized cut; Ding and He (2005) has also shown the relation between Spectral Clustering, kernel k-means and Nonnegative Matrix Factorization of W . Let us recall that non minimizing Negative Factorization of W consists in finding a matrix H , H ∈ Rn×k + ||W − H H t || F , where ||.|| F denotes the Frobenius norm.

4.6 Hierarchical Clustering Searching for an organization of data into a single partition requires to choose the level of granularity, defined by the number of clusters in the partition. Hierarchical clustering solves this problem by looking for a sequence of nested partitions.

364

A. Cornuéjols and C. Vrain

Hierarchical clustering (Ward 1963; Johnson 1967) aims at building a sequence of nested partitions: the finest one is the partition P, P = {{x1 }, . . . ,{xm }} where each point is put in a cluster reduced to this point and the coarsest one is Q, Q= {{x1 , . . . , xm }}, composed of a single class containing all the points. There exists two family of algorithms. Divisive algorithms start with the partition composed of a single class and iteratively divide a cluster into 2 or more smaller clusters, thus getting finer partitions. On the other hand, agglomerative algorithms start with the finest partition P, composed of m clusters, each cluster being reduced to a single point. The two closest clusters are merged and the process is iterated until getting the partition Q composed of a single cluster. A sketch of these algorithms is given in Algorithm 6

Algorithm 6: Agglomerative hierarchical clustering Compute dis(xi , x j ) for all pairs of points xi and x j in S ; Let Π = {{x}|x ∈ S} ; Let D a dendrogram with m nodes at height 0, one node for each element in Π ; while |Π | > 1 do Choose two clusters Cu and Cv in Π so that dis(Cu , Cv ) is minimal ; Remove Cu and Cv from Π and add Cu ∪ Cv ; Add a new node to the dendrogram labeled by Cu ∪ Cv at height dis(Cu , Cv ); Compute dis(Cu ∪ Cv , Cw ) for all Cw ∈ Π ; end

This algorithm is parameterized by the choice of the dissimilarity between two clusters. For instance (Ward 1963) proposes to optimize the variance, defined in Table 1. The most known strategies for defining a dissimilarity between two clusters are: • single linkage (nearest-neighbor strategy): dis(Cu , Cv ) = min{dis(u, v)|u ∈ Cu , v ∈ Cv } (split between Cu and Cv ) • average linkage: dis(Cu , Cv ) = mean{dis(u, v)|u ∈ Cu , v ∈ Cv } • complete linkage (furthest-neighbor strategy): dis(Cu , Cv ) = max{dis(u, v)|u ∈ Cu , v ∈ Cv } Single linkage suffers from the chain effect (Johnson 1967): it merges clusters with a minimal split but it can iteratively lead to clusters with large diameters. On the other hand, complete linkage aims at each step at minimizing the diameter of the resulting clusters and thus finding homogeneous clusters, but quite similar objects may be classified in different clusters, in order to keep the diameters small. Average linkage tends to balance both effects. (Lance and Williams 1967) reviews different criteria used in hierarchical clustering and proposes a general formula, allowing to update dis(Cu ∪ Cv , Cw ) from dis(Cu , Cv ), dis(Cu , Cw ) and dis(Cv , Cw ).

Designing Algorithms for Machine Learning and Data Mining

365

4.7 Conceptual Clustering Conceptual clustering was introduced in (Michalski 1980), (Michalski and Stepp 1983). The main idea of conceptual clustering is that, in order to be interesting, a cluster must be a concept, where a concept can be defined by characteristic properties (properties satisfied by all the elements of the concept) and discriminant properties (properties satisfied only by elements of the concept).2 The system Cobweb, proposed in (Fisher 1987) is original in the sense that it learns probabilistic concepts organized into a hierarchy and it is incremental, that is examples are sequentially processed and the hierarchy is updated given a new example. More precisely, each concept C is described by the probability of the concept P(C) (estimated by the rate of observations in this concept w.r.t. the total number of observations) and for each attribute X and each value v, P(X = v|C) (estimated by the proportion of observations in the concept satisfying X = v). Given a hierarchy of concepts and a new observation x, the aim is to update the hierarchy taking into account this new observation. The new observation is first inserted at the root of the hierarchy and then iteratively integrated at the different levels of the tree. The algorithm relies on four main operators, the two last operators aims at repairing the bad choices that could have been done, because of the sequential processing of the observations: • Creating a new concept: when, at a given level, x seems too different from the existing concepts, a new concept is created, reduced to this observation and the process stops. • Integrating x in an existing concept C. When this concept is composed of a single observation, x is added to C (thus getting a concept with 2 observations) and two leaves are created one for each observation. Otherwise, probabilities of C are updated and the process goes on, considering the sons of C as the new level. • Merging two concepts: when x seems close to two concepts, the two concepts are merged, x is integrated in the new resulting concept and the process goes on. • Splitting two concepts: the concept is removed and its descendants are put at the current level. The choice of the best operator relies on a criterion, called category utility, for evaluating the quality of a partition. It aims at maximizing both the probability that two objects in the same category have values in common and the probability that objects in different categories have different property values. The sum is taken across all categories Ck , all features X i and all feature values vil  j

i

P(X i = vil )P(X i = vil |C j )P(C j |X i = vil )

l

P is a property and C is a concept, a property is characteristic if C → P and discriminant if P → C.

2 If

366

A. Cornuéjols and C. Vrain

• P(X i = vil |C j ) is called predictability. It is the probability that an object has the value vil for feature X i given that the object belongs to category C j • P(C j |X i = vil ) is called predictiveness. It is the probability with which an object belongs to the category C j given it has a value vil for a feature X i . • P(X i = vil ) serves as a weight. Frequent features have a stronger influence.

4.8 Clustering Validation Validation of a clustering process is a difficult task since clustering is per nature exploratory, aiming at understanding the underlying structure of data, which is unknown. Thus, contrary to supervised learning, we have no ground truth for assessing the quality of the result. Several approaches have been developed: • deviation to a null hypothesis reflecting the absence of structure (Jain and Dubes 1988): for instance samples can be randomly generated and the result is compared to the output computed on real data. • comparison of the result with some prior information on the expected results. They can be formulated in terms of the expected structure of the clustering, as for instance getting compact and/or well separated clusters, or in terms of an expected partition. This method is mostly used for assessing the quality of a new clustering method, by running it on supervised benchmarks in which the true partition is given by the labels. It can also be used when only partial information is given. • stability measures that study the change in the results, either when the parameters of the clustering algorithm (number of clusters, initial seeds, …) vary or when data is slightly modified. In all methods, performance indexes must be defined, either measuring the quality of the clustering by itself (called internal indexes) or comparing the result with other clusterings (either obtained by a ground truth, or on randomly generated data, or with different settings of the parameters) (called external indexes). There are also some relative indexes that compare the results of several clusterings.

4.8.1

Internal Indexes

Such indexes allow measuring the intrinsic quality of the partition. There are many indexes (a list can be found in Halkidi et al. 2002). They tend to integrate in a single measure the compactness of the clusters and their separation: the first one must be minimized whereas the second one must be maximized, and usually they are aggregated using a ratio. Some criteria for assessing the compactness of a cluster or its separation from other clusters are given in Table 1, this list is far from being exhaustive. For example, Davies–Bouldin index is defined by

Designing Algorithms for Machine Learning and Data Mining

367

δi + δ j 1 k Σi=1 max j, j =i k Δi j In this expression, δi is the average distance of the objects of cluster i to the centroid μi , it measures the dispersion of the cluster i (to be minimized for compactness). Δi j is the distance between the centroid of cluster i and the centroid of cluster j (dis(μi , μ j )), it measures the dissimilarity between the two clusters (to δ +δ be maximized for cluster separation). The term iΔi j j represents a kind of similarity between clusters Ci and C j . Clustering aims at finding dissimilar clusters and therefore the similarity of a cluster with the other ones must be small. Davies–Bouldin index averages out the similarity of each cluster with its most similar one, the quality is higher when this index is small.

4.8.2

External Indexes

We suppose that the result of the clustering is a partition C = (C1 , . . . , Ck ) and we already know a partition P = (P1 , . . . , Pk ), which is called the reference partition. The external indexes compare the situation of pairs of points (x, y) in each cluster: do they belong to the same cluster in both partitions? Do they belong to different clusters in both partitions? More precisely comparing the two partitions can involve the following numbers: • • • •

a: number of pairs of points belonging to a same cluster in both partitions b: number of pairs of points belonging to a same cluster in C but not in P c: number of pairs of points belonging to a same cluster in P but not in C d: number of pairs of points belonging in different clusters in both partitions. This leads to the definition of the Rand Index defined by RI =

a+d n(n−1) 2

where a + d represents the number of agreements between C and P: when the partitions are identical, the Rand index is equal to 1. The adjusted rand index (ARI) is usually preferred. It corrects the Rand Index by comparing it to the expected one: when A R I = 0, the learned partition is not better than a random partition, whereas if A R I = 1, the two partitions are identical.

5 Linear Models and Their Generalizations We now turn to various hypothesis spaces, which can be used either in the unsupervised context or in the supervised one. Most of the following presentation, however, is put in the supervised setting. We start with the simplest model: the linear ones.

368

A. Cornuéjols and C. Vrain

When the input space X is viewed as a vectorial space, for instance Rd , where d is the dimension of this space, it becomes natural to think of regularities in the data as geometric structures in this input space. The simplest such structures are linear, like lines or (hyper)planes. Their appeal for inductive learning comes from their simplicity which helps in understanding or interpreting what they represent, and from the property of the associated inductive criteria that are often convex and therefore lead to efficient ways of approximating their global optimum. Additionally, because these models have limited expressive power, they are not prone to overfitting and are stable to small variations of the training data. Typically, the expressions considered for representing regularities are of the form:

h(x) = f

 n

 wi gi (x) + w0

i=1

1. When the gi (·) are the projection on the ith coordinate of X , and f is the identity function, we have the classical linear regression model. 2.  When f is the sign function, we have a binary linear classification model, where n i=1 wi gi (x) + w0 = 0 is the equation of the separating hyperplane. 3. Mixtures models in the generative approach are also often expressed as linear combination of simple density distributions, like mixtures of Gaussians: p(x) =  n i=1 πi pi (x|θi ). 4. When the functions gi (·) are themselves non linear transformations of the input space, we get the so-called generalized linear models. 5. Even though the Support Vector Machine is a non parametric method for classification _ meaning that its number of parameters depends on the training data _, it can also be cast as a linear model of the form: h(x) =

n 

αi κ(x, xi ) yi

i=1

where the kernel function κ measures a “similarity” between instances in the input space, and can be seen as special cases of function gi (·), with each gi (x) = κ(xi , x). 6. Some ensemble learning methods, such as boosting for binary classification, are equally members of the linear models family. In these methods, the hypotheses generally take the form: H (x) = sign

 n

 αi h i (x)

i=1

where the number n of base (or “weak”) hypotheses controls the expressive power of the hypothesis, and hence the risk of overfitting the data.

Designing Algorithms for Machine Learning and Data Mining

369

The first problem to solve in learning these models is to choose the “dictionary” of functions gi (·). The second one is to estimate the parameters wi (1 ≤ i ≤ n). The choice of the dictionary is either trivially done using classical base functions, like the splines for linear regression, or is the result on the expert’s insights. When there exists a large amount of training data, methods for “dictionary learning” can be used. The design of these methods, however, remains largely a research problem because the search for a good hypothesis within an hypothesis space is now compounded by the problem of constructing the hypothesis space itself. Dictionary learning methods have been mostly used in the context of vision systems and scene analysis (Qiu et al. 2012; Rubinstein et al. 2010; Tosic and Frossard 2011). The estimation of the parameters demands first that a performance criterion be defined, so that it can be optimized by controlling their values. In regression, the problem is to learn a function h : X → R from a set of examples (xi , yi )1≤i≤m . The differences between the actual and estimated function values on ). The least-squares method the training examples are called residuals εi = yi − h(xi  m εi2 , and, according to adopts as the empirical risk the square of the residuals i=1 the ERM principle, the best hypothesis hˆ is the one minimizing this criterion. One justification for this criterion is to consider that the true target function is indeed linear, but that the observed yi values are contaminated with a Gaussian noise. The problem of optimizing the empirical risk can be solved by computing a closedform solution. The matrix inversion of X X is needed, where X denotes the m-by-d data matrix containing m instances in rows described by d features in columns. Unfortunately, this can be prohibitive in high-dimensional feature spaces and can be sensitive to small variations of the training data. This is why iterative gradient descent methods are usually preferred. It is also in order to control this instability, and the overfitting behavior it can denote, that regularization is called for. The idea is to add penalties on the parameter values. In shrinkage, the penalty is on the square norm of the weight vector w:  w∗ = argmin (y − Xw) (y − Xw) + λ ||w||2 w

This favors weights that are on average small in magnitude. In Lasso (least absolute shrinkage and selection operator), the penalty is on the absolute values of the weights w:  w∗ = argmin (y − Xw) (y − Xw) + λ |w| w

Lasso uses what is called the L 1 norm and favors sparse solutions, in that it favors solutions with zero values for as many weights as possible while still trying to fit the data.

370

A. Cornuéjols and C. Vrain

We now look at two simple, yet still widely used, discriminative methods: logistic regression and the perceptron. Logistic regression assumes that the decision function is linear and that the distance in the input space of an example from the decision boundary is indicative of the probability of the example to belong to the class associated with this side of the boundary. In fact, the method assumes that the histogram of these distances follows a normal distribution. If d(x) is the distance of x to the decision function, we have:

pˆ class(x) = ‘ + ’|d(x) =

1 exp(w · x − w0 )

= exp(w · x − w0 ) + 1 1 + exp −(w · x − w0 )

Because the model is based on generative assumptions (i.e. the model is able to generate data sets, in contrast to, say, decision functions), one can associate a likelihood function to the training data: L(w, w0 ) =

i

P(yi |xi ) =



p(x ˆ i ) yi (1 − p(x ˆ i ))(1−yi )

i

We want then to maximize the log-likelihood with respect to the parameters, which means that all partial derivatives must be zero: ∇w L(w, w0 ) = 0 ∂ L(w, w0 ) = 0 ∂w0 The corresponding weight parameters w and w0 can be obtained through a gradient descent procedure applied to the negative log-likelihood. The logistic regression algorithm is based on the assumption that the distances from the decision function in the input space X follows a normal distribution. If this assumption is not valid, it is possible that a linear separation exists between the two classes, but that the logistic regression outputs a decision function that does not separate them properly. By contrast, the perceptron algorithm guarantees that if a linear separation exists between the classes, it will output a linear decision function making no errors on the training set. The perceptron considers the training examples one at a time, and updates its weight vector every time the current hypothesis ht misclassifies the current example xi , according to the following equation: wt+1 = wt + η yi xi

(6)

The algorithm may cycle several times through the training set. Learning stops when there is no more training example misclassified. The perceptron algorithm is simple to implement and is guaranteed to converge in a finite number of steps if the classes are linearly separable.

Designing Algorithms for Machine Learning and Data Mining

371

Fig. 1 A complex decision function in the input space can be made linear in a feature space with an appropriate mapping

All methods described above are limited to finding linear models within the input space. One way to circumvent this serious limitation is by changing the input space. This is what the so-called generalized linear models do by using sets of non linear basis functions gi defined over X . In the new description space spanned by the functions gi , a linear separation can thus be associated with a non linear separation in the input space X (Fig. 1). Aside its limited expressive power, the perceptron has another unsatisfactory property: it outputs a linear decision function as soon as it has found one between the training data. However, intuitively, some linear separations can be better than others and it would be preferable to output the best one(s) rather than the first one discovered. This realization is the basis of methods that attempt to maximize the “margin” between examples of different classes (Fig. 2). Suppose we call the margin of an example x with respect to a linear decision function defined by its weight vector w the distance of this example from the decision function: w · x, then we want to have all the training examples on the good side of the decision function that is learned (i.e. all the positive instances on one side, and the negative ones on the other side), and the smallest margin for the positive training examples and for the negative ones to be as large as possible (see Fig. 2). In this way, the decision boundary is the most robust to small changes of the training points, and accordingly, it can be expected that it is the best separator for new unseen data that follow the same distribution as the training set. This leads to a quadratic constrained optimization problem: w∗ = argmin w

1 ||w||2 2

subject to

yi (w · xi ) ≥ 1, 1 ≤ i ≤ m

This optimization problem is usually solved using the method of Lagrange multipliers. Ultimately, it can be shown that the solution only depends on the set S S of the so-called support vectors which are the training examples nearest, i.e. with the smallest margin to the decision boundary. Each support vector xi is associated with

372

A. Cornuéjols and C. Vrain

Fig. 2 A linear decision function between two classes of data points can be defined by the support vectors that are on the margin. Here only three points suffice to determine the decision function

a weight αi . The decision function then becomes: 

 αi yi xi .x h(x) = sign xi ∈S S

where xi .x is the dot product of xi and x in the input space X . The dot product can be considered as a measure of ressemblance between xi and x. Accordingly, the SVM can be seen as a kind of nearest neighbors classifier, where a new example x is labeled by comparison to selected neighbors, the support vectors xi , weighted by the coefficients αi . However, even though SVM outputs decision functions that are likely to be better than the ones produced by perceptrons, they would still be limited to produce linear boundaries in the input space. A fundamental realization by Vapnik and his coworkers (Cortes and Vapnik 1995) has changed the expressive power of the SVM, and indeed of many linear methods, such as linear regression, Principal Component Analysis, Kalman Filters, and so on. Indeed, an essential problem of the methods based on examples is the choice of an appropriate measure of similarity. Thus, any chess player is aware that two exactly identical game situations, but for the position of a pawn, may actually lead to completely different outcomes of the game. The same can happen when comparing various objects such as molecules, texts, images, etc. In every case, the choice of the right similarity measure is crucial for methods that decide on the basis of similarities to known examples. If mathematics provides us with many measures suited to vector spaces, for example in the form of distances, the problem is much more open when the data involve symbolic descriptors and/or are defined in non-vector spaces, such as texts. This is why a significant part of current machine learning contributions is the definition and

Designing Algorithms for Machine Learning and Data Mining

373

testing of new similarity measures appropriate for particular data types: sequences, XML files, structured data, and so on. The invention and popularization of SVM by Vapnik and its co-workers have been very influential in reviving and renewing the problem of the design of appropriate similarity functions, and this can be seen especially with the rise of the so-called kernel methods (Schölkhopf and Smola 2002). If one wants to adapt linear methods to learn non-linear decision boundaries, a very simple idea comes to mind: transform the data non-linearly to a new feature space in which linear classification can be applied. The problem, of course, is to find a suitable change of representation, from the input space to an appropriate feature space. We will see that one approach is to learn such a change of representation in a multilayer neural network (see Sect. 6). But one remarkable thing is that, in many cases, the feature space does not have to be explicitly defined. This is the core of the so-called “kernel trick”. Suppose that you must compare two points in R2 : u = (u 1 , u 2 ) and v = (v1 , v2 ). One way to measure their distance is through their dot-product u · v = u 1 v1√+ u 2 v2 . Suppose now that you decide to consider the mapping (x, y) → (x 2 , y 2 , 2 x y), 3 the points u and v are respecwhich translates points in R2 into √ points in R . Then, √ 2 2 tively mapped to u = (u 1 , u 2 , 2 u 1 u 2 ) and v = (v12 , v22 , 2 v1 v2 ). The dot product of these two vectors is: u · v = u 21 v12 + u 22 v22 + 2 u 1 u 2 v1 v2 = (u 1 v1 + u 2 v2 )2 = (u · v)2 In other words, by squaring the dot product in the input space, we obtain the dot product in the new 3-dimensional space without having to actually compute it explicitly. And one can see on this example that the mapping is non linear. A function that computes the dot product in a feature space directly from the vectors in the input space is called a kernel. In the above example, the kernel is: κ(u, v) = (u · v)2 . Many kernel functions are known, and because kernel functions form a group under some simple operators, it is easy to define new ones as wished. In order to be able to use the kernel trick to transform a linear method into a non linear one, the original linear method must be entirely expressible in terms of dot products in the input space. Many methods can be expressed that way, using a “dual form”. This is for instance the case of the perceptron, which thus can be transformed into a kernel perceptron able to separate non linearly separable data in the input space. A kernel perceptron that maximizes the margin between the decision function and the nearest examples of each class is called a Support Vector Machine. The general form of the equation of Support Vector Machines is: 

 m αi κ(x, xi ) yi h(x) = sign i=1

where m is the number of training data. However, it suffices to use the support vectors (the nearest data points to the decision function) in the sum, which most

374

A. Cornuéjols and C. Vrain

often drastically reduces the number of comparisons to be made through the kernel function. It is worth stressing that kernel functions have been devised for discrete structures, such as trees, graphs and logical formulae, thus extending the range of geometrical models to non-numerical data. A good introductory book is (Shawe-Taylor and Cristianini 2004). The invention of the boosting algorithm, another class of generalized linear classifiers, is due to Shapire and Freund in the early 1990s (Shapire and Freund 2012). It has stemmed from a theoretical question about the possibility of learning “strong hypotheses”, ones that perform as well as possible on test data, using only “weak hypotheses” that may perform barely above random guessing on training data. Algorithm 7: The boosting algorithm Input: Training data set S ; number of combined hypotheses T ; weak learning algorithm A Initialization of the distribution on the training set: wi,1 = 1/|S | for all xi ∈ S ; for t = 1, . . . , T do Run A on S with weights wi,t to produce an hypothesis h t ; Calculate the weighted error εt of h t on the weighted data set; if εt ≥ 1/2 then set T ← t − 1 and break end

t αt ← 21 log2 1−ε (confidence for the hypothesis h t ) ; εt w for instances xi misclassified at time t ; wi,t+1 ← 2∗i,tεt

w j,t+1 ← end

w j,t 2(1−εt )

for instances x j correctly classified at time t ;

Output: the final combined hypothesis H : X → Y : H (x) = sign

 T

 αt h t (x)

(7)

t=1

The idea is to find a set of basis decision functions h i : X → {−1, +1} such that used collectively in a weighted linear combination, they can provide a highperforming decision function H (x). How to find such a set of basis functions or weak hypotheses? The general principle behind boosting and other ensemble learning methods is to promote diversity in the basis functions so as to eliminate accidental fluctuations and combine their strengths (Zhou 2012). The way boosting does that is by modifying the training set between each stage of the iterative algorithm. At each stage, a weak hypothesis h t is learnt, and the training set is subtly modified so that h t does not perform better than random guessing on the new data set. This way, at the next stage, one is guaranteed that if a better than random hypothesis can be learnt,

Designing Algorithms for Machine Learning and Data Mining

375

it has used information not used by h t and therefore brings new leverage on the rule to classify the data points. In boosting, the training set is modified at each stage by changing the weights of the examples. Specifically, at start the weights of all training examples are uniformly set to 1/|S |. At the next step, half of the total weight is assigned to the misclassified examples, and the other half to the correctly classified ones. Since the sum of the current weight of the misclassified examples is the error rate εt , the weights of the misclassified examples must be multiplied by 1/2εt . Likewise the weights of the correctly classified examples must be multiplied by 1/2(1 − εt ). That way, at stage t + 1, the total weight of the misclassified examples is equal to the total weight of the other examples, and equal to 1/2. In each successive round, the same operation is carried out. Finally, a confidence is computed for each learned weak hypothesis risk (with a exponential loss h t . An analysis of the minimization of the

empirical i to each h t . The basic algorithm is function) leads to associate αt = 21 log2 1−ε εi given in Algorithm 7. From the Eq. 7, it can be seen that boosting realizes a linear classification in a feature space made of the values of the learned weak hypotheses (see Fig. 3). It can therefore be considered as generalized linear method which learns its feature map. While not strictly linear in a geometrical sense, another method, the random forest, belongs to the family of ensemble learning, like boosting (Breiman 2001). There, the idea is to foster diversity in the basis hypotheses by changing both the training data and the hypothesis space at each round of the algorithm. The training set is changed at each time step using a bootstrap strategy, by sampling with replacement m training examples (not necessarily different) from the initial training set S of size m. The hypothesis space is made of decision trees, and in order to accentuate diversity, a subset of the attributes is randomly drawn at each step, forcing the individual “weak decision tree” to be different from each other. Random forests have been the winning method in numerous competitions on data analysis, and while they are no longer well publicized since the advent of the deep neural networks, they remain a method of choice for their simplicity and their propensity to reach good level of performance.

Φ

X x2 +

+ + +

+ -

xi -

-

+

+

-

-

-

h3 (x)

++

x1

X

Φ(X ) h2 (x) h1 (x)

+

+

+ +

+

+

Φ(xi )

+

-

x

+ +

-

-

-

X

Φ(x)

-

-

Fig. 3 Boosting in effect maps the data points into a feature space with T dimensions corresponding to the values of the weak hypotheses h t (·). In this new feature space the final hypothesis H (x) =  T sign t=1 αt h t (x) corresponds to a linear classifier

376

A. Cornuéjols and C. Vrain

Linear methods are said to be shallow since they combine only one “layer” of descriptors, and, supposedly, this would limit their expressive power. Methods that change the representation of the data, and specially deep neural methods are motivated by overcoming this limitation.

6 Multi-layer Neural Networks and Deep Learning The preceding section has introduced us to linear models and ways to learn them. Among these models, the perceptron was introduced by Frank Rosenblatt in the late 1950s with the idea that it was a good model of learning perceptual tasks such as speech and character recognition. The learning algorithm (see Eq. 6) was general, simple and yet efficient. However, it was limited to learning linear decision functions with respect to the input space. This limitation was forcibly put forward in the book Perceptrons by Marvin Minsky and Seymour Papert in 1969 (see Minsky and Papert 1988) and this quickly translated into a “winter period” for connectionism. Actually, Minsky and Papert acknowledged in their book that layers of interconnected neurons between the input and the last linear decision function could perform some sort of representation change that would allow the system to solve non linear decision tasks, but they did not see how it would be possible to disentangle the effects of modifications in the various weights attached to the connections, and thus to learn these weights. They believed that these first layers would have to be hand-coded, which would be a daunting task but for the simplest perceptual problems. For more than ten years, no significant activity occurred in the study of artificial neural networks, not to lessen the works of some courageous scientists during this period. In 1982, a change happened coming from a field foreign to Artificial Intelligence. John Hopfield, a solid state physicist, noticed a resemblance between the problem of finding the state of lowest energy in spin glass and the problem of pattern recognition in neural networks (Hopfield 1982). In each case, there is a “basin of attraction”, and changing the weights in the connections between atoms or neurons can alter the basins of attractions which translates into modifying the ideal patterns corresponding to imperfect or noisy inputs. For the first time, a non linear function from the input space to the output one was shown to be learnable with a neural network. This spurred the development of other neural models, noticeably the “Boltzman machine”, unfortunately slow and impractical, and the Multi-layer perceptron.

6.1 The Multi-layer Perceptron In multi-layer perceptrons, the signal is fed to an input layer made of as many neurons (see Fig. 4) as there are descriptors or dimensions in the input space, and is then propagated through a number of “hidden layers” up until a final output layer which computes the output corresponding to the pattern presented to the input layer.

Designing Algorithms for Machine Learning and Data Mining

377

Fig. 4 Model of a formal neuron

bias neuron

1

w0i

x1

w1i

x2

w2i

x3

d

σ(i) =

wji xi j=0

g

w3i

yi

wdi

xd

Fig. 5 The back-prop algorithm illustrated. Here the desired output u is compared to the output produced by the network y, and the error is back-propagated in the network using local equations

w2i (t + 1) = w2i (t) − η(t) δi a2

x

{

x1

wis (t + 1) = wis (t) − η(t) δs ai

x2

ys(1) (x)

}

x3

y

u

δs = g (as )(u(1) − ys(1) (x))

xd

δj = g (aj )

wij δs i∈sources(j)

The hidden layers are in charge of translating the input signal in such a way that the last, output, layer can learn a linear decision function that solve the learning problem. The hidden layers thus act as a non linear mapping from the input space to the output one. The question is how to learn this mapping. It happens that the solution to this problem was found the very same year that the book “Perceptrons” was published, by Arthur Bryson and Yu-Chi Ho, and then again in 1974 by Paul Werbos in his PhD. thesis. However, this is only in 1986 that the now famous “back-propagation algorithm” was widely recognized as the solution to the learning of hidden layers of neurons and credited to David Rumelhart, Geoff Hinton and Ronald Williams (Rumelhart et al. 1987), while Yann Le Cun, independently, in France, homed on the same algorithm too (Le Cun 1986) (see Fig. 5). To discover the learning algorithm, it was needed that the neurons were no longer seen as logical gates, with a {0, 1} or {False, True} output, but as continuous functions of their inputs: the famous ‘S’ shaped logistic function or the hyperbolic tangent one. This allowed the computation of the gradient of the signal error _the difference between the desired output and the computed one_ with respect to each connection in the network, and thus the computation of the direction of change for each weight in order to reduce this prediction error.

378

A. Cornuéjols and C. Vrain

The invention of the back-propagation algorithm arrived at a timely moment in the history of Artificial Intelligence when, on one hand, it was felt that expert systems were decidedly difficult to code because of the knowledge bottleneck and, on the other hand, symbolic learning systems started to show brittleness when fed with noisy data, something that was not bothering for neural networks. Terry Sejnowski and Charles Rosenberg showed to stunned audiences how the system NETtalk could learn to correctly pronounce phonemes according to the context, reproducing the same stages in learning as exhibited by children, while other scientists applied with success the new technique to problems of speech recognition, prediction of stock market prices or hand-written character recognition. In fact, almost a decade before the DARPA Grand Challenge on autonomous vehicles, a team of Carnegie-Mellon successfully trained a multilayer perceptron to drive a car, or, more modestly but yet, to appropriately turn the steering wheel by detecting where the road was heading in video images recorded in real time. This car was able to self-direct itself almost 97% of the time when driving from the East coast of the United States to the West coast in 1997 (see https://www.theverge.com/2016/11/27/ 13752344/alvinn-self-driving-car-1989-cmu-navlab). However, rather rapidly, in the mid-90s, multilayer perceptrons appeared limited in their capacity to learn complex supervised learning tasks. When hidden layers were added in order to learn a better mapping from the input space to the output one, the back propagated error signal would rapidly become too spread out to induce significant and well-informed changes in the weights, and learning would not succeed. Again, connectionism subsided and yielded to new learning techniques such as Support Vector Machines and Boosting.

6.2 Deep Learning: The Dream of the Old AI Realized? Very early, researchers in artificial intelligence realized that one crucial key to successful automatic reasoning and problem solving was how knowledge was represented. Various types of logics were invented in the 60s and 70s, Saul Amarel, in a famous paper (Amarel 1968), showed that a problem became trivial if a good representation of the problem was designed, and Levesque and Brachman in the 80s strikingly exhibited the tradeoff between representation expressiveness and reasoning efficiency (Levesque and Brachman 1987) demonstrating the importance of the former. Likewise, it was found that the performance of many machine learning methods is heavily dependent on the choice of representation they use for describing their inputs. In fact, most of the work done in machine learning involves the careful design of feature extractors and generally of preprocessing steps in order to get a suitable redescription of the inputs so that effective machine learning can take place. The task can be complex when vastly different inputs must be associated with the same output while subtle differences might entail a difference in the output. This is often the case in image recognition where variations in position, orientation or illumination of an object should be irrelevant, while minute differences in the shape,

Designing Algorithms for Machine Learning and Data Mining

379

or texture can be significant. The same goes for speech recognition, where the learning system should be insensitive to pitch or accent variations while being attentive to small differences in prosody for instance. Until 2006, and the first demonstration of successful “deep neural networks”, it was believed that these clever changes of representation necessary in some application areas could only be hand-engineered. In other words, the dream of AI: automatically finding good representations of inputs and/or knowledge could only be performed for “simple” problems, as was done, in part, by multilayer perceptrons. If it was known that representations using hierarchical levels of descriptors could be much more expressive than shallow representations given the same number of features, it was also believed that learning such multiple levels of raw features and more abstract descriptors was beyond the reach of algorithmic learning from data. All that changed in 2006 when a few research groups showed that iteratively stacking unsupervised representation learning algorithms could yield multiple levels of representation, with higher-level features being progressively more abstract descriptions of the data. These first successes made enough of an impression to rekindle interest in artificial neural nets. This interest spectacularly manifested itself in the unexpected large audience of the “Deep Learning Workshop: Foundations and Future Directions” organized aside of the NIPS-2007 conference. Since then, there has been an increasing number of scientific gatherings on this subject: noticeably one workshop every year at the NIPS and ICML conferences, the two major conferences in machine learning, and a new specialized conference created in 2013: ICLR, the International Conference on Learning Representations. But then, what was the novelty with respect to the previous multilayer perceptrons? It was not a revolution, but still there were three significant new ingredients that changed the realm of possibilities. The first one was related to the fact that it would help learning tremendously if the weights of the connections were not initialized randomly but in a cleverer way. And the new idea was to train each layer one by one with unsupervised learning, and then finishing with a round of supervised learning with the standard back-propagation technique. In the early stages of the deep learning, this was done by variants of the Boltzman machines invented in the 1980s, later it was carried out with auto encoders that learn to associate each input with itself but with a limited bandwidth represented by a constrained hidden layer of neurons. That way, enormous volumes of unsupervised training data could be put to use in order to learn the initial weights of the network, and more importantly to learn meaningful hierarchically organized descriptors of the data. The second ingredient was the use of vastly more computing power than a decade earlier. To learn the millions of weights typical of the new deep networks, the classical CPU, even with parallelism were not enough. The computing power of Graphics Processing Units (GPU) had to be harnessed. This was quickly realized and lead to new record breaking achievements in machine learning challenges. Thus the possibility of using vastly larger sets of training data and correspondingly much faster computation lead to new horizons. Indeed, the use of very large data sets is instrumental in avoiding overfitting.

380

A. Cornuéjols and C. Vrain

Table 2 Some usual activation functions in artificial neurons. The Heaviside activation function was used in the perceptron Name

Graph

Equation 

Heaviside

Logistic or sigmoid function

f (x) =

f (x) =

 ReLU

0 if x < 0 1 if x ≥ 0

f (x) =

1 1 + e−x

0 if x < 0 x if x ≥ 0

Derivative  

f (x) =

0 if x =  0 ? if x = 0

 f  (x) = f (x) 1 − f (x)

 f  (x) =

0 if x < 0 1 if x ≥ 0

However, a third ingredient was also of considerable importance, and it was related to the analysis of the inefficiencies of the back-propagation algorithm when the number of hidden layers exceeded a small value, like 2 or 3. This analysis lead to two major findings. First, that the non-linear activation function in neurons has a big impact on performances, and that the classical ‘S’ shaped ones, like the sigmoid function, results easily in vanishing gradients and thus in very slow or inexistent learning. It was found that rectified linear units (ReLU), which is simply the halfwave rectifier g(z) = max(z, 0), albeit not strictly differentiable, lead to far more efficient back-propagation of the error signal (see Table 2). The second finding was that to better channel the error signal, it was possible to randomly “paralyse” a proportion of neurons during each back-propagation learning step. This trick, called “dropout”, not only improved the back-propagation of the error signal in the deep networks, but it also proved to be related to the bagging technique that enlists an ensemble of classifiers in order to reduce the variance of the learned models, and therefore to increased learning performance. A good book about deep learning is (Goodfellow et al. 2016).

6.3 Convolutional Neural Networks While the early revival of research in neural networks, in 2006 and later, was largely spurred by the realization that iterative construction of deep neural networks was

Designing Algorithms for Machine Learning and Data Mining

381

possible by using iterative unsupervised learning of each individual layer of neurons, another competing neural architecture soon regained attention. This architecture was the Convolutional Neural Network on which Yann Le Cun had been working as early as 1990 (Le Cun et al. 1990, 1995). Deep neural networks in general exploit the idea that many natural signals are hierarchical and compositional, in which higher-level features are composed of lower level ones. For instance, in images, low-level detectors may attend to edge and local contrasts, those being combined in motifs in higher levels, motifs that are then combined into parts and parts finally into objects. Similar hierarchies exist in speech or in texts and documents. Whereas the deep architectures obtained from iteratively using unsupervised learning trust the successive layers to encode such hierarchical descriptors, the convolutional neural networks are designed to realize some invariance operations, through convolutions, that conduct to such hierarchies. Four key ideas are drawn upon in ConvNets: local connections, shared weights, pooling and the use of many layers. The first few stages of a typical ConvNet are composed of two types of layers: convolutional layers and pooling layers (see Fig. 6). The input to a ConvNet must be thought of as arrays (one dimensional or multi-dimensional). The units in convolutional layers see only a local patch in the input array, either the raw input in the first layer, or an array from the previous layer in later layers. Each unit is connected to its input patch through a set of weights called a filter bank. The result of this local weighted sum is then passed through the function activation, usually a ReLU. Units in convolutional layers are organized in several feature maps, and all units in the same feature map share the same filter bank (aka weight sharing). This way, each feature map specializes into the detection of some local motif in the input, and these motifs can be detected anywhere in the input. Mathematically, this amounts to a discrete convolution of the input. Then, the role of the pooling layer is to merge

Fig. 6 A convolutional network taking as input an image of a dog, transformed into RGB input arrays. Each rectangular image is a feature map corresponding to the output for one of the learned feature detected at each of the image positions. Information flows bottom up with ever more combined features as the signal goes up toward the last layer where a probability is assigned to each class of object. (image taken from Le Cun et al. 2015)

382

A. Cornuéjols and C. Vrain

local motifs into one. One typical pooling unit may compute the maximum of the output of the local detectors realized by the convolutional unit, thus claiming that some motif has been found anywhere in the array if this maximum is high enough. This encodes a detector invariant by translation. Pooling units may also be used to reduce the dimension of the input, or to create invariance to small shifts, rotations or distortions. Typically, several stages of convolutions, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. The back propagation algorithm, possibly using dropout and other optimization tricks is used allowing all the weights in all the filter banks to be trained. Recent Convolutional neural networks have more than 10 layers of ReLU, and hundreds of millions, sometimes billions, of weights. A ConvNet takes an input expressed as an array of numbers and returns the probability that the input, or some part of it, belongs to a particular class of objects or patterns. ConvNets have been applied to speech recognition, with time-delay neural networks, and hand-written character recognition in the early 1990s. More recently, ConvNets have yielded impressive levels of performance, sometimes surpassing human performance, in the detection, segmentation and recognition of objects and regions in images, like traffic signal recognition, the segmentation of biological images, the detection of faces in photos, and so on. These networks have also been instrumental in the success of the AlphaGo program that now beats routinely all human champions at the game of Go (Silver et al. 2016, 2017). The other area where ConvNets are gaining importance are speech recognition and natural language understanding.

6.4 The Generative Adversarial Networks The preceding sections have dealt mainly with supervised learning: learning to associate inputs to outputs from a training sample of pairs (input, output). In 2014, Ian Goodfellow and co-workers (Goodfellow et al. 2014) presented an intriguing and seducing idea by which deep neural networks could be trained to generate synthetic examples that would be indistinguishable from a training example. For instance, given a set of images of interiors in apartments, a network can be trained to generate other interiors that are quite reasonable for apartments. Or a network could be trained to generate synthetic paintings in the style Van Gogh. The idea is the following. In GANs, two neural networks are trained in an adversarial way (see Fig. 7). One neural network, G, is the generator network. It produces synthetic examples using a combination of latent, or noise, variables as input, and its goal is to produce examples that are impossible to distinguish from the examples in the training set. For this, it learns how to combine the latent input variables through its layers of neurons. The other neural network, D, is the discriminator. Its tasks is to try to recognize when an input is coming from the training set or from the generator G. Each neural network evolves in order, for G to fool D, while for D the task is to learn to be able to discriminate the training inputs from the generated ones. Learning is successful when D is no longer able to do so.

Designing Algorithms for Machine Learning and Data Mining

383

Fig. 7 A typical GAN network

6.5 Deep Neural Networks and New Interrogations on Generalization A recent paper (Zhang et al. 2016) has drawn attention to a glaring lack of understanding of the reason of the success of the deep neural networks. Recall that supervised learning is typically done by searching for the hypothesis function that optimizes as well as possible the empirical risk, which is the cost of using one hypothesis on the training set. In order for this empirical risk minimization principle to be sound, one has to constraint the hypothesis space H considered by the learner. The tighter the constraint, the tighter the link between the empirical risk and the real risk, and the better the guarantee that the hypothesis optimizing the empirical risk is also a good hypothesis with regard to the real risk. If no such constraint on the hypothesis space exists, then there is no evidence whatsoever that a good performance on the training set entails a good performance in generalization. One way to measure the constraint, or capacity, of the hypotheses space is to measure to which extent one can find an hypothesis in H that is able to fit any training set. If, for any training set, with arbitrary input and arbitrary output, one can find a hypothesis with low, or null, empirical risk, then there is no guaranteed link between the measured empirical risk and the real one. But this is exactly what happens with deep neural networks. It has been shown in several studies that typical deep neural networks, used successfully for image recognition, can indeed be trained to have a quasi null empirical risk on any training set. Specifically, these neural nets could learn perfectly images where the pixels and the output were randomly set. For all purposes, there was no limit to the way they could fit any training set. According to the common wisdom acquired from the statistical learning theory, they should overfit severely any training set. Why, then, are they so successful on “natural” images? A number of papers have been hastily published since this finding. They offer at the time of this writing only what seem partial explanations. One observation is that learning is not done in the same way when learning, by heart, a set of “random” data and learning a set of “reasonable” data. Further studies are needed, and it is expected that they bring new advances in the understanding of what makes successful induction. More information about deep neural networks can be found in Le Cun et al. (2015), Goodfellow et al. (2016).

384

A. Cornuéjols and C. Vrain

7 Concept Learning: Structuring the Search Space Concept learning is deeply rooted in Artificial Intelligence, at the time when knowledge appeared to be the key for solving many AI problems. The idea was then to learn concepts defined by their intent, and thus meaningful models easily understandable by experts of the domain. Logic - propositional or first order logic - was therefore the most used representation formalism. Nevertheless, modeling the world require dealing with uncertainty and thus to study formalisms beyond classical logic: fuzzy logic, probabilistic models, …Learning a representation of the world becomes much more difficult, since two learning problems must then be faced: learning the structure (a set of rules, dependencies between variables, …) and learning the parameters (probabilities, fuzzy functions, …). We focus in this section on learning the structure in classic logic, and probabilistic learning will be addressed in Sect. 8. The structure depends on the input data and on the learning task. Data can be given in a table, expressing the values taken by the objects on attributes or features, or it can be more complex with a set of objects described by attributes and linked by relations, such as a relational database or a social network. Data can also be sequential or structured in a graph. Depending on the data types, the outputs vary from a set of patterns (conjunction of properties, rules, …), often written in logics or graphical models (automata, Hidden Markov model, Bayesian networks, conditional random fields, …). Before going further, let us specify some basic notions about the different representation languages used in this section. We focus mainly on rule learning. A rule is composed of two parts, the left-hand side of the rule expresses the conditions that must be satisfied for the rule to be applied and the right-hand side or conclusion specifies what become true when the conditions are realized. Several languages can be used to express rules: the propositional logic, composed only of propositional symbols, as for example in the rule vertebrate ∧ tetrapod ∧ winged → bird, the attribute-value representation as for instance temperature ≥ 37.5 → fever. First order logic allows one to express more complex relations between objects, such as father (X, Y), father (Y, Z) → grandFather (X, Z). Sometimes, when the concept to be learned is implicit (for example, learning the concept mammal), the conclusion of the rule is omitted and only the conjunction of properties defining the concept is given. More complex expressions can be written, like general clauses allowing to specify alternatives in the conclusion or to introduce negation. Learning knowledge expressed in first-order logic is the field of study of Inductive Logic Programming (Raedt 2008; Dzeroski and Lavrac 2001). Another representation formalism is the family of graphical models (Koller and Friedman 2009): they are based on graphs representing dependencies between variables. We distinguish mainly the Bayesian networks, oriented graphs associating to each node the conditional probability of this node, given its parents and the Markov models, non-oriented graphs for which a variable is independent of the others, given its neighbors. A research stream, called Statistical Relational Learning (Raedt et al. 2008; Getoor and Taskar 2007) tends to couple Inductive Logic Programming with probabilistic approaches.

Designing Algorithms for Machine Learning and Data Mining

385

Finally, grammatical inference (de la Higuera 2010 or Miclet 1990) aims at learning grammars or languages from data: the family of models considered is often that of automata, possibly probabilistic ones. Let us notice that often we do not look for an hypothesis but for a set of hypotheses, for instance a disjunction of hypotheses. Consider learning a concept from positive and negative examples. It may be unrealistic to look for a single rule covering positive examples and rejecting negative ones, since this would assume that all positive examples follow the same pattern: we then look for a set of rules. However, extending the model to a set of hypotheses introduces again the necessary compromise between the inductive criterion and the complexity of the hypothesis space. Indeed, the disjunction of the positive examples, x1 ∨ · · · ∨ xm , represents a set of hypotheses that covers all positive examples and rejects all negatives, under the condition that the negative examples differ from the positive ones. We then have an arbitrarily complex hypothesis, varying according to the learning examples, and with a null empirical error. This is called learning by heart, involving no generalization. In order to solve this problem, constraints on the hypotheses must be introduced, that is, making regularization as mentioned in Sect. 3.3.

7.1 Hypothesis Space Structured by a Generality Relation 7.1.1

Generalization and Coverage

When the hypothesis space H is no longer parameterized, the question is how to perform an informed exploration of the hypothesis space. The notion of hypothesis space structured by a generality relation has been first introduced for concept learning, where the aim is to learn a definition of a concept in presence of positive and negative examples of this concept. A hypothesis describes a part of the space X , and we search for a hypothesis that covers the positive examples and excludes the negative ones. Two hypotheses can then be compared according to the subspace of X they describe, or according to the observations they cover. We define the coverage of a hypothesis as the set of observations of X satisfied or covered by this hypothesis. A hypothesis is more general than another one, denoted by h 1 ≥ h 2 , if the coverage of the former contains the coverage of the latter (h 1 ≥ h 2 iff coverage(h 2 ) ⊆ coverage(h 1 )). The inclusion relation defined on X thus induces a generality relation on H , as illustrated in Fig. 8, which is a partial preorder. It is not antisymmetric since two different hypotheses may cover the same subspace of X , but it can be transformed into an order relation by defining two hypotheses as equivalent when they cover the same subspace of X and by considering the quotient set of H with respect to this equivalence relation. Thus only one representative hypothesis among all the equivalent ones has to be considered. The coverage relationship is fundamental for induction, because it satisfies an important property: when a hypothesis is generalized (resp. specialized), then its coverage becomes larger (resp. smaller). Indeed, when learning a concept, an incon-

386

A. Cornuéjols and C. Vrain

cover(h3) cover(h1)

h3

cover(h2)

h1

h2 H

X

Fig. 8 The inclusion relation in X induces a generalization relation in H . It is a partial preorder: h 1 and h 2 are not comparable, but they both are more specific than h 3

sistent hypothesis covering negative examples must be specialized to reject these negative examples, whereas an incomplete hypothesis covering not all known positive examples has to be generalized, in order to cover them. In other words, an inconsistent hypothesis when generalized remains inconsistent, whereas an incomplete hypothesis when specialized, remains incomplete. The induction process is therefore guided by the notion of coverage, which allows to define quantitative criteria such as the number of covered positive examples, the number of covered negative examples, useful for guiding the search and for pruning the search space.

7.1.2

Generalization in Logic

This notion of generalization is defined in terms of the input space X , used to describe the observations while learning is performed by exploring the hypothesis space H . From the point of view of search, it is more efficient to define operators working directly in the set of hypotheses H but respecting the generality relation defined in H in terms of the inclusion relation in X . Let us notice that in some cases the representation language of the examples is a subset of H , thus making the computation of the coverage easier. To illustrate this point, let us consider a dataset describing animals inspired by the zoo dataset of the UCI Machine Learning Repository.3 An observation is described by some boolean properties (presence of feathers for instance) that can be true or false. In propositional logic, it can be represented by a conjunction of literals, where a literal is either a property of the negation of a property. Given a set of animals, it may be interesting to find the properties they have in common. This set of properties can still be expressed by a conjunction of literals: X and H share the same representation space: a conjunction of literals or equivalently a set of literals. In this context, a hypothesis is more general than another when it contains less properties (h 1 ≥ h 2 if h 1 ⊆ h 2 ). It is easy to show that if h 1 ⊆ h 2 , then coverage(h 2 ) ⊆ coverage(h 1 ). The first definition is more 3 http://archive.ics.uci.edu/ml/datasets/zoo.

Designing Algorithms for Machine Learning and Data Mining

387

interesting than this latter since comparison are made in H and does not involve X . Thus defining a generality relation is quite easy in propositional logic, it becomes more problematic in predicate logic. Indeed, in first order logic, a natural definition would be the logical implication between two formulas, but this problem is not decidable and this is why (Plotkin 1970) introduced the notion of θ -subsumption between clauses: given two clauses A and B, A is more general than B if there exists a substitution θ such that A.θ ⊆ B.4 Let us consider 4 people: John (jo), Peter (pe), Ann (an) and Mary (ma) and the two rules: f ather (X, Y ), f ather (Y, Z ) → grand Father (X, Z ) f ather ( jo, pe), f ather ( pe, ma), mother (an, ma) → grand Father( jo, ma) They can be transformed into clauses: ¬ f ather (X, Y ) ∨ ¬ f ather (Y, Z ) ∨ grand Father(X, Z ) ¬ f ather ( jo, pe) ∨ ¬ f ather ( pe, ma) ∨ ¬mother (an, ma) ∨ grand Father( jo, ma)

Consider these clauses as sets of literals. If we instantiate X by jo, Y by pe and Z by ma, the first instantiated clause is included in the second, it is therefore considered as more general w.r.t. θ -subsumption. The θ -subsumption test involves potentially costly comparisons between hypotheses. Computing the l.g.g. of 2 clauses under θ -subsumption is O(n 2 ), where n is the size of the clauses and computing the l.g.g. of s clauses under θ -subsumption is O(n s ). In this example, we have made the assumption that examples were described in the same representation space as the hypotheses: an example is a fully instantiated clause and a hypothesis is a clause. This is called the single representation trick and this generally makes the coverage test simpler. We could also have given a set of atoms describing many people, including John, Peter, Mary, Ann and the relations between them and the subsumption test would have been even more costly, since many comparisons would have to be performed with non relevant information. The definition of θ -subsumption can be extended to take into account knowledge on the domain, for example that a father or a mother are parents, but then reasoning mechanisms must be introduced at the price of complexity. It is important to realize that the complexity of the coverage test is often a determining factor for choosing the representation space X and the hypothesis space H . The representation language chosen for the hypothesis space is thus essential for determining a generality relation allowing an efficient exploration of the hypothesis space. Among the possible order relations, a special interest has been put on the relations that form a lattice: in this case given two hypotheses h i and h j , there exists a least upper bound and a greatest lower bound. They are called the least general generalization, lgg(h i , h j ), and the least specific specialization, lss(h i , h j ). θ -subsumption induces a lattice on the space of clauses, whereas this is not true for 4A

clause is a disjunction of literals, assimilated in this definition to a set of literals.

388

A. Cornuéjols and C. Vrain

logical implication. These notions are easily extended to more than 2 hypotheses, leading to lgg(h i , h j , h k , . . .) and lss(h i , h j , h k , . . .). Finally, it is assumed that the lattice is bounded and that there exists a hypothesis, denoted by , that is more general than all the others and an hypothesis, denoted by ⊥, that is more specific than the others.

7.1.3

Exploration of the Search Space

The generality relation allows one to structure the hypothesis space and generalization/specialization operators can be defined to explore the search space. Let us consider a quasi-ordering5 ≥ on H (h 2 ≥ h 1 if h 2 is more general than h 1 ). A downward refinement operator or specialization operator is a function ρ from H to 2H such that for all h in H , ρ(h) ⊆ {h ∈ H |h ≥ h }, whereas a upward refinement operator or generalization operator is a function ρ such that for all h in H , ρ(h) ⊆ {h ∈ H |h ≥ h}. Let us consider the zoo example and let us assume that the hypothesis space is composed of all the conjunctions of literals in propositional logic. A hypothesis could be: hair ∧ milk ∧ four_legs. It will cover all the animals satisfying these properties, as for instance a bear or a cat. Let us consider now two simple operators that consist in adding/removing a literal to a hypothesis. Removing a literal for instance allows producing the hypothesis hair ∧ milk that covers more instances, and is therefore more general. Removing a literal (resp. adding a literal) allows the production of a more general (resp. a more specific) hypothesis. Refinement operators coupled with a coverage measure can be used to guide efficiently the exploration of the hypothesis space. Indeed, if a hypothesis does not cover all the positive examples, a generalization operator must be used to produce more general hypotheses covering this positive example, whereas, conversely, if negative examples are covered, specialization operators must be considered. This leads to different strategies for exploring the hypothesis space. The generalization/specialization operators depend on the description language of the hypotheses. If it is simple enough in propositional logic, it becomes more complicated in the presence of domain knowledge or in more complex languages such as first order logic. In this case, different generalization operators can be defined, such as deleting a literal, transforming a constant into a variable or transforming two occurrences of the same variable into different variables. To be interesting a refinement operator must satisfy some properties: it must be locally finite (i.e. it generates a finite and calculable number of refinements), proper (each element in ρ(h) is different from h) and complete (for a generalization operator, it means that for all h, if h > h then h or an hypothesis equivalent to h can be reached by the application of a finite number of ρ to h; the definition is similar for a specialization operator). Such an operator is called ideal. See van der Laag 5A

reflexive and transitive relation.

Designing Algorithms for Machine Learning and Data Mining

389

and Nienhuys-Cheng (1998) for a discussion on the existence of ideal refinement operators.

7.1.4

Formal Concept Analysis

As we have seen, there is a duality between the observation space X and the hypothesis space H . This duality is formalized by the notion of Galois connection, which forms the basis of Formal Concept Analysis (Ganter et al. 1998, 2005). Let us consider a set O of objects (animals in the zoo dataset), a set A of propositional descriptors and a binary relation r on O × A such that r (o, p) is true when object o satisfies property p. The connection between the objects and the properties is expressed by two operators, denoted here by f and g. The first one corresponds to the notion of coverage: given a set of descriptors A, A ⊆ A , f (A) returns the extent of A, i.e., the set of objects satisfying A. The second one corresponds to the notion of generalization: given a set of objects O, O ⊆ O, g(O) returns the intent of O, i.e., the set of descriptors that are true for all objects of O. Let us consider again the concept of animals, described by only 4 properties (hair, milk, four_legs, domestic) and consider only 3 objects (a bear, a cat and a dolphin) name hair milk f our _legs domestic bear T T T F cat T T T T dolphin F T F F The extent of the descriptor milk contains the three animals whereas the extent of hair and milk contains two animals, namely the bear and the cat. The intent of the first two animals, i.e., the descriptors they have in common is composed of hair, milk and four_legs. These two operators form a Galois mapping satisfying the following properties: (i) f and g are anti-monotonous (if A1 ⊆ A2 then f (A2 ) ⊆ f (A1 ) and if O1 ⊆ O2 then g(O2 ) ⊆ g(O1 )), (ii) any set of descriptors is included in the intent of its extent (for any A of A , A ⊆ g( f (A))) and every set of objects is included in the extent of its intent (for any O of O, O ⊆ f (g(O))) and (iii) f (A) = f (g( f (A))) and g(O) = g( f (g(O))). The operator g ◦ f has received particular attention in the field of frequent pattern mining and is called the closure of a set of descriptors. A concept is then defined as a pair (O, A) of objects and descriptors such that O is the extent of A and A is the intent of O. In the previous example, the pair ({bear, cat}, {hair, milk, four_legs}) is a concept. Let us notice that if (O, A) is a concept then A is a closed set (A = g( f (A))) and the set of closed descriptors with ⊆ is a lattice. This notion is particularly useful in frequent itemset mining since the set of frequent closed itemsets is a condensed representation of the set of frequent itemsets.

390

A. Cornuéjols and C. Vrain

For more details on Formal Concept Analysis, see chapter “Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing” of this volume.

7.1.5

Learning the Definition of a Concept name bear cat dolphin honeybee moth

hair f our _legs domestic class T T F mammal T T T mammal F F F mammal T F F insect T F F insect

Let us suppose now that we want to learn a definition of the concept mammal given the three positive examples (bear, cat, dolphin) and the negative examples (honeybee, moth), with only three attributes, namely hair, four_legs, domestic. If we consider a generalization of the three positive examples, it is empty: they share no common properties. In fact these 3 examples do not belong to the same concept and must be split into two groups for common properties to emerge. To learn a concept from positive and negative examples, the most common method is to learn a hypothesis covering some positive examples and then to iterate the process on the remaining examples: this is a divide and conquer strategy. Alternative methods exist, as for instance first separating positive examples into classes (unsupervised classification) and then learning a rule for each class. This leads to the problem of constructing a rule covering positive examples. Several strategies can be applied: a greedy approach, which builds the rule iteratively by adding conditions, as in Foil system (Quinlan 1996) or a strategy driven by examples, as for instance in Progol system (Muggleton 1995). Progol starts from a positive example, constructs a rule covering this example and as many positive examples as possible, while remaining consistent and iterates with the positive examples that are not yet covered by a rule. Other approaches propose to search exhaustively all the rules with sufficient support and confidence (one finds then the problem of association rule mining) and to construct a classifier from these rules (see for instance Liu et al. 1998; Li et al. 2001). For example, to classify a new example, each rule applicable on this example vote for its class with a weight depending on its confidence.

7.1.6

Extensions

An area in which learning is based on a generality relation in the hypothesis space is that of the induction of languages or grammars from examples of sequences. In grammatical inference, the description of hypotheses generally takes the form of automata. Most of the work has focused on the induction of regular languages that correspond to finite state automata. It turns out that the use of this representation

Designing Algorithms for Machine Learning and Data Mining

391

naturally leads to an operator associated with a generality relation between automata. Without entering into formal details, considering a finite state automaton, if we merge in a correct way two states of this automaton, we obtain a new automaton, which accepts at least all the sequences accepted by the first one; it is therefore more general. Using this generalization operator, most methods of grammatical inference thus start from a particular automaton, accepting exactly the positive sequences and generalize it, by a succession of merging operations, stopping either when a negative sequence is covered or when a stop criterion on the current automaton is verified. (de la Higuera 2010) describes thoroughly all the techniques of grammatical inference. Sometimes data results from inaccurate or approximate measurements, and it is possible that the precision on the values of the attributes is unrealistic. In this case, it may be advantageous to change the representation to take into account a lower accuracy and possibly to obtain more comprehensible concept descriptions. The rough sets (Suraj 2004) formalism provides a tool for approximate reasoning applicable in particular to the selection of informative attributes and to the search for classification or decision rules. Without going into details, the rough sets formalism makes it possible to seek a redescription of the examples space taking into account equivalence relations induced by the descriptors on the available examples. Then, given this new granularity of description, the concepts can be described in terms of lower and upper approximation. This leads to new definitions of the coverage notion and of the generality relation between concepts.

7.2 Four Illustrations 7.2.1

Version Space and the Candidate Elimination Algorithm

In the late 1970s Tom Mitchell showed the interest of generalization for Concept Learning. Given a set of positive and negative examples, the version space (Mitchell 1982, 1997) is defined as the set of all the hypotheses that are consistent with the known data, that is to say that cover all the positive examples (completeness) and cover no negative examples (consistency). The empirical risk is therefore null. Under certain conditions, the set of hypotheses consistent with the learning data is bounded by two sets in the generalization lattice defined on H : one, called the S-set is the set of the most specific hypotheses covering positive examples and rejecting negative examples, whereas the G-set is the set of maximally general hypotheses consistent with the learning data. Tom Mitchell proposed then an incremental algorithm, called the candidate elimination algorithm: it considers sequentially the learning examples and at each presentation of a new example, it updates the two frontiers accordingly. It can be seen as a bidirectional width-first search, updating incrementally, after each new example, the S-set and the G-set: an element of the S-set that does not cover a new positive example is generalized if possible (i.e. without covering a negative example), whereas an element of the G-set that covers a new negative example is specialized if possible.

392

A. Cornuéjols and C. Vrain

Inconsistent hypotheses are removed. Assuming that the language of hypotheses is perfectly chosen, and that data is sufficient and not noisy, the algorithm can in principle converge towards a unique hypothesis that is the target concept. An excellent description of this algorithm can be found in Chap. 2 of Tom Mitchell (Mitchell 1997). While this technique, at the basis of many algorithms for learning logical expressions as well as finite state automata, was experiencing a decline in interest with the advent of more numerical methods in Machine Learning, it knows a renewal of interest in the domain of Data Mining, which tends to explore more discrete structures.

7.2.2

Induction of Decision Trees

The induction of decision trees is certainly the best known case of learning models with variable structure. However, contrary to the approaches described above, the learning algorithm does not rely on the exploitation of a generality relation between models, but it uses a greedy strategy constructing an increasingly complex tree, corresponding to an iterative division of the space X of the observations. A decision tree is composed of internal nodes and leaves: a test on a descriptor is associated to each internal node, as for instance size > 1.70m, while a class label is associated to each leaf. The number of edges starting from an internal node is the number of possible responses to the test (yes/no for instance). The test differs according to the type of attribute. For a binary attribute, it has 2 values and for a qualitative attribute, there can be as many branches as the domain of the attribute or it can be transformed into a binary test by splitting the domain into two subsets. For a quantitative attribute, it takes the form A < θ and the difficulty is to find the best threshold θ . Given a new observation to classify, the test at the root of the tree is applied on that observation and according to the response, the corresponding branch towards one of the subtrees is followed, until arriving at a leaf, which gives then the predicted label. It is noteworthy that a decision tree can be expressed as a set of classification rules: each path from the root to a leaf expresses a set of conditions and the conclusion of the rule is given by the label of the leaf. Figure 9 gives the top of a decision tree built on the zoo dataset. A decision tree is thus the symbolic expression of a partition of the input space. We can only obtain partitions parallel to the axes insofar as the tests on numerical variables are generally of the form X ≥ θ . Each subspace in the partition is then labelled, usually by the majority class of the observations in this subspace. When building a decision tree, each new test performed on a node refines the partition by splitting the corresponding subspace. Learning consists in finding such a partition. The global inductive criterion is replaced by a local criterion, which optimizes the homogeneity of the nodes of the partition, where the homogeneity is measured in terms of the proportion of observations of each class. The most used criteria are certainly the information gain used in C5.0 (Quinlan 1993; Kotsiantis 2007) and

Designing Algorithms for Machine Learning and Data Mining Fig. 9 A decision tree on the zoo dataset

393

Milk

Mammal

Feathers

Backbone

Bird

based on the notion of entropy or the Gini index, used in Cart (Breiman et al. 1984). Techniques for pruning the built tree are then applied in order to avoid overfitting. This algorithm is of reduced complexity: O(m · d · log(m)) where m is the size of the learning sample and d the number of descriptive attributes. Moreover, an induction tree is generally easy to interpret (although this depends on the size of the decision tree). This is a typical example of a divide and conquer algorithm with a greedy exploration.

7.2.3

Inductive Logic Programming

Inductive Logic Programming (ILP) has been the subject of particular attention since the 1980s. Initially studied for the synthesis of logical programs from examples, it has gradually been oriented towards learning knowledge from relational data, thus extending the classical framework of data described in vector spaces and allowing to take into account relations in the data. When developed for the synthesis of logical programs, a key problem was learning recursive or mutually recursive concepts, whereas learning knowledge requires to take into account quantitative and uncertain information. One interests of ILP is the possibility to integrate knowledge domain, allowing to obtain more interesting concept definitions. For example, if one wishes to learn the concept of grandfather from examples of persons linked by a father-mother relationship, introducing the concept of parent will allow a more concise definition of the concept. The definition of θ -subsumption defined in Sect. 7.1.2 must then be modified accordingly. One of the major challenges ILP has to face is the complexity due to the size of the search space and to the coverage test. To overcome this problem, syntactic or semantic biases have been introduced, thus allowing to reduce the search space.

394

A. Cornuéjols and C. Vrain

Another idea is to take advantage of the work carried out in propositional learning. In this context, a rather commonly used technique, called propositionalization, consists in transforming the first-order learning problem into a propositional or attribute-value problem (see an example in Zelezný and Lavrac 2006). The difficulty then lies in the construction of relevant characteristics reflecting the relational character of the data and minimizing the loss of information. The best known ILP systems are Foil (Quinlan 1996) and Progol (Muggleton 1995). As already mentioned, Foil iteratively builds rules covering positive examples and rejecting the negative ones. For building a rule, it adopts a top-down strategy: the most general clause (a rule without conditions) is successively refined to reject all negative examples. It is based on a greedy strategy relying on a heuristic, close to the information gain: it takes into account the number of instantiations that cover positive, respectively negative, examples. The quality of each refinement is measured and the best one is chosen without backtracking. This strategy suffers from a well known problem: it may be necessary to add refinements, corresponding to functional dependencies (for example introducing a new object in relation with the target object), which are necessary to construct a consistent clause, but which have a null information gain (they are true for all positive and all negative examples). Progol differs in the way clauses are built: the construction of a clause is driven by a positive example. More precisely, Prolog chooses a positive example and constructs the most specific clause covering this example–it is called the saturation; search is performed in the generalization space of this saturated clause. The results produced by Prolog depend on the order positive examples are processed. It was implemented in the system Aleph (see http://www.cs.ox.ac.uk/activities/machinelearning/Aleph/aleph). During the 1990s, it was shown that a large class of constraint satisfaction problems presented a phase transition phenomenon, namely a sudden variation in the probability of finding a solution when the parameters of the problem vary. It was also noted that finding a hypothesis covering (in terms of θ -subsumption) positive examples, but no negative ones can be reduced to a constraint satisfaction problem. It was then shown empirically that a phenomenon of phase transition actually may appear in ILP. This discovery has been extended to certain types of problems in grammatical inference. (Saitta et al. 2011) is devoted to the study of Phase Transitions in Machine Learning. Finally, it should be noted that while supervised classification has long been the main task studied in ILP, there are also work on searching for frequent patterns in relational databases, or on subgroup discovery. Important references include (Lavrac and Dzeroski 1994; Dzeroski and Lavrac 2001; Raedt 2008; Fürnkranz et al. 2012). Inductive Logic Programming is still an active research field, mainly these recent years in Statistical Relational Learning (Koller and Friedman 2009; Raedt et al. 2008). The advent of deep learning has launched a new research stream, aiming at encoding relational features into neural networks, as shown in the last International Conferences on Inductive Logic Programming (Lachiche and Vrain 2018).

Designing Algorithms for Machine Learning and Data Mining

7.2.4

395

Mining Frequent Patterns and Association Rules

Searching for frequent patterns is a very important Data Mining problem, which has been receiving a lot of attention for twenty years. One of its application is the discovery of interesting association rules, where an association rule is a rule I → J with I and J two disjoint patterns. The notion of pattern is fundamental: it depends on the input space X , and more precisely on the representation language of X . It may be a set of items, called an itemset, a list of items when considering sequential data, or more complex structures, such as graphs. A pattern is said to be frequent if it occurs in the database a number of times greater than a given threshold (called the minimal support threshold). As in the version space, it is quite natural to rely on a generality relation between patterns. A pattern is more general than another when it occurs in more examples in the database. This is yet the notion of coverage defined in Sect. 7.1. When a pattern is a conjunction (or a set) of elementary expressions taken in a fixed vocabulary set V (V can be a set of Boolean features, a set of pairs (attribute, value), …), the generalization (resp. specialization) operator is the removal (resp. addition) of a term in the pattern. The pattern space is then ordered by the inclusion relation, modeling the generality relation (h 1 ≥ h 2 if h 1 ⊆ h 2 ) and it has a lattice structure. This ordering induces an anti-monotonous property, which is fundamental for pruning the search space: the specialization of a non frequent pattern cannot be frequent, or in other terms, when a pattern is not frequent it is useless to explore its specializations, since they will be non frequent too. This observation is the basis of Apriori algorithm (Agrawal and Srikant 1994), the first and certainly the most well-known algorithm for mining frequent itemsets and solid association rules. It is decomposed into two steps: mining frequent itemsets and then building from these itemsets association rules, which are frequent by construction. The confidence of the association rules can then be computed from the frequency of the itemsets. The first step is the most difficult one since the search space is 2d , (with d = |V |) and computing the frequency of itemsets require to go through the entire database. Apriori performs a breadth-first search in the lattice, first considering 1-itemsets, then 2-itemsets and so on, where a l-itemset is an itemset of size l. At each level, the support of all l-itemsets is computed, through a single pass in the database. To prune the search space, the anti-monotonicity property is used: when an itemset is not frequent, its successors can be discarded. The complexity depends on the size of the database and on the number of times it is necessary to go through the database. It depends also on the threshold: the lowest the minimum support threshold, the less pruning. Complexity is too high to be applicable on large data and with a low minimum support threshold and therefore new algorithms have been developed, based either on new search strategies (for instance partitioning the database, or sampling), or on new representations of the database, as for instance in FP-Growth (Han et al. 2004) or in LCM (Uno et al. 2004). Another source of complexity is the number of frequent itemsets that are generated. Condensed representations have been studied, they are usually based on closed

396

A. Cornuéjols and C. Vrain

patterns, where a pattern is closed if the observations that satisfy this pattern share only the elements of this pattern. It corresponds to the definition of concepts, underlying Formal Concept Analysis and described in Sect. 7.1.4. We can notice that the support of an itemset can be defined as the cardinal of its extent (the number of elements in the database that satisfy it) and we have interesting properties stating for instance that two patterns that have the same closure have the same support, or that if a pattern is included in another pattern and has the same support then these two patterns have the same closure. These properties allow one to show that the set of closed patterns with their support are a condensed representation of all the itemsets with their support, and therefore only closed patterns have to be stored in memory (see for instance Zaki 2000; Bastide et al. 2000). Itemset mining has been extended to deal with structured data, such as relational data, sequential data, graphs. All these structures are discrete. A remaining difficult problem is pattern mining in the context of numeric data, since learning expressions such as (age > 60) ∧ (HDL-cholesterol >1.65 mmol/L) require to learn the threshold (60, 1.65 for instance) in a context where the structure is not fixed and has to be learned simultaneously.

8 Probabilistic Models So far we have mainly addressed the geometric and symbolic views of Machine Learning. There is another important paradigm in Machine Learning, that is based on a probabilistic modeling of data. It is called generative in the sense that it aims at inducing the underlying distribution of data and given that distribution, it is then possible to generate new samples following the same law. We have addressed it in the context of clustering in Sect. 4.3 but it is also widely used in supervised learning with the naive Bayesian classifier. Generative and discriminative approaches differ, since as stated in (Sutton and McCallum 2012), a discriminative model is given by the conditional distributions p(y|x), whereas a generative model is given by p(x, y) = p(y|x) × p(x). Let us consider a sample S = {(x1 , y1 ), . . . , (xm , ym )}, where xi is described by d features X l , l = 1 . . . d, with domain Dl and yi belongs to a discrete set Y (the set of labels or classes). The features X l , l = 1 . . . d and the feature Y corresponding to the class of the observations can be seen as random variables. Let X denote the set of features X l . Given a new observation x = (x1 , . . . , xd ) the Bayesian classifier assigns to x the class y in Y that maximizes p(Y = y|X = x). Thus we have: h : x ∈ X → argmax y∈Y p(Y = y|X = x)

(8)

=y)p(Y =y) . Since Using Bayes theorem p(Y = y|X = x) can be written p(X =x|Y p(X =x) the denominator is constant given x, it can be forgotten in the argument of argmax, thus leading to the equivalent formulation

Designing Algorithms for Machine Learning and Data Mining

397

h : x ∈ X → argmax y∈Y p(X = x|Y = y)p(Y = y)

(9)

Learning consists in inferring an estimation of the probabilities given the data. When the input space X is described by d discrete features X l , l = 1 . . . d, with domain Dl , this means learning p(X 1 = x1 , . . . , X d = xd |Y = y) for all possible tuples and p(Y = y) for all y, leading to (|D1 | × · · · × |Dd | + 1) × |Y | probabilities to estimate. This would require many observations to have reliable estimates. An assumption is needed to make the problem feasible: the features are assumed to be independent conditionally to the class, that means: p(X 1 = x1 , . . . , X d = xd |Y = y) = p(X 1 = x1 |Y = y) × · · · × p(X d = yd |Y = y)

This leads to the definition of the naive Bayesian classifier d h : x = (x1 , . . . , xd ) → argmax y∈Y Πl=1 p(X l = xl |Y = y) × p(Y = y)

(10)

Nevertheless this assumption is often too simple to accurately model the data. More complex models, grouped under the term graphical models (Koller and Friedman 2009) have been introduced to take into account more complex dependencies between variables. Dependencies between variables are represented by a graph whose nodes are the variables and whose edges model the dependencies. There exist two main families of models: Bayesian networks that are acyclic oriented graphs associating to each node the conditional probability of this node given its parents and Conditional Random Fields (CRF), that are non oriented graphs in which a variable is independent from the other ones, given its neighbors. The naive Bayesian classifier is a special case of a Bayesian network, as illustrated by Fig. 10. A third model, Hidden Markov Model (HMM) is also frequently used for sequential data: a HMM is defined by a directed graph, each node of the graph represents a state and can emit a symbol (taken from a predefined set of observable symbols), two families of probabilities are associated to each state: the probabilities of emitting a symbol s, given this state (P(s|n)) and the probability of moving in state n , given state n (P(n |n)). A HMM models a Markovian process: the state in which the process is at time t depends only of the state reached at time t − 1 (P(qt = n |qt−1 = n)), it also assumes that the symbol emitted at time t depends only on the state the automaton

Fig. 10 Dependency between variables for the naive Bayesian classifier

Y

X1

Xd

398

A. Cornuéjols and C. Vrain

has reached at time t. Let us notice that a HMM can be modeled by a CRF: indeed the probability distribution in a CRF is often represented by a product of factors put on subsets of variables, the probability P(n |n) and P(s|n) are easily converted into factors. CRFs, by means of factors, allow the definition of more complex dependency between variables and this explains why it is now preferred to HMMs in natural language processing. An introduction to CRF and a comparison with HMM can be found in (Sutton and McCallum 2012). Learning graphical models can be decomposed in two subproblems: either the structure is already known and the problem is then to learn the parameters, that is the probability distribution, or the structure is not known and the problem is to learn both the structure and the parameters (see Koller and Friedman 2009, and Pearl 1988). The first problem is usually addressed by methods such as likelihood maximization or such as Bayesian maximization a priori (MAP). Different methods have been developed for learning the structure. For more information, the reader should refer to chapter “Belief Graphical Models for Uncertainty Representation and Reasoning” of this volume, devoted to graphical models. Statistical Relational Learning (Raedt et al. 2008; Getoor and Taskar 2007) is an active research stream that aims at linking Inductive Logic Programming and probabilistic models.

9 Learning and Change of Representation The motivation for changing the input space X is to make the search for regularities or patterns more straightforward. Changes can be obtained through unsupervised learning or through supervised learning, guided by the predictive task to solve. Unsupervised learning is often used in order to estimate density in the input space, or to cluster the data into groups, to find a manifold where most of the data lies near, or to make denoising in some way. The overall principle is generally to find a representation of the training data that is the simplest while preserving as much information about the examples as possible. Of course, the notion of “simplicity” is multifarious. The most common ways of defining it are: lower dimensional representations, sparse representations, and independent representations. When looking for lower dimensional representations, we are interested in finding smaller representations that keep as much useful information as possible about the data. This is advantageous because it tends to remove redundant information and generally allows for more efficient processing. There are several ways to achieve this. The most straightforward is to perform feature selection. Another one is to change the representation space and to project the data into a lower dimensional space. An altogether different idea is to use a high dimensional representation space, but to make sure that each piece of data is expressed using as few of these dimensions, or descriptors, as possible. This is called a sparse representation. The idea is that each input should be expressible using only a few “words” in a large dictionary. These

Designing Algorithms for Machine Learning and Data Mining

399

dictionaries are sometimes called overcomplete representations from earlier studies on the visual system (Olshausen and Field 1996). Independent representations seek to identify the sources of variations, or latent variables, underlying the data distribution, so that these variables are statistically independent in some sense.

10 Other Learning Problems 10.1 Semi-supervised Learning Learning to make prediction, that is to associate an input to an output, requires a training set with labelled inputs, of the form (xi , yi )(1≤i≤m) . The larger the training set, the better the final prediction function produced by the learning algorithm. Unfortunately, obtaining labels for a large set of inputs is often costly. Think of patients at the hospital. Determining the right pathology from the symptoms exhibited by a patient requires a lot of expertise and often costly medical examinations. However, it is easy to get a large data set comprised of the description of patients and their symptoms, without a diagnostic. Should we ignore this (potentially large) unsupervised data set? Examining Fig. 11 suggests that this might not be the case. Ignoring the unlabelled examples would lead a linear SVM to learn the decision function on Fig. 11 (left). But this decision function sure does feel inadequate in view of the unlabelled data points in Fig. 11 (right). This is because, it seems reasonable to believe that similarly labelled data points lie in “clumps”, and that a decision function should go through low density region in the input space. If we accept this assumption as a prior bias, then it becomes possible to use unlabelled data in order to improve learning. This is the basis of semi-supervised learning (Chapelle et al. 2009).

Fig. 11 Given a few labelled data points, a linear SVM would find the decision function on the left. When unlabelled data points are added, it is tempting to change the decision function to better reflect the low density region between the apparent two clouds of points (right)

400

A. Cornuéjols and C. Vrain

Semi-supervised learning is based on the assumption that the unlabelled points are drawn from the same distribution as the labelled ones, and that the decision function lies in the low density region. If any of these assumption is erroneous, then semi-supervised learning can deteriorate the learning performance as compared to learning from the labelled data points alone.

10.2 Active Learning The learning scenarios we have presented so far suppose that the learner is passively receiving the training observations. This hardly corresponds to learning in natural species and in humans in particular who are seekers of information and new sensations, and this seems wasteful when, sometimes, a few well chosen observations could bring as much information as a lot of random ones. Why not, then, design learning algorithms that would actively select the examples that seem the most informative? This is called active learning. Suppose that inputs are one dimensional real valued in the range [0, 100], and that you know that their label is decided with respect to a threshold: all data points of the same class (‘+’ or ‘−’) being on one side of the threshold (see Fig. 12). If you can ask the class of any data point in [0, 100], then you start by asking the class of the point ‘50’, and of the point ‘0’. If they are of the same class, you should now concentrate on the interval (50, 100], and ask for the class of the point ‘75’, otherwise concentrate on the interval (0, 50) and test the point ‘25’. By systematically halving the interval at each question, you can be ε-close to the threshold with O(log2 1/ε) questions, whereas you should ask O(1/ε) random questions in order to have the same precision on the threshold. In this example, active learning can bring an exponential gain over a passive learning scenario. But is this representative of what can be expected with active learning? In fact, four questions arise: 1. Is it possible to learn with active learning concept classes that cannot be learned with passive learning? 2. What is the expected gain if any in terms of number of training examples? 3. How to select the best (most informative) training examples? 4. How to evaluate learning if the assumption of equal input distribution in learning and in testing is no longer valid, while it is the foundation of the statistical theory of learning? The answer of question (1) is that the class of concepts learnable in the active learning scenario is the same as with the passive scenario. For question (2), we have

Fig. 12 Active learning on a one-dimensional input space, with the target function defined by h w

Designing Algorithms for Machine Learning and Data Mining

401

exhibited, in the example, a case with an exponential gain. However, there exist cases with no gain at all. On average, it is expected that active learning provides an advantage in terms of training examples. This, however, should be put in regards to the computation gain. Searching the most informative examples can be costly. This is related to question (3). There exist numerous heuristics to select the best examples. They all rely on some prior assumptions about the target concept. They also differ in their approach to measure the informativeness of the examples. Question (4) itself necessitates that further assumptions be made about the environment. There are thus different results for various theoretical settings (See Chapter “Statistical Computational Learning”).

10.3 Online Learning In the batch learning scenario, the learner is supposed to have access to all the training data at once, and therefore to be able to look at it at will in order to extract decision rules or correlations. This is not so in online learning where the data arrives sequentially and the learner must take decisions at each time step. This is what happens when learning from data streams (Gama 2010). In most cases, the learner cannot store all the past data and must therefore compress it, usually with loss. Consequently, the learner must both be able to adapt its hypothesis or model of the world iteratively, after each new example or piece of information, and be able to decide what to forget about the past data. Often, the learner throws out each new example as soon as it has been used to compute the new hypothesis. Online learning can be used in stationary environments, when data arrives sequentially or when the data set is so large that the learner must cope with it in piecemeal fashion. It can as well be applied in non stationary environment, which complicates things since, in this case, the examples have no longer the same importance according to their recency. Because there can no longer be a notion of expectation, since the environment may be changing with time, the empirical risk minimization principle, or the maximization of likelihood principles can no longer be used as the basis of induction. In the classical “batch scenario”, one tries several values of the hyperparameters that govern the hypothesis space being explored, and for each of them record the best hypothesis, the one minimizing: m 1  (h(xi , yi )) + Ωhyperparameters (H ) hˆ = argmin RReg (h) = m i=1 h∈H

where Ωhyperparameters (H ) penalizes the hypothesis space according to some prior bias encoded by the hyper parameters. The best hyper parameters are found using the validation set (see Sect. 3.4).

402

A. Cornuéjols and C. Vrain

In on line learning, this is no longer the best hypothesis space that is looked for, but rather the best adaptive algorithm, the one that best modifies the current hypothesis after each new arriving piece of information. Let us call this algorithm Aadapt . Then the best adaptive algorithm is the one having the best performance on the last T inputs, if T is supposedly relevant as a window size: ∗ Aadapt

= argmin A adapt ∈A

 T 1 

 h t (xt ), yt T t=1

∗ is found using several “typical” sequences of length T , that are supposed In fact Aadapt to be representative of the sequences the learning system will encounter. Numerous heuristical online learning systems are based on this general principle. But how do you get by if no regularity is assumed about the sequence of arriving data? Can you still devise a learning algorithm that can cope with any sequence of data whatsoever, even if it is ruled by an adversary who tries to maximize your error rate? And, if yes, can you still guarantee something about the performance of such an online learning system? This is the province of the online learning theory. Now, the performance criterion is called a regret. What we try to achieve is to have the learner to be competitive with the best fixed predictor h ∈ H . The regret measures how “sorry” the learner is, in retrospect, not to have used the predictions of the best hypothesis h∗, in retrospect, in H .

RT =

T 

( yt − y y ) − minh∈H

t=1

T 

(h(xt ), yt )

t=1

where  yt is the guess of the online learner for the input xt . Surprisingly, it is found that it is possible to devise learning algorithms with guarantees, meaning bounds, over the regret, whatever can be the sequence feed to the learner. One good reference about these algorithms is (Cesa-Bianchi and Lugosi 2006).

10.4 Transfer Learning In the classical supervised learning setting, one seeks to learn a good decision function h from the input space X to the output space Y using a training set S = {(xi , yi )}1≤i≤m . The basis of the inductive step is to assume that the training data and future test data are governed by the same distribution PX Y . Often, however, the distributions are different. This may happen for instance when learning to recognize spam using data from a specific user and trying to adapt the learned rule to another user. The resulting learning problem is called Domain Adaptation. A further step is taken when one wishes to profit from a solved learning task in order to

Designing Algorithms for Machine Learning and Data Mining

403

facilitate another different learning task, possibly defined on another input space X . For instance, a system that knows how to recognize poppy fields in satellite images, might learn more efficiently to recognize cancerous cells in biopsy images than a system that must learn this from scratch. This is known as transfer learning (Pan and Yang 2010). Formally, in transfer learning, there is a Source domain DS defined as the product of a source input space and a source output space: XS × YS . The source information can come either through a source training set SS = {(xiS , yiS )}1≤i≤m or through a decision function h S : XS → YS , with or without a training set. If only the decision function h S is available, this is called hypothesis transfer learning. Similarly, the Target domain DT is defined as a product of a target input space and a target output space: XT × YT . Often, it is assumed that the available target training data ST = {(xiT , yiT )}1≤i≤m is too limited to allow a learning algorithm to yield a good decision function h T : XT → YT . In some scenarios, the target data is assumed to be unlabeled: ST = {xiT }1≤i≤m . Two questions arise then. First, can the knowledge of the source hypothesis h S help in learning a better decision function in DT than would be possible with the training set ST alone? Second, if yes, how can this be achieved? Transfer learning is becoming a hot topic in machine learning, both because a growing number of applications could benefit from it, and because it demands new theoretical developments adapted to non stationary environments.

10.5 Learning to Rank Learning to rank (Li 2011) is related to descriptive learning in that the goal is not to make predictions about new unknown instances, but to order the set of available data according to some “relevance”. It is however related to supervised learning in that, usually, there are supervised information in the training data, for instance, that such instance is preferred to some other one. A typical application is the ordering of the results of a search engine according to their relevance to a query, and, possibly, to a user. One approach is to first define a loss function that measures the difference between the true ranking of a set of instances and the one produced by the system, and then to apply the empirical risk minimization (ERM) principle to find a good ranking function on the training sets (several sets of which the true ranking is known). For instance, linear predictors for ranking can be used. In this technique, assuming that X ⊂ Rd for any vector w ∈ Rd , a ranking function can be defined as:



h w (x1 , . . . , xr ) = w, x1 , . . . , w, xr  The elements xi (1 ≤ i ≤ r ) can then be ordered according to the values w, xi .

404

A. Cornuéjols and C. Vrain

There are other learning algorithms, some of which are based on learning binary classifiers that take two instances xi and x j and that return the output ‘+’ if the first argument xi is before x j , and ‘−’ otherwise.

10.6 Learning Recommendations Learning to make recommendations is somewhat related to learning to rank, but aims at extrapolating the relevance to new instances, not already seen by the user. For instance, a system can learn to make recommendations about movies that should interest someone based on his/her history of seeing movies and the appreciations that he/she reported. One way to do that is to describe the items to be rated (e.g. movies), and thus recommended or not, on the basis of attributes, and then to learn to associate the resulting descriptions with appreciations, or grades. This is called content-based recommendation. One problem is to find relevant attributes. Another is that such recommending systems can only use the past history of each customer or user. Information is not shared among users. Another approach is to suppose that if another user has expressed ratings that are close to mine for a set of items that we both rated, then it is likely that I would rate similarly to this user other items that I have not yet seen. In this way, it is possible to capitalize on the vast data sets of preferences expressed by the whole community of users. The idea is to compute the similarity (or dissimilarity) with other users based on sets of items rated in common, and then to extrapolate the way I would rate new items based on these similarities and the rates that these other users gave to the new items. This is called collaborative filtering. Among many techniques dedicated to solve the problem of collaborative filtering, the prominent one currently operates using matrix completion. The idea is to consider the matrix R defined over user × item with each element R(i, j) of the matrix containing the rate given by user i to item j . When no rate has been given the element contains 0. This matrix is usually very sparse (few values R(i, j) = 0) since users have generally tested only a few tens or at most hundreds of items. The goal is then to complete this matrix, filling the missing values. A range of techniques can be used for this. The most classical relies on the Singular Value Decomposition (SVD), which, in a way, expresses the fact that the columns (or the rows) of this matrix are not independent. If these techniques are rather powerful, they nonetheless give results that are somewhat less than satisfactorily. This is due to several factors, including the fact that the missing values are not randomly distributed as would demand SVD, the performance measures such as the root mean square error (RMSE) give the same importance to all entries R(i, j) while users are interested in high value ratings, and in fact are not interested in accurate ratings, but rather on the respective ratings given to the most interesting items. Furthermore, recommendations are highly context

Designing Algorithms for Machine Learning and Data Mining

405

dependent: for instance, the same item can be quite interesting one day, and not the day after because a similar item has been purchased. For all these reasons, new approaches for recommendation systems are to be expected in the years to come (see for instance Jannach et al. 2016).

10.7 Identifying Causality Relationships It might be easy to come up with correlations such as people who eat ice-creams wear swimming suits. But should we rely on a rule that says: “to make people eat ice-cream, make them wear swimming suits”? Clearly, correlations are not causal relationships, and believing they are can lead to disastrous conclusions. In order to be able to “act” on some phenomenon, it is crucial to identify the causes of the phenomenon. Unfortunately, almost all of the current predictive learning systems are geared to discover correlations, but not causal links, and going from correlations to causality is not a trivial problem. It is indeed mostly an open problem. Judea Pearl has shown that the standard concepts, and notations, of statistics are not able to capture causality (Pearl 2009). Some approaches suppose that a graph of potential causal links is provided by an expert before a learning algorithm tries to identify the true causal links together with their directions and intensities (see chapter “A Glance at Causality Theories for Artificial Intelligence” of Volume 1 for an extended discussion of causality within Artificial Intelligence, and the book by (Peters et al. 2017) for a thorough discussion of causality and machine learning). Recently an interesting idea has been proposed where the data points corresponding to some phenomenon are analyzed by a classifier (in this work, a deep neural network) which, after learning on synthetic data reflecting causality between variables, can recognize if some variable is likely to cause the value of some other one (Lopez-Paz et al. 2016).

11 Conclusion Machine learning is all the rage in artificial intelligence at the moment. Indeed, because it promises to eliminate the need to explicitly code the machines by hand when it suffices to feed the machines with examples to allow it to program itself, machine learning seems the solution to obtain high performing systems in many demanding tasks such as understanding speech, playing (and winning) games, problem solving, and so on. And, truly, machine learning has demonstrated impressive achievements in recent years, reaching superhuman performances in many pattern recognition tasks or in game playing. Autonomous vehicles are, so to speak, around the corner, while Watson, from IBM, and other similar automatic assistant systems that can sift through millions of documents and extract information in almost no time are deemed to pro-

406

A. Cornuéjols and C. Vrain

foundly change the way even high level professions, such as law making or medicine, will be carried out in the years to come. Still for all these breakthroughs and the accompanying hype, today’s machine learning is above all the great triumph of pattern recognition. Not much progress has been made in the integration of learning and reasoning since the 1980s. Machines are very limited in learning from unguided observations and, unlike less that 2 years aged children, they are very task-focused, lack contextual awareness, and look for statistical correlations when we look for casual relationships. Progress in reinforcement learning and in neural networks have been highly touted, and in part rightly so, but the improvements in performance are due in large parts to gains in computational power, and in the quantity of training examples rather than on new breakthroughs in concepts, even though it is indisputable that new ideas have been generated. The field of machine learning is therefore all but a finished, polished, or even a mature domain. Revolutionary ideas are yet to come. They have to come. Accordingly, let us conclude this chapter with an intriguing idea. Maybe the key of true intelligence is not learning per se, but teaching. Infants, and, later, adults, share knowledge in an innate compulsion. We share information and we share it with no immediate gain other than to remedy the gap of knowledge we perceive in others. It starts when a one-year old child see an object falling behind a piece of furniture and points to it to an adult who did not see where the object fell. Teaching is an universal instinct among humans. It is certainly intricately linked to our capacity of learning. We have models of what the others know or ignore, and we act, we teach, accordingly. In order to do so, we need to recognize the state of the world and the state of others, and we need to reason. Is the teaching ability, or rather the teaching instinct, the true frontier that machine learning must conquer? Learning/teaching, do we need to look at the two faces of a same coin in order to understand each one? Intriguing idea isn’t it, Watson!

References Aggarwal CC (2015) Data mining: the textbook. Springer Publishing Company Incorporated, Berlin Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. Very large data bases (VLDB-94). Santiage, Chile, pp 487–499 Aloise D, Hansen P, Liberti L (2012) An improved column generation algorithm for minimum sum-of-squares clustering. Math Program 131(1–2):195–220 Amarel S (1968) On representations of problems of reasoning about actions. Mach Intell 3(3):131– 171 Ankerst M, Breunig MM, Kriegel H, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: SIGMOD 1999, proceedings ACM SIGMOD international conference on management of data, June 1–3, 1999, Philadelphia, Pennsylvania, USA, pp 49–60. https://doi. org/10.1145/304182.304187 Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7–9, 2007, pp 1027–1035. http://dl.acm.org/citation.cfm?id=1283383. 1283494

Designing Algorithms for Machine Learning and Data Mining

407

Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Computational logic, pp 972–986 Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell Bishop CM (2006) Pattern recognition and machine learning. Springer, Secaucus Breiman L (2001) Random forests. Mach Learn 45(1):5–32 Breiman L, Friedman J, Olshen R, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks/Cole Advanced Books and Software Brusco M, Stahl S (2005) Branch-and-bound applications in combinatorial data analysis (Statistics and computing), 1st edn. Springer, Berlin Busygin S, Prokopyev OA, Pardalos PM (2008) Biclustering in data mining. Comput OR 35:2964– 2987 Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games. Cambridge University Press, Cambridge Chapelle O, Scholkopf B, Zien A (2009) Semi-supervised learning (chapelle O, et al eds; 2006). IEEE Trans Neural Netw 20(3):542 Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297 Dao T, Duong K, Vrain C (2017) Constrained clustering by constraint programming. Artif Intell 244:70–94. https://doi.org/10.1016/j.artint.2015.05.006 de la Higuera C (2010) Grammatical inference: learning automata and grammars. Cambridge University Press, Cambridge Dhillon IS, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, Washington, USA, August 22–25, 2004, pp 551–556. https://doi.org/10. 1145/1014052.1014118 Ding CHQ, He X (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proceedings of the 2005 SIAM international conference on data mining, SDM 2005, Newport Beach, CA, USA, April 21–23, 2005, pp 606–610, https://doi.org/10.1137/1. 9781611972757.70 du Merle O, Hansen P, Jaumard B, Mladenovic N (1999) An interior point algorithm for minimum sum-of-squares clustering. SIAM J Sci Comput 21(4):1485–1505 Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact wellseparated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046 Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, Berlin Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), Portland, Oregon, USA, pp 226–231. http:// www.aaai.org/Library/KDD/1996/kdd96-037.php Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172. https://doi.org/10.1007/BF00114265 Forgy E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21(3):768–769 Fürnkranz J, Gamberger D, Lavrac N (2012) Foundations of rule learning. Springer, Berlin Gama J (2010) Knowledge discovery from data streams. Chapman & Hall Ganter B, Wille R, Franke C (1998) Formal concept analysis: mathematical foundations. Springer, Berlin Ganter B, Stumme G, Wille R (eds) (2005) Formal concept analysis: foundations and applications. Springer, Berlin Getoor L, Taskar B (eds) (2007) An introduction to statistical relational learning. MIT Press, Cambridge Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND,

408

A. Cornuéjols and C. Vrain

Weinberger KQ (eds) Advances in neural information processing systems 27, Curran Associates, Inc., pp 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf Getoor L, Taskar B (eds) (2007) An introduction to statistical relational learning. MIT Press Halkidi M, Batistakis Y, Vazirgiannis M (2002) Clustering validity checking methods: part ii. SIGMOD Rec 31(3):19–27. https://doi.org/10.1145/601858.601862 Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. SIGMOD Rec 29(2):1–12. https://doi.org/10.1145/335191.335372 Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequentpattern tree approach. Data Min Knowl Discov 8(1):53–87 Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco Hansen P, Delattre M (1978) Complete-link cluster analysis by graph coloring. J Am Stat Assoc 73:397–403 Hansen P, Jaumard B (1997) Cluster analysis and mathematical programming. Math Program 79(1– 3):191–215 Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer series in statistics. Springer, Berlin Hawkins D (1980) Identification of outliers. Monographs on applied probability and statistics. Chapman and Hall. https://books.google.fr/books?id=fb0OAAAAQAAJ Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci 79(8):2554–2558 Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Jannach D, Resnick P, Tuzhilin A, Zanker M (2016) Recommender systems-: beyond matrix completion. Commun ACM 59(11):94–102 Japkowicz N (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press Johnson S (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254 Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York Klein G, Aronson JE (1991) Optimal clustering: a model and method. Nav Res Logist 38(3):447– 461 Kohonen T (ed) (1997) Self-organizing maps. Springer, New York Inc, Secaucus Koller D, Friedman N (2009) Probabilistic graphical models. Principles and techniques. MIP Press Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. Informatica 31:249–268 Lance GN, Williams WTA (1967) A general theory of classificatory sorting strategies: 1. Hierarchical systems 9 Lachiche N, Vrain C (eds) (2018) Inductive logic programming - 27th international conference, ILP 2017, Orléans, France, September 4–6, 2017, Revised selected papers, Lecture notes in computer science, vol 10759. Springer. https://doi.org/10.1007/978-3-319-78090-0 Lavrac N, Dzeroski S (1994) Inductive logic programming - techniques and applications. Ellis Horwood series in artificial intelligence. Ellis Horwood Le Cun Y (1986) Learning process in an asymmetric threshold network. Disordered systems and biological organization. Springer, Berlin, pp 233–240 Le Cun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396–404 Le Cun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995 Le Cun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444. https://doi.org/ 10.1038/nature14539 Levesque HJ, Brachman RJ (1987) Expressiveness and tractability in knowledge representation and reasoning. Comput Intell 3(1):78–93

Designing Algorithms for Machine Learning and Data Mining

409

Li H (2011) A short introduction to learning to rank. IEICE Trans Inf Syst 94(10):1854–1862 Li W, Han J, Pei J (2001) CMAR: accurate and efficient classification based on multiple classassociation rules. In: Proceedings of the 2001 IEEE international conference on data mining, 29 November–2 December 2001, San Jose, California, USA, pp 369–376. https://doi.org/10.1109/ ICDM.2001.989541 Liu B, Hsu W, Ma Y (1998) Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, AAAI Press, KDD’98, pp 80–86. http://dl.acm.org/citation.cfm?id=3000292.3000305 Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–136. https:// doi.org/10.1109/TIT.1982.1056489 Lopez-Paz D, Nishihara R, Chintala S, Schölkopf B, Bottou L (2016) Discovering causal signals in images. arXiv:160508179 Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform 1:24–45. https://doi.org/10.1109/TCBB.2004. 2, www.doi.ieeecomputersociety.org/10.1109/TCBB.2004.2 Michalski RS (1980) Knowledge acquisition through conceptual clustering: a theoretical framework and an algorithm for partitioning data into conjunctive concepts. Int J Policy Anal Inf Syst 4:219– 244 Michalski RS, Stepp RE (1983) Automated construction of classifications: conceptual clustering versus numerical taxonomy. IEEE Trans Pattern Anal Mach Intell 5(4):396–410. https://doi.org/ 10.1109/TPAMI.1983.4767409 Miclet L (1990) Grammatical inference. In: Bunke H, Sanfeliu A (eds) Syntactic and structural pattern recognition theory and applications. World Scientific, Singapore Minsky ML, Papert S (1988) Perceptrons, expanded ed. MIT Press, Cambridge, vol 15, pp 767, 776 Mitchell T (1982) Generalization as search. Artif Intell J 18:203–226 Mitchell T (1997) Machine learning. McGraw-Hill Muggleton S (1995) Inverse entailment and progol. New Gener Comput 13(3&4):245–286 Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems 14 [neural information processing systems: natural and synthetic, NIPS 2001, December 3–8, 2001, Vancouver, British Columbia, Canada], pp 849–856. http://papers.nips.cc/paper/2092-on-spectral-clustering-analysis-and-an-algorithm Ng RT, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: VLDB’94, proceedings of 20th international conference on very large data bases, September 12–15, 1994, Santiago de Chile, Chile, pp 144–155. http://www.vldb.org/conf/1994/P144.PDF Nie F, Wang X, Huang H (2014) Clustering and projected clustering with adaptive neighbors. In: The 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14, New York, NY, USA - August 24–27, 2014, pp 977–986. https://doi.org/10.1145/2623330. 2623726 Olshausen BA, Field DJ (1996) Natural image statistics and efficient coding. Netw Comput Neural Syst 7(2):333–339 Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345– 1359 Park HS, Jun CH (2009) A simple and fast algorithm for k-medoids clustering. Expert Syst Appl 36:3336–3341 Pearl J (1988) Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Pearl J (2009) Causal inference in statistics: an overview. Statist Surv 3:96–146. https://doi.org/10. 1214/09-SS057 Peters J, Janzing D, Schölkopf B (2017) Elements of causal inference: foundations and learning algorithms. MIT Press Plotkin G (1970) A note on inductive generalization. In: Machine intelligence, vol 5. Edinburgh University Press, pp 153–163 Qiu Q, Patel VM, Turaga P, Chellappa R (2012) Domain adaptive dictionary learning, pp 631–645

410

A. Cornuéjols and C. Vrain

Quinlan J (1993) C4.5: programs for machine learning. Morgan Kauffman Quinlan JR (1996) Learning first-order definitions of functions. CoRR. arXiv:cs.AI/9610102 Raedt LD (2008) Logical and relational learning. Springer, Berlin Raedt LD, Frasconi P, Kersting K, Muggleton S (eds) (2008) Probabilistic inductive logic programming - theory and applications. Lecture notes in computer science, vol 4911. Springer, Berlin Rao M (1969) Cluster analysis and mathematical programming 79:30 Rubinstein R, Bruckstein AM, Elad M (2010) Dictionaries for sparse representation modeling. Proc IEEE 98(6):1045–1057 Rumelhart DE, McClelland JL, Group PR et al (1987) Parallel distributed processing, vol 1. MIT Press, Cambridge Saitta L, Giordana A, Cornuéjols A (2011) Phase transitions in machine learning. Cambridge University Press Schölkhopf B, Smola A (2002) Learning with kernels. MIT Press Shapire R, Freund Y (2012) Boosting: foundations and algorithms. MIT Press Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489 Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354 Suraj Z (2004) An introduction to rough sets theory and its applications: a tutorial. In: ICENCO’2004, Cairo, Egypt Sutton C, McCallum A (2012) An introduction to conditional random fields. Found Trends Mach Learn 4(4):267–373. https://doi.org/10.1561/2200000013 Tosic I, Frossard P (2011) Dictionary learning. IEEE Signal Process Mag 28(2):27–38 Uno T, Kiyomi M, Arimura H (2004) LCM ver. 2: efficient mining algorithms for frequent/closed/maximal itemsets. In: FIMI ’04, proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, Brighton, UK, November 1, 2004. http://ceur-ws.org/ Vol-126/uno.pdf van der Laag PR, Nienhuys-Cheng SH (1998) Completeness and properness of refinement operators in inductive logic programming. J Log Program 34(3):201–225. https://doi.org/10.1016/S07431066(97)00077-0, http://www.sciencedirect.com/science/article/pii/S0743106697000770 Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. IJPRAI 25(3):337–372. https://doi.org/10.1142/S0218001411008683 von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416. https://doi. org/10.1007/s11222-007-9033-z Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning, pp 1103–1110 Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244. https://doi.org/10.1080/01621459.1963.10500845, http://www.tandfonline. com/doi/abs/10.1080/01621459.1963.10500845 Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, USA, August 20–23, KDD, pp 34–43 Zelezný F, Lavrac N (2006) Propositionalization-based relational subgroup discovery with rsd. Mach Learn 62(1–2):33–63 Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization. arXiv:161103530 Zhou ZH (2012) Ensemble methods: foundations and algorithms. CRC Press

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing Sébastien Ferré, Marianne Huchard, Mehdi Kaytoue, Sergei O. Kuznetsov and Amedeo Napoli

Abstract In this chapter, we introduce Formal Concept Analysis (FCA) and some of its extensions. FCA is a formalism based on lattice theory aimed at data analysis and knowledge processing. FCA allows the design of so-called concept lattices from binary and complex data. These concept lattices provide a realistic basis for knowledge engineering and the design of knowledge-based systems. Indeed, FCA is closely related to knowledge discovery in databases, knowledge representation and reasoning. Accordingly, FCA supports a wide range of complex and intelligent tasks among which classification, information retrieval, recommendation, network analysis, software engineering and data management. Finally, FCA is used in many applications demonstrating its growing importance in data and knowledge sciences.

1 Introduction Formal Concept Analysis (FCA) is a branch of applied lattice theory that appeared in 1980’s (Ganter and Wille 1999). Roots of the application of lattices obtained through the definition of a Galois connection between two arbitrary partially ordered S. Ferré Univ Rennes, CNRS, IRISA, Rennes, France e-mail: [email protected] M. Huchard LIRMM, Université de Montpellier, CNRS, Montpellier, France e-mail: [email protected] M. Kaytoue INSA-LIRIS, Infologic, Lyon, France e-mail: [email protected] S. O. Kuznetsov National Research University Higher School of Economics (HSE), Moscow, Russia e-mail: [email protected] A. Napoli (B) Université de Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_13

411

412

S. Ferré et al.

sets can also be found in the 1970’s (Barbut and Monjardet 1970). Starting from a binary relation between a set of objects and a set of attributes, formal concepts are built as maximal sets of objects in relation with maximal sets of common attributes, by means of derivation operators forming a Galois connection. Concepts form a partially ordered set that represents the initial data as a hierarchy, called the concept lattice. This conceptual structure has proved to be useful in many fields in artificial intelligence and pattern recognition, e.g. knowledge management, data mining, machine learning, mathematical morphology, etc. In particular, several results and algorithms from itemset and association rule mining and rule-based classifiers were characterized in terms of FCA (Kuznetsov and Poelmans 2013). For example, the set of frequent closed itemsets is an order ideal of a concept lattice; association rules and functional dependencies can be characterized with the derivation operators; jumping patterns are defined as hypotheses, etc., not to mention efficient polynomial-delay algorithms for building all closed itemsets such as NextClosure (Ganter 2010) or CloseByOne (Kuznetsov 1999), which was rediscovered under different names, e.g., LCM algorithm (Uno et al. 2004). In this chapter, we present the basics of FCA and several extensions allowing to work with complex data. FCA was generalized several times to partially ordered data descriptions to efficiently and elegantly deal with non binary, complex, heterogeneous and structured data, such as e.g. graphs (Liquiere and Sallantin 1998; Kuznetsov 1999). Different extensions include Fuzzy FCA (Belohlávek and Vychodil 2005; Belohlávek 2011), Generalized Formal Concept Analysis (Chaudron and Maille 2000), Logical Concept Analysis (Ferré and Ridoux 2000), and Pattern Structures (Ganter and Kuznetsov 2001). FCA has also been extended to relational data where concepts do not only depend on the description of objects, but also on the relationships between objects. These generalizations provide new ways of solving problems in several applications and include in particular Relational Concept Analysis (Rouane-Hacene et al. 2013), relational structures (Kötters 2013), Graph-FCA (Ferré 2015) and Triadic Concept Analysis (Lehmann and Wille 1995). FCA is a data analysis and classification formalism with good mathematical foundations. Besides that, FCA is also related in several ways to Knowledge Discovery in Databases and to Knowledge Representation and Reasoning. Indeed, many links exist between concept lattices, itemsets and association rules, see e.g. (Bastide et al. 2000; Zaki 2005; Szathmary et al. 2014). In addition, FCA was used in many various tasks, such as the discovery of definitions for ontologies, text mining, information retrieval, biclustering and recommendation, bioinformatics, chemoinformatics, medicine (healthcare), Natural Language Processing, Social Network Analysis…(the last section about applications provide details). Visualization plays a very important role in FCA and the display of the concept lattice is a very good support to exploration and interpretation of the data under study. Many tools were designed for this visualization purpose. One of the first tools to be created was Toscana (Eklund et al. 2000; Becker 2004), and the most wellknown tool is probably Conexp.1 Then a series of tools was developed for building 1 http://conexp.sourceforge.net/.

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

413

friendly user interfaces for various applications such as email, pictures, or museum collections (Eklund et al. 2004; Ducrou et al. 2006; Wray and Eklund 2011). More recently, LatViz2 was introduced for the drawing of concept lattices based on plain FCA and on pattern structures as well (Alam et al. 2016). In the following, we present the basics of FCA and then we review a number of main extensions, namely pattern structures, relational concept analysis, GraphFCA, and Triadic Concept Analysis. The seminal book (Ganter and Wille 1999) remains the reference for mathematical foundations of FCA, but let us mention some other very useful books, on the practical and various aspects of FCA (Carpineto and Romano 2004; Ganter et al. 2005) and on conceptual exploration (Ganter and Obiedkov 2016). Finally, the Formal Concept Analysis Homepage3 also includes a lot of useful information about FCA, FCA Conferences, and Software.

2 The Basics of Formal Concept Analysis 2.1 Context, Concepts and the Concept Lattice The framework of FCA is fully detailed in Ganter and Wille (1999) and we adopt the so-called “German notation” in this chapter. FCA starts with a formal context (G, M, I ), where G denotes a set of objects, M a set of attributes, and I ⊆ G × M a binary relation between G and M. The statement (g, m) ∈ I (also denoted by g I m) is interpreted as “object g has attribute m”. Two operators (·) define a Galois connection between the powersets (2G , ⊆) and (2 M , ⊆), with A ⊆ G and B ⊆ M: A = {m ∈ M | for all g ∈ A : g I m} and B  = {g ∈ G | for all m ∈ B : g I m}. For A ⊆ G, B ⊆ M, a pair (A, B), such that A = B and B  = A, is called a “formal concept”, where set A is called the “extent” and set B is called the “intent” of the concept (A, B). The dual aspect of a concept in knowledge representation is naturally materialized in a formal concept, where the intent corresponds to the description and the extent to the set of instances of the concept. Concepts are partially ordered by (A1 , B1 ) ≤ (A2 , B2 ) ⇔ A1 ⊆ A2 (⇔ B2 ⊆ B1 ). With respect to this partial order, the set of all formal concepts forms a complete lattice called the “concept lattice” of (G, M, I ). For illustration we reuse here a famous example introduced in Davey and Priestley (1990) where planets correspond to objects and their characteristics to attributes. The binary context is given below and the associated concept lattice is shown in Fig. 1.

2 https://latviz.loria.fr/. 3 http://www.upriss.org.uk/fca/fca.html.

414 Planets Jupiter Mars Mercury Neptune Pluto Saturn Earth Uranus Venus

S. Ferré et al. Size small

medium

large x

x x x x x x x x

Distance to Sun near far x x x x x x x x x

Moon(s) yes x x

no

x x x x x x x

Fig. 1 The concept lattice of planets in reduced notation

The concept lattice, provided that it is not too large, can be visualized, navigated and interpreted in various ways. • Visualization of the concept lattice: The labels of concepts are written in “reduced notation”, meaning that the intent of a concept –in blue– is made of the union of all greater concept intents while the extent –in black– is made of the union of all lower concept extents. For example, concept #10 has the extent (Jupiter, Saturn) and the intent (large, far, yes), while concept #4 has the extent (Mercury, Venus, Earth, Mars) and the intent (near, small). • Navigation and Information Retrieval: The concept lattice can be navigated for searching specific information, e.g. small planets which are far from the sun (concept #9) or large planets with moons (concept #10).

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

415

• Interpretation of implications and association rules: The concept lattice can be used for knowledge discovery, by interpreting the concepts and the associated rules. Indeed, considering the intents of the concepts, it is possible to discover implications and association rules as made precise here after. When one considers non-binary contexts, e.g. numerical or interval data, conceptual scaling is often used for binarizing data and for obtaining a binary formal context (Ganter and Wille 1999). Then, a numerical dataset for example can be described by a many-valued context (G, M, W, I ) where G is a set of objects, M a set of numerical attributes, W a set of values (e.g. numbers), and I a ternary relation defined on the Cartesian product G × M × W . The fact (g, m, w) ∈ I or simply m(g) = w means that object g takes value w for attribute m. Then, standard FCA algorithms can be applied for computing concept lattices from scaled contexts (Kuznetsov and Obiedkov 2002). However, adaptations of these algorithms should be applied to more complex data such as intervals, sequences, trees or graphs, as detailed in the section about pattern structures.

2.2 Rules and Implications Besides formal concepts and their ordering, which model the natural taxonomy of object classes described in terms of common attributes, another very important aspect of FCA is related to implicative dependencies, both exact and approximate. Dependencies of the first type, called “implications”, are closely related to functional dependencies, whereas approximate implication dependencies became known as association rules in data mining. Let us give definitions and examples. For A, B ⊆ M the “attribute implication” when its “support” A ⇒ B holds if A ⊆ B and the “association rule” A → B is valid  |(A∪B) | ∩B | | |A ∩B  | ≥ σs and its “confidence” c = |(A∪B) = ≥ σc , where s = |G| = |A|G|  |A | |A | σs and σc are user-defined thresholds for support and confidence respectively. In this way, an attribute implication is an association rule with c = 1, since the latter is possible only when A = (A ∪ B) = A ∩ B  , hence, when A ⊆ B  . Then, no =⇒ near and near =⇒ small are implications with a confidence of 1, while f ar −→ medium and small −→ near are association rules with a confidence of 2/5 and 4/5 respectively. Implications obey Armstrong rules: A→A

,

A→B , A∪C → B

A → B, B ∪ D → C . A∪D →C

which are known as properties of functional dependencies in database theory. Indeed, implications and functional dependencies can be reduced to each other. A functional dependency X → Y is valid in a complete many-valued context (actually, a relational data table) (G, M, W, I ) if the following holds for every pair of objects g, h ∈ G:

416

S. Ferré et al.

(∀m ∈ X, m(g) = m(h)) then (∀n ∈ Y, n(g) = n(h)). In Ganter and Wille (1999) it was shown that having a complete many-valued context K = (G, M, W, I ), one can define the context K N := (P2 (G), M, I N ), where P2 (G) is the set of all pairs of different objects from G and I N is given by {g, h}I Nm ⇔ m(g) = m(h). Then, a set Y ⊆ M is functionally dependent on the set X ⊆ M in K iff the implication X → Y holds in the context K N . In Baixeries et al. (2014), the relations between implications and functional dependencies are discussed in depth. The inverse reduction (Kuznetsov 2001) is given as follows: For a context K = (G, M, I ) one can construct a many-valued context K W such that an implication X ⇒ Y holds iff Y is functionally dependent on X in K W . For a context K the corresponding many-valued context is K W = (G, M, G ∪ {×}, I M ), where for any m ∈ M, g ∈ G one has m(g) = g if g I m does not hold and m(g) = × if g I m. Having Armstrong rules, it is natural to ask for an implication base, i.e., for a minimal subset of implications from which all other implications of a context can be derived. The authors of Guigues and Duquenne (1986) gave an algebraic characterization of premises of a cardinality-minimum implication base, in terms of what is now called “pseudo-intents”. Another important implication base, called “proper premise base” (Ganter and Wille 1999) or “direct canonical base” (Bertet and Monjardet 2010), is minimal w.r.t. application of first two Armstrong rules (the third rule is not applied in the derivation). Here we will not elaborate in this direction which is out of the scope of this chapter.

2.3 Algorithms for Computing Concepts Many algorithms were proposed for computing the set of concepts and concept lattices. The survey (Kuznetsov and Obiedkov 2002) provides a state-of-the art by 2002. Most efficient practical algorithms have “polynomial delay”. Recently, several new algorithms were proposed (Kourie et al. 2009; Outrata and Vychodil 2012). These algorithms, while having the same worst-case complexity, perform better in practice. As for implication bases, no total-polynomial-time algorithms for computing the minimum base are known, and due to intractability results (Kuznetsov 2004; Distel and Sertkaya 2011; Babin and Kuznetsov 2013), it looks like that such algorithms are not feasible. In applications, classical “NextClosure” algorithm (Ganter 2010) and a faster algorithm from (Obiedkov and Duquenne 2007) are the most used ones.

2.4 The Stability Measure Stability (Kuznetsov 2007) was proposed as a measure of independence of a concept intent, i.e., the concept “meaning”, on randomness in data composition. Every object could appear in data (context) at random, so a stable intent, i.e., intent with

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

417

large stability, would stay for a relative large number of subsets of the concept extent having the same intent. Dually, one can speak of stability of an extent, by considering subsets of a concept intent. Here attributes can be noisy. Stability is a partial case of robustness (Tatti et al. 2014). A drawback of stability is its intractability (# P-completeness). In Kuznetsov (2007), Babin and Kuznetsov (2012), Buzmakov et al. (2014) several tractable approximations of stability were proposed, including -stability, which proved to behave quite similar to stability in practice. Stability was used in numerous applications, like detecting expert communities (Kuznetsov et al. 2007), groups of French verbs (Falk and Gardent 2014), and chemical alerts (Métivier et al. 2015). The tractable approximation of stability, called Δ-stability, was proven to be an antimonotonic measure in terms of “projection chains” (Buzmakov et al. 2017). Besides stability and support, well-known in data mining, many other interestingness measures of concepts were proposed, see recent survey (Kuznetsov and Makhalova 2018).

3 Pattern Structures 3.1 Introduction When data are complex, i.e. numbers, sequences or graphs, instead of applying data transformation, such as discretization of numerical data, leading to space and time computational hardness, one may directly work on original data. A pattern structure is defined as a generalization of a formal context describing complex data (Ganter and Kuznetsov 2001; Kuznetsov 2009), and opens the range of applicability of plain FCA. Besides pattern structures, Fuzzy FCA and Logical Concept Analysis (LCA) are two other extensions of FCA. Fuzzy FCA aims at extending FCA to data with graded or fuzzy attributes. There are two main ways to deal with fuzzy attributes in FCA, one using conceptual scaling and the second extending FCA into a fuzzy setting enabling to directly work with fuzzy attributes (see for example Belohlávek 2004; Belohlávek and Vychodil 2005; Cabrera et al. 2017; Belohlávek 2011; GarcíaPardo et al. 2013). One of the main approaches in Fuzzy FCA is based on the use of a residuated implication, and a comprehensive overview is proposed in Belohlávek (2008). LCA (Ferré and Ridoux 2000, 2004) is a generalization of concept analysis that shares with pattern structures the objective to work directly with complex data. It defines the same Galois connection, only using different notations, and it was mostly developed for information retrieval purposes. Here after, we focus on pattern structures while some results involving LCA are mentioned in the next sections. In classical FCA, object descriptions are sets of attributes, which are partially ordered by set inclusion, w.r.t. set intersection: let P, Q ⊆ M be two attributes sets, then P ⊆ Q ⇔ P ∩ Q = P, and (M, ⊆), also written (M, ∩), is a partially ordered set of object descriptions. Set intersection ∩ is a meet operator, i.e., it is idempotent,

418

S. Ferré et al.

commutative, and associative. A Galois connection can then be defined between the powerset of objects (2G , ⊆) and a meet-semi-lattice of descriptions denoted by (D, ) ((standing for (M, ∩)). This idea is used to define pattern structures in the framework of FCA as follows. Formally, let G be a set of objects, let (D, ) be a meet-semi-lattice of potential object descriptions and let δ : G −→ D be a mapping that takes each object to its description. Then (G, (D, ), δ) is a “pattern structure”. Elements of D are patterns and are ordered by a subsumption relation : ∀c, d ∈ D, c  d ⇐⇒ c d = c. A pattern structure (G, (D, ), δ) gives rise to two derivation operators (·) : A =



δ(g), for A ∈ 2G and d  = {g ∈ G | d  δ(g)}, for d ∈ D.

g∈A

These operators form a Galois connection between (2G , ⊆) and (D, ). Pattern concepts of (G, (D, ), δ) are pairs of the form (A, d), A ⊆ G, d ∈ (D, ), such that A = d and A = d  . For a pattern concept (A, d), d is called “pattern intent” and it is the common description of all objects from A, called “pattern extent”. When partially ordered by (A1 , d1 ) ≤ (A2 , d2 ) ⇔ A1 ⊆ A2 (⇔ d2  d1 ), the set of all pattern concepts forms a complete lattice called “pattern concept lattice”. More importantly, the operator (.) is a closure operator and pattern intents are closed patterns. The existing FCA algorithms (detailed in Kuznetsov and Obiedkov 2002) can be used with slight modifications to compute pattern structures, in order to extract and classify concepts. Details can be found in Ganter and Kuznetsov (2001), Kaytoue et al. (2011c), Kuznetsov (2009). Pattern structures are very useful for building concept lattices where the extents of concepts are composed of “similar objects” with respect to a similarity measure associated to the subsumption relation  in (D, ) (Kaytoue et al. 2010). Pattern structures offer a concise way to define closed patterns. They also allow for efficient algorithms with polynomial-delay (modulo complexity of computing  and

) (Kuznetsov 1999). In the presence of large datasets they offer natural approximation tools (projections, detailed below) and allow for lazy classification (Kuznetsov 2013). When D is the power set of a set of items I , and  are the set intersection and inclusion, respectively: pattern intents are closed itemsets and we fall back to standard FCA settings. Originally, pattern structures were introduced to handle objects described by labeled graphs (Kuznetsov 1999; Kuznetsov and Samokhin 2005). A general approach for handling various types of descriptions was developed, for numbers and intervals (Kaytoue et al. 2011c), convex polygons (Belfodil et al. 2017), partitions (Baixeries et al. 2014), sequences (Buzmakov et al. 2016), trees (Leeuwenberg et al. 2015), and RDF triples in the web of data (Alam et al. 2015).

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

419

Fig. 2 Meet-semi-lattice (D, ) with D = {[4, 4], [5, 5], [6, 6], [4, 5], [5, 6], [4, 6]}

3.2 Interval Pattern Structures For illustration, we analyze object descriptions as tuples of numeric intervals. Pattern structures allow to directly extract concepts from data whose object descriptions are partially ordered (Kaytoue et al. 2011b, c). A numerical dataset with objects G and attributes M can be represented by an interval pattern structure. Let G be a set of objects, (D, ) a meet-semi-lattice of interval patterns (|M|-dimensional interval vectors), and δ a mapping associating with any object g ∈ G an interval pattern δ(g) ∈ (D, ). The triple (G, (D, ), δ) is an “interval pattern structure”. The meet operator on interval patterns can be defined as follows. Given two interval patterns c = [ai , bi ]i∈{1,...,|M|} and d = [ei , f i ]i∈{1,...,|M|} , then: c d = [minimum(ai , ei ), maximum(bi , f i )]i∈{1,...,|M|} meaning that the convex hull of intervals on each vector dimension is taken. The meet operator induces the following subsumption relation  on interval patterns: [ai , bi ]  [ci , di ] ⇔ [ai , bi ] ⊇ [ci , di ], ∀i ∈ {1, ..., |M|} where larger intervals are subsumed by smaller intervals. For example, with D = {[4, 4], [5, 5], [6, 6], [4, 5], [5, 6], [4, 6]}, the meet-semilattice (D, ) is given in Fig. 2. The interval labeling a node is the meet of all intervals labeling its ascending nodes, e.g. [4, 5] = [4, 4] [5, 5], and is also subsumed by these intervals, e.g. [4, 5]  [5, 5] and [4, 5]  [4, 4]. An example of interval pattern subsumption is given by: [2, 4], [2, 6]  [4, 4], [3, 4] as [2, 4]  [4, 4] and [2, 6]  [3, 4].

g1 g2 g3 g4 g5

m1 5 6 4 4 5

m2 7 8 8 9 8

m3 6 4 5 8 5

420

S. Ferré et al.

Let us consider the above numerical context where there are 5 objects and 3 attributes. The description of an object is a vector of intervals, e.g. g1 = 5, 7, 6, where 5 stands for the closed interval [5, 5]. Then the meet operator captures the “similarity” between two objects descriptions (i.e. two vectors of intervals) as the convex hull of the intervals w.r.t. the order of the components of the vector. For example, the similarity of g1 = 5, 7, 6 and g2 = 6, 8, 4 is computed as {g1 , g2 } = g∈{g1 ,g2 } δ(g) = 5, 7, 6 6, 8, 4 = [5, 6], [7, 8], [4, 6] Conversely, we can compute the image of an interval vector following the definition of the Galois connection for interval pattern structures as follows: [5, 6], [7, 8], [4, 6] = {g ∈ G|[5, 6], [7, 8], [4, 6]  δ(g)} = {g1 , g2 , g5 } And finally ({g1 , g2 , g5 }, [5, 6], [7, 8], [4, 6]) is a pattern concept. The whole pattern concept lattice for the numerical context is given in Fig. 3. Even for small numerical contexts, the pattern concept lattice is usually large and close to a Boolean lattice. This shows that it can be hard to work with the whole pattern concept lattice and that operations for simplifying the lattice should be provided, such as projections which are detailed below. Interval pattern structures offer a way to enumerate all hyper-rectangles of a numerical dataset (a tensor or numerical matrix), without redundancy thanks to the closure operator: all pattern intents correspond to a unique set of points in the Euclidean space and the intent gives the minimal bounding box containing them all. This is particularly interesting for mining numerical data without having to discretize it, either in a preprocessing phase, or on the fly, and thus suffering of imprecision. For example, it was used by Kaytoue et al. for enumerating contexts which induce exceptional models of some dataset (Kaytoue et al. 2017). As it is costly in the general case to enumerate all closed intervals, they proposed then a best-first search of interval patterns with a Monte Carlo Tree Search driven by a pattern quality measure which discriminates a class label (Bosc et al. 2017). Farther, the use of a pattern structure for the task of biclustering numerical data is also described.

3.3 Projections and Representation Context for Pattern Structures Pattern structure projections simplify computation and reduce the number of concepts (Ganter and Kuznetsov 2001). For example, a set of labeled graphs can be projected to a set of k-chains (Kuznetsov and Samokhin 2005), while intervals

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

421

Fig. 3 An example of pattern concept lattice

can be enlarged (Kaytoue et al. 2010). In general, a projection ψ takes any pattern to a more general pattern covering more objects. A projection is -preserving: ∀c, d ∈ D,ψ(c d) = ψ(c) ψ(d). Numerical data can be projected when a similarity or a tolerance relation (i.e. symmetric, reflexive, but not necessarily transitive relation) between numbers is used, then a projection can be performed as a preprocessing task. A wider class of projections was introduced in Buzmakov et al. (2015): o-projections can modify not only object descriptions, but the semi-lattice of descriptions. For any pattern structure, a representation context can be built, which is a binary relation encoding the pattern structure. Concepts in both data representations are in 1-1-correspondence. This aspect was studied for several types of patterns, designing the transformation procedures and evaluating in which conditions one data representation prevails (Kaytoue et al. 2011b; Baixeries et al. 2014). It should be noticed that the bijection does not hold in general for minimal generators, qualified as free or key in pattern mining (Kaytoue et al. 2011b). The impact of projections on representation contexts are investigated with the new class of o-projections in Buzmakov et al. (2015).

422

S. Ferré et al.

4 Relational Models in FCA 4.1 Relational Concept Analysis Relational datasets arise in a wide range of situations, e.g. Semantic Web applications (Staab and Studer 2009), relational learning and data mining (Dzeroski and Lavrac 2001), refactoring of UML class models (Dao et al. 2004) and model-driven development (Stahl et al. 2006). Relational datasets are composed of a set of binary tables (objects × attributes) and inter-objects relations (objects × objects). The binary status of inter-objects relations is not really a limitation, as n-ary relations can always be transformed as a composition of binary relations, or reified. Relational Concept Analysis (RCA) extends FCA to the processing of relational datasets in a way allowing inter-objects links to be materialized and incorporated into formal concept intents (Rouane-Hacene et al. 2013). Each object-attribute relation corresponds to an object category, while each object-object relation corresponds to inter-category links. The objects of one category are classified in the usual way into a concept lattice depending on the binary and relational attributes that they share. Inter-object links are scaled to become “relational attributes” connecting first objects to concepts and then concepts to concepts, in a similar way as role restrictions in Description Logics (DLs) (Baader et al. 2003). The relational attributes reflect the relational aspects within a formal concept. They also conform to the same classical rules for concept formation mechanisms from FCA which means that the relational concept intents can be produced by standard FCA algorithms. Due to the strong analogy between role restrictions in DLs and relational attributes in RCA, formal concepts can be almost readily converted into a DL-based formalism (Rouane-Hacene et al. 2007), e.g. for ontology engineering purposes as in Rouane-Hacene et al. (2010), Bendaoud et al. (2008). RCA is introduced and detailed in Huchard et al. (2007). The data structure is described by a relational context family (RCF), which is a (K, R) pair where: K = {K i }i=1,...,n is a set of contexts K i = (G i , Mi , Ii ), and R = {rj } j=1,..., p is a set of rj relations where rj ⊆ G k × G  for some k,  ∈ {1, . . . , n}. For a given relation rj , the domain is denoted by dom(rj ) and the range by ran(rj ). Figure 4 shows a simple example where three object categories, namely dishes, cereals and countries, are described by attributes, namely Europe/Asia for countries and rice/wheat for cereals. Three relations respectively connect dishes to cereals (hasMainCereal (hmc)), cereals to countries (isProducedIn (ipi)) and countries to dishes (eatLotOf (elo)). Relational Concept Analysis considers such data under the tabular form (RCF) presented in Table 1. RCA is based on a “relational scaling” mechanism that transforms a relation rj ⊆ G k × G  into a set of relational attributes that are added to the context describing the object set ran(rj ). To that end, relational scaling adapts the DL semantics of role restrictions. For each relation rj ⊆ G k × G  , there is an initial lattice for each object set, i.e. L k for G k and L  for G  . A relational attribute is associated to an object o ∈ G k whenever rj (o) satisfies a given constraint, where rj (o) denotes the set of

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

423

Fig. 4 Graph view on objects (dishes, cereals, and countries) with their attributes and relations Table 1 Relational Context Family: Formal (object-attribute) contexts dishes, cereal, countries, and object-object relations hasMainCereal, isProducedIn, eatLotOf

424

S. Ferré et al.

objects in G  in relation with o through rj . A relational attribute is composed of a scaling quantifier q, the name of the relation rj , and the target concept C, and is denoted by “q rj (C)”. There is a variety of scaling quantifiers. ∃∀rj (C) (existential+universal scaling) is associated with o, when o has at least one rj link and every such rj link is only directed towards objects in the extent of C, i.e. rj (o) ⊆ extent (C) and rj (o) = ∅. ∃rj (C) (existential scaling) is associated with o, when o has at least one rj link directed to one object of the extent of C, i.e. rj (o) ∩ extent (C) = ∅. ∃⊇rj (C) (contains scaling) is associated with o, when o has rj links directed to all objects of the extent of C, i.e. extent (C) ⊆ rj (o) and extent (C) = ∅. Some other relational scaling operators exist in RCA, e.g. with percentages to relax ∃⊇ or ∃∀ constraints, and cardinalities following the classical role restriction semantics in DLs. In the general case, the data model can be cyclic at the schema level and at the individual level, resulting in an iterative process which converges after a number of steps depending on the dataset. Accordingly, the RCAexplore4 system allows a variety of analyses: changing at each step the scaling quantifiers associated with the object-object contexts, the selected formal contexts and relations, and the conceptual structure which is built (concept lattice, AOC-poset (Berry et al. 2014) or iceberg lattice (Stumme et al. 2002)). In the current example, applying the existential quantifier to all object-object relations, the iterative RCA process outputs three lattices, respectively for dishes, cereals and countries. The concept lattices are shown in Fig. 5 and the building process terminates after three steps. Here after, we detail the main steps of the concept construction: • Step 0: redRice, arborioRice, thaiRice and basmatiRice are grouped together due to the common attribute rice (concept cer eal5). Then Pakistan and Thailand are grouped together due to the common attribute Asia (concept countr y5). • Step 1: relational attributes of the form ∃hmc(C) (respectively ∃i pi(C) and ∃elo(C)) where C is a concept built at step 0 are added to the dishes descriptions (resp. cereals and countries descriptions). All dishes have a cereal in cer eal5 extent, thus they have the relational attribute ∃hmc(cer eal5). Then basmatiRice and thaiRice are produced in a country located in countr y5 (Asian countries), thus they have the relational attribute ∃i pi(countr y5) and they can be grouped into concept cer eal6 (cereals produced in an Asian country). • Step 2: dishes khaoManKai and biryani are grouped into concept dish7 as they share the relational attribute ∃hmc(cer eal6), since both have a main cereal in cer eal6 extent. • Step 3: countries from countr y5 have the relational attribute ∃elo(dish7) because people from both countries eat a lot of one of the dishes in dish7 extent. The concept lattices show a classification of dishes depending on their main cereals, which in turn are classified through their attributes (e.g. being rice varieties) and 4 http://dataqual.engees.unistra.fr/logiciels/rcaExplore.

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

425

Fig. 5 The three lattices classify objects, i.e. dishes, cereals, and countries, w.r.t. their plain and relational attributes. The label “ex. hmc(cereal3)” stands for “∃has MainCer eal(cer eal3)” (“ipi” stands for “isProducedIn” and “elo” for “eatLotOf”)

their relations to the countries where they are produced. Finally the countries are classified depending on their attributes, here the continents, and their relations to the dishes that are regularly eaten. While the concepts group sets of similar objects, the relational attributes group sets of similar links. The lattices may also show implications between relational attributes or between relational and non-relational attributes. All analyses that can be made with FCA here apply, with the knowledge of object-object relations. Actually, RCA is a powerful mechanism for managing relations in the framework of FCA. Compared to FCA approaches based on graph patterns, RCA provides a complementary view, focusing on object classification inside object categories. Some approaches, such as Nica et al. (2017) in the context of sequential data, explore

426

S. Ferré et al.

the transformations between the two representations, and extract, from the concept lattices produced with RCA, graph patterns included in a subsumption hierarchy. This hierarchy can be navigated by experts and allows a straightforward interpretation. There are not many tools implementing RCA. Let us mention Galicia5 and RCAexplore.6 RCA has been used for UML class model refactoring (Dao et al. 2004; Guédi et al. 2013), model transformation learning (Dolques et al. 2010), legal document retrieval (Mimouni et al. 2015), ontology construction (Bendaoud et al. 2008), relational rule extraction in hydrobiological data (Dolques et al. 2016). In some applications, AOC-posets (for “Attributes-Objects-Concept partially ordered sets”) (Berry et al. 2014) are used rather than concept lattices for efficiency purposes.

4.2 Graph-FCA Graph-FCA (Ferré 2015; Ferré and Cellier 2016) is another extension of FCA for multi-relational data, and in particular for knowledge graphs of Semantic Web (Hitzler et al. 2009). The specific nature of Graph-FCA is to extract n-ary concepts from a knowledge graph using n-ary relationships. The extents of n-ary concepts are sets of n-ary tuples of graph nodes, and their intents are expressed as graph patterns with n distinguished nodes, which are called the “focus”. For instance, in a knowledge graph that represents family members with a “parent” binary relationship, the “sibling” binary concept can be discovered, and described as “a pair of persons having a common parent”. Classical FCA corresponds to the case where n = 1, i.e. when graph nodes are disconnected, and when concept extents are sets of graph nodes. Graph-FCA differs from Graal (Liquiere and Sallantin 1998) and applications of Pattern Structures to graphs (Kuznetsov 1999; Kuznetsov and Samokhin 2005). Here objects are the nodes of one large knowledge graph instead of having each object being described by a small graph. Graph-FCA shares theoretical foundations with the work of Kötters (2013) and brings in addition algorithms for computing and displaying concepts (Ferré and Cellier 2016). Whereas FCA defines a formal context as an incidence relation between objects and attributes, Graph-FCA defines a “graph context” as an incidence relation between tuples of objects and attributes. A graph context is a triple K = (G, M, I ), where G is a set of objects, M is a set of attributes, and I ⊆ G ∗ × M is an incidence g = (g1 , . . . , gn ) ∈ G ∗ , for any arity n, and attributes relation between object tuples ∗ n m ∈ M. G = n∈N∗ G = G ∪ (G × G) ∪ (G × G × G) ∪ . . . denotes the set of all tuples of objects (N∗ denotes the set of natural numbers strictly greater than 0). The graphical representation of a graph context uses objects as nodes, incidence elements as hyper-edges, and attributes as hyper-edge labels. Attributes can be interpreted as n-ary predicates, and graph contexts as First Order Logic (FOL) model

5 http://www.iro.umontreal.ca/~galicia/. 6 http://dataqual.engees.unistra.fr/logiciels/rcaExplore.

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

427

Fig. 6 Graph context (excerpt) about dishes, cereals, and countries. Rectangles are objects, wordlabeled links are binary edges, and ellipses are other edges, here unary edges

without functions and constants. For example, a hyper-edge ((g1 , . . . , gn ), m) can be seen as the FOL atom m(g1 , . . . , gn ). Different kinds of knowledge graphs, such as Conceptual Graphs (Sowa 1984), RDF graphs, or RCA contexts, can be directly mapped to a graph context. We illustrate Graph-FCA on the same relational data as RCA in the previous section. Fig. 6 shows an excerpt of the graphical representation of the graph context about dishes, cereals, and countries. The objects are dishes (e.g. biryani), cereals (e.g. basmatiRice), and countries (i.e. Pakistan). They are represented as rectangles. The attributes are either unary relations, i.e. classical FCA attributes (e.g., dish, Asia, basmatiRice), or binary relations (e.g., eatLotOf, hasMainCereal, isProducedIn). The former are represented as ellipses attached to objects and the latter are represented as directed edges between objects. More generally, a binary edge m(x, y) is represented by an edge from x to y labeled by m. Other edges m(x1 , . . . , xn ) are represented as ellipses labeled by m, having an edge labeled i to each node xi . Whereas FCA is about finding closed sets of attributes, Graph-FCA is about finding closed graph patterns. A graph pattern is similar to a graph context having variables instead of objects as nodes, in order to generalize over particular objects. A key aspect of Graph-FCA is that closure does not apply directly to graph patterns but to “Projected Graph Patterns” (PGP), i.e. graph patterns with one or several distinguished nodes as focus. Those focus nodes define a projection on the occurrences of the pattern, like a projection in relational algebra. For example, the PGP (x, y) ← par ent (x, z), par ent (y, z) defines a graph pattern with two edges, par ent (x, z) and par ent (y, z), and with focus on variables x, y. It means that for every occurrence of the pattern in the context, the valuation of (x, y) is an occurrence

428

S. Ferré et al.

of the PGP. This can be used as a definition of the “sibling” relationship, i.e. the fact that x and y are siblings if they have a common parent z. PGPs are analogous to anonymous definitions of FOL predicates and to conjunctive SPARQL queries. They play the same role as sets of attributes in FCA, i.e. as concept intents. Set operations are extended from sets of attributes to PGPs. PGP inclusion ⊆q is based on graph homomorphisms (Hahn and Tardif 1997). It is similar to the notion of subsumption on queries (Chekol et al. 2012) or rules (Muggleton and De Raedt 1994). PGP intersection ∩q is defined as a form of graph alignment, where each pair of variables from the two patterns becomes a variable of the intersection pattern. It corresponds to the “categorical product” of graphs (see Hahn and Tardif 1997, p. 116). The Galois connection underlying the concept construction in Graph-FCA is ∗ defined between PGPs (Q, ⊆q ) and sets of object tuples (2G , ⊆). The connection  from PGP Q to sets of object tuples Q is analogous to query evaluation, and the connection from sets of object tuples R to PGP R  is analogous to relational learning (Muggleton and De Raedt 1994). In the definitions of Q  and R  below, the PGP (g, I ) represents the description of an object tuple g by the whole incidence relation I seen from the relative position of g. Q  := {g ∈ G n | Q ⊆q (g, I )}, for Q = (x1 , . . . , xn ) ← P (a PGP) n for R ⊆ 2G , for n ∈ N R  := ∩q {(g, I )}g∈R , From there, concepts can be defined in the usual way and proved to be organized into lattices. A concept is a pair (Q, R) such that Q  = R and R  =q Q. The arity of the projected tuple of Q must be the same as the arity of object tuples in R. It determines the arity of the concept. Unary concepts are about sets of objects, while binary concepts are about relationships between objects, and so on. It can be noticed that the intent of a unary concept can combine attributes of different arities. Unlike RCA, there is a concept lattice for each concept arity rather than for each object type. Figure 7 displays a compact representation of the graph concepts about dishes, cereals, and countries. Each rectangle node x identifies a unary concept (e.g., Q2a) along with its extent (here, Pakistan, Thailand). The concept intent is the PGP x ← P, where P is the subgraph containing node x and all white nodes, which is called the “pattern core”. By reading the graph, we learn that Concept Q2a is the concept of “Asian countries, which eat a lot of some dish whose main cereal is produced in the country itself”. Formally, its intent is denoted by c ← countr y(c), Asia(c), eat Lot O f (c, d), dish(d), has MainCer eal(d, l), cer eal(l), is Pr oduced I n(l, c). Concepts Q2b and Q2c have the same graph pattern as Q2a but a different focus, on cereals for Q2b and on dishes for Q2c. N-ary concepts are obtained by picking several nodes as focus. For example, (Q2a,Q2c,Q2b) is a ternary concept whose instances are the object triples (Pakistan, bir yani, basmati Rice) and (T hailand, khaoMan K ai, thai Rice). It represents the cyclic relation existing between country, dish, and cereal in Asian countries. Although the generalization ordering between concepts, hence the concept lattice, is not explicitly represented in Fig. 7, it can be recovered by looking for inclusion

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

429

Fig. 7 Compact representation of graph concepts about dishes, cereals, and countries, with minimum support 2

relationships between concept extents. For example, concepts Q2a, Q2b, Q2c are respective specializations, hence sub-concepts, of concepts Q1a, Q1b, Q1c. The latter are general concepts respectively for countries, cereals, and dishes. They show that every dish has a main cereal, and every cereal is produced in some country but that not all countries eat a lot of some dish, and not all dishes are eaten a lot by some country. The (grayed) concepts Q3i, Q3f(d), and Q3d(f), are other specializations of Q1 concepts for European countries (Q3d(f)): Q3f(d) denotes the concept of cereals produced in Europe and Q3i denotes the concept of dishes whose main cereal is produced in Europe. The bracketed letter “(f)” in concept Q3d (with label “Q3d(f)”) indicates that concept Q3f belongs to the intent of concept Q3d, in addition to the concepts of the pattern core. Similarly, concept Q3d belongs to the intent of concept Q3f (with label “’Q3f(d)”). Contrasting Asian countries no cycle is observed because rice-based dishes are less popular in Europe. Those concepts are grayed because they do not belong to the pattern core, which is here a copy of a component of the graph context. Concepts in gray represent generalizations over other concepts in the pattern core. For instance, concept Q3i generalizes over concepts Q3g and Q3h (not visible in Fig. 7), which have only gardiane and arancini in their extents, respectively. Compared to RCA, Graph-FCA emphasizes relational patterns over the lattice structure. However, relational patterns can be extracted from RCA lattices, and concept lattices can be recovered from Graph-FCA relational patterns. Other differences are about the scaling operators and the meaning of cycles in the intents of concepts. Graph-FCA does not support scaling operators. Moreover a cycle in a Graph-FCA pattern actually corresponds to a cycle in the data, whereas this is not necessarily the case for cycles through RCA concept definitions.

430

S. Ferré et al.

5 Triadic Concept Analysis Triadic Concept Analysis (TCA Lehmann and Wille 1995) handles ternary relations between objects, attributes and conditions. Such triadic data can be represented as a “triadic context”, i.e. a kind of 3-dimensional table or a cube. Accordingly, “triadic concepts” are 3-dimensional and can be seen as maximal sets of objects related to maximal set of attributes under a maximal set of conditions, i.e. a maximal “subcube” full of × in the triadic context (up to rows, columns and layers permutations). Definition 1 (Triadic context) In a triadic context K = (G, M, B, Y ), G, M, and B respectively denote the sets of objects, attributes and conditions, and Y ⊆ G × M × B. The fact (g, m, b) ∈ Y is interpreted as the statement “object g has attribute m under condition b”. Example 1 An example of such a triadic context is given in Table 2 where the very first cross (to the left) denotes the fact “object g2 has attribute m 1 under the condition b1 ”, i.e. (g2 , m 1 , b1 ) ∈ Y . In this tabular representation, each table corresponds to the projection of the triadic context for one condition. Projections can be performed for any dimension, i.e. object, attribute and condition. Definition 2 (Triadic concept) A triadic concept (A1 , A2 , A3 ) of (G, M, B, Y ) is a triple with A1 ⊆ G, A2 ⊆ M and A3 ⊆ B and satisfying the two following statements: (i) A1 × A2 × A3 ⊆ Y (ii) for X 1 × X 2 × X 3 ⊆ Y , A1 ⊆ X 1 , A2 ⊆ X 2 and A3 ⊆ X 3 implies that (A1 , A2 , A3 ) = (X 1 , X 2 , X 3 ) (maximality). If (G, M, B, Y ) is represented as a three dimensional table, (i) means that a concept stands for a rectangular parallelepiped full of × while (ii) characterizes component-wise maximality of concepts. A1 is the “extent”, A2 the “intent” and A3 the “modus” of the triadic concept (A1 , A2 , A3 ). Example 2 ({g3 , g4 }, {m 2 , m 3 }, {b1 , b2 , b3 }) is a triadic concept in the triadic context shown in Table 2. Representing the triadic context as a cube, where each condition is a layer, one can observe that this triadic concept corresponds to a maximal rectangular parallelepiped full of × (modulo lines, columns and layers permutations). To describe the derivation operators, it is convenient to represent a triadic context as (K 1 , K 2 , K 3 , Y ). Definition 3 (Outer derivation operators) For {i, j, k} = {1, 2, 3}, j < k, X ⊆ K i and Z ⊆ K j × K k , the (i)-derivation operators are defined by: Φ : X → X (i) : {(a j , ak ) ∈ K j × K k | (ai , a j , ak ) ∈ Y for all ai ∈ X }  Φ : Z → Z (i) : {ai ∈ K i | (ai , a j , ak ) ∈ Y for all (a j , ak ) ∈ Z }

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

431

Table 2 A triadic context (G, M, B, Y ) with the triadic concept ({g3 , g4 }, {m 2 , m 3 }, {b1 , b2 , b3 }) b1 m1 m2 g1 g2 g3 g4

m3 ×

×

× × ×

b2 m1 m2

× ×

× × ×

g1 g2 g3 g4

× × × ×

b3 m1 m2

m3 × × ×

g1 g2 g3 g4

× × ×

m3 ×

× ×

× ×

These two derivation operators lead to 3 dyadic contexts: K(1) = K 1 , K 2 × K 3 , Y (1)  K(2) = K 2 , K 1 × K 3 , Y (2)  K(3) = K 3 , K 1 × K 2 , Y (3)  where gY (1) (m, b) ⇐⇒ mY (2) (g, b) ⇐⇒ bY (3) (g, m). Example 3 Consider i = 1, j = 2 and k = 3, i.e. K 1 = G, K 2 = M and K 3 = B. Given the arbitrary set of objects X = {g4 }, it comes: Φ(X ) = {(m 2 , b1 ), (m 3 , b1 ), (m 2 , b2 ), (m 3 , b2 ), (m 2 , b3 ), (m 3 , b3 )} Φ Φ(X ) = {g3 , g4 } 

Further derivation operators are defined as follows: Definition 4 (Inner derivation operators) For {i, j, k} = {1, 2, 3}, X i ⊆ K i , X j ⊆ K j and Ak ⊆ K k , the (i, j, Ak )-derivation operators are defined by: (i, j,Ak ) : {a j ∈ K j | (ai , a j , ak ) ∈ Y for all (ai , ak ) ∈ X i × Ak } Ψ : Xi → Xi  (i, j,Ak ) : {ai ∈ K i | (ai , a j , ak ) ∈ Y for all (a j , ak ) ∈ X j × Ak } Ψ : Xj → Xj This definition yields the derivation operators of dyadic contexts defined by: ij

ij

ij

k

k

K A = K i , K j , Y A  where (ai , a j ) ∈ Y A k

⇐⇒ ai , a j , ak are related by Y for all ak ∈ Ak .

Example 4 Consider i = 1, j = 2 and k = 3, i.e. K 1 = G, K 2 = M and K 3 = B, A3 = {b1 , b2 } and X = {g3 }: Ψ (X ) = {m 2 , m 3 } 

Ψ  Ψ (X ) = {g3 , g4 }

Operators Φ and Φ are called “outer operators”, a composition of both operators  is called an “outer closure”. Operators Ψ and Ψ are called “inner operators”, a composition of them is called “inner closure”.

432

S. Ferré et al.

Definition 5 (Triadic concept formation) A concept having X 1 in its extent can be constructed as follows: (X 1(1,2,A3 )(1,2,A3 ) , X 1(1,2,A3 ) , (X 1(1,2,A3 )(1,2,A3 ) × X 1(1,2,A3 ) )(3) ) Example 5 In the current example, ({g3 , g4 }, {m 2 , m 3 }, {b1 , b2 , b3 }) is a triadic concept. From a computational point of view, the algorithm Trias is developed in Jäschke et al. (2006) for extracting frequent triadic concepts, i.e. concepts whose extent, intent and modus cardinalities are higher than user-defined thresholds (see also Ji et al. 2006). Cerf et al. from the “pattern mining community” presented a more efficient algorithm called Data- peeler which is able to handle n-ary relations (Cerf et al. 2009), following the formal definitions given in terms of “Polyadic Concept Analysis” in Voutsadakis (2002). Some examples of triadic analysis capabilities are given in Kaytoue et al. (2014) with a formalization of biclustering and in Ganter and Obiedkov (2004) for illustrating the use of implications in triadic analysis. Moreover, a comparison of algorithms dealing with triadic analysis is provided in Ignatov et al. (2015). Contrasting the intuitive graphical representation of a concept lattice in FCA, the exploration of a triadic conceptual structure is not easy task, especially because of the complexity of such a triadic structure. In Rudolph et al. (2015), a new navigation paradigm is proposed for triadic conceptual structures based on a neighborhood notion arising from associated dyadic concept lattices. This visualization capability helps an analyst to understand the construction and the content of a triadic structure.

6 Applications FCA and extensions are used in many application domains and in many different tasks. Some of these tasks were already mentioned before and we propose below a quick tour of some representative applications for completing the picture. It should also be noticed that we cannot be exhaustive and that some surveys are existing such as Poelmans et al. (2013), Kuznetsov and Poelmans (2013). Information Retrieval FCA has been used in a myriad of ways to support a wide variety of information retrieval (IR) techniques and applications (see (Codocedo and Napoli 2015) for a survey). The concept lattice concisely represents the document and query space which can be used as an index for automatic retrieval, and as a navigation structure (Godin et al. 1993; Lindig 1995; Carpineto and Romano 1996; Ducrou et al. 2006). Eklund’s team has developed algorithms to efficiently compute neighbor concepts, not only children concepts but also parent concepts and sibling concepts, and have designed user-friendly and multimedia user interfaces for various applications

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

433

such as email, pictures, or museum collections (Ducrou et al. 2006). However, the Boolean IR model, and consequently FCA, is considered as too limited for modern IR requirements, such as large datasets and complex document representations. This is the motivation for introducing more complex model and in particular Logical Concept Analysis (LCA) and pattern structures, to reconcile search by navigation in a lattice and search by expressive querying. Logical Concept Analysis (Ferré and Ridoux 2000) was introduced with the purpose to reconcile navigation and querying, combined with rich object descriptions. It led to the paradigm of Logical Information Systems (LIS) (Ferré and Ridoux 2004), where the document and query space can be composed in a flexible way from various components called “logic functors” (Ferré and Ridoux 2001). There are logic functors for different datatypes (e.g., numbers, dates, strings), different structures (e.g., valued attributes, taxonomies, sequences), and different logical operators (e.g., Boolean, epistemology). LIS have been applied to collections of photos (Ferré 2009), to biological sequences (Ferré and King 2004), to geographical data (Bedel et al. 2008), to complex linguistic data (Foret and Ferré 2010), as well as to OLAP cubes (Ferré et al. 2012). LIS have even been implemented as a genuine UNIX file system (Padioleau and Ridoux 2003). In line with relational extensions of FCA (RCA and Graph-FCA), LIS have been extended to semantic web knowledge graphs (Ferré 2010). The richness of the query space has since increased up for covering most of the SPARQL 1.1 query language, including computations such as aggregations, while retaining navigation-based search (Ferré 2017). Web of Data and Ontology Engineering There are many connections between FCA, web of data and semantic web. The idea of a “conceptual hierarchy” is fundamental to both Formal Concept Analysis and Description Logics (and thus semantic web). If the concept construction is quite different, there are several attempts to relate FCA-based and DL-based formalisms, and to combine both approaches (Sertkaya 2010). In Baader et al. (2007), an approach for extending both the terminological and the assertional part of a Description Logic knowledge base is proposed, using information provided by the knowledge base and by a domain expert. Materializing this approach, the “OntoComP” system is a Protégé plugin that supports ontology engineers in completing OWL ontologies using conceptual exploration (Sertkaya 2009). Related approaches were also used in the discovery of axioms or definitions in ontologies (Baader and Distel 2008; Borchmann et al. 2016). In Kirchberg et al. (2012), authors discuss how to build a “concept layer” above the web of data in an efficient way, for allowing a machine-processable web content. The emphasis should be on creating links in a way that both humans and machines can explore the web of data. Indeed, FCA can bring the level of concept abstraction that makes possible to link together semantically-related facts as meaningful units. Following the same idea, a noticeable application of pattern structures aimed at RDF data completion is discussed in Alam et al. (2015). This shows the great potential of pattern structures to support complex document representations with numerical and heterogeneous indexes (Codocedo et al. 2014; Codocedo and Napoli 2014a).

434

S. Ferré et al.

We should also mention some applications of FCA and pattern structures in text mining. One of the very first approaches to the automatic acquisition of concept hierarchies from a text corpus based on FCA is detailed in Cimiano et al. (2005). Following this way, authors in Bendaoud et al. (2008) combined FCA and RCA to take into account relations within texts and build more realistic DL-based concepts where roles correspond to the extracted relations. Moreover, the learning of expressive ontology axioms from textual definitions with the so-called “relational exploration” (Rudolph 2004) is proposed in Völker and Rudolph (2008). Relational exploration is also based on attribute exploration which is used to interactively clarify underspecified dependencies and increase the quality of the resulting ontological elements. More recently, a specific pattern structure for managing syntactic trees was introduced in Leeuwenberg et al. (2015). This work was aimed at analyzing and classifying relations such as drug-drug interactions in medical texts where sentences are represented as syntactic trees. Biclustering and Recommendation Biclustering aims at finding local patterns in numerical data tables. The motivation is to overcome the limitation of standard clustering techniques where distance functions using all the attributes may be ineffective and hard to interpret. Applications are numerous in biology, recommendation, etc. (see references in Kaytoue et al. 2014; Codocedo and Napoli 2014b). In FCA, formal concepts are maximal rectangles of tr ue values in a binary data-table (modulo columns/rows permutations). Accordingly, concepts are binary biclusters with interesting properties: maximality (through a closure operator), overlapping and partial ordering of the local patterns. Such properties are key elements of a mathematical definition of numerical biclusters and the design of their enumeration algorithms. We highlight these links for several types of biclusters with interval (Kaytoue et al. 2011a) and partition pattern structures (Codocedo and Napoli 2014b) and their representation contexts. Next investigations concern dimensionality: a bijection between n-clusters and n + 1-concepts is proven (Kaytoue et al. 2014). Some work about recommendation using plain FCA (du Boucher-Ryan and Bridge 2006) and Boolean Matrix Factorization (Akhmatnurov and Ignatov 2015) should be mentioned. Database and Functional Dependencies Characterizing and computing functional dependencies (FDs) are an important topic in database theory (see e.g. references in Baixeries et al. 2014). In FCA, Ganter & Wille proposed a first characterization of FDs as implications in a formal context (binary relation) obtained after a transformation of the initial data (Ganter and Wille 1999). However, one has to create here n 2 objects from the n initial tuples. To overcome this problem, a characterization of functional dependencies is proposed in terms of partition pattern structures offering additional benefits for the computation of dependencies (Baixeries et al. 2014). This method can be naturally generalized to other types of FDs (multi-valued and similarity dependencies Baixeries et al. 2013; Codocedo et al. 2016).

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

435

Moreover, an interactive and visual way to discover simultaneously FDs, conditional FDs (i.e., FDs valid under certain conditions), and association rules is proposed in Allard et al. (2010). It is based on navigating a lattice of OLAP cubes whose dimensions correspond to the premise of functional dependencies and association rules. Software Engineering A survey by Tilley et al. (2005) gathers and analyzes main research work that applied FCA to the field of Software Engineering (SE) before 2005. One of the oldest is due to R. Godin and H. Mili in the context of Smalltalk class hierarchy reengineering (Godin and Mili 1993), where they introduced ideas that inspired many later approaches. Another main track has been initiated by Lindig (1995) that aims to classify software components. In the nineties, object identification was also a major topic, due to the spreading of object-orientation and the importance to migrate from procedural code to object-oriented code, for which FCA approaches were proposed (Sahraoui et al. 1997; van Deursen and Kuipers 1999). Then, FCA continued to expand in software engineering, e.g. to facilitate fault localization in software (Cellier et al. 2008). Abilities of FCA for classifying components or web services have been investigated in Bruno et al. (2005), Aboud et al. (2009), Azmeh et al. (2011). Connections to the software product line representations (feature models) are studied in Carbonnel et al. (2017a, b). Moreover, RCA has been applied in Model-Driven Engineering, for UML class model or use case diagram analysis and reengineering (Dao et al. 2004; Falleri et al. 2008; Guédi et al. 2013), model transformation learning (Dolques et al. 2010), a topic for which an FCA-based graph mining model was also designed (Saada et al. 2014). Let us conclude with two papers showing the diversity of the applications in software engineering and based on FCA. In the first paper (Obiedkov et al. 2009), the authors are studying the lattice-based access control models using conceptual exploration, for understanding dependencies between different security categories. In the second paper (Priss 2011), the author examines the detection of security threat problems and the way FCA can help in exploring available related data. Social Network Analysis The authors of Freeman and White (1993) remarked the visualization power of concept (Galois) lattices and their usefulness for interpretation. Actually, a concept lattice yields a complete and ordered view of the data based on concept extents and intents while in previous models several and separate views are produced. Moreover, the view provided by the concept lattice suggests useful insights about the structural properties of the data. The book (Missaoui et al. 2017) contains several research papers on recent trends in applying FCA for SNA: acquisition of terminological knowledge from social networks, knowledge communities, individuality computation, FCA-based analysis of two-mode networks, community detection and description in one-mode and multimode networks, multimodal clustering, adaptation of the dual-projection approach to weighted bipartite graphs, and attributed graph analysis.

436

S. Ferré et al.

In particular, the paper (Roth 2017) shows how FCA allows the assessment and analysis of actors and their attributes on an equal basis. FCA can help solve key typical challenges of community detection in SNA such as group hierarchy and overlapping, temporal evolution and stability of networks. The paper (Borchmann and Hanika 2017) defines individuality and introduces a new measure in two-mode (affiliation) networks using FCA by evaluating how many unique groups of users of size k can be uniquely defined by a combination of attributes. In Ignatov et al. (2017) the authors present FCA-based biclustering and its extensions to mining multimode communities in social network analysis. The author of Kriegel (2017) describes a technique for the acquisition of terminological knowledge from social networks. The chapter (Soldano et al. 2017) studies social and other complex networks as attributed graphs and addresses attribute pattern mining in such graphs through recent developments in FCA. Finally, in Valverde-Albacete and Peláez-Moreno (2017) the authors adapt a dual-projection approach to weighted two-mode networks using an extension of FCA for incidences with values in a special case of semiring. Bioinformatics and Chemoinformatics FCA was used by different research groups as a tool for analyzing biological and chemical data, e.g. for structure-activity relationship studies, where the correlation between chemical structure and biological properties are explored (Bartel and Bruggemann 1998; Blinova et al. 2003; Kuznetsov and Samokhin 2005; Métivier et al. 2015; Quintero and Restrepo 2017). In Gardiner and Gillet (2015), authors describe four data mining techniques, namely Rough Set Theory (RST), Association Rule Mining (ARM), Emerging Pattern Mining (EP), and Formal Concept Analysis (FCA), and give a list of their chemoinformatics applications. In bioinformatics FCA was also applied to gene expression analysis in Kaytoue et al. (2011c) and in the analysis of metabolomic data in Grissa et al. (2016). In the latter, FCA is combined to numerical classifiers for selecting and visualizing discriminant and predictive features in nutrition data. In molecular biology, the exploration of potentially interesting substructures within molecular graphs requires proper abstraction and visualization methods. In Bourneuf and Nicolas (2017), the so-called “power graph analysis” based on classes of nodes having similar properties and classes of edges linking node classes is stated in terms of FCA. Then the problem is solved in an alternative way thanks to “answer set programming”. Enzymes are macro-molecules, i.e. linear sequences of molecules, whose activity is basic in any biochemical reaction. Predicting the functional activity of an enzyme from its sequence is a crucial task that can be approached by comparing new target sequences with already known source enzymes. In Coste et al. (2014), the authors study the problem in the framework of FCA and define this task as an optimization problem on the set of concepts. Finally, median networks which generalize the notion of trees in phylogenetic analysis are investigated within FCA in Priss (2013). In standard situations, concept lattices may represent the same information as median networks, but the FCA

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

437

algorithmic machinery is powerful and operational, and offers efficient algorithms that can be used for evolutionary analysis. Moreover, it should be noticed that median graphs are also related to distributive lattices.

7 Conclusion FCA is at the heart of knowledge discovery, knowledge representation and reasoning, data analysis and classification. As pointed out in Wille (2002), formal concepts and concept lattices provide a mathematization of real-world concept hierarchies, support reasoning and various other complex tasks, especially using the graphical representation of concept lattices. In such a framework, knowledge discovery can be considered as a whole process, from data to knowledge units, more practically, as pattern discovery associated with knowledge creation. Such a process is guided by the design of concepts and concept lattices, and as well by the representation of concepts within a knowledge representation formalism such as description logics. In particular, relational models in FCA and triadic concept analysis are of main importance in the representation of relations between concepts. Regarding the wide range of applications in which FCA is involved, it is clear that FCA is now recognized as a formalism for data and knowledge processing that can be really useful for AI practitioners. More than that, we strongly believe that FCA will play a first role in data and knowledge sciences in the forthcoming years. Acknowledgements The work of Sergei O. Kuznetsov was supported by the Russian Science Foundation under grant 17-11-01294 and performed at National Research University Higher School of Economics, Russia.

References Aboud NA, Arévalo G, Falleri J, Huchard M, Tibermacine C, Urtado C, Vauttier S (2009) Automated architectural component classification using concept lattices. In: joint working IEEE/IFIP conference on software architecture (WICSA/ECSA), pp 21–30 Akhmatnurov M, Ignatov, DI (2015) Context-aware recommender system based on boolean matrix factorisation. In: Proceedings of the twelfth international conference on concept lattices and their applications (CLA), CEUR workshop proceedings, vol 1466, pp 99–110 Alam M, Buzmakov A, Codocedo V, Napoli A (2015) Mining definitions from RDF annotations using formal concept analysis. In: Proceedings of the twenty-fourth international joint conference on artificial intelligence, (IJCAI). AAAI Press, pp 823–829 Alam M, Le TNN, Napoli A (2016) LatViz: a new practical tool for performing interactive exploration over concept lattices. In: Proceedings of the thirteenth international conference on concept lattices and their applications (CLA), CEUR workshop proceedings, vol 1624, pp 9–20 Allard P, Ferré S, Ridoux O (2010) Discovering functional dependencies and association rules by navigating in a lattice of OLAP views. In: Proceedings of the 7th international conference on concept lattices and their applications (CLA), CEUR workshop proceedings, vol 672, pp 199–210

438

S. Ferré et al.

Azmeh Z, Driss M, Hamoui F, Huchard M, Moha N, Tibermacine C (2011) Selection of composable web services driven by user requirements. In: IEEE international conference on web services (ICWS), pp. 395–402 Baader F, Distel F (2008) A finite basis for the set of EL-implications holding in a finite model, vol 4933. Springer, Berlin, pp 46–61 Baader F, Calvanese D, McGuinness D, Nardi D, Patel-Schneider P (eds) (2003) The description logic handbook. Cambridge University Press, Cambridge Baader F, Ganter B, Sertkaya B, Sattler U (2007) Completing description logic knowledge bases using formal concept analysis. In: Proceedings of the 20th international joint conference on artificial intelligence (IJCAI), pp 230–235 Babin MA, Kuznetsov SO (2012) Approximating concept stability. In: Proceedings of the 10th international conference on formal concept analysis (ICFCA). LNCS, vol 7228. Springer, pp 7–15 Babin MA, Kuznetsov SO (2013) Computing premises of a minimal cover of functional dependencies is intractable. Discret Appl Math 161(6):742–749 Baixeries J, Kaytoue M, Napoli A (2013) Computing similarity dependencies with pattern structures. In: Proceedings of the 10th international conference on concept lattices and their applications (CLA), CEUR workshop proceedings, vol 1062, pp 33–44 Baixeries J, Kaytoue M, Napoli A (2014) Characterizing functional dependencies in formal concept analysis with pattern structures. Ann Math Artif Intell 72(1–2):129–149 Barbut M, Monjardet B (1970) Ordre et classification: algèbre et combinatoire. Hachette, Paris Bartel H-G, Bruggemann R (1998) Application of formal concept analysis to structure-activity relationships. Fresenius J Anal Chem 361:23–38 Bastide Y, Taouil R, Pasquier N, Stumme G, Lakhal L (2000) Mining frequent patterns with counting inference. SIGKDD Explor Newsl 2(2):66–75 Becker P (2004) Numerical analysis in conceptual systems with ToscanaJ. In: Proceedings of the second international conference on formal concept analysis (ICFCA). LNCS, vol 2961. Springer, pp 96–103 Bedel O, Ferré S, Ridoux O (2008) Handling spatial relations in logical concept analysis to explore geographical data. In: Proceedings of the 6th international conference on formal concept analysis (ICFCA). LNCS, vol 4933. Springer, pp 241–257 Belfodil A, Kuznetsov SO, Robardet C, Kaytoue M (2017) Mining convex polygon patterns with formal concept analysis. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence (IJCAI), pp 1425–1432 Belohlávek R (2004) Concept lattices and order in fuzzy logic. Ann Pure Appl Log 128(1–3):277– 298 Belohlávek R (2008) Relational data, formal concept analysis, and graded attributes. Handbook of research on fuzzy information processing in databases. IGI Global, pp 462–489 Belohlávek R (2011) What is a fuzzy concept lattice? (II). In: Proceedings of the 13th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC). LNCS, vol 6743. Springer, pp 19–26 Belohlávek R, Vychodil V (2005) What is a fuzzy concept lattice? Proceedings of the twelfth international conference on concept lattices and their applications (CLA), pp 34–45 Bendaoud R, Napoli A, Toussaint Y (2008) Formal concept analysis: a unified framework for building and refining ontologies. In: Proceedings of the 16th international conference on knowledge engineering and knowledge management (EKAW), pp 156–171 Berry A, Gutierrez A, Huchard M, Napoli A, Sigayret A (2014) Hermes: a simple and efficient algorithm for building the AOC-poset of a binary relation. Ann Math Artif Intell 72(1–2):45–71 Bertet K, Monjardet B (2010) The multiple facets of the canonical direct unit implicational basis. Theor Comput Sci 411(22–24):2155–2166 Blinova VG, Dobrynin DA, Finn VK, Kuznetsov SO, Pankratova ES (2003) Toxicology analysis by means of the JSM-method. Bioinformatics 19(10):1201–1207

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

439

Borchmann D, Hanika T (2017) Individuality in social networks. Formal concept analysis of social networks. Springer, Berlin, pp 19–40 Borchmann D, Distel F, Kriegel F (2016) Axiomatisation of general concept inclusions from finite interpretations. J Appl Non-Class Log 26(1):1–46 Bosc G, Boulicaut J-F, Raïssi C, Kaytoue M (2017) Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data mining and knowledge discovery (In press) Bourneuf L, Nicolas, J (2017) FCA in a logical programming setting for visualization-oriented graph compression. In: Proceedings of the 14th international conference on formal concept analysis (ICFCA). LNCS, vol 10308, pp 89–105 Bruno M, Canfora G, Penta MD, Scognamiglio R (2005) An approach to support web service classification and annotation. In: IEEE international conference on e-technology, e-commerce, and e-services (EEE), pp 138–143 Buzmakov A, Kuznetsov SO, Napoli A (2014) Scalable estimates of concept stability, vol 8478. Springer, Berlin, pp 157–172 Buzmakov A, Kuznetsov SO, Napoli A (2015) Revisiting pattern structure projections, vol 9113. Springer, Berlin, pp 200–215 Buzmakov A, Egho E, Jay N, Kuznetsov SO, Napoli A, Raïssi C (2016) On mining complex sequential data by means of FCA and pattern structures. Int J Gen Syst 45(2):135–159 Buzmakov A, Kuznetsov SO, Napoli A (2017) Efficient mining of subsample-stable graph patterns. In: IEEE international conference on data mining (ICDM), pp 757–762 Cabrera IP, Cordero P, Ojeda-Aciego M (2017) Galois connections in computational intelligence: a short survey. In: IEEE symposium series on computational intelligence (SSCI), IEEE, pp 1–7 Carbonnel J, Huchard M, Miralles A, Nebut C (2017a) Feature model composition assisted by formal concept analysis. In: Proceedings of the 12th international conference on evaluation of novel approaches to software engineering (ENASE), pp 27–37 Carbonnel J, Huchard M, Nebut C (2017b) analyzing variability in product families through canonical feature diagrams. In: The 29th international conference on software engineering and knowledge engineering (SEKE), pp 185–190 Carpineto C, Romano G (1996) A lattice conceptual clustering system and its application to browsing retrieval. Mach Learn 24(2):95–122 Carpineto C, Romano G (2004) Concept data analysis: theory and applications. Wiley, Chichester (UK) Cellier P, Ducassé M, Ferré S, Ridoux O (2008) Formal concept analysis enhances fault localization in software. In: Proceedings of the 6th international conference on formal concept analysis (ICFCA). LNCS, vol 4933. Springer, pp 273–288 Cerf L, Besson J, Robardet C, Boulicaut J-F (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):1–36 Chaudron L, Maille N (2000) Generalized formal concept analysis. In: Proceedings of the 8th international conference on conceptual structures (ICCS). LNCS, vol 1867. Springer, pp 357– 370 Chekol MW, Euzenat J, Genevès P, Layaïda N (2012) SPARQL query containment under RDFS entailment regime. In: Proceedings of the 6th international joint conference on automated reasoning (IJCAR), vol 7364. Springer, pp 134–148 Cimiano P, Hotho A, Staab S (2005) Learning concept hierarchies from text corpora using formal concept analysis. J Artif Intell Res 24:305–339 Codocedo V, Napoli A (2014a) A proposition for combining pattern structures an relational concept analysis. In: Proceedings of the 12th international conference on formal concept analysis (ICFCA), vol 8478. Springer, pp 96–111 Codocedo V, Napoli A (2014b) Lattice-based biclustering using partition pattern structures. In: 21st European conference on artificial intelligence (ECAI). IOS Press, pp 213–218 Codocedo V, Napoli A (2015) Formal concept analysis and information retrieval–a survey. In: Proceedings of the 13th international conference on formal concept analysis (ICFCA). LNCS, vol 9113. Springer, pp 61–77

440

S. Ferré et al.

Codocedo V, Lykourentzou I, Napoli A (2014) A semantic approach to concept lattice-based information retrieval. Ann Math Artif Intell 72(1–2):169–195 Codocedo V, Baixeries J, Kaytoue M, Napoli A (2016) Characterization of order-like dependencies with formal concept analysis. In: Proceedings of the thirteenth international conference on concept lattices and their applications (CLA), CEUR workshop proceedings, vol 1624, pp 123–134 Coste F, Garet G, Groisillier A, Nicolas J, Tonon T (2014) Automated enzyme classification by formal concept analysis. In: Proceedings of the 12th international conference on formal concept analysis (ICFCA). LNCS, vol 8478. Springer, pp 235–250 Dao M, Huchard M, Hacene MR, Roume C, Valtchev P (2004) Improving generalization level in UML models iterative cross generalization in practice. In: Proceedings of the 12th international conference on conceptual structures (ICCS), pp 346–360 Davey BA, Priestley HA (1990) Introduction to lattices and order. Cambridge University Press, Cambridge, UK Distel F, Sertkaya B (2011) On the complexity of enumerating pseudo-intents. Discret Appl Math 159(6):450–466 Dolques X, Huchard M, Nebut C, Reitz P (2010) Learning transformation rules from transformation examples: an approach based on relational concept analysis. In: Workshop proceedings of the 14th IEEE international enterprise distributed object computing conference (EDOCW), pp 27–32 Dolques X, Le Ber F, Huchard M, Grac C (2016) Performance-friendly rule extraction in large water data-sets with AOC posets and relational concept analysis. Int J Gen Syst 45(2):187–210 du Boucher-Ryan P, Bridge DG (2006) Collaborative recommending using formal concept analysis. Knowl-Based Syst 19(5):309–315 Ducrou J, Vormbrock B, Eklund PW (2006) FCA-based browsing and searching of a collection of images. In: Proceedings of the 14th international conference on conceptual structures (ICCS). LNCS, vol 4068. Springer, pp 203–214 Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, Berlin Eklund PW, Groh B, Stumme G, Wille R (2000) A contextual-logic extension of TOSCANA. In: Proceedings of the 8th international conference on conceptual structures (ICCS), LNCS 1867. Springer, pp 453–467 Eklund PW, Ducrou J, Brawn P (2004) Concept lattices for information visualization: can novices read line-diagrams? In: Proceedings of the second international conference on formal concept analysis (ICFCA). LNCS, vol 2961. Springer, pp 57–73 Falk I, Gardent C (2014) Combining formal concept analysis and translation to assign frames and semantic role sets to French verbs. Ann Math Artif Intell 70(1–2):123–150 Falleri J, Huchard M, Nebut C (2008) A generic approach for class model normalization. In: 23rd IEEE/ACM international conference on automated software engineering (ASE), pp 431–434 Ferré S (2009) Camelis: a logical information system to organize and browse a collection of documents. Int J Gen Syst 38(4):379–403 Ferré S (2010) Conceptual navigation in RDF graphs with SPARQL-like queries. In: Proceedings of the 8th international conference on formal concept analysis (ICFCA), vol 5986. Springer, pp 193–208 Ferré S (2015) A proposal for extending formal concept analysis to knowledge graphs. In: Proceedings of the 13th international conference on formal concept analysis (ICFCA), vol 9113. Springer, pp 271–286 Ferré S (2017) Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language. Semant Web: Interoperability Usability Appl 8(3):405–418 Ferré S, Cellier P (2016) Graph-FCA in practice. In: Proceedings of the 22nd international conference on conceptual structures (ICCS). LNCS, vol 9717. Springer, pp 107–121 Ferré S, King RD (2004) BLID: an application of logical information systems to bioinformatics. In: Proceedings of the 2nd international conference on formal concept analysis (ICFCA), vol 2961. Springer, pp 47–54

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

441

Ferré S, Ridoux O (2000) A logical generalization of formal concept analysis. In: Proceedings of the 10th international conference on conceptual structures (ICCS). LNCS, vol 1867. Springer, pp 371–384 Ferré S, Ridoux O (2001) A framework for developing embeddable customized logics. Selected papers of the 1th international workshop on logic based program synthesis and transformation (LOPSTR). LNCS, vol 2372. Springer, pp 191–215 Ferré S, Ridoux O (2004) An introduction to logical information systems. Inf Process Manag 40(3):383–419 Ferré S, Allard P, Ridoux O (2012) Cubes of concepts: multi-dimensional exploration of multivalued contexts. In: Proceedings of the 10th international conference on formal concept analysis (ICFCA), vol 7228. Springer, pp 112–127 Foret A, Ferré S (2010) On categorial grammars as logical information systems. In: Proceedings of the 8th international conference on formal concept analysis (ICFCA). LNCS, vol 5986. Springer, pp 225–240 Freeman LC, White DR (1993) Using Galois lattices to represent network data. Sociol Methodol 23:127–146 Ganter B (2010) Two basic algorithms in concept analysis. In: Proceedings of the 8th international conference on formal concept analysis (ICFCA). Lecture notes in computer science, vol 5986. Springer, pp 312–340 Ganter, B, Kuznetsov SO (2001) Pattern structures and their projections. In: International conference on conceptual structures (ICCS). LNCS 2120, pp 129–142 Ganter B, Obiedkov SA (2004) Implications in triadic formal contexts. In: Proceedings of the 12th international conference on conceptual structures (ICCS). Lecture notes in computer science, vol 3127. Springer, pp 186–195 Ganter B, Obiedkov SA (2016) Conceptual exploration. Springer, Berlin Ganter B, Wille R (1999) Formal concept analysis. Springer, Berlin Ganter B, Stumme G, Wille R (eds) (2005) Formal concept analysis, foundations and applications. LNCS, vol 3626. Springer García-Pardo F, Cabrera IP, Cordero P, Ojeda-Aciego M (2013) On Galois connections and soft computing. In: Proceedings of the 12th international work-conference on artificial neural networks (IWANN). LNCS, vol 7903. Springer, pp 224–235 Gardiner EJ, Gillet VJ (2015) Perspectives on knowledge discovery algorithms recently introduced in chemoinformatics: rough set theory, association rule mining, emerging patterns, and formal concept analysis. J Chem Inf Model 55(9):1781–1803 Godin R, Mili H (1993) Building and maintaining analysis-level class hierarchies using Galois lattices. In: Proceedings of the eighth conference on object-oriented programming systems, languages, and applications (OOPSLA), pp 394–410 Godin R, Missaoui R, April A (1993) Experimental comparison of navigation in a Galois lattice with conventional information retrieval methods. Int J Man-Mach Stud 38(5):747–767 Grissa D, Comte B, Pujos-Guillot E, Napoli A (2016) A hybrid knowledge discovery approach for mining predictive biomarkers in metabolomic data. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML-PKDD). LNCS, vol 9851. Springer, pp 572–587 Guédi AO, Miralles A, Huchard M, Nebut C (2013) A practical application of relational concept analysis to class model factorization: lessons learned from a thematic information system. In: Proceedings of the tenth international conference on concept lattices and their applications (CLA), pp 9–20 Guigues J-L, Duquenne V (1986) Famile minimale d’implications informatives resultant d’un tableau de données binaire. Mathematique, Informatique et Sciences Humaines 95:5–18 Hahn G, Tardif C (1997) Graph homomorphisms: structure and symmetry. Graph symmetry. Springer, Berlin, pp 107–166 Hitzler P, Krötzsch M, Rudolph S (2009) Foundations of semantic web technologies. Chapman & Hall/CRC

442

S. Ferré et al.

Huchard M, Rouane-Hacene M, Roume C, Valtchev P (2007) Relational concept discovery in structured datasets. Ann Math Artif Intell 49(1–4):39–76 Ignatov DI, Gnatyshak DV, Kuznetsov SO, Mirkin BG (2015) Triadic formal concept analysis and triclustering: searching for optimal patterns. Mach Learn 101(1–3):271–302 Ignatov DI, Semenov A, Komissarova D, Gnatyshak DV (2017) Multimodal clustering for community detection. Formal Concept analysis of social networks. Springer, Berlin, pp 59–96 Jäschke R, Hotho A, Schmitz C, Ganter B, Stumme G (2006) TRIAS - an algorithm for mining iceberg tri-lattices. In: Proceedings of the 6th IEEE International conference on data mining (ICDM), pp 907–911 Ji L, Tan K-L, Tung AKH (2006) Mining frequent closed cubes in 3D datasets. In: Proceedings of the 32nd international conference on very large data bases (VLDB), pp 811–822. ACM Kaytoue M, Assaghir Z, Napoli A, Kuznetsov SO (2010) Embedding tolerance relations in FCA: an application in information fusion. In: Proceedings of the 19th ACM conference on information and knowledge management (CIKM), pp 1689–1692. ACM Kaytoue M, Kuznetsov SO, Napoli A (2011a) Biclustering numerical data in formal concept analysis. In: Proceedings of the 9th international conference on formal concept analysis (ICFCA). LNCS, vol 6628. Springer, pp 135–150 Kaytoue M, Kuznetsov SO, Napoli A (2011b) Revisiting numerical pattern mining with formal concept analysis. In: Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI), pp 1342–1347. IJCAI/AAAI Kaytoue M, Kuznetsov SO, Napoli A, Duplessis S (2011c) Mining gene expression data with pattern structures in formal concept analysis. Inf Sci 181(10):1989–2001 Kaytoue M, Kuznetsov SO, Macko J, Napoli A (2014) Biclustering meets triadic concept analysis. Ann Math Artif Intell 70(1–2):55–79 Kaytoue M, Plantevit M, Zimmermann A, Bendimerad AA, Robardet C (2017) Exceptional contextual subgraph mining. Mach Learn 106(8):1171–1211 Kirchberg M, Leonardi E, Tan YS, Link S, Ko RKL, Lee B (2012) Formal concept discovery in semantic web data. In: Proceedings of the 10th international conference on formal concept analysis (ICFCA). LNCS, vol 7278. Springer, pp 164–179 Kötters J (2013) Concept lattices of a relational structure. In: Proceedings of the 20th international conference on conceptual structures (ICCS). LNCS, vol 7735. Springer, pp 301–310 Kourie DG, Obiedkov SA, Watson BW, van der Merwe D (2009) An incremental algorithm to construct a lattice of set intersections. Sci Comput Program 74(3):128–142 Kriegel F (2017) Acquisition of terminological knowledge from social networks in description logic. Formal concept analysis of social networks. Springer, Berlin, pp 97–142 Kuznetsov SO (1999) Learning of simple conceptual graphs from positive and negative examples. In: Proceedings of the third european conference on principles of data mining and knowledge discovery (PKDD). LNCS, vol 1704. Springer, pp 384–391 Kuznetsov SO (2001) Machine learning on the basis of formal concept analysis. Autom Remote Control 62(10):1543–1564 Kuznetsov SO (2004) On the intractability of computing the Duquenne-Guigues base. J Univers Comput Sci 10(8):927–933 Kuznetsov SO (2007) On stability of a formal concept. Ann Math Artif Intell 49(1–4):101–115 Kuznetsov SO (2009) Pattern structures for analyzing complex data. In: Proceedings of the 12th international conference on rough sets, fuzzy sets, data mining and granular computing (RSFDGrC), vol 5908. Springer, pp 33–44 Kuznetsov SO (2013) Fitting pattern structures to knowledge discovery in big data. In: Proceedings of the 11th international conference on formal concept analysis (ICFCA), vol 7880. Springer, pp 254–266 Kuznetsov S, Obiedkov S (2002) Comparing performance of algorithms for generating concept lattices. J Exp Theor Artif Intell 14(2/3):189–216 Kuznetsov SO, Makhalova TP (2018) On interestingness measures of formal concepts. Inf Sci 442–443:202–219

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

443

Kuznetsov SO, Poelmans J (2013) Knowledge representation and processing with formal concept analysis. Data Min Knowl Discov (Wiley) 3(3):200–215 Kuznetsov SO, Samokhin MV (2005) Learning closed sets of labeled graphs for chemical applications. In: Proceedings of 15th international conference on inductive logic programming (ILP). LNCS, vol 3625. Springer, pp 190–208 Kuznetsov SO, Obiedkov SA, Roth C (2007) Reducing the representation complexity of latticebased taxonomies. In: Proceedings of the 15th international conference on conceptual structures (ICCS). LNCS, vol 4604. Springer, pp 241–254 Leeuwenberg A, Buzmakov A, Toussaint Y, Napoli A (2015) Exploring pattern structures of syntactic trees for relation extraction, vol 9113. Springer, pp 153–168 Lehmann F, Wille R (1995) A Triadic approach to formal concept analysis, vol 954. Springer, Berlin, pp 32–43 Lindig C (1995) Concept-based component retrieval. In: IJCAI-95 workshop: formal approaches to the reuse of plans, proofs, and programs, pp 21–25 Liquiere M, Sallantin J (1998) Structural machine learning with Galois lattice and graphs. In: Proceedings of the fifteenth international conference on machine learning (ICML), pp 305–313 Métivier J, Lepailleur A, Buzmakov A, Poezevara G, Crémilleux B, Kuznetsov SO, Goff JL, Napoli A, Bureau R, Cuissart B (2015) Discovering Structural alerts for mutagenicity using stable emerging molecular patterns. J Chem Inf Model 55(5):925–940 Mimouni N, Nazarenko A, Salotti S (2015) A conceptual approach for relational IR: application to legal collections. In: Proceedings of the 13th international conference on formal concept analysis (ICFCA), pp 303–318 Missaoui R, Kuznetsov SO, Obiedkov SA (eds) (2017) Formal concept analysis of social networks. Springer, Berlin Muggleton S, De Raedt L (1994) Inductive logic programming: theory and methods. J Log Program 19–20:629–679 Nica C, Braud A, Le Ber F (2017) Hierarchies of weighted closed partially-ordered patterns for enhancing sequential data analysis. In: Proceedings of the 14th international conference on formal concept analysis (ICFCA), pp 138–154 Obiedkov SA, Duquenne V (2007) Attribute-incremental construction of the canonical implication basis. Ann Math Artif Intell 49(1–4):77–99 Obiedkov SA, Kourie DG, Eloff JHP (2009) Building access control models with attribute exploration. Comput Secur 28(1–2):2–7 Outrata J, Vychodil V (2012) Fast algorithm for computing fixpoints of Galois connections induced by object-attribute relational data. Inf Sci 185(1):114–127 Padioleau Y, Ridoux O (2003) A logic file system. In: USENIX annual technical conference, pp 99–112 Poelmans J, Ignatov DI, Kuznetsov SO, Dedene G (2013) Formal concept analysis in knowledge processing: a survey on applications. Expert Syst Appl 40(16):6538–6560 Priss U (2011) Unix systems monitoring with FCA. In: Proceedings of the 19th international conference on conceptual structures (ICCS). LNCS, vol 6828. Springer, pp 243–256 Priss U (2013) Representing median networks with concept lattices. In: Proceedings of the 20th international conference on conceptual structures (ICCS). LNCS, vol 7735. Springer, pp 311–321 Quintero NY, Restrepo G (2017) Formal concept analysis applications in chemistry: from radionuclides and molecular structure to toxicity and diagnosis. Partial order concepts in applied sciences. Springer, pp 207–217 Roth C (2017) Knowledge communities and socio-cognitive taxonomies. Formal concept analysis of social networks. Springer, Berlin, pp 1–18 Rouane-Hacene M, Huchard M, Napoli A, Valtchev P (2007). A proposal for combining formal concept analysis and description logics for mining relational data. In: Proceedings of the 5th international conference on formal concept analysis (ICFCA 2007). LNAI, vol 4390. Springer, pp 51–65

444

S. Ferré et al.

Rouane-Hacene M, Huchard M, Napoli A, Valtchev P (2010) Using formal concept analysis for discovering knowledge patterns. In: Proceedings of the 7th international conference on concept lattices and their applications (ICFCA), pp 223–234 Rouane-Hacene M, Huchard M, Napoli A, Valtchev P (2013) Relational concept analysis: mining concept lattices from multi-relational data. Ann Math Artif Intell 67(1):81–108 Rudolph S (2004) Exploring relational structures via FLE. In: Proceedings of the 12th international conference on conceptual structures (ICCS), LNCS, vol 3127. Springer, pp 196–212 Rudolph S, Sacarea C, Troanca D (2015) Towards a navigation paradigm for triadic concepts. In: Proceedings of the 13th international conference on formal concept analysis (ICFCA), LNCS, vol 9113. Springer, pp 252–267 Saada H, Huchard M, Liquiere M, Nebut C (2014) Learning model transformation patterns using graph generalization. In: Proceedings of the eleventh international conference on concept lattices and their applications (CLA), pp 11–22 Sahraoui HA, Melo WL, Lounis H, Dumont F (1997) Applying concept formation methods to object identification in procedural code. In: International conference on automated software engineering (ASE), pp 210–218 Sertkaya B (2009) OntoComP: a protégé plugin for completing OWL ontologies. In: Proceedings of the 6th European semantic web conference (ESWC). LNCS, vol 5554. Springer, pp 898–902 Sertkaya B (2010) A survey on how description logic ontologies benefit from FCA. In: Proceedings of the 7th international conference on concept lattices and their applications (CLA), CEUR workshop proceedings, vol 672, pp 2–21 Soldano H, Santini G, Bouthinon D (2017) Formal concept analysis of attributed networks. Formal concept analysis of social networks. Springer, Berlin, pp 143–170 Sowa J (1984) Conceptual structures. Information processing in man and machine, Addison-Wesley, Reading, US Staab S, Studer R (eds) (2009) Handbook on ontologies, 2nd edn. Springer, Berlin Stahl T, Voelter M, Czarnecki K (2006) Model-driven software development: technology, engineering, management. Wiley, New York Stumme G, Taouil R, Bastide Y, Pasquier N, Lakhal L (2002) Computing iceberg concept lattices with Titanic. Data Knowl Eng 42(2):189–222 Szathmary L, Valtchev P, Napoli A, Godin R, Boc A, Makarenkov V (2014) A fast compound algorithm for mining generators, closed itemsets, and computing links between equivalence classes. Ann Math Artif Intell 70(1–2):81–105 Tatti N, Moerchen F, Calders T (2014) Finding robust itemsets under subsampling. ACM Trans Database Syst 39(3):20:1–20:27 Tilley T, Cole R, Becker P, Eklund PW (2005) A survey of formal concept analysis support for software engineering activities. Formal concept analysis, foundations and applications. LNCS 3626:250–271 Uno T, Asai T, Uchida Y, Arimura H (2004) An efficient algorithm for enumerating closed patterns in transaction databases. In: Proceedings of the 7th international conference on discovery science (DS). LNCS, vol 3245. Springer, pp 16–31 Valverde-Albacete FJ, Peláez-Moreno C (2017) A formal concept analysis look at the analysis of affiliation networks. Formal concept analysis of social networks. Springer, Berlin, pp 171–195 van Deursen A, Kuipers T (1999) Identifying objects using cluster and concept analysis. In: Proceedings of the international conference on software engineering (ICSE), pp 246–255 Völker J, Rudolph S (2008) Lexico-logical acquisition of OWL DL axioms. In: Proceedings of the 6th international conference on formal concept analysis (ICFCA), LNCS, vol 4933. Springer, pp 62–77 Voutsadakis G (2002) Polyadic concept analysis. Order 19(3):295–304 Wille R (2002) Why can concept lattices support knowledge discovery in databases? J Exp Theor Artif Intell 14(2/3):81–92

Formal Concept Analysis: From Knowledge Discovery to Knowledge Processing

445

Wray T, Eklund PW (2011) Exploring the information space of cultural collections using formal concept analysis. In: Proceedings of the 9th international conference on formal concept analysis (ICFCA), LNCS, vol 6628. Springer, pp 251–266 Zaki MJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478

Constrained Clustering: Current and New Trends Pierre Gançarski, Thi-Bich-Hanh Dao, Bruno Crémilleux, Germain Forestier and Thomas Lampert

Abstract Clustering is an unsupervised process which aims to discover regularities and underlying structures in data. Constrained clustering extends clustering in such a way that expert knowledge can be integrated through the use of user constraints. These guide the clustering process towards a more relevant result. Different means of integrating constraints into the clustering process exist. They consist of extending classical clustering algorithms, such as the well-known k-means algorithm; modelling the constrained clustering problem using a declarative framework; and finally, by directly integrating constraints into a collaborative process that involves several clustering algorithms. A common point of these approaches is that they require the user constraints to be given before the process begins. New trends in constrained clustering highlight the need for better interaction between the automatic process and expert supervision. This chapter is dedicated to constrained clustering. In particular, after a brief overview of constrained clustering and associated issues, it presents the three main approaches in the domain. It also discusses exploratory data mining by presenting models that develop interaction with the user in an incremental and collaborative way. Finally, moving beyond constraints, some aspects of user implicit preferences and their capture are introduced.

P. Gançarski (B) · T. Lampert ICube, University of Strasbourg, Strasbourg, France e-mail: [email protected] T. Lampert e-mail: [email protected] T.-B.-H. Dao LIFO, University of Orléans, Orléans, France e-mail: [email protected] B. Crémilleux Normandy University, UNICAEN, ENSICAEN, CNRS - UMR GREYC, Caen, France e-mail: [email protected] G. Forestier IRIMAS, University of Haute-Alsace, Mulhouse, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8_14

447

448

P. Gançarski et al.

1 Introduction Supervised learning methods are at the center of artificial intelligence solutions and have, for a long time, proved their viability. The phenomenon of “Big Data”— the tremendous increase in the amount of available data—combined with increased computing capacities has led to the return of neuronal methods in the form of deep learning (LeCun et al. 2015). Through their striking results, these approaches have revolutionised supervised learning in, for example, the analysis and understanding of images. These techniques are quickly becoming generalised to domains related to decision aiding (banks, medicine, etc.) and decision making (automobile, avionics, etc.). For such an algorithm to learn to recognise a concept (such as car, cat, etc.), it is trained using several hundreds of thousands of occurrences of this target concept and the labelling of this training data may require many hours of manual interpretation. Once trained, however, the network can almost instantly recognise the concept in new unseen data with success rates rarely achieved so far. Nevertheless, these approaches suffer from several drawbacks. On the one hand, the “black box” aspect of the learning process and the nature of the model (i.e. a network with learned weights) make it difficult to understand and interpret by an expert. Extracting reusable or transferable knowledge and applying it to other domains or applications remains a challenging problem. On the other hand, these methods, as with all supervised methods, rely upon the hypothesis that learning (and validation) data provided by users and/or domain experts exist and fully represent the underlying concept and data distribution (i.e. they do not fluctuate with time). The creation of such learning sets proves to be very time-consuming, although crowdsourcing methods, for example, make it possible to alleviate this bottleneck. Finally, creating learning sets implies that a problem can be formalised and objects of interest defined, which is often not a realistic hypothesis and means that the algorithms are subject to error. These issues explain the development of unsupervised approaches, which allow for the discovery of both regularities and structures in the data. Unsupervised discovery of knowledge is at the core of data mining and more specifically clustering, which is the central theme of this chapter. Experiments have shown the ability of clustering methods to extract meaningful clusters from large amounts of heterogeneous data without requiring any additional prior information. Nevertheless, regardless of the efficiency of these algorithms, the lack of formalism of thematic classes and the absence of real reference data makes it difficult to accurately evaluate the quality of the results. Thus, an expert cannot directly validate their own results and cannot directly modify the clusters according to thematic classes. As such, the process of cluster extraction should be rethought and improved to make the results directly useful for domain experts. To this end, it is obvious that the results proposed by these algorithms should model the thematic “intuition” of the expert, that is to say the potential thematic classes. The clustering process is by definition unsupervised, which means that it only uses the data and no additional knowledge in order to accentuate the principle of serendip-

Constrained Clustering: Current and New Trends

449

ity (the probability of finding something useful without specifically searching for it). Without any supervision, clustering algorithms often produce irrelevant solutions. Recent studies have focused on approaches to allow the guidance of the clustering process using background knowledge or expert knowledge to avoid apophenia phenomenium (the risk of seeing patterns or connections in random or meaningless data). The objective is to allow a human expert to embed their domain knowledge into the data mining process and thus to guide it towards better results. In order to limit expert intervention, which can be highly time consuming, the ideal solution is to make the expert knowledge actionable in order to automate its use in data mining. Depending on the domain, however, the representation and the type of knowledge to model can be very heterogeneous. Thus, three main knowledge representations that are independent of the application domain have been proposed for clustering, under the form of operable constraints (Basu et al. 2008; Dinler and Tural 2016). • The first concerns the use of constraints between objects (comparison constraints), mainly of resemblance and dissemblance as for example the relations must-link and cannot-link: two objects should (not) be in the same cluster, having same (different) nature according to expert knowledge. • The second consists of using labelled objects (labeling constraints), which corresponds directly to domain knowledge. • The third is on the clusters themselves (number, size, density, etc.), which corresponds to intrinsic qualities of the clusters (cluster constraints). All these approaches only partially address the issue of transferring thematic constraints to actionable constraints. Thus, it is not realistic to ask an expert to define all the constraints when starting from nothing. Indeed, knowledge discovery, and in the case of discovering more relevant clusters, is an iterative process. This is why, in a second phase, the expert has to be able to successively refine and improve the proposed clustering. In particular, to be able to add, refine, or remove constraints, act directly on the clusters (split, merge, deletion) or freeze the clusters that are most likely candidates for thematic classes. The latter should be labelled early in the process and should not be considered by the clustering algorithm any further. Interactive learning methods that put the expert at the center of the extraction process are one solution to this problem. Surprisingly little attention has been focussed on their development even though since the emergence of data mining in the 1990s (Fayyad et al. 1996) the importance of interactive and semi-automatic processes of knowledge discovery has been well known. Multiple studies (Anand et al. 1995; Kopanas et al. 2002) have shown the importance played by background knowledge and expert knowledge in the process of data mining. Analysts interact (visualise, select, explore) not only with the data but also with the patterns or models supported by the data (Boulicaut et al. 2006). A very strong feature of big data for a data science project is that the data spans multiple domains and therefore understanding and analysing such data requires different but complementary expertises. Expert and algorithm interaction must be flexible in order to better fit the data’s perspective. Thus the process of knowledge extraction cannot be fully automatic. This highlights

450

P. Gançarski et al.

the need to study mechanisms that allow a better interaction between automatic processes and expert supervision. Consequently, data analysts have rapidly focused on redefining the role of the expert or simply replacing them altogether. Thus, the formalisation of knowledge and its use is a key issue not only in clustering but also in the whole field of data mining. For this reason, this chapter is concluded by discussing new trends in exploratory data analysis. The remainder of this chapter is organised as follows. In Sect. 2 the principles of constrained clustering and the main approaches used to implement them are presented. Section 3 presents a review of classic clustering algorithms that have been extended to include user constraints. In Sect. 4 are described declarative approaches, while Sect. 5 presents a collaborative approach. Section 6 introduces new trends in constrained clustering, i.e. interactive/incremental approaches, and user preference based methods.

2 Constrained Clustering Given a set of objects, cluster analysis aims at grouping the objects together into homogeneous groups, such that the objects in the same group are similar and the objects in different groups are different. The groups are called clusters and the set of groups is a clustering. Clustering is an important task in data mining and numerous methods have been developed for it. A complete overview of clustering algorithms can be found in chapter “Designing Algorithms for Machine Learning and Data Mining” of this Volume. In practice, the expert usually has some intuition or prior knowledge about the underlying clustering. In order to reach a solution relevant to expert knowledge, recent studies have focused on integrating knowledge to allow guidance on the clustering process. This section recalls some main clustering formulations and introduces constrained clustering through several types of user constraints.

2.1 Cluster Analysis Let O be a set of instances (data) {o1 , . . . , on }, let us assume that there exists a dissimilarity (or a similarity) measure d(oi , o j ) between any two instances oi and o j . Partitional clustering involves finding a partition of O into K non-empty and (generally) disjoint groups called clusters C1 , . . . , C K , such that instances in the same cluster are very similar and instances in different clusters are dissimilar. The homogeneity of the clusters is usually formalised by an optimisation criterion. Finding the optimal clustering (i.e. the best number of clusters and the best associated clustering) among all the possible clusterings is intractable in general. For N objects and a given number of clusters K , the number of clustering candidates is

Constrained Clustering: Current and New Trends

S N ,K =

451

  K K 1  KN (−1)k (K − k) N  , k K ! k=0 K!

when N → ∞,

(1)

and for all clusterings where the number of clusters can vary BN =

N 

S N ,k .

(2)

k=1

For instance, for N = 25, there are 4, 638, 590, 332, 229, 999, 353 possible clusterings requiring 147,000 years to be generated on a computer producing one million partitions per second. In distance-based clustering the optimisation criterion that defines the homogeneity of the clusters is based on the distance measure. Different optimisation criteria exist, the most popular are (Hansen and Jaumard 1997): • minimising the maximal diameter of the clusters, which is defined by the maximal dissimilarity between two objects in the same cluster; • maximising the minimal split between clusters, which is the smallest dissimilarity between two objects in different clusters; • minimising the sum of stars of the clusters, which is defined by the minimum sum of dissimilarities between an object to all other objects in the cluster, for each object in the cluster; • minimising the within-cluster sum of dissimilarities (WCSD), which is the sum of all the dissimilarities between two objects in the same cluster; • minimising the within-cluster sum of squares (WCSS), in an Euclidean space WCSS is the sum of squared Euclidean distances between each object oi and the centroid m k of the cluster that contains oi . Finding a partition maximising the minimal split between clusters is polynomial since the partition can be computed from a minimum spanning tree (Delattre and Hansen 1980). As for the maximal diameter criterion, the problem is polynomial with 2 clusters (K = 2), but as soon clusterings with at least 3 clusters (K ≥ 3) are considered the problem becomes NP-Hard (Hansen and Delattre 1978). All the other criteria are NP-Hard. The NP-hardness of the WCSS criterion in general dimensions even with K = 2 is shown in (Aloise et al. 2009). Thus, most of the classic clustering algorithms search for a local optimum. For instance, the k-means algorithm finds a local optimum for the WCSS criterion, the k-median algorithm finds a local optimum for the sum of stars criterion and the FPF (Furthest Point First) algorithm (Gonzalez 1985) for the diameter criterion. In similarity-based clustering, the optimisation criterion that defines the homogeneity of the clusters is based on a similarity measure. The similarity between the instances is usually defined by an undirected graph where the vertices are the objects and the edges have non-negative weights. Spectral clustering aims to find a partition of the graph such that the edges between different groups have a very low weight and the edges within a group have high weight. Given a cluster Ci , a cut measure

452

P. Gançarski et al.

cut(Ci ) is defined by the sum of the weights of the edges that link an instance in Ci and an instance not in Ci . The two most common optimisation criteria are (von Luxburg 2007): • minimising the ratio cut, which is defined by the sum of cut(Ci )/|Ci |; • minimising the normalised cut, which is defined by the sum of cut(Ci )/vol(Ci ), where vol(Ci ) measures the weight of the edges within Ci . These criteria are also NP-Hard. Spectral clustering algorithms solve relaxed versions of those problems: relaxing the normalised cut leads to normalised spectral clustering and relaxing the ratio cut leads to unnormalised spectral clustering.

2.2 User Constraints In practice, a user may have some requirements for, or prior knowledge about, the final solution. For instance, the user can have some information on the label of a subset of objects (Wagstaff and Cardie 2000). Several studies have demonstrated the importance of domain knowledge in the data mining processes (Anand et al. 1995). Because of the inherent complexity of the optimisation criteria, classic algorithms always find a local optimum. Several optima may exist, some of which may be closer to the user requirement. It is therefore important to integrate prior knowledge into the clustering process. Prior knowledge is expressed by user constraints to be satisfied by the clustering solution. The subject of these user constraints can be the instances or the clusters (Basu et al. 2008). With the presence of user constraints, clustering problems become harder, as for instance the polynomial criterion of maximising the minimal split between clusters becomes NP-Hard under user constraints (Davidson and Ravi 2007). Instance-level constraints are the most widely used type of constraint and were first introduced by Wagstaff and Cardie (2000). Two kinds of instance-level constraints exist: must-link (ML) and cannot-link (CL). Definition 1 Two instances oi and o j that satisfy an ML constraint must be in the same cluster, i.e. ∀k ∈ {1, . . . , K }, oi ∈ Ck ⇔ o j ∈ Ck . Definition 2 Two instances oi and o j that satisfy an CL constraint must not be in the same cluster, i.e. ∀k ∈ {1, . . . , K }, ¬(oi ∈ Ck ∧ o j ∈ Ck ). In semi-supervised clustering, this information is available to aid the clustering process and can be inferred from class labels: if two objects have the same label then they are linked by an ML constraint, otherwise by a CL constraint. Supervision by instance-level constraints is however more general and more realistic than class labels. Using knowledge, even when class labels may be unknown, a user can specify whether pairs of points belong to the same cluster or not (Wagstaff et al. 2001). Semi-supervised clustering is therefore a transductive operation because its objective is to define the clusters to explain the processed data and possibly to label the

Constrained Clustering: Current and New Trends

453

ML

δ ML

δ

CL

ε

δ

γ

Fig. 1 Examples of ML, CL, δ and ε constraints

objects not initially labelled. During semi-supervised classification, labelled objects and unlabelled objects are used to construct a classification function. Thus, the objective is to use the unlabelled objects to better understand the configuration of the data space. Semi-supervised classification is an inductive operation, the aim of which is to create a classifier which generalises the model of available data and which can be used subsequently to process other data. Cluster-level constraints define requirements on the clusters (see Fig. 1), for example: • the number of clusters K ; • their size, a capacity constraint expresses a maximal or a minimal limit on the number of objects in each cluster—a minimal capacity constraint states that each cluster must have at least α objects, i.e. ∀k ∈ {1, . . . , K }, |Ck | ≥ α, and a maximal capacity constraints requires that each cluster must have at most β objects, i.e. ∀k ∈ {1, . . . , K }, |Ck | ≤ β; • the diameter of the clusters, a maximum diameter constraint gives an upper bound γ on the diameter of each cluster, i.e. ∀k ∈ {1, . . . , K }, ∀oi , o j ∈ Ck , d(oi , o j ) ≤ γ ; • the split between clusters, a minimum split constraint states that the clusters must be separated by at least δ: ∀k, k ∈ {1, . . . , K }, k = k, ∀oi ∈ Ck , ∀o j ∈ Ck , d(oi , o j ) ≥ δ; • finally, the ε-constraint, introduced in (Davidson and Ravi 2005), demands that each object oi have in its neighbourhood of radius ε at least one other object in the same cluster: ∀k ∈ {1, . . . , K }, ∀oi ∈ Ck , ∃o j ∈ Ck , o j = oi , d(oi , o j ) ≤ ε,

454

P. Gançarski et al.

this constraint tries to capture the density notion, used in density based clustering DBSCAN (Ester et al. 1996) and can be generalised to the requirement that each object oi has in its neighbourhood of radius ε at least m objects in the same cluster. Note that although the diameter and split constraints state requirements on the clusters, they can be expressed by a conjunction of cannot-link constraints and must-link constraints, respectively (Davidson and Ravi 2005). The instances can be described by a set of features that enables the computation of a dissimilarity measure and also by a set of properties from which the definitions of what is actionable/interesting is given. Constraints can therefore also be stated on properties and can be divided into the following categories (Dao et al. 2016): • cardinality constraints place a requirement on the count of the elements in a cluster having a property, they may be as simple as each cluster should contain at least one female to more complex variations such as the number of males must be no greater than two times the number of females; • density constraints relate to cardinality constraints in that they provide requirements on the count of a property except not for an entire cluster but rather a subset of instances in the cluster, e.g. we may require that each person has at least 10 people in his/her cluster that share the same hobby; • geometric constraints place an upper or lower bound on some geometric property of a cluster or cluster combination, e.g. that the maximum diameter of a cluster with respect to the age property is 10 years, this would prevent clusters containing individuals with a wide range of ages; • complex logic constraints express logical combinations of constraints, which can be instance-level or cluster-level, e.g. we may require that any cluster having more than 2 professors should have more than 10 Ph.D. students. When the instances are described by binary features as in the case of transactional data, constraints can be stated such that each cluster is associated to a definition expressed by a pattern. Four families of constraints can be identified and k-pattern set mining problems can be specified as combinations of them (Guns et al. 2013): • individual pattern constraints, which include among others local pattern mining constraints; • redundancy constraints, which are used to constrain or to minimise the redundancy between different patterns; • coverage constraints, which deal with defining and measuring how well a pattern set covers the data; • discriminative constraints, which are used on labelled data to measure and optimise how well a pattern or pattern set discriminates between positive and negative examples.

Constrained Clustering: Current and New Trends

455

3 Extensions of Classic Clustering Algorithms to User Constraints This section presents a brief review of partitional constrained clustering methods and in particular k-Means, metric learning, and spectral graph theory based methods. These all have the following properties in common: (1) they extend a clustering algorithm to integrate user constraints and are therefore specific to the objective function that is optimised by the clustering algorithm, e.g. minimising the sum of squared errors for the k-Means algorithm; (2) they integrate instance-level constraints or some form of cluster level constraint (e.g. cluster size); (3) they are usually fast and find an approximate solution, and therefore do not guarantee the satisfaction of all the constraints.

3.1 k-Means In this type of approach, the clustering algorithm or the objective function is modified so that user constraints are used to guide the algorithm towards a more appropriate data partitioning. Most of these works consider instance-level must-link and cannotlink constraints. The extension is done either by enforcing pairwise constraints or by using pairwise constraints to define penalties in the objective function. A survey on partitional and hierarchical clustering with instance level constraints can be found in (Davidson and Basu 2007). In the category of enforcing pairwise constraints, the first work proposed a modified version of COBWEB (Fisher 1987) that tends to satisfy all the pairwise constraints, named COP-COBWEB (Wagstaff and Cardie 2000). Subsequent work extended the k-Means algorithm to instance-level constraints. The k-Means algorithm starts with initial assignment seeds and assigns objects to clusters in several iterations. At each iteration, the centroids of the clusters are computed and the objects are reassigned to the closest centroid. The algorithm converges and finds a solution which is a local optimum of the within-cluster sum of squares (WCSS or distortion). To integrate must-link and cannot-link constraints, the COP-KMeans algorithm by Wagstaff et al. (2001) extends the k-Means algorithm by choosing a reassignment that does not violate any constraints at each iteration. This greedy behavior without backtracking means that COP-KMeans may fail to find a solution that satisfies all the constraints even when such a solution exists. Basu et al. (2002) propose two variants of k-Means, the Seed-KMeans and Constrained-KMeans algorithms, which allow the use of objects labeled as seeds: the difference between the two being the possibility of changing the class centers or not. In both approaches, it is assumed that there is at least one seed for each cluster and that the number of clusters is known. The seeds are used to overcome the sensitivity of the k-Means approaches to the initial parameterisation.

456

P. Gançarski et al.

Incorporating must-link and cannot-link constraints makes clustering algorithms sensitive to the assignment order of instances and therefore results in consequent constraint-violation. To address the issue of constraint violation in COP-KMeans, Tan et al. (2010) (ICOP-KMeans) and Rutayisire et al. (2011) propose a modified version with an assignment order, which is either based on a measure of certainty computed for each instance or a sequenced assignment of cannot-linked instances. MLC-KMeans (Huang et al. 2008) takes an alternative approach by introducing assistant centroids, which are calculated using the points implicated by must-link constraints for each cluster, and which are used to calculate the similarity of instances and clusters. For high-dimensional sparse data, the SCREEN method (Tang et al. 2007) for constraint-guided feature projection was developed, which can be used with a semisupervised clustering algorithm. This method considers an objective function to learn the projection matrix, which can project the original high-dimensional dataset into a low-dimensional space such that the distance between any pair of instances involved in the cannot-link constraints are maximised while the distance between any pair of instances involved in the must-link constrains are minimised. A spherical k-Means algorithm is then used to try to avoid violating cannot-link constraints. Other methods uses penalties as a trade-off between finding the best clustering and satisfying as many constraints as possible. Considering a subset of instances whose label is known, Demiriz et al. (1999) modifies the clustering objective function to incorporate a dispersion measure and an impurity measure. The impurity measure is based on Gini Index to measure misplaced known labels. The CVQE (constrained vector quantisation error) method (Davidson and Ravi 2005) penalises constraint violations using distance. If a must-link constraint is violated then the penalty is the distance between the two centroids of the clusters containing the two instances that should be together. If a cannot-link constraint is violated then the penalty is the distance between the cluster centroid the two instances are assigned to and the distance to the nearest cluster centroid. These two penalty types together with the distortion measure define a new differentiable objective function. An improved version, linear-time CVQE (LCVQE) (Pelleg and Baras 2007), avoids checking all possible assignments for cannot-link constraints and its penalty calculations takes into account coordinates of the involved instances in the violated constraint. The method PCK-Means (Basu et al. 2004a) formulated the goal of pairwise constrained clustering as minimising a combined objective function, defined as the sum of the total squared distances between the points and their cluster centroids WCSS, and the cost incurred by violating any pairwise constraints. The cost can be uniform but can also take into account the metric of the clusters, as in the MPCK-Means version that integrates both constraints and metric learning. Lagrangian constrained clustering (Ganji et al. 2016) also formulates the objective function as a sum of distortion and the penalty of violating cannot-link constraints (must-link constraints are used to aggregate instances into super-instances so they are all satisfied). This method uses a Lagrangian relaxation strategy of increasing penalties for constraints which remain unsatisfied in subsequent clustering iterations. A local search approach using Tabu search was developed to optimise the objective function, which is the sum of the dis-

Constrained Clustering: Current and New Trends

457

tortion and the weighted cost incurred by violating pairwise constraints (Hiep et al. 2016). Grira et al. (2006) introduced the cost of violating pairwise constraints into the objective function of Fuzzy CMeans algorithm. Li et al. (2007) use non-negative matrix factorisation to perform centroid-less constrained k-Means clustering (Zha et al. 2001). Hybrid approaches integrate both constraint enforcing and metric learning (see Sect. 3.2) into a single framework: MPCK-Means (Bilenko et al. 2004), HMRF-KMeans (Basu et al. 2004b), semi-supervised kernel k-Means (Kulis et al. 2005), and CLWC (Cheng et al. 2008). Bilenko et al. (2004) define an uniform framework that integrates both constraint-based and metric-based methods. This framework represents PCK-Means when considering a constraint-based factor and MPCK-Means when considering both constraint-based and metric-based factors. Semi-supervised HMRF k-Means (Basu et al. 2004b) is a probabilistic framework based on Hidden Markov Random Fields, where the semi-supervised clustering objective minimises both the overall distortion measure of the clusters and the number of violated must-link and cannot-link constraints. A k-Means like iterative algorithm is used for optimising the objective, where at each step the distortion measure is reestimated to respect user-constraints. Semi-supervised kernel k-Means (Kulis et al. 2005, 2009) is a weighted kernel-based approach, that generalises HMRF k-Means. The method can perform semi-supervised clustering on data given either as vectors or as a graph. It can be used on a wide class of graph clustering objectives such as minimising the normalised cut or ratio cut. The framework can therefore be applied to semi-supervised spectral clustering. Constrained locally weighted clustering (CLWC) (Cheng et al. 2008) integrates local distance metric learning with constrained learning. Each cluster is assigned to its own local weighting vector in a different subspace. The data points in the constraint set are arranged into disjoint groups (chunklets), and the chunklets are assigned entirely in each assignment and weight update step. Beyond pairwise constraints, Ng (2000) adds suitable constraints into the mathematical program formulation of the k-Means algorithm to extend the algorithm to the problem of partitioning objects into clusters where the number of elements in each cluster is fixed. Bradley et al. (2000) avoid local solutions with empty clusters or clusters having very few points by explicitly adding k minimal capacity constraints to the formulation of the clustering optimisation problem. This work considers that the k-Means algorithm and the constraints are enforced during the assignment step at each iteration. Banerjee and Ghosh (2006) proposed a framework to generate balanced clusters, i.e. clusters of comparable sizes. Demiriz et al. (2008) integrated a minimal size constraint into the k-Means algorithm. Considering two types of constraints, the minimum number of objects in a cluster and the minimum variance of a cluster, Ge et al. (2007) proposed an algorithm that generates clusters satisfying them both. This algorithm is based on a CD-Tree data structure, which organises data points in leaf nodes such that each leaf node approximately satisfies the significance and variance constraint and minimises the sum of squared distances.

458

P. Gançarski et al.

3.2 Metric Learning Metric learning aims to automatically learn a metric measure from training data that best discriminates the comprising samples according to a given criterion. In general, this metric is either a similarity or a distance (Klein et al. 2002). Many machine learning approaches rely on learned metrics; thus metric learning is usually a preprocessing step for such approaches. In the context of clustering, the metric can be defined as the Mahalanobis distance parameterised by a matrix M, i.e. dM (oi , o j ) = ||oi − o j || M (Bellet et al. 2015). Unlike the Euclidean distance, which assumes that attributes are independent of one another, the Mahalanobis distance enables the similarity measure to take into account correlations between attributes. Learning the distance dM is equivalent to learning the matrix M. For dM to satisfy distance proprieties (non-negativity, identity, symmetry, and the triangle inequality) M should be a positive semi-definite real-valued matrix. To guide the learning process, two sets are constructed from the ML and CL constraints: the set of supposedly similar—must-link—pairs Sim, and the supposedly dissimilar—cannot-link—pairs Dis, such that • Sim = {(oi , o j ) | oi and o j should be as similar as possible}, • Dis = {(oi , o j ) | oi and o j should be as dissimilar as possible}. It is also possible to introduce unlabeled data along with the constraints to prevent over-fitting. Several proposals have been made to modify (learn) a distance (or metric) taking into account this principle. We can cite works on the Euclidean distance and shortest path (Klein et al. 2002), Mahanalobis distance (Bar-Hillel et al. 2003, 2005; Xing et al. 2002), Kullback–Leibler divergence (Cohn et al. 2003), string-edit distance (Bilenko and Mooney 2003), and the Laplacian regulariser metric learning (LRML) method for clustering and imagery (Hoi et al. 2008, 2010). Yi et al. (2012) describe a metric learning algorithm that avoids the high computational cost implied by the positive semi-definite constraint. Matrix completion is performed on the partially observed constraints and it is observed that the completed similarity matrix has a high probability of being positive semi-definite, thus avoiding the explicit constraint.

3.3 Spectral Graph Theory Spectral clustering is a non-supervised method that takes as input a pre-calculated similarity matrix (graph) and aims to minimise the ratio cut criterion (von Luxburg 2007) or the normalised cut criterion (Shi and Malik 2000). Spectral clustering is often considered superior to classical clustering algorithms, such as k-Means, because it is capable of extracting clusters of arbitrary form (von Luxburg 2007). It has also been shown that algorithms that build partitions incrementally (like k-Means and EM)

Constrained Clustering: Current and New Trends

459

are prone to be overly constrained (Davidson and Ravi 2006). Moreover, spectral clustering has polynomial time complexity. The constraints can be expressed as ML/CL constraints or in the form of labels, these can be taken into account either as “hard” (binary) constraints or “soft” (probabilistic) constraints. The method allows the user to specify a lower bound on constraint satisfaction and all points are assigned to clusters simultaneously, even if the constraints are inconsistent. Kamvar et al. (2003) first integrated ML and CL constraints into spectral clustering. This is achieved by modifying the affinity matrix by setting ML constrained pairs to maximum similarity, 1, and CL constrained pairs to minimum similarity, 0. This has been extended to out-of-sample points and soft-constraints through regularisation (Alzate and Suykens 2009). Li et al. (2009) point out, however, that a similarity of 0 in the affinity matrix does not mean that the two objects tend to belong to different clusters. Wang and Davidson (2010a) and Wang et al. (2014) introduce a framework for integrating constraints into a spectral clustering. Constraints between N objects are modelled by a matrix Q of size N × N , such that

Q i j = Q ji

⎧ ⎪ ⎨+1, = −1, ⎪ ⎩ 0,

if ML(i, j), if CL(i, j), otherwise,

(3)

upon which a constraint satisfaction measure can be defined. Soft constraints can be taken into account by allowing real values to be assigned to Q or by allowing fuzzy cluster membership values. Subsequently, the authors introduce a method to integrate a user-defined lower-bound on the level of constraint satisfaction (Wang and Davidson 2010b). Work has also been described that allows for inconsistent constraints (Rangapuram and Hein 2012). Based on the Karush–Kuhn–Tucker (Kuhn and Tucker 1951) conditions, an optimal solution can then be found by first finding the set of solutions satisfying all constraints and then using a brute-force approach to find the optimal solution from this set. These approaches have been extended to integrate logical combinations of constraints (Zhi et al. 2013), which are translated into linear equations or linear inequations. Furthermore, instead of modifying the affinity matrix using binary values, Anand and Reddy (2011) propose to modify the distances using an all-pairs-shortestpath algorithm such that the new distance metric is similar to the original space. Lu and Carreira-Perpinán (2008) state that an affinity matrix constructed using constraints is highly informative but only for a small subset of points. To overcome this limitation they propose a method to propagate constraints (in a method that is consistent with the measured similarities) to points that are not directly affected by the original constraint set. These advances are proposed for the two-class problem (multi-class extension is discussed but is computationally inefficient), multi-class alternatives have been proposed (Lu and Ip 2010; Chen and Feng 2012; Ding et al. 2013).

460

P. Gançarski et al.

Several works (Zhang and Ando 2006; Hoi et al. 2007; Li et al. 2008, 2009) use the constraints and point similarities to learn a kernel matrix such that points belonging to the same cluster are mapped to be close and points from different clusters are mapped to be well-separated. Most recently, progress has been made in introducing faster and simpler formulations, while providing a theoretical guarantee of the partitioning quality (Cucuringu et al. 2016).

4 Declarative Approaches for Constrained Clustering These approaches offer the user a general framework to formalise the problem by choosing an objective function and explicitly stating the constraints. The frameworks are usually developed using a general optimisation tool, such as integer linear programming (ILP), SAT, constraint programming (CP), or mathematical programming. Detailed descriptions of SAT and CP can be found in chapters “Reasoning with Propositional Logic - From SAT Solvers to Knowledge Compilation” and “Constraint Reasoning” of this Volume, respectively. Commonalities between theses approaches are that they enable the modelling of different types of user constraints and they search for an exact solution—a global optimum that satisfies all the constraints. Some declarative approaches are reviewed in Sect. 4.1 of this chapter, more detailed descriptions of approaches using ILP are presented in Sect. 4.2 and approaches using CP in Sect. 4.3.

4.1 Overview For dissimilarity-based constrained clustering settings, several approaches using SAT, ILP, and CP have been developed. A SAT based framework has been proposed by Davidson et al. (2010) for constrained clustering problems with K = 2. The assignment of objects to clusters is represented by a binary variable X i , where X i = 1 (or X i = 0) means the ith object is assigned to cluster number 1 (or number 0, respectively). Constraints such as must-link, cannot-link, maximum diameter, and minimum split can be expressed by 2-SAT problems. Using binary search, the framework offers both single objective optimisation and bi-objective optimisation. Several single optimisation criteria are integrated: minimising the maximal diameter, maximising the minimal split, minimising the difference between diameters, and minimising the sum of diameters. When optimising multiple objectives, the framework considers minimising the diameter and maximising the split either in a way such that one objective is used as a constraint and the other is optimised under that constraint, or by combining them into a single objective which is the ratio of diameter to split. In order to make the framework more efficient, approximation schemes are also developed to reduce the number of calls in the binary search. CP and ILP based

Constrained Clustering: Current and New Trends

461

approaches offer flexible frameworks with several choices of optimisation criteria and user constraints (Sects. 4.2 and 4.3). Generic declarative frameworks have also been investigated in several works for other clustering settings. Conceptual clustering considers objects described by categorical attributes and aims to associate to each cluster a definition expressed by a pattern. CP frameworks have been developed for the K -pattern set mining problem that can be used for conceptual clustering and other pattern mining tasks (e.g. unexpected rules, k-tilling, redescription mining) (Khiari et al. 2010; Guns et al. 2013; Chabert and Solnon 2017). These frameworks integrate constraints on patterns or groups of patterns as well as different optimisation criteria (Guns et al. 2013). A SAT based framework has also been proposed, which provides a query language to formalise conceptual clustering tasks (Métivier et al. 2012). The elements of the language are translated into SAT clauses and solved by a SAT solver. An ILP-based framework has been proposed by Ouali et al. (2016), which also integrates constraints into clustering that enable the modelling of conceptual clustering, soft-clustering, co-clustering, and soft co-clustering. Based on a similarity graph between objects, correlation clustering aims to find a partition that agrees as closely as possible with the similarities. The cost function to be optimised is such that the number of similar points co-clustered is maximised and the number of dissimilar points co-clustered is minimised. The MaxSAT framework has been developed for constrained correlation clustering (Berg and Järvisalo 2017). In this model, hard-clauses guarantee a well defined partition, must-link and cannot-link constraints, and soft-clauses are used to encode the cost function. Aside from partition clustering, hierarchical clustering constructs a hierarchy of partitions represented by a dendrogram. A framework developed in (Gilpin and Davidson 2017) allows to model hierarchical clustering using ILP. Another SAT framework allows to integrate different types of user constraints (Gilpin and Davidson 2011).

4.2 Integer Linear Programming Different frameworks using Integer Linear Programming (ILP) have been developed for constrained clustering. Using ILP, constrained clustering problems must be formalised by a linear objective function subject to linear constraints. In the formulation of clustering such as the one used in CP-based approaches, a clustering is defined by an assignment of instances to clusters. For several optimisation criteria, e.g. the within-cluster sum of squares (WCSS), this formulation leads to a non-linear objective function. ILP-based approaches therefore use an orthogonal formulation, where a clustering is considered as a subset of the set of all possible clusters. A cluster is a subset of instances. For a dataset of n instances, the number of possible clusters is 2n . Let T = {1, . . . , 2n } be the set of all possibles non-empty clusters. Consider any cluster Ct and let ct be the cost of the cluster, which is defined on Ct depending on the optimisation criterion. In the case of minimising the WCSS,

462

P. Gançarski et al.

the cost ct is defined by the sum of squared distances of the instances in Ct to its mean, such that  1 ||oi − o j ||2 . ct = 2|Ct | o ,o ∈C i

j

t

For each i ∈ {1, . . . , n}, let ait be a constant which is 1 if oi is in cluster Ct and 0 otherwise. The unconstrained clustering problem is therefore formalised by an integer linear program, such that (du Merle et al. 1999): minimise



ct xt ,

t∈T

subject to



ait xt = 1,

∀i ∈ {1, . . . , n},

t∈T



xt = K ,

t∈T

xt ∈ {0, 1}. In this formulation, the first constraint states that each instance oi must be covered by exactly one cluster (the clustering is therefore a partition of the instances) and the second states that the clustering is formed by K clusters. The variable xt expresses whether the cluster Ct is chosen in the clustering solution. The number of variables xt is however exponential w.r.t. the number of instances. Two kinds of ILP-based approaches are therefore developed for constrained clustering: (1) use a column generation approach, where the master problem is restricted to a smaller set T ⊆ T and columns (clusters) are incrementally added until the optimal solution is proved (Babaki et al. 2014); and (2) restrict the cluster candidates on a subset T ⊆ T and define the clustering problem on T (Mueller and Kramer 2010; Ouali et al. 2016). An ILP-based approach with column generation for unconstrained minimum sum of squares clustering was introduced by du Merle et al. (1999) and improved by Aloise et al. (2012). Column generation iterates between solving the restricted master problem and adding one or multiple columns. A column is added to the master problem if it can improve the objective function. If no such column can be found, one is certain that the optimal solution of the restricted master problem is also an optimal solution of the full master problem. Whether a column can improve the objective function can be derived from the dual of the master problem. A column that improves the objective function of the master problem corresponds to a column with a negative reduced cost in the dual. Among all the columns with negative reduced cost, the smallest one is usually searched for, which yields a minimisation subproblem. The column generation approach has been extended to integrate antimonotone user-constraints (Babaki et al. 2014). A constraint is anti-monotone if it is satisfied on a set of instances S and satisfied on all subsets S ⊆ S. For instance, maximal capacity constraints are anti-monotone but minimal capacity constraints are not. With the observation that many user-constraints can be evaluated on each cluster

Constrained Clustering: Current and New Trends

463

individually, the user-constraints are not part of the master problem, but only have to be considered when solving the subproblems. These are enforced when solving the subproblem by removing columns corresponding to clusters that do not satisfy the constraints. Subproblems are solved by a branch-and-bound algorithm where an anti-monotone property is used to ensure the correctness of the computed bounds. The number of cluster candidates in principle is exponential w.r.t. the number of instances. Nevertheless, in some clustering settings such as conceptual clustering, candidates can usually be drawn from a smaller subset T . Considering a constrained clustering problem on a restricted subset T , Mueller and Kramer (2010) and Ouali et al. (2016) develop ILP-based frameworks that can integrate different kinds of user-constraints. The same principle is used, i.e. instance-level and cluster-level constraints are enforced to remove cluster candidates that do not satisfy the constraints. Moreover, the frameworks can integrate constraints on clustering and different optimisation criteria. In (Mueller and Kramer 2010) constraints on clustering can be stated that give, for instance, bounds on the degree of overlap between clusters of the clustering, and bounds on the number of clusters an instance can be grouped into. The best clustering can be found by optimising the mean/minimum/median quality of the clusters. This framework has been used in conceptual clustering for transactional datasets. In a transactional dataset, each instance (transaction) is described by a set of items. Conceptual clustering aims to assign the transactions to homogeneous clusters and to provide each cluster with a distinct description (concept) that characterises all the transactions contained within it. The clusters that comprise the subset T can be required to correspond to frequent patterns or to closed frequent patterns. The subset T is therefore precomputed by an algorithm that extracts frequent patterns a priori (Mueller and Kramer 2010) or closed patterns (e.g. LCM) (Ouali et al. 2016). In the framework developed by Ouali et al. (2016), constraints on clustering are also available, for example: at least some instances must be covered, and small overlaps of the clusters are allowed. Besides modelling conceptual clustering, these constrains also enable the modelling of soft clustering (at most some transactions can be uncovered or small overlaps are allowed), co-clustering (clustering that covers both the set of transactions and the set of items, without any overlap on transactions or on items) and soft co-clustering.

4.3 Constraint Programming A general and declarative framework has been developed for distance-based constrained clustering, based on Constraint Programming (CP) (Dao et al. 2013, 2017). CP is a powerful paradigm for solving combinatory satisfaction or optimisation problems. Modelling the problem in CP consists of its formalisation into a Constraint Satisfaction Problem (CSP) or a Constraint Optimisation Problem (COP). A CSP is a triplet X, Dom, C where X is a set of variables, Dom(x) for each x ∈ X is its domain and C is a set of constraints, each of which expresses a condition on a subset of X . A solution to a CSP is a complete assignment of values from Dom(x)

464

P. Gançarski et al.

to each variable x ∈ X that satisfies all the constraints of C. A COP is a CSP with an objective function to be optimised. An optimal solution of a COP is a solution of the CSP that optimises the objective function. In general, solving a CSP or a COP is NP-Hard. Nevertheless, the constraint propagation and search strategies (Rossi et al. 2006) used by CP solvers allow a large number of real-world applications to be efficiently solved. As discussed in chapter “Constraint Reasoning” of this Volume, the propagation of a constraint c reduces the domain of the variables of c by removing some or all inconsistent values, i.e. values that cannot be part of a solution of c. A propagation scheme is defined for each type of constraint. Different kinds of constraints are available for modelling, they can be elementary constraints expressing arithmetic or logic relations, or global constraints expressing meaningful n-ary relations. Although equivalent to conjunctions of elementary constraints, global constraints benefit from efficient propagation, performed by a filtering algorithm exploiting results from other domains, e.g. graph theory. Reified constraints are available, which allow a boolean variable to be linked to the truth value of a constraint. A catalogue of global constraints that contains more than 400 inventoried global constraints is maintained by Beldiceanu et al. (2005). In a CP solver, two steps—constraint propagation until a stable state is found and branching—are repeated until a solution is found. Different strategies can be used to create and to order branches at each branching point. They can be standard search strategies defined by CP solvers or can be specifically developed. A CP-based framework developed for distance-based constrained clustering has been developed by Dao et al. (2013). This framework enables the modelling of different constrained clustering problems by specifying an optimisation criterion and by setting the user constraints. The framework is improved by modifying the model and by developing dedicated propagation algorithms for each optimisation criterion (Dao et al. 2017). In this model, the number of clusters K does not need to be fixed beforehand, only bounds are required, i.e. K min ≤ K ≤ K max . The clusters in a partition of K clusters are numbered from 1 to K . In order to express the cluster assignment, a variable G i is used for each instance oi ∈ O, with Dom(G i ) = {1, . . . , K max }. A real valued variable is used to represent the value of the objective function. The model has the following three components. • Constraints to express a partition. The use of the variables G i naturally express a partition. Nevertheless, several assignments of the variables G 1 , . . . , G n can correspond to the same partition, for instance by interchanging the numbers of two clusters. In order to break these symmetries, the constraint precede([G 1 , . . . , G n ], [1, . . . , K max ]) is used. This constraint ensures that G 1 = 1 and moreover, if G i = c with 1 < c ≤ K max , there must exist at least an index j < i such that G j = c − 1. The requirement to have at least K min clusters means that all the numbers among 1 and K min must be used in the assignment of the variables G i . When using the constraint precede, one only needs to require that at least one variable G i is equal to K min . This is expressed by the relation #{i | G i = K min } ≥ 1.

Constrained Clustering: Current and New Trends

465

• Constraints to express clustering user constraints. All popular user-defined constraints may be straightforwardly integrated. For instance, a must-link (or cannotlink) constraint on oi and o j is expressed by G i = G i (or G i = G j , respectively). For the minimal size α of clusters, this means that each point must be in a cluster with at least α points (including itself). This is expressed by n constraints: for each i ∈ [1, n], the assigned value of the variable G i must then appear at least α times in the array G 1 , . . . , G n , i.e. #{ j | G j = G i } ≥ α. • Constraint to express the objective function. Different optimisation criteria are available: minimising the maximal diameter D of the clusters, maximising the minimal split S between clusters, minimising the within-cluster sum of dissimilarities (WCSD) or minimising the within-cluster sum of squares (WCSS). A global optimisation constraint is developed for each criterion along with a filtering algorithm. For instance, if the user chooses to optimise the sum of squares, the variable V will be linked by the constraint WCSS([G 1 , . . . , G n ], V, d). In order to improve the performance of CP solvers, different search strategies are elaborated for each criterion. For example, a CP-based framework using repetitive branch-and-bound search has been developed (Guns et al. 2016) for the WCSS criterion. Another interest of the declarative framework is the bi-objective constrained clustering problem. This problem aims to find clusters that are both compact (minimising the maximal diameter) and well separated (maximising the split), under user constraints. In (Dao et al. 2017) it is shown that to solve this problem, the framework can be used by iteratively changing the objective function and adding constraints on the other objective value. This framework has been extended to integrate the four categories of user constraints on properties (cardinality, density, geometric, and complex logic), in order to make clustering actionable (Dao et al. 2016). Schemes are developed to expresses the categories using CP constraints. For instance, a density constraint provides bounds on the occurrence of some properties on a subset of instances in each cluster. To express this constraint, for each instance oi ∈ O which is eligible (e.g. more than 20 years old), the set of neighbourhood instances N (i) (e.g. persons having the same hobby) is determined. The number of instances of N (i) in the same cluster as oi can be captured using a variable Z i , which is linked by the constraint #{ j ∈ N (i) | G j = G i } = Z i . Arithmetic constraints are then stated on Z i to express density constraints. As an example, for the constraint that each person more than 20 years old must be in the same group as at least 5 people sharing the same hobby, the constraint Z i ≥ 6 (5 other instances and the instance oi itself) is included. Several CP frameworks have been developed for conceptual clustering (Khiari et al. 2010; Guns et al. 2013; Chabert and Solnon 2017). These frameworks integrate constraints on patterns or groups of patterns as well as different optimisation criteria, for instance, maximising the minimal size of the clusters, or maximising the minimal

466

P. Gançarski et al.

size of the patterns defining the clusters. The models are developed using binary variables (Guns et al. 2013) or set variables (Khiari et al. 2010; Chabert and Solnon 2017).

5 Collaborative Constrained Clustering Over the last fifty years, a huge number of new clustering algorithms have been developed, and existing methods have been modified and improved (Jain et al. 1999; Rauber et al. 2000; Xu and Wunsch 2005). This abundance of methods can be explained by the ill-posed nature of the problem—each clustering algorithm is biased by its objective function used to build the clusters. Consequently, different methods can produce very different clustering results from the same data. Furthermore, the same algorithm can produce different results depending upon its parameters and initialisation. A relatively recent approach to circumvent this problem considers that the information offered by different sources and different clusterings are complementary (Kittler 1998). A single clustering is produced from the results of methods that have different points of view and each individual clustering opinion is used to find a consensual decision. Thus, the combination of different clusterings may increase their efficiency and accuracy. Each decision can be processed from a different source or media. The final result can be produced directly from the independently obtained results (ensemble clustering) or from the result of a collaborative process (collaborative clustering).

5.1 Ensemble Clustering Ensemble clustering methods aim to improve the overall quality of the clustering by reducing the bias of each single algorithm (Hadjitodorov and Kuncheva 2007). An ensemble clustering is composed of two steps. First, multiple clusterings are produced from a set of methods having different points of view. These methods can be different clustering algorithms (Strehl and Ghosh 2002) or the same algorithm with different parameter values or initialisations (Fred and Jain 2002). The final result is derived from the independently obtained results by applying a consensus function. Constraints can be integrated in two manners: each learning agent integrates them in its own fashion, or applying them in the consensus function. The former approach faces an important dilemma: either favor diversity or quality. High quality is desired, but the gain of ensemble clustering is derived from diversity (thus avoiding biased solutions). Clustering from constrained algorithms tends to have a low variance, which implies low diversity (Yang et al. 2017), especially when using the same set of constraints. Therefore the advantage of ensemble clustering is limited. Implementations of the first approach exist (Yu et al. 2011; Yang et al. 2012). For example (Iqbal et al. 2012) develop the semi-supervised clustering ensembles by

Constrained Clustering: Current and New Trends

467

voting (SCEV) algorithm, in which diversity is balanced by using different types of semi-supervised algorithms (i.e. constrained k-Means, COP-KMeans, SP-Kmeans, etc.). In the first step each semi-supervised agent computes a clustering given the data and the set of constraints. It then combines all the results using a voting algorithm after having relabeled and align the different clustering results. The authors propose to integrate a weight for each agents’ contributions into the voting algorithm. This weight is a combination of two sub-weights, the first one is defined a priori, based upon the expert’s trust of each agent according to the data (i.e. seeded k-Means is more efficient for noise, COP-Means and constraints are more efficient if the data is noise free), the second is also user defined but based upon the user’s feedback on the clustering result. As such, the algorithm allows more flexibility and user control over the clustering. The second approach focuses on applying constraints in the consensus function (Al-Razgan and Domeniconi 2009; Xiao et al. 2016; Dimitriadou et al. 2002). These algorithms start by generating the set of clusterings from the clustering agents. The constraints are then integrated in the consensus function, which can be divide into four steps: 1. generate a similarity matrix from the set of clusterings; 2. construct a sparse graph from this similarity matrix using the CHAMELEON algorithm—an edge is constructed between two vertices if the value in the similarity matrix is greater than zero for the corresponding elements; 3. partition the graph into a large number of sub-clusters using the METIS method; 4. merge the sub-clusters using an agglomerative hierarchical clustering approach by finding the most similar pair of sub-clusters. Constraints are integrated during partitioning. Cannot-link constraints are used as priorities for the split operation—sub-clusters that contain a CL constraint are partitioned until the two elements in the constraint are allocated to two different clusters.

5.2 Collaborative Clustering Collaborative clustering consists of making multiple clustering methods collaborate to reach an agreement on a data partitioning. While ensemble clustering (and consensus clustering (Monti et al. 2003; Li and Ding 2008)) focuses on merging clustering results, collaborative clustering focuses on iteratively modifying the clustering results by sharing information between them (Wemmert et al. 2000; Gançarski and Wemmert 2007; Pedrycz 2002). In consequence it extends ensemble clustering by adding a refinement step before the unification of the results. For instance, in Samarah (Wemmert et al. 2000; Gançarski and Wemmert 2007) each clustering algorithm modifies its results according to all the other clusterings until all the clusterings proposed by the different methods are strongly similar. Thus, they can be more easily unified through a voting algorithm (for example).

468

P. Gançarski et al.

Three stages for integrating user constraints in the collaborative process can be identified (Forestier et al. 2010a): (1) generation of the final result (by labeling the clusters of the final result using label constraints); (2) directly in the collaborative clustering (in order to guide the collaborative process); (3) using constrained agents. Integrating user constraints into the learning agents (3) is complex because it requires extensive modification of each of the clustering methods involved. The complexity of integrating constraints in the collaboration (2) depends on how information is exchanged between the learning agents. Integrating the constraints after collaboration (1), however, does not interfere in the collaborative process, which makes it easier to implement. To illustrate the second level, the Samarah method is first introduced. Then Sect. 5.2.2 presents the method for integrating constraints into the collaborative process.

5.2.1

S AMARAH: a Framework for Collaborative Multistrategy Clustering

Samarah (Forestier et al. 2010a) is based on the principle of mutual and iterative refinement of multiple clustering algorithms. The process can be decomposed into three main steps: 1. the generation of the initial results; 2. the refinement of the different results; 3. the combination of the refined results. The first step consists of generating the initial results that will be used during the process. In this step, different algorithms or the same algorithm with different parameters can be used. During the refinement stage, each result is compared to the set of results proposed by the other methods, the goal being to evaluate the similarity between the different results in order to observe differences in the clusterings. Once these differences (named conflicts) are identified, the objective is to modify the clusterings to reduce them, in addition to the number of constraints violations, i.e. resolving the conflicts (Forestier et al. 2010b). These are resolved by either merging clusters, splitting clusters, or re-clustering clusters iteratively. This step can be seen as a questioning each result according to the information provided by the other actors in the collaboration and the background knowledge. After multiple iterations of refinement (in which a local similarity criterion is used to evaluate whether the modifications of a pair of results is relevant (Forestier et al. 2010a), the results are expected to be more similar than before the collaboration began. During the third and final step, the refined results are combined to propose a final and unique result (which is simplified due to the similarity of the results).

Constrained Clustering: Current and New Trends

5.2.2

469

Knowledge Integration in the S AMARAH Collaborative Method

During the refinement step of the Samarah method, a local similarity criterion γ i, j is used to evaluate whether the proposed modification of a pair of results is relevant (Forestier et al. 2010a). This criterion includes a quality criterion δ i which represents the quality of the result R i . It therefore balances the refinement between the similarity and the quality of the expected results. It is computed for two aspects of the results: the internal and external qualities. The internal evaluation consists of evaluating the quality of the result through an unsupervised measure. The external evaluation consists of evaluating the quality of the result according to external knowledge, such as an estimation of the number of clusters, some labeled samples, or some constraints. The original version of Samarah included internal knowledge and the only external knowledge was an estimate of the number of clusters. To take into account additional external knowledge, the quality criterion has been extended to measure the level of agreement of the results with different kinds of constraints (Forestier et al. 2010b), such that Nc  qc (R i ) × pc , (4) δi = c=1

where Nc is the number of constraints to respect, qc is the criterion used to evaluate the result according to the cth constraint (qc (·) ∈ [0, 1]) and pc is the relative importance given by the user to the cth constraint ( p1 + p2 + · · · + p Nc = 1). By default, each constraint is given a weight of N1c . Thus, any constraint that can be defined as a function taking its values on [0, 1] can be integrated into the process. The method to integrate some frequently encountered constraints are as follows. Cluster quality constraints are based on the intrinsic quality of clusters, such as inertia or predictivity and include the number of clusters. Criterion such as inertia or compacity need to be balanced with an evaluation of the number of clusters. An example of a criterion that includes the quality of the clusters and the number of clusters is as follows: ni pi  i i τ , (5) qqb (R ) = i n k=1 k where n i is the number of clusters of R i , τki defines the internal quality of the kth cluster, and pi is the external quality of the result. The internal quality of the kth cluster is given by τki =

⎧ ⎨0, ⎩1 −

1 n ik

nik l=1

if i d(xk,l ,gki ) i d(xk,l ,g i )

1 n ik

nik

i d(xk,l ,gki ) i l=1 d(xk,l ,g i )

, otherwise,

> 1,

(6)

470

P. Gançarski et al.

where n ik is the cardinality of Cki , gki is the gravity center of Cki , g i is the gravity i and d is the distance function. The measure is center of the closest cluster to xk,l computed on each cluster to evaluate the overall quality of the clustering result. To take into account the number of clusters n i , the criterion pi is defined as pi =

n sup − n inf , |n i − n inf | + |n sup − n i |

(7)

where [n inf , n sup ] is the range of the expected number of clusters, which is given by the user. Class label constraints correspond to the case where a set of labeled samples is available. To evaluate the agreement between results and such constraints, we can use any index which enables us to evaluate the similarity between a clustering and a labeled classification (where all the classes are known, and each object belongs to one of these classes). To achieve this, it is only necessary to compare results with a given partial partition R which represents the known labeled objects. In the Samarah method, the Rand index (Rand 1971) or the WG agreement index (Wemmert et al. 2000) is used. The Rand index is a measure of similarity between two data partitions, such that a+b Rand(R i , R) = n ,

(8)

2

where n is the number of objects to classify, a is the number of pairs of objects which are in the same cluster in R i and in the known result, and b is the number of pairs of objects which are not in the same cluster in the proposed result R i nor in the known result R j . The sum of these two measurements (a and b) can been seen as the number of times that the two partitions are in agreement. This index takes values in [0, 1], where 1 indicates that the two partitions are identical. A constraint qrand (R i ) can therefore be defined, such that qrand (R i ) = Rand(R i , R).

(9)

The WG agreement index is defined by W G(R i , R) =

ni

1 S Cki , R j Cki , n k=1

(10)

where n is the number of objects to classify and R j is the reference partition (e.g. labeled classification, another clustering, etc.). This index takes values in [0, 1], where 1 indicates that all the objects in the clustering R i are well classified according to the object labels in R j . A constraint qwg (R i ) can therefore be defined, such that qwg (R i ) = W G(R i , R).

(11)

Constrained Clustering: Current and New Trends

471

Link constraints correspond to the case where knowledge is expressed as mustlink and cannot-link constraints between objects (see Sect. 2.2). In this case, the ratio of respected to violated constraints can easily be computed such that qlink (R i ) =

nr 1  v(R i , l j ), n r j=1

(12)

where n r is the number of constraints between the objects, l j is a must-link or cannot-link constraint and v(R i , l j ) = 1 if R i respects the constraint l j and 0 otherwise. Note that such constraints can be extracted from class-label constraints. For example, a must-link constraint can be created for all pairs of objects belonging to the same cluster, and a cannot-link constraint can be created for all pairs of objects belonging to different clusters.

6 New Trends Obtaining useful results with pattern mining methods remains a difficult task. Careful tuning of the algorithm parameters and filtering of the results are needed. This requires considerable effort and expertise from the data analyst. As a consequence, the idea of interactive or exploratory data mining has been proposed (van Leeuwen 2014). Exploratory data mining looks for models and patterns that explain the data as much as possible by developing user interaction to influence the search and the results. This section deals with these trends concerning clustering. Scientific challenges (Sect. 6.1.1), an example of user interaction (Sect. 6.1.2), and an example of incremental and collaborative clustering (Sect. 6.1.3) are given. Limitations of the constraint paradigm are sketched and, moving beyond constraints, exploratory data mining is discussed. It is shown that preferences are a way to address pattern mining tasks (Sect. 6.2.1) and exploratory data mining enables the capturing of implicit preferences (Sect. 6.2.2). Chapter “Databases and Artificial Intelligence” of Volume 3 describes preferences queries in the field of database.

6.1 Interactive and Incremental Constrained Clustering 6.1.1

Challenges

Preliminary studies have revealed numerous scientific challenges concerning the objectives mentioned in the previous sections. For example, it is necessary to study and detail the thematic constraints (i.e. of the domain of application) that the expert may formulate to guide the process.

472

P. Gançarski et al.

These constraints can be extremely broad and have to be translated into actionable constraints. In the current state of knowledge, they are generally limited to constraints that can be directly translated into comparison constraints such as ML/CL, labeling constraints, or constraints in terms of cluster number or cluster size. Thus, for example, the following can be accepted: “these two objects seem to be of the same nature”, or “these two ensembles of objects are of the same nature” (ML constraints between all the pairs of objects of the two sets); “these three sets are totally different” (CL constraint on all the pairs of objects from the three sets); “this object is of type C” (labeling constraint); and “a cluster cannot represent more than 20% of the image” (constraints of cluster size). Generating actionable constraints from a set, however, can rapidly lead to a significant increase in combinatorial complexity. For example, a constraint “of the same nature” on two sets of size N1 and N2 respectively will, using the naive approach, generate N1 (N1 − 1) + N2 (N2 − 1) + N1 × N2 ML constraints. These are transitive (Wagstaff et al. 2001) however, and therefore the number of constraints needed to satisfy the user requirement can be reduced to (N1 + N2 ) − 1 under the assumption of guaranteed constraint satisfaction. Expressing all the constraints can be very time consuming, particularly in the case of data mining in big-data problems where the size of sets N1 and N2 can be considerable. The following problems therefore need to be tackled. • How to design algorithms that are able to deal with large constraint sets? – Reduce the size of the model by limiting the number of considered elements, for example by sampling or by identifying irrelevant objects. – Reduce the number of constraints without loss of quality: Sample the constraints or sample the objects under constraint; Identify the categories of constraints and study search strategies. – Relax the optimality of the solution using a threshold on the execution time (which is easy to choose but cannot guarantee the result’s quality). – Use a local instead of a global search. • How to limit the number of constraints—ideally, to define a minimal set of constraints—and how to use incremental approaches to allow a user to give such a set?

6.1.2

Interactive Clustering

In an interactive clustering model, the algorithm proposes a clustering of the data to the user and receives some feedback on the current solution. Taking the feedback into account, the algorithm makes changes to the clustering and proposes the new result to the user. This step is iterated until the user is satisfied with the clustering. The improvements to the clustering given by the user are usually in the form of splitting or merging clusters (Balcan and Blum 2008; Cutting et al. 1992; Awasthi and Zadeh 2010; Awasthi et al. 2017). The aim of efficient algorithms is to require as little user

Constrained Clustering: Current and New Trends

473

interaction as possible. User feedback can also be the rejection of some clusters and to request new ones. The system returns another clustering, which is chosen to fit the data as well as possible, while avoiding the creation of a cluster that is similar to any of those previously rejected. Formalising this in a Bayesian framework, after the user rejects a set of clusters, the prior distribution over model parameters is modified to severely downweight regions of the parameter space that would lead to clusters that are similar to those previously rejected (Srivastava et al. 2016). Another kind of feedback is that associated with each cluster: the user can lock the cluster (it is therefore no longer modified), refine the cluster by adding or removing elements, or change the pairwise distance of elements within a cluster. The distance between elements is recomputed accordingly, the unlocked clusters are reclustered and the process is repeated until no unlocked clusters remain (Coden et al. 2017). Interaction with the user can also be in made at different stages of the clustering process, as in an interactive approach for app search clustering in the Google Play Store, which incorporates human input in a knowledge-graph-based clustering process (Chang et al. 2016). Instead of directly clustering apps, the algorithm extracts topic labels from search results, runs clustering on a semantic graph of the topic labels and assigns search results to topic clusters. The interactive interface lets domain experts steer the clustering process in different stages: refining the input to the clustering algorithm, steering the algorithm to generate more or less fine-grained clusters, and finally editing topic label clusters and topic labels. User feedback through constraints on the clustering makes the clustering actionable to some purposes. Let us consider the case where the user already has a clustering, which was obtained by their favourite clustering algorithm. The clustering in general is good but there are a few undesirable properties. One ad hoc way to fix this is to re-run the constrained clustering algorithm but there is no guarantee that the obtained clustering will be as good as the first one. Instead, Kuo et al. (2017) propose to minimally modify the existing clustering while reaching the desired properties given by user feedback. User feedback can be, for example, splitting or merging some clusters, stating a bound on the diameter of the clusters, or stating a bound on the size of the clusters to be more balanced.

6.1.3

Incremental Constrained Clustering

Many constrained clustering methods require that the complete set of constraints be given before running the algorithm. This is very often unrealistic. For example, when a geographer tries to extract relevant clusters from a remote sensing image, it is almost impossible, given the number of clusters and the large space of possible constraints, to give such constraints a priori. Indeed, experiments show that the user will tend to give “obvious” and not informative constraints (e.g., pixels in the same homogeneous region must be clustered together or pixels of roads and pixels of vegetation cannot be in the same cluster, even if the “colors” of these pixels are sufficient to decide)—these constraints will not impact the algorithm in any way. In other words, the algorithm will find the same result regardless of the constraints. A

474

P. Gançarski et al.

way to tackle this problem would be to follow the example of interactive supervised learning methods and allow the user to inject constraints based on the results obtained (Davidson et al. 2007). In this manner the algorithm could reduce its uncertainty by obtaining new constraints that are related to uncertain zones (e.g., cluster edges, areas of high object density, etc.). These, highly informative, constraints could be proposed to the user who may validate them, or not, according to their knowledge. The hope is that this approach (selectively choosing constraints) produces significantly better results when compared to randomly choosing constraints. Cohn et al. (2003) describes an experiment (introduced by Davidson et al. (2007)) concerning document clustering, in which ten constraints incrementally chosen by a user (not by the machine) produced as good results as using between 3000 and 6000 randomly chosen constraints. In the same way our geographer could add label constraints (“The region belongs to the thematic water class?”), ML/CL constraints (“These regions should be together/apart?”) or cluster constraint (“Should this cluster be removed?”) according to the results obtained and the proposals of the algorithm. Even with research that is concerned with defining the conditions for the application of such methods (Raj et al. 2013; Vu and Labroche 2017), a large number of scientific obstacles still remain to be overcome: • How to evaluate the informativeness of a constraint (some progress in this area has been made by Davidson et al. 2006 and Wagstaff et al. 2006)? • How to integrate new constraints while limiting the effects on the results (a strong modification of the result can confuse the user)? • How to concretely design incremental algorithms, not in terms of new data or interactive cluster modifications but in terms of new constraints? • How to remove a constraint already taken into account without starting afresh? • How to deal with inconsistent constraints: should a new constraint be rejected if it is inconsistent with a previously added constraint or should the previously added constraint be removed? As such, the development of incremental interactive constrained clustering methods remains a very challenging research problem.

6.2 Beyond Constraints: Exploratory Data Analysis 6.2.1

From Constraints to Explicit Preferences

The notion of constraints is at the core of numerous works in pattern mining as presented in this chapter. Nevertheless, constraint-based pattern mining assumes that the user is able to express what they are looking for and, requires finely tuning thresholds. The result is a collection of patterns that is often too large to be truly exploited. This picture may explain why preferences in pattern mining become more and more important. Preferences in pattern mining do not arise from nothing. In constraintbased pattern mining, the utility functions measure the interest of a pattern and can

Constrained Clustering: Current and New Trends

475

be seen as a quantitative preference model (Yao and Hamilton 2006; Fürnkranz et al. 2012; Geng and Hamilton 2006). Many other mechanisms have been developed such as mining the most interesting patterns with one measure, top-k patterns (Wang et al. 2005), or more, skyline patterns (Cho et al. 2005; Soulet et al. 2011; van Leeuwen and Ukkonen 2013); reducing redundancy by integrating subjective interestingness (Gallo et al. 2007; Bie 2011; De Bie 2013; van Leeuwen et al. 2016); and putting the pattern mining task to an optimisation problem. Even though it has been realised for a long time that it is difficult for a data analyst to model their interest in terms of constraints and overcome the well-known thresholding issue, researchers have only recently intensified their study of methods for finding high-quality patterns according to a user’s preferences. We briefly introduce the example of skylines patterns (Cho et al. 2005; Soulet et al. 2011; van Leeuwen and Ukkonen 2013) which can be seen as a generalisation of the well known top-k patterns (Wang et al. 2005; Ke et al. 2009). Top-k patterns integrate user preferences in the form of a score in order to limit the number of extracted patterns. By associating each pattern with a rank score, this approach returns an ordered list of the k patterns with the highest score to the user. Nevertheless, top-k patterns suffer from the diversity issue (top-k patterns tend to be similar) and the performance of top-k approaches is often sensitive to the size of the datasets and to the threshold value, k. Even worse, combining several measures into a single scoring function is difficult. Skyline patterns introduce the idea of skyline queries (Börzsönyi et al. 2001) into the pattern discovery framework. Such queries have attracted considerable attention due to their importance in multi-criteria decision making, where they are usually called “Pareto efficiency” or “optimality queries”. Briefly, in a multidimensional space where a preference is defined for each dimension, a point a dominates another point b if a is better (i.e. more preferred) than b in at least one dimension, and a is not worse than b in every other dimension. For example, a user selecting a set of patterns may prefer a pattern with a high frequency, a large length, and a high confidence. In this case, we say that pattern a dominates another pattern b if frequency(a) ≥ frequency(b), length(a) ≥ length(b) and confidence(a) ≥ confidence(b), where at least one strict inequality holds. Given a set of patterns, the skyline set contains the patterns that are not dominated by any other pattern. Skyline pattern mining is interesting for several reasons. First, skyline processing does not require any threshold tuning. Second, for many pattern mining applications it is often difficult (or impossible) to find a reasonable global ranking function. Thus the idea of finding all optimal solutions in the pattern space with respect to multiple preferences is appealing. Third, the formal property of dominance satisfied by the skyline patterns defines a global interestingness measure with semantics easily understood by the user. While the notion of skylines has been extensively developed in engineering and database applications, it has remained unused for data mining purposes until recently (Cho et al. 2005; Soulet et al. 2011). In this kind of approach, preferences (or measures) are explicitly given by the user.

476

6.2.2

P. Gançarski et al.

From Explicit Preferences to Implicit Preferences

All of the approaches introduced in the previous section assume that preferences are explicit and given in the process. In practice, the user only has a vague idea of which patterns could be useful and there is therefore a need to elicit preferences. The recent research field of interactive pattern mining relies on the automatic acquisition of these preferences (van Leeuwen 2014). Basically, its principle is to repeat a short mining loop centred on the user: (1) the user poses an initial query to the system, which returns an initial result, (2) the user designates components or aspects of this result as (un)desirable/(un)interesting, (3) the system translates the user feedback into a model of the user’s preferences and uses this model to adapt its search strategy, (4) a new result is produced and the process returns to step (2). At each iteration, only some patterns are mined and the user has to indicate those which are relevant (Dzyuba et al. 2014) by, for example, liking/disliking, rating, or ranking. The user feedback improves an automatically learned model of preferences that will refine the pattern mining step in the next iteration. A great advantage is that the user does not have to explicitly state their preference model. Interactive pattern mining raises several challenges. The first being the design of user feedback options. The easiest forms of feedback to take into consideration are explicit and more or less binary in nature: the requirement that certain instances be grouped together or kept apart, or that particular descriptors are included in a cluster’s description. Unfortunately, experts are unlikely to be able to give this kind of feedback, especially early on in the discovery process. A second, less explicit but still easy to use, form of feedback allows the user to designate components of the result, e.g. descriptions, as interesting or uninteresting or to express preferences, denoting a (component of a) result as more interesting than another (Dzyuba et al. 2014). The first form of feedback translates into constraints that can be included by prototypes, meaning that they can be included into the system. The second form requires a notion of equivalence/alternative to returned descriptions to replace uninteresting ones or to allow the user to express a preference over pairs. This requires multiple characterisations and to integrate pairwise comparisons to take preferences over pairs into account. A long-term goal is how to elicit and learn a preference model. Of equal importance is the design of methods following the principle of instant data mining (Boley et al. 2011) to avoid the expert user “checking-out” of the process. Each iteration must be fast and the result must be provided in a concise form so that the user is not overwhelmed with a huge collection of patterns that are impossible to analyse. Instant data mining is based on sampling techniques and provides a representative set of patterns without explicitly searching in the pattern space. These techniques, however, handle a limited set of measures or constraints (Giacometti and Soulet 2016). Subjective interestingness is a way of exploiting user feedback to directly influence search. Interactive diverse subgroup discovery (IDSD) (Dzyuba and van Leeuwen 2013) is an interactive algorithm that allows a user to provide feedback with respect to provisional results to avoid subgroups corresponding to common knowledge, which is usually uninteresting to a domain expert. The beam selection strategy is made

Constrained Clustering: Current and New Trends

477

interactive on each level of the search, thus the interestingness measure becoming subjective. The one click mining system (Boley et al. 2013) extracts local patterns through a mechanism based on two types of preferences. One is used to allocate the computation time to different mining algorithms, the other is used to learn a utility function to compute a ranking over all mined patterns. Both learning algorithms rely on inputs corresponding to implicit user feedback. A formal framework for exploratory data mining is proposed by De Bie (2011), who argues that traditional objective quality measures are of limited practical use and proposes a general framework that models background knowledge. Subjective interestingness is formalised by information theory. The principle is to consider prior beliefs, e.g. background information, as constraints on a probabilistic model representing the uncertainty of the data. Given the prior beliefs, the maximum entropy distribution is used to model the data. Then one can compute how informative a pattern is given the current model. This framework follows the iterative data mining process: starting from a MaxEnt model based on prior beliefs, the subjectively most interesting pattern is searched for and added to the model, after which one can start looking for the next pattern. The exact implementation depends on the specific data and pattern types. Finally, approaches based on the declarative modelling paradigm (see Sect. 4 in this chapter) help exploratory data analysis. Indeed, the data analyst focuses on the specification of the desired results through constraints (and optionally an optimisation criterion) rather than describing how the solution should be computed. The assumption of the declarative modelling paradigm is that a data analyst is able to express constraints that can be iteratively added to the declarative model.

7 Conclusions While the development and generalisation of deep learning approaches is revolutionising supervised learning, particularly in the area of decision support, attention is becoming increasingly focused on unsupervised learning. This is due, in part, to the pressing need for methods that allow one to explore large data sets without welldefined preconceptions, and without predefined categories that can be used as labels for training instances. Nevertheless, without any supervision, these approaches can lead to irrelevant results. A way to circumvent this problem is to (re)introduce experts into the analysis process and thus, to define methods capable of taking into account domain knowledge without the fastidious preliminary step of sample annotation. In this chapter, we have presented the alternative approach to clustering in which the process is guided by user constraints in order to produce more relevant results. Here ‘relevant’ means more directly matched to the expert’s thematic intuition, that is to say to potential thematic classes. Methods derived from this approach have been developed and have demonstrated their effectiveness and applicability in many areas. Despite the increasing number of methods and tools dedicated to constrained clustering and the surge of interest in constrained clustering, this paradigm is still

478

P. Gançarski et al.

surprisingly infrequently used. An explanation for this is the issues that remain to be explored and addressed. Without claiming to be exhaustive, this chapter has listed some of the scientific obstacles to be overcome: how to define more expressive operable constraints, for instance constraints involving more objects (“A is closer to B than to C”) or conditional constraints (“If A is with B then C cannot be with B”)? How to deal with increasing volumes of data, which can lead to an explosion of the number of constraints? How to design incremental interactive methods that are able to deal with incoherent constraints? Nevertheless, while these issues are important and must be addressed by computer scientists, it is convincingly apparent that the main obstacle preventing the adoption of these approaches is the lack of theoretical and practical understanding of the “translation” of an expert’s knowledge into actionable constraints. As such, research effort should be focused, on the one hand, on the ways of translating domain knowledge into thematic constraints and, on the other hand, on the automatic translation of such constraints into actionable constraints. This survey chapter has presented the principles of constrained clustering. This exciting field has arrived at a time when solutions to knowledge discovery problems in big data are needed, and it promises to offer these. The time is therefore ripe for increasing the exposure and use of these methods. All the while, it is also a domain full of questions, some of which go beyond the current theory of statistical learning, and answering these questions promises to stimulate interesting and innovative research directions.

References Al-Razgan M, Domeniconi C (2009) Clustering ensembles with active constraints. In: Okun O, Valentini G (eds) Applications of supervised and unsupervised ensemble methods. Springer, Berlin, pp 175–189 Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248 Aloise D, Hansen P, Liberti L (2012) An improved column generation algorithm for minimum sum-of-squares clustering. Math Program 131(1–2):195–220 Alzate C, Suykens J (2009) A regularized formulation for spectral clustering with pairwise constraints. In: Proceedings of the international joint conference on neural networks, pp 141–148 Anand R, Reddy C (2011) Graph-based clustering with constraints. In: Proceedings of the PacificAsia conference on knowledge discovery and data mining, pp 51–62 Anand S, Bell D, Hughes J (1995) The role of domain knowledge in data mining. In: Proceedings of the international conference on information and knowledge management, pp 37–43 Awasthi P, Zadeh RB (2010) Supervised clustering. In: Proceedings of the international conference on neural information processing systems, pp 91–99 Awasthi P, Balcan MF, Voevodski K (2017) Local algorithms for interactive clustering. J Mach Learn Res 18:1–35 Babaki B, Guns T, Nijssen S (2014) Constrained clustering using column generation. In: Proceedings of the international conference on AI and OR techniques in constriant programming for combinatorial optimization problems, pp 438–454

Constrained Clustering: Current and New Trends

479

Balcan MF, Blum A (2008) Clustering with interactive feedback. In: Proceedings of the international conference on algorithmic learning theory, pp 316–328 Banerjee A, Ghosh J (2006) Scalable clustering algorithms with balancing constraints. Data Min Knowl Discov 13(3):365–395 Bar-Hillel A, Hertz T, Shental N, Weinshall D (2003) Learning distance functions using equivalence relations. In: Proceedings of the international conference on machine learning, pp 11–18 Bar-Hillel A, Hertz T, Shental M, Weinshall D (2005) Learning a mahalanobis metric from equivalence constraints. J Mach Learn Res 6:937–965 Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the international conference on machine learning, pp 19–26 Basu S, Banerjee A, Mooney R (2004a) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the SIAM international conference on data mining, pp 333–344 Basu S, Bilenko M, Mooney R (2004b) A probabilistic framework for semi-supervised clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 59–68 Basu S, Davidson I, Wagstaff K (2008) Constrained clustering: advances in algorithms, theory, and applications, 1st edn. Chapman & Hall/CRC, Boca Raton Beldiceanu N, Carlsson M, Rampon JX (2005) Global constraint catalog. Technical Report T200508, SICS and EMN Technical Report Bellet A, Habrard A, Sebban M (2015) Metric learning. Morgan & Claypool Publishers, San Rafael Berg J, Järvisalo M (2017) Cost-optimal constrained correlation clustering via weighted partial maximum satisfiability. Artif Intell 244:110–142 Bie TD (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446 Bilenko M, Mooney R (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 39–48 Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the international conference on machine learning, pp 11–18 Boley M, Lucchese C, Paurat D, Gärtner T (2011) Direct local pattern sampling by efficient two-step random procedures. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 582–590 Boley M, Mampaey M, Kang B, Tokmakov P, Wrobel S (2013) One click mining: interactive local pattern discovery through implicit preference and performance learning. In: Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, pp 27–35 Börzsönyi S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings of the international conference on data engineering, pp 421–430 Boulicaut JF, De Raedt L, Mannila H (eds) (2006) Constraint-based mining and inductive databases. Lecture notes in artificial intelligence, vol 3848. Springer, Berlin Bradley P, Bennett K, Demiriz A (2000) Constrained k-means clustering. Technical Report MSRTR-2000-65, Microsoft Research Chabert M, Solnon C (2017) Constraint programming for multi-criteria conceptual clustering. In: Proceedings of the international conference on principles and practice of constraint programming, pp 460–476 Chang S, Dai P, Hong L, Sheng C, Zhang T, Chi E (2016) AppGrouper: knowledge-based interactive clustering tool for app search results. In: Proceedings of the international conference on intelligent user interfaces, pp 348–358 Chen W, Feng G (2012) Spectral clustering: a semi-supervised approach. Neurocomputing 77(1):229–242 Cheng H, Hua K, Vu K (2008) Constrained locally weighted clustering. Proc VLDB Endow 1(1):90– 101 Cho M, Pei J, Wang H, Wang W (2005) Preference-based frequent pattern mining. Int J Data Warehous Min 1(4):56–77

480

P. Gançarski et al.

Coden A, Danilevsky M, Gruhl D, Kato L, Nagarajan M (2017) A method to accelerate human in the loop clustering. In: Proceedings of the SIAM international conference on data mining, pp 237–245 Cohn D, Caruana R, Mccallum A (2003) Semi-supervised clustering with user feedback. Technical Report TR2003-1892. Department of Computer Science, Cornell University Cucuringu M, Koutis I, Chawla S, Miller G, Peng R (2016) Simple and scalable constrained clustering: a generalized spectral method. In: Proceedings of the international conference on artificial intelligence and statistics, pp 445–454 Cutting D, Pedersen J, Karger D, Tukey J (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the international ACM SIGIR conference on research and development in information retrieval, pp 318–329 Dao TBH, Duong KC, Vrain C (2013) A declarative framework for constrained clustering. In: Proceedings of the joint European conference on machine learning and knowledge discovery in databases, pp 419–434 Dao TBH, Vrain C, Duong KC, Davidson I (2016) A framework for actionable clustering using constraint programming. In: Proceedings of the European conference on artificial intelligence, pp 453–461 Dao TBH, Duong KC, Vrain C (2017) Constrained clustering by constraint programming. Artif Intell 244:70–94 Davidson I, Basu S (2007) A survey of clustering with instance level constraints. ACM Trans Knowl Discov Data 77(1):1–41 Davidson I, Ravi S (2005) Clustering with constraints: feasibility issues and the k-means algorithm. In: Proceedings of the SIAM international conference on data mining, pp 138–149 Davidson I, Ravi S (2006) Identifying and generating easy sets of constraints for clustering. In: Proceedings of the AAAI conference on artificial intelligence, pp 336–341 Davidson I, Ravi S (2007) The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min Knowl Discov 14(1):25–61 Davidson I, Wagstaff K, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: European conference on principles of data mining and knowledge discovery, pp 115–126 Davidson I, Ester M, Ravi S (2007) Efficient incremental constrained clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 240–249 Davidson I, Ravi S, Shamis L (2010) A SAT-based framework for efficient constrained clustering. In: Proceedings of the SIAM international conference on data mining, pp 94–105 De Bie T (2011) An information theoretic framework for data mining. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 564–572 De Bie T (2013) Subjective interestingness in exploratory data mining. In: Proceedings of the international symposium on intelligent data analysis, pp 19–31 Delattre M, Hansen P (1980) Bicriterion cluster analysis. IEEE Trans Pattern Anal Mach Intell 2(4):277–291 Demiriz A, Bennett K, Embrechts M (1999) Semi-supervised clustering using genetic algorithms. In: Proceedings of the conference on artificial neural networks in engineering, pp 809–814 Demiriz A, Bennett K, Bradley P (2008) Using assignment constraints to avoid empty clusters in k-means clustering. In: Basu S, Davidson I, Wagstaff K (eds) Constrained clustering: advances in algorithms, theory, and applications, 1st edn. Chapman & Hall/CRC, pp 201–220 Dimitriadou E, Weingessel A, Hornik K (2002) A mixed ensemble approach for the semi-supervised problem. In: Proceedings of the international conference on artificial neural networks, pp 571–576 Ding S, Qi B, Jia H, Zhu H, Zhang L (2013) Research of semi-supervised spectral clustering based on constraints expansion. Neural Comput Appl 22:405–410 Dinler D, Tural M (2016) A survey of constrained clustering. In: Celebi M, Aydin K (eds) Unsupervised learning algorithms. Springer, Berlin, pp 207–235

Constrained Clustering: Current and New Trends

481

du Merle O, Hansen P, Jaumard B, Mladenovi´c N (1999) An interior point algorithm for minimum sum-of-squares clustering. SIAM J Sci Comput 21(4):1485–1505 Dzyuba V, van Leeuwen M (2013) Interactive discovery of interesting subgroup sets. In: Proceedings of the international symposium on intelligent data analysis, pp 150–161 Dzyuba V, van Leeuwen M, Nijssen S, De Raedt L (2014) Interactive learning of pattern rankings. Int J Artif Intell Tools 23(6):1460,026 Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the international conference on knowledge discovery and data mining, pp 226–231 Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT Press, pp 1–36 Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172 Forestier G, Gançarski P, Wemmert C (2010a) Collaborative clustering with background knowledge. Data Knowl Eng 69(2):211–228 Forestier G, Wemmert C, Gançarski P (2010b) Towards conflict resolution in collaborative clustering. In: IEEE international conference on intelligent systems, pp 361–366 Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: Proceedings of the IEEE international conference on pattern recognition, pp 276–280 Fürnkranz J, Gamberger D, Lavraˇc N (2012) Foundations of rule learning. Cognitive technologies, Springer, Berlin Gallo A, De Bie T, Cristianini N (2007) MINI: mining informative non-redundant itemsets. In: Proceedings of the European conference on principles of data mining and knowledge discovery, pp 438–445 Gançarski P, Wemmert C (2007) Collaborative multi-step mono-level multi-strategy classification. J Multimed Tools Appl 35(1):1–27 Ganji M, Bailey J, Stuckey P (2016) Lagrangian constrained clustering. In: Proceedings of the SIAM international conference on data mining, pp 288–296 Ge R, Ester M, Jin W, Davidson I (2007) Constraint-driven clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 320–329 Geng L, Hamilton H (2006) Interestingness measures for data mining: a survey. ACM Comput Surv (CSUR) 38(3):9 Giacometti A, Soulet A (2016) Frequent pattern outlier detection without exhaustive mining. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining, pp 196–207 Gilpin S, Davidson I (2011) Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 1136–1144 Gilpin S, Davidson I (2017) A flexible ILP formulation for hierarchical clustering. Artif Intell 244:95–109 Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38(2):293–306 Grira N, Crucianu M, Boujemaa N (2006) Fuzzy clustering with pairwise constraints for knowledgedriven image categorization. IEE Proc Vis Image Signal Process (CORE B) 153(3):299–304 Guns T, Nijssen S, De Raedt L (2013) k-pattern set mining under constraints. IEEE Trans Knowl Data Eng 25(2):402–418 Guns T, Dao TBH, Vrain C, Duong KC (2016) Repetitive branch-and-bound using constraint programming for constrained minimum sum-of-squares clustering. In: Proceedings of the European conference on artificial intelligence, pp 462–470 Hadjitodorov ST, Kuncheva LI (2007) Selecting diversifying heuristics for cluster ensembles. In: Proceedings of the international workshop on multiple classifier systems, pp 200–209 Hansen P, Delattre M (1978) Complete-link cluster analysis by graph coloring. J Am Stat Assoc 73(362):397–403

482

P. Gançarski et al.

Hansen P, Jaumard B (1997) Cluster analysis and mathematical programming. Math Program 79(1– 3):191–215 Hiep T, Duc N, Trung B (2016) Local search approach for the pairwise constrained clustering problem. In: Proceedings of the symposium on information and communication technology, pp 115–122 Hoi S, Jin R, Lyu M (2007) Learning nonparametric kernel matrices from pairwise constraints. In: International conference on machine learning, pp 361–368 Hoi S, Liu W, Chang SF (2008) Semi-supervised distance metric learning for collaborative image retrieval. In: Proceedings of the IEEE international conference on computer vision and pattern recognition Hoi S, Liu W, Chang SF (2010) Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Trans Multimed Comput Commun Appl 6(3):18 Huang H, Cheng Y, Zhao R (2008) A semi-supervised clustering algorithm based on must-link set. In: Proceedings of the international conference on advanced data mining and applications, pp 492–499 Iqbal A, Moh’d A, Zhan Z (2012) Semi-supervised clustering ensemble by voting. In: Proceedings of the international conference on information and communication systems, pp 1–5 Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323 Kamvar S, Klein D, Manning C (2003) Spectral learning. In: Proceedings of the international joint conference on artificial intelligence, pp 561–566 Ke Y, Cheng J, Yu JX (2009) Top-k correlative graph mining. In: Proceedings of the SIAM international conference on data mining, pp 1038–1049 Khiari M, Boizumault P, Crémilleux B (2010) Constraint programming for mining n-ary patterns. In: Proceedings of the international conference on principles and practice of constraint programming, pp 552–567 Kittler J (1998) On combining classifiers. IEEE Trans Pattern Anal Mach Intell 20(3):226–239 Klein D, Kamvar S, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the international conference on machine learning, pp 307–314 Kopanas I, Avouris N, Daskalaki S (2002) The role of domain knowledge in a large scale data mining project. In: Proceedings of the Hellenic conference on artificial intelligence, pp 288–299 Kuhn H, Tucker A (1951) Nonlinear programming. In: Proceedings of the Berkeley symposium, pp 481–492 Kulis B, Basu S, Dhillon I, Mooney R (2005) Semi-supervised graph clustering: a kernel approach. In: Proceedings of the international conference on machine learning, pp 457–464 Kulis B, Basu S, Dhillon I, Mooney R (2009) Semi-supervised graph clustering: a kernel approach. Mach Learn 74(1):1–22 Kuo CT, Ravi S, Dao TBH, Vrain C, Davidson I (2017) A framework for minimal clustering modification via constraint programming. In: Proceedings of the AAAI conference on artificial intelligence, pp 1389–1395 LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444 Li T, Ding C (2008) Weighted consensus clustering. In: Proceedings of the SIAM international conference on data mining, pp 798–809 Li T, Ding C, Jordan M (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the IEEE international conference on data mining, pp 577–582 Li Z, Liu J, Tang X (2008) Pairwise constraint propagation by semidefinite programming for semisupervised classification. In: Proceedings of the international conference on machine learning, pp 576–583 Li Z, Liu J, Tang X (2009) Constrained clustering via spectral regularization. In: Proceedings of the international conference on computer vision and pattern recognition, pp 421–428 Lu Z, Carreira-Perpinán M (2008) Constrained spectral clustering through affinity propagation. In: IEEE conference on computer vision and pattern recognition, pp 1–8

Constrained Clustering: Current and New Trends

483

Lu Z, Ip H (2010) Constrained spectral clustering via exhaustive and efficient constraint propagation. In: Proceedings of the European conference on computer vision, pp 1–14 Métivier KP, Boizumault P, Crémilleux B, Khiari M, Loudni S (2012) Constrained clustering using SAT. In: Proceedings of the international symposium on advances in intelligent data analysis, pp 207–218 Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1):91–118 Mueller M, Kramer S (2010) Integer linear programming models for constrained clustering. In: Proceedings of the international conference on discovery science, pp 159–173 Ng M (2000) A note on constrained k-means algorithms. Pattern Recognit 33(3):515–519 Ouali A, Loudni S, Lebbah Y, Boizumault P, Zimmermann A, Loukil L (2016) Efficiently finding conceptual clustering models with integer linear programming. In: Proceedings of the international joint conference on artificial intelligence, pp 647–654 Pedrycz W (2002) Collaborative fuzzy clustering. Pattern Recognit Lett 23(14):1675–1686 Pelleg D, Baras D (2007) K-means with large and noisy constraint sets. In: Proceedings of the European conference on machine learning, pp 674–682 Raj S, Raj P, Ravindran B (2013) Incremental constrained clustering: a decision theoretic approach. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining, pp 475–486 Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(366):846–850 Rangapuram S, Hein M (2012) Constrained 1-spectral clustering. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp 1143–1151 Rauber A, Pampalk E, Paraliˇc J (2000) Empirical evaluation of clustering algorithms. J Inf Organ Sci 24(2):195–209 Rossi F, van Beek P, Walsh T (eds) (2006) Handbook of constraint programming. Foundations of artificial intelligence. Elsevier B.V, New York Rutayisire T, Yang Y, Lin C, Zhang J (2011) A modified cop-kmeans algorithm based on sequenced cannot-link set. In: Proceedings of the international conference on rough sets and knowledge technology, pp 217–225 Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905 Soulet A, Raïssi C, Plantevit M, Cremilleux B (2011) Mining dominant patterns in the sky. In: Proceedings of the IEEE international conference on data mining, pp 655–664 Srivastava A, Zou J, Adams R, Sutton C (2016) Clustering with a reject option: interactive clustering as bayesian prior elicitation. In: Proceedings of the ICML workshop on human interpretability in machine learning, pp 16–20 Strehl A, Ghosh J (2002) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617 Tan W, Yang Y, Li T (2010) An improved cop-k means algorithm for solving constraint violation. In: Proceedings of the international FLINS conference on foundations and applications of computational intelligence, pp 690–696 Tang W, Xiong H, Zhong S, Wu J (2007) Enhancing semi-supervised clustering: a feature projection perspective. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 707–716 van Leeuwen M (2014) Interactive data exploration using pattern mining. Interactive knowledge discovery and data mining in biomedical informatics, vol 9. Lecture notes in computer science. Springer, Berlin, pp 169–182 van Leeuwen M, Ukkonen A (2013) Discovering skylines of subgroup sets. In: Proceedings of the joint European conference on machine learning and knowledge discovery in databases, pp 272–287 van Leeuwen M, De Bie T, Spyropoulou E, Mesnage C (2016) Subjective interestingness of subgraph patterns. Mach Learn 105(1):41–75

484

P. Gançarski et al.

von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416 Vu VV, Labroche N (2017) Active seed selection for constrained clustering. Intell Data Anal 21(3):537–552 Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the international conference on machine learning, pp 1103–1110 Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of the international conference on machine learning, pp 577–584 Wagstaff K, Basu S, Davidson I (2006) When is constrained clustering beneficial, and why? In: Proceedings of the National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference Wang X, Davidson I (2010a) Active spectral clustering. In: Proceedings of the IEEE international conference on data mining, pp 561–568 Wang X, Davidson I (2010b) Flexible constrained spectral clustering. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 563–572 Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5):652–663 Wang X, Qian B, Davidson I (2014) On constrained spectral clustering and its applications. Data Min Knowl Discov 28(1):1–30 Wemmert C, Gançarski P, Korczak J (2000) A collaborative approach to combine multiple learning methods. Int J Artif Intell Tools 9(1):59–78 Xiao W, Yang Y, Wang H, Li T, Xing H (2016) Semi-supervised hierarchical clustering ensemble and its application. Neurocomputing 173(3):1362–1376 Xing E, Ng A, Jordan M, Russell S (2002) Distance metric learning learning, with application to clustering with side-information. In: Proceedings of the international conference on neural information processing systems, pp 521–528 Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678 Yang Y, Tan W, Li T, Ruan D (2012) Consensus clustering based on constrained self-organizing map and improved cop-kmeans ensemble in intelligent decision support systems. Knowl-Based Syst 32:101–115 Yang F, Li T, Zhou Q, Xiao H (2017) Cluster ensemble selection with constraints. Neurocomputing 235:59–70 Yao H, Hamilton H (2006) Mining itemset utilities from transaction databases. Data Knowl Eng 59(3):603–626 Yi J, Jin R, Jain A, Yang T, Jain S (2012) Semi-crowdsourced clustering: generalizing crowd labeling by robust distance metric learning. In: Proceedings of the international conference on neural information processing systems, pp 1772–1780 Yu ZW, Wongb HS, You J, Yang QM, Liao HY (2011) Knowledge based cluster ensemble for cancer discovery from biomolecular data. IEEE Trans NanoBiosci 10(2):76–85 Zha H, He X, Ding CHQ, Gu M, Simon HD (2001) Spectral relaxation for k-means clustering. In: Proceedings of the international conference on neural information processing systems, pp 1057–1064 Zhang T, Ando R (2006) Analysis of spectral kernel design based semi-supervised learning. In: Proceedings of the international conference on neural information processing systems, pp 1601– 1608 Zhi W, Wang X, Qian B, Butler P, Ramakrishnan N, Davidson I (2013) Clustering with complex constraints - algorithms and applications. In: Proceedings of the AAAI conference on artificial intelligence, pp 1056–1062

Afterword—Artificial Intelligence and Operations Research Philippe Chrétienne

In this afterword, we first give an overview of the objectives and scientific approaches of Artificial Intelligence and Operations Research. We then show, through some examples, why and how these two disciplines can cooperate to improve the resolution of complex optimization problems.

1 Operations Research Except the works of some famous precursors such as Blaise Pascal in “Réflexions sur quelques problèmes de décision dans l’incertain”, Gaspard Monge in “Résolution du problème des déblais et remblais” or Harris for the “economic lot-size formula”, modern Operations Research (OR) begins in 1940 with the second world-wide war where Patrick Blackett was in charge, by the english top management, to constitute the first OR team to solve optimally the problem of localizing surveillance radars and other optimization problems connected to the management of supplies. The word “Operations” in OR comes from the fact that military operations have been the first application domain of OR. From then, and in particular due to the great increase of the computation power of computers, solving methods became far more efficient, new solution techniques have been developped and the application domain of OR has greatly extended. Administration (time-tabling,…), manufacturing shops (lotsizing, scheduling,…), physical systems (spin-glasses,…), transport (delivery vehicle tours,…), computers (file localization, code optimization,…), biological systems (sequence alignment,…), telecommunications (network sizing,…) and energy (hydraulic dams management, nuclear power plants stopping, energy distribution,…) make a “non-exhaustive” list of important applications domain of OR. These few examples are significant enough to understand the nature and the approach of the problems to which OR is faced and which may be summarized as follows. Starting from an application problem, a mathematical optimization model, P. Chrétienne (B) LIP6-CNRS and Université Pierre et Marie Curie, Paris, France e-mail: [email protected] © Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8

485

486

Afterword—Artificial Intelligence and Operations Research

which is most often issued from a consensus between the applicant and the researcher must be developed. Then an exact or approximate algorithmic solution method must be provided. Finally experimental numerical tests must show that the quality of the solutions given by the algorithm is high enough in the real environment of the initial problem. OR models essentially belong to 3 classes: combinatorial models (optimization in graphs, scheduling, logistics, transport, localization,…), stochastic models (waiting systems, strategies in uncertain environments,…) and decision making models (concurrential games, preference models,…). Obviously, that definition of OR can come in several variants since some OR researchers work on the improvement of generic solution techniques while others are closer to real applications. In any case, OR is naturally a transverse discipline that uses the tools from Mathematics and Computer Science to solve optimization problems from models issued from real applications.

2 Artificial Intelligence Artificial Intelligence (AI), which was given birth by Computer Science, is thus more recent than OR. Its overall purpose is, according to its creators, to develop computer programs capable of executing, in a more satisfying way, tasks executed until now by human beings since requiring mental processes (learning, organization, reasoning,…). While the word “intelligence” concerns the imitation of human behavior (reasoning, mathematical games, understanding, visual and auditive perception,…), the qualifier “artificial” comes from computers. Let us however note that this definition is considered to be too restrictive by the upholders of a “strong” IA, who envisage to create computer machines that not only produce clever behaviors but also endowed with an intelligent consciousness allowing the understanding of its own reasonings. In any case, the ambition of AI is clearly far more larger than that of OR. But it is also clear that, through the resolution of complex optimization problems issued from sufficiently large generic models (e.g., combinatorial or decision models), OR and AI can strengthen each other as, I hope, the examples developed in the next paragraph will show.

3 The Common Fight of OR and AI against Complexity My first example concerns “constraint reasoning”, that is seen from the AI point of view, as a generic model for problem solving (Apt 2003). The generic status of the approach, that is a real asset, may also constitute a factor of ineffectiveness since the specificities of a given problem, such as the sharp structural properties of its optimal solutions, that are not taken into account by a generic constraint programmming sofware, are most often essential to find exact or even good approximate solutions to

Afterword—Artificial Intelligence and Operations Research

487

real instances of the problem. In most cases, only a thorough analysis of the specific properties of the problem will allow to sufficiently break the combinatorial nature of the solution space. However, the development of adapted propagation techniques is an answer that has been the source of great improvements in solving hard combinatorial problems (Dechter 2003). Let us for example consider, in the non-preemptive scheduling context, the single-machine constraint where a single machine is available to execute a set of n tasks and each task i has release time ri , deadline di , and processing time pi , all of these data being integers. We denote by δi the value ri + pi (a lower bound of the earliest completion time of task i) and by ρi the value di − pi (an upper bound of the latest starting time of task i). The starting (resp. completion) time of task i is the variable Si (resp. Ci ). We present first two simple rules called respectively the “time-tabling” and the “disjunctive” rules and then two more sophisticated rules called the “edge-finding” and the “not first-not last” rules. Integrated in a constraint programming software, these rules were at the root of a great progress in solving shop-scheduling problems, and especially the job-shop scheduling problem. The time-tabling rule simply takes advantage of the fact that the machine is working during time unit t, an information captured by the variable ai (t) where ai (t) = 1 (resp. 0) if the machine is running task i at time unit t. Propagating the time-tabling rule yields the following adjustments: (ai (t) = 0) ∧ (δi ≥ t) ⇒ Si ≥ t (ai (t) = 0) ∧ (ρi < t) ⇒ Ci < t ai (t) = 1 ⇒ (Si ≥ t − pi ) ∧ (Ci < t + pi ) The disjunction rule takes advantage of the fact that any two distinct tasks i and j are executed on the disjoint semi-open intervals [Si , Si + pi [ and [S j , S j + p j [. Propagating the disjunction rule yields the following adjustments: δi > ρ j ⇒ Si ≥ S j + p j The edge-finding rule is a powerful deductive process that allows to get relations on the order in which tasks are processed by the machine. That rule may be used as part of the branching rule to select a task within a subset Ω or to remove nodes of the serach tree by proving that one task i ∈ / Ω must be executed either before or after all tasks of Ω. If we denote by rΩ (resp. dΩ ) the smallest starting (resp. latest completion) time of the tasks in Ω, the adjustments of the edge-finding rule are the following: (i ∈ / Ω) ∧ (rΩ + pΩ + pi > dΩ ) ⇒ i ≺Ω (i ∈ / Ω) ∧ (rΩ∪{i} + pΩ + pi > dΩ ) ⇒ Ω ≺i (i ∈ / Ω) ∧ (i ≺ Ω) ⇒ Ci ≤ min∅ =Ω ⊂Ω (dΩ − pΩ ) (i ∈ / Ω) ∧ (Ω ≺ i) ⇒ Si ≤ max∅ =Ω ⊂Ω (rΩ + pΩ )

488

Afterword—Artificial Intelligence and Operations Research

Contrary to edge-finding that allows to infer that a task must be processed either before or after all the tasks of a set Ω, the “not first-not last” rule allows to deduce that a task must be processed neither before nor after all the tasks in Ω. The corresponding adjustments are the following: (i ∈ / Ω) ∧ (ri + pΩ + pi > dΩ ) ⇒ (i ∈ / Ω) ∧ (rΩ + pΩ + pi > di ) ⇒ (i ∈ / Ω) ∧ non(i ≺ Ω) ⇒ Si (i ∈ / Ω) ∧ non(Ω ≺ i) ⇒ Ci

non(i ≺ Ω) non(Ω ≺ i) ≥ minΩ (r j − p j ) ≤ maxΩ (d j − p j )

These rules showed themselves very effective to solve scheduling problems since for both edge-finding and not first-not last, the adjustments associated with the Ω(n2n ) pairs {i, Ω} may be computed in O(n 2 ) time (Baptiste and Le Pape 2000). These important results, issued from research work on the exact solving of the job shop problem (Carlier and Pinson 1989), show how positive may be the interaction between OR and AI. My second example concerns the “metaheuristics” domain and more precisely “local search” that is often the most effective approach to solve complex combinatorial problems (Siarry and Dréo 2003; Deroussy 2016). Recall that a local search algorithm is based on a neighborhood function that assigns to any solution x a set N (x) of neighbouring solutions. Starting from an initial solution x0 , iteration i searches in N (xi−1 ) a solution xi that is strictly better than xi−1 (possibly by searching the best solution in N (xi−1 )). If there is no such solution, xi−1 is a local optimum (with respect to N ) and the algorithm completes by returning xi−1 . Otherwise iteration i + 1 is processed. The efficiency of this approach highly depends on the choice of N . In particular the size of N (x) is a very sensitive parameter. Indeed a larger size generally yields less iterations at the expense of a larger iteration time. So one should find a large (possibly exponential) neihborhood from which a good quality solution may be found rapidly (e.g., in a weak polynomial time). This explains why a lot of research has developed rather recently in OR concerning large neighborhoods (Pisinger and Ropke 2010). We will illustrate the interest in large neighborhoods on the following general “partitionning” problem. Let A = {a1 , . . . , an }. A partition of A into classesS1 , . . . , SK K (that can have a particular structure), must be found such that the sum k=1 ck (Sk ) is minimum. Note that computing the cost ck (Sk ) associated with the grouping of the items in Sk may be difficult. For example, if Sk is the set of nodes of a tour, computing ck (Sk ) corresponds to solving an instance of the Traveling Salesman Problem. This partitionning problem includes in particular the vehicle routing problem, the capacitated minimum spanning tree problem, the generalized assignment problem, the multicut problem, the aggregation and graph partitionning problems, and also localization problems and some parallel machines scheduling problems. The first studied neighborhoods were the 2-exchanges (Croes 1958) where the Ω(n 2 ) neighbors of a partition are obtained by exchanging two items that belong to distinct classes. More recently, Thomson and Orlin (Thomson and Orlin 1989)

Afterword—Artificial Intelligence and Operations Research

1

3

489

1

3

8

8 2

9

6

4

9

2

6

4

14

5

7

12

14

5

7

12

10

15

11

13

10

15

11

13

Fig. 1 A 2-exchange and a 4-exchange

proposed cyclic exchanges, thus generalizing 2-exchanges, where at most K items of distinct classes are cyclically exchanged (see Fig. 1). The size Ω(n K ) of that neighborhood is quite larger than the size of a 2-exchange neighborhood and may be exponential if K depends on n. Since it is not possible to enumerate all the neighbors of a partition to find a better partition, an optimization algorithm is thus needed. In Ahuja et al. (2000), such an algorithm is based on a valued graph G(S), called the “improvements graph” whose nodes are the n items of A and where there is an arc (i, j) if i and j are in two distinct classes of S and if the new class containing ai , say S (ai ) = S(a j ) \ {a j } ∪ {ai } is feasible. The cost of the arc (i, j) is c(S (ai )) − c(S(a j )). A circuit of G(S) is said to be class-disjoint if its nodes belong to distinct classes. Since the cyclic exchanges and the class-disjoint circuits are in a one to one correspondence with the same cost, one has to find in G(S) negative class-disjoint circuits, called valid circuits. Since deciding whether G(S) has at least one valid circuit is an NP-complete problem (Thompson and Psarafstis 1993), several heuristics and non-polynomial optimization algorithms have been developed to find a valid circuit. A simple variant of a shortest path algorithm with label correction (Ahuja et al. 1993) has found to be so efficient that the processing time to explore the neighborhood in the cyclic exchange case is similar to that in the 2-exchange case. Such a local search algorithm, based on the cyclic exchange neighborhoods has been used for the specific capacitated minimum cost spanning tree problem. In that problem, a processor source node must send flows to the demand nodes of a network while respecting the constraint that the total flow on any arc must be less than a given capacity K . The network is assumed to be complete and the cost of each edge is

490

Afterword—Artificial Intelligence and Operations Research

0

1

4

2

5

6

3

7

8

9

13 10

11

12

15

16

14

Fig. 2 Structure of a solution

given. A solution of this problem is a partition of the demand nodes such that the sum of the demands of each class is at most K . Moreover the cost of one class of the partition is the minimum cost of a spanning tree of the complete subgraph induced by that class. Figure 2 shows the structure of a solution with 3 spanning trees, unitary demands and K = 6. The algorithm described in Ahuja et al. (1993) has allowed not only to greatly improve (up to 20%) the best known solutions of benchmark instances for which taboo search algorithms were used but also to solve instances with 500 nodes and arbitrary demands.

4 Conclusion Following a rather simplistic overview of the OR and AI objectives, we wished to show, through two examples issued from the constraint propagation and metaheuristics, how these two disciplines can enforce each other to better solve (i.e., get better solutions) the numerous optimization problems coming from the real world that often remain out of reach despite the development of new solution techniques and the

Afterword—Artificial Intelligence and Operations Research

491

increase of computational power. There is no doubt that Constraint Programming has brought, from its unifying vision of problems and its generic solution methods, a great improvement in programming comfort and has proved to be efficient enough to solve a lot of real problems. The generic character of these methods also gives them a real flexibility of adaptation and robustness in front of possible variations in the problem specifications. Nevertheless, many highly combinatorial problems are still badly solved. The OR approach, that mainly consists in extracting the problem complexity hard core through a deep analysis of the properties of its optimal solutions and in making the best use of the obtained results (lower bounds, dominance,…) to integrate them in a cleverly guided enumerative algorithm, often provides better results than a generic AI approach but cannot in general be easily adapted to even minor variants of the problem specifications. As a conclusion, we can assert that the OR and the AI approaches complement each other to converge on the common objective to progress in the resolution of complex problems.

References Ahuja R, Magnanti T, Orlin J (1993) Network flows: theory, algorithms, and applications. Prentice Hall, NJ Ahuja R, Orlin J, Sharma D (2000) New neighborhood search structures for the capacitated minimum spanning tree problem. Technical report, Sloan School of management. MIT, Cambridge (MA) Apt K (2003) Principles of constraint programming. Cambridge University Press, Cambridge Baptiste P, Le Pape C (2000) Constraint propagation and decomposition techniques for highly disjunctive and highly cumulative project scheduling problems. Constraints 5:119–139 Carlier J, Pinson E (1989) An algorithm for solving the job shop problem. Manag Sci 35(2):164–176 Croes G (1958) A method for solving traveling salesman problems. Oper Res 6:791–812 Dechter R (2003) Constraint processing. Morgan-Kaufmann, San Francisco Deroussy L (2016) Métaheuristiques pour la logistique. ISTE Pisinger D, Ropke S (2010) Large neighborhood search. Springer, Berlin Siarry P, Dréo J (2003) Métaheuristiques pour l’optimisation difficile. Eyrolles Thomson P, Orlin J (1989) The theory of cyclic transfers. Technical report OR-200-89, Operations Research Center, MIT, Cambridge (MA) Thompson P, Psarafstis H (1993) Cyclic transfer algorithms for multivehicle routing and scheduling problems. Oper Res 41:935–946

Index

A A* algorithm, 323(II), 232(III), 239(III) abduction (abductive), 278(I), 283(I), 307(I), 487(I), 494(I), 495(I), 511(I), 512(I), 674(I), 259(II), 160(III), 359(III), 369(III), 446(III), 460(III), 461(III) act, 132(I), 133(I), 281(I), 294(I), 555(I), 556(I), 559(I), 568–570(I), 576(I), 405(II), 476(III), 477(III) action, 1(I), 2(I), 8(I), 9(I), 18(I), 52(I), 254(I), 255(I), 258(I), 264–266(I), 269(I), 275(I), 277(I), 284(I), 285(I), 287(I), 288(I), 291(I), 294(I), 295(I), 298(I), 299(I), 317(I), 319(I), 389– 396(I), 400–406(I), 444(I), 487– 508(I), 511–515(I), 523(I), 559(I), 583(I), 606(I), 612(I), 629–633(I), 637(I), 638(I), 640(I), 641(I), 646(I), 647(I), 738(I), 740(I), 763(I), 771(I), 41(II), 94(II), 287(II), 290(II), 295(II), 299(II), 303(II), 321(II), 327(II), 331(II), 123(III), 126(III), 269(III), 291(III), 304(III), 306(III), 310(III), 311(III), 313–319(III), 326(III), 354(III), 369(III), 370(III), 381(III), 389(III), 390(III), 404(III), 408–410(III), 412–416(III), 420– 422(III), 424(III), 429(III), 430(III), 443(III), 444(III), 446(III), 447(III), 454(III), 458(III), 476(III), 479(III), 492(III), 507(III), 508(III), 510(III), 541(III) action language, 487(I), 488(I), 496(I), 497(I), 500–502(I), 505(I), 511(I) action logic, 264(I), 277(I), 294(I) action selection, 304(III), 315–319(III), 326(III)

actor-critic (algorithm), 408(I), 322(II), 318(III), 319(III) adaptation (adaptive), 16(I), 23(I), 127(I), 137(I), 157(I), 255(I), 308(I), 310(I), 311(I), 313–321(I), 333(I), 372(I), 389(I), 614(I), 646(I), 741(I), 747(I), 750(I), 760(I), 38(II), 44(II), 71(II), 165(II), 171(II), 172(II), 233(II), 300(II), 302(II), 313(II), 330(II), 331(II), 402(II), 415(II), 435(II), 491(II), 99(III), 131(III), 161(III), 188(III), 210(III), 247(III), 290(III), 308(III), 318(III), 325(III), 341(III), 352(III), 353(III), 357(III), 358(III), 369–372(III), 452(III) agent, 19(I), 46(I), 48(I), 49(I), 51(I), 52(I), 60(I), 63(I), 71–74(I), 80– 82(I), 84–86(I), 89(I), 90(I), 99(I), 101(I), 120(I), 121(I), 138–142(I), 217–219(I), 225(I), 230(I), 232(I), 233(I), 237(I), 239(I), 241(I), 248(I), 257(I), 258(I), 265–271(I), 280(I), 281(I), 284(I), 285(I), 287(I), 288(I), 292(I), 293(I), 298(I), 315(I), 389(I), 390(I), 392(I), 393(I), 399(I), 404(I), 406(I), 415(I), 417(I), 419(I), 441– 447(I), 452(I), 454–459(I), 463(I), 465(I), 474(I), 476(I), 487–491(I), 493(I), 495(I), 508(I), 510(I), 512(I), 513(I), 520(I), 530(I), 539(I), 542(I), 549–554(I), 556–559(I), 561–575(I), 577–580(I), 582(I), 583(I), 587– 593(I), 600(I), 605–613(I), 615– 622(I), 629–648(I), 651–668(I), 725(I), 726(I), 735(I), 769–772(I), 202(II), 225(II), 233(II), 235(II), 259(II), 295(II), 299(II), 303(II),

© Springer Nature Switzerland AG 2020 P. Marquis et al. (eds.), A Guided Tour of Artificial Intelligence Research, https://doi.org/10.1007/978-3-030-06167-8

493

494 306(II), 329(II), 331(II), 332(II), 466–468(II), 119(III), 129(III), 149(III), 150(III), 157(III), 168(III), 182(III), 212(III), 216(III), 305(III), 314(III), 319(III), 322(III), 323(III), 326(III), 338(III), 354(III), 365(III), 367(III), 370–375(III), 382(III), 416– 418(III), 426(III), 427(III), 429(III), 438(III), 447(III), 451(III), 453(III), 466(III), 476(III), 477(III), 481(III), 491(III), 508(III), 521(III), 537(III), 538(III) aggregation, 18(I), 110(I), 218(I), 241(I), 312(I), 380(I), 457(I), 459–461(I), 463(I), 475(I), 476(I), 519(I), 520(I), 522–528(I), 531(I), 535–537(I), 539(I), 544(I), 576(I), 588–591(I), 596(I), 608–611(I), 615(I), 187– 191(II), 193(II), 194(II), 197(II), 199(II), 254(II), 433(II), 488(II), 52(III), 104(III), 154–156(III), 161(III), 162(III), 164(III), 169(III), 453(III), 480(III) algebraic closure, 163(I), 165(I), 167(I), 168(I) Allen interval algebra, 160(I), 165(I), 167(I), 411(III) alpha-beta algorithm, 313(II) analogical proportion, 3(I), 14(I), 307– 309(I), 313(I), 321(I), 324–331(I), 473(III) analogical proportion-based learning, 324(I) analogical reasoning, 4(I), 18(I), 94(I), 307(I), 308(I), 321(I), 324(I), 330(I), 333(I), 673(I), 232(III), 243(III), 444(III), 449(III), 473(III) analogy, 3–5(I), 25(I), 26(I), 159(I), 253(I), 256(I), 307(I), 313(I), 319(I), 321– 324(I), 326(I), 327(I), 329(I), 345(I), 630(I), 718(I), 422(II), 2–4(III), 325(III), 447(III), 460(III), 468(III), 473(III), 516(III) anaphora, 123(III), 541(III) and/or graph, 189(I), 244(I), 284(I), 343(I), 405(I), 494(I), 495(I), 526(I), 529(I), 582(I), 632(I), 694(I), 717(I), 2(II), 292(II), 352(III), 409(III) annotated corpus (annotated corpora), 131(III) annotation, 174(I), 744(I), 748(I), 760(I), 761(I), 210(II), 228(II), 237(II), 477(II), 118(III), 131(III), 132(III), 182(III), 212(III), 213(III), 216(III),

Index 217(III), 219–221(III), 224(III), 238(III), 246(III), 248(III), 268(III), 343(III), 344(III), 346(III) answer set programming (ASP), 65(I), 99(I), 431(I), 464(I), 465(I), 513(I), 83(II), 84(II), 90(II), 94–108(II), 262(II), 292(II), 436(II), 195(III), 234(III), 235(III), 237(III), 241(III), 242(III), 295(III) answer-set program, 99(I) ant colony optimization algorithms, 27(II), 167(III), 210(III) approximate reasoning, 27(I), 75(I), 76(I), 312(I), 330–333(I), 391(II) approximation, 105(I), 126(I), 137(I), 355(I), 356(I), 360(I), 367(I), 379(I), 390(I), 393–397(I), 399(I), 401(I), 402(I), 408(I), 465(I), 598(I), 599(I), 615–617(I), 700(I), 13(II), 28(II), 129(II), 139(II), 221–223(II), 300(II), 313(II), 350(II), 391(II), 417(II), 418(II), 460(II), 4(III), 39(III), 83(III), 230(III), 271(III), 281(III), 438(III), 439(III), 442(III), 480(III) arc consistency, 158–164(II), 169(II), 170(II), 176(II), 177(II), 197(II), 198(II), 200(II) argumentation, 3–5(I), 8(I), 12(I), 17(I), 73(I), 110(I), 247(I), 415(I), 419(I), 427–432(I), 435(I), 436(I), 443(I), 446(I), 462(I), 465(I), 514(I), 633(I), 642(I), 652(I), 664(I), 665(I), 719(I), 721(I), 722(I), 103(II), 10(III), 12(III), 113(III), 160(III), 443(III), 444(III), 449(III) argumentation graph, 430(I), 431(I), 436(I) argumentative inference, 418(I), 421(I), 274(II) ASP solver, 104–106(II) association rule, 102(I), 345–348(II), 390(II), 395(II), 412(II), 415(II), 435(II), 200(III), 379(III) ATMS, 678(I), 684(I), 138(II), 139(II), 141– 143(II) attack relation, 73(I), 110(I), 429(I), 430(I), 465(I), 667(I) attitude with respect to risk, 561(I) automatic control, 27(I), 153(I), 673(I), 674(I), 693(I), 695(I), 700(I), 701(I), 367(III), 401(III) automaton (automata), 6(I), 7(I), 12(I), 13(I), 15(I), 16(I), 21(I), 497(I),

Index

495 673(I), 683(I), 685–691(I), 700(I), 89(II), 125(II), 166(II), 192(II), 294(II), 384(II), 385(II), 390– 392(II), 397(II), 3(III), 5(III), 8(III), 9(III), 17(III), 18(III), 36(III), 51– 53(III), 60(III), 66–79(III), 235(III), 238(III), 266(III), 273(III), 409(III), 428(III), 453(III), 497(III)

B backpropagation, 381(I), 377–380(II), 382(II), 477(III) backtrack algorithm, 195(II) backtracking, 664(I), 716(I), 72(II), 79(II), 131(II), 156(II), 166(II), 167(II), 170(II), 172–174(II), 195(II), 193(III), 280(III), 344(III), 416(III) bagging, 362(I), 367–369(I), 351(II), 380(II), 211(III), 212(III), 224(III) batch reinforcement learning, 397(I) Bayesian classifier, 342(I), 238(II), 396(II), 397(II) Bayesian network, 71(I), 83(I), 84(I), 86(I), 96(I), 219(I), 235(I), 238– 240(I), 283(I), 284(I), 289(I), 290(I), 296(I), 347(I), 379(I), 472(I), 487(I), 497(I), 506(I), 581(I), 582(I), 35(II), 192(II), 202(II), 210(II), 212(II), 215–227(II), 229– 231(II), 234(II), 235(II), 237– 239(II), 247(II), 248(II), 251– 253(II), 255(II), 256(II), 258(II), 259(II), 262–264(II), 268–270(II), 273(II), 276(II), 285(II), 286(II), 297–299(II), 306(II), 384(II), 397(II), 162(III), 244(III), 249(III), 338(III), 375(III), 405(III), 406(III), 446(III), 478(III) BDI, 630–633(I), 648(I), 374(III) belief, 11(I), 20(I), 46(I), 49(I), 51(I), 52(I), 59(I), 63(I), 69(I), 70(I), 73(I), 74(I), 80(I), 82–84(I), 86(I), 88(I), 96(I), 99(I), 101(I), 104(I), 106(I), 107(I), 110(I), 119–127(I), 129–136(I), 138– 145(I), 174(I), 246(I), 283(I), 284(I), 290(I), 292(I), 299(I), 315(I), 415(I), 417(I), 441–449(I), 451–453(I), 455– 460(I), 466(I), 467(I), 469(I), 471– 477(I), 487(I), 489–494(I), 504(I), 506–510(I), 511–513(I), 515(I), 549(I), 557(I), 572–574(I), 577(I), 578(I), 588(I), 600(I), 630(I), 631(I), 633–637(I), 639–642(I), 646–648(I),

665(I), 666(I), 668(I), 700(I), 720(I), 760(I), 771(I), 129(II), 192(II), 209(II), 210(II), 212–214(II), 217– 219(II), 227(II), 230(II), 233– 240(II), 251(II), 265(II), 297(II), 303(II), 304(II), 398(II), 477(II), 83(III), 113(III), 124(III), 129(III), 157(III), 158(III), 162(III), 199(III), 221(III), 226(III), 244(III), 342(III), 359(III), 374(III), 375(III), 418(III), 442(III), 445(III), 446(III), 449(III), 462(III), 469(III), 477(III), 478(III), 506(III) belief base, 59(I), 101(I), 444(I), 507(I), 508(I), 510(I), 512(I), 513(I) belief change, 441–443(I), 446(I), 447(I), 456(I), 466(I), 467(I), 477(I), 507(I), 512(I), 515(I), 235(II) belief function, 11(I), 69(I), 74(I), 104(I), 107(I), 110(I), 119–127(I), 129– 136(I), 138(I), 139(I), 141–145(I), 290(I), 441(I), 467(I), 471–474(I), 477(I), 549(I), 572–574(I), 234(II), 342(III) Bellman residual, 396(I), 397(I), 399(I), 400(I) biclustering, 412(II), 420(II), 432(II), 434(II), 436(II) big data, 722(I), 756(I), 340(II), 448(II), 449(II), 472(II), 478(II), 315(III), 379(III) bioinformatics, 342(I), 95(II), 102(II), 115(II), 125(II), 237(II), 412(II), 436(II), 77(III), 209–212(III), 214(III), 217(III), 218(III), 222(III), 224(III), 225(III), 227(III), 231(III), 232(III), 237(III), 238(III), 241(III), 243(III), 247(III), 249(III), 250(III), 296(III) biology, 20(I), 159(I), 434(II), 436(II), 74(III), 210(III), 211(III), 215– 220(III), 229(III), 232(III), 237(III), 238(III), 250(III), 265–268(III), 274(III), 280(III), 281(III), 283(III), 288(III), 289–291(III), 296(III), 400(III), 451(III) Boolean game, 129(III) boosting, 362(I), 367–369(I), 380(I), 339(II), 351(II), 368(II), 374(II), 375(II), 378(II), 163(III), 211(III), 212(III), 223(III), 224(III) brain, 21(I), 173(I), 347(I), 28(II), 3(III), 8(III), 9(III), 52(III), 303–306(III),

496 309(III), 310(III), 314(III), 315(III), 318(III), 319(III), 323–326(III), 343– 346(III), 358(III), 457(III), 458(III), 462(III), 474(III), 475(III), 477(III), 503(III), 504(III), 506(III), 507(III), 538(III) branch and bound algorithm, 195(II), 356(II), 463(II), 239(III) C cake cutting, 592(I), 607(I), 610(I), 612– 614(I), 622(I) capacity, 74(I), 100(I), 141–142(I), 539– 542(I), 569–570(I), 572(I), 576(I), 578(I) cardinal direction calculus, 163(I), 166(I), 168(I), 169(I) cardinal utility, 520(I) case base, 308(I), 310–312(I), 316(I), 317(I), 320(I), 321(I) case-based reasoning, 26(I), 75(I), 94(I), 307(I), 308(I), 321(I), 325(I), 330(I), 331(I), 333(I), 673(I), 226(III), 232(III), 243(III), 350(III), 353(III), 444(III), 449(III), 466(III), 473(III), 493(III), 509(III), 541(III) case retrieval, 311(I), 318(I), 333(I) causal graph, 285(I), 289(I), 290(I), 674(I), 675(I), 684(I), 698(I), 216(II) causal rule, 280(I), 281(I), 295(I), 502– 504(I), 511(I), 716(I), 725(I) causality, 11(I), 19(I), 20(I), 101(I), 157(I), 158(I), 258(I), 275–279(I), 281– 289(I), 291–295(I), 297–300(I), 502(I), 681(I), 698(I), 725(I), 107(II), 213(II), 405(II), 162(III), 221(III), 291(III), 446(III) ceteris paribus principle, 244(I), 245(I) checkers, 21(I), 26(I), 739(I), 320(II), 236(III), 288(III), 438(III) chemoinformatics, 412(II), 436(II), 243(III) chess, 10(I), 12(I), 13(I), 16(I), 21(I), 26(I), 314(II), 321(II), 231(III), 390(III), 438(III), 441(III), 536(III) Choquet integral, 110(I), 123(I), 132(I), 539–543(I), 572(I), 573(I), 615(I), 169(III) Church-Turing thesis, 3(III) circumscription, 58(I), 498(I), 97(II), 100(II), 127(III) Clark completion, 260(II) classification, 83(I), 134(I), 136(I), 312(I), 327(I), 328(I), 342(I), 344–346(I),

Index 348(I), 351(I), 360(I), 362(I), 366(I), 368–370(I), 376(I), 380(I), 436(I), 475(I), 488(I), 522(I), 588(I), 606(I), 660(I), 684(I), 737(I), 741(I), 755(I), 757(I), 79(II), 209(II), 210(II), 218– 220(II), 224(II), 237(II), 239(II), 344(II), 346(II), 348(II), 349(II), 353(II), 368(II), 373(II), 375(II), 390–392(II), 394(II), 411(II), 412(II), 418(II), 424(II), 425(II), 437(II), 453(II), 470(II), 64(III), 83(III), 130(III), 134(III), 213(III), 226(III), 227(III), 243(III), 246(III), 291(III), 307(III), 372(III), 376(III), 419(III), 425(III), 430(III), 522(III), 524(III) clause, 97(I), 169(I), 170(I), 207(I), 208(I), 227(I), 346(I), 365(I), 433–435(I), 512(I), 678(I), 679(I), 712(I), 716(I), 717(I), 40–43(II), 57–67(II), 83– 89(II), 91(II), 99(II), 105(II), 106(II), 116(II), 118–123(II), 125–136(II), 138(II), 140(II), 145(II), 156(II), 168(II), 192(II), 201(II), 294(II), 384(II), 387(II), 394(II), 461(II), 62(III), 106(III), 119(III), 122(III), 188(III), 195(III), 277(III), 471(III) closed world (closed world assumption, CWA), 186(I), 187(I), 81(II), 91(II), 96(II), 139(II), 195(III) clustering, 78(I), 126(I), 134(I), 136– 138(I), 345(I), 346(I), 166(II), 178(II), 218(II), 339(II), 344– 347(II), 354–356(II), 358(II), 360– 367(II), 396(II), 406(II), 434(II), 435(II), 447–452(II), 454–468(II), 470–474(II), 477(II), 478(II), 83(III), 133(III), 134(III), 154(III), 291(III), 496(III), 522(III) cognition (cognitive), 17(I), 18(I), 26(I), 28(I), 46(I), 52(I), 71(I), 159(I), 172(I), 174(I), 219(I), 220(I), 239(I), 247(I), 281(I), 285(I), 292(I), 293(I), 298(I), 308(I), 321(I), 322(I), 332(I), 341(I), 427(I), 629–634(I), 636(I), 640(I), 641(I), 645(I), 647(I), 648(I), 720(I), 722(I), 734(I), 742(I), 743(I), 747(I), 754(I), 772(I), 271(II), 147(III), 209(III), 292(III), 303– 305(III), 308(III), 310–315(III), 317(III), 319(III), 320(III), 324– 326(III), 339(III), 348(III), 349(III), 358(III), 367–375(III), 381(III),

Index 389(III), 430(III), 437–439(III), 443– 448(III), 450(III), 452–454(III), 457– 459(III), 461(III), 463(III), 468– 471(III), 473–481(III), 503–510(III), 519(III), 525(III), 527(III), 538(III) coherence, 80(I), 81(I), 85(I), 93(I), 96(I), 104(I), 141(I), 298(I), 709(I), 711– 715(I), 720(I), 726(I), 744(I), 11(II), 33(III), 98(III), 124(III), 445(III), 494(III), 496(III), 497(III) collaborative clustering, 466–468(II), 471(II) collective decision, 217(I), 219(I), 231(I), 242(I), 248(I), 471(I), 519(I), 520(I), 528(I), 537(I), 544(I), 587(I), 590(I), 593(I), 606(I), 614(I), 617(I), 622(I), 651(I), 652(I), 660(I), 437(III), 452(III) combinatorial auction, 242(I), 588(I), 614(I), 615(I), 617–619(I), 621(I), 622(I), 660(I) combinatorial optimization, 520(I), 606(I), 91(II), 167(III) comonotonicity (comonotone), 568(I) commonality function, 122(I), 124(I), 127(I) common sense, 18(I), 151(I), 152(I), 159(I), 550(I), 716(I), 740(I), 56(II), 366(III), 465(III), 469(III), 541(III) compact representation of preferences, 99(I), 217(I), 248(I), 330(I), 379(I), 592(I), 600(I), 614(I), 192(II), 83(III), 102(III), 105(III) compilation, 689(I), 88(II), 115–116(II), 131(II), 136–138(II), 140–143(II), 145(II), 202(II), 219(II), 288(II), 290(II), 98(III), 129(III), 187(III), 288(III) completeness, 16(I), 154(I), 193(I), 195(I), 201(I), 319(I), 435(I), 552(I), 648(I), 681(I), 696(I), 697(I), 713(I), 723(I), 724(I), 60(II), 68(II), 69(II), 71(II), 75(II), 78(II), 80(II), 87(II), 121(II), 122(II), 124(II), 129(II), 138(II), 144(II), 174(II), 175(II), 391(II), 417(II), 3(III), 17–20(III), 22(III), 34(III), 61(III), 111(III), 112(III), 122(III), 186(III), 291(III) completion, 434(I), 604(I), 58(II), 78(II), 97(II), 100(II), 105(II), 143(II), 260(II), 404(II), 433(II), 458(II), 487(II), 309(III), 381(III), 471(III) compositionality, 76(I), 118(III)

497 composition table, 163(I), 169(I), 170(I), 175(I) computability, 2(I), 16(I), 21(I), 76(II), 78(II), 1(III), 2(III), 5(III), 6(III), 8(III), 11–17(III), 33(III), 38(III), 41–43(III), 53(III), 54(III), 59– 62(III), 69(III), 71(III), 80(III), 83(III), 84(III), 537(III) computational biology, 74(III), 232(III), 400(III) computational model, 416(I), 333(II), 66(III), 68(III), 69(III), 128(III), 304(III), 318(III), 326(III), 389(III), 390(III), 451(III), 488(III), 489(III), 525(III) computer vision, 27(I), 380(I), 229(II), 237(II), 337–341(III), 344(III), 359(III), 370(III), 406(III) concept lattice, 171(I), 411–416(II), 418(II), 420–422(II), 424(II), 426(II), 428(II), 429(II), 432(II), 435(II), 436(II), 125(III) concept learning, 341(I), 343(I), 362(I), 364(I), 366(I), 367(I), 378(I), 384(II), 385(II), 391(II) conceptual clustering, 346(II), 347(II), 365(II), 461(II), 463(II), 465(II) conceptual graph, 185(I), 187(I), 197– 207(I), 713(I), 714(I), 752(I), 427(II), 126(III), 154(III), 158(III), 159(III), 183(III), 184(III), 186(III), 187(III), 192(III), 195(III), 341(III), 537(III) conceptual model, 188(I), 734–736(I), 739(I), 742(I), 744(I), 746(I), 761(I) conceptual space, 332(I) conditional, 11(I), 18(I), 45–49(I), 52– 54(I), 56(I), 58–61(I), 63(I), 64(I), 69(I), 73(I), 77(I), 79(I), 81–84(I), 87(I), 88(I), 91(I), 95(I), 96(I), 99–101(I), 106(I), 125(I), 134(I), 142(I), 143(I), 175(I), 221–223(I), 226(I), 228–231(I), 234–237(I), 244–248(I), 253–256(I), 259–262(I), 264(I), 265(I), 277(I), 278(I), 283– 285(I), 289(I), 291(I), 296(I), 332(I), 347–349(I), 372(I), 374(I), 375(I), 377(I), 408(I), 415(I), 443(I), 467(I), 468(I), 490(I), 501(I), 502(I), 506(I), 507(I), 511(I), 578(I), 581(I), 631(I), 634(I), 646(I), 679(I), 681(I), 35(II), 73(II), 97(II), 103(II), 192(II), 202(II), 203(II), 210(II), 211(II), 215(II), 220(II), 239(II), 249(II),

498 251–253(II), 268(II), 270–272(II), 288(II), 295(II), 297(II), 301(II), 359(II), 384(II), 396(II), 397(II), 435(II), 478(II), 18(III), 105(III), 107(III), 127(III), 158(III), 162(III), 165(III), 223(III), 226(III), 227(III), 231(III), 244(III), 359(III), 406(III), 444(III), 446(III), 460(III), 463– 465(III), 470–472(III), 478(III) conditional independence, 284(I), 285(I), 193(II), 212(II), 214–216(II), 221– 223(II), 225(II), 228(II), 229(II), 232(II), 234(II) conditioning, 70(I), 79(I), 81(I), 83(I), 84(I), 86(I), 95(I), 96(I), 99(I), 106(I), 107(I), 125(I), 142–144(I), 278(I), 290(I), 451(I), 467–470(I), 474(I), 726(I), 191(II), 195(II), 211(II), 212(II), 217(II), 219(II), 231(II), 232(II), 235(II), 236(II), 310(III), 317(III) Condorcet winner, 594(I), 596(I) conflict, 124(I), 126(I), 127(I), 130(I), 137(I), 430(I), 654(I), 655(I), 677– 680(I), 682(I), 684(I), 696–698(I), 18(II), 44(II), 45(II), 105(II), 133(II), 134(II), 171(II), 183(III), 221(III), 341(III), 342(III) conflict graph, 622(I), 134(II) confluence, 156(I), 161(I), 429(I), 71(II), 11(III), 78(III) consistency (consistent), 56(I), 73(I), 90(I), 92(I), 100(I), 122(I), 126(I), 127(I), 134(I), 140(I), 141(I), 154(I), 158(I), 163(I), 165–170(I), 187(I), 190(I), 203(I), 205(I), 208(I), 257(I), 262(I), 263(I), 292(I), 315(I), 359– 361(I), 365(I), 366(I), 373(I), 406(I), 417–422(I), 425(I), 428(I), 430(I), 431(I), 433–435(I), 442–449(I), 457–459(I), 461(I), 462(I), 464(I), 467(I), 468(I), 470–472(I), 474(I), 501(I), 504(I), 505(I), 508(I), 509(I), 512(I), 580(I), 581(I), 594–597(I), 636(I), 674–677(I), 681(I), 683(I), 684(I), 691(I), 694(I), 695(I), 697(I), 698(I), 723(I), 724(I), 742(I), 750(I), 760(I), 74(II), 78(II), 91(II), 92(II), 97(II), 126(II), 138(II), 140(II), 142(II), 153(II), 155–166(II), 168– 170(II), 176(II), 177(II), 197– 200(II), 202(II), 203(II), 211(II), 218(II), 231(II), 390(II), 391(II),

Index 394(II), 459(II), 33(III), 93(III), 98(III), 131(III), 193(III), 198(III), 199(III), 242(III), 289(III), 318(III), 377(III), 410(III), 426(III), 495(III), 508(III), 510(III)

34(III), 135(III), 241(III), 341(III), 473(III),

constrained clustering, 136(I), 356(II), 447(II), 450(II), 456(II), 460–466(II), 473(II), 474(II), 477(II), 83(III)

345(II), 455(II), 471(II), 478(II),

constraint, 23(I), 25(I), 27(I), 74(I), 75(I), 80–82(I), 84(I), 89(I), 90(I), 93– 95(I), 97(I), 98(I), 100(I), 101(I), 120(I), 129(I), 137(I), 144(I), 156– 158(I), 160(I), 161(I), 163(I), 165– 170(I), 174(I), 175(I), 186(I), 194(I), 195(I), 197(I), 198(I), 201(I), 204– 208(I), 221(I), 227(I), 228(I), 231(I), 238(I), 239(I), 248(I), 255(I), 263(I), 265(I), 266(I), 270(I), 271(I), 278(I), 289(I), 294(I), 295(I), 298(I), 299(I), 314(I), 315(I), 319(I), 322(I), 374(I), 376(I), 378(I), 379(I), 406(I), 428(I), 452(I), 453(I), 456(I), 458(I), 459(I), 465(I), 469(I), 475(I), 501(I), 510(I), 511(I), 520(I), 537(I), 569(I), 615(I), 619(I), 621(I), 632(I), 633(I), 638(I), 652(I), 654(I), 674(I), 678(I), 684– 686(I), 688(I), 693(I), 695(I), 697(I), 699(I), 700(I), 711–715(I), 725(I), 735–737(I), 754(I), 13(II), 14(II), 16(II), 20(II), 21(II), 27(II), 29(II), 39(II), 41(II), 43–47(II), 83(II), 84(II), 88–94(II), 97(II), 100(II), 102(II), 103(II), 106(II), 108(II), 119(II), 122(II), 141(II), 153– 179(II), 185–189(II), 191–194(II), 196–199(II), 201–203(II), 221(II), 222(II), 224(II), 228(II), 234(II), 249–251(II), 253(II), 258(II), 286– 289(II), 294(II), 330(II), 345(II), 346(II), 356(II), 383(II), 385(II), 394(II), 422(II), 424(II), 447(II), 449(II), 450(II), 452–478(II), 486(II), 487(II), 489–491(II), 5(III), 7(III), 11(III), 76(III), 78(III), 91–93(III), 95–101(III), 107(III), 112(III), 113(III), 121(III), 126(III), 132(III), 148(III), 155–158(III), 160(III), 161(III), 164(III), 189(III), 191–195(III), 199(III), 201–203(III), 228–232(III), 234(III), 235(III),

Index 237(III), 241(III), 245(III), 246(III), 265–267(III), 277–280(III), 282(III), 284(III), 285(III), 287–290(III), 292(III), 312(III), 337(III), 341(III), 344(III), 345(III), 349(III), 350(III), 355(III), 356(III), 359(III), 382(III), 397–400(III), 407(III), 408(III), 410– 412(III), 416(III), 423(III), 427(III), 429(III), 445(III), 449(III), 452(III), 454(III), 468(III), 477(III), 478(III), 490(III), 491(III), 493(III), 494(III), 497(III), 506–512(III), 518(III), 519(III), 521–523(III), 525(III), 526(III), 536(III), 538(III), 541(III) constraint network, 160(I), 163(I), 165(I), 166(I), 168–170(I), 174(I), 201(I), 239(I), 465(I), 153–159(II), 162– 164(II), 168–170(II), 175(II), 176(II), 185–187(II), 189(II), 288(II), 193(III), 344(III), 410(III) constraint programming, 615(I), 45(II), 90(II), 108(II), 153(II), 154(II), 178(II), 199(II), 201(II), 202(II), 287(II), 294(II), 346(II), 356(II), 460(II), 463(II), 487(II), 491(II), 199(III), 202(III), 231(III) constraint propagation, 163(I), 169(I), 174(I), 92(II), 122(II), 153(II), 157(II), 158(II), 167(II), 168(II), 172–174(II), 186–188(II), 196– 199(II), 464(II), 490(II), 345(III) constraint satisfaction problem (CSP), 160(I), 169(I), 170(I), 201(I), 227(I), 228(I), 248(I), 379(I), 683(I), 27(II), 29(II), 39(II), 43–47(II), 90–92(II), 105(II), 141(II), 153(II), 154(II), 156(II), 164–166(II), 172–175(II), 177(II), 178(II), 185–193(II), 195– 202(II), 289(II), 294(II), 394(II), 463(II), 464(II), 232(III), 266(III), 279(III), 344(III), 536(III), 538(III), 541(III) context, 102(I), 103(I), 245(I), 259(I), 266(I), 320(I), 324(I), 554(I), 684(I), 708(I), 737(I), 742(I), 745(I), 757(I), 763(I), 101(II), 187(II), 193(II), 233(II), 258(II), 261(II), 413(II), 415(II), 417(II), 420(II), 422(II), 423(II), 426(II), 427(II), 429– 431(II), 434(II), 23(III), 25(III), 26(III), 28(III), 29(III), 60(III), 64(III), 72–74(III), 77(III), 78(III), 107(III), 118(III), 123(III), 127(III),

499 154(III), 155(III), 188(III), 195(III), 199(III), 242(III), 313(III), 352(III), 355(III), 369(III), 372(III), 373(III), 379(III), 421(III), 442(III), 445(III), 448(III), 449(III), 454(III), 468(III), 469(III), 522(III) contraction, 398(I), 447(I), 449(I), 450(I), 466(I), 212(II), 23(III), 221(III) contradiction, 12(I), 56(I), 58(I), 71–73(I), 75(I), 77(I), 120(I), 233(I), 257(I), 415(I), 416(I), 423(I), 426(I), 453(I), 711–714(I), 721(I), 724(I), 66(II), 78(II), 117(II), 129(II), 130(II), 133(II), 2(III), 16(III), 34(III), 63(III), 444(III) contrary-to-duty, 259(I), 263(I), 264(I), 267(I), 270(I) controllability, 153(I), 412(III) convexity (convex), 105(I), 120(I), 132(I), 133(I), 140–142(I), 144(I), 165(I), 171(I), 341(I), 343(I), 348(I), 354(I), 357(I), 358(I), 369–378(I), 380(I), 467(I), 534(I), 538–542(I), 562(I), 570–573(I), 610(I), 659(I), 165(II), 228(II), 304(II), 353(II), 357(II), 368(II), 418–420(II) convex learning, 341(I), 343(I), 354(I), 361(I), 369–373(I), 375(I), 378(I) convolutional neural network, 380(I), 380– 382(II), 154(III), 164(III), 226(III), 498(III) correlation, 275(I), 277(I), 278(I), 288(I), 293(I), 492(I), 461(II), 225(III), 226(III), 307(III), 380(III), 447(III) cortex, 52(III), 304(III), 306(III), 308– 310(III), 312(III), 314–316(III), 318– 322(III), 324(III), 325(III), 380(III), 477(III) cost function, 26(I), 137(I), 359(I), 395(I), 397(I), 185(II), 186(II), 189–195(II), 197–203(II), 461(II), 230(III), 232(III), 356(III), 357(III) coverage, 332(I), 751(I), 385–389(II), 391(II), 393(II), 395(II), 454(II), 128(III), 130(III), 391(III), 399(III), 409(III), 419(III), 475(III), 531(III) CP-net, 221–231(I), 234(I), 235(I), 240(I), 245(I), 600(I), 614(I), 192(II), 236(II), 92(III), 105–107(III) creativity, 1(I), 366(III), 369(III), 437(III), 442(III), 487–490(III), 493(III), 495(III), 498(III), 505(III), 517(III), 524(III), 525(III), 527(III), 532(III)

500 Curry-Howard isomorphism, 4(III), 33(III), 42(III) D data base, 740(I), 756(I), 759(I), 107(II), 348(II), 435(II), 9(III), 183(III), 186(III), 187(III), 193(III), 194(III), 201(III), 202(III), 211(III) data integration, 212(I), 757(I), 108(III), 182(III), 216(III) data mining, 78(I), 102(I), 134(I), 136(I), 248(I), 328(I), 380(I), 394(I), 398(I), 740(I), 178(II), 276(II), 318(II), 339(II), 392(II), 395(II), 412(II), 415(II), 417(II), 422(II), 436(II), 447(II), 449(II), 450(II), 452(II), 471(II), 472(II), 475–477(II), 83(III), 131–134(III), 163(III), 200(III), 210(III), 245–247(III), 304(III), 307(III), 325(III), 356(III), 360(III), 365(III), 367(III), 375(III), 379(III), 380(III), 382(III), 391(III), 419(III), 442(III), 488(III) Davis and Putnam algorithm (DP algorithm), 130(II), 138(II) Davis, Logeman and Loveland algorithm (DLL algorithm, also known as DPLL algorithm, with P for Putnam), 121(I), 105(II), 128(II), 131–134(II), 139(II), 145(II) declarative approach, 450(II), 460(II) decidability, 209(I), 211(I), 648(I), 54(II), 63(II), 76(II), 1(III), 34(III), 36(III), 53(III), 59(III), 60(III), 62(III), 68(III), 69(III), 71(III), 75(III), 78(III), 84(III), 111(III) decision list, 346(I), 365(I) decision tree, 317(I), 342(I), 346(I), 352(I), 365(I), 367(I), 369(I), 404(I), 577– 582(I), 178(II), 210(II), 227(II), 298(II), 351(II), 375(II), 392(II), 393(II), 170(III), 200(III), 223(III), 244(III), 376(III) decision under uncertainty, 559(I), 225(II), 286(II), 295(II), 169(III), 359(III) decision-support system, 318(I), 550(I), 739(I) decomposition, 127(I), 171(I), 211(I), 232– 237(I), 239(I), 461(I), 600(I), 718(I), 2(II), 3(II), 59(II), 130(II), 136(II), 139(II), 140(II), 165(II), 166(II), 176(II), 177(II), 212(II), 236(II), 331(II), 352(II), 404(II), 76(III),

Index 133(III), 134(III), 154(III), 248(III), 352(III), 353(III), 400(III), 407(III), 409(III), 410(III), 415(III), 416(III) deduction, 14(I), 16(I), 17(I), 24(I), 48(I), 88(I), 283(I), 307(I), 321(I), 330(I), 417(I), 422(I), 708(I), 711(I), 712(I), 770(I), 53(II), 54(II), 66(II), 67(II), 80(II), 83–86(II), 89(II), 91(II), 121(II), 248(II), 3(III), 4(III), 11(III), 16(III), 21(III), 23(III), 24(III), 26– 28(III), 31–34(III), 63(III), 78(III), 92(III), 127(III), 158(III), 162(III), 199(III), 200–202(III), 218(III), 442(III), 445(III), 447(III) deep reinforcement learning, 398(I) deep learning, 380(I), 381(I), 394(I), 399(I), 272(II), 339(II), 376(II), 378– 380(II), 394(II), 448(II), 477(II), 154(III), 163–165(III), 170(III), 226– 229(III), 244(III), 245(III), 325(III), 359(III), 430(III), 441(III), 524(III), 537(III), 541(III) default, 8(I), 56–58(I), 64(I), 79(I), 89(I), 100(I), 101(I), 174(I), 246(I), 292(I), 294(I), 295(I), 431(I), 454(I), 464(I), 512(I), 654(I), 674(I), 679(I), 94(II), 95(II), 98(II), 103(II), 127(III), 160(III), 471(III) default logic, 56(I), 58(I), 64(I), 431(I), 674(I), 679(I), 95(II), 98(II), 103(II), 127(III), 160(III), 471(III) default negation, 94(II), 95(II) default rule, 56(I), 79(I), 89(I), 100(I), 101(I), 292(I), 457(I), 464(I), 512(I), 103(II) defeasible inference, 127(III) Dempster’s rule of combination, 124(I), 474(I) Dempster’s rule of conditioning, 125(I), 144(I), 231(II) deontic logic, 18(I), 52(I), 253–259(I), 261(I), 263(I), 265(I), 269–271(I), 631(I), 632(I), 637(I), 639(I), 92(III), 445(III) description logic, 64(I), 65(I), 175(I), 185(I), 187–190(I), 193(I), 195–197(I), 205– 208(I), 211(I), 313(I), 316(I), 466(I), 513(I), 514(I), 752(I), 753(I), 63(II), 263(II), 264(II), 422(II), 433(II), 437(II), 78(III), 92(III), 98(III), 111(III), 112(III), 126(III), 158(III), 159(III), 189–193(III), 195–197(III), 199(III), 201(III), 221(III), 341(III)

Index diagnosis, 83(I), 94(I), 152(I), 153(I), 157–159(I), 174(I), 275(I), 276(I), 278(I), 283(I), 286(I), 287(I), 294(I), 296(I), 309(I), 320(I), 488(I), 493(I), 549(I), 550(I), 673–685(I), 687– 695(I), 697–701(I), 720(I), 737(I), 738(I), 758(I), 4(II), 141–143(II), 202(II), 209(II), 210(II), 219(II), 237–239(II), 291(III), 408(III), 537(III), 541(III) dialogue, 17(I), 25(I), 390(I), 405(I), 427(I), 664(I), 665(I), 667(I), 668(I), 701(I), 716(I), 719–722(I), 332(II), 117(III), 119(III), 124(III), 128(III), 375(III), 377(III), 541(III) discrete-event system, 515(I), 673(I), 683– 685(I) discriminative learning, 339(II) dissimilarity, 135(I), 137(I), 312(I), 326(I), 327(I), 329(I), 330(I), 405(I), 345(II), 346(II), 353–355(II), 361(II), 364(II), 367(II), 404(II), 450(II), 451(II), 454(II), 460(II), 465(II), 241(III), 451(III), 506(III) distributed decision, 592(I) diversification, 739(I), 27(II), 28(II), 30(II), 37–39(II), 41(II) ‘do’ operator, 282(I), 290(I) doxastic logic, 631(I) Dutch book, 133(I), 558(I), 559(I) dynamic epistemic logic, 45(I), 48(I), 60(I), 64(I), 633(I), 639(I) dynamic logic, 258(I), 266(I), 295(I), 487(I), 495(I), 497(I), 500(I), 504– 506(I), 511(I), 514(I), 631(I), 637(I), 124(III) dynamic programming, 237(I), 238(I), 398(I), 403(I), 408(I), 579(I), 4(II), 166(II), 193(II), 196(II), 296(II), 304(II), 305(II), 223(III), 231(III), 239(III), 413(III), 414(III), 423(III), 524(III) dynamic semantics, 118(III), 123(III) dynamic system, 284(I), 487–490(I), 492(I), 701(I), 228(II), 382(III), 477(III), 538(III)

E egalitarianism (egalitarian), 459(I), 591(I), 606(I), 608(I), 611(I), 615(I), 656– 659(I) Ellsberg’s urn, 565(I), 571–574(I)

501 embodied conversational agent, 367(III), 372–374(III), 382(III) emotion, 46(I), 52(I), 629(I), 630(I), 645– 648(I), 772(I), 329(II), 333(II), 369(III), 373–376(III), 447(III), 504(III), 507(III), 508(III), 520(III), 521(III), 537(III) ensemble learning, 351(II), 368(II), 374(II), 375(II), 211(III), 225(III) envisionment, 156(I), 157(I) epistemic entrenchment, 101(I), 450(I), 469(I) epistemic logic, 45(I), 46(I), 48(I), 60(I), 64(I), 91(I), 99(I), 500(I), 504(I), 506(I), 513(I), 633(I), 634(I), 639(I), 92(III) epistemic state, 71(I), 85(I), 89(I), 90(I), 98(I), 138(I), 298(I), 441(I), 442(I), 446(I), 452–457(I), 463(I), 466(I), 468–470(I), 513(I), 629(I), 104(II), 537(III) equilibrium logic, 99(I), 99(II) equity, 528(I), 538(I), 540(I), 591(I), 592(I), 608(I), 609(I), 657(I) event, 79–82(I), 87(I), 88(I), 93–95(I), 173(I), 266(I), 287(I), 290(I), 291(I), 297(I), 298(I), 494(I), 495(I), 507(I), 508(I), 511(I), 551(I), 555(I), 565(I), 673(I), 683–685(I), 688(I), 689(I), 211(II), 212(II), 217(II), 235(II) exception, 4(I), 57(I), 64(I), 70(I), 72(I), 73(I), 88(I), 100(I), 246(I), 253(I), 255(I), 259–262(I), 405(I), 406(I), 523(I), 593(I), 614(I), 622(I), 770(I), 79(II), 97(II), 268(II), 195(III), 216(III), 229(III), 237(III), 391(III), 446(III), 465(III), 480(III), 537(III), 540(III), 541(III) execution, 154(I), 258(I), 319(I), 490– 494(I), 496(I), 498(I), 500(I), 505(I), 524(I), 638(I), 89(II), 92(II), 94(II), 179(II), 288(II), 289(II), 291(II), 330(II), 472(II), 22(III), 98(III), 99(III), 101(III), 110(III), 317(III), 353(III), 373(III), 391(III), 401(III), 408–412(III), 416(III), 417(III), 419(III), 425–429(III), 459(III) existential rule, 185(I), 187(I), 188(I), 197(I), 203(I), 204(I), 206–210(I), 192(III) expansion, 209(I), 346(I), 376(I), 446– 449(I), 470(I), 508(I), 3(II), 12(II), 14(II), 16(II), 67(II), 69–72(II),

502 301(II), 55(III), 149(III), 154(III), 155(III), 166(III), 167(III) expected utility, 19(I), 132(I), 133(I), 138(I), 407(I), 549–552(I), 554(I), 558(I), 561(I), 565(I), 567(I), 569(I), 570(I), 572(I), 573(I), 582(I), 227(II), 295(II), 305(II) experience, 11(I), 17(I), 308(I), 320(I), 342(I), 399(I), 408(I), 442(I), 7(II), 13(II), 35(II), 36(II), 46(II), 80(II), 105(II), 330(II), 331(II), 339(II), 340(II), 9(III), 52(III), 168(III), 320(III), 366(III), 373(III), 420– 422(III), 440(III), 441(III), 508(III) expert system, 26(I), 70(I), 134(I), 283(I), 473(I), 673(I), 674(I), 676(I), 707(I), 708(I), 715–719(I), 721(I), 726(I), 734(I), 739(I), 740(I), 743(I), 755(I), 236(II), 237(II), 240(II), 378(II), 210(III), 240(III), 244(III), 340(III), 493(III), 536(III), 540(III) explanation, 17(I), 153(I), 174(I), 189(I), 238(I), 280(I), 282–284(I), 286(I), 290(I), 291(I), 294(I), 298(I), 299(I), 316(I), 321(I), 356(I), 477(I), 488(I), 494(I), 637(I), 674(I), 698(I), 707– 709(I), 714–727(I), 108(II), 130(II), 146(II), 167(II), 168(II), 192(II), 202(II), 210(II), 217(II), 218(II), 237(II), 331(II), 342(II), 383(II), 478(II), 63(III), 136(III), 170(III), 212(III), 219(III), 233(III), 295(III), 307(III), 326(III), 444(III), 449(III), 461(III), 463(III), 467(III), 470(III), 475(III), 537(III) explanation-based learning, 167(II) exploitation / exploration, 403(I), 3(II), 4(II), 16(II), 27(II), 28(II), 38(II), 40(II), 41(II), 48(II), 72(II), 105(II), 131(II), 171(II), 172(II), 196(II), 202(II), 287(II), 315(II), 321(II), 339(II), 349(II), 350(II), 385(II), 387(II), 388(II), 392(II), 393(II), 412(II), 413(II), 432–436(II), 320(III), 420(III) expressiveness, 160(I), 501(I), 502(I), 507(I), 510(I), 512–514(I), 542(I), 99(II), 116(II), 126(II), 154(II), 378(II), 69(III), 187(III), 189(III), 191(III), 192(III), 201(III), 444(III) extension, 46(I), 47(I), 49(I), 52(I), 54(I), 57(I), 58(I), 60(I), 64(I), 71(I), 81(I), 86(I), 88(I), 94(I), 97(I), 99(I),

Index 100(I), 102–105(I), 110(I), 119(I), 122(I), 125(I), 130(I), 153(I), 168(I), 171(I), 172(I), 192(I), 200–202(I), 205–207(I), 212(I), 221(I), 228(I), 229(I), 231(I), 234(I), 240(I), 245(I), 257(I), 258(I), 266(I), 269(I), 287(I), 288(I), 318(I), 328(I), 329(I), 347(I), 348(I), 353(I), 362(I), 365(I), 369(I), 389(I), 390(I), 397(I), 404(I), 431(I), 432(I), 435(I), 454(I), 455(I), 458(I), 464(I), 469(I), 471(I), 473(I), 475(I), 477(I), 503(I), 506(I), 514(I), 530(I), 536(I), 537(I), 580(I), 604(I), 613(I), 615(I), 619(I), 638(I), 639(I), 663(I), 666(I), 679(I), 680(I), 682(I), 684(I), 690(I), 692(I), 694(I), 753(I), 60(II), 94(II), 194(II), 198(II), 459(II), 7(III), 17(III), 68(III), 72(III), 78(III), 99(III), 111(III), 155(III), 156(III), 191(III), 199(III), 221(III), 236(III), 285(III), 292(III), 347(III) extrapolation, 299(I), 330(I), 487(I), 494(I), 508(I), 512(I) F fair allocation, 587(I), 588(I), 590(I), 592(I), 606(I), 607(I) fair division, 230(I), 242(I), 587(I), 606– 608(I), 610(I), 611(I), 614–616(I), 622(I) feature, 58(I), 69(I), 91(I), 93(I), 100(I), 110(I), 135(I), 136(I), 152(I), 155(I), 160(I), 174(I), 206(I), 220(I), 255(I), 257(I), 309(I), 311(I), 312(I), 314(I), 324–328(I), 342(I), 344–346(I), 349(I), 363(I), 368(I), 376–378(I), 402(I), 403(I), 405(I), 406(I), 432(I), 496(I), 502(I), 505(I), 566(I), 568(I), 580(I), 582(I), 652(I), 674(I), 735(I), 741(I), 751(I), 752(I), 757(I), 759(I), 272(II), 347(II), 348(II), 352(II), 365(II), 366(II), 369(II), 371(II), 373(II), 375(II), 381(II), 395(II), 396(II), 397(II), 436(II), 454(II), 121(III), 125(III), 126(III), 134(III), 163(III), 212(III), 223(III), 224(III), 247(III), 249(III), 308(III), 347(III), 348(III), 504(III), 511–514(III), 516(III), 517(III) first-order logic, 19(I), 86(I), 108(I), 151(I), 185(I), 188(I), 192(I), 193(I), 195(I), 197(I), 200(I), 206(I), 211(I), 270(I), 463(I), 464(I), 466(I), 497(I), 678(I),

Index 695(I), 53(II), 54(II), 58(II), 63(II), 69(II), 70(II), 73(II), 100(II), 103(II), 142(II), 248(II), 250(II), 252(II), 255(II), 258(II), 263(II), 266(II), 275(II), 289(II), 298(II), 300(II), 343(II), 384(II), 387(II), 388(II), 426(II), 65(III), 76(III), 112(III), 118(III), 121(III), 122(III), 125(III), 159(III), 184(III), 430(III) fixed point, 57(I), 397(I), 398(I), 159(II), 161(II), 247(II), 3(III), 14–16(III), 19(III), 43(III), 65(III), 100(III), 277(III), 413(III), 414(III) flexible query, 155(III), 160(III), 162(III) Floyd-Warshall algorithm, 411(III), 412(III) formal concept analysis (FCA), 10(I), 69(I), 70(I), 89(I), 97(I), 102(I), 103(I), 107(I), 110(I), 317(I), 324(I), 347(II), 389(II), 390(II), 396(II), 411– 413(II), 415(II), 417(II), 418(II), 422(II), 425–429(II), 432–437(II), 200(III) frame, 24(I), 120(I), 266(I), 294(I), 295(I), 308(I), 495(I), 496(I), 499(I), 500(I), 502(I), 505(I), 506(I), 633(I), 687(I), 4(II), 10(II), 103(II), 119–121(III), 128(III), 153(III), 189(III), 221(III), 305(III), 340(III), 398(III), 407(III), 443(III), 444(III), 447(III), 453(III), 460(III), 469(III) frame problem, 294(I), 295(I), 495(I), 496(I), 499(I), 500(I), 502(I), 505(I), 506(I), 633(I), 103(II), 305(III), 443(III), 444(III), 447(III) frequent pattern, 345(II), 348(II), 389(II), 394(II), 395(II), 463(II) functional dependency, 325(I), 415(II), 98(III), 101(III) functional programming, 25(I), 84(II), 248(II), 270(II), 271(II), 6(III), 29(III), 32(III) function approximation, 390(I), 393–395(I), 397(I), 399(I), 402(I), 408(I) fusion, 59(I), 73(I), 84(I), 101(I), 127(I), 315(I), 415(I), 417(I), 441–444(I), 466(I), 467(I), 471–477(I), 493(I), 507(I), 508(I), 542(I), 588(I), 600(I), 103(II), 113(III), 199(III), 221(III), 338(III), 341(III), 342(III), 359(III), 366(III), 402(III), 445(III), 446(III), 449(III) fuzzy logic, 18(I), 21(I), 76(I), 646(I), 647(I), 674(I), 104(II), 384(II),

503 108(III), 148(III), 150(III), 157(III), 158(III), 376(III) fuzzy rule, 27(I), 94(I), 308(I), 312(I), 330– 332(I) fuzzy set, 27(I), 69(I), 70(I), 74–76(I), 78(I), 89(I), 90(I), 93(I), 94(I), 105(I), 123(I), 307(I), 330–332(I), 537(I), 543(I), 92(III), 103–105(III), 107(III), 108(III), 155(III), 156(III), 160(III), 162(III), 210(III), 339(III), 342(III) G GAI-net, 235(I), 236(I) Galois connection, 102(I), 389(II), 411– 413(II), 417(II), 418(II), 420(II), 428(II), 271(III) game theory, 19(I), 231(I), 523(I), 79(III), 128(III), 129(III), 150(III), 157(III), 168(III) generalization, 99(I), 120(I), 133(I), 141(I), 163(I), 246(I), 312(I), 313(I), 323(I), 341(I), 343(I), 350(I), 353(I), 361(I), 454(I), 459(I), 461(I), 511(I), 531(I), 534(I), 566(I), 570(I), 613(I), 679(I), 711(I), 7(II), 20(II), 58(II), 63(II), 119(II), 145(II), 352(II), 353(II), 383(II), 385–391(II), 394(II), 395(II), 417(II), 428(II), 64(III), 76(III), 136(III), 155(III), 156(III), 160(III), 161(III), 209(III), 218(III), 294(III), 310(III), 311(III), 313(III), 323(III), 340(III), 343(III), 425(III), 442(III), 445(III), 449(III), 496(III), 505(III), 527(III) generalized interval calculus, 163(I) generative adversarial networks (GANs), 383(II) generative learning, 358(II) genetic algorithm, 27(II), 29(II), 30(II), 32(II), 37(II), 38(II), 42–44(II), 127(II), 223(II), 157(III), 166(III), 210(III), 229(III), 240(III), 290(III), 450(III), 493(III) go, 399(I), 316(II), 318(II), 319(II), 321(II), 323(II), 327(II), 341(II), 382(II), 64(III), 219–221(III), 441(III) goal, 7(I), 24(I), 97(I), 152(I), 153(I), 155(I), 159(I), 204(I), 217(I), 228(I), 233(I), 240(I), 241(I), 243(I), 245(I), 276(I), 284(I), 309(I), 318(I), 319(I), 342(I), 344(I), 345(I), 347(I), 349(I), 350(I), 358(I), 379(I), 380(I), 408(I), 441(I),

504

Index 488(I), 490(I), 492–495(I), 514(I), 549(I), 550(I), 571(I), 580(I), 606(I), 629–633(I), 636(I), 637(I), 640– 642(I), 645–648(I), 653(I), 664(I), 665(I), 668(I), 717–720(I), 723(I), 735(I), 736(I), 739(I), 747(I), 758(I), 763(I), 4–6(II), 8–11(II), 13–18(II), 298(II), 299(II), 301(II), 302(II), 2(III), 34(III), 97(III), 100(III), 108(III), 120(III), 129(III), 130(III), 149(III), 163(III), 181(III), 182(III), 194(III), 199(III), 211(III), 232(III), 237(III), 238(III), 239(III), 250(III), 291(III), 304(III), 310(III), 313(III), 314(III), 316(III), 317(III), 326(III), 347(III), 349(III), 350(III), 351(III), 353(III), 355(III), 358(III), 366(III), 369(III), 371(III), 372(III), 374(III), 376(III), 382(III), 397–399(III), 408(III), 413(III), 414(III), 417(III), 425(III), 443(III), 444(III), 447(III), 453(III), 457(III), 460(III), 464(III), 467(III), 474–476(III), 488(III), 490(III), 494(III), 498(III), 509(III), 523(III), 524(III), 536(III), 539– 541(III)

Gödel theorem, 16(I) graduality, 74(I), 331(I) grammar, 26(I), 49(I), 52(I), 60(I), 718(I), 318(II), 385(II), 390(II), 8(III), 64(III), 66(III), 69(III), 72(III), 73(III), 75(III), 77–79(III), 119– 123(III), 125(III), 130(III), 247(III), 248(III), 285(III), 447(III), 493(III), 495(III), 505(III), 507(III), 513(III), 519(III) grammatical inference, 385(II), 390(II), 391(II), 394(II), 79(III), 238(III), 248(III) granularity, 69(I), 71(I), 77(I), 78(I), 168(I), 465(I), 684(I), 750(I), 363(II), 391(II), 99(III), 129(III), 340(III), 342(III), 407(III), 444(III), 445(III) graphical model, 84(I), 96(I), 99(I), 231(I), 234(I), 239(I), 248(I), 277(I), 283(I), 290(I), 293(I), 343(I), 346(I), 347(I), 379(I), 472(I), 506(I), 514(I), 549(I), 577(I), 582(I), 185(II), 189(II), 192(II), 193(II), 201(II), 202(II), 209(II), 210(II), 212– 217(II), 219(II), 221(II), 225(II), 227–231(II), 233(II), 235–240(II),

83(III), 162(III), 232(III), 244(III), 442(III), 446(III) greedy algorithm, 598(I), 602(I), 33(II), 34(II), 228(III) Grice maxims, 124(III)

H Herbrand base, 86(II), 87(II), 260(II), 261(II) Herbrand model, 56(II), 86(II), 87(II) here-and-there logic, 98(II), 99(II) heuristic search, 27(I), 1(II), 2(II), 4(II), 7(II), 8(II), 18(II), 276(II), 287(II), 293(II), 298(II), 301(II), 536(III) heuristics, 27(I), 158(I), 167(I), 321(I), 367(I), 477(I), 598(I), 622(I), 659(I), 660(I), 674(I), 717(I), 755(I), 7(II), 8(II), 10–14(II), 17–19(II), 22(II), 23(II), 28(II), 30(II), 32(II), 33(II), 36(II), 40(II), 41(II), 44– 47(II), 127(II), 132–135(II), 171(II), 172(II), 174(II), 198(II), 225(II), 234(II), 276(II), 287(II), 291– 293(II), 298(II), 301(II), 302(II), 317(II), 318(II), 322–324(II), 326(II), 327(II), 342(II), 345(II), 346(II), 394(II), 54(III), 166(III), 200(III), 210(III), 231(III), 239(III), 290–292(III), 412(III), 414(III), 420(III), 441(III), 450(III), 452(III), 463(III), 465(III), 467–469(III), 472(III), 476(III), 504(III), 520(III), 522(III), 524(III), 525(III), 536(III), 538(III) hidden Markov model (HMM), 202(II), 227(II), 228(II), 256(II), 257(II), 384(II), 397(II), 398(II), 72(III), 79(III), 218(III), 221–223(III), 226(III), 228(III), 238(III), 248(III), 376(III) hierarchical clustering, 126(I), 344(II), 363(II), 364(II), 455(II), 461(II), 467(II) higher-order logic, 73(II), 120(III), 122(III) Horn clause, 169(I), 207(I), 208(I), 678(I), 83–87(II), 89(II), 122(II), 125(II), 62(III), 195(III), 277(III) human-centred design, 367(III) human-computer interaction, 305(II), 365– 368(III), 370(III), 378(III), 380(III), 381(III), 477(III) Hurwicz criterion, 132(I), 574(I), 575(I)

Index hypothesis (hypothetical), 3(I), 4(I), 54(I), 82(I), 90(I), 103(I), 257(I), 283(I), 309(I), 325(I), 328(I), 342(I), 345(I), 346(I), 348–370(I), 374– 378(I), 380(I), 416(I), 445(I), 463(I), 491(I), 507(I), 512(I), 525(I), 554(I), 690(I), 696(I), 718(I), 720(I), 769(I), 8(II), 16(II), 76(II), 77(II), 84(II), 116(II), 117(II), 124(II), 141(II), 142(II), 191(II), 224(II), 332(II), 339(II), 343(II), 349–353(II), 366– 370(II), 374(II), 375(II), 383(II), 385–390(II), 392(II), 394(II), 401– 403(II), 448(II), 5(III), 12(III), 13(III), 23(III), 26(III), 28(III), 30(III), 63(III), 101(III), 209(III), 218(III), 229(III), 231(III), 305– 307(III), 311(III), 325(III), 380(III), 404(III), 439(III), 458(III), 460(III), 464(III), 476(III) I IDA*, 1(II), 15(II), 293(II), 325(II), 326(II) implication, 45–47(I), 50(I), 75(I), 87(I), 94(I), 95(I), 108(I), 166(I), 244(I), 256(I), 257(I), 277(I), 278(I), 294(I), 299(I), 331(I), 423(I), 425(I), 501(I), 502(I), 711–713(I), 56(II), 96(II), 98(II), 100(II), 101(II), 116(II), 134(II), 135(II), 248(II), 268(II), 387(II), 388(II), 415–417(II), 24(III), 110(III), 127(III), 158–160(III), 292(III), 308(III), 313(III), 314(III), 444(III), 478(III), 508(III) imprecise probability, 19(I), 20(I), 69(I), 85(I), 99(I), 105(I), 119(I), 120(I), 131(I), 138–145(I), 467(I), 472(I), 477(I), 228(II) incoherence, 434(I), 709–711(I), 714(I), 725(I), 198(III), 445(III) incompleteness, 16(I), 110(I), 138(I), 165(I), 709(I), 713(I), 725(I), 53(II), 76(II), 77(II), 80(II), 126(II), 33(III), 35(III), 110(III), 111(III), 148(III), 181(III), 244(III), 341(III), 342(III) inconsistency, 73(I), 97(I), 100(I), 153(I), 204(I), 212(I), 256(I), 257(I), 267(I), 332(I), 417–420(I), 422(I), 426(I), 429(I), 430(I), 432–434(I), 436(I), 441(I), 443(I), 446(I), 462(I), 464(I), 465(I), 471(I), 508(I), 514(I), 664(I), 665(I), 675(I), 678(I), 683(I), 697(I), 721(I), 726(I), 762(I), 28(II), 44(II),

505 106(II), 167(II), 168(II), 113(III), 191(III), 194(III), 195(III), 198(III), 202(III), 220(III), 221(III), 242(III), 295(III), 341(III), 410(III), 443(III), 444(III), 449(III), 537(III) independence, 10(I), 96(I), 124(I), 127(I), 217(I), 219(I), 221(I), 232(I), 233(I), 240(I), 284(I), 285(I), 288(I), 289(I), 326(I), 447(I), 473(I), 510(I), 553(I), 556(I), 564(I), 566(I), 568(I), 581(I), 589(I), 609(I), 655(I), 682(I), 62(II), 192(II), 193(II), 209(II), 212– 217(II), 221–223(II), 225(II), 228– 232(II), 234(II), 235(II), 247(II), 251(II), 297(II), 298(II), 416(II), 196(III), 292(III), 378(III), 475(III) indifference, 79(I), 220(I), 222(I), 524(I), 590(I), 102(III) INDU calculus, 164(I), 166(I), 169(I) induction (inductive), 4(I), 8(I), 11(I), 12(I), 14(I), 20(I), 23(I), 307(I), 317(I), 321(I), 576(I), 579(I), 580(I), 60(II), 67(II), 76–78(II), 80(II), 88(II), 178(II), 202(II), 262(II), 275(II), 296(II), 339(II), 340(II), 349–352(II), 354(II), 368(II), 383–386(II), 390(II), 392–394(II), 398(II), 401(II), 402(II), 453(II), 26(III), 29(III), 33(III), 65(III), 134(III), 200(III), 201(III), 227(III), 248(III), 277(III), 284(III), 292(III), 356(III), 460(III), 472(III), 481(III) inductive logic programming (ILP), 202(II), 262(II), 275(II), 384(II), 393(II), 394(II), 398(II), 227(III), 248(III), 277(III), 292(III) inference engine, 707(I), 710(I), 711(I), 716(I), 734(I), 259(II), 125(III) infinitesimal probability, 101(I), 104(I), 106(I), 467(I), 469(I) influence diagram, 581(I), 582(I), 225(II), 226(II), 300(II) information retrieval, 22(I), 231(I), 347(I), 379(I), 588(I), 734(I), 737(I), 754(I), 755(I), 757(I), 761(I), 763(I), 237(II), 264(II), 276(II), 411(II), 412(II), 414(II), 417(II), 432(II), 130(III), 147–149(III), 160(III), 165(III), 182(III), 199(III), 371(III), 373(III) information visualisation, 380(III) inheritance, 607(I), 133(II), 125(III), 340(III)

506 integrity constraint, 186(I), 270(I), 315(I), 458(I), 459(I), 511(I), 711–715(I), 91–93(III), 95–99(III), 101(III), 113(III) intelligent user interface, 365(III), 367(III), 369–373(III), 378(III) intensification, 27(II), 28(II), 30(II), 37– 39(II), 43(II) interaction, 140(I), 159(I), 161(I), 174(I), 197(I), 232(I), 233(I), 237(I), 239(I), 240(I), 270(I), 277(I), 279(I), 280(I), 283–285(I), 288(I), 318(I), 390(I), 399(I), 428(I), 429(I), 541(I), 542(I), 588(I), 592(I), 613(I), 615(I), 622(I), 631(I), 638(I), 640–642(I), 645(I), 652(I), 654(I), 661(I), 682(I), 691(I), 708(I), 719–721(I), 723(I), 733(I), 740(I), 743(I), 761(I), 769(I), 772(I), 237(II), 305(II), 117(III), 164(III), 165(III), 211(III), 215(III), 217(III), 220(III), 225(III), 230(III), 232(III), 233(III), 243–245(III), 265–268(III), 283(III), 288(III), 296(III), 303(III), 310(III), 326(III), 338(III), 355(III), 357(III), 360(III), 365–376(III), 378(III), 380(III), 381(III), 416– 419(III), 428(III), 429(III), 440(III), 450(III), 458(III), 477(III), 498(III), 532(III), 538(III) interactive learning, 26(I), 152(I), 158(I), 159(I), 708(I), 449(II) interpolative reasoning, 307(I), 308(I), 330(I), 333(I) interpretation, 3(I), 15(I), 20(I), 45(I), 51(I), 52(I), 76(I), 77(I), 79(I), 86(I), 89(I), 91(I), 97(I), 98(I), 101(I), 120(I), 131(I), 138(I), 144(I), 154(I), 160(I), 161(I), 168(I), 186(I), 190(I), 191(I), 195(I), 240(I), 244(I), 245(I), 255(I), 257(I), 293(I), 330(I), 331(I), 426(I), 442(I), 445(I), 447(I), 448(I), 451–456(I), 459–461(I), 464(I), 466(I), 475(I), 476(I), 500(I), 541(I), 553(I), 555(I), 557(I), 560(I), 561(I), 564(I), 570(I), 602(I), 604(I), 743–746(I), 55(II), 56(II), 60(II), 65(II), 66(II), 69(II), 71(II), 74(II), 75(II), 77(II), 86(II), 90(II), 91(II), 98(II), 116(II), 117(II), 126–128(II), 144(II), 153(II), 155(II), 192(II), 203(II), 211(II), 231(II), 233(II), 235(II), 249–251(II), 260(II), 264(II), 341(II), 412(II), 415(II),

Index 426(II), 435(II), 448(II), 19(III), 29(III), 30(III), 32(III), 54(III), 92(III), 93(III), 95(III), 96(III), 99(III), 123(III), 127(III), 130(III), 134(III), 157–159(III), 161(III), 197(III), 210(III), 269–274(III), 277(III), 278(III), 337–339(III), 343(III), 344(III), 346(III), 347(III), 376(III), 378(III), 379(III), 390(III), 394(III), 408(III), 430(III), 462(III), 463(III), 469(III), 470(III), 478(III), 491(III), 492(III), 496(III), 509(III), 512(III) interval, 71(I), 73(I), 75(I), 76(I), 85– 87(I), 89(I), 95–97(I), 105(I), 123(I), 125(I), 131(I), 136(I), 139(I), 140(I), 142(I), 155(I), 160(I), 162–168(I), 467(I), 474(I), 612(I), 642(I), 683(I), 684(I), 91(II), 200(II), 229(II), 231(II), 251(II), 264(II), 288(II), 290(II), 314(II), 400(II), 415(II), 419(II), 420(II), 434(II), 39(III), 103(III), 104(III), 182(III), 286(III), 410(III), 411(III), 426(III), 508(III), 521(III), 523(III) intervention, 277(I), 278(I), 281(I), 282(I), 284(I), 285(I), 289(I), 290(I), 293(I), 296(I), 297(I), 299(I), 300(I), 692(I), 449(II), 119(III), 237(III), 392(III), 446(III) intuitionistic logic, 16(I), 46(I), 428(I), 4(III), 195(III) is-a relation, 450(I) inverse reinforcement learning, 356(III), 423(III), 424(III) J Jaffray model, 574(I) Jeffrey rule, 457(I), 469(I), 471(I) junction tree, 239(I), 240(I), 691(I), 218(II), 227(II), 230(II), 233(II) K Kalman filter, 397(I), 512(I), 515(I), 694(I), 228(II), 372(II), 402(III), 403(III), 405(III) kernel method, 378(I), 373(II) k-means, 345(I), 344(II), 346(II), 347(II), 355–358(II), 360(II), 362(II), 363(II), 406(II), 447(II), 451(II), 455–458(II), 467(II), 154(III), 307(III)

Index knowledge acquisition, 316–318(I), 708(I), 709(I), 725(I), 734(I), 735(I), 740(I), 743(I), 749(I), 755(I), 126(III), 203(III), 354(III), 506(III), 509(III), 540(III) knowledge base, 64(I), 156(I), 185–190(I), 201(I), 202(I), 204(I), 207(I), 209(I), 292(I), 310(I), 316(I), 416–419(I), 428–432(I), 466(I), 474–476(I), 665(I), 707(I), 709(I), 711(I), 712(I), 714–716(I), 724(I), 738(I), 739(I), 742(I), 744(I), 750(I), 758(I), 760(I), 761(I), 89(II), 103(II), 116(II), 131(II), 137(II), 139(II), 142(II), 143(II), 145(II), 219(II), 253(II), 317(II), 411(II), 433(II), 125(III), 132(III), 135(III), 149(III), 151(III), 152(III), 192(III), 220(III), 244(III), 245(III), 246(III), 249(III), 340(III), 377(III), 379(III), 493(III) knowledge discovery, 102(I), 317(I), 333(I), 390(II), 411(II), 412(II), 415(II), 437(II), 449(II), 478(II), 200(III), 212(III), 370(III), 380(III), 381(III) knowledge engineering, 186(I), 311(I), 316(I), 317(I), 709(I), 721(I), 722(I), 725(I), 726(I), 733(I), 735(I), 411(II), 125(III), 152(III), 188(III), 203(III), 378(III), 444(III) knowledge representation, 24–27(I), 45(I), 46(I), 58(I), 64(I), 69(I), 70(I), 73(I), 77(I), 79(I), 86(I), 88(I), 91(I), 93(I), 99(I), 101(I), 102(I), 107(I), 110(I), 119(I), 155(I), 174(I), 175(I), 185(I), 187(I), 188(I), 197(I), 211(I), 212(I), 219(I), 246(I), 248(I), 255(I), 256(I), 262(I), 308(I), 332(I), 415(I), 443(I), 502(I), 506(I), 629(I), 631(I), 634(I), 679(I), 708(I), 724(I), 733–735(I), 739(I), 741(I), 742(I), 746(I), 748(I), 752(I), 754(I), 756(I), 760(I), 73(II), 95–97(II), 101(II), 103–105(II), 108(II), 209(II), 210(II), 237(II), 251(II), 255(II), 263(II), 264(II), 273(II), 276(II), 285(II), 306(II), 354(II), 411–413(II), 437(II), 449(II), 126(III), 127(III), 150(III), 157(III), 158(III), 160(III), 169(III), 181–183(III), 188(III), 201(III), 202(III), 211(III), 220(III), 305(III), 338(III), 340(III), 343(III), 359(III), 370(III), 373(III), 376(III), 429(III), 444(III), 446(III), 471(III), 476(III),

507 504(III), 506(III), 509–512(III), 520(III), 526(III), 527(III), 535(III), 537(III) knowledge-based system, 24(I), 185(I), 316(I), 447(I), 707–710(I), 715(I), 719(I), 722(I), 726(I), 727(I), 733– 736(I), 739(I), 742(I), 747(I), 411(II), 491(III) Kolmogorov complexity, 43(III), 51(III), 53(III), 59(III), 81–83(III) Kripke semantics, 262(I), 635(I), 73(II)

L lambda calculus, 4(III), 10(III), 11(III), 19(III), 28(III), 65(III), 122(III), 136(III) least-squares methods, 397(I), 398(I), 369(II) lexical semantics, 745(I), 118(III), 120(III), 125(III), 130(III), 134(III) leximin, 243(I), 244(I), 461(I), 591(I), 611(I), 612(I), 615(I), 107(III), 161(III) lifted inference, 274(II), 275(II) likelihood, 11(I), 104(I), 134(I), 136(I), 144(I), 345(I), 348(I), 379(I), 402(I), 406(I), 472(I), 473(I), 576(I), 583(I), 660(I), 690(I), 700(I), 220–225(II), 275(II), 347(II), 349(II), 359(II), 370(II), 398(II), 401(II), 130(III), 133(III), 154(III), 165(III), 242(III), 464(III), 478(III) linear model, 377(I), 542(I), 550(I), 351(II), 353(II), 367(II), 368(II), 371(II), 376(II) LISP, 25(I), 26(I), 716(I), 119(III), 540(III) literal, 98(I), 245(I), 278(I), 346(I), 363(I), 365(I), 422(I), 428(I), 434(I), 501(I), 502(I), 511(I), 512(I), 676(I), 679(I), 680(I), 682(I), 711– 713(I), 55(II), 57(II), 59–63(II), 68(II), 71(II), 72(II), 85–87(II), 97(II), 103(II), 116(II), 120–123(II), 125(II), 127(II), 128(II), 131– 136(II), 138(II), 139(II), 142(II), 143(II), 145(II), 289(II), 291(II), 386–388(II), 126(III), 183(III), 184(III), 293(III), 308(III), 492(III), 496(III) literature, 6(I), 15(I), 22–23(I), 487–499(III) local search, 314(I), 27(II), 29(II), 31–33(II), 37–47(II), 126–128(II), 130(II),

508 174(II), 196(II), 456(II), 488(II), 489(II), 230(III), 412(III) localisation, 738(I), 435(II), 485(II), 486(II), 488(II), 401(III) logic, 2–11(I), 13–19(I), 21(I), 24–26(I), 45–61(I), 63–65(I), 69(I), 70(I), 72(I), 73(I), 76(I), 77(I), 79(I), 81(I), 86(I), 88(I), 89(I), 91(I), 93(I), 97– 101(I), 105(I), 108(I), 110(I), 122(I), 129(I), 145(I), 151(I), 161(I), 167(I), 169–171(I), 175(I), 185(I), 187– 190(I), 192(I), 193(I), 195–197(I), 199–201(I), 205–208(I), 211(I), 217(I), 220(I), 226(I), 228(I), 240– 242(I), 244–247(I), 253–259(I), 261– 266(I), 269–271(I), 277(I), 284(I), 287(I), 290(I), 291(I), 293–295(I), 298(I), 299(I), 313(I), 316(I), 319(I), 322(I), 325(I), 326(I), 332(I), 408(I), 416–418(I), 420(I), 423–429(I), 431(I), 442(I), 443(I), 445(I), 446(I), 448(I), 463–466(I), 472(I), 474(I), 475(I), 477(I), 487(I), 495(I), 497– 499(I), 504–506(I), 511–514(I), 576(I), 582(I), 616(I), 620(I), 630– 639(I), 642(I), 645–648(I), 674(I), 675(I), 678(I), 679(I), 695(I), 699(I), 708(I), 713(I), 752(I), 753(I), 769(I), 770(I), 73–76(II), 95–99(II), 102(II), 103(II), 259(II), 1–4(III), 23(III), 25(III), 32(III), 62(III), 65(III), 66(III), 70(III), 71(III), 75–77(III), 91–100(III), 107(III), 108(III), 112(III), 113(III), 118(III), 120– 127(III), 130(III), 132(III), 148(III), 150(III), 157–160(III), 184(III), 188(III), 191(III), 194(III), 195(III), 198(III), 202(III), 211(III), 226(III), 227(III), 229(III), 236(III), 242(III), 248(III), 265–267(III), 271(III), 273(III), 277–280(III), 283–285(III), 289(III), 290(III), 292(III), 295(III), 376(III), 409(III), 426(III), 430(III), 444(III), 445(III), 460(III), 461(III), 464(III), 468–471(III), 479(III), 481(III), 526(III), 536(III) logic programming, 25(I), 65(I), 99(I), 206(I), 464(I), 500(I), 513(I), 62(II), 83–89(II), 91(II), 97–101(II), 103(II), 104(II), 107(II), 108(II), 153(II), 202(II), 248(II), 258(II), 259(II), 261(II), 262(II), 271(II), 274(II), 275(II), 384(II), 393(II),

Index 394(II), 398(II), 194(III), 195(III), 227(III), 236(III), 248(III), 265– 267(III), 277–279(III), 292(III), 295(III), 409(III), 445(III), 471(III), 526(III), 536(III) Lorenz dominance, 523(I), 528–531(I), 536(I), 539(I) lottery, 20(I), 80(I), 81(I), 86(I), 120(I), 141(I), 552(I), 553(I), 555(I), 559– 562(I), 564(I), 566(I), 567(I), 573– 575(I), 579(I) lower probability, 104(I), 110(I), 119– 121(I), 139(I), 141(I), 144(I), 291(I) M machine learning, 21(I), 24(I), 78(I), 102(I), 134(I), 136(I), 248(I), 284(I), 312(I), 316(I), 328(I), 341–348(I), 358(I), 378(I), 380(I), 394(I), 398(I), 542(I), 549(I), 727(I), 746(I), 755(I), 761(I), 39(II), 47(II), 179(II), 209(II), 210(II), 253(II), 273(II), 275(II), 276(II), 318(II), 330(II), 331(II), 339(II), 340(II), 343(II), 349(II), 354(II), 358(II), 372(II), 378(II), 379(II), 386(II), 392(II), 394(II), 396(II), 403(II), 405(II), 406(II), 412(II), 450(II), 458(II), 74(III), 83(III), 117(III), 118(III), 128– 134(III), 137(III), 150(III), 157(III), 163(III), 169(III), 199–203(III), 210(III), 211(III), 216(III), 220(III), 223(III), 225(III), 228(III), 243(III), 245(III), 246(III), 249(III), 265(III), 267(III), 291(III), 292(III), 304(III), 307(III), 319(III), 324(III), 325(III), 355(III), 356(III), 360(III), 366(III), 370(III), 371(III), 375(III), 379(III), 390(III), 391(III), 419(III), 442(III), 487(III), 488(III), 495(III), 504(III), 519(III), 522(III), 523(III), 537(III), 538(III) machine translation, 22(I), 131(III), 135(III), 541(III) Manhattan distance, 8(II), 18(II), 241(III) manipulation, 8(I), 136(I), 281(I), 289(I), 601–605(I), 619(I), 137(II), 262(II), 273(II), 326(III), 340(III), 390(III), 391(III), 393(III), 400(III), 407(III), 424(III), 462(III), 471(III), 519(III), 524(III) Markov blanket, 214(II), 215(II), 217(II), 224(II), 225(II)

Index Markov decision process (MDP), 390– 393(I), 398(I), 402(I), 407(I), 408(I), 494(I), 515(I), 582(I), 202(II), 227(II), 285(II), 286(II), 295(II), 303–306(II), 331(II), 79(III), 353(III), 413(III), 420(III) Markov logic, 582(I), 202(II), 262(II), 265– 268(II) Markov model, 161(I), 202(II), 227(II), 239(II), 256(II), 257(II), 384(II), 397(II), 72(III), 79(III), 223(III), 224(III), 228(III), 248(III), 376(III) Markov network, 347(I), 379(I), 229(II), 230(II), 265(II), 266(II), 274(II), 229(III) Markov random field, 192(II), 201(II), 202(II), 229(II), 247(II), 248(II), 265–268(II), 273(II), 274(II), 457(II) matching, 24(I), 308(I), 313–315(I), 405(I), 406(I), 750(I), 752(I), 758(I), 177(II), 199(II), 60(III), 77(III), 119(III), 148–150(III), 154(III), 155(III), 158(III), 162(III), 164(III), 165(III), 187(III), 196(III), 199–202(III), 232(III), 246(III), 280–282(III), 341(III), 343(III), 472(III), 524(III) maximum satisfiability problem (MaxSAT), 40(II), 127(II), 192(II), 201(II), 461(II), 232(III) mental state, 513(I), 630(I), 632(I), 633(I), 638–641(I), 645(I), 648(I), 664(I), 665(I), 373(III), 375(III), 521(III) merging, 59(I), 64(I), 73(I), 84(I), 101(I), 120(I), 127(I), 204(I), 205(I), 315(I), 332(I), 415(I), 417(I), 427(I), 431(I), 441(I), 443(I), 444(I), 456–466(I), 471–477(I), 493(I), 507(I), 508(I), 588(I), 600(I), 724(I), 727(I), 763(I), 236(II), 361(II), 365(II), 391(II), 436(II), 467(II), 468(II), 472(II), 473(II), 113(III), 199(III), 221(III), 281(III), 359(III), 445(III), 446(III), 449(III) meta-heuristics, 22(II), 27(II), 29(II), 30(II), 32(II), 33(II), 37–40(II), 46–48(II), 196(II), 223(II), 54(III), 166(III), 210(III), 290(III), 291(III), 450(III), 538(III) meta-knowledge, 708(I) meta-programming, 88(II) metonymy, 126(III) metric learning, 455–458(II) Möbius transform, 122(I)

509 modal logic, 3(I), 4(I), 18(I), 45(I), 46(I), 48(I), 49(I), 51(I), 53(I), 91(I), 100(I), 101(I), 108(I), 110(I), 122(I), 161(I), 170(I), 255–257(I), 269(I), 270(I), 287(I), 295(I), 299(I), 428(I), 446(I), 630(I), 631(I), 634–636(I), 640(I), 642(I), 648(I), 770(I), 73–76(II), 98(II), 101(III), 124(III), 150(III), 158(III), 159(III), 520(III) model-driven engineering, 435(II), 377(III) model reuse, 739(I), 741(I), 742(I), 749(I) model-based diagnosis, 153(I), 159(I), 278(I), 283(I), 673(I), 676(I), 684(I), 693(I), 698(I), 699(I), 701(I), 141– 143(II) model-free reinforcement learning, 422(III) modus ponens, 50(I), 53(I), 54(I), 331(I), 424(I), 711(I), 116(II), 23(III), 463(III), 464(III), 471(III), 472(III) Monte-Carlo simulation, 126(I), 301(II), 302(II) Moore-Dijkstra algorithm, 5(II) moral agent, 637(I) motion planning, 27(I), 232(III), 391(III), 397(III), 400(III), 407(III), 415(III), 416(III) music, 12(I), 321(I), 324(I), 202(II), 333(II), 129(III), 503–527(III), 532(III) multicriteria decision, 94(I), 110(I), 248(I), 404(I), 471(I), 519–522(I), 525(I), 528(I), 537(I), 543(I), 544(I), 570(I), 577(I), 674(I), 169(III), 359(III) multiple sources (multi sources), 100(I), 427(I), 458(I), 664(I), 756(I) N Nash equilibrium, 407(I), 659(I) natural deduction, 17(I), 24(I), 54(II), 23(III), 24(III), 26–28(III), 32(III), 33(III) natural language, 12(I), 24(I), 25(I), 27(I), 46(I), 69(I), 71(I), 75(I), 76(I), 89(I), 152(I), 172(I), 173(I), 175(I), 186(I), 197(I), 219(I), 232(I), 244(I), 308(I), 316(I), 320(I), 342(I), 345(I), 380(I), 405(I), 717(I), 718(I), 720–722(I), 744(I), 745(I), 748(I), 754(I), 755(I), 4(II), 89(II), 103(II), 106(II), 108(II), 202(II), 269(II), 382(II), 398(II), 412(II), 103(III), 118(III), 122(III), 123(III), 126(III), 129(III), 148(III), 150(III), 151(III), 155(III), 170(III), 182(III), 199(III), 203(III), 218(III),

510 219(III), 326(III), 370(III), 371(III), 443(III), 444(III), 446(III), 495(III), 504(III), 506(III), 507(III), 509(III), 510(III), 512(III), 519(III), 541(III) natural language processing (NLP), 175(I), 197(I), 316(I), 320(I), 380(I), 720(I), 744–746(I), 752(I), 754(I), 755(I), 760(I), 762(I), 89(II), 202(II), 269(II), 398(II), 412(II), 69(III), 77(III), 117–121(III), 129(III), 130(III), 132(III), 133(III), 135(III), 137(III), 150–152(III), 154(III), 169(III), 217(III), 277(III), 278(III), 359(III), 430(III), 487(III), 488(III), 495(III), 497(III), 504(III), 507(III), 525(III) navigation, 404(I), 749(I), 759(I), 763(I), 237(II), 414(II), 432(II), 433(II), 187(III), 188(III), 303(III), 308(III), 310(III), 312(III), 314(III), 315(III), 326(III), 367(III), 390(III), 394(III), 397(III), 404(III), 407(III), 417(III), 421(III), 458(III) necessity, 46(I), 48(I), 50(I), 51(I), 53(I), 70(I), 74(I), 76(I), 89–95(I), 97(I), 99(I), 101(I), 103–105(I), 122(I), 123(I), 140–143(I), 158(I), 254(I), 256(I), 258(I), 270(I), 293(I), 356(I), 444(I), 451(I), 543(I), 691(I), 709(I), 735(I), 107(II), 305(II), 107(III), 153(III), 158(III), 162(III), 163(III), 197(III), 244(III), 369(III), 479(III), 539(III) necessity measure, 74(I), 76(I), 89(I), 91– 92(I), 101(I), 104–105(I), 123(I), 140–142(I), 143(I), 293(I), 451(I), 543(I) negation as failure, 259(II), 195(III) negotiation, 270(I), 592(I), 616(I), 622(I), 642(I), 651–655(I), 657–660(I), 662(I), 664–669(I), 725(I), 332(II), 185(III), 447(III), 453(III) network, 20(I), 22(I), 71(I), 83(I), 84(I), 86(I), 96(I), 99(I), 107(I), 136(I), 153(I), 159–161(I), 163(I), 165– 170(I), 174(I), 187(I), 197(I), 201(I), 205(I), 219(I), 221(I), 230(I), 231(I), 234–240(I), 283(I), 284(I), 289(I), 290(I), 296(I), 298(I), 322(I), 343(I), 347(I), 348(I), 363(I), 379(I), 380(I), 394(I), 398–400(I), 402(I), 406(I), 432–435(I), 465(I), 472(I), 487(I), 497(I), 506(I),

Index 507(I), 581(I), 582(I), 593(I), 640(I), 643(I), 683(I), 685(I), 690(I), 693(I), 708(I), 713(I), 723(I), 725(I), 726(I), 752(I), 755(I), 759(I), 35(II), 153– 159(II), 162–165(II), 168–170(II), 175(II), 176(II), 178(II), 185– 187(II), 189(II), 192(II), 202(II), 210(II), 212(II), 215–227(II), 229– 231(II), 234(II), 235(II), 237– 239(II), 247(II), 248(II), 251– 253(II), 255(II), 256(II), 258(II), 259(II), 262–264(II), 268–270(II), 273(II), 274(II), 276(II), 285(II), 286(II), 288(II), 297–299(II), 301(II), 306(II), 322(II), 339(II), 342(II), 350–352(II), 358(II), 373(II), 375(II), 376(II), 378– 384(II), 394(II), 397(II), 405(II), 406(II), 41(III), 42(III), 105(III), 130(III), 132(III), 134–136(III), 150(III), 154(III), 157(III), 162– 164(III), 168(III), 169(III), 182(III), 183(III), 187(III), 193(III), 196– 200(III), 210(III), 214(III), 216(III), 220(III), 221(III), 223(III), 226– 229(III), 231–237(III), 240(III), 244(III), 249(III), 250(III), 265– 268(III), 272(III), 274(III), 277– 280(III), 283(III), 287(III), 289– 296(III), 305(III), 307(III), 314(III), 319–326(III), 338(III), 340(III), 344(III), 369(III), 375(III), 376(III), 380(III), 382(III), 405(III), 406(III), 409–411(III), 419(III), 430(III), 446(III), 447(III), 468(III), 472(III), 477(III), 478(III), 497(III), 498(III), 520–523(III), 537(III) neural network, 136(I), 298(I), 343(I), 347(I), 348(I), 363(I), 380(I), 394(I), 398(I), 400(I), 402(I), 406(I), 178(II), 301(II), 322(II), 339(II), 342(II), 350–352(II), 358(II), 373(II), 375(II), 376(II), 378– 383(II), 394(II), 405(II), 406(II), 41(III), 130(III), 132(III), 134– 136(III), 150(III), 154(III), 157(III), 163(III), 164(III), 169(III), 200(III), 210(III), 223(III), 226(III), 229(III), 231(III), 265(III), 290(III), 304(III), 305(III), 307(III), 320(III), 322(III), 376(III), 419(III), 472(III), 477(III), 497(III), 498(III), 520(III), 522(III), 523(III), 537(III)

Index neuron, 20(I), 322(I), 347(I), 352(II), 376(II), 377(II), 379–382(II), 448(II), 41(III), 52(III), 250(III), 304(III), 305(III), 307(III), 309– 312(III), 314(III), 318(III), 321(III), 323–325(III), 358(III), 477(III) neuroscience, 83(III), 210(III), 303(III), 304(III), 307(III), 318–320(III), 324(III), 325(III), 358(III), 390(III), 448(III), 474(III), 538(III) noisy-OR, 263(II), 269(II), 270(II) non monotonic consequence relation (nonmonotonic consequence relation), 48(I), 58–59(I), 292–293(I), 299(I) non-monotonic inference (nonmonotonic inference), 58(I), 64(I), 295(I) non-monotonic logic (nonmonotonic logic), 295(I), 83(II), 84(II), 141(II), 126(III), 127(III), 130(III), 446(III), 471(III) non-monotonic reasoning (nonmonotonic reasoning), 45(I), 64(I), 65(I), 73(I), 77(I), 88(I), 91(I), 101(I), 175(I), 246(I), 248(I), 255(I), 256(I), 281(I), 332(I), 427(I), 443(I), 506(I), 514(I), 631(I), 634(I), 664(I), 679(I), 100(II), 121(III), 127(III), 128(III), 158(III), 359(III), 444(III), 446(III), 464(III), 471(III) norm, 18(I), 52(I), 74(I), 75(I), 96(I), 128(I), 253–255(I), 258–262(I), 265(I), 269(I), 270(I), 280(I), 299(I), 343(I), 358(I), 373(I), 374(I), 397(I), 534(I), 592(I), 632(I), 637(I), 639(I), 646(I), 647(I), 722(I), 738(I), 770(I), 362(II), 363(II), 103(III), 104(III), 156(III), 161(III), 241(III) normal form, 98(I), 192(I), 231(I), 680(I), 56(II), 57(II), 67(II), 73(II), 99(II), 100(II), 118(II), 119(II), 140(II), 143(II), 201(II), 247(III), 293(III) O obligation, 5(I), 108(I), 204(I), 253–271(I), 631(I), 632(I), 637(I), 506(III) observation, 8(I), 17(I), 54(I), 61(I), 71(I), 79(I), 80(I), 84(I), 87(I), 120(I), 135(I), 142(I), 143(I), 153(I), 174(I), 225(I), 278(I), 282–285(I), 287(I), 290(I), 299(I), 327(I), 345(I), 349(I), 357(I), 361(I), 389(I), 390(I), 405(I), 406(I), 416(I), 417(I), 442(I), 443(I), 472(I), 487–489(I), 491–496(I),

511 507(I), 542(I), 659(I), 663(I), 673– 677(I), 679(I), 681–685(I), 687(I), 688(I), 690–692(I), 695–697(I), 700(I), 740(I), 771(I), 7(II), 73(II), 142(II), 143(II), 172(II), 201(II), 217(II), 219(II), 257(II), 270(II), 303(II), 304(II), 346(II), 348(II), 354(II), 356(II), 357(II), 359(II), 361(II), 362(II), 365(II), 383(II), 385(II), 386(II), 389(II), 392(II), 395–397(II), 400(II), 406(II), 462(II), 26(III), 148(III), 185(III), 211(III), 216(III), 217(III), 224(III), 228(III), 232(III), 249(III), 250(III), 289–291(III), 293(III), 294(III), 306(III), 308(III), 318(III), 321(III), 339(III), 380(III), 404(III), 405(III), 444(III), 452(III), 453(III), 460(III), 463(III), 464(III), 508(III), 520(III), 522(III) ontology, 64(I), 151(I), 155(I), 174(I), 185– 189(I), 191(I), 192(I), 194(I), 195(I), 197(I), 198(I), 201(I), 205(I), 206(I), 210–212(I), 311(I), 316–318(I), 456(I), 463(I), 465(I), 466(I), 513(I), 708(I), 709(I), 713(I), 722(I), 723(I), 725(I), 726(I), 733(I), 734(I), 736– 738(I), 740(I), 742(I), 744–763(I), 112(III), 125(III), 149–151(III), 153(III), 162(III), 182(III), 183(III), 189(III), 192–202(III), 219–221(III), 246(III), 249(III), 268(III), 341(III), 348–352(III) ontology alignment, 726(I), 749(I), 752(I), 758(I), 197(III) ontology matching, 752(I), 196(III), 199(III), 200–202(III) ontology representation, 752(I) ontology reuse, 738(I), 747(I), 749(I), 750(I), 757(I) open world (open world assumption, OWA), 133(I), 186(I), 187(I), 207(I), 519(I), 536–539(I), 542(I), 611(I), 615(I) order of magnitude, 155(I), 158(I), 333(I), 117(II), 235(II), 8(III), 53(III), 54(III), 58(III), 213(III), 214(III) ordered weighted average (OWA), 133(I), 519(I), 536–539(I), 542(I), 611(I), 615(I), 161(III) ORD-Horn relation, 165(I), 167(I) ordinal conditional function (OCF), 106(I), 467(I), 234(II) ordinal utility, 19(I), 610(I)

512 overfitting, 356–358(I), 360(I), 350(II), 352(II), 368(II), 369(II), 379(II), 393(II), 249(III) OWL, 186(I), 188(I), 190(I), 193(I), 195– 197(I), 725(I), 748(I), 752–754(I), 762(I), 264(II), 433(II), 185(III), 189–195(III), 197–202(III), 216(III), 220(III), 221(III), 246(III) P PAC learning, 341(I), 351(I), 352(I), 355(I), 364–366(I), 292–295(III) paraconsistent logic, 73(I), 417(I), 418(I), 423(I), 425–427(I) parameter learning, 379(I), 220(II), 234(II), 275(II) parametric learning method, 351(II) Pareto dominance, 521(I), 522(I), 527(I), 530(I), 531(I), 589(I), 245(III) parfactor, 253–255(II), 258(II), 259(II), 266(II), 269(II), 274(II), 275(II) partial order, 129(I), 131(I), 198(I), 200(I), 202(I), 204(I), 234(I), 521(I), 603(I), 604(I), 413(II), 434(II), 105–107(III) partition, 75(I), 78(I), 79(I), 83(I), 87(I), 137(I), 138(I), 155(I), 163(I), 166(I), 167(I), 169(I), 221(I), 345(I), 350(I), 368(I), 420(I), 468(I), 555(I), 558(I), 568(I), 653(I), 667(I), 681(I), 749(I), 194(II), 265(II), 267(II), 293(II), 344–347(II), 351(II), 354–356(II), 362–367(II), 392(II), 418(II), 434(II), 450(II), 451(II), 458(II), 461(II), 462(II), 464(II), 467(II), 470(II), 488–490(II), 68(III), 106(III), 228(III), 307(III) partition scheme, 166(I), 167(I) path-consistency, 162(II), 165(II) pattern, 3(I), 4(I), 8(I), 12(I), 24(I), 173(I), 175(I), 284(I), 308(I), 323–326(I), 328(I), 329(I), 331(I), 345(I), 390(I), 588(I), 685(I), 690(I), 692(I), 718(I), 724(I), 726(I), 745(I), 746(I), 751(I), 18(II), 20(II), 23(II), 46(II), 47(II), 154(II), 175–178(II), 247(II), 252(II), 269(II), 275(II), 326(II), 327(II), 340(II), 345– 348(II), 350(II), 353(II), 354(II), 376(II), 382(II), 384(II), 385(II), 389(II), 394–396(II), 398(II), 412(II), 413(II), 415(II), 417– 421(II), 425–429(II), 432–434(II), 436(II), 449(II), 454(II), 461(II),

Index 463–466(II), 471(II), 474–477(II), 60(III), 77(III), 83(III), 119(III), 120(III), 132(III), 162(III), 186– 188(III), 194(III), 200(III), 222(III), 228(III), 236(III), 245–247(III), 284(III), 309(III), 323(III), 379(III), 380(III), 451(III), 468(III), 492(III), 493(III), 497(III), 508–511(III), 522– 525(III) pattern discovery, 437(II), 475(II) pattern recognition, 24(I), 175(I), 390(I), 588(I), 237(II), 304(II), 322(II), 376(II), 405(II), 406(II), 412(II), 5(III), 83(III), 120(III), 136(III), 238(III), 308(III), 315(III), 337– 340(III), 359(III), 380(III), 391(III), 419(III), 443(III), 537(III) pattern structure, 412(II), 413(II), 415(II), 417–421(II), 426(II), 433(II), 434(II) PDDL, 404(I), 502(I), 271(II), 285–290(II), 298–302(II), 305(II), 418(III) perceptron, 20(I), 365(I), 380(I), 352(II), 370–373(II), 376–380(II), 164(III), 477(III) permission, 108(I), 253(I), 254(I), 256– 259(I), 262(I), 265(I), 269(I), 271(I), 631(I) persuasion, 276(I), 432(I), 592(I), 616(I), 622(I), 651(I), 664(I), 272(II), 447(III), 453(III), 516(III) pignistic probability, 133(I), 137(I) Pigou–Dalton principle, 528(I), 529(I), 536(I), 609(I) planning, 25(I), 27(I), 217(I), 226(I), 248(I), 268(I), 284(I), 309(I), 319(I), 390(I), 391(I), 404(I), 488(I), 492–495(I), 501(I), 502(I), 506(I), 515(I), 520(I), 549(I), 582(I), 583(I), 606(I), 674(I), 683(I), 701(I), 718(I), 736(I), 757(I), 28(II), 95(II), 101(II), 104(II), 107(II), 125(II), 144(II), 201(II), 202(II), 210(II), 227(II), 237(II), 239(II), 271(II), 285–295(II), 298(II), 300–306(II), 318(II), 330(II), 331(II), 83(III), 124(III), 231(III), 232(III), 313(III), 319(III), 326(III), 350(III), 353(III), 357(III), 366(III), 367(III), 370(III), 371(III), 376(III), 389–391(III), 394(III), 397–401(III), 407–427(III), 429(III), 430(III), 442(III), 443(III), 447(III), 467(III), 475(III), 476(III),

Index 479(III), 494(III), 509(III), 517(III), 519(III), 526(III) plate, 253–255(II), 258(II), 261(II), 272(II), 274(II) plausibility, 54(I), 60(I), 74(I), 89(I), 90(I), 101(I), 119(I), 121–123(I), 125(I), 139–141(I), 143(I), 445(I), 450– 452(I), 454(I), 456(I), 457(I), 459(I), 460(I), 467(I), 470(I), 512(I), 513(I), 681(I), 210(II), 342(III), 469(III), 474(III) plausibility function, 74(I), 122(I), 123(I), 139–141(I), 143(I), 470(I), 231(II) point calculus, 160(I), 161(I), 163(I), 165(I), 166(I) policy, 270(I), 271(I), 389–395(I), 397– 408(I), 464(I), 465(I), 491(I), 498(I), 576(I), 31(II), 144(II), 226(II), 295(II), 296(II), 300– 302(II), 304(II), 305(II), 319(II), 320(II), 322(II), 413(III), 414(III), 420–422(III), 424(III) policy evaluation, 392(I), 401(I) policy gradient, 402(I), 408(I) policy iteration, 392(I), 393(I), 397(I), 407(I), 296(II), 302(II) policy search, 389(I), 390(I), 400(I), 401(I), 403(I), 404(I), 408(I) polynomial hierarchy, 462(I), 477(I), 598(I), 102(II), 144(II) possibilistic inference, 100(I) possibilistic logic, 69(I), 89(I), 93(I), 97–101(I), 105(I), 145(I), 242(I), 420(I), 474(I), 475(I), 477(I), 104(II), 233(II), 92(III), 107(III) possibilistic network, 99(I), 290(I), 507(I), 210(II), 232–234(II) possibility, 18–21(I), 27(I), 46(I), 54(I), 69(I), 70(I), 74(I), 76(I), 86(I), 89–110(I), 119(I), 122(I), 123(I), 129(I), 131(I), 136(I), 137(I), 140(I), 141(I), 143(I), 161(I), 168(I), 170(I), 218(I), 222(I), 232(I), 242(I), 254(I), 256(I), 258(I), 265(I), 270(I), 271(I), 278(I), 287(I), 290(I), 293(I), 299(I), 310(I), 325(I), 330–332(I), 406(I), 415(I), 420(I), 441(I), 443(I), 451(I), 458(I), 466–473(I), 475(I), 476(I), 489(I), 511(I), 535(I), 540(I), 542(I), 543(I), 555(I), 558(I), 572(I), 575(I), 590(I), 593(I), 596(I), 600(I), 603(I), 622(I), 635(I), 646(I), 752(I), 757(I), 34(II), 78(II), 79(II), 88(II), 92(II),

513 104(II), 105(II), 122(II), 127(II), 140(II), 203(II), 211(II), 230– 234(II), 236(II), 237(II), 249(II), 261(II), 275(II), 289(II), 292(II), 305(II), 322(II), 330(II), 331(II), 374(II), 379(II), 393(II), 455(II), 38(III), 82(III), 83(III), 103(III), 107(III), 112(III), 113(III), 132(III), 153(III), 155(III), 158(III), 159(III), 162(III), 163(III), 210(III), 225(III), 229(III), 233(III), 236(III), 240(III), 242(III), 286(III), 295(III), 321(III), 342(III), 377(III), 380(III), 444(III), 446–449(III), 452(III), 458(III), 462(III), 469(III), 470(III), 476(III), 478(III), 480(III), 507(III), 508(III) possibility function, 94(I), 98(I), 122(I) possible world, 10(I), 18(I), 49(I), 50(I), 52(I), 54(I), 55(I), 81(I), 88(I), 110(I), 240(I), 246(I), 257(I), 287(I), 466– 468(I), 471(I), 473(I), 637(I), 639(I), 640(I), 73(II), 74(II), 76(II), 159(III) postdiction, 283(I), 487(I), 492(I), 493(I), 495(I) pragmatics, 17(I), 278(I), 358(I), 367(I), 708(I), 115(II), 126(II), 117(III), 118(III), 120(III), 121(III), 126(III), 128(III), 130(III), 131(III), 437(III), 440(III), 443–447(III), 450(III) pre-convex relation, 165(I), 171(I) predicate, 14(I), 16(I), 18(I), 25(I), 56(I), 57(I), 69(I), 74(I), 75(I), 108(I), 167(I), 188(I), 189(I), 197(I), 200(I), 204(I), 208(I), 210(I), 294(I), 315(I), 324(I), 325(I), 498(I), 675(I), 676(I), 696(I), 713(I), 735(I), 54(II), 55(II), 57(II), 59(II), 63(II), 66(II), 71(II), 85(II), 86(II), 88(II), 90(II), 100(II), 101(II), 107(II), 143(II), 299(II), 387(II), 22(III), 32(III), 70(III), 93(III), 95(III), 96(III), 103(III), 119(III), 126(III), 134(III), 153(III), 158(III), 159(III), 183–185(III), 187(III), 189(III), 200(III), 235(III), 322(III), 323(III), 460(III), 461(III), 471(III), 498(III), 517(III) prediction, 83(I), 84(I), 142–144(I), 154(I), 157(I), 159(I), 275(I), 276(I), 283(I), 291(I), 294(I), 328(I), 342(I), 345– 348(I), 376(I), 377(I), 403(I), 404(I), 487(I), 488(I), 492(I), 493(I), 495(I), 502(I), 549(I), 564(I), 660(I), 674(I), 682(I), 687(I), 210(II), 219(II), 340–

514 343(II), 348(II), 353(II), 377(II), 378(II), 399(II), 402(II), 403(II), 2(III), 123(III), 130(III), 136(III), 211(III), 212(III), 216(III), 220– 229(III), 232(III), 238(III), 242(III), 244(III), 245(III), 248(III), 266(III), 267(III), 290(III), 291(III), 317(III), 318(III), 322–324(III), 371(III), 408(III), 426(III), 463(III), 465(III), 468(III), 474(III), 475(III), 477(III), 478(III), 480(III), 540(III) preference, 12(I), 93(I), 97(I), 99(I), 136(I), 185(I), 217–225(I), 228–235(I), 237(I), 239–248(I), 261–265(I), 299(I), 312(I), 317(I), 330(I), 341– 343(I), 346(I), 347(I), 357(I), 358(I), 379(I), 380(I), 404(I), 407(I), 419– 421(I), 429(I), 457(I), 459(I), 460(I), 471(I), 476(I), 514(I), 519–528(I), 530–544(I), 550–554(I), 557–559(I), 561(I), 563–565(I), 567(I), 568(I), 570(I), 571(I), 573–577(I), 582(I), 583(I), 588–593(I), 599–601(I), 603–612(I), 614–618(I), 622(I), 630(I), 639(I), 651–653(I), 657(I), 659–664(I), 666–668(I), 681(I), 682(I), 690(I), 697(I), 770(I), 92(II), 104(II), 177(II), 190(II), 192(II), 203(II), 236(II), 286(II), 289– 291(II), 295(II), 303(II), 350(II), 404(II), 447(II), 450(II), 471(II), 474–477(II), 486(II), 83(III), 91(III), 92(III), 102–108(III), 113(III), 129(III), 164(III), 169(III), 228(III), 231(III), 245(III), 248(III), 321(III), 370(III), 371(III), 373(III), 450(III), 462(III), 465(III), 470(III), 481(III), 496(III) preference aggregation, 380(I), 457(I), 519(I), 520(I), 522(I), 588(I), 615(I) preference elicitation, 136(I), 239(I), 247(I), 248(I), 563(I), 582(I), 583(I), 305(II) preference relation, 219–225(I), 229(I), 230(I), 242(I), 246(I), 247(I), 261(I), 262(I), 299(I), 312(I), 347(I), 379(I), 419(I), 420(I), 514(I), 519(I), 521– 524(I), 527(I), 531–533(I), 540(I), 543(I), 551–554(I), 557(I), 558(I), 567(I), 570(I), 573(I), 574(I), 576(I), 589(I), 600(I), 666(I), 667(I), 104(II), 102(III), 106(III) preferential independence, 219(I), 221(I) preferential inference, 59(I), 60(I), 101(I)

Index preferred world, 259(I), 263(I), 265(I), 636(I) prime implicant, 679(I), 680(I), 682(I), 138– 140(II), 236(III), 294(III) prime implicate, 433–435(I), 679(I), 120(II), 138–140(II), 142(II), 143(II) priority, 65(I), 73(I), 97(I), 99(I), 240(I), 243(I), 244(I), 246(I), 315(I), 418(I), 421(I), 422(I), 435(I), 441(I), 444(I), 450(I), 458(I), 463(I), 475(I), 510(I), 594(I), 602(I), 604(I), 609(I), 55(II), 196(II), 107(III), 315(III), 379(III), 422(III), 426(III), 428(III) probabilistic description logic, 263(II), 264(II), 201(III) probabilistic inductive logic programming, 262(II), 275(II) probabilistic inference, 219(II) probabilistic logic, 18(I), 247–249(II), 251(II), 263(II) probabilistic logic programming, 103(II), 248(II), 259–262(II), 273–275(II), 277(III) probabilistic programming, 247(II), 248(II), 269(II), 272–275(II), 292(III) probabilistic relational model, 582(I), 255(II) probabilistic rule, 477(I) probability, 7(I), 9–12(I), 14(I), 17– 20(I), 23(I), 54(I), 69(I), 70(I), 73(I), 74(I), 76(I), 78–92(I), 95(I), 96(I), 99(I), 101(I), 104–106(I), 110(I), 119–124(I), 129–133(I), 135(I), 137–145(I), 219(I), 230(I), 239(I), 240(I), 242(I), 278(I), 280(I), 281(I), 283(I), 285–291(I), 293(I), 296(I), 299(I), 330–332(I), 343– 349(I), 351–354(I), 356–358(I), 360(I), 362(I), 364–366(I), 368(I), 379(I), 391(I), 401–404(I), 406(I), 407(I), 441–443(I), 451(I), 457(I), 458(I), 466–469(I), 472–475(I), 477(I), 489(I), 491–494(I), 506(I), 507(I), 531(I), 537(I), 549(I), 551– 555(I), 557–560(I), 564–568(I), 570–573(I), 575(I), 578–583(I), 598(I), 617(I), 641(I), 642(I), 647(I), 680–682(I), 697(I), 699(I), 700(I), 30(II), 32(II), 34–36(II), 41(II), 46(II), 128(II), 178(II), 192(II), 202(II), 203(II), 210–213(II), 215– 218(II), 220(II), 222–224(II), 227– 231(II), 233–236(II), 247–250(II),

Index

515 251–253(II), 255(II), 259–261(II), 263–265(II), 267–275(II), 295(II), 297–299(II), 301–303(II), 316(II), 319(II), 323(II), 341(II), 342(II), 346(II), 349(II), 351(II), 358– 360(II), 365(II), 366(II), 370(II), 381(II), 382(II), 384(II), 394(II), 397(II), 398(II), 449(II), 458(II), 38(III), 43(III), 72(III), 77(III), 80(III), 83(III), 103(III), 107(III), 113(III), 133(III), 148(III), 149(III), 153(III), 158–160(III), 162(III), 163(III), 200(III), 210(III), 223(III), 229(III), 236(III), 237(III), 242(III), 244(III), 245(III), 249(III), 270(III), 275(III), 286(III), 289(III), 292(III), 293(III), 307(III), 317(III), 342(III), 399(III), 404(III), 405(III), 413(III), 420–422(III), 444(III), 446(III), 461(III), 465(III), 466(III), 468(III), 469(III), 472(III), 477(III), 478(III), 496(III), 537(III)

progression, 487(I), 492–497(I), 501(I), 502(I), 504(I), 507–509(I), 511(I), 5(II), 174(II) prohibition, 204(I), 253(I), 254(I), 256(I), 257(I), 259(I), 262(I), 265(I), 267(I), 269–271(I), 186(II), 197(II) PROLOG, 25(I), 204(I), 713(I), 83(II), 84(II), 87–94(II), 106(II), 108(II), 394(II), 123(III), 409(III), 513(III), 520(III) proof system, 54(II), 122(II), 134(II) propositional logic, 46(I), 49(I), 50(I), 52(I), 53(I), 81(I), 88(I), 91(I), 97(I), 169(I), 175(I), 217(I), 220(I), 226(I), 228(I), 240(I), 241(I), 244(I), 295(I), 316(I), 319(I), 325(I), 429(I), 445(I), 448(I), 463–466(I), 477(I), 500(I), 512(I), 514(I), 616(I), 675(I), 699(I), 54(II), 55(II), 64(II), 70(II), 71(II), 95(II), 115(II), 116(II), 120(II), 121(II), 123(II), 125(II), 126(II), 128(II), 137(II), 142(II), 143(II), 145(II), 146(II), 167(II), 192(II), 201(II), 202(II), 248(II), 285(II), 294(II), 384(II), 386–388(II), 460(II), 62(III), 66(III), 76(III), 132(III), 158– 160(III), 202(III), 445(III) protocol, 120(I), 138(I), 219(I), 588(I), 605(I), 613(I), 651(I), 652(I), 657– 665(I), 667(I), 668(I), 692(I), 758(I),

771(I), 54(II), 63(II), 186(III), 292(III), 293(III), 295(III) psychology, 26(I), 277(I), 279(I), 285(I), 298(I), 321(I), 322(I), 645(I), 648(I), 727(I), 742(I), 743(I), 754(I), 770(I), 332(II), 123(III), 304(III), 305(III), 317(III), 390(III), 448(III), 454(III), 461(III), 462(III), 466(III), 473(III), 481(III), 504(III), 506–508(III), 527(III) public announcement logic, 60(I), 61(I), 639(I)

Q Q-algebra, 154(I) Q-function, 393(I), 397(I), 400(I) Q-learning algorithm, 231(III), 421(III), 422(III) qualification problem, 497(I), 443(III) qualitative algebra, 152(I), 154(I), 315(I), 319(I) qualitative physics, 151(I), 152(I) qualitative reasoning, 52(I), 151–159(I), 167(I), 172(I), 174(I), 175(I), 296(I), 315(I), 332(I), 333(I), 635(I), 674(I), 683(I), 685(I), 698(I), 700(I), 338(III), 339(III), 342(III), 410(III) qualitative simulation, 154–159(I), 683(I) qualitative utility, 575(I), 576(I) quantified Boolean function (QBF), 115(II), 143–146(II), 64(III) quantum model, 477(III) query, 7(I), 25(I), 187(I), 188(I), 193(I), 195(I), 201(I), 202(I), 204(I), 206(I), 209–212(I), 224(I), 227(I), 230(I), 247(I), 299(I), 310(I), 311(I), 318– 320(I), 406(I), 434(I), 539(I), 613(I), 752(I), 753(I), 756(I), 760(I), 85– 87(II), 137(II), 138(II), 145(II), 210(II), 217–219(II), 227–229(II), 274(II), 275(II), 403(II), 428(II), 432(II), 433(II), 461(II), 476(II), 53(III), 78(III), 92(III), 94–100(III), 102–112(III), 147–151(III), 153– 169(III), 186–188(III), 193(III), 194(III), 198(III), 202(III), 203(III), 236(III), 284(III), 288(III), 341(III), 371(III), 380(III) query rewriting, 195(I), 212(I), 110– 112(III), 193(III)

516 R ramification problem, 295(I), 496(I), 501(I), 443(III), 447(III) rank-dependent utility (RDU), 537(I), 549(I), 554(I), 566–571(I), 578(I), 580(I), 583(I) rationality postulates, 508(I), 513(I) RCC-8, 161(I), 163–166(I), 169–171(I) RDF, 188(I), 723(I), 724(I), 753(I), 754(I), 759(I), 760(I), 762(I), 418(II), 427(II), 433(II), 78(III), 181(III), 183–195(III), 201(III), 202(III), 216(III), 219(III), 246(III) RDFS, 202(I), 318(I), 319(I), 748(I), 112(III), 188–194(III) reasoning, 1–11(I), 13–15(I), 18(I), 22(I), 24–27(I), 45(I), 52(I), 56(I), 58(I), 63–65(I), 69(I), 70(I), 72(I), 73(I), 75–77(I), 80(I), 84(I), 88(I), 89(I), 91(I), 94(I), 96(I), 97(I), 99–101(I), 124(I), 133(I), 145(I), 151–161(I), 163(I), 167–175(I), 185–190(I), 192(I), 193(I), 197(I), 198(I), 200(I), 201(I), 203(I), 206(I), 211(I), 212(I), 219(I), 242(I), 246(I), 248(I), 253– 257(I), 259(I), 262(I), 264–266(I), 270(I), 276(I), 279(I), 281–284(I), 290(I), 291(I), 294–296(I), 307(I), 308(I), 311(I), 312(I), 314–316(I), 320(I), 321(I), 324(I), 325(I), 327(I), 330–333(I), 415–420(I), 425(I), 427(I), 431–433(I), 435(I), 436(I), 443(I), 444(I), 446(I), 462(I), 465(I), 466(I), 472(I), 487–489(I), 492– 497(I), 500(I), 505–507(I), 513– 515(I), 558(I), 629–631(I), 633– 635(I), 637(I), 645(I), 646(I), 664(I), 665(I), 673(I), 674(I), 679(I), 683– 685(I), 687(I), 698(I), 700(I), 707(I), 708(I), 713–719(I), 721–723(I), 726(I), 733(I), 735–737(I), 739(I), 743(I), 747(I), 749–753(I), 755(I), 763(I), 769(I), 770(I), 772(I), 44(II), 45(II), 53(II), 54(II), 56(II), 64(II), 73(II), 76–78(II), 80(II), 83(II), 91(II), 95(II), 97(II), 100(II), 101(II), 103–105(II), 115–118(II), 120– 123(II), 126(II), 132(II), 133(II), 135–137(II), 140(II), 142(II), 153(II), 154(II), 163(II), 167(II), 171(II), 177(II), 185(II), 192(II), 199(II), 202(II), 209(II), 210(II), 212(II), 215(II), 217(II), 225–

Index 227(II), 229(II), 231(II), 236– 239(II), 248(II), 250(II), 251(II), 263(II), 265(II), 285(II), 294(II), 297(II), 300(II), 327(II), 354(II), 378(II), 387(II), 391(II), 398(II), 406(II), 411(II), 412(II), 437(II), 460(II), 464(II), 486(II), 3(III), 34(III), 38(III), 54(III), 66(III), 76(III), 83(III), 101(III), 110– 113(III), 120–122(III), 124–129(III), 150(III), 153(III), 158–160(III), 162(III), 170(III), 182–184(III), 186– 189(III), 191–195(III), 197–202(III), 211(III), 216(III), 220(III), 226– 227(III), 230(III), 232(III), 242– 244(III), 248(III), 266–268(III), 271(III), 277–278(III), 283(III), 287(III), 326(III), 338–342(III), 344, 346(III), 351(III), 353–354(III), 359(III), 370–373(III), 410(III), 442– 447(III), 449(III), 457(III), 459– 466(III), 468–481(III), 493(III), 498(III), 504(III), 506(III), 509(III), 531(III), 535(III), 537–538(III), 541(III) recommendation, 544(I), 582(I), 583(I), 606(I), 643(I), 752(I), 754(I), 339(II), 404(II), 405(II), 411(II), 412(II), 434(II), 186(III), 188(III), 197(III), 377(III) rectangle calculus, 163(I), 166(I), 169(I) recursivity (recursive), 22(I), 53(I), 206(I), 397(I), 434(I), 504(I), 98(II), 107(II), 170(II), 251(II), 272(II), 393(II), 6(III), 7(III), 13(III), 20(III), 21(III), 31(III), 33(III), 65(III), 83(III), 124(III), 226(III), 237(III), 239(III), 405(III), 413(III), 454(III), 512(III) regression, 13(I), 136(I), 291(I), 344– 346(I), 348(I), 360(I), 369(I), 371(I), 380(I), 395(I), 398(I), 403(I), 487(I), 493–497(I), 500(I), 504(I), 291(II), 300(II), 348(II), 349(II), 351(II), 368–370(II), 372(II), 225(III), 231(III), 243(III), 425(III) REINFORCE (algorithm), 402(I) reinforcement learning, 21(I), 24(I), 380(I), 389(I), 390(I), 392(I), 394(I), 398(I), 408(I), 286(II), 295(II), 303(II), 304(II), 330–332(II), 340(II), 341(II), 406(II), 83(III), 231(III), 291(III), 304(III), 313(III), 315(III), 317–320(III), 325(III), 356(III),

Index 357(III), 360(III), 391(III), 414(III), 419(III), 420(III), 422–424(III), 430(III), 476(III), 537(III) relational algebra, 427(II), 94(III), 96(III), 103(III) relation algebra, 166(I) relational concept analysis (RCA), 412(II), 413(II), 422(II), 424–429(II), 433– 435(II) relational dependency network, 268(II) relational learning, 255(II), 384(II), 394(II), 398(II), 422(II), 428(II) relational Markov network, 265(II), 266(II), 274(II) resolution, 11(I), 98(I), 325(I), 358(I), 434(I), 591(I), 592(I), 622(I), 677(I), 708(I), 716(I), 717(I), 727(I), 752(I), 3(II), 7(II), 16(II), 17(II), 19(II), 22(II), 41(II), 56(II), 58(II), 59(II), 62(II), 63(II), 65–67(II), 73(II), 86(II), 87(II), 89(II), 91(II), 95(II), 96(II), 101(II), 102(II), 106(II), 116(II), 117(II), 121– 123(II), 125(II), 126(II), 134– 139(II), 141(II), 144–146(II), 287(II), 485(II), 486(II), 491(II), 121(III), 123(III), 136(III), 230(III), 231(III), 354(III), 427(III), 476(III), 505(III), 508(III) resource allocation, 588(I), 614–616(I), 669(I) resource, 271(I), 588(I), 590(I), 607(I), 612(I), 614–616(I), 654(I), 662(I), 663(I), 669(I), 737(I), 758(I), 4(II), 7(II), 17(II), 185(II), 201(II), 289(II), 300(II), 328(II), 5(III), 11(III), 12(III), 23(III), 51(III), 53(III), 54(III), 60(III), 61(III), 65(III), 80(III), 81(III), 84(III), 110(III), 112(III), 118(III), 121(III), 123(III), 130(III), 149(III), 151–153(III), 155(III), 182–186(III), 188–191(III), 195(III), 199–201(III), 203(III), 218(III), 268(III), 366(III), 416(III), 417(III), 425(III), 430(III), 463(III), 491–493(III), 495(III) retrieval, 310–312(I) revision, 20(I), 48(I), 59(I), 73(I), 83(I), 84(I), 101(I), 125(I), 127(I), 142(I), 144(I), 174(I), 315–317(I), 319(I), 415(I), 417(I), 441–460(I), 462– 472(I), 474(I), 477(I), 493(I), 494(I), 507(I), 508(I), 511(I), 512(I), 588(I),

517 600(I), 633(I), 647(I), 760(I), 103(II), 159(II), 235(II), 33(III), 113(III), 158(III), 199(III), 221(III), 359(III), 408(III), 445(III), 446(III), 448– 450(III), 452(III), 453(III), 506(III), 509(III) risk, 10(I), 19(I), 28(I), 70(I), 349(I), 350(I), 352(I), 354–356(I), 358– 362(I), 366(I), 371(I), 373–375(I), 379(I), 389–391(I), 404(I), 407(I), 408(I), 534(I), 537(I), 544(I), 549(I), 550(I), 554(I), 559–566(I), 568(I), 570(I), 571(I), 574(I), 575(I), 618(I), 658(I), 2(II), 27(II), 28(II), 40(II), 144(II), 174(II), 237(II), 239(II), 276(II), 302(II), 316(II), 349– 352(II), 368(II), 369(II), 375(II), 383(II), 391(II), 401(II), 403(II), 449(II), 105(III), 183(III), 445(III) risk-sensitive, 390(I), 404(I), 407(I), 408(I) robotics, 175(I), 400(I), 669(I), 4(II), 83(III), 232(III), 308(III), 311(III), 315(III), 317(III), 319(III), 326(III), 337– 339(III), 355–360(III), 366(III), 389–393(III), 395–397(III), 400(III), 401(III), 408–410(III), 412(III), 415(III), 419(III), 422–424(III), 426– 430(III), 443(III), 476(III), 538(III), 540(III) robustness, 135(I), 432(I), 605(I), 709(I), 40(II), 417(II), 491(II), 7(III), 59(III), 112(III), 120(III), 130(III), 195(III), 236(III), 283(III), 286(III), 287(III), 289(III), 391(III), 425(III), 427(III) Ross paradox, 257(I), 259(I) Rotschild-Stiglitz theorem, 561(I), 562(I) rough set, 78(I), 102(I), 107(I), 391(II), 436(II) rule base, 88(I), 100(I), 332(I), 457(I), 709(I), 711(I), 712(I), 714(I), 717(I), 108(II), 294(II), 412(II), 131(III) S SARSA (State-Action-Reward-StateAction policy), 393(I), 395(I), 318(III), 319(III), 421(III) SAT solver, 169(I), 175(I), 500(I), 40(II), 42(II), 54(II), 64(II), 65(II), 95(II), 115(II), 119(II), 120(II), 122(II), 125(II), 126(II), 131(II), 132(II), 135(II), 136(II), 139(II), 141(II), 145(II), 146(II), 167(II), 171(II), 177(II), 248(II), 294(II), 461(II),

518

Index 62(III), 66(III), 76(III), 202(III), 280(III), 282(III), 283(III)

satisfiability (SAT), 64(I), 169(I), 170(I), 175(I), 188(I), 190(I), 195(I), 197(I), 227(I), 379(I), 434(I), 477(I), 500(I), 512(I), 677(I), 678(I), 690(I), 701(I), 27(II), 29(II), 39(II), 57(II), 63(II), 64(II), 67(II), 69(II), 71(II), 91(II), 115(II), 117–119(II), 123(II), 125(II), 126(II), 128(II), 130(II), 131(II), 139(II), 142(II), 164(II), 201(II), 287(II), 294(II), 71(III), 76(III), 78(III), 98(III), 191(III), 234(III), 235(III), 265–267(III), 277(III), 278(III), 280(III), 286(III), 429(III), 445(III), 538(III) Savage axiomatics, 138(I), 552(I), 554– 556(I), 559(I), 566(I), 576(I) script, 309(I), 80(II), 330(II), 331(II), 2(III), 119–121(III), 128(III), 351(III), 494(III) search, 10(I), 11(I), 26(I), 27(I), 126(I), 133(I), 219(I), 225–227(I), 231(I), 281(I), 309(I), 314(I), 315(I), 319(I), 333(I), 389(I), 390(I), 400(I), 401(I), 403(I), 408(I), 520(I), 535(I), 542(I), 544(I), 596(I), 605(I), 622(I), 642(I), 661(I), 664(I), 674(I), 677(I), 687(I), 720(I), 721(I), 739(I), 746(I), 749(I), 750(I), 759(I), 760(I), 27(II), 29(II), 31–33(II), 37–47(II), 126–128(II), 130(II), 132(II), 157(II), 167– 175(II), 196(II), 197(II), 314(II), 315(II), 456(II), 488(II), 489(II), 66(III), 147(III), 149(III), 150(III), 156(III), 163(III), 166–169(III), 170(III), 223(III), 230–232(III), 235(III), 237–239(III), 244(III), 246(III), 247(III), 267(III), 286(III), 287(III), 289(III), 291(III), 292(III), 344(III), 355(III), 366(III), 368(III), 398(III), 400(III), 407(III), 412(III), 415(III), 416(III), 429(III), 466(III), 467(III), 469(III), 476(III), 520(III), 522(III), 536(III) search algorithm, 27(I), 400(I), 7(II), 8(II), 33(II), 42–45(II), 48(II), 92(II), 105(II), 130(II), 194–196(II), 293(II), 298(II), 314(II), 316(II), 325(II), 488–490(II), 223(III), 232(III), 237(III)

search tree, 622(I), 132(II), 157(II), 167– 173(II), 175(II), 196(II), 197(II), 314(II), 315(II), 467(III) Searle’s Chinese room, 305(III), 437(III), 439(III) security, 173(I), 254(I), 270(I), 271(I), 566(I), 237(II), 238(II), 240(II), 435(II), 346(III), 355(III) segmentation, 137(I), 237(II), 382(II), 133(III), 341(III), 343(III), 344(III), 346(III), 349–351(III) semantic analysis, 118(III), 152(III) semantic gap, 173(I), 338(III), 339(III), 341(III), 344(III), 346(III), 347(III) semantic tableau, 53(II), 67(II), 73(II), 23(III) semantic web, 186–188(I), 318(I), 319(I), 466(I), 708(I), 709(I), 722–726(I), 733(I), 734(I), 751(I), 753(I), 754(I), 756(I), 758(I), 759(I), 763(I), 770(I), 771(I), 264(II), 426(II), 433(II), 78(III), 112(III), 152(III), 153(III), 181–185(III), 188(III), 192–196(III), 201–203(III), 216(III), 220(III), 382(III), 390(III) semantics, 17(I), 18(I), 23(I), 24(I), 27(I), 46(I), 47(I), 49(I), 52–54(I), 59(I), 60(I), 63–65(I), 76(I), 77(I), 87(I), 88(I), 91(I), 97–101(I), 105(I), 110(I), 121(I), 154(I), 166(I), 169(I), 173(I), 185–191(I), 193(I), 195– 197(I), 200(I), 205(I), 212(I), 223(I), 244(I), 246(I), 247(I), 256(I), 257(I), 259(I), 261–265(I), 295(I), 308(I), 318(I), 319(I), 326(I), 327(I), 330– 333(I), 430–434(I), 436(I), 444(I), 448(I), 451(I), 455(I), 458(I), 461(I), 462(I), 464–466(I), 474– 477(I), 513(I), 631(I), 633–638(I), 648(I), 666(I), 707–711(I), 713(I), 720(I), 722–726(I), 733(I), 734(I), 737(I), 740(I), 744(I), 745(I), 748(I), 751–756(I), 758–760(I), 42(II), 43(II), 53(II), 55(II), 67(II), 73– 75(II), 84–92(II), 94(II), 95(II), 97(II), 101(II), 103(II), 116–120(II), 144(II), 161(II), 163(II), 186(II), 199(II), 210(II), 231(II), 233(II), 248–251(II), 255(II), 256(II), 259– 264(II), 266(II), 269(II), 273(II), 275(II), 276(II), 295(II), 299(II), 300(II), 393(II), 422(II), 424(II), 426(II), 433(II), 473(II), 475(II),

Index 5(III), 16(III), 23(III), 96(III), 103(III), 105(III), 106(III), 117– 126(III), 130(III), 131(III), 134(III), 136(III), 155(III), 156(III), 158(III), 161(III), 182–190(III), 194(III), 197– 199(III), 269–273(III), 275–279(III), 284(III), 287(III), 291(III), 293(III), 294(III), 338(III), 339(III), 347(III), 348(III), 427(III), 439(III), 440(III), 446(III), 447(III), 470(III), 471(III), 510(III), 512(III), 519(III) sequential decision, 389(I), 390(I), 549(I), 550(I), 563(I), 577(I), 578(I), 581(I), 582(I), 226(II), 286(II), 295(II), 296(II) Shannon entropy, 129(I), 130(I), 81(III) similarity, 20(I), 53(I), 54(I), 102(I), 208(I), 239(I), 307(I), 308(I), 311–313(I), 316–323(I), 326(I), 327(I), 330– 333(I), 354(I), 358(I), 360(I), 376(I), 465(I), 748(I), 31(II), 35(II), 166(II), 218(II), 232(II), 234(II), 346(II), 347(II), 354–356(II), 361–363(II), 367(II), 368(II), 372(II), 373(II), 404(II), 418(II), 420(II), 421(II), 434(II), 450(II), 451(II), 456(II), 458–461(II), 467–470(II), 2(III), 26(III), 30(III), 29(III), 132–135(III), 155(III), 165(III), 167(III), 200(III), 201(III), 226(III), 238(III), 243(III), 247(III), 277(III), 321(III), 440(III), 459(III), 491(III), 496(III), 497(III), 506(III), 508(III), 523(III), 524(III) simulated annealing, 661(I), 32(II), 41(II), 223(II), 234(II), 318(II), 240(III) situation calculus, 25(I), 294(I), 298(I), 487(I), 497(I), 498(I), 500(I), 502(I), 503(I), 505(I), 514(I), 632(I), 92(III) SLAM, 401–407(III) social choice, 19(I), 407(I), 448(I), 476(I), 519(I), 537(I), 544(I), 587(I), 588(I), 590(I), 593(I), 597(I), 606(I), 614(I), 615(I), 622(I) social network analysis, 435(II), 436(II) social welfare, 542(I), 589(I), 596(I), 616(I), 617(I), 656(I), 657(I), 661–663(I) soft constraint, 189(II), 289(II), 459(II), 156(III), 229(III) solver, 24(I), 25(I), 169(I), 175(I), 500(I), 598(I), 622(I), 675(I), 677(I), 678(I), 683(I), 684(I), 46(II), 89(II), 91–95(II), 101(II), 102(II), 104– 108(II), 117(II), 122(II), 125–

519 128(II), 130–136(II), 145(II), 146(II), 153(II), 161(II), 163(II), 167–169(II), 171(II), 172(II), 176– 178(II), 201(II), 285(II), 287(II), 302(II), 319(II), 464(II), 465(II), 234(III), 235(III), 247(III), 265(III), 278(III), 280(III), 466(III), 476(III), 494(III) SPARQL, 318(I), 753(I), 428(II), 433(II), 185–188(III), 194(III), 198(III), 201(III), 202(III), 216(III), 219(III), 246(III) spatial relation, 171(I), 173–175(I), 339(III), 340(III), 342–344(III), 348(III) spectral clustering, 347(II), 354(II), 361– 363(II), 451(II), 452(II), 457–459(II) stable model, 270(I), 464(I), 477(I), 95– 97(II), 99(II), 101(II), 106(II), 261(II), 262(II), 278(III) state graph, 622(I), 1–8(II), 10–12(II), 16(II), 19(II), 20(II), 23(II), 291(II), 293(II), 323(II), 66(III), 239(III), 288(III), 291(III), 407(III) statistical learning, 134(I), 341(I), 343(I), 344(I), 348–352(I), 355(I), 358(I), 359(I), 363(I), 369(I), 372(I), 375(I), 376(I), 378–380(I), 220(II), 350(II), 383(II), 478(II), 419(III) STIT logic, 258(I), 259(I), 266(I), 269(I), 638(I), 646(I) STN, 410–412(III) stochastic dominance, 530(I), 531(I), 560(I), 570(I), 615(I) stochastic gradient, 372(I), 377(I), 394– 398(I) strategy, 26(I), 126(I), 136(I), 160(I), 237(I), 314(I), 323(I), 324(I), 354(I), 372(I), 375(I), 401(I), 402(I), 462(I), 578– 580(I), 652(I), 657–661(I), 691(I), 700(I), 718(I), 14(II), 32(II), 33(II), 37(II), 38(II), 41(II), 44(II), 45(II), 47(II), 48(II), 54(II), 63(II), 67(II), 71(II), 72(II), 76(II), 80(II), 87(II), 92(II), 107(II), 136(II), 138(II), 174(II), 175(II), 195(II), 226(II), 227(II), 263(II), 265(II), 266(II), 271–273(II), 295(II), 327(II), 328(II), 331(II), 332(II), 341(II), 346–348(II), 350(II), 364(II), 375(II), 388(II), 390(II), 392(II), 394(II), 395(II), 456(II), 464(II), 465(II), 468(II), 475(II), 476(II), 486(II), 34(III), 55(III), 79(III),

520 99(III), 156(III), 166(III), 168– 170(III), 210–212(III), 223(III), 226(III), 234(III), 236(III), 237(III), 239(III), 245(III), 290(III), 295(III), 317(III), 319(III), 347(III), 350– 352(III), 354(III), 375(III), 379(III), 410(III), 445(III), 457(III), 466(III), 467(III), 475(III), 493–495(III), 504(III), 509(III), 510(III), 513(III), 520–522(III), 525–527(III) STRIPS, 25(I), 226(I), 404(I), 487(I), 501(I), 502(I), 511(I), 514(I), 285–288(II), 297(II), 298(II), 305(II), 408(III), 409(III) strong negation, 96(II), 98(II) structure learning, 379(I), 220–222(II), 224(II), 234(II), 275(II), 276(II) sub-modular function, 200(II) substitution, 15(I), 208(I), 238(I), 239(I), 314(I), 315(I), 500(I), 57(II), 58(II), 60(II), 61(II), 65(II), 72(II), 86(II), 101(II), 253(II), 387(II), 10(III), 28(III), 68(III), 491(III) subsumption, 187(I), 188(I), 190(I), 192(I), 195–197(I), 754(I), 61(II), 62(II), 138(II), 387(II), 393(II), 397(II), 418(II), 419(II), 426(II), 428(II), 98(III), 191(III), 192(III), 197– 199(III), 279(III), 315(III), 425(III) Sugeno integral, 94(I), 110(I), 519(I), 543(I), 576(I) superposition, 53(II), 58–60(II), 62(II), 63(II), 65(II), 66(II), 79(III), 451(III) supervised learning, 135(I), 344(I), 345(I), 349(I), 351(I), 352(I), 394(I), 405(I), 322(II), 339–341(II), 343– 346(II), 348(II), 349(II), 351– 353(II), 366(II), 378(II), 379(II), 381–383(II), 396(II), 398–400(II), 402(II), 403(II), 448(II), 474(II), 477(II), 131(III), 132(III), 341(III), 351(III), 356(III), 357(III), 424(III), 425(III) supervision, 159(I), 278(I), 283(I), 328(I), 379(I), 380(I), 390(I), 488(I), 493(I), 673(I), 684(I), 685(I), 691(I), 142(II), 447(II), 449(II), 450(II), 452(II), 477(II), 132(III), 165(III), 337(III), 338(III), 341(III), 346(III), 347(III), 350(III), 355(III), 360(III), 373(III), 394(III), 426(III) support relation, 429(I), 431(I), 644(I)

Index support vector machine (SVM), 368(I), 369(I), 375–378(I), 380(I), 339(II), 351(II), 368(II), 373(II), 378(II), 163(III), 212(III), 220(III), 223(III), 224(III), 227(III), 244(III), 246(III), 247(III), 249(III), 376(III), 497(III), 498(III) sure thing principle, 556(I), 565(I), 566(I), 568(I), 569(I), 571(I), 576(I), 581(I), 466(III), 478(III) surveillance, 485(II), 392(III) symmetry, 53(I), 79(I), 85(I), 276(I), 278(I), 325(I), 329(I), 461(I), 471(I), 472(I), 73(II), 76(II), 93(II), 135(II), 175(II), 202(II), 212(II), 214(II), 458(II), 464(II), 190(III), 230(III), 290(III), 513(III) syntax, 189(I), 190(I), 193(I), 198(I), 309(I), 418(I), 445(I), 447(I), 448(I), 458–460(I), 462(I), 713(I), 724(I), 83(II), 86(II), 88(II), 94(II), 101(II), 210(II), 248(II), 249(II), 256(II), 259–262(II), 264(II), 269(II), 271–273(II), 5(III), 118–123(III), 131(III), 185(III), 189(III), 194(III), 268(III), 269(III), 274(III), 440(III), 446(III), 447(III), 506(III), 507(III), 512(III), 513(III) T tabu search, 32(II), 39(II), 42(II), 43(II), 45(II), 128(II), 456(II), 231(III) tautology, 77(I), 90(I), 258(I), 260(I), 266(I), 450(I), 455(I), 458(I), 474(I), 60(II), 61(II), 117(II), 123(II), 25(III) Tchebycheff norm (Chebyshev norm), 534(I) temporal logic, 170(I), 171(I), 257(I), 265(I), 266(I), 270(I), 313(I), 408(I), 631(I), 635(I), 125(II), 126(II), 321(II), 77(III), 92(III), 93(III), 97(III), 99(III), 236(III), 242(III), 267(III), 271(III), 283–285(III), 289(III), 290(III) temporal reasoning, 151(I), 152(I), 160(I), 161(I), 168(I), 172(I), 173(I), 635(I), 684(I), 769(I) temporal relation, 174(I), 131(III), 514(III) term, 17(I), 22(I), 24(I), 48(I), 60(I), 72(I), 75(I), 159(I), 186(I), 200(I), 265(I), 266(I), 316(I), 325(I), 342(I), 356–358(I), 363(I), 433(I), 445(I), 535(I), 555(I), 611(I), 639(I), 668(I),

Index 723(I), 734(I), 737(I), 745(I), 750(I), 10(II), 42(II), 55(II), 57–59(II), 61(II), 71(II), 97(II), 137(II), 210(II), 256(II), 10(III), 11(III), 19(III), 21(III), 22(III), 24(III), 26(III), 28–31(III), 82(III), 125(III), 150– 152(III), 155(III), 156(III), 160– 163(III), 165(III), 221(III), 238(III), 309(III), 318(III), 366(III), 368(III), 372(III), 406(III), 469(III), 514(III), 539(III) terminology, 64(I), 71(I), 192(I), 206(I), 220(I), 355(I), 738(I), 742(I), 745(I), 749(I), 755(I), 756(I), 45(II), 191(II), 248–252(II), 433(II), 435(II), 436(II), 125(III), 219(III), 293(III), 426(III) time, 45(I), 52(I), 99(I), 121(I), 151(I), 157(I), 159–162(I), 165(I), 167(I), 168(I), 170–172(I), 174(I), 175(I), 222(I), 257(I), 264–267(I), 269(I), 277(I), 278(I), 281(I), 292(I), 299(I), 389(I), 391(I), 392(I), 401–403(I), 489(I), 491(I), 492(I), 494(I), 496(I), 498(I), 503(I), 506–508(I), 576– 578(I), 604–608(I), 630–633(I), 635(I), 636(I), 640(I), 641(I), 647(I), 653(I), 658(I), 659(I), 668(I), 683– 687(I), 690(I), 692(I), 693(I), 698(I), 700(I), 227(II), 238(II), 257(II), 289(II), 291(II), 297–299(II), 302– 304(II), 341(II), 343(II), 374(II), 397(II), 398(II), 485(II), 487– 489(II), 9(III), 18(III), 41(III), 42(III), 54(III), 77(III), 99(III), 100(III), 187(III), 214(III), 216(III), 233(III), 236(III), 242(III), 270(III), 277(III), 281(III), 283(III), 285(III), 286(III), 291–295(III), 308(III), 314(III), 401–406(III), 410(III), 412(III), 453(III), 519(III), 536(III) trace, 320(I), 472(I), 533(I), 687(I), 708(I), 714(I), 715(I), 744(I), 746(I), 761(I), 108(II), 319(II), 363(II), 228(III), 271(III), 284–286(III), 314(III), 487(III), 488(III) tractable relation, 167(I) transfer learning, 408(I), 339(II), 402(II), 403(II), 132(III), 212(III), 220(III) transition, 2(I), 392(I), 426(I), 453(I), 490(I), 491(I), 493(I), 502(I), 510(I), 637(I), 683(I), 686(I), 699(I), 74(II), 75(II), 125(II), 128(II), 292(II),

521 293(II), 295(II), 296(II), 300(II), 394(II), 9(III), 19(III), 38(III), 60(III), 233(III), 235(III), 236(III), 242(III), 267(III), 271(III), 275– 277(III), 283(III), 293(III), 313(III), 314(III), 408(III), 409(III) triadic concept analysis, 412(II), 413(II), 430(II), 437(II) trust, 3(I), 4(I), 46(I), 52(I), 74(I), 431(I), 612(I), 629(I), 630(I), 640–644(I), 648(I), 723–725(I), 770–772(I), 381(II), 467(II), 163(III), 182(III), 374(III), 447(III) truth-functionality, 46(I), 47(I) Turing machine, 204(I), 3(III), 5–8(III), 11(III), 12(III), 15(III), 17(III), 18(III), 38(III), 42(III), 53(III), 59–64(III), 67(III), 69(III), 79(III), 80(III), 448(III), 474(III), 507(III) Turing test, 9(I), 21(I), 120(III), 430(III), 441(III), 463(III), 471(III) typical (typicality), 64(I), 76(I), 84(I), 154(I), 159(I), 160(I), 169(I), 174(I), 186(I), 298(I), 403(I), 449(I), 492(I), 494(I), 531(I), 587(I), 599(I), 609(I), 647(I), 653(I), 662(I), 740(I), 27(II), 29(II), 33(II), 38(II), 41(II), 42(II), 44(II), 47(II), 115(II), 123(II), 131(II), 141(II), 145(II), 237(II), 275(II), 319(II), 327(II), 328(II), 350(II), 351(II), 353(II), 368(II), 379(II), 381–383(II), 393(II), 402(II), 403(II), 436(II), 78(III), 102(III), 108(III), 119(III), 134(III), 148(III), 164(III), 170(III), 181(III), 195(III), 198(III), 200(III), 213(III), 215(III), 216(III), 218(III), 219(III), 227–231(III), 239(III), 241(III), 243(III), 247(III), 272(III), 281(III), 309(III), 321(III), 344(III), 372(III), 373(III), 379(III), 422(III), 426(III), 458–460(III), 478(III), 488(III), 506(III), 510(III), 515(III), 520(III), 523(III), 525(III) U uncertainty, 10(I), 11(I), 18(I), 19(I), 54(I), 69–71(I), 73(I), 78(I), 79(I), 83– 86(I), 89(I), 91(I), 94–97(I), 99(I), 110(I), 119–122(I), 129–134(I), 136– 138(I), 141(I), 144(I), 159(I), 219(I), 230(I), 231(I), 242(I), 248(I), 278(I), 283(I), 284(I), 287(I), 290(I), 293(I),

522 330–332(I), 390(I), 403(I), 407(I), 427(I), 441–443(I), 451(I), 457(I), 466(I), 467(I), 472(I), 475(I), 477(I), 489–492(I), 495(I), 504(I), 506(I), 507(I), 515(I), 525(I), 531(I), 537(I), 543(I), 544(I), 549–551(I), 553(I), 554(I), 558(I), 564(I), 565(I), 570(I), 572(I), 575(I), 579(I), 582(I), 583(I), 674(I), 683(I), 687(I), 708(I), 769(I), 770(I), 178(II), 192(II), 203(II), 209–211(II), 214(II), 215(II), 225–227(II), 230–232(II), 235(II), 236(II), 239(II), 249(II), 251(II), 256(II), 265(II), 266(II), 275(II), 285(II), 286(II), 296(II), 297(II), 303–306(II), 342(II), 384(II), 398(II), 474(II), 477(II), 38(III), 83(III), 103(III), 107(III), 113(III), 128(III), 147(III), 150(III), 153(III), 155(III), 158–160(III), 162(III), 169(III), 210(III), 244(III), 286(III), 319(III), 338(III), 341(III), 342(III), 346(III), 355(III), 359(III), 376(III), 378(III), 400(III), 403(III), 408(III), 413(III), 414(III), 430(III), 442(III), 444–446(III), 469(III), 472(III), 536(III), 537(III), 540(III) undecidability, 197(I), 16(III), 35(III), 36(III) unification, 145(I), 204(I), 209(I), 56–59(II), 65(II), 86(II), 87(II), 90(II), 93(II), 107(II), 108(II), 293(II), 467(II), 121(III), 125(III), 250(III), 511(III), 513(III) unit propagation, 122(II), 127(II), 132– 134(II) unsupervised learning, 345(I), 349(I), 352(I), 394(I), 339–341(II), 343– 346(II), 352(II), 353(II), 379(II), 381(II), 398(II), 477(II), 132(III), 133(III), 306(III) updating (update), 56(I), 77(I), 84(I), 101(I), 144(I), 270(I), 299(I), 300(I), 368(I), 375(I), 392(I), 393(I), 395(I), 396(I), 399–404(I), 442(I), 444(I), 462(I), 464(I), 487(I), 504(I), 507–514(I), 622(I), 639(I), 641(I), 690(I), 697(I), 700(I), 717(I), 761(I), 8(II), 30(II), 31(II), 34–36(II), 132(II), 134(II), 137(II), 140(II), 142(II), 174(II), 195(II), 219(II), 235(II), 300(II), 305(II), 315(II), 316(II), 322(II), 350(II), 358(II), 364(II), 365(II),

Index 370(II), 391(II), 457(II), 53(III), 92(III), 97–99(III), 101(III), 216(III), 220(III), 233(III), 235(III), 236(III), 274(III), 276(III), 277(III), 293(III), 294(III), 311(III), 313(III), 317(III), 323(III), 401–405(III), 407(III), 408(III), 411(III), 413(III), 420– 423(III), 443(III), 444(III), 477(III) upper probability, 105(I) usability knowledge, 376–378(III) utilitarianism (utilitarian), 6(I), 587(I), 590(I), 591(I), 608–611(I), 621(I), 656(I), 657(I), 661–663(I) utility, 19(I), 132(I), 133(I), 138(I), 192(I), 218(I), 231–242(I), 407(I), 520(I), 549–552(I), 554(I), 558(I), 559(I), 561–570(I), 572–576(I), 580–583(I), 590(I), 591(I), 608–610(I), 612(I), 615(I), 616(I), 653(I), 655–660(I), 662(I), 746(I), 192(II), 203(II), 225– 227(II), 286(II), 295(II), 305(II), 332(II), 365(II), 474(II), 477(II), 83(III), 169(III) utility function, 218–220(I), 231–236(I), 239–242(I), 407(I), 457(I), 536(I), 551(I), 552(I), 554(I), 558(I), 559(I), 562–564(I), 570(I), 573–575(I), 580(I), 582(I), 590(I), 591(I), 612(I), 615(I), 619(I), 620(I), 621(I), 652(I), 653(I), 655(I), 296(II), 474(II), 477(II)

V validation, 310(I), 316(I), 318(I), 350(I), 707–711(I), 713–715(I), 720–726(I), 762(I), 239(II), 339(II), 345(II), 346(II), 352(II), 353(II), 366(II), 401(II), 448(II), 96(III), 128(III), 130(III), 267(III), 358(III), 359(III), 390(III), 409(III), 427(III), 444(III), 449(III), 493(III) value function, 242(I), 355(I), 390–396(I), 398–400(I), 402(I), 406(I), 300(II), 302(II), 304(II), 413(III), 422(III) value-iteration algorithm, 392(I), 296(II), 300(II), 304(II), 413(III), 414(III) valued constraint, 238(I), 248(I), 178(II), 185(II), 186(II), 189(II), 232(III), 266(III) variable elimination algorithm, 365(I), 193(II), 226(II) VC dimension, 362–366(I)

Index verification, 60(I), 710(I), 711(I), 723(I), 725(I), 726(I), 63(II), 78(II), 79(II), 81(II), 115(II), 119(II), 126(II), 127(II), 16(III), 66(III), 76(III), 77(III), 101(III), 242(III), 267(III), 283(III), 287(III), 409(III), 427(III) version space, 364(I), 365(I), 350(II), 391(II), 395(II), 245(III) veto, 532(I), 594(I), 102(III) video game, 389(I), 390(I), 647(I), 539(III), 313(II), 314(II), 321(II), 323(II), 324(II), 327–333(II) view, 138(I), 276(I), 290(I), 115(II), 345(II), 423(II), 311(III) violation, 208(I), 253(I), 255(I), 256(I), 258(I), 259(I), 261(I), 263(I), 265(I), 267–270(I), 376(I), 571(I), 632(I), 698(I), 44(II), 46(II), 47(II), 103(II), 456(II), 468(II), 99(III), 286(III), 289(III) visual analytics, 380(III) von Neumann-Morgenstern axiomatics, 552(I), 554(I), 575(I) W web, 186–188(I), 231(I), 318(I), 319(I), 466(I), 588(I), 596(I), 605(I), 616(I),

523 643(I), 683(I), 708(I), 709(I), 722– 726(I), 733(I), 734(I), 740(I), 748(I), 749(I), 751(I), 753(I), 754(I), 756– 761(I), 763(I), 770(I), 771(I), 62(II), 95(II), 103(II), 264(II), 426(II), 433(II), 435(II), 8(III), 18(III), 78(III), 101(III), 108(III), 112(III), 126(III), 132(III), 147(III), 152(III), 153(III), 159(III), 167–169(III), 181– 189(III), 193–197(III), 199–203(III), 216(III), 218(III), 220(III), 233(III), 242(III), 366(III), 368(III), 369(III), 376(III), 382(III), 390(III), 492(III), 541(III) web of data, 709(I), 723(I), 726(I), 756(I), 759(I), 418(II), 433(II), 182(III), 185(III), 186(III), 188(III), 197(III), 200(III) weighted average, 126(I), 133(I), 473(I), 534(I), 536(I), 539(I), 611(I) well-founded semantics, 261(II)

Y Yaari’s model, 537(I)