130 31 33MB
English Pages 608 [596] Year 2005
Monitoring, Security, and Rescue Techniques in Multiagent Systems
Advances in Soft Computing Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw, Poland E-mail: [email protected] Further books of this series can be found on our homepage: springeronline.com Henrik Larsen, Janusz Kacprzyk, SJawomir Zadrozny, Troels Andreasen, Henning Christiansen (Eds.) Flexible Query Answering Systems 2000. ISBN 3-7908-1347-8 Robert John and Ralph Birkenhead (Eds.) Developments in Soft Computing 2001. ISBN 3-7908-1361-3
Leszek Rutkowski, Janusz Kacprzyk (Eds.) Neural Networks and Soft Computing 2003. ISBN 3-7908-0005-8 Jurgen Franke, Gholamreza Nakhaeizadeh, Ingrid Renz (Eds.) Text Mining 2003. ISBN 3-7908-0041-4
Mieczyslaw Kbpotek, Maciej Michalewicz and Slawomir T. Wierzchon (Eds.) Intelligent Information Systems 2001 2001. ISBN 3-7908-1407-5
Tetsuzo Tanino,Tamaki Tanaka, Masahiro Inuiguchi Multi-Objective Programming and Goal Programming 2003. ISBN 3-540-00653-2
Antonio Di Nola and Giangiacomo Gerla (Eds.) Lectures on Soft Computing and Fuzzy Logic 2001. ISBN 3-7908-1396-6
Mieczyslaw Klopotek, Slawomir T. Wierzchon, Krzysztof Trojanowski (Eds.) Intelligent Information Processing and Web Mining 2003. ISBN 3-540-00843-8
Tadeusz Trzaskalik and Jerzy Michnik (Eds.) Multiple Objective and Goal Programming 2002. ISBN 3-7908-1409-1 James J. Buckley and Esfandiar Eslami An Introduction to Fuzzy Logic and Fuzzy Sets 2002. ISBN 3-7908-1447-4 Ajith Abraham and Mario Koppen (Eds.) Hybrid Information Systems« 2002. ISBN 3-7908-1480-6
Ahmad Lotfi, Jonathan M. Garibaldi (Eds.) Applications and Science in Soft-Computing 2004. ISBN 3-540-40856-8 Mieczyslaw Klopotek, Slawomir T. Wierzchon, Krzysztof Trojanowski (Eds.) Intellligent Information Processing and Web Mining 2004. ISBN 3-540-21331-7
Przemyslaw Grzegorzewski, Olgierd Hryniewicz, Maria 9 . Gil (Eds.) Soft Methods in Probability, Statistics and Data Analysis 2002. ISBN 3-7908-1526-8
Miguel Lopez-Diaz, Maria 9 . Gil, Przemyslaw Grzegorzewski, Olgierd Hryniewicz, Jonathan Lawry Soft Methodology and Random Information Systems 2004. ISBN 3-540-22264-2
Lech Polkowski Rough Sets 2002. ISBN 3-7908-1510-1
Kwang H. Lee First Course on Fuzzy Theory and Applications 2005. ISBN 3-540-22988-4
Mieczyslaw Klopotek, Maciej Michalewicz and Slawomir T. Wierzchon (Eds.) Intelligent Information Systems 2002 2002. ISBN 3-7908-1509-8
Barbara Dunin-K^plicz, Andrzej Jankowski, Andrzej Skowron, Marcin Szczuka Monitoring, Security, and Rescue Techniques in Multiagent Systems 2005. ISBN 3-540-23245-1
Andrea Bonarini, Francesco Masulli and Gabriella Pasi (Eds.) Soft Computing Applications 2002. ISBN 3-7908-1544-6
Barbara Dunin-K^plicz Andrzej Jankowski Andrzej Skowron Marcin Szczuka
Monitoring, Security, and Rescue Techniques in Multiagent Systems
With 138 Figures
^ S p r iinger
Barbara Dunin-K^plicz Institute of Computer Science Polish Academy of Sciences Ordona 21 01-237 Warsaw, Poland and Institute of Informatics, Warsaw University Banacha 2 02-097 Warsaw, Poland and Institute for Decision Process Support Chemikow 5 09-411 Piock, Poland Andrzej Jankowski Institute for Decision Process Support Chemikow 5 09-411 Plock, Poland
Andrzej Skowron Institute of Mathematics Warsaw University Banacha 2 02-097 Warsaw, Poland and Institute for Decision Process Support Chemikow 5 09411 Ptock, Poland Marcin Szczuka Institute of Mathematics Warsaw University Banacha 2 02-097 Warsaw, Poland
Library of Congress Control Number: 2004116865
ISSN 16-15-3871 ISBN 3-540-23245-1 Springer Berlin Heidelberg NewYork This work is subject to copyright. AU rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to Prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany The use of general descriptive names, registered names, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and free for general use. Cover design: Erich Kirchner, Springer Heidelberg Typesetting: Digital data supplied by editors Production: medionet AG, Berlin Printed on acid-free paper
62/3141/Rw-5 43 210
Preface
In todays society the issue of security, understood in the widest context, have become a crucial one. The people of the information age, having instant access to the sources of knowledge and information expect the technology to improve their safety in all respects. That, straightforwardly, leads to the demand for methods, technologies, frameworks and computerized tools which serve this purpose. Nowadays such methods and tools are more and more expected to embed not only ubiquitous information sources, but also the knowledge that stems from them. The use of knowledge-based technology in security applications, and in the society at large, clearly emerges as the next big challenge. Within general set of security-related tasks we may exhale some sub-fields such us monitoring, control, crisis and rescue management. Multiagent systems are meant to be a toolset for modelling of, automated reasoning with, and study on the behavior of compound environments that involve many perceiving, reasoning and acting parties. In a natural way they are well suited for supporting the research on foundations of automatic reasoning processes starting from data acquisition (including data entry, sensor measurements and multimedia information processing) to automatic knowledge perception, real-life situation assessment, through planning to action execution in the context of monitoring, security, and rescue techniques. These activities are closely related to many very active research areas like autonomous systems, spatio-temporal reasoning, knowledge representation, soft computing with rough, fuzzy and rough mereological approaches, perception, learning, evolution, adaptation, data mining and knowledge discovery, collective intelligence and behavior. All these research directions have plenty of possible applications in the systems that are concerned with assuring security, acting in emergency and crisis situations, monitoring of vital infrastructures, managing cooperative jobs in the situation of danger, and planning of action for the rescue campaigns. This volume contains extended and improved versions of selected contributions presented at the International Workshop "Monitoring, Security, and Rescue Techniques in Multiagent Systems" (MSRAS 2004) held in Plock, Poland, June 7-9, 2004. The MSRAS 2004 workshop was aimed at gathering world's leading researchers active in areas related to monitoring, security, and rescue techniques in multiagent
VI
Preface
systems. Such techniques are among the core issues that involve very large numbers of heterogeneous agents in the hostile environment. The intention of the workshop was to promote research and development in these significant domains. The workshop itself was a significant success thanks to the presence and contributions of the leading researchers in the field. In this way, by establishing a forum for exchanging results and experience of top specialists working in areas closely related to such tasks the new possibilities for scientific cooperation have been created. The workshop was also the first step on the road to establishing permanent research and technology center in Plock. We hope that this center, a part of Industrial and Technology Park, will become an important institution contributing to fostering the research in knowledge-based technologies and security. Organization of the volume As the 48 contributions in this volume span over very wide area of research, it is quite hard to categorize them precisely. Therefore, there are only two major parts in this volume. First one, entitled "Foundations and Methods" gathers the papers of more theoretical and fundamental character as well as those dealing with general, basic descriptions of various methodologies and paradigms. The second part, entitled "Application Domains and Case Studies" is meant to encapsulate the articles that deal with more specific problems, concrete solutions and application examples. Naturally, the division is very subjective and should not be treated as definite. Within each part the papers are organized in accordance with they role at the workshop. It means that in each of the parts the articles from keynote presenters come first, followed by invited and regular contributions, and finshed by papers that were part of special and poster sessions. Acknowledgements We wish to express our gratitude to Professors Zdzislaw Pawlak and Lotfi A. Zadeh who accepted our invitation to act as honorary chairs of the workshop. We are also very grateful to all oustanding scientists who participated in the workshop including: Andrzej Uszok Jean-Pierre Muller James F. Peters KatiaSycara David Wolpert Philip S. Yu Hans-Dieter Burkhard Tom R. Bums Nikolai Chilov Andrzej Czyzewski Barbara Dunin-K^plicz Rineke Verbrugge Amal EL Fallah-Seghrouchni
Vladimir Gorodetski Zbigniew Michalewicz Hideyuki Nakanishi SankarK.Pal L^ch Polkowski Alberto Pettorossi Zbigniew Ras Alexander Ryjov Marek Sergot V.S. Subrahmanian ^^^^^^ ^^^g ^ui Wang
Preface
VII
Many thanks to all authors who prepared their articles for this volume. We would like to thank PKN Orlen, City of Flock and the supervisor of MSRAS organization - Ms. Sulika Kamiiiska. Without the money, expertise, and organizational muscle they provided, neither the MSRAS workshop nor the publication of this volume would have been possible. We are also thankful to Springer-Verlag and Dr. Thomas Ditzinger for the opportunity of publishing this volume in the "Advances in Soft Computing" series. Warsaw, July 2004
Barbara Dunin-K^plicz Andrzej Jankowski Andrzej Skowron Marcin Szczuka
Contents
Part I Foundations and Methods 1 Flow Graphs, their Fusion and Data Analysis Zdzislaw Pawlak
3
2 Approximation Spaces for Hierarchical Intelligent Behavioral System Models James F. Peters
13
3 Distributed Adaptive Control: Beyond Single-Instant, Discrete Control Variables David H. Wolpert, Stefan Bieniawski
31
4 Multi-agent Planning for Autonomous Agents' Coordination Amal El Fallah-Seghrouchni
53
5 Creating Common Beliefs in Rescue Situations Barbara Dunin-K§plicz, Rineke Verbrugge
69
6 Coevolutionary Processes for Strategic Decisions Rodney W. Johnson, Michael E. Melich, Zbigniew Michalewicz, Martin Schmidt
85
7 Automatic Proofs of Protocols via Program Transformation Fabio Fioravanti, Alberto Pettorossi, Maurizio Proietti 8 Mereological Foundations to Approximate Reasoning Lech Polkowski 9 Data Security and Null Value Imputation in DIS Zbigniew W. Ras, Agnieszka Dardzinska
99 117 133
X
Contents
10 Basic Principles and Foundations of Information Monitoring Systems Alexander Ryjov
147
11 Modelling Unreliable and Untrustworthy Agent Behaviour Marek Sergot
161
12 Nearest Neighbours without k Hui Wang, Ivo Duntsch, Gunther Gediga, Gongde Guo
179
13 Classifiers Based on Approximate Reasoning Schemes Jan Bazan, Andrzej Skowron
191
14 Towards Rough Applicability of Rules Anna Gomolinska
203
15 On the Computer-Assisted Reasoning about Rough Sets Adam Grabowski
215
16 Similarity-Based Data Reduction and Classification Gongde Guo, Hui Wang, David Bell, Zhining Liao
227
17 Decision Trees and Reducts for Distributed Decision Tables Mikhail Ju. Moshkov
239
18 Learning Concept Approximation from Uncertain Decision Tables Nguyen Sink Hoa, Nguyen Hung Son
249
19 In Search for Action Rules of the Lowest Cost Zbigniew W Has, Angelina A. Tzacheva
261
20 Circularity in Rule Knowledge Bases Detection using Decision Unit Approach Roman Siminski, Alicja Wakulicz-Deja
273
21 Feedforward Concept Networks Dominik Sl§zak, Marcin Szczuka, Jakub Wroblewski
281
22 Extensions of Partial Structures and Their Apphcation to Modelling of Multiagent Systems Bozena Staruch
293
23 Tolerance Information Granules Jaroslaw Stepaniuk
305
24 Attribute Reduction Based on Equivalence Relation Defined on Attribute Set and Its Power Set Ling Wei, Wenxiu Zhang
317
Contents
XI
25 Query Cost Model Constructed and Analyzed in a Dynamic Environment Zhining Liao, Hui Wang, David Glass, Gongde Quo
327
26 The Efficiency of the Rules' Classification Based on the Cluster Analysis Method and Salton's Method Agnieszka Nowak, Alicja Wakulicz-Deja
333
27 Extracting Minimal Templates in a Decision Table Barbara Marszai-Paszek, Piotr Paszek
339
Part II AppUcation Domains and Case Studies 28 Programming Bounded Rationality Hans-Dieter Burkhard
347
29 Generalized Game Theory's Contribution to Multi-agent Modelling Tom R. Burns, Jose Castro Caldas, Ewa Roszkowska
363
30 Multi-Agent Decision Support System for Disaster Response and Evacuation Alexander Smimov, Michael Pashkin, Nikolai Chilov, Tatiana Levashova, Andrew Krizhanovsky
385
31 Intelligent System for Environmental Noise Monitoring Andrzej Czyzewski, Bozena Kostek, Henryk Skarzynski
397
32 Multi-agent and Data Mining Technologies for Situation Assessment in Security-related Applications Vladimir Gorodetsky, Oleg Karsaev, Vladimir Samoilov
411
33 Virtual City Simulator for Education, Training, and Guidance Hideyuki Nakanishi
423
34 Neurocomputing for Certain Bioinformatics Tasks Shubhra Sankar Ray, Sanghamitra Bandyopadhyay, Pabitra Mitra, SankarK. Pal
439
35 Rough Set Based Solutions for Network Security Guoyin Wang, Long Chen, YuWu
455
36 Task Assignment with Dynamic Token Generation Alessandro Farinelli, Luca locchi, Daniele Nardi, Fabio Patrizi
467
37 DyKnow: A Framework for Processing Dynamic Knowledge and Object Structures in Autonomous Systems Fredrik Heintz, Patrick Doherty
479
XII
Contents
38 Classifier Monitoring using Statistical Tests Rafal Latkowski, Cezary Gtowifiski
493
39 What Do We Learn When We Learn by Doing? Toward a Model of Dorsal Vision Ewa Ranch
501
40 Rough Mereology as a Language for a Minimalist Mobile Robot's Eenvironment Description Lech Polkowski, Adam Szmigielski
509
41 Data Acquisition in Robotics Krzysztof Luks
519
42 Spatial Sound Localization for Humanoid Lech Blazejewski
527
43 Oculomotor Humanoid Active Vision System Piotr Kazmierczak
539
44 Crisis Management via Agent-based Simulation Grzegorz Dohrowolski, Edward Nawarecki
551
45 Monitoring in Multi-Agent Systems: Two Perspectives Marek Kisiel-Dorohinicki
563
46 Multi-Agent Environment for Management of Crisis in an Enterprises-Markets Complex Jaroslaw Kozlak
571
47 Behavior Based Detection of Unfavorable Events Using the Multiagent System Krzysztof Cetnarowicz, Edward Nawarecki, Gabriel Rojek
579
48 Intelligent Medical Systems on Internet Technologies Platform Beata Zielosko, Andrzej Dyszkiewicz
589
Author Index
595
Parti
Foundations and Methods
Flow Graphs, their Fusion and Data Analysis Zdzislaw Pawlak Institute of Computer Sciences Warsaw University of Technology Ul. Nowowiejska 15/19, 00665 Warsaw, Poland and Warsaw School of Information Technology ul. Newelska 6, 01-447 Warsaw, Poland zpwiiii . p w . e d u . p l Summary. This paper concerns a new approach to data analysis based on information flow distribution study in flow graphs. The introduced flow graphs differ from that proposed by Ford and Fulkerson, for they do not describe material flow in the flow graph but information "flow" about the data structure. Data analysis (mining) can be reduced to information flow analysis and the relationship between data can be boiled down to information flow distribution in aflownetwork. Moreover, it is revealed that information flow satisfies Bayes' rule, which is in fact an information flow conservation equation. Hence information flow has probabilistic character, however Bayes' rule in our case can be interpreted in an entirely deterministic way, without referring to prior and p(75r^nc>r probabilities, inherently associated with Bayesian philosophy. Furthermore in this paper we study hierarchical structure of flow networks by allowing to substitute a subgraph determined by branches x and y by a single branch connecting x and y, called fusion of x and y. This "fusion" operation allows us to look at data with different accuracy and move from details to general picture of data structure.
Key words: flow graphs, data fusion, data mining, Bayes' rule
1.1 Introduction In [4] we presented a new approach to data analysis based on information flow distribution study in flow graphs. The introduced flow graphs differ from that proposed by Ford and Fulkerson [1], for they do not describe material flow in the flow graph but information "flow" about the data structure. With every branch of the flow graph three coefficients are associated, called strength, certainty and coverage factors. These coefficients were widely used in data mining and rough set theory, but in fact they were first introduced by Lukasiewicz [2] in connection with his study of logic and probability. These coefficients have a
4
Zdzislaw Pawlak
probabilistic flavor, but here they are interpreted in a deterministic way, describing information flow distribution in the flow graph. We claim that data analysis (mining) can be reduced to information flow analysis and the relationship between data can be boiled down to information flow distribution in a flow network. Moreover, it is revealed that information flow satisfies Bayes' rule, which is in fact an information flow conservation equation. Hence information flow has probabilistic character, however Bayes' rule in our case can be interpreted in an entirely deterministic way, without referring to prior and posterior probabilities, inherently associated with Bayesian philosophy. Furthermore in this paper we study hierarchical structure of flow networks by allowing to substitute a subgraph determined by branches x and 2/ by a single branch connecting x and y, cdXltd fusion of x and y. This "fusion" operation allows us to look at data with different accuracy and move from details to general picture of data structure. This approach allows us to study different relationships in data and can be used as a new mathematical tool for data mining. Summing up, we advocate to use flow analysis to: • • • •
searching for patterns in data, searching for dependencies in data, data classification, data fusion.
A simple tutorial example will be used to illustrate the introduced ideas.
1.2 Example 1 - Smoking and Cancer First let us explain basic concepts of the proposed methodology on a simple example taken from [3]. In Table 1.1 data conceming 60 people who do or do not smoke and do or do not have cancer are shown. Table 1.1. Smoking and Cancer Not cancer Cancer Total
Not smoke 40 7 47
Smoke 10 3 13
Total 50 10 60
With every data table like that in presented in Table 1.1 we associate a flow graph as shown in Fig. 1.1. Nodes XQ and xi are inputs of the graph, whereas yo and yi are outputs of the graph. The numbers assigned to the input nodes (J){XQ) and 0(xi) of the flow graph represent inflow to the flow graph, whereas numbers associated with the inputs 0(2/0) and (j){yi) represent outflow of the graph. Every branch (x, y) of the flow graph is
1 Flow Graphs, their Fusion and Data Analysis
5
labeled by a number which represents a throughflow (j){x, y) through the branch from nodes xioy. This representation of data is intended to capture the relationships in the data and is not meant to describe any material flow in the network.
yes cp(^l)=13
cpO;j)=10
Fig. 1.1. Flow graph for Table 1.1 We will show in the next sections that representation of data as flow in a flow graph can be used to discover many important relationships in data, e.g. dependences. However to this end we have to "normalize" the flow graph by using instead of absolute values of flow (j){x) their relative values cr(x), i.e. percentage of flow with respect to total flow of the graph. The absolute throughflow (a:, y) will be also replaced by relative throghflow cr{x,y). This normalized representation has very interesting mathematical properties, which can be use to discover patterns in data. Beside, we will use two additional coefficients called the certainty and coverage factors, denoted cer{x, y) and cov{x, y) respectively, which characterize how the flow is spread between nodes x and y. Normalized flow graph for the flow graph given in Fig. 1.1 is shown in Fig. 1.2.
a(jcj) = 13/60
a(yj)= 10/60
Fig. 1.2. Normalized flow graph for Table 1.1
From the flow graph we arrive at the following conclusions: • • •
85% non-smoking persons do not have cancer (cer(a;o, yo) = 40/47 ^ 0.85), 15% non-smoking persons do have cancer (cer(xo, yi) = 7/47 ^ 0.15), 77% smoking persons do not have cancer {cer{xi,yo) = 10/13 ^ 0.77),
6 •
Zdzislaw Pawlak 23% smoking persons do have cancer (cer(xi, yi) = 3/13 ^ 0.23).
From the flow graph we get the following reason for having or not cancer: • • • •
80% persons having not cancer do not smoke {cov{xo^ yo) = 4/5 = 0.80), 20% persons having not cancer do smoke {cov{xi^yo) = 1/5 = 0.20), 70% persons having cancer do not smoke {cov{xo, yi) = 7/10 = 0.70), 30% persons having cancer do smoke {cov{xi,yi) = 3/10 = 0.30).
Let us observe that in the statistical terminology cr(xo), (T{XI) are priors while (^{xo^yo)^ " ", c^(^i5 2/i) are joint distributions, cov(xo, yo),..., cov{xi,yi) SLTQ posteriors and cr{yo),a{yi) are marginal probabilities.
1.3 Flow Graphs Basic Concepts 1.3.1 Flow Graphs In this section the fundamental concept of the proposed approach flow graph is defined and discussed. A flow graph is a directed, acyclic, finite graph G = {N, B, ), where A'^ is a set oi nodes, B C N x N is Siset of directed branches, cj) : B —^ R^ is ?iflowfunction and R^ is the set of non-negative reals. Input of a node x e N is the set I{x) = {y E N : {y,x) e B}', output of a node X e N is defined as 0{x) = {y e N : {x,y) e B}. We will also need the concept of input and output of a graph G, defined, respectively, as follows: I{G) ^ {x e N : I{x) = 0}, 0{G) = {x e N : 0{x) = 0}. Inputs and outputs of G are external nodes of G', other nodes are internal nodes ofG. If (x, y) £ B then (/)(x, y) is a throughflow from x to y. With every node x of a flow graph G we associate its inflow Mx)=
^ 0(2/,x), yel{x)
(1.1)
and outflow
4>-{x)=
Yl
^(^'2/).
(1.2)
yeo{x) Similarly, we define an inflow and an outflow for the whole flow graph, which are defined as
ct>^{G)= Y.
^-(^)'
(1-^)
yei{G) xei(0) We assume that for any intemal node x, 4>+{x) = -(a:) = (t){x), where B—» < 0,1 > is a normalized flow of (a:, y) and
is a strength of (x,2/). Obviously, 0 < cr{x^y) < 1. The strength of the branch expresses simply the percentage of a total flow through the branch. In what follows we will use normalized flow graphs only, therefore by flow graphs we will understand normalized flow graphs, unless stated otherwise. With every node x of a flow graph G we associate its inflow and outflow defined as
^^
^
yeO{x)
Obviously for any internal node x, we have cr^{x) = a normalized throughflow of x. Moreover, let
(T-{X)
— cr{x), where a{x) is
Obviously, a+(G) = (7_(G) = c7(G) = 1. If we invert direction of all branches in G, then the resulting graph G = (AT, B\ a') will be called an inverted graph of G. Of course the inverted graph G' is also a flow graph and all inputs and outputs of G become inputs and outputs of G\ respectively. 1.3.2 Certainty and Coverage Factors With every branch (x, y) of a flow graph G we associate the certainty and the coverage factors. The certainty and the coverage of (x, y) are defined as cer(z,j/) = ^ % f ,
(1.10)
8
Zdzislaw Pawlak
and COv{x,y) = ^
^
.
(1.11)
respectively. Evidently, cer{x, y) = cov{y, x), where (a;, y) E B and (y, x) G ^ ' . Below some properties, which are immediate consequences of definitions given above are presented: ^ cer(x,2/) = l, (1.12) yeO{x)
Y^ cov{x,y) = l,
(1.13)
xel{y)
(^{x)=
Y^ cer{x,y)cF{x) = ^ 2/€0(a;)
^(y)=
(T{x,y),
(1.14)
cr{x,y),
(1.15)
yeO(x)
X I coi;(x,2/)a(2/) = xel{y)
^ xyel{y)
cer(.,,)^-(-'.y^),
(1.16)
(T(X)
co^0^,7/ = — H ^ r ^ .
(1.17)
Obviously the above properties have a probabilistic flavor, e.g., equations (14) and (15) have a form of total probability theorem, whereas formulas (16) and (17) are Bayes' rules. However, these properties in our approach are interpreted in a deterministic way and they describe flow distribution among branches in the network. 1.3.3 Paths, Connections and Fusion A {directed) path from x to y, x ^ y in G is a sequence of nodes x i , . . . , x^ such that xi = x^Xn — y and (xj, xi^i) G B for every i, l < z < n — l . A path from x to y is denotedhy[x...y]. The certainty of the path [ x i . . . Xn] is defined as n-l
cer[xi ,..Xn]=
]][ cer{xi,x^+i),
(1.18)
2=1
the coverage of the path [ x i . . . x^] is n-l COt'[xi . . . Xn] = J J COv{Xi, Xi+i), i=l
and the strength of the path [ x . . . y] is
(1-19)
1 Flow Graphs, their Fusion and Data Analysis a[x .. .y] = a{x)cer[x .. .y] = a{y)cov[x .. .y].
9 (1-20)
The set of all paths from x to y{x 7^ y) in O denoted < x, y >, will be called a connection from x to y in G. In other words, connection < x, y > is a sub-graph of G determined by nodes x and y. The certainty of the connection < x, y > is cer < x^y >=
V^
cer[x...y]^
(1.21)
[x...y]e
the coverage of the connection < x, y > is GOV < x,y >=
22
cov[x.. .y],
(1-22)
[x...y]e
and the strength of the connection < x, y > is a=
^
cr[x...2/] =
[x...?/]€
= a{x)cer < x^y >= a{y)cov < x,y > .
(1.23)
If we substitute simultaneously every sub-graph < x, y > of a given flow graph G, where x is an input node and y an output node of G, by a single branch (x, y) such that cr(x, y) = a < x,y >, then in the resulting graph G\ called the fusion of G, we have cer(x,y) = cer < x,y >, cov{x,y) = cov < x,y > and a{y),
(1.27)
cov{x,y) > cr(x),
(1.28)
10
Zdzislaw Pawlak
then X and y are positively depends on x in G. Similarly, if cer{x,y) < a{y),
(1.29)
or cov{x,y) < CF{X),
(1.30)
then X and y are negatively dependent in G. Relations of independency and dependences are symmetric ones, and are analogous to those used in statistics. For every branch (x, y) G B'WQ define a dependency (correlation) factor //(x, y) defined as cov{x, y) — a[x) cer{x, y) — a{y) r]{x,y) (131) cer{x^y) -\- (j{y) cov[x^y) -{- a(x)' Obviously —1 < rj{x,y) < 1; ri{x,y) = 0 if and only \i cer{x^y) — a{y) and cov{x,y) = a{x);r]{x,y) = — 1 if andonly if cer(x,t/) = cov{x,y) =0;r){x,y) = 1 if and only if a{y) = a{x) = 0. It is easy to check that if r}{x, y) = 0, then x and y are independent, if - 1 < 77(x, y) < 0 then x and y are negatively dependent and if 0 < 77(x, y) < I then X and y are positively dependent. Thus the dependency factor expresses a degree of dependency, and can be seen as a counterpart of correlation coefficient used in statistics.
Disease yes a(x{) = 0.70
a{x^ = 0.30
young Fig. 1.3. Initial data
1 Flow Graphs, their Fusion and Data Analysis
11
1.4 Example 2 - Medical Test Now we are ready to illustrate the basic concepts presented in this paper by a simple tutorial example. Various patient groups are put to the test for certain drug effectiveness. Initial data are shown in Fig. 1.3. Corresponding flow graph is presented in Fig. 1.4.
a(jC2) = 0.30
a(z2) = 0.47
G(y^) = 0.25
young Fig. 1.4. Relationship between Disease, Age and Test Fig. 1.5 shows the corresponding fusion, of Disease and Test.
Disease
Test
yes a(Xj) = 0.70
G(X^)
= 0.30
Giz^) = 0.55
G(Z^)
= 0.45
Fig. 1.5. Fusion of theflowgraph presented in Fig. 1.4 This flow graph leads to the following conclusions:
12
Zdzislaw Pawlak
•
If the disease is present then the test result is positive with certainty 0.68
•
It the disease is absent then the test result is negative with certainty 0.78
Explanation of test results is as follows: • If the test result is positive then the disease is present with certainty 0.87 • If the test result is negative then the disease is absent with certainty 0.61 From the flow graph we get: • • • •
There is slight positive correlation between presence of the disease and positive test result (ry = 0.10). There is low positive correlation between absence of the disease and negative test result (r? = 0.27). There is slight negative correlation between presence of the disease and negative test result (77 = -0.17). There is higher negative correlation between absence of the disease and positive test result (77 = -0.40).
1.5 Conclusions We proposed in this paper to represent relationships in data by means of flow graphs. Flow in the flow graph is meant to capture structure of data rather than to describe any physical material flow in the network. It is revealed the information flow in the flow graph is governed by Bayes' formula, however the formula can interpreted in entirely deterministic way, without referring to its probabilistic character. This representation allows us to study different relationships in the data and can be used as a new mathematical tool for data mining. Summing up: • flow graphs can be used to knowledge representation, • flow distribution represents relationships in data, • flow conservation is described by Bayes' formula, • Bayes' formula has deterministic interpretation. Acknowledgements Thanks are due to Professor Andrzej Skowron for critical remarks.
References 1. Ford L.R, Fulkerson D.R,(1962) Flows in Networks. Princeton University Press, Princeton. New Jersey 2. Lukasiewicz J, (1913) Die logishen Grundlagen der Wahrscheinlichkeitsrechnung. Krakow. In: Borkowski L, (ed.), Jan Lukasiewicz - Selected Works, North Holland Publishing Company, Amsterdam, London, Polish Scientific Publishers, Warsaw, 1970 3. Grinstead Ch. M, Snell J. L, (1997) Introduction to Probability: Second Revised Edition American Mathematical Society 4. Pawlak Z,(2003) Flow Graphs and Decision Algorithms. In: Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, Proceedings, G. Wang, Y. Yao and A. Skowron (eds.) Lecture Notes in Artificial Intelligence 2639 1-10 Springer
Approximation Spaces for Hierarchical Intelligent Behavioral System Models James F. Peters Department of Electrical and Computer Engineering, University of Manitoba Winnipeg, Manitoba R3T 5V6 Canada j [email protected] [Roughness] is an essential feature of living things, and has deep structural causes. -Christopher Alexander, 2002. Summary. This article introduces a hierarchical behavioral model for an intelligent system that is capable of approximate reasoning. The rough set approach introduced by Zdzislaw Pawlak provides a ground for concluding to what degree a particular model for an intelligent system is a part of a set of models representing a standard. Each layer of the hierarchical view of an intelligent system includes one or more information systems as well as one or more approximation spaces that provide a framework for approximate reasoning, learning, and pattern recognition. An approach to the solution of the behavioral system model classification problem in the context of rough sets and a satisfaction-based approximation space is suggested in this article. Approximation spaces are used to classify and measure intelligent system behavior patterns. In this context, rough inclusion of information granules relative to a standard and the proximity to a satisfaction threshold are measured. In addition, a rough set approach to ethology in classifying the behavior of cooperating agents is introduced. A hierarchical model for a swarmbot is briefly considered by way of illustration of the approach to modeling and classifying the behavior of intelligent systems. Key words: approximation space, behavior, ethology, system model, design pattern, intelligent system, rough sets, swarm intelligence
2.1 Introduction This paper introduces an approach to approximate reasoning about hierarchical, behavioral system models for intelligent systems in the context of rough sets [1924,30,32-33,35,38-45]. Considerable work has been done on approximation spaces in the context of rough sets [24,27-28,38-41,46] based on generalized approximation spaces introduced in [38-39]. This work on approximation spaces is an outgrowth of the original approximation space definition in [24]. It is well-known that experimental models for system design in general and intelligent system design in particular
14
James F. Peters
seldom exactly match what might be considered a standard. This is to be expected, since system designs tend to have an unbounded number of variations relative to an accepted design pattern. Consider, for example, the variations in the implementation of design patterns in architecture made possible by pattern languages [1-4]. This is expected, and encouraged. It is this variation in actual system design patterns that is a source of a difficult classification problem [25-26,28-29]. The approach to creating an intelligent system model for a particular project resembles what architects do in designing a living space or what weavers do in designing a tapestry or a carpet. That is, the particular living space designed by an architect will represent variations of known architectural patterns driven by project context. The architectural patterns are drawn from a pattern language representing standards. Similarly, the design of a particular intelligent system will be guided by a pattern language tailor-made for intelligent systems. In the case of weavers, the design of each tapestry or carpet reflects an understanding about patterns representing the accumulation of wisdom about aesthetically-pleasing objects, in effect, a standard which is approximated by each new artifact created by a weaver. In either case, a design is considered good if it provides to some degree a satisfactory (comfortable, beckoning, safe) place for humans. Similarly, the design of approximation spaces provides gateways to knowledge about intelligent system models. An approach to the solution of the behavioral system model classification problem in the context of rough sets and a satisfactionbased approximation space is suggested in this article. The term model is an abstraction of a physical system with a specific purpose (see, e.g., [16,36]). In this work, a model is a set of interacting objects. The term behavioral system model denotes a set of interacting objects with observable effects of a sequence of events. The term interaction is a specification of how stimuli (e.g., patterns) are sent between objects designed to perform a specific task (e.g., pattern recognition [10,50]). Interaction is best understood in the context of a collaboration between objects, which a realization of a specification for a communicating system behavior. The term behavior denotes the observable effects of a sequence of events in the form of observed stimulation of an object (e.g., arrival of a message) and observed response. A number of the features of the behavior of cooperating agents can be identified with methods used by ethologists in the study of the behavior of biological organisms [7,15,48-49]. For example, one can consider the survival value of a response to a proximate cause (stimulus) at the onset of a behavior by an agent interacting with its environment as well as cooperating with other agents in an intelligent system. At any instance in time, behavior ontogeny (origin and development of a behavior) can be considered as means of obtaining a better understanding of the interactions between agents and the environment in a complex system. Finally, it is helpful to consider features that can be extracted relative to evolution of a behavior. Survival value, proximate cause, ontogeny and evolution of a behavior (the four whys) provide a basis for an ethological explanation of behavior [48]. In this article, a rough set approach to ethology in classifying the behavior of cooperating agents is introduced. A central problem considered in this work is how to reason about complex objects (e.g., a collaboration of objects used to model an intelligent system) using a
2 Approximation Spaces for Hierarchical Intelligent Behavioral System Models
15
rough mereological framework, where one considers the functor "being a part in degree" as a predicate fjLr where the assertion XfirY means that X is a part of Y in degree at least r [35]. The question considered here is how can we model complex objects relevant for applications such as intelligent systems (e.g., satisfying a given specification to a satisfactory degree), and how can we measure inclusion (closeness) of complex objects. These questions were included in papers on rough mereology (see, e.g., [34]). However, the problem not solved in the earlier work is how to deal with complex objects (e.g., a swarmbot) that are dynamically changing. The problem with complex systems is that although local interactions are definable, it often the case that global behavior is not known exactly. However, using some approximate reasoning methods about global behavior, it is possible to predict (characterize) such behavior to a degree (e.g., some soft constraint will be satisfied to a degree, which can guarantee that the specification for a complex system behavior is satisfied at least to a satisfactory degree). An approach to solving this problem is considered in this article. This article also introduces a hierarchical approach to modeling the behavior of an intelligent system. This paper has the following organization. Basic concepts from rough sets are briefly presented in Sect. 2.2. Basic concepts concerning pattern languages, forms and metamodels appear in Sect. 2.3. A brief presentation of features in a hierarchy of intelligent system models is given in Sect. 2.4. Sections 2.5 and 2.6 present a framework for measuring acceptance of intelligent system models.
2,2 Basic Concepts: Rough Sets This section very briefly covers some fundamental concepts in rough set theory that provide a foundation for the approach to measuring the degree of acceptance of system models. In other words, the rough set approach introduced by Zdzislaw Pawlak [19-24] provides a ground for concluding to what degree a set of candidate design models are a part of a set of design models representing a standard. For computational reasons, a syntactic representation of knowledge is provided by rough sets in the form of data tables. Informally, a data table is represented as a collection of rows each labeled with some form of input, and each column is labeled with the name of an attribute that computes a value using the row input. Formally, a data (information) table IS is represented by a pair (C/, A), where [/ is a non-empty, finite set of objects and A is a non-empty, finite set of attributes, where a : C/ —> Va for every a e A. For example, let X, X' e U, where X, X' represents a model (set of interacting objects) and U is a collection of models. For each B C A, there is associated an equivalence relation lndis{B) such that Ind/5(B) = {(X, X') G U^ | Va G B. a(X) = a(X')} and B(X) denotes a block of B-indiscemible models in the partition of U containing X. This view of the world befits the case where U contains models that are spliced together to form a system model. For X C U, the set X can be approximated from information contained in B by constructing a 5-lower and jB-upper approximation denoted by B*X and B*X, respectively, where B*X = {x G U | B(x) C X} and B*X = {x G U | B(x) n X 7^ 0}.BHcXisa collection of objects that can be classified with full certainty as
16
James F. Peters
members of X using the knowledge represented by attributes in B. By contrast, B*X is a collection of objects representing both certain and possible uncertain knowledge about X. Whenever B*X is a proper subset of B*X, i.e., B*X c B*X, the collection of objects has been classified imperfectly, and X is considered a rough set.
2.3 Pattern Languages, Forms, and Metamodels The idea of a pattern language conforms to the notion of language in semiotics [8]. A language is defined by a set of objects (alphabet), grammar, expressions formed with objects. Similarly, a pattern language consists of a set of patterns, grammar for connecting patterns together, and pattern constructions (patterns pieced together) (see, e.g., [3]). Pattern constructions can be quite complex, and can include overlapping patterns (e.g., an observer pattern sometimes overlaps with a memento pattern in system models, where an observer records changes in the behavior of an observable object, and also acts as a caretaker in causing the internal state of an object (a memento) to be saved). The observer and memento patterns are examples of behavioral patterns (see, e.g., [12]). Each pattern is defined by a conjunction of feature values. A row in a table of feature values for various members of a class can be thought of as an approximate description of an object belonging to a class. Of course, this is reminiscent of an idea in Plato's dialogues where a/orm is a pattern fixed in the nature of things (see, e.g., [31]). Patterns found in tables of feature values derived from experiments can be viewed in some sense as approximations ("likenesses") of the ideal. Consider, for simplicity, a table where all of the rows of feature values are associated with objects belonging to a single class. In effect, such a table of feature values can be viewed as an imperfect imitation of an ideal table (see, e.g., Watanabe [50]). More recently, determination of membership of objects in a class has been viewed relative to categories (clusters of objects with similar patterns), categorical perception and perceptual learning (see, e.g., [11, 13]). A set of objects in a model that describes a system has a "form" recognizable from its feature values (patterns) that identify it as a recognizable member of a class. In the context of modeling intelligent systems, a model is an abstraction that describes interactions between cooperating objects [27]. The term intelligence is restricted to mean pattern recognition capability. In architecture, a pattern name (e.g., comer grocery, beer hall, traveler's inn, bus stop) is used instead of feature values to identify a category or cluster of similar objects (see, e.g., [3])). For example, instead of giving feature values for a comer grocery pattem (e.g., display window size, entrance size, number of counters, and so on), an architect uses the pattem name as a guide (initially, without specifics) in planning a town. This is a good idea because it simplifies the assembly of an architecture (connected pattems) which have infinite variability [1]. This also suggests the possibility of a metamodel for stmctures (either physical or abstract). A metamodel defines a language for models. That is, a metamodel provides a description of models (possibly metamodels) connected together to form a cohesive stmcture or system. In effect, stmctural and functional features in design pattems can be extracted at different levels in an intelligent system model. The complete model for an intelligent system then takes the form of
2 Approximation Spaces for Hierarchical Intelligent Behavioral System Models
17
layers. In general, a behavioral model for an intelligent system design is represented by a hierarchy of models (see, e.g.. Fig. 2.1). Each layer in the hierarchy contains a set of interacting objects. Patterns commonly found in models for intelligent system designs can be gleaned from diagrams [25-29] using UML, the Unified Modeling Language [16], especially in the context of systems engineering (see, e.g., [14, 17]). A hierarchical model of a swarmbot is briefly considered next by way of illustration of our approach. A swarmbot is a self-organizing collection of cooperating bots. Example 3.1 Hierarchical Description of a Swarmbot. A hierarchical description of a swarmbot (shot) is tree-shaped with a metamodel for a swarmbot at its root, which represents cooperative behavior of a collection of evolving bots (see Fig. 2.1).
Fig. 2.1. Layers in hierarchical swarmbot model The metamodel for an sbot has its counterpart in nature. For example, Tinbergen has observed that cooperation of a flock of flying starlings seems so perfect that one forgets the individuals and thinks of them as one huge "super-individual" [49]. The intelligence of a swarmbot is measured relative to the degree to which it learns to recognize patterns. The idea for a swarmbot comes from [6, 9]. The swarmbot metamodel subsumes metamodels for a bot, conununication channel, parameterized approximation space PAS(U,I^,z/$, |=) explained in [27], bot subsystem, and swarm intelligence components (see Fig. 2.2). Each layer in the hierarchy will have its own information table used to classify feature-value patterns (see Fig. 2.1). Channels are used for interaction (message-passing) between bots. The evolution of the behavior of a swarmbot is the result of the evolving behavior of its subsystems, which are sub-swarmbots (sub-collections of cooperating, communicating bots). One of the objects in the swarmbot model in Fig. 2.2 is a metamodel
18
James F. Peters
of a bot (see Fig. 2.3). The high-level view of a hot in Fig. 2.3 includes metamodels of hot machines, subsystems, approximator, and bot components. Swarmbot Model
{
9 |l.* 1..*
1 1"* y
Bot Model
It
1
1>
PAS(U4„i%,l=)
0..*
l^y^approximate
Bot Subsystem Model
y 1... 1
1..*
1..*
interconnect
1 -•
k
1..* SwarmlntelligenceComponents
Channel Model
^^
evolve behavior
Fig. 2.2. Swarmbot model
Bot Model
—0— 1..*
1..* Bot Machine
Bot Subsystem 1..*
i.t
1..* 1..*
classify pattern
[0, 1] denotes rough inclusion
The uncertainty function defines a neighborhood of every object x belonging to the universe U. For example, the sets computed with the uncertainty function I(x) can be used to define a covering of U [35]. The rough inclusion function i/ computes the degree of overlap between two subsets of U. In general, rough inclusion u : p(U) x p(U) —> [0, 1] can be defined in terms of the relationship between two sets where
I
1
, Otherwise
for any X, Y C U [46]. In the case where X C Y, then z/(X, Y) = 1. The minimum inclusion value i/(X, Y) = 0 is obtained when X fl Y = 0 (i.e., X and Y have no elements in common). Other forms of approximation spaces have been considered in approximate reasoning (see, e.g., [27]) and rough neural computing (see, e.g., [18]). In a hierarchical model of an intelligent system, one or more approximation spaces would be associated with each layer. Example 5.1 Approximation Space for a Swarmbot To set up an approximation space for a swarmbot, let ISsbot = (U, A) be an information system, where U is a non-empty set of models, and A is a set of features drawn from Tables 2.1 and 2.2. Assume that I^ : p(U) ^ p(U) is an uncertainty function that computes a subset of U relative to parameter B (subset of attributes in A). For example, let X C U, B C A, and let IB(X) compute B*(X), a B-upper approximation of X. Further, let Y C U represent a standard for swarmbot models, and let B(Y) be a partition of U containing Y (i.e., B(Y) contains models that are equivalent to Y relative to B). Then I/B(B*(X), B ( Y ) ) = |B*(X)nB(Y)|/|B(Y)|. In effect, rough inclusion Z/B(B*(X), B ( Y ) ) measures the extent that an upper approximation is included in a partition of the universe containing B-equivalent sets of interacting objects representing a design standard.
2.6 IMeasuring Acceptance of Intelligent System IModels In this section, an approach to measuring the degree of acceptance of intelligent system models is considered relative to a form of satisfaction-based approximation space, which has been inspired by [5,44].
24
James F. Peters
Definition 1. Rough Inclusion Satisfaction Relation. Let the threshold th G [0, I), and let X, 7 G p(U), B C.A (feature set), where U is a non-empty set of objects. Then X \=Y,th B if and only ifuB(X, Y) > th. That is, X satisfies B if and only if the rough inclusion UB(X, Y) value is greater than or equal to some preset threshold th. The threshold th serves as a filter for a set of candidate models X for system designs relative to a standard Y represented by a set of acceptable models. In the case where the overlap between a set of models X for an intelligent system design and a standard Y is greater than or equal to threshold th, then X has been satisfactorily classified relative to the standard. Let \=Ystd,th denote a rough inclusion satisfaction relation that has been specialized relative to standard Ystd and a threshold th. Basically, then, the measurement of the acceptability of intelligent system models relative to standard Ystd can be considered in the context of PASsat» a parameterized approximation space that includes a satisfaction relation, where PASgat = (U,I^,i/$,|=ystd,t/i). This form of an approximation space was introduced in [27]. Satisfaction-based approximation spaces are also explored in [28]. vibration damper power line
^
];^^\ ^"^^
'
camera used by inspection subsystem
line-crawling inspect-bot
expandable pipe used to guide bot under obstacle
Fig. 2.6. Left: Inspect-bot; Right: Caterpillar swarmbot
2.7 Line-Crawling Swarmbot A brief description of a swarmbot that resembles a caterpillar (cat-sbot) is given in this section. The cat-sbot has been designed to crawl along powerlines during inspection of various structures (e.g., towers, conductors, vibration dampers, insulators) is given in this section. An individual bot in the cat-sbot is equipped with one or more cameras used to inspect and classify power system structures (see Fig. 2.6). Such a bot is called an inspect-bot (it consists of two or more bots, one bot that handles locomotion and one or more independent bots with vision systems for inspection). The appendages of an inspection bot give the bot the ability to hang onto and roll along an unobstructed length of line. However, an inspect-bot has a very simple design and
2 Approximation Spaces for Hierarchical Intelligent Behavioral System Models
25
does not have the capability to navigate by itself around obstacles such as vibration dampers and line clamps. This bot requires the assistance (a form of pushing action) of another bot to crawl around an obstacle. Such cooperation between bots is one of the hallmarks of swarmbot behavior (see, e.g., [6,9,17]), and in multiagent systems containing independent agents that exhibit some degree of coordination (see, e.g., [47]). The cat-bot in Fig. 2.6(right) shows multiple inspect-bots, which cooperate to navigate along a powerline with many obstacles. Example 7.1 Decisions about observed shot behavior The basic model for an shot given in Fig. 2.2 has been specialized to solve a power system equipment inspection problem. For simplicity, the original types of proximate causes, evolution and response in Table 2.1 have been numerically coded. This coding is shown in Table 2.3. So, for example, ml = 2 denotes the fact that the Table 2.3. Numerically coded why-types Sbot ml
m2 m3
Explanation Proximate cause G {1 communication, 2: inspect, 3: repair, 4: perception} Evolution G {1: migrate, 2: self-organize} Response € {1: build, 2: filter, 3: explore}
Bot bl
Explanation Proximate cause G {1: hunger, 2: avoidance, 3: recognition, 4: memento, 5: observer}
b2
Evolution G {1: mutation, 2: selection, 3: reproduction} Response G {1: hunt, 2: observe, 3: protect, 4: classify, 4: learn}
b3
current stimulus (proximate cause of an sbot behavior) is an inspect signal, and m3 = 3 denotes a sbot response, which is an exploration activity by an sbot. A sample decision table used to record some of the feature values for a number of different behaviors exhibited by a swarmbot is given in Table 2.4. In Table 2.4, ml, m2, m3. Table 2.4. Partial Feature Value Table X XI X2 X3 X4 X5 X6 X7 X8 X9
ml 2 3 1 3 4 4 3 1 2
m2 1 2 1 2 2 2 2 2 1
m3 3 3 2 3 3 3 3 3 3
m4 0 0 0 0 0 0 0 0 0
bl 3 4 3 4 1 1 4 4 3
b2 1 3 1 3 1 1 3 3 1
b3 2 3 3 3 1 1 1 3 2
b4 0 0 0 0 0 0 0 0 0
b6 0 0 0 0 0 0 0 0 0
d 1 0 1 0 1 0 0 1 1
m4 correspond to Tinbergen's four whys for a swarmbot. Also included in this table are bl, b2, b3, b4 representing proximate cause, evolution, response, and skill-level respectively for a bot built into the swarmbot. Feature b6 represents bot habituation. Decision d = 1 indicates that the particular sbot behavior typifies a caterpillar swarmbot (cat-sbot), where d = 0 (reject) indicates that the sbot behavior does not represent
26
James F. Peters
a cat-sbot. Example 7.2 Approximating a Set of Decisions about cat-sbot Models From Table 2.4, for simplicity, let U = {XI, X2, X3, X4, X5, X6, X7, X8, X9}, and let D = {X I Design(X) = 1}. Let B = {m4(sbot growth), bl (bot proximate cause), b3 (bot response), b4 (bot skill-level)} as in Table 2.5. Then consider the approximation Table 2.5. Cat-sbot Feature Table m4 0 0 0 0 0 0 0 0 0
X XI X2 X3 X4 X5 X6 X7 X8 X9
bl 3 4 3 4 1 1 4 4 3
b3 2 3 3 3 1 1 1 3 2
b4 0 0 0 0 0 0 0 0 0
d 1 0 1 0 1 0 0 1 1
of set D relative to B-pattems from Table 2.5. Decision Value = 1 Equivalence Classes for Attributes: {m4, bl, b3, b4} [attr. values]:{equivalence classes} [0,1,1,0] :{X5,X6} [0,3,2,0]:{X1,X9} [0, 3, 3, 0 ] : {X3} [0,4, 1,0]: {X7} [0,4, 3, 0 ] : {X2, X4, X8} Decision Class Stimuli: {XI, X3, X5, X8, X9} B*D = {XI, X3, X9} B*D = {XI, X2, X3, X4, X5, X6, X8, X9} BN^(D) = {X2, X4, X5, X6, X8} [B-boundary region of D] U - B*D = {X7} [B-outside region of D]
aeiD)
m
0.375
[approximation accuracy]
{X7{
U-B*(D) BNB(D)
{X2, X4, X5, X6, X8} B.(D) cat-slM)t
t-sbot/ at-sbot
|Xl,X3,X9)
1
f
Fig. 2.7. Approximation of Set of Decisions
2 Approximation Spaces for Hierarchical Intelligent Behavioral System Models 27 Hence, the set D is considered rough or inexact, since the accuracy of approximation of D is less than 1. A visualization of approximating the set of design decisions relative to our knowledge in B about the behavior of the experimental shots is shown in Fig. 2.7. The non-empty boundary region in Fig. 2.7 is an indication that the set D cannot be defined in a crisp manner. Example 7.3 Upper Approximation as a Standard Consider an approximation space (U, IB,J^B) with B equal to a feature set, and where IB : P(U) ^ p(U), where IB(X) = B*(X) i^B : p(U)xp(U)-. [0,1], where ^ B ( B ( X ) , B^D) = l^m^^'Dl Again, let B = {m4, bl, b3, b4} in Table 2.5. Let X denote a set of interacting objects that cooperate to achieve shot design objectives. Relative to the knowledge represented by the feature set B and the set D = {X | Design(X) = 1, cat-sbot design pattem}= { XI, X3, X5, X8, X9}, the following upper approximation has been computed. B*D = {XI, X2, X3, X4, X5, X6, X8, X9} The partition of U is given by B(X5) = {X5, X6} B(X1) = {X1,X9} B(X3) = {X3} B(X7) = {X7} B(X2) = {X2, X4, X8} In this case, assume that IB(D) computes B*D, and compute i^{m4,6i,fe3,M}(B(X), B*D). Then construct Table 2.6, where the threshold th = 2/8. Table 2.6. Rough Inclusion Table B(X) B(X1) = {X1,X9} B(X2) = {X2, X4, X8} B(X3) = {X3} B(X5) = {X5, X6} B(X7) = {X7}
l^B
|B(Xl)nB*D|/|B*D 1=2/8 |B(X2) n B*D |/| B*D 1= 3/8 |B(X3) n B*D |/| B*D 1= 1/8 |B(X4) n B*D |/| B*D 1= 2/8 |B(X7)nB*D|/|B*D|=0
Result (th = 2/8) accept accept reject accept reject
The upper approximation B*(D) represents our knowledge about a particular set of shot behaviors relative to our knowledge reflected in B. That is, every set of behaviors in the upper approximation has been classified as a possible cat-sbot. A comparison of the upper approximation with each of the equivalence classes ranging over the observed shot behaviors provides an indication of where mismatches occur. The degree of mismatch between the upper approximation and a block of equivalent behaviors is measured using v^. In four cases in Table 2.6, i/^C B(X1), B*D), UB^ B(X2), B*D), UB{ B ( X 3 ) , B * D ) , VB{ B ( X 5 ) , B * D ) , there is an overlap between the observed behaviors in B(X1), B(X2), B(X3), B(X5) and B*D. This indicates that the behaviors in blocks B(X1), B(X2), B(X3), B(X5) are related to behaviors in B*D to a degree. The issue now is to devise a scheme provided by an approximation space relative to standards for classifying shot behaviors and a measure of when an shot behavior is satisfactory.
28
James F. Peters
Assume that the threshold th = 0.25 in the rough inclusion satisfaction relation. In what follows, assume that Ystd presents a collection of similar models for a system Sys, where each Y € Ystd is a model for the design of a subsystem of Sys. Further, assume that each partition of the universe represented by B(X) contains candidate models for the design of Sys. The outcome of a classification is to cull from partitions of the set of experimental models those models that satisfy the standard to some degree. Next construct a classification table (see Table 2.5). From Table 2.6, B(X1), B(X2), and B(X5) satisfy the standard. Since UB{ B ( X 3 ) , B * D ) is below the threshold th, it is rejected along with B(X7). The choice of the threshold is somewhat arbitrary and will depend on the requirements for a particular project.
2.8 Conclusion The end result of viewing intelligent system models within a satisfaction-based approximation space is a rough mereological framework that provides with a means of grading (i.e., measuring the extent) a set of intelligent system models are a part of a set of models representing a project standard to an acceptable degree. Since it is common for models of subsystem designs to overlap, a subsystem model extracted from a complete legacy system model has the appearance of a fragment, something incomplete when compared with a standard. Hence, it is appropriate to use approximation methods to measure the extent that an experimental model is to a degree a part of a model representing a standard. Ultimately, it is important to consider ways to model and classify the behavior of an intelligent system as it unfolds. Rather than take a rigid approach where a system behavior is entirely predictable based on its design, there is some benefit in relaxing the predictability requirement and considering how one might gain approximate knowledge about evolving patterns of behavior in an intelligent system. The studies of animal behavior by ethologists provide a number of features useful in the study of changing intelligent system behavior in response to environmental (sources of stimuli) as well as internal influences (e.g., image classification results, battery energy level, response to behavior pattern recognition, various forms of learning). Behavior decision tables would normally be constantly changing during the life of an intelligent system. As a result, there is a need for a cooperating system of agents to gain, measure, and share knowledge discovery about changing behavior patterns. This article presents a suggested rough set approach to coping with behavior pattern classification and measurement in the context of approximation spaces. Acknowledgements The author gratefully acknowledges the profound insights and suggestions made by Andrzej Skowron concerning this paper. The author also wishes to thank Maciej Borkowski, Dan Lockery and Peter Schilling for their recent work on the design of a line-crawling shot that provides a basis for the illustration of swarm intelligence described in this article. This research has been supported by Natural Sciences and Engineering Research Council of Canada (NSERC) grant 185986 and grants T209, T217, T137, and T247 from Manitoba Hydro.
2 Approximation Spaces for Hierarchical Intelligent Behavioral System Models
29
References 1. Alexander, C : The Timeless Way of Building. Oxford University Press, UK (1979) 2. Alexander, C : Notes on the Synthesis of Form. Harvard University Press, Cambridge, MA (1964) 3. Alexander, C , Ishikawa, S., Silverstein, M., Jacobson, M., Fiksdahl-King, S. Angel, I.: A Pattern Language. Oxford University Press, UK (1977) 4. Alexander, C : A Foreshadowing of 21** Century Art. The color and geometry of very early Turkish carpets, Oxford University Press, UK (1993) 5. Barwise, J., Seligman, J.: Information Flow. The Logic of Distributed Systems. Cambridge University Press, UK (1997) 6. Bonabeau, E., M. Dorigo, M., G. Theraulaz, G.: Swarm Intelligence. From Natural to Artificial Systems, Oxford University Press, UK (1999) 7. Cheng, K.: Generalization and Tinbergen's four whys. Behavioral and Brain Sciences 24 (2001)660-661. 8. Curry, H.B.: Mathematical Foundations of Logic. Dover Publications, NY (1963) 9. Dorigo, M: Swarmbots, Wked (Feb. 2004) 119 10. Duda, R.O., Hart, PE., Stork, D.G.: Pattern Classification, Wiley, Toronto (2001) 11. Fahle, M, Poggio, T. (Eds.): Perceptual Learning, The MIT Press, Cambridge, MA (2002) 12. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Toronto (1995) 13. Harnad, S. (Ed.): Categorical Perception. The Groundwork of cognition, Cambridge University Press, UK (1987) 14. Holt, J.: UML for Systems Engineering. Watching the Wheels. The Institute of Electrical Engineers, Herts, UK (2001) 15. Kruuk, H.: Niko's Nature. A life of Niko Tinbergen and his science of animal behavior. Oxford University Press, London, 2003. 16. OMG Unified Modeling Language Specification. Object Management Group, http://www.omg.org. 17. Mondada, R, Bonani, M., Magnenat, S., Guignard, A., Floreano, D.: Physical connections and cooperation in swarm robotics. In: Frans Groen, Nancy Amato, Andrea Bonarini, Eiichi Yoshida and Ben Krose editors. Proceedings of the 8th Conference on Intelligent Autonomous Systems (IAS8), Amsterdam, NL, March 10-14 (2004) 53-60. 18. Pal, S.K., Polkowski, L., Skowron, A. (eds.): Rough-Neural Computing. Techniques for Computing with Words. Springer-Verlag, Heidelberg (2004) 19. Pawlak, Z.: Rough sets. International J. Comp. Inform. Science. 11 (1982) 341-356 20. Pawlak, Z.: Rough sets and decision tables. Lecture Notes in Computer Science, Vol. 208, Springer Verlag, Berlin (1985) 186-196 21. Pawlak, Z.: On rough dependency of attributes in information systems. Bulletin Polish Acad. Sci. Tech., 33(1985) 551-599 22. Pawlak, Z.: On decision tables. Bulletin Polish Acad. Sci. Tech., 34 (1986) 553-572 23. Pawlak, Z.: Decision tables—a rough set approach, Bulletin ETACS, 33 (1987) 85-96 24. Pawlak, Z.: Rough Sets. Theoretical Reasoning about Data. Kluwer, Dordrecht (1991) 25. Peters, J.F.: Design patterns in intelligent systems. Lecture Notes in Artificial Intelligence, Vol. 2871, Springer-Verlag, Berlin (2003) 262-269 26. Peters, J.F., Ramanna, S.: Intelligent systems design and architectural patterns. In: Proceedings IEEE Pacific Rim Conference on Communication, Computers and Signal Processing (PACRIM'03) (2003) 808-811 27. Peters, J.F. : Approximation space for intelligent system design patterns. Engineering Applications of Artificial Intelligence, Vol. 17, No. 4 (2004) 1-8
30
James F. Peters
28. Peters, J.F., Ramanna, S.: Measuring acceptance of intelligent system models. In: Proc. KES 2004 [to appear] 29. Peters, J.F., Ramanna, S.: Hierarchical behavioral model of a swarmbot. In: Proc. AIMethod 2004 [to appear] 30. Peters, J.F., Skowron, A., Stepaniuk, J., Ramanna, S.: Towards an ontology of approximate reason. Fundamenta Informaticae, Vol. 51, Nos. 1, 2 (2002) 157-173 31. Plato: Parmenides. In: Hamilton, E., Cairns, H. (Eds.): The Collected Dialogues of Plato Including the Letters, Princeton University Press, NJ (1961) 32. Polkowski, L. and Skowron, A. (eds.): Rough Sets in Knowledge Discovery. Vol. 1, Physica-Verlag, Heidelberg (1998a) 33. Polkowski, L. and Skowron, A. (eds.): Rough Sets in Knowledge Discovery. Vol. 2, Physica-Verlag, Heidelberg (1998b) 34. Polkowski, L. and Skowron, A.: Rough mereology: A new paradigm for approximate reasoning. Int. J. Approximate Reasoning, 15/4 (1996) 333-365 35. Polkowski, L.: Rough Sets. Mathematical Foundations. Physica-Verlag, Heidelberg (2002) 36. Sandewall, E.: Features and Fluents. The Representation of Knowledge about Dynamical Systems. Vol. 1. Clarendon Press, Oxford (1994) 37. Skowron, A.: Toward intelligent systems: Calculi of information granules. In: Hirano, S. Inuiguchi, M., Tsumoto S. (eds.). Bulletin of the International Rough Set Society, Vol. 5, No. 1/2(2001)9-30 38. Skowron, A. Stepaniuk, J.: Generalized approximation spaces. In: Proceedings of the 3'^'^ International Workshop on Rough Sets and Soft Computing, San Jose (1994) 156-163 39. Skowron, A. Stepaniuk, J.: Generalized approximation spaces. In: Lin, T.Y., Wildberger, A.M. (Eds.), Soft Computing, Simulation Councils, San Diego (1995) 18-21 40. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae, 27 (1996) 245-253 41. Skowron, A., Stepaniuk, J.,: Information granules and approximation spaces. In: Proc. of the 1^^ Int. Conf. on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU'98), Paris (1998) 1354-1361 42. Skowron, A., Stepaniuk, J.,: Information granules and rough neural computing. In: [18], 43-84. 43. Skowron, A. Stepaniuk, J.: Information granules: Towards Foundations of Granular Computing. Int. Journal of Intelligent Systems, 16, (2001) 57-85 44. Skowron, A., Stepaniuk, J. Peters, J.F.: Rough sets and infomorphisms: Towards approximation of relations in distributed environments. Fundamenta Informaticae, Vol. 54, Nos. 2, 3 (2003) 263-277 45. Skowron, A. and Swiniarski, R.W.: Information granulation and pattern recognition. In: [18], 599-636 46. Stepaniuk, J.: Approximation spaces, reducts and representatives. In: [33], 109-126 47. Stone, P.: Layered Learning in Multiagent Systems. A Winning Approach to Robotic Soccer. The MIT Press, Cambridge, MA, 2000. 48. Tinbergen N.: On aims and methods of ethology, Zeitschrift fiir Tierpsychologie 20 (1963) 410-433 49. Tinbergen N.: Social Behavior in Animals with Special Reference to Vertebrates. The Scientific Book Club, London, 1953. 50. Watanabe, S.: Pattern Recognition: Human and Mechanical, Wiley, Toronto (1985)
Distributed Adaptive Control: Beyond Single-Instant, Discrete Control Variables David H. Wolpert^ and Stefan Bieniawski^ ^ NASA Ames Research Center, USA, [email protected] ^ Dept. of Aeronautics, Stanford University, USA, [email protected] Summary. In extensive form noncooperative game theory, at each instant t, each agent i sets its state Xi independently of the other agents, by sampling an associated distribution, qi{xi). The coupling between the agents arises in the joint evolution of those distributions. Distributed control problems can be cast the same way. In those problems the system designer sets aspects of the joint evolution of the distributions to try to optimize the goal for the overall system. Now information theory tells us what the separate qi of the agents are most likely to be if the system were to have a particular expected value of the objective function G{x\,X2, ...)• So one can view the job of the system designer as speeding an iterative process. Each step of that process starts with a specified value of E{G), and the convergence of the qi to the most likely set of distributions consistent with that value. After this the target value for Eq{G) is lowered, and then the process repeats. Previous work has elaborated many schemes for implementing this process when the underlying variables xi all have a finite number of possible values and G does not extend to multiple instants in time. That work also is based on a fixed mapping from agents to control devices, so that the the statistical independence of the agents' moves means independence of the device states. This paper also extends that work to relax all of these restrictions. This extends the applicability of that work to include continuous spaces and Reinforcement Learning. This paper also elaborates how some of that earlier work can be viewed as a first-principles justification of evolution-based search algorithms.
3.1 Introduction This paper considers the problem of adaptive distributed control [1, 2, 3]. There are several equivalent ways to mathematically represent such problems. In this paper the representation of extensive form noncooperative game theory is adopted[4, 5, 6, 7, 8]. In that representation, at each instant t each control agent i sets its state x\ independently of the other agents, by sampling an associated distribution, q\{x\). In this view the coupling between the agents does not arise directly, via statistical dependencies of the agents' states at the same time t. Rather it arises indirectly, through the stochastic joint evolution of their distributions {qD across time.
32
David H. Wolpert and Stefan Bieniawski
More formally, let time be discrete, where at the beginning of each t all control agents simultaneously and independently set their states ("make their moves") by sampling their associated distributions. After they do so any remaining portions of the system (i.e., any stochastic part not being directly set by the control agents) responds to that joint move. Indicate the state of the entire system at time t as y*. (y^ includes the joint move of the agents, x*, as well as the state at t of all stochastic elements not directly set by the agents.) So the joint distribution of the moves of the agents at any moment t is given by the product distribution q^{x^) = Yli Qii^l)^ and the state of the entire system, given joint move x*, is governed by P{y* | x*). Now in general the observations by agent i of aspects of the system's state at times previous to t will determine qj. In turn, those observations are determined by the previous states of the system. So qj is statistically dependent on the previous states of the entire system, y^^ ^*^. Accordingly, the system can be viewed as a multistage noncooperative game among the agents and Nature. Each agent plays mixed strategies {ql} at moment t, and Nature's move space at that time consists of those components of the vector y^ not contained in x* [4, 5, 6, 7, 8]. The interdependence of the agents across time can be viewed as arising through information sets and the like, as usual in game theory. For pedagogical simplicity, consider the problem of inducing an optimal state y rather than the problem of inducing an optimal sequence of states. What the designer of the system can specify are the laws that govern how the joint mixed strategy q^ gets updated from one stage of the game (i.e., one t) to the next. The goal is to specify such laws that will quickly lead to a good value of an overall objective function of the state of the system, F{y)? Note that the agents work in the space of x's; all aspects of the system not directly set by the agents, and in particular all noise processes, are implicitly contained in the distribution P{y \ x). Tautologically then, in distributed control the goal is to induce a joint strategy q{x) with a good associated value of Eq{F) = Jdxq{x)E{F I x). Defining the world utility G{x) = J dyF{y)P{y \ x), we can re-express Eq{F) purely in terms of x, as J dxq{x)G{x) = Eq{G).^ Once such a g is found, one can sample it to get a final x, and be assured that, on average, the associated F value is low. In other words, such sampling is likely to give us a good value of our objective. Previous work has has elaborated several iterative schemes for updating product distributions q to monotonically lower Eq{G) [9, 10, 11, 12, 13, 14, 15]. In all of these schemes, each q in the sequence is defined indirectly, as the minimizer of a different G-parameterized Lagrangian, ^{q). Implementing such a sequence of Lagrangian-minimizing ^'s results in the optimal control policy for the distributed system, i.e., in the q minimizing Eq{G). However while one cannot directly solve for the q minimizing Eq{G) in a distributed manner, as elaborated below one can ^Here we follow the convention that lower F is better. In addition, for simplicity we only consider objectives that depend on the state of the system at a single instant; it is straightforward to relax this restriction. '^For simplicity, here we indicate integrals of any sort, including point sums for countable X, with the / symbol.
3 Distributed Adaptive Control
33
solve for the q minimizing each -Sf (^) in a distributed manner. In this way one can find the optimal distributed control policy using a purely distributed algorithm. Many of these schemes are based on a steepest descent algorithm for each step of minimizing a Lagrangian ^{q). Because the descent is over Euclidean vectors g, these algorithms can be applied whether the Xi are categorical/symbolic, continuous, time-extended, or a mixture of the three. So in particular, they provide a principled way to do "gradient descent over categorical variables". Not all previously considered algorithms for how to perform the Lagrangianminimizing step are based on steepest descent. However they do have certain other characteristics in common. One is that the underlying variables Xi all have a finite number of possible values. Another is G does not extend to multiple instants in time. This paper shows how to relax these restrictions, simply by redefining the spaces involved. This allows the previously considered algorithms to be used for continuous spaces, and also implement Reinforcement Learning (RL) [16, 17, 18, 19]. A final shared characteristic is that all of the previously considered algorithms for minimizing the Lagrangians employ a fixed mapping from the moves of agents to the setting of control devices, so that the statistical independence of the agents' moves means independence of the device states. This paper also shows how that restriction can be relaxed, so that independent agents can result in coupled control devices. The general mathematical framework for casting control and optimization problems in terms of minimizing Lagangians of probability distributions is known as "Probability Collectives". The precise version where the probability distributions are product distributions is known as "Product Distribution" (PD) theory [9]. It has many deep connections to other fields, including bounded rational game theory and statistical physics [10]. As such it serves as a mathematical bridge connecting these disciplines. Some initial experimental results concerning the use of PD theory for distributed optimization and distributed control can be found in [20, 21, 15, 22, 23]. [24,20,21,15,22,23]. The next section reviews the salient aspects of PD theory. The section does not consider any of the schemes for the Lagrangian-minimizing step of adaptive distributed control in great detail; the interested reader is directed to the literature. However it is shown in that section how those schemes provide a first-principles justification of certain types of evolution-based search algorithms. The following section presents two ways to cast PD theory for uncountably infinite X as PD theory for countable X. This allows us to apply all the standard finite-X algorithms even for uncountable X. Experimental tests validating one of those ways of recasting PD theory are presented in [25]. The following section shows how to recast single-instant PD theory to apply to the RL domain, in which y is time-extended. That section considers both episodic and discounted sum RL. The final section considers varying the mapping from the moves of agents to the setting of control devices. Experimental tests validating the usefulness of such variations are presented in [15].
34
David H. Wolpert and Stefan Bieniawski
3,2 Review of PD theory Say the designer stipulates a particular desired value of E{G), 7. For simplicity, consider the case where the designer makes no other claims concerning the system besides 7 and the fact that the joint strategy is a product distribution. Then information theory tells us that the a priori most likely q consistent with that information is the one that maximizes entropy subject to that information [26, 27, 28].^ In other words, of all distributions that agree with the designer's information, that distribution is the "easiest" one to induce by random search. Given this, one can view the job of the designer of a distributed control system as an iterative equilibration process. In the first stage of each iteration the designer works to speed evolution of the joint strategy to the q with maximal entropy subject to a particular value of 7. Once we have found such a solution we can replace the constraint — replace the target value of E{G) — with a more difficult one, and then repeat the process, with another evolution of q [9]. To formalize this, define the maxent Lagrangian by ^{q)=^^(q)=P{E,{G)-j)-S = pij dxq{x)G{x) - 7) - S{q),
(3.1)
where S{q) is the Shannon entropy of q, — j dxq{x)\n^^, and for simplicity we here take the prior ^ to be uniform.^. Given 7, the associated most likely joint strategy is the q that minimizes ^{q) over all those {q, /S) such that the Lagrange parameter /? is at a critical point of -Sf^, i.e., such that | ^ = 0. Solving, we find that the qi are related to each other via a set of coupled Boltzmann equations (one for each agent i), q^{xi) oc e
^(0
(3.2)
where the overall proportionality constant for each 2 is set by normalization, the subscript g^^x on the expectation value indicates that it is evaluated according to the distribution Ylj^i qj, and /3 is set to enforce the condition Eqi3{G) — 7. Following Nash, we can use Brouwer's fixed point theorem to establish that for anyfixed/5, there must exist at least one solution to this set of simultaneous equations. In light of the foregoing, one natural choice for an algorithm that lowers Eq{G) is the repeated iteration of the following step: Start with the q^ matching a current 7 value, then lower 7 slightly, and end by modifying the old q^ to find the one that ^In light of how limited the information is here, the algorithms presented below are bestsuited to "off the shelf* uses; incorporating more prior knowledge allows the algorithms to be tailored more tightly to a particular application. ^Throughout this paper the terms in any Lagrangian that restrict distributions to the unit simplices are implicit. The other constraint needed for a Euclidean vector to be a valid probability distribution is that none of its components are negative. This will not need to be explicitly enforced in the Lagrangian here.
3 Distributed Adaptive Control
35
matches the new 7. A difficulty with this iterative step is the need to solve for /? as a function of 7. However we can use a trick to circumvent this need. Typically if we evaluate E{G) at the solutions q^, we find that it is a declining function of fi. So in following the iterative procedure of equilibrating and then lowering 7 we will raise /S. Accordingly, we can avoid the repeated matching of /? to each successive constraint E{G) = 7, and simply monotonically increase p instead. This allows us to avoid ever explicitly specifying the values of 7 [12]. An alternative interpretation of this iterative scheme is based on prior knowledge of the value of the entropy rather of the expected G. Given this alternative prior knowledge, we can recast the designer's goal as finding the q that is consistent with that knowledge that has minimal E{G). This again leads to Eq.'s 3.1 and 3.2. Now raising f3 is cast as lowering the (never-specified) prior knowledge of the entropy value rather than the (never-specified) prior knowledge of E{G). Simulated annealing is an example of this approach, where rather than work directly with q, one works with random samples of it formed via the Metropolis random walk algorithm [29, 30, 31, 32]. There is no a priori reason to use such an inefficient means of manipulating q however. In [12] for example one works with q directly instead. This results in an algorithm that is not simply "probabilistic" in the sense that the updating of its variables is stochastic (as in simulated annealing). Rather the very entity being updated is a probability distribution. Another advantage of casting the problem directly in terms of the maxent Lagrangian is that one can even avoid the need to explicitly stipulate an annealing schedule. In the usual way, first order methods can be used to find the saddle point of the Lagrangian, e.g., by performing steepest ascent of J^ in the Lagrange parameter (3 while performing a descent in g ^. In many situations one should use a modification of the maxent Lagrangian. Whenever one has extra prior knowledge about the problem domain, that should be used to modify the use of entropy as (in statistics terminology) a regularizer. This leads to Bayesian formulations [12]. Similarly, if one has constraints {fi{x) = 0}, the Lagrangian has to be modified to accout for them. The most naive way of doing this is to simply cast the constraints as Lagrange penalty terms {E{fi) = 0} and add those terms to the Lagrangian, in the usual way [12, 23] ^. 3.2.1 How to find minima of the Lagrangian Consider the situation where each Xi can take on a finite number of possible values, \Xi\, and we are interested in the unconstrained maxent Lagrangian. Say we are iteratively evolving q to minimize JSf for some fixed /3, and are currently at some point q in Q, the space of product distributions (i.e., in the Cartesian product of unit ^Formally, since the maxent Lagrangian is not convex, we have no guarantee that the duality gap is zero, and therefore no guarantee about saddlepoints. Nonetheless, just as in other domains, first order methods here seem to work well in practice. ^Note though that since the gradient of entropy is infinite at the border of the unit simplex, we are guaranteed that no component of q will ever exactly equal 0, which typically means that the constraints {fi{x) = 0} will never be satisfied with probability exactly 1.
36
David H. Wolpert and Stefan Bieniawski
simplices). Using Lemma 1 of [12], we can evaluate the direction from q within Q that, to first order, will result in the largest drop in the value of JSf (g):
J^^^u.U)-Y:M.'^m,
(3.3)
where Ui{j) = l3E{G \ xi = j) + \n[qi{j)], and the symbol d^ indicates that we do not mean the indicated partial derivative, formally speaking, but rather the indicated component of the Ist-order descent vector ^. Eq. 3.3 specifies the change that each agent should make to its distribution to have them jointly implement a step in steepest descent of the maxent Lagrangian. These updates are completely distributed, in the sense that each agent's update at time t is independent of any other agents' update at that time. Typically at any t each agent i knows qi{t) exactly, and therefore knows \n[qi{j)]. However often it will not know G and/or the q(^i). In such cases it will not be able to evaluate the E{G \ Xi = j) terms in Eq. 3.3 in closed form. One way to circumvent this problem is to have those expectation values be simultaneously estimated by all agents by repeated Monte Carlo sampling of q to produce a set of (x, G{x)) pairs. Those pairs can then be used by each agent i to estimate the values E{G \ Xi = j ) , and therefore how it should update its distribution. In the simplest version of this, an update to q only occurs once every ^ time-steps. In this scheme only the samples (x, G{x)) formed within a block of ^ successive time-steps are used at the end of that block by the agents to update their distributions (according to Eq. 3.3). There are numerous other schemes besides gradient descent for finding minima of the Lagrangian. One of these is a second order version of steepest descent, constrained to operate over Q. This scheme, called "Nearest Newton" [12], starts by calculating the point p £ V that one should step to from the current distribution q, if one were to use Newton's method to descend the Lagrangian. Now in general that p is not a product distribution. So we instead find q\ IhQ q G Q that is closest (as measured by Kullbach-Leibler distance) to that point p; the step actually taken is to q'. This step from the current q turns out to be indentical to the gradient descent step, just with an extra multiplicative factor of qi{xi = j) multiplying each associated component of that gradient descent step: A{q,ix,=j))
oc qiU)u,iJ) - ^
M^iM^,
(3.4)
where Ui is as defined just below Eq. 3.3, and the proportionality constant is the step size. ^Formally speaking, the partial derivative is given by Ui{j). Intuitively, the reason for subtracting J2^, Ui(xi)/\Xi\ is to keep the distribution in the set of all possible probability distributions over x, V.
3 Distributed Adaptive Control
37
In the continuum time limit, this step rule reduces to the replicator equation of evolutionary game theory, only with an entropic term added in [11]. (Intuitively, that entropic term ensures the evolution explores sufficiently.) This connection can be viewed as a first-principles justification for (particular versions of) evolution-based search algorithms, e.g., genetic algorithms. To be precise, say we have a biological population of many "genes", each specifying a value x, and an associated "fitness function" G{x). Have the frequency of each gene in the population be updated via the replicator dynamics, as usual in evolutionary game theory. We can justify this evolution-based search algorithm as the /5 -^ oo limit of Nearest Newton for the case of a single agent with moves x. By allowing /? < oo, we can extend those evolution-based search algorithms in a principled manner. A final example of a Lagrangian descent scheme, which is analogous to block relaxation, is "Brouwer updating" [11]. In that kind of updating one or more agents simultaneously jump to their optimal distribution, as given by Eq. 3.2 (with /S rather than 7 specified, as discussed above). It turns out that if the expectations defining the Brouwer updating are "exponentially aged" to reflect nonstationarity, then in the continuum time limit Brouwer updating becomes identical to Nearest Newton. The aging constant in Brouwer updating turns out to be identical to the step size in Nearest Newton. All of the update schemes can be used so long as each agent i knows or can estimate qi together with Eq^_.^ {G \ Xi) = E{G \ Xi) for all of its moves Xi. No other quantities are involved.
3,3 Semicoordinate transformations 3.3.1 Motivation Consider a multi-stage game like chess, with the stages (i.e., the instants at which one of the players makes a move) delineated by t. In game theory, strategies are the if-then rules set by the players before play starts [4, 5, 6, 7, 8]. So in such a multistage game the strategy of player i, Xi, must be the set oft-indexed maps taking what that player has observed in the stages t' < t into its move at stage t. Formally, this set of maps is called player i's normal form strategy. The joint strategy of the two players in chess sets their joint move-sequence, though in general the reverse need not be true. In addition, one can always find a joint strategy to result in any particular joint move-sequence. Now typically at any stage there is overlap in what the players have observed over the preceding stages. This means that even if the players' strategies are statistically independent, their move sequences are statistically coupled. In such a situation, by parameterizing the space Z of joint-move-sequences z with joint-strategies x, we shift our focus from the coupled distribution P{z) to the decoupled product distribution, q{x). This is the advantage of casting multi-stage games in terms of normal form strategies. More generally, any onto mapping C : x ^ z, not necessarily invertible, is called a semicoordinate system. The identity mapping z —> z is a trivial example of a semicoordinate system. Another example is the mapping from joint-strategies in a multi-stage game to joint move-sequences is an example of a semicoordinate system.
38
David H. Wolpert and Stefan Bieniawski
In other words, changing the representation space of a multi-stage game from movesequences z to strategies a: is a semicoordinate transformation of that game. We can perform a semicoordinate transformation even in a single-stage game. Say we restrict attention to distributions over X that are product distributions. Then changing C(.) from the identity map to some other function means that the players' moves are no longer independent. After the transformation their move choices — the components of 2; — are statistically coupled, even though we are considering a product distribution. Formally, this is expressed via the standard rule for transforming probabilities, Pz{z eZ) = aPx) = JdxPx{x)5{z - ax)),
(3.5)
where Px and Pz are the distributions across X and Z, respectively. To see what this rule means geometrically, let V be the space of all distributions (product or otherwise) over Z. Recall that Q is the space of all product distributions over X , and let C(Q) be the image of Q in V. Then by changing ({.), we change that image; different choices of C(.) will result in different manifolds C(Q)As an example, say we have two players, with two possible moves each. So z consists of the possible joint moves, labeled (1,1), (1,2), (2,1) and (2,2). Have X = Z, and choose C(l,l) = (1,1), C(l,2) = (2,2), C(2,1) = (2,1), and C(2,2) = (1,2). Say that q is given by qi{xi = 1) = ^2(^2 = 1) = 2/3. Then the distribution over joint-moves z is Pz{l,l) = P x ( l , l ) — 4/^' Pz(2,1) = Pz(2,2) = 2/9, P z ( l , 2 ) = 1/9. So Pziz) ^ Pz{zi)Pz{z2)\ the moves of the players are statistically coupled, even though their strategies Xi are independent. Such coupling of the players' moves can be viewed as a manifestation of sets of potential binding contracts. To illustrate this return to our two player example. Each possible value of a component xi determines a pair of possible joint moves. For example, setting xi = 1 means the possible joint moves are (1,1) and (2,2). Accordingly such a value of Xi can be viewed as a set of proffered binding contracts. The value of the other components of x determines which contract is accepted; it is the intersection of the proffered contracts offered by all the components of x that determines what single contract is selected. Continuing with our example, given that xi = 1, whether the joint-move is (1,1) or (2,2) (the two options offered by xi) is determined by the value of X2. To relate semicoordintes to distributed control we have tofixsome notation. To maintain consistency with the discussion of maxent Lagrangians, we will have product distributions q{x G X ) € Qx- Also as before, to allow stochasticity, we write the ultimate space of interest as y with associated cost function F{y). This means that X sets z which stochastically sets y\ E,iF) = Jdy P{y)F{y) = Jdz P{z)G{z) = J dx q{x)G{x) where
(3.6)
3 Distributed Adaptive Control
G{z) = j dyF{y)P{y\z)
39
(3.7)
= f dx G{x)S{Cix) - z); G{x) = j dyF{y)P{y\x) =
(3.8)
jdyF{y)P{y\ax)).
3.3.2 Representational properties Binding contracts are a central component of cooperative game theory. In this sense, semicoordinate transformations can be viewed as a way to convert noncooperative game theory into a form of cooperative game theory. Indeed, any cooperative mixed strategy can be cast as a non-cooperative game mixed strategy followed by an appropriate semicoordinate transformation. Formally, any Pz, no matter what the coupling among its components, can be expressed as ({Px) for some product distribution Px for and associated C(-) ^^ Less trivially, given any model class of distributions {Pz}, there is an X and associated ({,) such that {Pz} is identical to CiQx)- Formally this is expressed in a result concerning Bayes nets. For simplicity, restrict attention to finite Z. Order the components of Z from 1 to A^. For each index i e { 1 , 2 , . . . , A^}, have the parent function V{i, z) fix a subset of the components of z with index greater than z, returning the value of those components for the z in its second argument if that subset of components is non-empty. So for example, with A^ > 5, we could have V{1, z) = {z2, Z5). Another possibility is that V{1, z) is the empty set, independent of 2:.
Let A{V) be the set of all probability distributions Pz that obey the conditional dependencies implied by V:W Pz e A{V), z e Z, N
Pz{z)^l[Pz{zi\P{i.z)).
(3.9)
i=l
(By definition, if V{i, z)) is empty, Pz{zi \ V{i, z)) is just the z'th marginal of Pz, Pz{zi)') Note that any distribution Pz is a member of A{V) for some V — in the worst case, just choose the exhaustive parent function V{i^ z) — {zj : j > i}. For any choice of V there is an associated set of distributions C{Qx) that equals A{V) exactly: Theorem 1: Define the components of X using multiple indices: For all i G { 1 , 2 , . . . , N} and possible associated values (as one varies over z e Z) of the vector V{i, z), there is a separate component of x, Xi.^(^i^z)- This component can take on '^In the worst case, one can simply choose X to have a single component, with C() ^ bijection between that component and the vector z — trivially, any distribution over such an X is a product distribution.
40
David H. Wolpert and Stefan Bieniawski
any of the values that zi can. Define C(.) recursively, starting aXi = N and working to lower z, by the following rule: V z G { 1 , 2 , . . . , N},
ThtnAiV) = aQx). Proof: First note that by definition of parent functions, due to the fact that we're iteratively working down from higher z's to lower ones, C{x) is properly defined. Next plug that definition into Eq. 3.5. For any particular x and associated z = C{x), those components of x that do not "match" z by having their second index equal V{i,z) get integrated out. After this the integral reduces to N
Pz{z) = Y[Px{[xi.v(i,z)]
= ^i),
i.e., is exactly of the form stipulated in Eq. 3.9. Accordingly, for any fixed x and associated z = C{x), ranging over the set of all values between 0 and 1 for each of the distributions Px{[xi]'P{i,z) = ^i) will result in ranging over all values for the distribution Pz{z) that are of the form stipulated in Eq. 3.9. This must be true for all X. Accordingly, C(2x) S A{V). The proof that A{V) C C(Qx) goes similarly: For any given Pz and z, simply set Px{[xi'^r{i,z)] = ^i) for ^11 the independent components x^.^-pf^i^z) of ^ ^^^ evaluate the integral in Eq. 3.5. QED. Intuitively, each component oix in Thm. 1 is the conditional distribution Pz{zi \ V{i^ z)) for some particular instance of the vector V{i^ z)). Thm. 1 means that in principle we never need consider coupled distributions. It suffices to restrict attention to product distributions, so long as we use an appropriate semicoordinate system. In particular, mixture models over Z can be represented this way. 3.3.3 Maxent Lagrangians over X rather than Z While the distribution over X uniquely sets the distribution over Z, the reverse is not true. However so long as our Lagrangian directly concerns the distribution over X rather than the distribution over Z, by minimizing that Lagrangian we set a distribution over Z, In this way we can minimize a Lagrangian involving product distributions, even though the associated distribution in the ultimate space of interest is not a product distribution. The Lagrangian we choose over X should depend on our prior information, as usual. If we want that Lagrangian to include an expected value over Z (e.g., of a cost function), we can directly incorporate that expectation value into the Lagrangian over X, since expected values in X and Z are identical: / dzPz{z)A{z) — J dxPx{x)A{C{x)) for any function A{z). (Indeed, this is the standard justification of the rule for transforming probabilities, Eq. 3.5.) However other functionals of probability distributions can differ between the two spaces. This is especially common when C() is not invertible, so X is larger than Z. In particular, while the expected cost term is the same in the X and Z maxent Lagrangians, this is not true of the two entropy terms in general; typically the entropy of a g G Q will differ from that of its image, ({q) e C(Q) in such a case.
3 Distributed Adaptive Control
41
More concretely, the fully formal definition of entropy includes a prior probability fi: Sx = J dxp{x)\n{^^), and similarly for Sz- So long as (i{x) and /i(z) are related by the normal laws for probability transformations, as are p{x) and p{z), then if the cardinalities of X and Z are the same, Sz = Sx^^ - When the cardinalities of the spaces differ though (e.g., when X and Z are both finite but with differing numbers of elements), this need no longer be the case. The following result bounds how much the entropies can differ in such a situation: Theorem 2: For all z e Z, take ii[x) to be uniform over all x such that C,{x) = z. Then for any distribution p{x) and its image p{z),
- fdzp{z)\n{K{z))
< Sx-Sz
1, and counts the number of x with the same image z.) If we ignore the /x terms in the definition of entropy, then instead we have
0 < Sx-Sz
< - f
dzp{z)\n{K{z)).
Proof: Write
Sx = -Jdzjdx = —
dz
6{z - C(x)) p{x) l n [ ^ ] dx S{z — C{x)) p{x) X
('"'scSb^'"'"'^'" = - f dz p{z)ln[d{z)]
-
f dz fdx S{z - C{x)) p{x) ln[ ^^'^'* d{z)/ji{x) where dz = f dx S{z — C(^)) 4 | ) • Define //^ to be the common value of all fi{x) such that Ci^) = z. So ii{z) = ii^K{z) and p{z) = ij.^d{z). Accordingly, expand our expression as
"For example, if X = Z = 1, then Hf^] = l n [ £ | | ^ ] = l n [ ^ ] , where Mx) is the determinant of the Jacobian of C() evaluated at x. Accordingly, as far as transforming from X to Z is concerned, entropy is just a conventional expectation value, and therefore has the same value whichever of the two spaces it is evaluated in.
42
David H. Wolpert and Stefan Bieniawski
Sx = -Jdzp{z)ln[^]
- jdzp{z)K{z)
-
/d./..5(.-a.))p(x)ln[^] ^Sz-
I dzp{z)K{z)
+
/d.p(.)(-/d..(.-C(.))g|ln[^]). The x-integral of the right-hand side of the last equation is just the entropy of normalized the distribution | i ^ defined over those x such that C,{x) = z. Its maximum and minimum are ln[-ff (z)] and 0, respectively. This proves the first claim. The second claim, where we "ignore the // terms", is proven similarly. QED. In such cases where the cardinalities of X and Z differ, we have to be careful about which space we use to formulate our Lagrangian. If we use the transformation C(.) as a tool to allow us to analyze bargaining games with binding contracts, then the direct space of interest is actually the a:'s (that is the place in which the players make their bargaining moves). In such cases it makes sense to apply all the analysis of the preceding sections exactly as it is written, concerning Lagrangians and distributions over X rather than z (so long as we redefine cost functions to implicitly pre-apply the mapping C() to their arguments). However if we instead use C,{.) simply as a way of establishing statistical dependencies among the moves of the players, it may make sense to include the entropy correction factor in our x-space Lagrangian. An important special case is where the following three conditions are met: Each point z is the image under C,{.) of the same number of points in x-space, n; /i(x) is uniform (and therefore so is fJi{z))\ and the Lagrangian in x-space, J^^, is a sum of expected costs and the entropy. In this situation, consider a z-space Lagrangian, ^z, whose functional dependence on Pz, the distribution over z's, is identical to the dependence of S^x on P^, except that the entropy term is divided by n ^^. Now the minimizer P*{x) oi^x is a Boltzmann distribution in values of the cost function(s). Accordingly, for any z, P*{x) is uniform across all n points x e C~^iz) (all such x have the same cost value(s)). This in turn means that 5(C(Px)) = nS{Pz). So our two Lagrangians give the same solution, i.e., the "correction factor" for the entropy term is just multiplication by n. 3.3.4 Exploiting semicoordinate transformations This subsection illustrates some way to exploit semicoordinate transformations to facilitate descent of the Lagrangian. To illustrate the generality of the arguments, situations where one has to to use Monte Carlo estimates of conditional expectation values to descend the shared Lagrangian (rather than evaluate them closed-form) will be considered. ^'For example, if-^a.(Px) = pEp^{G{C{.))) - S{Pa:). then ^z{Pz) = f3Ep,{G{.)) S(Pz)/n, where Px and Pz are related as in Eq. 3.5.
3 Distributed Adaptive Control
43
Say we are currently at a local minimum q G Q of J^. Usually we can break out of that minimum by raising /? and then resuming the updating; typically changing (3 changes -Sf so that the Lagrange gaps are nonzero. So if we want to anneal (3 anyway (e.g., to find a minimum of the shared cost function G), it makes sense to do so to break out of any local minima. There are many other ways to break out of local minima without changing the Lagrangian (as we would if we changed /?, for example) [12]. Here we show how to use semicoordinate transformations to do this. As explicated below, they also provide a general way to lower the value of the Lagrangian, whether or not one has local minimum problems. Say our original semicoordinate system is C^(-)- Switch to a different semicoordinate system ('^{.) for Z and consider product distributions over the associated space X'^. Geometrically, the semicoordinate transformation means we change to a new submanifold C^(Q) C V without changing the underlying mapping from p{z) As a simple example, say C^ is identical to C^ except that it joins two components of X into an aggregate semicoordinate. Since after that change we can have statistical dependencies between those two components, the product distributions over X^, C^(Qx2), niap to a superset of C^(QxO- Typically the local minima of that superset do not coincide with local minima of C^(QxO- So this change to X^ will indeed break out of the local minimum, in general. More care is needed when working with more complicated semicoordinate transformations. Say before the transformation we are at a point p* G C^iQx^)- Then in general p* will not be in the new manifold C^CQx^), i.e., p* will not correspond to a product distribution in our new semicoordinate system. (This reflects the fact that semicoordinate transformations couple the players.) Accordingly, we must change from p* to a new distribution when we change the semicoordinate system. To illustrate this, say that the semicoordinate transformation is bijective. Formally, this means that X^ = X'^ = X and C'^{x) = CH^(^)) foi" ^ bijective ^(.). Have ^(.), the mapping from X^ to X^, be the identity map for all but a few of the M total components of X, indicated as indices 1 -^ n. Intuitively, for any fixed ^n+i->M — ^n+i->M, the cffcct of the semicoordinate transformation to C^(-) from C^(.) is merely to "shuffle" the associated mapping taking semicoordinates 1 -^ n to Z, as specified by ^(.). Moreover, since ^(.) is a bijection, the maxent Lagrangians overX^ andX2 are identical: J^xi(^(p^')) = ^x^{{p^^)). Now say we set q^_^i_^ = q^-^i^j^. This means we can estimate the expectations of G conditioned on possible x1^^ from the Monte Carlo samples conditioned on ^(xf_,^). In particular, for any £,{.) we can estimate E{G) as J dxl_^^p'^ {xl_^^)E{G I ^(xf ,^)) in the usual way. Now entropy is the sum of the entropy of semicoordinates n -h 1 —> M plus that of semicoordinates 1 —> n. So for any choice of ^(.) and g^i^^, we can approximate ^x = -^x^ ^s (our associated estimate of) E{G) minus the entropy of p ^ ^ ^ , minus a constant unaffected by choiceof ^(.).
44
David H. Wolpert and Stefan Bieniawski
So for finite and small enough cardinality of the subspace |Xi^n |» we can use our estimates E{G | ^{xl_^^)) to search for the "shuffling" ^(.) and distribution q^__,n that minimizes J5f^ ^^. In particular, say we have descended ^x to a distribution q^ (x) = q*{x). Then we can set q-^ — q*, and consider a set of of "shuffling ^(.)". Each such ^(.) will result in a different distribution q-^ {x) = q-^ (^~^ (x)) = g* (^~^ (x)). While those distributions will have the same entropy, typically they will have different (estimates of) E{G) and accordingly different local minima of the Lagrangian. Accordingly, searching across the ^(.) can be used to break out of a local minimum. However since E{G) changes under such transformations even if we are not at a local minimum, we can instead search across ^(.) as a new way (in addition to those discussed above) for lowering the value of the Lagrangian. Indeed, there is always a bijective semicoordinate transformation that reduces the Lagrangian: simply choose ^(.) to rearrange the G{x) so that G{x) < G{x') ^ q{x) < q{x'). In addition one can search for that ^(.) in a distributed fashion, where one after the other each agent i rearranges its semicoordinate to shrink E{G). Furthermore to search over semicoordinate systems we don't need to take any additional samples of G. (The existing samples can be used to estimate the E{G) for each new system.) So the search can be done off-line. To determine the semicoordinate transformation we can consider other factors besides the change in the value of the Lagrangian that immediately arises under the transformation. We can also estimate the amount that subsequent evolution under the new semicoordinate system will decrease the Lagrangian. We can estimate that subsequent drop in a number of ways: the sum of the Lagrangian gaps of all the agents, gradient of the Lagrangian in the new semicoordinate system, etc. 3,3.5 Distributions over semicoordinate systems The straightforward way to implement these kinds of schemes for finding a good semicoordinate systems is via exhaustive search, hill-climbing, simulated annealing, or the like. Potentially it would be very useful to instead find a new semicoordinate system using search techniques designed for continuous spaces. When there are a finite number of semicoordinate systems (i.e., finite X and Z) this would amount to using search techniques for continuous space to optimize a function of a variable having a finite number of values. However we now know how to do that: use PD theory. In the current context, this means placing a product probability distribution over a set of variables parameterizing the semicoordinate system, and then evolving the probability distribution. More concretely, write
^•'penalizing by the bias^ plus variance expression if we intend to do more Monte Carlo see [9].
3 Distributed Adaptive Control
45
N e
i=i
X N
= ^ E E n Qii^i)PiG)Gia^, 0))+s{q) e
X
2=1
where ^ is a parameter on the semicoordinate system. We can rewrite this using an additional semicoordinate transformation, as A/"+l
^{q*) = / ? E n qKOGia^*)) + S{q*) X*
(3.11)
i—l
where x* = Xi for all i up to N, and x""^^-^ = 6. (As usual, depending on what space we cast our Lagrangian in, the entropy can either have the argument of the entropy term starred — as here — or not.) Intuitively, this approach amounts to introducing a new coordinate/agent, whose "job" is to set the semicoordinate system governing the mapping from the other agents to a z value. This provides an alternative to periodically (e.g., at a local minimum) picking a set of alternative semicoordinate systems and estimating which gives the biggest drop in the overall Lagrangian. We can instead use Nearest Newton, Brouwer updating, or what have you, to continuously search for the optimal coordinate system as we also search for the optimal x. The tradeoff, of course, is that by introducing an extra coordinate/agent, we raise the noise level all the original semicoordinates experience. (This raises the issue of what best parameterization of C() to use, an issue not addressed here.)
3.4 PD theory for uncountable Z In almost all computational algorithms for finding minima, and in particular in the algorithms considered above, we can only modify a finite set of real numbers from one step to the next. When Z is finite, we accomodate this by having the real numbers be the values of the components of the qi. But how can we use a computational algorithm to find a minimum of the maxent Lagrangian when Z is uncountable? One obvious idea is to have the real numbers our algorithm works with parameterize p differently from how they do with product distributions. For example, rather than product distributions, we could use distributions that are mixture models. In that case the real numbers are the parameters of the mixture model; our algorithm would minimize the value of the Lagrangian over the values of the parameters of the mixture model. An alternative set of approaches still use product distributions, with all of its advantages, but employs a special type of semicoordinate system for Z. For pedagogical simplicity, say that Z is the reals between 0 and L So ^ must be a semi-coordinate system for the reals, i.e., each x G ^ must map to a single 2: G C- Now we want to have those of the qi that we're modifying be probability distributions, not probability density functions (pdf's), so that our computational algorithm can work with them.
46
David H. Wolpert and Stefan Bieniawski
Accordingly, in our minimization of the Lagrangian we do not directly modify coordinates that can take on an uncountable number of values (generically indicated with superscript 2), but only coordinates that take on a finite number of values (generically indicated with superscript 1). We illustrate this for the minimization schemes considered in the preceding sections. For generality, we consider the case where Monte Carlo sampling must be used to estimate the values of E{G \ x^) arising in those schemes. Accordingly, we need two things. The first is a way to sample q to get an z, which then provides a G value. The second is a way to estimate the quantities E{G \ x^) based upon such empirical data. Given those, all the usual algorithms for searching q^ to minimize the Lagrangian hold; intuitively, we treat the g^ like stochastic processes that reside in Z but not X, and therefore not directly controllable by us. 3.4.1 Reimann semicoordinates In the Reimann semicoordinate system, x^ can take values 0,1,...,JB — 1, and x^ is the reals between 0 and 1. Then with a^ = i / B , we have z = a^i^x'^/B = a^i + a:^(a^i+i - a^;!).
(3.12)
We thenfixq'^{x'^) to be uniform. So all our minimization scheme can modify are the B values of q^{x^). To sample q, we simply sample q^ to get a value of x^ and g^ to get a value of x^. Plugging those two values into Eq. 3.13 gives us a value of z. We then evaluate the associated value of the world utility; this provides a single step in our Monte Carlo sampling process. Next we need a way to use a set of such Monte Carlo sample points to estimate E{G I x^) for all x^. We could do this with simple histogram averaging, using Laplace's law of succession to deal with bins (x^ values) that aren't in the data. Typically though with continuous Z we expect F{z) to be smooth. In such cases, it makes sense to allow data from more than one bin to be combined to estimate E{G I x^) for each x \ by using a regression scheme. For example, we could use the weighted average regression F{z) = - ^ j _ ( . _ , ) . ; , . .
.
(3.13)
where a is a free parameter, Zi is the i'th value of z out of our Monte Carlo samples, and Fi is the associated i'th value of F. Given such afit,we would then estimate EiG I x^) = f dxV{x^)F((:{x\x^)) ^ Jdx^q\x^)F{ax\x^)). This integral can then be evaluated numerically.
(3.14)
3 Distributed Adaptive Control
47
Typically in practice one would use a trapezoidal semicoordinate system, rather than the rectangular illustrated here. Doing that introduces linear terms in the integrals, but those can still be evaluated as before. 3.4.2 Lebesgue semicoordinates The Lebesgue semicoordinate system generalizes the Reimann system, by parameterizing it. It does this by defining a set of increasing values {ao, a2,..., aj?} that all lie between 0 and 1 such that ao = 0 and a ^ = 1. We then write z = a^,! -f- x'^{a,,i^i - a^i),
(3.15)
Sampling with this scheme is done in the obvious way. The expected value of G if q^ is uniform (i.e., all x^ are equally probable) is E{G) = Y^qiix^) q,{x')
f dx^q\x^)F[a,i /
dz
+ x\a,i^^ ^-^
- a^i)] (3.16)
and similarly for E{G \ x^). When the ai are evenly spaced, the Lebesgue system just reduces to the Reimann system, of course. Note that for a given value of x^, we have probability mass 1 in the bin following a^.!. So q^{x^) sets the cumulative probability mass in that bin. Changing the parameters ai will change what portion of the real line we assign to that mass — but it won't change the mass. This may directly affect the Lagragian we use, depending on whether it's the Xspace Lagrangian or the Z-space one. In the Reimann semicoordinate system, Sx oc Sz, and both Lagrangians are the same (just with a rescaled Lagrange parameter). However in the Lebesgue system, if the a^ are not evenly spaced, those two entropies are not proportional to one another. Accordingly, in that scenario, one has to make a reasoned decision of which maxent Lagrangian to use. The {ai} are a finite set of real numbers, just like q^. Accordingly, we can incorporate them along with q^ into the argument of the maxent Lagrangian, and search for the Lagrangian-minimizing set {ai} and q^ together ^'^. In fact, one can even have q^ fixed, along with q'^, and only vary the {ai}. The difference between such a search over the {ai} when q^ is fixed, and a search over q^ when the {ai} are fixed, is directly analogous to the difference between Reimann and Lebesgue integration, in how the underlying distribution P{z) is being represented. Whether or not q^ is also varied, one must be careful in how one does the search for each a^. Unlike for each {qi}, each ai does not arise as a multilinear product, and therefore appears more than once in the Lagrangian. For example, any particular ^"^Compare this to the scheme discussed previously for searching directly over semicoordinate transformations, where here the search is over probability distributions defined on the set of possible semicoordinate transformations.
48
David H. Wolpert and Stefan Bieniawski
a^i term arises in Eq. 3.16 twice as a limit of an integral, and twice in an integrand. All four instances must be accounted for in differentiating the E{G) term in the Lagrangian with respect to that a^i term. 3.4.3 Decimal Reimann semicoordinates In the standard Reimann semicoordinate system, we use only one agent to decide which bin x^ falls into. To have good precision in making that decision, there must be many such bins. This often means that there are few Monte Carlo samples in most bins. This is why we need to employ a regression scheme (with its attendant implicit smoothness assumptions) to estimate E{G \ Xi), An alternative is to break x^ into a set of many agents, through a hierarchical decimal-type representation. For example, say x^ can take on 2 ^ values. Then under a binary representation, we would specify the bin by K
x^ = Y,xl'^~'
(3.17)
where xj is the bit specifying agent i's value. With this change updating the Lagrangian is done by K agents, with each agent i estimating E{G \ xj) for two values of xj, rather than by a single agent estimating E{G \ x^) for all 2 ^ values of x\ With this system, each agent performs its estimations by looking at those Monte Carlo samples where z fell within one particular subregion covering half of [0.0,1.0]. So long as the samples weren't generated from too peaked a distribution (e.g., early in the search process), there will typically be many such samples, no matter what bit i and associated bit value x} we are considering. Accordingly, we do not need to perform a regression to estimate E(G \ xj) to run our Lagrangian minimization algorithms ^^. When q is peaked, some of bin counts from the Monte Carlo data may be small. We can use regression as above, if desired, for such impoverished bins. Alternatively, we can employ a Lebesgue-type scheme to update the bin borders, to ensure that all xj occur often in the Monte Carlo data.
3.5 PD theory for Reinforcement Learning In this section we show how to use semicoordinate transformations and PD theory for a single RL algorithm playing against nature. The underlying idea is to "fracture" the single learner across multiple timesteps into a set of separate agents, one for each timestep. This gives us a distributed system. Constraints are then used to couple those agents.
^^As usual, we could have the entropy term in the Lagrangian be based on either X space or Z space.
3 Distributed Adaptive Control
49
3.5.1 Episodic RL First consider episodic RL, in which reward comes only at the end of an episode of T timesteps. The learner chooses actions in response to observations of the state of the system via a policy. It does this across a set of several episodes, modifying its policy as it goes to try to maximize reward. The goal is to have it find the optimal such policy as quickly as possible. To make this concrete, use superscripts to indicate timestep in an episode. So z — (z^, z^, 2:^,... z^) = C{x). If we assume the dynamics is Markovian, P{z) = P{z^)P{z^ I z^)P{z^ I ^^). • • P{z^ I z^~^)- Typically the objective function G depends solely on z^. For the conventional RL scenario, each z* can be expressed as (5*, a*), where s^ is the state of the system at t, and a* is the action taken then. As an example, the learner doesn't take into account its previous actions when deciding its current one and that it observes the state of the system (at the previous instant) with zero error. Then P{z' I z'-^) = P{s\a'
I s'-\a'-^)=P{a'
\ s'-^)P{s'
\ s'-\a'-^).
(3.18)
Have C(-) give us a representation of each of the conditional distributions in the usual way using semicoordinates (see Thm. 1). So X is far larger than Z, and we can write P{z) with abbreviated notation as Pis'^.a'^,...,
s^, a^) = P{a'')P{s'') J ] P{a' \ s'-')P{s'
\
s'-\a'-')
t>i
= 9Ao(a°)g5o(5°)n^M*-^(«')^M*-^a*-i(5*)-(3.19) t>l
In RL we typically can only control the Qt^s^-i distributions. While the other qi go into the Lagrangian, they are fixed and not directly varied. If it is desired to have the policy of the learner be constant across each episode, we can add penalty terms Xi[qt,si)qso{s'')'[[qt,s'Ma')qt,s'-^,a'-^is')Gis) a,s
t>l
- S{qso) - Y^S{qt^s) t>l
+ Yl
Ks,ah,s{^)
-
^t-lA^)]
t>l,s,a
(3.20)
^^Note that unlike constraints over X, those over Q are not generically true only to some high probability, but rather can typically hold with probability 1.
50
David H. Wolpert and Stefan Bieniawski
where s and a indicate the vectors of all 5* and all a*, respectively, and the entropy function 5(.) should not be confused with the subscript 5° on q (which indicates the component of q referring to the time-0 value of the state variable) ^^ We can then use any of the standard techniques for descending this Lagrangian. So for example say we use Nearest Newton. Then at the end of each episode, for each ^ > 1, s, a, we should increase qt^s{o) by a [qt,s{ci){(3E{G I s*"^ =s,a* =a) + \n{qt,s{a))+ Xt,s,a->^t-^i,s,a) - const], (3.21) where as usual a is the step size and const is the normalization constant (see Eq. 3.4). 3.5.2 Discounted sum RL It is worth giving a brief overview of how the foregoing gets modified when we instead have a single "episode" of infinite time, with rewards received at every t, and the goal of the learner at any instant being to optimize the discounted sum of future rewards. Let the matrix P be the conditional distribution of state zt given state zt-i, and 7 a real-valued discounting factor between 0 and 1. Write the single-instant reward function as a vector R whose components give the value for the various zt. Then if Po is the current distribution of (single-instant) states, ZQ, the world utility is
([£(7P)*]Po) • R t=l
The sum is just a geometric series, and equals j ^ ^ , where 1 is the identity matrix, and it doesn't matter if the matrix inversion giving the denominator term is right-multiplied or left-multiplied by the numerator term. We're interested in the partial derivative of this with respect to one of the entries of P (those entries are given by the various qi^j). What we know though (from our historical data) is a (non-IID) set of samples of (7P)*P • R for various values of t and various (delta-function) P . So it is not as trivial to use historical data to estimate the gradient of the Lagrangian as in the canonical optimization case. More elaborate techniques from machine learning and statistics need to be brought to bear. Acknowledgements We would like to thank Stephane Airiau, Chiu Fan Lee, Chris Henze, George Judge, Ilan Kroo and Bill Macready for helpful discussion.
^^Equivalently, at the expense of some extra notation, we could enforce the time-translation invariance without the \t,s,a Lagrange parameters, by using a single variable qs (a) rather than the time-indexed set qt,s{o).
3 Distributed Adaptive Control
51
References 1. Laughlin, D., Morari, M., Braatz, R.: Robust performance of cross-directional control systems for web processes. Automatica 29 (1993) 1394-1410 2. Wolfe, J., Chichka, D., Speyer, J.: Decentralized controllers for unmanned aerial vehicle formation flight. American Institute of Aeronautics and Astronautics 96 (1996) 3933 3. Mesbai, M., Hadaegh, R: Graphs, matrix inequalities, and switching for the formation flying control of multiple spacecraft. In: Proceedings of the American Control Conference, San Diego, CA. (1999) 4148-4152 4. Fudenberg, D., Tirole, J.: Game Theory. MIT Press, Cambridge, MA (1991) 5. Basar, T., Olsder, G.: Dynamic Noncooperative Game Theory. Siam, Philadelphia, PA (1999) Second Edition. 6. Osborne, M., Rubenstein, A.: A Course in Game Theory. MIT Press, Cambridge, MA (1994) 7. Aumann, R., Hart, S.: Handbook of Game Theory with Economic Applications. NorthHolland Press (1992) 8. Fudenberg, D., Levine, D.K.: The Theory of Learning in Games. MIT Press, Cambridge, MA (1998) 9. Wolpert, D.H.: Factoring a canonical ensemble. (2003) cond-mat/0307630. 10. Wolpert, D.H.: Information theory — the bridge connecting bounded rational game theory and statistical physics. In D. Braha, A.M., Bar-Yam, Y., eds.: Complex Engineering Systems. (2004) 11. Wolpert, D.H.: What information theory says about best response, binding contracts, and collective intelligence. In et al, A.N., ed.: Proceedings of WEHIA04, Springer Verlag (2004) 12. Wolpert, D.H., Bieniawski, S.: Theory of distributed control using product distributions. In: Proceedings of CDC04. (2004) 13. Macready, W, Wolpert, D.H.: Distributed optimization. In: Proceedings of ICCS 04. (2004) 14. Wolpert, D.H., Lee, C.F.: Adaptive metropolis bastings sampling using product distributions. Submitted to ICCS04 (2004) 15. Airiau, S., Wolpert, D.H.: Product distribution theory and semi-coordinate transformations. (2004) Submitted to AAMAS 04. 16. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998) 17. Kaelbing, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4 (1996) 237-285 18. Crites, R.H., Barto, A.G.: Improving elevator performance using reinforcement learning. In Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., eds.: Advances in Neural Information Processing Systems - 8, MIT Press (1996) 1017-1023 19. Hu, J., Wellman, M.P.: Multiagent reinforcement learning: Theoretical framework and an algorithm. In: Proceedings of the Fifteenth International Conference on Machine Learning. (1998) 242-250 20. Antoine, N., Bieniawski, S., Kroo, I., Wolpert, D.H.: Fleet assignment using collective intelligence. In: Proceedings of 42nd Aerospace Sciences Meeting. (2004) AIAA-20040622. 21. Lee, C.F., Wolpert, D.H.: Product distribution theory for control of multi-agent systems. In: Proceedings of AAMAS 04. (2004) 22. Bieniawski, S., Wolpert, D.H.: Adaptive, distributed control of constrained multi-agent systems. In: Proceedings of AAMAS04. (2004)
52
David H. Wolpert and Stefan Bieniawski
23. Bieniawski, S., Wolpert, D.H., Kroo, I.: Discrete, continuous, and constrained optimization using collectives. In: Proceedings of 10th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, Albany, New York. (2004) in press. 24. Macready, W., Bieniawski, S., Wolpert, D.: Adaptive multi-agent systems for constrained optimization. (2004) 25. Bieniawski, S., Wolpert, D.H.: Using product distributions for distributed optimization. In: Proceedings of ICCS 04. (2004) 26. Cover, T., Thomas, J.: Elements of Information Theory. Wiley-Interscience, New York (1991) 27. Mackay, D.: Information theory, inference, and learning algorithms. Cambridge University Press (2003) 28. Jaynes, E.T., Bretthorst, G.L.: Probability Theory : The Logic of Science. Cambridge University Press (2003) 29. Kirkpatrick, S., Gelatt, C.D.J., Vecchi, M.P.: Optimization by simulated annealing. Science 220 (1983) 671-680 30. Diekmann, R., Luling, R., Simon, J.: Problem independent distributed simulated annealing and its applications. In: Applied Simulated Annealing. Springer (1993) 17^4 31. Catoni, O.: Solving scheduling problems by simulated annealing. SIAM Journal on Control and Optimization 36 (1998) 1539-1575 32. Vidal, R.V.V., ed.: Applied Simulated Annealing (Lecture Notes in Economics and Mathematical Systems). Springer (1993)
Multi-agent Planning for Autonomous Agents' Coordination Amal El Fallah-Seghrouchni LIP6, University Pierre and Marie Curie and CNRS (UMR 7606) 8, rue du Capitaine Scott, 75015 Paris, France [email protected] http://www-poleia.Iip6.fr/~elfallah
4.1 Coordination Issue Features like cooperation, interoperability, distribution, data integration and problem solving are fundamental to build complex and intelligent systems. To support the effective modeling and analysis of inherent complexity, an adequate (and comprehensive) framework should combine the rigor of formal models, the practicality of existing development methods and the performance analysis of modeling tools. This paper proposes an approach to formalize rational agent coordination. We argue that concurrent plan coordination is a necessary bridge for rational agent cooperation. Coordination endow^s rational agents with cooperation abilities in order to generate coherent activities and ensure a correct Multi-Agent System (MAS) behavior. Such a coordination requires both an adequate plan representation and an efficient interaction between agents. Based on information exchange (e.g. data, plans), the interaction allows agents to update their own plans by considering the exchanged information. Coordination aims to produce two effects: cancelling negative interactions (e.g. harmful actions) and taking advantage of helpful interactions (e.g. handling redundant actions). Agents organize their activities and update their plans in order to cooperate and avoid conflicts. Multi-agent planning remains one of the main mechanisms for MAS coordination. However, it raises several interesting issues because of the main features of agents (such as the autonomy and the partial view of agents) and also because of the environment changes.
4.2 Multi-Agent Planning Requirements To meet the specific requirements of the studied application domain, a model for multi-agent planning [6] should be studied from several perspectives: dynamic (or reactive) planning, distributed planning, task allocation and resource sharing, etc. In addition, in the case of a dynamic environment, a such model must take into account the handling of new events. Indeed, in response to an event, the system must
54
Amal El Fallah-Seghrouchni
be able to reorganize itself. This implies that the agents should adapt dynamically their behaviour by adopting new goals and re-planning their course of actions if necessary. In a such context, namely reactive planning, the models for planning have to be enough flexible to avoid to replan from scratch. A good model for multi-agent or multi-agent planning should : - allow the representation and the reasoning about simultaneous actions and continuous processes (e.g. concurrent actions, alternatives, synchronization, etc.), - be domain-independent and support complex plans with different levels of abstraction (i.e. only the relevant information is represented at the earlier phases), - allow dynamic modifications with the associated verification (e.g. no structural inconsistency) and valuation methods (e.g. robustness), - support the interleaving of execution and planning with respect to the environment changes, - offer the plan reuse mechanisms allowing agents to bypass the planning process in the case of similar situations according to the execution context (e.g. library of abstract plans which are the basic building blocks of the new plans), - allow agents to skip some of the planning actions, detect conflicts early and reduce communication costs, - execution control is dynamic in accordance with the associated refinement and therefore minimizes the set of revocable choices [2] because the instantiation of actions can be deferred, - provide plans the size of which remains controllable for the specification and the complexity of which is tractable for validation. In this paper, we propose an overview of two models we developed to deal with distributed and multi-agent planning. The first one is based on recursive Petri nets that allow an event-driven modelling of plans (both single and muli-agent plans) while the second is based on Hybrid automata that are state driven modelling. Through these two models, we outline the coordination issue and the specific constraints it raises.
4.3 Transportation Domain Scenario This scenario will be used to illustrate each stage of the RPN model. The system is made up of two types of agents: Conveyors and Clients. The Conveyor: A Conveyor has a maritime traffic net modeled as a graph where the nodes describe the ports on the net and the arcs represent the channels between them. He also has a boat with a given volume. From each port, the directed neighbors are known using a routing table. Each Conveyor has 2 hangars per port: In_Hangar used for loading incoming goods, and OutJHangar used for unloading outgoing goods. Each hangar can stock only one container at a time. The Conveyor's motivation (goal) is to transport goods through the net using his boat between a source port and a destination port. The Client: A Client produces goods near the departure port and consumes them at the arrival port. The Client puts the goods he produced in the OutJHangar of the departure
4 Multi-agent Planning for Autonomous Agents' Coordination
55
port. He has to find a Conveyor who will transport the goods to the arrival port. Then the Client gets these goods at the InJHangar of the arrival port in order to consume them. The Client's motivation (goal) is to get his goods transported between the production and consumer ports.
The problem now is to answer the two questions: - When must agents' coordination happen? When positive interactions are detected, coordination is desirable and even necessary and may be considered as an optimization of the plans' execution (e.g. Cooperation between Conveyors and Clients and optimization of the boat filling). On the other hand, when negative interactions are detected, coordination becomes indispensable to the plans' execution (e.g. a hangar is a critical resource since it can stock one good at once (Boolean value) and the boat is limited by the quantity of goods it can contain (Real value)). - How may agents' coordination be ensured? To answer this question the coordination mechanism will be illustrated in the following.
4.4 RPN Formalism for Distributed Plamiing 4.4.1 Motivation The synergy between agents helps the emergence of coherent plans, i.e. it cancels negative effects and favors cooperation between agents. Consequently, it requires sharing information and synchronization between the parallel activities of the agents. The Petri Nets are suitable for modeling, analyzing and prototyping dynamic systems with parallel activities, so distributed planning lends itself very well to this approach. The main contribution we expect from Petri Nets is their ability to improve the representation of complex plans and to allow their dynamic coordination and execution. Applied to distributed planning, the Petri Net model mainly offers the following advantages: natural and graphical expression of the synchronization of parallel activities that are the performance of the agents* tasks, clear decomposition of processing (transitions) and sharing data (places), scheduling of the actions (causal and temporal relationships) of plans, dynamic allocation of tasks, qualitative and quantitative analysis of Petri Net modeling of a plan.
The Recursive Petri Net formalism we have introduced overcomes some limitations of usual categories of Petri Nets [9] (e.g. ordinary Petri Nets, High-Level Petri Nets (HLPN) and even Hierarchical HLPN (HHLPN)) that are apparent if one considers a Petri Net as a plan of actions: transition firings are instantaneous whilst an action lasts some time, (HLPN only) transitions are elementary actions as one needs to see an action as an abstraction of a plan, (HHLPN) when a transition is an abstraction, there is no clear end to its firing, some dynamicity is required in the structure of the net but in a controlled way.
56
Amal El Fallah-Seghrouchni
The processing described in the next section is based on dynamic planning which is supported by RPN through the interieaving of execution and planning. In addition, the hierarchical aspect of RPN supports the dynamic refinement of transitions and allows a plan to be considered at multiple levels of abstraction. 4.4.2 A Recursive Model of Plans and Actions Plans: A plan organizes a collection of actions which can be performed sequentially or concurrently in some specific order. Furthermore these actions demand both local and shared resources. A correct plan execution requires that whatever the total order that extends the partial order obtained from the plan, it must remain feasible. Actions: A plan involves both elementary actions associated with irreducible tasks (but not instantaneous ones) and complex actions (i.e. abstract views of the task). Semantically, there are three types of actions: - an elementary action represents an irreducible task which can be performed without any decomposition, - an abstract action, the execution of which requires its substitution (i.e. refinement) by a new sub-plan. There are two types of abstract actions: - a parallel action, which can be performed while another action is being executed, - an exclusive action, the execution of which requires a coordination with other current executions, - an end action, which is necessarily elementary, gives the final results of the performed plan. The plan goals are implicitly given through its end actions. Methods: Intuitively, a method may be viewed as some way to perform an action. Several methods can be associated with an action. A method requires: a label, a list of formal parameters to be linked when the method is executed, a set of Pre-Conditions (i.e. the conditions that must be satisfied before the method is executed), and a set of Post-Conditions (i.e. the conditions that will be satisfied after the method has been executed). Depending on the action definition, a method may be elementary or abstract. An elementary method calls for a sub-routine in order to execute the associated action inmiediately but not instantaneously. An abstract method calls for a sub-plan corresponding to the chosen refinement of the associated abstract action. The refinement occurs so as to detail abstractions and display relevant information. Example: Let us assume that the Initial Conveyor Plan (see Fig.4.1) is reduced to the abstract action Self.Work where Self represents the agent identity and Work
4 Multi-agent Planning for Autonomous Agents' Coordination
57
is the method label. The associated method Ag.Work encapsulates two methods Ag.GoTo(Dest) where Dest represents the source location of the good (see the following table) and Ag.Transport(Good) corresponds to the abstract action in the Self.Work refinement (see Fig.4.1). No method is associated with the end action Self.End-Conv since it is just a synchronization action. Method Type Variables Name Class CONVEYOR Ag Dest LOCATION
Ag.GoTo(Dest) Abstract Conditions Pre Post None Ag.Cur_Loc = Dest None None
4.4.3 Syntactic Definitions of RPN Definition 1. Method A Method is defined through three components: ' An identifier or label - An abstract attribute which represents the type of the associated method (e.g. abstract, elementary) ' A set of initialized RPN generated by an initialization mechanism in the case of abstract method.
RPNl Self.Woric
Reflnement
I Self.GoTo(Dest)
II Abstract Transition
II
I Self.Transport(Good) I
o
Place M0 =(1.0,0)
i
- i ' ' ... I Seif.End_Conv
Fig. 4.1. First Refinement of Initial Conveyor Plan
Definition 2. Recursive Petri Net An RPN is defined as a tuple < P, r , Pre, Post, Var, Call > where: • P is the set of places • T is the set of transitions such that: T = Teiem l±) Tabs l+l T^nd where: Tabs = Tpar l±) Texc cind |+) represents the disjoint union • Pre is the precondition matrix and is a mapping from P x T to N • Post is the postcondition matrix and is a mapping from P x T to N • Var is a set of variables
58
Amal El Fallah-Seghrouchni
• Call{t) is a method call associated with t and defined through the following components: ' the label of the method, - an expression (built on Var variables) of the agent who calls the method, - an expression of the call parameters (built on Var variables) which represents the Pre- and Post-Conditions associated with the method. Definition 3. An initialized RPN An initialized RPN is defined as a tuple < R, Mo, Bind > where: • R is the skeleton of RPN, • Mo is the initial marking of RPN (mapping from P to N), • Bind is the function which links all (a total link) or some (a partial link) variables of Var to the domain objects. Let us note that the objects represent the domain data and allow to instantiate the RPN variables and the parameters of the methods. An RPN model represents a plan according to the previous definitions: - The initial marking Mo allows the plan execution to start. - Pre{p, t) (respectively Post{p, t)) equals n > 0 if the place p is an input (respectively output) place of a transition t and the valuation of the arc linking p to t (respectively t to p) is n. It equals 0 otherwise. - The default value of a non-valuate arc equals to 1. 4.4.4 The RPN Semantics A plan is executed by executing its actions. In RPN formalism, a transition models an action and its firing corresponds to executing an action. The dynamic refinement of an abstract transition when it has to be fired is an elegant way of the handling a conditional plan without developing all the situations at plan generation. In addition, it allows the interleaving of planning and execution. In our model, the planner chooses dynamically the best refinement according to the execution context depending on the Pre- and Post-Conditions (see cases (a), (b) and (c) in Fig.4.2). The choice heuristics are not detailed in this paper in order to focus on the plan management. In the following the RPN semantics is given in order to illustrate the dynamic execution. An RPN models a plan. The successive states of a plan are represented in a tree (which may be considered as an execution tree). A node represents an initialized RPN and an arc represents an abstract transition firing. Definition 4. A Plan State A plan state is defined as a tree Tr =< S,A>
where:
• S is the set of nodes • A is the set of arcs a G A such that: - a =< s,s' > if and only if s' is the child ofs in Tr - An arc a is labeled by a transition which is called Trans (a) where Trans is a function from AtoT with trans (^a) = t
4 Multi-agent Planning for Autonomous Agents' Coordination
59
RPN2 DSelf.MoveTo(Z) i ' ii
^
^^^
Self.MoveTo(Z) ; rrhn ] Self.M
a Self.GoTo(Dest): = ^^^ '^e port destination is an ^.^.J^, :: immediate neighbor port i , •. of the Conveyor's i. current location : \ (c) the port destination is the current port j i (a) the port destination is not a neighbor port
The variable Z represents the next Port Location and will be linked using the route table.
Fig. 4.2. Self.GoTo(Dest) Refinements
• The initial state TVQ is a tree which is reduced to only one node s such that: M{s) = Mo where MQ is the initial marking. Example: Starting from the refinement RPNl in Fig.4.1, the successive states of the plan are given in Fig.4.3 where case (P) gives the chosen initialized RPN associated with the call of the method Self.GoTo(Good.Src_Loc) and using the RPN2 refinement (see Fig.4.2); case (Q) illustrates the firing of the abstract action Self.GoTo(Dest); and case (R) illustrates the firing of the elementary action Self.MoveTo(Dest) according to the following definitions. Notation: The index Tr of S (respectively of A and Trans) means that we consider the set of nodes S (respectively the set of arcs A and the transition Trans) relative to the tree Tr. Pre{., t) (respectively Post{., t)) is a mapping from P to N induced from Pre (respectively Post) by fixing to t as a second argument.
,—3-
^
Tt , - -^'^^ ^ • "•• — ^ SQ (RPNl, (0,0,0), Bl) = i S(w(RPNl, (0,0,0), Bl) t Firing of Elementary '—-*, , ,. . , , , (RPNl, (1,0,0), UFinng of Abstracts ^ i | Transition / Self.GoTo(Dest) . , I Bl) |lJransition__^ ; self.GoTo(Dest = ^ . o„....„..,^..vx ' SQ• " I • I V ^cvTV.-^ I t = Self.Move'n)(Z) \ J / ns = si [#(RPN2,(0,1,0),B2)
'"^"^r'
I i
Bl(Dest,Good)
K_ ^ L „ „ ..^ (Q) B2(Z,Dest)
.-I
K_ ^ ~
(R)
Fig. 4.3. States of the Conveyor Plan
Definition 5. A Transition Firing Rule A transition t is said to befireablefrom a node s e S if and only if: • Pre{s,t) is labeled by the abstract transition t. Definition 9. Let Tr{s) be a tree with the node s as its root. The function PRUNE{Tr^ allows the tree Tr to be cut from the node s as follows: Tr{s) Prune{Tr, s) H-> Tr' such that Tr' =Tr\ Remark: Tr' = 0 if the node s is the root of Tr. Definition 10. End transition: Lett e Tend-Thefiringoft from s G S produces the new tree Tr' with the new marking MTV' ofs such that: •
Tr' = PRUNE{Tr,s) Let s' be the immediate predecessor ofs in the tree Tr and a =< s', s > the arc labeled by t, then:
s)
4 Multi-agent Planning for Autonomous Agents' Coordination •
Wa E ATr'tTransTr'io)
=
•
Mrr'is^) = MTr{s^) + Post{.,
61
TransTr{o) TransTr{o))
This definition means that the firing of an end action tend belonging to an initialized RPN which has been generated by the firing of tabs corresponds to the applying of the Post-Conditions of the call of the abstract method associated with tabs- It closes the sub-net and adds the Post{., tabs)•
4.5 Concurrent Plan Management Interacting situations are generally expressed in terms of binary relationships between actions and are often detected statically. The interleaving of planning and execution requires both static and dynamic detection of such situations which is ensured through the RPN semantics. Moreover, these situations are usually represented semantically which makes their handling difficult if not impossible. The syntactic aspects of such situations are often required to allow their operational management. 4.5.1 Interacting Situations through RPN Here, planning and coordination aspects are merged, thus offering a number of advantages. When an agent cannot execute the refined plan he communicates it to another agent. Conrmiunication triggers a coordination mechanism which is based on the plan merging paradigm. Coordination is globally initiated by the incoming new plan (i.e. a new plan is submitted to an agent). The most important interacting situations handled by our approach include both positive and negative interactions. Positive Interactions -
Redundant Actions: Actions are redundant if they appear in at least two plans belonging to two different agents, have identical Pre- and Post-Conditions, and the associated methods are instantiated by the same parameters except for the agent parameter (who has to perform the action). Hence, coordination assigns action execution to one of the agents. The agent who will perform the action has to provide his results. The others have to modify their plans by including a synchronizing transition. Helpful Actions: Actions are said to be helpful if the execution of one satisfies all or some of the Pre-Conditions of the others. Their execution will be respectively possible or favored. There are two ways of detecting such a situation: dynamic detection: during the execution of an abstract action, the refinement of which encapsulates an elementary action which validates another action' s Pre-Conditions, static detection: when the execution of one action precedes the execution of another and validates its Pre-Conditions.
62
Amal El Fallah-Seghrouchni
Negative Interactions -
-
Harmful Actions: Actions are said to be harmful if the execution of one invalidates all or some of the Pre-Conditions of the others. Consequently, the execution of the latter will be respectively impossible or at an unfair disadvantage. Such a situation must be detected before the new plan execution starts in order to predict failure or deadlock. Our coordination mechanism introduces an ordering between the harmful actions as in [3] which provides a coordination algorithm (COA) for handling such interactions between n agents at once. Exclusive Actions: Actions are said to be exclusive if the execution of one momentarily prevents the execution of the others (e.g. their execution requires the same non-consumable resource). Detected dynamically, this situation occurs when an exclusive action has been started (i.e. an exclusive transition firing). In this case, execution remains possible but is deferred since it requires coordination with other executions. Incompatible Actions: Actions are said to be incompatible if the execution of one prevents the execution of the others (e.g. their execution requires the same consumable resource). In our model, such a situation models an altemative (e.g. two transitions share the same input place with one token). In this case, execution remains possible only if the critical resource can be replaced. In our approach, the planner uses heuristics based on two alternatives which may be combined in order to avoid conflicts: if the conflict concerns an abstract action, the planner tries to substitute the current refinement. Otherwise, the method used will be replaced.
4,5.2 Plan Merging Paradigm Hypotheses -
The RPN is acyclic and the initial net is reduced to an abstract transition before starting the execution. After plan generation, the plan actions have no negative interactions. The value of each unspecified condition is assumed to be unchanged as in STRIPS formalism. If the Pre-Conditions of an action method are valid, then the Post-Conditions are necessarily satisfied after its execution. Otherwise, no assumption is made about the validity of the Post-Conditions.
In the following, the handling of positive and negative interactions will be described through our case study. Let Clients {C/1,0/2} and Conveyor Cv be three agents who are working out their respective plans. Handling Positive Interactions: The main phases of the coordination algorithm are:
4 Multi-agent Planning for Autonomous Agents' Coordination
63
1. Starting coordination: An agent (e.g. Cli) has to perform a plan Hi corresponding to a leaf in his execution tree Tr but i l i is partially instantiated i.e. some plan methods associated with plan transitions have non-instantiated call parameters (e.g. Y e III the TYPE of which is CONVEYOR). The agent (e.g. Ch) must find an agent (e.g. Conveyor) who will execute these methods. He starts a selective communication, based on his acquaintances, until he receives a positive answer. Let us assume that Cli chooses Cv and then sends him i l i . 2. Recognition and unification: The agent who receives iJi (e.g. Cv) detects the methods that are partially instantiated. Then, he examines his execution tree in search of the same methods. This phase is achieved with success if he finds an RPN (a node in his Tr) where appear all the methods to be instantiated (see Y.Transport(X) which is non-assigned in 77i and assigned in 772 in Fig. 4.4). Then, Cv triggers unification of the methods through their call parameters and instantiation of the variables w.r.t. the two plans. If both unification and instantiation are possible, Cv tries to merge the two plans J7i and 772. 3. Structural merging through the transitions: Cv produces a first merging plan 77^ (see Fig. 4.4) through the transitions associated with the previous methods and instantiates the call parameters. Then he checks that all the variables have been instantiated and satisfy both the Pre- and Post-Conditions. 4. Consistency checking: This phase is the keystone of the coordination mechanism since it checks the feasibility of the new plan which results from the structural merging. It is based on the algorithm using the Pre- and Post Conditions Calculus (PPCC) described in the following.
Selfl.Put(X)
Initial Client Plan
Self.Transport(X)
Selfl.Get(X) j 1
Client Plan Refinement Conveyor Plan refinement
iSelf.End_Conv
Proposed Plan
Fig. 4.4. Structural Merging
Pre- and Post-Conditions Calculus Algorithm (PPCC) The PPCC algorithm is based on two phases: i. Reachability Tree Construction (RTC): To begin with, we have to build an RT using a classical recursive right depth first algorithm which returns the tree root. Fig.4.5 shows the RT of the Structural Merging Plan 77^ given in Fig.4.4.
64
Amal El Fallah-Seghrouchni
%>
Initial M a r k e d Places
Selfl.Put(X) Self.Go_To(X.Src_Loc) Can he :Pl.Q0> ^ P Qj Ql> ^^J^exchanged I Self.Go_To(X.Src_Loc) t„^l£L?!^?^)
Self.Transport(X) r, i ^
Self 1 .Get(X)' Self.End_Conv ^ ^-^ ^ r^ ^ ^1^-=* *=- ^ 2 Self.End_Conv Selfl.Get(X)
Identical siJb-tree J c a n b e c u t ,,''
> ::i marked places . ^ J ' V ' ^^^ transition t ' is fired , . ' < > : no transition can . be fired
Fig. 4.5. A Reachability Tree
ii. Pre- and Post-Conditions Calculus: This algorithm [3]is called with the root of RT and with the parameters of the current execution context. Property: For all plan 77 modeled as an RPN, if the FFCC algorithm applied to the reachabilty tree (RT) of 77 returns true (i.e. 77 is consistent) and the environment is stable, then no failure situation can occur. Discussion: In order to avoid a combinatorial explosion, there exist algorithms which allow to construct a reduced RT. In our context, the RT can be optimized as the following: let ni and n^ be two nodes of RT. The RTC can be optimized through analysis of the method calls as follows: -
-
Independent Nodes: there is no interference between the Pre- and Post-Conditions of rii on the one hand and the Pre- and Post-Conditions of rij on the other hand, i.e. the associated transitions can be fired simultaneously whatever their ordering (i.e. the global execution is unaffected). Here, an arbitrary ordering is decided (e.g. Self.Put and Self.Go_To can be exchanged) which allows many sub-trees to be cut(the right sub-tree in Fig. 4.5). Semi-Independent Nodes: there is no interference between the Post-Conditions of Ui and rij, i.e. the associated transitions don't affect the same attributes. If the exchanged sequences ({n^, Uj } or {uj, n^}) of firing transitions which lead to the same marking can be detected then the sub-tree starting from this marking can be cut. The obtained graph is then acyclic and merges the redundant sub-trees.
Handling Negative Interactions: Now an other agent (e.g. Cli) sends his plan to Cv who processes 77i (C/i's plan). GCOA as Generalization of the COA Algorithm [3] The Conveyor starts a new coordination. The first and second phases are the same as in the case of positive interactions.He chooses the plan 772. Negative interactions arise when the two refinements (772 and 772) have shared attributes (e.g. the boat volume constraint). Now, Cv has to solve internal negative interactions before proposing a merging plan to C/2. Again the GCOA is divided into two steps.
4 Multi-agent Planning for Autonomous Agents' Coordination
65
i. Internal Structural Merging by Sequencing: Cv connects 772 and il2 by creating a place pi for each pair of transitions (te^U) in End{n2) x Init{Il2) and two arcs in order to generate a merged plan 77^: Function Sequencing(in iJi, 772: Plan): Plan; {this function merges 11\ and 772/ produces a merged plan Tim and the synchronization places } begin Let TE = {te G Hi/te is an end transition} and Ti = {U e 112/U is an initial transition } (i.e. U has no predecessor in Pi2) for ail (te.ti) GTE x T / d o Create a place pe,i Create an input arc IAe,i from te to pe,i Create an output arc OAe,i from pe,i to U (i.e. Post{pe,i,te) = 1 and Pre{pe,i,ti) = 1) endfor Hm := Merged_Plan (TTi, 772, {pa}, {IAe,i}, {OAe,i}) return (77^, {pe,i}) end {Sequencing} ii. Parallelization by Moving up arcs: Cv applies the FFCC algorithm to the merged net 77^^ obtained by sequencing. If the calculus returns true then the planner proceeds to the parallelization phase by moving up the arcs recursively in order to introduce a maximum parallelization in the plan. This algorithm [3] tries to move (or eliminate) the synchronization places. The predecessor transition of each synchronization place will be replaced by its own predecessor transition in two cases: the transition which precedes the predecessor transition is not fired or is not in firing. If both the Pre- and Post-Conditions remain valid, then a new arc replaces the old. The result of this parallelization is to satisfy both Cli and CI2 by executing the merged net 77^.
rg J JL
Moving Up Arcs te;ti ^te//ti
Fig. 4.6. Sequencing and Parallelization
Remark: At each moving up the arcs, the FFCC algorithm is applied to the new net.
66
Amal El Fallah-Seghrouchni
The exchanged plans are the old ones augmented by synchronization places upstream and downstream. This algorithm can be optimized at the consistency control level. In fact, the coherence checking can be applied in incremental way to each previous plan 772.
4.6 Hybrid Automata Fomalism for Multi-Agent Planning The second model of multi-agent planning we developped is based on Hybrid Automata [7] which represent an alternative formalism to deal with multi-agent planning when temporal constraints play an important role. In this modelling, the agents' behaviour (throught individual plans and multi-agent plans) is state-driven. The interest of those automata is that they can model different clocks evolving with different speeds. These clocks may be the resources of each agent and the time. A Hybrid Automaton is composed of two elements, a finite automaton and a set of clocks: •
•
A finite automaton A is a tuple: A =< Q, E, Tr, qo, I > where Q is the set of finite states, E the set of labels, Tr the set of edges, qo the initial locations, and I an application associating the states of Q with elementary properties verified in this state. A set of clocks H, used to specify quantitative constraints associated with the edges. In hybrid automata, the clocks may evolve with different speeds.
Tr is a set of edge t such as ^ G Tr, t =< s, {{g}, e, {r}), s' >, where: •
s and 5' are elements of Q, they respectively model the source and the target of the edge t =< s, {{g}, e, {r}), s' > such that: - {g} is the set of guards. It represents the conditions on the clocks; - e is the transition label, an element of E; - {r} is the set of actions on clocks.
Multi-agent plans can be modeled by a network of synchronized hybrid automata (a more detailed presentation can be found in [4]). They provide an important interest since they take into account the agents features and the time as parameters of the simulation (those variables may be modeled by different clocks evolving with different speeds inside the automata). All the parameters of the planning problem may be represented in the hybrid automata: the tasks to be accomplished are represented by the reachable states in the automata; the relation between tasks by the edges; the pre-, post- and interruption conditions by the guards of the edges; and finally the different variables by the clocks of the automata. Let us define the synchronized product: Considering n hybrid automata Ai =< Qi^Ei, Tri,qo^i, k, Hi >, fori = 1, ...,n. •
Q=
QiX...xQn\
4 Multi-agent Planning for Autonomous Agents' Coordination
•
•
T = {((gi,...,^9n),(ei,...,en),(5l,...,gl)|, Ci = ' - ' and g- = qi or Ci -f=! -' and {q^ e^, q-) G Tr J ; 90 = (9o,i,9o,2,.--,9o,n);
•
H = Hi X ... X Hn-
67
So, in this product, each automaton may do a local transition, or do nothing (empty action modeled by '-') during a transition. It is not necessary to synchronize all the transitions of all the automata. The synchronization consists of a set of Synchronization that label the transitions to be synchronized. Consequently, an execution of the synchronized product is an execution of the Cartesian product restricted to the label of transitions. In our case, we only synchronize the edges concerning the temporal connectors Sstart and Send- Indeed the synchronization of individual agent's plans is done with respect to functional constraints and classical synchronisation technics of the automata formalism like "send / reception" messages. Hybrid Automata formalism and the associated coordination mechanisms are detailed in [5].
4.7 Conclusion The two models presented in this paper are suitable for multi-agent planning. The recursive Petri nets allow the plans modelling (both at the agent and multi-agents levels) and their management when abstraction and dynamic refinement are required. RPN allows, easily, the synchronization of individual agents'plans. They are, in particular, interesting for the multi-agent validation thanks to the reachability tree building if combined to reduction technics (in order to avoid the combinatory explosion of the the number of states). The main shortcoming of this model is the absence of explicit handling of temporal constraints. This is why we developped a model based on Hybrid Automata that model different clocks evolving with different speeds. These clocks may be the resources of each agent and the time.
References 1. R. Alur and D. Dill A Theory of Timed Automata. Theoretical Computer Science. Vol. 126, n. 3, pages 183-225. (1994) 2. A. Barrett and D.S. Weld. Characterizing Subgoal Interactions for Planning. In Proceedings ofIJCAI-93, pp 1388-1393. (1993). 3. A. El Fallah Seghrouchni and S. Haddad. A Recursive Model for Distributed Planning. In the proceedings of ICMAS'96. IEEE publisher. Japan (1996). 4. A. El Fallah-Seghrouchni, I. Degirmenciyan-Cartault and F. Marc. Framework for MultiAgent Planning based on Hybrid Automata. In the proceedings of CEEMAS 03 (International/Eastern Europe conference on Multi-Agent System). LNAI2691. Springer Verlag. Prague.(2003).
68
Amal El Fallah-Seghrouchni
5. A. El Fallah-Seghrouchni, R Marc and I. Degirmenciyan-Cartault. Modelling, Control and Validation of Multi-Agent Plans in Highly Dynamic Context. To appear in the proceedings of AAMAS 04. ACM Publisher. New York.(2004). 6. M.R Georgeff. Planning. In Readings in Planning. Morgan Kaufmann Publishers, Inc. San Mateo, California.(1990) 7. T. A. Henzinger. The theory of Hybrid Automata. In the proceedings of 11th IEEE Symposium Logic in Computer Science, pages 278-292.(1996) 8. T. A. Henzinger, Pei-Hsin Ho and H. Wong-Toi HyTech : a model checker for hybrid systems Journal of Software Tools for Technology Transfer. Vol. 1, n. 1/2, pages 110122. (2001) 9. Jensen, K. High-level Petri Nets, Theory and Application. Springer-Verlag.(1991) 10. Martial, V. 1990. Coordination of Plans in a Multi-Agent World by Taking Advantage of the Favor Relation. In Proceedings of the Tenth International Workshop on Distributed Artificial Intelligence.
Creating Common Beliefs in Rescue Situations Barbara Dunin-K^plicz^'"^ and Rineke Verbrugge"^ ^ Institute of Informatics, Warsaw University, Banacha 2, 02-097 Warsaw, Poland [email protected] ^ Institute of Computer Science, Polish Academy of Sciences, Ordona 21, 01-237 Warsaw, Poland ^ Institute of Artificial Intelligence, University of Groningen, Grote Kruisstraat 2/1, 9712 TS Groningen, The Netherlands [email protected] Summary. In some rescue or emergency situations, agents may act individually or on the basis of minimal coordination, while in others, full-fledged teamwork provides the only means for the rescue action to succeed. In such dynamic and often unpredictable situations agents' awareness about their involvement becomes, on the one hand, crucial, but one can expect that it is only beliefs that can be obtained by means of communication and reasoning. A suitable level of communication should be naturally tuned to the circumstances. Thus in some situations individual belief may suffice, while in others everybody in a group should believe a fact or even the strongest notion of common belief is the relevant one. Even though conmion knowledge cannot in general be established by communication, in this paper we present a procedure for establishing common beliefs in rescue situations by minimal conununication. Because the low-level part of the procedure involves file transmission (e.g. by TCP or alternating-bit protocol), next to a general assumption on trust some additional assumptions on conmiunication channels are needed. If in the considered situation communication is hampered to such an extent that establishing a common belief is not possible, creating a special kind of mutual intention (defined by us in other papers) within a rescue team may be of help.
5.1 Introduction Looking at emergency situations in their complexity, a rather powerful knowledgebased system is needed to cope with them in dynamic and often unpredicatble environment. In emergencies, coordination and cooperation are on the one hand vital, and on the other side more difficult to achieve than in normal circumstances. To make the situation even more complex, time is critical for rescues to succeed, and communication is often hampered. Also, usually expertise from different fields is needed. Multiagent systems exactly fit the bill: they deliver means for organizing complex, sometimes spectacular interactions among different, physically and/or logically distributed knowledge based entities [I]:
70
Barbara Dunin-K^plicz and Rineke Verbrugge
A MAS can be defined as a loosely coupled network of problem solvers that work together to solve problems that are beyond the individual capabilities or knowledge of each problem solver. This paper is concerned with a specific kind of MAS, namely a team. A team is a group in which the agents are restricted to having a common goal of some sort in which team-members typically cooperate and assist each other in achieving their common goal. Rescuing people from a crisis or emergency situation is a complex example of such a common goal. Emergency situations may be classified along different lines. It is not our purpose to provide a detailed classification here, but an important dimension of classification is along the need for teamwork. A central joint mental attitude addressed in teamwork is collective intention. We agree with [2] that: Joint intention by a team does not consist merely of simultaneous and coordinated individual actions; to act together, a team must be aware of and care about the status of the group effort as a whole. In some rescue situations, agents may act individually or on the basis of minimal coordination, while in others, full-fledged teamwork, based on a collective intention, provides the only means for the rescue action to succeed. MAS can be organized using different paradigms or metaphors. For teamwork, BDI (Beliefs, Desires, Intentions) systems form a proper paradigm. Thus, some multiagent systems may be viewed as intentional systems implementing practical reasoning — the everyday process of deciding, step by step, which action to perform next. This model of agency originates from Michael Bratman's theory of human rational choice and action [3]. His theory is based on a complex interplay of informational and motivational aspects, constituting together a belief-desire-intention model of rational agency. Intuitively, an agent's beliefs correspond to information the agent has about the environment, including other agents. An agent's desires or goals represent states of affairs (options) that the agent would choose. Finally, an agent's intentions represent a special subset of its desires, namely the options that it has indeed chosen to achieve. The decision process of a BDI agent leads to the construction of agent's commitment, leading directly to action execution. The BDI model of agency comprises beliefs referring to agent's informational attitudes, intentions and then commitments referring to its motivational attitudes. The theory of informational attitudes has been formalized in terms of epistemic logic as in [4, 5]. As regards motivational attitudes, the situation is much more complex. In Cooperative Problem Solving (henceforth CPS), a group as a whole needs to act in a coherent pre-planned way, presenting a unified collective motivational attitude. This attitude, while staying in accordance with individual attitudes of group members, should have a higher priority than individual ones. Thus, from the perspective of CPS these attitudes are considered on three levels: individual, social (bilateral), and collective.
5 Creating Common Beliefs in Rescue Situations
71
When analyzing rescue situations from the viewpoint of BDI systems, one of the first purposes is to define the scope and strength of motivational and informational attitudes needed for successful team action. These determine the strength and scope of the necessary communication. In [6], [7], we give a generic method for the system developer to tune the type of collective commitment to the application in question, the organizational structure of the group or institution, and to the environment, especially to its communicative possibilities. In this paper, however, the essential question is in what terms to define the communication necessary for teamwork in rescue situations. Knowledge, which always corresponds to the facts and can be justified by a formal proof or less rigorous argumentation, is the strongest and therefore preferred informational attitude. The strongest notion of knowledge in a group is common knowledge, which is the basis of all conventions and the preferred basis of coordination. Halpem and Moses proved that common knowledge of certain facts is on the one hand necessary for coordination in well-known standard examples, while on the other side, it cannot be established by communication if there is any uncertainty about the communication channel [4]. In practice in MAS, agents do with belief instead of knowledge for at least the following reasons. First, in MAS perception provides the main background for beliefs. In a dynamic unpredictable environment the natural limits of perception may give rise to false beliefs or to beliefs that, while true, still cannot be fully justified by the agent. Second, communication channels may be of uncertain quality, so that even if a trustworthy sender knows a certain fact, the receiver may only believe it. Conmion belief is the notion of group belief which is constructed in a similar way as common knowledge. Thus, even though it puts less constraints on the communication environment that common knowledge, it is still logically highly complex. For efficiency reasons it is often important to minimize the level of communication among agents. This level should be tuned to the circumstances under consideration. Thus in some situations individual belief may suffice, while in others everybody in a group should believe a fact and again in the others the strongest notion of common belief is needed. In this paper we aim to present a method for establishing common beliefs in rescue situations by minimal conmiunication. If in the considered situation communication is hampered to such an extent that establishing a common belief is not possible, we attempt some alternative solutions. The paper is structured in the following manner. In section 5.2, a short reminder is given about individual and group notions of knowledge and belief, and the difficulty to achieve conmion belief in certain circumstances. Then, a procedure for creating conmion beliefs is introduced in section 5.3, which also discusses the assumptions on the environment and the agents that are needed for the procedure to be effective. Section 5.4 presents three case studies of rescue situations where various collective attitudes enabling appropriate teamwork are established, tuned to the communicative possibilities of the environment. Finally, section 5.5 discusses related work and provides some ideas about future research.
72
Barbara Dunin-K^plicz and Rineke Verbrugge
5.2 Knowledge and belief in groups In multiagent systems, agents' awareness of the situation they are involved in is a necessary ingredient. Awareness in MAS is understood as a reduction of the general meaning of this notion to the state of an agent's beliefs (or knowledge when possible) about itself, about other agents as well as about the state of the environment, including the situation they are involved in. Assuming such a scope of this notion, different epistemic logics can be used when modelling agents' awareness. This awareness may be expressed in terms of any informational (individual or collective) attitude fitting given circumstances. In rescue situations, when the situation is usually rather complex and hard to predict, one can expect that only beliefs can be obtained. 5.2.1 Individual and common beliefs To represent beliefs, we adopt a standard i^£)45n-system for n agents as explained in [4], where we take BEL(i,(/?) to have as intended meaning "agent i believes proposition (/?". A stronger notion than the one for belief is knowledge, often called "justified true belief. The usual axiom system for individual knowledge within a group is 55^, i.e. a version of KDA5n where the consistency axiom is replaced by the (stronger) truth axiom KN0W(2, ip) —^ (p. We do not define knowledge in terms of belief. Definitions occurring in the MAS-literature (such as KNOW(i, if) ,.
(8.11)
Then, /XT is a rough inclusion that satisfies the transitivity rule, see [14],
f^T{x,y,r),/j.T{y,z,s)
(8.12)
fZT{x,z,T{r,s))
Particular examples of rough inclusions are the Menger rough inclusion, (MRI, in short) and the Lukasiewicz rough inclusion (LRI, in short), corresponding, respectively, to the product t-norm TM{x,y) = x - y, and the Lukasiewicz product TL{X, y) = max{0, x-\-y -l). The Menger Rough inclusion For the t-norm T/v/, the generating function f{x) = —Inx whereas g{y) = e~^ is the pseudo-inverse to / . The rough inclusion ^TM ^^ given by the formula, \DIS{x,y)\
fj.TM{^,y.r)^e
1^1
>r.
(8.13)
^i.e., a map T : [0,1]^ -^ [0,1] that is symmetric, associative, increasing and satisfies r(a:,0) = 0. ^This means that g(x) = 1 for rr G [0, / ( I ) ] , g{x) = 0 for x € [/(O), 1], and g{x) = /-^(x)forx€[/(l),/(0)].
126
Lech Polkowski
The Lukasiewicz rough inclusion For t-norm TL, the generating function f{x) inverse to / . Therefore,
= I — x and g = f is the pseudo-
,,^i:,,y,r)^l-\£l^>r.
(8.14)
Let us observe that rough inclusions based on sets DIS are necessarily symmetric. Table 8.5. The information system A U ai a2 Cis a4 a:i 1 1 1 2 X2 X3 X4 X5 X6 X7 X8
1 2 3 3 3 1 2
0 0 2 1 2 2 0
1 0 1 1 1 0 1 0 1 2 01 02
For the information system A in Table 5, we calculate values of LRI, shown in Table 6; as //TL is symmetric, we show only the upper triangle of values. Table 8.6. /x^ for Table 5 U
Xl X2
Xs
X4 Xs
XQ XJ XS
xi 1 X2 X3X4 -
0.5 0.25 0.25 0.5 0.5 0.25 0.25 1 0.5 0.5 0.5 0.25 0.25 0.25 - 1 0.25 0.25 0.25 0.25 0.5 - - 1 0.75 0.75 0.25 0
X5-
-
-
-
1
0.5
0 0
X6- - X7 - - -
- - 1 0.25 0.25 - - - 1 0.25
X8 -
-
-
-
-
-
-
1
Rough inclusions over relational information systems In some applications, a need may arise, to stratify objects more subtly than it is secured by sets DIS. A particular answer to this need can be provided by a relational information system by which we mean a system {U, A , R), where R — {Ra ' CL G A} with Ra ^Va xVa^ relation in the value set Va.
8 Mereological Foundations to Approximate Reasoning
127
A modified set DIS^{x,y) is defined as follows; DIS^{x,y) = {a e A : Ra{ci{^),o,{y))}' Then, for any archimedean t-norm T, and non-reflexive, nonsymmetric, transitive, and linear, relation R, we define the rough inclusion /x^ by the modified formula,
^,^i:c,y,r)^gi\^l^^^>r,
(8.15)
where g is the pseudo-inverse to / in the representation r ( r , s) = g{f{r) -f f{s)); clearly, the notion of a part is here: xn^y if and only \i x ^ y and Ra{a{y), a{x)) for each a e A. Let us look at values of /x^ in Table 7 for the information system in Table 5 with value sets ordered linearly as a subset of real numbers. Table 8.7. I^^TL for Table 5 U
X\
XI X2 X3 X4 X5 X6 X7 X8
1 1 0.75 0.5 0.5 0.5 0.75 0.75 0.5 1 0.5 0.5 0.5 0.25 0.5 0.5 0.5 1 1 0.5 0.5 0.25 0.75 0.75 0.75 1 0.75 1 1 0.75 0.75 0.75 0.75 1 0.75 0.75 1 0.5 0.5 0.75 1 1 1 1 1 1 1 1 0.5 0.75 0.5 0.5 0.5 0.25 1 0.5 0.5 0.75 0.75 0.25 0.25 0.25 0.75 1
X2
X^ X4 X5
X6
Xj
Xs
As expected, the modified rough inclusion is non-symmetric. We now discuss problems of granulation of knowledge, showing an approach to them by means of rough inclusions.
8.4 Rough Mereological Granule Calculus Granular computing paradigm proposed by Lotfi Zadeh, is based on the idea of making entities into granules of entities and performing calculations on granules in order to reduce computational cost of approximate problem solving. Here we propose a general scheme for granular computing based on rough mereology. We assume a rough inclusion /i^ on a mereological universe {U, el^r) with a part relation TT. For given r < 1 and x E C/, we let, Qrix) = Cls{%),
(8.16)
%iy)^f^^{y,x,r).
(8.17)
where The class gr{x) collects all atomic objects satisfying the class definition with the concept iZv.
128
Lech Polkowski
We will call the class gr{x) the r-granule about x; it may be interpreted as a neighborhood of x of radius r. We may also regard the formula yyirX as stating similarity oiyiox (to degree r). We do not discuss here the problem of representation of granules; in general, one may apply sets or lists as the underlying representation structure. The following are general properties of the granule operator gr induced by a rough inclusion /i^r, see [14]. 1. 2. 3. 4. 5.
\ifi^{y,x,r)i\iQnyel^gr{x). if /^TT(^, y-t ^) A yelT^z then xel^^gr (z). \/z.[zelT^y => 3w, q.{welT^z A WCIT^Q A finiQ^ ^j ^)] => yel-Kgri^)if yelT^grix) A zel^ry then zel^^grix). if 5 < r then gr{'^)elT^gs{x).
8.4.1 Granulation via archimedean t~norm based rough inclusions For an archimedean t-norm T — induced rough inclusion /i^, we have a more detailed result, viz., the equivalence, 6. for each x, y G UjNDy ^^lirgriy) if and only if fj^rix, y, ^)We consider the information system of Table 5 along with values of rough inclusions fiTL^I^TL Siv^^» respectively, in Tables 6 and 7. Admitting r = .5, we list below granules of radii .5 about objects xi — xg in both cases. We denote with the symbol gi^gf, respectively, the granule go.bixi) defined by ^J'TL^^^TL' respectively, presenting them as sets. We have, 1- gi =
2. g2 =
{XI,X2,X5,XG},
{xi,X2,Xs,X4,Xs},
3. gs = { X 2 , X 3 , X 8 } ,
4. g4 = {X2,X4,X5,X6}, 5. g5 = {xi,X2,X4,X5,X6}, 6. g6 =
{XI,X4,X^,XG},
7. gj = {xj}, 8. ^8 = {xs.xs}, what provides an intricate relationship among granules: i^g^^g^ Q gs, gs Q g2, ^2, g5 incomparable by inclusion, gr isolated. We may contrast this picture with that for fXj,^. 2. g^ = 9^ =97=U\ 3- 9s =
{xe},
{xi,X2,X3,X7,Xs},
providing a nested sequence of three distinct granules.
8 Mereological Foundations to Approximate Reasoning
129
8.4.2 Extending rough inclusions to granules We now extend /XTT over pairs of the form x, g, where x G Ujjsfo, 9 a granule. We define /x^r in this case as follows, fi^{x,g,r)
yCx\ meaning symmetry. (C3) \iz.{zCx ^^=> zCy)] ==> {x = y)\ meaning extensionality."* In terms of connections, schemes for spatial reasoning are constructed, see [3]. 8.5.1 Connections from rough inclusions In this section we investigate some methods for inducing connections from rough inclusions /x = /XTT, see [16]. Limit connection We define a functor CT as follows, xCry ^=^ - ( 3 r , 5 < l.ext{gr{x),gs{y))),
(8.22)
where ext{x^ y) is satisfied in case x, y have no common parts. Clearly, (C1-C2) hold with CT irrespective of a rough inclusion fi applied . The status of (C3) depends on /i. In case x ^ y^v/e have, e.g., zelx and ext{z, y) for some z. Clearly, CT{Z, X)\ to prove -I(CT(>2^, t/)), we add a new property of /x: (RM5) ext{x,y) ==:^ 3s < l.Vt > 5.-i[/i(x,y,t)]. Under (RM5), CT induced via // does satisfy (C3), i.e. it is a connection. 8.5.2 From Graded Connections to Connections We begin with a definition of an individual BdrX. BdrX ~ CIST^{II'^{X)), where/i;^(x)(2) 4=^ ^{z,x,r) A -i(3s > We introduce a graded (r, s)-connection C{r, s) (r, s trans[0]=red
(H-H)
which says that a collision occurs only following a transition in which either one train or both violate the norms. Notice that comp(r^^) ^ trans[0]=green -^ -tcollision[l]: as formulated by D2, the transition from a collision state to itself is green.
11 Modelling Unreliable and Untrustworthy Agent Behaviour
EW
WW
tW
EE
WE
tE
Et
Wt
tt
173
Fig. 11.2. Coloured transition system defined by action description D2. Dotted lines indicate red transitions. All states and all other transitions are green. Reflexive edges (all green) are omitted for clarity. One major advantage of taking C+ as the basic action formalism, as we see it, is its explicit transition system semantics, which enables a wide range of other analytical techniques to be applied. In particular, system properties can be expressed in the branching time temporal logic CTL and verified on the transition system defined by a C-\- or C-f-"^"^ action description using standard model checking systems. We will say that a formula (f of CTL is valid on a (coloured) transition system (S',I(cr^),i?, 5g,J^g) defined by C+"^"^ action description D when s U e \= (f for every 5 U e such that (5, e, s') G R for some state s\ The definition is quite standard, except for a small adjustment to allow action constants in (/? to continue to be evaluated on transition labels e. (And we do not distinguish any particular set of initial states; all sets in S are initial states.) We will also say in that case that formula (p is valid on the action description D. In CTL, the formula AX (f expresses that (p is satisfied in the next state in all future branching paths from now.^ EX is the dual of AX: EXcp = -lAX -K^. EX (p expresses that cp is satisfied in the next state of some future branching path from now. The properties (ILIO) and (ILl 1) can thus be expressed in CTL as follows: -> collision A tra ns=green —^ AX -• collision
(1L12)
or equivalently -^collision A EX collision -^ trans=red. It is easily verified by reference to Fig. 11.2 that these formulas are valid on the action description D2. Also valid is the CTL formula EX trans=green which expresses that there is always a permitted action for both trains. This is true even in collision states, since the only available transition is then the one where both trains remain idle, and that transition is green. The CTL formula EF collision is also valid on D2, signifying that in every state there is at least one path from then on with collision true somewhere in the future.^ ^so U eo \= AX (p if for every infinite path so eosiei • • we have that si U ei |= cp. ^so U €0 \= Ef (p if there is an (infinite) path SQ €Q - • • Sm ^m - • • with Sm U €m \= (p for some m > 0.
174
Marek Sergot
11.4 Example: a simple co-ordination mechanism We now consider a slightly more elaborate version of the trains example. In general, we want to be able to verify formally whether the introduction of additional control mechanisms—additional controller agents, communication devices, restrictions on agents' possible actions—are effective in ensuring that agents comply with the norms ('social laws') that govern their behaviour. For the trains, we might consider a controller of some kind, or traffic lights, or some mechanism by which the trains communicate their locations to one another. For the sake of an example, we will suppose that there is a physical token (a metal ring, say) which has to be collected before a train can enter the tunnel. A train must pick up the token before entering the tunnel, and it must deposit it outside the tunnel as it exits. No train may enter the tunnel without possession of the token. To construct the C-f"^+ action description D3 for this version of the example, we begin as usual with the C-f action description Dtrains of section 11.2. We add a fluent constant tok to represent the position of the token. It has values {W, E, a, b}. tok=\N represents that the token is lying at the West end of the tunnel, tok=a that the token is currendy held by train a, and so on. We add Boolean action constants pick (a), pick (6) to represent that a (resp., b) picks up the token, and drop (a), drop (6) to represent that a (resp., 6) drops the token at its current location. For convenience, we will keep the action constants enter (a), enter (6), exit (a), exit (b) defined as in D2 of the previous section. The following causal laws describe the effects of picking up and dropping the token. To avoid duplication, x and / are variables ranging over a and b and locations W, E, t respectively. inertial tok drop (x) causes tok=l if tok=x A loc {x)=l nonexecutable drop (x) if tok^x pick (x) causes tok-x nonexecutable pick {x)
if loc
{x)^tok
The above specifies that the token can be dropped by train x only if train x has the token (tok=x), and it can be picked up by train x only if train x and the token are currently at the same location (loc {x)==tok). Notice that, as defined, an action drop (x) A x=stay drops the token at the current location of train x, and drop (x) A x=^go drops it at the new location of train x after it has moved. Since tok=\ is not a well-formed atom, it is not possible that (there is no transition in which) the token is dropped inside the tunnel, pick {x) A x=go represents an action in which train x picks up the token and moves with it. More refined representations could of course be constructed but this simple version will suffice for present purposes. The action description D3 is completed by adding the following permission laws: not-permitted enter (x) if tok^x A -^pick (x) oblig drop (x) if exit (x) A tok=x
11 Modelling Unreliable and Untrustworthy Agent Behaviour
175
It may be helpful to note that in C+"^~^, the first of these laws is equivalent to oblig pick (x) if enter (x) A tok^x The coloured transition system defined by action description D3 is larger and more complicated than that for D2 of the previous section, and cannot be drawn easily in its entirety. A fragment is shown in Fig. 11.3.
-Et
WW
Et
-WW
Et-
WW
-tE
ET
WW-
IE
EW
-WE
tE
EW-
WE
tE-
-tw
-EW
WE
-tt
tw
EW
WE-
It
Tw
-EE
-Wt
xJ
tw-
EE
Wt
EE-
WT
EE
Wt-
tt-
Fig. 11.3. Fragment of the coloured transition system defined by D3. The figure shows all states but not all transitions. The dash in state labels indicates the position of the token: it is at W/E when the dash is on the left/right, and with train a/b when the dash appears above the location of a/b. Dotted lines depict red transitions. All other depicted transitions, and all states, are green. One property we might wish to verify on D3 is that collisions are guaranteed to be avoided if both trains comply with the norms ('social laws'). Using the 'Causal Calculator' C C A L C , we can try to determine whether co7np{r^)
\= -^collision[0] A trans[0]=green A . . . A trans[m—l]=green —> -^collision[m]
that is, whether the formula comp{T^) A -^collision[0] A trans[0]=green A • • • A trans[m—l]=green A collision[7n] is satisfiable. But what should we take as the
176
Marek Sergot
length m of the longest path to be considered? In some circumstances it is possible to find a suitable value m for the longest path to be considered but it is far from obvious in this example what that value is, or even if one exists. The problem can be formulated conveniently as a model checking problem in CTL. The CTL formula E[trans=green U collision] expresses that there is at least one path with collision true at some future state and trans=green true on all intervening transitions.^ So the property we want to verify can be expressed in CTL as follows: -^collision —^ ->E[trans=green U collision]
(11-14)
It can be seen from Fig. 11.3 that property (11.14) is not valid on the action description D3: there are green transitions leading to collision states, from states where there is already a train inside the tunnel without the token. However, as long as we consider states in which both trains are initially outside the tunnel, the safety property we seek can be verified. The following formula is valid on D3: loc {a)^X A loc {b)y^X -^ - £ 0) 82
+«•
(0
^
(A
oO o^ 81 1
2
3
4
5
6
7
8
9
10
Count of neighbourhoods
Fig. 12.2. noA;NN results.
Table 12.3. The worst and best performance of kNN along with the corresponding values for k. Also the performance of nokNN when 10 neighbourhoods are used. Dataset Australian Colic Diabetes Hepatitis Iris Sonar Wine Average
nokNN kNN Worst case Best case All of 10 k %correct k %correct %correct 85.15 2 83.04 10 85.48 82.63 7 79.64 2 82.89 74.86 1 71.73 2 74.22 79.35 1 78.71 2 79.35 96.00 1 93.33 3 96.00 76.43 10 65.89 1 72.08 93.21 3 89.29 1 92.65 83.24 83.95 80.23
12 Nearest Neighbours without k
189
thermore the stabilised performance is comparable (in fact slightly better in our experiment on the datasets) to the best performance of kNN within 10 neighbourhoods.
12.5 Summary and conclusion In this paper we have discussed the "choice-of-A:" issue related to the kNN method for classification. In order for kNN to be less dependent on the choice of value for k, we proposed to look at multiple sets of nearest neighbours rather than just one set of k nearest neighbours. A set of neighbours is here called a neighbourhood. For a data record t each neighbourhood bears certain support for different possible classes. The key question is: how can we aggregate these supports to give a more reliable support value which better reveals the true class of t? In order to answer this question we presented a probability function, G. It is defined in terms of a mass function on events and it takes into account the cardinality of events. A mass function is a basic probability assignment for events. For the classification problem, an event is specified as a neighbourhood and a mass function is taken to represent the degree of support for a particular class from different neighbourhoods. Under this specification we have shown that G is a linear function of conditional probability, which can be used to determine the class of a new data record. In other words we calculate G from a set of neighbourhoods, then we calculate the conditional probability from G according the linear equation, and finally we classify based on the conditional probability. We designed and implemented a classification algorithm based on the contextual probability - nokNN. Experiment on some public datasets shows that using nokNN the classification performance (accuracy) increases as the number of neighbourhoods increases but stabilises soon after a few number of neighbourhoods; using the standard voting kNN, however, the classification performance varies when different neighbourhoods are used. Experiment further shows that the stabilised performance of nokNN is comparable (in fact, slightly better than) to the best performance of kNN. This fulfils our objective.
References 1. Atkeson, C. G., Moore, A. W., and Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1-5): 11-73. 2. Han, J. and Kamber, M. (2000). Data Mining : Concepts and Techniques. Morgan Kaufmann. 3. Hand, D., Mannila, H., and Smyth, P. (2001). Principles of Data Mining. The MIT Press. 4. Smets, P. and Kennes, R. (1994). The transferable belief model. Artificial Intelligence, 66(2):191-234.
13 Classifiers Based on Approximate Reasoning Schemes Jan Bazan^ and Andrzej Skowron^ ^ Institute of Mathematics, University of Rzeszow Rejtana 16A, 35-959 Rzeszow, Poland [email protected] ^ Institute of Mathematics, Warsaw University Banacha 2, 02-097 Warsaw, Poland [email protected]
Summary. We discuss classifiers [3] for complex concepts constructed from data sets and domain knowledge using approximate reasoning schemes (AR schemes). The approach is based on granular computing methods developed using rough set and rough mereological approaches [9, 13, 7]. In experiments we use a road simulator (see [15]) making it possible to collect data, e.g., on vehicle-agents movement on the road, at the crossroads, and data from different sensor-agents. We compare the quality of two classifiers: the standard rough set classifier based on the set of minimal decision rules and the classifier based on AR schemes.
13.1 Introduction A classification algorithm {classifier) permits making a forecast in new situations on the basis of accumulated knowledge. We consider here classifiers predicting decisions for objects previously unseen; each new object will be assigned to a class belonging to a predefined set of classes on the basis of observed values of suitably chosen attributes (features). Many approaches have been proposed for constructing of classifiers. Among them we would like to mention classical and modem statistical techniques, neural networks, decision trees, decision rules and inductive logic programming (see e.g. [5] for more details). One of the most popular methods for classification algorithms constructing is based on learning rules from examples. The standard rough set methods based on calculation of so called local reducts makes it possible to compute, for a given data, the descriptions of concepts by means of minimal consistent decision rules (see, e.g., [6], [2]). Searching for relevant patterns for complex concepts can be performed using AR schemes. AR schemes (see, e.g., [13]) can be treated as approximations of reasoning performed on concepts from domain knowledge and they represent relevant patterns for complex classifier construction. The proposed approach is based on granular
192
Jan Bazan and Andrzej Skowron
computing methods developed using rough set and rough mereological approaches [9,13,7]. In our experiments we use a road simulator (see [15]) making it possible to collect data, e.g., on vehicle-agents movement on the road and at the crossroads and data from different sensor-agents. The simulator also registers a few more features, whose values are defined by an expert. Any AR scheme is constructed from labelled approximate rules, called productions that can be extracted from data using domain knowledge [13]. In the paper we present a method for extracting productions from data collected by road simulator and an algorithm for classifying objects by productions, that can be treated as an algorithm for on-line synthesis of AR scheme for any tested object. We report experiments supporting our hypothesis that classifiers induced using the AR schemes are of higher quality than the traditional rough set classifiers (see Section 13.5). For comparison we use data sets generated by road simulator.
13.2 Approximate reasoning scheme One of the main tasks of data exploration [4] is discovery from available data and expert knowledge of concept approximations expressing properties of the investigated objects and rules expressing dependencies between concepts. Approximation of a given concept can be constructed using relevant patterns. Any such pattern describes a set of objects belonging to the concept to a degree p where 0 < p < 1. Relevant patterns for complex concepts can be represented by AR schemes. AR schemes can be treated as approximations of reasoning performed on concepts from domain knowledge. Any AR scheme is constructed from labeled approximate rules, called productions. Productions can be extracted from data using domain knowledge. We define productions as a parameterized implications with premises and conclusion built from patterns sufficiently included in the approximated concept. C3> "large"
C3> "medium" CI > "medium" C2 > "large" C3> "small" CI > "small" C2 > "medium "
CI > "smair
C2 > "small"
Fig. 13.1. The example of production as a collection of three production rules In Figure 13.1 we present an example of production for some concepts CI, C2 and C3 approximated by three linearly ordered layers small, medium, and large. This
13 Classifiers Based on Approximate Reasoning Schemes
193
production is a collection of three simpler rules, called production rules, with the following interpretation: (1) if inclusion degree to a concept CI is at least medium and to concept C2 at least large then the inclusion degree to a concept C3 is at least large', (2) if the inclusion degree to a concept CI is at least small and to a concept C2 at least medium then the inclusion degree to a concept C3 is at least medium', (3) if the inclusion degree to a concept CI is at least small and to a concept C2 at least small then the inclusion degree to a concept C5 is at least small. The concept from the upper level of production is called the target concept of production, whilst the concept from the lower level of production are called the source concepts of production. For example, in case of production from Figure 13.1 C3 is the target concept and CI, C2 are the source concepts.
Cl-si^'smair C2>"memmf C4>''smair
Fig. 13.2. Synthesis of approximate reasoning scheme
One can construct AR scheme by composing single production rules chosen from different productions from a family of productions for various target concepts. In Figure 13.2 we have two productions. The target concept of the first production is C5 and the target concept of the second production is the concept C3. We select one production rule from the first production and one production rule from the second production. These production rules are composed and then a simple AR-scheme is obtained that can be treated as a new two-levels production rule. Notice, that the target pattern of lower production rule in this AR-scheme is the same as one of the source patterns from the higher production rule. In this case, the common pattern is
194
Jan Bazan and Andrzej Skowron
described as follows: inclusion degree (of some pattern) to a concept C3 is at least medium. In this way, we can compose AR-schemes into hierarchical and multilevel structures using productions constructed for various concepts.
13.3 Road simulator Road simulator (see [15]) is a tool for generating data sets recording vehicle movement on the road and at the crossroads (see [15]). Such data is extremely crucial in testing of complex decision systems monitoring the situation on the road that are working on the basis of information coming from different devices. ,.^MiM Maximal number of vehicles: 20 Current number of vehicles: 14 Humidily: LACK Visibility: 500 Traffic parameter of main road: 0.5 Traffic parameter of subordinate road: 0.2 Current simulation step: 68 (from 500) Saving data: NO
SOUTH I
Fig. 13.3. The board of simulation Driving simulation takes place on a board (see Figure 13.3) which presents a crossroads together with access roads. During the simulation the vehicles may enter the board from all four directions that is East, West, North and South. The vehicles coming to the crossroads form South and North have the right of way in relation to the vehicles coming from West and East. Each of the vehicles entering the board has only one aim - to drive through the crossroads safely and leave the board. The simulation takes place step by step and during each of its steps the vehicles may perform the following maneuvers during
13 Classifiers Based on Approximate Reasoning Schemes
195
the simulation: passing, overtaking, changing direction (at the crossroads), changing lane, entering the traffic from the minor road into the main road, stopping and pulling out. Planning each vehicle's further steps takes place independently in each step of the simulation. Each vehicle, is "observing" the surrounding situation on the road, keeping in mind its destination and its own parameters (driver's profile), makes an independent decision about its further steps; whether it should accelerate, decelerate and what (if any) maneuver should be commenced, continued, ended or stopped. Making decisions concerning further driving, a given vehicle takes under consideration its parameters and the driving parameters of five vehicles next to it which are marked FRl, FR2, FL, BR and BL (see Figure 13.4).
FL-H-
1
-+-FR2
i-
-H—FRl
BL-
4 i-
A given vehicle
h-BR
Fig. 13.4. A given vehicle and five vehicles next to it
During the simulation the system registers a series of parameters of the local simulations, that is simulations connected with each vehicle separately, as well as two global parameters of the simulation that is parameters connected with driving conditions during the simulation. The value of each simulation parameter may vary and what follows it has to be treated as a certain attribute taking values from a specified value set. We associate the simulation parameters with the readouts of different measuring devices or technical equipment placed inside the vehicle or in the outside environment (e.g., by the road, in a helicopter observing the situation on the road, in a police car). These are devices and equipment playing the role of detecting devices or converters meaning sensors (e.g., a thermometer, range finder, video camera, radar, image and sound converter). The attributes taking the simulation parameter values, by analogy to devices providing their values will be called sensors. The exemplary sensors are the following: initial and current road (four roads), distance from the crossroads (in screen units), current lane (two lanes), position of the vehicle on the road (values from 0.0 to 1.0), vehicle speed (values from 0.0 to 10.0), acceleration and deceleration, distance of a given vehicle from FRl, FL, BR and BL
196
Jan Bazan and Andrzej Skowron
vehicles and between FRl and FR2 (in screen units), appearance of the vehicle at the crossroad (binary values), visibility (expressed in screen units values from 50 to 500), humidity (slipperiness) of the road (three values: lack of humidity - dry road, low humidity, high humidity). If, for some reason, the value of one of the sensors may not be determined, the value of the parameter becomes equal NULL (missing value). Apart from sensors the simulator registers a few more attributes, whose values are determined using the sensor's values in a way determined by an expert. These parameters in the present simulator version take the binary values and are therefore called concepts. The results returned by testing concepts are very often in a form YES, NO or DOES NOT CONCERN (NULL value). Here are exemplary concepts: 1. 2. 3. 4. 5.
Is the vehicle forcing the right of way at the crossroads? Is there free space on the right lane in order to end the overtaking maneuver? Will the vehicle be able to easily overtake before the oncoming car? Will the vehicle be able to brake before the crossroads? Is the distance from the FRl vehicle too short or do we predict it may happen shortly? 6. Is the vehicle overtaking safely? 7. Is the vehicle driving safely? Besides binary concepts, simulator registers for any such concept one special attribute that approximates binary concept by six linearly ordered layers: certainly YES, rather YES, possibly YES, possibly NO, rather NO and certainly NO. Some concepts related to the situation of the road are simple and classifiers for them can be induced directly from sensor measurement but for more complex concepts this is infeasible. In searching for classifiers for such concepts domain knowledge can be helpful. The relationships between concepts represented in domain knowledge can be used to construct hierarchical relationship diagrams. Such diagrams can be used to induce multi-layered classifiers for complex concepts (see [14] and next section). In Figure 13.5 there is an exemplary relationship diagram for the above mentioned concepts. The concept specification and concept dependencies are usually not given automatically in accumulated data sets. Therefore they should be extracted from a domain knowledge. Hence, the role of human experts is very important in our approach. During the simulation, when a new vehicle appears on the board, its so called driver's profile is determined. It may take one of the following values: a very careful driver, a careful driver and a careless driver. Driver's profile is the identity of the driver and according to this identity further decisions as to the way of driving are made. Depending on the driver's profile and weather conditions (humidity of the road and visibility) speed limits are determined, which cannot be exceeded. The generated data during the simulation are stored in a data table (information system). Each row of the table depicts the situation of a single vehicle and the sensors' and concepts' values are registered for a given vehicle and the FRl, FR2, FL,
13 Classifiers Based on Approximate Reasoning Schemes
197
Safe driving Safe overtaking
Safe distance from FL during overtaking
Forcing the right of way
Possibility of going back to the right lane
Possibility of safe stopping before the crossroads
SENSORS Fig. 13.5. The relationship diagram for presented concepts
BL and BR vehicles (associated with a given vehicle). Within each simulation step descriptions of situations of all the vehicles on the road are saved to file.
13.4 Algorithm for classifying objects by production In this section we present an algorithm for classifying objects by a given production but first of all we have to describe the method for the production inducing. To outline a method for production inducing let us assume that a given concept C registered by road simulator depends on two concepts CI and C2 (registered be road simulator too). Each of these concepts can be approximated by six linearly ordered layers: certainly YES, rather YES, possibly YES, possibly NO, rather NO and certainly NO. We induce classifiers for concepts CI and C2. These classifiers generate the corresponding weight (the name of one of six approximation layers) for any tested object. We construct for the target concept C a table T over the Cartesian product of sets defined by weight patterns for CI, C2, assuming that some additional constraints hold. Next, we add to the table T the last column, that is an expert decision. From the table T, we extract production rules describing dependencies between these three concepts. In Figure 13.6 we illustrate the process of extracting production rule for concept C and for the approximation layer rather YES of concept C The production rule can be extracted in the following four steps: 1. Select all rows from the table T in which values of column C is not less than rather YES.
198
Jan Bazan and Andrzej Skowron The tatget pattern of production rule
C 1 C2 1 ^^ 1 certainly YES certainly YES certainly YES \ certainly NO
1 certainly NO
certBinly NO \
certainly YES rather YES 1 possibly YES possibly NO possibly YES \ 1 rather YES
1 possibly YES possibly NO 1 possibly YES ratiierYES
rather NO \ rather YES \
1 possibly YES certainly NO possibly NO \ C1> possibly YES 1 certainly YES rather YES certainly YES | possibly NO
1 certainly NO
C2> rather YES
The source patterns of production rule
certainly NO \
certainly YES > rather YES > possibly YES > possibly NO > rather NO > certainly NO Fig. 13.6. The illustration of production rule extracting 2. Find minimal values of attributes CI and C2 from table T for selected rows in the previous step (in our example it easy to see that for the attribute CI minimal value is possibly YES and for the attribute C2 minimal value is rather YES). 3. Set sources patterns of new production rule on the basis of minimal values of attributes that were found in the previous step. 4. Set the target pattern of new production, i.e., concept C with the value rather YES. Finally, we obtain the production rule: (*) If (CI > possibly YES) and {C2 > rather YES) then (C > rather YES). A given tested object can be classified by the production rule (*), when weights generated for the object by classifiers induced for concepts from the rule premise are at least equal to degrees from source (premise) patterns of the rule. Then the production rule classifies tested object to the target (conclusion) pattern. 1 1
CI certainly YES
1
CI
1
possibly YES
C2 n — ^ rather YES \ 02
1
certainly NO \
-/>
0>ratherYES
C>rather YES
Fig. 13.7. Classifying tested objects by single production rule
13 Classifiers Based on Approximate Reasoning Schemes
199
For example, the object ui from Figure 13.7 is classified by production rule (*) because it is matched by the both patterns from the left hand side of the production rule (*) whereas, the object U2 from figure 13.7 is not classified by production rule (*) because it is not matched by the second source pattern of production rule (*) (the value of attribute C2 is less than rather YES). The method of extracting production rule presented above can be applied for various values of attribute C In this way, we obtain a collection of production rules, that we mean as a production. Using production rules selected from production we can compose AR schemes (see Section 13.2). In this way relevant patterns for more complex concepts are constructed. Any tested object is classified by AR scheme, if it is matched by all sensory patterns from this AR scheme. The method of object classification based on production can be described as follows: 1. Preclassify object to the production domain. 2. Classify object by production. We assume that for any production a production guard (boolean expression) is given. Such a guard describes the production domain and is used in preclassification of tested objects. The production guard is constructed using domain knowledge. An object can be classified by a given production if it satisfies the production guard. For example, let us assume that the production P is generated for the concept: "Is the vehicle overtaking safely ?". Then an object-vehicle u is classified by production P iff ti is overtaking. Otherwise, it is returned a message ''HAS NOTHING TO DO WITH (OVERTAKING) ". Now, we can present algorithm for classifying objects by production. Algorithm 1. The algorithm for classifying objects hy production Step 1 Select a complex concept C from relationship diagram. Step 2 If the tested object should not be classified by a given production P extracted for the selected concept C, i.e., it does not satisfy the production guard: return HAS NOTHING TO DO WITH Step 3 Find a rule from production P that classifies object with the maximal degree to the target concept of rule if such a rule of P does not exist return / DO NOT KNOW. Step 4 Generate a decision value for object from the degree extracted in the previous step if (the extracted degree is greater than possibly YES) then the object is classified to the concept C (return YES) else the object is not classified to the concept C (return NO). The algorithm for classifying objects by production presented above can be treated as an algorithm of dynamical synthesis of AR scheme for any tested object. It is easy to see, that during classification any tested object is classified by single
200
Jan Bazan and Andrzej Skowron
Table 13.1. Results of experiments for the concept: "Is the vehicle overtaking safely?" Decision class Method Accuracy Coverage Real accuracy 0.784 YES RS 0.949 0.826 0.974 ARS 0.973 0.948 0.889 NO RS 0.979 0.870 ARS 0.926 1.0 0.926 All classes RS 0.999 0.996 0.995 (YES + NO) ARS 0.999 0.999 0.998
production rule selected from production. It means that the production rule is dynamically assigned to the tested object. In other words, the approximating reasoning scheme is dynamically synthesized for any tested object. We claim that the quality of the classifier presented above is higher than the classifier constructed using algorithm based on the set of minimal decision rules. In the next section we present the results of experiments with data sets generated by road simulator supporting this claim.
13.5 Experiments with Data To verify effectiveness of classifiers based on AR schemes, we have implemented our algorithms in the AS-lib programming library. This is an extension of the RSESlib 2.1 progranmiing library creating the computational kernel of the RSES system [16]. The experiments have been performed on the data sets obtained from the road simulator. We have applied the train and test method for estimating accuracy (see e.g. [5]). Data set consists of 18101 objects generated by the road simulator. This set was randomly divided to the train table (9050 objects) and the test table (9051 objects). In our experiments, we compared the quality of two classifiers: RS and ARS. For inducing of RS we use RSES system generating the set of minimal decision rules that are next used for classifying situations from testing data. ARS is based on AR schemes. We compared RS and ARS classifiers using accuracy of classification, learning time and the rule set size. We also checked the robustness of classifiers. Table 13.1 and table 13.2 show the results of the considered classification algorithms for the concept: "Is the vehicle overtaking safely?" and for the concept "Is the vehicle driving safely?" respectively. One can see that accuracy of algorithm ARS is higher than the accuracy of the algorithm RS for analyzed data set. Table 13.3 shows the learning time and the number of decision rules induced for the considered classifiers. In case of the number of decision rules we present the average over all concepts (from the relationship diagram) number of rules.
13 Classifiers Based on Approximate Reasoning Schemes
201
Table 13.2. Results of experiments for the concept: "Is the vehicle driving safely?" Decision class Method Accuracy Coverage Real accuracy YES NO All classes (YES + NO)
RS ARS RS ARS RS ARS
0.978 0.962 0.633 0.862 0.964 0.958
0.946 0.992 0.740 0.890 0.935 0.987
0.925 0.954 0.468 0.767 0.901 0.945
Table 13.3. Learning time and the rule set size for concept: "Is the vehicle driving safely?" Method Learning time Rule set size 835 801 seconds RS ARS 247 seconds 189
One can see that the learning time for ARS is much shorter than for RS and the number of decision rules induced for ARS is much lower than the number of decision rules induced for RS.
13.6 Summary We have discussed a method for construction (from data and domain knowledge) of classifiers for complex concepts using AR schemes (ARS classifiers). The experiments showed that: • • •
the accuracy of classification by ARS is better than accuracy of RS classifier, the learning time for ARS is much shorter than for RS, the number of decision rules induced for ARS is much lower than the number of decision rules induced for RS.
Finally, the ARS classifier is much more robust than the RS classifier. The results are consistent with the rough-mereological approach. Acknowledgement. The research has been supported by the grant 3 T l l C 002 26 from Ministry of Scientific Research and Information Technology of the Republic of Poland.
References 1. Bazan J. (1998) A comparison of dynamic non-dynamic rough set methods for extracting laws from decision tables. In: [8]: 321-365 2. Bazan J., Nguyen H. S., Skowron A., Szczuka M. (2003) A view on rough set concept approximation. LNAI2639, Springer, Heidelberg: 181-188
202
Jan Bazan and Andrzej Skowron
3. Friedman J. H., Hastie T., Tibshirani R. (2001) The elements of statistical learning: Data mining, inference, and prediction. Springer, Heidelberg. 4. Kloesgen W., Zytkow J. (eds) (2002) Handbook of KDD. Oxford University Press 5. Michie D., Spiegelhalter D.J., Taylor, C.C. (1994) Machine learning, neural and statistical classification. Ellis Horwood, New York 6. Pawlak Z (1991) Rough sets: Theoretical aspects of reasoning about data. Kluwer, Dordrecht. 7. Pal S. K., Polkowski L., Skowron A. (eds) (2004) Rough-Neuro Computing: Techniques for Computing with Words. Springer-Verlag, Berlin. 8. Polkowski L., Skowron A. (eds) (1998) Rough Sets in Knowledge Discovery 1-2, Physica-Verlag, Heidelberg. 9. Polkowski L., Skowron, A. (1999) Towards adaptive calculus of granules. In: [17]: 201227 10. Polkowski L., Skowron A. (2000) Rough mereology in information systems. A case study: Qualitative spatial reasoning. In: Polkowski L., Lin T. Y., Tsumoto S. (eds). Rough Sets: New Developmentsin Knowledge Discovery in Information Systems, Studies in Fuzziness and Soft Computing 56, Physica-Verlag, Heidelberg: 89-135 11. Skowron, A. (2001) Toward intelligent systems: Calculi of information granules. Bulletin of the International Rough Set Society 5 (1-2): 9-30 12. Skowron A., Stepaniuk J. (2001) Information granules: Towards foundations of granular computing. International Journal of Intelligent Systems 16 (1): 57-86 13. Skowron A., Stepaniuk J. (2002) Information granules and rough-neuro computing. In [7]: 43-84 14. Stone P. (2000) Layered Learning in Multi-Agent Systems: A Winning Approach to Robotic Soccer. The MIT Press, Cambridge, MA 15. The Road simulator Homepage - l o g i c . m i m u w . e d u . p i / ^ b a z a n / s i m u l a t o r 16. The RSES Homepage - l o g i c , mimuw . e d u . p l / ~ r s e s 17. Zadeh L. A., Kacprzyk J. (eds.) (1999) Computing with Words in Information/Intelligent Systems 1-2. Physica-Verlag, Heidelberg
14 Towards Rough Applicability of Rules Anna Gomoliiiska* University of Bialystok, Department of Mathematics, Akademicka 2, 15267 Bialystok, Poland [email protected]
Summary. In this article, we further study the problem of soft applicability of rules within the framework of approximation spaces. Such forms of applicability are generally called rough. The starting point is the notion of graded applicability of a rule to an object, introduced in our previous work and referred to as fundamental. The abstract concept of rough applicability of rules comprises a vast number of particular cases. In the present paper, we generalize the fundamental form of applicability in two ways. Firstly, we more intensively exploit the idea of rough approximation of sets of objects. Secondly, a graded applicability of a rule to a set of objects is defined. A better understanding of rough applicability of rules is important for building the ontology of an approximate reason and, in the sequel, for modeling of complex systems, e.g., systems of social agents. Key words: approximation space, ontology of approximate reason, information granule, graded meaning of formulas, applicability of rules To Emilia
14.1 Introduction It is hardly an exaggeration to say that soft application of rules is the prevailing form of rule following in real life situations. Though some rules (e.g., instructions, regulations, laws, etc.) are supposed to be strictly followed, it usually means "as strictly as possible" in practice. Typically, people tend to apply rules "softly" whenever the expected advantages (gain) surpass the possible loss (failure, harm). Soft application of rules is usually more efficient and effective than the strict one, however, at the cost of the results obtained. In many cases, adaptation to changing situations requires a
*Many thanks to James Peters, Alberto Pettorossi, Andrzej Skowron, Dominik Sl^zak, and last but not least to the anonymous referee for useful and insightful remarks. The research has been partially supported by the grant 3T11C00226 from the Ministry of Scientific Research and Information Technology of the Republic of Poland.
204
Anna Gomolinska
change in the mode of application of rules only, retaining the rules unchanged. Allowing rules to be applied softly simplifies multi-attribute decision making under missing or uncertain information as well. As a research problem, applicability of rules concerns strategies (meta-rules) which specify the permissive conditions for passing from premises to conclusions of rules. In this paper, we analyze soft applicability of rules within the framework of approximation spaces (ASs) or, in other words, rough applicability of rules. The first step has been already made by introducing the concept of graded applicability of a rule to an object of an AS [3]. This fundamental form of applicability is based on the graded satisfiability and meaning of formulas and their sets, studied in [2]. The intuitive idea is that a rule r is applicable to an object u in degree Hff a sufficiently large part of the set of premises of r is satisfied for uin a. sufficient degree, where sufficiency is determined by t. We aim at extending and refining this notion step by step. For the time being, we propose two generalizations. In the first one, the idea of approximation of sets of objects is exploited more intensively. The second approach consists in extending the graded applicability of a rule to an object to the case of graded applicability of a rule to a set of objects. Studying various rough forms of applicability of rules is important for building the ontology of an approximate reason. In [9], Peters et al. consider structural aspects of such an ontology. A basic assumption made is that an approximate reason is a capability of an agent. Agents classify information granules, derived from sensors or received from other agents, in the context of ASs. One of the fundamental forms of reasoning is a reflective judgment that a particular object (granule of information) matches a particular pattern. In the case of rules, agents judge whether or not, and how far an object (set of objects) matches the conditions for applicability of a rule. As explained in [9]: Judgment in agents is a faculty of thinking about (classifying) the particular relative to decision rules derived from data. Judgment in agents is reflective but not in the classical philosophical sense [...]. In an agent, a reflective judgment itself is an assertion that a particular decision rule derived from data is applicable to an object (input). [... ] Again, unlike Kant's notion of judgment, a reflective judgment is not the result of searching for a universal that pertains to a particular set of values of descriptors. Rather, a reflective judgment by an agent is a form of recognition that a particular vector of sensor values pertains to a particular rule in some degree. The ontology of an approximate reason may serve as a basis for modeling of complex systems like systems of social, highly adaptive agents, where rules are allowed to be followed flexibly and approximately. Since one and the same rule may be applied in many ways depending, among others, on the agent and the situation of (inter)action, we can to a higher extent capture the complexity of the modelled system by means of relatively less rules. Moreover, agents are given more autonomy in applying rules. From the technical point of view, degrees of applicability may serve as lists of tuning parameters to control application of rules. Another area of possible use of rough applicability is multi-attribute classification (and, in particular, decision making). In
14 Towards Rough Applicability of Rules
205
the case of an object to which no classification rule is applicable in the strict sense, we may try to apply an available rule roughly. This happens in the real life, e.g., in the process of selection of the best candidate(s), where no candidate fully satisfies the requirements. If a decision is to be made anyway, some conditions should be omitted or their satisfiability should be treated less strictly. Rough applicability may also help in classification of objects, where some values of attributes are missing. In Sect. 14.2, approximation spaces are overviewed. Section 14.3 is devoted to the notions of graded satisfiability and meaning of formulas. In Sect. 14.4, we generalize the fundamental notion of applicability in the two directions mentioned earlier. Section 14.5 contains a concise summary.
14.2 Approximation Spaces The general notion of an approximation space (AS) was proposed by Skowron and Stepaniuk [13, 14, 16]. Any such space is a triple M = (U^F^K), where f/ is a non-empty set, F .U \-^ pU is an uncertainty mapping, and K : (pf/)^ ^-^ [0,1] is a rough inclusion function (RIF). pU and (pC/)^ denote the power set of U and the Cartesian product pU x pU, respectively. Originally, F and n were equipped with tuning parameters, and the term "parameterized" was therefore used in connection with ASs. Exemplary ASs are the rough ASs, induced by the Pawlak information systems [6, 8]. Elements of C/, called objects and denoted by u with subscripts whenever needed, are known by their properties only. Therefore, some objects may be viewed as similar. Objects similar to an object u constitute a granule of information in the sense of Zadeh [17]. Indiscemibility may be seen as a special case of similarity. Since every object is obviously similar to itself, the universe U of M\^ covered by a family of granules of information. The uncertainty mapping T is a basic mathematical tool to describe formally granulation of information on U, For every object u, Fu is a set of objects similar to u, called an elementary granule of information drawn to u. By assumption, u G Fu. Elementary granules are merely building blocks to construct more complex information granules which form, possibly hierarchical, systems of granules. Simple examples of complex granules are the results of set-theoretical operations on granules obtained at some earlier stages, rough approximations of concepts, or meanings of formulas and sets of formulas in ASs. An adaptive calculus of granules, measure(s) of closeness and inclusion of granules, construction of complex granules from simpler ones which satisfy a given specification are a few examples of related problems (see, e.g., [11, 12, 15, 16]). In our approach, a RIF K : [pU)^ \-^ [0,1] is a function which assigns to every pair (x, y) of subsets of U, a number in [0,1] expressing the degree of inclusion of x in 2/, and which satisfies postulates (A1)-(A3) for any x^y^z C U:{A\) K{x^y) = 1 iff X C 2/; (A2) If x ^^ 0, then K{X, y) = 0 iff X fl y = 0; (A3) If y C, z, then /^(^j y) < i^{x^ z). Thus, our RIFs are somewhat stronger than the ones characterized by the axioms of rough mereology, proposed by Polkowski and Skowron [10, 12].
206
Anna Gomolinska
Rough mereology extends Lesniewski's mereology [4] to a theory of the relationship of being-a-part-in-degree. Among various RIFs, the standard ones deserve a special attention. Let the cardinality of a set X be denoted by #a:. Given a non-empty finite set U and x,y CU, the standard RIF, «-^, is defined by /^-^(x, y) = < ^^ , I 1 otherwise. The notion of a standard RIF, based on the frequency count, goes back to Lukasiewicz [5]. In our framework, where infinite sets of objects are allowed, by a quasi-standard RIF we understand any RIF which for finite first arguments is like the standard one. In M, sets of objects (concepts) may be approximated in various ways (see, e.g., [1] for a discussion and references). In [14,16], a concept x CU is approximated by means of the lower and upper rough approximation mappings low, upp : pU H-^ pU, respectively, defined by lowx = {u G C/ I K,{ru^x) = 1} and uppx = {u e U \ K^FU^X) > 0}. (14.1) By (A1)-(A3), the lower and upper rough approximations of a:, lowx and uppa;, are equal io {u e U \ Fu C x} and {u eU \ FuDx ^ (/)}, respectively. Ziarko [18, 19] generalized the Pawlak rough set model [7, 8] to a variableprecision rough set model by introducing variable-precision positive and negative regions of sets of objects. Let t e [0,1]. Within the AS framework, in line with (14.1), the mappings of t-positive and t-negative regions of sets of objects, pos^,neg^ : pU i-> pU, respectively, may be defined as follows, for any set of objects x'} pos^x = {ti G [/ I
K{FU,X)
> t} and neg^a: = {u E U \ K{FU,X)
< t}. (14.2)
Notice that lowx = pos^x and uppx = U — neggX.
14.3 The Graded IVfeaning of Formulas Suppose a formal language L expressing properties of M is given. The set of all formulas of L is denoted by FOR. We briefly recall basic ideas concerning the graded satisfiability and meaning of formulas and their sets, studied in [2]. Given a relation of (crisp) satisfiability of formulas for objects of [/, \=c, the c-meaning (or, simply, meaning) of a formula a is understood as the extension of a, i.e., as the set | |a| |c = {u £ U \ u \=c a}. For simplicity, "c" will be omitted in formulas whenever possible. By introducing degrees t G [0,1], we take into account the fact that objects are perceived through the granules of information attached to them. In the formulas below, li 1=^ a reads as "a is t-satisfied for u" and \\ct\\t denotes the t-meaning of a: u\=ta
iff K{FU, \\a\\) > t and ||a||t = {u £ U \ u ^t (^}-
The original definitions, proposed by Ziarko, are somewhat different.
(14.3)
14 Towards Rough Applicability of Rules
207
In other words, ||a||t = posJ|a||. Next, for t € T == [0,1] U {c}, the set of all formulas which are t-satisfied for an object u is denoted by \u\t, i.e., \u\t = {a E FOR I ^i |=t a } . Notice that it may be t = c here. The graded satisfiability of a formula for an object is generalized on the left-hand side to a graded satisfiability of a formula for a set of objects, and on the right-hand side to a graded satisfiability of a set of formulas for an object, where degrees are elements of Ti = T x [0,1]. For any n-tuple t and i = 1 , . . . , n, let Tr^t denote the z-th element of t. For simplicity, we use |=t, | • |t» and 11 • | |t both for the (object, formula)case as well as for its generalizations. Thus, for any object u, a set of objects x, a formula a, a set of formulas X, a RIF K* : (pFOR)^ h-> [0,1], and teTi, x\=tOL iff K{X, llallTTit) > 7r2t and \x\t = {a e FOR | x |=t a } ; u^tX
iff K^XM^it)
> 7T2t and ||X||t = {ueU\u\=t
X}.(14.4)
u [=t X reads as "X is t-satisfied for u'\ and | |X| |t is the t-meaning of X. Observe that \=t extends the classical, crisp notions of satisfiability of the sorts (set-of-objects, formula) and (object, set-of-formulas). Along the standard lines, x \= a iff \fu e x.u t= a, and u\= X iffWa e X.u \= a. Hence, x ^ a iff x f=(c,i) Q^» and u [= X iff u |=(c,i) X. Properties of the graded satisfiability and meaning of formulas and sets of formulas may be found in [2]. Let us only mention that a non-empty finite set of formulas X cannot be replaced by a conjunction f\X of all its elements as it happens in the classical, crisp case. In the graded case, one can only prove that II /\ X||t C ||X||(t 1), where t eT, but the converse may not hold.
14.4 The Graded Applicability of Rules Generalized All rules over L, denoted by r with subscripts whenever needed, constitute a set RUL. Any rule r is a pair of finite sets of formulas of L, where the first element, Pr, is the set of premises of r and the second element of the pair is a non-empty set of conclusions of r. Along the standard lines, a rule which is not applicable in a considered sense is called inapplicable. A rule r is applicable to an object u in the classical sense iff the whole set of premises Pr is satisfied for u. The graded applicability of a rule to an object, viewed as a fundamental form of rough applicability here, is obtained by replacing the crisp satisfiability by its graded counterpart and by weakening the condition that all premises be satisfied [3]. Thus, for any t eT\, r e apl^u iff «:*(Pr, l^kit) > 7r2t, i.e., iff u e \\Pr\\t'
(14.5)
r e aip\^u reads as "r is t-applicable to u''? Properties of apl^ are presented in [3]. Let us only note that the classical applicability and the (c, 1)-applicability coincide. Example 1. In the textile industry, a norm determining whether or not the quality of water to be used in the process of dyeing of textiles is satisfactory, may be written ^Equivalently, "r is applicable to u in degree f\
208
Anna Gomolinska
as a decision rule r with 16 premises and one conclusion (o?, yes). In this case, the objects of the AS considered are samples of water. The c-meaning of the conclusion of r is the set of all samples of water u eU such that the water may be used for dyeing of textiles, i.e., ||(G?,yes)|| = {u £ U \ d{u) = yes}. Let a i , . . . ,07 denote the attributes: colour (mgPt/1), turbidity (mgSi02/l), suspensions (mg/1), oxygen consumption (mg02/l), hardness (mval/1), Fe content (mg/1), and Mn content (mg/1), respectively Then, (ai, [0,20]), (as, [0,15]), (as, [0,20]), {a^, [0,20]), (as, [0,1.8]), (ae, [0,0.1]), and (07, [0,0.05]) are exemplary premises of r. For instance, the cmeaning of (a2, [0,15]) is the set of all samples of water such that their turbidity does not exceed 15mgSi02/l, i.e., ||(a2, [0,15])|| = {u eU \ a2{u) < 15}. Suppose that the values of a2, aa slightly exceed 15,20 for some sample u, respectively, i.e., the second and the third premises are not satisfied for u, whereas all remaining premises hold for u. That is, r is inapplicable to the sample u in the classical sense, yet it is (c, 0.875)-applicable to u. Under special conditions as, e.g., serious time constraints, applicability ofriou in degree (c, 0.875) may be viewed as sufficient or, in other words, the quality of u may be viewed as satisfactory if the gain expected surpass the possible loss. Observe that r € apl^tz iff ix G /^c/||^r||t, where Ipu is the identity mapping on pU. A natural generalization of (14.5) is obtained by taking a mapping /$ : pU H-^ pU instead of /pt/, where $ is a possibly empty list of parameters. For instance, /$ may be an approximation mapping. In this way, we obtain a family of mappings aplf^ : U H-^ pRUL, parameterized hyteTi and $, and such that for any r and u,
re^vi'u'HuehWPrWf The family is partially ordered by C, where for any ti,t2
(14.6) ETI,
aplff E apl/^ ^^ Wu e C/.aplf> C aplf>.
(14.7)
The general notion of rough applicability, introduced above, comprises a number of particular cases, including the fundamental one. In fact, apl^ = apl^*^^. Next, e.g., r e aplj^^i^ iff ix € low| |Pr | |t iff ^ is ^-applicable to every object similar to u. In the same vein, r € apl^^^u iff u € upp||Pr||t iff r is ^-applicable to some object similar to u. We can also say that r is certainly ^-applicable and possibly t-applicable to u, respectively. In the variable-precision case, for / = pos^ and s e [0,1], r e aplj tx iff w € pos^llPrll* iff r is i-applicable to a sufficiently large part of Fu, where sufficiency is determined by 5. In a more sophisticated case, where / = pos^ o low (o denotes the concatenation of mappings), r e apl/zz iff u G pos^lowUP^Hf iff A^(Pu, low||Pr||t) > 5 iff r is certainly ^-applicable to a sufficiently large part of Fu, where sufficiency is determined by s. Etc. For t = (^1,^2) ^ [0,1]^, the various forms of rough t-applicability are determined up to granularity of information. An object u is merely viewed as a representative of the granule of information Fu drawn to it. More precisely, a rule r may practically be treated as applicable to u even if no premise is, in fact, satisfied for u.
14 Towards Rough Applicability of Rules
209
It is enough that premises are satisfiable for a sufficiently large part of the set of objects similar to u. If used reasonably, this feature may be advantageous in the case of missing data. The very idea is intensified in the case of pos^. Then, r is t-applicable to u in the sense of pos^ iff it is t-applicable to a sufficiently large part of the set of objects similar to u, where sufficiency is determined by s. This form of applicability may be helpful in classification of u if we cannot check whether or not r is applicable to u and, on the other hand, it is known that r is applicable to a sufficiently large part of the set of objects similar to u. Next, rough applicability in the sense of low is useful in modeling of such situations, where the stress is laid on the equal treatment of all objects forming a granule of information. A form of stability of rules may be defined, where r is called stable in a sense considered if for every u, r is applicable to It iff r is applicable to all objects similar to u in the very sense. Example 2. Consider a situation of decision making whether or not to support a student financially. In this case, objects of the AS are students applying for a bursary. Suppose that some data concerning a person u is missing which makes decision rules inapplicable to u in the classical, crisp sense. For simplicity, assume that r would be the only decision rule applicable to u unless the data were missing. Let a be the premise of r of which we cannot be sure if it is satisfied for u or not. Suppose that for 80% of students whose cases are similar to the case of u, all premises of r are satisfied. Then, to the advantage of u, we may view r as practically applicable to u. Formally, r is (0.8, l)-applicable to u. Additionally, let r be (0.8,0.9)-applicable to 65% of objects similar to u. In sum, r is (0.8,0.9)-applicable to u in the sense of The second (and last) generalization of the fundamental notion of rough aplicability, proposed here, consists in extension of applicability of a rule to an object to the case of applicability of a rule to a set of objects. In the classical case, a rule is applicable to a set of objects x iff it is applicable to each element of x. For any a, let {a)'^ denote the tuple consisting of n copies of a, and (a)^ be abbreviated by (a). For arbitrary tuples s,t, st denotes their concatenation. Next, if t is at least a pair of items (i.e., an n-tuple for n > 2), then 0
(c) If Fu ~ Fu' and g G {upp o /$,pos5 o /$}? then apl^^z = apl^-u' and aplf Ti = aplf u'. (d) If /$ is monotone and t ^t\
then aplj^f C. aplf^.
(e) apll^- C apl, C apl^^^.
(/) Apli(i)^ = n^^P^*^ \uex}. Proof. We prove (d), (f) only. For (d) consider a rule r and assume (dl) /$ is monotone and (d2) t :< t'. First, we show (d3) \\Pr\\t' ^ WPrWt- Consider the non-trivial case only, where nit,nit' ^ c. Assume that u e WPrWr- Then
14 Towards Rough Applicability of Rules
211
(d4) K*{Pr, I^^UitO > ^^2^' by the definition of graded meaning. Observe that for any formula a, if K{ru, \\a\\) > 7rit\ then K{ru, \\a\\) > nit by (d2). Hence, Hint' Q lulTTit' As a consequence, K*{Pr,\u\^^t') < K,*{Pr,\u\^^t) by (A3). Hence, A^*(P^, \u\^,t) > ^2t' > TTS^ by (d2), (d4). Thus, u G \\Pr\\t by the definition of graded meaning. In the sequel, /$||Pr||t' Q /$||Pr||t by (dl), (d3). Hence, r £ apl/^^ implies r G aplf'^ by the definition of graded applicability in the sense of /$. Incase (f), for any rule r,r e Apl^^^^x iff x C ||Pr||t iff Vii G x.u G \\Pr\\t iff Wu G x.r G apl^w iff r G p|{apl^w \u G x}. D Let us briefly comment the results. By (a), rough applicability of a rule to u in the sense of pos^ and the graded applicability of a rule to Fu coincide, (b) is a direct consequence of the properties of approximation mappings, (c) states that the fundamental notion of rough applicability as well as the graded forms of applicability in the sense of uppo/$ and pos^o/$ are determined up to granulation of information. By (d), ift: 0, then Apl^ji^} = apl 0, then r G Apl^x iff a: = 0. (j) If x' n llPrlUt = 0, then r G Apl^(x U x') implies r G Apl^x and r G Apl^x impUes r G Apl^(a: — x'). (fc) If x' C ||Pr|| 0, and some premise of a rule r is vrit-unsatisfiable, then r is ^-applicable to the empty set only. Recall that RIFs are quasi-standard in cases (j), (k). (j) states that the property of being inapplicable (resp., applicable) in the sense of Apl^ is invariant under adding (removing) objects for which sets of premises of rules are " the so-called mother type of a functor is given. Because in a variant of ZF set theory types of all objects expand to the type s e t (except Mizar structures which are treated in a different way), the user may drop this part of a definition not to restrict its type. We wanted Mizar to understand automatically that approximations yield subsets of an approximation space. For uniformity purposes, we used notation C l a s s (R, x) instead of originally introduced in MML n e i g h b o u r h o o d (x, R) - even if we dealt with tolerances, not equivalence relations. Because of implemented inheritance mechanisms and adjectives it worked surprisingly well. The Mizar articles are plain ASCII files, so some usual (often close to its ETgX equivalents) abbreviations are claimed: "c=" stands for the set-theoretical inclusion, " i n " for G," {}" for 0, " \ / " and "/ \ " for the union and the intersection of sets, respectively. The double colon starts a comment, while semicolon is a delimiter for a sentence. Another important construction in the Mizar language which we extensively used, was cluster, that is a collection of attributes. There are three kinds of cluster registrations: •
existential, because in Mizar all types are required to be non-empty, so the existence of the object which satisfies all these properties has to be proved. We needed to construct an example of an approximation space; r e g i s t r a t i o n l e t A be non diagonal Approximation.Space; cluster rough Subset of A; existence; end;
Considered approximation space A which appear in the locus (argument of a definition) have to be non d i a g o n a l . If A will be diagonal, i.e. if its indiscemibility relation will be included in the identity relation, therefore all subsets of A will become crisp with no possibility for the construction of a rough subset. • functorial, i.e. the involved functor has certain properties, used e.g. to ensure that lower and upper approximations are exact (see the example below); r e g i s t r a t i o n l e t A be Approximation_Space, X be Subset of A; cluster LAp X -> exact; coherence; end;
•
Functorial clusters are most frequent due to a big number of introduced functors (5484 in MML). The possibility of adding of an adjective to the type of an object is also useful (e.g. often we force that an object is non-empty in this way). conditional stating e.g. that all approximation spaces are tolerance spaces.
15 On the Computer-Assisted Reasoning about Rough Sets
221
registration cluster with.equivalence -> with^tolerance RelStr; coherence; end;
This kind of a cluster is relatively rare (see Table 15.1) because of a strong type expansion mechanism. Table 15.1 contains number of clusters of all kinds comparing to those introduced in [4]. Table 15.1. Number of clusters in MML vs. RST development type
in MML in [4]
existential functorial conditional
1501 3181 1131
7 9 7
total
5813
23
As it sometimes happens among other theories (compare e.g. the construction of fuzzy sets), paradoxically the notion of a rough set is not the central point of RST as a whole. Rough sets are in fact classes of abstraction w.r.t. rough equality of sets and their formal treatment varies. Majority of authors (w^ith Pawlak in [9] for instance) define a rough set as an underlying class of abstraction (as noted above), but some of them (cf. [2]) claim for simplicity that a rough set is an ordered pair containing the lower and the upper limit of fluctuation of the argument X. These two approaches are not equivalent, and we decided to define a rough set also in the latter sense. d e f i n i t i o n l e t A be Approximation.Space; l e t X be Subset of A; mode RoughSet of X means :: ROUGHS_l:def 8 i t = [LAp X, UAp X]; end;
What should be recalled here, there are so-called modes in the Mizar language which correspond with the notion of a type. To properly define a mode, one should only prove its existence. As it can be easily observed, because the above definiens determines a unique object for every subset X of a fixed approximation space A, this can be reformulated as a functor definition in the Mizar language. If both approximations coincide, the notion collapses and the resulting set is exact, i.e. a set in the classical sense. Unfortunately, in the above mentioned approach, this is not the case. In [4] we did not use this notion in fact, but we have chosen some other solution which describes rough sets more effectively, i.e. by attributes.
222
Adam Grabowski
15.5 Rough Inclusions and Equality Now we are going to briefly present the fundamental predicate for the rough set theory: rough equahty predicate (the lower version is cited below, while the dual upper equality - notation is "= "", and assumes the equality of upper approximations of sets). d e f i n i t i o n l e t A be Tolerance_Space, X, Y be Subset of A; pred X _= Y means :: ROUGHS^lidef 14 LAp X « LAp Y; reflexivity; symmetry; end;
Two additional properties (reflexivity and symmetry) were added with their trivial proofs: e.g. the first one forces the checker to accept that X ^^ X without any justification. In Mizar it is also possible to introduce the so-called redefinitions, that is to give another definiens, if equivalence of it and the original one can be proved (in the case above, the rough equality can be defined e.g. as a conjunction of two rough inclusions). This mechanism may be also applied to our Mizar definition of a rough set generated by a subset of approximation space - as an ordered pair of its lower and upper approximation, not as classes of abstraction w.r.t. rough equality relation.
15.6 Membership Functions Employing the notion of indiscemibility the concept of a membership fiinction for rough sets was defined in [10] as
^^^""^ -
\I{x)\
'
where \A\ denotes cardinality of A. Because the original approach deals with equivalence relations, I{x) is equal to [x]/, i.e. an equivalence class of the relation / containing element x. Using tolerances we should write rather x/I instead. Also in Mizar we can choose between C l a s s and n e i g h b o u r h o o d , as we already noted in the fourth section. As it can be expected, for a finite tolerance space A and X which is a subset of it, a function //^ is defined as follows. d e f i n i t i o n l e t A be f i n i t e Tolerance_Space; l e t X be Subset of A; func MemberFimc (X, A) -> Function of the carrier of A, REAL means for X being Element of A holds i t . x = card (X / \ Class (the InternalRel of A, x)) / (card Class (the InternalRel of A, x ) ) ; end;
15 On the Computer-Assisted Reasoning about Rough Sets
223
Actually, the dot " . " stands in MML for the function application, i t in the definiens denotes the defined object. Extensive usage of attributes make formulation of some theorems even simpler (at least, in our opinion) than in the natural language, because it enables us e.g. to state that JJ,-^ is equal to the characteristic function xx (theorem 44 from [4]) for a discrete finite approximation space A (that is, with the identity as an indiscemibility relation) in the following way: theorem :: ROUGHS.1:44 for A being discrete finite Approximation.Space, X being Subset of A holds
15.7 Example of the Formalization We formalized 19 definitions, 61 theorems with proofs, and 23 cluster registrations in [4]. This representation in Mizar of the rough set theory basics is 1771 lines long (the length of a line is restricted to 80 characters), which takes 54855 bytes of text. In this section we are going to show one chosen lemma together with its proof given in [6] and its full formalization in the Mizar language."^ Lemma 9. Let Re Tol{U) and X,Y XRUYR
=
CU. IfX is R-definable, then {XUY)R.
Proof. It is obvious that XRUYR C {X U Y)R. Let x e {X U Y)R, i.e., x/R C X U y. If x/R n X 7^ 0, then x e X^ md x £ XR because X is i?-definable. If x/R D X = 0, then necessarily x/R C Y and x e YR. D Hence, in both cases x e XR U YR . What is worth noticing, the attribute e x a c t (sometimes called i?-definable in the literature) has been defined earlier to describe sets with their lower and upper approximations equal (that is, crisp). Defining new synonyms and redefinitions however is also possible here. One of the features of the Mizar language which reflects closely mathematical vernacular is reasoning per cases (subsequent cases are marked by the keyword s u p p o s e ) . The references (after by) for XB00LE_1 (which is identifier of the file containing theorems about Boolean properties of sets) take external theorems from MML as premises, all other labels are local. Obviously, some parts of proofs in the literature may be hard for machine translation (compare "It is obvious that..." above), other may depend on the checker architecture (especially if an author would like to drive remaining part of his/her proof analogously to the earlier one). However, the choice of the above example is rather accidental.
'^In fact, to keep this presentation compact, we dropped dual conjunct of this lemma.
224
Adam Grabowski
theorem Lemma_9: for A X Y LAp proof let
being Tolerance_Space, being exact Subset of A, being Subset of A holds X \/ LAp Y = LAp (X \/ Y)
A be Tolerance^Space, X be exact Subset of A, Y be Subset of A; thus LAp X \/ LAp Y c= LAp (X \/ Y) by Th26; let X be set; assume Al: X in LAp (X \/ Y ) ; then A2: Class (the InternalRel of A, x) c= X \/ Y by Th8; A3: LAp X c= LAp X \/ LAp Y & LAp Y c= LAp X \/ LAp Y by XB00LE_1:7; per cases; suppose Class (the InternalRel of A, x) meets X; then X in UAp X by Al, Thll; then X in LAp X by Thl5; hence x in LAp X \/ LAp Y by A3; suppose Class (the InternalRel of A, x) misses X; then Class (the InternalRel of A, x) c^ Y by A2, XB00LE_1:73; then X in LAp Y by Al, Th9; hence x in LAp X \/ LAp Y by A3; end;
Even though Mizar source itself is not especially hard to read for a mathematician, some translation services are available. The final version converted automatically back to the natural language looks like below: For every tolerance space A, every exact subset X of A, and every subset Y of A holds LAp(X) U LAp(y) = LAp(X U Y). The name de Bruijn factor is claimed by automated reasoning researchers to describe "loss factor" between the size of an ordinary mathematical exposition and its full formal translation inside a computer. However in Wiedijk's considerations and Mizar examples contained in [12] it is equal to four (although in the sixties of the previous century de Bruijn assumed it to be about ten times bigger), in our case two is a good upper approximation.
15.8 Conclusions The purpose of our work was to develop a uniform formalization of basic notions of rough set theory. For lack of space we concentrated in this outline mainly on the notions of rough approximations and a membership function. Following [6] and [10], we formalized in [4] properties of rough approximations and membership functions
15 On the Computer-Assisted Reasoning about Rough Sets
225
based on tolerances, rough inclusion and equality, rough set notion and associated basic properties. The adjectives and type modifiers mechanisms available in the Mizar type theory made our work quite feasible. Even if we take into account that the transitivity was dropped from the classical indiscemibility relation treated as equivalence relation, further generalizations (e.g. variable precision model originated from [14]) are still possible. It is important that by including the formalization of rough sets into MML we made it usable for a number of automated deduction tools and other digital repositories. The Mizar system closely cooperates with OMDOC system to share its mathematical library via a format close to XML. Works concerning exchange of results between automatic theorem provers (e.g. Otter) and Mizar (already resulted in successful solution of Robbins problem) are on their way. Formal concept analysis, as well as fuzzy set theory is also well developed in MML. Successful experiments with theory merging mechanisms implemented in Mizar (e.g. to describe topological groups or continuous lattices) are quite promising to go further with rough concept analysis as defined in [7] or to do the machine encoding of the connections between fuzzy set theory and rough sets. We also started with the formalization of a paper [3], which focuses upon a comparison of some generalized rough approximations of sets. We hope that much more interconnections can be discovered automatically. Rough set researchers could be also assisted in searching in a distributed library of facts for analogies between rough sets and other domains. Eventually, it could be helpful within the rough set domain itself, thanks to e.g. proof restructurization utilities available in Mizar system itself - as well as other specialized tools. One the most useful at this stage is discovering irrelevant assumptions of theorems and lemmas. Comparatively low de Bruijn factor allows us to say that the Mizar system seems to be effective and the library is quite well developed to go further with the encoding of the rough set theory. Moreover, the tools which automatically translate the Mizar articles back into the ET^X source close to the mathematical vernacular are available. This makes our development not only machine- but also human-readable.
References 1. Ch. Benzmiiller, M. Jamnik, M. Kerber, V. Sorge, Agent-based mathematical reasoning, Electronic Notes in Theoretical Computer Science, 23(3), 1999. 2. E. Bryniarski, Formal conception of rough sets, Fundamenta Informaticae, 27(2-3), 1996, pp. 109-136. 3. A. Gomolinska. A comparative study of some generalized rough approximations, Fundamenta Informaticae, 51(1-2), 2002, pp. 103-119. 4. A. Grabowski, Basic properties of rough sets and rough membership function, to appear in Formalized Mathematics, 12(1), 2004, available at h t t p : / / m i z a r . org/JFM/Voll5 / r o u g h s . l . html.
226
Adam Grabowski
5. A. Grabowski, Robbins algebras vs. Boolean algebras, in Proceedings of Mathematical Knowledge Management Conference, Linz, Austria, 2001, available at http://www.emis.de/proceedings/MKM2001/. 6. J. Jarvinen, Approximations and rough sets based on tolerances, in: W. Ziarko, Y. Yao (eds.). Proceedings of RSCTC 2000, LNAI2005, Springer, 2001, pp. 182-189. 7. R. E. Kent, Rough concept analysis: a synthesis of rough sets andformal concept analysis, Fundamenta Informaticae, 27(2-3), 1996, pp. 169-181. 8. The Mizar Home Page, h t t p : / / m i z a r . o r g . 9. Z. Pawlak, Rough sets, International Journal of Information and Computer Science, 11(5), 1982, pp. 341-356. 10. Z. Pawlak, A. Skowron, Rough membership functions, in: R. R. Yaeger, M. Fedrizzi, and J. Kacprzyk (eds.), Advances in the Dempster-Shafer Theory of Evidence, Wiley, New York, 1994, pp. 251-271. 11. A. Skowron, J. Stepaniuk, Tolerance approximation spaces, Fundamenta Informaticae, 27(2-3), 1996, pp. 245-253. 12. F. Wiedijk, The de Bruijnfactor, h t t p : / /www. c s . k u n . n l / ~ f r e e k / f a c t o r / . 13. L. Zadeh, Fuzzy sets, Information and Control, 8, 1965, pp. 338-353. 14. W. Ziarko, Variable precision rough sets model. Journal of Computer and System Sciences, 46(1), 1993, pp. 39-59.
16 Similarity-Based Data Reduction and Classification Gongde Guo^'^, Hui Wang\ David Bell^, and Zhining Liao^ ^ School of Computing and Mathematics, University of Ulster, BT37 OQB, UK { G . G U O , H.Wang, Z . L i a o } @ u l s t e r . a c . u k ^ School of Computer Science, Queen's University Belfast, BT7 INN, UK [email protected] ^ School of Computer and Information Science, Fujian University of Technology Fuzhou, 350014, China Summary. The ^-Nearest-Neighbors (^NN) is a simple but effective method for classification. The major drawbacks with respect to ^NN are (1) low efficiency and (2) dependence on the parameter k. In this paper, we propose a novel similarity-based data reduction method and several variations aimed at overcoming these shortcomings. Our method constructs a similarity-based model for the data, which replaces the data to serve as the basis of classification. The value of k is automatically determined, is varied in terms of local data distribution, and is optimal in terms of classification accuracy. The construction of the model significantly reduces the number of data for learning, thus making classification faster. Experiments conducted on some public data sets show that the proposed methods compare well with other data reduction methods in both efficiency and effectiveness. Key words: data reduction, classification, /:-Nearest-Neighbors
16.1 Introduction The ^-Nearest-Neighbors (^NN) is a non-parametric classification method, which is simple but effective in many cases [6]. For an instance dt to be classified, its k nearest neighbors are retrieved, and this forms a neighborhood of dt. Majority voting among the instances in the neighborhood is generally used to decide the classification for dt with or without consideration of distance-based weighting. In contrast to its conceptual simplicity, the A:NN method performs as well as any other possible classifier when applied to non-trivial problems. Over the last 50 years, this simple classification method has been intensively used in a broad range of applications such as medical diagnosis, text categorization [9], pattern recognition, data mining, and e-commerce etc. However, to apply kNN we need to choose an appropriate value for k, and the success of classification is very much dependent on this value. In a sense, the kNN method is biased by k. There are many ways of choosing the k value, but a simple one is to run the algorithm many times with different k values and choose the one with the best performance, but this is not a pragmatic method in real applications.
228
Gongde Guo, Hui Wang, David Bell, and Zhining Liao
In order for A:NN to be less dependent on the choice of k, we proposed to look at multiple sets of nearest neighbors rather than just one set of A:-nearest neighbors [12]. The proposed formalism is based on contextual probability, and the idea is to aggregate the support of multiple sets of nearest neighbors for various classes to give a more reliable support value, which better reveals the true class of dt. As it aimed at improving classification accuracy and alleviating the dependence on k, the efficiency of the method in its basic form is worse than ^NN, though it is indeed less dependent on k and is able to achieve classification performance close to that for the best k. From the point of view of its implementation, the A:NN method consists of a search of pre-labelled instances given a particular distance definition to find k nearest neighbors for each new instance. If the number of instances available is very large, the computational burden for ^NN method is unbearable. This drawback prohibits it in many applications such as dynamic web mining for a large repository. These drawbacks of A:NN method motivate us to find a way of instances reduction which only chooses a few representatives to be stored for use for classification in order to improve efficiency whilst both preserving its classification accuracy and alleviating the dependence on k.
16.2 Related work Many researchers have addressed the problem of training set size reduction. Hart [7] made one of the first attempts to reduce the size of the training set with his Condensed Nearest Neighbor Rule (CNN). His algorithm finds a subset S of the training set T such that every instance of T is closer to an instance of S of the same class than to an instance of 5 of a different class. In this way, the subset S can be used to classify all the instances in T correctly. Ritter et al. [8] extended the condensed NN method in their Selective Nearest Neighbor Rule (SNN) such that every instance of T must be closer to an instance of S of the same class than to any instance of T (instead of S) of a different class. Further, the method ensures a minimal subset satisfying these conditions. Gate [5] introduced the Reduced Nearest Neighbor Rule (RNN). The RNN algorithm starts with S-T and removes each instance from S if such a removal does not cause any other instances in T to be misclassified by the instances remaining in S. Wilson [13] developed the Edited Nearest Neighbor (ENN) algorithm in which S starts out the same as T, and then each instance in S is removed if it does not agree with the majority of its k nearest neighbors. The Repeated ENN (RENN) applies the ENN algorithm repeatedly until all instances remaining have a majority of their neighbors with the same class, which continues to widen the gap between classes and smooth the decision boundary. Tomek [11] extends the ENN with his AUk-NN method of editing. This algorithm works as follows: for i=l to k, flag as bad any instance not classified correctly by its / nearest neighbors. After completing the loop all k times, remove any instances from S flagged as bad. Other methods include ELGrow {Encoding Length Grow), Explore by Cameron-Jones [3], IB1~IB5 by Aha et al. [1][2], Dropl~Drop5, and DEL by Wilson et al. [15] etc. From the experimental results conducted by Wilson et al., the average classification
16 Similarity-Based Data Reduction and Classification
229
accuracy of those methods on reduced data sets is lower than that on the original data sets due to the fact that purely instances selection suffers information loss to some extent. In the next section, we introduce a novel similarity-based data reduction method (SBModel). It is a type of inductive learning methods. The method constructs a similarity-based model for the data by selecting a subset S with some extra information from training set T, which replaces the data to serve as the basis of classification. The model consists of a set of representatives of the training data, as regions in the data space. Based on SBModel, two variations of SBModel called e-SBModel and p-SBModel are also presented which aim at improving both the efficiency and effectiveness of SBModel. The experimental results and a comparison with other methods will be reported in section 16.4.
16.3 Similarity-Based Data Reduction 16.3.1 The Basic Idea of Similarity-Based Data Reduction Looking at Figure 16.1, the Iris data with two features: petal length and petal width is used for demonstration. It contains 150 instances with three classes represented as diamond, square, and triangle respectively, and is plotted in 2-dimensional data space. In Figure 16.1, the horizontal axis is for feature petal length and the vertical axis is for feature petal width.
3 y^
2.5-
N^m(rfi). M^(di)-43
2
^ ^ i v
1.5 •
* A A ^
1 •
y
kCksslI
*
UCl«iss2| |»C1MS3|
0.50C
\W* 2
4
6
8
Fig. 16.1. Data distribution in 2-dimension data space Given a similarity measure, many instances with the same class label are close to each other in many local areas. In each local region, the central instance di looking at Figure 16.1 for example, with some extra information such as Cls{di) - the class label of instance df, Num{di) - the number of instances inside the local region; Sim{di) - the similarity of the furthest instance inside the local region to di, and Rep{di) - 2i representation of di, might be a good representative of this local region. If we take these representatives as a model to represent the whole training set, it will significantly reduce the number of instances for classification, thereby improving efficiency.
230
Gongde Guo, Hui Wang, David Bell, and Zhining Liao
For a new instance to be classified in the classification stage, if it is covered by a representative it will be classified by the class label of this representative. If not, we calculate the distance of the new instance to each representative's nearest boundary and take each representative's nearest boundary as an instance, then classify the new instance in the spirit of kNN. 16.3.2 Terminology and Definitions Before we give more details about the designs of the proposed algorithms, some important terms (or concepts) need to be explicitly defined first. Definition 1. 7. Neighborhood A neighborhood is a term referred to a given instance in data space. A neighborhood of a given instance is defined to be a set of nearest neighbors of this instance. 2. Local Neighborhood A local neighborhood is a neighborhood, which covers the maximal number of instances with the same class label 3. Local €-Neighborhood A local ^-neighborhood is a neighborhood which covers the maximal number of instances in the same class label with allowed e exceptions. 4. Global Neighborhood The global neighborhood is defined to be the largest local neighborhood among a set of local neighborhoods in each cycle of model construction stage. 5. Global £-Neighborhood The global ^-neighborhood is defined to be the largest local e-neighborhood among a set of local e-neighborhoods in each cycle of model construction stage. The global e-neighborhood is defined to be the largest local ^-neighborhood among a set of local ^-neighborhoods in each cycle of model construction stage. With the above definitions, given a training set, each instance has a local neighborhood. Based on these local neighborhoods, the global neighborhood can be obtained. This global neighborhood can be seen as a representative to represent all the instances covered by it. For instances not covered by any representative we repeat the above operation until all the instances have been covered by chosen representatives. All representatives obtained in the model construction process are used for replacing the data and serving as the basis of classification. There are two obvious advantages: (1) we needn't choose a specific k in the sense of A:NN for our method in the model construction process. The number of instances covered by a representative can be seen as an optimal k which is generated automatically in the model construction process, and is different for different representatives; (2) using a list of chosen representatives as a model for classification not only reduces the number of instances for classification, but also significantly improves the efficiency. From this point of view, the proposed method overcomes the two shortcomings of ^NN.
16 Similarity-Based Data Reduction and Classification
231
16.3.3 Modelling and Classification Algorithm Let D be a collection of n class-known instances {di, G?2, • * * ,dn},di e D. For handling heterogeneous applications - those with both numeric and nominal features, we use HVDM distance function (to be presented later) as a default similarity measure to describe the following algorithms. The detailed model construction algorithm of SBModel is described as follows: Step 1: Select a similarity measure and create a similarity matrix for a given training setD. Step 2: Set to 'ungrouped' the tag of all instances. Step 3: For each 'ungrouped' instance, find its local neighborhood. Step 4: Among all the local neighborhoods obtained in step 3, find its global neighborhood Ni, Create a representative {Cls{di),Sim{di),Nu'm{di), Rep{di)) into M to represent all the instances covered by Ni, and then set to 'grouped' the tag of all the instances covered by Ni. Step 5: Repeat step 3 and step 4 until all the instances in the training set have been set to 'grouped'. Step 6: Model M consists of all the representatives collected from the above learning process. In the above algorithm, D represents a given training set, M represents the created model. The elements of representative {Cls{di), Sim{di)^Num{di)^ Rep{di)) respectively represent the class label of di, the HVDM distance of di to the furthest instance among the instances covered by Ni', the number of instances covered by Ni, and a representation of di itself. In step 4, if there are more than one local neighborhood having the same maximal number of neighbors, we choose the one with minimal value of Sim{di), i.e. the one with highest density, as representative. The classification algorithm of SBModel is described as follows: Step 1: For a new instance dt to be classified, calculate its similarity to all representatives in the model M. Step 2: If dt is covered by a representative {Cls{dj),Sim{dj),Num{dj), Rep{dj)), i.e. the HVDM distance of dt to dj is smaller than Sim{dj), dt is classified as the class label of dj. Step 3: If no representative in the model M covers dt, classify dt as the class label of a representative which boundary is closest to dt. The HVDM distance of dt to a representative di's nearest boundary is equal to the difference of the HVDM distance of di to dt minus Sim{di). In an attempt to improve the classification accuracy of SBModel, we implemented two different pruning methods in our SBModel. One method is to remove both the representatives from the model M created by SBModel that only cover a few instances and the relevant instances covered by these representatives from the training set, and then to construct the model again using SBModel from the revised training set. The SBModel algorithm based on this pruning method is called p-SBModel. The model construction algorithm of p-SBModel is described as follows:
232
Gongde Guo, Hui Wang, David Bell, and Zhining Liao
Step 1: For each representative in the model M created by SBModel that only covers a few (a pre-defined threshold by users) instances, remove all the instances covered by this representative from the training set D. Set model M=0, then go to step 2. Step 2: Construct model M from the revised training set D again by using SBModel. Step 3: The final model M consists of all the representatives collected from the above pruning process. The second pruning method modifies the step 3 in the model construction algorithm of SBModel to allow each local neighborhood to cover e (called error tolerance rate) instances with different class label to the majority class label in this neighborhood. This modification integrates the pruning work into the process of model construction. The SBModel algorithm based on this pruning method is called sSBModel. The detailed model construction algorithm of e-SBModel is described as follows: Step 1: Select a similarity measure and create a similarity matrix for a given training setD. Step 2: Set to 'ungrouped' the tag of all instances. Step 3: For each 'ungrouped' instance, find its local ^-neighborhood. Step 4: Among all the local ^-neighborhoods obtained in step 3, find its global sneighborhood Ni. Create a representative {Cls(di),Sim{di),Num{di), Rep{di)) into M to represent all the instances covered by A^^, and then set to 'grouped' the tag of all the instances covered by A^^. Step 5: Repeat step 3 and step 4 until all the instances in the training set have been set to 'grouped'. Step 6: Model M consists of all the representatives collected from the above learning process. The SBModel is a basic algorithm with s=0 (error tolerance rate) and without pruning.
16.4 Experiment and Evaluation 16.4.1 Data Sets To evaluate the SBModel method and its variations, fifteen public data sets have been collected from the UCI machine learning repository for training and testing. Some information about these data sets is listed in Table 16.1. The meaning of the title in each column is follows: NF-Number of Features, NN-Number of Nominal features, NO-Number of Ordinal features, NB-Number of Binary features, NI-Number of Instances, CD-Class Distribution. 16.4.2 Experimental Environment Experiments use the 10-fold cross validation method to evaluate the performance of SBModel and its variations and to compare them with C5.0, A:NN (Voting ^NN), and w^NN (Distance weighted A:NN). We implemented SBModel and its variations, ^NN
16 Similarity-Based Data Reduction and Classification
233
Table 16.1. Some information about the data sets Dataset
NF NN NO NB
Australian Colic Diabetes Glass HCleveland Heart Hepatitis Ionosphere Iris LiverBupa Sonar Vehicle Vote Wine Zoo
14 23 8 9 13 13 19 34 4 6 60 18 16 13 16
NI
CD
4 383:307 6 4 690 232:136 16 7 0 368 268:500 0 8 0 768 0 9 0 214 70:17:76:0:13:9:29 164:139 7 3 303 3 120:150 7 3 270 3 1 12 155 6 32:123 126:225 0 34 0 351 4 0 150 50:50:50 0 145:200 6 0 345 0 97:111 0 60 0 208 0 18 0 846 212:217:218:199 267:168 0 16 435 0 59:71:48 0 13 0 178 16 0 0 90 37:18:3:12:4:7:9
and wA;NN in our own prototype. The C5.0 algorithm used in our experiments were implemented in the Clementine' software package. The experimental results of other editing and condensing algorithms to be compared here are obtained from Wilson's experiments [15]. In voting ^NN, the k neighbors are implicitly assumed to have equal weight in decision, regardless of their distances to the instance x to be classified. It is intuitively appealing to give different weights to the k neighbors based on their distances to jc, with closer neighbors having greater weights. In wA:NN, the k neighbors are assigned to different weights. Let c/ be a distance measure, and xi, 0:2, • • ? ^fc be the A: nearest neighbors of ;c arranged in increasing order of d{xi,x). So xi is the first nearest neighbor of jc. The distance weight Wi for i*^ neighbor Xi is defined as follows:
I
1
if d{xk,x) = d{xi,x)
Instance x is assigned to the class for which the weights of the representatives among the k nearest neighbors sum to the greatest value. In order to handle heterogeneous applications - those with both ordinal and nominal features - we use a heterogeneous distance function HVDM [14] as the distance function in the experiments, which is defined as:
HVDM{^,y) = where the function daixa-, Va) is the distance for feature a and is defined as:
234
Gongde Guo, Hui Wang, David Bell, and Zhining Liao
(
1
if Xa or ya is unknown;
otherwise
vdma{Xa,ya) if a is nominal, else l^a — ?/a|/4cra if a is numeric In above distance function, aa is the standard deviation of the values occurring for feature a in the instances in the training set D, and vdma{xa, Va) is the distance function for nominal features called Value Difference Metric [10]. Using the VDM, the distance between two values Xa and ya of a single feature a is given as: C=l
^'^
^^
where Nx^ is the number of times feature a had value Xa', Nx^,c is the number of times feature a had value Xa and the output class was c; and C is the number of output classes. 16.4.3 Experimental Results [Experiment 1] In this experiment, our goal is to evaluate the basic SBModel algorithm, and to compare its experimental results with C5.0, kNN and wA:NN. So the error tolerance rate is set to 0, k for kNN is set to 1, 3, 5 respectively, k for wA:NN is set to 5, and the allowed minimal number of instances covered by a representative (N) in the final model of SBModel is set to 2, 3, 4, 5 respectively. Under these settings, A comparison of C5.0, SBModel, ^NN, and wkNN in classification accuracy using 10-fold cross validation is presented in Table 16.2. The reduction rate of SBModel is listed in Table 16.3. Note that in Table 16.2 and Table 16.3, N = / means each representative in the final model of SBModel at least covers / instances of the training set. From the experimental results, the average classification accuracy of the proposed SBModel method in its basic form on fifteen training sets is better than C5.0, and is comparable to A:NN and wA:NN. But the SBModel significantly improves the efficiency of A:NN by keeping only 9.19 percent (N=4) of the original instances for classification with only a slight decreasing in accuracy (81.29%) in comparison with A:NN (82.58%) and wkNN (82.34%). [Experiment 2] In this experiment, our goal is to evaluate s-SBModel. So we tune the error tolerance rate £ in a small range from 0 to 4 for each training set, and choose the e for obtaining relatively higher classification accuracy. The experimental results are presented in Table 16.4. Note that in Table 16.4 heading RR for short represents 'Reduction Rate'. From the experimental results in Table 16.4, e-SBModel obtains better performance than C5.0, SBModel, ^NN, and wkNN. Even when A^=5,6:-SBModel still obtains 82.93% classification accuracy which is higher than 79.96% of C5.0, 82.58% of ^NN, and 82.34% of wA:NN (Refer to Table 16.2 for more details). In this situation, £-SBModel only keeps 7.67 percent instances of the original training set for classification, thus significantly improving the efficiency whilst improving the classification accuracy, ofA:NN.
16 Similarity-Based Data Reduction and Classification
235
Table 16.2. A comparison of C5.0, SBModel, ^NN, and w^NN in classification accuracy Dataset
C5.0 N=2 N=3 N=4 N=5 A:NN(1) itNN(3) ^NN(5) witNN(5)
Australian Colic Diabetes Glass HCleveland Heart Hepatitis Ionosphere Iris LiverBupa Sonar Vehicle Vote Wine Zoo
85.5 80.9 76.6 66.3 74.9 75.6 80.7 84.5 92.0 65.8 69.4 67.9 96.1 92.1 91.1
79.42 78.89 70.92 68.10 78.33 76.30 80.67 87.14 95.33 60.00 88.00 68.57 91.30 95.88 96.67
82.75 83.89 73.03 66.67 82.33 80.37 80.67 85.14 94.67 66.47 83.50 71.79 92.17 94.71 95.56
85.22 83.06 74.21 67.62 81.00 80.37 83.33 84.00 96.67 66.47 85.00 69.29 92.17 94.71 95.56
82.46 81.94 72.37 67.62 81.33 77.41 83.33 87.14 95.33 66.47 86.50 71.43 90.87 95.29 95.56
Average
79.96 82.16 82.03 81.29 80.22 81.03
82.25
82.58
82.34
84.20 81.67 75.00 65.24 82.67 80.37 83.33 94.29 96.00 63.53 84.00 65.83 88.70 95.29 92.22
84.64 82.50 74.08 65.24 80.33 80.74 85.33 93.71 96.00 64.41 82.50 65.36 88.70 94.71 92.22
84.78 83.06 74.21 61.43 80.33 80.37 87.33 92.57 96.00 63.82 80.00 63.69 88.70 94.12 88.89
84.63 82.50 74.74 55.71 78.00 77.78 87.33 91.43 96.00 61.76 79.50 62.26 88.70 94.12 88.89
Ta ble 16.3. The reduction rate of SBModel in the firlal model Dataset
N=2
N=3
N=4
N=5
Australian Colic Diabetes Glass HCleveland Heart Hepatitis Ionosphere Iris LiverBupa Sonar Vehicle Vote Wine Zoo
86.81 78.26 80.47 79.44 84.16 84.81 85.81 81.48 95.33 73.62 81.73 80.38 91.38 90.45 91.11
90.43 84.24 86.98 88.32 87.79 88.52 88.39 85.19 96.00 83.48 86.06 87.83 93.53 90.45 92.22
92.17 87.50 89.58 90.19 91.42 90.00 90.32 88.60 96.00 88.70 87.50 91.96 93.97 92.13 92.22
92.46 88.86 91.67 93.93 92.74 91.48 91.61 89.74 96.00 92.75 90.87 93.50 94.40 92.70 93.33
Average
84.35 88.63 90.81 92.40
236
Gongde Guo, Hui Wang, David Bell, and Zhining Liao Table 16.4. The classification accuracy and reduction rate of s-SBModel Dataset
£ N=2 RR
N=3 RR N=4
RR N=5 RR
Australian Colic Diabetes Glass HCleveland Heart Hepatitis Ionosphere Iris LiverBupa Sonar Vehicle Vote Wine Zoo Average
2 84.93 90.43 84.93 90.43 85.22 92.17 85.51 92.46 1 83.06 78.26 83.06 84.24 82.78 87.50 83.61 88.86 1 74.34 80.47 74.47 86.98 75.13 89.58 75.53 91.67 3 69.52 90.19 69.52 90.19 69.52 90.19 69.05 93.93 4 81.67 92.08 81.67 92.08 81.67 92.08 81.67 92.08 1 80.74 84.81 81.11 88.52 81.85 90.00 81.11 91.48 1 88.00 85.81 89.33 88.39 88.67 90.32 88.67 91.61 1 93.71 81.48 93.71 85.19 92.86 88.60 92.57 89.74 0 96.00 95.33 96.00 96.00 96.00 96.00 96.00 96.00 2 68.53 83.48 68.53 83.48 68.24 88.70 67.94 92.75 2 82.50 86.54 82.50 86.54 82.50 88.94 81.50 90.38 2 66.43 87.83 66.43 87.83 66.55 91.96 66.67 93.50 4 91.74 94.40 91.74 94.40 91.74 94.40 91.74 94.40 0 95.29 90.45 94.71 90.45 94.12 92.13 94.12 92.70 0 92.22 91.11 92.22 92.22 88.89 92.22 88.89 93.33 83.25 87.51 83.33 89.13 83.05 90.99 82.93 92.33
[Experiment 3] In this experiment, our goal is to evaluate p-SBModel. It is a nonparametric classification method which conducts pruning work by removing both the representatives from the model M that only cover 1 instances (it means no any induction being done for this representative) and the relevant instances covered by these representatives from the training set, and then constructing the model from the revised training set again. The experimental results are presented in Table 16.5. Form the experimental results shown in Table 16.5, it is clear that with the same classification accuracy, p-SBModel has a slight higher reduction rate than SBModel on average. The main merit of the /?-SBModel algorithm is that it does not need any parameter to be set in both modelling and classification stages. However, its average classification accuracy is comparable to A:NN and wA:NN. It keeps only 10.13 percent instances of the original training set for classification. [Experiment 4] In this experiment, we compare our SBModel method and its variations with other algorithms in the literature in average classification accuracy and reduction rate. These algorithms to be compared in the experiment include CNN, SNN, IB3, DEL, ENN, RENN, Allk-NN, ELGrow, Explore and Drop3, each of which has been described in section 16.2 in this paper. The experimental results are presented in Figure 16.2. Note that the values of the horizontal axis In Figure 16.2 represent different algorithms, i.e. 1-CNN, 2-SNN, 3-IB3, 4-DEL, 5-Drop3, 6-ENN, 7-RENN, 8-Allk-NN, 9-ELGrow, 10-Explore, 11-SBModel, 12-(^-SBModel), 13-(p-SBModel). From the experimental results, it is clear that the average classification accuracy and reduction rate of our proposed SBModel method and its variations on fifteen data sets are better than other data reduction methods in 10-fold cross validation with exceptions of
16 Similarity-Based Data Reduction and Classification
237
Table 16.5. A comparison of A:NN, SBModel, andp-SBModel Dataset
itNN(5) wfcNN(5) RR SBModel(3) RR p-SBModel RR
LiverBupa Sonar Vehicle Vote Wine Zoo
85.22 83.06 74.21 67.62 81.00 80.37 83.33 84.00 96.67 66.47 85.00 69.29 92.17 94.71 95.56
82.46 81.94 72.37 67.62 81.33 77.41 83.33 87.14 95.33 66.47 86.50 71.43 90.87 95.29 95.56
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
84.64 82.50 74.08 65.24 80.33 80.74 85.33 93.71 96.00 64.41 82.50 65.36 88.70 94.71 92.22
90.43 84.24 86.98 88.32 87.79 88.52 88.39 85.19 96.00 83.48 86.06 87.83 93.53 90.45 92.22
86.23 82.78 73.16 65.24 80.67 81.85 84.67 92.00 95.33 62.94 82.50 67.26 90.00 94.71 91.11
95.22 88.59 87.11 84.58 89.11 91.11 96.77 87.18 95.33 82.03 86.54 83.69 96.98 90.45 93.33
Average
82.58
82.34
0
82.03
88.63
82.03
89.87
Australian Colic Diabetes Glass HCleveland Heart Hepatitis Ionosphere
Iris
m Classification Accuracy
• Reduction Rate
120 -1 100 - §^B'«^^^^^^^^^©^^^l*^^^^^^^N^^^^^^^^^^^^^^^^^^Rf ^F% 80 - s ^ w f 4 i ^ ^ a ^ s ^ : f i i ^ ^ ^ ^ s t i M ^ a B m 60 40 - ^^^ M' fli B l H i ^ • m i : ^ : W H M 1 SHf I11 H I-Hi ^ l K1Pml^^B" 20 - M^M •SK^ ^B^«*^^Bl!-''~ WUf -mlLkimli-*'" *ffl^S £HfT-^Hr>>^BHC .mm^''-'mm^^'l - • i ^ « i ' « i , ^ @ i •
m ai 1
1
2
3
4
5
6
7
8
9
10
11
12
13
Fig. 16.2. Average classification accuracy and reduction rate ELGrow and Explore in reduction rate. Though ELGrow obtains a highest reduction rate among all the algorithms for comparison, its rather lower classification accuracy counteracts its advantage in reduction rate. Explore seems to be a competitive algorithm with a higher reduction rate and a slight lower classification accuracy in comparison with our proposed SBModel and its variations. Otherwise, Drop3 is the one closest to our algorithms both in classification accuracy and reduction rate.
16,5 Conclusions In this paper we have presented a novel solution for dealing with the shortcomings of ^NN. To overcome the problems of low efficiency and dependency on k, we select a few representatives from training set with some extra information to represent the whole training set. In the selection of each representative we use the optimal but different k, decided automatically according to the local data distribution, to eliminate
238
Gongde Guo, Hui Wang, David Bell, and Zhining Liao
the dependency on k without user's intervention. Experimental results carried out on fifteen public data sets have shown that SBModel and its variations e-SBModel and p-SBModel are quite competitive for classification. Their average classification accuracies on fifteen public sets are better than C5.0 and are comparable with A:NN, and wA:NN. But our proposed SBModel and its variations significantly reduce the number of the instances in the final model for classification with a reduction rate ranging from 88.63% to 92.33%. Moreover, comparing to other reduction techniques, s-SBModel obtains the best performance. It only keeps 7.67 percent instances of the original training set on average for classification whilst improving the classification accuracy of A:NN and wA:NN. It is a good alterative of ^NN in many application areas such as for text categorization and for financial stock market analysis and prediction.
References 1. Aha DW, Kibler k, Albert MK (1991) Instance-Based Learning Algorithms, Machine Learning, 6, pp.37-66. 2. Aha DW (1992) Tolerating Noisy, Irrelevant and Novel Attributes in Instance-Based Learning Algorithms, International Journal of Man-Machine Studies, 36, pp. 267-287. 3. Cameron-Jones, RM (1995) Instance Selection by Encoding Length Heuristic with Random Mutation Hill Climbing, Proc. of the 8th Australian Joint Conference on Artificial Intelligence, pp. 99-106. 4. Devijver P, Kittler J (1972) Pattern Recognition: A Statistical Approach, Prentice-Hall, Englewood Cliffs, NJ. 5. Gates G (1972) The Reduced Nearest Neighbor Rule, IEEE Transactions on Information Theory, 18, pp. 431-433. 6. Hand D, Mannila H, Smyth P (2001) Principles of Data Mining, The MIT Press. 7. Hart P (1968) The Condensed Nearest Neighbor Rule, IEEE Transactions on Information Theory, 14,515-516. 8. Riter GL, Woodruff HB, Lowry SR et al (1975) An Algorithm for a Selective Nearest Neighbor Decision Rule. IEEE Transactions on Information Theory, 21-6, November, pp. 665-669. 9. Sebastiani F (2002) Machine Learning in Automated Text Categorization, In ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47. 10. StanfiU C, Waltz D (1986) Toward Memory-Based Reasoning Communications of the ACM, 29, pp. 1213-1228. 11. Tomek A (1976) An Experiment with the Edited Nearest-Neighbor Rule. IEEE Transactions on Systems, Man, and Cybernetics, 6-6, pp. 448-452. 12. Wang H (2003) Contextual Probability, in Journal of Telecommunications and Information Technology, 4(3):92-97. 13. Wilson DL (1972) Asymptotic Properties of Nearest Neighbor Rules Using Edited Data, IEEE Transactions on Systems, Man, and Cybernetics, 2-3, pp. 408-421. 14. Wilson DR, Martinez TR(1997) Improved Heterogeneous Distance Functions, Journal of Artificial Intelligence Research (JAIR), 6-1, pp. 1-34. 15. Wilson DR, Martinez TR (2000)Reduction Techniques for Instance-Based Learning Algorithms, Machine Learning, 38-3, pp. 257-286.
17 Decision TVees and Reducts for Distributed Decision Tables Mikhail Ju. Moshkov Institute of Computer Science, University of Silesia 39, B^ziiiska St., Sosnowiec, 41-200, Poland [email protected]
Summary. In the paper greedy algorithms for construction of decision trees and relative reducts for joint decision table generated by distributed decision tables are studied. Two ways for definition of joint decision table are considered: based on the assumption that the universe of joint table is the intersection of universes of distributed tables, and based on the assumption that the universe of joint table is the union of universes of distributed tables. Furthermore, a case is considered when the information about distributed decision tables is given in the form of decision rule systems. Key words: distributed decision tables, decision trees, relative reducts, greedy algorithms
17.1 Introduction In the paper distributed decision tables are investigated which can be useful for the study of multi-agent systems. Let T i , . . . , T^ be decision tables, and { a i , . . . , an} be the set of attributes of these tables. In the paper two questions are considered: how we can define a joint decision table T with attributes a i , . . . , a^ generated by tables Ti,... ,Tm. and how we can construct decision trees and relative reducts for the table T. We study two extreme cases: • •
The universe of the table T is the intersection of universes of tables T i , . . . , Tm. The universe of the table T is the union of universes of tables T i , . . . , T^n.
In reality, we consider more complicated situation when we do not know exactly universes of the tables T i , . . . , r ^ . In this case we must use upper approximations of the table T which are tables containing at least all rows from T. We study two such approximations which are minimal in some sense: the table T^ for the case of the intersection of universes, and the table T^ for the case of the union of universes. We show that in the first case (intersection of universes) even simple problems (for given tables T i , . . . , T^ and a decision tree it is required to recognize is this tree a decision tree for the table T^; for given tables T i , . . . , T^ and a subset of the set
240
Mikhail Ju. Moshkov
{ a i , . . . , ttn} it is required to recognize is this subset a relative reduct for the table T^) are NP-hard. We consider approaches to minimization of decision tree depth and relative reduct cardinality on some subsets of decision trees and relative reducts. In the second case (union of universes) the situation is similar to the situation for single decision table: there exist greedy algorithms for decision tree depth minimization and for relative reduct cardinality minimization which have relatively good bounds on precision. Furthermore, we consider the following problem. Let we have a complicated system consisting of parts Q i , . . . , Qm- For each part Qj the set of normal states is described by a decision rule system Sj. It is required to recognize for each part Qj is the state of this part normal or abnormal. We consider an algorithm for minimization of depth of decision trees solving this problem, and bounds on precision for this algorithm. The results of the paper are obtained in the frameworks of rough set theory [8,9]. However, for simplicity we consider only "crisp" decision tables in which there are no equal rows labelling by different decisions. The paper consists of six sections. In the second section we consider known results on algorithms for minimization of decision tree depth and relative reduct cardinality for single decision table. In the third and fourth sections we consider algorithms for construction of decision trees and relative reducts for joint tables T^ and T^ respectively. In the fifth section we consider an algorithm which for given rule systems 5 i , . . . , Sm constructs a decision tree that recognizes the presence of realized rules in each of these systems. The sixth section contains short conclusion.
17.2 Single Decision Table Consider a decision table T (see Fig. 17.1) which has t columns labelling by attributes a i , . . . , at. These attributes take values from the set {0,1}. For simplicity, we assume that rows are pairwise different (rows in our case correspond not to objects from the universe, but to equivalence classes of the indiscemibility relation). Each row is labelled by a decision d.
Fig. 17.1. Decision table T
We correspond the classification problem to the decision table T: for a given row it is required to find the decision attached to this row. To this end we can use values of attributes a i , . . . ,at.
17 Decision Trees and Reducts for Distributed Decision Tables
241
Test for the table T is a subset of attributes which allow to separate each pair of rows with different decisions. Relative reduct (reduct) for the table T is a test for which each proper subset is not a test. Decision tree for the table T is a tree with the root in which each terminal node is labelled by a decision, each non-terminal node is labelled by an attribute, two edges start in each non-terminal node, and these edges are labelled by numbers 0 and 1. For each row the work of the decision tree is finished in terminal node labelling by the decision corresponding to the considered row. The depth of the tree is the maximal length of a path from the root to a terminal node. It is well known that the problem of reduct cardinality minimization and the problem of decision tree depth minimization are NP-hard problems. So we will consider only approximate algorithms for optimization of reducts and trees. 17.2.1 Greedy Algorithm for Decision Tree Construction Denote by P{T) the number of unordered pairs of rows with different decisions. This number will be interpreted as uncertainty of the table T. Sub-table of the table T is any table obtained from T by removal of some rows. For any a^ G { a i , . . . , a^} and b e {0,1} denote by T{ai^ b) the sub-table of the table T consisting of rows which on the intersection with the column labelling by a^ contain the number b. If we compute the value of the attribute a^ then the uncertainty in the worst case will be equal to t/(T,a,) = niax{P(T(a,,0)),P(r(a,,l))} . Let P{T) ^ 0. Then we compute the value of an attribute a^ for which C/(T, a^) has minimal value. Depending on the value of ai the given row will be localized either in the sub-table T(ai, 0) or in the sub-table T{ai, 1), etc. The algorithm will finish its work when in the constructed tree for any terminal node for sub-table T' corresponding to this node the equality P{V) = 0 holds. It is clear that the considered greedy algorithm has polynomial time complexity.Denote by h{T) the minimal depth of a decision tree for the table T. Denote by hgreedyiT) the depth of a decision tree for the table T constructed by the greedy algorithm. It is known [3,4] that hgreedyiT)
< h{T)\nP{T)
+ 1 .
Using results of Feige [1], one can show (see [5]) that if NP2DTIME(n^(^^s^^sn)) then for any e > 0 there is no polynomial algorithm that for a given decision table T constructs a decision tree for this table which depth is at most
{l-e)h{T)lnP{T)
.
Thus, the considered algorithm is close to best polynomial approximate algorithms for decision tree depth minimization.
242
Mikhail Ju. Moshkov
It is possible to use another uncertainty measures in the considered greedy algorithm. Let F be an uncertainty measure, T a decision table, V a sub-table of T, a^ an attribute of T and h G {0,1}. Consider conditions which allow to obtain bounds on precision of greedy algorithm: (a) F{T) - F{T{au h)) > F{r) - F{r{au b)). (b) F{T) = 0 iff r has no rows with different decisions. (c) If F{T) — 0 then T has no rows with different decisions. One can show that if F satisfies conditions (a) and (b) then Veedy(T) 0 there is no polynomial algorithm that for a given decision table T constructs a reduct for this table which cardinality is at most {l~e)R{T)\nP{T)
.
Thus, the considered algorithm is close to best polynomial approximate algorithms for reduct cardinality minimization. To obtain bounds on precision of this algorithm it is important that we can represent the considered problem as a set cover problem. We can weaken this condition and consider a set cover problem such that each cover corresponds to a reduct, but not each reduct corresponds to a cover. In this case we will solve the problem of reduct cardinality minimization not on the set of all reducts, but we will be able to obtain some bounds on precision of greedy algorithm on the considered subset of reducts.
17 Decision Trees and Reducts for Distributed Decision Tables
243
17.3 Distributed Decision Tables. Intersection of Universes In this section we consider the case when the universe of joint decision table is the intersection of universes of distributed decision tables. 17.3.1 Joint Decision Table T^ Let Ti,..., Tyn be decision tables and { a i , . . . , a^} be the set of attributes of these tables. Let b = ( 6 i , . . . , 6^) ^ {0,1}*^, j G { 1 , . . . , m} and {a^^,..., a^,} be the set of attributes of the table Tj. We will say that the row b corresponds to the table Tj if {hi^,..., 6i J is a row of the table Tj. In the last case we will say that (6^^,... ,bij is the row from Tj corresponding to b. Let us define the table T^ (see Fig. 17.2). This table has n columns labelling by attributes a i , . . . , a^. The row b = ( 6 i , . . . , 6n) G {0, l}'^ is a row of the table T^ iff b corresponds to each table Tj,j e { 1 , . . . , m}. This row is labelled by the tuple (cfi,..., d^) where dj is the decision attached to the row from Tj corresponding to h,j e { 1 , . . . , m}. Sometimes we will denote the table T"^ by Ti x ... x T^.
{di,...
,dm)
Fig. 17.2. Joint decision table T^ One can interpret the table T^ in the following way. Let t/i,..., Um be universes corresponding to tables Ti,..., T^ and f/n = C^i Pi... n f/^. If we know the set C/p we can consider the table T(C/n) with attributes ai,..., On, rows corresponding to objects from C/n» and decisions of the kind ( d i , . . . , d^). Assume now that we do not know the set C/n- In this case we must consider an upper approximation of the table T{U^) which is a table containing all rows from T{Up). The table T^ is the minimal upper approximation in the case when we have no any additional information about the set
17.3.2 On Construction of Decision Trees and Reducts for T ^ Our aim is to construct decision trees and reducts for the table T^. Unfortunately, it is very difficult to work with this table. One can show that the following problems are NP-hard: •
For given tables Ti,..., T^ it is required to recognize is the table T^ = Ti x ... x Tm empty.
244
• • •
Mikhail Ju. Moshkov
For given tables Ti,..., T^ it is required to recognize has the table T^ rows with different decisions. For given tables Ti,..., T^ and decision tree F it is required to recognize is F a. decision tree for the table T^. For given tables Ti,..., T^ and a subset D of the set { a i , . . . , a^} it is required to recognize is P a reduct for the table T^.
So really we can use only sufficient conditions for decision tree to be a decision tree for the table T^ and for a subset of the set { a i , . . . , an} to be a reduct for the table T^. If P 7^ NP then there are no simple (polynomial) uncertainty measures satisfying the condition (b). Now we consider two examples of polynomial uncertainty measures satisfying the conditions (a) and (c). L e t a i i , . . . , a i , e { a i , . . . ,an}, 6 1 , . . . ,6^ G {0,1}, and Oi = ( a i , , 6 i ) . . . ( a ^ , , 6 t ) .
Denote by T^a the sub-table of T^ which consists of rows that on the intersection with columns labelling by a^^,..., a^^ have numbers 6 1 , . . . , 6^. Let j € { 1 , . . . , m} and Aj be the set of attributes of the table Tj. Denote by Tja the sub-table of Tj consisting of rows which on the intersection with column labelling by a^^ have number bk for each a^^ G Aj d {a^,,..., a^ J . Consider an uncertainty measure Fi such that F i ( r ^ a ) = P(Tia) H- ... + P ( T ^ a ) . One can show that this measure satisfies the conditions (a) and (c). Unfortunately, the considered measure does not allow to use relationships among tables Ti,,.. ,Tm- Describe another essentially more complicated but polynomial measure which takes into account some of such relationships. Consider an uncertainty measure F2. For simplicity, we define the value of this measure only for the table T^ (the value F2 (T^a) can be defined in the similar way). Set F2(T^) = GI + . . . - K G ^
.
Let j e { 1 , . . . , m}, and the table Tj have p rows r i , . . . , rp. Then
Gj-E^i^: where 1 < ^ < A; < p, and rows r^ and r^ have different decisions. Let q G { l , . . . , p } . Then ^^q
^ ql
'''
^qm
where V^^,i = 1 , . . . , m, is the number of rows r in the table Tj x Ti such that r^ is the row from Tj corresponding to r. It is not difficult to prove that this measure satisfies the conditions (a) and (c). One can show that if P 7^ NP then it is impossible to reduce effectively the problem of reduct cardinality minimization to a set cover problem. However, we can
17 Decision Trees and Redacts for Distributed Decision Tables
245
consider set cover problems such that each cover corresponds to a reduct, but not each reduct corresponds to a cover. Consider an example. Denote B{Tj), j — 1 , . . . , m, the set of unordered pairs of rows from Tj with different decisions, B = J5(ri) U . . . U B(Tm), and Q the set of pairs from B separating by a^, i = 1 , . . . , n. It is not difficult to show that the set cover problem for the set B and family { C i , . . . , C^} of subsets of B has the following properties: each cover corresponds to a reduct for the table T^, but (in general case) not each reduct corresponds to a cover.
17.4 Distributed Decision Tables. Union of Universes In this section we consider the case when the universe of joint decision table is the union of universes of distributed decision tables. 17.4.1 Joint Decision Table T ^ Let Ti,..., Tm be decision tables and { a i , . . . , «„} be the set of attributes of these tables. Let us define the table T^ (see Fig. 17.3). This table has n columns labelling by attributes a i , . . . , a^. The row b = (fei,..., 6^) G {0,1}"^ is a row of the table T^ iff there exists j G {1, • •., m} such that b corresponds to the table Tj. This row is labelled by the tuple (d*,..., cij^) where dj is the decision dj attached to the row from Tj corresponding to b, if b corresponds to the table Tj, and gap otherwise, j e {l,...,m}.
(di, — , . . . ,dm)
Fig. 17.3. Joint decision table T"^ Two tuples of decisions and gaps will be considered as different iff there exists digit in which these tuples have different decisions (in the other words, we will interpret gap as an arbitrary decision). We must localize a given row in a sub-table of the table T^ which does not contain rows labelling by different tuples of decisions and gaps. Most part of results considered in Sect. 17.2 is valid for joint tables T^ too. One can interpret the table T^ in the following way. Let C/i,..., Um be universes corresponding to tables Ti, ...,r^, and Uu = (7i U ... U 1/^. If we know the set U\j we can consider the table T{Uu) with attributes ai,..., a^, rows corresponding to objects from Uu, and decisions of the kind (di, —,..., dm)- Assume now that we do not know the set C/y In this case we must consider an upper approximation of the table T{U[j) which is a table containing all rows from T{Uu). The table T^ is the minimal upper approximation in the case when we have no any additional information about the set f/y-
246
Mikhail Ju. Moshkov
17.4.2 On Construction of Decision Trees and Reducts for T^ L e t a i i , . . . , a i , C {ai,...,On}, 6 1 , . . . , 6t G {0,1}, and a = {ai^.h).. Consider the uncertainty measure Fi such that
.{ai^,bt).
Fi(T^a) = P(Tia) + ... + P{Tma) . One can show that this measure satisfies the conditions (a) and (b). So we can use greedy algorithm for decision tree depth minimization based on the measure Fi, and we can obtain relatively good bounds for this algorithm precision. The number of nodes in the constructed tree can grow as exponential on m. However, we can effectively simulate the work of this tree by construction of a path from the root to a terminal node. Denote B{Tj), j = 1 , . . . , m, the set of unordered pairs of rows from Tj with different decisions, B = B(Ti) U . . . U B{Tm), and Q the set of pairs from B separating by a^, i = 1 , . . . , n. It is not difficult to prove that the problem of reduct cardinality minimization for T^ is equivalent to the set cover problem for the set B and family { C i , . . . , C^} of subsets of B. So we can use greedy algorithm for minimization of reduct cardinality, and we can obtain relatively good bounds for this algorithm precision.
17.5 From Decision Rule Systems to Decision Tree Instead of distributed decision tables we can have information on such tables represented in the form of decision rule systems. Let we have a complicated object Q the state of which is described by values of attributes a i , . . . , an. Let Q i , . . . , Qm be parts of Q, For j = 1 , . . . , m the state of Qj is described by values of attributes from a subset Aj of the set { a i , . . . , a„}. For any Qj v/e have a system Sj of decision rules of the kind Oil = bi A ,.. Aai^ =bt -^ normal where a^^,..., a^^ are pairwise different attributes from Aj, and 6 1 , . . . , 6t are values of these attributes (not necessary numbers from {0,1}). These rules describe the set of all normal states of Qj. We will assume that for any j G { 1 , . . . , m} and for any two rules from Sj the set of conditions from the first rule is not a subset of the set of conditions from the second rule. We will assume also that all combinations of values of attributes are possible, and for each attribute there exists an "abnormal" value which is not in rules (for example, missed value of the attribute). For each part Qj we must find a rule from Sj which is realized (in this case Qj has normal state) or must show that all rules from Sj are non-realized (in this case Qj has abnormal state).
17 Decision Trees and Reducts for Distributed Decision Tables
247
Consider simple algorithm for construction of a decision tree solving this problem. Really we will construct a path from the root to a terminal node of the tree. Describe the main step of this algorithm which consists of 6 sub-steps. Main step: 1. Find minimal set of attributes which cover all rules (an attribute covers a rule if it is in this rule). 2. Compute values of all attributes from this cover. 3. Remove all rules which contradict to obtained values of attributes. If after the realization of this sub-step a system of rules Sj, j G { 1 , . . . , m}, will be empty then corresponding part Qj has abnormal state. 4. Remove from the left side of each rule all conditions (equalities) containing attributes from the cover. 5. Remove all rules with empty left side (such rules are realized). Remove all rules from each system Sj which has realized rules. For each such system the corresponding part Qj has normal state. 6. If the obtained rule system is not empty then repeat main step. Denote h the minimal depth of a decision tree solving the considered problem, /laig the depth of the decision tree constructed by the considered algorithm, and L the maximal length of a rule from the system 5 = 5i U . . . U 5 ^ . One can prove that max{L, h} < /laig < L x h . It is possible to modify the considered algorithm such that we will construct a cover of rules by attributes using greedy algorithm for set cover problem. Denote by /if[g®^^ the depth of constructed decision tree. By N we denote the number of rules in the system S. One can prove that
hir^''^'^ 14 is a function for a e A, V = [j{Va : a G A}, where Va is a set of values of the attribute a e A.
Elements of U are called objects. In this paper, they are often seen as customers. Attributes are interpreted as features, offers made by a bank, characteristic conditions etc. By a decision table we mean any information system where the set of attributes is partitioned into conditions and decisions. Additionally, we assume that the set of conditions is partitioned into stable conditions and flexible conditions. For simplicity reason, we also assume that there is only one decision attribute. Date of Birth is an example of a stable attribute. Interest rate on any customer account is an example
19 In Search for Action Rules of the Lowest Cost
263
of a flexible attribute (dependable on bank). We adopt the following definition of a decision table: By a decision table we mean an information system of the form 5 = (C/, Ast U Api U {d}), where d 0 Ast U Api is a distinguished attribute called decision. The elements of Ast are called stable conditions, whereas the elements of AFI are called flexible conditions. As an example of a decision table we take S = ({xi, rr2, X3, X4, xs, xe, X7, a^s}, {a, c} U {b} U {d}) represented by Table 1. The set {a, c} lists stable attributes, b is a flexible attribute and d is a decision attribute. Also, we assume that H denotes a high profit and L denotes a low one.
X
a
b
c
d
Xi
0
S
0
L
X2
0
R
1
L
X3
0
S
0
L
X4
0
R
1
L
X5
2
P
2
L
xe
2
P
2
L
X7
2
S
2
H
Xs
2
S
2
H
Table 19.1. Decision System S
In order to induce rules in which the THEN part consists of the decision attribute dand the IF part consists of attributes belonging to Ast^Apu subtables (C/, Bu{d}) of 5 where B is a d-reduct (see [4]) in S should be used for rules extraction. By L{r) we mean all attributes listed in the IF part of a rule r. For example, if r = [(a, 2)*(6, S) —> (d, H)] is a rule then L{r) = {a, 6}. By d{r) we denote the decision value of a rule. In our example d{r) = ^ . If r i , r2 are rules and B C A^t U Api is a set of attributes, then ri/B = r2/B means that the conditional parts of rules r i , r2 restricted to attributes B are the same. For example if ri = [(6, S) * (c, 2) —> (c?, J?)], thenri/{6} = r/{6}. In our example, we get the following optimal rules: (a,0)-^(d,L),(c,0)-.(d,L), (6,i2)^(ci,L),(c,l)^(d,L), (6, P) ^ (d, L), (a, 2) * (6,5) - . (d, ff), (6,5) * (c, 2) ^ (d, H). Now, let us assume that (a, v -^^ it;) denotes the fact that the value of attribute a has been changed from v to w. Similarly, the term (a, t; —» tt;)(x) means that a{x) = V has been changed to a{x) — w. Saying another words, the property (a, v) of object X has been changed to property (a, w).
264
Zbigniew W. Ras and Angelina A. Tzacheva
Let S = {U,Ast U Api U {d}) is a decision table and rules r i , r2 have been extracted from S. Assume that S i is a maximal subset of Ast such that ri/Bi = ^2/Si, d{ri) = ki, d{r2) = k2 and the user is interested in reclassifying objects from class ki to class k2. Also, assume that (61,62,..., bp) is a Hst of all attributes in L{ri) D L{r2) H An on which r i , r2 differ and ri{bi) = t;i, ri(62) = V2,..., ri{bp) = Vp, r2{bi) = wi, r2{b2) = W2,..., r2{bp) = Wp, By (ri, r2)-action rule on x G C/ we mean an expression (see [7]): [{bi,vi -^ wi) A (62,1^2 -^ '^^2) A... A {bp,Vp -> Wp)]{x) => Uki-^k2)]{x). The rule is valid, if its value on x is true in S (there is object xi e S which does not contradict with x on stable attributes in 5 and (Vi < p)(Vfei)[6i(x2) = Wi] A d{x2) = k2). Otherwise it is false.
19.3 Distributed Information System By a distributed information system we mean a pair DS = {{Si}i^i, L) where: • • •
/ is a set of sites. Si = {Xi, Ai,Vi) is an information system for any i e I, L is a symmetric, binary relation on the set / showing which systems can direcdy communicate with each other.
A distributed information system DS = {{Siji^j, L) is consistent if the following condition holds: (Vz)(Vi)(Vx eXiO Xj){Wa eAiH Aj) [{a[s,]{x) C a[Sj]{x)) or (a[5.|(x) C ais,]{x))]. Consistency basically means that information about any object x in one system can be either more general or more specific than in the other. Saying another words two systems can not have conflicting information stored about any object x. Another problem which has to be taken into consideration is semantics of attributes which are common for a client and some of its remote sites. This semantics may easily differ from site to site. Sometime, such a difference in semantics can be repaired quite easily. For instance if Temperature in Celsius is used at one site and Temperature in Fahrenheit at the other, a simple mapping will fix the problem. If information systems are complete and two attributes have the same name and differ only in their granularity level, a new hierarchical attribute can be formed to fix the problem. If databases are incomplete, the problem is more complex because of the number of options available to interpret incomplete values (including null vales). The problem is especially difficult in a distributed framework when chase techniques based on rules extracted at the client and at remote sites (see [6]), are used by the client to impute current values by values which are less incomplete. In this paper we concentrate on granularity-based semantic inconsistencies. Assume first that Si — (Xi.Ai, Vi) is an information system for any i e I and that
19 In Search for Action Rules of the Lowest Cost
265
all S^s form a Distributed Information System (DIS). Additionally, we assume that, if a e Ai 0 Aj, then only the granularity levels of a in Si and 5^ may differ but conceptually its meaning, both in Si and Sj is the same. Assume now that L{Di) is a set of action rules extracted from 5^, which means that D = IJie/ ^(-^0 ^^ ^ ^^^ of action rules which can be used in the process of distributed action rules discovery. Now, let us say that system Sk, k e I is queried be a user for an action rule reclassifying objects with respect to decision attribute d. Any strategy for discovering action rules from S^ based on action rules D' C D is called sound if the following three conditions are satisfied: • •
•
for any action rule in D', the value of its decision attribute d is of the granularity level either equal to or finer than the granularity level of the attribute din S^. for any action rule in D\ the granularity level of any attribute a used in the classification part of that rule is either equal or softer than the granularity level of a in Skattribute used in the decision part of a rule has to be classified as flexible in 5^.
In the next section, we assume that if any attribute is used at two different sites of DIS, then at both of them its semantics is the same and its attribute values are of the same granularity level.
19.4 Cost and Feasibility of Action Rules Assume now that DS = ({5^ : i € / } , L) is a distributed information system (DIS), where Si = {Xi.Ai^ Vi),i e LhQtb e Aiisa flexible attribute in Si and 6i, 62 G Vi are its two values. By ps^ (^1, ^2) we mean a number from (0, +00] which describes the average cost to change the attribute value from 61 to 62 for any of the qualifying objects in Si. Object x e Xi qualifies for the change from 61 to 62, if b{x) = bi. If the implementation of the above change is not feasible for one of the qualifying objects in Si, then we write psi{bi,b2) = +00. The value of ^5^(61,62) close to zero is interpreted that the change of values from 61 to 62 is quite easy to accomplish for qualifying objects in Si whereas any large value of p^. (61,62) means that this change of values is practically very difficult to get for some of the qualifying objects in Si. If psi (61, ^2) < PSi {bs, 64), then we say that the change of values from 61 to 62 is more feasible than the change from 63 to 64. We assume here that the values pSi (6ji, 6^2) are provided by experts for each of the information systems Si. They are seen as atomic expressions and will be used to introduce the formal notion of the feasibility and the cost of action rules in Si. So, let us assume that r = [{bi.vi -^ wi) A (62,^2 —^ W2) A ... A {bp^Vp -> Wp)]{x) => (d, ki -^ k2){x) is a (ri,r2)-action rule. By the cost of r denoted by cost{r) we mean the value Ylips.i'^ki '^k) ' ^ ^ k < p}. We say that r is feasible if cost{r) < pSi{ki,k2). It means that for any feasible rule r, the cost of the conditional part of r is lower than the cost of its decision part and clearly cost{r) < +00.
266
Zbigniew W. Ra^ and Angelina A. Tzacheva
Assume now that disa. decision attribute in Si,ki,k2 G V^, and the user would like to re-classify some customers in Si from the group ki to the group k2. To achieve this goal he may look for an appropriate action rule, possibly of the lowest cost value, to get a hint which attribute values have to be changed. To be more precise, let us assume that Rsi [( k2)] he may identify a rule which has the lowest cost value. But the rule he gets may still have the cost value much to high to be of any help to him. Let us notice that the cost of the action rule r = [{bi.vi -^ wi) A {b2,V2 -^ '^2) A ... A {bp,Vp -^ Wp)]{x) ^ {d,ki -^ k2){x) might be high only because of the high cost value of one of its sub-terms in the conditional part of the rule. Let us assume that {bj^Vj —> Wj) is that term. In such a case, we may look for an action rule in Rs^ [{bj^Vj -^ Wj)] which has the smallest cost value. Assume that ri = [{bji^Vji —> Wji) A {bj2,Vj2 —^ '^32) A ... A {bjq^Vjq -^ '^3q)]{y) =^ iPj^'^j "^ '^j){y) is such a rule which is also feasible in Si, Since x,y e Xi, we can compose r with ri getting a new feasible rule which is given below: [(61,-^i -> wi) A ... A [{bji.Vji -^ Wji) A {bj2,Vj2 -^ '^32) A ... A {bjq.Vjq -^ Wjq)] A ... A {bp.Vp -^ 'Wp)]{x) => {d,ki -> k2){x). Clearly, the cost of this new rule is lower than the cost of r. However, if its support in Si gets too low, then such a rule has no value to the user. Otherwise, we may recursively follow this strategy trying to lower the cost of re-classifying objects from the group ki into the group k2. Each successful step will produce a new action rule which cost is lower than the cost of the current rule. This heuristic strategy always ends because there is a finite number of action rules and any action rule can be applied only once at each path of this recursive strategy. One can argue that if the set Rsi[{d^ki -^ k2)] contains all action rules reclassifying objects from group ki into the group k2 then any new action rule, obtained as the result of the above recursive strategy, should be already in that set. We do not agree with this statement since in practice Rsi [(c/, ki —> A;2)] is only a subset of all action rules. Firstly, it takes too much time (complexity is exponential) to generate all possible rules from an information system and secondly even if we extract such rules it still takes too much time to generate all possible action rules from them. So the applicability of the proposed recursive strategy, to search for rules of lowest cost, is highly justified. Again, let us assume that the user would like to reclassify some objects in Si from the class 61 to the class 62 and that ps^ (^1, ^2) is the current cost to do that. Each action rule in i?^. [(d, ki —> k2)] gives us an alternate way to achieve the same result but under different costs. If we limit ourself to the system 5^, then clearly we can not go beyond the set Rsi [(0?, ki -^ A:2)]. But, if we allow to extract action rules at other information systems and use them jointly with local action rules, then
19 In Search for Action Rules of the Lowest Cost
267
the number of attributes which can be involved in reclassifying objects in Si will increase and the same we may further lower the cost of the desired reclassification. So, let us assume the following scenario. The action rule r = [{bi,vi —^wi)A (62,f2 —^ W2) A ... A {bp.Vp -^ Wp)]{x) =^ {d,ki -^ k2){x), extracted from the information system Si, is not feasible because at least one of its terms, let us say {bj, Vj -^ Wj) where 1 < j < p, has too high cost ps-. {vj, Wj) assign to it. In this case we look for a new feasible action rule ri = [(bji^Vji -^ Wji) A {bj2,Vj2 -> ^i2) A ... A {bjq.Vjq -^ u)jq)]iy) ^ {bj.Vj "^ '^j){y) wWch Concatenated with r will decrease the cost value of desired reclassification. So, the current setting looks the same to the one we already had except that this time we additionally assume that ri is extracted from another information system in DS. For simplicity reason, we also assume that the semantics and the granularity levels of all attributes listed in both information systems are the same. By the concatenation of action rule ri with action rule r we mean a new feasible action rule ri o r of the form: [(61,vi -^ wi) A ... A [{bji,Vji -^ Wji) A ibj2,Vj2 "^ ^j2) A ... A {bjq.Vjq -^ Wjq)] A ... A {bp.Vp -^ Wp)]{x) => {d,ki -^ k2){x) where x is an object in Si = (X^, Ai.Vi). Some of the attributes in {6^1,6^2, ••, bjq} may not belong to Ai. Also, the support of ri is calculated in the information system from which r i was extracted. Let us denote that system by Sm = (^m, ^m, Kn) and the set of objects in Xjn supporting ri by Supsmi^i)- Assume that Supsi{r) is the set of objects in Si supporting rule r. The domain of ri o r is the same as the domain of r which is equal to SupSi{r), Before we define the notion of a similarity between two objects belonging to two different information systems, we assume that Ai = {61,62,63,64}, Am = {bi,b2,b3,b5,bG}, and objects x e Xi, y e Xm are defined by the table below: Table 19.2. Object x from Si and y from Sn 61
X Vi y vi
62
63
^4
^5
V2 V3 V4 W2 W3
ws
We
The similarity p(x, y) between x and y is defined as: [1 -f 0 -f- 0 + 1/2 + 1/2 + 1/2] = [2 -h 1/2]/6 = 5/12. To give more formal definition of similarity, we assume that: p{x, y) = [S{p{bi{x), bi{y)) : bi 6 {Ai U Am)}]/card{Ai U Am), where: • • •
p{bi{x),bi{y)) = 0, if bi{x) ^ bi{y), p{bi{x),bi{y)) = 1, if bi{x) = bi{y), p{bi{x)^ bi{y)) = 1/2, if either bi{x) or bi{y) is undefined.
268
Zbigniew W. Ras and Angelina A. Tzacheva
Let us assume that p(a:,5''ixp5^(ri)) = max{p{x,y) : y e Sups^{ri)}, for each x G SupSi{r). By the confidence of ri o r we mean Conf{ri o r) = lUipi^^S'^PSmin)) ' X € Sups,{r)}/card{SupSi{r))] • Conf{ri) • Conf{r), where Conf{r) is the confidence of the rule r in Si and Conf(ri) is the confidence of the rule ri in Sfn. If we allow to concatenate action rules extracted from 5^ with action rules extracted at other sites of DIS, we are increasing the total number of generated action rules and the same our chance to lower the cost of reclassifying objects in Si is also increasing but possibly at a price of their decreased confidence.
19.5 Heuristic Strategy for the Lowest Cost Reclassification of Objects Let us assume that we wish to reclassify as many objects as possible in the system Si, which is a part of DIS, from the class described by value ki of the attribute d to the class k2. The reclassification ki -^ k2 jointly with its cost psi {ki,k2) is seen as the information stored in the initial node no of the search graph built from nodes generated recursively by feasible action rules taken initially from i?^. [(d, ki -> ^2)]. For instance, the rule r = [{bi,vi -> wi) A (62,^2 -^ ^2) A ... A {bp.Vp -^ Wp)]{x) =^ {d,ki -^k2){x) applied to the node UQ = {[ki -^ k2^ pSi (^15 ^2)]} generates the node ni = {[vi -^wi,ps,{vi,wi)],[v2 -^W2,pSiiv2,W2)],..., [Vp -^Wp,ps,{Vp,Wp)]}, and from rii we can generate the node ^2 = {[Vl -^Wi,pSi{vi,Wi)],[v2 -^W2,pSi{v2,yJ2)],'", [Vji -^ Wji,ps,(Vjl,Wji)],[Vj2 -^Wj2,pSi{Vj2,Wj2)l..., [Vjq -> VJjq^ps.iVjq.Wjq)], ..., [Vp -> Wp, ps,iVp,Wp)]} assuming that the action rule n = [{bjl^Vjl -> Wji) A {bj2,Vj2 -^ Wj2) A ... A {bjq.Vjq -^ Wjq)]{y) => {bj.Vj -^Wj){y) from Rs^ [{bjiVj -^ '^j)] is applied to ni. /see Section 4/ This information can be written equivalently as: r(no) = n i , ri(ni) = n2, [ri o r](no) = n2. Also, we should notice here that ri is extracted from S^ and Supsm (^1) ^ ^rn whereas r is extracted from 5^ and Sups^ (r) C Xi. By Sup Si (r) we mean the domain of action rule r (set of objects in 5^ supporting r). The search graph can be seen as a directed graph G which is dynamically built by applying action rules to its nodes. The initial node no of the graph G contains information coming from the user, associated with the system Si, about what objects in Xi he would like to reclassify and how and what is his current cost of this reclassification. Any other node n in G shows an alternative way to achieve the same reclassification with a cost that is lower than the cost assigned to all nodes which are preceding n in G. Clearly, the confidence of action rules labelling the path from the
19 In Search for Action Rules of the Lowest Cost
269
initial node to the node n is as much important as the information about reclassification and its cost stored in node n. Information from what sites in DIS these action rules have been extracted and how similar the objects at these sites are to the objects in Si is important as well. Information stored at the node {[^;i -^ wi.ps,(^^1,^i)], [v2 -^ ^2,pSi{v2,W2)\,..., [vp -^ Wp,ps,{vp,Wp)]} says that by reclassifying any object x supported by rule r from the class vi to the class Wi, for any i < p, we also reclassify that object from the class ki to k2. The confidence in the reclassification of x supported by node {[vi -^ 'Wi,pSi{vi,wi)],[v2 -^ W2,pSi{y2i'i^^2)],-'A'^P ^ Wp,ps,{vp,Wp)]} IS tht Same as the confidence of the rule r. Before we give a heuristic strategy for identifying a node in G, built for a desired reclassification of objects in Si, with a cost possibly the lowest among all the nodes reachable from the node no, we have to introduce additional notations. So, assume that N is the set of nodes in our dynamically built directed graph G and no is its initial node. For any node n e N,by f{n) = {Yn,{[vn,j -^ Wnj,pSi{yn,j^'^n,j)]}jein) ^^ mcau its domain, the reclassification steps related to objects in Xi, and their cost all assigned by reclassification function f to the node n, where Yn C Xi /Graph G is built for the client site Si/. Let us assume that/(n) = (Yndb^n.k -^ ifn,fc,p5i(^n,fc,^t^n,fc)]}fc€/.)-We say that action rule r, extracted from Si, is applicable to the node n if: • •
YnnSups,{r)y^ili, (Bk e In)[f ^ Rsi[vn,k -^ tt;n,A;]]./see Section 4 for definition of i?5. [...]/
Similarly, we say that action rule r, extracted from 5 ^ , is applicable to the node nif:f: • •
{3x e Yn){3y e Sups^{r))[p{x,y) < A], lp{x,y) is the similarity relation between x, y (see Section 4 for its definition) and A is a given similarity threshold/ {3k e /n)[^ ^ Rsm [^n,k —^ Wn,k]]- /scc Scctiou 4 for definition of Rs^ [...]/
It has to be noticed that reclassification of objects assigned to a node of G may refer to attributes which are not necessarily attributes listed in Si. In this case, the user associated with Si has to decide what is the cost of such a reclassification at his site, since such a cost may differ from site to site. Now, let RA{n) be the set of all action rules applicable to the node n. We say that the node n is completely covered by action rules from RA{n) if Xn = [JlSups^ (r) : r e RA{n)}. Otherwise, we say that n is partially covered by action rules. What about calculating the domain Yn of node n in the graph G constructed for the system 5^? The reclassification (d, ki -^ k2) jointly with its cost psi{ki^k2) is stored in the initial node no of the search graph G. Its domain YQ is defined as the settheoretical union of domains of feasible action rules in Rs^ [{d, ki —> k2)] applied to Xi. This domain still can be extended by any object x e Xi if the following condition holds: (3m)(3r € Rsjki ^ k2]){3y G Sups^{r))[p{x,y) < A].
270
Zbigniew W. Ras and Angelina A. Tzacheva
Each rule applied to the node no generates a new node in G which domain is calculated in a similar way to no. To be more precise, assume that n is such a node and / ( n ) = {Yn, {K,/c -> 'Wn.k^pSi{vn,k,'Wn,k)]}kein)' Its domain Yn is defined as the set-theoretical union of domains of feasible action rules in IJi^s^i [^n,k -^ Wn,k] ' k e In} applied to Xi. Similarly to no, this domain still can be extended by any object x e Xiif the following condition holds: {3m){3k e /n)(3r G Rsm[vn,k -^ ^^n,A:])(32/ e Sups^{r))[p{x,y) < A]. Clearly, for all other nodes, dynamically generated in G, the definition of their domains is the same as the one above. Property 1. An object x can be reclassified according to the data stored in node n, only if x belongs to the domain of each node along the path from the node no to n. Property 2. Assume that x can be reclassified according to the data stored in node n a n d / ( n ) = (Fn,{K,fe -^ w^^k,pSi{vn,k,'l^n,k)]}keIr^)' The cost Cosifci-^fcaC^j ^) assigned to the node n in reclassifying x from ki to k2 is equal to J2{pSi{yn,k,Wn,k) ' k G In}Property 3. Assume that x can be reclassified according to the data stored in node n and the action rules r, r i , r2,..., rj are labelling the edges along the path from the node no to n. The confidence Confk^-^k2 (^? ^) assigned to the node n in reclassifying x from fci to k2 is equal to Conf[rj o ... o r2 o ri o r] /see Section 4/. Property 4. If node nj2 is a successor of the node n^i, then Confk^^k2{'^j2,x) < Con/fc.^A^sKi.^)Property 5. If a node nj2 is a successor of the node n^i, thenCostki^k2{'^j2,x) < Costfci-^fcaC^ji^^)Let us assume that we wish to reclassify as many objects as possible in the system Si, which is a part of DIS, from the class described by value ki of the attribute d to the class k2. We also assume that R is the set of all action rules extracted either from the system Si or any of its remote sites in DIS. The reclassification (c?, fci —^ ^2) jointly with its cost pSi {ki, ^2) represent the information stored in the initial node no of the search graph G, By Xconf we mean the minimal confidence in reclassification acceptable by the user and by Xcost, the maximal cost the user is willing to pay for the reclassification. The algorithm Build-and-Search generates for each object x in Si, the reclassification rules satisfying thresholds for minimal confidence and maximal cost. Algorithm Build-and-Search(i^, x, Xconf^ Xcost, n, m); Input Set of action rules R, Object X which the user would like to reclassify, Threshold value Xconf for minimal confidence. Threshold value Xcost for maximal cost. Node n of a graph G. Output Node m representing an acceptable reclassification of objects from 5^. begin if Co5tfci_fc2(^5^) > Acost,then
19 In Search for Action Rules of the Lowest Cost
271
generate all successors of n using rules from R', while ni is a successor of n do if Con/fci^A;2(^i7^) < ^'Conf then stop else if Co5tfci_^jfe2(ni,a:) < Xcost then Output[ni] else Build-and-Search(i2, x, Xconf, Xcost, ni,m) end Now, calling the procedure Build-and-Search(i?,x, Acon/, Acost,^o,^), we get the reclassification rules for x satisfying thresholds for minimal confidence and maximal cost. The procedure, stops on the first node n which satisfies both thresholds: Xconf for minimal confidence and Xcost for maximal cost. Clearly, this strategy can be enhanced by allowing recursive calls on any node n when both thresholds are satisfied by n and forcing recursive calls to stop on the first node ni succeeding n, if only Costk^^k2{'^i^^) < ^Cost and Confk^^k2{'^i^^) < Xconf- Then, the recursive procedure should terminate not on rii but on the node which is its direct predecessor.
19.6 Conclusion The root of the directed search graph G is used to store information about objects assigned to a certain class jointly with the cost of reclassifying them to a new desired class. Each node in graph G shows an alternative way to achieve the same goal. The reclassification strategy assigned to a node n has the cost lower then the cost of reclassification strategy assigned to its parent. Any node nin G can be reached from the root by following either one or more paths. It means that the confidence of the reclassification strategy assigned to n should be calculated as the maximum confidence among the confidences assigned to all path from the root of G to n. The search strategy based on dynamic construction of graph G (described in previous section) is exponential from the point of view of the number of active dimensions in all information systems involved in search for possibly the cheapest reclassification strategy. This strategy is also exponential from the point of view of the number of values of flexible attributes in all information systems involved in that search. We believe that the most promising strategy should be based on a global ontology [14] showing the semantical relationships between concepts (attributes and their values), used to define objects in DAIS. These relationships can be used by a search algorithm to decide which path in the search graph G should be exploit first. If sufficient information from the global ontology is not available, probabilistic strategies (Monte Carlo method) can be used to decide which path in G to follow.
References 1. Adomavicius, G., Tuzhilin, A., (1997), Discovery of actionable patterns in databases: the action hierarchy approach, in Proceedings of the Third International Conference on
272
2.
3. 4. 5. 6.
7.
8.
9.
10. 11.
12.
13. 14.
15.
Zbigniew W. Ras and Angelina A. Tzacheva Knowledge Discovery and Data Mining (KDD97), Newport Beach, CA, AAAI Press, 1997 Liu, B., Hsu, W., Chen, S., (1997), Using general impressions to analyze discovered classification rules, in Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD97), Newport Beach, CA, AAAI Press, 1997 Liu, B., Hsu, W., Mun, L.-E, (1996), Finding interesting patterns using user expectations, DISCS Technical Report No. 7, 1996 Pawlak Z., (1985), Rough Ssets and decision tables, in Lecture Notes in Computer Science 208, Springer-Verlag, 1985, 186-196. Pawlak, Z., (1991), Rough Sets: Theoretical aspects of reasoning about data, Kluwer Academic Publisher, 1991. Ra^, Z., Dardzinska, A., "Handling semantic inconsistencies in query answering based on distributed knowledge mining", in Foundations of Intelligent Systems, Proceedings of ISMIS*02 Symposium, LNCS/LNAI, No. 2366, Springer-Verlag, 2002, 66-74 Ras, Z., Wieczorkowska, A., (2000), Action Rules: how to increase profit of a company, in Principles of Data Mining and Knowledge Discovery, (Eds. D.A. Zighed, J. Komorowski, J. Zytkow), Proceedings of PKDD'OO, Lyon, France, LNCS/LNAI, No. 1910, SpringerVerlag, 2000, 587-592 Ras, Z.W., Tsay, L.-S., (2003), Discovering Extended Action-Rules (System DEAR), in Intelligent Information Systems 2003, Proceedings of the IIS'2003 Symposium, Zakopane, Poland, Advances in Soft Computing, Springer-Verlag, 2003, 293-300 Ras, Z.W., Tzacheva, A., (2003), Discovering semantic incosistencies to improve action rules mining, in Intelligent Information Systems 2003, Advances in Soft Computing , Proceedings of the IIS*2003 Symposium, Zakopane, Poland, Springer-Verlag, 2003, 301310 Ras, Z., Gupta, S., (2002), Global action rules in distributed knowledge systems, in Fundamenta Informaticae Journal, lOS Press, Vol. 51, No. 1-2, 2002, 175-184 Silberschatz, A., Tuzhilin, A., (1995), On subjective measures ofinterestingness in knowledge discovery, in Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD95), AAAI Press, 1995 Silberschatz, A., Tuzhilin, A., (1996), What makes patterns interesting in knowledge discovery systems, in IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 6, 1996 Skowron A., Grzymala-Busse J., (1991), From the Rough Set Theory to the Evidence Theory, in ICS Research Reports, 8/91, Warsaw University of Technology, October, 1991 Sowa, J.F., (2000), Ontology, Metadata and Semiotics, in Conceptual Structures: Logical, Linguistic, and Computational Issues, B. Ganter, G.W. Mineau (Eds), LNAI 1867, Springer-Verlag, 2000, 55-81 Suzuki, E., Kodratoff, Y., (1998), Discovery of surprising exception rules based on intensity of implication, in Proc. of the Second Pacific-Asia Conference on Knowledge Discovery and Data mining (PAKDD), 1998
20
Circularity in Rule Knowledge Bases Detection using Decision Unit Approach Roman Siminski and Alicja Wakulicz-Deja University of Silesia, Institute of Computer Science B^ziiiska 39, 41-200 Sosnowiec, Poland [email protected] wakulicziius . edu. p i
20.1 Introduction Expert systems are programs that have extended the range of application of software systems to non-structural, ill-defined problems. Perhaps the crucial characteristic of expert systems that distinguishes it from classical software systems is impossibility to obtain correct and complete formal specification. It comes from the nature of knowledge engineering process, that is essentially a modeling discipline, where the results of modeling activity, is the modeling process itself. Expert systems are programs, that solves problems using knowledge acquired, usually, from human experts in the problem domain, as opposed to conventional software that solves problems by following algorithms. But expert systems are programs and programs must be validated. Regarding the classical definition of validation, as stated in [1]: Determination of the correctness of a program with respect to the user needs and requirements, we claim that it is adequate for KBS. But we encounter some differences if we try to use classical verification methods for software engineering. The tasks performed by the knowledge-based systems usually can not be correctly and completely specified, these tasks are usually ill-structured and no efficient algorithmic approach is known from them. KBS are constructed using declarative languages (usually rule-based) that are interpreted by inference engines. This kind of programming is concerned with truth values, rule dependencies and heuristic associations, in contrast to conventional programming that deals with variables, conditionals, loops and procedures. The knowledge base of expert system contains program, usually constructed using rule-based languages, knowledge engineer uses declarative languages or specialized expert system shells. In this work we concentrate our attention on verification of rule knowledge bases of expert systems. We assume that the inference engine and other parts of expert system doesn't need any verification for example, they derives properties from commercial expert system shell. Although the basic validation concepts are common for knowledge and software engineering, we encounter difficulties if we try to apply classical definitions of
274
Roman Siminski and Alicja Wakulicz-Deja
verification and validation (from software engineering) to knowledge engineering. Verification methods of conventional software are not directly applicable to expert systems and the new, specific methods of verification are required. In our previous works [2, 3,4, 5] we present some of the theoretical and practical information about verification and validation of knowledge bases as well as some of the best known methods and tools described in references. Perhaps the best reference materials we found in Alun Precee home page: httpi/Zwww.csd.abdn.ac.ukTapreece, especially in [8, 9, 10, 1, 12]. We can identify some kinds of anomalies in rule knowledge bases. A. Preece divides them in to the four groups: redundancy, ambivalence, circularity and deficiency. In present work we will discuss only one kind of anomalies - circularity. Circular rule sequences are undesirable because they may cause endless loops, as long as inference system does not recognize them at execution time. We present circularity detection algorithm based on decision units conception described in details in [6, 7].
20,2 Circularity - the problem in backward chaining systems Circularity presents an urgent problem in backward chaining systems. A knowledge base has circularity iff it contains some set of rules such that a loop could occur when the rules are fired. In other words - a knowledge base is circular if it contains a circular sequence of the rules, that is, a sequence of rules that the right-hand of all but the last rule are contained in the left-hand side of next rule in sequence, and the right-hand side of the rule is contained in the left-hand side of the first rule of sequence. More formally [10], a knowledge base R contains circular dependences if there is a hypothesis H that unifies with the consequent of a rule R in the rule base R, where R is Arable only when H is supplied as an input to R (see eq. 1). {3R eR,3Ee
E, 3H e H)
{H = conseq{R) A -^firable{R, R, E) A firable{R, R,EU {H}))
(20.1)
where function conseq{R) supplies the literal from the consequent of rule R : R = Li A L2 A ... A Lm -^ M : conseq{R) = M. E, called environment, is a subset of legal input literals (that does not imply a semantic constraint). H, called inferable hypothesis, is defined to be the set of literals in the consequents ant their instances: H e R if{3R e R) {conseq{R) = H). Predicate fireable describes that a rule i? G R is firable if there is some environment E such that the antecedent of i^ is a logical consequence of supplying E as input to R : fireable{R, R, E)if{3a)(RUE) We can distinguish between direct cycle, where the rule calls itself: P{x) A R{x) -> R{x)
(20.2)
Ri : P{x) A Q{x) -> R{x) i?2 : R{x) A S(x) -> P{x)
(20.3)
20 Circularity in Rule Knowledge Bases...
275
I And
Andf
Fig. 20.1. An example circular rule sequence
20.3 Decision units In the real-world rule knowledge bases literals are often coded using attribute-value pairs. In this chapter we shall briefly introduce conception of decision units determined on a rule base containing the Horn clause rules, where literals are coded using attribute-value pairs. We assume a backward inference. A decision unit U is defined as a triple U = (/, O, R), where / denotes a set of input entries, O denotes a set of output entries and R denotes a set of rules fulfilling given grouping criterion. These sets are defined as follows:
/ = O =
{{attri^valij) {(attri^valij)
:3r e R '."ir G R
Rz=z ^r :Wi j^ j , ri^ Vj G R
{attri^valij) G antec{r)} attri =
conclAttr{r)}
conclAttr{ri) = conclAttr{rj)}
(20.4)
Two functions are defined on a rule r : conclAttr{r) returns attribute from conclusion of rule r, antec{r) is a set of conditions of rule r. As it can be seen, decision unit U contains the set of rules R, each rule r e R contains the same attribute in the literal appearing in the conclusion part. All rules grouped within a decision unit take part in an inference process, confirming the aim described by attribute, which appears in the conditional part of each rule. The process given above is often considered to be a part of decision system, thus it is called - a decision unit. All pairs (attribute, value ) appearing in the conditional part of each rule are called decision unit input entries, whilst all pairs (attribute, value) appearing in the conclusion part of each set rule R are called decision unit output entries. Summarising, the idea of decision units allows arranging rule-base knowledge according to a clear and simple criterion. Rules within a given unit work out or confirm the aim determined by a single attribute. When there is a closure of the rules within a given unit and a set of input and output entries is introduced it is possible to review a base on the higher abstraction level. This reveals simultaneously the global connections, which are difficult to be detected immediately, on the basis of rules list verification. Decision unit idea can be well used in knowledge base verification and
276
Roman Siminski and Alicja Wakulicz-Deja Rule set R
r
(ai, valjj)
(Oj, valj if (aI, valji) (a^, valj if(a^
(a^
valj
Input entries /
val^J
(a^, valj I
:
(a^, valj (a^ valj
Output entries O
Fig. 20.2. The structure of the decision unit U
validation process and in pragmatic issue of modelling, which is the subject to be presented later in this paper.
20.4 Decision units in knowledge base verification Decision unit introduction allows implementing the anomalies division into local and global: • •
local anomalies appear within considered individually the decision unit and their detection is local; global anomalies disclose at the decision unit net level. Their detection is based on the connection analysis between units and is global.
A single decision unit can be considered as a model of an elementary, partial decision that has been worked out by the system. The reason for this situation is that all rules being a constitution of a decision unit have the same conclusion attribute. All conclusions create a set of unit output entries specifying possible to confirm inference aims. The decision unit net allows us to formulate the global verification method. On the strength of connections between decision unit analysis it is possible to detect local anomalies in rules, such as deficiency, redundancy, incoherence or circularity, creating chains during an inference process. We can apply considerations at the unit level using black box and glass box techniques. Digressing from an integral unit structure, which creates a net, allows us to detect characteristic symptoms of global anomalies. This can give us a push to do a detailed analysis, making allowance for an integral structure of each unit. This analysis is nevertheless limited to a given fragment of a net, having been tipped previously through a black box verification method.
20 Circularity in Rule Knowledge Bases...
277
20.5 Circular relationship detection technique using decision units There is one particular case of circularity - circularity inside decision unit. This is an example of local anomalies. We can detect this kind of circularity on the local level, building local casual graph - this case presents Fig. 20.3. Global circular rule relationship detection technique shall be presented on example. Figure 20.4a presents such an example. A net can be described as a directed graph. After exclusion of input and output entries discrimination and after rejection of vertexes which stand for disjointed input and output entries the graph assumes shape like the one presented by Figure 20.4b. Such graph shall be called a global relationship decision unit graph. As it can be seen, there are two cycles: 1-2-3-1 and 1-3-1. a)
0
b) l:c=v^,if
—K^=^ci> ^~vcl''
2: c=v^ if
o - > ( ^ G^w^
^>-(5^^>)Gii>t
E^ Fig. 20.3. An example of circularity in decision unit - local casual graph
The presence of cycles can indicate appearance of a cycle relationship in a considered rule base. Figure 20.4c presents example where there is no cyclical relationship - the arcs define proper rules. To make the graph more clear a text description has been omitted. On the contrary, the figure 4d presents case where both cycles previously present at figure 4b now stand for real cycles in a base rule. Thus, the presence of cyclical relationship on a decision unit relationship graph is an indication to carry out a cyclical relationship presence inspection on a global level. This can be achieved by creating a suitable reason-result (casual) graph, representing relations between input and output entries of units, causing cyclical relations described by decision unit relationship diagram. Scanned graph shall consist of only nodes and arcs necessary to determine a circularity causing limitations in scanned area.
20.6 Summary This paper presents the usage of decision units in circularity detection task. Decision units allow modular organisation of the rule knowledge base, which facilitate programming and base verification, simultaneously increasing the clarity of achieved
278
Roman Simiiiski and Alicja Wakulicz-Deja 3 C 3^C 15^
>--i
^
C
T
2
g""*!
^
)) ) •»(
±^
c
A.1.,,,
) 3 C ^''g
>^ >*-?
rf)
>- Q : 2 = 1 , . . . , n }
(21.2)
defines generalized linear combinations over the concept spaces Ci. For any i = 1 , . . . , n, Wi denotes the space of the combination parameters. If Wi is a partial or total ordering, then we interpret its elements as weights reflecting the relative importance ofparticular concepts in construction of the resulting concept. Let us denote by m{i) G N the number of nodes in the i-th network layer. For any i = 1 , . . . , n, the nodes from the z-th and {i + l)-th layers are connected by the links labeled with parameters ^j(*4.i) ^ Wi, for j{i) = 1 , . . . , m(i) and j{i -h 1) = 1 , . . . , m(i + 1). For any collection of the concepts c j , . . . , c^^*^ G Q occurring as the outputs of the i-th network's layer in a given situation, the input to the j{i -h l)-th node in the {i -h l)-th layer takes the following form: 4 \ + ^ U m a p , (lim ( { ( 4 ' ' \ 4 ^ - i i ) ) -.Jii) = l , . . . , m ( i ) } ) )
(21.3)
The way of composing functions within the formula (21.3) requires, obviously, further discussion. In this paper, we restrict ourselves to the case of Figure 21.2a, where mapi and lirii are stated separately. However, parameters ^^/*^_i) could be also used directly in a generalized concept mapping genmapi : 2^^^^^ - . Q + i
(21.4)
as shown in Figure 21.2b. These two possibilities reflect construction tendencies described in Section 21.2. Function (21.4) can be applied to construction of more compound concepts parameterized by the elements of Wi, while the usage of Definitions 1 and 2 results rather in potential syntactical simplification of the new concepts (which can, however, still become more compound semantically). One can see that function genmap and the corresponding illustration 21.2b refer directly to the ideas of synthesizing concepts (granules, standards, approximations) known from rough-neural computing, rough mereology, and the theory of approximation spaces (cf. [6, 11, 14]). On the other hand, splitting genmap's functionality, as proposed by formula (21.3) and illustrated in 21.2a, provides us with a framework more comparable to the original artificial neural networks and their supervised learning capabiHties (cf. [19,18]).
286
Dominik Slf zak, Marcin Szczuka, and Jakub Wroblewski
21.5 Weighted compound concepts Beginning with the input layer of the network, we expect it to provide the conceptssignals c j , . . . , c^^^^ G Ci, which will be then transmitted towards the target layer using (21.3). If we learn the network related directly to real-valued training sample, then we get Ci = R, lirii can be defined as classical linear combination (with Wi = M), and mapi as identity. An example of a more compound concept space originates from our previous studies [18, 19]:
C;]^7|^^'^ = map.
olin^
Fig. 21.2. Production of new concepts in consecutive layers: a. the concepts arefirstweighted and combined within the original space Ci using function lirii and then mapped to a new concept in Ct+i; b. the concepts are transformed directly to the new space Ci-^i by using the generalized concept mapping (21.4). Example 1. Let us assume that the input layer nodes correspond to various classifiers and the task is to combine them within a general system, which synthesizes the input classifications in an optimal way. For any object, each input classifier induces possibly incomplete vector of beliefs in the object's membership to particular decision classes. Let DEC denote the set of decision classes specified for a given classification problem. By the weighted decision space WDEC we mean the family of subsets of DEC with elements labeled by their beliefs, i.e.:
WDEC=
U
{{k,iik)'keX,yik^
(21.5)
XCDEC
Any weighted decision Jl = {(fc,jLtfc) : A; G Xjx^fjik G R} corresponds to a subset Xji C DEC of decision classes for which the beliefs fik ^^ are known. Another example corresponds to the specific classifiers - the sets of decision rules obtained using the methodology of rough sets [12, 21]. The way of parametrization is comparable to the proceedings with classification granules in [11, 14]. Example 2. Let DESC denote the family of logical descriptions, which can be used to define decision rules for a given classification problem. Every rule is labeled with its description amie ^ DESC and decision information, which takes - in the most
21 Feedforward Concept Networks
287
general framework - the form of Jlmie ^ WDEC. For a new object, we measure its degree of satisfaction of the rule's description (usually zero-one), combine it with the number of training objects satisfying amie, and come out with the number appmie € M expressing the level of rule's applicability to this object. As a result, by the decision rule set space RULS we mean the family of all sets of elements of DESC labeled by weighted decision sets and the degrees of applicability, i.e.: RULS =
(J
{(a, Jl, app) :a£X,p£
WDEC, app £ E}
(21.6)
XCDESC
Definition 3. By a weighted compound concept space C we mean a space of collections of sub-concepts/ram some sub-concept space S (possibly from several spaces), labeled with the concept parameters/ram a given space V, i.e.: C^
[j {{s,Vs):seX,VseV} xcs
(21.7)
For a given c= {{s,Vs) : s e X^ Vg G V}, where Xc C S is the range ofc, parameters Vs EV reflect relative importance of sub-concepts s G Xc within Ci. Just like in case of combination parameters Wi in Definition 2, we can assume a partial or total ordering over the concept parameters. A perfect situation would be then to be able to combine these two kinds of parameters while calculating the generalized linear combinations and observe how the sub-concepts from various outputs of the previous layer fight for their importance in the next one. For the sake of simplicity, we further restrict ourselves to the case of real numbers, as stated by Definition 4. However, in general Wi does not need to be E. Let us consider a classifier network, similar to Example 2, where decision rules are described by parameters of accuracy and importance (initially equal to their support). A concept transmitted by network refers to rules matched by an input object. The generalized linear combination of such concepts may be parameterized by vectors (w^O) G Wi and defined as a union of rules, where importance is expressed by w and 9 states a threshold for the rules' accuracy. Definition 4. Let the i-th network layer correspond to the weighted compound concept space Ci based on sub-concept space Si and parameters V^ = E. Consider the j{i -\-l)-th node in the next layer We define its input as follows:
lini{{{4^'\w^g,^) : j(i) = l,...,m(i)}) =
(21.8)
where Xj(^i) C Si is simplified notation for the range of the weighted compound concept c^^*^ and Vs G E denotes the importance of sub-concept s e Si in c^^^K Formula (21.8) can be applied both to WDEC and RULS. In case of WDEC, the sub-concept space equals to DEC. The sum J2j(i)-sex i ^^(I-i-i)^« gathers the
288
Dominik Sl^zak, Marcin Szczuka, and Jakub Wroblewski
weighted beliefs of the previous layer's nodes in the given decision class s e DEC. In the case of RULS we do the same with the weighted applicability degrees for the elements-rules belonging to the sub-concept space i ^ E ^ C x WDEC. It is interesting to compare our method of the parameterized concept transformation with the way of proceeding with classification granules and decision rules in the other rough set based approaches [11, 12, 14, 21]. Actually, at this level, we do not provide anything novel but rewriting well known examples within a more unified framework. A more visible difference can be observed in the next section, where we complete our methodology.
O
to be ^ / \ classified
RULS layers
WDEC^l—I for the ' I 1 nhiect
Fig. 21.3. The network-based object classification: the previously trained decision rule sets are activated by an object by means of their applicabihty to its classification; then the rule set concepts are processed and mapped to the weighted decisions using function (21.9); finally the most appropriate decision for the given object is produced.
21.6 Activation functions The possible layout combining the concept spaces DEC, WDEC, and RULS with the partly homogeneous classifier network is illustrated by Figure 21.3. Given a new object, we initiate the input layer with the degrees of applicability of the rules in particular rule-sets to this object. After processing with this type of concept along (possibly) several layers, we use the concept mapping function map{ruls) = { (fc, Eia,ji,app)eruis:keXj, ^PP ' f^k) : k € U(,a,fi,app)Eruls
4
X;
(21.9) that is we simply summarize the beliefs (weighted by the rules' applicability) in particular decision classes. Similarly, we finally map the weighted decision to the decision class, which is assigned with the highest resulting belief. The intermediate layers in Figure 21.3 are designed to help in voting among the classification results obtained from particular rule sets. Traditional rough set approach (cf. [12]) assumes specification of a fixed voting function, which, in our terminology, would correspond to the direct concept mapping from the first RULS
21 Feedforward Concept Networks
289
layer into DEC, with no hidden layers and without possibility of tuning the weights of connections. An improved adaptive approach (cf. [21]) enables us to adjust the rule sets, although the voting scheme still remains fixed. In the same time, the proposed method provides us with a framework for tuning the weights and, in this way, learning adaptively the voting formula (cf. [6, 11, 14]). Still, the scheme based only on generalized linear combinations and concept mappings is not adjustable enough. The reader may check that composition of functions (21.8) for elements of RULS and WD EC with (21.9) results in the collapsed single-layer structure corresponding to the most basic weighted voting among decision rules. This is exactly what happens with classical feedforward neural network models with no non-linear activation functions translating the signals within particular neurons. Therefore, we should consider such functions as well. Definition 5. Neural concept scheme is a quadruple (C, MAV, CXM, ACT), where the first three entities are provided by Definitions 1, 2, and ACT = {acti : Ci -^ Ci : 2 = 2 , . . . ,n + 1}
(21.10)
is the set of activation fiinctions, which can be used to relate the inputs to the outputs within each i-th layer of a network. It is reasonable to assume some properties of ACT, which would work for the proposed generalized scheme analogously to the classical case. Given a compound concept consisting of some interacting parts, we would like, for instance, to guarantee that a relative importance of those parts remains roughly unchanged. Such a requirement, corresponding to monotonicity and continuity of real functions, is well expressible for weighted compound concepts introduced in Definition 3. Given a concept Ci G Ci represented as the weighted collection of sub-concepts, we claim that its more important (better weighted) sub-concepts should keep more influence on the concept acti{ci) G Ci than the others. In [18, 19] we introduced sigmoidal activation function working on probability vectors comparable to the structure of WD EC in Example 1. That function, originated from the studies on monotonic decision measures in [15], can be actually generalized onto any space of compound concepts weighted with real values: Definition 6. By a-sigmoidal activation function for weighted compound concept space C with the real concept parameters, we mean function act^ : C -^ C parameterized by a> 0 which modifies these parameters in the following way: act2:{c) = {\s,^-^^^^^^^y.it,vt)ec\
(21.11)
By composition of lirii and mapi, which specify the concepts c^^^'^ ^ e C^+i as inputs to the nodes in the (z+l)-th layer, with functions actf_^i modifying the concepts within the entire nodes, we obtain a classification model with a satisfiable expressive and adaptive power. If we apply this kind of function to the rule sets, we modify the rules' applicability degrees by their internal comparison. Such performance cannot
290
Dominik Slf zak, Marcin Szczuka, and Jakub Wroblewski
be obtained using the classical neural networks with the nodes assigned to every single rule. Appropriate tuning of a > 0 results in activation/deactivation of the rules with a relative higher/lower applicability. Similar characteristics can be observed within WDEC, where the decision beliefs compete with each other in the voting process (cf. [15]). The presented framework allows for modeling also other interesting behaviors. For instance, the decision rules which inhibit influence of other rules (so called exceptions) can be easily achieved by negative weights and proper activation functions, what would be hard to emulate by plain, negation-free conjunctive decision rules. Further research is needed to compare the capabilities of the proposed construction with other hierarchical approaches [6, 10, 9, 20].
21.7 Learning in classifier networks A cautious reader have probably already noticed the arising question about the proper choice of connection weights in the network. The weights are ultimately the component that decides about the performance of entire scheme. As we will try to advocate, it is - at least to some extent - possible to learn them in a manner similar to the case of standard neural networks. Backpropagation, the way we want to use it here, is a method for reducing the global error of a network by performing local changes in weights' values. The key issue is to have a method for dispatching the value of the network's global error functional among the nodes (cf. [4]). This method, when shaped in the form of an algorithm, should provide the direction of the weight update vector, which is then applied according to the learning coefficient. For the standard neural network model (cf. [3]) this algorithm selects the direction of weight update using the gradient of error functional and the current input. Obviously, numerous versions and modifications of gradient-based algorithm exist. In the more complicated models which we are dealing with, the idea of backpropagation transfers into the demand for a general method of establishing weight updates. This method should comply to the general principles postulated for the rough-neural models (cf. [8, 21]). Namely, the algorithm for the weight updates should provide a certain form of mutual monotonicity i.e. small and local changes in weights should not rapidly divert the behavior of the whole scheme and, at the same time, a small overall network error should result in merely cosmetic changes in the weight vectors. The need of introducing automatic backpropagation-like algorithms to rough-neural computing were addressed recently in [6]. It can be referred to some already specified solutions like, e.g., the one proposed for rough-fuzzy neural networks in [7]. Still, general framework for RNC is missing, where a special attention must be paid on the issue of interpreting and calculating partial error derivatives with respect to the complex structures' parameters. We do not claim to have discovered the general principle for constructing backpropagation-like algorithms for the concept (granule) networks. Still, in [18,19] we have been able to construct generalization of gradient-based method for the homogeneous neural concept schemes based on the space WDEC. The step to partly homogeneous schemes is natural for the class of weighted compound concepts.
21 Feedforward Concept Networks
291
which can be processed using the same type of activation function. For instance, in case of the scheme illustrated by Figure 21.3, the conservative choice of mappings, which turn to be differentiable and regular, permits direct translation from the previous case. Hence, by small adjustment of the algorithm developed previously, we get a recipe for learning the weight vectors. An example of two-dimensional weights {w, 6) e Wi proposed in Section 21.4 is much harder to translate into backpropagation language. One of the most important features of classical backpropagation algorithm is that we can achieve the local minimum of an error function (on a set of examples) by local, easy to compute, change of the weight value. It does not remain easy for two real-valued parameters instead of one. Moreover, parameter ^ is a rule threshold (fuzzified by a kind of sigmoidal characterisitcs to achieve differentiable model) and, therefore, by adjusting its value we are switching on and off (almost, up to the proposed sigmoidal function) entire rules, causing dramatic error changes. This is an illustration of the problems arising when we are dealing with more complicated parameter spaces - In many cases we have to use dedicated, time-consuming local optimization algorithms. Yet another issue is concerned with the second „tooth" of backpropagation: transmitting the error value backward the network. The question is how to modify the error value due to connection weight, assuming that the weight is generalized (e.g. the vector as above). The error value should be translated into value compatible with the previous layer of classifiers, and should be useful for an algorithm of parameters modification. It means that information about error transmitted to the previous layer can be not only a real-valued signal, but e.g. a complete description of each rule's positive or negative contribution to the classifier performance in the next layer.
21.8 Conclusions We have discussed construction of hierarchical concept schemes aiming at layered learning of mappings between the inputs and desired outputs of classifiers. We proposed a generalized structure of feedforward neural-like network approximating the intermediate concepts in a way similar to traditional neurocomputing approaches. We provided the examples of compound concepts corresponding to the decision rule based classifiers and showed some intuition concerning their processing through the network. Although we have some experience with neural networks transmitting non-trivial concepts [18, 19], this is definitely the very beginning of more general theoretical studies. The most emerging issue is the extension of proposed framework onto more advanced structures than the introduced weighted compound concepts, without loosing a general interpretation of monotonic activation functions, as well as relaxation of quite limiting mathematical requirements corresponding to the general idea of learning based on the error backpropagation. We are going to challenge these problems by developing theoretical and practical foundations, as well as by referring to other approaches, especially those related to rough-neural computing [6, 8,9].
292
Dominik Sl^zak, Marcin Szczuka, and Jakub Wroblewski
References 1. Bazan, J., Nguyen, S.H., Nguyen, H.S., Skowron, A.: Rough Set Methods in Approximation of Hierarchical Concepts. In: Proc. of RSCTC'2004. LNAI3066, Springer Verlag (2004) pp. 346-355 2. Dietterich, T.: Machine learning research: four current directions. AI Magazine 18/4 (1997) pp. 97-136. 3. Hecht-Nielsen, R.: Neurocomputing. Addison-Wesley (1990). 4. le Cun, Y.: A theoretical framework for backpropagation. In: Neural Networks - concepts and theory. IEEE Computer Society Press (1992). 5. Lenz, M., Bartsch-Spoerl, B., Burkhard, H.-D., Wess, S. (eds.): Case-Based Reasoning Technology: From Foundations to Applications. LNAI1400, Springer (1998). 6. Pal, S.K., Peters, J.F., Polkowski, L., Skowron, A,: Rough-Neural Computing: An Introduction. In: S.K. Pal, L. Polkowski, A. Skowron (eds.), Rough-Neural Computing. Cognitive Technologies Series, Springer (2004) pp. 15-^1. 7. Pedrycz, W., Peters, J.F.: Learning in fuzzy Petri nets. In: J. Cardoso, H. Scarpelli (eds.), Fuzziness in Petri Nets. Physica (1998) pp. 858-886. 8. Peters, J.F., Szczuka, M.: Rough neurocomputing: a survey of basic models of neurocomputation. In: Proc. of RSCTC'2002. LNAI 2475, Springer (2002) pp. 309-315. 9. Polkowski, L., Skowron, A.: Rough-neuro computing. In: W. Ziarko, Y.Y. Yao (eds.), Proc. of RSCTC'2000. LNAI 2005, Springer (2001) pp. 57-64. 10. Polkowski, L., Skowron, A.: Rough mereological calculi of granules: A rough set approach to computation. Computational Intelligence, 17/3 (2001) pp. 472-492. 11. Skowron, A.: Approximate Reasoning by Agents in Distributed Environments. Invited speech at IAT'2001. Maebashi, Japan (2001). 12. Skowron, A., Pawlak, Z., Komorowski, J., Polkowski, L.: A rough set perspective on data and knowledge. In: W. Kloesgen, J. Zytkow (eds.). Handbook of KDD. Oxford University Press (2002) pp. 134-149. 13. Skowron, A., Stepaniuk, J.: Information granules: Towards foundations of granular computing. International Journal of Intelligent Systems 16/1 (2001) pp. 57-86. 14. Skowron, A., Stepaniuk, J.: Information Granules and Rough-Neural Computing. In: S.K. Pal, L. Polkowski, A. Skowron (eds.), Rough-Neural Computing. Cognitive Technologies Series, Springer (2004) pp. 43-84. 15. Sl^zak, D.: Normalized decision functions and measures for inconsistent decision tables analysis. Fundamenta Informaticae 44/3 (2000) pp. 291-319. 16. Sl^zak, D., Szczuka, M., Wrdblewski, J.: Harnessing classifier networks - towards hierarchical concept construction. In: Proc. of RSCTC'2004, Springer (2004). 17. Sl^zak, D., Wroblewski, J.: Application of Normalized Decision Measures to the New Case Classification. In: W. Ziarko, Y. Yao (eds.), Proc. of RSCTC'2000. LNAI 2005, Springer (2001) pp. 553-560. 18. Sl^zak, D., Wroblewski, J., Szczuka, M.: Neural Network Architecture for Synthesis of the Probabilistic Rule Based Classifiers. ENTCS 82/4, Elsevier (2003). 19. Sl^zak, D., Wroblewski, J., Szczuka, M.: Constructing Extensions of Bayesian Classifiers with use of Normalizing Neural Networks. In: N. Zhong, Z. Ras, S. Tsumoto, E. Suzuki (eds.), Proc. of ISMIS'2003. LNAI 2871, Springer (2002) pp. 408-416. 20. Stone, P.: Layered Learning in Multiagent Systems: A Winning Approach to Robotic Soccer. MIT Press, Cambridge MA (2000). 21. Wroblewski, J.: Adaptive aspects of combining approximation spaces. In: S.K. Pal, L. Polkowski, A. Skowron (eds.), Rough-Neural Computing. Cognitive Technologies Series, Springer (2004) pp. 139-156.
22
Extensions of Partial Structures and Their Application to Modelling of Multiagent Systems Bozena Staruch Faculty of Mathematics and Computer Science University of Warmia and Mazury Zolnierska 14a, 10-561 Olsztyn, Poland bs [email protected] Summary. Various formal approaches to modelling of multiagent systems were used, e.g., logics of knowledge and various kinds of modal logics [4]. We discuss an approach to multiagent systems based on assumption that the agents possess only partial information about global states, see [6]. We make a general assumption that agents perceive the world by fragmentary observations only [8, 4]. We propose to use partial structures for agent modelling and we present some consequences of such an algebraic approach. Such partial structures are incrementally enriched by new information. These enriched structures are represented by extensions of the given partial model. The extension of partial structure is a basic notion of this paper. It makes it possible for a given agent to model hypotheses about extensions of the observable world. An agent can express the properties of the states by properties of the partial structure he has at his disposal. We assume that every agent knows the signature of the language that we use for modelling agents.
22.1 Introduction A partial structure is a partial algebra [2, 1] enriched in predicates. For simplicity, we use a language with a satisfactory number of constants and, in consequence, we describe theories of partial structures in terms of atomic formulas with constants and additionally, inequalities between some constants. Such formulas can be treated as constraints defining the discemibility conditions that should be preserved, e.g, during the data reduction [8]. Our theoretical considerations splits into two methods: partial-model theoretic and logical one. We investigate two kinds of sets of first order sentences. An infallible set of sentences (a partial theory) contains all sentences that should be satisfied in every extension of the given family of partial structures. A possible set of sentences is the set of sentences that is satisfied in a certain extension of the given family of partial structures. Any partial algebraic structure is closely related to its partial theory. The theory of a partial structure that is the intersection of all its extensions corresponds to the common part of extensions considered in non-monotonic logics [5].
294
Bozena Staruch
Temporal, modal, multimodal and epistemic logics are used to express properties of extensions of partial structures (see, e.g., [10],[12] or [13]). We investigate the inconsistency problem that may appear in multiagent systems during extending and synthesizing (fusion) of partial results. From logical point of view, inconsistency could appear if the theory of a partial structure representing knowledge of a given agent is logically inconsistent under available information for this single agent or other agents. From algebraic point of view, inconsistency could appear when identification of different constants by agents is necessary. The main tool we use for fusion of partial results is the coproduct operation. For any family of partial structures there exists the unique (up to isomorphism) coproduct that is constructed as a disjoint sum of partial structures factored by a congruence identifying constants that should be identified. Then, inconsistency can be recognized during the construction of this congruence. Notice that Pawlak's information systems [8], can be naturally represented by partial structures. For example, any such system can be considered as a relational structure with some partial operations. Extensions of partial structures can also be applied to problems concerning data analysis in information systems such as the decomposition problem or the synthesis (fusion) problem of partial results [7]. We also consider multiagent systems where some further logical constraints (in form of atomic formulas), controlling the extension process, are added. The paper is organized as follows. We introduce basic facts on partial structures in Section 2. We define there extensions of a partial structure and of a family of partial structures, as well. In Subsection 2.1 we give the construction of coproduct of the given family of partial structures. Section 3 includes the logical part of our theory. We give here a definition of possible and infallible sets of sentences. In the next section we discuss how our algebraic approach can be used in multiagent systems.
22.2 Partial structures We use partial algebra theory [2, 1] throughout the paper. Almost all facts concerning partial algebras are easily extended to partial structures [10 -13]. We consider a signature (F,C,i7,n), with at most countable (finite in practice) and pairwise disjoint sets of function, constants and predicate symbols and with an arity function n : FU n -^ Af, where Af represents the set of nonnegative integers. Any constant is a 0-ary function, so we omit a set of constants in a signature generally, and we write it apparently if necessary. Definition 1. A partial structure of signature (F, 77, n) is a triple A = {A, ( / ^ ) / e F , {'f^^)ren) such that for every f £ F f^ is a partial n{f)-ary operation on A (the domain of the operation f^ C A^^^^ x A is denoted by domf^) and for every r £ 11 r ^ C A^^^\ We say that A is a total structure of signature (F, 77, n) if all its operations are defined everywhere. An operation or relation is discrete if its domain is empty. A partial structure A is discrete if all its operations and relations are discrete.
22 Extensions of Partial Structures ...
295
Notice that for any constant symbol c, the appropriate operation c^ is either a distinguished element in A or is undefined. Every structure (even total) of a given signature is a partial structure of any wider signature. Then, the additional operations and relations are discrete. Remark 1. We will use Pawlak's information systems [8] for presenting examples, so let us recall some definitions here. An information system is a pair S = (f7, A), where each attribute a e A, is identified with function a :U —^Va, from the universe U of objects, into the set Va of all possible values on a. A formula a = Vais called a descriptor and a template is defined as a conjunction of descriptors /\{ai, Va^) where a^ E A,ai ^ aj firi^j. A decision table is an information system of the form A = (C/, A U {d}), where d ^ A is a distinguished attribute called the decision. For every set of attributes 5 C ^ , an equivalence relation, denoted by INDA{B) and called the B-indiscemibility relation, is defined by INDA{B)
= {(u, u') e V^ : for every aeB,
a{u) = a(u')}
(22.1)
Objects u, u' satisfying the relation INDA{B) are indiscernible by attributes from B. If A = {U,A) is an information system, 5 C yl is a set of attributes and X C.U is a set of objects, then the sets: BX = {u e U : [U]B C X} and BX = {u e U : [U]B n Xj^0} are called the B-lower and the B-upper approximation of X in A, respectively. The set BNB{X) =1BX - BX will be called the B-boundary of X. In rough set theory there are considered also approximations determined by tolerance relation instead of equivalence relation. Our approach can be used there, too. Example 1. We interpret the information system as a partial structure A = (f/, R), where R = {(r^^^) '- CL ^ A^v E T4}, and ra^v is a unary relation such that for every X eU X e r^^y iff a{x) = v. Partial operations can be also considered there. Example 2. Every partially ordered set is a partial lattice of signature with two binary operations V and A of the least upper bound and the greatest lower bound, respectively. Definition 2. A homomorphism of partial structures h : A —> B of signature (F, 77, n) is a function h : A —> B such that, for any f E F, if a e domf^ then ho a£ domf^ and then h{f^{a)) = f^{ho a) and for any r E 11 andai,... ,an(r) G A ifr^[ai,... ,an(r)) then r^{h{ai),... ,/i(an(r)))Definition 3. A partial structure B is an extension of a partial structure A iff there exists an injective homomorphism e^ : A —> B. //"B is total, then we say that B is a completion of A, •
E{A) denotes the class of all extensions and
296
•
B ozena Staruch
T( A) denotes the class of all completions of A.
Remark 2. For applications in further sections we use a generalization of the above notion of extension. By a generalized extension of the given partial structure A we understand any partial structure B (even of extended signature) such that there exists a homomorphism /i : A ^ B preserving some a'priori chosen constraints. Properties of extensions defined by monomorphisms are important from theoretical point of view and can be easily imported to more general cases. We also consider extensions under some further constraints which follow from assumption on extensions to belong to special classes of partial structures [13]. Definition 4. A is a weak substructure o/B iff the identity embedding id A : A -^ B is a homomorphism ofpartial structures idA : A —^ B . Hence, every partial structure is an extension of its weak substructure. We do not recall here notions of a relative substructure and a closed substructure. Example 3. If B is a subtable of the given information system A, then the corresponding partial structure A is an extension of B . By a subtable we mean any subset of the given universe with some attributes and some values of these attributes. We allow null attribute values in subtables. B = (UB^RB) is a weak substructure of the given information system A = ([/, R) if UB CU and RB Q R (then also B C A). It means that if x G r^^ then X e r^y. Hence, it may be that a{x) = t' in A and x £ UB and a G 5 but a{x) is not determined in B. Example 4. For generalized extensions we discern some constants. For example let A = ([/, -R) be a relational system corresponding to information system A = (C/, A) and let X C A. Take the language Cu in which every object of C/ is a constant. Assume that every constant of the lower approximation A{X) should be discerned from every constant from the complement, while no assumption is taken for objects in the boundary region of the concept. One can describe the above discemibility using decision tables. To do that let d be an additional decision attribute such that d{x) = 1 for every x G A{X) and d{x) = 0 for every x eU\ A{X). For partial lattices A , B , C described in Figure 22.1, A is an extension of B and obviously B is a weak substructure of A. A is a generalized extension of C under assumption that a ^ h ^ c ^ a. The appropriate homomorphism glues c with d. II.IX Extensions of a family of partial structures We assume that a family of partial structures (agents) is given. Every possible extension should include (in some way) every member of the given family, as well as the entire family. Let us take the following definition:
22 Extensions of Partial Structures ...
297
Example 5.
K B
Fig. 22.1. Partial lattices
Definitions. Let dt = {Ai)i^i be a family of partial structures of a given fixed signature. A partial structure B is an extension of^iff^ is an extension of every AiE^. is a generalized extension of every And B is a generalized extension of^iff^ E{^), T(9?) denote the classes of extensions and completions of a family 3^, respectively. Definition 6. Let 3? = {Ai)i^i be a family of partial structures of signature (F, C,n,n). A partial structure B of the same signature is called a coproductc?/3^ iff there exists a family of homomorphisms hi : Ai —^ 'B for every Ai G ^ and iffor a certain partial structure C there exists a family of homomorphisms gi \ Ai ^^ C then there exists a unique homomorphism /i : B —> C such that ho hi = Qi, Proposition 1. For any family ofpartial structures the coproduct of this family exists and is unique up to isomorphism. Construction of coproducts of partial structures Let ^ = {Ai)i^i be a family of partial structures of signature (F, C, 77, n). We assume that there are no 0-ary functional symbols in F, i.e., all constants are included in C. Let for any A^ e 5R, A^ denote its reduct to signature (F, 77, n). We first take a disjoint sum (j^^ of the family SR° = {A? : A^ e ^}. We now take care of the appropriate identification of the existing constants and set ceC, c^Sc^^exist} ^o = {((eAi^^)^(cA,-^^-)). Ai^AjGdi, Moreover, let 6 be the congruence relation on | J ^ ° generated by 0^. Finally, we set B = \J^^/O and a family of appropriate homomorphisms hi : A^ -^ B so that hi{a) = [ia,i)]0. Proposition 2. The partial structure B constructed above is a coproduct of the family SR.
298
Bozena Staruch Example 6.
Coproduct Fig. 22.2. Coproduct of partial lattices A, B, C If each of the above homomorphisms is injective, then we call the coproduct the free sum of K and the free sum is an extension of 3?. If there are no constants in the signature then the disjoint sum of the family 3ft is a free sum of 3?. The coproduct is a generalized extension of 5t if it preserves all the a'priori chosen inequalities. We also consider coproducts under further constraints in form of a set of atomic formulas. Look at Figure 22.2. Assume that a, 6, c, cJ, e are constants in the signature of the language. In the coproduct of A, B, C there must be determined a A 6, a A c, 6 A c, because they have been determined in A. It follows from the construction of the coproduct that all a V 6, a V c, 6 V c must be determined and must be equal. Notice that this coproduct is not an extension of the given family of partial lattices, but it is a generalized extension of this family when it is assumed that a ^ h ^ c^ a. And it is not a generalized extension of {A, B , C} if we assume that all the constants a, 6, c, d, e are pairwise distinct. We see from this example that depending on the initial conditions the coproduct of the given family of partial structures can be a generalized extension or not. If it is not a generalized extension then it means that generating the congruence in the construction of coproduct we have to identify constants which are assumed to be different. In this situation we say that the given family of partial structures is inconsistent with undertaken assumptions (consistent, for short). This inconsistency is closely related to logical inconsistency. The given family 9? of partial structures is inconsistent with undertaken assumptions A iff the infallible set of sentences for 3? (definition in the next section) is inconsistent with A. It is important to know what to do when inconsistency appears. We consider the simplest approach to detect what condition cause problems and take a crisp decision of rejecting it. In the above example by rejecting d ^ e ^t obtain a consistent generalized extension. For applications it is worth to assume that partial structures under considerations are finite with finite sets of functional and relational symbols. In this situation we can check inconsistency in a finite predicted time. Various methods for conflict resolving dependent on the application problem are possible. For example, facts that cause conflicts can be rejected, one can use voting in eliminating facts causing conflicts. In
22 Extensions of Partial Structures ...
299
general, the process of eliminating some constraint can require some more advanced negotiations between agents.
22.3 Possible and infallible sets of sentences We present in this section the logical part of our approach. Let £ be a first order language of signature (F, 77, n). The set of all sentences of the language C is denoted \yj Sent{C). Assume that A is a given partial structure in the signature of £ (a partial structure of £, for short). Definition 7. A set of sentences E C Sent{C) is possible/(7r A iff there is a total structure B G T(A) such that B \= E. The set of sentences PA = C\{^K^) ' ^ ^ ^ ( A ) } is called an infallible set of sentences for A. We say also that PA is the theory ofpartial model A. Notice that a set of sentences is possible for a partial structure A if it is possible for a certain extension of A. The infallible set of sentences for a partial structure A is also an intersection of infallible sets for all its extensions. Notice here that extension used in non-monotonic logics corresponds to theories of total structures, whereas the infallible set for a partial structure corresponds to the intersection of non-monotonic extensions. The properties of possible and infallible sets of sentences are described and proved in [10 - 13]. If 5^ is a family of partial structures then we define possibility and infallibility for 3^ analogously, and if Pg? denotes the set of sentences infallible for 3fJ then we have the following: 1. Pgj = Cn((J{PAi • Ai € 3?}), where Cn denotes classical operator of first order consequences. 2. P^ is logically consistent iff T(§ft) is nonempty. Let A be a partial structure of a language C of signature (P, 7J, n). We extend C to CA by adding a set of constants CA = {ca • CL ^ A}. Now, we describe all the information about A in £A- Let EA be the sum of the following sets: ^F ={f{ca^,...,Ca^^f^) = Ca: f ^ F, (tti,..., a ^ / ) ) E d o m / ^ , /^(tti,..., a ^ / ) ) a} ^n ={r(cai,...,Ca„(,)): r G 77 , r ^ ( a i , ...,a^(^))} SCA = {ca ^ Cfe: a,b e A , ay^b} Remark 3, When dealing with generalized extensions as in Remark 2, we do not assume that the homomorphism for extension is injective. Then in place of EA we may take a set Ep U En U Ec where Ec is any subset of Ec^ • Then all the results for extensions easily transform to the generalized ones. Definition 8. Let Abe a partial structure of a language £. Let CA be the language as above. We say that a partial structure A' is an expansion of A to CA iff A' = A, yA = yA ^A _ ^A ^^j. ^y^yy f £ F and r e n and c^ = a.
300
B ozena Staruch
Proposition 3. For any partial structure AofC Sent{C).
Pp^> = Cn(i7^) and PA = PA' H
For a family 5R = {Ai)i^i of partial structures of a given language C we can take a language >Cs^ extending £ by a set of constants Cs^ = {ca^ : ai e Ai,i e I}. Here the set Ss^ = (J Z'A. has analogous properties to EA-
22A Partial structures and their extensions in multiagent systems Let us consider a team of agents with the supervisor agent who collects information, deals with conflicts and distributes knowledge. Generally, the system that we discuss will be like a firm hierarchy. There is the chief director and n departments with their chiefs, the departments 1,..., n are divided into Ui, i = 1,..., n sections with their chiefs and so on. There are information channels from sections to departments and from departments to the director. Agents that are inmiediately higher are able to receive information from their subordinate (child) agents, fuse the information, resolve conflicts, and change knowledge of their children. There are also information channels between departments and between sections and so on. These channels can be used only for exchanging information. It may be assumed that such a frame for agent systems is obligatory. For simplicity we do not assume any frame except the supervisor agent. However, agents can create subsystems of child agents and supervise them. They can also exchange information if they need. The relationships between agents follow from existence of a homomorphism. We represent collected information as a partial structure of a language C with signature {F^IJ^n). We assume that the environment of the problem one work is described in a language of arbitrarily large signature. We can reach this world by observing finite fragments only. Hence the signature we use should be finite, but sufficient for the observed fragments and whenever a new interrelation is discovered, the signature can be extend. At the beginning we perform the following steps: •
• •
We give names to the observable objects and discover interrelations between objects and decide which of them should be written as relations, partial operations or constants and additionally which names describe different objects. Depending on the application problem we decide which interrelations should be preserved while extending. We represent our knowledge by means of afinitepartial structure A of a language C with signature (F, IJ, n) with information about discemibility of some names. The names of observed objects are elements of the structure either all of them or these that are important for some reason.
Having a partial structure A of a language C in signature (F, 77, n) we extend the language to CA- Let A denote the expansion of Kio CA- The discemibility of some names is written up as a subset S C EA- We distribute the knowledge to n agents Agi^ ...Agn- The method of distribution depends on the problem we try to solve, but the most generally, we select n weak substructures Ai,..., An from A. Every agent
22 Extensions of Partial Structures ...
301
Agi, i = 1,..., n is provided with knowledge represented by Ai and additionally we provide him/her with a set of inequalities Si C S. There are a lot of possibilities to do that: we can either take relative or closed substructures or substructures covering A ( or not), or even n copies of A. And the inequalities from E could be distributed to the agents by various procedures. We describe a situation when the agents get Si C S such that there exists a set of constants d C C^^ such that Si = {ca ^ Cb : Ca^Cb e Ci^a ^h}. It means that all the constants from Ci are pairwise unequal. As we will show below this situation is easy to menage from theoretical point of view. Thus, the knowledge of an agent Agi is represented by a partial structure A^ such that Ai = (Ci, {f^')f^F, {'f^^')ren) is a weak substructure of A^ and additionally for every Ca^Ca €Ci ,a ^bii holds that Ca ^ c^. Notice that from logical point of view S^_ C Z*^. C 17 C SA- Hence (J i7^. C U ^Ai ^ SA is inconsistent set of sentences in CAThe knowledge distribution process is based on properties of a finite number of constants. In such an algebraic approach we can take advantage of homomorphisms, congruences, quotient structures, coproducts, and so on. We have at our disposal a closely related logical approach, where we can use infallible sets of sentences, consistency, and so on. We are able to propose logical methods of knowledge distribution via the standard family of partial structures (see [13]) whereas the logic, as not effective, would be less useful in applications. We consider also multiagent systems where some further logical constraints (in form of atomic formulas), controlling the extension process, are added. Hence let every agent Agi possesses a set Ai of atomic formulas. Example 7. Let our information be written as a relational system A = ([/, i?) corresponding to information system A = (f/, A). Let X C ^4 be a set of objects. Take the language Cu. Thus every object is now a constant. We discern every constant of the lower approximation A{X) from every constant from the complement, while no assumption is taken for objects in the boundary region of the concept. Now we distribute A to n agents Agi,..., Agn giving to every Agi, i = 1,..., n, a weak substructure (subtable) A^ = (f/^, RAJ- This distribution depends on expert decision, it may be a covering of A or only a covering of a chosen(training) set of objects. Moreover, every agent Agi has at his disposal sets of objects Xi = XnUi, U—Xi and makes his own approximations of these sets and discern constants. Additionally, every agent Agi gets a set of descriptors (or templates) which is either derived from his own information or is obtained from the system. Hence, every agent approximates a part of the concept described by X, he can get new objects, new attributes, new attribute values and a new set of descriptors. Now one can consider fusion (the coproduct) of information obtained from agents.
302
Bozena Staruch
22.4.1 Agents acting Now we are at starting point, i.e., in a moment t = 0. We have the fixed language CA> Local state of every agent Agi is represented by a partial structure A^ equipped with a set of inequalities of constants from Ci and also with a set of atomic formulas Ai. We take the following assumptions. • • • •
Every agent knows the signature of the language. One can consider also situation where every agent knows all the sets d. Every agent is able to exchange information with others. Then he makes fusion and resolves his internal conflicts using the construction of the coproduct. Every agent can build his own system of child agents in the same way as the whole system. For every moment of time there is the supervising agent with his own constraints (in form of atomic formulas) that collects all the information and deals with inconsistency using coproduct.
Now, knowledge is distributed and we are going to show how agents act. Every agent possesses his knowledge and is able to collect new information independently. New information can be obtained by the agent by his own exploring of the world or by his child agents acting or by exchanging information with other agents. The new information is in the form of new objects, new operation and relation symbols, new determination of constants,extension of domains of some relations and operations, new constraints that can be derived or added. Assume that at some time our system is in a state s. It means that agents have collected information and written it as partial structures Af,..., A^ which are consistent generalized extensions of Ai,..., A^, respectively. The main tool for fusion is a notion of coproduct, which plays the supervising role in the system. Additionally, a set As of atomic formulas is given. We construct a coproduct S of a family of partial structures A, Af,..., A^. If S is a generalized extension of Ai,..., An and As holds in S, then the system is consistent, and knowledge can be redistributed, otherwise we should resolve conflicts. Now, S plays role of A i.e. we take a new language Cs and the expansion S of S to this language. The most general way of knowledge redistribution is repetition of the process. It is often necessary to redistribute the knowledge in accordance, for example, either with the initial information (i.e., preserving A^ for every agent Agi) or with actually sent information. Notice that it is not necessary to stop the system for synthesis. Agents can work during this process. In this situation every agent should synthesize his actual results with redistributed knowledge (i.e., construct a coproduct of this two) and dispose of the eventual inconsistency by his own. 22.4.2 Dealing with inconsistency From logical point of view inconsistency may appear when the set of sentences IJ PA\ is either not possible for the family Ai,..., An or is possible for this family
22 Extensions of Partial Structures ...
303
but is inconsistent with As that is, it is logically inconsistent. Algebraic inconsistency is when the coproduct is inconsistent with the given constraints. There are two kinds of inconsistency: (i) the first one is "internal", when an agent, say Agi needs to identify constants that were different in A^ as a consequence of extending his knowledge; (ii) an "external" one, when knowledge of every agent is internally consistent, but there are conflicts in the whole system. Notice that decision of removing some determinations of constants and operations is not irreversible since the agent can resend the same information. If it happens "often" then we have a signal that there is something wrong, and we have to correct our initial knowledge. We remove inconsistency while exchanging information. If agent Agj sends some information to Agi, then Agi should resolve conflicts using a coproduct as above. The requirement of exchanging information can be recognized by Agi when he gets a constant and he knows that Agj has some information about this constant. One can use the following schema for dealing with inconsistency. First identification of constants in the process of congruence generating in the construction of the coproduct is a signal that inconsistency could appear. From proceeding of this process the supervisor detects the cause of inconsistency and sends orders to remove given determination to detected agents, respectively. We do not assume that the process should stop for the control. If agents work during this time then they fuse their actual knowledge with this redistributed and resolve conflicts with an assumption that the knowledge form the supervisor is more important. The proposed system may work permanendy, stoping after human decision or in case if some conditions are satisfied, e.g., time conditions or restrictions on the system size.
22.5 Conclusions We have presented a way of modelling multiagent systems via partial algebraic methods. We propose the coproduct operator to fuse knowledge and resolve conflicts under some logical-algebraic constraints. Further constraints may be also considered, for example on number of agents and or on the system size.
References 1. Bartol, W., 'Introduction to the Theory of Partial Algebras', Lectures on Algebras, Equations and Partiality, Univ. of Balearic Islands, Technical Report B-006, (1992), pp. 36-71. 2. Burmeister, P., 'A Model Theoretic Oriented Approach to Partial Algebras', Mathematical Research 32, Akademie-Verlag, Berlin (1986). 3. Burris, S., Sankappanavar, H.P[.], 'A Course in Universal Algebra', Springer-Verlag, Berlin (1981). 4. Fagin, R., Halpem, J., Moses, Y., Vardi, M.Y., 'Reasoning About Knowledge', MIT Press, Cambridge MA (1995). 5. Gabbay, D.M., Hogger, C.J., Robinson, A. A., 'Handbook of Logic in Artificial Intelligence and Logic Programming 3: Nonmonotonic Reasoning and Uncertain Reasoning', Oxford University Press, Oxford (1994).
304
Bozena Staruch
6. dlnverno M., Luck, M., 'Understanding Agent Systems', Springer-Verlag, Heidelberg, (2004). 7. Pal, S.K., Polkowski, L., Skowron, A. (Eds.), 'Rough-Neural Computing: Techniques for Computing with Words', Springer-Verlag, Berlin, (2004). 8. Pawlak, Z., 'Rough sets. Theoretical Aspects of Reasoning about Data', Kluwer Academic Publishers, Dordrecht, (1991). 9. Shoenfield, J.R., 'Mathematical Logic', Addison-Wesley Publishing Company, New York 1967. 10. Staruch, B., 'Derivation from Partial Knowledge in Partial Models', Bulletin of the Section of Logic 32, (2002), pp. 75-84. 11. Staruch, B., Staruch B., ' Possible sets of equations'. Bulletin of the Section of Logic 32, (2002), pp. 85-95. 12. Staruch, B., Staruch B . , ' Partial Algebras in Logic', submitted to Logika, Acta Universitatis Vratislaviensis, (2002). 13. Staruch, B., Staruch B.,' First order theories for partial model', accepted for publication in Studia Logica, (2003).
23 Tolerance Information Granules Jaroslaw Stepaniuk Department of Computer Science Bialystok University of Technology Wiejska45a, 15-351 Bialystok, Poland [email protected] Summary. In this paper we discuss tolerance information granule systems. We present examples of information granules and we consider two kinds of basic relations between them, namely inclusion and closeness. The relations between more complex information granules can be defined by extension of the relations defined on parts of information granules. In many application areas related to knowledge discovery in databases there is a need for algorithmic methods making it possible to discover relevant information granules. Examples of SQL implementations of discussed algorithms are included.
23.1 Introduction Last years have shown a rapid growth of interest in granular computing. Information granules are collections of entities that are arranged together due to their similarity, functional adjacency or indiscemibility [13], [14]. The process of forming information granules is referred to as information granulation. Granular computing as opposed to numeric-computing is knowledge-oriented. Knowledge based processing is a cornerstone of knowledge discovery and data mining [3]. A way of constructing information granules and describing them is a common problem no matter which path (fuzzy sets, rough sets,...) we follow. In this paper we follow rough set approach [7] to constructing information granules. Different kinds of information granules will be discussed in the following sections of this paper. The paper is organized as follows. In Section 23.2 we recall selected notions of the tolerance rough set model. In Section 23.3 we discuss information granule systems. In Section 23.4 we present examples of information granules. In Section 23.5 we discuss searching of optimal tolerance granules.
23.2 Selected Notions of Tolerance Rough Sets In this section we recall selected notions of the tolerance rough set model [8], [9], [11], [12].
306
Jaroslaw Stepaniuk
We recall general definition of an approximation space [9], [11], [12] which can be used for example for introducing the tolerance based rough set model and the variable precision rough set model. For every non-empty set C/, let P {U) denote the set of all subsets of U. Definition 1. A parameterized approximation space is a system AS^^% = {UJ^,jy$), where • • •
U is a non-empty set of objects, I^ :U —^ P{U) is a granulation function^ iy$ : P (U) X P (U) —• [0,1] is a rough inclusion function.
The granulation function defines for every object x a set of similarly described objects. A constructive definition of granulation function can be based on the assumption that some metrics (distances) are given on attribute values. For example, if for some attribute a e Aa. metric Sa : Va x Va —> [0, oo) is given, where Va is the set of all values of attribute a then one can define the following granulation function: y e i t {x) if and only if 5a (a {x), a {y)) < fa (a (x)
,a{y)),
where fa'-^a^Va -^ [0, oo) is a given threshold function. A set X C f/ is definable in AS:^^$ if and only if it is a union of some values of the granulation function. The rough inclusion function defines the degree of inclusion between two subsets ofC/[9].
This measure is widely used by data mining and rough set conmiunities. However, Jan Lukasiewicz [5] was first who used this idea to estimate the probability of implications. The lower and the upper approximations of subsets of U are defined as follows. Definition 2, For an approximation space AS^^$ = (C/, J:,^, u$) and any subset X C U the lower and the upper approximations are defined by LOW (A5^,$, X)={xeU:u, (/^ (x), X) = 1} , UPP (A%,$, X) ={xeU :us (/# (x), X) > 0}, respectively Approximations of concepts (sets) are constructed on the basis of background knowledge. Obviously, concepts are also related to unseen so far objects. Hence it is very useful to define parameterized approximations with parameters tuned in the searching process for approximations of concepts. This idea is crucial for construction of concept approximations using rough set methods. In our notation # , $ are denoting vectors of parameters which can be tuned in the process of concept approximation. Approximation spaces are illustrated on Figure 23.1.
23 Tolerance Information Granules
307
Fig. 23.1. Approximation Spaces with Two Vectors # 1 and # 2 of Parameters
Rough sets can approximately describe sets of patients, events, outcomes, keywords, etc. that may be otherwise difficult to circumscribe. We recall the notion of the positive region of the classification in the case of generalized approximation spaces [12]. Definitions. Let AS^^% = {U^I^^u%) be an approximation space and let for a natural number r > 1 a set {Xi,... ,Xr} be a classification of objects (i.e. Xi^... ,Xr CU, |J[_-^ Xi = U and Xi fl Xj = 0/or i ^ j , where z, j = 1 , . . . , r j . The positive region of the classification { X i , . . . , Xr} with respect to the approximation space AS^ $ is defined by POS (A5#,$, { X i / . . .,Xr]) = U L i LOW ( A % , $ , X , ) . Let DT = ([/, A U {d}) be a decision table [7], where C/ is a set of objects, A is a set of condition attributes and c? is a decision. For every condition attribute a G A is known a distance function Sa : Va x Va -^ [0, oo), which is defined as follows: for numeric attributes Sa{ci{Xi),Ci{Xj))
= \ci{Xi) -
a{Xj)\
for symbolic attributes
*.(.fe)..fe))={;:?:l»fW^
UiCW *
Mulfimedlalny System Monitorowania Halasir J Lista map hatd>u Mapa haiasu I Mapa hatasu (dia pola iaeq}
[g^oc*,,^ [ijz,
Fig. 31.5. Sample acoustic map - noise at the area of Gdansk University of Technology
31.3.2 Measurement Results Data Visualization The user must specify which region the searched measurement point is located in. For the selected region a list of cities with measurement devices is displayed. After selecting the city the user will see the map of the selected area with marked measurement points. The final selection concerns a specific measurement point. It can be selected by clicking a box on the map or by selecting a measure point from the list. For each measurement point one can specify a time range, for which specific parameters will be presented. An example of a selected measure point can be observed in Fig. 31.6. After selecting a measurement point and specifying a required time range one can display the results in graphic or table form. Measurement card for a given point contains a table including available noise parameters and a chart presenting the results in a graphic form. Fig. 31.7 presents an example of a page containing the results of measurements. By clicking a selected parameter in the table one can add of remove it from the chart. To simplify the process of viewing the results for other points, appropriate links have been added. Therefore one can select another measurement point for the same city or specify a new location.
31 Intelligent System for Environmental Noise Monitoring
403
31.3.3 Visualizing Survey Results The Web service offers access to the survey to every interested user. The survey enables users to express their own, subjective opinion about the acoustic climate in the place of residence. Subjective research is a perfect addition to objective measurements, as it allows collecting information about noise spitefulness directly from the inhabitants of an area. Survey results are automatically processed by the system. A number of results' presentation methods have been prepared. They may be charted on the map of regions of the country, for a given city in the form of circle charts or in the form of collective circle charts for the whole country. The user may select an appropriate presentation method. ^^^^J Pik
SUtm
Wkti^
Uubiqne
M^rt^dM
Ptrnt
Mulfimedialny System Monitorowania Hatasu 1 .WvkreSY : W y b 6 r l o k a l i z a c j i p o m i a r u
I I VVybor p u n k t u p o m i a r o w e g o d), "default.xsl" is the name of XSL file. Using the XML Schemes, XML document could be presented by means of DOM (Document Object Model). XML file is described by means of tree, which has bundles. These bundles are objects with various methods and properties.
592
Beata Zielosko and Andrzej Dyszkiewicz
48.4 Multi modular medical system of patients' diagnosis As an example I would like to present possibility of rough sets elements adaptation in .Net technology to create medical system of patients' diagnosis. The usage of rough sets enables to help in solving problems connected with patients' classification [9].
Fig. 48.1. Multi modular medical system of patients' diagnosis
Proposed solution is a multi-channel canvassing measurement data, which are synchronized together with time basis. It will be measured by a four-channel photopletismograph coupling four-channel spirometer and a four-channel thermometer [5]. An important issue in this system is synchronous data collecting from the sensors. On the basis of this information and disease symptoms, the system can help a doctor in deciding about patients' diagnosis. Knowledge about correlations between data got from the research could be an indirect result of the work of this system. This system will make the assessment of human's body reactions and emotional conditions. Those conditions will be registered as a change of breath frequency which modifies pulse and as a consequence influences a temperature of human body organs. For the reason of synchronous data collecting from individual modules included in system, it could be used e.g. in intensive care, monitoring patient's condition when he is away of hospital and telemedicine. In further research this system will be used as one of the elements of the system to diagnose patients with scoliosis. On the basis of expert's knowledge and analysis of results got from research, the decision table is constructed and sent as an XML file format. This file is a parameter for function implemented in Web Service. Such function makes it possible to compute by other Web Services. For example: function of one Web Service returns data in abstraction classes format. Abstraction classes are passed as parameters to next Web Service which generates e.g. a core. Another Web Services with other functions implemented allows e.g. to generate decision rules. The business rules of appUcation layer placed on the server includes algorithms to process data. This assures that we don't have to modify other functions implemented
48 Intelligent Medical Systems on Internet Technologies Platform
593
in other Web Services in case we want to change one of those algorithms. Features as: scaling, multi modular building, diffusion structure, devices independence, builtin exchange of data standards, operating systems independence, permit to aggregate this technology with rough sets. This solution could give interesting results in the form of new services available in a global network.
48.5 Summary Era of static WWW (World Wide Web) is finishing. Now internet is a platform accessible for newer and newer services, standards as XML or script languages. Electronic shops, banks, institutions, schools are more and more popular. .Net platform could be an efficient environment for an exchange of data and services of applications working in individual medical units. Above examples show that implementation of intelligent techniques of data process as Web Services on .Net platform opens new possibilities in designing diffusion decision support systems and autonomic computing.
Fig. 48.2. Diagnosis support system based on Web Services and XML standard
594
Beata Zielosko and Andrzej Dyszkiewicz
References 1. Brodziak A (1974) Formalizacja naturalnego wnioskowania diagnostycznego. Psychonika-teoria struktur i procesow informatycznych centralnego systemu nerwowego czlowieka i jej wykorzystanie w infonnatyce. PAN Warszawa 2. Doroszewski J (1990) Komputerowe wspomaganie diagnostyki medycznej. Nal^cz M, Problemy Biocybemetyki i Inzynierii Biomedycznej. WKL Warszawa 3. Dunway R (2003) Visual studio .NET. Mikom, Warszawa 4. Dyszkiewicz A, Wr6bel Z (2001) Elektromechaniczne procedury diagnostyki i terapii w rehabilitacji. Problemy Biocybemetyki i Inzynierii Biomedycznej pod redakcj§ Macieja Nal^cza, Warszawa 5. Dyszkiewicz A, Zielosko B, Wakulicz-Deja A, Wrobel Z (2004) Jednoczesna akwizycja wielopoziomowo sprz^zonych parametrow organizmu czlowieka krokiem do wyzszej swoisto^ci wnioskowania diagnostycznego. MPM Krynica 6. Esposito D (2002) Building Web Solutions with ASP .NET and ADO .NET. MS Press, Redmond 7. Komorowski J, Pawlak Z, Polkowski L, Skowron A Rough Sets: A Tutorial 8. Mackenzie D, Sharkey K (2002) Visual Basic .Net dla kazdego. Helion, Gliwice 9. Pawlak Z (1991) Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer, Dordrecht 10. Panowicz L (2004) Software 2.0: Darmowa paltforma .Net 1: 18-26 11. Young J. Michael (2000) XML krok po kroku. Wydawnictwo RM, Warszawa
Author Index
Bandyopadhyay, Sanghamitra, 439 Bazan, Jan, 191 Bell, David, 227 Bieniawski, Stefan, 31 Blazejewski, Lech, 527 Burkhard,Hans-Dieter, 347 Bums, Tom R., 363 Castro Caldas, Jose, 363 Cetnarowicz, Krzysztof, 579 Chen, Long, 455 Chilov, Nikolai, 385 Czyzewski, Andrzej, 397 Dardziiiska, Agnieszka, 133 Dobrowolski, Grzegorz, 551 Doherty, Patrick, 479 Dunin-K^plicz, Barbara, 69 Duntsch, Ivo, 179 Dyszkiewicz, Andrzej, 589
locchi, Luca, 467 Johnson, Rodney W., 85 Karsaev, Oleg, 411 Kazmierczak, Piotr, 539 Kisiel-Dorohinicki, Marek, 563 Kostek, Bozena, 397 Kozlak, Jaroslaw, 571 Krizhanovsky, Andrew, 385 Latkowski, Rafal, 493 Levashova, Tatiana, 385 Liao, Zhining, 227 Luks, Krzysztof, 519 Marszal-Paszek, Barbara, 339 Melich, Michael E., 85 Michalewicz, Zbigniew, 85 Mitra, Pabitra, 439 Moshkov, Mikhail Ju., 239
El Fallah-Seghrouchni, Amal, 53 Farinelli, Alessandro, 467 Fioravanti, Fabio, 99 Gediga, Gunther, 179 Glowinski, Cezary, 493 Gomolinska, Anna, 203 Gorodetsky, Vladimir, 411 Grabowski, Adam, 215 Guo, Gongde, 179, 227 Heintz, Fredrik, 479
Nakanishi, Hideyuki, 423 Nardi, Daniele, 467 Nawarecki, Edward, 551, 579 Nguyen Hung Son, 249 Nguyen Sinh Hoa, 249 Nowak, Agnieszka, 333 Pal, Sankar K., 439 Pashkin, Michael, 385 Paszek, Piotr, 339 Patrizi, Fabio, 467 Pawlak, Zdzislaw, 3
596
Author Index
Peters, James R, 13 Pettorossi, Alberto, 99 Polkowski, Lech, 117,509 Proietti, Maurizio, 99
Stepaniuk, Jaroslaw, 305 Szczuka, Marcin, 281 Szmigielski, Adam, 509 Sl?zak, Dominik, 281
Ra^, Zbigniew W., 133, 261 Rauch, Ewa, 501 Ray, Shubhra Sankar, 439 Rojek, Gabriel, 579 Roszkowska, Ewa, 363 Ryjov, Alexander, 147
Tzacheva, Angelina A., 261
Samoilov, Vladimir, 411 Schmidt, Martin, 85 Sergot, Marek, 161 Simiiiski, Roman, 273 Skarzynski, Henryk, 397 Skowron, Andrzej, 191 Smimov, Alexander, 385 Staruch, Bozena, 293
Verbrugge, Rineke, 69 Wakulicz-Deja, Alicja, 273, 333 Wang, Guoyin, 455 Wang, Hui, 179, 227 Wei, Ling, 317 Wolpert, David H., 31 Wr6blewski,Jakub,281 Wu, Yu, 455 Zhang, Wenxiu, 317 Zielosko, Beata, 589
Printing and Binding: Strauss GmbH, Morlenbach