KI 2005: Advances in Artificial Intelligence: 28th Annual German Conference on AI, KI 2005, Koblenz, Germany, September 11-14, 2005, Proceedings (Lecture Notes in Computer Science, 3698) 9783540287612, 3540287612

ThisvolumecontainstheresearchpaperspresentedatKI2005, the28thGerman Conference on Arti?cial Intelligence, held September

118 78 7MB

English Pages 432 [420] Year 2005

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Frontmatter
Invited Talks
Hierarchy in Fluid Construction Grammars
Description Logics in Ontology Applications
175 Miles Through the Desert
Knowledge Representation and Reasoning
A New {\itshape n}-Ary Existential Quantifier in Description Logics
Subsumption in $\mathcal{EL}$ w.r.t. Hybrid TBoxes
Dependency Calculus: Reasoning in a General Point Relation Algebra
Temporalizing Spatial Calculi: On Generalized Neighborhood Graphs
Machine Learning
Design of Geologic Structure Models with Case Based Reasoning
Applying Constrained Linear Regression Models to Predict Interval-Valued Data
On Utilizing Stochastic Learning Weak Estimators for Training and Classification of Patterns with Non-stationary Distributions
Noise Robustness by Using Inverse Mutations
Diagnosis
Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm for Induction Motors Based on Self-organization of Feature Extraction
Computing the Optimal Action Sequence by Niche Genetic Algorithm
Diagnosis of Plan Execution and the Executing Agent
Automatic Abstraction of Time-Varying System Models for Model Based Diagnosis
Neural Networks
Neuro-Fuzzy Kolmogorov's Network for Time Series Prediction and Pattern Classification
Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm
New Stability Results for Delayed Neural Networks
Planning
Metaheuristics for Late Work Minimization in Two-Machine Flow Shop with Common Due Date
An Optimal Algorithm for Disassembly Scheduling with Assembly Product Structure
Hybrid Planning Using Flexible Strategies
Controlled Reachability Analysis in AI Planning: Theory and Practice
Robotics
Distributed Multi-robot Localization Based on Mutual Path Detection
Modeling Moving Objects in a Dynamically Changing Robot Application
Heuristic-Based Laser Scan Matching for Outdoor 6D SLAM
A Probabilistic Multimodal Sensor Aggregation Scheme Applied for a Mobile Robot
Behavior Recognition and Opponent Modeling for Adaptive Table Soccer Playing
Cognitive Modelling / Philosopy / Natural Language
Selecting What Is Important: Training Visual Attention
Self-sustained Thought Processes in a Dense Associative Network
Why Is the Lucas-Penrose Argument Invalid?
On the Road to High-Quality POS-Tagging
Backmatter
Recommend Papers

KI 2005: Advances in Artificial Intelligence: 28th Annual German Conference on AI, KI 2005, Koblenz, Germany, September 11-14, 2005, Proceedings (Lecture Notes in Computer Science, 3698)
 9783540287612, 3540287612

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Artificial Intelligence Edited by J. G. Carbonell and J. Siekmann

Subseries of Lecture Notes in Computer Science

3698

Ulrich Furbach (Ed.)

KI 2005: Advances in Artificial Intelligence 28th Annual German Conference on AI, KI 2005 Koblenz, Germany, September 11-14, 2005 Proceedings

13

Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA Jörg Siekmann, University of Saarland, Saarbrücken, Germany Volume Editor Ulrich Furbach Universität Koblenz-Landau Fachbereich Informatik Universitätsstrasse 1, 56070 Koblenz, Germany E-mail: [email protected]

Library of Congress Control Number: 2005931931

CR Subject Classification (1998): I.2 ISSN ISBN-10 ISBN-13

0302-9743 3-540-28761-2 Springer Berlin Heidelberg New York 978-3-540-28761-2 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springeronline.com © Springer-Verlag Berlin Heidelberg 2005 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11551263 06/3142 543210

Preface

This volume contains the research papers presented at KI 2005, the 28th German Conference on Artificial Intelligence, held September 11–14, 2005 in Koblenz, Germany. KI 2005 was part of the International Conference Summer Koblenz 2005, which included conferences covering a broad spectrum of topics that are all related to AI: tableau-based reasoning methods (TABLEAUX), multi-agent systems (MATES), automated reasoning and knowledge representation (FTP), and software engineering and formal methods (SEFM). The Program Committee received 113 submissions from 22 countries. Each paper was reviewed by three referees; after an intensive discussion about the borderline papers during the online meeting of the Program Committee, 29 papers were accepted for publication in this proceedings volume. The program included three outstanding keynote talks: Ian Horrocks (University of Manchester, UK), Luc Steels (University of Brussels and Sony) and Sebastian Thrun (Stanford University), who covered topics like logical foundation, cognitive abilities of multi-agent systems and the DARPA Grand Challenge. KI 2005 also included two excellent tutorials: Techniques in Evolutionary Robotics and Neurodynamics (Frank Pasemann, Martin H¨ ulse, Steffen Wischmann and Keyan Zahedi) and Connectionist Knowledge Representation and Reasoning (Barbara Hammer and Pascal Hitzler). Many thanks to the tutorial presenters and the tutorial chair Joachim Hertzberg. Peter Baumgartner, in his role as a workshop chair, collected 11 workshops from all areas of AI research, which also includes a meeting of the German Priority Program on Kooperierende Teams mobiler Roboter in dynamischen Umgebungen. I want to sincerely thank all the authors who submitted their work for consideration and the Program Committee members and the additional referees for their great effort and professional work in the review and selection process. Their names are listed on the following pages. Organizing a conference like KI 2005 is not possible without the support of many individuals. I am truly grateful to the local organizers for their hard work and help in making the conference a success: Gerd Beuster, Sibille Burkhardt, Ruth G¨ otten, Vladimir Klebanov, Thomas Kleemann, Jan Murray, Ute Riechert, Oliver Obst, Alex Sinner, and Christoph Wernhard.

September 2005

Ulrich Furbach

Organization

Program and Conference Chair Ulrich Furbach

University of Koblenz-Landau, Germany

Program Committee Slim Abdennadher Elisabeth Andr´e Franz Baader Peter Baumgartner Michael Beetz Susanne Biundo Gerhard Brewka Hans-Dieter Burkhard J¨ urgen Dix Christian Freksa Ulrich Furbach Enrico Giunchiglia G¨ unther G¨ orz Andreas G¨ unther Barbara Hammer Joachim Hertzberg Otthein Herzog Steffen H¨ olldobler Werner Horn Stefan Kirn Gerhard Kraetzschmar Rudolf Kruse Nicholas Kushmerick Gerhard Lakemeyer Hans-Hellmut Nagel Bernhard Nebel Bernd Neumann Ilkka Niemel¨ a Frank Puppe

The German University in Cairo University of Augsburg University of Dresden Max-Planck-Institute for Informatics (Workshop Chair) Technical University of Munich University of Ulm University of Leipzig Humboldt University Berlin University of Clausthal University of Bremen University of Koblenz-Landau (General Chair) Genova University Friedrich-Alexander University of Erlangen-N¨ urnberg University of Hamburg University of Clausthal University of Osnabr¨ uck (Tutorial Chair) TZI University of Bremen University of Dresden University of Vienna University of Hohenheim University of Applied Sciences, Bonn-Rhein-Sieg Otto-von-Guericke-University of Magdeburg University College Dublin University of Aachen University of Karlsruhe University of Freiburg University of Hamburg Helsinki University of Technology University of W¨ urzburg

VIII

Organization

Martin Riedmiller Ulrike Sattler Amal El Fallah Seghrouchni J¨ org Siekmann Steffen Staab Frieder Stolzenburg Sylvie Thi´ebaux Michael Thielscher Toby Walsh Gerhard Weiss

University of Osnabr¨ uck University of Manchester University of Paris 6 German Research Center for Artificial Intelligence University of Koblenz-Landau University of Applied Studies and Research, Harz The Australian National University, Canberra University of Dresden National ICT Australia Technical University of Munich

Additional Referees Vazha Amiranashvili Martin Atzm¨ uller Sebastian Bader Volker Baier Thomas Barkowsky Joachim Baumeister Brandon Bennett Sven Bertel Gerd Beuster Thomas Bieser Alexander Bockmayr Peter Brucker Andreas Br¨ uning Frank Buhr Wolfram Burgard Nachum Dershowitz Klaus Dorfm¨ uller-Ulhaas Phan Minh Dung Manuel Fehler Alexander Ferrein Felix Fischer Arthur Flexer Lutz Frommberger Alexander Fuchs Sharam Gharaei Axel Großmann Hans Werner Guesgen Matthias Haringer Pat Hayes Rainer Herrler

Andreas Heß J¨ org Hoffmann J¨ orn Hopf Lothar Hotz Tomi Janhunen Yevgeny Kazakov Rinat Khoussainov Phil Kilby Alexandra Kirsch Dietrich Klakow Alexander Kleiner Birgit Koch Klaus-Dietrich Kramer Kai Lingemann Christian Loos Bernd Ludwig Carsten Lutz Stefan Mandl Hans Meine Erica Melis Ralf M¨ oller Reinhard Moratz Armin Mueller Jan Murray Matthias Nickles Malvina Nissim Peter Novak Andreas N¨ uchter Oliver Obst Karel Oliva

Stephan Otto G¨ unther Palm Dietrich Paulus Yannick Pencole Tobias Pietzsch Marco Ragni Jochen Renz Kai Florian Richter Jussi Rintanen Michael Rovatsos Marcello Sanguineti Ferry Syafei Sapei Bernd Schattenberg Sebastian Schmidt Lars Schmidt-Thieme Martin Schuhmann Bernhard Sch¨ uler Holger Schultheis Inessa Seifert John Slaney Freek Stulp Iman Thabet Bernd Thomas Jan Oliver Wallgr¨ un Diedrich Wolter Stefan W¨ olfl Michael Zakharyaschev Yingqian Zhang

Organization

IX

Organization KI 2005 was organized by the Artificial Intelligence Research Group at the Institute for Computer Science of the University of Koblenz-Landau.

Sponsoring Institutions University of Koblenz-Landau City of Koblenz Government of Rhineland-Palatine Griesson - de Beukelaer GmbH

Table of Contents

Invited Talks Hierarchy in Fluid Construction Grammars Joachim De Beule, Luc Steels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Description Logics in Ontology Applications Ian Horrocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

175 Miles Through the Desert Sebastian Thrun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

Knowledge Representation and Reasoning A New n-Ary Existential Quantifier in Description Logics Franz Baader, Eldar Karabaev, Carsten Lutz, Manfred Theißen . . . . . .

18

Subsumption in EL w.r.t. Hybrid TBoxes Sebastian Brandt, J¨ org Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Dependency Calculus: Reasoning in a General Point Relation Algebra Marco Ragni, Alexander Scivos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Temporalizing Spatial Calculi: On Generalized Neighborhood Graphs Marco Ragni, Stefan W¨ olfl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Machine Learning Design of Geologic Structure Models with Case Based Reasoning Mirjam Minor, Sandro K¨ oppen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Applying Constrained Linear Regression Models to Predict Interval-Valued Data Eufrasio de A. Lima Neto, Francisco de A.T. de Carvalho, Eduarda S. Freire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

On Utilizing Stochastic Learning Weak Estimators for Training and Classification of Patterns with Non-stationary Distributions B. John Oommen, Luis Rueda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Noise Robustness by Using Inverse Mutations Ralf Salomon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

XII

Table of Contents

Diagnosis Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm for Induction Motors Based on Self-organization of Feature Extraction Hyeon Bae, Sungshin Kim, Jang-Mok Kim, Kwang-Baek Kim . . . . . . . 134 Computing the Optimal Action Sequence by Niche Genetic Algorithm Chen Lin, Huang Jie, Gong Zheng-Hu . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Diagnosis of Plan Execution and the Executing Agent Nico Roos, Cees Witteveen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Automatic Abstraction of Time-Varying System Models for Model Based Diagnosis Pietro Torasso, Gianluca Torta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Neural Networks Neuro-Fuzzy Kolmogorov’s Network for Time Series Prediction and Pattern Classification Yevgeniy Bodyanskiy, Vitaliy Kolodyazhniy, Peter Otto . . . . . . . . . . . . . 191 Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm Alireza Hadi, Farzan Rashidi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 New Stability Results for Delayed Neural Networks Qiang Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Planning Metaheuristics for Late Work Minimization in Two-Machine Flow Shop with Common Due Date Jacek Blazewicz, Erwin Pesch, Malgorzata Sterna, Frank Werner . . . . 222 An Optimal Algorithm for Disassembly Scheduling with Assembly Product Structure Hwa-Joong Kim, Dong-Ho Lee, Paul Xirouchakis . . . . . . . . . . . . . . . . . . 235 Hybrid Planning Using Flexible Strategies Bernd Schattenberg, Andreas Weigl, Susanne Biundo . . . . . . . . . . . . . . . 249 Controlled Reachability Analysis in AI Planning: Theory and Practice Yacine Zemali . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

Table of Contents

XIII

Robotics Distributed Multi-robot Localization Based on Mutual Path Detection Vazha Amiranashvili, Gerhard Lakemeyer . . . . . . . . . . . . . . . . . . . . . . . . . 279 Modeling Moving Objects in a Dynamically Changing Robot Application Martin Lauer, Sascha Lange, Martin Riedmiller . . . . . . . . . . . . . . . . . . . 291 Heuristic-Based Laser Scan Matching for Outdoor 6D SLAM Andreas N¨ uchter, Kai Lingemann, Joachim Hertzberg, Hartmut Surmann . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 A Probabilistic Multimodal Sensor Aggregation Scheme Applied for a Mobile Robot Erik Schaffernicht, Christian Martin, Andrea Scheidig, Horst-Michael Gross . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 Behavior Recognition and Opponent Modeling for Adaptive Table Soccer Playing Thilo Weigel, Klaus Rechert, Bernhard Nebel . . . . . . . . . . . . . . . . . . . . . . 335

Cognitive Modelling / Philosopy / Natural Language Selecting What is Important: Training Visual Attention Simone Frintrop, Gerriet Backer, Erich Rome . . . . . . . . . . . . . . . . . . . . . 351 Self-sustained Thought Processes in a Dense Associative Network Claudius Gros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 Why Is the Lucas-Penrose Argument Invalid? Manfred Kerber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 On the Road to High-Quality POS-Tagging Stefan Klatt, Karel Oliva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

Hierarchy in Fluid Construction Grammars Joachim De Beule1 and Luc Steels1,2 1

2

Vrije Universiteit Brussel, Belgium Sony Computer Science Laboratory, Paris, France {joachim, steels}@arti.vub.ac.be http://arti.vub.ac.be/~{joachim,steels}/

Abstract. This paper reports further progress into a computational implementation of a new formalism for construction grammar, known as Fluid Construction Grammar (FCG). We focus in particular on how hierarchy can be implemented. The paper analyses the requirements for a proper treatment of hierarchy in emergent grammar and then proposes a particular solution based on a new operator, called the J-operator. The J-operator constructs a new unit as a side effect of the matching process.

1

Introduction

In the context of our research on the evolution and emergence of communication systems with human language like features[5], we have been researching a formalism for construction grammars, called Fluid Construction Grammar (FCG) [6]. Construction grammars (see [3], [2], [4] for introductions) have recently emerged from cognitive linguistics as the most successful approach so far for capturing the syntax-semantics mapping. In FCG (as in other construction grammars), a construction associates a semantic structure with a syntactic structure. The semantic structure contains various units (corresponding to lexical items or groupings of them) and semantic categorizations of these units. The syntactic structure contains also units (usually the same as the semantic structure) and syntactic categorizations. FCG uses many techniques from formal/computational linguistics, such as feature structures for representing syntactic and semantic information and unification as the basic mechanism for the selection and activation of rules. But the formalism and its associated parsing and production algorithms have a number of unique properties: All rules are bi-directional so that the same rules can be used for both parsing and production, and they can be flexibly applied so that ungrammatical sentences or meanings that are only partly covered by the language inventory, can be handled without catastrophic performance degradation. We therefore prefer the term templates instead of rules. Templates have a score which reflects how ‘grammatical’ they are believed to be. The score is local to an agent and based solely on the agent’s interaction with other agents. Our formalism is called Fluid Construction Grammar as it strives to capture the fluidity by which new grammatical constructions can enter or leave a language inventory. U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 1–15, 2005. c Springer-Verlag Berlin Heidelberg 2005 

2

J. De Beule and L. Steels

The inventory of templates can be divided depending on what role they play in the overall language process. Morph(ological) templates and lex(ical)-stem templates license morphology and map lexical stems into partial meanings. Syncat and sem-cat templates identify syntactic or semantic features preparing or further instantiating grammatical constructions. Gram(matical) templates define grammatical constructions. So far FCG has only dealt with single layered structures without hierarchy. It is obvious however that natural languages make heavy use of hierarchy: a sentence can contain a noun phrase, which itself can contain another noun phrase etc. This raises the difficult technical question how hierarchy can be integrated in FCG without loosing the advantages of the formalism, specifically the bidirectionality of the templates. This paper presents the solution to this problem that has proven effective in our computational experiments.

2

Hierarchy in Syntax

It is fairly easy to see what hierarchy means on the syntactic side and there is a long tradition of formalisms to handle it. To take an extremely simple example: the phrase ”the big block” combines three lexical units to form a new whole. In terms of syntactic categories, we say that ”the” is an article, ”big” is an adjective, and ”block” is a noun and that the combination is a noun phrase, which can function as a unit in a larger structure as in ”the blue box next to the big block”. Traditionally, this kind of phrase structure analysis is represented in graphs as in figure 1. Using the terminology of (fluid) construction grammar,

Fig. 1. Hierarchical syntactic structure for a simple noun phrase

we would say that ”the big block” is a noun phrase construction of the type Art-Adj-Noun. It consists of a noun-phrase (NP) unit which has three subunits, one for the Article, one for the Adjective and one for the Noun. ”The blue box next to the big block” is another noun phrase construction of type NPprep-NP. So constructions come in different types and subtypes. The units in a construction participate in particular grammatical relations (such as head, complement, determiner, etc.) but the relational view will not be developed here. Various syntactic constraints are associated with the realization of a construction which constrain its applicability. For example, in the case of the ArtAdj-Noun construction, there is a particular word order imposed, so that the

Hierarchy in Fluid Construction Grammars

3

Article comes before the Adjective and the Adjective before the Noun. Agreement in number between the article and the noun is also required. In French, there would also be agreement between the adjective and the noun, and there would be different Art-Adj-Noun constructions with different word order patterns depending on the type of adjective and/or its semantic role (”une fille charmante” vs. ”un bon copain”). When a parent-unit is constructed that has various subunits, there are usually properties of the subunits (most often the head) that are carried over to the parent-unit. For example, if a definite article occurs in an Art-Adj-Noun construction, then the noun phrase as a whole is also a definite noun phrase. Or if the head noun is singular than the noun phrase as a whole is singular as well. It follows that the grammar must be able to express (1) what kind of units can be combined into a larger unit, and (2) what the properties are of this larger unit. The first step toward hierarchical constructions consists in determining what the syntactic structure looks like before and after the application of the construction. Right before application of the construction in parsing, the syntactic structure is as shown in figure 2. It corresponds to the left part of figure 6. Application of the Art-Adj-Noun construction to this syntactic structure should ((top (syn-subunits (determiner modifier head)) (form ((precedes determiner modifier) (precedes modifier head)))) (determiner (syn-cat ((lex-cat article) (number singular))) (form ((stem determiner "the")))) (modifier (syn-cat ((lex-cat adjective))) (form ((stem modifier "big")))) (head (syn-cat ((lex-cat noun) (number singular))) (form ((stem head "block"))))) Fig. 2. Syntactic structure for the phrase “The big block” as it is represented in FCG before the application of an NP construction of the type Art-Adj-Noun. The structure contains four units, one of them (the top unit) has the other three as subunits. The lexicon contributes the different units and their stems. The precedes-constraints are directly detectable from the utterance. This structure should trigger the NP construction.

result in the structure shown in figure 3. It corresponds to the right part of figure 6. As can be seen a new NP-unit is inserted between the top-unit and the other units. The NP-unit has the determiner, modifier and head-unit as subunits and also contains the precedence constraints involving these units. The lexical category of the new unit is NP and it inherits the number of the head-unit.

4

J. De Beule and L. Steels ((top (syn-subunits (np-unit))) (np-unit (syn-subunits (determiner modifier head)) (form ((precedes determiner modifier) (precedes modifier head))) (syn-cat ((lex-cat NP) (number singular)))) (determiner (syn-cat ((lex-cat article) (number singular))) (form ((stem determiner "the")))) (modifier (syn-cat ((lex-cat adjective))) (form ((stem modifier "big")))) (head (syn-cat ((lex-cat noun) (number singular))) (form ((stem head "block")))))

Fig. 3. Syntactic structure for the phrase “The big block” as it is represented in FCG after the application of an NP construction of the type Art-Adj-Noun. A new NP-unit is inserted between the top-unit and the other units, which now covers the article, adjective and noun units.

Let us now investigate what the constructions look like that license this kind of transformation. In FCG a construction is defined as having a semantic pole (left) and a syntactic pole (right). The application of a template consists of a matching phase followed by a merging phase. In parsing, the syntactic pole of a construction template must match with the syntactic structure after which the semantic pole may be merged with the semantic structure to get the new semantic structure. In production, the semantic pole must match with the semantic structure and the syntactic pole is then merged with the syntactic structure. Merging is not always possible. This section considers first the issue of syntactic parsing. Our key idea to handle hierarchy is to construct a new unit as a side effect of the matching and merging processes. Specifically, we can assume that, for the example under investigation, the syntactic pole should at least contain the units shown in figure 4. This matches with the syntactic structure given in 2 and can therefore be used in parsing to test whether a noun-phrase occurs. But it does not yet create the noun-phrase unit. This is achieved by the J-operator.1 Units marked with the J-operator are ignored in matching. When matching is successful, the new unit is introduced and bound to the first argument of the J-operator. The second argument should already have been bound by the matching process to the parent unit from which the new unit should depend. The third argument specifies the set of units that will be pulled into the newly created unit. The new unit can contain additional 1

J denotes the first letter of Joachim De Beule who established the basis for this operator.

Hierarchy in Fluid Construction Grammars

5

((?top (syn-subunits (== ?determiner ?modifier ?head)) (form (== (precedes ?determiner ?modifier) (precedes ?modifier ?head)))) (?determiner (syn-cat (== (lex-cat article) (number singular)))) (?modifier (syn-cat (== (lex-cat adjective)))) (?head (syn-cat (== (lex-cat noun) (number singular)))))

Fig. 4. The part of the NP-construction that is needed to match the structure in figure 2. Symbols starting with a question-mark are variables that get bound during a successful match or merge. The ==, or includes symbol, specifies that the following list should at least contain the specified elements but possibly more. ((?top (syn-subunits (== ?determiner ?modifier ?head)) (form (== (precedes ?determiner ?modifier) (precedes ?modifier ?head)))) (?determiner (syn-cat (==1 (lex-cat article) (number ?number)))) (?modifier (syn-cat (==1 (lex-cat adjective)))) (?head (syn-cat (==1 (lex-cat noun) (number ?number)))) ((J ?new-unit ?top (?determiner ?modifier ?head)) (syn-cat (np (number ?number)))))

Fig. 5. The pole of figure 4 extended with a J-operator unit. The function of the ==1 or includes-uniquely symbol will be explained shortly, in matching it roughly behaves like the == symbol.

slot specifications, specified in the normal way, and all variable bindings resulting from the match are still valid. An example is shown in figure 5. Besides creating the new unit and adding its features, the overall structure should change also. Specifically, the new-unit is declared a subunit of its second argument (i.e. ?top in figure 5) and all feature values specified in this parent unit are moved to the new unit (i.e. the syn-subunits and form slots in figure 5.) This way the precedence relations in the form-slot of the original parent or any other categories that transcend single units (like intonation) can automatically be moved to the new unit as well. Thus the example syntactic structure of figure 2 for ”the big block” is indeed transformed into the target structure of figure 3 by applying the pole of figure 5. The operation is illustrated graphically in figure 6. Note that now the np-unit is itself a unit ready to be combined with others if necessary.

6

J. De Beule and L. Steels

Fig. 6. Restructuring performed by the J-operator

Formally, the J-operator is a tertiary operator and is written as (variable names are arbitrary): (J ?new-unit ?parent (?child-1 ... ?child-n)) The second argument is called the parent argument of the J-operator and the third the children-argument. Any unit-name in the left or right pole of a template may be governed by a J-operator. The parent and children-arguments should refer to other units in the same pole. Matching a pattern containing a J-operator against a target structure is equivalent to matching the pattern without the unit marked by the J-operator. Assume a set of bindings for the parent and children in the J-operator, then merging a pattern containing a J-operator with a target is defined as follows: 1. Unless a binding is already present, the set of bindings is extended with a binding of the first argument ?new-unit of the J-operator to a new constant unit-name N. 2. A new pattern is created by removing the unit marked by the J-operator from the original pattern and adding a unit with name N. This new unit has as slots the union of the slots of the unit marked by the J-operator and of the unit in the original pattern as specified by the parent argument. This pattern is now merged with the target structure. 3. The feature values of the parent specified in the J-pattern are removed from the corresponding unit in the target structure and the new unit is made a subunit of this unit. 4. Finally, all remaining references to units specified by the children are replaced by references to the new unit. This is illustrated for the target structure in figure 2 and the pattern of figure 5. The pattern figure 5 matches with figure 2 because the J-unit is ignored. The resulting bindings are: [?top/top, ?determiner/determiner, ?modifier/modifier, ?head/head, ?number/singular]

(B1)

Hierarchy in Fluid Construction Grammars

7

The new pattern is given by ((?top (syn-subunits (== ?determiner ?modifier ?head)) (form (== (precedes ?determiner ?modifier) (precedes ?modifier ?head)))) (?determiner (syn-cat (==1 (lex-cat article) (number ?number)))) (?modifier (syn-cat (==1 (lex-cat adjective)))) (?head (syn-cat (==1 (lex-cat noun) (number ?number)))) (np-unit (syn-cat (np (number ?number))) (syn-subunits (== ?determiner ?modifier ?head)) (form (== (precedes ?determiner ?modifier) (precedes ?modifier ?head))))) Before merging, these bindings are extended with a binding [?new-unit/np-unit]. Merging this pattern with figure 2 given the bindings (B1) results in ((top (syn-subunits (determiner modifier head)) (form ((precedes determiner modifier) (precedes modifier head))) (np-unit (syn-subunits (determiner modifier head)) (form ((precedes determiner modifier) (precedes modifier head))) (syn-cat ((lex-cat NP) (number singular))))) (determiner (syn-cat ((lex-cat article) (number singular))) (form ((stem determiner "the")))) (modifier (syn-cat ((lex-cat adjective))) (form ((stem modifier "big")))) (head (syn-cat ((lex-cat noun) (number singular))) (form ((stem head "block"))))) The last two steps remove the syn-subunits and form slots from the top unit and result in the structure of figure 3. Note that the application of a template now actually consists of three phases: first the matching of a pole against the target structure, next the merging of the same structure with the altered pole as specified above and finally the normal merging phase between the template’s other pole and the corresponding target structure (although here too J-operators might be involved).

8

J. De Beule and L. Steels ((top (syn-subunits (determiner modifier head)) (determiner (form ((stem determiner "the")))) (modifier (form ((stem modifier "big")))) (head (form ((stem head "block"))))))

Fig. 7. Syntactic structure while producing “the big block”, before the application of an NP-construction

In syntactic production the syntactic structure after lexicon-lookup and before application of the Art-Adj-Noun construction is as shown in figure 7. As will be discussed shortly, in production the left (semantic) pole of the construction is matched with the semantic structure and its right (syntactic) pole (shown in figure 5) has to be merged with the above syntactic structure. Here again the J-operator is used to create a new unit and link it from the top, yielding again the structure shown in figure 3. It is important to realise that merging may fail (and hence the application of the construction may fail) if there is an incompatibility between the slotspecification in the template’s pattern and the target structure with which the pattern is merged. Indeed this is the function of the includes-uniquely symbol ==1. It signals that there can only be one clause with the predicates that follow. Whereas if the slot-specification starts with a normal includes symbol ==, it means that the clause(s) can simply be added. The different forms of a slotspecification and their behavior in matching and merging are summarised in the table below: Notation Read as (a1 ...an ) equals (== a1 ..an ) includes (== 1 a1 ...an ) includes uniquely

Matching and merging behavior target must contain exactly the given elements target must contain at least the given elements target must contain at least the given elements and may not contain duplicate predicates

For example, the syn-cat slot of the determiner unit in the NP-construction’s syntactic pole (see figure 5) is specified as (==1 (lex-cat article) (number ? number)). This can be merged only if the lex-cat of the target unit is none other than article and if the ?number variable is either not yet bound (in which case it becomes bound to the one contained in the target unit) or else only if it is bound to the number value in the target unit. Note also that no additional mechanisms are needed to implement agreement or the transfer from daughter to parent. Indeed, in figure 5 ?number is the same for the determiner and modifier units. If these units would have different number values in the target structure then the merge is blocked. This implements a test on agreement in number. And because the new-unit has its number cat-

Hierarchy in Fluid Construction Grammars

9

egory specified by the same variable ?number, it will inherit the number of its constituents as specified. provided, in which Art-Adj-Noun specifies that

3

Hierarchy in Semantics

The mechanisms proposed so far implement the basic mechanism of syntactically grouping units together in more encompassing units. Any grammar formalism supports this kind of grouping. However constructions have both a syntactic and a semantic pole, and so now the question is how hierarchy is to be handled semantically in tight interaction with syntax. A construction such as ”the big block” not only simply groups together the meanings associated with its component units but also adds additional meaning, including what should be done with the meanings of the components to obtain the meaning of the utterance. FCG uses a Montague-style semantics (see [1] for an introduction), which means that individual words like ”ball” or ”big” introduce predicates and that constructions introduce second order semantic operators that say how these predicates are to be used (see [7] for a sketchy description of the system IRL generates this kind of operators and how it is used). For example, we can assume a semantic operator find-referent-1, which is a particular type of a find-referent operator that uses a quantifier, like [the], a property, like [big], and a prototype of an object, like [ball], to find an object in the current context. It would do this by filtering all objects that match the prototype, then filtering them further by taking the one who is biggest, and then taking out the unique element of the remaining singleton. The initial semantic structure containing the meaning of the phrase ”the big block” now looks as in figure 8. Lexicon-lookup introduces units for each of ((top (meaning ((find-referent-1 obj det1 prop1 prototype1) (quantifier det1 [the]) (property prop1 [big]) (prototype prototype1 [ball]))))) Fig. 8. The initial semantic structure for producing the phrase ”the big block”

the parts of the meaning that are covered by lexical entries and transforms this structure into the one in figure 9: The challenge now is to apply a construction which pulls out the additional part of the meaning that is not yet covered and constructs the appropriate new unit. Again the J-operator can be used with exactly the same behavior as seen earlier. The J-unit is ignored during matching but introduces a new unit during merging as explained before: the new unit is linked to the parent and the parent slot specifications contained in the template are moved to the new unit.

10

J. De Beule and L. Steels ((top (sem-subunits (determiner modifier head)) (meaning ((find-referent-1 obj det1 prop1 prototype1)))) (determiner (referent det1) (meaning ((quantifier det1 [the])))) (modifier (referent prop1) (meaning ((property prop1 [big])))) (head (referent prototype1) (meaning ((prototype prototype1 [ball])))))

Fig. 9. The semantic structure for producing ”the big block” after lexicon lookup. Three lexical entries were used, respectively covering the parts of meaning (quantifier det1 [the]), (property prop1 [big]) and (prototype prototype1 [ball]).

Hence, the semantic pole of the Art-Adj-Noun construction looks as follows: ((?top (sem-subunits (== ?determiner ?modifier ?head)) (meaning (== (find-referent-1 ?obj ?det1 ?prop1 ?prototype1)))) (?determiner (referent ?det1)) (?modifier (referent ?prop1)) (?head (referent ?prototype1)) ((J ?new-unit ?top) (referent ?obj) (?determiner ?modifier ?head))) Application of this semantic pole to the semantic structure of figure 9 yields: ((top (sem-subunits (np-unit))) (np-unit (referent obj) (sem-subunits (determiner modifier head)) (meaning ((find-referent-1 obj det1 prop1 prototype1)))) (determiner (referent det1) (meaning ((quantifier det1 [the])))) (modifier (referent prop1) (meaning ((property prop1 [big])))) (head (referent prototype1) (meaning ((prototype prototype1 [ball]))))) As on the syntactic side, various kinds of selection restrictions could be added. On the semantic side they take the the form of semantic categories that may have

Hierarchy in Fluid Construction Grammars

11

Art-Adj-Noun Construction: ((?top (sem-subunits (== ?determiner ?modifier ?head)) (meaning (== (find-referent-1 ?obj ?quantifier ?property ?prototype)))) (?determiner (referent ?quantifier)) (?modifier (referent ?property) (sem-cat (==1 (property-domain ?property physical)))) (?head (referent ?prototype) (sem-cat (==1 (prototype-domain ?prototype physical)))) ((J ?new-unit ?top) (referent ?obj) (sem-cat (==1 (object-domain ?obj physical)))))

((?top (syn-subunits (== ?determiner ?modifier ?head)) (form (== (precedes ?determiner ?modifier) (precedes ?modifier ?head)))) (?determiner (syn-cat (==1 (lex-cat article) (number ?number)))) (?modifier (syn-cat (==1 (lex-cat adjective)))) (?head (syn-cat (==1 (lex-cat noun) (number ?number)))) ((J ?new-unit ?top) (syn-cat (np (number ?number))))) Fig. 10. The complete Art-Adj-Noun construction. The double arrow < −− > seperates the semantic pole of the construction (above the arrow) form the syntactic pole (below the arrow) and suggests the bi-directional application of the construction during parsing and producing.

been inferred during re-conceptualization (i.e. using sem-templates.) These selection restrictions would block the application of the construction while matching in production and while merging in parsing. For example, we could specialise the construction’s semantic pole to be specific to the domain of physical objects (and hence physical properties and prototypes of physical objects) as follows: ((?top (sem-subunits (== ?determiner ?modifier ?head)) (meaning (== (find-referent-1 ?obj ?quantifier ?property ?prototype)))) (?determiner (referent ?quantifier)) (?modifier

12

J. De Beule and L. Steels

(referent ?property) (sem-cat (==1 (property-domain ?property physical)))) (?head (referent ?prototype) (sem-cat (==1 (prototype-domain ?prototype physical)))) ((J ?new-unit ?top) (referent ?obj) (sem-cat (==1 (object-domain ?obj physical))))) The complete Art-Adj-Noun construction as a whole is shown in figure 10. syntactic parsing. syntactic pole. This unit is top unit are As an example of semantic parsing, suppose the semantic structure after lexicon-lookup and semantic categorisation is as in figure 11. ((top (sem-subunits (determiner modifier head))) (determiner (referent ?quant1) (meaning ((quantifier ?quant1 [the])))) (modifier (referent ?prop1) (meaning ((property ?prop1 [big]))) (sem-cat ((property-domain ?prop1 physical-object)))) (head (referent ?prototype1) (meaning ((prototype ?prototype1 [ball]))) (sem-cat ((prototype-domain ?prototype1 physical-object))))) Fig. 11. Semantic structure after lexicon lookup and semantic categorisation while parsing the phrase ”the big block”

Merging now takes place and succeeds because all semantic categories are compatible. The final result is: ((top (sem-subunits (np-unit))) (np-unit (referent ?obj) (sem-subunits (determiner modifier head)) (meaning ((find-referent-1 ?obj ?quant1 ?prop1 ?prototype1))) (sem-cat ((object-domain ?obj physical)))) (determiner (referent ?det1) (meaning ((quantifier ?quant1 [the])))) (modifier (referent ?prop1) (meaning ((property ?prop1 [big]))) (sem-cat ((property-domain ?prop1 physical-object))))

Hierarchy in Fluid Construction Grammars

13

(head (referent ?prototype1) (meaning ((prototype ?prototype1 [ball]))) (sem-cat ((prototype-domain ?prototype1 physical-object)))))

4

The J-Operator in Lexicon Look-Up

In the previous sections it was assumed that the application of lex-stem templates resulted in a structure containing separate units for every lexical entry. For example, it was assumed that the lex-stem templates transform the flat structure of figure 8 into the one in figure 9. However an additional transformation is needed for this, and in this section we show how the the J-operator can be used to accomplish this. In production the initial semantic structure is as in figure 8. Now consider the following lexical entry: (def-lex-template [the] ((?top (meaning (== (quantifier ?x [the])))) ((J ?new-unit ?top)))

((?top (syn-subunits (== ?new-unit)) (form (== (string ?new-unit "the")))) ((J ?new-unit ?top) (form (== (stem ?new-unit "the"))) (syn-cat (== (lex-cat article)))))), and similar entries for [big] and [block]. Matching the left pole of the [the]-template with the structure in figure 8 yields the bindings: [?top/top, ?x/det1]. As explained, merging this pole with figure 8 extends these bindings with the binding [?new-unit/[the]-unit] and yields: ((top (sem-subunits ([the]-unit)) (meaning ((find-referent-1 obj det1 prop1 prototype1) (property prop1 [big]) (prototype prototype1 [ball])))) ([the]-unit (referent det1) (meaning ((quantifier det1 [the]))))) Note that, because of the working of the J-operator, a new unit is created which has absorbed the (quantifier det1 [the]) part of the meaning from the top unit. Applying also the templates for [big] and [block] finally results in the desired structure shown in figure 9.

14

J. De Beule and L. Steels

In production, the initial syntactic structure is empty and simply equal to ((top)). But merging the right pole of the [the]-template given the extended binding yields: ((top (syn-subunits ([the]-unit))) ([the]-unit (form ((string [the]-unit "the") (stem [the]-unit "the"))) (syn-cat ((lex-cat article))))) Similarly applying the templates for [big] and [block] results in figure 7. As a final example, consider that the initial semantic structure in parsing is also empty and equal to ((top)). Merging of the left poles of the lexical entries results in ((top (sem-subunits ([the]-unit [big]-unit [block]-unit))) ([the]-unit (referent ?x-1) (meaning ((quantifier ?x-1 [the])))) ([big]-unit (referent ?x-2) (meaning ((property ?x-2 [big])))) ([block]-unit (referent ?x-3) (meaning ((prototype ?x-3 [ball]))))) which is, apart from the sem-cat features which have to be added by semtemplates (re-conceptualization), indeed equivalent with figure 11.

5

Conclusions

This paper introduced the J-operator and showed that it is a very elegant solution to handle hierarchical structure, both in the syntactic and semantic domain. All the properties of FCG, and particularly the bi-directional applicability of templates, could be preserved. There are of course many related topics and remaining issues that could not be discussed in the present paper. One issue is the percolation of properties from the ’head’ node of a construction to the governing node. Although this can be regulated explicitly (as shown in the Art-Adj-Noun construction where the number of the noun percolates to the number of the governing NP), it has been argued (in HPSG related formalisms) that there are default percolations. This can easily be implemented in FCG as will be shown in a forthcoming paper. Another issue is the handling of optional components, i.e. elements of a construction which do not obligatory appear. This can be handled by another kind of operator that regulates which partial matches are licenced as will also be shown in another paper. Finally, we have already implemented learning operators that invent or acquire the kind of constructions discussed here, but their presentation goes beyond the scope of the present paper.

Hierarchy in Fluid Construction Grammars

15

Acknowledgement This research has been conducted at the Sony Computer Science Laboratory in Paris and the University of Brussels VUB Artificial Intelligence Laboratory. It is partially sponsored by the ECAgents project (FET IST-1940). We are indebted to stimulating discussions about FCG with Benjamin Bergen, Joris Bleys, and Martin Loetzsch.

References 1. Dowty, D., R. Wall, and S. Peters (1981) Introduction to Montague Semantics. D. Reidel Pub. Cy., Dordrecht. 2. Goldberg, A.E. 1995. Constructions.: A Construction Grammar Approach to Argument Structure. University of Chicago Press, Chicago 3. Kay, P. and C.J. Fillmore (1999) Grammatical Constructions and Linguistic Generalizations: the What’s X Doing Y? Construction. Language, may 1999. 4. Michaelis, L. 2004. Entity and Event Coercion in a Symbolic Theory of Syntax In J.O. Oestman and M. Fried, (eds.),Construction Grammar(s): Cognitive Grounding and Theoretical Extensions. Constructional Approaches to Language series, volume 2. Amsterdam: Benjamins. 5. Steels, L. (2003) Evolving grounded communication for robots. Trends in Cognitive Science. Volume 7, Issue 7, July 2003 , pp. 308-312. 6. Steels, L. (2004) Constructivist Development of Grounded Construction Grammars Scott, D., Daelemans, W. and Walker M. (eds) (2004) Proceedings Annual Meeting Association for Computational Linguistic Conference. Barcelona. p. 9-1 7. Steels, L. (2000) The Emergence of Grammar in Communicating Autonomous Robotic Agents. In: Horn, W., editor, Proceedings of European Conference on Artificial Intelligence 2000. Amsterdam, IOS Publishing. pp. 764-769.

Description Logics in Ontology Applications (Abstract) Ian Horrocks Faculty of Computer Science, University of Manchester, Great Britain

Abstract. Description Logics (DLs) are a family of logic based knowledge representation formalisms. Although they have a range of applications (e.g., configuration and information integration), they are perhaps best known as the basis for widely used ontology languages such as OWL (now a W3C recommendation). This decision was motivated by a requirement that key inference problems be decidable, and that it should be possible to provide reasoning services to support ontology design and deployment. Such reasoning services are typically provided by highly optimised implementations of tableaux decision procedures. In this talk I will introduce both the logics and decision procedures that underpin modern ontology languages, and the implementation techniques that have enabled state of the art systems to be effective in applications in spite of the high worst case complexity of key inference problems.

U. Furbach (Ed.): KI 2005, LNAI 3698, p. 16, 2005. c Springer-Verlag Berlin Heidelberg 2005 

175 Miles Through the Desert (Abstract) Sebastian Thrun Artificial Intelligence Lab (SAIL), Stanford University, United States of America

Abstract. The DARPA Grand Challenge is one of the biggest open challenges for the robotics community to date. It requires a robotic vehicle to follow a given route of up to 175 miles across punishing desert terrain, without any human supervision. The Challenge was first held in 2004, in which the best performing team failed after 7.3 miles of autonomous driving. The speaker heads one out of 195 teams worldwide competing for the 2 Million Dollar price. Thrun will present the work of the Stanford Racing Team, which is developing an automated car capable of desert driving at up to 50km/h. He will report on research in areas as diverse as computer vision, control, fault-tolerant systems, machine learning, motion planning, data fusion, and 3-D environment modeling.

U. Furbach (Ed.): KI 2005, LNAI 3698, p. 17, 2005. c Springer-Verlag Berlin Heidelberg 2005 

A New n-Ary Existential Quantifier in Description Logics Franz Baader, Eldar Karabaev, Carsten Lutz1 , and Manfred Theißen2 1 2

Theoretical Computer Science, TU Dresden, Germany Process Systems Engineering, RWTH Aachen, Germany

Abstract. Motivated by a chemical process engineering application, we introduce a new concept constructor in Description Logics (DLs), an n-ary variant of the existential restriction constructor, which generalizes both the usual existential restrictions and so-called qualified number restrictions. We show that the new constructor can be expressed in ALCQ, the extension of the basic DL ALC by qualified number restrictions. However, this representation results in an exponential blow-up. By giving direct algorithms for ALC extended with the new constructor, we can show that the complexity of reasoning in this new DL is actually not harder than the one of reasoning in ALCQ. Moreover, in our chemical process engineering application, a restricted DL that provides only the new constructor together with conjunction, and satisfies an additional restriction on the occurrence of roles names, is sufficient. For this DL, the subsumption problem is polynomial.

1

Introduction

Description Logics (DLs) [2] are a class of knowledge representation formalisms in the tradition of semantic networks and frames, which can be used to represent the terminological knowledge of an application domain in a structured and formally well-understood way. DL systems provide their users with inference services (like computing the subsumption hierarchy) that deduce implicit knowledge from the explicitly represented knowledge. For these inference services to be feasible, the underlying inference problems must at least be decidable, and preferably of low complexity. This is only possible if the expressiveness of the DL employed by the system is restricted in an appropriate way. Because of this restriction of the expressive power of DLs, various application-driven language extensions have been proposed in the literature (see, e.g., [3,9,22,16]), some of which have been integrated into state-of-the-art DL systems [15,13]. The present paper considers a new concept constructor that is motivated by a process engineering application [23]. This constructor is an n-ary variant of the usual existential restriction operator available in most DLs. To motivate the need for this new constructor, assume that we want to describe a chemical plant that has a reactor with a main reaction, and in addition a reactor with a main and a side reaction. Also assume that the concepts Reactor with main reaction and Reactor with main and side reaction are defined such that the first concept U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 18–33, 2005. c Springer-Verlag Berlin Heidelberg 2005 

A New n-Ary Existential Quantifier in Description Logics

19

subsumes the second one. We could try to model this chemical plant with the help of the usual existential restriction operator as Plant  ∃has part.Reactor with main reaction  ∃has part.Reactor with main and side reaction. However, because of the subsumption relationship between the two reactor concepts, this concept is equivalent to Plant  ∃has part.Reactor with main and side reaction, and thus does not capture the intended meaning of a plant having two reactors, one with a main reaction and the other with a main and a side reaction. To overcome this problem, we consider a new concept constructor of the form ∃r.(C1 , . . . , Cn ), with the intended meaning that it describes all individuals having n different r-successors d1 , . . . , dn such that di belongs to Ci (i = 1, . . . , n). Given this constructor, our concept can correctly be described as Plant  ∃has part.(Reactor with main reaction, Reactor with main and side reaction). The situation differs from other application-driven language extensions in that the new constructor can actually be expressed using constructors available in the DL ALCQ, which can be handled by state-of-the-art DL systems (Section 3). Thus, the new constructor can be seen as syntactic sugar; nevertheless, it makes sense to introduce it explicitly since this speeds up reasoning. In fact, expressing the new constructor with the ones available in ALCQ results in an exponential blow-up. In addition, the translation introduces many “expensive” constructors (disjunction and qualified number restrictions). For this reason, even highly optimized DL systems like Racer [13] cannot handle the translated concepts in a satisfactory way. In contrast, the direct introduction of the new constructor into ALC does not increase the complexity of reasoning (Section 4). Moreover, in the process engineering application [23] mentioned above, the rather inexpressive DL EL(n) that provides only the new constructor together with conjunction is sufficient. In addition, only concept descriptions are used where in each conjunction there is at most one n-ary existential restriction for each role. For this restricted DL, the subsumption problem is polynomial (Section 5). If this last restriction is removed, then subsumption is in coNP, but the exact complexity of the subsumption problem in EL(n) is still open (Section 6). Because of space constraints, some of the technical details are omitted: they can be found in [5].

2

The DL ALCQ

Concept descriptions are inductively defined with the help of a set of constructors, starting with a set NC of concept names and a set NR of role names. The

20

F. Baader et al. Table 1. Syntax and semantics of ALCQ Name

Syntax

Semantics

conjunction C D C I ∩ DI negation ¬C ΔI \ C I at-least qualified  n r.C {x | card({y | (x, y) ∈ r I ∧ y ∈ C I }) ≥ n} number restriction

constructors determine the expressive power of the DL. In this section, we restrict the attention to the DL ALCQ, whose concept descriptions are formed using the constructors shown in Table 1. Using these constructors, several other constructors can be defined as abbreviations: – – – – –

C  D := ¬(¬C  ¬D) (disjunction),  := A  ¬A for a concept name A (top-concept), ∃r.C :=  1 r.C (existential restriction), ∀r.C := ¬∃r.¬C (value restriction),  n r.C := ¬( (n + 1) r.C) (at-most restriction).

The semantics of ALCQ-concept descriptions is defined in terms of an interpretation I = (ΔI , ·I ). The domain ΔI of I is a non-empty set of individuals and the interpretation function ·I maps each concept name A ∈ NC to a subset AI of ΔI and each role r ∈ NR to a binary relation rI on ΔI . The extension of ·I to arbitrary concept descriptions is inductively defined, as shown in the third column of Table 1. Here, the function card yields the cardinality of the given set. A general ALCQ-TBox is a finite set of general concept inclusions (GCIs) C  D where C, D are ALCQ-concept descriptions. The interpretation I is a model of the general ALCQ-TBox T iff it satisfies all its GCIs, i.e., if C I ⊆ DI holds for all GCIs C  D in T . We use C ≡ D as an abbreviation of the two GCIs C  D, D  C. An acyclic ALCQ-TBox is a finite set of concept definitions of the form A ≡ C (where A is a concept name and C an ALCQ-concept description) that does not contain multiple definitions or cyclic dependencies between the definitions. Concept names occurring on the left-hand side of a concept definition are called defined whereas the others are called primitive. Given two ALCQ-concept descriptions C, D we say that C is subsumed by D w.r.t. the general TBox T (C T D) iff C I ⊆ DI for all models I of T . Subsumption w.r.t. an acyclic TBox and subsumption between concept descriptions (where T is empty) are special cases of this definition. In the latter case we write C  D in place of C ∅ D. The concept description C is satisfiable (w.r.t. the general TBox T ) iff there is an interpretation I (a model I of T ) such that C I = ∅. The complexity of the subsumption problem in ALCQ depends on the presence of GCIs. Subsumption of ALCQ-concept descriptions (with or without acyclic TBoxes) is PSpace-complete and subsumption w.r.t. a general ALCQ-

A New n-Ary Existential Quantifier in Description Logics

21

TBox is ExpTime-complete [24].1 These results hold both for unary and binary coding of the numbers in number restrictions, but in this paper we restrict the attention to unary coding (where the size of the number n is counted as n rather than log n).

3

The New Constructor

The general syntax of the new constructor is ∃r.(C1 , . . . , Cn ) where r ∈ NR , n ≥ 1, and C1 , . . . , Cn are concept descriptions. We call this expression an n-ary existential restriction. Its semantics is defined as . . ∧ (x, yn ) ∈ rI ∧ ∃r.(C1 , . . . , Cn )I := {x | ∃y1 , . . . , yn . (x, y1 ) ∈ rI ∧ . I I y1 ∈ C1 ∧ . . . ∧ yn ∈ Cn ∧ 1≤i>

=>>

>=>

>

>>> Fig. 4. The generalized neighborhood graph of the point algebra. The relation triple r1 r2 r3 in the node encode a scenario for the constraints X r1 Y , Y r2 Z, and X r3 Z.

PO PO PP

PO PO DC PO PO PO PP PO DC

PP PO PP

PO PP PP

PP DC DC PP PO PO PP PP PP PP PPi EQ PP PPi PO

PP PPi PPi

PO PP PO

Fig. 5. A subgraph of the (3, 1)-neighborhood graph for RCC5

5 Summary and Outlook We started from the question how spatial constraint calculi can be temporalized. In this context, we presented a precise notion of continuous change that seems to be conceptually adequate for temporalized topological calculi. Such a precise concept is necessary for a well-founded semantics of temporalized calculi dealing with continuous transformations. Moreover, it can be used to analytically prove the correctness of neighborhood graphs.

Temporalizing Spatial Calculi

77

In a second step we considered so-called transformation problems, i. e., problems of the kind whether some spatial configuration can be continuously transformed into another configuration, even if these transformations are constrained by further conditions. Solving such problems may be especially interesting, for instance, if we want to plan how objects have to be moved in space in order to reach a specific goal state. The classical neighborhood graphs discussed in the literature only represent possible continuous transformations of at most two objects. We proposed a concept that eliminates this limitation. These generalized neighborhood graphs may also be considered an appropriate tool for solving transformation problems, because they mirror the notion of k-consistency known from static reasoning problems. This idea is reflected in our definition of (n, l)-consistency. Future work will be concerned with the following questions: What is the exact relationship between (n, l)-consistency and satisfiability of transformation problems? Are there tractable classes of these problems? How can the notion of (n, l)-consistency be used to identify such classes? Finally, how can temporalized QSR be used to solve spatial planning scenarios?

Acknowledgments This work was partially supported by the Deutsche Forschungsgemeinschaft (DFG) as part of the Transregional Collaborative Research Center SFB/TR 8 Spatial Cognition. We would like to thank Bernhard Nebel for helpful discussions. We also gratefully acknowledge the suggestions of three anonymous reviewers, who helped improving the paper.

References [1] J. F. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832–843, 1983. [2] B. Bennett. Space, time, matter and things. In FOIS, pages 105–116, 2001. [3] B. Bennett, A. G. Cohn, F. Wolter, and M. Zakharyaschev. Multi-dimensional modal logic as a framework for spatio-temporal reasoning. Applied Intelligence, 17(3):239–251, 2002. [4] B. Bennett, A. Isli, and A. G. Cohn. When does a composition table provide a complete and tractable proof procedure for a relational constraint language? In Proceedings of the IJCAI97 Workshop on Spatial and Temporal Reasoning, Nagoya, Japan, 1997. [5] A. G. Cohn. Qualitative spatial representation and reasoning techniques. In G. Brewka, C. Habel, and B. Nebel, editors, KI-97: Advances in Artificial Intelligence. Springer, 1997. [6] E. Davis. Continuous shape transformation and metrics on regions. Fundamenta Informaticae, 46(1-2):31–54, 2001. [7] M. J. Egenhofer and K. K. Al-Taha. Reasoning about gradual changes of topological relationships. In A. U. Frank, I. Campari, and U. Formentini, editors, Spatio-Temporal Reasoning, Lecture Notes in Computer Science 639, pages 196–219. Springer, 1992. [8] M. Erwig and M. Schneider. Spatio-temporal predicates. IEEE Transactions on Knowledge and Data Engineering, 14(4):881–901, 2002. [9] C. Freksa. Conceptual neighborhood and its role in temporal and spatial reasoning. In Decision Support Systems and Qualitative Reasoning, pages 181–187. North-Holland, 1991.

78

M. Ragni and S. W¨olfl

[10] D. Gabelaia, R. Kontchakov, A. Kurucz, F. Wolter, and M. Zakharyaschev. Combining spatial and temporal logics: Expressiveness vs. complexity. To appear in Journal of Artificial Intelligence Research, 2005. [11] A. Galton. Qualitative Spatial Change. Oxford University Press, 2000. [12] A. Galton. A generalized topological view of motion in discrete space. Theoretical Compututer Science, 305(1-3):111–134, 2003. [13] A. Gerevini and B. Nebel. Qualitative spatio-temporal reasoning with RCC-8 and Allen’s interval calculus: Computational complexity. In Proceedings of the 15th European Conference on Artificial Intelligence (ECAI-02), pages 312–316. IOS Press, 2002. [14] M. Knauff. The cognitive adequacy of allen’s interval calculus for qualitative spatial representation and reasoning. Spatial Cognition and Computation, 1:261–290, 1999. [15] P. Muller. A qualitative theory of motion based on spatio-temporal primitives. In A. G. Cohn, L. K. Schubert, and S. C. Shapiro, editors, Proceedings of the Sixth International Conference on Principles of Knowledge Representation and Reasoning (KR’98), Trento, Italy, June 2-5, 1998, pages 131–143. Morgan Kaufmann, 1998. [16] P. Muller. Topological spatio-temporal reasoning and representation. Computational Intelligence, 18(3):420–450, 2002. [17] B. Nebel and H.-J. B¨urckert. Reasoning about temporal relations: A maximal tractable subclass of Allen’s interval algebra. Technical Report RR-93-11, Deutsches Forschungszentrum f¨ur K¨unstliche Intelligenz GmbH, Kaiserslautern, Germany, 1993. [18] M. Ragni and S. W¨olfl. Branching Allen: Reasoning with intervals in branching time. In C. Freksa, M. Knauff, B. Krieg-Br¨uckner, B. Nebel, and T. Barkowsky, editors, Spatial Cognition, Lecture Notes in Computer Science 3343, pages 323–343. Springer, 2004. [19] D. A. Randell, Z. Cui, and A. G. Cohn. A spatial logic based on regions and connection. In B. Nebel, W. Swartout, and C. Rich, editors, Principles of Knowledge Representation and Reasoning: Proceedings of the 3rd International Conference (KR-92), pages 165–176. Morgan Kaufmann, 1992. [20] M. B. Vilain, H. A. Kautz, and P. G. van Beek. Contraint propagation algorithms for temporal reasoning: A revised report. In D. S. Weld and J. de Kleer, editors, Readings in Qualitative Reasoning about Physical Systems, pages 373–381. Morgan Kaufmann, 1989. [21] F. Wolter and M. Zakharyaschev. Spatio-temporal representation and reasoning based on RCC-8. In A. Cohn, F. Giunchiglia, and B. Selman, editors, Principles of Knowledge Representation and Reasoning: Proceedings of the 7th International Conference (KR2000). Morgan Kaufmann, 2000.

Design of Geologic Structure Models with Case Based Reasoning Mirjam Minor and Sandro K¨oppen Humboldt-Universit¨at zu Berlin, Institut f¨ur Informatik, Unter den Linden 6, D-10099 Berlin [email protected]

Abstract. This paper describes a new approach of case based reasoning for the design of geologic structure models. It supports geologists in analysing building ground by retrieving and adapting old projects. Projects are divided in several complex cases consisting of pairs of drilling cores with their outcrop and the spatial context. A three-step retrieval algorithm (1) selects structurally similar candidate cases, (2) estimates the quality of a transfer of the layer structures to the query, and (3) retrieves from the remaining set of promising cases the most geologically and geometrically similar cases. The case based system is able to adapt a similar case to the query situation and to export the result to a 3D visualisation tool. A first evaluation has shown that the case based system works well and saves time.

1 Introduction Designing geologic structure models is a time-consuming process that requires a lot of geological knowledge. Legal regulations require the careful analysis of drilling data resulting in a 3D model of building ground. CAD (computer aided design) programs support the geologists by drawing tools when they determine the stone layers between drilling cores. In this paper, we present a case based approach to support the design process by the reuse of experience from former geologic projects. Section 2 describes the design procedure for geologic structure models. Section 3 deals with the division of geologic projects into reusable units in form of cases. Section 4.3 deals with the extended similarity computation and the adaptation. A sample application with first results is given in Section 5. Section 6 discusses related work, and, finally, Section 7 draws a conclusion.

2 Construction of Geological Structure Models When a real estate is developed, a geological structure model has to be constructed according to law. Therefore, a geologist uses information from geologic outcrops (drillings) and geologic basic knowledge about the area. For instance, if the land lies in the ‘Nordische Tiefebene’ in North Germany, certain types of glacial stone like sand and silt are expected. U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 79–91, 2005. c Springer-Verlag Berlin Heidelberg 2005 

80

M. Minor and S. K¨oppen

Usually, a structure model is a network of geologic outcrops in their spatial context which are connected via cross sections. A cross section describes the distribution of stone layers between two geological outcrops. Fig. 1 shows an example of three outcrops with two cross sections. The third cross section is analogical to the second one. The letters A, B, and C stand for different layers of stone.

Fig. 1. Example of outcrops and cross sections between two drillings each

A layer is characterised by many properties: – two depth parameters for the beginning and the ending of the layer, – the strike and fall (vertical and horizontal direction) of the layer, and – geologic parameters: • stratigraphy (ages, e.g., Jurassic, Cretaceous), • petrography (major and minor stone components), • genesis, • colour, and • additional information, e.g., bulged or canted. The construction of the model, especially the specification of the particular cross sections, is performed in a two-dimensional representation as connected, planar graph. The nodes are the outcrops and the edges are the cross sections. To avoid conflicts at intersecting points, the topology of the graph consists of non-intersecting triangles that are computed by a standard triangulation algorithm.

Design of Geologic Structure Models with Case Based Reasoning

81

The complete structural model is visualised in 3D. An example for the three outcrops from Fig. 1 with the triangle of three cross sections is shown in Fig. 2. A CAD (computer aided design) program should support both. The main focus of our work is on the 2D construction of a cross section graph.

Fig. 2. Examples of a 3D structural model

3 Cases in a Cross Section Graph The number of outcrops per project varies from only a few for the building land of a one-family house up to some hundred drillings for a rail route. So, we decided not to use projects as cases, they are hardly comparable. Instead, a case in our approach is a single cross section with its spatial context: – The problem part of a case are the two outcrops at the border of the cross section and one or two additional outcrops; – the solution is the distribution of layers in the cross section. We distinguish two topological settings. Either the wanted cross section is at the border or within the convex closure of the triangulation graph (see Fig. 3). In case the cross section is within the convex closure of the graph, we need two additional outcrops (D3, D4) to describe the context, only one (D3), otherwhise. An outcrop is represented as sequence of layers beginning with the top layer and its location. Each layer is described by the parameters explained above. The location consists of x and y coordinates concerning the dimensions of the whole project plus the overall height of the outcrop. In fact, the ’height’ means the depth of the layer, but the geologic term is height. A cross section is represented as a graph (see Fig. 4). A node stands for the occurrence of a layer in an outcrop. The vertical edges model the sequential order of the layers within an outcrop; the horizontal edges are for continuous layers from different outcrops. To represent the ending of a layer, we use virtual outcrops. In Fig. 4, the two virtual outcrops in the middle have non-bold nodes.

82

M. Minor and S. K¨oppen

Fig. 3. Two different topological settings for cases

Fig. 4. Sample representation of a cross section

In this way, a project with 20 outcrops is divided into up to 54 cases (one per edge of the triangulation graph). The generated cases belong to the complex cases due to their structure.

4 Determining Useful Cases To find useful cases to a query, we designed a three-step retrieval mechanism including a quality estimation in the middle: 1. Retrieve candidate cases that are structurally similar to the query, 2. Retrieve promising candidate cases that are structurally transferable to the query with a promising result: – reuse the structure of each candidate case for the query, – estimate the quality of this transfer, and – reduce the set of candidates by means of a threshold, 3. Retrieve the best matching cases by means of a detailed, composite similarity computation.

Design of Geologic Structure Models with Case Based Reasoning

83

We will discuss the three steps and the adaptation in the following subsections after two general remarks on the mechanism. (1) We integrated step two afterwards what improved our retrieval results significantly. (2) In step one and three, we compared the according outcrops as illustrated in Fig. 5. And step two employs them analogically. Partner outcrops can be assigned directly or in mirrored form. To avoid effort at retrieval time, the mirrored cases could be stored in the case base, too.

Fig. 5. Direct and mirrored comparison of outcrops for two topological settings

4.1 Structural Similarity The first step of the retrieval mechanism is to find candidate cases that are structurally similar to the query. Therefore, we count the number of layers in the two main outcrops. A case must have an equal or higher number of layers on both sides than the query. The idea behind this is, that cases covering more than the query might be useful if there are not any cases available that have exactly the same numbers of layers. The lower layers of the case need to be cut off, as the upper layers are more important than the deeper ones. We have structured the case storage by means of kd-trees [7] (see Fig. 6) where the first ≥ means the number of layers on the bigger side and the second ≥ the number of layers on the smaller side. Note, that the queries are asked automatically for both, the direct and the mirrored form. The tree contains many redundant case pointers what simplifies the retrieval step one. It needs only to identify the matching leaf. Is this leaf not available as the cases are all too small we take the rightmost node to handle at least the upper part of the query. The result of the retrieval step one is the set of all cases connected with the selected leaf node. 4.2 Structural Transferability The second step of the retrieval mechanism reduces the set of candidate cases to those that are structurally transferable with a promising quality estimation. Each candiate

84

M. Minor and S. K¨oppen

Fig. 6. Organisation of the case storage with a kd-tree

case from step one is applied to the query as follows: Its structure is transferred to the query outcrops to build hypothetical connections of layers in the query. The connections are tranferred top down. The overall quality of the structural transfer is estimated by composing two values: 1. similarity of the proportion of layers, 2. quality of the hypothetical connection of layers. The similarity of the proportion of layers is computed by simpol : CASE × CASE → [0, 1] :  α1 , α1 < α2 2 simpol (case1 , case2 ) = α α2 , else, α1 where αi is the ratio of the numbers of layers in casei : αi =

number of layers of the f irst outcrop number of layers of the second outcrop .

For instance, if case1 has two and three layers and case2 has three and three layers then is α1 = 2/3, α2 = 1, and simpol (case1 , case2 ) = 2/3 1 = 2/3. The quality of the hypothetical connection of layers is estimated layer by layer li from both outcrops of the hypothetical case: qualityhypo N

1 N +M (

M i=1

i=1

sim (case, hypothetical

case) =

quality(lioutcrop 1 , case, hypothetical case)+

quality(lioutcrop 2 , case, hypothetical case)).

N and M are the numbers of layers in outcrop 1 and outcrop 2 of the hypothetical case. The quality function for single layers quality(loutcrop i , case, hypothetical case) uses

Design of Geologic Structure Models with Case Based Reasoning

85

Fig. 7. The quality of a hypothetical solution reusing the structure of a case

the geologic similarity function for layers siml (see Section 4.3) to compute the similarity between a layer and its hypothetical partner section in the connected outcrop: ⎧ ⎨ 0, lo i  case o i quality(l , case, h case) = 1, lo i  case not connected ⎩ siml (lo i , lo i+−1 ), else. An example of qualityhypo sim is illustrated in Fig. 7. The composite value for the quality estimation is 1/2 ∗ (simpol + qualityhypo sim ). 3/2 1 For the sample in Fig. 7, it is 1/2 ∗ [ 4/2 + 4+2 ((0.87 + 1 + 0.76 + 0) + (0.87 + 0.76))] = 1/2 ∗ [0.75 + 0.71] = 0.73. The result of retrieval step two is the set of candidate cases that reached a quality estimation value above a threshold. 4.3 Final Similarity Computation After determining promising candidate cases in step one and two of the retrieval mechanism, in step three the most similar cases to the query are computed by a ’usual’ retrieval process. The detailed and final similarity computation for a query and a case uses a composite similarity function sim(case1 , case2 ) that aggregates geologic and geometric similarity values: sim(case1 , case2 ) = 1/2(simgeo (case1 , case2 ) + simgeometric (case1 , case2 ). Geologic Similarity. The geologic similarity relates to properties of the stone. It needs four levels: 1. The first level similarity compares single attributes of a layer. 2. The second level similarity aggregates first level similarity values to compare one layer with another.

86

M. Minor and S. K¨oppen

3. The third level similarity aggregates similarity values of layers to compare two outcrops with each other, i.e. sequences of layers. 4. The fourth level similarity regards the constellation of all outcrops of a case. The first level similarity considers the following attributes of a layer (compare properties of a level in Section 2): upper and lower depth of the layer, strike and fall, stone parameters for stratigraphy, petrography, genesis, colour, and additional information. Each attribute has an own similarity function. The depth values are transformed to the thickness of the layer. The similarity simΔ is the proportion of thicknesses Δ1 of layer l1 and Δ2 of layer l2 :  Δ1 , Δ1 < Δ2 Δ2 simΔ (l1 , l2 ) = Δ 2 , else. Δ1 The strike and fall of a layer is compared by means of angles α and β. α describes the difference between the strike direction of the layer and the cross section. β is the fall of the layer itself as the relation to the cross section is already described by the thicknesses within the outcrops. A sample illustration is shown in Fig. 8.

Fig. 8. Strike and fall of a layer within a cross section

The similarity sims&f composes the strike and fall in equal shares: 1 −α2 ) + 90−(β901 −β2 ) ). sims&f (l1 , l2 ) = 1/2( 180−(α 180 The similarity values for stratigraphy, petrography, genesis, colour, and additional information are stored in similarity tables of possible values. We used manyfold types of geological knowledge to specify the values for the tables, e.g., on the hierarchy of ages or on the grain size of sedimentary earth. Due to space limits, we present just a section of the table for petrographic similarity, namely for magma stones in Table 4.3. The second level similarity computes the similarity of two layers siml (l1 , l2 ) by composing the first level similarity values with the following weights:

Design of Geologic Structure Models with Case Based Reasoning

87

Table 1. Similarity values simpet for magma stones

+G +Sy +Dr +Gb +R +Dz +B Lt+ Tr+

+G 1 0.6 0.75 0.4 0.3 0.2 0.1 0.1 0.1

+Sy 0.6 1 0.5 0.4 0.1 0.1 0.1 0.2 0.3

+DR 0.75 0.5 1 0.75 0.1 0.1 0.3 0.1 0.1

+Gb 0.4 0.4 0.75 1 0.1 0.1 0.3 0.1 0.1

+R 0.3 0.1 0.1 0.1 1 0.75 0.4 0.5 0.5

+Dz 0.2 0.1 0.1 0.1 0.75 1 0.75 0.4 0.3

+B 0.1 0.1 0.3 0.3 0.4 0.75 1 0.7 0.4

2simΔ +2sims&f +2simstr +9simm +3sims

+Lt 0.1 0.2 0.1 0.1 0.5 0.4 0.7 1 0.7

+Tr 0.1 0.3 0.1 0.1 0.5 0.3 0.4 0.7 1

+7simg +simc +sima

pet pet siml (l1 , l2 ) = . 28 simΔ compares the thickness of l1 and l2 , sims&f the strike and fall, simstr the s stratography, simm pet the main mixture, simpet the secondary mixture, simg the genesis, simc the colour, and sima the additional information. The third level similarity computes the similarity of two outcrops simo (o1 , o2 ) by composing the second level similarity values as normalized sum:  o1 o2 simo (o1 , o2 ) = (1/N N i=1 siml (li , li )). N is the minimum number of layers of o1 and o2 . The forth level similarity computes the similarity of two constellations of outcrops simgeo (case1 , case2 ) by composing the third level similarity values with a double weight for the main outcrops o1 and o2 : 1 2 1 2 , ocase ) + 2simo (ocase , ocase )+ simgeo (case1 , case2 ) = 1/6(2simo(ocase 1 1 2 2 case1 case2 case1 case2 simo (o3 , o3 ) + simo (o4 , o4 )), 1 2 1 2 , ocase ) is substituted by 0 if one of the outcrops ocase , ocase is where simo (ocase 4 4 4 4 not existing, and 1 if both are not existing.

Geometric Similarity. The geometric similarity relates to the location of outcrops. The location of outcrops consists of the height values of the outcrops as well as their position to each other. Both components have to be computed from the absolute values stored in the outcrop representations. The height values of the two main outcrops provide the angle of the slope of the land. If we have, for instance, two pairs of drillings with ascending slopes of 17 and 5 degree. Then the straight-forward similarity is the difference of the angles (12 in the example) divided through 180 as the maximum difference is 180 degrees. The similarity of the two angles in our example is about 0.93. To reach more realistic values, we think about taking the square of this value. The absolute positions of the three or four outcrops in a case can be used to estimate how ’spacious’ the case is. We compute the sides of the triangle or tetragon of outcrops and compare them with those of the query by 1/3(a/a + b/b + c/c ) for triangles and 1/4(a/a + b/b + c/c + d/d ) for tetragons. An alternative is to compare the areas, e.g., computed by taking the formula of Heron for triangles and Green’s theorem for tetragons.

88

M. Minor and S. K¨oppen

The results of the difference of angles and the size estimation are combined to simgeometric (case1 , case2 ) = 1/2(simangle (case1 , case2 )+simsize (case1 , case2 )). All values and functions used in this section can be modified independently as the similarity function is a composite one - despite of the manyfold ways to determine the partial values. The overall result of retrieval step three is an ordered set of the best matching cases from the set of promising candidate cases that has been the output of step two. 4.4 Adaptation of a Cross Section Case The user can select a case from the retrieval result and ask the system for adaptation. Then, the user modifies or cancels the adapted case. The automatic adaptation concerns – the coordinates of outcrops and layers, – the layer names, and – the thickness of layers in the virtual outcrops. The coordinates of the cross section have to be transferred: The original section is rotated and its length has to be scaled to the new outcrop coordinates.The relative position of a virtual outcrop must be preserved. The height of the virtual outcrop has to be modified. The names of the layers can be copied top down to the layers of the query case. Deeper layers of the original case might be cut off for a query case. The layers of the virtual outcrops have to be adapted to the thickness of the layers in the main outcrops (see Fig. 9). A linear interpolation is not realistic as more than one virtual outcrops may lie between two real outcrops. Instead, we go top down for each layer of the virtual outcrop and choose the nearest connected layer in a real outcrop. The proportion of the thickness of both layer parts of the original case is transferred to the dimensions of the new case. Has the virtual layer in the original half the size of the real layer, the virtual layer in the new case gets also half the size of the real layer in the new case.

Fig. 9. Scaling virtual layers

Design of Geologic Structure Models with Case Based Reasoning

89

5 Application and Evaluation A sample application is prototypically implemented in Visual Basic. It has an interface to the CAD program Autocad for the 3D visualisation according to the DIN4023 norm. The sample case base contains 37 cases of six solved projects from the ’Norddeutsche Tiefebene’, a glacial part of North Germany. The most projects cover an area of about 100 x 100 meters with a depth of 10 meters. The test case base consists of cases with horizontal bedding and cases with ending layers. A frequent geologic structure in projects is a lens. An example of a lens structure is shown in Fig. 10. The project of this example has four outcrops at the corners and one outcrop in the middle of the lens. It is described by eight cases.

Fig. 10. 3D visualisation of a project with a lens structure

To fill the sample case base, we solved only five cases with different structures by hand. The remaining 32 cases have been generated incrementally by the case based system. 9 of the 32 cases needed to be modified manually, 23 cases have been solved successfully and required only slight or none adaptation. Table 5 shows a manual evaluation of the test cases grouped by overlapping geologic phenomena. For further tests, the five completely manually solved cases have been asked as queries to the overall case base. The case itself was found in all cases and reached always one of the highest similarity values. The time savings during the editing of the Table 2. Six test projects with evaluation (from 0 = bad to 10 = perfect description

outcrops horizontal 4 4 ending layer lens structure 5 4 strong slope shallow slope 4 alluvial channel 6

cases new ;-) ;-( evaluation 5 1 0 4 9 5 1 1 3 8 8 0 1 7 9 5 0 3 2 7 5 0 2 3 8 9 3 2 4 6

90

M. Minor and S. K¨oppen

32 cases were about 15 minutes to half an hour per case compared to the construction of a cross section with Autocad only. This lets us hope that the system will be very useful in real scenarios. Reducing the system to the third step of the retrieval led to unacceptable results. We think about presenting also the result of the second retrieval step to allow the user to select one of those cases in case the third retrieval step has not produced sufficient cases.

6 Related Work Conventional software to support the design of geologic structure models are CAD programs. They use fix rules to construct cross sections or provide only graphical support. Surpac Minex Groups’ [6] Drillking has been developed for the collection of drilling data for mines. Drillking provides tables to entry the data and generates 3D models with the tool Surpac-Vision. The design task has to be solved manually. Drillking is not conform to the German DIN norm. Geographix’ [3] GESXplorer has been developed for the oil production and works similar to Drillking. Additionally, it provides support for project management. It is not DIN conform, too. The PROFILE system [5] is the result of a DFG research project on developing methods for the digital design of geologic structure models. It visualises the drilling data and supports the manual construction of cross sections layer by layer or drill core by drill core. A conclusion of the project that was finished in 1997 was that the automatic design is not possible at the moment as it needs experiential knowledge. We confirm this statement and have shown that case based reasoning technique helps to go one step further than PROFILE in supporting human geologists. The FABEL project [1] employs case based reasoning for design tasks in architecture. It uses conceptual clustering and conceptual analogy to find useful cases from the past. The case memory is clustered according to hierarchically organised concepts, e.g. the numbers of pipes and connections in a pipe system. The cases are represented as graphs. The geologic design task in our approach is easier to solve than the architectural design: The projects can be divided into independent cases of a less complex structure. So, we can use a composite similarity function instead of graph matching in FABEL. O’Sullivan et al. ([4]) use case based reasoning for the analysis of geo-spatial images. They use an annotation-based retrieval on textual annotations to series of geospatial images. In [2], they presented an algorithm to perform shape matching with grids of pixels. The structural similarity in our approach is comparable to the combination of topological, orientation, and distance properties in [2]. Fortunately, we don’t have to deal with noisy data like images. Our work is not related to spatial reasoning as it does not use an explicit or general model of space. It just compares particular domain-specific spatial parameters with a specific similarity function, e.g. the depth and orientation of two outcrops.

7 Conclusion and Future Work In this paper, we presented a case based approach to support geologists in designing structure models. We divided projects concerning a whole building ground into several

Design of Geologic Structure Models with Case Based Reasoning

91

cases to reduce the complexity and increase the comparability of the stored pieces of experience. Before a composite similarity function performs a classical retrieval process, the structurally best matching cases are preselected by means of an estimation how successfull their adaptation to the query might be. Many types of geologic knowledge have been integrated to define the composite similarity function for geologic and geometric properties. We implemented a prototype and modelled the knowledge for the ’Norddeutsche Tiefebene’. The knowledge is reusable for projects with similar glacial imprinting. The evaluation with a small set of cases was successfull and saved about one man-day of geological work. We would like to acknowledge the geo software company Fell-Kernbach for their support of this work!

References 1. K. B¨orner. CBR for Design. In M. Lenz, H.-D. Burkhard, B. Bartsch-Sp¨orl, and S. Weß, editors, Case-Based Reasoning Technology — From Foundations to Applications, LNAI 1400, Berlin, 1998. Springer Verlag. 2. J. D. Carswell, D. C. Wilson, and M. Bertolotto. Digital Image Similarity for Geo-spatial Knowledge Management. In S. Craw and A. Preece, editors, Advances in Case-Based Reasoning: 6th European Conference, ECCBR-2002, LNAI 2416, Berlin, 2002. Springer Verlag. 3. http://www.geographix.com/, 2005. 4. D. O’Sullivan, E. McLoughlin, M. Bertolotto, and D. C. Wilson. A case-based approach to managing geo-spatial imagery tasks. In P. Funk and P. A. Gonz´alez Calero, editors, Advances in Case-Based Reasoning: 7th European Conference, ECCBR-2004, number 3155 in Lecture Notes in Artificial Intelligence, pages 702–716, Berlin, 2004. Springer. 5. H. Preuß, M. Homann, and U. Dennert. Methodenentwicklung zur digitalen Erstellung von geologischen Profilschnitten als Vorbereitung zur 3D-Modellierung. Final report PR 236/3-1, DFG, Germany, 1997. 6. http://www.surpac.com/, 2005. 7. S. Weß, K.-D. Althoff, and G. Derwand. Using kd-trees to improve the retrieval step in casebased reasoning. In S. Wess, K.-D. Althoff, and M. M. Richter, editors, Topics in Case-Based Reasoning: First European Workshop, EWCBR-9 3, selected papers, number 837 in Lecture Notes in Artificial Intelligence, pages 167–181, Berlin, 1994. Springer.

Applying Constrained Linear Regression Models to Predict Interval-Valued Data Eufrasio de A. Lima Neto, Francisco de A.T. de Carvalho, and Eduarda S. Freire Centro de Informatica - CIn / UFPE, Av. Prof. Luiz Freire, s/n Cidade Universitaria, CEP: 50740-540 - Recife - PE - Brasil {ealn, fatc}@cin.ufpe.br

Abstract. Billard and Diday [2] were the first to present a regression method for interval-value data. De Carvalho et al [5] presented a new approach that incorporated the information contained in the ranges of the intervals and that presented a better performance when compared with the Billard and Diday method. However, both methods do not guarantee that the predicted values of the lower bounds (ˆ yLi ) will be lower than the predicted values of the upper bounds (ˆ yU i ). This paper presents two approaches based on regression models with inequality constraints that guarantee the mathematical coherence between the predicted values yˆLi and yˆU i . The performance of these approaches, in relation with the methods proposed by Billard and Diday [2] and De Carvalho et al [5], will be evaluated in framework of Monte Carlo experiments.

1

Introduction

The classical model of regression for usual data is used to predict the behaviour of a dependent variable Y as a function of other independent variables that are responsible for the variability of variable Y . However, to fit this model to the data, the estimation is necessary of a vector β of parameters from the data vector Y and the model matrix X, supposed with complete rank p. The estimation using the method of least square does not require any probabilistic hypothesis on the variable Y . This method consists of minimizing the sum of the square of residuals. A detailed study on linear regression models for classical data can be found in [6], [9], [10] among others. This paper presents two new methods for numerical prediction for intervalvalued data sets based on the constrained linear regression model theory. The probabilistic assumptions that usually are behind this model for usual data will not be considered in the case of symbolic data (interval variables), since this is still an open research topic. Thus, the problem will be investigated as an optimization problem, in which we desire to fit the best hyper plan that minimizes a predefined criterion. In the framework of Symbolic Data Analysis, Billard and Diday [2] presented for the first time an approach to fitting a linear regression model to an intervalvalued data-set. Their approach consists of fitting a linear regression model to U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 92–106, 2005. c Springer-Verlag Berlin Heidelberg 2005 

Applying Constrained Linear Regression Models

93

the mid-point of the interval values assumed by the variables in the learning set and applies this model to the lower and upper bounds of the interval values of the independent variables to predict, respectively, the lower and upper bounds of the interval value of the dependent variable. De Carvalho et al [5] presented a new approach based on two linear regression models, the first regression model over the mid-points of the intervals and the second one over the ranges, which reconstruct the bounds of the interval-values of the dependent variable in a more efficient way when compared with the Billard and Diday method. However, both methods do not guarantee that the predicted values of the lower bounds (ˆ yLi ) will be lower than the predicted values of the upper bounds (ˆ yUi ). Judge and Takayama [7] consider the use of restrictions in regression models for usual data to guarantee the positiveness of the dependent variable. This paper suggests the use of constraints linear regression models in the approaches proposed by [2] and [5] to guarantee the mathematical coherence between the predicted values yˆLi and yˆUi . In the first approach we consider the regression linear model proposed by [2] with inequality constrained in the vector of parameters β which guarantee that the values of yˆLi will be lower than the values of yˆUi . The lower and upper bounds of the dependent variable are predicted, respectively, from the the lower and upper bounds of the interval values of the dependent variables. The second approach considers the regression linear model proposed by [5] that consist in two linear regression models, respectively, fitted over the midpoint and range of the interval values assumed by the variables on the learning set. However, the regression model fitted over the ranges of the intervals will take in consideration inequality constraints to guarantee that yˆLi will be lower than yˆUi . In order to show the usefulness of these approaches will be predicted according with the proposed methods in independent data sets, the lower and upper bound of the interval values of a variable which is linearly related to a set of independent interval-valued variables. The evaluation of the proposed prediction methods will be based on the estimation of the average behaviour of the root mean squared error and of the squared of the correlation coefficient in the framework of a Monte Carlo experiment. Section 2 presents the new approaches that fits a linear regression model with inequality constraints for interval-valued data. Section 3 describes the framework of the Monte Carlo simulations and presents experiments with artificial and real interval-valued data sets. Finally, the section 4 gives the concluding remarks.

2 2.1

Linear Regression Models for Interval-Valued Data The Constrained Center Method

The first approach consider the regression linear model proposed by [2] with inequality constrained in the vector of parameters β to guarantee that the values of yˆLi will be lower than the values of yˆUi .

94

E. de A. Lima Neto, F. de A.T. de Carvalho, and E.S. Freire

Let E = {e1 , . . . , en } be a set of the examples which are described by p + 1 intervals quantitative variables Y , X1 , . . . , Xp . Each example ei ∈ E (i = 1, . . . , n) is represented as an interval quantitative feature vector zi = (xi , yi ), xi = (xi1 , . . . , xip ), where xij = [aij , bij ] ∈ & = {[a, b] : a, b ∈ ', a ≤ b} (j = 1, . . . , p) and yi = [yLi , yUi ] ∈ & are, respectively, the observed values of Xj and Y . Let us consider Y (a dependent variable) related to X1 , . . . , Xp (the independent predictor variables), according to a linear regression relationship: yLi = β0 + β1 ai1 + . . . + βp aip + Li yUi = β0 + β1 bi1 + . . . + βp bip + Ui , constrained to βi ≥ 0, i = 0, . . . , p.

(1)

From the equations (1), we will denote the sum of squares of deviations in the first approach by S1 =

n  i=1

(Li + Ui )2 =

n 

(yLi − β0 − β1 ai1 − . . . − βp aip +

i=1

+ yUi − β0 − β1 bi1 − . . . − βp bip )2 ,

(2)

constrained to βi ≥ 0, i = 0, . . . , p, that represents the sum of the square of the lower and upper bounds errors. Additionally, let us consider the same set of examples E = {e1 , . . . , en } but now described by p+1 continuous quantitative variables Y md and X1md , . . . , Xpmd , which assume as value, respectively, the mid-point of the interval assumed by the interval-valued variables Y and X1 , . . . , Xp . This means that each example ei ∈ E (i = 1, . . . , n) is represented as a continuous quantitative feature vector md md md md wi = (xmd = (xmd i , yi ), xi i1 , . . . , xip ), where xij = (aij +bij )/2 (j = 1, . . . , p) md and yi = (yLi + yUi )/2 are, respectively, the observed values of Xjmd and Y md . The first approach proposed in this paper is equivalent to fit a linear regression model with inequality constraints in the vector of parameters β over the mid-point of the interval values assumed by the variables on the learning set, and applies this model on the lower and upper bounds of the interval values of the independent variables to predict, respectively, the lower and upper bounds of the interval value of the dependent variable. This method guarantee that the values of yˆLi will be lower than the values of yˆUi . In matrix notation, this first approach, called here constrained center method (CCM), can be rewritten as ymd = Xmd β md + md , with constraints βi ≥ 0, i = 0, . . . , p

(3)

T md T T md T md where ymd = (y1md , . . . , ynmd )T , Xmd = ((xmd 1 ) , . . . , (xn ) ) , (xi ) = (1, xi1 md md md T , . . . , xmd ) (i = 1, . . . , n), β = (β , . . . , β ) with constraints β ≥ 0, i 0 p ip md md md T i = 0, . . . , p and  = (1 , . . . , n ) . Notice that the model proposed in (3) will be fitted according to the inequality constraints βi ≥ 0, i = 0, 1, ..., p. Judge and Takayama [7] presented

Applying Constrained Linear Regression Models

95

a solution to this problem which uses iterative techniques from quadratic programming. Lawson and Hanson [8] presented an algorithm that guarantee the positiveness of the least squares estimates of the vector of parameters β, if Xmd has full rank p+1 ≤ n. The basic idea of the Lawson and Hanson [8] algorithm is identify the values that present incompatibility with the restriction, and change them to positive values through a re-weighting process. The algorithm has two main steps: initially, the algorithm find the classical least squares estimates to the vector of parameters β and identify the negative values in the vector of parameters. In the second step, the values which are incompatible with the restriction are re-weighted until they become positive. The authors gives a formal proof of the convergence of the algorithm. The approach proposed in this section uses the algorithm developed by Lawson and Hanson in the expression (3) to obtain the vector of parameters β. Given a new example e, described by z = (x, y), x = (x1 , . . . , xp ), where xj = [aj , bj ] (j = 1, . . . , p), the value y = [yL , yU ] of Y will be predicted by yˆ = [yˆL , yˆU ] as follow: ˆ and yˆU = (xU )T β, ˆ yˆL = (xL )T β

(4)

ˆ (βˆ0 , βˆ1 , . . . , βˆp )T where, (xL )T = (1, a1 , . . . , ap ), (xU )T = (1, b1 , . . . , bp ) and β= ˆ with βi ≥ 0, i = 0, . . . , p. 2.2

The Constrained Center and Range Method

The second approach consider the regression linear model proposed by [5]. However, the regression model fitted over the ranges of the intervals will take in consideration inequality constraints to guarantee that yˆLi will be lower than yˆUi . Let E = {e1 , . . . , en } be the same set of examples described by the interval quantitative variables Y , X1 , . . . , Xp . Let Y md and Xjmd (j = 1, 2, . . . , p) be, respectively, quantitative variables that represent the midpoint of the interval variables Y and Xj (j = 1, 2, . . . , p) and let Y r and Xjr (j = 1, 2, . . . , p) be, respectively, quantitative variables that represent the ranges of these interval variables. This means that each example ei ∈ E (i = 1, . . . , n) is represented by md md md r r r two vectors wi = (xmd = (xmd i , yi ), xi i1 , . . . , xip ) and ri = (xi , yi ), xi = r r md r md (xi1 , . . . , xip ), where xij = (aij + bij )/2, xij = (bij − aij ), yi = (yLi + yUi )/2 and yir = (yUi − yLi ) are, respectively, the observed values of Xjmd , Xjr , Y md and Y r . Let us consider Y md and Y r being dependent variables and Xjmd and Xjr (j = 1, 2, . . . , p) being independent predictor variables, which related according to the following linear regression relationship: md md md yimd = β0md + β1md xmd i1 + . . . + βp xip + i

yir = β0r + β1r xri1 + . . . + βpr xrip + ri with constraints

βir

≥ 0, i = 0, . . . , p.

(5)

96

E. de A. Lima Neto, F. de A.T. de Carvalho, and E.S. Freire

From equation (5), we will express the sum of squares of deviations in the second approach as S2 =

n 

md md 2 (yimd − β0md − β1md xmd i1 − . . . − βp xi1 ) +

i=1

+

n 

(yir − β0r − β1r xri1 − . . . − βpr xrip )2

(6)

i=1

constrained to βir ≥ 0, i = 0, . . . , p, that represent the sum of the midpoint square error plus the sum of the range square error considering independent vectors of parameters to predict the midpoint and the range of the intervals. This second approach consider the regression linear model proposed by [5] that fits two independent linear regression models, respectively, over the midpoint and range of the interval values assumed by the variables on the learning set. However, the regression model fitted over the ranges of the intervals take in consideration inequality constraints to guarantee that yˆLi is lower than yˆUi . The prediction of the lower and upper bound of an interval value of the dependent variable is accomplished from its mid-point and range which are estimated from the fitted linear regression models applied to the mid-point and range of each interval value of the independent variables. In matrix notation, this second approach, called here constrained center and range method (CCRM) can be rewritten as ymd = Xmd β md + md , yr = Xr β r + r , with constraints βir ≥ 0, i = 0, . . . , p,

(7)

where, X and X has full rank p + 1 ≤ n, y = (y1md , . . . , ynmd )T , Xmd = md T md T T md T md md ((x1 ) , . . . , (xn ) ) , (xi ) = (1, xi1 , . . . , xip ), β md = (β0md , . . . , βpmd ), yr = (y1r , . . . , ynr )T , Xr = ((xr1 )T , . . . , (xrn )T )T , (xri )T = (1, xri1 , . . . , xrip ), β r = r md (β0r , . . . , βpr ), xmd = (yLi + yUi )/2 and ij = (aij + bij )/2, xij = (bij − aij ), yi r yi = (yUi − yLi). md md

r

md

Notice that the least square estimates of the vector of parameters β first equation of the expression (7) is given by ˆmd = ((Xmd )T Xmd )−1 (Xmd )T ymd . β

in the

(8)

However, the second equation in expression (7) must be fitted according to the inequality constraints βir ≥ 0, i = 0, 1, ..., p. To accomplish this task we will use the algorithm proposed by Lawson and Hanson [8]. In this way we will obtain the vector of parameters β r according to these inequality constraints. Given a new example e, described by z = (x, y) with x = (x1 , . . . , xp ), w = md r r r r r (xmd , y md ) with xmd = (xmd 1 , . . . , xp ) and r = (x , y ) with x = (x1 , . . . , xp ), md r where xj = [aj , bj ], xj = (aj + bj )/2 and xj = (bj − aj ) (j = 1, . . . , p), the value y = [yL , yU ] of Y will be predicted from the predicted values yˆmd of Y md and yˆr of Y r as follow:

Applying Constrained Linear Regression Models

yˆL = yˆmd − (1/2)ˆ y r and yˆU = yˆmd + (1/2)ˆ yr

97

(9)

md ˆmd , yˆr = (˜ ˆr , (˜ where yˆmd = (˜ xmd )T β xr )T β xmd )T = (1, xmd xr )T = 1 , . . . , xp ), (˜ r r md md ˆmd md T r r ˆr r T ˆ ˆ ˆ ˆ ˆ ˆ (1, x1 , . . . , xp ), β = (β0 , β1 , . . . , βp ) and β = (β0 , β1 , . . . , βp ) with r ˆ βi ≥ 0, i = 0, . . . , p.

3

The Monte Carlo Experiences

To show the usefulness of the approaches proposed in this paper in comparison with the methods proposed by [2] (here called Center Method - CM ) and [5] (here called Center and Range Method - CRM ), experiments with simulated intervalvalued data sets with different degrees of difficulty to fit a linear regression model are considered in this section, along with a cardiological data set (see [2]). 3.1

Simulated Interval-Valued Data Sets

We consider standard quantitative data sets in '2 and '4 . Each data set (in '2 or '4 ) has 375 points partitioned in a learning set (250 points) and a test set (125 points). Each data point belonging to the standard data set is a seed for an interval data (a rectangle in '2 or a hypercube in '4 ). In this way, the interval data sets are obtained from these standard data sets. The construction of the standard data sets and the corresponding interval data sets is accomplished in the following steps: s1 ) Let us suppose that each random variables Xjmd is uniformly distributed in the interval [a, b]; at each iteration 375 values of each variable Xjmd are randomly selected; s2 ) The random variable Y md is supposed to be related to variables Xjmd according to Y md = (Xmd )T β + , where (Xmd )T = (1, X1md) and β = (β0 = U [c, d], β1 = U [c, d])T in '2 , or (Xmd )T = (1, X1md , X2md, X3md ) and β = (β0 = U [c, d], β1 = U [c, d], β2 = U [c, d], β3 = U [c, d])T in '4 and  = U [e, f ]; the mid-points of these 375 intervals are calculated according this linear relation; s3 ) Once the mid-points of the intervals are obtained, let us consider the range of each interval. Let us suppose that each random variable Y r , Xjr (j = 1, 2, 3) is uniformly distributed, respectively, in the intervals [g, h] and [i, j]; at each iteration 375 values of each variable Y r , Xjr are randomly selected, which are the range of these intervals; s4 ) The interval-valued data set is partitioned in learning (250 observations) and test (125 observations) data sets. Table 1 shows nine different configurations for the interval data sets which are used to measure de performance of the CCM and CCRM methods in comparison with the approaches proposed by [2] and [5]. These configurations take into account the combination of two factors (range and error on the mid-points) with three degrees of variability (low, medium and

98

E. de A. Lima Neto, F. de A.T. de Carvalho, and E.S. Freire Table 1. Data set configurations C1 C2 C3 C4 C5 C6 C7 C8 C9

Xjmd Xjmd Xjmd Xjmd Xjmd Xjmd Xjmd Xjmd Xjmd

∼ U [20, 40] ∼ U [20, 40] ∼ U [20, 40] ∼ U [20, 40] ∼ U [20, 40] ∼ U [20, 40] ∼ U [20, 40] ∼ U [20, 40] ∼ U [20, 40]

Xjr ∼ U [20, 40] Xjr ∼ U [20, 40] Xjr ∼ U [20, 40] Xjr ∼ U [10, 20] Xjr ∼ U [10, 20] Xjr ∼ U [10, 20] Xjr ∼ U [1, 5] Xjr ∼ U [1, 5] Xjr ∼ U [1, 5]

Y r ∼ U [20, 40] Y r ∼ U [20, 40] Y r ∼ U [20, 40] Y r ∼ U [10, 20] Y r ∼ U [10, 20] Y r ∼ U [10, 20] Y r ∼ U [1, 5] Y r ∼ U [1, 5] Y r ∼ U [1, 5]

 ∼ U [−20, 20]  ∼ U [−10, 10]  ∼ U [−5, 5]  ∼ U [−20, 20]  ∼ U [−10, 10]  ∼ U [−5, 5]  ∼ U [−20, 20]  ∼ U [−10, 10]  ∼ U [−5, 5]

high): low variability range (U [1, 5]), medium variability range (U [10, 20]), high variability range (U [20, 40], low variability error (U [−5, 5]), medium variability error (U [−10, 10]) and high variability error (U [−20, 20]). The configuration C1 , for example, represents observations with a high variability range and with a poor linear relationship between Y and the dependent variables due the high variability error on the mid-points. Figure 1 shows the configuration C1 when the data is in '2 .

Fig. 1. Configuration C1 showing a poor linear relationship between Y and X1

On the other hand, the configuration C9 represents observations with a low variability range and with a rich linear relationship between Y and the dependent variables due the low variability error on the mid-points. Figure 2 shows the configuration C9 when the data is in '2 . 3.2

Experimental Evaluation

The performance assessment of these linear regression models is based on the following measures: the lower bound root mean-square error (RM SEL ), the upper bound root mean-square error (RM SEU ), the square of the lower bound

Applying Constrained Linear Regression Models

99

Fig. 2. Configuration C9 showing a rich linear relationship between Y and X1 2 correlation coefficient (rL ) and the square of the upper bound correlation coeffi2 cient (rU ). These measures are obtained from the observed values yi = [yLi , yUi ] (i = 1, . . . , n) of Y and from their corresponding predicted values yˆi = [ˆ yLi , yˆUi ], denoted as:  n  n 2 (y − y ˆ ) ˆUi )2 Li Li i=1 i=1 (yUi − y and RM SEU = (10) RM SEL = n n 2 rL =

ˆL) ˆU ) Cov(yL , y Cov(yU , y 2 and rU = SyL Syˆ L SyU Syˆ U

(11)

ˆ L = (ˆ where: yL = (yL1 , . . . , yLn )T , y yL1 , . . . , yˆLn )T , yU = (yU1 , . . . , yUn )T , T ˆ U = (ˆ ˆ • ) is the covariance between yL and y ˆ L or y yU1 , . . . , yˆUn ) , Cov(y• , y ˆU , Sy• is the standard deviation of yL or yU and Syˆ • is the between yU and y ˆU . ˆ L or y standard deviation of y These measures are estimated for the CM, CCM, CRM and CCRM methods in the framework of a Monte Carlo simulation with 100 replications for each independent test interval data set, for each of the nine fixed configurations, as well as for different numbers of independent variables in the model matrix X. At each replication, we fitted a linear regression model to the training interval data set using the CM, CCM, CRM and CCRM methods. Thus, the fitted regression models are used to predict the interval values of the dependent variable Y in the test interval data set and these measures are calculated. For each measure, the average and standard deviation over the 100 Monte Carlo simulations is calculated and then a statistical t-test is applied to compare the performance of the approaches. Additionally, this procedure is repeated considering 50 different values for the parameter vector β selected randomly in the interval [−10, 10]. The comparison between these approaches is achieved by a statistical t-test for independent samples at a significance level of 1%, applied to each measure considered. For each configuration presented in the Table 1, number of independent variables and methods compared is calculated the ratio of times that the hypothesis H0 is rejected for each measure presented in the equations (10) and (11), considering the 50 different vector of parameters β.

100

E. de A. Lima Neto, F. de A.T. de Carvalho, and E.S. Freire

For any two methods compared (A and B, in this order ), concerning RM SEL and RM SEU measures (the higher the measures, the worse the method ), the null and alternative hypotheses are structured as: H0 : Method A ≥ Method B versus H1 : Method A < Method B; 2 2 Concerning rL and rU measures (the higher the measures, the better the method ), the hypotheses are structured as: H0 : Method A ≤ Method B versus H1 : Method A > Method B; In the Table 2, we illustrate for the CRM and CCRM methods (in this order) the results of the Monte Carlo experience for a specific vector of parameters β. This table shows that, for any considered situation and for all the measures considered, the CRM and CCRM methods presented an average performance very similar. The comparison between the CRM and CCRM methods (in this order) in relation to 50 different values for the parameter vector β showed that the null hypothesis was never rejected at a significance level of 1%, regardless the number of variables, the performance measure considered or the configuration. This means that the average performance of the CRM method is not better than the CCRM method. The same conclusion was again found in the comparisons of the CCRM method against CRM method (in this order). This corroborates the hypothesis that the approaches present, in average, the same performance. Then, concerning the model presented by [5] we conclude that the use of inequality constraints to guarantee that yˆLi will be lower than yˆLi does not penalize, statistically, its prediction performance. Then, the CCRM method can be preferred when compared with the CRM In the Table 3 we present the results of the comparison between CM and CCM methods (in this order). In contrast with the results presented above, we conclude that the use of inequality constraints in the method proposed by [2] penalizes its prediction performance. We can see that the growing of the number of independent variables produced an increase in the percentage of rejection of the null hypothesis H0 . This means that the performance difference between the CM and CCM methods increases when the number of independent variables in the model increases. Comparing the configurations C1 , C4 and C7 , that differ with respect to the variability on the range of the intervals, notice that when the interval’s range decreases, the percentage of rejection of the null hypothesis H0 increases. We have a similar conclusion comparing the configurations (C2 , C5 , C8 ) and (C3 , C6 , C9 ). However, it’s important remember that the CM method does not guarantee that the values of yˆLi will be lower than the values of yˆUi . Finally, the results presented in Table 4 show the superiority of the CCRM method when compared with the CM method (in this order). We can see that the percentage of rejection of the null hypothesis at a significance level of 1%, regardless the number of variables or the performance measure considered, is always higher than 80%. This indicate that the CCRM method has a better performance than the CM approach. When the number of variables is 3, in all configurations except C7 , the null hypothesis H0 was rejected in 100% of the cases.

Applying Constrained Linear Regression Models

101

Table 2. Comparision between CRM and CCRM methods - Average and standard deviation of each measure calculated from the 100 replications in the framework of a Monte Carlo experience for a specific vector of parameters β Config. p Stat. C1

1 3

C2

1 3

C3

1 3

C4

1 3

C5

1 3

C6

1 3

C7

1 3

C8

1 3

C9

1 3

x S x S x S x S x S x S x S x S x S x S x S x S x S x S x S x S x S x S

RM SEL CRM CCRM 19.874 19.874 1.180 1.179 19.870 19.866 1.099 1.096 16.919 16.919 0.751 0.750 19.778 19.770 1.254 1.251 16.310 16.310 0.643 0.644 16.342 16.340 0.616 0.617 14.060 14.059 0.800 0.800 14.082 14.081 0.825 0.825 9.915 9.915 0.612 0.612 9.955 9.956 0.493 0.494 8.490 8.490 0.370 0.370 8.530 8.529 0.391 0.390 11.863 11.863 0.540 0.540 11.700 11.701 0.524 0.524 6.121 6.121 0.285 0.285 6.145 6.145 0.298 0.296 3.476 3.475 0.206 0.205 3.455 3.454 0.181 0.180

RM SEU CRM CCRM 19.718 19.718 1.121 1.123 19.697 19.696 1.070 1.069 17.076 17.076 0.763 0.764 19.648 19.644 1.082 1.083 16.337 16.337 0.599 0.599 16.316 16.315 0.642 0.643 14.122 14.120 0.880 0.879 13.959 13.957 0.771 0.769 9.872 9.871 0.583 0.584 9.905 9.904 0.498 0.498 8.545 8.545 0.370 0.370 8.577 8.577 0.394 0.395 11.851 11.851 0.515 0.515 11.753 11.752 0.431 0.431 6.103 6.103 0.296 0.295 6.125 6.124 0.328 0.329 3.455 3.456 0.187 0.187 3.489 3.489 0.202 0.202

2 RL (%) CRM CCRM 93.85 93.86 0.78 0.78 97.28 97.29 0.38 0.38 94.76 94.76 0.73 0.74 91.98 91.99 1.02 1.02 97.32 97.32 0.35 0.34 98.28 98.28 0.22 0.22 72.52 72.52 3.31 3.31 94.64 94.64 0.72 0.72 96.91 96.61 0.43 0.43 97.91 97.91 0.25 0.25 99.02 99.02 0.14 0.13 99.53 99.53 0.07 0.07 27.14 27.14 6.81 6.80 94.48 94.48 0.63 0.63 98.85 98.85 0.13 0.13 98.85 98.85 0.16 0.16 81.56 81.57 2.51 2.51 99.52 99.52 0.06 0.06

2 RU (%) CRM CCRM 94.07 94.07 0.76 0.76 97.25 97.25 0.42 0.42 95.01 95.01 0.63 0.63 91.83 91.83 1.10 1.10 97.33 97.34 0.36 0.36 98.27 98.27 0.24 0.25 72.41 72.41 3.46 3.46 94.62 94.62 0.76 0.76 96.63 96.64 0.36 0.36 97.93 97.93 0.28 0.28 99.00 99.00 0.13 0.13 99.53 99.53 0.08 0.08 27.20 27.20 6.90 6.91 94.50 94.50 0.60 0.60 96.86 96.86 0.14 0.14 98.86 98.86 0.15 0.15 81.85 81.85 2.29 2.29 99.51 99.51 0.07 0.07

Additionally, in all configurations, the growing of the number of independent variables produced an increase in the percentage of rejection of the null hypothesis H0 . This means that the performance difference between the CCRM and CM methods increases when the number of independent variables in the model increases.

102

E. de A. Lima Neto, F. de A.T. de Carvalho, and E.S. Freire

Table 3. Comparison between CM and CCM methods - Percentage of rejection of the null hypothesis H0 (T-test for independent samples following a Student’s t distribution with 198 degrees of freedom) 2 Configuration p RM SEL RM SEU rL (%) C1 1 22% 1% 0% 3 26% 20% 34% C2 1 22% 2% 0% 3 24% 24% 36% C3 1 18% 2% 0% 3 26% 16% 28% C4 1 52% 44% 0% 3 74% 80% 22% C5 1 62% 48% 0% 3 80% 88% 38% C6 1 46% 36% 0% 3 78% 88% 38% C7 1 60% 62% 0% 3 86% 88% 42% C8 1 58% 56% 0% 3 80% 84% 42% C9 1 60% 60% 0% 3 90% 90% 36%

2 rU (%) 0% 32% 0% 36% 0% 28% 0% 22% 0% 38% 0% 38% 0% 42% 0% 42% 0% 36%

Now, comparing the configurations C1 , C2 , and C3 , that differ with respect to the error on the mid-points of the intervals, notice that when the variability of the error decreases, the percentage of rejection of the null hypothesis H0 increases. This shows that, as much linear is the relationship between the variables as higher is the statistical significance of the difference between the CCRM and CM methods. We arrive to the same conclusion comparing the configurations (C4 , C5 , C6 ) and (C7 , C8 , C9 ). On the other hand, comparing the configurations C1 , C4 and C7 , that differ with respect to the variability on the range of the intervals, notice that when the interval’s range decreases, the percentage of rejection of the null hypothesis H0 decreases too. We have a similar conclusion comparing the configurations (C2 , C5 , C8 ) and (C3 , C6 , C9 ). 3.3

Cardiological Interval Data Set

This data set (Table 5) concerns the record of the pulse rate Y , systolic blood pressure X1 and diastolic blood pressure X2 for each of eleven patients (see [2]). The aim is to predict the interval values y of Y (the dependent variable) from xj (j = 1, 2) through a linear regression model. The fitted linear regression models to the CM, CCM, CRM and CCRM methods in the cardiological interval data set are presented below:

Applying Constrained Linear Regression Models

103

Table 4. Comparison between CCRM and CM methods - Percentage of rejection of the null hypothesis H0 (T-test for independent samples following a Student’s t distribution with 198 degrees of freedom) Configuration p RM SEL RM SEU C1 1 100% 100% 3 100% 100% C2 1 100% 100% 3 100% 100% C3 1 100% 100% 3 100% 100% C4 1 100% 100% 3 100% 100% C5 1 100% 100% 3 100% 100% C6 1 100% 100% 3 100% 100% C7 1 94% 94% 3 98% 98% C8 1 100% 100% 3 100% 100% C9 1 100% 100% 3 100% 100%

2 rL (%) 97% 100% 96% 100% 98% 100% 94% 100% 96% 100% 100% 100% 84% 98% 94% 100% 100% 100%

2 rU (%) 97% 100% 96% 100% 98% 100% 94% 100% 94% 100% 100% 100% 80% 98% 94% 100% 100% 100%

Table 5. Cardiological interval data set u Pulse rate Systolic blood pressure Diastolic blood pressure 1 [44-68] [90-100] [50-70] 2 [60-72] [90-130] [70-90] 3 [56-90] [140-180] [90-100] 4 [70-112] [110-142] [80-108] 5 [54-72] [90-100] [50-70] 6 [70-100] [130-160] [80-110] 7 [72-100] [130-160] [76-90] 8 [76-98] [110-190] [70-110] 9 [86-96] [138-180] [90-110] 10 [86-100] [110-150] [78-100] 11 [63-75] [60-100] [140-150]

CM: yˆL = 21.17 + 0.33a1 + 0.17a2 and yˆU = 21.17 + 0.33b1 + 0.17b2 ; CCM: yˆL = 21.17 + 0.33a1 + 0.17a2 and yˆU = 21.17 + 0.33b1 + 0.17b2 ; md CRM: yˆmd = 21.17 + 0.33xmd and yˆr = 20.13 − 0.15xr1 + 0.35xr2 . 1 + 0.17x2 md md md CCRM: yˆ = 21.17 + 0.33x1 + 0.17x2 and yˆr = 17.96 + 0.20xr2 . From the fitted linear regression models for interval-valued data listed above we present, in the Table 6, the predicted interval values of the dependent variable pulse rate, in accordance with CM, CCM, CRM and CCRM models.

104

E. de A. Lima Neto, F. de A.T. de Carvalho, and E.S. Freire

Table 6. Predicted interval values of the dependent variable pulse rate, according to CM, CCM, CRM and CCRM models CM [59-66] [63-79] [83-97] [71-86] [59-66] [78-92] [77-89] [69-102] [82-99] [71-87] [65-80]

CCM [59-66] [63-79] [83-97] [71-86] [59-66] [78-92] [77-89] [69-102] [82-99] [71-87] [65-80]

CRM CCRM [50-75] [52-74] [60-82] [60-82] [81-99] [80-100] [66-99] [67-90] [50-75] [52-74] [72-98] [73-97] [73-93] [73-93] [75-97] [73-99] [80-101] [79-101] [68-90] [68-90] [63-81] [62-82]

Table 7. Performance of the methods in the cardiological interval data set Method RM SEL RM SEL CM 11.09 10.41 CCM 11.09 10.41 CRM 9.81 8.94 CCRM 9.73 9.18

2 rL (%) 30.29 30.29 41.54 41.04

2 rU (%) 53.47 53.47 63.34 59.41

The performance of the CM, CCM, CRM and CCRM methods is eval2 2 and rU measures on uated through the calculation of RM SEL , RM SEU , rL this cardiological interval data set. From Table 7, we can see that the performance of the methods in this cardiological data set were similar to what was demonstrated in the Monte Carlo simulations. We conclude that the CRM and CCRM methods outperforms the CM and CCM methods, that present the same performance in this particular case, and that the CRM and CCRM methods presented a very close performance.

4

Concluding Remarks

In this paper we presented two new methods to fit a regression model on intervalvalued data which considers inequality constraints to guarantee that the predicted values of the yˆLi will be lower than the predicted values of the upper bounds yˆUi . The first approach consider the regression linear model proposed by [2] and add inequality constraints in the vector of parameters β. The second approach consider the regression linear model proposed by [5] and add, in the regression model fitted over the ranges of the intervals, inequality constraints to guarantee that yˆLi will be lower than yˆUi . The assessment of the proposed prediction methods was based on the average behaviour of the root mean square error and the square of the correlation

Applying Constrained Linear Regression Models

105

coefficient in the framework of a Monte Carlo simulation. The performance of the proposed approaches was also measured in a cardiological interval data set The comparison between the CRM and CCRM showed that the use of inequality constraints to guarantee that yˆLi would be be lower than yˆLi did not represent, statistically, penalization in the model prediction performance. Then, the CCRM method was preferred when compared with the CRM. On the other hand, we conclude that the use of inequality constraints in the method proposed by [2] penalizes its prediction performance. We observed that the performance difference between the CM and CCM methods increases when the number of independent variables in the model increases. The results showed the superiority of the CCRM method when compared with the CM method. The percentage of rejection of the null hypothesis at a significance level of 1%, regardless the number of variables or the performance measure considered, is always higher than 80%. The performance difference between the CCRM and CM methods increases when the number of independent variables in the model increases. Moreover, as much linear is the relationship between the variables as higher is the statistical significance of the difference between the CCRM and CM methods. Finally, the performance of the methods in the cardiological data set was similar to that showed in the Monte Carlo simulations. We concluded that the CRM and CCRM methods outperforms the CM and CCM methods and that the CRM and CCRM methods presented a very similar performance. Acknowledgments. The authors would like to thank CNPq and CAPES (Brazilian Agencies) for its financial support.

References 1. Bock, H.H. and Diday, E.: Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer, Berlin Heidelberg (2000) 2. Billard, L., Diday, E.: Regression Analysis for Interval-Valued Data. Data Analysis, Classification and Related Methods: Proceedings of the Seventh Conference of the International Federation of Classification Societies, IFCS-2000, Namur (Belgium), Kiers, H.A.L. et al. Eds, Vol. 1 (2000), Springer, 369–374 3. Billard, L., Diday, E.: Symbolic Regression Analysis. Classification, Clustering and Data Analysis: Proceedings of the Eighenth Conference of the International Federation of Classification Societies, IFCS-2002, Crakow (Poland), Jajuga, K. et al. Eds, Vol. 1 (2002), Springer, 281–288 4. Billard, L., Diday, E.: From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis. Journal of the American Statistical Association, Vol. 98 (2003) 470-487 5. De Carvalho, F.A.T, Lima Neto, E.A., Tenorio, C.P.: A New Method to Fit a Linear Regression Model for Interval-Valued Data. Advances in Artificial Intelligence: Proceedings of the Twenty Seventh Germany Conference on Artificial the International Intelligence, KI-04, Ulm (Germany), Biundo, S. et al. Eds, Vol. 1 (2004), Springer, 295-306.

106

E. de A. Lima Neto, F. de A.T. de Carvalho, and E.S. Freire

6. Draper, N.R., Smith, H.: Applied Regression Analysis. John Wiley, New York (1981) 7. Judge, G.G., Takayama, T.: Inequality Restrictions in Regression Analysis. Journal of the American Statistical Association, Vol. 61 (1966) 166-181 8. Lawson, C.L., Hanson, R.J.: Solving Least Squares Problem. Prentice-Hall, New York (1974) 9. Montgomery, D.C., Peck, E.A.: Introduction to Linear Regression Analysis. John Wiley, New York (1982) 10. Scheff´e, H.: The Analysis of Variance. John Wiley, New York, (1959)

On Utilizing Stochastic Learning Weak Estimators for Training and Classification of Patterns with Non-stationary Distributions B. John Oommen1, and Luis Rueda2, 1 School of Computer Science, Carleton University, 1125 Colonel By Dr., Ottawa, ON, K1S 5B6, Canada [email protected] 2 School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, ON, N9B 3P4, Canada [email protected]

Abstract. Pattern recognition essentially deals with the training and classification of patterns, where the distribution of the features is assumed unknown. However, in almost all the reported results, a fundamental assumption made is that this distribution, although unknown, is stationary. In this paper, we shall relax this assumption and assume that the class-conditional distribution is non-stationary. To now render the training and classification feasible, we present a novel estimation strategy, involving the so-called Stochastic Learning Weak Estimator (SLWE). The latter, whose convergence is weak, is used to estimate the parameters of a binomial distribution using the principles of stochastic learning. Even though our method includes a learning coefficient, λ, it turns out that the mean of the final estimate is independent of λ, the variance of the final distribution decreases with λ, and the speed decreases with λ. Similar results are true for the multinomial case. To demonstrate the power of these estimates in data which is truly “nonstationary”, we have used them in two pattern recognition exercises, the first of which involves artificial data, and the second which involves the recognition of the types of data that are present in news reports of the Canadian Broadcasting Corporation (CBC). The superiority of the SLWE in both these cases is demonstrated.

1 Introduction Every Pattern Recognition (PR) problem essentially deals with two issues : the training and then the classification of the patterns. Generally speaking, the fundamental assumption made is that the class-conditional distribution, although unknown, is stationary. It is interesting to note that this tacit assumption is so all-pervading - all the data sets included in the acclaimed UCI machine learning repository [2] represent data in which the class-conditional distribution is assumed to be stationary. Our intention here is to  

Fellow of the IEEE. Phone: +1 (613) 520-2600 Ext. 4358, Fax: +1 (613) 520-4334. Member of the IEEE.

U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 107–120, 2005. c Springer-Verlag Berlin Heidelberg 2005 

108

B.J. Oommen and L. Rueda

propose training (learning) and classification strategies when the class-conditional distribution is non-stationary. Indeed, we shall approach this problem by studying the fundamental statistical issue, namely that of estimation. Since the problem involves random variables, the training/classification decisions are intricately dependent on the system obtaining reliable estimates on the parameters that characterize the underlying random variable. These estimates are computed from the observations of the random variable itself. Thus, if a problem can be modeled using a random variable which is normally (Gaussian) distributed, the underlying statistical problem involves estimating the mean and the variance. The theory of estimation has been studied for hundreds of years [1,3,8,24,26,27]. It is also easy to see that the learning (training) phase of a statistical Pattern Recognition (PR) system is, indeed, based on estimation theory [4,5,6,32]. Estimation methods generally fall into various categories, including the Maximum Likelihood Estimates (MLE) and the Bayesian family of estimates [1,3,4,5]. Although the MLEs and Bayesian estimates have good computational and statistical properties, the fundamental premise for establishing the quality of estimates is based on the assumption that the parameter being estimated does not change with time. In other words, the distribution is assumed to be stationary. Thus, it is generally assumed that there is an underlying parameter θ, which does not change with time, and as the number of samples increases, we would like the estimate θˆ to converge to θ with probability one, or in a mean square sense. There are numerous problems which we have recently encountered, where these strong estimators pose a real-life concern. One scenario occurs in pattern classification involving moving scenes. We also encounter the same situation when we want to perform adaptive compression of files which are interspersed with text, formulae, images and tables. Similarly, if we are dealing with adaptive data structures, the structure changes with the estimate of the underlying data distribution, and if the estimator used is “strong” (i.e., w. p. 1), it is hard for the learned data structure to emerge from a structure that it has converged to. Indeed, we can conclusively demonstrate that it is sub-optimal to work with strong estimators in such application domains, i.e., when the data is truly non-stationary. In this paper, we shall present one such “weak” estimator, referred to as the Stochastic Learning Weak Estimator (SLWE), and which is developed by using the principles of stochastic learning. In essence, the estimate is updated at each instant based on the value of the current sample. However, this updating is not achieved using an additive updating rule, but rather by a multiplicative rule, akin to the family of linear actionprobability updating schemes [13,14]. The formal results that we have obtained for the binomial distribution are quite fascinating. Similar results have been achieved for the multinomial case, except that the variance (covariance) has not been currently derived. The entire field of obtaining weak estimators is novel. In that sense, we believe that this paper is of a pioneering sort – we are not aware of any estimators which are computed as explained here. We also hope that this is the start of a whole new body of research, namely those involving weak estimators for various distributions, and their applications in various domains. In this regard, we are currently in the process of deriving weak estimators for various distributions. To demonstrate the power of the SLWE in

On Utilizing Stochastic Learning Weak Estimators

109

real-life problems, we conclude the paper with results applicable to news classification problems, in which the underlying distribution is non-stationary. To devise a new estimation method, we have utilized the principles of learning as achieved by the families of Variable Structure Stochastic Automata (VSSA) [10,11,13]. In particular, the learning we have achieved is obtained as a consequence of invoking algorithms related to families of linear schemes, such as the Linear Reward-Inaction (LRI ) scheme. The analysis is also akin to the analysis used for these learning automata (LA). This involves first determining the updating equations, and taking the conditional expectation of the quantity analyzed. The condition disappears when the expectation operator is invoked a second time, leading to a difference equation for the specified quantity, which equation is later explicitly solved. We have opted to use these families of VSSA in the design of our SLWE, because it turns out that the analysis is considerably simplified and (in our opinion) fairly elegant. The traditional strategy to deal with non-stationary environments has been one of using a sliding window [7]. The problem with such a “sliding window” scheme for estimation is that the width of the window is crucial. If the width is too “small”, the estimates are necessarily poor. On the other hand, if the window is “large”, the estimates prior to the change of the parameters influence the new estimates significantly. Also, the observations during the entire sliding window must be retained and updated during the entire process. A comparison of our new estimation strategy with some of the other “non-sliding-window” approaches is currently under way. However, the main and fundamental difference that our method has over all the other reported schemes is the fact that our updating scheme is not additive, but multiplicative. The question of deriving weak estimators based on the concepts of discretized LA [16,18] and estimator-based LA [12,17,19,21,23,29] remains open. We believe, however, that even if such estimator-based SLWE are designed, the analysis of their properties will not be trivial.

2 Weak Estimators of Binomial Distributions Through out this paper we assume that we are estimating the parameters of a binomial/multinomial distribution. The binomial distribution is characterized by two parameters, namely, the number of Bernoulli trials, and the parameter characterizing each Bernoulli trial. In this regard, we assume that the number of observations is the number of trials. Thus, all we have to do is to estimate the Bernoulli parameter for each trial. This is what we endeavor to do using stochastic learning methods. Let X be a binomially distributed random variable, which takes on the value of either ‘1’ or ‘2’1 . We assume that X obeys the distribution S, where S = [s1 , s2 ]T . In other words, X = ‘1’ with probability s1 = ‘2’ with probability s2 , where, s1 + s2 = 1. 1

We depart from the traditional notation of the random variable taking values of ‘0’ and ‘1’, so that the notation is consistent when we consider the multinomial case.

110

B.J. Oommen and L. Rueda

Let x(n) be a concrete realization of X at time ‘n’. The intention of the exercise is to estimate S, i.e., si for i = 1, 2. We achieve this by maintaining a running estimate P (n) = [p1 (n), p2 (n)]T of S, where pi (n) is the estimate of si at time ‘n’, for i = 1, 2. Then, the value of pi (n) is updated as per the following simple rule : p1 (n + 1) ← λp1 (n) ← 1 − λp2 (n)

if x(n) = 2 if x(n) = 1 .

(1) (2)

where λ is a user-defined parameter, 0 < λ < 1. In order to simplify the notation, the vector P (n) = [p1 (n), p2 (n)]T refers to the estimates of the probabilities of ‘1’ and ‘2’ occurring at time ‘n’, namely p1 (n) and p2 (n) respectively. In the interest of simplicity, we omit the index n, whenever there is no confusion, and thus, P is an abbreviation P (n). The first theorem, whose proof can be found in [15], concerns the distribution of the vector P which estimates S as per Equations (1) and (2). We shall state that P converges in distribution. The mean of P is shown to converge exactly to the mean of S. The proof, which is a little involved, follows the types of proofs used in the area of stochastic learning [15]. Theorem 1. Let X be a binomially distributed random variable, and P (n) be the estimate of S at time ‘n’. Then, E [P (∞)] = S.  The next results that we shall prove indicates that E [P (n + 1)] is related to E [P (n)] by means of a stochastic matrix. We derive the explicit dependence, and allude to the resultant properties by virtue of the stochastic properties of this matrix. This leads us to two results, namely that of the limiting distribution of the chain, and that which concerns the rate of convergence of the chain. It turns out that while the former is independent of the learning parameter, λ, the latter is determined only by λ. The reader will observe that the results we have derived are asymptotic results. In other words, the mean of P (n) is shown to converge exactly to the mean of S. The implications of the “asymptotic” nature of the results will be clarified presently. The proofs of Theorems 2, 3 and 4 can be found in [15]. Theorem 2. If the components of P (n + 1) are obtained from the components of P (n) as per Equations (1) and (2), E [P (n + 1)] = MT E [P (n)], where M is a stochastic matrix. Thus, the limiting value of the expectation of P (.) converges to S, and the rate of convergence of P to S is fully determined by λ.  Theorem 3. Let X be a binomially distributed random variable governed by the distribution S, and P (n) be the estimate of S at time ‘n’ obtained by (1) and (2). Then, the rate of convergence of P to S is fully determined by λ.  We now derive the explicit expression for the asymptotic variance of the SLWE. A small value of λ leads to fast convergence and a large variance and vice versa. Theorem 4. Let X be a binomially distributed random variable governed by the distribution S, and P (n) be the estimate of S at time ‘n’ obtained by (1) and (2). Then, the algebraic expression for the variance of P (∞) is fully determined by λ. 

On Utilizing Stochastic Learning Weak Estimators

111

When λ → 1, it implies that the variance tends to zero, implying mean square convergence. Our result seems to be contradictory to our initial goal. When we motivated our problem, we were working with the notion that the environment was non-stationary. However, the results we have derived are asymptotic, and thus, are valid only as n → ∞. While this could prove to be a handicap, realistically, and for all practical purposes, the convergence takes place after a relatively small values of n. Thus, if λ is even as “small” as 0.9, after 50 iterations, the variation from the asymptotic value will be of the order of 10−50 , because λ also determines the rate of convergence, and this occurs in a geometric manner [13]. In other words, even if the environment switches its Bernoulli parameter after 50 steps, the SLWE will be able to track this change. Observe too that we do not need to introduce or consider the use of a “sliding window”. We conclude this Section by presenting the updating rule given in (1) and (2) in the context of some schemes already reported in the literature. If we assume that X can take values of ‘0’ and ‘1’, then the probability of ‘1’ at time n + 1 can be estimated as follows: n 1 p1 + x(n + 1) (3) p1 (n + 1) ← n+1 n+1 This expression can be seen to be a particular case of the rule (1) and (2), where the 1 . This kind of rule is typically used in stochastic approximation parameter λ = 1 − n+1 [9], and in some reinforcement learning schemes, such as the Q-learning [31]. What we have done is to show that when such a rule is used in estimation, the mean converges to the true mean (independent of the learning parameter λ), and that the variance and rate of convergence are determined only by λ. Furthermore, we have derived the explicit relationships for these dependencies.

3 Weak Estimators of Multinomial Distributions In this section, we shall consider the problem of estimating the parameters of a multinomial distribution. The multinomial distribution is characterized by two parameters, namely, the number of trials, and a probability vector which determines the probability of a specific event (from a pre-specified set of events) occurring. In this regard, the number of observations is the number of trials. Thus, we are to estimate the latter probability vector associated with the set of possible outcomes or trials. Specifically, let X be a multinomially distributed random variable, which takes on the values from the set T {‘1’, . . . , ‘r’}. We assume that X is governed r by the distribution S = [s1 , . . . , sr ] as: X = ‘i’ with probability si , where i=1 si = 1. Also, let x(n) be a concrete realization of X at time ‘n’. The intention of the exercise is to estimate S, i.e., si for i = 1, . . . , r. We achieve this by maintaining a running estimate P (n) = [p1 (n), . . . , pr (n)]T of S, where pi (n) is the estimate of si at time ‘n’, for i = 1, . . . , r. Then, the value of p1 (n) is updated as per the following simple rule (the rules for other values of pj (n) are similar):  p1 (n + 1) ← p1 + (1 − λ) pj when x(n) = 1 (4) j =1

← λp1

when x(n) = 1

(5)

112

B.J. Oommen and L. Rueda

As in the binomial case, the vector P (n) = [p1 (n), p2 (n), . . . , pr (n)]T refers to the estimate of S = [s1 , s2 , . . . , sr ]T at time ‘n’, and we will omit the reference to time ‘n’ in P (n) whenever there is no confusion. The results that we present now are as in the binomial case, i.e. the distribution of E [P (n + 1)] follows E [P (n)] by means of a stochastic matrix. This leads us to two results, namely to that of the limiting distribution of the vector, and that which concerns the rate of convergence of the recursive equation. It turns out that both of these, in the very worst case, could only be dependent on the learning parameter λ. However, to our advantage, while the former is independent of the learning parameter, λ, the latter is only determined by it (and not a function of it). The complete proofs of Theorems 5, 6 and 7 are found in [15]. Theorem 5. Let X be a multinomially distributed random variable governed by the distribution S, and P (n) be the estimate of S at time ‘n’ obtained by (4) and (5). Then, E [P (∞)] = S.  We now derive the explicit dependence of E [P (n + 1)] on E [P (n)] and the consequences. Theorem 6. Let the parameter S of the multinomial distribution be estimated at time ‘n’ by P (n) obtained by (4) and (5). Then, E [P (n + 1)] = MT E [P (n)], in which every off-diagonal term of the stochatic matrix, M, has the same multiplicative factor, (1−λ). Furthermore, the final solution of this vector difference equation is independent of λ.  The convergence and eigenvalue properties of M follow. Theorem 7. Let the parameter S of the multinomial distribution be estimated at time ‘n’ by P (n) obtained by (4) and (5). Then, all the non-unity eigenvalues of M are exactly λ, and thus the rate of convergence of P is fully determined by λ.  A small value of λ leads to fast convergence and a large variance, and vice versa. Again, although the results we have derived are asymptotic, if λ is even as “small” as 0.9, after 50 iterations, the variation from the asymptotic value will be of the order of 10−50 . In other words, after 50 steps, the SLWE will be able to track this change. Our experimental results demonstrate this fast convergence.

4 Experiments on Pattern Classification The power of the SLWE has been earlier demonstrated in the “vanilla” estimation problem [20]. We now argue that these principles also have significance in the PR of nonstationary data, by demonstrating in two pattern recognition experiments. The classification in these cases was achieved on non-stationary data respectively derived synthetically, and from real-life data files obtained from the Canadian Broadcasting Corporation’s (CBC’s) news files. We now describe both these experiments and the comparison of the SLWE with the well-known MLE that uses a sliding window (MLEW).

On Utilizing Stochastic Learning Weak Estimators

113

4.1 Classification of Synthetic Data In the case of synthetic data, the classification problem was the following. Given a stream of bits, which are drawn from two different sources (or classes), say, S1 and S2 , the aim of the classification exercise was to identify the source which each symbol belonged to. To train the classifier, we first generated blocks of bits using two binomial distributions (20 blocks for each source), where the probability of 0 for each distribution was s11 and s12 respectively. For the results which we report, (other settings yielding similar results are not reported here) the specific values of s11 and s12 were randomly set to be s11 = 0.18 and s12 = 0.76, which were assumed unknown to the classifier. These data blocks were then utilized to compute MLE estimates of s11 and s12 , which we shall call s 11 and s 12 respectively. In the testing phase, 40 blocks of bits were generated randomly from either of the sources, and the order in which they were generated was kept secret to the classifier. Furthermore, the size of the blocks was also unknown to the estimators, MLEW and SLWE. In the testing phase, each bit read from the testing set was classified using two classifiers, which were designed using the probability of the symbol ‘0’ estimated by the MLEW and SLWE, which we refer to as pM1 (n) and pS1 (n) respectively. The classification rule was based on a linear classifier, and the bit was assigned to the class that minimized the Euclidean distance between the estimated probability of 0, and the “learned” probability of 0 (for the two sources) obtained during the training. Thus, based on the MLWE classifier, the nth bit read from the test set was assigned to class S1 , if (ties were resolved arbitrarily) : (p1M − s 11 ) < (p1M − s 12 ) , and assigned to class S2 otherwise. The classification rule based on the SLWE was analogous, except that it used p1S instead of p1M . The two classifiers were tested for different scenarios. For the SLWE, the value of λ was arbitrarily set to 0.9. In the first set of experiments, the classifiers were tested for blocks of different sizes, b = 50, 100, . . . , 500. For each value of b, an ensemble of 100 experiments was performed. The MLEW used a window of size w, which was centered around b, and was computed as the nearest integer of a randomly generated value obtained from a uniform distribution U [b/2, 3b/2]. In effect, we thus gave the MLEW the advantage of having some a priori knowledge of the block size. The results obtained are shown in Table 1, from which we see that classification using the SLWE was uniformly superior to classification using the MLEW, and in many cases yielded more than 20% accuracy than the latter. The accuracy of the classifier increased with the block size as is clear from Figure 1. In the second set of experiments, we tested both classifiers, the MLEW and SLWE, when the block size was fixed to be b = 50. The window size of the MLEW, w was again randomly generated from U [b/2, 3b/2], thus giving the MLE the advantage of having some a priori knowledge of the block size. The SLWE again used a conservative value of λ = 0.9. In this case, we conducted an ensemble of 100 simulations, and in each case, we computed the classification accuracy of the first k bits in each block. The results are shown in Table 2. Again, the uniform superiority of the SLWE over the MLEW can be observed. For example, when k = 45, the MLEW yielded an accuracy

114

B.J. Oommen and L. Rueda

Table 1. The results obtained from testing linear classifiers which used the MLEW and SLWE respectively. In this case, the blocks were generated of sizes b = 50, 100, . . . , 500, and the window size w used for the MLEW was randomly generated from U [b/2, 3b/2]. The SLWE used a value of λ = 0.9. The results were obtained from an ensemble of 100 simulations, and the accuracy was computed using all the bits in the block. b MLEW SLWE 50 76.37 93.20 100 75.32 96.36 150 73.99 97.39 200 74.63 97.95 250 74.91 98.38 300 74.50 98.56 350 75.05 98.76 400 74.86 98.88 450 75.68 98.97 500 75.55 99.02

100

95 MLEW SLWE

90

85

80

75

70 50

100

150

200

250

300

350

400

450

500

b

Fig. 1. Plot of the accuracies of the MLEW and SLWE classifiers for different block sizes. The settings of the experiments are as described in Table 1.

of only 73.77%, but the corresponding accuracy of the SLWE was 92.48%. From the plot shown in Figure 2, we again observe the increase in accuracy with k. In the final set of experiments, we tested both classifiers, the MLEW and SLWE, when the block size was made random (as opposed to being fixed as in the previous cases). This was, in one sense, the most difficult setting, because there was no fixed “periodicity” for the switching from one “environment” to the second, and vice versa.

On Utilizing Stochastic Learning Weak Estimators

115

Table 2. The results obtained from testing linear classifiers which used the MLEW and SLWE respectively. In this case, the blocks were generated of size b = 50, and the window size w used for the MLEW was randomly generated from U [b/2, 3b/2]. The SLWE used a value of λ = 0.9. The results were obtained from an ensemble of 100 simulations, and the accuracy was computed using the first k bits in each block. k MLEW SLWE 5 50.76 56.46 10 51.31 70.00 15 53.15 78.65 20 56.03 83.66 25 59.47 86.78 30 63.19 88.90 35 67.03 90.42 40 70.65 91.57 45 73.77 92.48 50 76.37 93.20

100 MLEW SLWE

90

80

70

60

50

40 5

10

15

20

25

30

35

40

45

50

k

Fig. 2. Plot of the accuracies of the MLEW and SLWE classifiers for different values of k. The settings of the experiments are as described in Table 2.

This block size was thus randomly generated from U [w/2, 3w/2], where w was the width used by the MLEW. Observe again that this implicitly assumed that the MLE had some (albeit, more marginal than in the previous cases) advantage of having some a priori knowledge of the system’s behaviour, but the SLWE used the same conservative value of λ = 0.9. Again, we conducted an ensemble of 100 simulations, and in each case, we computed the classification accuracy of the first k bits in each block. The

116

B.J. Oommen and L. Rueda

Table 3. The results obtained from testing linear classifiers which used the MLEW and SLWE respectively. In this case, the blocks were generated of a random width b uniform in U [w/2, 3w/2], where w was the window size used by the MLEW. The SLWE used a value of λ = 0.9. The results were obtained from an ensemble of 100 simulations, and the accuracy was computed using the first k bits in each block. b MLEW SLWE 5 51.53 56.33 10 51.79 70.05 15 52.06 78.79 20 53.09 83.79 25 56.17 86.88

100 MLEW SLWE

90

80

70

60

50

40 5

10

15

20

25

k

Fig. 3. Plot of the accuracies of the MLEW and SLWE classifiers for different values of k. The numerical results of the experiments are shown in Table 3.

results are shown in Table 3. As in the above two cases, the SLWE was uniformly superior to the MLEW. For example, when b = 20, the MLEW yielded a recognition accuracy of only 53.09%, but the corresponding accuracy of the SLWE was 83.79%. From the plot shown in Figure 3, we again observe the increase in accuracy with k for the case of the SLWE, but the increase in the case of the MLEW is far less marked. 4.2 Result on Real-Life Classification In this section, we present the results of classification obtained from real-life data. The problem can be described as follows. We are given a stream of news items from different sources, and the aim of the PR exercise is to detect the source that the news come

On Utilizing Stochastic Learning Weak Estimators

117

Table 4. Empirical results obtained by classification of CBC files into Sports (S) and Business (B) items using the MLEW and SLWE File name Class Size s3.txt b11.txt b17.txt s16.txt s9.txt b7.txt b5.txt b12.txt s20.txt b1.txt b2.txt s4.txt b8.txt b15.txt s10.txt b20.txt s14.txt b10.txt b9.txt s7.txt s5.txt s12.txt s8.txt s1.txt b3.txt s2.txt b19.txt s13.txt

S B B S S B B B S B B S B B S B S B B S S S S S B S B S

1162 1214 1290 1044 2161 1021 1630 1840 2516 1374 1586 1720 1280 1594 2094 2135 2988 3507 2302 2209 2151 3197 1935 2334 1298 2147 1943 2534

MLEW Hits Accuracy 721 62.05 1045 86.08 1115 86.43 610 58.43 1542 71.36 877 85.90 932 57.18 1289 70.05 1582 62.88 826 60.12 940 59.27 1051 61.10 801 62.58 831 52.13 1202 57.40 1361 63.75 1749 58.53 2086 59.48 1425 61.90 928 42.01 1181 54.90 1636 51.17 1149 59.38 1079 46.23 563 43.37 917 42.71 991 51.00 1380 54.46

SLWE Hits Accuracy 909 78.23 1154 95.06 1261 97.75 656 62.84 1789 82.79 892 87.37 1087 66.69 1502 81.63 1919 76.27 882 64.19 1124 70.87 1169 67.97 917 71.64 928 58.22 1486 70.96 1593 74.61 2163 72.39 2291 65.33 1694 73.59 1116 50.52 1411 65.60 2071 64.78 1398 72.25 1248 53.47 640 49.31 1218 56.73 1210 62.27 1568 61.88

from. For example, consider the scenario when a TV channel broadcasts live video, or a channel that releases news in the form of text, for example, on the internet. The problem of detecting the source of the news item from a live channel has been studied by various researchers (in the interest of brevity, we merely cite one reference [25]) with the ultimate aim being that of extracting shots for the videos and using them to classify the video into any of the pre-defined classes. Processing images is extremely time-consuming, which makes real time processing almost infeasible. Thus, to render the problem tractable, we consider the scenario in which the news items arrive in the form of text blocks, which are extracted from the closed-captioning text embedded in the video streams. A classification problem of this type is similar to that of classifying files posted, for example, on the internet. To demonstrate the power of our SLWE over the MLE when used in a classification scheme for real-life problems, we used them both as in the case of the artificial data

118

B.J. Oommen and L. Rueda

described above, followed by a simple linear classifier. We first downloaded news files from the Canadian Broadcasting Corporation (CBC), which were from either of two different sources (the classes): business or sports2 . The data was then divided into two subsets, one for training and the other for testing purposes, each subset having 20 files of each class. With regard to the parameters, for the SLWE we randomly selected a value of λ uniformly from the range [0.99, 0.999], and for the case of the MLEW, the window size was uniformly distributed in [50, 150]. In our particular experiments, the value for λ was 0.9927, and the value for the size of the window was 79, where the latter was obtained by rounding the resulting random value to the nearest integer. The results of the classification are shown in Table 4, where we list the name of the files that were selected (randomly) from the testing set. The column labeled “hits” contains the number of characters in the file, which were correctly classified, and “accuracy” represents the percentage of those characters. The results reported here are only for the cases when the classification problem is meaningful. In a few cases, the bit pattern did not even yield a 50% (random choice) recognition accuracy. We did not report these results because, clearly, classification using these “features” and utilizing such a classifier is meaningless. But we have reported all the cases when the MLEW gave an accuracy which was less than 50%, but for which the SLWE yielded an accuracy which was greater than 50%. The superiority of the SLWE over the MLEW is clearly seen from the table - for example in the case of the file s9.txt, the MLEW yielded a classification accuracy of 71.36%, while the corresponding accuracy for the SLWE was 82.79%. Indeed, in every single case, the SLWE yielded a classification accuracy better than the corresponding accuracy obtained by the MLEW ! We conclude this section by observing that a much better accuracy can be obtained using similar features if a more powerful classifier (for example, a higher order classifier) is utilized instead of the simple linear classifier that we have invoked.

5 PR Using SLWE Obtained with Other LA Schemes Throughout this paper, we have achieved PR by deriving training schemes obtained by utilizing the principles of learning as achieved by the families of Variable Structure Stochastic Automata (VSSA) [10,11,13,28]. In particular, the learning we have achieved is obtained using an learning algorithm of the linear scheme family, namely the LRI scheme, and provided the corresponding analysis. We would like to mention that the estimation principles introduced here leads to various “open problems”. A lot of work has been done in the last decades which involve the so-called families of Discretized LA, and the families of Pursuit and Estimator algorithms. Discretized LA were pioneered by Thathachar and Oommen [28] and since then, all the families of continuous VSSA have found their counterparts in the corresponding discretized versions [16,18]. The design of SLWE using discretized VSSA is open. Similarly, the Pursuit and Estimator algorithms were first pioneered by Thathachar, Sastry and their colleagues [23,29,30]. These involve specifically utilizing running estimates of the penalty probabilities of the actions in enhancing the stochastic 2

The files used in our experiments were obtained from the CBC web-site, www.cbc.ca, between September 24 and October 2, 2004, and can be made available upon request.

On Utilizing Stochastic Learning Weak Estimators

119

learning. These automata were later discretized [12,19] and currently, extremely fast continuous and discretized pursuit and estimator LA have been designed [17,21,22]. The question of designing SLWE using Pursuit and Estimator updating schemes is also open3. We believe that “weak” estimates derivable using any of these principles will lead to interesting PR and classification methods that could also be applicable for nonstationary data.

6 Conclusions and Future Work In this paper we have considered the pattern recognition (i.e., the training and classification) problem in which we assume class-conditional distribution of the features is assumed unknown, but non-stationary. To achieve the training and classification feasible within this model, we presented a novel estimation strategy, involving the socalled Stochastic Learning Weak Estimator (SLWE), which is based on the principles of stochastic learning. The latter, whose convergence is weak, has been used to estimate the parameters of a binomial/multinomial distribution. Even though our method includes a learning coefficient, λ, it turns out that the mean of the final estimate is independent of λ, the variance of the final distribution decreases with λ, and the speed decreases with λ. To demonstrate the power of these estimates in data which is truly “non-stationary”, we have used them in two PR exercises, the first of which involves artificial data, and the second which involves the recognition of the types of data that are present in news reports of the Canadian Broadcasting Corporation (CBC). The superiority of the SLWE in both these cases is demonstrated. Acknowledgements. This research work has been partially supported by NSERC, the Natural Sciences and Engineering Research Council of Canada, CFI, the Canadian Foundation for Innovation, and OIT, the Ontario Innovation Trust.

References 1. P. Bickel and K. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics, volume I. Prentice Hall, second edition, 2000. 2. C. Blake and C. Merz. UCI repository of machine learning databases, 1998. 3. G. Casella and R. Berger. Statistical Inference. Brooks/Cole Pub. Co., second edition, 2001. 4. R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, NY, 2nd edition, 2000. 5. K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, 1990. 6. R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, Cambridge, Massachusetts, 2001. 7. Y. M. Jang. Estimation and Prediction-Based Connection Admission Control in Broadband Satellite Systems. ETRI Journal, 22(4):40–50, 2000. 8. B. Jones, P. Garthwaite, and Ian Jolliffe. Statistical Inference. Oxford University Press, second edition, 2002. 3

We believe, however, that even if such estimator-based SLWEs are designed, the analysis of their properties will not be trivial.

120

B.J. Oommen and L. Rueda

9. H. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2nd. edition, 2003. 10. S. Lakshmivarahan. Learning Algorithms Theory and Applications. Springer-Verlag, New York, 1981. 11. S. Lakshmivarahan and M. A. L. Thathachar. Absolutely Expedient Algorithms for Stochastic Automata. IEEE Trans. on System, Man and Cybernetics, SMC-3:281–286, 1973. 12. J. K. Lanctˆot and B. J. Oommen. Discretized Estimator Learning Automata. IEEE Trans. on Systems, Man and Cybernetics, 22(6):1473–1483, 1992. 13. K. Narendra and M. Thathachar. Learning Automata. An Introduction. Prentice Hall, 1989. 14. J. Norris. Markov Chains. Springer Verlag, 1999. 15. B. J. Oommen and L. Rueda. Stochastic Learning-based Weak Estimation of Multinomial Random Variables and Its Applications to Non-stationary Environments. 2004. Submitted for publication. 16. B.J. Oommen. Absorbing and Ergodic Discretized Two-Action Learning Automata. IEEE Trans. on System, Man and Cybernetics, SMC-16:282–296, 1986. 17. B.J. Oommen and M. Agache. Continuous and Discretized Pursuit Learning Schemes: Various Algorithms and Their Comparison. IEEE Trans. on Systems, Man and Cybernetics, SMC-31(B):277–287, June 2001. 18. B.J. Oommen and J. R. P. Christensen. Epsilon-Optimal Discretized Reward-Penalty Learning Automata. IEEE Trans. on System, Man and Cybernetics, SMC-18:451–458, 1988. 19. B.J. Oommen and J. K. Lanctˆot. Discretized Pursuit Learning Automata. IEEE Trans. on System, Man and Cybernetics, 20(4):931–938, 1990. 20. B.J. Oommen and L. Rueda. A New Family of Weak Estimators for Training in Nonstationary Distributions. In Proceedings of the Joint IAPR International Workshops SSPR 2004 and SPR 2004, pages 644–652, Lisbon, Portugal, 2004. 21. G. I. Papadimitriou. A New Approach to the Design of Reinforcement Schemes for Learning Automata: Stochastic Estimator Learning Algorithms. IEEE Trans. on Knowledge and Data Engineering, 6:649–654, 1994. 22. G. I. Papadimitriou. Hierarchical Discretized Pursuit Nonlinear Learning Automata with Rapid Convergence and High Accuracy. IEEE Trans. on Knowledge and Data Engineering, 6:654–659, 1994. 23. K. Rajaraman and P. S. Sastry. Finite Time Analysis of the Pursuit Algorithm for Learning Automata. IEEE Trans. on Systems, Man and Cybernetics, 26(4):590–598, 1996. 24. S. Ross. Introduction to Probability Models. Academic Press, second edition, 2002. 25. M. De Santo, G. Percannella, C. Sansone, and M. Vento. A multi-expert approach for shot classification in news videos. In Image Analysis and Recogniton, LNCS, volume 3211, pages 564–571, Amsterdan, 2004. Elsevier. 26. J. Shao. Mathematical Statistics. Springer Verlag, second edition, 2003. 27. R. Sprinthall. Basic Statistical Analysis. Allyn and Bacon, second edition, 2002. 28. M. A. L. Thathachar and B. J. Oommen. Discretized Reward-Inaction Learning Automata. Journal of Cybernetics and Information Sceinces, pages 24–29, Spring 1979. 29. M. A. L. Thathachar and P.S. Sastry. A Class of Rapidly Converging Algorithms for Learning Automata. IEEE Trans. on Systems, Man and Cybernetics, SMC-15:168–175, 1985. 30. M. A. L. Thathachar and P.S. Sastry. Estimator Algorithms for Learning Automata. In Proc. of the Platinum Jubilee Conference on Systems and Signal Processing, Bangalore, India, December 1986. 31. C. Watkins. Learning from Delayed Rewards. PhD thesis, University of Cambridge, England, 1989. 32. A. Webb. Statistical Pattern Recognition. John Wiley & Sons, N.York, second edition, 2002.

Noise Robustness by Using Inverse Mutations Ralf Salomon Faculty of Computer Science and Electrical Engineering, University of Rostock, 18051 Rostock, Germany [email protected]

Abstract. Recent advances in the theory of evolutionary algorithms have indicated that a hybrid method known as the evolutionary-gradientsearch procedure yields superior performance in comparison to contemporary evolution strategies. But the theoretical analysis also indicates a noticeable performance loss in the presence of noise (i.e., noisy fitness evaluations). This paper aims at understanding the reasons for this observable performance loss. It also proposes some modifications, called inverse mutations, to make the process of estimating the gradient direction more noise robust.

1

Introduction

The literature on evolutionary computation discusses the question whether or not evolutionary algorithms are gradient methods very controversially. Some [10] believe that by generating trial points, called offspring, evolutionary algorithms stochastically operate along the gradient, whereas others [4] believe that evolutionary algorithms might deviate from gradient methods too much because of their high explorative nature. The goal of an optimization algorithm is to locate the optimum x(opt) (minimum or maximum in a particular application) of an N -dimensional objective (fitness) function f (x1 , . . . , xN ) = f (x) with xi denoting the N independent variables. Particular points x or y are also known as search points. In essence, most procedures differ in how they derive future test points x(t+1) from knowledge gained in the past. Some selected examples are gradient descent, Newton’s method, and evolutionary algorithms. For an overview , the interested reader is referred to the pertinent literature [3,6,7,8,9,14]. Recent advances in the theory of evolutionary algorithms [1] have also considered hybrid algorithms, such as the evolutionary-gradient-search (EGS) procedure [12]. This algorithm fuses principles from both gradient and evolutionary algorithms in the following manner: – it operates along an explicitly estimated gradient direction, – it gains information about the gradient direction by randomly generating trial points (offspring), and – it periodically reduces the entire population to a single search point. U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 121–133, 2005. c Springer-Verlag Berlin Heidelberg 2005 

122

R. Salomon

A description of this algorithm is presented in Section 2, which also briefly reviews some recent theoretical analyses. Arnold [1] has shown that the EGS procedure performs superior, i.e., sequential efficiency, in comparison to contemporary evolution strategies. But Arnold’s analysis [1] also indicates that the performance of EGS progressively degrades in the presence of noise (noisy fitness evaluations). Due to the high importance of noise robustness in real-world applications, Section 3 briefly reviews the relevant literature and results. Section 4 is analyzing the reason for the observable performance loss. A first result of this analysis is that the procedure might be extended by a second, independently working step size in order to de-couple the gradient estimation from performing the actual progress step. A main reason for EGS suffering from a performance loss in the presence of noise is that large mutation steps cannot be used, because they lead to quite imprecise gradient estimates. Section 5 proposes the usage of inverse mutations, which use each mutation vector z twice, originally and in the inverse direction −z. Section 5 shows that inverse mutations are able to compensate for noise due to a mechanism, which is called genetic repair in other algorithms. Section 6 concludes with a brief discussion and an outlook for future work.

2

Background: Algorithms and Convergence

This section summarizes some background material as far as necessary for the understanding of this paper. This includes a brief description of the procedure under consideration as well as some performance properties on quadratic functions. 2.1

Algorithms

As mentioned in the introduction, the evolutionary-gradient-search (EGS) procedure [12] is a hybrid method that fuses some principles of both gradient and evolutionary algorithms. It periodically collapses the entire population to a single search point, and explicitly estimates the gradient direction in which it tries to advance to the optimum. The procedure estimates the gradient direction by randomly generating trial points (offspring) and processing all trial points in calculating a weighted average. In its simplest form, EGS works as follows1 : (i)

1. Generate i=1, . . . , λ offspring y t (trial points) (i)

(i)

y t = xt + σz t 1

(1)

Even though further extensions, such as using a momentum term or generating non-isotropic mutations, significantly accelerate EGS [13], they are not considered in this paper, because they are not appropriately covered by currently available theories. Furthermore, these extensions are not rotationally invariant, which limits their utility.

Noise Robustness by Using Inverse Mutations

123

from the current point xt (at time step t), with σ>0 denoting a step size and z (i) denoting a mutation vector consisting of N independent, normally distributed components. 2. Estimate the gradient direction g˜t : g˜t =

λ     (i) (i) f (y t ) − f (xt ) y t − xt .

(2)

i=1

3. Perform a step xt+1 = xt +

√ g˜t (prog) = xt + z t N , g˜t 

(3)

with z (prog) denoting the progress vector. It can be seen that Eq. (2) estimates an approximated gradient direction by calculating a weighted sum of all offspring regardless of their fitness values; the procedure simply assumes that for offspring with negative progress, positive progress can be achieved in the opposite direction and sets those weights to negative values. This usage of all offspring is in way contrast to most other evolutionary algorithms, which only use ranking information of some selected individuals of some sort. Further details can be found in the literature [1,5,12]. EGS also features some dynamic adaptation of the step size σ. This, however, is beyond the scope of this paper, and the interested reader is referred to the literature [12]. This paper compares EGS with the (μ/μ, λ)-evolution strategy [5,10], since the latter yields very good performance and is mathematically well analyzed. The (μ/μ, λ)-evolution strategy maintains μ parents, applies global-intermediate recombination on all parents, and applies normally distributed random numbers to generates λ offspring, from which the μ best ones are selected as parents for the next generation. In addition, this strategy also features some step size adaptation mechanism [1]. 2.2

Progress on Quadratic Functions

Most theoretical performance analyses have yet been done on the quadratic  function f (x1 , . . . , xn ) = i x2i , also known as the sphere model. This choice has the following two main reasons: First, other functions are currently too difficult to analyze, and second, the sphere model approximates the optimum’s vicinity of many (real-world) applications reasonably well. Thus, the remainder of this paper also adopts this choice. As a first performance measure, the rate of progress is defined as ϕ = f (xt ) − f (xt+1 )

(4)

in terms of the best population members’ objective function values in two subsequent time steps t and t+1. For standard (μ, λ)-evolution strategies operating on the sphere model, Beyer [5] has derived the following rate of progress ϕ: ϕ ≈ 2Rcμ,λ σ − N σ 2 ,

(5)

124

R. Salomon

Maximal Efficacy η

Noise Free Evaluation 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06

EGS

(μ/μ,λ)-ES

0

5

10

15 20 25 30 Number λ of Trial Points

35

40

Fig. 1. Rate of progress for EGS and (μ/μ, λ)-evolution strategies according to Eqs. (7) and (8). Further details can be found in [1].

with R = xg  denoting the distance of the best population member to the optimum and cμ,λ denoting a constant that subsumes all influences of the population configuration as well as the chosen selection scheme. Typical values are: c1,6 =1.27, c1,10 =1.54, c1,100 =2.51, and c1,1000 =3.24. To be independent from the current distance to the optimum, normalized quantities are normally considered (i.e., relative performance): ϕ∗ = σ ∗ cμ,λ − 0.5(σ ∗ )2 with N ∗ = σN . ϕ∗ = ϕ 2R 2,σ R

(6)

Similarily, the literature [1,5,10] provides the following rate of progress formulas for EGS and the (μ/μ, λ)-evolution strategies:  2 λ σ∗ ∗ ∗ , (7) ϕEGS ≈ σ − 2 1 + σ ∗2 /4 ϕ∗μ/μ,λ ≈ σ ∗ cμ/μ,λ −

σ∗ 2μ

(8)

The rate of progress formula is useful to gain insights about the influences of various parameters on the performance. However, it does not consider the (computational) costs required for the evaluation of all λ offspring. For the common assumption that all offspring be evaluated sequentially, the literature often uses a second performance measure, called the efficiency η = ϕ∗ /λ. In other words, the efficiency η expresses the sequential run time of an algorithm. Figure 1 shows the efficiency of both algorithms according to Eqs. (7) and (8). It can be seen that for small numbers of offspring (i.e., λ≈5), EGS is most

Noise Robustness by Using Inverse Mutations

125

efficient (in terms of sequential run time) and superior to the (μ/μ, λ)-evolution strategy.

3

Problem Description: Noise and Performance

Subsection 2.2 has reviewed the obtainable progress rates for the undisturbed case. The situation changes, however, when considering noise, i.e., noisy fitness evaluations. Noise is present in many (if not all) real-world applications. For the viability of practially relevant optimization procedures, noise robustness is thus of high importance. Most commonly, noise is modeled by additive N(0,σ ) normally distributed random numbers with standard deviation σ . For noisy fitness evaluations, Arnold [1] has derived the following rate of progress  ϕ∗EGS

≈σ



2

σ∗ λ , − 2 2 2 2 1 + σ ∗ /4 + σ ∗ /σ ∗

(9)

with σ ∗ = σ N/(2R2 ) denoting the normalized noise strength. Figure 2 demonstrates how the presence of noise σ ∗ reduces the rate of progress of the EGS procedure. Eq. (9) reveals that positive progress can be √ achieved only if σ ∗ ≤ 4λ + 1 holds. In other words, the required number of offspring (trial points) to estimate the gradient direction √ grows quadratically. Another point to note is that the condition σ ∗ ≤ 4λ + 1 incorporates the normalized noise strength. Thus, if the procedure approaches the optimum, the EGS Performance under Noise 6

σ∗ε = 0.0

Rate of Progress ϕ∗

5 4

σ∗ε = 4.0

3 2



1

σ ε = 8.0

0

-1 -2 0

0.5

1

1.5

2

2.5

Mutation Strength σ

3

3.5

4



Fig. 2. The rate of progress ϕ∗ of EGS progressively degrades under the presence of noise σ∗ . The example has used λ=25 offspring. Further details can be found in [1].

126

R. Salomon

distance R decreases and the noise strength increases. Consequently, the procedure exhibits an increasing performance loss as it advances towards the optimum. By contrast, the (μ/μ, λ)-evolution strategy benefits from an effect called genetic repair induced by the global-intermediate recombination, and is thus able to operate with larger mutation step sizes σ ∗ . For the (μ/μ, λ)-evolution strategy, the literature [2] suggest that only a linear growth in the number of offspring λ is required.

4

Analysis: Imprecise Gradient Approximation

This section analysis the reasons for the performance loss described in Section 3. To this end, Figure 3 illustrates how the EGS procedure estimates the gradiλ (i) (i) ent direction according to Eq. (2): g˜t = i=1 (f (y t ) − f (xt ))(y t − xt ). The following two points might be mentioned here: 1. The EGS procedure uses the same step size σ for generating trial points and performing a step from xt to xt+1 . 2. For small step sizes σ, the probability for an offspring to be better or worse than its parents is half chance. For larger step sizes, though, the chances for better offspring are steadily decreasing, and are approaching zero for σ ≥ 2R.

x2 Optimum y1

y2

−σ/ζ∗et

z6

y6

−σ∗ζ∗ et

z2

z1

* xt

z3

y3

z5 y5

z4

f(x)=const

y4 x1 ˜ Fig. 3. This figure shows how the EGS procedure estimates the gradient direction g by averaging weighted test points y

Noise Robustness by Using Inverse Mutations

127

EGS and Two Step Sizes 3

Rate of Progress ϕ∗

2.5 2

σg=0.2

1.5

σg=0.02

1 0.5 0 -0.5

σg=20

-1 0

σg=2.0 0.5

1 1.5 Progress Step Size σp

2

Fig. 4. Normalized rate of progress ϕ∗ when using different step sizes σg and σp for the example N = λ = 10

cos α

Step Size and Accuracy 0.9 λ=25 0.8 0.7 λ=10 0.6 λ=5 0.5 0.4 0.3 0.2 0.1 0 -0.1 0.001 0.01

0.1

1 10 Step Size σg

100

1000

10000

˜ as a function of the Fig. 5. cos α between the true gradient g and its estimate g number of trial points λ

It is obvious that for small λ, unequal chances have a negative effect on the gradient approximation accuracy. Both points can be further elaborated as follows: A modified version of the EGS procedure uses two independent step sizes σg and σp for generating test points (i) (i) (offspring) y t =xt +σg z t and performing the actual progress step xt+1 =xt √ +σp N g˜t /g˜t  according to Eqs. (1) and (3), respectively. Figure 4 illustrates

128

R. Salomon

Step Size and Noise on Accuracy 0.8 0.7

σe=.002

0.6 cos α

0.5 0.4 0.3

σe=0.01

0.2 0.1 0

σe=0.05 σe=0.50

-0.1 0.001

0.01

0.1

1 10 Step Size σg

100

1000

10000

˜ as a function of the noise Fig. 6. cos α between the true gradient g and its estimate g strength σe ∈ {0.002, 0.01, 0.05, 0.5}. In this example, N = λ = 10 and R = 1 were used.

the effect of these two step sizes for the example R = 1, N = 10 dimensions and λ = 10 trial points. It can be clearly seen that the rate of progress ϕ∗ drastically degrades for large step sizes σg . Since Figure 4 plots the performance for ‘all possible’ step sizes σp , it can be directly concluded that the accuracy of the gradient estimation significantly degrades for step size σg being too large; ˜ = g was precise, the attainable rate of progress would be if the estimation g ϕ∗ = (f (x0 ) − f (0))N/(2R2 ) = (1 − 0)10/2 = 5. This hypothesis is supported by Fig. 5, which shows the angle cos α = g˜ g /(g˜ g) between the true gradient g and its estimate g˜ . It can be clearly seen that despite being dependent on the number of trial points λ, the gradient estimate’s accuracy significantly depends on the step size σg . The qualitative behavior is more or less equivalent for all three number of trial points λ ∈ {5, 10, 25}: it is quite good for small step sizes, starts degrading at σg ≈ R, and quickly approaches zero for σg > 2R. In addition, Fig. 6 shows how the situation changes when noise is present. It can be seen that below the noise level, the accuracy of the gradient estimate degrades. In summary, this section has show that it is generally advantageous to employ two step sizes σg and σp and that the performance of the EGS procedure degrades when the step size σg is either below the noise strength or above the distance to the optimum.

5

Inverse Mutations

The previous section has identified two regimes, σg < σe and σg > R that yield poor gradient estimates. This problem could be tackled by either resampling al-

Noise Robustness by Using Inverse Mutations

129

ready generated trial points or by increasing the number of different trial points. Both options, however, have high √ computational costs, since the performance gain would grow at most with λ (i.e., reduced standard deviation or Eq. (9)). Since the performance gain of the (μ/μ, λ)-evolution strategy grows linearly in λ (due to genetic repair [1]), these options are not further considered here. For the second problematic regime σg > R, this paper proposes inverse mutations. Here, the procedure still generates λ trial points (offspring). However, half of them are mirrored with respect to the parent, i.e., they are pointing towards the opposite direction. Inverse mutations can be formally defined as: (i)

(i)

y t = xt + σg z t for i = 1 . . . ,λ/2(i) yt

= xt −

(i−λ/2) σg z t

(10)

for i = ,λ/2- + 1, . . . λ

In other words, each mutation vector z (i) is used twice, once in its original form and once as −z (i) . Figure 7 illustrates the effect of introducing inverse mutations. The performance gain is obvious: The observable accuracy of the gradient estimate is constant over the entire range of σg values. This is in sharp contrast to the regular case, which exhibits the performance loss discussed above. The figure, however, indicates a slight disadvantage in that the number of trial points should be twice as much in order to gain the same performance as in the regular case. In addition, the figure also shows the accuracy for λ = 6, which is smaller than the number of search space dimensions; the accuracy is with cos α ≈ 0.47 still reasonably good. Inverse vs. Regular Mutations

cos α

0.8 0.7

Inverse Mutations, λ=20

0.6

Inverse Mutations, λ=10

0.5

Inverse Mutations, λ=6

0.4 0.3 0.2 0.1 0 -0.1 0.001

Regular Mutations, λ=10 0.01

0.1

1 10 Step Size σg

100

1000

10000

˜ for various numbers of Fig. 7. cos α between the true gradient g and its estimate g trial points λ. For comparison purposes, also the regular case with λ = 10 is shown. In this example, N = 10 and R = 1 were used.

130

R. Salomon

Figure 8 illustrates the performance inverse mutations yield in the presence of noise. For comparison purposes, the figure also shows the regular case for λ = 10 and σe ∈ {0.05, 0.5} (dashed lines). Again, the performance gain is obvious: the accuracy degrades for σg < σe but sustains for σg > R, which is in way contrast to the regular case. Inverse Mutations and Noise 0.8 A

0.7

B

0.6 cos α

0.5

C

D A: λ=20, σ=0.05

0.4

B: λ=20, σ=0.5

0.3

C: λ=10, σ=0.05

0.2

D: λ=10, σ=0.5

0.1 0 -0.1 0.001

0.01

0.1

1 10 Step Size σg

100

1000

10000

˜ for λ ∈ {10, 20} and Fig. 8. cos α between the true gradient g and its estimate g σe ∈ {0.05, 0.5}. For comparison purposes, also the regular case with λ = 10 and σe ∈ {0.05, 0.5} (dashed lines). In this example, N = 10 and R = 1 were used.

Figure 9 shows the rate of progress ϕ∗ of the EGS procedure with two step sizes σg and σp and inverse mutations for the example N = 40 and λ=24 and a normalized noise strength of σ ∗ = 8. When comparing the figure with Fig. 2, it can be seen that the rate of progress is almost that of the undisturbed case, i.e., σ ∗ = 0. It can be furthermore seen that the performance starts degrading only for too small a step size σg . It should be mentioned here that for the case of σ ∗ = 0, the graphs are virtually identical with the cases σ ∗ = 8 and σg = 32, and are thus not shown. Figure 10 demonstrates that the proposed method isnot restricted to the i 2 sphere model; also the general quadratic case f (x) = i ξ xi benefits from using inverse mutations. The performance gain is quite obvious when comparing graph “C”, i.e., regular mutations, with graphs “A” and “B”, i.e., inverse mutations. When using regular mutations, the angle between the true and estiamted gradient is rather eratic, whereas inverse mutations yield virtually the same accuracy as compared to the sphere model. It might be worhtwhile to note that regular mutations are able to yield some reasonable accurace only for small excentricities, e.g., ξ = 2 in graph “D”. Even though inverse mutations are able to accurately estimate the gradient direction, that do not resort to second order derivatives – that in the case of

Noise Robustness by Using Inverse Mutations

131

excentric quadratic fitness functions, the gradient does not point towards the minimum. Please note that as all gradient-based optimization procedures that do not utilize second order derivatives suffer from the same problem. Inverse Mutations and Noise σ*e=8, λ=24

Rate of Progress ϕ∗

5

σg=32 σg=10

4

σg=3.2

3 σg=1.0 2 1 σg=0.32 0 0

0.5

1 1.5 Progress Step Size σp

2

Fig. 9. The rate of progress ϕ∗ of EGS with two step sizes and inverse mutations under the presence of noise σ∗ = 8. The example has used N = 40 and λ=24. In comparison with Fig. 2, the performance is not affected by the noise.

Inverse Mutations and Noise 0.8 A

0.7 0.6

B

cos α

0.5

A: λ=20, Inv. Mutations

0.4

B: λ=10, Inv. Mutations

0.3

C: ξ=10, Reg. Mutations

0.2

D: ξ=2, Reg. Mutations

0.1

D

0 -0.1 0.001

0.01

0.1

C 1 10 Step Size σg

100

1000

10000

˜ for λ ∈ {10, 20} and Fig. 10. cos α between the true gradient g and its estimate g  i 2 σe = 0.5 for the test function f (x) = i ξ xi when using inverse mutations (labels “A” and “B”) as well as regular mutations (labels “C” and “D”). For graphs “A”-“C”, ξ = 10, and for graph “D”, ξ = 2 was used. In all examples, N = 10 and R = 1 were used.

132

6

R. Salomon

Conclusions

This paper has briefly reviewed the contemporary (μ/μ, λ)-evolution strategy as well as a hybrid one known as the evolutionary-gradient-search procedure. The review of recent analyses has emphasized that EGS yields superior efficiency (sequential run time), but that it significantly degrades in the presence of noise. The paper has also analyzed the reasons of this observable performance loss. It has then proposed two modifications: inverse mutations and using two independent step sizes σg and σp , which are used to generate the λ trial points and to perform the actual progress step, respectively. With these two modifications, the EGS procedure yields a performance, i.e., the normalized rate of progress ϕ∗ , which is very noise robust; the performance is almost not effected as long as the step size σg for generating trial points is sufficiently large. The achievements reported in this paper have been motivated by both analyses and experimental evidence, and have be validated by further experiments. Future work will be devoted to conducting a mathematical analysis.

References 1. D. Arnold, An Analysis of Evolutionary Gradient Search, Proceedings of the 2004 Congress on Evolutionary Computation. IEEE, 47-54, 2004. 2. D. Arnold and H.-G. Beyer, Local Performance of the (μ/μ, λ)-Evolution Strategy in a Noisy Environment, in W. N. Martin and W. M. Spears, eds., Proceeding of Foundation of Genetic Algorithms 6 (FOGA), Morgan Kaufmann, San Francisco, 127-141, 2001. 3. T. B¨ ack, U. Hammel, and H.-P. Schwefel, Evolutionary Computation: Comments on the History and Current State. IEEE Transactions on Evolutionary Computation, 1(1), 1997, 3-17. 4. H.-G. Beyer. On the Explorative Power of ES/EP-like Algorithms, in V.W. Porto, N. Saravanan, D. Waagen, and A.E. Eiben, (eds), Evolutionary Programming VII: Proceedings of the Seventh Annual Conference on Evolutionary Programming (EP’98), Springer-Verlag, Berlin, Heidelberg, Germany, 323-334, 1998. 5. H.-G. Beyer. The Theory of Evolution Strategies. Springer-Verlag Berlin, 2001. 6. D.B. Fogel. Evolutionary Computation: Toward a New Philosophy of Machine Learning Intelligence.. IEEE Press, NJ, 1995. 7. D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA, 1989. 8. D.G. Luenberger. Linear and Nonlinear Programming. Addison-Wesley, Menlo Park, CA, 1984. 9. W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numerical Recipes. Cambridge University Press, 1987. 10. I. Rechenberg, Evolutionsstrategie (Frommann-Holzboog, Stuttgart, 1994). 11. R. Salomon and L. Lichtensteiger, Exploring different Coding Schemes for the Evolution of an Artificial Insect Eye. Proceedings of The First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks, 2000, 10-16.

Noise Robustness by Using Inverse Mutations

133

12. R. Salomon, Evolutionary Algorithms and Gradient Search: Similarities and Differences, IEEE Transactionson Evolutionary Computation, 2(2): 45-55, 1998. 13. R. Salomon, Accelerating the Evolutionary-Gradient-Search Procedure: Individual Step Sizes, in T. B¨ ack, A. E. Eiben, M. Schoenauer, and H.-P. Schwefel, (eds.), Parallel Problem Solving from Nature (PPSN V), Springer-Verlag, Berlin, Heidelberg, Germany, 408-417, 1998. 14. H.-P. Schwefel. Evolution and Optimum Seeking. John Wiley and Sons, NY. 1995.

Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm for Induction Motors Based on Self-organization of Feature Extraction Hyeon Bae1, Sungshin Kim1, Jang-Mok Kim1, and Kwang-Baek Kim2 1

School of Electrical and Computer Engineering, Pusan National University, Jangjeon-dong, Geumjeong-gu, 609-735 Busan, Korea {baehyeon, sskim, jmok}@pusan.ac.kr http://icsl.ee.pusan.ac.kr 2 Dept. of Computer Engineering, Silla University, Gwaebop-dong, Sasang-gu, 617-736 Busan, Korea [email protected]

Abstract. In this study, the datamining application was achieved for fault detection and diagnosis of induction motors based on wavelet transform and classification models with current signals. Energy values were calculated from transformed signals by wavelet and distribution of the energy values for each detail was used in comparing similarity. The appropriate details could be selected by the fuzzy similarity measure. Through the similarity measure, features of faults could be extracted for fault detection and diagnosis. For fault diagnosis, neural network models were applied, because in this study, it was considered which details are suitable for fault detection and diagnosis.

1 Introduction The most popular way of converting electrical energy to mechanical energy is an induction motor. This motor plays an important role in modern industrial plants. The risk of motor failure can be remarkably reduced if normal service conditions can be arranged in advance. In other words, one may avoid very costly expensive downtime by replacing or repairing motors if warning signs of impending failure can be headed. In recent years, fault diagnosis has become a challenging topic for many electric machine researchers. The diagnostic methods to identify the faults listed above may involve several different types of fields of science and technology [1]. K. Abbaszadeh et al. presented a novel approach for the detection of broken rotor bars in induction motors based on the wavelet transformation [2]. Multiresolution signal decomposition based on wavelet transform or wavelet packet provides a set of decomposed signals in independent frequency bands [3]. Z. Ye et al. presented a novel approach to induction motor current signature analysis based on wavelet packet decomposition (WPD) of the stator current [4]. L. Eren et al. analyzed the stator current via wavelet packet decomposition to detect bearing defects [5]. The proposed method enables the analysis of U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 134 – 147, 2005. © Springer-Verlag Berlin Heidelberg 2005

Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm

135

frequency bands that can accommodate the rotational speed dependence of the bearing-defect frequencies. Fault detection and diagnosis algorithms have been developed according to the target motors, so it cannot guarantee the adaptability and flexibility under changeable field conditions. In this study, similarity measure was performed to find out the good features from signals that improve compatibility of fault detection and diagnosis algorithm.

2 Application Data from Induction Motor 2.1 Signal Acquisition from Induction Motor As it is obvious, sometimes, different faults produce nearly the same frequency components or behave like healthy machine, which make the diagnosis impossible. This is the reason that new techniques must also be considered to reach a unique policy for distinguishing among faults. In this study, current signals were used for fault diagnosis of induction motors. Current signals can be changed by the inner resistivity, so it can indicate the status of the motor intuitionally. Figure 1 shows the measuring method of induction motors. In this study, we used this concept to detect faults of the motor. Current Transformer 3-phase Supply

a b

AC Motor

c

Burden

Fig. 1. Measuring the current signal using instruments

2.2 Fault Types Many types of signals have been studied for the fault detection of induction motors. However, each technique has advantages and disadvantages with respect to the various types of faults. In this study, the proposed fault detection system dealt with bowed rotor, broken rotor bar, and healthy cases. 2.2.1 Bearing Faults Though almost 40~50% of all motor failures are bearing related, very little has been reported in the literature regarding bearing related fault detection techniques. Bearing faults might manifest themselves as rotor asymmetry faults from the category of eccentricity related faults [6], [7]. Artificial intelligence or neural networks have been researched to detect bearing related faults on line. In addition, adaptive, statistical time frequency method is studying to find bearing faults.

136

H. Bae et al.

2.2.2 Rotor Faults Rotor failures now account for 5-10% of total induction motor failures. Frequency domain analysis and parameter estimation techniques have been widely used to detect this type of faults. In practice, current side bands may exist even when the machine is healthy. Also rotor asymmetry, resulting from rotor ellipticity, misalignment of the shaft with the cage, magnetic anisotropy, etc. shows up at the same frequency components as the broken bars. 2.2.3 Eccentricity Related Faults This fault is the condition of unequal air-gap between the stator and rotor. It is called static air-gap eccentricity when the position of the minimal radial air-gap length is fixed in the space. In case of dynamic eccentricity, the center of rotor is not at the center of rotation, so the position of minimum air-gap rotates with the rotor. This maybe caused by a bent rotor shaft, bearing wear or misalignment, mechanical resonance at critical speed, etc. In practice, an air-gap eccentricity of up to 10% is permissible.

3 Feature Extraction from Current Signals 3.1 Wavelet Analysis Time series data are of growing importance in many new database applications, such as data warehousing and datamining. A time series is a sequence of real numbers, each number representing a value at a time point. In the last decade, there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. For fault detection and diagnosis of motors, Fourier transform has been employed. Discrete Fourier Transform (DFT) has been one of the general techniques used commonly for analysis of discrete data. One problem with DFT is that it misses the important feature o time localization. Discrete Wavelet Transform (DWT) has been found to be effective in replacing DFT in many applications in computer graphic, image, speech, and signal processing. In this paper, this technique was applied in time series for feature extraction from current signals of induction motors. The advantage of using DWT is multi-resolution representation of signals. Thus, DWT is able to give locations in both time and frequency. Therefore, wavelet representations of signals bear more information than that of DFT [8]. In the past studies, details of wavelet decomposition were used as features of the motor faults. Following sections will introduce the principal algorithm of the past researches. 3.2 Problem in Feature Extraction In the time series datamining, the following claim is made. Much of the work in the literature suffers from two types of experimental flaws, implementation bias and data bias. Because of these flaws, much of the work has very little generalizability to real world problems. In particular, many of the contributions made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of “improvement” that would have been completely dwarfed by the variance that would have been observed by testing on many

Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm

137

real world data sets, or the variance that would have been observed by changing minor (unstated) implementation details [9]. To solve the problem, fuzzy similarity measure was employed to analyze detail values and find good features for fault detection. The features were defined by energy values of each detail level of wavelet transform. Bed adaptability and flexibility that have been considered as important ability in fault detection and diagnosis of induction motors could be improved by the proposed algorithm. 3.2.1 Which Wavelet Function Is Proper? Many transform methods have been used for feature extraction from fault signals. Typically, Fourier transform has been broadly employed in data analysis on frequency domain. In this study, instead of Fourier transform, wavelet transform was applied, because the wavelet transform can keep information of time domain and can extract features from time series data. On the other hand, types of wavelet functions and detail levels of decomposition were chosen by user’s decision. Therefore, it could be difficult to extract suitable features for fault detection based on optimal algorithm. Especially, when the developed algorithm in changed new environments or motors, fault detection and diagnosis could not achieved properly. Algorithms are sometimes very sensitive to environments, so the system can make incorrect results. Therefore, in this study, the new algorithm is proposed to increase adaptability of fault detection and diagnosis. 3.2.2 How Many Details Are Best? In past studies, fault detection and diagnosis algorithms were developed and applied to specific motors, that is, the algorithms had no flexibility for general application. Therefore, under changed environments, the system cannot detect faults correctly. In actual, when the developed algorithm in the past researches was applied on the used data here, the faults could not be detected and diagnosed. In this study, the advanced algorithm is proposed to compensate the problems of the traditional approaches in fault detection and diagnosis. The best features for fault detection were automatically extracted by fuzzy similarity measure. The purpose of the proposed algorithm is to compare features and determine function types and detail levels of wavelet transform. The algorithm was performed by fuzzy similarity measure and improved detection performance. If similarity is low, the classification can be well accomplished. Through the proposed method, the best mother wavelet function and detail level can be determined according to target signals. 3.3 Proposed Solutions In this chapter, the solution is proposed to settle the problem in feature extraction that is mentioned before. When wavelet analysis is employed to find features from signals, two important things must be considered before starting transform. One is to select a wavelet mother function and the other is to determine a detail level for composition. In this study, fuzzy similarity measure was applied to compare the performance of 5 functions and 12 detail levels and determine the best one for the function and the detail level. The proposed algorithm can automatically find out the best condition in feature extraction, so it can guarantee adaptability and flexibility in industrial proc-

138

H. Bae et al.

esses. Figure 2 shows the flowchart for fault detection and diagnosis of induction motors. Fault current signal set

Wavelet transform

Calculate energy values Transform new wavelet types Evaluate fuzzy similarity

SM0, the scaling coefficients and the wavelet coefficients are

c j +1,k = ¦ g[n − 2k ]d j ,k

(3)

d j +1,k = ¦ h[ n − 2k ]d j ,k

(4)

n

n

3.4.2 Fuzzy Similarity Measure A recent comparative analysis of similarity measures between fuzzy sets has shed a great deal of light on our understanding of these measures. Different kinds of measures were investigated from both the geometric and the set-theoretic points of view. In this paper, a similarity measure was used as a measure transformed from a distance measure by using SM=1/(1+DM). In this sense, Zwick et al. similarity measures are actually distance measures used in the derivation of our SM. In a behavioral experiment, different distance measures were generalized to fuzzy sets. A correlation analysis was conducted to assess the goodness of these measures. More specifically, the distance measured by a particular distance measure was correlated with the true distance. It was assumed that it would be desirable for a measure to have high mean and median correlation, to have a small dispersion among its correlations, and to be free of extremely how correlations.

140

H. Bae et al.

In this study, the other similarity of fuzzy values was employed that was proposed by Pappis and Karacapilidis [10]. The similarity measure is defined as follows: n

M A, B =

| A∩ B | = | A∪ B |

¦ (a

i

∧ bi )

i =1 n

(5)

¦ (ai ∨ bi ) i =1

4 Fault Detection of Induction Motor 4.1 Current Signals and Data Preprocessing The motor ratings applied in this paper depend on electrical conditions. The rated voltage, speed, and horsepower are 220 [V], 3450 [RPM], and 0.5 [HP], respectively. The specifications for used motors are 34 slots, 4 poles, and 24 rotor bars. Figure 3 shows the experimental equipment. The current signals are measured under fixed conditions that consider the sensitivity of the measuring signals that is, the output of the current probes. The sensitivity of each channel is 10 mV/A, 100 mV/A, and 100mV/A, respectively. The specification of the measured input current signal under this condition consists of 16,384 sampling numbers, 3 kHz maximum frequency, and 2.1333 measuring time. Therefore, the sampling time is 2.1333 over 16,384. Fault types used in this study are broken rotor, faulty bearing, bowed rotor, unbalance, and static and dynamic eccentricity. When wavelet decomposition is used to detect faults in induction motors, the unsynchronized current phase problem should have great influence on the detection results. Figure 4 shows original current signals that are not synchronized with the origin point. If the target signals are not synchronized with each other, unexpected results will appear in wavelet decomposition. Therefore the signals have to be resampled with the starting origin to a 0 (zero) phase. In this study, 64 of 128 cycles of the signals had to be re-sampled. The average value divided by one cycle signal is used to reduce the noise of the original signals.

Fig. 3. Fault detection and data acquisition equipment

Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm

141

Fig. 4. Current signals measured by equipped motors

1 Detail' 4

Detail' 1

Wavelet Decomposition for Normal Condition 0.05 0 -0.05

0 -1

0

1000

2000

3000

4000

5000

6000

7000

8000

0

100

200

300

400

500

600

700

800

900

1000

0

50

100

150

200

250

300

350

400

450

500

Wavelet Decomposition for Normal Condition

Detail' 5

Detail' 2

0.1 0 -0.1 0

500

1000

1500

2000

2500

3000

3500

2 0 -2

4000

Wavelet Decomposition for Normal Condition

Detail' 6

Detail' 3

0.5 0 -0.5 0

200

400

600

800

1000

1200

1400

1600

1800

2 0

0 -5

2000

0

Detail' 10

Detail' 7

4

5

50

100

150

200

250

0.05 0 -0.05

-2 -4 0

20

40

60

80

100

0

2

4

6

8

10

12

14

16

0

1

2

3

4

5

6

7

8

0

1

2

3 4 5 Data samples (n)

6

7

8

120

Detail' 11

0.1 Detail' 8

0.2 0

0.05 0 -0.05

Detail' 9

0

10

20

30

40

50

60

0.2 0 -0.2 0

5

10

15

20

25

Approximation' 11

-0.2

0.15 0.1 0.05

30

Fig. 5. Wavelet transformation for one current signal

4.2 Wavelet Transform for Current Signals As mentioned above, wavelet transform was applied to extract features. A signal was transformed to 12 details that depend on sampling numbers. An energy value was calcu-

142

H. Bae et al.

lated to one value, so one set consists of 12 values for one case. The energy value was computed by each detail value and the distribution of the calculated energy values was evaluated by similarity measure. Figure 5 shows one result of wavelet transform. 4.3 Calculate Energy Values From wavelet transformed 12 details, the energy values were calculated that shows the features according to motor status. The energy values could be illustrated by distribution functions with 20 samples. Each distribution function was expressed as fuzzy membership function types and applied for fuzzy similarity measure. Employed conditions included one normal case and two fault cases, so constructed membership functions were three according to each detail. In conclusion, 12 sets of membership functions can be constructed for one wavelet function. After determining functions and details, features were extracted from signals and the extracted features were classified by neural network models, that is, fault diagnosis. Figure 6 shows the calculated energy values that include 12 results according to each detail. And Fig. 7 illustrates the distribution of the energy values for 20 samples. One detail has three distributions for normal, broken rotor bar, and bowed rotor fault. Table 1 shows the calculated similarity between three distributions for each detail. Detail 1 and 2 of the first function show the best result for fault classification. Finally, Section 5 shows the classification results using similarity values to diagnose faults. -3

1.9

Energy Values of Detail 1 of db1

x 10

Energy Values of Detail 4 of db1 1050 Healthy Case Broken Rotor Bar Bowed Rotor

1.85

Healthy Case Broken Rotor Bar Bowed Rotor

1000 950

1.8

Energy values

Energy values

900 1.75 1.7 1.65

850 800 750

1.6

700

1.55

650

1.5 0

2

4

6

8

10 12 Samples [n]

14

16

18

600

20

0

2

4

Energy Values of Detail 8 of db1

8

10 12 Samples [n]

14

16

18

20

Energy Values of Detail 12 of db1

2.5

0.08 Healthy Case Broken Rotor Bar Bowed Rotor

Healthy Case Broken Rotor Bar Bowed Rotor

0.07 0.06

Energy values

2

Energy values

6

1.5

1

0.05 0.04 0.03 0.02 0.01

0.5

0

2

4

6

8

10 12 Samples [n]

14

16

18

20

0

0

2

4

6

8

10 12 Samples [n]

14

16

18

Fig. 6. Energy values of wavelet transform with Daubechies wavelet function

20

Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm

143

4.4 Distribution of Energy Values of Details Energy Values Distribution of Detail 1 of db1

Energy Values Distribution of Detail 4 of db1

1

1

Healthy Case Broken Rotor Bar Bowed Rotor

0.9

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 1.5

Healthy Case Broken Rotor Bar Bowed Rotor

0.9

DOF

DOF

0.8

1.55

1.6

1.65

1.7 1.75 Energy value

1.8

1.85

0 550

1.9

600

750 800 850 Energy value

900

950

1000

1050

1 Healthy Case Broken Rotor Bar Bowed Rotor

0.9 0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0.5

1

1.5 Energy value

2

2.5

3

Healthy Case Broken Rotor Bar Bowed Rotor

0.9

DOF

DOF

700

Energy Values Distribution of Detail 12 of db1

Energy Values Distribution of Detail 8 of db1 1

0 -0.5

650

-3

x 10

3.5

0 -0.02

-0.01

0

0.01

0.02 0.03 0.04 Energy value

0.05

0.06

0.07

0.08

Fig. 7. Distribution for energy values of wavelet transform with Daubechies wavelet function

4.5 Results of Similarity Measure In this section, the result of fuzzy similarity measure is shown. Using similarity measure, the proper a wavelet function and detail levels were selected that can classify Table 1. Results of fuzzy similarity measures Detail Detail 1 Detail 2 Detail 3 Detail 4 Detail 5 Detail 6 Detail 7 Detail 8 Detail 9 Detail 10 Detail 11 Detail 12

Daub1 0.0081 0.0084 0.0094 1.8507 0.9273 2.4349 2.5762 2.3295 1.9273 1.3469 2.5999 1.7005

Daub2 0.0000 0.0000 2.0664 1.9684 2.3579 2.3521 2.3616 2.5403 2.6427 2.6122 2.6327 2.6177

Sym 0.0000 0.0000 2.0664 1.9684 2.3579 2.3521 2.3616 2.5403 2.6427 2.6122 2.6327 2.6177

Coif 0.0000 0.0000 1.2986 1.9502 2.7438 2.6290 2.3854 2.2591 2.3220 2.3463 2.3444 2.3442

Mey 1.4286 0.0000 0.9402 2.0714 2.3912 2.3532 2.3671 2.4443 2.4702 2.4958 2.3301 2.6441

Bior 0.0081 0.0084 0.0094 1.8507 0.9273 2.4349 2.5762 2.3295 1.9273 1.3469 2.5999 1.7005

144

H. Bae et al.

faults correctly. For the similarity measure, fuzzy similarity measure was employed to measure similarity with distribution of 20 samples for one case. Namely, one distribution was constructed by 20 sample sets for one fault or normal case. On one detail domain, three distribution functions were illustrated for similarity measure. After measuring fuzzy similarity, low valued details were selected for fault diagnosis, because dissimilar distribution valued details can classify the cases well. As shown in Table 1, the Daubechies wavelet function has low similarity value in general and Detail 1 to Detail 3 has low similarity generally. Therefore, Detail 1 to Detail 3 of Daubechies wavelet function is suitable for fault detection and diagnosis in this study.

5 Fault Diagnosis of Induction Motor 5.1 Fault Diagnosis Using Whole Details Test results for Daub1

Test results for Daub2 3

Model class Actual class

2

Classification

Classification

3

1

0

Model class Actual class

2

1

0 0

6 12 Sample data

18

0

6 12 Sample data

18

(a) Classification result of 12 details using Daub1 and Daub2 function. Test results for Sym

Test results for Coif 3

Model class Actual class

2

Classification

Classification

3

1

Model class Actual class

2

1

0

0 0

6 12 Sample data

0

18

6 12 Sample data

18

(b) Classification result of 12 details using Symlet and Coiflet function. Test results for Mey

Test results for Bior 3

Model class Actual class

2

Classification

Classification

3

1

0

Model class Actual class

2

1

0

0

6 12 Sample data

18

0

6 12 Sample data

(c) Classification result of 12 details using Meyer and Biorthogonal function Fig. 8. Fault diagnosis result using 12 details of Wavelet transform

18

Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm

145

5.2 Fault Diagnosis Using Selected Details Test results for Daub1

Test results for Daub2 3

Model class Actual class

2

Classification

Classification

3

1

0

Model class Actual class

2

1

0 0

6 12 Sample data

18

0

6 12 Sample data

18

(a) Classification result of selected details using Daub1 and Daub2 function Test results for Sym

Test results for Coif 3

Model class Actual class

2

Classification

Classification

3

1

0

Model class Actual class

2

1

0 0

6 12 Sample data

18

0

6 12 Sample data

18

(b) Classification result of selected details using Symlet and Coiflet function Test results for Mey

Test results for Bior 3

Model class Actual class

2

Classification

Classification

3

1

0

Model class Actual class

2

1

0 0

6 12 Sample data

18

0

6 12 Sample data

18

(c) Classification result of selected details using Meyer and Biorthogonal function Fig. 9. Fault diagnosis result using selected details of Wavelet transform

5.3 Comparison of the Three Detail Sets In this section, the result of fault diagnosis using neural network is shown. As shown in figures, fault diagnosis was achieved using three data sets. First, whole details were applied, second, selected details by fuzzy similarity measure were used, and finally, unselected details in the second case were employed for fault diagnosis. The purpose of evaluation was to check the effect of the selected details. If the selected details do not affect the performance of diagnosis, three cases will show good performance, and vice verse.

146

H. Bae et al.

As shown in Table 2, the performance shows the best when the selected details are only used in diagnosis. The performance of diagnosis with whole details is also adequate for diagnosis, but there were some error points. When the selected details were used, the performance was the best. This means the selected details have principal information for fault diagnosis. However, the result using unselected details showed the worst performance. Table 2. Comparison of the three data sets Function Daub1 Daub2 Symlet Coiflet Meyer Bior

Whole details Train: 42 Test: 18 42 17 100% 94.44% 42 18 100% 100% 42 18 100% 100% 42 15 100% 83.33% 42 18 100% 100% 42 17 100% 94.44%

Selected details Train: 42 Test: 18 42 18 100% 100% 42 18 100% 100% 42 18 100% 100% 42 18 100% 100% 42 18 100% 100% 42 18 100% 100%

Unselected details Train: 42 Test: 18 42 16 100% 88.88% 42 12 100% 66.66% 42 16 100% 88.88% 42 11 100% 61.11% 42 15 100% 83.33% 42 16 100% 88.88%

6 Conclusions The motor is one of the most important equipment in industrial processes. Most production lines used the motor for power supplying in the industrial fields. The motor control algorithms have been studied for a long time, but the motor management systems have not been researched much. The management is one of popular fields that are broadly performed recently. However, it is difficult to diagnose the motor and guarantee the performance of the diagnosis, because the motors are applied in various fields with different specification. In this study, the datamining application was achieved for fault detection and diagnosis of induction motors based on wavelet transform and classification models with current signals. Energy values were calculated from transformed signals by wavelet and distribution of the energy values for each detail was used in comparing similarity. The appropriate details could be selected by the fuzzy similarity measure. Through the similarity measure, features of faults could be extracted for fault detection and diagnosis. For fault diagnosis, neural network models were applied, because in this study, it was considered which details are suitable for fault detection and diagnosis. Through the proposed algorithm, the fault detection and diagnosis system for induction motors can be employed in different motors and changed environments. Therefore, the importantly dealt problem related to the adaptability and flexibility in the past can be solved to improve the performance of motors.

Development of Flexible and Adaptable Fault Detection and Diagnosis Algorithm

147

Acknowledgement This work has been supported by KESRI (R-2004-B-129), which is funded by the Ministry of Commerce, Industry and Energy, in Korea.

References 1. P. Vas: Parameter Estimation, Condition Monitoring, and Diagnosis of Electrical Machines, Clarendron Press, Oxford (1993) 2. G. B. Kliman and J. Stein: Induction motor fault detection via passive current monitoring. International Conference in Electrical Machines, Cambridge, MA (1990) 13-17 3. K. Abbaszadeh, J. Milimonfared, M. Haji, and H. A. Toliyat: Broken Bar Detection in Induction Motor via Wavelet Transformation. The 27th Annual Conference of the IEEE Industrial Electronics Society (IECON '01), Vol. 1 (2001) 95-99 4. Z. Ye, B. Wu, and A. Sadeghian: Current signature analysis of induction motor mechanical faults by wavelet packet decomposition. IEEE Transactions on Industrial Electronics, Vol. 50, No. 6 (2003) 1217-1228 5. L. Eren and M. J. Devaney: Bearing Damage Detection via Wavelet Packet Decomposition of the Stator Current. IEEE Transactions on Instrumentation and Measurement, Vol. 53, No. 2 (2004) 431-436 6. Y. E. Zhongming and W. U. Bin: A Review on Induction Motor Online Fault Diagnosis. The Third International Power Electronics and Motion Control Conference (PIEMC 2000), Vol. 3 (2000) 1353-1358. 7. Masoud Haji and Hamid A. Toliyat: Patern Recognition-A Technique for Induction Machines Rotor Fault Detection Eccentricity and Broken Bar Fault. Conference Record of the 2001 IEEE Industry Applications Conference, Vol. 3 (2001) 1572-1578 8. K. P. Chan and A. W. C. Fu: Efficient time series matching by wavelets. 15th International Conference on Data Engineering (1999) 126-133 9. E. Keogh and S. Kasetty: On the Need for Time Series Datamining Benchmarks: A Survey and Empirical Demonstration. In the 8th ACM SIGKDD International Conference on Knowledge Discovery and Datamining (2002) 102-111 10. C. P. Pappis and N. I. Karacapilidis: A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets and Systems 56 (1993) 171-174

Computing the Optimal Action Sequence by Niche Genetic Algorithm* Chen Lin, Huang Jie, and Gong Zheng-Hu School of Computer Science, National University of Defense Technology, Changsha, 410073, China [email protected], [email protected]

Abstract. Diagnosis aims to identify faults that explain symptoms by a sequence of actions. Computing minimum cost of diagnosis is a NP-Hard problem if diagnosis actions are dependent. Some algorithms about dependent actions are proposed but their problem descriptions are not accurate enough and the computation cost is high. A precise problem model and an algorithm named NGAOAS (Niche Genetic Algorithm for the Optimal Action Sequence) are proposed in this paper. NGAOAS can avoid running out of memory by forecasting computing space, and obtain better implicit parallelism than normal genetic algorithm, which makes the algorithm be easily computed under grid environments. When NGAOAS is executed, population is able to hold diversity and avoid premature problem. At the same time, NGAOAS is aware of adaptive degree of each niche that can reduce the computation load. Compared with updated P/C algorithm, NGAOAS can achieve lower diagnosis cost and obtain better action sequence. Keywords: ECD (Expected Cost of Diagnosis); Fault Diagnosis; Niche Genetic Algorithm.

1 Introduction The purpose of diagnosis is to identify faults by a sequence of actions. It is when faults can be identified in the least time and with the lowest cost that the diagnostic result does make sense. Otherwise, any sequences of the actions can identify faults at a potential tremendous cost. The efficiency of an action is defined in papers [1-4]. The ratio eff ( ai ) = Pi ca is called the efficiency of the action ai , where ca is the cost i

i

of the action ai , Pi = p ( ai succeeds | e) is the conditional probability that the action ai identifies faults if the evidence e is present. Corresponding P/C algorithm is presented in [1-4], which calculates efficiencies of all the actions, performs the actions in degressive order of their efficiencies and finally acquires an action sequence with minimal expected cost. The algorithm can get the optimal action sequence in the case of single fault, independent actions and cost. In [5] an updated *

This work is supported by the National Grand Fundamental Research 973 Program of China under Grant No. 2003CB314802.

U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 148 – 160, 2005. © Springer-Verlag Berlin Heidelberg 2005

Computing the Optimal Action Sequence by Niche Genetic Algorithm

149

P/C algorithm is presented to solve dependent actions problem. The algorithm selects the action with the highest efficiency eff (ai ) = p(ai succeeds | e) cai and performs it

under the current evidence e . If the action cannot successfully identify faults, it will update the evidence e with {ai fails} ∪ e , and erase the action ai from the list of actions. Repeating the above procedure, we can obtain an action sequence with minimal expected cost. We will describe the algorithm by a simple example. There are three actions, their possibilities of successful diagnosis are P1 = 0.45, P2 = 0.65 and P3 = 0.55 , and their costs are ca1 = ca2 = ca3 = 1 . We can get eff (a2 ) > eff (a3 ) > eff (a1 ) . First of all, we select the action a2 with the highest efficiency and perform it. If it cannot identify any faults, we update the probabilities and the new efficiencies are turned into eff (a1 ) = p(a1 succeeds | a2 fails ) ca1 = 0.5714 , eff (a3 ) = p(a3 succeeds | a2 fails ) ca3 = 0.4286 respectively. Since the action a1 has higher

efficiency now, we perform a1 next. If the process of fault identification fails again, we perform the last remaining action, i.e. a3 . Finally, the sequence < a2 , a1 , a3 > gets the expected diagnosis cost, i.e. ECD(a2 , a1 , a3 ) = ca2 + (1 − p( a2 succeeds ))ca1 +(1 − p(a1 succeeds | a2 fails ))(1 − p(a2 succeeds ))ca3 ≈ 1.50 , which is lower than the expected cost acquired by the sequence < a2 , a3 , a1 > in P/C algorithm. Suppose and eff (a1 ) = p(a1 succeeds | a3 fails ) ca1 = 0.99

eff (a2 ) =

p(a2 succeeds | a3 fails ) ca2 = 0.10 , the sequence < a3 , a1 , a2 > will have less

expected diagnosis cost ECD (a3 , a1 , a2 ) ≈ 1.455 in the condition of the dependent actions. Therefore, we can conclude that the approximate approach of updated P/C algorithm is not always optimal. In this paper, we first give detailed descriptions of the fault diagnosis problem, and then present an algorithm named NGAOAS. NGAOAS deals with the background knowledge of the problem, partitions the candidate action sequences into several niches each of which corresponds to a permutation of the candidate fault sets. Each individual will compete with others in the same niche and NGAOAS will estimate whether a niche should continue evolving and when to do the evolution at the same time, which can reduce the calculation load as much as possible and get the result as soon as possible. At last, we prove through experiments that NGAOAS has several outstanding features: 1) The space used in NGAOAS can be predicted. 2) Compared with normal genetic algorithm, NGAOAS has better implicit parallelism. 3) When NGAOAS is on execution, population is able to hold diversity and avoid premature problem. 4) NGAOAS introduces the niche adaptive value to reduce calculation load, and thus can speed up convergence rate and avoid premature problem. Compared with updated P/C algorithm, NGAOAS can achieve lower diagnosis cost and obtain better actions sequence.

2 Description of the Fault Diagnosis Problem in Network This section describes the fault diagnosis problem in network and related definitions. The fault evidence set E = {e1 , e2 ,, em } denotes all possible fault evidences in network. Ec ⊆ E is the current fault evidence set.

150

L. Chen, J. Huang, and Z.-H. Gong

The candidate fault set F = { f1 , f 2 ,, f n } denotes all possible faults. f c ∈ F is the current fault. The probability Pf of the candidate fault fi in the evidence set Ec can i

be expressed as Pfi = p( fi happens | Ec ) . The probability set of candidate faults in the evidence set Ec is ProbE = { p( f1 happens | Ec ),..., p( f n happens | Ec )} . c

The action set A = {a1 , a2 ,, al } is formed by all the possible actions which can identify the current fault fi according to Ec . The cost of the action ai is cai . The action subset on fi is an action set selected to diagnose candidate fault fi , which is represented as A f i , Afi ⊆ A . The action subset on F is represented as multi-set AF =

A

fi ∈F

fi

. AF is a multi-set because the repeated actions are possibly needed while

diagnosing different candidate faults. For an arbitrary action subset Af , we can obtain | A f i |! permutations of all the i

elements in it. The sequence set is denoted as S A = {s1 A , s2 A ,…, s(| A f |)! A } , where fi

fi

fi

i

fi

s jAf is a permutation of Afi . i

Definition 1. (Expected Cost of Diagnosis, ECD) The expected cost of diagnosis refers to the cost spent by the process that an action sequence s ∈ S A completes the F

diagnosis task. Supposing there is an action sequence, s = ( a1 , a2 ,…, ak ) , which is a permutation of the multi-set AF , expected cost of diagnosis is defined as: k

j −1

j =1

i =1

ECD( s | Ec ) = ¦ p({ai fails}| Ec )ca j

(1)

p (ai fails | Ec ) is the probability of the diagnosis failure when ai is performed in the evidence set Ec .

Definition 2. For an arbitrary s ' ∈ S A , we can get ECD ( s | EC ) ≤ ECD ( s ' | EC ) , if F

s ∈ S A is the optimal action sequence when s is to identify f c in the evidence set F

Ec .

Once the evidence set Ec is known, the optimal action sequence can be found by searching s ∈ S A and computing the least ECD ( s | Ec ) . This combinatorial F

optimization problem is an NP-hard problem. Considering the space limit, we omit proof of the NP-hard problem. To solve the decision-making problem about diagnosis action, we propose the following assumptions. Assumption 1. (Single fault assumption) Only one fault is present. Assumption 2. (Independent cost) The cost of each action is independent with the results of former actions, which means that each action cost is a constant. Assumption 3. (Independent diagnosis process) We can identify faults one by one according to an action sequence until a candidate fault issued by the current symptoms is proven.

Computing the Optimal Action Sequence by Niche Genetic Algorithm

151

Although we have made assumptive limits for the problem, it is still an NP-hard problem and it is hard to find the optimal action sequence for dependence actions. However, the assumptions can help to solve the problem, particularly assumption 3, which is originated from the actual diagnosis process. According to assumption 3, finding the optimal action sequence is a process of calculating s1  s2  ...  sn , during which ECD ( s1  s2  ...  sn | Ec ) remains the least, n

where

 is the append operator.. For any i , we have si ∈  S A f j =1

j

. If

i ≠ j , si ∈ S A f , s j ∈ S Af , we have S A f ≠ S Af . k

l

k

l

According to the former assumptions, for a sequence s1  s2  ...  sn , we can suppose si ∈ S Af , s1 = ( a11 , a12 ,..., a1α ) , s2 = (a21 , a22 ,..., a2 β ) , s3 = (a31 , a32 ,..., a3γ ) ,…. i

The value of ECD can be calculated based on the assumptions and definition of action subset on F as follow. α

j

j =1

i =1

ECD( s1  s2  ...  sn | Ec ) = p( f1 | Ec )¦ p({a1i fails}| f1 )ca j β

α

j −1

j =1

+ p( f 2 | Ec )¦ p(({a1i fails})  ({a2i fails}) | f 2 )ca j i =1

i =1

γ

α

β

j −1

j =1

i =1

i =1

i =1

(2)

+ p( f 3 | Ec )¦ p(({a1i fails})  ({a2i fails})  ({a3i fails}) | f 3 )ca j +... In the formula (2), p ( f i | Ec ) can be calculated by probability of fault is a fi in evidence set Ec . According to the assumptions, ca i

α

j

constant. p( f1 | Ec )¦ p({a1i fails}| f1 )ca j is the expected cost spent when we j =1

verify

whether

i =1

the

fault

β

α

j −1

j =1

i =1

i =1

f1

occurs

at

the

first

p( f 2 | Ec )¦ p(({a1i fails})  ({a2i fails}) | f 2 )ca j is the expected

time.

cost spent

when we verify whether fault f 2 occurs if f1 does not occur. We have known the conditional probability that a single action can not find out the fault fi , that is p (a j fails | f i ) , and the conditional probability that any two of the actions can not find out the fault fi , that is : p(a j fails, ak fails | f i )

(3) In the formula (2), the possibilities p (a1 fails, a2 fails, a3 fails,... | f i ) can be calculated by the following iterative procedures: p(a1 fails, a2 fails, a3 fails | f i ) = p(a1 fails, a2 fails | f i ) p(a3 fails | a1 fails, a2 fails, f i ) ≈ p(a1 fails, a2 fails | f i ) p(a3 fails | a1 fails, f i ) p(a3 fails | a2 fails | f i ) = p(a1 fails, a2 fails | f i ) p(a3 fails, a1 fails | f i ) p(a3 fails, a2 fails | f i ) /( p(a1 fails | f i ) p(a2 fails | f i ))

(4)

152

L. Chen, J. Huang, and Z.-H. Gong

p(a1 fails, a2 fails, a3 fails, a4 fails | f i ) = p(a1 fails, a2 fails, a3 fails | f i ) p (a4 fails | a1 fails, a2 fails, a3 fails, f i ) ≈ p(a1 fails, a2 fails, a3 fails | f i ) p (a4 fails | a1 fails, f i ) p(a4 fails | a2 fails, fi ) p(a4 fails | a3 fails, f i ) = p(a1 fails, a2 fails, a3 fails | f i ) p (a4 fails, a1 fails | f i ) p(a4 fails, a2 fails | fi ) p(a4 fails, a3 fails | f i ) /( p(a1 fails | f i ) p(a2 fails | f i ) p(a3 fails | f i ))

(5)

In order to calculate the united possibilities of the multi-actions’ failure, we only need the possibilities of a single action failure and the united possibilities of any two actions failure. We can calculate approximation of ECD ( s1  s2  ...  sn | Ec ) according to the formula (2) and iterative method in the formulae (3), (4), (5) according to the previous assumptions. Based on the description of problem and simplification of assumptions, the process to calculate the optimal action sequence is the one to make ECD ( s1  s2  ...  sn | Ec ) the minimum.

3 Computing the Optimal Action Sequence by Niche Genetic Algorithm In order to solve the problem, we should firstly find an optimal permutation of the candidate fault set F , and then find an optimal action sequence for each fi . Because the diagnosis actions are dependent, the diagnosis costs are different for different permutations of F and the same action sequences of fi . In the case of specified fault sequence (the permutation of F is specified), we can get a local optimal action sequence by adjusting action sequence of each fi . Based on the former analysis, we partition the solution into two steps. First, we should get a local optimal action sequence for each permutation of F . Then, a global optimal action sequence is selected by comparing the cost of all the local optimal action sequences. In order to obtain the least diagnosis cost with less calculation load, we present the niche genetic algorithm. From the formula (2), we can see that the optimal action sequences are closely related to the sequences of candidate faults. Once a fault sequence is given, we can get a local optimal action sequence by adjusting the order of each fault’s action subset. From the local optimal action sequence of each candidate fault permutation, we can obtain a global optimal result. We should hold better results from different fault sequences as possible as we can during the process of the algorithm. We adopt niche algorithm of the cooperating multi-group genetic algorithms, and partition all action sequences into multi-niches, each of which corresponds to a permutation of fault set. Firstly an individual will compete with others in the same niche firstly. At the same time, the algorithm uses adaptability function to estimate whether a niche should continue evolving and when to do evolution, which can reduce the calculation load as much as possible and get the result as soon as possible. First, we can define the code of the niche genetic algorithm and its operators.

Computing the Optimal Action Sequence by Niche Genetic Algorithm

153

The coding manner. [6][7]Each action sequence of fi adopts the symbolic coding ( aix , aiy ,..., aiz ) . Each action sequence of F adopts multi-symbolic coding, which is

appended by a action sequence of fi as : ( aix , aiy ,..., aiz , a jx ' , a jy ' ,..., a jz ' ,..., akx '' , aky '' ,..., akz '' )

.

Crossover operator. For chromosomes ( a ix , a iy , ..., a iz , ..., a jx ' , a jy ' , ..., a jz ' , ..., a k x '' , a k y '' , ..., a k z '' ) and

( alx ''' , aly ''' ,..., alz ''' ,..., a jx '''' , a jy '''' ,..., a jz '''' ,..., anx ''''' , any ''''' ,..., anz ''''' ) , their offspring generated by

performing the crossing operation are ( aix , aiy ,..., aiz , a jx '''' , a jy '''' ,..., a jz '''' ,..., akx '' , aky '' ,..., akz '' ) and ( alx ''' , aly ''' ,..., alz ''' ,..., a jx ' , a jy ' ,..., a jz ' ,..., anx ''''' , any ''''' ,..., anz ''''' ) for any integer j . Mutation operator. For a chromosome (..., a jx ,..., a jy ,...) and three integers j , x , y ,

the chromosome generated by performing the mutation operation is (..., a jy ,..., a jx ,...) . Reversal operator. For a chromosome (..., aiz ' , a jx , a jy ,..., a jz , akx ' ,...) , and any integer j ,

the

chromosome

generated

by

performing

the

reversal

operation

is

(..., aiz ' , a jz ,..., a jy , a jx , akx ' ...) .

Adaptability function. For Adaptability function fitness ( s | Ec ) = length ( s ) ⋅ max{Ca }/ ECD ( s | Ec ) , where ECD ( s | Ec ) can be calculated i

according to formula (2), s = ( aix , aiy ,..., aiz , a jx ' , a jy ' ,..., a jz ' ,..., akx '' , aky '' ,..., akz '' ) is a chromosome and length( s ) ⋅ max{Ca } in numerator is used to keep the adaptability i

function value within the ability of a computer. In the niche genetic algorithm, in order to make individuals in niche compete with each other, we adopt the fitness-sharing model. The method adjusts an individual’s adaptive value by defining the individual’s sharing scale in population, and thus the population holds multiple high-level modes. The sharing model is a kind of measurement transformation of a special nonlinear adaptive value, which is based on the similarity between individuals in the population. The mechanism limits the limitless increase of a special species in the population, which ensures that different species can develop themselves in their own niches. In order to define the sharing function, we firstly define the coding distance between any two individuals. Definition 3. d ( si , s j ) represents the distance between coding si and s j as : d ( si , s j ) = ¦ | orderi (akl ) − orderj ( akl ) | , where orderi ( akl ) and orderj (akl ) are the

sequence numbers of akl in si and s j respectively. Definition 4. Given a set which has | F |! niches NICHES = {niche1 , niche2 ,..., niche|F |! } ,

the radius of a niche is defined as : σ niche = max{d ( si , s j )} , where si , s j ∈ nicheα . α

Definition 5. The sharing function between individuals si and s j is defined as ­ § d ( s , s ) ·γ i j ¸ , if si , s j ∈ nicheα °°1 - ¨ , sh( d ( si , s j )) = ® ¨ σ niche ¸ α © ¹ ° 0 , otherwise ¯°

154

L. Chen, J. Huang, and Z.-H. Gong

γ is used to adjust the shape of the sharing function. Beasley’s advice is to set γ = 2 [7]. The sharing value represents similar degree between two individuals. We can see that individual sharing value between different niches is zero.

Definition 6. For a population S = {s1 , s2 ,..., sn } , the sharing scale of individual si in population is defined as: n

mi = ¦ sh( d ( si , s j )), i = 1,2,, n j =1

Definition 7. The method to adjust the fitness-sharing value is defined as: fitness '( si | Ec ) =

fitness( si | Ec ) , n = 1,2,, n . mi

Individual selection operator. Through adjusting fitness value, we use the optimal preserve policy to copy α | F |! oriental individuals to the next generation, where α is the scale of a niche. The niche’s adaptive value. For the niche’s adaptive value NF (envα ) = ΔE ( ECD( s | Ec )) / E ( ECD( s | Ec ))( s ∈ nicheα ) , E ( ECD ( si | Ec )) is the expected value of the individuals’ ECD in a niche. The smaller E ( ECD ( si | Ec )) are, the better niches are. | ΔE ( ECD ( s | Ec )) | is the change rate of E ( ECD ( si | Ec )) , which represents the individuals’ evolution speed in a niche on one step length. The bigger | ΔE ( ECD ( s | Ec )) | is, the faster the niche evolves. We calculate that the niches with higher adaptive value should be counted in priority. Other niches which evolve slowly and whose value of E ( ECD ( si | Ec )) is big can be put aside for the moment. So unnecessary calculations load is reduced. The following is the computing process of the optimal action sequence by the niche genetic algorithm. Algorithm NGAOAS (Niche Genetic Algorithm for the Optimal Action Sequence).

(1) Initialize the counter t = 0 (2) Random generates the initial population P (0) , and the initial scale is α | F |! ( α is the scale of the niches, | F |! is the number of the niches) (3) Initialize all the niches adaptive values NF ( niche) = +∞ (4) Select the optimal | F |!/10 niches based on NF ( env ) (5) Process the crossover action between individuals P '(t ) ← Crossover[ P(t )] by the probability β (6) Process the individual mutation action by the probability γ : P ''(t ) ← Mutation[ P '(t )]

(7) Process

the

individual

reversal

action

by

the

probability

P '''(t ) ← Inversion[ P ''(t )]

(8) Calculate the individual adaptive value fitness( s | Ec ) (9) Adjust the adaptive value by sharing scale to get an adjusted fitness '( s )

σ

:

Computing the Optimal Action Sequence by Niche Genetic Algorithm

155

(10) Copy α | F |! individuals of the optimal adaptive value to the next generation, and get P (t + 1) (11) If t mod 10 = 0 , recalculate the adaptive value NF (niche) of each evolving niche (12) If ∃niche, ∃s , s ∈ niche , fitness ( s | Ec ) = max{ fitness ( s ' | Ec )} , NF ( niche) = 0 and ∀niche , NF ( niche) < δ , the algorithm ends. Otherwise, go to (4).

4 The Performance Analysis of NGAOAS The keys of calculating the optimal action sequence with the genetic algorithm are selecting the proper parameters and holding good features. Before analyzing features of NGAOAS, we select parameters of NGAOAS by testing. NGAOAS has four parameters. By experience we can assign the reference values to β = 0.3 , γ = 1 , σ = 0.03 . Notice that because the algorithm adapts the symbolic coding, the function of mutation operator is similar to the crossover operator in the general GA algorithm, while the reversal operator is real mutation operator. Therefore, we usually set the mutation operator’s probability of NGAOAS according to GA experienced crossover probability value. Accordingly, we set the reversal operator’s probability of NGAOAS according to GA algorithm’s mutation experienced probability value.

Fig. 1. Choose parameter α

We emphasize the parameter α , which decides the original population scale of NGAOAS. The selection of the original population scale affects the probability of finding the optimal result. An unsuitable original scale may cause premature problem. We select five network faults and 20 diagnosis actions, and each fault diagnosis needs average 10 actions. We do experiment on a stand-alone PC with PIV 2G CPU, 512M memory. From the figure 1, we know that when the parameter α is bigger than 100, premature problem can be avoided. In order to reduce the memory space that the algorithm needs, we set α = 100 in the later test.

156

L. Chen, J. Huang, and Z.-H. Gong

Furthermore, NGAOAS has some outstanding features. Feature 1. The space complexity of NGAOAS is O (| F |!) , the memory space that the algorithm needs can be predicted. Only α | F |! optimal individuals remain by the optimal reservation strategy. So the memory space that NGAOAS needs can be predicted and controlled under α | F |! . We can avoid the case of insufficient memory before processing the algorithm. Besides, the space complexity of the algorithm is O (| F |!) , but not O (length ( s )!) . Therefore, we reduce the space complexity effectively. Feature 2. NGAOAS has better implicit parallelism. Through the plain definition and assumptions for the problem, we study the background knowledge and adopt niche technology. Each niche is independent and seldom communicates with each other. Therefore, NGAOAS has better implicit parallelism than the normal genetic algorithm. We can make better use of the existing calculating grid, and the algorithm has better practicability. We have tested the performance of NGAOAS on the single PC and grid environments according to feature 2. We select 5 network faults and 20 corresponding diagnosis actions. Each fault diagnosis needs 10 actions on average. Standalone environment is PIV 2G CPU, 512M memory. The grid environment is composed of Globus Toolkit 2.4 computation grid platform and 10 identical test PCs.

Fig. 2. Compare the performance under the circumstance between a single PC and the grid environment

From figure 2 we can find that performance of NGAOAS increases about 6 to 7 times in the grid environment. This proves that NGAOAS has better implicit parallelism. Feature 3. The population is able to hold diversity, which can avoid premature problem. NGAOAS defines the distance between individuals, adjusts the adaptive value with sharing scale and selects the optimal individuals according to adjusted adaptive value, which makes individuals in each niche achieve balanced development. NGAOAS

Computing the Optimal Action Sequence by Niche Genetic Algorithm

157

takes the inner and global competitive stress all around, and thus avoids the convergence problem prematurely. According to feature 3, we draw a comparison between the genetic algorithm without niche technology and NGAOAS. The test is processed in the similar environment as feature 2. We adapt div ( P (t )) = ¦ d ( si , s j ) ¦ d ( sk , sl ) as the si , s j ∈P ( t )

sk , sl ∈P (0)

computation formula of the population’s diversity.

Fig. 3. The Diversity changing curve along with genetic generations

Figure3 is population’s diversity changing curve along with the genetic generation.

Fig. 4. The changing curve of adaptive value along with genetic generations

Figure 4 is the changing curve of the optimal individual’s adaptive value in the same situation. From figure 3 and figure 4, we can see that compared with general GA algorithm, NGAOAS can keep the adaptive values of individuals increasing and can retain better diversity along with the increase of genetic generations. While the population diversity of the general GA algorithm decreases quickly, the optimal individuals’ adaptive values no longer increase after the 100th generation. Premature problem

158

L. Chen, J. Huang, and Z.-H. Gong

severely influences the optimal results or preferable results. Repeated experiments prove that NGAOAS has a moderate convergence speed, the probability of finding out the optimal results per 300 generations is 82%, and the rest 18% also generates preferable result, which is close to the optimal one. This effect is enough for the most situations. Feature 4. NGAOAS introduces niche adaptive value to reduce calculation load, and thus speeds up convergence and avoid the premature problem at the same time. NGAOAS introduces the concept of niche adaptive value according to the diagnosis decision problem. The niches’ evolution speed is direct ratio to niches’ adaptive value, so niches have a good potential of being optimized. The smaller current optimal ECD is, the better the niche is. The proper design of the niche adaptive value function ensures the generation of the optimal results, which decreases unnecessary calculation load at a certain extent and also speeds up the convergence without causing premature problem. In the experiment, we choose a situation with less calculation load, in which we compare the previous convergence speed with the later introduction of the niche adaptive function. In the test environment, we select 6 network faults and 16 corresponding diagnosis actions. Each fault diagnosis needs 3 to 4 actions.

Fig. 5. Comparison of the convergence rates

In figure 5, the algorithm without niche adaptive value converges to a steady state after about 500 generations. After using the niche adaptive value function, the algorithm can converge to a steady state after about 200 generations. The convergence rate doubles. When there are more candidate faults, the convergence rate increases even more. From figure 5, we can also see that NGAOAS optimize the convergence rate without causing premature problem. It can find the optimal or preferable results at faster convergence rate. At the same time, we do experiments in various of diagnosis environments and compare the diagnosis cost of the updated P/C algorithm with the one of NGAOAS. As is shown in figure 6, when actions are dependent, compared with updated P/C algorithm, NGAOAS can acquire preferable diagnosis action sequence and the lower cost.

Computing the Optimal Action Sequence by Niche Genetic Algorithm

NGAOAS

25

Updated P/C

20

Cost

159

15 10 5 0 1

2

3

4

5

6

7

8

9 10

Diagnosis environment

Fig. 6. Comparison of the optimal diagnosis costs

5 Summary and Conclusion This paper presents an algorithm named NGAOAS to compute the minimum diagnosis cost in the condition of dependent actions. NGAOAS decomposes the candidate action sequences into | F |! niches, each of which corresponds to a single candidate faults permutation. An individual will compete with others in the same niche. At the same time NGAOAS will estimate whether a niche should continue evolving and when it should do evolution, which reduces calculation load as much as possible and get the result as soon as possible. The experiment proves NGAOAS has several outstanding features. Compared with updated P/C algorithm, NGAOAS can acquire preferable diagnosis action sequence and the less cost. We describe the diagnosis problem in the condition of dependent actions. There are several other scenarios that we have not considered and that deserve further exploration. We described the diagnosis problem assuming there are independent faults or single fault, which simplifies the computations, but dependent faults and multiple faults may occur in real world. So [8] [9] introduce the model of exact 3-sets cover. These problems will be studied further. Another issue relates to the behavior of observations [10]. Observations can not solve the diagnosis problem, but they can exclude some faults and speed up the troubleshooting.

References 1. J.-F. Huard and A. A. Lazar. Fault isolation based on decision-theoretic troubleshooting. Technical Report 442-96-08, Center for Telecommunications Research, Columbia University, New York, NY 1996. 2. J.Baras, H.Li, and G..Mykoniatis. Integrated distributed fault management for communication networks. Technical Report 98-10, Center for Satellite and Hybrid Communication Networks, University of Maryland, College Park, MD,1998. 3. Breese J S, Heckerman D. Decision-Theoretic Troubleshooting: A Framework for Repair and Experiment. Proc. 12th Conf. on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, San Francisco, CA,1996:124-132

160

L. Chen, J. Huang, and Z.-H. Gong

4. Skaanning C, Jensen F V, Kjærulff U. Printer Troubleshooting Using Bayesian Networks. Industrial and Engineering Applications of Artificial Intelligence and Expert Systems (IEA/AIE) 2000, New Orleans, USA, June, 2000 5. VomlelovÁ M. Decision Theoretic Troubleshooting. PhD Thesis, Faculty of Informatics and Statistics, University of Economics, Prague, Czech Republic, 2001. 6. Shi Zhong Zhi. Knowledge Discover. Beijing: Tsinghua university press, 2002. ISBN 7302-05061-9:265~293(in Chinese) 7. Li Min Qiang, Kou Ji Song, Lin Dan, Li Shu Quan. Basic theory and application of Genetic Algorithm. Beijing: Science press, 2002.3, ISDN7-03-009960-5:233-244. 8. C.Skaanning, F.V.Jensen, U.Kjarulff, and A.L.Madsen. Acquisition and transformation of likelihoods to conditional probabilities for Bayesian networks. AAAI Spring Symposium, Stanford, USA,1999. 9. C.Skaanning,. A knowledge acquisition tool for Bayesian network troubleshooters. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, Stanford, USA, 2000. 10. M. Steinder and A. S. Sethi. Increasing robustness of fault localization through analysis of lost, spurious, and positive symptoms. IEEE. 2002

Diagnosis of Plan Execution and the Executing Agent Nico Roos1 and Cees Witteveen2,3 1

Dept of Computer Science, Universiteit Maastricht, P.O.Box 616, NL-6200 MD Maastricht [email protected] 2 Faculty EEMCS, Delft University of Technology, P.O.Box 5031, NL-2600 GA Delft [email protected] 3 Centre of Mathematics and Computer Science, P.O. Box 94079, NL-1090 GB Amsterdam [email protected]

Abstract. We adapt the Model-Based Diagnosis framework to perform (agentbased) plan diagnosis. In plan diagnosis, the system to be diagnosed is a plan, consisting of a partially ordered set of instances of actions, together with its executing agent. The execution of a plan can be monitored by making partial observations of the results of actions. Like in standard model-based diagnosis, observed deviations from the expected outcomes are explained qualifying some action instances that occur in the plan as behaving abnormally. Unlike in standard model-based diagnosis, however, in plan diagnosis we cannot assume that actions fail independently. We focus on two sources of dependencies between failures: dependencies that arise as a result of a malfunction of the executing agent, and dependencies that arise because of dependencies between action instances occurring in a plan. Therefore, we introduce causal rules that relate health states of the agent and health states of actions to abnormalities of other action instances. These rules enable us to introduce causal set and causal effect diagnoses that use the underlying causes of plan failing to explain deviations and to predict future anomalies in the execution of actions. Keywords: planning, diagnosis, agent systems.

1 Introduction In Model-Based Diagnosis (MBD) a model of a system consisting of components, their interrelations and their behavior is used to establish why the system is malfunctioning. Plans resemble such system specifications in the sense that plans also consist of components (action specifications), their interrelations and a specification of the (correct) behavior of each action. Based on this analogy, the aim of this paper is to adapt and extend a classical Model-Based Diagnosis (MBD) approach to the diagnosis of plans. To this end, we will first formally model a plan consisting of a partially ordered set of actions as a system to be diagnosed, and subsequently we will describe how a diagnosis can be established using partial observations of a plan in progress. Distinguishing between normal and abnormal execution of actions in a plan, we will introduce sets of U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 161–175, 2005. c Springer-Verlag Berlin Heidelberg 2005 

162

N. Roos and C. Witteveen

actions qualified as abnormal to explain the deviations between expected plan states and observed plan states. Hence, in this approach, a plan diagnosis is just a set of abnormal actions that is able to explain the deviations observed. Although plan diagnosis conceived in this way is a rather straightforward application of MBD to plans, we do need to introduce new criteria for selecting acceptable plan diagnoses: First of all, while in standard MBD usually subset-minimal diagnoses, or within them minimum (cardinality) diagnoses, are preferred, we also prefer maximum informative diagnoses. The latter type of diagnosis maximizes the exact similarity between predicted and observed plan states. Although maximum informative diagnoses are always subset minimal, they are not necessarily of minimum cardinality. More differences between MBD and plan diagnosis appear if we take a more detailed look into the reasons for choosing minimal diagnoses. The idea of establishing a minimal diagnosis in MBD is governed by the principle of minimal change: explain the abnormalities in the behavior observed by changing the qualification from normal to abnormal for as few system components as necessary. Using this principle is intuitively acceptable if the components qualified as abnormal are failing independently. However, as soon as dependencies exist between such components, the choice for minimal diagnoses cannot be justified. As we will argue, the existence of dependencies between failing actions in a plan is often the rule instead of an exception. Therefore, we will refine the concept of a plan diagnosis by introducing the concept of a causal diagnosis. To establish such a causal diagnosis, we consider both the executing agent and its plan as constituting the system to be diagnosed and we explicitly relate health states of the executing agent and subsets of (abnormally qualified) actions to the abnormality of other actions in the form of causal rules. These rules enable us to replace a set of dependent failing actions (e.g. a plan diagnosis) by a set of unrelated causes of the original diagnosis. This independent and usually smaller set of causes constitutes a causal diagnosis, consisting of a health state of an agent and an independent (possibly empty) set of failing actions. Such a causal diagnosis always generates a cover of a minimal diagnosis. More importantly, such causal diagnoses can also be used to predict failings of actions that have to be executed in the plan and thereby also can be used to assess the consequences of such failures for goal realizability. This paper is organized as follows. Section 2 introduces the preliminaries of planbased diagnosis, while Section 3 formalizes plan-based diagnosis. Section 4 extends the formalization to determining the agent’s health state. Finally, we briefly discuss some computational aspects of (causal) plan diagnosis. In Section 6, we place our approach into perspective by discussing some related approaches to plan diagnosis. and Section 7 concludes the paper.

2 Preliminaries Model Based Diagnosis. In Model-Based Diagnosis (MBD) [4,5,12] a system S is modeled as consisting of a set Comp of components and their relations, for each component c ∈ Comp a set Hc of health modes is distinguished and for each health mode hc ∈ Hc of each component c a specific (input-output) behavior of c is specified. Given some input to S, its output is defined if the health mode of each component c ∈ Comp is known. The diagnostic engine is triggered whenever, under the assumption that all com-

Diagnosis of Plan Execution and the Executing Agent

163

ponents are functioning normally, there is a discrepancy between the output as predicted from the input observations, and the actually observed output. The result of MBD is a suitable assignment of health modes to the components, called a diagnosis, such that the actually observed output is consistent with this health mode qualification or can be explained by this qualification. Usually, in a diagnosis one requires the number of components qualified as abnormally to be minimized. States and Partial States. We consider plan-based diagnosis as a simple extension of the model-based diagnosis where the model is not a description of an underlying system but a plan of an agent. Before we discuss plans, we introduce a simplified statebased view on the world, assuming that for the planning problem at hand, the world can be simply described by a set Var = {v1 , v2 , . . . , vn } of variables and their respective value domains Di . A state of the world σ then is a value assignment σ(vi ) = di ∈ Di to the variables. We will denote a state simply by an element of D1 ×D2 ×. . .×Dn , i.e. an n-tuple of values. It will not always be possible to give a complete state description. Therefore, we introduce a partial state as an element π ∈ Di1 × Di2 × . . . × Dik , where 1 ≤ k ≤ n and 1 ≤ i1 < . . . < ik ≤ n. We use V ar(π) to denote the set of variables {vi1 , vi2 , . . . , vik } ⊆ Var specified in such a state π. The value dj of variable vj ∈ V ar(π) in π will be denoted by π(j). The value of a variable vj ∈ Var not occurring in a partial state π is said to be unknown (or unpredictable) in π, denoted by ⊥. Including ⊥ in every value domain Di allows us to consider every partial state π as an element of D1 × D2 × . . . × Dn . Partial states can be ordered with respect to their information content: Given values d and d , we say that d ≤ d holds iff d = ⊥ or d = d . The containment relation  between partial states is the point-wise extension of ≤ : π is said to be contained in π , denoted by π  π , iff ∀j[π(j) ≤ π(j )]. Given a subset of variables V ⊆ Var , two partial states π, π are said to be V -equivalent, denoted by π =V π , if for every vj ∈ V , π(j) = π (j). We define the partial state π restricted to a given set V , denoted by π  V , as the state π  π such that V ar(π ) = V ∩ V ar(π). An important notion for diagnosis is the notion of compatibility between partial states. Intuitively, two states π and π are said to be compatible if there is no essential disagreement about the values assigned to variables in the two states. That is, for every j either π(j) = π (j) or at least one of the values π(j) and π (j) is undefined. So we define π and π to be compatible, denoted by π ≈ π , iff ∀j[π(j) ≤ π (j) or π (j) ≤ π(j)]. As an easy consequence we have, using the notion of V -equivalent states, π ≈ π iff π =V ar(π)∩V ar(π ) π . Finally, if π and π are compatible states they can be merged into the -least state π  π containing them both: π  π (j) = max≤ {π(j), π (j)}. Goals. An (elementary) goal g of an agent specifies a set of partial states an agent wants to bring about using a plan. Here, we specify each such a goal g as a constraint, that is a relation over some product Di1 × . . . × Dik of domains. We say that a goal g is satisfied by a partial state π, denoted by π |= g, if the relation g contains at least one tuple (di1 , di2 , . . . , dik ) such that (di1 , di2 , . . . dik )  π. We assume each agent to have a set G of such elementary goals g ∈ G. We use π |= G to denote that all goals in G hold in π, i.e. for all g ∈ G, π |= g. Actions and Action Schemes. An action scheme or plan operator α is represented as a function that replaces the values of a subset Vα ⊆ Var by other values, dependent

164

N. Roos and C. Witteveen

1

a1

0 v1

v2

v3

v4

Fig. 1. Plan operators & states

upon the values of another set Vα ⊇ Vα of variables. Hence, every action scheme α can be modeled as a (partial) function fα : Di1 × . . . × Dik → Dj1 × . . . × Djl , where 1 ≤ i1 < . . . < ik ≤ n and {j1 , . . . , jl } ⊆ {i1 , . . . , ik }. The variables whose value domains occur in dom(fα ) will be denoted by domV ar (α) = {vi1 , . . . , vik } and, likewise ranV ar (α) = {vj1 , . . . , vjl }. Note that it is required that ranV ar (α) ⊆ domV ar (α). This functional specification fα constitutes the normal behavior of the action scheme, denoted by fαnor . Example 1. Figure 1 depicts two states σ0 and σ1 (the white boxes) each characterized by the values of four variables v1 , v2 , v3 and v4 . The partial states π0 and π1 (the gray boxes) characterize a subset of values in a (complete) state. Action schemes are used to model state changes. The domain of the action scheme α is the subset {v1 , v2 }, which are denoted by the arrows pointing to α. The range of α is the subset {v1 }, which is denoted by the arrow pointing from α. Finally, the dashed arrow denotes that the value of variable v2 is not changed by operator(s) causing the state change. The correct execution of an action may fail either because of an inherent malfunctioning or because of a malfunctioning of an agent responsible for executing the action, or because of unknown external circumstances. In all these cases we would like to model the effects of executing such failed actions. Therefore, we introduce a set of health modes Mα for each action scheme α. This set Mα contains at least the normal mode nor, the mode ab indicating the most general abnormal behavior, and possibly several other specific fault modes. The most general abnormal behavior of action α is specified by the function fαab , where fαab (di1 , di2 , . . . , dik ) = (⊥, ⊥, . . . , ⊥) for every partial state (di1 , di2 , . . . , dik ) ∈ dom(fα ).1 To keep the discussion simple, in the sequel we distinguish only the health modes nor and ab. Given a set A of action schemes, we will need to consider a set A ⊆ inst(A) of instances of actions in A. Such instances will be denoted by small roman letters ai . If type(ai ) = α ∈ A, such an instance ai is said to be of type α. If the context permits we will use “actions” and “instances of actions” interchangeably. Plans. A plan is a tuple P = A, A, t = t >t Pt , P t ≥ 0 during the exe-

168

N. Roos and C. Witteveen

cution of the plan P . We would like to use these observations to infer the health states of the actions occurring in P . Assuming a normal execution of P , we can (partially) predict the state of the world at a time point t given the observation obs(t): if all actions at time t such that obs(t)→∗P (π∅ , t ). behave normally, we predict a partial state π∅ There does not need to be a strict correspondence between the variables predicted at time t and the variables observed at time t . That is, V ar(π ) and V ar(π∅ ) need not to be identical sets. This means that to check whether the predicted state matches the observed state at time t , we have to verify whether the variables occurring in both ) have identical values, that is whether π (j) = π∅ (j) holds V ar(π ) and V ar(π∅ for all vj ∈ V ar(π ) ∩ V ar(π∅ ). Therefore, these states match exactly if they are compatible i.e. π ≈ π∅ holds.4 If this is not the case, the execution of some action instances must have gone wrong and we have to determine a qualification Q such that the predicted state πQ derived using Q is compatible with π . Hence, we have the following straight-forward extension of the diagnosis concept in MBD to plan diagnosis (cf. [5]): Definition 1. Let P = A, A, t . – For every rule r of the form (h; α1 , α2 , . . . , αk ) → αk+1 ∈ R, inst(R) contains the instances (h; ai1 , ai2 , . . . , aik ) → aik+1 , whenever there exists a t ≥ 0 such that {ai1 , ai2 , . . . , aik } ⊆ P≤t and aik+1 ∈ P>t . For each r ∈ inst(R), let ante(r) denote the antecedent of r and hd(r) denote the head of r. Furthermore, let Ab ⊆ {h} be a set containing an abnormal agent health state h or be equal to the empty set (signifying a normal state of the agent) and let Q ⊆ A be a qualification of instances of actions. We can now define a causal consequence of a qualification Q and a health state Ab using R as follows: Definition 3. An instance a ∈ A is a causal consequence of a qualification Q ⊂ A and the health state Ab using the causal rules R if 1. a ∈ Q or 2. there exists a rule r ∈ inst(R) such that (i) for each ai ∈ ante(r) either ai is a causal consequence of Q or ai ∈ Ab, and (ii) a = hd(r). The set of causal consequences of Q using R and Ab is denoted by CR,Ab (Q). We have a simple characterization of the set of causal consequences CR,Ab (Q) of a qualification Q and a health state Ab using a set of causal rules R: Observation 2. CR,Ab (Q) = CnA (inst(R) ∪ Q ∪ Ab). Here, CnA (X) restricts the set Cn(X) of classical consequences of a set of propositions X to the consequences occurring in A. To avoid cumbersome notation, we will omit the subscripts R and Ab from the operator C and use C(Q) to denote the set of consequences of a qualification Q using a health state Ab and a set of causal rules R. We say that a qualification Q is closed under the set of rules R and an agent health state Ab if Q = C(Q), i.e, Q is saturated under application of the rules R. Proposition 1. The operator C satisfies the following properties: 1. (inclusion): for every Q ⊆ A, Q ⊆ C(Q) 2. (idempotency): for every Q ⊆ A, C(Q) = C(C(Q)) 3. (monotony): if Q ⊆ Q ⊆ A then C(Q) ⊆ C(Q ) Proof. Note that C(Q) = Cn(inst(R) ∪ Q ∪ Ab) ∩ A. Hence, monotony and inclusion follow immediately as a consequence of the monotony and inclusion of Cn. Monotony and inclusion imply C(Q) ⊆ C(C(Q)). To prove the reverse inclusion, let Cn∗ (Q) = Cn(instr(R) ∪ Q ∪ Ab). Then by inclusion and idempotency of Cn we have  C(C(Q)) = Cn∗ (C(Q)) ∩ A ⊆ Cn∗ (Cn∗ (Q)) ∩ A = Cn∗ (Q) ∩ A = C(Q). Thanks to Proposition 1 we conclude that every qualification can be easily extended to a closed set C(Q) of qualifications. Due to the presence of causal rules, we require every diagnosis Q to be closed under the application of rules, that is in the sequel we restrict diagnoses to closed sets Q = C(Q). We define a causal diagnosis as a qualification Q such that its set of consequences C(Q) constitutes a diagnosis:

172

N. Roos and C. Witteveen

Definition 4. Let P = A, A, t : Definition 6. Let Q be a causal diagnosis of a plan P based on the observations (π, t) , t ) and let obs(t ) = and (π , t ) where t < t . Furthermore, let obs(t)→∗C(Q);P (πQ (π , t ). Then, for some time t > t , (π , t ) is the partial state predicted using Q and the observations if (πQ  π , t )→∗C(Q);P (π , t ).

Diagnosis of Plan Execution and the Executing Agent

173

In particular, if t = depth(P ), i.e., the plan has been executed completely, we can predict the values of some variables that will result from executing P and we can check which goals g ∈ G will still be achieved by the execution of the plan, based on our current knowledge. That is, we can check for which goals g ∈ G it holds that τ |= g. So causal diagnosis might also help in evaluating which goals will be affected by failing actions. 4.3 Complexity Issues It is well-known that the diagnosis problem is computationally intractable. The decision forms of both consistency-based and abductive based diagnosis are NP-hard ([2]). It is easy to see that standard plan diagnosis has the same order of complexity. Concerning (minimal) causal diagnoses, we can show that they are not more complex than establishing plan diagnoses if the latter problem is NP-hard. The reason is that in every case the verification of Q being a causal diagnosis is as difficult as verifying a plan diagnosis under the assumption that the set instP (R) is polynomially bounded in the size ||P || of the plan P .6 Also note that subset minimality (under a set of rules inst(R) of a set of causes can be checked in polynomial time.

5 Related Research In this section we briefly discuss some other approaches to plan diagnosis. Like we use MBD as a starting point to plan diagnosis, Birnbaum et al. [1] apply MBD to planning agents relating health states of agents to outcomes of their planning activities, but not taking into account faults that can be attributed to actions occurring in a plan as a separate source of errors. However, instead of focusing upon the relationship between agent properties and outcomes of plan executions, we take a more detailed approach, distinguishing two separate sources of errors (actions and properties of the executing agents) and focusing upon the detection of anomalies during the plan execution. This enables us to predict the outcomes of a plan on beforehand instead of using them only as observations. de Jonge et al. [6] propose another approach that directly applies model-based diagnosis to plan execution. Their paper focuses on agents each having an individual plan, and where conflicts between these plans may arise (e.g. if they require the same resource). Diagnosis is applied to determine those factors that are accountable for future conflicts. The authors, however, do not take into account dependencies between health modes of actions and do not consider agents that collaborate to execute a common plan. Kalech and Kaminka [9,10] apply social diagnosis in order to find the cause of an anomalous plan execution. They consider hierarchical plans consisting of so-called behaviors. Such plans do not prescribe a (partial) execution order on a set of actions. Instead, based on its observations and beliefs, each agent chooses the appropriate behavior to be executed. Each behavior in turn may consist of primitive actions to be executed, or of a set of other behaviors to choose from. Social diagnosis then addresses the issue 6

The reason is that computing consequences of Horn-theories can be achieved in a time linear in the size of instP (R).

174

N. Roos and C. Witteveen

of determining what went wrong in the joint execution of such a plan by identifying the disagreeing agents and the causes for their selection of incompatible behaviors (e.g., belief disagreement, communication errors). This approach might complement our approach when conflicts not only arise as the consequence of faulty actions, but also as the consequence of different selections of sub-plans in a joint plan. Lesser et al. [3,8] also apply diagnosis to (multi-agent) plans. Their research concentrates on the use of a causal model that can help an agent to refine its initial diagnosis of a failing component (called a task) of a plan. As a consequence of using such a causal model, the agent would be able to generate a new, situation-specific plan that is better suited to pursue its goal. While their approach in its ultimate intentions (establishing anomalies in order to find a suitable plan repair) comes close to our approach, their approach to diagnosis concentrates on specifying the exact causes of the failing of one single component (task) of a plan. Diagnosis is based on observations of a component without taking into account the consequences of failures of such a component w.r.t. the remaining plan. In our approach, instead, we are interested in applying MBD-inspired methods to detect plan failures. Such failures are based on observations during plan execution and may concern individual components of the plan, but also agent properties. Furthermore, we do not only concentrate on failing components themselves, but also on the consequences of these failures for the future execution of plan elements.

6 Conclusion We have adapted model-based agent diagnosis to the diagnosis of plans and we have pointed out some differences with the classical approaches to diagnosis. We distinguished two types of diagnosis: minimum plan diagnosis and maximum informative diagnosis to identify (i) minimum sets of anomalously executed actions and (ii) maximum informative (w.r.t. to predicting the observations) sets of anomalously executed actions. Assuming that a plan is carried out by a single agent, anomalously executed action can be correlated if the anomaly is caused by some malfunctions in the agent. Therefore, (iii) causal diagnoses have been introduced and we have extended the diagnostic theory enabling the prediction of future failure of actions. Current work can be extended in several ways. We mention two possible extensions: First of all, we could improve the diagnostic model of the executing agent. The causal diagnoses are based on the assumption that the agent enters an abnormal state at some time point and stays in that state until the agent is repaired. In our future work we wish to extend the model such that the agent might evolve through several abnormal states. The resulting model will be related diagnosis in Discrete Event Systems [7,11]. Moreover, we intend to investigate plan repair in the context of the agent’s current (abnormal) state. Secondly, we would like to extend the diagnostic model with sequential observations and iterative diagnoses. Here, we would like to consider the possibilities of diagnosing a plan if more than two subsequent observations are made, the best way to detect errors in such cases and the construction of enhanced prediction methods. Acknowledgements. This research is supported by the Technology Foundation STW, applied science division of NWO and the technology programme of the Dutch Ministry of Economic Affairs.

Diagnosis of Plan Execution and the Executing Agent

175

References 1. L. Birnbaum, G. Collins, M. Freed, and B. Krulwich. Model-based diagnosis of planning failures. In AAAI 90, pages 318–323, 1990. 2. T. Bylander, D. Allemang, M. C. Tanner, and J. R. Josephson. The computational complexity of abduction. Artif. Intell., 49(1-3):25–60, 1991. 3. N. Carver and V.R. Lesser. Domain monotonicity and the performance of local solutions strategies for cdps-based distributed sensor interpretation and distributed diagnosis. Autonomous Agents and Multi-Agent Systems, 6(1):35–76, 2003. 4. L. Console and P. Torasso. Hypothetical reasoning in causal models. International Journal of Intelligence Systems, 5:83–124, 1990. 5. L. Console and P. Torasso. A spectrum of logical definitions of model-based diagnosis. Computational Intelligence, 7:133–141, 1991. 6. F. de Jonge and N. Roos. Plan-execution health repair in a multi-agent system. In PlanSIG 2004, 2004. 7. R. Debouk, S. Lafortune, and D. Teneketzis. Coordinated decentralized protocols for failure diagnosis of discrete-event systems. Journal of Discrete Event Dynamical Systems: Theory and Application, 10:33–86, 2000. 8. B. Horling, B. Benyo, and V. Lesser. Using Self-Diagnosis to Adapt Organizational Structures. In Proceedings of the 5th International Conference on Autonomous Agents, pages 529–536. ACM Press, 2001. 9. M. Kalech and G. A. Kaminka. On the design ov social diagnosis algorithms for multi-agent teams. In IJCAI-03, pages 370–375, 2003. 10. M. Kalech and G. A. Kaminka. Diagnosing a team of agents: Scaling-up. In AAMAS 2004, 2004. 11. Y. Pencol´e and M. Cordier. A formal framework for the decentralised diagnosis of large scale discrete event systems and its application to telecommunication networks. Artif. Intell., 164(1-2):121–170, 2005. 12. R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32:57–95, 1987.

Automatic Abstraction of Time-Varying System Models for Model Based Diagnosis Pietro Torasso and Gianluca Torta Dipartimento di Informatica, Universit` a di Torino, Italy {torasso, torta}@di.unito.it

Abstract. This paper addresses the problem of automatic abstraction of component variables in the context of the MBD of Time-Varying Systems (i.e. systems where the behavioral modes of components can evolve over time); the main goal is to produce abstract models capable of deriving fewer and more general diagnoses when the current observability of the system is reduced and/or the system operates under specific operating conditions. The notion of indiscriminability among instantiations of a subset of components is introduced and constitutes the basis for a formal definition of abstractions which preserve all the distinctions that are relevant for diagnosis given the current observability and operating conditions of the system. The automatic synthesis of abstract models further restricts abstractions so that the temporal behavior of abstract components can be expressed in terms of a simple combination of the temporal behavior of their subcomponents. As a validation of our proposal, we present the results obtained with the automatic abstraction of a non-trivial model adapted from the spacecraft domain.

1

Introduction

System model abstraction has been successfully exploited in many approaches to model-based diagnosis (MBD). The pioneer work presented in [7] and recent improvements proposed e.g. in [9] and [2] mostly use human-provided abstractions in order to focus the diagnostic process and thus improve its efficiency. However, abstraction has another main benefit, namely to provide fewer and more concise abstract diagnoses when it is not possible to discriminate among detailed diagnoses (e.g. [4], [5]). Recently, some authors have aimed at the same goal in a different way, namely automatic abstraction of static system models ([8], [11], [10]). The main idea behind these approaches is that, since the system is usually only partially observable, it can be convenient to abstract the domains of system variables ([10]) and possibly system components themselves ([8], [11]) without losing any information relevant for the diagnostic task. By using the abstracted model for diagnosis the number of returned diagnoses can be significantly reduced, and such diagnoses, by being “as abstract as possible”, are more understandable for a human. U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 176–190, 2005. c Springer-Verlag Berlin Heidelberg 2005 

Automatic Abstraction of Time-Varying System Models for MBD

177

The work presented in this paper aims at extending the results of [11] to the automatic abstraction of time-varying system models. Such models are able to capture the possible evolutions of the behavioral modes of system components over time; they thus have an intermediate expressive power between static models and fully dynamic models, where arbitrary status variables can be modelled 1 . The goal of automatically synthesizing an abstraction of a time-varying system is very ambitious since the abstraction has to specify not only the relations between the behavioral modes of the abstract component and the ones of its subcomponents, but also to automatically derive the temporal evolutions of the modes of the abstract component. As the approaches mentioned above, our proposal relies on system observability as one of the two main factors for driving the automatic abstraction process. The other factor is represented by the fact that abstractions are tailored to specific operating conditions of the system (e.g. an engine system may be in operating conditions shut down, warming up, cruising); as already noted in the context of diagnosability analysis (e.g. [3]), indeed, different operating conditions imply differences in the system behavior so relevant that it is often impossible to draw general conclusions (such as the system is diagnosable or components C and C can be abstracted) that hold for all of them. Our proposal requires that abstractions lead to real simplifications of the system model while preserving all the information relevant for diagnosis: in order to exclude all the undesired abstractions we introduce a precise definition of abstraction mapping, and further drive the computation of abstractions through heuristics and enforcement of additional criteria. The paper is structured as follows. In section 2 we formalize the notions of time-varying system model, diagnostic problem and diagnosis and in section 3 we give a precise definition of abstraction mapping. In section 4 we describe how the declarative notions introduced in 3 can be implemented and discuss correctness and complexity properties of the computational process. In section 5 we present the results obtained by applying our approach to the automatic abstraction of a non-trivial time-varying system model taken from the spacecraft domain. Finally, in section 6 we compare our work to related papers and conclude.

2

Characterising Time-Varying Systems

In order to formally define time-varying system models, we first give the definition of Temporal System Description which captures fully dynamic system models, and then derive the notion of Time-Varying System Description from it. Definition 1. A Temporal System Description (TSD) is a tuple (SV, DT, Δ) where: - SV is the set of discrete system variables partitioned in IN P U T S (system inputs), OP CON DS (operating conditions), ST AT U S (system status) and 1

An in-depth analysis of different types of temporal systems in reported in [1] where the main characteristics of time-varying systems are singled out.

178

P. Torasso and G. Torta

IN T V ARS (internal variables). Distinguished subsets COM P S ⊆ ST AT U S and OBS ⊆ IN T V ARS represent system components and observables respectively. We will denote with DOM (v) the finite domain of variable v ∈ SV; in particular, for each C ∈ COM P S, DOM (C) consists of the list of possible behavioral modes for C (an ok mode and one or more fault modes) - DT (Domain Theory) is an acyclic set of Horn clauses defined over SV representing the instantaneous behavior of the system (under normal and abnormal conditions); we require that any instantiation of variables ST AT U S∪ OP CON DS ∪IN P U T S is consistent with DT - Δ is the System Transition Relation mapping the values of the variables in OP CON DS ∪ ST AT U S at time t to the values of the variables in ST AT U S at time t+1 (such a mapping is usually non deterministic). A generic element τ ∈ Δ (named a System Transition) is then a pair (OP Ct ∪ St , St+1 ) where OP Ct is an instantiation of OP CON DS variables at time t and St , St+1 are instantiations of ST AT U S variables at times t and t+1 respectively. In the following, we will sometimes denote a System Transition τ = (OP Ct ∪ OP C

St , St+1 ) as τ = St → t St+1 . Given a system status S and an instantiation OP C of the OP CON DS variables, the set of the possible successor states of S OP C given OP C is defined as Π(S, OP C) = {S s.t. S → S ∈ Δ}. Operator Π can be easily extended to apply to a set S = {S1 , . . . , Sm } of system states by taking the union of Π(Si , OP C), i = 1, . . . , m. Finally, let θ = (τ0 , . . . , τw−1 ) be a sequence of system transitions s.t. τi = OP C

Si → i Si+1 , i = 0, . . . , w − 1. We say that θ is a (feasible) system trajectory. The following definition of Time-Varying System Description is derived by putting some restrictions on the notion of Temporal System Description. Definition 2. A Time-Varying System Description (TVSD) is a Temporal System Description TSD = (SV, DT, Δ) such that: - ST AT U S = COM P S (i.e. the only status variables are the ones representing the health conditions of system components)  ...   Δn where Δi (Component Transition Re- Δ can be obtained as Δ1  lation) represents the possible evolutions of the behavioral modes of component Ci . A generic element τi ∈ Δi (Component Transition) is then a pair (OP C ∪ Ci,t (bm), Ci,t+1 (bm )) where OP C is an instantiation of OP CON DS variables at time t and Ci,t (bm), Ci,t+1 (bm ) are instantiations of variable Ci at times t and t+1 respectively. Since Δ can be partitioned in Δ1 , . . . , Δn we can easily extend the notion of trajectory to apply to any subsystem Γ involving some system components. We are now ready to formalize the notions of Temporal Diagnostic Problem and temporal diagnosis; both of these notions apply to Temporal System Descriptions as well as to Time-Varying System Descriptions.

Automatic Abstraction of Time-Varying System Models for MBD

179

Definition 3. A Temporal Diagnostic Problem is a tuple TDP = (TSD, S0 , w, σ) where: - T SD is a Temporal System Description - S0 is the set of possible initial states (i.e. system states at time 0) - w is an integer representing the size of the time window [0, . . . , w] over which the diagnostic problem is defined - σ is a sequence (X0 , OP C0 , Y0 , . . . , Xw , OP Cw , Yw ) where the Xi s, the OP Ci s and the Yi s are instantiations of the IN P U T S, OP CON DS and OBS variables respectively at times 0, . . . , w. Sequence σ then represents the available observed information about the system in time window [0, . . . , w]. Given a temporal diagnostic problem T DP , we say that a system status S is instantaneously consistent at time t, t ∈ {0, . . . , w} if: DT ∪ Xt ∪ OP Ct ∪ Yt ∪ S 0 ⊥ Instantaneous consistency of S expresses the fact that the instantiation S of ST AT U S variables is logically consistent with the current values of IN P U T S, OP CON DS and OBS variables, under constraints imposed by DT . The following definition formalizes the notions of belief state and temporal diagnosis. Definition 4. Let TDP= (TSD, S0 , w, σ) be a Temporal Diagnostic Problem. We define the belief state Bt at time t (t = 0, . . . , w) recursively as follows: - B0 = {S0 s.t. S0 ∈ S0 and S0 is instantaneously consistent at time 0} - ∀t = 1 . . . w, we have that Bt = {St s.t. St ∈ Π(Bt−1 , OP Ct−1 ) and St is instantaneously consistent at time t} We say that any system status Sw ∈ Bw is a (temporal) diagnosis for T DP . In order for St to belong to Bt , then, it must be consistent with the current observed information about the system (i.e. Xt , OP Ct and Yt ) and with the prediction Π(Bt−1 , OP Ct−1 ) based on the past history. Example 1. As a running example throughout the paper we will consider a very simple hydraulic system involving a pipe P1 connected in series with a valve V which in turn is in series with a pipe P2 (see the left part of figure 1). The Time Varying System Description T V SDH of the hydraulic system is built by instantiating generic models of valve and pipe contained in a library of component models (figure 2 reports the domain theory DT and the transition relation Δ for both a valve and a pipe). For T V SDH we have that: COM P S = {P 1, V, P 2}, where P1 and P2 have three behavioral modes (ok, leaking and broken), while the valve V has six behavioral modes (ok, leaking, broken, stuck-open, out_P1 V P1 in_P1

out_P1 P1

P2 out_P2

opc_V (a)

AC

in_P1 opc_V

out_P2

(b)

Fig. 1. A fragment of an hydraulic circuit at two levels of abstraction

180

P. Torasso and G. Torta V (ok ) ∧ in(f ) ∧ opc(open) ⇒ out(f ) V (ok ) ∧ in(rf ) ∧ opc(open) ⇒ out(rf ) V (ok ) ∧ in(nf ) ∧ opc(open) ⇒ out(nf ) V (ok ) ∧ in(∗) ∧ opc(closed) ⇒ out(nf ) V (so) ∧ in(f ) ∧ opc(open) ⇒ out(f ) V (so) ∧ in(rf ) ∧ opc(open) ⇒ out(rf ) V (so) ∧ in(nf ) ∧ opc(open) ⇒ out(nf ) V (so) ∧ in(f ) ∧ opc(closed) ⇒ out(f ) V (so) ∧ in(rf ) ∧ opc(closed) ⇒ out(rf ) V (so) ∧ in(nf ) ∧ opc(closed) ⇒ out(nf ) V (sc) ∧ in(∗) ∧ opc(open) ⇒ out(nf ) V (sc) ∧ in(∗) ∧ opc(closed) ⇒ out(nf ) P (ok ) ∧ in(f ) ⇒ out(f ) P (ok ) ∧ in(rf ) ⇒ out(rf ) P (ok ) ∧ in(nf ) ⇒ out(nf ) P (br ) ∧ in(∗) ⇒ out(nf )

V (lk ) ∧ in(f ) ∧ opc(open) ⇒ out(rf ) V (lk ) ∧ in(rf ) ∧ opc(open) ⇒ out(rf ) V (lk ) ∧ in(nf ) ∧ opc(open) ⇒ out(nf ) V (lk ) ∧ in(∗) ∧ opc(closed) ⇒ out(nf ) V (hb) ∧ in(f ) ∧ opc(open) ⇒ out(rf ) V (hb) ∧ in(rf ) ∧ opc(open) ⇒ out(rf ) V (hb) ∧ in(nf ) ∧ opc(open) ⇒ out(nf ) V (hb) ∧ in(f ) ∧ opc(closed) ⇒ out(rf ) V (hb) ∧ in(rf ) ∧ opc(closed) ⇒ out(rf ) V (hb) ∧ in(nf ) ∧ opc(closed) ⇒ out(nf ) V (br ) ∧ in(∗) ∧ opc(open) ⇒ out(nf ) V (br ) ∧ in(∗) ∧ opc(closed) ⇒ out(nf ) P (lk ) ∧ in(f ) ⇒ out(rf ) P (lk ) ∧ in(rf ) ⇒ out(rf ) P (lk ) ∧ in(nf ) ⇒ out(nf )

SC closed open closed

open closed

OK

open open

open open closed

SO

LK

open

BR

open closed

open closed open closed

HB

OK

LK

BR

Fig. 2. Domain Theories and Transition Relations of a Valve and a Pipe

stuck-closed and half-blocked); IN P U T S = {inP 1 } (i.e. the input of the pipe P1 is the only system input); OP CON DS = {opcV } (i.e. the op. cond. of the valve V specifying whether the valve should be open or closed); IN T V ARS = {outP 1 , outV , outP 2 }; OBS = {outP 1 , outP 2 } (i.e. the output of V is not observable). Inputs and internal variables represent qualitative values of flows (flow, reduced-flow and no-flow). For lack of space the DT and the Δ of the system are omitted.

3

Abstractions Defined

As pointed out in the introduction, we drive automatic abstraction of timevarying system models mainly by exploiting system observability and by tailoring abstractions to specific operating conditions of the system. The following definition, which introduces the notion of instantaneous indiscriminability among instantiations of subsets of COM P S, is explicitly based on both of the driving factors that we use for abstraction. Definition 5. Let Γ be a subsystem consisting of the subset SCOM P S of COM P S and OP C be an instantiation of OP CON DS. We say that two instantiations SΓ , SΓ of SCOM P S are OPC-indiscriminableinst iff for any instantiation X of IN P U T S and any instantiation O of COM P S\SCOM P S the following holds:

Automatic Abstraction of Time-Varying System Models for MBD

181

tcDT (X ∪ OP C ∪ O ∪ SΓ )|OBS = tcDT (X ∪ OP C ∪ O ∪ SΓ )|OBS where tcDT (V) is the transitive closure of V given DT (i.e. the set of variables instantiations logically derivable from V ∪ DT ) and operator | means projection of a set of instantiations on a subset of the variables. Note that the OPC–indiscriminabilityinst relation induces a partition into OPC– indiscriminabilityinst classes of the set of possible instantiations of SCOM P S. Example 2. Let us consider a subsystem Γ of the hydraulic system of figure 1 consisting of V and P2. Let also assume that OP C = {opcV (open)}, i.e. the valve has been commanded to be open. According to the notion of instantaneous indiscriminability introduced in definition 5, we have 3 indiscriminability classes for the instantiations of {V,P2}. In particular, class C1 = { V (ok) ∧ P 2(ok), V (so) ∧ P 2(ok)}, class C2 = {V (ok) ∧ P 2(lk), V (so) ∧ P 2(lk), V (hb) ∧ P 2(lk), V (hb) ∧ P 2(ok), V (lk) ∧ P 2(ok), V (lk) ∧ P 2(lk)} and class C3 including all the other assignments. It is easy to see that just on the basis of a single time point, the discriminability among different health states of the subsystem is very poor. We now extend the notion of indiscriminability by taking into account the temporal dimension. Definition 6. Let Γ be a subsystem consisting of the subset SCOM P S of COM P S and OP C be an instantiation of OP CON DS. We say that two instantiations SΓ , SΓ of SCOM P S are OPC-indiscriminablek iff: - SΓ , SΓ are OPC-indiscriminableinst - given a trajectory θΓ = (SΓ,0 , . . . , SΓ,k ) of subsystem Γ s.t. SΓ,t = SΓ for some t ∈ {0, . . . , k} there exists a trajectory θΓ = (SΓ,0 , . . . , SΓ,k ) s.t. SΓ,t = SΓ and SΓ,i , SΓ,i are OPC-indiscriminableinst for i = 0, . . . , k - given a trajectory θΓ = (SΓ,0 , . . . , SΓ,k ) of subsystem Γ s.t. SΓ,t = SΓ for some t ∈ {0, . . . , k} there exists a trajectory θΓ = (SΓ,0 , . . . , SΓ,k ) s.t. SΓ,t = SΓ and are OPC-indiscriminableinst for i = 0, . . . , k SΓ,i , SΓ,i When SΓ , SΓ are OPC-indiscriminablek for any k, we say that they are OPCindiscriminable∞ . If we require that the second and third conditions of the definition above only hold for trajectories where SΓ , SΓ are the first (last) status of the trajectory we obtain a weaker notion of indiscriminability that we denote as OPCindiscriminabilityfk ut (OPC-indiscriminabilitypast ). It is possible to prove that k the following property holds. Property 1. Instantiations SΓ , SΓ of SCOM P S are OPC-indiscriminablek iff they are both OPC-indiscriminablefk ut and OPC-indiscriminablepast . k Example 3. The indiscriminability classes of the subsystem involving V and P2 change when we take into consideration the temporal dimension. In fact, if we consider {V(hb),P2(ok)} and {V(lk),P2(lk)}, they are instantaneously indiscriminable (both belong to class C2) but are not OPC-indiscriminablef1 ut ; in the same

182

P. Torasso and G. Torta

way the two assignments {V(lk),P2(br)} and {V(sc),P2(br)} are instantaneously indiscriminable, but are not OPC-indiscriminablepast , while {V(ok),P2(ok)} and 1 {V(so),P2(ok)} are OPC-indiscriminable∞ . The final OPC-indiscriminable∞ classes for the considered subsystem {V,P2} when OP C ={opcV (open)} are C1’ = {{V(ok),P2(ok)}, {V(so),P2(ok)}}, C2’ = {{V(hb),P2(ok)}}, C3’ = {{V(ok),P2(lk)}, {V(so),P2(lk)}, {V(hb),P2(lk)}, {V(lk),P2(ok)}, {V(lk),P2(lk)}}, C4’ = {{V(ok),P2(br)}, {V(so),P2(br)}, {V(hb), P2(br)}, {V(lk),P2(br)}, {V(br),P2(ok)}, {V(br),P2(lk)}, {V(br), P2(br)}} and C5’ = {{V(sc),P2(2ok)}, {V(sc),P2(lk)}, {V(sc),P2(br)}}. It is easy to see that the temporal dimension reduces the degree of indiscriminability among mode assignment. Nevertheless, the OPC-indiscriminable∞ classes are far from being singletons even in this very simple subsystem. We now introduce the notion of abstraction mapping; as it is common in structural abstractions, abstract components are recursively built bottom-up starting with the primitive components. Definition 7. Given a set COM P S = {C1 , . . . , Cn } of component variables and an instantiation OP C of OP CON DS, a components abstraction mapping AM of COM P S defines a set COM P SA = {AC1 , . . . , ACm } of discrete variables (abstract components) and associates with each ACi ∈ COM P SA one or more Cj ∈ COM P S (subcomponents of ACi ) s.t. each component in COMPS is the subcomponent of exactly one abstract component. Moreover, AM associates with each abstract component AC a definition defAC , which is a characterization of the behavioral modes of AC in terms of the behavioral modes of its subcomponents. More precisely, an abstract component and its definition are built hierarchically as follows: - if C ∈ COM P S, AC is a simple abstract component if its definition defAC associates with each abm ∈ DOM (AC) a formula defabm = C(bm1 ) ∨ . . . ∨ C(bmk ) s.t. bm1 , . . . , bmk ∈ DOM (C); in the trivial case, AC has the same domain as C and ∀bm ∈ DOM (C) : defbm = C(bm) - Let AC , AC be two abstract components with disjoint sets of subcomponents SCOM P S , SCOM P S :AC is an abstract component with subcomponents SCOM P S ∪ SCOM P S if defAC associates with each abm ∈ DOM (AC) a definition defabm which is a DNF s.t. each disjunct is in the form defabm ∧ defabm where abm ∈ DOM (AC ) and abm ∈ DOM (AC ) We require that, given the abstract component AC, its definition defAC satisfies the following conditions: 1. correctness: given defabm ∈ defAC , the set of instantiations of SCOM P S which satisfy defabm is an OPC-indiscriminability∞ class 2. mutual exclusion: for any two distinct defabm , defabm ∈ defAC , and any instantiation COMPS of COM P S: COMPS ∪ {defabm ∧ defabm } 0 ⊥ 3. completeness: for any instantiation COMPS of COM P S: COMPS 0 defabmi for some i ∈ {1, . . . , |DOM (AC)|}

Automatic Abstraction of Time-Varying System Models for MBD

183

The definition defAC of AC thus specifies a relation between instantiations of the subcomponents of AC and instantiations (i.e. behavioral modes) of AC itself. Example 4. Let us consider again the subsystem involving {V, P2} and OP C = opcV (open). A components abstraction mapping AM may define a new abstract component AC whose subcomponents are V and P2 and whose behavioral modes are {abm1, . . ., abm5} (the behavioral modes correspond to the five OPC-indiscriminable∞ classes C1’, . . ., C5’ reported above). In particular, the definition of the behavioral modes of AC are the following – defabm1 = V (ok) ∧ P 2(ok) ∨ V (so) ∧ P 2(ok) – defabm2 = V (hb) ∧ P 2(ok) – defabm3 = V (ok)∧P 2(lk)∨V (so)∧P 2(lk)∨V (hb)∧P 2(lk)∨V (lk)∧P 2(ok)∨ V (lk) ∧ P 2(lk) – defabm4 = V (ok)∧P 2(br)∨V (so)∧P 2(br)∨V (hb)∧P 2(br)∨V (lk)∧P 2(br)∨ V (br) ∧ P 2(ok) ∨ V (br) ∧ P 2(lk) ∨ V (br) ∧ P 2(br) – defabm5 = V (sc) ∧ P 2(ok) ∨ V (sc) ∧ P 2(lk) ∨ V (sc) ∧ P 2(br) It is easy to verify that the definitions of abstract behavioral modes abm1, . . ., abm5 satisfy the properties of correctness, mutual exclusion and completeness required by definition 7.

4

Computing Abstractions

The hierarchical way abstract components are defined in section 3 suggests that, after an initial step in which simple abstract components are built, the computational process can produce new abstract components incrementally, by merging only two components at each iteration. After some finite number of iterations, arbitrarily complex abstract components can be produced. As already mentioned in section 1, however, our purpose is to produce abstractions that are not merely correct, but are also useful and meaningful. In the context of MBD, this essentially comes down to producing abstract components with a limited number of different behavioral modes. A proliferation of behavioral modes in the abstract components, indeed, has negative effects on both the efficiency of diagnosis and the understandability of abstract diagnoses. Given that the maximum number of behavioral modes of an abstract component AC built from AC , AC is clearly |DOM (AC )| · |DOM (AC )|, we chose to impose the limit |DOM (AC)| ≤ |DOM (AC )| + |DOM (AC )|; we call this condition the limited domain condition. Figure 3 reports an high-level description of the abstraction process. The body of the ForEach loop at the beginning of the algorithm has the purpose of building a simple abstract component AC by possibly merging indiscriminable behavioral modes of the basic component C while taking into consideration the operating conditions of C. This step is performed by invoking BuildSimpleAC() and the system model is updated in order to replace C with AC. The update of the Domain TheoryDTtemp is performed by ReviseDTSimple() which essentially replaces the set of formulas DTC where C occurs with formulas

184

P. Torasso and G. Torta

Function Abstract(T V SDD , OP C) AM := ∅ T V SDtemp := T V SDD ForEach C ∈ COM P Stemp AC, defAC  = BuildSimpleAC(C, OP C, DTC , ΔC ) COM P Stemp := COM P Stemp \ {C} ∪ {AC} DTtemp := ReviseDTSimple(C, DTtemp , DTC , AC, defAC ) Δtemp := ReviseDeltaSimple(C, Δtemp , ΔC , AC, defAC ) AM := AM ∪ {AC, defAC } Loop G := BuildInfluenceGraph(DTA ) While (Oracle(T V SDA , G) = Ci , Cj ) (DTloc , Gloc ) := CutDT(Ci , Cj , T V SDtemp , G) If (BuildAC(Ci , Cj , OP C, DTloc , Gloc , ΔCi , ΔCj ) = AC, defAC ) COM P Stemp := COM P Stemp \ {Ci , Cj } ∪ {AC} DTtemp := ReviseDT(Ci , Cj , DTtemp , DTloc , Gloc , AC, defAC ) Δtemp := ReviseDelta(Ci , Cj , Δtemp , ΔCi , ΔCj , AC, defAC ) G := BuildInfluenceGraph(DTtemp ) AM := AM ∪ {AC, defAC } EndIf Loop T V SDA := T V SDtemp Return AM, T V SDA EndFunction Fig. 3. Sketch of the Abstract() function

that mention the new abstract component AC. Similarly, the System Transition Relation Δtemp is updated by function ReviseDeltaSimple() which builds the Component Transition Relation ΔAC of AC and replaces ΔC with ΔAC . Finally, the definition of AC is added to the abstraction mapping AM. Once the simple abstract components have been built, the algorithm enters the While loop for incrementally building complex abstract components. Function Oracle() selects the next two candidates Ci , Cj for abstraction according to strategies that will be discussed below; Oracle() uses a DAG G (automatically built by BuildInfluenceGraph()) which represents the influences among system variables: the variables appearing in the antecedent of a formula φ ∈ DT become parents of the variable occurring in the consequent of φ. The call to function CutDT() has the purpose of isolating the portion DTloc of DTtemp relevant for computing the influence of Ci , Cj on the values of observables. CutDT() also returns the Influence Graph Gloc associated with DTloc . Function BuildAC() then tries to compute the definition of a new abstract component AC starting from Ci and Cj . Unlike BuildSimpleAC(), BuildAC() may fail when AC turns out to have a number of behavioral modes larger than |DOM (Ci )| + |DOM (Cj )|. In case BuildAC() succeeds, the system model is updated in order to replace Ci and Cj with AC. As for the case of simple abstract components, the Domain Theory and the System Transition Relation are updated (by functions ReviseDT() and ReviseDelta() respectively).

Automatic Abstraction of Time-Varying System Models for MBD

185

Then, the definition of AC is added to the abstraction mapping AM. The whole process terminates when function Oracle() has no new suggestions for further abstractions. The Oracle. Function Oracle() chooses, at each iteration, a pair of candidate components for abstraction. Since a search over the entire space of potential abstractions would be prohibitive, the function is based on a greedy heuristic that tries to achieve a good overall abstraction without doing any backtracking. In our experience it turned out that by following two simple principles significant results could be achieved. First, Oracle() follows a locality principle for choosing Ci , Cj . This principle is often used when structural abstractions are built by human experts, in fact these abstractions tend to consist in hierarchies such that at each level of the hierarchy the new abstractions involve components that are structurally close to one another. The locality principle usually has the advantage of building abstract components that have a limited and meaningful set of behavioral modes. However, since there is no guarantee on the number of behavioral modes of the abstract component, the check of the limited domain condition must be deferred to a later time, in particular when function BuildAC() is called; there’s thus no warranty that the components selected by Oracle() will end up being merged but just a good chance of it to happen. Two good examples of structural patterns that follow this principle are components connected in series and in parallel 2 . The second principle followed by Oracle() consists of preferring pairs of components that are connected to the same observables; the rationale behind this principle is that if at least one of the two components is observable separately from the other, it is more unlikely to find indiscriminable instantiations of the two components. In the current implementation, Oracle() terminates when it can’t find any pair of components which are connected in series or parallel and influence the same set of observables. Building Abstract Components. Figure 4 shows the pseudo-code of function BuildSimpleAC() which has the purpose of building a simple abstract component. First of all, DOM (C) is partitioned into OPC-indiscriminabilityinst classes by function FindIndiscriminableInst(). The function is essentially an algorithm for computing static indiscriminability classes as the one described in [11]; the only difference is that the values of OP CON DS variables are constrained to match parameter OP C, so that in general it is possible to merge more behavioral modes in the same indiscriminability class. The call to FindIndiscriminableFut() has the purpose of partitioning DOM (C) into OPC-indiscriminabilityf ut classes. Function FindIndiscriminableFut() starts with the OPC-indiscriminabilityinst classes of partition Pinst and gradually refines them into OPC-indiscriminabilityf1 ut classes, OPC-indiscriminabilityf2 ut classes, etc. until a fixed point is reached. 2

Note that the structural notions of vicinity, seriality and parallelism can be naturally transposed in terms of relationships in the influence graph G.

186

P. Torasso and G. Torta

Function BuildSimpleAC(C, OP C, DTC , ΔC ) P inst = FindIndiscriminableInst(C, OP C, DTC ) P f ut = FindIndiscriminableFut(C, OP C, ΔC , Pinst ) P past = FindIndiscriminablePast(C, OP C, ΔC , Pinst ) P = P f ut ◦ P past AC = NewComponentName() defAC = ∅ ForEach π ∈ P abm = NewBehavioralModeName()  defabm = bm∈π C(bm) defAC = defAC ∪ {(abm, defabm )} Loop Return AC, defAC  EndFunction Fig. 4. Sketch of the BuildSimpleAC() function

It is important to note that partition P inst is used by FindIndiscriminable Fut() at each iteration, since it is required by definition 6 that all the states on trajectories starting from OPC-indiscriminabilityf ut states are OPCindiscriminableinst . Function FindIndiscriminablePast() is similar to FindIndiscriminableFut(), except that it partitions DOM (C) into OPCindiscriminabilitypast classes by considering backwards transitions on ΔC . The partition P into OPC-indiscriminability∞ classes is computed by merging P f ut and P past according to property 1. The classes of P form the basis for building abstract behavioral modes definitions: associating exactly one abstract behavioral mode to each of them guarantees that the mutual exclusion, completeness and correctness conditions given in definition 7 are automatically satisfied. For this reason, the definitions of the abstract behavioral modes are generated by considering an indiscriminability class π at a time and building a behavioral mode definition for π. We do not report the algorithm for function BuildAC() because of lack of space. The function exhibits only a few differences w.r.t. BuildSimpleAC(): first of all, partitions P inst , P f ut , P past and P are computed on DOM (Ci ) × DOM (Cj ) instead of DOM (C); second, since the number of classes in the partition P corresponds to the number of abstract behavioral modes to be eventually generated, the function fails (and returns null) if |P| is larger than |DOM (Ci )| + |DOM (Cj )|; finally, the generated abstract behavioral modes are DNFs of terms in the form Ci (bmi ) ∧ Cj (bmj ). Example 5. Let us consider now the abstractions performed by the algorithm on the T V SDH system when the operating condition of the valve V is open. The first step performed by Abstract() is the creation of simple abstract modes for each component P1, V and P2. It is worth noting that the model of P1 and P2 does not change since the behavior of the pipe is not influenced by any operating condition. On the contrary, the simple abstract model of V is different from the

Automatic Abstraction of Time-Varying System Models for MBD AC (abm1 ) ∧ inAC (f ) ⇒ outAC (f ) AC (abm1 ) ∧ inAC (rf ) ⇒ outAC (rf ) AC (abm1 ) ∧ inAC (nf ) ⇒ outAC (nf ) AC (abm2 ) ∧ inAC (f ) ⇒ outAC (rf ) AC (abm2 ) ∧ inAC (rf ) ⇒ outAC (rf ) AC (abm2 ) ∧ inAC (nf ) ⇒ outAC (nf ) AC (abm3 ) ∧ inAC (f ) ⇒ outAC (rf ) AC (abm3 ) ∧ inAC (rf ) ⇒ outAC (rf ) AC (abm3 ) ∧ inAC (nf ) ⇒ outAC (nf ) AC (abm4 ) ∧ inAC (∗) ⇒ outAC (nf ) AC (abm5 ) ∧ inAC (∗) ⇒ outAC (nf )

187

abm1

abm2 abm3

abm5

abm4

Fig. 5. Model of the Abstract Component built from V and P2

detailed model of V, in particular the transition relation changes because of the influence of the operating condition; however the simple abstract model of V has six modes (despite the restriction on the operating condition, all the original behavioral modes of the valve are discriminable over time). Now Oracle() has to select two candidate components for abstraction. It selects V and P2 because they are connected in series and no observable exists that is reachable from just one of them (this is not the case for P1 and V since outP 1 is observable). When BuildAC() is invoked on components V and P2, the abstraction is successful since the abstract component involves just 5 abstract behavioral modes (the limited domain condition is satisfied). The inferred definition of the abstract component is exactly the same as the one reported in example 4. The DT and Δ of the abstract component are reported in figure 5. No attempt is made to create an abstract component starting from P1 and AC because of the heuristics used by Oracle() (i.e. outP 1 is observable) Correctness and Complexity. We now state some results concerning algorithm Abstract() 3 . The first result just states that the algorithm builds abstractions according to the definition given in section 3 while the second one makes explicit the correspondence between detailed and abstract diagnoses. Property 2. The AM returned by Abstract() is a components abstraction mapping according to definition 7. Property 3. Let AM and T V SDA be the components abstraction mapping and system description obtained from T V SDD and OP C by applying Abstract(). Given a Temporal Diagnostic Problem T DPD = (T V SDD , S0 , w, σ) and the corresponding abstract Temporal Diagnostic Problem T DPA = (T V SDA , SA,0 , w, σ), DD is a diagnosis for T DPD iff its abstraction DA according to AM4 is a diagnosis for T DPA . 3 4

The proofs are not reported because of lack of space. Each instantiation COMPS of COM P S corresponds to exactly one instantiation COMPSA of COM P SA ; we say that COMPSA is the abstraction of COMPS according to AM.

188

P. Torasso and G. Torta V13

AC1

V17 V11

V14

V12

V15

V17

V11 V18

V12

T1

AC3

T1

V16 E1

E1 V23

AC5

V27 V21

V24

V22

V25

V21

T2

T2 V22

V28

AC7

V26 E2

E2

Fig. 6. The Propulsion Sub-System (left) and its abstraction (right)

As concerns the computational complexity of Abstract(), it turns out that it is polynomial in the number |SV| of the System Variables in T V SDD ; however, some of the operations in the While loop can be exponential in |SN loc |, where SN loc denotes the set of all source nodes in Gloc . This exponential complexity is taken under control by the locality principle used by Oracle(); indeed, when Ci and Cj are close, Gloc tends to be small (this is particularly true when Ci , Cj are connected in series or parallel).

5

Applying Abstraction to a Propulsion Sub-system

In order to show the applicability of our approach to realistic domains, we report the main results of the abstraction on the model of a part of a propulsion system of a spacecraft (derived from [6]). In figure 6 (left part) the detailed model is reported where Vij represent valves whose Domain Theory and Transition Relation are the same as the ones reported in figure 2 while Ti represent tanks and Ei engines. Apart from the outputs of the two engines, the observable parameters are marked with ⊕ symbols. The only system components that have operating conditions are the valves. We analyze the case where V11 , V12 , V14 , V16 , V17 and V18 are open, while all the other valves (depicted in black) are closed. For sake of simplicity, pipes are not modeled as components (i.e. we assume they cannot fail). Figure 6 (right part) schematically shows the abstractions performed by our algorithm. In particular, abstract component AC1 is the abstraction of the subsystem involving V13 and V14 connected in parallel; abstract component AC3 is the abstraction of a subsystem involving V18 and the abstract component AC2 that has been derived by abstracting V15 and V16 ; AC5 is the abstraction of a subsystem involving V23 , V24 and V27 (in an intermediate step of the algorithm V23 and V24 have been abstracted in AC4 ); similar remarks hold for AC7 . The model of the propulsion system is interesting for analyzing the behavior of Oracle() in selecting the components to be abstracted. As said above the current version of Oracle() exploits heuristics for selecting components that are

Automatic Abstraction of Time-Varying System Models for MBD

189

connected in parallel or in series. By following this criterion Oracle() identifies that V13 and V14 are connected in parallel, so they are suitable candidates for abstraction. The abstraction is successful and returns AC1 which has just 9 behavioral modes (out of 36 possible behavioral modes that are obtained by the Cartesian product of the modes of the two valves), its Δ involves 27 state transitions (out of the 289 obtained as the Cartesian product of the Δs of V13 and V14 or the 70 transitions if we take into account the restrictions imposed by the operating conditions). After the update of the system model, the Oracle() is able to detect that V17 is connected in series with AC1 and therefore they could be candidates for abstraction. However, since the output of V17 (and input of AC1 ) is observable, Oracle() does not suggest the abstraction of V17 and AC1 . Oracle()continues its search and the next two selected components are V15 and V16 which are successfully abstracted in AC2 . Then, Oracle() selects V18 and AC2 , since they are connected in series and no observable parameter exists in between the two; this step leads to the construction of AC3 . Abstract components AC5 and AC7 are built in a similar way; the main difference is that the involved valves have different operating conditions. In all these cases abstraction provides a significant reduction in the number of behavioral modes (15 for AC3 , 3 for AC5 and AC7 , just a fraction of the possible 216 behavioral modes), in the number of transitions of Δ (for AC3 36 versus the 4913 possible transitions or 700 transition if we take into account operating conditions), in the number of propositional formulas in DT (in case of AC3 54 instead of 108). Even more dramatic reductions are obtained with AC5 and AC7 whose Δs involve just 3 transitions and whose DT s involve 9 formulas.

6

Conclusions

So far, only a few methods have been proposed for automatic model abstraction in the context of MBD and all of them deal with static system models. The work described in [10] aims at automatic abstraction of the domains of system variables given a desired level of abstraction τtarg for the domains of some variables and the system observability τobs . In [11] the authors consider the automatic abstraction of both the domains of system variables and of the system components, given the system observability expressed as the set of observables OBSAV and their granularity Π. Both of these approaches guarantee that the abstract models they build preserve all the relevant diagnostic information, i.e. there is no loss of discrimination power when using the abstract model so that the ground model can potentially be substituted by the abstract one. This is different from the goal of [7] and its improvements, where the abstract model is viewed as a focusing mechanism, and is used in conjunction with the detailed model in order to improve efficiency. As [10] and [11], the approach described in this paper guarantees that the information relevant for diagnosis is preserved, as shown by the possibility of converting back and forth between detailed and abstract diagnoses (property 3).

190

P. Torasso and G. Torta

However, our approach applies to a significantly broader class of systems, namely time-varying systems; for this reason, we tailor our abstractions not only to the system observability, but also to specific operating conditions of the system. The abstraction is clearly connected with the diagnosability assessment task ([3]), since both of them strongly depend on the indiscriminability of elements of the model. Note however that the two tasks are actually dual problems: diagnosability assessment is based on finding elements that are indiscriminable in at least one situation, while abstraction is based on finding elements that are indiscriminable in all situations; moreover, while diagnosability is a system wide property that either holds or does not hold, a system can be abstracted in several different ways, so abstraction can be viewed more as a creative process whose goal is to simplify the system model in a useful and meaningful way. The application of our approach to the propulsion sub-system has shown its ability at automatically synthesizing abstract components with complex behavior. Note that these results have been obtained under very demanding requirements: the algorithm had to identify combinations of behavioral modes indiscriminable regardless of the input vectors over unlimited time windows. The problem of the abstraction of fully dynamic system models is an even more challenging problem, since it requires not only the abstraction of component variables but also of other system status variables that interact with component variables in determining the temporal evolution of the system. We believe that this paper has laid solid ground for attacking this problem.

References 1. Brusoni, V., Console, L., Terenziani, P., Theseider Dupr´e, D.: A Spectrum of Definitions for Temporal Model-Based Diagnosis. Artificial Intelligence 102 (1998) 39–79 2. Chittaro, L., Ranon, R.: Hierarchical Model-Based Diagnosis Based on Structural Abstraction. Artificial Intelligence 155(1–2) (2004) 147–182 3. Cimatti, A., Pecheur, C., Cavada, R.: Formal Verification of Diagnosability via Symbolic Model Checking. Proc. IJCAI (2003) 363–369 4. Console, L., Theseider Dupr´e, D.: Abductive Reasoning with Abstraction Axioms. LNCS 810 (1994) 98–112 5. Friedrich, G.: Theory Diagnoses: A Concise Characterization of Faulty Systems. Proc. IJCAI (1993) 1466–1471 6. Kurien, J., Nayak, P. Pandurang: Back to the Future for Consistency-Based Trajectory Tracking. Proc. AAAI00 (2000) 370–377 7. Mozetiˇc, I.: Hierarchical Model-Based Diagnosis. Int. Journal of Man-Machine Studies 35(3) (1991) 329–362 8. Out, D.J., van Rikxoort, R., Bakker, R.: On the Construction of Hierarchic Models. Annals of Mathematics and AI 11 (1994) 283–296 9. Provan, G.: Hierarchical Model-Based Diagnosis. Proc. DX (2001) 167–174 10. Sachenbacher, M., Struss, P.: Task-Dependent Qualitative Domain Abstraction. Artificial Intelligence 162 (1–2) (2004) 121–143 11. Torta, G., Torasso, P.: Automatic Abstraction in Component-Based Diagnosis Driven by System Observability. Proc. IJCAI (2003) 394–400

Neuro-Fuzzy Kolmogorov's Network for Time Series Prediction and Pattern Classification Yevgeniy Bodyanskiy1, Vitaliy Kolodyazhniy1, and Peter Otto2 1 Control Systems Research Laboratory, Kharkiv National University of Radioelectronics, 14, Lenin Av., Kharkiv, 61166, Ukraine [email protected], [email protected] 2 Department of Informatics and Automation, Technical University of Ilmenau, PF 100565, 98684 Ilmenau, Germany [email protected]

Abstract. In the paper, a novel Neuro-Fuzzy Kolmogorov's Network (NFKN) is considered. The NFKN is based on and is the development of the previously proposed neural and fuzzy systems using the famous Kolmogorov’s superposition theorem (KST). The network consists of two layers of neo-fuzzy neurons (NFNs) and is linear in both the hidden and output layer parameters, so it can be trained with very fast and simple procedures: the gradient-descent based learning rule for the hidden layer, and the recursive least squares algorithm for the output layer. The validity of theoretical results and the advantages of the NFKN are confirmed by experiments.

1 Introduction According to the Kolmogorov's superposition theorem (KST) [1], any continuous function of d variables can be exactly represented by superposition of continuous functions of one variable and addition:

f ( x1,, xd ) =

2 d +1

d

º

¬ i =1

¼

ª

¦ gl «¦ψ l ,i ( xi )» , l =1

(1)

where gl (•) and ψ l ,i (•) are some continuous univariate functions, and ψ l ,i (•) are independent of f. Aside from the exact representation, the KST can be used as the basis for the construction of parsimonious universal approximators, and has thus attracted the attention of many researchers in the field of soft computing. Hecht-Nielsen was the first to propose a neural network implementation of KST [2], but did not consider how such a network can be constructed. Computational aspects of approximate version of KST were studied by Sprecher [3], [4] and KĤrková [5]. Igelnik and Parikh [6] proposed the use of spline functions for the construction of Kolmogorov's approximation. Yam et al [7] proposed the multi-resolution approach to fuzzy control, based on the KST, and proved that the KST representation can be U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 191 – 202, 2005. © Springer-Verlag Berlin Heidelberg 2005

192

Y. Bodyanskiy, V. Kolodyazhniy, and P. Otto

realized by a two-stage rule base, but did not demonstrate how such a rule base could be created from data. Lopez-Gomez and Hirota developed the Fuzzy Functional Link Network (FFLN) [8] based on the fuzzy extension of the Kolmogorov's theorem. The FFLN is trained via fuzzy delta rule, whose convergence can be quite slow. A novel KST-based universal approximator called Fuzzy Kolmogorov's Network (FKN) with simple structure and training procedure with high rate of convergence was proposed in [9 – 11]. The training of the FKN is based on the alternating linear least squares technique for both the output and hidden layers. However, this training algorithm may require a large number of computations in the problems of high dimensionality. In this paper we propose an efficient and computationally simple learning algorithm, whose complexity depends linearly on the dimensionality of the input space. The network considered in this paper (Neuro-Fuzzy Kolmogorov's Network – NFKN) is based on the FKN architecture and uses the gradient descent-based procedure for the tuning of the hidden layer weights, and recursive least squares method for the output layer.

2 Network Architecture The NFKN (Fig. 1) is comprised of two layers of neo-fuzzy neurons (NFNs, Fig. 2) [12] and is described by the following equations: n

d

l =1

i =1

fˆ ( x1 ,  , xd ) = ¦ f l[ 2] (o[1, l ] ) , o[1,l ] = ¦ fi[1,l ] ( xi ) , l = 1,, n ,

(2)

where n is the number of hidden layer neurons, f l[ 2] (o[1,l ] ) is the l-th nonlinear syn-

apse in the output layer, o[1,l ] is the output of the l-th NFN in the hidden layer, f i[1, l ] ( xi ) is the i-th nonlinear synapse of the l-th NFN in the hidden layer. The equations for the hidden and output layer synapses are m1

f i[1,l ] ( xi ) = ¦ μi[1, h] ( xi )wi[1, h,l ] , h =1

m2

f l[ 2] (o[1,l ] ) = ¦ μ l[,2j] (o[1,l ] ) wl[,2j] , j =1

(3)

l = 1,, n , i = 1,, d ,

where m1 and m2 is the number of membership functions (MFs) per input in the hidden and output layers respectively, μi[1, h] ( xi ) and μl[,2j] (o[1, l ] ) are the MFs, wi[1,h,l ] and wl[,2j] are tunable weights. Nonlinear synapse is a single input-single output fuzzy inference system with crisp consequents, and is thus a universal approximator [13] of univariate functions (see Fig. 3). It can provide a piecewise-linear approximation of any functions gl (•) and

ψ l ,i (•) in (1). So the NFKN, in turn, can approximate any function f ( x1 ,  , xd ) .

Neuro-Fuzzy Kolmogorov's Network for Time Series Prediction and Pattern Classification 193

1st hidden layer neuron

x1

f1[1,1] f2[1,1]

o[1,1]

+

fd[1,1] 2nd hidden layer neuron

Output layer neuron

f1[1,2] x2

f2[1,2]

f1[2]

o[1,2]

+

f2[2]

fd[1,2]

+

fˆ( x1, x2,, xd )

fn[2]

n-th hidden layer neuron

f1[1,n] f2[1,n] xd

o[1,n]

+

fd[1,n] Fig. 1. NFKN architecture with d inputs and n hidden layer neurons

x1 x2 xd

f1 f2

μi,1

+



xi

fd

μi,2 μi,m

wi,1 wi,2

+

fi(xi)

wi,m

Fig. 2. Neo-fuzzy neuron (left) and its nonlinear synapse (right)

The output of the NFKN is computed as the result of two-stage fuzzy inference: n m2 ª d m1 º yˆ = ¦ ¦ μl[,2j] «¦¦ μi[1,h] ( xi ) wi[1, h,l ] » wl[,2j] . l =1 j =1 ¬« i =1 h =1 ¼»

(4)

The description (4) corresponds to the following two-level fuzzy rule base: IF xi IS X i ,h THEN o[1,1] = wi[1, h,1]d AND AND o[1, n ] = wi[1, h, n ]d , i = 1,, d , h = 1,, m1 ,

(5)

194

Y. Bodyanskiy, V. Kolodyazhniy, and P. Otto

IF o[1,l ] IS Ol , j THEN yˆ = wl[,2j]n , l = 1,, n ,

j = 1,, m2 ,

(6)

where X i ,h and Ol , j are the antecedent fuzzy sets in the first and second level rules, respectively. Each first level rule contains n consequent terms wi[1,h,1] d ,, wi[1,h,n ] d , corresponding to n hidden layer neurons. fi(xi) wi,m-1 wi,m wi,2 wi,3 wi,1 ci,1 1

ci,2

μi,1

ci,1

μi,2

ci,2

ci,3

μi,3

ci,3

ci,m-1

μi,m-1

ci,m-1

ci,m

xi

μi,m

ci,m

xi

Fig. 3. Approximation of a univariate nonlinear function by a nonlinear synapse

Total number of rules is N RFKN = d ⋅ m1 + n ⋅ m2 ,

(7)

i.e., it depends linearly on the number of inputs d. The rule base is complete, as the fuzzy sets X i ,h in (5) completely cover the input hyperbox with m1 MFs per input variable. Due to the linear dependence (7), this approach is feasible for input spaces with high dimensionality d without the need for clustering techniques for the construction of the rule base. Straightforward gridpartitioning approach with m1 MFs per input requires (m1 ) d fuzzy rules, which results in combinatorial explosion and is practically not feasible for d > 4 . To simplify computations and to improve the transparency of fuzzy rules, MFs of each input are shared between all the neurons in the hidden layer (Fig. 4). So the set of rules (5), (6) in an NFKN can be interpreted as a grid-partitioned fuzzy rule-base IF x1 IS X 1,1 AND  AND xd IS X d ,1 THEN yˆ = fˆ (c1[1,1] ,, cd[1,]1 ), (8)



IF x1 IS X 1, m1 AND  AND xd IS X d , m1 THEN yˆ = fˆ (c1[1, m] ,, cd[1,]m ), 1

1

Neuro-Fuzzy Kolmogorov's Network for Time Series Prediction and Pattern Classification 195

NS1

f1[1,1] x1

f1[1,2]

+

o[1,1]

f1[1,n] NS2

f2[1,1] Shared MFs

x2

o[1,2]

f2[1,2]

+

f2[1,n]

NSd

fd[1,1] xd

fd[1,2]

+

o[1,n]

fd[1,n] Fig. 4. Representation of the hidden layer of the NFKN with shared MFs

with total of (m1 ) d rules, whose consequent values are equal to the outputs of the NFKN computed on all the possible d-tuples of prototypes of the input fuzzy sets c1[1,1] ,, cd[1,]m [9]. 1

3 Learning Algorithm The weights of the NFKN are determined by means of a batch-training algorithm as described below. A training set containing N samples is used. The minimized error function is E (t ) = ¦ [ y (k ) − yˆ (t , k )] 2 = [Y − Yˆ (t )] [Y − Yˆ (t )] , N

T

k =1

where Y = [ y (1),, y ( N )] T is the vector of target values, and

Yˆ (t ) = [ yˆ (t ,1),  , yˆ (t , N )] T is the vector of network outputs at epoch t.

(9)

196

Y. Bodyanskiy, V. Kolodyazhniy, and P. Otto

Since the nonlinear synapses (3) are linear in parameters, we can employ recursive least squares (RLS) procedure for the estimation of the output layer weights. Re-write (4) as

[

T

], )] .

yˆ = W [ 2] ϕ [ 2] (o[1] ) , W [ 2] = w1[,21] , w1[,22] ,, w[n2,m] 2

T

ϕ [ 2] (o[1] ) = μ1[,21] (o[1,1] ), μ1[,22] (o[1,1] ),, μ n[ 2,m] 2 (o[1,n ]

T

[

(10)

Then the RLS procedure will be

­W [ 2 ] (t , k ) = W [ 2 ] (t , k − 1) + P (t , k )ϕ [ 2 ] (t , k )( y (k ) − yˆ (t , k )), ° T ® P (t , k − 1)ϕ [ 2 ] (t , k )ϕ [ 2 ] (t , k ) P(t , k − 1) P t k P t k ( , ) = ( , − 1 ) − , ° T 1 + ϕ [ 2 ] (t , k ) P (t , k − 1)ϕ [ 2 ] (t , k ) ¯

(11)

where k = 1,, N , P(t ,0) = 10000 ⋅ I , and I is the identity matrix of corresponding dimension. Introducing after that the regressor matrix of the hidden layer

[

]

T

Φ[1] = ϕ [1] ( x(1)),,ϕ [1] ( x( N )) , we can obtain the expression for the gradient of the error function with respect to the hidden layer weights at the epoch t:

[

]

T ∇W [1] E (t ) = −Φ [1] Y − Yˆ (t ) ,

(12)

and then use the well-known gradient-based technique to update these weights:

W [1] (t + 1) = W [1] (t ) − γ (t )

∇W [1] E (t ) , ∇W [1] E (t )

(13)

where γ (t ) is the adjustable learning rate, and

[

W [1] = w1[1,1,1] , w1[1,2,1] ,, w[d1,,1m]1 ,, wd[1,,mn ]1

[

]

T

,

]

T

ϕ [1] ( x) = ϕ1[1,1,1] ( x1 ),ϕ1[1, 2,1] ( x1 ),,ϕ d[1,,m1]1 ( xd ),,ϕ d[1,,mn1] ( xd ) ,

(14)

ϕ i[,1h,l ] ( xi ) = al[ 2 ] (o[1,l ] ) μi[,1h,l ] ( xi ) , and al[ 2 ] (o[1,l ] ) is determined as in [9 – 11] al[ 2] (o[1,l ] ) =

wl[,2p] +1 − wl[,2p] cl[,2p] +1 − cl[,2p]

,

(15)

where wl[,2p] and cl[,2p] are the weight and center of the p-th MF in the l-th synapse of the output layer, respectively. The MFs in an NFN are chosen such that only two adjacent MFs p and p+1 fire at a time [12].

Neuro-Fuzzy Kolmogorov's Network for Time Series Prediction and Pattern Classification 197

Thus, the NFKN is trained via a two-stage optimization procedure without any nonlinear operations, similar to the ANFIS learning rule for the Sugeno-type fuzzy inference systems [14]. In the forward pass, the output layer weights are adjusted. In the backward pass, adjusted are the hidden layer weights. An epoch of training is considered ‘successful’ if the error measure (e.g. the root mean squared error (RMSE)) on the training set is reduced in comparison with the previous epoch. Only successful epochs are counted. If the error measure is not reduced, the training cycle (forward and backward passes) is repeated until the error measure is reduced or the maximum number of cycles per epoch is reached. Once it is reached, the algorithm is considered to converge, and the parameters from the last successful epoch are saved as the result of training. Otherwise the algorithm is stopped when the maximum number of epochs is reached. The number of tuned parameters in the hidden layer is S1 = d ⋅ m1 ⋅ n , in the output layer S 2 = n ⋅ m2 , and total S = S1 + S2 = n ⋅ (d ⋅ m1 + m2 ) . Hidden layer weights are initialized deterministically using the formula [9 – 11]

­ i[m (l − 1) + h − 1]½ w[h1,,il ] = exp ®− 1 ¾ , h = 1,  , m1 , i = 1, , d , l = 1, , n , d (m1n − 1) ¿ ¯

(16)

broadly similar to the parameter initialization technique proposed in [6] for the Kolmogorov’s spline network based on rationally independent random numbers. Further improvement of the learning algorithm can be achieved through the adaptive choice of the learning rate γ (t ) as was proposed for the ANFIS.

4 Experimental Results To verify the theoretical results and compare the performance of the proposed network to the known approaches, we have carried out three experiments. The first two were prediction and emulation of the Mackey-Glass time series [15], and the third one was classification of the Wisconsin breast cancer (WBC) data [16]. In all the experiments, the learning rate was γ (t ) = 0.05 , maximum number of cycles per epoch was 50. In the first and second experiments, the error measure was root mean squared error, in the third one – the number of misclassified patterns. The Mackey-Glass time series was generated by the following equation:

dy (t ) 0.2 y (t − τ ) = − 0 .1 y ( t ) , dt 1 + y10 (t − τ )

(17)

where τ is time delay. In our experiments, we used τ = 17 . The values of the time series (17) at each integer point were obtained by means of the fourth-order RungeKutta method. The time step used in the method was 0.1, initial condition y (0) = 1.2 , and y (t ) was derived for t = 0,...,5000 .

198

Y. Bodyanskiy, V. Kolodyazhniy, and P. Otto

4.1 Mackey-Glass Time Series Prediction

In the first experiment, the values y (t − 18), y (t − 12), y (t − 6) , and y (t ) were used to predict y (t + 85) . From the generated data, 1000 values for t = 118,...,1117 were used as the training set, and the next 500 for t = 1118,...,1617 as the checking set. The NFKN used for prediction had 4 inputs, 9 neurons in the hidden layer with 3 MFs per input, and 1 neuron in the output layer with 9 MFs per synapse (189 adjustable parameters altogether). The training algorithm converged after 43 epochs. The NFKN demonstrated similar performance as multilayer perceptron (MLP) with 2 hidden layers each containing 10 neurons. The MLP was trained for 50 epochs with the Levenberg-Marquardt algorithm. Because the MLP weights are initialized randomly, the training and prediction were repeated 10 times. Root mean squared error on the training and checking sets (trnRMSE and chkRMSE) was used to estimate the accuracy of predictions. For the MLP, median values of 10 runs for the training and checking errors were calculated. The results are shown in Fig. 5 and Table 1.

Mackey−Glass time series and prediction: trnRMSE=0.018455 chkRMSE=0.017263 MG time series Prediction

1.2 1 0.8 0.6 0.4 400

600

800

1000

1200

1400

1600

Difference between the actual and predicted time series 0.06 0.04 0.02 0 −0.02 −0.04 −0.06 400

600

800

1000

1200

Fig. 5. Mackey-Glass time series prediction

1400

1600

Neuro-Fuzzy Kolmogorov's Network for Time Series Prediction and Pattern Classification 199 Table 1. Results of Mackey-Glass time series prediction

Network MLP 10-10-1 NFKN 9-1

Parameters 171 189

Epochs 50 43

trnRMSE 0.022 0.018455

chkRMSE 0.0197 0.017263

With a performance similar to that of the MLP in combination with the LevenbergMarquardt training method, the NFKN with its hybrid learning algorithm requires much less computations because no matrix inversions for the tuning of the synaptic weights are required. 4.2 Mackey-Glass Time Series Emulation

The second experiment consisted in the emulation of the Mackey-Glass time series by the NFKN. The values of the time series were fed into the NFKN only during the training stage. 3000 values for t = 118,...,3117 were used as the training set. The NFKN had 17 inputs corresponding to the delays from 1 to 17, 5 neurons in the hidden layer with 5 MFs per input, and 1 neuron in the output layer with 7 MFs per synapse (total 460 adjustable parameters). The NFKN was trained to predict the value of the time series one step ahead. The training procedure converged after 28 epochs with the final value of RMSETRN=0.0027, and the last 17 values of the time series from the training set were fed to the inputs of the NFKN. Then the output of the network was connected to its inputs through the delay lines, and subsequent 1000 values of the NFKN output were computed. As can be seen from Fig. 6, the NFKN captured the dynamics of the real time series very well. The difference between the real and emulated time series becomes visible only after about 500 time steps. The emulated chaotic oscillations remain stable, and neither fade out nor diverge. In such a way, the NFKN can be used for long-term chaotic time series predictions. 4.3 Classification of the Wisconsin Breast Cancer Data

In this experiment, the NFKN was tested on the known WBC dataset [16], which consists of 699 cases (malignant or benign) with 9 features. We used only 683 cases, because 16 had missing values. The dataset was randomly shuffled, and then divided into 10 approximately equally-sized parts to perform 10-fold cross-validation. The NFKN that demonstrated best results had 9 inputs, only 1 neuron in the hidden layer with 4 MFs per input, and 1 synapse in the output layer with 7 MFs (total 43 adjustable weights). It provided high classification accuracy (see Table 2). Typical error rates the on the WBC data with the application of other classifiers range from 2% to 5% [17, 18]. But it is important to note that most neuro-fuzzy systems rely on time-consuming learning procedures, which fine-tune the membership functions [17, 19, 20]. In the NFKN, no tuning of MFs is performed, but the classification results are very good.

200

Y. Bodyanskiy, V. Kolodyazhniy, and P. Otto Table 2. Results of the WBC data classification

Run

Epochs

1 2 3 4 5 6 7 8 9 10 Min Max Average

6 5 6 12 8 7 6 6 9 8 5 12 7.3

Training set errors 1.7915% 1.9544% 2.1173% 0.97561% 1.9512% 1.7886% 1.7886% 1.7886% 1.4634% 1.626% 0.97561% 2.1173% 1.7245%

Checking set errors 2.8986% 2.8986% 0% 4.4118% 0% 2.9412% 2.9412% 2.9412% 2.9412% 2.9412% 0% 4.4118% 2.4915%

Actual and emulated Mackey−Glass time series MG time series Emulator output

1.2 1 0.8 0.6

3200

3300

3400

3500

3600

3700

3800

3900

4000

4100

4000

4100

Difference between the real and emulated time series 0.5

0

−0.5 3200

3300

3400

3500

3600

3700

3800

Fig. 6. Mackey-Glass time series emulation

3900

Neuro-Fuzzy Kolmogorov's Network for Time Series Prediction and Pattern Classification 201

5 Conclusion A new simple and efficient neuro-fuzzy network (NFKN) and its training algorithm were considered. The NFKN contains the neo-fuzzy neurons in both the hidden and output layers and is not affected by the curse of dimensionality because of its twolevel structure. The use of the neo-fuzzy neurons enabled us to develop fast training procedures for all the parameters in the NFKN. Two-level structure of the rule base helps the NFKN avoid the combinatorial explosion in the number of rules even with a large number of inputs (17 in the experiment with chaotic time series emulation). Further work will be directed to the improvement of the convergence properties of the training algorithm by means of the adaptation of the learning rate, and to the development of a multiple output modification of the NFKN for the classification problems with more than two classes. We expect that the proposed neuro-fuzzy network can find real-world applications, because the experimental results presented in the paper are quite promising.

References 1.

Kolmogorov, A.N.: On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR 114 (1957) 953-956 2. Hecht-Nielsen, R: Kolmogorov's mapping neural network existence theorem. Proc. IEEE Int. Conf. on Neural Networks, San Diego, CA, Vol. 3 (1987) 11-14 3. Sprecher, D.A.: A numerical implementation of Kolmogorov's superpositions. Neural Networks 9 (1996) 765-772 4. Sprecher, D.A.: A numerical implementation of Kolmogorov's superpositions II. Neural Networks 10 (1997) 447–457 5. KĤrková, V.: Kolmogorov’s theorem is relevant. Neural Computation 3 (1991) 617-622 6. Igelnik, B., and Parikh, N.: Kolmogorov's spline network. IEEE Transactions on Neural Networks 14 (2003) 725-733 7. Yam, Y., Nguyen, H. T., and Kreinovich, V.: Multi-resolution techniques in the rulesbased intelligent control systems: a universal approximation result. Proc. 14th IEEE Int. Symp. on Intelligent Control/Intelligent Systems and Semiotics ISIC/ISAS'99, Cambridge, Massachusetts, September 15-17 (1999) 213-218 8. Lopez-Gomez, A., Yoshida, S., and Hirota, K.: Fuzzy functional link network and its application to the representation of the extended Kolmogorov theorem. International Journal of Fuzzy Systems 4 (2002) 690-695 9. Kolodyazhniy, V., and Bodyanskiy, Ye.: Fuzzy Neural Network with Kolmogorov’s Structure. Proc. 11th East-West Fuzzy Colloquium, Zittau, Germany (2004) 139-146 10. Kolodyazhniy, V., and Bodyanskiy, Ye.: Fuzzy Kolmogorov’s Network. Proc. 8th Int. Conf. on Knowledge-Based Intelligent Information and Engineering Systems (KES 2004), Wellington, New Zealand, September 20-25, Part II (2004) 764-771 11. Kolodyazhniy, V., Bodyanskiy, Ye., and Otto, P.: Universal Approximator Employing Neo-Fuzzy Neurons. Proc. 8th Fuzzy Days, Dortmund, Germany, Sep. 29 – Oct. 1 (2004) CD-ROM

202

Y. Bodyanskiy, V. Kolodyazhniy, and P. Otto

12. Yamakawa, T., Uchino, E., Miki, T., and Kusanagi, H.: A neo fuzzy neuron and its applications to system identification and prediction of the system behavior. Proc. 2nd Int. Conf. on Fuzzy Logic and Neural Networks “IIZUKA-92”, Iizuka, Japan (1992) 477-483 13. Kosko, B.: Fuzzy systems as universal approximators. Proc. 1st IEEE Int. Conf. on Fuzzy Systems, San Diego, CA (1992) 1153-1162 14. Jang, J.-S. R.: Neuro-Fuzzy Modeling: Architectures, Analyses and Applications. PhD Thesis. Department of Electrical Engineering and Computer Science, University of California, Berkeley (1992) 15. Mackey, M. C., and Glass, L.: Oscillation and chaos in physiological control systems. Science 197 (1977) 287-289 16. UCI Machine Learning Repository, http://www.ics.uci.edu/~mlearn/MLRepository.html 17. Nauck, D., Nauck, U., and Kruse, R.: Generating classification rules with the neuro-fuzzy system NEFCLASS. Proc. NAFIPS'96, Berkeley (1996) 18. Duch, W.: Datasets used for classification: comparison of results. Computational Intelligence Laboratory, Nicolaus Copernicus University, Torun, Poland. http://www.phys.uni.torun.pl/kmk/projects/datasets.html 19. Sun, C.-T., and Jang, J.-S. R.: Adaptive network based fuzzy classification, Proc. of the Japan-U.S.A. Symposium on Flexible Automation (1992) 20. Sun, C.-T., and Jang, J.-S. R.: A neuro-fuzzy classifier and its applications. Proc. IEEE Int. Conf. on Fuzzy Systems, San Francisco (1993)

Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm Alireza Hadi and Farzan Rashidi Azad University of Tehran-South Branch, Tehran, Iran [email protected]

Abstract. This paper presents solution of optimal power distribution networks problem of large-sized power systems via a genetic algorithm of real type. The objective is to find out the best power distribution network reliability while simultaneously minimizing the system expansion costs. To save an important CPU time, the constraints are to be decomposing into active constraints and passives ones. The active constraints are the parameters which enter directly in the cost function and the passives constraints are affecting the cost function indirectly as soft limits. The proposed methodology is tested for real distribution systems with dimensions that are significantly larger than the ones frequently found in the literature. Simulation results show that by this method, an optimum solution can be given quickly. Analyses indicate that proposed method is effective for large-scale power systems. Further, the developed model is easily applicable to n objectives without increasing the conceptual complexity of the corresponding algorithm and can be useful for very largescale power system.

1 Introduction Investments in power distribution systems constitute a significant part of the utilities’ expenses. They can account for up to 60 percent of capital budget and 20 percent of operating costs [1]. For this reason, efficient planning tools are needed to allow planners to reduce costs. Computer optimization algorithms improve cost reduction compared to system design by hand. Therefore, in recent years, a lot of mathematical models and algorithms have been developed. The optimal power distribution system design has been frequently solved by using classical optimization methods during the last two decades for single stage or multistage design problems. The basic problem has been usually considered as the minimization of an objective function representing the global power system expansion costs in order to solve the optimal sizing and/or locating problems for the feeder and/or substations of the distribution system [2]. Some mathematical programming techniques, such as Mix-Integral Program(MIP)[3,4], Branch and Bound [5], Dynamic Program(DP)[6], Quadratic MIP, AI/Expert system approach[7,8], Branch Exchange[9], etc., were extensively used to solve this problem. Some of the above models considered a simple linear objective function to represent the actual non-linear expansion costs, what leads to unsatisfactory representations of the real costs. Mixed-integer models achieved a U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 203 – 215, 2005. © Springer-Verlag Berlin Heidelberg 2005

204

A. Hadi and F. Rashidi

better description of such costs by including fixed costs associated to integer 0-1 variables, and linear variable costs associated to continuous variables [10]. The most frequent optimization techniques to solve these models have been the branch and bound ones. These kinds of methods need large amounts of computer time to achieve the optimal solution, especially when the number of integer 0-1 variables grows for large power distribution systems [11]. Some models such as DP contained the nonlinear variable expansion costs and used dynamic programming that requires excessive CPU time for large distribution systems [12]. AI/Expert based methods propose a distribution planning heuristic method, that is faster than classical ones but the obtained solutions are sometimes local optimal solutions [13]. This model takes into account linearized costs of the feeders, that is, a linear approximation of the nonlinear variable costs. Genetic Algorithms (GAs) have solved industrial optimization problems in recent years, achieving suitable results. Genetic algorithms have been applied to the optimal multistage distribution network and to the street lighting network design [14, 15]. In these works, a binary codification was used, but this codification makes difficult to include some relevant design aspects in the mathematical models. Moreover, most papers about optimal distribution design do not include practical examples of a true multi objective optimization of real distribution networks of significant dimensions (except in [16]), achieving with detail the set of optimal multi objective non dominated solutions [17, 18] (true simultaneous optimization of the costs and the reliability). In this paper, because the total objective function is very non-linear, a genetic algorithm with real coding of control parameters is chosen in order to well taking into account this nonlinearity and to converge rapidly to the global optimum. To accelerate the processes of the optimization, the controllable variables are decomposed to active constraints and passive constraints. The active constraints, which influence directly the cost function, are included in the GA process. The passive constraints, which affect indirectly this function, are maintained in their soft limits using a conventional constraint, only, one time after the convergence of GA. Novel operators of the genetic algorithm is developed, that allow for getting out of local optimal solutions and obtaining the global optimal solution or solutions very close to such optimal one. This algorithm considers the true non-linear variable cost. It also obtains the distribution nodes voltages and an index that gives a measure of the distribution system reliability. The proposed algorithm allows for more flexibility, taking into account several design aspects as, for example, different conductor sizes, different sizes of substations, and suitable building of additional feeders for improving the network reliability. This methodology is tested for real distribution systems with dimensions that are significantly larger than the ones frequently found in the literature. Simulation results show that by this method, an optimum solution can be given quickly. Further analyses indicate that this method is effective for large-scale power systems. CPU times can be reduced by decomposing the optimization constraints of the power system to active constraints manipulated directly by the genetic algorithm, and passive constraints maintained in their soft limits using a conventional constraint power distribution network.

Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm 205

2 Power Distribution Problems The problem that we want to solve is a simplified form that it disposes a series of sources (substations) and a series of drains or demand nodes (demand centers). To each demand node it is associated a determined demand of boosts and each source has a maximum limit of the power supply. Moreover, it is known the possible routes for the construction of electric line to transport the powers from the sources to the demand nodes are known. Each line possesses a cost that depends principally on its length (basic fixed costs) and the power value that transports (fundamentally variable costs). The model contains an objective function that represents the costs corresponding to the lines and substations of the electric distribution system. Multiobjective optimization can be defined as the problem of finding [10]: A vector of decision variables, which satisfies constraints and optimizes a vector function whose elements represent the objective functions. These functions form a mathematical description of performance criteria, which are usually in conflict with each other. Hence, the term “optimizes” means finding such a solution, which would give the values of all the objective functions acceptable to the designer. The standard power distribution problem can be written in the following form, Minimize f (x ) subject to : h( x ) = 0 and g (x ) ≤ 0

(the objective function) (equality constraints) (inequality constraints)

(1)

Where x = [x1 ,x 2 ,...,x n ] is the vector of the control variables, that is those which can be varied (set of routes, set of nodes, power flow, fixed cost of a feeder to be built and etc.). The essence of the optimal power distribution problem resides in reducing the objective function and simultaneously satisfying the objective function (equality constraints) without violating the inequality constraints. 2.1 Objective Function The most commonly used objective in the power distribution problem formulation is the optimization of the total economic cost and the reliability of the distribution network. In this paper, the multi objective design model is a nonlinear mixed-integer one for the optimal sizing and location of feeders and substations, which can be used for single stage or for multistage planning under a pseudo dynamic methodology. The vector of objective functions to be minimized is f(x) = [ f 1 ( x ), f 2 ( x )] , where f1 ( x ) is the objective function of the global economic costs and f 2 ( x ) is a function related with the reliability of the distribution network. The objective function f1 ( x ) can be considered as below [10]:

f1 ( x ) =

¦ { (CF )(Y ) + (CV )(X ) }+ ¦ { (CF )(Y ) + (CV )( X ) } 2

ij

( i , j )∈N F

ij

ij

2

ij

k

k

k

k

k∈N S

Where: N F : is the set of routes (between nodes) associated with lines in the network. N S : is the set of nodes associated with substations in the network.

(2)

206

A. Hadi and F. Rashidi

(i, j ) : is the Route between nodes i and j. X k : is the Power flow, in kVA, supplied from node associated with a substation size. X ij : is the power flow, in kVA, carried through route associated with a feeder size.

CVij : is the Variable cost coefficient of a feeder in the network, on route (i, j).

CFij : is the Fixed cost of a feeder to be built, on route (i, j). CVk : is the Variable cost coefficient of a substation in the network, in the node k. CFk : is the Fixed cost of a substation to be built, in the node k. Yk : is equal to 1, if substation associated with node k ∈ N S is built. Otherwise, it is equal to 0. Yij : is equal to 1, if feeder associated with route (i , hj ) ⊂ N F is built. Otherwise, it is equal to 0. The objective function f 2 ( x ) is:

f 2 ( x) =

¦ (u )(X ) ij

( i , j )∈N F

f (i, j )

(3)

Where: N F : is the set of “proposed” routes (proposed feeders) connecting the network nodes with the proposed substation. X f (i , j ) : is the power flow, in kVA, carried through the route f ∈ N F , that is

( )

calculated for a possible failure of an existing feeder (usually in operation) on the route (i, j ) ∈ N F In this case, reserve feeders are used to supply the power demands. uij : is the constants obtained from other suitable reliability constants in the

( )

distribution network, including several reliability related parameters such as failure rates and repair rates for distribution feeders, as well as the length of the corresponding feeders on routes or (i, j ) ∈ N F . The simultaneous minimization of the two objective functions is subject to technical constraints, which are [10]: 1. The first Kirchhoff’s law (in the existed nodes of the distribution system). 2. The relative restrictions to the power transport capacitance limits of the lines. 3. The relative restrictions to the power supply capacitance limits of the substations. The presented model corresponds to a mathematical formulation of non linear mixedinteger programming, which includes the true non linear costs of the lines and substations of the distribution system, to search the optimal solution from an economic viewpoint.

3 Evolutionary Algorithms in Power Distribution Networks Evolutionary algorithms (EAs) are computer-based problem solving systems, which are computational models of evolutionary processes as key elements in their design and implementation. Genetic Algorithms (GAs) are the most popular and widely used of all evolutionary algorithms. They have been used to solve difficult problems with

Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm 207

objective functions that do not possess well properties such as continuity, differentiability, satisfaction of the Lipschitz Condition, etc. These algorithms maintain and manipulate a population of solutions and implement the principle of survival of the fittest in their search to produce better and better approximations to a solution [19]. This provides an implicit as well as explicit parallelism that allows for the exploitation of several promising areas of the solution space at the same time. The implicit parallelism is due to the schema theory developed by Holland, while the explicit parallelism arises from the manipulation of a population of points. The power of the GA comes from the mechanism of evolution, which allows searching through a huge number of possibilities for solutions. The simplicity of the representations and operations in the GA is another feature to make them so popular. There are three major advantages of using genetic algorithms for optimization problems [20]. 1.

2.

3.

GAs do not involve many mathematical assumptions about the problems to be solved. Due to their evolutionary nature, genetic algorithms will search for solutions without regard for the specific inner structure of the problem. GAs can handle any kind of objective functions and any kind of constraints, linear or nonlinear, defined on discrete, continuous, or mixed search spaces. The ergodicity of evolution operators makes GAs effective at performing global search. The traditional approaches perform local search by a convergent stepwise procedure, which compares the values of nearby points and moves to the relative optimal points. Global optima can be found only if the problem possesses certain convexity properties that essentially guarantee that any local optimum is a global optimum. GAs provide a great flexibility to hybridize with domain-dependent heuristics to make an efficient implementation for a specific problem.

The implementation of GA involves some preparatory stages. Having decoded the chromosome representation (genotype) into the decision variable domain (phenotype), it is possible to assess the performance, or fitness, of individual members of a population [21]. This is done through an objective function that characterises an individual’s performance in the problem domain. During the reproduction phase, each individual is assigned a fitness value derived from its raw performance measure given by objective function. Once the individuals have been assigned a fitness value, they can be chosen from population, with a probability according to their relative fitness, and recombined to produce the next generation. Genetic operators manipulate the genes. The recombination operator is used to exchange genetic information between pairs of individuals. The crossover operation is applied with a probability px when the pairs are chosen for breeding. Mutation causes the individual genetic representation to be changed according to some probabilistic rule. Mutation is generally considered as a background operator that ensures that the probability of searching a particular subspace of the problem space is never zero. This has the effect of tending to inhibit the possibility of converging to a local optimum [22-25]. The traditional binary representation used for the genetic algorithms creates some difficulties for the optimization problems of large size with high numeric precision. According to the problem, the resolution of the algorithm can be expensive in time. The crossover and the mutation can be not adapted. For such problems, the

208

A. Hadi and F. Rashidi

genetic algorithms based on binary representations have poor performance. The first assumption that is typically made is that the variables representing parameters can be represented by bit strings. This means that the variables are discretized in an a priori fashion, and that the range of the discretization corresponds to some power of two. For example, with 10 bits per parameter, we obtain a range with 1024 discrete values. If the parameters are actually continuous then this discretization is not a particular problem. This assumes, of course, that the discretization provides enough resolution to make it possible to adjust the output with the desired level of precision. It also assumes that the discretization is in some sense representative of the underlying function. Therefore, one can say that a more natural representation of the problem offers more efficient solutions. Then one of the greater improvements consists in the use of real numbers directly. Evolutionary Programming algorithms in Economic Power Dispatch provide an edge over common GA mainly because they do not require any special coding of individuals. In this case, since the desired outcome is the operating point of each of the dispatched generators (a real number), each of the individuals can be directly presented as a set of real numbers, each one being the produced power of the generator it concerns [26]. Our Evolutionary programming is based on the completed Genocop III system [27], developed by Michalewicz, Z. and Nazhiyath, G. Genecop III for constrained numerical optimization (nonlinear constraints) is based on repair algorithms. Genocop III incorporates the original Genocop system [28] (which handles linear constraints only), but also extends it by maintaining two separate populations, where a development in one population influences evaluations of individuals in the other population. The first population Ps consists of so-called search points, which satisfy linear constraints of the problem, the feasibility (in the sense of linear constraints) of these points is maintained by specialized operators (as in Genocop). The second population, Pr, consists of fully feasible reference points. These reference points, being feasible, are evaluated directly by the objective function, whereas search points are “repaired” for evaluation. 3.1 The Proposed GA for Design of Optimal Power Distribution Network The Genecop (for GEnetic algorithm for Numerical Optimization of Constrained Problem) system [28] assumes linear constraints only and a feasible starting point (a feasible population). 3.1.1 Initialization k Let Pg k = Pg1k , Pg 2k ,..., Pg ng

[

]

be the trial vector that presents the

k th

individual, k = 1,2,..., P of the population to be evolved, where ng is the population size. The elements of the Pg k are the desired values. The initial parent trial vectors, Pg k , k=1,…,P, are generated randomly from a reasonable range in each dimension by setting the elements of Pg k as:

pg ik = U ( pg i ,min , pg i ,max )

for i = 1,2,..., ng

(4)

Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm 209

Where U ( pg i ,min , pg i ,max ) denotes the outcome of a uniformly distributed random variable ranging over the given lower bounded value and upper bounded values of the active power outputs of generators. A closed set of operators maintains feasibility of solutions. For example, when a particular component xi of a solution vector X is mutated, the system determines its current domain dom(xi) ( which is a function of liner constraints and remaining value of the solution vector X) and the new value xi is taken from this domain (either with flat probability distribution for uniform mutation, or other probability distributions for non-uniform and boundary mutations). In any case the offspring solution vector is always feasible. Similarly, arithmetic crossover, aX+(1-a)Y, of two feasible solution vectors X and Y yields always a feasible solution (for 0 ≤ a ≤ 1 ) in convex search spaces. 3.1.2 Offspring Creation By adding a Gaussian random variation with zero mean, a standard derivation with zero mean and a standard derivation proportional to the fitness value of the parent trial solution, each parent Pg k , k=1,…,P, creates an offspring vector, Pg k +1 , that is:

(

)

pg k +1 = pg k + N (0, σ 2k ) (5) designates a vector of Gaussian random variables with mean zero

Where N 0, σ 2k and standard deviation σ k , which is given according to:

( pg i ,min − pg i ,max ) + M (6) f min Where f is the objective function value to be minimized in (2) associated with the control vector f min represents the minimum objective function value among the trial solution; r is a scaling factor; and M indicates an offset. σk = r

f

3.1.3 Stopping Rule As the stopping rule of maximum generation or the minimum criterion of f value in Equations (2, 3) is satisfied, the evolution process stops, and the solution with the highest fitness value is regarded as the best control vector. 3.2 Replacement of Individuals by Their Repaired Versions The question of replacing repaired individuals is related to so-called Lamarckian evolution, which assumes that an individual improves during its lifetime and that the resulting improvements are coded back into the chromosome. In continuous domains, Michalewicz, Z. and Nazhiyath, G. indicated that an evolutionary computation technique with a repair algorithm provides the best results when 20% of repaired individuals replace their infeasible originals.

4 Computational Results The new evolutionary algorithm has been applied intensively to the multi objective optimal design of several real size distribution systems. A compatible PC has been used (CPU Pentium 1.6GHz and 256Mb of RAM) with the operating system WinXp

210

A. Hadi and F. Rashidi Table 1. Characteristics of the distribution system

Number of existing demand nodes Number of proposed demand nodes Number of total demand nodes Number of existing lines Number of proposed lines Number of total lines Number of existing substations Number of proposed substations Number of total substations Number of total 0-1 variables 35

36

34 33

43 121 164 43 133 176 2 0 2 532

17

30

29

31

18

28

20

39

24

42

101

43 8

9

102

104

10 11

103 25

105 106

107 108 110

111

113

112

15

115 114

4

119

3

121 122

125

58 46

47 51

50 62 63 64

123

71

155

160

56

54

60

53 65 73 69 68 77

162

134

132 148 146

74

75

137

76

136 82

133

78

138

139 140

163

135

130 131

151 150 152

164

52

70

129 147

159 158 156 153 165 161

55

128

154 149

166

49

72

67

127

48

57

66

87 88

124 126

1

2

14

118

117

116

6

5

12

13

109

7

120

157

26

27

45

99

25

23

44

41

32

100

22

19

38

40

90 91 92 93 94 95 96 97 98

21

37

142 143

83

141 84

79

85

144 145

86

80

81

87

Fig. 1. Existing and proposed feeders of the distribution network Table 2. Results of the multi objective optimization program

Generations

Cost (Rials)

Reliability

75

42511.4367

197.17357

61

59

Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm 211 220

o x

x

215 x

x

210 x

x

x xx

x

Relibility

205 o

x

x

x

x

200 o

x x

195

x o

x x

x

x

x

x

x

x

190 o

185 42500

43000

o

43500 Cost

44000

44500

Fig. 2. Evolution of the non-dominated solution curve

to run MATLAB. Table I. shows the characteristics of the example of optimal design, indicating the dimensions of the used distribution system and the mathematical aspects. Notice the number of 0-1 variables of the distribution network indicating that the complexity of the optimization and the dimensions of this network is significantly larger than most of the ones usually described in technical papers [20]. In this study we use a 20 kV electric net, which is a portion of Iran’s power distribution network and the network is represented in the Figure (1). The complete network, the existing feeders (thickness lines) and the proposed feeders (other lines). The existing network presents two substations at nodes 1 and 17 with the capacitance of 40MVA. The sizes of proposed substation are 40MVA and 10MVA. The existing lines of the network possess the following conductor sizes: 3x75Al, 3x118Al, 3x126Al and 3x42Al. Also in the design of the network, it is proposed 4 different conductor sizes for the construction of new lines: 3x75Al, 3x118Al, 3x126Al and 3x42Al. We have used a multi objective toolbox that is written to work in MATLAB environment, which is called MOEA. In Table 2, we have shown the results of the multi objective optimization program that have leaded us to the final non dominated solutions curve. This table provides the objective function values (“cost” in millions of Rials, and “Reliability”, function of expected energy not supplied, in kWh) of the ideal solutions, the number of generations (Gen.) and the objective function values of the best topologically meshed network solution for the distribution system. The multi objective optimization finishes when the stop criteria is reached, that is when the upper limit of the defined generation number reached. The used crossover rate is 0.3 and the mutation rate is 0.02. The population size is 180 individuals and the

212 35

A. Hadi and F. Rashidi 36

34 33

17

30

29

31

18

28

20

39

24

42

101

43

10 11

103 25

105 106

8

9

102

104

107 108 110

111

113

112

15

115 114

119

3

121 122

125

12

58 46

47 50 62 63 64

123

71

155

154

165 161

164

162

135

54

60

59

61

53 65 73 69 68

74

75

137

76

134 136 82

132 148 146

133

78

138

139 140

163

56

77

130 131

151 150 152

52

70

129 147

159 158 156 153

166

55

128 149

160

49

72

67

127

48

57 51

66

87 88

124 126

1

2

14

118

117

116

6

5 4

13

109

7

120

157

26

27

45

99

25

23

44

41

32

100

22

19

38

40

90 91 92 93 94 95 96 97 98

21

37

142 143

83

141 84

79

85

144 145

86

80

81

87

Fig. 3. Solution of multi objective optimal design model

process uses roulette wheel selection method; it reuses the previous population in the new generation. Figure (2), shows the non-dominated solution curve during the optimization design. The trade off between two solutions is very easy and subjective. The curve shows that when the cost improves the reliability decreases and vice versa. After analyzing the curve of the final non-dominated solutions, the planner can select the definitive non-dominated solution, taking into account simultaneously the most satisfactory values of the two objective functions. We believe that the set of nondominated solutions is the best set that can be offered to the planner in order to select the best satisfactory solution from such set. Finally, as mentioned before, Figure (1), shows the existing and future proposed distribution network. Figure (3), shows the final selected multi objective non-dominated solution in this paper, corresponding to the non dominated one that represents the topologically meshed distribution system in radial operating state with the best reliability achieved by the complete multi objective optimal design. The comparison of the multi objective algorithm of this paper (evolutionary algorithm) with other existing ones (for distribution systems of significant dimensions) has not been possible since such existing algorithms are not able to consider the characteristics of our mathematical model used in this paper.

Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm 213

5 Conclusions This paper was reported the use of multiobjective evolutionary algorithm to solve optimal power distribution network problem with coordination of reliability and economy objectives. The multi objective optimization model contemplates the following aspects: Non-linear objective function of economic costs associated to the electric distribution network and a new objective function representing the reliability of the electric network to effects of the optimal design. These objective functions are subject to the mathematical restrictions that were mentioned. For the optimal design localizations and optimal sizes of electric lines as a future substations, proposed for the planner. Reliability objective of the electric power distribution system evaluated by means of a fitting function to indicate a meshing made topologies (or radial) electric network, keeping in mind, in general, faults (errors) of any order in the electric lines (studious for the first-order faults in the obtained computational results) to effects of the optimal design of distribution network. As a study case, a portion of Iran’s power distribution network was selected. The simulation results show that for large-scale system a genetic algorithm with real code can give a best result with reduced time. To save an important CPU time, the constraints are to be decomposing into active constraints and passives ones. The active constraints are the parameters which enter directly in the cost function and the passives constraints are affecting the cost function indirectly. The developed model is easily applicable to n objectives without increasing the conceptual complexity of the corresponding algorithm. The proposed method can be useful for very large-scale power system.

Reference 1. H. Lee Willis, H. Tram, M.V. Engel, and L. Finley. “Optimization applications to power distribution”, IEEE Computer Applications in Power, 8(4):12-17, October 1995. 2. Gonen, T., and B.L. Foote, “Distribution-System Planning Using Mixed-Integer Programming”, Proceeding IEE, vo1.128, Pt. C, no.2, pp. 70-79, March 1981. 3. R.N. Adams and M.A. Laughton, “Optimal planning of power networks using mixedinteger programming Part I static and time-phaseed network synthesis”, Proc. IEE, v01.121(2), pp. 139-147, February 1974. 4. G.L.Thompson, D.L.Wal1, “A Branch and Bound model for Choosing Optimal Substation Locations”, IEEE Trans. PAS, pp 2683-2688, May 1981. 5. R.N. Adams and M.A. Laughton, “A dynamic programming network flow procedure for distribution system planning”, Presented at the IEEE Power Industry Computer Applications Conference (PICA), June 1973. 6. M.Ponnaviakko, K.S.Parkasa Rao, S.S.Venkata, “Distribution System Planning through a Quadratic Mixed Integer Programming Approach”, IEEE Trans. PWRD, pp 1157-1 163, October 1987. 7. Brauner, G. and Zobel, M. “Knowledge Based Planning of Distribution Networks”, IEEE Trans Power Delivery, vo1.5, no.3, pp. 1514-1519, 1990. 8. Chen, J. and Hsu, Y. “An Expert System for Load Allocation in Distribution Expansion Planning”, IEEE Trans Power Delivery, vo1.4, no.3, pp 1910-1917, 1989.

214

A. Hadi and F. Rashidi

9. S.K.Goswami, “Distribution System Planning Using Branch Exchange Technique”, IEEE Trans. On Power System, v01.12, no.2, pp.718-723, May 1997. D.E. Goldberg, Genetic Algorithm in Search, Optimization and Machine Learning, Addison Wesley, 1989. 10. I. J. Ramírez-Rosado and J. L. Bernal-Agustín, “Genetic algorithms applied to the design of large power distribution systems,” IEEE Trans. On Power Systems, vol. 13, no. 2, pp. 696–703, May 1998. 11. I.J. Ramfrez-Rosado and T. Giinen, “Pseudodynamic Planning for Expansion of Power Distribution Systems”, IEEE Transactions on Power Systems, Vol. 6, No. 1, February 1991, pp. 245-254. 12. Partanen, J., “A Modified Dynaniic Programming Algorithm for Sizing Locating and Timing of Feeder Reinforcements”, IEEE Tram. on Power Delivery, Vol. 5, No. 1, January 1990, pp. 277-283. 13. Ramirez-Rosado, I.J., and JosB L. Bemal-Agustfn, “Optimization of the Power Distribution Netwoik Design by Applications of Genetic Algorithms”, International Journal of Power and Energy Systems, Vol. 15, No. 3, 1995, pp. 104-110. 14. I.J. Ramirez-Rosado and R.N. Adams, “Multiobjective Planning of the Optimal Voltage Profile m Electric Power Distribution Systems”, International Journal for Computation and Mathematics in Electrical and Electronic Engineering, Vol. 10, No. 2, 1991, pp. 115128. 15. J. z Zhu, “Optimal reconfiguration of electrical distribution network using the refined genetic algorithm”, Electric Power System Research 62 , 37-42, 2002 16. J. Partanen, “A modified dynamic programming algorithm for sizing locating and timing of feeder reinforcements”, IEEE Trans. on Power Delivery, vol. 5, No. 1, pp. 277–283, Jan. 1990. 17. T. Gönen and I. J. Ramírez-Rosado, “Optimal multi-stage planning of power distribution systems”, IEEE Trans. on Power Delivery, vol. PWRD-2, no. 2, pp. 512–519, Apr. 1987. 18. S.Sundhararajan, A.Pahwa, “Optimal Selection of Capacitors for Radial Distribution System Using A Genetic Algorithm”, IEEE Trans. On Power System, vo1.9, no.3, pp. 1499-1507, August 1994. 19. Goldberg D., Genetic Algorithms in Search, “Optimization and Machine Learning”, Addison Wesley 1989,ISBN: 0201157675 20. Coello Coello, C.A.,VanVeldhuizen, D.A., Lamont, G.B., “Evolutionary Algorithms for Solving Multi-Objective Problems”, Genetic Algorithms and Evolutionary Computation. Kluwer Academic Publishers, NewYork, NY 2002 21. Obayashi, S., “Pareto genetic algorithm for aerodynamic design using the Navier-Stokes equations”, In Quagliarella, D., P´eriaux, J., Poloni, C., Winter, G., eds.: Genetic Algorithms and Evolution Strategies in Engineering and Computer Science. John Wiley & Sons, Ltd., Trieste, Italy (1997) 245–266 22. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: A comparative case study and the strength pareto approach. IEEE Transactions on Evolutionary Computation 3, 257–271, 1999 23. Bentley, P.J.,Wakefield, J.P. “Finding acceptable solutions in the pareto-optimal range using multiobjective genetic algorithms. Proceedings of the Second Online World Conference on Soft Computing in Engineering Design and Manufacturing (WSC2) 5, pp. 231–240, 1998 24. Deb, K., Mohan, M., Mishra, S. “A fast multi-objective evolutionary algorithm for finding well-spread pareto-optimal solutions”, Technical Report 2003002, Kanpur Genetic Algorithms Laboratory (KanGAL), Indian Institute of Technology Kanpur, Kanpur, PIN 208016, India,2003

Design of Optimal Power Distribution Networks Using Multiobjective Genetic Algorithm 215 25. Parmee, I.C.,Watson,A.H., “Preliminary airframe design using co-evolutionary multiobjective genetic algorithms”, Proceedings of the Genetic and Evolutionary Computation Conference 1999, pp.1657–1665, 1999 26. K. De Jong: Lecture on Coevolution., “Theory of Evolutionary Computation”, H.-G. Beyer et al. (chairs), Max Planck Inst. for Comput. Sci. Conf. Cntr., Schloß Dagstuhl, Saarland, Germany, 2002 27. Z. Michalewicz, G. Nazhiyzth, “Genocop III: A Co-evolutionary Algorithm for Numerical Optimization Problems with Nonlinear Constraints”, Proceedings of the Second IEEE ICEC, Perth, Australia, 1995. 28. Z. Michalewicz, C. Jaikow, “Handling Constraints in Genetic Algorithms”, Proceedings of Fourth ICGA, Morgan Kauffmann, pp. 151-157, 1999

New Stability Results for Delayed Neural Networks Qiang Zhang Liaoning Key Lab of Intelligent Information Processing, Dalian University, Dalian, 116622, China [email protected]

Abstract. By constructing suitable Lyapunov functionals and combing with matrix inequality technique, some new sufficient conditions are presented for the global asymptotic stability of delayed neural networks. These conditions contain and improve some of the previous results in the earlier references.

1

Introduction

In recent years, the stability of a unique equilibrium point of delayed neural networks has extensively been discussed by many researchers[1]-[16]. Several criteria ensuring the global asymptotic stability of the equilibrium point are given by using the comparison method, Lyapunov functional method, M-matrix, diagonal dominance technique and linear matrix inequality approach. In [1]-[12], some sufficient conditions are given for the global asymptotic stability of delayed neural networks by constructing Lyapunov functionals. In [13]-[16], several results for asymptotic stability of neural networks with constant or time-varying delays are obtained based on the linear matrix inequality approach. In this paper, some new sufficient conditions to the global asymptotic stability for delayed neural networks are derived. These conditions are independent of delays and impose constraints on both the feedback matrix and delayed feedback matrix. The results are also compared with the earlier results derived in the literature.

2

Stability Analysis for Delayed Neural Networks

The dynamic behavior of a continuous time delayed neural networks can be described by the following state equations: 

xi (t) = −ci xi (t) +

n  j=1

aij fj (xj (t)) +

n 

bij fj (xj (t − τj )) + Ji ,

i = 1, 2, · · · , n

j=1

(1) or equivalently x (t) = −Cx(t) + Af (x(t)) + Bf (x(t − τ )) + J U. Furbach (Ed.): KI 2005, LNAI 3698, pp. 216–221, 2005. c Springer-Verlag Berlin Heidelberg 2005 

(2)

New Stability Results for Delayed Neural Networks

217

where x(t) = [x1 (t), · · · , xn (t)]T ∈ Rn , f (x(t)) = [f1 (x1 (t)), · · · , fn (xn (t))]T ∈ Rn ,f (x(t − τ )) = [f1 (x1 (t − τ1 )), · · · , fn (xn (t − τn ))]T ∈ Rn . C = diag(ci > 0) (a positive diagonal matrix), A = {aij } is referred to as the feedback matrix, B = {bij } represents the delayed feedback matrix, while J = [J1 , · · · , Jn ]T is a constant input vector and the delays τj are nonnegative constants. In this paper, we will assume that the activation functions fi , i = 1, 2, · · · , n satisfy the following condition (H) 0≤

fi (ξ1 ) − fi (ξ2 ) ≤ Li , ∀ξ1 , ξ2 ∈ R, ξ1 = ξ2 . ξ1 − ξ2

This type of activation functions is clearly more general than both the usual sigmoid activation functions and the piecewise linear function (PWL): fi (x) = 1 2 (|x + 1| − |x − 1|) which is used in [4]. The initial conditions associated with system (1) are of the form xi (s) = φi (s), s ∈ [−τ, 0], i = 1, 2, · · · , n, τ = max {τj } 1≤j≤n

in which φi (s) is continuous for s ∈ [−τ, 0]. Assume x∗ = (x∗1 , x∗2 , · · · , x∗n )T is an equilibrium of the Eq.(1), one can derive from (1) that the transformation yi = xi − x∗i transforms system (1) or (2) into the following system: 

yi (t) = −ci yi (t) +

n 

aij gj (yj (t)) +

j=1

n 

bij gj (yj (t − τj )), ∀i

(3)

j=1

where gj (yj (t)) = fj (yj (t) + x∗j ) − fj (x∗j ), or, y (t) = −Cy(t) + Ag(y(t)) + Bg(y(t − τ ))

(4)

respectively. Note that since each function fj (·) satisfies the hypothesis (H), hence, each gj (·) satisfies gj2 (ξj ) ≤ L2j ξj2 gj2 (ξj ) , ∀ξj ∈ R. Lj gj (0) = 0

ξj gj (ξj ) ≥

(5)

To prove the stability of x∗ of the Eq.(1), it is sufficient to prove the stability of the trivial solution of Eq.(3) or Eq.(4). In this paper, we will use the notation A > 0 (or A < 0) to denote the matrix A is a symmetric and positive definite (or negative definite) matrix. The notation AT and A−1 means the transpose of and the inverse of a square matrix A. In order to obtain our results, we need establishing the following lemma: Lemma 1. For any vectors a, b ∈ Rn , the inequality 2aT b ≤ aT Xa + bT X −1 b holds, in which X is any matrix with X > 0.

(6)

218

Q. Zhang

Proof. Since X > 0,we have 1

1

1

1

aT Xa − 2aT b + bT X −1 b = (X 2 a − X − 2 b)T (X 2 a − X − 2 b) ≥ 0 From this, we can easily obtain the inequality (6). Now we will present several results for the global stability of Eq.(1). Theorem 1. The equilibrium point of Eq.(1) is globally asymptotically stable if there exist a positive definite matrix Q and a positive diagonal matrix D such that − 2L−1 DC + DA + AT D + Q−1 + DBQB T D < 0 (7) holds. Proof. We will employ the following positive definite Lyapunov functional: V (yt ) = 2

n  i=1

yi (t)

di

t

gi (s)ds + 0

g T (y(s))Q−1 g(y(s))ds

t−τ

in which Q > 0 and D = diag(di ) is a positive diagonal matrix. The time derivative of V (yt ) along the trajectories of the system (4) is obtained as V (yt ) = 2g T (y(t))D(−Cy(t) + Ag(y(t)) + Bg(y(t − τ ))) +g T (y(t))Q−1 g(y(t)) − g T (y(t − τ ))Q−1 g(y(t − τ )) = −2g T (y(t))DCy(t) + 2g T (y(t))DAg(y(t)) + 2g T (y(t))DBg(y(t − τ )) +g T (y(t))Q−1 g(y(t)) − g T (y(t − τ ))Q−1 g(y(t − τ ))

(8)

Let B T Dg(y(t)) = a, g(y(t − τ )) = b. By Lemma 1, we have 2g T (y(t))DBg(y(t−τ )) ≤ g T (y(t))DBQB T Dg(y(t))+g T (y(t−τ ))Q−1 g(y(t−τ )) (9) Using the inequality (5), we can write the following: − 2g T (y(t))DCy(t) ≤ −2g T (y(t))L−1 DCg(y(t))

(10)

Substituting (9), (10) into (8), we get V (yt ) ≤ g T (y(t))(−2L−1 DC + DA + AT D + Q−1 + DBQB T D)g(y(t)) 0, then we can easily obtain the main result in [6]. As discussed in [6], therefore, the conditions given in [7]-[10] can be considered as the special cases of the result in Theorem 1 above.

3

An Example

In this section, we will give the following example showing that the conditions presented here hold for different classes of feedback matrices than those given in [11]. To compare our results with those given in [11] , we first restate the results of [11]. Theorem 2. [11]: If 1) A has nonnegative off-diagonal elements, 2) B has nonnegative elements; and 3) −(A+B) is row sum dominant, then the system defined by (1) has a globally asymptotically stable equilibrium point for every constant input when the activation function is described by a PWL function and ci = 1. Example 1. Consider the following delayed neural networks 

x1 (t) = −x1 (t) − 0.1f (x1 (t)) + 0.2f (x2 (t)) + 0.1f (x1 (t − τ1 )) + 0.1f (x2 (t − τ2 )) − 3 

x2 (t) = −x2 (t) + 0.2f (x1 (t)) + 0.1f (x2 (t)) + 0.1f (x1 (t − τ1 )) + 0.1f (x2 (t − τ2 )) + 2 (12)

where the activation function is described by a PWL function fi (x) = 0.5(|x + 1| − |x − 1|). The constant matrices are ! " ! " −0.1 0.2 0.1 0.1 A= ,B = ,C = I 0.2 0.1 0.1 0.1 where A has nonnegative off-diagonal elements and B has nonnegative elements. Let L = D = Q = I in Theorem 1 above, we can check that −2L−1 DC + DA + AT D + Q−1 + DBQB T D = A + AT + BB T − I ! " −1.18 0.42 =