Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012) [1 ed.] 9781614990840, 9781614990833

The complex information systems which have evolved in recent decades rely on robust and coherent representations in orde

164 116 4MB

English Pages 368 Year 2012

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012) [1 ed.]
 9781614990840, 9781614990833

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

FORMAL ONTOLOGY IN INFORMATION SYSTEMS

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Frontiers in Artificial Intelligence and Applications FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes. The FAIA series contains several sub-series, including “Information Modelling and Knowledge Bases” and “Knowledge-Based Intelligent Engineering Systems”. It also includes the biennial ECAI, the European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the European Coordinating Committee on Artificial Intelligence – sponsored publications. An editorial panel of internationally well-known scholars is appointed to provide a high quality selection. Series Editors: J. Breuker, N. Guarino, J.N. Kok, J. Liu, R. López de Mántaras, R. Mizoguchi, M. Musen, S.K. Pal and N. Zhong

Volume 239

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Recently published in this series Vol. 238. A. Respício and F. Burstein (Eds.), Fusing Decision Support Systems into the Fabric of the Context Vol. 237. J. Henno, Y. Kiyoki, T. Tokuda, H. Jaakkola and N. Yoshida (Eds.), Information Modelling and Knowledge Bases XXIII Vol. 236. M.A. Biasiotti and S. Faro (Eds.), From Information to Knowledge – Online Access to Legal Information: Methodologies, Trends and Perspectives Vol. 235. K.M. Atkinson (Ed.), Legal Knowledge and Information Systems – JURIX 2011: The Twenty-Fourth Annual Conference Vol. 234. B. Apolloni, S. Bassis, A. Esposito and C.F. Morabito (Eds.), Neural Nets WIRN11 – Proceedings of the 21st Italian Workshop on Neural Nets Vol. 233. A.V. Samsonovich and K.R. Jóhannsdóttir (Eds.), Biologically Inspired Cognitive Architectures 2011 – Proceedings of the Second Annual Meeting of the BICA Society Vol. 232. C. Fernández, H. Geffner and F. Manyà (Eds.), Artificial Intelligence Research and Development – Proceedings of the 14th International Conference of the Catalan Association for Artificial Intelligence Vol. 231. H. Fujita and T. Gavrilova (Eds.), New Trends in Software Methodologies, Tools and Techniques – Proceedings of the Tenth SoMeT_11 Vol. 230. O. Kutz and T. Schneider (Eds.), Modular Ontologies – Proceedings of the Fifth International Workshop (WoMO 2011) Vol. 229. P.E. Vermaas and V. Dignum (Eds.), Formal Ontologies Meet Industry – Proceedings of the Fifth International Workshop (FOMI 2011) Vol. 228. G. Bel-Enguix, V. Dahl and M.D. Jiménez-López (Eds.), Biology, Computation and Linguistics – New Interdisciplinary Paradigms ISSN 0922-6389 (print) ISSN 1879-8314 (online)

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Formal Ontology in Information Systems Proceedings of the Seventh International Conference (FOIS 2012)

Edited by

Maureen Donnelly Department of Philosophy, State University of New York at Buffalo, Buffalo, USA

and

Giancarlo Guizzardi

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Department of Computer Science, Federal University of Espírito Santo (UFES), Vitória, Espírito Santo, Brazil

Amsterdam • Berlin • Tokyo • Washington, DC

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

© 2012 The authors and IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-61499-083-3 (print) ISBN 978-1-61499-084-0 (online) Library of Congress Control Number: 2012941397 Publisher IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected]

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail: [email protected]

LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Formal Ontology in Information Systems M. Donnelly and G. Guizzardi (Eds.) IOS Press, 2012 © 2012 The authors and IOS Press. All rights reserved.

v

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Preface This volume collects articles presented at the 7th edition of the International Conference on Formal Ontologies in Information Systems (FOIS 2012). This edition of this bi-annual conference was held in conjunction with with the 3rd edition of the International Conference on Biomedical Ontologies (ICBO 2012), in Graz, Austria. We received 71 submissions from all continents, in particular, from authors affiliated with institutions in countries such as Algeria, Australia, Austria, Brazil, Canada, China, France, Germany, Hong Kong, Ireland, Italy, Japan, Mexico, Norway, Poland, Russia, Senegal, Singapore, South Africa, Spain, Sweden, Switzerland, Taiwan, Thailand, United States and United Kingdom. All submissions were carefully reviewed by the members of our international program committee. Based on the reviews, 24 articles were chosen for presentation at the conference. Accepted submissions were organized into 8 sessions: Ontologies and Bioinformatics; Ontologies of Physical Entities; Ontological Aspects of Artifacts and Human Resources; Methodological Aspects of Ontological Engineering; Ontology Evaluation; Ontology, Language and Social Relations; Ontological Aspects of Time and Events; Aspects of Ontology Representation. The wide range of topics addressed in these sessions demonstrates that formal ontology is an active area of research which addresses problems ranging from theoretical questions regarding the ontology of time to the applications in the sciences and engineering. Ontologies and Bioinformatics: in the first paper in this session entitled Probability assignments to dispositions in ontologies, Adrien Barton, Anita Burgun and Régis Duvauferrier investigate the probabilistic dimension of dispositions, with a particular interest on Biomedical ontologies. The authors investigate the determination of which kinds of dispositional entities (individuals, universals, both) a probability value can be assigned to; in Maturation of Neuroscience Information Framework: An Ontology Driven Information System for Neuroscience, Fahim T. Imam and colleagues discuss the main ontology-based components of the Neuroscience Information Framework (NIF). In the context of the NIF project, the ultimate end product is a semantic search engine and knowledge discovery portal that provides federated access to a vast amount of Neuroscience data and resources over the web. Finally, in Suggestions for Galaxy Workflow Design Using Semantically Annotated Services, Alok Dhamanaskar and colleagues propose an extension of the Galaxy open-source web-based framework to assist the user in the construction of Service-based Scientific Workflows. The work is based on proposed extensions to the Ontology for Biomedical Investigations (OBI) which are intended to provide a base for the semantic annotation of Web Services. Ontologies of Physical Entities: in The Void in Hydro Ontology, Torsten Hahmann and Boyan Brodaric extend the DOLCE foundational ontology to a logical theory aimed at representing specific aspects of the physical containment of water studied in hydrology. More specifically, they address the notion of void – empty spaces that can be filled with water; in The mysterious appearance of objects, Emanuele Bottazzi, Roberta Ferrario and Claudio Masolo present a constructivist approach to objects. This approach aims at making explicit how objects can be constructed as from the outcome of an apparatus, being it a measurement instrument or our perceptual system, discussing what are the ontological and representational problems faced by such an approach;

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

vi

finally, in Towards Making Explicit the Ontological Commitment of a Database Schema on the Geological Domain, Alda Maria Ferreira Rosa da Silva and Maria Cláudia Cavalcanti propose an approach which combines a set of reverse engineering techniques and the use of a top-level ontology as a way of making explicit the ontological commitment of a conceptual database schema. The proposal combines the OntoClean methodology and the OntoUML meta-categorization in a set of methodological guidelines aimed at producing higher-quality models to support tasks of interoperability and database integration. Ontological Aspects of Artifacts and Human Resources: In the paper entitled An Ontology for Skill and Competency Management, Maryam Fazel-Zarandi and Mark S. Fox present a formal PSL-based ontology for representing, inferring, and validating skills and competencies of Human Resources in a dynamic environment; in Towards A Unified Definition of Function, Riichiro Mizoguchi, Yoshinobu Kitamura and Stefano Borgo build on an existing ontological definition of Artifact Functions, generalizing this notion to provide a general unified definition of functions aimed at characterizing both Biological Organisms and Technical Artifacts. Finally, in Preliminaries to a formal ontology of failure of engineering artifacts, Luca Del Frate advances a conceptual analysis of the notion of failure in engineering. The paper propose three different notion of failures which are intended to capture practitioners’ intuitions and are advocated as an important step towards the definition of a formal ontology of failure. Methodological Aspects in Ontology Engineering: In the paper entitled A method for re-engineering a thesaurus into an ontology, Daniel Kless and colleagues present a general method for re-engineering a standard-compliant thesaurus into an ontology by making use of top-level ontologies; In Ontology Content “At A Glance”, Gökhan Coskun, Mario Rothe and Adrian Paschke present a technique to group concepts for ontology documentation by applying community detection algorithms on the graph structure of ontologies. Finally, we have in this session Interactive Semantic Feedback for Intuitive Ontology Authoring by Ronald Denaux and colleagues. Their proposal aims at increasing the eƥciency and eơectiveness of the ontology authoring process by providing interactive, semantic feedback that helps ontology authors to consider relevant logical consequences of their modeling inputs. Ontology Evaluation: in Does your ontology make a (sense) difference?, Pawel Garbacz proposes three logical criteria that an Applied Ontology need to satisfy in order to suitably achieve the task of satisfactorily characterize its terminology. These logical criteria correlate to graded levels of semantic indeterminacy. In sequence, we have two papers by A. Patrice Seyed. In the first of these papers, entitled A Method for Evaluating Ontologies Introducing the BFO-Rigidity Decision Tree Wizard, the author proposes an integration of OntoClean’s notion of Rigidity with the BFO theory of types to provide a tool-supported decision tree procedure for evaluating ontologies. Moreover, in Integrating OntoClean’s Notion of Unity and Identity with a Theory of Classes and Types: Towards a Method for Evaluating Ontologies, the author provides a reformulation of OntoClean’s notion of Identity and Unity within a formal theory of classes and evaluates how the reformulations apply to BFO’s theory of types. This work is aimed at making an additional contribution to ongoing efforts to build automated support to evaluate and standardize OBO Foundry candidate ontologies. Ontology, Language and Social Relations: in Axiomatizing Change-of-State Words, Niloofar Montazeri and Jerry R. Hobbs present a part of their program of developing core theories of fundamental commonsense phenomena. These theories are then employed to define English word senses by means of axioms using predicates explicated

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

vii

in these theories. In particular, in this paper they focus on structure of events and, more specifically, on the axiomatization of on change-of-state words from the Core Wordnet; in Elements for a linguistic ontology in the verbal domain, Lucia M. Tovena discusses elements of a linguistically-motivated ontology and proposes a novel analysis to the philosophical notion of sortal in order to address aspects of essence and discretization of events; in Toward a Commonsense Theory of Microsociology: Interpersonal Relationships, Jerry R. Hobbs, Alicia Sagae and Suzanne Wertheim present a part of a formal ontology of microsociology (focused on small-scale social groups). The discussed part focuses on interpersonal relationships addressing concepts such as commitments, shared plans and good will and aimed at formally characterizing relationships such as the host-guest relationship and friendship in order to support inter-cultural communication. Ontological Aspects of Time and Events: in The Data-Time Vocabulary, Mark H. Linehan, Ed Barkmeyer, and Stan Hendryx present a new OMG specification that models a Foundational Vocabulary of Time and related notions (e.g., continuous time, discrete time, the relationship of events and situations to time, language tense and aspect, time indexicals, timetables, and schedules). The proposal offers a linguisticoriented vocabulary and ontology intended for supporting the specification of business rules in different business domains; in States, Processes and Events, and the Ontology of Causal Relations, Anthony Galton elaborates on the difficult subject of causation by advancing aspects of an ontology of particulars. This ontology elaborates on notions such as events, states and processes (taking a particular view on the latter two) as well as different causal and causal-like relations (e.g., initiation, termination, perpetuation, enablement and prevention) holding among them; Finally, in Ontology of Time in GFO, Ringo Baumann, Frank Loebe, and Heinrich Herre present a novel formal ontology of time as a part of the GFO research program. Besides presenting this formal theory, the authors revisit a number of problematic cases related to temporal representation and reasoning. Finally, a metalogical analysis for this theory is presented (including consistency, completeness and decidability results). Aspects of Ontology Representation: in Using Partial Automorphisms to Design Process Ontologies, Bahar Aameri proposes a methodology for the design and verification of domain-specific process ontologies that are extensions of generic process ontologies by using the notion of partial automorphism (a mapping from a model to itself which preserves some substructures of the model); in A Temporal Extension of the Hayes/ter Horst Entailment Rules and an Alternative to W3C’s N-ary Relations, Hans-Ulrich Krieger propose a novel approach that contains extended entailment rules for RDFS and the OWL Horst dialect and is designed to efficiently support encoding of temporally changing information in OWL and RDF; finally, in Three Semantics for the Core of the Distributed Ontology Language, Till Mossakowski, Christoph Lange and Oliver Kutz present the abstract syntax and new? kind of semantics for the meta-level constructs of the DOL (Distributed Ontology Language). A DOL Ontology consists of modules formalized in existing ontology languages (e.g., OWL, Common Logic, F-Logic). The language meta-level constructs can be employed to express different types of links between these heterogeneous ontologies. As program chairs we would like to thank all of the authors who submitted their work and the reviewers who helped us to select the best papers from a pool of high quality submissions.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

ix

Contents Preface

v

Part 1. Ontologies and Bioinformatics Probability Assignments to Dispositions in Ontologies Adrien Barton, Anita Burgun and Régis Duvauferrier Maturation of Neuroscience Information Framework: An Ontology Driven Information System for Neuroscience Fahim T. Imam, Stephen Larson, Anita Bandrowski, Jeffrey S. Grethe, Amarnath Gupta and Maryann E. Martone Suggestions for Galaxy Workflow Design Using Semantically Annotated Services Alok Dhamanaskar, Michael E. Cotterell, Jie Zheng, Jessica C. Kissinger, Christian J. Stoeckert Jr. and John A. Miller

3

15

29

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Part 2. Ontologies of Physical Entities The Void in Hydro Ontology Torsten Hahmann and Boyan Brodaric

45

The Mysterious Appearance of Objects Emanuele Bottazzi, Roberta Ferrario and Claudio Masolo

59

Towards Making Explicit the Ontological Commitment of a Database Schema on the Geological Domain Alda Maria Ferreira Rosa da Silva and Maria Cláudia Cavalcanti

73

Part 3. Ontological Aspects of Artifacts and Human Resources An Ontology for Skill and Competency Management Maryam Fazel-Zarandi and Mark S. Fox

89

Towards a Unified Definition of Function Riichiro Mizoguchi, Yoshinobu Kitamura and Stefano Borgo

103

Preliminaries to a Formal Ontology of Failure of Engineering Artifacts Luca del Frate

117

Part 4. Methodological Aspects in Ontology Engineering A Method for Re-Engineering a Thesaurus into an Ontology Daniel Kless, Ludger Jansen, Jutta Lindenthal and Jens Wiebensohn

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

133

x

Ontology Content “At a Glance” Gökhan Coskun, Mario Rothe and Adrian Paschke

147

Interactive Semantic Feedback for Intuitive Ontology Authoring Ronald Denaux, Dhaval Thakker, Vania Dimitrova and Anthony G. Cohn

160

Part 5. Ontology Evaluation Does Your Ontology Make a (Sense) Difference? Pawel Garbacz

177

A Method for Evaluating Ontologies – Introducing the BFO-Rigidity Decision Tree Wizard A. Patrice Seyed

191

Integrating OntoClean’s Notion of Unity and Identity with a Theory of Classes and Types – Towards a Method for Evaluating Ontologies A. Patrice Seyed

205

Part 6. Ontology, Language and Social Relations Axiomatizing Change-of-State Words Niloofar Montazeri and Jerry R. Hobbs

221

Elements for a Linguistic Ontology in the Verbal Domain Lucia M. Tovena

235

Toward a Commonsense Theory of Microsociology: Interpersonal Relationships Jerry R. Hobbs, Alicia Sagae and Suzanne Wertheim

249

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Part 7. Ontological Aspects of Time and Events The Date-Time Vocabulary Mark H. Linehan, Ed Barkmeyer and Stan Hendryx

265

States, Processes and Events, and the Ontology of Causal Relations Antony Galton

279

Ontology of Time in GFO Ringo Baumann, Frank Loebe and Heinrich Herre

293

Part 8. Aspects of Ontology Representation Using Partial Automorphisms to Design Process Ontologies Bahar Aameri

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

309

xi

A Temporal Extension of the Hayes/ter Horst Entailment Rules and an Alternative to W3C’s N-ary Relations Hans-Ulrich Krieger

323 337

Subject Index

353

Author Index

355

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Three Semantics for the Core of the Distributed Ontology Language Till Mossakowski, Christoph Lange and Oliver Kutz

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Part 1

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Ontologies and Bioinformatics

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Formal Ontology in Information Systems M. Donnelly and G. Guizzardi (Eds.) IOS Press, 2012 © 2012 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-084-0-3

3

Probability assignments to dispositions in ontologies Adrien BARTONa,b,1, Anita BURGUNa and Régis DUVAUFERRIERa a U936, INSERM & Université Rennes 1, Rennes, France b Department of Philosophy, KTH University, Stockholm, Sweden

Abstract. We investigate how probabilities can be assigned to dispositions in ontologies, building on Popper’s propensity approach. We show that if D is a disposition universal associated with a trigger T and a realization R, and d is an instance of D, then one can assign a probability to the triplets (d,T,R) and (D,T,R). These probabilities measure the causal power of dispositions, which can be defined as limits of relative frequencies of possible instances of T triggering an instance of R over a hypothetical infinite random sequence of possible instances of T satisfying certain conditions. Adopting a fallibilist methodology, these probability values can be estimated by relative frequencies in actual finite sequences. Keywords. probability, upper ontology, universal, causal power, frequency

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Introduction Probabilistic and statistical notions are ubiquitous in the medical domain. These include e.g. the prevalence of a disease in a population, the sensitivity or the specificity of a medical test, or the probability for a person to develop a disease in a given timeframe. It would therefore be valuable if ontologies aiming at representing adequately medical knowledge could formalize probabilistic notions. The OBO Foundry is to date one of the most significant attempts to build interoperable ontologies in the biomedical domain. In this context, the OGMS ontology [1] aims at supplying a general ontology for the medical domain. The question of how to represent probabilistic notions in this framework is still open. A first attempt on a related topic has been made by Röhl & Jansen [2], who analyze the non-probabilistic aspects of the notion of disposition. We will build on their work and combine it with Popper’s work [3] on propensity in order to investigate the probabilistic dimension of dispositions. In particular, we will try to determine to which kind of dispositional entities a probability can be assigned: to universals, to particulars, or to both?

1

Corresponding Author; E-mail: [email protected]

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

4

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

1. Dispositions and propensities Before investigating how to formalize the concept of propensity in ontologies, let us first introduce our general ontological framework, and explain the propensity account in the field of philosophy of probability.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

1.1. Realizable entities and dispositions The OBO Foundry relies on the upper-level ontology Basic Formal Ontology (BFO), which aims at formalizing the most general concepts that domain ontologies should be based on. At the most general level, BFO recognizes two different types of entities. On one hand, there are occurrents, which are extended in time (all processes, e.g. a dinner or a movie screening, are occurrents). On the other hand, there are continuants, which are entirely present at every time they exist. These include independent continuants, which roughly correspond to what we would imagine as objects (e.g. the Earth, a bottle of wine, a molecule) or object aggregates (e.g. a flock of birds); and dependent continuants, which inhere in independent continuants (e.g. the greenness of a leaf, the shape of the Earth). Amongst dependent continuants, BFO makes a distinction between qualities on one hand, and realizable entities on the other hand. Qualities are entities that can be described as “categorical”, meaning that they are constantly realized. For example, at any instant, a ball has a color and a shape, therefore these dependent continuants are categorical properties and must be classified as qualities. By contrast, realizable entities have two different kinds of phases: actualization phase, during which they are realized through some processes; and dormancy phases, during which they still exist in their bearer but are not realized (cf. [4]). Dispositions belong to this family of realizable entities: a disposition borne by an object will lead to a given process (named here “realization”) when this object is introduced into certain specific circumstances (named here “trigger”). For example, according to OGMS, the disease “epilepsy” is a disposition whose realizations are epileptic seizures; even at times when he is not undergoing any epileptic seizure, an epileptic patient is still bearing an instance of such a disposition. Let us point that a disposition always has a categorical basis (cf. [5]) – that is, a set of categorical properties (qualities) underlying the disposition. For example, the categorical basis of the disease “epilepsy” is constituted by some anomalies in neural structures, which are going to lead to epileptic seizures when some trigger (e.g. a stressful episode, or a flashing light for photosensitive people) is happening. Finally, dispositions can be divided between sure-fire dispositions (dispositions whose triggering process lead systematically to a realization – for example the disposition for a windshield to break when it is hit by a 30 tons truck) and probabilistic dispositions (dispositions whose triggering process lead to a realization with some probability – for example the disposition for a fair coin to land on heads when it is tossed). We will investigate this second kind of dispositions in this paper. 1.2. Probabilistic dispositions and interpretation of probabilities According to Röhl & Jansen [2], a disposition attribution has the following general structure: x has a disposition D for realization R with a trigger T with a probability p. Here, R is the realization process of the disposition; and T is its triggering process. Additionally, probability is for them simply a number between 0 and 1, and a

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

5

probability attribution is a function of the tuple to this interval. However, this definition leaves open two questions. First, it does not explain what it means to assign a probability to a disposition: how can we express, using only non-probabilistic concepts, the necessary and sufficient conditions for the assignment of a probability p to a disposition? This is a particular case of a classical problem in philosophy of probability, namely the task of interpreting probabilities. Several theories have been proposed in the past, including frequentist theories (proposed by Von Mises and Reichenbach), which interpret the probability of an event as the relative frequency of this event in a hypothetical infinite sequence of trials; logicist theories (by Keynes and Carnap), which see probabilities as degrees of entailment between two propositions; subjectivist theories (by Ramsey and De Finetti), for whom a probability is a degree of belief of a rational agent in a proposition; and finally propensity theories (by Popper), that will be detailed thereafter. The second question left open by this definition is the following: are the entities present in the tuple particulars or universals? As we will see, we will have to answer the first question in order to answer the second one. Of note, all interpretations of probability face important difficulties: see e.g. [6] and [7] for flaws in the frequentist interpretation, and see [8] for critics of the propensity theory. Moreover, Hansson [9] has pointed to the need of second-order probabilities, suggesting that a subjectivist interpretation of probability, though necessary, is likely to be not sufficient, and should be complemented with an objectivist interpretation of probability like a frequentist or propensity theory. Of these two theories, propensity theories appear to be more solid than frequentist ones (cf. [7], [10]). The underlying realist philosophy of BFO also naturally invites a propensity approach of probability; indeed, Röhl & Jansen’s work [2] on sure-fire dispositions can be extended to probabilistic dispositions along the lines of this propensity theory. So let us introduce briefly this propensity approach as it has been developed in the philosophical literature, before trying to adapt it to the framework of ontologies.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

1.3. Two analysis of propensity Popper [3] has proposed the following account of propensity (refined thereafter in particular by Mellor [11] and Williamson [12]): repeatable experimental conditions (named “test”) C are endowed with a disposition (named “propensity”) to produce infinite hypothetical sequences of events amongst which the limit of relative frequencies of an event E would be equal to the value of the probability of E given C. For example, according to this account, an experiment of coin tosses of a symmetrical coin is endowed with a propensity which is realized when a hypothetical infinite sequence of tosses happens, by leading to a relative frequency of results “heads” of ½. According to Popper’s account, probabilities are always conditional and there does not exist any probabilities simpliciter: it does not make sense to speak of the probability of E, one can only deal with the probability of E given C. This should not be seen as a weakness of Popper’s account: it may actually be a common feature shared with all other viable approaches of probability (see [13]). Let us call “propensity1” such a disposition. It is important to understand that the trigger of the propensity1 is not C, but an infinite repetition of experimental conditions C; and its realization is not E, but E happening with a given limit of relative frequencies. Also, one should note that according to this account, it is certain that the event E will happen with a given limit of relative frequency if C is repeated an infinite

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

6

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

number of times; therefore, the propensity1 is not a probabilistic disposition, but a surefire disposition. However, this account can seem problematic. As a matter of fact, if all propensities would be realized only during hypothetical infinite sequences of tests, there would be no point in representing them in ontologies. Indeed, in real life, we are generally not interested in hypothetical infinite sequences of tests (such as an infinite hypothetical sequence of coin tosses), but in actual and finite sequences of tests (such as a finite sequence of coin tosses). This problem can however be easily overcome. As a matter of fact, like any dispositional property, propensity1 are associated with a categorical basis – that is, a set of categorical properties that underlie the disposition. For example, the propensity1 of a coin to fall on heads is associated with a categorical basis composed by some symmetry properties of the coin. But this categorical basis is also the bearer of another dispositional property that we will name here “propensity2”, whose trigger is not a hypothetical infinite sequence of repetitions of C (as it was for the propensity 1), but a unique test C. In the coin toss example, the symmetry properties of the coin will have a causal influence not only during an infinite hypothetical sequence of tosses, but also during a unique coin toss. The latter causal influence reveals a disposition to fall on heads after a unique toss, which is a propensity2. This is not a sure-fire disposition, but a probabilistic disposition. As a historical note, Popper’s account of propensity actually evolved during his life from a propensity1 theory to a propensity2 theory – although he never properly differentiated these two interpretations, and occasionally switched between one and the other without mentioning it (cf. [10]). These two accounts should however not be seen as rivals, but as complementary. In a nutshell, we defend here the thesis that for every propensity1, there is an associated propensity2 (and vice versa) such that 1) a propensity1 and its associated propensity2 have the same categorical basis and 2) the trigger of a propensity 1 associated with C is an infinite sequence of repetitions of C, whereas the trigger of the propensity2 associated with C is a single instance of C. We will here be interested mainly in propensities2 rather than in propensities1, as they are the ones which are realized (repeatedly) in finite and actual sequences of tests that we normally encounter. Does it mean that propensities1 are of no use at all? This is not the case: we actually need them, because of some insufficiencies of the propensity2 account. As a matter of fact, propensity2 to an event E will have a causal influence (also named “causal power”) on the realization of this event when a test C happens; and it would be desirable to define probability as the intensity of this causal power. However, to our knowledge, there is currently no theory of causal powers giving a direct interpretation (using only non-probabilistic concepts) of such a probability by referring to only one test C (as expressed by Eagle [8]: “No account of partial causation has ever quantified the part-cause-of relation in the way that is required for probability.”). Still, we can define the probability of a propensity2 through the associated propensity1, in the following way. Let us write P2 a propensity2 for an event E and a test C; and let us consider P1 the associated propensity1 that will be realized with E happening with a limit of relative frequencies p over a hypothetical infinite sequence of repetitions of C. Then we can simply define the intensity of P2 as having this value p (see [11] for a related strategy). For example, according to this approach, a coin has a propensity 2 of intensity ½ to fall on heads on a unique toss if and only if it will fall on heads with a relative frequency ½ in an infinite hypothetical sequence of tosses. Therefore, this value ½ characterizes not only the relative frequency in a hypothetical infinite

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

7

repetition of coin tosses, when the propensity1 is realized, but also the causal power of the propensity2 in a unique toss. This provides us with a first insight into the ontology of probabilistic dispositions. We now have to specify this account in the framework of the BFO ontology, by adapting the concept of propensity2 to this framework, and by introducing the distinction between universals and particulars.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

2. Ontologies and probabilistic dispositions We will use here the analysis of dispositions, bearer, trigger and realization proposed by Röhl & Jansen [2], which introduces the following relations between particulars: has_bearer relates a particular of disposition with its (particular) bearer; has_realization relates a (particular) disposition with its realization; has_triggerD, relates a disposition to its trigger; and has_triggerR relates a realization to its trigger. Röhl & Jansen also introduce the following relations between universals: has_bearer, has_realization, has_triggerD et has_triggerR (here we adopt the usual convention of writing in bold the relations for which one of the relata at least is a particular, and writing in italic the relations that relate only universals). Röhl & Jansen’s analysis is restricted to sure-fire dispositions. If the triggering process of a sure-fire disposition happens, then its realization process also happens: this is expressed by through Röhl & Jansen so-called “realization principle”. However, the triggering process of a probabilistic disposition can happen without its realization happening. Therefore, the realization principle is not verified for probabilistic dispositions. Neither is Röhl & Jansen’s following axiom: d has_triggerD t я ‫׌‬r (d has_realization r ‫ ר‬r has_triggerR t), for the same reason. Instead, the following weaker axiom holds true for probabilistic disposition: ‫׌‬r (d has_realization r ‫ר‬ r has_triggerR t) ֜d has_triggerD t. In the most general case, we want to assign probabilities to triplets . We now have to investigate whether the entities that appear in this triplet are universals or particulars. For this, let us introduce D a disposition universal, X an independent continuant universal such that D has_bearer X, T an occurrent universal such that D has_triggerD T, and R an occurrent universal such that D has_realization R. Let us consider also a particular disposition d such that d instance_of D, x an independent continuant such that x instance_of X and d has_bearer x, t a process such that d has_triggerD t (and therefore t instance_of T), and r a process such that d has_realization r (and therefore r instance_of R) and r has_triggerR t. We will show that we can assign a probability to two different kind of triplets: (d,T,R) and (D,T,R). In order to show this, we have to find necessary and sufficient conditions for statements like “(d,T,R) has a probability p” or “(D,T,R) has a probability p”, conditions which should not mention any probabilistic concept. That is, we need to reduce probabilistic assignments to non-probabilistic statements.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

8

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

3. Assignment of probability to a triplet (d,T,R)

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

3.1. Definition of probability of (d,T,R) In order to illustrate the meaning of a statement like “(d,T,R) has a probability p”, let us consider a particular case. Let us name T’ the universal process whose instances are sets of fifty white light flashes emitted by a 100 Watts bulb at a frequency of 10 Hz, seen by a person at a distance of one meter (any instance of such a repetition of fifty flashes will be abbreviated thereafter under the name ‘flashing light’); R’ the universal “epileptic seizure”; and D’ the universal disposition such that D’ has_triggerD T’ and D’ has_realization R’ (that is, D’ is the universal disposition of having an epileptic seizure while seeing a flashing light). Finally, let us write x’ the particular Mr. Dupont, a photosensitive epileptic patient; and d’ the instance of D’ such that d’ has_bearer x’ (that is, d’ is the disposition of Mr. Dupont to have an epileptic seizure when seeing a flashing light). The disposition d’ is associated with a categorical basis constituted by some properties of neural structures of M. Dupont. The probability p’ assigned to (d’,T’,R’) should then measure the causal power of these neural anomalies in triggering an epileptic seizure of Mr. Dupont during a flashing light. This specific disposition d’ is a high-level disposition; it is likely that there are many lower-level dispositions being triggered in Mr. Dupont’s brain which keep on manifesting until a particular threshold is reached and M. Dupont has a fit (see [14] for a theory of dispositions along these lines). However, an ontology of medicine would focus on such a high-level disposition – not the lower-level dispositions underlying it (unless its granularity would be pushed to the neurobiological level). Therefore, we have to find a way to define the causal power assigned to the triplet (d’,T’,R’). Such a causal power can be defined according to the lines mentioned in section 1, using the fact that universals are repeatable (that is, they can be instantiated by several particulars): p’ equals 0.2 if and only if, in every random hypothetical infinite sequence of flashing lights perceived by Mr. Dupont, the limit of relative frequency of situations causing an epileptic seizure is 0.2. It seems that BFO does not have enough expressive power for these concepts. As a matter of fact, a hypothetical infinite sequence is composed (at least in part) by possible, non-actual entities – whereas BFO only recognizes actual particulars. In order to circumvent this problem, we will consider here an extension of BFO that recognizes also possible, non-actual particulars. The difficulties raised by this extension are out of reach of this article; and we do not claim that BFO should be permanently extended in this way (cf. our discussion in the Conclusion section). Here, we are just concerned with finding the necessary concepts to rephrase probability assignments in nonprobabilistic statements, in order to determine to which kind of entities one can assign probabilities. In the remainder of this article, we will accept that particulars can be either actual or possible entities. Let us name “sequence generated by (d,T)” an infinite sequence of possible particulars which are instances of T and triggers of d. That is, if G is a sequence generated by (d,T), one can write G = (t1, t2, …, tn,...) where : ‫׊‬i∈N, ti instance_of T ‫ר‬ d has_trigger ti. Let us now write: GnR = (ti | i∈[1,n], ∃ri ri instance_of R ‫ר‬ d has_realization ri ‫ ר‬ri has_triggerR ti). That is, GnR is the subsequence, amongst the n first elements of G, of the processes that will trigger a realization of d.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

9

Let us elaborate on our former epilepsy example. In this case, a sequence G’ generated by (d’,T’) is a hypothetical infinite sequence of flashing lights perceived by Mr. Dupont, and G’nR’ is the subsequence, amongst the n first elements of G’, of the flashing lights that would cause an epileptic seizure in Mr. Dupont. The propensity theory introduces several conditions (cf. [12]), that we can formulate in our framework the following way; let Z be a set of sequences generated by (d,T), then: x Z satisfies the convergence condition iff for any sequence G of Z, Card(GnR)/n has a finite limit as n tends to +∞. x Z satisfies the independence condition iff for any sequences G1 and G2 of Z, then: limn->+∞[Card(G1nR)/n] = limn->+∞[Card(G2n R)/n] x Z satisfies the condition of Von Mises-Church randomness iff for any sequence G of Z, if a subsequence G° of G is extracted by a recursive place selection function, then limn->+∞[Card(G°nR)/n] = limn->+∞[Card(GnR)/n] (see [16] for more details; this condition is introduced to exclude sequences which are not random, e.g. a perfectly regular alternation of coin tosses that lead respectively to heads and tails). Unfortunately, the set of all sequences generated by (d,T) will not satisfy these three conditions. For example, if a fair coin were tossed an infinite number of times, it would typically fall on heads with a limiting probability of ½. However, it is possible (although highly non-typical) that it would fall on heads at every single toss of this infinite sequence; and of course, the relative frequency of the result ‘heads’ obtaining in this sequence (namely, 1) is not the correct value of the probability (which is ½) (see [15] for a discussion of this point). Defining precisely what is a “typical” sequence is a challenge for this kind of propensity interpretations of probability, a challenge that we will not tackle here (let us just remark that using the strong law of large numbers to solve this problem would not work, or at least not directly and easily – see e.g. [8] on this point). One possible solution that has been considered would involve using LewisStalnaker semantics for counterfactuals, arguing that such non-typical sequences would not occur in any of the nearest possible worlds in which a fair coin is tossed infinitely many times (see [16]). Here, we will just accept without more discussion this notion of “typical” sequence generated by (d,T). As we said, all major interpretations of probability face important difficulties; defining typical sequences is precisely the major difficulty bearing on propensity interpretations. Our purpose here is not to solve this perennial problem, but to adapt the propensity interpretation to the framework of applied ontologies. Then, if the set of typical sequences generated by (d,T) satisfies the three conditions of Convergence, Independence and Randomness, one can define a probability assignment in the following way: (d,T,R) has a probability p if and only if for every typical sequence G generated by (d,T), limn->+∞Card(GnR)/n = p. 3.2. Determination of probability of (d,T,R) This concept of probability thus clarified, we now face another question: how can we determine the value of the probability associated with (d,T,R) ? As a matter of fact, in practice, we never have direct access to hypothetical infinite sequence of events. The answer to this problem is simple: although the probabilities are defined as limits of relative frequencies in infinite hypothetical sequences, we can estimate their values through relative frequencies in actual finite sequences.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

10

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

Let us call “finite sequence associated with (d,T)” a finite sequence of actual instances of T which are triggers of d. Let us assume that we have recorded such a finite sequence G* = (t1, t2, …, tN) where: ‫׊‬i∈[1,N], ti instance_of T ‫ ר‬d has_trigger ti.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Let us define then G*R = (ti | i∈[1,N], ∃ri ri instance_of R ‫ ר‬d has_realization ri ‫ר‬ ri has_triggerR ti). Then the value Card(G*R)/N will provide an estimate of the probability associated with (d,T,R). For example, if we have a recording of 27 episodes of Mr. Dupont perceiving a flashing light (these episodes being separated enough in time so that there would be no cumulative effects), and that on these 27 episodes, 6 of them led to an epileptic seizure, then 6/27 is an estimate of the value of the probability associated with the disposition of Mr. Dupont to undergo an epileptic seizure while perceiving a flashing light. Of course, the larger our sample, the more confident we can be in our probability estimate; and statistical tests can evaluate the quality of the estimation. Moreover, we know that the relative frequency in a finite sample is certainly at least slightly different from the real probability value (even if the sample is large). However, it is a potentially reliable estimate of the probability value. One has to remember here that ontologies do not claim to be true representations of the world: the methodology underlying them is fallibilist (cf. [17]). That is, ontologies may not be true, but represent our best estimation of the reality. Therefore, probability estimates could figure in them, even if these values are probably slightly different from the real probability values. One has to notice however that this method to estimate probabilities through finite sequences is not always available. Remember the case of Mr. Dupont and its epileptic seizures. It may be the case that we cannot register several flashing lights perceived by Mr. Dupont, and therefore that we cannot register the relative frequencies of the flashing lights that lead to an epileptic seizure. It is more likely that we will have to rely on medical data obtained on a sample of several photosensitive epileptic patients, not only on Mr. Dupont. This requires the formalization of another kind of probability, not associated with a triplet of the kind (d,T,R), but of the kind (D,T,R).

4. Assignment of probability to a triplet (D,T,R) 4.1. Definition of the probability of (D,T,R) We have defined above the probability that a particular photosensitive epileptic patient (for example Mr. Dupont) undergoes an epileptic seizure when seeing a flashing light. We will now propose a definition of the probability that a non-specified photosensitive epileptic patient undergoes an epileptic seizure when seeing a flashing light; that is, the probability will be associated with a triplet containing a universal disposition D’ borne by the universal photosensitive epileptic patient X’, rather than with a triplet containing a particular disposition d’ borne by a particular photosensitive epileptic patient x’ like Mr. Dupont. The method will be similar to the former one. First, let us call “sequence generated by (D,T)” an infinite hypothetical sequence of couple of instances of D and T, such that in every couple the first element (the instance of D) has as trigger the second one (the instance of T). That is, if H is a sequence generated by (D,T), one can write H = ((d1,t1), (d2,t2), …, (dn,tn),...) where : ‫׊‬i∈N, di instance_of D ‫ ר‬ti instance_of T ‫ר‬

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

11

di has_triggerD ti. Let us now write HnR = ((di,ti) | i∈[1,n], ∃ri ri instance_of R ‫ ר‬di has_realization ri ‫ ר‬ri has_triggerR ti). That is, HnR is the subsequence, amongst the n first elements of H, of the processes ti that trigger a realization of the disposition di. Let us now assume that the same three conditions as before (convergence, independence, randomness) are verified by the set of typical sequences generated by (D,T). Then one can define an assignment of probability in a similar way: (D,T,R) has a probability p if and only if, for every typical sequence H generated by (D,T), limn->+∞Card(HnR)/n = p. Elaborating on our epilepsy example, if H’ is a sequence generated by (D’,T’), H’ is an infinite sequence of couple of instances < epileptic disposition borne by a photosensitive patient, flashing light perceived by this patient > ; and H’nR is the subsequence, amongst the n first elements of H’, of the pairs (di,ti) such that the flashing light ti causes an epilepsy seizure ri in the patient bearer of the disposition di. The probability p’ associated with (D’,T’,R’) will then be defined as the limit of relative frequencies of flashing lights leading to epileptic seizures over a hypothetical random infinite sequence of flashing lights undergone by photosensitive epileptic individuals.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

4.2 Determination of the probability of (D,T,R) This being defined, the same question as before reappears: how can we evaluate practically the probability assigned to a triplet (D,T,R)? The answer will be similar to the one we gave before: the actual finite relative frequencies will provide estimates of limits of infinite hypothetical relative frequencies. For example, in the epilepsy case, in order to estimate the probability that a non-specified photosensitive epileptic patient will have an epileptic seizure when seeing a flashing light, we will estimate the relative frequency of epileptic seizures that obtained in a finite sample of flashing lights perceived by a finite sample of photosensitive epileptic patients. More generally, let us call “finite sequence associated with (D,T)” a finite sequence of couple of instances of D and T, such that in every couple the first element (the instance of D) has as a trigger the second element (the instance of T). That is, if H* is a finite sequence associated with (D,T), one can write H* = ((d1,t1), (d2,t2), …, (dN,tN)) where : ‫׊‬i∈[1,N], di instance_of D ‫ ר‬ti instance_of T ‫ ר‬di has_triggerD ti. Let us define: H*R = ((di,ti) | i∈[1,N], ∃ri ri instance_of R ‫ ר‬di has_realization ri ‫ר‬ ri has_triggerR ti). Then Card(H*R)/N will provide an estimate of the probability associated with (D,T,R). For example, if we have registered 954 flashing lights undergone by different photosensitive epileptic patients, and that on these 954 situations, 113 have led to an epileptic seizure, then an estimate of the probability that a non-specified photosensitive epileptic patient has an epileptic seizure during a flashing light would be 113/954. Here again, such an estimate may not provide the true value of the associated probability, but it fits in a fallibilist representation of reality. 4.3 Use of the probability of (D,T,R) in order to estimate the probability of (d,T,R) As we said, it is sometimes not possible to obtain a finite sequence associated with (d,T,R), and hence to obtain an estimate of the probability of (d,T,R) as indicated in 3.2. In this case, we can try to estimate, if it is available, the probability of (D,T,R), where D is a universal instantiated by d (i.e. d instance_of D). As a matter of fact, since d is an

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

12

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

instance of D, the categorical basis of d may have a causal power similar (to some extent) to the causal power of the categorical basis of D; therefore, the estimate of the probability of (D,T,R) may provide us with an approximation of the probability of (d,T,R). Of course, the more specific the universal D, the better the probability associated with (D,T,R) will approximate the probability associated with (d,T,R). For example, the probability associated with the disposition borne by the universal of photosensitive epileptic patient to have a seizure when seeing a flashing light will provide an estimate of the probability that Mr. Dupont has a seizure when seeing a flashing light; but if we know that Mr. Dupont is a 46-years-old male, then the probability associated with the disposition borne by the universal (or defined class) of a male photosensitive epileptic patient who is between 40 and 50 years old will presumably provide an even better estimate of the probability associated with the disposition borne by Mr. Dupont. This is a version of the “principle of the narrowest reference class” due to Reichenbach [18], who proposed to “proceed by considering the narrowest class for which reliable statistics can be compiled” (see also [10]).

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

5. Conclusion Let us now summarize. The probability of a disposition measures the intensity of the causal power of the categorical basis of this disposition in the realization of a given process, when a given kind of triggering process happens. The value of this causal power can be identified with the limit of relative frequencies, over an infinite hypothetical repetition of possible instances of triggering processes, of these processes that triggers an instance of a realization process. Relative frequencies obtained in a finite sequence of instances of the triggering process provide estimates of the values of these probabilities. Finally, the value of the probability associated with a triplet of the kind (D,T,R) may approximate the value of the probability associated with a triplet of the kind (d,T,R), where d instance_of D. We have shown that representing a hypothetical infinite sequence of instances of triggering process requires to consider not only actual, but also possible, non-actual instances. Extending BFO to include possible instances would certainly have very significant consequences. Fortunately, we do not need to change BFO in such a way. We have shown here how, with such a change, one could define an attribution of probabilities to triplets of the kind (d,T,R) or (D,T,R). This being shown, we can now work with the classical version of BFO, restricted to actual entities, and introduce probability assignments as a primitive operation on triplets of the kind (d,T,R) or (D,T,R). It was important though to investigate the foundations of probability assignments, in order to determine to which kind of triplets we can assign a probability. Without such an investigation, the meaning of the notion of probability would have remained unclear, and it would therefore have been unclear to which kind of entities (particulars or universals) we can assign probabilities2. This leads us now to three questions that exceed the scope of this article. First, how should probabilistic entities and probability values be represented in the ontology? 2

We could also wish to assign probabilities not to a particular individual or to a universal of individual, but to a particular of population or to a universal of population. This raises some new questions that have been discussed by Eells [15] (pp. 45-55). Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

13

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Second, how should probability assignments to triplets of the kind (d,T,R) or (D,T,R) be represented in the framework of ontologies, which accept only binary relations? (and how should they be made in artifacts representing ontologies, like OWL or OBO files) This problem also appears for surefire dispositions, and it has been partially investigated by Röhl & Jansen ([2]). Future investigations concerning both surefire and probabilistic dispositions will be needed in the future. And third, at which probability threshold should we assert a disposition in an ontology? Since there is no objective threshold guided by physical reality, such a threshold should be chosen by a principle of relevance: some dispositions are so weak that there is no practical interest in representing them in an ontology. But it could be expected that this threshold of relevance would depend, amongst other factors, on the field under consideration (biology, medicine, engineering…) and on the goal of the ontology (for example, do we want to avoid type I error - false positive - or type II error - false negative - when using the ontology; cf. [19] for an introduction to this problem). Finally, let us notice that this analysis deals only with probabilities associated with dispositional entities, which are inherently causal. In the medical domain, that would include for example: the probability to get a disease in some given circumstances; the sensitivity of a test (i.e. the probability to have a positive result to a test if one has the disease); or its specificity (i.e. the probability to have a negative result to a test if one does not have the disease). However, this account does not apply to evidential, noncausal probabilities, like the positive predictive value of a test (i.e. the probability to have a disease if one is tested positive) or the negative predictive value of a test (i.e. the probability to not have the disease if one is tested negative). In such situations, the probability characterizes the evidential strength of some evidence (for example a positive or negative test) in determining if an event (for example having got the disease) happened or not in the past. Such probabilities are epistemic and do not characterize the strength of a disposition (although their values can be constrained by probability values associated to some related dispositions). Their formalization in the framework of ontologies is therefore a totally different task, and needs to be investigated in future works3.

Acknowledgments Barton’s contribution was funded in part by a research grant from the Swedish Institute. We would like to thank Barry Smith and attendance at the seminar of philosophy of Buffalo University (USA) during the presentation of a preliminary version of this paper; as well as attendance at the seminar of philosophy of KTH University (Sweden).

3

Such a formalization may also enable to answer one problem left open by the present article, which is how we should account for probability of dispositions when the triggering process can vary in intensity. Here, for example, we have been interested in the probability to undergo an epileptic seizure after a set of fifty light flashes whose frequency was 10 Hz. But how should the probability be formalized if the frequency of the flashing light is not specified, and we just know that it is between 1 and 20 Hz? One solution (restricting ourselves to integer values of possible light frequencies) could be the following. First, one should determine the objective probabilities to undergo an epileptic seizure after a set of fifty light flashes of 1 Hz; of 2 Hz; of 3 Hz; …; of 20 Hz. Then, one should determine the subjective probabilities that the particular flashing light we are interested in has a frequency of 1 Hz, 2 Hz, 3 Hz,…, 20 Hz. Finally, one should combine these objective and subjective probabilities in order to compute the resulting probability that the flashing light whose frequency is not specified will cause an epileptic fit.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

14

A. Barton et al. / Probability Assignments to Dispositions in Ontologies

We also would like to thank two anonymous referees for their valuable input and comments.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

References [1] R. Scheuermann, W. Ceusters, and B. Smith, Toward an Ontological Treatment of Disease and Diagnosis, Proceedings of the 2009 AMIA Summit on Translational Bioinformatics (2009), 116–120. [2] J. Röhl and L. Jansen, Representing Dispositions, Journal of Biomedical Semantics 2(Suppl 4) (2011), S4. [3] K. Popper, A Propensity Interpretation of Probability, British Journal for the Philosophy of Science 10 (1959), 25–42. [4] R. Arp. and B. Smith, Function, Role, and Disposition in Basic Formal Ontology, Nature Precedings 1941.1 (2008), 1–4. [5] E. Prior, R. Pargetter, and F. Jackson, Three Theses about Dispositions, American Philosophical Quarterly 19 (1982), 251–257. [6] A. Hájek, Mises Redux – Redux: Fifteen Arguments Against Finite Frequentism, Erkettntis 45 (1997), 209-227. [7] A. Hájek, Fifteen Arguments Against Hypothetical Frequentism, Erkenntnis 70 (2009), 211–235. [8] A. Eagle, Twenty-One Arguments Against Propensity Analyses of Probability, Erkenntnis 60 (2004), 371–416. [9] S. O. Hansson, Do We Need Second-Order Probabilities?, Dialectica 62 (2008), 525–533. [10] D. Gillies, Philosophical Theories of Probability, Routledge, London, 2000. [11] D. H. Mellor, Probability: A Philosophical Introduction. Routledge, London, 2005. [12] J. Williamson, Philosophies of Probability, in A. Irvine (Ed.), Handbook of the Philosophy of Mathematics (pp. 493–533), North-Holland, Amsterdam, 2009. [13] A. Hájek, The Reference Class Problem Is Your Problem Too. Synthese 156 (2007), 563 – 585. [14] S. Mumford and R. L. Anjum, A Powerful Theory of Causation, in A. Marmodoro (Ed.), The Metaphysics of Powers (pp. 143-159), Routledge, London, 2010. [15] E. Eells, Probabilistic Causality. Cambridge University Press, Cambridge, U.K., 1991. [16] A. Eagle, Chance versus Randomness, in E. N. Zalta (Ed.) The Stanford Encyclopedia of Philosophy (Spring 2012 Edition), forthcoming URL = . [17] P. Grenon and B. Smith, SNAP and SPAN: Towards Dynamic Spatial Ontology, Spatial Cognition and Computation, 4.1 (2004), 69-103. [18] H. Reichenbach, The Theory of Probability, University of California Press, Berkeley, 1949. [19] S. O. Hansson, Risk, in E. N. Zalta (Ed.) The Stanford Encyclopedia of Philosophy (Fall 2011 Edition), URL = .

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Formal Ontology in Information Systems M. Donnelly and G. Guizzardi (Eds.) IOS Press, 2012 © 2012 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-084-0-15

15

Maturation of Neuroscience Information Framework: An Ontology Driven Information System for Neuroscience

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Fahim T. IMAM, Stephen LARSON, Anita BANDROWSKI, Jeffrey S. Grethe, Amarnath GUPTA, and Maryann E. MARTONE University of California, San Diego, United States

Abstract. The numbers of available neuroscience resources (databases, tools, materials and networks) on the web have, and continue to expand; particularly in light of newly implemented data sharing policies required by funding agencies and journals. However, the nature of dense, multi-faceted neuroscience data and the design of classic search engine systems makes efficient, reliable, and relevant discovery of such resources a significant challenge. This challenge is especially pertinent for online databases, whose dynamic content is largely opaque to contemporary search engines. The Neuroscience Information Framework1 (NIF) was initiated to address this problem of finding and utilizing neuroscience-relevant resources. The NIF provides simultaneous, concept-based search across multiple data sources allowing neuroscientists to connect with available resources, including the deep content of experimental data in online databases. Searching the NIF portal is semantically enhanced through the utilization of a comprehensive ontology, the Neuroscience Information Framework Standard (NIFSTD), developed internally and with community involvement at NeuroLex.org. The NIFSTD also provides the foundation for a standard semantic resources description framework to facilitate navigation across resources, as well as integration and interoperability of available neuroscience data. Since the first production release in 2008, NIF has grown significantly in content and functionality, particularly with respect to the ontologies and ontology-services driving the system. This paper presents NIF as a comprehensive information system that is decisively driven by its ontologies. Keywords. Ontology, semantic search, neuroscience ontologies

Introduction An initiative of the NIH Blueprint for Neuroscience Research, the Neuroscience Information Framework1 (NIF) project is targeted towards advancing Neuroscience by enabling discovery and access to public research data and tools worldwide through an open source, semantically enhanced networked environment. The ultimate end product of NIF is a semantic search engine and knowledge discovery portal that provides federated access to a vast amount of Neuroscience data and resources over the web. Now entering its fourth year of production-release, NIF has matured into the largest source of neuroscience-relevant information on the web.

1

The Neuroscience Information Framework (NIF), http://neuinfo.org

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

16

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

One of the critical components for the overall NIF system, NIF Standardized Ontologies (NIFSTD) [1] provides a comprehensive collection of Neuroscience relevant concepts (60,000+) along with their synonyms and conceptual relationships. NIFSTD covers major domains in neuroscience, including diseases, brain anatomy, cell types, subcellular anatomy, small molecules, techniques and resources descriptors. The conceptual knowledge models defined within the NIFSTD ontologies are materialized through an ontological query processing engine called OntoQuest [2, 6], which enables an effective concept-based search over heterogeneous types of web-accessible information entities for the NIF's production system. Unlike traditional, generic search engines like Google, NIF provides a powerful domain-specific query processing mechanisms that allow the query strings to be searched against their ontological semantics rather than their exact lexical matches. In this paper we present the integral components of the NIF system, the NIFSTD ontologies, the NeuroLex Wiki, and the OntoQuest system, and how they are utilized to enhance the overall NIF platform.

1. NIF System The core objective of NIF was to address the problem of finding neuroscience-relevant resources. NIF provides simultaneous search across multiple information sources to connect neuroscientists to available resources. These sources include: (1) NIF Registry: A human-curated registry of neuroscience-relevant resources; (2) NIF Literature: A collection of neuroscience relevant corpora; (3) NIF Database Federation: A federation of independent databases registered to the NIF, allowing for direct search and discovery of database content, often referred to as the “hidden web”. Semantic search through the NIF portal is enhanced through the utilization of NIFSTD. OntoQuest enhances the search by providing an ontology-based query formulation, source selection, term expansion, and finally better ranking on the search results.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

1.1. NIF Data Federation Although the trend of representing web resources through RDF like resource description formalisms are increasingly practiced to achieve interoperability, the vast majority of resources are still residing in relational databases, or simply in html documents. Through the NIF data federation, NIF have established strategies that would address the most common types of resources with biomedical interests. Thus, for this phase of the NIF, we elected to focus our data federation efforts on these types. NIF queries get translated to queries in the native forms of the databases we federate. NIF contents are also transformed in RDF and have its SPARQL endpoint. NIF will soon implement its RDF query interface for its federated resources. For a portal such as NIF, the challenge is to provide a simplified view of a complex resource that is understandable to a user coming through the NIF portal. To accomplish this, the NIF curators work with the native resource to ensure that the local semantics of the data are expressed correctly through NIF by augmenting the model where necessary and mapping the data to NIFSTD ontologies. Before the current era of Semantic web, the common practices for the vast majority of web-based resource providers simply overlooked the idea of structuring their data models that could be machine processable, reusable and interoperable. NIF curators are

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

17

taking up the challenge of filtering these resources so that they could become available under the hood of a common, interoperable semantic layer. The NIF data federation currently contains one of the largest unified collections of neuroscience-relevant data available through the web, with over 50 independent data sources integrated through the NIF indices. NIF is closely following the movements such as Open Data, Linked Data, and the Web of Data that could integrate data regardless of their sources. 1.2. NIF Resource Registry To aid the neuroscience community to discover useful digital resources, such as academic databases, software, funding etc. NIF has developed a large digital catalog of resources related to neuroscience. NIF has developed a simple OWL ontology module dedicated to cataloging digital resources called the resource ontology in collaboration with the BRO (biomedical resource ontology). So far NIF curators have assigned a vast amount of digital resources to one or more of the Resource ontology categories. The resource category classes are mostly derived from NIF-Investigation and the Ontology for Biomedical Investigation (OBI) that can serve as 'resource descriptors'. The top level resource categories are: Data, Funding, Job, Material, People, Services, Software, and Training. These categories comprise with human readable definitions and labels targeted for curators , along with sub-categorizations and synonyms to assist the search systems to locate the right data.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

1.3. Growth of NIF Contents and Outreach Since the first release in 2008, NIF has grown significantly in content and community building. Currently, NIF provides access to the largest collection of neuroscience relevant data on the web, all from a single interface. The chart on the left in Figure 1 illustrates the growth of federated data resources within NIF since June, 2008. The chart on the right illustrates the utilization growth in visits per month across NIF holdings, including NIF search portal and NeuroLex. Currently NIF search portal has ~6000 visits per month, and NeuroLex has over 15,000 visits per month. Also, it should be noted that a significant number of NIF users are successfully finding their desired keywords from the NIF ontologies. For example, based on the recent google analytics report (March 19-25, 2012), out of total 1,823 search events, 846 were auto complete search (i.e., terms exist in NIFSTD), and 85 of them were advanced ontological queries.

Figure 1. On the left, the increase of NIF contents in terms of the number of federated records and databases. On the right, the increase of community outreach in terms of the number of visitors to the NIF portal.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

18

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

2. The NIFSTD Ontologies

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

NIFSTD is a set of modular ontologies where each module covers a distinct orthogonal domain of Neuroscience [1] (Figure 2). The modules in NIFSTD are expressed in OWL-DL [11] which affirms the balance between expressivity and computations decidability, and supports automated reasoning via common DL reasoners e.g., Pellet, FACT++. Acknowledging the fact that information is most often incomplete, the openworld assumption in OWL has been a suitable approach to represent the Neuroscience domain in NIF which also allows its ontologies to be deployed incrementally. NIFSTD is loadable through Protégé [13] ontology editor and has a SPARQL endpoint [15]. It is also available on the web through NCBO BioPortal [16]. Wherever possible, NIFSTD reuses terms and their classification schemes from existing, standard Biomedical knowledge sources. These sources include fully structured ontologies to loosely structured controlled vocabularies or lexicons following standard nomenclatures. Depending on the state of the sources, they are either adapted into an OWL ontology, extracted a relevant portion using MIREOT principles [8], or simply imported as whole. Domains covered by the current NIFSTD along with the vocabularies imported from the external, community sources and the corresponding OWL modules can be found in [5]. Each class in NIFSTD is assigned with a unique identifier accompanied with a variety of annotation properties. These annotation properties were mostly drawn from SKOS and Dublin Core Metadata model.

Figure 2. The semantic domains covered in the NIFSTD ontology. Separate OWL modules cover the domains specified within the ovals. The umbrella file http://purl.org/nif/ontology/nif.owl imports each of these modules when opened in Protégé. Each of the modules, in turn, may cover multiple sub-domains, some of which are shown in the rectangular boxes [1].

As one of the core principles, NIFSTD included the Basic Formal Ontology (BFO) [18], the most widely used upper ontology in biomedical communities, to represent the upper level semantic layer of its different orthogonal modules. The upper level classes in NIFSTD modules are normalized under the appropriate BFO entities. Since the beginning, NIF always recognized the significance of a formal ontology. As the number of biomedical ontologies increases, a formal ontology like BFO plays an important role to promote semantic interoperability for data integration [4]. BFO provides a logical basis to categorize the domain-independent high level classes, such

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

19

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

as the entities, characteristics and processes. This kind of categorization helped us to avoid erroneous and ambiguous ontological assertions, and was necessary to develop a large-scale ontology like NIFSTD. NIFSTD closely follows the OBO Foundry best practices [4] as long as they are practical for day to day development. NIFSTD includes PATO (obofoundry.org/wiki/ index.php/PATO:Main_Page) to describe the phenotypic qualities of the NIFSTD entities. The relational properties in NIFSTD (Example: Figure 3) are derived from standard OBO-Relations Ontology (RO). The relations that are domain-independent and exist as universally true within the classes of a specific module, those relations are kept integrated together within the same module. The relations between entities that could vary based on specific application, or require domain-dependent viewpoint, those relations are kept in a separate module called a bridging module. A bridging module would typically incorporate relational properties between multiple distinct modules. This kind of isolation enables NIFSTD to keep its modularity principles intact. The core modules are hence easily extendible and re-usable without the need for any modification. New bridging modules can be developed should a user desire a customized application of their own domain.

Figure 3. An exemplary knowledge model in NIFSTD. Both cross-modular and intra-modular classes are associated through object properties drawn from the OBO Relations ontology (RO). Different color/ shape of the boxes indicate that the classes belong to different modules. The cross-domain relations are typically kept in a separate bridging module in NIFSTD.

NIFSTD follows the simple inheritance principle [3] for its hierarchy of named classes; i.e., an asserted named class can have only one named class as its super-class; however, a named class can have multiple anonymous super-classes. The classes with multiple super-class are derivable via automated classification on defined NIFSTD classes with necessary and sufficient conditions. This approach saves a great deal of manual labor and minimizes human errors inherent in maintaining multiple hierarchies. Also, this approach provides logical and intuitive reason as to how a class may exist in multiple, different hierarchies. Since the first release, the NIFSTD ontologies have undergone extensive revision and refinements over the course of its evolution during the last couple of years. These include simplified structural changes to its import hierarchies, simplifying the 'backend' modules that comprises the common entities shared by all of the NIFSTD modules,

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

20

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

enforced modularization principles, and refactoring the modules under more appropriate BFO classes. Also, NIFSTD included new modules extracted from standard Biomedical ontologies such as the Gene Ontology (GO), Protein Ontology (PRO), ChEBI, and Human Disease Ontology (DOID). NIFSTD core contents have also been rapidly enhanced from NeuroLex contributions.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

3. The NeuroLex Wiki One of the largest roadblocks that NIF encountered during its ontology development phase was the lack of tools for domain experts to view, edit and contribute their knowledge to formal ontologies like NIFSTD. Existing ontology tools were difficult to use or required expert knowledge to employ. NIF strives to balance the involvement of the neuroscience community for domain expertise and the knowledge engineering community for ontology expertise when constructing its ontologies. By combining several open source technologies related to semantic media wikis, NIF created NeuroLex2, a semantic wiki for the neuroscience community and domain experts. The initial contents of the NeuroLex were derived from NIFSTD which established its neuroscience centric semantic framework and enabled the semantic relationships among its category pages. NIFSTD OWL classes are were automatically transformed into category pages containing a simplified, human readable class descriptions. The category pages are editable and readily available to access, annotate or enhance by the community or domain experts. Additions of new categories and enhancements to the NeuroLex contents are regularly transformed into NIFTSD in formal OWL-DL expressions. While the properties in NeuroLex are meant for easier interpretation, the restrictions in NIFSTD are more rigorous and based on standard OBO-RO relations. For example, the property ‘soma located in’ is translated as ‘Neuron X’ has_part some (‘Soma’ and (part_of some ‘Brain region Y’)) in NIFSTD. Sometimes a similar kind of ‘macro’ relation such as ‘has_neurotransmitter’ are used in NIFSTD, recognizing that these relations can be specified more rigorously. These ‘macro’ relations readily lend themselves into more rigorous representations using OWL 2.0 [12] property chain composition, should they become necessary at a later date. NIF considers NeuroLex.org as the main entry point for the broader community to access, annotate, edit and enhance the core NIFSTD content. The peer-reviewed contributions in the media wiki are later implanted in formal OWL modules. It should be noted that NIF is not charged with development of new modules but relies on community for new contents. Therefore, the NeuroLex wiki has proven to be ideal for NIF’s current scope. For example, it has proven to be effective and helpful in the area of neuronal cell types where NIF is working with a group of neuroscientists to create a comprehensive list of neurons and their properties. NeuroLex category pages are linked with NIF Search interface where users can quickly view a descriptive ontological details about a search term. NeuroLex can be viewed as a full-fledged information management system that provides a bottom-up ontology development approach where multiple participants can edit the ontology instantly. Semantics in NeuroLex are limited to what is convenient for the domain experts. Essentially, the NeuroLex approach is not a replacement for top-down construction, but critical to increase accessibility for non-ontologist domain 2

The NeuoroLex Wiki, http://neurolex.org

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

21

experts. NeuroLex provides various simple forms for structured knowledge where communities can contribute and verify their knowledge with ease. It also allows the generation of specific class hierarchies, or extraction of a specific portion of the ontology contents based on certain properties in a spreadsheet, without having to learn complicated ontology tools.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

4. The OntoQuest System Within NIF, NIFSTD is served through a powerful ontology management system, called OntoQuest. OntoQuest views the ontology as a graph and performs graph-like operations (e.g., finding the k-neighborhood). The NIFSTD is an OWL/RDF graph currently containing 60,000+ terms and about 25 edge labels. In OntoQuest, the subclassOf property induces a spanning DAG over all ontology nodes. Periodically, the major release of NIFSTD gets loaded in OntoQuest repository. However, we have recently established an incremental update mechanism to allow day-to-day updates of NIFSTD to be reflected more frequently in OntoQuest. As the ontologies used in Neuroscience and relevant biomedical domain are usually massive in size, it was necessary for OntoQuest to utilize database technologies in order to provide a scalable query processing mechanisms. OntoQuest provides efficient retrieval of knowledge and information that are semantically structured through ontologies. Currently, OntoQuest can process full-fledged ontologies represented in OWL or OBO formalisms as well as simpler taxonomies expressed in RDFS. OntoQuest is built on top of a relational schema motivated by the IODT system from IBM [17] that shreds OWL and OBO ontologies into this relational schema. It is important to note that OntoQuest is not a reasoner; rather, it provides an efficient navigation and query facility on ontologies by treating them as directed graphs. Details on OntoQuest graph model can be found in [7]. OntoQuest utilizes Jena library APIs to parse OWL ontologies. However, OntoQuest has its own customized algorithms for generating the mappings between the OWL ontologies and the backend schemas. OntoQuest stores all distinguished relationships permitted by OWL (e.g., subclass-of, allValuesFrom, disjoint) in separate tables, while all user-defined relation names are stored in a quad-store. Each class and property definitions from the ontologies are mapped into appropriate relational table. Using an advanced encoding and indexing algorithm [9], the DAGstructure of the ontological class hierarchies are stored in such a way that allows an efficient computations on transitive relations such as class subsumption and partonomic relations among the classes. The current version of OntoQuest allows the expression of property chain rules advocated by the proposed OWL 2.0 standard [12]. This enables non-recursive first order rules like (A subclass-of B), (C part-of B)®(C part-of A). When such rules are specified, OntoQuest may be instructed to materialize them or to evaluate them during query processing. OntoQuest contains its own query processing engine to support ontological queries which is integrated with NIF search portal providing automated query expansion of the terms asserted in NIFSTD ontologies. It also provides a collection of web services to extract specific ontological contents [14]. Finally, OntoQuest accepts bridge ontologies

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

22

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

that provide mappings between multiple existing ontologies through rules specified in OWL. This is particularly important for NIF as most biomedical ontologies in the OBO consortium3 significantly cross reference each other. NIFSTD currently have several modules containing extensive inter-ontology mappings. 4.1 OntoQuest Operations OntoQuest implements various useful operations on its ontological graphs. Table 1. lists the functions along with their operations provided by OntoQuest. The NIF system extensively utilizes these functions trough its interface. Refer to [7] for details on the specific operations. It is important to note that the OntoQuest ontology repository does not store instances; rather, it only stores the classes and interclass properties. The instance data is stored in the various databases, and are duly mapped to ontological concepts where possible. Therefore the instantiate operator in OntoQuest calls data access operations. Table 1. OntoQuest operators manipulate the ontology graph stored in the ontology repository. Node and edge IDs are unique in the system over all stored ontologies.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Function scangraph(p)

Operation Performs a scan operation over the edges that are evaluated to satisfy predicate p. selectNodeLabels(p) Selects a set of node labels satisfying predicate p. selectNodes(p) Selects a subset of nodes based on predicate p. selectEdges(p) Selects a subset of edges based on predicate p. project(pat) Projects a set of subgraphs that satisfy a graph pattern pat. label(g) Accepts a graph g and returns a copy of it by replacing the node-ids by node labels, and edge-ids by edge-labels. merge(g1,g2) Performs a node and edge union of graphs g1, g2. flattenPropTree(pLabel) Accept a property label pLabel and return a set of subproperties of label pLabel. flattenQuality(qValue) Accept a quality value qValue and return a set of subdomain quality names under qValue. induce(N) Given a node set N, returns the graph induced by N reachable(n1, n2, ei) Whether node n2 is reachable n1 by traversing edges satisfying regular expression ei. getTransitiveAncestors(n, Get k levels of ancestors of node n, by following the transitive edge Label, k) label Label. getTransitiveDescendants Get k levels of descendants of node n, by following the transitive (n, Label, k) edge label Label. neighbors(N, k, ei, ex) Given nodes N, returns the k-neighborhoods of each node in N, such that the edges satisfy the regular expression ei, and do not satisfy the regular expression ex. LCA(N, Label) Find the least common ancestor of node set N by traversing the transitive edge label Label. dagPath(n1, n2, Label) Find the paths connecting nodes n1 to n2 along transitive edge label Label. centerpiece(N) Given a node set N, compute the centerpiece subgraph intervening the nodes in N. unfoldPropertyChain(chID) Compute and materialize derived edges by unfolding OWL2 property chains identified by property chain ID chID. instantiate(N) Find instances of the nodes N, where N can be only class nodes.

3

OBO Consortium, www.obofoundry.org

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

23

4.2 OntoQuest Query Processing  The NIF system uses a query language inspired by current search engines like Google. In this language, the simplest option is to ask a keyword query, but one can optionally add predicates on metadata and data attributes, specify return structures, and make references to ontologies. A detailed treatment of the full query specification with all options, and the evaluation strategy is beyond the scope of this paper. Simpler constructs are presented here but focus on how the ontologies are utilized. Table 2: Typical ontological query expansions in NIF through OntoQuest. Example Query Type

Ontological Expansion

A single term query for Hippocampus and its synonyms

synonyms(Hippocampus), expands to Hippocampus OR "Cornu ammonis" OR "Ammon's horn" OR "hippocampus proper". transcription AND gene AND pathway (gene) AND (pathway) AND (regulation OR "biological regulation") AND (transcription) AND (recombinant) synonyms(zebrafish AND descendants(promoter,subclassOf))), zebrafish gets expanded by synonym search and the second term transitively expands to all subclasses of promoter as well as their synonyms. synonyms(descendants(Hippocampus,partOf)), expands to all parts of hippocampus and all their synonyms through the ontology. All parts are joined as an “OR” operation. synonyms(Hippocampus) AND equivalent(synonyms(memory)), the second term uses the ontology to find all terms that are equivalent to the term memory by ontological assertion, along with synonyms. synonyms(x:descendants(neuron,subclassOf) where x.neurotransmitter='GABA') AND synonyms(gene where gene. name='IGF'), x is an internal variable. synonyms(x:descendants(neuron,subclassOf) where x.soma.location = descendants (Hippocampus, partOf)) 'GABAergic neuron' AND equivalent ('GABAergic neuron'), the term gets recognized as ontologically equivalent to any neuron that has GABA as a neurotransmitter and therefore expands to a list of inferred neuron types.

A conjunctive query with 3 terms A 6-term AND/OR query with one term expanded into synonyms A conjunctive query with 2 terms, where a user chooses to select the subclasses of the 2nd term

A single term query for an anatomical structure where a user chooses to select all of the anatomical parts of the term along with synonyms A conjunctive query with 2 terms, where a user chooses to select all the equivalent terms for the 2nd term

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

A conjunctive query with 2 terms, where a user is interested in a specific subclasses for both of the terms A query to seek all subclasses of neuron whose soma location is in any transitive part of the hippocampus A query to seek a conceptual term that is semantically equivalent to a collection of terms rather than a single term.

A keyword query in NIF is an Boolean expression with wildcards (PARK* stands for PARKIN, PARK2,…). The basic query generation processes involves the following. 1. 2.

The query is first sent to a query analysis unit to identify terms that are known to the ontology sources. The analyzed query goes through query expansion unit that uses the ontology to find synonyms and related terms stored with the ontology.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

24

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

3.

4.

The terms in the expanded query are looked up from an inverted index to locate candidate records that are in different data stores (graph store, relation store etc.) The Boolean conditions are then evaluated to generate a candidate list of data items (of heterogeneous types) that form the result.

Here, a candidate term refers to the actual term that matched with ontologies, and the candidate record refers to the containing data structure that has the candidate term. Table 2 presents a set of typical queries along with their expansions. It is important to note that, to the best of our knowledge, none of the traditional search engines provide this kind of query expansion mechanism. Readers interested in performance evaluation of typical NIF queries are referred to [7].

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

4.3 OntoQuest and NIF Search Interface NIF is essentially an application system built upon the heterogeneous data management infrastructure that utilizes OntoQuest. NIF hides the complexity of the query language within the elements of the user interface. Specifically, user presents only keyword and ontological keyword queries; the ontology-based expansion and predicate search happens by user interaction. The NIF search engine takes the user’s keyword query and in the most common case, performs an ontological search to retrieve conceptual terms that closely match the terms in the ontology, and if desired, the neighborhood of these ontological terms. This process of exploring the ontology to find related terms is performed interactively. When the user settles on the final query terms, the keyword module uses the index to locate sources that have the data or web documents satisfying the keywords. Once the data sources are located, the source query wrapper module transforms the query into queries against all sources and broadcasts these transformed queries. The process of transformation converts the query keywords into SQL (or HTTP calls and so on) for structured data sources, XML requests, search against the web index and so forth. If the user’s search terms are not found in the ontology, OntoQuest allows the query to be posted directly against the sources as a string search.

5. NIF Semantic Search One of the most powerful aspects of ontologies is that they allow explicit knowledge of a domain to be asserted from which the implicit, inferred knowledge can be automatically derived as logical consequences. NIFSTD is designed to capitalize this ontological feature to enhance NIF’s semantic search mechanism. The key feature of the current NIFSTD is the inclusion and enrichment of various cross-domain bridging modules. These modules contain necessary restrictions along with a set of defined classes to infer useful classification of neurons and molecules. The following list illustrates some of the defined concepts in NIFSTD and their classification schemes: − − −

Neurons by their soma location in different brain regions - e.g., Hippocampal neuron, Cerebellum neuron, Retinal neuron etc. Neurons by their neurotransmitter - e.g., GABAergic neuron, Glutamatergic neuron, Cholinergic neuron Neurons by their circuit roles - e.g., Intrinsic neuron, Projection neuron

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

− −



25

Neurons by their morphology - e.g., Spiny neuron Neurons by their molecular constituents - e.g., Pervalbumin neuron, Calretinin neuron Classification of molecules and chemicals by their molecular roles - e.g., Drug of abuse, Neurotransmitter, Calcium binding protein

A list of defined concepts along with their textual definitions can be found on the NIFSTD wiki page in [5]. The following example illustrates the strengths and usefulness of this feature for our NIF system. NIF has various neuron types with an asserted simple single hierarchy within the NIF-Cell module (Figure 4 is an example with five neuron types).

Figure 4. NIFSTD Cerebellum Purkinje cell is simply a subclass of a Neuron before invoking a reasoner along with asserted restrictions as specified in Fig. 5.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

We assert various logical necessary restrictions about these neurons in a bridging module where we also specify various defined neuron types with necessary and sufficient conditions as illustrated in Figure 5.

Figure 5. Typical NIFSTD asserted restrictions for various neuron types. The first table in the figure defines three neuron types with logical necessary and sufficient conditions. The second table lists a set of necessary restrictions for Cerebellum Purkinje cell. All these restrictions written in a readable format here is expressed in OWL DL language in actual NIFSTD.

When the NIF-Cell module, along with the bridging modules are passed to a reasoner, the reasoner automatically computes for the asserted neuron types with restrictions (as indicated in Figure 5) and produce a hierarchy where a neuron can have multiple

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

26

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

inferred super-classes. In this example, although we did not explicitly state that Cerebellum Purkinje cell is anything other than a simple neuron, the reasoner identified that the neuron is an inferred subclass of four different defined neurons (Figure 6) namely, GABAergic neuron, Cerebellum neuron, Spiny neuron and Principal neuron, based on the logical restrictions specified as in Figure 5. Having the defined ontological classes have enabled NIF to formulate useful concept-based queries. For example, while searching for ‘GABAergic neuron’, the NIF query expansion through OntoQuest recognizes the term as ‘defined’ from the ontology, and looks for any neuron that has GABA as a neurotransmitter (instead of the lexical match of the search term) and enhances the query over those inferred list of neurons. Searching these defined terms in a Google search would essentially exclude all the GABAergic neurons unless they are explicitly listed within the search.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Figure 6. After invoking a reasoner NIFSTD Cerebellum Purkinje cell becomes a subclass of four different defined neuron types based on the restrictions specified in Figure 5.

To further its mission to deliver a concept-based interface for neuroscience, OntoQuest is being extended to support the rules for defining certain classes to include quantitative definitions as well as logical ones. Such concepts include quantities such as 'age maturity categories for common organisms, concepts such as “dementia” which may be defined as a range of scores on a set of test batteries and qualitative assessments of expression studies. These standards allow a researcher to issue a query through the NIF keyword interface for “increased gene expression” in “adults” for “drugs of abuse". For example, many native databases report age results for individual organisms. However, when searching from a keyword-based interface, users will not be easily able to enter a set of age ranges into the query without specialized forms. Furthermore, looking for the expression of gene X in adult animals across species would require that the user enter age ranges for all organisms, clearly impractical. Thus, for concepts like maturity stages and expression levels, NIF maps the quantitative values to qualitative categories like “Adult” or “increases expression”. For example, NIF defines an adult mouse as a mouse >=40 days of age. For sources within the integrated view, we will provide the age provided by the data source but also an additional tag that maps this range to “Adult”. For gene expression levels from microarray or other types of assays, NIF annotates a change relative to some control at a significance level of p>=0.05 as “increased” or “decreased” expression. These

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

27

standards are applied when the translation of results is straightforward and the original values as recorded in the source database are always provided. When annotation of results requires interpretation because the meaning of values recorded in the source database is unclear, we do not apply such standards. For example, in the Gene to Brain Region view, each of the sources provides assessment on gene expression levels within a brain region in a different way (See Figure 7).

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Figure 7. Results for “GRM1 and Midbrain” from the 3 expression atlases.

In the Allen Brain gene expression atlas, based on analysis of in situ hybridization images through the mouse brain, gene expression levels are expressed in a numerical scale from 0-100 where expression is normalized compared to controls. Higher numbers indicate an increased likelihood that the gene is expressed within that region, but there is no way for NIF to translate the numbers into discrete categories that would accurately reflect the intent of the resource providers. Similarly, the GENSAT resource provides assessments of labeling patterns for the BACS transgenics per brain region using a qualitative rating of low, moderate and high staining. However, the staining intensity does not necessarily correlate with the expression levels of gene, as it reflects the number of GFP’s inserted into the cell. Thus, in both cases, NIF retains the assessments provided by the source databases without additional annotations.

6. Conclusion In this paper we have presented the NIF as an ontology-driven information system, a system where the data resources are imported into the system by the domain experts, annotated by the domain experts to connect the data to its standardized formal ontologies or other data sources through the ontologies, and finally, all the data mapped to the ontologies can be queried in a federated manner. The NIF system enables multimodel data federation. It uses an ontologically enhanced system catalog, an ontological data index, an association index to facilitate cross-model data mapping, and an algorithm for ontology-based keyword queries with ranking. Since the launch of the NIF in September 2008, it has grown significantly in content and functionality. During this rapid period of growth, NIF has been working to develop an understanding of the current state of biomedical resources and established an ontology-based infrastructure, procedures and guidelines for maximizing their utility.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

28

F.T. Imam et al. / Maturation of NIF: An Ontology Driven Information System for Neuroscience

The NIF project provides an example of practical ontology development and how ontologies can be used within hybrid information systems to enhance search and data integration across diverse resources. Acknowledgement. Supported by a contract from the NIH Neuroscience Blueprint HHSN271200800035C via NIDA.

References [1]

[2]

[3] [4] [5]

[6] [7] [8]

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

[9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

W.J. Bug, G.A. Ascoli, J.S. Grethe, A. Gupta, M.E.Martone et al., The NIFSTD and BIRNLex Vocabularies: Building Comprehensive Ontologies for Neuroscience, Neuroinformatics 6(3) (2008), 175-94 A. Gupta, W.J. Bug, L. Marenco, C. Condit, M. E. Martone et al., Federated Access to Heterogeneous Information Resources in the Neuroscience Information Framework (NIF), Neuroinformatics 6(3) (2008), 205-17 A. Rector, Modularisation of Domain Ontologies Implemented in Description Logics and related formalisms including OWL, Proc K-CAP (2003) B. Smith, M. Ashburner, C. Rosse, et al., The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration, Nat Biotech (2007), 1251-1255 F.T. Imam, S.D. Larson S., J.S. Grethe, A. Gupta, A. Bandrowski, M.E. Martone, NIFSTD and NeuroLex: A Comprehensive Neuroscience Ontology Development Based on Multiple Biomedical Ontologies and Community Involvement, Proc. of Intl. Conf. on Biomedical Ontologies (ICBO), Buffalo, NY (2011) L. Chen, M.E. Martone, A. Gupta, L. Fong, M. Wong-Barnum, Ontoquest: exploring ontological data made easy, Proc. 31st Int. Conf. on Very Large Database (VLDB) (2006), 1183–1186 A. Gupta, C. Condit, X. Qian, An ontology-enhanced information system for heterogeneous biological information, BioDB (2011) M. Courtot, F. Gibson, A. Lister, J. Malone, D. Schober, R. Brinkman, A. Ruttenberg, MIREOT: the Minimum Information to Reference an External Ontology Term. Nature Precedings (2009), http://dx.doi.org/10.1038/npre.2009.3576.1 L. Chen, A. Gupta, M.E. Kurul, Stack-based algorithms for pattern matching on DAGs, Proc. 31st Int. Conf. on Very Large Databases (VLDB), Stockholm (2005), 493–504 Ontology Design Patterns (ODPs) Public Catalog, http://odps.sourceforge.net Web Ontology Language (OWL), http://www.w3.org/2001/sw/wiki/OWL OWL2 Web Ontology Language Primer, http://www.w3.org/TR/2009/REC-owl2-primer-20091027/ Protégé Ontology Editor and Knowledge Acquisition System, http://protege.stanford.edu OntoQuest Web Services, http://ontology.neuinfo.org/ontoquestservice.html NIFSTD SPARQL Endpoint, http://ontology.neuinfo.org/sparqlendpoint.html NIFSTD in NCBO BioPortal, http://bioportal.bioontology.org/ontologies/1084 J. Mei, L. Ma, Y. Pan, Ontology query answering on databases, Proc. of Int. Semantic Web Conf. (2006), 445–458. P. Grenon and B.Smith, SNAP and SPAN: Towards Dynamic Spatial Ontology, Spatial Cognition and Computation 4: 1 (2004), 69-103

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Formal Ontology in Information Systems M. Donnelly and G. Guizzardi (Eds.) IOS Press, 2012 © 2012 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-084-0-29

29

Suggestions for Galaxy Workflow Design Using Semantically Annotated Services Alok Dhamanaskar a , Michael E. Cotterell a , Jie Zheng d , Jessica C. Kissinger a,b,c , Christian J. Stoeckert, Jr. d and John A. Miller a,b a Dept. of Computer Science b Institute of Bioinformatics c Center for Tropical and Emerging Global Diseases and Dept. of Genetics University of Georgia, Athens, GA 30602 d Penn Center for Bioinformatics and Dept. of Genetics University of Pennsylvania, Philadelphia, PA 19104 Abstract. The wide-scale development of ontologies in the bioinformatics domain facilitates their use in the creation of scientific workflows. To speed up the design of workflows, a Service Suggestion Engine is interfaced to the Galaxy Tool Integration and Workflow Platform. This enables users to ask for suggestions (e.g., what operation should go next) while designing workflows with the Galaxy user interface. The Service Suggest Engine utilizes semantic annotations to suggest appropriate Web service operations to plug into the workflow under design. The enriched Ontology for Biomedical Investigation (OBI) is used as a target for the annotations. The effectiveness of the suggestions provided is evaluated against a consensus of domain experts.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Keywords. Semantic Web Services, Ontology, Semantic Annotations, Workflow, Web Service Composition

1. Introduction The bioinformatics domain is witnessing an exponential rise in available data as more efficient, cheaper and faster means of sequencing, transcript analysis, etc. are developed. Mining this vast amount of data to gain useful insights often requires the coordinated use of multiple bioinformatics analysis tools. An increasing number of such tools and software applications are being provided as Web services by the biological and biomedical communities. For example, Biocatalogue, a registry of biological Web services, currently has information on 2,278 Web services from 161 service providers [9]. To utilize these Web services effectively, there is a need to rapidly construct scientific workflows composed of Web services. Galaxy [2] is an easy to use, open-source, Web-based platform that provides multiple tools for data analysis and bioinformatics research. Galaxy provides a platform to construct workflows using existing Galaxy tools in a very simple fashion using a Yahoo pipes-based graphical designer. In our previous work [28], we created a tool, Galaxy

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

30

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Web Service Extensions1 , that permitted the addition of Web services as tools within Galaxy. This addition made Web service composition possible within the Galaxy framework. However, two important problems remain: 1) selection of the appropriate operations (tools) to achieve the desired workflow outcome and 2) connection of the appropriate input parameters with the right output parameters. To understand the complexities involved with selecting two tools, or Web services, such that they are input-output compatible with each other, let us consider the output and input of the Web service operations in Figure 1. On the left is the output of Web Service 1 and on the right is the input to Web Service 2. The input to Web Service 2 can be perfectly fed by the output of Web Service 1, exp to e-val and SequenceId to Sid. As it can be quite difficult for a naive or non-specialist researcher to assign the correct mapping, there is a need for a system to assist the user.

Figure 1. A representation of XML structures for the output and input of two different Web services

This paper extends our previous work and focuses on assisting the user by providing suggestions for the next possible Web service to use in the creation of bioinformatics and biomedical workflows that otherwise, would need computer science as well as biological expertise to complete [30]. The result of our previous studies [29,30] was a Path-based data mediation algorithm, capable of providing service suggestions to help a user construct a desired workflow. The second work compared three data mediation algorithms (Leaf-based, Structurebased and Path-based) and the Path-based algorithm was usually found to work best among them. Here, we provide the following improvements and extensions: • re-engineered code to improve calculation of metrics like Property Similarity, • taking into account restrictions on concepts in the ontology when calculating Concept Similarity, 1 Available at: 94d0f039a25a883a

http://toolshed.g2.bx.psu.edu/repository/view_repository?id=

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

31

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

• a better use of semantics to suggest the next operation and assist with connecting outputs to inputs, • assist the user with possible input values and documentation about the different input-output parameters, and • efficient interfacing with the existing Galaxy Workflow Editor to provide service suggestions within Galaxy. SOAP Web services are generally described using the Web Service Description Language (WSDL), which is a W3C specification. There is also a W3C Submission called Web Application Description Language (WADL), a description language for REST Web services or Web applications. Semantically annotating Web services involves adding references that are terms in an Ontology to specific inputs, outputs and operations inside a WSDL/WADL file. W3C has recommended a specification, Semantic Annotation for WSDL (SAWSDL) 2 , which defines a set of extension attributes for the WSDL language. For a system to provide service suggestions that will help the user construct a desired workflow, the system should have a precise specification for what a Web service operation does, the input it takes and the output it produces. This generates the need to agree upon common vocabulary that would uniquely identify various aspects of the Web service or tool (i.e., the functionality of the operation performed, its inputs and outputs). Ontologies, explicit formal specifications of the terms in the domain and relations among them [10], are an ideal candidate to describe Web services (referred to as annotation of Web services) for a variety of reasons. Some ideal features of ontologies include providing a rich modeling framework, enabling reuse of domain knowledge, facilitating formal community agreement, being Web accessible, and facilitating reasoning to ensure consistency. The Ontology for Biomedical Investigations (OBI) 3 , is being developed to address the need for consistent descriptions of biological and clinical investigations, including data analysis. It is currently a candidate for inclusion in the Open Biological and Biomedical Ontology (OBO) Foundry [27]. With the Basic Formal Ontology (BFO) 4 as an upper level ontology, OBI is interoperable with other OBO-compliant ontologies. OBI’s existing structure makes it ideal for enrichment with concepts to support Web service annotations. In order to achieve this we follow a systematic methodology. This process involves the design of ontology analysis diagrams for Web services and their subsequent analysis to discover terms that need to be added to the ontology [11]. Analysis of the following Web services: WUBLAST, NCBIBlast, PSIBlast, ClustalW2, TCoffee, WSDBFetch, WSConverter, Fasta, Muscle, WSFilterSequences and WSPhylip has resulted in the identification of approximately 100 new ontological terms, which are currently pending approval by the OBI community. We are continuing to annotate additional Web services and tools in the bioinformatics domain. Since the terms we have proposed cover many of the fundamental concepts in Web services and bioinformatics analysis, we expect the number of additional terms required to annotate new services to decrease. Once a Web service is semantically annotated, our algorithm calculates scores (data mediation and functionality) for candidate Web services and returns a ranked list of Web service operations or tools that can succeed the current operation. In section 2, we de2 SAWSDL:

http://www.w3.org/2002/ws/sawsdl/ OBI Consortium: http://purl.obolibrary.org/obo/obi 4 BFO: http://www.ifomis.org/bfo 3 The

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

32

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

scribe the details of the Service Suggestion Engine (SSE) and the recent improvements made relative to our previous effort. We discuss the interfacing of the Service Suggestion Engine with the Galaxy Workflow tool in section 3. In section 4, we present an evaluation of the SSE in terms of effectiveness of the suggestions. Sections 5 and 6 discuss related work and conclusions, respectively.

2. Service Suggestion Engine The Service Suggestion Engine facilitates the process of constructing and extending workflows by providing suggestions to the user for the next Web service operation. Suggestions are provided as a ranked list of Web service operations. The implementation of the algorithms expands upon our previous work [29]. It considers the set of operations currently in the workflow (workflowOps), the set of operations available for use in the workflow (candidateOps), and a desired functionality (desiredOp) which can be either the URI for some concept in an ontology or a set of keywords. The score for each of the candidate operations is the weighted sum of their data mediation (dm) and functionality ( f n) sub-scores, as seen in equation 1. The data mediation sub-score is intended to measure how well the inputs to a Web service operation can be provided by preceding Web service operations in the workflow either directly or through some form of data mediation (e.g., based on SAWSDL schema mapping specifications). The functionality sub-score is determined by how well a Web service operation matches the functional category or objective indicated by the user (e.g., Multiple Sequence Alignment). S = w1 · Sdm + w2 · S f n

(1)

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

In equation (1), the weights w1 and w2 will always sum to one. If no desired functionality is provided then w2 = 0. Currently, these weights are set manually to be equally weighted, however, they can be optimized by machine learning algorithms. The data mediation and functionality sub-scores are detailed in the following two sub-sections. 2.1. Data Mediation Sub-score The suggestion algorithm finds matches between the inputs/outputs of various Web service operations. The data mediation sub-score Sdm is the result of comparing the Input Output Directed Acyclic Graph (IODAG) representing the input of the candidate operation with IODAGs representing the outputs of selected operations in the workflow [29]. This comparison involves checking both the syntactic and semantic similarity between respective nodes of the IODAG, termed as concept Similarity (CS). Sdm is a sum of these comparison sub-scores (CS), weighted as a geometric series starting with minimum weight for the root node. This method is a Path-based data mediation approach. Concept Similarity CS, as seen in equation 2, considers syntactic, coverage and property similarity. Syntactic similarity involves comparison of the labels and definitions associated with each of the two concepts. Coverage similarity indicates how the concepts are related to each other from their relative positions in the ontology. Property similarity measures the similarity between the properties of the concepts being compared. CS = w3 · Syntacticsim + w4 ·Coveragesim + w5 · Propertysim Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

(2)

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

33

ConceptSimilarity is developed as an independent component so that it can be used by other algorithms to facilitate Semantic Web service discovery [26]. 2.2. Functionality Sub-score A functionality sub-score S f n is calculated when a desired functionality (functionality the operation is expected to provide) is provided by the user as the next step they would like to perform. A user can provide information as a simple string of text describing the desired operation or the user can choose from a list of concepts that denote objective specifications. If the desired functionality is provided in the form of a string then S f n is based on the string metric results between the string and the labels associated with the concepts denoting their candidate operations. In this case, we use the Levenshtein distance [6] as the metric to calculate the difference between the two. If the desired operation is provided in the form of a concept URI, then the functionality sub-score is based on the concept Similarity (CS) score between the concepts denoting the desiredOp and the candidateOp. Consider an example where a user requests a suggestion for the next operation (step four in the workflow given in section 3.1) after the filterByEval operation of FilterSequences Web service. Table 1 shows the scores for the top ranked operation for three cases: (1) no desired functionality given, (2) the desired functionality given as keywords (”multiple sequence alignment”) and (3) the desired functionality given as the concept http://purl.obolibrary.org/obo/multiple sequence alignment from the OBI-WS ontology where we have substituted label for the class name OBIws 0000063.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Table 1. Scores for Step Four in the Workflow vs. Functionality Annotation Web service

Operation

Sdm

Sfn

Stotal

FilterSequencesWS

filterByEval

clustalw2

run

0.551

0

0.551

none

0.386

0.397

0.391

keywords

clustalw2

run

0.386

1.0

0.694

concept

fn Annotation

2.3. Understanding Properties and Restrictions An ontology, at its core, is a collection of concepts and the relationships between them. These relationships are modeled by properties and restrictions upon them. Hence, property similarity is an important part of concept Similarity. A property restriction is a special kind of class description that describes an anonymous class, namely a class of all individuals that satisfy the restriction. Restrictions impose constraints on the range of the property (value constraints) or the number of values the property can take (cardinality constraints). For example, if there is a value restriction on a property (owl:allValuesFrom), then the range of the property would be changed to what is specified by the restriction. Hence, if this information is not captured, it would lead to properties being scored incorrectly. Both value and cardinality constraints must be taken into account when calculating the property similarity score for two concepts.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

34

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

Let PC1 and PC2 be the set of properties that concept 1 and concept 2 participate in, respectively. We calculate Matrix P given by P = [propi j ]{i=1..m, j=1..n} where m = |PC1 | and n = |PC2 |

(3)

The value for propi j is the property match score between properties i and j as given by equation 4. The Syntacticsim between the properties is computed using a string metric that compares the labels and definitions of two properties. When calculating the Rangesim , the algorithm checks for value restrictions that may exist on the property for the concept under consideration. If an owl:allValuesFrom or owl:someValuesFrom restriction exists, ranges are updated accordingly. The Cardinalitysim takes care of cardinality restrictions. propi j = w6 · Syntacticsim + w7 · Rangesim + w8 ·Cardinalitysim

(4)

The matrix stores property match scores between every two properties that the concepts in consideration participate. An optimal assignment between the two sets of properties is found by the Hungarian algorithm [16] which gives the final property similarity score, Propertysim .

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

2.4. Improved Use of Semantics The Service Suggestion Engine makes use of semantics available from ontological annotations to facilitate the construction and invocation of a workflow. The problem is not completely solved with just Suggesting the next operation, as the user also needs to know which output of the previous operation to connect to which input of the subsequent operation. As in Figure 1, each operation will typically have multiple inputs and outputs, with some having as many as fifteen inputs. The SSE can help the user with this. For instance, in Figure 1, the SSE indicates that exp can be connected to e-Val and Sequenceid to sId. The Service Suggestion Engine achieves this by keeping track of the highest scoring matched paths in the IODAG when calculating the data mediation score Sdm . At each step in the workflow construction process, the algorithm also checks if the inputs to the newly added operation can be fed by the outputs of the preceding operation. To achieve this, the algorithm also compares IODAGs of the outputs of all previous operations with the IODAG of the input of the newly added operation. At each step all the inputs that cannot be fed from any of the previous outputs are tentatively categorized as Global inputs. For each identified Global input, the Suggestion Engine assists the user in two ways. First, it tries to suggest possible input values that can selected by the user. To accomplish this, the SSE makes use of the ontological structure and determines if the annotated concept has any direct sub-classes or individuals. Consider an actual scenario, the BLAST Web service has an input, ’blast program’ which allows the user to select the type of BLAST program to execute. In such a case, the algorithm can provide suggestions such as blastp, blastn, blastx using concepts obtained from the ontology. Second, if the algorithm cannot provide any suggestions for possible values, it may still facilitate user comprehension of the parameter using the definitions included in the annotation properties of the ontology and the documentation included in the WSDL file. Another example of the Suggestion Engine making best use of available semantics can be found in the WUBLAST and ClustalW Web Services. The suggestion engine

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

35

knows that the run operation of WUBLAST can take only one sequence as input, while the run operation of ClustalW needs at least two sequences in order to perform a multiple sequence alignment. The Suggestion Engine is able to supply this information to the user. It makes use of the cardinality restrictions defined in the ontology as well as information specified in the WSDL document.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

3. Interaction with Galaxy Galaxy is a Web application, developed and maintained by researchers at The Institute for CyberScience at Penn State and Emory University. It was designed to facilitate data integration and the construction and execution of bioinformatics and biomedical workflows. It comes bundled with its own set of tools for use in the construction of such workflows. As mentioned in the introduction, in order to facilitate the use of tools and resources located externally in the form of Web services, we created an extension to Galaxy that enables Web service operations to be added as tools in the Galaxy workflow editor. A user simply provides the URI to the desired, WSDL/WADL file and selects some or all of the operations that he or she desires to add. Tools in Galaxy are described using an XML description file referred to as a ”toolconfig” file that serves the same purpose as a Web service description document. Due to its similarity to WADL, we convert the tool-config file to a WADL file to facilitate adding semantic annotations. Therefore, the SSE can perform suggestions using not just the Web services added, but also the existing Galaxy tools. The Service Suggestion Engine described in this paper is hosted as a JSONP Web service so that it can easily be used by Galaxy and other tools. The SSE itself is not tied to Galaxy and can be used by any workflow composition system of this nature by simply invoking the SSE Web service. To help make Web Service Composition within Galaxy easier, an interface addition to Galaxy that makes the Suggestion Engine Web service available inside the Galaxy workflow editor is provided. Users can request suggestions for Web service operations to be added after, before, or in the middle of the current workflow process, referred to as forward, backward and bi-directional suggestions, respectively[30]. By using the results provided by the Suggestion Engine, a human designer can easily cope with the input/output details of workflow composition and design. 3.1. Common Workflow To illustrate the utility of semantically annotated Web services, a common workflow scenario is considered. A frequent use case encountered by biologists is that of discovering more information about a particular protein sequence and its evolutionary relationship to other protein sequences. Biologists often utilize several resources in a particular order to accomplish this task and thus it is an ideal candidate for a workflow. Web services already exist for each of the required resources and we utilize semantically annotated versions of each for our example. The input to the workflow is a user-supplied protein sequence. The workflow utilizes three popular bioinformatics programs: BLAST [1] for database searching and pair-wise sequence alignment, ClustalW [17] to perform multiple sequence alignment and Phylip [8] to construct phylogenetic trees.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

36

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

Additionally, a few other Web services that perform format conversions, data retrieval, etc. are required. For the purposes of this paper, it is assumed that the required Web services have already been annotated and added as tools in Galaxy. The process of creating this workflow is described below.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Figure 2. Workflow Creation 1

The process begins with addition of the run operation of the WUBLAST Web service, which is annotated with objective specification “pairwise sequence alignment objective” from the OBI ontology. In order to determine the next step in the workflow, a Galaxy user needs to invoke the Suggestion Engine on the current workflow. This is accomplished by clicking on the “Web Service Tools” button provided at the top of the workflow editor and selecting the Suggestion Engine from the list. Here, the user selects the run operation of WUBLAST as the previous step and clicks the “Make Suggestions” button. The Suggestion Engine returns a ranked list of possible services that can follow the previous step in the workflow. At the top of this list is the getResults operation provided by the WU-BLAST Web service. As this is the desired next step in the workflow, the user is able to click on the “Add to Workflow” link that is provided in order to place the operation onto the workflow canvas as a tool. To complete the next step in the workflow, the user invokes the Suggestion Engine again, this time selecting getResults as the previous step in the workflow. This time, the ranked list that is returned includes the filterByEvalScore operation provided by WSFilterSequences. This particular operation filters the sequences returned by WUBLAST depending upon their e-value and score, which helps the user in narrowing down the number of sequences of interest before performing multiple sequence alignment. Once the sequences have been filtered, the user can invoke the Suggestion Engine again. This time, in addition to selecting the previous step, the desired functionality for the next step “multiple sequence alignment” is also specified. The ranked list of possible operations for the next step in the workflow includes the run operations of ClustalW2, muscle and tCoffee which are all multiple sequence alignments programs. As the run operation of ClustalW2 appears at the top of the list and fulfills the desired functional-

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

37

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

ity, the user can add it as the next step in the workflow. Just as with the previous case, the getResults operation provided by the ClustalW2 Web service is suggested by the Suggestion Engine and added to the workflow by the user. In order to achieve the final goal, a phylogenetic analysis (in this case utilizing a distance approach to phylogenetic estimation) the user needs to generate a distance matrix based on the multiple sequence alignment produced in the previous step. This time, with the user specifying the desired functionality as “protein distance matrix”, the SSE returns the protDist operation offered by the WSPhylip Web service. Once this operation is added to the workflow the user wants to perform the last step of creating a distancebased tree using Phylip’s Neighbor program. This time the user specifies the desired functionality as “phylip neighbor” and finds the operation neighbor from WSPhylip, which is then added to the workflow. The completed workflow is illustrated in figure 3. It should be noted that, throughout the process of workflow creation, the user could have specified the desired functionality using appropriate concepts from the ontology. The results of the rankings produced by the Suggestion Engine after each of the steps in the workflow is analyzed in the Evaluation section.

Figure 3. Completed Workflow

4. Evaluation We have focussed our evaluation on a performance comparison of the Suggestion Engine relative to a consensus on ranking by Human Experts. The evaluation setup is comprised of 60 Web service operations from the following 11 Web services: WUBLAST, NCBIBlast, PSIBlast, ClustalW2, TCoffee, WSDBFetch, WSConverter, Fasta, Muscle, WSFilterSequences and WSPhylip. We have used the enriched Ontology for Biomedical Investigations (OBI), from our previous efforts [11] to annotate the Web services. All of the SAWSDL files and ontology used in this evaluation can be downloaded from http://mango.ctegd.uga.edu/jkissingLab/SWS/

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

38

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

Wsannotation/sawsdls.html. A common bioinformatics workflow involving 7 steps (described above) is used as the basis for the evaluation. For the purpose of evaluation the algorithm in addition to returning a ranked list of operations, classifies them as high or low. The human consensus of the operations is also categorized as high or low. Since the operations classified as high are the ones that can ideally follow the current operation in the workflow, the performance of the Service Suggestion Engine is measured using precision and recall between the two sets. An ideal match indicates a reasonable choice for the next operation in order to advance the design a step further (although the workflow from section 3.1 specifies a unique operation for each step, the evaluation considers all reasonable next operations). Precision (Eq: 5) and recall (Eq: 6) are a commonly used measure of quality of retrieved results (in our case the results are service operations) [19]. Precision (P) is the fraction of retrieved results that are relevant, recall (R) is the fraction of relevant results that are retrieved, and F-measure (F ) is the harmonic mean of precision and recall. P =

RelevantResults ∩ RetrievedResults RetrievedResults

(5)

R =

RelevantResults ∩ RetrievedResults RelevantResults

(6)

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

F = 2·

P ·R P +R

(7)

Figure 4 shows the precision and recall values for all the steps for two different cases. Case 1 represents Web services with no annotations and case 2 represents Web services with annotations on the input and output messages only. When no Annotations are present only a syntactic match between the inputs and outputs is considered and hence success depends only on the consistency in naming conventions between different Web services, which is rarely the case. Working with Web services with no annotations gives an average precision of 0.33 and recall of 0.33. With annotations on input and output messages only we got an average precision and recall of 0.62 and 0.87, respectively.

Figure 4. Precision and Recall : (1) Un-annotated Web services (2) Annotations on input and output messages

Figure 5 depicts the precision and recall values for Web services with annotations on input-output messages as well as the functionality of operations. Annotations with respect to the functionality of operations are important when doing semi-automatic workFormal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

39

flow composition. This enrichment combines the input-output matching with the user’s knowledge concerning the type of functionality that is desired. The graph on the left represents results obtained when the user supplies desired functionality as text giving precision and recall values as 0.64 and 0.98, respectively. The one on the right represents results obtained when the functionality is supplied as a concept in the ontology, giving 0.69 precision and 0.98 recall.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Figure 5. Precision and Recall : Annotations on Input-Output Messages and Functionality

Figure 6. Average precision, Recall and F-measure for different levels of annotation

Figure 6 is a plot of average F-measure for all the steps for Web services with different levels of annotation. F-measure being the harmonic mean of precision and recall is a measure of overall effectiveness of the results. When going from Web services that are not annotated to the services with annotations on input/output messages, we observe an increase in F-measure of 0.392 (from 0.333 to 0.722), over a 100% improvement. Adding annotations on the functionality of operations, shows a steady improvement in the Fmeasure. When the user supplies desired functionality as text the F-measure obtained is 0.774 and with functionality provided as a concept in the ontology, the F-measure is 0.812. 5. Related Work In recent years, work has been done to advance the area of service composition, especially Web service composition (WSC). As early as 2002, McIlraith et al. [20] proposed

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

40

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

the use of planning techniques for automatic composition of semantic Web services, however, this approach did not work well when workflow designers wished to influence the configuration of services during the composition process. In 2004, Kim et al. [14,15] proposed a Composition Analysis Tool (CAT) that provides feedback to composition designers as to whether a particular composition can be executed. As this work did not include user evaluations, statistical comparisons to compositions designed by domain experts were not performed. In 2005, a survey by Rao et al. [21] details the three main approaches to WSC: manual, semi-automatic and fully-automated composition. Using a combination of these three techniques, an architecture for WSC that provides these different levels of automation was proposed. Later work by Charif-Djebbar et al. [4] proposed to further automate the service composition process using unplanned service-based dialogs among the agents who provide Web services. Cheng et al. [5] integrated Case-Based Reasoning (CBR), the process of solving new problems based on previous case studies, to compose Web services in a similar way. In all three papers, cases were made against fully-automated WSC due to the many complexities found in Web service environments. The work of Hull et al. [12] suggests that a semi-automatic composition process is preferable for service composition in general. Schaffner et al. [23,24,22] also integrated a form of CBR in their study, proposing a semi-automated service composition approach based on mixed initiative features derived from an industrial case study. These features include filtering inappropriate services, checking composition validity, and suggesting partial plans. Pre-conditions and effects are used from the Web Service Modeling Ontology (WSMO)5 . Evaluations were done on compositions for Business Process Management (BPM) scenarios using only composition validity as a metric. No statistical comparisons to compositions created by domain experts were provided (e.g., Precision and Recall). In 2008, DiBernardo et al. [7] proposed a composition client that ranks services in order to provide suggestions, however, the degree to which this aids designers in the WSC process is unknown as no comparison evaluations were made here either. More recently, other frameworks for semi-automatic WSC have been proposed. In 2010, Khattak et al. [13] proposed a framework that operates mostly at the service level, however, due to the absence of an evaluation or actual implementation it is unclear how useful the suggestions provided by this approach are during the actual composition process. The work of L´ecu´e [18] is similar to Schaffner et al. in that suggestions are provided primarily by first filtering out the services which cannot be composed. In later work by Canturk et al. [3], a similar approach is proposed that focuses on the discovery of Web services so that the WSC process can be started. An approach similar to the one presented in this paper is proposed by Schonfisch et al [25] that focuses on approximate subsumption of inputs/outputs in order to provide suggestions during the WSC process.

6. Conclusions and Future Work This study was focussed on improving and extending the Service Suggestion Engine, based on our previous design effort for assisting users with workflow construction. Re5 Web

Service Modeling Ontology (WSMO): http://www.w3.org/Submission/WSMO/

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

41

engineered code, improvements in calculation of metrics and consideration of restriction on concepts when calculating Concept Similarity have yielded substantially improved performance. We have gone a step further in making use of available semantics, to help users find the next operation and build the workflow. This includes help with appropriate connection of outputs and inputs, providing lists of possible values the inputs can take and providing documentation for the input-output parameters that will help the user better understand and run the workflow. Furthermore, work was performed on interfacing the Service Suggestion Engine with the Galaxy Tool Integration and Workflow Platform enabling invocation of SSE as a Web service from the Galaxy workflow editor. This also required enhancements to Galaxy’s user interface to facilitate the user dialog. Currently no mechanism exists to check if the Web service added as tool to Galaxy is up and running. We are working on implementing this functionality, that would make sure that the added Web service is available before executing the workflow. We are considering use of Schema Mapping (lifing and lowering), to further enhance data mediation between Web services using specified mappings. We would also like to reconsider the impact of adding pre-conditions and effects[30], since we expect that they will further increase the precision (pre-conditions and effect were dropped in the latest version for the sake of simplicity). Once we have considered schema mappings and pre-conditions & effects we plan to do an extensive evaluation that includes query times in addition to precision and recall for multiple workflow scenarios and additional Web services. Acknowledgements: Funding for this study was provided in part by NIH R01 GM093132. References [1]

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

[2]

[3] [4]

[5] [6] [7]

[8] [9]

[10] [11]

S.F. Altschul, T.L. Madden, A.A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research, 25(17):3389–3402, 1997. D. Blankenberg, G. Von Kuster, N. Coraor, G. Ananda, R. Lazarus, M. Mangan, A. Nekrutenko, and J. Taylor. Galaxy: A Web-Based Genome Analysis Tool for Experimentalists. Current Protocols in Molecular Biology, 19(19.10):11–19, 2010. D. Canturk and P. Senkul. Using Semantic Information for Distributed Web Service Discovery. International Journal of Web Science, 1(1):21–35, 2011. Y. Charif-Djebbar and N. Sabouret. Dynamic Web Service Selection and Composition: An Approach Based on Agent Dialogues. Proceedings of the 2006 International Conference on Service-Oriented Computing, pages 515–521, 2006. R. Cheng, S. Su, F. Yang, and Y. Li. Using Case-based Reasoning to Support Web Service Composition. Computational Science–ICCS 2006, pages 87–94, 2006. F.J. Damerau. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176, 1964. M. DiBernardo, R. Pottinger, and M. Wilkinson. Semi-Automatic Web Service Composition for the Life Sciences using the BioMoby Semantic Web Framework. Journal of Biomedical Informatics, 41(5):837– 847, 2008. J. Felsenstein. PHYLIP (Phylogeny Inference Package), version 3.5 c. Joseph Felsenstein., 1993. C.A. Goble, K. Belhajjame, F. Tanoh, J. Bhagat, K. Wolstencroft, R. Stevens, E. Nzuobontane, H. McWilliam, T. Laurent, and R. Lopez. BioCatalogue: A Curated Web Service Registry for the Life Science Community. April 2009. T.R. Gruber et al. Toward Principles for the Design of Ontologies used for Knowledge Sharing. International Journal of Human Computer Studies, 43(5):907–928, 1995. Guttula, C. and Dhamanaskar, A. and Wang, R. and Miller, J.A. and Kissinger, J.C. and Zheng, J. and Stoeckert Jr, C.J. Enriching the Ontology for Biomedical Investigations (OBI) to Improve its Suitability

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

42

[12]

[13]

[14] [15]

[16] [17]

[18]

[19]

[20]

[21] [22]

[23]

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

[24] [25]

[26] [27]

[28]

[29]

[30]

A. Dhamanaskar et al. / Suggestions for Galaxy Workflow Design

for Web Service Annotations. In Proceedings of the 2011 International Conference on Biomedical Ontology, Buffalo, New York, pages 246–248. ICBO, 2011. R. Hull, M. Benedikt, V. Christophides, and J. Su. E-Services: A Look Behind the Curtain. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 1–14. ACM, 2003. A.M. Khattak, Z. Pervez, A.M.J. Sarkar, and Y.K. Lee. Service Level Semantic Interoperability. In Proceedings of the 2010 International Conference on Applications and the Internet, pages 387–390. IEEE, 2010. J. Kim, Y. Gil, and M. Spraragen. A Knowledge-based Approach to Interactive Workflow Composition. In Proceedings of the 14th International Conference on Automatic Planning and Scheduling, 2004. J. Kim, M. Spraragen, and Y. Gil. An Intelligent Assistant for Interactive Workflow Composition. In Proceedings of the 9th International Conference on Intelligent User Interfaces, pages 125–131. ACM, 2004. H.W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. MA Larkin, G. Blackshields, NP Brown, R. Chenna, PA McGettigan, H. McWilliam, F. Valentin, IM Wallace, A. Wilm, R. Lopez, et al. ClustalW and ClustalX version 2.0. Bioinformatics, 23(21):2947– 2948, 2007. F. L´ecu´e. Combining Collaborative Filtering and Semantic Content-based Approaches to Recommend Web Services. In The 2010 Internation Conference on Semantic Computing, pages 200–205. IEEE, 2010. D.D. Lewis and W.A. Gale. A Sequential Algorithm for Training Text Classifiers. In Proceedings of the 1994 Internationa Cconference on Research and Development in Information Retrieval, pages 3–12. Springer-Verlag New York, Inc., 1994. S. McIlraith and T.C. Son. Adapting Golog for Composition of Semantic Web Services. In Proceedings of the 2002 International Conference on Principles of Knowledge Representation and Reasoning, pages 482–496. Morgan Kaufmann Publishers; 1998, 2002. J. Rao and X. Su. A Survey of Automated Web Service Composition Methods. Semantic Web Services and Web Process Composition, pages 43–54, 2005. J. Schaffner. Supporting the Modeling of Business Processes Using Semi-Automated Web Service Composition Techniques. Master’s thesis, Hasso-Plattner-Institute for IT Systems Engineering, University of Potsdam, Potsdam, Germany, 2006. J. Schaffner, H. Meyer, and C. Tosun. A Semi-Automated Orchestration Tool for Service-Based Business Processes. Service-Oriented Computing, pages 50–61, 2007. J. Schaffner, H. Meyer, and M. Weske. A Formal Model for Mixed Initiative Service Composition. In Proceedings of the 2007 International Conference on Services Computing, pages 443–450. IEEE, 2007. J. Sch¨onfisch12, W. Chen12, and H. Stuckenschmidt. A Purely Logic-Based Approach to Approximate Matching of Semantic Web Services. In Proceedings of the 2009 International Workshop on Service Matchmaking and Resource Retrieval in the Semantic Web, April 2009. A. Sheth, K. Verma, J. Miller, and P. Rajasekaran. Enhancing Web Service Descriptions using WSDL-S. Research-Industry Technology Exchange at EclipseCon, pages 1–2, March 2005. B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, W. Ceusters, L.J. Goldberg, K. Eilbeck, A. Ireland, C.J. Mungall, et al. The OBO Foundry: Coordinated Evolution of Ontologies to Support Biomedical Data Integration. Nature Biotechnology, 25(11):1251–1255, 2007. R. Wang, D. Brewer, S. Shastri, S. Swayampakula, J.A. Miller, E.T. Kraemer, and J.C. Kissinger. Adapting the Galaxy Bioinformatics Tool to Support Semantic Web Service Composition. In Proceedings of the 2009 World Conference on Services-I, pages 283–290. IEEE, 2009. R. Wang, S. Ganjoo, J.A. Miller, and E.T. Kraemer. Ranking-Based Suggestion Algorithms for Semantic Web Service Composition. In Services (SERVICES-1), 2010 6th World Congress on, pages 606–613. IEEE, 2010. Wang, R. and Guttula, C. and Panahiazar, M. and Yousaf, H. and Miller, J.A. and Kraemer, E.T. and Kissinger, J.C. Web Service Composition using Service Suggestions. In Proceedings of the 2011 IEEE World Congress on Services, pages 482–489. IEEE, 2011.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Part 2

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

Ontologies of Physical Entities

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

Formal Ontology in Information Systems M. Donnelly and G. Guizzardi (Eds.) IOS Press, 2012 © 2012 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-084-0-45

45

The Void in Hydro Ontology Torsten HAHMANN a , Boyan BRODARIC b Department of Computer Science, University of Toronto, Toronto, ON, Canada b Geological Survey of Canada, Natural Resources Canada, Ottawa, ON, Canada a

Abstract. Voids are extremely important to water science, because their size and connectivity determines the storage and flow of water both above and below the ground surface. While previous formal theories about voids strictly consider holes hosted inside objects, we generalize voids to also include spaces between objects, and distinguish voids in macroscopic objects from those occurring microscopically in an object’s matter. These notions are axiomatized in first-order logic as an extension of the DOLCE ontology, and are applied to key aspects of hydrology and hydrogeology, laying the groundwork for a foundational hydro ontology. Keywords. hydro ontology, hydrogeology, mereotopology, constituency, physical void, hole, gap, pore space, DOLCE

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

1. Introduction Water is a natural resource necessary for human life. It is found in the atmosphere, on the ground surface, and in the subsurface. Environmental conditions drive the movement of water between these spheres, as reflected in the well-known water cycle. Many scientific and social issues require a sophisticated understanding of the water cycle, including climate change, flood risk, and groundwater contamination. Such issues also require increased access to large volumes of water data, which are starting to be provided by Spatial Data Infrastructures (SDI) [16] in countries such as the USA, Canada, Australia, and Europe. However, at present there exist multiple SDI data standards emerging for both the surface and subsurface water domains, and these are being developed somewhat independently leading to incomplete and incompatible representations, and a lack of coupling between them. This hinders water cycle modelling, which requires an integrated approach to water data, and signals a need for the development of a reference ontology to bridge and disambiguate conceptual differences. Although the water cycle provides boundary entities that straddle the surface and subsurface domains, e.g. the notion of ‘baseflow’ as the discharge of water from subsurface to surface, a strict focus on boundary entities ignores shared foundational aspects, such as a container schema which represents water contained in holes or gaps. Thus, an important component of a reference hydro ontology is the unified representation of boundary entities and common foundational entities. In this paper we begin to develop such a representation, by building a formal theory for specific aspects of the physical containment of water. In particular, we classify and characterize empty spaces – voids – that can be filled with water, and show how the resulting distinctions can be used to help characterize key water entities, such as an aquifer or lake. This work makes the following original contributions: it generalizes formal theories of holes to voids, to include both spaces within objects (holes) and between ob-

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

46

T. Hahmann and B. Brodaric / The Void in Hydro Ontology

Figure 1. Examples of physical objects (rock body, water body, aquifer, rock) and physical voids (water well, ground depression, gaps between rocks) in hydrogeology. Different kinds of matter are shown at the top, and various features are illustrated on the right (ground surface, water surface, rock surface, water table).

jects (gaps); it provides a physical characterization of voids, distinguishing and relating voids hosted by physical objects and the matter that constitutes them, leading to a refined taxonomy of voids; it specializes the DOLCE [17] foundational ontology with physical voids and associated relations, and in doing so it formalizes some key entities in hydro ontology using first-order logic, laying a basis for its further development. In this sense, this work is very much an ontology engineering task, extending an existing foundational ontology through the refinement of some fundamental distinctions about voids. The paper is organized as follows: Sec. 2 discusses semantic issues in water data standards that motivate this work, and Sec. 3 identifies gaps in related work on voids. In sections 4, 5, and 6 we develop the formal theory on physical voids, focusing respectively on spatial regions, physical objects, and physical voids, with application mainly to hydrogeology. Sec. 7 concludes with a brief summary and a note about future directions.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

2. Ontological Issues in Groundwater Data Standards The lack of surface-subsurface integration, and the occurrence of semantic incompatibilities, in present water data standards can be demonstrated by comparing emerging groundwater data schema: the INSPIRE schema [15] is a Europe-wide initiative, and the Groundwater Markup Language (GWML) [3] is mainly used in North America. Comparison of the schemas reveals the following issues: • Semantic ambiguity: it is unclear whether the INSPIRE GroundWaterBody refers to a specific amount of water, which might change location, or to the object consisting of water but fixed to a specific location? E.g. the water body in the Ogallala aquifer of the US Great Plains is a timeless entity tied to the location of the aquifer, whereas the specific amounts of water that constitute the water body change over time – they can enter or leave the aquifer. The related GWML GroundwaterBody is clearer, as it distinguishes a water body object from the matter that comprises it, but this still leaves in question the compatibility of the INSPIRE and GWML entities. • Semantic incompleteness: an aquifer is characterized similarly in both INSPIRE and GWML, however both UML representations are incomplete, capturing only a fragment of the intended meaning evident in the accompanying text, i.e. both indicate that an aquifer is a rock body, but do not capture the fact that an aquifer is wet, porous, permeable, and yields water to wells. The expressivity limits of UML are partially responsible, but in the end the conditions expressed in the text are not completely formalized.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

T. Hahmann and B. Brodaric / The Void in Hydro Ontology

47

• Semantic granularity: elements of the containment schema are quite general in GWML: GWML Reservoir denotes the sum of fillable empty spaces in a rock body, but voids in general are not further differentiated; thus, it is impossible to distinguish large holes such as caves from minute gaps such as spaces between rock grains. The INSPIRE schema lacks voids altogether. This highlights a significant difference between GWML and INSPIRE: a GWML aquifer is a rock body that hosts a space filled by a water body composed of groundwater, and in INSPIRE it is a rock body that does not host any space nor contain groundwater directly, rather it contributes to a whole that consists of the rock body and water body as parts. • Surface-subsurface disconnection: the groundwater schemas are largely disconnected from relevant surface water schemas, with neither boundary nor shared entities represented. The INSPIRE schemas are an exception, in that a specialization of GroundWaterBody exists in both the surface and subsurface schemas, but with different parent entities in each schema, leading to further ambiguities about the intended meaning of the entity.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

3. Hydro Ontology and Related Work Key to our approach is the notion that a containment schema is central to hydro ontology. In the schema, a physical container hosts a void in which water can be stored and through which it can flow. Voids are considered here to be physical entities devoid of the hosting body’s matter, but that can be filled by other matter such as gases, liquids, or solids. The nature of the containers and voids is different for surface and subsurface water entities, as illustrated in Figure 1. Surface water is stored and transmitted in depressions in the ground surface, such as in lakes or rivers. Subsurface water exists in the gaps between unconsolidated materials such as sand or gravel, in the gaps between the grains and crystals that make up consolidated rock bodies, and in the spaces between rock bodies or cavities within them, such as caves. The size and connectivity of voids determines important qualities associated with the storage and flow of water in both the surface and subsurface, and these constitute perhaps the most important attributes for water science. As the focus of this paper is on voids and their relation to hosting containers, in a hydro context, the relevant related work includes formal theories of holes, places, and other hydro entities. The seminal formal theory on holes is developed by Casati & Varzi [6]. Holes are self-connected empty spaces, classified as cavities (e.g. caves), tunnels (e.g. donut holes), or hollows (e.g. canyons), and their hosts are non-scattered wholes. Holes are thus appropriate to represent surface water hollows as well as major subsurface cavities, but gaps between objects are not included, a critical shortcoming in the representation of subsurface voids. The relations between voids at different physical scales is also not considered, such as between macroscopic voids in an object and the microscopic voids in the matter constituting the object; e.g. between a canyon containing a lake and the tiny spaces in the aquifer materials beneath the lake, as shown in Fig. 1. In other work, formal representation of holes is also discussed in relation to relative places and surface features, such as cracks, but these are less relevant to hydro ontology [10,13]. Voids are typically not represented in foundational ontologies, though most such as DOLCE possess a general category for dependent entities [17]. Existing work on hydro ontology focuses on the physical container for surface water – to enable identification and classification [14,18], on vocabularies mainly for water constituents [1,2], and on basic hydrogeology entities [5,19], but voids are not considered in any substantial way.

Formal Ontology in Information Systems : Proceedings of the Seventh International Conference (FOIS 2012), IOS Press,

48

T. Hahmann and B. Brodaric / The Void in Hydro Ontology

4. Spatial Regions A central goal of this paper is to represent aspects of voids from primarily a hydrogeological perspective. For that purpose, we are only concerned here with enduring entities that are located in physical space such as rock formations, sediments, and various kinds of water-related bodies such as rivers, lakes, groundwater, aquifers, and wells. Perdurants such as processes, plus non-physical entities, are out of scope, and while dependent qualities such as volume or depth are important to water science, they are of secondary concern here and are not considered in this work. A key notion here is that of a spatial region, which we discuss from a geometrical, indeed mereotopological, perspective. We distinguish the physical space populated by real physical entities from an abstract space populated by spatial regions, which are of purely geometrical and topological nature. This distinction allows us to consider abstract space as a mathematical-logical construct, which provides flexibility in spatial operations. To map entities located in physical space to their associated abstract spatial regions, we reuse the region function r(x) from layered mereotopology [8,9]1 . The range of the region function defines the DOLCE category of ‘spatial region’ S (S1, S2, S-T1). We refer to entities of this category henceforth simply as ‘regions’. Throughout, all axioms and definitions are assumed to be implicitly universally quantified.

Copyright © 2012. IOS Press, Incorporated. All rights reserved.

(S1) S(r(x)) (S2) S(x) ↔ x = r(x) (S-T1) r(r(x)) = r(x)

(the range of the region function are spatial regions) (spatial regions are their own region) (region function idempotent; from S1 and S2)

Regions are related to each other by spatial inclusion ⊆r (S3)2 . We say x ⊆r y iff the spatial region x is a subregion of y. ⊆r is the dimension-independent mereological relation (S4, S5, S6) adapted from [11,12], with the original axiom numbering included in parentheses. For convenience we maintain ZEX (x) to denote a unique zero region of no extent and no location (S7). While ⊆r is restricted to regions, it can be extended as ⊆ to non-regions (S8) so that all subsequently defined relations equally apply to regions and non-regions, unless otherwise noted. We further define proper spatial inclusion ⊂ (S9) and contact C (C-D) in terms of ⊆. If an entity x is in contact with a proper subset of the entities some y is in contact with, x must be properly spatially included in y (S10). To compare regions dimensionally, we reuse the axiomatization of the primitive relation x