305 102 10MB
English Pages 490 [503] Year 2010
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6412
Jeffrey Parsons Motoshi Saeki Peretz Shoval Carson Woo Yair Wand (Eds.)
Conceptual Modeling – ER 2010 29th International Conference on Conceptual Modeling Vancouver, BC, Canada, November 1-4, 2010 Proceedings
13
Volume Editors Jeffrey Parsons Memorial University of Newfoundland St. John’s, NL, Canada E-mail: [email protected] Motoshi Saeki Tokyo Institute of Technology Tokyo, Japan E-mail: [email protected] Peretz Shoval Ben-Gurion University of the Negev Beer-Sheva, Israel E-mail: [email protected] Carson Woo University of British Columbia Vancouver, BC, Canada E-mail: [email protected] Yair Wand University of British Columbia Vancouver, BC, Canada E-mail: [email protected]
Library of Congress Control Number: 2010936075 CR Subject Classification (1998): D.2, F.3, D.3, I.2, F.4.1, D.2.4 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-16372-6 Springer Berlin Heidelberg New York 978-3-642-16372-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
This publication comprises the proceedings of the 29th International Conference on Conceptual Modeling (ER 2010), which was held this year in Vancouver, British Columbia, Canada. Conceptual modeling can be considered as lying at the confluence of the three main aspects of information technology applications –– the world of the stakeholders and users, the world of the developers, and the technologies available to them. Conceptual models provide abstractions of various aspects related to the development of systems, such as the application domain, user needs, database design, and software specifications. These models are used to analyze and define user needs and system requirements, to support communications between stakeholders and developers, to provide the basis for systems design, and to document the requirements for and the design rationale of developed systems. Because of their role at the junction of usage, development, and technology, conceptual models can be very important to the successful development and deployment of IT applications. Therefore, the research and development of methods, techniques, tools and languages that can be used in the process of creating, maintaining, and using conceptual models is of great practical and theoretical importance. Such work is conducted in academia, research institutions, and industry. Conceptual modeling is now applied in virtually all areas of IT applications, and spans varied domains such as organizational information systems, systems that include specialized data for spatial, temporal, and multimedia applications, and biomedical applications. The annual International Conference on Conceptual Modeling is the premiere forum for presenting and discussing developments in the research and practice of conceptual modeling. The diversity of the conference is manifested in the call for papers. The call this year included information modeling; semantics, metadata, and ontology; Web information systems and the Semantic Web; business process modeling and enterprise architecture; semi-structured data and XML; integration of models and data; information retrieval, filtering, classification, and visualization; methods, tools, evaluation approaches, quality and metrics; requirements engineering, reuse, and reverse engineering; maintenance, change and evolution of models; integrity constraints and active concepts; knowledge management and business intelligence; logical foundations; and empirical methods. We are delighted to provide you with an exciting technical program this year. The Program Committee received 147 submissions from authors in 32 countries, reflecting the international nature of the conference. Thirty submissions were accepted as full papers for presentation and publication in the proceedings (an acceptance rate of 20%). The authors of a further seven papers were invited to present in poster sessions. Their papers are included as short papers (six pages) in the proceedings. The technical program consisted of 10 sessions covering all aspects of conceptual modeling and related topics. The technical program included one panel, dedicated to empirical methods in conceptual modeling research. In addition, the poster session
VI
Preface
included two demonstrations. In parallel to the technical sessions were two additional streams, combining specialized workshops and tutorials. One of the workshops was dedicated to the Doctoral Consortium. Most workshops represented continuing activities from previous ER conferences. As well, we were fortunate to obtain the participation of three keynote speakers, each providing a different perspective on IT: industry and consulting (John Thorp of Thorp Network Inc.), research and development (Mamdouh Ibrahim of IBM), and IT management (Ted Dodds of University of British Columbia). We would like to thank all those who helped put this program together. The Program Chairs extend special thanks to the 70 members of the Program Committee who worked many long hours reviewing and discussing the submissions. They tolerated frequent reminders with good humor. The high standard of their reviews not only provided authors with outstanding feedback but also substantially contributed to the quality of the technical program. It was a great pleasure to work with such a dedicated group of researchers. Thanks go also to the 72 external reviewers who helped with their assessments. They are individually acknowledged in the proceedings. We would also like to especially thank the chairs of the various activities that make any conference diverse and interesting. This included Workshop Chairs Gillian Dobbie and Juan-Carlos Trujillo; Doctoral Consortium Chairs Andrew Burton-Jones, Paul Johannesson, and Peter Green; Tutorial Chairs Brian Henderson-Sellers and Vijay Khatri; Panel Chairs Bernhard Thalheim and Michael Rosemann; and Demonstrations program and posters Chairs Gove Allen and Hock Chan. We are very grateful to Sase Singh, the Proceedings Chair, for working with the authors and conference submission system to organize the conference proceedings. Palash Bera helped us in publicizing the conference. William Tan was always available as our Webmaster. Heinrich Mayr and Oscar Pastor from the ER steering committee were generous with their time in answering our questions and providing guidance. We thank Andrew Gemino, as Local Arrangement Co-chair, for making sure that the conference ran smoothly. Finally, special thanks are due to Jessie Lam, who, in her role as a Local Arrangements Co-chair, made a major contribution to making everything happen. All aspects of the paper submission and reviewing processes were handled using the EasyChair Conference Management System. We thank the EasyChair development team for making this outstanding system freely available to the scientific community. Finally, we would like to thank the authors of all submitted papers, workshops, tutorials, panels, and software demonstrations, whether accepted or not, for their outstanding contributions. These contributions are critical to the high quality of an ER conference, and without them this conference could have not taken place.
November 2010
Jeffrey Parsons Motoshi Saeki Peretz Shoval Yair Wand Carson Woo
ER 2010 Conference Organization
Conference Co-chairs Yair Wand Carson Woo
University of British Columbia, Canada University of British Columbia, Canada
Program Co-chairs Jeffrey Parsons Motoshi Saeki Peretz Shoval
Memorial University of Newfoundland, Canada Tokyo Institute of Technology, Japan Ben-Gurion University of Negev, Israel
Workshop Chairs Gillian Dobbie Jaun-Carlos Trujillo
University of Auckland, New Zealand Universidad de Alicante, Spain
Doctoral Consortium Chairs Andrew Burton-Jones Paul Johannesson Peter Green
University of British Columbia, Canada Stockholm University and the Royal Institute of Technology, Sweden University of Queensland, Australia
Tutorial Chairs Brian Henderson-Sellers Vijay Khatri
University of Technology Sydney, Australia Indiana University, USA
Panel Chairs Bernhard Thalheim Michael Rosemann
Christian-Albrechts-Universitat zu Kiel, Germany Queensland University of Technology, Australia
Demonstrations Program and Posters Chairs Gove Allen Hock Chan
Brigham Young University, USA National University of Singapore, Singapore
VIII
ER 2010 Conference Organization
Proceedings Chair Sase Singh
University of British Columbia, Canada
Local Arrangement Chairs and Treasurers Andrew Gemino Jessie Lam
Simon Fraser University, Canada University of British Columbia, Canada
Publicity Chair Palash Bera
Texas A&M International University, USA
Webmaster William Tan
University of British Columbia, Canada
Steering Committee Liaison Heinrich Mayr
Univeristy of Klagenfurt, Austria
Program Committee Akhilesh Bajaj Carlo Batini Zohra Bellahsene Boualem Benatallah Mokrane Bouzeghoub Andrew Burton-Jones Silvana Castano Roger Chiang Philippe Cudre-Mauroux Alfredo Cuzzocrea Joseph Davis Umesh Dayal Johann Eder Ramez Elmasri David W. Embley Opher Etzion Joerg Evermann Alfio Ferrara Xavier Franch Piero Fraternali Avigdor Gal Andrew Gemino
University of Tulsa, USA University of Milano, Italy University of Montpellier II, France University of New South Wales, Australia Université de Versailles, France University of British Columbia, Canada Università degli Studi di Milano, Italy University of Cincinnati, USA MIT, USA University of Calabria, Italy University of Sydney, Australia HP Labs, USA Universität Vienna, Austria University of Texas-Arlington, USA Brigham Young University, USA IBM Research Labs, Haifa, Israel Memorial University of Newfoundland, Canada University of Milano, Italy Universitat Politècnica de Catalunya, Spain Politecnico di Milano, Italy Technion Institute of Technology, Israel Simon Fraser University, Canada
ER 2010 Conference Organization
Paolo Girogini Paulo Goes Jaap Gordijn Peter Green Giancarlo Guizzardi Peter Haase Jean-Luc Hainaut Sven Hartmann Brian Henderson-Sellers Howard Ho Manfred Jeusfeld Paul Johannesson Vijay Khatri Tsvika Kuflik Alberto Laender Qing Li Stephen Liddle Tok-Wang Ling Peri Loucopoulos Mirella M. Moro Takao Miura John Mylopoulos Moira Norrie Antoni Olivè Sylvia Osborn Oscar Pastor Zhiyong Peng Barbara Pernici Dimitris Plexousakis Sudha Ram Iris Reinhertz-Berger Lior Rokach Colette Rolland Gustavo Rossi Klaus-Dieter Schewe Graeme Shanks Richard Snodgrass Pnina Soffer Il-Yeol Song Ananth Srinivasan Veda Storey Arnon Sturm Ernest Teniente Bernhard Thalheim Riccardo Torlone
University of Trento, Italy University of Arizona, USA Vrije Universiteit Amsterdam, The Netherlands University of Queensland, Australia Universidade Federal do Espírito Santo, Brazil Universität Karlsruhe, Germany University of Namur, Belgium Clausthal University of Technology, Germany University of Technology Sydney, Australia IBM Almaden Research Center, USA Tilburg University, The Netherlands Stockholm University & the Royal Institute of Technology, Sweden Indiana University, USA Haifa University, Israel Universidade Federal de Minas Gerais, Brazil University of Hong Kong, China Brigham Young University, USA National University of Singapore, Singapore Loughborough University, UK Universidade Federal de Minas Gerais, Brazil Hosei University, Japan University of Trento, Italy ETH Zurich, Switzerland Universitat Politècnica de Catalunya, Spain University of Western Ontario, Canada Technical University of Valencia, Spain Wuhan University, China Politecnico di Milano, Italy University of Crete, Greece University of Arizona, USA Haifa University, Israel Ben-Gurion University, Israel University Paris 1 Panthéon-Sorbonne, France Universidad de La Plata, Argentina Information Science Research Centre, New Zealand University of Melbourne, Australia University of Arizona, USA Haifa University, Israel Drexel University, USA University of Auckland, New Zealand Georgia State University, USA Ben-Gurion University, Israel Universitat Politècnica de Catalunya, Spain University of Kiel, Germany Università Roma Tre, Italy
IX
X
ER 2010 Conference Organization
Juan Trujillo Aparna Varde Vânia Vidal Kevin Wilkinson Eric Yu
University of Alicante, Spain Montclair State University, USA Universidade Federal do Cear, Brazil HP Labs, USA University of Toronto, Canada
External Referees Sofiane Abbar, Raian Ali, Toshiyuki Amagasa, Birger Andersson, Sven Arnhold, Claudia P. Ayala, Zhifeng Bao, Moshe Barukh, Seyed Mehdi Reza Beheshti, Maria Bergholtz, Alexander Bergmayr, Windson Carvalho, Van Munin Chhieng, Paolo Ciaccia, Anthony Cleve, Fabiano Dalpiaz, Fabien Duchateau, Golnaz Elahi, Bernhard Freundenthaler, Irini Fundulaki, Matteo Golfarelli, Adnene Guabtni, Lifan Guo, Jon Heales, Patrick Heymans, Ela Hunt, Marta Indulska, Ritu Khare, Markus Kirchberg, Kerstin Klemisch, Haridimos Kondylakis, Fernando Lemos, Maya Lincoln, An Liu, Lidia López, Hui Ma, José Macedo, Amel Mammar, Sabine Matook, Stefano Montanelli, Christine Natschläger, Matteo Palmonari, Paolo Papotti, Horst Pichler, Anna Queralt, Al Robb, Fiona Rohde, Oscar Romero, Seung Ryu, Tomer Sagi, AnaCarolina Salgado, Michael Schmidt, Pierre-Yves Schobbens, Isamu Shioya, Nobutaka Suzuki, XuNing Tang, Ornsiri Thonggoom, Thu Trinh, Domenico Ursino, Gaia Varese, Gaia Varese, Hung Vu, Kei Wakabayashi, Jing Wang, Qing Wang, Chiemi Watanabe, Ingo Weber, Robert Woitsch, Huayu Wu, Haoran Xie, Liang Xu, Lijuan Yu, Rui Zhang.
Organized by Sauder School of Business, University of British Columbia
Sponsored by The ER Institute Sauder School of Business Xerox Canada Limited
In Cooperation with ACM SIGMIS
Table of Contents
Business Process Modeling Meronymy-Based Aggregation of Activities in Business Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergey Smirnov, Remco Dijkman, Jan Mendling, and Mathias Weske
1
Leveraging Business Process Models for ETL Design . . . . . . . . . . . . . . . . . Kevin Wilkinson, Alkis Simitsis, Malu Castellanos, and Umeshwar Dayal
15
Adaptation in Open Systems: Giving Interaction Its Rightful Place . . . . . Fabiano Dalpiaz, Amit K. Chopra, Paolo Giorgini, and John Mylopoulos
31
Requirements Engineering and Modeling 1 Information Use in Solving a Well-Structured IS Problem: The Roles of IS and Application Domain Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vijay Khatri and Iris Vessey
46
Finding Solutions in Goal Models: An Interactive Backward Reasoning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jennifer Horkoff and Eric Yu
59
The Model Role Level – A Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rick Salay and John Mylopoulos
76
Requirements Engineering and Modeling 2 Establishing Regulatory Compliance for Information System Requirements: An Experience Report from the Health Care Domain . . . . Alberto Siena, Giampaolo Armellin, Gianluca Mameli, John Mylopoulos, Anna Perini, and Angelo Susi
90
Decision-Making Ontology for Information System Engineering . . . . . . . . Elena Kornyshova and R´ebecca Deneck`ere
104
Reasoning with Optional and Preferred Requirements . . . . . . . . . . . . . . . . Neil A. Ernst, John Mylopoulos, Alex Borgida, and Ivan J. Jureta
118
XII
Table of Contents
Data Evolution and Adaptation A Conceptual Approach to Database Applications Evolution . . . . . . . . . . Anthony Cleve, Anne-France Brogneaux, and Jean-Luc Hainaut Automated Co-evolution of Conceptual Models, Physical Databases, and Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James F. Terwilliger, Philip A. Bernstein, and Adi Unnithan A SchemaGuide for Accelerating the View Adaptation Process . . . . . . . . . Jun Liu, Mark Roantree, and Zohra Bellahsene
132
146
160
Operations on Spatio-temporal Data Complexity of Reasoning over Temporal Data Models . . . . . . . . . . . . . . . . Alessandro Artale, Roman Kontchakov, Vladislav Ryzhikov, and Michael Zakharyaschev Using Preaggregation to Speed Up Scaling Operations on Massive Spatio-temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Angelica Garcia Gutierrez and Peter Baumann Situation Prediction Nets: Playing the Token Game for Ontology-Driven Situation Awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Norbert Baumgartner, Wolfgang Gottesheim, Stefan Mitsch, Werner Retschitzegger, and Wieland Schwinger
174
188
202
Model Abstraction, Feature Modeling, and Filtering Granularity in Conceptual Modelling: Application to Metamodels . . . . . . Brian Henderson-Sellers and Cesar Gonzalez-Perez
219
Feature Assembly: A New Feature Modeling Technique . . . . . . . . . . . . . . . Lamia Abo Zaid, Frederic Kleinermann, and Olga De Troyer
233
A Method for Filtering Large Conceptual Schemas . . . . . . . . . . . . . . . . . . . Antonio Villegas and Antoni Oliv´e
247
Integration and Composition Measuring the Quality of an Integrated Schema . . . . . . . . . . . . . . . . . . . . . . Fabien Duchateau and Zohra Bellahsene
261
Contextual Factors in Database Integration—A Delphi Study . . . . . . . . . . Joerg Evermann
274
Table of Contents
Building Dynamic Models of Service Compositions with Simulation of Provision Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dragan Ivanovi´c, Martin Treiber, Manuel Carro, and Schahram Dustdar
XIII
288
Consistency, Satisfiability and Compliance Checking Maintaining Consistency of Probabilistic Databases: A Linear Programming Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . You Wu and Wilfred Ng Full Satisfiability of UML Class Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . Alessandro Artale, Diego Calvanese, and Ang´elica Ib´ an ˜ez-Garc´ıa On Enabling Data-Aware Compliance Checking of Business Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Knuplesch, Linh Thao Ly, Stefanie Rinderle-Ma, Holger Pfeifer, and Peter Dadam
302 317
332
Using Ontologies for Query Answering Query Answering under Expressive Entity-Relationship Schemata . . . . . . Andrea Cal`ı, Georg Gottlob, and Andreas Pieris
347
SQOWL: Type Inference in an RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter J. McBrien, Nikos Rizopoulos, and Andrew C. Smith
362
Querying Databases with Taxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Davide Martinenghi and Riccardo Torlone
377
Document and Query Processing What Is Wrong with Digital Documents? A Conceptual Model for Structural Cross-Media Content Composition and Reuse . . . . . . . . . . . . . . Beat Signer Classification of Index Partitions to Boost XML Query Performance . . . . Gerard Marks, Mark Roantree, and John Murphy Specifying Aggregation Functions in Multidimensional Models with OCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jordi Cabot, Jose-Norberto Maz´ on, Jes´ us Pardillo, and Juan Trujillo
391 405
419
XIV
Table of Contents
Demos and Posters The CARD System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Faiz Currim, Nicholas Neidig, Alankar Kampoowale, and Girish Mhatre
433
AuRUS: Automated Reasoning on UML/OCL Schemas . . . . . . . . . . . . . . . Anna Queralt, Guillem Rull, Ernest Teniente, Carles Farr´ e, and Toni Urp´ı
438
How the Structuring of Domain Knowledge Helps Casual Process Modelers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jakob Pinggera, Stefan Zugal, Barbara Weber, Dirk Fahland, Matthias Weidlich, Jan Mendling, and Hajo A. Reijers SPEED: A Semantics-Based Pipeline for Economic Event Detection . . . . Frederik Hogenboom, Alexander Hogenboom, Flavius Frasincar, Uzay Kaymak, Otto van der Meer, Kim Schouten, and Damir Vandic Prediction of Business Process Model Quality Based on Structural Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laura S´ anchez-Gonz´ alez, F´elix Garc´ıa, Jan Mendling, Francisco Ruiz, and Mario Piattini
445
452
458
Modelling Functional Requirements in Spatial Design . . . . . . . . . . . . . . . . . Mehul Bhatt, Joana Hois, Oliver Kutz, and Frank Dylla
464
Business Processes Contextualisation via Context Analysis . . . . . . . . . . . . Jose Luis de la Vara, Raian Ali, Fabiano Dalpiaz, Juan S´ anchez, and Paolo Giorgini
471
A Generic Perspective Model for the Generation of Business Process Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Horst Pichler and Johann Eder Extending Organizational Modeling with Business Services Concepts: An Overview of the Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . Hugo Estrada, Alicia Mart´ınez, Oscar Pastor, John Mylopoulos, and Paolo Giorgini Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
477
483
489
Meronymy-Based Aggregation of Activities in Business Process Models Sergey Smirnov1 , Remco Dijkman2 , Jan Mendling3 , and Mathias Weske1 1
Hasso Plattner Institute, University of Potsdam, Germany {sergey.smirnov,mathias.weske}@hpi.uni-potsdam.de 2 Eindhoven University of Technology, The Netherlands [email protected] 3 Humboldt-Universit¨ at zu Berlin, Germany [email protected]
Abstract. As business process management is increasingly applied in practice, more companies document their operations in the form of process models. Since users require descriptions of one process on various levels of detail, there are often multiple models created for the same process. Business process model abstraction emerged as a technique reducing the number of models to be stored: given a detailed process model, business process model abstraction delivers abstract representations for the same process. A key problem in many abstraction scenarios is the transition from detailed activities in the initial model to coarse-grained activities in the abstract model. This transition is realized by an aggregation operation clustering multiple activities to a single one. So far, humans decide on how to aggregate, which is expensive. This paper presents a semiautomated approach to activity aggregation that reduces the human effort significantly. The approach takes advantage of an activity meronymy relation, i.e., part-of relation defined between activities. The approach is semi-automated, as it proposes sets of meaningful aggregations, while the user still decides. The approach is evaluated by a real-world use case.
1
Introduction
As organizations increasingly work in a process-oriented manner, they create and maintain a growing number of business process models. Often several hundred or even thousand of process models are stored in a company’s repository. There are two reasons contributing to this growth. On the one hand, modeling initiatives formalize a multitude of operational processes; on the other hand, one process is often described from different perspectives and at various levels of detail. This increasing amount of models poses a considerable challenge to repository management. The BPM community has targeted this type of problems with, for example, techniques to efficiently deal with process model variety [16,30] and algorithms to search process models that fit a particular profile [7,10]. Against this background, business process model abstraction (BPMA) emerged as a technique reducing the number of models describing one business process at J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 1–14, 2010. c Springer-Verlag Berlin Heidelberg 2010
2
S. Smirnov et al.
different abstraction levels. BPMA is an operation on a business process model preserving essential process properties and leaving out insignificant process details in order to retain information relevant for a particular purpose. In practice there are various scenarios where BPMA is helpful [12,15,26]. A prominent BPMA use case is a construction of a process “quick view” for rapid process comprehension. In the “quick view” scenario the user wants to familiarize herself with a business process but has only a detailed model at hand. BPMA solves this problem by deriving a process model that specifies high-level activities and overall process ordering constraints. The transition from the initial process model to the abstract one is realized by two primitive operations: elimination and aggregation [3,22,27]. While elimination simply drops model elements, aggregation merges a set of low-level model elements into one more general element. A special case of aggregation is activity aggregation that provisions coarse-grained activities for an abstract model. Often, there is a part-of relation between the aggregated activities and the aggregating activity. While there is a number of papers elaborating on aggregation as the basic abstraction operation, all of them consider only the structure of the process model to decide which activities belong together. Therefore, activities without semantic connection might be aggregated. However, to the best of our knowledge, there is no paper directly discussing how to aggregate activities which are close according to their domain semantics. In this paper we develop an aggregation technique clustering activities according to their domain semantics. The technique can guide the user during a process model abstraction providing suggestions on related activities. To realize aggregation we refer to a part-of, or meronymy, relation between activities. The techniques we define can be extended towards composition, generalization, and classification [13,20,19]. The main contributions are the metric for comparing activity aggregations and the algorithm for aggregation mining. The presented approach is evaluated against a set of real-world process models. The remainder of the paper is structured as follows. Section 2 illustrates the research problem. Section 3 presents an algorithm for derivation of activity aggregations. Section 4 describes the evaluation setting and results. In Section 5 we discuss the relation of our contribution to the existing research. Section 6 concludes the paper.
2
Motivation
A significant point of consideration in the design of the aggregation operation is the principle of activity aggregation. The principle defines which set of activities belong together and constitute a more coarse-grained activity. Existing BPMA techniques aggregate activities according to the process model structure: activities that belong to a process model fragment are aggregated into one coarsegrained activity. A fragment is either explicitly defined [3,22], or identified by its properties [27]. To illustrate the shortcomings of purely structural BPMA techniques, let us refer to one concrete example, an approach developed in [27]. We demonstrate the abstraction mechanism by a process model example in Fig. 1.
Meronymy-Based Aggregation of Activities in Business Process Models
P Confirm order
3
Arrange shipment
Update manufacturing plan
Generate order summary
Review order summary
Update customer file
Receive order Revise order
Fig. 1. “Customer order handling” business process
The model captures a business process of customer order handling at a factory. According to [27] the model is decomposed into several fragments: the shaded fragment P , and the top-level sequence that orders the start event, activity Confirm order, the fragment P , activities Generate order summary, Review order summary, Update customer file, and the end event. The abstraction algorithm enables aggregation of either a pair of sequential activities, or two branches in fragment P . Activity aggregation may continue and end up with a process model containing one coarse-grained activity. Obviously, this structural algorithm neglects the domain semantics of model elements. Consider activities Generate order summary and Review order summary in the example. The abstraction algorithm addresses none of the following questions: 1. Does an aggregation of Generate order summary and Review order summary makes sense from the point of view of the domain semantics? 2. Is aggregation of Generate order summary and Review order summary better than an aggregation of Review order summary and Update customer profile? Apparently, structural information is insufficient to answer these questions. We consider domain semantics of activities as an alternative source of information. Intuitively, we may argue that it makes sense to aggregate Generate order summary and Review order summary, as both activities operate on object order summary and contribute to its evolution. We consider business process domain ontologies as a formal representation of the domain knowledge. In particular, we focus on ontologies that organize activities by means of a meronymy relation. Meronymy is a semantic relation denoting that one entity is a part of another entity. Meronymy is often referenced as a part-of relation. The meronymy relation organizes activities into a hierarchy, a meronymy tree. Activities at the top of the tree are more coarse-grained than those deeper in the tree. Given an activity in the meronymy tree, its direct descendents are low-level activities to be accomplished to fulfill the given activity. Hence, each non-leaf activity can be iteratively refined down to leaves. Consider an example meronymy tree in Fig. 2. According to the tree, to complete activity Process order summary, activities Generate order summary and Review order summary have to be executed.
4
S. Smirnov et al.
We propose to use meronymy trees for activity agCreate order report gregation. We reference a set of activities which is in question to be aggregated as an aggregation candidate. If all the activities of an aggregation candidate appear Update Process customer order in a meronymy tree, they have a lowest common ancesprofile summary tor (LCA). We assume that the LCA can be used as Generate Review order order a representative for the aggregation candidate. Returnsummary summary ing to the example, we observe that Generate order summary and Review order summary are the direct Fig. 2. Meronymy tree descendents of activity Process order summary in the example meronymy tree. According to our argument, this is a strong indication that these two activities should be aggregated. One can notice that Update customer profile appears in the tree as well. This allows us to consider the set Generate order summary, Review order summary, and Update customer profile as an aggregation candidate as well. Which of the two candidates is preferable? To answer this question we make an assumption that a good aggregation candidate comprehensively describes its LCA. In other words, we assume activities to be related, if they have a subsuming activity, LCA, and this activity is comprehensively described by the considered activities. According to this assumption, aggregation candidate Generate order summary and Review order summary is preferable, as it fully describes the ancestor Process order summary. At the same time the set Generate order summary, Review order summary, and Update customer profile does not provide a comprehensive description of Create order report : there are other activities in the meronymy tree that contribute to its execution. Following this argumentation we are able to mine activity aggregations in a process model and guide the user with recommendations on which activities belong together.
3
Meronymy-Based Activity Aggregation Mining
This section formalizes the intuitive discussion sketched above. First, we introduce the basic concepts in subsection 3.1. Subsection 3.2 explains how activities in a business process model can be related to activities in a meronymy tree. Next, subsection 3.3 defines a metric enabling the comparison of aggregation candidates. Finally, subsection 3.4 defines an algorithm for activity aggregation mining. 3.1
Basic Concepts
We start by introducing the basic concepts. First, a universal alphabet of activities A is postulated. Further, we define the notion of a process model based on the general concepts of activities, gateways, and edges capturing the control flow. Notice that Definition 1 does not distinguish gateway types. We make this design decision, as we ignore the ordering constraints imposed by the control flow, but focus on the domain semantics of activities. Meanwhile, in subsection 3.4 we make use of distances between nodes, which motivates definition of a process model as a graph.
Meronymy-Based Aggregation of Activities in Business Process Models
5
n0
n4
n5
n1
n2
g
n3
e
f
n6
n7
n8
Fig. 3. Example meronymy tree t1
Definition 1 (Process Model). A tuple P M = (A, G, E) is a process model, where: – A ⊆ A is the finite nonempty set of process model activities – G is the finite set of gateways – A∩G=∅ – N = A ∪ G is a finite set of nodes – E ⊆ N × N is the flow relation, such that (N, E) is a connected graph. An aggregation candidate C ⊆ A is a subset of activities in a process model P M = (A, G, E). The search for activity aggregations utilizes a domain ontology formalized as a meronymy forest. Definition 2 (Meronymy Tree and Meronymy Forest). A meronymy tree is a tuple t = (At , rt , Mt ), where: – At ⊆ A is the finite non-empty set of activities – rt ∈ At is the root – Mt ⊆ At × (At \{rt }) is the set of edges such that (a, b) ∈ Mt , if b is part of a – Mt is an acyclic and coherent relation such that each activity a ∈ At \{rt } has exactly one direct ancestor. A meronymy forest F is a disjoint union of meronymy trees. We denote the set of activities in the meronymy forest as AF = ∀t∈F At . An example meronymy tree t1 is presented in Fig. 3. Notice that according to the definition of a meronymy forest, each activity appears exactly in one meronymy tree. Definition 2 does not assume the existence of one super activity subsuming all the others. This is consistent, for instance, with ontologies, like the MIT Process Handbook, in which there are eight root activities [23]. Further we make extensive use of the lowest common ancestor notion. While this concept is well defined for a pair of nodes in the graph theory, we extend it to a node set. Let t = (At , rt , Mt ) be a meronymy tree. We introduce an auxiliary function lcat : P(At ) → At , which for a set of activities C ⊆ At returns a node l ∈ At that is the lowest node having all the activities in C as descendants. The function is defined for a concrete meronymy tree and can be illustrated by the following two examples in tree t1 : lcat1 ({e, f }) = n2 and lcat1 ({e, g}) = n0 . 3.2
Matching Activities: From Process Models to Meronymy Forest
To enable aggregation we need to relate process model activities to the information in an ontology, i.e., a meronymy forest. In the trivial case each process
6
S. Smirnov et al.
model activity is captured in the ontology. However, in practice this is only the case if a process model is designed using the activities predefined by the ontology. However, we do not want to impose the restriction that a process model is constructed from activities in an ontology. Therefore, we use a matching step to determine which activity in a process model matches which activity in the meronymy forest. The matching step is particularly useful, if the process model and the meronymy forest have been designed independently, and now we would like to reuse the ontology for the activity aggregation problem. Definition 3 (Activity Match and Activity Mix-Match). Let P M = (A, G, E) be a process model, F be a meronymy forest. The function matchP M : A → P(AF ) maps an activity to a set of activities in the meronymy forest. Function match is extended to sets such that matchP M : P(A) → P(P(AF )) and for Q ⊆ A it is defined as matchP M (Q) = {matchP M (q)| q ∈ Q}, which returns a set of match-sets, each corresponding to an element of Q. Further, function mixmatchP M returns all potential combinations of matches for each process model activity from an input set. Function mixmatchP M : P(A) → P(AF ) is defined so that for a set of activities Q ⊆ A holds that S ∈ mixmatchP M (Q), if |S| = |Q| and ∀u, v ∈ S holds that ∃a1 ,a2 ∈A [a1 = a2 ∧ u ∈ matchP M (a1 ) ∧ v ∈ matchP M (a2 )]. The match mapping enables activity mapping in both cases: if the process model was designed in the presence of a meronymy forest, or independently. In the former case function match maps an activity to a trivial set, containing only this activity. In the latter case match maps a process model activity to a set of similar activities in the meronymy forest. Various matching techniques exist [7,8]. However, these techniques focus on matching activities from different process models, while we focus on matching activities from a process model to activities in an ontology. This means that relations between activities, e.g., control flow, cannot be exploited to establish a match. Therefore, we refine techniques for matching activities based only on their labels. To match an activity a in a process model to ontology activities, activities with labels being most similar to the label of a are considered. To address the similarity of labels we make use of relations between words, for instance, synonymy, antonymy, meronymy, and hyponymy, in the lexical database WordNet [25]. Given these relations, there are algorithms which enable finding a semantic similarity between words, e.g., see [4,18,21]. We use the similarity metric proposed by Jiang and Conrath in [18]. As the algorithms provide similarity values for words, we extend this method for labels as lists of words. First, within each label all the words which are not verbs and nouns according to WordNet are removed. For each pair of labels all possible combinations of word to word mappings are considered. The mapping which results in a maximal similarity of words is selected. Each activity is mapped to the activities in the meronymy forest which are sufficiently similar to it. A configurable threshold value defines, when an activity is considered to be “sufficiently” similar to another one.
Meronymy-Based Aggregation of Activities in Business Process Models
3.3
7
Aggregation Candidates Ranking
Without any prior knowledge, every subset of a process model activity set might be considered as a potential aggregation candidate. However, we aim to select only those aggregation candidates whose activities are strongly semantically related. There are various options for defining semantic relations among activities, for instance, based on operation on the same data object or execution by the same resource. In this paper we utilize meronymy relations between activities in order to judge their semantic relatedness. We say that activities in an aggregation candidate are strongly related, if together they comprehensively describe another activity—their LCA. The comprehensiveness depends on the existence of LCA descendants that do not belong to the aggregation candidate. The larger share of the LCA descendants belongs to the aggregation candidate, the more comprehensive is the description. For example, activity set {e, f } in Fig. 3 fully describes its LCA n2 . In contrast, activity set {e, g} describes only a small part of its LCA, activity n0 . We define a metric to measure how comprehensively a set of activities describes its LCA. We impose the following requirements on the metric. The metric must reflect, whether the activities of an aggregation candidate describe the LCA comprehensively. The more descendants, which do not belong to the aggregation candidate, the LCA has, the smaller share of the LCA is described by the aggregation candidate. The metric must be neutral to the distance between activities of an aggregation candidate and the LCA, as the distance has no direct influence on how comprehensively activities describe their ancestors. Similarly, the relative position of the LCA to the tree root is not characteristic in this context. This position reflects the abstraction level of an activity. However, we have no interest in LCA abstraction level. We also require the metric to be neutral to the size of an aggregation candidate. Hence, the metric enables comparison of aggregation candidates of different sizes and even comparison of an aggregation candidate with aggregation candidates, which are its subsets. Finally, it is handy, if a metric has a value between 0 and 1. We summarize this requirements discussion as a list. R1. Reflect, if the LCA has other descendents, except aggregation candidate activities. R2. Be neutral to the depth of aggregation candidate in the LCA-rooted subtree. R3. Be neutral to the depth of the LCA in the meronymy tree. R4. Be neutral to the size of the aggregation candidate. R5. Have a value between 0 and 1. To present the designed function we introduce an auxiliary function meronymy leaves first. The function sets up correspondence between a meronymy tree node and its descending leaves. Definition 4 (Meronymy Leaves). Let t = (At , rt , Mt ) be a meronymy tree in meronymy forest F . A function wt : At → P(At ) is a meronymy leaves function, which for activity a ∈ At returns the leaves of the subtree rooted to activity a.
8
S. Smirnov et al.
Returning to the example tree t1 , consider wt1 (g) = {g} and wt1 (n2 ) = {e, f }. Given meronymy leaves function, we propose the following metric for aggregation candidate ordering. Definition 5 (Degree of Aggregation Coverage). Let t = (At , rt , Mt ) be a meronymy tree in a meronymy forest F and C ⊆ At be an aggregation candidate. A function cover : P(At) → (0, 1] describes the degree of aggregation coverage, wt (a) defined as: cover(C) = ∀a∈C . |wt (lcat (C))| The metric captures the extent to which the activity set covers the LCA activity. The larger the share, the more “comprehensive description” provides the activity set. For the motivating example in the tree t1 the metric has values cover({e, f }) = 1 and cover({e, g}) = 0.25, i.e., cover({e, f }) > cover({e, g}). Due to this we conclude that {e, f } is a better aggregation than {e, g}. As the aggregation metric makes use of meronymy leaves function, it considers the presence of other LCA descendants rather than those in the aggregation candidate. As the metric makes no use of distance measures, it is neutral to depth of aggregation candidates in the LCA-rooted subtree, as well as the depth of the LCA in the meronymy tree. The metric is indifferent to the size of the aggregation candidate, but considers the number of leaves in the tree “covered” by the candidate. Finally, the metric value is always greater than 0, and reaches 1 at most. We conclude that the proposed aggregation metric satisfies requirements R1–R5. 3.4
Activity Aggregation Mining Algorithm
Finally, we propose an algorithm for mining of activity aggregations from a process model. The mining algorithm has two subproblems: generation of aggregation candidates out of a process model and selection of aggregations from aggregation candidates. While the latter problem exploits the developed aggregation metric cover, the former requires a discussion. Generation of aggregation candidates from the model can be approached in a brute force fashion, if all the possible activity combinations are considered. However, the number of combinations in this case is 2|A| , where A is the set of activities in a process model. As BPMA addresses complex process models with a large number of activities, this brute force method is insufficient. We need a method for coping with the computational complexity. A wholesome observation is that related activities are co-located within a process model [29]. According to this observation, we assume that for a given activity, the related activities can be found within a fixed graph distance. In this way we effectively manage the computational complexity problem. The computational complexity is further reduced, if we iteratively construct aggregation candidates pruning redundant ones. First, all the aggregation candidates of size two are created and analyzed. Candidates, which matches do not appear in one meronymy tree, are pruned. In the next iteration aggregation candidates of size three are constructed from the
Meronymy-Based Aggregation of Activities in Business Process Models
9
Algorithm 1. Activity aggregation mining 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
mine(Model P M = (A, G, E), MForest F , double cover0 , int dist) Set aggregations = ∅; for all activity ∈ A of P M do Set candidates = ∅; for all activityP air ∈ findNeighbours(activity, dist) do candidate = {activityP air[1], activityP air[2]}; for all ontologyCandidate ∈ mixmatchP M (candidate) do if ∃t ∈ F, t = (At , rt , Mt ) : ontologyCandidate ⊆ At then candidates = candidates ∪ {candidate}; if cover(ontologyCandidate) ≥ cover0 then aggregations = aggregations ∪ {candidate}; aggregations = aggregations ∪ kStep(candidates, P M , F , cover0 , dist); return aggregations; \\Inductive step of aggregation mining kStep(Set kCandidates, Model P M , MForest F , double cover0 , int dist) Set aggregations = ∅; Set (k + 1)Candidates = ∅; int k = kCandidates[1].size; for all candidateP air from kCandidates do newCandidate = candidateP air[1] ∪ candidateP air[2]; if newCandidate.size == k + 1 then for all ontologyCandidate ∈ mixmatchP M (newCandidate) do if ∃t ∈ F, t = (At , rt , Mt ) : ontologyCandidate ⊆ At then (k + 1)Candidates = (k + 1)Candidates ∪ {newCandidate}; if cover(ontologyCandidate) ≥ cover0 then aggregations = aggregations ∪ {newCandidate}; aggregations = aggregations ∪ kStep((k + 1)Candidates, P M , F , cover0 , dist); return aggregations;
candidates of size two. Hence, the construction of aggregation candidates of size k + 1 makes use of aggregation candidates of size k and their pruning. Algorithm 1 formalizes the discussion above. The input of the algorithm is a process model P M = (A, G, E), a meronymy forest F , an aggregation metric threshold value cover0 , and dist—the graph node distance. The threshold value cover0 and distance dist allow to set up the algorithm. The values can be selected by the user or empirically obtained (see Section 4). The output of the algorithm is the set of aggregations. The iterative construction of aggregations of increasing size is realized by two functions: mine and kStep. The entry point of the algorithm is function mine. For each activity in a process model (line 3) the algorithm finds a set of neighboring activities within a specified distance dist. Function f indN eigbours(activity, dist) returns the set of activities allocated within a distance not greater than dist from activity in the process model (line 5). Within this set all the subsets of size two are considered as aggregation candidates (line 6). Each candidate is evaluated against the ontology. If the candidate has no mappings to the ontology activities that belong to one tree, it
10
S. Smirnov et al.
is pruned (lines 7–8). Otherwise, the candidate mappings are evaluated against the specified metric threshold value cover0 . If there is at least one mapping of an aggregation candidate, for which the value of cover is greater than m0 the candidate is considered to be an aggregation (lines 10–11). All the aggregation candidates that have not been pruned are used as the input for function kStep (line 12). Function kStep iteratively increases the size of aggregation candidates by one, pruning and evaluating them (lines 16–29). The pruning and evaluation of candidates follows the same principle as in function mine.
4
Empirical Validation
We evaluated the developed aggregation technique by applying it to a set of models capturing the business processes of a large electronics manufacturer. The model collection considered in the experiment includes 6 business process models. Each model contains on average 42 activities, with a minimum of 18 activities and a maximum of 81 activities. On average, an activity label contains 4.1 words. A meronymy forest is represented by the MIT Process Handbook [23]. The MIT Process Handbook describes business processes elicited by researchers in the interviews with business process experts. It spans several business domains, like sales, distribution, and production. The handbook describes about 5 000 activities and specifies hyponymy and meronymy relations between them. We make use of activities and a meronymy relation only. The process models were not aligned with the Handbook in advance: no relations between process model activities and the MIT Process Handbook activities were established. We matched process model activities to the activities of the handbook according to the semantics of their labels, as discussed in Section 3.2. In the evaluation we rely on human judgment for assessing the relevance of the results delivered by our approach. We asked a process modeling expert from TU Eindhoven, who was unfamiliar with the technique, to select those aggregations that were considered relevant. We gave the instruction to consider an abstraction relevant, if a given set of activities could be reasonably “abstracted” into a single activity. The means for abstraction that could be considered were: aggregating the activities in the abstraction, generalizing the activities, putting the activities into a common subprocess, or any other means that the evaluator considered relevant. Activity aggregations delivered by the aggregation mining algorithm vary in size. To get a precise result we decomposed each mined activity set into a set of all its subsets of size two. Due to this decomposition, the evaluation shows, if all the activities of the aggregation are strongly related. If in the set {a, b, c} activity c is weakly related to a and b, the human easily points this out claiming (a, c) and (b, c) irrelevant. We have conducted a series of experiments in which we varied the parameters of our aggregation technique. In each run of the experiments we have fixed the parameters of match function (each process model activity was mapped to at most 10 activities in the Handbook). At the same time we varied the node distance and cover threshold value. The node distance runs the values from 1
Meronymy-Based Aggregation of Activities in Business Process Models
11
Precision 1 0.8 0.6 0.4 0.2 0 1
2 cover threshold 0.3
3
4
Node distance
cover threshold 0.2
Fig. 4. Precision of the activity aggregation mining algorithm
to 4, while the cover threshold values were 0.2 and 0.3. Within the experiment we observed the precision value—a number of relevant activity pairs retrieved by the algorithm related to the total number of retrieved pairs. Fig. 4 illustrates the observed results. The precision value varies between 0.27 (the node distance equals to 4 and the cover threshold value is of 0.2) and 0.46 (the node distance equals to 1 and the cover threshold value is of 0.3). One can see two tendencies in the experiment. First, a higher cover0 threshold value leads to a higher precision. Indeed, a high threshold prunes more aggregation candidates, as it imposes more strict search conditions. The total number of aggregations declines, increasing the precision. Second, the increase of node distance leads to a precision decrease. This observation can be explained by the fact that a node distance increase brings more activities into algorithm’s consideration. As [29] argues the greater the distance is, the less related activities appear in the set. Thereby, the precision decrease is expected. While the technique returns a considerable amount of helpful suggestions, there is still quite a number of irrelevant aggregations proposed. Hence, we aim to improve the technique precision by combining it with other information on relatedness, e.g., shared data access or same resource execution. Further, the conducted experiment evaluated both aggregation mining algorithm and match function. Since the process models and the used ontology were not aligned beforehand, there is also a contribution to a gap in precision by the match technique. To study the behavior of aggregation mining algorithm further, we need a setting, where process models are created using activities from a domain ontology. We perceive such an evaluation as the future work. Concluding, we suggest that although the developed technique cannot be used in fully automatic fashion, it can support the user with suggestions.
5
Related Work
This paper extends existing methods for business process model abstraction by addressing the semantics of activities in business process models. As we mentioned
12
S. Smirnov et al.
in Section 2, the available BPMA techniques focus on the structural aspects of process model transformation, e.g., see [3,27,31]. In addition to those papers, there are works that discuss criteria for abstraction of model elements, see [12,15]. These criteria are defined on activity properties such as execution cost, duration, frequency, and associated resources; activities are either abstracted or preserved depending on whether they meet the given criteria. Basu and Blanning [2] propose a technique for detecting possible abstractions by searching for structural patterns. We are not aware of BPMA algorithms and, in particular, activity aggregation algorithms, that use semantic information. However, object-oriented analysis and design has extensively studied the meronymy, or whole-part, relation. In [1] Barbier et al. extend the formalization of this relation beyond the UML specification. Guizzardi focuses on the modal aspects of the meronymy relation and the underlying objects in [14]. Business process model abstraction is related to modularization of business process models, which is the task of extracting a set of activities and putting them into a common subprocess. In [28] Reijers and Mendling investigate quality metrics that determine whether tasks should be abstracted and put into a common subprocess. This work is extended by studying strict criteria that can be used to determine when tasks can be put into a common subprocess [29]. Another stream of the related research is the work on lexical semantic relatedness. There exist a number of measures that utilize semantic relations of WordNet to evaluate semantic relatedness of words. In [4] Budanitsky and Hirst provide a comprehensive overview of such measures that consider the WordNet hyponymy relation. Further, they evaluate the measures and conclude on their pros and cons. Our interest in this body of research is twofold. On the one hand, we apply these results for finding related activity labels: based on the outcome of evaluation in [4] and our experiment setting, we utilize the measures proposed in [18] and [21]. On the other hand, these measures inspired the aggregation metric that we developed in this paper. Other techniques for finding related activity labels or related process models vary with respect to the type of information that they rely on. Structural and semantic information is used in [11], behavioral information in [9], and graphedit distance techniques are used as well [8]. However, these techniques focus on finding related activities from different processes, while the focus of our work is mining of related activities in the same model. The discussion of semantic relations between activities falls into the area of semantic BPM. Hepp et al. [17] present a vision of semantic BPM, explaining the role of ontologies and showing how they expand process analysis capabilities. In [5] Casati et al. suggest to use taxonomies for high-level analysis of business processes. Medeiros et al. employ activity meronymy and hyponymy relations to advance analysis of information derived from logs [24]. Although the above named papers describe advanced analysis of process models, they do not suggest metrics that enable mining of aggregations in domain ontologies. Finally, Dapoigny and Barlatier introduce a formalism that facilitates a precise and unambiguous description of a part-whole relation [6]. Similar to the semantic BPM work, their contribution does not directly address the aggregation mining problem.
Meronymy-Based Aggregation of Activities in Business Process Models
6
13
Conclusions and Future Work
This paper presented a semi-automated approach to aggregation of activities in process models. We developed a technique for mining sets of semantically related activities in a process model. The activity relatedness is evaluated with the help of information external to the model—a domain ontology specifying activity meronymy relation. The paper contributions are 1) the metric enabling judgment on relatedness of activity sets and 2) the algorithm for activity aggregation mining. Further, we proposed a technique for matching activities of a process model and an ontology. The developed approach is evaluated against a realworld setting including a set of business process models from a large electronics manufacturer and the domain ontology of the MIT Process Handbook. We foresee a number of future research directions. First, there is potential for further improvement of the activity matching technique. In particular, we aim to develop advanced label analysis techniques by identifying verbs and business objects. Second, we want to evaluate the aggregation mining technique in a setting where process models have been created using a domain ontology. Such an evaluation allows to check the performance of the aggregation mining algorithm in isolation. Third, it is of great interest to investigate other types of information in process models that may support aggregation, for instance, data flow or organizational role hierarchies. Finally, our research agenda includes a study of interconnection between structural and semantic aggregation approaches.
References 1. Barbier, F., Henderson-Sellers, B., Le Parc-Lacayrelle, A., Bruel, J.-M.: Formalization of the Whole-Part Relationship in the Unified Modeling Language. IEEE TSE 29(5), 459–470 (2003) 2. Basu, A., Blanning, R.W.: Synthesis and Decomposition of Processes in Organizations. ISR 14(4), 337–355 (2003) 3. Bobrik, R., Reichert, M., Bauer, T.: View-Based Process Visualization. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 88–95. Springer, Heidelberg (2007) 4. Budanitsky, A., Hirst, G.: Evaluating WordNet-based Measures of Lexical Semantic Relatedness. COLI 32(1), 13–47 (2006) 5. Casati, F., Shan, M.-C.: Semantic Analysis of Business Process Executions. In: ˇ Jensen, C.S., Jeffery, K., Pokorn´ y, J., Saltenis, S., Bertino, E., B¨ ohm, K., Jarke, M. (eds.) EDBT 2002. LNCS, vol. 2287, pp. 287–296. Springer, Heidelberg (2002) 6. Dapoigny, R., Barlatier, P.: Towards an Ontological Modeling with Dependent Types: Application to Part-Whole Relations. In: Laender, A.H.F. (ed.) ER 2009. LNCS, vol. 5829, pp. 145–158. Springer, Heidelberg (2009) 7. Dijkman, R.M., Dumas, M., Garc´ıa-Ba˜ nuelos, L.: Graph Matching Algorithms for Business Process Model Similarity Search. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 48–63. Springer, Heidelberg (2009) 8. Dijkman, R.M., Dumas, M., Garc´ıa-Ba˜ nuelos, L., K¨ aa ¨rik, R.: Aligning Business Process Models. In: EDOC 2009, pp. 45–53. IEEE CS, Los Alamitos (2009) 9. van Dongen, B., Dijkman, R.M., Mendling, J.: Measuring Similarity between Business Process Models. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 450–464. Springer, Heidelberg (2008)
14
S. Smirnov et al.
10. Dumas, M., Garc´ıa-Ba˜ nuelos, L., Dijkman, R.M.: Similarity Search of Business Process Models. IEEE Data Eng. Bull. 32(3), 23–28 (2009) 11. Ehrig, M., Koschmider, A., Oberweis, A.: Measuring Similarity between Semantic Business Process Models. In: APCCM 2007, Ballarat, Victoria, Australia, pp. 71– 80. ACSC (2007) 12. Eshuis, R., Grefen, P.: Constructing Customized Process Views. DKE 64(2), 419– 438 (2008) 13. Evermann, J., Wand, Y.: Toward Formalizing Domain Modeling Semantics in Language Syntax. IEEE TSE 31(1), 21–37 (2005) 14. Guizzardi, G.: Modal Aspects of Object Types and Part-Whole Relations and the de re/de dicto Distinction. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 5–20. Springer, Heidelberg (2007) 15. G¨ unther, C.W., van der Aalst, W.M.P.: Fuzzy Mining–Adaptive Process Simplification Based on Multi-perspective Metrics. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 328–343. Springer, Heidelberg (2007) 16. Hallerbach, A., Bauer, T., Reichert, M.: Capturing Variability in Business Process Models: The Provop Approach. SPIP (2009) 17. Hepp, M., Leymann, F., Domingue, J., Wahler, A., Fensel, D.: Semantic Business Process Management: A Vision Towards Using Semantic Web Services for Business Process Management. In: ICEBE 2005, pp. 535–540. IEEE CS, Los Alamitos (2005) 18. Jiang, J.J., Conrath, D.W.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: ROCLING 1997, pp. 19–33 (1997) 19. K¨ uhne, T.: Matters of (Meta-) Modeling. SoSyM 5(4), 369–385 (2006) 20. K¨ uhne, T.: Contrasting Classification with Generalisation. In: APCCM 2009, Wellington, New Zealand (January 2009) 21. Lin, D.: An Information-Theoretic Definition of Similarity. In: ICML 1998, pp. 296–304. Morgan Kaufmann, San Francisco (1998) 22. Liu, D., Shen, M.: Workflow Modeling for Virtual Processes: an Order-preserving Process-view Approach. IS 28(6), 505–532 (2003) 23. Malone, T.W., Crowston, K., Herman, G.A.: Organizing Business Knowledge: The MIT Process Handbook. The MIT Press, Cambridge (2003) 24. De Medeiros, A.K.A., van der Aalst, W.M.P., Pedrinaci, C.: Semantic Process Mining Tools: Core Building Blocks. In: ECIS 2008, Galway, Ireland, pp. 1953– 1964 (2008) 25. Miller, A.G.: Wordnet: A Lexical Database for English. CACM 38(11), 39–41 (1995) 26. Polyvyanyy, A., Smirnov, S., Weske, M.: Process Model Abstraction: A Slider Approach. In: EDOC 2008, pp. 325–331. IEEE CS, Los Alamitos (2008) 27. Polyvyanyy, A., Smirnov, S., Weske, M.: The Triconnected Abstraction of Process Models. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) BPM 2009. LNCS, vol. 5701, pp. 229–244. Springer, Heidelberg (2009) 28. Reijers, H.A., Mendling, J.: Modularity in Process Models: Review and Effects. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 20–35. Springer, Heidelberg (2008) 29. Reijers, H.A., Mendling, J., Dijkman, R.M.: On the Usefulness of Subprocesses in Business Process Models. BPM Center Report BPM-10-03, BPMcenter.org (2010) 30. Rosemann, M., van der Aalst, W.M.P.: A Configurable Reference Modelling Language. IS 32(1), 1–23 (2007) 31. Smirnov, S.: Structural Aspects of Business Process Diagram Abstraction. In: BPMN 2009, Vienna, Austria, pp. 375–382. IEEE CS, Los Alamitos (2009)
Leveraging Business Process Models for ETL Design Kevin Wilkinson, Alkis Simitsis, Malu Castellanos, and Umeshwar Dayal HP Labs, Palo Alto, CA, USA {kevin.wilkinson,alkis,malu.castellanos,umeshwar.dayal}@hp.com
Abstract. As Business Intelligence evolves from off-line strategic decision making to on-line operational decision making, the design of the back-end Extract-Transform-Load (ETL) processes is becoming even more complex. Many challenges arise in this new context like their optimization and modeling. In this paper, we focus on the disconnection between the IT-level view of the enterprise presented by ETL processes and the business view of the enterprise required by managers and analysts. We propose the use of business process models for a conceptual view of ETL. We show how to link this conceptual view to existing business processes and how to translate from this conceptual view to a logical ETL view that can be optimized. Thus, we link the ETL processes back to their underlying business processes and so enable not only a business view of the ETL, but also a near real-time view of the entire enterprise. Keywords: Business Intelligence, ETL, Process & Conceptual Models.
1
Introduction
Enterprises use Business Intelligence (BI) technologies for strategic and tactical decision making, where the decision-making cycle may span a time period of several weeks (e.g., marketing campaign management) or months (e.g., improving customer satisfaction). Competitive pressures, however, are forcing companies to react faster to rapidly changing business conditions and customer requirements. As a result, there is an increasing need to use BI to help drive and optimize business operations on a daily basis, and, in some cases, even for intraday decision making. This type of BI is called operational business intelligence. Traditionally, business processes touch the OLTP databases, analytic applications touch the data warehouse (DW), and ETL provides the mapping between them. However, the delay between the time a business event occurs and the time that the event is reflected in the warehouse could be days. In the meantime, the data is likely to be available but it is stored in the data staging area in a form that is not available for analysis. In operational BI, knowing the current state of the enterprise may require the provisioning of business views of data that may be in the OLTP sources, already loaded into the DW, or in-flight between the sources and the DW. The notion of ETL is now generalized to that of J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 15–30, 2010. c Springer-Verlag Berlin Heidelberg 2010
16
K. Wilkinson et al.
integration flows, which define these business views. In addition, the BI architecture must evolve to the new set of requirements of operational BI that current data warehouse architectures cannot handle. In particular, a complex variety of quality requirements and their tradeoffs need to be considered in the design of integration flows. In this paper, we present an approach to the design of integration flows for operational BI that links BI to the operational processes of the enterprise. Our approach is based on a layered methodology that starts with modeling the business processes of the enterprise, and the BI information requirements and associated service level objectives, and proceeds systematically to physical implementation through intermediate layers comprising a conceptual model and a logical model. The methodology is centered on the notion of quality objectives, which we call collectively the QoX objectives [5] and are used to drive the optimization of the design [15]. An important feature of our design methodology is the use of business process models for the business requirements and conceptual models of ETL designs. This has several advantages. First, it provides a unified formalism for modeling both production (operational) processes as well as the end-to-end enterprise data in the warehouse; this offers a high level view of the process populating the DW. Second, it provides a business view of the intermediate enterprise state. Third, it enables ETL design from a business view that hides the low-level IT details and therefore facilitates the specification of Service Level Agreements (SLAs) and metrics by business analysts. The value of a layered methodology is that the ETL flow can be modeled and represented in a form that is appropriate for users of that layer. At the conceptual layer the flow is presented in business terms. At the logical layer, the flow is expressed as logical IT operations, a form that is suited for optimization and is also independent of any specific ETL engine. The physical layer is expressed using operations specific to the implementation technology. A key benefit of our approach is coupling all these layers and using QoX to guide the translation process. Contributions. Our main contributions are as follows. – We capture quality requirements for business views. – We propose a conceptual model for ETL based on an enriched business process model with QoX annotations. – We present a logical model for ETL having an XML-based representation, which preserves the functional and quality requirements and can be used by existing ETL engines. – We describe techniques to map from conceptual to logical and then, to physical models, while facilitating QoX-driven optimization. Outline. The rest of this paper is structured as follows. Section 2 describes our QoX-driven approach and layered methodology for ETL design. Sections 3 and 4 present the business requirements model and the conceptual model, respectively. Section 5 describes the logical model and Section 6 the mapping from conceptual to logical and logical to physical models. Section 7 overviews the related work and Section 8 concludes the paper.
Leveraging Business Process Models for ETL Design
2
17
Layered QoX-Driven ETL Modeling
As shown in figure 1, our design methodology starts with a business requirements model that specifies the required business views in the context of the operational business process whose execution triggers the generation of those views. Business views are information entities that allow the user to view how the business is going in a timely manner. They are derived from source data manipulated by business processes and correspond to facts and dimensions in a data warehouse. Business requirements are usually captured by a business consultant through interviews with the business users. Then, a conceptual model is defined by the ETL designer that contains more technical details (but, it is still a high level model). The conceptual model is used for producing a logical model that is more detailed and can be optimized. Finally, we create a physical model that can be executed by an ETL engine. Only the physical model ties to a specific ETL tool, whilst the other layers are agnostic to ETL engines. The functional and non-functional requirements gathered at the business layer as SLAs are defined over a set of quality objectives which we represent as a set of metrics called QoX metrics. A non-exhaustive list of QoX metrics includes: performance, recoverability, reliability, freshness, maintainability, scalability, availability, flexibility, robustness, affordability, consistency, traceability, and auditability. Some metrics are quantitative (e.g., reliability, freshness, cost) while other metrics may be difficult to quantify (e.g., maintainability, flexibility). In previous work, we have defined these metrics and have shown how to measure them [15,16]. In this paper, we use a subset of these metrics to illustrate how QoX metrics drive the modeling. At each design level, QoX objectives are introduced or refined from higher levels. There are opportunities for optimization at each successive level of specification. Optimizations at all design levels are driven by QoX objectives. These objectives prune the search space of all possible designs, much like cost-estimates are used to bound the search space in cost-based query optimization. For example, there may be several alternative translations from operators in the conceptual model to operators in the logical model and these alternatives are determined by the QoX objectives and their tradeoffs. Similarly, the translation from the logical model to the physical model enables additional types of QoX-drive optimizations.
3
Business Requirements
The first level of our layered architectural model for QoX driven design is the business requirements level. Traditionally business requirements are informally captured in documents and dealt with in an ad-hoc manner. Our goal in this level is four-fold. First, we need to identify the data entities needed for decision support that are to be created and maintained by the ETL processes. These correspond to the business views. Second, we need to relate the business views to events in the operational business processes that affect them. These events represent the input sources for the business view. Third, we need to express
18
K. Wilkinson et al.
Fig. 1. Layered approach for ETL
Fig. 2. Connecting Order-to-Revenue operational processes to DailyRevenue business view
quality requirements on the business views as SLAs on QoX metrics. Fourth, we need to collect the business rules that specify how the objects in the operational processes are to be mapped into objects in the business view. For example, consider a hypothetical, on-line, retail enterprise and an associated business process for accepting a customer order, fulfilling and shipping the order and booking the revenue. Such an Order-to-Revenue process involves a number of workflows that utilize various operational (OLTP) databases. A Shopping subprocess would support the customer browsing and the shopping cart. A Checkout subprocess would validate the order and get payment authorization. A Delivery subprocess would package and ship the order. Figure 2 illustrates how the shopping subprocess sends order details of the shopping cart to the checkout process. In turn, the check-out subprocess sends a validated order to the delivery subprocess. The delivery subprocess may then send messages for each individual delivery package and when the full order has been completed. Suppose there is a business requirement for a DailyRevenue business view that computes total sales by product by day. That raises the question of what constitutes a sale. An item might be considered sold at customer check-out (the order details message in figure 2), after the order is validated (the validated order message), when the product is shipped (the shipped item message) or when all items in the order have shipped (the order complete message). The choice depends on the business need but, for correctness, the daily revenue business view must be linked to the appropriate business events in the operational processes. For our example, we will consider an item is sold when the order is complete. Once linkages to the business events are understood, the QoX for the business view must be defined. For DailyRevenue, suppose there is a requirement to generate a weekly report. This corresponds to a freshness requirement that no more than one week can elapse between a sale and its inclusion in the DailyRevenue business view. For another example, suppose there is a reliability requirement that the report must be generated on a weekend night and that the flow that implements the report should be able to tolerate two failures during execution. This translates into a reliability requirement that the view must complete within 8 hours and may be restarted (either from the beginning or from some recovery point) at most twice.
Leveraging Business Process Models for ETL Design
19
Finally, the detailed business rules for constructing the business view must be collected. For example, suppose internal orders should not be counted as revenue. In that case, the rule for detecting an internal order must be defined and included in the business view. As another example, there may be a requirement that revenue from EU countries should be converted from euros to dollars and tax should be calculated on US orders. Given the QoX, business rules and relevant business events from the operational processes, the next step is to construct the ETL conceptual model as described in the next section.
4
Conceptual Model
There is a fundamental disconnect between the real-time view of the enterprise provided by the operational systems (e.g., OLTP systems) and the historical, analytical view provided by the data warehouse. These two views are reconciled by ETL but the ETL processes are themselves opaque to business users. Operational data stores have been employed to bridge the gap but this adds yet another view of the enterprise; e.g., three data sources, three data models, three query systems. We believe the use of a formal model for the ETL conceptual level can bridge the disconnect. We propose the use of business process models for the conceptual level. In our approach, for each business view of a data warehouse there is a corresponding business process to construct that view. This has several advantages as mentioned in the Introduction. 4.1
BPMN as an ETL Conceptual Model
In this section, we describe how we adapt BPMN (Business Process Modeling Notation [4]) for ETL conceptual models. BPMN is a widely-used standard for specifying business processes in graphical terms. A related specification, BPEL (Business Process Execution Language), was considered but it is more focused on details of process execution than process specification and semantics. The legend in figure 3 shows the modeling components we use in our BPMN diagrams. They should be self-explanatory. For a complete description of BPMN, see [4]. Design patterns. A major challenge in using business process models is that ETL fundamentally describes a data flow whereas business process models describe control flow. Note that the proposed BPMN version 2.0 includes features for capturing dataflow but they are not yet adopted nor widely available in tools. Another challenge is that ETL generally operates over batch datasets that represent an entire day or perhaps week of business activity. On the other hand, an operational business process represents a single business activity. In other words, an ETL flow aggregates a number of business flows. These two different perspectives must be reconciled. A third challenge is our need to capture QoX within a BPMN model. We discuss these issues in turn. To address the data flow issue, we associate documents with tasks in a business process and use these to represent the data flow. Each task has an associated
20
K. Wilkinson et al.
Fig. 3. DailyRevenue ETL Conceptual Model
input document and an output document. The input document describes the input data objects and their schemas as well as parameters for the task. The output document identifies the result object and its schema. To track the QoX for the ETL process, a distinguished document, QoX, is designated to contain the QoX parameters. To address the conversion from single process instances to batch datasets, we use a design pattern that models each business view as four related flows: scheduler, extract, fact, and load. Note that in this context, fact is used generically to refer to any warehouse object, i.e., dimensions, roll-ups as well as fact tables. The scheduler orchestrates the other three flows, e.g., starting and stopping, encoding QoX as parameters, and so on. The remaining flows correspond roughly to the extract, transform and load steps of ETL. The extract flow is responsible for interfacing with operational business processes. Probes inserted into the operational business processes notify the extract flow of relevant business events. The extract process accumulates the business events and periodically (as per the freshness requirement), releases a batch data object to the fact flow. In our model, data objects are modeled as hypercubes. This is a familiar paradigm for business users. In addition, operations over hypercubes are readily translated to relational algebra operators [1] so this facilitates translating the conceptual model to the logical model. The fact flow performs the bulk of the work. It uses the raw business event data collected by the extract process and transforms it to the warehouse view. The resulting view instance (e.g., fact object) is then added to the warehouse by the load flow. Note that quality objectives must be captured at three levels of granularity: process level, flow level, and task (operator) level. The distinguished QoX document addresses the flow level. Document associations address the task level. The process level is addressed by a (global) process level QoX document that is inherited by all flows.
Leveraging Business Process Models for ETL Design
4.2
21
Example of an ETL Conceptual Process
We illustrate our approach using an example process for DailyRevenue. Suppose the DailyRevenue fact table records the total quantity, revenue, and average time to ship each product sold each day, and that its schema is . A BPMN diagram for the DailyRevenue fact process is shown in figure 3. The other processes are either compressed or not shown to simplify the presentation and also because they convey implementation details that are not necessarily of interest (e.g., process orchestration) at the conceptual level to the business user. Assume that the quality objectives specify high reliability in addition to freshness (e.g., one week). The scheduler flow includes the QoX document at the start of the flow (not shown). The scheduler then starts an extract flow, delays for the freshness interval, then stops the current extract flow and loops back to start a new extract for the next freshness period. Note that to simplify the diagram we do not show all document associations within a flow. The extract flow aggregates order information needed to compute the DailyRevenue fact. A key point is the linkage of this flow to the operational business process that confirms an order, i.e., Delivery. The extract flow receives a message from Delivery for each completed order. Note that this is a conceptual model and does not require that an implementation use a message interface to an operational business process. The translation from conceptual to logical to a physical ETL model will use QoX objectives to determine if the extract flow should use a (near real-time) message probe or a periodic extract from a source table. For the specified freshness period, extract collects the orders, building a conceptual hypercube. At the end of the freshness interval it sends the accumulated order hypercube to the the fact process. For this example, we assume the fact process computes three aggregate values for each date and product: total quantity sold, total revenue and average time to ship the product. As mentioned earlier, we assume three business rules relating to internal orders, EU currency conversion, and US state taxes. Figure 3 shows details of the fact flow (dailyRevFact). The main flow of the process is executed, conceptually in parallel, for each lineitem slice of the orders hypercube. The idResolve operation represents identity resolution, a concept familiar to business managers (e.g., a customer may have multiple aliases, a new customer must be added to the master data, and so on). At the logical level, this operation includes surrogate key generation but that level of detail is hidden by the abstract operation. In addition, there are other predefined functions useful for all flows; e.g., dateDiff, lkupTable, appendFile, currCvt, etc.. Other functions may be specific to a single business view. At the end of the fact flow, the orders hypercube is rolled-up on the date and product dimensions. This creates new facts that are added to the warehouse by the load process (not shown). 4.3
Encoding the ETL Conceptual Model
A number of vendors and open-source projects support the creation of BPMN models using a GUI (e.g., [8,12]). However, the original BPMN specification did
22
K. Wilkinson et al.
not include a (textual) serialization format. This is a problem for our approach because a serialization format is needed in order to process the conceptual model and generate an ETL logical model. In fact, the proposed BPMN version 2.0 does include an XML serialization but it is not yet adopted. In the meantime, XPDL is in wide use as a de facto standard XML serialization for BPMN models and we use it for now. The mapping from our BPMN models to XPDL is straightforward and not detailed here due to space limitations. For example, we use a Package for each business view and, within a package, one Pool per process. The mapping of BPMN to XPDL MessageFlows, Artifacts, WorkflowProcesses, Associations is obvious. XPDL DataFields are useful for sharing QoX metrics across flows. Each XPDL object (an XML element) has a unique identifer in its Id attribute and we use that to reference the object from other elements. Clearly, there is also a mapping from attribute names in the process diagrams (e.g., prodNo) to elements in the data objects, but that need not be presented in this view. Finally, XPDL has attributes for the placement of model elements in a graphical representation.
5 5.1
Logical Model Preliminaries
In this subsection we present theoretical underpinnings for the logical model. In the next subsection we provide its representation in XML. An ETL design can be represented as a directed, acyclic graph (DAG) of vertices and edges, G = (V, E) [19]. The vertices are either data stores or activities –a.k.a. operations and transformations. As in [16], each node has a set of input, output, and parameter schemata; however, here we focus on modeling QoX metrics and for clarity of presentation we do not mention such schemata. The ETL graph represents the flow of data from the source data stores (e.g., operational systems) to the target data stores (e.g., data warehouse and data marts). The data flow is captured by the edges of the graph called provider relationships. A provider relationship connects a data store to an activity or an activity to either another activity or a data store. Such constructs are adequate for capturing the data flow semantics and functionality and are used by existing logical models (see Section 7). We extend these models as follows. For incorporating additional design information like business requirements, Qx , (as QoX metrics and values), physical resources needed for the ETL execution Rp , and other generic characteristics, F (e.g., useful for visualization), we extend the ETL graph and we consider its parameterized version G(P ), where P is a finite set of properties that keep track of the above information. Formally, an ETL design is a parameterized DAG G(P ) = (V (P ), E), where P = Qx ∪ Rp ∪ F . If we do not wish to keep additional design information, then P can be the empty set. P may contain zero or more elements according to the design phase. For example, at the logical level, the ETL design does not necessarily contain information about the physical execution (e.g., database connections, memory and processing requirements), and thus, Rp may be empty.
Leveraging Business Process Models for ETL Design
Start dailyRevFact Qx: freshness=15, MTTR=2 Rp: memory=2, cpus={cp1,cp2,cp3,cp4}
LineItem
Extract
GroupBy {Date, ProdNum}
Filter {shipAddr= null}
SK {Date}
SK {ProdNum}
... Load
DailyRevenue Fact
End
Lineitem datastore TableInput
Extract activity ExecSQL
Ȗ(date,prodNum) activity GroupBy
SK(prodNum) activity SKassign
...
(a) ETL graph
23
Load activity ExecSQL
DailyRevenueFact datastore TableOutput
LineitemExtract
ExtractȖ(date,prodNum)
Ȗ(date,prodNum)SK(prodNum)
...
LoadDailyRevenueFact
...
(b) xLM representation
Fig. 4. DailyRevenue ETL logical model
As we have discussed, additional design information like the QoX metrics can be assigned at different abstraction levels: at a flow (e.g., ‘the F flow should run every 15min’ or at a certain operation ‘add a recovery point after the O operator’). While the parameterized DAG G(P ) covers the former case, the latter case is modeled through the parameterized vertices V (P ). For example, to meet a recoverability requirement, we may decide to add a recovery point after a certain operator O; we can model this as O(Qx ), Qx = {add RP }. As with the parameterized DAG, the parameterized vertices may contain zero or more elements of P , which at different design levels are used differently. For example, for a specific operation at the logical level a property named ‘operational semantics’ describes an abstract generic algorithm e.g., merge-sort, whereas at the physical level it contains the path/name/command for invoking the actual implementation code that executes this operation. For example, consider figure 4a that depicts the ETL graph, say G1 , for the DailyRevenue previously discussed. To denote that G1 should run every 15min (freshness), have mean time to recover (MTTR) equal to 2min, use 2GB of memory and 4 CPUs, and use a certain database dbx for temporary storage, we can write: G1 (P ) = {{cycle = 15min, M T T R = 2min}, {memory = 2GB, cpus = {cp1, cp2, cp3, cp4}, tmpdb sid = dbx}}. Alternatively, we can optimize the design for recoverability [16] and push the requirement for recoverability down at the operator level. A possible outcome of this optimization might be that the MTTR requirement can be achieved if we add a recovery point after an expensive series of operators and in particular after an operator Ok . Then, we write: G1 (P )={{cycle=15min}, {memory=2GB, cpus={cp1, cp2, cp3,cp4}, tmpdb sid = dbx}} and V1 (P ) = {..., Ok ({addR P }, {}, {}), ...}. Finally, it is possible to group parts of an ETL flow in order to create ETL subflows. In other words, an ETL graph G(P ) may comprise several subgraphs Gi (Pi ), where P = ∪∀i P i .
24
5.2
K. Wilkinson et al.
Encoding the ETL Logical Model
Although different design alternatives have been proposed for the logical modeling of ETL flows (see Section 7), which are based either on ad hoc formalisms or some standard design language (e.g., UML), we follow a novel approach and use XML notation for representing logical ETL models. Since there is no standard modeling technique for ETL, we choose to work close to the prevailing practice in the market. Many commercial ETL tools use XML for storing and loading ETL designs. We call our logical ETL model xLM. xLM uses two main entities: the design and node elements. represents an ETL flow (or an ETL graph). ETL subflows are denoted as nested elements. represents a vertex of the ETL graph (either activity or recordset). Next, we elaborate on these two elements. element. It contains all the information needed for representing an ETL flow. Its main elements are as follows: : Each represent a vertex of the ETL graph (see below). : An edge stands for a provider relationship connecting two vertices (activities or data stores). It has (a) a name, (b) a starting vertex, (c) an ending vertex, (d) additional information like enabled or not, partitioning type (if it participates in a partitioned part of the flow), and so on. : ETL graph properties involve a set of QoX metrics defined at the flow level. Such properties are defined as simple expressions of the form: θ or f ()θ . qmetric is a quantitative metric for non-functional requirements; i.e., QoX metrics (e.g., MTTR, uptime, degree of parallelism, memory, cpus). f can be any built-in or user defined function (e.g., an aggregate function like min, max, avg). θ can be any of the usual comparison operators like , =, and so on. A value belongs to a domain specified accordingly to the respective qmetric; thus, the variety of value domains has a 1-1 mapping to the variety of qmetric domains. Example properties are: ‘cycle=15min’, ‘MTTR=2min’, ‘failure probability ≤ 0.001’, ‘max(tolerable failures) = 2’, and ‘execution window = 2h’. : This element specifies the set of resources needed for ETL execution. Typical resources are: memory, cpus, disk storage, db-connections, paths for tmp/log storage, and so forth. For example, we may specify the following resources: memory=2GB, cpus={cp1,cp2,cp3,cp4}, tmpdb sid=dbx. : This generic element comprises metadata needed for the visualization of the ETL flow. Typical metadata are x,y coordinates of the design, colors for each node type, language of the labels, font size, and so on. element. It represents a specific vertex of the ETL graph. It consists of several elements that specify, customize, and define the operation of the vertex. Its main elements are as follows: and : The name and type of the vertex; a vertex can be either activity or data store. : The operational type of a vertex defines its operational semantics. We consider an extensible library of ETL activities (operations) in the spirit of [19] (e.g., filter, join, aggregation, diff, SK assignment, SCD-1/2/3, pivoting,
Leveraging Business Process Models for ETL Design
25
splitter/merger, function). The operational type can be any of these if the node represents an activity. If the node is a data store, then the optype represents the nature of the data store; e.g., file, relational table, xml document, and so on. (At the physical level, this element specifies the path/name/command for invoking the actual implementation code that executes this operation.) : This element describes the schemata employed by the node. These are the , , and schemata. The cardinality of the first two is equal or greater than one. If the node is a data store, then it has only one schema (without loss of genericity, one input schema) with cardinality equal to one. The input and output schemata stand for the schema of a processing tuple before and after, respectively, the application of the operation. The parameter schema specifies the parameters needed for the proper execution of the operation. For example, for a filter operation Of , the input schema can be {id, age, salary}, the output schema is the same, and the parameter schema is {age>30, salary< /schemata>
The QoX objectives are stored as a set of strings that contains elements of the form: f ()θ (see properties in Section 5.2 for f , θ, value). For example, figure 5 shows an objective for max(f reshness)=12. Expressions are stored similarly. The expression set may contain notes as narratives. These notes are not to be automatically translated or used; they are intended to be used by the designer who will manually take the appropriate actions. It is straightforward for an appropriate parser to transform such information from one model to the other (essentially from one XML file containing XPDL constructs to another XML file containing xLM constructs). Both XPDL and xLM describe graphs so there is a natural correspondence between many elements. XPDL activities map to xLM nodes. XPDL transitions become xLM edges. XPDL workflows become xLM designs. XPDL document
Leveraging Business Process Models for ETL Design
27
...
max f reshness = 12
...
Fig. 5. Example objective
Fig. 6. Example Kettle k::transformation
artifacts that describe dataflow are converted to schemata for nodes. QoX in the XPDL become properties either at the flow level or the node level. However, the translation from XPDL to xLM is not mechanical. As we discussed, patterns in the XPDL are converted to specific operators. QoX objectives may affect the choice of operator or flow sequence. Parameters for templated nodes (e.g., those with alternative implementations) must be selected, etc. 6.2
Logical to Physical Models
Once we have modeled ETL at the logical level, we have both: (a) an ETL graph comprising activities and data stores as vertices and provider relationships as edges, and (b) a representation of the design in XML form. (Although our approach is generic enough and the logical XML file can be transformed to other proprietary formats as well, we use an XML representation for ETL graphs, since most ETL tools use the same format for storing and loading ETL designs.) The former representation can be used for optimizing the design, whilst the second for creating a corresponding physical model. Elsewhere, we have proposed algorithms for optimizing logical ETL designs solely for performance [14] and also, for freshness, recoverability, and reliability [16]. The input and output of our optimizer [16] are ETL graphs, which can be translated to appropriate XML representations interpretable by an ETL engine. Therefore, when we finish processing the logical model (after we have complemented it with the appropriate details and optimized it) we create a corresponding physical design. The details of the physical design are tied to the specific ETL engine chosen for the implementation. Then, the task of moving from the logical to the physical level, essentially is taken care of by an appropriate parser that transforms the logical XML representation to the XML representation that the chosen ETL engine uses. As a proof of concept, we describe how the proposed logical XML representation can be mapped to the XML format used by Pentaho’s Kettle (a.k.a. Pentaho Data Integration or PDI). Similarly, it can be mapped to other solutions like Informatica’s PowerCenter, and others.
28
K. Wilkinson et al.
Fig. 7. DailyRevenue ETL Physical Model (in Kettle)
Kettle supports ‘jobs’ and ‘transformations’ for representing control and data flows, respectively. In the generic case, our element maps to a ‘job’, and if it contains nested ’s then these maps to ‘transformations’. Briefly, a part of the XML schema for a kettle transformation is shown in figure 6. It has three parts. The first involves physical properties (e.g., db connection, etc.) and it can be populated with the Rp of the ETL (sub)graph G(P ) mapped to this ‘transformation’. The second part () specifies the order and interconnection of ETL activities, which corresponds to the provider relationships of G(P ). The third part involves the ETL activities () and contains all metadata needed for their execution (e.g, partitioning method, memory, temp directory, sort size, compression), parameters needed (i.e., ), and visualization information (). Depending on the type of each the metadata will change; e.g., in figure 6, if the of the is SortRows the described operation is a sorter. Thus, for a certain vertex of G(P ), v(P ), the name and optype are mapped to and of a step. The input schemata are used for the definition of the source data stores. If we need to modify a schema, and given that the type of allows to do so, then we update the schema of according to the formulae: generated attributes = output schema - input schema, and projected out attributes = input schema - output schema. The parameter schema of v(P) populates . The physical resources Rp of v(P ) are used to populate accordingly the physical resources of . Similarly, the properties Qx that made it to the physical level are mapped to the respective fields. For example, a requirement about partitioning populates the respective element by filling the and also defining other related parameters as (e.g., degree of parallelism), , and so on. A requirement about increased freshness would affect the implementation type; e.g., by making us choose an algorithm that can handle streaming data instead of using an algorithm which performs better (and possible more accurately) when it processes data in batch mode. Finally, some features (either of G(P ) or of a v(P )) are different in each design level. For example visualization information (e.g., xand y- coordinates) is different in logical and physical designs, since the designs may contain different nodes and have different structure.
Leveraging Business Process Models for ETL Design
29
In terms of our running example, figure 7 shows a physical model for the DailyRevenue process previously discussed. Note the effect of optimization in that surrogate key generation for the RevenueFact was moved forward in the flow to appear with the other surrogate key methods (preceding the filter). Also note the use of Kettle methods for conditional branches, function invocation, and database lookup.
7
Related Work
In business process modeling, the business artifact centric approach has been proposed [3]. This approach relates to our approach in that our business views could also be considered business artifacts. However, business artifacts refer to objects in the OLTP databases that are manipulated by the operational business processes throughout their lifecycle. Business views in contrast, are BI information entities necessary for business analysis and decision support. Several efforts have been proposed for the conceptual modeling of ETL processes. These include ad hoc formalisms [19] approaches based on standard languages like UML (e.g., [9]) or MDA (e.g., [10,11]). Akkaoui and Zim´ anyi present a conceptual model based on the BPMN standard and provide a BPMN representation for frequently used ETL design constructs [2]. Habich et al. propose an extension to BPEL, called BPEL-DT, to specify data flows within BPEL processes [7]. The intuition of this work is to use the infrastructure of web services to support data intensive service applications. Our work differs in that we provide a complete framework that covers other design levels beyond the conceptual one. We consider business and integration processes together and include quality objectives (QoX metrics) that pass seamlessly through the design layers. The logical modeling of ETL processes has attracted less research. Vassiliadis et al. use LDL as a formal language for expressing the operational semantics of ETL activities [18]. ETL tools available in the market provide design interfaces that could be used for representing logical ETL models. However, since the designs provided are closely related to the execution layer, it’s difficult to provide a logical abstraction of the processes. Also, to the best of our knowledge, there is no means for incorporating quality objectives into the design. The only work related to our multi-layer design approach is a previous effort in translating conceptual (expressed in ad hoc notation) to logical ETL models (in LDL) [13]. Here, we use standard notation, and tools and techniques closer to those provided by commercial ETL tools. Also, we describe a comprehensive approach that captures QoX requirements at the business level and propagates them to lower design levels.
8
Conclusions
We described a layered methodology for designing ETL processes for operational Business Intelligence. This methodology is novel in that it uses a unified formalism for modeling the operational processes of the enterprise as well as
30
K. Wilkinson et al.
the processes for generating the end-to-end business views required for operational decision-making. The methodology starts with gathering functional and non-functional requirements for business views. Then, we discuss how to design conceptual and logical models. We also present a method for translating one model to another, in order to produce the final physical model that run in an ETL engine of choice. The whole process is driven by QoX objectives that transferred through all the design levels. We illustrated the methodology with examples derived from a real-world scenario of an on-line retail enterprise.
References 1. Agrawal, R., Gupta, A., Sarawagi, S.: Modeling multidimensional databases. In: ICDE, pp. 232–243 (1997) 2. Akkaoui, Z.E., Zim´ anyi, E.: Defining ETL worfklows using BPMN and BPEL. In: DOLAP, pp. 41–48 (2009) 3. Bhattacharya, K., Gerede, C.E., Hull, R., Liu, R., Su, J.: Towards formal analysis of artifact-centric business process models. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 288–304. Springer, Heidelberg (2007) 4. BPMN (2009), http://www.bpmn.org 5. Dayal, U., Castellanos, M., Simitsis, A., Wilkinson, K.: Data integration flows for business intelligence. In: EDBT, pp. 1–11 (2009) 6. Dayal, U., Wilkinson, K., Simitsis, A., Castellanos, M.: Business processes meet operational business intelligence. IEEE Data Eng. Bull. 32(3), 35–41 (2009) 7. Habich, D., Richly, S., Preissler, S., Grasselt, M., Lehner, W., Maier, A.: BPEL-DT - Data-aware Extension of BPEL to Support Data-Intensive Service Applications. In: WEWST (2007) 8. Intalio. Bpmn designer (2009), http://www.intalio.com/ 9. Luj´ an-Mora, S., Vassiliadis, P., Trujillo, J.: Data Mapping Diagrams for Data Warehouse Design with UML. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, pp. 191–204. Springer, Heidelberg (2004) 10. Maz´ on, J.-N., Trujillo, J., Serrano, M.A., Piattini, M.: Applying MDA to the development of data warehouses. In: DOLAP, pp. 57–66 (2005) 11. Mu˜ noz, L., Maz´ on, J.-N., Trujillo, J.: Automatic generation of ETL processes from conceptual models. In: DOLAP, pp. 33–40 (2009) 12. Oryx. Oryx tool (2009), http://bpt.hpi.uni-potsdam.de/oryx/bpmn 13. Simitsis, A., Vassiliadis, P.: A method for the mapping of conceptual designs to logical blueprints for etl processes. Decision Support Systems 45(1), 22–40 (2008) 14. Simitsis, A., Vassiliadis, P., Sellis, T.K.: Optimizing etl processes in data warehouses. In: ICDE, pp. 564–575 (2005) 15. Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Qox-driven etl design: reducing the cost of etl consulting engagements. In: SIGMOD, pp. 953–960 (2009) 16. Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL Workflows for Fault-Tolerance. In: ICDE, pp. 385–396 (2010) 17. van der Aalst, W.: Patterns and XPDL: A Critical Evaluation of the XML Process Definition Language. Technical report, QUT, FIT-TR-2003-06 (2003) 18. Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M., Skiadopoulos, S.: A generic and customizable framework for the design of ETL scenarios. Inf. Syst. 30(7), 492–525 (2005) 19. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL processes. In: DOLAP, pp. 14–21 (2002)
Adaptation in Open Systems: Giving Interaction Its Rightful Place Fabiano Dalpiaz, Amit K. Chopra, Paolo Giorgini, and John Mylopoulos Department of Information Engineering and Computer Science, University of Trento {dalpiaz,chopra,paolo.giorgini,jm}@disi.unitn.it
Abstract. We address the challenge of adaptation in open systems. Open systems are characterized by interactions among autonomous and heterogeneous participants. In such systems, each participant is a locus of adaptation; nonetheless, a participant would typically have to interact with others in order to effect an adaptation. Existing approaches for software adaptation do not readily apply to such settings as they rely upon control-based abstractions. We build upon recent work on modeling interaction via social commitments. Our contributions in this paper include (1) formalizing the notion of a participant’s strategy for a goal not just in terms of goals and plans, but also in terms of the commitments required, and (2) a conceptual model and framework for adaptation built around this notion of strategy that allows using arbitrary strategy selection criteria—for example, trust. We illustrate our contributions with examples from the emergency services domain.
1 Introduction One of principal challenges in software engineering is supporting runtime adaptation in software systems. In this paper, we address the challenge of adaptation in open systems. Open systems involve autonomous and heterogeneous participants who interact in order to achieve their own respective goals [8]. Autonomy implies that no participant has control over another; heterogeneity that the participants’ internal constructions, not only in terms of code but also in terms of goals and policies, may be different. Additionally, a participant will likely keep its internal construction private to others. In this sense, they are completely independent of each other. Many of the applications that we rely upon today are open—for example, banking, foreign exchange transactions, trip planning. Trip planning, for example, involves a customer, a travel agency, airlines, credit card companies, and so on—each an autonomous entity with its own private goals and policies. The participants interact with each other in order to fulfill their respective goals. Clearly, supporting adaptation in open systems is as valuable as in any other kind of system. In the trip planning application, the travel agency may book an alternate flight for his customer in case the workers of the airline with which the customer is currently booked are likely to go on strike the day of the flight. Typically, however, the travel agent will not do that before interacting with the customer and getting his approval for the change. Neither is the customer’s approval guaranteed—he could prefer to arrive a day early than two hours late. Thus, in adapting, the travel agent must interact with another autonomous entity—the customer. The point we emphasize here is twofold. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 31–45, 2010. c Springer-Verlag Berlin Heidelberg 2010
32
F. Dalpiaz et al.
1. In open systems, each participant, being autonomous, is an independent locus of adaptation. 2. Nonetheless, in effecting an adaptation, a participant needs to interact with others to achieve goals it cannot achieve by itself. This paper addresses the challenge of understanding what it means for a participant to adapt in open systems, and thus how one might design adaptive software that represents the participant. It emphasizes interaction among participants and the corresponding social abstractions. By contrast, recent trends in software adaptation emphasize centralized, control-based abstractions and largely ignore interaction. Our emphasis on interaction is not a matter of technique—interaction is what makes things work in open systems, and our approach reflects that reality. From here on, we refer to a participant as an agent. An important kind of agent—that ties our work to software engineering—is a software system that pursues the interests of a particular stakeholder. Our contributions specifically are the following. – A conceptual model for adaptation in open systems that emphasizes interaction. – The formalization of the notion of an agent’s strategy for a goal. The notion of strategy covers interaction with other agents, and forms the common semantic substrate for adaptation across agents, whoever they may be designed by. – A framework for adaptation that allows plugging in arbitrary agent-specific criteria in order to select and operationalize alternate strategies. In previous work we proposed design-time reasoning about the suitability of interaction protocols for a participant’s goals [5]. Here, such results constitute the baseline of our framework for participant runtime adaptation in open systems. The rest of the paper is organized as follows. Section 2 introduces a conceptual model for adaptive agents in open systems. Section 3 presents a set of motivating examples of agent adaptation drawn from a firefighting scenario. Section 4 formalizes the notion of a strategy. Section 5 explains the overall framework for adaptive agents. It presents an agent control loop and focuses on the selection and operationalization of a strategy. Section 6 contrasts our model to existing approaches, summarizes the key points, and outlines future work.
2 Modeling Adaptive Agents in Open Systems Section 2.1 recaps a conceptual model for agents in open systems [4]. Section 2.2 then describes the concepts involved in agent adaptation. 2.1 Agents in Open Systems Following Tropos [1], we model an agent as a goal-driven entity (Fig. 1). An agent has goals that reflect its own interests. An agent may have the capability to achieve certain goals; for others, he may have no such capability. A capability is an abstraction for specific plans that an agent may execute to achieve the goal. To support those goals, an agent may depend on other agents. Conceptually, the approach in [4] goes beyond
Adaptation in Open Systems: Giving Interaction Its Rightful Place
33
Fig. 1. Conceptual model for agents in open systems [4]
Tropos in that it makes these dependencies concrete and publicly verifiable by explicitly modeling interaction. In other words, the agent interacts with others (via messaging) to realize those dependencies. A distinguishing feature of [4] is that the interactions are modeled in terms of commitments. A commitment C(debtor, creditor, antecedent, consequent) is a promise made by the debtor to the creditor that if the antecedent is brought about, the consequent will be brought about. For example, C(customer, travel agency, tickets delivered, paid) represents a commitment from the customer to the travel agency that if the tickets are delivered, then the payment will be made. In open systems, an agent would have to interact with others and rely on commitments with them in order to support its goals. The agent can either play the debtor or the creditor in a commitment. In the above example, the customer’s goal of having tickets delivered (tickets delivered) is supported by the commitment he makes (by sending the appropriate message, details in [4]) to the travel agency. Alternatively, the customer could request a commitment from the travel agency for tickets. If the travel agency responds by creating C(travel agency, customer, paid, tickets delivered), then the customer’s goal is supported (provided the customer can bring about the payment). Roughly, supporting a goal means identifying a strategy which will lead to the fulfilment of such goal [4] at run-time, provided that the strategy is enacted faultlessly. In principle, goals can be related to commitments and vice versa, because both goals and commitments (via the antecedent and consequent) talk about states of the world. It is well known that commitments abstract over traditional descriptions of interactions (such as message choreographies) in terms of data and control flow [17]. More importantly, commitments support a standard of compliance suitable for open systems: an agent behaves in a compliant manner as long as it fulfills its commitments. Commitments are notably different from traditional Tropos goal modeling: – Commitments are a social abstraction better suited to open systems than dependencies. Commitments relate agents; they are created by explicit and observable messaging and hence are publicly verifiable (whether they hold or have been discharged) [14]. Dependencies relate agents, but are not tied to communication. – Commitments decouple agents: to support a goal, in principle, an agent has only to enter into the appropriate commitment relationships with another agent. The agent need not care if the latter actually has the intention of achieving that goal. This is possible since commitments are publicly verifiable, and thus socially binding.
34
F. Dalpiaz et al.
By contrast, reasoning with dependencies assumes a centralized perspective, where traditional AI planning techniques can be exploited [2]. 2.2 A Model for Agent Adaptation Fig. 2 shows the concepts involved in agent adaptation and the relations among them. The figure helps answer the following questions. Why should an agent adapt? When is adaptation required? What is the object of adaptation?
Fig. 2. Conceptual model for agent adaptation
An agent acts on the basis of its motivational component, the target goals it currently wants to achieve. The reason for adaptation—the why—is that the current strategy for achieving the current goals is inadequate or can be improved. Every agent maintains its own state in a knowledge base, which is updated based upon the events the agent observes. An agent adapts when a certain adaptation trigger is activated. A trigger is a condition that is monitored for one or more target goals and is specified over the knowledge base. Three types of trigger are shown in Fig. 2; this list of types is not meant to be exhaustive but rather illustrative. A threat means that target goals are at risk; an opportunity means that a better strategy may be adopted; a violation means that the current strategy has failed. Both threat and opportunity relate to proactive adaptation policies—the agent adapts to prevent failures. By contrast, violation triggers are set off upon failures. Table 1 extends such triggers with more specific trigger types. The object of adaptation, that is, what to adapt to, represents the new strategy, technically a variant. When the trigger goes off, the variant is activated. The variant represent the set of goals that need to be achieved in order to achieve the target goal. These goals are supported either by the agent’s capabilities or commitments. Additionally, the variant is computed with respect to the agent’s goal model—for example, this ensures that the set of goals to be achieved are sound with respect to goal decomposition (this notion is formalized later in Definition 3). As explained above, the model in Fig. 1 motivates the supports relation. Fig. 2 exploits the supports relation in the notion of variant. A variant is essentially a collection of goals, commitments, and capabilities that are necessary to support the agent’s target goals (Definition 4 formalizes this notion).
Adaptation in Open Systems: Giving Interaction Its Rightful Place
35
Table 1. Taxonomy of adaptation trigger types Type
Description
Capability threat Commitment threat Capability opportunity Social opportunity Commitment violation Capability failure Quality violation
The capability (plan) for an active goal is undermined The fulfillment of a commitment from another agent is at risk An alternative capability has become more useful to exploit A new agent is discovered with whom it is better to interact A commitment from another agent is broken (timeout or cancellation) A capability was executed but failed An internal quality threshold is not met by the agent
3 Motivating Examples We discuss examples of agent adaptation in a firefighting scenario. It serves as the running example throughout the rest of the paper. Jim is a fire chief. His top goal is to extinguish fires. That goal may be achieved either by using a fire hydrant or a tanker truck. The top part of Table 2 shows Jim’s goal model in Tropos. Ticked goals are those for which Jim has capabilities. The bottom part of Table 2 lists the commitments used in Figures 3–6. Each of the Figures 3–6 depicts Jim’s active variant for the goal of having fires extinguished before (on the left of the dark, solid arrow) and after adaptation (on the right of the arrow). The chorded circles are agents. A variant depicts the active part of Jim’s goal model, that is, those goals in the goal model of Jim that he has instantiated, and the capabilities and commitments required to support that goal. Commitments are represented by labeled directed arrows between agents—the debtor and creditor are indicated by the tail and the head of the arrow respectively. We lack the space to explain the details of how Jim’s fire extinguished goal is supported after adaptation in each example. The key point to take away from these Table 2. Running example: Jim is a fire chief (top); commitments in the scenario (bottom)
Label Commitment C1 C2 C3 C4
C(Brigade 1,Jim,hydrant need notified,hydrant usage authorized) C(Jim,Brigade 1,tanker service paid,tanker truck used) C(Tanker 1,Jim,tanker service paid,fire reached by tanker truck) C(Tanker 2,Jim,tanker service paid,fire reached by tanker truck)
36
F. Dalpiaz et al.
examples is that Jim’s set of active goals, the capabilities required, and his commitments to and from others change as a result of adaptation in response to some trigger. Tactic 1 (Alternative goals). (Example 1) Choose a different set of goals in a goal model to satisfy target goals. The agent believes the current strategy will not succeed. Example 1. (Fig. 3) Jim tries to achieve fire extinguished via a variant that relies upon using the fire hydrant. However, the fulfillment of C1 , which is necessary to support the goal, is threatened because Brigade 1 hasn’t authorized hydrant usage yet. So Jim switches to another variant that supports fire extinguished via tanker truck used. To support tanker truck used, Jim makes C2 to Brigade 1 and gets C3 from Tanker 1. Tactic 2 (Goal redundancy). (Example 2) Select a variant that includes redundant ways for goal satisfaction. Useful for critical goals that the agent wants to achieve at any cost. Example 2. (Fig. 4). Jim’s current strategy is to fight the fire via the hydrant. However, C1 is threatened. So Jim adopts a strategy which involves also calling a water tanker truck. By contrast, Example 1 involves no redundancy. Tactic 3 (Commitment redundancy). (Example 3) More commitments for a goal are taken. Useful if the agent does not trust some agent it interacts with. Also, it applies
Fig. 3. Alternative goals: Jim switches from a variant involving fire hydrant usage to another involving tanker truck usage
Fig. 4. Goal redundancy: Jim adopts a redundant variant, which involves also calling a tanker truck
Adaptation in Open Systems: Giving Interaction Its Rightful Place
37
Fig. 5. Commitment redundancy: Jim gets C4 from Tanker 2
Fig. 6. Switch debtor: Jim releases Tanker 1 from C3 and takes C4 from Tanker 2
when a commitment from someone else is at risk due to the surrounding environment, and a different commitment is more likely to succeed. Example 3. (Fig. 5) Jim doesn’t trust Tanker 1 much for C3 . Therefore, he decides to get a similar commitment C4 from Tanker 2. Tactic 4 (Switch debtor). (Example 4) Get a commitment for the same state of the world but from a different debtor agent. Useful if the creditor believes the current debtor will not respect its commitment or a more trustworthy debtor comes into play. The original debtor is released from his commitment. Example 4. (Fig. 6). Jim takes C3 from Tanker 1, but fears that Tanker 1 will violate the commitment. Thus, Jim releases Tanker 1 from C3 and instead gets C4 from Tanker 2. Tactic 5 (Division of labor). (Example 5) Rely on different agents for different goals instead of relying on a single agent. Distribution of work spreads the risk of complete failure. Example 5. Suppose Jim wants to use both fire hydrant and a water tanker truck. Also, suppose Brigade 1 acts as a water tanker provider. Then Jim could use the tanker service from Brigade 1. However, Jim applies division of labor to minimize risk of failure: he takes C1 from Brigade 1 and C3 from Tanker 1.
38
F. Dalpiaz et al.
Tactic 6 (Commitment delegation). (Example 6) An agent delegates a commitment in which he is debtor to another agent, perhaps because he can’t fulfill it. Example 6. Jim does not have resources to fight a fire. So he delegates his commitment to extinguish a fire to another fire chief Ron of a neighboring town. Tactic 7 (Commitment chaining). (Example 7) Agent x’s commitment C(x, y, g0 , g1 ) is supported if he can get C(z, x, g2 , g1 ) from some z and if x supports g2 . Example 7. Jim wants to achieve goal tanker truck used. It makes C2 to Brigade 1 so that tanker service paid is achieved. In such a way, it can get C3 from Tanker 1.
4 Formalization We now formalize the notion of a variant. This formalization is not specific to any individual agent—it forms the common semantic substrate upon which we later build a comprehensive adaptation framework. We explain the formalization by referring to the examples introduced earlier. Let g, g ′ , g ′′ , g1 , g2 , . . . be atomic propositions (atoms); p, q, r, . . . be generic propositions; x, y, z, . . . be variables for agents. Let aid be the agent under consideration. A commitment is specified as a 4-ary relation C(x, y, p, q). It represents a promise from a debtor agent x to the creditor agent y for the consequent q if the antecedent p holds. Let P be a set of commitments. Commitments can be compared via a strength relation [6]. If an agent commits for something, it will also commit for something less. Also, it will commit if he gets more than expected in return. Such an intuition is captured via the transitive closure of P . Definition 1. Given a set of commitments P , P ∗ is its transitive closure with respect to the commitments strength relation [6]. Let P = {C(fireman, brigade, team assigned ∨ ambulance sent, fire fought ∧ casualties rescued)}. Then, for instance, C(fireman, brigade, team assigned ∨ ambulance sent, fire fought) ∈ P ∗ , C(fireman, brigade, team assigned ∨ ambulance sent, casualties rescued) ∈ P ∗ , C(fireman, brigade, team assigned, fire fought) ∈ P ∗ . Definition 2. A goal model Mid specifies an agent aid as: 1. a set of AND/OR trees whose nodes are labeled with atoms; 2. a binary relation on atoms p-contrib; 3. a binary relation on atoms n-contrib. An AND/OR tree encodes the agent’s knowledge about how to achieve the root node. The nodes are the agent’s goals. p-contrib(g, g ′ ) represents positive contribution: the achievement of g also achieves g ′ . n-contrib(g, g ′ ) is negative contribution: the achievement of g denies the achievement of g ′ . The top part of Table 2 is a goal model for agent Jim. MJim has one AND/OR tree rooted by goal fire extinguished. MJim contains no contributions. We introduce the predicate scoped to capture a well-formedness intuition: a goal cannot be instantiated unless its parent is, and if a goal’s parent is and-decomposed, all the siblings of such a goal must also be instantiated.
Adaptation in Open Systems: Giving Interaction Its Rightful Place
39
Definition 3. A set of goals G is scoped with respect to goal model M id , that is, scoped(G , M id ) if and only if, for all g0 ∈ G, either 1. g0 is a root goal in Mid , or 2. exists a simple path g0 , g1 , . . . , gn in Mid such that gn is a root goal in Mid and ∀i, 0 ≤ i ≤ n : (a) gi ∈ G, and (b) if anddec(gi+1 ) (i = n), then ∀g such that parent(gi+1 , g), g ∈ G Example 8. G1 = {fire extinguished, fire hydrant used} is scoped with respect to MJim . Indeed, fire extinguished is a root goal, whereas fire hydrant used is part of path fire hydrant used, fire extinguished. A variant is an abstract agent strategy to achieve some goal. It consists of a set of goals G the agent intends to achieve via a set of commitments P and a set of capabilities C. A variant is defined with respect to a goal model Mid . The notion of variant is more general than supports [5]: it considers scope and commitments between arbitrary agents. Definition 4. A triple ⌊G, P, C⌋ is a variant for a goal g with respect to goal model Mid , that is, ⌊G, P, C⌋ |=Mid g if and only if 1. scoped(G, Mid ) and g ∈ G, and 2. g is supported: ∄g ′ ∈ G : n-contrib(g ′ , g) ∈ Mid , and either (a) g ∈ C, or (b) C(x, aid , g ′ , g) ∈ P ∗ : ⌊G, P, C⌋ |=Mid g ′ , or (c) C(x, y, g, g ′ ) ∈ P ∗ , or (d) ordec(g), and either i. ∃g ′ : parent(g, g ′ ) and ⌊G, P, C⌋ |=Mid g ′ , or ′ ∗ ii. C(x, aid , g , q) ∈ P : q ⊢ parent(g,gi ) gi and ⌊G, P, C⌋ |=Mid g ′ , or iii. C(x, y, p, g ′ ) ∈ P ∗ : p ⊢ parent(g,gi ) gi ; (e) anddec(g) and ∀g ′ : parent(g, g ′ ) and ⌊G, P, C⌋ |=Mid g ′ , or (f) p-contrib(g ′ , g) ∈ Mid : ⌊G, P, C⌋ |=Mid g ′ . The goals G that aid intends to achieve should be scoped with respect to the goal model Mid (clause 1). Goal g must be supported: there should be no negative contributions to g from any goal in G and one clause among 2a-2f should hold (clause 2). 2a. capabilities support goals; 2b. aid gets a commitment for g from some other agent x if aid supports the antecedent; 2c. some agent y brings about g in order to get a commitment for g ′ from some other agent x (possibly aid itself); 2d. an or-decomposed goal g is supported if either: there is some subgoal g ′ such that ⌊G, P, C⌋ |=Mid g ′ (2(d)i), g is supported via commitment to (2(d)iii) or from (2(d)ii) other agents. These two clauses cover the case of an agent who commits for a proposition that logically implies the disjunction of all the goal children. For instance, a commitment for g1 ∨g2 supports a goal g or-decomposed to g1 ∨g2 ∨g3 ; 2e. an and-decomposed goal is supported if ⌊G, P, C⌋ is a variant for every children; 2f. positive contribution from g ′ supports g if ⌊G, P, C⌋ |=Mid g ′ .
40
F. Dalpiaz et al.
Definition 5 generalizes the notion of variant to sets of goals. A variant for a goal set should be a variant for each goal in the set. Definition 5. A triple ⌊G, P, C⌋ is a variant for a goal set G ′ with respect to goal model Mid , that is ⌊G, P, C⌋ |=Mid G, if and only if, for all g in G ′ , ⌊G, P, C⌋ |=Mid g. Example 9. G = {fire extinguished, tanker truck used, tanker service paid, fire reached by tanker truck, pipe connected}, P = {Cy = C(Jim, y, tanker service paid, fire extinguished), Cz = C(z, Jim, tanker service paid, fire reached by tanker truck)}, C = {pipe connected}. ⌊G, P, C⌋ |=MJim fire extinguished. Shown in the right side of Fig. 3. Step 1. From Definition 3, G is scoped with respect to MJim . Indeed, fire extinguished is a root goal, tanker truck used is and-decomposed, all its subgoals are in G, and there is a path from all goals in G to fire extinguished. Step 2. From Definition 4, we should check if g is supported. Clause 2(d)i applies to fire extinguished if ⌊G, P, C⌋ |=MJim tanker truck used. Step 3. tanker truck used is and-decomposed into tanker service paid, fire reached by tanker truck, pipe connected; 2e tells to verify every subgoal. Step 4. pipe connected is in C, therefore 2a applies. Step 5. tanker service paid can be supported if Jim commits for Cy to some agent (2c). Step 6. fire reached by tanker truck is supported if some agent commits for Cz to Jim (2b), given that the antecedent tanker service paid is supported.
5 A Framework for Adaptive Agents The conceptual model we introduced in the previous sections is the foundation to define a framework for the development of adaptive agents. First, we sketch a generic control loop for an adaptive agent in Algorithm 1. Then, we investigate agent adaptation policies for variant selection (Section 5.1) and variant operationalization (Section 5.2).
Algorithm 1. An adaptive agent. AGENT is a generic agent control loop. TRIGGERED is an event handler for adaptation triggers AGENT()
1 while OBSERVE(ε) 2 do UPDATE(σ, ε) 3 ... [ ] G, goalModel M , state σ) 1 variant [ ] V ← GEN VARIANTS(G, M) 2 variant V ← SELVARIANT(V, σ) 3 OPERATIONALIZE(V, σ)
TRIGGERED(goal
Algorithm 1 shows the skeleton of an adaptive agent. The procedure AGENT sketches the part of a generic agent control loop related to adaptation. An agent observes an event ε from the environment, then it updates the current state σ of its knowledge base.
Adaptation in Open Systems: Giving Interaction Its Rightful Place
41
The procedure TRIGGERED is an event handler for adaptation triggers, which is executed whenever some event sets off an adaptation trigger. The input parameters are a set of target goals G, the agent’s goal model M, and the state σ. First, the agent generates all variants V for the goals G with respect to the goal model M. Function GEN VARIANTS is standard for any agent and is computed according to Definitions 4 and 5. Second, the agent selects one of the variants in V. Such choice depends on the agent’s internal policies. Section 5.1 details variant selection. A variant V is an abstract strategy. It is a triple ⌊G, P, C⌋ composed of goals G, commitments P, and capabilities C. Neither commitments nor capabilities are grounded to concrete entities. The agent should therefore operationalize the variant: commitments must be bound to actual agents, capabilities to real plans. Section 5.2 describes operationalization. 5.1 Variant Selection Variant selection is the choice of one variant among all the generated ones. Function SELVARIANT takes as input a set of variants and a state and returns one of these variants. SELVARIANT SELVARIANT({V1 , . . . , Vn }, σ)
: 2V × S → V
= Vi : 1 ≤ i ≤ n
Table 3 shows some common criteria for variant selection. Due to its autonomy, each agent is free to decide its own criterion. Example 10. Table 4 specifies the function SELVARIANT for Fig. 4 as an Event-Condition-Action rule. We used such formalism to keep our explanation simple. In practice variant selection policies will be expressed via appropriate policy definition languages. Such a function is based upon the goal redundancy tactic. The triggering event is that Table 3. Generic criteria for variant selection Name
Description
Cost Stability Softgoals Preference Goal Redundancy
Minimize the overall cost, expressed as money, needed resources, time Minimize the distance between the current strategy and the new one Maximize the satisfaction of quality goals (performance, security, risk, . . . ) Choose preferred goals and commitments Choose a redundant variant to achieve critical goals
Table 4. Event-Condition-Action rule for variant selection with goal redundancy (Fig. 4) Event
Condition
Action adopt(tanker truck used), target(fire extinguished), adopt(tanker service paid), ¬ made(C2 ), adopt(fire reached by tanker truck), ¬ taken(C4 ), threatened(C1 ) ¬ taken(C3 ), adopt(pipe connected), adopted(fire hydrant used), useCapability(pipe connected), ¬ adopted(tanker truck used) get(Cz ), make(Cy )
42
F. Dalpiaz et al.
commitment C1 is threatened. It applies if the target goal is fire extinguished, commitments C2 , C3 , C4 are not in place, Jim adopted goal fire hydrant used and not tanker truck used. The action specifies the transition to the new variant. Jim adopts goal tanker truck used and its children, uses his capability for pipe connected, gets commitment Cz , and commits for Cy . Cy and Cz , from Example 9, are unbound. 5.2 Variant Operationalization The OPERATIONALIZE function takes as input the selected variant and an agent’s state and returns a set of states. : V × S → 2S OPERATIONALIZE(⌊G, P, C⌋, σ) = BIND T O P LAN (C); BINDAGENT (P) OPERATIONALIZE
Operationalization means identifying a concrete strategy to achieve goals and commitments in a variant. Commitments are bound to real agents (BINDAGENT), whereas capabilities are bound to executable plans (BIND P LAN). Let’s explain why such a function returns a set of states instead of a single one. Suppose Jim’s selected variant includes finding some agent that will commit for Cz . Jim may send a request message for Cz to all known tanker providers—Tanker 1 and Tanker 2—and get a commitment from the first one that accepts. If Tanker 1 answers first, the function returns a state σ1 where C3 holds; if Tanker 2 answers first, the returned state will be σ2 such that C4 holds. With respect to Table 4, operationalization would be invoked inside useCapability(pipe connected) to bind an appropriate plan to the capability, and inside get(Cz ) and make(Cy ) to bind z and y to the appropriate agents. Table 5 shows some generic criteria an agent can exploit and combine to operationalize a variant. Example 11. Let’s operationalize the variant in Example 10. Jim wants to delegate firefighting with tanker truck to the agent he trusts more. He knows two fire chiefs, Ron and Frank. The one he trusts more is Ron. Step 1. Bind capabilities to plans. Jim binds its capability for pipe connected to a specific plan where he connects a water pipe to the rear connector of a water tanker truck. Step 2. Bind commitments to agents. Jim delegates C2 to Ron, but he doesn’t get any response. Thus, he delegates such commitment to Frank, who accepts delegation. Frank creates a commitment to Brigade 1 and notifies Jim. Fig. 7 illustrates binding to agents. Table 5. Generic criteria for operationalization Name
Description
Comm Redundancy Division of Labour Delegation Trust Reputation
More commitments for the same goal from different agents Involve many agents, each agent commits for a small amount of work Delegate some commitment where the agent is debtor to someone else The agent gets commitments only from other agents it trusts Rely on reputation in community to select agents to interact with
Adaptation in Open Systems: Giving Interaction Its Rightful Place
43
Fig. 7. Bind commitments to agents: Jim delegates C2 on the basis of trust
6 Discussion and Conclusion Zhang and Cheng [18] introduce a formal model for the behavior of self-adaptive software. They separate adaptation models from non-adaptation models. Adaptation models guide the transition from a source program to a target one. Salehie et al. [13] propose a model for adaptation changes based on activity theory. They define a hierarchy for adaptation changes and match such concepts to a hierarchy for objectives. Both approaches are inadequate for open systems: they presume an omniscient view on the system which violates heterogeneity, and full control on system components which violates autonomy. Component-based approaches to adaptation [9,10] assume an external controller that adds, replaces, and rewires system components as necessary. The controller affects adaptations by reflecting upon the architectural model of the system. A centralized controller-based approach is unrealistic in open systems. Our approach includes several elements of adapting at the architectural level, albeit without any central controller, when one considers that commitments are nothing but the interconnections among agents [15]. Our motivating examples may be seen as patterns constituting an adaptation style [9] for the commitment-based architectural style. The meaning that we ascribe to (an agent) being autonomous is different from being autonomic [11]. A system is autonomic to the extent it can operate without supervision from its operator—the operator retains ultimate control. Broadly speaking, the self-* approaches refer to this notion of being autonomic. Goal-oriented approaches for adaptation differ from the architectural ones in their emphasis on modeling the rationale for adaptation in a more detailed manner. However, the modus operandi remains similar. Either system code is instrumented to support adaptation [16] or a system is augmented with a controller that runs a monitor, diagnose, adapt loop [7]. They are inadequate for open systems, since they violate heterogeneity. Approaches for adaptive agents characterize adaptation in terms of mentalistic notions such as goals, beliefs, desires, and intentions. Commitments, on the other hand, represent a social notion—they cannot be deduced from mentalistic notions, only from publicly observable communication [14]. Morandini et al. [12] give an operational account of goals so as to support adaptation in agents. However, lacking of commitments, their approach is not applicable for open systems. Unity [3] is a multiagent system created for autonomic computing. In Unity, autonomic elements (agents) collaborate to fulfill the system mission. Open systems, however, encompass competitive settings as well, and the agents may have no common goal.
44
F. Dalpiaz et al.
In our framework, adaptation is conceived from the perspective of one autonomous agent which makes no assumptions about the internals of other agents (preserving, therefore, heterogeneity). An agent relies on interaction with others to achieve its own goals. Both the agent’s goals and its architectural connections—specified in terms of commitments—are explicit and formally related to one another. This paper provides the underpinnings of agent adaptation in open systems. Our contribution lies in incorporating interaction as a first-class entity in the notion of a variant for a goal. We built a framework around this notion of a variant that allows plugging in agent-specific variant selection and operationalization policies. Future work involves detailing these policies, building a middleware that understands the notion of a variant, and building implementations of adaptive agents on top of this middleware. Acknowledgements. This work has been partially funded by the EU Commission, through projects SecureChange, COMPAS, NESSOS and ANIKETOS.
References 1. Bresciani, P., Perini, A., Giorgini, P., Giunchiglia, F., Mylopoulos, J.: Tropos: An agentoriented software development methodology. Autonomous Agents and Multi-Agent Systems 8(3), 203–236 (2004) 2. Bryl, V., Giorgini, P., Mylopoulos, J.: Designing socio-technical systems: From stakeholder goals to social networks. Requirements Engineering 14(1), 47–70 (2009) 3. Chess, D.M., Segal, A., Whalley, I., White, S.R.: Unity: experiences with a prototype autonomic computing system. In: Proceedings of ICAC, pp. 140–147 (2004) 4. Chopra, A.K., Dalpiaz, F., Giorgini, P., Mylopoulos, J.: Modeling and reasoning about service-oriented applications via goals and commitments. In: Pernici, B. (ed.) CAiSE 2010. LNCS, vol. 6051, pp. 113–128. Springer, Heidelberg (2010) 5. Chopra, A.K., Dalpiaz, F., Giorgini, P., Mylopoulos, J.: Reasoning about agents and protocols via goals and commitments. In: Proceedings of AAMAS, pp. 457–464 (2010) 6. Chopra, A.K., Singh, M.P.: Multiagent commitment alignment. In: Proceedings of AAMAS, pp. 937–944 (2009) 7. Dalpiaz, F., Giorgini, P., Mylopoulos, J.: An architecture for requirements-driven selfreconfiguration. In: van Eck, P., Gordijn, J., Wieringa, R. (eds.) CAiSE 2009. LNCS, vol. 5565, pp. 246–260. Springer, Heidelberg (2009) 8. Desai, N., Mallya, A.U., Chopra, A.K., Singh, M.P.: Interaction protocols as design abstractions for business processes. IEEE Transactions on Software Engineering 31(12), 1015–1027 (2005) 9. Garlan, D., Cheng, S.-W., Huang, A.-C., Schmerl, B., Steenkiste, P.: Rainbow: Architecturebased self-adaptation with reusable infrastructure. IEEE Computer 37(10), 46–54 (2004) 10. Heaven, W., Sykes, D., Magee, J., Kramer, J.: A case study in goal-driven architectural adaptation. In: Cheng, B.H.C., de Lemos, R., Giese, H., Inverardi, P., Magee, J. (eds.) SEAMS 2009. LNCS, vol. 5525, pp. 109–127. Springer, Heidelberg (2009) 11. Kephart, J.O., Chess, D.M.: The vision of autonomic computing. IEEE Computer 36(1), 41– 50 (2003) 12. Morandini, M., Penserini, L., Perini, A.: Operational semantics of goal models in adaptive agents. In: Proceedings of AAMAS, pp. 129–136 (2009) 13. Salehie, M., Li, S., Asadollahi, R., Tahvildari, L.: Change support in adaptive software: A case study for fine-grained adaptation. In: Proceedings of EaSE, pp. 35–44 (2009)
Adaptation in Open Systems: Giving Interaction Its Rightful Place
45
14. Singh, M.P.: Agent communication languages: Rethinking the principles. IEEE Computer 31(12), 40–47 (1998) 15. Singh, M.P., Chopra, A.K., Desai, N.: Commitment-based service-oriented architecture. IEEE Computer 42(11), 72–79 (2009) 16. Wang, Y., Mylopoulos, J.: Self-repair through reconfiguration: A requirements engineering approach. In: Proceedings of ASE, pp. 257–268 (2009) 17. Yolum, P., Singh, M.P.: Flexible protocol specification and execution: Applying event calculus planning using commitments. In: Proceedings of AAMAS, pp. 527–534 (2002) 18. Zhang, J., Cheng, B.H.C.: Model-based development of dynamically adaptive software. In: Proceedings of ICSE, pp. 371–380 (2006)
Information Use in Solving a Well-Structured IS Problem: The Roles of IS and Application Domain Knowledge Vijay Khatri1 and Iris Vessey2 1
Indiana University, 1309 E. 10th Street, Bloomington IN 47405, United States 2 The University of Queensland, St. Lucia QLD 4072, Australia [email protected], [email protected]
Abstract. While the application domain is acknowledged to play a significant role in IS problem solving, little attention has been devoted to formal analyses of what role it plays, why and how it makes a difference, and in what circumstances. The theory of dual-task problem solving, which formalizes and generalizes the role of both the IS and application domains in IS problem solving, responds to these issues. The theory, which is based on the theory of cognitive fit, can be used to identify supportive, neutral, and conflicting interactions between the two types of knowledge, depending on problem structure. We used this theory to determine how IS and application domain knowledge support the solution of schema-based problem-solving tasks. Although such tasks are wellstructured and therefore can be solved using IS domain knowledge alone, they are not fully structured. They require knowledge transformation, which is aided by application domain knowledge. Further, in well-structured tasks, IS and application domain knowledge play independent roles, with no interaction between the two. Analysis of verbal protocol data from the perspective of information use showed that problem solution is aided by both better IS knowledge and better application knowledge. Keywords: Dual-task problem-solving, cognitive fit, well-structured problems, conceptual schema understanding, schema-based problem-solving tasks, protocol analysis, problem-solving processes, information use.
1 Introduction Information Systems (IS) development can be viewed as “application domain problem solving using a software solution” [4], a perspective that highlights the role of both the IS and application domains. While research and practice alike have long acknowledged the importance of the application domain in IS (e.g., [4]), the majority of IS research has investigated the role of the IS domain alone. Increased interest in the role of the application domain is evidenced in more recent studies [5, 14, 18, 20-21]. There has, however, been little theoretical development surrounding the joint roles of IS and application domain knowledge in IS problem solving. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 46 – 58, 2010. © Springer-Verlag Berlin Heidelberg 2010
Information Use in Solving a Well-Structured IS Problem
47
In prior literature on domain knowledge, a domain is viewed as an area to which a set of theoretical concepts is applied [1]. As is evident in this definition, a domain is viewed as a single “area” of inquiry; that is, domains are viewed as encompassing both content and principles. Hence this literature per se provides no insights into problem solving in dual domains. In the current paper, we present theory that explains the roles of the IS and application domains in IS problem solving, and their inter-relationship. The theory is based on the dual-task model of IS problem solving [27] and can be used to differentiate the different types of interaction, supportive, neutral, or conflicting, that may occur between the two types of domain knowledge. We apply the theory of dual-task problem solving to the area of conceptual data modeling. In increasingly complex IS environments, firms rely on the selection, evaluation, and combination of data to provide the information that is essential to their survival and growth. Conceptual modeling involves both the design and use of conceptual schemas, abstract representations of the structure of the data relevant to a specific area of application [12]. Data structures are captured in representations such as Entity-Relationship (ER) diagrams [8] and class diagrams of the Unified Modeling Language (UML) [9]. Prior research on conceptual schema understanding has shown that performance on certain, but not all types of conceptual schema understanding tasks is contingent on knowledge of the application domain [14]. Specifically, application domain knowledge plays a role in the solution of schema-based problem-solving tasks, but not schema comprehension tasks, findings that are explained by the theory of dual-task problem solving. In this paper, we seek to build on Khatri et al.’s [14] findings for schema-based problem-solving tasks by conducting an in-depth, exploratory study into how both information systems domain knowledge (ISDK) and application domain knowledge (ADK) influence the solution of schema-based problem-solving tasks. Our research question is: “How do IS and application domain knowledge influence the use of information in solving schema-based problem-solving tasks?” We address this issue by examining the information used in problem solving using data from a protocol analysis study in which we varied knowledge in both the IS and application domains. The paper proceeds as follows. In the following section, we present theory on problem solving that takes place simultaneously in two related domains, while in Section 3, we present theory supporting our investigation of the roles of IS and application domain knowledge in understanding schema-based problem-solving tasks. We then present our research methodology, followed by our findings. Finally, we present the implications of our research.
2 Theory on Dual-Task Problem Solving in Information Systems We first present the theory of dual-task problem solving and then apply it to the solution of conceptual schema understanding tasks. 2.1 Need for Dual Domain Knowledge in IS Problem Solving While the IS domain consists of representations, methods, techniques, and tools that form the basis for the development of application systems, the application domain is
48
V. Khatri and I. Vessey
the area to which those methods, tools, and techniques are applied. Application domain knowledge is therefore necessary to elicit and organize information related to the real-world problems in that area of inquiry. IS domain knowledge is necessary to apply the tools of the IS domain to represent formally the relevant data and processes of the application domain. IS problem solving therefore applies theoretical concepts from the IS domain to the application domain of interest. Hence, knowledge of the IS and the application domains have a symbiotic relationship in solving IS problems. 2.2 Theory of Dual-Task Problem Solving Recent research has developed a unifying theory to explain the role of IS and application domain knowledge in IS problem solving [27]. The theory is based on the theoretical perspectives of dual-task problem solving, the theory of cognitive fit, and the theory on task structure. Research in cognitive psychology that examines problem solvers engaging in the simultaneous solution of two (in this case, unrelated) tasks (see, for example, [16]) provides the theoretical framework for establishing roles for both IS and application domain knowledge. The model is based on three repetitions of the basic problemsolving model used to describe cognitive fit [26] (one for each of the two domains, with a third for their interaction, called the “interaction model”), extended to include the notions of distributed cognition [29-30]. Formulating problem solving in IS as a dual-task model based on cognitive fit facilitates examination of situations in which a “cognitive” task in one domain has different types of influences on the performance of a cognitive task in the other domain. That is, the theory of cognitive fit can be used to distinguish different types of interactions between the tasks in the IS and application domains: when the two types of tasks match and when they do not. Those interactions may be supportive, neutral, or conflicting. The types of interactions between the tasks in the dual domains depend on the extent of structure in the problem under investigation [19]. Theory on problem structure is therefore used as a contingency factor in establishing cognitive fit between the dual tasks. Well-structured problems are those that have a well-defined initial state, clearly-defined goal state, well-defined, constrained set of transformation functions to guide the solution process, well-defined evaluation processes, and an optimal solution path [28]. Further, the information needed to solve the problem is contained in the problem statement and any associated representations. In solving tasks in wellstructured problem areas, all of the information needed for problem solution is available in the external problem representation and problem solving can take place with reference to IS domain knowledge alone. In this case, knowledge of the application domain plays a role only in solving problems in which cognitive fit does not exist. On the other hand, ill-structured problems are those for which the initial and goal states are vaguely defined or unclear [28]. They have multiple solutions and solution paths, or no solution at all [15]. Further, the problem statement does not contain all the information needed for their solution; hence, it is not clear what actions are required to reach a solution. Based on the theory of cognitive fit, when the information needed for problem solution is not available, application domain knowledge is essential to problem solution. When knowledge of the application domain matches the knowledge required to solve the problem, cognitive fit exists and problem solving is
Information Use in Solving a Well-Structured IS Problem
49
facilitated. However, when knowledge of the application domain does not match that required to solve the problem, dual-task interference, which is manifested as an inverse relationship between the two types of knowledge, occurs (see, for example, [22]). 2.3 Application of Theory to Conceptual Schema Understanding A conceptual schema represents the structure and inter-relationships in a set of data. The structure of data has been subject to extensive formalization over the past four decades (see, among others, [7]). As a result, all of the information required to solve conceptual schema understanding tasks (the IS task) can be gained from the schema itself, which, from the viewpoint of the model of dual-task problem solving, is represented by the external IS problem representation, i.e., the schema. There is, therefore, a clearly-defined initial state, a well-defined goal state, a formal set of transformation and evaluation paths, as well as a well-defined solution path. Conceptual schema understanding can therefore be addressed using IS domain knowledge alone and we can characterize conceptual schema understanding as a well-structured problem area. Because well-structured IS problems can be solved using IS domain knowledge alone, there is no interaction between the two types of domain knowledge, and each therefore has independent effects on performance. That is, any effect of application domain knowledge occurs in addition to the effect of IS domain knowledge. The role of the application domain in solving well-structured problems is, however, itself contingent upon the type of task under investigation. Khatri et al. [14] identify four types of conceptual schema understanding tasks: syntactic and semantic comprehension tasks and schema-based and inferential problem-solving tasks. We do not consider further, here, inferential problem-solving tasks because they require background information not provided in the schema. Two situations may arise with respect to cognitive fit in regard to the solution of these tasks. First, in addressing schema comprehension tasks, which require knowledge of the constructs of the notation under investigation (syntactic) and knowledge of the meaning of the constructs (semantic) [10], the knowledge required for task solution can be acquired directly from the schema; that is, cognitive fit exists. The problem solving that takes place is therefore both accurate and timely [26]. Second, in addressing schema-based problem-solving tasks, while all of the information essential to task solution is available in the schema, it is not available directly. Therefore, the knowledge required to address the task and that available for task solution do not match; that is, cognitive fit does not exist. Problem solvers must transform either knowledge emphasized in the schema to match that emphasized in the IS task, or vice versa, to form a mental representation of the IS task, and ultimately a mental representation that facilitates task solution (the key to the interaction model). The need to transform such knowledge to solve the task effectively increases the complexity of the task at hand. The presence of application domain knowledge may, however, effectively reduce that complexity, thereby playing a role in problem solution. In terms of the dual-task problem-solving model, the formulation of the mental representation for task solution (interaction model) may be aided by the presence of application domain knowledge. And, indeed, Khatri et al. [14] showed that performance on schema-based problem-solving tasks improves when application domain knowledge is present.
50
V. Khatri and I. Vessey
3 Examining the Solution of Schema-Based Problem-Solving Tasks This research builds on Khatri et al.’s prior research [14] on conceptual schema understanding by seeking to determine how both IS and application domain knowledge are used in solving those conceptual schema understanding tasks that benefit from knowledge of the application domain, that is, schema-based problem-solving tasks. We therefore address the research question: “How do IS and application domain knowledge influence the use of information in solving schema-based problem-solving tasks?” Studies that address how problem solving occurs most often focus on “opening up the black box” that lies between problem-solving inputs and outputs; that is, such studies investigate what happens during individual problem solving (isomorphic approach) rather than simply observing the effects of certain stimuli averaged over a number of cases, as in traditional studies (paramorphic approach) [11]. While there are a number of such approaches, the most common approach to opening up the black box is to analyze verbal protocols [25]. The researcher then has two major choices with regard to analyzing verbal protocol data: 1) examine problem-solving processes; and/or 2) examine information use. We examined problem-solving processes in a prior study [13]. In this study, we examine information use. Although schema-based problem-solving tasks are well-structured, the fact that they are not “fully” structured (cognitive fit does not exist) leaves the way open for examining them in ways that have traditionally been used to examine more complex tasks, that is, by drawing on the literature on the use of information in decision making. The expectation in early behavioral decision making research was that better (more expert) decision makers made better decisions because they used more information cues, the so-called information-use hypothesis (see, for example, [6]). However, researchers in a multitude of subsequent studies found little evidence to support that notion [6, 23-24]. Rather, expert problem solvers were observed to use similar, if not fewer, numbers of information cues than novices. Two reasons have been advanced to explain this issue. The first explanation is that while expert decision makers may use less information than novices, the information they use is more relevant (see, for example, [23]). In a review of prior literature and detailed analysis of five studies, Shanteau found that the information experts use is more relevant than that used by novices. The second explanation is that differences in the use of information by experts and novices are contingent upon the structure of the task being investigated (see, for example, [24]). Only in the middle region of the continuum of problem structure, do experts have an advantage over novices. In the case of fully structured problems, both experts and novices have similar insights into their solution (as we have seen, no transformations are required), while in the case of fully unstructured problems, neither experts nor novices have relevant insights. Khatri et al. [14] found support for this notion in their examination of performance on fully-structured schema comprehension tasks: there was no difference in performance between participants with higher and lower levels of IS domain knowledge. It is, therefore, the middle region of task structure, where transformations are required, that expertise plays a role.
Information Use in Solving a Well-Structured IS Problem
51
It is apparent, therefore, that the behavioral decision making literature dovetails very well with the theory of dual task problem solving that we present specifically in the IS context. We use the findings in the behavioral decision making literature to examine information use in schema-based problem-solving tasks. We expect that problem solvers, in general, will engage in focused information use because, as we have seen, although such problems require transformation, the goal in well-structured problem areas is clear, and achievable. Participants with lower IS domain knowledge, however, may well be less certain about the information they need to solve them. Hence, we expect that participants with higher IS domain knowledge will be more focused in their use of information, in general, and will also use information that is more relevant than those with lower IS domain knowledge. Because, as we have seen, there is no interaction between IS and application domain knowledge in wellstructured tasks, we expect that application domain knowledge will play a similar role to that of IS domain knowledge. We state Proposition 1 and associated hypotheses in relation to the extent of information use, and Proposition 2 and associated hypotheses in relation to the relevance of the information used. • Proposition 1: Problem solvers with better domain knowledge engage in focused information use when solving well-structured problems. Hypothesis 1a: In solving well-structured schema-based problem-solving tasks, problem solvers with higher IS domain knowledge use fewer information cues than those with lower IS domain knowledge. Hypothesis 1b: In solving well-structured schema-based problem-solving tasks, problem solvers in the familiar application domain use fewer information cues than in unfamiliar application domains. • Proposition 2: Problem solvers with better domain knowledge use relevant information when solving well-structured problems. Hypothesis 2a: In solving well-structured schema-based problem-solving tasks, problem solvers with higher IS domain knowledge use information cues that are more relevant than those with lower IS domain knowledge. Hypothesis 2b: In solving well-structured schema-based problem-solving tasks, problem solvers in familiar application domains use information cues that are more relevant than in unfamiliar application domains.
4 Research Methodology As noted above, we used verbal protocol data to examine schema-based problem solving in familiar and unfamiliar application domains. 4.1 Task Setting We investigated sales and hydrology as our two application domains. We expected that participants drawn from a business school (see the following section) would be more familiar with a sales application and less familiar with a hydrology application.
52
V. Khatri and I. Vessey
Further, we investigated the solution of schema-based problem-solving tasks on the conceptual models most commonly used in practice: the ER and EER models (see [7] and [10], respectively). A recent survey found that the ER Model is the most commonly-used formalism, with usage exceeding by far that of Object Role Modeling (ORM) or UML class diagrams [8]. 4.2 Participants Study participants were 12 undergraduate students, proficient in conceptual modeling, drawn from two sections of a data management course offered in the business school of a large university in the U.S. mid-west. Participation in the study was voluntary and the participants were given $30 to complete the conceptual schema understanding experiment. All of the participants were between 20 and 25 years old, and had a highschool diploma, some work experience, and little database-related work experience. 4.3 Experimental Design We used a 2 x 2 mixed design with knowledge of the IS domain as a between-subjects factor and familiarity with the application domain as a within-subjects factor. Participants demonstrating high and low IS expertise each completed four tasks, two tasks in each of the familiar and unfamiliar application domains, which we refer to as Task 1 and Task 2. Participants were randomly assigned to two groups (ER and EER). The schema-based problem-solving tasks investigated in this research involved only entity types/relationships and attributes (henceforth ERA), which are common across ER and EER models. Further, the presentation sequence of the two schemas (familiar and unfamiliar) was counterbalanced, thereby effectively controlling for any order effects. 4.3.1 Operationalizing IS Domain Knowledge To investigate the influence of IS domain knowledge on the solution of schema-based problem-solving tasks, we formed groups of participants with high and low expertise in the IS domain. To do so, we examined participants’ scores on syntactic and semantic comprehension tasks. Performance on syntactic and semantic comprehension questions is an appropriate measure of IS expertise because it is well established in the cognitive psychology literature that knowledge of surface features, or declarative knowledge, forms the foundation for developing higher forms of knowledge such as procedural knowledge [2]. We used the assessment of syntactic and semantic knowledge to form groups of participants with varying levels of knowledge of conceptual modeling. We then selected the six highest performers to form the group with higher IS domain knowledge (H-ISDK), and the six lowest performers to form the group with the lower IS domain knowledge (L-ISDK). 4.3.2 Operationalizing Application Domain Knowledge Our experimental design called for the use of two domains with which our participants would not be equally familiar. We refer to these application domains as familiar (F-AD) and unfamiliar (U-AD). As a manipulation check on application domain knowledge prior to the experiment proper, we asked each participant to describe five
Information Use in Solving a Well-Structured IS Problem
53
terms that mapped to concepts on the conceptual schemas with which they later interacted. The sales terms were product line, salesperson, warehouse, area headquarter and manufacturer, while the hydrology terms were seep, playa, bore hole, lithology and pump. Hence this exercise highlighted for the participants what they knew about aspects of each domain. The participants were then asked to rate their familiarity, on a 7-point scale, with sales and hydrology applications (where 7 = high and 1 = low familiarity). The self-reported familiarity of all the participants was far higher in the sales (5.00) than in the hydrology domain (1.33). 4.4 Experimental Materials Each participant was presented with two schemas (together with corresponding data dictionaries), one in the familiar domain (sales) and the other in the unfamiliar domain (hydrology). The data dictionary included application-oriented descriptions of each entity type/relationship (E/R) and attribute on the schema. The schemas were syntactically equivalent; only the labels used for entity types, relationships, and attributes differed. The schema and the data dictionary were adapted from [14]. The sales schema was a typical order-processing application that included concepts such as SALES AREA, SALES TERRITORY, PRODUCT, PRODUCT LINE, and MANAGER. The hydrology schema was adapted from a schema for a ground water application at the U.S. Geological Survey. This application included hydrological concepts such as SEEP, PLAYA, BORE HOLE, CASING, and ACCESS TUBE. Our participants responded to two schema-based problem-solving tasks in each of the sales and hydrology domains. As noted above, both tasks focused on ERA only. Tasks 1 and 2 could be solved by referring to one and two entity types, respectively. The tasks were structurally equivalent in each domain; that is, structurallycorresponding entity types and attributes were needed to respond to the corresponding task in each application domain. Khatri and Vessey [13] provides details of the tasks, including the concepts that were required to address them.
5 Findings We examine the between-subjects effects of IS domain knowledge and the withinsubjects effects of application domain knowledge for the extent of information use, examined in terms of focused information, and the use of relevant information, in turn. 5.1 Use of Focused Information Here we address Proposition 1, that problem solvers engage in use of focused information when solving well-structured problems. From the viewpoint of ISDK, presents those problem solvers who explored more than the median number of E/Rs explored by participants. The median number of concepts (E/Rs) explored for Task 1 in F-AD was 1. H-2 explored 2 concepts and therefore appears in Table 1 under Task 1, H-ISDK and F-AD.
54
V. Khatri and I. Vessey Table 1. Analysis of Information Use on Tasks 1 and 2 Based on IS Domain Knowledge
Task 1 F-AD U-AD Task 2 F-AD U-AD
H-ISDK
L-ISDK
H-2, H-5 -
L-1, L-3, L-6 L-4
H-1 H-1, H-3
L-3, L-5 L-1, L-2, L-3, L-4
On the other hand, because H-1 explored just 1 concept, s/he does not appear in Table 1. Perusal of Table 1 shows that, in both familiar and unfamiliar domains, more L-ISDK participants explored more concepts than H-ISDK participants. Our findings are consistent for Tasks 1 and 2. Hence Hypothesis 1a, that problem solvers with lower IS domain knowledge use more information than those with higher IS domain knowledge, is supported. From the viewpoint of ADK, Table 2 presents an analysis similar to that for ISDK, above. We see that in the third column (F-AD < U-AD), the number of both H- and L-ISDK problem solvers is higher than in the second column (F-AD > U-AD), indicating that more participants explored more information cues in the unfamiliar than in the familiar domain. Our findings are again consistent for Tasks 1 and 2. Hence, our findings support Hypothesis 1b, that problem solvers in the unfamiliar application domain explore more information than those in the familiar application domain. Table 2. Analysis of Information Use on Tasks 1 and 2 Based on Application Domain Knowledge
Task 1 H-ISDK L-ISDK Task 2 H-ISDK L-ISDK
F-AD > U-AD
F-AD < U-AD
H-5 L-6
H-1, H-4 L-2, L-4, L-5
L-3, L-5
H-1, H-3, H-4, H-6 L-1, L-2, L-4
5.2 Use of Relevant Information Here we address Proposition 2, that problem solvers use relevant information when solving well-structured problems. From the viewpoint of IS domain knowledge, Table 3 shows that higher numbers of H-ISDK participants used just the relevant concepts in the schema, in both the familiar and unfamiliar domains than did L-ISDK participants. Our findings are consistent for Tasks 1 and 2. Hence our findings support Hypothesis 2a, that problem solvers with higher IS domain knowledge focus on relevant information to a greater extent than those with lower IS domain knowledge
Information Use in Solving a Well-Structured IS Problem
55
Table 3. Analysis of Relevant Information Use on Tasks 1 and 2 Based on IS Domain Knowledge
Task 1 F-AD U-AD Task 2 F-AD U-AD
H-ISDK
L-ISDK
H-1, H-3, H-4, H-6 H-3, H-5, H-6
L-2, L-4, L-5 L-6
H-2, H-3, H-6 H-2, H-4
L-2 L-5
. From the viewpoint of application domain knowledge, Table 4 presents the participants who used only relevant information in each of the application domains. It shows that, in general, participants are more selective in their use of information in the familiar than the unfamiliar application domain. Our findings are consistent for Tasks 1 and 2, except that on the more complex task L-ISDK participants examined just one entity in each of the familiar and unfamiliar application domains. Khatri and Vessey [13] illustrates graphically the difficulty these participants had in addressing this problem. Hence, these findings provide support for Hypothesis 2b, that problem solvers in the familiar application domain use more relevant information than those in the unfamiliar application domain. Table 4. Analysis of Relevant Information Use on Tasks 1 and 2 Based on Application Domain Knowledge
Task 1 H-ISDK L-ISDK Task 2 H-ISDK L-ISDK
F-AD
U-AD
H-1, H-3, H-4, H-6 L-2, L-4, L-5
H-3, H-5, H-6 L-6
H-2, H-3, H-6 L-2
H-2, H-4 L-5
6 Discussion of the Findings Our research addresses a widely-acknowledged, though not widely-studied, issue in IS problem solving, that of the role of the application domain. In particular, there has been little theoretical development in this area. We present a theory that addresses the joint roles of the IS and application domains in IS problem solving and apply it to the specific case of conceptual schema understanding. We then report the findings of an exploratory study into how IS and application domain knowledge support the solution of schema-based problem-solving tasks, the solution of which is aided by application domain knowledge. The theory of dual-task problem solving formalizes and generalizes to problems of different levels of structure, the role of both the IS and application domains in IS problem solving. In applying the theory of dual-task problem solving to the well-structured problem area of conceptual modeling, we focus on the solution of
56
V. Khatri and I. Vessey
schema-based problem-solving tasks, a subset of conceptual schema understanding tasks. While all conceptual schema understanding tasks can be solved by reference to the schema alone (all the necessary information is available in the schema), the solution of schema-based problem-solving tasks is aided by the presence of application domain knowledge [14]. We seek to extend Khatri et al.’s work by examining how both types of knowledge contribute to the solution of schema-based problem-solving tasks; that is, we address the following research question: “How do IS and application domain knowledge influence the use of information in solving schema-based problem-solving tasks?” The theory of dual-task problem solving tells us that when the task is well structured IS and application domain knowledge have independent effects on problem solving. To examine the effects of each type of knowledge, and thereby address our research question, we used verbal protocol data from problem solvers with both high and low knowledge of the IS domain in both familiar and unfamiliar application domains. We examined information use (based on literature in behavioral decision making). We found that both higher IS domain knowledge and higher application domain knowledge resulted in use of more focused information (that is, use of fewer information cues) and use of relevant information (that is, use of only that information necessary for problem solution). Our research makes a number of contributions to the literature. First, introducing the theory of dual-task problem solving [27] to a broader audience represents a significant contribution to understanding the role of the application domain in IS. Second, problem solving in well-structured domains, such as conceptual schema understanding, appears to have been understudied in cognitive psychology research. Characterizing conceptual schema understanding as a well-structured problem area opens the way for the examination and characterization of problem solving on wellstructured problems, in general. For example, a key finding of our study is that expert problem solving in well-structured problem areas is characterized by both focused and relevant information use. Our research therefore makes a contribution to the behavioral decision making literature. Third, this study is the first of which we are aware to examine information use to study problem solving in IS. Our study has the following limitations. First, we conducted our investigation using students who were relatively inexperienced in using real world conceptual schemas. We characterize them as novice conceptual modelers. Note, however, that the difference between students and professionals is not always clear cut. For example, a study on maintaining UML diagrams found no differences in performance of undergraduate/ graduate students and junior/intermediate professional consultants [3]. Second, while verbal protocol data has been questioned on a number of issues (see [17] for a detailed analysis), it remains the accepted way of collecting process data. Further, it is a far better approach than the use of retrospective reports or various types of self-reported data. Third, due to the high density of data in a single verbalization, small numbers of participants are typically used, “commonly between 2 and 20” ([25], p. 501). Our study, with 12 participants, is therefore in the mid-range.
Information Use in Solving a Well-Structured IS Problem
57
7 Conclusion The role of the application domain is an issue that has been largely neglected in research into IS problem solving. In this research, we introduce directly to the data modeling community a theory [27] that formalizes, and generalizes to problems of different levels of structure, the roles of both the IS and the application domain in IS problem solving. We then applied the theory to the solution of schema-based problem-solving tasks that, while well-structured, benefit from the existence of application domain knowledge. Analyses of information use reveal that more focused, and more relevant information use result with both better application domain knowledgebetter IS domain knowledge.
References [1] Alexander, P.A.: Domain Knowledge: Evolving Themes and Emerging Concerns. Educational Psychologist 27(1), 33–51 (1992) [2] Anderson, J.R.: Acquisition of Cognitive Skill. Psychological Review 89(4), 369–406 (1982) [3] Arisholm, E., Sjøberg, D.I.K.: Evaluating the Effect of a Delegated versus Centralized Control Style on the Maintainability of Object-Oriented Software. IEEE Transactions on Software Engineering 30(8), 521–534 (2004) [4] Blum, B.A.: A Paradigm for the 1990s Validated in the 1980s. In: Proceedings of the AIAA Conference 1989, pp. 502–511 (1989) [5] Burton-Jones, A., Weber, R.: Understanding relationships with attributes in entityrelationship diagrams. In: Proceedings of the Twentieth International Conference on Information Systems, Charlotte, North Carolina, USA, pp. 214–228 (1999) [6] Camerer, C.F., Johnson, E.J.: The process-performance paradox in expert judgment: How can experts know so much and predict so badly? In: Ericsson, K.A., Smith, J. (eds.) Towards a general theory of expertise: Prospects and limits, pp. 195–217. Cambridge Press, New York (1991) [7] Chen, P.P.: The Entity-Relationship Model - Toward a Unified View of Data. ACM Transactions of Database Systems 1(1), 9–36 (1976) [8] Davies, I., Green, P., Rosemann, M., Indulska, M., Gallo, S.: How do practitioners use conceptual modeling in practice? Data & Knowledge Engineering 58(3), 358–380 (2006) [9] Dobing, B., Parsons, J.: How UML is used. Communications of the ACM 49(5), 109–113 (2006) [10] Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 6th edn. Addison Wesley, Boston (2006) [11] Ericsson, K.A., Simon, H.A.: Verbal reports as data. Psychological Review (87), 215–251 (1980) [12] Hoffer, J.A., Prescott, M.B., McFadden, F.R.: Modern Database Management, 8th edn. Pearson/Prentice Hall, Upper Saddle River (2007) [13] Khatri, V., Vessey, I.: Information Search Process for a Well-Structured IS Problem: The Role of IS and Application Domain Knowledge. In: 29th International Conference on Information Systems, Paris, France (2008) [14] Khatri, V., Vessey, I., Ramesh, V., Clay, P., Kim, S.-J.: Understanding Conceptual Schemas: Exploring the Role of Application and IS Domain Knowledge. Information Systems Research 11(3), 81–99 (2006)
58
V. Khatri and I. Vessey
[15] Kitchner, K.S.: Cognition, metacognition, and epistemic cognition: A three-level model of cognitive processing. Human Development (26), 222–232 (1983) [16] Navon, D., Miller, J.: Role outcome conflict in dual-task interference. Journal of Experimental Psychology: Human Perception and Performance (13), 435–448 (1987) [17] Nisbett, R.E., Wilson, T.D.: Telling more than we can know: verbal reports on mental processes. Psychological Review (84), 231–259 (1977) [18] Purao, S., Rossi, M., Bush, A.: Toward an Understanding of the Use of Problem and Design Spaces During Object-Oriented System Development. Information and Organization (12), 249–281 (2002) [19] Reitman, W.R.: Heuristic Decision Procedures: Open Constraints and the Structure of IllDefined Problems. In: Shelly, M.W., Bryan, G.L. (eds.) Human Judgments and Optimality, pp. 282–315. John Wiley, New York (1964) [20] Shaft, T.M., Vessey, I.: The Relevance of Application Domain Knowledge: The Case of Computer Program Comprehension. Information Systems Research 6(3), 286–299 (1995) [21] Shaft, T.M., Vessey, I.: The Relevance of Application Domain Knowledge: Characterizing the Computer Program Comprehension Process. Journal of Management Information Systems 15(1), 51–78 (1998) [22] Shaft, T.M., Vessey, I.: The Role of Cognitive Fit in the Relationship Between Software Comprehension and Modification. MIS Quarterly 30(1), 29–55 (2006) [23] Shanteau, J.: How Much Information Does An Expert Use? Is It Relevant? Acta Psychologica (81), 75–86 (1992) [24] Spence, M.T., Brucks, M.: The Moderating Effects of Problem Characteristics on Experts’ and Novices’ Judgments. Journal of Marketing Research 34(2), 233–247 (1997) [25] Todd, P., Benbasat, I.: Process Tracing Methods in Decision Support Systems Research: Exploring the Black Box. MIS Quarterly 11(4), 493–512 (1987) [26] Vessey, I.: Cognitive Fit: A Theory-based Analysis of Graphs Vs. Tables Literature. Decision Sciences 22(2), 219–240 (1991) [27] Vessey, I.: The Effect of the Application Domain in IS Problem Solving: A Theoretical Analysis. In: Hart, D., Gregor, S. (eds.) Information Systems Foundations: Theory, Representation and Reality, pp. 25–48. ANU Press, Canberra (2006) [28] Voss, J.F., Post, T.A.: On the solving of ill-structured problems. In: Chi, M.H., Glaser, R., Farr, M.J. (eds.) The nature of expertise, pp. 261–285. Lawrence Erlbaum Associates, Hillsdale (1988) [29] Zhang, J.: The nature of external representations in problem solving. Cognitive Science 21(2), 179–217 (1997) [30] Zhang, J., Norman, D.A.: Representations in distributed cognitive tasks. Cognitive Science 18(1), 87–122 (1994)
Finding Solutions in Goal Models: An Interactive Backward Reasoning Approach Jennifer Horkoff1 and Eric Yu2 1
U n i v e r s i t y of T or on t o, D e p ar t m e n t of C om p u t e r S c i e n c e 2 U n i v e r s i t y of T or on t o, F ac u l t y of I n f or m at i on [email protected], [email protected]
Abstract. M o d e l i n g i n t h e e a r l y s t a g e o f s y s t e m a n a l y s i s i s c r i t i c a l f o r u n d e r s t an d i n g s t ak e h ol d e r s , t h e i r n e e d s , p r ob l e m s , an d d i ff e r e n t v i e w p oi n t s . W e ad v oc at e m e t h od s f or e ar l y d om ai n e x p l or at i on w h i c h p r ov o k e i t e r a t i o n ov e r c a p t u r e d k n ow l e d g e , h e l p i n g t o g u i d e e l i c i t a t i o n , a n d f a c i l i t a t i n g e a r l y s c o p i n g a n d d e c i s i o n m a k i n g . S p e c i fi c a l l y , w e p r ov i d e a f r a m e w o r k t o s u p p o r t i n t e r a c t i v e , i t e r a t i v e a n a l y s i s ov e r g o a l - a n d a g e n t o r i e n t e d (agent-goal) models. Previous work has introduced an interactive evaluation procedure propagating forward from alternatives allowing users to ask “What if?” questions. In this work we introduce a backwards, iterative, interactive evaluation procedure propagating backward from high-level target goals, allowing users to ask “Is this possible?” questions. The approach is novel in that it axiomatizes propagation in the i* framework, including the role of human intervention to potentially resolve conflicting contributions or promote multiple sources of weak evidence. Keywords: Goal- and Agent-Oriented Modeling, Early System Analysis, Model Analysis, Interactive Analysis, Iterative Analysis.
1
Introduction
Understanding gained during early stages of system analysis, including knowledge of stakeholders, their needs, and inherent domain problems, can be critical for the success of a socio-technical system. Early stages of analysis are characterized by incomplete and imprecise information. It is often hard to quantify or formalize critical success criteria such as privacy, security, employee happiness, or customer satisfaction in early stages. Ideally, early analysis should involve a high-degree of stakeholder participation, not only gathering information, but presenting information gathered thus far, allowing validation and improved understanding in an iterative process. Goal- and agent-oriented models (agent-goal models) have been widely advocated for early system analysis [1] [2], as such models allow even imprecise concepts to be reasoned about in terms of softgoals and contribution links, and have a relatively simple syntax, making them amenable to stakeholder participation. We advocate methods for early domain exploration which provoke and support iterative inquiry over captured knowledge, prompting analysts and stakeholders J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 59–75, 2010. c Springer-Verlag Berlin Heidelberg 2010
60
J. Horkoff and E. Yu
to review what is known, helping to guide elicitation, and facilitating early scoping and decision making. To this end we have created a framework for iterative, interactive analysis of agent-goal models in early system analysis. Previous work has introduced an interactive procedure which propagates evidence from means to ends, allowing users to ask “what if?” questions [3]. In this work we introduce an interactive “backward” procedure, propagating target values from ends to means, helping users to ask “Is this possible?”, “If so how?” and “If not, why not?” questions. The procedure introduced in this paper encodes forward and backward propagation rules in conjunctive normal form (CNF), iteratively applying a SAT solver and human intervention to search for an acceptable solution. In formulating such an interactive backward procedure we face some interesting questions and technical challenges. What types of questions could and should be posed to the user, and at what point in the procedure? How can the encoding be modified to reflect human judgment, what is added, what is removed? When a choice does not lead to an acceptable solution, to what state does the procedure backtrack? As information is lost in forward propagation when evidence is manually combined, what assumptions about this evidence can be made when propagating backward? How can the axiomization allow for explicit values of conflict and unknown, compared to approaches that only allow for positive and negative values [4]? How can we find a balance between constraining the problem sufficiently to avoid nonsensical values and allowing enough freedom to detect the need for human judgment? How can we use information about SAT failures to inform the user? Is there a computationally realistic approach? The procedure in this work represents one approach to answering these questions. The paper is organized as follows: an overview of the framework for iterative, interactive analysis for agent-goal models is provided (Section 2), including a summary of the forward propagation procedure (2.1). We motivate the need for backward analysis (2.2), and provide an overview of the proposed backward analysis procedure (3). Background on SAT solvers are provided (3.1) along with a formalization of the i* Framework as an example agent-goal syntax (3.2), including axioms for forward and backward propagation. The iterative, backward algorithm is described in (3.5), including an example and a consideration of termination, run time, soundness, and completeness. Related work is described in Section 4, with discussion, conclusions, and future work in Section 5.
2
A Framework for Iterative, Interactive Analysis of Agent-Goal Models in Early System Analysis
We introduce a framework for iterative, interactive analysis of agent-goal models consisting of the following components [5]: – An interactive, qualitative forward analysis procedure, facilitating “What if?” analysis. – Management of results for each analyzed alternatives.
Finding Solutions in Goal Models
61
– An interactive, qualitative backward analysis procedure, facilitating “Is this possible?”, “If so, how?”, and “If not, why?” analysis. – Management of human judgments provided by users. – Integration with textual justifications for modeling and evaluation decisions. – Reflecting model and judgment changes in alternative evaluation results. Currently, the first component has been implemented, applied, and described in detail [6] [3] [7], with the first and second component implemented in the OpenOME tool [8]. In this work, we focus on the third component: interactive, qualitative backward analysis. Our analysis framework uses the i* notation as an example goal modeling framework [2], but could be applicable to any goal models using softgoals and/or contribution links. 2.1
Background: Forward Interactive Analysis
The forward analysis procedure starts with an analysis question of the form “How effective is an alternative with respect to goals in the model?” The procedure makes use of a set of qualitative evaluation labels assigned to intentions to express their degree of satisfaction or denial, shown in the left column of Table 2. Following [1], the (Partially) Satisfied label represents the presence of evidence which is (insufficient) sufficient to satisfy an intention. Partially Denied and Denied have the same definition with respect to negative evidence. Conflict indicates the presence of positive and negative evidence of roughly the same strength. Unknown represents the presence of evidence with an unknown effect. Although tasks, resources, and goals typically have a binary nature (true or false), the use of softgoal decompositions or dependencies on softgoals means that they often encompass quality attributes. We allow partial labels for tasks, resources, and goals for greater expressiveness. The analysis starts by assigning labels to intentions related to the analysis question. These values are propagated through links using defined rules. See [2] or [9] for a review of i* syntax (legend in Fig. 2.2). The nature of a Dependency indicates that if the element depended upon (dependee) is satisfied then the element depended for (dependum) and element depending on (depender) will be satisfied. Decomposition links depict the elements necessary to accomplish a task, indicating the use of an AND relationship, selecting the “minimum” value amongst all of the values, using the ordering in (1). Similarly, Means-Ends links depicts the alternative tasks which are able to satisfy a goal, indicating an OR relationship, taking the maximum values of intentions in the relation. To increase flexibility, the OR is interpreted to be inclusive.
P1T2S7.1 < >
Process performance data P1T1S1.1
Data Subject
Hospital
Refuse requests in cases a) to h) > P1T2S8.2
> Be informed A2
of the source of the personal data G is a possible P1T2S7.2 compliance goal for A G
A
Fig. 2. The N`omos modelling languages: visual representation of the Italian Personal Data Protection Code
NPs. Rights also impact on the social interaction of actors - in the Hohfeldian legal taxonomy, rights are related by correlativity relations: for example, if someone has the claim to access some data, then somebody else will have the duty of providing that data. This means that duty and claim are correlatives. Similarly, privilege-noclaim, power-liability, immunity-disability are correlatives: they describe the same reality from two different points of view. So instead of defining two separate classes for “duty” or “claim”, we have a single class, ClaimDuty, which is able to model both. Similarly the classes PrivilegeNoclaim, PowerLiability and ImmunityDisability, each of them sub-class of the abstract class Right. Priorities between rights are captured in the meta-model by means of the Dominance class, which connects two rights. The PrescribedAction class contains the actual object of the NP. Such prescribed action is bound to the behaviour of actors by means of the Realization class. It specifies that a certain goal is wanted by the actor in order to accomplish the action prescribed by law. For more information on the N`omos meta-model, see [12]. Figure 2 exemplifies the N`omos language used to create models of laws. The language is an extension of the i* modelling language, and inherits its notation. For example, actors of the domain are represented as circles, and they have an associated rationale, which contains the goals, tasks and resources of the actors. In the N`omos language, the actor’s rationale also contains the NPs addressing that actor, partially ordered through dominance relations. Finally, the holder and counter-party actors of a right are linked by a legal relation. The diagram in the figure shows an excerpt of the N`omos models that represent some fragments of the Italian Personal Data Protection Code2 . The figure shows the subject of right - the [User] - and its claim, toward the data processor, 2
An English translation of the law can be found at http://www.privacy.it/
94
A. Siena et al.
to have [Protection of the personal data] (as of Art. 7.1). However, the data processor is free to [Process performance data] (Art. 7.1). The general duty to protect the personal data is then overcame by a number of more specific duties, such as [Be informed of the source of the personal data], [Obtain updating, rectification or integration of the data], [Confirmation as to whether or not personal data concerning him exist], [Be informed of the source of the purposes] (all from Art. 7.2), and so on. However, the data processor has the claim, towards the data subject, to refuse requests of information in (unlikely) case, for example, of the data that are processed for reasons of justice by judicial authorities.
3 A Healthcare Information System Amico3 is an industrial research and development (R&D) project in the health care domain, which involved around 25 people working on it — including project manager, analysts, software architect, programmers and researchers — and lasted 18 months of work. The Amico project is intended to define the architecture for an integrated servicebased system, aiming at increasing the possibilities of self-supporting life for elder or disabled people in their own home. The project is focused on the realisation of an Electronic Patient Record (EPR) for storing social and health information to be used in health care. Information is stored and accessed independently by the subjects that operate in the health care system: social workers, doctors, social cooperatives, relatives. The EPR, accessed via web, allows for a collaboration among the subjects, for improving health care and having social and health information, as well as economic and managerial data. The EPR. The Amico system has been conceived as a network of interconnected components, as depicted in Figure 3. Nodes of the network are mainly the health care facilities with their information systems, called Local Authorities (LA). Local Authorities run their own databases, and provide services such as data search and retrieval to other members of the network. Local authorities directly collect data from patients mainly in two ways: through direct input of operators, such as doctors and nurses; or, through automatic sensing, by means of input peripherals such as cameras, heart rate monitor, and so on. Alternatively, they receive data that had previously collected by other Local Authorities. The collected data can in turn be further propagated to other members of the network, if needed. Certificate Authorities (CA) are the reference actors for Local Authorities: they keep a copy of those data that have been verified and can be trusted. So, the data that the Local Authorities retrieves form the Certificate Authorities are considered “clean”, as opposed to data retrieved from other Local Authorities, which are not verified and are considered “dirty”. An Index node manages the list of members of the network. Through the Index, a Local or Certificate Authority can know of others Authorities registered system-wide. The network-wide collection of patient data forms the EPR for the patient. Usage scenario. For checking the requirements of the system, a usage scenario has been proposed by the industrial partners. The scenario focuses on the case of a patient 3
Assistenza Multilivello Integrata e Cura Ovunque.
Requirements Certification
BUS
S1
95
Services
S2
S3
S1
S2
S3
S4
S5
Database
Database
Database
Database
LA 1
LA 2
CA 1
Index
Fig. 3. The demo scenario for the Amico’s system architecture
that needs to access the (physical) services of a health care facility. The patient may turn to various facilities, according to their specialisation and w.r.t. his/her needs. On reception, the facility needs to know the clinical history of the patient, in order to select the most appropriate cure. The clinical data can be in the local database of the accessed facility; or, it can be distributed somewhere across the Amico network; or, it can be completely absent. Through the Amico system, it should be possible to access an integrated EPR, which collects every useful information available for the patient wherever in the network; alternatively, it should be possible to create the EPR from scratch, and broadcast it through the network. Involved services. With regard to the described scenario, five services are provided by the nodes of the network, they are labelled S1 to S5, in Figure 3. Services S1, S2 and S3 are provided by the node Local Authority. In particular, the service S1 is responsible for accessing the underlying system and provides service such as local data search and updating; the service S2, is responsible for accessing the Certificate Authority node and provides service of data search remotely. It is not possible to update information in this case as it provides a read only service. Once information is available from the Certificate Authority it is possible to update or create new information locally by invoking S1; and the service S3, broadcast “dirty” information (information still not verified) to all local authorities. This service enables each local authority to keep integrity of data between local systems. Service S4 and S5 are exposed by the Certificate Authority and the Index nodes respectively. They behave as in the following: S4, each Certificate Authority is responsible for providing certified data search — i.e., it provides a read-only access to some verified data; and S5, this is a service accessible from anywhere and it returns information of corresponding Certificate Authority. The Local authority accesses S1 to S5, regardless whether the services are provided by itself, or by another node. The Certificate Authority accesses S4 and S5. Compliance issues. Our role in the Amico project consisted in refining the analysis of gathered requirements from the point of view of legal compliance. Specifically, the law under analysis was the Italian Personal Data Protection Code D.Lgs. n. 196/2003 (depicted in Figure 2), limited to Part I, Title II (Data Subject’s Rights) of the law. Eight
96
A. Siena et al.
people were involved in this task: 3 analysts, 1 industry partner, 1 software architect, 2 designers, 1 programmer. We modelled the requirements of the demo scenario by means of goals. In goal models, goals express the why of requirements choices. Goals are decomposed into sub-goals and operationalised by means of plans. Plans, in turn, may need resources to be executed. In the Amico project, we used i* (which N`omos is based on) to create goal models representing the rationale behind the demo scenario. Figure 4 represents such rationale, limited to the [Local Authority] actor. When a patient ([User]) accesses a health care centre, at the check-in the EPR of the patient has to be retrieved from the system. In the health care centre accessed by the patient, the system (a [Local Authority]) executes a query on the local database, and the [S1] service furnishes such data. If the data is not found in the local database, the [Local Authority] forwards the request to the [S2] service, which returns the name of the reference [Certificate Authority]. The Authority is queried to have certified data. But [Certificate Authority] can also be unable to provide the requested data. In this case, the local authority contacts another Local Authority (the actor [Peer Local Authority] in the diagram), which in turn executes a local search or queries its own reference Certificate Authority. If the searched data don’t exist in the system, the Local Authority proceeds inserting it, and marking it as “dirty”. In this case, after the data insertion, the Local Authority invokes the [S3] service, which broadcasts the data to the whole system. When the broadcast notification is received, each Local Authority updates its local database. However, the privacy law lays down many prescriptions concerning the processing of personal data (in particular, sensitive data) of patients. For example, it requires the owner’s confirmation for the data being processed. Before building the system, it is necessary to provide some kind of evidence that the described scenario do not violate the law.
4 Compliant and Auditable Requirements The purpose of our work in Amico is to provide evidence of compliance of requirements. In a general sense, being compliant with law means behaving according to law prescriptions. However, this meaning has the drawback that behaviours can only be verified at run-time. For this reason, N`omos splits the notion of compliance into intentional compliance and auditability. Intentional compliance is defined as the design-time distribution of responsibilities such that, if every actor fulfils its goals, then actual compliance is ensured. It represents the intention to comply. Intentional compliance allows for moving at design time the notion of compliance, but has the drawback that having the intention to comply is not equivalent to actually behaving in compliance. Therefore, intentional compliance is associated to the concept of auditability, which is the designtime distribution of auditing resources, such that the run-time execution of processes can be monitored and eventually contrasted to their purpose. 4.1 Intentional Compliance In a N`omos model, we distinguish goals with respect to their role in achieving compliance. We define strategic goals those goals that come from stakeholders and represent needs of the stakeholders. We define compliance goals those goals that have been developed to cope with legal prescriptions. For example, in Figure 4, the goal [Update
Requirements Certification
97
Confirmation as to User whether or not personal data concerning him exist Access P1T2S7.1 Health care health care centre Patient's Peer data is-a Local Local Access Authority Certificate Authority health care Authority centre Update data AND Confirmation as to locally Provide whether or not personal data Get data remote data OR concerning him exist Forward Retrieve P1T2S7.1 EPR data EPR data Update to doctors OR local data Retrieve dirty data
Ask user authorization
AND Insert data Return inserted data Broadcast dirty data
Retrieve existing data OR
Retrieve Request Retrieve remote data data to peer local data AND Authority Get Certificate Authority
Retrieve data
Broadcast dirty data
updateAllLocal
Provide remote data Get certificated data
Update locally Get name of Certificate authority
Search local data
S4 Search local data
S1 S3
Verify user's authorization
searchLocal
Update locally
updateLocal
S2
Get name of Certificate authority
Get certificated data searchCertificate
getCertificateAuthority
Fig. 4. A goal model for the demo scenario of the Amico project data locally] is a strategic goal, because it is only due to the reason-to-be of the owning actor; viceversa, [Ask user authorization] is a compliance goal, because it is due to the need of complying with the [Confirmation as to whether or not personal data concerning him exist] claim of the user. The identification of compliance goals, and specifically
the identification of missing compliance goals, was actually the objective of our analysis. We moved from the analysis of the Italian Privacy Code, which lays down many prescriptions concerning the processing of personal data (in particular, sensitive data) of patients. We modelled the relevant fragments of the law through the N`omos language, as in Figure 2. Afterwards, those goals have been identified, which could serve for achieving compliance with that particular law fragment, and were associated with the corresponding normative proposition. If no appropriate goals were identified, new ones were conceived and added to the model. For example, the law requires the owner’s confirmation for the data being processed. In Figure 4, this is depicted by means of the normative proposition [Confirmation as to whether or not personal data concerning him exist]), extracted from article 7.1. The normative proposition is modelled as a claim of the patient, held towards the Local Authority, which has therefore a corresponding duty. This results in two additions to the diagram. The first one concerns the insertion of the data into the local database, and subsequent broadcast to the system. In this case, before the broadcast is executed, it is necessary to obtain the patient’s authorisation (goal [Ask
98
A. Siena et al.
Local Authority
Data processing authorized
AND
Access health care centre User
Health care
Retrieve dirty data
Ask user authorization
Return inserted data
Authorizations record: [Patient, Authorization]
Data processing authorized
Patient's authorization
Verify user's authorization
S3
Broadcast dirty data updateAllLocal
Insert data Patient, Record
Broadcast log Broadcast dirty data
Fig. 5. The rationale for an auditable process user authorization]), and to add such information in the broadcast message. The second case concerns the reception of the broadcast system by a Local Authority. In this case, before updating the local data with the received one, the Local Authority must verify that in the broadcast message the authorisation to data processing is declared (task [Verify user authorization]). This has been done for every normative proposition considered relevant for the demo scenario. The resulting models (partially depicted in Figure 2) contained 10 normative propositions relevant for the described demo scenario. This approach allowed for assigning to domain’s actors a set of goals, which represent their responsibility in order to achieve compliance.
4.2 Compliance Auditability The second kind of compliance evidence we produced concerns the capability of the adopted solution to be monitored at run-time in order to confirm compliance. N`omos supports this evidence through the concept of auditability. The idea is that, if an activity or a whole process is not executed or is executed incorrectly, this is reflected in the log data. Auditing the log data makes possible to monitor the execution of processes and detect problems. Designing for auditability means deciding which log data have to be produced, by which process, when, and so on. For the information systems supporting the processes, it’s important to specify requirements of compliance auditability together with other requirements. Consequently, compliance auditability has to be conceived during the requirements analysis. In order to assess auditability we associate data log resources to goals. More properly, the resources are associated to the plans intended to fulfil compliance goals. Or, as a shortcut, the plans are omitted and resources are associated directly to goals. In any case when plans (either explicitly modelled or not) are executed to achieve a goal, they use the associated resource. The idea, conceived in the N`omos framework, is that such resources can be the key to monitor, at run-time, the execution of the processes. When a resource - such as a database - is used, its state is affected and the state change can be recorded. Or, even the simple access to a resource can be recorded. Upon this idea, 2 analysts have undertaken a modelling session. An excerpt of the results is depicted in Figure 5. The ultimate purpose was to revise existing models, searching for
Requirements Certification
99
compliance goals. Once found, compliance goals have been elaborated to be made auditable. For example, the [Local Authority] has the goal [Ask user authorisation], which has been developed to comply with the duty to have such authorisation from the user before processing his data, can’t be proved at run-time. Even if the authorisation is requested, if a legal controversy arises, the developed models do not inform on how to prove that this request has been made. To deal with this situation, we added the [Authorisation record] to the model, and associated it to the [Ask user authorisation] goal. This way, we are saying that, whenever the goal is achieved, this is recorded in the authorisation record, where we store the name of the patient, who gave the authorisation, and the authorisation itself (if the authorisation has been provided electronically; otherwise, the ID of the archived copy of it). On the other hand, the [Local Authority] receives broadcasted messages when other local authorities commit dirty data in their databases. In this case, the goal to [Verify user’s authorization] had been added, to avoid that failures of other local authorities in getting the authorisation from the user may lead to compliance issues. Again, in this case, from the model it’s not possible to specify how the achievement of this goal can be monitored. For this reason, after having verified the presence of the authorisation information within the broadcasted message, the [Local Authority] stores such information in another log, which informs on which authorisations have been received and from which peer. Additionally, we enriched the model with a resource dependency form the [Local Authority] to the service [S3] (of the peer authority), to indicate that the authorisation has to be provided by that service. In turn, this raises another problem, of where that information has came from. This is not explicit in the model, so it’s not possible to associate the behaviour of [S3] with its auditability resources. So finally, we associated the [Insert data] goal to its resources: the [Authorizations record], where it takes the authorisation information from; and the [Patients data], where the actual data is stored. This way, patients’ data is associated to the authorisation to process them.
5 Results The proposed approach has led to some important results. The Software Requirements Specification (SRS) document has been integrated to reflect the compliance solutions found, and to support the auditability of the compliance solutions. Table 1 reports the most important additions introduced by the N`omos analysis. In the table, the first column reports the law or law fragment, which the requirement has been developed for; the second column presents the description of the requirements, gathered or induced by the law fragment; the third column indicates whether the requirement is intended to be audited at run-time. As the first column shows, not all of the articles in the selected law slice were addressed. This is due to the demo scenario, which concerned only on search, addition and update of data: so, law articles not impacting on these functionalities were not addressed. On the contrary, one article - the 157.1 - has been considered, even if not contained in Title II of the law. The reason has been a cross-referencing from article 8.3, in relation to the role plaid by the [Garante] actor, a public body in charge of controlling law application. The Garante has monitoring responsibilities, and may request
100
A. Siena et al. Table 1. An excerpt of the Software Requirements Specification document Law article Art. 7.1
Art. 7.2e
Art. 7.3a, 7.3b Art. 9.4 ... Art. 157.1
Requirement The Local Authority registers users’ authorisations The Local Authority writes the User’s in the Authorisations base The Local Authority inserts the data into the local DB The Local Authority verifies the entrance of new peers The Local Authority maintains the list of verified peers S1 gets the list of verified peers from the Local Authority The Local Authority writes data modifications to log The Local Authority identifies the patient by means of identity card The Local Authority records patients’ ID card number ... The Local Authority produces a report with the collected data to the Garante
Audit
Table 2. An excerpt of the Auditability Requirements document Auditability document Authorisations record
Reponsible Local Authority
Database log
Local Authority
Broadcast log Requests log
S3 Local Authority
Peers list
Local Authority
When used Request of user’s authorisation Insertion of dirty data into the local DB ... Insertion of dirty data ... Broadcast of dirty data entries Requests of data modifications are received from the patient Changes are made in the local database Addition of a new peer to the list of known peers
from the data processors to “provide information and produce documents”. Such information and documents are those derived with the use of the N`omos framework, and are in the following enumerated in Table 2. Table 2 shows the overall results of the analysis process. It reports the auditability documents for the identified compliance goals. The table has been created by collecting from the model all the data logs used as auditing sources. The first column contains the name of the audit document. The second column contains the name of the actor, who is in charge of maintaining the document. The third column contains the cases, in which the document is modified. Basically, the content of the table has to be attached to the SRS, and specifies what documents does the system need to produce once developed, to provide compliance evidence at run-time.
6 Lesson Learned Upon the experience with the Amico project, we performed a qualitative evaluation of the N`omos framework by answering a set of questions. We summarize it below in terms of lessons learned. Which has been the main contribution given to the Amico project? An interview with the industrial partners allowed to capture their perception of using N`omos models of law. Primarily, an overall decrease in ambiguity. The representation of law prescriptions as models made analysis choices more explicit. This, turns out also into a better understanding of these choices by the customer, showing that models
Requirements Certification
101
can be a powerful communication language. Worth saying that in order to be used for interaction, the customer has to be shortly instructed on the notation. Was the framework effective, with regard to a costs/benefits comparison? The effort spent for the compliance analysis amounted in 15 person-days, including meetings with the industrial partner, and also modelling and analysis of compliance goals. The analysis of the law sources was not included in this estimation: in fact, it was necessary in any case, regardless of the use of the N`omos framework. The modelling activity lasted around 7 person-days. Overall, 29 law articles (including sub-paragraphs) were analysed. Out of them, 10 were mapped into normative propositions (focusing on the demo scenario described in Section 3), from whose analysis 12 new goals were added to the goal diagram of the demo scenario, and 5 audit resources were identified. Globally, 25 new requirements were derived, including those related to log data for compliance auditing. We considered as requirements: the goals, added for compliance reasons; the resources, representing auditable data logs; and associations, between goals and resources, considered as the explanation of which processes are responsible to maintain the auditing logs. A quantitative evaluation reveals a productivity measure of around 4 elements per day. This is not a high number, with respect to the limited demo scenario, but the identified requirements were previously not considered at all. So, the use of the N`omos framework has ultimately been effective in our experience. Is the framework scalable for larger case studies? The activity of identifying compliance basically consists in an exhaustive search throughout the models for portions of them addressed by normative propositions. Once found, and after the proper compliance goals have been identified, an additional effort consists in keeping synchronised the use of auditability resources by different compliance goals across the models. In case of a large model, the effort to face these activities may be considerable. To partially mitigate this fact, the framework does not seem to introduce additional complexity, if compared to the effort of working directly on the textual sources of the law; law is in fact the major source of complexity. As mentioned above we exclude from this analysis the effort needed to analyze laws with the purpose of extracting normative propositions. What advantages did the analysts see in using the framework? The N`omos framework allowed for a straightforward association of compliance goals to legal prescriptions. In particular, when a normative had effects on multiple parts of the model, the visual notation of the framework effectively gave to the analysts the capability to easily observe these effects. The association of auditing resources to compliance goals was more complex. Identifying resources required to deeper comprehension of previously created goals. For example, the goal [Ask user authorization] generated the problem that a hard copy of the authorisation cannot be broadcasted. This forced us to reconsider the semantics of the goal, and decompose it into lower level goals ([Archive hard copy of authorization], [Assign authorization number] and [Broadcast the authorization number]). What issues did the analysts see in using the framework? A major issue concerned the values to be given to some elements of the model (in particular, to resources) to be significant. The semantics of the model relies mainly
102
A. Siena et al.
on the semantics of the labels on goals and tasks. When associating a resource to a goal or task, we noticed that the meaning of this association changed depending on the semantics of the goal (or task). For example, when the goal [Ask user authorization] and the goal [Insert data], both of them use the [Authorizations record] resources. However, while the first accesses the resource for producing data, the latter accesses it only for reading data. The difference between these two cases is important, because producing data actually refers to the capability to provide auditability of compliance goals. This has been resolved by adding annotations to the goal models, in order to be able to differentiate requirements for auditable compliance, from others.
7 Related Works In recent years, there has been various efforts to deal with law-related issues from the requirements elicitation phase. Ant`on and Breaux have developed a systematic process, called semantic parameterisation [2], which consists of identifying in legal text restricted natural language statements (RNLSs) and then expressing them as semantic models of rights and obligations (along with auxiliary concepts such as actors and constraints). Secure Tropos [5] is a framework for security-related goal-oriented requirements modelling that, in order to ensure access control, uses strategic dependencies refined with concepts such as: trust, delegation of a permission to fulfill a goal, execute a task or access a resource, as well as ownership of goals or other intentional elements. Along similar lines, Darimont and Lemoine have used KAOS as a modelling language for representing objectives extracted from regulation texts [3]. Such an approach is based on the analogy between regulation documents and requirements documents. Ghanavati et al. [4] use GRL to model goals and actions prescribed by laws. This work is founded on the premise that the same modelling framework can be used for both regulations and requirements. Likewise, Rifaut and Dubois use i* to produce a goal model of the Basel II regulation [8]. Worth mentioning that the authors have also experimented this goal-only approach in the Normative i* framework [10], in which the notion of compliance was not considered. Finally, much work has been done in AI on formalizing law, e.g. [7,9]. We use some of this work as a foundation for our framework. However, our software engineering task of having a person check for compliance between a model of law and another of requirements is different from that of formalizing law for purposes of automatic question-answering and reasoning.
8 Conclusions and Future Work The paper reports on the application of the N`omos framework to ensure compliance and auditability for a given set of requirements for a healthcare information system. The case study was conducted over a period of 18 months and involved 25 people, including project manager, analysts, software architect, programmers and researchers. On the basis of the results reported in this paper, the N`omos framework has been refined introducing the distinction between strategic goals, the goals related to the strategic dimension of the domain, and compliancy goals, devoted to the fullfilments of norms in the domain. Moreover, we introduced the concept of auditability for the requirements
Requirements Certification
103
and means to describe it in the N`omos requirements models. More case studies need to be conducted, that do not involve principal authors of the N`omos framework, to validate the approach and to evaluate the effort needed to introduce it in a typical requirement analysis process. Acknowledgement. The activities described here have been funded by the Autonomous Province of Trento, Italy, L6 funding, project “Amico”.
References 1. Medical privacy - national standards to protect the privacy of personal health information. Office for Civil Rights, US Department of Health and Human Services (2000) 2. Breaux, T.D., Ant´on, A.I., Doyle, J.: Semantic parameterization: A process for modeling domain descriptions. ACM Trans. Softw. Eng. Methodol. 18(2), 1–27 (2008) 3. Darimont, R., Lemoine, M.: Goal-oriented analysis of regulations. In: Laleau, R., Lemoine, M. (eds.) ReMo2V, held at CAiSE 2006. CEUR Workshop Proceedings, CEUR-WS.org, vol. 241 (2006) 4. Ghanavati, S., Amyot, D., Peyton, L.: Towards a framework for tracking legal compliance in healthcare. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 218–232. Springer, Heidelberg (2007) 5. Giorgini, P., Massacci, F., Mylopoulos, J., Zannone, N.: Requirements engineering meets trust management: Model, methodology, and reasoning. In: Jensen, C., Poslad, S., Dimitrakos, T. (eds.) iTrust 2004. LNCS, vol. 2995, pp. 176–190. Springer, Heidelberg (2004) 6. Hohfeld, W.N.: Fundamental Legal Conceptions as Applied in Judicial Reasoning. Yale Law Journal 23(1) (1913) 7. Padmanabhan, V., Governatori, G., Sadiq, S.W., Colomb, R., Rotolo, A.: Process modelling: the deontic way. In: Stumptner, M., Hartmann, S., Kiyoki, Y. (eds.) APCCM. CRPIT, vol. 53, pp. 75–84. Australian Computer Society (2006) 8. Rifaut, A., Dubois, E.: Using goal-oriented requirements engineering for improving the quality of iso/iec 15504 based compliance assessment frameworks. In: Proceedings of RE 2008, pp. 33–42. IEEE Computer Society, Washington (2008) 9. Sartor, G.: Fundamental legal concepts: A formal and teleological characterisation. Artificial Intelligence and Law 14(1-2), 101–142 (2006) 10. Siena, A., Maiden, N.A.M., Lockerbie, J., Karlsen, K., Perini, A., Susi, A.: Exploring the effectiveness of normative i* modelling: Results from a case study on food chain traceability. In: Bellahs`ene, Z., L´eonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 182–196. Springer, Heidelberg (2008) 11. Siena, A., Mylopoulos, J., Perini, A., Susi, A.: Designing law-compliant software requirements. In: Conceptual Modeling - ER 2009, pp. 472–486 (2009) 12. Siena, A., Mylopoulos, J., Perini, A., Susi, A.: A meta-model for modeling law-compliant requirements. In: 2nd International Workshop on Requirements Engineering and Law (Relaw 2009), Atlanta, USA (September 2009) 13. Yu, E.S.-K.: Modelling strategic relationships for process reengineering. PhD thesis, Toronto, Ont., Canada (1996)
Decision-Making Ontology for Information System Engineering Elena Kornyshova and Rébecca Deneckère CRI, University Paris 1 - Panthéon Sorbonne, 90, rue de Tolbiac, 75013 Paris, France {elena.kornyshova,rebecca.deneckere}@univ-paris1.fr
Abstract. Information Systems (IS) engineering (ISE) processes contain steps where decisions must be made. Moreover, the growing role of IS in organizations involves requirements for ISE such as quality, cost and time. Considering these aspects implies that the number of researches dealing with decisionmaking (DM) in ISE increasingly grows. As DM becomes widespread in the ISE field, it is necessary to build a representation, shared between researchers and practitioners, of DM concepts and their relations with DM problems in ISE. In this paper, we present a DM ontology which aims at formalizing DM knowledge. Its goal is to enhance DM and to support DM activities in ISE. This ontology is illustrated within the requirements engineering field. Keywords: Decision-making, Ontology, Information System Engineering.
1 Introduction Information system (IS) conception, development, implementation, and every other process in IS engineering includes steps where several alternatives are considered and a decision must be made. Decision-making (DM) may be considered as an outcome of a cognitive process leading to the selection of an action among several alternatives. It might be regarded as a problem solving activity which is terminated when a satisfactory solution is found. With regard to IS engineering methodologies, the issue of DM has already been explored with respect to requirements engineering [1], to method engineering [2] [3], and, more generally, to systems engineering [4]. For instance, the GRL model allows to evaluate solutions according to their contribution to the goals [5]. Ruhe emphasized the importance of DM in SE along the whole life cycle [4]. Several examples of different DM methods application can also be mentioned: AHP for prioritizing requirements [6] and evolution scenarios [7]. Saeki uses weighting method to deal with software metrics [8]. Outranking and weighting methods are illustrated in the field of method engineering in order to select method fragments from a repository according to some project characteristics [3]. As shown in [4], engineering-related decisions may result from the need to satisfy practical constraints such as quality, cost or time. Ruhe stresses the importance of DM in the field of IS because of: (i) time, effort, quality and resources constraints; (ii) presence of multiple objectives; (iii) uncertain, incomplete and fuzzy information, and J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 104–117, 2010. © Springer-Verlag Berlin Heidelberg 2010
Decision-Making Ontology for Information System Engineering
105
(iv) complex decision space. However, the arguments to carry out final decisions are still poor, and choices are made in an intuitive and hazardous way [1] [4]. We consider the lack of DM in ISE at three levels: (i) at the tool level, (ii) at the method level, (iii) and at the model level. At the tool level, even if DM tools exist, there is none with a complete context-aware DM process. At the method level, intuitive and ad hoc decisions overshadow the method-based ones. At the model level, decisions are often ill-formulated. They are characterized, for instance, by poor understanding and description of decision problems, by misunderstanding of decision consequences, and by the lack of alternatives and criteria formalization. We have developed the MADISE (MAke Decisions in Information Systems Engineering) approach to solve DM drawbacks at the method and model levels. The main goal of the MADISE approach is to guide IS engineers through DM activities. The MADISE approach includes three elements: DM ontology, MADISE process, and DM methodological repository. The DM ontology (DMO) is a representation of DM concepts for formalizing DM knowledge. The MADISE process is a generic DM process including main activities used for DM and explaining how to use DMO. The DM methodological repository provides a set of methodological guidelines for realizing DM activities. The goal of this work is to present the DM ontology. Even if the number of researches dealing with DM in IS engineering increasingly grows [9], a complete DM ontology does not exist. DM becomes widespread in IS engineering field, so it is mandatory to build a shared representation of DM concepts and to show how these concepts are related to DM problems in IS engineering. The main goal of DMO is to represent concepts of the DM domain, as well as their properties and relations. The DM ontology fulfils the following needs: • • • • • • •
to clarify and organize DM concepts; to build a shared representation of DM concepts between researchers and practitioners; to show how these concepts are related to DM problems in IS engineering; to make DM knowledge reusable in similar IS engineering situations; to compare existing DM models in order to select an appropriate one; to validate the completeness of existing DM models; to support the creation of new DM models.
This paper is organized as follows. In Section 2, we explain the main concepts used for building the DM ontology. We give an overview of the DM fundamentals and describe the DM ontology in Section 3. In Section 4, we validate DMO by applying it to a case from the requirements engineering field. We conclude this paper by presenting the possible applications of the DM ontology and our future works in Section 6.
2 Building a Decision-Making Ontology In this section, we analyse different aspects that we have used for building DMO. We present a generic definition of the ontology concept; several classifications applied to DMO; DMO elements; DMO modelling way; and, finally, DMO goals.
106
E. Kornyshova and R. Deneckère
Definition. The term ontology is taken from philosophy, where Ontology is a systematic account of Existence. The notion of “ontology” denotes the science of being and, with this, of descriptions for the organization, designation and categorization of existence [10]. Gruber was the first to formulate the term ontology in the field of Computer Science [11] and he defines it as “an explicit specification of a conceptualization”. Gruber [11] has found the main principles for constructing ontologies in Computer Science and defined the main ontology elements, such as classes, relations, functions, or other objects. Since then, many approaches were developed for creating and applying ontologies [12] [13] [14] [15] [16] [17]. For instance, [10] defines the ontology applied to Computer Science as “the capture of the recognized and conceived in a knowledge domain for the purpose of their representation and communication”. For [16], an ontology is a way of representing a common understanding of a domain. [12] considers ontology as a novel and distinct method for scientific theory formation and validation. However, all new definitions are based on the idea that Computer Science ontology is a way of representing concepts like in the Gruber’s approach [16]. Ontologies could be linked to object models as the former express a formal, explicit specification of a shared conceptualization and the latter ones refer to the collection of concepts used to describe the generic characteristics of objects in object-oriented languages [13]. The main distinction between ontologies and object models relates to the semantic nature and shared conceptual representation of ontologies. Classification. [16] presents several classifications of ontologies in Computer Science: by their level of generality (Guarino N., 1998 and Fensel D., 2004 classifications), by their use (Van Heijst G., Schereiber A.T. and Wieringa B.J., 1996 classification), and by the level of specification of relationships among the terms gathered on the ontology (Gómez-Perez A., Fernández-López M. and Corcho O., 2003 classification). According to these works, DMO can be defined as follows: •
• •
according to the generality level: DMO is a domain ontology which captures the knowledge valid for ISE domain and describes the vocabulary related to the domain of DM in ISE; according to the use: DMO is a knowledge modeling ontology as it specifies the conceptualization of the DM knowledge; according to the specification level: DMO is a lightweight ontology, which includes concepts, concept taxonomies, relationships between concepts and properties describing concepts and which omits axioms and constraints.
Elements. DMO is a lightweight ontology including the following main elements: concepts, relations and properties [13] [16]: • •
•
Concepts represent objects from the real world and reflecting the representational vocabulary from domain knowledge [11]; Relations are relationships representing a type of interaction between concepts. Three types of relationships may be used in ontology: generalization, association, and aggregation [13]; Properties (or attributes) are characteristics of concept describing its main particularities which are concise and relevant to the ontology’s goals.
Decision-Making Ontology for Information System Engineering
107
Modelling. [16] mentions several methods for modelling ontologies such as frames and first order logic, Description Logics, Entity-Relationship (ER) diagrams or Unified Modeling Language (UML) diagrams. UML is sufficient to model lightweight ontologies [16] and is a well-known modeling language. For this purpose, the UML formalism is already used, for instance, in an ontology for software metrics and indicators for cataloging web systems [18] or for representing domain ontologies and meta model ontology in requirements engineering [15]. For this reason, we have selected UML class diagram for representing DMO. In this case, each class represents a concept. Concept taxonomies are represented by generalization relationships. Relations between concepts are represented by association relationships. Concept properties are attributes of the corresponding classes. Goals. In general, an abstract representation of phenomena expressed with a model must be relevant to the model’s purpose [19]. Ontology is a conceptualization, which is an abstract, simplified view of the world that we wish to represent for some purpose [11]. This is the reason why goals for building DMO must be specified. We define in our approach the following goals of DMO based on the reviewed literature on ontologies in Computer Science [11] [12] [15] [16] [19] [20]: 1. Knowledge conceptualization. As ontologies represent concepts, they offer “ways to model phenomena of interest, and, in particular, model theories that are cast in the form of a conceptual framework in a much more rigorous fashion” [12]. In this manner, DMO provides a flexible way for conceptualizing DM knowledge. 2. Domain modelling. Ontology aims at modelling a specific domain [16]. [19] claims that “domain understanding is the key to successful system development”. Domain understanding is usually represented via some form of domain modeling [19]. This allows representing complex real world objects within graphical and diagram representations understandable and accessible to experts and practitioners in diverse domains. Motivation for DMO is to have a unique model for DM knowledge. 3. Anchoring [19]. Ontology allows anchoring concepts – often abstract – to concrete application domains. From this view point, DMO aims at relating DM concepts to ISE. 4. Sharing representation. Agents must communicate about a given domain and have a common language within this domain. [11] calls this “ontological commitments”. An ontology must include atomic concepts that any stakeholders, or other agents, can commonly have in a problem domain [15]. From this view-point, ontology is a compromise between different viewpoints, different stakeholders, or involved parties. [16] claims that “not only people, but also applications must share a common vocabulary, that is, a consensus about the meaning of things”. This consensus is reached by building ontologies, which are one of the solutions for representing this common understanding. Therefore, all participants of DM process (such as IS engineers, method engineers, users, and stakeholders) and also applications must share a common understanding of a DM problem. 5. Model validation. In Software Engineering, a specific ontology could be taken as reference point to validate a model that acts over a particular domain [16]. DMO application enables validating existing DM models or new ones in the ISE domain. The following criteria may be tested: consistency, completeness, conciseness, expandability, sensitiveness [20].
108
E. Kornyshova and R. Deneckère
3 DMO: Decision-Making Ontology In this section, we describe the DM ontology. After an introduction in DM fundamentals, we present DMO and, then, the organization of DM knowledge into DM method components. 3.1 Decision-Making Fundamentals A decision is an act of intellectual effort initiated for satisfying a purpose and allowing a judgement about the potential actions set in order to prescribe a final action. Bernard Roy defines three basic concepts that play a fundamental role in analysing and structuring decisions [21]: decision problem, alternatives (potential actions), and criteria. The decision problem [21] can be characterised by the result expected from a DM. When the result consists in a subset of potential alternatives (most often only one alternative) then it is a choice problem. When the result represents the potential alternative affectation to some predefined clusters, then it is a classification problem. When the result consists in a potential collection of ordered alternatives then it is a ranking problem. The concept of alternative designates the decision object. Any decision involves at least two alternatives that must be well identified. A criterion can be any type of information that enables the evaluation of alternatives and their comparison. There are many different kinds of criteria: intrinsic characteristics of artefacts or processes, stakeholders' opinions, potential consequences of alternatives etc. When dealing with criteria, the engineer must determine "preference rules", i.e. the wishful value of criterion (for example, max. or min. for numeric criterion) according to a given need. Herbert Simon (1978 Nobel Prize in Economics) was the first to formalize the decision-making process. He suggested a model including three main phases: intelligence, design, and choice (I.D.C. model) [22]. Intelligence deals with investigating an environment for conditions that call for decisions. Design concerns inventing, developing, and analyzing possible decision alternatives. Choice calls for selecting an alternative from possible ones. This process was modified and extended in different ways. Currently, the commonly agreed and used decision-making steps are defined as follows [23]: • • • • •
define problem (necessity to define priorities), identify problem parameters (for instance, alternatives and criteria), establish evaluation matrix (estimate alternatives according to all criteria), select method for decision making, aggregate evaluations (provide a final aggregated evaluation allowing decision).
These are the basic notions of DM. However, an additional analysis of the DM literature is required in order to build a complete DM ontology. We have used the following references for completing the DM knowledge: [24] [25] [26] [27] [28] [29].
Decision-Making Ontology for Information System Engineering
109
3.2 Decision-Making Concepts Ontology Based on the State-of-the-Art and DM knowledge we have developed the DM ontology which is a domain knowledge lightweight ontology including concepts, attributes and relationships. This ontology represents DM knowledge as a UML class diagram (See Fig. 1). In the following, we describe the DM ontology and give some additional explanations within a common example, which is a project portfolio management (PPM), for instance a project of an ERP purchase. A more detailed description of the DM concepts ontology is given in the Appendix. The concepts, attributes, and relationships are respectively shown in Tables 1, 2, 3.
1 DM Sit ua t ion
Problem 1..* type = {choice,rank,class,desc}
1
Product Ele m e nt leads_to *
contains (1)
DM Obje ct
1
name Proce ssEle m e nt
contains (1) contains (1) 0..1
defines
1
Crit e ria Se t
0..1 is (1) * 1..*
*
A lt e rna t iv e Se t nature = {stable,evol}
St a k eholde r
1
is_associated_to (1) A lt e rna t iv e
2..*
type = {global,frag} contains (2) validity: Boolean
contains (3)
characterizes * Crit er ion
1..*
1..2
1 has (1) * Conse que nc e
informationType = {det,prob,fuzzy,mixed} is (3) is (2) measureScale = {nom,ord,int,ratio,abs} 0..1 0..1 nature = {cert,uncert} 0..1 0..1 dataType = {qual,quan} validity: Boolean 1 1..* 0..1 1 determines De cisionM a k e r 0..1 1 1 is_defined_for is_defined_for is_defined_for St a t e type = {ind,coll} * 0..1 0..1 0..1 probability Pre fe re nce Rule W eight Thr e shold 1 * 0..1 * 2..* type value type = {pref,indif,veto} value is_associated_to (2) has (2) 0..1 0..1 0..2 aggregates (1) 0..1 concerns (2) aggregates (2) 0..3 concerns (1) 0..1 2..* 1 * 1 concerns (1) is_related_to (1) A lt e r na t iv eVa lue Pr e fe r ence has (4) 0..2 1 concerns (3) type = {char,num,enum} * * 0..1 0..2 value has (3)
Goa l
* description
evaluates *
validates makes
Int uit iv eDe cision *
De cision
*
validity: Boolean
M e t hodBa se dDe cision
*
0..1
1..* is_based_on
responds_to
Fig. 1. Decision-Making Ontology
The starting point for analyzing the DM concepts ontology is the DM situation. The DM situation is an abstract concept which puts together the main DM elements and describes a concrete case of DM dealing with a given DM object. The DM object can be a process or a product element. In the case of PPM, the project is the DM object, which is a product element. A given DM situation contains a DM problem and a set of alternatives. In PPM, the problem is a choice of one or more relevant projects; and alternatives can be SAP ERP project, Oracle e-Business Suite project, and OpenERP. The DM situation can also
110
E. Kornyshova and R. Deneckère
contain a set of criteria. It is not mandatory as a decision could be made without analyzing criteria. Criteria can have different natures. It can be: (i) intrinsic characteristics of alternatives (characterizes relationship), (ii) future consequences of alternatives depending on the future state (is(3) relationship), (iii) decision-makers’ goals (is(1) relationship), or (iv) decision-makers themselves (in this case, they have a role of stakeholders according to the is(2) relationship). Regarding to the PPM case, two criteria can be considered: purchase cost and maintenance cost. The first one is known directly and constitutes a characteristic of a project. The second one depends on several factors (for instance, project duration) and implies the consequences of the ERP implementation in the future. The criteria could be analyzed in order to know their weights, preference rules, and thresholds. In our case, weights could be equal (same importance of two criteria); the preference rule is the minimization of two costs; a threshold could be established in order to indicate the maximal acceptable cost of the ERP purchase and maintenance. All alternatives are evaluated according to the identified criteria in order to obtain the alternative values. In the ERP purchase case, a value matrix (3 × 2) will be constituted. The alternative values can be aggregated in order to produce a unique value by alternative (aggregates relationship). For instance, it can be a weighted sum. In the PPM case, the two ERP costs will be added for each alternative ERP. Thus, each ERP will have only one value allowing to compare all ERP in an easier way. Based on these aggregated values, a method-based decision will be made. Decision-makers can participate in DM by several ways. They define the DM problem; have goals and preferences with regard to preference rules, weights and thresholds. They can also become criteria as a particular decision-maker type – stakeholder. Decision-makers evaluate alternatives, validate decisions and make intuitive decisions. In our case, decision-makers participate in the definition of the DM problem, in the establishment of weights, preference rules, and thresholds. They also validate the final decision. Both method-based and intuitive decisions are related to the DM situation. Each DM situation can lead to either none or several decisions. 3.3 Decision-Making Method Components DMO elements are organized into DM method components in order to make their use easier. The notion of method component is inspired from the Method Engineering domain [30]. The DM component model is shown on Fig. 2. It includes six concepts: component, actor, intention, concept, activity, and context. Component. A DM method component is a reusable building bloc of a DM method that can be used separately. Each DM method component may contain several method components, which, in turn, may also be decomposed in other more simple components. Actor. Actors participating in DM can have three main roles: stakeholder, IS engineer, and DM staff. A Stakeholder defines the decision problem, sets goals, expresses preferences on alternatives and criteria [23] and validates the final decision. An IS engineer evaluates alternatives and makes a proposal for DM to stakeholders. DM staff is responsible for assisting stakeholders and IS engineers in all stages of the
Decision-Making Ontology for Information System Engineering
111
DM process [23]. DM staff includes a machine support for DM. In this case, this is a system actor. If actors are human, we call them decision-makers. Decision-makers have the same roles as actors but have a complementary property, which is their type: individual or collective. A collective decision-maker is a group of decision-makers having the same goals and preferences and acting as a unique actor. Actors contribute to the DM process at different stages. It is obvious that the same actor can play different roles in a specific DM process. defines * A ct or
* *
*
* has
Int e nt io n * *
* is_used_for
*
*
* Cont e x t
is_applied_in *
* is_related_to * 1..* aims_at * A ct iv it y
1..*
*
Co m ponent
uses
*
*
* is_used_in
* produces
*
*
*
is_available_in
*
1..*
uses
Co nce pt *
carries_out defines
*
*
* is_related_to
Fig. 2. DM Method Component
Intention. This concept describes goals that actors have within the DM process. It is represented as a taxonomy. The main intention is to find a solution of a problem having a DM nature. This intention is decomposed into more detailed ones, for instance, define alternatives, define relative importance of criteria, and so on. Concept. Concepts are objects used in DM, for instance, the DM problem (choice, ranking, classification), alternative, criterion, and so on. The DM concepts and their relationships are complex. We have organized them into an ontology of DM concepts, which is described is details in the following sub-section. Activity. The activity concept describes elementary actions used for making decisions. The activity taxonomy contains activities (enumerate alternatives, calculate a weighted sum, calculate an aggregate value etc.) and the possible relationships between them: composition, precedence and parallel execution. Context. The context describes conditions in which decisions are made. The context is represented as a taxonomy of characteristics, such as cost, time, etc. These meta-concepts are related to each other as follows. An actor has the intention to make a decision in order to resolve a problem. He defines several DM concepts and carries out different DM activities. DM activities use various concepts and produce other ones. Both DM concepts and activities are related to intentions in order to show for which reason they are used in a DM process (is_used_for and aims_at relationships). They are also related to a context in order to indicate the conditions in which they can be used (is_used_in and is_available_in relationships). A combination of intentions, concepts and activities (composition link between component and intention, concept, and activity) represents a component, as each component contains methodological information about its application. Components are related to a context in order to indicate the characteristics of the situation in which they can be applied (is_applied_in relationship). The context is defined by involved actors (defines relationship).
112
E. Kornyshova and R. Deneckère
4 An Application Case: The REDEPEND Approach This section aims at validating the DM ontology. Our goal is to show how existing DM models could be expressed through the DM ontology. We have chosen an existing and well known DM method dealing with requirements engineering: the REDEPEND approach [6]. We capture its DM model and express it through DMO, i.e. DMO concepts, attributes and relationships are used in order to represent the REDEPEND approach. The REDEPEND approach uses requirements for selecting candidate tasks, which represent possible alternatives. It is based on the AHP (Analytic Hierarchy Process) DM method. The AHP, proposed by T.L. Saaty [31], includes pair-wise comparison between alternatives and/or criteria and aggregation of the comparison results into quantitative indicators (score). The REDEPEND approach integrates the AHP and i*, which is a well-known requirements modeling formalism. Fig. 3 represents the DM concepts used in the REDEPEND approach. The DM situation in this approach is characterized as follows. The DM problem is ranking. The DM object is a task, which can be a scenario (process element) or a goal (product element). Tasks represent alternatives, which are fragmented, i.e. they can be dependent one another. All alternatives are true as the REDEPEND approach does not contain a module for validating them. The alternative set is evolving as it can change through time. One or more decision-makers (individual stakeholders) define goals and soft goals. Goals and soft goals represent requirements and are considered as criteria in the given model. These goals are determinist; their measure scale is nominative; the data type is qualitative; and they are valid. Pro duct Elem ent *
DM Sit ua t ion
DM Obje ct
1
leads_to
name = Task
1
contains
contains 1
contains
Pr oble m
1
*
A lt er na t iv e Se t
Crit e riaSet
type = rank
2..*
nature = evol
contains 1..* contains
1..*
defines Go al
is 0..1
is_associated_to A lt er na t iv e type = frag validity = true 1..*
Crit e rion
description
*
informationType = det measureScale = nominal dataType = qual 0..1 validity = true
is_defined_for
has
0..1
1 0..1
1 has
*
type = numeric value
aggregates
2..*
0..1
1 Pre fer ence
has
*
A lt e rnat iv eValue
2..*
0..2 value
* responds_to
We ight
concerns
type = ind * role = stakeholder *
1
characterizes
is_associated_to
*
Decis io nM ak er
Pro cessElem ent
1
1
concerns
0..2
1..*
0..1 *
De cis ion
M e t hodBa s edDecis io n 1
validity = true
Fig. 3. DMO: Application to the REDEPEND Approach
0..1 aggregates is_based_on
Decision-Making Ontology for Information System Engineering
113
Decision-makers make pair-wise comparisons of tasks and goals. In this way, they express preferences on weights and alternatives values (concerns relationships). For instance, they compare each pair of alternatives according to a criterion and give a numeric value to it. A value can vary from 1 (equal importance) to 9 (absolute importance) in accordance with the basic AHP method. This constitutes the elementary alternative value. These values for all alternatives are then aggregated in order to rank the alternatives against the given criterion (aggregates relationship). The same analysis is made between alternatives for each criterion and between criteria in order to prioritize them too (class weight and relationship aggregates respectively). The ranked alternatives and criteria are computed for the final alternatives’ ranking in order to make decision (is_based_on relationship). Fig. 4. illustrates DM method components used in the REDEPEND approach. REDEPEND implements the AHP method and has three related components (See Fig. 4.A), which are used for the pair-wise comparison of tasks and pair-wise comparison of criteria (two instances of PWComponent), and for the computation of candidates ranking (an instance of CompComponent). We detail the PWComponent dealing with alternatives ranking in Fig. 4.B. The PWComponent contains an intention which is to prioritize candidates. This component includes alternative values as concepts and two activities. The first activity normalizing values uses alternative values for producing normalized alternative values. The second activity calculating relative values (used for calculating relative rating of each task) uses normalized alternative values for producing ranked alternative values. A)
B)
is_used_for Int e nt io n
A HPCom pone nt : Com ponent
PW Co m po ne nt : Co m po ne nt
is_used_for
1 Com pCom pone nt : Com pone nt A ct iv it y
uses
0..1
2..* A lt e rnat iv eVa lue
No rm a liz ingVa lue s
1
2..*
is_used_for 1 Nor m A lt erna t iv e Value
uses
2..*
produces
2
Co nce pt
name = prioritize candidates
follows
0..1
PW Com pone nt : Com pone nt Ca lculat ingR ela t iv e Va lue s
0..1
produces 0..1
2..* aggregates 1 R ank A lt e rna t iv e Va lue
2..*
Fig. 4. A) REDEPEND DM Method Components; B) PWComponent of REDEPEND Approach
5 Conclusion In this paper, we have presented the DM ontology. DMO aims at representing and formalizing DM knowledge. It includes concepts, their properties and relationships organized into two levels. We have validated DMO by applying it to a well-known DM method from the requirements engineering field. The main goal of DMO is to enhance and facilitate DM. It supports IS engineers in DM activities. Therefore, IS engineers could use this ontology in every case of DM. This implies that DMO includes all necessary elements and links for supporting DM in various situations.
114
E. Kornyshova and R. Deneckère
We foresee different applications of DMO as follows. Firstly, DMO helps to overcome the abovementioned drawbacks of DM in IS engineering at the model and method levels. For instance, at the model level, it contributes to a better comprehension of the DM problem and decision consequences; it allows formulating DM situations in a clear way shared between different DM actors. At the method level, DMO encourages the usage of DM scientific methods for decisions. In fact, by showing how IS engineering concepts are related to DM ones, DMO makes the usage of different methods and models from the operational research field easier. Secondly, DMO responds to practical needs such as the validation of existing DM methods and models, their possible enhancing by adding different DM components, and the assistance in the creation of new ones. As the definition of this ontology was motivated by the necessity to support the generic DM process MADISE, our future research includes (i) validation of the MADISE approach and (ii) development of the DM methodological repository.
References 1. Ngo-The, A., Ruhe, G.: Decision Support in Requirements Engineering. In: Aurum, A., Wohlin, C. (eds.) Engineering and Managing Software Requirements, pp. 267–286 (2005) 2. Aydin, M.N.: Decision-making support for method adaptation, Ed. Enschede, Netherlands (2006) 3. Kornyshova, E., Deneckère, R., Salinesi, C.: Method Chunks Selection by Multicriteria Techniques: an Extension of the Assembly-based Approach. In: Situational Method Engineering (ME), Geneva, Switzerland (2007) 4. Ruhe, G.: Software Engineering Decision Support – Methodology and Applications. In: Tonfoni, Jain (eds.) Innovations in Decision Support Systems. International Series on Advanced Intelligence, vol. 3, pp. 143–174 (2003) 5. Amyot, D., Mussbacher, G.: URN: Towards a New Standard for the Visual Description of Requirements. In: Proceeding of 3rd Int. WS on Telecommunications and beyond:the broader applicability of SDL and MSC (2002) 6. Maiden, N.A.M., Pavan, P., Gizikis, A., Clause, O., Kim, H., Zhu, X.: Integrating Decision-Making Techniques into Requirements Engineering. In: REFSQ 2002, Germany (2002) 7. Papadacci, E., Salinesi, C., Sidler, L.: Panorama des approches d’arbitrage dans le contexte de l’urbanisation du SI, Etat de l’art et mise en perspective des approches issues du monde de l’ingénierie des exigences. special issue ISI Journal (2005) 8. Saeki, M.: Embedding Metrics into Information Systems Development Methods: An Application of Method Engineering Technique. In: Eder, J., Missikoff, M. (eds.) CAiSE 2003. LNCS, vol. 2681, pp. 374–389. Springer, Heidelberg (2003) 9. Kou, G., Peng, Y.: A Bibliography Analysis of Multi-Criteria Decision Making in Computer Science (1989-2009). In: Cutting-Edge Research Topics on Multiple Criteria Decision Making. CCIS, vol. 35, pp. 68–71. Springer, Heidelberg (2009) 10. Rebstock, M., Fengel, J., Paulheim, H.: Ontologies-Based Business Integration. Springer, Heidelberg (2008) 11. Gruber Thomas, R.: Toward Principles for the Design of Ontologies Used for Knowledge Sharing. International Journal Human-Computer Studies 43, 907–928 (1993)
Decision-Making Ontology for Information System Engineering
115
12. Akkermans, H., Gordijn, J.: Ontology Engineering, Scientific Method and the Research Agenda. In: Staab, S., Svátek, V. (eds.) EKAW 2006. LNCS (LNAI), vol. 4248, pp. 112– 125. Springer, Heidelberg (2006) 13. Batanov, D.N., Vongdoiwang, W.: Using Ontologies to Create Object Model for ObjectOriented Software Engineering. In: Sharman, R., Kishore, R., Ramesh, R. (eds.) Ontologies: A Handbook of Principles, Concepts and Applications in Information Systems, pp. 461–487. Springer, US (2007) 14. Hevner, A., March, S., Park, J., Ram, S.: Design science research in information systems. MIS Quarterly 28(1), 75–105 (2004) 15. Kaiya, H., Saeki, M.: Using Domain Ontology as Domain Knowledge for Requirements Elicitation. In: 14th RE Conference, vols. 11-15, pp. 189–198 (2006) 16. Sánchez, D.M., Cavero, J.M., Martínez, E.M.: The Road Toward Ontologies. In: Sharman, R., Kishore, R., Ramesh, R. (eds.) Ontologies: A Handbook of Principles, Concepts and Applications in Information Systems, pp. 3–20. Springer, US (2007) 17. Wieringa, R., Maiden, N., Mead, N., Rolland, C.: Requirements engineering paper classification and evaluation criteria: A proposal and a discussion. Requirements Engineering 11(1), 102–107 (2006) 18. Martín, M., Olsina, L.: Towards an Ontology for Software Metrics and Indicators as the Foundation for a Cataloging Web System. In: Proceedings of the First Conference on Latin American Web Congress, Washington, DC, USA, pp. 103–113 (2003) 19. Offen, R.: Domain understanding is the key to successful system development. Requirements Engineering 7(3) (2002) 20. Gómez-Pérez, A.: Evaluation of ontologies. In: Workshop on Verification and Validation at DEXA, Vienna, Autriche, vol. 16(3) (2001) 21. Roy, B.: Paradigms and challenges. In: Figueira, J., Greco, S., Ehrgott, M. (eds.) Multiple Criteria Decision Analysis - State of the Art Survey, pp. 3–24. Springer, Heidelberg (2005) 22. Simon, H.: The New Science of Management Decision. Harper&Row, New York (1960) 23. Baker, D., Bridges, D., Hunter, R., Johnson, G., Krupa, J., Murphy, J., Sorenson, K.: Guidebook to decision-making methods, Developed for the Department of Energy (2001) 24. Bouyssous, D., Marchant, T., Pirlot, M., Perny, P., Tsoukias, A., Vincke, P.: Evaluation and decision models: a critical perspective. Kluwer Academic Publishers, USA (2000) 25. Hanne, T.: Meta Decision Problems in Multiple Criteria Decision Making. In: Gal, T., Stewart, T.J., Hanne, T. (eds.) Multiple Criteria Decison Making-Advances in MCDM Models, Algorithms, Theory and Applications, ch. 6. International Series in Operations Research and Management Science, vol. 21. Springer, Kluwer (1999) 26. Keeney, R.L., Raiffa, H.: Decisions with Multiple Objectives: Preferences and Value Trade-Offs. Cambridge University Press, Cambridge (1993) 27. Bouyssou, D.: Outranking methods. In: Encyclopedia of Optimization. Kluwer, Dordrecht (2001) 28. Ballestero, E., Romero, C.: Multiple criteria decision making and its applications to economic problems. Kluwer Academic Publishers, Netherlands (1998) 29. Vincke, P.: L’aide multicritère à la decision. Edition Ellipses, Edition de l’Université de Bruxelles, Bruxelles, Belgique (1989) (in French) 30. Ralyté, J., Deneckere, R., Rolland, C.: Towards a Generic Model for Situational Method Engineering. In: Eder, J., Missikoff, M. (eds.) CAiSE 2003. LNCS, vol. 2681. Springer, Heidelberg (2003) 31. Saaty, T.L.: The Analytic Hierarchy Process. McGraw-Hill, New York (1980)
116
E. Kornyshova and R. Deneckère
Appendix: DMO Description Table 1. Decision-Making Ontology: Glossary of concepts Concept Name Alternative AlternativeSet AlternativeValue Consequence CriteriaSet Criterion Decision
DecisionMaker DMObject DMSituation Goal IntuitiveDecision MethodBasedDecision Preference PreferenceRule Problem ProcessElement ProductElement Stakeholder State Threshold Weight
Description Possible action available for decision-making. Set of alternatives available in a given DM situation. Evaluation of an alternative according to a criterion or an aggregated alternative value. Impact that an alternative can have whether it is realized following the decision made. Set of criteria available in a given DM situation. Information of any kind that enables the evaluation of alternatives and their comparison. Act of intellectual effort initiated for satisfying a purpose and allowing a judgment about the potential actions set in order to prescribe a final action. It can be an IntuitiveDecision or MethodBasedDecision. Actor contributing to the DM process at its different stages. Artifact being the subject of decision-making (ProductElement of ProcessElement) in IS engineering. Set of specific conditions of DM dealing with a given DM object. Intention or a projected state that a decision-maker intends to achieve. Decision made ‘on the fly’ without using a DM method. Decision based on the application of different DM methods. Preference that a decision-maker have on different DM situation elements of alternatives and criteria. Wishful value of a criterion according to a given need. Result expected from a DM DM object corresponding to a process in IS engineering. DM object corresponding to a product in IS engineering. Particular role of a decision-maker, which defines the DM problem, sets goals, expresses preferences on alternatives and criteria and validates the final decision. State of the environment affecting alternative consequences in the future. Acceptable value for alternative values expressed for a given criterion. Relative importance of a criterion.
Table 2. Decision-Making Ontology: Attributes Description Concept Alternative
Attribute type validity AlternativeSet nature AlternativeValue type value Consequence nature Criterion informationType measureScale dataType validity Decision validity DecisionMaker type
DMObject Goal
name description
Description and/or Domain Type of alternative: global or fragmented. Validity of an alternative, which is a Boolean value. Nature of the alternative set, which is stable or evolving. Type of an alternative value: character, numeric, or enumeration. Value of an alternative. Nature of a consequence, which can be certain or uncertain. Type of information on a criterion: determinist, probabilistic, fuzzy, or mixed. Measure scale: nominal, ordinal, interval, ratio, and absolute. Type of data: qualitative or quantitative. Validity of a criterion, which is a Boolean value. Validity of a decision, which is a Boolean value. Type of a decision-maker, which can be individual or collective (a group of decision-makers having the same goals and preferences and acting as a unique decision-maker). Name of a DM object. Description of a goal.
Decision-Making Ontology for Information System Engineering
117
Table 2. (continued) PreferenceRule Problem State Threshold Weight
type type probability type value value
Type of preference, for instance, a function (max, min), or an ordered list. Problem type, which can be a choice, a ranking, a classification,. Probability of a state realization in the future. Threshold type: preference, indifference, or veto thresholds. Numeric value of a threshold. Numeric value of a weight.
Table 3. Decision-Making Ontology: Relationships Description Relationship Name aggregates (1) aggregates (2) characterizes concerns (1) concerns (2) concerns (3)
contains (1) contains (2) contains (3) defines determines evaluates has (1) has (2) has (3) has (4) is (1) is (2) is (3) is_associated_to (1) is_associated_to (2) is_based_on is_described_by is_defined_for is_related_to (1) leads_to makes responds_to validates
Description An alternative value can be an aggregation of at least two values which describe this alternative according to different criteria. A weight value can be an aggregation of at least two values. Each criterion characterizes one or more alternatives. Each alternative can be characterized by one or more criteria. A preference may concern a threshold or a preference rule. A preference may concern a weight or two weights in the case of pair-wise comparisons. A preference may concern an alternative value or two alternative values in the case of pair-wise comparisons. An alternative value may be defined by a preference of a decision-maker. Each DM situation contains a problem and an alternative set and can contain a criteria set. An alternative set contains a least two alternatives. A criteria set contains one or more criteria. A decision-maker defines a problem for each DM situation. He (she) can define several problems for different DM situations. A state can determine one or more consequences. A decision-maker can evaluate one or more alternatives. An alternative can have one or more consequences; each consequence is related to an alternative. An alternative can have several values; each alternative value is related to one or two alternatives. A decision-maker can have several goals. The same goal may be shared by several decision-makers. A decision-maker can have several preferences. The same preference may be shared by several decision-makers. A stakeholder can be a criterion in one or more DM situations. A goal can be a criterion in a given DM situation. A consequence can be a criterion in a given DM situation. None or several alternatives are associated to a DM object. None or several alternative values can be associated to a criterion. A method-based decision is based on one or more alternative values. Each DM situation is described by none or several DM characteristics. A preference rule, a threshold, or a weight can be defined for a criterion. A threshold can be related to one or two alternative values. Each DM situation leads to a DM object. A DM object can be related to several DM situations. A Decision-maker can make intuitive decisions. An intuitive decision is made by a decision-maker. A decision responds to a DM situation. A DM situation can be related to none or several decisions. A decision-maker can validate decisions; a decision can be validated by none or several decision-makers.
Reasoning with Optional and Preferred Requirements Neil A. Ernst1 , John Mylopoulos1 , Alex Borgida2, and Ivan J. Jureta3 1
Department of Computer Science University of Toronto {nernst,jm}@cs.toronto.edu 2 Department of Computer Science Rutgers University [email protected] 3 FNRS & Information Management University of Namur [email protected]
Abstract. Of particular concern in requirements engineering is the selection of requirements to implement in the next release of a system. To that end, there has been recent work on multi-objective optimization and user-driven prioritization to support the analysis of requirements tradeoffs. Such work has focused on simple, linear models of requirements; in this paper, we work with large models of interacting requirements. We present techniques for selecting sets of solutions to a requirements problem consisting of mandatory and optional goals, with preferences among them. To find solutions, we use a modified version of the framework from Sebastiani et al. [1] to label our requirements goal models. For our framework to apply to a problem, no numeric valuations are necessary, as the language is qualitative. We conclude by introducing a local search technique for navigating the exponential solution space. The algorithm is scalable and approximates the results of a naive but intractable algorithm.
1
Introduction
Requirements modeling languages such as SADT [2], KAOS [3], and i* [4] have been part of the very core of Requirements Engineering (RE) since the early days. Every requirements modeling language is grounded in an ontology consisting of a set of primitive concepts in terms of which requirements can be elicited, modelled, and analyzed. Traditionally, requirements were viewed as functions the system-to-be ought to support. This view is reflected in SADT and other structured analysis techniques of the 1970s and 1980s. Recently, an intentional perspective on requirements, Goal-Oriented Requirements Engineering (GORE), has gained ground. Requirements are now viewed as also containing stakeholder goals representing the intended purposes for the system-to-be. This deceptively small shift in the underlying ontology for requirements has had tremendous impact on RE research and is beginning to be felt in RE practice [5]. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 118–131, 2010. c Springer-Verlag Berlin Heidelberg 2010
Reasoning with Optional and Preferred Requirements
119
The objective of this paper is to propose an extension to the goal-oriented modelling and analysis framework in order to scalably accommodate priorities in the form of optional goals (“nice-to-have” requirements) and preferences (“requirement A is preferred over requirement B”). The extension is much needed to align RE theory with RE practice, where optionality and prioritization have been routinely used to manage requirements [6]. This paper makes the following contributions: – An extension of the qualitative goal modeling framework of Sebastiani et al. [1], supporting stakeholder preferences and optional requirements. – Macro operators for managing the scale of the possible problem and solution spaces. – Algorithms to generate and compare all possible solutions. – A local search algorithm to efficiently find solutions in a complex requirements model. – A case study showing these concepts working on a problem at reasonable scale. In this paper we make reference to the requirements problem. Our objective is to (efficiently) find solutions to a given requirements problem. According to [7], a solution to the requirements problem is a combination of tasks and domain assumptions. This combination must ensure that the execution of the tasks under the domain assumptions satisfies all mandatory goals, and zero or more optional goals. This paper looks at how this might be achieved. We start by introducing the requirements modeling language that serves to define a requirements problem for a system of interest. The language is built on top of a formal logic. We begin by sorting its propositions and well-formed formulas (wffs), and then adding extra-logical constructs used to indicate two things: (i) which wffs are preferred over others; (ii) for which wffs is satisfaction optional to the satisfaction of mandatory wffs. Preference and optionality are extra-logical in the sense that they serve in algorithms but are not defined within the logic. We then study the implications of adding preference and optionality when searching for solutions to the requirements problems. We define formally what we mean by ‘optional’ goals below. 1.1
Case Study: Enterprise Strategy for Portals
Figures 1 and 2 present a subset of the case study we use throughout the paper. We show how the model and the reasoning techniques can be used to model a real life setting concerning enterprise web portal development (ESP). We base our scenario on the documentation that the provincial Ministry of Government Services of Ontario (Canada) (MGS) provides, specifically on the description of the content management requirements. This is a series of reports from a 2003 project to re-work the provincial government’s enterprise-wide citizen access portals, as part of an e-Government initiative. The content management requirements detail the specific technical aspects such a portal would have to meet1 . Our extended 1
Available at http://www.mgs.gov.on.ca/en/IAndIT/158445.html
1 20
N.A. Ernst et al.
Support government-wide portal technology OR
User Profile Information
AND
OR OR
OR
ComponentBased Development
Authentication
AND
Thesaurus Support
OR
Web Accessibility
OR
Centralized templates
AND
minimally support IE 5 Netscape 4.7+
AND
AND
+
AND
support layered arch
X.509
Conform to W3C Accessibility Guidelines
+ Portability
Support Content Authoring
AND
AND
OR
Multibrowser Support
Support profile access
Support platform requirements
---
Security
+
Usability
-
AND AND
Create new templates
+
UTF-8
--
Maintainability
Accessibility
Support extras
Fig. 1. A partial QGM model of the case study (1.1)
model contains 231 goals and several hundred relationships. This is the same size as real-world problems modeled with KAOS [8, p. 248].
2
pQGM: Extending Qualitative Goal Models with Preferences and Optionality
Goal modeling is a widely accepted methodology for modeling requirements (cf. [5]). We leverage the qualitative goal modeling framework, which we call ‘QGM’, as defined in Sebastiani et al. [1]. That paper defines an axiomatization of goal models that transforms the goal labeling problem into a boolean satisfiability (SAT) problem (leveraging the power of off-the-shelf SAT solvers). The goal labeling problem, as stated in that paper, is “to know if there is a label assignment for leaf nodes of a goal graph that satisfies/denies all root goals [1, p. 21]”. This approach can be used, among others, to find what top-level goals can be achieved by some set of leaf goals/tasks (“forward”), or conversely, given some top-level goals which are mandated to be True, what combination of tasks will achieve them (“backward”). The framework is however fully general, so that one can give an initial specification that indicates any set of goals as being mandated or denied, and solutions are consistent labelings of all the nodes in the model. Input goals serve as parameters for a given evaluation scenario. Thus a solution to the requirements problem amounts to answering the question: given the input goals, can we satisfy the mandatory goals?
Reasoning with Optional and Preferred Requirements
121
We briefly recapitulate the definitions introduced in [9,1]. We work with sets of goal nodes Gi , and relations Ri ⊆ ℘G × G. Relations have sorts Decomposition: {and, or}, which are (n+1)-ary relations; and Contributions {+s,++s,+d,++d,-s,--s,-d,--d,++,+,--,-}, which are binary. A goal graph is a pair G, R where G is a set of goal nodes and R is a set of goal relations. Truth predicates are introduced, representing the degree of evidence for the satisfaction of a goal: P S(G), F S(G), P D(G), F D(G), denoting (F)ull or (P)artial evidence of (S)atisfaction or (D)enial for G. A total order on satisfaction (resp. denial) is introduced as F S(G) ≥ P S(G) ≥ ⊤ (no evidence). This permits an axiomatization of this graphical model into propositional logic, so that the statement “A contributes some positive evidence to the satisfaction +s of B”, represented A −→ B becomes the logical axiom P S(a) → P S(b). The complete axiomatization from which this section is derived is available in [1]. Extra-logical extensions. To manage preferences and optional elements, we add extra-logical elements to QGM. These elements do not affect the evaluation of admissible solutions (i.e., whether there is a satisfying assignment), thus our use of the term “extra-logical”. We call our extension prioritized Qualitative Goal Models (pQGM). Models in this language form requirements nets (r-nets). pQGM r-nets allow us to capture stakeholder attitudes on elements in the model. Attitudes (i.e., emotions, feelings, and moods) are captured via optionality and preference relationships. Optionality is an attribute of any concept indicating its optional or mandatory status. Being mandatory means that the element e in question must be fully satisfied in any solution (in the QGM formalization, F S(e) must be true) and not partially denied (P D(e) must be false). (In general, element e will be said to be “satisfied” by a solution/labeling iff F S(e) ∧ ¬P D(e) is true.) Being optional means that, although a solution does not have to satisfy this goal in order to be acceptable, solutions that do satisfy it are more desirable than those which do not, all else being equal. We consider non-functional requirements (NFRs) like Usability to be ideal candidates for being considered optional goals: if not achieved, the system still functions, but achievement is desirable if possible. Alternatives arise when there is more than one possible solution of the problem (i.e., more than one possible satisfying assignment). In goal models with decomposition, this typically occurs when there is an OR-decomposition. We treat each branch of the OR-decomposition as a separate alternative solution to the requirements problem. Note that even for very small numbers of ORdecompositions one generates combinatorially many alternatives. Preferences are binary relationships defined between individual elements in an r-net. A preference relationship compares concepts in terms of desirability. A preference from a goal to another goal in a pQGM r-net indicates that the former is strictly preferred to the latter. Preferences are used to select between alternatives. We illustrate the role of these language elements in the sections which follow.
122
3
N.A. Ernst et al.
Finding Requirements Solutions
The purpose of creating a pQGM model is to use it to find solutions to the requirements problem it defines. We now define a procedure to identify these solutions. Briefly, we identify admissible models (‘possible solutions’); satisfy as many optional requirements as we can; identify alternatives; then filter the set of alternatives using user-expressed preferences. We give both a naive and a local search algorithm for this process. Support government-wide portal technology User Profile Information p
AND
OR
AND
OR
OR
Multibrowser Support
OR
p
ComponentBased Development
Web Accessibility
AND
AND
support a layered arch
Conform to W3C Accessibility Guidelines
Authentication AND
Support profile access
AND
AND
minimally support IE 5 Netscape 4.7+
support X.509
+
M
Support Content Authoring
AND
Support platform requirements
OR
M
OR OR
Centralized templates
Thesaurus Support AND AND
Create new templates
+
Portability --
--
+
Security support UTF-8
Usability
-
+
Support extras
p Maintainability
Accessibility
--
Fig. 2. An example pQGM r-net. Options have dashed outlines and mandatory elements are labeled ‘M’. Preferences are represented with ‘p’ arcs on double-headed ¯ consists of the elements outlined in solid borders and S¯ the stippled elearrows. R ments (a candidate solution).
What are the questions that pQGM can answer? We want to know whether, for the mandatory goals we defined – typically the top-level goals as these are most general – there is some set of input nodes which can satisfy them. Furthermore, we want these sets of inputs to be strictly preferred to other possible sets, and include or result in as many options as possible. We show some answers to these questions in Section 4, below. With respect to the pQGM model shown in Fig. 2, a dominant solution (stippled nodes) consists of the goals Support government-wide portal technology, Support Content Authoring, Web Accessibility, Conform to W3C Accessibility Guidelines, Support platform requirements, Authentication, X.509, User Profile Information, Support profile access, UTF-8, Accessibility, Security, Portability. Goals such as Minimally support IE5 are alternatives that are dominated, and therefore not included (in this case, because they break the Usability and Security options). Deriving this is the focus of the remainder of the paper.
Reasoning with Optional and Preferred Requirements
3.1
123
Identifying Admissible Models
Our first step is to find satisfying label assignments to the model elements. We rely on the solution to the backwards propagation problem of [1]. We take an ¯ (attitude-free meaning one without optional elements, edges attitude-free r-net R leading to/from them, and without considering preferences), and label the model. The labeling procedure encodes the user’s desired output values (Φo utv al ), the model configuration (Φgraph ) and the backwards search axiomatization Φbackw ard (cf. [1, p. 8])2 . The output of this is a satisfying truth assignment µ, or None, if there is no assignment that makes the mandatory nodes F S∧ ¬ P D . ¯ is adIf there was a satisfying assignment, we say our attitude-free model R missible, as our mandatory nodes were satisfied, and we call the resulting labeled ¯ m . At this point, it suffices to identify that there is admissibility. goal model R Subsequent stages handle alternatives. 3.2
Satisfying Optional Requirements
We now turn to consideration of the optional goals of our initial r-net R. In our diagrams, a node is marked optional by having dashed outlines. (This could of course be formalized by using a meta-predicate O.) Although a solution does not have to satisfy such a goal in order to be admissible, solutions that do satisfy it are more desirable than ones that do not, all else being equal. For example, in Fig. 2, we would prefer to satisfy all the dashed nodes (Conform to W3C accessibility guidelines, etc.). However, the use of optional goals allows us to accept solutions which only satisfy some of these nodes. Consider the example in Fig. 2. In this example we have an NFR Usability and incoming concepts which can satisfy the goal. By specification, our algorithm should find the maximal sets of optional elements that can be added to the mandatory ones while preserving admissibility. It is important to note that in our framework optional elements can interact. This means that adding a optional goal to Rm could render a) the new model inadmissible b) previously admissible optional goals inadmissible. This reflects the notion that introducing an optional goal has consequences. Option identification. We describe our na¨ıve implementation in Algorithm 1. We are given an admissible r-net, Rm , e.g., a set of mandatory goals, along with a map,T , from elements of Rm to the set of label assignments for Rm , e.g, {P D , P S , F D , F S }. We are given a set, O , the set of optional elements, with O Rm = R. In our example, a subset of O is equal to the set of goals culminating in Usability. For each subset input ofO (i.e., element of ℘(O )), with the exception of the ¯ m , the admissible solution. We then use the SAT solver to empty set, add it to R find a satisfying assignment, checking for admissibility. If we find an admissible 2
In the case of nodes e representing “hard” goals and especially tasks, one might require only truth assignments that satisfy axioms forcing binary “holds”/“does not hold” of e, through axioms F S(e) → ¬F D(e) and F S(e)∨F N (e), forming additional sets Φconf lict and Φconstrain in [1].
124
N.A. Ernst et al. I nput: A solution Rm and a set of optional goals, O O utput: All maximal admissible option sets that can be added to Rm Om ← ∅ foreach input ∈ ℘(O) do // in a traversal of sets in non-increasing size value ← admissible(input ∪ Rm ) if value == T then Om ← input Remove subsets of input end end return Om
Algorithm 1. NaiveSelect
solution (an option set O such that O Rm is admissible), we can add this to our set of acceptable solutions,S a . Now, because of the order in which we traverse subsets, this solution is a maximal set, and the proper subsets of O contain fewer optional goals. We therefore remove these sets from the collection being traversed. The running time of this naive approach is clearly inpractical since in the worst-case, we must check all 2n subsets of the set ℘(O), where n is the number of optional elements. (The worst-case could arise if all optional goals invalidate all other optional goals.) This is on top of the complexity of each SAT test! Furthermore, the result set might be quite large. We therefore show some mechanisms to improve this. Pruning the set of optional goals. We introduce two ways to reduce the number of option sets the naive algorithm finds, in order to reduce problem size. One approach prunes the initial sets of optional goals, and another reduces the admissible option set presented to the user. The intuition is to support a form of restriction on database queries; some requirements might be useful in some scenarios but not others, yet we would like to retain them in the original model. For example, we might implement the ESP model in an environment which does not need authentication. Our first approach defines simple operations over optional goals – operators on ℘(O) that act as constraints. Each element (a set) in ℘(O) has two variables, cardinality and membership. We allow constraints on option set cardinality (boolean inequalities), and option set membership (inclusion or exclusion). An equivalent expression in SQL might be DELETE FROM options WHERE Size $n and DELETE FROM options WHERE $x IN options. Our second approach is to make use of pQGM’s preference relations to remove option sets which are dominated by other sets. In our example, with P ={(Authentication, Component-based development), (Security,Usability)}, if we found admissible solutions S1 containing {Security, Authentication}, and S2 containing {(Component-based development)}, we would discard S2 in the case where both are admissible. This makes use of the Dominate function we define in Section 3.4.
Reasoning with Optional and Preferred Requirements
125
Input: solution Rm , set of options O , time limit tlim, tabu expiration expire Output: A locally optimal set of optionsets O t while time < tlim and candidate∈ tabu list do O t = O t + TabuMove(∅, O, Ot , tabu, time) if time % expire ≤ 1 then tabu ← ∅ end end foreach o′ , o′′ ∈ Ot do if o′ ⊂ o′′ then Ot - o end end
Algorithm 2. TabuSearch Section 4 shows the success of these techniques in reducing model evaluation times. Local search. The naive approach must search through each member of the set of possible options, an exponential worst-case running time. This is clearly infeasible. A more tractable approach is to define a local search algorithm to find, in a bounded amount of time, a locally optimal solution. We implemented Tabu search [10] (Algorithms 2 and 3). The algorithm iteratively searches for improvements to the cardinality of an admissible option-set. We start with a randomly chosen option, and add (random) remaining options singly, checking for admissibility. If we reach a point where no admissible options are found, we preserve the best result and randomly restart. A tabu list prevents the searcher from re-tracing steps to the same point it just found. A tabu tenure details for how many moves that point will be considered tabu. One commonly lists the last few moves as tabu. An iteration limit ensures the algorithm terminates. IsAdmissible represents a call to the SAT solver with the given optionset merged with the existing admissible solution, R. Although it is not necessarily the case that there are single options which are admissible, in practice this is common. If this isn’t the case, our algorithm performs a random search, beginning with 1-sets of options. 3.3
Identify Solution Alternatives
For comparison purposes, our example model with 231 non-optional nodes and 236 relations is translated into a SAT expression consisting of 1762 CNF clauses. This is well within the limits of current SAT solvers. The input into our penultimate phase consists of the admissible, labeled r-net Rm , along with a set, possibly empty, of option sets that can be joined to that r-net, O. The number of current solutions, then, is the size of O+1. The goal of this phase is to identify alternatives that are created at disjunctions in the model. We do this by converting each admissible solution s ∈ S : Rm ∪ O ∈ O to a boolean formula (a traversal of the AND-OR graph), and then
126
N.A. Ernst et al. Input: candidate, remainder, solution, tabu, time Output: solution step = 0 while step < radius do tmp = candidate selected = Random.Choice(remainder) tmp = tmp + selected step = step + 1 if selected ∈ tabu list and tmp ∈ solution then break end end if length(candidate) == initial then solution = solution + candidate return solution // base case end remainder = remainder - selected candidate = candidate + selected if time > tlim then return solution end time = time + 1 if IsAdmissible(candidate) then solution = solution + TabuMove(candidate, remainder) end else candidate = candidate - selected tabu list = tabu list + selected remainder = remainder + selected solution = solution + TabuMove(candidate, remainder) end return solution
Algorithm 3. TabuMove
converting this formula to conjunctive normal form. This is the accepted format for satisfiability checking (SAT). We pass this representation of the r-net as the SAT formula Φ to a SAT solver, store, then negate the resulting satisfiability model µ and add it as a conjunct to Φ . We repeat this process until the result is no longer satisfiable (enumerating all satisfying assignments). This produces a set of possible alternative solutions, e.g. µ i , µ i+1 , .., µ n . We convert this to a set of sets of concepts that are solutions, Sa . 3.4
Solution Selection
Our final step is to prune the sets of solution r-nets, Sa , using stakeholder valuations over individual elements – expressed as preferences and (possibly) costs. We do not prune before finding optional goals since we might discard a dominated set in favour of one that is ultimately inadmissible. Similarly, we do not risk
Reasoning with Optional and Preferred Requirements
127
the possibility of discarding an optionset that is nonetheless strictly preferred to another set, since by definition, if set O′ is admissible and yet not selected, there was a set O ⊃ O′ that contains the same elements (and therefore preferences) and was admissible. Selection using preferences. We will use in the text the notation pq if the r-net indicates that node p is preferred to node q, and let ≥ be the transitive reflexive closure of . We use this to define the function Dominate(M,N), mapping S x S to booleans as follows: Dominate(M, N ) = T rue ⇐⇒ ∀n ∈ N .∃m ∈ M : m ≥ n Intuitively, every element of a dominated set is equal to a value in the dominant one, or is (transitively) less preferred. Note that the Dominates relation is a partial order, and we can therefore choose maximal elements according to it. Selection using cost. Although not shown in the case study, our framework provides for solution selection using a simple cost operator, whenever such numbers are available. Clearly there are many cost dimensions to consider. For a given cost function, we define a min cost(value, increment,set) function which ranks the proposed solutions using the cost of the solution as a total ordering. The meaning of value is as an upper cost threshold (‘return all solutions below value’), and increment as a relaxation step in the case where no solutions fall under the threshold. Note that we are ranking admissible solutions, and not just requirements.
4
Evaluation
We now present experimental results of our technique on our large goal model presented in Section 1.1. Our ESP model contains 231 non-optional nodes and 236 relations. We demonstrate the utility and scalability of our technique by presenting evaluation results for this model in various configurations of options, using different pre-processing steps to reduce the problem scale (Table 1). The first row of results reflects the raw time it takes to evaluate this particular model with no options (using the SAT solver). The third column shows that adding the set cardinality heuristic (a call to Naive with a maximal set size of 8 and minimal size of 5), results in some improvement in running times. For example, with 15 options, the running time is approximately 30% shorter (while returning the same options, in this case). Similarly, while we note the exponential increase in evaluation time for our naive algorithm, TabuSearch clearly follows a linear trend. However, the tradeoff is that TabuSearch misses some of the solutions, although in our tests, assuming an average-case model configuration, it found half of the maximal sets of options. Quality of solutions. While the naive approach returns all solutions (the Pareto-front of non-dominated solutions) the Tabu search heuristic will only return an approximation of this frontier. TabuSearch returned, in the case with
128
N.A. Ernst et al.
Table 1. Comparing na¨ıve vs. heuristic option selection. The fifth column represents the number of calls to the SAT solver; the last column indicates how many solutions the local search found: max, the number of maximal sets, sub the remaining non-optimal sets. We used a time step limit of 400. Options Naive Naive(8,5) TabuSearch # calls 0 4 6 9 12 15 20
0.062 s 0.99 3.97 31.8 4m23s 33m -
– – 23.6s 3m6s 21m -
– 0.08s 0.11 0.14 0.16 0.18 0.19
– 2800 2800 2942 3050 3165 3430
solns
1 1 1 1 1
– all max, 1 max, 2 max, 2 max, 2 max, 2
sub sub sub sub sub
15 options, one maximal set (of two), and 2 subsets. This in turn affected the number of alternatives that were found. Using the naive approach with 15 options, we identified 7 dominant solutions and 8 other alternatives (unrelated via preferences). Using TabuSearch, with the smaller option sets, we only return 4 dominant solutions and 6 other alternatives. These are individual solutions; we permit combinations, so the total number is much higher (the powerset). However, we feel the greatly reduced running time will allow for more model exploration than the naive approach.
5
Related Work
We focus our comparison on the process of deriving a high-level solution from an initial model. Tropos [11] uses the syntax of i* to generate early requirements models that are then the source for eventual derivation of a multi-agent system design. The ‘early requirements’ phase generates goal models using the notions of decomposition, means-ends and contribution links. These three notions bear no semantic distinction in the eventual reasoning procedure: they all propagate truth values. Partial satisfaction (denial) of stakeholder goals is modeled with the qualitative ‘positive contribution’ (denial) links, allowing goals to be partially satisfied (denied). While this is used to model preference, it is not clear what to make of a partially satisfied goal (with respect to preference). Although formalized (in [1] and others), the reasoning still relies on stakeholder intervention to evaluate how ‘well’ a solution satisfies the qualities (e.g., is it preferable that security is partially satisfied while usability is partially denied?). From [11, p. 226]: “These [non-deterministic decision points] are the points where the designers of the software system will use their creative [sic] in designing the system-to-be.” KAOS [3] uses a graphical syntax backed by temporal logic. KAOS has a strong methodological bent, focusing on the transition from acquisition to specification. It is goal-oriented, and describes top-down decomposition into operationalized requirements which are assigned to actors in the system-to-be.
Reasoning with Optional and Preferred Requirements
129
Alternative designs can be evaluated using OR decompositions, as in Tropos. In [12], a quantitative, probabilistic framework is introduced to assign partial satisfaction levels to goals in KAOS, which allows tradeoffs to be analyzed, provided the quantification is accurate. The notion of gauge variables is hinted at in [13], which seem to be measures attached to goal achievement. Obstacle analysis [14] provides a mechanism to identify risks to normal system design. There are several techniques that can be used to generate pairwise preferences over lists of requirements, surveyed in [15,16]. These included iterative pairwise comparisons, economic valuations, and planning games. Here, all requirements are assumed to be achievable and there are no interactions between them. This is a very simple model of requirements. For example, what if one requirement is blocked if another is implemented? Or if there is a trade-off analysis required? The QGM implementation in [1] describes a variant using a minweight SAT solver; weights are assigned using qualitative labels. It searches for a solution that minimizes the number/cost of input goals (in essence, the tasks a solution must accomplish). Our approach doesn’t consider weights, relying on the user’s judgement to evaluate the optimal solution they prefer. As mentioned, it is simple to add cost as a factor in our solution finding, but we don’t assume that this is available. Finally, researchers have pointed out that assuming stakeholders understand the cost of a given requirement is dangerous [12]. Arguably it is equally dangerous to assume prioritization is possible; our approach assumes that such preferences will have to be iteratively provided, as stakeholders are presented with various solutions (or no solutions). Search-based SE focuses on combinatorial optimization, using evolutionary and local search algorithms to efficiently find solutions, e.g [17,18]. DDP [19] supports quantitative reasoning and design selection over non-hierarchical requirements models, and the latest iteration uses search-based heuristics to find system designs [20]. The principal difference is in the nature of the underlying model. Our framework uses goal models to provide for latent interactions between requirements that are satisfied. An alternative to formalized requirements models are workshop techniques (e.g., [21]). Here, requirements problems and solutions can be explored and evaluated without the use of formal reasoning, which can be difficult to construct and display to end users. The motivation for formalization is that it is more amenable to rapid prototyping and semi-autonomous execution, particularly in the context of evolving systems. Workshops are expensive and difficult to organize. A good formal methodology will ensure end users don’t use the model, but merely get the results, in this case the set of optimal solutions.
6
Conclusions and Future Work
With pQGM, we have introduced an extension to a well-known formal goal reasoning procedure. Our extension allowed us to compare solutions to the requirements problem using preferences and optional goals. We described some
130
N.A. Ernst et al.
techniques for generating solution alternatives and comparing them, and showed that our techniques can scale with the addition of simple constraints. We view the modeling of requirements problems in pQGM as iterative: identify a core set of mandatory elements and verify admissibility; identify optional elements; generate solutions; refine the model to narrow the solution space (using preferences or costs). We hope to extend this work to accommodate incremental revisions of the set of solutions as new information becomes available in the model. Finally, we would like to highlight the enduring problem of propagation of conflicting values. Although separate in QGM, many goals are only partially satisfied and often also partially denied. We have been working on a paraconsistent requirements framework that addresses this issue. Most modeling languages are focused on finding single solutions to the problem. They use priorities and evaluation to find that solution. We have defined some ways to select from many solutions. The model assessment procedure defined a) ways to narrow the search space and b) ways to select between solutions once found. In other modeling languages, such as KAOS [3] or Tropos [11], the aim of the methodology is to generate a single solution which best solves the problem. Why do we want to model multiple solutions? First, we think it is unrealistic to expect to find a single solution. Many practitioners, particularly from the lean and agile paradigm, prefer to wait until the ‘last responsible moment’ before making decisions [22]. Secondly, in a changing system, we should not presume to know what single solution will always apply. Finally, this method allows the user – with appropriate tool and language support – to define the conditions under which a solution is acceptable.
References 1. Sebastiani, R., Giorgini, P., Mylopoulos, J.: Simple and Minimum-Cost Satisfiability for Goal Models. In: Persson, A., Stirna, J. (eds.) CAiSE 2004. LNCS, vol. 3084, pp. 20–35. Springer, Heidelberg (2004) 2. Ross, D.: Structured Analysis (SA): A Language for Communicating Ideas. Trans. Soft. Eng. 3(1), 16–34 (1977) 3. Dardenne, A., van Lamsweerde, A., Fickas, S.: Goal-directed requirements acquisition. Science of Computer Programming 20(1-2), 3–50 (1993) 4. Yu, E.S.: Towards modelling and reasoning support for early-phase requirements engineering. In: Intl. Conf. Requirements Engineering, Annapolis, Maryland, pp. 226–235 (1997) 5. van Lamsweerde, A.: Goal-Oriented Requirements Engineering: A Guided Tour. In: Intl. Conf. Requirements Engineering, Toronto, pp. 249–263 (2001) 6. Liaskos, S., Mcilraith, S.A., Mylopoulos, J.: Goal-based Preference Specification for Requirements Engineering. In: Intl. Conf. Requirements Engineering, Sydney (September 2010) 7. Jureta, I.J., Mylopoulos, J., Faulkner, S.: Revisiting the Core Ontology and Problem in Requirements Engineering. In: Intl. Conf. Requirements Engineering, Barcelona, pp. 71–80 (September 2008)
Reasoning with Optional and Preferred Requirements
131
8. van Lamsweerde, A.: Requirements engineering: from craft to discipline. In: Intl. Conf. Foundations of Software Engineering, Atlanta, Georgia, pp. 238–249 (November 2008) 9. Giorgini, P., Mylopoulos, J., Nicchiarelli, E., Sebastiani, R.: Formal Reasoning Techniques for Goal Models. Journal on Data Semantics 2800, 1–20 (2003) 10. Glover, F.: Future paths for integer programming and links to artificial intelligence. Computers and Operations Research 13(5) (1986) 11. Bresciani, P., Giorgini, P., Giunchiglia, F., Mylopoulos, J., Perini, A.: TROPOS: An Agent-Oriented Software Development Methodology. Autonomous Agents and Multi-Agent Systems 8, 203–236 (2004) 12. Letier, E., van Lamsweerde, A.: Reasoning about partial goal satisfaction for requirements and design engineering. In: Intl. Conf. Foundations of Software Engineering, Newport Beach, CA, pp. 53–62 (2004) 13. van Lamsweerde, A.: Reasoning About Alternative Requirements Options. In: Borgida, A., Chaudhri, V.K., Giorgini, P., Yu, E.S.K. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 380–397. Springer, Heidelberg (2009) 14. van Lamsweerde, A., Letier, E.: Handling obstacles in goal-oriented requirements engineering. Trans. Soft. Eng. 26, 978–1005 (2000) 15. Karlsson, L., H¨ ost, M., Regnell, B.: Evaluating the practical use of different measurement scales in requirements prioritisation. In: Intl. Conf. Empirical Software Engineering, Rio de Janeiro, Brasil, pp. 326–335 (2006) 16. Herrmann, A., Daneva, M.: Requirements Prioritization Based on Benefit and Cost Prediction: An Agenda for Future Research. In: Intl. Conf. Requirements Engineering, Barcelona, pp. 125–134 (September 2008) 17. Battiti, R., Brunato, M., Mascia, F.: Reactive Search and Intelligent Optimization. Operations research/Computer Science Interfaces, vol. 45. Springer, Heidelberg (2008) 18. Finkelstein, A., Harman, M., Mansouri, S., Ren, J., Zhang, Y.: A search based approach to fairness analysis in requirement assignments to aid negotiation, mediation and decision making. Requirements Engineering J. 14(4), 231–245 (2009) 19. Feather, M.S., Cornford, S.: Quantitative risk-based requirements reasoning. Requirements Engineering J. 8, 248–265 (2003) 20. Jalali, O., Menzies, T., Feather, M.S.: Optimizing requirements decisions with KEYS. In: International Workshop on Predictor Models in Software Engineering, Leipzig, Germany, pp. 79–86 (2008) 21. Maiden, N., Robertson, S.: Integrating Creativity into Requirements Processes: Experiences with an Air Traffic Management System. In: Intl. Conf. Requirements Engineering, Paris, France (2005) 22. Thimbleby, H.: Delaying commitment. IEEE Software 5(3), 78–86 (1988)
A Conceptual Approach to Database Applications Evolution Anthony Cleve1 , Anne-France Brogneaux2, and Jean-Luc Hainaut2 1
2
ADAM team, INRIA Lille-Nord Europe Universit´e de Lille 1, LIFL CNRS UMR 8022, France [email protected] Faculty of Computer Science, PReCISE Research Center University of Namur, Belgium {afb,jlh}@info.fundp.ac.be
Abstract. Data-intensive systems are subject to continuous evolution that translates ever-changing business and technical requirements. System evolution usually constitutes a highly complex, expensive and risky process. This holds, in particular, when the evolution involves database schema changes, which in turn impact on data instances and application programs. This paper presents a comprehensive approach that supports the rapid development and the graceful evolution of data-intensive applications. The approach combines the automated derivation of a relational database from a conceptual schema, and the automated generation of a data manipulation API providing programs with a conceptual view of the relational database. The derivation of the database is achieved through a systematic transformation process, keeping track of the mapping between the successive versions of the schema. The generation of the conceptual API exploits the mapping between the conceptual and logical schemas. Database schema changes are propagated as conceptual API regeneration so that application programs are protected against changes that preserve the semantics of their view on the data. The paper describes the application of the approach to the development of an e-health system, built on a highly evolutive database.
1
Introduction
Data-intensive applications generally comprise a database and a collection of application programs in strong interaction with the former. Such applications constitute critical assets in most entreprises, since they support business activities in all production and management domains. As any software systems, they usually have a very long life, during which they are subject to continuous evolution in order to meet ever-changing business and technical requirements. The evolution of data-intensive applications is known as a highly complex, expensive and risky process. This holds, in particular, when the evolution involves database schema changes, which in turn impact on data instances and application programs. Recent studies show, in particular, that schema evolutions may have a huge impact on the database queries occuring in the programs,
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 132–145, 2010. c Springer-Verlag Berlin Heidelberg 2010
A Conceptual Approach to Database Applications Evolution
133
reaching up to 70% query loss per new schema version [1]. Evaluating the impact of database schema changes on related programs typically requires sophisticated techniques [2,3], especially in the presence of dynamically generated queries [4]. The latter also severely complicate the adaptation of the programs to the new database schema [5,6]. Without reliable methods and tools, the evolution of data-intensive applications rapidly becomes time-consuming and error-prone. This paper addresses the problem of automated co-evolution of database schemas and programs. Building on our previous work in the field [7,8,9], we present a comprehensive approach that supports the rapid development and the graceful evolution of data-intensive applications. The proposed approach combines the automated derivation of a relational database from a conceptual schema and the automated generation of a data manipulation API that provides programmers with a conceptual view of that relational database. This conceptual API can be re-generated in order to mitigate the impact of successive schema evolutions and to facilitate their propagation to the program level. The remaining of the paper is structured as follows. Section 2 further analyzes the problem of data-intensive application evolutions and specifies the objective of our work. Section 3 provides a detailed presentation of our approach. A set of tools supporting the proposed approach are described in Section 4. Section 5 discusses the application of our approach and tools to a real-life data-intensive system. Concluding remarks are given in Section 6.
2 2.1
Problem Statement Database Engineering: The Abstraction Levels
Standard database design methodologies are built on a three level architecture providing, at design time, a clean separation of objectives, and, at execution time, logical and physical independence (Figure 1). The conceptual design process translates users functional requirements into a conceptual schema describing the structure of the static objects of the domain of interest as well as the information one wants to record about them. This schema is technology-independent and is to meet semantic requirements only. The logical design process translates the conceptual schema into data structure specifications compliant with a database model (the logical schema), such as the relational, object-relational or XML models. Though it is not specific to a definite DBMS, the logical schema is, loosely speaking, platform dependent. In addition, it is semantically equivalent to the conceptual schema. The result of logical design is twofold: a logical schema and the mapping betwen the later and its source conceptual schema. This mapping defines how each conceptual object has been translated into logical constructs and, conversely, what are the source conceptual objects of each logical construct. Though mappings can be quite simple and deterministic for unsophisticate and undemanding databases, they can be quite complex for actual corporate databases. Hence the need for rigorously defined mapping. The physical schema specializes the logical schema for a specific DBMS. In particular, it includes indexes and storage space specifications.
134
A. Cleve, A.-F. Brogneaux, and J.-L. Hainaut
Users requirements
conceptual/ logical mapping
Conceptual design
conceptual schema
Logical design
logical schema
View design
Physical design
physical schema
external schemas
logical/ external mapping
Fig. 1. Standard database design processes
Application programs interact with the database through an external schema, or view, generally expressed as a subset of the logical schema or as an object schema mapped on the latter. In both cases, a mapping describes how the external view has been built from the logical schema. The popular object-relational mapping interfaces (ORM) store this mapping explicitly, often as XML data. 2.2
Dependencies between Schemas and Programs
Current programming architectures entail a more or less strong dependency of programs on database structures, and particularly on the logical schema. Changes in the logical schema must be translated into physical schema changes and data conversion [7]. Most generally, they also lead to discrepencies between the logical and external schemas, in such a way that the latters must be adapted. In many cases, that is, when the changes cannot be absorbed by logical/external maping adaptation, the client programs must be modified as well [10]. This problem is known as a variant of the query/view synchronization problem [11]. As shown in [12], the use of ORM’s does not solve the problem but may make things worse, since both logical and external schemas can evolve in an asynchroneous way, each at its own pace, under the responsibility of independent teams. Severe inconsistencies between system components may then progressively emerge due to undisciplined evolution processes. In the context of this paper, we identify four representative schema modification scenarios, by focussing on the schema that is subject to the initial change S, (conceptual or logical) and its actual impact on client programs1. Let CS, L PS and XS denote respectively the conceptual, logical, physical and external database schemas. Let M(CS, LS) and M(LS, XS) denote the conceptual/logical and logical/external mappings, and let P be a program using XS. 1
Existing taxonomies for schema evolution [13,14] usually consider the reversibility of the schema change and its potential impact on the programs.
A Conceptual Approach to Database Applications Evolution
135
1. Change in CS without impact on P . The conceptual schema CS undergoes structural modifications that translate domain evolution (we ignore refactoring changes). However, the modified conceptual constructs are outside the scope of the part of external schema XS that is used by program P . Conceptually, this modification must be followed by the replay of the logical design and view design processes, which yields new versions of logical schema LS and mappings M(CS, LS) and M(LS, XS). Theoretically, this evolution should have no impact on P . However, the extent to which this property is achieved depends on two (1) the quality of the logical design process and (2) the power of the logical/external mapping technology and the care with which this mapping is adapted, be it automated or manual. 2. Change in CS with impact on P . In this scenario, conceptual changes cause the modification of the part of external schema XS used by P , in such a way that no modification of mapping M(LS, XS) is able to isolate program P from these conceptual changes. Accordingly, some parts of P must be changed manually. Considering the current state of the art, the best one can expect from existing approaches is the identification of those parts. 3. Change in LS with CS unchanged. The logical schema LS is refactored to better suit non functional requirements such as adaptation to another platform, performance, structural quality, etc. As in scenario 1, this modification requires replaying the logical design process and yields new versions of LS, M(CS, LS) and M(LS, XS). Automatic program adaptation has proved tractable through co-transformation techniques that couple schema and program transformations [15,9,6]. 4. Change in P S with CS and LS unchanged. Such a modification is fully controlled by the DBMS and has no impact on program P , which, at worst, could require recompilation. Our proposal aims to address the evolution problems posed by scenarios 1, 2 and 3 via a tool-supported programming architecture that allows programs to
Application program
Application program
relational schema
conceptual schema
RDB data
conceptual/relational mapping
relational schema
RDB data
Fig. 2. Programs interact with a logical (left) or a conceptual (right) database
136
A. Cleve, A.-F. Brogneaux, and J.-L. Hainaut
interface with the database through an external conceptual view instead of a logical view (Figure 2). This architecture provides logical independence in scenarios 1 and 3 by encapsulating mappings M(CS, LS) and M(LS, XS) into an API automatically generated by an extension of the DB-MAIN CASE tool. In scenario 2, the identification of the program parts that must be changed is carried out by the recompilation process driven by the strong typing of the API. Though seeking full automation would be irrealistic, the approach provides a reliable and low cost solution that does not require high skills in system evolution.
3
Approach
The approach proposed in this paper relies on a generic representation of schemas, and on a formal definition of schema transformations and schema mappings. It consists in automatically building a schema mapping during the logical design process, in order to use such a mapping as input for (re)generating a conceptual data manipulation API to the relational database. 3.1
Generic Schema Representation
The approach deals with schemas at different levels of abstraction (conceptual, logical, physical) and relying on various data modeling paradigms. To cope with this variety of models, we use a large-spectrum model, namely the Generic Entity-relationship model [16], or GER for short. The GER model is an extended ER model that includes most concepts of the models used in database engineering processes, encompassing the three main levels of abstractions, namely conceptual, logical and physical. In particular, it serves as a generic pivot model between the major database paradigms including ER, relational, object-oriented, objectrelational, files structures, network, hierarchical and XML. The GER model has ben given a formal semantics [16]. 3.2
Transformational Schema Derivation
Most database engineering processes can be modeled as chains of schema transformations (such a chain is called a transformation plan). A schema transformation basically is a rewriting rule T that replaces a schema construct C by another one C ′ = T (C). A transformation that preserves the information capacity of the source construct is said semantics-preserving. This property has been formally defined and its applications have been studied in [16]. In the context of system evolution, schema transformations must also specify how instances of source construct C are transformed into instances of C ′ . The logical design process can be described as follows: we replace, in a conceptual schema, the ER constructs that do not comply with the relational model by application of appropriate schema transformations. We have shown in [16] that (1) deriving a logical (relational) schema from an ER schema requires a dozen semantics-preserving transformations only and that (2) a transformation
A Conceptual Approach to Database Applications Evolution
137
plan made up of semantics-preserving transformations fully preserves the semantics of the conceptual schema. The transformations on which a simple relational logical design process relies includes the following : transforming is-a relations into one-to-one rel-types, transforming complex rel-types into entity types and functional (one-to-many) rel-types, transforming complex attributes into entity types, transforming functional rel-types into foreign keys. A transformation plan should be idempotent, that is, applying it several times to a source schema yields the same target schema as if it was applied only once. The transformational approach to database engineering provides a rigorous means to specify forward and backward mappings between source and target schemas: the trace (or history) of the transformations used to produce the target schema. However, as shown in [7], extracting an synthetic mapping from this history has proved fairly complex in practice. Therefore, we will define in the next sections a simpler technique to define inter-schema mappings. 3.3
Schema Mapping Definition
Let C be the set of GER schema constructs. Let S be the set of GER schemas. Definition 1 (Schema). A schema S ∈ S is a non empty set of schema constructs : S = {C1 , C2 , ..., Cn } : ∀i : 1 0).
Automated Co-evolution of Conceptual Models
153
If we find an instance of the TPH scheme, we can use column mapping patterns to distinguish further how the existing mapping reuses store columns. Column mapping patterns do not use local scope, but rather look at the entire mapping table for all entities that map to a given table; we expand the set of considered entities to all entities because the smaller scope is not likely to yield enough data to exhibit a pattern: Remap by column name (RBC): If types E and E ′ are cousin types in a hierarchy2, and both E and E ′ have a property named P with the same domain, then E.P and E ′ .P are mapped to the same store column. This scheme maps all properties with like names to the same column, and is the scheme that Ruby on Rails uses by convention [8]. Given hierarchy table T , the RBC pattern is: Q+ RBC ≡ (∃C∈πSC σST = T σ¬K M |σCP ∈NKP(CE) σST =T ∧SC=C M| > 1) ∧(∀
C∈πSC σST = T σ¬K M|
πCP σCP ∈NKP(CE) σST =T ∧SC=C M| = 1).
That is, check if a store column C is mapped to more than one client property, and all client properties CP that map to store column C have the same name. Remap by domain (RBD): If types E and E ′ are cousin types in a hierarchy, let P be the set of all properties of E with domain D (including derived properties), and P ′ be the set of all properties of E ′ with the same domain D. If C is the set of all columns to which any property in P or P ′ map, then |C| = max(|P |, |P ′ |). In other words, the mapping maximally re-uses columns to reduce table size and increase table value density, even if properties with different names map to the same column. Said another way, if one were to add a new property P0 to an entity type mapped using the TPH scheme, map it to any column C0 such that C0 has the same domain as P0 and is not currently mapped by any property in any descendant type, if any such column exists. Given hierarchy table T , the RBD pattern is: Q+ RBD ≡ (∃C∈πSC σST = T σ¬K M |σCP ∈NKP(CE) σST =T ∧SC=C M| > 1) ∧(∀X∈πD σST = T ∧¬K M ∃E∈πCE σST = T M |πCP σCE=E∧ST =T ∧D=X∧¬K M| = |πSC σST =T ∧D=X∧¬K M|). There is at least one store column C that is remapped, and for each domain D, there is some client entity E that uses all available columns of that domain. Fully disjoint mapping (FDM): If types E and E ′ are cousin types in a hierarchy, the non-key properties of E map to a set of columns disjoint from the non-key properties of E ′ . This pattern minimizes ambiguity of column data provenance — given a column c, all of its non-null data values belong to instances of a single entity type. Given hierarchy table T , the FDM pattern is: Q+ F DM ≡ ∀C∈πSC σST = T σ¬K M |σCP ∈NKP(CE) σST =T ∧SC=C M| = 1. Each store column C is uniquely associated with a declared entity property CP . 2
Cousin types belong to the same hierarchy, but neither is a descendant of the other.
154
J.F. Terwilliger, P.A. Bernstein, and A. Unnithan
In addition to hierarchy and column mapping schemes, other transformations may exist between client types and store tables. For instance: Horizontal partitioning (HP): Given an entity type E with a non-key property P , one can partition instances of E across tables based on values of P . Store-side constants (SSC): One can assign a column to hold a particular constant. For instance, one can assign to column C a value v that indicates which rows were created through the ORM tool. Thus, queries that filter on C = v eliminate any rows that come from an alternative source. Strictly speaking, we do not need patterns for these final two schemes — our algorithm for generating new mapping relation rows (Section 5) carries such schemes forward automatically. Other similar schemes include vertical partitioning and merging, determining whether a TPH hierarchy uses a discriminator column (as opposed to patterns of NULL and NOT NULL conditions), and association inlining (i.e., whether one-to-one and one-to-many relationships are represented as foreign key columns on the tables themselves or in separate tables). Note that each group of patterns is not complete on its own. The local scope of an entity may be too small to find a consistent pattern or may not yield a consistent pattern (e.g., one sibling is mapped TPH, while another is mapped TPC). In our experience, the developer is most likely to encounter this situation during bootstrapping, when the client model is first being built. Most mappings we see are totally homogeneous, with entire models following the same scheme. Nearly all the rest are consistent in their local scope (specifically, all siblings are mapped identically). However, for completeness in our implementation, we have chosen the following heuristics for the rare case when consistency is not present: If we do not see a consistent hierarchy mapping scheme (e.g., TPT), we rely on a global default given by the user (similar to [3]). If we do not see a consistent column mapping scheme, we default to the disjoint pattern. If we do not see consistent condition patterns like store constants or horizontal partitioning, we ignore any store and client conditions that are not relevant to TPH mapping.
5
Evolving a Mapping
Once we know that a pattern is present in the mapping, we can then effect an incremental change to the mapping and the store based on the nature of the change. The incremental changes that we support fall into four categories: Actions that add constructs: One can add entity types to a hierarchy, add a new root entity type, add properties, or add associations. Setting an abstract entity type to be concrete is also a change of this kind. For these changes, new rows may be added to the mapping relation, but existing rows are left alone. Actions that remove constructs: One can drop any of the above artifacts, or set a concrete entity type to be abstract. For changes of this kind, rows may be removed from the mapping relation, but no rows are changed or added. Actions that alter construct attributes: One can change individual attributes, or “facets,’, of artifacts. Examples include changing the maximum length of a string property or the nullability of a property. For such changes, the mapping relation remains invariant, but is used to guide changes to the store.
Automated Co-evolution of Conceptual Models
155
Actions that refactor or move model artifacts: One can transform model artifacts in a way that maximizes information preservation, such as renaming a property (rather than dropping and re-adding it), transforming an association into an inheritance, or changing an association’s cardinality. Changes of this kind may result in arbitrary mapping relation changes, but such changes are often similar to (and thus re-use logic from) changes of the other three kinds. The set of possible changes is closed in that one can evolve any client model M1 to any other client model M2 by dropping any elements they do not have in common and adding the ones unique to M2 (a similar closure argument has been made for object-oriented models, e.g. [1]). All of the rest of the supported changes — property movement, changing the default value for a property, etc. — can be accomplished by drop-add pairs, but are better supported by atomic actions that preserve data. For the rest of the section, we show the algorithms for processing a cross-section of the supported model changes. Adding a new type to the hierarchy: When adding a new type to a hierarchy, one must answer three questions: what new tables must be created, what existing tables will be re-used, and which derived properties must be remapped. For clarity, we assume that declared properties of the new type will be added as separate “add property” actions. When a new entity type E is added, we run algorithm AddNewEntity: 1. AddNewEntity(E): 2. k ← a key column for the hierarchy 3. G ← γ CX σCP =k∧ CE∈ Φ(E) M, where γ CX groups rows of the mapping relation according to their client conditions 4. If ∃i |πCE Gi | = |Φ(E)| then G ← {σCP =k∧ CE∈ Φ(E) M} (if there is no consistent horizontal partition across entity types, then just create one large partition, ignoring client-side conditions) 5. For each G ∈ G: 6. If Q+ T P T (G): (if TPT pattern is found when run just on the rows in G) 7. For each property P ∈ Keys(E) ∪ NKP(E): 8. Add NewMappingRow(GenerateTemplate(G, P ), E) + 9. If Q+ T P H (G) or QT P C (G): 10. A ← the common ancestor of Φ(E) 11. For each property P ∈ Keys(E) ∪ ∩e∈ E NKP(E) where E is the set of all entities between E and A in the hierarchy, inclusive: 12. Add NewMappingRow(GenerateTemplate(G, P ), E) Function GenerateTemplate(R, P ) is defined as follows: we create a mapping template T as a derivation from a set of existing rows R, limited to those where CP = P . For each column C ∈ {CE, CP, ST, SC}, set T.C to be X if ∀r∈ R r.C = X. Thus, for instance, if there is a consistent pattern mapping all properties called ID to columns called PID, that pattern is continued. Otherwise, set T.C = ⊗, where ⊗ is a symbol indicating a value to be filled in later. For condition column CX (and SX), template generation follows a slightly different path. For any condition C = v, C IS NULL, or C IS NOT NULL that
156
J.F. Terwilliger, P.A. Bernstein, and A. Unnithan
Table 2. Creating the mapping template for a type added using a TPH scheme, over a single horizontal partition where “Editor=Tom” and with a store-side constant “Source=A” — the final row shows the template filled in for a new type Alumnus CE Person Student Staff ⊗ Alumnus
CP ID ID ID ID ID
CX Editor=Tom Editor=Tom Editor=Tom Editor=Tom Editor=Tom
ST TPerson TPerson TPerson TPerson TPerson
SC PID PID PID PID PID
SX Type=Person AND Source=A Type=Student AND Source=A Type=Staff AND Source=A Type=⊗ AND Source=A Type=Alumnus AND Source=A
K Yes Yes Yes Yes Yes
D Guid Guid Guid Guid Guid
appear in every CX (or SX) field in R (treating a conjunction of conditions as a list that can be searched), and the value v is the same for each, add the condition to the template. If each row r ∈ R contains an equality condition C = v , but the value v is distinct for each row r , add condition C = ⊗ to the template. Ignore all other conditions. Table 2 shows an example of generating a mapping template for a set of rows corresponding to a TPH relationship; the rows for the example are drawn from Table 1, with additional client and store conditions added to illustrate the effect of the algorithm acting on a single horizontal partition and a store constant. Note that the partition conditions and store conditions translate to the template; note also that the name of the store column remains consistent even though it is not named the same as the client property. The function NewMappingRow(F, E) takes a template F and fills it in with details from E. Any ⊗ values in CE, CX, ST , and SX are filled with value E. Translating these new mapping table rows back to an EF mapping fragment is straightforward. For each horizontal partition, take all new rows collectively and run the algorithm from Section 2 backwards to form a single fragment. Adding a new property to a type: When adding a new property to a type, one has a different pair of questions to answer: which descendant types must also remap the property, and to which tables must a property be added. The algorithm for adding property P to type E is similar to adding a new type: – For each horizontal partition, determine the mapping scheme for (E). – If the local scope has a TPT or TPC scheme, add a new store column and a new row that maps to it. Also, for any child types whose local scope is mapped TPC, add a column and map to it as well. – If the local scope has a TPH scheme, detect the column remap scheme. If remapping by name, see if there are other properties with the same name, and if so, map to the same column. If remapping by domain, see if there is an available column with the same domain and map to it. Otherwise, create a new property and map to it. Add a mapping row for all descendant types that are also mapped TPH. Translating these new mapping rows backward to the existing EF mapping fragments is straightforward. Each new mapping row may be translated into a new item added to the projection list of a mapping fragment. For a new mapping
Automated Co-evolution of Conceptual Models
157
row N , find the mapping fragment that maps σN.CX N .CE = σN.SX N.ST and add N.CP and N.SC to the client and store projection lists respectively. Changing or dropping a property: One can leverage the mapping relation to propagate schema changes and deletions through a mapping as well. Consider first a scenario where the user wants to increase the maximum length of Student.Major to be 50 characters from 20. We use the mapping relation to effect this change as follows. First, if E.P is the property being changed, issue query πST,SC σCE=E∧ CP =P M — finding all columns that property E.P maps to (there may be more than one if there is horizontal partitioning). Then, for each result row t, issue query Q = πCE,CP σST =t.ST ∧ SC=t.SC M — finding all properties that map to the same column. Finally, for each query result, set the maximum length of the column t.SC in table t.SE to be the maximum length of all properties in the result of query Q. For the Student.Major example, the property only maps to a single column TPerson.String1. All properties that map to TPerson.String1 are shown in Table 3. If Student.Major changes to length 50, and Staff.Office has maximum length 40, then TPerson.String1 must change to length 50 to accommodate. However, if TPersonString1 has a length of 100, then it is already large enough to accommodate the wider Major property. Dropping a property follows the same algorithm, except that the results of query Q are used differently. If query Q returns more than one row, that means multiple properties map to the same column, and dropping one property will not require the column to be dropped. However, if r is the row corresponding to the dropped property, then we issue a statement that sets r.SC to NULL in table r.ST for all rows that satisfy r.SX. So, dropping Student.Major will execute UPDATE TPerson SET String1 = NULL WHERE Type=’Student’. If query Q returns only the row for the dropped property, then we delete the column.3 In both cases, the row r is removed from M. We refer to the process of removing the row r and either setting values to NULL or dropping a column as DropMappingRow(r). Table 3. A listing of all properties that share the same mapping as Student.Major CE Student Staff
CP Major Office
ST TPerson TPerson
SC String1 String1
SX K Type=Student No Type=Staff No
D Text Text
Moving a property from a type to a child type: If entity type E has a property P and a child type E ′ , it is possible using a visual designer to specify that the property P should move to E ′ . In this case, all instances of E ′ should keep their values for property P , while any instance of E that is not an instance of E ′ should drop its P property. This action can be modeled using analysis of the mapping relation M as well. Assuming for brevity that there are no client-side conditions, the property movement algorithm is as follows: 3
Whether to actually delete the data or drop the column from storage or just remove it from the storage model available to the ORM is a policy matter. Our current implementation issues ALTER TABLE DROP COLUMN statements.
158
J.F. Terwilliger, P.A. Bernstein, and A. Unnithan
1. MoveClientProperty(E, P, E ′ ): 2. r0 ← σCE=E∧CP =P M (without client conditions, this is a single row) 3. If |σCE=E ′ ∧CP =P M| = 0: (E ′ is mapped TPT relative to E ) 4. AddProperty(E ′ , P ) (act as if we are adding property P to E ′ ) 5. For each r ∈ σCE=E ′ ∨CE∈Descendants(E ′ ) σCP =P M: 6. UPDATE r.ST SET r.SC = (r.ST ✶ r0 .ST ).(r.SC) WHERE r.SX 7. E − ← all descendants of E, including E but excluding E ′ and descendants 8. For each r ∈ σCE∈E − ∧CP =P M: 9. DropMappingRow(r) (drop the mapping row and effect changes to the physical database per the Drop Property logic in the previous case)
6
Related Work
A demonstration of our implementation is described in [10]. That work describes how to capture user intent to construct incremental changes to a conceptual model and how to generate SQL scripts. It gives intuition about how to select a mapping scheme given a change, and mentions the mapping relation. This paper extends that work by providing the algorithms, the formal underpinnings, a more complete example, and the exact conditions when the approach is applicable. A wealth of research has been done on schema evolution [9], but very little on co-evolution of mapped schemas connected by a mapping. One example is MeDEA, which uses manual specifications of update policies [3]. One can specify policies for each class of incremental change to the client schema as to what the desired store effect should be on a per-mapping basis. The Both-As-View (BAV) federated database language can express non-trivial schema mappings, though it does not handle inheritance [5]. For some schema changes (to either schema in BAV), either the mapping or the other schema can be adjusted to maintain validity. Many cases require manual intervention for non-trivial mappings. Two cases of schema evolution have been considered in data exchange, one on incremental client model changes [11], and one where evolution is represented as a mapping [12]. Both cases focus on “healing” the mapping between schemas, leaving the non-evolved schema invariant. New client constructs do not translate to new store constructs, but rather add quantifiers or Skolem functions to the mapping, which means new client constructs are not persisted. It is unclear whether a similar technique as ours can be applied to a data exchange setting. However, it would be an interesting exercise to see if it is possible to define both patterns and a mapping table representation for first-order predicate calculus, in which case similar techniques could be developed.
7
Conclusion and Future Work
We have presented a way to support model-driven application development by automatically translating incremental client model changes into changes to a store model, the mapping between the two, and any database instance conforming to the store model. Our technique relies on treating an O-R mapping as minable data, as well as a notion of pattern uniformity within a local scope.
Automated Co-evolution of Conceptual Models
159
A prominent feature of EF is that it compiles mapping fragments into views that describe how to translate data from a store model into a client model and vice versa. Mapping compilation provides several benefits, including precise mapping semantics and a method to validate that a mapping can round-trip client states. The computational cost for compiling and validating a mapping can become large for large models. An active area of our research is to translate incremental changes to a model into incremental changes to the relational algebra trees of the compiled query and update views, with results that are still valid and consistent with the corresponding mapping and store changes. Finally, the mapping relation is a novel method of expressing an O-R mapping, and as such, it may have desirable properties that are yet unstudied. For instance, it may be possible to express constraints on a mapping relation instance that can validate a mapping’s roundtripping properties.
References 1. Banerjee, J., Kim, W., Kim, H., Korth, H.F.: Semantics and Implementation of Schema Evolution in Object-Oriented Databases. In: SIGMOD 1987 (1987) 2. Blakeley, J.A., Muralidhar, S., Nori, A.: The ADO .NET Entity Framework: Making the Conceptual Level Real. In: Embley, D.W., Oliv´e, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, Springer, Heidelberg (2006) 3. Dom´ıngueza, E., Lloret, J., Rubio, A.L., Zapata, M.A.: Evolving the Implementation of ISA Relationships in EER Schemas. In: Roddick, J., Benjamins, V.R., Si-said Cherfi, S., Chiang, R., Claramunt, C., Elmasri, R.A., Grandi, F., Han, H., Hepp, M., Lytras, M.D., Miˇsi´c, V.B., Poels, G., Song, I.-Y., Trujillo, J., Vangenot, C. (eds.) ER Workshops 2006. LNCS, vol. 4231, pp. 237–246. Springer, Heidelberg (2006) 4. Hibernate, http://www.hibernate.org/ 5. McBrien, P., Poulovassilis, A.: Schema Evolution in Heterogeneous Database Architectures, a Schema Transformation Approach. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, p. 484. Springer, Heidelberg (2002) 6. Melnik, S., Adya, A., Bernstein, P.A.: Compiling Mappings to Bridge Applications and Databases. ACM TODS 33(4) (2008) 7. Oracle TopLink, http://www.oracle.com/technology/products/ias/toplink/ 8. Ruby on Rails, http://rubyonrails.org/ 9. Rahm, E., Bernstein, P.A.: An Online Bibliography on Schema Evolution. SIGMOD Record 35(4) (2006) 10. Terwilliger, J.F., Bernstein, P.A., Unnithan, A.: Worry-Free Database Upgrades: Automated Model-Driven Evolution of Schemas and Complex Mappings. In: SIGMOD 2010 (2010) 11. Velegrakis, Y., Miller, R.J., Popa, L.: Preserving Mapping Consistency Under Schema Changes. VLDB Journal 13(3) (2004) 12. Yu, C., Popa, L.: Semantic Adaptation of Schema Mappings When Schemas Evolve. In: VLDB 2005 (2005)
A SchemaGuide for Accelerating the View Adaptation Process⋆ Jun Liu1 , Mark Roantree1 , and Zohra Bellahsene2 1
Interoperable Systems Group, Dublin City University 2 LIRMM CNRS/University of Montpellier II {jliu,mark.roantree}@computing.dcu.ie, [email protected]
Abstract. Materialization of XML views significantly improves query performance in the often slow execution times for XPath expressions. Existing efforts focus on providing approaches of how to reuse materialized view for answering XPath queries and, the problem of synchronizing materialized data in response to the changes taking place at data source level. In this paper, we study a closely related problem, the view adaptation problem, which maintains the materialized data incrementally after view definitions have been redefined/changed (view redefinition). Our research focuses on an efficient process for view adaptation upon the fragment-based view representation by segmenting materialized data into fragments and developing algorithms to update only those materialized fragments that have affected by the view definition changes. This serves to minimize the effect of view adaptation and provide a more efficient process for stored views. Additionally, we study the containment problem at fragment level under the constraints expressed in a so-name SchemaGuide. We have implemented our view adaptation system and we present in this paper the performance analysis.
1
Introduction
XML data is semi-structured, providing a flexibility and bridge between the more structured world of relational databases and the free form world of unstructured and web data. Its flexible nature also makes XML suitable for exchanging data between heterogeneous systems which led to its standardization as the format for information interchange on the Web. However, where applications are required to store data in native XML databases, query optimization is an ongoing problem. There have been many approaches to XML query optimization: SQLbased optimizers have been used in [1,2]; advanced tree-structured indexes were created to prune the search space in [3]; and XPath axis navigation algorithms were developed in [4]. More recently, efforts have focused on precomputing and storing the results of XML queries [5]. In relational database systems, materialized views are widely used for query result caching, especially where systems are queried more than updated such ⋆
Funded by Enterprise Ireland Grant No. CFTD/07/201.
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 160–173, 2010. c Springer-Verlag Berlin Heidelberg 2010
A SchemaGuide for Accelerating the View Adaptation Process
161
as in data warehouses. The XML systems, [6,7,8] provide different approaches using single views, while in [9,10,11,5], they focus on using multiple materialized XPath views. Furthermore, there has been increasing focus on the issue of XPath view updates and maintenance with [12,13] synchronizing the materialized view data with updates to source data. However, none of these approaches facilitate updates when view definitions have changed. This problem, first introduced by Gupta, et al [14], is known to as the view adaptation problem. Our motivation is to provide a more holistic approach to view based XML optimization by including view adaptation algorithms to manage query/view updates. 1. 1
Contribution and Paper Structure
The core novelty in our work is in the provision of a framework where we materialize fragments of views that can be shared but also facilitate far more efficient view adaptation while view redefinition takes place very often. – Based on our previous works [15,16], the XML Fragment-based Materialization (XFM) Approach, we have developed a new fragment-based containment checking mechanism which uses a construct called the Schema Guide. It is applied by our adaptation algorithms to facilitate the efficiency of detecting the containment relationship between different fragments in the XFM view graph. – We devised a set of XPath view adaptation algorithms to efficiently handle the view redefinition caused by different type of changes. This paper is structured as follows: in §2, we provide analysis of similar approaches to this topic; in §3, a brief review of the view graph and its components are provided; in §4, our containment checking approach is introduced, while in §5, the process for view adaptation is described; in §6, we provide details of our experiments and finally in §7, we offer some conclusions. A long version of the paper containing a detailed Related Work and an overview of the view adaptation system architecture can be found in [17].
2
Related Work
XPath query containment is a necessary condition for using materialized XPath views, and has been studied in [18,19,20]. Traditional the XPath containment problem is based on comparing two tree patterns containing a subset of XPath expressions (/ , ∗, []). However, it has been proved in [19] that even for such a small subset of XPath expressions, the containment checking process is rather complex and time consuming. In this paper, we reduce the containment checking problem to fragment based comparison rather than tree patterns with the assistant of a so-named SchemaGuide which is a variation of the QueryGuide introduced in [21]. The cost of containment checking between two fragments is with respect to the number of schema nodes bound to each fragment. Based on the containment checking approach, we also provide an XML Fragment Based View Adaptation approach. [22] claims that they are the first one
162
J. Liu, M. Roantree, and Z. Bellahsene
providing a solution for the view adaptation problem in the XML world. They deal with view adaptation problem for XPath access-control views with a set of comprehensive incremental view adaptation techniques. Materialized data is represented to the users according to a set of access control rules. Based on these rules, data representation is restricted and dynamically changed to different users. However, this technique only operates when view adaptation starts from the output node of a query to its subtree. This is due to XPath semantics where only the XML fragment below the output node will be materialized. By getting inspiration from [23], we utilize multiple view fragments to achieve sharing of materialized data between difference XML views. Due to sharing of materialized fragments, view adaptation can theoretically take place at any point in the view construct. The number of accesses to the source data is significantly reduced by using our approach. While there will result in a significant benefit from our approach, one side effect is that the final result of each view still needs to be computed using the materialized fragments. However, the query evaluation is still more efficient than computing queries from scratch using the source data.
3
The XFM View Graph
The XFM View Graph is a directed acyclic graph including a combination of XPath views built by a set of fragments and operators. The graph exposes the common subexpressions between different view definitions. Figure 1 is a sample XFM view graph representing sample XPath views, where fragments in gray indicates those fragments that are materialized. A detailed description of the XFM view graph with fragments and logical operators can be found in our previous works [16,15], and here, we give only a brief overview. Fragments are categorized into 5 types. Each fragment (except Source Fragment ) represents the result of a single step in an XPath expression, and it is these fragments that can be shared across XML views. All fragments contain a set of V-typed instances (XML tree nodes) for each node label V. In the case of R F and SF fragments, VF1 ΠD
ΠD
.
//text
FF4
DF5
//to VF3
//site
//regions
//item
.
.
RF
DF1
DF2
DF3
FF1
FF2
⊲⊳ − →
⊲⊳ − → SF1
site
⊲⊳ − → SF2
σlocation= σpayment= SF3
regions
SF7 to //mailbox Π D DF6
SF5
⊲⊳ . → text − FF3 ⊲⊳ − →
σdate>
’Creditcard’ ’10/12/1999’ ’United States’
item
FF5 ’China’
⊲⊳ − → SF8
DF4 ⊲⊳ − →
VF4 //from Π D DF8
SF6 mailbox //shipping
.
σlocation=
VF5
⊲⊳ − →
σquantity>2 ROOT(T )
ΠD
DF7
Π
D
from VF2
SF4 shipping
Fig. 1. XML Fragment Materialized View Graph
VF6
A SchemaGuide for Accelerating the View Adaptation Process
163
this will be the entire set of instances for V. For the remaining fragments, there will generally be some subset of V generated for the fragment. – Root Fragment (RF ) - A Root Fragment represents a node sequence containing a single node, which is the root node of an XML tree T (also known as the document node). It always represents the starting point of a XFM view graph. While a view graph will contain multiple query representations, they are all joined by the same root fragment, as shown in Figure 1 (e.g., RF with rectangle box). – Filter Fragment (FF ) - A filter fragment (e.g., FF1 in Figure 1) represents the node sequence produced by a select operation. In our view model, the select operation always contains a predicate used to filter an input node sequence. e.g., location=’United States’ in Figure 1 represents the filter operation that results FF1. – Dependency Join Fragment (DF ) - A Dependency Join Fragment (e.g., DF1 in Figure 1 represents the node sequence resulting from a d-join operation, e.g., the −→ ⊲⊳ before DF1 represents a dependency join operation. – Source Fragment (SF ) - A Source Fragment represents the full set of Vtyped nodes. The major difference between this fragment and all others is that it cannot be reused and merely acts as an operand in a d-join operation. An example of a source fragment is shown in Figure 1 with dashed boxes. – View Fragment (VF ) - A view fragment (e.g., VF1 in Figure 1) represents the result of a view. It always follows a deep project operation, e.g., Π D before VF1 indicates a deep project operation. Each fragment within a particular view is referenced by a corresponding VF fragment representing the context view. For example, DF6 is referenced by VF4, VF5 and VF6 as it is shared by them whereas DF8 is referenced only by VF6. We use Vn and VFn interchangeably to represent an XPath view, VF1 and V1 both represent XPath view 1. The fragmentation approach is used to facilitate this sharing of fragments across views as each fragment indicates a potential end point (materialization candidate) for a view. Furthermore, each fragment links to one or more fragments that directly after it, e.g., FF2 links to FF3, FF4 and FF5 in Figure 1. Each fragment is also linked by one and only one fragment that directly precedes it, e.g., FF2 is linked by FF1.
4
Containment Checking
In this section, we introduce our SchemaGuide, a metadata construct that optimizes the containment checking process. We then present the properties that manage the decision making process for containment. 4.1
The SchemaGuide
Our SchemaGuide is a heavily extended version of the QueryGuide introduced in [21] as it contains region encoding and a set of properties to govern containment checking and facilitate a more efficient fragment-based containment check. Figure
164
J. Liu, M. Roantree, and Z. Bellahsene a) XML Data Tree for XMark
b) Schema Guide for XMark
site
(0, 0, 433) site
regions
(1, 1, 252) regions
···
···
africa
(3, 2, 45) africa
···
asia
(24, 46, 85) asia ···
item1
item2
item3
item4
location =′ U nitedStates′
location
location
location=’China’
=′ · · ·′
=′ China′
=′ · · ·′
(5, 6, 7) location
(4, 3, 44) item (6, 8, 9) ··· quantity
(9, 14, 23) description
···
(25, 47, 84) item (26, 48, 49) location ···
···
Fig. 2. Sample XML Data Tree and Schema Guide
2b is a sample SchemaGuide summarizing the XML tree in Figure 2a. They represent equivalent subsets of the SchemaGuide and the XML Data Tree for the XMark dataset respectively. Each node within the schema guide maps to a set of nodes with identical root-to-node path in an XML tree. For instance, item1 and item2 in Figure 2a both map to the schema node item within the subtree of africa in Figure 2b. From now on, we use the term schema node to represent any node within the SchemaGuide, and node instance for any node in the XML data tree. A SchemaGuide (SG) is a tree-based data structure that describes the XML document structure. Each schema node (sn) in SG is defined by a tuple (pid, start, end), where pid is the unique identifier of each root-to-node (rtn) path within an XML data tree, start and end values encapsulate the sub-tree region (number of nodes) for that schema node. All nodes in the schema guide are encoded by the StartEnd encoding scheme, which is widely used in the XML twig pattern matching process, e.g., [24], to facilitate the determination of the parent-child or ancestor-descendant relationship between XML tree nodes. The start and end values of a node v together are called the region of node v, denoted as reg(v). For two XML tree nodes, u and v, if v is in the subtree of u, then we say that u contains v, denoted by reg(v) ⊏ reg(u). In this paper, we make use of the StartEnd encoding scheme to determine the containment relationship between two sequences of schema nodes mapped to the sequences of node instances that are represented by the fragments. What is meant by containment between fragments is that as fragments represent sequences of node instances, therefore, for two sequences of node instances Suand Sv, if every node instance in Suis in the subtree of at least one node in Sv, then we say that Svcontains Su, or Suis contained in Sv. Because each node instance is mapped to a schema node within the SchamaGuide, therefore, in this paper we deal with containment problem at the schema level. 4.2
Region Containment
Mathematic notations ⊏, ⊐, ≡ and will be used to indicate the relationship is contained in, contains, equivalent and incomparability (disjoint) between
A SchemaGuide for Accelerating the View Adaptation Process
165
either two schema nodes or sequences of schema nodes. Additionally, there is also overlap between two sequences of schema nodes. We use the function Overlap to determine how two sequences of schema nodes are overlapped. Property 1 (Region Containment). For any two given schema nodes snu and snv in a schema guide SG, the followings hold: 1. reg(snu) ⊏ reg(snv), iff snu.start > snv.start and snu.end < snv.end. 2. reg(snu) ⊐ reg(snv), iff snu.start < snv.start and snu.end > snv.end. 3. reg(snu) ≡ reg(snv), iff snu.start = snv.start and snu.end = snv.end. 4. reg(snu) reg(snv), iff snu.end < snv.start or snu.start > snv.end. Property 1 is used to determine the containment relationship between two schema nodes in a SchemaGuide. As shown in (see Figure 2b), we may wish to determine if item (4,3,44) is contained within africa (3,2,45). The start value 3 is greater than start 2, and end value 44 is less than end value 45, thus, the region of item is contained in the region of africa. Based on Property 1, we use PropAlgorithm 1: ContainmentCheck(Fu , Fv ) Input: two fragments erty 2 to detect the containment Output: a state indicating the containment relationship between two sequences relationship between two inputs of schema nodes. This property is 1 if both Fu and Fv are Root Fragment then 2 Fu ≡ Fv ; the core concept used to determine the containment relationship between 34 elseFifu ⊐FuFvis; Root Fragment then fragments introduced in §3. This is 5 else if Fv is Root Fragment then Fu ⊏ F v ; due to the fact that each fragment 6 7 else represents a sequence of node in8 Su = Fu → GetSchemaNodes(); 9 Sv = Fv → GetSchemaNodes(); stances, whereas each node instance FindRelationship(Su , Sv ); further maps to a schema node within 10 11 if both Fu and Fv are Filter Fragment then the SchemaGuide. Therefore, to check 12 compare predicate pu and pv associated with Fu and Fv respectively; the containment between fragments, 13 all we have to do is to check the containment relationship between two sequence of schema nodes associated with the fragments (or the sequence of node instances represented by the fragments). Therefore, we can detect the containment relationship between fragments using Property 2. Property 2 (Sequence Containment). For any two given sequence of schema nodes Suand Sv, where sni∈ Su= {sn0 · · · snn} and snj∈ Sv= {snn+1 · · · snm}, 1. Su⊏ Sv, iff for ∀sni∈ Su,∀snj∈ Sv→ reg(sni) ⊏ reg(snj) 2. Su⊐ Sv, iff for ∀sni∈ Su,∀snj∈ Sv→ reg(sni) ⊐ reg(snj) 3. Su≡ Sv, iff for ∀sni∈ Su,∀snj∈ Sv→ reg(sni) ≡ reg(snj) 4. Su Sv, iff for ∀sni∈ Su,∀snj∈ Sv→ reg(sni) reg(snj) 5. OVERLAP(Su, Sv) = true, iff for ∃sni ∈ Su and ∃snj ∈ Sv → reg(sni) ⊏ reg(snj) or reg(sni) ⊐ reg(snj) Property 2 is an extension to Property 1 where containment checking involves two sequences of schema nodes. (1) to (4) are similar to Property 1 except for the fact that we are comparing sequences of schema nodes. However, a new sub-property can emerge as sequences may overlap (5).
166
4.3
J. Liu, M. Roantree, and Z. Bellahsene
Containment Algorithm
Algorithm 1 is the containment checking algorithm used by our adaptation process based on the properties listed above. It takes two fragments as input. If none of the fragments is the Root Fragment, then we compare the sequence of schema nodes associated with each fragment (Line 8-10 Algorithm 1 ). Algorithm 2 detects the relationAlgorithm 2: FindRelationship(Su , Sv ) ship between two sequence of schema Input: Su , Sv contains a sequence of nodes mapped to the fragments. It schema nodes labeled u and v respectively takes two sequences of schema nodes Output: a state indicating the containment as input. If both fragments are the relationship between two inputs Filter Fragment, we then compare the 1 if |Su | = |Sv | then 2 if ∀ui ∈ Su , ∀vj ∈ Sv → ui ≡ vj then predicates as well. Comparing predi3 return Su ≡ Sv ; cates is straight forward and due to 4 else if ∃ui ∈ Su , ∃vj ∈ Sv the space limitation, we do not list 5 → ui ⊏ vj or ui ⊐ vj then 6 return Su is overlap with Sv ; the algorithm here. The output of the 7 else Su Sv ; ContainmentCheck is one of the con8 else tainment relationship mentioned 9 if ∀ui ∈ Su , ∀vj ∈ Sv → ui ⊏ vj then in Property 2. The time complexity 10 return Su ⊏ Sv ; else if ∀ui ∈ Su , ∀vj ∈ Sv → ui ⊏ vj then of our fragment-based containment 11 12 return Su ⊐ Sv ; checking algorithm is O(n× m), where 13 else if ∃ui ∈ Su , ∃vj ∈ Sv n and m are the number of the schema 14 → ui ⊏ vj or ui ⊐ vj then return Su is overlap with Sv ; nodes within the sequences associated 15 with the input fragments.
5
View Adaptation
The adaptation methods consist of two types of adaptations, structural adaptation and data adaptation. The structural adaptation is the adaptation process maintaining the structure of the XFM view graph in response to the changes. This is followed by the data adaptation process that decides on the materialized fragments to be adapted. Our effort so far has focused mainly on structural maintenance and containment. We begin with an overview of the adaptation process and identify the key algorithms. 5.1
Adaptation Method
The main idea of view adaptation is to maintain the XFM view graph both structurally and physically updated in response to the changes applied to the XFM view graph. Based on the effect taking place on the XFM view graph, the structural adaptation is further categorized into three types, integration, deletion and modification of a fragment. An abstraction of the adaptation process flow can be summarized as follows: 1. Detect whether the adaptation process (e.g., the integration, deletion or modification of a target fragment) affects other views beyond the target view.
A SchemaGuide for Accelerating the View Adaptation Process
167
2. For integration and modification, iteratively calling the containment check subprocess to find an existing fragment containing (including) the target fragment. Such a fragment must be the most restricted one compared to other fragments that also contain the target fragment. This means that there may be many fragments containing the target one, however, the desired one is the one contains least node instances. 3. Adapt the structure of the XFM view graph based on the results returned by the above two processes. 4. Find any existing materialized fragment that is affected by the requested change. 5. Search for any existing materialized fragment that can be reused to physically adapted the affected fragments in response to the change. The following methods are used to implement the adaptation process. – GetNextByRef retrieves the fragment directly is directed linked by the context fragment in a target view. – GetPrevious retrieves the fragment preceding the context fragment. – ProcessNFs and ProcessPFs recursively check the containment relationship between the fragments directly linked by or precedes the context fragment respectively, and then adapt the XFM view graph in response to different containment relationship detected. – GetDuplicatedFragment returns a copy of the specified fragment. Only the structural and mapping information will be duplicated, not the materialized data. – RemoveReference deletes the reference between a fragment and a view. – AddNext adds a link between two fragments. – RemoveNext removes link between two fragments. A detailed description of the methods listed above can be found in our technical report [25]. In the rest of this section, due to space limitation, we will give a detailed description of one type of adaptation method only. The others can also be found in our technical report [25]. 5.2
Worked Example
ΠD
ΠD
.
//text
FF4
DF5
//to VF3
ΠD
DF7
VF5
⊲⊳ − → SF7
σquantity>2 In this section, we use a worked exto SF5 . ⊲⊳ . //mailbox Π → text − ample for the modification of a predFF2 FF3 DF6 VF4 icate. The main issue in this case is σpayment= σdate> −→⊲⊳ //from Π SF6 DF8 VF6 the containment relationship between ’Creditcard’ ’10/12/1999’ //from ⊲⊳ mailbox − → FF3’ the target predicate and its preceding SF8 ⊲⊳ σdate< − → Π predicates (if exists). Additionally, it from ’01/02/1999’ DF6’ VF4 //mailbox is also essential to check whether the predicate being modified is shared by Fig. 3. Modify Predicate other views or not. It is straight forward when the target predicate is only referenced by one view, however, further effort is required when the predicate in question is shared by more than D
D
D
168
J. Liu, M. Roantree, and Z. Bellahsene
one XPath views in the XFM view graph. Figure 3 gives an example of modifying a predicate, suppose we would like to modify the predicate ./date > ’10/12/1999’ represented by F F 3 in V4 to ./date < ’01/02/1999’. Note that DF 6 is materialized by V4 and shared by different views as well. Initially, the requested change is transformed into the fragment representation, F F 3′ . The new fragment (F F 3′ ) is then compared to the target fragment (F F 3). If they are equivalent (Line 2 ), then the original XFM view graph is returned. This is more like a validation process. Nevertheless, if they are Algorithm 3: modifyPredicate(G, Fc , Fn , ref ) Input: the XFM View Graph G, not equivalent (Line 3 ), then Fc is the target fragment being modified, depending on whether the Fn is the new fragment after applying the modification, target fragment, F F 3, is the context view reference ref Output: an updated MFM View Graph G shared by different views 1 begin (Line 5 ) or not (Line 21 ), we 2 if Fc ≡ Fn then return G; 3 else take a different approach for 4 Fp = Fc →GetPrevious(); structural adaptation. 5 if Fc →IsShared() then In this case, F F 3 is shared 6 if Fn ⊏ Fc then 7 return ProcessNFs(G, Fc , Fn , ref); by views, V3 , V4 , V5 and V6 . 8 else if Fc ⊏ Fn then The actual containment re9 return ProcessPFs(G, Fp , Fc , Fn , ref); else lationship between the tar- 10 if Fp ⊏ Fn then get fragment (F F 3) and the 11 12 ProcessPFs(G, Fp , Fc , Fn , ref); new fragment (F F 3′ ) is then 13 else Fc →RemoveReference(ref); required to be determined. 14 15 Fp →AddNext(Fn ); Different containment rela- 16 size = Fc →GetNextsCount() ; for int i = 0 to size do tionship is expected (Line 17 18 Fcn = Fc →GetNext(i ); 6,8 and 10 ). If F F 3 con- 19 add ref to all fragments linked to Fcn ; tains (includes) F F 3′ (Line 20 return G; else 6 ), ProcessNFs is then called 21 if Fp ⊏ Fn then to process the following frag- 22 23 return ProcessPFs(G, Fp , Fc , Fn , ref); ments, e.g., DF 6, linked by 24 else if Fn ⊏ Fp or OVERLAP(Fp ,Fn )→true or Fp Fn then F F 3 (Line 7 ) in the tar- 25 26 return ProcessNFs(G, Fp , Fn , ref); get view. The purpose of 27 else Fp →RemoveNext(Fc ); this method is to recursively 28 Fcn = Fc →GetNextByRef(ref); detect the containment re- 29 30 Fp →AddNext(Fcn ); lationship between the new 31 return G; fragment and the fragments linked by the target fragment, and eventually, adapt the structure of G. If the target fragment (F F 3) is contained in (more restricted than) the new fragment (F F 3′ ) (Line 8 ), we then call ProcessNFs to recursively detect the containment relationship between the fragments preceding the target fragment (e.g., F F 3) and the new fragment (F F 3′ ) until either a relationship other than incomparability is detected or the Root Fragment is reached. The structure of G is then adapted. However, if the relationship between the target fragment and the new fragment is either incomparability or overlap (Line 10 ), the process will then take the third approach (Line 11-20 ).
A SchemaGuide for Accelerating the View Adaptation Process
169
In this case, it is obvious that the predicates do not overlap, which means F F 3 and F F 3′ is disjoint. As a result, we compare the fragment (F F 2) preceding F F 3 to F F 3′ (Line 11 ). If the preceding fragment is contained in F F 3′ , ProcessPFs is then called to continuously process the preceding fragments and adapt the structure of the graph. However, if any other relationship is found, we link the new fragment to the fragment preceding the target fragment (Line 15 ), and we remove the view reference between the target fragment and the target view (Line 14 ). In the current example, because F F 2 contains F F 3′ , we add a link between F F 2 and F F 3′ . As fragments after F F 3 in V4 are also shared by other views (in this case DF 6 only), therefore, we need to also get a copy of the DF 6, DF 6′ , mapping to the same sequence of schema nodes. We then link DF 6′ to F F 3′ as shown in Figure 3. Algorithm 8: AdaptFragment(G, Fn , ref )
1 2 3 4 5 6 7 8 9 10 11 12 13
Input: the MFM View Graph G,Fn is the adapted fragment, ref specified the target view Output: an adapted MFM view G Fnn = Fn →GetNextByRef(ref); while Fnn is not View Fragment do if Fnn is materialized then Fp = Fnn ; while Fp is not Root Fragment do if Fp is materialized then adapt Fm using Fp ; return; Fp = Fp →GetPrevious(); materialize Fm from scratch; return; Fnn = Fnn →GetNextByRef(ref); return G;
Fig. 4. Data Adaptation
5.3
Maintenance of Materialized Fragment
The above process maintains only the structure of the XFM view graph G in response to the change. We now give a brief description of the data adaptation algorithm for maintenance of the existing materialized fragments that are affected by the change. Algorithm 8 is the algorithm we used for adapting the materialized data in response to the change. In the current version, we simply search for an existing materialized fragment that we can reuse to maintain the affected fragment (Line 5-9 ). The cost of using such an existing fragment is less than other approaches as we will show in §6. If there is a materialized fragment that is affected by the adapted view (Line 3 ), we search for the fragments that precede the adapted fragment which can be reused by the affected one (Line 5 ). In our future work, we will provide a solution to adapt the affected materialized fragments by either inserting extra data into or removing redundant data from the fragments.
170
6
J. Liu, M. Roantree, and Z. Bellahsene
Results of Experiments
In this section, we demonstrate the performance of different adaptation methods. We then compare our XFM adaptation approach to the Full Materialization (FULL) approach which is more traditional and based on the materialization of entire views. 0.9 09
XFM
0.8
FULL
FULL
0.7 Seconds conds
Seconds
40.00
XFM
20.00 0.00
0.6 0.5 0.4 04 0.3
FULL XFM
0.2 0.1 0
9) Add S Step
8) Delete Predicate Pr
7) Add S Step
6) Remove Remov Step
5 Add 5) Pr Predicate
4) Remove Remov Predicate P Predicat
3) Add S Step
2) Modify Predicate Pr
1) Modify Predicate Pr
(a) Query Processing Cost
(b) Materialisation Cost
Fig. 5. Evaluation Result
6.1
Experiment Setup
Two XML database servers were deployed for this experiment: a remote MonetDB server and a local MonetDB server. The remote server contains all XML source data, and the local server stores all XPath views and their fragments. This models a typical data warehousing system where data is often distributed due to the high volumes generated. We use version 4.34.4 for both remote and local MonetDB servers. The remote MonetDB server is distributed on an Intel Core(TM)2 Duo 2.66GHz workstation running 64-bit Fedora Server version 12. The local MonetDB server is installed on an Intel Core(TM)2 Duo CPU 3.00GHz Windows 7 workstation. The second server contains all XPath materialised views with data obtained from the remote site. We use XMark benchmark in our experiment, which generates a narrow and deep XML document with maximum 13 levels. The document we generated is 1GB. We implemented all our adaptation algorithms using Java. The SchemaGuide is built during the document parsing time and stored in the main memory. Our XFM view graph contains 15 XPath views including the XPath views demonstrated in Figure 1. The XFM view graph is built automatically by an implemented Java programm. The total size of the materialized data within the chosen fragments is 27MB, only 2.6% of the original document. The list of queries are contained in an accompanying technical report which contains a more detailed breakdown and analysis of the experiments [25]. 6.2
Experimental Analysis
We now demonstrate the performance of view adaptation using XFM and FULL approaches. The experiment is initialized by materializing XPath views with data obtained from the remote MonetDB server. To demonstrate the benefit
A SchemaGuide for Accelerating the View Adaptation Process
171
Seconds conds
of XFM approach, we assume that the XFM approach will always uses data from the materialized fragments shared across the XPath views. Concerning the query processing cost, the full materialization approach clearly outperformed the fragment-based approach since some fragments are still virtual (not materialized). However, we show that when considering the total adaptation cost, our fragment-based approach provides superior optimization. Our experiment demonstrates the general cost of both structurally and physically adapting the existing fragments in response to a sequence of mixed types of changes. Our approach provides a trade off between maintenance cost and query processing cost, which is important in a volatile query environment. We apply a sequence of changes to the 15 XPath XFM FULL 30 24.259 24.305 24.212 24.587 24.228 24.176 24.135 24.057 24.665 views. The changes are a com25 bination of different types, 20 e.g., Add a Predicate or 15 Remove a Step. Examples 10 of changes that were intro5 1.234 1 234 1.03 1 03 0.936 0 936 0.516 0 516 0.374 0 374 0.359 0 359 0 0 0 0 duced (from the full set in [25]) are modifying the predicate date>’10/12/1999’ to date>’01/04/2002’ in V4 and adding a step /descenFig. 6. Total Adaptation Cost dent::description in V3 before the step /descendant::text. This sequence of changes are evaluated sequentially by both XFM and Full approaches. The evaluation analysis is based on the following costs: i) Recomputing Cost, the cost of evaluating the query (view) after applying the change; ii) Data Transferring Cost, the cost of transferring the query result from the remote XML database server to the local XML database server; iii) Materialization Cost, the cost of parsing the XML query result (an XML document) and storing them into the local XML database server. The data transferring cost does not count for the XFM approach as we assume that it reuses the existing fragments on the local server in response to the changes. Therefore, we use the terms Query Processing Cost indicating the sum of the Recomputing Cost and Data Transferring Cost. Figure 5a lists the query processing cost for performing a sequence of changes on the XFM view graph. The type of the changes are displayed on the x-axis and are applied from left to right. It is clear that our approach is far more efficient than FULL approach as we reuse the existing materialized fragment (27MB) during the adaptation process. On the other hand, the Full approach must recompute the query after applying each change, and additionally it has to query the remote MonetDB database (>1GB) and retrieve data from it. Figure 5b demonstrates the efficiency of the materialization between both approaches. – For changes 1, 2 and 8, the XFM approach requires much more time for materialization. The reason for this is because the FULL approach rematerializes only the target view after each requested change. However, in the XFM approach, we must rematerialize the affected fragment which in this A 9) Add Ste Step
D 8) Delete Pre Predicate
A 7) Add Ste Step
R 6) Remove S Step
5) Add Pre Predicate
R 4) Remove Pre Predicate
A 3) Add Ste Step
M 2) Modify Pre Predicate
M 1) Modify Pre Predicate
172
J. Liu, M. Roantree, and Z. Bellahsene
case is a super set of the result set generated by the target view. In another words, the data set the fragment contains is larger than the actual size of the view result. – For changes 4, 5 and 7, as the data set contained in the affected fragment is close to the final view result, therefore, the materialization cost are similar for both approaches. – There is no materialization cost for changes 3, 6 and 9 for the XFM approach. This is due to the fact that, in some cases, a change may not affect any existing materialized fragment at all, it will only require an structural adaptation of the XFM view graph. Although both approaches have similar average performance cost for materialization, when discussing the overall view adaptation cost, our approach provides superior adaptation performance. Figure 6 gives the total adaptation cost after applying each change. It is obviously that our approach is much more efficient than the FULL approach. This is because rather than recomputing the query from scratch after applying each change, we adapt the affected materialized fragment by reusing the existing ones. In our example the total size of the existing materialized fragments is only approximately 2.6% of the source dataset. Therefore, even through there were no data transferring cost, querying the existing materialized data is much more efficient than querying the source dataset (1GB). However, the FULL approach has to recompute the query and retrieving data from the remote database. It will be even worse when data required is located on the multiple sites.
7
Conclusions
In this paper, we presented a fragment based view adaptation method for materialized XPath views. The benefit of our approach is in the event of a change to views, it is not necessary to recompute the entire view. Additionally, the containment problem between two XPath views is reduced down to the fragment based comparison rather than query based. We provide a SchemaGuide in this paper to facilitate the view adaptation process by efficiently determining the containment relationship. From our experimental analysis, we have shown the significant gains in performance can be achieved from this approach. While the fragmented approach has been applied in relational databases, it has not been used in XML systems, as in our approach. The structure of XML trees together with the more complex format for expressions provides a more significant challenge. In our current version of XFM, we manually select fragments for materialization when building the view graph. In future work, we are creating a cost based method for fragment selection, which will provide for a more fully automated fragment based approach to XML view materialization.
References 1. Boncz, P.A., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: A Fast XQuery Processor Powered By A Relational Engine. In: SIGMOD 2006 (2006)
A SchemaGuide for Accelerating the View Adaptation Process
173
2. Marks, G., Roantree, M.: Metamodel-Based Optimisation of XPath Queries. In: BNCOD 2009 (2009) 3. Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: optimal XML pattern matching. In: SIGMOD 2002 (2002) 4. Grust, T.: Accelerating XPath Location Steps. In: SIGMOD 2002 (2002) ¨ 5. Tang, N., Yu, J.X., Tang, H., Ozsu, M.T., Boncz, P.A.: Materialized view selection in xml databases. In: DASFAA 2009 (2009) 6. Arion, A., Benzaken, V., Manolescu, I., Papakonstantinou, Y.: Structured Materialized Views for XML Queries. In: VLDB 2007 (2007) ¨ 7. Balmin, A., Ozcan, F., Beyer, K.S., Cochrane, R., Pirahesh, H.: A Framework for Using Materialized XPath Views in XML Query Processing. In: VLDB 2004 (2004) 8. Lakshmanan, L.V.S., Wang, H., Zhao, Z.: Answering Tree Pattern Queries Using Views. In: VLDB 2006 (2006) 9. Cautis, B., Deutsch, A., Onose, N.: XPath Rewriting Using Multiple Views: Achieving Completeness and Efficiency. In: WebDB 2008 (2008) 10. Gao, J., Wang, T., Yang, D.: MQTree Based Query Rewriting over Multiple XML Views. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, Springer, Heidelberg (2007) 11. Tang, N., Yu, J., Ozsu, M., Choi, B., Wong, K.F.: Multiple materialized view selection for xpath query rewriting. In: ICDE 2008 (2008) 12. Sawires, A., Tatemura, J., Po, O., Agrawal, D., Candan, K.S.: Incremental Maintenance of Path-Expression Views. In: SIGMOD 2005 (2005) 13. Lim, C.H., Park, S., Son, S.H.: Access Control of XML Documents Considering Update Operations. In: XMLSEC 2003 (2003) 14. Gupta, A., Mumick, I.S., Ross, K.A.: Adapting Materialized Views after Redefinitions. In: SIGMOD 1995 (1995) 15. Liu, J., Roantree, M., Bellahsene, Z.: Optimizing XML Data with View Fragments. In: ADC 2010 (2010) 16. Liu, J., Roantree, M.: Precomputing Queries for Personal Health Sensor Environments. In: MEDES 2009 (2009) 17. Liu, J.: A SchemaGuide for Accelerating the View Adaptation Process. Technical report, Dublin City University (2010), http://www.computing.dcu.ie/~ isg/ 18. Deutsch, A., Tannen, V.: Containment and Integrity Constraints for XPath. In: KRDB 2001 (2001) 19. Miklau, G., Suciu, D.: Containment and Equivalence for a Fragment of XPath. Journal of the ACM 51, 2–45 (2004) 20. Neven, F., Schwentick, T.: On the Complexity of XPath Containment in the Presence of Disjunction, DTDs, and Variables. CoRR (2006) 21. Izadi, S.K., H¨ arder, T., Haghjoo, M.S.: S3: Evaluation of Tree-Pattern XML Queries Supported by Structural Summaries. Data and Knowledge Engineering 68, 126–145 (2009) 22. Ayyagari, P., Mitra, P., Lee, D., Liu, P., Lee, W.C.: Incremental Adaptation of XPath Access Control Views. In: ASIACCS 2007 (2007) 23. Bellahsene, Z.: View Adaptation In The Fragment-Based Approach. IEEE Transactions on Knowledge and Data Engineering 16, 1441–1455 (2004) 24. Liu, J., Roantree, M.: OTwig: An Optimised Twig Pattern Matching Approach for XML Databases. In: van Leeuwen, J., Muscholl, A., Peleg, D., Pokorn´ y, J., Rumpe, B. (eds.) SOFSEM 2010. LNCS, vol. 5901. Springer, Heidelberg (2010) 25. Liu, J.: Schema Aware XML View Adaptation. Technical report, Dublin City University (2010), http://www.computing.dcu.ie/~ isg/
Complexity of Reasoning over Temporal Data Models Alessandro Artale1 , Roman Kontchakov2, Vladislav Ryzhikov1, and Michael Zakharyaschev2 1 Faculty of Computer Science Free University of Bozen-Bolzano I-39100 Bolzano, Italy [email protected] 2 Department of Computer Science Birkbeck College London WC1E 7HX, UK {roman,michael}@dcs.bbk.ac.uk
Abstract. We investigate the computational complexity of reasoning over temporal extensions of conceptual data models. The temporal conceptual models we analyse include the standard UML/EER constructs, such as ISA between entities and relationships, disjointness and covering, cardinality constraints and their refinements, multiplicity and key constraints; in the temporal dimension, we have timestamping, evolution, transition and lifespan cardinality constraints. We give a nearly comprehensive picture of the impact of these constructs on the complexity of reasoning, which can range from NL OG S PACE to undecidability.
1 Introduction Temporal conceptual data models [25,24,14,15,19,4,23,12] extend standard conceptual schemas with means to visually represent temporal constraints imposed on temporal database instances. According to the glossary [18], such constraints can be divided into three categories, to illustrate which we use the temporal data model in Fig. 1. Timestamping constraints are used to discriminate between those elements of the model that change over time—they are called temporary—and others that are timeinvariant, or snapshot. Timestamping is realised by marking entities, relationships and attributes by T (for temporary) or S (for snapshot), which is then translated into a timestamping mechanism of the database. In Fig. 1, Employee and Department are snapshot entities, Name and PaySlipNumber are snapshot attributes and Member a snapshot relationship. On the other hand, Manager is a temporary entity, Salary a temporary attribute and WorksOn a temporary relationship. If no timestamping constraint is imposed on an element, it is left unmarked (e.g., Manages). Evolution and transition constraints control permissible changes of database states [13,23,7]. For entities, we talk about object migration from one entity to another [17]. Transition constraints presuppose that migration happens in some specified amount of time. For example, the dashed arrow marked by TEX in Fig. 1 means that each Project expires in one year. On the other hand, evolution constraints are qualitative in the sense that they do not specify the moment of migration. In Fig. 1, an AreaManager will eventually become a TopManager (the dashed arrow marked by DEV), a Manager was once an Employee (DEX− ), and a Manager cannot be demoted (PEX). J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 174–187, 2010. c Springer-Verlag Berlin Heidelberg 2010
Complexity of Reasoning over Temporal Data Models
175
PaySlipNumber(Integer)
S
Name(String)
Salary(Integer)
T S
emp Employee S
Member S
WorksOn T
mbr −
dex
org
act (3,∞)
(1,∞) OrganizationalUnit
pex
Manager T
(0,1)
Propose gp
Ex-Project
d
ProjectCode(String)
S
Project
tex
(1,1) prj
Department S
InterestGroup
AreaManager
TopManager
[0,5]
man
(1,1)
Manages
dev
Fig. 1. An E R VT temporal conceptual data model
Evolution-related knowledge can also be conveyed through generation relationships [16,23]. For instance, the generation relationship Propose between Manager and Project (marked by GP, with an arrow pointing to Project) in Fig. 1 means that managers may create new projects. Lifespan cardinality constraints [14,23,24] are temporal counterparts of standard cardinality constraints in that they restrict the number of times an object can participate in a relationship. However, the standard cardinality constraints are evaluated at each moment of time, while the lifespan cardinality constraints over the entire existence of the object. In Fig. 1, we use [k, l] to specify the lifespan cardinalities and (k, l) for the standard ones: for instance, every TopManager manages exactly one project at each moment of time but not more than five different projects over the whole lifetime. The temporal conceptual model E R VT we consider in this paper is a generalisation of the formalisms introduced in [4,7]. Apart from the temporal constraints discussed above, it includes the standard UML/EER constructs such as ISA (solid arrows in Fig. 1), disjointness (circled d) and covering (double arrows) constraints between entities and relationships, cardinality constraints and their refinements, multiplicity constraints for attributes and key constraints for entities (underlined). The language of E R VT and its model-theoretic semantics are defined in Section 2. This formalisation of temporal conceptual models also provides us with a rigourous definition of various quality properties of temporal conceptual schemas. For instance, a minimal quality requirement for a schema is its consistency in the sense that the constraints are not contradictory—or, semantically, are satisfied in at least one (nonempty) database instance. We may also need a guarantee that some entities and relationships in the schema are not necessarily empty (at some or all moments of time) or that one entity (relationship) is not subsumed by another one. To automatically check such quality properties, it is essential to provide an effective reasoning support for the construction phase of a temporal conceptual model. The main aim of this paper is to investigate the impact of various types of temporal and atemporal constraints on the computational complexity of checking quality
176
A. Artale et al.
properties ofE R VT temporal conceptual models. First, we distinguish between the full (atemporal) EER language ERfull , its fragment ERbool where ISA can only be used for entities (but not relationships), and the fragment ERref thereof where covering constraints are not available. Reasoning in these (non-temporal) formalisms is known to be, respectively, E XP T IME-, NP- and NL OG S PACE-complete [10,2]. We then combine each of these EER languages with the temporal constraints discussed above and give a nearly comprehensive classification of their computational behaviour. Of the obtained complexity results summarised in Table 1, Section 2.2, we emphasise here the following: – It is known [1] that timestamping and evolution constrains together cause undecidability of reasoning over ERfull (E XP T IME-completeness, if timestamping is restricted to entities [5]). We show in Section 3, however, that reasoning becomes only NP-complete if the underlying EER component is restricted to ERbool or ERref . – Timestamping and transition constraints over ERbool result in NP-complete reasoning; addition of evolution constraints increases the complexity to PS PACE. – Evolution constraints over ERref give NP-complete reasoning; transition constraints result in PT IME, while timestamping over the same ERref gives only NL OG S PACE. – Reasoning in ERfull with both lifespan cardinalities and timestamping is known to be 2E XP T IME-complete [8]. We show that for ERbool restricted to binary relationships the problem becomes NP-complete. – Reasoning with lifespan cardinalities and transition (or evolution) constraints is undecidable over both ERfull and ERbool . We prove these results by exploiting the tight correspondence between conceptual modelling formalisms and description logics (DLs) [10,2], the family of knowledge representation formalisms tailored towards effective reasoning about structured class-based information [9]. DLs form a basis of the Web Ontology Language OWL,1 which is now in the process of being standardised by the W3C in its second edition OWL 2. We show in Section 3 how temporal extensions of DLs (see [21] for a recent survey) can be used to encode the different temporal constraints and thus to provide complexity results for reasoning over temporal data models.
2 The Temporal Conceptual Model ERVT To give a formal foundation to temporal conceptual models, we describe here how to associate a textual syntax and a model-theoretic semantics with an EER/UML modelling language. In particular, we consider the temporal EER model E R VT generalising the formalisms of [4,7] (V T stands for valid time).E R VT supports timestamping for entities, attributes and relationships, as well as evolution constraints and lifespan cardinalities. It is upward compatible (i.e., preserves the non-temporal semantics of conventional (legacy) conceptual schemas) and snapshot reducible [20,23] (i.e., at each moment of time, the atemporal constraints can be verified by the database with the given temporal schema). E R VT is equipped with both textual and graphical syntax along with a model-theoretic semantics defined as a temporal extension of the EER semantics [11]. We illustrate the formal definition ofE R VT using the schema in Fig. 1. 1
http://www.w3.org/2007/OWL/
Complexity of Reasoning over Temporal Data Models
177
Throughout, by an X -labelled n-tuple over Y we mean any sequence of the form x 1 : y 1 , . . . , xn : yn , where xi ∈ X with xi = xj if i = j, and yi ∈ Y . A signature is a quintuple L = (E, R, U, A, D) consisting of disjoint finite sets E of entity symbols, R of relationship symbols, U of role symbols, A of attribute symbols and D of domain symbols. Each relationship R ∈ R is assumed to be equipped with some k ≥ 1, the arity of R, and a U-labelled k-tuple REL(R) = U1 : E 1 , . . . , Uk : Ek over E. For example, the binary relationship WorksOn in Fig. 1 has two roles emp and act, ranging over Employee and Project, respectively, i.e., REL(WorksOn) = emp : Employee, act : Project. Each entity E ∈ E comes equipped with an Alabelled h -tuple ATT(E) = A 1 : D1 , . . . , Ah : Dh over D, for some h ≥ 0. For example, the entity Employee in Fig. 1 has three attributes: Name of type String and PaySlipNumber and Salary of type Integer. Domain symbols D ∈ D are assumed to be associated with pairwise disjoint countably infinite sets BD called basic domains. In Fig. 1, the basic domains are the set of integer numbers (for Integer) and the set of strings (for String). A temporal interpretation of signature L is a structure of the form I = (Z, t I(t′ ) with e ∈ E2 . I(t) – E1 DEX− E2 is satisfied in I if, for each e ∈ E1 and t ∈ Z, there exists t′ < t I(t′ ) such that e ∈ E2 . In Fig. 1, AreaManager DEV TopManager means that every AreaManager will eventually migrate to TopManager and Manager DEX− Employee means that every Manager was once an Employee. There are also three types of persistent evolution constraints: I(t)
I(t′ )
– E1 PEV E2 is satisfied in I if, for each e ∈ E1 , t ∈ Z, we have e ∈ E2 and I(t′ ) ′ e∈ / E1 , for all t > t. I(t) I(t′ ) – E1 PEX E2 is satisfied in I if, for each e ∈ E1 , t ∈ Z, we have e ∈ E2 , for all t′ > t. I(t) I(t′ ) – E1 PEX− E2 is satisfied inI if, for each e ∈ E1 , t ∈ Z, we have e ∈ E2 for all t′ < t. In Fig. 1, Manager PEX Manager reflects the persistent status of Manager (once a manager, always a manager). Transition constraints (TRANS) are defined only for entities and can be of three types: I(t)
I(t+1)
– E1 TEV E2 is satisfied inI if, for each e ∈ E1 , t ∈ Z, we have e ∈ E2 and I(t+1) . e∈ / E1 I(t) I(t+1) . – E1 TEX E2 is satisfied inI if, for each e ∈ E1 , t ∈ Z, we have e ∈ E2 I(t) I(t−1) − – E1 TEX E2 is satisfied inI if, for each e ∈ E1 , t ∈ Z, we have e ∈ E2 . In Fig. 1, Project TEX Ex-Project means that every Project will expire in one year. Lifespan cardinality constraints (LFC) and their refinements are defined as follows: – Let REL(R) = U1 : E1 , . . . , Uk : Ek . For 1 ≤ i ≤ k, the lifespan cardinality constraint L - CARD(R, Ui , Ei ) = (α, β) with α ∈ N and β ∈ N ∪ {∞} is satisfied I(t) in I if, for all t ∈ Z and e ∈ Ei , ′ α ≤ ♯ t′∈Z {U1 : e1 , . . . , Ui : ei , . . . , Uk : ek ∈ RI(t ) | ei = e} ≤ β. (2) In Fig. 1, TopManager can Manage at most five distinct Projects throughout the whole life. – Let REL(R) = U1 : E1 , . . . , Uk : Ek and Ei′ ISA Ei , for some 1 ≤ i ≤ k. The refinement L - REF(R, Ui , Ei′ ) = (α, β) of the lifespan cardinality constraint is satisfied in I if (2) holds for all t ∈ Z and e ∈ (Ei′ )I(t) . Generation relationships (GEN) are a sort of evolution constraints conveyed through relationships. Suppose R is a binary relationship with REL(R) = s : E1 , t : E2 , where s and t are two fixed role symbols.
180
A. Artale et al.
– A production relationship constraint GP(R ) = R ′ , where R ′ is a fresh relationship with REL(R ′ ) = s : E1 , t : E2′ and E2′ a fresh entity (i.e., R′ and E2′ do not occur I(t) in other constraints), is satisfied in I if, for all t ∈ Z, we have E2 ∩ (E2′ )I(t) = ∅ I(t+1) I(t) and if s : e1 , t : e2 ∈ (R′ )I(t) then e1 ∈ E1 and e2 ∈ (E2′ )I(t) ∩ E2 . In Fig. 1, the fact that Managers can create at most one new Project at a time is captured by constraining Propose to be a production relationship (marked by GP) together with the (0, 1) cardinality constraint. – The transformation relationship constraint GT(R) = R′ , where R′ is a fresh relationship with REL(R′ ) = s : E1 , t : E2′ and E2′ a fresh entity, is satisfied in I if, I(t) for all t ∈ Z, we have E2 ∩ (E2′ )I(t) = ∅ and if s : e1 , t : e2 ∈ (R′ )I(t) then I(t+1) I(t′ ) I(t) and e1 ∈ E1 , for all t′ > t. e1 ∈ E1 , e2 ∈ (E2′ )I(t) ∩ E2 Note that the production relationship constraint GP(R) = R′ can be equivalently replaced with the disjointness and evolution constraints E2′ DISJ E2 and E2′ TEX E2 . Similarly, the transformation relationship constraint GT(R) = R′ can be equivalently replaced with REL(R′ ) = s : E1′ , t : E2′ , E2′ DISJ E2 , E2′ TEX E2 , E1′ ISA E1 , E1′ PEX E1′′ and E1′′ DISJ E1 , where E1′ and E1′′ are fresh entities. Therefore, in what follows we do not consider generation relationship constraints. 2.1 Reasoning Problems The basic reasoning problems over temporal data models we are concerned with in this paper are entity, relationship and schema consistency, and subsumption for entities and relationships. To define these problems, suppose that L = (E, R, U, A, D) is a signature, E1 , E2 ∈ E, R1 , R2 ∈ R and Σ is an ERVT schema over L. Σ is said to be consistent if there exists a temporal interpretation I over L satisfying all the constraints from Σ and such that E I(t) = ∅, for some E ∈ E and t ∈ Z. In this case we also say that I is a model of Σ. The entity E1 (relationship R1 ) is consistent with respect to Σ I(t) I(t) if there exists a model I of Σ such that E1 = ∅ (respectively, R1 = ∅), for some t ∈ Z. The entity E1 (relationship R1 ) is subsumed by the entity E2 (relationship R2 ) in Σ if any model of Σ is also a model of E1 ISA E2 (respectively, R1 ISA R2 ). It is well known that the reasoning problems of checking schema, entity and relationship consistency, as well as entity and relationship subsumption are reducible to each other (see [10,2] for more details). Note, however, that if the covering construct is not available, to check schema consistency we have to run, in the worst case, as many entity satisfiability checks as the number of entities in the schema. In what follows, we only consider the entity consistency problem. 2.2 Complexity of Reasoning We investigate the complexity of reasoning not only for the full ERVT but also for its various sub-languages obtained by weakening either the EER or the temporal component. We consider the three non-temporal EER fragments identified in [2]:
Complexity of Reasoning over Temporal Data Models
181
Table 1. Complexity of reasoning in fragments of ERVT temporal features TS TRANS TS, TRANS EVO TS, EVO TRANS, EVO TS, TRANS, EVO
ERfull
EER component ERbool ERref
2E XP T IME [8] E XP T IME [5] U NDEC . [Th.6] E XP T IME [5] U NDEC . [1] E XP T IME [5] U NDEC . [1]
NP [Th.2] PS PACE [Th.1] PS PACE [Th.1] NP [Th.2] NP [Th.2] PS PACE [Th.1] PS PACE [Th.1]
2E XP T IME [8] NP† [Th.7] U NDEC . [Th.8] U NDEC . [Th.8] EVO , LFC U NDEC . [Th.6] ? († ) This result is proved only for binary relationships. TS, LFC
TRANS, LFC
NL OG S PACE [Th.5] in PT IME [Th.4] in PT IME [Th.4] NP [Th.3] NP [Th.3] NP [Th.3] NP [Th.3] in NP† [Th.7] ? ?
– ERfull contains all the ERR constraints. – ERbool has ISA only between entities; it is also required that attributes do not change their types: if A : D is in ATT(E) and A : D′ in ATT(E ′ ) then D′ = D. – ERref is the fragment of ERbool without covering constraints. As shown in [10,2], reasoning in these languages is, respectively, E XP T IME-, NP- and NL OG S PACE-complete. Table 1 summarises the complexity results known in the literature or to be proved below. Unless otherwise indicated, the complexity bounds are tight. In the subsequent sections, we denote the languages by explicitly indicating their EER and temporal comTS, EVO , TRANS, LFC denotes the full conceptual modelling language ponents. For example, ERfull E R VT . The missing proofs can be found in the full version of the paper which is available at http://www.dcs.bbk.ac.uk/˜roman/papers/er10-full.pdf.
3 Embedding Temporal ERbool/ ref in Temporal DL-Litebool/ core We prove the positive (i.e., decidability) results in the table above by reducing reasoning over temporal data models based on ERbool and ERref without lifespan cardinality constraints to reasoning in temporal description logics based on variants of DL-Lite [3,6] (in fact, these temporal DL-Lite logics were originally designed with the aim of capturing temporal data models). The language of TFPX DL-LiteN bool contains concept names A0 , A1 , . . . , local role names P0 , P1 , . . . and rigid role names G0 , G1 , . . . . Roles R, basic concepts B and concepts C are defined as follows: S ::= Pi | Gi , B ::= ⊥ | Ai | ≥ q R,
R ::= S | S − ,
C ::= B | ¬C | C1 ⊓ C2 | ✸F C | ✷F C | ✸P C | ✷P C |
F C
|
P C,
182
A. Artale et al.
where q ≥ 1 is a natural number (given in binary). We use the construct C 1 ⊔ C 2 as a ∗ C = ✸P ✸F C and ✷ ∗ C = ✷P ✷F C . standard Boolean abbreviation, and also set ✸ N A TFPX DL-Litebool interpretation I is a function I(n) I(n) I(n) I(n) = ∆I , A0 , . . . , P0 , . . . , G0 , . . . , n ∈ Z, I(n)
I(n)
I(n)
I(n)
I(k)
where ∆I = ∅, Ai = Gi ⊆ ∆I × ∆I with Gi , Gi ⊆ ∆I and Pi all k ∈ Z. The role and concept constructs are interpreted in I as follows:
, for
(S − )I(n) = {(y, x) | (x, y) ∈ S I(n) }, ⊥I(n) = ∅, (¬C)I(n) = ∆I \ C I(n) , I(n) I(n) (C1 ⊓ C2 )I(n) = C1 ∩ C2 , (≥ q R)I(n) = x | ♯{y | (x, y) ∈ RI(n) } ≥ q , (✸P C)I(n) = kn C I(k) , (✷F C)I(n) = k>n C I(k) , (✷P C)I(n) = k 0 and s = (s1 , ..., sd ) is a vector of scale factors. Given the importance of the scaling operation on scientific raster data, image pyramid support has been added on application logic level already earlier. However, this concept does not scale with dimensions and further we were not convinced that it is optimal even for the 2D case given the variety of queries observed. With interest we observed that
190
A.G. Gutierrez and P. Baumann
in OLAP data cubes are rolled up efficiently along dimension hierarchies using preaggregation (both techniques will be discussed in Section 2). All this gave rise to our investigation on how to optimize scaled access. Obviously, materialization is a key candidate for optimization of scaling. Hence, our research question is: How can scaling operations on multi-dimensional scientific raster data be made more effi cient? Our approach is to determine, for some given query set (the workload), a suitable set of preaggregates (ie, downscaled versions) of the raster object addressed. For preaggregate selection, a benefit function indicates how well a preaggregate under consideration supports the workload; further, storage constraints are considered. Any subsequent query containing scaling will be inspected to find an equivalent query which relies on a suitable preaggregate rather than on the base set. We present the conceptual framework of this approach in Section 3 and the preaggregates selection and scale query rewriting algorithms in Section 4 and 5, resp. In Section 6 we report performance evaluations. Finally, conclusions are given in Section 7.
2
Related Work
Approaches to quickly scaling multi-dimensional gridded data exist in two rather distinct domains, OLAP and geo raster services. We inspect both in turn. 2.1
OLAP
In OLAP, multi-dimensional data spaces are spanned by axes (there called measures) where cell values (there called facts) sit at the grid intersection points (see, e.g., [11]). For example, a 3D data cube on car repair frequency can be spanned by time, garages, and car type. Facts might correspond to having a particular car in a particular garage at some time for repair. Measures normally are considered discrete; in the case of time, granularity possibly is one day because subsidiaries report their business volume every evening. This is paralleled by scientific data which are discretized into some raster during acquisition. Hence, the structure of an OLAP data cube is rather similar to a raster array. Dimension hierachies serve to group (consolidate) value ranges along an axis. For example, individual vehicles might be categorized into models and brands. Querying data by referring to coordinates on the measure axes yields ground truth data, whereas queries using axes higher up a dimension hierarchy (such as car brand) will return aggregated values (such as repairs summed up by brand). A main differentiating criterion between OLAP and observation data is density: OLAP data are sparse, typically between 3% and 5%, whereas, e.g., simulation data are 100% dense. 2.2
Image Pyramids
In Geographic Information Systems (GIS) technology, multi-scale image pyramids are used since long to improve performance of scaling operations on 2D
Preaggregation for Spatio-temporal Scaling
191
raster images [2]. Basically, this technique consists of resampling the original dataset and creating a specific number of copies from it, each one resampled at a coarser resolution. This process resembles a primitive form of preaggregation. The pyramid consists of a finite number of levels, which differ in scale typically by a fixed step factor, and are much smaller in size than the original dataset but adequate for visualization at a lower scale (zoom ratio). Common practice is to construct pyramids in scale levels of a power of 2 (see Figure 2.2). Materializing such a pyramid requires an extra of about 33% storage space, which is considered acceptable in view of the massive performance gain obtained in particular with larger scale factors. Pyramid evaluation works as follows. During evaluation of a scaling query with some target scale factor s, the pyramid level with the largest scale Fig. 2. Sample image pyrafactor s′ with s′ < s is determined. This level is mid over a 2D map loaded and then an adjustment is done by scaling this image by a factor of s/ s′ . If, for example, scaling by s = 11 is required then pyramid level 3 with scale factor s′ = 8 will be chosen requiring a rest scaling of 11/ 8 = 1.375, thereby touching only 1/64 of what needs to be read without pyramid. In comparison to the OLAP use case before, GIS image pyramids are particular in several respects: – Image pyramids are constrained to 2D imagery. To the best of our knowledge there is no generalization of pyramids to nD. – The x and y axis always are zoomed by the same scalar factor s in the 2D zoom vector (s, s). This is exploited by image pyramids in that they only offer preaggregates along a scalar range. In this respect, image pyramids actually are 1D preaggregates. – Several interpolation methods are in use for the resampling done during scaling. Some of the techniques used in practice are standardized [8], they include nearest-neighbor, linear, quadratic, cubic, and barycentric. The two scaling steps incurred (construction of the pyramid level and rest scaling) must be done using the same interpolation technique to achieve valid results. In OLAP, summation during roll-up normally is done in a way that corresponds to linear interpolation in imaging. – Scale factors are continuous, as opposed to the discrete hierarchy levels in OLAP. It is, therefore, impossible to materialize all possible preaggregates. 2.3
Time Scaling
Shifts in temporal detail have been studied in various application domains, such as [7][13][14]. In GIS technology, up to now there is little support for zooming with respect to time [5][6][10].
192
3
A.G. Gutierrez and P. Baumann
Conceptual Framework
3.1
Scaling Lattices
We focus on preaggregation for scaling operations. The relation between scaling operations can be represented as a lattice framework [4]. A scaling lattice consists of a set of queries L and dependence relations denoted by L, . The operator imposes a partial ordering on the queries of the lattice. Consider two queries q1 and q2 . We say q1 q2 if q1 can be answered using only the results of q2 . The base node of the lattice is the scaling operation with smallest scale vector upon which every query is dependent. The selection of preaggregates, that is, queries for materialization, is equivalent to selecting vertices from the underlying nodes of the lattice. Figure 3 shows a lattice diagram for a workload containing five queries. Each node has an associated label, which represents a scale operation defining a raster object, scale vector, and resampling method.
Fig. 3. Sample scaling lattice diagram for a workload with five scaling operations
In our approach, we use the following function to define scaling operations: scale(objN a m e[l o
1
: h i1 , ..., lon : hin ], s , resM eth)
(1)
where – objName[lo1 : hi1 , ..., lon : hin ] is the name of the n-dimensional rasterobject to be scaled down. The operation can be restricted to a specific area of the raster-object. In such a case, the region is specified by defining lower (lon ) and upper (hin ) bounds for each dimension. If the spatial domain is omitted, the operation shall be performed on the entire spatial extent defining the raster-object. – s : is a vector whose elements consist of a numeric value by which the original raster-object must be scaled down. – resMeth specifies the resampling method to be used. For example, scale(CalFires, [2, 2, 2], nearest-neighbor) defines a scaling operation by a factor of 2 on each dimension, using Nearest Neighbor for resampling, on a three-dimensional dataset Cal F ires.
Preaggregation for Spatio-temporal Scaling
3.2
193
Cost Model
We use a cost model in which the cost of answering a scaling operation (in terms of execution time) is driven by the number of disk I/Os required and memory accesses. These parameters are influenced by the number of tiles that are required to answer the query as well as by the number and size of the cells composing the raster object used to compute the query. The raster object can be the original object, or a suitable preaggregate. Several simplifying assumptions underlie our estimates. 1. Time to retrieve a tile from disk is constant, i.e., the same for all tiles. 2. Similarly, cell (pixel) access time in main memory is the same for all cells. 3. We do not take into account the time for performing the rest scaling on the preaggregates loaded for obtaining the final query answer. This is acceptable as the very goal of our approach is to achieve very small scale factors in order to touch as few excess data as possible. Definition 1. Preaggregates Selection Problem. Given a query workload Q , a storage space constraint C, the preaggregates selection problem is to select a set P ⊆ Q of queries such that P minimizes the overall costs of computing Q while the storage space required by P does not exceed the limit given by C. Based on view selection strategies that have proven successful in OLAP, we consider the following aspects for the selection of preaggregates: – Query frequency. The rationale behind is that preaggregates yield particular speed-ups if they support scaling operations with high frequency. – Storage space. We consider the amount of storage space taken up by a candidate scaling operation in the selection process. To simplify the algorithms we assume (realistically) that the storage space constraint is at least of the size of the query with the smallest scale vector. This assumption guarantees that for any workload at least one preaggregate can be made available. – Benefit. Since a scaling operation may be used to compute not only the same but also other dependent queries queries in the workload, we use a metric to calculate the cost savings gained with such candidate scaling operation; this we call the benefit of a preaggregate set. We normalize benefit against the base object’s storage volume. Frequency. The frequency of query q, denoted by F (q), is the number of occurrences of a given query in the workload divided by the total number of queries in the workload: F (q) = N(q)/|Q|
(2)
where N (q) is a function that returns the number of occurrences, and |Q| is the total number of queries in the workload. Storage space. The storage space of a given query, denoted by S(q), represents the storage space required to save the result of query q which is given by:
194
A.G. Gutierrez and P. Baumann
S(q) = N cells(q) ∗ Siz eCell(q)
(3)
where N cells(q) is a function that returns the number of cells composing the output object defined in query q. Benefit. The benefit of a candidate scale operation for preaggregation q is given by adding up the savings in query cost for each scaling operation in the workload dependent on q, including all queries identical to q. That is, a query q may contribute to save processing costs for the same or similar queries in the workload. In both cases, matching conditions must be satisfied. Full-Match Conditions. Let q be a candidate query for preaggregation and p a query in workload Q. Further, let p and q both be scaling operations as defined in Eq. 1. There is a full-match between q and p if and only if: – objN ame[] in the scale function defined for q is the same as in p – s in the scale function defined for q is the same as in p – resM eth in the scale function defined for q is the same as in p Partial-Match Conditions. Similarly, let q be the candidate query for preaggregation and p be a query in the workload Q. There is a partial-match between p and q if and only if: – – – –
objN ame[] in the scale function defined for q is the same as in p resM eth in the scale function defined for q is the same as in p s is of the same dimensionality for both q and p each element of s in q is higher than the corresponding element in p
Definition 2. Benefit. Let T ∈ Q be a subset of scaling operations that can be fully or partially computed from query q. The benefit B(q) of q is the sum of computational cost savings gained by selecting query q for preaggregation. B(q) = (F (q)/C(q)) +
(F(t)/C(t, q))
(4)
t∈T
where F(q) represents the frequency of query q in the workload, C(q) is the cost of computing query q from the original raster object as given by our cost model, and C(t, q) is the relative cost of computing query t from q.
4
Preaggregate Selection
Preaggregating all scaling operations in the workload is not always a possible solution because of space limitations. The same problem is found with the selection of preaggregates in OLAP. However, it is reasonable to assume that the storage space constraint is at least big enough to hold the result of one scaling operation in the workload.
Preaggregation for Spatio-temporal Scaling
195
One approach to find the optimal set of scaling operations to precompute consists of enumerating all possible combinations and find one that yields the minimum average query cost, or the maximum benefit. Finding the optimal set of preaggreates in this way has a complexity of O (2n ) where n is the number of queries in the workload. If the number of scaling operations on a given raster object is 50, there exist 250 possible preaggregates for that object. Therefore, computing the optimal set of aggregates exhaustively is not feasible. In fact, it is an NP-hard problem [4][12]. In view of this, we consider the selection of preaggregates as an optimization problem where the input includes the raster objects, a query workload, and an upper bound on the available disk space. The output is a set of queries that minimizes the total cost of evaluating the query workload, subject to the storage limit. We present an algorithm that uses the benefit per unit space of a scaling operation. The expected queries we model by a query workload, which is a set of scaling operations Q = {qi |1 < i ≤ n} where each qi has an associated nonnegative frequency, F (qi ). We normalize frequencies so that they sum up to n 1 = ( qi ). Based on this setup we study different workload patterns. i=1
The PreaggregatesSelection procedure returns a set P = {pi |0 ≤ i ≤ n} of queries to be preaggregated. Input is a workload Q and a storage space constraint S. The workload contains a number of queries where each query corresponds to a scaling operation as defined in Eq. 1. Frequency, storage space, and benefit per unit space are calculated for each distinct query. When calculating the benefit, we assume that each query is evaluated using the root node, which is the first selected preaggregate, p1 . The second chosen preaggregate p2 is the one with highest benefit per unit space. The algorithm recalculates the benefit of each scaling operation given that they will be computed either from the root, if the scaling operation is above p1 , or from p2 , if not. Subsequent selections are performed in a similar manner: the benefit is recalculated each time a scaling operation is selected for preaggregation. The algorithm stops selecting preaggregates when the storage space constraint is less than zero, or when there are no more queries in the workload that can be considered for preaggregations, i.e., all scaling operations have been selected for preaggregation. In this algorithm, highest B enef it (Q) is a function which returns the scaling operation with the highest benefit in Q. Complexity is O(nlogn), which arises from the cost of sorting the preaggregates by benefit per unit size.
5
Scaling with Preaggregates
We now look at the problem of computing scaling operations in the presence of preaggregates. We say that a preaggregate p answers query q if there exists some other query q ′ which when executed on the result of p, provides the result of q. The result can be exact with respect to q i.e., q ′ ◦p ≡ q, or only an approximation, i.e., q ′ ◦ p ≈ q. In practice, the result can be only an approximation because
196
A.G. Gutierrez and P. Baumann
Algorithm 1. PreaggregatesSelection Require: A workload Q, and a storage space constraint c 1: P = top scaling operation 2: while c > 0 and |P | != |Q| do 3: p := highestBenefit(Q, P ) 4: if (c - S(p) > 0) then 5: c := c - S(p) 6: P := P ∪ p 7: else c := 0 8: end if 9: end while 10: return P
the application of the resampling method in a layer other than the original may consider a different subset of input cell values and thus affect the cell values in the output raster object. Note that the same effect occurs with the traditional image pyramids approach, however, it has been considered negligible since many applications do not require of an exact result. In our approach, when two or more preaggregates can be used to compute a scaling operation, we select the preaggregate with the scale vector closest to the one of the scaling operation. Example 1. Consider q = scale(r, (4. 0, 4. 0, 4. 0), linear). Based on Table 1, what should the composing query q ′ be so that q ′ ◦ p ≈ q? Clearly, p2 and p3 can be used to answer query q. p3 is the preaggregate with closest scale factor to q. Therefore, q ′ = scale(p3, (0. 87, 0. 87, 0. 87), linear). Note that q ′ represents a complementary scaling operation on a preaggregate. Table 1. Sample preaggregate list raster OID scale vector p1 (2. 0, 2.0, 2.0) p2 (3.0, 3.0, 3.0) p3 (3.5, 3.5, 3.5) p4 (6.0, 6.0, 6.0)
resampling method nearest-neighbor linear linear linear
The RewriteOperation procedure returns for a query q a rewritten query q ′ in terms of a preaggregate identified with pid. The input of the algorithm is a scaling operation q and a set of preaggregates P . The algorithm first looks for a perfect-match between q and one of the elements in P . The algorithm verifies that the matching conditions listed in Section 3.2 are all satisfied. If a perfect match is found, then it returns the identifier of the matched preaggregate. Otherwise, the algorithm verifies partial-match conditions for all preaggregates in P . All qualified preaggregates are then added to a set S. In the case of a partial matching, the algorithm finds the preaggregate with closest scale vector to the one defined in Q. rewriteQuery rewrites the original query in function
Preaggregation for Spatio-temporal Scaling
197
of the chosen preaggregate, and adjust the scale vector values so they reflect the needed complementary scaling. The algorithm makes use of the following auxiliary procedures. – fullMatch(q, P ) verifies that all full-match conditions are satisfied. If no matching is found it returns 0, else the matching preaggregate id. – partialMatch(q, P ) verifies that all partial-match conditions are satisfied. Each qualified preaggregate of P is then added to a set S. – closestScaleVector(q, S) compares the scale vectors between q and the elements of S, and returns the identifier (pid ) of the preaggregate whose values are the closest to those defined for q. – rewriteQuery(Q, pid ) rewrites query q in terms of the selected preaggregate, and adjusts the scale vector values accordingly.
Algorithm 2. RewriteOperation R eq uire: A query q, and a set of preaggregates P 1: initialize S = {} , pid = 0 2: pid = f ullM atch(q, P ) 3: if (pid == 0) then 4: S = partialM atch(q, P ) 5: pid = closestScaleV ector(qt , S) 6: end if 7: q ′ = rewriteQuery(q, pid ) 8: return q ′
6
Experimental Results
Evaluation experiments have been conducted on a PC with a 3.00 GHz Pentium 4 running SuSE 9.1. Different query distributions have been applied to the following real-life sample data: – R1: 2D airborne image with spatial domain [15359:10239] consisting of 600 tiles of extent [0:512, 0:512], totalling 1.57e+8 cells. – R2: 3D image time series with spatial domain [0:11299, 0:10459, 0:3650] consisting of 226,650 tiles of extent [0:512, 0:512, 0:512], totalling 4.31e+11 cells. 6.1
2D Scaling
In this experiment, the workload contained 12,800 scalings defined for object R1, all using the same resampling method. First, we assumed a uniform distribution for the scaling vectors. Scale vectors were in the range from 2 to 256 in steps of one. Further, x and y dimensions were assumed as coupled. 50 different scale vectors were used. We consider a storage space contraint of 33%.
198
A.G. Gutierrez and P. Baumann
Execution of the PreaggregatesSelection algorithm yields 14 preaggregates for this test: scaling operations with scale vectors 2, 3, 4, 6, 9, 14, 20, 28, 39, 53, 74, 102, 139, and 189. We compare our algorithm against standard image pyramids which select eight levels: 2, 4, 8, 16, 32, 64, 128, and 256. Pyramids require approximately 34% of additional storage space. Our algorithm selects 14 preaggregates and requires aprox. 40% of additional storage space. For a Poisson distribution image pyramids still take up 34% space. Workload cost (i.e., number of cells read) is 95,468. In contrast, our algorithm selects 33 preaggregates, i.e., performs a full-preaggregation, while computing the workload at a cost of 42,455. Figure 4 shows both a uniform scale vector distribution (left) and a Poisson distribution with mean=50 (right). In summary, our algorithm performs 2.2 times faster than image pyramids at the price of less than 5% additional space. Probably powers of two, as a straightforward rule of thumb which ”works well in practice”, never has been analyzed analytically before.
Fig. 4. 2D scaling preaggregates using image pyramids for a uniform (left) and Poisson (right) workload distribution
6.2
3D Scaling
For 3D testing we picked four different scale vector distributions. As we recall that x and y are coupled we name dimensions xy and t, respectively. Uniform distribution in x y , t. The workload encompasses 10,000 scaling operations on object R2 with a scale vector distribution graph as in Figure 5 Scale vectors are distributed uniformly along xy and t in a range from 2 to 256. We run the PreaggregatesSelection algorithm for different instances of c. The minimum storage space required to support the lattice root was 12.5% of the size of the original dataset. We then increased c by factors of five, always selecting a new set of preaggregates. Figure 5 shows the average R2 query cost over space, Figure 5 the scaling operations selected when c= 36%; the algorithm selected 49 preaggreagates. Computing the workload from the original dataset would require reading 1.28e+12 cells, whereas our algorithm decreases this to 6.44e+5.
Preaggregation for Spatio-temporal Scaling
199
Fig. 5. 3D workload, xy and t uniform: query frequency over xy/t (left), average query cost over preaggregation space (center), preaggregates selected for space constraint c = 36% (right)
Fig. 6. 3D workload, xy uniform, t Poisson: query frequency over xy/t (left), average query cost over preaggregation space (center), preaggregates selected for space constraint c = 26% (right)
Uniform distribution in xy and Poisson distribution in t. Here the workload contained 23,460 scalings on R2. Distribution was uniform along xy and Poisson on t with vectors between 2 and 256. Figure 6 shows the distribution graph. Again, we ran PreaggregatesSelection for different instances of c . The minimum space required to support the lattice root was 3.13% of the original dataset. We then increased c by factors of five. Figure 6 shows the average query cost over space, Figure 6 the selection for c = 26%, which were 67 preaggreagates. Computing the workload from the original dataset requires reading 2.31e+11 cells, as opposed to 1.21e+7 using these preaggregates. Note how the storage amount skyrockets if we want to achieve very low costs - with the uniform addition of further preaggregates all over the workload spectrum the high-res preaggregates, expressing the curse of dimensionality, cause particularly high space requirements. Poisson distribution in xy, t. Next, we look at a workload with 600 scaling operations on R2 based on a Poisson distribution along both xy and t dimensions. Scale vectors varied from 2 to 256; see Figure 7 for the distribution graph. Minimum space required by the lattice root was 4.18% of the original size. Again, c stepping was by factors of five. The resulting average query cost over space is displayed in Figure 7, Figure 7 shows the candidates selected for c = 26%, summing up to 23 preaggreagates. On the original dataset this requires access to 1.34e+12 cells, as compared to 1,680 cells using these preaggregates.
200
A.G. Gutierrez and P. Baumann
Fig. 7. 3D workload, xy and t Poisson distribution: query frequency over xy/t (left), average query cost over preaggregation space (center), preaggregates selected for space constraint c = 30% (right)
Fig. 8. 3D workload, xy Poisson, t uniform distribution: query frequency over xy/t (left), average query cost over preaggregation space (center), preaggregates selected for space constraint c = 21% (right)
Poisson distribution in xy, and Uniform distribution along t. Finally, we tested a workload with 924 R2 scalings where both dimensions follow a Poisson distribution with scale factors again between 2 and 256 (see Figure 8). This time the lattice root node occupied 4.18% of the original dataset size. Increasing by steps of five resulted in the average query cost as shown in Figure 8. In face of c = 21% the algorithm selected 17 preaggreagates, see Figure 8. Cell access compares as 1.63e+12 cells accessed if using the original data set versus 1,472 using the preaggregates.
7
Conclusion and Outlook
We have introduced preaggregation as a means to speed up a particular class of queries occurring in e-Science and Web mapping, namely scaling operations. Incoming queries containing scale operations can make use of the preaggregates using a simple and efficient method. Performance evaluations show significant performance gains with modest storage overhead, even improving the well-known image pyramids. Based on our results, geo service administrators and tuning tools in future can make an informed decision about which preaggregates to materialize by inspecting user workload and storage constraints. Currently we are investigating on 4D data sets, such as atmospheric, oceanographic, and astrophysical data. Further, workload distribution deserves more attention. While from our 10+ years field experience in geo raster services we feel fairly confident that the distributions chosen are practically relevant, there might
Preaggregation for Spatio-temporal Scaling
201
be further situations worth considering. We, therefore, plan to apply preaggregation to user-exposed services like EarthLook (www.earthlook.org) to obtain empirical results. Our investigations initially had a focus on spatio-temporal data. However, there is no restriction in that respect; our results seem equally valid for data cubes containing one or more non-spatio-temporal dimensions, such as pressure (which is common in meteorological and oceanographic data sets).
Acknowledgments Research supported by CONACYT, DAAD, and Jacobs University. Thanks goes to Eugen Sorbalo for his evaluation code. First author’s work was performed while with Jacobs University.
References 1. Baumann, P., Holsten, S.: A comparative analysis of array models for databases. Technical report, Jacobs University Bremen (2010) 2. Burt, P., Adelson, E.: The laplacian pyramid as a compact code. IEEE Transactions on Communications COM-31, 532–540 (1983) 3. Garcia, A., Baumann, P.: Modeling fundamental geo-raster operations with array algebra. In: Proc. SSTDM 2007, October 28-31, pp. 607–612 (2007) 4. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. SIGMOD Rec. 25(2), 205–216 (1996) 5. Hornsby, K., Egenhofer, M.J.: Shifts in detail through temporal zooming. In: Bench-Capon, T.J.M., Soda, G., Tjoa, A.M. (eds.) DEXA 1999. LNCS, vol. 1677, pp. 487–491. Springer, Heidelberg (1999) 6. Hornsby, K., Egenhofer, M.J.: Identity-based change: A foundation for spatiotemporal knowledge representation. Intl. J. Geogr. Inf. Science 14, 207–224 (2000) 7. Lopez, I.F.V.: Scalable algorithms for large temporal aggregation. In: Proc. ICDE 2000, p. 145 (2000) 8. n.n. Geographic Information - Coverage Geometry and Functions. Number 19123:2005. ISO (2005) 9. n.n. rasdaman query language guide, 8.1 edition (2009) 10. Peuquet, D.J.: Making space for time: Issues in space-time data representation. Geoinformatica 5(1), 11–32 (2001) 11. Sapia, C.: On modeling and predicting query behavior in olap systems. In: Proc. DMDW 1999, June 14-15 (1999) 12. Shukla, A., Deshpande, P., Naughton, J.F.: Materialized view selection for multidimensional datasets. In: Proc. VLDB 1998, pp. 488–499 (1998) 13. Spokoiny, A., Shahar, Y.: An active database architecture for knowledge-based incremental abstraction of complex concepts from continuously arriving timeoriented raw data. J. Intell. Inf. Syst. 28(3), 199–231 (2007) 14. Wiederhold, G., Jajodia, S., Litwin, W.: Dealing with granularity of time in temporal databases. In: Andersen, R., Solvberg, A., Bubenko Jr., J.A. (eds.) CAiSE 1991. LNCS, vol. 498, pp. 124–140. Springer, Heidelberg (1991)
Situation Prediction Nets Playing the Token Game for Ontology-Driven Situation Awareness⋆ Norbert Baumgartner1 , Wolfgang Gottesheim2 , Stefan Mitsch2 , Werner Retschitzegger2 , and Wieland Schwinger2 1
team Communication Tech. Mgt. GmbH, Goethegasse 3, 1010 Vienna, Austria 2 Johannes Kepler University Linz, Altenbergerstr. 69, 4040 Linz, Austria
Abstract. Situation awareness in large-scale control systems such as road traffic management aims to predict critical situations on the basis of spatiotemporal relations between real-world objects. Such relations are described by domain-independent calculi, each of them focusing on a certain aspect, for example topology. The fact that these calculi are described independently of the involved objects, isolated from each other, and irrespective of the distances between relations leads to inaccurate and crude predictions. To improve the overall quality of prediction while keeping the modeling effort feasible, we propose a domain-independent approach based on Colored Petri Nets that complements our ontology-driven situation awareness framework BeAware!. These Situation Prediction Nets can be generated automatically and allow increasing (i) prediction precision by exploiting ontological knowledge in terms of object characteristics and interdependencies between relations and (ii) increasing expressiveness by associating multiple distance descriptions with transitions. The applicability of Situation Prediction Nets is demonstrated using real-world traffic data. Keywords: Situation Awareness, Ontology, Colored Petri Nets.
1
Introduction
Situation awareness in large-scale control systems. Situation awareness is gaining increasing importance in large-scale control systems such as road traffic management. The main goal is to support human operators in assessing current situations and, particularly, in predicting possible future situations in order to take appropriate actions pro-actively to prevent critical events. The underlying data describing real-world objects (e. g., wrong-way driver) and their relations (e. g., heads towards), which together define relevant situations (e. g., wrongway driver rushes into traffic jam), are often highly dynamic and vague. As a consequence reliable numerical values are hard to obtain, which makes qualitative situation prediction approaches better suited than quantitative ones [14]. ⋆
This work has been funded by the Austrian Federal Ministry of Transport, Innovation and Technology (BMVIT), grant FIT-IT 819577.
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 202–218, 2010. c Springer-Verlag Berlin Heidelberg 2010
Situation Prediction Nets
203
Ontology-driven situation prediction based on spatio-temporal calculi. Recently, ontology-driven situation awareness techniques [9], [2] have emerged as a basis for predicting critical situations from spatio-temporal relations between objects. Such relations are expressed by employing relation calculi, each of them focusing on a certain spatio-temporal aspect, such as mereotopology [23], orientation [11], or direction [22]. These calculi are often formalized by means of Conceptual Neighborhood Graphs (CNGs, [13]), imposing constraints on the existence of transitions between relations. CNGs are an important construct for modeling continuously varying processes [20], and are adopted in, for example, qualitative simulation [10], prediction [6], tracking moving objects [25], and agent control [12]. The domain-independent nature of calculi (i) leaves interpretations (e. g., Close means within 10km) to applications, (ii) does not consider object characteristics (e. g., whether they are moveable), (iii) is irrespective of interdependencies (e. g., topological transitions depend on spatial distance), and (iv) does not express any kind of distance for transitions such as probability, which altogether lead to inaccurate and crude situation predictions. Existing approaches try to increase quality by constructing domain- and even situation-specific calculi manually, which, however, requires considerable modeling effort. Colored Petri Nets to the rescue. In order to achieve a proper balance between prediction quality and modeling effort, we propose a domain-independent approach on the basis of Colored Petri Nets (CPNs, [16]) that complements our ontology-driven situation awareness framework BeAware! [5]. Representing CNGs as CPNs allows, on the one hand, increasing prediction precision by exploiting ontological knowledge included in the framework in terms of object characteristics and interdependencies between spatio-temporal relations and, on the other hand, increasing prediction expressiveness by associating transitions with dynamically derived distances for multiple view-points. These so called Situation Prediction Nets (SPN) are derived automatically from the situation awareness ontologies of BeAware!. Petri net properties are preserved, which enables features such as predicting multiple situation evolutions in parallel, which are not as easily realizable with alternative formalisms such as state transition diagrams. The applicability of SPNs is demonstrated using real-world traffic data. Structure of the paper. In Sect. 2, a brief overview of our work on situation awareness is given, detailing further the challenges tackled in this paper by means of a road traffic example. Section 3 discusses related work, Sect. 4 introduces SPNs, and their applicability is discussed in Sect. 5. Finally, Sect. 6 concludes the paper with lessons learned and an outlook on further research directions.
2
Motivating Example
Road traffic management systems responsible for example, for improving traffic flow and ensuring safe driving conditions are a typical application domain of situation awareness. Based on our experience in this area, examples from road traffic management further detail the challenges of enhancing the quality of
204
N. Baumgartner et al.
neighborhood-based predictions [10] in situation awareness. In such neighborhoodbased predictions, the relations of a current situation are the starting point for tracing transitions in CNGs to predict possible future situations. In our previous work [5], we introduced a generic framework for building situation-aware systems that provides common knowledge about (i) situations, which consist of objects and relations between them, and (ii) relation neighborhood in a domain-independent ontology. This ontology is used in generic components to derive new knowledge from domain information provided at runtime. A prototypical implementation supports assessing situations in real-world road traffic data, which in turn forms the basis for simple predictions [6] following a neighborhood-based approach. To illustrate the shortcomings of neighborhoodbased prediction approaches—which are rooted in relation calculi and CNG characteristics in general—let us consider the following example: Suppose that an initial situation is assessed in which a wrong-way driver is heading towards an area of road works, as depicted in Fig. 1. This initial situation is characterized by two relations (Disrelated from Region Connection Calculus (RCC) [23] and Far from Spatial Distance calculus) between the objects Wrong-way driver and Road works. Human operators would like to know potential future situations— for instance, a wrong-way driver in the area of road works—in order to take appropriate actions. With neighborhood-based prediction, we can provide this information by following the edges of Disrelated (1) and Far (3, 4) in the respective CNGs, thereby predicting five possible subsequent situations, which form the basis for further predictions. This leads, however, to combinatorial explosion, making such crude predictions—even when using small CNGs as in the example above—inaccurate and incomprehensible to human operators. Current approaches [1] try to tackle these problems with manually defined constraints resulting in high modeling effort. To further illustrate the challenges of enhancing prediction quality, we provide examples below and a summary in Fig. 2. Challenge A: Increasing precision with object characteristics. CNGs model relations independently of objects and their characteristics, which results
Fig. 1. Prediction of possible future situations from an assessed, current one
Situation Prediction Nets
!
! ! % & ' ( ! % & & !
)
!
!
! % & ' & ( *+,+,- ! % & ' . ( *$$ !
! " !
+,$
+,+,
$$
! &
& ' .
$
205
& ! & & *//% & ' ( -
" #
#
" $ $
Fig. 2. Challenges in prediction quality enhancement
in a vast number of predicted situations, of which many are actually impossible. Let us revisit our example: Since a wrong-way driver has non-permeable boundaries, all predictions that require areas of other traffic objects to become a proper part of the wrong-way driver are wrong (i. e., resulting in impossible situations, e. g., “Wrong-way driver Proper Part Inverse road works”). If only we had knowledge about object characteristics, such as boundary permeability, available in CNGs, we could increase prediction precision as we exclude impossible situations. In [4], we laid the basis for tackling this challenge by proposing an ontology for representing object characteristics and corresponding optimization rules. In this paper, we describe how to translate this knowledge to SPNs. Challenge B: Increasing precision with relation interdependencies. A CNG describes relations of a single calculus without taking interdependencies between different calculi into account, which again leads to a large number of false predictions. For example, let us reconsider the initial situation “Wrongway driver disrelated and far from traffic jam” involving two different calculi. In a world in which motion is continuous, transitions in these calculi are not independent. In our example, this means that two objects will first transition in Spatial Distance calculus from Far to Close, then advance to Very Close before a transition in RCC from Disrelated to Partially Overlapping can happen. If only we were able to describe such relation interdependencies between different calculi, we could further increase prediction precision.
206
N. Baumgartner et al.
Challenges C and D: Increasing expressiveness with distance descriptions. CNGs only describe the mere existence of transitions between relations, but do not associate any distance descriptions with them. Human operators, however, are keen to know details beyond existence, such as temporal distance, probability, impact, and confidence. For example, the duration of a transition from Far to Close depends on the speed of both objects, while the transition probability is influenced by the direction of motion (i. e., whether or not the wrong-way driver heads towards road works, which is expressed by an orientation relation FrontFront of OPRA [11]). If only we had such knowledge about object characteristics, for example their speed, and about relation interdependencies, for example mereotopology on spatial distance, we could derive distance descriptions to increase prediction expressiveness.
3
Related Work
In this section, we discuss related research in qualitative neighborhood-based prediction and simulation because, in fact, such predictions are based on simulating evolutions of situations. We distinguish between methods trying to increase precision and those focusing on expressiveness. Since our approach generates CPNs from ontologies, we also cover related work in this area. Increasing precision. Increasing precision is a major concern in fields such as qualitative simulation [1], [7] and robot agent control [12]. In [1], a qualitative simulation method was presented which manually defines (i) simulated situations more precisely with unmodifiable relations and object characteristics (e. g., relative positions of static objects, termed intra-state constraints), and (ii) customized CNGs (termed inter-state constraints, or dynamic constraints in a similar approach [7]) to describe valid transitions for determining the next simulated situation. Similarly, in [12], the effect of object characteristics (e. g., whether objects can move or rotate) on the conceptual neighborhood of a particular calculus was emphasized. To represent this knowledge in terms of CNGs, six different manually defined conceptual neighborhoods of the orientation calculus O PR Am were introduced. The drawback common to all these approaches is that intra- and inter-state constraints (CNGs) must be defined manually for each domain or even worse for each prediction. Situation Prediction Nets, in contrast, can be derived automatically from such ontological knowledge, allowing us, at the same time, to increase prediction precision. Increasing expressiveness. Increasing expressiveness is of concern, for instance, when assigning preferences to CNGs in order to customize multimedia documents [18] or to describe costs for assessing spatial similarity [19]. Laborie [18] uses spatio-temporal calculi to describe relations between parts of a multimedia document, and CNGs to find a similar configuration, in case a multimedia player cannot deal with the original specification of the document. In order to increase expressiveness with preferences for selecting the most suitable configuration, distances between relations are described with statically defined, quantitative weights on CNG edges. Similarly, Li and Fonseca [19] increase expressiveness
Situation Prediction Nets
207
with weights to describe distances in terms of static costs for making transitions in a CNG. These costs are used to assess spatial similarity: less costly transitions connect relations that are more similar. We take these approaches further by increasing expressiveness with qualitative distances for multiple view-points which incorporate relation interdependencies in addition to object characteristics. Translating ontologies to Petri nets. Petri nets are commonly known models appropriate for describing the static and dynamic aspects of a system, thereby enabling prediction of future situations by simulating evolutions [24]. Of particular interest for defining our Situation Predictions Nets are extensions to the original place-transition net formalism in the form of hierarchical Colored Petri Nets [16], because they allow representing ontological knowledge with complex data types. Translations from ontologies to Petri nets are described in the literature as a pre-requisite, for instance, for defining a hybrid ontological and rule-based reasoner [26] or for achieving formal analyses of Web services [8]. In [26], patterns were presented for translating OWL axioms of a particular ontology with their accompanying SWRL rules into Petri nets in order to create a combined ontological and rule-based reasoner. In [8], translations from concepts in OWL-S Web service specifications (such as choice, sequence, and repeat-until) into a custom-defined Petri net variant were introduced to check static properties of the Web service, such as its liveness. Both approaches focus on translations from ontologies to Petri nets, but, in contrast to our approach, they do not exploit dynamic information to influence the behavior of their nets.
4
From Ontologies to Situation Prediction Nets
In this section, we propose Situation Prediction Nets (SPN) based on CPNs for tackling the challenges described in Sect. 2 with the goal of enhancing prediction quality in domain-independent situation awareness. Note that the modeling examples given in this section—in order to keep them concise and comprehensible— only show simplified subsets of our nets, describing aspects relevant to situation awareness, such as mereotopology, distance, speed, and orientation. As illustrated in Fig. 3, the overall architecture of our approach formalizes a conceptual view of objects, spatio-temporal relations between them (i. e., a prediction’s static definition), and situations (i. e., a prediction’s dynamic starting point ) as knowledge for situation-aware systems in an ontology1, which in turn forms the basis for generating Situation Prediction Nets. This situation awareness ontology is structured into: (i) a domain-independent part including spatio-temporal calculi and their CNGs, accompanied by algorithms for situation assessment, duplicate detection, and prediction, forming our situation awareness framework BeAware! [5], and (ii) a domain-dependent part extending the domain-independent part during implementation of situationaware systems. 1
Our situation awareness ontology [5] builds upon the notion of Barwise and Perry [3], which makes the proposed approach applicable to a wide range of efforts in situation awareness, for instance, to Kokar’s approach [17].
208
N. Baumgartner et al.
+
(
# 5 $ 5
%
(
%.
0 "
" " !
$ 3
3
%. + " ! 0 "
2% + (
(
* %
( %
+
( "
111
111
/ !
+4
%
! " +
$ 6
6
$
6
%6
$ &
+4
$ %
111
/ !
$ # %
!
#"
( '
& '!
# "
Fig. 3. From ontological knowledge to Situation Prediction Nets
Translating from ontological knowledge to CPNs. We first describe how to translate automatically ontological knowledge to plain CPNs as a basis for Situation Prediction Nets. As illustrated in Fig. 3 and formalized in Tab. 1, translating the static structure of a CNG (an undirected, unweighted graph) to a CPN (a bipartite directed graph) is straightforward: (i) We represent every calculus on a dedicated Petri net page, (ii) define the CPN’s color set, (iii) create one place per node of the corresponding CNG, (iv) represent every edge with two transitions (one for each direction), and connect transitions with arcs to the respective places. (v) Situations, being defined by objects and relations, can be modeled by objects as two-colored tokens, being composed of the two objects to be related, and placing them in the corresponding relation places. For example, the situation “wrong-way driver disrelated from, close to, and heading towards road works” is modeled by tokens consisting of the objects wrong-way driver and road works placed in Disrelated, Close, and FrontFront (see Fig. 3).
Situation Prediction Nets
209
Table 1. Ontology to CPN translation Ontology concept
CPN concept
A S i t u a t i o n A w a r e n e s s O n t o l o g y i s a t u p l e SA W = (R C, OT , RT , ST , AC, RN, rcm, n, st, occ, rcd, idf ), satisfying the requirements below. (i) RC is a finite set of relation calculi S = RC (ii) OT is a finite set of object types, ST is a finite set of situation types, such that ST ⊆ OT
CPN is a tuple CP N = (Σ, P, T , A, N, C, G, E, I), a hierarchical CPN is a tuple HCP N = (S, . . .) satisfying the requirements below [15]. S is a finite set of pages, each one being a CP N
Σ is a finite set of non-empty types, called color sets. C is a color function C : P → Σ E is an arc expression function E : A → expression, such that ∀a ∈ A : [T ype(E(a)) = C(p(a))M S ∧ T ype(V ar(E(a))) ⊆ Σ The color set, color function and arc expression consists of object tuples (we only consider binary relations): Σ = (OT × OT ) E(a) = (o1 , o2 ) in every case C(p) = (OT × OT ) in every case P is a finite set of places (iii) RT is a finite set of relation types, RT ∩ OT = ∅ rcm is a relation calculus membership function, rcm : RT → RC All relations of a family are added to the set of places of the family’s corresponding page. ∀s ∈ S∃rc ∈ RC : Ps = {r ∈ RT | rcm(r) = rc}, every relation is represented by a place Pi = RTi (iv) n is a neighborhood function n : RT → RT , defining T is a finite set of transitions. for each relation type a finite set of relation neighA is a finite set of arcs, P ∩ T = P ∩ A = T ∩ A = ∅ bors ∀r ∈ RT : RNr = {r′ | n(r) = r′ } N is a node function, N : A → P × T ∪ T × P For each pair of neighboring relation types, a transition with two arcs connecting the respective places exists: ∀r, r′ ∈ RT : n(r) = r′ ⇒ ∃a1 , a2 ∈ A, t ∈ T such that N (a1 ) = (r, t) ∧ N (a2 ) = (t, r′ ) Only one transition exists per pair of neighboring relation types: ∀t, t′ ∈ T : N (a) = (r, t) ∧ N (a′ ) = (t, r′ ) ∧ N (a′′ ) = (r, t′ ) ∧ N (a′′′ ) = (t′ , r′ ) ⇒ t = t′ (v) st is a situation type definition function, st : I is an initialization function P → expression such (RT , (OT × OT )) → ST that ∀p ∈ P : [T ype(I(p)) = C(p)M S ] (o1 , o2 ) if ∃s ∈ ST , o1 , o2 ∈ OT : st(p, (o1 , o2 )) = s Situation types become initial markings: I(p) = ∅ otherwise
In this section, we propose translations from object characteristics and relation interdependencies to Petri nets, thus promoting CPNs to SPNs. Challenge A: Object characteristics exploited in guards. The first step in enhancing prediction quality, as shown in Fig. 4 and formalized in Tab. 2, aims to increase prediction precision by disabling wrong transitions between relations on the basis of object characteristics, such as permeability and moveability. In SPNs, (vi) object characteristics carried by two-colored tokens (assigned to objects by an object change characteristic function) are exploited in guards on transitions. These guards express optimization rules given by inherent characteristics of relations, such as IsPermeable, IsMoveable, and IsScalable [4], thereby defining firing conditions of transitions more precisely. For example, to determine whether the transition from Disrelated to Partially Overlapping should be disabled for a token even though it is placed in Disrelated, guard 1 in Fig. 4 checks whether either object can move. The information for this check is supplied by objects: The guard evaluates to true for Wrong-way driver objects, and hance the transition remains enabled. In contrast, guard 2 checks +
%.
* % ( % %
+ " !
0"% * %1 2 * %1 2
0 " * % %
0"1 %1 2
Fig. 4. Object characterististics exploited in guards
! " #
210
N. Baumgartner et al.
whether the first object of the token has a permeable boundary (i. e., whether it allows the second object to enter its area), which is not the case for Wrong-way drivers. Consequently, the transition is disabled. Table 2. Translations necessary for considering object characteristics Ontology concept
CPN concept
(vi) AC is a finite set of attribute change types (e. g., G is a guard function, G : T → expression, such moveable) that ∀t ∈ T : [T ype(G(t)) = B∧T ype(V ar(G(t))) ⊆ occ is an object change characteristic function, occ : Σ] OT → AC P ∪ T is called the set of nodes rcd is a relation change dependency function deterOut maps each node to its output nodes, such that mining whether its from o1 or to o2 must fulfill the Out(x) = {x′ ∈ X | ∃a ∈ A : N (a) = (x, x′ )} requirement, rcd : RT → AC The transition is enabled, if the change dependencies of relation represented by a particular place are fulfilled by the incoming tokens. ⎧ ⎨⎪ ∃p ∈ Out(t) : p ∈ P ∧ rcd(p) ∈ occ(o1 ) if f rom o1(p) = o1 G(t) = ∃p ∈ Out(t) : p ∈ P ∧ rcd(p) ∈ occ(o2 ) if f rom o1(p) = o2 ⎪⎩ f alse otherwise
Challenge B: Relation interdependencies expressed by configurable dependency pages. In a world in which motion is continuous, relation interdependencies between different calculi may reduce the number of transitions between relations, as described in Sect. 2. These interdependencies, however, vary across different situations: For example, in the situation “Traffic jam disrelated from and close to road works”, transitions in RCC depend on spatial distance, but in other cases, they may for instance depend on orientation. To tackle challenge B, we therefore need concepts both to represent interdependencies between relations of different calculi, and to achieve configurability on the basis of situations, eliminating the need for manually composed situation-dependent calculi. In order to keep the modeling effort low by eliminating the need to create customized calculi, we split interdependencies into two parts, as shown in Fig. 5 and formalized in Tab. 3: The first part describes that other calculi may depend upon a particular relation (Very Close in our example), while the second part specifies that a particular transition (Disrelated to Partially Overlapping in the example) depends on certain preconditions being met (vii). A transition in-between combines these two parts and, in our example, forms the precondition “Transition from Disrelated to Partially Overlapping depends on Very Close”. This transition also serves as an extension point for specifying additional dependencies, which are accumulated in the place All Preconditions Fulfilled. One may use this extension point, for instance, to add orientation information as a further precondition. Now that we know how to represent dependencies between different calculi conceptually, let us turn our attention to making these dependencies configurable to accommodate different situations. For this purpose, we represent configurations on dedicated Dependency Configuration pages, making use of the Petri net design pattern “Deterministic XOR-Split” [21]. If the current situation makes it necessary to consider relation interdependencies (which is the case
Situation Prediction Nets
211
for the situation in our example), this design pattern results in a token being placed in Waitn , meaning that we need to wait for a token to appear in Relationn, in order to fulfill our precondition (i. e., place a token in Fulfilledn). If the current situation does not consider relation interdependencies, this pattern fulfills the precondition directly (i. e., it is not relevant whether a token appears in Relationn). In order to prevent tokens from accumulating in Fulfilledn— which would make multiple transitions possible without actually evaluating the interdependencies—we follow the Petri net design pattern “Capacity Bounding” [21] with an anti-place Evaln restricting the capacity of Fulfilledn to 1.
' *
* (
! "#$"%$&'
* !
. /
./
,+ & +) ) (
!)
Fig. 5. Relation interdependencies expressed by configurable dependency pages Table 3. Translations necessary for considering relation interdependencies Ontology concept
CPN concept
A token is a pair (p, c) where p ∈ P ∧ c ∈ C(p), a ( v ii) id f is an interdependency function id f : RT → marking M is a multi-set over tokens in a CPN. RT , defining for each relaton type a finite set of depended-on relations ∀r ∈ RT : RDr = {r′ | idf (r) = r′ } In order to model Deterministic XOR Split, we extend the color set with type boolean: Σ = (OT × OT ) ∪ B For each depended-on relation n ∈ RDi , we create a dedicated dependency configuration page: Pn = {Relationn , W aitn , F ulf illedn , Evaln } Tn = {Tn1 , Tn2 } An = {RTn1 , W Tn1 , ETn2 , Tn1 R, Tn1 F, Tn2 W, Tn2 F } Gn (t) = true ⎧ if a = Tn2 W ∃s ∈ ST , p ∈ P, c ∈ C(p) : (p, c) ∈ M ∧ st(p, c) = s then true else empty ⎪ ⎪ ⎪ ⎨ ∀s ∈ ST ∃p ∈ P, c ∈ C(p) : (c, p) ∈ M ⇒ st(p, c) = s then true else empty if a = Tn2 F En (a) = ⎪ (o1 , o2 ) if a ∈ {RTn1 , Tn1 R} ⎪ ⎪ ⎩ b otherwise true if p = Evaln In (p) = ∅ otherwise We extend ⎧ the guard functions on transitions to also check relation interdependencies. ⎪ cases from above ⎨ G(t) = true if ∃(p, c) where p = All Preconditions Fulfilled ∧ c ∈ Σ ⎪ ⎩ f alse otherwise
212
N. Baumgartner et al.
( "
+ " !
*"
1 2) 2 %3 2 43 2 4
%% 1 ! 2) 2 %3+5 43+ ( 4
( %
+
%. (
0 "
(
(
( ( ( " ( ( "
( ( " (( / !
! " # $
Fig. 6. Distances derived from axiomatic mappings in code segments
Challenges C and D: Distances derived from axiomatic mappings in code segments. Building upon the concepts of increasing precision introduced above, we describe methods for increasing expressiveness using distance descriptions. Let us recall the example from Sect. 2, in which temporal distance— “distant” if one object moves slowly or “soon” if both objects move fast— describes the transition from Disrelated to Partially Overlapping in more detail. For tackling challenges C and D, we need knowledge in the form of axioms, which map from object characteristics, like speed in our example, and from relations between objects to distance descriptions, as shown in Fig. 6. In road traffic management, such axioms model domain knowledge providing rough estimations in the absence of real-world training data. If, in some other domain, such training data is available, learning from observed situation evolutions helps to further refine these axioms. Note that for simplicity, axioms are given in tabular form in this paper, but other representations, such as Hidden Markov Models and Bayesian nets, are also possible. Using these axioms, code segments on transitions estimate distances in the prediction process. For this purpose, in addition to two-colored object tokens, the current marking in an SPN is reified as n-colored tokens (each color representing a particular place) and disseminated to transitions using the Petri net design pattern “Shared Database” [21]. The code segment in Fig. 6 derives estimations for temporal distance and probability, describing transitions between Disrelated and Partially Overlapping in more detail. Temporal distances are looked up in axiom tables using object characteristics (speed in our example), while probability is determined using the SPN’s current marking.
5
Evaluation
In this section, we evaluate the applicability of SPNs using real-world data from the domain of road traffic management covering Austrian highways over a period
Situation Prediction Nets
213
of four weeks. These data were collected from multiple sources, such as traffic flow sensors, road maintenance schedules, and motorists reporting incidents to a call center. The recorded data set used for this evaluation consists of 3,563 distinct road traffic objects, comprising, among others, 778 traffic jams, 819 road works, 1,339 other obstructions, 460 accidents, and 64 weather warnings. As a proper starting point for situation prediction, we derived relations between traffic objects using our situation-awareness prototype BeAware! to detect situations that possibly require a human operator’s attention. In order to restrict detected situations to those most relevant, we defined 13 situations in cooperation with the Austrian highways agency, of which three interesting ones2 were selected for this evaluation. Table 4 lists these situations together with the characteristics of involved objects and the number of occurrences in our data set. Evaluation method. Based on the situations detected, we predicted possible future situations with SPNs. We discuss the predicted situations in the context of our major goals: We determined the effectiveness of increasing precision (challenges A and B) by comparing the resulting number of possible future situations to that derived from the respective unoptimized calculi. (H1: Optimizing calculi reduces the number of falsely predicted situations while retaining critical ones). The potential of distance descriptions for increasing expressiveness (challenges C and D) was evaluated by comparing the results to recorded real-world data, using duration and probability as example distances. (H2: Temporal distances and probabilities match real-world evolutions). It must be noted that, although covering a period of four weeks, the data with which we were provided were updated very infrequently. Hence, we obtained only a small number of observed evolutions. Although the first evaluation indicates that the approach we propose to challenges C and D is applicable, further (real-world) observations are needed to confirm H2. To this end, we are continuously extending our data set. Evaluation setup. In our evaluation, we employed the guards IsPermeable, IsMoveable, and IsScalable, which use object characteristics, in conjunction with interdependencies between the calculi mereotopology, spatial distance, and size. In particular, transitions in RCC require objects to be very close to each Table 4. Overview of situations that are starting points for predictions Situation description and formalization, including object characteristics
#
Sit. 1
traffic jam close to another traffic jam (may merge) Tr af f icJam(o1) ∧ T raf f icJam(o2) ∧ Disrelated(o1, o2) ∧ Close(o1, o2) ∧ F rontBack(o1, o2) Traffic jam (o1): moveable, permeable, scalable, large, medium speed Traffic jam (o2): moveable, permeable, scalable, medium size, slow
17
Sit. 2
wrong-way driver heading towards road works (may cause an accident) W rongW ayDr.(o1) ∧ Roadworks(o2) ∧ Disrelated(o1, o2) ∧ Close(o1, o2) ∧ F rontF ront(o1, o2) Wrong-way driver: moveable, non-permeable, small, fast Road works: permeable, large, static
10
Sit. 3
poor driving conditions (snow) in the area of road works (may evolve towards border) P oorDrivingConditions(o1) ∧ Roadworks(o2) ∧ P roperP art(o1, o2) ∧ V eryClose(o1, o2) Poor driving conditions: moveable, permeable, medium size, slow Road works: permeable, large, static
2
2
Showing strengths, and shortcomings indicating potential improvements in SPNs.
214
N. Baumgartner et al.
other; transitions to Proper Part (Inverse) and Equals check relative sizes; spatial distances when being Partially Overlapping, Proper Part (Inv.), or Equals are restricted on the basis of object size to increase precision. For increasing expressiveness, we used temporal axioms mapping object speed to durations (on the basis of domain bindings defining spatial distances), as well as probability axioms mapping orientation of objects towards each other to probabilities. Figure 7 summarizes the achieved prediction quality enhancement. It can be seen that object characteristics alone are not effective, but in combination with relation interdependencies they exclude many false predictions, and that the predicted distances correspond to the expected overall evolution. Below, we discuss these results in more detail. Situation 1: Traffic jams close to each other. In the first situation, two traffic jams are very close (about 0.5km apart, as stated in the real-world data), but still disrelated from each other (meaning that they will probably merge) with the rear one being larger, and growing faster, than the front one (real-world data, cf. Table 4). Traffic jams, which can move, scale, and have permeable boundaries, do not allow us to exclude situations by inspecting object characteristics. In this case, prediction precision can only be increased by additionally taking relation interdependencies—as described above—into account, reducing the number of predicted situations to nine (excluding evolutions such as traffic jams partially overlapping but far from each other). In these predicted situations, two crucial evolutions are preserved: the traffic jams may drift apart or merge (confirms H1). Distances for duration and probability are based on the current motion of traffic jams and state that merging is more likely than drifting apart, although it will take a considerable amount of time. This prediction was confirmed by our data set, which showed that the observed traffic jams indeed merged into a single large one about 90 minutes after detection. (
+ ;
+ ; ( $ "
(
;
+ ;
;
* ( " ! * % * % ( %
;
+ ;
+ ;
+ ; ;
5
4 5 4
(
(
(
, " %% 7 %% 8! " %%
;
$ ;
* ' ( ") & "" & : % ( ( ( %. & ( +
( 9
) & ""
! *
" & . %%
"
- " "
;
( ,
+
Fig. 7. Evaluation overview
02 02 02 02
Situation Prediction Nets
215
Situation 2: A wrong-way driver heads towards road works. In the second scenario, a wrong-way driver (small, fast) is detected to head towards road works (large, static). The fact that a wrong-way driver’s boundary is not permeable, discards predictions involving Proper Part Inverse and reduces their number from 20 to 16 solely on the basis of object characteristics. With relation interdependencies, a further reduction to 7 (size relationship between the two objects) results in the following predicted evolutions: the wrong-way driver may enter and then drive past the area of road works, or he/she may turn around (never observed in our data set). The most likely scenario is that, due to the wrong-way driver’s current orientation and speed, he/she may enter the area of road works. This prediction partially matches our data set, in which, ten minutes after being detected, the wrong-way driver entered the area of road works (becoming Proper Part in accordance with our prediction) and then, luckily and against all odds, managed to drive past the road works (which was assigned only low probability). Situation 3: Snow in the area of road works. In our final scenario, poor driving conditions (a medium-sized area of snowfall) are detected within a large area of road works. The non-deterministic nature of weather conditions, together with the limited amount of information currently available to our system (e. g., directions of weather movements are not provided), makes it impossible to exclude situations on the basis of object characteristics. It also does not allow us to exclude a large number of evolutions when considering relation interdependencies, and makes deriving probabilities impossible (no direction given). Only approximate durations (distant and very distant) may be given on the basis of the domain knowledge that weather conditions typically change slowly. These predicted durations are consistent with observations in our real-world data reporting poor driving conditions over a period of three hours in one case and over about a day in another case.
6
Lessons Learned and Future Work
In this section, we present lessons learned from implementing and using Situation Prediction Nets and, based on these findings, indicate directions for future work. Object characteristics are only effective in combination with relation interdependencies. Situations described with objects and relations are the basis for neighborhood-based predictions in situation awareness. The potential for enhancing prediction quality when using object characteristics in isolation may, depending on the domain, therefore be rather limited (as shown for road traffic management by our evaluation). Only in combination with relation interdependencies, one can achieve substantial improvements in such a case. CPNs are suitable for deriving, but not for keeping track of, distance descriptions. Distance descriptions must be retained for later examination by human operators. In a naive approach, distances are attached to tokens or
216
N. Baumgartner et al.
accumulated in dedicated places, resulting in prediction state space being no longer bounded by the number of possible combinations between relations. Persistent storage outside the CPN seems to be better suited to preventing this. Distances in axiomatic mappings should be learned. Distances and probabilities modeled a-priori in axiomatic mappings are only a starting point for the system. In order to keep them up-to-date, a learning component should analyze events occurring in the domain (e. g., to learn that something that was considered unlikely actually occurs more often than assumed). Recursive aggregation of situations in predictions facilitates re-use. In our previous work [5], we encouraged re-use of situations as objects in recursively defined higher-level situations. For example, a wrong-way driver could head towards road works in snowfall. We could re-use the situation “Poor driving conditions in the area of road works”, and relate the wrong-way driver with this situation. SPNs, therefore, need to be extended with concepts for aggregating simulations of one net into the simulation of another one, for instance by representing the marking of one Petri net in a two-colored token of another one. Predictions using partial information require planning. Neighborhoodbased prediction assumes that starting points for predictions already comprise all relevant objects. This is particularly problematic if causal relations between objects are described. For example, the emergence of the critical situation “Accident causes traffic jam” should clearly be indicated by the corresponding initial situation—the occurrence of an accident. However, the absence of relations prevents our current approach from predicting any situation in this case. We intend to extend SPNs with ideas from qualitative planning [22]. Critical situations could then be represented as goals, and the planning approach should yield the necessary steps (e. g., emergence of a traffic jam) for reaching them.
References 1. Apt, K.R., Brand, S.: Constraint-based qualitative simulation. In: Proc. of the 12th Intl. Symp. on Temporal Rep. and Reasoning, pp. 26–34. IEEE, Los Alamitos (2005) 2. Bailey-Kellogg, C., Zhao, F.: Qualitative spatial reasoning - extracting and reasoning with spatial aggregates. AI Magazine 24(4), 47–60 (2003) 3. Barwise, J., Perry, J.: Situations and Attitudes. MIT Press, Cambridge (1983) 4. Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: On optimization of predictions in ontology-driven situation awareness. In: Proc. of the 3rd Intl. Conf. on Knowl., Science, Eng. and Mgmt., pp. 297–309 (2009) 5. Baumgartner, N., Gottesheim, W., Mitsch, S., Retschitzegger, W., Schwinger, W.: BeAware!—situation awareness, the ontology-driven way. Accepted for publication in: Data and Knowledge Engineering (Journal) (2010) 6. Baumgartner, N., Retschitzegger, W., Schwinger, W., Kotsis, G., Schwietering, C.: Of situations and their neighbors—Evolution and Similarity in Ontology-Based Approaches to Situation Awareness. In: Kokinov, B., Richardson, D.C., RothBerghofer, T.R., Vieu, L. (eds.) CONTEXT 2007. LNCS (LNAI), vol. 4635, pp. 29–42. Springer, Heidelberg (2007)
Situation Prediction Nets
217
7. Bhatt, M., Rahayu, W., Sterling, G.: Qualitative simulation: Towards a situation calculus based unifying semantics for space, time and actions. In: Cohn, A.G., Mark, D.M. (eds.) COSIT 2005. LNCS, vol. 3693. Springer, Heidelberg (2005) 8. Bonchi, F., Brogi, A., Corfini, S., Gadducci, F.: Compositional specification of web services via behavioural equivalence of nets: A case study. In: Proc. of the 29th Intl. Conf. on Application and Theory of Petri Nets and other models of concurrency, Xi’an, China, pp. 52–71. Springer, Heidelberg (2008) 9. Cohn, A.G., Renz, J.: Qualitative Spatial Representation and Reasoning. In: Handbook of Knowledge Representation, pp. 551–596. Elsevier, Amsterdam (2008) 10. Cui, Z., Cohn, A.G., Randell, D.A.: Qualitative simulation based on a logical formalism of space and time. In: Proc. AAAI 1992, pp. 679–684. AAAI Press, Menlo Park (1992) 11. Dylla, F., Wallgr¨ un, J.O.: On generalizing orientation information in OPRAm . In: Freksa, C., Kohlhase, M., Schill, K. (eds.) KI 2006. LNCS (LNAI), vol. 4314, pp. 274–288. Springer, Heidelberg (2007) 12. Dylla, F., Wallgr¨ un, J.O.: Qualitative spatial reasoning with conceptual neighborhoods for agent control. Intelligent Robotics Systems 48(1), 55–78 (2007) 13. Freksa, C.: Conceptual neighborhood and its role in temporal and spatial reasoning. In: Proc. of the Imacs Intl. Workshop on Decision Support Systems and Qualitative Reasoning, pp. 181–187 (1991) 14. Ibrahim, Z.M., Tawfik, A.Y.: An abstract theory and ontology of motion based on the regions connection calculus. In: Miguel, I., Ruml, W. (eds.) SARA 2007. LNCS (LNAI), vol. 4612, pp. 230–242. Springer, Heidelberg (2007) 15. Jensen, K.: Coloured Petri Nets - Basic Concepts, Analysis Methods and Practical Use, 2nd edn. Springer, Heidelberg (1997) 16. Jensen, K., Kristensen, L.M., Wells, L.: Coloured petri nets and CPN Tools for modeling and validation of concurrent systems. International Journal on Software Tools for Technology Transfer 9, 213–254 (2007) 17. Kokar, M.M., Matheus, C.J., Baclawski, K.: Ontology-based situation awareness. International Journal of Information Fusion 10(1), 83–98 (2009) 18. Laborie, S.: Spatio-temporal proximities for multimedia document adaptation. In: Euzenat, J., Domingue, J. (eds.) AIMSA 2006. LNCS (LNAI), vol. 4183, pp. 128– 137. Springer, Heidelberg (2006) 19. Li, B., Fonseca, F.: TDD - a comprehensive model for qualitative spatial similarity assessment. Journal on Spatial Cognition & Computation 6(1), 31–62 (2006) 20. Ligozat, G.: Towards a general characterization of conceptual neighborhoods in temporal and spatial reasoning. In: Proceedings of the AAAI 1994 Workshop on Spatial and Temporal Resoning, Seattle, WA, USA, pp. 55–59. AAAI, Menlo Park (1994) 21. Mulyar, N., van der Aalst, W.: Towards a pattern language for colored petri nets. In: Proc. of the 6th Workshop on Practical Use of Coloured Petri Nets and the CPN Tools, Aarhus, Denmark, pp. 39–58. University of Aarhus (2005) 22. Ragni, M., W¨ olfl, S.: Temporalizing cardinal directions: From constraint satisfaction to planning. In: Proc. of 10th Intl. Conf. on Principles of Knowledge Representation and Reasoning, pp. 472–480. AAAI Press, Menlo Park (2006) 23. Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection. In: Proceedings of the 3rd International Conference on Knowledge Representation and Reasoning, pp. 1–34. Morgan Kaufmann, San Francisco (1992)
218
N. Baumgartner et al.
24. Reisig, W., Rozenberg, G.: Informal intro. to Petri nets. In: Lectures on Petri Nets I: Basic Models - Advances in Petri Nets, pp. 1–11. Springer, Heidelberg (1998) 25. van de Weghe, N., Maeyer, P.D.: Conceptual neighborhood diagrams for representing moving objects. In: Proceedings of the ER Workshop on Perspectives in Conceptual Modeling, Klagenfurt, Austria, pp. 228–238. Springer, Heidelberg (2005) 26. Zhang, G., Meng, F., Jiang, C., Pang, J.: Using petri net to reason with rule and owl. In: Proc. of the 6th Intl. Conf. on Computer and Information Technology, Seoul, Korea, pp. 42–47. IEEE, Los Alamitos (2006)
Granularity in Conceptual Modelling: Application to Metamodels Brian Henderson-Sellers1 and Cesar Gonzalez-Perez2 1
Faculty of Engineering and Information Technology, University of Technology, Sydney, PO Box 123, Broadway, NSW 2007, Australia 2 The Heritage Laboratory (LaPa) Spanish National Research Council (CSIC) Santiago de Compostela, Spain [email protected], [email protected]
Abstract. The granularity of conceptual models depends at least in part on the granularity of their underpinning metamodel. Here we investigate the theory of granularity as it can be applied to conceptual modelling and, especially, metamodelling for information systems development methodologies. With a background context of situational method engineering, this paper applies some theoretical works on granularity to the study of current metamodelling approaches. It also establishes some granularity-related best practices to take into account when adopting a metamodel, especially for its future use in developing method fragments for situational method engineering. Using these best practices will result in components of better quality and, consequently, better conceptual models and methodologies. Keywords: concepts, granularity, metamodels, abstraction, modeling.
1 Introduction In conceptual modelling we represent concepts with abstractions variously named “entity”, “class”, “type”, “object”, “agent” etc. (called here simply “entity” for generality). Although these entities (should) follow a definition in terms of either an ontological definition or specification by means of a metamodel (itself a conceptual model), there is no guarantee that the “size” of the resulting entities will be consistent across the model (be it metamodel or conceptual model). In the context of conceptual modelling at both these levels (metamodel as well as model), we identify from the theoretical literature an underpinning theory for abstraction. The theory is then available to assist us in, firstly, determining a good quality set of meta-elements that will not only be self-consistent but that will, in due course, be able to generate “instances” (in the conceptual model) that are themselves of a consistent “size”. An example of where this is relevant is in the area of situational method engineering (SME) [1] [2] [3] in which each conceptual “instance”, known as a J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 219–232, 2010. © Springer-Verlag Berlin Heidelberg 2010
220
B. Henderson-Sellers and C. Gonzalez-Perez
method fragment, is used as the basic “building block” for the creation and ultimate use of a software development methodology. In this paper, we therefore investigate how to apply a theory of granularity abstraction (Section 2) to both the metamodel and conceptual model level particularly in the context of software development methodologies. Following this discussion, in Section 3 we compare and contrast a number of methodology metamodels, focussing on work units, and evaluate how these are modelled in a variety of standard process-focussed metamodels (considered as conceptual models). In Section 4, we identify some guidelines for the best practice application of granularity in conceptual modelling. Section 5 outlines some related work before concluding in Section 6.
2 Granularity Abstraction Abstraction refers to a process by which detail is removed in order to simplify our understanding of some entity or system (e.g. [4]). During such removal via a mapping between two representations, it is necessary to retain the essence of the system such that the simplification can be used effectively as a surrogate for the more detailed (and thus less abstract) entity/system. The aim of this process is to provide assistance with the original problem (e.g. [5]). Formally, an abstraction can be defined as a mapping between two languages which may be similar or dissimilar (e.g. [6]) and defined as [5]: f : Σ1 ⇒ Σ2 is a pair of formal systems (Σ1, Σ2) with languages Λ1 and Λ2 respectively and an effective total function fΛ : Λ1 → Λ2 (1a) or more simply [4] as given two languages Lo and L1, abs : Lo → L1 is an abstraction.
(1b)
where abs is an abstraction function, L0 is known as the ground language and L1 the abstract language. Having specified what an abstraction is, we can now look at the particularities of the granularity abstraction. By granularity, we mean (loosely) the number of entities in different representations of the same conceptual model. We can thus talk about a model that consists of a small number of entities as having coarse granularity or of a large number of entities as possessing fine granularity [7]. Hobbs [8] (quoted in [6]) argues that when we conceptualize the world, we do so at different levels of granularity, these levels being determined by the number of characteristics that are seen as relevant to a particular purpose. For example, a forest is composed of trees; that means that the forest (the composite) can be used as a granular abstraction of the trees in it (the components). Granularity relationships can be taken to the concept plane: the forest concept can be said to be a granular abstraction of the tree concept. Lakoff [9] also deals with granularity in terms of the gestalt effect: he says that our brains and psychology, as human beings, are tuned to the perception of entities in a quite narrow range of spatial and temporal scales. For example, we do not immediately
Granularity in Conceptual Modelling: Application to Metamodels
221
perceive atoms or galaxies, although we are constantly “seeing” them. We don’t perceive geological eras, although we are in one. Without resorting to these extreme examples, we can use the following scenario. Walk with a friend to the edge of a forest and ask him/her what s/he sees. The response will probably be “trees”, or perhaps “a forest” but it is unlikely to be “leaves”, even though this is also correct. The response is even more unlikely to be “cells” or “part of the biosphere”. Our brain is better tuned to perceiving trees than leaves or forests, and much better attuned to trees than to cells or populations. For us, the collection of parts that compose a tree is perceived as a gestalt (definition: collection of parts perceived as a whole) rather than a set of individual parts. We are aware of the parts, and we can perceive each one individually if required, but we tend to perceive the whole (here, tree) first. Our brain is tuned to the tree concept. A tree frog probably perceives the world differently: its brain is most likely tuned to the leaf concept. From a modelling perspective, we can say that the sequence [molecule-cell-leaftree-forest-biosphere] is a chain made of concepts related to the next by whole/part relationships – although it should be noted that these meronymic relationship are not identical, some being configurational and others being non-configurational [10]. Since granularity, in this example is linked to whole/part relationships [11], we can say that this sequence can be seen as a series of granularity steps, from the finer (molecule) to the coarser (biosphere). We can measure the abstraction of a concept in this sequence with relation to another concept in the sequence as the number of whole/part relationships that must be traversed to navigate from the source to the reference concept. For example, forest has a granularity of 2 with respect to leaf. The gestalt effect says that the most intuitive concept for us in this sequence is tree, so we can argue that we should select tree as the basic reference point in the sequence, or the “zero”, so to speak. Traversing a whole/part relationship from the whole to the part means that granularity increases, and traversing a whole/part relationship from the part to the whole means that it decreases. Assuming this, we can say that cell has a granularity of 2 and that forest has a granularity of -1. In a second example, rather than considering granularity chains in terms of whole/part relationships, various levels of granularity can be seen using a classification/generalization depiction. For example, at a fine granularity, we might have walkfrom(location), drivefrom(location) and flyfrom(location), which are equivalent to the coarse granularity gofrom(location) [4]. We might browse in a library collection through all the titles, some of which are CDs, some books (Figure 1). In the fiction section, we might find historical novels and crime novels. Historical novels might be further classified as pre- and post-1500 CE. Individual books that are classified as, say, pre-1500 Historical Novels are members of the set thus defined, yet are also members of the set at the coarse granular level as specified by Collection Item. As with the meronymic example, from a modelling perspective, we can say that the sequence [pre-1500 CE historical novel-historical novel-fiction book-book-titlecollection item] has a series of granularity steps from fine (pre-1500 CE historical novel) to coarse (collection item). A third example, and a third application of abstraction, is that of classification/instantiation. When an object is created from a class, detail is added (by specifying values for the attributes and links for the associations). Or, from the opposite viewpoint, when an object is ascribed to a class, detail is removed. In fact, the process
222
B. Henderson-Sellers and C. Gonzalez-Perez
Collection Item
Title
CD
Book
Fiction Book
Historical Novel
Pre-1500 CE Historical Novel
Factual Book
Crime Novel
Post-1500 CE Historical Novel
Fig. 1. Example of granularity levels in a generalization/specialization hierarchy
of finding the classes that take place during the usual analysis stages in most objectoriented software development projects is about looking at the objects around us and removing details (i.e. abstracting) in order to obtain the corresponding classes. This kind of abstraction leads to the common appellation of “levels of abstraction” as applied to the multi-level architecture as promulgated in the OMG for standards such as UML, MOF and SPEM (Figure 2). Mani [6] (citing [8] and [5]) formally defines a granularity abstraction as: An abstraction F is a granularity abstraction iff (i) F maps individual constants in Λ to their equivalence class under the indistinguishability relation ~ in Λ. (thus, for an individual x in Λ, F(x) = K(x) where K(x) = {y such that x~y}.) (ii) F maps everything else, including the predicates in Λ, to itself. (2)
Granularity in Conceptual Modelling: Application to Metamodels
223
M3 Metametamodel instance_of
M2 Metamodel instance_of
M1 Model instance_of
M0 Data Fig. 2. Four levels of abstraction used in several OMG standards (after [12]). © Pearson Education Limited.
Alternatively, granularity is formally defined in [4] as x1, …… xn ∈ Lo, x ∈ L1 and abs(xi) = x for all i ∈ [1,n] (where x is either a constant, a function or a predicate).
(3)
Thus we are mapping n elements from a fine-grained system (Lo) to a single element in a coarse-grained system (L1). A natural consequence of this, it is noted, is that granularity abstractions tend to lose information. We propose here a measure of the system granularity, GS, as being related to the number of entities in each system. Since it is reasonable to propose that the fine-grained system should have a smaller value for GS than for a coarse-grained system, we hypothesize that the grain size (system granularity value) is thus a reciprocal measure of the number of granularity abstraction mappings (equation 2 or 3) between two entities [9]. Thus GS =1/ n
(4)
This measure refers to entities represented in a single system/model. In other words, granularity refers to the degree of decomposition/aggregation, generalization or classification levels often observed in terms of the number and size of extant entities. These are generally regarded as orthogonal. On the one hand, with a composition granularity abstraction, they take the role of “parts” in a whole-part (aggregation or meronymic) relationship. Thus moving from the “parts” (fine detail) to the “whole” (coarse detail) loses detail, thus, again, supporting the notion that in many senses granularity is a kind of abstraction. In the OO literature, this removal of detail in the process of moving between granularity levels can be modelled not only by a wholepart relationship but also by a generalization relationship between two sets – the generalization granularity abstraction; or by an instance-of relationship between objects and their class – the classification granularity abstraction. Consequently, making such parts or subclasses visible/invisible changes the granularity value of the overall system/model.
224
B. Henderson-Sellers and C. Gonzalez-Perez
Granularity is thus a kind of abstraction that uses aggregation, generalization or classification relationships between entities to achieve simplification. This kind of mechanism produces entities that are more coarse granular than the original, fine grained entities. For example, applying Equation 3 to the entities along the main “chain” in Figure 1, x1=Collection Item, x2=Title; x3=Book, x4=Fiction Book, x5=Historical Novel and x6=Pre-1500 CE Historical Novel. Using Equation 3, we then map each of these six elements in the fine granular system (of Figure 1) into a single coarse grained entity called simply Library Collection Item i.e. abs(xi)=Library Collection Item for all xi in the original fine grained system (L1). [A similar argument applies if the fine-grained system uses aggregation relationships rather than the generalization relationships of Figure 1.] One problem is that this approach, by itself, cannot guarantee anything about the comparative sizes of individual entities. For the example in Figure 1, we have no information about the consistency or otherwise of entities such as Historical Novel, Fiction Book, Book etc. This is a topic requiring detailed information on the structure and semantics of each entity and is beyond the scope of this current paper. In the next two sections, we apply these theoretical ideas of granularity to a range of current metamodels (Section 3) before a final discussion (Section 4), a brief look at related work (Section 5) and the conclusions (Section 6).
3 Granularity in Current Metamodels Rather than assessing the use of granularity in conceptual models in general, we take a special case of conceptual model commonly called a “metamodel”. A metamodel is a model of models (e.g. [13]) and can alternatively be thought of as a language to express conceptual models. The context may be, for example, work products (using a metamodel-based language like UML [14]) or methodologies (in the software engineering context). Here, we focus on the latter and evaluate the current situation for methodology metamodels, examining their granularity in terms of how they model one particular metamodel element: WorkUnit (Figure 3) Producer
produces
performs
Work Unit
creates evaluates iterates
Work Product
Fig. 3. The metamodel triangle of Producer, Work Unit and Work Product that underpins SPEM, OPF and 24744 standards for software engineering process modelling
Granularity in Conceptual Modelling: Application to Metamodels
225
Domain Name Methodology Metamodel Metamodel
Metamodel
Methodology Method Process Model a.k.a. or Process Model
Method
Endeavour
Process
Language for describing methodologies
Both a methodology and a model for enactment of processes
As enacted by real people on a specific project or endeavour
Fig. 4. Terminology of the three domains of interest
Process Element
Activity
Process Element
Task
Fig. 5. Alternative representations for Process Element – fine granular (left hand side) or coarse granular (right hand side)
While metamodels (in object-oriented software engineering) were originally used to formulate modelling languages (e.g. [15]) such as UML [14] [16], more recently they have been used to define process models (e.g. the OPEN Process Framework (OPF) [17] [18]; and the Software Process Engineering Metamodel (SPEM) [19]) and very recently combining both process and product aspects in an integrated methodology metamodel standard [20] [21] – see Figure 4, which elaborates on the three lower levels of Figure 2, showing how the methodology metamodel acts as the language for describing a specific methodology, which in turn acts as a language (or model) for the actual process carried out in human endeavours. Both OPF and SPEM emphasize process aspects, whilst also supporting a link to work product modelling languages like UML or AUML [22]. At the topmost level,
226
B. Henderson-Sellers and C. Gonzalez-Perez
(a)
(c)
(b)
Work Definition
Life Cycle Process Group
Process
Process
Activity Activity
Activity
Task Task
Step
List
(d) +Context 1
Note
WorkUnit +StartTime +EndTime +Duration
+Component 0..*
+Parent 0..1
Task
Technique
Process +Child 0..*
Fig. 6. Metamodel fragments from (a) SPEM V1 [19], (b) 12207 [25], (c) proposed 12207/15288, (d) SEMDM [21]
both the OPF and SPEM metamodels have only three major classes (as depicted in Figure 3) – although there are many subclasses (not shown here) as well as two support classes (Stage and Language [18 Appendix G]) – here we consider just one of these metalevel entities: WorkUnit. In this section, we address the granularity of entities in various methodology metamodels, focussing on the definitions of process-focussed entities (often called work units) in the metamodel such as Activity, Task, Step etc. as compared to simply ProcessElement (Figure 5) [this argument can then be applied similarly to the other top-level classes of Figure 3 as well as to the support classes and subclasses such as
Granularity in Conceptual Modelling: Application to Metamodels
227
Stage, Phase, Lifecycle, Language etc.]. We highlight those entities in these and other methodology metamodels that represent WorkUnits i.e. what work needs to be done and how – but neglecting the “who”, the “when” and the “what” (as depicted, for instance, in the Zachman framework [23]) for the sake of simplicity. For example, in the first version of SPEM [19], this notion is represented by the class WorkDefinition, which is represented by the subtype Activity, which in turn is composed of Steps and classified by a Discipline (Figure 6(a)). In other words, there are three classes involved of different granularities: Discipline – Activity – Step. The value of GS is thus 0.33. Each of these three entities can be mapped from the fine granular depiction of Figure 6 into the single coarse-grained entity, here Work Definition. (A similar application of Equation 4 to the other parts of Figure 6 reaches the same conclusion: that the granularity abstraction mapping is valid.) [In Version 2.0 of SPEM [24], the metamodel is much more complex, these three metaclasses being split between the Method Content package and the Core package, although the granularity remains similar – it will therefore not be included in this proof-of-concept comparison.] In contrast, ISO/IEC 12207 [25] has five levels (Figure 6(b): LifeCycle Process Group – Process – Activity – Task – List). Here, the value of GS is 0.20. More recently, a merger of ISO/IEC 12207 and 15288 has led to a slightly different model (Figure 6(c)) in which Processes can have sub-processes as well as activities and tasks, giving a value for GS of 0.25. In ISO/IEC 24744 [21], Processes can be broken down recursively into Tasks and subtasks which are effected by means of the application of Techniques (Figure 6(d)). Based on Equation 4, we calculate a value for GS of 0.25.
4 Discussion Although the above theory of granularity allows us to calculate a system granularity value, there is no judgemental value regarding whether this number should be high or low or medium to be able to conclude that the system is of “good quality”. A value of 1 implies a single entity – for any other system than the trivial, clearly the loss of information thus entailed is intolerable and inconsistent with the notion of good quality. Similarly, for a value of GS→0, it is highly unlikely that such a disparate system could be of good quality since the understandability factor will most likely be low. Thus we seek a ‘good’ value as being some intermediate value. For example, Figure 6(a) with a value of GS of 0.3 and Figure 6(b) with a value of 0.2 – which is “better”? Both have intermediate values of GS. Clearly, the difference lies in the semantics of the entities. We must ask whether every intermediate entity adds value to the overall chain of concepts and/or whether any one of the entities is actually a compound of two or more atomic concepts – the goal, we suggest, should be that all elements are atomic in nature. A good rule of thumb would therefore be to evaluate each entity in turn and ask these two questions. Thus for Figure 6(a) we might argue that Activity is a compound of two concepts: Activity and Task; and for Figure 6(b) we might argue that Process and Activity are semantically essentially the same or we might argue that since a Task doesn’t really consist of a number of Lists,the List concept is out of scope, thus giving a more realistic value of GS of 0.25 – almost commensurate with the value of GS for Figure 6(a).
228
B. Henderson-Sellers and C. Gonzalez-Perez
Similarly, we might evaluate Figure 6(b) versus Figure 6(c) by first eliminating the LifeCycle Process Group of Figure 6(b) thus rendering both Figures 6(b) and 6(c) to have identical values of system granularity. However, there remain differences, namely the additional complexity (but not granularity) of Figure 6(c) resulting from two additional relationships: from Process to Task and recursively from Process to (sub)Process. This leads us to conclude that a higher quality model cannot be judged solely on the basis of granularity – yet system granularity may certainly be a useful contributory factor to an overall assessment of model quality. In other words, we need not a single metric (like GS) but a vector of measurements. Now considering Figure 6(d), which also has a value of system granularity equal to 0.25, we see a somewhat different structure since this metamodel replaces the white diamond aggregation of the first three metamodels with three specializations, the whole-part structure resulting from the single recursive relationship at the top level, inherited by all the subtypes. In other words, granularity cannot discriminate between this and Figure 6(b) and Figure 6(c) rescoped as above. Yet the semantics – or at least the way in which the modelling elements are related – seem different. At this stage of our research, it is impossible to advocate decision-making on anything other than subjective grounds – asking whether the concepts are defined clearly and in such a way that either specialization or whole-part relationships can be argued as being more advantageous.
5 Related Work While there has been considerable work on the theoretical characterization of abstraction and granularity (as discussed and cited in Section 2), we are not aware of any previous direct application of this theoretical material to the issues of conceptual modelling (and especially metamodelling) in software engineering. Kühne [26] discusses models features, including model abstraction as a projection but does not investigate granularity abstractions. In situational method engineering, we are aware of the work of Rolland and Prakash [27] who discuss method knowledge in terms of three levels of granularity (called context, tree and forest of trees). Other related work is that of metamodel evaluation – although not an evaluation of granularity, there are several papers that offer quality assessments of metamodels (e.g. [28] and citations therein) including the application of the Semiotic Quality Framework of Lindland et al. [29] to the BPMN metamodel [30]; Recker et al. [31] who compare the UML and BPMN metamodels using the Rossi and Brinkkemper [32] metrics and Ma et al. [33] who use standard OO metrics to assess the UML metamodel. Ontological assessment of various metamodels has also been presented; for example, Opdahl and Henderson-Sellers [34] as applied to UML. There is also a large literature on quality evaluation of conceptual models (e.g. [35-37]), although again not in terms of granularity. Indeed, the contributory nature of granularity to overall quality needs to be assessed in comparison with other important influencing quality factors. These might include, for instance, the clarity of the semantics and some of the “-ilities” e.g. flexibility, stability.
Granularity in Conceptual Modelling: Application to Metamodels
229
Granularity issues have also been addressed in the area of software patterns. For example, Fowler [38 p.59] discusses granularity criteria as something to be considered in the Enterprise Segment pattern. Gamma et al. [39 p.195] introduce Flyweight as a structural pattern specifically devoted to the efficient management of fine-grained objects. Rising [40 p.60] claims that the flyweight pattern is often combined with the Composite pattern to implement a specific kind of hierarchical structure that is suited to fine-grained objects; this corresponds to what we have called decomposition/aggregation granularity abstraction in previous sections. Roberts and Johnson [41 p.481] introduce the Fine-Grained Objects pattern and argue that fine-grained components are best to achieve reusability, and that by decomposing larger components into smaller abstractions that have no meaning in the specific problem domain where they were created, reusability in other (not yet foreseen) problem domains is increased.
6 Conclusions and Recommendations Granularity, as defined by Equation 3 and measured by Equation 4, can be helpful in comparing two model structures but must be augmented by a semantic argument regarding the atomicity or otherwise of each concept in the concept “chain”. Currently, this is generally accomplished manually. Future work could make this evaluation more objective by the use of an ontological analysis approach, perhaps similar to that used in [42] in their evaluation of the quality of modelling languages like UML [34] or of the aggregation relationship itself [43]. Such an analysis, based on the BungeWand-Weber model [44] [45], has proven useful in identify construct overload (when one modelling construct refers to several ontological concepts), construct redundancy (when several modelling constructs refer to a single ontological concept), construct excess (when a model construct does not represent any ontological concept) and construct deficit (when a necessary ontological concepts is not represented in the model). Conceptual model quality cannot be addressed by granularity alone. Model quality is, as yet, an unresolved issue (e.g. [46] [47]). It is most unlikely that a single measure can be identified or constructed that will uniquely characterize model quality. Rather, we should seek a vector of measurements that, together, will provide us with a good quality approach to measuring model quality. One element of that quality vector, we propose, should be granularity. Finally, in our overall context of situational method engineering (e.g. [3]), our next step is to understand how the granularity of the methodology metamodel affects the granularity of the process models that can be created in conformance with this metamodel. In other words, given a conceptual model (the metamodel) belonging to the Metamodel Domain of Figure 4, as actioned in, say, the ISO/IEC 24744 [21]) architecture, what is the granularity of conformant models in the Method Domain. Clearly, there is an immediate consequence that the generated method fragments will first have the same granularity in the sense that if the metamodel has entities called Life Cycle Process Group, Process, Activity, Task and List (as in Figure 6(b)), then there will also be method fragments of type Activity, type Task and type List; whereas if the metamodel used were more like that of Figure 6(a), then the only types of method fragments possible would be Activities and Steps. In the first case, the system granularity value, GS, is 0.2 (=1/5) whereas in the second it is 0.5 (=1/2).
230
B. Henderson-Sellers and C. Gonzalez-Perez
However, there is also a second granularity issue for method fragments that needs investigation: the size of each fragment. To take an extreme example, we could generate a single instance of Task that encompassed everything that needed to be done in building the system – or we could have each Task instances addressing a small area such that, say, 50 tasks are needed to give complete coverage. This assessment will be the focus of a future paper.
Acknowledgements This is paper number 2010/05 of the Centre for Object Technology Applications and Research within the Centre for Human Centred Technology Design of the University of Technology, Sydney.
References 1. Welke, R., Kumar, K.: Method Engineering: A Proposal for Situation-Specific Methodology Construction. In: Cotterman, W.W., Senn, J.A. (eds.) Systems Analysis and Design: A Research Agenda. J. Wiley & Sons, Chichester (1991) 2. Brinkkemper, S.: Method Engineering: Engineering of Information Systems Development Methods and Tools. Inf. Software Technol. 38(4), 275–280 (1996) 3. Henderson-Sellers, B.: Method Engineering for OO System Development. ACM Comm. 46(10), 73–78 (2003) 4. Ghidini, C., Giunchiglia, F.: A Semantics for Abstraction. In: Procs. ECAI 2004 (2004) 5. Giunchiglia, F., Walsh, T.: A Theory of Abstraction. Artificial Intelligence 57(2-3), 323– 390 (1992) 6. Mani, I.: A Theory of Granularity and its Application to Problems of Polysemy and Underspecification of Meaning. In: Cohn, A.G., Schubert, L.K., Shapiro, S.C. (eds.) Principles of Knowledge Representation and Reasoning: Proceedings of the Sixth International Conference (KR 1998), pp. 245–257. Morgan Kaufmann, San Mateo (1998) 7. Unhelkar, B., Henderson-Sellers, B.: ODBMS Considerations in the Granularity of a Reusable OO Design. In: Mingins, C., Meyer, B. (eds.) TOOLS15, pp. 229–234. Prentice Hall, Englewood Cliffs (1995) 8. Hobbs, J.: Granularity. In: Procs. Int. Joint Conf. on Artificial Intelligence, IJCAI 1985 (1985) 9. Lakoff, G.: Fire, Women, and Dangerous Things. What Categories Reveal About the Mind. University of Chicago Press, Chicago (1987) 10. Winston, M., Chaffin, R., Herrmann, D.: A Taxonomy of Part-Whole Relations. Cognitive Science 11, 417–444 (1987) 11. Jørgensen, K.A.: Modelling on Multiple Abstraction Levels. In: Procs. 7th Workshop on Product Structuring – Product Platform Development, Chalmers University of Technology, Göteborg (2004) 12. Henderson-Sellers, B., Unhelkar, B.: OPEN Modeling with UML. Addison-Wesley, Harlow (2000) 13. Favre, J.-M.: Foundations of Model (Driven) (Reverse) Engineering: Models. Episode I: Stories of The Fidus Papyrus and of The Solarus. In: Bézivin, J., Hockel, R. (eds.) Procs. Dagstuhl Seminar 04101 Language Engineering for Model-Driven Software Development (2005)
Granularity in Conceptual Modelling: Application to Metamodels
231
14. OMG: UML superstructure, v2.2. OMG documentsmsc/09-02-01: UML superstructure, v2.2 (2009) 15. Henderson-Sellers, B., Bulthuis, A.: Object-Oriented Metamethods. Springer, New York (1998) 16. OMG: UML Semantics, Version 1.0. OMG document ad/97-01-03 (January 13, 1997) 17. Graham, I., Henderson-Sellers, B., Younessi, H.: The OPEN Process Specification. Addison-Wesley, Harlow (1997) 18. Firesmith, D.G., Henderson-Sellers, B.: The OPEN Process Framework. Addison-Wesley, Harlow (2002) 19. OMG: Software Process Engineering Metamodel Specification. OMG document formal/02-11-14 (2002) 20. Standards Australia: Standard Metamodel for Software Development Methodologies, AS 4651-2004, Standards Australia, Sydney (2004) 21. ISO/IEC: Software Engineering – Metamodel for Software Development. ISO/IEC 24744, Geneva, Switzerland (2007) 22. Odell, J., Parunak, H.V.D., Bauer, B.: Extending UML for Agents. In: Wagner, G., Lesperance, Y., Yu, E. (eds.) Procs. Agent-Oriented Information Systems Workshop, 17th National Conference on Artificial Intelligence, Austin, TX, USA, pp. 3–17 (2000) 23. Zachman, J.A.: A Framework for Information Systems Architecture. IBM Systems J. 26(3), 276–292 (1987) 24. OMG: Software & Systems Process Engineering Meta-Model Specification. Version 2.0, OMG Document Number: formal/2008-04-01 (2008) 25. ISO/IEC: Software Life Cycle Processes. ISO/IEC 12207. International Standards Organization / International Electrotechnical Commission (1995) 26. Kühne, T.: Matters of (Meta-)modeling. Softw. Syst. Model. 5, 369–385 (2006) 27. Rolland, C., Prakash, N.: A Proposal for Context-specific Method Engineering. In: Brinkkemper, S., Lyytinen, K., Welke, R.J. (eds.) Method Engineering. Principles of Method Construction and Too Support. Procs. IFIP TC8, WG8.1/8.2 Working Conference on Method Engineering, Atlanta, USA, August 26-28, pp. 191–208. Chapman & Hall, London (1996) 28. Henderson-Sellers, B.: On the Challenges of Correctly using Metamodels in Method Engineering. In: Fujita, H., Pisanelli, D. (eds.) New Trends in Software Methodologies, Tools and Techniques. Proceedings of the sixth SoMeT 2007, pp. 3–35. IOS Press, Amsterdam (200) 29. Lindland, O.I., Sindre, G., Solvberg, A.: Understanding Quality in Conceptual Modeling. IEEE Software 11(2), 42–49 (1994) 30. Wahl, T., Sindre, G.: An Analytical Evaluation of BPMN using a Semiotic Quality Framework. In: Siau, K. (ed.) Advanced Topics in Database Research, vol. 5, ch. VI, pp. 102–113. Idea Group Inc., Hershey (2006) 31. Recker, J.C., zur Muehlen, M., Siau, K., Erickson, J., Indulska, M.: Measuring Method Complexity: UML versus BPMN. In: Procs 15th Americas Conf. on Information Systems, San Francisco, CA, USA, August 6-9 (2009) 32. Rossi, M., Brinkkemper, S.: Complexity Metrics for Systems Development Methods and Techniques. Information Systems 21(2), 209–227 (2006) 33. Ma, H., Shao, W., Zhang, L., Ma, Z., Jiang, Y.: Applying OO Metrics to Assess UML Meta-models. In: Baar, T., Strohmeier, A., Moreira, A., Mellor, S.J. (eds.) UML 2004. LNCS, vol. 3273, pp. 12–26. Springer, Heidelberg (2004) 34. Opdahl, A., Henderson-Sellers, B.: Ontological Evaluation of the UML using the BungeWand-Weber Model. Software and Systems Modelling 1(1), 43–67 (2002)
232
B. Henderson-Sellers and C. Gonzalez-Perez
35. Genero, M., Piattini, M., Calero, C.: A Survey of Metrics for UML Class Diagrams. Journal of Object Technology 4(9), 59–92 (2005), http://www.jot.fm/issues/issue_2005_11/article1 36. Unhelkar, B.: Verification and Validation for Quality of UML 2.0 Models. J. Wiley and Sons, Chichester (2005) 37. Aguilar, E.R., Ruiz, F., Garcia, F., Piattini, M.: Evaluation Measures for Business Process models. In: Procs 2006 ACM Symposium on Applied Computing, pp. 1567–1568. ACM, New York (2006) 38. Fowler, M.: Analysis Patterns. Addison-Wesley, Reading (1997) 39. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, Reading (1995) 40. Rising, L.: The Pattern Almanac 2000. Addison-Wesley, Reading (2000) 41. Roberts, D., Johnson, R.: Patterns for Evolving Frameworks. In: Martin, R.C., Riehle, D., Buschmann, F. (eds.) Pattern Languages of Program Design 3, pp. 471–486. AddisonWesley Longman, Amsterdam (1997) 42. Opdahl, A., Henderson-Sellers, B.: Template-based Definition of Information Systems and Enterprise Modelling Constructs. In: Green, P., Rosemann, M. (eds.) Business Systems Analysis with Ontologies, pp. 105–129. Idea Group, Hershey (2005) 43. Opdahl, A.L., Henderson-Sellers, B., Barbier, F.: Ontological Analysis of Whole-Part Relationships in OO Models. Information and Software Technology 43(6), 387–399 (2001) 44. Wand, Y., Weber, R.: On the Ontological Expressiveness of Information Systems Analysis and Design Grammars. Journal of Information Systems 3, 217–237 (1993) 45. Wand, Y., Weber, R.: On the Deep Structure of Information Systems. Information Systems Journal 5, 203–223 (1995) 46. Genero, M., Piattini, M., Calero, C. (eds.): Metrics for Software Conceptual Models. Imperial College Press, London (2005) 47. Shekhovtsov, V.A.: On Conceptualization of Quality. Paper presented at Dagstuhl Seminar on Conceptual Modelling, April 27-30 (2008) (preprint on conference website)
Feature Assembly: A New Feature Modeling Technique Lamia Abo Zaid, Frederic Kleinermann, and Olga De Troyer Vrije Universiteit Brussel (VUB) Pleinlaan 2, 1050 Brussel Belgium {Lamia.Abo.Zaid,Frederic.Kleinermann,Olga.DeTroyer}@vub.ac.be http://wise.vub.ac.be/
Abstract. In this paper we present a new feature modeling technique. This work was motivated by the fact that although for over two decades feature modeling techniques are used in software research for domain analysis and modeling of Software Product Lines, it has not found its way to the industry. Feature Assembly modeling overcomes some of the limitations of the current feature modeling techniques. We use a multi-perspective approach to deal with the complexity of large systems, we provide a simpler and easier to use modeling language, and last but not least we separated the variability specifications from the feature specifications which allow reusing features in different contexts. Keywords: Feature, Variability Modeling, Feature Models, Domain Analysis.
1 Introduction Over the last decades software development has evolved into a complex task due to the large number of features available in software, and secondly due to the many (often implicit) dependencies between these features. In addition, there is an increased demand to deliver similar software on different platforms and/or to different types of customers. This has lead to the emergence of so-called Software Product Lines (SPL) [1] or more generally variable software. SPLs tend to manufacture the software development process. Instead of developing a single product the fundamental base is to develop multiple closely related but different products. These different products share some common features but each individual product has a distinguishable set of features that gives each product a unique flavor. To be able to profit maximally from the benefits of variability, but to keep the development of such software under control, feature-oriented analysis is used to effectively identify and characterize the SPL capabilities and functionalities. In feature-oriented analysis, features are abstractions that different stakeholders can understand. Stakeholders usually speak of product characteristics i.e. in terms of the features the product has or delivers [2]. Feature oriented domain analysis (FODA) [3] was first introduced in the 1990 for domain modeling, and since then it has become an appealing technique for modeling SPLs. It was applied in several case studies [2] and many extensions to the original technique have been defined. However, these feature modeling techniques have not J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 233–246, 2010. © Springer-Verlag Berlin Heidelberg 2010
234
L. Abo Zaid, F. Kleinermann, and O. De Troyer
gained much popularity outside the research community. Several explanations can be given for this. Firstly, there are many different “dialects” of feature modeling techniques (e.g. [4] [5] [6]), each focusing on different issues; there is no commonly accepted model [7]. Secondly, feature models do not scale well, mainly because they lack abstraction mechanisms. This makes them difficult to use in projects with a large number of features [8]. Thirdly, little guidelines or methods exist on how to use the modeling technique. This often results in feature models with little added value or of discussable quality. To overcome these limitations companies define their own notations and techniques to represent and implement variability. Examples are Bosch [9], Philips Medical Systems [10] and Nokia [11]. Yet the proposed notations are tailored to each company’s specific needs for modeling variability in their product line. In [9] and [10] a hierarchical structure of features, introducing new feature types was adopted. While feature interaction and scalability issues were more important for [11], therefore they adopted a separation of concern approach for devising higher level features. They used documentation to specify the systems evolution using its features and relations. In this paper we present a new feature modeling technique that is based on using multiple perspectives (viewpoints) to model (variable) software in terms of its composing features. We call it Feature Assembly Modeling (FAM). The presented modeling technique is innovative from different perspectives. It separates the information on variability (i.e. how features are used to come to variability) from the features it selves. In FAM, how a feature contributes to the variability of a specific piece of software (or product line) is not inextricably associated with the feature. Rather this information is part of how the features are assembled together in the feature assembly model that models the software (or product line). This yields more flexibility and allows the reuse of these features in other contexts and even in other software. The model is also based on a multi-perspective abstraction mechanism. It is well known that focusing on one aspect at the time helps to deal with complexity (also known as the separation of concerns paradigm). FAM provides better abstraction mechanism by using perspectives to model large and complex software; and thus will also increase scalability of the modeling process. Furthermore, we have reduced the number of modeling primitives to simplify and ease the modeling process. This paper is organized as follows, in section 2, we review existing feature modeling techniques. In section 3, we discuss the limitations of the mainstream feature modeling techniques. In section 4, we explain our Feature Assembly Modeling technique. Section 5 provides an example that illustrates the approach and its benefits. Next, in section 6 we discuss how FAM offers solutions for the limitations identified in section 3. Finally, section 7 provides a conclusion and future work.
2 Mainstream Feature Modeling Techniques Over the past few years, several variability modeling techniques have been developed that aim supporting variability representation and modeling. Some of the techniques extend feature models (e.g. [4], [5], [6], and [12]), while others tend to add profiles for variability representation in UML (e.g. [13], [14], and [15]). In addition, some
Feature Assembly: A New Feature Modeling Technique
235
work has been done on defining new modeling languages and frameworks to model variability information (e.g. [16] and [17]). For the purpose of this paper we restrict ourselves to the modeling methods extending Feature Oriented Domain Analysis (FODA), commonly called feature models [3] [4]. For a detailed study classifying the existing well known feature modeling techniques, methodologies and implementation frameworks, we refer the reader to [18]. A feature model is a hierarchical domain model with a tree-like structure for modeling features and their relations. It is a variability modeling (visual) language indicating how the features contribute to variability. Over the past decade several extensions to FODA (the first feature modeling language) have been defined to compensate for some of its ambiguity and to introduce new concepts and semantics to extend FODA’s expressiveness. Yet, all keep the hierarchical structure originally used in FODA, accompanied with using some different notations. Feature-Oriented Reuse Method (FORM) [4] extends FODA by adding a domain architecture level which enables identifying reusable components. It starts with an analysis of commonality among applications in a particular domain in terms of four different categories (also called layers): capabilities, operating environments, domain technologies, and implementation [2]. AND/OR nodes are used to build a hierarchical tree structured feature model for the features belonging to each of the previously mentioned categories. The excludes and requires feature dependencies originally defined in FODA are still used; a new implemented by dependency was defined. FeatureRSEB [5] aims at integrating feature modeling with the Reuse-Driven Software Engineering Business (RSEB). It uses UML use case diagrams as a starting point for defining features and their variability requirements. FeatureRSEB classifies features to optional, mandatory (similar to FODA) and variant. Variant is used to indicate alternative features. FeatureRSEB adds the concept of vp-features which represents variation points. The excludes and requires dependencies originally defined in FODA are used to represent constraints between features. PLUSS [12], which is the Product Line Use case modeling for Systems and Software engineering, introduced the notation of multiple adapter to overcome the limitation of not being able to specify the at-least-one-out-of many relation in FODA. PLUSS also renamed alternative features to single adaptor features following the same naming scheme. The modeling notation was also slightly changed in PLUSS to meet the needs of the modified model, yet it remained a hierarchical tree structure based on the notation of FODA. Similar to FeatureRSEB, the excludes and requires dependencies originally defined in FODA are used to represent feature dependencies. Cardinality Based Feature Models (CBFS) [6] represent a hierarchy of features, where each feature has a feature cardinality. Two types of cardinality are defined: clone cardinality and group cardinality. A feature clone cardinality is an interval of the form [m..n]. Where m and n are integers that denote how many clones of the feature (with its entire subtree) can be included in a specified configuration. A group cardinality is an interval of the form [m..n], where m and n are integers that denote how many features of the group are allowed to be selected in a certain configuration. Features still had one of four feature types AND, OR, Alternative, and Optional. In addition, the notation of feature attribute was defined. A feature attribute indicates a property or parameter of that feature; it could be a numeric or string value. CBFS kept
236
L. Abo Zaid, F. Kleinermann, and O. De Troyer
the original FODA feature dependencies. In addition, there are rational constraints associated with the value of the feature attribute (i.e. >, =, 1. Note that e′ ∈F S d(e, e′ ) = |F S| when e is directly connected to all entity types of F S. If e and e′ are not connected (because at least one of them does not participate in relationship types nor IsA relationships, or both belong to different connected components of the graph denoted by the schema), then we define d(e, e′ ) = |E|. 5.3
Interest of Entity Types (Φ)
The importance metric is useful when a user wants to know which are the most important entity types, but it is of little use when the user is interested in a
A Method for Filtering Large Conceptual Schemas
253
specific subset of entity types, independently from their importance. What is needed then is a metric that measures the interest of a candidate entity type e with respect to a focus set F S. This metric should take into account both the absolute importance of e (as explained in Section 5.1) and the closeness measure of e with regard to the entity types in F S. For this reason, we define: Φ(e, F S) = α × Ψ (e) + (1 − α) × Ω(e, F S)
(2)
where Φ(e, F S) is the interest of a candidate entity type e with respect to F S, Ψ (e) the importance of e, and Ω(e, F S) is the closeness of e with respect to F S. Note that α is a balancing parameter in the range [0,1] to set the preference between closeness and importance for the retrieved knowledge. An α > 0.5 benefits importance against closeness while an α < 0.5 does the opposite. The default α value is set to 0.5 and can be modified by the user. The computation of the interest Φ(e, F S) for candidate entity types returns a ranking which is used by our filtering method to select the K − |F S| top candidate entity types. As an example, Table 2 shows the top-8 entity types with a greater value of interest when the user defines F S = {TaxRate, TaxClass}, K = |F S| + 8 = 10 and α = 0.5 (the rejection set is the default one, RS = ∅). Within the top of interest there may be entity types directly connected to all members of the focus set as in the case of TaxZone (Ω(TaxZone,F S) = 1.0) but also entity types that are not directly connected to any entity type of F S (although they are closer/important). Table 2. Top-8 entity types of interest with regard to F S= {TaxRate, TaxClass} Rank 1 2 3 4 5 6 7 8
Entity Importance Distance Distance Closeness Interest Type (e) Ψ (e) d(e, TR) d(e, TC ) Ω(e, F S) Φ(e, F S) TaxZone 0.57 1 1 1.0 0.785 Product 0.84 2 1 0.66 0.75 Language 1.0 3 2 0.4 0.7 Customer 0.62 3 2 0.4 0.51 Zone 0.35 2 2 0.5 0.425 Order 0.41 3 2 0.4 0.405 Special 0.29 3 2 0.4 0.345 Currency 0.4 4 3 0.28 0.34 (TR = TaxRate, TC = TaxClass)
Each candidate entity type e of the conceptual schema CS can be seen, in a geometrically sense, as a point in a bidimensional space with the axis being the measures of importance Ψ (e) and closeness Ω(e, F S). Figure 2(a) shows such bidimensional space with the corresponding axis. Let r be a straight line between the points (0, Ωmax ) and (Ψmax , 0) of the maximum values of closeness and importance (Ωmax = Ψmax = 1 in Fig. 2(a)). We choose r in order to maintain the same proportion between closeness and importance (α = 0.5). A straight line r′ parallel to r traversing the point (Ψmax , Ωmax ) indicates the interest line to the user (see Fig. 2(b)). Taking the importance and the closeness measures for the entity types from Tab. 2, we obtain the coordinates to place them as bidimensional points in the
254
A. Villegas and A. Oliv´e
0.6
r
0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
0.6
P5
0.4
P7
0.2 0
r ′) P2
P4
P3
P8
0
0.2
0.4
0.6
0.8
1
Importance Ψ
Importance Ψ
(a) Movement of r to obtain r ′ .
P6
2,
6,
P1
dis t(P
0.8
dis t(P
0.8
r ′)
r 1 Closeness Ω
Closeness Ω
r 1
(b) Distances to the interest indicator r ′ .
Fig. 2. Geometrical foundation of the concept of Interest of entity types Φ(e)
plane, as shown in Fig. 2(b). The distance between each point in the plane and the straight line r′ is inversely proportional to the interest of the entity type the point represents. Figure 2(b) shows that Product placed at point P2=(0.84, 0.66) is of more interest (position 2) than Order (position 6) at point P6=(0.41, 0.4) due to its smaller distance to r′ . Note that the balancing parameter α in Eq. 2 can be seen as a modifier of the slope of the straight line r′ of Fig. 2(b), in order to prioritize the closeness or importance components. In particular, if we choose α = 0 then we only take into account the closeness, and Language (that is at position 3) would be ranked the first.
6
Filtered Conceptual Schema (FCS )
The main task of our filtering method consists in constructing a filtered conceptual schema, FCS , from the K more interesting entity types computed in the previous section, and the knowledge of the original schema (see Fig. 1). Definition 5. (Filtered Conceptual Schema) A filtered conceptual schema FCS of a conceptual schema CS = E, R, I, C, D is defined as a tuple FCS = EF , RF , IF , CF , DF , where: 6.1
EF is a set of entity types filtered from E of CS (Section 6.1). RF is a set of relationship types filtered from R of CS (Section 6.2). IF is a set of IsA relationships filtered from I of CS (Section 6.3). CF is a set of integrity constraints filtered from C of CS (Section 6.4). DF is a set of derivation rules filtered from D of CS (Section 6.5). Filtered Entity Types (EF )
The entity types EF of the filtered conceptual schema FCS are those included in the union of the focus set F S, the set Etopof the K − |F S| most interesting candidate entity types computed by our method, and the set Eaux of auxiliary entity types due to association projections (see details in Section 6.2). Formally we have EF = F S ∪ Etop ∪ Eaux and |EF | = K + |Eaux |.
A Method for Filtering Large Conceptual Schemas
6.2
255
Filtered Relationship Types (RF )
The relationship types in RF are those r ∈ R whose participant entity types belong to EF , or are ascendants of entity types of EF (in which case a projection of r is required). If such relationship types contain an association class, we also include it in FCS . Formally, ∀r ∈ R (∀e that participates in r (e ∈ EF ∨ ∃e′ (e′ is descendant of e ∧ e′ ∈ EF )) =⇒ r ∈ RF ) The projection of a relationship type r ∈ R consists in descending the participations of entity types not in EF into each of their descendants in EF . Figure 3 shows an example of projection of a relationship type R. The marked area in Fig. 3(a) indicates the entity types that are included in EF (E2, E6 and E7). The relationship type R has two participants. E2 is included in EF while E1 has two indirect descendants, E6 and E7, included in EF . Therefore, R should be projected as shown in Fig. 3(b) but, unfortunately, R is repeated, which is correct but increases the complexity of the schema.
(a) Original Relationship
(b) Repeated Relationship (c) Projected Relationship
Fig. 3. Result of projecting a relationship to the filtered conceptual schema
There is a special subset of entity types Eaux inside EF that includes the auxiliary entity types that are required to avoid relationship types repetitions. In Fig. 3(a) the closer common ascendant between the entity types in EF (E6 and E7) that descend from the original participant E1 of R is the entity type E4. Therefore, in order to avoid having two R relationship types (connected to E6 and E7, respectively, as shown in Fig. 3(b)) it is necessary to include E4 in Eaux , project R to E4, and create I sA relationships between the descendants and the auxiliary class (see Fig. 3(c)) to maintain the semantics. It is important to note that if there is only one descendant the auxiliary class is not necessary because the projection of the relationship will be with the descendant itself. Figure 3(a) shows that the cardinality constraint 1 in E1 has to be changed to 0..1 after the projection of R. This happens because the cardinality constraint of the projected participant E1 must be satisfied for the union of its descendants (E3 and E4), and not for only a subset of them [23].
256
6.3
A. Villegas and A. Oliv´e
Filtered I sARelationships (IF )
If e and e′ are entity types in EF and there is a direct or indirect IsA relationship between them in I of CS, then such IsA relationship must also exist in IF of FCS . Formally we have ∀e′ , e ∈ EF ((e′ IsA+ e) ∈ I =⇒ (e′ IsA+ e) ∈ IF )1 . Figure 4(a) shows a fragment of an original schema where E1, E4, E5 and E6 are the entity types included in EF . Figure 4(b) presents the IsA relationships included in IF of FCS . Note that we maintain the direct IsA relationships of the original schema between entity types in EF , as in the case of E6 IsA E5. We also keep the semantics by adding IF of FCS the new IsA relationships E4 IsA E1 and E5 IsA E1 as shown in Fig. 4(b).
(a) Original IsA Relationships
(b) Filtered IsA Relationships
Fig. 4. Example of filtering IsA relationships
6.4
Filtered Integrity Constraints (CF )
The integrity constraints CF included in FCS are a subset of the integrity constraints C of CS. Concretely, the included integrity constraints are those whose expressions only involve entity types from EF . Formally we have ∀ c ∈ C (∀e involved in c (e ∈ EF ) =⇒ c ∈ CF ) An entity type can be referenced by means of its attributes, its participations in relationship types or by referencing the entity type itself. As an example, the integrity constraint ic1 in Fig. 5 has A and B as its participants. A is referenced as the context of the constraint and by means of its attribute a1 in the OCL expression self.a1. Also, B is referenced by means of its attribute b1 in the OCL expression self.b.b1. Our method only includes ic1 into CF of FCS if both A and B are entity types in EF . 6.5
Filtered Derivation Rules (DF )
The derivation rules DF included in FCS are those rules D of CS whose expressions only involve entity types from EF . Formally we have ∀ d ∈ D (∀e involved in d, (e ∈ EF ) =⇒ d ∈ DF ) The derivation rule dr1 in Fig. 5 is included in DF if both A and B are entity types in EF . If only B ∈ EF , our method marks the derived attribute b2 as materialized and does not include dr1 in DF because that derivation rule also references the entity type A wich is not included in EF . 1
Note that “IsA+ ” denotes the transitive closure of IsA relationships.
A Method for Filtering Large Conceptual Schemas
257
Fig. 5. Example of integrity constraint (ic1) and derivation rule (dr1)
7
Experimentation
This section presents the results obtained by our filtering method in two real large schemas: the osCommerce [13], and the ResearchCyc (research.cyc.com). Table 3 shows some metrics of both conceptual schemas. Table 3. Conceptual schema characteristics of two large schemas osCommerce ResearchCyc
7.1
Entity Types Attributes Relationship Types IsA Relationships 84 209 183 28 26,725 1,060 5,514 43,323
osCommerce
The conceptual schema of the osCommerce [13] includes the elements shown in Tab. 3 and also 204 general constraints and derivation rules. Figure 6 shows the filtered conceptual schema FCS that results when the user selects K = 10 and wants to know more about F S = {TaxRate, TaxClass}. Figure 6(b) presents the integrity constraints and derivation rules (CF and DF ) included in FCS . The derivation rules of the derived attributes id, name, phone, primary and currencyValue of Order in Fig. 6(a) are included in DF because they only use information contained in FCS . Additionally, we mark each derived attribute that has been materialized with an asterisk (∗) at the end of its name (as in the case of total of Order ) and its derivation rule is hidden because it uses information about entity types out of FCS . 7.2
ResearchCyc
ResearchCyc knowledge base contains more than 26,000 entity types and is defined using the CycL language [24]. Our experimentation with ResearhCyc has been done with a UML version obtained through a conversion process from CycL. The anatomy of this ontology has the peculiarity that it contains a small core of abstract concepts strongly connected through high-level relationship types. The rest of the concepts are all descendants of such core. The interesting knowledge the user obtains with the filtered conceptual schema in Fig. 7 about Cancer are the IsA relationships with other interesting concepts because the relationship types are defined only between top elements in the hierarchy of concepts. The experiments we have done with our implementation show that, starting with a focus set of up to three entity types, the time required to compute the interest and filter the ResearchCyc ontology is about half a second (the average of 100 experiments is 0.53 seconds, with a standard deviation of 0.31).
258
A. Villegas and A. Oliv´e
! "
# # #
$ %% % $ & '(( $ )# *+ $ %% %, $ %%"- %, "- $ %% %.
$ %%" #% " #
$ %% #/ %0 #1
(a) Filtered Schema
(b) CF and DF
Fig. 6. Filtered conceptual schema for FS = {TaxRate, TaxClass} in the osCommerce
Fig. 7. Filtered conceptual schema for FS = {Cancer } and K = 18 in the ResearchCyc
8
Conclusions and Future Work
We have focused on the problem of filtering a fragment of the knowledge contained in a large conceptual schema. The problem appears in many information systems development activities in which people needs to operate for some purpose with a piece of the knowledge contained in that schema. We have proposed a filtering method in which a user indicates a focus set consisting of one or more entity types of interest, and the method determines a subset of the elements of the original schema that is likely to be of interest to the user. In order to select this subset, our method measures the interest of each entity type with respect to the focus set based on the importance and closeness. We have implemented our method in a prototype tool built on top of the USE environment. We have experimented it with two large schemas. In both cases, our tool obtains the filtered schema in a short time. Using our prototype tool it is practical for a user to specify a focus set, to obtain a filtered schema, and to repeat the interaction until the desired knowledge has been obtained.
A Method for Filtering Large Conceptual Schemas
259
We plan to improve our method in several ways. One improvement is to take into account the importance of the relationship types. In the current method, we only use the importance of entity types and assume that all relationship types are equally important. This improvement requires the definition of a convenient metric of the importance of relationship types, which does not yet exist. Another enhancement consists in a fine-grained filter of integrity constraints and derivation rules in order to hide only those OCL expressions that reference entity types out of the user focus instead of hiding the whole constraint or rule. Finally, we plan to conduct experiments to precisely determine the usefulness of our method to real users.
Acknowledgements Thanks to the people of the GMC group for their useful comments to previous drafts of this paper. This work has been partly supported by the Ministerio de Ciencia y Tecnologia under TIN2008-00444 project, Grupo Consolidado, and by Universitat Polit`ecnica de Catalunya under FPI-UPC program.
References 1. Oliv´e, A.: Conceptual Modeling of Information Systems. Springer, Heidelberg (2007) 2. Lindland, O.I., Sindre, G., Sølvberg, A.: Understanding quality in conceptual modeling. IEEE Software 11(2), 42–49 (1994) 3. Conesa, J., Storey, V.C., Sugumaran, V.: Usability of upper level ontologies: The case of researchcyc. Data & Knowledge Engineering 69(4), 343–356 (2010) 4. Tzitzikas, Y., Hainaut, J.L.: On the visualization of large-sized ontologies. In: AVI 2006, Working Conf. on Advanced Visual Interfaces, pp. 99–102. ACM, New York (2006) 5. Katifori, A., Halatsis, C., Lepouras, G., Vassilakis, C., Giannopoulou, E.: Ontology visualization methods-a survey. ACM Computing Surveys 39(4), 10 (2007) 6. Lanzenberger, M., Sampson, J., Rester, M.: Visualization in ontology tools. In: Intl. Conf. on Complex, Intelligent and Software Intensive Systems, pp. 705–711. IEEE Computer Society, Los Alamitos (2009) 7. Shoval, P., Danoch, R., Balabam, M.: Hierarchical entity-relationship diagrams: the model, method of creation and experimental evaluation. Requirements Engineering 9(4), 217–228 (2004) 8. Rokach, L., Maimon, O.: Clustering methods. In: Data Mining and Knowledge Discovery Handbook, ch. 15, pp. 321–352. Springer, Heidelberg (2005) 9. Campbell, L.J., Halpin, T.A., Proper, H.A.: Conceptual schemas with abstractions making flat conceptual schemas more comprehensible. Data & Knowledge Engineering 20(1), 39–85 (1996) 10. Kuflik, T., Boger, Z., Shoval, P.: Filtering search results using an optimal set of terms identified by an artificial neural network. Information Processing & Management 42(2), 469–483 (2006) 11. Hanani, U., Shapira, B., Shoval, P.: Information filtering: Overview of issues, research and systems. User Modeling and User-Adapted Interaction 11(3), 203–259 (2001)
260
A. Villegas and A. Oliv´e
12. Gogolla, M., B¨ uttner, F., Richters, M.: USE: A UML-based specification environment for validating UML and OCL. Science of Computer Programming (2007) 13. Tort, A., Oliv´e, A.: The osCommerce Conceptual Schema. Universitat Polit`ecnica de Catalunya (2007), http://guifre.lsi.upc.edu/OSCommerce.pdf 14. Lenat, D.B.: Cyc: a large-scale investment in knowledge infrastructure. Communications of the ACM 38(11), 33–38 (1995) 15. Villegas, A., Oliv´e, A., Vilalta, J.: Improving the usability of hl7 information models by automatic filtering. In: IEEE 6th World Congress on Services (SERVICES) (2010), http://www.computer.org/portal/web/csdl/doi/ 10.1109/SERVICES.2010.32 16. Villegas, A., Oliv´e, A.: On computing the importance of entity types in large conceptual schemas. In: Heuser, C.A., Pernul, G. (eds.) ER 2009 Workshops. LNCS, vol. 5833, pp. 22–32. Springer, Heidelberg (2009) 17. Castano, S., De Antonellis, V., Fugini, M.G., Pernici, B.: Conceptual schema analysis: techniques and applications. ACM Transactions on Database Systems 23(3), 286–333 (1998) 18. Moody, D.L., Flitman, A.: A methodology for clustering entity relationship modelsa human information processing approach. In: Akoka, J., Bouzeghoub, M., ComynWattiau, I., M´etais, E. (eds.) ER 1999. LNCS, vol. 1728, pp. 114–130. Springer, Heidelberg (1999) 19. Tzitzikas, Y., Kotzinos, D., Theoharis, Y.: On ranking rdf schema elements (and its application in visualization). Journal of Universal Computer Science 13(12), 1854–1880 (2007) 20. Tzitzikas, Y., Hainaut, J.L.: How to tame a very large er diagram (using link analysis and force-directed drawing algorithms). In: Delcambre, L.M.L., Kop, C., ´ (eds.) ER 2005. LNCS, vol. 3716, pp. 144– Mayr, H.C., Mylopoulos, J., Pastor, O. 159. Springer, Heidelberg (2005) 21. Yu, C., Jagadish, H.V.: Schema summarization. In: VLDB 2006, 32nd Intl. Conf. on Very Large Data Bases, pp. 319–330 (2006) 22. Yang, X., Procopiuc, C.M., Srivastava, D.: Summarizing relational databases. In: VLDB 2009, 35th Intl. Conf. on Very Large Data Bases, pp. 634–645 (2009) 23. Conesa, J.: Pruning and refactoring ontologies in the development of conceptual schemas of information systems. PhD thesis, UPC (2008) 24. Lenat, D.B., Guha, R.V.: The evolution of cycl, the cyc representation language. ACM SIGART Bulletin 2(3), 84–87 (1991)
Measuring the Quality of an Integrated Schema⋆ Fabien Duchateau1 and Zohra Bellahsene2 1 CWI, Science Park 123 1098 XG Amsterdam, The Netherlands [email protected] 2 LIRMM - Universit´e Montpellier 2 161 rue Ada, 34392 Montpellier, France [email protected]
Abstract. Schema integration is a central task for data integration. Over the years, many tools have been developed to discover correspondences between schemas elements. Some of them produce an integrated schema. However, the schema matching community lacks some metrics which evaluate the quality of an integrated schema. Two measures have been proposed, completeness and minimality. In this paper, we extend these metrics for an expert integrated schema. Then, we complete them by another metric that evaluates the structurality of an integrated schema. These three metrics are finally aggregated to evaluate the proximity between two schemas. These metrics have been implemented as part of a benchmark for evaluating schema matching tools. We finally report experiments results using these metrics over 8 datasets with the most popular schema matching tools which build integrated schemas, namely COMA++ and Similarity Flooding.
1
Introduction
Schema integration is the process of merging existing data sources schemas into one unified schema named global schema or integrated schema. This unified schema serves as a uniform interface for querying the data sources [1]. However, integrated schema can also serve in many other applications. Indeed, due to growing availability of information in companies, agencies, or on the Internet, decision makers may need to quickly understand some concepts before acting, for instance for building communities of interest [2]. In these contexts, the quality of an integrated schema is crucial both for improving query execution through the mediated schema and for data exchange and concepts sharing [3]. Although schema matching tools mainly emphasize the discovering of correspondences, most of them also generate an integrated schema based on these correspondences. Evaluating the quality of discovered correspondences is performed by ⋆
Supported by ANR DataRing ANR-08-VERSO-007-04. The first author carried out this work during the tenure of an ERCIM “Alain Bensoussan” Fellowship Programme.
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 261–273, 2010. c Springer-Verlag Berlin Heidelberg 2010
262
F. Duchateau and Z. Bellahsene
using widely accepted measures (precision and recall). Yet, the schema matching community lacks some measures for assessing the quality of the integrated schema that is also automatically produced by the tools. Consequently, authors of [4] have proposed the completeness and minimality measures. The former represents the percentage of data sources concepts which are covered by the integrated schema, while the latter checks that no redundant concept appears in the integrated schema. As stated by Kesh [5], these metrics are crucial to produce a more efficient schema, i.e. that reduces query execution time. However, they do not measure the quality of the structure of the produced integrated schema. We believe that the structure of an integrated schema produced by a schema matching tool may also decrease schema efficiency if it is badly built. Besides, an integrated schema that mainly preserves the semantics of the source schemas is easier to interpret and understand for an end-user. This paper discusses the evaluation of the quality for integrated schemas. First, we adapt completeness and minimality, proposed in [4], for an expert integrated schema. Then, we complete them by another metric that evaluates the structurality of an integrated schema. These three metrics are finally aggregated to evaluate the schema proximity of two schemas. Experiments using two state-of-the-art schema matching tools enable us to demonstrate the benefits of our measures. The rest of the paper is organised as follows: first, we give some definitions in Section 2. Section 3 covers the new measures we have designed for evaluating integrated schemas. We report in Section 4 the results of two schema matching tools. Related work is presented in Section 5. Finally, we conclude and outline future work in Section 6.
2
Preliminaries
Here we introduce the notions used in the paper. Schema matching is the task which consists of discovering semantic correspondences between schema elements. We consider schemas as edge-labeled trees (a simple abstraction that can be used for XML schemas, web interfaces, or other semi-structured or structured data models). Correspondences (or mappings) are links between schema elements which represent the same real-world concept. We limit correspondences to 1:1 (i.e., one schema element is matched to only one schema element) or to 1:n (i.e., one schema element is matched to several schema elements). Currently, only a few schema matching tools produce n:m correspondences. Figure 1 depicts an example of two schemas (from hotel booking web forms) and the correspondences discovered by a schema matching tool. A schema matching dataset. is composed of a schema matching scenario (the set of schemas to be matched), the set of expert mappings (between the schemas of the scenario) and the integrated expert schema. A metric proposed in this paper uses a rooted Directed Acyclic Graph (rDAG) for evaluating the schema structure. Schemas can be seen as rDAGs. A rDAG is a DAG, expressed by a triple < V , E, r > where:
Measuring the Quality of an Integrated Schema
263
Fig. 1. Correspondences between two hotel booking schemas
– V is a set of elements, noted V =< e0 , e1 , ..., en >; – E is a set of edges between elements, with E ⊆ V × V ; – r is the root element of the rDAG. A property of the rDAG deals with the path. In a rDAG, all elements can be reached from the root element. Given a rD A G =< V, E, e0 >, ∀ element e ∈ V , ∃ a path P (e0 , e) =< e0 , ei , ..., ej , e >.
3
Quality of an Integrated Schema
The schema matching community lacks some metrics which evaluate the quality of an integrated schema. Indeed, some schema matching tools produce an integrated schema (with the set of mappings between input schemas). To the best of our knowledge, there are only a few metrics [4] for assessing the quality of this integrated schema. Namely, authors define two measures for integrated schema w.r.t. data sources. Completeness represents the percentage of concepts present in the data sources and which are covered by the integrated schema. Minimality checks that no redundant concept appears in the integrated schema. We have adapted these metrics for an expert integrated schema. Then, we complete them by another metric that evaluates the structurality of integrated schema. These three metrics are finally aggregated to evaluate the schema proximity of two schemas. To illustrate the schema proximity metric, we use the integrated schemas depicted by figures 2(a) and 2(b). Note that a set of mappings is necessarily provided with the integrated schema. Indeed, let us imagine that elements X and G match, i.e. they represent the same concept. This means that only one of them should be added in the integrated schema. On
264
F. Duchateau and Z. Bellahsene
(a) produced by a matching tool
(b) given by an expert
Fig. 2. Two examples of integrated schemas
figure 2, we notice that X has been added in the tool’s integrated schema while G appears in the expert integrated schema. Thus, with the set of mappings, we are able to check that the concept represented by X and G is present in the integrated schema, and only once. 3.1
Completeness and Minimality
In our context, we have an integrated schema produced by a matching tool, named S i tool , and an expert integrated schema S i exp . Recall that this expert integrated schema is ideal. | S i exp | stands for the number of elements in schema S i exp . Thus, completeness, given by formula 1, represents the proportion of elements in the tool integrated schema which are common with the expert integrated schema. Minimality is computed thanks to formula 2, and it is the percentage of extra elements in the tool integrated schema w.r.t. expert integrated schema. Both metrics are in the range [0, 1], with a 1 value meaning that the tool integrated schema is totally complete (respectively minimal) related to expert integrated schema.
comp(Sitool , S iexp ) =
min(S itool , S iexp ) = 1 −
| S itool ∩ S iexp | | S iexp |
| S itool | − | S itool ∩ S iexp | | S iexp |
(1)
(2)
Let us compute completeness and minimality for the schemas shown in figure 2. As the number of common elements between the expert and tool integrated schemas is 6, then completeness is equal to comp(S itool , S iexp ) = 67 . Indeed, we notice that the integrated schema produced by the matching tool lacks one element (G) according to the expert integrated schema. Similarly, we compute 5 minimality, which gives us min(S itool , S iexp ) = 1 − 8−6 7 = 7 . The tool integrated schema is not minimal since two elements (X and Z) have been added w.r.t. the expert integrated schema.
Measuring the Quality of an Integrated Schema
3.2
265
Structurality
Structurality denotes “the qualities of the structure an object possesses”1 . To evaluate the structurality of a tool integrated schema w.r.t. an expert integrated schema, we check that each element owns the same ancestors. The first step consists of converting the schemas into rooted directed acyclic graphs (DAG), which have been described in section 2. Consequently, integrated schemas Siexp and Sitool are respectively transformed into rDA G exp and rDAGtool. Secondly, for each element ei from rDAGexp (except for the root), we build the two paths from the roots e0 of both rDAGs. These paths are noted Pexp(e0 , ei ) and Ptool(e0 , ei ). We also remove from these paths element ei . For sake of clarity, we respectively write Pexp and Ptool instead of Pexp(e0 , ei ) and Ptool(e0 , ei ). Note that if element ei has not been included in rDAGtool, then Ptool = ∅ . From these two paths, we can compute the structurality of element ei using formula 3. Intuition behind this formula is that element ei in both integrated schemas shares the maximum number of common ancestors, and that no extra ancestor have been added in the tool integrated schema. Besides, an α parameter enables users to give a greater impact to the common ancestors to the detriment of extra ancestors. As the number of ancestors in Ptool might be large and cause a negative value, we constrain this measure to return a value between 0 and 1 thanks to a max function.
α | Pexp ∩ Ptool | − (|Ptool | − |Pexp ∩ Ptool |) structE l em(ei ) = max 0, α|Pexp |
(3)
Back to our example, we can compute the structurality of each (non-root) element from rDAGexp, with a weight for α set to 2: – B: Pexp = A and Ptool = A. Thus, structElem(B) = max(0, 2× 1−2× (1−1 1) ) = 1. – D: Pexp = A and Ptool = A. Thus, structElem(D) = max(0, 2× 1−2× (1−1 1) ) = 1. – E: Pexp = A, D and Ptool = A, D. Thus, structElem(E) = max(0, 2× 2−2× (2−2 2) ) = 1. – G: Pexp = A, D and Ptool = ∅ . Thus, structElem(G) = max(0, 2× 0−2× (0−2 0) ) = 0. – C: Pexp = A, B and Ptool = A, D. Thus, structElem(C) = max(0, 2× 1−2× (2−2 1) ) = 41 . – F: Pexp = A, D and Ptool = A. Thus, structElem(F ) = max(0, 2× 1−2× (1−2 1) ) = 12 . Finally, structurality of a tool integrated schema Sitool w.r.t. an expert integrated schema Siexp is given by formula 4. It is the sum of all element structuralities (except for the root element noted e0 ) divided by this number of elements.
struct(Sitool, Siexp) = 1
i=n i=1
structElem(ei ) n− 1
http://en.wiktionary.org/wiki/structurality (March 2010)
(4)
266
F. Duchateau and Z. Bellahsene
In our example, structurality of the tool integrated schema is therefore the sum of all element structuralities. Thus, we obtain struct(Sitool, Siexp ) = 1+1 1+1+1+0+ 4 2 = 0.625. 6 3.3
Integrated Schema Proximity
The integrated schema proximity, which computes the similarity between two integrated schemas, is a weighted average of previous measures, namely completeness, minimality and structurality. Three parameters (α, β and γ) enable users to give more weight to any of these measures. By default, these parameters are tuned to 1 so that the three measures have the same impact. Formula 5 shows how to compute schema proximity. It computes values in the range [0, 1]. pr ox(S i
tool , S
i
exp )
=
αcomp(Sitool , Siexp ) + βmin(Sitool , Siexp ) + γstruct(Sitool , Siexp ) α+β+γ
(5)
In our example, the schema proximity between tool and expert integrated schemas is equal to prox(Sitool, Siexp ) = 0.86+0.71+0.625 = 0.73 with all parameters 3 set to 1. Thus, the quality of the tool integrated schema is equal to 73% w.r.t. the expert integrated schema. 3.4
Discussion
We now discuss some issues dealing with the proposed schema proximity metric. Contrary to [6], our structurality metric does not rely on discovering common subtrees. We mainly check for common ancestors for each element and do not penalise some elements. For instance, child elements whose parent element is different are not included in a subtree, and they are taken in account as single elements (not part of a subtree) when measuring the schema quality . With our structurality metric, we avoid this problem since each element with its ancestors is individually checked. We have decided to exclude the root element from the metric, because it already has a strong weight due to its position. If the root element of the tool integrated schema is the same than the one in the expert integrated schema, then all elements (present in both schemas) which are compared already have a common element (the root). Conversely, if the root elements of both integrated schemas are different, then comparing all elements involves a decreased structurality due to the different root elements. Therefore, there was no need to consider this root element. Our measure assumes that a set of mappings between the source schemas has been discovered. This set of mappings has a strong impact for building the integrated schema. In most cases, domain experts can check and validate the mappings, so that mapping errors do not affect the quality of the integrated schema. However, there also exists many cases in which manual checking is not possible, e.g., in dynamic environments or in large scale scenarios. What is the influence of mapping quality in such contexts ? Let us discuss these points according to
Measuring the Quality of an Integrated Schema
267
precision and recall. The former measure denotes the percentage of correct mappings among all those which have been discovered. In other words, the lowest the precision is, the more incorrect mappings the tool has discovered. For all these incorrect mappings, only one element of the mapping is chosen to be included in the integrated schema, while the other is not. Since the mapping is incorrect, all elements composing it should have been put in the schema. Thus, precision has an influence on completeness. On the contrary, the second measure, recall, directly impacts minimality. As it computes the percentageof correct mappings that have been discovered among all correct mappings, it evaluates the number of correct mappings that have been “missed” by the tool. A “missed” mapping is fully integrated, i.e., all of its elements are added in the integrated schema. Yet, only one of them should be added. For these reasons, the quality of the set of mappings is strongly correlated with the quality of the integrated schema. Although one could see the requirement of an expert integrated schema as a drawback, we advocate that the measures to evaluate the quality of mappings (precision and recall) are also based on the existence of an expert ground truth. Besides, authors of [2] indicate that companies and organizations often own global repositories of schemas or common vocabularies. These databases can be seen as incomplete expert integrated schemas. Indeed, they have mainly been manually built, thus ensuring an acceptable quality. They are also incomplete since all schemas or all of their underlying concepts are not integrated in these databases. Yet, it is possible to use them as ground truth. Let us imagine that users of a company are accustomed to a global repository. If the company needs an extended integrated schema which includes the concepts of the global repository, it could be convenient for the users that the new integrated schema keeps a similar structure and completeness with the one of the global repository. In this case, we can apply our measures both on the global repository and the new integrated schema to check if these constraints are respected. However, the integrated schema proximity metric does not take into account user requirements and other constraints. For instance, a user might not want a complete integrated schema since (s)he will query only a subset of the schema. Or the minimality could not be respected because the application domain requires some redundancies. In another way, some hardware constraints may also impact integrated schemas.
4
Experiments Report
In this section, we present the evaluation results of the following schema matching tools: COMA++ [7,8] and Similarity Flooding (SF) [9,10]. These tools are described in the next section (see section 5). We notice that it is hard to find available schema matchers to evaluate. We first describe our experiment protocol, mainly the datasets that we used. We then report results achieved by schema matching tools on the quality of integrated schema by datasets. Due to space limitation, we do not include quality (in terms of precision, recall or F-measure) obtained by the tools for discovering the mappings. As explained in Section 3.4, mapping discovery is a crucial initial step for building the integrated
268
F. Duchateau and Z. Bellahsene
schema. Thus, we provide some figures when necessary to justify the quality of the integrated schema. 4.1
Experiments Protocol
Here are the datasets used for these experiments: – Person dataset. contains two small-sized schemas describing a person. These schemas are synthetic. – Order dataset. deals with business. The first schema is drawn from the XCBL collection2 , and it owns about 850 elements. The second schema (from OAGI collection3 ) also describes an order but it is smaller with only 20 elements. This dataset reflects a real-case scenario in which a repository of schemas exist (similar to our large schema) and the users would like to know if a new schema (the small one in our case) is necessary or if a schema or subset of a schema can be reused from the repository. – University courses dataset. These 40 schemas have been taken from Thalia collection presented in [11]. Each schema has about 20 nodes and they describe the courses offered by some worldwide universities. As described in [2], this datasets could refer to a scenario where users need to generate an exchange schema between various data sources. – Biology dataset. The two large schemas come from different collections which are protein domain oriented, namely Uniprot4 and GeneCards 5 . This is an interesting dataset for deriving a common specific vocabulary from different data sources which have been designed by human experts. – Currency and sms datasets are popular web services which can be found at http://www.seekda.com – University department dataset. describes university departments and it has been widely used in the literature [12]. These two small schemas have very heterogeneous labels. – Betting. contains tens of webforms, extracted from various websites by the authors of [13]. As explained by authors of [2], schema matching is often a process which evaluates the costs (in terms of resources and money) of a project, thus indicating its feasibility. Our betting dataset can be a basis for project planning, i.e., to help users decide if integrating their data sources is worth or not. 4.2
Experiments
For each schema matching tool, we have first run the schema matching process to discover mappings between source schemas. Thanks to these mappings (which have not been manually checked), the tools have then built an integrated schema. All experiments were run on a 3.0 Ghz laptop with 2G RAM under Ubuntu Hardy. 2 3 4 5
www.xcbl.org www.oagi.org http://www.ebi.uniprot.org/support/docs/uniprot.xsd http://www.geneontology.org/GO.downloads.ontology.shtml
Measuring the Quality of an Integrated Schema
269
Betting dataset. Figure 3(a) depicts the quality for the betting dataset. COMA++ successfully encompasses all concepts (100% completeness) while SF produces the same structure than the expert (100% structurality). Both tools did not achieve a minimal integrated schema, i.e., without redundancies. SF generates the most similar integrated schema w.r.t. the expert one (schema proximity equal to 92%). 100%
100% Completeness Minimality Structurality SchemaProximity
60%
40%
20%
0%
Completeness Minimality Structurality SchemaProximity
80%
Value in %
Value in %
80%
60%
40%
20%
COMA++
SF
(a) Quality for the betting dataset
0%
COMA++
SF
(b) Quality for the biology dataset
Biology dataset. With this large scale and domain-specific dataset, the schema matching tools have poorly performed for discovering mappings (less than 10% F-measure). These mitigated results might be explained by the fact that no external resource (e.g., a domain ontology) was provided. However, as shown by figure 3(b), the tools were able to build integrated schemas with acceptable completeness (superior to 80%) but many redundancies (minimality inferior to 40%) and different structures (58% and 41% structuralities). These scores can be explained by the failure for discovering correct mappings. As a consequence, lots of schema elements have been added into the integrated schemas, including redundant elements. For structurality, we believe that for unmatched elements, the schema matching tools have copied the same structure than the one of the input schemas. Currency dataset. On figure 3(c), we can observe the quality of the integrated schemas built by COMA++ and SF for currency, a nested average-sized dataset. This last tool manages to build a more similar integrated schema (83% schema proximity against 62% for COMA++). Although both tools have a 100% completeness, COMA++ avoids more redundancies (due to a better recall during mapping discovery) while SF respects more the schema structure. We notice that COMA++ produces a schema with a different structure than the one of the expert. This is probably due to Order dataset. This experiment deals with large schemas whose labels are normalised. Similarly to the other large scale scenario, schema matching tools do not perform well for this order dataset (F-measures less than 30%). As for quality of the integrated schema, given by figure 3(d), both tools achieve a schema proximity above 70%, with a high completeness.
270
F. Duchateau and Z. Bellahsene 100%
100% Completeness Minimality Structurality SchemaProximity
60%
40%
20%
0%
Completeness Minimality Structurality SchemaProximity
80%
Value in %
Value in %
80%
60%
40%
20%
COMA++
0%
SF
(c) Quality for the currency dataset
COMA++
SF
(d) Quality for the order dataset
Person dataset. Figure 3(e) depicts quality for the person dataset, which contains small schemas featuring low heterogeneity in their labels. We notice that both generated schemas are complete and they achieve the same minimality (76%). However, for this dataset containing nested schemas, COMA++ is able to respect a closer structurality than SF. The tools achieve a 80% schema proximity, mainly due to the good precision and recall that they both achieve. Sms dataset. The sms dataset does not feature any specific criteria, but it is a web service. A low quality for discovering mappings has been achieved (all F-measures below 30%). As they missed many correct mappings, the integrated schemas produced by the tools have a minimality around 50%, as shown on figure 3(f). SF obtains better completeness and structurality than COMA++. 100%
100% Completeness Minimality Structurality SchemaProximity
60%
40%
20%
0%
Completeness Minimality Structurality SchemaProximity
80%
Value in %
Value in %
80%
60%
40%
20%
COMA++
SF
(e) Quality for the person dataset
0%
COMA++
SF
(f) Quality for the sms dataset
Univ-courses dataset. The univ-courses dataset contains flat and averagesized schemas. On figure 3(g), the quality of COMA++ and SF’s integrated schemas are evaluated. It appears that both tools produces an acceptable integrated schema w.r.t. the expert one (schema proximity equal to 94% for COMA++ and 83% for SF). Notably, COMA++ achieves a 100% completeness and 100% structurality. Univ-dept dataset. The last dataset, univ-dept, has been widely used in the litterature. It provides small schemas with high heterogeneity and the results of the schema matching tools are shown on figure 3(h). Both tools achieve acceptable completeness and structurality (all above 90%), but they have more difficulties to respect the minimality constraint, merely due to their average recall.
Measuring the Quality of an Integrated Schema 100%
100% Completeness Minimality Structurality SchemaProximity
60%
40%
20%
60%
40%
20%
COMA++
SF
(g) Quality for the univ-courses dataset
4.3
Completeness Minimality Structurality SchemaProximity
80%
Value in %
Value in %
80%
0%
271
0%
COMA++
SF
(h) Quality for the univ-dept dataset
Concluding the Experiments Report
We conclude this section by underlining some general points about these experiments. – Average completeness (for all tools and all datasets) is equal to 91%. On the contrary, average minimality is 58% and average structurality reaches 68%. Indeed, schema matching tools mainly promote precision, thus they avoid the discovery of incorrect mappings and they do not miss too many schema elements when building the integrated schema. A lower recall means that many similar schema elements are added in the integrated schema, thus reducing minimality. – We also notice that it is possible to obtain a high minimality with a low recall, if precision is low too. Indeed, the low recall means that we have missed many correct mappings, thus two similar elements could be added twice in the integrated schema. But with a low precision, there are many incorrect discovered mappings, and only one of their elements would be added in the integrated schema. As an example, let us imagine that a correct mapping between elements A and A’ is not discovered. Both A and A’ are added in the integrated schema, unless one of them has been incorrectly matched to another element. This explains the high minimality achieved with some datasets, despite of a low recall. – Similarity Flooding provides a better quality when building integrated schemas (79% average schema proximity against 67% for COMA++). – If a correct mapping is missed by a matching tool, then both elements of this missed mapping are added in the integrated schema. Structurality only takes into account one of these elements (the one which is in the expert integrated schema). The other is ignored, but it also penalizes minimality. This explains why structurality and completeness have high values even when mapping quality measures return low values. – Schema proximity is also quite high, simply because it averages completeness and structurality values which are already high. For instance, when a few correct mappings are discovered (order or biology datasets), many elements are added into integrated schema, thus ensuring a high completeness but a low minimality. Due to the missed mappings, lots of elements have to be added into the integrated schema, and the easiest way is to keep the
272
F. Duchateau and Z. Bellahsene
same structure that can be found in the source schemas, thus guaranting an acceptable structurality. However, our schema proximity measure can be tuned (with parameters α, β and γ) to highlight a weakness in any of the three criteria (completeness, minimality or structurality).
5
Related Work
Many approaches have been devoted to schema matching. In [14,15], authors have proposed a classification for matching tools, which has been later refined in [16]. Similarly, ontology researchers are also prolific for designing approaches to fulfill the alignment task between ontologies [17]. However, the yearly OAEI challenge6 for instance mainly evaluates the mapping quality, and not ontology integration. This section only focuses on schema matching tools which are publicly available for evaluation with our benchmark, namely Similarity Flooding and COMA++. 5.1
Similarity Flooding/Rondo
Similarity Flooding [9] (also called Rondo [10]) is a neighbour affinity matching tool. First, it applies a terminological similarity measure to discover initial correspondences, and then feeds them to the structural matcher for propagation. The weight of similarity values between two elements is increased, if the algorithm finds some similarity between related elements of the pair. The user can then (in)validate the discovered correspondences, and the tool builds an integrated schema based on these correspondences. 5.2
COMA/COMA++
COMA/COMA++ [18,7] is a generic, composite matcher with very effective matching results. The similarity of pairs of elements is calculated using linguistic and terminological measures. Then, a strategy is applied to determine the pairs that are presented as correspondences. COMA++ supports a number of other features like merging, saving and aggregating match results of two schemas.
6
Conclusion
In this paper, we have presented new measures for assessing the quality of integrated schema produced by schema matching tools. Namely, we are able to evaluate the structure of this schema. Combined with minimality and completeness, the schema proximity measure computes the likeness of an integrated schema w.r.t. an expert one. We have finally evaluated two schema matching tools, COMA++ and Similarity Flooding, over 10 datasets. The resulting report indicates that Similarity Flooding generates better integrated schemas. But it also shows that schema matching tools could be enhanced to let users express some constraints for generating an integrated schema, for instance in terms of design. 6
http://oaei.ontologymatching.org/
Measuring the Quality of an Integrated Schema
273
As future work, we intend to enhance our measures for ontologies. The structurality measure should be refined to express the different relationships (e.g., generalization, instance) between the paths of two elements. As many organizations own schema repositories which could be used as expert integrated schemas, we also plan to extend our measures for reflecting the incompleteness of these repositories.
References 1. Batini, C., Lenzerini, M., Navathe, S.B.: A Comparitive Analysis of Methodologies for Database Schema Integration. ACM Computing Surveys 18(4), 323–364 (1986) 2. Smith, K., Morse, M., Mork, P., Li, M., Rosenthal, A., Allen, D., Seligman, L.: The role of schema matching in large enterprises. In: CIDR (2009) 3. Castano, S., De Antonellis, V., Fugini, M.G., Pernici, B.: Conceptual schema analysis: techniques and applications. ACM Trans. Database Syst. 23(3), 286–333 (1998) 4. da Concei¸ca ˜o Moraes Batista, M., Salgado, A.C.: Information quality measurement in data integration schemas. In: QDB, pp. 61–72 (2007) 5. Kesh, S.: Evaluating the quality of entity relationship models. Information and Software Technology 37, 681–689 (1995) 6. Duchateau, F., Bellahsene, Z., Hunt, E.: Xbenchmatch: a benchmark for xml schema matching tools. In: VLDB, pp. 1318–1321 (2007) 7. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: ACM SIGMOD Conference, DEMO paper, pp. 906–908 (2005) 8. Do, H.H., Rahm, E.: Matching large schemas: Approaches and evaluation. Information Systems 32(6), 857–885 (2007) 9. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: ICDE, pp. 117– 128 (2002) 10. Melnik, S., Rahm, E., Bernstein, P.A.: Developing metadata-intensive applications with rondo. J. of Web Semantics I, 47–74 (2003) 11. Hammer, J., Stonebraker, M., Topsakal, O.: Thalia: Test harness for the assessment of legacy information integration approaches. In: Proceedings of ICDE, pp. 485–486 (2005) 12. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Ontology matching: A machine learning approach. In: Handbook on Ontologies in Information Systems (2004) 13. Marie, A., Gal, A.: Boosting schema matchers. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part II. LNCS, vol. 5332, pp. 283–300. Springer, Heidelberg (2008) 14. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001) 15. Euzenat, J., et al.: State of the art on ontology matching. Technical Report KWEB/2004/D2.2.3/v1.2, Knowledge Web (2004) 16. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. Journal of Data Semantics IV, 146–171 (2005) 17. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 18. Do, H.H., Rahm, E.: COMA - A System for Flexible Combination of Schema Matching Approaches. In: VLDB, pp. 610–621 (2002)
Contextual Factors in Database Integration — A Delphi Study Joerg Evermann Faculty of Business Administration Memorial University of Newfoundland St. John’s, NL, Canada [email protected]
Abstract. Database integration is an important process in information systems development, maintenance, and evolution. Schema matching, the identification of data elements that have the same meaning, is a critical step to ensure the success of database integration. Much research has been devoted to developing matching heuristics. If these heuristics are to be useful and acceptable, they must meet the expectations of their users. This paper presents an exploratory Delphi study that investigates what information is used by professionals for schema matching.
1
Introduction
The central step in any database integration project is schema matching, the identification of data elements in multiple sources that correspond to each other. The identification of such data elements is a judgment or decision task for which database professionals use a diverse set of information about each data element and data source. A wide array of software tools is available, using increasingly complex heuristics, to support or automate the identification of matching database elements [1]. If these software tools are to be successful, their results must satisfy their user’s expectations [2]. Thus, knowledge of how database professionals make matching decisions is important to develop new schema matching heuristics. Such knowledge also provides criteria with which to evaluate the performance of schema matching tools. This is increasingly important, as there are currently no standard evaluation criteria [3]. Compared to the wealth of software heuristics, little work has tackled the investigation of schema matching decisions by database professionals [2]. A recent experimental study has investigated the use and relative importance of different kinds of schema and instance information [4]. There appears to be a greater importance of instance than schema information, varying significantly between individuals. There were also strong correlations between different notions of ”match”: Subjects took ”matching” to mean any or all of (1) having the same meaning, (2) referring to the same domain elements, (3) storing information about the same domain objects, or (4) matching each other. Another J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 274–287, 2010. c Springer-Verlag Berlin Heidelberg 2010
Contextual Factors in Database Integration — A Delphi Study
275
study investigated the processes by which database professionals arrive at their matching decisions [5]. Examining the same kinds of information as [4], this study found that there is no unique process by which database professionals arrive at the matching decision [5]. While this confirms the results of the earlier study [4], it does little to clarify why the individual differences occur. Thus, the question of the causes of the individual differences in schema matching decisions remains. In this study, we employ the Delphi methodology to identify the contextual factors that affect database professionals’ decision making for schema matching and may account for the individual differences. The term ”context” has been used in schema matching research to signify different things such as the semantic relationships of a data item to other data items, metainformation associated with data items, or sets of objects [7] or that specific matches between data instances hold only when certain constraints on other data instances are satisfied [6]. In contrast, the term ”context” is here defined as the set of circumstances outside or external to the database that may have an affect on the schema matching outcome, specifically by affecting the importance of information items used for schema matching decisions. The contributions of this study are twofold. First, we identify the relevant information for schema matching, grounded in actual practice. Second, we identify contextual factors that influence matching decisions by affecting the relative importance of the information used for matching. The contributions are of use to researchers for further development of matching heuristics and improving the validity of empirical evaluations of matching tools. The remainder of the paper is structured as follows. Section 2 provides a brief introduction to schema matching. Section 3 presents the methodology and results of this study. A discussion with reference to previous work is provided in Section 4. The paper concludes with an outlook to further research.
2
Schema Matching Review
Schema matching methods are primarily categorized by their use of schemalevel or instance-level information, although many methods use both types of information [1,2,8]. Schema-level matching methods may use constraint information, such as data type, optionality, or uniqueness constraints on attributes [10,11,12,13,14,15,16,17]. They may also use structural information such as foreign-key dependencies [7,9,11,12,15,16,18,19,20,21]. Schema-level information may be limited to local information, where only individual database elements are considered, e.g. columns within a table, or it may encompass global information, where the overall structure of the database is considered. Instance-level information can be used in addition to, or instead of, schemalevel information. For example, a schema-level matching method may be used to first match tables, and an instance-level matching method may subsequently match table columns [9,22,23,24,25,26]. Instance-level information typically consists of aggregate measures such as value distributions, frequencies, averages, etc.
276
J. Evermann
that are computed for attributes and are used to identify similar attributes. For example, when two table fields contain the same distribution of values, then the fields may be argued to be similar. Machine learning techniques, such as neural networks [26], Bayesian learners [22,27], and information retrieval techniques [28] have been used to establish relevant aggregate measures. Instance-level information may be used locally, e.g. instances of a single table column, or it may be used globally, e.g. computing data dependencies among multiple columns [9]. Two further dimensions are added by [9]. Local information techniques use information about single elements, e.g. individual table columns, while global information techniques use information about the entire database, e.g. the global schema graph. Interpreted techniques use semantic knowledge about the database structure or its instances, while uninterpreted techniques do not1 . All of the above types of information are used by one matching method or another, often in different combinations. However, without knowledge of the contribution of each type of information to human similarity judgments, the developers of schema matching methods have little guidance on how to improve their methods. Further, a review of experimental evaluation methods shows that ”the evaluations have been conducted in so [sic] different ways that it is impossible to directly compare their results” [3]. As Do et al. acknowledge, ”the match task first has to be manually solved” [3], but most tool evaluations only use a single human expert to perform the manual solution, thus introducing possible bias, and casting doubt on the validity and generalizability of the findings. None of the studies that use human experts to develop a reference match report the specific tasks and contexts based on which the human experts develop the matches [2]. Identifying what information human experts use, how they use it, and what factors influence that use, allows us to design rigorous and valid evaluation standards against which to compare the performance of tools.
3
Research Methodology and Results
Given the inconclusive results of the earlier, observational studies [4,5], a Delphi study that asks professionals directly about the information they use for making matching decisions might shed more light on the phenomenon. Rather than relying on the researcher to advance a theory, Delphi studies identify information grounded in actual practice. Given the limited success of prior studies with theoretical foundations, a Delphi study can provide a way forward to interpreting and making sense of previous results. The Delphi method is an exploratory technique to elicit a consensus importance rank ordering from a panel of experts [29,30]. It does this by first using a brainstorming phase to elicit a set of items, including their definitions and examples. Next, experts rank-order the items by importance. The researcher compiles average ranks and computes an agreement measure. which is returned to the experts, who are asked to revise their ranking. The process is repeated until either 1
What is called local and global information matching here is called element and schema level in [9], which is confusing with the term schema level as used by [1].
Contextual Factors in Database Integration — A Delphi Study
277
Table 1. Demographics of participants (self-reported) Data integration experience Integration work Years Projects Database administrator (re- 18 6 Data integration for sponsible for maintenance, data business/financial remigration, QA for database deporting sign, advice on data integration) Software developer 26 Data migration from legacy systems Senior programmer (database 8 5 Data migration from legacy systems legacy applications) Web application developer 2 2 Data integration for (business intelligence) business intelligence solutions Manager of Advanced Tech- 11 ≥ 50 System / Application nologies – Enterprise Architecintegration ture Group Senior application architect 10 20 Developing new mod(systems development, mostly ules for inadequately relational database-based sysdocumented systems tems, frequent activity in data integration) Computer database analyst 3 to 5 5 to 6 Integration of multiple (DBA) databases Executive director of applica- 4 to 6 20 to 30 System / Application tion development integration Director (data masking prod- 18 0 Assisting and consulting for data masking ucts) Job title or description
a sufficient level of agreement is established, or there is no further improvement in agreement among the experts. This study follows the recommendations in [30]. Because Delphi studies require a significant commitment from the participants, and are designed to collect rich data, the suggested sample size for a Delphi study is small when compared to survey research. Expert panels with as few as 8 or as many as 40 members are considered acceptable [29]. The experts for this study were selected through contacts with companies in the software development and IT consulting business. Demographic information for the nine participating experts is shown in Table 1. While the last expert has no direct experience in data integration, he is an expert in data masking, which is closely related and, in a sense, the inverse of integration. The study was conducted by email and the experts did not know each other’s identity, a feature of Delphi studies that is intended to minimize bias and mutual influence beyond what is desired by the researcher. The study was conducted over a period of 8 months. The Delphi study was used to simultaneously elicit and rank two sets of items. Part A of the study was intended to elicit the information items that database professionals use in making matching decisions and determine its relative importance. Part B of the study was intended to elicit the contextual factors that database professionals believe impact the relative importance of the information found in part A, and the relative importance of those contextual factors. The instructions included an explanation of the concept of schema matching as used in
278
J. Evermann
this research, and some of the ways in which participants might have undertaken schema matching in their work. Specific instructions for each round of ranking are reproduced below. Initial Round (Item Generation) The initial round of a Delphi study is the generation of items, called ”brainstorming round”. The experts were asked to identify a set of items (separately for parts A and B of the study) they consider important. The instructions for parts A and B were as follows: ”This part of the questionnaire is about the information that you find useful or important to determine whether two pieces of data are equivalent or correspond to each other, as shown in the introductory example on the first page. Irealize not all such information is always available, but please list these things anyway. Specifically, I am trying to find out about information that can be found without asking others; information that is found by analyzing the database, application software, interface definition fi les, etc.” ”The importance and usefulness of the information you listed above might depend on other factors. Please list such contextual factors that can influence if you would use certain information to make your decision about the correspondence of data in two elements. Contextual factors are those not present in the databases, application system, etc. These might be connected with the business or organizational environment, the purpose of integration, project and customer characteristics, schedule, technical factors, etc.” Participants were asked for at least six items with no upper limit. The responses for each part included a brief definition and an example or comment. Eliciting definitions and examples ensures that the researcher and all participants have the same understanding of the items [30]. A total of 57 information items were generated for part A, and 42 contextual factors were generated for part B. Using the provided definitions and examples, these responses were examined for duplicates. When a duplicate was found, all definitions and examples were maintained when duplicates were collapsed. This allowed all participants to see other perspectives in later rounds. As a result, 27 unique information items were retained for part A, and 17 unique contextual factors were retained for part B. These are shown in Tables 2 and 3. 3.1
Round 1 (Item Reduction)
The consolidated lists of 27 information items for part A and 17 contextual factors for part B were given to the panel of experts. The elements in each list were randomly ordered and the participants were asked to select eight elements in each list that they believed are most important. Participants were told not to
Contextual Factors in Database Integration — A Delphi Study
279
Table 2. Items and Rankings for Part A Label Item A1 A21 A5 A2 A4 A6 A20 A19 A8 A17 A16 A15 A3 A7 A9 A10 A11 A12 A13 A14 A18 A22 A23 A24 A25 A26 A27 a b
Field Name Data/Content Foreign keys (lookup table relationships) Table name Foreign keys / Relationships Primary keys Field Collection / Table structure Queries / Views Narrow/specific data type Application logic / code Application user interface Column Comments Instance / Schema name Broad/generic data type Field size Uniqueness Includes nulls CHECK Constraints Indices Triggers Business Rules Checksums / Hashes Data range Length/Size of data Data gaps Patterns / Regular Expressions Unique/distinct column values Agreement (Kendall’s W) Eight participants in this and the following rounds No ranking was required in this round
Rank in round 1 2 3a 4 5 b X 1 1 1 1 X 2 2 2 2 X 6 6 3 3 X 3 3 4 4 X 4 4 5 5 X 5 5 6 6 X 7 7 7 7 X 8 8 8 8 X 10 9 9 9 X 10 11 11 10 X 7 10 10 11 X 12 11 12 12
.226 .457 .613 .626
rank-order the selected elements at this stage. The purpose of this first round was to pare down the lists to a manageable size for later rank-ordering, as well as to gather feedback on collapsing duplicate elements from the previous round to ensure validity. The participants were also given the opportunity to add elements during this round that they may have forgotten previously. The lists for each part were reduced to those elements for which at least half of the participants indicated they were most important. No additional elements were generated at this stage, and no comments about the validity of the lists were received, indicating that the elements are valid and reflect the intentions of the experts who submitted them. As a result, the list for part A contained 12 information items, the list for part B contained 9 contextual factors. The third columns in Tables 2 and 3 show the retained items marked with an ”X”. 3.2
Round 2 (Ranking)
The lists of information items and contextual factors were again randomly ordered and presented to participants. In this round, participants were asked to
280
J. Evermann Table 3. Items and Rankings for Part B
Rank in round 1 2 3a 4 5 b B2 Client / Customer’s business knowledge and cooperation X 1 1 1 1 B1 Integration Purpose X 2 2 2 2 B5 Integrator domain knowledge X 5 3 3 3 B17 Nature of data integrity required X 4 4 4 4 B14 Criticality / Importance of the data X 3 5 5 5 B12 Relationship with the customer/project owner X 7 6 6 6 B4 Data Accessibility X 6 7 7 7 B9 Size of databases X 8 8 7 8 B16 Technology X 8 8 9 9 B3 Source system ownership B6 Existing data integrity requirements B7 System availability B8 System architecture B10 Available time B11 Ease of access to data B13 Project size B15 Project Team Skill-Set Agreement (Kendall’s W) .212 .547 .693 .697 a Eight participants in this and the following rounds b No ranking was required in this round Label Item
rank-order the elements of each list. Participants were asked to avoid tie ranks. The specific guidelines for part A issued to participants were as follows: ”A database integrator may look towards these criteria/this information to determine whether two data elements (fields, tables, rows) from different data sources match. Please rank the items in this list by their order of importance to determining whether two data elements from different data sources match. A rank of 1 means that this is most important to you. Please do NOT USE TIES when ranking them. You may think that all of them are equally important, so here is a guideline to help you decide: If you consider a database in which two factors give you conflicting evidence, for example A1 suggests they match, while A2 suggests they do not, and all other factors being equal, would you trust A1 or A2 more?” For part B, the identification of contextual factors that might influence the importance of information items in part A, the instructions were as follows: ”These context factors may change the importance that a database integrator assigns to items in the first list when s/he tries to determine whether two data elements (fields, tables, rows) from different data sources match. Please rank the items in this list by their order of importance to determining whether two data elements from different data sources match. A
Contextual Factors in Database Integration — A Delphi Study
281
rank of 1 means that this is most important to you. Please do NOT USE TIES when ranking them.” The ranks for the list elements in parts A and B were averaged over all respondents. The average ranks are shown in Tables 2 and 3. To assess the degree of agreement on the ranks among the participants, Kendall’s W coefficient was computed [31]. A W-value of 0.7 shows strong agreement and gives high confidence in the ranks [30]. The agreement in this round was weak to very weak, with W= 0.226 for part A and W = 0.212 for part B, which is to be expected at the beginning of a Delphi study. The subsequent rounds of controlled feedback and ranking are designed to increase the agreement on ranks. 3.3
Round 3 (Ranking)
Given the poor agreement among the participants in the previous round, another round of ranking was conducted. Participants received the lists in parts A and B ordered by their average rank and were also shown their own previous ranking, as well as the percentage of participants that ranked an element in the top half of their ranks. This serves as an important feedback and calibration mechanism to indicate to respondents the spread of ranking for an element and to signal how much consensus there is among the experts for an element [30]. Participants were also provided with the W agreement measures and the interpretation that this was ”low to very low” agreement. The participants were asked to comment on their ranking if their rank for an element differs significantly from the average rank. This is done to help build consensus through controlled feedback [30]. The instructions for this round included the following: ”You do not have to revise your ranking, but please provide a comment on your ranking if you rank a factor far from the average rank. These comments will be given back to the other participants, in the hope that they may be useful for finding a consensus ranking soon.” Results were received from only eight of the participants, despite repeated appeals through email and telephone. Ranks were again averaged and the resulting average ranks are shown in Tables 2 and 3. The agreement was W = .457, and W = .547 for the elements in parts A and B. While there was little change in the average ranks relative to the previous round, this presents a substantial improvement in agreement and showed fair agreement. Given this substantial improvement, it was decided to conduct another round of ranking [30]. 3.4
Round 4 (Ranking)
Participants were again provided with the average ranks, their own previous ranking and the percentage of respondents who ranked an element within their top half of ranks. Participants were also provided with all other participants’ comments of the previous round together with the rank that those participants
282
J. Evermann
had ranked an item. To not overburden participants, they were not asked to provide further comments on their ranking. Again, only minor changes in average ranks occurred, but the agreement measures showed again significant improvement for both part A (from W= .457 to W = .613) and part B (from W = .547 to W = .693). These agreement values constitute fair to good agreement for part A and good agreement for part B. Delphi studies should be terminated either when good agreement (W> 0.7) is achieved or when no further improvement of agreement seems achievable [30]. At this point, there was good agreement on the importance of contextual factors in part B, but not on the importance of information items in part A. Also, this round of ranking had shown that significant improvement to agreement could still be achieved. On the other hand, given the increasing difficulties in soliciting responses from the participants, it was clear that study fatigue had begun to set in, endangering further rounds of ranking either through participant withdrawal or response bias [30]. Weighing these factors, it was decided to conduct one further round of ranking. 3.5
Round 5 (Ranking)
Participants were again provided with the average rank of elements within each list, their own previous rank of an element, and the percentage of respondents who ranked an element within the top half of the ranks. The results of this round are shown in the final column in Tables 2 and 3. This round showed few changes of average rank in any item, and also showed little improvement in the agreement for either part (from W = .613 to W = .626 for part A, and from W = .693 to W = .697 for part B). Hence, there is little to be gained from another round of ranking. The agreement measures indicate good agreement for both parts [30]. Finally, the average ranks of the elements in each list have not changed more than a few positions over the entire course of the study. Based on these observations, the final rankings can be assumed as valid and stable.
4
Discussion
This Delphi study was conducted to make sense of previously reported results, which showed large variations in the use of database information across data integration professionals [4,5]. The earlier studies were based on philosophical and psychological considerations [2] and examined a pre-determined set of database information. In contrast, this study sought to generate a set of important information items from practitioners. The information items examined in the previous studies [4,5] are shown in Table 4, ranked in order of the effect sizes in [4]. A first difference is the number of items. This study has generated many more than the seven items previously examined. However, many of the extra items are specific types of constraints, aggregates, and names. Examining how the earlier studies operationalized the seven categories, we find that the constraints category in [4,5] includes all of the information items generated here, the aggregate category in [4,5] is missing A27—Unique values, A26—Patterns,
Contextual Factors in Database Integration — A Delphi Study
283
Table 4. Comparison with prior results Rank from [4] Item from [4,5] 1 Instance information 2
Effects in DB
3
Aggregate information
4
Schema structure
5
Constraints
No effect
Name
No effect
Intent
—
—
Items from this study (ranks) A21—Data/Content (2) A14—Triggers (—) A18—Business Rules (—) A22—Checksums/Hashes (—) A23—Data Range (—) A24—Length/Size (—) A25—Data Gaps (—) A26—Patterns/RegExp (—) A27—Unique Values (—) A4—FK-Relationships (5) A5—FK-LookUp (3) A20—Table Structure (7) A6—Primary Keys (6) A7—Generic Type (—) A8—Specific Type (9) A9—Field Size (—) A10—Uniqueness (—) A11—Null (—) A12—CHECK (—) A1—Field Name (1) A2—Table Name (4) A3—Schema Name (—) A16—App UI (11) A17—App Logic (10) A19—Queries/Views (8) A13—Indices (—) A15—Column Comments (12)
A25—Data gaps, and A22—Checksums/hashes, but does include A23—Data range and A24—Length/size. This study generated two items that were not part of the earlier studies, A13—Indices and A15—Comments. Together, A13— Indices and A19—Queries may constitute a category representing the use of data, which participants in [4] had found missing. Naming was found to have no significant effect in [4], yet, in this study, A1— Field names are considered most important and A2—Table names are ranked 4th. The reason may be that names in the earlier study were fictitious [4]. The database effects, which describe sequences of changes to data in the data sources, are found to be highly important in [4]. However, two items related to data modification, A14—Triggers and A18—Business rules are considered to not be important in this study. These discrepancies might arise because research subjects may not be fully aware of their actions, especially when these actions are internalized. This effect was also found in [4], where respondents’ actual information use was compared to their stated information use, and is well-known and documented in the literature [32]. While the process-tracing study described in [5] did not set out to identify the importance of information items for matching but the sequence of their use, some relevant information was collected. For example, the importance of an information item may be related to how frequently it is mentioned or viewed
284
J. Evermann
during the matching process. By that reasoning, database content and schema structure are by far the two most important pieces of information, with constraint information following in 3rd position. There were few differences among the remaining 5 pieces of information investigated in that study. This is somewhat similar to the present results, where we find that A21—Data/Content ranks very highly (2nd), followed by schema structure related elements (A5—FK-Look Up (3rd), A4—FK-Relationships (5th), A20—Table structure (7th)), and then in turn followed by constraint related elements (A6—PK (6th), A8—Data type (9th)). This cross-validation of results serves to strengthen our confidence in the validity of the present results. Another interesting result is that some of the information used in schema matching heuristics was not even generated in this study. This may lead to potential issues about the acceptability of these heuristics for professional data integrators. If the tool users do not understand or accept the importance of the information used by a tool, they may not accept the tool’s matching decisions. For example, attribute correlations [9] and dependencies [33], the use of transformations to improve matching [18], or the exploitation of hierarchical domain structure [34] are not mentioned by our participants, yet used in schema matching heuristics. Most of the contextual factors identified in this study (Table 3) have not previously been examined in schema matching research, nor have these factors been measured or experimentally manipulated in previous studies. No existing schema matching heuristics take these factors into account when determining the similarity of database elements. Only the contextual factor B1—Integration Purpose has been briefly discussed as a possible affecting variable in ontology alignment [35]. The participants of the previous studies [4,5] provided the business context in which they typically perform data integration, with responses such as ”Application integration”, ”Business intelligence”, or ”Data warehousing”. This may also be related to contextual factor B1—Integration purpose, generated in this study. However, the definition for B1—Integration purpose in this study was ”The reason for integrating the data” with the following clarifying comment: ”Through the integration request, the business owner(s) may define what items they want integrated or the type of view they wish to see. This would provide a basis on which items in each data to compare”. Hence, while related, business context and integration purpose are distinct. Moreover, the process-tracing study reported in [5] failed to find correlations between business context and the integration process. The identification of contextual factors that affect the importance of schema matching information raises a challenge for the empirical evaluation of schema matching results [3,2], as the human-created reference matches may not be valid in different contexts or for different people. Hence, the generalizability of the matching heuristics to different contexts and different people may be limited. The identification of the contextual factors in this study can be used by researchers to more clearly specify task contexts when they use human participants to create reference matches for evaluation of heuristics and tools. This will support the comparability of evaluation results across studies and tools [3].
Contextual Factors in Database Integration — A Delphi Study
5
285
Conclusion
As set out in Section 1, the contributions of this study are twofold. First, some of the information items generated in this study are novel and the rankings provided by this study are new and may be useful to schema matching researchers and practitioners alike. Many of the information items that have found little or no use in existing schema matching research, e.g. indices, application logic, or triggers, are not judged as important. In contrast, the items judged as important by participants are those typically used in schema matching heuristics. This validates existing research approaches and matching tools. However, application logic, application user interface, and queries/views on the data are perceived as important, yet have seen no use in matching heuristics, perhaps because they go beyond the static database and extend into the surrounding IS. It may be more appropriate to speak of systems matching rather than schema matching. Incorporating these items, with their relative importance, into matching tools may enhance the quality of the matching results, as perceived by human users. They constitute an area for future research, both for empirical studies such as this one, or for developing heuristics through design science research. Second, the contextual factors identified in this study show that researchers must take care to explicate the task context in which schema matching is assumed to occur, limiting the potential generalizability of matching results. Context matters and needs to be taken into account when comparing results, either between matching tools, or between tools and human reference matches. While our study did not elicit examples, it is easy to construct plausible scenarios for the influence of contextual factors. For example, large databases may prevent some compute-intensive instance-based techniques and thus require schema-based approaches. Actual data may be inaccessible due to privacy reasons, and therefore necessitate schema-level approaches. When data is critical, e.g. financial data, data integrators may pay increased attention to criteria they would not otherwise consider. The domain knowledge of the data integrator may lead him or her to act based on prior knowledge and naming, and to pay less attention to other information. The purpose of the integration may be lead to a focus on structural information, e.g. if separate databases with global views are required, whereas instance information may be more important when data is to be merged. While the contextual factors were ranked in importance, this is only the beginning of further research. Further empirical work needs to observe the actual impact of these contextual factors, if any. For example, experiments may be performed in which the contextual factors are systematically manipulated to determine how and in what combination they affect the matching outcome. Similarly, more detailed process tracing studies may be conducted to examine at what point and by what process such influence is exerted. Other means of studies are possible as well. In-depth case or field studies are suitable to compare and contrast data integration projects with different contextual characteristics. Using such rich observational techniques in a realistic setting may also discover whether the underlying paradigm of this research is justified. This study, and prior work [4,5], has framed data integration as a decision making problem, but
286
J. Evermann
this may not necessarily the dominant mode. For example, it may be conceived as a problem solving issue, where a certain outcome or functionality is required, and the integrator’s task is to solve the problem of how to achieve this. Alternatively, data integration may be described as a design task, where an integrated database or data schema is designed using the existing data as input or foundation. Finally, data integration and schema matching could be conceived of also as negotiation task, either between human actors, such as data integrators and clients, among data integrators on the project team, or, alternatively, between human actors and computer tools, as in the case of semi-automatic, human guided schema matching. The identified contextual factors could play a role in any of these conceptualizations of schema matching. Delphi studies such as this are by their nature atheoretic [30,29]. This study has provided some interesting and useful data, but has offered no theory that might explain the findings. Earlier, theory-based approaches have had mixed success in explaining their findings. Hence, the present study is just one further step to theory development, by raising puzzling questions and providing some data that needs to be explained, and from which theories can be developed.
References 1. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10, 334–350 (2001) 2. Evermann, J.: Theories of meaning in schema matching: A review. J. Database Manage. 19, 55–83 (2008) 3. Do, H.H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Chaudhri, A.B., Jeckle, M., Rahm, E., Unland, R. (eds.) NODe-WS 2002. LNCS, vol. 2593, pp. 221–237. Springer, Heidelberg (2003) 4. Evermann, J.: Theories of meaning in schema matching: An exploratory study. Inform Syst. 34, 28–44 (2009) 5. Evermann, J.: An exploratory study of database integration processes. IEEE T. Knowl. Data En. 20, 99–115 (2008) 6. Bohannon, P., Elnahrawy, E., Fan, W., Flaster, M.: Putting context into schema matching. In: Proc. Int. Conf. VLDB, pp. 307–318 (2006) 7. Palopoli, L., Sacca, D., Terracina, G., Ursino, D.: Uniform techniques for deriving similaritites of objects and subschemas in heterogeneous databases. IEEE T. Knowl. Data En. 15, 271–294 (2003) 8. Batini, C., Lenzerini, M., Navathe, S.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 323–364 (1986) 9. Kang, J., Naughton, J.F.: Schema matching using interattribute dependencies. IEEE T. Knowl. Data En. 20, 1393–1407 (2008) 10. Lerner, B.S.: A model for compound type changes encountered in schema evolution. ACM T. Database Syst. 25, 83–127 (2000) 11. Mitra, P., Wiederhold, G., Kersten, M.: A graph-oriented model for articulation of ontology interdependencies. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 86–100. Springer, Heidelberg (2000) 12. Bertino, E., Guerrini, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Inform Syst. 29, 23–46 (2004)
Contextual Factors in Database Integration — A Delphi Study
287
13. Castano, S., De Antonellis, V., De Capitani di Vimercati, S.: Global viewing of heterogeneous data sources. IEEE T. Knowl. Data En. 13, 277–297 (2001) 14. Larson, J., Navathe, S., Elmasri, R.: A theory of attribute equivalence in databases with application to schema integration. IEEE T. Software Eng. 15, 449–463 (1989) 15. Gotthard, W., Lockemann, P.C., Neufeld, A.: System-guided view integration for object-oriented databases. IEEE T. Knowl. Data En. 4, 1–22 (1992) 16. Hayne, S., Ram, S.: Multi-user view integration system (MUVIS): an expert system for view integration. In: Proc ICDE, pp. 402–409 (1990) 17. Spaccapietra, S., Parent, C.: View integration: A step forward in solving structural confl icts. IEEE T. Knowl. Data En. 6, 258–274 (1992) 18. Yeh, P.Z., Porter, B., Barker, K.: Using transformations to improve semantic matching. In: Proc K-CAP 2003, pp. 180–189 (2003) 19. Noy, N., Musen, M.: Anchor-PROMPT: Using non-local context for smeantic matching. In: Workshop on Ontologies and Information Sharing at IJCAI (2001) 20. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proc ICDE, pp. 117–126 (2002) 21. Wang, T.L.J., Zhang, K., Jeong, K., Shasha, D.: A system for approximate tree matching. IEEE T. Knowl Data En. 6, 559–571 (1994) 22. Berlin, J., Motro, A.: Database schema matching using machine learning with feature selection. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 452–466. Springer, Heidelberg (2002) 23. Kang, J., Naughton, J.F.: On schema matching with opaque column names and data values. In: Proc ACM SIGMOD, pp. 205–216 (2003) 24. Miller, R.J., Hernandez, M.A., Haas, L.M., Yan, L., Ho, C.H., Fagin, R., Popa, L.: The Clio project: Managing heterogeneity. SIGMOD Rec. 30, 78–83 (2001) 25. Chua, C.E.H., Chiang, R.H., Lim, E.P.: Instance-based attribute identification in database integration. VLDB J. 12, 228–243 (2003) 26. Li, W.S., Clifton, C., Liu, S.Y.: Database integration using neural networks: Implementation and experiences. Knowl. Inf. Syst. 2, 73–96 (2000) 27. Doan, A., Madhavan, J., Dhamankar, R., Domingos, P., Halevy, A.: Learning to match ontologies on the semantic web. VLDB J. 12, 303–319 (2003) 28. Su, X., Hakkarainen, S., Brasethvik, T.: Semantic enrichment for improving system interoperability. In: Proc. ACM SAC, pp. 1634–1641 (2004) 29. Okoli, C., Pawlowski, S.D.: The delphi method as a research tool: an examples, design considerations and applications. Inform. & Manage. 42, 15–29 (2004) 30. Schmidt, R.C.: Managing delphi surveys using nonparametric statistical techniques. Decision Sci. 28, 763–774 (1997) 31. Kendall, M., Babington Smith, B.: The problem of m rankings. Ann. Math. Stat. 10, 275–287 (1939) 32. Hufnagel, E.M., Conca, C.: User response data: The potential for errors and biases. Inform. Syst. Res. 5, 48–73 (1994) 33. Berlin, J., Motro, A.: Autoplex: Automated discovery of content for virtual databases. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, pp. 108–122. Springer, Heidelberg (2001) 34. Ganesan, P., Garcia-Molina, H., Widom, J.: Exploiting hierarchical domain structure to compute similarity. ACM T. Inform. Syst. 21, 64–93 (2003) 35. Quix, C., Geisler, S., Kensche, D., Li, X.: Results of GeRoMeSuite for OAEI 2008. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 160–166. Springer, Heidelberg (2008)
Building Dynamic Models of Service Compositions with Simulation of Provision Resources⋆ Dragan Ivanovi´c1, Martin Treiber2, Manuel Carro1 , and Schahram Dustdar2 1 2
Facultad de Informática, Universidad Politécnica de Madrid Distributed Systems Group, Technical University of Vienna
Abstract. Efficient and competitive provision of service compositions depends both on the composition structure, and on planning and management of computational resources necessary for provision. Resource constraints on the service provider side have impact on the provision of composite services and can cause violations of predefined SLA criteria. We propose a methodology for modeling dynamic behavior of provider-side orchestration provision systems, based on the structure of orchestrations that are provided, their interaction, statistically estimated run-time parameters (such as running time) based on log traces, and the model of resources necessary for orchestration provision. We illustrate the application of our proposed methodology on a non-trivial real world example, and validate the approach using a simulation experiment. Keywords: Service Compositions, Business Process Modeling, Quality of Service, Simulation.
1 Introduction Service compositions allow organizations to develop complex, cross-organizational business processes by reusing existing services, and are thus attractive for service providers and service consumers alike. Service compositions have been studied thoroughly over recent years and different models to define service compositions have emerged [1]. Approaches like BPEL [2] or YAWL [3] define service compositions in a top down manner using specific notation. At the same time, abstract service composition models include different strands of Petri Nets [4] and process calculi [5]. Key to business usability of service compositions is conformance of their Quality of Service (QoS) attributes with Service-Level Agreements (SLA). Both are intimately related to monitoring and adaptation capabilities [6]. From the computational point of view, resource utilization management, especially the ability to scale computational resources to the expected level of demand, may have drastic impact on response time and failure rates. ⋆
The research leading to these results has received funding from the European Community Seventh Framework Programme FP7/2007-2013 under grant agreement 215483 (S-Cube). Dragan Ivanovi´c and Manuel Carro were also partially supported by Spanish MEC project 2008-05624/TIN DOVES and CM project P2009/TIC/1465 (PROMETIDOS).
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 288–301, 2010. c Springer-Verlag Berlin Heidelberg 2010
Building Dynamic Models of Service Compositions
289
In this paper we take the approach of modeling service compositions with dynamic, continuous-time models that are usually found in applications of system dynamics [7]. We extend the previous work on applying system dynamics to (atomic) service provision management [8], by deriving the quantitative indicators for a service provision chain (number of executing instances, invocation and failure rates) from the structure of a particular composition being provided, as well as from a model of computational resources involved in provision. Usefulness of the system dynamics approach for generating simulators of different (and potentially complex and “non-standard”) what-if scenarios that reflect interesting situations in provider’s environment, has been already studied [9]. Such simulators can be used as a basis for developing and validating provision management policies for service-oriented systems. We propose an approach that utilizes a common (composition-independent) service provision framework that divides the modeling concern into the composition and the computational resource parts. The former has been studied in several interesting, but specific, cases [9,10], and some examples of a more systematic approaches to building such dynamic composition models based on QoS constraints have been demonstrated [11]. Our intention is to propose a generic method of converting descriptions of orchestrations in the form of a place-transition networks (PT-nets) [12] into dynamic models, in a manner that ensures composability of orchestrations within choreographies. We believe that such automatic dynamic model generation is a prerequisite for practical application.
2 Motivating Example To illustrate the challenges of our proposed approach, consider the following service composition example. An SME offers a data clearance service, which is the process of filtering arbitrary customer data (e.g., consumer or company addresses). The data cleansing removes duplicates and, if necessary, corrects data using the SME’s internal (consumer/company) database. An overview of the service composition is given in Figure 1. In the first step (Format Service), the data set is transformed into a format that is accepted by the services that are used for data cleansing. This is supported by a set of auxiliary tools, like spreadsheets or text editors which allow users to check the content and to apply modifications to the content. Examples include the rearrangement of data columns or the deletion of columns. In the second step, a service (Data Checker Service) is used to determine whether a data record represents consumer data or company data. Afterwards, a Search Service is used to search the company’s database for matching data. If a data record can be found, the customer data record is marked with a unique numeric identifier and the result is stored using a Storage Service. If no matching datum is found, the data record is used to create a new entry in the SME database with an unique identifier (Add Service). In the case of multiple hits, the data is checked again (Check Service) using additional resources like online databases to look for additional data to confirm the identity of a data record and either to create a new entry in the SME database or to assign an unique identifier manually. These activities are controlled
290
D. Ivanovi´c et al. p1
A pin
Format Service
p2
p3
F
Data Checker Service
Calibration Service
C
precision not ok Logging Service
multiple hits
Search Service hit
Check Service
B
p4 w4h
no hit
Storage Service
w4nh
HPS
p6
w5h
Add Service
G
condition
Service invoke
Loop
p8
p7 p9
Service
p5
w5nh
no hit
hit Machine Service
w5p
w4mh E
Service
Parallel Execution
( J)
pout H
Fig. 1. Overview of data cleansing process
Fig. 2. A PT-net representation of workflow from Fig. 1
by employees who supervise the search service. If the observed quality of the search result is not satisfying (e.g., low precision of the search result), a re-calibration of the search service is done manually and the re-calibrated Service Service is invoked again. In parallel, a Logging Service is executed to log service invocations (execution times and service precision). Figure 2 shows a representation of the cleansing process workflow from Figure 1 in the shape of a PT-net. Services from the workflow are marked as transitions (A . . . H), and additional transitions are introduced for the AND-split/join. Places represent the arrival of an input message (pin ), end of the process (pout ), and activity starting/ending. The places that represent branches to mutually exclusive activities (p4 and p5 ) have outgoing branches annotated with non-negative weight factors that add up to 1; e.g., the branches w4h , w4nh and w4mh of p4 , which correspond to the single hit, no hits and multiple hits outcomes of the search service (C), respectively. As indicated by the example, the actual execution time of the service composition depends on various factors, like the execution time of the human-provided services. Based on these observations, we can summarize the challenges of the working example as follows: – Unpredictable content. The content of the customer data varies in terms of data structure, data size, and quality. For instance, there might be missing and/or wrong parts of addresses or wrong names and misplaced column content which can cause extra efforts.
Building Dynamic Models of Service Compositions 1
*
Quantity
OCT Model
name value
process def
291
2..*
1..* *
* CCT Model
Variable
composition def
Aggregation aggregate op
1 * Connection
*
1
*
1
Rate Var place id is parameter?
Activity Var net rate activity id
Fig. 3. Conceptual view of composition CT model
– Unpredictable manual intervention. Depending on the data size and the data quality the precision of the search result differs. This requires manual calibration of search profiles during the execution of the task. – Customer specific details regarding the provided and expected data format need also to be considered. For instance a customer provides excel spreadsheets, while another costumer favors ASCII files with fixed length columns and another provides a XML structure. – Delegation. Employees might delegate parts of their activities to other employees due to time constraints. All these challenges arise because of resource constraints (e.g., number of employees that work on a given task) and need to be regarded when service compositions are modeled. The provider may be interested not only in maximal levels of required resources, but may also ask, given an input request scenario, when the peak loads can be expected to occur, how long it takes for resource loads to stabilize after a change in input regime, and what resource up/down scaling policies are reasonable.
3 Conceptual Dynamic Modeling Framework 3.1 Conceptual Composition Model We start with a conceptual model for dynamic representation of service compositions (orchestrations and choreographies). The goal of these models is to represent how the numbers of executing activites and rates of activity-to-activity transitions vary over time. Figure 3 presents a conceptual famework for the continuous-time (CT) composition modeling. The fundamental building block of an orchestration CT (OCT) model is a variable that has a time-varying value. Compared to a PT-net model, activity variables are attached to transitions inside, and their value is the expected number of executing instances of a given activity at each point in time. In our example, each service in the data cleansing process would be represented with an activity variable, that shows the expected number of concurrently executing instances of the service at any moment.
292
D. Ivanovi´c et al.
Rate variables correspond to places, i.e., the events, in the workflow: arrival of input messages, dispatching of replies, or branching from one activity to another. They represent the number of the corresponding events per unit of time. Rates that come from outside the OCT are called parameters. The most common example is the input message rate (number of incoming messages per unit of time), and the reply rate from the model of an invoked service. In our example, besides the input parameter, the rate variables would include any service-to-service transitions, branches, parallel split-join points, and the resulting finishing rate. Other than variables, quantities in an OCT model include aggregations, which use some aggregation operator (e.g. sum, minimum, maximum) to combine values of two or more quantities. Examples include: the sum of activity variables for a given set of activity types, the number of invocations of a specific partner service, and the sum of the reply and failure rates. We could, for instance, create an aggregate quantity that sums all activity variables for services in the data cleansing process that are hosted by the same provider. Such aggregate quantity could be used as an indication of the total load on the provider’s infrastructure. The OCT conceptual model can also be applied as a wrapper for services with unknown structure, or those that are not available for analysis. Such “wrapper OCT models” consist of a single activity variable, which stands for the entire service execution, with the input rate (the parameter), the reply rate, and the failure rate. A composition CT (CCT) model is an extension of the concept of OCT model, and puts together one or more sub-models, which are connected by pairing their respective input and reply rates. For instance, when an orchestration A invokes an orchestration B, at least two connections are required. The first one connects A’s send rate to B’s input rate, and the second connects B’s reply rate to the corresponding A’s receive rate. Additionally, the B’s failure rate may be connected to an appropriate slot in A. A CCT model includes all quantities from the underlying sub-models, but only the unconnected parameters from the sub-models are qualified as parameters of the CCT model. In the rest of the paper, we implicitly assume that the sub-models used in simulations are connected in such a way that the resulting model has a single parameter, its input message rate. Te
Incoming Requests
Rejects
i R Tin
β
E e = R/Te pin = β · R/Tin pout
Successful finishes S
Resource Model
OCT Model F pfail
Failures
Fig. 4. Simulation setting for the CCT model
Building Dynamic Models of Service Compositions
293
Initial Stock (S0 )
Inflow ( fi ) S|t=0 = S0 ,
Stock (S) d dt S = f i − f o
⇔ S = S0 +
Outflow ( f o ) t
0 ( f i − f o )dt
Fig. 5. Stocks, flows and their meaning
3.2 Dynamic Framework Provision Model Figure 4 shows a view of the framework dynamic provision model that embeds a OCT model and a resource model, using the stock/flow notation [7]. As described in Figure 5, stock/flow diagrams provide a visual notation that corresponds to a part of the system of ordinary differential equations. Rates are represented as flows, while stocks correspond to integrals over the difference between inflows and outflows. Auxiliary variables (circles) represent parameters or intermediate calculations in the model, and arrows depict influences. The framework dynamic provision model is driven by the input rate (or input regime) i, that models different situations of interest in the environment. Note that this dynamic model (as well as other system dynamic models) is usually not solved analytically, but simulated using numeric integration techniques. This allows experimentation with arbitrary input regime scenarios (i.e., without limiting to the “standard” statistical distributions at the entry). The input regime i fills the stock of received requests R. Some of those requests that are not timely served are rejected (at rate e filling the stock E of rejects). Others, after a period of preparation (modeled by the time constant Tin ) are fed by the input rate pin into the OCT model. There they execute, and end up either as a successfully finished instance (rate pout filling stock S) or as a failed execution (rate pfail filling stock F). The relation between the OCT component and the resource model component of the framework model is twofold. First, quantities from the OCT model are used by the resource model to determine the resource load at any moment in time. For instance, sum of all activity variables corresponding to the services in the workflow hosted by the same provider may represent the sum of all threads that need to be available at any moment of time for the orchestration to run. The resource model is, of course, a model of particular resource management policies. Some models can be as simple as to assume no resource limits, other can have “firm” limits, and yet others may include mechanisms for scaling the resources up or down, based on observed trends in the input regime. We will present such a scalable model in Section 4.5; until then, we will tacitly assume the infinite resource case. The second link is the blocking factor β that regulates the composition input rate pin . β stands for the probability that the system is able to start executing a prepared process instance. In an infinite resource case, β = 1, but in a more realistic case, as the system load increases beyond the optimal carrying capacity of the provider’s infrastructure, the system becomes less able to accept new requests, and β may decrease down to a point of complete denial of service (β = 0).
294
D. Ivanovi´c et al.
Service
Service Composition
Petri Net Model
Transformation Service
BPEL
Monitoring Data Service Service Execution
ODE composition model
Workflow Engine
Fig. 6. Overview of the approach
The framework model can be used to produce some basic metrics for the provision system, relative to the input regime i. Assuming that R+ is the “inflow” part of R, i.e. the sole result of accumulating requests from i (without the effect of e and pin ), then if R+ > 0, we may obtain: – Percentage of all failures and rejects: (E + F)/R+ ; – Percentage of all finished orchestrations: (S + F)/R+ ; – Percentage of all successful finishes: S/R+.
4 Automatic Derivation of OCT Models The derivation of an OCT model for an orchestration follows a series of steps as shown in Figure 6. The overall process starts with the generation of a PT-net model from e.g., an executable BPEL process specification. During the process execution, activity timing and branching probability information is collected, and used to compound (calibrate) the Petri Net model. In the final step, the calibrated Petri Net model is translated into a model based on ordinary differential equations (ODE), whose elements are OCT rates and activity variables. 4.1 Elements of the Petri Net Model Petri Nets are a formalism frequently used to represent workflows and reason about them [12]. Many standard workflow patterns can be naturally expressed using Petri Nets [13], and there exist numerous tools that allow automatic translation and analysis of service composition languages, such as WS-BPEL [14], and YAWL [3], into a Petri Net representation. Additionally, Petri Net process models can be (partially) discovered from (incomplete) execution log traces. Among many different variations of Petri Nets, we start with simple PT-nets (place-transition networks). A PT-net is a tuple P, M, R, where P and M are finite non-empty sets of places and transitions, respectively, and R ⊆ (P × M) ∪ (M × P) is a relation that represents edges from places to transitions and from transitions to places. For any place p ∈ P, we denote the set of input transitions for p as •p = {m ∈ M|(m, p) ∈ R}, and the set of output transitions as p • = {m ∈ M|(p, m) ∈ R}. The sets •m and m• of input and
Building Dynamic Models of Service Compositions
295
output places, respectively, for any transition m ∈ M, are defined analogously. For the derivation of an ODE model from the given PT-net, we require that | • m| > 0 for each m ∈ M. A marking s : P → N assigns to each place p ∈ P a non-negative integer s(p), known as the number of tokens. We say that p is marked (under s) if s(p) > 0. A marked place p ∈ P enables exactly one of its output transitions among p •. For a transition m to fire, all places in •m must be marked. When firing, m consumes a token from each p ∈ •m, and when m finishes, it sends a token into each p ∈ m•. In a typical composition setting, tokens are used to describe orchestration instances, transitions are used to model activities, which may take some time to complete, and places typically represent entry/exit points or pre/post conditions for execution of activities. 4.2 Elements of the ODE Orchestration Model In a discrete time model of an orchestration provision system, at any moment of time, each running instance of an orchestration has its own marking, and the superposition of these markings gives the aggregate view of the provision system. The aggregate number of tokens p(ti ) in a place p ∈ P between time steps ti and ti+1 remains stable until ti+1 = ti + ∆ ti , where ∆ ti is a discrete time increment, implying instantaneous transitions. In real execution environments, however, activities (transitions) use some definite (sometimes long) amount of time to execute, while tokens stay in places for a very short period of time which is needed by the execution engine to start the next activity. To build an ODE model based on a PT-net, we consider an idealized execution environment, where the time step ∆ ti becomes infinitely small and turns into the time differential dt. Consequently, we can no longer assume that tokens stay in places for definite periods of time, but rather presume they are immediately passed to the destination activities. Therefore, in the CT case, we associate a place p ∈ P with the rate p(t) of tokens passing through p, measured in instances per unit of time. On the other hand, activities are fed by tokens emitted from places at their corresponding rates. For activity m ∈ M, we denote the aggregate number of its currently executing instances at time t with m(t). In the CT setting, we operate on probabilistic expectations of both rates and activities (transitions). When p has more than one outgoing transition, we use a non-negative real number w pm to denote the expected fraction of p(t) that is passed to m ∈ p •, such that ∑m∈p • w pm = 1. Also, we use exponential decay to model the expected number of executing activity instances. With m(t) we associate a non-negative average execution time Tm . When Tm = 0, transition is immediate, and m(t) always remains empty. When Tm > 0, we take the usual convenience assumption in dynamic modeling that the running time of individual instances of m obeys Poisson distribution with the average Tm . The weight factors w pm and the average execution times Tm are assumed to be obtained from execution logs, i.e. the statistical information from previous executions of the orchestrations. Figure 7 shows a general ODE scheme for a transition m ∈ M with one or more input places. With single input place, the transition continuously accumulates tokens from the
296
D. Ivanovi´c et al. m p (t): a new stock ∀p ∈ •m
···
qt = argmin p∈•m {m p (t)} m(t) = mqt (t) d dt m p (t) =
p(t)w pm − om (t) , ∀p ∈ •m m(t)/Tm Tm > 0 om (t) = qt (t)wqt m Tm = 0
m om
Fig. 7. ODE scheme for a transition ···
p
p(t) = ∑m∈•p {om (t)}
Fig. 8. ODE scheme for a place
input place, and discharges them either instantaneously (Tm = 0) or gradually (Tm > 0) through om (t). When a transition has more than one input place, its execution is driven by the smallest number of accumulated tokens. At time t, qt ∈ •m denotes the the place from which the smallest number of tokens has been accumulated. Because a transition needs to collect a token from all of its input places to fire, the smallest token accumulation mqt (t) dictates m(t). When the average execution time Tm > 0, the outflow om (t) = m(t)/Tm corresponds to exponential decay. When Tm = 0, the transition is instantaneous, i.e. m(t) = 0, which means that outflow has to balance inflow qt (t)wqt m from qt , therefore keeping mqt (t) at zero. Figure 8 shows a general ODE scheme for a place p ∈ P, which is simply a sum of outflows from incoming transitions, assuming that •p is non-empty. Places with an empty set of incoming transitions must be treated as exogenous factors. 4.3 An Example ODE Model To illustrate the approach to construction of the ODE model, we look at the PT-net representation of our working example, shown in Figure 2. The PT-net model has the starting place pin and the final place pout . Transitions that correspond to invocations of partner services are marked with letters A..H, and we assume that the corresponding average execution times TA ..TH are non-zero. Other transitions are assumed to be instantaneous, and with the exception of the AND-join transition (marked J), they simply propagate their inflow. Places p4 and p5 are decision nodes, and their outgoing links are annotated with weight factors corresponding to branch probabilities. With reference to Figure 1, index “h” stands for “hit”, “nh” for “no hit”, “mh” for “multiple hits”, and “p” for “precision not ok.” Weights for other (single) place-transition links are
Building Dynamic Models of Service Compositions
d A(t) = pin (t) − p1 (t) dt p1 (t) = A(t)/TA p2 (t) = p1 (t) d B(t) = p2 (t) − p8 (t) dt p8 (t) = B(t)/TB p3 (t) = p1 (t) + oF (t) d C(t) = p3 (t) − p4 (t) dt p4 (t) = C(t)/TC d E(t) = p4 (t)w4mh − p5 (t) dt p5 (t) = E(t)/TE d F(t) = p5 (t)w5p − oF (t) dt
297
oF (t) = F(t)/TF p6 (t) = p4 (t)w4nh + p5 (t)w5nh d G(t) = p6 (t) − oG (t) dt oG (t) = G(t)/TG p7 (t) = p4 (t)w4h + oG (t) + p5 (t)w5h d Jp (t) = p8 (t) − p9 (t) dt 8 d Jp (t) = p7 (t) − p9 (t) dt 7
p9 (t) =
p8 (t) p7 (t)
Jp8 (t) ≤ Jp7 (t) Jp7 (t) < Jp8 (t)
d H(t) = p9 (t) − pout (t) dt pout (t) = H(t)/TH
Fig. 9. ODE model for PT-net from Fig. 2
implicitly set to 1. For simplicity, the PT-model does not represent auxiliary computations that in reality take some definite, if small, time to execute. Figure 9 shows the corresponding ODE model. Some obvious simplifications were applied. For instance, when for a p ∈ P, •p = {m} it is not necessary to represent om (t) and p(t) separately, so we use the latter. Also, for a place m ∈ M where •m = {p} and Tm = 0, we omit the equation for dtd m(t) (which is always 0), and directly propagate p(t)w pm as om (t). The AND-join transition J has two input places, and thus two auxiliary token stocks J p7 (t) and J p8 (t). Since the join is instantaneous, at least one of these two stocks is always zero, and the outflow p9 (t) copies the inflow of the smaller stock. We assume that the initial marking of the PT-model contains only pin . Consequently, we implicitly assume that the initial condition for all transitions (A(0), B(0), etc.) is zero. Since that place has no input transitions in the model, we assume that pin (t) is exogenous. Conversely, pout has no output transitions, and we assume that it is the terminal place of the model. The function pout (t) thus gives the finishing rate of the orchestrations in the ODE model, relative to the start rate pin (t). 4.4 Asynchronous Composition and Failures The example PT-net in Figure 2 is simplified as it does not involve asynchronous messaging with partner services, nor accounts for potential failures during service invocations. Both can be built into the model automatically, when translating from a concrete orchestration language with known formal semantics. Here we discuss a way to deal with asynchronicity and faults in a general case.
298
D. Ivanovi´c et al.
···
··· rA
sA Ae
As
Ar
Fig. 10. Asynchronous messaging scheme ···
···
Φ φm
m
⇒
m
1 − φm
Fig. 11. Failure accounting scheme
Figure 10 shows a usual pattern of asynchronous communication with a partner service A. Transition As sends a message to A via a dedicated place sA , and transition Ar receives the reply through a synchronizing place rA . The same representation applies to synchronous messaging as well, with Ar directly following As , and rA as its single input place. In Figure 2, we have combined As , sA , Ae , rA , and Ar into a single transition A characterized with an overall average execution time TA . The rate sA (t) is the send rate, to be connected with the input rate parameter of the sub-model of Ae , while rA (t) is the reply rate from the sub-model, to be connected with the receive rate in the main OCT model. Examples of “wrapper” sub-models for Ae are: {rA (t) = sA (t)} (short circuiting, zero time), and {rA (t) = Ae (t)/TAe ; dtd Ae (t) = sA (t) − rA (t)} (black box, definite time). Failures can be accounted for by introducing failure probabilities φm for each transition in the PT-net model, and decorating the transitions as shown in Figure 11. Fault handling is represented by Φ . In the simplest case of unrecoverable faults, Φ is an instantaneous transition to a terminal fault place pfail . 4.5 A Sample Resource Model For a sample resource model, we model threads that execute orchestration activities on the provider’s infrastructure. In the sample, shown on Figure 12, we assume that services A, B, G and H from Figure 2 (corresponding to the Formatting, Logging, Adding and Storage services) are “back-end” services hosted by the orchestration provider, so that their each execution occupies a (logical) thread. The number of occupied threads in the resource model is shown as X. The current capacity (available number of threads) ˆ and γ is the degree of utilization. The blocking factor β is 1 if some is shown as X, capacity is free, 0 otherwise. On the management side, we form a perception X p of the number of threads required to meet the needs. That perception changes at a rate r p that is driven by the adjustment time Tp . That is a well known method of approximating formation of perception/trend reporting based on exponential smoothing [7]. Finally, we assume that we
Building Dynamic Models of Service Compositions
γ = X/Xˆ
Current capacity
299
rˆ = ⌊X p − X⌋B · δ (0)
Xˆ
β=
1, 0,
γ t2 > · · · > tn ) in r and (t′1 > t′2 > · · · > t′n ) in r′ , where t′i ∈ r′ is the modified tuple corresponding to ti ∈ r. The distance between r and r′ is then given by Dist(r, r′ ) = c · ni=1 A∈R Dist(ti [A], t′i [A]) · Fti where constant c =
4
1 | R| · T
.
Minimal Consistency Problem
In this section, we briefly review the underlying idea of Linear Programming (LP) technique and define the Minimal Consistency Problem (MCP) in relations. We show how to solve MCP by transforming the problem into standard LP setting. 4.1
Linear Programming Algorithm
Linear programming (LP) is a technique that optimizes a linear objective function, subject to a given set of linear equality and linear inequality constraints. The standard form (or the canonical form) of a linear programming problem
308
Y. Wu and W. Ng
− → − → is that, given a variable vector X = (x 1 , x2 , . . . , xn ), a constant vector C − →T − → = (c1 , c2 , . . . , cn ) and a linear objective function G = C · X which is sub− → − → − → ject to a set of constraints expressed in a matrix equation M· X ≤ B (or ≥ B ) − → where M is an m × n matrix with constant entries and B is a constant vector (b1 , b2 , . . . , bm ), we are able to optimize G by using the simplex algorithm. We employ an LP solver lp solve detailed in [10] to tackle the consistency problem in this work. 4.2
Minimal Consistency Problem and LP Transformations
In subsequent discussion, we denote r the input relation and r′ the output relation where r and r′ conform to the same schema. The variables xi,j and x′i,j will carry the same meaning as defined in Defs. 9, 10, 11 and 12. We now define the problem of generating a consistent relation with respect to a given FD F where the right-hand side of all FDs is A and the left hand side of no FD has A. Suppose we modify r to r′ . There are several requirements for the modification. First, the modification should change nothing over attribute values of X. Second, r′ should satisfy F . Third, it does not change the tuple frequencies in r. Fourth, it does not change the relative value frequency of any domain value over attribute A (recall Def. 4). Finally, r′ should differ as little as possible from r according to the distance measure in Def. 8. We formalize all the requirements of the modification and call the problem MCP. Definition 9 (Minimal Consistency Problem MCP). Let X be a finite set of attributes and A be a single attribute. Let F = {Xiαi → Aβi |Xi ⊆ X} be a finite set of FDs. Let k be the size of domain of attribute A. Let Vti ,A = (xi,1 , xi,2 , . . . , xi,k ) and Vt′i ,A = (x′i,1 , x′i,2 , . . . , x′i,k ), i = 1, 2, . . . , n. The Minimal Consistency Problem (MCP) is to find r′ that minimizes Dist(r, r′ ) and satisfies the following four conditions (C1 to C4). C1. C2. C3. C4.
r[X] = r′ [X]. r′ is consistent with respect to F . ∀i ∈ [1, n] ∩ ZZ, Fti = Ft′i . ′ . ∀i ∈ [1, k] ∩ ZZ, fi,A = fi,A
In the above definition, C1 to C4 correspond to our first four requirements. The last requirement is realized by minimizing Dist(r, r′ ) as already stated. Let T denote the sum of tuple frequencies of all tuples of r, i.e., T = ni=1 Fti . We now transform MCP into the first version of its equivalent LP problem. We start by assuming the special case F = {Xα → Aβ } (i.e. singleton F ). Definition 10 (1LPMCP: First LP Transformed MCP). We minimize n k 1 the objective function: (| X| +1) · i=1 ( j=1 |x′i,j − xi,j |) · Fti , which is subject ·T to the following constraints: 1. ∀i ∈ [1, n] ∩ ZZ, kj=1 x′i,j = 1. 2. ∀i ∈ [1, n] ∩ ZZ, j ∈ [1, k] ∩ ZZ, 1 ≥ x′i,j ≥ 0.
Maintaining Consistency of Probabilistic Databases
309
3. ∀j ∈ [1, k] ∩ ZZ, ni=1 x′i,j · Fti = ni=1 xi,j · Fti . k 4. ∀i, j ∈ [1, n] ∩ ZZ, 21 l=1 |x′i,l − x′j,l | ≤ β if D i s t (t i [X ], tj [X]) ≤ α.
Notably, the four conditions in Def. 9 are addressed in the above definition: C1 and C4 in Def. 9 are satisfied, since all the involved values in these two conditions remain constant in the LP formulation. C3 is realized by the third 1LPMCP constraint and finally C2 is realized by the fourth 1LPMCP constraint. The objective function of 1LPMCP is employed to find minimal Dist(r, r′ ). The first two 1LPMCP constraints in Def. 10 simply ensure the modified relation r′ is a valid one. There is still a problem in 1LPMCP: we need to convert the absolute expressions of the fourth 1LPMCP constraint and the objective function into standard linear form in order that 1LPMCP can be solved by a linear programming algorithm. To achieve the conversion, we need to introduce more technical variables and define another version of LPMCP as follows. Definition 11 (2LPMCP: Second LP Transformed MCP). We minimize n k 1 the objective function: (| X| +1) i=1 ( j=1 di,j ) · Fti , which is subject to the ·T · following constraints: k 1. ∀i ∈ [1, n] ∩ ZZ, j=1 x′i,j = 1. 2. ∀i ∈ [1, n] ∩ ZZ, j ∈ [1, k] ∩ ZZ, 1 ≥ x′i,j ≥ 0. 3. ∀j ∈ [1, k] ∩ ZZ, ni=1 x′i,j · Fti = ni=1 xi,j · Fti . 4. ∀i, j ∈ [1, n] ∩ ZZ, a. ∀l ∈ [1, n] ∩ ZZ, Di,j,l ≥ x′i,l − x′j,l and Di,j,l ≥ x′j,l − x′i,l . k b. if Dist(ti [X], tj [X]) ≤ α, 21 l=1 Di,j,l ≤ β. 5. ∀i ∈ [1, n] ∩ ZZ, j ∈ [1, k] ∩ ZZ, di,j ≥ x′i,j − xi,j and di,j ≥ xi,j − x′i,j . In Def. 11, di,j and Di,j,l are the newly introduced variables that replace the absolute expressions in Def. 10. Clearly, the objective function and all the constraints in 2LPMCP are in standard setting of a linear programming problem. Therefore, the minimal value of the objective function can be obtained by using the simplex algorithm. The following theorem formally shows that the minimal value of the 2LPMCP objective function is equal to the minimal possible distance Dist(r, r′ ) in MCP. Theorem 1. Let m1 be the minimal value of the objective function in 2LPMCP. Let m2 be the minimal possible distance Dist(r, r′ ) in MCP. Then m1 = m2 . By Theorem 1, we establish the result that MCP having a single FD (Xα → Aβ ) can be solved in polynomial time. Next, we consider transforming MCP to the general case of multiple FDs having the same attribute on the right-hand side. Definition 12 (3LPMCP: Third Transformed MCP). We minimize the n k 1 objective function: (| X| +1) j=1 di,j ) · Fti , which is subject to the foli=1 ( ·T · lowing constraints:
310
Y. Wu and W. Ng
∀i ∈ [1, n] ∩ ZZ, kj=1 x′i,j = 1. ∀i ∈ [1, n] ∩ ZZ, j ∈ [1, k] ∩ ZZ, 1 ≥ x′i,j ≥ 0. n n ∀j ∈ [1, k] ∩ ZZ, i=1 x′i,j · Fti = i=1 xi,j · Fti . ∀i, j ∈ [1, n] ∩ ZZ a. ∀l ∈ [1, k] ∩ ZZ, Di,j,l ≥ x′i,l − x′j,l and Di,j,l ≥ x′j,l − x′i,l . k b. 12 l=1 Di,j,l ≤ minXpαp → Aβp ⊆ F {[[Dist(ti [Xp ], tj [Xp ]) ≤ αp ]] · βp + (1 − [[Dist(ti [Xp ], tj [Xp ]) ≤ αp ]]) · +∞}. 5. ∀i ∈ [1, n] ∩ Z Z, j ∈ [1, k] ∩ ZZ, di,j ≥ x′i,j − xi,j and di,j ≥ xi,j − x′i,j .
1. 2. 3. 4.
The essential difference between 3LPMCP and 2LPMCP is in the right-hand side of the inequality in Constraint 4b. The square bracket [[E]] is a notation that returns 1 if the boolean expression E is evaluated to be true, otherwise 0. The expressions in the constraint contain only known values, so the right-hand side is still a constant expression. The following corollary immediately follows from Theorem 1. Corollary 1. Let m1 be the minimal value of the objective fucntion in 3LPMCP. Let m2 be the minimal possible distance Dist(r, r′ ) in MCP. Then m1 = m2 . From Corollary 1, we can see that 3LPMCP is an effective means to obtain a consistent relation r′ modified from r, since it guarantees minimal possible change from the original (inconsistent) relation r with respect to F .
Complexity Note. The worst case running time of the simplex algorithm is exponential. However, Spielman and Teng [13] used smoothed analysis to show that the algorithm has a polynomial complexity in terms of the input size and magnitude of perturbation, where small random perturbation can be caused by noises or instrumental errors. In our problem setting, the smoothed complexity for solving 3LPMCP is O(n8 k 4 ln nk). The modeling time for transforming the input of MCP to the input of 3LPMCP is O(n2 k||F ||). Corollary 1 implies that solving 3LPMCP is equivalent to solving MCP. Therefore, the total time for solving MCP is O(n8 k 4 ln nk + n2 k||F ||). 4.3
Further Improving the Evaluation of 3LPMCP
The time for solving 3LPMCP (O(n8 k 4 ln nk)) dominates the modeling time (O(n2 k||F ||)) asymptotically. However, the running time for solving 3LPMCP can be further reduced as follows. We construct an undirected weighted graph G where each vertex vi corresponds to tuple ti of input relation r for i = 1, 2, . . . , n. An edge of weight d exists between two vertices vi and vj if and only if d(≤ 1) is the maximum distance allowed between ti and tj according to F (cf. Constraint 4b in Def. 12). It is clear that two tuples in different connected components of G will not violate any FD in F . We partition r into sub-relations where tuples belong to the same sub-relation if and only if their corresponding vertices are in the same connected component of G. Then 3LPMCP of the original relation r can be solved by solving 3LPMCP of all the sub-relations.
Maintaining Consistency of Probabilistic Databases
311
In fact, the running time for solving 3LPMCP becomes negligible compared to I/O time and modeling time as domain size increases, because each subrelation becomes smaller and the coefficient matrix of 3LPMCP is very sparse (most entries are zero). This interesting point will be further discussed using the empirical results presented in Sect. 6.
5
Chase Algorithm: LPChase
Chase algorithms are a commonly used technique to deal with consistency problems in databases [9,14]. Our chase also takes a relation r and an FD set F as inputs and outputs a consistent relation r′ with respect to F . In contrast to others, our chase algorithm is developed by solving 3LPMCP which guarantees minimal change to r. Before presenting our chase algorithm, we define a specific class of FD sets, called acyclic FD sets. We will show that acyclic FD sets are a necessary assumption in constructing the algorithm. Definition 13 (Acyclic FDs). Let F be a set of FDs over a relational schema R. Let GF= (V, E) be a directed graph, where V = R and E = {(A, B)|∃Xα → Yβ ∈ F, such that A ∈ X and B ∈ Y }. F is said to be acyclic iff GF is acyclic. Notably, the definition of an acyclic FD set is more general than that of a canonical FD set [7]. Obviously, a canonical set should be acyclic. However, the reverse is not true. For example, F = {A → B, B → C} is acyclic but it is not canonical, since B occurs on both the left-hand side and the right-hand side. We now present our chase algorithm based on solving MCP in Algorithm 1. This algorithm, denoted by LPChase(r, F ), takes a relation r and an acyclic FD set F as inputs. In Algorithm 1, MCPSolve(F ′′ , A, r[XA]) is defined to be a function that chases r[XA] over FD set F ′′ by modifying tuples over attribute A, MCPSolve uses the 3LPMCP model described in Sect. 4. The underlying idea of LPChase is that we first generate the acyclic graph (V, E) from the input FD set F (acyclic FDs) in Lines 1 and 2 and impose an ) on the attributes of the schema R by topological order (A1 < A2 < · · · < A| R| T opoSo rt(V, E) in Line 3. Then in Line 4 we decompose F into F ′ s.t. all FDs in F ′ contain only one attribute on their right-hand sides. The “for” loop from Lines 5 to 12 is a chase that essentially checks the consistency of r with respect to the set of FDs according to the topo order of the right-hand-side attribute Ai . In each round the subrelation r[Xi Ai ] and all the FDs having Ai on the righthand side are passed to MCPSolve(F ′′ , Ai , r[Xi Ai ]) in Line 9. The remaining FDs (F ′ − F ′′ ) which may contain Ai+1 on the right-hand side are generated in Line 10 for preparing the chase of next round. The following theorem presents important properties of LPChase. Theorem 2. The following statements are true. (1) ∀i ∈ [1, k] ∩ ZZ, fi,A (the relative value frequency of domain value vi ) does not change in LPChase(r, F ).
312
Y. Wu and W. Ng
Algorithm 1. LPChase(r, F ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
V ←R E ← {(A, B)|∃(Xα → Bβ ) ∈ F, such that A ∈ X} (A1 , A2 , . . . , A|R| ) ← T opoSort(V, E) F ′ ← Decompose(F) f or i = 1 to |R| do F ′′ ← {(Xα → Aiβ ) ∈ F ′ } if F ′′ = ∅ then X i ← {X|∃Xα → Aiβ ∈ F ′ } M CP Solve(F ′′ , Ai , r[X i Ai ]) F ′ ← (F ′ − F ′′ ) end if end for return r
( 2) r is consistent with respect to F after LPChase(r, F ). (3) LPChase(r, F ) terminates in polynomial time. Example 5. Given r over {A, B, C, D} and an FD set F = {Aα1 → Bβ1 , Bα2 → CDβ2 , ABα3 → Dβ3 }. We have F ′ = D eco m pose(F ) = {Aα1 → Bβ1 , Bα2 → Cβ2 , Bα2 → Dβ2 , ABα3 → Dβ3 }. According to Def. 13, GF can be topologically sorted as ABCD (another possibility is ABDC). Thus, Algorithm 1 chases for the first attribute A1 = A but gets F ′′ = ∅ in round 1. It chases for A2 = B and gets F ′′ = {Aα1 → Bβ1 }, X 2 = A and F ′ = {B → C, B → D, AB → D}. It chases for A3 = C and gets F ′′ = {Bα2 → Cβ2 }, X 3 = B and F ′ = {Bα2 → Dβ2 , ABα3 → Dβ3 }. Finally, it chases for A4 = D and gets F ′′ = {Bα2 → Dβ2 , ABα3 → Dβ3 }, X 4 = AB and F ′ = ∅.
6
Experiments
We implemented Algorithm 1 in Microsoft Visual C++. All experiments were conducted on a Windows-based machine with 2.66GHz CPU and 1GB RAM. We use a set of synthetic data generated according to Gaussian distribution. For LP algorithm, we use the lp solve library [10] as discussed in Sect. 4.1. The main goal in our experiments is to study the efficiency of LPChase. Specifically, we examine the running time against various database parameters related to data domain, tuples and FD sets. The total running time of LPChase consists of three main components: modeling time that generates the 3LPMCP setting as described in Def. 12, I/O time that transfers tuples to memory and MCP solving time that runs the LP solver. 6.1
Domain Size k
In this study, we observed that the number of data values that have non-zero probabilities in the data domain D is important to the running time of LPChase.
Maintaining Consistency of Probabilistic Databases
(a) Solving time for MCP
313
(b) Total running time
Fig. 1. Running time of LPChase with different domain size k
To understand the impact better, we define the set of all data values that have non-zero probability in r the active domain (or simply AD), which is computed by the expected size of AD derived from Gaussian distribution. We call D the physical domain (or simply P D). Clearly AD ⊆ P D = D. We fix n = 20000 and ||F || = 2 (a single FD A0. 05 → B0. 05 ) throughout this experiment. As shown in Fig. 1(a), the time for solving the modeled LP decreases drastically as k increases and become negligible around k = 10 (i.e. P D = AD = 10). As k is normally much larger than 10, the time for solving MCP is negligible. The total running time then becomes linear to k due to the modeling time in LP transformation (recall the complexity note in Sect. 4.2). The running time of Algorithm 1 for various ADs is shown in Fig. 1(b). It unanimously shows that when k (i.e. PD) is larger than AD, the running time decreases as k increases. This is because when the PD size is increased, with all other parameters unchanged, tuples are less likely to be similar. Hence, each sub-relation (i.e. r[X Ai ] used in Line 9 of Algorithm 1) is small and thus the sum of all the MCPSolve running time (each is quadratic to n) decreases. Intuitively, this gain is due to the “divide-and-conquer” strategy as explained in Sect. 4.3. 6.2
Size of FDs || F ||
We fix n = 20000, k = 10 and α = β = 0.05 for each FD in this experiment. Fig. 2 shows the running time of Algorithm 1 with different sizes of FD set. The
Fig. 2. Running time of LPChase with different ||F||
Fig. 3. Running time of LPChase with different n
314
Y. Wu and W. Ng
(a) Total running time
(b) Time for solving MCP
Fig. 4. Running time of LPChase with different α
running time is linear with ||F ||, since the modeling time is linear to ||F || and the time for solving LP is negligible compared to I/O and modeling time. 6.3
Size of Relation n
We fix k = 10, ||F || = 2 and α = β = 0.05 for each FD in this experiment. Fig. 3 shows the running time of LPChase with different number of tuples. The running time is non-linear, since the time for solving LP is negligible at k = 10 compared to the I/O time (in O(n)) and the modeling time (in O(n2 )) and thus the modeling time, which is non-linear, is dominating. 6.4
Sensitivity of FD α
We fix k = 10 and ||F || = 2 in this experiment. Fig. 4 shows the running time against different α values when n = 1000, 2000 and 3000 respectively. As shown in Fig. 4(a), when α increases from 0.08 to 0.09, the running time increases significantly. The running time is very sensitive to α at the critical point (0.080.09 in Fig. 4). Fig. 4(b) shows the total running time of LPChase (α = 0.05 − 0.08). It is clear that the sizes of 3LPMCP sub-problems increase monotonically with the increase of α and so does the running time. However, in this experiment setting, k is only 10 (i.e. PD = AD = 10). When k becomes larger, non-zero probability values are distributed over more domain values and the critical point of α will increase rapidly because tuples are less likely to be similar. Admittedly, this is an issue deserved further study. We may perform a binary search on α, model the 3LPMCP sub-problems for a fixed α without solving them, and examine the sizes of the sub-problems until an acceptable α value is found.
7
Related Work
The problem of maintaining the consistency of a conventional relational database is well-known [14]. When dealing with uncertain information (data are missing, unknown, or imprecisely known), probability theory, fuzzy set and possibility theory-based treatments have been applied to extend classic relational databases. For example, stochastic dependency [5] is the generalization of the concept of functional dependency in probabilistic databases, and fuzzy functional dependencies [1] have been proposed to handle the integrity constraints problem in the
Maintaining Consistency of Probabilistic Databases
315
context of fuzzy databases. Demetrovics proposed the error-correcting functional dependency [4] in a deterministic database containing erroneous data. However, the mentioned work is for checking data consistency rather than developing an effective and efficient means to maintain consistency for uncertain data. Levene [9] introduces the notion of imprecise relations and employs FDs to maintain imprecise relations. An imprecise data model is to cater for relational data obtained from different equally likely and noise-free sources, and therefore may be imprecise. Lu and Ng [11] use vague sets that assume data sources are not equally likely to address similar issues.
8
Conclusions
We have studied the problem of maintaining consistency of probabilistic relations with respect to acyclic FD sets, which are stronger than the canonical form of FDs [7]. We developed an LP-based chase algorithm, called LPChase, which can be employed to maintain the consistency by transforming MCP into an LP setting. LPChase has a polynomial running time and it is efficient in practice if the domain size is not extremely small (say | D |= 10) and the α values are chosen before a critical point. The output of LPChase is also effective in the sense that it is the minimally modified input relation as proved in Corollary 1. As shown in Sect. 6, the time for solving the modeled LP problems becomes negligible compared to the time for constructing them and the I/O time, when the physical domain is reasonably larger than the active domain of r. Then the running time of LPChase is bounded by O(n2 k||F ||).
References 1. Brown, P., Haas, P.J.: BHUNT: automatic discovery of fuzzy algebraic constraints in relatoinal data. In: VLDB, pp. 668–679 (2003) 2. Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data and expected ranks. In: ICDE, pp. 305–316 (2009) 3. Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB Journal 16(4), 523–544 (2007) 4. Demetrovics, J., et al.: Functional dependencies distorted by errors. DAM 156(6), 862–869 (2008) 5. Dey, D., Sarkar, S.: Generalized normal forms for probabilistic relational data. TKDE 14(3), 485–497 (2002) 6. Faloutsos, C., Lin, K.I.: FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: SIGMOD, pp. 163–174 (1995) 7. Greco, S., Molinaro, C.: Approximate probabilistic query answering over inconsistent databases. In: Li, Q., Spaccapietra, S., Yu, E., Oliv´e, A. (eds.) ER 2008. LNCS, vol. 5231, pp. 311–325. Springer, Heidelberg (2008) 8. Huang, J., et al.: MayBMS: a probabilistic database management system. In: SIGMOD, pp. 1071–1074 (2009) 9. Levene, M.: Maintaining consistency of imprecise relations. Comput. J. 39(2), 114– 123 (1996)
316
Y. Wu and W. Ng
10. Ip solve, http://lpsolve.sourceforge.net/5.5/ 11. Lu, A., Ng, W.: Maintaining consistency of vague databases using data dependencies. DKE 68(7), 622–651 (2009) 12. Singh, S., et al.: Orion 2.0: native support for uncertain data. In: SIGMOD, pp. 1239–1242 (2008) 13. Spielman, D.A., Teng, S.H.: Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time. JACM 51(3), 385–463 (2004) 14. Wijsen, D.A.: Database repairing using updates. TODS 30(3), 722–768 (2005)
Full Satisfiability of UML Class Diagrams Alessandro Artale, Diego Calvanese, and Ang´elica Ib´an ˜ ez-Garc´ıa KRDB Research Centre, Free University of Bozen-Bolzano, Italy { artale,calvanese,ibanezgarcia} @inf.unibz.it
Abstract. UML class diagrams (UCDs) are the de-facto standard formalism for the analysis and design of information systems. By adopting formal language techniques to capture constraints expressed by UCDs one can exploit automated reasoning tools to detect relevant properties, such as schema and class satisfiability and subsumption between classes. Among the reasoning tasks of interest, the basic one is detecting full satisfiability of a diagram, i.e., whether there exists an instantiation of the diagram where all classes and associations of the diagram are non-empty and all the constraints of the diagram are respected. In this paper we establish tight complexity results for full satisfiability for various fragments of UML class diagrams. This investigation shows that the full satisfiability problem is ExpTime-complete in the full scenario, NP-complete if we drop isa between relationships, and NLogSpace-complete if we further drop covering over classes.1 Keywords: Reasoning over Conceptual Models, Description Logics, Complexity Analysis.
1
Introduction
UML (Unified Modeling Language - http://www.omg.org/spec/UML/) is the de-facto standard formalism for the analysis and design of information systems. One of the most important components of UML are class diagrams (UCDs). UCDs describe the domain of interest in terms of objects organized in classes and associations between them. The semantics of UCDs is by now well established, and several works propose to represent it using various kinds of formal languages, e.g., [2,3,4,5,6,7]. Thus, one can in principle reason on UCDs. The reasoning tasks that one is interested in are, e.g., subsumption between two classes, i.e., the fact that each instance of one class is necessarily also an instance of another class, satisfiability of a specific class (or association) in the diagram, i.e., the fact that the information encoding that class (or association) in the diagram is not contradictory, diagram satisfiability, which requires that at least one class in the diagram is satisfiable, and full satisfiability of the diagram [8,9], i.e., the fact that there exists an instantiation of the diagram where all classes and associations of the diagram are non-empty. 1
A preliminary and shortened version of this paper has been presented at the 2009 Int. Workshop on Logic in Databases (LID 2009), with informal proceedings printed as a technical report [1].
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 317–331, 2010. c Springer-Verlag Berlin Heidelberg 2010
318
A. Artale, D. Calvanese, and A. Ib´ an ˜ez-Garc´ıa
The latter property is of importance since the presence of some unsatisfiable class or association actually means either that the diagram contains unnecessary information that should be removed, or that there is some modeling error that leads to the loss of satisfiability. In this paper, we adopt the well established formalization of UCDs in terms of Description Logics (DLs). DLs [10] are decidable logics that are specifically tailored for capturing various forms of conceptual data models (cf. [11,12,13,14,15,16,5]), and they allow one to exploit state-of-the-art DL reasoners [17] to provide automated reasoning support over such data models. The complexity of reasoning over UCDs has been addressed in [5] where it has been shown that in the presence of the standard UML/EER constructs, such as isa, disjointness and covering between entities and associations, cardinality constraints (also called participation constraints) for associations, and multiplicity constraints for attributes makes checking class satisfiability and schema satisfiability ExpTime-complete. This result has been strengthened in [6] to UCDs2 with simple isa between associations (and both disjointness and completeness constraints on class hierarchies only), where it was also shown that by dropping isa between associations reasoning becomes NP-complete, and by further forbidding completeness in class hierarchies it drops to NLogSpace-complete. The only works that addressed explicitly the complexity of full satisfiability of UCDs are [8,9], which include a classification of UCDs based on inconsistency triggers. Each inconsistency trigger is a pattern for recognizing possible inconsistencies of the diagram based on the interaction between different modelling constraints. [8,9] introduce various algorithms for checking full satisfiability of UCDs with different expressive power, together with an analysis of their computational complexity (i.e., upper bounds are provided). In particular, checking full satisfiability in the following scenarios is showed to be in: 1. ExpTime, if the standard constructs are used; 2. NP, if isa between associations and multiple and overwriting inheritance of attributes is dropped—i.e., each attribute has a fixed type; 3. P, if diagrams are further restricted by forbidding completeness constraints; 4. PSpace (instead ofExpTime), if standard constructs are uses (as in scenario 1) but types for attributes associated to sub-classes are sub-types of types for the respective attributes associated to super-classes; 5. NP and P in the scenarios 2 and 3, respectively, if we further allow for attributes with types restricted as in 4. The main contributions of this paper can be summarised as follows: – We show tight complexity results for checking full satisfiability proving that the problem is ExpTime-complete in the standard scenario 1, NP-complete in the scenario 2 and NLogSpace-complete (instead of P) in the scenario 3; – We prove that full satisfiability in the scenario 4 is ExpTime-hard, thus showing that the PSpace algorithm presented in [8,9] must be incomplete. 2
The results in [6] are formulated in terms of the Entity-Relationship model, but they also carry directly over to UML class diagrams.
Full Satisfiability of UML Class Diagrams
319
Our results build on the formalization of UCDs in terms of DLs given in [5,6]. In fact, our upper bounds for full satisfiability are an almost direct consequence of the corresponding upper bounds of the DL formalization. On the other hand, the obtained lower bounds for full satisfiability are more involved, and in some cases require a careful analysis of the corresponding proof for class satisfiability. The rest of the paper is organized as follows. In Section 2, we briefly introduce the DL ALC, on which we base our results, and show that full satisfiability in ALC is ExpTime-complete. In Section 3, we recall the FOL semantics of UCDs. In Sections 4 and 5, we provide our results on full satisfiability for various variants of UCDs. Finally, in Section 6, we draw some conclusions.
2
Full Satisfiability in the Description Logic ALC
We start by studying full satisfiability for the DL ALC, one of the basic variants of DLs [10]. The basic elements of ALC are atomic concepts and roles, denoted by A and P , respectively. Complex concepts C, D are defined as follows: C, D ::= A | ¬C | C ⊓ D | ∃P.C The semantics of ALC, as usual in DLs, is specified in terms of interpretations. An interpretation I = (∆I, ·I), with a non-empty domain ∆I and an interpretation function ·I, assigns to each concept C a subset C I of ∆I, and to each role name P a binary relation P I in ∆I × ∆I such that the following conditions are satisfied: AI ⊆ ∆I, (¬C)I = ∆I \ C I,
(C ⊓ D)I = C I ∩ DI, (∃P.C)I = {a ∈ ∆I | ∃b. (a, b) ∈ P I ∧ b ∈ C I} .
We use the standard abbreviations C1 ⊔ C2 := ¬(¬C1 ⊓ ¬C2 ), and ∀P.C := ¬∃P.¬C, with the corresponding semantics. An ALC terminological box (TBox) T is a finite set of (concept inclusion) assertions of the form C ⊑ D. An interpretation I satisfies an assertion of the form C ⊑ D if and only if C I ⊆ DI. A TBox T is satisfiable if there is an interpretation I, called a model of T , that satisfies every assertion in T . A concept C is satisfiable w.r.t. a TBox T if there is a model I of T such that C I = ∅. It can be shown that TBox satisfiability and concept satisfiability w.r.t. a TBox are reducible to each other in polynomial time. Moreover, reasoning w.r.t. ALC TBoxes is ExpTime-complete (see e.g., [10]). We now define the notion of full satisfiability of a TBox and show that for ALC it has the same complexity as classical satisfiability. Definition 1 (TBox Full Satisfiability). An ALC TBox T is said to be fully satisfiable if there exists a model I of T such that AI = ∅, for every atomic concept A in T . We say that I is a full model of T . We first prove that full satisfiability in ALC is ExpTime-hard.
320
A. Artale, D. Calvanese, and A. Ib´ an ˜ez-Garc´ıa
Lemma 2. C oncept satisfiability w.r.t. ALC TBoxes can be linearly reduced to full satisfiability of ALC TBoxes. Proof. Let T be an ALC TBox and C an ALC concept. As pointed out in [18], C is satisfiable w.r.t. T if and only if C⊓ AT is satisfiable w.r.t. the TBox T1 consisting of the single assertion, AT ⊑ C1 ⊑C2 ∈T (¬C1 ⊔ C2 ) ⊓ 1≤i≤n ∀Pi .AT , where AT is a fresh atomic concept and P1 , . . . , Pn are all the atomic roles in T and C. In order to reduce the latter problem to full satisfiability, we extend T1 to T2 = T1 ∪ {AC ⊑ C ⊓ AT }, with AC a fresh atomic concept, and prove that: C ⊓ AT is satisfiable w.r.t. T1 if and only if T2 is fully satisfiable. “⇒” Let I be a model of T1 such that (C ⊓ AT )I = ∅. We construct an interpretation of T2 , J = (∆I ∪ {dtop }, ·J ), with dtop ∈ ∆I , such that: I I AJ AJ T = AT , C = (C ⊓ AT ) , AJ = AI ∪ {dtop }, for each atomic concept A in T and C, P J = P I , for each atomic role P in T and C. Obviously, the extension of every atomic concept is non-empty in J . Next, we show that J is a model of T2 , by relying on the fact (easily proved by structural induction) that DI ⊆ DJ , for each subconcept D of concepts in T1 or of C. Then, it is easy to show that J satisfies the two assertion in T2 . ”⇐” Conversely, every full model J of T2 is also a model of T1 with (C ⊓AT )J = J ⊆ (C ⊓ A ) . ⊓ ⊔ ∅, as AJ T C Theorem 3. Full satisfiability of ALC TBoxes is ExpTime-complete. Proof. The ExpTime membership is straightforward since full satisfiability of an ALC TBox T can be reduced to satisfiability of the TBox T ∪ 1≤i≤n {⊤ ⊑ ∃P ′ .Ai }, where A1 , . . . , An are all the atomic concepts in T , and P ′ is a fresh atomic role. The ExpTime-hardness follows from Lemma 2. ⊓ ⊔ We now modify the reduction of Lemma 2 so that it applies also to primitive ALC − TBoxes, i.e., TBoxes that contain only assertions of the form: A ⊑ B,
A ⊑ ¬B,
A ⊑ B ⊔ B′,
A ⊑ ∀P.B,
A ⊑ ∃P.B,
where A, B, B ′ are atomic concepts, and P is an atomic role. Theorem 4. Full satisfiability of primitive ALC − TBoxes is ExpTimecomplete. Proof. The ExpTime membership follows from Theorem 3. For proving the ExpTime-hardness, we use a result in [5] showing that concept satisfiability in ALC can be reduced to atomic concept satisfiability w.r.t. primitive ALC − TBoxes. Let T − = {Aj ⊑ Dj | 1 ≤ j ≤ m} be a primitive ALC − TBox, and A0 an atomic concept. By the proof of Lemma 2, we have that A0 is satisfiable w.r.t. T − if and only if the TBox T2′ consisting of the assertions AT − ⊑ ∀Pi .AT − , A′0 ⊑ A0 ⊓ AT − , (¬Aj ⊔ Dj ) ⊓ Aj ⊑Dj ∈T −
1≤i≤n
Full Satisfiability of UML Class Diagrams
321
is fully satisfiable, with AT −, A′0 fresh atomic concepts. T2′ is not a primitive ALC − TBox, but it is equivalent to the TBox containing the assertions: A′0 ⊑ AT − A′0 ⊑ A0
AT −⊑ ¬A1 ⊔ D1 .. .
AT −⊑ ∀P1 . AT − .. .
AT −⊑ ¬Am ⊔ Dm
AT −⊑ ∀Pn . AT −,
Finally, to get a primitive ALC − TBox, T2− , we replace each assertion of the form AT −⊑ ¬Aj ⊔ Dj by AT −⊑ Bj1 ⊔ Bj2 , Bj1 ⊑ ¬Aj , and Bj2 ⊑ Dj , with Bj1 and Bj2 fresh atomic concepts, for j ∈ {1, . . . , m}. We show now that T2′ is fully satisfiable iff T2− is fully satisfiable: (⇒) Let I = (∆I , ·I ) be a full model of T2′ . We extend I to an interpretation J J of T2− . Let ∆J = ∆I ∪ {d+ , d− }, with {d+ , d− } ∩ ∆I = ∅, and define · as follows: = AIT −, AJ T−
A′0
J
I
= A′0 ,
AJ = AI ∪ {d+ }, for every other atomic concept A in T2′ , J J Bj1 = (¬Aj )J and Bj2 = DjJ , for each AT −⊑ Bj1 ⊔ Bj2 ∈ T2− , P J = P I ∪ {(d+ , d+ )},
for each atomic role P in T2− .
It is easy to see that J is a full model of T2− . (⇐) Trivial, since every model of T2− is a model of T2′ .
3
⊓ ⊔
Formalizing UML Class Diagrams
In this section, we briefly describe UCDs and provide their semantics in terms of First Order Logic (the formalization adopted here is based on previous presentations in [5,15]). A class in UCDs denotes a set of objects with common features. Formally, a class C corresponds to a unary predicate C. An n-ary association (also called relationship) in UCDs represents a relation between instances n ≥ 2 classes. Names of associations (as names of classes) are unique in a UCD. A binary association between two classes C1 and C2 is graphically rendered as in Fig. 1. The multiplicity constraint nl ..nu (also called participation constraint) written on one end of the binary association specifies that each instance of the class C1 participates at least nl times and at most nu times in the association R, and the multiplicity constraint ml ..mu specifies an analogous constraint for each instance of the class C2 . When a multiplicity constraint is omitted, it is intended to be 0..∗. Formally, an association R between the classes C1 and C2 is captured by a binary predicate R that satisfies the FOL axiom ∀x1 , x2 . (R(x1 , x2 ) → C1 (x1 )∧C2 (x2 )), while multiplicities are formalized by the following FOL assertions: ∀x. (C1 (x) → ∃≥nl y. R(x, y) ∧ ∃≤nu y. R(x, y)) ∀y. (C2 (y) → ∃≥ml x. R(x, y) ∧ ∃≤mu x. R(x, y)),
322
A. Artale, D. Calvanese, and A. Ib´ an ˜ez-Garc´ıa
C1 C1
ml ..mu
nl ..nu R
ml ..mu
C2
Fig. 1. Binary association
nl ..nu
C2
CR
Fig. 2. Binary association with related class C
{complete, disjoint}
C1
C2
...
Cn
Fig. 3. A class hierarchy in UML
where we use counting quantifiers to abbreviate the FOL formula encoding the multiplicity constraints. A more general form of multiplicity is the so called refinement of multiplicity constraints for sub-classes participating in associations. With such a construct we are able to change (and thus refine) the multiplicity constraints for sub-classes. Refinement involving a binary association, R, between classes C1 and C2 , and a sub-class of C1 , say C1′ , can be formalized with the following FOL axioms: ∀x. (C1′ (x) → C1 (x)),
∀x. (C1′ (x) → ∃≥n′l y. R(x, y) ∧ ∃≤n′u y. R(x, y)).
An association class describes properties of the association, such as attributes, operations, etc. (see Fig. 2). A binary association between classes C1 and C2 with a related association class CR is formalized in FOL by reifying the association into a unary predicate CR with two binary predicates P1 , P2 , one for each component of the association. We enforce the following semantics, for i ∈ {1, 2}: ∀x.(CR (x) → ∃y. Pi (x, y)), ∀x, y.(CR (x) ∧ Pi (x, y) → Ci (y)), ′ ′ ∀x, y, y ′ .(CR (x) ∧ Pi (x, y) ∧ P i (x, y ) → y = y ), ′ ′ ∀y1 , y2 , x, x .(CR (x) ∧ CR (x ) ∧ ( i∈{1,2} Pi (x, yi ) ∧ Pi (x′ , yi )) → x = x′ ). For associations with a related class, the multiplicity constraints are formalized by the following FOL assertions: ∀y1 .(C1 (y1 ) → ∃≥nl x. (CR (x) ∧ P1 (x, y1 )) ∧ ∃≤nu x. (CR (x) ∧ P1 (x, y1 ))) , ∀y2 .(C2 (y2 ) → ∃≥ml x. (CR (x) ∧ P2 (x, y2 )) ∧ ∃≤mu x. (CR (x) ∧ P2 (x, y2 ))) . Classes can have attributes, formalized similarly to binary associations, relating the class with values of a given type. As for associations, we can specify multiplicity constraints over attributes. A generalization (called also ISA constraint) between two classes C1 and C, formalized as ∀x. C1 (x) → C(x), specifies that each instance of C1 is also an
Full Satisfiability of UML Class Diagrams
323
Table 1. Complexity of Full Satisfiability in UML (sub)languages
Language UCDf u
ll
Constraints Complexity Classes Associations/Attributes of Full isa disjoint complete isa multiplicity refinement Satisfiability ExpTime [Th.7]
UCDb ool
✗
NP [Th.9]
UCDref
✗
✗
NLogSpace [Th.11]
instance of C. Several generalizations can be grouped together to form a class hierarchy, as shown in Fig. 3. Such a hierarchy is formally captured by means of the FOL axioms ∀x. Ci (x) → C(x) for i ∈ {1, . . . , n}. Disjointness and completeness constraints can also be enforced on a class hierarchy, by adding suitable labels to the diagram. Disjointness among the classes C1 , . . . , Cn is expressed by n ∀x. Ci (x) → j=i+1 ¬Cj (x), for i ∈ {1, . . . , n − 1}. The completeness constraint, expressing that each instance n of C is an instance of at least one of C1 , . . . , Cn , is captured by ∀x. C(x) → i=1 Ci (x). We can also have generalization, disjointness and completeness constraints between associations and between association classes with the obvious semantics. In this paper, we denote with UCDfull the class diagram language that comprises all the standard constructs as discussed above (i.e., what we called scenario 1 in Section 1). With UCDbool we denote the language without generalization between associations (i.e., scenario 2 in Section 1), and with UCDref we further drop completeness constraints over classes (i.e., scenario 3 in Section 1). The constructors allowed in these languages are summarized in Table 1, together with the tight complexity results obtained in this paper.
4
Full Satisfiability of UML Class Diagrams
Three notions of UCD satisfiability have been proposed in the literature [19,5,6,20,9]. First, diagram satisfiability refers to the existence of a model, i.e., a FOL interpretation that satisfies all the FOL assertions associated to the diagram and where at least one class has a nonempty extension. Second, class satisfiability refers to the existence of a model of the diagram where the given class has a nonempty extension. Third, we can check whether there is a model of an UML diagram that satisfies all classes and all relationships in a diagram. This last notion of satisfiability, referred here as full satisfiability and introduced in [8,9] is thus stronger than diagram satisfiability, since a model of a diagram that satisfies all classes is, by definition, also a model of that diagram. Definition 5 (UML Full Satisfiability). A UCD, D, is fully satisfiable if there is a FOL interpretation, I, that satisfies all the constraints expressed in D and such that C I = ∅ for every class C in D, and RI = ∅ for every association R in D. We say that I is a full model of D.
324
A. Artale, D. Calvanese, and A. Ib´ an ˜ez-Garc´ıa
O
A
B
{disjoint}
A
{complete}
B
B1
Fig. 4. Encoding of A⊑ ¬B 1..1
O
B2
Fig. 5. Encoding of A⊑ B1 ⊔ B2
1..1
{disjoint}
A
A¯PB
APB
1..1
O
1..1
1..1
1..1
A
PAB1 P2
P1
B
1..1
1..1
PAB1 ¯ PAB2
B 1..1
P1
CPAB
PAB1
PAB2
P2
C PAB 1..*
CPAB
{complete}
CP
Fig. 6. Encoding of A⊑ ∀P.B
CP
Fig. 7. Encoding of A ⊑ ∃P.B
We now address the complexity of full satisfiability for UCDs with the standard set of constructs, i.e., UCDfull . For the lower bounds, we use the results presented in Section 2 and reduce full satisfiability of primitive ALC − TBoxes to full satisfiability of UCDfull . This reduction is based on the ones used in [5,6] for the lower complexity bound of schema satisfiability in the extended EntityRelationship model, but the proof of their correctness is more involved here. Given a primitive ALC − TBox T , construct a UCDfull diagram Σ(T ) as follows: for each atomic concept A in T , introduce a class A in Σ(T ). Additionally, introduce a class O that generalizes (possibly indirectly) all the classes in Σ(T ) that encode an atomic concept in T . For each atomic role P , introduce a class CP , which reifies the binary relation P . Further, introduce two functional associations P1 , and P2 that represent, respectively, the first and second component of P . The assertions in T are encoded as follows: – For each assertion of the form A ⊑ B, introduce a generalization between the classes A and B. – For each assertion of the form A ⊑ ¬B, construct the hierarchy in Fig. 4. – For each assertion of the form A ⊑ B1 ⊔ B2 , introduce an aux iliary class B, and construct the diagram shown in in Fig. 5. – For each assertion of the form A ⊑ ∀P.B, add the auxiliary classes CPAB , , and PAB2 , and C PAB , APB , and A¯PB , and the associations PAB1 , PAB1 ¯ construct the diagram shown in Fig. 6.
Full Satisfiability of UML Class Diagrams
325
– For each assertion of the form A ⊑ ∃P.B, add the auxiliary class CPAB and the associations PAB1 and PAB2 , and construct the diagram shown in Fig. 7. Notice that Σ(T ) is a UCD in UCDfull . Lemma 6. A primitive ALC − TBox T is fully satisfiable if and only if the UCD Σ(T ), constructed as above, is fully satisfiable. Proof. “⇐” Let J = (∆J , ·J ) be a full model of Σ(T ). We construct a full model I = (∆I , ·I ) of T by taking ∆I = ∆J . Further, for every concept name A and for every atomic role P in T , we define respectively AI = AJ and P I = (P1− )J ◦ P2J (r1 ◦ r2 denotes the composition of two binary relations r1 and r2 ). Let us show that I satisfies every assertion in T . – For assertions of the form A ⊑ B, A ⊑ ¬B, and A ⊑ B1 ⊔ B2 , the statement easily follows from the construction of I. – For assertions of the form A ⊑ ∀P.B and A ⊑ ∃P.B, the proof uses arguments similar to those in the proof of Lemma 1 in [6]. “⇒” Let I = (∆I , ·I ) be a full model of T , and let role(T ) be the set of role names in T . We extend I to an instantiation J = (∆J , ·J ) of Σ(T ), by assigning suitable extensions to the auxiliary classes and associations in Σ(T ). Let ∆J = ∆I ∪Γ ∪Λ, where: Λ = A⊑∀P.B∈T {aAPB , aA¯P }, such that ∆I ∩Λ = B ∅, and Γ = P ∈role(T ) ∆P , with ∆P = P I ∪ A⊑∀P.B∈T {(aAPB , b), (aA¯P , o¯)} B where b is an arbitrary instance of B, and o¯ an arbitrary element of ∆I . We set OJ = ∆I ∪ Λ, AJ = AI for each class A corresponding to an atomic concept in T , and CPJ = ∆P for each P ∈ role(T ). Additionally, the extensions of the associations P1 and P2 are defined as follows: P1J = {((o1 , o2 ), o1 ) | (o1 , o2 ) ∈ CPJ }, P2J = {((o1 , o2 ), o2 ) | (o1 , o2 ) ∈ CPJ }. We now show that J is a full model of Σ(T ). – For the portions of Σ(T ) due to TBox assertions of the form A ⊑ B, A ⊑ ¬B, and A ⊑ B1 ⊔ B2 , the statement follows from the construction of J . – For each TBox assertion in T of the form A ⊑ ∀P.B, let us define the extensions for the auxiliary classes and associations as follows: J J A¯J PB = O \ APB ,
I AJ PB = A ∪ {aAPB },
J
CPJAB = {(o, o′ ) ∈ CPJ | o ∈ AJ C PAB = {(o, o′ ) ∈ CPJ | o ∈ A¯J PB }, PB }, J J J J J ′ ′ PAB1 = {((o, o ), o) ∈ P1 | o ∈ APB }, PAB1 = {((o, o ), o) ∈ P1 | o ∈ A¯J ¯ PB }, J J J ′ ′ PAB2 = {((o, o ), o ) ∈ P2 | o ∈ APB } . It is not difficult to see that J satisfies the fragment of Σ(T ) as shown in Fig. 6. It remains to show that each class and each association has a non-empty extension. This is clearly the case for classes that encode atomic concepts in T . For the classes APB , A¯PB , CPAB , and C PAB we have that aAPB ∈ AJ PB ,
aA¯P ∈ A¯J PB , B
(aAPB , b) ∈ CPJAB ,
J
(aA¯P , o¯) ∈ C PAB . B
326
A. Artale, D. Calvanese, and A. Ib´ an ˜ez-Garc´ıa C⊤
C⊤
Ri
1..*
Ci
Rp 1..*
Ci
1..1
P1
CP
P2
1..1
Cj
Fig. 8. Reducing UML full satisfiability to class satisfiability
we have that For the associations P1 , P2 , PAB1 , PAB2 , and PAB1 ¯ J J ⊆ P1J , ((aA¯P , o¯), aA¯P ) ∈ PAB1 , ((a ((aAPB , b), aAPB ) ∈ PAB1 APB , b), b) ∈ ¯ B
B
J PAB2 ⊆ P2J . – For each TBox assertion in T of the form A ⊑ ∃P.B, let us define: CPJ AB = {(o, o′ ) ∈ CPJ | o ∈ AI and o′ ∈ B I }, J = {((o, o′ ), o) ∈ P1J | (o, o′ ) ∈ CPJAB }, PAB1 J PAB2 = {((o, o′ ), o′ ) ∈ P2J | (o, o′ ) ∈ CPJAB } . We have that CPJAB = ∅ as there exists a pair (a, b) ∈ ∆P with a ∈ AI , and J J = ∅ and PAB2 = ∅. ⊓ ⊔ b ∈ B I . Since CPJAB = ∅, we have that PAB1
Theorem 7. Full satisfiability of UCDfull diagrams is ExpTime-complete. Proof. We establish the upper bound by a reduction to class satisfiability in UCDs, which is known to be ExpTime-complete [5]. Given a UCD D, with classes C1 , . . . , Cn , we construct a UCD D′ by adding to D a new class C⊤ and new associations Ri , for i ∈ {1, . . . , n}, as shown in the left part of Fig. 8. Furthermore, to check that every association is populated we use reification, i.e., we replace each association P in the diagram D between the classes Ci and Cj (such that neither Ci nor Cj is constrained to participate at least once to P ) with a class CP and two functional associations P1 and P2 to represent each component of P . Finally, we add the constraints shown in the right part of Fig. 8. Intuitively, we have that if there is a model I of the extended diagram I = ∅, then the multiplicity constraint 1..∗ on the association RP D′ in which C⊤ forces the existence of at least one instance o of CP . By the functionality of P1 and P2 there are at least two elements oi and oj , such that oi ∈ CiI , oj ∈ CjI , (o, oi ) ∈ P1I and (o, oj ) ∈ P2I . Then, one instance of P can be the pair (oi , oj ). Conversely, if there is a full model J of D, it is easy to extend it to a model I of D′ that satisfies C⊤ . The ExpTime-hardness follows from Lemma 6 and Theorem 4. ⊓ ⊔ Note that, the proof of the above theorem does not involve attributes. Thus, the ExpTime complexity result is valid for both scenarios 1 and 4 in Section 1.
5
Full Satisfiability of Restricted UML Class Diagrams
In this section, we investigate the complexity of the full satisfiability problem for the two sub-languages UCDbool and UCDref defined in Section 3. By building
Full Satisfiability of UML Class Diagrams
327
on the techniques used for the satisfiability proofs in [6], we show that also in this case checking for full satisfiability does not change the complexity of the problem. We consider first UCDbool diagrams, by showing that deciding full satisfiability is NP-complete. For the lower bound, we provide a polynomial reduction of the 3sat problem (which is known to be NP-complete) to full satisfiability of UCDbool CDs. Let an instance of 3sat be given by a set φ = {c1 , . . . , cm } of 3-clauses over a finite set Π of propositional variables. Each clause is such that ci = ℓ1i ∨ℓ2i ∨ℓ3i , for i ∈ {1, . . . , m}, where each ℓkj is a literal, i.e., a variable or its negation. We construct an UCDbool diagram Dφ as follows: Dφ contains the classes Cφ , C⊤ , one class Ci for each clause ci ∈ φ, and two classes Cp and C¬ p for each variable p ∈ Π. To describe the constraints imposed by Dφ , we provide the corresponding DL inclusion assertions, since they are more compact to write than an UCD. For every i ∈ {1, . . . , m}, j ∈ {1, 2, 3}, and p ∈ Π, we have the assertions Cφ ⊑ C⊤ Cp ⊑ C⊤ C¬ p ⊑ C⊤
Ci ⊑ C⊤ Cφ ⊑ Ci C⊤ ⊑ Cp ⊔ C¬
Clj ⊑ Ci i Ci ⊑ Cℓ1i ⊔ Cℓ2i ⊔ Cℓ3i C¬ p ⊑ ¬Cp
p
Clearly, the size of Dφ is polynomial in the size of φ. Lemma 8. A set φ of 3-clauses is satisfiable if and only if the UCDbool class diagram Dφ , constructed as above, is fully satisfiable. Proof. “⇒” Let J |= φ. Define an interpretation I = ({0, 1}, ·I ), with I C⊤ = {0, 1} {1}, if J |= ℓ I Cℓ = {0}, otherwise
CiI = CℓI1 ∪ CℓI2 ∪ CℓI3 , i
i
i
for ci = ℓ1i ∨ ℓ2i ∨ ℓ3i
I . CφI = C1I ∩ · · · ∩ Cm
Clearly, C I = ∅ for every class C representing a clause or a literal, and for C = C⊤ . Moreover, as at least one literal ℓji in each clause is such that J |= ℓji , then 1 ∈ CiI for every i ∈ {1, . . . , m}, and therefore 1 ∈ CφI . It is straightforward to check that I satisfies T . “⇐” Let I = (∆I , ·I ) be a full model of Dφ . We construct a model J of φ by taking an element o ∈ CφI , and setting, for every variable p ∈ Π, J |= p if and only if o ∈ CpI . Let us show that J |= φ. Indeed, for each i ∈ {1, . . . , m}, since o ∈ CφI and by the generalization Cφ ⊑ Ci , we have that o ∈ CiI , and by the completeness constraint Ci ⊑ Cℓ1i ⊔ Cℓ2i ⊔ Cℓ3i , there is some ji ∈ {1, 2, 3} such that o ∈ Cℓji . If ℓji i is a variable, then J |= ℓji i by construction, and thus J |= ci . i
Otherwise, if ℓji i = ¬p for some variable p, then, by the disjointness constraint ⊓ ⊔ C¬p ⊑ ¬Cp , we have that o ∈ / CpI . Thus, J |= ¬p, and therefore, J |= ci .
328
A. Artale, D. Calvanese, and A. Ib´ an ˜ez-Garc´ıa
Theorem 9. Full satisfiability of UCDbool is NP-complete Proof. To prove the NP upper bound, we reduce full satisfiability to class satisfiability, which, for the case of UCDbool , is known to be in NP [6]. We use an encoding similar to the one used in the proof of Theorem 7 (see Fig. 8). The NP-hardness follows from Lemma 8. ⊓ ⊔ We turn now to UCDref class diagrams and show that full satisfiability in this case is NLogSpace-complete. We provide a reduction of the reachability problem on (acyclic) directed graphs, which is known to be NLogSpacecomplete (see e.g., [21]) to the complement of full satisfiability of UCDref CDs. Let G = (V, E, s, t) be an instance of reachability, where V is a set of vertices, E ⊆ V × V is a set of directed edges, s is the start vertex, and t the terminal vertex. We construct an UCDref diagram DG from G as follows: – DG has two classes Cv1 and Cv2 , for each vertex v ∈ V \ {s}, and one class Cs corresponding to the start vertex s. – For each edge (u, v) ∈ E with u = s and v = s, DG contains the following constraints (again expressed as DL inclusion assertions): Cu1 ⊑ Cv1 , Cu2 ⊑ Cv2 . – For each edge (s, v) ∈ E, DG contains the following constraints: Cs ⊑ Cv1 , Cs ⊑ Cv2 . – For each edge (u, s) ∈ E, DG contains the following constraints: Cu1 ⊑ Cs , Cu2 ⊑ Cs . – The classes Ct1 and Ct2 are constrained to be disjoint in D, expressed by: Ct1 ⊑ ¬Ct2 . The following lemma establishes the correctness of the reduction. Lemma 10. t is reachable from s in G iff DG is not fully satisfiable. Proof. “⇒” Let π = v1 , . . . , vn be a path in G with v1 = s and vn = t. We claim that the class Cs in the constructed diagram DG is unsatisfiable. Suppose otherwise that there is a model I of DG with o ∈ CsI , for some o ∈ ∆I . From π , a I I number of generalization constraints hold in DG , i.e., CsI ⊆ Ct1 and CsI ⊆ Ct2 . 2 I 1 I Thus, we obtain that o ∈ (Ct ) and o ∈ (Ct ) , which violates the disjointness between the classes Ct1 and Ct2 , in contradiction to I being a model of DG . Hence, Cs is unsatisfiable, and therefore DG is not fully satisfiable. “⇐” Let us consider the contrapositive. Assume that t is not reachable from s in G. We construct a full model I of DG . Let ∆I = {ds } ∪ v∈V \{s} {d1v , d2v }. Define inductively a sequence of interpretations as follows: 0
0
I0
I 0 = (∆I , ·I ), such that: CsI = {ds }, Cvi = {div }, ∀i ∈ {1, 2}, v ∈ V \ {s}, n+1 n+1 n In ∪ = CsI ∪ (u,s)∈E (Cu1 I n+1 = (∆I , ·I ), such that: CsI n n n+1 n In iI iI iI 2I = Cv ∪ (u,v)∈E, u =s Cu ∪ (s,v)∈E Cs . Cu ), Cv
Full Satisfiability of UML Class Diagrams
329
The definition induces a monotone operator over a complete lattice, and hence it has a fixed point. Let I be defined by such a fixed point. It is easy to check that I is such that for all i ∈ {1, 2}, and u, v ∈ V \ {s} the following holds: For each class Cvi , we have that div ∈ Cvi I . ds ∈ CsI . For all d ∈ ∆I , d ∈ Cui I implies d ∈ Cvi I iff v is reachable from u in G. For all diu ∈ ∆I , diu ∈ Cvj I for i = j iff s is reachable from u in G, and v is reachable from s in G. 5. ds ∈ Cvi I iff v is reachable from s in G.
1. 2. 3. 4.
From (1) and (2) we have that all classes in DG are populated in I. It remains to show that I satisfies DG . A generalization between the classes Cui and Cvi corresponds to the edge (u, v) ∈ E. This means that v is reachable from u in G, and therefore, by (3) we have that Cui I ⊆ Cvi I . A similar argument holds for generalizations involving the class Cs . Furthermore, the classes Ct1 and Ct2 are disjoint under I. To show this, suppose that there is an element d ∈ ∆I such that d ∈ Ct1 I ∩ Ct2 I . Then by (5), d = ds , as t is not reachable from s. Moreover, d = div for all i ∈ {1, 2} and v ∈ V \ {s}. Indeed, suppose w.l.o.g. that i = 1. Then, by (4), d1v ∈ Ct2 I iff s is reachable from v, and t is reachable from s, which leads to a contradiction. Hence, Ct1 I ∩ Ct2 I = ∅. ⊓ ⊔ Theorem 11. Full-satisfiability of UCDref class diagrams is NLogSpacecomplete. Proof. The NLogSpace membership follows from the NLogSpace membership of class satisfiability [6], and a reduction similar to the one used in Theorem 9. Since NLogSpace = coNLogSpace (by the Immerman-Szelepcs´enyi theorem; see, e.g., [21]), and as the above reduction is logspace bounded, it follows that full consistency of UCDref class diagrams is NLogSpace-hard. ⊓ ⊔
6
Conclusions
This paper investigates the problem of full satisfiability in the context of UML class diagrams, i.e., whether there is at least one model of the diagram where each class and association is non-empty. Our results (reported in Table 1) show that the complexity of checking full satisfiability is ExpTime-complete both in the full scenario (UCDfull ) and in the case where attributes are dropped, NP-complete if we drop isa between relationships (UCDbool ), and NLogSpacecomplete if we further drop covering over classes (UCDref ), thus matching the complexity of the classical class diagram satisfiability check. These complexity bounds extend the ones presented in [6] for class/schema satisfiability to full satisfiability. We show a similar result also for the problem of checking the full satisfiability of a TBox expressed in the description logic ALC. As a future work, we intend to investigate the problem under the finite model assumption.
330
A. Artale, D. Calvanese, and A. Ib´ an ˜ez-Garc´ıa
Acknowledgements. This research has been partially supported by the FP7 ICT projects ACSI, contract n. 257593, and OntoRule, contract n. 231875.
References 1. Artale, A., Calvanese, D., Ibanez-Garcia, A.: Full satisfiability of UML class diagrams (extended abstract). Technical Report 127, Roskilde University Computer Science Research Reports. In: Proc. of the 2009 Int. Workshop on Logic in Databases (LID 2009) (2009) 2. Clark, T., Evans, A.S.: Foundations of the Unified Modeling Language. In: Duke, D., Evans, A. (eds.) Proc. of the 2nd Northern Formal Methods Workshop, Springer, Heidelberg (1997) 3. Evans, A., France, R., Lano, K., Rumpe, B.: Meta-modelling semantics of UML. In: Kilov, H. (ed.) Behavioural Specifications for Businesses and Systems. Kluwer Academic Publishers, Dordrecht (1999) 4. Harel, D., Rumpe, B.: Modeling languages: Syntax, semantics and all that stuff. Technical Report MCS00-16, The Weizmann Institute of Science, Rehovot, Israel (2000) 5. Berardi, D., Calvanese, D., De Giacomo, G.: Reasoning on UML class diagrams. Artificial Intelligence 168(1-2), 70–118 (2005) 6. Artale, A., Calvanese, D., Kontchakov, R., Ryzhikov, V., Zakharyaschev, M.: Reasoning over extended ER models. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 277–292. Springer, Heidelberg (2007) 7. Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M.: The DL-Lite family and relations. J. of Artificial Intelligence Research 36, 1–69 (2009) 8. Kaneiwa, K., Satoh, K.: Consistency checking algorithms for restricted UML class diagrams. In: Dix, J., Hegner, S.J. (eds.) FoIKS 2006. LNCS, vol. 3861, pp. 219– 239. Springer, Heidelberg (2006) 9. Kaneiwa, K., Satoh, K.: On the complexities of consistency checking for restricted UML class diagrams. Theoretical Computer Science 411(2), 301–323 (2010) 10. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, Cambridge (2003) 11. Bergamaschi, S., Sartori, C.: On taxonomic reasoning in conceptual design. ACM Trans. on Database Systems 17(3), 385–422 (1992) 12. Borgida, A.: Description logics in data management. IEEE Trans. on Knowledge and Data Engineering 7(5), 671–682 (1995) 13. Artale, A., Cesarini, F., Soda, G.: Describing database objects in a concept language environment. IEEE Trans. on Knowledge and Data Engineering 8(2), 345– 351 (1996) 14. Calvanese, D., Lenzerini, M., Nardi, D.: Description logics for conceptual data modeling. In: Chomicki, J., Saake, G. (eds.) Logics for Databases and Information Systems, pp. 229–264. Kluwer Academic Publishers, Dordrecht (1998) 15. Calvanese, D., Lenzerini, M., Nardi, D.: Unifying class-based representation formalisms. J. of Artificial Intelligence Research 11, 199–240 (1999) 16. Borgida, A., Brachman, R.J.: Conceptual modeling with description logics. In: [10], ch. 10, pp. 349–372
Full Satisfiability of UML Class Diagrams
331
17. M¨ oller, R., Haarslev, V.: Description logic systems. In: [10], ch. 8, pp. 282–305 18. Buchheit, M., Donini, F.M., Schaerf, A.: Decidable reasoning in terminological knowledge representation systems. J. of Artificial Intelligence Research 1, 109–138 (1993) 19. Lenzerini, M., Nobili, P.: On the satisfiability of dependency constraints in entityrelationship schemata. Information Systems 15(4), 453–461 (1990) 20. Jarrar, M., Heymans, S.: Towards pattern-based reasoning for friendly ontology debugging. Int. J. on Artificial Intelligence Tools 17(4), 607–634 (2008) 21. Papadimitriou, C.H.: Computational Complexity. Addison Wesley Publ. Co., Reading (1994)
On Enabling Data-Aware Compliance Checking of Business Process Models⋆ David Knuplesch1 , Linh Thao Ly1 , Stefanie Rinderle-Ma3 , Holger Pfeifer2 , and Peter Dadam1 1
Institute of Databases and Information Systems Ulm University, Germany 2 Institute of Artificial Intelligence Ulm University, Germany 3 Faculty of Computer Science University of Vienna, Austria {david.knuplesch,thao.ly,holger.pfeifer,peter.dadam}@uni-ulm.de, [email protected]
Abstract. In the light of an increasing demand on business process compliance, the verification of process models against compliance rules has become essential in enterprise computing. To be broadly applicable compliance checking has to support data-aware compliance rules as well as to consider data conditions within a process model. Independently of the actual technique applied to accomplish compliance checking, dataawareness means that in addition to the control flow dimension, the data dimension has to be explored during compliance checking. However, naive exploration of the data dimension can lead to state explosion. We address this issue by introducing an abstraction approach in this paper. We show how state explosion can be avoided by conducting compliance checking for an abstract process model and abstract compliance rules. Our abstraction approach can serve as preprocessing step to the actual compliance checking and provides the basis for more efficient application of existing compliance checking algorithms. Keywords: Process verification, Compliance rules, Process data, Abstraction.
1
Introduction
In many application domains, business processes are subject to compliance rules and policies that stem from domain-specific requirements such as standardization or legal regulations [1]. Examples of compliance rules for order-to-delivery processes are collected in Table 1. Ensuring compliance of their business processes is crucial for enterprises today, particularly since auditing and certification of their business processes has become a competitive edge in many domains. Examples ⋆
This work was done within the research project SeaFlows partially funded by the German Research Foundation (DFG).
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 332–346, 2010. c Springer-Verlag Berlin Heidelberg 2010
Verification of Data-Aware Compliance Rules
333
Table 1. Examples of compliance rules for order-to-delivery processes c1 After confirming an order, goods have to be shipped eventually. c2 Production (i.e., local and outsourced production) shall not start until the order is confirmed. c3 Each order shall either be confirmed or declined. c4 Local production shall be followed by a quality test. c5 Premium customer status shall only be offered after a prior solvency check. c6 Orders with a piece number beyond 50,000 shall be approved before they are confirmed. c7 For orders of a non-premium customer with a piece number beyond 80,000 a solvency check is necessary before assessing the order. c8 Orders with piece number beyond 80,000 require additional shipping insurance before shipping. c9 After confirming an order of a non-premium customer with piece number of at least 125,000, premium status should be offered to the customer
include certified family-friendly enterprises being more attractive to prospective employees or clinics proving a certain standard of their audited treatments to patients. Since process models are the common way to represent business processes, business process compliance can be ensured by verifying process models against imposed compliance rules at process buildtime. Such a priori compliance checking might help process designers to define compliant process models and avoid instantiations of non-compliant processes. Further, legacy process models can be checked for compliance, when introducing new compliance rules. Fig. 1 shows a simplified order-to-delivery process P which might be subject to the rules given in Table 1. For brevity we abstain from modeling the complete data flows of P . A closer look at compliance rules c1 to c4 reveals that they basically constrain the execution and ordering of activities and events within a process model. For example, c1 being applied to P means that event confirm order has to be eventually followed by the activity ship goods in all execution paths of P . We can apply approaches from literature to verify P against c1 to c4 , (e.g., [2,3,4]). However, compliance rules c6 to c9 obviously do not only refer to activities and events, but also to process data. In particular, in the context of P process data includes piece number pn, customer status c, and approved a. In order to verify P against data-aware compliance rules such as c6 to c9 , data flows as well as branching conditions of P have to be considered, i.e., any compliance checking approach should be able to deal with data conditions. It is notable that although compliance rule c5 does not contain any references to process data of P , data-awareness of the compliance checking is still needed to enable correct verification. Verifying c5 while ignoring the data conditions in P would lead to violation of c5 over P since activity offer premium status is not always preceeded by activity check solvency. However, when having a closer look at the data conditions under which these activities are executed (i.e., the
334
D. Knuplesch et al.
Process model P
c customer
pn piece number
process order
x
x
[pn 150,000]
+
local production
[true, false]
x
assess order
x
[pn 100,000]
x
[pn > 150,000]
check solvency
[pn 100,000] [pn 50,000]
x
split order
a approved
x [c premium]
confirmation of receipt
outsourced production
x
x
[pn > 50,000]
receive order
10% discount
[c = premium] [pn > 100,000]
[new, normal, premium]
comfirm order
[a = true x OR pn 50.000] [a = false AND
[pn > 100,000]
pn > 50,000]
receive 30% prepayment
send invoice
local production
[c premium AND pn > 150,000]
quality test
+
decline order
x
offer premium status
x
x
x x
[pn > 100,000]
shipping insurance
+ x
[pn 100,000]
+
[c = premium OR pn 150,000 ] enable tracking
ship goods confirm shipping
Fig. 1. A simplified order-to-delivery process modeled in BPMN
branching conditions) we can see, that c5 is satisfied over P . Note that offer premium status is executed only for orders of non-premium customers with piece number beyond 150,000. The correlation of the data conditions assigned to data-based exclusive gateways in P guarantee a prior solvency check. We denote compliance checking mechanisms that are able to deal with correlations of data-based gateways as well as to verify processes against data-aware compliance rules as data-aware compliance checking. Challenges. As our examples show, data-awareness is crucial for applying compliance checking in practice. Independently of how compliance checking is actually accomplished (i.e., which techniques, such as model checking [2], are applied), data-awareness poses challenges for compliance checking in general. Data-aware compliance checking has to consider the states that relevant data objects can adopt during process execution. Activity offer premium status from Fig. 1, for example, can only be executed under data condition pn > 150,000. As this example shows, we may have to deal with arbitrary data such as integers that have huge domains. When compliance checking is applied in a straightforward and naive manner, data-awareness can lead to state explosion caused by the states that relevant data objects can adopt during process execution. Consider for example data object pn representing the piece number in the orderto-delivery process and compliance rule c8 . Let us assume, for example, that the domain of pn ranges from 1 to 500,000. Then, naive exploration of the data dimension means that process model P is verified against c8 for each possible state of pn between 1 and 500,000. This, in fact, means that the complexity of data-aware compliance checking is 500,000 times the complexity of the nondata-aware case. Hence, to enable efficient compliance checking, strategies are
Verification of Data-Aware Compliance Rules B process model
data-aware compliance rules
A
abstract process model automatic abstraction to reduce complexity
compliance checking abstract dataaware compliance rules
335
processing compliance report
automatic concretization for user feedback
Fig. 2. The overall process of automatic abstraction based on the analysis of the process model and the compliance rules to be verified
required that keep the complexity manageable. A further challenge is that dataaware compliance checking also necessitates advanced concepts for user feedback. This is necessary since compliance violations that occur only under certain data conditions are often not as obvious as compliance violations at activity level. Contributions. Although the verification of process models has been addressed by a multitude of recent approaches, data-awareness has not been sufficiently supported yet. Only few approaches consider the data perspective at all. While addressing data-aware compliance rules, [5] only enables data conditions that do not correlate. However, as the order-to-delivery process in Fig. 1 shows, databased gateways may also contain conditions that are correlated. For example, the data conditions pn > 100,000 and pn > 150,000 in P correlate (i.e., pn > 150,000 implies pn > 100,000 but not vice versa). In addition, these conditions correlate with data condition pn > 80,000 ("piece number beyond 80,000") resulting from compliance rule c8 . The limitation to only non-correlating data conditions facilitates the compliance checking problem. However, as our examples indicate, this can be too restrictive for many practical applications. While existing approaches mainly focus on the compliance checking part (cf. Fig. 2 A), this paper focuses on the pre- and postprocessing steps (cf. Fig. 2 B) that enable data-aware compliance checking by tackling the state explosion problem. The latter can occur when fully exploring the data dimension during compliance checking. In this paper, we introduce abstraction strategies to reduce the complexity of data-aware compliance checking. This is achieved by abstracting from concrete states of data objects to abstract states. Based on the compliance rules to be checked our approach automatically derives an abstract process model and abstract compliance rules. The latter enable more efficient exploration of the data dimension when used as input to the actual compliance checking (cf. Fig. 2 B). Moreover, we discuss how a concretization can be accomplished to provide users with intelligible feedback in case of compliance violations. Our approach is finally validated by a powerful implementation, the SeaFlows compliance checker. This paper is structured as follows. Related work is discussed in Sect. 2. Fundamentals are introduced in Sect. 3. Sect. 4 discusses how the abstract process model and corresponding abstract compliance rules are derived. The application of our abstraction approach and the proof-of-concept implementation are discussed in Sect. 5. We close with a summary and an outlook in Sect. 6.
336
2
D. Knuplesch et al.
Related Work
Due to its increasing importance compliance verification has been addressed by numerous approaches. Most of them focus on the compliance checking part (cf. Fig. 2 A). They propose a variety of techniques to accomplish the verification of process models against imposed compliance rules, such as model checking [2,6,7,8,9] or the analysis of the process model structure [3,10]. In our previous work in the SeaFlows project, we addressed the support of activity-level compliance rules throughout the process lifecycle [4,10]. So far, however, only few approaches have addressed data-awareness. The modeling of data-aware compliance rules is addressed by [4,5,7,11]. Graphical notations that are mapped to logical formulas (e.g., in linear temporal logic) are introduced in [4,5,7]. Basically, these approaches support the enrichment of activity-related compliance rules by data conditions. Our work in this paper apply these modeling approaches, since we do not focus on how to model data-aware compliance rules but rather on enabling their verification. [12] introduce an approach for semantically annotating activities with preconditions and effects that may refer to data objects. In addition, [12] discusses an efficient algorithm for compliance verification using propagation. In contrast to this approach, we focus on deriving suitable abstraction predicates from process models and compliance rule. Further [5] allows for verifying process models against compliance rules with data conditions based on linear temporal logic (LTL). However, as previously discussed [5] only addresses data conditions that do not correlate. This is for example not sufficient to support proper verification of the order-to-delivery process from Fig. 1 against compliance rule c5 from Table 1, since we have to deal with the correlation of two data-based exclusive gateways both refering to data object pn. In this paper, we propose strategies to deal with such cases. By applying abstraction techniques, our approach accomplishes the basis to apply approaches such as [5]. Orthogonal strategies to reduce complexity of compliance checking are discussed in [6,7]. [7] sequentializes parallel flows in order to avoid analyzing irrelevant interleavings. [6] limits the complexity of compliance checking by reducing process models to the relevant parts. These abstraction strategies operating at the structural level are orthogonal to our abstraction approach. They can be applied to complement the approach introduced in this paper.
3
Fundamentals
Independently of the actual technique applied to accomplish compliance checking, data-awareness means that in addition to the control flow dimension the data dimension must be explored during compliance checking. However, this can lead to state explosion since a potentially huge number of states of data objects has to be explored. In Sect. 4, we show how the state explosion can be avoided by applying suitable abstraction strategies. Before discussing these, we provide some fundamentals.
Verification of Data-Aware Compliance Rules
337
A process domain representing a particular business domain typically consists of process artifacts (e.g., activities and events). The process domain notion in Def. 1 provides the basis for both process models and compliance rules. Definition 1 (Process Domain). A process domain Dis a tuple with D= (A , E , O , D , dom) where – – – – – –
A E
is the set of activity types, is the set of event types, O is the set of data objects, D is the set of data domains, and dom : O → D is a function assigning a data domain to each data object. We further define ΩD := D as the set of all values (i.e., data states) of D.
Example 1. Consider process model P from Fig. 1. The related process domain may be D= (A , E , O , D , dom), where A := {process order, 10% discount, check solvency, assess order, ...} E := {receive order, confirmation of receipt, confirm order, ...} O := {pn, c, a} D := { N = {0, 1, 2, ...} , {new, normal, premium} , B = {true, false} } dom(pn) := N; dom(c) := {new, normal, premium}; dom(a) := B Data-Aware Compliance Rules. It is not our intention to introduce an approach to model data-aware compliance rules. Hence, we rely on existing work such as [4,5,13]. Since our approach is not restricted to a particular compliance rule modeling language, we come up with a general notion of data-aware compliance rules in Def. 3 that is applicable to a multitude of existing approaches. Compliance rules typically contain conditions on activities and events of certain types (e.g., cf. compliance rules c1 to c4 ). We denote these as type conditions. In addition to type conditions, data-aware compliance rules contain conditions on the states of data objects, so-called data conditions. Formalization of type conditions and data conditions is given in Def. 2. Definition 2 (Type Condition, Data Condition). Let D = (A, E, O, D, dom) be a process domain and let t ∈ A E be an activity type or an event type. Then – a type condition is an expression of the form: (type = t). Let further o ∈ O be a data object, v ∈ dom(o) a certain value of the related domain, and ⊗ ∈ {=, =, , ≤, ≥, . . .} a relation. Then – a data condition is an expression of the form: (o ⊗ v). Moreover, we define: – T CD ⊆ { (type = t) | t ∈ A ∪ E} as the set of all type conditions over D. – DCD ⊆ { (o ⊗ v) | o ∈ O, v ∈ dom(o), ⊗ := {=, =, , ...}} as the set of all data conditions over D.
338
D. Knuplesch et al.
Example 2. Consider compliance rule c9 from Table 1. Here, c9 yields the following type conditions (T CD ) and data conditions (DCD ) where pn ∈ O is the data object representing the piece number and c ∈ O is the data object representing the customer status (cf. Fig. 1). Phrase Corresponding condition After confirming an order (type = confirm order) of a non-premium customer with (c = premium) piece number of at least 125,000 (pn ≥ 125,000) premium status should be offered (type = offer premium status) to the customer
∈ T CD ∈ DCD ∈ DCD ∈ T CD
Finally, a general data-aware compliance rule is defined as follows: Definition 3 (Data-Aware Compliance Rule). Let D = (A, E, O, D, dom) be a process domain. Then, a data-aware compliance rule is a tuple c = (C, ∆), with: – C = T Cc ∪ DCc is finite set of conditions that is partitioned into the set of type conditions T Cc ⊆ T CD and the set of data conditions DCc ⊆ DCD . – ∆ an expression defining temporal (ordering) and logical relations over the conditions in C. Further, we define: – conditionsCRc : O → 2DCc , o′ → {(o ⊗ v)|o = o′ ∧ (o ⊗ v) ∈ DCc } as a function returning all data conditions of c that affect a certain data object. Example 3. To illustrate our examples, we use linear temporal logic (LTL). LTL is applied to model compliance rules by numerous approaches [5,13]. Note, however, that our approach is not restricted to a particular compliance rule modeling language. Using LTL we can model c9 from Table 1 as follows: c9 : G ( ((type = confirm order) ∧ (pn ≥ 125,000) ∧ (c = premium)) ⇒ F (type = offer premium status) ) According to Def. 3 this means c9 = (C9 , ∆9 ), where C9 = {tc1 = (type = confirm order), tc2 = (type = offer premium status) dc1 = (pn ≥ 125,000), dc2 = (c = premium)} ∆9 = G ( (tc1 ∧ dc1 ∧ dc2 ) ⇒ F tc2 ) T Cc9 = {(type = confirm order), (type = offer premium status)} DCc9 = {(pn ≥ 125,000), (c = premium)} Based on the above notion of data-aware compliance rules, we later show how automatic abstraction is conducted to enable data-aware compliance checking. Processes. A process model, commonly represented by a process graph, can be composed using activities, events, and data objects from a process domain. Since our approach is not restricted to a particular process definition language, we provide a general definition of process graphs in Def. 4 following common notations, such as BPMN:
Verification of Data-Aware Compliance Rules
339
Definition 4 (Process Graph). Let D = (A, E, O, D, dom) be a process domain. Then, a process graph is a tuple with P = (N , F , O, I , type , con), where: – N = AP ∪ EP ∪ GP is a finite set of nodes that is partitioned into the set of activities AP , the set of events EP , and the set of gateways GP . – F ⊆ N × N represents the sequence flow relation between nodes. – O ⊆ O is a finite set of data objects. – I ⊆ O × N ∪ N × O is the data flow relation between nodes and data objects. – type : AP ∪ EP → A ∪ E is a function assigning an activity type to each activity in P and an event type to each event in P , where holds: a ∈ AP ⇒ type(a) ∈ A and e ∈ EP ⇒ type(e) ∈ E – con : F → 2DCD is a function assigning a (maybe empty) set of data conditions to each sequence flow. Further, we define: – DCP := {(o ⊗ v)|∃f ∈ F : (o ⊗ v) ∈ con(f )} ⊆ DCD as the set of data conditions in P . – conditionsP GP : O → 2DCP , o′ → {(o ⊗ v)|o = o′ ∧ (o ⊗ v) ∈ DCP } as a function returning all data conditions of P on the associated data object. Example 4. Process model P from Fig. 1 contains the following data conditions over pn: conditionsP GP (pn) = {(pn ≤ 50,000), (pn > 50,000), (pn ≤ 100,000), (pn > 100,000), (pn ≤ 150,000), (pn > 150,000)}
4
On Enabling Data-Aware Compliance Checking
As discussed the full exploration of the data dimension can lead to state explosion, when conducting compliance checking. The basic idea to achieve more efficient data-aware compliance checking of process models and to limit the state explosion problem is to abstract from states that are irrelevant for the verification of a particular compliance rule. Consider, for example, compliance rule c8 from Table 1. Concerning the satisfaction/violation of c8 it is not relevant whether pn = 120,000, pn = 120,001, pn = 120,002, . . ., or pn = 130,000 holds when executing the order-to-delivery process from Fig. 1. Hence, it is not necessary to differentiate between these cases when verifying P against c8 . These potential states of pn, namely 120,000, . . ., 130,000, could be treated as one "merged" state. The merged state can be described by the abstraction predicates (pn ≥ 120,000) ∧ (pn ≤ 130,000) and be applied to data-aware compliance checking (e.g., by applying model checking techniques). Other irrelevant states of pn can be merged in a similar manner to derive a more compact set of states that serve as domain of pn in the abstract process model. In fact, a differentiation between the concrete states of pn beyond 100,000 is not necessary for verifying P against c8 . Hence, all states of pn beyond 100,000 can be merged to one abstract state pn > 100,000.
340
D. Knuplesch et al.
In general, by abstracting from states of data objects irrelevant for the verification of a compliance rule, less cases have to be explored in the verification procedure. This helps to reduce complexity of compliance checking. However, abstracting from states must not lead to incorrect verification results. Consider, for example, again the order-to-delivery process P and compliance rule c8 from Table 1. To verify particularly c8 it is not sufficient to only consider whether pn > 100,000 or pn ≤ 100,000 holds. The challenge of automatic abstraction is to identify adequate abstraction predicates that enable us to "merge" states wihtout falsifying verification results. In literature, abstracting from concrete states to abstract predicates is common practice for dealing with state explosion [14,15,16,17]. This is particularly relevant in engineering domains, where large systems have to be verified against safety properties. In many applications, abstraction constitutes a task that requires human interaction. In particular, domain experts are required to find the right abstraction. By analyzing the dependencies between a process model P and a compliance rule c our abstraction approach automatically derives an abstract process model Pabstract and an abstract compliance rule cabstract . These can serve as input to the actual compliance checking. We want to find conservative abstraction predicates such that holds: P |= c ⇔ Pabstract |= cabstract The data-based abstraction introduced in this paper can be combined with structural abstraction strategies (cf. Sect. 2) to achieve further reduction of the compliance checking complexity. Automatic Abstraction for Data Conditions To achieve automatic abstraction, we have to accomplish three steps: 1) Identify data objects potentially relevant to c and the data conditions on them in c and in the data-based gateways of P 2) Identify abstraction predicates for relevant data objects 3) Application of abstraction predicates to obtain Pabstract and cabstract Altogether, the states of each data object o can be represented by a set of abstraction predicates beeing relevant for proper verification of the associated compliance rule over the process model. This is accomplished by analyzing the data conditions in the process model and in the compliance rule. For the identified set of abstraction predicates we can identify combinations of predicates whose conjunction is satisfiable (i.e., evaluated with true). Each such combination represents a potential abstract state of the corresponding data object o: Definition 5 (Abstraction for Data Conditions). Let D = (A, E, O, D, dom) be a process domain, P = (N, F, O, I, type, con) be a process model, and c be a compliance rule over D. Let further o ∈ O be a data object in P . Then, – predicatescP : O → DCD , o → predicatescP (o) with predicatescP (o) := conditionsCRc (o) ∪ conditionsP GP (o) is a function returning the set of all data conditions in c and P that affect o and, thus, constitute relevant abstraction predicates.
Verification of Data-Aware Compliance Rules
341
– sol ve : 2DCD× ΩD → 2DCD, (C, v ′ ) → solve((C, v ′ )) with solve(C, v ′ ) := {(o ⊗ v) | (v ′ ⊗ v) = true ∧ (o ⊗ v) ∈ C} is a function returning the particular subset of predicates that are satisfied by v. DC – allocationscP : O → 22 D, o → allocationscP (o) with allocationscP (o) := {S | ∃v ∈ dom(o) : S = solve(predicatescP (o), v)} is a function returning a set of sets of predicates such that for each value v ∈ dom(o) there is a set in allocationscP (o) containing all predicates over o that are satisfied by v. Note that for deriving the abstraction predicates predicatescP (o) not only the data conditions of P , but also the data conditions of c are considered. Based on the predicates predicatescP (o) for a data object o, we can narrow the data domain of o which has to be explored during data-aware compliance checking. In particular, instead of exploring the complete domain of o, which may cause a state explosion, only the corresponding set of abstract states (i.e., the elements of allocationscP (o)) has to be explored in the compliance checking procedure. Due to the Def. 5 |allocationscP (o)| ≤ |dom(o)| always holds. For a large data domain dom(o) typically, allocationscP (o) contains significantly less elements. Hence, to be able to narrow the domain of o to the set of abstract states of o being relevant for the verification of the actual compliance rule is a crucial step to avoid the state explosion problem. Dealing with Large Domains. Although it is easy to derive allocationscP (o) for small finite domains by calculating solve(C, v) for each v ∈ dom(o), this procedure is not feasible for large data domains, such as D = N. However, if D is a totally ordered domain (e.g. N, Z, or R) and all conditions (o ⊗ v) ∈ predicatescP (o) are using ordering relations ⊗ ∈ {, =}, as it is the case with pn and corresponding data conditions, we can efficiently calculate allocationscP (o) as follows: First, we determine (vi )1≤i≤n =< v1 , . . . , vn > the ascendingly sorted finite sequence of such values v with ∃(o⊗v) ∈ predicatescP (o) without any multiple occurrences. Now it is sufficient to limit the calculation of solve(predicatescP (o), v) to the following cases of v 1. the values v1 , . . . , vn that are the limits of the relevant abstraction predicates, 2. for any two successive values vi and vi+1 , a value wi with v1 < wi < vi+1 , 3. a value s < v1 smaller than any vi and b > vn bigger than any vi Obviously, it is sufficent to use one wi with vi < wi < vi+1 , since all values of this interval exactly fulfill and violate the same conditions of predicatescP (o). For the same reason the use of one s and one b is sufficent. Note that sometimes there may be no s ∈ D with s < v1 or no wi ∈ D with vi < wi < vi+1 (i.e. D = N and v1 = 0, v2 = 1). Then, the corresponding cases have to be ignored. The calculation of allocationscP (o) may also be delegated to a SMT-Solver (e.g., Yices [18]) that is even able to deal with large domains and conditions using linear arithmetics.
342
D. Knuplesch et al.
Example 5. Consider process model P from Fig. 1 and compliance rule c9 from Table 1 over process domain D. As described above, to calculate allocationscP9 (pn) it is sufficient to consider 50,000, 100,000, 125,000, 150,000 as well as 75,000, 112,500, 137,500 and 49,999, 150,001. So we receive the following abstraction predicates for the data object pn. predicatescP9 (pn) = conditionsCRc9 (pn) ∪ conditionsP GP (pn) = {(pn ≥ 125,000)} ∪ {(pn ≤ 50,000), (pn > 50,000), (pn ≤ 100,000), (pn > 100,000), (pn ≤ 150,000), (pn > 150,000)} = {(pn ≥ 125,000), (pn ≤ 50,000), (pn > 50,000), (pn ≤ 100,000), (pn > 100,000), (pn ≤ 150,000), (pn > 150,000)} allocationscP9 (pn) = {α, β, γ, δ, ǫ}, where α := {(pn ≤ 50,000), (pn ≤ 100,000), (pn ≤ 150,000)} β := {(pn > 50,000), (pn ≤ 100,000), (pn ≤ 150,000)} γ := {(pn > 50,000), (pn > 100,000), (pn ≤ 150,000)} δ := {(pn ≥ 125,000), (pn > 50,000), (pn > 100,000), (pn ≤ 150,000)} ǫ := {(pn ≥ 125,000), (pn > 50,000), (pn > 100,000), (pn > 150,000)} The sets of predicates in allocationscP9 (pn) constitute properties describing the "merged" sets of original states of pn. {α, β, γ, δ, ǫ} may be used as abstract data domain of pn for verification against c9 . During compliance checking of P against c9 , only these abstract states have to be explored. Compared to the original domain of pn, this constitutes a significant reduction of the complexity for exploring the data dimension. In Def. 6 we transfer the above results into process domain, process graph and compliance rule and, therefore, formalize the abstract process domain, the abstract process graph, and the abstract compliance rule (cf. Fig. 2). Definition 6 (Abstract Process Domain, Abstract Process Graph, Abstract Compliance Rule). Let D = (A, E, O, D, dom) be a process domain with a process graph P = (N, F, O, I, type, con) and a compliance rule c = (C, ∆). The abstract process domain Dabstract of D with respect to P and c is defined as Dabstract = (A, E, O, Dabstract , domabstract ), where – Dabstract := {allocationscP (o)|o ∈ O} – domabstract (o) := allocationscP (o) Then the abstract process graph Pabstract of P with respect to c is defined as Pabstract = (N, F, O, I, type, conabstract ), where for f ∈ F holds – conabstract (f ) := { (′′ (o ⊗ v)′′ ∈ o) | ′′ (o ⊗ v)′′ ∈ con(f ) } The abstract compliance rule cabstract of c with respect to P , is defined as cabstract = (Cabstract , ∆) where: – Cabstract := T Cc ∪ { (′′ (o ⊗ v)′′ ∈ o) | ′′ (o ⊗ v)′′ ∈ DCc ) } Example 6. Consider process model P from Fig. 1 and compliance rule c9 over process domain D (cf. Example 1). Due to space limitation, we apply the
Verification of Data-Aware Compliance Rules Process model Pabstract
c customer
pn
[(pn > 100,000) ɽ pn]
[new, normal, premium]
piece number
10% discount
[c = premium]
x
x
[(pn > 50,000) ɽ pn]
process order
a approved
x
[true, false]
check solvency
[c premium]
x
assess order
[(pn 100,000) ɽ pn] [(pn 50,000) ɽ pn]
x
x
confirmation of receipt
receive order
[(pn > 150,000) ɽ pn]
x
[(pn 100,000) ɽ pn]
x
split order
[(pn 150,000) ɽ pn]
+
local production
outsourced production
x
[(pn > 100,000) ɽ pn]
comfirm order
x [a = true OR (pn 50.000) ɽ pn] [a = false AND (pn > 50,000) ɽ pn]
receive 30% prepayment
send invoice
local production
[ c premium AND (pn > 150,000) ɽ pn]
quality test
+
decline order
[ c = premium OR (pn 150,000) ɽ pn] [(pn > 100,000) ɽ pn]
x
offer premium status
x
x x
343
shipping insurance
x
+
+ x
enable tracking
ship goods
[(pn 100,000) ɽ pn]
confirm shipping
Fig. 3. The abstract order-to-delivery process after applying abstraction to pn
abstraction only for data object pn. Based on Example 5 we obtain the abstract process domain Dabstract = (A, E, O, Dabstract , domabstract ), with – Dabstract := { al l o ca tionscP9 (pn), . . .} = { { , γ , δ , ǫ } , . . . } – dom(pn)abstract := allocationscP9 (pn) = {α, β, γ, δ, ǫ} The corresponding abstract process graph Pabstract = (N, F, O, I, type, conabstract ) is depicted in Fig. 3). Further, we obtain the corresponding compliance rule cabstract = (Cabstract , ∆), where: Cabstract := T Cc ∪ { ((o ⊗ v) ∈ o) |(o ⊗ v) ∈ DCc )} = {tc1 = (type = confirm order), tc2 = (type = offer premium status) dc1abstract = ((pn ≥ 125,000) ∈ pn), . . . }
5
Analyis and Implementation
We applied our approach to enable data-aware compliance checking of process models without running into intractable state explosion problems. To accomplish the actual compliance checking step (cf. Fig. 2 A), we applied model checking techniques (e.g., by using SAL [19]). Model checking is applied to compliance verification by numerous approaches in literature [2,6,7]. It comprises techniques for automatic verification of a model specification against predefined properties. In order to apply model checking, we have to provide a state transition system and a logic property model to the model checker (cf. Fig 4). Therefore, we transform the abstract process model into a state representation and the abstract compliance rule into a logic property. To provide both to the model checker, a
344
D. Knuplesch et al.
process data-aware model compliance rule
process domain
counterexample process trace
automatic abstraction
abstraction mappings
automatic concretization
abstract abstract data-aware process model compliance rule
transformation
state transition system
temporal-logic property
conversion i
SAL input p file
counterexample abstract process trace
abstract process domain transformation mappings
retransformation
counterexample state trace
state transition system conversion mappings
parsing i
SAL
counterexample SAL output stream
specific ifi model d l checker h k
model checking with SAL
true false
Fig. 4. Data-aware compliance checking and generation of counterexample
original process graph
counterexample as process log
counterexample as process graph
visualization of the counterexample’s steps
data-aware compliance rules
Fig. 5. Aristaflow Process Template Editor with the SeaFlows compliance checker plugin for data-aware compliance rule checking
conversion with respect to the model checker’s specific syntax and restrictions is required. The model checker then performs automatic exploration of the state space and checks for conformance to the compliance rule. In case of a violation, the model checker provides an incompliant execution trace as counterexample. As previously discussed, a major challenge of data-aware compliance checking is to provide meaningful feedback in case of compliance violations (e.g., data
Verification of Data-Aware Compliance Rules
345
conditions under which a violation occurs). To tackle this we have to memorize the steps taken during the transformation procedure and we need to conduct a retransformation. Fig. 4 shows the steps accomplished by our implementation. Our proof-of-concept implementation, the SeaFlows compliance checker, is implemented as Java-plug-in for the Aristaflow Process Template Editor which is part of the Aristaflow BPM Suite [20]. 17.000 lines of code and the class hierarchy comprising about 70 interfaces and 210 classes indicate the complexity of the implementation. The SeaFlows compliance checker enables modeling of LTL-based data aware compliance rules using a tree-based editor. Automatic abstraction as discussed in Sect. 4 is supported for domains of numbers. The SeaFlows compliance checker conducts the automatic abstraction, transforms a AristaFlow process model into a state representation, and pass it to the model checker SAL [19]. Counterexamples obtained from SAL can be shown as process logs or visualized as process graph as shown in Fig. 5.
6
Summary and Outlook
Enabling process-aware information systems to support the compliance of process models with imposed data-aware compliance rules can be regarded as one step towards installing business process compliance management in practice. In this paper, we introduced an abstraction approach that enables data-aware compliance checking in a more efficient manner by limiting the state explosion problem that can occur when fully exploring the data dimension during verification. The approach serves as preprocessing step to actual compliance checking and provides the basis for efficient application of existing compliance checking algorithms. Being indepedent of a particular process meta-model and of a particular compliance rule modeling language, our approach is applicable to a variety of existing approaches. To accomplish data-aware compliance checking in a comprehensive manner, we also address the challenge of providing users with intelligible feedback in case of compliance violations. To our best knowledge, we are the first to apply automatic data abstraction in the context of compliance checking of business process models. In future, we will further research on automatic abstraction for other types of domains, also considering relationships among them. Further we will go on in refining the verification output to provide more intelligible feedback to users.
References 1. Sadiq, S., Governatori, G., Naimiri, K.: Modeling control objectives for business process compliance. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 149–164. Springer, Heidelberg (2007) 2. Ghose, A., Koliadis, G.: Auditing business process compliance. In: Krämer, B.J., Lin, K.-J., Narasimhan, P. (eds.) ICSOC 2007. LNCS, vol. 4749, pp. 169–180. Springer, Heidelberg (2007) 3. Sadiq, S., Orlowska, M., Sadiq, W.: Specification and validation of process constraints for flexible workflows. Information Systems 30(5), 349–378 (2005)
346
D. Knuplesch et al.
4. Ly, L.T., Rinderle-Ma, S., Dadam, P.: Design and verification of instantiable compliance rule graphs in process-aware information systems. In: Pernici, B. (ed.) Advanced Information Systems Engineering. LNCS, vol. 6051, pp. 9–23. Springer, Heidelberg (2010) 5. Awad, A., Weidlich, M., Weske, M.: Specification, verification and explanation of violation for data aware compliance rules. In: Baresi, L., Chi, C.-H., Suzuki, J. (eds.) ICSOC-ServiceWave 2009. LNCS, vol. 5900, pp. 500–515. Springer, Heidelberg (2009) 6. Awad, A., Decker, G., Weske, M.: Efficient compliance checking using BPMN-Q and temporal logic. In: Dumas, M., Reichert, M., Shan, M.-C. (eds.) BPM 2008. LNCS, vol. 5240, pp. 326–341. Springer, Heidelberg (2008) 7. Liu, Y., Müller, S., Xu, K.: A static compliance-checking framework for business process models. IBM Systems Journal 46(2), 335–361 (2007) 8. Yu, J., et al.: Pattern based property specification and verification for service composition. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds.) WISE 2006. LNCS, vol. 4255, pp. 156–168. Springer, Heidelberg (2006) 9. Foerster, A., Engels, G., Schattkowsky, T.: Activity diagram patterns for modeling quality constraints in business processes. In: Briand, L.C., Williams, C. (eds.) MoDELS 2005. LNCS, vol. 3713, pp. 2–16. Springer, Heidelberg (2005) 10. Ly, L.T., Rinderle, S., Dadam, P.: Semantic correctness in adaptive process management systems. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 193–208. Springer, Heidelberg (2006) 11. Weber, I., Hoffmann, J., Mendling, J.: Semantic business process validation. In: Proc. of the 3rd Int’l Workshop on Semantic Business Process Management, pp. 22–36 (2008) 12. Governatori, G., et al.: Detecting regulatory compliance for business process models through semantic annotations. In: Business Process Management Workshops. LNBIP, vol. 17, pp. 5–17. Springer, Heidelberg (2008) 13. van der Aalst, W., Pesic, M.: DecSerFlow: Towards a truly declarative service flow language. In: Bravetti, M., Núñez, M., Zavattaro, G. (eds.) WS-FM 2006. LNCS, vol. 4184, pp. 1–23. Springer, Heidelberg (2006) 14. Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proc. of the 4th ACM Symp. on Principles of Programming Languages, pp. 238–252. ACM Press, New York (1977) 15. Graf, S., Saidi, H.: Construction of abstract state graphs with PVS. In: Grumberg, O. (ed.) CAV 1997. LNCS, vol. 1254, pp. 72–83. Springer, Heidelberg (1997) 16. Das, S.: Predicate Abstraction. PhD thesis, Stanford University (2003) 17. Model, F.S., Clarke, E., Lu, Y.: Counterexample-guided abstraction refinement. In: Emerson, E.A., Sistla, A.P. (eds.) CAV 2000. LNCS, vol. 1855, pp. 154–169. Springer, Heidelberg (2000) 18. Dutertre, B., De Moura, L.: The YICES SMT solver (2006), Tool paper at http://yices.csl.sri.com/tool-paper.pdf 19. Bensalem, S., et al.: An overview of SAL. In: Proc. of the 5th NASA Langley Formal Methods Workshop, pp. 187–196. NASA Langley Research Center (2000) 20. Dadam, P., Reichert, M.: The ADEPT project: a decade of research and development for robust and flexible process support. Computer Science-Research and Development 23(2), 81–97 (2009)
Query Answering under Expressive Entity-Relationship Schemata Andrea Cal`ı3,2 , Georg Gottlob1,2 , and Andreas Pieris1 1
Computing Laboratory, University of Oxford, UK Oxford-Man Institute of Quantitative Finance, University of Oxford, UK Department of Information Systems and Computing, Brunel University, UK [email protected]
2 3
Abstract. We address the problem of answering conjunctive queries under constraints representing schemata expressed in an extended version of the EntityRelationship model. This extended model, called ER+ model, comprises is-a constraints among entities and relationships, plus functional and mandatory participation constraints. In particular, it allows arbitrary permutations of the roles in is-a among relationships. A key notion that ensures high tractability in ER+ schemata is separability, i.e., the absence of interaction between the functional participation constraints and the other constructs of ER+ . We provide a precise syntactic characterization of separable ER + schemata, called ER ± schemata, by means of a necessary and sufficient condition. We present a complete complexity analysis of the conjunctive query answering problem under ER± schemata. We show that the addition of so-called negative constraints does not increase the complexity of query answering. With such constraints, our model properly generalizes the most widely-adopted tractable ontology languages.
1 Introduction Since the origins of the Entity-Relationship formalism, introduced in Chen’s milestone paper [16], conceptual modeling has been playing a central role in database design. More recently, a renewed interest in conceptual modeling has arisen in the area of the Semantic Web, with applications in semantic information integration, data exchange, and web information systems. Logic-based formalisms are prominent in data modeling for the Semantic Web; in particular, description logics (DLs) [14]. Most of the large corpus of research on DLs focuses on consistency and subsumption in knowledge bases. However, the attention has more recently shifted on tractable query answering under DL knowledge bases. As with databases, the complexity that matters here is data complexity, i.e., the complexity calculated taking only the data as input, while the query and the knowledge base are considered fixed. The DL-lite family [12,26] is a prominent family of DL languages that achieves low data complexity. More precisely, a query over a DL-lite knowledge base can be rewritten as a first-order query (and therefore also in SQL), and then evaluated directly against the initial data, thus obtaining the correct answer. Languages enjoying such a property are called first-order (FO) rewritable. In this paper we consider an expressive Entity-Relationship formalism, that we call ER+ . As the Entity-Relationship model, the ER+ formalism is flexible and expressive, J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 347–361, 2010. c Springer-Verlag Berlin Heidelberg 2010
348
A. Cal`ı, G. Gottlob, and A. Pieris
and at the same time it is well understood by database practitioners. ER+ allows, on the top of the basic ER constructs [16], is-a constraints among entities and relationships, mandatory and functional participation of entities to relationships, and mandatory and functional attributes. Notice that we also allow for arbitrary permutations associated to is-a between two relationships; for example, we can assert that each instance a, b, c of a ternary relationship R1 is also an instance of another ternary relationship R2 , but the three objects (instances of the participating entities) appear in the order c, a, b. Such permutations in is-a constraints among relationships are used in DL languages, for instance, in DL-lite among binary roles (relationships). The addition of this feature raises the complexity of query answering, as we explain below. We first recall the semantics of the ER+ formalism [11] by means of a translation of an ER+ schema into a relational one with a class of constraints (a.k.a. dependencies) called conceptual dependencies (CDs); in particular, CDs are tuple-generating dependencies (TGDs) and key dependencies (KDs) (more precisely, the TGDs in a set of CDs are inclusion dependencies). We then address the problem of answering conjunctive queries (a.k.a. select-project-join queries) under ER+ schemata, providing a complete study of its data and combined complexity, the latter being the complexity calculated taking as input, together with the data, also the query and the schema. Summary of Contributions. The main property that ensures tractable query answering over an ER+ schema is separability, which occurs when the TGDs and KDs in the associated set of CDs do not interact. For separable sets of constraints, we can basically ignore the KDs and consider the TGDs only [9]. In Section 3, we give a syntactic condition, that we prove to be necessary and sufficient, for an ER+ schema to be separable, and therefore FO-rewritable. Our condition hinges on a graph-based representation of an ER+ schema. We denote with ER± the language of ER+ schemata that satisfy our syntactic condition. ER± schemata are FO-rewritable, and conjunctive query answering can be done by rewriting the given query into a first-order one, which can therefore directly translated into SQL. Notice also that the class of CDs associated to general ER+ schemata is not first-order rewritable [7]. In Section 4, we study the complexity of conjunctive query answering over ER± schemata. We prove that the problem is PSPACE-complete in combined complexity. The membership in PSPACE is shown by utilizing separability of ER± schemata, and the fact that query answering under inclusion dependencies is in PSPACE, established in a seminal paper by Johnson and Klug [20]. The hardness is proved by providing a reduction from the finite function generation problem, studied by Kozen in [21]. The data complexity is immediately in AC0 , since ER± schemata are FO-rewritable. Finally, in Section 5, similarly to what is done in [6], we enrich ER± schemata with negative constraints of the form ∀X ϕ(X ) → ⊥, where ϕ(X ) is a conjunction of atoms and ⊥ is the constant false, without increasing the complexity of query answering. Negative constraints serve to express several relevant constructs in ER± schemata, e.g., disjointness between entities and relationships, and non-participation of entities to relationships, but also more general ones. This extension allows us to show that our work properly generalizes several prominent and tractable formalisms for ontology reasoning, in particular the DL-lite family [12,26].
Query Answering under Expressive Entity-Relationship Schemata
349
Language Arity of rel. Perm. Combined compl. Data compl. ER± any yes PSPACE -complete AC 0 UB: [20]
ER± i ER± 2
any =2
no yes
UB: [10]
NP-complete
AC 0
LB: [15]
UB: [10]
NP-complete
AC 0
LB: [15]
UB: [10]
Fig. 1. Summary of complexity results
The table in Figure 1 summarizes our results together with others, which we could not include in this paper for space reasons. Results already known from the literature are detailed in the table (LB and UB stand for lower and upper bound, respectively). ER± i is the language of separable ER+ schemata that do not admit arbitrary permutations in is-a among relationships, which was studied in [7]; ER± 2 is the language of separable ER+ schemata with only binary relationships. Related work. The ER+ model is based on Chen’s Entity-Relationship model [16]. [19] considers an extended ER model providing arbitrary data types, optional and multivalued attributes, a concept for specialization and generalization, and complex structured entity types. In addition to the ER model, the extended model considered in [24] includes the generalization and full aggregation constructs. EER-convertible relational schemas, i.e., relational schemas with referential integrity constraints that can be associated with an extended ER schema, are considered in [23]. A logic-based semantics for the ER model was given in [3], while [2] studies reasoning tasks under variants of ER schemata. Query answering in data integration systems with mediated schemata expressed in a variant of the ER model, which is strictly less expressive than those treated in this paper, is studied in [4]. The formalism of [25], which also studies query answering, is instead more expressive than ER± , and it is not computationally tractable in the same way. Other works [13,20] consider query containment, which is tightly related to the query answering problem. [13] considers query containment in a formalism similar to ER+ but with less expressive negative constraints, and incomparable to it; the combined complexity is higher than the one under ER± , and data complexity is not studied. The notion of separability between KDs and inclusion dependencies was first introduced in [10], together with a sufficient syntactic condition for it. The works on DL-lite [12,26] exhibit tractable query answering algorithms (in AC0 in data complexity) for different languages in the DL-lite family. Our formalism, with the addition of negative constraints, properly generalizes all languages in the DL-lite family, (this can be shown in a way similar to that of [6]), while providing a query answering algorithm with the same data complexity. Recent works [5,6] deal with expressive rules (TGDs) that consitute the languages of the Datalog± family, which are capable of capturing the ER+ formalism presented here, if we consider TGDs only. The languages in the Datalog± family are more expressive (and less tractable) than ours except for Linear Datalog± , that allows for query answering in AC0 in data complexity. However, ER± is not expressible in Linear Datalog±
350
A. Cal`ı, G. Gottlob, and A. Pieris
(plus the class of KDs presented in [6]). Finally, [11] deals with general (not separable) ER+ schemata. A rewriting technique is presented, but the upper bound implied by it is significantly higher than the one for ER± .
2 Preliminaries Relational Model and Dependencies, Queries and Chase. We define the following pairwise disjoint (infinite) sets of symbols: (i) a set Γ of constants (constitute the “normal” domain of a database), and (ii) a set Γf of labeled nulls (used as placeholders for unknown values, and thus can be also seen as variables). Different constants represent different values (unique name assumption), while different nulls may represent the same value. A lexicographic order is defined on Γ ∪ Γf , such that every value in Γf follows all those in Γ . A relational schema R is a set of relational symbols (or predicates), each with its associated arity. We write r/n to denote that the predicate r has arity n. A position r[i] is identified by a predicate r ∈ R and its i-th attribute. A term t is a constant, null, or variable. An atomic formula (or atom) has the form r(t1 , . . . , tn ), where r/n is a predicate, and t1 , . . . , tn are terms. For an atom a, we denote as dom(a) the set of its terms. This notation naturally extends to sets and conjunctions of atoms. Conjunctions of atoms are often identified with the sets of their atoms. A substitution is a function h : S1 → S2 defined as follows: (i) ∅is a substitution (empty substitution), (ii) if h is a substitution then h ∪ {X → Y } is a substitution, where X ∈ S1 and Y ∈ S2 , and h does not already contain some X → Z with Y = Z. If X → Y ∈ h, then we write h(X) = Y . A homomorphism from a set of atoms A1 to a set of atoms A2 , both over the same schema R, is a substitution h : dom(A1 ) → dom(A2 ) such that: (i) if t ∈ Γ , then h(t) = t, and (ii) if r(t1 , . . . , tn ) is in A1 , then h(r(t1 , . . . , tn )) = r(h(t1 ), . . . , h(tn )) is in A2 . If there are homomorphisms from A1 to A2 and vice-versa, then A1 and A2 are homomorphically equivalent. The notion of homomorphism naturally extends to conjunctions of atoms. A database (instance) D for a schema R is a (possibly infinite) set of atoms of the form r(t), where r/n ∈ R and t ∈ (Γ ∪ Γf )n . We denote as r(D) the set {t | r(t) ∈ D}. A conjunctive query (CQ) q of arity n over a schema R, written as q/n, has the form q(X ) = ∃Y ϕ(X , Y ), where ϕ(X , Y ) is a conjunction of atoms over R, X and Y are sequences of variables or constants, and the length of X is n. ϕ(X , Y ) is called the body of q, denoted as body(q). The answer to such a CQ q over a database D, denoted as q(D), is the set of all n-tuples t ∈ Γ n for which there exists a homomorphism h : X ∪ Y → Γ ∪ Γf such that h(ϕ(X , Y )) ⊆ D and h(X ) = t. A tuple-generating dependency (TGD) σ over a schema R is a first-order formula ∀X ∀Y ϕ(X , Y ) → ∃Zψ(X , Z), where ϕ(X , Y ) and ψ(X , Z) are conjunctions of atoms over R, called the body and the head of σ, denoted as body(σ) and head (σ), respectively. Henceforth, for notational convenience, we will omit the universal quantifiers in front of TGDs. Such σ is satisfied by a database D for R iff, whenever there exists a homomorphism h such that h(ϕ(X , Y )) ⊆ D, there exists an extension h′ of h, i.e., h′ ⊇ h, such that h′ (ψ(X , Z)) ⊆ D. A key dependency (KD) κ over R is an assertion of the form key(r) = A , where r ∈ R and A is a set of attributes of r. Such κ is satisfied by a database D for R iff, for each pair of distinct tuples t1, t2 ∈ r(D),
Query Answering under Expressive Entity-Relationship Schemata
351
t1 [A] = t2 [A], where t[A] is the projection of tuple t over A. A negative constraint (NC) νover R is a first-order formula ∀X ϕ(X) → ⊥, where ϕ(X) is a conjunction of atoms over R and ⊥ is the truth constant false. For conciseness of notation, we will omit the universal quantifiers in front of NCs. Such ν is satisfied by a database D iff there is no homomorphism h such that h(ϕ(X)) ⊆ D. As observed in [6], this is equivalent to the fact that the answer to the Boolean CQ1 q() = ∃X ϕ(X) over D is the emptyset; denote as qν the Boolean CQ q() = ∃X ϕ(X). Given a database D for a schema R and a set of dependencies Σ over R, the answers we consider are those that are true in all models of D w.r.t. Σ, i.e., all databases that contain D and satisfy Σ. Formally, the models of D w.r.t. Σ, denoted as mods(D, Σ), is the set of all databases B such that B |= D ∪ Σ. The answer to a CQ q w.r.t. D and Σ, denoted as ans(q, D, Σ), is the set {t | t ∈ q(B) for each B ∈ mods(D, Σ)}. The decision problem associated to CQ answering under dependencies is the following: given a database D for R, a set Σ of dependencies over R, a CQ q/n over R, and an n-tuple t ∈ Γ n , decide whether t ∈ ans(q, D, Σ). For the moment we put NCs aside and deal only with TGDs and KDs; we shall return to consider also NCs in Section 5. The chase procedure (or chase) is a fundamental algorithmic tool introduced for checking implication of dependencies [22], and for checking query containment [20]. Informally, the chase is a process of repairing a database w.r.t. a set of dependencies so that the resulted database satisfies the dependencies. We shall use the term chase interchangeably for both the procedure and its result. The chase works on an instance through the so-called TGD and KD chase rules. The TGD chase rule comes in two different equivalent fashions: oblivious and restricted [5], where the restricted one repairs TGDs only when they are not satisfied. In the sequel, we focus on the oblivious one for better technical clarity. The chase rules follow. TGD C HASE RULE . Consider a database D for a schema R, and a TGD σ = ϕ(X, Y) → ∃Z ψ(X, Z) over R. If σ is applicable to D, i.e., there exists a homomorphism h such that h(ϕ(X, Y)) ⊆ D then: (i) define h′ ⊇ h such that h′ (Zi ) = zi for each Zi ∈ Z, where zi ∈ Γf is a “fresh” labeled null not introduced before, and following lexicographically all those introduced so far, and (ii) add to D the set of atoms in h′ (ψ(X, Z)), if not already in D. KD C HASE RULE . Consider a database D for a schema R, and a KD κ of the form key(r) = A over R. If κ is applicable to D, i.e., there are two (distinct) tuples t1 , t2 ∈ r(D) such that t1 [A] = t2 [A], then for each attribute i ∈ A of r such that t1 [i] = t2 [i]: (i) if t1 [i] and t2 [i] are both constants of Γ , then there is a hard violation of κ and the chase fails; in this case mods(D, Σ) = ∅ and we say that D is inconsistent with Σ, (ii) either replace each occurrence of t1 [i] with t2 [i], if the former follows lexicographically the latter, or vice-versa otherwise. Given a database D and a set of dependencies Σ = ΣT ∪ ΣK , where ΣT are TGDs and ΣK are KDs, the chase algorithm for D and Σ consists of an exhaustive application of the chase rules in a breadth-first fashion, which leads to a (possibly infinite) database. Roughly, the chase of D w.r.t. Σ, denoted as chase(D, Σ), is the (possibly infinite) instance constructed by iteratively applying (i) the TGD chase rule once, and (ii) the 1
A Boolean CQ has no variables in the head, and has only the empty tuple as possible answer, in which case it is said that the query has positive answer.
352
A. Cal`ı, G. Gottlob, and A. Pieris
KD chase rule as long as it is applicable (i.e., until a fixed point is reached). A formal definition of the chase algorithm is given in [5]. Example 1. Let R = {r, s}. Consider the set Σ of TGDs and KDs over R constituted by the TGDs σ1 = r(X, Y ) → ∃Z r(Z, X), s(Z) and σ2 = r(X, Y ) → r(Y, X), and the KD κ of the form key (r) = {2}. Let D be the database for R consisting of the single atom r(a, b). During the construction of chase(D, Σ) we first apply σ1 , and we add the atoms r(z, a), s(z), where z is a “fresh” null of Γf . Moreover, σ2 is applicable and we add the atom r(b, a). Now, the KD κ is applicable and we replace each occurrence of z with the constant b; thus, we get the atom s(b). We continue by applying exhaustively the chase rules as described above. ⊓ ⊔ The (possibly infinite) chase for D and Σ is a universal model of D w.r.t. Σ, i.e., for each database B ∈ mods(D, Σ), there exists a homomorphism from chase(D, Σ) to B [18,17]. Using this fact it can be shown that the chase is a formal tool for query answering under TGDs and KDs. In particular, the answer to a CQ q/n w.r.t D and Σ, in the case where the chase does not fail, can be obtained by evaluating q over chase(D, Σ), and discarding tuples containing at least one null [18]. In case the chase fails recall that mods(D, Σ) = ∅, and thus ans(q, D, Σ) contains all tuples in Γ n . The ER+ Model. We now present the conceptual model we adopt in this paper, and we define it in terms of relational schemata with dependencies. This model, called ER+ , incorporates the basic features of the ER model [16] and OO models, including subset (or is-a) constraints on both entities and relationships. An ER+ schema consists of a collection of entity, relationship, and attribute definitions over an alphabet of symbols, partitioned into entity, relationship and attribute symbols. The model is similar as, e.g., the one in [11], and it can be summarized as follows: (i) entities and relationships can have attributes; an attribute can be mandatory (instances have at least one value for it), and functional (instances have at most one value for it), (ii) entities can participate in relationships; a participation of an entity E in a relationship R can be mandatory (instances of E participate at least once), and functional (instances of E participate at most once), and (iii) is-a relations can hold between entities and between relationships; assuming the relationships to have both arity n, a permutation [i1 , . . . , in ] of the set {1, . . . , n} specifies the correspondence between the components of the two relationships in the is-a constraint. Henceforth, for brevity, given an integer n > 0 we will denote by [n] the set {1, . . . , n}. We refer the interested reader to [11] for further details. In what follows we give an example, taken from [7], of an ER+ schema. Example 2. The schema in Figure 2, based on the usual ER graphical notation, describes members of a university department working in research groups. The is-a constraints specify that Ph.D. students and professors are members, and that each professor works in the same group that (s)he leads. The cardinality constraint (1, N ) on the participation of Group in Works in, for instance, specifies that each group has at least one member and no maximum number of members (symbol N ). The participating entities to each relationship are numbered (each number identifies a component). The permutation [1, 2], labeling the arrow that denotes the is-a constraint among Leads and Works in, implies that the is-a constraint holds considering the components in the same order. ⊓ ⊔
Query Answering under Expressive Entity-Relationship Schemata since
(1, 1) memb name
Member
(1, 1)
1
gr name
Works in
2
(1, N)
Group
[1, 2]
Phd student
Professor
1 (0, 1)
353
Leads
(1, 1) 2
stud gpa
Fig. 2. ER+ Schema for Example 2 Table 1. Derivation of relational dependencies from an ER+ schema ER+ Construct attribute A for an entity E attribute A for a relationship R rel. R with entity E as i-th component mandatory attribute A of entity E mandatory attribute A of relationship R functional attribute A of an entity functional attribute A of a relationship is-a between entities E1 and E2 is-a between relationships R1 and R2 where components 1, . . . , n of R1 correspond to components i1 , . . . , in of R2 mandatory part. of E in R (i-th comp.) functional part. of E in R (i-th comp.)
Relational Constraint a(X, Y ) → e(X ) a(X1 , . ..,Xn ,Y ) → r(X1 , . . . , Xn ) r(X 1 , . . . , X n ) → e(Xi ) e(X) → ∃Y a(X, Y ) r(X1 , . . . , Xn ) → ∃Y a(X1 , . . . , Xn , Y ) key(a) = {1} (a has arity 2) key(a) = {1, . . . , n} (a has arity n + 1) e1 (X) → e2 (X) r1 (X1 , . . . , Xn ) → r2 (Xi1 , . . . , Xin ) e(Xi ) → ∃ Xr(X1 , . . . , Xn ) key(r) = {i}
The semantics of an ER+ schema C is defined by associating a relational schema R to it, and then specifying when a database for R satisfies all the constraints imposed by the constructs of C. We first define the relational schema that represents the so-called concepts, i.e., entities, relationships and attributes, of an ER+ schema C as follows: (i) each entity E has an associated predicate e/ 1, (ii) each attribute A of an entity E has an associated predicate a/ 2, (iii) each relationship Rof arity nhas an associated predicate r/ n , and (iv) each attribute A of a relationship Rof arity nhas an associated predicate a/ (n+ 1). Intuitively, e (c ) asserts that cis an instance of entity E . a(c , d) asserts that d is the value of attribute A (of some entity E) associated toc , wherec is an instance of E. r(c1 , . . . ,c n ) asserts that c 1 , . . . ,c n is an instance of relationship R (among entities E1 , . . . , En ), where c1 , . . . ,c n are instances of E1 , . . . , En , respectively. Finally, a(c1 , . . . ,c n , d) asserts that d is the value of attribute A (of some relationship R of arity n) associated to the instance c1 , . . . , cn of R. Once we have defined the relational schema R associated to an ER+ schema C, we give the semantics of each construct of C. We do that by using the relational dependencies, introduced above, as shown in Table 1 (where we assume that the relationships are of arity n). The dependencies that we obtain from an ER+ schema are called conceptual dependencies [11]. In particular, let C be an ER+ schema and let R be the relational schema associated to C. The set Σof dependencies over R obtained from C (according
354
A. Cal`ı, G. Gottlob, and A. Pieris
to the Table 1) is called the set of conceptual dependencies (CDs) associated to C. In general, a set of CDs associated to an ER+ schema consists of key and inclusion dependencies2 , where the latter are a special case of TGDs. Henceforth, when using the term TGD, we shall refer to TGDs that are part of a set of CDs (the results of this paper do not hold in case of general TGDs). A conjunctive query over an ER+ schema C is a query over the relational schema R associated to C. Given a database D for R, the answer to qw.r.t. D and C is the set ans(q, D, Σ), where Σ is the set of CDs over R associated to C. Example 3. Consider the ER+ schema C given in Example 2. The relational schema R associated to C consists of member/1, phd student/1, professor/1, group/1, works in/2, leads/2, memb name/2, stud gpa/2, gr name/2 and since/3. Suppose that we want to know the names of the professors who work in the DB group since 1998. The corresponding CQ is q(B) = ∃A∃C professor (A), memb name(A, B), works in(A, C), since(A, C, 98), gr name(C, db). ⊓ ⊔
3 Separable ER+ Schemata In this section we introduce a novel conceptual model, called ER± , obtained by applying certain syntactic restrictions on the ER+ model presented in the previous section. Intuitively, given an ER± schema C, the TGDs and KDs in the set of CDs associated to C do not interact. This implies that answers to queries can be computed by considering the TGDs only, and ignoring the KDs, once it is known that the chase does not fail. This semantic property, whose definition is given below, is known as separability [9,6]. In the rest of the paper, for notational convenience, given a set Σ of CDs we will denote as ΣT and ΣK the set of TGDs and KDs in Σ, respectively. Definition 1. Consider an ER+ schema C, and let Σ be the set of CDs over a schema R associated to C. C is said to be separable iff for every instance D for R, either chase(D , Σ) fails or for every CQ q over R, ans(q, D , Σ) = ans(q, D , ΣT ). Before defining ER± schemata, we need some preliminary definitions. Definition 2. Consider an ER+ schema C, and let Σ be the set of CDs over a schema R associated to C. The CD-graph for C is a triple V, E, λ, where V is the node set, E is the edge set, and λ is a labeling function E → ΣT . The node set V is the set of positions in R. If there is a TGD σ∈ ΣT such that the same variable appears at position pb in body(σ ) and at position ph in head (σ ), then in E there is an edge e = p with λ (e) = σ . A node corresponding to a position derived from an entity (resp., a p h b relationship) is called an e-node (resp., an r-node). Moreover, an r-node corresponding to a position which is a unary key in a relationship is called a k-node. 2
Inclusion dependencies are TGDs with just one atom in the body and one in the head, with no repetition of variables neither in the body nor in the head [1].
Query Answering under Expressive Entity-Relationship Schemata
member [1]
phd student[1]
professor [1]
works in[1]
works in[2]
leads[1]
leads[2]
355
group[1]
Fig. 3. CD-graph for Example 4
Let G be the CD-graph for an ER+ schema C, where Σ is the set of CDs over a schema R associated to C. Consider an edge u v in G which is labeled by the TGD r1 (X 1 , . . . , Xn ) → r2 (Xj1 , . . . , Xjn ), where [j1 , . . . , jn ] is a permutation of the set [n]. Intuitively, the permutation [j1 , . . . , jn ] indicates that the ji -th component of the relationship R1 is the i-th component of the relationship R2 . This can be represented by the bijective function f u v : [n] → [n], where f u v (j i ) = i, for each i ∈ [n]. v1 of only r-nodes in G. The permutaNow, consider a cycle C = v1 v2 . . . vm ([1, . . . , n]) = tion associated to C, denoted as G (C), is defined as the permutation g v ◦ . . . ◦ f [g(1), . . . , g(n)], where g = fvm v1 v2 . We are now ready to define v2 v3 ◦ f 1 ER± schemata. Definition 3. Consider an ER+ schema C. Let Σ be the set of CDs over a schema R associated to C, and G be the CD-graph for C. C is an ER± schema iff for each path v1 v2 . . . vm , for m 2, in G such that v1 is an e-node, v2 , . . . , vm−1 are r-nodes, and vm is a k-node, the two following conditions are satisfied: (i) for each cycle C of only r-nodes in G going through vm , G (C) = [1, . . . , n], where n is the arity of the predicate of vm (recall that vm is a position of R), and (ii) if m 3, then there exists a path of only r-nodes from vm to v2 . Example 4. Consider the ER+ schema C obtained from the one given in Example 2 by ignoring the attributes. The relational schema R associated to C is given in Example 3. The CD-graph for C is depicted in Figure 3, where the k-nodes are shaded; for clarity, the labels of the edges are not presented. It is easy to verify, according to the Defini⊓ ⊔ tion 3, that C is an ER± schema. In what follows we give a more involved example of an ER± schema, where a cycle of only r-nodes, going through a k-node, occurs in its CD-graph, and also arbitrary permutations are used in is-a constraints among relationships. Example 5. Consider the ER+ schema C depicted in Figure 4. The relational schema R associated to C consists of e1 /1, . . . , e9 /1, r1 /3, r2 /3, r3 /3. It is not difficult to verify that C = r1 [1] r3 [3] r2 [3] r1 [1] is the only cycle of only r-nodes in the CD-graph G for C going through the k-node r1 [1]. By defining g = fr2 [3] r1 [1] ◦ fr3 [3] r2 [3] ◦ fr1 [1] r3 [3], we get that πG (C) = [g(1), g(2), g(3)] = [1, 2, 3], and thus the first condition in the Definition 3 is satisfied. Moreover, due to the edge r1 [1] r3 [3], the second condition in the Definition 3 is also satisfied. Hence, C is an ER± schema. ⊓ ⊔
356
A. Cal`ı, G. Gottlob, and A. Pieris
E3 3
E1
(0, 1) 1
R1
[3, 2, 1]
E4
2
E2
[2, 3, 1]
1
2
3
R2
E6
E5 [2, 1, 3]
E7
1
R3 3
2
E8
(1, N )
E9
Fig. 4. ER± schema for Example 5
The following example shows that during the construction of the chase of a database w.r.t. a set of CDs associated to an ER± schema, a KD may be violated. Example 6. Consider the ER+ schema C obtained from the one given in Example 2 by ignoring the attributes. Suppose that Σ is the set of CDs associated to C. Let D = {professor (p), leads(p, g)}. During the construction of chase(D, Σ), we add the atoms member (p), works in(p, g) and works in(p, z), where z ∈ Γf . The KD key(works in) = {1} in Σ implies that p cannot participates more than once in Works in as the first component. Thus, we deduce that z = g. We must therefore replace each occurrence of z with g in the part of the chase(D, Σ) constructed so far. ⊓ ⊔ We continue to establish that every ER± schema is separable. Theorem 1. Consider an ER+ schema C. If C is an ER± schema, then it is separable. Proof (sketch). Let Σ be the set of CDs over a schema R associated to C, and let D be a database for R such that chase(D, Σ) does not fail. It holds that chase(D, Σ) and chase(D, ΣT ) are homomorphically equivalent. Thus, for every CQ q over R we have that q(chase(D, Σ)) and q(chase(D, ΣT )) coincide, and the claim follows. ⊓ ⊔ We now show that for an ER+ schema the property of being an ER± schema is not only sufficient for separability, but also necessary. This way we precisely characterize the class of separable ER+ schemata. Theorem 2. Consider an ER+ schema C. If C is not an ER± schema, then it is not separable. Proof (sketch). We prove this result by exhibiting a database D and a Boolean CQ q such that chase(D, Σ) does not fail, and ∈ ans(q, D, Σ) but ∈ / ans(q, D, ΣT ), where Σ is the set of CDs associated to C. ⊓ ⊔
Query Answering under Expressive Entity-Relationship Schemata
357
Note that results analogous to Theorems 1 and 2 hold also for ER± i schemata (where only the identity permutation can be used in is-a constraints between relationships), and also for ER± 2 schemata (with binary relationships only) [7]. Observe that for these two classes of ER+ schemata, the first condition in the definition of ER± schemata is satisfied trivially.
4 Query Answering under ER± Schemata In this section we investigate the data and combined complexity of the (decision) problem of conjunctive query answering under ER± schemata. Data Complexity. As we shall see, once we know that the chase does not fail, the problem under consideration is in the highly tractable class AC0 in data complexity. We establish this by showing that the class of CDs associated to ER± schemata is first-order rewritable [9,6]. A class T of TGDs is first-order rewritable, henceforth abbreviated as FO-rewritable, iff for every set Σ of TGDs in T , and for every CQ q/n, there exists a first-order query qΣ such that, for every database D and an n-tuple t ∈ Γ n , t ∈ ans(q, D, Σ) iff t ∈ qΣ (D). Since answering first-order queries is in the class AC0 in data complexity [27], it follows that for FO-rewritable TGDs, CQ answering is in AC0 in data complexity. Theorem 3. Consider an ER± schema C, and let Σ be the set of CDs over a schema R associated to C. Consider a database D for R, a CQ q/n over R, and an n-tuple t ∈ Γ n . If chase(D, Σ) does not fail, then the problem whether t ∈ ans(q, D, Σ) is in AC0 in data complexity. Proof. Since chase(D, Σ) does not fail, from Theorem 1 we get that the problem whether t ∈ ans(q, D, Σ) is equivalent to the problem whether t ∈ ans(q, D, ΣT ). Recall that ΣT is a set of inclusion dependencies. Since the class of inclusion dependencies is FO-rewritable [9], the claim follows. ⊓ ⊔ It is important to clarify that the above result does not give the exact upper bound for the data complexity of the problem under consideration. This is because we assume that the chase does not fail. Thus, we need also to determine the data complexity of the problem whether the chase fails. This will be studied in the last paragraph of this section. Note that the general class of CDs associated to ER+ schemata is not FO-rewritable. This was established in [7] by providing a counterexample ER+ schema and a Boolean CQ such that no first-order rewriting exists for the query. Combined Complexity. We now focus on the combined complexity. We establish, providing that the chase does not fail, that the problem of CQ answering under ER± schemata is PSPACE-complete. Theorem 4. Consider an ER± schema C, and let Σ be the set of CDs over a schema R associated to C. Consider a database D for R, a CQ q/n over R, and an n-tuple t ∈ Γ n . If chase(D, Σ) does not fail, then the problem whether t ∈ ans(q, D, Σ) is in PSPACE in combined complexity. The problem is PSPACE -hard in combined complexity.
358
A. Cal`ı, G. Gottlob, and A. Pieris
Proof (sketch). By Theorem 1 we get that t ∈ ans(q, D, Σ) iff t ∈ ans(q, D, ΣT ), where ΣT is a set of inclusion dependencies. Membership in PSPACE follows from the fact that CQ answering under inclusion dependencies is in PSPACE in combined complexity [20]. PSPACE-hardness is established by providing a reduction from a PSPACEhard problem called finite function generation3 [21]. ⊓ ⊔ As for Theorem 3, it is important to say that the above result does not provide the exact upper bound for the combined complexity of the problem under consideration, since we assume that the chase does not fail. The problem whether the chase fails is the subject of the next paragraph. Chase Failure. We show that the problem whether the chase fails is tantamount to the CQ answering problem (providing that the chase does not fail). This is established by exploiting a technique proposed in [6]. A union of conjunctive queries Qof arity nis a set of CQs, where each q ∈ Qhas the same arity n and uses the same predicate symbol in the head. The answer to Qover a database D, denoted as Q (D), is defined as the set {t | there exists q∈ Qsuch that t ∈ q (D)}. Lemma 1. Consider an arbitrary (resp., a fixed) ER± schema C. Let Σ be the set of CDs over a schema R associated to C, and let D be a database for R. There exist a , that can be constructed in PTIME (resp., database D′ and a union of Boolean CQs Q in AC0 ), such that chase(D, Σ) fails iff ∈ Q(chase(D′ , ΣT ). The following complexity characterization of the CQ answering problem under ER± schemata follows immediately from Theorems 3 and 4, and Lemma 1. Corollary 1. CQ answering under ER± schemata is in AC0 in data complexity, and it is PSPACE-complete in combined complexity. ± Note that for ER± i and ER2 schemata, conjunctive query answering is NP -complete in combined complexity; for more details we refer the reader to [8].
5 Adding Negative Constraints In this section we show how we can utilize NCs in order to express several relevant constructs in ER+ schemata, e.g., disjointness between entities and relationships, and non-participation of entities to relationships, but also more general ones. The new for+ + malism that arise is called ER+ ⊥ . In particular, an ER⊥ schema C is an ER schema with an additional set Σ⊥ of NCs over the relational schema associated to C. If C is also an ER± schema, then it is called ER± ⊥. Example 7. Consider the ER± schema C obtained from the one in Example 2 (see Figure 2) by adding the entity Pension scheme and a relationship Enrolled among Member and Pension scheme, without cardinality constraints. The fact that students and professors are disjoint sets can be expressed by the NC phd student(X), professor (X) → ⊥ 3
Consider a finite set F of (bijective) functions from a set S to itself, and another (bijective) function f : S → S. Decide whether f can be obtained by composing functions in F .
Query Answering under Expressive Entity-Relationship Schemata
359
(entity disjointness). Moreover, the fact that a student cannot be enrolled in a pension scheme (i.e., it does not participate to Enrolled) can be represented by the NC phd student(X), enrolled (X, Y) → ⊥ (non-participation of an entity to a relationship). ⊓ ⊔ As in the case without NCs, a conceptual schema is separable if the answers to queries can be computed by considering the TGDs only, and ignoring the KDs and the NCs, once it is known that the chase does not fail, and also satisfies the set of NCs. Definition 4. Consider an ER+ ⊥ schema C. Let Σ and Σ⊥ be the set of CDs and NCs, respectively, over a schema R associated to C. C is separable iff for every instance D for R, we have that either chase(D , Σ) fails, or chase(D , Σ) |= Σ⊥ , or for every CQ q over R, ans(q, D , Σ) = ans(q, D , ΣT ). It is straightforward to see that given an ER+ ⊥ schema C, the fact that C is also an ER± ⊥ schema is sufficient for separability. This follows immediately from Theorem 1. The interesting question that comes up is whether this fact is also necessary for sep+ arability of ER+ ⊥ schemata. It is not difficult to construct an ER⊥ schema which is ± not an ER⊥ schema, but it is separable. This implies that in general the answer to the above question is negative. We are interested to identity particular cases where the above question is answered positively. We assert that one such case is if we consider strongly consistent ER+ ⊥ schemata [2]; for more details on this issue we refer the reader to [8]. The next result implies that the addition of NCs does not alter the computational complexity of the CQ answering problem. Theorem 5. CQ answering under ER± ⊥ schemata is in AC0 in data complexity, and is PSPACE -complete in combined complexity. It is important to say that ER± ⊥ schemata, where only binary relationships are allowed, are general enough to capture the main ontology languages in the DL-lite family, in particular DL-liteF and DL-liteR [12,26], while keeping the same, highly tractable data complexity. For the translation of DL-liteF and DL-liteR into ER± ⊥ schemata we refer the interested reader to [8].
6 Conclusions In this paper we have identified, by means of a graph-based representation, a class of ER+ schemata, called ER± , for which query answering is tractable, and more precisely in AC0 in data complexity, and PSPACE-complete in combined complexity. A key tool in our complexity analysis is the notion of separability, for which we have provided a precise characterization in terms of a necessary and sufficient syntactic condition. We have also shown that NCs can be added to ER± schemata, without increasing the data and combined complexity of query answering. The class of ER± ⊥ schemata is general enough to capture the most widely adopted conceptual modeling and knowledge representation formalisms; in particular, it is strictly more expressive than the languages in the DL-lite family, which is prominent in the field of ontological reasoning. Acknowledgments. This work was supported by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007- 2013)/ERC
360
A. Cal`ı, G. Gottlob, and A. Pieris
grant no. 246858 DIADEM. The authors also acknowledge support by the EPSRC project “Schema Mappings and Automated Services for Data Integration and Exchange” (EP/E010865/1). Georg Gottlob’s work was also supported by a Royal Society Wolfson Research Merit Award.
References 1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Reading (1995) 2. Artale, A., Calvanese, D., Kontchakov, R., Ryzhikov, V., Zakharyaschev, M.: Reasoning over extended ER models. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 277–292. Springer, Heidelberg (2007) 3. Battista, G.D., Lenzerini, M.: A deductive method for entity-relationship modeling. In: Proc. of VLDB, pp. 13–21 (1989) 4. Cal`ı, A., Calvanese, D., De Giacomo, G., Lenzerini, M.: Accessing data integration systems through conceptual schemas. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) ER 2001. LNCS, vol. 2224, pp. 270–284. Springer, Heidelberg (2001) 5. Cal`ı, A., Gottlob, G., Kifer, M.: Taming the infinite chase: Query answering under expressive relational constraints. In: Proc. of KR, pp. 70–80 (2008) 6. Cal`ı, A., Gottlob, G., Lukasiewicz, T.: A general datalog-based framework for tractable query answering over ontologies. In: Proc. of PODS, pp. 77–86 (2009) 7. Cal`ı, A., Gottlob, G., Pieris, A.: Tractable query answering over conceptual schemata. In: Proc. of ER, pp. 175–190 (2009) 8. Cal`ı, A., Gottlob, G., Pieris, A.: Query answering under expressive entity-relationship schemata. Unpublished manuscript, available from the authors (2010) 9. Cal`ı, A., Lembo, D., Rosati, R.: On the decidability and complexity of query answering over inconsistent and incomplete databases. In: Proc. of PODS, pp. 260–271 (2003) 10. Cal`ı, A., Lembo, D., Rosati, R.: Query rewriting and answering under constraints in data integration systems. In: Proc. of IJCAI, pp. 16–21 (2003) 11. Cal`ı, A., Martinenghi, D.: Querying incomplete data over extended ER schemata. TPLP 10(3), 291–329 (2010) 12. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning and efficient query answering in description logics: The DL-lite family. J. Autom. Reasoning 39(3), 385–429 (2007) 13. Calvanese, D., Giacomo, G.D., Lenzerini, M.: On the decidability of query containment under constraints. In: Proc. of PODS, pp. 149–158 (1998) 14. Calvanese, D., Lenzerini, M., Nardi, D.: Description logics for conceptual data modeling. In: Logics for Databases and Information Systems, pp. 229–263 (1998) 15. Chandra, A.K., Merlin, P.M.: Optimal implementation of conjunctive queries in relational data bases. In: Proc. of STOCS, pp. 77–90 (1977) 16. Chen, P.P.: The entity-relationship model: Towards a unified view of data. ACM Trans. Database Syst. 1(1), 9–36 (1976) 17. Deutsch, A., Nash, A., Remmel, J.B.: The chase revisisted. In: Proc. of PODS, pp. 149–158 (2008) 18. Fagin, R., Kolaitis, P.G., Miller, R.J., Popa, L.: Data exchange: Semantics and query answering. Theor. Comput. Sci. 336(1), 89–124 (2005) 19. Gogolla, M., Hohenstein, U.: Towards a semantic view of an extended entity-relationship model. ACM Trans. on Database Syst. 16(3), 369–416 (1991)
Query Answering under Expressive Entity-Relationship Schemata
361
20. Johnson, D.S., Klug, A.C.: Testing containment of conjunctive queries under functional and inclusion dependencies. J. Comput. Syst. Sci. 28(1), 167–189 (1984) 21. Kozen, D.: Lower bounds for natural proof systems. In: Proc. of FOCS, pp. 254–266 (1977) 22. Maier, D., Mendelzon, A.O., Sagiv, Y.: Testing implications of data dependencies. ACM Trans. Database Syst. 4(4), 455–469 (1979) 23. Markowitz, V.M., Makowsky, J.A.: Identifying extended entity-relationship object structures in relational schemas. IEEE Trans. Software Eng. 16(8), 777–790 (1990) 24. Markowitz, V.M., Shoshani, A.: Representing extended entity-relationship structures in relational databases: A modular approach. ACM Trans. Database Syst. 17(3), 423–464 (1992) 25. Ortiz, M., Calvanese, D., Eiter, T.: Characterizing data complexity for conjunctive query answering in expressive description logics. In: Proc. of AAAI (2006) 26. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linking data to ontologies. J. Data Semantics 10, 133–173 (2008) 27. Vardi, M.Y.: On the complexity of bounded-variable queries. In: Proc. of PODS, pp. 266–276 (1995)
SQOWL: Type Inference in an RDBMS P.J. McBrien, N. Rizopoulos, and A.C. Smith Imperial College London⋆ , 180 Queen’s Gate, London, UK Abstract. In this paper we describe a method to perform type inference over data stored in an RDBMS, where rules over the data are specified using OWLDL. Since OWL-DL is an implementation of the Description Logic (DL) called SHOIN (D), we are in effect implementing a method for SHOIN (D) reasoning in relational databases. Reasoning may be broken down into two processes of classification and type inference. Classification may be performed efficiently by a number of existing reasoners, and since classification alters the schema, it need only be performed once for any given relational schema as a preprocessor of the schema before creation of a database schema. However, type inference needs to be performed for each data value added to the database, and hence needs to be more tightly coupled with the database system. We propose a technique to meet this requirement based on the use of triggers, which is the first technique to fully implement SHOIN (D) as part of normal transaction processing.
1 Introduction There is currently a growing interest in the development of systems that store and process large amounts of Semantic Web knowledge [6,14,17]. A common approach is to represent such knowledge as data in RDF tuples [1], together with rules in OWL-DL [10]. When large quantities of individuals in a ontology need to be processed efficiently, it is natural to consider that the individuals are held in a relational database management system (RDBMS), in which case we refer to the individuals as data, and make the unique name assumption (UNA). Hence, the question arises of how knowledge expressed in OWL-DL can be deployed in a relational database context, and take advantage of the RDBMS platforms in use today to process data in an ontology, and make inferences based on the open world assumption. To illustrate the issues we address in this paper, consider a fragment from the terminology box (TBox) of the Wine Ontology [12] expressed in DL: Loire ≡ Wine ⊓ locatedIn:{LoireRegion} (1) WhiteLoire ≡ Loire ⊓ WhiteWine
(2)
WhiteLoire ⊑ ∀madeFromGrape.{CheninBlanc, PinotBlanc, SauvignonBlanc} (3)
⋆
⊤ ⊑ ∀locatedIn− .Region
(4)
⊤ ⊑ ∀madeFromGrape.Wine
(5)
The work reported in this paper was funded by the Systems Engineering for Autonomous Systems (SEAS) Defence Technology Centre established by the UK Ministry of Defence.
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 362–376, 2010. c Springer-Verlag Berlin Heidelberg 2010
SQOWL: Type Inference in an RDBMS
363
⊤ ⊑ ∀madeFromGrape− .WineGrape (6) To differentiate between classes and properties, classes start with an upper case letter, e.g. Wine. Properties start with a lower case letter, e.g. madeFromGrape. Individuals start with an upper case letter and appear inside curly brackets, e.g. {LoireRegion}. Obviously, there is a simple mapping from classes and properties in DL to unary and binary relations in an RDBMS. Thus, from the above DL statements we can infer a relational schema: Wine(id) WineGrape(id) madeFromGrape(domain,range) Loire(id) Region(id) locatedIn(domain,range) WhiteWine(id) WhiteLoire(id) Furthermore, each property has a domain and a range which can be restricted. For example, (5) could infer the foreign key madeFromGrape.domain → Wine.id and (6) infer madeFromGrape.range → WineGrape.id. However, as it stands, the relational schema, with its closed world semantics, does not behave in the same manner to the open world semantics of the DL. For example, we can insert into the database the following facts (which in DL would be called the assertion box (ABox)): Loire(SevreEtMaineMuscadet) (7) WhiteWine(SevreEtMaineMuscadet)
(8)
madeFromGrape(SevreEtMaineMuscadet, PinotBlancGrape) (9) Using these rules, we have the problems that in the DBMS version of the ontology: P1 SevreEtMaineMuscadet would not be a member of Wine, despite that being implied by TBox rule (1) from ABox rule (7) and by (5) from (9) P2 SevreEtMaineMuscadet is not a member of WhiteLoire, despite that being implied by TBox rule (2) from ABox rule (7) and (8) together. P3 Assertion of the property membership (9) would fail, since we have not previously asserted the PinotBlancGrape as a member of WineGrape. Performing classification using a reasoner on the TBox infers new subclass relationships (i.e. in a relational terms, foreign keys), and such classification can partially solve these problems. In particular, classification would infer the following: Loire ⊑ Wine (10) Loire ⊑ locatedIn:{LoireRegion}
(11)
WhiteLoire ⊑ Loire
(12)
(13) WhiteLoire ⊑ WhiteWine which will then allow us to infer additional foreign keys. For example, (10) would imply Loire.id → Wine.id. Now, the insert of data value SevreEtMaineMuscadet into Loire would be disallowed unless the data value was already in Wine. However, this does not capture the open world semantics of the DL statement. In type inference performed by a reasoner on the ABox, data values can be inserted into classes if their
364
P.J. McBrien, N. Rizopoulos, and A.C. Smith
presence is logically determined by the presence of data values in other classes. Hence in open world systems with type inference, you are allowed to insert S ev r eEtMaineMuscadet into Loire provided that data value was either a member of Wine already, or it could be inserted into Wine. Furthermore, type inference would determine that SevreEtMaineMuscadet should be a member of WhiteLoire based on TBox and ABox rules (2), (7) and (8). In general, when performing type inference over data in an RDBMS, there is a choice between the reasoning being performed by a separate application outside the database, or being performed within the database system. Current implementations of the separate application approach all have the disadvantage that each change to the data requires the external application to reload the data and recompute type inference for all the individuals. Whilst data updates in semantic web applications might not be very intensive, even moderately sized databases will take a considerable time to be loaded and have type inference performed. For example, our tests [9] of the SOR [7] system with the LUBM benchmark [14] containing 350,000 individuals took 16 seconds to load the set of individuals. Hence, each addition to a database of that size would need to be locked for 16 seconds whilst type inference is performed. It might be argued that it is possible to achieve tighter integration with an external application, such that a total reload is not required each time an insert is made. However, it will always be an inefficient process to keep the data in an external application synchronised with the contents of a DBMS. Hence we investigate in this paper how type inference can be performed within an RDBMS. One previous approach followed in DLDB2 [14] is to use views to compute inferred types, and for each relation have an intentional definition based on rules and an extensional definition with stored values. An alternative approach studied in this paper is to use triggers to perform type inference as data values are inserted into the database. This approach has the advantage that since the classes are all materialised, query processing is much faster than when using the view based approach. Although there is a similarity between our work and materialised view maintenance [4,16], to implement OWL-DL we must allow a relation to be both a base table and have view statements to derive certain additional values. Our approach has the disadvantage of additional data storage for the materialised views, and additional time taken to insert data into the database. The prototype of our SQOWL approach is 1000 times faster than DLDB2 at query processing, but 10 times slower than DLDB2 at inserts on the LUBM case study. Since most database applications are query intensive rather than update intensive, there will be a large range of applications that would benefit from our approach. The SQOWL approach presented in this paper is the first complete (in the sense of supporting all OWL-DL constructs) implementation of type inference for OWL-DL on data held in an RDBMS where the type inference is entirely performed within the RDBMS. Compared to previous reasoners for use on ontologies with large numbers of individuals, our approach has the following advantages (validated by experiments comparing our prototype with other implementations): – In common with other rule based approaches [13,6], our approach to type inference is much more efficient than tableaux based reasoners [2], since we do not need to use a process of refutation to infer instances as being members of classes.
SQOWL: Type Inference in an RDBMS
365
– Apart from SOR [7], we are the only rule based approach implementing the full S H O I N (D) DL [2] of OWL-DL. By fully implement, we mean supporting all of the OWL-DL constructs including oneOf and hasValue restrictions, and providing a type inference complete enough to be able to run all queries in the LUBM [14] and UOBM [8] benchmarks. – Apart from DLDB2 [14], we are the only approach of any type where all type inference is performed within the RDBMS, and hence we allow RDBMS based applications to incorporate OWL-DL knowledge without alteration to the RDBMS platform, and perform reasoning within transations. – Since we materialise the data instances of classes, we support faster query processing than any other approach. The remainder of this paper is structured as follows. Section 2 gives an outline of how the SQOWL approach works, describing the basic technique for implementing type inference implied by OWL-DL constructs using relational schemas and triggers on the schema. The set of production rules for mapping all OWL-DL constructs to relational schemas with triggers is presented in Section 3. We give a more detailed description of related work in Section 4, and give our summary and conclusions in Section 5.
2 The SQOWL Approach Our approach to reasoning over large volumes of data is based on a three stage approach to building the reasoning system, which we describe below, together with some technical details of the prototype implementation of the SQOWL approach that we developed in order to run the benchmark tests. i. Classification and consistency checking of the TBox of an OWL-DL ontology is performed with any suitable reasoner to produce the inferred closure of the TBox. In our prototype system, we load an OWL-DL ontology as a Jena OWL model using the Protege-OWL API, and use Pellet [15], a tableaux based reasoner. ii. From the TBox we produce an SQL schema, that can store the classes and properties of the TBox. In our prototype system, we take the simple approach of implementing each class as a unary relation, each property as a binary relation, which generates a set of ANSI SQL CREATE TABLE statements, but use triggers to perform type inference rather than use foreign keys to maintain integrity constraints. iii. We use a set of production rules, that generate SQL trigger statements that perform the type inference and ABox consistency checking. The production rules map statements in OWL-DL to triggers in an abstract syntax. In our prototype system, the production rules are programmed in Java, and produce the concrete syntax of PostgreSQL function definitions and trigger definitions. Note that once steps (i)–(iii) have been performed, the database is ready to accept ABox rules such as (7), (8) and (9) implemented as insertions to the corresponding relations in the database. We have already illustrated in the introduction how steps (i) and (ii) of the above process work to produce a set of SQL tables. However, one detail omitted in the introduction is that anonymous classes such as that for the enumeration of individuals in
366
P.J. McBrien, N. Rizopoulos, and A.C. Smith
TBox rule (3) will also cause a table to be created for the anonymous class (which in our prototype would be named cheninblanc pinotblanc sauvignonblanc). Now we shall introduce the abstract trigger syntax we use in step (iii) above, and how the triggers serve to perform type inference within the RDBMS. The triggers are ECA rules in the standard when event if condition then action form, where: – event will always be some insertion of a tuple to a table, prefixed with a ‘− ’ if the condition and action is to execute before the insertion of the tuple is applied to the table, or prefixed with a ‘+ ’ if the condition and action is to execute after the insertion of the tuple to the table is applied. – condition is some Datalog query over the database. Each comma in the condition specifies a logical AND operator. – action is one of • some list of tuple(s) to insert into the database, or • reject if the whole transaction involving the event is to be aborted, or • ignore if the event is to be ignored, which may only be used if the event is prefixed by − , i.e. is a before trigger. In order to perform type inference within the RDBMS, we require that we have a trigger for each table that appears in the left-hand side (LHS) of a sufficient (⊑) DL rule, with that table as the event. The remainder of the LHS is re-evaluated in the condition, and if it holds, then the changes to the right-hand side (RHS) of the DL rule made as the action. These actions must be made before changes to the table are applied in the database, and hence we must have a ‘before trigger’. For instance, for TBox rule (12), we can identify a trigger rule: when + WhiteLoire(x) if true thenLoire(x) which states that after the insertion of x into the WhiteLoire table, we unconditionally go on to assert the x is a value of Loire. This in turn may be implemented by an SQL trigger, the PostgreSQL version being presented in Figure 1(a). Due to the design of PostgreSQL, the trigger has to call a function that implements the actions of the trigger. The function insert Loire() first checks whether the new tuple (NEW.id) already exists in Loire, and if not, then inserts the new tuple. For each necessary and sufficient (≡) TBox rule, we additionally require a trigger on any table appearing in the RHS of the rule to reevaluate the RHS, and then assert the LHS after the RHS is inserted into the database. Thus there is one trigger for each table in the RHS, that table being the event, the remainder of the RHS in the condition, and the tables of the LHS in the action. For example, for TBox rule (2), we have two triggers, one for each table in the RHS: when + Loire(x) if WhiteWine(x) thenWhiteLoire(x) when + WhiteWine(x) if Loire(x) thenWhiteLoire(x)
3 Translating OWL-DL to ECA Rules In this section we will describe how we translate an OWL-DL KB into the ECA rules introduced in the previous section. Note that these ECA rules may in turn be translated into any specific implementation of SQL triggers that supports both BEFORE and AFTER
SQOWL: Type Inference in an RDBMS CREATE FUNCTION insert Loire() RETURNS OPAQUE AS ’BEGIN IF NOT EXISTS( SELECT id FROM Loire WHERE id=NEW.id) INSERT INTO Loire(id) VALUES(NEW.id); END IF; RETURN NEW; END;’ LANGUAGE ’plpgsql’; CREATE TRIGGER propagateTo Loire AFTER INSERT ON WhiteLoire FOR EACH ROW EXECUTE PROCEDURE insert Loire();
CREATE FUNCTION skip insert Wine() RETURNS OPAQUE AS ’BEGIN IF EXISTS( SELECT id FROM Wine WHERE id=NEW.id) THEN RETURN NULL; END IF; RETURN NEW; END;’ LANGUAGE ’plpgsql’; CREATE TRIGGER skipinsert BEFORE INSERT ON Wine FOR EACH ROW EXECUTE PROCEDURE skip insert Wine();
(b) Allow asserts on Wine
(a) WhiteLoire ⊑ Loire CREATE FUNCTION reject insert cps() RETURNS OPAQUE AS ’BEGIN IF NOT EXISTS(SELECT id FROM cheninblanc pinotblanc sauvignonblanc WHERE id=NEW.id) THEN RAISE EXCEPTION ’Unable to change enumeration’; END IF; RETURN NULL; END; ’ LANGUAGE ’plpgsql’; CREATE TRIGGER rejectinsert BEFORE INSERT ON cheninblanc pinotblanc sauvignonblanc FOR EACH ROW EXECUTE PROCEDURE reject insert cps();
367
CREATE FUNCTION insert Wine() RETURNS OPAQUE AS ’BEGIN IF NOT EXISTS( SELECT id FROM Wine WHERE id=NEW.id) INSERT INTO Wine(id) VALUES(NEW.id); END IF; RETURN NEW; END;’ LANGUAGE ’plpgsql’; CREATE TRIGGER propagateTo Wine AFTER INSERT ON Loire FOR EACH ROW EXECUTE PROCEDURE insert Wine();
(d) Loire ⊑ Wine
(c) {CheninBlanc, PinotBlanc, SauvignonBlanc} Fig. 1. Some examples of Postgres triggers implementing type inference for DL statements
triggers on row level updates. The outline of the mapping from our when if then ECA rules to SQL is as follows: when − C(x) := when + C(x) := if C(x) := if ¬C(x) := then C(x) := then ignore := then reject :=
BEFORE INSERT ON C AFTER INSERT ON C IF EXISTS (SELECT id FROM C WHERE id=x) IF NOT EXISTS (SELECT id FROM C WHERE id=x) THEN INSERT INTO C(id) VALUES(x) END IF; THEN RETURN NULL END IF; THEN RAISE EXCEPTION ’...’ END IF;
In this section we present the translation of basic OWL-DL classes and properties in Sections 3.1 and 3.2. Then we describe how class descriptions are translated in Section 3.3, and the special case of intersections in class definitions is handled in Section 3.4. Finally we present the translation of restrictions on properties in Section 3.5. 3.1 OWL-DL Classes and Individuals An OWL-DL ontology contains declarations of classes. In our translation to SQL, each class declaration C maps to an SQL table C. The production rule is:
368
P.J. McBrien, N. Rizopoulos, and A.C. Smith
Class : C CREATE TABLE C(id VARCHAR PRIMARY KEY), when − C(x) if C(x) then ignore with the semantics that any class C found in OWL-DL causes two additions to the relational schema. The first addition is a table to hold the known instances of this class, and the second addition is an SQL trigger on table C to ignore any insertions of a tuple value x where x already exists in C. Note that the trigger is fired before x is actually inserted into C, hence preventing spurious duplicate key error messages from the RDBMS when a fact x that has already been inserted, is attempted to be inserted again. To illustrate how a class is implemented in the RDBMS, the declaration of class Loire in TBox rule (1) produces: CREATE TABLE Loire(id VARCHAR PRIMARY KEY) when − Loire(x) if Loire(x) then ignore where the translation of the ECA rule to a Postgres trigger is illustrated in Figure 1(b). The function checks whether the value to be inserted already exists. If it exists, then the function returns NULL, which corresponds to ignoring the insert. If it does not exist, then the function returns NEW, which is the value to be inserted. An OWL-DL class may contain individuals. Each individual of class C will be inserted into table C with the name of the individual as the id. The production rule is: individual : C(a) INSERT INTO C VALUES (a) 3.2 OWL-DL Properties An OWL-DL property defines a binary predicate P (D, R), where the domain D is always an OWL-DL class, and the range R varies according to which of two types P belongs to. A datatype property has a range which is a datatype, normally defined as an RDF literal or XML Schema datatype [3]. An object property has a range which is an OWL-DL class. Hence, there is a rough analogy between properties and subclass relationships, in that membership of a property implies membership in the domain class, and if it exists, the range class. Thus the implementation of properties is as a two-column table, with triggers to update any classes that the property references, giving the following rules: datatypeProperty: P (D , R) CREATE TABLE P (domain VARCHAR, range sqlT ype(R)) when − P (x, y) if P (x, y) then ignore if true then D(x) objectProperty: P (D, R) CREATE TABLE P (domain VARCHAR, range VARCHAR) when − P (x, y) if P (x, y) then ignore if true then D(x), R(x) Note that the function sqlT ype(R) returns the SQL data type corresponding to the OWL-DL datatype R. Applying the above to wine ontology rules ⊤ ⊑ ∀hasFlavor.Wine and ⊤ ⊑ ∀hasFlavor− .WineFlavor will lead to a table to represent the object property being created: CREATE TABLE hasFlavor(domain VARCHAR, range VARCHAR). plus a trigger that causes an update to hasFlavor to be propagated to both Wine and WineFlavor. Instances of a property have the obvious mapping to insert statements on the corresponding property table: propertyInstance: P (x, y) INSERT INTO TABLE P VALUES (x, y)
SQOWL: Type Inference in an RDBMS
369
With these definitions, we have a solution to problem P3 in the introduction, since assertions of membership such as in (9) will cause any ‘missing’ class memberships, such as PinotBlancGrape being a member of WineGrape to be created. 3.3 OWL-DL Class Descriptions In OWL-DL, if C, D each denote a class, P denotes a property, and a1 , . . . , a n individuals, then a class description takes the form: class-des ::= class-expression | property-restriction | class-des ⊓ class-des | class-des ⊔ class-des class-expression = D | ¬D | {a1 , . . . , an } property-restriction = ∀P.D | ∃P.D | ∃P :a | =nP | >nP | n then C(x) The DL hasValue construct C ⊑ ∃P :a specifies that each individual x of C has a tuple x, a in P . This restriction translates into a trigger which is executed after an insertion of instance x into C. The trigger inserts the tuple x, a into table P . hasValue: C ⊑ ∃P :a when + C(x) if true then P (x, a) For example, based on the TBox rule (11) we know that each Loire wine is locatedIn LoireRegion. Thus, when the ABox rule (7) is examined, the trigger will insert tuple {SevreEtMaineMuscadet, LoireRegion} in table locatedIn. In the case of ∃P :a ⊑ C, for each x with x, a a tuple of P , x is inserted into C. hasValue: ∃P :a ⊑ C when + P (x, y) if P (x, a) then C(x) 3.4 Inferences from Intersections When an intersection appears on the left of a subclass relationship there is an additional inference that can be performed. For example, from TBox rule (2), we have Loire ⊓ WhiteWine ⊑ WhiteLoire
372
P.J. McBrien, N. Rizopoulos, and A.C. Smith
and hence if there is an individual x inserted which is both a member of L oire and a member of WhiteWine, then it can be inferred to be a member of WhiteLoire. To achieve this we would need a trigger which is executed after an instance x is inserted in table Loire or WhiteWine, and if the value is present in the other of those two tables, inserts x into WhiteLoire: when + Loire(x) if WhiteWine(x) then WhiteLoire(x) when + WhiteWine(x) if Loire(x) then WhiteLoire(x) In general, if E1 ⊓ . . . ⊓ En ⊑ C, then we require a trigger on each Ei that will check to see if E1 ⊓ . . . ⊓ En now holds. Thus the implementation of intersection is: intersection: E1 ⊓ . . . ⊓ En ⊑ C ∀ni=1 when + trigger (Ei ) if holds(E1 ⊓ . . . ⊓ En ) then C(x) where the trigger function identifies the tables to trigger on as follows: trigger (> nP ) := P (x, y) trigger (Ei ) := Ei (x) trigger (∃P.D) := P (x, y) trigger (P :{a}) := P (x, y) trigger (∀P.D) := P (x, y) and the holds function maps the OWL-DL intersection into predicate logic (and hence SQL) as follows: holds (D ⊓ E) := holds (D), holds(E) holds(D) := D(x) holds(P :{a}) := P (x, a) holds (>n P ) := count(P (x, )) > n holds (∃P.D) := P (x, y), D(y) holds(∀P.D) := false Note that we say the ∀P.D is assumed to be false, simply because with open world semantics, there is no simple query that can determine if it is true. As an example of using the above rules, if we apply them to the TBox rule (2), we have that E1 ≡ Loire and E2 ≡ WhiteWine, and that expands out to: when + trigger (Loire) if holds(Loire ⊓ WhiteWine) then WhiteLoire(x) when + trigger (WhiteWine) if holds(Loire ⊓ WhiteWine) then WhiteLoire(x) and expanding the trigger and holds function gives: when + Loire(x) if Loire(x), WhiteWine(x) then WhiteLoire(x) when + WhiteWine(x) if Loire(x), WhiteWine(x)then WhiteLoire(x) If we remove the redundant check on Loire in the condition of the first rule, and the redundant check on WhiteWine in the condition of the second rule, we have the two ECA rules we talked about in the beginning of this section, and have provided a solution to problem P2 in the introduction. 3.5 Restrictions on and between OWL-DL Properties Properties can be functional and/or inverse functional. A functional property P (D, R) can have only one tuple x, y for each x in D, and an inverse functional can have only one tuple x, y for each y in R. We translate these restrictions into SQL as key/unique constraints: FunctionalProperty: P ALTER TABLE P ADD PRIMARY KEY (domain) InverseFunctionalProperty: P ALTER TABLE P ADD CONSTRAINT UNIQUE (range)
In the Wine ontology there is a functional property definition ⊤ ⊑ 1 hasFlavor which adds a primary key constraint on table hasFlavor on its domain column. This constraint will not allow the same wine to appear in the hasFlavor table twice, therefore it will enforce the functional constraint on the property. Note that it would be possible in an implementation to normalise a functional property table (such as hasFlavor)
SQOWL: Type Inference in an RDBMS
373
with the class of its domain (such as Wine) to make the range a column in the class table (i.e. to have a column grape in Wine, instead of the separate hasFlavor property table). All triggers that were on the property table would instead be on the relevant columns of the class table. However this would make our presentation more complex, and so we do not use such an optimisation in this paper. Properties can also be transitive and/or symmetric. For example, the locatedIn property in the Wine ontology is transitive, and hence if we know: locatedIn(ChateauChevalBlancStEmilion,BordeauxRegion) locatedIn(BordeauxRegion,FrenchRegion) then we can infer: locatedIn(ChateauChevalBlancStEmilion,FrenchRegion) The production rule for a transitive property P needs to define a trigger to be executed after each insert of tuple x, y in P . The rule will insert for each y, z existing in P the tuple x, z and for each z, x in P the tuple z, y will be inserted. The macro f oreach must be used in the production rule that performs these iterations: f oreach(z, P (y, z), P (x ,z )) := FOR z IN (SELECT range FROM P WHERE domain=y) LOOP IF x , zNOT IN SELECT domain,range FROM P THEN INSERT INTO P VALUES (x ,z ) END IF; END LOOP;
The production rule is as follows: TransitiveProperty: P ∈ P+ when + P (x, y) if true then f oreach(z, P (y, z), P (x, z)), f oreach(z, P (z, x), P (z, y)) If a property P is declared to be symmetric, then a rule needs to be defined that will insert in P the tuple y, x after an event inserts tuple x, y on P : SymmetricProperty: P ≡ P − when + P (x, y) if true then P (y, x) Like classes, OWL-DL properties can be related to one another. For example, a property P might be declared to be a subproperty of Q, which means that each tuple of P is also a tuple of Q. An SQL trigger is added on table P to specify that after inserting any tuple x, y in P , then the tuple must be inserted in Q. The production rule is as follows: subPropertyOf: P ⊑ Q when + P (x, y) if true then Q(x, y) A property P might be declared as the inverse of another property Q. This declaration asserts that for each tuple x, y in P , the inverse tuple y, x exists in Q, and vice versa. An SQL trigger is added on table P to specify that after each insertion on P the inverse tuple must be inserted on Q, if it does not already exist. Note that in our methodology, for each such property declaration two inverseOf constructs are created: P ≡ Q− and Q ≡ P − . The production rule for the inverseOf construct is : inverseOf: P ≡ Q− when + P (x, y) if ¬Q(y, x) then Q(y, x) Finally, a property P might be declared to be equivalent to another property Q. In this case, an SQL trigger is added on table P that after each insertion of tuple x, y in P the trigger inserts the tuple on table Q, and vice versa. The production rule is: equivalentProperty: P ≡ Q when + P (x, y) if true then Q(x, y) when + Q(x, y) if true then P (x, y)
374
P.J. McBrien, N. Rizopoulos, and A.C. Smith
4 Related Work DL reasoners come in a number of forms [2]. The most common type are Tableaux based reasoners like Racer, FacT++ and Pellet. These are very efficient at computing classification hierarchies and checking the consistency of a knowledge base. However, the tableaux based approach is not suited to the task of processing ontologies with large numbers of individuals, due to the use of a refutation procedure rather than a query answering algorithm [5]. Rule based reasoners provide an alternative to the tableaux based approach that is more promising for handling large datasets. O-DEVICE [11] translates OWL rules into an in-memory representation, and can process all of OWL-DL except oneOf, complementOf or data ranges. The fact that the system is memory based provides fast load and query times, but means that it does not scale beyond tens of thousands of individuals. OWLIM [6] is similar in both features and problems, but supports a smaller subset of OWL-DL than O-DEVICE. KAON2 [13] does reasoning by means of theorem proving. The TBox is translated into first-order clauses, which are executed on a disjunctive Datalog engine to compute the inferred closure. KAON2 has fast load and query times, but is unable to handle nominals (i.e. hasValue and oneOf, i.e. misses the O in S H O IN (D)). DLDB2 [14] and SOR [17] (previously called Minerva) are most similar to SQOWL, since they use an RDBMS as their rule engine. DLDB2 stores the rules inside the database as non-materialised views. Tables are created for each atomic property and class, populated with individuals from the ontology. A separate DL reasoner is used to classify the ontology. The resulting TBox axioms are translated into non-recursive Datalog rules that are translated into SQL view create statements. DLDB2 enjoys very fast load times because the inferred closure of the database is not calculated at load time, but its querying is slow. An advantage of the system is that because the closure is only calculated when queries are posed on the system, updates and deletes can be performed on the system. DLDB2 is not able to perform type inference based on allValuesFrom. SOR [17] also uses a standard tableaux based DL reasoner to first classify the ontology. It differs from DLDB2 in that rules are kept outside the database and the SQL statements created from the OWL-DL rules are not used to create views but are rather executed at load time to materialise the inference results. This makes query processing faster. However, since the rules are kept outside the database, any additions to the database necessitate a complete rerun of the reasoning.
5 Summary and Conclusions We have described a method of translating an OWL-DL ontology into an active database that can be queried and updated independently of the source ontology. In particular, we have implemented type inference for OWL-DL in relational databases, and have produced a prototype implementation that builds such type inference into Postgres databases. Our approach gives a complete implementation of OWL-DL in a relational database, assuming that we make the UNA. As such, we have no need to handle OWLDL constructs differentFrom, AllDifferent, or sameAs since they are concerned with issues where the UNA does not hold.
SQOWL: Type Inference in an RDBMS
375
Running the LUBM [14] and UOBM [8] benchmarks shows we are between two and 1000 times faster at query answering than other DBMS based approaches [9]. Only the DLDB2 approach has the same advantage of performing all its type inference within the DBMS, and DLDB2 is at least 30 times slower than SQOWL in query answering in these benchmarks. The current prototype is crude in its generation of trigger statements, in that it does not attempt to combine multiple triggers on one table into a single trigger and function call. Furthermore, it does not make the obvious optimisation that all functional properties of a class can be stored as a single table, which would reduce further the number of triggers and reduce the number of joins required in query processing. A more substantial addition would be to handle deletes and updates, since we would then need to consider how a fact might be derived from more than one rule. We have shown that the SQOWL approach offers for a certain class of semantic web application, a method of efficiently storing data, and performing type inference. We also believe that our work opens up the possibility of using ontologies expressed in OWLDL to enhance database schemas with type inference capabilities, and will explore this theme of ‘knowledge reasoning’ in RDBMS applications in future work.
References 1. Resource Description Framework, RDF (2001), http://www.w3.org/RDF/ 2. Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, Cambridge (2003) 3. Biron, P., Malhotra, A.: XML Schema part 2: Datatypes, 2nd edn. (2004), http://www.w3.org/TR/xmlschema-2 4. Ceri, S., Widom, J.: Deriving production rules for incremental view maintenance. In: Proc. VLDB, pp. 577–589 (1991) 5. Hustadt, U., Motik, B.: Description logics and disjunctive datalog the story so far. In: Description Logics (2005) 6. Kiryakov, A., Ognyanov, D., Manov, D.: Owlim - a pragmatic semantic repository for OWL. In: WISE Workshops, pp. 182–192 (2005) 7. Lu, J., Ma, L., Zhang, L., Brunner, J.-S., Wang, C., Pan, Y., Yu, Y.: Sor: A practical system for ontology storage, reasoning and search. In: VLDB, pp. 1402–1405 (2007) 8. Ma, L., Yang, Y., Qiu, Z., Xie, G.T., Pan, Y., Liu, S.: Towards a complete OWL ontology benchmark. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 125–139. Springer, Heidelberg (2006) 9. McBrien, P., Rizopoulos, N., Smith, A.: SQOWL: Performing OWL-DL type inference in SQL. Technical report, AutoMed Technical Report 37 (2009) 10. McGuinness, D., van Harmelen, F.: OWL Web Ontology Language Overview (2004), http://www.w3.org/TR/owl-features/ 11. Meditskos, G., Bassiliades, N.: A rule-based object-oriented OWL reasoner. IEEE Trans. Knowl. Data Eng. 20(3), 397–410 (2008) 12. Smith, M.K., Welty, C., McGuinness, D.: OWL Web Ontology Language Guide (2004), http://www.w3.org/TR/owl-guide/
376
P.J. McBrien, N. Rizopoulos, and A.C. Smith
13. Motik, B., Sattler, U.: A comparison of reasoning techniques for querying large description logic aboxes. In: Hermann, M., Voronkov, A. (eds.) LPAR 2006. LNCS (LNAI), vol. 4246, pp. 227–241. Springer, Heidelberg (2006) 14. Pan, Z., Zhang, X., Heflin, J.: DLDB2: A scalable multi-perspective semantic web repository. In: Web Intelligence, pp. 489–495 (2008) 15. Pellet., http://clarkparsia.com/pellet/ 16. Urp´ı, T., Oliv´e, A.: A method for change computation in deductive databases. In: Proc. VLDB, pp. 225–237 (1992) 17. Zhou, J., Ma, L., Liu, Q., Zhang, L., Yu, Y., Pan, Y.: Minerva: A scalable OWL ontology storage and inference system. In: Mizoguchi, R., Shi, Z.-Z., Giunchiglia, F. (eds.) ASWC 2006. LNCS, vol. 4185, pp. 429–443. Springer, Heidelberg (2006)
Querying Databases with Taxonomies Davide Ma rtineng hi1 a nd Ricca rdo To rlo ne2 1
2
Dip. di Elettronica e Informazione Politecnico di Milano, Italy [email protected] Dip. di Informatica e Automazione Universit` a Roma Tre, Italy [email protected]
Abstract. Traditional information search in which queries are posed against a known and rigid schema over a structured database is shifting towards a Web scenario in which exposed schemas are vague or absent and data comes from heterogeneous sources. In this framework, query answering cannot be precise and needs to be relaxed, with the goal of matching user requests with accessible data. In this paper, we propose a logical model and an abstract query language as a foundation for querying data sets with vague schemas. Our approach takes advantages of the availability of taxonomies, that is, simple cl as s i fi c at i on s of t e r m s ar r an ge d i n a h i e r ar c h i c al s t r u c t u r e . T h e m od e l i s a n at u r al e x t e n s i on of t h e r e l at i on al m od e l i n w h i c h d at a d om ai n s ar e or gan i z e d i n h i e r ar c h i e s , ac c or d i n g t o d i ff e r e n t l e v e l s of ge n e r al i z at i on . T h e q u e r y l an gu age i s a c on s e r v at i v e e x t e n s i on of r e l at i on al al ge b r a w h e r e s p e c i al op e r at or s al l ow t h e s p e c i fi c at i on of r e l ax e d q u e r i e s ov e r v agu e l y s t r u c t u r e d i n f or m at i on . We s t u d y e q u i v al e n c e an d r e w r i t i n g p r op e r t i e s of t h e q u e r y l an gu age t h at c an b e u s e d f or q u e r y op t i m i z at i on .
1
Introduction
There a re to day ma ny a pplica tio n scena rio s in which user queries do no t ma tch the structure a nd the co ntent o f da ta repo sito ries, g iven the na ture o f the a pplica tio n do ma in o r just beca use the schema is no t ava ila ble. This ha ppens fo r insta nce in lo ca tio n-ba sed sea rch (find an opera concert in Paris next summer), multifaceted product search (find a cheap blu-ray player with an adequate user rating), multi-domain search (find a database conference held in a seaside location), and social search (find the objects that my friends like). In these situations, the query is usually relaxed to accommodate user’s needs, and query answering relies on finding the best matching between the request and the available data. In spite of this trend towards “schema-agnostic” applications, the support of current database technology for query relaxation is quite limited. The only examples are in the context of semi-structured information, in which schemas and values are varied and/or missing [1]. Conversely, the above mentioned applications can greatly benefit from applying traditional relational database technology enhanced with a comprehensive support for the management of query relaxation. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 377–390, 2010. c Springer-Verlag Berlin Heidelberg 2010
378
D. Martinenghi and R. Torlone
To this aim, we propose in this paper a logical data model and an abstract query language supporting query relaxation over relational data. Our approach relies on the availability of taxonomies, that is, simple ontologies in which terms used in schemas and data are arranged in a hierarchical structure according to a generalization-specialization relationship. The data model is a natural extension of the relational model in which data domains are organized in hierarchies, according to different levels of detail: this guarantees a smooth implementation of the approach with current database technology. In this model data and metadata can be expressed at different levels of detail. This is made possible by a partial order relationship defined both at the schema and at the instance level. The query language is called Taxonomy-based Relational Algebra (TRA) and is a conservative extension of relational algebra. TRA includes two special operators that extend the capabilities of standard selection and join by relating values occurring in tuples with values in the query using the taxonomy. In this way, we can formulate relaxed queries that refer to attributes and terms different from those occurring in the actual database. We also present general algebraic rules governing the operators over taxonomies and their interactions with standard relational algebra operators. The rules provide a formal foundation for query equivalence and for the algebraic optimization of queries over vague schemas. In sum, the contributions of this paper are the following: (i) a simple but solid framework for embedding taxonomies into relational databases: the framework does not depend on a specific domain of application and makes the comparison of heterogeneous data possible and straightforward; (ii) a simple but powerful algebraic language for supporting query relaxation: the query language makes it possible to formulate complex searches over vague schemas in different application domains; (iii) the investigation of the relationships between the query language operators and the identification of a number of equivalence rules: the rules provide a formal foundation for the algebraic optimization of relaxed queries. Because of space limitation, we do not address the issue of implementing the formal framework proposed in this paper and we disregard the orthogonal problem of taxonomy design. Both issues will be addressed in forthcoming works.
2 2.1
A Data Model with Taxonomies Partial Orders and Lattices
A (weak) partial order ≤ on a domain V is a subset of V × V whose elements are denoted by v1 ≤ v2 that is: reflexive (v ≤ v for all v ∈ V ), antisymmetric (if v1 ≤ v2 and v2 ≤ v1 then v1 = v2 ), and transitive (if v1 ≤ v2 and v2 ≤ v3 then v1 ≤ v3 ). If v1 ≤ v2 we say that v1 is included in v2 . A set of values V with a partial order ≤ is called a poset. A lower bound (upper bound) of two elements v1 and v2 in a poset (V, ≤) is an element b ∈ V such that b ≤ v1 and b ≤ v2 (v1 ≤ b and v2 ≤ b). A maximal lower bound (minimal upper bound ) is a lower bound (upper bound) b of two elements v1 and v2 in a poset (V, ≤) such that there is no lower bound (upper bound) b′ of v1 and v2 such that b′ ≤ b (b ≤ b′ ).
Querying Databases with Taxonomies
379
The greatest lower bound or glb (least upper bound or lub) is a lower bound (upper bound) b of two elements v1 and v2 in a poset (V, ≤) such that b′ ≤ b (b ≤ b′ ) for any other lower bound (upper bound) b′ of v1 and v2 . It easily follows that if a lub (glb) exists, then it is unique. The glb and the lub are also called meet and join, respectively. A lattice is a poset in which any two elements have both a glb and a lub. The glb and lub can also be defined over a set of elements. By induction, it follows that every non-empty finite subset of a lattice has a glb and a lub. 2.2
Hierarchical Domains and t-Relations
The basic construct of our model is the hierarchical domain or simply the hdomain, a collection of values arranged in a containment hierarchy. Each hdomain is described by means of a set of levels representing the domain of interest at different degrees of granularity. For instance, the h-domain time can be organized in levels like day, week, month, and year. Definition 1 (H-domain). An h-domain h is composed of: – a finite set L = {l1 , . . . , lk } of levels, each of which is associated with a set of values called the members of the level and denoted by M (l); – a partial order ≤L on L having a bottom element, denoted by ⊥L , and a top element, denoted by ⊤L , such that: • M (⊥L ) contains a set of ground members whereas all the other levels contain members that represent groups of ground members; • M (⊤L ) contains only a special member m⊤ that represents all the ground members; – a family CM of containment mappings cmapll21 : M (l1 ) → M (l2 ) for each pair of levels l1 ≤L l2 satisfying the following consistency conditions: • for each level l, the function cmapll is the identity on the members of l; • for each pair of levels l1 and l2 such that l1 ≤L l ≤L l2 and l1 ≤L l′ ≤L l2 ′ for some l = l′ , we have: cmapll2 (cmapll1 (m)) = cmapll2′ (cmapll1 (m)) for each member m of l1 . Example 1. The h-domain time has a bottom level whose (ground) members are timestamps and a top level whose only member, anytime, represents all possible timestamps. Other levels can be day, week , month, quarter, season and year, where day ≤L month ≤L quarter ≤L year and day ≤L season. A possible member of the Day level is 23/07/2010, which is mapped by the containment mappings to the member 07/2010 of the level month and to the member Summer of the level season. As should be clear from Definition 1, in this paper we consider a general notion of taxonomy in which, whenever l1 ≤L l2 for two levels in an h-domain, then the set of ground members for l1 is contained in the set of ground members for l2 . The following result can be easily shown. Proposition 1. The poset (L, ≤L ) is a lattice and therefore every pair of levels l1 and l2 in L has both a glb and a lub.
380
D. Martinenghi and R. Torlone
Actually, a partial order ≤M can also be defined on the members M of an hdomain h : it is induced by the containment mappings as follows. Definition 2 (Poset on members). Let h be an h-domain and m1 and m2 be members of levels l1 and l2 of h, respectively. We have that m1 ≤M m2 if: (i) l1 ≤L l2 and (ii) cmapll21 (m1 ) = m2 . Example 2. Consider the h-domain of Example 1. Given the members m1 = 29/06/2010 and m2 = 23/08/2010 of the level day, m3 = 06/2010 and m4 = 08/2010 of the level month, m5 = 2Q 2010 and m6 = 3Q 2010 of the level quarter, m7 = 2010 of the level year, and m8 = Summer of the level season, we have: m1 ≤M m3 ≤M m5 ≤M m7 , m2 ≤M m4 ≤M m6 ≤M m7 , and m1 ≤M m8 and m2 ≤M m8 . Example 2 shows an interesting property: differently from the poset on the levels on an h-domain, the poset on the members of an h-domain is not a lattice in general. Consider for instance the members m1 and m2 of the example above: they have no lower bound, since their intersection is empty (more precisely, the intersection of the ground members that they represent is empty), and have two incomparable minimal upper bounds: 2010 at the year level and Summer at the season level. Indeed, it is possible to show that the poset (M, ≤M ) can be converted into a lattice by adding to M all the elements of the powerset of the ground members (including the empty set, which would become the bottom level). This however would imply an explosion of the number of members and an unnatural representation of an h-domain. We are ready to introduce the main construct of the data model: the t-relation, a natural extension of a relational table built over taxonomies of values. Definition 3 (T-relation). Let H be a set of h-domains. We denote by S = {A1 : l1 , . . . , Ak : lk } a t-schema (schema over taxonomies), where each Ai is a distinct attribute name and each li is a level of some h-domain in H. A t-tuple t over a t-schema S = {A1 : l1 , . . . , Ak : lk } is a function mapping each attribute Ai to a member of li . A t-relation r over S is a set of t-tuples over S. Given a t-tuple t over a t-schema S and an attribute Ai occurring in S on level li , we will denote by t[Ai : li ] the member of level li associated with t on Ai . Following common practice in relational database literature, we use the same notation A : l to indicate both the single attribute-level pair A : l and the singleton set {A : l}; also, we indicate the union of attribute-level pairs (or sets thereof) by means of the juxtaposition of their names. For a subset S ′ of S, we will denote by t[S ′ ] the restriction of t to S ′ . Finally, for the sake of simplicity, often in the following we will not make any distinction between the name of an attribute of a t-relation and the name of the corresponding h-domain, when no ambiguities can arise. Example 3. As an example, a t-schema over the h-domains time, location and weather conditions can be the following: S = {T ime : day, Location : city, W eather : brief}. A possible t-relation over this schema is the following:
Querying Databases with Taxonomies
381
Time: day Location: city Weather: brief Rome Sunny t1, 1 Milan Cloudy t1, 2 24/07/2010 New York Showers t1, 3
r1 = 11/05/2010 24/04/2009
Then we have: t1, 1 [Location:city] = Rome. A partial order relation on both t-schemas and t-relations can be also defined in a natural way. Definition 4 (Poset on t-schemas). Let S1 and S2 be t-schemas over a set of h-domains H 1 and H 2 respectively. We have that S1 ≤S S2 if: (i) H 2 ⊆ H 1 , and (ii) for each Ai : li ∈ S2 there is an element Ai : lj ∈ S1 such that lj ≤L li . Definition 5 (Poset on t-tuples). Let t1 and t2 be t-tuples over S1 and S2 respectively. We have that t1 ≤t t2 if: (i) S1 ≤S S2 , and (ii) for each Ai : li ∈ S2 there is an element Ai : lj ∈ S1 such that t1 [Ai : lj ] ≤M t2 [Ai : li ]. Definition 6 (Poset on t-relations). Let r1 and r2 be t-relations over S1 and S2 respectively. We have that r1 ≤r r2 if for each t-tuple t ∈ r1 there is a t-tuple t′ ∈ r2 such that t ≤t t′ . Note that, in these definitions, we assume that levels of the same h-domain occur in different t-schemas with the same attribute name: this strongly simplifies the notation that follows without loss of expressibility. Basically, it suffices to use as attribute name the role played by the h-domain in the application scenario modeled by the t-schema. Example 4. Consider the following t-relations: S1 = Title:cultural-event Author:artist Time:day Location:theater r1 = Romeo & Juliet Prokofiev 13/04/2010 La Scala Carmen Bizet 24/05/2010 Op´era Garnier Requiem Verdi 28/03/2010 La Scala La boh`eme Puccini 09/01/2010 Op´era Garnier S2 = Title:event Time:quarter Location:city r2 = Concert 1Q 2010 Milan Ballet 2Q 2010 Milan Sport 3Q 2010 Rome Opera 2Q 2010 Paris
t1, 1 t1, 2 t1, 3 t1, 4
t2, 1 t2, 2 t2, 3 t2, 4
Then, it is easy to see that: (i) S1 ≤S S2 , and (ii) t1, 1 ≤t t2, 2 , t1, 2 ≤t t2, 4 , t1, 3 ≤t t2, 1 , and t1, 4 ≤t t2, 4 . It follows that r1 ≤r r2 . The same considerations done for the poset on levels also apply to the poset on t-schemas. Proposition 2. Let S be the set of all possible t-schemas over a set of h-domains H. Then, the poset (S, ≤S ) is a lattice. Conversely, the poset on t-relations is not a lattice in general since, it is easy to show that, given two t-relations, they can have more than one minimal upper bound (but necessarily at least one) as well as more than one maximal lower bound (possibly none). In the following, for the sake of simplicity, we will often make no distinction between the name of an attribute and the corresponding level.
382
3
D. Martinenghi and R. Torlone
Querying with Taxonomies
In this section we present TRA (Taxonomy-based Relational Algebra) an extension of relational algebra over t-relations. This language provides insights on the way in which data can be manipulated taking advantage of available taxonomies over those data. Moreover, for its procedural nature, it can be profitably used to specify query optimization. The goal is to provide a solid foundation to querying databases with taxonomies. Similarly to what happens with the standard relational algebra, the operators of TRA are closed, that is, they apply to t-relations and produce a t-relation as result. In this way, the various operators can be composed to form the texpressions of the language. TRA is a conservative extension of basic relational algebra (RA) and so it includes its standard operators: selection (σ ), projection (π ), and natural join (⊲⊳). It also includes some variants of these operators that are obtained by combining them with the following two new operators. Definition 7 (Upward extension). Let r be a t-relation over S, A be an attribute in S defined over a level l, and l′ be a level such that l ≤L l′ . The A :l′ upward extension of r to l′ , denoted by εˆA :l (r), is the t-relation over S ∪ {A : l′ } defined as follows: ′
εˆAA ::ll (r) = {t | ∃t′ ∈ r : t[S] = t′ , t[A : l′] = cmapll (t′ [A : l])} ′
Definition 8 (Downward extension). Let r be a t-relation over S, A be an attribute in S defined over a level l, and l′ be a level such that l′ ≤L l. The A:l downward extension of r to l′ , denoted by εˇA:l′ (r), is the t-relation over S ∪ {A : ′ l } defined as follows: l ′ ′ ′ ′ l εˇA: A:l′ (r) = {t | ∃t ∈ r : t[S] = t , t [A : l] = cmapl′ (t[A : l ])} l′
l′
For simplicity, in the following we will often simply write εˆl or εˇl , when there is no ambiguity on the attribute name associated with the corresponding levels. Example 5. Consider the t-relations r1 and r2 from Example 4. The result of
εˆcity theater (r1 ) is the following t-relation.
S3 = Title:cultural-event Author:artist Time:day Location:theater Location:city r3 = Romeo & Juliet Prokofiev 13/04/2010 La Scala Milan Carmen Bizet 24/05/2010 Op´era Garnier Paris Requiem Verdi 28/03/2010 La Scala Milan La boh`eme Puccini 09/01/2010 Op´era Garnier Paris quarter
The result of εˇmonth (r2 ) is the following t-relation. S4 = Title:event Time:quarter Location:city Time:month r4 = Concert 1Q 2010 Milan Jan 2010 Concert 1Q 2010 Milan Feb 2010 Concert 1Q 2010 Milan Mar 2010 Sport 3Q 2010 Rome Jul 2010 Sport 3Q 2010 Rome Aug 2010 Sport 3Q 2010 Rome Sep 2010 ... ... ... ...
t4,1 t4,2 t4,3 t4,4 t4,5 t4,6
t3,1 t3,2 t3,3 t3,4
Querying Databases with Taxonomies
383
The main rationale behind the introduction of the upward extension is the need to relax a query with respect to the level of detail of the queried information. For example, one might want to find events taking place in a given country, even though the events might be stored with a finer granularity (e.g., city). Similarly, the downward extension allows the relaxation of the answer with respect to the level of detail of the query. For instance, a query about products available in a given day may return the products available in that day’s month. Both kinds of extensions meet needs that arise naturally in several application domains. For this purpose, we introduce two new operators for the selection that leverage the available taxonomies; they can reference an h-domain that is more general or more specific than that occurring in its tuples. Definition 9 (Upward selection). Let r be a t-relation over S, A be an attribute in S defined over l, m be a member of l′ with l ≤L l′ , and θ ∈ {=, , ≤, ≥, =}: the upward selection of r with respect to A θ m on level l, denoted by σ ˆ A:l θ m (r), is the t-relation over S defined as follows: ′
σ ˆ A :l θ m (r) = {t ∈ r | cmapll (t[A : l]) θ m} Definition 10 (Downward selection). Let r be a t-relation over S, A be an attribute in S defined over l, m be a member of l′ with l′ ≤L l, and θ ∈ {=, , ≤, ≥, =}: the downward selection of r with respect to A θ m on level l, denoted by σ ˇ A:l θ m (r), is the t-relation over S defined as follows:
σ ˇ A:l θ m (r) = {t ∈ r | cmapll′ (m) θ t[A : l]} ˆ A θ m and σ ˇ A θ m , without explicitly In the following, we will often simply write σ indicating the name of the level, when this is unambiguously determined by the corresponding attribute. Also, we will call these operators t-selections, to distinguish them from the standard selection operator. Example 6. Consider again the t-relations r1 and r2 from Example 4. We have that: σ ˆ City=Milan (r1 ) = {t1,1, t1,3 } and σ ˇ Day=13/03/2010(r2 ) = {t2,1 }. It can be easily seen that these operators can be obtained by composing the upward or downward extension, the (standard) selection, and the projection operators, as shown in (1) and (2) below. ′
σ ˆ A:l θ m (r) = π S (σ A:l′ θ m (εˆA:l A:l (r))) A:l σ ˇ A:l θ m (r) = π S (σ A:l′ θ m (εˇA:l′ (r)))
(1) (2)
Finally, we introduce two new join operators. Their main purpose is to combine information stored at different levels of granularity. Definition 11 (Upward join). Let r1 and r2 be two t-relations over S1 and S2 respectively, and let S be an upper bound of a subset S¯1 of S1 and a subset
384
D. Martinenghi and R. Torlone
S¯2 of S2 . The upward join of r1 and r2 with respect to S on S¯1 and S¯2 , denoted ˆ S :S¯1 ,S¯2 r2 , is the t-relation over S1 ∪ S2 defined as follows: by r1 ⊲⊳
ˆ S:S¯1 ,S¯2 r2 = { t | ∃t1 ∈ r1 , ∃t2 ∈ r2 , ∃t′ over S : t1 [S¯1 ] ≤t t′ , r1 ⊲⊳ t2 [S¯2 ] ≤t t′ , t[S1 ] = t1 , t[S2 ] = t2 } Definition 12 (Downward join). Let r1 and r2 be two t-relations over S1 and S2 respectively, and let S be a lower bound of a subset S¯1 of S1 and a subset S¯2 of S2 . The downward join of r1 and r2 with respect to S on S¯1 and S¯2 , denoted ˇ S:S¯1 ,S¯2 r2 , is the t-relation over S1 ∪ S2 defined as follows: by r1 ⊲⊳
ˇ S:S¯1 ,S¯2 r2 = { t | ∃t1 ∈ r1 , ∃t2 ∈ r2 , ∃t′ over S : t′ ≤t t1 [S¯1 ], r1 ⊲⊳ t′ ≤t t2 [S¯2 ], t[S1 ] = t1 , t[S2 ] = t2 } In the following, we will omit the indication of S¯1 and S¯2 when evident from the context. Also, we will call these operators t-joins, to distinguish them from the standard join operator. Example 7. Consider the t-relation r1 from Example 4 and the following trelation. S5 = Company:airline-company Location:airport r5 = Alitalia Linate t5,1 Air France Roissy t5,2
ˆ city r5 is the following t-relation: The result of r1 ⊲⊳ S6 = Event:cultural-event Author:artist Time:day Location:theater Company:airline-company Location:airport r6 = Romeo & Juliet Prokofiev 24/04/2010 La Scala Alitalia Linate Carmen Bizet 24/05/2010 Op´era Garnier Air France Roissy Requiem Verdi 24/03/2010 La Scala Alitalia Linate La boh`eme Puccini 09/01/2010 Op´era Garnier Air France Roissy
t6,1 t6,2 t6,3 t6,4
Now, consider the following t-relations. S8 = Loc:theater Time:month Discount:perc. S7 = Loc:theater Time:year Price:money r = La Scala 03/2010 10% t8,1 r7 = La Scala 2010 150 t7,1 8 La Scala 06/2010 20% t8,2
ˇ theater,day r8 is the following t-relation: The result of r7 ⊲⊳ S9 = Loc:theater Time:year Price:money Time:month Discount:perc. r9 = La Scala 2010 150 03/2010 10% t9,1 La Scala 2010 150 06/2010 20% t9,2
Also in this case, both the upward join and the downward join can be obtained by combining the upward extension or the downward extension, and the (standard) join. Equation (3) below shows this for the upward join, where S = {A1 : l1 , . . . , An : ln }, Si ⊇ S¯i ⊇ {A1 : li1 , . . . , An : lin } for i = 1, 2, and P is a predicate requiring pairwise equality in both sides of the join for all fields added by the extensions. A1 :l1
An :ln
A1 :l1
An :ln
ˆ S:S¯1 ,S¯2 r2 = π S1 S2 (εˆA1 :l11 · · · εˆAn :ln1 (r1 )⊲⊳P εˆA1 :l12 · · · εˆAn :ln2 (r2 )) r1 ⊲⊳
(3)
Querying Databases with Taxonomies
385
Equation (4) below shows this for the downward join, where S ⊇ {A1 : l1 , . . . , An : ln }, Si ⊇ S¯i ⊇ {A1 : li1 , . . . , An : lin } for i = 1, 2, and P is as above. A1 :l1
A1 :l1
An :ln
An :ln
ˇ S :S¯1 ,S¯2 r2 = π S1 S2 (εˇA1 :l11 · · · εˇAn :l1n (r1 )⊲⊳P εˇA1 :l21 · · · εˇAn :l2n (r2 )) r1 ⊲⊳
(4)
As in the standard relational algebra, it is possible to build complex expressions combining several TRA operators thanks to the fact that TRA is closed, i.e., the result of every application of an operator is a t-relation. Formally, one can define and build the expressions of TRA, called t-expressions, by assuming that t-relations themselves are t-expressions, and by substituting the t-relations appearing in Definitions 7-12 with a t-expression. Similar extensions are possible for other RA operators (e.g., difference); we omit them in the interest of space.
4
Query Equivalence in TRA
One of the main benefits of Relational Algebra is the use of algebraic properties for query optimization. In particular, equivalences allow transforming a relational expression into an equivalent expression in which the average size of the relations yielded by subexpressions is smaller. Rewritings may be used, e.g., to break up an application of an operator into several, smaller applications, or to move operators to more convenient places in the expression (e.g., pushing selection and projection through join). In analogy with the standard case, we are now going to describe a collection of new equivalences that can be used for query optimization in Taxonomy-based Relational Algebra. In the remainder of this section, we shall use, together with possible subscripts and primes, the letter r to denote a t-relation, l for a level, A for a set of attributes, and P for a (selection or join) predicate. 4.1
Upward and Downward Extension
Border cases Let l be the level of an attribute in r. Then:
εˆll (r) = εˇll (r) = r
(5)
Equivalence (5) shows that if the upper and lower level of an extension coincide, then the extension is idle, both for the upward and for the downward case. The proof of (5) follows immediately from Definitions 7 and 8. Idempotency Let l be the level of an attribute in r such that l ≤L l′ and l′′ ≤L l. Then: ′
′
′
εˆll (εˆll (r)) = εˆll (r) εˇll′′ (εˇll′′ (r)) = εˇll′′ (r)
(6) (7)
Equivalences (6) and (7) state that repeated applications of the same extension are idle, both for the upward and for the downward case. Here, too, the proof follows immediately from Definitions 7 and 8.
386
D. Martinenghi and R. Torlone
D uality Let l be the level of an attribute in r such that l′ ≤L l. Then:
εˆll′ (εˇll′ (r)) = εˇll′ (r)
(8)
The above Equivalence (8) shows that an upward extension is always idle after a downward extension on the same levels. To prove (8), it suffices to consider that the mapping from members of a lower level to members of an upper level is many-to-one, so no new tuple can be generated by the upward extension. Note, however, that the downward extension after an upward extension on the same levels is generally not redundant, since the mapping from members of an upper level to members of a lower level is one-to-many. Commutativity Let l1 , l2 be levels of attributes of r, s.t. li ≤L li′ and li′′ ≤L li , for i = 1, 2. Then: ′
′
′
′
εˆll22 (εˆll11 (r)) = εˆll11 (εˆll22 (r)) εˇll2′′2 (εˇll1′′1 (r)) = εˇll1′′1 (εˇll2′′2 (r))
(9) (10)
The above Equivalences (9) and (10) state that two extensions of the same kind can be swapped. Both follow straightforwardly from Definitions 7 and 8. Interplay with standard projection Let l be the level of an attribute A in a relation r over S s.t. l ≤L l1′ ≤L l2′ and l2 ≤L l1 ≤L l, and let Ap ⊆ S s.t. Ap ∋ A : l1 and Ap ∋ A : l1′ . Then: ′
′
′
A:l A:l A:l2 (r) = π Ap εˆA:l′2 (εˆA:l1 (r)) π Ap εˆA:l 1
(11)
A:l A:l A:l (r) = π Ap εˇA:l21 (εˇA:l1 (r)) π Ap εˇA:l 2
(12)
Note that the outerπ Ap in Equivalence (11) is necessary, because, in case l = l1′ = l2′ , the left-hand sides of the equivalences would be t-relations that do not include the attribute-level pair A : l1′ , whereas the right-hand sides would; therefore, projecting away A : l1′ is essential. Similarly for Equivalence (12). Let l be the level of an attribute A in a relation r over S s.t. l ≤L l′ and ′′ l ≤L l, and Ap ⊆ S s.t. Ap ∋ A : l′ and Ap ∋ A : l′′ . Then: ′
′
A:l A:l π Ap (εˆA:l (r)) = εˆA:l (π Ap (r)) A:l ˇA:l π Ap (εˇA:l ′′ (π Ap (r)) A:l′′ (r)) = ε
(13) (14)
Equivalences (13) and (14) show that, similarly to Equivalences (11) and (12), it is also possible to swap extension and standard projection provided that the projection does not retain the added attribute.
Querying Databases with Taxonomies
387
Interplay with standard selection Let l be the level of an attribute A in r s.t. l ≤L l′ and l′′ ≤L l, and P be a selection predicate that does not refer either to A : l′ or A : l′′ . Then: ′
′
A:l A:l σ P (εˆA:l (r)) = εˆA:l (σ P (r)) A:l σ P (εˇA:l ˇA:l ′′ (σ P (r)) A:l′′ (r)) = ε
(15) (16)
Equivalences (15) and (16) show swapping of extension and standard selection, when the added attribute-level pair is immaterial to the selection predicate. Interplay with standard join Let l be the level of an attribute A in r1 but not r2 s.t. l ≤L l′ , l′′ ≤L l, and P be a join predicate that does not refer either to A : l′ or A : l′′ . Then: ′
′
A:l εˆA:l ˆA:l (r1 ))⊲⊳P r2 A:l (r1 ⊲⊳P r2 ) = (ε A:l A:l εˇA:l′′ (r1 ⊲⊳P r2 ) = (εˇA:l′′ (r1 ))⊲⊳P r2
(17) (18)
Equivalences (17) and (18) show that extension can be “pushed” through standard join. (Note that, if A : l was in the schema of both r1 and r2 , the extension should be “pushed” through both sides of the join.) 4.2
Upward and Downward Selection
Idempotency Let l be the level of an attribute A of r s.t. l ≤L l′ and l′′ ≤L l, where l′ is the level of m′ and l′′ of m′′ . Then:
ˆ A:l θ m′ (r) ˆ A:l θ m′ (r)) = σ σ ˆ A:l θ m′ (σ ˇ A:l θ m′′ (r) ˇ A:l θ m′′ (r)) = σ σ ˇ A:l θ m′′ (σ
(19) (20)
Equivalences (19) and (20) state that repeated applications of the same tselection are idle, both for the upward and for the downward case. To prove (19), consider that, by (1), the left-hand side of the equivalence can be written as: ′
′
A:l (r)))))) ˆA:l π S (σ A:l′ θ m′ (εˆA:l A:l (π S (σ A:l′ θ m′ (ε
where S is the schema of r. The innermost π S can be moved outside the upward selection by using equivalence (13) if l = l′ or equivalence (5) if l = l′ . By using standard properties of the relational operators, the innermost π S can also be moved outside the outermost selection, and eliminated by idempotency: ′
′
A:l πS (σ A:l′ θ m′ (εˆA:l ˆA:l (r))))) A:l (σ A:l′ θ m′ (ε
Now, equivalence (15) allows swapping selection and upward extension provided that the selection predicate does not refer to the attribute-level pair introduced by εˆ. This condition is only required to make sure that, after the swap, the
388
D. Martinenghi and R. Torlone
selection refers to an existing attribute-level pair. Therefore, equivalence (15) can be used here to move the innermost selection outside the outermost εˆ, although A : l′ θ m′ is a predicate that clearly refers to A : l′ (l′ being the level of m′ ), since A : l′ is already introduced by the innermost εˆ. By idempotency of both standard selection and upward extension (as of equivalence (6)), we obtain ′
A:l π S (σ A:l′ θ m′ (εˆA:l (r)))
which, by (1), corresponds to the right-hand side of (19). Analogously for (20). Commutativity
ˆ l1 :A1 θ1 m1 (σ ˆ l2 :A2 θ2 m2 (r)) ˆ l1 :A1 θ1 m1 (r)) = σ σ ˆ l2 :A2 θ2 m2 (σ ˇ l2 :A2 θ2 m2 (r)) σ ˇ l2 :A2 θ2 m2 (σ ˇ l1 :A1 θ1 m1 (r)) = σ ˇ l1 :A1 θ1 m1 (σ ˇ l2 :A2 θ2 m2 (r)) σ ˇ l2 :A2 θ2 m2 (σ ˆ l1 :A1 θ1 m1 (r)) = σ ˆ l1 :A1 θ1 m1 (σ
(21) (22) (23)
The above equivalences state that t-selection is commutative, both for the upward and the downward case. Moreover, an upward selection can be swapped with a downward selection (and vice versa), as shown in equivalence (23). The proof of these follows straightforwardly from commutativity of standard selection and interplay of extension and standard selection. 4.3
Upward and Downward Join
Pushing upward and downward selection through upward and downward join Let A : l be in the schema S1 of r1 but not in the schema S2 of r2 , and Clow and Cup be a lower and an upper bound of C1 ⊆ S1 and C2 ⊆ S2 . Then:
ˆ Cup :C1 ,C2 r2 ˆ Cup :C1 ,C2 r2 ) = (σ ˆ A:l θ m r1 )⊲⊳ σ ˆ A:l θ m (r1 ⊲⊳ ˇ Clow :C1 ,C2 r2 ˇ Clow :C1,C2 r2 ) = (σ ˆ A:l θ m r1 )⊲⊳ σˆ A:l θ m (r1 ⊲⊳ ˆ Cup :C1 ,C2 r2 ˆ ˇ A:l θ m r1 )⊲⊳ σ ˇ A:l θ m (r1 ⊲⊳Cup :C1 ,C2 r2 ) = (σ ˇ Clow :C1 ,C2 r2 ˇ Clow :C1,C2 r2 ) = (σ ˇ A:l θ m r1 )⊲⊳ σˇ A:l θ m (r1 ⊲⊳
(24) (25) (26) (27)
The above equivalences (24)-(27) indicate that a t-selection can be “pushed” through a t-join on the side that involves the attribute-level pair used in the selection. To prove the equivalences, it suffices to use (1)-(4) and the properties of standard operators. Pushing standard projection through upward and downward join Let ri be a t-relation over Si for i = 1, 2, Clow and Cup be a lower and an upper bound of C1 ⊆ S1 and C2 ⊆ S2 , L be a subset of S1 ∪ S2 , and Li = Ci ∪ (L \ Si ) for i = 1, 2. Then:
ˆ Cup :C1 ,C2 (π L2 r2 )) ˆ Cup :C1 ,C2 r2 ) = π L ((π L1 r1 )⊲⊳ π L (r1 ⊲⊳ ˇ Clow :C1 ,C2 (π L2 r2 )) ˇ Clow :C1 ,C2 r2 ) = π L ((π L1 r1 )⊲⊳ πL (r1 ⊲⊳
(28) (29)
Querying Databases with Taxonomies
389
Equivalences (28) and (29) show how standard projection can be “pushed” through an upward or downward join to both sides of the join by properly breaking up the projection attributes into smaller sets. Again, the equivalences follow immediately by applying (3) and (4) together with the standard “push” of projection through join and through extension (as of equivalences (13) and (14)). From the above discussion, we have the following correctness result. Theorem 1. Equivalences (5)-(29) hold for any possible t-relation. Theorem 1 together with the fact that TRA is closed entails that equivalences (5)-(29) can also be used to test equivalence of complex t-expressions. Finally, we observe that some applications of the TRA operators preserve l′ partial order between relations. For instance, r1 ≤r r2 entails (i) (εˆl (r1 )) ≤r r2 , l
l′
l
(ii) (εˇl′ (r1 )) ≤r r2 , and (iii) r1 ≤r (εˆl (r2 )), but not (iv) r1 ≤r (εˇl′ (r2 )).
5
Related Work
The approach proposed in this paper, which extends to a more general scenario a work on modeling context-aware database applications [10], is focused on the relaxation of queries to a less restricted form with the goal of accommodating user’s needs. This problem has been investigated in several research areas under different perspectives. In the database area, query relaxation has been addressed in the context of XML and semi-structured databases, with the goal of combining database style querying and keyword search [1] and for querying databases with natural language interfaces [9]. Query relaxation has also been addressed to avoid empty answers to complex queries by adjusting values occurring in selections and joins [8]. Malleable schemas [5,11] deals with vagueness and ambiguity in database querying by incorporating imprecise and overlapping definitions of data structures. An alternative formal framework relies on multi-structural databases [6], where data objects are segmented according to multiple distinct criteria in a lattice structure and queries are formulated in this structure. The majority of these approaches rely on non-traditional data models, whereas we refer on a simple extension of the relational model. Moreover, none of them consider relaxation via taxonomies, which is our concern. In addition, the systematic analysis of query equivalence for optimization purposes has never been studied in the relaxed case. Query relaxation is also used in location-based search [4], but in the typical IR scenario in which a query consists of a set of terms and query evaluation is focused in the ranked retrieval of documents. This is also the case of the approach in [3], where the authors consider the problem of fuzzy matching queries to items. Actually, in the information retrieval area, which is however clearly different from ours, document taxonomies have been already used in, e.g., [7], where the authors focus on classifying documents into taxonomy nodes and developing the scoring function to make the matching work well in practice, and in [2], where the authors propose a framework for relaxing user requests over ontologies, a notion that is more general than that of taxonomy.
390
6
D. Martinenghi and R. Torlone
Conclusion
In this paper, we have presented a logical model and an algebraic language as a foundation for querying databases using taxonomies. In order to facilitate the implementation of the approach with current technology, they rely on a natural extension of the relational model. The hierarchical organization of data allows the specification of queries that refer to values at varying levels of details, possibly different from those available in the underlying database. We have also studied the interaction between the various operators of the query language as a formal foundation for the optimization of taxonomy-based queries. We believe that several interesting directions of research can be pursued within the framework presented in this paper. We are particularly interested into a deep investigation of general properties of the query language. In particular, we plan to develop methods for the automatic identification of the level in which two heterogeneous t-relations can be joined for integration purposes. Also, we are currently studying the impact of our model on the complexity of query answering. On the practical side, we plan to study how the presented approach can be implemented, in particular whether materialization of taxonomies is convenient. With this prototype, we plan to develop quantitative analysis oriented to the optimization of relaxed queries. The equivalence results presented in this paper provide an important contribution in this direction. Acknowledgments. D. Martinenghi acknowledges support from the Search Computing (SeCo) project, funded by the European Research Council (ERC).
References 1. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: Flexpath: Flexible structure and full-text querying for XML. In: Proc. of SIGMOD, pp. 83–94 (2004) 2. Balke, W.-T., Wagner, M.: Through different eyes: assessing multiple conceptual views for querying web services. In: Proc. of WWW, pp. 196–205 (2004) 3. Broder, A.Z., Fontoura, M., Josifovski, V., Riedel, L.: A semantic approach to contextual advertising. In: Proc. of SIGIR, pp. 559–566 (2007) 4. Chen, Y.-Y., Suel, T., Markowetz, A.: Effi cient query processing in geographic web search engines. In: Proc. of SIGMOD, pp. 277–288 (2006) 5. Dong, X., Halevy, A.Y.: Malleable schemas: A preliminary report. In: Proc. of WebDB, pp. 139–144 (2005) 6. Fagin, R., Guha, R.V., Kumar, R., Novak, J., Sivakumar, D., Tomkins, A.: Multistructural databases. In: Proc. of PODS, pp. 184–195 (2005) 7. Fontoura, M., Josifovski, V., Kumar, R., Olston, C., Tomkins, A., Vassilvitskii, S.: Relaxation in text search using taxonomies. In: Proc. of VLDB, vol. 1(1), pp. 672–683 (2008) 8. Koudas, N., Li, C., Tung, A.K.H., Vernica, R.: Relaxing join and selection queries. In: Proc. of VLDB, pp. 199–210 (2006) 9. Li, Y., Yang, H., Jagadish, H.V.: NaLIX: A generic natural language search environment for XML data. TODS, art. 30 32(4) (2007) 10. Martinenghi, D., Torlone, R.: Querying Context-Aware Databases. In: Proc. of FQAS, pp. 76–87 (2009) 11. Zhou, X., Gaugaz, J., Balke, W., Nejdl, W.: Query relaxation using malleable schemas. In: Proc. of SIGMOD, pp. 545–556 (2007)
What Is Wrong with Digital Documents? A Conceptual Model for Structural Cross-Media Content Composition and Reuse Beat Signer Vrije Universiteit Brussel Pleinlaan 2, 1050 Brussels, Belgium [email protected]
Abstract. Many of today’s digital document formats are strongly based on a digital emulation of printed media. While such a paper simulation might be appropriate for the visualisation of certain digital content, it is generally not the most effective solution for digitally managing and storing information. The oversimplistic modelling of digital documents as monolithic blocks of linear content, with a lack of structural semantics, does not pay attention to some of the superior features that digital media offers in comparison to traditional paper documents. For example, existing digital document formats adopt the limitations of paper documents by unnecessarily replicating content via copy and paste operations, instead of digitally embedding and reusing parts of digital documents via structural references. We introduce a conceptual model for structural cross-media content composition and highlight how the proposed solution not only enables the reuse of content via structural relationships, but also supports dynamic and context-dependent document adaptation, structural content annotations as well as the integration of arbitrary nontextual media types. We further discuss solutions for the fl uid navigation and cross-media content publishing based on the proposed structural cross-media content model.
1
Introduction
In his 1945 seminal article ‘As We May Think’ [1], the visionary Vannevar Bush introduced the concept of the Memex, a prototypical hypertext machine for storing and accessing information on microfilm. As a knowledge worker, Bush was not happy with the current way of accessing information based on hierarchical classifications such as the Dewey Decimal Classification (DDC). As described in his article, the Memex was meant to enhance information management by introducing a superimposed metadata structure to be considered as a natural extension of human mind based on cross-references between different microfilms: W hen data of any sort are placed in storage, they are filed alphabetically or numerically, and information is found (when it is) by tracing it down from subclass to subclass. [. . . ] The human mind does not work that way. It operates by association. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 391–404, 2010. c Springer-Verlag Berlin Heidelberg 2010
392
B. Signer
While Bush’s vision is often accredited as being the origin of hypermedia systems, some of the early hypermedia pioneers, including Douglas Engelbart and Ted Nelson, brought the idea of defining associations between pieces of information into digital age. The Memex offered only limited structural representation of information since it was based on printed documents (i.e. microfilm document pages). The only available document structure was the concept of a page which implied that links (associations) could only be defined at the page level. In his Xanadu document model [2], Ted Nelson introduced the idea of so-called deep documents, where snippets of information can be reused in higher-level document structures via the concept of transclusion. The Xanadu document model no longer requires textual information to be replicated since the same content can be embedded in different documents via structural references (transclusions) in combination with a versioning mechanism. In the mid 70’s, researchers at Xerox PARC coined the term ‘What you see is what you get’ (WYSIWYG) which would become a de facto standard for document representations including the Portable Document Format (PDF), the Word Document Format and most other existing document formats. Unfortunately, the WYSIWYG representation did not take into consideration richer digital document representations such as the one proposed in Engelbart’s oN-Line System (NLS) [3] or Nelson’s Xanadu project. It further did not address the new possibilities that digital document formats could offer in comparison to printed media and the computer rather became degraded to a paper simulator as criticised by Nelson [4]: Most people don’t understand the logic of the concept: “What You See Is What You Get” is based on printing the document out (“get” means “get WHEN YOU PRINT IT OUT”). And that means a metaphysical shift: a document can only consist of what can be printed! This re-froze the computer document into a closed rectangular object which cannot be penetrated by outside markings (curtailing what you could do with paper). No marginal notes, no sticky notes, no crossouts, no insertions, no overlays, no highlighting—PAPER UNDER GLASS. Unfortunately, these limitations in terms of missing document structure in combination with hierarchical file systems led to unsatisfactory information management solutions. Often users have to replicate content—for example if the same picture is going to be used in multiple documents—due to monolithic and closed document formats. Driven by recent developments on the Web where document structures are made explicitly available, for example through the increasing use of the Extensible Markup Language (XML), people start to realise that also desktop applications could profit from richer hypermedia structures to overcome some of the limitations of hierarchical file systems. We strongly believe that existing desktop applications can provide enhanced functionality based on a richer digital document representation that pays attention to structural semantics as well as bidirectional inter-document relationships. This functionality should not be provided in a proprietary and applicationspecific manner on top of existing file systems, but rather form part of the core
What Is Wrong with Digital Documents?
393
functionality offered by an enhanced file system. In this paper, we introduce a general and media-neutral structural cross-media document model based on a core link metamodel which can be extended via resource plug-ins to support existing as well as emerging document formats. We further show that the application of such a cross-media document layer may lead to enhanced information management functionality and ultimately result in a remediation of existing “paper simulation” approaches. We begin in Sect. 2 by providing an overview of existing digital document representations that are used on local file systems as well as on the Web. We discuss some advantages of existing document formats and highlight potential future improvements for dealing with cross-media information management. In Sect. 3 we introduce the main concepts of our conceptual model for structural cross-media content composition and discuss some differences to existing document formats. We then present an implementation of the model based on an object database management solution and outline potential authoring and publishing components. Various application scenarios of the presented cross-media document model, including a prototype of an associative file system, are highlighted in Sect. 5. Concluding remarks are given in Sect. 6.
2
Background
As denoted in the previous section, early hypermedia and document models offered rich features for addressing parts of documents, defining bidirectional relationships between documents as well as parts of documents and the structural embedding of documents or parts of documents via transclusion. However, the WYSIWYG concept offered by early desktop solutions sacrificed the rich digital document models in favour of simpler implementations and visualisations of documents and led to some of the document formats that are still in use today. In most existing applications, a document is represented as a single file that is stored somewhere in the file system. Furthermore, the content of a document is normally stored in a proprietary format which makes it hardly impossible for third-party applications to access any structural metadata that is completely controlled by the monolithic application “owning” the document. These closed document formats prevent third-party applications to offer supplemental services related to parts of a document and make the external annotation of document substructures impossible. There are two types of structural metadata that have to be considered: the within document structure describing different parts of a document (e.g. sections, images, etc.) and the external document structure which superimposes a structure on multiple documents (e.g. to organise projects with multiple documents). The within document structure is normally fixed and only accessible to the application that has been used to create the document, whereas an explicit external document structure is not o ffered by mo st existing file systems. A problem of proprietary document formats is not only that it is difficult to access the corresponding data from within third-party applications, but also
394
B. Signer
the document rendering functionality is generally not portable between different document types. This results in major development effo rts fo r the visualisation of particular document formats and limits innovation since only large companies have the necessary resources to develop advanced document renderers. To overcome this problem, Phelps and Wilensky [5] proposed an architecture that separates the document format from its associated functionality represented by reusable and composable behaviour objects. The paper simulating document formats (e.g. PDF or Word Document Format) support only a single WYSIWYG document structure. However, in a digital environment that offers many different non-printable media types, this kind of representation might not be appropriate for certain mixed-media information compositions. A flexible solution should rather support some basic structuring mechanism and allow for different structural views on top of existing pieces of information. Note that such a basic structuring mechanism should be general enough to support existing as well as emerging document formats as we show when introducing our structural cross-media model in the next section. Of course, the WYSIWYG representation might always be one of the potential structural representations, but it should not be the only possible one. As mentioned earlier, an external document structure to define the semantics of a set of documents is not supported by most file systems. The general way to organise documents of a project is to make use of hierarchical folder structures. However, the hierarchical organisation of documents has a number of limitations. In most file systems a file can only be in a single folder since also folders are inspired by their physical counterparts. The affordances of objects in the physical space have again been copied to digital space rather than making use of the richer functionality that could be offered for digital documents in terms of multiple classification. Without any physical constraints, there is no reason why a file should not be put into multiple folders. Nevertheless, the lack of this functionality leads to an unnecessary replication of information since the same file is copied to different projects folders. Even worse, there is no relation between multiple instances of the same file which makes it possible that different inconsistent versions of the same document might exist. Last but not least, the underlying file system as well as the different applications have no understanding of the semantics that a user encodes in these folder structures and therefore cannot profit from exploiting this structural semantics. Alternative non-hierarchical access forms to traditional file systems are also investigated by new collaborative tangible user interfaces [6]. When we introduce potential applications of our structural cross-media content model in Sect. 5, we will list some applications that o ffer enhanced information management functionality by making use of the underlying non-hierarchical structural metadata. In the early days, the Web had similar limitations in terms of not providing access to document structures. The Hypertext Markup Language (HTML) is a mix of content and visualisation that interweaves content with structural and navigational elements. While earlier hypertext models envisioned a separation of content and external link structures, the Web mixes these concepts. HTML only
What Is Wrong with Digital Documents?
395
offers embedded and unidirectional links, which means that only the owner of a document can add links to it and links can not be traced back. HTML offers a very restricted form of transclusion for embedded images [7]. An external image can be embedded in an HTML document by referencing its URL via the src attribute of an tag. However, we can only embed the entire image and it is not possible to address parts of an image. While this limited form of transclusion is supported for images, the structural composition is not possible for textual HTML content. More recently, it has been realised that the external linking and annotation of webpages offers an immense flexibility in terms of dynamic and adaptive content composition. The hypermedia structures that are used on the Web, start to blur the notion of classical document boundaries since it is no longer clear which hyperlinked resources should be counted as part of a particular document. The introduction of the Extensible Markup Language (XML) was an important step to open the structure of documents. In combination with the XPointer and XPath languages for addressing parts of an XML structure, the XML Linking Language (XLink) [8] can be used to define external multi-directional links. The use of XML as a general representation for cross-media documents might be problematic since XML itself puts some constraints on the structural grouping of elements by o ffering a hierarchical document model. Based on the opening of web document formats, the Annotea project [9] introduced a solution for external collaborative semantic annotations that is, for example, used in the W3C’s Amaya1 web browser to edit third-party webpages. More recently, the use of mashups, another form of transclusion where snippets of content and services are composed to form new content and services, becomes more important on the Web. Another form of transclusion is provided by CrystalBox2, an agile software documentation tool that embeds code as well as other resources into HTML pages and other web documents by simply defining references to the corresponding resources in an online software repository. In the near future, we can see a trend to apply ideas from the Web in desktop applications, thereby reducing the gap between applications running on a local machine and services or documents available on the Web. Some companies already started to open their proprietary document formats. For example, the latest Microsoft Office suite uses the Office Open XML [1 0 ] standard to represent text documents, presentations as well as other documents. This interesting development makes the structural document components (e.g. a slide of a PowerPoint presentation) accessible to third-party applications and tools which can reuse parts of a document based on structural references (e.g. via an XPointer expression to an Office Open XML document). While this solution uses XML as a tool for providing access to the application-specific structural semantics, we propose a set of structural concepts to be o ffered at the file system level. This has the advantage, that any application will be able to make use of the additional metadata. 1 2
http://www.w3.org/Amaya/ http://crystalbox.org
396
B. Signer
Existing hypermedia solutions have been challenged by more recent approaches such as structural computing where structure is treated as a first-class citizen and the focus is no longer on the data [11]. As we show in the next section, our model treats data, navigational as well as structural metadata at the same level. Note that also new hypermedia data structures such as Zzstructures, polyarchies and mSpaces might be used for organising documents [12]. The idea of open document formats has been addressed by Rivera et al. when discussing future directions of their OMS-FS object file system [13]. They recommended to represent documents as containers of different media components that are associated with the corresponding digital functionality (e.g. editor, viewer etc.). This approach is somehow related to the visualisation that we later introduce for our structural cross-media document model. It is likely that in the future we will see a reconciliation of digital document spaces with tangible objects such as interactive paper documents [14], RFIDtagged objects and other physical resources. A general structural cross-media document model should therefore be able to define structural compositions of mixed digital and physical resources. In the next section, we introduce our general conceptual model for structural cross-media content composition and explain how this core model can be used and extended to address some of the issues mentioned above.
3
Structural Cross-Media Model
In this section we present our structural cross-media document model and explain how its minimal set of concepts can be extended via resource plug-ins to support different types of documents. The presented structural model is an application of our general resource-selector-link (RSL) model [15] that was originally developed for the cross-media linking of digital and physical resources. However, we will show that the structural composition of different resources can be seen as a specialisation of the general link concept. The core components of the RSL model with its structural link support are shown in Fig. 1. Our structural cross-media composition model has been defined using the OM data model [16] which integrates concepts from both entity relationship (ER) and object-oriented data models. The OM model distinguishes between typing and classification. The typing deals with the representation of entities by objects with their attributes and methods. On the other hand, the semantic roles of individual object instances are managed by classification via named collections which are represented by the rectangular shapes in Fig. 1. Furthermore, the OM model o ffers a first-class association construct which is represented by the oval shapes with theirs corresponding cardinality constraints. Note that associations might be ordered and that a ranking over an association is highlighted by putting the name of the association between two vertical lines (e.g. | HasChild| ). It is out of the scope of this paper to provide a full overview of the OM model but a detailed description can be found in [16].
What Is Wrong with Digital Documents?
397
structure
(0,*)
Structures
HasElements (1,1)
link
link
Navigational Links
(1,*)
Structural Links
|HasChild|
partition link
(1,*)
HasSource
(0,*)
HasProperties
(1,*)
Links
HasTarget
(0,*) (0,*)
(0,*)
entity
(0,*)
Entities
(0,*)
(0,*)
parameter
contextResolver
partition
Properties
Context Resolvers
selector
Selectors
HasResolver
resource
(1,1)
RefersTo
(0,*)
Resources
Fig. 1. Structural cross-media composition model
The most general concept introduced by our RSL model is the notion of an entity type. All entity instances are further grouped by the corresponding Entities collection. Our conceptual cross-media link model further introduces three concrete specialisations of the abstract entity concept as indicated by the corresponding resource, selector and link subtypes. The resource type represents any particular resource can be used as element in a composition. A resource itself is an abstract concept and for each specific media type, the corresponding resource plug-in has to be provided. A resource plug-in stores additional metadata about the corresponding media type (e.g. a URL attribute for the HTML resource plug-in). Links between different entities can be defined via the link type. A link has one or multiple source entities and one or multiple target entities as indicted by the cardinality constraints defined for the HasSource and HasTarget associations. As mentioned earlier, we treat Structural Links and Navigational Links as two different roles that a link can have. A link has either to be a structural link or a navigational link as indicated by the partition constraint over the subcollection relationship. By modelling links as well as structural links as specialisations of the entity type, a structural link can have other structural links as source or target entities resulting in a multi-level composition pattern. To support the concept of transclusions in its most general form, it is not enough that we can only define structural links over resources but we need a mechanism to address parts of a resource. In the RSL model, we therefore provide the abstract concept of a selector which always has to be related to a resource via the RefersTo association. Also for each selector type, a media-specific plug-in
398
B. Signer
has to be provided which enables the selection of parts of the related resource. A selector for XML documents could for example be an XPointer expression as explained in the previous section whereas as selector for an image might be defined by an arbitrary shape within the image. Furthermore, arbitrary metadata in the form of key/value pairs can be added to an entity via the HasProperties association. The context-dependent availability of an entity can be defined by associating a set of contextResolver instances with an entity. Last but not least, each entity has associated information about its creator as well as information about which users might access a specific entity based on well-defined access rights. For the sake of simplicity, the user management concepts are not shown in Fig. 1. However, the full model as well as a more detailed description of the various concepts can be found in [15]. The user management component guarantees that users only see those entities for which they have the corresponding access rights. In addition, by encrypting information at entity level, privacy can be ensured when sharing documents or parts of documents. Structures are handled by the Structures collection and the HasElement association to the corresponding structural links. It is necessary to have such an explicit grouping of structure elements since parts of structures might be reused by other structures. Furthermore, structural links are a specialisation of general links since we have to introduce an order for the substructure relationship. For example, if we want to model the structure of a document with different chapters and sections within a chapter, we have to know the order of the chapter within the entire document as well as the order of the sections within the chapters. The order over such substructure relationships is introduced via the ordered | HasChild| subassociation of the HasTarget association. Therefore, the Structures and | HasChild| components provide information about all of the components that belong to a specific structure as well as their structural relationships. Note that since our model treats resources, selectors and links equally, structural compositions can be defined over a combination of these three entity types. Since we have a clear separation between data and structure, it is possible to reuse the same entity in different structural compositions and to support different views or structures on top of existing pieces of information as recommended by Nelson [4]. While our core model only provides some basic constructs to define navigational as well a structural links, it is up to the implementation of a particular document type to specify application-specific compositions based on these fundamental concepts. The advantage of this approach is that all document types that make use of the RSL model also offer generic access to some basic structural metadata information. This implies that we can develop general structural browsers and third-party applications can exploit the underlying structural metadata to provide additional services as explained later in Sect. 5. With our RSL model and its structural cross-media composition functionality, we were aiming for the most general form of document representation that one could imagine. While the model only o ffers a few basic concepts, it can be extended
What Is Wrong with Digital Documents?
399
to support arbitrary digital or physical document representations via the resource and selector concepts. Furthermore, our model supports navigational as well as structural linking based on these abstract media-neutral concepts. Even if the presented solution for the structural grouping of entities is very general, it can of course also be used to implement existing solutions such as hierarchical folder structures. However, an important difference is that hierarchical ordering becomes an option and is no longer an imposition. Even if we start by organising our content in a hierarchical way, we have the flexibility to adopt the form of structural references at any later stage. The concept of context resolvers as well as the access rights specified at entity level enable some other interesting possibilities. The structure of a document no longer has to be fixed but it can change based on some general contextual factors or based on the role of the user who is currently accessing the structural component. Similar to adaptive hypermedia, where the navigational link structure can be adapted based on a user’s browsing history or some other metadata, we can adopt the structure of documents. This can, for example, be used to model highly dynamic documents that change their structure and content based on a multidimensional set of contextual parameters such as the user role or the preferred language by making use of context-dependent transclusions. The goal of our cross-media document and link model was not to just introduce yet another hypermedia model but we wanted to find a small set of concepts that are general enough to o ffer so me basic structural and navigational functionality for existing as well as emerging digital or physical document types.
4
Implementation
The RSL model has been implemented based on the OMS Java object database management system [17]. The resulting iServer platform o ffers separate Java classes for all the RSL concepts introduced in the previous section as well as a main IServer API that provides a set of static methods to create, access, update and delete information stored in an iServer database. The most important methods of the IServer API are highlighted in Fig. 2 Of course, we cannot assume that any client application that wants to make use of document metadata managed by the iServer platform has to be implemented in Java. We therefore o ffer an XML representation of all the data managed by iServer as well as a Web Service interface that provides the same functionality as offered by the Java API. Note that we do not primarily represent our data in XML but only use XML at the interface level to provide a language neutral interface for accessing iServer information. In addition to the iServer core functionality, a variety of media-specific resource plug-ins have been implemented over the last few years by providing specific realisations of the corresponding resource and selector concepts. These plug-ins range from support for digital resources such a HTML pages, movie clips or Flash movies to physical resources including interactive paper or RFID tagged physical objects.
400
B. Signer
IServer +createLink(name: String) : Link +createLink(name: String, source: Entity, target: Entity, creator: Individual) : Link +deleteLink(link: Link) +createIndividual(name: String, desc: String, login: String, password: String) : Individual +deleteIndividual(individual: Individual) +createGroup(name String, desc: String) : Group +deleteGroup(group: Group) +createMedium(name: String, desc: String, medium: OMMime, creator: Individual) : Medium +deleteResource(resource: Resource) +deleteContainer(container: Resource) +createSelector(name: String) : Selector +createSelector(name: String, layer: Layer, resource: Resource, creator: Individual) : Selector +deleteSelector(selector: Selector) +createLayer(name: String) : Layer +createLayer(name: String, position: int) : Layer +deleteLayer(layer: Layer) +createPreference(key: String, value: String) : Parameter +deletePreference(preference: Parameter)
Fig. 2. Core iServer API
The implementation of a new resource plug-in includes not only the realisation of the corresponding resource and selector components to persistently store the appropriate metadata but also some visual plug-ins have to be provided. We are currently implementing a general visual cross-media authoring and browser tool that can be extended in a similar way as the data model via different visual plug-ins. The visual plug-in interface defines some common functionality to be o ffered by all the visual components. A media-specific visual plug-in for example has to be able to visualise a given resource or selector. Furthermore, it has to offer the functionality to create new selectors for a given resource. Based on these media-specific visualisation components, a general browser for the fluid navigation of structural cross-media documents can be developed. Note that such a general structural browser can be implemented independently of specific media types since it only has to visualise the compositional aspects between different entities whereas the visual resource plug-ins will provide the necessary rendering functionality at the resource level. Further details about our future development plans in terms of a general visual cross-media authoring and browser tool can be found in [18]. We are also investigating how the functionality of the presented RSL model can be offered at the level of a file system. There seem to be two ways how applications with their proprietary document formats can profit from such a general structural cross-media document model. They either have to be reimplemented to make use of the new API offered by an enhanced file system or plug-ins and add-ons have to be written for existing applications to enable the new structural metadata functionality. However, it could make sense to execute all applications on top of such an associative file system and at the same time o ffer a kind of a convenience layer that provides the old hierarchical file system view based on the new underlying structural functionality. This would enable a gradual migration
What Is Wrong with Digital Documents?
401
of old applications as soon as developers think that their applications could profit from the newly o ffered associative file system functionality.
5
Applications
In this section, we would like to evaluate the presented structural cross-media model by outlining different application scenarios and highlighting how one can profit from the existence of structural metadata. The presented structural RSL functionality is currently investigated by implementing a prototype of an RSLbased associative file system (RBAF). In addition to a file system API, the RBAF prototype offers a visual document explorer that provides access to the underlying RSL concepts to offer various metadata about a selected document. A screenshot of the initial RBAF Explorer implementation is shown in Fig. 3.
Fig. 3. RBAF Explorer
For the realisation of the RBAF file system, new file and folder resource subtypes have been introduced as shown in Fig. 4. Files can further be specialised into different types of file formats (e.g. pictures or PowerPoint presentations). Furthremore, structural links can be used to associate single files with one or more folders. A fist application is the structural grouping of multiple documents in a project. The different documents of a project do no longer have to be copied to specific folders to ensure that they form part of the corresponding hierarchical folder structure. Often people arrange their files by media types (e.g. a folder with topical picture subfolders). However, if a picture has to be used within a specific project, it is often copied from its original picture folder to a subfolder within the corresponding project hierarchy. In our RBAF prototype this is no longer necessary, since we can just add a structural link originating from the structure defining the project to the corresponding picture. By reducing the need to copy documents just for organisational purposes, we can lower the amount of redundant and potentially inconsistent
402
B. Signer resource
Resources partition folder
file
Folders
Files partition picture
Pictures
powerpoint
PowerPoint Files
Fig. 4. RBAF model extensions
data. Of course, the same picture can be used in multiple projects at the same time and we have still only a single copy in the original picture folder as recommended by Nelson when introducing the concept of transclusions. Even better, since all the associations in our model are bidirectional, for any picture or other document we can check whether the document is used in a project via a structural reference. Our picture can not only be grouped with other documents, but it can also be reused as part of a document structure. In our new document representation, a PowerPoint presentation has structural links to the different slides forming part of the presentation. Each slide can further have structural links to some of its content. Our picture might therefore be used within a specific PowerPoint slide. However, we can not only use the entire image but also define a structural link to a selector addressing parts of the picture. Of course, content can also be reused on the level of single PowerPoint slides. Instead of copying specific slides from one presentation to another presentation—something that happens quite often when people want to reuse a few slides—the same slide can be reused and embedded into multiple presentations via structural links. As we have explained when introducing the context resolver concept, the structural links can also slightly vary based on contextual information and we can have slightly different versions of the same presentation without having to replicate the commonly shared slides. The same concept of variation may exist based on the current user role. While it might not be evident why we need this for PowerPoint presentations, it becomes more obvious when we talk about teaching documents in educational settings. Different structural links and thereby document versions of a word document can be provided based on user roles. For example, a student might get a word document with some questions whereas the version of the teacher contains an additional structural link to the corresponding solutions. The adaptation could even be based on individual students and their learning progress which is similar to the provision of dynamic navigational link information in adaptive hypermedia. Note that not only applications that understand the structural within document information could profit from the additional semantics offered by the RSL
What Is Wrong with Digital Documents?
403
approach. Let us revisit the grouping of documents within a project. What happens if we would like to send the content of a project to a colleague? Since the corresponding project documents are no longer grouped in a single project folder, this might look like a challenge. Nevertheless, if our email client is RSL-aware, we can just select the main project file and the email client will recognise that there are structural project dependencies which means that also any structurally referenced files have to be sent to our colleague. The email client will call an RSL-aware compression tool that will automatically create a zip file containing all relevant project resources and send the resulting zip file to our colleague. Of course the structural linking can not only be used to group different documents but it can also be applied to keep track of different versions of the same document when a document is edited. Furthermore, any structural link can be annotated with additional metadata either in terms of another link that has the structural link as a source and some other resource as target or by adding arbitrary key/value properties to the structural link. The flexible composition of arbitrary digital or physical resources implies that some information might no be available under certain circumstances. Since we no longer follow the WYSIWYG principle, we can also no longer guarantee that all information forming part of a structural cross-media document can also be printed or visualised on another output channel. For example, if our cross-media document structure contains sound files or movie clips, then these resources cannot easily be printed on paper. On the other hand, visual components might be difficult to output via a voice output channel. Already these small examples show plenty of possibilities how our daily work and organisation of documents could be enhanced and simplified if applications would get access to some additional structural document metadata. Of course, there are still various open issues such as whether documents have to be globally identifiable and what would be the best way to o ffer a general entity versioning mechanism.
6
Conclusions
We have presented our conceptual model for structural cross-media content composition and reuse and highlighted the importance of extensible document models. After reviewing some early hypermedia models and introducing the concept of transclusions, we have highlighted how existing file systems could profit from more flexible ways of organising documents and reusing parts of documents via structural references. After describing our conceptual model for extensible structural cross-media documents, we have outlined some application scenarios that can profit from having access to detailed structural document metadata. We hope that our structural cross-media model might provide a foundation for further discussions about innovative forms of document interfaces and interactions that are no longer simply simulating paper. Acknowledgment. We would like to thank Gregory Cardone for the implementation of the RBAF file system.
404
B. Signer
References 1. Bush, V.: As We Think. Atlantic Monthly 176(1), 101–108 (1945) 2. Nelson, T.H.: Literary Machines. Mindful Press (1982) 3. Engelbart, D.C., English, W.K.: A Research Center for Augmenting Human Intellect. In: Proceedings of AFIPS Joint Computer Conferences, San Francisco, USA, pp. 395–410 (December 1968) 4. Nelson, T.: Geeks Bearing Gifts: How the Computer World Got This Way. Mindful Press (2009) 5. Phelps, T.A., Wilensky, R.: The Multivalent Browser: A Platform for New Ideas. In: Proceedings of DocEng 2001, ACM Symposium on Document Engineering, Atlanta, USA, pp. 58–67 (November 2001) 6. Collins, A., Apted, T., Kay, J.: Tabletop File System Access: Associative and Hierarchical Approaches. In: Proceedings of Tabletop 2007 Second Annual IEEE International Workshop on Horizontal Interactive Human-Computer Systems (2007) 7. Krottmaier, H., Maurer, H.: Transclusions in the 21st Century. Universal Computer Science 7(12), 1125–1136 (2001) 8. Christensen, B.G., Hansen, F.A., Bouvin, N.O.: Xspect: Bridging Open Hypermedia and XLink. In: Proceedings of WWW 2003, 12th International World Wide Web Conference, Budapest, Hungary, pp. 490–499 (May 2003) 9. Koivunen, M.R.: Semantic Authoring by Tagging with Annotea Social Bookmarks and Topics. In: Proceedings of SAAW 2006, 1st Semantic Authoring and Annotation Workshop, Athens, Greece (November 2006) 10. Miller, F.P., Vandome, A.F., McBrewster, J.: Offi ce Open XML. Alphascript Publishing (2009) 11. N¨ urnberg, P.J., Schraefel, m. c.: Relationships Among Structural Computing and Other Fields. Journal of Network and Computer Applications 26(1), 11–26 (2003) 12. McGuffin, M.J., Schraefel, m. c.: A Comparison of Hyperstructures: Zzstructures, mSpaces, and Polyarchies. In: Proceedings of Hypertext 2004, 15th ACM Conference on Hypertext and Hypermedia, Santa Cruz, USA, pp. 153–162 (2004) 13. Rivera, G., Norrie, M.C.: OMX-FS: An Extended File System Architecture Based on a Generic Object Model. In: Weck, W., Gutknecht, J. (eds.) JMLC 2000. LNCS, vol. 1897, pp. 161–174. Springer, Heidelberg (2000) 14. Signer, B.: Fundamental Concepts for Interactive Paper and Cross-Media Information Spaces. Books on Demand GmbH (May 2008) 15. Signer, B., Norrie, M.C.: As We May Link: A General Metamodel for Hypermedia Systems. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 359–374. Springer, Heidelberg (2007) 16. Norrie, M.C.: An Extended Entity-Relationship Approach to Data Management in Object-Oriented Systems. In: Elmasri, R.A., Kouramajian, V., Thalheim, B. (eds.) ER 1993. LNCS, vol. 823, pp. 390–401. Springer, Heidelberg (1994) 17. Kobler, A., Norrie, M.C.: OMS Java: A Persistent Object Management Framework. In: Java and Databases. Hermes Penton Science, pp. 46–62 (May 2002) 18. Signer, B., Norrie, M.C.: An Architecture for Open Cross-Media Annotation Services. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 387–400. Springer, Heidelberg (2009)
Classification of Index Partitions to Boost XML Query Performance⋆ Gerard Marks, Mark Roantree, and John Murphy Interoperable Systems Group, Dublin City University, Dublin 9, Ireland { gmarks,mark.roantree,jmurphy} @computing.dcu.ie
Abstract. XML query optimization continues to occupy considerable research effort due to the increasing usage of XML data. Despite many innovations over recent years, XML databases struggle to compete with more traditional database systems. Rather than using node indexes, some efforts have begun to focus on creating partitions of nodes within indexes. The motivation is to quickly eliminate large sections of the XML tree based on the partition they occupy. In this research, we present one such partition index that is unlike current approaches in how it determines size and number of these partitions. Furthermore, we provide a process for compacting the index and reducing the number of node access operations in order to optimize XML queries.
1
Introduction
Despite the continued growth of XML data and applications that rely on XML for the purpose of communication, there remains a problem in terms of query performance. XML databases cannot perform at the same level as their relational counterparts, and as a result, many of those who rely on XML for reasons of interoperability are choosing to store XML data in relational databases rather than its native format. For this reason, the advantages of semi-structured data (i.e. schema-less data storage) are lost in the structured world of relational databases, where schema design is required before data storage is permitted. The result of this is that many domains such as sensor networks are using rigid data models where more flexible and dynamic solutions are required. Over the last decade, many research groups have developed new levels of optimization. However, there remains significant scope and opportunity for further improvements. In this paper, we adopt some of the methodology that has been applied in the past but introduce a new approach where we dynamically partition the XML document, together with a metadata structure, to improve the performance of the index. In doing so, we can demonstrate new levels of optimization across XPath expressions. The paper is organized as follows: in the remainder of this section, we provide further background and motivation and state our contribution to this area; in ⋆
Funded by Enterprise Ireland Grant No. CFTD/07/201.
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 405–418, 2010. c Springer-Verlag Berlin Heidelberg 2010
406
G. Marks, M. Roantree, and J. Murphy
§ 2 , we examine similar research approaches in XML query optimization; in § 3, we provide a detailed description of our partitioned index; in § 4, we describe how query processing can take advantage of our approach; in § 5, we present our experiments and discuss the findings, before concluding in § 6. 1.1
Background and Motivation
Current XML query optimization solutions can be placed in two broad categories. On one hand, index based approaches build indexes on XML documents to provide efficient access to nodes, e.g. XRel [1], XPath Accelerator [5], Xeek [10]. On the other hand, algorithmic based approaches are focused on designing new join algorithms, e.g. TJFast [9], StaircaseJoin [7]. The former approach can use standard relational databases to deploy the index structure and thus, benefit from mature relational technology. The latter depends on a modification to the underlying RDBMS kernel [2], or a native XML database may be built from scratch. The XPath Accelerator [5] demonstrated that an optimized XPath index that lives entirely within a relational database can be used to evaluate all of the XPath axes. However, the XPath Accelerator, and similar approaches [6], suffer from scalability issues, as this type of node evaluation (even across relatively small XML documents) is inefficient [10]. A more recent solution is to partition nodes in an XML tree into disjoint subsets, which can be identified more efficiently as there will always be less partitions than there are nodes. After the relevant partitions are identified, only the nodes that comprise these partitions are evaluated using the inefficient node comparison step. Based on pre/post encoding, [11] is an index based approach that requires a user defined partitioning factor to divide the pre/post plane into disjoint sub-partitions. However, an optimal partitioning factor cannot be known in advance and as a result, rigorous experimentation is needed to identify this parameter (as is discussed in our related research). 1.2
Contribution
The main contributions in our work can be described as follows: – We provide a novel partitioning method for XML document indexes that offers new levels of optimization for XML queries. – We have developed efficient algorithms that automatically identify and resize these document partitions in a single pass of the XML dataset; user defined partitioning factors are not used. – Using structural information we can allow identical node partitions to be merged and thus reduce the size of the index and avoid processing large numbers of equivalent nodes. – Finally, for the purpose of comparing our approach to similar works we use pre/post encoding. However, to the best of our knowledge, in this paper, we present the first index based partitioning approach that is independent of the specific properties of the XML node labeling scheme used. Therefore,
Classification of Index Partitions to Boost XML Query Performance
407
our approach can be more easily integrated with other XML node labeling schemes such as ORDPATH [12]. We also provide a longer version of this paper, which includes the concepts and terminology that underpins our work, and extended related research, optimization and experiments sections [3].
2
Related Research
postorder
The X Path Accelerator [5] exemplifies an XML database built on top of a relational database. In this work, pre/post region information, i.e. region encoding, is used as their XML node encoding scheme. However, querying large XML datasets using pre/post labels is inefficient [10]. In [11], the pre/post plane is partitioned based on a user defined partitioning factor. Fig 1 illustrates the pre/post 16 plane partitioned parts using a parti14 12 tioning factor of 4. For each node, the 10 pre/post identifier of its part is the lower 8 x (6, 5) bound of its x and y values respectively. 6 4 For example, in Fig 1 the part P associ2 ated with node x(6, 5) is: P(4, 4). The 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ancestors of node x can only exist in the preorder parts that have a lower bound x value ≤ 4 and a lower bound y ≥ 4, i.e. the shaded parts (Fig 1). Similar is true for Fig. 1. Partitioning factor N = 4 the other major XPath axes, i.e. descendant, following and preceding. The problem with this approach is that an ideal partitioning factor is not known in advance and requires rigorous experimentation to identify. For example, in reported experiments each XML document was evaluated for the partitioning factors 1, 2, 4,.., 256 [11]. We believe this type of experimentation is infeasible even for relatively small XML documents. Additionally, as XML data is irregular by nature, a single partitioning factor per dataset is less than ideal. Finally, although it is suggested in [11] that the partitioning approach may be tailored to other encoding schemes such as order/size, it relies heavily on the lower bound of each x and y value in the partitioned pre/post (or order/size) plane. Therefore, this approach does not lend itself naturally to prefix based encoding schemes such as ORDPATH [12], which have become very popular in recent years for reasons of updatability. The work presented in this paper overcomes these issues by automatically partitioning nodes based on their individual layout and structural properties within each XML dataset. We do not rely on user defined partitioning factors. Also, our approach is not dependent on the specific properties of XML node labels, and thus can be used in conjunction with any XML encoding scheme.
408
3
G. Marks, M. Roantree, and J. Murphy
Optimization Constructs
In this section, we present the optimization process together with the various constructs and indexing layers it comprises. We start by introducing a small number of new constructs that form part of the optimization process. Following this, we provide a step-by-step description of how we create the dynamic partition index that is influenced by the structure and data content of each XML document. Definition 1. A branch is a set of connected node identifiers within an XML document. A branch (sometimes referred to as a sub-tree) is the abstract data type used to describe a partition of nodes. In our work, we will deal with the local-branch and path-branch sub-types of a branch. Definition 2. A local-branch is a branch, such that its members represent a single branching node and the nodes in its subtree. A local-branch cannot contain a member that represents a descendant of another branching node. In a tree data structure, a branching node will have a minimum of two child nodes, whereas a non-branching node has at most one child node [8]. The localbranch uses the branching node to form each partition. Our process uses the rule that each local-branch must not contain nodes that are descendants of another branching node, to create primary partitions. Definition 3. A path-branch is a branch with a single path. The path-branch is an abstract type with no branching node. Each member is a child member of the preceding node. Its three sub-types (orphan-path, branchlink-path and leaf-path) are used to partition the document. Definition 4. An orphan-path is a path-branch such that its members cannot belong to a local-branch. The orphan-path definition implies that members of the orphan-path cannot have an ancestor that is a branching node. The motivation is to ensure that each node in the XML document is now a member of some partition. Definition 5. A branchlink-path is a path-branch that contains a link to a single descendant partition of its local-branch. In any local-branch, there is always a single branching node and a set of nonbranching nodes. With the non-branching nodes, we must identify those that share descendant relationships with other partitions. These are referred to as branchlink-path partitions and each member occupies the path linking two branching nodes (i.e. two partitions). Definition 6. A leaf-path is a path-branch that contains a leaf node inside its local-branch. A leaf-path differs from a branchlink-path in that it does not contain a link to descendants partitions. In other words, it contains a single leaf node and its ancestors.
Classification of Index Partitions to Boost XML Query Performance
3.1
409
Creating the Primary Partitions
When creating the first set of partitions, the goal is to include all nodes in local-branch or path-branch partitions. As explained earlier, path-branches are abstract types and at this point, all path-branch instances will be orphan-paths. The algorithms for encoding an XML document using a pre/post encoding scheme were provided by Grust in [5]. In brief, each time a starting tag is encountered a new element object is created, which is assigned the attributes: name, type, level, and preorder. After which, the new element is pushed onto an element stack. Each time an end tag is encountered an element is popped from the element stack and is assigned a postorder identifier. Once an element has been popped from the stack, we call it the current node, and the waiting list is a set in which elements reside temporarily prior to being indexed. The first step in the process is to determine if the current node is a branching node by checking if it has more than one child node. The next steps are as follows: 1. If the current node is non-branching and does not reside at level 1, it is placed on the waiting list 1 . 2. If the current node is branching, it is assigned to the next local-branch in sequence. Also, the nodes on the waiting list will be its descendants. Therefore, the nodes on the waiting list are output to the same local-branch as the current node. 3. If the current node is non-branching, but a node at level 1 is encountered, the current node does not have a branching node ancestor. Therefore, the current node is assigned to an orphan-path (Definition 4). For the same reason, any node currently on the waiting list is assigned to the same orphan-path. 3.2
Partition Refinement
Although the local-branches are rooted subtrees they may contain nodes that do not have an ancestor/descendant relationship. As we will discuss in §4, the separation of nodes that do not have a hierarchical association leads to an optimized pruning effort. Each local-branch instance has a single branching node root which may have many (non-branching node) descendants. It is the non-branching descendants of the root that are checked to see if they share a hierarchical association. For this reason, we partition the non-branching nodes (in each local-branch) into disjoint path-branches (Definition 3). As orphan-paths and local-branches are disjoint, each of these path-branch instances will be a branchlink-path (Definition 5) or a leaf-path (Definition 6). The RefinePartitions (algorithm 1) replaces all steps outlined for creating the primary index (above). The new branch partitions are created by processing 1
Examples and illustrations of the primary partitions are provided in the long version of this paper [3].
410
G. Marks, M. Roantree, and J. Murphy
two local-branches simultaneously. All current nodes (see creating primary partitions above), up to and including the first branching node, are placed in the first waiting list (wList1 ) where they wait to be indexed. Subsequently, the next set of current nodes, up to and including the second branching node, are placed on the second waiting list (wList2 ). At this point, wList1 and wList2 contain the nodes that comprise the first and second local-branches respectively.
Algorithm 1. RefinePartitions 1: if node at level 1 encountered then 2: move nodes that comprise wList2 to orphan-path; 3: end if 4: move non-branching nodes from wList1 to leaf-path; 5: for each node n in wList2 do 6: if n = ancestor of wList1.ROOT ∧ n = branching node then 7: move n to branchlink-path; 8: else if n = ancestor wList1.ROOT then 9: move n to leaf-path; 10: end if 11: end for 12: move local-branch from wList1 to local-branch; 13: move local-branch from wList2 to wList1 ; If a node at level 1 is encountered, the nodes that comprise wList2 are an orphan-path (line 2 ). If a branchlink-path (Definition 5) exists, RefinePartitions identifies it as the non-branching nodes in wList2 that are ancestors of the root node in wList1 (lines 6-7 ). If one or more leaf-paths (Definition 6) exist, they will be the nodes in wList2 that are not ancestors of root node in wList1 (lines 8-9 ). The remaining nodes that comprise the first local-branch (wList1 ) are then moved to the index (line 12 ); this will be the single branching node root of the first localbranch only. At this point, the only node that remains in wList2 is the root node of the second local-branch. This local-branch is then moved to wList1 (line 13 ); wList2 will not contain any nodes at this point. The next local-branch is placed in wList2 and the process is repeated until no more branches exist. When this process has completed, the result will be a lot more partitions, with the benefit of increased pruning. This is illustrated in Fig 2. The process will also track the ancestor-descendant relationships between branch partitions. This is achieved by maintaining the parent-child mappings between branches. Given two branches: B1 and B2, B2 is a child of B1 if and only if the parent node of a node that comprises B2 belongs to B1. When the RefinePartitions process is complete, the ancestor-descendant relationships between branches are determined using a recursive function across these parentchild relationships, i.e. select the branch’s children, then its children’s children recursively.
Classification of Index Partitions to Boost XML Query Performance
411
Document Node 0
LB-3 44
4
1
OP-25 2
3
5
LP-1
LP-2
6
OP-26 45
LB-24 46
LB-23 7
28
31
LP-16
35
LB-21 BLP-14 8
29
32
36
41
LB-19 9
30
LB-13
11
14
40
42
LP-20
LP-15
10
37
33
34
21
17
38
LP-17
39
LP-22 43
LP-18
BLP-12 12
15
18
13
16
19
22
LB-11 23
LB-9 LP-4
LP-5
24
20
27
LP-10
LP-6 25
26
LP-8 LP-7
Fig. 2. After Partition Refinement
3.3
Branch Class Index
The indexing process results in the creation of a large number of branch partitions. This benefits the optimization process as it facilitates a highly aggressive pruning process and thus, reduces the inefficient stage of node comparisons. The downside of aggressive pruning is the large index size it requires. Our final step is to reduce the size of our index while maintaining the same degree of pruning. To achieve this, we use a classification process for all branches based on root to leaf structure of the partition. Definition 7. A branch class describes the structure of a branch, from the document node to its leaf node, and includes both elements and attributes. Every branch instance can belong to a single branch class. A process of classifying each branch will use the structure of the branch instance and its relationship to other branch instances as the matching criteria. Earlier work on DataGuides [4] adopted a similar approach, although here the branch class includes the DataGuide and set of attribute names associated with each element node on the path from the document node to the leaf node within each branch instance. Additionally, in order to belong to the same class, each branch instance must have an identical set of descendant branches. The latter is required to ensure that there is no overlap between branch classes, which we will discuss in § 4.
412
G. Marks, M. Roantree, and J. Murphy 0
4
1
@key=”123" 2
3
B1 5
8
6
18
11
20
12
C1
10
B2 13 21
14
15
7
9
16
C2
@key=”456" 19
B3 22 23 C1 24
17 25
Order Class C1 Class C2 1 a a 2 a/b{@key} a/j 3 a/b/c a/j/k a/j/k/m 4 a/b/c/e 5 a/b/c/e/f a/j/k/m/n
26
Fig. 3. Branch Classifications
Fig 3 depicts a sample XML document showing three branch instances, B1-B3 (left) and the extended DataGuides associated with two branch classes, C1 and C2 (right). Note that the order of the extended DataGuides associated with each branch class is important. After classification, if B1 and B3 have an identical set of descendant branch instances, they will be instances of the C1 class, while branch B2 is an instance of the C2 class. Finally, the process that maintains parent-child relationships between branch instances (discussed earlier), must be replaced with one that maintains parentchild relationships between branch classes. The ancestor-descendant relationships are then generated for branch classes in the same manner as they were for branch instances.
4
Index Deployment and Query Processing
In this section, we describe the indexing constructs resulting from the indexing process in §3. Following this, we give an overview of our query processing approach and continue with a worked example to illustrate how query optimization is achieved. Using the sample XML document in Fig. 4, Tables 4-4 illustrate the NODE, NCL (Name/Class/Level), and CLASS index respectively. The NODE index contains an entry for each node in the XML document. The NCL is generated by selecting each distinct name, class, level and type from the NODE index. The CLASS index contains ancestor-descendant mappings between branch classes, where the attributes ac and dc are the ancestor-or-self classes and descendant-or-self classes respectively. The NCL index allows us to bypass, i.e. avoid processing, large numbers of nodes (discussed shortly). In the traditional approach to XPath query processing, there is a two step process: (1) retrieve nodes (based on the XPath axis and NodeTest), (2) input these nodes to the subsequent step (i.e. context nodes), or return them as the result set (if the current step is the rightmost step in the path expression). In partitioning approaches, a third step is added. Thus, the query process is performed in the following steps:
Classification of Index Partitions to Boost XML Query Performance
413
1. Identify the relevant partitions, i.e. prune the search space. 2. Retrieve the target nodes from these partitions, i.e. by checking their labels (e.g. pre/post, dewey). 3. Input these nodes to the subsequent step, or return them as the result set. The NODE and CLASS indexes are sufficient to satisfy all three steps, where the CLASS index prunes the search space (step 1 ), thus optimizing step 2. However, ultimately we are only concerned with the nodes that are output from the rightmost step in an XPath expression, as these will form the result set for the query. Nodes that are processed as part of the preceding steps are only used to navigate to these result nodes. Using the NCL index instead of the NODE index (where possible), enables us to bypass (or avoid processing) many of these nodes that are only used to navigate to the result set, thus step 2 is optimized further.
1
2
< article>
3
< author>
4
7
< title>
< sub>
2
5
< i>
0
< dblp>
6
< article>
11
8
< author>
9
12
< title>
< sub>
10
< article>
13
< author>
< title>
< i>
4
Fig. 4. XML Snippet taken from the DBLP Dataset
Table 1. Node Index
Table 2. NCL Index
label name type level class value (0,13)| 0 dblp 1 0 n/a (1,4)| 0.0 article 1 1 5 (2,0)| 0.0.0 author 1 2 1 (3,3)| 0.0.1 title 1 2 4 (4,1)| 0.0.1.0 sub 1 3 2 2 (5,2)| 0.0.1.1 i 1 3 3 (6,9)| 0.1 article 1 1 5 (7,5)| 0.1.1 author 1 2 1 (8,8)| 0.1.2 title 1 2 4 (9,6)| 0.1.2.1 sub 1 3 2 4 (10,7)| 0.1.2.2 i 1 3 3 (11,12)| 0.2 article 1 1 7 (12,10)| 0.2.1 author 1 2 8 (13,11)| 0.2.2 title 1 2 6 -
NAME CLASS LEVEL TYPE author 1 2 1 sub 2 3 1 i 3 3 1 title 4 2 1 article 5 1 1 title 6 2 1 article 7 1 1 author 8 2 1
Table 3. CLASS Index ac dc 1 1 2 2 3 3 4 2 4 3 4 4 5 1 5 2 5 3 5 4 5 5 6 6 7 6 7 7 7 8 8 8
Bypassing is not possible across all steps in an XPath expression. Therefore, a selection process is required to choose which steps must access the NODE index, and which steps can access the (much smaller) NCL index instead. We are currently in the process of formally defining this process across all steps. Thus, in this paper we present the rules for the selection process that we have currently defined: 1. The NODE index must be used at the rightmost step in the path expression, i.e. to retrieve the actual result nodes. For example, see the rightmost step (/education) in Q1 (Fig 5).
414
G. Marks, M. Roantree, and J. Murphy NCL
NCL
NCL
NCL
NCL
NODE
Q1: //people//person[.//address//zipcode]//profile/education Filter NCL
NODE
Filter NCL
NODE
Q2: //people//person[.//address//zipcode = ‘
NCL
NODE
’]//profile/education
Fig. 5. Index Selection
2. If the query does not evaluate a text node, the NCL index can be used in all but the rightmost step. For example, Q1 does not evaluate a text node, thus only the rightmost step accesses the node index as required by rule 1. 3. All steps that evaluate a text node must use the NODE index, e.g. //zipcode = ‘17’ (Q2). 4. A step that contains a predicate filter that subsequently accesses a text node must use the NODE index, e.g. step two in Q2. NODE index accesses are required to filter nodes based on the character content of text nodes, i.e. the VALUE attribute (Table 4), or to retrieve the result set for the rightmost step. The character content of text nodes was not considered during the branch classification process (§3) in order to keep the number of branch classes, and therefore, the size of the CLASS index small. However, NODE index accesses (based on the character content of text nodes) are efficient as they usually have high selectivity. In fact, where the character content of text nodes that do not have high selectivity can be identified, e.g. gender has only 2 possible values, they can be included as part of the classification process ensuring high selectivity for all remaining NODE accesses. However, we are currently examining the cost/benefit aspects of including text nodes in our classification process. Example 1. //people//person 1. SELECT * FROM NODE SRM WHERE SRM .TYPE = 1 AND SRM .NAME = ‘person’ 2. AND SRM .BRANCH IN ( 3. SELECT C1 .DC FROM NCL N1 , CLASS C1 4. WHERE N1 .NAME = ‘people’ 5. AND N1 .CLASS = C1 .AC 6. AND SRM .LEVEL > N1 .LEVEL 7. ) 8. ORDER BY SRM .PRE
In Example 1, notice that the NODE index is only accessed in the rightmost step (line 1 ). The layout of the final branch partitions (see Fig 2) enables us to evaluate the ancestor (or self), descendant (or self), parent or child axis by checking the LEVEL attribute (line 6 ). Note, this would not be possible if we allowed overlap between branches (discussed in §3). Similar approaches must evaluate unique node labels, e.g. pre/post or dewey. An additional benefit of the fact that we do not allow overlap between branch classes is that the inefficient DISTINCT clause that is required by related approaches [11, 5] to remove duplicates from the result set can be omitted. Also, as large numbers of nodes are bypassed, the IN clause is efficient as the sub-query usually returns a small number of branch classes.
Classification of Index Partitions to Boost XML Query Performance
5
415
Experiments
In this section, we compare our branch based approach to similar (lab-based) approaches. Following this, we evaluate how our approach performs against vendor systems. Experiments were run on identical servers with a 2.66GHz Intel(R) Core(TM)2 Duo CPU and 4GB of RAM. For each query, the time shown includes the time taken for: (1) the XPath-to-SQL transformation, (2) the SQL query execution, and (3) the execution of the SQL count() function on the PRE column of the result set. The latter is necessary as some SQL queries took longer to return all rows than others. Each query was executed 11 times ignoring the first execution to ensure hot cache result times across all queries. The 10 remaining response times were then averaged to produce the final time in milliseconds. Finally, we placed a 10 minute timeout on querys. Table 4. XPath Queries X M ark Q01 /site/regions/africa Q02 /site/people/person[@id = ’person0’] Q03 //regions/africa//item/name Q04 //person[profile/@income]/name Q05 //people/person[profile/gender][profile/age]/name Q06 /site/keyword/ancestor::listitem/text/keyword Q07 /site/closed auctions/closed auction//keyword Q08 /site/closed auctions/closed auction[./descendant::keyword]/date Q09 /site/closed auctions/closed auction/annotation/description/text/keyword Q10 /site/closed auctions/closed auction[annotation/description/text/keyword]/date
5.1
Comparison Tests with Lab-Based Systems
In this section, we will evaluate the performance of a traditional node based approach to XPath: Grust07 [6], and the partitioning approach most similar to ours: Luoma07 [11]. For Grust07, we built the suggested partitioned B-trees: Node(level,pre), Node(type,name,pre) and Node(type,name,level,pre). Additionally we built indexes on size, name, level, value and type. For Luoma07, we used partitioning factors 20, 40, 60, and 100. As suggested in this work, Node(pre) is a primary key. Node(part) is a foreign key reference to the primary key Part(part) and indexes were built on Node(post), Node(name), Node(part), Part(pre), and Part(post). Our overall findings for both approaches are that they do not scale well even for relatively small XML documents. As such, we had to evaluate these approaches using a relatively small dataset. Later in this section, we evaluate large XML datasets across vendor systems. For the following experiments, we generated an XMark dataset of just 115 MB in size and tested both approaches against queries from the XPathMark [13] benchmark and Grust07 (Table 4). In Fig 6, the query response time for each of these queries is shown. These results show the following:
416
G. Marks, M. Roantree, and J. Murphy 1,000,000 ExecutionTime(ms)
100,000 10,000 1,000 100 10 1 Q1
Q2
Q9
Q10
20
211
223
53 481 53,481
5 198 5,198
126 190 126,190
151 728 151,728
40
263
307
Q3
Q4
61,458
9,168
197,386
140,133
60
260
1,452
52,423
10,492
124,019
132,178
80
262
1,200
78,719
10,215
21,281
114,020
161,539
100
267
1,134
53,596
290,967
18,289
112,605
166,413
Grust07
136
259
BranchIndex
16
92
1,371
192
996
292,528
Q5
Q6
Q7
Q8
23,842 63
81
3,274
896
229
Fig. 6. Query Response Times for Luoma07, Grust07 and BranchIndex
– Grust07timed out on all but Q01, Q02, Q07. – In Luoma07, a partitioning factor of 100 returned results for the greatest number of queries: Q01, Q02, Q04, Q05, Q07, Q08, Q09, Q10. Q07 shows an increase in processing times as the partitioning factor increased, whereas Q09 showed a decrease. The remaining queries do not provide such a pattern. – Luoma07 returned results for a greater number of queries then Grust07 across all partitioning factors. – The BranchIndex is orders of magnitude faster across all queries. Queries Q01 and Q02 have high selectivity as they return a single result node. Also, the first two steps in Q7, i.e. /site and /closed auctions, both access a single node. We attribute the fact that Grust07 returned results for queries Q01, Q02 and Q03 to the high selectivity of these queries. As the second bullet point indicates that there is no consistent pattern between the incrementing partitioning factors, we suggest that a single partitioning factor per dataset is not ideal. Luoma07 provides superior results than Grust07, both in terms of the query response times, and number of queries that returned a result within 10 minutes. However, the exhaustive experimentation required to identify suitable partition factors is infeasible. Both approaches do not scale well for queries that have low selectivity, even for relatively small XML datasets, e.g. 115 MB, the query response times are relatively large. 5.2
Comparison Tests with Vendor Systems
In this section, we will evaluate the branch index against a leading commercial XML database solution (Microsoft SQL Server 2008 ) and a leading open source XML database (MonetDB/XQuery) [2] using the XPathMark [13] benchmark. The X P athMark Benchmark. The standard XPath benchmark (XPathMark [13]) consists of a number of categories of queries across the synthetic
Classification of Index Partitions to Boost XML Query Performance
417
ExecutionTime(ms)
1,000,000 100,000 10,000 1,000 100 10 1 Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
BranchIndex
, 1,515
1,625 ,
581
2,347 ,
5,893 ,
11226
3,297 ,
5273
1,033 ,
3277
MonetDB/XQuery
2025
114
49,905
11350
11084
10,788
10774
284436
5443
5,711
5699
SQLServer
1046
8,954
9036
317363
20095
14719
Fig. 7. XMark Query Response Times
XMark dataset. In this paper, we are examining the performance of the ancestor, ancestor-or-self, descendant, descendant-or-self, parent and child axes. The queries in Table 4 where chosen for this purpose. Fig 7 shows the following: – SQL Server threw an exception on Q6 as it contains the ancestor axes. – Q1 and Q2 have high selectivity (discussed earlier), thus all three systems took a small amount of time to return the result. – In queries Q3, Q4, Q6, Q7, and Q9 the BranchIndex shows orders of magnitude improvements over the times returned by SQL Server and MonetDB/XQuery. – In queries Q5, the branch index is almost twice as efficient as MonetDB/ XQuery and three times as efficient as SQL Server. – In Q8 and Q10, the BranchIndex and SQL Server returned similar times, and MonetDB/XQuery took twice as long. The branch index is the preferred option across all queries except Q2, in which case the time difference is negligible. SQL Server performs well across queries that have multiple parent-child edges, e.g. Q8 Q9 and Q10, which we attribute to the secondary PATH index we built. For instance, SQL Server performs very poorly in Q3, which has an ancestor-descendant join on the third step. MonetDB/XQuery is quite consistent across all queries, i.e. taking around 10/11 seconds across all low selectivity queries. However, it performs particularly poorly in Q6, which could indicate that it does not evaluate the ancestor axis efficiently.
6
Conclusions
In this paper, we presented a partitioning approach for XML documents. These partitions are used to create an index that optimizes XPath’s hierarchical axes. Our approach differs from the only major effort in this area in that we do not need to analyze the document in advance to determine efficient partition sizes. Instead, our algorithms are dynamic, thus they create partitions based on document characteristics, e.g. structure and node layout. This provides for a fully automated process for creating the partition index. We obtain further optimization by compacting the partition index using a classification process. As each
418
G. Marks, M. Roantree, and J. Murphy
identical partition will generate identical results in query processing, we need only a representative partition (a branch class) for all partitions of equivalent structure. We then demonstrated the overall optimization gains through experimentation. Our current work focuses on evaluating non-hierarchical XPath axes, e.g. following, preceding, and on using real-world datasets (sensor-based XML output) to test different XML document formats and to utilize real world queries to understand the broader impact of our work.
References 1. Xrel: a path-based approach to storage and retrieval of xml documents using relational databases. ACM Trans. Internet Technol. 1(1), 110–141 (2001) 2. Boncz, P., Grust, T., van Keulen, M., Manegold, S., Rittinger, J., Teubner, J.: MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In: SIGMOD 2006: Proceedings of the, ACM SIGMOD international conference on Management of data, pp. 479–490. ACM Press, New York (2006) 3. Marks, G., Roantree, M., Murphy, J.: Classification of Index Partitions. Technical report, Dublin City University (2010), http://www.computing.dcu.ie/~isg/ publications/ISG-10-03.pdf 4. Goldman, R., Widom, J.: DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) Proceedings of 23rd International Conference on Very Large Data Bases, VLDB 1997, Athens, Greece, August 25-29, pp. 436–445. Morgan Kaufmann, San Francisco (1997) 5. Grust, T.: Accelerating XPath Location Steps. In: SIGMOD 2002: Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pp. 109–120. ACM, New York (2002) 6. Grust, T., Rittinger, J., Teubner, J.: Why off-the-shelf RDBMSs are better at XPath than you might expect. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 949–958. ACM Press, New York (2007) 7. Grust, T., van Keulen, M., Teubner, J.: Staircase Join: Teach a Relational DBMS to Watch Its (axis) Steps. In: VLDB 2003: Proceedings of the 29th international conference on Very large data bases, pp. 524–535. VLDB Endowment (2003) 8. Lu, J., Chen, T., Ling, T.W.: Efficient Processing of XML Twig Patterns with Parent Child Edges: A Look-ahead Approach. In: CIKM 2004: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 533–542. ACM, New York (2004) 9. Lu, J., Chen, T., Ling, T.W.: TJFast: Effective Processing of XML Twig Pattern Matching. In: WWW 2005: Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 1118–1119. ACM, New York (2005) 10. Luoma, O.: Xeek: An efficient method for supporting xpath evaluation with relational databases. In: ADBIS Research Communications (2006) 11. Luoma, O.: Efficient Queries on XML Data through Partitioning. In: WEBIST (Selected Papers), pp. 98–108 (2007) 12. O’Neil, P., O’Neil, E., Pal, S., Cseri, I., Schaller, G., Westbury, N.: ORDPATHs: Insert-Friendly XML Node Labels. In: SIGMOD 2004: Proceedings of the, ACM SIGMOD international conference on Management of data, pp. 903–908. ACM, New York (2004) 13. XPathMark Benchmark. Online Resource, http://sole.dimi.uniud.it/~massimo.franceschet/xpathmark/
Specifying Aggregation Functions in Multidimensional Models with OCL Jordi Cabot1 , Jose-Norberto Maz´on2 , Jes´ us Pardillo2, and Juan Trujillo2 1
´ INRIA - Ecole des Mines de Nantes (France) [email protected] 2 Universidad de Alicante (Spain) {jnmazon,jesuspv,jtrujillo}@dlsi.ua.es
Abstract. Multidimensional models are at the core of data warehouse systems, since they allow decision makers to early define the relevant information and queries that are required to satisfy their information needs. The use of aggregation functions is a cornerstone in the definition of these multidimensional queries. However, current proposals for multidimensional modeling lack the mechanisms to define aggregation functions at the conceptual level: multidimensional queries can only be defined once the rest of the system has already been implemented, which requires much effort and expertise. In this sense, the goal of this paper is to extend the Object Constraint Language (OCL) with a predefined set of aggregation functions. Our extension facilitates the definition of platform-independent queries as part of the specification of the conceptual multidimensional model of the data warehouse. These queries are automatically implemented with the rest of the data warehouse during the code-generation phase. The OCL extensions proposed in this paper have been validated by using the USE tool.
1
Introduction
Data warehouse systems support decision makers in analyzing large amounts of data integrated from heterogeneous sources into a multidimensional model. Several authors [1,2,3,4] and benchmarks for decision support systems (e.g., TPC-H or TPC-DS [5]) have highlighted the great importance of aggregation functions during this analysis to compute and return a unique summarized value that represents all the set, such as sum, average or variance. Although it is widely accepted that multidimensional structures should be represented in an implementation-independent conceptual model in order to reflect real-world situations as accurately as possible [6], multidimensional queries that satisfy information needs of decision makers are not currently expressed at the conceptual level but only after the rest of the data warehouse system has been developed. Therefore, the definition of these queries is implementation-dependent which requires a lot of effo rt and expertise in the target implementation platform. The main drawback of this traditional way of proceeding is that it avoids designers to properly validate if the conceptual schema meets the requirements of decision makers before the final implementation. Therefore, if any change is J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 419–432, 2010. c Springer-Verlag Berlin Heidelberg 2010
420
J. Cabot et al.
found out after the implementation, designers must start the whole process from the early stages, thereby dramatically increasing the overall cost of data warehouse projects. As stated by Oliv´e [7], this main drawback comes from the little importance given to the informative function of the information system, that is, to the definition of queries at the conceptual level that must be provided to the users in order to satisfy their information needs. To overcome this drawback in the data warehouse scenario, multidimensional queries must be defined at the conceptual level. The main restriction for defining multidimensional queries at the conceptual level is the rather limited support offered by current conceptual modeling languages [8,9,10,11], that exhibit a lack of rich constructs for the specification of aggregation functions. So far, researchers have focused on using a small subset of them, namely sum, max, min, avg and count [12] (and most modeling languages do not even cover all these basic ones). However, data warehouse systems require aggregation functions for a richer data analysis [6,4]. Therefore, we believe that it is highly important to be able to provide a wide set of aggregation functions as predefined constructs offered by the modeling language used in the specification of the data warehouse so that the definition of multidimensional queries can be carried out at the conceptual level. This way, designers can define and validate them regardless the final technology platform chosen to implement the data warehouse. To this aim, in this paper, the standard Object Constraint Language (OCL [13]) library is extended with a new set of aggregation functions in order to facilitate the specification of multidimensional queries as part of the definition of UML conceptual schemas. In our work, we will use the operations in combination with our UML profile for multidimensional modeling [14]. Nevertheless, our OCL extension is independent of the UML profile and could be used in the definition of any standard UML model. Our new OCL operations have been tested and implemented in the USE tool [15] in order to ensure their well-formedness and to validate them on sample data from our running example (see Sect. 2). Our work is aligned with current Model-Driven Development (MDD) approaches, such those of [16,17], where the implementation of the system is supposed to be (semi)automatically generated from its high-level models. The definition of all multidimensional queries at the conceptual level permits a more complete codegeneration phase, including the automatic translation of these queries from their initial platform-independent definition to the final (platform-dependent) implementation, as we describe later in the paper. Therefore, code can be easily generated for implementing multidimensional queries in several languages, such as MDX or SQL. The remainder of this paper is structured as follows: a motivating example is presented in the next section to illustrate the benefits of our proposal throughout the paper. Our OCL extension to model this kind of queries at the conceptual level is presented in Sect. 3, while its validation is carried out in Sect. 4. Sect. 5 defines how to automatically implement it. Finally, Sect. 6 comments the related work and Sect. 7 presents the main conclusions and sketches future work.
Specifying Aggregation Functions in Multidimensional Models with OCL
2
421
Motivating Example
To motivate the importance of our approach and illustrate its benefits, consider the following example, which is inspired in one of the scenarios described in [18]: an airline’s marketing department wants to analyze the flight activity of each member of its frequent flyer program. The department is interested in seeing what flights the company’s frequent flyers take, which planes they travel with, what fare basis they pay, how often they upgrade, and how they earn their frequent flyer miles1 . A possible conceptual model for this example is shown in Fig. 1 as a class diagram annotated and displayed using the multidimensional UML profile presented in [14]. The figure represents a multidimensional model of the flight legs taken by frequent flyers in the FrequentFlyerLegs Fact class. This class contains several FactAttribute properties: Fare, Miles and MinutesLate. These properties are measures that can be analyzed according to several aspects as the origin and destination airport (Dimension class Airport ), the Customer, FareClass, Flight and Date (these two last Dimension classes are not detailed in the diagram).
Fig. 1. Conceptual multidimensional model for our frequent flyer scenario
Given this conceptual multidimensional model, decision makers can request a set of queries to retrieve useful information from the system. For instance, they are probably interested in knowing the miles earned by a frequent flyer in his/her trips from a given airport ( e.g., airports located in Denver) in a given fare class. Many other multidimensional queries can be similarly defined. These kind of queries are usually of particular interest for the decision makers because they (i) aggregate the data (e.g., the earned miles in the previous example) and (ii) summarize values by means of different aggregation functions. For example, it is likely that decision makers will be interested in knowing the total number of miles earned by the frequent flyer, a ranking of frequent flyers per number 1
Note that, in this case study, the interest is in actual flight activity, but not in reservation or ticketing activity.
422
J. Cabot et al.
of miles earned, the average number of earned miles, several percentiles on the number of miles and so forth. Interestingly, these multidimensional queries are related to several concepts [19]: – Phenomenon of interest, which is the measure or set of measures to analyze (FactAttribute properties in Fig. 1). Miles are the phenomenon of interest in the previous defined query. – Category attributes, which are the context for analyzing the phenomenon of interest (Dimension and Base classes in Fig. 1). E.g., FareClass and Airport are category attributes. – Aggregation sets, which are subsets of the phenomenon of interest according to several category attributes. In our sample query, the aggregation set only contains miles obtained by frequent flyers that depart from Denver. – Aggregation functions, which are predefined operators that can be applied on the aggregation sets to summarize or analyze their factual data. E.g., the sum, avg or percentile operators above-mentioned. The first two aspects (i.e., the definition of the category attributes and the phenomenon of interest) can be easily modeled in UML (as we have already accomplished in Fig. 1). Furthermore, a method for defining aggregation sets in OCL has been proposed in [16]. With regard to aggregation functions, so far, researchers and practitioners have focused on using a small subset of them, namely sum, max, min, avg and count [12]. Moreover, query-intensive applications, such as data warehouses or OLAP systems, require other kind of statistical functions for a richer data analysis (e.g., see [4]). However, support for statistical functions is very limited (e.g., OCL does not even support all of the basic aggregation functions) which hinders designers wanting to directly implement the kind of queries presented above and preventing them from easily satisfying the user requirements. Therefore, we believe that it is highly important to be able to provide all kinds of aggregation functions as predefined constructs offered by the modeling language (UML and OCL in our case) so that the definition of multidimensional queries can be carried out at the conceptual level in order to define and validate them regardless the final technology platform chosen to implement the data warehouse. In the rest of the paper, we propose an extension for the OCL language to solve this issue.
3
Extending OCL with Aggregation Functions
Conceptual modeling languages based on visual formalisms are commonly managed together with textual formalisms, since some model elements are not easily or properly mapped into the graphical constructs provided by the modeling language [20]. For UML schemas, OCL [13] is typically used for this purpose. The goal of this section is to extend the OCL with a new set of predefined aggregation functions to facilitate the definition of multidimensional queries on UML schemas.
Specifying Aggregation Functions in Multidimensional Models with OCL
423
The set of core aggregation functions included in our study are those among the most used in data analysis [4]. To simplify their presentation, we classify these functions in three different groups, following [21,3]: – Distributive functions, which can be defined by structural recursion, i.e., the input collection can be partitioned into subcollections that can be individually aggregated and combined. – Algebraic functions, which are expressed as finite algebraic expressions over distributive functions, e.g., average is computed using count and sum. – Holistic functions, which are all other functions that are not distributive nor algebraic. These functions can be combined to provide many other advanced operators. An example of such an operator is top(x) which uses the rank operation to return a subset of the x highest values within a collection. 3.1
Preliminary OCL Concepts
OCL is a rich language that offers predefined mechanisms for retrieving the values of the attributes of an object, for navigating through a set of related objects, for iterating through collection of objects (e.g., by means of the forAll, exist and select iterators) and so forth. As part of the language, a standard library including a predefined set of types and a list of predefined operations that can be applied on those types is also provided. The types can be primitive (Integer, Real, Boolean and String) or collection types (Set, Bag, OrderedSet and Sequence). Some examples of operations provided for those types are: and, or, not (Boolean), +, − , ∗, >, < (Real and Integer), union, size, includes, count and sum (Set). All these constructs can be used in the definition of OCL constraints, derivation rules, queries and pre/post-conditions. In particular, definition of queries follows the template: context Class::Q(p1:T1, . . . , pn:Tn): Tresult body: Query-ocl-expression
where the query Q returns the result of evaluating the Q u e ry −o c l−e x p re s s io n by using the arguments passed as parameters in its invocation on an object of the context type Cla s s . Apart from the parameters p 1 ...p n , in query-ocl-expression designers may use the implicit parameter s e lf(of type Cla s s ) representing the object on which the operation has been invoked. As an example, the previous query total miles earned by a frequent flyer in his/her trips from Denver in a given fare can be defined as follows: context Customer::sumMiles(FareClass fc) body: self.frequentFlyerLegs−> select(f | f.fareClass=fc and f.origin.city.name=’Denver’)−> sum()
424
J. Cabot et al.
Unfortunately, many other interesting queries cannot be similarly defined since the operators required to define such queries are not part of the standard library (e.g. the average number of miles earned by a customer in each flight leg, since the average operation is not defined in OCL). In the next section, we present our extension to the OCL standard library to include them as predefined operators available to all users of this language. 3.2
Extending the OCL Standard Library
Multidimensional queries cannot be easily defined in OCL since the aggregation functions required to specify them are not part of the standard library and thus, they must be manually defined by the designer every time they are needed which is an error-prone and time-consuming activity (due to the complexity of some aggregation functions). To solve this problem, we propose in this section an extension to the OCL Standard Library by predefining a list of new aggregation functions that can be reused by designers in the definition of their OCL expressions. The new operations are formally defined in OCL by specifying their operation contract, exactly in the same style that existing operations in the library are defined in the OCL officia l specifica tion document. Our extension does not change the OCL metamodel and thus, it does not risk the standard level of UML/OCL models using it. In fact, our operations could be regarded as new user-defined operations, a possibility which is supported by most current OCL tools. Therefore, our extension could be easily integrated in those tools. Each operation is attached to the most appropriate (primitive or collection) type. As usual, functions defined on a supertype can be applied on instances of the subtypes. For each operation we indicate the context type, the signature and the postcondition that defines the result computed by it. When required, preconditions restricting the operation application are also provided. Note that some aggregation functions may have several slightly different alternative definitions in the literature. Due to space limitations we stick to just one of them. These functions can be called within OCL expressions in the same way as any other standard OCL operation. See an example in Sect. 3.3. Distributive Functions – MAX: Returns the element in a non-empty collection of objects of type T with the highest value. Tmust support the >= operation. If several elements share the highest value, one of them is randomly selected. context Collection::max():T pre: self−> notEmpty() post: result = self−> any(e | self−> forAll(e2 | e > = e2))
– MIN: Returns the element with the lowest value in the collection of objects of type T . Tmust support the notEmpty() post: result = self−>any(e | self−>forAll(e2 | e asSet()−>size()
Algebraic Functions – AVG: Returns the arithmetic average value of the elements in the non-empty collection. The type of the elements in the collection must support the + and /operations. context Collection::avg():Real pre: self−>notEmpty() post: result = self−>sum() / self−>size()
– VARIANCE: Returns the variance of the elements in the collection. The type of the elements in the collection must support the +, −, ∗ and /operations. The function accumulates the deviation of each element regarding the average collection value (this is computed by using the iterate operator: for each element e in the collection, the acc variable is incremented with the square result of substracting the average value from e). Note that this function uses the previously defined avg function. context Collection::variance():Real pre: self−>notEmpty() post: result = (1/(self−>size()-1)) * self−>iterate(e; acc:Real =0 | acc + (e - self−>avg()) * (e - self−>avg()))
– STDDEV: Returns the standard deviation of the elements in the collection. context Collection::stddev():Real pre: self−>notEmpty() post: result = self−>variance().sqrt()
– COVARIANCE: Returns the covariance value between two ordered sets (or sequences). We present the version for OrderedSets. The version for the Sequence type is exactly the same, only the context type changes. The standard at operation returns the position of an element in the ordered set. As guaranteed by the operation precondition, both input collections must have the same number of elements.
426
J. Cabot et al. context OrderedSet::covariance(Y: OrderedSet):Real pre: self−>size() = Y−>size() and self−>notEmpty() post: let avgY:Real = Y−>avg() in let avgSelf:Real = self−>avg() in result = (1/self−>size()) * self−>iterate(e; acc:Real=0 | acc + ((e - avgSelf) * (Y−>at(self−>indexOf(e)) - avgY))
Holistic Functions – MODE: Returns the most frequent value in a collection. context Collection::mode(): T pre: self−>notEmpty() post: result = self−>any(e | self−>forAll(e2 | self−>count(e) >= self−>count(e2))
– DESCENDING RANK: Returns the position (i.e., ranking) of an element within a Collection. We assume that the order is given by the >= relation among the elements (the type T of the elements in the collection must support this operator). The input element must be part of the collection. Repeated values are assigned the same rank value. Subsequent elements have a rank increased by the number of elements in the upper level. As mentioned above, this is just one of the possible existing interpretations for the rank function. Others would be similarly defined. context Collection::rankDescending(e: T): Integer pre: self−>includes(e) post: result = self−>size() - self−>select(e2 | e >= e2)−>size() + 1
– ASCENDING RANK: Inverse of the previous one. The order is now given by the includes(e) post: result = self−>size() - self−>select(e2 | e size() + 1
– PERCENTILE: Returns the value of the percentile p, i.e., the value below which a certain percent p of elements fall. context Collection::percentile(p: Integer): T pre: p >= 0 and p notEmpty() post: let n: Real = (self−>size()-1) * 25 / 100 + 1 in let k : Integer = n.floor() in let d : Real = n - k in let s: Sequence(Integer) = self−>sortedBy(e | e) in if k = 0 then s−>first() * 1.0 else if k = s−>size() then s−>last() * 1.0 else s−>at(k) + d * (s−>at(k+1) - s−>at(k) ) endif endif
Specifying Aggregation Functions in Multidimensional Models with OCL
427
– MEDIAN: Returns the value separating the higher half of a collection from the lower half, i.e., the value of the percentile 50. context Collection::median(): T pre: self−>notEmpty() post: result = self−>percentile(50)
3.3
Applying the Operations
As we above-commented, these operations can be used exactly in the same way as any other standard OCL function. As an example, we show the use of our avg function to compute the average number of miles earned by a customer in each flight leg. context Customer::avgMilesPerFlightLeg():Real body: self−>frequentFlyerLegs.Miles−>avg()
4
Validation
Our OCL extension has been validated by using the UML Specification Environment (USE) tool [15]. As a first step, we have implemented our aggregation operations as new user-defined functions in USE. Thanks to the syntactic analysis performed by USE, the syntactic correctness of our functions has been proved in this step. Additionally, in order to also prove that our functions behave as expected (i.e. to check that they are also semantically correct), we have evaluated them over sample scenarios and evaluated the correctness of the results (i.e., we have compared the result returned by USE when executing queries including our operations with the expected result as computed by ourselves). Fig. 2 shows more details of the process. In the background of the USE environment we can see the implementation of the multidimensional conceptual schema of Fig. 1 in USE (left-hand side) and the script that loads the data provided in [18] (objects and links, which have been obtained by using the operations described in [16]) into the corresponding classes and associations (righthand side). In the foreground we show one of the queries we have used to test our functions (in this case the query is used to check our avg function) together with the resulting collection of data returned by the query. Interested readers can download2 the scripts and data of our running example together with the definition of our library of aggregation functions. It is worth noting that during the validation process we have overcome some limitations of the USE tool, since it neither provides the indexOf nor Cartesian product functions. Therefore, functions that make use of these OCL operators needed to be slightly redefined for their implementation in USE, e.g., the covariance function. To create the queries to test our operations we have used as a base query the query defined in Sect. 1 (miles earned by a frequent flyer in his/her trips from Denver according to their fare). Test queries have been created by applying on 2
http://www.lucentia.es/index.php/OCL_Statistics_Library
428
J. Cabot et al.
Fig. 2. Conceptual querying of frequent flyer legs implemented in USE Table 1. Collections of miles by fare class when the departure’s city is Denver City FareClass Miles Denver Economy ∅ Business {61,61,61,1634,1634,1906} First {977,977,1385} Discount {992,1432} Table 2. Results for distributive and algebraic statistical functions miles Economy Business First Discount
sum 0 5357 2406 2424
max N/A 1906 977 1432
min avg var stddev covar. N/A N/A N/A N/A N/A 61 892,8333 840200,5667 916,6246 248379,4444 452 802 91875 303,1089 20650 992 1212 96800 311,1270 9240
this base query a different aggregation function every time. The results returned by the base query are shown in Table 1. Then, Tab. 2 and 3 show the results of applying our aggregation functions3 on the collection of values of Tab. 1. The results returned by our functions were the ones expected (according to the underlying data) in all cases. 3
Fare per frequent flyer is used as an additional collection to compute the covariance.
Specifying Aggregation Functions in Multidimensional Models with OCL
429
Table 3. Results for holistic statistical functions miles Economy Business First Discount
5
mode perc.(25) median N/A N/A N/A 61 61 847,5 977 714,5 977 992 1102 1212
Automatic Code Generation
This section shows how our “enriched” schema can be used in the context of a MDD process. In fact, conceptual schemas containing queries defined using our aggregation functions can be directly implemented in any final technology platform by using exactly the same existing MDD methods and tools able to generate code from UML/OCL schemas. These methods do not need to be extended to cope with our aggregation functions. An automatic code-generation is possible thanks to the fact that (i) our library is defined at the model-level and thus it is technologicallyindependent, and (ii) aggregation functions are specified in terms of standard OCL operations. More specifically, given a query operation qincluding an OCL aggregation operation s, q can be directly implemented in a technology platform p (for instance a relational database or a object oriented Java program) if p o ffers a native support for s. In that case, we just need to replace the call to s with the call to the corresponding operation in p as part of the usual translation process followed to generate the code for implementing OCL queries in that platform. Otherwise, i.e., p does not support s, we need to first unfold s in q by replacing the call to s with the body condition of s. After the unfolding, q only contains standard OCL functions and therefore can be implemented in p as explained in the former case. As an example we show in Fig. 3 the implementation of the query average miles per flight leg specified in OCL in Sect. 3.3. Fig.3 (a) shows the implementation for a relational database, while Fig.3 (b) shows it for a Java program. In the database implementation, queries could be translated as views. The generation of the relational tables (for the classes and associations in the conceptual schema) and the views for the query operations can be generated with the DresdenOCL tool [22] (among others). Since database management systems usually o ffer statistical packages for all of our functions, the avg operation in the query is directly translated by calling the predefined SQL AVG function in the database (see Fig.3 (a)). For the Java example, queries are translated as methods in the class owning the query. Java classes and methods can be generated from a UML/OCL specification using the same DresdenOCL tool or other OCL-to-Java tools (see a list in [23]. However, in this case we need to first unfold the definition of avg in the query since Java does not directly support aggregation operations. The new OCL query body becomes:
430
J. Cabot et al.
context Customer::avgMilesPerFlightLeg():Real post: result = self−>frequentFlyerLegs.Miles−>sum() / self−>frequentFlyerLegs.Miles−>size()
This new body is the one passed over to the Java code-generation tool to obtain the corresponding Java method, as can be seen in Fig. 3 (b). All nonstandard Java operations (e.g., sumMiles) are implemented by the own OCLto-Java tool during the translation (basically they traverse the AST of the OCL expression and generate a new auxiliary method for each node in the tree without a exact mapping to one of the predefined methods in the Java API). Obviously, different tools will generate different Java code excerpts. create view AvgMilesFlight as { select avg(l.miles) from customer c, frequentflyerlegs l where c.id=l.customer } (a) DBMS code
class Customer { int id; String name; Vector f; ... public float avgMiles() { return sumMiles(f)/f.size(); } } (b) Java code
Fig. 3. Code excerpts for an OCL query using the avg function
6
Related Work
Multidimensional modeling languages (and modeling languages in general) offer a limited support for the definition of aggregation operations at the conceptual level. Early approaches [9,10,24] are only concerned about static aspects and lack of mechanisms to properly model multidimensional query behavior. At most, these approaches suggest a limited set of predefined aggregation functions but without providing a formal definition. Recently, other approaches have been trying to use more expressive constructs to model aggregation functions at the conceptual level by extending the UML [8,14,11]. They all propose to use OCL to complete the multidimensional model with information about the applicable aggregation functions in order to define multidimensional queries in a proper manner. They also suggest that aggregation functions should be defined in the UML schema, but unfortunately, they do not provide any mechanisms to carry it out. Therefore, to overcome this drawback, we define in this paper how to extend OCL with new aggregation functions in order to query multidimensional schemas at the conceptual level. A subset of these functions was presented in a preliminary short paper [25].
7
Conclusions and Future Work
Aggregation functions should be part of the predefined constructs provided by existing languages for multidimensional modeling to allow designers to specify
Specifying Aggregation Functions in Multidimensional Models with OCL
431
queries at the conceptual level. However, due to the current lack of support in modeling languages, queries are not currently defined as part of the conceptual schema but added only after the schema has been implemented in the final platform. In this paper, we address this issue by providing an OCL extension that predefines a set of aggregation functions that facilitate the definition of platform-independent queries as part of the specification of the multidimensional conceptual schema of the data warehouse. These queries can be then animated and validated at design-time and automatically implemented along with the rest of the system during the code-generation phase. Our short term future work is to better integrate these aggregation functions with OLAP operations already presented in [16] to provide a more complete definition of the CS Furthermore, definition of multidimensional queries at the conceptual level opens the door to the development of systematic techniques for the treatment of aggregation problems in data analysis at the conceptual level, as a way to evaluate the overall quality of the data warehouse at design time. Finally, we are also concerned about developing mechanisms that help users to define their own ad-hoc ocl queries in a more intuitive manner.
Acknowledgements Work supported by the projects: TIN2008-00444, ESPIA (TIN2007-67078) from the Spanish Ministry of Education and Science (MEC), QUASIMODO (PAC080157-0668) from the Castilla-La Mancha Ministry of Education and Science (Spain), and DEMETER (GVPRE/2008/063) from the Valencia Government (Spain). Jes´ us Pardillo is funded by MEC under FPU grant AP2006-00332.
References 1. Cabibbo, L.: A framework for the investigation of aggregate functions in database queries. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 383– 397. Springer, Heidelberg (1999) 2. Lenz, H.J., Thalheim, B.: OLAP databases and aggregation functions. In: SSDBM, pp. 91–100. IEEE Computer Society, Los Alamitos (2001) 3. Lenz, H.-J., Thalheim, B.: OLAP schemata for correct applications. In: Draheim, D., Weber, G. (eds.) TEAA 2005. LNCS, vol. 3888, pp. 99–113. Springer, Heidelberg (2006) 4. Ross, R.B., Subrahmanian, V.S., Grant, J.: Aggregate operators in probabilistic databases. J. ACM 52(1), 54–101 (2005) 5. TPC: Transaction Processing Performance Council, http://www.tpc.org 6. Rizzi, S., Abell´ o, A., Lechtenb¨ orger, J., Trujillo, J.: Research in data warehouse modeling and design: dead or alive? In: DOLAP, pp. 3–10 (2006) ` Conceptual schema-centric development: A grand challenge for informa7. Oliv´e, A.: ´ Falc˜ao e Cunha, J. (eds.) CAiSE 2005. LNCS, tion systems research. In: Pastor, O., vol. 3520, pp. 1–15. Springer, Heidelberg (2005) 8. Abell´ o, A., Samos, J., Saltor, F.: YAM2 : a multidimensional conceptual model extending UML. Inf. Syst. 31(6), 541–567 (2006)
432
J. Cabot et al.
9. Golfarelli, M., Maio, D., Rizzi, S.: The dimensional fact model: A conceptual model for data warehouses. Int. J. Cooperative Inf. Syst. 7(2-3), 215–247 (1998) 10. H¨ usemann, B., Lechtenb¨ orger, J., Vossen, G.: Conceptual data warehouse modeling. In: DMDW, 6 (2000) 11. Prat, N., Akoka, J., Comyn-Wattiau, I.: A UML-based data warehouse design method. Decision Support Systems 42(3), 1449–1473 (2006) 12. Shoshani, A.: OLAP and statistical databases: Similarities and differences. In: PODS, pp. 185–196. ACM Press, New York (1997) 13. Object Management Group: UML 2.0 OCL Specification (2003) 14. Luj´ an-Mora, S., Trujillo, J., Song, I.Y.: A UML profile for multidimensional modeling in data warehouses. Data Knowl. Eng. 59(3), 725–769 (2006) 15. Gogolla, M., B¨ uttner, F., Richters, M.: USE: A UML-based specification environment for validating UML and OCL. Sci. Comput. Program. 69(1-3), 27–34 (2007) 16. Pardillo, J., Maz´ on, J.N., Trujillo, J.: Extending OCL for OLAP querying on conceptual multidimensional models of data warehouses. Information Sciences 180(5), 584–601 (2010) 17. Maz´ on, J.N., Trujillo, J.: An MDA approach for the development of data warehouses. Decis. Support Syst. 45(1), 41–58 (2008) 18. Kimball, R., Ross, M.: The Data Warehouse Toolkit. Wiley & Sons, Chichester (2002) 19. Rafanelli, M., Bezenchek, A., Tininini, L.: The aggregate data problem: A system for their definition and management. SIGMOD Record 25(4), 8–13 (1996) 20. Embley, D., Barry, D., Woodfield, S.: Object-Oriented Systems Analysis. A ModelDriven Approach. Youdon Press Computing Series (1992) 21. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., Pirahesh, H.: Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals. Data Min. Knowl. Discov. 1(1), 29–53 (1997) 22. Software Technology Group - Technische Universitat Dresden: Dresden OCL toolkit, http://dresden-ocl.sourceforge.net/ 23. Cabot, J., Teniente, E.: Constraint support in mda tools: A survey. In: Rensink, A., Warmer, J. (eds.) ECMDA-FA 2006. LNCS, vol. 4066, pp. 256–267. Springer, Heidelberg (2006) 24. Sapia, C., Blaschka, M., H¨ ofling, G., Dinter, B.: Extending the E/R Model for the Multidimensional Paradigm. In: ER Workshops, 105–116 (1998) 25. Cabot, J., Maz´ on, J.-N., Pardillo, J., Trujillo, J.: Towards the conceptual specification of statistical functions with OCL. In: CAiSE Forum, pp. 7–12 (2009)
The CARD System Faiz Currim1, Nicholas Neidig2, Alankar Kampoowale3, and Girish Mhatre4 1
Department of Management Sciences, Tippie College of Business, University of Iowa, Iowa City IA, USA 2 Software Engineer, Kansas City MO, USA 3 State Hygienic Lab, University of Iowa, Iowa City IA, USA 4 Department of Internal Medicine, Carver College Of Medicine, University of Iowa, Iowa City IA, USA {faiz-currim,alankar-kampoowale,girish-mhatre}@uiowa.edu, [email protected]
Abstract. We describe a CASE tool (the CARD system) that allows users to represent and translate ER schemas, along with more advanced cardinality constraints (such as participation, co-occurrence and projection [1]). The CARD system supports previous research that proposes representing constraints at the conceptual design phase [1], and builds upon work presenting a framework for establishing completeness of cardinality and the associated SQL translation [2]. From a teaching perspective, instructors can choose to focus student efforts on data modeling and design, and leave the time-consuming and error-prone aspect of SQL script generation to the CARD system. Graduate-level classes can take advantage of support for more advanced constraints. Keywords: Keywords: conceptual design, schema import, relational translation, CASE tool, cardinality, triggers.
1 Introduction Cardinality constraints have been a useful and integral part of conceptual database diagrams since the original entity-relationship (ER) model proposed by Chen [3]. A variety of papers since then have examined cardinality constraints in more detail, and many frameworks and taxonomies have been proposed to comprehensively organize the types of cardinality constraints [4, 5]. Cardinality captures the semantics of realworld business rules, and is needed for subsequent translation into the logical design and implementation in a database. In previous research we suggested it was useful to explicitly model constraints at the conceptual stage to permit automation in translation into logical design [1]. We also proposed an approach for establishing completeness of cardinality constraint classifications [2]. This allowed us to come up with a well-defined SQL mapping for each constraint type. The CARD (Constraint Automated Representation for Database systems) project seeks to use knowledge of the SQL mapping to automatically generate database trigger code. Our software aims to go beyond freely available CASE tools by also generating triggers that manage the more complicated constraints. This J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 433–437, 2010. © Springer-Verlag Berlin Heidelberg 2010
434
F. Currim et al.
automated approach to code generation has a two-fold advantage of improving both database integrity and programmer productivity. In the next section we describe the architecture of the CARD system.
2 System Architecture The architecture of the CARD system is summarized Fig. 1. The user interacts with the system via the web front-end which provides options to both import as well as manually create schemas. Changes to the schema may also be made through the web-interface. For example, the user may wish to change an attribute name, or add constraints that are not typically shown on an ER diagram. The SQL generator module takes an existing schema and translates it into tables and associated triggers (for constraints). The purpose of the data access layer is to provide standardized access to schema repository by the different modules. This also allows for future additions to client software while re-using functionality. An in-depth description of the complete CARD system would be beyond the scope of this demo proposal, and we focus on the schema import and SQL generation aspects.
Fig. 1. System Architecture
2.1 The Visio Import Layer The current prototype allows the import of ER schemas developed in MS-Visio. We provide a stencil of standard shapes (downloadable from our website) for entity classes, relationships and attributes based on the an extended ER [6] grammar. In the future, we would like to support UML as well. Some of the basic stencil shapes are shown in Fig. 2 and a screen capture of the import process is in Fig. 3. Next, we briefly explain how the Visio parser functions. Our Visio import layer uses JDOM to parse the Visio XML drawings (.vdx files) in two stages. First each master shape and its correspondence with ER modeling constructs is identified. Then,
The CARD System
ENTITY CLASS
attribute
WEAK ENTITY CLASS
identifier
435
Relationship
multi-valued attribute
Fig. 2. Sample Visio stencil shapes
Fig. 3. Importing a Visio XML Drawing
we map the items in the drawing to corresponding master types (e.g., entity classes or attributes). After parsing diagram objects (e.g., entity classes) into data structures, we analyze what shapes are connected to one another to determine relationships and attributes. Further processing transforms the parsed objects into XML elements that associate different roles played by the components of the specific database schema. For example, we track which class in an inclusion relationship serves as the superclass and which others serve as subclasses. We use a variety of heuristics to determine the root of a superclass-subclass hierarchy as also for a composite-attribute tree. To allow for complex diagrams and eliminate ambiguity, we introduce a few syntactic requirements over standard ER modeling grammars, such as having a directed arrow from a weak entity class to its identifying relationship. The Visio parser also does some structural validation of the diagram. For example, we test whether lines are connected at both ends to a valid shape (e.g., a relationship should not be connected to another relationship, but only to one or more entity classes). In order to facilitate easy SQL generation, we check names given to the objects on the diagram (and raise warnings if special characters are used). The output of our Visio parser, is an XML document that contains information about the different schema constructs. This is passed on to the XML Importer.
436
F. Currim et al.
2.2 The XML Importer The XML import layer of the CARD project is responsible for parsing an XML document representing a user’s ER schema. While a user is allowed to directly edit and provide such a document for import to the CARD system, in most cases we assume the schema is generated by our Visio parser. We provide an XML Schema document which we use to check the validity of the incoming XML document. Next, we evaluate the document for ER structural validity (e.g., strong entity classes should have an identifier) and satisfaction of constraints limiting the types and numbers of individual constructs in a relationship and the cardinalities allowed. If a document fails to adhere to a constraint, the user is notified with a message. The process of updating the schema repository goes hand in hand with the parsing. If at any point the document is found to be invalid, the transaction updating the repository is aborted. The XML layer can also act as canonical standard for the specification of data that an ER diagram is designed to store. This feature allows for future development of additional user-defined input file formats. The only requirement is to create a parser that converts from the desired representation format to the structure enforced by our XML Schema document. The Visio layer of the project is an example of one such method of conversion and further establishes the flexibility and versatility of the project. Schema SQL Writing The system supports writing the SQL representation of an entity relationship schema. This involves the implicit relational conversion, and the associated SQL generation. The writing of a schema begins with the schema’s entity classes. The strong entity classes are translated first, followed by related subclasses, and weak entity classes. Multi-valued attributes are also written right after their parent class. The translation includes definition of primary and foreign keys. We provide support for translation of a variety of relationships, including interaction, grouping, composition and instantiation. We fully support unary, binary, as well as higher-order relationships. Further, we generate trigger code to manage participation constraints for the general n-ary relationship. We are in the process of developing code to handle co-occurrence and projection constraints. A minimum cardinality specification of > 0, implies the need for an ON DELETE trigger, while a maximum cardinality of < many requires an INSERT trigger. If the database allows updates to primary key values or constraint specification with predicates then an UPDATE trigger must be used. A row trigger is written for the tables corresponding to a relationship. The core SQL mapping for each of these constraint types has been previously discussed [2]. Since a relationship may have multiple constraints on it (corresponding to different predicate conditions), our triggers call a constraint check procedure that is written for each relationship to verify that the affected rows do not violate the cardinality of any other constraints on the relationship.
3 Summary The CARD prototype has been developed in Java, and is available over the web at: http://www.iowadb.com/ . For the demonstration, we would like to present the ER
The CARD System
437
schema development and import process, the options for constraint annotation, as well as the generation of table creation SQL code and associated triggers. While the SQL table creation code is designed to be ANSI compliant and work across platforms, the constraint triggers currently use PL/SQL and currently are Oracle specific (however we feel that the core logic can be modified in a straightforward manner for other platforms). Since it is an online tool, we feel it will be suitable for an audience interested in conceptual data modeling research as well as instructors of database classes.
References 1. Currim, F., Ram, S.: Modeling Spatial and Temporal Set-based Constraints during Conceptual Database Design. Information Systems Research (forthcoming) 2. Currim, F., Ram, S.: Understanding the concept of Completeness in Frameworks for Modeling Cardinality Constraints. In: 16th Workshop on Information Technology and Systems, Milwaukee (2006) 3. Chen, P.P.: The Entity-Relationship Model - Toward a Unified View of Data. ACM Transactions on Database Systems 1, 9–36 (1976) 4. Liddle, S.W., Embley, D.W., Woodfield, S.N.: Cardinality Constraints in Semantic Data Models. Data and Knowledge Engineering 11, 235–270 (1993) 5. Ram, S., Khatri, V.: A Comprehensive Framework for Modeling Set-based Business Rules during Conceptual Database Design. Information Systems 30, 89–118 (2005) 6. Ram, S.: Intelligent Database Design using the Unifying Semantic Model. Information and Management 29, 191–206 (1995)
AuRUS: Automated Reasoning on UML/OCL Schemas Anna Queralt, Guillem Rull, Ernest Teniente, Carles Farré, and Toni Urpí Universitat Politècnica de Catalunya - BarcelonaTech {aqueralt,grull,teniente,farre,urpi}@essi.upc.edu
Abstract. To ensure the quality of an information system, the conceptual schema that represents its domain must be semantically correct. We present a prototype to automatically check whether a UML schema with OCL constraints is right in this sense. It is well known that the full expressiveness of OCL leads to undecidability of reasoning. To deal with this problem, our approach finds a compromise between expressiveness and decidability, thus being able to handle very expressive constraints guaranteeing termination in many cases.
1 Introduction We present a tool that allows to assess the semantic quality of a conceptual schema, consisting of a UML class diagram complemented with a set of arbitrary OCL constraints. Semantic quality can be seen from two perspectives. First, the definition of the schema cannot include contradictions or redundancies. In other words, the schema must be right. Second, the schema must be the right one, i.e. it must correctly represent the requirements. Due to the high expressiveness of the combination of UML and OCL, it becomes very difficult to manually check the correctness of a conceptual schema, specially when the set of constraints is large, so it is desirable to provide the designer with automated support. Our approach consists in checking a set of properties on a schema, both to assess that it is right and that it is the right one. Most of the questions that allow checking these properties are automatically drawn from the schema. Additionally, we provide the designer with the ability to define his own questions to check the correspondence of the schema with the requirements in an interactive validation process. It is well-known that the problem of automatically reasoning with integrity constraints in their full generality is undecidable. This means that it is impossible to build a reasoning procedure that deals with the full expressiveness of OCL, and that always terminates and answers whether the schema satisfies a certain property or not. Thus, the problem has been approached in the following ways in the literature: 1. 2. 3.
Allowing general constraints without guaranteeing termination [3, 5]. Allowing general constraints without guaranteeing completeness [1, 4, 7, 11]. Ensuring both termination and completeness by allowing only specific kinds of constraints, such as cardinalities or identifiers [2].
Approaches of the first kind are able to deal with arbitrary constraints but do not guarantee termination, that is, a result may not be obtained in some particular cases. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 438–444, 2010. © Springer-Verlag Berlin Heidelberg 2010
AuRUS: Automated Reasoning on UML/OCL Schemas
439
The second kind of approaches always terminate, but they may fail to find that a schema satisfies a certain property when it does. Finally, the third kind of approaches guarantee both completeness and termination by disallowing arbitrary constraints. AuRUS combines the benefits of the first and the third approaches. It admits highly expressive OCL constraints, for which decidability is not guaranteed a priori, and determines whether termination is ensured for each schema to be validated. The theoretical background of this tool was published in [8-10]. AuRUS is an extension of our relational database schema validation tool [6] to the context of conceptual schemas. It handles UML schemas with arbitrary OCL constraints, and provides the analysis of termination. Also, additional predefined properties are included, both to check that the schema is right and that it is the right one. Moreover, the explanations for unsatisfiable properties are given in natural language.
2 Description of the Validation Process In this section we describe how a schema can be validated using AuRUS, which corresponds to the demonstration we intend to perform. We will use the schema in Fig. 1, specified using the CASE tool Poseidon, with a set of OCL constraints that can be introduced as comments. The schema must be saved as XMI (cf. Section 3).
1.context Department inv UniqueDep: Department.allInstances()->isUnique(name)
8.context Boss inv BossIsManager: self.managedDep->notEmpty()
2.context Employee inv UniqueEmp: Employee.allInstances()->isUnique(name)
9.context Boss inv BossHasNoSuperior: self.superior->isEmpty()
3.context WorkingTeam inv UniqueTeam: WorkingTeam.allInstances()->isUnique(name)
10.context Boss inv SuperiorOfAllWorkers: self.subordinate-> includesAll(self.managedDep.worker)
4.context Department inv MinimumSalary: self.minSalary > 1000 5.context Department inv CorrectSalaries: self.minSalary < self.maxSalary 6.context Department inv ManagerIsWorker: self.worker->includes(self.manager) 7.context Department inv ManagerHasNoSuperior: self.manager.superior->isEmpty()
11.context WorkingTeam inv InspectorNotMember: self.employee->excludes(self.inspector) 12.context Member inv NotSelfRecruited: self.recruiter self 13.context WorkingTeam inv OneRecruited: self.member-> exists(m | m.recruiter.workingTeam=self)
Fig. 1. Class diagram introduced in Poseidon and OCL integrity constraints
440
A. Queralt et al.
Once the schema is loaded in AuRUS, the tool shows a message informing whether termination of the reasoning is guaranteed. The user can proceed in any case, knowing that he will get an answer in finite time if the result if positive. In our example we can check any (predefined or user-defined) property with the guarantee of termination, as can be seen in Fig. 2.
Fig. 2. Message informing that reasoning on our schema is always finite
Fig. 3 shows the interface of AuRUS. The schema is represented in the tree at the left, containing the class diagram and the constraints. When clicking on a class, its attributes appear below; for associations, the type and cardinality of their participants is shown, and for constraints, their corresponding OCL expressions appear.
Fig. 3. Is the schema right? properties, with results for liveliness
AuRUS: Automated Reasoning on UML/OCL Schemas
441
The first step is to check whether the schema is right (Is the schema right? tab). The user can choose the properties to be checked, expressed in the form of questions. We check all of them, and we see for example that the answer to the question Are all classes and associations lively? is negative. This takes 7,6 seconds, as can be seen at the bottom of the window. The Liveliness tab below shows that class Boss (in red) is not lively, and clicking on it we get an explanation consisting in the set of constraints (graphical, OCL or implicit in the class diagram) that do not allow to instantiate it. This means we must remove or modify any of these constraints to make Boss lively. The rest of classes and associations are lively, and clicking on each of them we obtain a sample instantiation as a prove. Both the explanation for the unliveliness of Boss and a sample instantiation for the liveliness of Department are shown in Fig. 3. The constants used in the instantiations are real numbers, since this facilitates the implementation and does not affect the results. Instances of classes include the values of their attributes as parameters, as well as a first value representing the Object Identifier (OID). Instances of associations include the OIDs of the instances they link. For instance, the instantiation for Department in Fig. 3 shows that in order to have a valid instance of this class (in this case, a department with 0.0 as OID) we also need an employee with OID 0.0010, who must be linked to this department by means of the associations WorksIn and Manages, due to the cardinalities and to the OCL constraint 6. Also, the minimum salary of the department must be over 1000 (constraint 4), and lower than the maximum salary (constraint 5). This is shown in the values 1000.001 and 1000.002 given to its attributes. We can also see that there is some redundant constraint (details are given in the Non-redundancy tab). In particular, the OCL constraint 9 (BossHasNoSuperior) is redundant. Fig. 4 shows the explanation computed by AuRUS, consisting on the constraints that cause the redundancy: BossIsManager and ManagerHasNoSuperior, together with the fact that Boss is a subclass of Employee, which is the class in which the latter constraint is defined. This redundancy means we can remove BossHasNoSuperior from the schema, thus making it simpler and easier to maintain while preserving its semantics. The rest of properties are satisfied (see Fig. 3).
Fig. 4. Explanation for the redundancy of the OCL constraint BossHasNoSuperior
When all the properties in Is the schema right? tab are satisfied, we can check the ones in Is it the right schema? to ensure that it represents the intended domain. As shown in Fig. 5(a) (Predefined validation tab), some predefined questions help the designer to check whether he has overlooked something. As a result we get that all classes have an identifier (the answer to Is some identifier missing? is No), but some
442
A. Queralt et al.
other constraints might be missing. For instance, the answer Maybe to the question Is some irreflexive constraint missing? warns us that some recursive association in the schema can link an object to itself. In the Irreflexivity tab, the highlighted line tells us that an employee can be related to himself through the association WorksFor. The designer must decide whether this is correct according to the domain and add an appropriate constraint if not. Finally, the designer may want to check additional ad-hoc properties, such as May a superior work in a different department than his subordinates?. This test can be introduced in the Interactive validation tab (Fig. 5(b)), where instantiations for classes, association classes or associations can be edited to formalize the desired questions. Questions consist in a set of instances that must (not) hold in the schema, and these instances may be specified using variables, as shown in the figure. In this case we want to check whether the schema admits an employee, formalized using the variable Superior, that has a Subordinate (instance of the association WorksFor), such that Superior works in a department Dept (instance of WorksIn) and Subordinate does not work in the department Dept (another instance of WorksIn, which this time must be negated). This partial instantiation of the schema, shown at the bottom of Fig. 5(b), is satisfiable, which means that in our schema a superior may work in a department that is different from that of his subordinates.
(a) Predefined validation
(b) Interactive validation
Fig. 5. Is it the right schema? properties
In the same way, specific instances can be tested to see whether they are accepted by the schema, by providing constants instead of variables when defining the question.
AuRUS: Automated Reasoning on UML/OCL Schemas
443
3 AuRUS Overview AuRUS works as a standalone application. Its input is an XMI file with the UML/OCL schema. This file is loaded into a Java library that implements the UML 2.0 and OCL 2.0 metamodels [12]. Since the XMI generated by different CASE tools are usually not compatible, this library implements its own XMI. Models can be constructed using the primitives offered by the library, or can be drawn in Poseidon and then imported using a converter [12]. Poseidon does not support some UML constructs, such as n-ary association classes. If required, they can be added using the primitives in the library. Currently, only the converter from Poseidon is available, but we plan to provide converters for other popular tools. All the components of AuRUS have been implemented in Java, except for the reasoning engine, which is implemented in C#. It can be executed in any system featuring the .NET 2.0 framework and the Java Runtime Environment 6.
Acknowledgements Our thanks to Lluís Munguía, Xavier Oriol and Guillem Lubary for their work in the implementation of this tool, and to Albert Tort and Antonio Villegas for their help. We also thank the people in the FOLRE and GMC research groups. This work has been partly supported by the Ministerio de Ciencia y Tecnología under the projects TIN2008-03863 and TIN2008-00444, Grupo Consolidado, and the FEDER funds.
References 1. Anastasakis, K., Bordbar, B., Georg, G., Ray, I.: On Challenges of Model Transformation from UML to Alloy. Software and System Modeling 9(1), 69–86 (2010) 2. Berardi, D., Calvanese, D., De Giacomo, G.: Reasoning on UML Class Diagrams. Artificial Intelligence 168(1-2), 70–118 (2005) 3. Brucker, A.D., Wolff, B.: The HOL-OCL Book. Swiss Federal Institute of Technology (ETH),525 (2006) 4. Cabot, J., Clarisó, R., Riera, D.: Verification of UML/OCL Class Diagrams Using Constraint Programming. In: Proc. Workshop on Model Driven Engineering, Verification and Validation, MoDEVVa 2008 (2008) 5. Dupuy, S., Ledru, Y., Chabre-Peccoud, M.: An Overview of RoZ: A Tool for Integrating UML and Z Specifications. In: Wangler, B., Bergman, L.D. (eds.) CAiSE 2000. LNCS, vol. 1789, pp. 417–430. Springer, Heidelberg (2000) 6. Farré, C., Rull, G., Teniente, E., Urpí, T.: SVTe: A Tool to Validate Database Schemas Giving Explanations. In: Proc. International Workshop on Testing Database Systems DBTest, p. 9 (2008) 7. Gogolla, M., Büttner, F., Richters, M.: USE: A UML-based Specification Environment for Validating UML and OCL. Science of Computer Programming 69(1-3), 27–34 (2007) 8. Queralt, A., Teniente, E.: Reasoning on UML Class Diagrams with OCL Constraints. In: Embley, D.W., Olivé, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 497–512. Springer, Heidelberg (2006)
444
A. Queralt et al.
9. Queralt, A., Teniente, E.: Decidable Reasoning in UML Schemas with Constraints. In: Bellahsène, Z., Léonard, M. (eds.) CAiSE 2008. LNCS, vol. 5074, pp. 281–295. Springer, Heidelberg (2008) 10. Rull, G., Farré, C., Teniente, E., Urpí, T.: Providing Explanations for Database Schema Validation. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2008. LNCS, vol. 5181, pp. 660–667. Springer, Heidelberg (2008) 11. Snook, C., Butler, M.: UML-B: Formal Modeling and Design Aided by UML ACM Trans. on Soft. Engineering and Methodology 15(1), 92–122 (2006) 12. UPC, UOC. EinaGMC, http://guifre.lsi.upc.edu/eina_GMC
How the Structuring of Domain Knowledge Helps Casual Process Modelers Jakob Pinggera1, Stefan Zugal1 , Barbara Weber1 , Dirk Fahland2 , Matthias Weidlich3 , Jan Mendling2 , and Hajo A. Reijers4 1
University of Innsbruck, Austria {jakob.pinggera,stefan.zugal,barbara.weber}@uibk.ac.at 2 Humboldt-Universit¨ at zu Berlin, Germany {[email protected],jan.mendling}@wiwi.hu-berlin.de 3 Hasso-Plattner-Institute, University of Potsdam, Germany [email protected] 4 Eindhoven University of Technology, The Netherlands [email protected] Abstract. Modeling business processes has become a common activity in industry, but it is increasingly carried out by non- ex perts. This raises a challenge: How to ensure that the resulting process models are of sufficient quality? This paper contends that a prior structuring of domain knowledge, as found in informal specifications, will positively influence the act of process modeling in various measures of performance. This idea is tested and confirmed with a controlled experiment, which involved 83 master students in business administration and industrial engineering from Humboldt-Universit¨ at zu Berlin and Eindhoven University of Technology. In line with the reported findings, our recommendation is to explore ways to bring more structure in the specifications that are used as input for process modeling endeavors.
1
Introduction
Business process modeling is the task of creating an explicit, graphical model of a business process from internalized knowledge on that process. This type of conceptual modeling has recently received considerable attention in information systems engineering due to its increasing importance in practice [1]. Business process modeling typically involves two specific, associated roles. A domain expert concretizes domain knowledge into an informal description, which is abstracted into a formal model by a system analyst. This works well if the domain expert and the process modeler closely interact with each other, and if both the domain expert and the process modeler attain a high level of expertise. However, these two conditions are often not met. Increasingly, casual modelers are involved in process modeling initiatives, who are neither domain nor process modeling experts. Many organizations do not reserve the time or resources for iterative and consensus-seeking approaches. To illustrate, we are in contact with a financial services provider that employs over 400 business professionals of which only two are skilled process modelers. Process modeling activities in this organization are J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 445–451, 2010. c Springer-Verlag Berlin Heidelberg 2010
446
J. Pinggera et al.
often carried out by IT specialists with a low process modeling expertise. As a consequence, process modeling is driven by informal requirement specifications as provided by domain experts, and models are generated in a small number of cycles with little opportunity for feedback and interaction. Requirements engineering points to the importance of structure in requirements specifications [2,3]. Along these lines, our central idea is that the quality of process models may be improved through providing inexperienced, casual process modelers with well-structured domain descriptions. To investigate this contention we designed an experiment in which we observed and measured how a process modeler creates a formal process model from an informal requirements specification. We varied the level of content organization of an informal requirements specification (serving as proxy for differently skilled domain experts) and traced its impact on process model quality. The subjects in our experiment were 83 students who received only limited prior process modeling training. The paper is structured as follows. Section 2 discusses the background of our research. Section 3 describes our experimental framework. Section 4 covers the execution and results of our experiment. Section 5 concludes the paper.
2
Background
For discussing factors that influence the creation of a formal process model, it is necessary to first reflect on conceptual modeling in general. Conceptual models are developed and used during the requirements analysis phase of information systems development [4]. At that stage, conceptual modeling is an exchange process between a domain expert on the one hand and a system analyst on the other hand [3,5]. Typically, a domain expert can be characterized as someone with (1) superior, detailed knowledge of the object under consideration but often (2) minor powers of abstraction beyond that knowledge. The strengths of the system analyst are exactly the opposite. In this sense, the domain expert is mainly concerned with concretization, which refers to the act of developing an informal description of the situation under consideration. The system analyst, in contrast, is concerned with abstraction, i.e., using the informal description to derive a formalized model of the object. The interaction between domain expert and analyst comprises elicitation, mapping, verification, and validation. In the elicitation step the domain expert produces an initial problem specification, also referred to as the dialogue document. As natural language is human’s essential vehicle to convey ideas, this is normally written in natural language [3]. The primary task of the system analyst is to map the sentences of this informal dialogue document onto concepts of a modeling technique. The resulting formal model can be verified using the syntax rules of the technique. The formal model, in turn, can be translated into a comprehensible format for validation purposes. DeMarco states that a dialogue document is not the problem in the analysis if it is a “suitably partitioned spec with narrative text used at the bottom level” [6]. This statement is in line with more general insights from cognitive psychology that the presentation of
How the Structuring of Domain Knowledge Helps Casual Process Modelers
447
a problem has a significant impact on the solution strategy [7]. This need for a good organization of the requirements is reflected in various works that suggest guidelines and automated techniques for increasing the quality of informal descriptions, particularly of use cases [8,9]. The quality of the dialogue document can be improved using a multitude of requirements elicitation techniques [10]. In the situation of casual modeling the steps of mapping, verification, and validation are conducted by a system analyst with limited abstraction capabilities. Currently, we lack a detailed understanding of what the process of mapping an informal dialogue document to a process model looks like and what exactly the abstraction capabilities entail that we expect from a good system analyst or process modeler. In this paper, we focus on the organization of domain knowledge as it has to be done during mapping to a formal model. To investigate its impact on the creation of process models, we provide different dialogue documents in an experiment that have different degrees of internal organization. For subjects like graduate students without established expertise in modeling, we should be able to observe the consequences of a lack of content organization. Insights from this investigation might improve guidelines on organizing a dialogue document and the effectiveness of approaches supporting the modeling process.
3
Research Setup
The main goal of our experiment is to investigate the impact of content organization of the dialogue document on the modeling outcome and the modeling process. To this end, we designed a task of creating a formal process model in BPMN syntax from an informal dialogue document under varying levels of content organization. To investigate the very process of process modeling, we recorded every modeling step in the experiment in a log. In this section the setup of our experiment is described in conformance with the guidelines of [11]. Subjects: In our experiment, subjects are 66 students of a graduate course on Business Process Management at Eindhoven University of Technology and 17 students of a similar course at the Humboldt-Universit¨ at zu Berlin. Participation in the study was voluntary. The participants conducted the modeling in the C h e e t ah BPMN Modeler [12] which is a graphical process editor, specifically designed for conducting controlled experiments. Objects: The object to be modeled is an actual process run by the “Task Force Earthquakes” of the German Research Center for Geosciences (GFZ), who coordinates the allocation of an expert team after catastrophic earthquakes. The task force runs in-field missions for collecting and analyzing data, including seismic data of aftershocks, post-seismic deformation, hydrogeological data, damage distribution, and structural conditions of buildings at major disaster sites [13]. In particular, subjects were asked to model the “Transport of Equipment” process of the task force. The task force needs scientific equipment in the disaster area to complete its mission. We provided a description of how the task force transports its equipment from Germany to the disaster area.
448
J. Pinggera et al.
Factor and Factor Levels: The considered factor in our experiment is the organization of the dialogue document. We provided our subjects with dialogue documents with varying degrees of content organization simulating the structuring capabilities of domain specialists. The documents differ in the order in which the process is described (Factor Levels: breadth-first, depth-first and random order description). For all three dialogue document variants, we created a natural language description of the process from a set of elementary text blocks, each block describing one activity of the process. Depending on the factor level, the text blocks were ordered differently. The breadth-first description begins with the start activity and then explains the entire process by taking all branches into account. The depth-first description, in turn, begins with the start activity and then describes the entire first branch of the process before moving on with other branches. Finally, the random description yields a dialogue document for which the order of activity text blocks does not correlate with the structure of the process model. Response Variables: As response variable we considered accuracy of the resulting model, estimated by comparing each model to a reference model of the process. Here, we relied on the graph-edit distance, which defines the minimal number of atomic graph operations needed to transform one graph into another. The graph-edit distance, in turn, can be leveraged to define a similarity metric [14]. For our setting, we weighted insertion and deletion operations of edges and nodes equally, whereas node substitutions are not taken into account as they have been established manually for corresponding pairs of activities. The corresponding hypothesis is: Null Hypothesis H0: There is no significant difference in the accuracy of the created process models between the three groups.
4
Performing the Experiment
This section describes the preparation and execution of the experiment, as well as the analysis and discussion of the results. Preparation: As part of the set-up of the intended experiment, we provided the task to model the “Transport of Equipment” process from a natural language description. Three variants of the task were created: A depth first description (Variant D), a breadth first description (Variant B) and a random description (Variant R). To ensure that each description is understandable and can be modeled in the available amount of time, we conducted a pre-test with 14 graduate students at the University of Innsbruck. Based on their feedback, the modeling task descriptions were refined in several iterations. Execution: The experiment was conducted at two distinct, subsequent events. The first event took place early November 2009 in Berlin, the second was performed a few days later in Eindhoven. The modeling session started with a demographic survey and was followed by a modeling tool tutorial in which the basic functionality of the BPMN Modeler was explained to our subjects. This was followed by the actual modeling task in which the students had to model
How the Structuring of Domain Knowledge Helps Casual Process Modelers
449
the “ Transport of Equipment” process. Roughly a third of the students were randomly assigned to the D variant of the modeling task, another third to the B variant and the remaining third to the R variant. After completing the modeling task, the students received a questionnaire on cognitive load. Data Validation: Once the experiment was carried out, logged data was analyzed. We discarded data of 8 students because respective data was incomplete. Finally, data provided by 66 Eindhoven students and 17 Berlin students was used in our data analysis. Data Analysis: In total 83 students participated in our experiment. Out of the 83 students 27 worked on the breadth-first description, 25 on the depthfirst description and 31 on the random description. To assess how accurately the 83 models reflect the “Transport of Equipment” process, we compared each model to a reference model of the process using the graph-edit distance. A statistical analysis revealed a significant difference in accuracy between the three groups in terms of this similarity metric (p=0.0026), see Fig. 1. Pairwise MannWhitney tests showed a significant difference between breadth-first and random (p=0.0021 < 0.05/3) and between depth-first and random (p=0.0073 < 0.05/3). No difference can be observed between breadth-first and depth-first (p=0.4150). Discussion of Results. Our concern in this paper is the impact of the organization level of an informal specification on the outcome of a modeling process. This contention seems to be confirmed: An explicit ordering of the specification is positively related to the accuracy of the process model that is derived from it. The models created from a breadth-first and depth-first description are significantly more similar to the reference model than those created on the basis of the randomized description. The group dealing with the latter had to re-organize the dialogue document quite extensively. This suggests that casual modelers would perform better when presented with well structured specifications. How can the insights from this study be exploited? First of all, it seems reasonable to be selective with respect to the domain experts that will be involved in drawing up the informal specifications. After all, some may be more apt than others to bring structure to such a document. Secondly, it may be feasible to instruct domain experts on how to bring structure to their specifications. In a research stream that is concerned with structuring use cases [9,15], various
Breadth-first
Depth-first
Random
0,11
0,16
0,21
0,26 Similarity
Fig. 1. Accuracy of Models
0,31
0,36
450
J. Pinggera et al.
measures can be distinguished to ease the sense-making of these. For example, one proposal is to create pictures along with use cases that sketch the hierarchical relations between these or simply use numbering to both identify and distinguish between logical fragments of the use cases.
5
Summary and Outlook
This paper presented findings from an experiment investigating the impact of content organization of the informal dialogue document to both the modeling outcome and the modeling process. Apparently, a breadth-first organization was best suited to yield good results, indicating that industrial practice of process modeling can be improved when selecting domain specialists with good content organization skills. Our future work aims at further investigating the process of creating process models. Acknowledgements. We thank H. Woith and M. Sobesiak for providing us with the expert knowledge of the disaster management process used in our experiment.
References 1. Indulska, M., Recker, J., Rosemann, M., Green, P.: Business process modeling: Current issues and future challenges. In: van Eck, P., Gordijn, J., Wieringa, R. (eds.) CAiSE 2009. LNCS, vol. 5565, pp. 501–514. Springer, Heidelberg (2009) 2. Davis, A., et al.: Identifying and measuring quality in a software requirements specification. In: Proc. METRICS, pp. 141–152 (1993) 3. Frederiks, P.J.M., van der Weide, T.P.: Information Modeling: The Process and the Required Competencies of Its Participants. DKE 58, 4–20 (2006) 4. Wand, Y., Weber, R.: Research Commentary: Information Systems and Conceptual Modeling - A Research Agenda. ISR 13, 363–376 (2002) 5. Hoppenbrouwers, S., Proper, H., Weide, T.: A fundamental view on the process of conceptual modeling. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, ´ (eds.) ER 2005. LNCS, vol. 3716, pp. 128–143. Springer, Heidelberg J., Pastor, O. (2005) 6. DeMarco, T.: Software Pioneers. Contributions to Software Engineering (2002) 7. Lindsay, P., Norman, D.: Human Information Processing: An introduction to psychology. Academic Press, London (1977) 8. Cockburn, A.: Writing Effective Use Cases. Addison-Wesley, Reading (2000) 9. Rolland, C., Achour, C.B.: Guiding the Construction of Textual Use Case Specifications. DKE 25, 125–160 (1998) 10. Davis, A.M., et al.: Effectiveness of requirements elicitation techniques: Empirical results derived from a systematic review. In: Proc. RE, pp. 176–185 (2006) 11. Wohlin, C., et al.: Experimentation in Software Engineering: an Introduction. Kluwer, Dordrecht (2000)
How the Structuring of Domain Knowledge Helps Casual Process Modelers
451
12. Pinggera, J., Zugal, S., Weber, B.: Investigating the Process of Process Modeling with Cheetah Experimental Platform. Accepted for ER-POIS 2010 (2010) 13. Fahland, D., Woith, H.: Towards Process Models for Disaster Response. In: Proc. PM4HDPS 2008, pp. 254–265 (2008) 14. Dijkman, R., Dumas, M., Garc´ıa-Ba˜ nuelos, L.: Graph Matching Algorithms for Business Process Model Similarity Search. In: Dayal, U., Eder, J., Koehler, J., Reijers, H.A. (eds.) Business Process Management. LNCS, vol. 5701, pp. 48–63. Springer, Heidelberg (2009) 15. Constantine, L., Lockwood, L.: Structure and style in use cases for user interface design. In: Object Modeling and User Interface Design, pp. 245–280 (2001)
SPEED: A Semantics-Based Pipeline for Economic Event Detection Frederik Hogenboom, Alexander Hogenboom, Flavius Frasincar, Uzay Kaymak, Otto van der Meer, Kim Schouten, and Damir Vandic Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, The Netherlands {fhogenboom,hogenboom,frasincar,kaymak}@ese.eur.nl, {276933rm,288054ks,305415dv}@student.eur.nl
Abstract. Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets. Therefore, it is important to be able to automatically and accurately identify events in news items in a timely manner. For this, one has to be able to process a large amount of heterogeneous sources of unstructured data in order to extract knowledge useful for guiding decision making processes. We propose a Semantics-based Pipeline for Economic Event Detection (SPEED), aiming to extract financial events from emerging news and to annotate these with meta-data, while retaining a speed that is high enough to make real-time use possible. In our implementation of the SPEED pipeline, we reuse some of components of an existing framework and develop new ones, e.g., a high-performance Ontology Gazetteer and a Word Sense Disambiguator. Initial results drive the expectation of a good performance on emerging news.
1
Introduction
In today’s information-driven society, machines that can process natural language can be of great importance. Decision makers are expected to be able to extract information from an ever increasing amount of data such as emerging news, and subsequently to be able to acquire knowledge by applying reasoning to the gathered information. In today’s global economy, it is of utmost importance to have a complete overview of the business environment to enable effective, well-informed decision making. Financial decision makers thus need to be aware of events on their financial market, which is often extremely sensitive to economic events like stock splits and dividend announcements. Proper and timely event identification can aid decision making processes, as these events provide means of structuring information using concepts, with which knowledge can be generated by applying inference. Hence, automating information extraction and knowledge acquisition processes can be a valuable contribution. This paper proposes a fully automated framework for processing financial news messages gathered from RSS feeds. These events are represented in a J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 452–457, 2010. c Springer-Verlag Berlin Heidelberg 2010
SPEED: A Semantics-Based Pipeline for Economic Event Detection
453
machine-understandable way. Extracted events can be made accessible for other applications as well through the use of Semantic Web technologies. Furthermore, it is aimed that the framework is able to handle news messages at a speed useful for real-time use, as new events can occur any time and require decision makers to respond in a timely and adequate manner. Our proposed framework (pipeline) identifies the concepts related to economic events, which are defined in a domain ontology and are associated to synsets from a semantic lexicon (e.g., WordNet [1]). For concept identification, lexicosemantic patterns based on concepts from the ontology are employed in order to match lexical representations of concepts retrieved from the text with eventrelated concepts that are available in the semantic lexicon, and thus aim to maximize recall. The identified lexical representations of relevant concepts are subject to a Word Sense Disambiguation (WSD) procedure for determining the corresponding sense, in order to maximize precision. In order for our pipeline to be real-time applicable, we also aim to minimize the latency, i.e., the time it takes for a new news message to be processed by the pipeline. The remainder of this paper is structured as follows. Firstly, Sect. 2 discusses related work. Subsequently, Sects. 3 and 4 elaborate on the proposed framework and its implementation, respectively. Finally, Sect. 5 wraps up this paper.
2
Related Work
This section discusses tools that can be used for Information Extraction (IE) purposes. Firstly, Sect. 2.1 discusses the ANNIE pipeline and subsequently, Sects. 2.2 and 2.3 elaborate on the CAFETIERE and KIM frameworks, respectively. Finally, Sect. 2.4 wraps up this section. 2.1
The ANNIE Pipeline
The General Architecture for Text Engineering (GATE) [2] is a freely available general purpose framework for IE tasks, which provides the possibility to construct processing pipelines from components that perform specific tasks, e.g., linguistic, syntactic, and semantic analysis tasks. By default, GATE loads the A Nearly-New Information Extraction (ANNIE) system, which consists of several key components, i.e., the English Tokenizer, Sentence Splitter, Part-Of-Speech (POS) Tagger, Gazetteer, Named Entity (NE) Transducer, and OrthoMatcher. Although the ANNIE pipeline has proven to be useful in various information extraction jobs, its functionality does not suffice when applied to discovering economic events in news messages. An important lacking component is one that can be employed for WSD, although some disambiguation can be done using JAPE rules in the NE Transducer. This is however a cumbersome and ineffective approach where rules have to be created manually for each term, which is prone to errors. Furthermore, ANNIE lacks the ability to individually look up concepts from a large ontology within a limited amount of time. Despite its drawbacks, GATE is highly flexible and customizable, and therefore ANNIE’s components are either usable, or extendible and replaceable in order to suit our needs.
454
2.2
F. Hogenboom et al.
The CAFETIERE Pipeline
An example of an adapted ANNIE pipeline is the Conceptual Annotations for Facts, Events, Terms, Individual Entities, and RElations (CAFETIERE) relation extraction pipeline [3], which consists of an ontology lookup process and a rule engine. Within CAFETIERE, the Common Annotation Scheme (CAS) DTD is applied, which allows for three layers of annotation, i.e., structural, lexical, and semantic annotation. CAFETIERE employs extraction rules defined at lexicosemantic level which are similar to JAPE rules. Nevertheless, the syntax is at a higher level than is the case with JAPE, resulting in more easy to express, but less flexible rules. As knowledge is stored in an ontology using Narrative Knowledge Representation Language (NKRL), Semantic Web ontologies are not employed. NKRL has no formal semantics and there is no reasoning support, which is desired when identifying for instance economic events. Furthermore, gazetteering is a slow process when going through large ontologies. Finally, the pipeline also misses a WSD component. 2.3
The KIM Platform
The Knowledge and Information Management (KIM) platform [4] combines GATE components with semantic annotation techniques in order to provide an infrastructure for IE purposes. The framework focuses on automatic annotation of news articles, where entities, inter-entity relations, and attributes are discovered. For this, the authors employ a pre-populated OWL upper ontology. In the back-end, a semantically enabled GATE pipeline, which utilizes semantic gazetteers and pattern-matching grammars, is invoked for named entity recognition using the KIM ontology. Furthermore, GATE is used for managing the content and annotations within the back-end of KIM’s architecture. The middle layer of the KIM architecture provides services that can be used by the topmost layer, e.g., semantic repository navigation, semantic indexing and retrieval, etcetera. The front-end layer of KIM embodies front-end applications, such as the Annotation Server and the News Collector. The differences between KIM and our envisaged approach are in that we aim for a financial event-focused information extraction pipeline, which is in contrast to KIM’s general purpose framework. Hence, we employ a domainspecific ontology rather than an upper ontology. Furthermore, we focus on event extraction from corpora, in contrast to mere (semantic) annotation. Finally, the authors do not mention the use of WSD, whereas we consider WSD to be an essential component in an IE pipeline. 2.4
Conclusions
The IE frameworks discussed in this section have their merits, yet each framework fails to fully address the issues we aim to alleviate. The frameworks incorporate semantics only to a limited extent, e.g., they make use of gazetteers or
SPEED: A Semantics-Based Pipeline for Economic Event Detection
455
knowledge bases that are either not ontologies or ontologies that are not based on OWL. Being able to use a standard language as OWL fosters application interoperability and the reuse of existing reasoning tools. Also, existing frameworks lack a feed-back loop, i.e., there is no knowledge base updating. Furthermore, WSD appears not to be sufficiently tackled in most cases. Finally, most existing approaches focus on annotation, rather than event recognition. Therefore, we aim for a framework that combines the insights gained from the approaches that are previously discussed, targeted at financial event discovery in news articles.
3
Economic Event Detection Based on Semantics
The analysis presented in Sect. 2 demonstrates several approaches to automated information extraction from news messages, which are often applied for annotation purposes and are not semantics-driven. Because we hypothesize that domain-specific information captured in semantics facilitates detection of relevant concepts, we propose a Semantics-Based Pipeline for Economic Event Detection (SPEED). The framework is modeled as a pipeline and is driven by a financial ontology developed by domain experts, containing information on the NASDAQ-100 companies that is extracted from Yahoo! Finance. Many concepts in this ontology stem from a semantic lexicon (e.g., WordNet), but another significant part of the ontology consists of concepts representing named entities (i.e., proper names). Figure 1 depicts the architecture of the pipeline. In order to identify relevant concepts and their relations, the English Tokenizer is employed, which splits text into tokens (which can be for instance words or numbers) and subsequently applies linguistic rules in order to split or merge identified tokens. These tokens are linked to ontology concepts by means of the Ontology Gazetteer, in contrast to a regular gazetteer, which uses lists of words as input. Matching tokens in the text are annotated with a reference to their associated concepts defined in the ontology.
Information Flow UsedBy Relationship
News
English Tokenizer
Ontology Gazetteer
Sentence Splitter
Part-Of-Speech Tagger
Morphological Analyzer
Ontology Instantiator
Event Pattern Recognition
Event Phrase Gazetteer
Word Sense Disambiguator
Word Group Look-Up
Ontology
Fig. 1. SPEED design
Semantic Lexicon
456
F. Hogenboom et al.
Subsequently, the Sentence Splitter groups the tokens in the text into sentences, based on tokens indicating a separation between sentences. These sentences are used for discovering the grammatical structures in a corpus by determining the type of each word token by means of the Part-Of-Speech Tagger. As words can have many forms that have a similar meaning, the Morphological Analyzer subsequently reduces the tagged words to their lemma as well as a suffix and/or a ffix. A word can have multiple meanings and a meaning can be represented by multiple words. Hence, the framework needs to tackle WSD tasks, given POS tags, lemmas, etcetera. To this end, first of all, the Word Group Look-Up component combines words into maximal word groups, i.e., it aims for as many words per group as possible for representing some concept in a semantic lexicon (such as WordNet). It is important to keep in mind groupings of words, as combinations of words may have very specific meanings compared to the individual words. Subsequently, the Word Sense Disambiguator determines the word sense of each word group by exploring the mutual relations between senses of word groups using graphs. The senses are determined based on the number and type of detected semantic interconnections in a labeled directed graph representation of all senses of the considered word groups [5]. After disambiguating word group senses, the text can be interpreted by introducing semantics, which links word groups to an ontology, thus capturing their essence in a meaningful and machine-understandable way. Therefore, the Event Phrase Gazetteer scans the text for specific (financial) events, by utilizing a list of phrases or concepts that are likely to represent some part of a relevant event. Events thus identified are then supplied with available additional information by the Event Pattern Recognition component, which matches events to lexico-semantic patterns that are subsequently used for extracting additional information. Finally, the knowledge base is updated by inserting the identified events and their extracted associated information into the ontology by means of the Ontology Instantiator.
4
SPEED Implementation
The analysis presented in Sect. 2 exhibits the potential of a general architecture for text engineering: GATE. The modularity of such an architecture can be of use in the implementation of our semantics-based pipeline for economic event detection, as proposed in Sect. 3. Therefore, we made a Java-based implementation of the proposed framework by using default GATE components, such as the English Tokenizer, Sentence Splitter, Part-Of-Speech Tagger, and the Morphological Analyzer, which generally suit our needs. Furthermore, we extended the functionality of other GATE components (e.g., ontology gazetteering), and also implemented additional components to tackle the disambiguation process. Initial results on a test corpus of 200 news messages fetched from the Yahoo! Business and Technology RSS feeds show fast gazetteering of about 1 second and a precision and recall for concept identification in news items of 86% and 81%, respectively, which is comparable with existing systems. Precision and recall of
SPEED: A Semantics-Based Pipeline for Economic Event Detection
457
fully decorated events result in lower values of approximately 62% and 53%, as they rely on multiple concepts that have to be identified correctly.
5
Conclusions and Future Work
In this paper, we have proposed a semantics-based framework for economic event detection: SPEED. The framework aims to extract financial events from news articles (announced through RSS feeds) and to annotate these with meta-data, while maintaining a speed that is high enough to enable real-time use. We discussed the main components of the framework, which introduce some novelties, as they are semantically enabled, i.e., they make use of semantic lexicons and ontologies. Furthermore, pipeline outputs also make use of semantics, which introduces a potential feedback loop, making event identification a more adaptive process. Finally, we briefly touched upon the implementation of the framework and initial test results on the basis of emerging news. The established fast processing time and high precision and recall provide a good basis for future work. The merit of our pipeline is in the use of semantics, enabling broader application interoperability. For future work, we do not only aim to perform thorough testing and evaluation, but also to implement the proposed feedback of newly obtained knowledge (derived from identified events) to the knowledge base. Also, it would be worthwhile to investigate further possibilities for implementation in algorithmic trading environments, as well as a principal way of linking sentiment to discovered events, in order to assign more meaning to these events. Acknowledgments. The authors are partially sponsored by the NWO EW Free Competition project FERNAT: Financial Events Recognition in News for Algorithmic Trading.
References 1. Fellbaum, C.: WordNet an Electronic Lexical Database. Computational Linguistics 25(2), 292–296 (1998) 2. Cunningham, H.: GATE, a General Architecture for Text Engineering. Computers and the Humanities 36(2), 223–254 (2002) 3. Black, W.J., McNaught, J., Vasilakopoulos, A., Zervanou, K., Theodoulidis, B., Rinaldi, F.: CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and RElations. Technical Report TR–U4.3.1, Department of Computation, UMIST, Manchester (2005) 4. Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: KIM - A Semantic Platform For Information Extraction and Retrieval. Journal of Natural Language Engineering 10(3-4), 375–392 (2004) 5. Navigli, R., Velardi, P.: Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(7), 1075–1086 (2005)
Prediction of Business Process Model Quality Based on Structural Metrics Laura Sánchez-González1, Félix García1, Jan Mendling2, Francisco Ruiz1, and Mario Piattini1 1
Alarcos Research Group, TSI Department, University of Castilla La Mancha, Paseo de la Universidad, nº4, 13071,Ciudad Real, España {Laura.Sanchez,Felix.Garcia,Francisco.RuizG, Mario.Piattini}@uclm.es 2 Humboldt-Universität zu Berlin, Unter den Linden 6, D-10099 Berlin, Germany [email protected]
Abstract. The quality of business process models is an increasing concern as enterprise-wide modelling initiatives have to rely heavily on non-expert modellers. Quality in this context can be directly related to the actual usage of these process models, in particular to their understandability and modifiability. Since these attributes of a model can only be assessed a posteriori, it is of central importance for quality management to identify significant predictors for them. A variety of structural metrics have recently been proposed, which are tailored to approximate these usage characteristics. In this paper, we address a gap in terms of validation for metrics regarding understandability and modifiability. Our results demonstrate the predictive power of these metrics. These findings have strong implications for the design of modelling guidelines. Keywords: Business process, measurement, correlation analysis, regression analysis, BPMN.
1 Introduction Business process models are increasingly used as an aid in various management initiatives, most notably in the documentation of business operations. Such initiatives have grown to an enterprise-wide scale, resulting in several thousand models and involving a significant number of non-expert modellers [1]. This setting creates considerable challenges for the maintenance of these process models, particularly in terms of adequate quality assurance. In this context, quality can be understood as “the totally of features and characteristics of a conceptual model that bear on its ability to satisfy stated or implied needs”[2]. It is well known that poor quality of conceptual models can increase development efforts or results in a software system that does not satisfy user needs [3]. It is therefore vitally important to understand the factors of process model quality and to identify guidelines and mechanisms to guarantee a high level of quality from the outset. An important step towards improved quality assurance is a precise quantification of quality. Recent research into process model metrics pursues this line of argument J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 458–463, 2010. © Springer-Verlag Berlin Heidelberg 2010
Prediction of Business Process Model Quality Based on Structural Metrics
459
by measuring the characteristics of process models. The significance of these metrics relies on a thorough empirical validation of their connection with quality attributes [4]. The most prominent of these attributes are understandability and modifiability, which both belong to the more general concepts of usability and maintainability, respectively [5]. While some research provides evidence for the validity of certain metrics as predictors of understandability, there is, to date, no insight available into the connection between structural process model metrics and modifiability. This observation is in line with a recent systematic literature review that identifies a validation gap in this research area [6]. In accordance with the previously identified issues, the purpose of this paper is to contribute to the maturity of measuring business process models. The aim of the empirical research presented herein is to discover the connections between an extensive set of metrics and the ease with which business process models can be understood (understandability) and modified (modifiability). This was achieved by adapting the measures defined in [7] to BPMN business process models [8]. The empirical data of six experiments which had been defined for previous works were used. A correlation analysis and a regression estimation were applied in order to test the connection between the metrics and both the understandability and modifiability of the models. The remainder of the paper is as follows. In Section 2 we describe the theoretical background of our research and the set of metrics considered. Section 3 describes the series of experiments that were used. Sections 4 and 5 present the results. Finally, Section 6 draws conclusions and presents topics for future research.
2 Structural Metrics for Process Models In this paper we consider a set of metrics defined in [6] for a series of experiments on process model understanding and modifiability. The hypothetical correlation with understandability and modifiability is annotated in brackets as (+) for positive correlation or (-) for negative correlation. The metrics include: • • • • • • • • • •
Number of nodes (-): number of activities and routing elements in a model; Diameter (-): The length of the longest path from a start node to an end node; Density (-): ratio of the total number of arcs to the maximum number of arcs; The Coefficient of Connectivity (-): ratio of the total number of arcs in a process model to its total number of nodes; The Average Gateway Degree (-) expresses the average of the number of both incoming and outgoing arcs of the gateway nodes in the process model; The Maximum Gateway Degree (-) captures the maximum sum of incoming and outgoing arcs of these gateway nodes; Separability (+) is the ratio of the number of cut-vertices on the one hand to the total number of nodes in the process model on the other; Sequentiality (+): Degree to which the model is constructed out of pure sequences of tasks. Depth (-): maximum nesting of structured blocks in a process model; Gateway Mismatch (-) is the sum of gateway pairs that do not match with each other, e.g. when an AND-split is followed by an OR-join;
460
• • •
L. Sánchez-González et al.
Gateway Heterogeneity (-): different types of gateways are used in a model; Cyclicity (-) relates the number of nodes in a cycle to the sum of all nodes; Concurrency(-) captures the maximum number of paths in a process model that may be concurrently activate due to AND-splits and OR-splits.
3 Research Design The empirical analysis performed is composed by six experiments: three to evaluate understandability and three to evaluate modifiability. The experimental material for the first three experiments consisted of 15 BPMN models with different structural complexity. Each model included a questionnaire related to its understandability. The experiments on modifiability included 12 BPMN models related to a particular modification task. A more detailed description of the family of experiments can be found in [9]. It was possible to collect the following objective data for each model and each task: time of understandability or modifiability for each subject, number of correct answers in understandability or modifiability, and efficiency defined as the number of correct answers divided by time. Once the values had been obtained, the variability of the values was analyzed to ascertain whether the measures varied sufficiently to be considered in the study. Two measures were excluded, namely Cyclicity and Concurrency, because the results they offered had very little variability (80% of the models had the same value for both measures, the mean value was near to 0, as was their standard deviation). The remaining measures were included in the correlation analysis. The experimental data was accordingly used to test the following null hypotheses for the current empirical analysis, which are: • •
For the experiments on understandability, H0,1: There is no correlation between structural metrics and understandability For the experiments on modifiability, H0,2: there is no correlation between structural metrics and modifiability
The following sub-sections show the results obtained for the correlation and regression analysis of the empirical data.
4 Correlation Analysis Understandability: Understanding time is strongly correlated with number of nodes, diameter, density, average gateway degree, depth, gateway mismatch, and gateway heterogeneity in all three experiments. There is no significant correlation with the connectivity coefficient, and the separability ratio was only correlated in the first experiment. With regards to correct answers, size measures, number of nodes (-.704 with p-value of .003), diameter (-.699, .004), and gateway heterogeneity (.620, .014) have a significant and strong correlation. With regard to efficiency, we obtained evidence of the correlation of all the measures with the exception of separability.
Prediction of Business Process Model Quality Based on Structural Metrics
461
The correlation analysis results indicate that there is a significant relationship between structural metrics and the time and efficiency of understandability. The results for correct answers are not as conclusive, since there is only a correlation of 3 of the 11 analyzed measures. We have therefore found evidence to reject the null hypothesis H0,1. The alternative hypothesis suggests that these BPMN elements affect the level of understandability of conceptual models in the following way. It is more difficult to understand models if: • • • •
There are more nodes. The path from a start node to the end is longer. There are more nodes connected to decision nodes. There is higher gateway heterogeneity.
Modifiability: We observed a strong correlation between structural metrics and time and efficiency. For correct answers there is no significant connection in general, while there are significant results for diameter, but these are not conclusive since there is a positive relation in one case and a negative correlation in another. For efficiency we find significant correlations with average (.745, .005) and maximum gateway degree (.763, .004), depth (-.751, .005), gateway mismatch (-.812, .001) and gateway heterogeneity (.853, .000). We have therefore found some evidence to reject the null hypothesis H0,2. The usage of decision nodes in conceptual models apparently implies a significant reduction in efficiency in modifiability tasks. In short, it is more difficult to modify model the model if: • •
More nodes are connected to decision nodes. There is higher gateway heterogeneity.
5 Regression Analysis The previous correlation analysis suggests that it is necessary to investigate the quantitative impact of structural metrics on the respective time, accuracy and efficiency dependent variables of both understandability and modifiability. This goal was achieved through the statistical estimation of a linear regression. The regression equations were obtained by performing a regression analysis with 80% of the experimental data. The remaining 20% were used for the validation of the regression models. The first step is selected the prediction models with p-values below 0.05. Then, it is necessary to validate the selected models verifying the distribution and independence of residuals through Kolmogorov-Smirnov and Durbin-Watson tests. Both tests values are considered to be satisfactory. The accuracy of the models was studied by using the Mean Magnitude Relative Error (MMRE) [10] and the prediction level Pred(25) and Pred(30) on the remaining 20% of the data, which were not used in the estimation of the regression equation. These levels indicate the percentage of model estimations that do not differ from the observed data by more than 25% and 30%. A model can therefore be considered to be accurate when it satisfies any of the following cases: a) MMRE ≤ 0,25 or b) Pred (0,25) ≥ 0,75 or c) Pred (0,30) ≥0,70. Table 3 depicts the results.
462
L. Sánchez-González et al.
Efficiency
E3
Time C.A.
E4 E4 E4
Efficiency
Understandabiltiy 47.04 + 2.46 nºnodes 3.17 - 0.005 nºnodes - 0.38 coeff. of connectivity + 0.17 depth - 0.015 gateway mismatch 0.042 - 0.0005 nºnodes+0.026sequentiality Modifiability 50.08 + 3.77 gateway mismatch + 422.95 density 1.85 - 3.569 density 0.006 + 0.008 sequentiality
P p(0,30)
E3 E2
V p(0.25)
Time Correct answers
Prediction model
MMRE
Exp
Table 1. Prediction models of understandability
.32 .18
.51 .79
.58 .79
0.84
.22
.25
.37 .23 .62
.31 .82 .32
.38 .83 .42
Understandability: The best model for predicting the understandability time is obtained with the E3, which has the lowest MMRE value of all the models. The best models with which to predict correct understandability answers originate from the E2, and this also satisfies all the assumptions. For efficiency, no model was found that satisfied all the assumptions. The model with the lowest value of MMRE is obtained in the E3. In general, the results further support the rejection of the null hypothesis H0,1. Modifiability: We did not obtain any models which satisfy all of the assumptions for the prediction of modifiability time, but we have highlighted the prediction model obtained in E4 since it has the best values. However, the model to predict the number of correct answers may be considered to be a precise model as it satisfies all the assumptions. The best results for predicting efficiency of modifiability are also provided by E4, with the lowest value of MMRE. In general, we find some further support for rejecting the null hypothesis H0,2. The best indicators for modifiability are gateway mismatch, density and sequentiality ratio. Two of these metrics are related to decision nodes. Decision nodes apparently have a negative effect on time and the number of correct answers in modifiability tasks.
6 Conclusions and Future Work In this paper we have investigated structural metrics and their connection with the quality of business process models, namely understandability and modifiability. The statistical analyses suggest rejecting the null hypotheses, since the structural metrics apparently seem to be closely connected with understandability and modifiability. For understandability these include Number of Nodes, Gateway Mismatch, Depth, Coefficient of Connectivity and Sequentiality. For modifiability Gateway Mismatch, Density and Sequentiality showed the best results. The regression analysis also provides us with some hints with regard to the interplay of different metrics. Some metrics are not therefore investigated in greater depth owing to their correlations with other metrics.
Prediction of Business Process Model Quality Based on Structural Metrics
463
Our findings demonstrate the potential of these metrics to serve as validated predictors of process model quality. Some limitations in the experimental data are about the nature of subjects, which implies that results are particularly relevant to nonexpert modellers. This research contributes to the area of process model measurement and its still limited degree of empirical validation. This work has implications both for research and practice. The strength of the correlation of structural metrics with different quality aspects (up to 0.85 for gateway heterogeneity with modifiability) clearly shows the potential of these metrics to accurately capture aspects that are closely connected with actual usage. From a practical perspective, these structural metrics can provide valuable guidance for the design of process models, in particular for selecting semantically equivalent alternatives that differ structurally. In future research we aim to contribute to the further validation and actual applicability of process model metrics Acknowledgments. This work was partially funded by projects INGENIO (PAC 080154-9262); ALTAMIRA (PII2I09-0106-2463), ESFINGE (TIN2006-15175-C05-05) and PEGASO/MAGO (TIN2009-13718-C02-01).
References 1. Rosemann, M.: Potential pitfalls of process modeling: part a. Business process Management Journal 12(2), 249–254 (2006) 2. ISO/IEC, ISO Standard 9000-2000: Quality Management Systems: Fundamentals and Vocabulary (2000) 3. Moody, D.: Theoretical and practical issues in evaluating the quality of conceptual models: current state and future directions. Data and Knowledge Engineering 55, 243–276 (2005) 4. Zelkowitz, M., Wallace, D.: Esperimental models for validating technology. IEEE Computer, Computing practices (1998) 5. ISO/IEC, 9126-1, Software engineering - product quality - Part 1: Quality Model (2001) 6. Sánchez, L., García, F., Ruiz, F., Piattini, M.: Measurement in Business Processes: a Systematic Review. Business process Management Journal 16(1), 114–134 (2010) 7. Mendling, J.: Metrics for Process Models: Empirical Foundations of Verification, Error Prediction, and Guidelines for Correctness. Springer Publishing Company, Incorporated, Heidelberg (2008) 8. OMG. Business Process Modeling Notation (BPMN), Final Adopted Specification (2006), http://www.omg.org/bpm 9. ExperimentsURL (2009), http://alarcos.inf-cr.uclm.es/bpmnexperiments/ 10. Foss, T., Stensrud, E., Kitchenham, B., Myrtveit, I.: A Simulation Study of the Model Evaluation Criterion MMRE. IEEE Transactions on Software Engineering 29, 985–995 (2003)
Modelling Functional Requirements in Spatial Design Mehul Bhatt, Joana Hois, Oliver Kutz, and Frank Dylla SFB/TR 8 Spatial Cognition, University of Bremen, Germany
Abstract. We demonstrate the manner in which high-level design requirements, e.g., as they correspond to the commonsensical conceptualisation of expert designers, may be formally specified within practical information systems, wherein heterogeneous perspectives and conceptual commitments are needed. Focussing on semantics, modularity and consistency, we argue that our formalisation serves as a synergistic interface that mediates between the two disconnected domains of human abstracted qualitative/conceptual knowledge and its quantitative/precision-oriented counterpart within systems for spatial design (assistance). Our demonstration utilises simple, yet real world examples.
1
Conceptual Modelling for Spatial Design
This paper investigates the role of ontological formalisation as a basis for modelling high-level conceptual requirement constraints within spatial design. We demonstrate the manner in which high-level functional requirements, e.g., as they correspond to the commonsensical conceptualisation of expert designers, may be formally specified within practical information systems. Here, heterogeneous perspectives and conceptual commitments are needed for capturing the complex semantics of spatial designs and artefacts contained therein. A key aspect of our modelling approach is the use of formal qualitative spatial calculi and conceptual design requirements as a link between the structural form of a design and the differing functional capabilities that it affo rds o r leads to. In this paper, we focus on the representational modalities that pertain to ontological modelling of structural forms from different perspectives: human / designer conceptualisations and qualitative spatial abstractions suited to spatial reasoning, and geometric primitives as they are applicable to practical information systems for computer-aided design (CAD) in general, and computer-aided architecture design (CAAD) in particular. Our modelling is focussed on semantics, modularity and functional requirement consistency, as elaborated on in the following: ⊲ Semantics. The expert’s design conceptualisation is semantic and qualitative in nature—it involves abstract categories such as Rooms, Doors, Motion Sensors and the spatial (topological, directional, etc.) relationships among them, e.g., ‘Room A and Room B have a Door in Between, which is monitored by Camera C ’. Professional design tools lack the ability to exploit such design expertise that J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 464–470, 2010. c Springer-Verlag Berlin Heidelberg 2010
Modelling Functional Requirements in Spatial Design
465
a designer is equipped with, but unable to communicate to the design tool explicitly in a manner consistent with its inherent human-centred conceptualisation, i.e., semantically and qualitatively. ⊲ Modular and Multi-dimensional Representation. An abstraction such as a Room or Sensor may be identified semantically by its placement within an ontological hierarchy and its relationships with other conceptual categories. This is what a designer must deal with during the initial design conceptualisation phase. However, when these notions are transferred to a CAD design tool, the same concepts acquire a new perspective, i.e., now the designer must deal with points, line-segments, polygons and other geometric primitives. Within contemporary design tools, there is no way for a knowledge-based system to make inferences about the conceptual design and its geometric interpretation within a CAD model in a unified manner. ⊲ Functional Requirements. A crucial aspect that is missing in contemporary design tools is the support to explicitly characterise the functional requirements of a design. For instance, it is not possible to model spatial artefacts such as the range space of a sensory device (e.g., camera, motion sensor), which is not strictly a spatial entity in the form of having a material existence, but needs to be treated as such nevertheless. For instance, consider the following constraint: ‘the motion-sensor should be placed such that the door connecting room A and room B is always within the sensor’s range space’. The capability to model such a constraint is absent from even the most state-of-the-art design tools. Organisation of paper. We present an overview of our conceptual modelling approach and the manner in which our formalisation serves as a synergistic interface that mediates between the two disconnected domains of human abstracted qualitative/conceptual knowledge and its quantitative/precision-oriented counterpart within practical information systems. Section 2 presents the concept of ontological modularity and its use for modelling multi-perspective design requirements using a spatial ontology. Section 3 details some key aspects of our spatial ontology and requirements modelled therein. Finally, Section 4 concludes.
2
Multi-perspective Representation and Modularity
Modularity has become one of the key issues in ontology engineering, covering a wide spectrum of aspects (see [11]). The main research question is how to define the notion of a module and how to re-use such modules. 2.1
Ontological Modules
The architectural design process defines constraints of architectural entities that are primarily given by spatial types of information. Space is particularly defined from a conceptual, qualitative, and quantitative perspective. The three ontological modules are briefly discussed in the following:
466
M. Bhatt et al. Task-Specific Requirements (building automation, access restriction, ...)
Integrated Representation and Axioms based on E-Connections
DOLCE
Physical Object
M1 - Conceptual Module
RCC-8 Relations
Building Architecture
M2 - Qualitative Module
Building Construction
IFC data model
M3 - Quantitative Module
Fig. 1. Multi-Dimensional Representation
M1 – Conceptual Module. This ontological module reflects the most general, or abstract, terminological information concerning architectural entities: they are conceptualised according to their essential properties, i.e., without taking into account the possible contexts into which they might be put. The ontology Physical Object categorises these entities with respect to their attributes, dependencies, functional characteristics, etc. It is based on DOLCE [9]. M2 – Qualitative Module. This module reflects qualitative spatial information of architectural entities. It specifies the architectural entities based on their regionrelated spatial characteristics.1 In particular, the ontology uses relations as provided by the RCC-8 fragment of the Region Connection Calculus (RCC) [10]. Here, we reuse an RCC ontology that has been introduced in [6], which defines the taxonomy for RCC-8 relations. M3 – Quantitative Module. This ontological module reflects metrical and geometric information of architectural entities, i.e., their polygon-based characteristics in the floor plan. It is closely related to an industrial standard for data representation and interchange in the architectural domain, namely the Industry Foundation Classes (IFC) [5]. This quantitative module specifies those entities of the architectural domain that are necessary to describe structural aspects of environments. Especially, information that is available by construction plans of buildings are described here. 2.2
E-Connecting Multiple Perspectives
The main aspects of modularity are: syntactic and logical heterogeneity, notions of module, distributed semantics and modular reasoning. Here, we restrict ourselves to describing our use of E -connections for multi-perspective modelling of spatial or architectural design. In order to model spatial design scenarios, we need to be able to cover rather disparate aspects of ‘objects’ on various conceptual (and spatial) levels. E -connections allow a formal representation of different views on the same domain 1
We concentrate on region-based spatial relations, as they are most suitable for our architectural design examples. However, other spatial relations (e.g., for distances, shapes, or orientations) may be applied as well.
Modelling Functional Requirements in Spatial Design Wall2
Sensor
Win5 fswatch fs
ops fstouch
Wall Panel
ops
Door
Col2
Sensor2 Door2
Win8 Win9 Door3
Win6 Wall6
Col3
Win7
ops
Window
Room2
Col1
Sensor1
Wall4
Wall3
range
Win4
Door1
fs
fs
fs
Win3
Win10
Wall7
Wall
Win2
Wall5
Wall1 Win1 Room1
467
Sensor3
Wall8
fs
Fig. 2. Concrete Interpretations in R2
Fig. 3. Spatial Artefact Groundings
together with a loose coupling of such views by means of axiomatic constraints employing so-called ‘link-relations’ to formally realise the coupling. Specifically, in E-connections, an ‘abstract’ object o of some description logic (DL) can, e.g., be related via a relation E to its spatial extension in a logic such as RCC-8 (i.e. a (regular-closed) set of points in a topological space), by a relation T to its life-span in a temporal logic (i.e. an interval of time points), or by a relation S to another conceptual view (i.e. the concept of all rooms object o may be found in). Essentially, the language of an E-connection is the (disjoint) union of the original languages enriched with operators capable of talking about the link relations (see [8] for technical details). The connection of the three modules (M1–M3) is formalised by axiomatising the used link relations. The Integrated Representation defines couplings between classes from different modules. An overall integration of these thematic modules is achieved by E -connecting the aligned vocabulary along newly introduced link relations and appropriate linking axioms. Based on this Integrated Representation, the module for task-specific requirements specifies additional definitions and constraints to the architectural information available in the modules (M1– M3). It formulates requirements that describe certain functions that a specific design, e.g. a concrete floor plan, has to satisfy. They can codify building regulations that a work-in-progress design generally must meet, as explained next.
3
Functional Requirement Constraints in Architecture
Semantic descriptions of designs and their requirements acquires real significance when the spatial and functional constraints are among strictly spatial entities as well as abstract spatial artefacts. This is because although spatial artefacts may not be physically extended within a design, they need to be treated in a real physical sense nevertheless. In general, architectural working designs only contain physical entities. Therefore, it becomes impossible for a designer to model constraints involving spatial artefacts at the design level. In [1], we identified three important types of spatial artefacts (this list is not assumed to be exhaustive): A1. the operational space denotes the region of space that an object requires to perform its intrinsic function that characterises its utility or purpose
Door (a) Door 1
Washsink
M. Bhatt et al.
Phone
468
Door
Door
Door (b) Door 2
Fig. 4. Killer Doors
Stairs
Stairs
(a) Consistent
(b) Inconsistent
Fig. 5. Building code for doors and upward stairs—Landesbauordnung Bremen §35(10)
A2. the functional space of an object denotes the region of space within which an agent must be located to manipulate or physically interact with a given object A3. the range space denotes the region of space that lies within the scope of a sensory device such as a motion or temperature sensor
Fig. 2 provides a detailed view on the different kinds of spaces we introduced. From a geometrical viewpoint, all artefacts refer to a conceptualised and derived physical spatial extension in Rn . The derivation of an interpretation may depend on object’s inherent spatial characteristics (e.g., size and shape), as well as additional parameters referring to mobility, transparency, etc. We utilise the spatial artefacts introduced in (A1–A3) towards formulating functional requirements constraints for a work-in-progress spatial design. (C1–C2) may need to be satisfied by a design: C1. Steps of a staircase may not be connected directly to a door that opens in the direction of the steps. There has to be a landing between the staircase steps and the door. The length of this landing has to have at least the size of the door width. (“Bremen building code”/Landesbauordnung Bremen §35 (10)) C2. People should not be harmed by doors opening up. In general, the operation of a door should be non-interfering with the function / operation of surrounding objects.
Constraints such as (C1–C2) involve semantic characterisations and spatial relationships among strictly spatial entities as well as other spatial artefacts. In Fig. 5 we depict a consistent and an inconsistent design regarding this requirement. This officia l regulation can be modelled in the integrated representation by using the link relations introduced in Section 2. The regulation is specified by the ontological constraint that no operational space of a door is allowed to overlap with the steps of a staircase: Cla s s : S u b C l a s s Of :
m2:DoorOperationalSpace m2:OperationalSpace, inv (ir:compose) exactly 1 m3:Door, not (rcc:overlaps some ( inv(ir:compose) some m3:StaircaseSteps) )
In this example, the different modules are closely connected with each other. In detail, categories in the qualitative modules M2, namely DoorOperationalSpace
Modelling Functional Requirements in Spatial Design
469
which is a subclass of OperationalSpace, are related to entities in the quantitative module M3, namely Door and StaircaseSteps, by the link relations given in the integrated representation module, namely compose.
4
Conclusion
The work described in this paper is part of an initiative that aims at developing the representation and reasoning methodology [2] and practically usable tools [4] for intelligent assistance in spatial design tasks. We have provided an overview of the overall approach to encoding design semantics within an architectural assistance system. High-level conceptual modelling of requirements, and the need to incorporate modular specifications therein, were the main topics covered in the paper. Because of parsimony of space, we could only provide a glimpse of the representation; details of the formal framework and ongoing work may be found in [3, 7]. Acknowledgements. We acknowledge the financial support of the DFG through the Collaborative Research Center SFB/TR 8 Spatial Cognition. Mehul Bhatt also acknowledges funding by the Alexander von Humboldt Stiftung, Germany. Participating projects in this paper include: [DesignSpace], I1-[OntoSpace] and R3-[Q-Shape]. We thank Graphisoft – http://www.graphisoft.com/ – for providing licenses for the design tool ArchiCAD v13 2010.
References [1] Bhatt, M., Dylla, F., Hois, J.: Spatio-terminological inference for the design of ambient environments. In: Hornsby, K.S., Claramunt, C., Denis, M., Ligozat, G. (eds.) COSIT 2009. LNCS, vol. 5756, pp. 371–391. Springer, Heidelberg (2009) [2] Bhatt, M., Freksa, C.: Spatial computing for design: An artificial intelligence perspective. In: NSF International Workshop on Studying Visual and Spatial Reasoning for Design Creativity, SDC 2010 (to appear, 2010), http://www.cosy. informatik.uni-bremen.de/staff/bhatt/seer/Bhatt-Freksa-SDC-10.pdf [3] Bhatt, M., Hois, J., Kutz, O.: Modelling Form and Function in Architectural Design. Submitted to a journal (2010), http://www.cosy.informatik.uni-bremen. de/staff/bhatt/seer/form-function.pdf [4] Bhatt, M., Ichim, A., Flanagan, G.: DSim: A Tool for Assisted Spatial Design. In: Proceedings of the 4th International Conference on Design Computing and Cognition, DCC 2010 (2010) [5] Froese, T., Fischer, M., Grobler, F., Ritzenthaler, J., Yu, K., Sutherland, S., Staub, S., Akinci, B., Akbas, R., Koo, B., Barron, A., Kunz, J.: Industry Foundation Classes for Project Management—A Trial Implementation. ITCon 4, 17–36 (1999), http://www.ifcwiki.org/ [6] Gr¨ utter, R., Scharrenbach, T., Bauer-Messmer, B.: Improving an RCC-Derived Geospatial Approximation by OWL Axioms. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 293–306. Springer, Heidelberg (2008)
470
M. Bhatt et al.
[7] Hois, J., Bhatt, M., Kutz, O.: Modular Ontologies for Architectural Design. In: Proc. of the 4th Workshop on Formal Ontologies Meet Industry, FOMI 2009, Vicenza, Italy. Frontiers in Artificial Intelligence and Applications, vol. 198, pp. 66–77. IOS Press, Amsterdam (2009) [8] Kutz, O., Lutz, C., Wolter, F., Zakharyaschev, M.: E-Connections of Abstract Description Systems. Artificial Intelligence 156(1), 1–73 (2004) [9] Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A.: WonderWeb Deliverable D18: Ontology Library. Technical report, ISTC-CNR (2003) [10] Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection. In: Proc. of KR 1992, pp. 165–176 (1992) [11] Stuckenschmidt, H., Parent, C., Spaccapietra, S. (eds.): Modular Ontologies. LNCS, vol. 5445. Springer, Heidelberg (2009)
Business Processes Contextualisation via Context Analysis Jose Luis de la Vara1, Raian Ali2, Fabiano Dalpiaz2, Juan Sánchez1, and Paolo Giorgini2 1
Centro de Investigación en Métodos de Producción de Software Universidad Politécnica de Valencia, Spain {jdelavara,jsanchez}@pros.upv.es 2 Department of Information Engineering and Computer Science University of Trento, Italy {raian.ali,fabiano.dalpiaz,paolo.giorgini}@disi.unitn.it
Abstract. Context-awareness has emerged as a new perspective for business process modelling. Even though some works have studied it, many challenges have not been addressed yet. There is a clear need for approaches that (i) facilitate the identification of the context properties that influence a business process and (ii) provide guidance for correct modelling of contextualised business processes. This paper addresses this need by defining an approach for business process contextualisation via context analysis, a technique that supports reasoning about context and discovery of its relevant properties. The approach facilitates adequate specification of context variants and of business process execution for them. As a result, we obtain business processes that fit their context and are correct. Keywords: business process modelling, context-awareness, business process contextualisation, context analysis, correctness of business process models.
1 Introduction Traditional approaches for business process modelling have not paid much attention to the dynamism of the environment of a business process. However, business processes are executed in an environment in which changes are usual, and modelling perspectives that aim to represent and understand them are necessary. Context-awareness has recently appeared as a new perspective for business process modelling to meet this need [3]. It is expected to improve business process modelling by explicitly addressing fitness between business processes and their context. The context of a business process is the set of environmental properties that affect business process execution. Therefore, these properties should be taken into account when designing a business process. If context is analysed when modelling a business process, then identification of all its variants (relevant states of the world in which the business process is executed) and definition of how the business process should be executed in them are facilitated. J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 471–476, 2010. © Springer-Verlag Berlin Heidelberg 2010
472
J.L. de la Vara et al.
Some works have contributed to the advance of context-aware business process modelling by addressing issues such as context-aware workflows [4], general principles (e.g. [3]) and modelling of context effect (e.g. [2]). However, research on this topic is still at an initial stage and many challenges have not been addressed yet. This paper aims to advance in research on context-aware business process modelling by dealing with two of these challenges: 1) provision of techniques for determination of the relevant context properties that influence a business process, and; 2) provision of mechanisms and guidance for correct business process contextualisation. The objectives of the paper are to determine how business process context can be analysed, how it can influence business processes, how to create contextualised business process models, and how to guarantee their correctness. These objectives are achieved by defining an approach for business process contextualisation via context analysis [1], which is a technique that aims to support reasoning about context and discovery of contextual information to observe. Context analysis is adapted in the paper for analysis of business process context. The approach provides mechanisms and guidance that can help process designers to reason about business process context and to model business processes that fit their context and are correct. Context properties and variants are analysed in order to determine how they influence a business process, to guarantee that a business process is properly executed in all its context variants, and to correctly model contextualised business processes. The next sections present the approach and our conclusions, respectively.
2 Approach Description The approach consists of four stages (Fig. 1): modelling of initial business process, analysis of business process context, analysis of context variants and modelling of contextualised business process. First, an initial version of the business process that needs to fit its context is modelled. Next, the rest of stages have to be carried out while relevant context variations (changes) are found and they are not represented in the business process model. Relevant context variations influence the business process and imply that business process execution has to change. If a context variation is found, then business process context is analysed to find the context properties that allow process participants to know if a context variant holds. A context analysis model is created, and context variants of the business process are then analysed. Finally, a contextualised business process model is created on the basis of the final context variants and their effect on the business process.
Fig. 1. Business process contextualisation
Business Processes Contextualisation via Context Analysis
473
Fig. 2. Initial business process model
As a running example, product promotion in a department store is used (Fig. 2). The business process has been modelled with BPMN, and it does not reflect context variations such as the fact that customers do not like being addressed if they are in a hurry. The paper focuses on contextualisation of the task “Find potential buyer”. 2.1 Analysis of Business Process Context Business process context is analysed in the second stage of the approach. This stage aims to understand context, to reason about it and to discover the context properties that influence a business process. For these purposes, context analysis (which has been presented in the requirements engineering field) has been adapted for analysis of business process context. Further details about context analysis can be found in [1]. Context is specified as a formula of world predicates, which can be combined conjunctively and disjunctively. World predicates can be facts (they can be verified by a process participant) or statements (they cannot be). The truth value of a statement can be assumed if there is enough evidence to support it. Such evidence comes from another formula of world predicates that holds. A context is judgeable if there exists a formula of facts that supports it, and thus implies it. Identifying a judgeable context can be considered the main purpose of this stage. The facts of the formula correspond to the context properties that characterise the context and its variants, and their truth values influence business process execution. A context analysis model (Fig. 3) is created to facilitate reasoning about business process context and discovery of the facts of the formula that implies it.
Fig. 3. Context analysis model
474
J.L. de la Vara et al.
2.2 Analysis of Context Variants The main purposes of this stage are to adequately define the (final) context variants of a business process and that they allow correct business process contextualisation. A context variant corresponds to a set of facts whose conjunction implies a context. Fig. 4 shows the eleven initial context variants for C1, which is analysed in Fig. 1. Initial Context Variants {F1} {F2} {F3, F4, F5, F9} {F3, F4, F5, F10} {F3, F4, F5, F11} {F3, F5, F6, F9} {F3, F5, F6, F10} {F3, F5, F6, F11} {F3, F5, F7, F8, F9} {F3, F5, F7, F8, F10} {F3, F5, F7, F8, F11}
Æ
Final Context Variants CV1: {F1} CV2: {F2} CV3: {F3 Æ (F4, F5, F9)} CV4: {F3 Æ (F4, F5, F10)} CV7: {F4, F11 Æ F5} CV5: {F6 Æ F3 Æ (F5, F9)} CV6: {F6 Æ F3 Æ (F5, F10)} CV8: {F6 Æ F11Æ F5} CV9: {F3 Æ (F7, F8, F9)} CV10: {F7, F11 Æ F8}
Fig. 4. Context variants
Correctness of business processes is usually related to its soundness [5]. For business process executions that are defined from context variants, two situations can impede soundness of a contextualised business process. The first one is that a context variant contains conflicting facts. The second situation is to follow a sequence of fact verifications that will not allow a business process instance to be finished. These situations are avoided by analysing the context variants. For this purpose, a table is created to specify the relationships between facts. The table also aims to obtain context variants whose sets of facts are the minimum ones. An example is shown in Table 1, which specifies the relationships between the facts of the initial context variants of Fig. 4. The relationships are specified as follows. Given a pair of facts Fr (fact of a row) and Fc (fact of a column), their relationship can be: ‘X’ (no context variant contains Fr and Fc together); ‘Pr’ (Fr verification will precede Fc verification); ‘Pc’ (opposite to ‘Pr’); ‘Kr’ (Fr truth value will be known before Fc verification); ‘Kc’ (opposite to ‘Kr’); ‘Ur’ (Fr is always true when the Fc is true, thus Fr verification will be unnecessary when Fc is true); ‘Uc’ (opposite to ‘Ur’); ‘C’ (Fr and Fc are conflicting); ‘-’ (no relationship exists). Finally, context variants are refined by specifying sequence of fact verification (‘Æ’) and removing conflicting variants and unnecessary facts (Fig. 4). 2.3 Modelling of Contextualised Business Process A contextualised business process is modelled on the basis of its final context variants. The first step is determination of the tasks that will be part of the business process. They can correspond to: 1) tasks of the initial business process model that are not influenced by context); 2) tasks that are defined from refinement of the tasks of the initial business process model (e.g. “Address customer” refines “Find potential buyer”), and; 3) tasks that make facts true (e.g. “Approach customer” makes F3 true).
Business Processes Contextualisation via Context Analysis
475
Table 1. Relationships between facts F11 X X Ur Pc Kr Pc X X
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
F10 X X Pr Kr C X
F9 X X Pr Kr -
F8 X X Pr X Ur X -
F7 X X Pr X X
F6 X X Kc X Kc
F5 X X Pr -
F4 X X Pr
F3 X X
F2 X
Table 2. Relationships between tasks and facts T1: Approach customer T2: Address customer
F1 U U
F2 U U
F3 M Sc
F4 Sc1
F5 Sc1
F6 Sc1
F7 Sc1
F8 Sc1
F9 Sc1
F10 Sc1
F11 U U
If a task of the latter type is executed when a given fact is false, then the fact turns into true. These facts are called manageable. Once tasks are determined, a table is created to specify their relationships with the facts of the final context variants. An example is shown in Table 2. The relationships are specified as follows. Given a fact F, a set of facts φ and a task T, their relationship can be: ‘M’ (T allows F to be manageable); ‘U’ (T execution will be unnecessary if F is true); ‘Sc’ (T execution will succeed F verification); ‘ScX’ (where ‘X’ is a number; T execution will succeed verification of the facts of φ); ‘-’: (no relationship exists) CE1: F1 CE2: F2 CE3: (F3 | T1) Æ (F4, F5, F9) Æ T2 CE4: (F3 | T1) Æ (F4, F5, F10) Æ T2 CE5: F6 Æ (F3 | T1) Æ (F5, F9) Æ T2
CE6: F6 Æ (F3 | T1) Æ (F5, F10) Æ T2 CE7: F4, F11 Æ F5 CE8: F6 Æ F11Æ F5 CE9: (F3 | T1) Æ (F7, F8, F9) Æ T2 CE10: F7, F11 Æ F8
Fig. 5. Contextualised executions
The next step for modelling of a contextualised business process is specification of its contextualised executions (Fig. 5). A contextualised execution is a set of fact verifications and task executions that specifies a correct execution of a business process or of a fragment of a business process for a context variant. Contextualised executions are specified by extending the final context variants of a business process with the execution sequence of its tasks (‘Æ’). The manageable facts and their associated tasks are put in brackets and the symbol ‘|’ is put between them: either the fact is true or the task has to be executed. Finally, a contextualised business process model is created on the basis of the constraints (fact verification and task execution sequences) that the contextualised executions impose. BPMN has been extended by labelling its sequence flows for specification of formulas that have to hold so that a sequence flow is executed. Fact and formula verification is represented by means of gateways. Fig. 6 shows the effect of contextualisation of the task “Find potential buyer” for the running example.
J.L. de la Vara et al.
Assistant
476
Fig. 6. Contextualised business process model
3 Conclusions and Future Work This paper has addressed several challenges of context-aware business process modelling in order to allow research on it to further advance. As a result, an approach for business process contextualisation has been presented. The approach adapts context analysis for analysis of business process context, and provides mechanisms and guidance for analysis of business process context and its variants and for modelling of contextualised business processes. It facilitates discovery and adequate specification of relevant context properties in the form of facts, as well as of the relationships between facts and between facts and tasks of a contextualised business process. These relationships affect business process execution. Furthermore, the mechanisms and guidance can guarantee that a contextualised business process fits its context and is sound. As future work, we have to address approach automation and formal evaluation. Acknowledgements. This work has been developed with the support of the Spanish Government under the projects SESAMO TIN2007-62894 and HI2008-0190 and the program FPU AP2006-02324, partially funded by the EU Commission through the projects COMPAS, NESSOS and ANIKETOS, and co-financed by FEDER. The authors would also like to thank Amit K. Chopra for his useful comments.
References 1. Ali, R., Dalpiaz, F., Giorgini, P.: A Goal-based Framework for Contextual Requirements Modeling and Analysis. Requirements Engineering Journal (to appear, 2010) 2. Hallerbach, A., Bauer, T., Reichert, M.: Capturing Variability in Business Process Models: The Provop Approach. Journal of Software Maintenance and Evolution (to appear, 2010) 3. Rosemann, M., Recker, J., Flender, C.: Contextualisation of business processes. International Journal of Business Process Integration and Management 3(1), 47–60 (2008) 4. Smanchat, S., Ling, S., Indrawan, M.: A Survey on Context-Aware Workflow Adaptations. In: MoMM 2008, pp. 414–417 (2008) 5. Weske, M.: Business Process Management. Springer, Heidelberg (2007)
A Generic Perspective Model for the Generation of Business Process Views H orst Pichler and Johann Eder Universitaet Klagenfurt {horst.pichler,johann.eder}@uni-klu.ac.at
Abstract. Overwhelmed by the model size and the diversity of presented information in huge business process models, the stakeholders in the process lifecycle, like analysts, process designers, or software engineers, find it hard to focus on certain details of the process. We present a model along with an architecture that allows to capture arbitrary process perspectives which can then be used for the generation of process views that contain only relevant details.
1
Introduction
The complexity of big business (workflow) process models may contain hundreds of connected activities, augmented with information of diverse perspective, which are – corresponding to the application domain – required for the presentation of the process’s universe of discourse. Accordingly it is hard for various stakeholders in the process lifecycle (e.g., analysts, process managers, software engineers) to get a focus on the areas of interest. This can be accomplished with process views, which are extracts of processes that contains only relevant (selected) activities or aggregations of them. Most process view-related research publications solely focus on control flow issues. They assume that a set of already selected viewrelevant control flow elements is given and aim at the generation of process views corresponding to diverse correctness criteria. Their findings are important for the generation of process views, but they do not show how view-relevant control flow elements are selected corresponding to specified characteristics [4,3,2,5]. These characteristics are usually defined as parts of process perspectives (also called aspects), where the most frequently mentioned are: behavior (control flow), function, information (data), organization, and operation. However, this list is neither complete nor fixed and may therefore be arbitrarily modified and extended depending on the application domain or workflow system [6]. Correspondingly most standardization efforts are likely to fail. Furthermore, especially in ERP-systems, complex perspectives can not always be captured directly in the process model, but through referral to an external resource repository. We aim at the generation of process views for analytical purposes, based on queries which formulate combinations of constraints on diverse perspectives. In the following we present an architecture to import process models and related information from arbitrary sources (workflow systems, process modelling tools, J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 477–482, 2010. c Springer-Verlag Berlin Heidelberg 2010
478
H. Pichler and J. Eder
ERP-systems, etc.) into a predefined generic perspective model, which is then used to formulate user-queries for the generation of process views. Very similar to workflow data warehousing approaches [1] the structure and components of perspectives must suit the queries required to answer relevant questions. Furthermore it must be possible to extract information from various external sources, prepare (e.g., with aggregation operations) and transform it to the target perspective structures, to be loaded into the target perspective model, which can then be queried for further view generation.
2
Architecture Overview
The architecture of our systems, as visualized in Figure 1, consists of several components.
Fig. 1. System Architecture
How to use this architecture is indicated by the encircled numbers: (1) an expert specifies the perspective structure of a given model type (like XPDL) which must be stored in the perspective database. (2) Then he implements an import interface for every perspective of this model type and an XPDL-transformer for the control flow perspective and imports all processes (or process model instances respectively) into the instance database. (3) Now a user can access the system by formulating queries with the query interface, which guides (4) the user’s with model-specific context-aware information from the perspective database for a selected model type and process model instance. (5) When this specification-step is finished the query engine generates and executes the queries, that results in a list of relevant components. Then a view can be generated, followed by an export as XPDL-document. The white boxes are future components for a complete architecture - a definition tool that helps the expert during the definition phase by scanning external data structures and the import (e.g., an ETL-tool similar as
A Generic Perspective Model
479
for data warehouse systems). Another component could be a process viewer that is integrated into the architecture, which allows to get additional information on the generated views by accessing the perspective models.
3
Generic Perspective Model
Our generic perspective model allows experts to specify the components and structures of arbitrary process model perspectives, which are then filled with data from process model instances of arbitrary source systems. The UML class diagram of our generic perspective database model, as visualized in Figure 2, basically consists of two levels: (1) the perspective model that defines the components, structure, and relations of each perspective of a specific model type, and (2) the instance model, to store specific instances of process model types (e.g., the process ’claim handling’). In order to be generic the model must not contain any item that adheres to a specific concept, like ’user’, ’department’, or ’document’. Although we wanted our model to be as generic as possible, we opted against a totally generic approach for the control flow perspective for several reasons and chose the standardized process definition language XPDL in order to connect the components of other perspectives to it. 3.1
Perspective Model
The upper part of Figure 2 shows the model for arbitrary perspectives. The perspective database stores information about perspective components and their structures of any process model type (e.g., WorkParty, YAWL, SAP Business Workflow, BPEL, or any proprietary notation). Such a Model consists of several Perspectives (behavior, organization, etc.), where each perspective is composed of Components (e.g., activity, role, duration). The XPDLComponent is a specialization of a component to represent an XPDL control flow element type, which is required for the connection between the externalized XPDL control flow specification and other perspective components. Multiple inheritance between different components is supported by the association is-a, including attribute inheritance. Every component may have an arbitrary number of Attributes (e.g., a label, a description field, a condition). Due to space limitations we omitted the following attribute-related concepts in the UML class-diagram: attribute types (e.g., integer, decimal, string, boolean, date, time, enum, etc.) and type-dependent value ranges and constraints (e.g., between 1 and 10, enumeration {red, green, blue}). A relationship between components is realized by the class Relation (e.g., has, takes, connects, etc.), where multiple components can be source or target of such a relation. We differentiate between three different types of relations: (1) An association is a simple relationship between components (e.g., activity ’has’ duration). The other types are network-types used to specify directed graphs and tree structures, which means that they also implicitly describe transitive relationships that may be traversed. Again multiple inheritance between different relations is supported by the association is-a. This allows for instance the
480
H. Pichler and J. Eder
Control Flow Model
Perspective Model
1 belongs to
Model +id +name +type +version -description
belongs to
*
sub
Component +id +name +description
*
1
* super
Relation +id +name +description * +type
* source
*
* target
1
1
1 super *
*
XPDLComponent
consistsOf 1
is-a
belongs to
*
*
Perspective +id +name 1 +description
Control Flow
is-a *
sub
* 1
has
instanceOf
Attribute +id * +name +description 1
InstanceOf
instanceOf
* XDPLComponent Instance -idXPDL
* ComponentInstance -id 1
instanceOf
instanceOf
*
*
RelationInstance -id -isTransitive
target
*
* isPartOf
represented in
* has *
XPDL Document (Control Flow)
AttributeInstance -id -value
*
*
*
source
instanceOf
*
*
ModelInstance 1 -id n a me isPartOf -version -date 1 -description -aut hor
isPartOf
1
control flow representation
Control Flow Instance
Instance Model
Fig. 2. Generic Perspective Database Model
specification of relations ’connects-regular’ and ’connects-exception’ which are both sub-classes of a relation ’connects’. Figure 3 visualizes how we mapped the structures of perspectives of a sample process to our perspective model. In this figure components – to be stored as instances of the class Component in the generic model – are represented by rectangles, their attributes – to be stored in the class Attribute – are listed below the component name, and the relations between components – to be stored in the class Relation – are represented by directed edges. With the exception of the relation ’hierarchy’ all Relations are of type ’association’. The relation-type ’network’ indicates that organizational units may form a hierarchy. Inheritance structures between components are visualized as directed edges (between super and sub) with white arrowheads. 3.2
Instance Model
The lower part of Figure 2 shows the instance model, which stores the perspective information for specific process model instances. It contains all component instances and their attribute instances (e.g., an activity instance with the name
A Generic Perspective Model Control Flow
481
Organization Activity:XMLComponent -name
assigned:Relation
Participant:Component -name
requires:Relation
access:Relation
hiearchy:Relation
Permission -accessType
Form:Component -name contains:Relation
hasAccessType:Relation
is-a Role:Component
FormField:Component -name -fieldType
is-a
(NetworkBW)
is-a
User:Component
has:Relation
Unit:Component
affiliation:Relation
Data
Fig. 3. Mapping the Perspective Model of the Sample Process
instance ’a’, a unit instance ’DepartmentA’, a user instance ’Pichler’). Additionally it contains the relation instances, which connect the component instances (e.g., DepartmentA ’hierarchy’ branchSales, Pichler ’has’ clerk). Specifically for the control flow each XPDLComponentInstance has a idXPDL, which is the original id (or key attribute) of the referenced element in the input file. This information is required when generating views, such that the control flow components in the view can be related to the components in the original process. The left-hand side of Figure 4 visualizes a small part of the organizational instance structure for our process as it is imported from the source system: user Pichler has the role Clerk, and is affiliated to a unit DepartmentA, which is a sub-unit of BranchSales. According to the assigned-relation Activity ’a’ may be executed by anybody within the DepartmentA.
Original Instances
Incl. Closures Unit name=branchSales
hierarchy
hierarchy
Activity name=a
assigned
Unit name=DepartmentA affiliation
User name=Pichler
Participant name=branchSales Unit name=branchSales
Activity name=a
Participant name=DepartmentA Unit name=DepartmentA assigned affiliation
has
has
Role name=Clerk
affiliation
Participant name=Pichler User name=Pichler
Participant name=Clerk Role name=Clerk
Fig. 4. Small Part of the Instance Model’s Content for the Sample Process
482
H. Pichler and J. Eder
The right-hand side of Figure 4 shows the closures, which exist according to the inheritance between components and relations of type ’network’. According to the perspective model the components User, Unit, and Role inherit from the component Participant, which means, that for each of their component instances their also exists a corresponding component instance of type Participant, along with duplicates of the relations to other components. Similarly, as component instances of type Unit are connected by a ’network’ relation, all relations connected to a specific unit instance must also connect to it’s predecessors in a bottom-up fashion. E.g., as user ’Pichler’ is affiliated to ’DepartmentA’, he is also affiliated to ’BranchSales’.
4
Conclusions and Outlook
In this paper we presented a system that captures process models with arbitrary perspective structures in order to generate process views based on user queries. We showed how to model these perspectives by example and how to represent them in our generic perspective model to import of process model instances from external sources. Currently we are working on a context-sensitive dynamic query interface that helps users define queries for view generation, along with an adaptation of an existing reduction-based view generation technique [5], to be completed with a BPMN-based visualization component for generated views.
References 1. Bonifati, A., et al.: Warehousing workflow data: Challenges and opportunities. In: Proc. of the 27th International Conference on Very Large Databases, VLDB 2001. Morgan Kaufmann, San Francisco (2001) 2. Bobrik, R., Reichert, M.U., Bauer, T.: View-Based Process Visualization. In: Alonso, G., Dadam, P., Rosemann, M. (eds.) BPM 2007. LNCS, vol. 4714, pp. 88–95. Springer, Heidelberg (2007) 3. Chebbia, I., Dustdar, S., Tataa, S.: The view-based approach to dynamic interorganizational workflow cooperation. Data & Knowledge Engineering 56(2) (2006) 4. Liu, D., Shen, M.: Workflow modeling for virtual processes: an order-preserving process-view approach. Information Systems 28(6) (2003) 5. Tahamtan, N.A.: Modeling and Verification of web Service Composition Based Interorganizational Workflows. Ph.D. Thesis, University of Vienna (2009) 6. zur Muehlen, M.: Workflow-based Process Controlling. In: Foundation, Design, and Application of workflow-driven Process Information Systems, Logos (2004) ISBN 978-3-8325-0388-8
Extending Organizational Modeling with Business Services Concepts: An Overview of the Proposed Architecture* Hugo Estrada1, Alicia Martínez1, Oscar Pastor2, John Mylopoulos3, and Paolo Giorgini3 1 CENIDET, Cuernavaca, Mor. México {hestrada,amartinez}@cenidet.edu.mx 2 Technical University of Valencia, Spain [email protected] 3 University of Trento, Italy {jm,paolo.giorgini}@dit.unitn.it
Abstract. Nowadays, there is wide consensus on the importance of organizational modelling in the definition of software systems that correctly address business needs. Accordingly, there exist many modelling techniques that capture business semantics from different perspectives: transactional, goal-oriented, aspect-oriented, value-oriented etc. However, none of these proposals accounts for the service nature of most business organizations, nor of the growing importance of service orientation in computing. In this paper, an overview of a new business service-oriented modeling approach, that extends the i* framework, is presented as a solution to this problem. The proposed modeling approach enables analysts to represent an organizational model as a composition of business services, which are the basic building blocks that encapsulate a set of business process models. In these models the actors participate in actor dependency networks through interfaces defined in the business service specification. Keywords: Organizational modeling, Business Services.
1 Introduction Nowadays, there exists a great variety of business modelling techniques in academia and industry alike. Many of these include modelling primitives and abstractions mechanisms intended to capture the semantics of business organizations from a specific view-point: process-oriented, goal-oriented, aspect-oriented, value-oriented, etc. However, none of them support primitives that capture the service-orientation of most business organizations, nor do they account for the growing importance of service orientation within IT. In this context, additional modelling and programming efforts are needed to adapt the organizational concepts (from a specific view of the enterprise) to service-oriented software systems. *
This research has been partially supported by DGEST Project #24.25.09-P/2009.
J. Parsons et al. (Eds.): ER 2010, LNCS 6412, pp. 483–488, 2010. © Springer-Verlag Berlin Heidelberg 2010
484
H. Estrada et al.
The objective of this work is to reduce the mismatch between organizational descriptions and service-oriented specifications by using the concept of service to model an enterprise. Therefore, the main contribution of this paper is to present an overview of a new modeling and methodological approach to address the enterprise modeling activity using business services as building blocks for encapsulating organizational behaviors. In order to support the service-oriented approach into a well-known and well-founded business modeling technique, extensions to the i* framework [1] were proposed as an initial work of this research [2]. The paper is structured as follows: Section 2 presents the related works that use services at the organizational level. In Section 3 the proposed business service architecture is given, and finally, Section 4 presents the conclusions of this work.
2 Related Works in Services at the Organizational Level The use of services at the organizational level is the most emerging research fields in service-oriented modeling. The focus of this phase consists of the definition of the services that are offered by an enterprise. Following, we present the relevant works in this area. One of the few existing proposals is On demand Business Service Architecture [3]. In this proposal, the authors explore the impact of service orientation at the business level. The services represent functionalities offered by the enterprise to the customers. It considers the definition of complex services composed of low-level services. One of the contributions of the Cherbakov work is that the services are represented from the customer point of view. One of the main weaknesses is the lack of mechanisms to model the complex internal behavior needed to satisfy the business services. The services are represented as “black boxes” where the internal details of the implementation of each service are not represented; therefore, in this approach there is not a mechanism to represent the relationship between services and the goals that justify their creation. Another example of the use of services at business level is the proposal of Software-aided Service Bundling [4][5][6]. The main contribution of this research work is the definition of an ontology –a formalized conceptual model– of services to develop software for service bundling. A service bundle consists of elementary services, where service providers can offer service bundles via the Internet. The ontology describes services from a business value perspective. Therefore, the services are described by the exchange of economic values between suppliers and customers rather than describing services by physical properties. This modeling technique shares the same problem as the proposal of on demand business service. The services are defined as black boxes, where the main focus is on the definition of the set of input and outputs of the service. One of the main consequences of not having mechanisms to describe the internal behavior of the services is that it is impossible to relate the services offered with the strategic objectives of the enterprise. Therefore, it could be difficult to define the alternative services that better satisfy the goals of the enterprise. No matter what the services are analyzed in, in all cases there is a strong dependency between the concept of services and the concept of business functionalities. However, this key aspect of service modeling has been historically neglected in the literature. At present, there is only a partial solution to the problem of representing services at the organizational level, in the same way as the services are perceived by
Extending Organizational Modeling with Business Services Concepts
485
the final customers. This paper presents an overview of the solution to this problem. In the proposed approach the goals are the mechanisms that allow the analyst to match the business functionalities and the user´s needs.
3 The Business Service Architecture The research work presented in this paper is based on the hypothesis that it is possible to focus the organizational modeling activity on the values (services) offered by the enterprise to their customers. In this research work, we will call them business services. Following this hypothesis, a proposed method has been developed that provides mechanisms to guide the organizational modeling process based on the business service viewpoint. In this context, the business services can be used as the basic granules of information that allow us to encapsulate a set of composite process models. The use of services as building blocks enables the analyst to represent new business functionalities by composing models of existing services. It is important to point out that research presented in this paper is an overview of a big research project where following components are proposed: a) a modeling language that extends i* modeling framework to support services. The language gives solution to issues detected in empirical evaluations of i* in practice [7], b) a three tier architecture that capture relevant aspects of services: composition, variability, goals, actors, plans, behaviors, c) an elicitation technique to find current implementations of the services offered and requested by the analyzed enterprise, where goals play a very relevant role in the discovering process, d) a specific business modeling method to design o redesign an enterprise in accordance with the concept of business service, e) a formal definition (axioms) of the modeling primitives and diagrams of the serviceoriented architecture. In the entire project, formal (axioms) and informal (diagrammatic) definitions are provided for business services components. 3.1 Our Conceptualization about Business Service We have defined a business service as a functionality that an organizational entity (an enterprise, functional area, department, or organizational actor) offers to other entities in order to fulfill its goals [2]. To provide the functionality, the organizational unit publishes a fragment of the business process as an interface with the users of the service. The business services concept refers to the basic building blocks that act as the containers in which the internal behaviors and social relationships of a business process are encapsulated.
service
Goal A
customer Goal A
Internal provider goals
Services and dependency
Internal customer goal
Fig. 1. The Business Service Notation
486
H. Estrada et al.
The business services have been represented using an extension of the notation of the i* framework. The concept of dependency provided by the i* framework has been modified to appropriately represent the social agreement between customers and providers (Fig. 1). 3.2 The Three-Tier Architectural Models The business service architecture is composed of three complementary models that offer a view of what an enterprises offers to its environment and what enterprise obtains in return. Global Model: In the proposed method, the organizational modeling process starts with the definition of a high-level view of the services offered and used by the enterprise. The global model permits the representation of the business services and the actor that plays the role of requester and provider. Extensions to i* conceptual primitives are used in this model. Fig. 2 shows a fragment of the detailed view of the business service global model for the running example. Manage travel agency Manage car rentals
Maximize investment in car rentals
Manage travel packages
Minimize cost
maximum performance of each car
provide suitable travel packages Flight reservation
Offering different reservations means Manage flight reservations
cars “ready” to being rented 350 days to year extend the car life for 3 years
Manage integrated planning travels
Car Reservation
reserve a flight
rent a car
Manage car reservations
Manage hotel reservations
customer Hotel Reservation
Integrated Travel Planning
reserve a hotel
buy a travel package
Fig. 2. Fragment of the detailed view of the global model for the running example
Process Model: Once business services have been elicited, they must be decomposed into a set of concrete processes that perform them. To do this, we use a process model that represents the functional abstractions of the business process for a specific service. At least one process model needs to be developed for each business services defined in the global model. Extensions to i* conceptual primitives are used in this model. Fig. 3 shows an example of the simplified view of the process model for the walk-in rental car case study. The process model provides the mechanisms required to describe the flow of multiple processes, where the direction of the arrows in the figure indicates dependency for process execution, for example, analyzing the car availability is a precondition for requesting a walk-in rental. The box connectors with a letter “T” are used to indicate a transactional process. Interaction Model: Finally, the semantics of the interactions and transactions of each business process is represented in an isolated diagram using the i* conceptual
Extending Organizational Modeling with Business Services Concepts
487
Aggregated processes Analyze car availability
Enterprise
service
Request walk-in rental
Walk-in reservation T Formalize the rent
Rent a car in the branch
customer
T
Finish walk-in rental
Fig. 3. Example of the process model for the walk-in reservation business service Request walk-in car rental
provide the service
use the service Customer
Enterprise
request the service
authorize the service Analyze the Analyze the preconditions own prefor the Client conditions Indicate the acceptation or rejection
requested data
deliver data Wait for the notification
acceptation/ rejection
Validate the customer credit
Validate credit
Bank
Fig. 4. The interaction needed for requesting a walk-in rental
constructs. This model provides a description of a set of structured and associated activities that produce a specific result or product for a business service. This model is represented using the redefinition of the i* modeling primitives. Fig. 4 presents an example of the interaction model for the running example.
4 Conclusions and Future Work As a solution to the lack of appropriated mechanisms to reduce current mismatch between business models and service-oriented designs and implementations, we have proposed a service-oriented organizational model. In this model, services represent the functionalities that the enterprise offers to potential customers. Business services are the building blocks for a three-tier business architecture: business services, business processes and business interactions. The organizational modeling process starts with the definition of a high-level view of the services offered and used by the
488
H. Estrada et al.
enterprise. Later, each business service is refined into more concrete process models, according to the business service proposed method. Finally, business interactions are represented using the revised version of the modeling concepts of the i* framework. The proposed service-oriented architecture introduces new i*-based modeling diagrams and the analysis needed to represent services at an organizational level. Our current research work is focused on generating semi-automatically WSDL Web service descriptions from business services.
References 1. Eric, Y.: Modelling Strategic Relationships for Process Reengineering, Ph.D. Thesis, Department of Computer Science, University of Toronto (1995) 2. Hugo, E.E.: A service-oriented approach for the i* framework. PhD Thesis, Valencia University of Technology, Valencia, Spain (2008) 3. Cherbakov, L., Galambos, G., Harishankar, R., Kalyana, S., Rackham, G.: Impact of service orientation at the business level. IBM Systems Journal 44(4), 653–668 (2005) 4. Ziv, B.: Software-aided Service Bundling - Intelligent Methods & Tools for Graphical Service Modeling, PhD thesis, Vrije Universiteit Amsterdam, The Netherlands (2006) 5. Gordijn, J., Akkermans, H.: E3-value: Design and evaluation of e-business models. IEEE Intelligent Systems 16(4), 11–17 (2001) 6. Gordijn, J., Akkermans, H.: Value based requirements engineering: Exploring innovative e-commerce idea. Requirements Engineering Journal 8(2), 114–134 (2003) 7. Estrada, H., Martinez, A., Pastor, O., Mylopoulos, J.: An empirical evaluation of the i* framework in a model-based software generation environment. In: Dubois, E., Pohl, K. (eds.) CAiSE 2006. LNCS, vol. 4001, pp. 513–527. Springer, Heidelberg (2006) ISSN: 0302-9743
Author Index
Abo Zaid, Lamia 233 Ali, Raian 471 Armellin, Giampaolo 90 Artale, Alessandro 174, 317
Gonzalez-Perez, Cesar 219 Gottesheim, Wolfgang 202 Gottlob, Georg 347 Gutierrez, Angelica Garcia 188
Baumann, Peter 188 Baumgartner, Norbert 202 Bellahsene, Zohra 160, 261 Bernstein, Philip A. 146 Bhatt, Mehul 464 Borgida, Alex 118 Brogneaux, Anne-France 132
Hainaut, Jean-Luc 132 Henderson-Sellers, Brian 219 Hogenboom, Alexander 452 Hogenboom, Frederik 452 Hois, Joana 464 Horkoff, Jennifer 59
Cabot, Jordi 419 Cal`ı, Andrea 347 Calvanese, Diego 317 Carro, Manuel 288 Castellanos, Malu 15 Chopra, Amit K. 31 Cleve, Anthony 132 Currim, Faiz 433 Dadam, Peter 332 Dalpiaz, Fabiano 31, 471 Dayal, Umeshwar 15 Deneck`ere, R´ebecca 104 de la Vara, Jose Luis 471 De Troyer, Olga 233 Dijkman, Remco 1 Duchateau, Fabien 261 Dustdar, Schahram 288 Dylla, Frank 464 Eder, Johann 477 Ernst, Neil A. 118 Estrada, Hugo 483 Evermann, Joerg 274 Fahland, Dirk 445 Farr´e, Carles 438 Frasincar, Flavius 452 Garc´ıa, F´elix 458 Giorgini, Paolo 31, 471, 483
Ib´ an ˜ez-Garc´ıa, Ang´elica Ivanovi´c, Dragan 288 Jureta, Ivan J.
317
118
Kampoowale, Alankar 433 Kaymak, Uzay 452 Khatri, Vijay 46 Kleinermann, Frederic 233 Knuplesch, David 332 Kontchakov, Roman 174 Kornyshova, Elena 104 Kutz, Oliver 464 Liu, Jun 160 Ly, Linh Thao
332
Mameli, Gianluca 90 Marks, Gerard 405 Martinenghi, Davide 377 Mart´ınez, Alicia 483 Maz´ on, Jose-Norberto 419 McBrien, Peter J. 362 Mendling, Jan 1, 445, 458 Mhatre, Girish 433 Mitsch, Stefan 202 Murphy, John 405 Mylopoulos, John 31, 76, 90, 118, 483 Neidig, Nicholas 433 Ng, Wilfred 302 Oliv´e, Antoni
247
490
Author Index
Pardillo, Jes´ us 419 Pastor, Oscar 483 Perini, Anna 90 Pfeifer, Holger 332 Piattini, Mario 458 Pichler, Horst 477 Pieris, Andreas 347 Pinggera, Jakob 445 Queralt, Anna
Smirnov, Sergey 1 Smith, Andrew C. 362 Susi, Angelo 90 Teniente, Ernest 438 Terwilliger, James F. 146 Torlone, Riccardo 377 Treiber, Martin 288 Trujillo, Juan 419
438
Reijers, Hajo A. 445 Retschitzegger, Werner 202 Rinderle-Ma, Stefanie 332 Rizopoulos, Nikos 362 Roantree, Mark 160, 405 Ruiz, Francisco 458 Rull, Guillem 438 Ryzhikov, Vladislav 174 Salay, Rick 76 S´ anchez-Gonz´ alez, Laura 458 S´ anchez, Juan 471 Schouten, Kim 452 Schwinger, Wieland 202 Siena, Alberto 90 Signer, Beat 391 Simitsis, Alkis 15
Unnithan, Adi 146 Urp´ı, Toni 438 van der Meer, Otto 452 Vandic, Damir 452 Vessey, Iris 46 Villegas, Antonio 247 Weber, Barbara 445 Weidlich, Matthias 445 Weske, Mathias 1 Wilkinson, Kevin 15 Wu, You 302 Yu, Eric
59
Zakharyaschev, Michael Zugal, Stefan 445
174