Enterprise Information Systems: 10th International Conference, ICEIS 2008, Barcelona, Spain, June 12-16, 2008, Revised Selected Papers (Lecture Notes in Business Information Processing) [1 ed.] 3642006698, 9783642006692, 9783642006708


191 4 9MB

English Pages 361 [365] Year 2009

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Enterprise Information Systems: 10th International Conference, ICEIS 2008, Barcelona, Spain, June 12-16, 2008, Revised Selected Papers (Lecture Notes in Business Information Processing) [1 ed.]
 3642006698, 9783642006692, 9783642006708

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Lecture Notes in Business Information Processing Series Editors Wil van der Aalst Eindhoven Technical University, The Netherlands John Mylopoulos University of Trento, Italy Norman M. Sadeh Carnegie Mellon University, Pittsburgh, PA, USA Michael J. Shaw University of Illinois, Urbana-Champaign, IL, USA Clemens Szyperski Microsoft Research, Redmond, WA, USA

19

Joaquim Filipe José Cordeiro (Eds.)

Enterprise Information Systems 10th International Conference, ICEIS 2008 Barcelona, Spain, June 12-16, 2008 Revised Selected Papers

13

Volume Editors Joaquim Filipe José Cordeiro Institute for Systems and Technologies of Information, Control and Communication (INSTICC) and Instituto Politécnico de Setúbal (IPS), Department of Systems and Informatics Rua do Vale de Chaves, Estefanilha, 2910-761 Setúbal, Portugal E-mail: {j.filipe,jcordeir}@est.ips.pt

Library of Congress Control Number: Applied for ACM Computing Classification (1998): H.3.5, H.4, J.1, K.4.4, J.2 ISSN ISBN-10 ISBN-13

1865-1348 3-642-00669-8 Springer Berlin Heidelberg New York 978-3-642-00669-2 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12638550 06/3180 543210

Preface

This book contains the best papers of the 10th International Conference on Enterprise Information Systems (ICEIS 2008), held in the city of Barcelona (Spain), organized by the Institute for Systems and Technologies of Information, Control and Communication (INSTICC) in cooperation with AAAI and co-sponsored by WfMC. ICEIS has become a major point of contact between research scientists, engineers and practitioners in the area of business applications of information systems. This year, five simultaneous tracks were held, covering different aspects related to enterprise computing, including: “Databases and Information Systems Integration,” “Artificial Intelligence and Decision Support Systems,” “Information Systems Analysis and Specification,” “Software Agents and Internet Computing” and “Human–Computer Interaction.” All tracks focused on real-world applications and highlighted the benefits of information systems and technology for industry and services, thus making a bridge between academia and enterprise. Following the success of 2007, ICEIS 2008 received 665 paper submissions from more than 40 countries. In all, 62 papers were published and presented as full papers, i.e., completed work (8 pages in proceedings / 30-min oral presentations), and 183 papers, reflecting work-in-progress or position papers, were accepted for short presentation and another 161 for poster presentation. These numbers, leading to a “full-paper” acceptance ratio below 10%, and a total oral paper acceptance ratio below 37%, show the intention of preserving a high -quality forum for the next editions of this conference. Additionally, as usual in the ICEIS conference series, a number of invited talks, presented by internationally recognized specialists in different areas, positively contributed to reinforcing the overall quality of the conference and to providing a deeper understanding of the enterprise information systems field. We included in the first part of this book the set of papers that supported some of these keynote speeches. The program for this conference required the dedicated effort of many people. Firstly, we must thank the authors, whose research and development efforts are recorded here. Secondly, we thank the members of the Program Committee and the additional reviewers for their diligence and expert reviewing. Thirdly, we thank the invited speakers for their invaluable contribution and for taking the time to synthesize and prepare their talks. Fourthly, we thank the Workshop Chairs and Special Session Chairs whose collaboration with ICEIS was much appreciated. Finally, special thanks to the INSTICC staff whose collaboration was fundamental for the success of this conference. December 2008

Joaquim Filipe José Cordeiro

Organization

Conference Chair Joaquim Filipe

Polytechnic Institute of Setúbal / INSTICC Portugal

Program Chair José Cordeiro

Polytechnic Institute of Setúbal / INSTICC, Portugal

Organizing Committee Paulo Brito

INSTICC, Portugal

Marina Carvalho

INSTICC, Portugal

Helder Coelhas

INSTICC, Portugal

Vera Coelho

INSTICC, Portugal

Andreia Costa

INSTICC, Portugal

Bruno Encarnação

INSTICC, Portugal

Bárbara Lima

INSTICC, Portugal

Vitor Pedrosa

INSTICC, Portugal

Vera Rosário

INSTICC, Portugal

Mónica Saramago

INSTICC, Portugal

Senior Program Committee Luís Amaral, Portugal Senén Barro, Spain Jean Bézivin, France Enrique Bonsón, Spain João Alvaro Carvalho, Portugal Albert Cheng, USA Bernard Coulette, France Jan Dietz, The Netherlands Virginia Dignum, The Netherlands Schahram Dustdar, Austria

António Figueiredo, Portugal Ulrich Frank, Germany Nuno Guimarães, Portugal Jatinder Gupta, USA Dimitris Karagiannis, Austria Michel Leonard, Switzerland Kecheng Liu, UK Pericles Loucopoulos, UK Andrea De Lucia, Italy Kalle Lyytinen, USA

VIII

Organization

Yannis Manolopoulos, Greece José Legatheaux Martins, Portugal Masao Johannes Matsumoto, Japan Luís Moniz Pereira, Portugal Marcin Paprzycki, Poland Alain Pirotte, Belgium Klaus Pohl, Germany Matthias Rauterberg, The Netherlands Colette Rolland, France Narcyz Roztocki, USA Abdel-Badeeh Salem, Egypt Bernardette Sharp, UK

Timothy K. Shih, Taiwan Alexander Smirnov, Russian Federation Ronald Stamper, UK David Taniar, Australia Miguel Toro, Spain Antonio Vallecillo, Spain Michalis Vazirgiannis, Greece François Vernadat, Luxembourg Ioannis Vlahavas, Greece Frank Wang, UK Merrill Warkentin, USA Hans Weigand, The Netherlands

Program Committee Mohd Syazwan Abdullah, Malaysia Rama Akkiraju, USA Patrick Albers, France Vasco Amaral, Portugal Yacine Amirat, France Andreas Andreou, Cyprus Plamen Angelov, UK Pedro Antunes, Portugal Nasreddine Aoumeur, Germany Gustavo Arroyo-Figueroa, Mexico Wudhichai Assawinchaichote, Thailand Juan Carlos Augusto, UK Ramazan Aygun, USA Bart Baesens, UK Cecilia Baranauskas, Brazil Steve Barker, UK Balbir Barn, UK Daniela Barreiro Claro, Brazil Nick Bassiliades, Greece Remi Bastide, France Nadia Bellalem, France Orlando Belo, Portugal Hatem Ben Sta, Tunisia Sadok Ben Yahia, Tunisia

Manuel F. Bertoa, Spain Peter Bertok, Australia Robert Biddle, Canada Oliver Bittel, Germany Luis Borges Gouveia, Portugal Djamel Bouchaffra, USA Danielle Boulanger, France Jean-louis Boulanger, France José Ângelo Braga de Vasconcelos, Portugal Sjaak Brinkkemper, The Netherlands Miguel Calejo, Portugal Coral Calero, Spain Luis M. Camarinha-Matos, Portugal Olivier Camp, France Roy Campbell, USA Gerardo Canfora, Italy Fernando Carvalho, Brazil Nunzio Casalino, Italy Jose Jesus Castro-Schez, Spain Luca Cernuzzi, Paraguay Maria Filomena Cerqueira de Castro Lopes, Portugal Laurent Chapelier, France

Organization

Cindy Chen, USA Jinjun Chen, Australia Abdelghani Chibani, France Henning Christiansen, Denmark Chrisment Claude, France Francesco Colace, Italy Cesar Collazos, Colombia Jose Eduardo Corcoles, Spain Antonio Corral, Spain Ulises Cortes, Spain Sharon Cox, UK Alfredo Cuzzocrea, Italy Mohamed Dahchour, Morocco Sergio de Cesare, UK Nuno de Magalhães Ribeiro, Portugal José-Neuman de Souza, Brazil Suash Deb, India Vincenzo Deufemia, Italy Rajiv Dharaskar, India Kamil Dimililer, Cyprus Gillian Dobbie, New Zealand José Javier Dolado, Spain Anonio Dourado, Portugal Juan C. Dueñas, Spain Alan Eardley, UK Hans-Dieter Ehrich, Germany Jean-Max Estay, France Yaniv Eytani, USA Antonio Fariña, Spain Antonio Fernández-Caballero, Spain Edilson Ferneda, Brazil Paulo Ferreira, Portugal Filomena Ferrucci, Italy Juan J. Flores, Mexico Donal Flynn, UK Ana Fred, Portugal Lixin Fu, USA Mariagrazia Fugini, Italy Jose A. Gallud, Spain

Juan Garbajosa, Spain Leonardo Garrido, Mexico Peter Geczy, Japan Marcela Genero, Spain Joseph Giampapa, USA Paolo Giorgini, Italy Raúl Giráldez, Spain Pascual González, Spain Gustavo Gonzalez-Sanchez, Spain Robert Goodwin, Australia Jaap Gordijn, The Netherlands Silvia Gordillo, Argentina Feliz Gouveia, Portugal Virginie Govaere, France Rune Gustavsson, Sweden Sven Groppe, Germany Sissel Guttormsen Schär, Switzerland Sung Ho Ha, Korea Maki Habib, Japan Lamia Hadrich Belguith, Tunisia Beda Christoph Hammerschmidt, USA Abdelwahab Hamou-Lhadj, Canada Thorsten Hampel, Germany Sven Hartmann, New Zealand Christian Heinlein, Germany Ajantha Herath, USA Suvineetha Herath, USA Francisco Herrera, Spain Colin Higgins, UK Peter Higgins, Australia Wladyslaw Homenda, Poland Jun Hong, UK Wei-Chiang Hong, Taiwan Nguyen Hong Quang, Vietnam Jiankun Hu, Australia Kaiyin Huang, China Joshua Ignatius, Malaysia François Jacquenet, France Hamid Jahankhani, UK

IX

X

Organization

Arturo Jaime, Spain Ivan Jelinek, Czech Republic Luis Jiménez Linares, Spain Paul Johannesson, Sweden Michail Kalogiannakis, France Nikos Karacapilidis, Greece Nikitas Karanikolas, Greece Stamatis Karnouskos, Germany Hiroyuki Kawano, Japan Nicolas Kemper Valverde, Mexico Seungjoo Kim, Korea Alexander Knapp, Germany John Krogstie, Norway Stan Kurkovsky, USA Rob Kusters, The Netherlands Joaquín Lasheras, Spain James P. Lawler, USA Chul-Hwan Lee, USA Jintae Lee, USA Alain Leger, France Kauko Leiviskä, Finland Carlos León de Mora, Spain Joerg Leukel, Germany Hareton Leung, China Xue Li, Australia Therese Libourel, France John Lim, Singapore ZongKai Lin, China Matti Linna, Finland Panos linos, USA Honghai Liu, UK Jan Ljungberg, Sweden Stephane Loiseau, France João Correia Lopes, Portugal Víctor López-Jaquero, Spain María Dolores Lozano, Spain Miguel R. Luaces, Spain Christopher Lueg, Australia

Mark Lycett, UK Edmundo Madeira, Brazil Laurent Magnin, Canada S. Kami Makki, USA Mirko Malekovic, Croatia Nuno Mamede, Portugal João Bosco Mangueira Sobral, Brazil Pierre Maret, France Farhi Marir, UK Maria João Marques Martins, Portugal Herve Martin, France Miguel Angel Martinez, Spain David Martins de Matos, Portugal Katsuhisa Maruyama, Japan Hamid Mcheick, Canada Andreas Meier, Switzerland Engelbert Mephu Nguifo, France John Miller, USA Subhas Misra, USA Sudip Misra, USA Michele Missikoff, Italy Ghodrat Moghadampour, Finland Pascal Molli, France Francisco Montero, Spain Paula Morais, Portugal Fernando Moreira, Portugal Nathalie Moreno Vergara, Spain Gianluca Moro, Italy Haralambos Mouratidis, UK Pietro Murano, UK Tomoharu Nakashima, Japan Paolo Napoletano, Italy Ana Neves, Portugal Jose Angel Olivas, Spain Luis Olsina Santos, Argentina Peter Oriogun, UK Tansel Ozyer, Turkey Claus Pahl, Ireland

Organization

José R. Paramá, Spain João Pascoal Faria, Portugal Leif Peterson, USA Steef Peters, The Netherlands Vicente Pelechano, Spain Maria Carmen Penadés Gramaje, Spain Gabriel Pereira Lopes, Portugal Laurent Péridy, France Dana Petcu, Romania Paolo Petta, Austria José Pires, Portugal Geert Poels, Belgium José Ragot, France Abdul Razak Rahmat, Malaysia Jolita Ralyte, Switzerland Srini Ramaswamy, USA Pedro Ramos, Portugal Marek Reformat, Canada Hajo A. Reijers, The Netherlands Ulrich Reimer, Switzerland Marinette Revenu, France Yacine Rezgui, UK Simon Richir, France Roland Ritsch, Switzerland David Rivreau, France Daniel Rodriguez, Spain Pilar Rodriguez, Spain Jimena Rodriguez Arrieta, Spain Oscar M. Rodriguez-Elias, Mexico Jose Raul Romero, Spain Agostinho Rosa, Portugal Gustavo Rossi, Argentina Angel L. Rubio, Spain Francisco Ruiz, Spain Roberto Ruiz, Spain Ángeles S. Places, Spain Manuel Santos, Portugal Jurek Sasiadek, Canada

XI

Daniel Schang, France Mareike Schoop, Germany Remzi Seker, USA Isabel Seruca, Portugal Jianhua Shao, UK Alberto Silva, Portugal Maria João Silva Costa Ferreira, Portugal Spiros Sirmakessis, Greece Hala Skaf-Molli, France Pedro Soto-Acosta, Spain Chantal Soule-Dupuy, France Priti Srinivas Sajja, India Chris Stary, Austria Janis Stirna, Sweden Markus Stumptner, Australia Chun-Yi Su, Canada Vijayan Sugumaran, USA Lily Sun, UK Gion K. Svedberg, Sweden Ramayah T., Malaysia Ryszard Tadeusiewicz, Poland Sotirios Terzis, UK Claudine Toffolon, France Robert Tolksdorf, Germany Grigorios Tsoumakas, Greece Theodoros Tzouramanis, Greece Gulden Uchyigit, UK Athina Vakali, Greece Michael Vassilakopoulos, Greece Belén Vela Sánchez, Spain Christine Verdier, France Maria-Amparo Vila, Spain HO Tuong Vinh, Vietnam Aurora Vizcaino, Spain Bing Wang, UK Hans Weghorn, Germany Gerhard Weiss, Austria Graham Winstanley, UK

XII

Organization

Claus Witfelt, Denmark Wita Wojtkowski, USA Robert Wrembel, Poland Baowen Xu, China Haiping Xu, USA Hongji Yang, UK

Lili Yang, UK Jasmine Yeap, Malaysia Kokou Yetongnon, France Jun Zhang, China Liping Zhao, UK Shuigeng Zhou, China

Auxiliary Reviewers Antonia Albani, The Netherlands Francisco Martinez Alvarez, Spain Simona Barresi, UK Bruno Barroca, Portugal Christos Berberidis, Greece Beatriz Pontes Balanza, Spain Félix Biscarri, Spain Valeria de Castro, Spain José María Cavero, Spain Max Chevalier, France Evandro de Barros Costa, Brazil Stergiou Costas, Greece Guillermo Covella, Argentina Andrea Delgado, Uruguay Manuel Fernández Delgado, Spain Yuhui Deng, China Remco Dijkman, The Netherlands Vincent Dubois, France Beatrice Duval, France Fausto Fasano, Italy Paulo Félix, Spain Oscar Pedreira Fernández, Spain David Ferreira, Portugal Rita Francese, Italy Vittorio Fuccella, Italy David Heise, Germany Na Helian, UK Andi Iskandar, Japan Nitin Kanaskar, USA R. B. Lenin, USA

Oriana Licchelli, France Fernanda Lima, Brazil Marcos López, Spain Luiz Mauricio Martins, Portugal Shamila Makki, USA Philip Mayer, Germany Bernado Mello, Brazil Germana Menezes da Nóbrega, Brazil Iñigo Monedero, Spain Gabriele Monti, Italy Diego Seco Naveiras, Spain Rocco Oliveto, Italy Efi Papatheocharous, Cyprus Ignazio Passero, Italy Manuel Lama Penín, Spain Hércules Antonio do Prado, Brazil Franck Ravat, France Michele Risi, Italy K. Sauvagnat, France João Saraiva, Portugal Carlos Senna, Brazil Ivo dos Santos, Brazil Ernst Sikora, Germany Renate Strazdina, Latvia Xosé Antón Vila Sobrino, Spain Anastasis Sofokleous, Cyprus Sithu D. Sudarsan, USA Jonas Sprenger, Germany Mehdi Snene, Switzerland Giuseppe Scanniello, Italy

Organization

Constantinos Stylianou, Cyprus Guilaine Talens, France Christer Thörn, Sweden Athanasios Tsadiras, Greece S. Vimalathithan, USA Zhiming Wang, USA

XIII

Sining Wu, UK Kenji Yoshigoe, USA Johannes Zaha, Germany Chuanlei Zhang, USA Johannes Zaha, Germany

Invited Speakers Moira C. Norrie Ricardo Baeza-Yates

ETH Zurich, Switzerland VP of Yahoo! Research for Europe and LatAm, Spain and Chile

Jorge Cardoso

SAP AG, Germany

Jean-Marie Favre

University of Grenoble, LIG, France

Table of Contents

Invited Papers The Link between Paper and Information Systems . . . . . . . . . . . . . . . . . . . Moira C. Norrie

3

Service Engineering for the Internet of Services . . . . . . . . . . . . . . . . . . . . . . Jorge Cardoso, Konrad Voigt, and Matthias Winkler

15

Part I: Databases and Information Systems Integration Bringing the XML and Semantic Web Worlds Closer: Transforming XML into RDF and Embedding XPath into SPARQL . . . . . . . . . . . . . . . . Matthias Droop, Markus Flarer, Jinghua Groppe, Sven Groppe, Volker Linnemann, Jakob Pinggera, Florian Santner, Michael Schier, Felix Sch¨ opf, Hannes Staffler, and Stefan Zugal

31

A Framework for Semi-automatic Data Integration . . . . . . . . . . . . . . . . . . . Paolo Ceravolo, Zhan Cui, Ernesto Damiani, Alex Gusmini, and Marcello Leida

46

Experiences with Industrial Ontology Engineering . . . . . . . . . . . . . . . . . . . . Jon Atle Gulla

61

A Semiotic Approach to Quality in Specifications of Software Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erki Eessaar

73

Hybrid Computational Models for Software Cost Prediction: An Approach Using Artificial Neural Networks and Genetic Algorithms . . . . Efi Papatheocharous and Andreas S. Andreou

87

Part II: Artificial Intelligence and Decision Support Systems How to Semantically Enhance a Data Mining Process? . . . . . . . . . . . . . . . . Laurent Brisson and Martine Collard

103

Next-Generation Misuse and Anomaly Prevention System . . . . . . . . . . . . . Pablo Garc´ıa Bringas and Yoseba K. Penya

117

XVI

Table of Contents

Discovering Multi-perspective Process Models: The Case of Loosely-Structured Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Folino, Gianluigi Greco, Antonella Guzzo, and Luigi Pontieri Tackling the Debugging Challenge of Rule Based Systems . . . . . . . . . . . . . Valentin Zacharias Semantic Annotation of EPC Models in Engineering Domains to Facilitate an Automated Identification of Common Modelling Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas B¨ ogl, Michael Schrefl, Gustav Pomberger, and Norbert Weber

130

144

155

Part III: Information Systems Analysis and Specification Tool Support for the Integration of Light-Weight Ontologies . . . . . . . . . . . Thomas Heer, Daniel Retkowitz, and Bodo Kraft

175

Business Process Modeling for Non-uniform Work . . . . . . . . . . . . . . . . . . . . Kimmo Tarkkanen

188

Association Rules and Cosine Similarities in Ontology Relationship Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jon Atle Gulla, Terje Brasethvik, and Gøran Sveia Kvarv Compositional Model-Checking Verification of Critical Systems . . . . . . . . Luis E. Mendoza, Manuel I. Capel, Mar´ıa P´erez, and Kawtar Benghazi Model-Driven Web Engineering in the CMS Domain: A Preliminary Research Applying SME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kevin Vlaanderen, Francisco Valverde, and Oscar Pastor

201 213

226

Part IV: Software Agents and Internet Computing Binary Serialization for Mobile XForms Services . . . . . . . . . . . . . . . . . . . . . Jaakko Kangasharju and Oskari Koskimies An Efficient Neighbourhood Estimation Technique for Making Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li-Tung Weng, Yue Xu, Yuefeng Li, and Richi Nayak Improve Recommendation Quality with Item Taxonomic Information . . . Li-Tung Weng, Yue Xu, Yuefeng Li, and Richi Nayak

241

253 265

Table of Contents

Adapting Integration Architectures Based on Semantic Web Services to Industrial Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Bachlechner

XVII

280

Part V: Human-Computer Interaction “Fact or Fiction?” Imposing Legitimacy for Trustworthy Information on the Web: A Qualitative Inquiry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emma Nuraihan Mior Ibrahim, Nor Laila Md. Noor, and Shafie Mehad Enabling End Users to Proactively Tailor Underspecified, Human-Centric Business Processes: “Programming by Example” of Weakly-Structured Process Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Todor Stoitsev, Stefan Scheidl, Felix Flentge, and Max M¨ uhlh¨ auser Enhancing User Experience on the Web via Microformats-Based Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anca-Paula Luca and Sabin C. Buraga

297

307

321

Designing Universally Accessible Mobile Multimodal Artefacts . . . . . . . . . Tiago Reis, Marco de S´ a, and Lu´ıs Carri¸co

334

Dissection of a Visualization On-Demand Server . . . . . . . . . . . . . . . . . . . . . Romain Vuillemot, B´eatrice Rumpler, and Jean-Marie Pinon

348

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

361

Invited Papers

The Link between Paper and Information Systems Moira C. Norrie Institute for Information Systems, ETH Zurich, CH-8092 Zurich, Switzerland [email protected]

Abstract. Emerging technologies for interactive paper make it possible to capture and access information from paper in a variety of interesting ways. Over the past seven years, we have developed a rich infrastructure for the prototyping and production of interactive paper documents and paper-based interfaces to applications. We will provide a review of this work, starting with a motivation for reforming rather than replacing paper and then going on to describe various ways in which paper can be linked to information systems to support both the capture of and access to information. This will include an introduction to commercial digital pen and paper technologies that can be used to capture user actions on paper as well as our own iPaper framework. Keywords: Digital pen, interactive paper, cross-media content publishing.

1 Introduction Paper remains an important information medium even though most information is now managed digitally. Studies have shown that, far from disappearing from the workplace, the affordances of paper have assured its retention [1]. Many users still prefer to read on paper, especially given the ease with which printed documents can be annotated. Paper also supports forms of collaboration that are difficult to mimic in digital environments. In the case of mobile settings, paper has further advantages over existing mobile devices as it is light, foldable, requires no power and can be easily read by one or more users even in bright daylight. As a consequence, a lot of personal and professional information is still acquired from paper documents whether it be contact information printed on business cards or tourist information in guide books and magazines. Yet litte attention is given to the role of paper in information systems and ways in which emerging technologies to bridge the paper-digital divide could be used to capture and access information. On the whole, paper documents are only considered as possible original sources of data such as documents to be scanned, or, final target output documents in report generation or content publishing. Over the past seven years, we have investigated ways in which printed and digital information could be integrated and paper used as the basis for novel forms of interaction with information systems. Our investigations have not only considered innovative ways of accessing and capturing information from paper, but also the technologies and infrastructures required to support the development of paper-based interfaces to existing applications and the process of publishing interactive paper documents. In this article, we provide a brief overview of the underlying digital pen and paper technologies and J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 3–14, 2009. c Springer-Verlag Berlin Heidelberg 2009 

4

M.C. Norrie

Battery Ink Cartridge

Memory Processor

0.3 mm

Camera

Fig. 1. Digital Pen and Paper

the general iPaper framework for the development of applications based on these technologies. By means of example applications related to tourism, we show how these technologies open up a wide range of possibilities for using paper as a medium for not only accessing a wide range of information services, but also capturing information. We start in Sect. 2 with an introduction to digital pen and paper technologies and then provide an overview of our iPaper framework for developing interactive paper applications in Sect. 3. Two examples of tourist guides based on interactive paper are described in Sect. 4 to show different modes of accessing information. Section 5 then discusses possible ways in which information can be captured from documents using a gesture-based interface. Concluding remarks are given in Sect. 6.

2 Digital Pen and Paper Anoto digital pen and paper technology1 was originally developed for the digital capture of handwritten information. A digital pen has a camera alongside the stylus that can be used to track the position of the pen on paper as illustrated in Fig. 2. Pages are covered with an almost invisible printed dot pattern that encodes position information. The pen position is given as a coordinate within a vast virtual document space defined by Anoto and applications can map position data returned by the pen to (x,y) coordinates within physical pages used by the application. Developers are assigned pages within the Anoto pattern space as part of their licensing agreement. For example, the manufacturers of note books designed to be used with digital pens each have their own pattern licence that allows them to produce books of pages printed with the pattern and track pen strokes on paper within these pages. Multiple handwritten pages can be captured and stored within the pen before being transferred to a PC via a Bluetooth or USB connection. Timing information is recorded along with position information which allows user actions to be replayed if required. The applications associated with note books typically allow handwritten text and sketches to be stored as images or they may use character or gesture recognition software to interpret the pen strokes. 1

http://www.anoto.com

The Link between Paper and Information Systems

5

Anoto also license their technology to hardware manufacturers and a number of Anoto-enabled pens have been produced from companies such as Logitech, Nokia and Hitachi Maxell. The major business application to date has been automated forms processing in large organisations such as insurance companies and hospitals. Handwritten forms can be automatically transformed into digital files thereby dramatically reducing the time and cost of processing forms and enabling the form-filling process to be traced in detail. Importantly, all this can be acheived without changes to current work practices. Anoto provides a range of developer tools including Anoto Forms Design Kit (FDK) and the Software Development Kit (SDK). Anoto FDK provides a plug-in for Adobe Acrobat that allows developers to generate Anoto-enabled documents by adding capture areas and pidgets to existing PDF documents. A pidget is a special piece of pattern and an associated icon that can be interpreted as commands to the pen. For example, these can be used to create buttons within pages to mark the end of a specific capture or to transmit data to the PC. Once the developer has created the layout of their form, the system will automatically create a printable version of the file that includes the Anoto pattern assigned according to the associated pattern licence. The Anoto SDK can then be used to build the application that will process the data captured from Anoto-enabled documents. More recently, Anoto released the Paper SDK which provides greater flexibility than the original FDK and SDK by allowing developers to create Anoto-enabled documents from applications of their choice. This eases the development process by avoiding a two-stage authoring process and allowing designers to create Anoto-enabled documents directly from within the authoring tools that they currently use. This is done by allowing developers access to core Anoto functionality such as the generation of the pattern and also printing profiles. Since Anoto pens were originally intended to be used for handwriting capture, they were designed so that pen stroke data is transmitted in batch mode. To enable the pen to be used as an interaction device and not just a capture device requires that the position data can be processed in real-time. As described above, interactivity is only supported in the standard Anoto technology on a very limited scale based on pidgets which are special pieces of pattern with pre-defined functionality intended to be used as some form of command buttons. Hitachi Maxell and Logitech have released digital pens based on Anoto functionality that can also be used in streaming mode where position information is transmitted continuously. This enables the pens to be used for real-time interaction as well as writing capture and the mode is controlled by the pattern. In addition to the original pattern space, Anoto have developed a so-called streaming pattern space that not only delivers pen position data, but also instructs the pen to transmit it in real time via Bluetooth. Unfortunately, at this point in time, licences for the streaming pattern space are not openly available probably due to the fact that Anoto intend using this for their own application development. A few research groups have been granted research licences to allow some investigations into the potential of digital pen and paper technologies for more general forms of natural interaction. Finally, we mention that there are some digital pens that have been adapted or designed to transmit position data in real-time based on the original Anoto pattern space.

6

M.C. Norrie

For example, it is possible to adapt the firmware of the Nokia digital pens to operate in streaming mode. The pen can be switched to streaming mode through a special piece of pattern, similar to the pidgets described above, that instructs the pen to change modes. This solution has been used by many research groups to experiment with pen-based interaction based on Anoto technologies. The commercial product known as the FLY Fusion Pentop computer2 based on Anoto technologies also allows position data to be processed in real-time, but, in this case, they designed a special multimedia pen that is self-contained in that the application runs on the pen and it uses audio as an output channel. The pen was designed primarily for educational applications for children. A similar approach is used in the recently released Pulse smartpen from Livescribe3 which allows not only handwritten notes to be captured during lectures and meetings, but also the sound. This means that the handwritten notes can be linked to sections of the recorded audio to allow users to later playback parts of the event by simply pointing to the entries within their notes.

3 Interactive Paper As described in the previous section, Anoto digital pen and paper technology was originally developed for the digital capture of handwriting rather than real-time interaction. However, for many years, researchers had been investigating ways of bridging the paperdigital divide by linking paper to digital media and they quickly recognised the potential of Anoto technologies for pen-based interaction. A number of research groups began to investigate ways in which Anoto technology could be used to support a wide variety of tasks and modes of interaction. These groups mainly worked with the modified Nokia digital pens that could stream data in real-time, with a few also experimenting with the more recent Hitachi Maxell pens that recognise the streaming pattern. Various frameworks have been developed to deal with interactions from an Anotoenabled paper document. The Anoto SDK described previously supports the simple post-processing of the data captured by the digital pen within a paper form. However, the PaperToolkit framework developed at Stanford University [2], the PADD system from the University of Maryland [3] and the iPaper/iServer infrastructure created at ETH Zurich [4,5] all allow more complex real-time interactions to be managed. The major difference between the iPaper framework developed within our group and the PaperToolkit of Stanford is that we wanted to provide support for everything from rapid prototyping of paper-based interfaces to existing applications through to largescale publishing of interactive paper documents. Further, adopting ideas familiar from the web, we wanted to offer an authoring rather than programming approach to application development. This means that developers can create their own applications by linking together a rich variety of media and services using graphical authoring tools. Programming is only required when developers need to write their own plug-in to integrate new forms of media or services, or possibly implement their own services. By using the general cross-media link server iServer as our development platform for iPaper, it is possible to define active areas on paper and bind them to digital resources 2 3

http://www.flyworld.com http://www.livescribe.com

The Link between Paper and Information Systems

7

i SERVER

Links source

Users

Layers

target

Entities

Selectors

Resources

Shapes

Pages

iPAPER

Fig. 2. iPaper and iServer

such as images, videos and web pages or to digital services. The digital pen can then be used in much the same way on paper documents as a mouse would be used during web browsing sessions to navigate links to other web resources or trigger application calls. Server generalises concepts from hypermedia systems. Any link between two information entities is based on the abstract concepts of resources and selectors as indicated in Fig. 2. The platform was designed to be extensible for new types of media based on a resource plug-in mechanism. By implementing media-specific instances of the resource and selector concepts, any new type of media can be integrated. In the case of the iPaper plug-in, resources are pages and selectors are shapes within a page that define active areas linked to other resources or selectors over these resources. This means that it is possible to define areas within a printed page that, when touched by a digital pen, will activate links to a variety of digital media or services. Note however that active areas may also serve as link targets meaning that it is not only possible to link from paper to digital, but also from digital back to paper. For example, we have developed applications that will show all the active areas defined within a document that link to a specific video clip. At the moment, this is done by showing the result as an annotated PDF of the doucment or by giving audio information about the positions of the active areas. However, currently there are researchers working on technologies to turn paper into an active medium and develop paper displays. Therefore, in the future, these technologies could be used to actually highlight regions within the printed document itself. Through the generality of the RSL link model [6] on which iServer is based, it is possible to have links with multiple sources and/or multiple targets and even links over links. A user management component manages access rights and can be used as the basis for providing personalised and context-dependent delivery of information. The delivery of information can further be controlled by iServer’s layer concept. A physical page can have any number of virtual layers on which links are defined. These layers can be activated and deactivated dynamically as well as being re-ordered and this provides a flexible means of resolving the links to be activated as the result of touching the paper with the pen.

8

M.C. Norrie

Rating

POOL OF LIFE

Big Value Comedy

Café Royal Fringe Theatre, 17 West Register Street, 0131 556 2549 Grid Ref: D5 Following his critically successful show of 2003, Keith Carter returns as alter-ego 'Nige'. With new characters, a show celebrating Liverpool's capital of culture. 'Nige thinks he's a superhero. In comedy terms he already is ****' Evening News. Aug 15, 17, 18, 19, 20, 21, 22, 24, 25, 26, 27, 28: 21.20 (60mins) £7.00 (£6.00) Dates:

Fig. 3. EdFest Interactive Brochure

In addition to linking to digital media, active areas can also be linked to pieces of programming logic known as active components. These may be bridges to existing applications or independent pieces of application logic such as components for handwriting capture. The current set of active components offers a wide range of functionality including services to link to SQL databases and also Microsoft applications such as PowerPoint. For example, the application PaperPoint [7] allows PowerPoint presentations to be controlled and annotated from printed slide handouts by defining the slide views and also various printed buttons as active areas that link to active components that send commands to PowerPoint. A wide variety of interactive paper applications have been developed using the iPaper framework ranging from form-like paper-based interfaces to existing applications to digitally-augmented documents. In the latter case, the printed document can be used on its own and generally provides the reader with access to core, static information which can optionally be augmented with digital information and services as a special value-added service. We have focussed a lot of attention on the potential use of digitallyaugmented documents for tourists on the move and describe two such applications in the next section to show possible ways in which information services can be accessed from interactive paper documents.

4 Accessing Information In this section, we describe two forms of interaction with paper documents that can be used to obtain access to information that is either dynamic or supplementary to that printed on the page. Both are based on examples related to tourism—the first being an event brochure for the Edinburgh Fringe Festival and the second a digitally augmented version of the Rough Guide to Switzerland4 . One of the major challenges of mobile information systems is how to support interaction with users on the move. Most research projects have focussed on small mobile devices such as PDAs and mobile phones, and yet it is accepted that these can be problematic when accessing information systems of any complexity [8]. Restricted screen 4

http://www.roughguides.com

The Link between Paper and Information Systems

9

sizes make it difficult to compare and combine information and to support various forms of collaboration. A further problem of current mobile devices is the difficulty of reading in bright sunlight and also that of power consumption. We therefore decided to investigate the use of combining paper and audio as a means of accessing information in mobile settings. Specifically, we developed a set of interactive paper documents for visitors to the Edinburgh Fringe festival using a text-to-speech engine for delivery of database content. The EdFest system that we developed is based around a set of three interactive paper documents and an audio output channel. The set of documents includes an event brochure, a map and also a physical bookmark which could be used to mark places in the brochure and provides additional services such as searches and a ticket reservation system. Figure 3 shows an entry of the interactive paper event brochure. The printed information corresponds to the entries in the existing printed Fringe festival brochures. We added pictograms which are active areas linking to various forms of additional information and services. Under the title of the event, there are three pictograms. The one to the left is used to select the desired event when reserving tickets or adding a review. The “i” pictogram provides summary information about the event in terms of the artists, associated awards and any other special remarks about the event such as its suitability for children. The pictogram to the right of this provides access to user reviews. The pictograms located under the description of the event venue provide access to additional information about the venue such as its location, public transport, disabled access and catering facilities. The dates shown at the bottom of the entry correspond to the dates of the festival and the ones with pictograms are the dates when the event is taking place. Touching the date with the pen, provides the user with information about ticket availability for that date. To the right of the date line is an alarm clock pictogram which allows a user to set a reminder for an event. This service is only available to users who have registered their mobile phones, in which case, they will be sent a reminder by SMS half an hour before the start of the event. Last but not least, the rating pictograms at the top right of the event entry allow users to access the average rating of the event and also input their own rating. At the back of the brochure, there are empty pages for users to write their own comments about events. These pages consist of an active area that links to an active component for handwriting capture. When the user has written their comment, they finish the capture by touching a special button at the bottom of the page and then select the event within the brochure to which the comment should be linked. The interactive paper map gives users information about venues and events close to locations specified by touching the map with the pen. It also allows users to be guided to their own location or the location of event venues. In the case of the former, the user’s position is given by a GPS, and by touching a “Where am I?” button at the top of the map, the system tells the user where they are located on the map. At first, this is done through a general grid reference such as “top left of quadrant G2”. The user is then instructed to touch the map with the pen and the system gives audio instructions as to how they should move the pen to position it at the required location. This is repeated until the user touches the correct location on the map with the pen. Similarly, users

10

M.C. Norrie

Fig. 4. Gesture-based Interaction

touching on the event location pictogram within the event brochure entry are guided to the position of the venue on the map. The services that we have described so far show three different forms of interaction with the paper documents. One involves touching on pictograms to activate links in much the same way as is done using a mouse to activate hyperlinks in web pages. The second involves a mode switch so that pen strokes are recorded to enable the digital capture of handwriting. Handwritten texts could then be later retrieved as images or as digital texts using intelligent character recognition software. The third type of interaction is a form of linking from the digital world to paper as the pen is used as a guiding device to locate a position on paper. While it is beyond the scope of this paper to describe the architecture and systems behind the EdFest system in detail, we note that the system was based on an event database constructed from XML data provided by the festival organisers. The system operates in parallel with a regular web-based interface that demonstrates how the system could be accessed through public kiosks where personalised daily event programmes could be printed as interactive paper documents on demand. The production of the documents as well as the operation of the system was based on a contentpublishing approach. Further details of the EdFest system and the publishing process are available in [9]. Event brochures such as the one used in EdFest are well-structured and the brochure along with the link definitions could be generated from the event database using templates which define the various active areas and links to information services. In a second project, we wanted to produce an interactive version of a general tourist guide which, in contrast to an event brochure, mainly consists of unstructured textual descriptions and images. One approach to designing an interactive paper version of such a guide would be to author links on paper for place names, keywords etc. in much the same way as would be done for web pages. Again, similar to web conventions, underlining or the use of different colour fonts could be adopted as a visual means of showing links within printed documents. Two key disadvantages of such an approach are the fact that it may require a lot of manual authoring of links and it can be disruptive to the content if many links are present. We therefore decided to investigate an alternative approach based on gesture-based interaction. Figure 4 shows part of an interactive paper version of the Rough Guide to Switzerland. Users can select items of interest based on gestures such as circling words or underlining phrases. We developed a prototype to demonstrate how gestures could be used to access a variety of information services and how this could be supported in the underlying infrastructure for interactive paper. Two services involved a Google search

The Link between Paper and Information Systems

11

on a selected phrase. One gesture was used for standard Google searches returning result lists and another for the “I feel lucky” search to return a single web page directly. The other service was a database search of related images. We have yet to carry out detailed studies to determine what is a good gesture set for this application. The development of gesture-based interfaces requires experimentation with proposed gesture-recognition algorithms and also specific gesture sets. We therefore developed a Java-based framework to support both the use of existing algorithms and also the design of new algorithms. The resulting framework, iGesture5 , also provides a graphical tool for the creation and management of gesture sets. Through the tool, it is simple to compare existing algorithms for a specific gesture set as well as evaluating new algorithms [10]. In the case of interactive paper documents, the technical challenge of providing gesture-based interfaces of the kind used in the Rough Guide is the ability to map a gesture on a physical page such as the circling of a word back to the corresponding word element within the digital document in order that it can be used as a search parameter. Similar issues were faced in an application called PaperProof that we developed to allow digital documents to be edited by marking up a printed version of the document [11]. In this case, we need to be able to interpret gestures such as striking through a word as a delete command to be executed in the original source document. Further, we want to be able to do this even if the digital document has been edited in parallel. A general solution is to maintain a mapping between elements in the digital and physical instances of documents as decribed in [12]. Gesture-based interfaces offer a very natural style of interaction using digital pens. It is interesting to note that normal pens are often used as pointing devices during collaboration as well as writing devices and these two modes correspond to the two styles of interaction that we have described in this section. In the next section, we consider how gesture-based interfaces might be used to support the capture of information.

5 Capturing Information As described earlier, Anoto technology was first developed for the digital capture of handwriting and the major business application to date has been automated forms processing. In this section, we consider other ways in which information could be captured from paper documents. Information is frequently accessible in the form of printed documents. For example, as already discussed, it is still common for information such as tourist guides and event brochures to be accessed on paper in preference to the web, especially in mobile settings. Also it is common to read information about possible future travel destinations in flight magazines. During the reading process, users often filter these items, selecting those that best match their interests, financial resources or likely travel plans. But having performed this selection process, how can readers store that information for later retrieval? The only possibilities at the moment are either to copy the information (or at least some link to it) or to take the document (or part of it) with you. In both cases, the main problem is one of how to store that information so that it can easily be found. 5

http://www.igesture.org

12

M.C. Norrie

Fig. 5. Gesture-Based Capture

While organised individuals might create a database of travel information that they can search in the future, the effort currently required to create such a database and enter the data would discourage most of us from doing this. It would be useful if we could find an easy way of capturing these information scraps and managing them in a database. Interactive paper and gesture-based interfaces could provide the foundations for such a system. Consider the example of part of a flight magazine shown in Fig. 5. The page lists information about various hotels with the descriptions in english and german. We show how gestures along with annotations could be used to capture that information. Assume the following sequence of actions: 1) a user selects the area describing the basic information about the hotel by drawing the upper left corner and lower right corner gestures that define the corresponding area on the page; 2) the user then writes the type of the object that the information describes alongside the area, i.e. the text “hotel”; 3) the user selects the english description of the hotel by again using the upper left corner and lower right corner gestures around the text; 4) the user writes “description” alongside the text and links it to the hotel object by drawing an arrow connecting the two text areas. Using such a sequence of gestures and annotations, a system could capture the selected information and create objects and associations within the database to represent it. In this way, users could capture all sorts of information while reading and store it in such a way that it could easily be retrieved later. There are a number of open questions as to exactly how the information should be stored and what gestures and annotations would be most appropriate and this is a current topic of research within our group. If one considers the above example, a range of possibilities exist. The information selected could be stored as simple text objects with associated keywords and relationships between objects. An alternative would be to interpret the information according to a fixed schema. For example, the annotation “hotel” could specify that this is a hotel object and the selected text could be parsed to try and automatically extract attributes

The Link between Paper and Information Systems

13

such as the address and phone number. This approach is similar in some ways to various approaches that have been adopted for the automatic discovery and extraction of information published on web pages. Further, the selection of the text annotated with “description” together with the arrow gesture linking it to the hotel object could be used to denote that this should be represented as an attribute of the hotel object with the corresponding name. Since this is an area of on-going research, we do not present any fixed solutions here, but rather want to draw attention to the possibilities offered by digital pen and paper technology for innovative ways of capturing information while reading.

6 Conclusions We have presented an overview of digital pen and paper technologies and shown various ways in which interactive paper could be used to access and capture information. Just as paper plays an important role in document life cycles, we believe that it could play an important role in providing natural forms of interaction with information systems in various settings. Acknowledgements. Many members of the Global Information Systems group at ETH Zurich have contributed to the work reported in this paper over the last seven years. Special thanks are due to Beat Signer who has led these activities, and Nadir Weibel and Adriana Ispas who are currently working in interactive paper projects.

References 1. Sellen, A.J., Harper, R.: The Myth of the Paperless Office. MIT Press, Cambridge (2001) 2. Yeh, R.B., Klemmer, S.R., Paepcke, A.: Design and Evaluation of an Event Architecture for Paper UIs: Developers Create by Copying and Combining. Technical report, Stanford University, Computer Science Department (2007) 3. Guimbreti`ere, F.: Paper Augmented Digital Documents. In: Proceedings of UIST 2003, 16th Annual ACM Symposium on User Interface Software and Technology, Vancouver, Canada, pp. 51–60 (2003) 4. Norrie, M.C., Signer, B., Weibel, N.: General Framework for the Rapid Development of Interactive Paper Applications. In: Proceedings of CoPADD 2006, 1st International Workshop on Collaborating over Paper and Digital Documents, Banff, Canada, pp. 9–12 (2006) 5. Signer, B.: Fundamental Concepts for Interactive Paper and Cross-Media Information Spaces. PhD thesis, ETH Zurich (2006) 6. Signer, B., Norrie, M.C.: As We May Link: A General Metamodel for Hypermedia Systems. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 359–374. Springer, Heidelberg (2007) 7. Signer, B., Norrie, M.C.: PaperPoint: A Paper-Based Presentation and Interactive Paper Prototyping Tool. In: Proceedings of TEI 2007, First International Conference on Tangible and Embedded Interaction, Baton Rouge, USA, pp. 57–64 (2007) 8. Pashtan, A., Blattler, R., Heusser, A., Scheuermann, P.: CATIS: A Context-Aware Tourist Information System. In: Proceedings of IMC 2003, 4th International Workshop of Mobile Computing, Rostock, Germany (2003)

14

M.C. Norrie

9. Signer, B., Grossniklaus, M., Norrie, M.: Interactive Paper as a Mobile Client for a MultiChannel Web Information System. World Wide Web (WWW) Journal 10 (2007) 10. Signer, B., Kurmann, U., Norrie, M.C.: iGesture: A General Gesture Recognition Framework. In: Proceedings of ICDAR 2007, 9th International Conference on Document Analysis and Recognition, Curitiba, Brazil (2007) 11. Weibel, N., Ispas, A., Signer, B., Norrie, M.: PaperProof: A Paper-Digital Proof-Editing System. In: Proceedings of CHI 2008, 26th SIGCHI Conference on Human Factors in Computing Systems: Interactivity Track, Florence, Italy (2008) 12. Weibel, N., Norrie, M.C., Signer, B.: A Model for Mapping between Printed and Digital Document Instances. In: Proceedings of DocEng 2007, ACM Symposium on Document Engineering, Winnipeg, Canada (2007)

Service Engineering for the Internet of Services Jorge Cardoso, Konrad Voigt, and Matthias Winkler SAP Research CEC, Chemnitzer Strasse 48, 01187 Dresden, Germany {jorge.cardoso,konrad.voigt,matthias.winkler}@sap.com

Abstract. The Internet and the Web have extended traditional business networks by allowing a Web of different digital resources to work together to create value for organizations. The most industrialized countries have entered a post-industrial era where their prosperity is largely created through a service economy. There is a clear transition from a manufacturing based economy to a service based economy. From the technological perspective, the development of Web-based infrastructures to support and deliver services in this new economy raises a number of challenges. From a business perspective, there is the need to understand how value is created through services. In this paper, we describe how we propose to address these two perspectives and realize the vision of the Internet of Services (IoS), where Web-based IT-supported service ecosystems form the base of service business value networks. This paper addresses the main challenging issues that need to be explored to provide an integrated technical and business infrastructure for the Internet of Service. Keywords: Internet of Services, service engineering, service, e-service, web service, business models.

1 Introduction Throughout the years organizations have always tried to introduce new business models to gain a competitive advantage over competitors or to explore hidden markets. For example, IKEA introduced the concept that people could transport the merchandise and assemble the furniture by themselves. eBay gained an early competitive advantage by being the first-to-market with a new business model based on auctions. Dell was able to bypass distributors, resellers, and retailers and use the Internet to reduce costs. In all these examples, the new or adapted business models are often derived from the human perception that something could be done in a different way. The idea comes very often from intuition and it is driven by a business need. Recently, the vision of the Internet of Services (IoS) [13] emerged and can be seen as a new business model that can radically change the way we discover and invoke services. The IoS describes an infrastructure that uses the Internet as a medium for offering and selling services. As a result, services become tradable goods. Service marketplaces, where service consumers and providers are brought together to trade services and so engage in business interaction, are an enabling technology for the IoS vision. Thus, the IoS provides the business and technical base for advanced business models where service providers and consumers form business networks for service J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 15–27, 2009. © TEXO Theseus Texo Consortium 2009

16

J. Cardoso, K. Voigt, and M. Winkler

provision and consumption. Within these business networks, organizations work together to deliver a service to consumers. For example, a service-based value network may include the research, development, design, production, marketing, sales and distribution of a particular service. All these phases work interchangeably to add to the overall worth of a service. Value is created from the relationship between the company, its customers, intermediaries, aggregators and suppliers. In the Internet of Services, the underlying IT perspective provides a global description of standards, tools, applications, and architectures available to support the business perspective. Currently, the service-oriented architecture paradigm has gained mainstream acceptance as a strategy for consolidating and repurposing applications to be combined with new applications in more dynamic environments through configurable services. Services that are composed into advanced business processes can interoperate with other services in order to support business processes spanning across organizational boundaries. In this paper we describe the challenges to support the concept of the Internet of Services. We enumerate the areas that need to be explored to provide fundamental insights on research to enable a radically new way to trade services on the Internet. A special focus will be laid on service marketplaces as an enabling technology for the IoS and to service engineering, which we see as a fundamental approach for creating services. The remaining of this paper is structured in four main sections. In Section 2 we describe the role of marketplaces and clarify the concept of service for the IoS. Section 3 identifies a set of requirements that needs to be addressed to support the concept of IoS to provide, create and drive a new “service industry” for producing, changing, adapting, (re)selling, and operating services in a Web-based business service economy. In Section 4, we discuss the importance of Service Engineering (SE) for the IoS. SE is a new discipline that will enable the development and implementation of technological solutions based on the Internet of Services. Finally, Section 5 presents our conclusions.

2 Marketplaces for the Internet of Services Electronic marketplaces for products have gained much attention over the last years enabling business interaction between providers and consumers of physical goods. Examples of such marketplaces include eBay and Amazon. In the IoS vision, services are seen as tradable goods that can be offered on service marketplaces by their providers to make them available for potential consumers. [2] describe service marketplaces as one example of web service ecosystems which represent “… a logical collection of web services whose exposure and access are subject to constraints, which are characteristic of business service delivery.” On a service marketplace multiple providers may offer their services. Providers may be large providers as well as small companies offering specialized services. As such, an ecosystem of competing as well as collaborating services may be created. 2.1 What Are Services? The terms Service, e-Service and Web Service have been widely used to refer, sometimes, to the same concept and other times to different concepts. These terms are generally used to identify an autonomous software component that is uniquely identified by a URI and that can be accessed using standard Internet protocols such as

Service Engineering for the Internet of Services

17

XML, SOAP, or HTTP. [1] have identified that the terms Service, e-Service and Web Service actually address related concepts from different domains such as computer science, information science and business science. We believe that a deeper understanding of those concepts needs to be made in order to conceptually separate and address the various stakeholders involved when architecting an enterprise wide solution based on services. Therefore, we introduce a set of definitions for the IoS. Business Service. In business and economics, a service is the non-material equivalent of a good. In these domains, a service is considered to be an activity which is intangible by nature. Services are offered by a provider to its consumers. We adopt the definition from [1] who defined business services as business activities provided by a service provider to a service consumer to create a value for the consumer. This definition is consistent as it is also understood in the business research community. In traditional economies, business services are typically discovered and invoked manually, but their realization maybe performed by automated or manual means (Figure 1). Therefore, business services may be performed by humans. Examples include cutting hair, painting a house, typing a letter, or filling a form. If a service is executed by means of automated mechanisms then processing an insurance claim is also considered a service. Services are considerably different from products primarily due to their intangible nature. Most products can be described physically based on observable properties, such as size, color, and weight. On the other hand, services lack of concrete characteristics. Thus, services must be defined indirectly in terms of the effects they have on consumers. Products have usually a well defined set of possible variants for customization. For example, if a consumer requires a faster laptop, a more powerful CPU can be designed, built and attached to the motherboard. If an important consumer (e.g. Yellow Cab Co.) desires yellow cars, a manufacturer only needs to notify the production chain to select a new color. The same cannot be easily achieved for services. This makes the description of services one of the most important undertakings for the IoS. e-Service. With the advances made by the Internet, companies started to use electronic information technologies for supplying services that were to some extend processed with the mean of automated applications. At this stage, the concept of e-service [12], electronic- or e-commerce was introduced to describe transactions conducted Manual request for bank transfer Business Service

EDI request for bank transfer

E-Service

Web-based request for bank transfer Web Service

Fig. 1. Examples of invocations of Business Services, e-Services and Web Services

18

J. Cardoso, K. Voigt, and M. Winkler

over the Internet. The main technology that made e-commerce a reality was computer networks. Initial developments included on-line transactions of buying and selling where business was done via Electronic Data Interchange (EDI). Examples of such transactions include an EDI request for bank transfer (Figure 1) or a money transfer via a private network. While many definitions for e-services can be found in the literature, we will use the definition given by [6] since it adequately matches our service concept: “a e-service is a collection of network-resident software services accessible via standardized protocols, whose functionality can be automatically discovered and integrated into applications or composed to form more complex services.” Therefore, we consider that e-services are a subset of business services (Figure 1). E-services are services for which the Internet (or any other equivalent network such as mobile and interactive TV platforms) is used as a channel to interact with consumers. Virtually any service can be transformed into an e-service if it can be invoked via a data network. It should be pointed out that this definition implies that the ability to withdraw money from an ATM machine is supplied trough an e-service. E-services are independent of the specification language used to define its functionality, non-functional properties or interface. The term Internet Services will be used be used to refer to the discovery and invocation of e-services using the Internet as a channel. In this paper, when no ambiguity arise, we will use the term service to refer to an Internet service. Web Service. Web services are e-services that are made available for consumers using Web-based protocols or Web-based programs. Separating the logical and technical layers specifications of a service leaves open the possibility for alternative concrete technologies for e-services. Nowadays, we can identify three types of Web services: RPC Web Services, SOA Web Services, and RESTful Web services. RPC Web Services bring distributed programming functions and methods from the RPC world. Some researchers view RPC Web services as a reincarnation of CORBA into Web services. SOA Web Services implement an architecture according to SOA, where the basic unit of communication is a message, rather than an operation. This is often referred to as “message-oriented” services. Unlike RPC Web services, loose coupling is achieved more easily since the focus is on the “contract” that WSDL provides, rather than the underlying implementation details. RESTful Web Services are based on HTTP and use a set of well-known operations, such as GET, PUT, and DELETE. The main focus is on interacting with stateful resources, rather than messages or operations (as it is with WSDL and SOAP). [4] describes REST objectives in the following way: “The name ‘Representational State Transfer’ (REST) is intended to evoke an image of how a well-designed Web application behaves: a network of web pages (a virtual state-machine), where the user progresses through an application by selecting links (state transitions), resulting in the next page (representing the next state of the application) being transferred to the user and rendered for their use.” We also consider that any e-service that can be invoked using Web standards, such as HTTP, is also a Web service. 2.2 Discovery, Invocation and/or Execution of Services The lifecycle of services includes two main phases that are of importance to the IoS (Figure 1): discovery/invocation and execution. Discovery and invocation refer to the medium and technology used to find and request for the execution of a particular service (human-based, via EDI, Web-based, etc.). The execution describes how the realization

Service Engineering for the Internet of Services

19

of a service is carried out. A service maybe carried out only with human involvement, with a conjunction of humans and automated devices, or resorting purely on automated machines (Figure 2). Therefore, services in the vision of the IoS may lay anywhere in the spectrum of services executed by humans, on the one side, or purely automated services on the other side. Nonetheless, in the IoS, the discovery and invocation of all services is IT-based. Service marketplaces provide the access point to the services made available by their providers. One example of a service where invocation is IT-based but execution is performed by humans would be a house painting service where a consumer selects the painter (service provider) and the color of the house using a service marketplace. The painting of the house is, of course, done by humans.

Fig. 2. Relationships between Business Service, e-Service and Web Service

2.3 Atomic vs. Composite Services The vision of marketplaces for services was previously described by [11]. They describe two steps in the evolution of marketplaces for offering services. As a first step basic services are offered on an electronic marketplace. The focus is on making services available via a new medium – the Internet – as opposed to the real world. Advanced mechanisms for searching and finding suitable services based on their properties were created and support for automatic negotiation of provisioning terms between the service provider and the service consumer was established. Consumers were able to create a one-to-one relationship with a provider. As a second step more complex relationships between consumers and providers are to be formed. Basic services can, in many cases, only provide basic functionality. To provide more complex functionality, composite services may be created from basic services or other

20

J. Cardoso, K. Voigt, and M. Winkler

composite services and offered on the marketplace. This might mean that there is a need for interaction of the different services with each other. In the scope of our work we are striving for the second step. There will not only be service providers and service consumers involved in business interactions. A further role often called service aggregator will be performed by entities that specialize on combining services into more complex service compositions or bundles which are then offered on the marketplace to be found and used by service consumers. While the aggregated services are sold to the service consumers by the aggregator, the individual parts they consist of (atomic services or other service compositions) will be provided by their original providers who will receive a payment from the aggregator. The added value created by the service aggregator stems from the selection and composition of available services into more complex services or bundles which constitute an advantage for the consumer. Aspects of service supply and demand such as service discovery, monitoring, charging and payment as well as publishing, composition, and re-provisioning (leasing, licensing) that are important for the different stakeholders (service providers, aggregators and clients) need to be regulated by the service marketplace [2]. In order to do so, service marketplaces need to provide a broad range of functionality which we will describe in more detail in the next section of this paper.

3 Requirements for the IoS and Marketplaces Marketplaces, as enabler of the IoS vision, need to offer complex functionality for bringing together a large number of consumers, aggregators, and providers and enable the means for business interactions among them. Service providers offer their services on the marketplace. Consumers select services from different providers based on functionality, best pricing, offered quality of service, or rating. After the service has been selected, it is delivered by the provider. Finally, the client will have to pay a due amount for the service consumption. In order to pave the path for the IoS there is a need for identifying and understanding the underlying challenges that need to be address to provide solutions realizing the vision. The IoS bridges business and IT infrastructures perspectives. As a result, IoS requirements need to have a strong emphasis on the business and IT sides. Therefore, the following topics need to be analyzed, studied and framed within the IoS: Legal and Community Aspects, and Business Models. The implications of business value networks need to be studied from a legal perspective. The combination and integration of world-wide regulations is fundamental. A special emphasis has to be given to the generation of new business models for all stakeholders (i.e., service providers, aggregators, and consumers) and corresponding incentive mechanisms. Community aspects encourage cooperation and boost innovations through the extensive exchange of knowledge. Service Search based on Advanced Service Description. Marketplaces need to offer search mechanisms that allow for heterogeneous search criteria. At the base for such search mechanisms, a framework for describing different aspects of services is needed. A suitable service description framework covers not only the functional description of a service but also aspects such as pricing, quality of service, user rating,

Service Engineering for the Internet of Services

21

and legal aspects among others. Consumers will be able to search for service functionality based on functional classifications such as UN/SPSC or eCl@ss [5] or natural language based descriptions. The search results may then be further refined taking into consideration a large variety of non-functional properties, thus leading to improved search results [8]. Negotiation of Service Level Agreements (SLA). Prior to interacting with a service, the consumer needs to create an agreement with the service provider stating the terms under which the service need to be provided. Rights as well as obligations of both parties regarding the service consumption will be described. The aspects specified in an SLA (e.g. QoS and pricing) will be derived from the service description as it provides the base for negotiation. Service Monitoring. In order to enable trust among the participants of a business interaction, there is a need for monitoring the interaction. The goal is to make sure that service providers deliver the service under the terms promised to the consumer. The monitoring of functionality may be provided by the marketplace itself or by a trusted third party (e.g. a monitoring service). The base for the monitoring is the SLA negotiated between the provider and the consumer. Billing and Payment. Consumers need to pay for the services they have consumed. The marketplace needs to provide the means to assure that clients pay the correct amount of money. An invoice will be created based on the usage information gathered during the interaction with services. During the payment process, a consumer will need to transfer the amount due according to a predefined payment method. Billing and payment functionality may be provided either by the marketplace or by a third party service. Service Governance. Governance addresses the strategic alignment between business services and business requirements thereby reducing risks and assuring compliance with rules and regulations. The ability to freely compose and orchestrate business functions which are available as services on a diversity of marketplaces bears overwhelming opportunities. Trust and trustworthiness of service offerings must be facilitated by the platform, balancing individual requirements, policies, and must be capable of adapting to the given business context. Service Delivery Platform. An infrastructure for service delivery has to be provided for technically enabling businesses to engage in business transactions. This infrastructure has to be scalable with respect to complexity, i.e., its consumers must be able to counter the intricacies of distributed systems. Platform services need to be provided directly by the platform. These fundamental services include brokering, mediation, billing, security and trust services. Service Engineering. Service engineering (SE) is a new approach to the analysis, design and implementation of service-based ecosystems in which organizations and IT provide value for others in the form of services. SE not only provides methodologies to handle the increased complexity of numerous business actors and their value exchanges, but also provides tools for constructing and deploying services that merge the IT and business perspectives.

22

J. Cardoso, K. Voigt, and M. Winkler

These requirements are a basis for the TEXO project [14] which main goal is to develop new business models for the Web. It targets the development of an (open) platform for the development, distribution and provision of (business) services. While all these topics are important to support the vision of the IoS, in this paper we will concentrate our study on the emerging research discipline termed Service Engineering. The ISE methodology, which we describe in the next section, provides the means for suppliers of services to create new services.

4 Service Engineering One recent development that is believed to allow organizations to support the notion of IoS is the adoption of Service-Oriented Architectures (SOA). The OASIS SOA Reference Model defines SOA as “a paradigm for organizing and utilizing distributed capabilities that may be under the control of different ownership domains” [9]. With SOA, designers of services are facing the challenge of gaining a deep understanding of the business for which they are developing solutions with the right scope and granularity. Designing services is not only a technical undertaking; it is the job of analyzing the business environment and business processes, and identifying business functions that could be implemented as a service. It should be noticed that it is frequently impossible to implement an innovative business model without, eventually, rely on the underlying IT infrastructure. This constitutes a major problem since there is a considerable gap between these two complementary worlds. Therefore, one challenge lies on bridging the gap between business and IT. This challenge requires a set of design principles, patterns, and techniques that has not been precisely identified yet. Therefore, the IoS cannot be realized without giving a strong emphasis on the business side of services. 4.1 Definition The set of activities involved in the development of service-based solutions in a systematic and disciplined way that span, and take into account, business and technical perspectives can be referred to as service engineering. “Service Engineering is an approach that provides a discipline for using models and techniques to guide the understanding, structure, design, implementation, deployment, documentation, operation, maintenance and modification of electronic services (e-services).” Service Engineering is a structured approach for describing a part of an organization from a service perspective that expresses the way the organization works. The approach should systematically translate an initial description from a natural language that expresses the way stakeholders think and communicate about the organization through a sequence of representations using various models to a representation that is accepted and understood. Developing and implementing services is a major chore for organizations. Dealing with hundreds of services may be seen, from a management point-of-view, as difficult as managing hundreds of human resources inside an organization, requiring a dedicated department, specialized staff, and adequate methodologies.

Service Engineering for the Internet of Services

23

4.2 The ISE Methodology Compared to other approaches, the methodology which we are developing (the ISE methodology), not only focuses on a technical perspective, but also focuses on a deep and prominent business perspective when developing services for the IoS. Since the notions of abstraction (perspective) and dimensions (entities) were important for our approach, we have followed an solution based on the Zachman framework [16] to support service engineering.

Fig. 3. Perspectives and dimensions of the ISE methodology

Each of the perspectives (layers or rows) of the ISE methodology (Figure 3) can be regarded as a phase in the development process of services. Thus, the models and methods which are assigned to each of the layers support the development process from different view points (e.g., business, conceptual, logical, technical, and runtime). For all the cells of the matrix we have developed major artifacts which should be considered in the business service development process. These artifacts include important elements such as balanced scorecards, UML, mind maps, BPMN, BPEL, OWL, OCL, etc. Artifacts are assigned to the intersection of an abstraction layer and a dimension (Figure 3). At the business layer, the development of a service is triggered, typically but not always, by the planning process, where strategies, objectives and performance measures (KPI) that can help an organization to achieve its goals. Fundamental services can often be derived from the strategic planning activity of an organization. Other elements that are typically part of the strategy, or direction, of an organization include resources, capital and people. Models and techniques that can be used to identify fundamental services at the business layer include the SWOT and PEST analysis. Once a list of services is identified – that is deemed necessary for the organization to

24

J. Cardoso, K. Voigt, and M. Winkler

stay competitive – the service engineering process will proceed with the analysis of the conceptual layer, the logical layer and the technical layer. Once a full technical specification of the service is created, the service is sent to the runtime platform for execution. For services, the business layer, the conceptual layer, the logical layer, the technical layer, and the runtime layer give a different perspective for stakeholders (i.e. CEO, CTO, CIO, architects, IT analysts, programmers, etc.) to services. 4.3 Service Model Integration In order to implement the ISE methodology different stakeholders have to develop different models defining a service. Since the union of all models defines a service they need to be integrated and synchronized. This integration task is facing major challenges because of the various people involved within the development process and the rising complexity of the models. To cope with these challenges we propose to integrate the models automatically supporting each role in ISE. The ISE models contain artifacts representing modeled information. Following the separation of concerns paradigm raised by [10], the ISE methodology divides a service into five dimensions, namely: service description, workflow, data, people and rules. Furthermore, each of these models is divided into four layers (levels) of abstraction. Throughout one dimension artifacts are modeled with respect to different views and refined until they conform to a technical specification. This leads to multiple representations of information on different layers of abstraction in the corresponding dimensions. Each of these models has to be specified and maintained. Changes in one model have to be propagated into the affected models holding the overlapping information. This is a time-consuming and challenging task since each of the models has to be aware of changes and needs to be adjusted. Due to several people involved in the development this leads also to an increased communication effort. For a structured approach we separate the dependencies between models into two classes: vertical and horizontal. Vertical Dependencies cover the synchronization of dependencies between models on different layers of abstraction in one dimension. It represents the bridging of layers of abstraction by transforming between multiple representations of artifacts. Horizontal Dependencies define the synchronization of models on the same layer of abstraction. This describes dependencies between models of different dimensions which refer to artifacts of other dimensions. This also includes multiple representations of an artifact on the same layer of abstraction. These dependencies form the integration of the models and have to be implemented manually or by automatic support. Being more precise, a dependency is defined by a mapping. Formally a mapping assigns to a set of artifacts a set of artifacts; where one sets corresponds to the other. That means the different representations of information are assigned to each other. To illustrate the dependencies, Figure 4 shows an example which depicts the dependencies between two layers of abstraction as well as between models on the same layer but of different dimensions. The workflow dimension shown is specified regarding the conceptual and logical layers. The conceptual layer is represented by an UML activity diagram. The Business Process Modeling Notation (BPMN) is used to represent the logical layer. The artifacts of the logical layer of the data dimension are modeled using an OWL-UML profile. The arrows depict artifacts that need to be synchronized and are mapped onto each other.

Service Engineering for the Internet of Services

25

Fig. 4. Example of models (UML activity diagram, BPMN and OWL-UML profile) synchronization

Actions modeled in the activity diagram are again represented in BPMN as tasks. Therefore, Action A needs to be in synchronized with Task A. That means that UML actions need to be mapped to BPMN tasks. The XOR between Task B and Task C of the BPMN model is mapped from Action B or C of the UML model. Furthermore, the Information I artifact used in the workflow is defined in the OWL-model (i.e., it depends on it). When one model changes (e.g. renaming or deletion), the depending models have to be updated. These updates can be done manually or by providing an automatic support. One solution to enable an automatic approach is by using model transformations for implementing mappings. The first step to enable the implementation of model transformations is to define one common formal representation of models. This can be done using ontology formalism or more mature concepts like the Meta Object Facility (MOF). Based on this formalism, a domain specific language for model transformation can be used to define rules and apply them to the models. During the last years many model transformation languages have been proposed, both by academia and industry. For an overview, we refer to [3] classification of today’s approaches. The two most prominent proposals in the context of Model Driven Architecture (MDA) are Query, View and Transformation (QVT) and the ATLAS Transformation Language (ATL). We have chosen to rely on MDA to support model transformations because of matured concepts, well established infrastructure for model management and transformation, and available OMG standards. The MDA guide (2003) defines a model transformation as “the process of converting one model to another model of the same system”. Thus a model transformation is an implementation of a mapping (model dependency specification). We follow [7] refining this definition to an automatic

26

J. Cardoso, K. Voigt, and M. Winkler

generation of a target model from a source model, following a transformation definition. A transformation definition is a set of rules describing formally how a source model can be transformed into a target model. Using a rule-based language like QVT to define model transformations executed by an engine allows for incremental and traceable transformations. For an automatic model integration we argue for model transformations as the implementation of mappings. Using and applying these concepts enables an automatic model synchronization. This supports both the implementation of vertical and horizontal dependencies, thus reducing the complexity, effort and errors in modeling a service using ISE.

5 Conclusions The Internet of Services (IoS) will provide the opportunity to create and drive a new “service industry” for producing, changing, adapting, (re)selling, and operating services. By providing a holistic approach, the IoS will be able to contribute to the larger topic of a Web-based business service economy. Business value networks based on the IoS can only be successfully achieved if important topics, such as legal issues, community aspects, new business models, service innovation, service governance and service engineering are exploited. Service marketplaces act as enablers for business interactions between various stakeholders in the IoS where business services are offered, composed, sold, and invoked by the means of IT. In order to support all stakeholders in their business, marketplaces need to provide advanced functionality such as service search based on functional and non-functional service properties, negotiation and monitoring of SLA and the means for billing and payment. A major constituent of a service marketplace include a common service description framework forming the base for the service lifecycle on the marketplace. Based on the requirements from marketplaces and based on the concept of IoS, we have introduced a new service engineering methodology for developing and describing services. By covering the technical and business perspectives, ISE provides a structured approach for service engineering. The structuring is achieved by following the separation of concerns and model-driven design. Therefore, we divide a service into several models and identify the need for model integration. Finally, we adopt a model-driven approach by using model transformations to integrate individual models on different layers of abstraction. This leverages service engineering as a discipline and enables the realization of the IoS. Acknowledgements. The TEXO project was funded by means of the German Federal Ministry of Economy and Technology under the promotional reference 01MQ07012. The authors take the responsibility for the contents.

References 1. Baida, Z., Gordijn, J., Omelayenko, B.: A Shared Service Terminology for Online Service Provisioning. In: The 6th International Conference on Electronic Commerce (ICEC 2004) (2004)

Service Engineering for the Internet of Services

27

2. Barros, A., Dumas, M., Bruza, P.: The Move to Web Service Ecosystems. BPTrends Newsletter 3(3) (2005) 3. Czarnecki, K., Helsen, S.: Feature-based Survey of Model Transformation Approaches. IBM Systems Journal 45(3) (June 2006) 4. Fielding, R.T.: Architectural Styles and the Design of Network-based Software Architectures, Ph.D. Thesis, University of California, Irvine, California (2000) 5. Hepp, M., Leukel, J., Schmitz, V.: A quantitative analysis of product categorization standards: content, coverage, and maintenance of ecl@ss, UNSPSC, eOTD, and the rosettanet technical dictionary. Knowl. Inf. Syst. 13(1), 77–114 (2007) 6. Hull, R., Benedikt, M., Christophides, V., Su, J.: E-services: a look behind the curtain. In: Proceedings of the twenty-second ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems, pp. 1–14. ACM Press, New York (2003) 7. Kleppe, A., Warmer, J.: MDA Explained. The Model Driven Architecture: Practice and Promise. Addison-Wesley, Reading (2003) 8. O’Sullivan, J.: Towards a Precise Understanding of Service Properties. PhD thesis, Queensland University of Technology (2006) 9. OASIS. OASIS SOA Reference Model (2006) (retrieve on 8 April 2008), http://www.oasis-open.org/committees/tc_home.php? wg_abbrev=soa-rm 10. Parnas, D.L.: On the criteria to be used in decomposing systems into modules. Communications of the ACM (12), 1053–1058 (1972) 11. Piccinelli, G., Mokrushin, L.: Dynamic Service Aggregation in Electronic Marketplaces. TechReport HPL-2001-31, Hewlett-Packard Company (2001) 12. Rust, R.T., Kannan, P.: E-service: a new paradigm for business in the electronic environment. Communications of the ACM 46(6), 36–42 (2003) 13. Schroth, C., Janner, T.: Web 2.0 and SOA: Converging Concepts Enabling the Internet of Services. IT Professional 3, 36–41 (2007) 14. Texo, TEXO – Business Webs in the Internet of Services (retrieve on 8 April 2008), http://theseus-programm.de/scenarios/en/texo 15. Theseus (retrieve on 8 April 2008), http://theseus-programm.de/ 16. Zachman, J.A.: A Framework for Information Systems Architecture. IBM Systems Journal 26(3) (1987)

Part I

Databases and Information Systems Integration

Bringing the XML and Semantic Web Worlds Closer: Transforming XML into RDF and Embedding XPath into SPARQL Matthias Droop1, Markus Flarer1, Jinghua Groppe2, Sven Groppe2, Volker Linnemann2, Jakob Pinggera1, Florian Santner1, Michael Schier1, Felix Schöpf1, Hannes Staffler1, and Stefan Zugal1 1 University of Innsbruck, Technikerstrasse 21a, A-6020 Innsbruck, Austria [email protected], {Markus.Flarer,Jakob.Pinggera, Florian.Santner}@student.uibk.ac.at, [email protected], [email protected], {Hannes.Staffler, Stefan.Zugal}@student.uibk.ac.at 2 Institute of Information Systems, University of Lübeck, Ratzeburger Allee 160 D-23538 Lübeck, Germany {jinghua.groppe,groppe,linnemann}@ifis.uni-luebeck.de

Abstract. XPath is an established query language developed by the W3C for XML, which is supported by many tools and used in many applications. SPARQL is a new query language developed by the W3C for RDF data. Recently available SPARQL query evaluators do not deal with XML data and XPath queries. In this contribution, we show how to enable SPARQL query evaluators to deal with XML data and XPath queries in order to support XPath processing and SPARQL processing in parallel. Keywords: XML, XPath, Semantic Web, SPARQL, RDF.

1 Introduction XML is a data format for exchanging data on the web, between databases and elsewhere. Furthermore, XML has become a widely used native data format. The W3C has developed XPath [17] as a query language for XML data. XPath is embedded in many other languages like the XQuery query language and the XSLT language for transforming XML data. The name of XPath derives from its basic concept, the path expression, with which the user can hierarchically address the nodes of the XML data. The user of XPath may not only use simple relationships like parent-child, but also more complex relationships like the descendant relationship, which is the transitive closure of the parent-child relationship. Furthermore, complex filter expressions are allowed in XPath queries. The Resource Description Framework (RDF) [3] is a language for representing information about resources in the World Wide Web. SPARQL [14] is a query language for formulating queries against RDF graphs. SPARQL supports querying by triple J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 31–45, 2009. © Springer-Verlag Berlin Heidelberg 2009

32

M. Droop et al.

patterns, conjunctions, disjunctions, and optional patterns, and constraining queries by a source RDF graph and extensible value testing. Results of SPARQL queries can be ordered, limited and offset in number. There are plans to embed SPARQL in forthcoming languages similar to XPath that is embedded in XQuery and XSLT. In recent years, many RDF storage systems, which support or plan to support SPARQL, have occurred like Jena [15]. These RDF storage systems do not support XML data and XPath queries, which are currently widely used in applications. Integration of XML data into RDF data and embedding of XPath queries into SPARQL queries can make XML data and XPath available for these products. Furthermore, the proposed embedding enables users to work in parallel with XML data and RDF data, and with XPath queries and SPARQL queries, i.e. XML data is integrated in RDF data and the result of XPath subqueries can be used in SPARQL for further processing. As many SPARQL tools do not allow calling an external XPath evaluator from SPARQL, we propose to translate embedded XPath queries into SPARQL subexpressions. Tree-based queries can be easier expressed in the tree query language XPath in comparison to the graph query language SPARQL. For example, SPARQL does not allow computing all descendant nodes of a node like XPath does. Furthermore, the formulation of joins in graphs is easier in SPARQL than in XPath. An embedding of XPath into SPARQL allows the easy formulation of tree queries and graph queries in one query. Thus, the host language SPARQL benefits from the embedded language XPath and the embedded language XPath benefits from its host language SPARQL. In this paper, we first compare the different data models of XML and RDF and the different query languages XPath and SPARQL. Based on the comparison, we propose a translation scheme for XML data into RDF data and XPath queries into SPARQL queries. Furthermore, we present the results of an experimental evaluation of a prototype, which shows that various different XPath queries can be embedded into SPARQL.

2 Further Related Work There are many contributions (e.g. [10] and [11]) to source-to-source translations for evaluating XPath expressions on relational database management systems. Many techniques described there can be adapted for evaluating XPath expressions on SPARQL evaluators like using a numbering scheme for the XML data to support all XPath axes [10], but some other techniques cannot be adapted like the evaluation of positional predicates [11] as SQL supports more language constructs than SPARQL like the rank clause. Furthermore, some contributions deal with bridging the gap between SPARQL/RDF and the relational world (e.g. [4], [5] and [13]). [9] describes a translation scheme from SPARQL to XQuery/XSLT. [7] deals with a translation from XML data into RDF data and from XPath queries into SPARQL queries, but not with an embedding of XPath queries into SPARQL queries. We extend our contributions given in [6] by the detailed translation scheme, which we present in the Appendix.

Bringing the XML and Semantic Web Worlds Closer

33

3 Comparison of XML/RDF and XPath/SPARQL 3.1 XPath and XQuery Data Model and XPath Query Language The XPath and XQuery data model is defined as follows: Definition 1 (Data Model of XPath and XQuery). An XML document is a tree of nodes. The kinds of nodes are document, element, attribute, text, namespace, processing-instruction, and comment. Every node has a unique node identity that distinguishes it from other nodes. In addition to nodes, the data model allows atomic values, which are single values that correspond to the simple types defined in [16], such as strings, Booleans, decimal, integers, floats, doubles, and dates. The first node in any document is the document node, which contains the entire document. Element nodes, comment nodes, and processing instruction nodes occur in the order in which they are found in the textual representation of the XML document. Element nodes occur before their children – the element nodes, text nodes, comment nodes, and processing instructions, which they contain. Attribute nodes and namespace nodes are not considered as children of an element. See Fig. 1 for an example of a textual XML document representing a bookstore containing the two books “Harry Potter” from J. K. Rowling and “Learning XML” from Erik T. Ray. Fig. 2 is a graphical representation of the XML document of Fig. 1.

Harry Potter J. K. Rowling

Learning XML Erik T. Ray

Fig. 1. An example XML document representing a bookstore Document Node bookstore book

book

category= „CHILDREN“ title Harry Potter

category= „WEB“ author J. K. Rowling

title Learning XML

author Erik T. Ray

Element with name „X“

parent-child relationship

X= „V“

attribute with name „X“ assigned with value „V“

next-sibling relationship

T

text node with content „T“

X

parent-attribute relationship

Fig. 2. Graphical representation of the XML document of Fig. 1

The W3C developed the XPath language as a simple query language to describe node sets of XML documents. The basic concept of XPath expressions are location steps separated by a slash ("/"). Each location step of the form a::nt[P1]…[Pn] contains

34

M. Droop et al.

• an axis a, which can be one of child, attribute, self, parent, descendant, descendant-orself, ancestor, ancestor-or-self, following, following-sibling, preceding and preceding-sibling. • a node test nt. Among the possible node tests are a name node test for a specific name A (declared by A itself), for an arbitrary name (declared by the wildcard *), for a text node (declared by text()), and a node test node() for all node types. • an arbitrary number of predicates P1 to Pn. A predicate is enclosed by the brackets [ and ]. A predicate contains a Boolean expression, e.g. a comparison with a constant string or number. Starting with the root node of an XML document, each location step from left to right describes, which XML nodes must be considered for the next location step by following the axis for the XML nodes of the previous location step, checking the node test and the predicates. The whole XPath expression describes the resultant XML nodes of the last location step. For example, the resultant nodes of the XPath query of Fig. 3 with input XML document of Fig. 1 are the text nodes containing “Harry Potter” and “Learning XML”. /bookstore/parent::node()/descendant::title/text() Fig. 3. An example XPath query

3.2 RDF Data Model and SPARQL In comparison to the XPath and XQuery data model, the data model of RDF is defined as follows: Definition 2 (Data Model of RDF). The underlying structure of any expression in RDF is a collection of triples, each consisting of a subject (a RDF URI reference or a blank node), a predicate (a RDF URI reference) and an object (a RDF URI reference, a blank node or a literal, which can be plain literals having optionally a language tag, or which can be a typed literal having additionally a datatype URI being a RDF URI reference). A set of such triples is called an RDF graph. The nodes of an RDF graph are its subjects and objects. The predicate holds a directed relationship from a subject to an object. There are different ways to represent RDF data, e.g. RDF Triplets or RDF/XML documents, which use XML to encode RDF data. Fig. 4 contains an example RDF graph, which actually represents the generated RDF graph of the data translation module of our prototype when using the XML data of Fig. 1 as input. SPARQL (see [14]) is a query language for retrieving information from RDF graphs stored in semantic storage systems. The outline query model is graph patterns expressed by simple triple patterns. It does not use rules and is not path based. We briefly introduce SPARQL by a simple example. Fig. 5 presents a SPARQL query, which actually is a query translated by our prototype with the input XPath query of Fig. 3. The query consists of three parts, the PREFIX declarations, the SELECT clause and the WHERE clause. The PREFIX declarations specify prefixes for short name usages (e.g. here rel is declared as short name for ). The SELECT clause identifies the variables to appear in the query results (here ?v9). The WHERE clause contains triple patterns like “?v0 rel:type "9".”. The first position (?v0) in the triple pattern represents the constraints or

Bringing the XML and Semantic Web Worlds Closer

35

bindings to variables for the subjects in the RDF data. The second position (here it is the relation rel:type) contains the constraints or bindings to variables for predicates of the triples of the RDF data, and the third (here "9") contains the constraints or bindings to variables for the objects of the triples of the RDF data. A join between the first triple pattern (?v0 rel:type "9".) and the second triple pattern (?v0 rel:child ?v1.) of the query is expressed by using the same variable ?v0 in both triple patterns. Furthermore, SPARQL queries may contain filter expressions (here e.g. FILTER (xsd:long(?v6) > xsd:long(?v4)).), which contain boolean expressions to constrain the input data (here the considered filter expression contains a greater-than comparison (“>”) between the two variables ?v6 and ?v4, which are first castet to the XML Schema datatype long).

e typ rel: rel: start re l:n rel:e nd am e

e typ rel: rel: start rel: re end l:v al ue

3

A0

6 title

rel:child

rel:child A4

e typ rel: rel: start re l:en re d l:n am e

1

3 4

A1

5 Harry Potter

e typ rel: rel: start rel:e re nd l:n am e

e typ rel: rel: start re l:en re d l:n am e

2

A12

11 book

1 7 10 author

e typ rel: rel: start rel: re end l:v alu e

A7

3 8

X

Node with identity X or literal with value X

rel: r

relationship r

1 1 22 bookstore

1

rel:child

ild ch rel:

A3

A11 rel:chi ld

rel: attribute

rel:child

rel: typ e rel: category A9 name e lu :va CHILDREN rel

ld hi l:c re

re l:c hi ld

A10

2

e 9 yp l:t re rel: 0 start rel: end 23

A8

9 J. K. Rowling

e typ rel: rel: start rel:e r nd re el:n l:a am ttr e ib ut ree l:c hi ld e typ rel: rel: start rel: re end l:n am e e typ rel: rel:

start rel: re end l:v al ue

1 12 21 book A2

e 2 typ rel: rel: category name rel: valu WEB e

1 13 16 title

A5 rel:child

rel:child

file:///C:/bookstore.xml

3 14

A6

15 Learning XML

e typ rel: rel: start rel: re end l:n am e e typ rel: rel: start rel: re end l:v alu e

1 17 20 author 3 18 19 Erik T. Ray

Fig. 4. Graphical representation of the RDF data representing the generated RDF graph of the data translation module of our prototype when using the XML data of Fig. 1 as input

There are further constructs to e.g. use built-in functions and set operations like the UNION operator. We refer the interested reader to [14] for a complete list and description of the SPARQL features. PREFIX rel: PREFIX xsd: SELECT ?v9 WHERE {?v0 rel:type "9". ?v0 rel:child ?v1. ?v1 rel:name "bookstore". ?v2 rel:child ?v1. ?v7 rel:start ?v3. ?v2 rel:start ?v5. ?v7 rel:end ?v4. ?v2 rel:end ?v6. ?v7 rel:name "title". ?v7 rel:child ?v8. ?v8 rel:type "3". ?v8 rel:value ?v9. FILTER(xsd:long(?v6)>xsd:long(?v4)). FILTER(xsd:long(?v5)b.start and a.endb.start (a.start xsd:long(?v4)), the processing of which is time consuming. Future versions of Jena or other SPARQL engines may optimize these kinds of filter expressions, such that the translations for recursive axes are faster processed.

Bringing the XML and Semantic Web Worlds Closer

41

Fig. 9. Average execution times of the original queries of the XPathMark benchmark, of data translation, query translation, result translation and the execution times of the translated SPARQL queries (all, only those containing recursive axes and those, which contain only nonrecursive queries) using Jena

6 Conclusions In this paper, we first compare the RDF and XPath and XQuery data model and the XPath and SPARQL query languages. Then we propose translations from XML into RDF, from XPath into SPARQL, and from the result of the translated SPARQL query into the XPath and XQuery data model, in order to integrate XML data into RDF data and embed XPath subqueries into SPARQL queries. A translation from XML into RDF and an embedding from XPath into SPARQL enable SPARQL query evaluators to deal with XML data and to process XPath queries as subqueries. We have developed a prototype to verify our translations and to show the practical usability of such a source-to-source translator. We have done a performance analysis to measure the execution times of the translations and the evaluations of the XPath query and the translated SPARQL query.

References 1. Axyana software. Qizx/open version 1.1 (2006), http://www.axyana.com/qizxopen 2. Cardoso, J.: The Semantic Web Vision: Where are We? IEEE Intelligent Systems, 22–26 (2007) 3. Carroll, J.J., Klyne, G.: Resource Description Framework: Concepts and Abstract Syntax. W3C Recommendation, February 10 (2004) 4. Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An Efficient SQL-based RDF Querying Scheme, VLDB, Trondheim, Norway (2005)

42

M. Droop et al.

5. Dokulil, J.: Evaluation of SPARQL queries using relational databases. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 972–973. Springer, Heidelberg (2006) 6. Droop, M., Flarer, M., Groppe, J., Groppe, S., Linnemann, V., Pinggera, J., Santner, F., Schier, M., Schöpf, F., Staffler, H., Zugal, S.: Embedding XPath Queries into SPARQL Queries. In: ICEIS 2008, Barcelona, Spain (2008) 7. Droop, M., Flarer, M., Groppe, J., Groppe, S., Linnemann, V., Pinggera, J., Santner, F., Schier, M., Schöpf, F., Staffler, H., Zugal, S.: Translating XPath Queries into SPARQL Queries. In: ODBASE 2007, Vilamoura, Algarve, Portugal (2007) 8. Franceschet, M.: XPathMark: An xPath benchmark for the xMark generated data. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 129–143. Springer, Heidelberg (2005) 9. Groppe, S., Groppe, J., Linnemann, V., Kukulenz, D., Höller, N., Reinke, C.: Embedding SPARQL into XQuery / XSLT. In: ACM SAC 2008, Fortaleza, Brazil (2008) 10. Grust, T., van Keulen, M., Teubner, J.: Accelerating XPath evaluation in any RDBMS. ACM Trans. Database Syst. 29, 91–131 (2004) 11. Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Toring and querying ordered XML using a relational database system. In: SIGMOD Conference 2002, Madison, Wisconsin, U.S.A (2002) 12. Kay, M.H.: Saxon - The XSLT and XQuery Processor (2006), http://saxon.sourceforge.net 13. de Laborda, C.P., Conrad, S.: Bringing Relational Data into the SemanticWeb using SPARQL and Relational.OWL. In: SWDB 2006, Atlanta, Georgia, U.S.A (2006) 14. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. W3C Recommendation (2008) 15. Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D.: Efficient RDF Storage and Retrieval in Jena2. In: SWDB 2003 co-located with VLDB 2003, Berlin (2003) 16. W3C. XML Schema Part 2: Datatypes. W3C Recommendation (2001) 17. W3C, XPath Version 2.0,W3C Recommendation (2007)

Bringing the XML and Semantic Web Worlds Closer

43

Appendix We describe the translation from XPath into SPARQL by an attribute grammar. Definition 3 (Attribute Grammar). An attribute grammar consists of a grammar G in EBNF notation and computation rules for attributes of symbols of G added to a production rule of G. In the following, we use the notation P { C }, where P is a production rule of G in the EBNF notation and C contains the computation rules for attributes of symbols, which occur in P. We use a slightly different variant of the Java notation in C. We refer to an attribute a of the m-th symbol n in P as n[m].a. If there is only one symbol n in P, we use n.a instead of n[1].a. If there exists an arbitrary number of symbols n in P, then |n| represents the concrete number of occurrences of the symbol n in P. n.getString() returns the textual representation of the symbol n, i.e. the parsed string for n in the source code. If there are choices n1 | … | nm in the right-hand side of P, then Symbol represents the concrete symbol, which is one of n1, …, nm. Due to space limitations, we do not present the full attribute grammar here, which has been implemented in the mentioned prototype, but present a subset of the attribute grammar, which shows the main technical problems for the translation process from XPath to SPARQL. We do not present the attribute grammar for the abbreviated syntax of XPath, any expression of which can be transformed into the here considered long form, constructors for sequences, datatype-specific operators like treat as, cast as and castable as, the for clause, the quantified expressions some and every, the conditional if expression and the except operator. We have left out translations for the idiv and mod operators of XPath, as SPARQL does not support corresponding operators. Note that we use a global numeric variable c to create a unique identifier for intermediate variables and we use the global variable p to store a variable prefix so that the result of the string concatenation p+c contains the variable, which is meant to contain the current result of the parsed subexpression determined by a production rule. Note that the generated variable names must be different from the variable names, which are already used in the SPARQL query with embedded XPath query. This can be easily considered by the prototype, but we do not present this in the proposed translation scheme here due to simplicity of presentation. XPath::=Expr. { c=0;p=”?v”;d=0;s=Expr.SPARQL; XPath.SPARQL="PREFIX rel:http://" + "uibk.ac.at/informatic/comdesign/relations:"+ "PREFIX xsd:http://www.w3.org/2001/XMLSchema#" + "SELECT DISTINCT "+p+c+" WHERE {"+s+"}"; }

getValueOfExpr(AdditiveExpr[2]); c++; RangeExpr.SPARQL+=”Filter(xsd:integer(”+ p+c+”)>=xsd:integer(“+p+ic+ ”).Filter(xsd:integer(”+ p+c+”)xsd:integer(”+p+(c+3)+”)).”+p+(c+5)+ ” rel:end ”+p+(c+2)+”.”+p+c+” rel:end ”+ p+(c+4)+”. Filter(xsd:integer(”+p+(c+2)+ ”)xsd:integer(”+p+(c+2)+”)).”;c+=4;} else if("preceding" occurs) {Axis.SPARQL = Translated SPARQL Expression for ancestor-or-self::node()/ preceding-sibling::node()/ descendant-or-self::; } else if("ancestor-or-slf" occurs) { Axis.SPARQL=p+(c+5)+” rel:start ”+p+(c+1)+ ”.”+p+c+” rel:start ”+p+(c+3)+ ”. Filter(xsd:integer(”+ p+(c+1)+”)=xsd:integer(”+p+(c+4)+”)).”;c+=5;} } NodeTest::=KindTest | QName | "*". {if(KindTest occurs) NodeTest.SPARQL=KindTest.SPARQL; else { if(QName occurs) NodeTest.SPARQL= p+c+” rel:name ‘”+QName.getString()+”’.”; else if("*" occurs) NodeTest.SPARQL=p+c+” rel:type ”+p+c+ ”i.”+”Filter(”+p+c+”i=”+getTypeConstant (ElementTest)+”||”+p+c+”i=”+ getTypeConstant(AttributeTest)+”).”; }} Pred::= "[" Expr "]". {ic=c;c=0;ip=p;d++;p=’?p’+d+’_’;Pred.SPARQL= Expr.SPARQL+”FILTER(”+ip+ic+”=”+p+”0).”; p=ip;c=ic;} Prim::=Literal|"$" QName|PExp|FunctionCall. { if(Literal occurs) Prim.SPARQL=Literal.getString(); else if(QName occurs) Prim.SPARQL=”?”+QName.getString(); else if(PExp occurs) Prim.SPARQL=PExp.SPARQL;

45

c++; return s+p+(c-1)+” rel:value ”+p+c; } FunctionCall::=QName "("(Expr("," Expr)*)?")". { FunctionCall.SPARQL =Expr[1].SPARQL; if(Expr[1].SPARQL does not refer to value) FunctionCall.SPARQL+= getValueOfExpr(Expr[1]); ic[1]=c; … FunctionCall.SPARQL+=Expr[|Expr|].SPARQL; if(Expr[|Expr|].SPARQL does not refer to value) FunctionCall.SPARQL+= getValueOfExpr(Expr[|Expr|]); ic[|Expr|]=c; FunctionCall.SPARQL+=getSPARQLOfFunction( QName.getString(),ic[1],…,ic[|Expr|]) } KindTest::=DocumentTest|ElementTest|Attribute Test|PITest|CommentTest|TextTest|AnyKindTest. { if(not AnyKindTest occurs) KindTest.SPARQL+= p+c+” rel:type ”+getTypeConstant(Symbol); if(last location step and (AttributeTest occurs or TextTest occurs)) {c++; KindTest.SPARQL+=s+p+(c-1)+ ” rel:value ”+p+c;} }

A Framework for Semi-automatic Data Integration Paolo Ceravolo1 , Zhan Cui2 , Ernesto Damiani1, Alex Gusmini2 , and Marcello Leida1

2

1 Universit`a degli studi di Milano Dipartimento di Tecnologie dell’Informazione via Bramante, 65, 26013 Crema (CR), Italy {ceravolo,damiani,leida}@dti.unimi.it Intelligent Systems Research Centre, BT Exact, British Telecom Adastral Park, Martlesham Heath, Ipswich, Suffolk, U.K. {zhan.cui,alex.gusmini}@bt.com

Abstract. Recent studies on Business Intelligence highlights the need of ontime, trustable and sound data access systems. Moreover the application of these systems in a flexible and dynamic environment requires for an approach based on automatic procedures that can provide reliable results. A crucial factor for any automatic data integration system is the matching process. Different categories of matching operators carry different semantics. For this reason combining them in a single algorithm is a non trivial process that have to take into account a variety of options. This paper proposes a solution based on a categorization of matching operators that allow to group similar attributes on a semantic rich form. This way we define all the information need in order to create a mapping. Then Mapping Generation is activated only on those set of elements that can be queried without violating any integrity constraints on data.

1 Introduction Since many years, data Integration has been a relevant problem in applications that needs to access, analyse and display data coming from heterogeneous data sources; this problem has become crucial for large-scale distributed applications on corporate networks, or in the global net. In principle Data Integration can be done by a procedural approach, e.g. performing an ad-hoc integration with respect to a set of predefined needs, such as in [11]. But, when the queries to be applied on the integrated sources cannot be define apriori, a declarative approach is required; here we limit our discussion to it. According to the declarative approach, we call local schemata (L) the set of representations referring to local data sources, while the global schema (G) is the representation integrating the different local sources. In general two data sets can be integrated only if they describe a common set of real world facts. Of course this common set does not have to cover the totality of the described facts. In [13], relations between facts described by different data sets are defined by set relationships. Actually this approach is partially inappropriate because the instances of two data sets can describe the same facts at different detail levels or they can describe distinct facts to be related in G. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 46–60, 2009. c Springer-Verlag Berlin Heidelberg 2009 

A Framework for Semi-automatic Data Integration

47

Research issues on declarative techniques for Data Integration can be grouped in three big clusters: a first cluster of issues focuses on the generation of G that can be either normative, as in [2], or inductive, as in [9]. A second category of issues focuses on how to represent the mapping between G and L. Here two main approaches exist. In the Global as View (GaV) approach, the mapping on G objects, is provided by using a L vocabulary while in the Local as View (LaV) approach the mapping, on L objects is provided by using a G vocabulary. In [12] a detailed discussion was done on how these approaches impact on application modeling and data reasoning. The last cluster of issues focuses on the problem of query answering, studying the computational complexity related to the different solutions, as in [1] or in [10], and defining effective algorithms for dealing with it, as for instance in [8] or [5]. These three clusters cover most of the relevant theoretical aspects of Data Integration. Moreover, with the increasing number of interactions and complexity of relations, we can no longer rely on human intervention: an automatism is needed, providing high level of quality in the final mapping. Here outcomes the importance of sound matching operators, capable of discovering semantics relations between elements of the system. Matching operators are very sensible to the data in input; usually, operators are tailored to specific data, and no generic matching function can be easily designed. For this reason the only way for implementing a generic Data Integration algorithm is to support a set of matching operators. Moreover, combining them in a single algorithm is a non trivial process that have to take into account a variety of options. This paper deals with the problem of managing a palette of matching operators supporting different semantics. The approach chosen is to combine all the available relations produced by different operators in a cluster. This cluster collect all the elements that can be associated using one of the matching operators in the palette and expresses the semantics of the association. This way in the cluster we have all the information necessary to create a mapping. Also Mapping Generation is activated only on those set of elements that can be queried without violating any integrity constraints. In our system, called Ontology Driven Data Integration (ODDI), these clusters are used as a starting point for Formal Concept Analysis (FCA)[7] in order to discover concept-level relations. We use an ontology as data access layer. Using an ontology as common conceptualization brings several benefits but the more relevant is that due to its sound logic basis, it is possible to perform reasoning tasks on the knowledge base such as Consistency Checking and Classification [4]. The paper is structured as follows: Section 2 introduces formally a generic Data Integration System, focusing then to our definition of mapping; then in Section 3 the matching process is described, providing initially our formalization and then a categorization of the traditional matching operators. Section 4 describes our mapping generation module, focusing on the use of FCA as a formalism for representing the information. The paper is enriched with an example of

48

P. Ceravolo et al.

the generation of the FCA lattice starting from a local schema S and a global schema G. Section 5 concludes the paper, outlining conclusions.

2 Data Integration The system we propose in this paper is based on Global as View approach [3], because the G is given through an ontology and the mapping is constructed by associating to each concept of G the set of attributes in L that carry the same informative value of the attributes of that concept. Accordingly, we can define a data integration system as triple DIS =; where G is the global representation, L the set of local representations composed by n single representations s1 ,s2 ,..., sn and M is the mapping between L and G. The mapping M is the result of a complex process taking as input Mt , a set of matching relations among the simple elements of G and L and generating the mapping M defined as: M =< M p , Mo >

(1)

where M p is a mapping between objects of the local representation L and the global representation G (such as for instance concepts in an ontology or table in a database) and Mo is a mapping between elements of L1 . According to our work a mapping between data sets can have to two distinct goals: – Composition. In this case some redundant information is assumed to be stored in the data sets. The mapping acts on this redundant information in order to aggregate new compositions of data items. In this perspective G contains views that recompose the data items contained in L in a new structure. – Summarization. In this case the information stored in the data sets can be reduced to a common type. The mapping expresses the communality shared by different data items. In this perspective G contains views that summarize the data items contained in L in a more compact representation. In principle a mapping can pursue both these goals. If a human agent generates the mapping, she will naturally distinguish between the two cases. But if the mapping is generated by an algorithm, achieving the right goal mainly depends on the operator adopted for matching the data items. The system that we propose consists of two modules: the first generating Mt , given G and L. The second module generates M p and Mo , by representing Mt as an FCA used as searching space to find semantic relations between elements of G, and L.

3 Matching Matching (Mt ) is the problem of discovering relations between elements of two different representations (G and L in this case). The matching at simple element level can be defined as a relation: 1

These can be relations between objects of the same source schema sa in L, such as the typical primary-key→foreign-key, but also relations between elements of different source schemas si , s j of L that are semantically related.

A Framework for Semi-automatic Data Integration

49

eisk ∼ =δ egj Where ∼ = can be: equality, inclusion or specification (=, where R, the set of relations r(x,y) = ox ⊲⊳ cy is defined as: ⎧ ⎨ true if the cluster c j contains an attribute ak of oi oi ⊲⊳ c j = ⎩ f alse otherwise 3

In this section we provided just an introduction to FCA for the sake of clearness. For a more exhaustive theoretically grounded coverage, please refer to [7].

56

P. Ceravolo et al. Table 3. The FCA context table hunts Lion × Finch Eagle × Hare Ostrich Bat Shark × Penguin Whale

fly bird mammal swim fish × × × × × × × × × × × × × × ×

Fig. 3. FCA Concept Lattice

This way, a lattice ℑ on ℜ is generated. Figure 3 shows the lattice representation of the formal context represented in Table 3 generated using Concept Explorer (http://conexp.sourceforge.net/). The lattice generated above is then processed in order to discover semantic relations between objects ons from the source schema sn in L (the tables of a data source) and the set of objects og from G (the concepts of the ontology). The mapping algorithm analyses the FCA concept lattice and generates, for each objects og of G, a set of element Tog = {te1g ,te2g , ...,te j } where te j is related to an attribute g g of a concept of G and is defined as: te j =< ck ,W > g

with: ck ∈ C|∀exg ∈ ck ∃W =

i=0..m 

eis j ∀eis j ∈ ck

j=0..n

In the case of our example, applying the lattice generation formula to the clusters, our algorithm generates the FCA lattice showed in fig. 5. This lattice reports all the information extracted during the matching process; these information are distributed in a highly structured search space that will be the input to our mapping generation algorithm. The lattice is covered by the mapping generation algorithm in an iterative way. For example,

A Framework for Semi-automatic Data Integration

57

Fig. 4. Selecting the object Customer:ONTO the intents Address, FiscalCode, FullName and Contact are obtained (above), for each intent the target objects are discovered. In case of Fullname the extents are Customer:ONTO, Customer:DB1, Employee:ONTO and Employee:DB1 (below).

Fig. 5. The FCA lattice generated from the schemas of the example

considering the concept CustomerONT O , the first step is to get the intent of the object to map. Selecting CustomerONT O in the FCA lattice we obtain the set of intents Address, Fullname, FiscalCode and Contact that refers to the respective clusters. Now, for each intent considered, we extract the extent (not considering the elements of G) that represents the target objects of the selected intent. Referring to Figure 4 the list of extents to consider for the intent FullName will be: T (FullNameCustomer ONT O ) = {CustomerDS1 , EmployeeDS1 } Following the same procedure we obtain the remaining Ti :  Customer T (ContactONTO )

=

CustomerDS1 , EmployeeDS1 , O f f iceDS1



T (AddressCustomer ONT O ) = {CustomerDS1 , O f f iceDS1 } T (FiscalCodeCustomer ONTO ) = {CustomerDS1 }

58

P. Ceravolo et al.

Table 4. Mapping relations for the concepts Customer and Employee generated by the FCAmapping generator ⎫ ⎧ T (FullNameCustomer ) = CustomerNameCustomer ⎪ ⎪ ⎪ ⎪ ONTO DS1 ⎪ ⎪ Customer ) = PhoneCustomer ⎪ ⎪ ⎪ ⎪ T (Contact ⎪ ⎪ DS1 ONTO ⎞ ⎛ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Customer ⎜ ⎟ ⎪ ⎪ AddressLine1 ⎪ ⎪ DS1 ⎜ ⎟ ⎪ ⎪ ⎪ ⎪ ⎜ ⎟ ⎪ ⎪ ∧ ⎪ ⎪ T (AddressCustomer ) = ⎜ ⎟ ⎪ ⎪ ⎪ ⎪ Customer ONTO ⎟ ⎜ AddressLine2 ⎪ ⎪ ⎬ ⎨ ⎟ ⎜ DS1 ⎠ ⎝ ∧ TCustomerONTO ⎪ ⎪ ⎪ ⎪ Δ(PhoneCustomer ) ⎪ ⎪ ⎪ ⎪ DS1 ⎞ ⎛ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Customer , ⎟ ⎪ ⎪ ⎪ ⎜ ⎪ ⎪ CustomerName ⎪ ⎪ DS1 ⎟ ⎜ ⎪ ⎪ ⎪ Customer ) = Fc ⎜ DateO f BirthCustomer , ⎟ ⎪ ⎪ ⎪ T (FiscalCode ⎪ ⎪ DS1 ONTO ⎟ ⎜ ⎪ ⎪ ⎪ ⎪ Customer ⎠ ⎝ ⎪ ⎪ PlaceO f Birth , ⎪ ⎪ DS1 ⎪ ⎪ ⎭ ⎩ Customer SexDS1 ⎧ Employee ⎫ Employee Employee T (FullNameONTO ) = (FirstNameDS1 )⎪ ∧ LastNameDS1 ⎪ ⎪ ⎪ ⎪ ⎪ Employee O f f ice Employee ⎪ ⎪ ⎪ ⎪ ) ∨ Email ) = (TelePhone T (Contact ⎪ ⎪ DS1 DS1 ONTO ⎪ ⎪ ⎛ ⎞ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ O f f ice ⎪ ⎪ ⎜ ⎟ ⎪ ⎪ Street ⎪ ⎪ ⎜ ⎟ DS1 ⎪ ⎪ ⎪ ⎪ ⎜ ⎟ ⎪ ⎪ ⎬ ⎨ ⎟ ⎜∧ ⎟ ⎜ O f f ice TEmployeeONTO CityDS1 ⎟ ⎜ Employee ⎪ ⎪ ⎟ ⎪ ⎪ T (AddressONTO ) = ⎜ ⎪ ⎪ ⎟ ⎜∧ ⎪ ⎪ ⎪ ⎪ ⎟ ⎜ ⎪ ⎪ O f f ice ⎪ ⎪ ⎟ ⎜ PostalCode ⎪ ⎪ ⎪ ⎪ DS1 ⎜ ⎟ ⎪ ⎪ ⎪ ⎪ ⎝ ⎠ ∧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ O f f ice ⎪ ⎪ ⎪ ⎪ ) Δ(TelePhone ⎪ ⎪ DS1 ⎪ ⎪ ⎭ ⎩ Department Employee T (TaskONTO ) = TaskDS2

The set Tog of elements Teig needs to be semantically analysed and pruning redundant information. The set Tog is pruned by applying a process that removes from the set Tog the elements that do not share any equal instance. The algorithm performs a set of queries and analyses the results to decide which elements esn of sn in L are not semantically related to the element eg of G4 . After the pruning process the set Tog will be:

TCustomerONT O

⎧ ⎫ T (FullNameCustomer ⎪ ONTO ) = {CustomerDS1 } ⎪ ⎨ ⎬ Customer T (ContactONT O ) = {CustomerDS1 }

Customer ) = {Customer ⎪ DS1 } ⎩ T (AddressONTO Customer

⎪ ⎭ ⎧ ⎫ Employee ⎪ T (FullNameONT O )  = {EmployeeDS1} ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ T (Contact Employee) = EmployeeDS1 ⎪ ⎬ ONTO O f f iceDS1 TEmployeeONT O ⎪ ⎪ Employee ⎪ ⎪ ⎪ ⎪ T (AddressONT O ) = {O f f iceDS1} ⎪ ⎪ ⎪ ⎪ ⎩ Employee ⎭ T (TaskONT O ) = {DepartmentDS2} T (FiscalCodeONTO ) = {CustomerDS1 }

4

We will avoid the details of the queries preformed to disambiguate the elements because the semantic queries are out of the scope of this paper.

A Framework for Semi-automatic Data Integration

59

for the concepts Customer and Employee respectively. Once the set of Tog is pruned from the redundant information it is possible to convert the set in the mappings M p and Mo . The conversion process is performed substituting the objects in Tog (the tables of the local schemas) with its correspondent in the set of ARs. For our example we obtain the relations in Table 4, which is the set of M p for the concepts Employee and Customer. To complete the mapping we need to generate the set Mo that is produced according to the tables involved in the mapping M p . Considering the set of ICs generated previously the set of Mo is empty in case of the concept Customer, because all the attributes are mapped on a single table, in case of the concept Employee the tables involved are: EmployeeDS1 , O f f iceDS1 and DepartmentDS2 and then the set Mo in this case is composed by all the ICs that refers to the tables considered in the mapping. It is important to underline that all the tables of the IC need to be present in the mapping. The set Mo in case of Employee is: Mo = {(IC1), (IC4)}. The resulting sets are the mapping M p and Mo that concludes the mapping generation process. The discovered mapping M needs to be validated: if the mapping does not return any result from the query engine or the query can not be resolved then the mapping is not considered to be correct, the wrong mapping is passed in the mapping generation process and an alternative mapping is generated.

5 Conclusions This paper addressed the issue of managing a variety of matching operators in a complex Data Integration system performing a semiautomatic process. A solution based on a categorization of matching operators that allows to group similar attributes in a semantically rich form. This way we define all the information need in order to create a mapping. The system described has been tested with encouraging results. Several public data sources and the correspondent ontological representation can be found online and they can be used for an evaluation by comparing the mappings generated by our tool with mapping generated by a domain expert. This way we can compare our tool with others well known data integration systems (COMA++, OntoBuilder, Harmony, Mafra) by exploiting classical Information Retrieval quality measures such as Precision and Recall. Moreover, the matching process is a key factor for the quality of the final mapping, then we performed an additional evaluation based on the benchmark test of the Ontology Alignment Evaluation Initiative contest 2007 (OAEI, http://oaei.ontologymatching.org/). Future work will focus on the use of logic based decisional process that will build a knowledge base starting from the results of the matching operators and will return the final matching as result of reasoning process over the knowledge base. Acknowledgements. This work was partly funded by the Italian Ministry of Research under FIRB contract n. RBNE05FKZ2 004, TEKNE. This work is also partially supported by British Telecom (BT) research and venturing.

60

P. Ceravolo et al.

References 1. Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views, pp. 254–263 (1998) 2. Braun, P., L¨otzbeyer, H., Sch¨atz, B., Slotosch, O.: Consistent integration of formal methods. In: Schwartzbach, M.I., Graf, S. (eds.) TACAS 2000. LNCS, vol. 1785, pp. 48–62. Springer, Heidelberg (2000) 3. Calvanese, D., Lenzerini, M., Nardi, D.: Description logics for conceptual data modeling. In: Logics for Databases and Information Systems, pp. 229–263 (1998) 4. Cui, Z., Damiani, E., Leida, M.: Benefits of ontologies in real time data access. In: Digital EcoSystems and Technologies Conference, 2007. DEST 2007. Inaugural IEEE-IES, February 21-23, 2007, pp. 392–397 (2007) 5. Duschka, O.M., Genesereth, M.R., Levy, A.Y.: Recursive query plans for data integration. Journal of Logic Programming 43(1), 49–73 (2000) 6. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 7. Ganter, B., Stumme, G., Wille, R. (eds.): Formal Concept Analysis, Foundations and Applications. LNCS, vol. 3626. Springer, Heidelberg (2005) 8. Grahne, G., Mendelzon, A.O.: Tableau techniques for querying information sources through global schemas. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 332– 347. Springer, Heidelberg (1999) 9. Hakimpour, F., Geppert, A.: Global schema generation using formal ontologies (2002) 10. Halevy, A.Y.: Answering queries using views: A survey. VLDB Journal: Very Large Data Bases 10(4), 270–294 (2001) 11. Hammer, J., Garcia-Molina, H., Widom, J., Labio, W., Zhuge, Y.: The stanford data warehousing project. IEEE Quarterly Bulletin on Data Engineering; Special Issue on Materialized Views and Data Warehousing 18(2), 41–48 (1995) 12. Lenzerini, M.: Data integration: a theoretical perspective. In: PODS 2002: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 233–246. ACM Press, New York (2002) 13. Parent, C., Spaccapietra, S.: Issues and approaches of database integration. Commun. ACM 41(5), 166–178 (1998) 14. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal: Very Large Data Bases 10(4), 334–350 (2001) 15. Yager, R.R.: On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Trans. Syst. Man Cybern. 18(1), 183–190 (1988)

Experiences with Industrial Ontology Engineering Jon Atle Gulla Department of Computer and Information Science Norwegian University of Science and Technology, Norway [email protected]

Abstract. The petroleum industry is a technically challenging business with highly specialized companies and complex operational structures. Several terminological standards have been introduced over the last few years, though they address particular disciplines and cannot help people collaborate efficiently across disciplines and organizational borders. This paper discusses the results from the industrally driven Integrated Information Platform project, which has developed and formalized an extensive OWL ontology for the Norwegian petroleum business. The ontology is now used in production reports, and the ontology is considered vital to semantic interoperability and the concept of integrated operations on the Norwegian continental shelf.

1 Introduction The petroleum industry on the Norwegian continental shelf (NCS) is technically challenging with challenging subsea installations and difficult climatic conditions. It is a fragmented business, in the sense that there is little collaboration between phases and disciplines in large petroleum projects. There are many specialized companies involved, though their databases and applications tend not to be well integrated with each other. Research done by the Norwegian Oil Industry Association (OLF) shows that there is a need for more collaboration and integration across phases, disciplines and companies to maintain the industry’s profitability [11]. The existing standards do not provide the necessary support for this, and the result is costly and risky projects and decisions based on wrong or outdated data. This paper presents the vision and some main results of the Integration Information Platform (IIP) project. The idea of the IIP project was to extend and formalize an existing terminology standard for the petroleum industry, ISO 15926. Using Semantic Web technologies, we have turned this standard into a real ontology that provides a consistent unambiguous terminology for selected areas in the oil and gas industry. The results of the project so far are promising, and the ontology developed by IIP is now being adopted by industry and is used in production reporting to the government. The work in IIP is the first step towards the concept of integrated operations in the petroleum sector. In this long-term vision semantic standards and tools enable companies to work seamlessly together across geographical and organizational borders, and people from different disciplines or phases can cooperate without terminological confusion and misunderstandings. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 61–72, 2009. © Springer-Verlag Berlin Heidelberg 2009

62

J.A. Gulla

The paper is organized as follows. In Section 2 we go through the structures and challenges in the subsea petroleum industry, explaining the status of current standards and the vision of future integrated operations. Section 3 briefly presents the parts of the Semantic Web initiative relevant to this project. Whereas the ontological work in the IIP project is introduced in Section 4, we discuss the issue of introducing semantic standards in the petroleum business in Section 5. Conclusions are found in Section 6.

2 The Subsea Petroleum Industry The Norwegian subsea petroleum industry is characterized by sophisticated technologies and highly competent and specialized companies. Many disciplines and competences need to come together in oil and gas projects, and their success is highly affected by the way people and systems collaborate and coordinate their work. On the Norwegian Continental Shelf (NCS) there are traditional oil companies like Statoil, Norsk Hydro and ElfTotalFina, but also specialized service companies like Schlumberger, Haliburton, Baker Hughes, Aker Kværner, FMC KongsbergSub, and smaller ICT service companies. Both the projects and the subsequent production systems are information-intensive. When a well is put into operation, the production has to be monitored closely to detect any deviation or problems. The next generation subsea systems will include numerous sensors that measure the status of the systems and send real-time production data back to onshore operation centers. For these centers to be effective, they need tools that allow them to understand and harmonize data, relate it to other relevant information, and help them deal with the situation at hand. There is a challenge in dealing with the sheer size of this information, but also in interpreting information that is deeply rooted in very technical terminologies. The Norwegian petroleum industry is now facing a number of challenges [10]: Firstly, as most of the resources are in the decline phase, we now produce 2-3 times more oil than what is added through the development of new fields. Secondly, the costs on all the bigger fields are increasing significantly as we enter the decline phase. Thirdly, we see a development from traditional big oil fields of 300-400 million Sm3 (standard cubic meters, equal to 6.29 barrels) to fields of only 3-5 million Sm3, which also implies that many small and specialized companies enter the market. Lastly, the exploration in the north is environmentally very sensitive and requires new approaches to deal with climatic and geographical issues. All these trends pose a challenge to the profitability of existing and future petroleum fields on NCS. While the costs of old large fields are increasing, the new ones are financially less attractive due to scalability problems. The multitude of companies involved, with their own applications and databases, makes coordination and collaboration more important than in the past. For the industry as a whole, this severely hampers the integration of applications and organizations as well as the decision making processes in general: •

Integration. Even though there is some cooperation between companies in the petroleum sector, this cooperation tends to be set up on an ad-hoc basis for a particular purpose and supported by specifically designed mappings between applications and databases. There is little collaboration across disciplines and phases,

Experiences with Industrial Ontology Engineering



63

as they usually have separate databases structured according to different goals, processes and terminologies. It is of course possible to map data from one database to another, but with the complexity of data and the multitude of companies and applications in the business this is not a viable approach for the industry as a whole. Decision making. A current problem is the lack of relevant high-quality information in decision making processes. Some data is available too late or not at all because of lack of integration of databases. In other cases relevant data is not found due to differences in terminology or format. And even when information is available, it is often difficult to interpret its real content and understand its limitations and premises. This is for example the case when companies report production figures to the government using slightly different terminologies and structures, making it very hard to compare figures from one company to another.

XML is already used extensively in the petroleum industry as a syntactic format for exchanging data. Over the last few years, there have been several initiatives for defining semantic standards to support information sharing in the business, but they have typically been limited to particular disciplines, companies or activities. ISO 15926 Integration of Life-Cycle Data ISO 15926 is a standard for integrating life-cycle data across phases (e.g. concept, design, construction, operation, decommissioning) and across disciplines (e.g. geology, reservoir, process, automation). It consists of 7 parts, of which part 1, 2 and 4 are the most relevant to this work. Whereas part 1 gives a general introduction to the principles and purpose of the standard, part 2 specifies the representation language for defining application-specific terminologies. Part 2 comes in the form of a data model and includes 201 entities that are related in a specialization hierarchy of types and sub-types. It is intended to provide the basic types necessary for defining any kind of industrial data. Being specified in EXPRESS (International Standards Association [8]), it has a formal definition based on set theory and first order logic. Part 4 of ISO 15926 is comprised of application or discipline-specific terminologies, and is usually referred to as the Reference Data Library (RDL). These terminologies, described as RDL classes, are instances of the data types from part 2, are related to each other in a specialization hierarchy of classes and sub-classes as well as through memberships and relationships. If part 2 defines the language for describing standardized terminologies, part 4 describes the semantics of these terminologies. Part 4 today contains approximately 50.000 general concepts like motor, turbine, pump, pipes and valves. ISO 15926 is still under development, and only Part 1 and 2 have so far become ISO standards. In addition to adding more RDL classes for new applications and disciplines in Part 4, there is also a discussion about standards for geometry and topology (Part 3), procedures for adding and maintaining reference data (Part 5 and 6), and methods for integrating distributed systems (Part 7). Neither ISO 15926 nor other standards have the scope and formality to enable proper integration of data across phases and disciplines in the petroleum industry.

64

J.A. Gulla

Integrated Operations The Norwegian Oil Industry Association launched the Integrated Operations program in 2004. The fundamental idea is to integrate processes and people onshore and offshore using new information and communication technologies. Facilities to improve onshore’s abilities to support offshore operationally are considered vital in the first phase of this program. Personnel onshore and offshore should have access to the same information in real-time and their work processes should be redefined to allow more collaboration and be less constrained by time and space. OLF has estimated that the implementation of integrated operations on NCS can increase oil recovery by 34%, accelerate production by 5-10% and lower operational costs by 20-30% [11]. Central in the program is the semantic and uniform manipulation of heterogeneous data that can be shared by all relevant parties. Decisions often depend on real-time production data, visualization data, and background documents and policies, and the data range from highly structured database tables to unstructured textual documents. This necessitates intelligent facilities for capturing, tracking, retrieving and reasoning about data.

Fig. 1. An oil and gas ontology allows cooperation across companies and disciplines (adapted from OLF)

The first generation of OLF’s integrated operations includes the definition of common terminologies that enable the automatic transfer of data between applications in the same discipline or inside the same company. Onshore operation centers for monitoring and controlling subsea oil installations are also part of this generation. The second generation requires complete formal ontologies that cover multiple domains and disciplines and support reasoning and inference of data using real-time data and rules. This will allow operators and vendors to integrate their operation centers, and subsea installations can to some extent control themselves using smart sensors and rule-based control systems that make use of semantic standards to integrate and interpret data

Experiences with Industrial Ontology Engineering

65

from highly heterogeneous sources. Figure 1 shows how a comprehensive oil and gas ontology based on ISO 15926 is intended to support integration across disciplines and phases.

3 Semantic Web and Interoperability “The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation” [3]. The Semantic Web is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. The general idea is to annotate data and services with machine-processable semantic descriptions. These descriptions must be specified according to a certain grammar and with reference to a standardized domain vocabulary. The domain vocabulary is referred to as an ontology and is meant to represent a common conceptualization of some domain. The grammar is a semantic markup language, as for example the OWL web ontology language recommended by W3C. With these semantic annotations in place, intelligent applications can retrieve and combine documents and services at a semantic level, they can share, understand and reason about each other’s data, and they can operate more independently and adapt to a changing environment by consulting a shared ontology [15, 17]. Interoperability can be defined as a state in which two application entities can accept and understand data from the other and perform a given task in a satisfactory manner without human intervention. We often distinguish between syntactic, structural and semantic interoperability [1, 5]: • •



Syntactic interoperability denotes the ability of two or more systems to exchange and share information by marking up data in a similar fashion (e.g. using XML). Structural interoperability means that the systems share semantic schemas (data models) that enable them to exchange and structure information (e.g. using RDF). Semantic interoperability is the ability of systems to share and understand information at the level of formally defined and mutually accepted domain concepts, enabling machine-processable interpretation and reasoning.

For the Semantic Web technology to enable semantic interoperability in the petroleum industry, it needs to tackle the problem of semantic conflicts, also called semantic heterogeneity. Since the databases are developed by different companies and for different phases and/or disciplines, it is often difficult to relate information that is found in different applications. Even if they represent the same type of information, they may use formats or structures that prevent the computers from detecting the correspondence between data. For example, the tables ORG_NAME and COMPNY in two different applications may in fact contain the same information about organizations. Similarly, while a time period may be modeled with the variables “StartTime” and “Endtime” in one database, the same information may be represented with “StartTime” and “Duration” in another (see for example Pollock & Hodgson [13]). Even for concepts that are well understood and subjected to international conventions, the definitions may be slightly different from one source to another. The descriptions of ‘mean time between failure’ in Figure 2, which are extracted from various sources

66

J.A. Gulla

used in the petroleum industry, are almost identical, but it turns out that the differences are large enough to cause problems when data about mean times are transferred between applications. 1 2 3 4 5 6 7

8

Mean time between failure “A period of time which is the mean period of time interval between failures” “The time duration between two consecutive failures of a repaired item” (International Electrotechnical Vocabulary online database) “The expectation of the time between failures” (International Electrotechnical Vocabulary online database) “The expectation of the operating time between failures” (MIL-HDBK-296124) “Total time duration of operating time between two consecutive failures of a repaired item” (International Electrotechnical Vocabulary online database) “Predicts the average number of hours that an item, assembly, or piece part will operate before it fails” (Jones, J. V. Integrated Logistics Support Handbook, McGraw Hill Inc, 1987) “For a particular interval, the total functional life of a population of an item divided by the total number of failures within the population during the measurement interval. The definition hoolds for time, rounds, miles, events, or other measure of life units”. (MIL-PRF-49506, 1996, Performance Specification Logistics Management Information) “The average length of time a system or component works without failure” (MIL-HDBK-29612-4) Fig. 2. Different definitions of ‘mean time between failure’

The Semantic Web’s approach to these problems is the construction of shared formal ontologies of all important domain concepts. These may be specified in OWL, which is a semantic markup language based on Description Logic. It has an XML syntax, is built on top of RDF(S)’s property statements and class hierarchies, and adds constraints for class membership, equivalence, consistency and classification [2, 16].

4 Developing Oil and Gas Ontologies The Integrated Information Platform (IIP) project was a collaboration project between companies active on NCS and academic institutions, supported by the Norwegian Research Council [14]. Its long-term target was to increase petroleum production from subsea systems by making high quality real-time information for decision support accessible to onshore operation centers. The IIP project started in June 2004 and terminated at the end of June 2007 with a total budget of 26 million NOK (about 3.25 million Euro). The participants included Det Norske Veritas, Statoil, Norsk Hydro, Cap Gemini, Poseidon, OLF, FMC Technologies, National Oilwell Varco, OilCamp, POSC, IBM and NTNU. The project addressed the need for a common understanding of terms and structures in the subsea petroleum industry. The objective was to ease the integration of data and processes across phases and disciplines by providing a comprehensive

Experiences with Industrial Ontology Engineering

67

Fig. 3. The standardization approach in IIP

unambiguous and well accepted terminology standard that lends itself to machineprocessable interpretation and reasoning. This should reduce risks and costs in petroleum projects and indirectly lead to faster, better and cheaper decisions. The project has identified a representative set of real-time data from reservoirs, wells and subsea production facilities. The OWL web ontology language was chosen as the markup language for describing these terms semantically in an ontology. The entire standard is thus rooted in the formal properties of OWL, which has a modeltheoretic interpretation and to some extent support formal reasoning. A major part of the project was to convert and formalize the terms already defined in ISO 15926 Part 2 (Data Model) and Part 4 (Reference Data Library). Since the ISO standard addresses rather generic concepts, though, the ontology also includes more specialized terminologies for the oil and gas segment. Detailed terminologies for standard products and services were included from other dictionaries and initiatives (DISKOS,WITSML, ISO 13628/14224, SAS), and the project also opened for the inclusion of terms from particular processes and products at the bottom level. In sum, the ontology built in IIP has a structure as shown in Figure 3. The ontology engineering approach in IIP was a combination of converting formal ISO 15926 definitions to manual modeling and verification of ontological structures. Due to the formality of ISO 15926’s EXPRESS notation most of the ISO concepts could be automatically converted into legal OWL constructs. The manual modeling part was led by Det Norske Veritas and was handled by multi-disciplinary teams with years of experiences from standardization work and modeling projects. This conversion of ISO 15926-2/4 from EXPRESS gave us an OWL hierarchy that has formed the backbone of the new oil and gas ontology. Additional terms were gradually and manually added to this hierarchy to reflect the larger scope of the new standard. In these initial stages it was considered important to concentrate on hierarchical relationships between concepts. Relationships and constraints of classes and

68

J.A. Gulla

Fig. 4. Christmas tree OWL hierarchy

relationships, which are needed for more sophisticated reasoning with rules, are assumed to be added over time as the ontlogy matures. Take for example the concept Christmas tree, which is an assembly of parts that is connected to the top of a wellhead to control the flow out of the well. Its OWL definition (without relationships and constraints) is:



An artefact that is an assembly of pipes and piping parts, with valves and associated control equipment that is connected to the top of a wellhead and is intended for control of fluid from a well.

CHRISTMAS TREE



These statements give us an informal definition of Christmas trees and reveal that they are subclasses of artefact. Looking at the excerpt of the class hierarchy in Figure 4, we see that there are at least three types of Christmas tree (subsea, vertical, and horizontal). It is a specialization of Artefact, which in turn is an Inanimate physical object that is made or given a shape by man. The Pipe class is also a specialization of Artefact, but it is also a specialization of two other classes. This is quite natural, as the pipe both has a physical (artefact) and a functional dimension (pipeline or network connection). More details about the construction of the ontology can be found in [4].

Experiences with Industrial Ontology Engineering

69

The IIP project has now converted the ISO 15926 Part 2 (210 elements) and Part 4 (about 50.000) elements into OWL class hierarchies. In addition, we have incorporated additional terms from the following disciplines: • • • • • • •

Geometry and topology: ca. 400 terms Drilling and logging: ca. 2.700 terms Production: ca. 2.000 terms Safety and automation: ca. 150 terms Subsea equipment: ca. 1.000 terms Reservoir characterization Reliability and maintenance

The Tyrihans oil field, operated by Statoil, was used as a case in the IIP project. This means that the initial terms included in the ontology were based on the Tyrihans specifications, though they had been generalized and verified against other specifications as well, like ISO 13628 “Petroleum and natural gas industries – Design and operation of subsea production systems”. The ontology is the basis for developing new semantically interoperable applications, and IIP has already started experimenting with integrated visualization and information retrieval environments.

5 Industrial Adoption of Semantic Standards In recent years a number of powerful new ontologies have been constructed and applied in domains like medicine and biology, where Semantic Web technologies and web mining have been exploited in new intelligent applications [1, 6, 12 ]. However, these disciplines are heavily influenced by government support and are not as commercially fragmented as the petroleum industry. Creating an industry-wide standard in a fragmented industry is a huge undertaking that should not be underestimated. In this particular case, we have been able to build on an existing standard, ISO 15926. This has ensured sufficient support from companies and public institutions. There is still an open question, though, what the coverage of such an ontology should be. There are other smaller standards out there, and many companies use their own internal terminologies for particular areas. The scope of this standard has been discussed throughout the project as the ontology grew and new companies signalled their interest. For any standard of this complexity, it is important also to decide where the ontology stops and to what extent hierarchical or complementing ontologies are to be encouraged. Techniques for handling ontology hierarchies and ontology alignment and enrichment must be considered in a broader perspective. As far as the construction of the ontology is concerned, there was a need for both domain experts and ontology engineers. Since both the syntax and the semantics of OWL are non-trivial, it cannot be assumed that domain experts do the modeling themselves. To handle the complexity, the IIP project decided to model only the hierarchical relations in the first round, delaying relationships and constraints until the hierarchies were stable. For later update and quality assessment, it may be useful to use text mining techniques for automatic term extraction [7, 9]. The quality of ontologies is a delicate topic. It is important to choose an appropriate level of granularity. In this project we have been fortunate to have an existing

70

J.A. Gulla

standard to start with. What was considered satisfactory in ISO 15926 may however not be optimal for the ontology-driven applications that will make use of the future ontology. Ultimately, we need to consider how the ontology will be used in these applications and the nature of the source data to be annotated with ontological descriptions. Since the Semantic Web is still a rather immature technology, there are still open issues that need to be addressed in the future. One problem in the IIP project is that we needed the full expressive power of OWL (OWL Full) to represent the structures of ISO 15926-2/4. Reasoning with OWL specifications is then incomplete. The lack of industrial SW applications is another issue worth taking into consideration. There may be performance and maintenance complexities that are still unclear with such an untested technology. However, there is now a large community promoting SW technologies and developing innovative applications, and the first commercial products have also emerged. Additionally, the tool development in IIP indicates that the technology can form the semantic foundation for a new generation of intelligent, interoperable information services. The success of the new ontology, and standardization work in general, depends on the users’ willingness to commit to the standard and devote the necessary resources. If people do not find it worthwhile to take the effort to follow the new terminology, it will be difficult to build up the necessary support. This means that it is important to provide environments and tools that simplify the use and maintenance of the ontology. Intelligent ontology-driven applications must demonstrate the benefits of the new technology and convince the users that the additional sophistication pays off. A positive sign is that daily production reports and daily drilling reports are now standardized across companies with the help of our ontology, and the major oil companies on NCS as well as IBM are now working on a similar semantic standarization of monthly production reports. The industry has received the standard with enthusiasm and are already planning new projects for further expansion of the standard and the development of appropriate semantic applications.

6 Conclusions The Integrated Information Platform project is one of the first attempts at applying state-of-the-art Semantic Web technologies in an industrial context. Existing standards have been converted and extended into a comprehensive OWL ontology for reservoir and subsea production systems. The intention is that this ontology will later be approved as an ISO standard and form a basis for developing interoperable applications in the industry. With the new ontology at hand, the industry will have taken the first step towards integrated operations on the Norwegian Continental Shelf. Data can be related across phases and disciplines, helping people collaborate and reducing costs and risks. However, there are costs associated with building and maintaining such an ambitious ontology. It remains to be seen if the industry is able to take advantage of the additional expressive power and formality of the new ontology. The work in IIP indicates that both information retrieval systems and sensor monitoring systems can benefit from having access to an underlying ontology for analyzing data and interpreting user needs.

Experiences with Industrial Ontology Engineering

71

As the class hierarchies in the ontology are completed, the emphasis of the IIP project will be put on adding more relationships and constraints to the ontology. This also includes specifying rules that will be used to analyze anomalies in the real-time data from subsea sensors. At that point we can start exploiting the logical properties of OWL and start experimenting with the next generation rule-based notification systems. We can also use agents to simplify the coordination of work and improve cooperation along the entire value chain. We will then see if a strong semantic foundation makes it easier for us to handle and interpret the vast amount of data that are so typical to the petroleum industry. Acknowledgements. This research is funded by the Integrated Information Platform for reservoir and subsea production systems project under the Petromax research program.

References 1. Aguilar, A.: Semantic interoperability in the context of e-health: CDH Seminar (2005), http://m3pe.org/seminar/aguilar.pdf 2. Antoniou, G., Franconi, E., van Harmelen, F.: Introduction to semantic web ontology languages. In: Eisinger, N., Małuszyński, J. (eds.) Reasoning Web. LNCS, vol. 3564, pp. 1– 21. Springer, Heidelberg (2005) 3. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (2001) 4. Christiansen, T., Jensen, M., Valen-Sendstad, M.: Defining iso 15926-4 reference data library classes in owl: The Norwegian Oil Industry Association (2005), http://www.olf.no/io/kunnskapsind/?28140.pdf 5. Dublin Core. Dublin core metadata glossary (2004), http://library.csun.edu/mwoodley/dublincoreglossary.html 6. Gene Ontology Consortium, Gene ontology: Tool for the unification of biology. Nature Genet. 25, 25–29 (2000) 7. Gulla, J.A., Brasethvik, T., Kaada, H.: A flexible workbench for document analysis and text mining. In: The 9th International Conference on Applications of Natural Language to Information Systems (NLDB 2004), Salford (2004) 8. International Standards Association, Industrial automation systems and integration - product data representation and exchange. Par 11: Description methods: The express language reference manual (2007), http://www.iso.org/iso/en/ CatalogueDetailPage.CatalogueDetail?CSNUMBER=18348 9. Maedche, A.: Ontology learning for the semantic web. Kluwer Academic Publishers, Dordrecht (2002) 10. OLF. Digital infrastructure offshore - common network operation management for digital infrastructure offshore on the norwegian continental shelf: The Norwegian Oil Industry Association (2005) 11. OLF. Integrated work processes: Future work processes on the norwegian continental shelf: The Norwegian Oil Industry Association (2005) 12. Pisanelli, D.M. (ed.): Ontologies in medicine. Studies in health technology and informatics, vol. 102. IOS Press, Amsterdam (2004)

72

J.A. Gulla

13. Pollock, J.T., Hodgson, R.: Adaptive information: Improving business through semantic interoperability, grid computing, and enterprise integration. Wiley Publishers, Chichester (2004) 14. Sandsmark, N., Mehta, S.: Integrated information platform for reservoir and subsea production systems. In: Proceedings of the 13th Product Data Technology Europe Symposium (PDT 2004), Stockholm (2004) 15. Sheth, A., Bertram, C., Avant, D., Hammond, B.: Managing semantic content for the web. IEEE Internet Computing, 80–87 (July/August 2002) 16. W3C. Owl web ontology language overview (2006), http://www.W3c.Org/tr/owl-features/ 17. Zhong, N., Liu, J., Yao, Y.: In search of the wisdom web. Computer, 27–31 (2002)

A Semiotic Approach to Quality in Specifications of Software Measures Erki Eessaar Department of Informatics, Tallinn University of Technology, Raja 15 12618 Tallinn, Estonia [email protected]

Abstract. Each software entity should have as high quality as possible in the context of limited resources. A specification of a software quality measure is a kind of software entity. Existing studies about the evaluation of software measures do not pay enough attention to the quality of specifications of measures. Semiotics has been used as a basis in order to evaluate the quality of different types of software entities. In this paper, we propose a multidimensional, semiotic quality framework of specifications of software quality measures. We apply this framework in order to evaluate the syntactic and semantic quality of two sets of specifications of database design measures. The evaluation shows that these specifications have some quality problems. Keywords: Metrics, Measures, Semiotics, Quality, Metamodel, Database design, SQL.

1 Introduction Measurement results, which are produced by performing measurements based on software quality measures (software measures), allow developers to evaluate the quality of software entities and improve them if necessary. Measures themselves must also have as high quality as possible. A part of the development of each measure is formal and empirical evaluation of the measure [1]. Existing evaluation methods of measures do not pay enough attention to the quality of specifications of measures. If the quality of a specification is low, then it is difficult to understand and apply the measure. Therefore, we need a method for evaluating the quality of specifications of measures. On the other hand, there are already quite a lot of studies about how to use semiotics (the theory of signs) in order to evaluate the quality of software entities. In this paper, we extend this research to the domain of measures. The first goal of the paper is to introduce a semiotic quality framework for evaluating specifications of software measures. This framework is created based on the semiotic quality framework of conceptual modeling SEQUAL that was proposed by Lindland et al. [2] and has been improved since then. The second goal of the paper is to show the usefulness of the proposed framework by presenting the results of a study about the syntactic and semantic quality of two sets of specifications of database design measures. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 73–86, 2009. © Springer-Verlag Berlin Heidelberg 2009

74

E. Eessaar

We follow the guidelines of García et al. [3] and use the term "measure" instead of the term "metric". In this paper the word "measure" denotes "software measure", if not stated otherwise. We use analogy [4] as the research method in order to work out the framework and new measures based on the results of existing research. The rest of the paper is organized as follows. In Section 2, we specify a semiotic quality framework for evaluating specifications of measures. In Section 3, we use the framework in order to evaluate two sets of specifications of database design measures. Section 4 summarizes the paper and points to the future work with the current topic.

2 A Semiotic Quality Framework Many authors have investigated how to evaluate measures and have proposed frameworks that involve empirical and formal validation of measures [5, 6, 7]. IEEE Standard for a Software Quality Metrics Methodology [8] also specifies how to validate "the relationship between a set of metrics and a quality factor for a given application". Jacquet and Abran [9] investigate the state of the art of validation of measures and describe a detailed model of measurement process. They claim, based on the literature review, that existing validation frameworks of measures do not pay enough attention to the validation of all the aspects of the design of a measurement method. McQuillan and Power [10] write that many measures "are incomplete, ambiguous and open to a variety of different interpretations". Some researchers have used semiotics as the basis in order to work out evaluation frameworks of different kinds of software entities. According to Merriam-Webster dictionary [11] semiotics is "a general philosophical theory of signs and symbols that deals especially with their function in both artificially constructed and natural languages and comprises syntactics, semantics, and pragmatics." Van Belle [12] writes that any informational object has a syntactic, semantic, and pragmatic aspect. Syntax, semantics, and pragmatics relate an informational object to specification language, specified domain, and audience of the object, respectively [2]. Semiotics has been used as the basis in order to evaluate the quality of conceptual models [2], specifications of requirements [13], ontologies [14], enterprise models [12], and process models [15]. A specification of a software measure is a kind of software entity. In this paper, we propose that semiotics can be successfully used in order to evaluate specifications of measures. 2.1 Specification of the Framework In this section, we present a multidimensional, semiotic evaluation framework of the quality of specifications of measures. A model is a kind of software entity. A specification of a measure is a kind of software entity. Each software entity can be characterized in terms of different quality levels (physical, empirical, syntactic etc.). Each quality level has one or more quality goals. Each quality goal has zero or more associated measures that allow us to measure the quality of a software entity in terms of the goal. The framework comprises physical, empirical, syntactic, semantic, perceived semantic, pragmatic, and social quality. We adapt the semiotic quality framework SEQUAL in order to use it in a new context – the evaluation of specifications of

A Semiotic Approach to Quality in Specifications of Software Measures

75

measures. The framework has to enhance the existing validation frameworks of measures. In addition, we present three candidate measures for evaluating the syntactic and semantic quality of specifications of measures. A candidate measure is a measure that has not yet been accepted or rejected by experts. We demonstrate the use of these measures in Section 3. These measures do not form a complete suite for evaluating specifications of measures. Future studies must work out a suite of measures that covers all the aspects of the framework. We propose to use metamodels, mapping of elements of models, and model-management operations in order to check the quality of some aspects of a specification of a measure. The novelty is in the combined use of them. The use of metamodels and ontologies in order to specify and evaluate measures is not a new method. Baroni et al. [16] define some database design measures in terms of SQL:2003 ontology and use Object Constraint Language (OCL) in order to specify measures as precisely as possible. McQuillan and Power [10] propose to extend the metamodel of Unified Modeling Language (UML) with a separate package that contains specifications of measures as OCL queries. It allows us to find measurement results based on a software entity e that is created by using a language L. The precondition of the use of the method is the existence of a metamodel of L and the existence of a UML model that represents e. The use of a mapping of model elements has been used, for instance, in order to evaluate UML metamodel [17] in terms of Bunge-Wand-Weber (BWW) model of information systems. In the proposed method and examples we assume that the relevant models are UML class models. Syntactic Quality. Syntactic correctness is the only syntactic goal [15]. The syntactic correctness has two subgoals in the context of specifications of measures because we have to use two different types of languages in order to specify measures. Firstly, the content of each specification of a measure is written by using one more languages. For instance, these languages could be natural languages like English, generic formal textual languages like OCL, domain-specific formal textual languages like Performance Metrics Specification Language [18], or generic visual languages like UML. For example, Baroni et al. [16] specify database design measures by using English and OCL. Therefore, the first subgoal of the syntactic correctness is to ensure that all specifications of measures follow the syntax rules of languages that are used in order to write the content of these specifications. Next, we use an analogy with the database domain in order to illustrate additional aspects of the syntactic quality of specifications of measures. The Third Manifesto [19] is a specification of future database systems. According to the manifest each appearance of a value of a scalar type T has exactly one physical representation and one or more possible representations. Each possible representation specifies how to present an appearance of a value to users. Specification of each possible representation for values of type T is part of the specification of T. We could conceptually think about measures as values that belong to the scalar type Measure. In this case, each measure has one or more possible representations of its specification. Therefore, the second subgoal of the syntactic correctness is to ensure that each appearance of a specification of a measure (specification of a measure in short) conforms to the rules of one of the possible representations of type Measure.

76

E. Eessaar

There is more than one specification that can be used as a basis in order to work out a possible representation of a measure. IEEE Standard for a Software Quality Metrics Methodology [8] prescribes how to document software metrics (measures) and Common Information Model [20] provides specification of metrics (measures) schema. Each possible representation has one or more associated constraints that a correctly structured specification of a measure must follow. A problem with the IEEE Standard for a Software Quality Metrics Methodology is that it does not clearly describe constraints that must be present in the possible representation of a measure. For example, if we want to specify this possible representation by using UML class model, then we do not have precise information in order to specify minimum and maximum cardinality at the ends of associations. In this paper, we denote a specification of a measure with the letter m. If we want to check whether m conforms to the second subgoal, then we have to do the following. Firstly, we have to create a model of the structure of m. After that we have to create a mapping between the model of the structure of m and the model that specifies a possible representation of the type Measure. There is a pair of model elements in the mapping if the constructs behind these elements are semantically similar or equivalent. Let us assume that we create these models as UML class models. The elements of these models are classes, properties, and relationships. If X is the set of all the elements of the model of the structure of m and Y is the set of all the elements of the model of possible representation, then ideally there must be a bijective function f: X→Y. The amount of discrepancies between the models characterizes the amount of syntactic problems of m. The creator of a UML class model can often choose whether to model something as a class or as a property (attribute) of a class. Larman [21] suggests about the construction of a conceptual class model: "If in doubt, define something as a separate conceptual class rather than as an attribute." Based on this suggestion, we can simplify the use of the method by considering only classes and not considering properties/relationships that are present in the class models (see Figure 1). It is in line with the example that is provided by Opdahl and Henderson-Sellers [17]. They evaluate a language based on classes of a metamodel (and not based on properties or relationships). We note that Figure 1 illustrates bijective functions and Y does not contain all the possible model elements. Next, we present a candidate measure for evaluating the syntactic richness of a specification of a measure m. SR(m): Let Y be the set of all the classes in a model of a possible representation of measures. Let y be the cardinality of Y. Let X be the set of all the classes in a model of the structure of a specification of a measure m. Let Z be the set of all the classes in Y that have a corresponding class in X. There exists a pair of (corresponding) classes if the constructs behind these classes are semantically similar or equivalent. Let z be the cardinality of Z. Then SR(m) = z/y. The possible value of SR(m) is between 0 and 1. 0 and 1 denote minimal and maximal syntactic richness, respectively. For instance, y=3, z=3, and z/y=1 in case of the example in Figure 1.

A Semiotic Approach to Quality in Specifications of Software Measures

X

77

Y name costs benefits

name costs benefits

Fig. 1. A bijective function

Semantic Quality. Each measure has one or more associated domains. For instances, Choinzon and Ueda [22] present 40 measures that belong to the domain of objectoriented design. Piattini et al. [1] present twelve measures that belong to the domain of object-relational database design. Let us assume that we have a specification of a measure m that describes how to measure a domain d. The feasible validity and feasible completeness are the only two semantic goals according to SEQUAL framework [15]. Validity means that each statement about d that is made by m must be correct and relevant. Completeness means that m must contain all the statements about d that are correct and relevant. On the other hand, it is often impossible to achieve the highest possible semantic quality due to limited resources. Therefore, the goal is to achieve feasible validity and feasible completeness. In this case, there does not exist an improvement of the semantic quality that satisfies the rule: its additional benefit to m exceeds the drawbacks of using it. Each measure considers only some aspect of the domain and not the entire domain. Therefore, we have to consider completeness in terms of sets of related measures. Specifications of measures, which belong to a set of specifications of measures about some domain, must together contain all the statements about the domain that are correct and relevant. How can we evaluate the validity and completeness of specifications of measures? Krogstie et al. [15] write about models that it is only possible to objectively measure the syntactic quality of models. Krogstie et al. [15] think that objective measurement of other quality levels (including semantic quality) of models is not possible because "both the problem domain and the minds of the stakeholders are unavailable for formal inspection." We claim that the situation is partially different in case of measures. The minds of the stakeholders are still unavailable for formal inspection. On the other hand, each measure can be used in order to measure the quality of one or more software entities. Each software entity is created by using one or more languages. Many of these languages are formal languages. Examples of these languages are UML and the underlying data model of SQL:2003. The abstract syntax of a formal language can be specified by using a metamodel [23]. In the context of measures, the metamodels of these languages are specifications of the domains. We can use the metamodels as a basis in order to evaluate the semantic quality of specifications of measures. Let us assume that we use UML class models for creating metamodels. In this case classes specify language elements and properties/ relationships specify relationships between the language elements [23]. Let us assume that we want to evaluate the validity of a specification of a measure m that is used for evaluating software entities that are created by using a language L. The procedure is following.

78

E. Eessaar

1. Identification of L-specific concepts from m. For instance, Piattini et al. [1] specify the measure "Referential Degree of a table T" as "the number of foreign keys in the table T." In this case, L is SQL and L-specific concepts are foreign key and table. 2. Construction of a UML class model based on the concepts that are found during step 1. 3. If X is the set of all the model elements from step 2 and Y is the set of all the elements of a metamodel of L, then ideally there must exist a total injective function f: X→Y. We can simplify the evaluation of validity by considering only classes (see Figure 2) and not considering properties/relationships that are present in the class models (see previous section). Model elements in Y in Figure 2 are from a metamodel of the underlying data model of SQL:2003 [24]. We note than Figure 2 illustrates total injective functions and Y does not contain all the possible model elements. X

Y table foreign key

base table referential integrity constraint viewed table

Fig. 2. A total injective function

One of the object-relational database design measures [1] is "Percentage of complex columns of a table T." The SQL standard [24] does not specify the concept "complex column". Therefore, in this case the function f is a partial injective function. Next, we present a candidate measure EV(m) for evaluating the validity of a specification of a measure m. EV(m): Let X be the set of all the classes in a class model that is constructed based on the L-specific concepts that are present in a specification of a measure. Let x be the cardinality of X. Let Y be the set of all the classes in a metamodel of a language L. Let Z be the set of all the classes in X that have a corresponding class in Y. There exists a pair of (corresponding) classes if the constructs behind these classes are semantically similar or equivalent. Let z be the cardinality of Z. Then EV(m) = z/x. The possible value of EV(m) is between 0 and 1. 0 and 1 denote minimal and maximal semantic validity, respectively. For instance, x=2, z=2, and z/x=1 in case of the example in Figure 2. Next, we present a candidate measure EC(M) for evaluating the completeness of a set of specifications of measures (we denote this set as M). We assume that all the measures allow us to evaluate software entities that are created by using a language L. For simplicity, the calculation procedure considers only classes and does not consider properties and relationships. The calculation of EC(M) starts with the preparative phase that contains three steps. 1. For each specification in M perform step 1 from the validity evaluation procedure. 2. For each specification in M, construct a simplified class model that specifies only classes (based on the result of step 1). 3. Merge all the models that are constructed during the step 2 by using the generic model management operator merge [25].

A Semiotic Approach to Quality in Specifications of Software Measures

79

EC(M): Let X be the set of all the classes in the merged model that is produced as the result of step 3. Let Y be the set of all the classes in a metamodel of a language L. Let y be the cardinality of Y. Let Z be the set of all the classes in Y that have a corresponding class in X. There exists a pair of (corresponding) classes if the constructs behind these classes are semantically similar or equivalent. Let Z' be the set that contains all classes from Z together with all their direct and indirect subclasses. Let z' be the cardinality of Z'. Then EC(M) = z'/y. The possible value of EC(M) is between 0 and 1. 0 and 1 denote minimal and maximal semantic completeness, respectively. Why we have to construct the set Z'? Value substitutability in case of a parameter of a read only operator (that has the declared type T) means that "wherever a value of type T is permitted, a value of any subtype of T shall also be permitted" [19]. Similarly, for instance, base table is a kind of table. In a metamodel of SQL, base table can be specified as a subclass of table. If we have a measure that allows us to measure tables in general, then it is possible to use this measure in order to measure base tables in particular. For example, if X = {table} and Y = {table, base table}, then Z = {table}, Z' = {table, base table}, y = 2, z' = 2, and z'/y = 1. Other Quality Levels. We use the works of Krogstie et al. [13, 15] as the basis in order to introduce the other quality levels. Physical quality has the goals: externalisation and internalisability [15]. Externalisation means that each measure must be available as a physical artefact that uses statements of one or more languages. Each measure must represent the knowledge of one or more software development specialists. Internalisability means that each measure must be accessible so that interested parties can make sense of it. Minimal error frequency is the only empirical quality goal [13]. Each externalised measure has one or more possible specifications that a human user can read and use. The layout and readability of each specification must allow users to correctly interpret the measure. Feasible perceived validity and feasible perceived completeness are the only two perceived semantic quality goals [13]. The perceived semantic quality of measures considers how the audience of measures interprets measures and their domains. For instance, if we want to evaluate the perceived validity of a specification of a measure, then we have to construct a model that specifies how some interested parties understand the specification. We also have to construct a model that specifies how the parties understand the domain of the measure. After that we have to compare these models. Comprehension is the only pragmatic quality goal [15]. Each specification of a measure must be understandable to its audience. For instance, Kaner and Bond [7] present ten evaluation questions about measures. If a specification of a measure has high pragmatic quality, then an interested party should be able to answer these questions based on the specification. Feasible agreement is the only goal of social quality [15]. The social quality considers how well different parties have accepted a measure (how widely a measure is used), how much they agree on interpretation of a measure, and how well they resolve the conflicts that arise from different interpretations.

80

E. Eessaar

2.2 Discussion Next, we discuss the advantages and possible problems of the proposed approach and analyse the proposed framework in terms of the Software Measurement Ontology [3]. Advantages. The use of the semiotic framework has already been tested in case of different types of software entities. The proposed framework allows us to organize the knowledge about the evaluation of specifications of measures. We can use the existing studies about semiotic frameworks in order to find new means of improving the quality of specifications of measures and candidate measures for evaluating the quality of these specifications. For instance, Burton-Jones et al. [14] present a suite of measures for evaluating ontologies. The suite consists of ten measures that allow us to measure the syntactic, semantic, pragmatic, and social quality. The measure SR(m) (see Section 2.1) is analogous to the measure for evaluating syntactic richness of an ontology. The measure EV(m) is similar to the measure EI for evaluating semantic interpretability of an ontology: "Let C be the total number of terms used to define classes and properties in ontology. Let W be the number of terms that have a sense listed in WordNet. Then EI = W/C" [14]. Instead of WordNet, the measure EV(m) uses a metamodel of the language that is the domain of the measure. The measure EC(M) does not have a corresponding measure in the suite of measures for evaluating ontologies. Challenges. Firstly, the construction of a model based on a specification of a measure, and the creation of a mapping between different models requires somewhat subjective decisions. Therefore, it is possible that two different parties, who use the same measure (SR(m), EV(m), or EC(M)) in case of the same set of specifications of measures, will get different results. Simsion [26] found based on experiments that different data modeling practitioners do produce different conceptual data models for the same scenario. Conceptual data models are similar to the models that are constructed based on the specifications of measures. If we simplify the calculation of syntactic richness, validity, and completeness by considering only classes, then the result depends on whether the creators of models prefer to use attributes or classes in UML class models. In addition, different parties could interpret the same specification of a measure differently. For instance, in our view Piattini et al. [1, 27] use the concept table in order to denote base tables. Base table is not the only possible type of tables. A human user can find this kind of inconsistent use of terminology by studying the context of specification. On the other hand, it makes the automation of the evaluation process more difficult. Secondly, the use of EV(m) and EC(M) requires the existence of metamodels of languages. If the required metamodels do not exist, then the use of the measures will be time consuming because a developer has firstly to acquire the metamodels. Thirdly, there could exist more than one specification of the same measure. These specifications could refer to different language elements. For instance, informal specification of the measure "Referential Degree of a table T" that is proposed by Baroni et al. [16] refers to the language (SQL) elements foreign key and table. On the other hand, formal specification of the same measure in OCL [16] refers to the language (SQL) elements foreign key and base table. Therefore, each evaluation must be accompanied with the information about the specification of the measure that is used as the basis of this evaluation.

A Semiotic Approach to Quality in Specifications of Software Measures

81

Finally, it is possible that a language has more than one metamodel. Metamodels are constructed for different reasons, for instance for understanding or for building tools. These metamodels could be created by different parties and could have different levels of detail. For instance, DMTF Common Information Model database specification of SQL Schema [28], relational package of OMG Common Warehouse Metamodel [29], and the ontology of SQL:2003 [16] are variants of metamodel of SQL. These models contain 8, 24, and 38 classes, respectively. It is also possible that there are differences between the different versions of the same metamodel. The values that characterize the quality of a specification of a SQL-database design measure will be different depending on the used metamodel (see Section 3). Therefore, each metamodel-based evaluation of a specification of a measure must be accompanied with the information about the version of the metamodel that is used in the evaluation. If we want to compare two sets of specifications of measures based on the values of the proposed measures, then these values must be calculated based on the same metamodel version. The Framework and the Software Measurement Ontology. García et al. [3] analyse and consolidate the terminology about software measurements that is present in different standards and research proposals. They propose the Software Measurement Ontology (SMO) in order to present common terminology and overcome inconsistencies between different standards and research proposals. We note that this ontology can be used as a basis in order to evaluate the quality of the proposed measures (SR(m), EV(m), and EC(M)) in terms of the proposed semiotic quality framework. Is the proposed framework consistent with SMO? In this section, we use italics in order to denote concepts from SMO. The framework that is proposed in this paper specifies a quality model, which provides "the basis for specifying quality requirements and evaluating the quality of the entities of a given entity class" [3]. Entities are in this case specifications of measures and sets of specifications of measures. SMO does not specify the concepts quality level and quality goal. In our view, the concept quality level is most similar to the SMO concept measurable concept. Each quality model evaluates one or more measurable concepts according to SMO. A measurable concept is an abstract relationship between information needs, which are necessary in order to manage objectives, goals, risks, and problems, and attributes of entities [3]. Date and Darwen [19] point to the logical differences between a value and an appearance of the value in a particular context. A value has no location in time or space but its representations (appearances) in memory can simultaneously appear in many different contexts [19]. Similarly, a measure has one or more specifications that have a location in time and space. For instance, the measure "Depth of Relational Tree of a table T" is specified in references [1, 16, 27]. A user of the proposed semiotic framework evaluates specifications of measures. García et al. [3] define the concept measure as "The defined measurement approach and the measurement scale". It is not clear whether the concept measure in SMO corresponds to the concept value or the concept appearance of a value. A specification of a software measure is an entity. Each entity is composed of zero or more other entities according to SMO. A set of specifications of software measures

82

E. Eessaar

is an example of this kind of composite entity. Each entity belongs to one or more entity class according to SMO. "Specification of a software measure" and "Set of specifications of software measures" are examples of entity classes. Each entity class has one or more attributes. Each attribute is "a measurable physical or abstract property of an entity, that is shared by all the entities of an entity class" [3]. We suggest that each quality goal from the proposed framework has a corresponding attribute of one or more entity classes. Set of specifications -name : String

Entity

0..* Measure

0..*

1 1..*

0..*

1

Reference -description : String

-source 1..*

Specification of a measure -name : String 1 1

1..*

1..*

Semantic element -name : String 0..*

Language -name : String

Syntactic element -name : String 0..*

corresponds

corresponds

-source 0..*

1

0..* Metamodel -name : String metamodel specifies abstract syntax of 1 1..* Metamodel element -name : String 0..*

0..*

Fig. 3. A fragment of the conceptual model of the proposed framework

SR(m), EV(m), and EC(M) are examples of measures. More specifically, they are derived measures that are "derived from other base or derived measures, using a measurement function as measurement approach" [3]. For instance, in order to find the measurement result of EV(m), we have to find the cardinality (z and x) of two sets of classes. The measurement function of EV(m) is z/x. Section 3 contains some measurement results that have been found based on the measures SR(m), EV(m), and EC(M). SMO does not specify all the concepts that are mentioned in the description of the framework. In Figure 3, we present a conceptual model that shows relationships between some important concepts of the framework. Class Entity is present in the specification of SMO as well as in Figure 3. Each specification of a measure has one or more syntactic and semantic elements. Each such element has zero or more corresponding elements that are part of a metamodel of a language.

3 An Evaluation of Database Design Measures Next, we illustrate the use of the proposed framework. In this paper, we investigate the quality of specifications of database design measures. The work of Blaha [30] shows us that many databases do not have the highest possible quality. Blaha [30]

A Semiotic Approach to Quality in Specifications of Software Measures

83

writes that about 50% of databases, which his team has reverse engineered, have major design errors. Therefore, it is clearly necessary to evaluate and improve the design of databases. We can use database design measures for this purpose. Unfortunately there exist few database design measures. Piattini et al. [27] present three table oriented measures for relational databases. Piattini et al. [1] present twelve measures that help us to evaluate the design of object-relational databases. The measures allow us to evaluate databases that are created by using SQL. We call the set of informal specifications of these measures as MSQL and MORSQL, respectively. We investigated MSQL and MORSQL by using the proposed measures (see Section 2). For recording the evaluation results and performing the calculations, we constructed a software system based on the database system MS Access. The database of the system was designed based on the conceptual model in Figure 3. For each specification of a measure, we calculated the value of SR(m) based on the specification of possible representation of measures that is proposed in IEEE Std. 1061-1998 [8]. We assumed that all the components of the possible representation are modelled as separate classes. In Table 1, we summarize the results. For each set of specifications (M), we present the lowest value, the mean value, and the highest value of SR(m) among all the specifications that belong to M. The only components that are in our view present in all the evaluated specifications are name, data items, and computation. Table 1. Syntactic richness of measures

MSQL MORSQL

lowest 0.31 0.19

mean 0.36 0.24

highest 0.38 0.31

For each specification of a measure, we calculated the values of EV(m) based on the following specifications of the domain (SQL): Relational package of OMG Common Warehouse Metamodel (v1.1) [29], DMTF CIM database specification (v2.16) [28], and the ontology of SQL:2003 [16]. In Table 2, we summarize the results. For each pair of a set of specifications (M) and a specification of the domain, we present the lowest value, the mean value, and the highest value of EV(m) among all the specifications in M. We also calculated EC(MSQL) and EC(MORSQL) based on the same specifications that we used in case of calculating EV(m). Table 3 summarizes the results. For each pair of a set of specifications (M) and a specification of the domain (d), we present the value of EC(M) that is calculated in terms of d. Table 2 and Table 3 demonstrate that the results of measurements based on EV(m) and EC(M) depend on the metamodel that is used in the calculation. The CIM database specification specifies fewer classes (8) compared to the CWM (24) and the SQL:2003 otnology (38). Therefore, EC(MSQL) has relatively high value in case of the CIM database specification.

84

E. Eessaar Table 2. Validity of measures

lowest mean OMG Common Warehouse Metamodel (v1.1) MSQL 0.33 0.61 MORSQL 0.12 0.63 DMTF CIM database specification (v2.16) MSQL 0.33 0.44 MORSQL 0.12 0.54 The ontology of SQL:2003 MSQL 0.33 0.61 MORSQL 0.25 0.64

highest 1 1 0.50 1 1 1

Table 3. Completeness of sets of measures

MSQL MORSQL

CWM 0.08 0.21

CIM 0.12 0.38

SQL:2003 0.05 0.18

The specifications that belong to MSQL have bigger completeness problems compared to the specifications that belong to MORSQL. However, MORSQL is also not complete. For instance, the specifications of measures in MORSQL do not consider type constructors, domains, triggers, SQL-invoked procedures, and sequence generators. On the other hand, the specifications of measures refer to elements that in our view do not have a corresponding element in the used metamodels: aggregation, arc, attribute of a table, class, complex attribute, complex column, generalization, hierarchy, involved class, referential path, shared class, simple attribute, simple column, and type of complex column.

4 Conclusions In this paper, we proposed a new framework for evaluating the quality of specifications of software measures. The novelty of the approach (in the context of development of measures) is that it is based on semiotics – the theory of signs. We developed this framework by adapting an existing semiotic framework. The existing framework is used in order to investigate the quality of different kinds of software entities. We proposed how to use this framework in order to evaluate specifications of measures. We proposed three candidate measures for evaluating the syntactic and semantic quality of specifications of measures. The proposed evaluation framework has to enhance the existing evaluation methods of measures, which do not pay enough attention to the quality of specifications of measures. We also investigated two sets of specifications of database design measures in terms of the proposed framework as an example. These measures allow designers to measure the design of relational and object-relational databases that are created by using SQL

A Semiotic Approach to Quality in Specifications of Software Measures

85

language. We evaluated the semantic quality of these specifications in terms of different metamodels that specify the domain of the measures (SQL). The results demonstrate that the selection of a metamodel affects the results of the evaluation. We found that the syntactic and semantic quality of the specifications is quite low. The future work must include improvement of the quality of measures that were proposed in the paper. We also have to improve of the quality of existing database design measures, develop more database design measures, and evaluate these measures in terms of the proposed framework.

References 1. Piattini, M., Calero, C., Sahraoui, H., Lounis, H.: Object-Relational Database Metrics. L’Object (March 2001) 2. Lindland, O.I., Sindre, G., Solvberg, A.: Understanding quality in conceptual modeling. IEEE Software 11(2), 42–49 (1994) 3. García, F., Bertoa, M.F., Calero, C., Vallecillo, A., Ruíz, F., Piattini, M., Genero, M.: Towards a Consistent Terminology for Software Measurement. Information & Software Technology 48, 631–644 (2006) 4. Maiden, N., Sutcliffe, A.: Exploiting reusable specifications through analogy. Commun. ACM 35(4), 55–64 (1992) 5. Schneidewind, N.F.: Methodology for Validating Software Metrics. IEEE Trans. Softw. Eng. 18(5), 410–422 (1992) 6. Kitchenham, B., Pfleeger, S.L., Fenton, N.: Towards a framework for software measurement validation. IEEE Transactions on Software Engineering 21(12), 929–944 (1995) 7. Kaner, C., Bond, P.: Software Engineering Metrics: What Do They Measure and How Do We Know? In: 10th International Software Metrics Symposium (2004) 8. IEEE Std. 1061-1998, Standard for a Software Quality Metrics Methodology. IEEE Standards Dept. (1998) 9. Jacquet, J., Abran, A.: Metrics Validation Proposals: A Structured Analysis. In: 8th International Workshop of Software Measurement (1998) 10. McQuillan, J.A., Power, J.F.: Towards the re-usability of software metric definitions at the meta level. In: PhD Workshop of the 20th European Conference on Object-Oriented Programming (2006) 11. Merriam-Webster online dictionary, http://www.m-w.com/ 12. Van Belle, J.P.: A Framework for the Evaluation of Business Models and its Empirical Validation. The Electronic Journal of Information Systems Evaluation 9(1), 31–44 (2006) 13. Krogstie, J.: A Semiotic Approach to Quality in Requirements Specifications. In: IFIP 8.1 Working Conference on Organizational Semiotics, pp. 231–249. Kluwer, B.V., The Netherlands (2002) 14. Burton-Jones, A., Storey, V.C., Sugumaran, V., Ahluwalia, P.: A Semiotic Metrics Suite for Assessing the Quality of Ontologies. Data & Knowledge Engineering 55(1), 84–102 (2005) 15. Krogstie, J., Sindre, G., Jorgensen, H.: Process models representing knowledge for action: a revised quality framework. European Journal of Information Systems 15(1), 91–102 (2006) 16. Baroni, A.L., Calero, C., Piattini, M., Abreu, F.B.: A Formal Definition for ObjectRelational Database Metrics. In: 7th International Conference on Enterprise Information Systems, pp. 334–339 (2005)

86

E. Eessaar

17. Opdahl, A.L., Henderson-Sellers, B.: Ontological Evaluation of the UML Using the Bunge–Wand–Weber Model. Software and Systems Modeling 1(1), 43–67 (2002) 18. Wismüller, R., Bubak, M., Funika, W., Arodz, T., Kurdziel, M.: Support for User-Defined Metrics in the Online Performance Analysis Tool G-PM. In: Dikaiakos, M.D. (ed.) AxGrids 2004. LNCS, vol. 3165, pp. 159–168. Springer, Heidelberg (2004) 19. Date, C.J., Darwen, H.: Databases, Types and the Relational Model, 3rd edn. The Third Manifesto. Addison-Wesley, Reading (2006) 20. DMTF Common Information Model (CIM) Standards. CIM Schema Ver. 2.15. Metrics schema 21. Larman, C.: Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and the Unified Process, 2nd edn. Prentice Hall, USA (2002) 22. Choinzon, M., Ueda, Y.: Design Defects in Object Oriented Designs Using Design Metrics. In: 7th Joint Conference on Knowledge-Based Software Engineering, pp. 61–72. IOS Press, Amsterdam (2006) 23. Greenfield, J., Short, K., Cook, S., Kent, S.: Software Factories: Assembling Applications with Patterns, Models, Frameworks, and Tools. John Wiley & Sons, USA (2004) 24. Melton, J.: ISO/IEC 9075-2:2003 (E) Information technology — Database languages — SQL — Part 2: Foundation (SQL/Foundation) (August 2003) 25. Bernstein, A.P.: Applying Model Management to Classical Meta Data Problems. In: Conference on Innovative Database Research, pp. 209–220 (2003) 26. Simsion, G.: Data Modeling Theory and Practice. Technics Publications, LLC (2007) 27. Piattini, M., Calero, C., Genero, M.: Table Oriented Metrics for Relational Databases. Software Quality Journal 9(2) (June 2001) 28. DMTF Common Information Model (CIM) Standards. CIM Schema Ver. 2.16. Database specification 29. OMG Common Warehouse Metamodel Specification, Version 1.1, formal/03-03-02 30. Blaha, M.R.: Dimensions of Database Reverse Engineering. In: Fourth Working Conference on Reverse Engineering, pp. 176–183 (1997)

Hybrid Computational Models for Software Cost Prediction: An Approach Using Artificial Neural Networks and Genetic Algorithms Efi Papatheocharous and Andreas S. Andreou University of Cyprus, Department of Computer Science, 75 Kallipoleos str. CY1678 Nicosia, Cyprus {efi.papatheocharous,aandreou}@cs.ucy.ac.cy

Abstract. Over the years, software cost estimation through sizing has led to the development of various estimating practices. Despite the uniqueness and unpredictability of the software processes, people involved in project resource management have always been striving for acquiring reliable and accurate software cost estimations. The difficulty of finding a concise set of factors affecting productivity is amplified due to the dependence on the nature of products, the people working on the project and the cultural environment in which software is built and thus effort estimations are still considered a challenge. This paper aims to provide size- and effort-based cost estimations required for the development of new software projects utilising data obtained from previously completed projects. The modelling approach employs different Artificial Neural Network (ANN) topologies and input/output schemes selected heuristically, which target at capturing the dynamics of cost behavior as this is expressed by the available data attributes. The ANNs are enhanced by a Genetic Algorithm (GA) whose role is to evolve the network architectures (both input and internal hidden layers) by reducing the Mean Relative Error (MRE) produced by the output results of each network. Keywords: Artificial neural networks, Genetic algorithms, Software cost estimation.

1 Introduction Nowadays, the demanding and complex software development environment within enterprises exemplifies several technical challenges concerning scheduling, cost estimation, reliability, performance, etc. The software development methods – from the classical waterfall to agile methods – usually involve complex problem-solving activities demonstrating high level of uncertainty due to the nature of the products developed, which are never identical, the people working on the project and the environment in which software is built. Specifically, software project managers depend on several disparate factors to identify project costs and seek reliable data sources to enable realtime managerial decisions from the initiation of the project and during the whole project life-cycle. Acquiring accurate software development cost estimations has J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 87–100, 2009. © Springer-Verlag Berlin Heidelberg 2009

88

E. Papatheocharous and A.S. Andreou

always been a major concern especially for people involved in project management, resource control and schedule planning. A good and reliable estimate could provide more efficient management over the whole software process and guide a project to success. The track record of IT projects shows that often a large number fails. Most IT experts agree that such failures occur more regularly than they should [1]. These figures are aggravated by contingencies such as changing requirements, team dynamics, and high staff turnover affecting the project costs. According to the 10th edition of the annual CHAOS report from the Standish Group that studied over 40,000 projects in 10 years [2], success rates increased to 34% and failures declined to 15% of the projects. However, 51% of the projects overrun time, budget and/or lack critical features and requirements, while the average cost apparently overruns by 43%. One of the main reasons for these figures is failure to estimate the actual effort required to develop a software project. The problem is further amplified due to the high level of complexity and uniqueness of the software process. Estimating software costs, as well as choosing and assessing the associated cost drivers, both remain difficult issues that are constantly at the forefront right from the initiation of a project and until the system is delivered. Cost estimates even for well-planned projects are hard to make and will probably concern project managers long before the problem is adequately solved. Over the years software cost estimation has attracted considerable research attention and many techniques have been developed to effectively predict software costs. Nonetheless, no single solution has yet been proposed to address the problem. Typically, the amount and complexity of the development effort proportionally drives software costs. However, as other factors affect the development process, such as technology shifting, team and manager skills, quality, size etc., it is even more difficult to assess the actual costs. A commonly investigated approach is to accurately estimate some of the fundamental project characteristics related to cost, such as effort, usually measured in person-months. Of course, it is preferred to measure a condensed set of characteristics that are available from the early phases of the life-cycle and then use them to estimate the actual effort. Software size is commonly recognised as one of the most important factors affecting the amount of effort required to complete a project (e.g., [3]). However, it is considered a fairly unpromising metric to provide early estimates mainly because it is unknown until the project terminates. Nonetheless, many researchers investigate cost models using size to estimate effort (e.g., [4], [5]) whereas others direct their efforts towards defining concise methods and measures to estimate software size from the early project phases (e.g., [6], [7]). The present work is related to the former, aspiring to provide size- and effort-based estimations for the software effort required for a new project using data from past completed projects, even though some of the data originate back from the 90’s. The hypothesis is that once a robust relationship between size and effort is affirmed by means of a model, then this model may be used along with size estimations to predict the effort required for new projects more accurately. Thus, this work attempts to study the potentials of developing a software cost model using computational intelligence techniques relying only on size and effort data. The core of the model proposed consists of Artificial Neural Networks (ANN), which are used as effort predictors. The ANN’s performance is further

Hybrid Computational Models for Software Cost Prediction

89

optimised with the use of a Genetic Algorithm (GA), focused on evolving the number and type of inputs, as well as the internal hidden architecture to predict effort as precisely as possible. The inputs used to train and test the ANN are project size measurements (either Lines of Code (LOC) or Function Points (FP)), and the associated effort to predict the subsequent in series, unknown project effort. A discussion is also provided on the value added to the software process by using size and effort timeseries samples prediction over a set of past project measurements, which also assesses practical aspects of the models proposed. The rest of the paper is organised as follows: Section 2 presents a brief overview of relative research on size-based software cost estimation and especially focuses on machine learning techniques. Section 3 provides a description of the datasets and performance metrics used in the experiments given in Section 4. Specifically, section 4 includes the application of an ANN cost estimation model and describes an investigation for further improvement of the model through a hybrid algorithm to construct the optimal input/output scheme and internal architecture. Section 5, concludes with the findings of this work, discusses a few limitations and suggests future research steps.

2 Related Work Several techniques have been investigated for software cost estimation, especially data-driven artificial intelligence techniques, such as neural networks, evolutionary computing, regression trees, rule-based induction etc., as they present several advantages over other, classic approaches like regression. The majority of related studies investigate, among other issues, the identification and realisation of the most important factors that influence software costs. Literature relating to software cost estimation suggests that software size is typically considered as one of the most important product attributes that directly affects effort, and in turn it is often used to build cost models. This section focuses mainly on size-based cost estimation models. To begin with, most size-based models consider either the number of lines written for a project (called lines of code (LOC) or thousands of lines of code (KLOC)), such as the COCOMO [8], or the number of function points (FP) used in models such as Albrecht’s Function Point Analysis (FPA) [9]. Many research studies investigate the potential of developing software cost prediction systems using different approaches, datasets, factors, etc. Review articles like the ones of [10] and [11] include a detailed description of such studies. We will attempt to highlight some of the most important findings of relevant studies: In [4] effort estimation was assessed using backpropagation ANN on the Desharnais and ASMA datasets, mainly using system size to determine the latter’s relationship with effort. The approach yielded promising prediction results indicating that the model required a more systematic development approach to establish the topology and parameter settings and consequently obtain better results. In [5] the cost estimation equation of the relationship between size and effort was investigated using a Genetic Programming technique which evolved tree structures representing several classical equations, like the linear, power, quadratic, etc. The approach reached to moderately good levels of prediction accuracy by using solely the size attribute, but also suggested that further improvements can be achieved.

90

E. Papatheocharous and A.S. Andreou

In summary, the literature thus far, includes many research attempts focusing on measuring effort and size as the key variables for cost modelling. In addition, many studies encourage the use of ANN models as cost estimators, which may perform better or at least as well as other approaches. Adopting this position, the present paper firstly aims to examine the suitability of ANNs in software cost modeling and secondly to investigate the possibility of providing such a model. Also, this work targets to examine whether: (i) a successful ANN-based cost model, in terms of input parameters, may be built; (ii) we can achieve sufficient estimates of software development effort using only size or function based metrics on different datasets of empirical cost samples; (iii) a hybrid computational model, which consists of a combination of ANN and GA, may contribute to devising the ideal ANN architecture and set of inputs that meet some evaluation criteria. Our strategy is to exploit the benefits of computational intelligence and provide a near to optimal effort predictor for impending new projects.

3 Datasets and Performance Metrics A variety of historical software cost data samples from various datasets containing empirical cost samples were employed to provide a strong comparative basis with results reported in other studies. Also, in this section, the performance metrics used to assess the ANN’s precision accuracy are described. 3.1 Datasets Description The following datasets describing historical project data were used to test the proposed approach: COCOMO`81 (COC`81), Kemerer`87 (KEM`87), a combination of COCOMO`81 and Kemerer`87 (COKEM`87), Albrecht and Gaffney`83 (ALGAF`83), Desharnais`89 (DESH`89) and ISBSG Release 9 (ISBSG`05). The COC`81 [12] dataset contains information about 63 software projects from different applications. Each project is described by the following 17 cost attributes: Required reliability, database size, product complexity, required reusability, documentation, execution time constraint, main storage constraint, platform volatility, analyst capability, programmer capability, applications experience, platform experience, language & tool experience, personnel continuity, use of software tools, multisite development and required development schedule. The second dataset, named KEM`87 [13] contains 15 software project records gathered by a single organisation in the USA, which constitute business applications written mainly in COBOL. The attributes of the dataset are: The actual project’s effort measured in man-months, the duration, the KLOC, the unadjusted and the adjusted function points (FP)’s count. A combination of the two previous datasets resulted to the third dataset, namely COKEM`87. This would allow us to experiment with a larger but more heterogeneous dataset and test the efficiency of our approach on a more dispersed dataset. The fourth dataset ALGAF`83 [9] contains information about 24 projects developed by the IBM DP service organisation. The datasets’ characteristics correspond to the actual project effort, the KLOC, the number of inputs, the number of outputs, the number of master files, the number of inquiries and the FP’s count.

Hybrid Computational Models for Software Cost Prediction

91

The fifth dataset, DESH`89 [14], includes observations for more than 80 systems developed by a Canadian Software Development House at the end of 1980. The basic characteristics of the dataset account for the following: The project name, the development effort measured in hours, the team’s experience and the project manager’s experience measured in years, the number of transactions processed, the number of entities, the unadjusted and adjusted FP, the development environment and the year of completion. The sixth dataset, ISBSG`05 [15] is obtained from the International Software Benchmarking Standards Group (ISBSG, Repository Data Release 9) and contains an analysis of software project costs for a group of projects. The projects come from a broad cross section of industry and range in size, effort, platform, language and development technique data. The release of the dataset used contains 92 variables for each of the projects and has multi-organisational, multi-application domain and multienvironment data that is considered fairly heterogeneous. Although rich in number and types of attributes our datasets were filtered so as to include only size and effort related samples because they were the common attributes existing in all datasets and furthermore, they are the main factors reported in literature to affect the most productivity and cost [16]. 3.2 Performance Metrics The performance of the predictions was evaluated using a combination of three common error metrics, namely the Mean Relative Error (MRE), the Correlation Coefficient (CC) and the Normalised Root Mean Squared Error (NRMSE), together with a customised Sign Predictor (Sign) metric. These error metrics were employed to validate the model’s forecasting ability considering the difference between the actual and the predicted cost samples and their ascendant or descendant progression in relation to the actual values. The MRE, given in equation (1), shows the prediction error focusing on the sample being predicted. xact(i) is the actual effort and xpred(i) is the predicted effort of the ith project. MRE ( n) =

1 n x act (i ) − x pred (i ) ∑ n i =1 x act (i )

(1)

The CC between the actual and predicted series, described by equation (2), measures the ability of the predicted samples to follow the upwards or downwards of the original series as it evolves in the sample prediction sequence. An absolute CC value equal or near 1 is interpreted as a perfect follow up of the original series by the forecasted one. A negative CC sign indicates that the forecasting series follows the same direction of the original series with negative mirroring, that is, with a rotation about the time-axis.

∑ [(x n

act

CC ( n) =

) (

(i ) − x act , n − x pred (i ) − x pred , n

)]

n 2⎤ 2 ⎤⎡ ⎡ n ⎢∑ xact (i ) − x act , n ⎥ ⎢∑ x pred (i ) − x pred , n ⎥ ⎣ i =1 ⎦ ⎣ i =1 ⎦ i =1

(

)

(

)

(2)

92

E. Papatheocharous and A.S. Andreou

The NRMSE assesses the quality of predictions and is calculated using the Root Mean Squared Error (RMSE) as follows: 1 n ∑ x pred (i) − x act (i) n i =1

[

RMSE (n) = NRMSE (n) =

RMSE (n)

σΔ

=

]

2

(3)

RMSE (n)

1 ∑ x act (i) − x n n i =1 n

[

2

]

(4)

If NRMSE=0 then predictions are perfect; if NRMSE=1 the prediction is no better than taking xpred equal to the mean value of n samples. The Sign Predictor (Sign(p)) metric assesses if there is a positive or a negative transition of the actual and predicted effort trace in the projects used only during the evaluation of the models on unknown test data. With this measure we are not interested in the exact values, but only if the tendency of the previous to the next value is similar; meaning that if the actual effort value rises and also the predicted value rises in relation to their previous values, then the tendency is identical. This is expressed in equations (5) and (6).

∑z n

Sign ( p ) =

i

i =1

(5)

n

pred act ) * ( xtact ⎧1 if (( xtpred +1 − xt +1 − xt )) > 0 where zi = ⎨ otherwise. ⎩0

(6)

4 Experimental Approach In this section the detailed experimental approach and the associated results yielded by various models developed are described. First we consider an ANN approach, with varying input and output schemes (a random timestamp was given to the data samples that were inputted using a sliding-window technique); Secondly, we introduce a Hybrid model, coupling ANN with a GA to reach to a near to optimal input/output scheme and internal neural network architecture. 4.1 A Basic ANN-Model Approach An ANN cost model is presented here which investigates the relationship between software size (expressed in LOC or FP) and effort, by conducting a series of experiments. We are concerned with inspecting the predictive ability of the ANN model in respect to the architecture utilised and the input/output scheme (volume and chronological order of the data fed to the model) per dataset used. 4.1.1 Model Description The core architecture of the ANN is a feedforward MLP (Figure 1) linking each input neuron with three hidden layers, consisting of parallel slabs activated by a different

Hybrid Computational Models for Software Cost Prediction

93

function (i.e., i-h1-h2-h3-o, where i is the input vector h1, h2, h3 are the internal hidden layers and o is the output) [17]. Variations of this architecture are employed regarding the number of inputs and the number of neurons in the internal hidden layers, whereas the difference between the actual and the predicted effort is manifested at the output layer (forecasting deviation).

Fig. 1. A Feed-forward MLP Neural Network consisting of an input and an output layer and three slabs of hidden neurons

Firstly, the ANNs are trained in a supervised manner, using the backpropagation algorithm. Also, a technique to filter the data and reserve holdout samples is utilised, creating the training, validation and testing subsets. The extraction is made randomly using 70% of the data samples for training, 20% for validation and 10% for testing. With backpropagation the inputs propagate through the ANN resulting in an output value according to the initial weights. The predicting behaviour of the ANN is characterised by the difference among the predicted and the desired output values. Then, the difference (error) is propagated in a backward manner adjusting the necessary weights of the internal neurons, so that the predicted value is moved closer to the actual one in the subsequent iteration. The training set is utilised during the learning process, the validation set is used to ensure that no overfitting occurs in the final result. The testing set is an independent dataset, i.e., does not participate during the learning process and measures how well the network performs with unknown data, and confirms that the network is able to generalise the knowledge gained. During training the inputs are presented to the network in patterns of input/output values and corrections are made on the weights of the network according to the overall error in the output and the contribution of each node to this error. 4.1.2 Feed-Forward MLP ANN Results The experiments presented in this section constituted an empirical investigation related mainly to the number of inputs and internal neurons forming the layers of the ANN. In these experiments several ANN parameters were kept constant as some preliminary experiments previously conducted implied that varying the type of the activation function in each layer had no significant effect on the forecasting quality. More specifically, we employed the following functions: The input layer employed the linear

94

E. Papatheocharous and A.S. Andreou

transfer function in the range [-1, 1], the first hidden layer used the Gaussian, the second hidden layer the Tanh, the third hidden layer the Gaussian complement and the output layer the Logistic function. Also, the learning rate, the momentum, the initial weights and the number of iterations were set to 0.1, 0.1, 0.3 and 10000 respectively. In addition, as the data were randomly divided into the three subsets, a specific chronological order (ti) was provided to test whether there is time-series dependence between the data values. Table 1. Sliding window technique to determine the ANN input/output data supply scheme Input Output Scheme (IOS) IOS-1 IOS-2 IOS-3 IOS-4 IOS-5 IOS-6

Inputs* LOC(ti) FP(ti) LOC(ti), EFF(ti) FP(ti), EFF(ti) LOC(ti), LOC(ti+1), EFF(ti) FP(ti), FP(ti+1), EFF(ti)

Output* EFF(ti) EFF(ti) EFF(ti+1) EFF(ti+1) EFF(ti+1) EFF(ti+1)

(*where i=1..5)

For each repetition of the procedure a sliding-window technique was applied to extract the input vector and feed it to the ANN model, with values i=1…5. Practically, this is expressed in Table 1, covering the following Input Output Schemes (IOS): IOS-1, -2: Using the Lines Of Code or the Function Points of i past projects to estimate the Effort of the i-th project in the dataseries samples sequence; IOS-3, -4: Using the Lines Of Code or the Function Points with the Effort of the ith project to estimate the Effort required for the next project (i+1)-th in the series sequence; IOS-5,-6: Using the Lines Of Code or the Function Points of the i-th and (i+1)-th projects and the Effort of the i-th project to estimate the Effort required for the (i+1)th project. In each Input/Output Scheme the sliding-window size (i index), i.e. number of past samples per variable, varied from 1 to 5. All of these input/output data schemes would enable us to draw conclusions relating the dependent variable of effort to the input cost drivers and identify the potential descriptive power of the latter to effort. Table 2. Best Experimental Results obtained with the ANN-model TRAINING PHASE

TESTING PHASE

DATASET

INPUT OUTPUT SCHEME

ANN ARCHITECTURE

MRE

CC

NRMSE

MRE

CC

NRMSE

Sign(p)

COC`81 COC`81 KEM`87 KEM`87 COKEM`87 COKEM`87 ALGAF`83 ALGAF`83 DESH`89 DESH`89 ISBSG`05 ISBSG`05

IOS-5 IOS-1 IOS-1 IOS-5 IOS-3 IOS-3 IOS-6 IOS-2 IOS-4 IOS-4 IOS-2 IOS-6

3-15-15-15-1 2-9-9-9-1 1-15-15-15-1 5-20-20-20-1 8-20-20-20-1 4-3-3-3-1 5-3-3-3-1 2-20-20-20-1 4-9-9-9-1 6-9-9-9-1 4-15-15-15-1 3-3-3-3-1

0.929 0.871 0.494 0.759 5.038 5.052 0.371 0.335 0.298 0.031 1.014 0.899

0.709 0.696 0.759 0.939 0.626 0.610 0.873 0.975 0.935 0.999 0.299 0.594

0.716 0.718 0.774 0.384 0.781 0.796 0.527 0.231 0.355 0.042 2.285 1.502

0.551 0.525 0.256 0.232 0.951 0.768 1.142 1.640 0.481 0.051 0.843 0.809

0.407 0.447 0.878 0.988 0.432 0.257 0.817 0.936 0.970 1.000 0.577 0.601

0.952 0.963 0.830 0.503 0.948 1.177 0.649 0.415 0.247 0.032 2.617 2.435

5/10 7/12 2/3 2/2 3/8 4/8 3/4 2/4 17/20 20/20 240/480 300/480

Sign(p) % 50.00 58.33 66.67 100.00 37.50 50.00 75.00 50.00 85.00 100.00 50.00 62.50

Hybrid Computational Models for Software Cost Prediction

95

The best results obtained utilising the ANN model and the various datasets are summarised in Table 2. The first column refers to the dataset used, the second column to the Input and Output Scheme (IOS) with which i data inputs were fed to the model, the third column refers to the ANN topology and the rest of the columns list the error figures obtained during the training and testing phase. The last two columns indicate the number of predicted projects that have the same sign tendency, in the sequence of the effort samples and the total percentage of the successful tendency during testing. The results in Table 2 indicate that the ANN architectures which achieve the best results vary from dataset to dataset, as expected. The best performance is observed for the DESH`89, presenting the lowest MRE equal to 0.05, a CC equal to 1.0 and NRMSE equal to 0.03 during testing. The KEM`87 dataset also performs adequately well with relatively low error figures. The worst prediction performance is obtained with ALGAF`83 and COKEM`87 datasets, whereas the most heterogeneous dataset, the ISBSG`05, achieves mediocre accuracy results. These failures may be attributed to too few projects involved in the prediction process in the first case, and to the heterogeneous nature of the datasets in the second and third case. Finally, as the results suggest, the COC`81 and KEM`87 datasets achieve adequately fit predictions and thus, we may claim that the model is able to approximate well the actual development cost of these datasets. It is also worth noting that in some cases the error figures are worse in the training phase than in the testing. This also may be attributed to the small dataset size and the heterogeneity of the data values which do not allow the ANNs to find an adequate input/output mapping function, at least with the specific ANN architectures that were empirically selected and utilised. In addition, the majority of the best yielded results involve ANN architectures with a relatively large number of internal neurons. Therefore, further investigation is needed with respect to different ANN topologies and IOS for the various datasets aiming at improving prediction performance, and, if possible, at simplifying the structure of ANN by decreasing the number of hidden neurons. To this end we resorted to using a hybrid scheme, combining ANN with GA, the latter attempting to evolve the near to optimal internal network topology and Input/Output Scheme that yields accurate predictions and has reasonably small size so as to avoid overfitting and confine to simpler network topologies. 4.2 A Hybrid Model Approach The rationale behind this attempt resides in the fact that the performance of ANN highly depends on the size, structure and connectivity of the network and results may be further improved if the right parameters are identified. Therefore, we applied a GA to investigate whether we may discover the ideal network settings by means of a cycle of generations including candidate solutions that are pruned by the criterion ‘survival of the fittest’, meaning the best performing ANN in terms of effort prediction. 4.2.1 Model Description The first task for creating the hybrid model was to determine the proper type of encoding so as to express the potential solutions. The encoding used was a binary string representing the ANN architecture, the internal hidden neurons and the varying inputs. The inputs were inserted into the ANNs participating in the hybrid model following the IOS scheme explained earlier. The number of neurons participating in each

96

E. Papatheocharous and A.S. Andreou

layer varied from 1 to 20. The space of all feasible solutions (i.e., the set of solutions among which the desired solution resides) is called the search space. Each point in the search space represents one possible solution. Each possible solution is “marked” by its fitness value, which in our case was expressed by equation (7), minimizing both the MRE and the size of the network. fitness =

1 1 + MRE + size

(7)

Searching for a solution is equal to looking for some value in the search space which maximizes (or minimizes) the objective (fitness) function. The GA developed included three types of operators: selection (roulette wheel), crossover (with rate equal to 0.25) and mutation (with rate equal to 0.01). Selection chooses members from the population of chromosomes proportionally to their fitness; in addition, elitism was used to ensure that the best member of each population is always selected for the new population. Crossover adapts the genotype of two parents by exchanging their parts and creates a new chromosome. Crossover was performed by selecting a random gene along the length of the chromosomes and swapping the genes after that point. Finally, the mutation operator simply changes a specific gene (flips a 0 to 1 and viceversa) of a selected individual in order to create a new chromosome with a different genotype. The algorithm was executed for 50 generations (evolutions) each including 50 candidate solutions (population size). From this population the best performing ANNs were identified in each evolution according to the error fitness function described in equation (7). 4.2.2 Hybrid Model Results This section presents and discusses the results obtained using the Hybrid model on the available datasets. The best ANN architectures yielded for each dataset are listed in the third column of Table 3 with the error figures obtained both during the training and testing phase. These experiments did not make use of the sliding-window technique, that is, index i in Table 1 remained constant and equal to 1 in all cases reported. The performance of the different ANN architectures constructed with the aid of the GA exhibit high learning ability as indicated by the error figures of Table 3. The main observation is that for all of the datasets the Hybrid model was able to optimise the ANN prediction accuracy. This is remarkably consistent through both training and testing error figures reported by the best solutions. In fact, for all of the datasets inspected, the integrated model performs adequately well in terms of generalisation ability and prediction accuracy. Comparing these results to the output of the experiments conducted with the simple ANN (Table 2) we observe that during both the training and the testing phase the MRE is significantly lowered, the CC improves in all cases, whereas the NRMSE is also highly improved. We focus now on each data separately: It seems that the experiments using KEM`87 showed similar MRE and CC error figures and an improved NRMSE in favor of FP instead of LOC, both during training and testing. The ALGAF`83 dataset showed similar NRMSE and CC error figures, whereas an improved MRE was observed using LOC. Finally, DESH`89 and ISBSG`05 present considerable improvement in all error metrics compared to the initial ANN results. Overall, the experiments conducted using one of the two size-related measures, either LOC or FP, and predicting effort (i.e., IOS-1 and IOS-2) produce superior results

Hybrid Computational Models for Software Cost Prediction

97

consistently throughout the datasets employed. This finding indicates that both LOC and FP are very good descriptors of effort, something that on one hand agrees with what is already pointed out by numerous studies in literature and on the other suggests that our model behaves as it should. Also, another observation is that the best accuracy predictions are significantly lowered (error rates are higher) when the EFF is given as an additional input to the ANNs (IOS-3 or IOS-4 cases). This suggests that the model seems unable to capture the correlation between size and effort. However, prediction accuracy is significantly and consistently improved when the LOC or FP of the project whose effort is being predicted, is given to the model (IOS-5 or IOS-6 cases). Heuristically, this is a logical conclusion as the model in the latter case is fed with information regarding the project’s LOC or FP and therefore, the prediction accuracy is enhanced by this additional information. Overall, the proposed model seems to work under these assumptions consistently well. Table 3. Hybrid model (coupling ANN and GA) results DATASET COC`81 COC`81 COC`81 KEM`87 KEM`87 KEM`87 KEM`87 KEM`87 KEM`87 ALGAF`83 ALGAF`83 ALGAF`83 ALGAF`83 ALGAF`83 ALGAF`83 DESH`89 DESH`89 DESH`89 ISBSG`05 ISBSG`05 ISBSG`05

INPUT OUTPUT SCHEME IOS-1 IOS-3 IOS-5 IOS-1 IOS-3 IOS-5 IOS-2 IOS-4 IOS-6 IOS-1 IOS-3 IOS-5 IOS-2 IOS-4 IOS-6 IOS-2 IOS-4 IOS-6 IOS-2 IOS-4 IOS-6

TRAINING PHASE

TESTING PHASE

ANN ARCHITECTURE

MRE

CC

NRMSE

MRE

CC

NRMSE

Sign(p)

1-9-17-10-1 2-20-18-3-1 3-19-20-4-1 1-17-13-16-1 2-18-14-18-1 3-19-15-20-1 1-17-20-11-1 2-19-1520-1 3-19-9-16-1 1-13-20-6-1 2-19-20-8-1 3-19-11-10-1 1-9-17-10-1 2-18-15-11-1 3-20-19-10-1 1-3-18-20-1 2-20-19-20-1 3-20-20-19-1 1-16-18-11-1 2-19-14-20 3-19-15-26-1

0.004 0.092 0.043 0.008 0.246 0.004 0.006 0.041 0.029 0.002 0.087 0.089 0.011 0.092 0.088 0.013 0.912 0.354 0.004 0.329 0.164

1.000 0.963 0.990 1.000 0.825 1.000 1.000 0.998 0.999 1.000 0.990 0.990 1.000 0.989 0.983 0.999 0.495 0.878 0.998 0.174 0.728

0.014 0.270 0.149 0.015 0.539 0.006 0.005 0.074 0.045 0.008 0.136 0.139 0.018 0.145 0.179 0.038 1.107 0.480 0.068 0.985 0.686

0.003 0.075 0.044 0.009 0.211 0.028 0.009 0.062 0.031 0.005 0.163 0.173 0.014 0.112 0.084 0.016 0.589 0.381 0.004 0.952 1.312

1.000 0.961 0.981 1.000 0.822 0.997 1.000 0.993 0.999 1.000 0.977 0.975 1.000 0.985 0.984 0.998 0.437 0.674 0.998 0.030 0.705

0.014 0.278 0.199 0.019 0.550 0.081 0.006 0.122 0.051 0.024 0.210 0.218 0.018 0.171 0.177 0.075 1.022 0.750 0.073 1.141 0.742

24/24 13/18 14/24 4/4 1/3 3/3 4/4 3/3 3/3 6/6 5/6 3/6 6/6 5/6 5/6 22/22 13/22 21/22 288/288 62/288 235/288

Sign(p) % 100.00 72.22 58.33 100.00 33.33 100.00 100.00 100.00 100.00 100.00 83.33 50.00 100.00 83.33 83.33 100.00 59.09 95.45 100.00 21.53 81.60

Figure 2 presents the actual versus the predicted normalised effort sample values during the training and testing phases juxtaposed from an indicative experiment using the ISBSG`05 dataset and IOS-2 scheme with the ANN architecture 1-3-18-20-1. The cycle of experiments continued and closed with a smaller set of runs, which utilised a varying sliding window size (see Table 4). Even though it was logical to assume that providing more historical data to the model regarding past projects should somehow improve the results obtained from the experiments without a slidingwindow (Table 3), the final series of experiments conducted only with the DESH`89 and ISBSG`05 datasets presented an inferior predictive ability. This leads us to conclude that time-series dependence between the project samples under investigation does not exist. This is a profound conclusion as, in our opinion, several distinct project clusters exist in the datasets with similar characteristics (this is more visible in the ISBSG`05 case as different projects from different countries, organisations, chronological period, etc. are evaluated).

98

E. Papatheocharous and A.S. Andreou

TRAINING PHASE

TESTING PHASE

1.1

normalised effort

1 0.9 0.8 0.7 0.6 0.5 0.4 0

10

20

30

40

50

project sample

60

actual

predicted

Fig. 2. Actual vs. predicted normalised effort estimation values with ANN architecture 1-1618-11-1 and IOS-2 on the ISBSG`05 dataset Table 4. Hybrid model (coupling ANN and GA) results with sliding-window DATASET DESH`89 DESH`89 DESH`89 DESH`89 DESH`89 DESH`89 ISBSG`05 ISBSG`05 ISBSG`05 ISBSG`05 ISBSG`05 ISBSG`05

INPUT OUTPUT SCHEME IOS-2 / i=2 IOS-4 / i=2 IOS-6 / i=2 IOS-2 / i=3 IOS-4 / i=3 IOS-6 / i=3 IOS-2 / i=2 IOS-4 / i=2 IOS-6 / i=2 IOS-2 / i=3 IOS-4 / i=3 IOS-6 / i=3

TRAINING PHASE

TESTING PHASE

ANN ARCHITECTURE

MRE

CC

NRMSE

MRE

CC

2-20-19-19-1 4-19-20-17-1 5-20-19-20-1 3-20-18-20-1 6-17-15-20-1 7-20-19-18-1 2-18-3-19-1 4-19-20-20-1 5-19-20-19-1 3-19-9-3-1 6-19-20-19-1 7-3-12-15-1

0.230 0.623 0.305 0.327 0.705 0.136 0.089 1.980 1.980 0.085 1.851 0.070

0.960 0.892 0.885 0.881 0.869 0.967 0.175 0.288 0.123 0.274 0.183 0.100

0.279 0.961 0.464 0.470 0.979 0.254 1.100 0.957 0.992 0.962 0.983 0.994

0.285 0.566 0.493 0.386 0.327 0.707 0.073 1.201 0.892 0.070 1.244 0.071

0.899 0.658 0.461 0.701 0.835 0.579 0.045 0.005 -0.053 0.241 -0.031 0.039

NRMSE

Sign(p)

0.443 0.755 1.058 0.763 0.550 0.936 1.012 1.738 1.648 1.000 1.845 1.013

15/22 11/22 13/21 9/22 10/21 13/21 80/288 110/286 94/286 143/286 98/286 163/286

Sign(p) % 68.18 50.00 61.90 40.91 47.62 61.90 27.78 38.46 32.87 50.00 34.27 56.99

The outcome of the experiments conducted also indicates that in most of the cases the yielded architectures are consistent in that the best performing ANNs need a quite large number of neurons in the internal layers to improve the associated effort estimations. These observations may lead to the argument that good performance is just a result of overfitting. Nevertheless, having separated the training phase from the testing one (using previously unseen patterns) almost guarantees that such an argument does not hold. We can also suggest with relative confidence that this approach universally improves the performance levels as the yielded results are consistent among most datasets. The hybrid model manages to generalise in representing complex underlying relationships and to improve the software cost estimation process.

5 Conclusions In the present work we attempted to study the potentials of developing a software cost model using computational intelligence techniques relying only on size and effort

Hybrid Computational Models for Software Cost Prediction

99

project data. The core of the model proposed consists of Artificial Neural Networks (ANN) trained and tested using project size metrics (Lines of Code, or, Function Points) and effort, aiming to predict the next project effort in the series sequence as accurately as possible. Separate training and testing subsets were used and serial sampling with a sliding window propagated through the data to extract the projects fed to the models. Commonly, it is recognised that the yielded performance of an ANN model mainly depends on the architecture and parameter settings, and usually empirical rules are used to determine these settings. The problem was thus reduced to finding the ideal ANN architecture for formulating a reliable prediction model. The first experimental results indicated mediocre to high prediction success according to the dataset used. In addition, it became evident that there was need for more extensive exploration of solutions in the search space of various topologies and input schemes as the results obtained by the simple ANN model did not converge to a general solution. Therefore, in order to select a more suitable ANN architecture, we resorted to using Evolutionary Algorithms. More specifically, a Hybrid model was introduced consisting of ANN and Genetic Algorithms (GA). The latter evolved a population of networks to select the optimal architecture and inputs that provided the most accurate software cost predictions. The results of this work showed that the Hybrid model yielded better estimates than the simple ANN models and that the proposed technique is very promising. The ANNs evolved by the GA were able to detect the relationships among the investigated attributes of project size (LOC or FP) and the associated effort. Also, the model provided very near to accurate effort predictions using only LOC or FP in the Input/Output Schemas. It was also observed that giving additional information to the model regarding the respective effort values of the samples did not assist in acquiring better estimates, whereas the significance of providing the size estimates was again confirmed as the model yielded optimised results. The main limitation of the model proposed is that it does not add value to the cost estimation process. On the one hand the LOC measure does not usually address the functionality of a system, and on the other the FP are not known until the system reaches an advanced development phase. The same drawback is also found in any other size-based approach as size estimates must be known in advance to provide accurate enough effort estimations. There is a large discrepancy between the actual and estimated size, especially when the estimation is made in the early project phases, which also makes the model’s practical utilisation even harder. Nevertheless, the correlation degrees found among LOC and FP as strong cost drivers can still facilitate in adequate software cost estimations, as they can be translated into several other drivers that are known beforehand, or measured relatively easier, such as the programming language, system specifications, etc. This leads us to assume that size metrics can be successfully used to normalise other software metrics and also compare different projects. Finally, the lack of a satisfactory volume of homogeneous data, definition and measurement rules on software metrics result in high uncertainty to the estimation process. The software size is also affected by other factors that are not investigated by the models, such as programming language and platform; in this work we emphasised only on coding effort which accounts for just a percentage of the total effort in software development. Therefore, future research steps will concentrate on ways to improve performance of the approach, examples of which may be: (i) the study of

100

E. Papatheocharous and A.S. Andreou

other software factors affecting development effort and locating interdependencies, (ii) further adjustment of the ANN and GA parameter settings, such as modification of the fitness function, (iii) improvement of the efficiency of the algorithms by testing more homogeneous or clustered project data and finally, (iv) assessment of the model in a real software development environment.

References 1. Charette, R.N.: Why software fails. Spectrum IEEE 42(9), 42–49 (2005) 2. Standish: Project success rates improved over 10 years. Software Magazine (2004) (accessed in November 2007), http://www.softwaremag.com/ L.cfm?Doc=newsletter/2004-01-15/Standish 3. Fenton, N.E., Pfleeger, S.L.: Software Metrics: A Rigorous and Practical Approach. International Thomson Computer Press (1997) 4. Wittig, G., Finnie, G.: Estimating software development effort with connectionist model. Information and Software Technology 39, 469–476 (1997) 5. Dolado, J.J.: On the Problem of the Software Cost Function. Information and Software Technology 43(1), 61–72 (2001) 6. Park, R.: Software size measurement: a framework for counting source statements. CMU/SEI-TR-020. Report (1996) (accessed in November 2007), http://www.sei.cmu.edu/pub/documents/92.reports/pdf/tr20.92.pdf 7. Albrecht, A.J.: Measuring Application Development Productivity. In: Proceedings of the Joint SHARE, GUIDE, and IBM Application Developments Symposium, pp. 83–92 (1979) 8. Boehm, B.W., Abts, C., Clark, B., Devnani-Chulani, S.: COCOMO II Model Definition Manual. The University of Southern California (1997) 9. Albrecht, A.J., Gaffney, J.R.: Software Function Source Lines of Code, and Development Effort Prediction: A Software Science Validation. IEEE Transactions on Software Engineering 9(6), 639–648 (1983) 10. Briand, L.C., Wieczorek, I.: Resource Modeling in Software Engineering. Encyclopedia of Software Engineering 2 (2001) 11. Jorgensen, M., Shepperd, M.: A Systematic Review of Software Development Cost Estimation Studies. Software Engineering, IEEE Transactions on Software Engineering 33(1), 33–53 (2007) 12. Boehm, B.W.: Software Engineering Economics. Prentice-Hall, Englewood Cliffs (1981) 13. Kemerer, C.F.: An Empirical Validation of Software Cost Estimation Models. CACM 30(5), 416–429 (1987) 14. Desharnais, J.M.: Analyse Statistique de la Productivite des Projects de Development en Informatique a Partir de la Technique de Points de Fonction. MSc. Thesis. Montréal Université du Québec (1988) 15. International Software Benchmarking Standards Group (ISBSG), Estimating, Benchmarking & Research Suite Release 9, ISBSG, Victoria (2005), http://www.isbsg.org/ 16. Sommerville, I.: Software Engineering. Addison-Wesley, Reading (2007) 17. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice-Hall, Englewood Cliffs (1999)

Part II

Artificial Intelligence and Decision Support Systems

How to Semantically Enhance a Data Mining Process? Laurent Brisson1,3 and Martine Collard2,4 1

2

Institut TELECOM, TELECOM Bretagne, CNRS UMR 3192 LAB-STICC Technopˆole Brest-Iroise CS 83818, 29238 Brest Cedex 3, France [email protected] http://perso.enst-bretagne.fr/laurentbrisson/ INRIA Sophia Antipolis, 2004 route des Lucioles, 06902 BP93 Sophia Antipolis, France [email protected] 3 Universit´e europ´eenne de Bretagne, France 4 Universit´e Nice Sophia Antipolis, France

Abstract. This paper presents the KEOPS data mining methodology centered on domain knowledge integration. KEOPS is a CRISP-DM compliant methodology which integrates a knowledge base and an ontology. In this paper, we focus first on the pre-processing steps of business understanding and data understanding in order to build an ontology driven information system (ODIS). Then we show how the knowledge base is used for the post-processing step of model interpretation. We detail the role of the ontology and we define a part-way interestingness measure that integrates both objective and subjective criteria in order to eval model relevance according to expert knowledge. We present experiments conducted on real data and their results. Keywords: Data mining, Knowledge integration, Ontology Driven Information System.

1 Introduction In knowledge discovery from data, methods and techniques are developed for discovering specific trends in a system or organization business by analyzing its data. The real advantage for decision making relies on the add-on provided by comparing extracted knowledge against the a priori knowledge of domain expert. Integrating domain a priori knowledge during the data mining process is currently an important research issue in the data mining field. In this paper, we present the KEOPS methodology based on an ontology driven information system which integrates a priori knowledge all along the data mining process in a coherent and uniform manner. We detail each of these ontology driven steps and we define a part-way interestingness measure that integrates both objective and subjective criteria in order to evaluate model relevance according to expert knowledge. The paper is organized in six sections. Section 2 presents related works. Section 3 presents the KEOPS methodology step by step. In Section 4, we comment some results which demonstrate the relevance of the approach. We discuss our approach in section 5 and conclude in section 6. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 103–116, 2009. c Springer-Verlag Berlin Heidelberg 2009 

104

L. Brisson and M. Collard

Fig. 1. KEOPS methodology

2 Related Works 2.1 Knowledge Integration in Data Mining The Data Mining process described according to the CRISP-DM model [1] is presented as both iterative and interactive. The iterative nature is due to the way processes run cycling test-error experiments. Indeed data miners have to repeat the pre-processing steps of domain understanding, data understanding and data preparation until final models are considered relevant. The interactive nature is inherent to a data mining activity since communications with experts is necessary for understanding domain and data and for interpreting results. Issues in evaluating and interpreting mining process results are currently big research challenges. In order to avoid useless iterations on preliminary tasks and facilitate model interpretation, one solution is to explore deeply expert knowledge and source data in order to formalize them in conceptual structures and exploit these structures both for robust data preparation and for flexible model interpretation. In the literature, partial solutions for domain knowledge interpretation are proposed for optimizing pre-processing steps [2]. For model evaluation, detailed studies have been devoted to interestingness measures [3]. A consensus among researchers is now established to consider objective interestingness versus subjective interestingness. Objective interestingness is traditionally evaluated by a variety of statistic indexes while subjective interestingness is generally evaluated by comparing discovered patterns to user knowledge or a priori convictions of domain experts. In this paper we present the KEOPS methodology based on an ontology driven information system which addresses the knowledge integration issue (see figure 1). The system relies on three main components: an ontology, a knowledge base and a mining oriented database rebuilt from source raw data. These components allow to model domain concepts and relationships among them. They are used to pre-process data and to identify mapping between discovered patterns and expert knowledge.

How to Semantically Enhance a Data Mining Process?

105

2.2 Ontology Driven Information System (ODIS) An ontology driven information systems is an information system (IS) which relies mainly on an explicit ontology. This ontology may underlie all aspects and components of the information system. An ODIS contains three kinds of components: application programs, information resources and user interfaces. [4] discusses the impact of an ontology on an information system according to temporal and structural dimension. The temporal dimension refers to ontology role during IS construction and run-time. If we have a set of reusable ontologies, the semantic content expressed can be transformed and translated into an IS component. Even if the volume of ontology knowledge available is modest it may nevertheless help a designer in a conceptual analysis task. This task consists frequently of redesigning an existing information system. This approach fits the needs of data mining tasks where an operational database has to be transformed into datasets before the data mining modeling step. The structural dimension refers to each information system component which may use the ontology in a specific way. – Database component: at development time, an ontology can play an important role in requirement analysis and conceptual modeling. The resulting conceptual model can be represented as a computer processable ontology mapped to a concrete target platform [5]. Usually, IS conceptual schemes (CS) are created from scratch, wasting a lot of time and resources. – Interface components may be assisted by ontologies which are used to generate personalized interfaces or to manage user profiles [6,7]. – Application program components use implicit knowledge in order to perform a task. However, this knowledge is often hardcoded in software. Ontologies may provide a formal base helping to access domain knowledge. 2.3 Ontology-Based Validation Methods Subjective interestingness measure were developed in order to complement objective measure and give an insight on real human interest. However, these measures lack of semantic formalization, and force the user to express all of his expectations. Consequently, extracted pattern validation process must involve not only the study of patterns but also the use of a domain ontology and domain experts expectations. Rules expressed to filter out noisy pattern or to select the most interesting ones should be relevant. An important issue in ontology-based validation methods is the definition of semantic similarity measures between ontologies concepts. Fortunately, there are numerous works that address this problem. We can consider two kinds of methods in order to measure semantic similarity within an ontology: edge counting methods and informationtheoretic based methods. Edge counting methods consist of calculating the distance between ontology concepts, similarity decreasing while distance increasing. If there are several paths, minimum or average distances can be used. Leacock and Chodorow [8] measure semantic similarity by finding the shortest method distance between two concepts and then scale the distance by the maximum distance in the ”is-a” hierarchy. Choi and Kim[9], use hierarchy concepts tree to calculte a concept distance between two concepts. Zhong and

106

L. Brisson and M. Collard

al. [10] define weights for the links among concepts according to their position in the taxonomy. Resnik introduced information-theoretic measures [11,12] based on the information content of the lower common ancestor of two concepts. The information content of a term decreases with its occuring probability. If the lower common ancestor of two concepts is a generic concept, these concepts should be pretty different and their lower common ancestors have a low information level. Resnik demonstrated that such information-theoretic based methods are less sensitive, and in some cases not sensitive, to the problem of link density variability [11]. Lin [13] improves Resnik’s measure considering how close are the concepts to their lower common ancestor. Jiang presentes a combined approach that inherits the edge counting based approach and enhanced it by node-based approach of the information content calculation [14]. Lord compared Resnik’s, Lin’s and Jiang’s measures in order to use them to explore the Gene Ontology (GO). His results suggest that all three measures show a strong correlation between sequence similarity and molecular function semantic similarity. He concludes that none of the three measures has a clear advantage over the others, although each on has strenghts and weaknesses [15]. Schlicker and al. [16] introduced a new measure of similarity between GO terms that is based on Resnik’s and Lin’s definitions. This measure takes into account how close these terms are to their lower common ancestor and a uses a score allowing to identify functionnaly related gene products from different species that have no significant sequence similarity. In KEOPS, we introduce ontology-based post-processing and evaluation steps too.

3 KEOPS Methodology KEOPS is a methodology which drives data mining processes by integrating expert knowledge. These are the goals addressed: – To manage interactions between knowledge and data all along the data mining process: data preparation, datasets generation, modeling, evaluation and results visualization. – To evaluate extracted models according to domain expert knowledge. – To provide easy navigation throughout the space of results. KEOPS (cf. fig. 1) is based upon an ontology driven information system (ODIS) set up with four components: – An application ontology whose concepts and relationships between them are dedicated to domain and data mining task. – A Mining Oriented DataBase (MODB): a relational database whose attributes and values are chosen among ontology concepts. – A knowledge base to express consensual knowledge, obvious knowledge and user assumptions. – A set of information system components - user interfaces, extraction algorithms, evaluation methods - in order to select the most relevant extracted models according to expert knowledge.

How to Semantically Enhance a Data Mining Process?

107

KEOPS methodology extends the CRISP-DM process model by integrating knowledge in most steps of the mining process. The initial step focuses on business understanding. The second step focuses on data understanding and activities in order to check data reliability. Data reliability problems are solved during the third step of data preparation. The fourth step is the evaluation of extracted models. In this paper we don’t focus on modeling step of CRISP-DM model since we ran CLOSE algorithm [17] which extracts association rules without domain knowledge. 3.1 Business Understanding During business understanding step, documents, data, domain knowledge and discussion between experts lead to assess situation, to determine business objectives and success criteria, and to evaluate risks and contingencies. However this step is often rather informal. KEOPS methodology requires to build an ontology driven information system during the next step, data understanding. Consequently an informal specification of business objectives and expert knowledge is henceforth insufficient. Thus, it is necessary to formalize expert knowledge during business understanding. We chose to state knowledge with production rules, also called “if ... then ...” rules. These rules are modular, each defining a small and independent piece of knowledge. Furthermore, they can be easily compared to extracted association rules. Each knowledge rule has some essential properties to select the most interesting association rules: – Knowledge confidence level: five different values are available to describe knowledge confidence according to a domain expert. These values are ranges of confidence values: 0-20%, 20-40%, 40-60%, 60-80% and 80-100%. We call confidence the probability for the rule consequence to occur when the rule condition holds. – Knowledge certainty: • Obvious: knowledge cannot be contradicted. • Consensual: domain knowledge shared among experts. • Assumption: knowledge the user wants to check. Since the description of expert interview methodology in order to capture knowledge is beyond the scope of this paper, the reader should refer to [18]. 3.2 Data Understanding Data understanding means selection and description of source data in order to capture their semantic and reliability. During this step, the ontology is built in order to identify domain concepts and relationships between them (the objective is to select among data the most interesting attributes according to the business objectives), to solve ambiguities within data and to choose data discretization levels. Consequently, the ontology formalizes domain concepts and information about data. This ontology is an application ontology; it contains the essential knowledge in order to drive data mining tasks. Ontology concepts are related to domain concepts, however relationships between them model database relationships. During next step, data preparation (cf. section 3.3), a relational database called Mining Oriented DataBase (MODB) will be built.

108

L. Brisson and M. Collard

Fig. 2. Bookshop ontology snapshoot

In order to understand links between the MODB and the ontology it is necessary to define notions of domain, concept and relationships: – Domain: This notion in KEOPS methodology, refers to the notion of domain in relational theory. A domain represents a set of values associated to a semantic entity (or concept). – Concept: Each concept of the ontology has a property defining its role. There exist two classes of concepts: attribute concepts and value concepts. • An attribute concept is identified by a name and a domain. • Each value of domain is called a value concept. Thus a domain is described by an attribute concept and by value concepts organized into a taxonomy. Each MODB attribute is linked to one and only one attribute concept and takes its values in the associated domain. In figure 2 “Bookshop” is an attribute concept, “Academic” a value concept and the set {Academic, General, Sciences, Letters} defines “Bookshop” domain. – Relationships: There exists three kinds of relationships between concepts: • A data-related relationship: “valueOf” relationship between an attribute concept and a value concept. The set of value concepts linked to an attribute concept with “valueOf” relationship define a domain within the MODB. • Subsumption relationship between two value concepts. A concept subsumed by another one is member of the same domain. This relationship is useful during data preparation (to select data granularity in datasets), reduction of rule volume (to generate generalized association rules, see 3.4, comparison between models and knowledge (to consider sibling and ancestor concepts) and final results visualization. • Semantic relationships between value concepts. These relationships could be order, composition, exclusion or equivalence relationships. They can be used to compare extracted models and knowledge and to visualize results.

How to Semantically Enhance a Data Mining Process?

109

KEOPS methodology aims to extract interesting models according user knowledge. Consequently, it is necessary during ontology construction to be careful with some usual concerns in data mining: – Aggregation level: like data, ontology concepts have to represent disjoint domains. – Discretization level: ontology concepts have to model various solutions for data discretization. Bad choices may affect modeling step efficiency. – Data correlation: if concepts are strongly related into the MODB, extracted models might be trivial and uninteresting. Since these concerns are beyond the scope of this paper, the reader should refer to [19] for a better insight on concept elicitation and [20] for a better insight on discretization and grouping. Table 1. Bookshop ontology concept elicitation Source Data St Denis Shop St Denis Shop Rive Gauche 5th Rive Gauche 5th

Attribute Value Concept Concept Bookshop Academic Location St Michel bd Bookshop General Location 5th District

Example. Let’s take the case of a bookstore company with several bookshops in Paris which plan to improve customer relationships. Bookshops may be specialized in a field like “academic” or not (general) (see figure 2). Bookshops are located geographically. Data are provided on bookshops, customers and sales. Table 1 shows a way for mapping source values to ontology concepts. 3.3 Data Preparation Data preparation is very iterative and time consuming. The objective is to refine data: discretize, clean and build new attributes and values in the MODB. During this step, KEOPS suggests building MODB by mapping original data with ontology concepts. The database contains only bottom ontology concepts. The objective is to structure knowledge and data in order to process efficient mining tasks and to save time spent into data preparation. The idea is to allow generation of multiple datasets from the MODB, using ontology relationships without another preparation step from raw data. Furthermore, during ODIS construction, experts can express their knowledge using the ontology which is consistent with data. Mining Oriented Database (MODB) Construction. Databases often contain several tables sharing similar information. However, it is desirable that each MODB table contains all the information semantically close and it’s important to observe normal forms in these tables. During datasets generation, it’s easy to use join in order to create interesting datasets to be mined. However these datasets don’t have to observe normal forms.

110

L. Brisson and M. Collard

Datasets Generation. It’s often necessary, in a data mining process, to step back to data preparation. Algorithms were proposed to choose relevant attributes among large data sources. However, sometimes results don’t satisfy user expectations and datasets have to be built again to run new tests. KEOPS methodology suggests using the ontology in order to describe domain values and relationships between these values. Consequently, various datasets could be generated according to expert user choices. The ontology driven information system allows choosing all data preparation strategies providing various datasets from the same source values. A dataset is built using the following operators: – Traditional relational algebra operators: projection, selection and join. – Data granularity: this operator allows choosing, among ontology, concepts which will be in the mining oriented database. In order to generate datasets we developed software whose inputs are MODB and user parameters and outputs are new datasets. The user can graphically select relational algebra operator and data granularity. While database attributes and values are also ontology concepts KEOPS methodology and KEOPS software make easier the data preparation task. 3.4 Evaluation This step assesses to what extent models meet the business objectives and seeks to determine if there is some business reason why these models are deficient. Furthermore, algorithms may generate lots of models according to parameters chosen for the extraction. That’s why evaluation is an important task in KEOPS methodology in order to select the most interesting models according to expert knowledge. Table 2. Interestingness measure if confidence levels are similar Kind of knowledge Rule R informative level More than K Similar Obvious weak none Consensual medium weak Assumption strong medium

Less than K none weak medium

Rule Volume Reduction. We choose an association rule extraction algorithm which can generate bases containing only minimal non-redundant rules without information loss. Then, these rules are filtered to suppress semantic redundancies. KEOPS methodology is based on Srikant’s generalized association rules definition [21]. These rules are minimal because they forbid all irrelevant relationships within their items. We give a formal definition below: Let T be a taxonomy of items. R : A → C is called generalized association rule if: – – – –

A ⊂T C ⊂T No item in C is an ancestor of any item in A or C No item in A is an ancestor of any item in A

How to Semantically Enhance a Data Mining Process?

111

Consequently relationships appearing within these rules are semantic and generalization relationships from C items to A items. The objective is to maximize information level in minimal rules. The last step consists of replacing a set of these rules by a more generalized one. Rule Interestingness Evaluation. KEOPS methodology suggests comparing extracted rules with expert’s knowledge. Extracted rules having one or more items that are in relationship with some knowledge rules items (i.e. value concepts are linked in the ontology) have to be selected. Then, for each pair knowledge rule/extracted rule: – Extracted rule antecedant coverage is compared to knowledge rule antecedent coverage, then extracted rule consequent coverage is compared to knowledge rule consequent coverage. – By coverage comparison the most informative rule is deduced, i.e. rule predicting the largest consequence from the smallest condition. – IMAK interestingness measure is applied [22]. This measure evaluates rule quality considering relative confidence values, relative information levels and knowledge certainty (see section 3.1). Thus, ontology driven information systems are useful in order to formalize domain concepts, to express knowledge, to generate models and to facilitate knowledge and models ontology-based comparison. Example. Let us assume that a domain expert makes the following assumption: “If a student wants to buy a book about JAVA he comes to an academic bookshop.’ and gives it a 60%-80% estimation of confidence. Let us assume that the extracted rule is slightly different because it says that “Every young customer buying a book about J2EE comes to an academic bookshop” and has 75% confidence. Assumption K book=’JAVA’ ∧ buyer=’student’ → bookshop=’Academic’ Extracted Rule R book=’J2EE’ ∧ buyer=’youngs’ → bookshop=’Academic’ According KEOPS methodology these two rules are said to be comparable because at least one extracted rule item is in relationship with a knowledge rule item: ’youngs’ is more general than ’student’ and ’JAVA’ is more general than ’J2EE’. Then, the algorithm compares the coverage of these two rules in order to evaluate the more informative one. Let’s make the assumption that R is more informative than K . Since these two rules have similar confidence we can use table 2 in order to evaluate extracted rule interestingness (similar tables for various confidence levels are presented in [22]). While the knowledge is an assumption, the interestingness degree of the extracted rule is strong.

4 Experiments Although we illustrated in this paper the KEOPS methodology with bookstore example, we run experiments on real data provided by French Family Allowance Office (CAF:

112

L. Brisson and M. Collard

Fig. 3. Confidence vs Lift of all of the extracted rules

a) Confidence vs Lift

b) Confidence vs Support

Fig. 4. Extracted rules (dots) matching knowledge rule 335 (square) (IMAK interestingness value increase with dot size)

Caisses d’allocations familiales). In this section we don’t express some specific rules about allowance beneficiaries behavior (because of privacy) but only extracted rules reliability. These results show we are able to select relevant rules to provide to experts for final human evaluation. CAF data were extracted during 2004 in the town of Grenoble (France). Each row describes one contact between the office and a beneficiary with 15 attributes and data about 443716 contacts were provided. We ran CLOSE algorithm and extracted 4404 association rules. The interestingness measure, IMAK, helps to filter the best ones. Figure 3 plots 4404 rules according to confidence and lift. Experiments illustrated by figure 4 and 5 compare these rules to a specific knowledge. We may observe that among all of the extracted rules only few of them are

How to Semantically Enhance a Data Mining Process?

a) Confidence vs Lift

113

b) Confidence vs Support

Fig. 5. Extracted rules (dots) matching knowledge rule 565 (square) (IMAK interestingness value increase with dot size)

selected. Selection condition is to match the knowledge and to have an interestingness value greater than 0. In these figures interestingness value is illustrated by the dot size. In figure 4 lift of selected rules is greater than 1 and often greater than knowledge lift (lift equals 1 at independency). Furthermore, some extracted rules have a better confidence but a smaller support: they illustrated the discovery of rare events which could be very interesting for expert users. Figure 5 shows some results for another specific knowledge. We may observe again that only few rules are selected. These rules offer various tradeoff to select rare events (low support and high confidence) or general rules (high support and good confidence) to provide to domain experts.

5 Discussion As future work, we plan to evaluate rules selected by KEOPS system on a larger scale with the help of some expert groups who are able to validate their semantic relevance. However we stand up to the problem of pattern management in a coherent way with knowledge management. These patterns are heterogeneous (decision trees, clusters, rules etc.) and it is a laborious task to access and analyze them. Researches on pattern management aims at setting up systems to maintain the persistence and availability of results. They have to represent patterns for sharing and reasoning and to manage them in order to allow efficient searches among them. Consequently there is a need for services to store patterns (indexation of various patterns), to manage patterns (insert, delete, update patterns), to manage metadata (link to data sources, temporal information, semantic information, quality measures) and to query patterns (to retrieve efficiently patterns with some constraints, to evaluate similarity). Several approaches defined logical models for pattern representation (PMML, CWMDM, etc.). Although they are well-suited for data models sharing they seem inadequate

114

L. Brisson and M. Collard

to represent and manage patterns in a flexible and coherent way. Inductive databases (IDB) [23] provide models for pattern storage and inductive query languages. Theses languages allow to generate patterns satisfying some user-specified constraints (using data-mining algorithm) and to querying previously generated patterns. Rizzi and al. [24] defined Pattern Base Management System (PBMS) claiming that a logical separation between database and pattern-base is needed to ensure efficient handling of both raw data and patterns through dedicated management systems. In order to extend our works by enhancing post-processing steps with expert knowledge we plan to develop a system dedicated to pattern storage and querying. In a first stage, we’ll focus only on rules based patterns: association rules, classification rules or sequential rules. The KEOPS current rule interestingness evaluation method is mainly based on the previously defined IMAK measure [22]. We plan to enrich the method according to different ways: – Rather than confronting one extracted rule according each of the existing knowledge rules in the base, we will measure the rule relative interestingness by comparison with a set of corresponding knowledge rules. Assuming that the knowledge rule set is relevant and does not contain any contradiction, the interestingness of a given rule will be evaluated relatively to a set of similar knowledge rules since it may either generalize them or highlight a more precise case than them. – The current measure is computed partly upon rule quality indices as support, confidence and lift. These indices are easily interpretable. They allow to select rules according to the common sense of coverage and implication but they are quite simple. More sophisticated interestingness measures could be used and combined to emphasize precise objectives of the user [25] and it will be valuable to observe how extracted rules are different. – The ODIS is the backbone of the KEOPS approach and one of our further works will more deeply take advantage of the semantic relationships stored among ontological concepts. Semantic distances [26,11,10] as discussed in section 2.3 will be integrated in order to determine more accurately similarity measures between an extracted rule and a knowledge rule and to refine the interestingness measure too.

6 Conclusions Managing domain knowledge during the data mining process is currently an important research issue in the data mining field. In this paper, we have presented the so-called KEOPS methodology for integrating expert knowledge all along the data mining process in a coherent and uniform manner. We built an ontology driven information system (ODIS) based on an application ontology, a knowledge base and a mining oriented database re-engineered from source raw data. Thus, expert knowledge is used during business and data understanding, data preparation and model evaluation steps. We have shown that integrating expert knowledge during the first step, gives experts a best insight upon the whole data mining process. In the last step we have introduced IMAK, a part-way interestingness measure that integrates both objective and subjective criteria in order to evaluate models relevance according to expert knowledge.

How to Semantically Enhance a Data Mining Process?

115

We implemented KEOPS prototype in order to run experiments. Experimental results show that IMAK measure helps to select a reduced rule set among data mining results. These rules offer various tradeoff allowing experts to select rare events or more general rules which are relevant according to their knowledge.

References 1. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: Crispdm 1.0: Step-by-step data mining guide. In: SPSS Inc. (2000) 2. Kedad, Z., M´etais, E.: Ontology-based data cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds.) NLDB 2002. LNCS, vol. 2553, pp. 137–149. Springer, Heidelberg (2002) 3. McGarry, K.: A survey of interestingness measures for knowledge discovery. Knowl. Eng. Rev. 20, 39–61 (2005) 4. Guarino, N.: Formal Ontology in Information Systems. IOS Press, Amsterdam (1998); Amended version of previous one in Proceedings of the 1st International Conference, Trento, Italy, June 6-8 (1998) 5. Ceri, S., Fraternali, P.: Designing Database Applications with Objects and Rules: The IDEA Methodology. Series on Database Systems and Applications. Addison-Wesley, Reading (1997) 6. Guarino, N., Masolo, C., Vetere, G.: Ontoseek: Using large linguistic ontologies for gathering information resources from the web. Technical report, LADSEB-CNR (1998) 7. Penarrubia, A., Fernandez-Caballero, A., Gonzalez, P., Botella, F., Grau, A., Martinez, O.: Ontology-based interface adaptivity in web-based learning systems. In: ICALT 2004: Proceedings of the IEEE International Conference on Advanced Learning Technologies (ICALT 2004), Washington, DC, USA, pp. 435–439. IEEE Computer Society, Los Alamitos (2004) 8. Leacock, C., Chodorow, M.: Combining local context with wordnet similarity for word sense identification. In: Fellbaum, C. (ed.) WordNet: A Lexical Reference System and its Application. MIT Press, Cambridge (1998) 9. Choi, I., Kim, M.: Topic distillation using hierarchy concept tree. In: ACM SIGIR conference, pp. 371–372 (2003) 10. Zhong, J., Zhu, H., Li, J., Yu, Y.: Conceptual graph matching for semantic search. In: ICCS conference, pp. 92–196 (2002) 11. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: IJCAI conference, pp. 448–453 (1995) 12. Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11, 95–130 (1999) 13. Lin, D.: An information-theoretic definition of similarity. In: ICML conference (1998) 14. Jiang, J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. CoRR cmp-lg/9709008 (1997) 15. Lord, P., Stevens, R., Brass, A., Goble, C.A.: Semantic similarity measures as tools for exploring the gene ontology. In: PSB conference (2003) 16. Schlicker, A., Domingues, F., Rahnenfuhrer, J., Lengauer, T.: A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 7, 302 (2006) 17. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Closed set based discovery of small covers for association rules. In: Actes des 15´emes journ´ees Bases de Donn´ees Avanc´ees (BDA 1999), pp. 361–381 (1999) 18. Becker, H.S.: Sociological Work: Method and Substance. Transaction Publishers, U. S (1976)

116

L. Brisson and M. Collard

19. De Leenheer, P., de Moor, A.: Context-driven disambiguation in ontology elicitation. In: Shvaiko, P., Euzenat, J. (eds.) Context and Ontologies: Theory, Practice and Applications, Pittsburgh, Pennsylvania, AAAI, pp. 17–24. AAAI Press, Menlo Park (2005) 20. Berka, P., Bruha, I.: Discretization and grouping: Preprocessing steps for data mining. In: ˙ Zytkow, J.M. (ed.) PKDD 1998. LNCS, vol. 1510, pp. 239–245. Springer, Heidelberg (1998) 21. Srikant, R., Agrawal, R.: Mining generalized association rules. In: VLDB 1995: Proceedings of the 21th International Conference on Very Large Data Bases, pp. 407–419. Morgan Kaufmann Publishers Inc., San Francisco (1995) 22. Brisson, L.: Knowledge extraction using a conceptual information system (ExCIS). In: Collard, M. (ed.) ODBIS 2005/2006. LNCS, vol. 4623, pp. 119–134. Springer, Heidelberg (2007) 23. Imieli´nski, T., Mannila, H.: A database perspective on knowledge discovery. Commun. ACM 39, 58–64 (1996) 24. Rizzi, S., Bertino, E., Catania, B., Golfarelli, M., Halkidi, M., Terrovitis, M., Vassiliadis, P., Vazirgiannis, M., Vrachnos, E.: Towards a logical model for patterns. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 77–90. Springer, Heidelberg (2003) 25. Collard, M., Vansnick, J.C.: How to measure interestingness in data mining: a multiple criteria decision analysis approach. In: RCIS, pp. 395–400 (2007) 26. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets, vol. 19, pp. 17–30 (1989)

Next-Generation Misuse and Anomaly Prevention System Pablo Garc´ıa Bringas and Yoseba K. Penya Deusto Technology Foundation University of Deusto Bilbao, Basque Country pgb,[email protected]

Abstract. Network Intrusion Detection Systems (NIDS) aim at preventing network attacks and unauthorised remote use of computers. More accurately, depending on the kind of attack it targets, an NIDS can be oriented to detect misuses (by defining all possible attacks) or anomalies (by modelling legitimate behaviour and detecting those that do not fit on that model). Still, since their problem knowledge is restricted to possible attacks, misuse detection fails to notice anomalies and vice versa. Against this, we present here ESIDE-Depian, the first unified misuse and anomaly prevention system based on Bayesian Networks to analyse completely network packets, and the strategy to create a consistent knowledge model that integrates misuse and anomaly-based knowledge. The training process of the Bayesian network may become intractable very fast in some extreme situations; we present also a method to cope with this problem. Finally, we evaluate ESIDE-Depian against well-known and new attacks showing how it outperforms a well-established industrial NIDS. Keywords: Intrusion prevention, Misuse detection, Anomaly detection, Data mining, Machine learning, Bayesian networks.

1 Introduction The Internet System Consortium estimates that, nowadays, more than 489 million computers are connected to the biggest network in the world[1]. Being part of such a vast community brings amazing possibilities but also worrying dangers. Against this recordbreaking growth (the same survey in July 2000 yielded only 93 million computers) traditional passive measures for isolation and access control have been proved inadequate to dam the current flood of digital attacks and intrusion attempts. In this way, the area of Computer Security and, more accurately, Network Intrusion Detection Systems (NIDS) have been lately subject of increasing interest and research as suited answer against the mentioned threat. Specifically, a NIDS is a software in charge of distinguishing among legitimate and malicious network users. Moreover, due to the rising complexity and volume of the attacks, NIDS are performed in an automated manner, so the NIDS software monitors system usage to identify behaviour breaking the security policy. Based on their scope, NIDS can be divided into misuse or anomaly detectors. Initially, NIDS where conceived as misuse detectors. This is, they had a well-defined set J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 117–129, 2009. c Springer-Verlag Berlin Heidelberg 2009 

118

P.G. Bringas and Y.K. Penya

of malicious behaviours and they just supervised the system to find those. Misuse Detection Systems are commonly characterised by a high accuracy in their decisions, as well as by an excellent throughput, since they are very good at detecting well-known attacks. Nevertheless, they also present an important flaw because they are not able to response against unknown attacks and, further, they require that an operator specifies the expert knowledge. In order to overcome this shortcoming, another strategy, known as anomaly detection, has been developed during the last decade. Anomaly Detection Systems model legitimate system usage in order to obtain afterwards a certainty measure of potential deviations from that normal profile. Each deviation that is found significant enough will be worth of being considered anomalous and notified to a human operator. This alarm can be analysed manually or processed automatically either to filter intruder actions (in line with Intrusion Prevention paradigm), reconfigure the environment or collect audit information. Anomaly Detection Systems, however, cannot compete with Misuse Detection ones when it comes to detect wellknown attacks; therefore, each approach fails when it comes to the other’s area of expertise. Now, several paradigms have been used to develop diverse NIDS approaches (a detailed analysis of related work in this area can be found for instance in [2]). Expert Systems [3], Finite Automatons [4], Rule Induction Systems [5], Neural Networks [6], Intent Specification Languages [7], Genetic Algorithms [8], Fuzzy Logic [9] Support Vector Machines [6], Intelligent Agent Systems [10] or Data-Mining-based approaches [11]. Still, none of them tries to combine anomaly and misuse detection and, fail when applied to either well-known or zero-day attacks. There is one exception in [12], but the analysis of network packets is too superficial (only headers) to yield any good results in real life. Moreover, few proposed models such as [13] [14] add historical data neither for analysis nor for sequential adaptation of the knowledge representations models used for detection, so this information about the essence and the potential trends of the target system is not commonly considered, so as to, e.g., obtain a baseline profile of normal behaviour. Against this background, this paper advances the state of the art in two main ways. First, we present ESIDE-Depian (Intelligent Security Environment for Detection and Prevention of Network Intrusions), the first inherently unified Misuse and Anomaly Detector that analyses the whole network packet. Second, we detail a new methodology and knowledge representation model that allow the adaptive reasoning engine of ESIDE-Depian infer conclusions considering both Misuse and Anomaly Detection characteristic knowledge in an unified and homogeneous way. The remainder of the paper is structured as follows. Section 2 describes the general architecture of ESIDE-Depian, including the creation process of the knowledge model for each kind of Bayesian Experts used for Misuse Detection and the integration of all verdicts in one Naive Bayesian Network to assure a coherent outcome. Section 3 details the structural learning phase and the methodology developed to reduce its complexity. Section 4 presents the experiments carried out to evaluate ESIDE-Depian with real network traffic. Finally, Section 5 concludes and outlines the future work.

Next-Generation Misuse and Anomaly Prevention System

119

2 Architecture and Approach The internal design of ESIDE-Depian is principally determined by its dual nature. Being both a misuse and anomaly detection system requires answering to sometimes clashing needs and demands. This is, it must be able to simultaneously offer efficient response against both well-known and zero-day attacks. In order to ease the way to this goal, ESIDE-Depian has been conceived and deployed in a modular way that allows decomposing of the problem into several smaller units. Thereby, Snort (a rule-based state of the art Misuse Detection System [15]), has been integrated to improve the training procedure to increase the accuracy of ESIDE-Depian. Following a strategy proven successful in this area [3], the reasoning engine we present here is composed of a number of Bayesian experts working over a common knowledge model. The Bayesian experts must cover all possible areas where a menace may rise. In this way, there are 5 Bayesian experts in ESIDE-Depian, as follows: 3 of them deal with packet headers of TCP, UDP, ICMP and IP network protocols, the so-called TCP-IP, UDP-IP and ICMP-IP expert modules. A further one, the Connection Tracking Expert, analyses potential temporal dependencies between TCP network events and, finally, the Protocol Payload Expert in charge of the packet payload analysis. In order to obtain the knowledge model, each expert carries out separately a Snort-driven supervised learning process on its expertise area. Therefore, the final knowledge model is the sum of the individual ones obtained by each expert. Fig. 1 shows the general ESIDE-Depian architecture. The rest of this section is devoted to detail the creation and up-dating of the knowledge model for each kind of Bayesian expert (including the exact role of Snort in this process) and the way ESIDE-Depian converges all experts’ verdicts.

Fig. 1. ESIDE-Depian general architecture

120

P.G. Bringas and Y.K. Penya

2.1 ESIDE-Depian Knowledge Model Generation Process The obtaining of the knowledge model in an automated manner can be achieved in an unsupervised or supervised way. Typically, unsupervised learning approaches don’t have into consideration expert knowledge about well-known attacks. They achieve their own decisions based on several mathematical representations of distance between observations from the target system, revealing themselves as ideal for performing Anomaly Detection. On the other hand, supervised learning models do use expert knowledge in their making of decisions, in the line of Misuse Detection paradigm, but usually present high-cost administrative requirements. Thus, both approaches present important advantages and several shortcomings. Being both ESIDE-Depian, it is necessary to set a balanced solution that enables to manage in an uniform way both kinds of knowledge. Therefore, ESIDE-Depian uses not only Snort information gathering capabilities, but also Snort’s decision-based labelling of network traffic. Thereby, the learning processes inside ESIDE-Depian can be considered as automatically-supervised Bayesian learning, divided into the following phases. Please note that this sequence only applies for the standard generation process followed by the Packet Header Parameter Analysis experts, (i.e. the TCP-IP, UDP-IP and ICMP-IP expert modules): – Traffic sample obtaining: First we need to establish the information source in order to gather the sample. This set usually includes normal traffic (typically gathered from the network by sniffing, ARP-poisoning or so), as well as malicious traffic generated by the well-known arsenal of hacking tools such as [17], etc. Subsequently, the Snort Intrusion Detection System embedded in ESIDE-Depian adds labelling information regarding the legitimacy or malice of the network packets. Specifically, Snort’s main decision about a packet is added to the set of detection parameters, receiving the name of attack variable. In this way, it is possible to obtain a complete sample of evidences, including, in the formal aspect of the sample, both protocol fields and also Snort labelling information. Therefore, it combines knowledge about normal behaviour and also knowledge about well-known attacks, or, in other words, information necessary for Misuse Detection and for Anomaly Detection. – Structural Learning: The next step is devoted to define the operational model ESIDE-Depian should work within. With this goal in mind, we have to provide logical support for knowledge extracted from network traffic information. Packet parameters need to be related into a Bayesian structure of nodes and edges, in order to ease the later conclusion inference over this mentioned structure. In particular, the PC-Algorithm [18] is used here to achieve the structure of causal and/or correlative relationships among given variables from data. In other words, the PC-Algorithm uses the traffic sample data to define the Bayesian model, representing the whole set of dependence and independence relationships among detection parameters. Due to its high requirements in terms of computational and temporal resources, this phase is usually performed in an off-line manner. – Parametric Learning: The knowledge model fixed so far is a qualitative one. Therefore, the following step is to apply parametric learning in order to obtain the quantitative model representing the strength of the collection of previously learned relationships, before the exploitation phase began. Specifically, ESIDE-Depian

Next-Generation Misuse and Anomaly Prevention System

121

implements maximum likelihood estimate [19] to achieve this goal. This method completes the Bayesian model obtained in the previous step by defining the quantitative description of the set of edges between parameters. This is, structural learning finds the structure of probability distribution functions among detection parameters, and parametric learning fills this structure with proper conditional probability values. The high complexity of this phase suggests a deeper description, that is accomplished in section 3. – Bayesian Inference: Next, every packet capture from the target communication infrastructure needs one value for the posterior probability of a badness variable, (i.e. the Snort label), given the set of observable packet detection parameters. So, we need an inference engine based on Bayesian evidence propagation. More accurately, we use the Lauritzen and Spiegelhalter method for conclusion inference over junction trees, provided it is slightly more efficient than any other in terms of response time [20]. Thereby, already working in real time, incoming packets are analysed by this method (with the basis of observable detection parameters obtained from each network packet) to define the later probability of the attack variable. The continuous probability value produced here represents the certainty that an evidence is good or bad. Generally, a threshold based alarm mechanism can be added in order to get a balance between false positive and negative rates, depending on the context. – Adaptation: Normally, the system operation does not keep a static on-going way, but usually presents more or less important deviations as a result of service installation or reconfiguration, deployment of new equipment, and so on. In order to keep the knowledge representation model updated with potential variations in the normal behaviour of the target system, ESIDE-Depian uses the general sequential/incremental maximum likelihood estimates [19] (in a continuous or periodical way) in order to achieve continuous adaptation of the model to potential changes in the normal behaviour of traffic. 2.2 Connection Tracking and Payload Analysis Bayesian Experts Knowledge Model Generation The Connection Tracking expert attends to potential temporal influence among network events within TCP-based protocols [21], and, therefore, it requires an structure that allows to include the concept of time (predecessor, successor) in its model. Similarly, the Payload Analysis expert, devoted to packet payload analysis, needs to model state transitions among symbols and tokens in the payload (following the strategy proposed in [22]). Usually, Markov models are used in such contexts due to their capability to represent problems based on stochastic state transitions. Nevertheless, the Bayesian concept is even more suited since it not only includes representation of time (in an inherent manner), but also provides generalisation of the classical Markov models adding features for complex characterisation of states. Specifically, the Dynamic Bayesian Network (DBN) concept is commonly recognised as a superset of Hidden Markov Models [23], and, among other capabilities, it can represent dependence and independence relationships between parameters within one common state (i.e. in the traditional static Bayesian style), and also within different chronological states.

122

P.G. Bringas and Y.K. Penya

Thus, ESIDE-Depian implements a fixed two-node DBN structure to emulate the Markov-Chain Model (with at least the same representational power and also the possibility to be extended in the future with further features) because full-exploded use of Bayesian concepts can remove several restrictions of Markov-based designs. For instance, it is not necessary to establish the first-instance structural learning process used by the packet header analysis experts since the structure is clear in beforehand. Moreover, according to [21] and [22], the introduction of an artificial parameter may ease this kind of analysis. Respectively, the Connection Tracking expert defines an artificial detection parameter, named TCP-h-flags (which is based on an arithmetical combination of TCP header flags) and the Payload Analysis expert uses the symbol and token (in fact, there are two Payload Analysis experts: one for token analysis and another for symbol analysis). Finally, traffic behaviour (and so TCP flags temporal transition patterns) as well as payload protocol lexical and syntactical patterns may differ substantially depending on the sort of service provided from each specific equipment (i.e. from each different IP address and from each specific TCP destination port). To this end, ESIDE-Depian uses a multi-instance schema, with several Dynamic Bayesian Networks, one for each combination of TCP destination address and port. Afterwards, in the exploitation phase, Bayesian inference can be performed from real-time incoming network packets. In this case, the a-priori fixed structure suggests the application of the expectation and maximisation algorithm [19], in order to calculate not the posterior probability of attack, but the probability which a single packet fits the learned model with. 2.3 Naive Bayesian Network of the Expert Modules Having different Bayesian modules is a two-fold strategy. On the one hand, the more specific expertise of each module allows them to deliver more accurate verdicts but, on the other hand, there must be a way to solve possible conflicting decisions. In other words, a unique measure must emerge from the diverse judgements. To this end, ESIDE-Depian presents a two-tiered schema where the first layer is composed of the results from the expert modules and the second layer includes only one class parameter: the most conservative response among those provided by Snort and the expert modules community (i.e. in order to prioritises the absence of false negatives in front of false positives). Thus, both layers form, in fact, a Naive Bayesian Network (as shown in Fig. 1 and Fig. 2). Such a Naive classifier [20] has been proposed sometimes in Network Intrusion Detection, mostly for Anomaly Detection. This approach provides a good balance between representative power and performance, and also affords interesting flexibility capabilities which allow, for instance, ESIDE-Depian’s dynamical enabling and disabling of expert modules, in order to support heavy load conditions derived e.g. from denial of service attacks. Now, Naive Bayesian Network parameters should have a discrete nature which, depending on the expert, could not be the case. To remove this problem, ESIDE-Depian allows the using of the aforementioned set of administratively-configured threshold conditioning functions.

Next-Generation Misuse and Anomaly Prevention System

123

Finally, the structure of the Naive Bayesian Network model is fixed in beforehand, assuming the existence of conditional independence hypothesis among every possible cause and the standing of dependency edges between these causes and the effect or class. Therefore, here is also not necessary to take into consideration any structural learning process for it; only sequential parametric learning must be performed, while the expert modules produce their packet classifying verdicts during their respective parametric learning stages. Once this step is accomplished, the inference of unified conclusions and the sequential adaptation of knowledge can be provided in the same way mentioned before. Fig. 2 details the individual knowledge models and how do they fit to conform the general one.

3 The Structural Learning Challenge As it was outlined before, Structural Learning allows modelling in a completely automated way the set of dependence and independence relationships that can exist between the different detection parameters. Nevertheless, in situations that have a large volume of evidences and detection parameters with many different possible states, the aforementioned PC Algorithm (as well as similar alternative methods) presents very high computational requirements. Moreover, depending on the inner complexity of the set of relationships, those requirements can grow even more. Therefore, the complexity depends entirely on the nature of data, rendering it unpredictable so far. In this way, this task may be sometimes too resource-demanding for small and medium computing platforms. Against this background, we have developed a method that splits the traffic sample horizontally in order to reduce the complexity the structural learning process. This method is detailed in the following sections. 3.1 PC-Algorithm Application First of all, please note that the different structural learning methods commonly use a significance parameter in order to define in a flexible manner the strength that a dependence relationship needs in fact to be considered as a definitive one in the Bayesian Network. Thus, this significance parameter can be used to relativise the concept of equality required in the independence tests implemented inside the learning algorithms; in particular, inside the PC Algorithm. On the one hand, a high significance value produces a higher number of connections in the Bayesian model, with an increase on the degree of representativeness but also with larger requirements in terms of main memory and computational power and the possibility that overfitting occurs. On the other hand, a low significance value usually produces a sparse Bayesian Network, with lower requirements but also less semantic power. Therefore, in order to achieve a trade-off between representativeness and throughput, we have implemented the expansion of the structural learning process through multiple significance levels. More accurately, we have used the seven most representative magnitude-orders. Once the expansion process is planned, it is possible to proceed with the structural learning process itself, applied in this case to the TCP-IP, UDP-IP and ICMP-IP protocols, which were defined a priori as a set of working dependence and independence hypotheses based on each different RFC. It is also possible to apply the PC Algorithm to

124

P.G. Bringas and Y.K. Penya

Fig. 2. ESIDE-Depian general architecture

Next-Generation Misuse and Anomaly Prevention System

125

the entire traffic sample in order to verify the accuracy of these hypotheses. In our case, this process was estimated to long 3,591 hours (150 days) of continuous learning. Thus, and also considering that this is a generalised problem, we propose the parallelization of the collection of learning processes, depending on the equipment available for the researcher. In fact, this parallelization, performed over 60 computers at the same time managed to reduce the overall time to 59.85 nominal hours. Finally, one instance of the PC Algorithm was applied to every fragment, for each of the four data-sets, and with the seven different significance values, 1,197 partial Bayesian structures were obtained at this point. 3.2 Partial Bayesian Structures Unifying Process The collection of partial Bayesian structures obtained in the previous phase must be unified into an unique Bayesian Network that represents the reality of the application domain. With this purpose, we defined a statistical measure based on the frequency of apparition in the set of partial structures of each dependence relationship between every two detection parameters. In this way, it is possible to calculate one single Bayesian structure, for each of the 7 significance levels and for each of the 4 data sets, remaining only 28 partial structures from the initial 1,197. The next step will achieve the definitive unifying. To this end, we use the average value for each of the edges between variables, which allows us to reach the desired balance between representativeness and throughput by means of the subsequent selective addition to the Bayesian model of those edges above a specific and configurable (depending on the application domain) significance threshold. Still, both the horizontal splitting of the traffic sample and also the significance-based expansion present an important problem: the cycles that can appear in the Bayesian model obtained after unifying, which render the model invalid. An additional step prevents such cycles by using the average value obtained before as a criteria for the selection of the weakest edge in the cycle, which is the one to be deleted.

4 Evaluation In order to measure the performance of ESIDE-Depian, we have designed two different kinds of experiments. In the first group, the network suffers well-known attacks (i.e. Misuse Detection) and in the second group, zero-day attacks (i.e. Anomaly Detection), putting each aspect of the double nature of ESIDE-Depian to the test. In both cases, the system was fed with a simulation of network traffic comprising more than 700.000 network packets that were sniffed during one-hour capture from a University network. The first experiment (corresponding to Misuse Detection) aimed to compare Snort and the Packet Header Parameters Analysis experts. To this end, Snort’s rule-setbased knowledge was used as the main reference for the labelling process, instantiated through Sneeze Snort-stimulator [24]. The sample analysed was a mixture of normal and poisoned traffic. Table 1 details the results of this experiment. As it can be seen, the three experts achieved a 100% rate of hitting success. Anyway, such results are not surprising, since ESIDE-Depian integrates Snort’s knowledge and if Snort is able to detect an attack, ESIDE-Depian should do so. Nevertheless, not only the

126

P.G. Bringas and Y.K. Penya Table 1. Bayesian expert modules for TCP, UDP and ICMP header analysis results Indicator TCP UDP Analyzed net packets 699.568 5.130 Snort’s hits 38 0 ESIDE-Depian’s hits 38 0 Anomalous packets 600 2 False negatives 0 0 Potential false positives 0,08% 0,03%

ICMP 1.432 450 450 45 0 3,14%

number of hits is important; the number of anomalous packets detected reflects the level of integration between the anomaly and the misuse detection part of ESIDE-Depian. In fact, the latter can be highlighted as the most important achievement of ESIDE-Depian: detecting unusual packets preserving the misuse detection advantages at the same time. Concerning potential false rates, it is possible to observe that very good rates are reached for TCP and UDP protocols (according to the values defined in [16] to be not humanoperator-exhausting), but not so good for ICMP. Table 1 shows, however, a significant bias in the number of attacks introduced in the ICMP traffic sample (above 30%), and labelled as so by Snort; thus, it is not strange the slightly excessive rate of anomalous packets detected here by ESIDE-Depian. In the second experiment (also corresponding to misuse detection), the goal was to test the other expert modules (Connection Tracking and Payload Analysis). With this objective in mind, a set of attacks against a representation of popular services were fired through several hacking tools such as [17]. The outcome of this test is summarised in Table 2. As we see, ESIDE-Depian prevailed in all cases with a 0% rate of false negatives and a 100% of hitting rate success. Still, not only Snort’s knowledge and normal traffic behaviour absorption was tested; the third experiment intended to assess ESIDEDepian’s performance with zero-day attacks. With this idea in mind, a sample of artificial Table 2. Bayesian expert modules for TCP, UDP and ICMP header analysis results Indicator

Payload Payload Conn. Analysis Analysis Tracking (Symbol) (Token)

Analyzed 226.428 net packets Attacks 29 in sample ESIDE 29 Depian hits Anomalous 0 net packets False 0 negatives Pot. false 0% positives

2.676

2.676

139

19

139

19

0

3

0

0

0%

0,11%

Next-Generation Misuse and Anomaly Prevention System

127

Table 3. Example of Zero-day attacks detected by ESIDE-Depian and not by Snort Protocol Artificial Network Anomaly TCP

packit -nnn -s 10.12.206.2

TCP

packit nnn -s 10.12.206.2

-d 10.10.10.100 -F SFP -D 1023

-d 10.10.10.100 -F SAF

Snort x

ESIDE-Depian √

x



x



x



x



packit -t udp -s 127.0.0.1

UDP

-d 10.10.10.2 -o 0x10 -n 1 -T ttl -S 13352 -D 21763 packit -t udp -s 127.0.0.1

UDP

-d 10.10.10.2 -o 0x50 -n 0 -T ttl -S 13352 -D 21763

ICMP

packit -i eth0 -t icmp -K 17 -C 0 -d 10.10.10.2

anomalies [25] was prepared with Snort’s rule set as basis and crafted (by means of the tool Packit) with slight variations aiming to avoid Snort’s detection (i.e. Zero-day attacks unnoticeable for misuse detection systems). Some of these attacks are detailed next in Table 3. Note that overcoming of Snort’s expert knowledge only has sense in those expert modules using this knowledge. This is, in protocol header specialised modules, because the semantics of Snort’s labelling does not fit the morphology of payload and dynamic nature parameters.

5 Conclusions and Future Lines As the use of Internet grows beyond all boundaries, the number of menaces rises to become subject of concern and increasing research. Against this, Network Intrusion Detection Systems monitor local networks to separate legitimate from dangerous behaviours. According to their capabilities and goals, NIDS are divided into Misuse Detection Systems (which aim to detect well-known attacks) and Anomaly Detection Systems (which aim to detect zero-day attacks). So far, no system to our knowledge combines advantages of both without any of their disadvantages. Moreover, the use of historical data for analysis or sequential adaptation is usually ignored, missing in this way the possibility of anticipating the behaviour of the target system. Our system addresses both needs. We present here ESIDE-Depian, a Bayesiannetworks-based Misuse and Anomaly Detection system. Our approach integrates Snort as Misuse detector trainer so the Bayesian Network of five experts is able to react against both Misuse and Anomalies. The Bayesian Experts are devoted to the analysis of different network protocol aspects and obtain the common knowledge model by means of separated Snort-driven automated learning process. A naive Bayesian network integrates the results of the experts, all the partial verdicts achieved by them. Since ESIDE-Depian has passed the experiments brilliantly, it is possible to conclude that ESIDE-Depian using of Bayesian Networking concepts allows to confirm an excellent basis for paradigm unifying Network Intrusion Detection, providing not only

128

P.G. Bringas and Y.K. Penya

stable Misuse Detection but also effective Anomaly Detection capabilities, with one only flexible knowledge representation model and a well-proofed inference and adaptation bunch of methods. On the other hand, the Bayesian approach also enables to implement powerful features over it, such as Dynamic-Bayesian-Network-based intrinsic full representation of time, in order to accomplish totally-characterised connection tracking and low- level chronological event correlation, or explanation tracking of the inferred cause-effect reasoning processes. Furthermore, contrary to other approaches such as Neural Networks, Bayesian networks allow administrative managing of inner information structures, so specific relationships among packet detection parameters and final conclusion can be explained, in a white-box manner. Moreover, it is not only possible to recover reasoning information, but also to act on both Bayesian network structures and conditional probability parameters, in order to adjust the whole behaviour of the Network Intrusion Detection System to special needs or configurations. Further, dynamic regulation of knowledge representation model can be accomplished by using the sensibility analysis proposed in [20], so as to avoid denial of service attacks, automatically enabling or disabling expert modules by means of one combined heuristic measure which considers specific throughput and representative power. In addition, it is also possible to perform model optimisation, to obtain the minimal set of representative parameters, and also the minimal set of edges among them, with the subsequent increase of the general performance. Approximate evidence propagation methods can also be applied in order to improve inference and adaptation time of response. Current expert models only consider exact inference, but it is possible to find methods which provide fast responses, with only a small and affordable loss of accuracy. Finally, Bayesian knowledge representation models present one further interesting capability in current Network Intrusion Detection state of art, the possibility to provide an ad-hoc method for NIDS evaluation. The Bayesian concept provides simulation of learned knowledge corresponding samples, so it is an ideal environment for artificial anomaly generation. Future work will focus on further research on exploiting the aforementioned omnidirectional inference capability of Bayesian networks to the prediction of the next event, as well as on comparing ESIDE-Depian to other cutting-edge Intrusion Detection Systems. Acknowledgements. The authors would like to give thanks to Kenneth Lobato Las´ tra, Alvaro Mar´ın Illera, Rodrigo del Val Peralta, Jon Ander Ortiz Dur´antez, Alejandro L´opez Monge and Gorka Rodr´ıguez Morales for their passion and commitment in the development of the different parts of ESIDE-Depian. Also thanks to the Regional Government of Biscay (Bizkaiko Foru Aldundia) and the Basque Government (Eusko Jaurlaritza) for their financial support.

References 1. Internet System Consortium, Internet Domain Survey, http://www.isc.org 2. Kabiri, P., Ghorbani, A.A.: Research on intrusion detection and response: A survey. Int. J. on Information Security 1(2), 84–102 (2005) 3. Alipio, P., Carvalho, P., Neves, J.: Using CLIPS to Detect Network Intrusion. In: Pires, F.M., Abreu, S.P. (eds.) EPIA 2003. LNCS, vol. 2902, pp. 341–354. Springer, Heidelberg (2003) 4. Vigna, G., Eckman, S., Kemmerer, R.: The STAT tool suite. In: DARPA Information Survivability Conference and Exposition, vol. 2, p. 1046. IEEE Press, Los Alamitos (2000)

Next-Generation Misuse and Anomaly Prevention System

129

5. Kantzavelou, I., Katsikas, S.: An attack detection system for secure computer systems outline of the solution. In: 13th International IFIP TC11 Conference on Information Security, pp. 123–135 (1997) 6. Mukkamala, S., Sung, A., Abraham, A.: Intrusion detection using an ensemble of intelligent paradigms. J. of Network and Computer Applications 28, 167–182 (2005) 7. Doyle, J., Kohane, I., Long, W., Shrobe, H., Szolovits, P.: Event recognition beyond signature and anomaly. In: 2001 IEEE Workshop on Information Assurance and Security, pp. 170–174 (2001) 8. Kim, D., Nguyen, H., Park, J.: Genetic algorithm to improve svm-based network intrusion detection system. In: 19th International Conference on Advanced Information Networking and Applications, vol. 2, pp. 155–158 (2005) 9. Chavan, S., Shah, K., Dave, N., Mukherjee, S., Abraham, A., Sanyal, S.: Adaptative neurofuzzy intrusion detection systems. In: 2004 International Conference on Information Technology: Coding and Computing, vol. 1, pp. 70–74 (2004) 10. Helmer, G., Wong, J., Honavar, V., Miller, L., Wang, Y.: Lightweight agents for intrusion detection. J. of Systems and Software 67, 109–122 (2003) 11. Lazarevic, A., Ertoz, L., Kumar, V., Ozgur, A., Srivastava, J.: A comparative study of anomaly detection schemes in network intrusion detection. In: SIAM International Conference on Data Mining (2003) 12. Skinner, K., Valdes, A.: Adaptive, Model-based Monitoring for Cyber Attack Detection. In: Debar, H., M´e, L., Wu, S.F. (eds.) RAID 2000. LNCS, vol. 1907, pp. 80–92. Springer, Heidelberg (2000) 13. Singhal, A., Jajodia, S.: Data warehousing and data mining techniques for intrusion detection systems. Int. J. on Information Security 1(2), 149–166 (2006) 14. Brugger, T.: Data Mining Methods for Network Intrusion Detection. PhD thesis. University of California Davis (2004) 15. Roesch, M.: SNORT: Lightweight intrusion detection for networks. In: 13th Systems Administration Conference, pp. 229–238 (1999) 16. Crothers, T.: Implementing Intrusion Detection Systems: A Hands-On Guide for Securing the Network. John Wiley & Sons, Chichester (2002) 17. Metasploit: Exploit research (2006), http://www.metasploit.org 18. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. In: Adaptive Computation and Machine Learning, 2nd edn. MIT Press, Cambridge (2001) 19. Murphy, K.: An introduction to graphical models. Technical report. Intel Research, Intel Corporation (2001) 20. Castillo, E., Gutierrez, J.M., Hadi, A.S.: Expert Systems and Probabilistic Network Models. Springer, Heidelberg (1997) 21. Estevez-Tapiador, J., Garcia-Teodoro, P., Diaz-Verdejo, J.: Stochastic protocol modelling for anomaly based network intrusion detection. In: 1st IEEE International Workshop on Information Assurance, pp. 3–12 (2003) 22. Kruegel, C., Vigna, G.: Anomaly detection of web-based attacks. In: 10th ACM Conference on Computer and Communications Security, pp. 251–261 (2003) 23. Ghahramani, Z.: Learning Dynamic Bayesian Networks. Adaptive Processing of Sequences and Data Structures. In: International Summer School on Neural Networks, pp. 168–197. Springer, London (1998) 24. Snort: The facto standard for intrusion detection and prevention, http://www.snort.org/ 25. Lee, W., Stolfo, S., Chan, P., Eskin, E., Fan, W., Miller, M., Hershkop, S., Zhang., J.: Real time data mining-based intrusion detection. In: 2nd DARPA Information Survivability Conference and Exposition, vol. 1, pp. 89–100 (2001)

Discovering Multi-perspective Process Models: The Case of Loosely-Structured Processes Francesco Folino1 , Gianluigi Greco2 , Antonella Guzzo3 , and Luigi Pontieri1 1

2

ICAR-CNR, via P. Bucci 41C, I87036 Rende, Italy ffolino, [email protected] Dept. of Mathematics, UNICAL, Via P. Bucci 30B, I87036 Rende, Italy [email protected] 3 DEIS, UNICAL, Via P. Bucci 41C, I87036 Rende, Italy [email protected]

Abstract. Process Mining techniques exploit the information stored in the execution log of a process to extract some high-level process model, useful for analysis or design tasks. Most of these techniques focus on “structural” aspects of the process, in that they only consider what elementary activities were executed and in which ordering. Hence, any other “non-structural” data, usually kept in real log systems (e.g., activity executors, parameter values), are disregarded, yet being a potential source of knowledge. In this paper, we overcome this limitation by proposing a novel approach to the discovery of process models, where the behavior of a process is characterized from both structural and nonstructural viewpoints. Basically, we recognize different executions’ classes via a structural clustering approach, and model them with a collection of specific workflows. Relevant correlations between these classes and non-structural properties are captured by a rule-based classification model, which can be used for both explanation and prediction. In order to empower the versatility of our approach, we also combine it with a pre-processing method, which allows to restructure the log events according to different analysis perspectives, and to study them at the right abstraction level. Interestingly, such an approach reduces the risk of obtaining knotty, ”spaghetti-like”, process models when analyzing the logs of looselystructured processes consisting of low-level operations that are performed in a more autonomous way than in traditional BPM platforms. Preliminary results on real-life application scenario confirm the validity of the approach. Keywords: Business process intelligence, Process mining, Decision trees.

1 Introduction Process mining techniques have been getting increasing attention in the Business Process Management (BPM) community because of their ability to characterize and analyze process behaviors, which turns out particularly useful for the (re)-design of complex systems. In fact, these techniques can extract a model for a process, based on data gathered during its past enactments and stored in suitable logs by workflow (or similar) systems. In particular, traditional and consolidated approaches —see, e.g., [12] for a survey on this topic— focus on “structural” aspects of the process, i.e., they try to J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 130–143, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Discovering Multi-perspective Process Models

131

single out mutual dependencies among process activities, in terms of precedence relationships and/or various routing constructs (e.g., parallelism, synchronization, exclusive choice, and loops). Only very recently, process mining techniques have been proposed to deal with important “non-structural” aspects of the process, such as activity executors, parameter values, and performance data. For instance, decision trees were used in [5] to express staff assignment rules, which correlate agent profiles with the execution of process activities; while, a decision point analysis was discussed in [9], where decision trees are used to determine, at each split point of the model, the activities that are most likely to be executed next. However, these techniques still exploit a few simplifying assumptions that limit their applicability in some application contexts. In particular, the various approaches share the basic idea of using non-structural data to characterize which activities are to be executed, while disregarding the specific coordination mechanisms of their enactment, i.e., how the flow of execution is influenced by certain data values. For example, in a sales order process where different sub-processes are enacted depending on whether the customer is a new one or has instead a fidelity card, current approaches will hardly discover that two different usage scenarios occur, and that they can be discriminated by some specific data fields associated with the customer. On the other hand, traditional process mining approaches, originally tailored to the case of workflow management systems, need to be provided with a representation of log data where each single process instance gathers a set of events, each of which refers to a well-defined process activity. As a consequence, they are likely to yield knotty, ”spaghetti-like” [11], process models when applied to the logs of a loosely-structured process, i.e. a process enacted in a more autonomous way, by composing low-level operations, with some poor business-oriented characterization. This limitation mainly resides in the incapability of classical process mining techniques to view log data at some suitable abstraction level, as pointed out in [11] and [1]. The main aim of this paper is precisely to enhance current techniques with the capability of characterizing how activity executors, parameter values and performance data affect routing constructs and, more generally, how they determine the sub-processes to be executed. To this end, we shall investigate on the mining of a multi-perspective process model, where structural aspects and non-structural ones are formally related with each other. In fact, the model consists of (i) a (structure-centric) behavioral schema, where the various sub-processes are explicitly and independently described; and (ii) a (data-dependent) classification model, assessing when the sub-processes have to be enacted, in terms of decision rules over process attributes. Technically, the discovery of a behavioral schema is carried out by resorting to the clustering approach presented in [4], where log instances are partitioned according to the sequence of activities performed, and each cluster is equipped with a distinct workflow schema. A data-dependent classification model is then induced with a rule-based classifier, by regarding the clusters as different execution classes. In order to enable for a flexible and effective application of the above technique to the logs of loosely-structured processes, we also adopt a pre-processing method for restructuring them in a workflow-oriented way, according to different analysis perspectives. This method, similar to the one we proposed in [1] for the analysis of collaboration

132

F. Folino et al.

work, increases the versatility of our approach, and allows to achieve a suitable abstraction level in the analysis of log data, based on domain-oriented information possibly available. The whole approach has been implemented as a plug-in for the ProM framework [13], and tested on a challenging real-life application scenario, concerning the trading of containers in an Italian harbor. Experiment results evidence that accurate models can be found with our approach, which provided useful hints for the tuning of logistic operations. Organization. The remainder of the paper is organized as follows. After introducing some basic notation and concepts in Section 2, we illustrate our approach in Section 3, by describing both the pre-processing restructuring method and the algorithm for the computation of a multi-perspective process model. The implementation of the approach within ProM is discussed in Section 4. Then, the real-life case study and results of experimental activity are discussed in Section 5. Finally, a few concluding notes are reported in Section 6.

2 Formal Framework Process logs contain a wide range of information about process executions. Following a standard approach in the literature, we next introduce a simple representation of process logs, where each trace storing a single enactment of the process (named, process instance or case) is just viewed as a sequence of task identifiers, while additional data are represented via attributes associated with traces. Let T be a set of task labels, and A = {A1 , . . . , An } be a set of (names of) process attributes, taking values from the domains D1 , . . . , Dn , respectively. Then, a trace (over T and A) is a sequence s = [s1 , ..., sk ] of task labels in T , associated with a tuple data(s) of values from D1 , . . . , Dn , i.e., data(s) ∈ D1 × . . . × Dn . A process log (or simply log) over T and A is a set of traces over T and A. By inspecting and analyzing a process log, we can extract a high-level representation of the (possibly unknown) underlying process, named multi-perspective process model. This kind of model, which is indeed the ultimate outcome which our approach is aimed to deliver, is defined below. Definition 1. Given a workflow log L, a multi-perspective process model for L is T quadruple M = C L , W , λ, δA  such that: 



– C L = {C1L , . . .CqL } is a partition of L, i.e., i=1..q C Li = 0/ and i=1..q C Li = L; T – W = {W1T , . . .WqT } is a set of workflow schemas, one for each cluster in CL , where all tasks are named with labels from T ; – λ is bijective function mapping the clusters C L to their associated workflow T T schemas in W , i.e., λ : C L → W , where λ(CiL ) is the schema modelling CiL ; L A – δ : D1 × . . .× Dn → C is q-ary classification function, which discriminates among the clusters in C L based on the values of A’s attributes. T

In words, W is a modular description of the behavior registered in the log, as far as concerns structural aspects only, where each execution cluster is represented with a

Discovering Multi-perspective Process Models

133

separate workflow schema, describing the typical way tasks are executed in that cluster. Conversely, δA is a classification function that correlates these structural clusters with (non-structural) trace attributes, by mapping any n-ple of values for the attributes A1 , ..., An into a single cluster of C L . For notation convenience, in the rest of the paper the above model will be seen as two sub-models, which represent the process in different and yet complementary ways: T a structural model C L , W , λ, which focuses on the different execution flows across the process tasks, and a data-aware classification model, which is essentially encoded in the function δA in the form of a decision tree [2]. In fact, yet being strictly connected between each other, both models are relevant in themselves, for they take care of separate aspects of the process.

3 A Method for the Discovery of Multi-perspective Process Models In this section, we illustrate a technique for extracting a multi-perspective process model from a given process log. We then describe a pre-processing method which allows to flexibly apply the technique for analyzing the logs of loosely-structured processes. 3.1 Process Mining Technique As discussed in the introduction, the problem of discovering process models based on structural information has been widely investigated in the literature. In Figure 1 an algorithm is shown that addresses instead the more general problem of discovering a high-level process representation, by computing a multi-perspective process model for a given log L. To this aim, we resort to the clustering scheme proposed in [4] and sketched through the function mineStructuralModel (still shown in Figure 1), which recursively partition the traces until the number of mined clusters (or equivalently the number of workflow schemas) is less than a real number given as parameter p of the algorithm (Step F3). In more detail, at each step, a cluster is chosen to be refined. To this end, a special kind of sequential patterns are extracted to model certain ways of executing the activities that frequently occur in the cluster and yet are unexpected w.r.t. its associated schema (Step F5). In fact, the traces are partitioned via the well-known k-means [6] algorithm (Step F7), after projecting them into the vectorial space associated with these patterns (Step F6). The structural model is updated at each iteration by replacing the cluster split in Step F4 with all the new clusters derived from it, each of which is equipped with a specific workflow model (Steps F9-F10), with the help of function mineWorkflow —this latter could be implemented with any classical control-flow mining algorithm, such as those discussed in [12]. Afterwards, the algorithm takes advantage of all available data attributes by invoking the function induceClassifier, which derives a classification function (see Step 2), expressing the mapping from data attributes to the clusters discovered before. Note that, by using the clusters produced by structuralCLustering as class labels for all their associated traces, we may learn a classification model over the attributes in A. To this end, we decided to resort to usage of decision trees, seeing as they can easily be

134

F. Folino et al.

Input: A log L over tasks T and attributes A. Output: A multi-perspective process model for L. Method: Perform the following steps: 1 C L :=structuralClustering(L); 2 δA := induceClassifier(C L ); 3 C L := classifyTraces(L,δ); 4 for each Ci ∈ C L do 5 Wi′ :=WFDiscovery(Ci ); 6 W T := W T − {λ(Ci )} ∪ {Wi′ }; 7 λ(Ci ) := {Wi′ }; 8 end T 9 return C L , W , λ, δA  Function structuralClustering(L: log): a structural model; F1 W0 :=mineWF(L ); F2 C := {L}; W := {W0 }; F3 while |W | ≤ p do ∗ F4 let C∗ be the most numerous cluster in C and W = λ(C∗ ) be its associated schema; ∗ F5 FS := mineStructuralFeatures(C∗ , W ); F6 V S := extractFeatureVectors(C∗ , FS); F7 {Ci∗ , . . . ,Ck∗ } := kMeans(V S, k); F8 for i = 1..k do λ(Ci∗ ) :=mineWF(Ci∗ ); F9 W := W − {λ(C∗ )} ∪ {λ(Ci∗ ) | i = 1..k}; F10 C := C − {C∗ } ∪ {Ci∗ | i = 1..k}; F11 end while F12 return C , W , λ; Fig. 1. Algorithm ProcessDiscovery

understood and do not require any prior assumptions on data distribution. Further, many algorithms exist that compute them in an efficient way, while coping well with noise, overfitting, and missing values. The rest of the ProcessDiscovery algorithm is meant to refine the structural model computed in the first phase, by making it more consistent with the classification function δ just discovered. Indeed, all log traces are first remapped to the clusters, in Step 3, by using the model in a predictive way. A new workflow model is then found for each cluster, in Steps 4-8, which should model more precisely the structure of the traces in the cluster. As a final result, the algorithm returns a multi-perspective process model that integrates the different kinds of knowledge discovered. 3.2 Log Restructuring As mentioned above, process mining approaches founds on quite a rigid, workfloworiented, representation of process logs. In particular, each atomic event must register the execution of a well-defined process activity for a well-identified process instance. This information can be missed for a loosely-structured process that spontaneously arise, as a composition elementary operations, without the usage of a proper workflow model. Regarding these operations as high-level process activities may yield very

Discovering Multi-perspective Process Models

135

intricate (“spaghetti-like” [11]) process models that are of scarce usefulness for analysis purposes. In order to increase the versatility of process mining techniques and to improve their effectiveness over such a loosely-structured scenario, we adopt a flexible method for restructuring the original event logs in a workflow-oriented way. We assume that, in general, each original event can encode the execution time, the basic operation performed, the tool used for performing it, the executor and a series of data parameters. Moreover, detailed, domain-specific, information can be available for each of such involved entities. The approach is meant to produce a workflow-oriented log, like the one sketched in Section 2, and consists of three distinct logical steps: (a) select a subset of events, (b) arrange the events in process instances, and (c) map the events to workflow activities. In step (a), the analyst can choose a subset of log events based on their properties (e.g., the execution date or the kind of event), as well as on the properties of their associated entities (e.g., tools or actors). In step (b) the events previously selected are arranged in a number of sequences, each of which is regarded as a distinct trace (i.e. process instance). Indeed, we admit that such information may be unavailable for a loosely-structured process, and, anyway we desire to let the analyst free of trying different ways to reorganize basic log events into separate flows of work. The last step is meant to specify the mapping from basic events to the high-level activities that will appear in the discovered workflow models.In this phase the analyst can determine which abstraction/aggregation level is to be used for analyzing the process. In the most detailed case, all the of information associated with each event are mapped to a single activity – i.e. the activity corresponds to the execution of a certain event, by a certain actor, through a certain tool, etc. Since such an approach may well yield rather cumbersome models, some less detailed description can be chosen to model the behavior of the process in a more concise and meaningful way. To this aim, the analyst can focus on some facets of the events (e.g., the tool employed), and even exploit the availability of concept taxonomies to represent them in a more abstract way.

4 Implementation The above approach has been implemented into a prototype system, which has been integrated as a plug-in within the ProM framework [13], a powerful platform for the analysis of process logs, quite popular in the Process Mining community. The logical architecture of the system is shown in Figure 2, where solid arrow lines stand for data exchange. Note that the whole mining process is driven by the module Process Mining Handler, while the other modules roughly replicate the computation scheme in Figure 1. By Log Repository we denote a collection of existing process logs, represented in MXML [14], the format that is actually used in the ProM framework [13]. The Structural Mining module, together with the ones connected with it, is responsible for the construction of structural models. In particular, the discovery of different trace clusters is carried out by the module Structural Clustering, which exploits some functionalities of the plug-in DWS implementing the clustering scheme in [4]. The

136

F. Folino et al.

Fig. 2. System architecture

module WFDiscovery is used, instead, to derive a workflow schema for each discovered cluster, by exploiting the ProM implementation of the Heuristic miner algorithm [15]. Discovered clusters and schemas are stored in the Trace Clusters repository and in the Workflow Repository, respectively, for inspecting and further analysis. The Training Set Builder module mainly labels each trace with the name of the cluster it has been assigned to, as it is registered in the Trace Clusters repository, in order to provide the Classifier induction module with a training set for learning a classification model. The module Classifier applies such a model to reassign the original log traces to the clusters. Both these latter modules, coordinated by the module Classification Mining, have been implemented by using the Weka library [3] and, in particular, the J48 implementation of the well known algorithm C4.5 [8]. In this regard, we notice that the Training Set Builder also acts as a “translator” by encoding the labelled log traces into a tabular form, according to the ARFF format used in Weka. Incidentally, this format requires that proper types (such as nominal, numeric, string) are specified for all attributes. Since this information misses in the original (MXML) log files, a semi-automatic procedure is used to assign such a type to each attribute. Two modules help the user in evaluating the quality of the discovered models: the Classifier Evaluator, which computes standard performance indexes for the classification models, and the Conformance Evaluator module which measures what an extent workflow models represents the behaviors actually registered in the log traces they have been derived from. To this end, some metrics have been implemented, which will be briefly described in Section 5.2. Figure 3 shows two screenshots of the plug-in, which concern the input panel, which allows to launch the mining algorithm and to set its parameters (Figure 3(a)), and the visualization of results (Figure 3(b)), respectively. For the latter, note, in particular, the decision tree (left) and one of the workflow schemas (right) found by the tool.

Discovering Multi-perspective Process Models

(a) Parameter setting

137

(b) Result view

Fig. 3. Screenshots of the plug-in within ProM

5 Case Study This section is devoted to discuss the application of the proposed approach within a real-life scenario, concerning an Italian maritime container terminal, as a way to give some evidence for the validity of our proposal, as well as to better explain its behavior through a practical example. 5.1 Application Scenario A series of logistic activities are registered for each of the containers that pass through the harbor, which actually amount to nearly 4 millions per year. Massive volumes of data are hence generated continually, which can profitably be exploited to analyze and improve the enactment of logistics processes. For the sake of simplicity, in the remainder of the paper we only consider the handling of containers which both arrive and depart by sea, and focus on the different kinds of moves they undergo over the “yard”, i.e., the main area used in the harbor for storage purposes. This area is logically partitioned into a finite number of bi-dimensional slots, which are the units of storage space used for containers, and are organized in a fixed number of sectors. The life cycle of any container can be roughly summarized as follows. The container is unloaded from the ship and temporarily placed near to the dock, until it is carried to some suitable yard slot for being stocked. Symmetrically, at boarding time, the container is first placed in a yard area close to the dock, and then loaded on the cargo. Different kinds of vehicles can be used for moving a container, including, e.g., cranes, straddlecarriers (a vehicle capable of picking and carrying a container, by possibly lifting it up), and multi-trailers (a sort of train-like vehicle that can transport many containers). This basic life cycle may be extended with additional transfers—classified as “housekeeping”—which are meant to make the container approach its final embark point or to leave room for other containers. More precisely, the following basic operations are registered for any container c: – MOV, when c is moved from a yard position to another by a straddle carrier; – DRB, when c is moved from a yard position to another by a multi-trailer; – DRG, when a multi-trailer moves to get c;

138

F. Folino et al.

– LOAD, when c is charged on a multi-trailer; – DIS, when c is discharged off a multi-trailer; – SHF, when c is moved upward or downward, possibly to switch its position with another container. – OUT, when a dock crane embarks c on a ship. 5.2 Evaluation Setting In order to provide a quantitative evaluation of the discovered models, two aspects have been measured in the experiments: (i) the predictive power of the classification model in discriminating the structural clusters found, and (ii) the level of conformance of each workflow schema w.r.t. the clusters that it models. As to point (i), a number of well-known measures are available in the literature to evaluate a classification model. Here, we simply resort to the Accuracy measure, which roughly expresses the percentage of correct predictions that the classifier would make over all possible traces of the process. In particular, we estimate this measure via the popular cross-validation method (with ten folders) [7]. In addition, standard Precision and Recall measures will be used to provide a “local” evaluation of the classifier, w.r.t. each single cluster; and, for each cluster, the F measure will be reported as well, which is defined as Fc = ((β2 + 1)Pc × Rc )/(Pc + β2 Rc )—note that for β = 1, it coincides with the harmonic mean of the precision and recall values. As to point (ii), the conformance of a workflow model W w.r.t. a log L is evaluated via the following metrics, all defined in [10] and ranging over the real interval [0,1]: – Fitness, which essentially evaluates the ability of W to parse all the traces L, by indicating how much the events in L comply with W . – Advanced Behavioral Appropriateness (denoted by BehAppr, for short), which estimates the level of flexibility allowed in W (i.e., alternative/parallel behavior) really used to produce L. – Advanced Structural Appropriateness (or StrAppr, for short), which assesses the capability of W to describe L in a maximally concise way. These measures have been defined for a workflow schema and do not apply directly to the multi-schema structural model discovered by our approach. In order to show a single overall score for such a model, we simply average the values computed by each of these measures against all of its workflow schemas. More precisely, the conformance values of these schemas are added up in a weighted way, where the weight of each schema is the fraction of original log traces that constitute the cluster it was mined from. 5.3 Experimental Results We next discuss three tests, which were carried out on these data according to different analysis perspectives: – Setting A (“operation-centric”), where we focus on the sequence of basic logistic operations applied to the containers. In more detail, for each container a distinct log trace is built that records the sequence of basic operations (i.e., MOV, DRB, DRG, LOAD, DIS, SHF, OUT) it underwent.

Discovering Multi-perspective Process Models

139

– Setting B (“position-centric”), where we focus on the flow of containers across the yard. Specifically, original data are transformed into a set of log trace, each of them encoding the sequence of yard sectors occupied by a single container during its stay. To this purpose, we exploited the method described in Section 3.2, which allowed us to restructure the original data into two different workflow-oriented logs. In both cases, we first selected only log data regarding the containers that completed their entire life cycle in the hub along the first two months of year 2006, and which were exchanged with four given ports around the Mediterranean sea—this yielded about 50Mb log data about 5336 containers. We then regarded the transit of any container through the hub as a single instance of the (unknown) logistic process, for which a model is to be discovered. A different choice was made, in the two settings, as concerns the mapping of log data to process activities (cf., step (c) of the method). Indeed, in the first case the activities were defined to simply coincide with the low-level move operations. Conversely, the second case we defined them to coincide with the yard position reached by the container for each single move. Notice that we choose to represent such positions at a lower level of detail (i.e., yard sectors) than that registered in the transactional logs of the harbor (i.e., yard slots), using an aggregation hierarchy defined for them. Notice that two dummy activities (denoted by START and END, respectively) were introduced in both settings to univocally mark the beginning and the end of each log trace. Further, various data attributes were considered for each container (i.e., for each process instance), including, e.g., its origin and final destination ports, its previous and next calls, diverse characteristics of the ship that unloaded it, its physical features (e.g., size, weight), and a series of categorical attributes concerning its contents (e.g., the presence of dangerous or perishable goods). Table 1 summarizes a few relevant figures for these tests, which concern both the clustering structure and the two models associated with it. More specifically, for each test, we report the number of clusters found and the different conformance measures for the structural model (i.e., the set of workflow schemas), as well as the accuracy and size of the decision tree. Moreover, to give some intuition on the semantics value of this latter model, we report four of the most discriminant attributes, actually appearing in its top levels —Decision trees are not shown in detail for both space and privacy reasons— many attribute express, indeed, sensible information about the hub society and its partners. In general, we note that surprisingly high effectiveness results have been achieved, as concerns both the structural model and the classification model. Notably, such a precision in modelling both structural and non-structural aspects of the logged events does not come with a verbose (and possibly overfitting) representation. Indeed, for all the tests, the number of clusters and the size of the tree are quite restrained, while the workflow models collectively attain a maximal score with the StrAppr metric. Table 1. Summary of results obtained against real log data Setting Clusters Structural model Fitness BehAppr StrAppr A 2 0.8725 0.9024 1 B 5 0.8558 0.9140 1

Data-aware classification model Accuracy Tree Size Top-4 Discriminant Attributes 96.01% 69 PrevHarbor, NavLine OUT, ContHeight, ShipType IN 91.64% 105 ShipSize IN, NavLine OUT, PrevHarbor, ContHeight

140

F. Folino et al.

(a) Cluster 0

(b) Cluster 1

Fig. 4. Setting A - the two workflow schemas discovered

In order to give more insight on the behavior of the approach, we next illustrate some detailed results for both analysis settings. Setting A. In this case, the approach has discovered two distinct usage scenarios, whose structural aspects are described by the workflow schemas in Figure 4. These schemas substantially differ for the presence of operations performed with multi-trailer vehicles: the schema of Figure 4.(a) does not feature any of these operations, which are instead contained in the other schema. Notably, the former schema captures the vast majority of handling cases (4736 containers of the original 5336 ones). This reflects a major aim of yard allocation strategies: to keep each container as near as possible to its positions of disembarkation/embark, by performing short transfers via straddle-carriers. Interestingly enough, these two markedly different structural models appear to strongly depend—an astonishing 96% accuracy score is achieved (cf. Table 1)—on some features that go beyond the mere occurrence of yard operations. Among these features, the following container attributes stand out: the provenance port (PrevHarbor) of a container, the navigation line that is going to take it away (NavLine OUT), the height of a container (ContHeight), and the kind of ship that delivered it to the hub (ShipType IN). A finer grain analysis performed with the help of Table 2 (where

Discovering Multi-perspective Process Models

(a) Cluster 2

141

(b) Cluster 3

Fig. 5. Setting B - two of the workflow schemas found

individual precision/recall measures for the two clusters are shown), confirms that the model guarantees a high rate of correct predictions for either cluster. Setting B. The “position-centric” approach is addressed to analyze the different ways of displacing the containers around the yard, yet comparing the usage of different storage areas. In principle, due to the high number of sectors and moving patterns that come to play in such analysis perspective, any flat representation of container flows, just consisting of a single workflow schema, risks being either inaccurate or difficult to interpret. Conversely, by separating different behavioral classes our approach ensures a modular representation, which can better support explorative analyses. In fact, the five clusters found in this test have been equipped with clear and compact workflow schemas, which yet guarantee a high level of conformance (see Table 1). As an instance, in Figure 5, we report two of these schemas, which differ both in the usage of sectors and in some of the paths followed by the containers across these sectors. Good quality results are achieved again both for the structural model and for the decision tree. In actual fact, by comparing these results with those obtained in the other two tests, we notice some slight decrease in the accuracy and a larger tree size, mainly due to the higher level of complexity that distinguish the position-centric analysis from the operation-centric one. Incidentally, Table 3 reveals that such worsening is mainly to blame on the inability of the decision tree to recognize well the second cluster, which is, in fact, slightly confused with the third one —further details are omitted here for lack of space. Almost the same attributes as in Setting A have been employed to discriminate Table 2. Setting A - details on the discovered clusters (sizes and classification metrics) Cluster Size P R F (β = 1) 0 4736 97.28% 98.25% 97.76% 1 600 84.99% 78.33% 81.53%

142

F. Folino et al. Table 3. Setting B - details on the discovered clusters (sizes and classification metrics) Cluster 0 1 2 3 4

Size 3664 188 346 1070 68

P 93, 76% 66, 67% 90, 81% 87, 25% 94, 29%

R 98, 39% 53, 19% 74, 28% 80, 56% 97, 06%

F (β = 1) 96, 02% 59, 17% 81, 72% 83, 77% 95, 65%

the clusters, except for the usage of ShipSize IN (i.e., the size category of the ship that delivered the container) in place of ShipType IN.

6 Conclusions In this paper we have proposed a novel approach to the discovery of multi-perspective process models, where structural and non-structural aspects are represented in an integrated way. In a nutshell, a number of homogeneous execution clusters are detected and described with separate workflow models, and then correlated with non-structural data (e.g., activity executors, and parameter values) by a classification model. In order to increase the versatility of the approach, a preprocessing method has been introduced to restructure historical log data according to different analysis perspectives and abstraction levels. This feature can be particularly useful for the analysis of looselystructured processes, which are not ruled by a well-specified execution model and may even lack a clear definition of process tasks. The whole approach has been implemented as a plug-in of the ProM framework, and validated on a real test case. Experimental results evidence that the discovered classification models give valuable help for interpreting and discriminating different ways of executing the process, by making explicit the link between these variants and other process properties. As future work, we will investigate the extension of the approach with outlier detection techniques, in order to provide it with the capability of spotting anomalous executions, which may mislead the learning of process models, and could be analyzed separately. Moreover, we are exploring the integration of multi-perspective models into an existing process management platform, as a means for providing prediction and simulation features, supporting both design of new processes and the enactment of future cases.

References 1. Basta, S., Folino, F., Gualtieri, A., Mastratisi, M.A., Pontieri, L.: A Knowledge-Based Framework for Supporting and Analysing Loosely Structured Collaborative Processes. In: Atzeni, P., Caplinskas, A., Jaakkola, H. (eds.) ADBIS 2008. LNCS, vol. 5207, pp. 140–153. Springer, Heidelberg (2008) 2. Buntine, W.: Learning Classification Trees. Statistics and Computation 2, 63–73 (1992) 3. Frank, E., Hall, M.A., Holmes, G., Kirkby, R., Pfahringer, B.: WEKA - A Machine Learning Workbench for Data Mining. In: The Data Mining and Knowledge Discovery Handbook, pp. 1305–1314. Springer, Heidelberg (2005)

Discovering Multi-perspective Process Models

143

4. Greco, G., Guzzo, A., Pontieri, L., Sacc`a, D.: Discovering Expressive Process Models by Clustering Log Traces. IEEE Trans. on Knowledge and Data Engineering 18(8), 1010–1027 (2006) 5. Ly, L.T., Rinderle, S., Dadam, P., Reichert, M.: Mining Staff Assignment Rules from EventBased Data. In: Bussler, C.J., Haller, A. (eds.) BPM 2005. LNCS, vol. 3812, pp. 177–190. Springer, Heidelberg (2006) 6. MacQueen, J.B.: Some Methods for Classification and Analysis of MultiVariate Observations. In: 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297. University of California Press (1967) 7. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997) 8. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993) 9. Rozinat, A., van der Aalst, W.M.P.: Decision Mining in ProM. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 420–425. Springer, Heidelberg (2006) 10. Rozinat, A., van der Aalst, W.M.P.: Conformance Checking of Processes Based on Monitoring Real Behavior. Information Systems 33(1), 64–95 (2008) 11. van der Aalst, W.M.P., G¨unther, C.W.: Finding Structure in Unstructured Processes: The Case for Process Mining. In: 7th International Conference on Application of Concurrency to System Design (ACSD’07), pp. 3–12. IEEE Computer Society Press, Los Alamitos (2007) 12. van der Aalst, W.M.P., van Dongen, B.F., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.M.M.: Workflow mining: A survey of issues and approaches. Data Knowledge Engineering 47(2), 237–267 (2003) 13. van Dongen, B.F., de Medeiros, A.K.A., Verbeek, H.M.W., Weijters, A.J.M.M., van der Aalst, W.M.P.: The ProM Framework: A New Era in Process Mining Tool Support. In: Ciardo, G., Darondeau, P. (eds.) ICATPN 2005. LNCS, vol. 3536, pp. 444–454. Springer, Heidelberg (2005) 14. van Dongen, B.F., van der Aalst, W.M.P.: A Meta Model for Process Mining Data. In: EMOIINTEROP, pp. 309–320. CEUR-WS.org (2005) 15. Weijters, A.J.M.M., van der Aalst, W.M.P.: Rediscovering Workflow Models from EventBased Data using Little Thumb. Integrated Computer-Aided Engineering 10(2), 151–162 (2003)

Tackling the Debugging Challenge of Rule Based Systems Valentin Zacharias FZI Research Center for Information Technology, Karlsruhe 76131, Germany [email protected] http://www.fzi.de

Abstract. Rule based systems are often presented as a simpler and more natural way to build computer systems - compared to both imperative programming and other logic formalisms. However, with respect to finding and preventing faults this promise of simplicity remains elusive. Based on the experience from three rule base creation projects and a survey of developers, this paper identifies reasons for this gap between promise and reality and proposes steps that can be taken to close it. An architecture for a complete support of rule base development is presented. Keywords: Rule based systems, F-logic, Knowledge acquisition, Rule based modeling, Knowledge modeling, Verification, Anomaly detection.

1 Introduction Rule based systems are often presented as a simpler and more natural way to build computer systems - compared to both imperative programming and other logic formalisms such as full first order logic or desciption logics. It seems natural to assume that this simplicity also applies to finding and preventing faults; to software quality assurrance in general. However, with respect to debugging of rule bases this promise of simplicity remains elusive. Indeed a survey found [1] that most developers of rule based systems think that the debugging of rule based systems is more difficult than the debugging of ’conventional’ - object oriented and procedural programs. An observation that the authors also found corroborated in three rule base development projects they participated in. 1.1 The Debugging Challenge The assumption of rule based systems as simpler than ’conventional’ programs rests on three observations: – Rule languages are (mostly) declarative languages that free the developer from worrying about how something is computed. This should decrease the complexity for the developer. – The If-Then structure of rules resembles the way humans naturally communicate a large part of knowledge. – The basic structure of a rule is very simple and easy to understand. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 144–154, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Tackling the Debugging Challenge of Rule Based Systems

145

To see whether this simplicity assumption holds in practice, a survey of developers of rule based systems [1] included a question that asked participants to compare rule base development to the development of procedural and object oriented programs. The results showed, that these developers indeed see the advantages in ease of change, reuse and understandability that were expected from rule based (and declarative) knowledge and program specification. However, at the same time a majority of rule base developers sees rule base development as inferior with respect to debugging; participants of the survey saw debugging of rule based systems as actually more difficult than the debugging of conventional software. And, in a different question, rank it as the issue most hindering the development of rule based systems. The authors found this problem of fault detection corroborated in three different projects he participated in. 1.2 Overview This paper is an attempt to reconcile the promise of simplicity of rule based systems with reality. It identifies why rule bases are not currently simple to build and proposes steps to address this. This paper’s structure is as follows: section two gives a short overview of the three projects that form the input for this analysis. Section three discusses what kinds of problems were encountered. The next two sections then discuss overall principles to better support the creation of rule bases and an architecture of tool support for the development process.

2 Experiences In addition to the survey in [1] and literature research, The analysis presented in this paper is based on the experience from three projects. – Project Halo1 is a multistage project to develop systems and methods that enable domain experts to model scientific knowledge. As part of the second phase of Project Halo six domain experts were employed for 6 weeks each to create rule bases representing domain knowledge in physics, chemistry and biology. – Online Forecast was a project to explore the potential of knowledge based systems with respect to maintainability and understandability. Towards this end an existing reporting application in a human resource department was re-created as rule based system. This project was performed by a junior developer who had little prior experience with rule based system. Approximately 5 person months went into this application. – The goal of the project F-verify was to create a rule base as support for verification activities. It models (mostly heuristic) knowledge about anomalies in rule bases. It consists of anomaly detection rules that work on the reified version of a rule base. This project was done by a developer with experience in creating rule bases and took 3 month to complete. 1

http://www.projecthalo.com

146

V. Zacharias

All of these projects used F-logic [2] as representation language and the Ontobroker inference engine [3]. The editing tools differed between the projects: Project Halo used the high level, graphical tools developed in the project. Online Forecast used a simple version of Ontoprise’s2 OntoStudio. F-verify used Ontostudio together with prototypical debugging and editing tools. The use of only one rule formalism obviously restricts the generality of any statements that can be made. However, F-logic is very similar to some of the rule languages under discussion for the Semantic Web [4] and it is based on normal logic programs probably the prototypical rule language. The authors examined literature and tools to ensure that tool support identified as missing is indeed missing from rule based systems and not only from systems supporting F-logic. All of these projects have been created using relatively lightweight methods that focus more on the actual implementation of rules than on high level models - as is to be expected for such relatively small projects, in particular when performed by domain experts or junior programmers. The analysis, methods and tools presented in this paper are geared towards this kind of small, domain expert/end user programmer driven projects.

3 Analyis In the three projects sketched above the authors found that the creation of working rule bases was an error prone and tiresome process. In all cases the developers complained that creating correct rule bases takes too long and in particular that debugging takes up too much time. Developers with prior experience in imperative programming languages said that developing rule bases was more difficult then developing imperative applications. While part of these observations can be explained by longer training with imperative programming, it still stands in a marked contrast to the often repeated assertion that creating rule bases is simpler. We identified six reasons for this observed discrepancy. – The One Rule Fallacy and the Problem of Terminology. Declarative statements such as rules are often assumed to be easier reusable and more modular, because they are true in broader contexts than program statements can be used [5]. Based on these sentiments the one rule fallacy goes as follows: One rule is simple to create and the combination of declarative rules is handled automatically by the inference engine hence entire rule bases are simply to create. However, the inference engine obviously combines the rules only based on how the user has specified the rules; and it is here - in the creation of rules in a way that they can work together - that most errors get made. In particular for rules in a rule set to work together, all rules need to be consistent in the domain formalization and the use of terminology. For example it must not happen: – That one rule uses a relation part while another one uses part of . – That one rule understands a parent relation biologically, while another understands it socially. – That one rule processes a number as inch, while another processes it as millimeter. 2

http://www.ontoprise.de

Tackling the Debugging Challenge of Rule Based Systems

147

Rule based systems hold the promise to allow the automatic recombination of rules to tackle even problems not directly envisioned during the creation of the rule base. However, this depends on ensuring the consistent formalization and use of terminology across the rule base. Hence a part of the gap between the expected simplicity of rule base creation and the reality can be explained by naive assumptions about rule base creation. At the same time this points to the Problem of Terminology as one explanation for the difficulty in debugging rule bases. The problem of terminology makes the rule base harder to understand, makes it harder to identify, avoid and correct mistakes based on only a local view of the rule base. – The Problem of Opacity. In the authors experience failed interactions between rules are the most important source of errors during rule base creation. At the same time, however, it is this interaction between rules that is commonly not shown; that is opaque to the user. The only common way to explicitly show these interactions between the rules is in a prooftree of a successful query to the rule base. This stands in contrast to imperative programming, where the relations between the entities and the overall structure of the program are explicitly created by the developer, shown in the source code and the subject of visualizations. Not having an overall view on the rule base mainly affects the ability of developers to correctly edit the rule base, because it becomes hard, for example, to see the consequences of changes or to gain confidence in the completeness of a rule base. The Problem of Opacity also affects the developer’s ability to identify faults in rule bases. – The Problem of Interconnection. Because rule interactions are managed by the inference engine, everything is potentially relevant to everything else3 . This complicates error localization for the user, because errors appear in seemingly unrelated parts of the rule base. As a very simple example consider the following rule base: hot(X) ← part of (Chile, X) # fault - should be part of(Chili,X) .. part of (Chile, SouthAmerica) In this example a rule that deduces that something is hot based on it having chili as part, is changed by a typo to deduce that something is hot if the country Chile is part of it. This simple error then introduces a connection to a part of the rule base that deals with countries and that - from the user’s point of view - is absolutely unrelated. More typical than this example, however, are cases that do not lead to wrong conclusions but rather merely lead the inference engine to try one path that ends in a rule cycle or in some kind of error. One example would be that a rule causes a whole rule base to become unstratisfiable, another that it leads to a rule being tried in unexpected circumstances that then lead to errors due to the wrong usage of built-ins. 3

At least in the absence of robust, user managed modularization.

148

V. Zacharias

– The No-Result Case and the Problem of Error Reporting. By far the most common symptom of an error in the rule base was a query that (unexpectedly) did not return any result - the no-result case. In such a case most inference engines give no further information that could aid the user in error localization. This is unlike many imperative languages that often produce a partial output and a stack trace. Both imperative and rule based systems sometimes show bugs by behaving erratically, but only rule based systems show the overwhelming majority of bugs by terminating without giving any help on error localization. – The Problem of Procedural Debugging. All deployed debugging tools for rule based programs known to the authors are based on the procedural or imperative debugging paradigm. This debugging paradigm is well known from the world of imperative programming and characterized by the concepts of breakpoints and stepping. A procedural debugger offers a way to indicate a rule/rule part where the program execution is to stop and has to wait for commands from the users. The user can then look at the current state of the program and give commands to execute a number of the successive instructions. However a rule base does not define an order of execution - hence the order of debugging is based on the evaluation strategy of the inference engine. The execution steps of a procedural debugger are then for instance: the inference engine tries to prove a goal A or it tries to find more results for another goal B. However, as a declarative representation a rule base does not define an order of execution - hence the order of debugging is based on the evaluation strategy of the inference engine. In this way procedural debugging of rule bases breaks the declarative paradigm - it forces the developer to learn about the inner structure of the inference engine. This stands in marked contrast to the idea that rule bases free the developer from worrying about how something is computed. The development of rule based systems cannot take full advantage of the declarative nature of rules, when debugging is done on the procedural nature of the inference engine. – Agile, Iterative and Lightweight Methods. Recent years also saw the increasing adoption and acceptance of iterative and evolutionary development methodologies [6] - they are now widely believed to be superior to waterfall-like models [7]. A high prevalence of agile and iterative methods was also found in the developer survey [1]. Agile and iterative methods were used in more than 50% of the projects with more than 10 person months of effort. A further 23% of the projects worked without following any specific methodology. The first corollary of this observation is that the verification support for these systems cannot rely on the availability of formal specifications of any kind; an impediment for many static analysis methods. A second likely consequence is a higher prevalence of implementation mistakes. With many of these methods architecture and design decisions are taken on the fly during the implementation of software. In addition, these design and architecture decisions are often repeated and changed, because agile methodologies encourage developers to make these decisions based on the scope of the current iteration, not on the expected scope of the final application. Doing these design and architecture decisions ’on the fly’ reduces the risk of costly requirements, design and (up-front) architecture mistakes, however, at the same time it means that many of these mistakes are then made during implementation.

Tackling the Debugging Challenge of Rule Based Systems

149

– End User Programmers and Domain Experts. The amount of software in society increases continually, and more and more people are involved in its creation. The creation of software artifacts used to be the very specialized profession of a few thousand experts worldwide, but has now become a set of skills possessed to some degree by tens of millions of people. Within this group an increasing role is played by end user programmers - people that are usually trained for a non-programming area and just need a program, script or spreadsheet as a tool for some task. For the US it is estimated that there are at least four times as many end user programmers as professional programmers [8]; with estimates for the number of end user programmers ranging from 11 million [8] up to 55 million [9]. End user programmers are particularly important for web related development, because web developers are known to have an even higher percentage of end user programmers [10,11]. This rise in the number of end user programmers means, that the average developer is not trained as well as the knowledge engineers that created the first expert systems. End user programmers usually can’t justify making an investment in programming training comparable to that of professional programmers. The survey of developers [1] also found that the average rule base development projects includes 1.5 domain experts that create rules themselves and 1.7 domain experts that were involved in verification and validation. Slightly more than half of the projects included at least one domain expert that created rules. – Less refined tools support. Compared to tools available as support for the development with imperative languages, those for the development of rule bases often lack refinement. This discrepancy is a direct consequence from the fact that in recent years the percentage of applications built with imperative programming languages was much larger than those built with rule languages.

4 Core Principles The authors have identified four core principles to guide the building of better tool support for the creation of rule bases. These principles where conceived either by generalizing from tools that worked well or as direct antidote to problems encountered repeatedly. – Interactivity. To create tools in a way that they give feedback at the earliest moment possible. To support an incremental, try-and-error process of rule base creation by allowing trying out a rule as it is created. Tools that embodied interactivity proved to be very popular and successful in the three projects under discussion. Tools such as fast graphical editors for test queries, text editors that automatically load their data into the inference engine or simple schema based verification during rule formulation where the most successful tools employed. In cases where quick feedback during knowledge formulation could not be given4, this was reflected immediately in erroneous rules and unmotivated developers. 4

This was mostly due to very long reasoning times or due to technical problems with the inference engine.

150

V. Zacharias

Interactivity is known to be an important success factor for development tools, in particular for those geared towards end user programmers [12,13]. Interactivity as a principle addresses many of the problems identified in the previous section by supporting faster learning during knowledge formulation. Immediate feedback after small changes also helps to deal with the problem of interconnection and the problem of error reporting. – Visibility. To show the hidden structure of (potential) rule interactions at every opportunity. Visibility is included as a direct counteragent to the problems of opacity and interconnection. – Declarativity. To create tools in a way that the user never has to worry about the how; about the procedural nature of the computation. Declaritivity of all development tools is a prerequisite to realize the potential of reduced complexity offered by declarative programming languages. The declarititivity principle is a direct response to the problem of procedural debugging described in the previous section. – Modularization. To support the structuring of a rule base in modules in order to give the user the possibility to isolate parts of the rule base and prevent unintended rule interactions. This principle is a direct consequence of the problem of interconnection. However, modularization of rule bases has been extensively researched and implemented (e.g. [14,15,16,17] and will not be further discussed in this paper.

5 Supporting Rule Base Development The principles identified in the previous section need to be embodied in concrete tools and development processes in order to be effective. Based on our experience from the three projects described earlier and on best practices as described in literature (eg [18,19,20]) we have identified (and largely implemented) an architecture of tool support for the development of rule bases. A high level view of this architecture is shown in figure 1. At its core it shows test, debug and rule creation as an integrated activity - as mandated by the interactive principle. These activities are supported by test coverage, anomaly detection and visualization. Each of the main parts will be described in more detail in the following paragraphs. 5.1 Test, Debug and Rule Creation as Integrated Activity In order to truly support the interactivity principle the test, debug and rule creation activities need to be closely integrated. This stands in contrast to the usual sequence of: create program part, create test and debug if necessary. An overview of this integrated activity is shown in figure 2. Testing has been broken up into two separate activities: the creation/identification of test data and the creation and evaluation of test queries. This has been done because test data can inform the rule creation process - even in the absence of actual test queries. Based on the test data an editor can give feedback on the inferences made possible by a rule while it is created. An editor can automatically display that (given the state of the rule that is being created, the contents of the rule base and the test data) the current rule allows to infer such and such. This allows the developer to get instant feedback on whether her intuition of what a rule is supposed to infer matches reality. When the

Tackling the Debugging Challenge of Rule Based Systems

151

Fig. 1. The overall architecture of tool support for the rule base development

Fig. 2. Test, debug and rule creation as integrated activities

inferences a rule enables do not match the expectations of the user she must be able to directly switch to the debugger to investigate this. In this way testing and debugging become integrated, even without a test query being present. Hence we speak of integrated test, debug and rule creation because unlike in traditional development: – test data is used throughout editing to give immediate feedback – debugging is permanently available to support rule creation and editing, even without an actual test query. The developer can debug rules in the absence of test queries. 5.2 Anomalies Anomalies are symptoms of probably errors in the knowledge base [21]. Anomaly detection is a well established and often used static verification technique for the development of rule bases. Its goal is to identify errors early that would be expensive to diagnose later. Anomalies give feedback to the rule creation process by pointing to errors and can guide the user to perform extra tests on parts of the rule base that seem problematic. Anomaly detection heuristics focus on errors across rules that would otherwise be very hard to detects. Examples for anomalies are rules that use concepts that are not in the ontology or rules deducing relations that are never used anywhere. Of the problems identified in section 3, anomalies detection partly addresses the problem of opacity, interconnection and error reporting by finding some of the errors related to rule interactions based on static analysis.

152

V. Zacharias

We deployed anomaly detection heuristics within Project Halo, both as very simple heuristics integrated into the rule editor and more complex heuristics as a separate tool. The anomaly heuristics integrated into the editor performed well and were well received by the developers, the more complex heuristics, however, took a long time to be calculated, thereby violated the interactivity paradigm, and weren’t accepted by the users. 5.3 Visualization The visualization of rule bases here means the use of graphic design techniques to display the overall structure of the entire rule base, independent of the answer to any particular query. The goals for such visualizations are the same as for UML class diagrams and other overview representations of programs and models: to aid teaching, development, debugging, validation and maintenance by facilitating a better understanding of a computer program/model by the developers. To, for example, show which other parts are affected by a change. The overview visualization of the entire rule base is a response to the problem of opacity. Not being able to get an overview of the entire rule base and being lost in the navigation of the rule base was a frequent complaint in the three projects described in the beginning. To address this we created a scalable overview representation of the static and dynamic structure of a rule base, detailed descriptions can be found in [22]. 5.4 Debugging In section 3 the current procedural debugging support was identified as unsuited to support the simple creation of rule bases. It was established that debuggers needs to be declarative in order to realize the full potential of declarative programming languages. To address this shortcoming we have developed and implemented the Explorative Debugging paradigm for rule-based systems [23]. Explorative Debugging works on the declarative semantics of the program and lets the user navigate and explore the inference process. It enables the user to use all her knowledge to quickly find the problem and to learn about the program at the same time. An Explorative Debugger is not only a tool to identify the cause of an identified problem but also a tool to try out a program and learn about its working. Explorative Debugging puts the focus on rules. It enables the user to check which inferences a rule enables, how it interacts with other parts of the rule base and what role the different rule parts play. Unlike in procedural debuggers the debugging process is not determined by the procedural nature of the inference engine but by the user who can use the logical/semantic relations between the rules to navigate. An explorative debugger is a rule browser that: – Shows the inferences a rule enables. – Visualizes the logical/semantic connections between rules and how rules work together to arrive at a result. It supports the navigation along these connections5. 5

Logical connections between rules are static dependencies formed for example by the possibility to unify a body atom of one rule with the head atom of another. Other logical connections are formed by the prooftree that represents the logical structure of a proof that lead to a particular result.

Tackling the Debugging Challenge of Rule Based Systems

153

– It allows to further explore the inference a rule enables by digging down into the rule parts. – Is integrated into the development environment and can be started quickly to try out a rule as it is formulated. The interested reader finds a complete description of the explorative debugging paradigm and one of its implementation in [23].

6 Conclusions Current tools support for the development of rule based knowledge based systems fails to address many common problems encountered during this process. An architecture that combines integrated development, debugging and testing supported by test coverage metrics, visualization and anomaly detection heuristics can help tackle this challenge. The principles of interactivity, declarativity, visibility and modularization can guide the instantiation of this architecture in concrete tools. The novel paradigm of explorative debugging together with techniques from the automatic debugging community can form the robust basis for debugging in this context. Such a complete support for the development of rule bases is an important prerequisite for Semantic Web rules to become a reality and for business rule systems to reach their full potential.

References 1. Zacharias, V.: Development and verification of rule based systems - a survey of developers. In: Bassiliades, N., Governatori, G., Paschke, A. (eds.) Rule ML 2008. LNCS, vol. 5321, pp. 6–16. Springer, Heidelberg (2008) 2. Kifer, M., Lausen, G., Wu, J.: Logical foundations of object-oriented and frame-based languages. Journal of the ACM 42, 741–843 (1995) 3. Decker, S., Erdmann, M., Fensel, D., Studer, R.: Ontobroker: Ontology-based access to distributed and semi-structured unformation. In: Database Semantics: Semantic Issues in Multimedia Systems, pp. 351–369 (1999) 4. Kifer, M., de Bruijn, J., Boley, H., Fensel, D.: A realistic architecture for the semantic web. In: Adi, A., Stoutenburg, S., Tabet, S. (eds.) RuleML 2005. LNCS, vol. 3791, pp. 17–29. Springer, Heidelberg (2005) 5. McCarthy, J.: Programs with common sense. In: Proceedings of the Teddington Conference on the Mechanization of Thought Processes, London, Her Majesty’s Stationary Office, pp. 75–91 (1959) 6. Larman, C., Basili, V.R.: Iterative and incremental development: A brief history. IEEE Computer, 47–56 (June 2003) 7. MacCormack, A.: Product-development practices that work. MIT Sloan Management Review, 75–84 (2001) 8. Scaffidi, C., Shaw, M., Myers, B.: Estimating the numbers of end users and end user programmers. In: L/HCC 2005: Proceedings of the 2005 IEEE Symposium on Visual Languages and Human-Centric Computing, pp. 207–214 (2005) 9. Boehm, B.: Software Cost Estimation with COCOMO II. Prentice Hall PTR, Englewood Cliffs (2000)

154

V. Zacharias

10. Rosson, M.B., Balling, J., Nash, H.: Everyday programming: Challenges and opportunities for informal web development. In: Proceedings of the 2004 IEEE Symposium on Visual Languages and Human Centric Computing (2004) 11. Harrison, W.: The dangers of end-user programming. IEEE Software, 5–7 (July/August 2004) 12. Ruthruff, J., Burnett, M.: Six challenges in supporting end-user debugging. In: 1st Workshop on End-User Software Engineering (WEUSE 2005) at ICSE 2005 (2005) 13. Ruthruff, J., Phalgune, A., Beckwith, L., Burnett, M.: Rewarding good behavior: End-user debugging and rewards. In: VL/HCC 2004: IEEE Symposium on Visual Languages and Human-Centric Computing (2004) 14. Jacob, R., Froscher, J.: A software engineering methodology for rule-based systems. IEEE Transactions on Knowledge and Data Engineering 2, 173–189 (1990) 15. Baroff, J., Simon, R., Gilman, F., Shneiderman, B.: Direct manipulation user interfaces for expert systems, pp. 99–125 (1987) 16. Grossner, C., Gokulchander, P., Preece, A., Radhakrishnan, T.: Revealing the structure of rule based systems. International Journal of Expert Systems (1994) 17. Mehrotra, M.: Rule groupings: a software engineering approach towards verification of expert systems. Technical report, NASA Contract NAS1-18585, Final Rep. (1991) 18. Preece, A., Talbot, S., Vignollet, L.: Evaluation of verification tools for knowledge-based systems. Int. J. Hum.-Comput. Stud. 47, 629–658 (1997) 19. Gupta, U.: Validation and verification of knowledge-based systems: a survey. Journal of Applied Intelligence, 343–363 (1993) 20. Tsai, W.T., Vishnuvajjala, R., Zhang, D.: Verification and validation of knowledge-based systems. IEEE Transactions on Knowledge and Data Engineering 11, 202–212 (1999) 21. Preece, A.D., Shinghal, R.: Foundation and application of knowledge base verification. International Journal of Intelligent Systems 9, 683–701 (1994) 22. Zacharias, V.: Visualization of rule bases - the overall structure. In: Proceedings of the 7th International Conference on Knowledge Management - Special Track on Knowledge Visualization and Knowledge Discovery (2007) 23. Zacharias, V., Abecker, A.: Explorative debugging for rapid rule base development. In: Proceedings of the 3rd Workshop on Scripting for the Semantic Web at the ESWC 2007 (2007)

Semantic Annotation of EPC Models in Engineering Domains to Facilitate an Automated Identification of Common Modelling Practices Andreas Bögl1, Michael Schrefl1, Gustav Pomberger2, and Norbert Weber3 1

Department of Business Informatics – Data & Knowledge Engineering Johannes Kepler University Linz, Altenberger Straße 69, 4040 Linz, Austria {boegl,schrefl}@dke.uni-linz.ac.at 2 Department of Business Informatics – Software Engineering Johannes Kepler University Linz, Altenberger Straße 69, 4040 Linz, Austria [email protected] 3 Siemens AG, Corporate Technology-SE 3, Otto-Hahn-Ring 6, Munich, Germany [email protected]

Abstract. An automated identification of common modeling practices from EPC (Event-Driven Process Chain) models requires to perform a semantic analysis of EPC functions and events. The semantic analysis faces the problem that an essential part of the EPC semantics is bound to natural language expressions in functions and events with undefined process semantics. The semantic annotation of natural language expressions provides an adequate approach to tackle this problem. This paper introduces a novel approach that enables an automated semantic annotation of EPC functions and events. It employs semantic patterns to analyze the textual structure of natural language expressions and to relate them to instances of a reference ontology. Thus, semantically annotated EPC model elements are input for subsequent semantic analysis that identifies common modeling practices. Keywords: Semantic Annotation, Semantic Patterns, Semantic EPC Models, Process Ontology.

1 Introduction Designing processes is a sophisticated and cognitive task for process designers, usually supported by tools, such as the ARIS Toolset. Process designs are represented as process models in a particular language, usually depicted by a graph of activities and their casual dependencies. The Event-driven Process Chain (EPC) modeling language [10] has gained a broad acceptance and popularity both in research and in practice. Engineering processes are characterized by an identifiable progression from requirements, through specification to design and implementation. Each of these phases comprises context specific tasks, conducted similarly in different engineering domains [12], e.g. the task “Design Architecture” usually succeeds to task “Identify Requirements”. Similarly conducted tasks in different engineering domains motivate J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 155–171, 2009. © Springer-Verlag Berlin Heidelberg 2009

156

A. Bögl et al.

the extraction of process patterns, tracing back to obviously existing structural analogies in process models describing various domains in engineering domains [4]. A process pattern represents a common or best practice solution to solve a particular problem in a certain context. Hence, it might assist process modelers for constructing high quality process solutions. Extracting process patterns requires to compare process models either human-driven or automatically. Prerequisites for an automatable comparison of conceptual models are discussed in [13]. Most emphasized prerequisites refer to a formal semantics of modeling language and constructs. Additionally, a reference ontology should capture the real world semantics. EPC models express the model element structure in terms of the meta language constructs which are Functions (F), Events(E) and Connectors (and, xor, or). Natural language expressions describe the inner process semantics of functions and events. Depending on the sequence of words and on its semantics, different meanings may arise. For example, “Define Software Requirements with Customer” has a different meaning than “Define Software Requirements for Customer”. The former expression means that software requirements are defined in cooperate manner with a customer; the latter one indicates the customer as output target of the performed activity. Semantic annotation of EPC models constitutes an appropriate solution to tackle the problem of different meanings but raises the problems to determine the process semantics of lexical terms used in natural language expressions and to link them to reference ontology instances. The process semantics of lexical terms represents process tasks executed on process objects, for instance. Linking lexical terms to reference ontology instances requires the availability of corresponding reference ontology instances. In case of non-corresponding instances, semantic annotation also entails to populate a reference ontology with new instances derived from lexical terms. Natural language expressions follow naming conventions or standards that represent guidelines for naming EPC functions/events [19]. If a naming convention is used, it is clear what a lexical term expresses. For example, a typical suggested naming convention for an EPC function is the template Task followed by a Process Object that specifies the semantics of natural language expressions such as “Define Project Plan”; “Define” has the semantics of a task, the lexical term “Project Plan” indicates the semantics of a process object. We analyzed about 5,000 EPC functions/events in engineering domains. We experienced that the suggested recommendations do not fully cover the semantics of used natural language expressions. These investigations addionally triggers the neccessity of a formal notation that enables to express or specify guidelines. Formalized naming guidelines enable verifying natural language expressions against a set of predefined conventions. Based on our investigations for clarifying the semantics of natural language expressions used for naming EPC functions/events, we introduce a set of guidelines that extend traditional naming conventions. The introduced guidelines are expressed through semantic pattern descriptions. Semantic pattern descriptions represent semantic templates that bridge the gap between informal and formal representation. Formal representation refers to concepts specified by a reference ontology. Semantic pattern descriptions are either defined for EPC functions or for EPC events. A semantic pattern description has a pattern template, that is represented in the form Function(Context)[Task; Process Object; Parameter] for an EPC function. Context is

Semantic Annotation of EPC Models in Engineering Domains

157

the name of an engineering domain the analyzed EPC function is associated with, the concept Task represents the activity being performed on an instance of the concept Process Object, the concept Parameter captures an instance of a Process Object optionally for executing an EPC function. To represent the semantics of a function consists in instantiating a semantic pattern template by binding a lexical term to the variables Context, Task, Process Object and Parameter. For example, the instantiated semantic pattern template Function(Software) [Task: ”Define”; Process Object: ”Quality Goal”] defines the process semantics of the EPC function “Define Quality Goal” in the context “Software”. Same or similar knowledge in EPC functions/events can be expressed in alternative ways due to the freedom of modeling (e.g. usage of synonyms, abbreviates etc.). Consequently, alternative natural language expressions may refer to same process semantics. For example, the EPC functions “Define SW Requirements” and “Define Requirements for Software” refer to the same process semantics. For that reason each semantic pattern description defines a set of lexical structures and analysis rules. A lexical structure is defined by a sequence of word classes (e.g. Verb, Noun, Preposition). The lexical structure LS:= [VerbTask] [NounGroupProcessObject] captures a natural language expression such as Define [VerbTask] Quality Goal [NounGroupProcessObject]. A lexical structure instantiates a semantic pattern template by applying analysis rules. They define how to map lexical structures onto the reference ontology concepts specified in a semantic pattern template. An instance of a reference ontology concept has an unique identifier which may have several textual counterparts. For example, a process object with the OID=123 has the textual counterparts {“Software Requirements”, “SW Requirements”, “Requirements for Software”}. To separate meaning from its lexical representation the reference ontology is split into two layers. Layer one represents the lexical knowledge base that captures commonsense vocabulary used in natural language expressions; layer two represents the semantic domain for those vocabulary by providing process ontology concepts and relations defined in the process knowledge base. This paper is an extended version of our paper “Semantic Annotation of EPC Models in Engineering Domains by Employing Semantic Patterns” [1] and is organized as follows: The upcoming section provides an overview of our approach for the identification of common modelling practices and its relation to semantically annotated EPC models. Section 3 introduces the reference ontology, Section 4 deals with the main contribution of this paper, the semantic annotation process. In Section 5, we discuss related approaches. Finally, section 6 concludes the paper summarizing our approach.

2 An Outline of the Overall Approach: Identification of Common Modelling Practices The work presented in this paper relies on research activities being part of the BPI (Business Project Improvement) project1. The project addresses the issue “How can process patterns be automatically extracted from given process descriptions in 1

The project was funded by Siemens AG, Corporate Technology – SE 3, Munich.

158

A. Bögl et al.

engineering domains?”. It assumes process descriptions in terms of Event-driven Process Chains (EPC) [10]. The automated extraction of process patterns from given process models yields several advantages. Using a large set of specific models offers a detailed insight into the common and best practices of a domain. The frequency of occurrence (process pattern instances) of process patterns enables an objective measure to evaluate candidates for common practice solutions. If the frequency of a process pattern achieves a predefined threshold value it automatically represents a candidate for a common practice.

Semantic Annotation of EPC Models

Extraction of Process Patterns

Process Patterns

1

2 Reference Ontology

Matching of Process Goals

(a) Process Model Repositories

Lexical Knowledge Base

Process Knowledge Base

3

(b) Stages for Automated Identification of Common Practices

(c) Process Knowledge Warehouse

Fig. 1. Automated Identification of Common Practices in EPC Models – The Approach

Figure 1 illustrates our approach for identification of common practices in given EPC models. The overall proceeding comprises the three stages (1) Semantic Annotation of EPC Models, (2) Extraction of Process Patterns and (3) Matching of Process Goals (Figure 1-b). The Process Knowledge Warehouse (Figure 1-c) represents a common knowledge base that captures and organizes process knowledge in terms of a reference ontology and process patterns. Stage 1 elicits process knowledge from external process repositories (Figure 1-a, e.g. ARIS repository). It performs a semantic annotation of EPC functions and events and populates a reference ontology automatically. Stage 2 extracts process patterns from semantically annotated EPC models. Finally, Stage 3 performs a matching of process goals in order to identify process patterns achieving similar process goals. Based on similar process patterns, common practices are constructed. Stage 1 is discussed in detail in the remaining sections. We now give a brief overview of stage 2 and 3. The objective of a process pattern is to provide a solution for a problem that occurs in a certain situation. A problem solved by a process pattern can be interpreted in the sense of how to achieve a defined process goal. A goal “represents the purpose or the outcome that the business as a whole is trying to achieve. Goals control the behaviour of the business and show the desired states of some resource in the business” [4]. The solution of a process pattern to meet a process goal is represented either by the semantically described elements of a process pattern or by a set of interrelated process patterns (composite process pattern). Our work makes a clear distinction between an elementary and composite process pattern. Elementary process patterns cannot be broken down into a set of sub-patterns whereas composite process patterns go beyond a simple composition of elementary process patterns. They are interrelated by semantic pattern relationships such as

Semantic Annotation of EPC Models in Engineering Domains

159

“hasSuccessor” or “uses” [25]. An elementary process pattern represents the smallest unit for describing a process solution to achieve a process goal. However, an elementary process pattern encapsulates the smallest unit of a semantically annotated EPC model. The smallest chunk of an EPC model consists of at least one start event, exactly one function and at least one end event.

Fig. 2. Example for Elementary Process Pattern

Figure 2 depicts an example that illustrates the basic approach for describing an elementary process pattern based on semantically annotated EPC functions and events. The EPC model consists of the two events named “Project Risk Defined” and “Development Plan Defined” and of the function named “Define Development Plan”, each annotated with instances of a reference ontology. It is important to note that EPC events mainly provide candidates for describing process goals since they capture state information of process objects in general. Due to space limitations, we omit a fully fledged discussion how process goals are extracted from given EPC models. It is important to note that initial and resulting context information of an elementary process pattern are represented by a goal based description. Initial context information means that assigned goals must be achieved in order to use the underlying process pattern. The resulting context indicates a process goal which is met by the concerned pattern. Composite process patterns represent coherent structures consisting of an elementary process pattern composition. As already mentioned, semantic pattern relationships enable to compose coherent pattern structures to achieve a higher-level process goal in order to unfold their full strength [25]. The specification of such relationships between elementary pattern structures relies on process goals as previously described. A process goal tree results from a decomposition of process goals into subgoals through and/or/xor graph structures borrowed from problem reduction techniques in artificial intelligence [26]. In Figure 3-b, process goal A is decomposed by a sequence decomposition into subgoals B and C, each having assigned a process pattern (B1 and C1) that provides a process solution to meet that process goal. The sequence decomposition of process goal A implies that process goals B and C must be fulfilled in that sequence in order to meet process goal A. Figure 3-c illustrates a composite process pattern derived from the process goal tree depicted in Figure 3-b. The composite

160

A. Bögl et al.

Fig. 3. Example for Composite Process Pattern

process pattern provides a process solution to achieve the process goal A. It consists of two elementary process patterns which are interrelated by the semantic pattern relationship hasSuccessor. The final step of our approach identifies process patterns in terms of common practices. In general, a common practice represents a process pattern that is abstracted from domain specific process patterns describing similar process solutions. Process patterns are similar if they achieve a common or similar process goal. Hence, the identification of common practices relies on a matching of process goals in order to identify similar process patterns. For a comprehensive overview we refer to [4].

3 Reference Ontology The semantic annotation of EPC models entails the population of reference ontology instances and the establishment of a semantic linkage from EPC functions / events ontology instances. The reference ontology provides concepts and relations whose instances capture both the vocabulary (lexical knowledge) used to describe EPC functions/events and its process semantics (process knowledge). Thus it serves as knowledge base for the semantic annotation of EPC functions/events. Lexical knowledge comprises morphological and syntactic knowledge about used vocabulary. Morphological knowledge relates to word inflections such as single/plural form of a noun, past tense of verbs. Syntactic knowledge comprises word class information and binary features such as countable/uncountable. Additionally, the lexical knowledge base explicitly considers context information and the relationships isAcronymOf and isSynonymTo between entries. Context information refers to engineering domains such as software or hardware development. Process knowledge represents the semantic domain for natural language expressions. The process semantics is expressed by instances of process ontology concepts such as task, process object and its semantic relationships such as isPartOf or isSubclassOf. Figure 4 sketches the overall architecture. Natural language expressions are used for naming EPC functions/events. The process semantics implicitly described in an EPC function/event is captured in the process knowledge base, its textual representation is captured in the lexical knowledge base. Publicly available resources such as WordNet [23] may provide commonsense vocabulary, but cannot be considered fully suitable for capturing domain specific vocabulary. In general, such resources are open world dictionaries, comprising several

Semantic Annotation of EPC Models in Engineering Domains EPC Model

Reference Ontology

Natural Language Expressions

Naming

EPC Functions/Events

161

Semantic Annotation

Lexical Knowledge Base Concepts

Instances

Semantic Mapping Concepts

Instances

Process Knowledge Base

Fig. 4. Coherence between Reference Ontology and EPC Functions/Events

hundred thousand open world entities and semantic relationships. It is an unreasonable demand for a process designer to maintain this vocabulary. A domain specific controlled vocabulary within an engineering domain usually comprises several hundred entities that can be maintained easier. Nevertheless, WordNet plays a vital role for the semantic annotation of EPC functions/events. Its purpose is further discussed in section 4 of this paper. The conceptual design of the lexical knowledge base relies on the analysis of natural expressions used for naming EPC functions/events. Figure 5 illustrates the structure of a natural language expression. A natural language expression used for naming is a composition of words each belonging to a word class. The sequence of word classes specifies the lexical structure of the natural language expression. In general, a lexical term represents a cluster of words belonging to equal word classes. For example, the word “Define” belongs to the word class verb, since there is only one occurrence of this word class, the lexical term only consists of the word “Define”. Besides this general rule, an “and” conjunction used in a natural language expression requires special attention. An “and” conjunction results in a separation of lexical terms.

Fig. 5. Structure of Natural Language Expressions

For example, let us replace the preposition “of” with an “and” conjunction in Figure 5. In this case, the natural language expression would comprise the two additional lexical terms “Requirements” and “Software”. The lexical knowledge base consists of lexical entries. A lexical entry is either single or multi-structured. A single structured lexical entry is represented by exactly one word; a multi-structured lexical entry consists of several words. One may recognize the analogy to a lexical term. Hence, a lexical entry represents a lexical term captured in the lexical knowledge base. Let us consider Figure 6 that provides an example for the multi-structured lexical entry “Software Requirements” and for the lexical mapping between lexical and process knowledge base. The right section illustrates a part of a process ontology that

162

A. Bögl et al. Lexical Knowledge Base

Process Knowledge Base Concepts

Concepts

Word Entity

EPC Entity

isSubClassOf isSubClassOf Noun

Verb

Preposition Process Object

Software Development

Software Requirements

Context

Lexical Entry

Instances

Instances isInstanceOf isInstanceOf

Software

Requirements

Noun / Singular

Noun / Plural Countable=true

Lexical Entry

Lexical Entry

SW

Requirement

Process Object

Process Object

isSublcassOf

Abbreviation

Singular

Lexical Entry

Lexical Entry

Process Object

Fig. 6. Example for Multi-Structured Lexical Entry

captures the semantics of the lexical entry “Software Requirements” realized by a lexical mapping between a lexical entry and a process ontology instance. The meaning of this entry can be interpreted as follows: “Software Requirements” is a process object and it is a specialization of the process object “Requirements” (). 3.1 Process Knowledge Base The concepts and relations of process knowledge result from the analysis of natural language terms describing EPC functions/events and its associated meanings. Figure 4 depicts the process ontology concepts for capturing the semantics of EPC functions/Events. The top level concept EPC Entity classifies a lexical term either into a Task(T), or a Process Object(PO) or a State(S) concept. A Process Object represents a real or an abstract thing being of interest within a (business process) domain. According to Rosemann [17, p.177 et seq), the concept for describing a Process Object has the semantic relations isPartOf (e.g. Project Handbook isPartOf Development Project), isSublcassOf (e.g. Software Project isSubClassOf Project) and migratesTo (e.g. Software Requirements migratesTo Software Architecture). A Task can be performed manually by a human or electronically by a service (e.g. Web Service) for achieving a desired (business) objective. It can be specified at different levels of abstraction, refinements or specializations that are expressed by the semantic relationship hasSubTask. A State always refers to a Process Object indicating the state that results from performing a Task on a Process Object. Parameters are process objects that are relevant for a task execution or a state description. The process ontology comprises the four parameter types Source Direction Parameter, Target Direction Parameter, Means Parameter and Dependency Parameter. A Source Direction Parameter defines a source process object, e.g. “Derive Quality Goal from Specification Document” indicates “Specification Document” as a Source Direction Parameter. A Target Direction Parameter denotes a recipient process object within a function execution, e.g. the function “Rework Specification for Project Plan” specifies “Project Plan” as a Target Direction Parameter. A Means Parameter semantically describes a process object as an input requirement for task execution. For

Semantic Annotation of EPC Models in Engineering Domains

163

EPC Entity

isSubClassOf

isSubClassOf

isSubClassOf {isPartOf, isSubClassOf, migratesTo}

hasSubTask

hasSubState refersTo

isPerformedOn

Task

Process Object

State isAttribute

isSubClassOf hasOptionalParameter

hasOptionalParameter

Parameter isSubClassOf isSubClassOf

Source Direction Parameter

Target Direction Parameter

isComposedOf

isSubClassOf isSubClassOf

Means Parameter

Dependency Parameter

Concept Name

Composite Parameter

Semantic Relationship

Fig. 7. Process Ontology for Capturing the Semantics of EPC Functions/Events

example, “Rework Specification with Software Goals” indicates “Software Goals” as a Means Parameter. A Dependency Parameter indicates that executing a task on a process object depends on an additional process object, e.g. the function “Decide Quality Measure Upon Review Status” specifies “Review Status” as a Dependency Parameter. Additionally, a function/event may contain a composition of parameters, expressed by the concept CompositeParameter. For example, “Rework Specification with Software Goals for Project Handbook” is a composition of the two parameters “Software Goals” (Means Parameter) and “Project Handbook” (Target Direction Parameter).

Fig. 8. Stages for Automatic Semantic Annotation

164

A. Bögl et al.

4 Semantic Annotation Process Input are natural language expressions used for naming EPC functions/events. The semantic annotation process comprises the four stages as depicted in Figure 8 that are discussed in the following subsections. Figure 9 sketches the output of the semantic annotation process which comprises (1) a semantic linkage between EPC functions/events and instances of process ontology concepts and (2) updated instances of reference ontology concepts. 4.1 Term Extractor The semantic annotation process starts with the extraction of used words by parsing natural language expressions for each EPC function/event. Extracted words are input for the term normalizer.

Fig. 9. Example for Semantically Annotated EPC Function

4.2 Term Normalizer The term normalizer component addresses the problems of word classification and of naming conflicts. This step reduces the number of potential naming conflicts to synonyms and abbreviations. It neglects homonyms since it is assumed a non ambiguous meaning of used vocabulary in engineering domains. Determination of word classes (e.g. noun, verb, etc.) requires finding a match between words in natural language expressions (extracted from the Term Extractor) and associated words to lexical entries in the lexical knowledge base. A match procedure considers semantic relationships (e.g. isAbbreviationTo) associated to a lexical entry (e.g. SW is an abbreviation of Software). If a search for is successful, the word class derives from the concept name the matched word is instance of. In case of naming conflicts, the term normalizer follows the rule to deliver the base word. For instance, SW has been identified as an abbreviation of software, consequently, the term normalizer delivers the term “Software” as a noun. If a query for a word in the lexical knowledge base delivers an empty result, an automatically driven word classification is not feasible. In this case, the publicly available dictionary WordNet is employed for word classification and synonym detection. According to Lui and Sing [11], it is particularly suited for this task as it is

Semantic Annotation of EPC Models in Engineering Domains

165

“optimized for lexical categorization and word-similarity determination”. WordNet originates from the Cognitive Science Laboratory of Princeton. Its schema comprises the three main classes synset, wordSense and word. A synset groups words with a synonymous meaning, such as {car, auto, machine}. Due to different senses of words, a synset contains one or more wordsenses and each wordsense belongs to exactly one synset [23]. A synset either contains the word classes nouns, verbs, adjectives or adverbs. There are seventeen relations between synsets (e.g. hyponymy, entailment, meronymy, etc.) and five between word senses (e.g. antonym, see also). The term normalizer tries to retrieve semantic information by consulting WordNet. A WordNet query delivers either a set of word classes and synonyms (associated to the queried lexical term) or an empty set. In case of delivering an empty set the term normalizer component requires an interaction with the analyst in order to get a human classification entry. In our introduced example in Figure 8, the term normalizer identifies the lexical terms “Define” as a verb, “Requirements” as a noun, “For” as a preposition and “Software Prototype” as a noun group. 4.3 Semantic Pattern Analyzer The term normalizer component determines word classes and resolves word conflicts as described in previous section. The semantic pattern analyzer instantiates semantic patterns by employing associated analysis rules. Semantic pattern descriptions enable to formalize and to specify naming conventions. Naming conventions represent guidelines for naming EPC functions/events as proposed by the ARIS method and by the guidelines of modeling [19]. These conventions suggest for naming EPC functions to make use of a verb to express a task followed by a noun that refers to a process object (e.g. Define: [Task] Development Plan: [Process Object]). The naming conventions for EPC events suggest expressing state information by a passive verb followed by an associated process object (e.g. Development Plan: [Process Object] Defined: [State]). The singular noun form is propagated since a process object can be regarded as a class type. Hence, the conventions proposed in data or class modeling are advocated. A semantic pattern description is given as a tuple S = (T, LA) where T defines the semantic pattern template, LA is a set of pairs ({li,aj}| (L={l1,…ln}, A={a1,…an}) where li ∈ L and aj ∈ A) where L is a set of lexical structures and A is a set of analysis rules. A semantic pattern template is a tuple P=(E,C,O) where E ∈ {Function, Event} defines the semantic pattern type, C defines the context of a template instance, O is the set of addressed process ontology concepts (e.g. task, state). As an example for a semantic pattern description, following pattern template Function(Context)[Task; Process Object] is introduced. It is used to discuss the other parts of the semantic pattern in the further subsections. Lexical Structures. The lexical structure is a tuple (I, C) where I is an unique identifier and C is an ordered set of word classes whereas the following set of predefined word classes is available: {Noun [N], NounGroup [NG], Verb [V], Passive Verb[PV], Preposition[P], Conjunction[C]}. The introduced example for a semantic pattern description defines the following two lexical structures L1:= [VTask] [NProcessObject] and L2:= [VTask] [NGProcessObject].

166

A. Bögl et al.

Analysis Rules. Analysis rules evaluate natural language expressions against predefined lexical structures and instantiates one or several semantic pattern templates. If the lexical structure of a natural language expression corresponds to a lexical structure associated to a semantic pattern template, a semantic pattern template is instantiated by the assignment of lexical terms to the addressed process ontology concepts of the semantic pattern template. Consider the natural language expression “[Verb]: Define [Noun]: Requirements” used for naming an EPC function. It matches with the lexical structure [VerbTask] [NounProcessObject] of to the semantic pattern template Function(Context)[Task; Process Object]). A defined analysis rule for this semantic pattern template maps the lexical terms “Define” to the process ontology concept Task and “Requirements” to the process ontology concept Process Object and instantiates the semantic pattern template Function(Software)[Task: “Define” Process Object: “Requirements”]. An analysis rule is specified by a precondition and a body separated by a “→”. The precondition consists of the operator Match whose parameters represent (1) a predefined lexical structure, (2) logical expressions (e. g. Preposition=”for”) and (3) a list of lexical terms (E={T1…Tn} or F={T1…Tn}) extracted by the term normalizer. The body denotes an action that generates one or several instantiated semantic pattern templates. Analysis rules are also used to determine the semantics of parameters. The semantics of a parameter depends on the preposition associated to a noun. For instance, the rule R: IF MATCH([VTask] [NprocessObject] [Parameter =”For”] [NProcessObject], E={T1…Tn}]) → GENERATE(Function (Context) [Task:V; Process Object: N; Target Direction Parameter:N]) generates an instantiated semantic pattern having a Target Direction Parameter. A Source Direction Parameter is determined by the prepositions “FROM” or “OF”, a Target Direction Parameter by the prepositions “FOR”, or “IN”, a Means Parameter by the prepositions “WITH”, a Dependency Parameter by the preposition “UPON”. Another feature denotes the setting of state information required for a semantic analysis of EPC events. State information indicates an attribute for a process object with an assigned attribute value. For example, the state information “Quality Goals Defined” assigns the attribute “Defined” to the process object “Quality Goal” with the boolean value true. This is realized by the rule R: IF MATCH([NProcessObject] [PVState], E={T1…Tn}) → GENERATE( Event (Context) [State:PV; State Value: “True”; Process Object: N]). Analysis rules also play a vital role in resolving the semantics of natural language expressions that address (1) more than one task or state or (2) more than one process object or (3) more than one parameter or (4) a combination of them. To illustrate this situation, let us consider the following example: F(Identify And Analyze Quality Goal). This function specifies the two tasks “Identify” and “Analyze”, connected via an “And” conjunction that are executed on the process object “Quality Goal”. For capturing the process semantics of this function in the process ontology, two semantic pattern instances are generated by applying the analysis rule R: IF MATCH([V1Task] [Con=”AND”] [V2Task] [NGProcessObject], F={T1…Tn}) → {GENERATE( Function(Software)[Task:V1; Process Object:NG] , F={T1…Tn}), {GENERATE (Function (Software) [Task:V2; Process Object:NG], F). This analysis rule results in the two instantiated semantic patterns templates Function (Software) [T:”Identify”; PO:”Quality Goal”] and Function (Software) [T:”Analyze”; PO:”Quality Goal”].

Semantic Annotation of EPC Models in Engineering Domains

167

Common Semantic Patterns. By manually performed analyzes of about 5,000 EPC functions/events in engineering domains we gained the insight that the suggested naming conventions do not fully cover the implicit process semantics of natural language expressions used for naming EPC functions/events. As an additional contribution of this paper, we introduce a set of semantic pattern descriptions resulting from our investigations. Table 1 and 2 summarize these semantic pattern descriptions for EPC functions/events. They use the following set of abbreviations: {V:=Verb, N:=Noun, NG:=Noun Group, PO:=Process Object, T:=Task, Con:=Conjunction, P:=Preposition, F:=Function, C:=Context, TDP:=Target Direction Parameter, SDP:=Source Direction Parameter, MP:=Means Parameter, DP:=Dependency Parameter}. Table 1. Semantic Pattern Descriptions for EPC Functions Template Function(Context)[Task; Process Object]

Template Function(Context)[Task; Process Object; Parameter]

Lexical Structure: LS1 := [V] [N]

Example: F (Define Goal):

Example: F (Derive Quality Goal From Specification Document)

Analysis Rule: IF MATCH([V] [N], F) GENERATE(F(C)[T:V;PO:N])

Pattern Template Instance:

Lexical Structure: LS4 := [V] [NG1] [P] [NG2] Analysis Rule: IF MATCH([V] [NG1] [P=(”FROM” | ”OF”)] { [NG2] , F) GENERATE ( F(C)[T:V; PO:NG1; SDP:NG2]} Lexical Structure: LS4 := [V] [NG1] [P] [NG2]

Pattern Template Instance: SP1 := Function(Software) [T:”Derive”; PO: “Quality Goal”, SDP: “Specification Document”]

Analysis Rule: IF MATCH([V] [NG1] [P=(”FOR” | ”ON” | “IN”) , F] [NG2]) { GENERATE ( F(C)[T:V; PO:NG1; TDP:NG2]} Lexical Structure: LS4 := [V] [N1] [P] [N2]

Pattern Template Instance: SP1 := Function(Software) [T:”Define”; PO: “Quality Goal”, TDP: “Project Plan”]

SP1 := Function(Software) [T:”Define”; PO: “Goal”]

Lexical Structure: LS2 := [V] [NG]

Example: F (Define Quality Goal):

Analysis Rule: IF MATCH([V] [NG] , F) GENERATE(F(C)[T:V; PO:NG])

Pattern Template Instance:

Example: F (Identify And Analyze Goal):

Lexical Structure: LS3 := [V1] [C] [V2] [N] Analysis Rule: IF MATCH([V1] [Con=”AND”] { GENERATE [V2] [N] , F) (F(C)[T:V1; PO:N], GENERATE (F(C)[T:V2; PO:N]}

Lexical Structure: LS4 := [V] [NG1] [C] [NG2] Analysis Rule: IF MATCH([V] [NG1] [Con=”AND”] [NG2] , F) GENERATE (F(C)[T:V; PO:NG1], GENERATE (F(C)[T:V; PO:NG2]}

SP1 := Function(Software) [T:”Define”; PO: “Quality Goal”]

{

Example: F (Define Quality Goal For Project Plan)

Example: F( Rework Specification With Customer) Pattern Template Instance: SP1 := Function(Software) [T:”Rework”; PO: “Specification”, MP: “Customer”]

Pattern Template Instance: SP1 := Function(Software) [T:”Identify”; PO: “Quality Goal”]

Analysis Rule: IF MATCH([V] [N1] { [P=”WITH”] [N2] , F) GENERATE ( F(C)[T:V; PO:N1; MP:N2]}

SP2 := Function(Software) [T:”Analyze”; PO: “Goal”] Example: F (Define Quality Goal And Quality Measure):

Lexical Structure: LS4 := [V] [NG1] [P] [NG2]

Example: F( Decide Quality Measure Upon Review Status)

Analysis Rule: IF MATCH([V] [NG1] [P=”UPON”] [NG2] , F) {GENERATE ( F(C)[T:V; PO:NG1; DP:NG2]}

Pattern Template Instance: SP1 := Function(Software) [T:” Decide”; PO: “Quality Measure”, DP: “Review Status”]

Pattern Template Instance: SP1 := Function(Software) [T:”Define”; PO: “Quality Goal”] SP2 := Function(Software) [T:”Define”; PO: “Quality Measure”]

168

A. Bögl et al. Table 2. Semantic Pattern Descriptions for EPC Events

4.4 Ontology Instance Generator The ontology instance generator populates the reference ontology by adding or updating instances of predefined concepts and relations in the lexical and in the process knowledge base. Instantiated semantic patterns generated by the semantic pattern analyzer are input for the ontology instance generator. This step concludes with the establishment of the semantic linkage between EPC functions/events and concerned reference ontology instances.

5 Related Work The work presented in this paper refers to research activities involving semantic Business Process Management and Natural Language Processing. Enhancing

Semantic Annotation of EPC Models in Engineering Domains

169

Business Process Management (BPM) with semantic web technologies to overcome obstacles in automated processing has triggered a new wave in research and practice (e.g. [8], [9]). Process ontology design is a well-established field of research consisting of many distinguished approaches. The most important are: Business Process Management Ontology (BPMO) is a fully-fledged semantic business process modeling framework [24]. Semantic EPC (sEPC) [9] has emerged from the SUPER Project [21] and aims at supporting the annotation of EPC models. Thomas and Fellmann [22] describe a similar approach that addresses the semantic annotation of EPC models. Plan ontologies such as the Dolce+DnS Plan Ontology (DPPO) [6] are founded on a theory of planning problems and on semantic descriptions of plans. The process ontology proposed in this paper preliminary intends to capture the implicit semantics of natural language expressions used for naming EPC functions/events (e.g. relationships between tasks and process objects, state information resulting from performing a task). Further, the introduced process ontology does not consider the control flow. The semantic annotation of EPC models is comparable with approaches used by Semantic Web annotation platforms (SAPs) whose purpose is to annotate existing and new documents on the Web. SAPs can be classified according to the used annotation method. The two primary categories are Pattern-based and Machine Learning-based. In Pattern-based approaches, “an initial set of entities is defined and the corpus is scanned to find the patterns in which the entities exist.” [14]. Machine learning-based SAPS utilize probability and induction methods. The approach presented in this paper follows the paradigm of a pattern-based analysis of natural language expressions used to describe EPC functions/events by employing semantic patterns. The idea using semantic patterns is inspired by Rolland and Achour [16]. They employ semantic patterns to extract a use case model from ambigous textual use case descriptions. Further, the usage of patterns traces back to Hearst [7]. The underlying clue is the use of patterns whose purpose is to explicitly grasp a certain relation between words [3] [5]. In this work, instantiated semantic patterns bridge the gap between informal and formal representations. Instances of predefined semantic patterns establish the semantic linkage between EPC functions/events and instances of process ontology concepts.

6 Conclusions A semantic annotation of EPC models yields several advantages. A resulting reference ontology represents a necessary prerequisite for the extraction of patterns from process models as proposed by Schütte [20, p. 237]. The frequency of occurrence of process patterns enables an objective measure to evaluate candidates for common or best practice solutions. The dependencies between patterns can provide information on larger structures (reference models, [18] or process variants, configurable and generic adaptation of reference models [2]). The introduced approach shows how to perform an automated semantic annotation of EPC functions/events. It employs semantic pattern descriptions to bridge the gap between semi-formal process representations and formal reference ontologies. Semantic pattern descriptions allow the specification of semantic pattern templates (naming

170

A. Bögl et al.

conventions for EPC functions/events), lexical structures (grammar of natural language expressions) and analysis rules (instantiation of semantic pattern templates). Our proposal for an automated semantic annotation is limited for resolving the semantics of natural language expressions used for a description of EPC functions/events. These natural language expressions must obey basic naming conventions as suggested by the EPC modeling language. A task within an EPC function is expressed by means of a verb, state information is indicated by a passive verb. Despite these limitations, the declarative nature of semantic pattern descriptions enables to define an arbitrary set of naming conventions. The definition of semantic pattern descriptions provides a mechanism to standardize the naming of EPC functions/events in a distributed modeling environment. The proposed common semantic patterns in section 4.3.3 resulted from practical experiences gained on a human driven analysis of about 5,000 EPC functions/events in engineering domains. The identified semantic pattern descriptions are a first approach for an additional standardization concerning the naming of EPC functions/events.

References 1. Bögl, A., et al.: Semantic Annotation of EPC Models in Engineering Domains by Employing Semantic Patterns. In: Proceedings of the 10th International Conference on Enterprise Information Systems (ICEIS 2008), Barcelona, Spain, June 12-16 (2008) 2. Becker, J., et al.: Adaptive Reference Modeling: Integrating Configurative and Generic Adaptation Techniques for Information Models. In: Becker, J., Delfmann, P. (eds.) Reference Modeling: Efficient Information Systems Design Through Reuse of Information Models, pp. 27–58. Physica, Heidelberg (2007) 3. Biemann, C.: Ontology Learning from Text: A Survey of Methods. LDV-Forum 20(2), 75–93 (2005) 4. Bögl, A., et al.: Knowledge Acquisition from EPC Models for Extraction of Process Patterns in Engineering Domains. In: Proceedings der Multikonferenz Wirtschaftsinformatik (MKWI 2008), München, Deutschland (2008) 5. Cimiano, P., et al.: Ontologies on Demand? – A Description of the State-of-the-Art, Applications, Challenges and Trends for Ontology Learning from Text. Information, Wissenschaft und Praxis 58(6-7), 315–320 (2006) 6. Gangemi, A., et al.: Task Taxonomies for Knowledge Content D07, http://www.loa-cnr.it/Papers/D07_v21a.pdf 7. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING 1992), Nantes, France, vol. 2, pp. 539–545 (1992) 8. Hepp, M., et al.: Semantic Business Process Management: A Vision Towards Using Semantic Web Services for Business Process Management. In: IEEE International Conference on e-Business Engineering, Beijing, China, pp. 535–540 (2005) 9. Hepp, M., et al. (eds.): Proceedings on Semantic Business Process and Product Lifecycle Management, 3rd European Semantic Web Conference, Innsbruck, Austria (2007) 10. Keller, G., et al.: Semantische Prozessmodellierung auf der Grundlage Ereignisgesteuerter Prozeßketten (EPK). In: Scheer, A-W (Hrsg.): Veröffentlichungen des Instituts für Wirtschaftsinformatik, Heft 89, Saarbrücken, http://www.iwi.uni-sb.de/Download/iwihefte/heft89.pdf

Semantic Annotation of EPC Models in Engineering Domains

171

11. Liu, H., Singh, P.: ConceptNet – A Practical Commonsense Reasoning Tool-Kit. BT Technology Journal 4(22), 211–226 (2004) 12. Moore, J., et al.: Combining and Adapting Process Patterns for Flexible Workflow. In: 11th International Conference on Database and Expert Systems Applications, London, United Kingdom, pp. 797–801 (2000) 13. Pfeiffer, D., Gehlert, A.: A Framework for Comparing Conceptual Models. In: Workshop on Enterprise Modelling and Information Systems Architectures, Klagenfurt, Austria, pp. 108–122 (2005) 14. Reeve, L., Han, H.: Survey of semantic annotation platforms. In: Proceedings of the 2005 ACM Symposium on Applied Computing, Santa Fe, New Mexico, March 13 - 17 (2005) 15. Basili, R., et al.: Language Learning and Ontology Engineering: an Integrated Model for the Semantic Web. In: 2nd Meaning Workshop, Trento, Italy (2005) 16. Rolland, C., Achour Ben, C.: Guiding the construction of textual use case specifications. The Data & Knowledge Engineering Journal 25(1-2), 125–160 (1998); special Jubilee issue 17. Rosemann, M.: Komplexitätsmanagement in Prozeßmodellen. Methodenspezifische Gestaltungsempfehlungen für die Informationsmodellierung, Gabler, Wiesbaden (1996) 18. Schermann, M., et al.: Fostering the Evaluation of Reference Models: Application and Extension of the Concept of IS Design Theories. In: 8th International Conference Wirtschaftsinformatik, Karlsruhe, Germany, pp. 181–198 (2007) 19. Schuette, R., Rotthowe, T.: The Guidelines of Modeling - An Approach to Enhance the Quality in Information Models. In: Ling, T.-W., Ram, S., Li Lee, M. (eds.) ER 1998. LNCS, vol. 1507, pp. 240–254. Springer, Heidelberg (1998) 20. Schütte, R.: Grundsätze ordnungsgemäßiger Referenzmodellierung: Konstruktion konfigurations- und anpassungsorientierter Modelle, Gabler, Wiesbaden (1998) 21. Super, Integrated Project Semantics Utilized for Process Management within and between Enterprises, http://www.ip-super.org 22. Thomas, F., Fellmann, M.: Semantic Business Process Management: Ontology Based Process Modeling Using Event-Driven Process Chains. International Journal of Interoperability in Business Information Systems 1(2), 29–44 (2007) 23. W3C RDF/OWL Representation of WordNet, W3C Working Draft (June 2006), http://www.w3.org/TR/wordnet-rdf/ 24. Yan, Z., et al.: BPMO: Semantic Business Process Modeling and WSMO Extension. In: International Conference on Web Services, Salt Lake City, USA, pp. 1185–1186 (2007) 25. Hagen, M., Gruhn, V.: Towards Flexible Software Processes by using Process Patterns. In: 8th International Conference on Software Engineering and Applications, Cambridge, USA, pp. 436–441 (2004) 26. Nilsson, N.J.: Problem Solving Methods in Artificial Intelligence. McGraw Hill, New York (1971)

Part III

Information Systems Analysis and Specification

Tool Support for the Integration of Light-Weight Ontologies Thomas Heer1 , Daniel Retkowitz1 , and Bodo Kraft2 1

Department of Computer Science 3, RWTH Aachen University Ahornstr. 55, 52074 Aachen, Germany {heer,retkowitz}@i3.informatik.rwth-aachen.de 2 AMB Generali Informatik Services GmbH Anton-Kurze-Allee 16, 52064 Aachen, Germany [email protected]

Abstract. In many areas of computer science ontologies become more and more important. The use of ontologies for domain modeling often brings up the issue of ontology integration. The task of merging several ontologies, covering specific subdomains, into one unified ontology has to be solved. Many approaches for ontology integration aim at automating the process of ontology alignment. However, a complete automation is not feasible, and user interaction is always required. Nevertheless, most ontology integration tools offer only very limited support for the interactive part of the integration process. In this paper, we present a novel approach for the interactive integration of ontologies. The result of the ontology integration is incrementally updated after each definition of a correspondence between ontology elements. The user is guided through the ontologies to be integrated. By restricting the possible user actions, the integrity of all defined correspondences is ensured by the tool we developed. We evaluated our tool by integrating different regulations concerning building design. Keywords: Knowledge management, Ontology engineering, Information integration tools, Human factors.

1 Introduction Our approach to ontology integration has been developed in the context of the ConDes research project [1]. In this project we have developed new concepts for software tools to support the conceptual design phase in building design. Thereby a knowledge-based approach has been followed. The relevant terminology is defined in several domainspecific ontologies. Based on theses ontologies restrictions for the conceptual design of a building can be specified. Therefore, the example ontologies in this paper come from the domain of building engineering, but our approach for interactive ontology integration is applicable to many other domains. There is a broad field of different types of structures that are all subsumed by the term ontology [2]. Ontologies can be simple vocabularies, i. e. lists of terms, which denote the entities of a certain domain. If a generalization relation is defined for these terms, one speaks of a taxonomy. Both of these types are called light-weight ontologies. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 175–187, 2009. c Springer-Verlag Berlin Heidelberg 2009 

176

T. Heer, D. Retkowitz, and B. Kraft

Light-weight ontologies define concepts, classifications of these concepts, properties and relations. In contrast to that, heavy-weight ontologies comprise further semantical information about a domain. This additional information is specified by axioms or constraints. Ontologies can describe concepts on different levels of abstraction. Ontologies, which define general concepts, are called upper ontologies, foundation ontologies, or top-level ontologies [3]. Ontologies, which contain knowledge about a specific domain, are called domain ontologies. With regard to ontologies the term integration is used with several different semantics. Three types of ontology integration can be distinguished [4]. The first type is integration in terms of reuse. This means constructing a new ontology based on already existing ontologies, which are incorporated in the new ontology. A second type is integration in terms of merging. In this case, two or more ontologies are unified into a single ontology by merging corresponding concepts in the original ontologies. The third type is integration in terms of use. This type of integration is applied, when applications are built, which are based on one or more ontologies. In our approach, we use the term integration in the second sense, i. e. in terms of merging. Before merging different ontologies into one unified ontology, a prior alignment of these is required [4]. Alignment is the process, in which the relations between the concepts contained in the different ontologies are determined. This is usually done by the definition of a mapping between the ontology elements. This mapping defines how the source ontologies have to be merged into one integrated ontology, so that the resulting ontology contains all the semantic information of the source ontologies, not more and not less. One difficulty in the alignment of different ontologies comes from the fact, that the structure of an ontology is not only determined by the comprised knowledge, but also by the design decisions made during its development. Therefore, even ontologies, which model the same part of a certain domain, may be structured significantly different. This makes the integration of the ontologies a difficult task. The paper is structured as follows. First, we give an overview of related work in section 2. Then, in section 3, we will describe how to define semantic correspondences and how to generate an integrated ontology from these correspondences. The following section 4 contains the description of our developed integration algorithm. Next, in section 5, we describe, how the integrity of the defined semantic correspondences is ensured. In section 6, we give a short overview of the tool we developed to implement the integration approach. Finally, we give a conclusion and an outlook on further possible developments at the end of the paper.

2 Related Work Ontologies and ontology integration are still emerging topics in the field of computer science. Many approaches for the use and integration of ontologies have been proposed in research. In [5] different techniques for the alignment of ontologies are described. These are manual definition of correspondences, use of linguistic heuristics, top-level grounding and the use of semantic correspondences. These techniques are not exclusive, but rather complement each other. The first technique requires a knowledge engineer, developing

Tool Support for the Integration of Light-Weight Ontologies

177

an ontology, to manually define certain correspondences between the concepts of the ontologies to integrate. These correspondences mainly have the semantics of equivalence, but are not restricted to 1:1 relations. In the second method, heuristics are applied to find correspondences automatically based on linguistic features of the terms, representing the concepts. The method of top-level grounding requires a common top-level ontology for all ontologies to be integrated. This top-level ontology is then used to identify related concepts, and to use this information as a basis for the integration. Finally, semantic correspondences can be defined. In this method, different types of semantic relations are used to relate the concepts of the ontologies to integrate. This way, not only equivalence relations, but also relations with other semantics can be defined. In our approach, we use the techniques of top-level grounding and semantic correspondences. In [6] and [7] surveys over existing approaches to ontology alignment are presented. Both works give an overview over theoretical frameworks and several current research projects. The surveyed works range from formal over heuristic approaches to approaches, which use machine learning to automate the process of ontology alignment. However, most of the presented works more or less neglect the issues involved with the interactive part of the integration process. In the following we present two examples of related works, which use heuristics for the alignment of ontologies. One alternative to align ontologies is, to consider lexical similarities between the terms, which represent the defined concepts. Such a lexical integration approach is implemented by the tool Chimaera [8]. Chimaera is an environment, which can be used for merging and testing ontologies. When integrating ontologies, Chimaera generates lists of suggestions for equivalent terms from the ontologies. These suggestions are based on lexical similarity measures. Besides that, Chimaera can identify parts of the class hierarchy, which probably need to be reorganized. These parts are identified by means of heuristic strategies. Since Chimaera uses heuristics based on lexical analysis, the identified similarities may contain mismatches. Thus, it is necessary that the user verifies all suggestions made by the tool. However, Chimaera does not propose any solutions in case of conflicts, which may arise during the integration process. In [9], an algorithm for semi-automatic merging and alignment of ontologies called PROMPT is presented. This algorithm realizes a semi-automatic integration of ontologies. The Anchor-PROMPT algorithm [10] is an extension to PROMPT. It is used to generate suggestions, which are not only based on linguistic similarity, but also on structural properties of the ontologies. In the first step, PROMPT generates suggestions for correspondences between the classes of the ontologies, to be integrated. These initial suggestions are based on linguistic similarities of the class names and the structure of the ontologies. The latter is analyzed by the Anchor-PROMPT algorithm. In the next step, the user selects for each suggestion an operation to perform or defines a different operation manually. PROMPT then automatically performs the selected operation and applies additional modifications to the merged ontology, if required. Subsequently, the list of suggestions is updated and a list of conflicts, which resulted from the previous operation, is generated. After this, the procedure is executed again, until no more operations have to be performed, and all suggestions are processed. In [11], ontologies are used to integrate database schemata. To perform the integration, the schemata are augmented by corresponding ontologies that define the schema

178

T. Heer, D. Retkowitz, and B. Kraft

semantics. These ontologies are then integrated to a global ontology from which a unified schema can be derived. To integrate the ontologies, similarity relations between schema concepts are defined. In [11] the same four types of similarity relations are used as in our ontology integration approach. It is described, how the resulting integrated ontology can be derived from the source ontologies and the defined correspondences. There are different suggestions, how to define the similarity relations between ontology elements, neither of which is discussed in detail. One suggestion is, to provide common references by using a higher level ontology. Other possibilities are, to use thesauruses, experts familiar with both ontologies, or a hybrid semiautomatic method. However, no concepts are proposed, how an expert could be supported in defining the similarity relations, which includes finding corresponding elements and choosing the right relation types. Nothing is said about, how to ensure the integrity of the defined correspondences, and how to take the effects of defined correspondences on the integration result into account during the alignment of the ontologies.

3 Semantic Correspondences In our approach of merging lightweight ontologies, semantic correspondences are used to relate the concepts of different ontologies to each other. Following an interactive incremental process, a knowledge engineer defines correspondences between elements of the ontologies to be integrated. Based on these correspondences an integrated ontology is automatically generated. There are four types of correspondences with different semantics: equivalence, overlap, generalization and disjointness. The chosen terms for the semantic correspondence types overlap and disjointness are rooted in the field of set theory. In our ontology integration scenario, the elements of ontologies are terms, structured in a generalization hierarchy. The terms represent concepts. Hence, correspondences between ontology elements relate concepts to each other. A concept defines a mental collection of objects or circumstances, which have common attributes. This collection is called the extension of the concept [11]. Extensions are basically sets. Thus, between two extensions one of the four possible relationships between sets must hold. The extensions of two concepts can be equal or disjoint, one can be a subset of the other, or the two extensions can overlap, i. e. they have a nonempty intersection, but are neither equal nor in a set-subset relation. From these four possible relations between sets the four correspondence types equivalence, disjointness, generalization and overlap are derived. In figure 1, examples of corresponding ontology elements and their extensions are shown. For example the generalization correspondence from ladies restroom to restroom implies that the extension of the former concept is a subset of the extension of the latter concept. Informally speaking, all ladies restrooms are restrooms. The disjoint correspondence is a special case insofar as it is not explicitly represented by an edge between ontology elements. Whenever no correspondence of one of the other three types is defined between two concepts, they are implicitly defined as disjoint. The defined correspondences between ontology elements allow for the automatic generation of a merged ontology. In our approach an arbitrary number of source ontologies can be merged into one ontology. To illustrate this, we consider here the result of the integration example from section 4. In figure 3 c) cutouts of three ontologies

Tool Support for the Integration of Light-Weight Ontologies Ontology 1

Correspondence

Ontology 2

179

Extensions toilets, restrooms ladies restrooms

restrooms

hallways

corridors

toilets

kitchens

Fig. 1. Corresponding concepts and their extensions

from the domain of building construction are shown along with the resulting merged ontology. Several correspondences are defined between the elements of the source ontologies, e. g. the terms toilet and restroom are defined as equivalent. The terms dining room and living room are defined as overlapping, since there are rooms that have the functionality of both, like e. g. a family room. Therefore family room is defined as a specialization of dining room. If several ontology elements are defined as equivalent, then the merged ontology only contains one representative for the equivalence class. Therefore the merged ontology in figure 3 c) only contains one element room and only the element restroom for the two equivalent ontology elements toilet and restroom. For each defined generalization correspondence a generalization edge is generated in the merged ontology. Afterwards, redundant generalization edges are removed. An overlap correspondence indicates that there are objects in the modeled domain, which belong to both corresponding concepts. When specifying an overlap correspondence, the knowledge engineer, who carries out the ontology integration, can choose between two options. Either an ontology element, which represents the intersection of the overlapping concepts, is generated for the overlap correspondence in the merged ontology, or the overlap correspondence is simply defined to indicate the overlapping of the two concepts without any direct influence on the resulting ontology. In the example in figure 3 c) no ontology element is generated. During the execution of the integration procedure, the knowledge engineer selects elements of the merged ontology to define correspondences, but the correspondences are established between the elements of the source ontologies or correspondences, which generate the selected elements of the merged ontology. The merged ontology is not generated at once, after all correspondences have been defined. It is incrementally updated throughout the execution of the integration algorithm, which is described in the following section.

4 Integration Algorithm In our integration approach an arbitrary number of ontologies can be merged into one ontology, which contains all the semantic information of the source ontologies. In the first step two ontologies are integrated. After that, all remaining ontologies are

180

T. Heer, D. Retkowitz, and B. Kraft

integrated one by one into the merged ontology, where each ontology is integrated with the current intermediate result. Except for the choice of terms, which represent the concepts in the merged ontology, the integration result is independent of the order, in which the source ontologies are integrated. This is guaranteed, because the correspondences defined throughout the integration process are established between elements of the original source ontologies. The source ontologies remain unchanged in the knowledge base, and the merged ontology can be generated from the source ontologies and the defined correspondences at any time. In our ontology integration approach there is no strict distinction between the phases of ontology alignment and merging. In the related work about ontology integration [6,7] different approaches are presented for the alignment of ontologies, which is often regarded as the first step of the integration procedure. The actual merging of ontologies is then carried out in a second step, based on the previously defined correspondences. This way, many dependencies between the defined correspondences are not taken into account. Especially, if the alignment is performed by a human being, e. g. a knowledge engineer, it is difficult for this person to oversee all effects of defined correspondences on the integration result. Therefore we follow a different approach. The knowledge engineer always works with the current intermediate result of the ontology integration. He defines correspondences between elements of this merged ontology. Whenever the knowledge engineer defines a correspondence, the merged ontology is immediately updated, so that he can directly see the effects of his action. So ontology alignment and merging steps alternate throughout the integration process. Defining semantic correspondences between ontology elements is a difficult task [12]. It is difficult to identify those ontology elements, which are related. Especially in case of large ontologies, without any guidance a knowledge engineer often does not know where to find corresponding elements. If corresponding elements have been identified, it is often hard to decide, which type of correspondence should be established. Sometimes correspondence types are chosen, which conflict with other correspondences. This happens, because effects of previously defined correspondences are hard to oversee. Our ontology integration approach provides solutions to the aforementioned problems. It aims at providing tool support for the interactive integration of ontologies. The effects of defined correspondences are on the one hand taken into account by the fact, that the knowledge engineer always works with the current intermediate result. On the other hand, defined correspondences restrict the possibilities for defining new correspondences. The problems of finding corresponding ontology elements and defining the correspondences in the right order are addressed by two aspects of our integration algorithm. First, a restrictive traversing order of the merged ontology is enforced. And second, those ontology elements, for which correspondences can be defined at a certain point in the integration processes, are restricted to a manageable number. This way the knowledge engineer is guided through the merged ontology, and his attention is focused to a relatively small part of the possibly large ontology. Our integration approach relies on the assumption, that all ontologies use a common top-level ontology. When integrating two ontologies, one can assume that the roots are

Tool Support for the Integration of Light-Weight Ontologies

181

Fig. 2. Highlighting of ontology elements

equivalent concepts. Thus, in a first step a top-level grounding of the two ontologies is performed. After the definition of equivalence correspondences between the roots of the ontologies, a first version of the merged ontology is generated. In this merged ontology the corresponding ontologies are combined by unifying their roots. After the top-level grounding an adapted breadth-first traversing is performed. The traversing is steered by the defined correspondences. At each point during the integration of two ontologies some elements are highlighted. The knowledge engineer is only allowed to define correspondences between these highlighted elements. In figure 2 the highlighting of ontology elements depending on the type of a previously defined correspondence is shown. For example in figure 2 a) an equivalence correspondence has been established between the ontology elements A and 1. At a later point in the integration procedure the knowledge engineer is asked to define correspondences between the highlighted elements B, C, 2 and 3. In figure 2 b) a generalization correspondence between A and 1 has been established. The ontology element A is defined to be a specialization of 1. Hence, in a later step the relationships between A and the specializations of 1, namely 2 and 3, have to be clarified. The case of an overlap correspondence constitutes a special case. An overlap correspondence provides the least information about the relationships of the specializations of the linked objects. Thus, the highlighting of ontology elements is conducted in three steps. In a first step, one of the corresponding elements and the specializations of the other are highlighted, like it is shown in figure 2 c). In this step it is not allowed, to define equivalence correspondences between A and either 2 or 3, because this would imply, that A is a generalization of 1. For the same reason it is furthermore not allowed to define a generalization

182

T. Heer, D. Retkowitz, and B. Kraft

correspondence with A as source and one of the specializations of 1 as target. In a second step the highlighting of the ontology elements is the other way round as in the first step, while the same restrictions apply. This situation is depicted in figure 2 d). While in figure 2 c) the case of an overlap correspondence without generation (indicated by the dashed arrows) of an ontology element is shown, in figure 2 d) the element for the intersection is generated. This element is not highlighted in steps one and two, because its relationships to the overlapping concepts are already defined. Finally, in the third step all specializations of the overlapping concepts are highlighted, including a potentially generated element, to give the knowledge engineer the opportunity to clarify their relationships. This situation is not depicted in figure 2. Figure 3 shows some steps of the integration algorithm by example. Figure 3 a) shows the situation directly after the top-level grounding of ontology 1 and ontology 2, as explained earlier. The root concepts of the ontologies are linked by an equivalence correspondence. Hence, the generated merged ontology depicted in figure 3 a) on the right contains only one element room and all specializations of room from the two ontologies are children of this root element. The elements sanitary room, toilet and dining room are highlighted. In the following, the knowledge engineer defines a generalization correspondence from toilet to sanitary room. This is the only correspondence, he defines for the highlighted elements, and thus the algorithm proceeds. Because of the defined generalization correspondence, in figure 3 b) the ontology elements restroom and toilet are highlighted. The knowledge engineer defines an equivalence correspondence between these elements. In this case, the generalization correspondence is modified, so that it references the equivalence correspondence instead of the ontology element toilet. This is depicted in figure 3 c). Figure 3 c) shows the final situation, in which a third ontology has been integrated with the other two ontologies. The resulting merged ontology contains all the semantic information of the source ontologies.

5 Correspondence Integrity Many tools for the integration of ontologies generate suggestions for possible semantic correspondences between ontology elements, but provide no assistance for choosing the right correspondences in the right order. In our view the main functionality of a tool, which provides support for the alignment of ontologies, should be, to ensure the integrity of user-defined semantic correspondences. The integrity is ensured, if no conflict between defined correspondences exists regarding their semantics. In our ontology integration approach, the integrity of defined correspondences is ensured by several means. Some of these have already been described in section 4. The breadth-first traversing order ensures, that correspondences between general concepts of the ontologies are defined before correspondences between their specializations. The former restrict the possibilities for defining the latter. The incremental integration approach reduces the possibility of defining inconsistent correspondences, because changes like the unification of equivalent elements are immediately performed on the merged ontology. The highlighting of ontology elements and the restrictions, which hold for defining correspondences between them, also contribute to ensuring the integrity of defined correspondences, as described in section 4. These restrictions can be

Tool Support for the Integration of Light-Weight Ontologies Ontology 1

Ontology 2

183

Merged ontology

e

a) room

room

sanitary room

toilet

room toilet

dining room

restroom

dining room

restroom e

b) room

room

sanitary room

g

toilet

room

dining room

sanitary room restroom

restroom c) Ontology 1

Ontology 2

dining room

toilet

Ontology 3

e room

room

sanitary room restroom

g

toilet

room

dining room

o g

e

living room family room

Merged ontology room sanitary room restroom

dining room

e Equivalence living room

family room

g Generalization o Overlap

Fig. 3. Example steps of the integration procedure

seen as static restrictions, as they do not depend on previously defined correspondences between the highlighted elements and their generalizations. However, when defining correspondences between the highlighted ontology elements, all previously defined correspondences between the highlighted elements, between their generalizations, and between the former and the latter have to be taken into account. Thereby, we consider the type of the correspondence to be defined as well as the types of previously defined correspondences. Restrictions, which arise through this consideration, are called dynamic restrictions. To determine all dynamic restrictions for a certain correspondence type, we examined the extensions of those concepts that should be linked by the new correspondence. In figure 4 for each correspondence type one example case is shown, in which the definition of a correspondence is prohibited. The restrictions result from the relationships of the concepts’ extensions. Crossed out correspondences mean, that in the defined case there must not be a correspondence of the

184

T. Heer, D. Retkowitz, and B. Kraft Ontology elements

Possible relations of extensions

Equivalence correspondence between 1 and 2 not allowed in the following case: a)

1

o

X

o

2

X 1

X

2

1

2

1

X 2

Generalization correspondence from S to G not allowed in the following case: b)

G

X

S

X G X

S

G

S

G X S

Overlap correspondence between 1 and 2 not allowed in the following case: c)

1

X

S

1

2 X

Fig. 4. Examples for dynamic restrictions for correspondences

given type, e. g. in figure 4 b) the pattern includes all cases, where there is no generalization correspondence from S to X. The example in figure 4 b) is a case in which no generalization correspondence can be established between the ontology elements G and S, where G should be defined as the more general concept and S as its specialization. The ontology element G is a specialization of another ontology element X, while S is not. On the right side, the possible relations between the extensions of G, S and X are shown. The extensions of X and S can be disjoint, they can overlap, or the extension of X can be a subset of the extension of S. However, in any case it is impossible, that the extension of S is a subset of the extension of G. Hence, a generalization correspondence from S to G is not allowed. The fewest restrictions apply for the definition of an overlap correspondence, because it provides the least information about the relation of the linked concepts, and is thus compatible with most other correspondences. Whenever a knowledge engineer is going to define a new correspondence, all dynamic and static restrictions are checked. If some restriction would be violated by establishing the correspondence, then the according user action is prohibited by our tool and the user is informed about the reason for the denial. Our integration algorithm can be extended by using heuristics for finding correspondences between the highlighted ontology elements. This could be linguistic similarity measures like in [8] or heuristics, which take the structure of the ontology graph into account, like in [10]. In this way, it could be possible, to achieve a high degree of automation of the integration process. Correspondences between highlighted ontology elements would be automatically generated, if their similarity measure would succeed a certain threshold, and their definition would not violate any restrictions. User interaction would only be required in undecidable cases. We did not follow this approach, because our main focus lied on the support for interactive ontology integration.

Tool Support for the Integration of Light-Weight Ontologies

185

6 Tool Support We implemented our approach by developing a graph-based tool for ontology integration. The underlying data structure for the representation of ontologies is a graph. All ontologies to be integrated are subgraphs of a common host graph. The host graph is stored in a GRAS database [13], which is a specialized database for graph structures. We used the graph rewriting system PROGRES [14] to specify transformations on a the host graph, which contains besides the representations of the source ontologies also the correspondences and the resulting merged ontology. The application logic, which constitutes the core part of the integration tool, was generated from this specification. The graphical user interface of the integration tool was realized by means of the UPGRADE framework [15]. Our visual graph-based tool provides an abstraction of the internal data structure and provides a user-friendly and problem-adequate representation. We evaluated the applicability of our approach and the efficiency of our integration tool, by merging several large ontologies from the domain of building design. The merged ontologies contained the relevant concepts for the definition of knowledge for the conceptual design of the university hospital in Aachen. In figure 5 a screenshot of the integration tool is depicted. The graphical user interface is divided into two views. Throughout the integration process, the upper view

Fig. 5. Integration tool

186

T. Heer, D. Retkowitz, and B. Kraft

shows a part of the intermediate merged ontology. In this view the user can select highlighted ontology elements, and he can define correspondences between these by clicking on according toolbar buttons or context menu entries respectively. The lower view of the integration tool shows the original source ontologies together with the correspondences, which have been defined between their elements. In this view a knowledge engineer can inspect, how the correspondences are actually established between the elements of the source ontologies. The integration proceeds as follows. The knowledge engineer selects the ontologies to be integrated via a dialog. After that, the alignment and merging of the first two ontologies takes place. After the first two ontologies have been integrated, the views are updated to the intermediate result of the integration of the first three ontologies, and so on, until all ontologies are merged into one. Whenever there are steps in the integration process, in which the user could not take any action, e. g. because no more correspondences are allowed for the highlighted elements, then these steps are automatically skipped, and the algorithm proceeds.

7 Conclusions In this paper we presented a novel approach for interactive ontology integration. Several different ontologies can be merged into one. The ontologies are integrated one by one, while the structure of the resulting merged ontology is independent of the order, in which they are integrated. The alignment of the ontologies relies on the definition of semantic correspondences between their elements. These correspondences are manually defined by a knowledge engineer. The knowledge engineer is supported by our graph-based tool in many ways. Alignment and merging steps alternate throughout the integration process. The intermediate result of the integration is immediately updated after each definition of a correspondence. That way, the effects of defined correspondences on the integration result become directly visible. The user is guided through the merged ontology, and his attention is focused on small parts of the ontology, where he has to define new correspondences. In this way, the common problems of finding corresponding ontology elements and defining the correspondences in the right order are substantially reduced. The integrity of all defined correspondences is ensured, because actions, which would violate it, are prohibited by the tool. Thereby, all restrictions, which arise through previously defined correspondences, are taken into account. We identified static and dynamic restrictions, which may prohibit the definition of certain correspondences, and motivated these restrictions by looking at the extensions of corresponding concepts. The integration algorithm can be extended by using heuristics for generating suggestions for correspondences. Thereby, a high degree of automation of the integration process could be achieved. The combination of the two approaches – the calculation of suggestions for correspondences using heuristics on the one hand, and the restriction of possible correspondences on the other hand – would probably enable the tool, to make reliable estimations about the correct correspondences between ontology elements. So far, our focus lied on the support for interactive ontology integration. Nevertheless, it would be a promising approach, to combine our concepts with approaches for automatic ontology integration.

Tool Support for the Integration of Light-Weight Ontologies

187

References 1. Kraft, B.: Semantische Unterst¨utzung des konzeptuellen Geb¨audeentwurfs. Dissertation, RWTH Aachen University, Aachen (2007) 2. G´omez-P´erez, A., Fern´andez-L´apez, M., Corcho, O.: Ontological Engineering: With Examples from the Areas of Knowledge Management, e-Commerce and the Semantic Web. Springer, Heidelberg (2004) 3. Guarino, N.: Formal Ontology and Information Systems. In: Guarino, N. (ed.) Proc. of the 1st Intl. Conf. on Formal Ontology in Information Systems (FOIS 1998), pp. 3–15. IOS Press, Amsterdam (1998) 4. Pinto, H.S., G´omez-P´erez, A., Martins, J.P.: Some Issues on Ontology Integration. In: Benjamins, V.R., Chandrasekaran, B., G´omez-P´erez, A., Guarino, N., Uschold, M., eds.: Proc. of the IJCAI 1999 Workshop on Ontologies and Problem-Solving Methods: Lessons Learned and Future Trends (KRR5), Aachen, Departement of Computer Science 5, RWTH Aachen. CEUR Workshop Proceedings, vol. 18, pp. 7/1–7/12 (1999) 5. Wache, H., V¨ogele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., H¨ubner, S.: Ontology-Based Integration of Information – A Survey of Existing Approaches. In: [16], pp. 108–117 6. Kalfoglou, Y., Schorlemmer, M.: Ontology Mapping: The State of the Art. The Knowledge Engineering Review 18, 1–31 (2003) 7. Euzenat, J.: State of the Art on Ontology Alignment. Technical Report D2.2.3, Knowledge Web Consortium (2004) 8. McGuiness, D.L., Fikes, R., Rice, J., Wilder, S.: An Environment for Merging and Testing Large Ontologies. In: Giunchiglia, F., Selman, B. (eds.) Proc. of the 17th Intl. Conf. on Principles of Knowledge Representation and Reasoning (KR 2000), pp. 483–493. Morgan Kaufmann Publishers, San Francisco (2000) 9. Noy, N.F., Musen, M.A.: PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. In: Engelmore, R., Hirsh, H. (eds.) Proc. of the 12th Conf. on Innovative Applications of Artificial Intelligence (IAAI 2000), Orlando, Florida, pp. 450–455. AAAI Press, Menlo Park (2000) 10. Noy, N.F., Musen, M.A.: Anchor-PROMPT: Using Non-Local Context for Semantic Matching. In: [16], pp. 63–70 11. Hakimpour, F., Geppert, A.: Resolving Semantic Heterogeneity in Schema Integration: an Ontology Based Approach. In: Proc. of the 2nd Intl. Conf. on Formal Ontology in Information Systems, pp. 297–308. ACM Press, New York (2001) 12. Klein, M.: Combining and Relating Ontologies: An Analysis of Problems and Solutions. In: [16], pp. 53–62 13. Kiesel, N., Sch¨urr, A., Westfechtel, B.: GRAS, A Graph-Oriented Database System for (Software) Engineering Applications. Information Systems 20, 21–51 (1995) 14. Sch¨urr, A., Winter, A., Z¨undorf, A.: The PROGRES approach: Language and environment. In: Ehrig, H., Engels, G., Kreowski, H.J., Rozenberg, G. (eds.) Handbook on Graph Grammars and Computing by Graph Transformation: Applications, Languages, and Tools, vol. 2, pp. 487–550. World Scientific, Singapore (1997) 15. B¨ohlen, B., J¨ager, D., Schleicher, A., Westfechtel, B.: UPGRADE: Building Interactive Tools for Visual Languages. In: Callaos, N., Zheng, B., Kaderali, F. (eds.) Proc. of the 6th World Multiconference on Systemics, Cybernetics, and Informatics (SCI 2002), pp. 17–22. TPA Publishing, New York (2002) 16. G´omez-P´erez, A., Gruninger, M., Stuckenschmidt, H., Uschold, M. (eds.): Proc. of the IJCAI–01 Workshop on Ontologies and Information Sharing, Orlando, Florida. AAAI Press, Menlo Park (2001)

Business Process Modeling for Non-uniform Work Kimmo Tarkkanen Work Informatics, Department of Information Technology Joukahaisenkatu 3-5, FI-20014 University of Turku, Finland [email protected]

Abstract. Business process and workflow models play important role in developing information system integration and later training of its usage. New ways of working and information system usage practices are designed with as-is and to-be process models, which are implemented into system characteristics. However, after the IS implementation the work practices may become differentiated. Variety of work practices on same business process can have unexpected and harmful social and economic consequences in IS-mediated work environment. This paper employs grounded theory methodology and a case study to explore non-uniformity of work in a retail business organization. By differentiating two types of non-uniform work tasks, the paper shows how process models were designed with less effort, yet maintaining the required amount of uniformity by the organization and support for employees’ uniform actions. In addition to process model designers, the findings help organizations struggling with IS use practices’ consistency to separate practices that may emerge most harmful and practices that are not worth to alter. Keywords: Business process modeling, Model re-engineering, IS use, Uniformity of work practices.

1 Introduction Common and ever-growing IS solutions of the last decades have been ERP systems, which integrate different business functions under shared application and database. ERP systems embody expectations of organizational integration and uniformity, as systems are based on standardization and centralization of both work processes and data. Organizational formalisms, such as process models, are required for designing these standardized business operations and integrated information systems. However, designs and descriptions of work practices always tend to be more or less incomplete [1]. Too vague work descriptions may not act as a guide for a worker or give enough operational support. On the other hand, very detailed models may be too restricting and ruling for the worker in everyday routine tasks or when confronted an exceptional situation. Incompleteness of process models can result also in computer applications, which have functions and data fields that are not needed or used in situated work. Similarly, system functionality can be insufficient and incomplete for work task accomplishment. The best practices implemented into ERP may not have substance in the line of business system they are designed to and later these practices J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 188–200, 2009. © Springer-Verlag Berlin Heidelberg 2009

Business Process Modeling for Non-uniform Work

189

need to be refuted and amended locally [2]. Luckily, information system users are able to work around with the system [3] and reconstruct the planned sequence of actions to match their actual work process [4]. Without these accommodating employees, computing and work performance would degrade very rapidly at significant organizational cost [3]. By acting irrationally with computer, users actually make systems more usable locally. Thus, deviations from planned work actions are not always harmful, but essential and inherent part of work activity. Workarounds and unexpectedly acting workers as well as those who act according to guidelines, constitute together an occurrence of non-uniformity - a group of people with minor or major differences in their work practices. Such non-uniformity of computer-mediated work practices has been found to imply unexpected results [5, 6, 7, 8, 9]. Significant and harmful differences in information system use emerged both between employees and between work communities [5]. Different business units may also vary in their processes and data after the enterprise systems-enabled integration [10]. Non-uniformity implied problems in individual work, in cooperation within work communities, in organizational coordination activities and in evaluation of state of affairs [11]. Productivity of work and usefulness of system data can weak considerably due to non-uniform system usage [8]. Disadvantages of non-uniformity show that system use and system development need to be directed toward the goal of supporting system usage by group members so that their actions are congruent with each other [9]. Related attempts have evolved continuously throughout the years of computerization era. A need of flexible and adaptive systems, system models and business processes introduce one realization in that attempt. For example, process modeling theory has searched adequate formality, granularity, precision, prescriptiveness and fitness for the models [12] with different languages and approaches for a long time. This becomes complicated, because computer-mediated work is human work, which is always shaped by freedom, opportunism and recreation capabilities of rationality and norms [13, 14]. Non-uniform acts are more a rule than an exception in computermediated cooperative organization environment. At the IS development phase as-is process models represent these non-uniform acts while to-be models typically seek to determine organizationally uniform best practices. Either type of model cannot erase the occurrences of non-uniformity, but this paper asks if the models and modeling practice can be adjusted to consider the non-uniformity of work. This paper focuses first on identifying different non-uniform work practices and their causes and consequences within a case organization. Before introducing and modeling the non-uniform work practices of the case organization, the next section discusses a research methodology and data collection and analysis methods. Lastly, the paper shows how process models represented these non-uniform work practices of the case organization and furthermore draws conclusions for the modeling practice to adapt to non-uniform work activity.

2 Research Methodology The research was conducted as cross-sectional although long-standing case study that allows comparisons of data collected with different methods [15]. The case study was approached with the grounded theory methodology, in order to reject a priori theorizing

190

K. Tarkkanen

and to use an iterative process of constant comparison between data incidents, emerging concepts and conceptual categories [16]. The research site was chosen in respect to grounded theory methodology. The case organization is one of the leading retail trade companies in Finland and Baltic countries. The empirical findings were collected from one sub-unit of the organization: the unit of agricultural retail trade, named here as Agro. Two years ago, the organization introduced a new organization-wide information system. This new ERP system was to cover all of the organization’s business areas and units. It was aimed at managing both processes and data in daily basis. The system was in go-live phase when the research started. This suited well with the research setting as the concerns were targeted on daily and routine work practices and organizational impacts of nonuniformity. As the Agro unit is part of a larger organization, it was positioned to follow the rules of organizational standardization and change. 2.1 Data Collection This study views information system use as an inseparable part of work activity [17]. The scope of the study is a work in its richness and entirety, which may or may not involve information systems use as a part of the performance. The implication is that, in order to relate the study with IS discipline the data collection must be extended into such organizational formalisms that determine information system usage in business processes. It includes information system’s user instructions, quality systems, business process models and other guidelines for organizing and managing work on different organizational levels. In the first place, this collected material guides the study to concentrate on work processes, which in theory, should involve information systems use actions regardless of the fact that computers may not be used in situated work actions. Secondly, it gives an understanding of the organizationally documented and intended way to accomplish the work processes and their expected results. Thirdly, the material plays a critical role in determining which practices should be noted as uniform or non-uniform from the organizational point of view. Next, the data collection emerged through observations, recorded interviews and informal discussions. The total number of recorded interviews was 26, from which a work of 18 different clerks was observed. Interviews and observations took place in 7 different grocery stores of Agro around Finland. Certain interview themes were repeated, which included basic questions of job description, work duties and responsibilities, as well as communication patterns related to work processes. The most important was to document the current work practice. Employees were allowed and encouraged to accomplish their routine work tasks during the interviews. Due to this, interviews turned to observing situations and contextual inquiry took then place. 2.2 Data Analysis Data collection and analysis occurred iteratively. After an interview, the recorded data was transcribed. Transcribed data provided insights into employee’s situated work practices, faced problems, and opinions about their work and practices. Based on the

Business Process Modeling for Non-uniform Work

191

transcribed data, workflow models of every work process and process instances discussed in the interviews was modeled. As regards the work practices, the purpose of modeling was to reveal a) differences between employees’ situated work practices and b) differences between situated and organizationally documented work practices. Firstly, data analysis focused on comparing the modeled practices and revealing non-uniformity in them. Constant comparison of employees’ work practices directed subsequent data collection. After identifying a difference between practices, data collection and analysis focused on the causes and consequences of those practices. Evaluations of positive and negative impacts of non-uniform practices were based on the experiences of the clerks and other stakeholders of doing the work. The evaluation proceeded from individual to group, unit and organizational levels. A gradual evaluation offers a way for the analysis of problems encountered in the use situation of information technology [18]. As regards the process modeling itself, the focus of the analysis is to find out how the observed non-uniform activities emerge from the diagram and method point of view. After identifying and modeling non-uniform practices, the occurrences of different practices were captured into one model. The modeling technique adapted resembled the one applied by Agro themselves during the ERP implementation. The models were built in three abstraction levels. First level diagram shows only hand-off situations, meaning that each time an actor is involved in the process it is shown with a single rectangle [19]. Thus, this level focuses only on workflows from one actor to another. Second level diagrams show significant milestones and decisions while actor has the work, but not details of how the actor should do the tasks [19]. In general, second level diagrams represent tasks that cannot be excluded in order to achieve intended result of the process. Third level adds more details and logic on diagrams and contains individual steps leading up to a certain milestone [19].

3 Case Description There were no pre-restrictions of which specific work processes were to be under the study. Data collection and analysis led to an identification of seemingly typical and frequently executed work processes throughout the Agro organization. Work processes related to clerks’ purchasing and selling activities soon became central and got the focus of the study. A business process called ‘direct delivery’ was one of these processes as it combines both selling and purchasing transactions and it flows through the different levels of organization. Direct delivery is a special kind of sales process where Agro acts as an agency, a retail dealer, between their end customers (e.g. farmers) and product suppliers. In direct delivery process, the company delivers the products, like oil, cattle feed or tractor, from a supplier to a customer without any warehousing. Customer relationships are important as the farmers are both sellers and buyers from the Agro point of view. Farmers usually do business with the same store and the same clerk who serves a certain geographical area. Figure 1 represents organizationally planned and accepted work practices of the direct delivery process on level 1 diagram.

192

K. Tarkkanen

Fig. 1. Hand-off diagram of direct delivery process

The clerks’ work consist of five milestones during the process accomplishment: creating the sales order, converting the order type to direct delivery, recording the sales order, recording the purchase order and sending order to the supplier. The direct delivery process begins when an end customer expresses a need for a not-at-the-stock product of the Agro company. First, a clerk at the company fills a new sales order in the IS with the specific product information (i.e. quantity, price etc.) and the end customer information (i.e. name, delivery address, terms of payment etc.). After filling the sales order, the clerk converts it to a direct delivery type of order by selecting a corresponding system function. In practice, the conversion itself is an automatic creation of a new purchase order based on the information entered on the sales order. The clerk records the sales order and prints it for a backup copy of the customer transaction. The clerk moves to the created purchase order and reviews the purchase information, like purchase prices and special terms for payment. Usually the purchase prices are available on an updated price list of the supplier. The clerk may also agree special purchase prices with the supplier. Reviewing is finished when the purchase order is recorded and printed. The actual purchase transaction to the supplier takes place through a telephone call, fax, or filling a form on supplier’s website. Product transportation is managed either by the supplier, by external transportation provider or by the company’s own transportation resources. The supplier supplies the products to the end customer and sends the invoice to the Agro’s accountant. The accountant matches the arrived invoice and the purchase order on the IS using the reference note on the invoice. Lastly, the accountant sends the sales invoice to the customer. Figure 2 represents the organizationally planned workflow of the direct delivery on third level diagram. After data analysis had begun, it became apparent that direct delivery process embedded also variety of different work practices during the process accomplishment. Ten non-uniform work practices in this workflow were found among the clerks interviewed (table 1). In figure 2 the non-uniformities are located on the related tasks.

Business Process Modeling for Non-uniform Work

193

Fig. 2. Third level diagram of direct delivery process with numeric references to ten different non-uniform actions

194

K. Tarkkanen

Table 1. Non-uniform work practices with their causes and negative or positive consequences Work practice

Causes

Consequences +/

1. The clerks do not charge for billing.

The customers are not willing to pay the billing charges

+ Increased customer satisfaction, which maintains a good customer relationship  Billing charges are lost

2. Customer-specific special terms are kept on the paper notes.

Customer-specific discount percents put into system are followed with every product transaction for this customer that is not reasonable

+ Given discounts are considered more carefully based on product type  Customer-specific special terms are not mediated to other clerks in case of sick leave etc.

3. Freight rates of the sales orders are entered separately for different products.

Need for improved service and avoidance of misunderstandings by improved documenting

+ Increased customer satisfaction

4. Product discounts are subtracted from the total costs and discount field is set to zero.

Lack of skills in using discount field

+ Quicken work

5. Direct delivery type of sales transaction is performed using the separate purchasing and selling IS functions successively instead of one specific function.

A common way to perform the task in the old system

+ Clerks are more aware and can control more the movements of the products from one place to another  The clerk must perform extra work tasks, sometimes including momentary warehousing

Need for control Possible with certain products and customers

6. Confirmations of the Printed sales orders are not sales orders are not printed. needed by the clerks

+ Economizing paper costs and minimizing space requirement  Backup of the sales transaction is not available if needed by the organization

7. Freight rates are entered on the purchase orders.

Lack of use skills

 Freight rate may be invoiced twice that causes additional financial expenses for the organization - Decreased supplier satisfaction

8. Purchase prices are not revised when fulfilling the purchase order.

The clerk does not know the correct purchase price

 Work process is extended and delayed when prices on order and invoice do not match (accountant faxes the invoice to the clerk, who enters agreed price into IS and informs the accountant)

The clerk wants to ease his job and follow the prices of the invoice

Business Process Modeling for Non-uniform Work

195

Table 1. (continued) 9. Purchase price is set unreasonable high.

The clerk wants to be contacted by the accountant and have extra information about current markets and commission prices etc.

 Erroneous data in the purchase price field can result financial expenses, if transferred into real payment transactions  Work process is extended and delayed due to use of incorrect communication channels + The clerk may produce improved results of the purchasing process with the extra information

10. Purchase orders are entered into the IS after placing the order by telephone, or after the product is delivered, or after the purchase invoice has arrived from the supplier.

Employee is busy with other work and there is a hurry to place the order to the supplier

 Accountant cannot find the purchase order from the IS and cannot match the order and arrived invoice

It is easier to fill the purchase order after the purchase invoice has arrived, because the clerk can follow the information on the invoice (e.g. can set the purchase prices correctly)

 Work process is extended and delayed (accountant sends the invoice to the clerk, who enters the order into IS and returns the number of the order to the accountant)

4 Modeling for Non-uniformity of Work As the causes of non-uniformity in the previous section show, there is a need to improve IS functions and fields as well as related operational and managerial processes of Agro. Post-implementation training of operational IS skills would be needed but also an education in which the entire nature of IS-mediated work environment is made familiar [20, 21]. This calls for revision and re-engineering of process and work flow models used for supporting these work operations and worker learning. The findings of previous section show that the models of direct delivery process of Agro could not describe the non-uniform work practices in an appropriate level of detail. For example, third level model (fig. 2) has only slight correspondence with the situated actions, even if the model is detail and operational. At best, the model captures one nonuniform work practice into one modeled task. However, this modeled task embeds other work practices as well and therefore is not at the level of detail of nonuniformity. Agro case findings support the notion that within a single process some parts need to be modeled with more details and operationality than other parts [22]. The necessary level of detail of the model is dependent on the amount of conformance and operational support required by the organization [22, 23]. The course of relation between these dimensions are not totally orthogonal, but rather vague [22]. For

196

K. Tarkkanen

example, more details in a workflow model do not necessarily provide more operational support for the worker, nor guarantee any conformance of the situated work actions. The Agro models show that fixing the levels of detail of the models before determining the needed conformance and nature of operational support does not support uniformity of work or designing for it. Furthermore and first of all, the level of detail is defined by the model designer and the modeling technique. Thus, level of detail is somewhat artificially created variable for the model, where as the level of conformance is based on the actual requirements of a work process and is set by the work organization. For the organization, determining the necessary level of conformance for the work process can have a basis on the evaluation of the effects of non-uniformity. In other words, there is no need for high conformance on a task if less conformity does not mean harmful consequences for the business and parties involved in the process. Using this evaluation criterion to the Agro case results, we can determine the necessary level of conformance for the direct delivery process. Non-uniform practices that have only positive effects (see table 1), indicate that there is no need for more uniformity on these tasks (and negatives as opposite). More holistic evaluation of consequences is needed when effects vary. Reviewing the intents of the actors and the criticality of possible effects on different organizational levels, it is found out that practices 7-10 introduce more harmful effects than practices 1-6. For example, not charging the customer for billing (practice 1), was well-intentioned and had positive influence for important customer loyalty where as the lost of incomes of this practice was regarded as insignificant consequence on the larger scale. The practices 1-6 and 7-10 have also another classification; the latter practices are hand-off tasks whereas first six practices are not. Hand-off tasks are those passing the control of work to another actor [19]. Opposite to hand-off tasks, on first six nonuniform practices, the actors have the work item and operate it themselves through these phases. Thus, the case findings suggest that other than hand-off tasks introduce variance that is positive for current process instance whereas hand-off tasks introduce variance that effects negatively for the same process instance. Organizations implementing a new or analyzing current information system face a great need to minimize the work effort of modeling. The modeling technique used in the Agro case has a simple notation, which makes it rather attractive option for time-, cost- and resource-limited IS customer organizations. The cost of modeling is minimized when only necessary amount of details are embedded into models [24], in which the level of necessity was found tricky question [22]. The Agro case findings suggest that it would be applicable to avoid details with the tasks that are not hand-off tasks. In other words, three level abstractions are not used with tasks that introduced positive variance. The benefit is that instead of focusing on every step on the process, the focus is targeted only to minor parts of the whole process. Aggregated modeling of these “on-hand” tasks of Agro will sustain an adequate level of conformity, because those do not introduce negative variance. From the modeling point of view, what we can do with greater conformance need of hand-off practices is try to add more details and operational nature to the model and hope that it is also realized in actual work practice. More accurate model is created either by adding more details on naming [22] or by continuing the focusing on smaller subtasks. In figure 3, the direct delivery process of Agro is re-modeled more effectively concerning the required amount of uniformity and allowed amount of non-uniformity.

Business Process Modeling for Non-uniform Work

197

Fig. 3. Direct delivery process with necessary level of details for uniformity in situated work

As the figure 3 shows, the model mixes different abstraction levels (between practices 1-6 and others). Noteworthy is that the focusing into sub-tasks typically entails that not all hand-off steps remain introducing hand-off. For example, after building the hand-off diagram and identifying two hand-off tasks ‘record the purchase order’ and ‘pass the purchase order to the supplier’, the former enlarges on level three diagram into three different steps (fig. 2) where only two steps introduce hand-off. One must then define what steps of this hand-off task to model with more details, if not all. Without any exact rules and limitations the focusing may become endless and certainly not a cost-effective option. The modeling procedure applied in re-modeling the direct delivery process of Agro is represented on table 2. The procedure led to a costeffective process modeling for required amount and support of uniform work actions accepting also the existing non-uniformity.

198

K. Tarkkanen

Table 2. Modeling procedure for required uniformity and allowed non-uniformity of work Steps

Modeling tasks

PHASE 1:

Model the hand-off diagram

PHASE 2:

Identify the work tasks that lead into new hand-off situation

PHASE 3:

Identify individual steps of every hand-off task

PHASE 4:

Model step(s) found in phase 3 with more details into the hand-off diagram

The modeling effort begins with the most abstract level, in this case, with modeling the hand-off diagram. At second phase, the work tasks that lead to a hand-off are identified. Third phase is to identify individual steps within the hand-off task and expand the created first level model with these individual steps. The new Agro model, created with this procedure, remains understandable in the context it is used; the modeled steps are comparable to units in reality and it is still a representation of a real world.

5 Discussion Non-uniformity of work practices in IS-mediated environment can have either serious or almost innovative impact on different levels of organization. The case of Agro introduced this variation in one organizational business process and gave opportunity to draw conclusions for modeling method improvements. Ten non-uniform practices were found on the case process. Those practices were hidden in current process models within or between the task descriptions due to a fixed level of detail of the model. More or less details would have been appropriate in different parts of the model in order to attain more unified work practices as well as to avoid unnecessary bounding of employee activity. Instead of adding details purposelessly to the model and accepting the arbitrariness of human work, the approach was to fit models and situated work by evaluating harmfulness of the consequences of different non-uniform practices. The evaluation revealed different consequences between hand-off and on-hand types of tasks. Latter introduced more positive consequences, thus leading to a part of model with lesser details. The introduced modeling procedure adjusted the level of detail according to harmfulness of non-uniformity and different types of tasks. Firstly, this avoids unnecessary bounding of employee activity in those parts of work where freedom should lie. Secondly, standardization of work practices is focused on the parts of work that create harmful consequences. Due to a partial and gradual focusing, the approach offers also a way to decrease costs of modeling work. The findings are especially useful for the Agro in their future modeling and work standardizing practices within and between the units. Other companies in retail industry may also gain insights into their selling and purchasing practices applying the results of this study. How this introduced modeling procedure would be applicable for

Business Process Modeling for Non-uniform Work

199

harmful non-uniformities in other organizations’ processes still calls for further research. Using other meta-level categorizations of work tasks, like different forms of task interdependences [25, 26], may reveal new aspects of non-uniformity and model designing in the IS post-implementation phase. Raising questions are also those that investigate the realization of benefits of using the procedure in terms of time and work effort needed. By gathering data from many organizations and business processes, it would be possible to define further these gaps between process models and process instances and develop efficient methods to determine necessary level of details and conformance for the process models while the organization is reaching for more standardized practices. This would require systematic evaluation of impacts of non-uniform IS practices based on practical business process evaluation methods. A recently introduced ProM framework provides a promising technically-oriented and real-time approach to identify and measure impact of non-uniform acts based on information system event logs [27, 28]. However, in order to reveal tacit and intangible causes and consequences of non-uniformity we may still need to exploit methods of qualitative field research. Business process models and modeling itself, as highly subjective and designer-dependent matters, set a challenge for research validity. Therefore, also validation of findings with different modeling techniques and by different process modelers would be needed in the future.

References 1. Suchman, L.: Plans and Situated Actions: The Problem of Human-Machine Interaction. Cambridge University Press, New York (1987) 2. Wagner, E.L., Scott, S.V., Galliers, R.D.: The Creation of ‘Best Practice’ Software: Myth, Reality and Ethics. Information and Organization 16, 251–275 (2006) 3. Gasser, L.: The Integration of Computing and Routine Work. ACM Transactions on Office Information Systems 3, 205–225 (1986) 4. Robinson, M.: Design for Unanticipated Use.... In: de Michelis, G., Simone, C., Schmidt, K. (eds.) Proceedings of the Third European Conference on Computer Supported Cooperative Work (ECSCW 1993), pp. 187–202. Kluwer Academic Publishers, Dordrecht (1993) 5. Koivisto, J.: Drifting Work Practices After EPR Implementation: The Case of a Home Health Care Organization. In: 4S & EASST Conference (2004) 6. Mark, G., Poltrock, S.: Shaping Technology across Social Worlds: Groupware Adoption in a Distributed Organization. In: Schmidt, K., Pendergast, M., Tremaine, M., Simone, C. (eds.) Proceedings of the International ACM SIGGROUP Conference on Supporting Group Work (GROUP 2003), pp. 284–293. ACM, New York (2003) 7. Nurminen, M.I., Reijonen, P., Vuorenheimo, J.: Tietojärjestelmän Organisatorinen Käyttöönotto: Kokemuksia ja Suuntaviivoja (Organizational Implementation of IS: Experiences and Guidelines). Turku Municipal Health Department Series A, Turku (2002) 8. Reijonen, P., Sjöros, A.: Toimintatapojen Vakiintuminen Tietojärjestelmän Käyttöönoton Jälkeen (Stabilization of the Work Practices After IS Implementation). In: SoTeTiTe Conference (2001) 9. Prinz, W., Mark, G., Pankoke-Babatz, U.: Designing Groupware for Congruency in Use. In: Proceedings of the 1998 ACM Conference on Computer Supported Cooperative Work, pp. 373–382. ACM, New York (1998)

200

K. Tarkkanen

10. Volkoff, O., Strong, D.M., Elmes, M.B.: Understanding Enterprise Systems-Enabled Integration. European journal of Information Systems 14, 110–120 (2005) 11. Koivisto, J., Aaltonen, S., Nurminen, M.I., Reijonen, P.: Työkäytäntöjen Yhtenäisyys Tietojärjestelmän Käyttöönoton jälkeen – Tapaustutkimus Turun Terveystoimen Kotisairaanhoidosta (Uniformity of Work Practices after IS Implementation – A Case Study in Home Care.) Turku Municipal Health Department Series A, Turku (2004) 12. Curtis, B., Kellner, M.I., Over, J.: Process Modelling. Communications of the ACM 35, 75–90 (1992) 13. Heritage, J.: Garfinkel and Ethnometodology. Polity Press, Cambridge (1984) 14. Giddens, A.: The Constitution of Society: Outline of the Theory of Structuration. Polity Press, Cambridge (1984) 15. Silverman, D.: Interpreting Qualitative Data. Sage, London (1993) 16. Glaser, B.G., Strauss, A.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine, New York (1967) 17. Nurminen, M.I., Eriksson, I.: Research Notes in Information Systems Research: The ’Infurgic’ Perspective. International Journal of Information Management 19, 87–94 (1999) 18. Kortteinen, B., Nurminen, M.I., Reijonen, P., Torvinen, V.: Improving IS Deployment Through Evaluation: Application of the ONION Model. In: Brown, A., Remenyi, D. (eds.) Proceedings of Third European Conference on the Evaluation of Information Technology, pp. 175–181 (1996) 19. Sharp, A., McDermott, P.: Workflow Modelling: Tools for Process Improvement and Applications Development. Artech House, London (2001) 20. Reijonen, P., Toivonen, M.: How to Be an Effective End-User? In: Dahlbom, B., Ljungberg, F., Nuldén, U., Simon, K., Stage, J., Sørensen, C. (eds.) Proceedings of IRIS 19, pp. 751–765. Gothenburg Studies of Informatics, Gothenburg (1996) 21. Yu, C.-S.: Causes Influencing the Effectiveness of the Post-implementation ERP System. Industrial Management & Data Systems 105, 115–132 (2005) 22. Ellis, C.A.: Workflow Technology. In: Beaudouin-Lafor, M. (ed.) Computer Supported Cooperative Work, pp. 29–54. John Wiley and Sons, New York (1999) 23. Nutt, G.J.: The Evolution Towards Flexible Workflow Systems. Distributed Systems Engineering 3, 276–294 (1996) 24. Mackulak, G.T., Lawrence, F.P., Colvin, T.: Effective Simulation Model Reuse: A Case Study for Amhs Modelling. In: Medeiros, D., Watson, E., Carson, J., Manivannan, M. (eds.) Proceedings of the 1998 Winter Simulation Conference, pp. 979–984. IEEE Press, New York (1998) 25. Thompson, J.D.: Organizations in Action: Social Science Bases of Administrative Theory. McGraw-Hill, New York (1967) 26. Malone, T.W., Crowston, K.: The Interdisciplinary Study of Coordination. ACM Computing Surveys 26, 87–119 (1994) 27. Rozinat, A., van der Aalst, W.M.P.: Conformance Testing: Measuring the Fit and Appropriateness of Event Logs and Process Models. In: Bussler, C.J., Haller, A. (eds.) BPM 2005. LNCS, vol. 3812, pp. 163–176. Springer, Heidelberg (2006) 28. Verbeek, H.M.W., van Dongen, B.F., Mendling, J., van der Aalst, W.M.P.: Interoperability in the ProM Framework. In: Latour, T., Petit, M. (eds.) Proceedings of the CAiSE 2006 Workshops and Doctoral Consortium, pp. 619–630. Presses Universitaires de Namur, Namur (2006)

Association Rules and Cosine Similarities in Ontology Relationship Learning Jon Atle Gulla, Terje Brasethvik, and Gøran Sveia Kvarv Department of Computer and Information Sciences Norwegian University of Science and Technology, Trondheim, Norway [email protected]

Abstract. Ontology learning is the application of automatic tools to extract ontology concepts and relationships from domain text. Whereas ontology learning tools have been fairly successful in extracting concept candidates, it has proven difficult to detect relationships with the same level of accuracy. This paper discusses the use of association rules to extract relationships in the project management domain. We evaluate the results and compare them to another method based on tf.idf scores and cosine similarities. The findings confirm the usefulness of association rules, but also expose some interesting differences between association rules and cosine similarity methods in ontology relationship learning.

1 Introduction Traditional ontology engineering approaches are tedious and labor-intensive, requiring a wide range of skill sets as well as an ability to deal with very complex and formal representations. In the ontology modeling process it is hard to manage and coordinate the contributions from various types of domain experts and ontology modelers. There are also technical, political and economical challenges that severely hamper the construction and maintenance of ontologies. At the same time, the ontologies are important in Semantic Web applications and integration projects, as they provide the vocabulary for semantic annotation of data and help applications to interoperate and people to collaborate. Most ontology engineering methods today are based on traditional modeling approaches and emphasize the systematic manual assessment of the domain and gradual elaboration of model descriptions (e.g. [5, 7]). Ontology learning is the process of automatically or semi-automatically constructing ontologies on the basis of textual domain descriptions. The assumption is that the domain text reflects the terminology that should go into an ontology, and that appropriate linguistic and statistical methods should be able to extract the appropriate concept candidates and their relationships from these texts. Numerous approaches to ontology learning have been proposed in recent years (e.g. Haase & Völker [11]; Navigli & Velardi [14]; Sabou et al. [17]), and they seem to allow ontologies to be generated faster and with less costs than traditional modeling environments. Even though many of the approaches display impressive results, the complexities of ontologies are so fundamental that the generated candidate structures often just constitute a starting point for the manual modeling task. Advanced approaches with J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 201–212, 2009. © Springer-Verlag Berlin Heidelberg 2009

202

J.A. Gulla, T. Brasethvik, and G.S. Kvarv

deep semantic analyses of text or whole batteries of statistical tests tend to yield better results, but are expensive to develop and may still not compete with traditional ontology modeling with respect to accuracy and completeness. So far, the best results are for the learning of prominent terms, synonyms and concepts. For more advanced constructions, like relationships and rules, there are still very few good tools out there to help us. Even though there are some ontology learning tools with relationship learning included, the accuracy of these relationships are questionable and there has only been limited work on comparing the various approaches to relationship learning. This is unfortunate, as there are indications that many of these approaches may be successfully combined into more reliable relationship learning approaches. In this paper we present an approach to ontology relationship learning that makes use of association rules. The theory of association rules comes from data mining, though it can easily be adapted to the task of extracting relationships between concepts in domain text. The underlying idea is that concepts tend to be related if it can be shown that they show up together in documents with a certain predictability. The technique neither distinguishes between types of relationships nor identifies relationship labels, but gives a first rough set of candidate relationships to the ontology modelers. The paper is structured as follows. Section 2 discusses the role of relationship learning in ontology engineering. We then introduce the association rules in Section 3 and briefly explain how they are used to extract relationships between concepts in Section 4. Section 5 introduces an alternative approach to relationship learning, using cosine similarities between concept vectors. An evaluation and comparison of the two approaches follows in Section 6, while some related work is discussed in Section 7. The conclusions are found in Section 8.

2 Learning Ontology Relationships An ontology can be regarded as a representation of a set of domain concepts (also called classes or objects) and their relationships. The concepts may be taxonomically related by the transitive IS_A relation or non-taxonomically related by a user-named relation, for example, hasPart [13]. Some also make a distinction between nontaxonomic relations about whole/parts, class/instance or associations in general. Web Ontology Language (OWL) is a semantic markup language recommended by the World Wide Web Consortium for the representation of ontologies. For the learning of relationships, OWL has four primitives of particular interest: ♦ ♦

Class: A class defines a group of objects or concepts that belong together. subClassOf: Stating that a class is a subclass of another gives us the ability to



Property: Properties are used to define relationships between concepts. A property of a class Person, for example, can be hasChild or ownsCar. subPropertyOf: Hierarchies of properties can be useful in structuring the ontology for easy maintenance and extension. For example, the property hasRelative for a class Person may be specialized into the subproperty hasSibling.

create generalization hierarchies of classes



Classes represent concepts that are taxonomically related, while properties define non-taxonomical relationships between concepts. Association rules do not distinguish

Association Rules and Cosine Similarities in Ontology Relationship Learning

203

between these two types of relationships and merely suggest relationships of some kind between two or more concepts. Moreover, the method is not able to derive any candidate names of the relationships identified. Used in an ontology learning environment, association rules may give us a rough overview of potential relationships between concepts in the ontology. Other techniques or manual inspection are needed to categorize the relationships and – if needed – give them descriptive labels. Mapping approved relationships to the OWL constructs shown above, for example, still remains a manual task. The technique may be used to relate already modeled concepts, but it usually includes a concept extraction pre-phase that identifies the concepts to be analyzed with association rules afterwards.

3 Association Rules for Text Mining Association rules is a data mining techniques that identifies data or text elements that co-occur frequently within a dataset. They were first introduced in by Agrawal et al. [1] as a technique for market basket analysis, where it was used to predict the purchase behavior of customers. This was primarily done for large databases of items purchased on per-transaction basis. An example of such an association rule is the statement that ”90% of the transactions that purchased bread and butter also purchased milk.” The problem in association rules mining can be formally stated as follows: Let I be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I . A transaction T contains X, a set of some items in I, if X ⊆ T . An assocation rule is an implication of the form X ⇒Y ,

where X ⊂ I , Y ⊂ I , X ∩ Y = Ø

A rule X ⇒ Y holds in the transaction set D with confidence c if c% of the transactions in D that contain X also contain Y. The rule X ⇒ Y has support s in the transaction set D if s% of the transactions in D contain X ∪ Y . The idea is to generate all association rules that have support and confidence greater than a user specified minimum support and minimum confidence. The most important algorithm for the generation of association rules is the Apriori algorithm, introduced in Agrawal & Srikant [2]. The algorithm finds all sets of items that have support greater than the minimum support. These sets are called frequent item sets. For every itemset l in the frequent itemset, Lk, it finds subsets of size k-1. For every subset X, it produces a rule X ⇒ Y , where Y = l – X. The rule is kept if the confidence support ( X ∪ Y ) / support ( X ) is greater than or equal to the minimum confidence. In a text mining context, association rules may be used to indicate relationships between concepts. Let us assume that an item set is a set of one or more concepts. If the

204

J.A. Gulla, T. Brasethvik, and G.S. Kvarv

rule X ⇒ Y has been confirmed, we conclude that there is a relationship between the concepts in X and the concepts in Y. With item sets of size 1, we have rules that indicate relationships between two concepts. In order to run association rule mining on text, we need to structure the text to mirror the situation in data mining. Following Delgado et al. [6] and Haddad et al. [10], we consider documents – rather than sentences or paragraphs – to correspond to transactions in data mining. Furthermore, we are only interested in extracting relationships between potential concepts, which means that we can restrict the analysis to noun phrases only. We reduce the noun phrases to their base forms, so that project plans and project plan count as the same term and only include noun phrases that have a certain prominence in the document set. We then have documents as item sets and lemmatized prominent noun phrases as items and can run a standard association rules analysis to extract relationships between these prominent noun phrases.

4 Learning Relationships for Project Management Ontology Our ontology learning tool is built as an extension to the GATE environment from the University of Sheffield [8]. General Architecture for Text Engineering (GATE) is an open source Java framework for text analysis. It contains an architecture and a development environment that allows new components to be easily added and integrated with existing ones. The architecture defines the organization of a text engineering system, in which each component is assigned particular responsibilities. The framework comes with a set of built-in components that can be used, extended and customized to the specific needs of the analysis. This includes NLP components like tokenizers, POS taggers, sentence splitters and noun phrase extractors, but also more extensive plug-ins for multilanguage stemming, WordNet retrieval, machine learning and ontology editors. An analysis with GATE typically consists of a chain of components that one by one goes through the text and annotate it with information that will be needed by later components. With our own components for association rules added, we built the analysis chain shown in Figure 1 and explained in more detail below. The analysis is run on a repository of documents representative to the project management domain. Whereas the GATE components work on individual documents, we developed our own modules for association rules that pulled the individual files together, extracted prominent noun phrases as keywords, and suggested relationships between these phrases. The components of the chain are: ♦

♦ ♦

Tokenizer and Sentence splitter are GATE components that split the document texts up in tokens and identify sentences for analysis. A token can be a simple word of something like a number or a punctuation mark. The GATE tagger is a statistical tagger that associates every word in the text with a part-of-speech tag. Having identified the parts-of-speech of a term, the lemmatizer can look it up in a dictionary and retrieve its lemma, or base form. The lemma is the common base form of all inflections of the same lexical entry, like the lemma process for verb forms like processes, processing, processed, etc.

Association Rules and Cosine Similarities in Ontology Relationship Learning







205

The noun phrase extractor identifies noun phrases in the text of the form Noun (Noun)*, i.e. phrases that consist of consecutive nouns. This means that a phrase like very large databases will not be recognized, since very is an adverb and large is an adjective, whereas project cost plan is a perfectly recognized phrase. The noun phrase indexer is responsible for indexing noun phrases found in the documents. After removing stopwords, the component extracts and counts the frequencies of noun phrases in the document set. A normalized term-frequency score (tf score) is used to select those prominent noun phrases that are most likely to be concepts in the domain. The result is a set of candidate concepts. The association rules miner uses the Apriori algorithm and extracts association rules between the noun phrases (concepts) found by the previous component. These association rules constitute possible relationships between ontology concepts of the domain.

Noun Noun phrase phrase indexer indexer

Association Association rules rules miner miner

Document repository analysis Individual document analysis

GATE GATE tokenizer tokenizer

GATE GATE Sentence Sentence splitter splitter

GATE GATE Tagger Tagger

GATE GATE Lemmatizer Lemmatizer

GATE GATE Noun Noun phrase phrase extractor extractor

Fig. 1. Process for extracting relationships using association rules

The relationship learning tool was set up for the project management domain, using documentation from a petroleum company as domain text. The integration of GATE components with internally developed components was unproblematic, though the performance of the system would need to be improved for large-scale document collections. A 76 document collection was loaded in 0,88 seconds on average, and the complete analysis with this collection took 6 minutes and 57 seconds.

5 An Alternative Relationship Learning Method An alternative method to association rules is the traditional information retrieval approach with calculations of cosine similarities between concepts. Solskinnsbakk [18] presents an implementation of such a system for the same project management domain. In this approach we make use of a vector of weighted terms for each concept in an already existing ontology. This concept vector is constructed on the basis of a domain text collection and contains words that tend to co-occur with the concept itself in the text. If the term estimate appears with weight 0.21 in concept Cost’s concept vector, it means that estimate and Cost are to some extent used in the same context (sentence, paragraph or document) and should display some semantic similarities. The weights

206

J.A. Gulla, T. Brasethvik, and G.S. Kvarv

are based on the tf.idf score, though we boost co-occurrences in the same paragraph and even more in the same sentence. Technically, we construct a document vector for each concept j, so that the score for each term i in the vector for j is given by the following formula,

where N is the number of concept vectors, ni is the number of concept vectors containing term i, max(vfl,j) is the frequency of the most frequent occuring term l in concept vector j, and the term frequency vfi,j for term i in concept vector j is defined below:

In this formula, fi,d is the frequency of term i in document d, fi,p is the frequency of term i in paragraph p, and fi,s is the frequency of term i in sentence s. We calculate all frequencies for all sentences, paragraphs and documents in the collection. The constants α,β and γ decide the internal weights of document occurrences, paragraph occurrences and sentence occurrences and have been set to 0.1, 1.0 and 10.0. This means that if a term i appear in the same sentence as concept j, this occurrence will count 100 times more in fi,j than if the same term had appeared in the same document as j only. When all concepts are described in terms of concept vectors, we calculate the relatedness between concepts using the cosine formula

r

r

where x and y are concept vectors and xi is the weight of the ith word of vector x . If the cosine similarity is above a certain threshold, we conclude that there is a relationship between the concepts. The set of all cosine similarities above this threshold for all concepts pairs in the ontology is the system’s suggested list of relationships in the domain.

6 Evaluation Evaluating ontology relationships learning systems is notoriously difficult, as there are potential relationships between all concepts and only subjective judgment can tell the important ones from the others. A comparative evaluation of ontology relationship learning may be more interesting than absolute evaluations, as it may expose the differences between the systems and reveal to what extent they may be combined in hybrid approaches.

Association Rules and Cosine Similarities in Ontology Relationship Learning

207

Concept Extraction The domain chosen for the evaluation was project management in STATOIL, a large Norwegian petroleum company. They use a particular project management methodology, PMI, that is documented in handbooks and also reflected in project documentation from their own projects. Domain experts from STATOIL have together with ontlogy modelers built a project management ontology [9], which served as a gold standard for our concept extraction part. Our association rules mining system was run on STATOIL’s documentation of their project management methodology, PMBOK [16]. This is a book of about 50.600 words (tokens) divided into 12 chapter. The system extracted a total of 196 concepts, compared to the manually constructed ontology’s 142 concepts. 50 concepts were identical in both sets, whereas some other 61 concepts found were abstractions of similar concepts in the manual ontology. If we assume that both the 50 perfect matches and the 61 abstract matches are valid, we have a precision of 56.7% and a recall of 78.2% for the concept extraction part. Relationship Learning For the relationship part, we compared the association rule approach to the cosine similarity method explained above. The manual ontology did not contain enough relationships to be of much use in this part of the evaluation. We first made a distinction between three types of relationships found by the two systems: ♦ ♦ ♦

Relationships suggested only by the association rule approach Relationships suggested only the cosine similarity approach Relationships suggested by both approaches

Slightly more than 50% of the relationships found were also identified by the cosine similarity method. A selection of concepts were chosen. For each of the three groups above, all suggested relationships to/from these concepts were shown to four persons that all had project management experience. Each person individually rated each relationship as not related (these two concepts are not related), related (there is probably a relationship between the two concepts) or highly related (there is definitely a relationship between these two concepts). An average score for each relationship was calculated on the basis of the individual scores from the test persons. Figure 2 shows the related concepts suggested for the ontology concept Cost for the three groups, as well as their average scores. Adding the results for all concepts together, we can compare the quality of relationships for the three groups. As shown in Figure 3, association rules and cosine similarities tend to produce the same share of good relationships (score Related and Highly related). The two methods suggested 82% and 86% good relationships, respectively, which is a fairly good result for such a small document collection. It should be noted, though, that this does not mean that they necessarily suggest the same relationships.

208

J.A. Gulla, T. Brasethvik, and G.S. Kvarv

Related concepts only from association rules project management team management team organization product information tool project team application area risk analysis result risk resource consequence estimate phase probability action analysis seller

R R R HR R R R R R R R R R R NR R R R HR

Related concepts only from cosine similarity cost management cost baseline actual cost schedule project schedule earn value staff project staff milestone plan value stakeholder project deliverable ev earn value management management scope definition scope management customer sponsor project management IS constraint project manager project plan development procurement management project plan execution quality management work breakdown structure

HR HR HR R R R R R NR R HR R NR R R NR R R R R R R R NR NR R R

Related concepts from both methods activity assumption control cost estimate performance process project project management project objective project plan quality scope scope statement

R NR R HR R R HR R R R R R R

Fig. 2. Relationships suggested for Cost. The average scores are NR (not related), R (related) or HR (highly related).

The share of very good relationships is worth a closer inspection. Whereas the association rules method only generated 7% very good relationships, the cosine similarity method reached an impressive 24%. A possible explanation for this difference lies in the mechanics of association rules and cosine similarity. For an association rule to be generated, the corresponding concepts need to occur is a wide range of documents. This will typically be the case for very general concepts and their rather general relationships. The cosine similarity method, on the other hand, makes use of tf.idf to characterize concepts by their differences to other concepts, and the relationships based on cosine similarities will be based on these discriminating concept vectors. The relationships get more specialized and precise and are easier to recognize as very good relationships.This may also explain why the association rule method had a larger share of normally good relationships (75%) than the cosine similarity method (65%). Interestingly, a combination of the two methods seems to produce much better results that each individual method. Both methods carry some noise, but our results

Association Rules and Cosine Similarities in Ontology Relationship Learning

209

indicate that this noise is dramatically reduced if we only keep the results that are common to both methods. In total, 97% of the relationships suggested by both methods were rated as good relationships by the test group (right column in Figure 3). 30% were considered very good relationships. This suggests that the two approaches – although comparable in quality – are fundamentally different with their own weaknesses and strenghts. Since overgeneration is already a problem in relationship learning, a better approach might be to combine approaches and only accept relationships that are supported by several methods. As far as association rules and cosine similarities are concerned, our research indicates that a combined approach will display substantially better results than each individual approach. 1 00 % 90 % 80 % 70 % 60 %

No t r e la ted

50 %

Re la ted

40 %

High ly r elated

30 % 20 % 10 % 0 % Re la tio ns only f r o m as s o c iation r u le s

Re la tio ns o nly f r o m c o s ine s imila r ity

Relations f r o m both me th od s

Fig. 3. Evaluation results for three categories of relationships

7 Related Work As discussed in Cimiano et al. [4], there are today numerous ontology learning systems with facilities for learning relationships (see Figure 4). Association rules are already in use in some of these systems. Systems Text2Onto HASTI OntoBasis OntoLT/ RelExt CBC/DIRT DOOBLE ASIUM OntoLearn ATRACT

Synonyms

Concepts

Hierarchy

Relations

clusters

X

X X

X X

clusters

clusters X

X

X X

X X X

clusters X clusters X clusters

clusters clusters X clusters

Fig. 4. Relationships learning in current systems

210

J.A. Gulla, T. Brasethvik, and G.S. Kvarv

Our approach to association rules is comparable to what can be find in other ontology learning tools. The accuracy of ontology relationship learning is still not very impressive and suffers from both over-generation and uncertainty. So far, the techniques have also failed in coming up with good labels for these relationships. Text2Onto has a particular structure, the Probabilistic Ontology Model (POM), that allows them to incrementally learn concepts and relationships [3]. Our approach draws on many of the ideas employed by Haddad et al. [10]. In their work they also use documents as transactions and focus on noun phrases as the carriers or meanings and the objects of analysis. A similar approach is taken by Nørvåg et al. [15]. Our research is now focused on the integration of different relationship learning approaches. The combination of association rules and cosine similarity is promising, and has to our knowledge not been done before. Another interesting application of association rules is presented in Delgado et al. [6]. Their idea is to use association rules to refine vague queries to search engine applications. After the search engine’s processing of the initial query, their system weights the words in the retrieved documents with tf.idf and extracts an initial set of prominent keywords. Stopwords are removed and the remaining keywords are stemmed. Representing the stemmed keywords of each document as a transaction, their system is able to derive association rules that relate the initial query terms with other terms that can be added as a refined query. Association rules have also been applied in web news monitoring systems. Ingvaldsen et al. [12] incorporate association rules and latent semantic analysis in a system that extracts the most popular news from RSS feeds and identifies important relationships between companies, products and people.

8 Conclusions This paper presented an ontology relationship learning approach that makes use of association rules to identify relationships between concepts. The approach is implemented as a text mining analysis chain, with GATE as the underlying architecture and our association rules components integrated with GATE through their standard API. As such, it is implemented as part of a comprehensive ontology learning workbench that also includes a battery of other ontology learning techniques. Association rules provide a powerful and straight-forward method for extracting possible ontology relationships from domain text. The relationships extracted may be both taxonomic and non-taxonomic, though it is difficult to use the analysis alone to decide on the nature of the relationships. Of the relationships extracted for the project management domain, about 82% were considered valid relationships by a test group with previous experience in project management. However, association rules do not seem to be substantially better than methods based on concept vector construction and cosine similarity calculations. The cosine similarity approach seems to generate more specialized and precise relationships than association rules. However, the methods are complementary, since association rules tend to focus on general relationships between high-level concepts and cosine similarity approaches focus on specialized relationships among low-level concepts.

Association Rules and Cosine Similarities in Ontology Relationship Learning

211

We are now investigating to what extent several relationship learning techniques can be combined in an incremental learning strategy. An extension of the POM structure from Text2Onto may be useful in this respect, though we need to carry over more than just probability measures when these techniques are applied sequentially. The whole set of uncertainties, possibly supported by evidence in terms of vectors or raw calculations, need to come together in such a hybrid ontology learning framework.

References 1. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data (1993) 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases (VLDB 1994) (1994) 3. Cimiano, P., Völker, J.: Text2Onto. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 227–238. Springer, Heidelberg (2005) 4. Cimiano, P., Völker, J., Studer, R.: Ontologies on demand? A description of the state-ofthe-art, applications, challenges and trends for ontology learning from text. Information, Wissenschaft und Praxis 57(6-7), 315–320 (2006) 5. Cristiani, M., Cuel, R.: A survey on ontology creation methodologies. Idea Group Publishing (2005) 6. Delgado, M., Martín-Bautista, M.J., Sánchez, D., Vila, M.A.: Association rule extraction for text mining. In: Andreasen, T., Motro, A., Christiansen, H., Larsen, H.L. (eds.) FQAS 2002. LNCS, vol. 2522. Springer, Heidelberg (2002) 7. Fernandez, M., Goméz-Peréz, A., Juristo, N.: Methontology: From ontological art towards ontological engineering. In: Proceedings of the AAAI 1997 spring symposium series on ontological engineering, Stanford, pp. 33–40 (1997) 8. Gaizauskas, R., Rodgers, P., Cunningham, H., KHumphreys, K.: Gate user guide (1996), http://gate.Ac.Uk/sale/tao/index.Html#x1-40001.2 9. Gulla, J.A., Borch, H.O., Ingvaldsen, J.E.: Ontology learning for search applications. In: Meersman, R., Tari, Z. (eds.) ODBASE 2007. LNCS, vol. 4803, pp. 1050–1062. Springer, Heidelberg (2007) 10. Haddad, H., Chevallet, J., Bruandet, M.: Relations between terms discovered by association rules. In: Proceedings of PKDD 2000 workshop on machine learning and textual information access. Lyon (2000) 11. Haase, P., Völker, J.: Ontology learning and reasoning - dealing with uncertainty and inconsistency. In: da Costa, P.C.G., Laskey, K.B., Laskey, K.J., Pool, M. (eds.) Proceedings of the international semantic web conference. Workshop 3: Uncertainty reasoning for the semantic web (ISWC-URSW 2005), Galway, pp. 45–55 (2005) 12. Ingvaldsen, J.E., Gulla, J.A., Lægreid, T., Sandal, P.C.: Financial news mining: Monitoring continuous streams of text. In: Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, Hong Kong, pp. 321–324 (2006) 13. Maedche, A., Staab, S.: Semi-automatic engineering of ontologies from text. In: Proceedings of the 12th Internal Conference on Software and Knowledge Engineering, Chicago, USA. KSI (2000) 14. Navigli, R., Velardi, P.: Learning domain ontologies from document warehouses and dedicated web sites. Computational Linguistics 30(2), 151–179 (2004)

212

J.A. Gulla, T. Brasethvik, and G.S. Kvarv

15. Nørvåg, K., Eriksen, T.Ø., Skogstad, K.-I.: Mining association rules in temporal document collections. In: Esposito, F., Raś, Z.W., Malerba, D., Semeraro, G. (eds.) ISMIS 2006. LNCS, vol. 4203, pp. 745–754. Springer, Heidelberg (2006) 16. PMI. A guide to the project: Management body of knowledge (PMBOK): Project Management Institute (2000) 17. Sabou, M., VWroe, C., Goble, C., Stuckenschmidt, H.: Learning domain ontologies for semantic web service descriptions. Accepted for publication in Journal of Web Semantics (2007) 18. Solskinnsbakk, G.: Ontology-driven query reformulation in semantic search. Master’s thesis. Norwegian University of Science and Technology, Trondheim (2007)

Compositional Model-Checking Verification of Critical Systems Luis E. Mendoza1, Manuel I. Capel2 , Mar´ıa P´erez1 , and Kawtar Benghazi2 1

Processes and Systems Department, Sim´on Bol´ıvar University P.O. box 89000, Baruta, Caracas 1080–A, Venezuela {lmendoza,movalles}@usb.ve http://www.lisi.usb.ve 2 Software Engineering Department, University of Granada ETSI Informatics and Telecommunication, 18071 Granada, Spain {manuelcapel,benghazi}@ugr.es http://lsi.ugr.es/˜sc

Abstract. Ensuring the correctness of Critical Systems (CS) becomes more complex if we consider that their behaviour is the result of the concurrent execution of many components. Furthermore, any automaton–based representation of concurrent components yields an explosion in the number of states, thus limiting the use of Model–Checking (MC) verification techniques in practice. This article presents a compositional verification approach, which is formally supported by state–of– the–art MC tools. To facilitate and guarantee the verification of large CS, the proposed approach integrates MEDISTAM–RT (Spanish acronym of Method for System Design based on Analytic Transformation of Real–Time Models), CCTL temporal logic as the property specification formal language, and the formal language CSP+T, used to formally describe a model of the system to be verified. To show a practical use of the proposed approach, a critical part of a realistic industry project related to mobile phone communications is discussed. Keywords: Critical software systems, Compositional verification, Formal methods, Model-checking, Case study.

1 Introduction The complexity of modern Critical Systems (CS) together with the absence of appropriate software tools is one of the reasons for the large number of errors in the design and implementation of these systems. The verification process requires proving the correctness of a large set of components, which can exceed the temporal and spatial complexity limits [1,2,3] of any software verification tool. Because of these implementation restrictions, failures to achieve validation are not precluded during the aforementioned verification process. To cope with verification complexity, only core critical parts of the CS are modelled as finite state systems and verified through Model–Checking (MC) technique [4]. MC techniques have become successful tactics, frequently used to uncover well–hidden bugs in sizeable industrial applications [4]. However, in this field, the state–explosion problem (i.e., the number of states to explore grows exponentially J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 213–225, 2009. c Springer-Verlag Berlin Heidelberg 2009 

214

L.E. Mendoza et al.

with the number of the system’s processes) still arises whenever there are too many dependencies between components of any CS. In order to contribute to solving the problems described above, a compositional verification approach, integrated with MEDISTAM–RT —Spanish acronym of Method for System Design based on Analytic Transformation of Real–Time Models [5], is presented in this paper, which can be proved as a sound verification approach since it is based on formal aspects of MC and compositional verification. The integration is attained by using two formalisms that are in the same formal semantics domain, given by Kripke Structures (KS), i.e., Clocked Computation Tree Logic (CCTL) for temporal properties and Communicating Sequential Processes + Time (CSP+T) for system process formal specifications. KS are also called transition graphs, consisting of a set of states, a set of transitions between states, and a function that labels each state with a set of properties that are true in this state [4]. Thanks to this common, semantically compatible interpretation of models that the aforementioned specification languages present, state–of–the– art MC tools can be incorporated in the development cycle to facilitate the verification of some complex software system properties, e.g., safety, deadlock–freeness, liveness, etc. To show the usefulness of our proposal, the verification approach presented here is applied to a case study that has critical temporal requirements in the field of mobile phone communications. Similar works about combining compositional verification and MC can be found in the literature. Some of these works [1,3,6] use the compositional capacity of temporal logics to address the complex software systems verification problem. While others, such as [7,8], take advantage of the process algebra operators to facilitate the checking of the system’s behaviour with respect to its predefined properties. And others [9,10,11,12,13] use the Assumption/Commitment, Rely/Guarantee, Assumption/Commitment or Assume/Guarantee paradigm to minimize the state–explosion problem. Finally, the work already carried out in [3,6,8,14] aims to make a component local verification approach compatible with a deductive system for software verification. Nevertheless, none of the aforementioned works merges the analysis/design processes with the verification task. In contrast with other research, our goal is to give the development of software systems a systemic and integrated framework for performing analysis, design and verification tasks. Software verification is achieved by using state–of–the–art MC tools, which allow the verification of the complete system design. The paper is organized as follows. In the next section, we give the main ideas of our compositional verification approach. After that, we give a brief description of the MEDISTAM–RT design method and the formal framework (CCTL and CSP+T) used in the MC technique integrated into the approach. Subsequently, we establish how all elements described previously are combined into the MC technique. Finally, we apply and discuss our proposal to a real project related to mobile phone communications. The last section gives conclusions and mentions further research.

2 Compositional Verification Approach The approach described here is aimed at dividing complex software systems into components. In order to mitigate complexity, modular software development makes use

Compositional Model-Checking Verification of Critical Systems

215

Fig. 1. Compositional verification approach

of system decomposition and abstraction/refinement concepts. Every system’s components are individually verified, and the results are deductively combined to obtain the global system characteristics. Moreover, the behaviour of the entire system can be derived from descriptions of system components [13,15], without it being necessary to take into account any other information about components’ internal structures (black box principle [13,16]). Figure 1 shows the proposed compositional approach (the symbol  denotes parallel composition, the symbol  denotes satisfaction, and the symbol ∧ denotes conjunction). Formally, this approach is based on compositional reasoning, which derives conclusions that are relevant to one complete system from the analysis of its individual components. This reasoning is founded on the basis of the Assume/Guarantee paradigm, i.e., each system component guarantees certain properties (φ i ) based on other component assumptions (Ai ) [4,12]. More specifically, given the C1 and C2 components, and the relation between them is given by C = C1 C2 , the C1 behaviour depends on the C2 behaviour and the C2 behaviour depends on the C1 behaviour, in order to achieve the desired behaviour of C. The engineer who is going to carry out the verification of the C behaviour will specify the set of assumptions that C2 and C1 (A2 and A1 , respectively) must satisfy to guarantee the correctness of C1 and C2 (φ1 and φ2 ), respectively. By appropriately combining the set of properties assumed and guaranteed by C1 and C2 —which are represented by the pairs (A1 , φ1 ) and (A2 , φ2 ), respectively—, it is possible to assure the correctness of the complete system made up of C1 and C2 , i.e., C, without it being necessary to build the complete state transition graph of C, i.e., the (entire) system. This is known as the “Assumption/Commitment”, “Rely/Guarantee”, “Assumption/Commitment” or “Assume/Guarantee” [12] paradigm [4]. Decomposition. The initial division of the System into smaller modelling entities, called Subsystems, is performed. Then, as result of a refining process, each subsystem becomes further divided into smaller entities named Components. Making the decision

216

L.E. Mendoza et al.

of where the system and subsystems should be divided is a task that can be optimized by performing the separation of subsystems until the smallest size components are found, those whose behaviour could be described by a single control line, i.e., the behaviour can be specified by only one automaton, or state machine, or process (depending on the formalism used to represent the system behaviour). This corresponds to Local processes in Figure 1. System decomposition does not usually involve any formal specification work, but only provides a general infrastructure for guiding the next steps in the approach. Nevertheless, the specification language should be capable of recomposing the separately specified components. Abstraction, Refinement and Modelling. The system will be developed as result of stepwise refinement, starting from the requirement specification phase and progressively adding details until an acceptable implementation is achieved [17]. An important choice to be taken is to find the correct abstraction level at which any subsystem or component should be described. It is recommended, as a general analysis/design strategy for system components, to begin with models as abstract as possible, but without excessive information loss. The described compositional approach ought to be able to integrate both apparently opposing views (structural and behavioural). On the one hand, the architectural design of the system is obtained, which represents a high level description of the system (its structure), on the other, one detailed description is obtained, which specifies the complete design of the system (its behaviour). At the end of the abstraction, refinement and modelling processes, each component must be modelled or mapped into a formal framework that allows verification. This modelling or transformation into a formal language enables the mathematical definition of the system, thus facilitating its formal analysis. Carrying out the description and accurately analyzing the behaviour of system components with the aid of formal methods will help to detect inconsistencies, ambiguities and omissions. Local Verification. Once the specification and the model of the system are obtained, the engineer can verify whether the system model satisfies specification of the properties using adequate formal verification techniques and tools. Thanks to MC, the kind of formal verification used in our approach, it is possible to study behavioural situations that can be classified as “normal” or abnormal. If the verification of a component fails, the MC tool will generally give a counterexample of use to correct the original error in the behaviour or in the property specification being checked. Every system component model should be tested against its formal specification, which allows complete automation of this step. Deduction. In order to provide the desired proof of the global system properties, the models of local processes and the specifications of their properties are composed by using the appropriate operators of each formal specification notation deployed. The idea is then to combine the local results obtained from the Local verification process by deduction in order to derive a global property of the complete system. The crucial step of this approach is to verify local properties for each component individually, using justified assumptions about its environment —here the Assume/Guarantee method is employed again. By the application of deductive techniques [13], it can be proved that

Compositional Model-Checking Verification of Critical Systems

217

the global system properties to be satisfied are the result of applying the propositional conjunction operator to the proved properties of the components. Finally, if the global system property cannot be proved, there may be information lost, quite probably in one or more of the local component specifications. In general, the intermediate results yielded by the combined proof will show the missing information. Having found the information leaks and having verified the local specifications once more, the test is applied again.

3 Integrated Elements 3.1 MEDISTAM–RT With MEDISTAM–RT we can specify the structural and behavioral aspects of CS systematically as stated in [5]. These two different viewpoints of a system are usually attained in UML–RT (an extension to UML which adds four new building blocks to the standard UML: capsules, ports, protocols, and connectors [18]) by using Class and Composite Structure Diagrams, and by using State Diagrams, respectively. We apply a transformational method, based on a proposed set of transformation rules [5], which allow us to create a CSP+T model from a UML–RT analysis model of a given CS. MEDISTAM–RT is divided into two main phases: the first one (top–down modelling process) models the system using UML–RT, while the second one (bottom–up specification process) obtains the formal specification in terms of CSP+T by transforming each UML–RT sub–model. Mapping links are continuously established between the UML–RT diagrams of the components, of which the system is constructed, and their formal specifications in terms of CSP+T processes. These links demonstrate how CSP+T syntactical terms are used to represent the real–time constraints and the internal components and connectors that constitute the system architecture, at different levels of description detail. 3.2 CCTL CCTL [19] is a temporal logic extending Computation Tree Logic (CTL) [4] with quantitative bounded temporal operators. CCTL is used to deal with sequences of states, where a state gives a time interpretation of atomic propositions at a given time instant and time is isomorphic to the set of non–negative integers. CCTL includes CTL [4] with the operators until (U) and the operator next (X) and other derived operators in LTL useful for facilitating CS property specifications. All “LTL–like” temporal operators are preceded by a run quantifier (A universal, E existential) which determines whether the temporal operator must be interpreted over a single run (existential quantification) or over every run (universal quantification) starting within the current configuration. 3.3 CSP+T CSP+T [20] is a real–time specification language which extends Communicating Sequential Processes (CSP) [21] to allow the description of complex event timings, from within a single sequential process, of use in the behavioural specification of CS.

218

L.E. Mendoza et al.

A CSP+T process term P is defined as a tuple (αP, P ), where αP =Comm act(P )∪ Interf ace(P ) is named communication alphabet of P. These communications represent the events that process P receives from its environment (made up of all the other processes in the system) or those that occur internally, such as the event e which is not externally visible. CSP+T is a superset of CSP. As a major change to the latter, the traces of events are now pairs denoted as t.e, where t is the absolute global time at which event e is observed. The operators, related with timing and enabling–intervals included in CSP+T are [20]: (a) the special process instantiation event denoted ⋆ (star); (b) the time capture operator (⋊ ⋉) associated to the time stamp function ae = s(e) that allows storing in a variable a (marker variable) the occurrence time of an event e (marker event) as it occurs; and (c) the event–enabling interval I(T, t1 ).a, viewed as representing timed refinements of the untimed system behaviour and facilitates the specification and verification of temporal system properties [20].

4 Our Verification Integrated View Figure 2 is a graphical summary of how the MC concepts support the integration of MEDISTAM–RT, UML–RT, CSP+T, and CCTL, into the compositional verification approach. The complete description of the system’s behaviour is obtained as result of using MEDISTAM–RT. A series of system views represented by Class Diagrams, Composition Structure Diagrams and State Diagrams are obtained when the concepts of decomposition, abstraction/refinement and modelling, following the UML–RT guidelines, are applied. The use of UML–RT allows us to support the analysis/design processes with the abstraction/refinement concepts behind UML. After that, these views are specified using CSP+T process terms, which share a refinement and satisfaction relation

Fig. 2. Integrated view according to our compositional verification approach

Compositional Model-Checking Verification of Critical Systems

219

equivalent to the one existing between UML–RT diagrams [15]. In parallel to the description of the system behaviour, the non–functional requirements and temporal constraints that the system must fulfill are specified with CCTL formulas. Bear in mind that the basis of CCTL is interval structure, i.e., a time–annotated automata; see [19] for more details. Once CSP+T [20] process terms and CCTL [19] formulas are obtained, we can proceed to the system’s verification in the same semantic domain given by KS. This semantic domain is very important within our approach, since it is the most adequate formalism to represent automata describing finite transition systems, which constitute the basic component used by MC tools. Our approach uses the CSP+T’s parallel composition operator (), a simple but powerful form of composition, which has the property that one can deduce the behaviour of one system part separately, confident that the resulting system will continue respecting the properties established by the parts, as proved in [17]. By using MC tools it is possible to check whether a possible model of the system under development, expressed as CSP+T process terms, satisfies the expected temporal behaviour of the system, given by CCTL formulas that express the properties of the system. As a result, we will obtain the verification of local system processes through the interpretation of Boolean expressions (True, False). Finally, by applying deductive techniques [13], it can be proved that the global system properties to be satisfied are the result of applying the propositional conjunction operator to the proved properties of components. Formally, it is possible to assure and obtain the complete verification of the system by using the relation:     φij (1) Pij  i:1...n j:1...m

i:1...n j:1...m

According to [12], the formal proof of relation 1 can be automated using MC [1]. Keep in mind that the Cij component was already modelled using the Pij CSP+T process term (which is represented by automata Pij ). Alternatively, a test automaton CTij can be constructed, which includes a fail state that is reachable if and only if φij , modelled by automata Aij , is not fulfilled. Then, the fail state events are checked to show that: Pij  Aij  CTij  ¬reach(f ail) ⇒ Cij  (Aij , φij )

(2)

5 Case Study A way to validate the applicability (and consistency) of any approach in compositional verification is through its deployment in a case study. To this end, we selected a real project related to mobile phone communications. The aim was to verify an application with an estimated daily transaction volume in the order of millions. The case study is related to monitoring the state of Base Transceiver Stations (BTS), i.e., sites where antennas and electronic communication equipment are placed to create a cell in a mobile phone network. Figure 3 depicts a simplified scheme of the case study, where five BTSs (A to E) and the messages interchanged (SndMsg, RcvMsg, AckMsg, RcvConf ) are shown. In order to be able to guarantee an adequate service level, a continuous monitoring of BTSs should be performed. The BTSs have a great number of strongly interconnected devices and individual characteristics susceptible to reconfiguration, to decide

220

L.E. Mendoza et al.

Fig. 3. Case study simplified scheme

whether to give a new service, or to substitute a damaged device, or simply to offer a better service to the customers. Obtaining good performance of the network requires guaranteeing the integrity of the information state of each one of the devices in each BTS by using a Distributed Data Base (DDB) modelling approach. Each BTS has its local Data Base (DB) and its own Distributed Data Base Manager (DDBM) that sends the changes to the rest of the BTSs. After updating the local DBs, each DDBM sends a confirmation message to the BTS that requested the update, notifying it of the change. 5.1 Properties Specification Several CCTL formulas have been defined to specify in detail the properties that guarantee the correct functioning of the network, i.e., to express reachability, safety, liveness, deadlock–freeness, and fairness properties, mainly related to: (a) data integrity between the different global data replicas, (b) dynamic state of the transmitted messages, (c) reception of acknowledgement messages preceding local data updating, and (d) maintenance of active and idle states of DDBM. Nevertheless, for the sake of simplicity, we only describe the detailed analysis and discussion of one CCTL formula to show our approach. This formula is considered simple but very important to the objectives of the system described in the case study because it assures that only one data update request can progress at any instant, i.e., two DDBMs are not allowed to send a data update request at the same time; thus, the information integrity in the DDBM is guaranteed. The complete set of CCTL formulas is detailed in [16]. We use here the ϑ := AG[1,3] (¬ [SndM sg(s)∧ SndM sg(s′ )]) formula, which expresses that “only one send–and–update message can be performed at the same time within [1,3] time interval”. More formally, the proposition ¬ [SndM sg(s) ∧ SndM sg(s′ )] is always globally true within the [1,3] interval. In the case study scheme represented in Figure 3, an instance of the above property is when one DDBM sends an update message (e.g., DDBM B), within the [1,3] interval, so none of the other DDBM (i.e., DDBMs A, C, D or E) can send another update message (until the time interval finishes). This CCTL formula represents a safety property since a sequential implementation of the protocol used by the DDBMs to communicate between each other will always satisfy this property, i.e., first one DDBM sends and when it receives the acknowledgement, another sends. The Figure 4 shows a KS (in this case, a Timed B¨uchi Automata —TBA—) semantically equivalent to the ϑformula obtained by the application of the TBA algorithm proposed in a prior work [22] (to improve the figure readability, we use the variable

Compositional Model-Checking Verification of Critical Systems

221

Fig. 4. TBA semantically equivalent to ϑ CCTL formula ϕ = ¬[SndM sg(s) ∧ SndM sg(s′ )]). Therefore, to carry out the verification, we use

the CSP+T process term obtained from the data structure generated by the algorithm execution as input of a MC tool to check the DDBM model according to the ϑ formula. This CSP+T process term corresponds with the ESP DDBM, ESP Act Control and, ESP Man Message specifications, when we use the FDR2 MC tool [23]. 5.2 DDBM Modelling The DDBM is made up of two subcapsules, Act Control and Man Message, both responsible for managing the states of the DDBMs and the message states, respectively. In [16] is shown the detailed architecture of each DDBM. In Figure 5 we can observe the Timed State Machines (TSM) that model the behaviour of each one of the subcapsules Act Control and Man Message. Attention must be paid to the fact that the necessary time labels have been included in the TSM to meet the maximum time expected, in order to meet the model’s corresponding temporal constraints (see section 5.1). For instance, when the subcapsule Act Control is in the Updating state, it should receive —within the time interval [ta, ta + 1)— the event Rcv produced by the subcapsule Man Message to change it to the Inactive state; given ta the time instant at which the subcapsule changes to the state Updating. If, within the specified time span, the event Rcv is not received, a “timeout” event will be raised at ta + 1 time instant. With the specification of these intervals and time instants, we guarantee that the DDBM can not enter into a blocking state that would prevent new updating occurrences. In [5] the TSM extension rules used to design the TSM shown in Figure 5 are explained. CSP+T process terms that specify the behaviours of UML–RT TSM in Figure 5 are presented in Figure 6, specifying the DDBM performance behaviour modelled and according to the architecture of each DDBM. The CSP+T process terms, which are used as models of DDBM, Act Control, and Man Message subcomponent automata, are checked w.r.t. the CCTL formula automaton (see section 5.1) represented by (a more abstract) CSP+T process term. 5.3 Component Verification First, we perform the verification of each subcomponent (Act Control and Man Message) w.r.t. the properties each one must satisfy (ESP Act Control and ESP Man Message), respectively. Secondly, we check that the component DDBM ( (Act Control M an M essage) \ C ) satisfies the ESP DDBM (ESP Act Control ∧ ESP M anM essage) property. From the specification of the property in section 5.1, which represents the property specification of the system (ESP DDBM), and the processes model

222

L.E. Mendoza et al.

(a) Act Control

(b) Man Message

Fig. 5. DDBM components TSM DDBM =

CSP + Tg,m (Act Control, C, Man Message) Va (CSP + T (Act Control))|[ain , aout ]|C|[min , mout ]| Vm (CSP + T (Man Message))

Act Control =

⋆.t0 → Inactive Inactive = Updating = W aitingAck = W aitingConf =

VExt a (CSP + T (Act Control)) = Va (CSP + T (Act Control)) = Man Message = ⋆.t0 → N ot used N ot used = Received = Dispatched = Conf irmed =

⋉ ta → (a!Rcv → Updating)) | (Ext a?RcvMsg ⋊ (a?Snd ⋊ ⋉ ta → (Ext a!SndMsg → W aitingAck)) (I[ta, ta + 1).a?Rcv → (Ext a!RcvConf → Inactive)) | (I[ta + 1, ta + 1] → (timeout → Inactive)) (I[ta, ta + 1).Ext a?AckMsg → (a!Ack → W aitingConf )) | (I[ta + 1, ta + 1] → (timeout → Inactive)) (I[ta, ta + 2).Ext a?RcvConf → (a!Conf → Inactive)) | (I[ta + 2, ta + 2] → (timeout → Inactive)) {RcvMsg, SndMsg, RcvConf, AckMsg} {Rcv, Snd, Ack, Conf } (m?Rcv ⋊ ⋉ tm → (Int m!LocUp → Received)) | (Int m?Up → (m!Snd → Dispatched)) (m?Conf ⋊ ⋉ tm → (Int m!Conf → Conf irmed)) | (Int m?Upd → (m!Conf → N ot used)) m?Ack → Received (I[tm, tm + 1).Int m?Ready → N ot used) | (I[tm + 1, tm + 1] → (timeout → N ot used))

VInt m (CSP + T (Man Message)) = {Rcv, Snd, Conf, Ack} {LocUp, Up, Conf, Upd, Ready} Vm (CSP + T (Man Message)) =

Fig. 6. Act Control and Man Message CSP+T process terms

in section 5.2, which represents a possible model of DDBMs, we proceed to their verification. In this case, considering that we are working with a simplified representation of DDBMs, we use the FDR2 MC tool [23]. As can be observed in Figure 7, the verification of each subcapsule (Act Control and Man Message) model satisfies (check marks at rows one and two, respectively) the specification of each one of the previously specified properties (ESP Act Control and ESP Man Message), w.r.t. the failures and divergences semantic models. Finally, as can be observed in Figure 7, the verification shows that the DDBM component model satisfies (check mark at row three) the ESP DDBM specified property, with respect to the failures and divergences semantic models.

Compositional Model-Checking Verification of Critical Systems

223

Fig. 7. DDBM verification screen shot

5.4 Discussion of Results Taking our verification approach, the results of the verification shown in Figure 7 allow us to deduce that the DDBM model does not provoke a deterioration of the expected behaviour specified by the CCTL formula, nor does there occur a deadlock situation in which the system remains waiting indefinitely for communication between each DDBM. In other words, the verification highlights that, among the n DDBMs, each DDBM globally assures the integrity of the distributed data when they are updated. Furthermore, these are carried out within the maximum execution times specified by their respective CCTL formulas and the Act Control and Man Message will remain for the least possible time when a DDBM is busy. By integrating MEDISTAM–RT in our compositional verification approach, temporal annotations (“time labels” on the graph in Figure 5) become part of the CSP+T process terms, thus remaining integrated with the process execution when the DDBMs carry out the global data updates. In this way, it is guaranteed that processes fully accomplish their work within the previously specified times (i.e., the maximum times that can be occupied are met), thus contributing to avoid service time degradation. On the other hand, the temporal constraints have been modelled and specified by taking into account the necessary synchronization between CSP+T process terms, thus permitting us to prove that any concurrent execution of the automata representing these terms respects the aforementioned specified temporal properties; in particular, in this case, a safety property (deadlock–freeness). The approach application to the mobile phone communication case shows the feasibility of our vision of compositional verification, supported by a state–of–the–art MC tool, and its integration with analysis/design methods, under the same formal semantics of KS. Thus, with the integration of MEDISTAM–RT, UML–RT, CSP+T, and CCTL, into our compositional verification approach we can now obtain the following artifacts within the same conceptual framework: (a) the CCTL expressed properties that the CS

224

L.E. Mendoza et al.

must fulfill, (b) the UML–RT model of the CS, (c) the CSP+T processes that specify the CS behaviour, and (d) the verification of the CS using the FDR2 MC tool. Therefore, we can say that a systemic and integrated framework for performing analysis, design, and verification tasks has been achieved. Thus, software verification is integrated within the software development cycle by using state–of–the–art MC tools, which allows us to carry out the verification of the complete system design. The use and selection of the MC tools depends on the set of properties to be verified; therefore, the software engineer is free to select the most appropriate state–of–the–art MC tool according to the properties of the system, and at the same time continue to benefit from our approach.

6 Conclusions In this paper, we describe a compositional verification approach, which can be proved as a sound verification approach since it is based on the formal aspects of MC, integrated with MEDISTAM–RT. The integration is attained by using two formalisms that are under the same formal semantics of KS, i.e., CCTL for temporal properties specification and CSP+T for system process formal modelling. Thanks to the compositionality that the CSP+T process algebra presents and the common interpretation of the latter and CCTL in the same semantic domain, state–of–the–art MC tools can be incorporated in our approach to enable design verification of large and complex CS. As the case study previously discussed shows, the verification of complex CS designs is facilitated by our approach. Future and ongoing work is aimed at the application of our approach in other case studies in industrial CS modelling; thus, our goal is to conduct in–depth research about the verification of these specifications, and achieve its support with state of the art verification tools. Acknowledgements. This research was partially supported by National Fund of Science, Technology and Innovation, Venezuela, under contract G–2005000165, and by the project MAT2004–06872–C03–03 of the Spanish Ministry of Science’s National R+D+I Plan.

References 1. Grumberg, O., Long, D.: Model checking and modular verification. ACM Transaction on Programming Languages and Systems 16(3), 843–871 (1994) 2. Jonnson, B.: Compositional specification and verification of distributed systems. ACM TOPLAS 16(2), 259–303 (1994) 3. Bultan, T., Fischer, J., Gerber, R.: Compositional verification by model checking for counter– examples. In: ISSTA 1996: Proc. of the 1996 ACM SIGSOFT International Symposium on Software Testing and Analysis (1996) 4. Clarke, E., Grumberg, O., Peled, D.: Model Checking. The MIT Press, Cambridge (2000) 5. Benghazi, K., Capel, M., Holgado, J., Mendoza, L.: A methodological approach to the formal specification of real–time systems by transformation of UML–RT design models. Science of Computer Programming 65(1), 41–56 (2007)

Compositional Model-Checking Verification of Critical Systems

225

6. Clarke, E., Long, D., McMillan, K.: Compositional model checking. In: Proc. of the Fourth Annual Symposium on Logic in Computer Science (1989) 7. Giese, H., Tichy, M., Burmester, S., Flake, S.: Towards the compositional verification of real– time UML designs. In: ESEC/FSE–11: Proc. 9th European Software Engineering Conference and 11th ACM SIGSOFT International Symposium on Foundations of Software Engineering (2003) 8. Yeh, W., Young, M.: Compositional reachability analysis using process algebra. In: TAV4: Proc. of the Symposium on Testing, Analysis, and Verification (1991) 9. Berezin, S., Campos, S., Clarke, E.: Compositional Reasoning in Model Checking. In: de Roever, W.-P., Langmaack, H., Pnueli, A. (eds.) COMPOS 1997. LNCS, vol. 1536, pp. 81– 102. Springer, Heidelberg (1998) 10. Kesten, Y., Klein, A., Pnueli, A., Raanan, G.: A Perfecto Verification: Combining Model Checking with Deductive Analysis to Verify Real-Life Software. In: Wing, J.M., Woodcock, J.C.P., Davies, J. (eds.) FM 1999. LNCS, vol. 1708, pp. 173–194. Springer, Heidelberg (1999) 11. Cobleigh, J., Giannakopoulou, D., P˘as˘areanu, C.: Learning Assumptions for Compositional Verification. In: Garavel, H., Hatcliff, J. (eds.) TACAS 2003. LNCS, vol. 2619, pp. 331–346. Springer, Heidelberg (2003) 12. Frehse, G., Stursberg, O., Engell, S., Huuck, R., Lukoschus, B.: Modular analysis of discrete controllers for distributed hybrid systems. In: b 2002: The XV IFAC World Congress, Barcelona, Spain, IFAC, pp. 21–26 (2002) 13. Lukoschus, B.: Compositional Verification of Industrial Control Systems: Methods and Case Studies. PhD thesis, Universit¨at zu Kiel, Technischen Fakult¨at der Christian-Albrechts (July 2005) 14. Cheung, S., Kramer, J.: Enhancing compositional reachability analysis with context constraints. In: SIGSOFT 1993: Proc. of the 1st ACM SIGSOFT Symposium on Foundations of Software Engineering, pp. 115–125. ACM Press, New York (1993) 15. Mendoza, L., Capel, M.: Consistency checking of UML composite structure diagrams based on trace semantics. In: Software Engineering in Progress – 2nd IFIP CEE-SET 2007 (2007) 16. Mendoza, L., Capel, M., P´erez, M., Benghazi, K.: A conceptual scheme for compositional model–checking verification of critical communicating systems. In: Proc. 10th ICEIS 2008 ISAS-1, pp. 86–93 (June 2008) 17. Allen, R., Garlan, D.: A formal basis for architectural connection. ACM Trans. Softw. Eng. Methodol. 6(3) (1997) 18. Selic, B., Rumbaugh, J.: UML for Modeling Complex Real–Time Systems. ObjecTime Technical Report. ObjecTime, New York, USA (1998) 19. Ruf, J., Kropf, T.: Modeling and Checking Networks of Communicating Real-Time Processes. In: Pierre, L., Kropf, T. (eds.) CHARME 1999. LNCS, vol. 1703, pp. 256–279. Springer, Heidelberg (1999) ˇ J.: Time–constrained buffer specifications in CSP+T and Timed CSP. ACM Transaction 20. Zic, on Programming Languages and Systems 16(6), 1661–1674 (1994) 21. Hoare, C.: Communicating Sequential Processes. International Series in Computer Science. Prentice–Hall International Ltd., Hertfordshire UK (1985) 22. Mendoza, L., Capel, M.: Algorithm proposal to automata generation from CCTL formulas. Technical report, University of Granada (2008) 23. Formal Systems (Europe) Ltd: Failures–Divergence Refinement – FDR2 User Manual. Formal Systems (Europe) Ltd, Oxford (2005)

Model-Driven Web Engineering in the CMS Domain: A Preliminary Research Applying SME Kevin Vlaanderen1, Francisco Valverde2 , and Oscar Pastor2 1

Universiteit Utrecht, Padualaan 14 3584CH Utrecht, The Netherlands [email protected] 2 Centro de Investigaci´on en M´etodos de Producci´on de Software Universidad Polit´ecnica de Valencia, Spain [email protected], [email protected]

Abstract. In recent years, the use of Content Management Systems (CMS) as the core tool to define a Web application has gained popularity. However, the ModelDriven Web Engineering methods are not well fitted into the CMS domain. The main reason is that these methods are mainly focusing on the data and navigation aspects. To address this problem we propose in this chapter the use of Situational Method Engineering in order to detect the potential issues and improvements of a Web Engineering method in the CMS domain. Specifically, the suitability of the OOWS method in the context of CMS-based Web applications is evaluated by means of a user-registration use case. From the results of this evaluation, a list of current limitations of the OOWS Method in the CMS domain are detected. Additionally the improvements that can be applied from a SME perspective are introduced. Keywords: Web engineering, Method engineering, Model-driven development.

1 Introduction Over the last years, the number of Web applications developed by industry has increased dramatically. However, the quality of those applications is often poor and they are difficult to maintain because of the lack of precise techniques to guide the development. For that reason the principles proposed by the Web Engineering [1] community have been widely applied. Specifically, these principles have been introduced from a Model-driven Engineering perspective. The main result has been a wide array of Model-driven Web development methods (see section 2) in order to improve Web applications quality. The development of specialised methods has certainly improved the efficiency of Web development processes. However, it is difficult to choose a Web Engineering method that is the most suitable in all the different domains. Mainly because there is no silver bullet method that can address the heterogeneous domains which are facing the Web development. With the arrival of the ”Web 2.0”, this issue has become more obvious since current Web Engineering methods are mainly designed to develop dataintensive Web applications. As a consequence the methods must be revisited to take into account the social perspective of this new application paradigm. In other words, when J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 226–237, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Model-Driven Web Engineering in the CMS Domain

227

a new Web Application domain needs to be supported the method must be adapted, extended or redefined. Many researches agrees that it is not a reasonable approach to develop a new Web Engineering method from scratch for any possible domain. In this context, applying the Method Engineering principles provides interesting advantages. Method Engineering is ”the discipline to design, construct and adapt methods, techniques and tools for the development of information systems” [2]. Situational Method Engineering (SME) is a type of Method Engineering that focusses on the method adaptation to a particular situation, as is described in [3] and [4]. This adaptation can be summarised in four generic steps [5]: (1) Identify concrete method needs, (2) Select candidate methods that meet some of the identified needs, (3) Analyse the methods and store the relevant method fragments in a method base and (4) Assemble a new method from useful method fragments. Applying SME to Web Engineering methods has two main advantages: 1. Building a common method base allows the analyst to define a method for a concrete domain using previous and validated method fragments. 2. A method can be easily adapted taking into account the needs of a concrete project. In this chapter we present how the SME principles can be applied to a Web Engineering method in order to improve its suitability for a particular domain. The domain chosen is the Content Management Systems (CMS): a Web system used to manage the content (texts, images, resources, electronic documents) of a Web site. The main difference with traditional Web applications is that the content is created or added dynamically by the Web users. In this work, the Web Engineering method selected to be improved is OOWS [6]. This method extends OO-Method [7], an automatic code generation method that produces the equivalent software from a system conceptual specification. With the possibility of creating code directly from a conceptual model, it enables drastic improvements in the time and resources required for the creation of Web applications. However, the OOWS method lacks expressiveness to define Web applications in some domains as CMS. To be able to apply OOWS in more fields, of which CMS-based Web application development is one, a better understanding and a more detailed specification of the method is needed. This chapter extends the work previously presented by the authors in [8]. In that work SME was applied to define the OOWS method metamodel and to detect the different method fragments. In addition, an analysis was carried on by means of a case study to evaluate the suitability of OOWS in the context of CMS-based Web applications. In this chapter, we perform a new evaluation applying a more complex case study based on a CMS application registration process. This new evaluation has led to extend the previously defined OOWS method metamodel. Both the new method metamodel and the evaluation, provide a preliminary research about how a common method base for the CMS domain can be defined. The rest of the chapter is organised as follows: section 2 presents the background about some related work on Web Engineering methods and introduces the OOWS method. Section 3, describes the improved OOWS method metamodel that was defined in previous works. Section 4 describes the new use-case defined to analyse the OOWS

228

K. Vlaanderen, F. Valverde, and O. Pastor

Method and the improvements needed. Finally section 5 presents the conclusions of the preliminary research.

2 Background In the last years several Web Engineering methods have proposed a process to develop Web applications. Specifically, they have encouraged a Model-driven Engineering point of view to improve the development. The generic method proposed can be summarised in five main steps: 1) Requirements gathering, using use cases or a textual specification 2) Definition of the domain model, which gathers the entities of the application by means of a UML-like class diagram 3) Definition of the Navigational Model, which represents the navigation between the application nodes, 4) Specification of the presentation and design aspects and 5) The final implementation phase. Taking into account little differences and different conceptual models, this five step process is followed by approaches such as OOHDM [9], WebML [10], WSDM [11] or UWE [12]. Therefore it is possible to say that there is a common agreement about the steps a Web Engineering method must have. However, in practice, a rigid or generic method doesn’t fit well for every Web application domain. For instance, a very exhaustive Requirements gathering phase may be not necessary in small-size Web applications. In addition as [13] has stated, the current methodologies do not have enough expressiveness to model new domains such as Rich Internet Applications. The first solution to these issues has been to redefine the methods. For instance the WebML method [14] has introduced a set of new conceptual models. However, this is not a long-term solution because the Web is evolving continuously, therefore, Web Engineering methods must evolve simultaneously. This fact leads to defining Web Engineering methods more flexibly applying the SME principles. An example of a CMS-based Web application design method developed using SME is the GX WebEngineering Method (WEM) [15,5]. WEM is currently used at GX creative online development, a Web technology company in the Netherlands, to implement Web applications using their own Content Management System, called GX WebManager. [15] describes how this method has been created and improved through the use of already existing (proprietary) methods such as [12]. In this chapter the approach defined in WEM is applied to OOWS. The main purpose is to define the OOWS method using the SME principles and to find method fragments which can be improved. 2.1 The OOWS Web Engineering Method OO-Method [7] is an Object Oriented software production method to automatically generate information systems. OO-Method models the system in different abstraction levels, distinguishing between the problem space (the most abstract level) and the solution space (the lowest abstract level). The system is represented by a set of conceptual models that represents the static structure of the system (Class Diagram) and the behaviour (State and Functional Diagrams). OOWS [6] is the OO-Method extension used to model and to generate Web applications. The OOWS Web Engineering method adds three models that describe the different concerns of a Web application:

Model-Driven Web Engineering in the CMS Domain

229

– User Model: A User Diagram allows us to specify the types of users that can interact with the Web system. The types of users are organised in a hierarchical way by means of inheritance relationships. – Navigational Model: This model defines the system navigational structure. It describes the navigation allowed for each type of user by means of a Navigational Map. This map is depicted by means of a directed graph whose nodes represent Navigational Contexts and their arcs represent navigational links that define the valid navigational paths over the system. Basically, a Navigational Context represents a Web page of the Web application at the conceptual level. Navigational Contexts are made up of a set of Abstract Information Units (AIU), which represent the requirement of retrieving a chunk of related information. AIUs are views defined over the underlying OO-Method Class Diagram. These views are represented graphically as UML classes that are stereotyped with the ”view” keyword and that contain the set of attributes and operations which will be available to the user. – Presentation Model: Using this model, we are able to specify the visual properties of the information to be shown. To achieve this goal, a set of presentation patterns is proposed to be applied over our conceptual primitives. Some properties that can be defined with this kind of patterns are information arrangement (register, tabular, master-detail, etc), order (ascendant/descendent), or pagination cardinality. These models are complemented by the OO-Method models which represent functional and persistence layers. The OOWS development process is compliant with Model Driven Architecture (MDA) principles, where a model compiler transforms a Platform Independent Model (PIM) into its corresponding software product. This MDA transformation process has been implemented by means of a case tool and a model compiler The OOWS Model Compiler generates the code corresponding to the user interaction layer, whereas OLIVANOVA (www.care-t.com), the industrial OO-Method implementation, generates the business logic layer and the persistence layer. Further details can be found in [16].

3 The OOWS Method Metamodel The metamodeling technique used in this paper to obtain a view of the OOWS method is based on the proposal of [3]. The technique uses a combination of two UML diagrams (see fig.1) as is described in [17]. – On the left hand side, an adaptation of the UML activity diagram is used for modeling the process of the method. This diagram consists of method activities, subactivities and transitions between them. – On the right hand side an adaptation of the UML class diagram that models the concepts: A set of objects which share the same attributes, operations, relations and semantics, used for capturing the deliverable view of a method. These two diagrams are integrated in a straightforward way after being built. Some of the activities from the UML activity diagram are associated to concepts from the UML class diagram through a dotted line. The resulting model is called Process-Deliverable Diagram (PDD) where each process step represents a method fragment.

230

K. Vlaanderen, F. Valverde, and O. Pastor

Fig. 1. Process-Deliverable Diagram for OOWS Conceptual Modeling phase

To validate the metamodel proposed in this work, expert-validation has been applied. The metamodel has been checked and revised by both a PhD-student and a full-PhD working intensively on and with OOWS. The OOWS method metamodel defined in this chapter consists of the two PDDs that represent the Requirements Modeling and Conceptual Modeling phases of the method. The metamodel presented here introduces the new method fragments that were detected in [8]. Owing to space constraints the implementation phase is not show as no modifications were introduced. However, further information about this phase and more detailed PDDs can be checked in [18]. The activities that made up both phases are briefly introduced below: 1. Requirements Modeling: The aim of the Requirements Modeling phase is to gather the user needs in order to build the Web application conceptual model. The PDD is composed of four activities:

Model-Driven Web Engineering in the CMS Domain

231

(a) Description of the main purpose of the system as a summarised textual definition of the system final goal. (b) Identification of the different tasks the user expects to achieve when interacting with the Web application. Each task is described using UML activity diagrams in terms of input and output activities that the user has to perform. (c) Description of each user-task from the user-system interaction. For each task a textual description of the input and output data structures/functionality is provided. (d) Categorisation of the user-tasks into performed by the system or imported from external systems. The output of this phase is a Web application requirements specification. Using model to model transformations, it’s possible to obtain a first draft of the Web conceptual model. Further details about this phase and the activities involved can be found in [19]. 2. Conceptual modeling: In this phase, the OOWS and OO-Method models that describe the Web system are defined. Since OOWS is an extension of OO-Method, these models are part of OOWS as well. These models and their purpose were explained in section 2.1. Ten activities compose this phase as the PDD in the figure 1 shows. For each activity the conceptual model built is stated: (a) Define the OO-Method Class Diagram (Object Model). This model provides information about the entities, their properties and the services that define the information system. (b) Define the external entities (Legacy classes, external services) that interact with the Object Model (c) Define the possible states of an object and their lifetime using a UML State Transition Diagram (Dynamic Model). This model constrains what kind of services can be executed in the particular state of an object. (d) Detect the interactions between objects and specify the communication between them (Dynamic Model). (e) Specify the object changes after an event occurrence (Functional Model). These changes are specified using a set of logic formulas applied over the object attributes. (f) Specify the different user profiles (preferences, rights etc.) that can access to the Web application (User Model). (g) Define the Navigational Contexts that a particular user can access (Navigational Map). (h) Define the information and services available in a Navigational Context (Navigational Model). (i) Define the complex processes that involves several Navigational Contexts (Process Model). (j) Specify presentation requirements (layout, information order criteria) for a Navigational Context (Presentation Model). All the activities related to OO-Method models specification (a to e) define the behaviour and data structures (objects) of the system. On the other hand, the OOWS models activities (f to j) define a Web interface. OO-Method models must be defined before

232

K. Vlaanderen, F. Valverde, and O. Pastor

the OOWS models since there is a relationship between them (for instance, a user type is linked to a class from the Object Model). This constraint implies that activities from a) to e) must always be carried out before defining the OOWS Models. In addition, for each model to be built, a detailed PDD [18] describes the sub-activities needed. The output of this phase is an OOWS model that specifies the Web system. The OOWS method metamodel provides a clear representation of the method as a whole. Moreover, a clear description of the activities involved in the Web application specifications are classified and explained. This metamodel is the starting point to detect method fragments, including those that have to be adapted to the CMS domain.

4 Analysis of the OOWS Method in the CMS Domain The main interest of this work is how the OOWS method can be adapted to define CMS Web applications. In the previous section, the OOWS method metamodel has been introduced in terms of method fragments or activities. In order to detect which current method fragments should be adapted, a use case from the CMS domain has been modelled. The use case used for evaluation purposes, describes a possible userregistration process, which is an important part of every CMS-based Web application. Figure 2 shows the first part of the sequence diagram for the use case. First the user gets to the registration page where he must introduce all the information required. The information to enter are some basic account items such a username, the e-mail address, a name, etc. In addition a cell phone number is required because the user does not have to provide a password; this will be automatically created by the system and send to the user through a text-message. After introducing all the data, a ’please wait’-page will be shown while the data is being checked. At this point, the created password will be sent to the user and an automatically created confirmation code is sent to the user’s email address. The user-account will not be active until the user enters the received password on the website, and either clicks on the link in the received email or enters the confirmation code on the Website’s ’confirm’-page. When this validation process is finished, the user will be redirected to the ”referrer-page” from which the registration process was started. Finally, when the users logs to the system, a ”session” object will be initialised with the information provided in the registration. This use case is heavily based on the previous work [8] presented. However, some more complex requirements have been added and the validation process has been altered. As the sequence diagram shows (see Figure 2), the main reason why this use case is called ’complex’ is because of the number of different systems involved: The UDB (user database), the SMSC (service for sending text messages), the e-mail server and the authorisation system. 4.1 User Registration Use Case The use case presented was modelled using the OOWS conceptual primitives introduced in section 2.1. Firstly, every user-type has a Navigational Context ’Home’, on which currently nothing is shown except links to the other available contexts. The AnonymousUser has a ’Register’-context, which allows the entering of all the required information that is needed for executing the register operation. The RegisteredUser has,

Model-Driven Web Engineering in the CMS Domain

233

Fig. 2. Partial Sequence Diagram for the complex registration use-case

next to its ’Home’-context, two additional contexts: ’Confirm’, which allows the entering of the confirmation code, and ’Phone’, which shows all the related Phone entities and the available ’Phone’-services (InsPhone and DelPhone). The resulting application provides a ’register’ service to every ’AnonymousUser’ that enters the system. When the user executes that service, he gets prompted for all the required info. Upon completion of the form a new ’RegisteredUser’ is created. When the user logs on to this account, he will have the opportunity to ’Confirm’ the email address by entering the confirmation code through another service. If the confirmation code is correct, the ’authorised’ attribute is set to true, indicating that the user receives full access to his account. Based on this attribute, decisions can be made regarding which services, objects and attributes are visible and/or active. The authentication mechanism provided by the OOWS-generated Web applications is a simple approach to user-registration and login. By default a standard login screen is created in which existing users can enter only their login information (id and password) to identify themselves. For this use case, the user-registration process had to be altered. It is certainly possible to do this at the coding level, but this is not ideal. A better solution is to do it at the conceptual level. However, this currently requires a complex solution. 4.2 Issues Detected and Method Improvements The use case presented in this chapter, was modelled and generated as the OOWS method proposes. Taking into account the lessons learned in the requirements and conceptual modeling phases and the semi-automatically generated implementation, several issues were detected. In order to improve the OOWS method, those issues have been described and a specific solution has been proposed. It is important to note that these issues/solutions are not specific of the CMS domain. Therefore this evaluation has

234

K. Vlaanderen, F. Valverde, and O. Pastor

provided a general improvement of the method in other domains. The issues detected can be classified in three main points: 1. Automatic generation of user information: When a new user is registered in a CMS, all the user information is not introduced. As the use case illustrates, a first password must be automatically generated by the system and sent to the user, in order to check if the e-mail address is correct. However, the OOWS User Model doesn’t support the definition of this kind of behavior. Each user defined in the model is associated to a class from the Object Model which defines their properties and services. Therefore, the default information that is generated when a new user is created is defined in the class construction service. Though, it is possible to associate default values to the user creation service, complex values as a randomly generated password or some initialization conditions cannot be introduced. In order to support this requirement, a new step should be defined in the method to optionally extend the definition of the User Model. This new step named, ”Dynamic User Information” is required to define the user information which cannot be decided at modeling time. The first user password is an example of such kind of information. However, another interesting example is the automatic language selection taking into account from which network has the user accessed the application. As the step 2.f only includes an static user profile, this optional step to dynamically define the profile must be included. 2. Dealing with session information from different users: This issue has a strong relationship with the previous requirement. The user information has not only to be initialised but also, is very common to be accessed while the user is logged into the system. The password or user identification is a very common example, which is not only related to the CMS domain, of this kind of information. For that reason the technological framework that supports the OOWS method, stores by default the user login into a session object. With this session information, the Navigational Map of the user (i.e. which navigational contexts are available) is defined. However, only the login is stored and other information as the password, login-time or new content added is not. In addition, OOWS does not provide any conceptual primitive to access to the session-only information, for example to filter a context according to the type of user logged. Only the information defined in the Object Model and therefore, that is stored in the database can be used in a particular context. To solve this issue the ”Define Object Model” method step must be redefined. After the Object Model has been specified, the analyst should explicitly point out which information will be persisted in the session cache. With the purpose a new optional method step is introduced: ”Session Information Specification”. In this the step the analyst must use the ”Session” stereotype for each class attribute which must be session-persisted. Hence, when the navigational contexts are defined, the session attributes will be available to retrieve information or to establish filter conditions. 3. Dynamic navigation to the referrer and the ’please-wait’ pages: In the OOWS method the available navigations are defined in the Navigational Model. These navigations are always related to association relationships between two classes from the Object Model, linking information instances related (for example the books written by a specific author). However, as the case study has shown, not all the

Model-Driven Web Engineering in the CMS Domain

235

navigations are triggered to retrieve related information. In particular, in the user registration Web page two navigations are automatically performed: 1) a transition to a feedback Web page to wait until the registration operation is finished and 2) a transition to the Web page, known as the referrer page, from which the user started the registration process. Both navigations are not related to specific relationships from the Object Model, so they cannot be introduced in the Navigational Model. The OOWS method supports the definition of a navigation transition when the service execution is finished, but not while is performed as the ”please-wait” page requires. Furthermore, every time the service is finished, the user will be redirected to the same context. In other words, the target of a defined navigation is always the same. To solve this problem the Navigation Model definition must be extended to include dynamic navigations. These new type of navigations are defined in an extension of the traditional ”Detail Navigational Map” method step. The main difference is that the dynamic navigations have a condition to define when the transition is performed. Hence different targets can be defined according to which condition is satisfied. For that reason, these navigations are not restricted to navigational contexts that have a structural relationship between their classes.

5 Conclusions and Further Research Currently, Web applications are focused on more specialised tasks besides the simple information retrieval. It is not hard to imagine that every Web application domain has its own requirements and therefore introduces its own problems. Creating an allencompassing method might not be the best approach. A better way to handle this is the creation of domain-specific adaptations of the method. In this paper, Situational Method Engineering has been applied to the OOWS Web Engineering method in order to improve the support for a specialised domain: Content Management Systems. The results of this preliminary research have been: 1) A formal description of the method by means of a metamodel, 2) an analysis in the CMS domain, to detect which method fragments need to be re-adapted and 3) an improved method metamodel to address the improvements detected. The results presented in this chapter have the aim of providing a better view of the OOWS method and its possibilities and impossibilities. However, the same reasoning line and lessons learned can be applied in another Web Engineering methods. From the experiences introduced in this chapter and in previous works [8], we can summarise the CMS-method improvements in three main points: 1. In the CMS domain the user information management is a key requirement. Although, several Web Engineering methods introduce a User Model, the expressiveness is not enough to deal with the modelling of a CMS. Specifically, the OOWS User Model does not provide mechanisms to store all the information required by a CMS system. However, in many Web applications a taxonomy of users is enough for identification purposes. Applying the SME principles, an additional step to enhance the user modelling can be easily included in the method to meet the domain requirements. 2. Web applications are evolving to include external services and even to be part of a complex business process. To model these complex requirements new conceptual

236

K. Vlaanderen, F. Valverde, and O. Pastor

models, as service models or BPMN-based diagrams, are required to gather the new expressivity. Several approaches as OOWS or WebML have extended their methods to achieve this goal. Nevertheless, that approach implies that the method has a more difficult application. SME can be used to define when these complex models must be defined because a simple Web application, does not required a complex business process. 3. Web Engineering methods have been focused on the definition of navigational models to deal with the hypermedia requirements. These models links different Web pages using relationships between the data. However, they not define navigations to Web pages not intended to retrieve data or dynamic navigations according to some conditions. Since the current navigational models have been widely used to define Web applications, SME could add the definition of alternative navigations without impacting on the method. In addition to the methodological changes explained here, another technological issues regarding CMS Web systems were detected. Examples are the management of multimedia content, aesthetic presentation or user-error prevention. However, these issues must be solved from an implementation point of view instead of methodological. It is expected that future versions of the tools and the frameworks will improve the code generation. Finally, the goal of this preliminary research must be to encourage the use of SME among the different Web Engineering methods. Hence a common method base can be built instead of a generic Web Engineering method to deal with any domain or a set of new specific-domain methods. As a consequence, if new methodological requirements need to be introduced, the current method will be extended using the suitable fragments from the method base. Future works will define this common method base applying the same process that we have applied in this chapter. Acknowledgements. This work has been developed with the support of MEC under the project SESAMO TIN2007-62894.

References 1. Deshpande, Y., Murugesan, S., Ginige, A., Hansen, S., Schwabe, D., Gaedke, M., White, B.: Web engineering. Journal of Web Engineering 1, 3–17 (2002) 2. Brinkkemper, S.: Method engineering: engineering of information methods and tools. Information and Software Technology 38, 275–280 (1996) 3. Saeki, M.: Embedding metrics into information systems development methods: An application of method engineering technique. In: Eder, J., Missikoff, M. (eds.) CAiSE 2003. LNCS, vol. 2681, pp. 374–389. Springer, Heidelberg (2003) 4. Ralyt´e, J., Deneck`ere, R., Rolland, C.: Towards a generic model for situational method engineering. In: Eder, J., Missikoff, M. (eds.) CAiSE 2003. LNCS, vol. 2681, p. 1029. Springer, Heidelberg (2003) 5. van de Weerd, I., Brinkkemper, S., Souer, J., Versendaal, J.: A situational implementation method for web-based content management system-applications: Method engineering and validation in practice. In: Software Process Improvement and Practice, pp. 521–538 (2006)

Model-Driven Web Engineering in the CMS Domain

237

6. Fons, J., Pelechano, V., Albert, M., Pastor, O.: Development of web applications from web enhanced conceptual schemas. In: Song, I.-Y., Liddle, S.W., Ling, T.-W., Scheuermann, P. (eds.) ER 2003. LNCS, vol. 2813, pp. 232–245. Springer, Heidelberg (2003) 7. Pastor, O., Molina, J.C.: Model-Driven Architecture in Practice: A Software Production Environment Based on Conceptual Modeling. Springer, Berlin (2007) 8. Vlaanderen, K., Valverde, F., Pastor, O.: Improvement of a web engineering method applying situational method engineering. In: Cordeiro, J., Filipe, J. (eds.) ICEIS (3-1), pp. 147–154 (2008) 9. Schwabe, D., Rossi, G.: The object oriented hypermedia design model. Communications of the ACM 38, 45–46 (1995) 10. Ceri, S., Fraternali, P., Bongio, A.: Web modeling language (webml): a modeling language for designing web sites. Computer Networks 33, 137–157 (2000) 11. De Troyer, O.M.F., Leune, C.J.: Wsdm: A user-centered design method for web sites. Computer Networks and ISDN Systems 30, 85–94 (1998) 12. Koch, N., Kraus, A.: The expressive power of uml-based web engineering. In: Schwabe, D., Pastor, O., Rossi, G., Olsina, L. (eds.) 2nd International Workshop on Web-Oriented Software Technology (2002) 13. Preciado, J.C., Trigueros, M.L., Sanchez, F., Comai, S.: Necessity of methodologies to model rich internet applications. In: WSE, pp. 7–13 (2005) 14. Bozzon, A., Comai, S., Fraternali, P., Carughi, G.T.: Conceptual modeling and code generation for rich internet applications. In: Wolber, D., Calder, N., Brooks, C.H., Ginige, A. (eds.) ICWE, pp. 353–360. ACM, New York (2006) 15. van de Weerd, I.: Wem: A design method for cmsbased web implementations. Master’s thesis, Utrecht University, Utrecht (2005) 16. Valverde, F., Valderas, P., Fons, J., Pastor, O.: A mda-based environment for web applications development: From conceptual models to code (2007) 17. van de Weerd, I., Brinkkemper, S.: Meta-modeling for situational analysis and design methods (2007) 18. Vlaanderen, K.: Oows in a cms-based environment: a preliminary research (pending publish). Master’s thesis, University of Utrecht (2007) 19. Valderas, P., Fons, J., Pelechano, V.: Transforming web requirements into navigational models: An mda based approach. In: Delcambre, L.M.L., Kop, C., Mayr, H.C., Mylopoulos, J., Pastor, O. (eds.) ER 2005. LNCS, vol. 3716, pp. 320–336. Springer, Heidelberg (2005)

Part IV

Software Agents and Internet Computing

Binary Serialization for Mobile XForms Services Jaakko Kangasharju1 and Oskari Koskimies2 1

Helsinki University of Technology, PO Box 5400, 02015 TKK, Finland Nokia Research Center, Itmerenkatu 11–13, 00180, Helsinki, Finland

2

Abstract. We consider here the case of XForms applications on small mobile devices. The aim is to find out whether a schema-aware binary XML format is better suited to this area than generic compression applied to regular XML. We begin by limiting the potential areas of improvement through considering the features of the binary format, and then proceed to measure effects in the identified areas to determine whether a binary format would be effective. Keywords: Mobile XForms services, binary XML serialization.

1 Introduction The ServiceSphere project at Nokia Research Center has focused on researching mobile service solutions for small and medium enterprises (SMEs) in different business domains. One of the goals has been to study the benefits of software as a service paradigm [1] that has gained a big momentum in the fixed Internet-based enterprise solutions. Increasing capabilities of cellular mobile devices open possibilities for developing mobile applications and services in many different business domains. Better UI, larger memory, integrated peripherals like camera and voice recorder, and wirelessly connected peripherals like GPS and bar code readers make it possible to use generic mobile phones for applications that used to require custom-made, integrated systems. Development of mobile service implementations in such an environment has requirements for extreme agility, low overhead, rapid prototyping capability, and high enough quality that the services could be deployed in a commercial environment for trial use. One of the technologies we selected based on these requirements was XForms [2]. We have implemented a prototype mobile-optimized XForms processor that allows us to easily reconfigure the user interface of the mobile device. As a consequence of using XForms, all client-server communication is done in XML, which also makes it easier to design flexible server-side components. However, XML is a relatively verbose format and in some domains the amount of data transmitted from client to server each day can be considerable. When operators charge for data transfer on a per-kilobyte basis, the question of efficient XML interchange becomes important for the commercial viability of a mobile service. In this paper, we analyze real-world data from a deployed mobile XForms client used in the trial of a product distribution chain application. This follows our previous analysis [3] that used two different kinds of documents, repetitive and non- repetitive, for an initial assessment. The application analyzed in this paper has produced a much J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 241–252, 2009. c Springer-Verlag Berlin Heidelberg 2009 

242

J. Kangasharju and O. Koskimies

larger data set, covering a wide variety of types of XML documents, from small to large, from linear information lists to repetitive structures. The application supports the work of travelling salespersons who visit wholesalers to sell products on behalf of one or more manufacturers. The salespersons are independent agents who get a commission for each sale, and there is no direct contact between wholesaler and manufacturer. The majority of XML documents in the application are product orders that the salespersons send on behalf of the wholesalers, and order confirmations and delivery notifications that the manufacturers send to the salespersons. These documents are typically small, but they can also be large when many items are included in the same order. In addition, the salespersons sometimes receive larger documents containing information needed to make the orders, such as product and contact lists. Finally, salespersons occasionally receive updates to the forms used for creating and viewing documents. The forms are typically large, but there are also some smaller ones used e.g. to add a new wholesaler to the salesperson’s client list. All the documents used in the application were tracked for a period of six months, resulting in a database of thousands of documents. However, the tracking period included the ramp-up phase of the service, so some of the business documents were created only for testing purposes, and the number of forms sent to salespersons is abnormally large due to the initially frequent need to update the forms in response to user feedback. These anomalies do not affect our analysis since forms are excluded already by our initial assessment, and test documents are easily pruned based on addressing information. We begin this paper with a more detailed problem statement in Section 2. A brief overview of the main features of the Xebu binary XML format that we used is given in Section 3. The purpose of the section is to introduce Xebu sufficiently well to allow following the rest of this document, so no technical details are provided. Section 4 considers the scenario on a high level, looking at the features of Xebu to determine how to best use it in the scenario and in what specific ways. Section 5 provides extensive measurements to determine the effectiveness of Xebu in the problem domain. Section 6 lists some conclusions on the feasibility based on the previous sections. Finally, Section 7 outlines some future work that could be done on Xebu to make it a better fit for a variety of applications, especially this one.

2 Problem Statement XForms [2] is a useful language for specifying interactive XML-based applications for distributed computing. However, when considering small mobile devices that could definitely benefit from a standardized, truly user-interface-agnostic language, the use of XML raises some questions. The chief among these is XML’s verbosity, which causes a large amount of network bandwidth to be used in communication. In an XForms application, there are three kinds of documents. The form document follows the XForms schema, and contains the data model and presentation logic of the application. The data model also includes the specification of what data the user is expected to provide, and a document template for submitting the data. After a user has filled the requisite information on the form, the application then needs to send the information to the server. This is done by filling the user-provided

Binary Serialization for Mobile XForms Services

243

information into the template provided in the form, and sending the resulting XML document over whatever protocol the application uses to communicate. To support the form there are resource documents that contain data needed by the form, such as a product list or a contacts database. These are updated more often than the form itself and so are stored in separate documents. To mitigate the effect of XML’s verbosity, the obvious solution is to apply some form of compression to it before sending it over the network. Common existing protocols such as HTTP [4] already support indicating the use of generic compression like gzip [5]. Since XML is text, and highly-redundant text at that, generic compression algorithms usually perform acceptably well. There are, however, two potential issues with generic compression over XML when applied to XForms applications. One is that compression takes time, and when this gets added to the already significant time needed to process XML, the amount of required processing may become prohibitive. The larger problem is that the amount of data may in many cases be quite small, so generic compression that is based on redundancy in the data will not perform very well. Because of these, and other, reasons, there have been several proposals for binary XML formats [6,7,8]. Such a format is a replacement for XML that is usually intended to be a more compact representation as well as more efficiently processable. Two requirements for a general binary XML format are that it be able to represent any XML and that it be able to use available schema information (usually in the form of XML Schema [9,10]) to improve its compression ratio. There are a large number of binary XML formats already in use. Well-known generalpurpose formats include Fast Infoset [11] and XBIS [12], but these are capable of using only a limited amount of schema information, i.e., they cannot take advantage of the structure information present in a schema. Better use of schema is provided by formats such as ASN.1 [13], BiM [14], and Xenia [15], but these have the drawback that all documents must be schema-valid and usually good results are achieved only when the schema describes everything very precisely. The EXI format currently being developed at the W3C [16] has the ability to serialize any XML but also to use all the information available in a schema to improve its compression ratio. EXI is also capable of serializing any document even when a schema is in use, but schema-valid documents will result in much smaller final documents.

3 Xebu Overview Xebu [17] is a binary XML format developed by one of the authors in the Fuego Core research project1 . It has been designed to be usable on mobile phones as a part of a general XML-based middleware system. It consists of a basic format applicable to any XML and a few techniques that decrease document size further when a schema is available. The main reason why Xebu is attractive for this research is that it has a publiclyavailable, Open Source implementation2 written for mobile phones that support the 1 2

http://www.hiit.fi/fi/fc/ http://hoslab.cs.helsinki.fi/homepages/xebu/

244

J. Kangasharju and O. Koskimies

Java Mobile Information Device Profile (MIDP) 1.0, which includes most Java-enabled phones. This eliminates the effort needed for implementing a format processor. The basic Xebu format, like Fast Infoset and XBIS, is based on tokenization, i.e., replacement of repeatedly-occurring pieces of content by small binary tokens. This is similar to generic compression methods, but in Xebu this tokenization happens at the XML level. The only candidates for tokenization are complete XML namespace URIs, XML names, attribute values, etc. By concentrating on the XML level, Xebu’s speed is much faster than that of a byte-oriented generic compression algorithm. In principle, Xebu has three different schema-based techniques that are each individually applicable: 1. Pre-tokenization establishes a set of token mappings beforehand based on the names and potential values that appear in a schema. 2. Typed content encoding uses a binary encoding for typed data values, such as integers, instead of encoding everything as strings like in XML. 3. Omission automata process XML event streams by leaving out completely events that are deducible from context, e.g., when the schema specifies a sequence of elements, the start tags of the elements are deducible from the previous element’s end tag and can be omitted. In the current implementation, typed content encoding is based on an extended XML API and the presence of xs:type attributes instead of determining the proper data type from the schema. The omission automaton technique, as it is defined, has one nice benefit. Namely, when the automata encounter an event that they do not recognize, it gets passed unchanged through the automata. This permits extending the schema by, e.g., adding new elements at certain places without hurting the compression ratio of the rest of the document. Also, deletions are possible, but with the current system the rest of the document will then potentially not benefit from the schema use. However, the omission automata are not usefully applicable to some cases. They perform best for linear schemas, i.e., where elements follow each other in a predetermined sequence. In case of alternatives, i.e., where there are even two choices for the next event, they do not provide any additional compression due to the either-or nature of event omission. This is in contrast with techniques that have the choices built into the processing and have an indicator of which choice to take [15,16]. A specific, and potentially unexpected, instance of this issue is mixed content, since there each position has the option of containing either text or an element, so the omission automata do not handle mixed content very well. This mixed content also extends to the whitespace often used to indent XML document for easier readability and editability. This choice reflects Xebu’s primary application area, since Xebu-using applications typically write XML through an API, where putting in additional whitespace is normally not expected.

4 Initial Assessment In XForms applications, three kinds of XML data are being sent over the network. The client receives the form in a document that includes both the form data model and the

Binary Serialization for Mobile XForms Services

245

form presentation logic. In addition to the form, the client may also receive one or more resource documents that support the form, e.g. a product list. The data model of the form also contains a document template for submitting the data. After filling in the information in the template, the client submits it to the server. This latter process can happen multiple times for the same form. Looking at the server-to-client sending of the form, real-world applications usually require relatively large forms to capture the necessary presentation logic. The resulting documents are typically in the 10–50 kB size range, for which gzip probably performs quite well, and a binary format might not provide much additional compaction. Furthermore, application of schema-based techniques to this document is not straightforward. The XForms schema is very free-form, allowing many different elements at each place. Therefore the omission automata of Xebu would probably not provide much benefit in this case. Resource documents, like forms, are sent from server to client, and are of similar size or even larger. The typical resource schema is a table encoded in XML, i.e. there is one repeating “row” element and each row element contains the same “column” elements. For example, a product list would contain a number of Product elements, each of which would contain a Name element, a Price element, and so on. Resource documents are an interesting case for analysis, as their large size makes them amenable to gzip compression, but their simple schema should also allow Xebu to perform well. The situation is different for submitted and received form data. In this case the document is typically much smaller than in the other cases, in the 1–10 kB size range, and therefore gzip does not achieve as high a compression ratio. But what is better, the schema in this case is often very linear, as the template provided in the form will have the elements in a specific order, which is then used when sending the filled-in form as well. Therefore the omission automata of Xebu should be well suited for this case. There still remain the questions of how to best apply Xebu in the given situation. For one, there are several options for how to construct the omission automata and pretokenization tables from the schema. The current Xebu implementation can either create a description of both in a reasonably simple text format, or it can create Java classes that directly implement the tables and automata in a form understood by the Xebu implementation. Second, there is the choice of constructing these on the server side and transmitting to the client or having the client construct them. In the specific application scenario, having the client construct the tables and automata from a schema seems like the less reasonable option. The client is not going to use the schema for anything other than these Xebu techniques, so having the schema available at the client seems like a waste. Furthermore, doing the generation on the client side just uses the client’s resources. Finally, the option of having the server generate the tables and automata for both sides ensures that the client and server are interoperable, as there is no need to specify precisely the generation process. There are drawbacks to this approach, though. The first one is that, in essence, it is creating a new schema language that would need to be supported by all implementations of Xebu. This might not be a large hurdle if Xebu is treated as an essentially proprietary format that is specified precisely by its implementation. The second concern is the size of the transmitted descriptions compared to the schema. While in most cases

246

J. Kangasharju and O. Koskimies

encountered so far the generated files are approximately the same size as the schemas, it could be possible that some schemas would generate much larger automata (Section 7 outlines ways to eliminate this even as a potential issue). However, for a simple linear schema this should not be a concern. The current implementation of Xebu supports both the generation of individual Java classes for each schema as well as generic implementations of both the tokenization tables and omission automata. These generic implementations read a simple text-based format file that gives the contents of the tables and the transitions of both automata. The choice between these approaches is probably dictated by the environment. The client will reside on a mobile phone, and MIDP permits loading only the system classes and classes contained in the application’s suite. Therefore, when generating classes, it would not be possible to dynamically include new recognized schemas at runtime, but rather all used schemas would have to be known at the build time of the application. As XForms-based applications are often generic, instead of being specific to a particular form, it seems therefore best to adopt the use of the generic tables and automata. Acquiring the schema to use in generation is another question. The algorithm that is used in generating the omission automata is mostly generic, but currently implemented only for RELAX NG [18]. For this initial assessment, the automata generation was also ported to understand XML Schema, to the extent required to process the sample documents3 . This latter work should be useful, as XForms already supports XML Schema, and when no schema is specified, it should be possible to generate a schema from the template in the form. This generation would seem to be preferable to the alternative of implementing the automata generation algorithm specifically for the XForms language. When using XML with schemas in the real world, it often occurs that the documents provided are not actually valid according to the schema. As noted, Xebu’s omission automata provide some resilience against invalid documents (or, equivalently, schema changes), but the precise limits of this resilience are not known at present. Some cases are straightforward to observe, e.g., adding an element inside a sequence of elements is supported, and deleting a mandatory element is often not supported well. It is not expected that the automata would be fully resilient to changes. However, what the serialization side automaton should be able to do is to determine in all cases whether the serialization succeeded or not. Here success is defined so that the parser side automaton will produce an equivalent XML document to the one that was transmitted. The current implementation can check that the automata have properly round-tripped back to the start state at document end, but this check might not be sufficient. Further evaluation with real documents that are schema-invalid would be needed.

5 Measurements We performed an extensive experiment with real-world XForms data to properly evaluate the feasibility of using Xebu. Following the analysis of Section 4, these were mostly client-submitted data files and not the larger and more complex server-provided forms. We included some resource documents to verify whether there is any benefit to be gained from using Xebu with them. After pruning test documents and non-well-formed 3

Note that this already includes a substantial amount of what is used in practice.

Binary Serialization for Mobile XForms Services

247

Table 1. Compiled class file sizes in bytes for kXML and Xebu Code Size kXML 18056 Xebu 22883 COA 11990

documents, we had a total of 1119 form data documents, all below 7 kB in size, and 32 resource documents, all approximately 30 kB in size. As the documents come from actual runs of the application, these proportions between data and resource documents are realistic ones, though the sizes of the resource documents are all from the low end. We used Gnuplot [19] and R [20] for plotting and analysis. We used the trang program4 to create an initial schema for the data set. Since this schema was overly permissive, we modified it based on the application characteristics to constrain the data more. For instance, the generation treated a sequence of elements as a repeated choice, which significantly reduces the ability of schema-based optimization techniques to work. Also, in some cases we could know from the application that some values were enumerations and not just generic strings. XML compressed with gzip has been deemed acceptable for this particular scenario, so that needs to be the main comparison point. Since schema information is present, we can use Xebu’s schema-based techniques. We use pre-tokenization both on its own (dubbed “Table” in the graphs) and combined with the omission automata (dubbed “Schema”). This is because pre-tokenization is always applicable whereas omission automata are usable only when the schema is of appropriate form. Since gzip compression is acceptable, we use it also in conjunction with the two Xebu formats. We begin by measuring the sizes of the compiled class files, shown in Table 1. As the XML processor, we used kXML5 , which is commonly used on mobile devices. The Xebu component measures the actual Xebu parser and serializer, and the COA component is everything that uses schema information, i.e., pre-tokenization and omission automata. These measurements do not show the size of the compression library; the size of JZlib6 , which is sometimes used on mobile phones, comes to almost 70 kB, but smaller gzip libraries exist. In particular a decompression-only library, which is sufficient if gzip is used only for documents sent from server to client, can be implemented in less than 10 kB. When deployed on a mobile phone, Java class files are always obfuscated, which shortens class, method, and field names, removes unused code, and performs other optimizations intended to reduce class size. Therefore, the numbers shown in Table 1 are after obfuscation with ProGuard7. We ran our measurements on a regular desktop computer, so we will focus only on final document size and not show any processing time measurements. This is justifiable, as several measurements demonstrate that communication costs, i.e., the amount of transmitted data, dominate energy consumption, which is the most important resource 4 5 6 7

http://thaiopensource.com/relaxng/trang.html http://kxml.org/ http://www.jcraft.com/jzlib/ http://proguard.sourceforge.net/

248

J. Kangasharju and O. Koskimies 0.6

XML.gz Table Table.gz Schema Schema.gz

0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

1000

2000

3000

4000

5000

6000

7000

Fig. 1. Smoothed size ratios for the data documents

on mobile devices [21,22,23,24]. We note, though, that we have measured Xebu’s performance on mobile phones [25] and have found it to be several times faster and also to consume less memory than XML or gzipped XML. Figure 1 shows the document sizes for each format, plotted as ratios against the original XML document size. As there are a very large number of documents, the lines in the graph are drawn as smoothed Bezier curves to more clearly show the overall measurements. Even this approximate graph shows that for the very smallest documents gzipped XML loses even to pre-tokenization. Around the 3-kilobyte mark gzipped XML begins to perform clearly better than just pre-tokenization but even so, it reaches only approximate equality with the omission automaton version of Xebu. When compressed with gzip, Xebu is clearly better than gzipped XML in all cases, even without the omission automata. An interesting feature of the graph is that there appears to be some inverse correlation between gzipped XML and the plain Xebu formats in that the peaks and valleys of the lines mirror each other. We believe that this is explainable by looking at how repetitive the elements in the documents are. When the whole document consists of little else except repetition of a certain kind of element, gzip does very well whereas Xebu will serialize each repetition in approximately the same number of bytes. In contrast, gzip does poorly on a non-repetitive document, but when such a document is also linear, Xebu and especially the omission automata perform very well. Table 2 shows aggregate statistics for the data documents, namely the minimum, 1st quartile, median, 3rd quartile, and maximum for each format. These numbers show approximately what can also be seen from Figure 1, namely that gzipped XML surpasses pre-tokenized Xebu slightly before the half-way point and the omission automaton Xebu only for a few of the documents. We note that there are a few documents

Binary Serialization for Mobile XForms Services

249

Table 2. Percentiles of small document sizes in bytes Format XML XML.gz Table Table.gz Schema Schema.gz

Min 549 309 217 195 152 150

1st q. Median 3rd q. Max 1560 2617 5444 6813 647 1027 1174 1285 579 974 1954 2482 426 670 754 978 444 718 1054 2117 352 515 586 859

where uncompressed Xebu is clearly inferior to gzipped XML, but that gzipping Xebu as well is a clear improvement. The resource documents were very uniform in the results, so Table 3 only shows approximate ratios for them, which are reasonably accurate over the whole data set. The sizes of these documents range between 29 kB and 32 kB. As can be seen, at these sizes gzip compression proves to be much more efficient than solely schemabased techniques. There is still a slight advantage when a schema-optimized document is compressed with gzip, likely due to the entropy reduction that is possible when a schema is available, but the advantage is very small and would probably disappear completely on even larger documents. Table 3. Approximate size ratios for the resource documents Format XML XML.gz Table Table.gz Schema Schema.gz

Ratio 1.00 0.10 0.34 0.10 0.23 0.09

Depending on the application and especially on the expected document sizes, it may even be beneficial to save application footprint by dispensing completely with a gzip compressor and using plain Xebu for communicating from client to server. This is in particular the case when the documents are non-repetitive and only a few kilobytes in size. Considering a typical XForms application, such as the one we measured, this appears to be the common case.

6 Conclusions Based on the experiments above, we can conclude that replacing gzipped XML with a binary serialization format, gzipped or not, yields significant benefits in the examined domains. In particular, even if the COA implementation is considered too fragile, the pre-tokenizations alone seem to provide improved compactness compared to gzip. We note that the Xebu implementation with the COA is approximately twice the size of kXML, so the code size poses little hindrance except in the most resourceconstrained of environments. More complex bit-efficient XML formats, such as the

250

J. Kangasharju and O. Koskimies

Fig. 2. Recommended compression use in XForms applications

upcoming EXI standard [16], may yield even more savings. On the other hand, in the resource-constrained mobile environment, simplicity of format and implementation are also important. Specifically in the XForms case, we recommend the approach shown in Figure 2. Namely, gzip should be used for the form and resource documents, since they are sufficiently large for gzip to do well and the XForms schema is complex so Xebu’s schema optimizations have less effect. On the other hand, the situation is reversed for the actual data documents, which are small and have rigid schemas, so Xebu, or another equally efficient binary format, should be used there. A further advantage of this approach is that since the gzip-compressed documents are always sent from server to client, the client only needs to support gzip decompression. It will also be beneficial to include negotiation of at least compression and possibly the format itself in the communication protocol. Namely, for the smallest documents there is little benefit in compressing the already-very-small Xebu documents, so always compressing would simply waste time. However, for even slightly larger documents compression can become a net win, so the protocol should offer the ability to compress when necessary.

7 Future Work The design and implementation of Xebu has been performed in the context of a research project, and its main purpose has been to test a variety of ideas related to binary serialization of XML. Therefore the implementation choices that have been made may not be

Binary Serialization for Mobile XForms Services

251

the best possible for wide application, and we have some concerns that would be useful to address to make Xebu applicable more widely. The implementation of typed content fits well into Xebu’s initial application area, and to take proper advantage of type support in the format does require an extended API of some form. However, requiring xs:type attributes is not a very good solution, as they are not often present in real data apart from some SOAP applications, and they require additional processing. A better solution could be to build a filter based on the schema that would decode bytes from the input directly into objects based on the schema state. The omission automata are currently specified as normal finite automata. This means that they, e.g., cannot support recursive content in elements. Another problem caused by this is that, for each kind of element, the part of the automata constructed for that element needs to be duplicated everywhere that the element could appear. This does not cause any problems for so-called “Russian doll” schemas [26], but with a flat schema this could lead to a blowup in the size of the automaton compared to the schema (theoretically even exponential, but this should not be a concern in the XForms case). This could be fixed by splitting each element into its own automaton and extending the automata to use an auxiliary stack to keep track of the automata corresponding to the parent elements and the states they are in. To consider the transmission format for the automata and tokenization tables, the current text-based format is designed for readability to aid in debugging the automata. In particular, there is much repetition and readable names. It would be straightforward to design a much more compact format by tokenizing all information and packing them tightly into bytes. However, it is not yet clear what kinds of size savings such a format would have over gzipping the current text-based format. The automata description file used in the experiments above gzips to 11 kilobytes. However, all of these considerations on extending Xebu need to be contrasted with how work at the W3C on a standard binary XML format is progressing [16]. The format specification is already feature-complete and expected to enter Last Call soon. A few implementation efforts are also under way, but we are not aware of any targeting mobile devices. EXI has the benefit that it is an open standard, which we expect to become accepted in the mobile device community, whereas Xebu is essentially a proprietary format. Xebu does have the advantages, though, that an implementation already exists, its properties are mostly well-understood, and the Xebu format is much simpler than EXI.

References 1. Traudt, E., Konary, A.: 2005 software as a service taxonomy and research guide. Research report, IDC (2005) 2. World Wide Web Consortium Cambridge, Massachusetts, USA: XForms 1.0. 2nd edn., W3C Recommendation (2006) 3. Kangasharju, J., Koskimies, O.: Using bit-efficient XML to optimize data transfer of XForms-based mobile services. In: 10th International Conference on Enterprise Information Systems (2008) (in submission)

252

J. Kangasharju and O. Koskimies

4. Fielding, R., Gettys, J., Mogul, J., Nielsen, H.F., Masinter, L., Leach, P., Berners-Lee, T.: RFC 2616: Hypertext Transfer Protocol — HTTP/1.1. Internet Engineering Task Force (1999) 5. Deutsch, L.P.: RFC 1952: GZIP File Format Specification Version 4.3. Internet Engineering Task Force (1996) 6. Pericas-Geertsen, S.: Binary interchange of XML Infosets. In: XML Conference and Exposition (2003) 7. World Wide Web Consortium: W3C Workshop on Binary Interchange of XML Information Item Sets, World Wide Web Consortium (2003) 8. World Wide Web Consortium Cambridge, Massachusetts, U.S.A.: XML Binary Characterization, W3C Note (2005) 9. World Wide Web Consortium Cambridge, Massachusetts, U.S.A.: XML Schema Part 1: Structures, W3C Recommendation (2001) 10. World Wide Web Consortium Cambridge, Massachusetts, USA: XML Schema Part 2: Datatypes, W3C Recommendation (2001) 11. Sandoz, P., Triglia, A., Pericas-Geertsen, S.: Fast Infoset. On Sun Developer Network (2004) 12. Sosnoski, D.M.: XBIS XML Infoset encoding. In: [7] 13. International Telecommunication Union, Telecommunication Standardization Sector Geneva, Switzerland: Mapping W3C XML Schema Definitions into ASN.1, ITU-T Rec. X.694 (2004) 14. Niedermeier, U., Heuer, J., Hutter, A., Stechele, W., Kaup, A.: An MPEG-7 tool for compression and streaming of XML data. In: IEEE International Conference on Multimedia and Expo, pp. 521–524 (2002) 15. Werner, C., Buschmann, C., Brandt, Y., Fischer, S.: Compressing SOAP messages by using pushdown automata. In: IEEE International Conference on Web Services, pp. 19–26. Institute of Electrical and Electronic Engineers, Piscataway (2006) 16. World Wide Web Consortium Cambridge, Massachusetts, USA: Efficient XML Interchange (EXI) Format 1.0, W3C Working Draft (2008) 17. Kangasharju, J., Tarkoma, S., Lindholm, T.: Xebu: A binary format with schema-based optimizations for XML data. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 528–535. Springer, Heidelberg (2005) 18. Organization for the Advancement of Structured Information Standards Billerica, Massachusetts, U.S.A.: RELAX NG Specification (2001) 19. Williams, T., Kelley, C., et al.: (Gnuplot) 20. R Development Core Team Vienna, Austria: R: A Language and Environment for Statistical Computing (2008) 21. Kangasharju, J., Lindholm, T., Tarkoma, S.: XML security with binary XML for mobile Web services. International Journal of Web Services Research 5(3), 1–19 (2008) 22. Karri, R., Mishra, P.: Optimizing the energy consumed by secure wireless sessions — Wireless Transport Layer Security case study. Mobile Networks and Applications 8, 177–185 (2003) 23. Potlapally, N.R., Ravi, S., Raghunathan, A., Jha, N.K.: A study of the energy consumption characteristics of cryptographic algorithms and security protocols. IEEE Transactions on Mobile Computing 5, 128–143 (2006) 24. Barr, K.C., Asanovic, K.: Energy-aware lossless data compression. ACM Transactions on Computer Systems 24, 250–291 (2006) 25. Kangasharju, J.: XML Messaging for Mobile Devices. PhD thesis, University of Helsinki, Department of Computer Science, Helsinki, Finland (2008) 26. van der Vlist, E.: Using W3C XML schema. On XML.com (2001)

An Efficient Neighbourhood Estimation Technique for Making Recommendations Li-Tung Weng, Yue Xu, Yuefeng Li, and Richi Nayak Faculty of Information Technology, Queensland University of Technology 4001 Queensland, Australia [email protected],{yue.xu,y2.li,r.nayak}@qut.edu.au

Abstract. Recommender systems produce personalized product recommendations during a live customer interaction, and they have achieved widespread success in e-commerce nowadays. For many recommender systems, especially the collaborative filtering based ones, neighbourhood formation is an essential algorithm component. Because in order for collaborative-filtering based recommender to make a recommendation, it is required to form a set of users sharing similar interests to the target user. Forming neighbourhood by going through all neighbours in the dataset is not desirable for large datasets containing million items and users. In this paper, we presented a novel neighbourhood estimation method which is both memory and computation efficient. Moreover, the proposed technique also leverages the common “fixed-n-neighbours” problem for standard “best-k-neighbours” techniques, therefore allows better recommendation quality for recommenders. We combined the proposed technique with a taxonomy-driven product recommender, and in our experiment, both time efficiency and recommendation quality of the recommender are improved. Keywords: Recommender System, Neighbourhood Formation, Taxonomic Information.

1 Introduction Recommender systems are designed to benefit humans’ information extracting experiences by giving information recommendations according to their information needs. User based collaborative filtering is the most fundamental and widely applied recommendation technique [1], it generates recommendations based on finding items that are commonly preferred by the neighbourhoods of the target users. Specifically, a target user’s neighbourhood is a set of users sharing similar preferences to the target user [2]. Neighbourhood formation in collaborative filtering techniques requires comparing the target users’ preferences to the preferences of all users in the dataset, and such preference comparison process can become a major computation efficiency bottleneck for recommenders. For large datasets, neighbourhood formation process requires a large amount of I/O to retrieve user profiles, and each user profile may be represented by a very high dimension vector, hence the similarity computation between the vectors can be very expensive. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 253–264, 2009. © Springer-Verlag Berlin Heidelberg 2009

254

L.-T. Weng et al.

Our main contribution in this paper is a novel neighbourhood estimation method called “relative distance filtering” (RDF), it is based on pre-computing a small set of relative distances between users, and using the pre-computed distances to eliminate most unnecessary similarity comparisons between users. The proposed RDF method is also capable of dynamic handling frequent data update; whenever the user preferences in the dataset are added, deleted or modified, the pre-computed structure cache can also be efficiently updated. Part of our research is to develop a novel collaborative filtering based recommender that utilizes the item taxonomy information for its user preference representation. Our work is based on a well-known taxonomy recommender, namely taxonomy product recommender (TPR), proposed by Ziegler [3] which utilizes the taxonomy information of the products to solve the data sparsity and cold-start problems. TPR outperforms standard collaborative filtering systems with respect to the recommendation accuracy when producing recommendations for sites with data sparsity. However, the time efficiency of TPR drops significantly when dealing with huge number of users, because the user preferences in TPR are represented by high dimensional vectors. We applied the proposed RDF technique to the TPR and the experiment results show that by utilizing the proposed technique, both the accuracy and efficiency of TPR are significantly improved.

2 Related Work Neighbourhood formation is a process required by most collaborative filtering based recommenders to find users with similar interests to the target user. Sarwar [4] proposed an efficient neighbourhood selection method by pre-computing users into clusters. However, clustering is an expensive process and can only be done offline. Datasets keep changing over time. Therefore the overall quality of the result neighbourhood based on existing clusters will degrade until the next clustering update. Moreover, clustering based neighbourhood selection favours target users nearby cluster centres, and for other users located at surrounding cluster edges the quality of their result neighbourhoods are usually poor because their actual neighbours are very likely in other clusters [4]. There are also several neighbourhood formation algorithms developed specifically for high dimensional data, such as RTree [5], kd-Tree [6], etc. The basic idea behind these algorithms is to index these high dimensional data into a search tree structure, and within each level, the children nodes subdivides the cluster their parent node holds into finer clusters and each tree node holds one of the cluster spaces. The search efficiency of these algorithms is very impressive, because the search space are quadratically reduced in each tree level (i.e. O(logN)). However, they suffer from similar problems to cluster based neighbourhood search, which is “loss of precision”. In fact, these algorithms usually produce worse result than clustering based method. Moreover, because the internal tree structures for indexing the data are fairly complex, therefore these algorithms are usually memory intensive and slow in initialization. The proposed RDF technique is not as good as these tree-structure based methods in terms of computation efficiency, however it is still more efficient than cluster based search method. In terms of accuracy, the proposed method produces much better result than these tree-structure based methods because it does not

An Efficient Neighbourhood Estimation Technique for Making Recommendations

255

constrain neighbourhood search within local clusters. The internal structure of the proposed RDF technique can be updated dynamically in real time and requires only very small amount of physical memory.

3 Taxonomy Product Recommender An overview of taxonomy-driven product recommender (TPR) proposed by Ziegler [3, 7] is given in this section. 3.1 Item Taxonomy Model We envision a world with a finite set of users , ,…, and a finite set of items , ,…, . For each user , he or she is associated with a set of corresponding implicit ratings , where . Unlike explicit ratings in which users are asked to supply their perceptions to items explicitly in a numeric scale, implicit ratings such as transaction histories, browsing histories, etc., are more common and obtainable for e-commerce sites and communities. In standard collaborative filtering recommenders, user profiles are represented by -dimensional vectors, where | | and each dimension represents an explicit item rating. However, for many systems, can be very large and the number of ratings made by each user can be very small. This problem is often addressed as cold start problem or data sparsity problem. Data sparsity problem is relieved with TPR, because instead of using the productrating vectors with | | dimensionalities as user profiles, TPR uses taxonomy vectors with dimensionalities, where is the number of topics in the product taxonomy space. Specifically, we denote the taxonomy vector for as , ,…, , and each dimension of indicates the degree of ’s interest to the corresponding topic. The taxonomy vector in TPR has three advantages over standard product rating vector. Firstly, for most e-commerce sites is much smaller than | |, and therefore it can yield better computational performances. Secondly, because the taxonomy vector records the user taxonomy preferences instead of item preference, and different items can share their descriptors entirely or partially, thus, even for users with no common item interests, their profiles can still be correlated. Thirdly, the construction of the taxonomy vector can be done with only implicit ratings, and therefore it effectively solved the data sparsity problem. 3.2 Recommendation Generation In this paper, the distances between user taxonomy vectors are computed by Euclidean distance, specifically: ,

∑|

|

(1)

Based on the distance measure, target user u ‘s neighbourhood can be \ with shortest distances to . By exformed by selecting n users from tracting the items implicitly rated by the neighbourhood, a candidate item list is formed for ’s personalized recommendation list, formally:

256

L.-T. Weng et al.

| \ (2) The items in the candidate list B need to be ranked according to their closeness to the target user’s personal interest. The ranking equation to weight u ’s possible interest towards t ’s shown below: ,



|

|

,

(3)

| . where In equation (3), the computed score is negated because the proximity measure is distance based (i.e. small value indicates strong similarity), thus, by negating the indicating higher item interests. result score we allow larger weight values of creates a dummy user for item t , so the proximity of the taxonomy vectors between and t can be measured. The conversion process simply creates a user with . Finally, after the candidate item weights are computed, the top m items with highest weight values are recommended to the target user.

4 Proposed Approach In this paper, we identified two aspects in TPR that can be improved. Firstly, even though the product rating vectors are compressed into taxonomy vectors with smaller numbers of dimensionalities, however, for datasets with a large amount of users and extensive taxonomy structures, the neighbourhood formation will become one of the computation efficiency bottlenecks in TPR, because it requires an extensive amount of I/O to retrieve user profiles (i.e. taxonomy vectors) from the database, and the proximity computation (i.e. equation (1)) for high dimensional vectors is expensive as well. Next, in Ziegler’s TPR implementation [3], the “best-n-neighbours” is applied as the neighbourhood selection method since “best-n-neighbours” performs better than “correlation-threshold” for sparse dataset [3]. However, because the value of is prespecified in “best-n-neighbours”, it means that the resulting neighbourhoods will be biased for users with true neighbours of less than [8]. This issue is particularly sensitive for users with unusual tastes, as it is likely that a portion of their neighbourhoods formed by “best-n-neighbours” might contain neighbours that are dissimilar to them. For example, if a user has distinct tastes, then he or she might only share similar tastes with only 2 other users, the recommendation result for this user might be biased if a neighbourhood with 20 users are used. In this paper, we propose a novel neighbourhood estimation method which is both memory and computation efficient. By substituting the proposed technique with the standard “best-n-neighbours” in TPR, the following two improvements are achieved:

• The computation efficiency of TPR is greatly improved. • The recommendation quality of TPR is also improved as the impact of the “fixed neighbours” problem has been reduced. That is, the proposed technique can help TPR locate the true neighbours for a given target user (the number of true

An Efficient Neighbourhood Estimation Technique for Making Recommendations

257

neighbours might be smaller than ), therefore the recommendation quality can be improved as only these truly closed neighbours of the target user can be included into the computation. 4.1 Relative Distance Filtering Forming neighbourhood for a given user with standard “best-n-neighbours” technique involves computing the distances between and all other users and selecting the top neighbours with shortest distances to . However, unless the distances between all users can be pre-computed offline or the number of users in the dataset is small, forming neighbourhood dynamically can be an expensive operation. Clearly, for the standard neighbourhood formation technique described above, there is a significant amount of overhead in computing distances for users that are obviously far away (i.e., dissimilar users). The performance of the neighbourhood formation can be drastically improved if we exclude most of these very dissimilar users from the detailed distance computation. In the proposed RDF method, this exclusion or filtering process is achieved with a simple geometrical implication: if two points are very close to each other in a space, then their distances to a given randomly selected point in the space should be similar. In Figure 1, a user set is projected onto a two-dimensional plane where each user is depicted as a dot on the plane. In the figure, is the target user, and the dots em. The RDF method starts by braced by small circles are the top 15 neighbours of randomly selecting a reference user in the user set, and then ’s distances to all other users are computed and sorted ( and are also reference users). Based on the triangle inequality theme, it is easy to observe that all ’s neighbours have similar distances to . This means, in the process of forming ’s neighbourwhich is hood, we only need to compute distances between and the users in set defined as: | (4) | , |

where is an abbreviated denotation for , . | is the difference of the distances from to and to In equation (4), | . According to Modus tolens inference rule, i.e., if the consequent of an implication is false, the antecedent of the implication must be false, from the geometrical implica| is large, then and are not close to each tion mentioned above, if | | is larger than , the user can be exer. is a distance threshold. If | cluded from the ’s neighbourhood. If is set to a larger value, the distance threshold is relaxed, thus more users can be included in the neighbourhood. In this case, the performance will be decreased because more users will be included in the actual distance computations. In our experiment, is set to the one tenth of the distance between the reference user and its furthest neighbour . To further optimize the neighbourhood estimation, we can select more reference and ) into the estimation process to obtain more estimated users (for example and ). With multiple estimated searching spaces, the final searching spaces (i.e. estimated searching space can be drastically reduced by intersecting these spaces ). It can be observed in Figure 2 that, the intersected searching space (i.e.

258

L.-T. Weng et al.

Fig. 1. Projected user profiles

Fig. 2. Estimated searching space with three reference users

is much smaller than the entire set, and most importantly, it covers ‘s most close users. Only the users in the intersection area need to be checked for determine ‘s neighbourhood. The actual I/O and distance computations only need to be conducted within the intersected space, thus the efficiency is greatly improved. 4.2 Reference User Selection The reference user selection is important for RDF. In order to optimize the performance of TPR, the final estimated searching space (i.e. ) needs to be as small as possible for any given target users. In order to achieve it, the distances between the reference users need to be as far as possible. It is because if the reference users are close to each other, the ring borders of their search spaces will result large overlap (since they all have similar centres and radiuses). Moreover, the number of reference users should be kept small (we only use 3 reference users for all our experiments), because when the number of reference users increases, the time required for the offline reference user initialization and the memory required for caching the sorted distances increase too.

An Efficient Neighbourhood Estimation Technique for Making Recommendations

259

In our implementation, the reference users are initialized with a simple two-pass technique. The first reference user is chosen randomly, and we compute its distances to all other users in . Next, with the computed distances we can obtain the such that arg max , . Finally, we second reference user for and such that again find the furthest neighbour arg max , , , and set as the third reference user. With this method, it is ensured that the initialization process is kept simple and efficient, and the result reference users are also very distant from each other. 4.3 Proposed RDF Implementation This section describes in detail the implementation of the proposed RDF method discussed in sections 4.1 and 4.2. With the proposed implementation, the power of RDF is maximized. First of all, it is important to note that the distances between users and reference users are not meant to be computed online, because the computation efficiency of this process is more expensive than the one by one search. Instead, these distances are computed, structured and indexed offline into a data structure called RDF searching cache, and the searching cache will be loaded into the memory in the initialization stage of the online recommendation process. This pre-computed searching cache is shared by all neighbourhood formation processes. The detailed structure is depicted in Figure 3. In the searching cache, each user is associated with a data structure called “user node”. For any user , denotes ’s user node. A user node basically stores two types of information for a user:

• User ID. Instead of fitting the entire user profiles or the user taxonomy vector into memory, only the user id is required to be stored in the cache. The user ids are used to identify and retrieve the actual user profiles in the database. • Distances to the Reference Users. The distances from the user node’s corresponding user to the reference users are stored in a vector. In our implementation, we have only three reference users , and , and therefore the distance vector for user node is , , . We denote the distance vector of η as corresponds corresponds to and corresponds to , where , , to respectively.

In order to efficiently retrieve the estimated searching space as described in equation (4), a binary tree structure is used to index and sort the user nodes. The index keys used for each user node are the distance between the user and the reference users, that is, the index keys for are , and . With the three different index keys, the user nodes can be efficiently sorted with different index key settings, that is, the user nodes can be sorted by any one of the three index keys. Because the user nodes are stored in this binary tree structure, the computation efficiency for equation (4) is optimized to , where | |. Note, this estimated user space retrieval process is very efficient, not only because the whole computation can be done within a small amount of memory (thus no database I/O is required), it is also because each index key lookup involves only a comparison of two

260

L.-T. Weng et al.

double values. Finally, because distances between the target users and the reference users are needed during the neighbourhood formation process, the user profiles for the reference users are required to be stored in the cache. The memory requirement for the reference user profiles is trivial, because there are only three reference users. Given that the RDF searching cache is properly initialized, the detailed RDF procedure is described below: RDF Algorithm 1) Let be the target user, n be the pre-specified number of neighbours for . 2) Use the indexed tree structure to locate the minimal user nodes set within thegiven boundary: | ,

, , which achieves minimal search space. Note, the acwhere tual implementation of ’s computation can be very efficient. By utilizing the pre-computed searching cache, the estimation of user nodes size does not involve looping through the user nodes one by one. 3) Based on step 2, is the primary index key used to sort and retrieve , and it is one of , and . The rest two index keys (also in , , ) are denoted as and . 4) We refine the searching space by using reference users and . This process is similar to finding the intersected space as described in section 4.2

Fig. 3. Structure for the RDF searching cache

5) FOR η ξ DO y or λ y or λ IF λ THEN remove η from ξ

z

or λ

z

An Efficient Neighbourhood Estimation Technique for Making Recommendations

261

END IF END FOR 6) Do the standard “best-n-neighbours” search against the estimated searching space ξ , and return the result neighbourhood for u .

5 Experiments

This section presents empirical results obtained from our experiment. 5.1 Experiment Setup The dataset used in this experiment is the “Book-Crossing” dataset (http://www.informatik.uni-freiburg.de/~cziegler/BX/), which contains 278,858 users providing 1,149,780 ratings about 271,379 books. Because the TPR uses only implicit user ratings, therefore we further removed all explicit user ratings from the dataset and kept the remaining 716,109 implicit ratings for the experiment. The goal of our experiment in this paper is to compare the recommendation performance and computation efficiency between standard TPR [3] and the RDF-based TPR proposed in this paper. The k-folding technique is applied (where k is set to 5 in our setting) for the recommendation performance evaluation. With k-folding, every user u ’s implicit rating list R is divided into 5 equal size portions. With these portions, one of them is se, and the rest 4 portions are combined into a test set lected as u ’s training set , 5 for user . In , \ . Totally we have five combinations to learn ’s interest, the experiment, the recommenders will use the training set and the recommendation list generated for will then be evaluated according to . Moreover, the size for the neighbourhood formation is set to 20 and the number of items within each recommendation list is set to 20 too. For the computation efficiency evaluation, we implemented four different versions of TPRs, each of them is equipped with different neighbourhood formation algorithms. The four TPR versions are:

• Standard TPR: the neighbourhood formation method is based on comparing the target user to all users in the dataset.

• RDF based TPR: the proposed RDF method is used to find the neighbourhood. • RTree based TPR: the RTree [5] is used to find the neighbourhood. RTree is a tree structure based neighbourhood formation method, and it has been widely applied in many applications. • Random TPR: this TPR forms its neighbourhood with randomly chosen users. It is used as the baseline for the recommendation quality evaluation. The average time required by standard, RTree based and the RDF based TPRs to make a recommendation will be compared. We incrementally increase the number of users in the dataset (from 1000, 2000, 3000 until 14000), and observe how the computation times are affected by the increments.

262

L.-T. Weng et al.

In this paper, the precision and recall metric is used for the evaluation of TPR, and its formulas are listed below: | _^ _ ^ |/| _ ^ | (5) | |/| | (6) 5.2 Result Analysis

Figure 4 and Figure 5 shows the performance comparison between the standard TPR and the proposed RDF based TPR using the precision and recall metrics. The horizontal axis for both precision and recall charts indicates the minimum number of ratings in the user’s profile (i.e.| |). Therefore larger x-coordinates imply that fewer users are considered for the evaluation. It can be observed that the proposed RDF based TPR outperformed standard TPR for both recall and precision. The results confirm that when the dissimilar users are removed from the neighbourhood, the quality of the result recommendations become well. RTree based TPR performs much worse than both the RDF based TPR and the standard TPR, as it is unable to accurately allocate neighbours for target users.

Fig. 4. Precision results obtained from TPR with different neighbourhood formation methods

The efficiency evaluation is shown in Figure 6. It can be seen from Figure 6 that, the time efficiency for standard TPR drops drastically when the number of users in the dataset increases. For dataset with 15000 users, the system needs about 14 seconds to produce a recommendation for a user, and it is not acceptable for most commercial systems. By comparison, the RDF based TPR is much efficient, and it only needs less than 4 seconds to produce a recommendation for dataset with 15000 users. The RTree based TPR greatly outperforms the proposed method when the number of users in the

An Efficient Neighbourhood Estimation Technique for Making Recommendations

263

Fig. 5. Recall results obtained from TPR with different neighbourhood formation methods

Fig. 6. Average recommendation time for different TPR settings

dataset is under 8000. However, as the number of users increases in the dataset, the differences between RDF and RTree based TPR becomes smaller, and RDF starts outperforms RTree when the number of users in the dataset is over 9000. This is because RTree is only efficient when the tree level is small. However, as the tree level increases (i.e. when number of users increases) RTree’s performance drops drastically because the chance for high dimensional vector comparison increases quadratically in accordance to the number of tree level. The proposed RDF method outperforms RTree method because its indexing strategy is single value based, and it reduces the possibility for the high dimensional vector correlation computation.

264

L.-T. Weng et al.

6 Conclusions In this paper, we presented a novel neighbourhood estimation method for recommenders, namely RDF. By embedding RDF with a TPR based recommender, not only the computation efficiency of the system is improved, the recommendation quality is also improved. The RDF method is different from the clustering based neighbourhood formation methods that use offline computed clusters as the neighbourhoods. Instead, our method forms neighbourhood for any given target users dynamically from scratch (thus is more accurate than cluster based approaches) in an efficient manner. In our experiment, it is shown that the proposed method improves both recommendation quality and computation efficiency for the standard TPR recommender.

References 1. Schafer, J.B., Konstan, J.A., Riedl, J.: E-Commerce Recommendation Applications. Journal of Data Mining and Knowledge Discovery 5, 115–152 (2000) 2. Awerbuch, B., et al.: Improved recommendation systems. In: Proceedings of 16th Annual ACM-SIAM symposium on Discrete algorithms, Vancouver, British Columbia (2005) 3. Ziegler, C.-N., Lausen, G., Schmidt-Thieme, L.: Taxonomy-driven Computation of Product Recommendations in International Conference on Information and Knowledge Management 2004, Washington D.C., U.S.A (2004) 4. Sarwar, B., et al.: Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering. In: Proceedings of 5th International Conference on Computer and Information Technology (2002) 5. Manolopoulos, Y., et al.: R-Trees: Theory and Applications. Springer, Heidelberg (2005) 6. Bentley, J.L.: K-d Trees for Semidynamic Point Sets. In: 6th Annual Symposium on Computational Geometry 1990, Berkley, California. ACM Press, United States (1990) 7. Ziegler, C.-N., et al.: Improving Recommendation Lists through Topic Diversification. In: Proceedings of 14th International World Wide Web Conference, Chiba, Japan (2005) 8. Li, B., Yu, S., Lu, Q.: An Improved k-Nearest Neighbor Algorithm for Text Categorization. In: Proceedings of the 20th International Conference on Computer Processing of Oriental Languages, Shenyang, China (2003)

Improve Recommendation Quality with Item Taxonomic Information Li-Tung Weng, Yue Xu, Yuefeng Li, and Richi Nayak Faculty of Information Technology, Queensland University of Technology 4001 Queensland, Australia [email protected],{yue.xu,y2.li,r.nayak}@qut.edu.au

Abstract. Recommender systems’ performance can be easily affected when there are no sufficient item preferences data provided by previous users. This problem is commonly referred to as cold-start problem. This paper suggests another information source, item taxonomies, in addition to item preferences for assisting recommendation making. Item taxonomic information has been popularly applied in diverse ecommerce domains for product or content classification, and therefore can be easily obtained and adapted by recommender systems. In this paper, we investigate the implicit relations between users’ item preferences and taxonomic preferences, suggest and verify using information gain that users who share similar item preferences may also share similar taxonomic preferences. Under this assumption, a novel recommendation technique is proposed that combines the users’ item preferences and the additional taxonomic preferences together to make better quality recommendations as well as alleviate the cold-start problem. Keywords: Recommender System, Taxonomy, Ecommerce.

1 Introduction Recommender systems have been an active research area for more than a decade, and many different techniques and systems with distinct strengths have been developed [1]. Among all these different recommendation techniques, collaborative filtering is perhaps the most successful and widely applied technique for building recommender systems [2, 3]. In general, collaborative filtering based recommenders recommend items that are commonly preferred by users with similar item preferences to a target user. Therefore, the recommendation quality of the collaborative filtering technique depends upon the number of users with similar preferences to the target user. If there are only few users in the dataset with similar preferences to the target user, then the standard collaborative filtering technique will not be able to suggest quality recommendation to the user. This issue, commonly referred to as cold-start problem[4], usually happens when the system is newly built (there is no initial data in the dataset), or when there is no data available for a new target user[5]. A commonly used approach to alleviate the cold-start problem is to take item content information into consideration in recommendation making. That is, when it is not J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 265–279, 2009. © Springer-Verlag Berlin Heidelberg 2009

266

L.-T. Weng et al.

possible to form a neighbourhood for a target user, content based techniques can be used to mine the item contents preferred by the target user, and based on the preferred item contents the recommendations can be generated by finding items with similar contents preferred by the target user [6, 7]. However, because most of the content based techniques represent item content information as word vectors and maintain no semantic relations among the words, therefore the result recommendations are usually very content centric and poor in quality[6-9]. To improve the content based techniques, the content information for the items should be captured in more sophisticated ways so that associations among items can be measured by their content semantic meanings rather than simple keywords mappings. In this paper, we propose a novel recommendation making approach, namely Hybrid Taxonomy based Recommender (HTR), which generates item recommendations based on both users’ item preferences and item taxonomic preferences. The notion of item taxonomy information is used in our system in place of standard item content information, that is, instead of using keywords vectors to represent items, our system describes items based on taxonomic topics extracted from a tree-like taxonomy structure. The item taxonomy information is useful for encapsulating item content semantics as it allows items with different topics to be related if they share common supper topics. Hence, not only the use of item taxonomy can significantly alleviate the coldstart problem, but it can also improve recommendation quality by reducing the content centric issue. The relationship between the item preferences and the item taxonomic preferences is also investigated in this paper. Based on our study and experiments, we suggest that when a set of users shares similar item preferences, they might also share similar item taxonomic preferences. The HTR technique utilizes the proposed relation to achieve competitive computation efficiency and recommendation performance. For the applicability concern, as item taxonomy information is available for most e-commerce sites and standardization organizations, HTR can be easily applied and adopted to a wide range of domains. Moreover, HTR can also adopt the implicit user preference information (in addition to the standard explicit user preferences) to further enhance its recommendation quality in cold-start environments.

2 Related Work Much research has suggested that the cold-start problem can be alleviated by combining collaborative filtering and content based techniques together [4, 6, 9, 10]. However, because part of the recommendation process for these hybrid recommenders is content-based, the generated recommendations may be excessively content centric and lack of novelty [5, 11]. Hence, semantic and ontology based techniques have been suggested to improve the recommendation generality for the content based filtering. Middleton [5] suggested an ontology based recommender which uses external organizational ontology (e.g. publication and authorship relationships, projects and project membership relationships, etc.) to solve the cold start problem. However, as the Middleton’s technique is mainly designed for recommending research papers and documents, and also relies on a specific organizational ontology, therefore it is not easy to adopt this method for general recommenders. On the other hand, Ziegler [11] proposed a taxonomy-driven product recommender (TPR), it utilizes a general tree

Improve Recommendation Quality with Item Taxonomic Information

267

structured product taxonomy to enhance its recommendations. Due to the simplicity of the taxonomy structure, Ziegler’s technique is considered widely applicable to different domains [11]. To the best of our knowledge, Middleton and Ziegler’s techniques are the only two works bearing traits similar to the proposed HTR technique. HTR employs similar tree structured taxonomy to TPR, and therefore it inherits TPR’s generality advantage. However, while TPR only considers implicit item preferences for making recommendations, HTR utilizes the relationship between users’ explicit item preference and implicit taxonomic preferences for recommendation making, therefore yields better recommendation performances. Moreover, HTR adopts item-based collaborative filtering paradigm [2] in contrast to TPR’s user-based collaborative filtering. Item-based collaborative filtering allows most computations to be done offline. Therefore, the computation efficiency of online recommendation generation can be improved.

3 Proposed Approach This section is divided into five parts. In Section 3.1, the basic system model and general notations used throughout this paper are described. In Section 3.2, we discuss the implicit relation between users’ item preferences and taxonomic preferences. The technique for taxonomic preference extraction is described in Section 3.3. At last, Section 3.4 details the proposed HTR method. 3.1 System Model We envision a world with a set of users and a set of items , ,…, . Each user is associated with a set of rated items . , ,…, Based on the different rating methods, we can divide these items into implicitly rated and explicitly rated items . A user items ). can rate an item implicitly or explicitly, but not both (i.e. In explicit ratings, users express their preferences to items in numeric form, that is, the value 0 indicates minimal satisfaction and 1 indicates maximum satisfaction. We use , to denote user ’s rating to item , such that , . HTR uses taxonomy based descriptors to describe items. Specifically, denotes a set of descriptors characterizing any item ’s taxonomy, , ,…, where | | . A taxonomy descriptor is a sequence of ordered taxonomic topics, , ,…, where , . The topics within a descriptor denoted by are sequenced so that the former topics are super topics of the latter topics, specificalis the direct super topic for where . A super topic covers a ly, broader concept than its sub-topics, and a topic can have more than one direct subtopics. Thus, it is easy to envision that the taxonomy topics are stored in a tree-like structure, and the tree structure formed with the taxonomy topics is referred as the taxonomy tree, and all item descriptors are paths that are extracted from the root to a leaf node on the tree. Let be the set of all taxonomy topics, | , , and : be a map from to that retrieves all direct sub-topics for

268

L.-T. Weng et al.

topics . Based on , we define a partial order on the taxonomy topic set to differentiate between super topics and sub-topics. , , if , then for all and have the relationship , i.e., require that , , . With this requirement and the map , we can recursively extract the taxonomy tree structure from the set . Moreover, as in standard tree structures, the taxonomy tree has exactly one top-most element with zero in-degree representing the most general topic, it is denoted by in this paper. By contrast, for these bottom-most elements with zero out-degree, they are denoted by and represent the most specific topics. In our system, for any item descriptor , ,…, , it is required and . 3.2 Cluster-Based User Neighbourhood In HTR, cluster based neighbourhood formation is adopted to ensure the computation efficiency. In order to form the user neighbourhoods or clusters, a similarity measure for computing user similarities is essential. In HTR, we adopted the correlation measure described in [12] to compute the item preference similarity between two users , as given in Equation (1). ,





,

,

,

(1)

,

and , that is is an item rated explicitly by both . denotes the average explicit ratings made by . Based on Equation (1), can be divided into a set of clusters , such that and . For the sake , ,…, of convenience, let denote the cluster which contains user u. Because the clusters are constructed based on users’ item preference similarity, users within the same cluster will have similar item preferences. In this paper, we take a further investigation to suggest the following assumption: where

Users within the same neighbourhood or cluster sharing similar item preferences may share similar taxonomic preferences and interests The idea behind the assumption is that the users within one cluster should have apparent similar taxonomic focus and the taxonomic focuses of the users in different clusters should be different. In this paper, we use information gain to measure the certainty of taxonomy focus of a user set, and empirically demonstrate the validity of the above assumption by using information gain measure. When the information gain is high, it indicates that the certainty of the taxonomic focuses of user clusters is high. Therefore we can use information gain to investigate whether different clusters have apparent taxonomic focuses and the taxonomic focuses are different in different user clusters. The adapted information gain can be calculated as below: ∑

Pr

(2)

where Pr is the probability that an item rating is made by a user in cluster . is the information entropy for a given user space. The concept of information entropy is adapted in this paper to measure the degree of taxonomic focus in a user set

Improve Recommendation Quality with Item Taxonomic Information

269

(i.e. a cluster or a neighbourhood). If the information entropy is high for a user set, then there is no apparent taxonomic focuses in the set (i.e. users in the set prefer all taxonomy topics equally), and vice versa. The information entropy formula is depicted below: ∑

Pr

,

,

Pr

,

(3)

In the entropy equation, Pr p, U denotes the probability that the users in the user set U U are interested in the taxonomy topic p. For a given clustering UC are low which means the taxonomic focuses are apparent uc , uc , … , uc , if H in cluster uc, according to Equation (2), the information gain is high. The effect of user clustering on taxonomy information gain is depicted in Table 1. This result is obtained by using k-means clustering technique to divide 278,858 users in “Book-Crossing” dataset (www.informatik.uni-freiburg.de/~cziegler/BX/) into 100 clusters according to their explicit ratings. We have tried to produce different number of clusters for the dataset (i.e. different values for k), and we have found by setting k to 100 (i.e. 100 clusters) can produce clusters with reasonable qualities. Our first experiment is to show if user clusters have stronger taxonomic focuses than the entire dataset when only explicit ratings are considered. It is shown in the first column of Table 1, the result information gain is 0.823, which is a big increase when comparing it with the information gain obtained from the randomly formed cluster partitions (i.e. -0.385). This result shows that, by clustering users with their explicit ratings, each user cluster has its own taxonomic focuses. Because our clusters are generated based on only explicit ratings, it might be unfair if we only consider explicit ratings in calculating taxonomy information gain. Hence, we further include the implicit ratings in computing taxonomy information gain. With identical cluster settings, we still get a strong information gain increase (i.e. 0.458) when comparing to the information gain obtained from the random formed clusters (i.e. -0.319). Based on the information gain analysis, we can conclude that users within the same clusters not only share similar item preferences, but they also share similar taxonomic preferences. Table 1. The effect of user clustering on taxonomy information gain

3.3 Taxonomic Preferences Extraction For each cluster , we build a cluster based taxonomy tree similar to the global taxonomy tree defined in Section 3.1. Formally, we define the cluster based topic set:

and

for topics

|

,

,

,

extracts the direct sub-topics of .

270

L.-T. Weng et al.

Using the similar way described in Section 3.1, with the map , we can construct a local taxonomy tree from a cluster . With the local cluster based taxonomy tree, we can then find the frequent and distinct topics for each cluster. We measure the distinctness of a topic within a local cluster uc in accordance to the global user set by: _

,

_

,

_

, _

,

,

,

(4)

where _ , is the number of user ratings to items involving taxonomy topic within a given user set . is a user defined constant, it is used to filter out topics that are not popularly interested by users. In this paper, is set to 50. So topics need to be involved in at least 50 ratings in order to get a reasonable score. The higher the topic score, the higher the possibility the taxonomy topic is unique to a cluster. Based on the topic score, the topics with their topic scores higher than a predefined threshold are chosen as the hot topics for that cluster. We denote the hot topic set by: _

,

|

,

_

,

(5)

3500 3000 2500 2000 1500 1000 500 0.96

0.8

0.88

0.72

0.64

0.56

0.4

0.48

0.32

0.24

0.16

0

0 0.08

average number of hot topics per cluster

where is the user defined threshold. In our experiment, is set to 0.6. Figure 1 shows the average number of topics left for each cluster for different threshold settings.

minimal topic score Fig. 1. Average number of hot topics per cluster given different minimal topic score (ζ)

For the “Book-Crossing” dataset there are originally 10746 topics in the entire dataset. After user clustering, the average number of topics per cluster is around 3164.12. The ratio of the topic number in the clusters out of the topic number in the entire dataset is about 0.29. This ratio suggests that different clusters may have very different taxonomy topics. Moreover, after we increase the topic score threshold , the ratio decreases drastically (e.g. when .68, the entire dataset has 530 topics and the average number of topics per cluster is 5.9, the ratio is only 0.01.). This observation further strengthens the conclusion that we made about cluster taxonomic focuses as detailed in Section 3.2.

Improve Recommendation Quality with Item Taxonomic Information

271

3.4 Hybrid Taxonomy Recommender In this section, we describe the proposed Hybrid Taxonomy based Recommender (HTR) that incorporates the hot topic set described in Section 3.3 with the item-based collaborative filtering (item-based CF) to improve recommendation quality. HTR generates item recommendations by combining the estimates to item preferences and the estimates to taxonomy preferences. We firstly explain the item-based CF technique used in HTR to estimate item preferences. Item-based CF recommends item to user based on the item similarity between and the items that have been rated by based on user ratings to these items. The similarity between two items is computed based on user explicit ratings as defined below: _

,





(6)



where is a simplified form for , representing user ’s rating to item , is the average rating for ’ over the users in , and is the set of users is defined as: who have rated both and . |

,

Note, it is possible that two items are never rated by more than one user, i.e. . In such case, _ , returns a special value which is a label indicating “Not Computable”. As mentioned above, the estimate of the preference to item to user is based on rated by the user , where the similarities between and the items . In order to achieve it, we need to find the target user’s rated items which are computable with the target item . That is, ,

|

_

,

Finally, user ’s item preference prediction to item t is computed as below: ,



,



, |

_

_

,

, |

(7)

where . , In order to improve the recommendation quality (especially in cold start situations), HTR also checks whether the taxonomy of the candidate items is preferred by the target user. We use , to denote the prediction of user ’s taxonomic preference to item t, and it can be computed as below: ,

_

,

,

, | |

(8)

where | , _ , is the set of ’s topics that are hot topics of the cluster which contains u. The idea behind the computation of taxonomic preference score is straightforward. We firstly check if any of the target item t’s taxonomy topics are hot topics of the user u’s neighbourhood (i.e. ). If the item’s topics are not hot topic of , then we suggest that

272

L.-T. Weng et al.

the user is not interested in the item’s taxonomy, hence 0 will be given as the taxonomy score. If the item’s topics are in the hot topic set, then among these matched hot topics (| | can be greater than 1), the maximum hot topic score is chosen as t’s taxonomy score. It should be mentioned that the hot topics calculated by Equation (5) represent the taxonomic focuses of the users in a cluster. That means the topics in represent cluster level taxonomic focuses commonly preferred by the users in that cluster but not particularly for any individual user. There are two reasons for doing so. Firstly, cluster level taxonomic preferences can be pre-computed offline, therefore it ensures the computation efficiency of the proposed technique. Secondly, since the cluster level taxonomic preferences cover the taxonomic interests of all the users in one cluster, for the target user, by recommending items with topics commonly preferred by the users in the cluster, the recommender can recommend items with a wider range of topics including the topics which may not be particularly preferred by the target user but preferred by the users in this cluster and thus the recommendation quality can be improved. In order to recommend a set of items to a target user , we firstly form a candidate item list containing all items rated by ’s neighbors but not yet rated by . Next, for each item in the candidate list, we compute the item preference score and taxonomic preference score for the item. The proposed preference score for each candidate item can then be computed by combining the item preference score ( , ) and the item taxonomic preference score ( , ) together. Finally, candidate items with highest preference scores are recommended to the user , and these recommended items are sorted by the ranking values. The complete algorithm is listed below: _

Algorithm. where 1) 2) 3) 4) 5)

,

.

is a given target user is the number of items to be recommended SET \ , the candidate item list FOR EACH αη , α Ψ, SET rank , END FOR Return the top k items with highest rank , scores to u.

From line (3) of the algorithm we can see that the predicted score for an item is computed by a linear combination of item preference score η , and topic preference score , . The coefficient , computed by Equation (9) below, in the formula is used to adjust the weights of , and , :

(9)

where

|

|

, | |

and

is a user controlled variable. ω is the ratio

between the number of the items that are commonly rated with item t by u and other users and the number of the items rated by u. In Equation (9), ω reflects the quality confidence of η , , because the more the target user’s past rated items related to the target item, the higher the accuracy of the

Improve Recommendation Quality with Item Taxonomic Information

273

item preference prediction (i.e. η , ) will be. When ω increases α will increase too, thus η , will receive higher weight in the final score (i.e. rank , ). Variable , on the other hand, is used to adjust the weights of ω in α, thus, if is large (e.g. 0.9) , will still receive high weight even is small. The value of is automatically adjusted along with the change of the number of users who commonly rated a given item . The higher the value of the more the users who commonly rated the item (i.e., is high which indicates a normal situation without severe cold start problems) and thus the item preference , estimated based on these users’ rating data becomes more important and reliable. In this case, the predicted item preference η , makes more contributions to the predicted score , to item t than the contribution made by the predicted taxonomic preference . On the other hand, if the value of is low (i.e. is low which indicates a cold , start situation), the taxonomic preference prediction becomes more important and will contribute more to the predicted score , that what the predicted item preference does. This design ensures that taxonomic preferences are used to supplement or enrich the item preference prediction, especially in cold start situations.

4 Experimentation This section presents empirical results obtained from our experiment. 4.1 Data Acquisition The dataset used in this experiment is the “Book-Crossing” dataset (http://www.informatik.uni-freiburg.de/~cziegler/BX/), which contains 278,858 users providing 1,149,780 ratings about 271,379 books. In the user ratings, 433,671 of them are the explicit user ratings, and the rest of 716,109 ratings are implicit ratings. The taxonomy tree and book descriptors for our experiment are obtained from Amazon.com. Amazon.com’s book classification taxonomy is tree-structured (i.e. limited to “single inheritance”) and therefore is perfectly suitable to the proposed technique. However, not every book in our dataset is available in Amazon.com, and we were only able to extract taxonomy descriptors for 270,868 books form Amazon.com. The books without descriptors are removed from the dataset. The average number of descriptors per book is around 3.15, and the taxonomy tree formed by these descriptors contains 10746 unique topics. 4.2 Experiment Framework All recommenders being used in the experiment are developed using the Taste (http://taste.sourceforge.net/) framework. Taste provides a set of standardized components for developing recommenders, therefore it ensures the comparability of the developed recommenders fairly. Moreover, Taste also provides an evaluation framework allowing researchers or developers to evaluate the performances of their recommenders with a standardized test bed easily and effectively. In this experiment we constructed 7 different recommenders, and they are listed in Table 2.

274

L.-T. Weng et al. Table 2. List of experimental recommenders

4.3 Evaluation Metrics The goal of our experiment in this paper is to compare the recommendation performances and computation efficiencies for the recommenders listed in Table 2. For the recommendation quality evaluation, we randomly divided each user ‘s past ratings (i.e. ) into two parts, one for training and another for testing. We use to denote ‘s training rating data and to denote the testing rating data, such that , , and | | | |. The testing data actually consists of three types of items, and they are: - Items implicitly rated by : , , - Items preferred by : | \ - Items not preferred by : In the experiment, the recommenders recommend a list of items to based on the training set , and the recommendation list can be evaluated with . In order and , to evaluate the performances of different recommenders based on

Improve Recommendation Quality with Item Taxonomic Information

275

recommendation list based evaluation metrics such as precision and recall, Breese Score, Half-life, and etc. [4, 14] can be utilized. In this paper, the precision and recall metric is used for the evaluation, and its formulas are listed below: |

| |

|

|

| |

(10) |

(11)

In order to provide a general overview of the overall performances, F1 metric is used to combine the results of Precision and Recall: (12) For the computation efficiency evaluation, the average time required by recommenders to make a recommendation will be compared. 4.4 Experiment Result The test dataset is constructed by randomly choosing 10,000 users from the 278,858 users in the Book-Crossing dataset mentioned in Section 4.1. We let each recommenders recommend a list of k items to these 10,000 users. We tested different values for k ranging from 5 to 25. The results of this part of the experiment are shown in Figure 2, Figure 3 and Figure 4. It can be observed from the figures that, for all the three evaluation metrics the proposed HTR technique achieves the best result among all the recommenders. In the case of using only explicit rating data, the recommendation quality of HTR (i.e. HTR_E) still outperforms other recommenders even slightly degrading compared with using both explicit and implicit rating data (i.e., HTR performs the best and HTR_E performs the second best. The standard item based CF recommender (IR) performed similarly to the slope one recommender (SO), however it seems that slope one recommender is slightly better in recommending longer item lists. In the experiment, the clustering-based CF recommender (IRC) performed better than the standard one (IR). The only difference between these two recommenders is in the candidate item list formation process. The standard item based CF uses all items from the dataset as its candidate item list (i.e. \ ) , whereas the clustering-based \ ). version uses only items within a user cluster (i.e. Intuitionally, the clustering-based CF might perform worse than the standard one, because its candidate item list is formed from a cluster which is only a subset of the entire item set, some potential promising items might be excluded and thus won’t be recommended. . However, based on our observation, many of these excluded items are noises generated from the item similarity measure (some item similarity measures might generate prediction noise, please refer to [2] for more information), therefore by removing these items from the candidate list can actually improve the recommendation quality. The proposed HTR also gets benefits from the clustering strategy as it generates recommendations from the candidate item list formed from a cluster.

276

L.-T. Weng et al.

0.18 0.16

Precision

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 #5

#10

#15

#20

#25

top k recommened items IRC HTR ITR

IR TPR

SO HTR_E

Fig. 2. Recommender evaluation with precision metric

0.12 0.1

Recall

0.08 0.06 0.04 0.02 0 #5 IRC HTR ITR

#10 #15 #20 top k recommened items IR TPR

#25 SO HTR_E

Fig. 3. Recommender evaluation with recall metric

We also implemented the TPR technique proposed by Ziegler[11], and it performed worst among all recommenders in our evaluation scheme. TPR uses only implicit ratings as its data source and generates recommendations only based on taxonomy preferences. In order to make the proposed HTR and Ziegler’s TPR more comparable, we modified TPR by adding the item-based CF component into TPR

Improve Recommendation Quality with Item Taxonomic Information

277

0.1 0.08

F1

0.06 0.04 0.02 0 #5 IRC HTR ITR

#10 #15 #20 top k recommened items IR TPR

#25 SO HTR_E

Fig. 4. Recommender evaluation with F1 metric

resulting in the new recommender ITR. ITR performed better than the standard TPR as it included the item preference consideration in its recommendation making process. However it is still worse than all other recommenders (i.e., TPR performs the worst and ITR performs the second worst). The difference between HTR and ITR is the method to compute the taxonomy preferences is different (they use the same method to compute the item preferences). The result of HTR outperforming ITR indicates that users’ item preference is also helpful for generating users’ taxonomy preference. The proposed HTR technique considers the item preference implication when generating the taxonomic preferences (i.e. the taxonomic preferences are extracted from user clusters which is divided based on users’ item preferences). In contrary, TPR generates users’ taxonomic preferences purely from taxonomy data without using any of the users’ item preferences. In the experiment, the recommender with the best computation efficiency is the clustering based CF (IRC) as showed in Figure 5, it is much faster than the standard CF because its candidate item list is much smaller. The proposed HTR methods (HTR and HTR_E) perform the third and second best, as they added a bit computation complexity in the taxonomic preference predictions. However, this extra computation complexity is trivial, because most of these computations (i.e. computing _ for each user cluster) can be done offline. HTR_E performed slightly better than HTR because it uses less data (only explicit ratings) to make recommendations. Ziegler’s TPR is computation expensive because it needs to convert all users and items into high dimensional taxonomy vectors. ITR performed slightly worse than TPR because it needs to compute extra item preference predictions using standard CF technique. Standard CF technique is the most inefficient one among all the recommenders, whereas slop one recommender offers a slight advantage in computation efficiency.

278

L.-T. Weng et al.

5.6664

6

5.0825 5

seconds

4 3 2.0355 1.5441

2 1 0.0962

0.0017

0.0473

0 IRC

IR

SO HTR TPR recommender types

HTR_E

ITR

Fig. 5. Average second per recommendation

5 Conclusions In this paper, we investigated the implicit relations between users’ item preferences and taxonomic preferences, suggested and also verified using information gain that users that share similar item preferences may also share similar taxonomic preferences. Based on this investigation, we proposed a novel, hybrid technique HTR to automated recommendation making based upon large-scale item taxonomies which are readily available for diverse ecommerce domains today. HTR produces quality recommendations by incorporating both users’ taxonomic preferences and item preferences. Moreover, it can utilize both explicit and implicit ratings for recommendation making, and hence they are less prone to the cold start problem. We have compared the proposed HTR technique with some standard benchmark techniques such as item-based recommender and some advanced modern techniques such as TPR (which are related to ours). We have conducted extensive experiments which demonstrated that the proposed HTR outperforms other recommenders in both recommendation quality and computation efficiency.

References 1. Montaner, M., López, B., Rosa, J.L.D.L.: A Taxonomy of Recommender Agents on the Internet. Artificial Intelligence Review 19(4), 285–330 (2003) 2. Deshpande, M., Karypis, G.: Item-based top-N recommendation algorithms. ACM Transactions on Information Systems 22(1), 143–177 (2004) 3. Schafer, J.B., Konstan, J.A., Riedl, J.: E-Commerce Recommendation Applications. Journal of Data Mining and Knowledge Discovery 5, 115–152 (2000) 4. Schein, A.I., et al.: Methods and metrics for cold-start recommendations. In: 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, Tampere (2002)

Improve Recommendation Quality with Item Taxonomic Information

279

5. Middleton, S.E., et al.: Exploiting Synergy Between Ontologies and Recommender Systems. In: The Semantic Web Workshop, World Wide Web Conference (2002) 6. Burke, R.: Hybrid Recommender Systems: Survey and Experiments. User Modeling and User-Adapted Interaction 12(4), 331–370 (2002) 7. Sarwar, B., et al.: Application of dimensionality reduction in recommender systems–a case study. In: ACM WebKDD Workshop 2000, Boston, MA, USA (2000) 8. Adomavicius, G., et al.: Incorporating contextual information in recommender systems using a multidimensional approach. ACM Trans. Inf. Syst. 23(1), 103–145 (2005) 9. Ferman, A.M., et al.: Content-based filtering and personalization using structured metadata. In: 2nd ACM/IEEE-CS joint conference on Digital libraries 2002, Portland, Oregon, USA (2002) 10. Park, S.-T., et al.: Naive filterbots for robust cold-start recommendations. In: 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA (2006) 11. Ziegler, C.-N., Lausen, G., Schmidt-Thieme, L.: Taxonomy-driven Computation of Product Recommendations. In: International Conference on Information and Knowledge Management, Washington D.C., USA (2004) 12. Breese, J.S., Heckerman, D., Kadie, C.: Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In: Proceedings of 14th Conference on Uncertainty in Artificial Intelligence, Madison, WI (1998) 13. Lemire, D., Maclachlan, A.: Slope One Predictors for Online Rating-Based Collaborative Filtering. In: 2005 SIAM Data Mining (2005) 14. Herlocker, J.L., et al.: Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22(1), 5–53 (2004)

Adapting Integration Architectures Based on Semantic Web Services to Industrial Needs Daniel Bachlechner Innsbruck University School of Management, Department of Information Systems Production and Logistics Management, Universitätsstraße 15, 6020 Innsbruck, Austria [email protected]

Abstract. Semantic Web services pledge the automation of core Web service tasks such as discovery, selection and execution. The use of Semantic Web services facilitates the integration and adaption of information systems considerably. However, such integration architectures have not yet been adopted by the industry. Within the scope of this work, we discuss the capabilities of integration architectures based on Semantic Web services as well as environmental factors that potentially drive or restrict their use. Our objective is to take a first step toward closing the gap between research trends and industrial needs. To that end, particular importance was attached to differences in the viewpoints of practitioners and researchers. The discourse is based on the findings of a SWOT analysis that was conducted in 2007.

1 Introduction Today, many enterprises employ multiple mission-critical, best-of-breed application systems from different vendors with different technologies and platforms [1]. They usually choose the best vendor for every operational area and connect the products via the interfaces they provide. In spite of the fact that this approach normally tends to result in highly complex systems, until recently, this strategy was considered a silver bullet when assembling business software. Together with mergers and acquisitions, reorganizations, and leadership changes, best-of-breed solutions cause considerable impact on IT infrastructures. Without doubt, the operation of heterogeneous application patchworks is complex and costly. Nevertheless, the ease of integration and adaption of information systems within organizations and across organizational boundaries is critical to realize competitive advantages. Even if just a few systems cannot share their data effectively, they create information bottlenecks that often require human intervention to be solved. Only with properly deployed integration architectures can organizations focus their efforts on their value-creating core competencies. Web services brought about a revolution by taking a remarkable step toward the seamless integration of distributed software components. The importance of Web services as a cornerstone of service-oriented integration architectures is recognized and widely accepted by experts from industry and academia. Current Web service technology, however, operates at the syntactic level, is not suited for automatic J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 280–293, 2009. © Springer-Verlag Berlin Heidelberg 2009

Adapting Integration Architectures Based on Semantic Web Services

281

processing, and, hence, still requires human interaction to a large extent. To better support service discovery or to automate the composition of services into complex processes, a formal and standardized description of services seems useful. Using Semantic Web technologies for the description of Web services has been discussed extensively in literature [2, 3, 4]. Semantic Web services (SWS) pledge the automation of core Web service tasks, such as discovery, selection, composition and execution, thus enabling interoperation and adaptability of systems, whereby human intervention is kept to a minimum. The objective of SWS research is to combine services on the fly in order to achieve given goals. Based on goal descriptions and descriptions of available services, a complex service yielding the desired result is composed automatically out of atomic building blocks [5]. However, with respect to SWSs, there seems to be a gap between research trends and industrial needs [6]. Within the scope of this work, we discuss the correspondence of the viewpoints of experts from academia and industry with respect to the capabilities of integration architectures based on SWSs as well as relevant environmental factors. The discourse is based on the findings of a SWOT analysis that was conducted within the scope of a Delphi study [7]. In order to enhance practical relevance the study investigated SWSs mostly with respect to their application as basis of integration architectures. While section 2 describes our research approach and the design of the underlying study, the results of the SWOT analysis are presented in section 3. The key findings are highlighted in section 4 and the work concludes with a recapitulation of the main ideas in section 5.

2 Research Approach The underlying Delphi study was conducted at the University of Innsbruck in early 2007. The main goal of the study was to collect and quantify the opinions of practitioners and researchers on the potential of SWSs as basis for integration architectures that enable organizations to link their data processing systems efficiently. It was expected that an understanding of the relevance and applicability of SWS-based integration architectures would help to align future research efforts with industry needs effectively. Another goal was to make participating experts from academia and industry more sensitive to the progress and focus of SWS research. A Delphi study with experts from academia and industry seemed to be particularly suitable to achieve these goals. By means of Delphi studies, anonymous expert judgments are obtained in a series of survey rounds. After each round, the panel is provided with controlled feedback about its responses [8]. A basic limitation of the Delphi method is its inability to make complex forecasts with multiple factors. Potential future outcomes are usually considered as if they had no effect on each other. Most events and developments, however, are in some way connected to each other. Hence, these interdependencies would have to be taken into consideration for more consistent and accurate forecasts. Despite this shortcoming, today, the method is a widely accepted tool for technology foresight and has been used successfully in many studies. The Delphi method also is not new to Web-related research. For instance, it was used to forecast the development of online communication and to identify trends in the field of Semantic Web research [9, 10].

282

D. Bachlechner

2.1 Survey Design Within the scope of the study, the participants were provided with two questionnaires. The first one contained open-ended questions designed to capture the experts’ views concerning factors potentially affecting the relevance and applicability of SWS-based integration architectures. The responses from the first round were aggregated into groups and classified by the unique issues that best summarized their contents. The second questionnaire was based on the results of the first round. The participants were asked to review the aspects identified in the first round and rank them on structured bipolar rating scales ranging from 1 to 5, with 1 representing Strong Disagreement and 5 representing Strong Agreement. Two rounds were expected to be sufficient to attain a first impression concerning the opinions of the experts. The Delphi study consisted of four parts, structured and formalized in a way that allowed the production of a technology roadmap as well as various analyses such as a SWOT (Strengths, Weaknesses, Opportunities and Threats) analysis, a requirements analysis and an analysis of expected effects. Within the scope of this work, we focus exclusively on the results of the SWOT analysis. Its purpose was to compare SWSbased integration architectures and traditional approaches with respect to their capabilities and to collect information on relevant environmental factors. 2.2 Questionnaire The SWOT analysis helped to assess the relevance and applicability of integration architectures based on SWSs by analyzing their strengths, weaknesses, opportunities and threats. The questions stated within the scope of the analysis read as follows: (1) Where do you see the strengths of integration architectures based on SWSs? (2) Where do you see the weaknesses of integration architectures based on SWSs? (3) What factors do you think will drive the use of integration architectures based on SWSs in the future? (4) What factors do you think will restrict the use of integration architectures based on SWSs in the future? Questions 1 and 2 are related to issues that make SWS-based integration architectures better than or inferior to traditional approaches. Questions 3 and 4 embody an approach to identify external factors that will potentially drive or restrict the use of integration architectures based on SWSs in the future. 2.3 Expert Panel The candidates were selected from academia and industry in similar proportion. Repeated involvement at major conferences and publication in at least one of the relevant fields were two of the main criteria used to find suitable representatives of the target population. The candidates were exclusively people involved in at least one of the major international conferences related to SWSs and associated technologies, enterprise integration architectures and middleware solutions. Additionally, book authors and members of widely recognized initiatives active in at least one of the related research fields were considered.

Adapting Integration Architectures Based on Semantic Web Services

283

Whereas 21 of the experts who participated in both rounds of the study had academic backgrounds, 17 had industrial ones. These numbers correspond well with respective recommendations for an adequate panel size [11]. The expertise of the participants in the area of research was gathered to evaluate their technical qualification for the study. The scale ranged from 1 to 5, with 1 representing Novice and 5 representing Expert. In spite of the fact that none of the participants ranked in the lowest category, the expertise distribution is sufficiently balanced to limit the risk of achieving too optimistic or too pessimistic forecasts [12]. 2.4 Survey System Web-based surveys provide capabilities far beyond those of any other type of selfadministered survey technique. They can be designed in ways that facilitate a dynamic interaction between respondents and the survey system, which is of particular interest for Delphi studies [13]. However, it always must be kept in mind that a survey corresponds to a level of technical sophistication that makes it possible for most users to respond. Within the scope of the development of the survey system, programming and design steps were taken to minimize the differences across respondents caused by different operating systems, Web browsers and screen configurations. Furthermore, the survey system allowed answering the questions in multiple sessions and measures were taken to avoid known biases of self-administered surveys.

3 Results The respondents rated up to 40 statements with respect to each of the questions stated within the scope of the SWOT analysis part of the Delphi study. The respondents were free to leave statements unrated or to check a No Comment box. The statements were ranked according to their arithmetic means. The statements ranked above the upper quartile (i.e., the ten top-ranked statements) are presented in tabular form for both groups of respondents. The tables show for each statement the number of ratings, the arithmetic mean, the standard deviation (SD) and Fleiss’ kappa. Fleiss’ kappa κ was computed for each statement in order to measure inter-raterreliability [14]. Fleiss’ kappa measures the reliability of agreement between a number of raters and is scored as number between 0 and 1. The arithmetic mean of the κ values for the group of researchers is 0.36 and the one for the group of practitioners is 0.42. The respective standard deviations are 0.07 and 0.11, respectively. These κ values indicate moderate agreement within the groups. The most controversial statements comparing the two groups of respondents are illustrated by means of net diagrams. The five statements in which the difference of the means of the two groups of respondents is maximal are defined as most controversial. 3.1 Strengths Tables 1 and 2 list the top-ranked strengths of SWS-based integration architectures from an academic and an industrial perspective, respectively. Improved service discovery capability and facilitated interoperability are the most important strengths,

284

D. Bachlechner

according to both groups of respondents. Both groups also consider the improved mediation between services and the availability of explicit definitions of conditions and functionalities as important strengths of SWS-based integration architectures. Furthermore, researchers and practitioners agree that the facilitated reuse of services is an issue that makes SWS-based integration architectures better than traditional ones. Finally, the improved service composition capability is considered as strength of integration architectures based on SWSs. Table 1. Top-ranked strengths from an academic perspective Statement Improved service discovery capability Facilitated interoperability Improved mediation between services Enhanced process and term definitions Formalization of systems Explicit definitions of conditions and functionalities Use of ontologies Facilitated reuse of services Improved service validation capability Improved service composition capability

N 21 20 20 18 21 20 20 20 21 21

Mean 4.24 4.20 4.10 4.06 4.00 3.95 3.95 3.95 3.81 3.81

SD 0.70 0.89 0.97 1.06 0.84 0.69 0.94 0.94 0.81 1.03

Kappa 0.40 0.44 0.34 0.31 0.36 0.64 0.32 0.32 0.40 0.44

SD 0.84 0.62 0.73 0.96 0.70 1.14 0.92 0.64 0.95 1.12

Kappa 0.46 0.47 0.63 0.45 0.56 0.41 0.32 0.64 0.35 0.35

Table 2. Top-ranked strengths from an industrial perspective Statement Facilitated interoperability Improved service discovery capability Facilitated reuse of services Improved service composition capability Goal-based paradigm Explicit definitions of conditions and functionalities Service-orientation Loose coupling Improved mediation between services Increased flexibility

N 14 15 14 14 15 14 14 15 14 14

Mean 4.36 4.33 4.07 4.00 3.93 3.93 3.93 3.87 3.86 3.79

From an academic point of view, enhanced process and term definitions, the formalization of systems in general, the use of ontologies and the improved service validation capability are considered as important strengths. From an industrial point of view, with average ratings of 3.67, enhanced process and term definitions, the use of ontologies and the improved service validation capability are also considered as strengths of integration architectures based on SWSs. With an average rating of 3.40, practitioners do not perceive the formalization of systems as very important strength. The formalization of systems even is among the most controversial issues with regard to strengths of integration architectures based on SWSs. Service-orientation, loose coupling and increased flexibility are among the topranked strengths from an industrial point of view. From an academic viewpoint, these strengths, with average ratings between 3.62 and 3.74, are also considered as quite important. Practitioners also perceive the goal-based paradigm of integration

Adapting Integration Architectures Based on Semantic Web Services

285

Goal-based paradigm Industry

5

Academia

4 3

Facilitated system upgrades

Formalization of systems

2 1

Improved service choreography capability

Compliance with business and legal rules

Fig. 1. Most controversial strengths

architectures based on SWSs as important strength, whereas researchers, with an average rating of 3.16, are not in accord with them. Just as the formalization of systems, the goal-based paradigm is also among the most controversial strengths of integration architectures based on SWSs. Figure 1 illustrates the most controversial strengths of SWS-based integration architectures comparing the two groups of respondents. The goal-based paradigm of and the general formalization of systems through integration architectures based on SWSs have already been discussed addressed. Furthermore, respondents with academic backgrounds consider the compliance with business and legal rules, the improved service choreography capability and facilitated system upgrades as quite important strengths, whereas practitioners do not consider these issues as particularly important. 3.2 Weaknesses The top-ranked weaknesses of integration architectures based on SWSs are listed in Tables 3 and 4 from an academic and an industrial perspective, respectively. The use of immature technologies and the description overhead associated with SWS-based Table 3. Top-ranked weaknesses from an academic perspective Statement High initial start-up costs Lack of agreement on description depth Use of immature technologies High system complexity Description overhead High service development costs Unsatisfactory support of change management Labor-intensive service specification Software engineers are not ontology experts Unsatisfactory life-cycle support

N 19 16 19 19 19 19 19 19 19 18

Mean 4.42 4.38 4.26 4.21 4.16 4.05 4.05 4.05 3.89 3.89

SD 0.61 0.89 0.81 0.85 1.07 0.78 0.85 1.03 1.05 0.96

Kappa 0.48 0.42 0.43 0.39 0.35 0.43 0.37 0.31 0.34 0.32

286

D. Bachlechner Table 4. Top-ranked weaknesses from an industrial perspective

Statement Use of immature technologies Not yet adopted Software engineers are not ontology experts Description overhead Labor-intensive service specification Unsatisfactory security features Lack of effective tools Lack of standards Unsatisfactory management capability Lack of a dominant design

N 13 13 13 13 13 13 12 13 11 13

Mean 4.23 4.15 4.08 4.00 3.92 3.92 3.92 3.77 3.64 3.62

SD 0.60 0.69 1.12 0.41 0.95 0.64 1.00 1.30 0.67 0.77

Kappa 0.51 0.42 0.40 0.91 0.38 0.51 0.35 0.40 0.54 0.40

integration architectures are considered as particularly important weaknesses by both groups of respondents. Furthermore researchers and practitioners agree that the laborintensive service specification and the fact that software engineers are usually not ontology experts make SWS-based integration architectures inferior to traditional ones. Respondents with academic backgrounds perceive the high initial start-up costs, the lack of agreement on description depth and high system complexity as very important weaknesses. Both, the lack of agreement on description depth and high system complexity are among the most controversial issues with respect to weaknesses of SWSbased integration architectures. With average ratings below 3.31, practitioners do not consider them as very important weaknesses. From an industrial point of view, also the high initial start-up costs, with an average rating of 3.54, is not perceived as a particularly important weakness. Furthermore, from an academic perspective, also the high service development costs, the unsatisfactory support of change management and the unsatisfactory life-cycle support are considered as important issues that make integration architectures based on SWSs inferior to traditional ones. With respect to the importance of the high service development costs, practitioners, with an average rating of 3.17, are not in accord with researchers. Just as high system complexity and the lack of agreement on description depth, high service development costs is among the most controversial weaknesses of SWS-based integration architectures. With average ratings of 3.58, practitioners do not perceive the unsatisfactory support of change management and the unsatisfactory life-cycle support as very important strengths. That SWS-based integration architectures have not yet been adopted is considered as an important weakness by practitioners. With an average rating of 3.68, researchers also perceive this weakness as quite important. Furthermore, respondents with industrial backgrounds perceive unsatisfactory security features, the lack of effective tools, the lack of standards, the lack of a dominant design and the unsatisfactory management capability as important issues that make SWS-based integration architectures inferior to traditional ones. With average ratings between 3.47 and 3.84, these weaknesses are also quite important form an academic point of view. Figure 2 illustrates the most controversial weaknesses of integration architectures based on SWSs comparing the two groups of respondents. Controversial weaknesses related to SWS-based integration architectures such as the lack of agreement on description depth, high system complexity and high service development costs have already been discussed. These weaknesses are perceived as important by researchers

Adapting Integration Architectures Based on Semantic Web Services

287

Fig. 2. Most controversial weaknesses

but not by practitioners. Similarly, the high degree of formality and unintuitive concepts associated with SWS-based integration architectures are considered as rather important by respondents with academic backgrounds. From an industrial point of view, they are not perceived as particularly important. 3.3 Opportunities Tables 5 and 6 list the top-ranked factors driving the use of integration architectures based on SWSs from an academic and an industrial perspective, respectively. The need for service interoperability, the availability of compliant middleware implementations, the availability of business cases and the availability of effective tools are considered as important drivers by both groups of respondents. Respondents with academic backgrounds perceive proven cost-effectiveness and a compelling value proposition as very important factors driving the use of integration architectures based on SWSs. With an average rating of 3.80, respondents with industrial backgrounds also consider a compelling value proposition as an important driver. Table 5. Top-ranked opportunities from an academic perspective Statement Availability of business cases Proven cost-effectiveness Availability of compliant middleware implementations Increasing dynamics of cooperation Availability of best practices Need for service interoperability Compelling value proposition Increasing support from standardization bodies Proliferation of services Availability of effective tools

N 19 18 19 19 19 19 19 19 18 19

Mean 4.42 4.33 4.21 4.16 4.16 4.05 4.05 4.00 4.00 4.00

SD 0.77 0.77 0.71 0.76 0.90 0.91 0.91 0.88 0.91 1.25

Kappa 0.48 0.39 0.41 0.45 0.34 0.43 0.33 0.39 0.32 0.29

288

D. Bachlechner Table 6. Top-ranked opportunities from an industrial perspective

Statement Need for service interoperability Preceding agreement on standards Availability of effective tools Buy-in from large integration players Increasing dynamics of systems Potential savings Availability of business cases Availability of compliant middleware implementations Availability of integrated development environments Need for effective collaboration

N 13 12 13 12 11 11 13 13 13 12

Mean 4.46 4.33 4.31 4.25 4.18 4.18 4.15 4.15 4.15 4.08

SD 0.52 0.98 0.85 0.87 0.87 1.08 1.14 0.80 1.14 1.00

Kappa 0.58 0.43 0.47 0.46 0.54 0.50 0.38 0.60 0.38 0.32

With respect to the effects of proven cost-effectiveness, practitioners, with an average rating of 3.69, are not in accord with researchers. Proven cost-effectiveness even is among the most controversial issues with regard to factors driving the use of integration architectures based on SWSs. External factors driving the use of SWS-based integration architectures such as increasing dynamics of cooperation, the availability of best practices and increasing support from standardization bodies also are among the top-ranked ones from an academic perspective. With average ratings between 3.58 and 3.73, respondents with industrial backgrounds do not perceive these factors as particularly important. Furthermore, researchers consider the proliferation of services as an important factor driving the use of integration architectures based on SWSs. With an average rating of 3.77, practitioners are in accord with researchers with respect to this factor. From an industrial point of view, the preceding agreement on standards, the buy-in from large integration players and potential savings are considered as important drivers of the use of SWS-based integration architectures. With average ratings of 3.95 and 3.74, respectively, respondents with academic backgrounds are in accord with practitioners with respect to all factors but the buy-in from large integration players. This external factor, with an average rating of 3.58, is not perceived as very important by practitioners. Furthermore, external factors such as the increasing dynamics of systems, the availability of integrated development environments and the need for effective collaboration are also among the top-ranked ones from an industrial perspective. With average ratings of 3.68, these drivers are not perceived as particularly important from an academic perspective. Just as proven cost-effectiveness, increasing dynamics of systems is also among the most controversial factors driving the use of SWS-based integration architectures. Figure 3 illustrates the most controversial factors driving the use of SWS-based integration architectures based on SWSs comparing the two groups of respondents. Factors such as proven cost-effectiveness and the increasing dynamics of systems have already been addressed. Furthermore, the cooperation across industries, academia and interest organizations, preceding globalization, and a consolidated pattern algebra are very controversial drivers of the use of integration architectures based on SWSs. A consolidated pattern algebra is perceived as important by respondents with industrial backgrounds but not by respondents with academic backgrounds. Conversely, the other two factors are perceive as important by researchers but not by practitioners.

Adapting Integration Architectures Based on Semantic Web Services

289

Fig. 3. Most controversial opportunities

3.4 Threats The top-ranked factors restricting the use of SWS-based integration architectures are listed in Tables 7 and 8 from an academic and an industrial perspective, respectively. Both researchers and practitioners consider only the lack of effective tools and the limited consideration of business needs as important external factors restricting the use of SWS-based integration architectures. Table 7. Top-ranked threats from an academic perspective Statement Difficulty of describing semantics Unavailability of convincing case studies Unproven cost-effectiveness Increasing complexity High costs Failure to reach critical mass Limited consideration of business needs Lack of integration into middleware technologies Lack of skilled developers Lack of effective tools

N 18 18 18 18 18 18 18 17 16 18

Mean 4.39 4.22 4.11 4.11 4.06 4.06 4.06 4.00 4.00 3.94

SD 0.50 0.88 0.83 0.90 0.64 0.87 0.87 0.94 1.03 1.06

Kappa 0.60 0.38 0.38 0.34 0.47 0.36 0.36 0.45 0.43 0.40

The difficulty of describing semantics and the unavailability of convincing case studies are considered as the most important barriers to the use of SWS-based integration architectures by researchers. With average ratings of 3.69 and 3.58, respectively, both factors are less important for practitioners. Unproven cost-effectiveness, increasing complexity and high costs also are considered as important factors restricting the use of SWS-based integration architectures by researchers. With a rating of 3.33, practitioners do not perceive high costs as an important barrier with respect to the use of integration architectures. The other two, with average ratings of 3.77 and 3.62,

290

D. Bachlechner Table 8. Top-ranked threats from an industrial perspective

Statement Lack of effective tools Limited interest of vendors Lack of industrial commitment Difficulty of catalyzing the market Market does not understand values and capabilities Dominant vendors use own technology Inability to communicate strengths Lack of common terminology for service description Limited consideration of business needs Lack of compelling value proposition

N 13 13 13 12 13 12 13 13 13 13

Mean 4.23 4.08 4.08 4.00 4.00 4.00 3.85 3.85 3.85 3.85

SD 0.83 0.95 1.04 0.74 0.82 1.04 0.99 1.07 0.99 1.34

Kappa 0.42 0.38 0.33 0.40 0.47 0.35 0.38 0.33 0.33 0.38

respectively, are not perceived as particularly important from an industrial perspective. Furthermore, researchers consider the failure to reach critical mass, the lack of integration of SWS into middleware technologies and the lack of skilled developers as important barriers. Practitioners, with ratings of 3.83 and 3.69, respectively, are in accord with researchers with respect to the lack of integration into middleware technologies and the lack of skilled developers. With a rating of 3.54, the failure to reach critical mass is not perceived as an important barrier by practitioners. From an industrial point of view, the limited interest of vendors, the lack of industrial commitment and the difficulty of catalyzing the market are considered as important barriers. With an average rating of 3.76, researchers also perceive the difficulty of catalyzing the market as an important barrier. The limited interest of vendors and the lack of industrial commitment, with average ratings of 3.17 and 3.35, respectively, are not considered as important factors restricting the use of integration architectures based on SWS by researchers. Both factors are also among the most controversial barriers. That the market does not understand the values and capabilities and that dominant vendors use their own technologies are perceived as important external factors restricting the use of SWS-based integration architectures. With respect to the importance of these barriers, researchers, with average ratings of 3.84 and 3.61, respectively, are not in accord with practitioners. External factors restricting the use of SWS-based integration architectures such as the inability to communicate the strengths of SWS-based integration architectures and the lack of a common terminology for service description are also among the top-ranked ones from an industrial perspective. With average ratings of 3.50 and 3.67, respectively, researchers only attach limited importance to these barriers. Finally, the lack of a compelling value proposition also plays a major role with respect to restricting factors from an industrial point of view. From the perspective of researchers, this factor, with an average rating of 3.89, is also quite important. Figure 4 illustrates the most controversial factors restricting the use of integration architectures based on SWSs comparing the two groups of respondents. Controversial barriers to the use of SWS-based integration architectures such as the limited interest of vendors and the lack of industrial commitment have already been discussed. However, the two most controversial barriers are the lack of semantic annotations and the heterogeneity of workflows and business processes. Both are considered as important by researchers but not by practitioners. Conversely, the lack funding is perceived as an important barrier by practitioners but not by researchers.

Adapting Integration Architectures Based on Semantic Web Services

291

Fig. 4. Most controversial threats

4 Discussion Based on the results of the study, it seems that researchers and practitioners have realized that SWS-based integration architectures, if applied, are able to reduce interoperability problems. SWSs and integration architectures based on them help organizations to react on the need for a higher level of integration and more agility. They not only allow to move information from application to application but also to create composite applications by combining services found in any number of different local or remote systems. However, the participants of the study also revealed some serious problems that have to be solved before the full potential of SWSs and SWS-based integration architectures can be exploited. It is agreed that the use of immature technologies, currently, is one of the most important weaknesses of integration architectures based on SWSs, particularly with respect to highly complex systems. With respect to immature technologies, the production of a technology roadmap seems natural in order to analyze the status quo and to coordinate future research and development activities [15]. Considerable ambiguity with respect to costs is also a problem. Particularly researchers perceive high initial start-up costs and high service development costs as important weaknesses of integration architectures based on SWSs. There is no doubt that the cost-effectiveness of SWS-based integration architectures needs to be proved. Making disparate systems share information cost-effectively is a key problem for companies and represents billions of euros in technology spending, with a high percentage of worldwide IT budgets dedicated to integration projects. The high costs are often associated with a lack of agreement on the extent of the semantic annotations. The service specification is perceived as very labor-intensive and particularly researchers attach importance to the lack of agreement on description depth. Finding the right balance between the satisfaction of high knowledge requirements and the avoidance of description overhead is critical. In principle, the use of ontologies is considered as strength but both groups of respondents expect problems due to the fact that software engineers are not ontology

292

D. Bachlechner

experts. Besides the lack of skilled developers also the lack of effective tools is perceived as a serious issue by both groups of respondents. Unlike researchers, practitioners attach particular importance to the lack of industrial commitment and the limited interest of vendors. Respondents with industrial backgrounds think that the market does not understand the values and capabilities of integration architectures based on SWSs. The fact that dominant vendors use their own technologies makes catalyzing the market difficult. Researchers are apparently aware of this and also realize that their consideration of business needs is limited. The unavailability of convincing case studies and best practices, however, can be considered a direct consequence of their lack of target group orientation.

5 Conclusions With respect to many aspects, the picture of integration architectures based on SWSs looks quite different from academic and industrial points of view. In the end, the practitioners decide about the adoption and use of SWS-based integration architectures in industry. Researchers are supposed to deliver technologies meeting the requirements of the practitioners. The perceived strengths of SWS-based integration architectures are worth knowing, but the weaknesses are what researchers have to focus on. The weaknesses are the limitations, faults or defects that keep an approach, such as SWS-based integration architectures, from achieving its purpose. The same applies to environmental factors. Exploiting opportunities is desirable, but countering threats is essential. Focusing on the weaknesses and threats will ultimately help to find an answer to the question whether SWS-based integration architectures are relevant to and applicable in industry. Closing the gap between research trends and industrial needs is an important step on the way to the exploitation of the full potential of SWSs within the scope of integration architectures but also in general.

References 1. Hohpe, G., Woolf, B.: Enterprise Integration Patterns: Designing, Building, and Deploying Messaging Solutions. Addison-Wesley, Boston (2005) 2. Terziyan, V., Kononenko, O.: Semantic Web Enabled Web Services: State-of-Art and Industrial Challenges. In: Web Services - ICWS-Europe (2003) 3. McIlraith, S.A., Son, T.C., Zeng, H.: Semantic Web Services. IEEE Intelligent Systems 16(2), 46–52 (2001) 4. Fensel, D., Bussler, C.: The Web Service Modeling Framework. Electronic Commerce Research and Applications 1(2), 113–137 (2002) 5. Alesso, H.P., Smith, C.F.: Developing Semantic Web Services. A K Peters, Ltd., Wellesley (2005) 6. Sollazzo, T., Handschuh, S., Staab, S., Frank, M.: Semantic Web Service Architecture Evolving Web Service Standards Toward the Semantic Web. In: Proceedings of the 15th International FLAIRS Conference (2002) 7. Bachlechner, D.: Relevance and Applicability of Semantic Web Services in Electronic Business: A Delphi Study. University of Innsbruck, Innsbruck (2007)

Adapting Integration Architectures Based on Semantic Web Services

293

8. Linstone, H.A., Turoff, M.: The Delphi Method: Techniques and Applications. AddisonWesley, Reading (1975) 9. Beck, K., Glotz, P., Vogelsang, G.: Die Zukunft des Internet. UVK-Medien, Konstanz (2000) 10. Cuel, R., Leger, A., Giunchiglia, F., Shvaiko, P., Zhdanova, A.V., Maynard, D., Euzenat, J., Ding, Y., Lucchese, L., Sure, Y., Stutt, A., Dzbor, M., Motta, E., Pasin, M., O’Murchu, I., Breslin, J.: Technology Roadmap. Knowledge Web project deliverable (2005) 11. Clayton, M.J.: Delphi: A Technique to Harness Expert Opinion for Critical DecisionMaking Tasks in Education. Educational Psychology 17(4), 373–386 (1997) 12. Grupp, H., Blind, K., Cuhls, K.: Analyse von Meinungsdisparitäten in der Technikbewertung mit der Delphi-Methode. In: Häder, M. (Ed.) Die Delphi-Technik in den Sozialwissenschaften. Westdeutscher Verlag, Wiesbaden (2000) 13. Dillman, D.: Mail and Internet Surveys. John Wiley & Sons Inc., New York (2000) 14. Fleiss, J.L.: Measuring Nominal Scale Agreement among many Raters. Psychological Bulletin 76(5), 378–382 (1971) 15. Bachlechner, D., Fink, K.: Semantic Web Service Research: Current Challenges and Proximate Achievements. International Journal of Computer Science and Applications 5(3b), 117–140 (2008)

Part V

Human-Computer Interaction

“Fact or Fiction?” Imposing Legitimacy for Trustworthy Information on the Web: A Qualitative Inquiry Emma Nuraihan Mior Ibrahim, Nor Laila Md. Noor, and Shafie Mehad Universiti Teknologi Mara, 40450 Shah Alam, Selangor, Malaysia {emma,norlaila,shafie}@tmsk.uitm.edu.my

Abstract. The question on how to impose the sense of “legitimacy” in designing information artifact which can be rationalized and controlled as part of the overall interface design strategy and future IS construction for sensitive settings becomes our primary aim of this research. The research key focal point is to refine the notions of the human computer interface within the Web Mediated Information Environment (W-MIE) context through the evolution of novel ways in assisting its design deployment and evaluation from non technical perspectives of requirements by addressing the earliest (conceptual) state of development work concerning the role of “soft” trust dimensions, as signals for trustworthiness governing interaction on the web. Drawing the attention to the web based information for Islamic content sharing sites, the paper present and discussed empirical results of its qualitative study that explicate the role of “institutional symbolisms” representation to safeguard interaction on the web. Keywords: Trust, web mediated information environment (W-MIE), institutional symbolisms, interview, qualitative, legitimacy.

1 Introduction In the information systems (IS) community and the wider human computer interaction (HCI) aspects, trust has been discussed widely in the context of e-tailing environment or known as the computer mediated exchange (CME) context [1]. It focuses on ecommerce aspects particularly the business to consumer (B2C), business to business (B2B) and consumer to consumer (C2C) electronic exchange models. Many discussions of trust in the context of IS often used the term rather loosely to refer to technologies such as encryption and communication protocols, cryptography and trusted information architectures [2] and [3]. In the IS literature, it is known as the tangible trust comprising psychological studies, formal mathematical and cognitive models of trust [4] or hard trust dimensions [5]. The emphasis is placed on the role of trust for ecommerce adoption and the short term transactional values. However the Internet is growing as a marketplace not only for products and services but also on information. Unaware to many, consumers are not only involved in the interpersonal or interorganizational transaction within the electronic exchange model but also in the knowledge transactions and exchanges within the information exchange model [6]. With the existence of websites that provide services on topics like career, relationship, medical, financial, legal information, religion and its practices and political are some J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 297–306, 2009. © Springer-Verlag Berlin Heidelberg 2009

298

E.N.M. Ibrahim, N.L.Md. Noor, and S. Mehad

of the evidences that the users nowadays are extending their web uses to access information about matters that affect their lives, also to establish personal and organizational connections. In this sense we coined the term web mediated information environment (W-MIE) to refer to the activities involved in acquiring, seeking and disseminating information on the web [7]. Within the information environment, the notion of trust in information conforms to the interpersonal model of trust. It is a social attitude towards a technological artifact, in this case the electronic information or document such as web page or electronic article [8]. A person may search for information to satisfy any of several needs: evidentiary support for a decision making process or as reference material for producing ones own or facts to supplement personal knowledge [8]. Although the Internet is widely used and has become the most effective communication tool and information provider, and have easy access to abundant sources but some limitations can be seen through the questionable quality of the information provided and the risk of getting false information [9]. Trust in W-MIE is fairly new and risks associated with it are novel to users. Some researchers have explicitly raised the problem of trust in information obtained from the Internet [10]. Apparently, there is less control and gate keeping on the web than for print publications. Neither authoritarian governments nor institutions can screen all the information on the Internet due to its nature. Nearly anyone could publish on the Internet. However, less critical and uninformed people are more likely to accept an untruth as a truth [11]. Falsity on the web is seldom revealed because there is too much information. Hence, deliberate users trusting decision to use their own knowledge to evaluate the information in its own terms. Other issues like fraudulent behavior, forgery and pretense, questions concerning the original and the copy [12], not to mention the evaluation of goods that are the object of commercial transactions, have given rise to the problem of trust in the electronic mediated environments and have highlighted just how crucial trust building mechanisms have been to the functioning of markets and communities since the beginning of time. What is at stake here is the entire range of mechanisms that will facilitate interpersonal and inter-organizational transactions, given the new conditions for knowledge transactions and exchanges: increasing specialization, increasing asymmetrical distribution of information and assessment capabilities, greater anonymity among interculotors and more opportunities for forgery identity as highlighted in [13]. Clearly new methods need to be devised to “certify” the knowledge circulating on the Internet within a context where inputs are no longer subject to control [13]. Another big issue is concerning the regulation and social behavior and the formation of cooperation based upon trust and shared ethos/ identity in virtual context [12]. This is because, trust and culture are interconnected. The meaning, antecedents, and effects of trust are indeed determined by one’s culture [14], [15] and [16]. In the HCI literature, much of the initial research on consumer’s judgment of trust on information is conceptualized through its credibility perceptions [17] and [18] or quality indicators [10] that focuses in the domain of health [19], financial [20], risks communication [21] and advice websites [22]. However some of the works are heavily criticized because the operationalization of trust was not understandable [23]. Compounding the problem circulating trust research within W-MIE is the fact that many of the scales used in the studies are mostly scientific, neither theoretically grounded nor authenticated. Indeed, trust exercises on consumer behavior and decision making processes. However, given the varying results of previous studies on

“Fact or Fiction?” Imposing Legitimacy for Trustworthy Information on the Web

299

consumer’s trust assessment within the e-tailing environment, we are left wondering whether consumer’s trust evaluation on topics that are sensitive in context and to its cultural norms would give different connotation of trust and its guidelines such as religion. There is a longstanding interest in designing information and computational systems that support enduring human values [24]. The Internet spans the globe in all languages and cultures. Users and organizations communicate their business ideas, knowledge and information across vast distances; hence it is critical to avoid misunderstanding and misinterpretations. Users must comprehend accurately the meaning of what is said. This is inhibited by differences in value systems, attitudes, beliefs, and communication styles. Such differences shall be taken into account in order to ensure that interfaces are usable and acceptable as the cultural background of users affects how they operate and interact with an interface. The communication style one uses for generating ideas, exchanging opinions, sharing knowledge and expressing ideas is indeed cultural centric [24]. However, these key issues rooted in deep cultural identities represented via the interface element within information context have not been fully explored and understood empirically. Some researchers have done work in the area of culture and design but results have been either inconclusive or unrelated to develop an IS for information settings in sensitive context. Moreover there is no consensus as to how the trust construct should be operationalized [25]. 1.1 Contextualizing the Problems of Trust within W-MIE - Web Based Information for Islamic Content Sharing Sites Moving away from the domineering trust issues in the e-tailing environment, we ‘contextualized’ the notion of trust within W-MIE, focusing in the domain of information in sensitive context. For the purpose of this research, we look at the context of web based information for Islamic content sharing sites. These websites highlight information, knowledge and services, be it commercial or entertainment in nature reflecting Islamic ideologies, content, norms or values. We believe the Islamic context offer interesting view on the investigation of this trusting phenomenon as Islamic principles rely much on the “legitimacy” governed by its cultural cognitive, normative and regulative elements, both formal and informal. Many Muslims have joined the Internet bandwagon in extolling its merits as a primary source of information through the increase amount of Islamic sites on the Internet, where some of which devoted to Islamic education and dissemination e.g. Al-Islam (al-islam.org), information on consumer product and services (e.g. ifanca.org) while others more commercial and entertainment in nature. However, due to the anonymous nature of such technology, the reliability and authenticity of information [27] received by those seekers of Islamic knowledge can be problematic. To some extend, it’s hard to make distinction between information (about Islam) and true knowledge [13] with some of the inauthentic Islamic sites that disseminate and communicate information pertaining to Islam. Hence, researchers argue that there is a need to monitor information on Islam so that fabricated and misleading information can be easily identified [13]. Some have recommended in developing a mechanism of certification and authentication for Islamic sites disseminating information on Islam or obtaining approval from well known Islamic organizations, in a similar manner that halal certification is required for food products [26] and [13]. Hence, the focus of this paper is to reveal in greater depth on the understanding of trust within W-MIE that takes into account of trust determinants from the perspective of human actor’s reasoning in the context where

300

E.N.M. Ibrahim, N.L.Md. Noor, and S. Mehad

the sense of a community with common norms and values are significant. We believe the solutions would be to understand the trust operationalization in a holistic manner by taking the assumption of human forces, focusing on those parts of the systems directly experienced and understood by the ordinary people. 1.2 Research Assumptions The research on trust issues in the e-mediated environment has been the subject of previous investigation of tangible trust perspectives in relation to e-tailing context. In the IS research, the studies of trust formation as an emotional, ‘intangible’ response to computer based stimuli [28], [29] and [30] or known as the ‘non technical mechanism’ of trust perspectives [31] or ‘soft trust dimension’ [5] is an equally diverse field as the tangible trust. It is proposed here that a paradigm shift is needed from specification to discovery and from detailed requirements analysis of tangible known requirements to activity designs (intangible known) to uncover hidden or at least unarticulated or semi articulated user knowledge and needs. Hence this paper, is an extension of our previous works in [7] and [32] that puts at the centre of its projection of trust research not as purely governed by technical mechanisms but as a technology which mediates already established and long practices routine of human behavior. It looks into on how human reasoning trust online triggered by the perceived trustworthy elements of an interface apparent to the users. As more people go online for seeking information regardless of context it becomes increasingly important to identify what makes people choose to trust some sites and reject others. In this sense, while not disregarding the importance of online security and the evolving systems that support it, we believe that security has little to do with general consumer trust. This is the case not only because of consumers lack of expertise needed to make informed choices but more importantly because of their general lack of interest in the technology of security [33]. Hence, leads to the assumptions that current trust research in IS are hampered by designing for computers rather than humans. Although some of these measures are surely useful and needed, we believe that the idea of total control and a purely technical solution to protect against deception and to favor non-self interested cooperation is unrealistic. Hence we posited that designing trust metrics requires an understanding of not only the technical nuances of security but also the human subtleties of trust perception. It is necessary to have a more cognitive and affective view on trust as a complex mental structure of beliefs and goals, which would imply that the trustor has a “theory of mind” of the trustee [34] and [35]. An exploratory study to the understanding of trust phenomena within the W-MIE and the ways in which it affect people’s decision to trust or not to trust is needed. Understanding the nature of trust seems like a logical first step. Asking users directly on the topic is usually not a good idea. Users tend to give ‘school class answers’ to such direct questions instead of describing how they really behave in their trusting decisions process. This is also why we have some doubts about using questionnaires to find out for example about security and privacy issues, even though such research exists [36]. However, we believe understanding the concept of trust in other areas such as in the social science theories can give insight into the understanding of what facilitates trustworthy information online. Applying this understanding to W-MIE will be helpful in developing initial guidelines for designs that support trust and social identity. Drawing on previous sociological research on trust, interaction and everyday experiences it re-asserts the position of trust within a non technical or deterministic context. It demonstrates

“Fact or Fiction?” Imposing Legitimacy for Trustworthy Information on the Web

301

how users approach information mediated exchange activities bringing with them previous experiences of trust and apply them to the new computer mediated situations rather than being tabulae rasae onto which the design for information based systems can write their preferred responses.

2 Theoretical Backgrounds To guide us we have formulated for ourselves the goal of creating a framework to facilitate trust conceptualization and operationalization within W-MIE context via explication of institutional theory and semiotics paradigm [7] and [32]. The framework we proposed, called Institutional Symbolisms Trust Inducing Features. Institutional symbolisms is a visible, physical manifestation of the institutional characteristics, behavior and values represented by trust marks; signs that depict and present connoted message of some ‘assurance’ which signified under these four dimensions and its underlying properties, content credibility, emotional assurance, brand/reputation and trusted third party, [7] (see Table 1). This assurance implies the sense of ‘legitimacy’ that safeguards the overall impersonal structures and situations on the web in which the information domain reside. It implies that the symbols carry its own disposition and meaning, the trust warranting properties manifested via textually or graphically presented on the website. In this sense, institutional symbolisms are seen as a form of social trust where trust is initiated from its social mechanism, behavior and values through the means of symbolic representation. We contend that institutional design features could make the alignment between formal and informal signs of trust to match their meanings through shared norms, assumptions, beliefs, perceptions and actions as elaborated in [7] and [32]. Table 1. The Framework of Institutional Symbolisms Trust Inducing Features (adapted from [7]) Dimensions

Values

Trust marks that reflect third party assurance or seals of approval.

A belief that it will perform a particular action, to monitor or to control that certain acts and behavior is warranted.

Trust marks that reflect credibility of the web content

A belief that it has the ability and competency to carry out the obligations.

Trust marks that evoke emotional assurance or security.

A belief that it will provide a sense of comfort that is reflective, thoughtful and careful.

Trust marks that reflect trustworthy expectations derived from the message.

A belief that it signifies positive or prominent identities and values.

Measurements Trust marks that symbolized: 1. Protecting privacy 2. Providing security Demonstrating consumer satisfaction 3. Providing reliability 4. Providing assurance or guarantee. Trust marks that symbolized: 1. Competence (knowledge, expertise and skill). 2. Reliability (accuracy, currency, coverage and believability). 3. Predictability (stability of information). Trust marks that symbolized: 1. Benevolence (goodwill and objectivity) 2. Honesty (validity and, openness). 3. Integrity (fiduciary obligations). Trust marks that symbolized: 1. Reputation - Offline reputation 2. Brand - Brand Image - Brand Personality

302

E.N.M. Ibrahim, N.L.Md. Noor, and S. Mehad

3 Methodology We conducted a qualitative study to inquire on the user’s perceptions of trustworthy information artifacts. This information artifact is assumed to act as an object (graphically or textually displayed on the web) that have some mandated specifications, meeting conventions or standards or possessing some kind of symbolic value to the user. The nature of the proposed data collection is rich and qualitative but also simple and intuitive. The technique involves concentrating on what might be loosely called the general look and feel of an interface as well as detailed design elements such as choice of specific content. Hence, we seek to anticipate (pre-align) an early application concept to a particular target audience to identify patterns of preferences for design characteristics by a specific group. The aim is to capture the users understanding on their de-coding of signs embedded in the electronic interfaces. 3.1 Interview – Semantic Meanings of Trustworthy Elements Prior to the interview, we have conducted a preliminary study to uncover cognition representation that used indirect approach to probe non-functional quality aspects of websites as detailed in [32]. At this stage, our empirical investigations used the ‘closed card sorting’ technique to probe the users trust perceptions of institutional signs embedded within the sensitive information web based content sharing sites. In this study we want to learn how users sort institutional elements into each category in order to probe into the mental model of users in order to achieve a stable picture of user’s preferred structure on the aspects of websites. A focus group of 15 users participated in this study. Our participants consist of 10 females and 5 males between the ages of 2540, having at least a bachelor degree qualification and have participated in online transaction activities for at least 2 years. In this study, we used subjects that have previous experiences in online transaction activities because they would already have well developed schema for offline risks [37]. We also preferred to have educated people because it is to be said more likely to have some experiences with technology [35]. For the operationalization of the institutional dimensions, web based information for Islamic content sharing sites were chosen. In this study, the subjects were presented with e-halal homepages from Malaysia (www.halaljakim.gov.my) and Singapore (www.muis.gov.sg/cms/index.aspx). These homepages are basically the official websites that disseminate information pertaining to halal products and services in its respective countries. The subjects were asked to think about any trust elements that come across into their mind when browsing or searching for halal information on the web while performing the card sort activity. The results of the card sorting study are elaborated further in [32]. However, we contend that these institutional elements are assumed to have some semantic meaning that capture the user’s cognitive structure. Hence, we conducted semi-structured interviews with the participants to understand the semantic meanings behind each trust elements solely from the user’s perception. We conducted the interviews immediately after their searches to elicit trust decision making processes and criteria for trustworthy information assessment. The interviews were transcribed and subject to content analysis. The results will allow us either to add new content or eliminating existing content to an existing theoretical framework. We analyzed the transcripts of the focus groups from the interviews for emerging themes concerning markers used by the users to evaluate trustworthy information artifact present in the web based information for Islamic content sharing sites. Several criteria and categories from each dimension emerge from the interviews as detailed in Table 2.

“Fact or Fiction?” Imposing Legitimacy for Trustworthy Information on the Web

303

4 Results - Transcripts of Focus Groups and Interviews Table 2. Results of user’s perception on trustworthy information Institutional dimensions

Generated categories

Content

Content trustworthy

Credibility

(truth, validity)

Institutional elements Authority/ authorships Past experience/ previous encounter with the site Attributions Content reliability Content believability (links, accuracy, currency) Language

Content legitimacy

Information legitimacy

(lawful, evidential) Information protection/ disclosure Content presentation

Navigation of content

(appearance, functionality) Design and layout Emotional Assurance

Site benevolence

Organization/ institution ‘s values

(goodwill) Organization/ institution’s positive intentions Site integrity (security application and

Security and authenticity

enforcement) Demonstrating user’s

Demonstrating user’s satisfaction

satisfaction Trusted Third Party/ Seals of Approval

Guarantees and safety nets

Providing third party on site security Providing third party on information privacy

Information practices

Site policies and practices validation Content credibility validation

Ethics

Purpose

(obligations)

Perceived institution/ organization’s ability

Brand/ Reputation

Social roles or functions Expertness (authority, knowledge) Familiarity (general business sense)

Sources Offline reputation Domain name Trademarks

304

E.N.M. Ibrahim, N.L.Md. Noor, and S. Mehad

5 Discussions and Conclusions In order to find out on how the online user defines, interpret and understand what is perceived as trustworthy information we conducted an interview to find out the semantic meanings underlying each trust elements solely from the users understanding. Several criteria to assess the credibility of information emerged from the interviews as shown in Table 2. The content credibility dimension is represented by three categories which consist of content trustworthy (truth, validity), content legitimacy (lawful, evidential) and content presentation (appearance, functionality). Several criteria to assess the emotional assurance of information emerged from the interview as shown in Table 2. There are two categories generated from the interview which consist of site benevolence (goodwill, identification) and site integrity (security and enforcement). Under the third party/seals of approval dimension, there are three categories generated which consist of ethical (moral values, obligations), expertness (authority, knowledge) and familiarity (general business sense). Finally, under the brand/reputation dimension, three categories generated which consist of demonstrating user’s satisfaction, guarantees and safety nets and information practices. The results yield evidence in understanding trust phenomena that represent the interpretation of online “legitimacy” within the web based information for Islamic content sharing sites environment from the Muslims perspective as to how they perceived the trustworthy information artifact on the web. Eventually, the paper attempts to highlight on the concept of trust within the unique context of W-MIE which has received little attention in the HCI literatures. There is less topical argument to make trust as the core concern of system design and evaluation of web interface elements in an informational based context, particularly addressing information within sensitive context setting. With highlighting the context of information exchange activities within the notion of W-MIE, it opens to discussion on debates, issues and methods for future construction of information systems (IS) and overall interface design strategy. To provide further explanation on the existing constructs, thorough measurement of the constructs with an appropriate instruments should be developed and empirically tested in order to provide better understanding and explanation of its instantiations within the IS environment. This can be done through outlining a series of research propositions that can move us towards a more comprehensive understanding of designing trustworthy information artifact. Next, an empirical testing is needed to validate the research model and to examine the relative importance of the trust dimensions and its antecedents. A more effective strategy would be to assess a construct’s efficacy in predicting or explaining the trust phenomena through a quantitative study. In a nutshell, this would be akin to assess the construct’s contribution to existing knowledge and theory in the field of trust studies and HCI.

References 1. Bailey, B.P., Gurak, L.J., Konstan, J.A.: An examination of trust production in computermediated exchange. In: Proceedings of Seventh Human Factors and the Web 2001 Conference, Madison, WI, June 4 (2001) 2. Pavlou, P.A., Gefen, D.: Effective Online Marketplaces with Institution-Based Trust. Information Systems Research 15(1), 37–59 (2004)

“Fact or Fiction?” Imposing Legitimacy for Trustworthy Information on the Web

305

3. Ratnasingam, P., Pavlou, P.A.: Technology Trust in B2B Electronic Commerce: Conceptual Foundations. In: Kangas, K. (ed.) Business Strategies for Information Technology Management, pp. 200–215. Idea Group Publishing, Hershey (2004) 4. Abdul-Rahman, A., Hailes, S.: Supporting Trust in Virtual Communities. In: Proceedings of the 33rd Hawaii International Conference on System Sciences, vol. 6, p. 6007, January 04-07 (2000) 5. Kräuter-Grabner, S., Kaluscha, E.A., Marliese, F.: Perspectives of Online Trust and Similar Constructs – A Conceptual Clarification. In: Proceedings of The Eighth International Conference on Electronic Commerce, pp. 235–243. ACM, New York (2006) 6. Forray, D.: The Economics of Knowledge. MIT Press, Cambridge (2004) 7. Ibrahim, E.N.M., Noor, N.L.M., Mehad, S.: Seeing Is Not Believing But Interpreting, Inducing Trust Through Institutional Symbolism: A Conceptual Framework for Online Trust Building in a Web Mediated Information Environment. In: Smith, M.J., Salvendy, G. (eds.) HCII 2007. LNCS, vol. 4558, pp. 64–73. Springer, Heidelberg (2007) 8. Chopra, K., Wallace, A.W.: Trust in Electronic Commerce. In: Proceedings of the 36th Hawaii International Conference on System Sciences (2002) 9. Marchand, D.: Managing information quality. In: Wormell, I. (ed.) Information Quality: Definitions and Dimensions. Taylor Graham, London (1990) 10. Alexander, J.E., Tate, M.: Web wisdom: How to evaluate and create information quality on the Web. Lawrence Erlbaum, Mahwah (1999) 11. Hernon, P.: Disinformation and Misinformation through the Internet: Findings of an Exploratory Study. Government Information Quarterly 12, 133–139 (1995) 12. Eco, U.: Interpretation And Overinterpretation. In: Collini, S. (ed.). Tanner Lectures In Human Values. Cambridge University Press, U.K (1992) 13. Khan, K., Khan, S.: Da’wah via Internet: Opportunities and Challenges. In: Abstract for Islam Internet Conference in USA. Islamic Society of North America (1999) 14. Doney, P.M., Cannon, J.P.: An examination of the nature of trust in buyer-seller relationships. Journal of Marketing 61, 35–51 (1997) 15. Fukuyama, F.: Trust: The social virtues and the creation of prosperity. Free Press, New York (1995) 16. Zucker, L.: Production of Trust: Institutional Sources of Economic Structure, 1840-1920. Research in Organization Behavior 8(1), 53–111 (1986) 17. Fogg, B.J., Tseng, H.: The elements of computer credibility. In: Proceedings of the CHI 1999, pp. 80–87. ACM Press, New York (1999) 18. Metzger, M.J., Flanagin, A.J., Zwarun, L.: College student web use, perceptions of information credibility, and verification behaviour. Computers and Education 41(3), 271–290 (2003) 19. Constantinides, H., Swenson, J.: Credibility and Medical Web Sites: A Literature Review, University of Minnesota (2000) 20. Stanford, J., Tauber, E.R., Fogg, B.J., Marable, L.: Experts vs. online consumers: A comparative credibility study of health and finance Web sites. Consumer WebWatch Research Report (2002) 21. Peters, G.R., Covello, T.V., McCallum, B.D.: The Determinants of Trust and Credibility in Environmental Risk Communication: An Empirical Study. Risk Analysis 17(1), 43–54 (1997) 22. McKnight, D.H., Kacmar, C.: Factors of Information Credibility for an Internet Advice Site. In: Proceedings of the 38th Hawaii International Conference on System Sciences (HICSS 2006) (2006)

306

E.N.M. Ibrahim, N.L.Md. Noor, and S. Mehad

23. Kräuter-Grabner, S., Kaluscha, A.E.: Empirical research in on-line trust: a review and critical assessment. International Journal of. Human.-Computer Studies 58(6), 783 (2003) 24. Friedman, B., Kahn Jr., P.H., Borning, A.: Value Sensitive Design and information systems. In: Zhang, P., Galletta, D. (eds.) Human-computer interaction in management information systems: Foundations, Armonk, New York, pp. 348–372. M.E. Sharpe, London (2006) 25. Bhattacherjee, A.: Individual trust in on-line firms: scale development and initial test. Journal of Management Information Systems 19, 211–241 (2002) 26. Ahmad, S.: Islamic Web Sites: Protecting Islamic Heritage. In: Abstract for Islam Internet Conference in USA. Islamic Society of North America (1999) 27. Ibrahim, E.N.M., Noor, N.L.M., Mehad, S.: Trust or Distrust in the Web Mediated Information Environment: A Perspective of Online Muslims Users, CD Rom. In: Irani, Z., Sahraoui, S., Ghoneim, A., Sharp, J., Ozkan, S., Ali, M., Alshawi, S. (eds.) Online Proceedings of the European and Mediterranean on Information Systems (EMCIS) 2008, Al Bostan Rotana, Dubai, UAE, 25-26 May (2008) 28. Corritore, C.L., Wiedenbeck, S., Kracher, B.: On-line Trust: Concepts, evolving themes, a model. International Journal of Human Computer Studies (58), 737–758 (2003) 29. Egger, F.N.: Affective design of e-commerce user interface: How to maximize perceived trustworthiness. In: Proceedings of the International Conference on Affective Human Factors Design. Academic Press, London (2001) 30. French, T., Liu, K., Springett, M.A.: Card-sorting Probe for E-Banking. In: Proceedings of British Human Computer Interaction, vol. 1. BCS Publications (2007) 31. Riegelsberger, J., Sasse, S.M., McCarthy, D.J.: The mechanics of trust: A framework for research and design. International Journal Human-Computer Studies 62, 381–422 (2005) 32. Ibrahim, E.N.M., Noor, N.L.M., Mehad, S.: Wisdom on the Web: On Trust, Institution and Symbolisms, A Preliminary Investigations. In: Proceedings of Enterprise Information Systems, Barcelona, Spain, ICEIS, vol. (5), pp. 13–20 (2008b) 33. Karvonen, K., Parkkinen, J.: Signs of trust. In: Proceedings of the 9th International Conference on HCI, New Orleans, LA, USA (2001) 34. Castelfranchi, C., Rosembergh, T.: Through the agents’ mind: Cognitive mediators of social action. Mind and Society, 109–140 (2000) 35. ewicki, R.J., Bunker, B.B.: Developing and maintaining trust in work relationships. In: Kramer, R., Tyler, T. (eds.) Trust in Organizations: Frontiers of Theory and Research, pp. 114–139. Sage, Newbury Park (1996) 36. Fogg, B.J., Marshall, J., Kameda, T., Solomon, J., Rangnekar, A., Boyed, J., Brown, B.: Web credibility research: a method for online experiments and early study results. In: Proceedings of the Conference on Human Factor in Computing Systems CHI Extended abstract, pp. 295–296. ACM Press, New York (2001) 37. Nyshadyam, E.A., Ugbaja, M.: A Study of E-Commerce Risk Perceptions among B2C Consumers. In: A Two Country Study, in the 19th BLED eConference, eValues (2006)

Enabling End Users to Proactively Tailor Underspecified, Human-Centric Business Processes: “Programming by Example” of Weakly-Structured Process Models Todor Stoitsev1, Stefan Scheidl1, Felix Flentge2, and Max Mühlhäuser2 1

SAP AG, SAP Research, Bleichstrasse 8, 64283 Darmstadt, Germany {todor.stoitsev,stefan.scheidl}@sap.com 2 Darmstadt University of Technology, Telecooperation Group Hochschulstrasse 10, 64289 Darmstadt, Germany [email protected], [email protected]

Abstract. Enterprises face the challenge of managing underspecified, humancentric business processes which are executed in distributed teams in a rather informal, ad-hoc manner. This gave hibernating CSCW and ad-hoc workflow research a new push recently. However, there is still the need to clearly perceive end users as the actual drivers of business processes and to enable them to proactively tailor these processes according to their expertise and problem solving strategies. This paper presents the design and evaluation of a prototype for enduser development of weakly-structured process models through emailintegrated task management. The presented CTM (Collaborative Task Manager) prototype uses “programming by example” to leverage user experience with standard email and task management applications and to extend user skills towards the definition of reusable process structures. By closely correlating to the actual user work practices and software environment, the tool provides a “gentle slope of complexity” for process tailoring by end users. Keywords: End-user development, human-computer interaction, computer supported cooperative work, ad-hoc workflow, knowledge management.

1 Introduction Up until recently, workflow systems were too formal and restrictive to be useful for knowledge-intensive and rather informal processes [18]. The importance of such processes and the increase of distributed team work led to further research on enterprise efficiency, which clearly presents how “individual actions lead to overall enterprise performance” [21]. It becomes apparent that the traditional enterprise process modeling perspective is being replaced by tailoring of business processes according to the individual point of view and connecting them towards the achievement of common enterprise goals. This novel view on business processes emerges in analyst reports as the “Process of Me” [6] and is recognized as one of the major challenges for the next generation Business Process Management (BPM). It states the fundamental J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 307–320, 2009. © Springer-Verlag Berlin Heidelberg 2009

308

T. Stoitsev et al.

need to provide end users with adequate techniques to proactively express process knowledge and to participate in business process management and design. End-User Development (EUD) is defined as “a set of methods, techniques, and tools that allow users of software systems, who are acting as non-professional software developers, at some point to create, modify, or extend a software artefact” [12]. Within the presented paper a process model is considered as a software artifact that can be adapted and enacted to support underspecified, human-centric processes. The presented study is motivated through the possibility to “render” appropriation of process models to end users and to “exploit the potential of opportunity-based and emergent changes” from the introduction of groupware in enterprises [22]. Riss et al. [17] discuss the challenges for the next generation BPM by suggesting the recognition and reuse of “task patterns” and “process patterns” as alternative to static workflows. However, concrete examples for engaging business users in task pattern definition and modeling towards generic enterprise process models are still missing as well as techniques for achieving that. This issue is in the focus of the presented paper. The described approach ensures a “gentle slope of complexity” [13] for process tailoring activities by leveraging user experience with standard tools for collaboration (email) and task management (to-do lists) and extending user skills towards definition of weakly-structured process models through “programming by example” [11]. This EUD technique enables unobtrusive support by embedding the process definition in the existing end user working environment and inferring process models from the captured executed activities. The described approach presents a valuable extension to “evolutionary” workflows [7] and “interactive process models” [10] by allowing “seeding, evolutionary growth, and reseeding” [5] of weakly structured process models in shared enterprise repositories and task instance-based evolution tracking. The iterative, evolutionary transitions from execution to design (and vice versa) of adaptable, weakly-structured process models exceed the capabilities of known email-based workflows [1]. In section 2 we present basic problems regarding current practices in ad-hoc processes, which are used to introduce process tailoring by end users. Section 3 presents a prototype for end-user-driven process definition. Section 4 describes results from prototype evaluation at a partner company. In section 5 we give conclusions and future research directions.

2 Addressed Problem Areas The presented study builds up on state of the art research in the areas of task management, flexible workflows, computer supported cooperative work and EUD. It is based on intra-organizational knowledge sources accumulating customer requirements as well as on dedicated site visits and interviews at three companies from various industries: textile (120 employees), software (ca. 500 employees), automotive (ca. 150 employees). Based on the preliminary studies we identified five generic problem areas concerning user work practices in ad-hoc processes that can be used to introduce user-driven process composition: •

Lacking Transparency. Email is the main tool for exchange of tasks and taskrelated information in informal processes [3]. Users further organize tasks in

Human-Centric Business Processes









309

to-do lists [2]. These tools do not provide end-to-end overview of running collaborative activities. No Structured Storage and Retrieval of Process Knowledge. Users spent considerable effort to search for task-related data in email folders [2]. While having individual strategies for storing data in email and file folders, users are not able to predict how their “sorting” practice will scale over time. Increasing data amount increases search effort and user efficiency degrades. Lacking Exchange of Process Knowledge. As process knowledge often remains implicit, stuck in personal email and file folders, people “know” what to do but cannot share it efficiently with their colleagues. This leads to problems when domain experts are not available and cannot provide support. Disjunction between Best-practices and Running Processes. A common way to store process guidelines is in text documents (e.g. Microsoft Word). Text representations do not provide the possibility to follow evolving user tasks with respect to the provided guidelines and to observe to what extent the described (best) practice is being followed, or why deviations have occurred. Inability to trace Evolving Best-practices. Best-practices for informal processes may often change due to the changing business conditions. Having previous process information in email and file folders and guidelines in text-based documents does not allow structured comparison and reasonable evaluation to what extent best-practices need to be adapted or if different variations have to be managed for different application contexts.

3 Collaborative Task Manager (CTM) The Collaborative Task Manager (CTM) is an email-integrated task management tool, with extensive support for definition, adaptation and reuse of weakly-structured process models. All industry partner companies involved in our preliminary studies were using Microsoft Outlook (OL) as a standard email client. To ensure an integrated support within the common working environment, CTM is delivered as an OL add-in, additionally exploiting the fact that tasks and email are provided in the same office application. The CTM add-in provides extensions of the OL mail and task items and enables “programming by example” [11] by using web services to track user actions executed on CTM tasks and replicating data on a central server. The data is held in a database that provides a central tracking repository for all CTM users. Tracking of email communication for task delegation integrates the individual task hierarchies of different users to overall enterprise process structures, emerging on the server. The CTM add-in application provides “Process Info” links on tasks and task-related email messages, which open a web-based client, providing overview and navigation in the generated process structures by retrieving data from the server. 3.1 CTM To-Do List The CTM to-do list is shown in Fig. 1. CTM extends OL tasks with functionality for displaying a hierarchical tree structure. The add-in provides additional toolbars for direct access to the main CTM functionalities. CTM enables insertion and removal of

310

T. Stoitsev et al.

Fig. 1. CTM to-do list

tasks and sub tasks in a task hierarchy in a light-weight manner. Task insertion opens a new OL task dialog where the user works with the familiar OL task fields. Files can be added to CTM tasks as common OL task attachments. An email can be saved as CTM task, whereby the mail subject, body and attachments are applied to the task. 3.2 Transfer of Tasks and Deliverables A CTM task is delegated through a preformatted “Request” message. Recipients can “Accept”, “Decline” or “Negotiate” the request. While request/accept/decline are standard actions known also from the exchange of meeting requests in OL, iterative negotiations allow additional clarifications on tasks. The actual discourse takes place in the email text, which is independent from the given message type. This allows open-ended collaboration on tasks and prevents from submitting user behavior to strict speech-act rules, which is a known limitation in speech-acts adoption [4]. When a request is accepted, and later on completed by a recipient, the latter issues a “Declare Complete” message. Hereupon the requester can respond with “Approve Completion” or “Decline Completion” message. These additional actions allow negotiation of deliverables, before the final completion of a delegated task. To avoid flooding of the OL inbox with task-related messages, a “Move CTMs” button is provided which moves all task related emails to a special CTM mail folder. All email exchange related to a task is associated to a task dialog and stored on the server. Dialogs can be inspected through a hierarchical process tree-view, where the nodes provide links, opening task and email items with text and attachments (Fig. 2). The collaborative functionality in CTM is further supported through a notifications framework which issues notifications throughout the task delegation chain to inform participants in collaborative processes if a related task of another process participant has changed. Stakeholders can accordingly adapt “in-situ” to the occurred changes.

Human-Centric Business Processes

311

Fig. 2. Detailed task dialog overview

3.3 Process Overview and Navigation In CTM, process models emerge as examples for the actual process execution and comprise the individual to-do lists of all process participants. These lists are integrated through the tracked task-related email exchange. Thereby overall process models emerge as Task Delegation Graphs (TDG) [19], where the personal task trees of different users are shown in different user containers (Fig. 3). We suggest that this overview provides a highly intuitive process representation and enables end users to more adequately recognize their position and role in the overall enterprise processes, to identify potential bottlenecks and to evaluate work distribution. Currently, due date, status and percent complete indications are provided. The description link within a task node opens a dialog with full task (text) description. Tasks attachments, added in OL tasks, are replicated in a central artifact repository on the CTM server, and are accessible in the task instances. Through the “Show Roottasks” button the user can open a list view with all initial process tasks (root tasks) generated on the server throughout the whole enterprise. Within this view the user can navigate through the root tasks list and open a TDG (process execution example) for a given root task.

312

T. Stoitsev et al.

Fig. 3. Detailed process overview – Task Delegation Graph (TDG)

3.4 Process Model Adaptation and Reuse Within the presented paper a Task Pattern (TP) [17, 19] is considered as a reusable task structure, comprising one task with its sub task hierarchy and the complete context information of the contained task instances, like e.g. description, used resources, involved persons etc. CTM enables export of a local task from the personal to-do list to a single TP, and export of complete TDG from the server to multiple TPs, which are interlinked through suggestions according to the delegation flow. A TP can be saved in a local or remote Task Pattern Repository (TPR). A local TPR is a XMLbased document [19], whereas remote TPRs reside in database on the CTM server. The exported task structures are managed in the Task Patterns Explorer/Editor (TPE) which is shown in Fig. 4. The TPE provides rich editing and search functionality: cut, copy, paste, insert and remove operations are enabled on task trees and on data in context fields (on the right hand side). TPE enables also search and extraction of TPs from the tracking repository. When editing the provided process execution examples (interlinked TPs) in the TPE “the user is not required to interact in the interface domain of computational abstraction, but works directly with the data that interests him or her” [12]. In this sense CTM enables programming by direct manipulation of the TP fields. The “Name”, ”Description” and “Suggested Execution Time” fields hold simple task context information in text format and are self-explanatory. The “Owner” field recommends expertise, i.e. when a TP is extracted from an executed process the owner is the person in whose to-do list a task was residing. The field “Suggested

Human-Centric Business Processes

313

Fig. 4. Task Pattern Explorer/Editor (TPE)

Delegates” contains information about the persons, who have the expertise to execute a given task. When a TP is extracted from a collaborative process, task recipients are set in this field. The “Suggested Pattern” field holds a reference to a TP, which can be used for the further processing of a task. In case of TDG extraction, such references in requester tasks point at recipient tasks, which are themselves extracted as separate TPs. The “Artifacts” field holds all task attachments. Custom adding of artifacts to a task replicates these to the artifact repository. Studies on ad-hoc processes report that “Employees often do not accept a strict sequencing of those tasks which they have to execute themselves, because this causes a limitation of their flexibility” [7]. Our preliminary studies confirm that statement and the necessity to minimize sequencing of activities where possible. Therefore TPs do not incorporate a declaration of explicit temporal relationships known from formal task modeling approaches [9, 15, 20] and workflow modeling notations [14]. TPs provide structured process execution examples, where the default assumption is to execute tasks along the provided task hierarchy in a top-down manner. Actual temporal relationships between tasks can be observed only through the task statuses,

314

T. Stoitsev et al.

e.g. “Waiting for someone else”, “In Progress” provided in the TDG in the web client during a concrete process execution (see Fig. 3). TPs can be reused through an “Apply Pattern” operation, available on tasks in the CTM to-do list. It opens the TPE, where the user can browse through different TPRs and search for tasks on the server, based on different criteria (owner, subject, description etc.). Tasks from remote TPRs can be opened in the TPE, whereas TDGs and dialogs of tracked tasks can be additionally viewed in the web client so that the user can estimate the task applicability to their current situation. No proactive information delivery on tasks [8] is currently provided. We have considered that many users approach their colleagues for help prior to looking for solution in the available software infrastructure (see also [16]). Therefore TPs can be exchanged through a “Send To” function in the TPE and as attachments in task requests. The application of a TP reactivates the process example by generating the complete task hierarchy and filling all pre-modeled structure and content information in the todo list. If during execution a user initiates a delegation, available delegates are suggested automatically. A user can change the anticipated (example) flow by entering different recipients. Suggested TP references are also available on tasks. A suggestion, stored as a reference to a recipient task in the original process execution, may be used by the person, activating the TP, to accomplish the task themselves without further delegations. If on the other hand a delegation is issued, the recipient task contains the reference and the recipient(s) can still refer to the suggested TP to possibly adapt and reuse it. To allow this, application of a TP from a local TPR enables iterative replication of all referenced TPs from the local TPR to a default remote, user-specific repository, where these are accessible by all users. 3.5 Task Pattern Evolution Best-practice deviations may occur due to changing business conditions and different problem solving strategies of end users. CTM provides functionality to trace such deviations through task instance-based ancestor/descendant relationships [19]. Such are set e.g. on copy/paste of (sub) task hierarchy in the TPE - iteratively each task in the resulting hierarchy receives an ancestor reference to the corresponding task in the original hierarchy. When a TP is exported from an executed process and saved to a remote TPR, all resulting tasks receive ancestor references to the corresponding original tasks in the tracking repository. If a remote TP is applied, the resulting tracked tasks receive ancestor references to the corresponding tasks of the remote TP. If a TP is exported from an executed process to a local TPR, the resulting tasks preserve the information (id’s) of the tracked tasks. When a local TP is applied, the resulting tasks receive ancestor references to the originating tasks in the tracking repository. Evolutions can be viewed in the Task Evolution Explorer (TEE) shown on Fig. 5. The “introduce consignment” task of user Y (selected node) originates from a tracked ancestor task with the same name, which was executed by user X (root node). The latter task has also another descendant, resulting from its reuse by user W (task in the bottom). User Y has saved a global TP from his execution to a remote TPR (expanded node with black descendant icon under selected node), which was reused in two further executions, the one of which resulted in a second global TP version. The TDG and dialogs of tracked ancestor/descendant tasks can be shown through the “View in Repository” button for case analysis.

Human-Centric Business Processes

315

Fig. 5. Task Evolution Explorer (TEE)

4 CTM Evaluation The CTM evaluation was conducted at the textile production company (cf. section 2) and involved 6 users, selected for having related, collaborative tasks: • Chief Officer Assistant (COA): serves as a single point of contact to the chief officer (forwards accept/reject of contract proposals); coordinates all departments. • Chief Sales Officer (CSO): manages sales department, responsible e.g. for: processing of special sales (consignment), credits approval, budget planning. • Sales Employees (SL1 & SL2): process sales orders, make credibility checks, participate in price definition processes, assist CSO. • IT Department Lead (ITL): coordinates activities of IT department, decides about acquisition of new software and hardware; manages adaptations and extensions to existing systems. • IT Employee (ITE): installs soft-and hardware; executes business process-related transactions in internal systems; maintains documentation about executed transactions; provides guidelines for transactions execution. 4.1 Setting and Extent of Use The evaluation was initiated with a workshop in which we gave a 1 hour presentation on CTM, followed by 30 minutes individual training of each user in the basic functionalities. Detailed CTM user guides were provided to all participants. After several days we visited the users individually to check how they are working with the tool and to provide further instructions. The evaluation concluded with a short video recording and transcription of the tool use, followed by a structured debriefing interview, in which we asked each participant to assess the basic features and to rate to

316

T. Stoitsev et al.

what extent CTM improved their ability to manage tasks in ad-hoc processes using Likert scales and freeform explanations. The CTM trial was planned initially for 4 weeks. However, the installation of the tool required network adaptations as well as OL configuration changes. Therefore only a 2 weeks trial was possible. Problems with character encoding schemes suspended the CTM usage by the COA for a further week. 4.2 Findings Despite the initial technical difficulties and usability issues, mentioned in the following, end users found the concepts behind CTM compelling and clearly identified the high potential to structure and optimize their activities with the tool - the average overall approval rating for CTM was 4.29 (on a Likert scale of 1: Hate it, to 5: Love it). A summary of the observations follows: Missing Initial Process Context. Some users suggested that root tasks should be created by senior employees, who actually trigger processes: “I do not initiate processes, I actually execute on them. […] I always expected to get a task request from somebody [COA, CSO] who would create a root task and distribute the sub tasks. I then would receive a task, break it down and distribute the resulting tasks to the others [Sales]” (ITE). Due to the encoding problems in the to-do list of the COA, the latter did not send requests for a week after ITE had started using CTM. This affected also the amount of tasks ITE acted on. Similarly, SL1 had created a root task for a task description, which was sent by CSO per email some time ago but was not acted upon before the CTM installation. No root tasks were created for ongoing activities in which users were engaged before the CTM installation. This reveals that process modeling can be triggered along the organizational hierarchy, where senior employees can drive a top-down implementation of the “Process of Me” [6]. Transparency. The ability to represent artifacts in process steps was considered crucial. We encountered that different artifact versions were attached to consequent tasks in a process flow, which revealed how artifacts are elaborated within a process. For example an empty, preformatted MS Excel table was attached in a request issued from CSO to SL2, and a filled MS Excel table was available in the resulting SL2 recipient task, which was elaborated to 75%. Further, users highly approved the status information and notifications on task changes as they saw in them the potential to reduce overhead for calling colleagues and writing emails with task status enquires: “Such processes [price definition] draw like a red thread through the whole company. I certainly want to know how far things have gone. […] It is annoying when you do not get feedback on requested actions. This [CTM process overview] will save me the effort to constantly call people or write mails to ask about the status of things” (SL1). Generally, employees with managerial functions had greater interest in the overview functionality than others. SL2 for example stated that seeing what others do might not be of interest to him as it might concern activities outside of his expertise scope. COA, CSO, SL1 (who had more senior functions) and ITL clearly wanted an overview.

Human-Centric Business Processes

317

As CTM was used only by a small group of people, privacy issues were not raised during the trial. However ITL stated that authorization has to be considered for extended CTM use in the enterprise by providing the possibility to hide certain process fragments in black-box containers in the web process overview. SL1 further demanded extensions in the notifications handling and suggested e.g. having notifications on each change in a delegated task and its sub tasks – structural or context change. Notifications for overdue of delegated tasks were also requested. As a further extension, users suggested summing up percent complete of sub tasks and increasing the percentage of a parent task. Structured Storage and Retrieval of Process Knowledge. Users generally reported that creating a task in the CTM to-do list does not impede their current work practice compared e.g. to dealing with email: “A task is a task - I clearly know that I should act on it. […] Putting it in the CTM task list does not bother me. I need to think how it should be handled anyway. If I can explicitly write that down, this only helps me to clearly structure my thoughts before executing and reduces the chance to miss something” (SL2). ITE further reported, that sometimes CSO asks him to execute transactions, which he is normally not allowed to. Before the CTM installation, ITL would preserve the emails, requesting those transactions, for responsibility tracking. Receiving a CTM task for such transactions reflected this “opportunistic” behavior in the generated process example (TDG) on the server and hence in the emerging process model. Despite the clear benefits from CTM usage for visibility on time-critical activities, users stated that email cannot be replaced fully by CTM tasks. Informal enquiries outside of a concrete process would still be done over email. Although only several TP were extracted – 2 in IT department (1 in a remote TPR and 1 in a local TPR) and 3 in sales (1 in a remote TPR and 2 in local TPR), the benefit from structuring process knowledge in a way that it could be reused was stated as a clear benefit. However, we clearly perceived that users were uncertain about the reuse potential of TPs and the way these should be distributed to others. The overall attitude was that global TPs should be delivered by a (senior) domain expert, who can handle also the responsibility for providing them. CSO e.g. experimented and developed a TP on a remote TPR instead of writing a text-based guideline. SL2 on the other hand refrained from submitting a TP on a remote TPR while stating that he could send the local TP to a colleague personally upon request and furthermore, that he “silently agrees” for other colleagues to take and adapt his implicitly generated task example from the tracking repository on their own responsibility. Some of the users proposed that the collaborative flow on tasks should be structured better to facilitate the handling of CTM emails for task delegation. The “Move CTM’s” functionality (cf. section 3.2) was not accepted well - users preferred to get CTM request messages in a dedicated “CTM Mail/Requests” email folder and responses in a “Responses” folder. Exchange of Process Knowledge. Having an example of how a problem should be approached was appreciated by all users: “Basically I have to achieve certain output for the tasks I receive [from CSO]. I really appreciate to know how she would break down the task and what the different facets in the task are. This helps me to stay on the right track and to know what is expected of me” (SL2). However, we actually

318

T. Stoitsev et al.

observed that CSO would send a single task with generic description e.g. “prepare contracts for customers C1, C2, and C3” and SL2 would then break it down, creating a task for each customer. Therewith tasks disperse and refine by falling through the organizational hierarchy. This reveals that “seeding, evolutionary growth, and reseeding (SER)” [5] towards complementing abstract process definitions can happen during task execution and iterative reuse of process examples in organizations. Domain experts, e.g. ITL, on the other hand did not think that they would benefit much from external knowledge. ITL however appreciated being able to distribute knowledge himself i.e. as a TP on a remote TPR, to avoid repeated inquiries from other employees on same topics. Connecting Best-practices and Running Processes. The users considered that comparison of TP and running tasks, resulting from their application, might not scale for large processes. Best-practices were generally desired as higher-level process descriptions, while running processes could produce multiple fine-grained tasks: “As far as I am concerned a TP will contain only top-level tasks as my employees always do things differently. This doesn’t bother me if the results are delivered on time. […] It is good to have a guideline, even if you do not care how the described tasks are accomplished concretely” (CSO). The overview provided in the TEE was not considered intuitive. Differences in task structures could be identified through additional effort, which would bring benefit only to managerial employees. Users suggested enabling task comparison in “swimming lane” overview, where the corresponding top-level tasks can be put against each other. This would enable users to better see the corresponding and missing process facets, by possibly discarding low level tasks. For the latter, filtering based on different criteria like e.g. “Task Level” and “Owner” was suggested. Tracing of Evolving Best-practices. Despite of the deficiencies in the TEE usability, the functionality that it provided was considered necessary by senior employees due to the frequent changes in informal process recommendations. Tracing of such changes could help to at least undo wrong strategies: “We often change processes to check if we can achieve better results. We check e.g. for the processing of these contracts we needed that much time, while we have planned that much. […] If we see that a change does not deliver better results, we switch back to our previous practice. […] An overview and comparison of the tasks for both practices in CTM is nice to have” (SL1). With this respect the provided structural overview was still insufficient as users cared also about certain performance indicators. They proposed that the comparison of task hierarchies in the TEE should be enabled based on specific criteria like e.g. execution time, persons involved. It was further suggested that in addition to the ancestor/descendant relationships also versioning of TP should be supported.

5 Conclusions and Future Work The presented paper describes an integrated approach, leveraging user experience with email and to-do lists and ensuring a “gentle slope of complexity” for process tailoring by end users. It delivers a valuable extension to known evolutionary workflow approaches by enabling “programming by example” of decentralized-emerging,

Human-Centric Business Processes

319

weakly-structured process models by both: users - executing processes, and domain experts - explicitly adapting captured process examples. Thereby SER of weaklystructured process models is enabled through the top-down implementation of the “Process of Me”, where: (i) generic tasks refine during execution; (ii) users can adapt reusable process fragments (TPs) through direct manipulation of the execution data (delegations, artifacts, suggested TPs). Opportunistic and emergent changes are supported during runtime and design time. CTM captures conversational (email) and control (task) flows. Unlike known email-based workflows, CTM provides the ability to decouple process fragments (interlinked TP) with different granularity from process runtime representations and to make them available for SER by managing task instance-based ancestor/descendant relationships, allowing navigation to the original or to similar execution contexts and inspection of task-related dialog flows. The CTM evaluation delivered user-proposed extensions which will be addressed in further prototype implementations. Long term evaluation in the partner companies is under negotiation and will allow the generation of larger tracking and TP repositories and their quantitative evaluation as well as scalability assessments. Further research will aim at the translation of user-defined to formal process models towards the automation of rigidly recurring processes. Acknowledgements. The work, this paper is based on, was supported financially by the German Federal Ministry of Education and Research (project EUDISMES, number 01 IS E03 C). We thank to all participants in the user studies for their time and cooperation.

References 1. Agostini, A., De Michelis, G.: Rethinking CSCW systems: the architecture of Milano. In: ECSCW 1997, pp. 33–48. Springer, Heidelberg (1997) 2. Bellotti, V., Dalal, B., Good, N., Flynn, P., Bobrow, D.G., Ducheneaut, N.: What a To-Do: Studies of Task Management towards the Design of a Personal Task List Manager. In: CHI 2004, pp. 735–742. ACM Press, New York (2004) 3. Bellotti, V., Ducheneaut, N., Howard, M., Smith, I., Grinter, R.: Quality Versus Quantity: E-Mail-Centric Task Management and Its Relation With Overload, vol. 20, pp. 89–138. Lawrence Erlbaum Associates, Mahwah (2005) 4. Button, G.: What’s Wrong With Speech-Act Theory. Computer Supported Cooperative Work 3(1), 39–42 (1994) 5. Fischer, G., Giaccardi, E., Ye, Y., Sutcliffe, A., Mehanjiev, N.: Meta-Design: A Manifesto for End-User Development. Communication of the ACM 47(9) (September 2004) 6. Gartner Research. Person-to-Process Interaction Emerges as the ‘Process of Me’. Gartner Inc. (2006) 7. Herrmann, T.: Evolving Workflows by User-driven Coordination. In: Reichwald, R., Schlichter, J. (eds.) Tagungsband D-CSCW 2000, pp. 103–114. Teubner (2000) 8. Holz, H., Rostanin, O., Dengel, A., Suzuki, T., Maeda, K., Kanasaki, K.: Task-based process know-how reuse and proactive information delivery in TaskNavigator. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pp. 522–531. ACM Press, New York (2006)

320

T. Stoitsev et al.

9. John, B., Kieras, D.: The GOMS family of analysis techniques: Comparison and contrast. ACM Transactions on Computer-Human Interaction 3(4), 320–351 (1996) 10. Jorgensen, H.D.: Interactive Process Models. Ph.D. Thesis, Norwegian University of Science and Technology, Trondheim, Norway (2004) 11. Lieberman, H.: Your Wish is My Command: Programming by Example. Morgan Kaufmann, San Francisco (2001) 12. Lieberman, H., Paterno, F., Wulf, V.: End-User Development. Springer, Heidelberg (2006) 13. MacLean, A., Carter, K., Lövstrand, L., Moran, T.: User-tailorable systems: pressing the issues with buttons. In: Proc. CHI 1990, pp. 175–182. ACM Press, New York (1990) 14. Object Management Group, BPMN, http://www.bpmn.org/ 15. Paterno, F., Mancini, S., Meniconi, S.: ConcurTaskTree: a diagrammatic notation for specifying Task Models. In: Proceedings Interact 1997, pp. 362–369. Chapman & Hall, Boca Raton (1997) 16. Ribak, A., Jacovi, M., Soroka, V.: Aks Before You Search, Peer Support and Community Building with ReachOut. In: Proceedings of the 2002 ACM conference on Computer supported cooperative work, pp. 126–135. ACM Press, New York (2002) 17. Riss, U., Rickayzen, A., Maus, H., van der Aalst, W.: Challenges for Business Process and Task Managemen. Journal of Universal Knowledge Management (2), 77–100 (2005) 18. Schwarz, S., Abecker, A., Maus, H., Sintek, M.: Anforderungen an die WorkflowUnterstützung für wissensintensive Geschäftsprozesse. In: WM 2001, 1st Conference for Professional Knowledge Management, Baden-Baden, Germany (2001) 19. Stoitsev, T., Scheidl, S., Spahn, M.: A Framework for Light-Weight Composition and Management of Ad-Hoc Business Processes. In: Winckler, M., Johnson, H., Palanque, P. (eds.) TAMODIA 2007. LNCS, vol. 4849, pp. 213–226. Springer, Heidelberg (2007) 20. Veer, G., Lenting, B.v.d., Bergevoet, B.: GTA: Groupware task analysis - modeling complexity. Acta Psychologica 91, 297–322 (1996) 21. Wiig, K.M.: People-focused knowledge management: How effective decision making leads to corporate success. Elsevier Butterworth–Heinemann (2004) 22. Wulf, V., Jarke, M.: The Economics of End-User Development. Communications of the ACM 47(9) (September 2004)

Enhancing User Experience on the Web via Microformats-Based Recommendations Anca-Paula Luca and Sabin C. Buraga Faculty of Computer Science, A. I. Cuza University of Iasi, Berthelot 16, Iasi, Romania {lucaa,busaco}@info.uaic.ro

Abstract. The multiple ways in which we rely on the information available on the web to solve increasingly more tasks encountered in everyday life has led to the question whether machines can assist us to parse the amounts of data and bring the interesting closer to us. This type of activity, most frequently, requires machines to understand human defined semantics which, fortunately, can be easily done in the present web through semantic markup. Our purpose is to develop a flexible user agent that understands the behavior of a user on the web and – on the basis of microformats – filters out the irrelevant data, presenting to the user only the information she is most interested in, while being as discreet as possible: the user is required no preference settings, no explicit feedback. Keywords: Semantic markup, microformats, recommender system, prediction, web interaction, user agent.

1 Introduction Navigating the web each day, accessing numerous websites containing information from various domains can be overwhelming sometimes, making the obvious or the interesting hard to get, due to the huge amount of useless data surrounding it. The explosion of the amount of information on the web from the past few years has satisfied our need of information, but also invaded our web lives with considerable amounts of unnecessary data we must surf through in order to get to the things we really want. In the actual stage of the web, some patterns in published information and users’ requests have emerged: there are numerous sites for blogs, social groups, collaborative bookmarks, collaborative knowledge, products/companies presentations, news portals – all aligned to the social web – also, so-called “Web 2.0” [13]. There also are a lot of users taking advantage of this information: relying on the web for communicating with friends and family, researching different topics, staying up-to-date with most recent news and events. In this context, semantics tend to repeat on the web. Either it is the semantic of the published information – a publisher’s point of view; either it is the semantic of the needed information – a user’s point of view. The patterns are reflected in the markup of the information as well: resembling elements, attribute names and values, and imbrications can be noticed [5]. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 321–333, 2009. © Springer-Verlag Berlin Heidelberg 2009

322

A.-P. Luca and S.C. Buraga

In the actual circumstances, the microformats initiative [17] aims to specify a frame for this kind of patterns: standards of publishing information with these most frequent semantics (thus semantically marking the data) and push the web to the new stage where information is equally accessible for humans as well as for machines. With growingly websites adapting their markup to follow these standards to produce semantic markup, the idea of a tool that comprehends and simulates a user’s behavior on the web becomes a need, it gets consistent and closer to implementation. Such an instrument can increase the efficiency of a user’s sessions on the internet, measured as a proportion of the information gained per time spent on the web. The existing tools focus either on microformats processing – detection, presentation and storage – without trying to assimilate them from a semantic point of view, either on the detection of semantics using the “classical” semantic web methods such as RDF description of metadata and/or ontology specifications [2, 4], or parsing hypertext as ordinary text (web scrapping), using standard text classification methods. Most of the tools require as well configurations, training data or explicit feedback from the user. The innovations of our approach are in using microformats as semantic sources in the task of “understanding” the web – for the arguments sustaining this decision, see also [5]. The purpose is to achieve our goal without human effort: the user is not requested to change her navigation behavior to adapt to the new tool or to provide it training data. This paper is structured as follows: first, in section 2, we will present the microformats and we will describe their possible usage in the context. In the next section, we will detail the model used for data and the recommending system [1] that constitute the foundation of the project. Section 4 focuses on the presentation of the application to the user: the user interface – design and, most important, interaction. After enumerating different related approaches, the paper ends with an outline of the discussed topics and presents the further research ideas.

2 Microformats According to [17], the microformats definition is: Designed for humans first and machines second, microformats are a set of simple, open data formats built upon existing and widely adopted standards. A more accessible definition is the following: Microformats are simple conventions for embedding semantics in HTML to enable decentralized development. Even more precise than this, microformats are conventions for XHTML (Extensible HyperText Markup Language) elements names, attribute names and associated values, with precise semantics – see also [3] and [10]. 2.1 Important Features The key principles in designing microformats are the simplicity – they are designed to solve a specific problem – and the loose connectivity – they represent small pieces loosely joined together to form larger blocks and to express increasingly complex semantics, without decreasing their semantic expressivity through connections.

Enhancing User Experience on the Web via Microformats-Based Recommendations

323

Microformats achieve their goal either by adding to the (X)HTML markup – called elemental microformats –, either by specifying a set of attribute values for XHTML existing elements and imbrications of such elements to be the frame for a piece of content – in this later case, there are called compound microformats. Certain microformats are definitively specified and can be considered as standard, while others are in work in progress. Regardless of this, microformats are widely spread – either explicit or through the semantic of content and similar structure of markup, with the possibility of actually being explicited. 2.2 Representative Microformats The list of the current official microformats is: hCalendar, hCard, rel-license, relnofollow, rel-tag, VoteLinks, XFN, XMDP, XOXO, adr, geo, hAtom, hResume, hReview, rel-directory, rel-enclosure, rel-home, rel-payment, Robots Exclusion, xFolk. The microformats useful for a navigation assistant are the ones that encapsulate the content as well as properties of the specific content: • rel-tag specifies that the current page or a portion of is marked with a tag. The tag for a piece of content is, usually, a single word that expresses a keyword for the content, or the topic of the content. It is a frequent practice to use multiple tags for a piece of content. • geo allows the description of a location using geographic coordinates (latitude and longitude). This microformat can be embedded into other microformats such as hCard or hCalendar, to mark the location of an entity or an event. • adr specifies an address, properly marked with fields for country, city, street and so on. This microformat is also embeddable into other microformats such as hCard or hCalendar, either joined or not by a geo microformat. • hCard denotes a full description of an entity: a person (most often), an organization, a company, etc. It specifies fields for the name of the entity, the nickname, an address, a website and other information. • hCalendar encapsulates a calendar entry (an event): date, description, address, etc. • hReview is defined to be used in publishing reviews for different items. It contains fields for title, description, hCard of reviewer, hCard of reviewed, date of the review, etc. • hAtom mirrors the Atom syndication method, enabling the embedding of an Atom feed in (X)HTML. 2.3 Examples The following is an example of using hCalendar to mark the ICEIS 2008 conference:

12 and 16 June 2008 in

Barcelona Spain

Note that it is easy to extract the information of the event by both a human user (who will actually see the result of the browser processing of the HTML) and an automated tool (that can process markup to extract the URL of the event, a description, a summary and the dates for the event). Also, we present a new proposed microformat to be used to denote georeferencing. With the help of microformats, the information about certain geographic locations can be easily embedded into Web pages. In [8] we describe a straightforward and open hLocation microformat, for representing georeferences, for example: museums, libraries, offices, banks, home addresses, etc. that can appear in any Web content. We illustrate the use of hLocation microformat in the following context: a group of people wants to attend an art symposium in Atlan City, a location in Atlantis. This scientific event is located in the Hall of Mirrors, near the Museum of Arts.

Call for Participation: Symposium on Artistic Developments at Hall of Mirrors, in Atlan City, Atlantis,

October 11, 2008

The symposium will take place near Museum of Arts.

Our solution provides a more detailed description of several attributes regarding locations (such as addressing, related locations in fuzzy terms, images) which can not easily be expressed by other microformats. Additionally, we use the most useful constructs provided by the traditional microformats.

Enhancing User Experience on the Web via Microformats-Based Recommendations

325

3 Proposed Recommending System In this section, we will present the models used for determining the users’ preferences and the prediction algorithms. We start with a presentation of the collected data and the associated data model and then we describe the definition the program associates with the notion of preferred content and the methods used to determine which portion of content is of interest and which is not. 3.1 Data Model Although there are multiple sources for content information in a web page, our purpose is to build a tool that will only use content semantics provided by microformats markup. Because of this, the following discussion will refer to microformatted web pages – documents that have associated various microformats. A web page is assumed to be composed by one or more blocks of data: pieces of content that belong each to certain categories of topics (one topic or more) and, most important, which are separable from the rest of the blocks – a program can identify and extract such a block from the web page. Consider, for example, a web page that contains news and each piece of news is properly marked (by using hAtom or hCalendar) – see also Figure 1. All the pieces of news represent blocks of data in the model described above.

Fig. 1. Detecting the blocks of data

326

A.-P. Luca and S.C. Buraga

Such a block of content is considered to be the unit of content: from a web page, the algorithm will recommend one or more such blocks of content – in the presented example it seems natural to recommend a piece of news, as the unit of information. Naturally, a block will be considered to be the piece of content encapsulated by a microformat – e.g., hCalendar, hAtom, hReview, hCard, and hLocation. Given a web page, the flow of the recommending program is the following: first, the blocks are identified in the web page. Then, for each block, a preference score is computed (the algorithms are described in section 3.2) and then the recommendation algorithm is updated to take into account the new blocks as well. 3.2 Algorithms Certain assumptions have been used in order to build the preference model – partially elaborated from [6]: • The entire semantic markup is semantically correct: any value assigned, through markup, to an attribute of the content is the real value of that attribute, for the specific content. This assumption might not always hold (it depends on the quality of the markup) but there is no general way to detect or correct such errors in an automated manner. • In the context of liking or disliking content, a user operates with blocks, as defined in the subsection above. The validity of this assumption is highly correlated with the way the blocks are built, which in turn depends as well on the quality of the markup. • The user’s preferences are expressed in terms of attributes values rather than items and the user tends to associate a greater importance to some attributes. • A user accesses more often items he/she likes and ignores items she does not like. In consequence, a solution inspired from the machine learning methods [12] is proposed. We will estimate the degree of interest for a given item by combining two components: the preference for an attribute and the preference for the particular value of that attribute, by the following formula:

interest(item)= ∑ wpi n

type (item)

∗ preference ( pi , pi (item))

i=1

where n denotes the number of all possible properties (considered in an ordered list), pi is the ith attribute, pi(item) represents the projection of the item on attribute pi (the value of the item for the property pi), wpitype(item) is the weight associated to attribute pi for the type of this item and preference is a function that estimates the interest of the user for a value of a property. The key components of the estimation computation are these last two values and they are approximated using the subsequent methods: 1. For each property of each type of block, the associated weight is computed as follows: initially, all weights are set equal for all attributes and then they are adjusted using the assumption that the more an attribute is present in the liked items, the more the user “trusts” that attribute (the user looks for the items that specify values for the properties she trusts more) therefore its weight grows. Note that the weights for each block can be different for different types of blocks.

Enhancing User Experience on the Web via Microformats-Based Recommendations

327

2. For a given value of a given attribute, the user preference for that value is computed as the probability that the value is among the values of that attribute for all preferred items, which is estimated by the report between the number of preferred items that contain that value and the total number of preferred items. For the computation of this last value, we have opted for a memory-based approach: items are stored and then the preference is computed when a new estimation is required – instead of transforming data into hypotheses of preference and then testing the items against them – for flexibility and data reusability reasons. This algorithm is able to handle a set of issues that might appear in the recommendation process: • It can efficiently learn the preferences incrementally, without the necessity of providing “training data”. Of course, the results in the first stages will be poor but, as data is collected, they will improve; • The assumption made about the user behavior regarding liked and disliked items allows the algorithm to work, even if no explicit feedback is provided by the user. As we have mentioned before, this is a key feature of the designed tool; • By simply setting an expiring date for the stored items – after a period of time they no longer influence the recommending process –, the algorithm can deal with dynamic preferences: a user’s preferences can change and the agent must change its recommendations accordingly. 3.3 User Interaction One of the first requirements regarding this aspect is the fact that the application must present its suggestions in real time (when a new web page is accessed) and this must not be done in a master manner: the agent only suggests the content to be accessed, not dictates. Related to these aspects we will discuss two options for the application: a standalone application and a browser extension. The interaction model for this application is the observable-observer one: first the agent (the observer) must be notified that the user (the observable) has accessed a new web page and then the user (observer this time) has to be notified about the agent’s recommendations (the observable). Two types of observing can be used: push – the observable notifies the observer when the state changes – and pull – the observer reads the observable’s state from time to time. We can assume that, in both cases, a push paradigm can be used for obtaining the current web page (the user must not notify the agent that a new page was accessed) and this is the method to be used in any of the cases since the agent would not require pointless user assistance. As a standalone application, the agent exists outside the navigation process, in a browser external application. If the agent is built as a push observable then it will have to notify the user, from time to time, that its state has changed – using a dialog window or any other method –, which can disturb the user from her navigation activity. If the agent is designed using a pull paradigm, thus eliminating the disturbing factor, then the user would be required to switch from one application to another to check the state of the agent, which would causes an overhead big enough to eliminate this option.

328

A.-P. Luca and S.C. Buraga

The most important note to be made in the case of a browser extension is that the agent can act like a component of the navigation process itself, transparent to the user, whose behavior is not required to change to manage the new application she is using. Clearly, the best option is the implementation as a browser extension, but there are a few other choices to be made, regarding the way the application presents its results to the user. We have chosen to highlight the interesting blocks (using background colors, tool-tips and combinations of the two techniques), in the web page, in the exact place where they appear, for a sum of reasons: • The model of this interaction corresponds to a push-pull combination: the agent is a push observable that allows a pull behavior from the user: it will alter the web page highlighting the interesting items, but will still provide the whole page to the user so she can access any piece of content she feels to; • It does not use page space as a side panel would – frequently used by many applications; • It does not require a change of focus from the page itself to another area of the screen; • The recommended items can be observed as the page is scrolled down, it does not require permanently referring a recommendations list; • Though somehow discrete, the results returned by the application still stand out from the rest of the web page, thus allowing one to easily notice them if this is the desired goal. Optionally, the recommendations list can be also displayed in a side bar list, for faster access of the user to the recommendations and a greater freedom of choosing the manner she browses the recommendations: either as a pull or as a push observer.

4 Aspects regarding the Implementation We will describe our approach to design and implement a “smart” navigation assistant, using the new web technologies and aiming the largest possible public. The goals are the platform independence and the wide spread standards. Two main modules of the application can be distinguished: a module that interacts with the navigation process (data collecting and results display) and the module that encapsulates the recommendation engine (storage and prediction). The implementation of the two as loosely connected components is a key feature, thus enabling independent testing and improvement, and also facilitating reusability in similar contexts. Loose connectivity has been achieved in this situation through a moderator component – a “data manager” – that enables bidirectional communication between the two modules. An illustration of the architecture presented above is depicted in Figure 2. 4.1 Data Collecting and Display Module This component must be analyzed in the context of the extension development interface available from the web browser. As a platform at this level, we have chosen the Firefox web browser, whose advantages are in the direction of wide platform availability, extended support for current

Enhancing User Experience on the Web via Microformats-Based Recommendations

329

Fig. 2. Modules and data flow

web standards and the encouragement of extension development through the clear structure imposed for the sub-application, the large development community and documentation and the possibilities of communicating with external modules: native libraries, Java programs, etc. The implementation of the collecting and display module takes some advantages from a browser embedded approach. Data collecting is done using the web documents downloaded by the browser; no new Internet connection and distinct request to the server are required for this activity. Extracting the data from HTML documents takes place at the DOM (Document Object Model) tree level, built by the browser for its internal use but made available to the extensions developers. Displaying the results takes place in the same context of preprocessed data, thus decreasing the running time and the complexity of the programs. We can now summarize the activity of the implemented agent as follows: when a new web page is accessed, the DOM tree is processed in search of microformatted content and a list of blocks is built. This list is then sent to the data storage and prediction module. When the response from this module arrives – which represents a list of preference scores associated with the blocks –, the display module is being called to modify the properties of the DOM tree associated with the page to highlight the recommended blocks.

330

4.2

A.-P. Luca and S.C. Buraga

Data Storage and Prediction Module

The data storage module relies on the physical storage engine which is, in our case, a native XML database engine, chosen for the management of concurrent access, the advantages in storing unstructured data and the programming interface based on actual XML technologies. Our choice was Berkeley DB XML which is a fully embeddable database engine, available on multiple platforms and which provides programming interfaces in a large variety of languages. The prediction sub-module implements the algorithm presented in the section 3.2, as a standalone module – most liable to future improvement –, in a platform independent language to increase portability. A solution for this implementation is the Java language that offers platform independency, compatibility with the XML documents management libraries (the native XML database, Saxon XML processing library), and moderate requirements for installed interpreters on the client machine.

Fig. 3. Highlighting the information of interest

Enhancing User Experience on the Web via Microformats-Based Recommendations

331

Thus the activity of this module can be summarized as follows: when the data manager receives – from the collecting module – a list of blocks, it sends it to the recommending module whose results are stored in order to be sent to the display module. Then, the blocks are sent to the update module and finally, the response for the browser is elaborated from the recommending module’s results and sent. 4.3 Usage Scenario A usage scenario for such an application is as follows: when the user accesses a new webpage, the tool analyzes the content of the page and discovers the blocks of data the user might be interested in. Then, it highlights these blocks, and attaches tool-tips containing related preferred data – see Figure 3. The blocks are marked in the place where they appear on the webpage, thus minimizing the interference of the agent in the navigation process. The results of such an activity are a decrease in the user effort to parse large documents in search of the interesting content, an increased efficiency in discovering preferred content – blocks that would have been missed otherwise can be emphasized through the use of such a tool –, and better semantic comprehension in the process of discovering new data on the web, provided by the permanent connections between current data and previously accessed information.

5 Related Approaches 5.1 Tools We will briefly describe in the following some of the tools that serve purposes related to this application: • Tails is a Firefox extension that collects microformats from the web page and allows users to execute miscellaneous actions through the Tails scripts. • Operator is a Firefox extension as well, that brings as improvement the combination of the microformats with other services aligned to the social web: Del.icio.us, Flickr, Google Maps, Google Calendar or Upcoming.org. • Greasemonkey is a Firefox extension that allows the execution of user scripts and offers – via JavaScript programs like microformat-find-gm5 and XFN Viewer – support for microformats extracting and processing. • The Firefox 3 browser has an API for the detection and parsing of microformats. • Magpie [7] is a tool that proposes a semantic navigation by detecting, from a web page, all the items (words) that correspond to certain ontologies publicly available. An enhanced version is PowerMagpie [9]. • WebIC [16] represents a recommender system which, using the words from the visited documents, determines the user preferences and helps the user achieve her goal by retrieving similar documents containing similar content.

332

A.-P. Luca and S.C. Buraga

5.2 Websites There are various websites that use microformats in the generated markup, and their number is continuously rising (the expansion of microformats is facilitated by their fast assimilation by all web developers with basic XHTML knowledge), thus contributing to the success of microformats a source of semantics for the web. A list of implementations is online available [17] – for example, Upcoming.org and Last.fm which use hCalendar to mark events, Yahoo! Tech and Cork’d use hReview for products and services review, many social websites (including Last.fm, LinkedIn, Flickr, and Twitter) use hCard to mark the user profiles. In this context, we must also mention that there are various tools dedicated to the microformats authoring and publishing – from editors to content management systems of blogs (for example, WordPress) and wiki systems like XWiki [8] –, thus widening the set of possible microformatted content authors.

6 Conclusions One of the main goals of the microformats initiative is to facilitate the machine access to the information published by humans, to enable them to assist humans in the web browsing process. Although tools that use microformats have been developed with the emergence of microformats, the attempts have settled in extracting – manually or semi-automated – the (meta)data marked through microformats, leaving the comprehension and semantic parsing to the human users. Our proposal takes a step further and tries to emphasize automated semantic detection as the first usage of the semantic markup, in the context of human computer interaction. Such an approach has real applicability in better web browsing experience (by increasing efficacy), information retrieval, products or services recommendation, social networks recognition, user assistance for various tasks and others. An important direction to follow is towards collaborative recommending: the application can automatically correlate two users based on the detected browsing preferences, and can use these correlations to improve its recommendations – using collaborative filtering [6] or association rules. Also, the recommending principles presented in this paper can be improved by elaborating superior category building techniques – by taking into account new properties or new similarity measures. Since microformats are in continuous evolution and new microformats proposal is open to the web community, the specification of new structures that would encapsulate new semantics represent a possibility of improvement for the dedicated tools, by offering access to data which is presently “hidden” by the lack of appropriate markup. By combining this type of instrument – that uses exclusively microformats – with the applications focused on the “classical” semantic web approaches [2, 4] and with standard text classification methods [15], we can go further in the direction of processing all kind of information on the web: practically, any page on the web could be understood by the navigation assistant, thus achieving one of the goals of the “new web” or the “web of data” – according to [11] and [14]: the equality of humans and machines as information consumers.

Enhancing User Experience on the Web via Microformats-Based Recommendations

333

References 1. Adomavicius, G., Tuzhilin, A.: Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions. IEEE Transactions on Knowledge and Data Engineering 17(6), 734–749 (2005) 2. Allemang, D., Hendler, J.: Semantic Web for the Working Ontologist. Morgan Kaufmann, San Francisco (2008) 3. Allsopp, J.: Microformats: Empowering Your Markup for Web 2.0 Apress, Berkeley (2007) 4. Antoniou, G., van Harmelen, G.: A Semantic Web Primer, 2nd edn. MIT Press, Boston (2008) 5. Celik, T., Marks, K.: Real World Semantics. In: ETech Conference. O’Reilly, Sebastopol (2004) 6. Chakrabarti, S.: Mining the Web – Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2003) 7. Domingue, J.B., Dzbor, M., Motta, E.: Collaborative Semantic Web Browsing with Magpie. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 388–401. Springer, Heidelberg (2004) 8. Dumitriu, S., Girdea, M., Buraga, S.: Knowledge Management in a Wiki Platform via Microformats. In: Wilson, D., Sutcliffe, G. (eds.) FLAIRS 2007 Proceedings, pp. 278–283. AAAI Press, Key West (2007) 9. Gridinoc, L., et al.: Semantic Browsing with PowerMagpie. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 802–806. Springer, Heidelberg (2008) 10. Haine, P.: HTML Mastery. Semantics, Standards, and Styling. Apress, Berkeley (2006) 11. Khare, R., Celik, T.: Microformats: a Pragmatic Path to Semantic Web. In: Proceedings of the 15th International Conference on World Wide Web. ACM Press, New York (2006) 12. Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997) 13. O’Reilly, T.: What is Web 2.0 – Design Patterns and Business Models for the Next Generation of Software. O’Reilly, Sebastopol (2005) 14. Shadbolt, N., Hall, W., Berners-Lee, T.: The Semantic Web Revisited. IEEE Intelligent Systems 3(21), 96–101 (2006) 15. Zaihrayeu, I., et al.: From Web Directories to Ontologies: Natural Language Processing Challenges. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 623–636. Springer, Heidelberg (2007) 16. Zhu, T., Greiner, R., Häubl, G.: An Effective Complete-Web Recommender System. In: Proceedings of the 12th International Conference on World Wide Web. ACM Press, New York (2003) 17. Microformats Initiative, http://microformats.org/

Designing Universally Accessible Mobile Multimodal Artefacts Tiago Reis, Marco de Sá, and Luís Carriço LaSIGE, Faculdade de Ciências da Universidade de Lisboa, Portugal [email protected],{marcodesa,lmc}@di.fc.ul.pt

Abstract. Peoples’ characteristics and surrounding environments sometimes reduce or eliminate their capability to perform paper-based activities. The support of such activities and their extension through the utilization of non paperbased modalities introduces new perspectives on their accomplishment for both impaired and non-impaired persons. We introduce Universally Accessible Mobile Multimodal Artefacts. We briefly explain the main tools of a Mobile Artefact Framework focusing an Artefact Emulator. We stress the set interaction modalities available and a set of predefined interaction which combine adequate interaction modalities. The design of both artefacts and emulator is presented and discussed based on their early user-centred usability evaluation. Keywords: Universal Accessibility, Multimodalities, Mobile, User-centred Design, User-Centred Evaluation.

1 Introduction Current paper-based activities and practices are highly disseminated and intrinsic to our daily lives. Notebooks, exams, questionnaires and formularies represent a miniscule sub-set of an endless list of paper artefacts that are used to support several activities, inherent to an enormous variety of domains. Paper comprises several characteristics which make it suitable for many different purposes. This kind of medium is well spread and its utilization forms are standardized for different types of users (e.g. writing with a pen or Braille typing machine). Its light-weight makes it portable, distributable, and, when stored in appropriate conditions, a lifetime register. However, as the amount of information in a paper artefact increases, its portability is affected by weight and organization becomes difficult. Additionally, its durability is affected by the conditions under which it is stored and its passive characteristics inhibit its pro-active behaviour. Finally, once used, the unnoticed modification/edition of this medium becomes very difficult. Both limitations and advantages inherent to paper artefacts have a strong impact on the processes and activities supported by this medium. Accordingly, many digital alternatives to paper became available. These try to overcome its limitations and keep its advantages, mapping characteristics of paper and paper interaction to solutions such as Digital Paper, PDAs, Smartphones, and Tablet PCs. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 334–347, 2009. © Springer-Verlag Berlin Heidelberg 2009

Designing Universally Accessible Mobile Multimodal Artefacts

335

PDAs, Smartphones and Tablet PCs are mobile devices, which include both paper and non-paper based interaction modalities, constituting an excellent target for the support of a digital, universally accessible alternative to paper artefacts. However, most of the developed applications that support paper-based activities do not fully recall the potential of their technological support. The presentation and gathering of information, as well as the interaction, through non paper-based interaction modalities (e.g. video, audio, voice, gestures, etc.) present great potential for the extension and augmentation of paper-based activities. Moreover, the inclusion of such interaction modalities on a digital alternative to paper may allow the support of both paper and non paper-based activities for everyone, including those with special needs. The main challenge of the work presented in this paper is to provide a universally accessible mobile multimodal alternative to paper artefacts. This alternative aims the universal support and augmentation of both paper and non paper-based activities inherent several domains. We start by introducing some work developed on this area. Following, we present a general overview of a Mobile Artefact Framework (MAFra) focusing mostly the Universally Accessible Mobile Multimodal Artefacts (UAMMA). Afterwards the design process followed on the creation of such artefacts is explained in detail and, finally, conclusions are taken and future work directions are drawn.

2 Concepts and Related Work Universal accessibility refers to the ability of all people to have equal opportunity and access to a service or product from which they can benefit, regardless of their social class, ethnicity, background or, most notably, physical disabilities. This concept is strongly tied to human rights, and in some cases becomes a legal concept. Universal design is a relatively new paradigm that aims universal accessibility. It strives to be a broad-spectrum solution that helps everyone, not just people with disabilities [1]. The inclusion of different forms of interaction in software products is a common practice of universal design. Multimodal interaction is a characteristic of everyday human activities and communications, in which we speak, listen, look, make gestures, write, draw, touch and point, alternatively or at the same time in order to achieve an objective. When it comes to human-machine interfaces, the main goal of multimodal interaction is to consider the human perceptual channels through the inclusion of elements of natural human behaviour on human-machine interaction [2]. The interaction modalities included on a multimodal application can be used either in a complementary way (to supplement the other modalities), in a redundant manner (to provide the same information through more than one modality), or as an alternative to the other modalities (to provide the same information through a different modality) [3]. Interface adaptability is a key issue on these applications. Users can, in some circumstances, take advantage of a single modality or a specific group of modalities according to their personal and/or situational needs [4]. The previous employment of the introduced interaction paradigm on different types of devices and applications suggests that multimodal interfaces can improve accessibility for different users and usage contexts, being, therefore, particularly well suited

336

T. Reis, M. de Sá, and L. Carriço

for mobile systems given the varying constraints placed on both users and surrounding environments [5]. Multimodal interaction has also been studied from points of view outside the accessibility domain. It has been employed in order augment unimodal activities, enhance game interactivity and multimedia-based activities, increasing performance, stability, robustness, expressive power, and efficiency of computer/mobile device supported activities [6, 7]. Studies on multimodal mobile systems have shown improvements when compared to their unimodal versions [8]. Several multimodal systems have been introduced on different domains. For instance, mobile systems that combine different interaction modalities in order to support and extend specific paper-based activities have been used with success in art festivals [9] and museums [10]. The latter supports visually impaired user interaction. Still, as for the majority of the applications that support paper-based activities, both these solutions are extremely specific and target activities that occur in particular, and controlled, environments. Other research lines focus on the complementary combination of interaction modalities in order to eliminate ambiguities inherent to a specific modality: speech recognition [5, 11]. However, once again, they focus specific domains and activities. Closer to our goals and research line ACICARE [12] provides a good example of a framework that was created to enable the development of multimodal mobile phone applications. The framework allows rapid and easy development of multimodal interfaces, providing automatic usage capture that is used on its evaluation. Nonetheless, the creation and analysis of these interfaces cannot be done in a graphical way, thus not enabling users with no programming experience to use this tool. In the available bibliography, the multimodal alternatives to paper-based artefacts vary in the combination of interaction modalities that, generally, suit specific purposes, which address pre-defined users’ needs and surrounding environments. None of the work found enables users, without programming experience, to create, distribute, analyse and interact with universally accessible mobile multimodal artefacts that can be easily created in order to suit different purposes, users and environments.

3 MAFra - Mobile Artefact Framework This section provides an overview of MAFra, a mobile framework for the creation, emulation, analysis of use and distribution of Universally Accessible Mobile Multimodal Artefacts (e.g.: questionnaires, prototypes, applications). A detailed explanation of the final interaction design of both artefacts and emulation tool is provided along with an overview of the remainder tools that compose MAFra. 3.1 UAMMA - Universally Accessible Mobile Multimodal Artefacts Artefacts are abstract entities composed by pages and rules. Pages contain one or more elements, which are the interaction building blocks of the artefacts (e.g., labels, choices, images). These blocks are arranged in space and time within a page, combining visual and audible presentations. They include a type-intrinsic counterpart (e.g.: drop down menu for its visual form and an audio description of how to interact with it) and an element specific part corresponding to its content (e.g.: the textual items and its matching audio streams).

Designing Universally Accessible Mobile Multimodal Artefacts

337

Rules can alter the sequence of pages (e.g., ) or determine their characteristics (e.g., ). They are triggered by user responses (e.g., , interaction (e.g., ), navigation (e.g., ), or by external events (e.g., ), defining artefacts’ behaviour. 3.1.1 Basic Output Elements Simple artefacts, such as tutorials, guides and digital books can be built with basic output elements. They present content and put forward size (e.g., fixed-size), time (e.g. reproduction speed limits) and audio-related (e.g. recommended volume) characteristics; they may also include interaction (e.g., scrolling, play/pause buttons) that, however, does not correspond to user responses. The following variant is provided: Text/image/audio/video labels present textual, image, audio or video content or a combination of audio with image or text. 3.1.2 Fully Interactive Elements Fully interactive elements expect user responses which can be optional or compulsory and may have default values. The elements may have pre-defined content (e.g., options of a choice element), thus inheriting the characteristics of basic ones, or gather it from user responses (e.g. inputted text). The content of user’s responses may be used within rules to control artefact behaviour. As such, the entire set of elements and rules can be used to compose fairly elaborated adaptive artefacts. The following elements are available: Text entries allow users to enter text and optionally ear their own entered text when the device is able to support a text-to-speech (TTS) package available on the emulation tool. Audio/video/image entries enable users to record an audio/video stream or to make a picture. Track bars allow users to choose one value from a numeric scale – scale, initial value and user selection are conveyed visually and/or audibly. Text/image/audio/video choices permit users to select one or more items from an array of possible options. 2D selectors allow users to interact with images or drawings by picking one screen point or a predefined region – audible output is available for point (coordinates) selection and regions (recorded audio) selection and navigation. 3.2 Related Tools MAFra is composed by 4 main tools. A wizard based creation tool allows users with no programming experience to easily create fairly elaborated artefacts through a strongly guided process, allowing domain specific templates for pages, page sequences and rules. An Artefact Emulator enables users to interact with the created artefacts. An Analysis Tool allows graphical artefact utilization analysis, this tool may work as a simple inspection browser or be able to fully reproduce the user interaction with the artefact. In the latter mode, a log of user actions, recorded during artefact emulation, is reproduced in adjustable paces. As such, users’ mistakes, hesitations and

338

T. Reis, M. de Sá, and L. Carriço

accesses to help information are available for analysis. Finally a Synchronization Tool allows the transfer of artefacts and utilization logs amongst users and devices. All the tools and corresponding libraries were developed in C# for Microsoft Windows Platforms. Desktop/laptop/tablet and hand-held versions are available. The latter, particularly in the creation tool’s case, are simplified versions of the former. 3.2.1 Artefact Emulator An Artefact Emulator (Fig. 1) allows the interaction with the artefacts. Optionally, it may record all user interaction in time-stamped XML-files for deferred analysis.

Fig. 1. Artefact Emulator

This tool provides different interaction modalities and preset modes, which were introduced in order to consider different device characteristics (e.g. inactive or touch screens, physical or non-physical keyboards), interaction forms (e.g. one hand, hands free, eyes free), and utilization contexts (e.g. movement and stationary situations). 3.2.1.1 Interaction Modalities. Direct Interaction (DI). Considers devices with touch screen, pen or mouse: interaction is done directly on the elements. These, must reveal their visual presentation (Fig.1), composed by the element itself (e.g. a text label, a text entry box, a drop down menu) and a small button to access its audible content. In general, it is an incomplete interaction modality since navigation between pages is not directly available. However, some artefacts may offer page navigation in page elements, explicitly or hidden (e.g., through a rule, on a game artefact). Interaction Bar (IB). This bar is composed by six buttons. It usually appears at the bottom of the artefact (Fig. 1). The “Next” and “Previous” buttons allow navigation between pages. Navigation between elements of the same page is available through the “Up” and “Down”. The “Play” button plays the audible content of the selected element. The “Action” button changes according to the selected element, and allows basic indirect interaction with it. Examples are record/stop recording and select. For more complex elements (such as lists, menus or track bars), the “Action” button starts/stops a nested navigation. On a nested navigation, users use the “Next” and

Designing Universally Accessible Mobile Multimodal Artefacts

339

“Previous” buttons to navigate horizontally (e.g., changing the value on a track bar) or the “Up” and “Down” to navigate vertically (e.g., items on a combo box). All together, these functionalities constitute a complete set for artefact indirect interaction (except for raw textual data entry). Device Keypad (K). This approach builds directly on the indirect interaction functionalities, mapping them on the device keys. It may or may not be a complete set, depending on the keys available on the device (and on the mapping). On most devices presenting navigation keys a default mapping is: Previous – Left Arrow; Up – Up Arrow; Play – Enter; Action – Holding Enter for 1 and a half seconds; Down – Down Arrow; Next – Right Arrow. Voice Recognition (VR). This approach maps the functionalities associated to the buttons of the IB to voice commands that are recognized by the application: Previous, Up, Play, Action, Down, Next. Gesture Recognition (GR). This approach maps the over mentioned functionalities to gestures that are recognized on the device’s touch screen. A basic gesture recognition algorithm was developed allowing the interpretation of the six different gestures presented bellow (Fig. 2).

Fig. 2. Gesture Recognition – the mapping of the Graphical Interaction Bar to gestures. From the left to the right: Previous, Up, Play, Action, Down, Next. The small dot represents a tap on the device’s screen; the line represents a continuous gesture after the stroke; the big dot represents a tap-and-hold lasting more than one and a half-second.

3.2.1.2 Preset Modes. Visual Mode. Ignores the elements’ audible counterpart throughout interaction. In this mode, interaction can be done through any of the available modalities. Eyes-free Mode. Directed for situations when users cannot pay much (e.g., walking or running) or any visual attention (e.g.: blind users) to the device. This version of the tool relies on interaction through: VR, GR, K or Haptics. On the later, directed for devices without keypad, a haptic card placed on top of the device’s touch screen (Fig. 3) maps a T9 keypad (for textual input) and simplified version of the IB. Considering space issues on hand-held devices’ touch screens, this bar considers only the Up, Down, Play and Action buttons. Navigation between pages is achieved indirectly pressing and holding Down (Next) or Up (Previous) buttons for 1,5 seconds or more. When navigating throughout pages, users are audibly informed about the current page. Whilst navigating between elements, users are audibly informed about the type of the selected element and how to interact with it. This information can be skipped through any of the available modalities by doing: Play, Down, Up, Next or Previous.

340

T. Reis, M. de Sá, and L. Carriço

Fig. 3. Haptic interaction for Eyes-free mode

Generally, this mode considers DI and the visual counterpart of the application is available (except when using the haptic card), aiming situations where can spare exporadic visual attention to the device. Hands-free Mode. Considers only VR for input and relies on any of the available modalities for output.

4 Prototyping Process During the design of the universally accessible mobile multimodal artefacts, we followed a user centred approach specifically directed to mobile interaction design [13]. Requirements were gathered from a wide set of paper-based activities that could benefit from the introduction of new interaction modalities. These activities considered different environments (e.g., class rooms, gymnasiums), users (e.g., visually impaired users) and movement stages (e.g., walking, running). Tasks were elicited from psychotherapy (e.g., scheduling, registering activities and thoughts), education (e.g., performing exams and homework, reading and annotating books), personal training (e.g., using guides and exercise lists), etc. These were subsequently modelled into use cases and diagrams that were employed on the definition of multiple scenarios. As these scenarios gained form several prototypes were created, and evaluated by end-users within some of the mentioned scenarios. Users’ procedures were filmed in order to gather usage and usability information that was crucial to our conclusions. In the end of each evaluation session, the users answered a usability questionnaire where they pointed the experienced difficulties. 4.1 Low-Fidelity Prototypes The low-fidelity prototypes created were questionnaires composed of seven pages, each with a question (basic output element) and an answer holder. For the latter different types of fully interactive elements were used (e.g. choices, entries, etc). In all tests users had to accomplish two tasks: (1) fill the form and (2) change their answers on some specific questions. Results were rated as follows: one point was credited if the user was able to successfully fill/change the answer at the first attempt; half a point was credited if the user was able to successfully answer at the second attempt; no points were attributed in any other case. The time spent on the every task was registered.

Designing Universally Accessible Mobile Multimodal Artefacts

341

This evaluation session [14] involved 11 persons, none visually impaired, all students (7 male, 4 female), between 18 and 30 years old. They were familiar with computers, mp3 players and mobile phones but not with PDA’s, listenable or multimodal interfaces. Approximately half of the users tested the prototype walking on a noisy environment. The remainder made the test sitting down in a silent environment. The researchers simulated the application behaviour and audio reproduction (Wizard of Oz approach). 4.1.1 Visual Mode We used a rigid card prototyping frame that mimics a real PDA in size and weight characteristics (Fig. 4). For each test, seven replaceable cards, each representing one page, were drawn to imitate the application. Two major audio/video control variants were assessed: one based on direct interaction with the elements (DI – two prototypes) and another on indirect interaction (IB - one prototype). All users performed all tests, but their order was defined in order to minimize bias.

Fig. 4. PDA prototype

DI. One button that controls the audible content of an element/item is available. Two design alternatives were experimented: one relies on controls only (Fig. 5) and the other includes additional text (Fig. 6). During the tests, we have noticed that all the users were manipulating the prototype with both hands (one holding the device and the other interacting with the cards). The test results (Table 1) show that the additional text information improved the time and success rate significantly. In the final questionnaire users confirmed the difficulties of interacting without the textual information.

Fig. 5. All contain a page navigation bar (top), an audio label (middle) and a fully interactive element (bottom) – the latter is (from left to right) an audio entry, an audio choice and a track bar

342

T. Reis, M. de Sá, and L. Carriço

Fig. 6. Same as for Fig. 5 but with text

IB. An interaction bar is available for the interaction with all the contents of elements/items, within an artefact (Fig. 7).

Fig. 7. Same as for Figure 6, but the navigation bar was replaced by an interaction bar

During the test, we have noticed that the users manipulating this type of control used only one hand (holding the device and interacting with it using the same hand). We have also noticed that the bar’s position made users cover the artefact while manipulating it. The test results (Table 1) show that the IB suits movement situations better than the DI variant. Table 1. Average evaluation/time of the visual version

STOPPED WALKING

DI (without text) 71.4% in 3:30 min 71.4% in 5:30 min

DI 100% in 3 min 100% in 4:30 min

IB 100% in 3 min 100% in 3 min

Considering both quantitative and qualitative results, we have decided to create high fidelity prototypes with a configurable solution, allowing both DI and BI. The decision was based in the fact that, although the latter performs better or equally to the former (with text), two handed interaction is often used in a sitting situation. We also decided to locate the bar on the bottom of the screen device instead of the top. 4.1.2 Eye-Free Version We used the same prototyping frame, but with a single card only. The card contained the IB and a T9 keyboard for textual input. Sounds were defined to notify a new working page, the required/possible interaction (dependent on the elements' type) and the interaction feedback. Two alternatives were evaluated: one relies on earcons and the other on voice prompts. Again, all users performed all tests with an appropriate order.

Designing Universally Accessible Mobile Multimodal Artefacts

343

Earcons. “abstract, synthetic tones" were defined and repeated for each notification (see above). The meaning of the sounds was carefully explained before the test. The results (Table 2) show that the users failed some operations. We believe some of these problems could be overcome with training or/and with a better choice of sounds. The comments reported on the post-tests questionnaire corroborate these findings. Voice Prompts. Succinct phrases were defined and repeated for each notification. The user could skip the information by pressing forward. The test results (Table 2) show that this approach assured the correct filling of the questionnaire, but also increased the time to accomplish it. This is because voice prompts are a lot longer than the earcons, and the users did not realize that they could skip them. Table 2. Average evaluation/time for eye-free version

STOPPED WALKING

Earcons 85.7% in 4:30 min 87.8% in 4:20 min

Voice prompts 100% in 6 min 100% in 6:30 min

Considering the evaluation results voice prompts seemed a preferable solution. Moreover, from the video analysis and the users' final comments, the 6 IB’s buttons occupied to much screen space to provide proper haptic feedback on buttons’ locations. The high-fidelity prototypes included a simplified version of the IB with only 4 buttons and were voice prompt based. 4.2 High-Fidelity Prototypes Similarly to the low-fidelity prototypes, the high-fidelity prototypes created were questionnaires composed by seven pages. The evaluation results were rated in same way as for the previous evaluation session. This session [14] involved: 20 persons that were not involved in the previous session, none visually impaired, all students (10 male, 10 female), between 17 and 38 years old, familiar with computers, mp3 players and mobile phones, but not with PDA’s, listenable or multimodal interfaces. 4.2.1 Visual Mode The evaluation of the visual version (Fig. 8) was done by 10 of the 20 persons involved on the high-fidelity prototypes' testing.

Fig. 8. Evaluation of the high-fidelity prototype

344

T. Reis, M. de Sá, and L. Carriço

Half of this population has performed the test using the DI variant and the other half did it through the IB, both in stationary situations. The purpose of this particular evaluation was to understand if people: (1) were capable of using our interfaces correctly; (2) felt comfortable interacting with them; and (3) thought they could perform school exams on it. The results (Table 3) clearly indicate some interaction issues, on the first attempt. Namely, people were not sure on how to manipulate audio/video entries, time selectors and audible track bars. Nevertheless, on a second utilization the results have improved substantially, suggesting a very short learning curve (Table 3). Table 3. Comparing average success and speed for element/item control VS page/artefact st nd control on the1 and 2 attempts 1st try 2nd try

DI

IB

88.5% in 2.5 min 100% in 2,7 min

80% in 2.6 min 100% in 2.6 min

During the video analysis of these tests, we were able to identify some other problems. The most significant were: (1) button feedback (audio and visual) was not enough - some people were not sure whether if they pressed some buttons or not; (2) in some situations, regarding the page/artefact control bar, people were not sure which button to use in order to perform specific actions - here again, graphical feedback was not enough. The users’ answers, expressed in the post-test questionnaire (Table 4) revealed good acceptance. Table 4. Users’ evaluation of the high-fidelity prototype

It was easy for me to accomplish the purposed activities. I think this application is easy to use. I would use this application to perform an exam

DI

IB

80% 80% 80%

70% 70% 60%

The overall results of this evaluation suggested some minor modifications on our final prototype. These were considered and implemented during the development of the new modalities on the Artefact Emulator described above. 4.2.2 Eye-Free Version The evaluation of the eye-free version was done by the 10 remainder persons. We developed a prototype without any graphical information, besides 4 buttons (back, record, play and forward) in the place of the control bar. This version reproduces the audio content of the elements, provides voice prompts guiding navigation and interaction requests, and audio feedback. The prototype simulated, as much as possible, usage scenarios found by a blind person. The session results (100% correct answers in 7 minutes) have proven that people were able to use the application. However, task accomplishment time (when

Designing Universally Accessible Mobile Multimodal Artefacts

345

compared to the visual version) and the users’ comments, suggested some changes. Although the users were informed that they could skip navigation/interaction information in order accelerate their task’s accomplishment, all of them reported an excessive use of the voice prompts. In view of that and of the previous tests, we adopted a voice prompt based solution on which the voice prompts can be defined by the user. 4.3 Recognition-Based Interaction Prototypes After evaluating the indirect interaction functionalities on the above-mentioned evaluation sessions (IB evaluation), these were mapped to recognition modalities: voice and gesture recognition. The technologies that support these modalities are error prone and their accuracy rates are of major importance to achieve universal accessibility on the Artefact Emulator, and, consequently, on the artefacts. Accordingly, it was decided to evaluate and adjust the mentioned modalities per-se. The aim, as already done for DI and IB, was to select a proper command set to support interaction. Two prototypes were created, one for each interaction modality considered. 4.3.1 Gesture Recognition A specific gesture recognition algorithm was implemented for the recognition of the conceived gesture set (Fig. 2). The validation of this set and the evaluation of the algorithm’s accuracy were performed in two consecutive laboratory evaluation sessions. On each session, users performed the six different gestures. Each gesture was tried 3 times in a row, holding and interacting with the device in three different ways: holding the device with one hand, and using a stylus (1), or a finger (2) of the other; or, using only one hand (3). Overall, each user performed each gesture 9 times. The order of usage ways, varied between users to minimize bias. Ten users were involved, five in each session. The first session’s results demonstrated short learning curves. Nevertheless, the algorithm’s accuracy rates were below our expectations, especially for finger and one hand interaction. Accordingly, the algorithm was adjusted and re-evaluated. Table 5 shows the second session’s results, which are self-explanatory. Learning curves are kept small even for finger and one hand interaction and the accuracy rates are significantly better. Table 5. Accuracy Rates for Gesture Recognition

One hand holding the device and the other interacting using the stylus. One hand holding the device and the other interacting using a finger. Holding and interacting with the device using the same hand.

1st Try

2nd Try

3rd Try

100%

100%

100%

95%

100%

100%

90%

95%

100%

346

T. Reis, M. de Sá, and L. Carriço

4.3.2 Voice Recognition The module used to support voice recognition on this prototype was the Microsoft Speech SDK. As for gestures, two evaluation sessions were conducted in order to understand the accuracy rates on voice recognition for specific lists of commands that could support the interaction command set. The first session involved four users and one list of commands (“Back, Down, Play, Action, Up, Next”). The results indicated 100% accuracy rates for most cases. However, the “Back” command was often interpreted as “Next”, and vice-versa. On the second session, the command list was modified, substituting “Back” by “Previous”. Four other users were involved. The results indicated a 100% accuracy rate for every voice command. Noise conditions were close to optimal in both tests.

5 Conclusions and Future Work In this paper, we have presented a framework that supports the creation, distribution, interaction and interaction analysis of Universally Accessible Mobile Multimodal Artefacts. We have focused the design and evaluation of an Artefact Emulator that enables users to interact with such artefacts. The results of the evaluation sessions conducted have shown that these artefacts can augment paper-based activities and significantly improve their accessibility through non paper-based modalities. Although the K modality was not evaluated on the early design stages of the artefacts and emulator, it is important to emphasize that this modality relies on indirect interaction functionalities, which were evaluated on the sessions addressing the IB variant. Our future work plans involve making a wider set of tests addressing both impaired and non-impaired users in order to enable the quantification of the levels of accessibility for different users and usage contexts. Acknowledgements. This work was supported by EU, LASIGE and FCT, through project JoinTS and through the Multiannual Funding Programme.

References 1. Stephanidis, C., Salvendy, G., Akoumianakis, D., Bevan, N., Brewer, J., Emiliani, P.L., Galetsas, A., Haataja, S., Iakovidis, I., Jacko, J., Jenkins, P., Karshmer, A., Korn, P., Marcus, A., Murphy, H., Stary, C., Vanderheiden, G., Weber, G., Ziegler, J.: Towards an Information Society for All: An International R&D Agenda. International Journal of HumanComputer Interaction 10(2), 107–134 (1988) 2. Turk, M., Robertson, G.: Perceptual user interfaces introduction. Communications of the ACM 43(3), 33–35 (2000) 3. Oviatt, S.: Mutual disambiguation of recognition errors in a multimodal architecture. In: Procs. CHI 1999, pp. 576–583. ACM Press, New York (1999) 4. Gibbon, D., Mertins, I., Moore, R.: Handbook of Multimodal and Spoken Dialogue Systems. In: Resources,Terminology and Product Evaluation. Kluwer, Dordrecht (2000)

Designing Universally Accessible Mobile Multimodal Artefacts

347

5. Hurtig, T.: A mobile multimodal dialogue system for public transportation navigation evaluated. In: Procs. of HCI 2006, pp. 251–254. ACM Press, New York (2006) 6. Oviatt, S., Darrell, T., Flickner, M.: Multimodal interfaces that flex, adapt, and persist. Commun. ACM 47(1), 30–33 (2004) 7. Lai, J.: Facilitating Mobile Communication with Multimodal Access to Email Messages on a Cell Phone. In: Procs. of CHI 2004, pp. 1259–1262. ACM Press, New York (2004) 8. Lai, J.: Facilitating Mobile Communication with Multimodal Access to Email Messages on a Cell Phone. In: Procs. of CHI 2004, pp. 1259–1262. ACM Press, New York (2004) 9. Signer, B., Norrie, M., Grossniklaus, M., Belotti, R., Decurtins, C., Weibel, N.: PaperBased Mobile Access to Databases. In: Procs. of theACM SIGMOD, pp. 763–765 (2006) 10. Santoro, C., Paternò, F., Ricci, G., Leporini, B.: A Multimodal Museum Guide for All. In: Mobile interaction with the Real World Workshop, Mobile HCI, Singapore (2007) 11. Lambros, S.: SMARTPAD: A Mobile Multimodal Prescription Filling System. A Thesis in TCC 402 University of Virginia (2003) 12. Serrano, M., Nigay, L., Demumieux, R., Descos, J., Losquin, P.: Multimodal interaction on mobile phones: development and evaluation using ACICARE. In: Procs. Of Mobile HCI 2006, Helsinki, Finland, vol. 159, pp. 129–136 (2006) 13. Sá, M., Carriço, L.: Designing for mobile devices: Requirements, low-fi prototyping and evaluation. In: Jacko, J.A. (ed.) HCI 2007. LNCS, vol. 4551, pp. 260–269. Springer, Heidelberg (2007) 14. Reis, T., Sá, M., Carriço, L.: Designing Mobile Multimodal Artefacts. In: Procs of ICEIS 2008, Barcelona - Spain, June 2008, pp. 79–85 (2008) ISBN: 978-989-8111-48-7

Dissection of a Visualization On-Demand Server Romain Vuillemot, B´eatrice Rumpler, and Jean-Marie Pinon Universit´e de Lyon, LIRIS, F-69621, Lyon, France [email protected] http://liris.cnrs.fr/romain.vuillemot

Abstract. In this paper, we detail specifications of a Visualization On-Demand (VizOD) server. We show that packaging information visualization processes into services reachable over a network benefits both users and programmers, by reducing development cycles. We implemented a prototype based on our architecture, resulting in an innovative way to visually explore large movie database. We discuss early results and our main perspective is to federate a community of users and practitioners to better design interactive environments and understand users behaviors. Keywords: Information Visualization, Service Oriented Architecture, Visualization On-Demand.

1 Introduction Availability of information visualization techniques are pare always strongly tied to a specific task or application. Making those broadly available to users according to data, needs and task to achieve (see Figure 1) would make more applications available. Availability is meant in terms of time-to-product, skills required and span of choice to offer users the right visualization technique. Painted with broad strokes, one major technical issue with visualization techniques is the data format heterogeneity and the lack of reliability: there is a huge gap between

Fig. 1. Users ecosystem is complex with data requirements, various needs and tasks to achieve. Trends show that it tends to be heterogeneous and increasing in size. Our focus is on visualization and interactions, and our approach is to make them available as services. J. Filipe and J. Cordeiro (Eds.): ICEIS 2008, LNBIP 19, pp. 348–360, 2009. c Springer-Verlag Berlin Heidelberg 2009 

Dissection of a Visualization On-Demand Server

349

a proof of concept issued from research works and a out of the shelve tool. Advanced contributions exist, but scattered in so many different application fields. For instance, visual data mining tools are very prolific in biology [1] with stunning results facing real life problems, especially dealing with data masses. New patterns finding heuristics in huge structures are available for a specific tasks, but those inspiring data depiction remain domain specific. At the end of the day, end-users cannot benefit from most of the scientific tools even if they look inspiring and potentially useful. And finally, concerning interactive environments, the desktop metaphor is still and exclusively massively used because of universal availability: innovation cannot make its breakthrough. As companies massively digitalize data for productivity, ubiquity and quality management, users need to keep track and get analytics on data evolution, issued from multiple sources. Whereas there exists tools to perform dedicated data analysis, as far as we know there is generic visual overview technique to get visual insights, regardless data or tasks (see Figure 2). Our goal is to quickly (in a product development perspective) build or setup an interface to uncover phenomena unseen at the first sight, which will guide users to a specific data subset or projection. Existing solutions are either restricted to an application or limited in extension. Starting a whole product cycle induces cost, time waste and risk. This process may also decrease users attention and productivity by changing their habits and reflex. New approaches have to keep users in the same interactive environment, with the same interactive devices he masters, just data being different.

Fig. 2. Users tasks have a broad span of data organization and time to result. Generic overview (gray area) of data structures leads to more specific analysis. We focus on that area which are the root of analytical and discovery processes.

Actors needs are identified as follow: – End-Users. They need abstract data handling to get visual insights of the data, with appealing visual metaphors. Such a process must take into account situations not requiring any symbol being entered into the system (in case users cannot formalize

350

R. Vuillemot, B. Rumpler, and J.-M. Pinon

their needs). Users have different background richness and cultures, so individual characteristics have to be stored. – Designers. They are experts in the field of translating users needs into software or assembly of software coupled with interactive devices. Whereas design knowledge is stored in a guideline format [2], it lacks formalization and evaluation. Re-use of existing libraries and environments becomes crucial with increasing system complexity and heterogeneity. Products life cycle are then extended. – Managers. They are using advanced monitoring tools with high reliability to detect trends. Trends are a way to anticipate the future and can be raised by means of complex and long term analysis.. This paper is organized as follow. Section 2 focuses on similar works in Information Visualization and Service Oriented Architecture. Section 3 focuses on the architecture outlines. Section 4 describes a prototype that has been developed. Section 5 discusses results and perspective. Section 6 concludes.

2 Related Work Our approach consists in bridging Informations Visualization (InfoVis) techniques and Service Oriented Architecture (SOA). We review related work and a synthetic comparison of the two fields is given (Table 1). 2.1 Information Visualization Conceptually, the fundamental goal of InfoVis techniques are to find the right visualization at the right moment. A complete visualization process results in outputs such as maps to discover relationships among data. Relationships can be either internal or external, helping users to better understand complex static or dynamic dataset changing over time. Human capabilities are thus enhanced, but limited cognitive memory must be taken into account in order not to overload users [3]. Also, codes or users knowledge is to be integrated to reduce information dimensions. Limited display space must also be considered, and can be done by coordinating multiple views [4] . Technically, the underlying issue is that visualization is hard coded to data and task to achieve. Today, reusing a technique means starting another configuration/implementation cycle according to informal design guidelines. These recommendations lack of formalization and are given in pattern format which has to be understood by experts, inducing extra costs. Another limit is that visualizations are local to users application and limited to his computing resources. And because contributions are scattered in so many different application fields, there are neither central repository, nor evaluation. Some lists exist, such as Many Eyes1 or Visual Complexity2. The latter consists in an updated screenshot and video repository: but there are no semantic taxonomy providing automatic access to visualizations. 1 2

http://services.alphaworks.ibm.com/manyeyes/ http://www.visualcomplexity.com/

Dissection of a Visualization On-Demand Server

351

Data Transformation Models. Our goal is to understand the visualization process in general, regardless data types, task or application domain. Then a model oriented analysis of the visualization has to be performed [5]. This approach has been commonly used, resulting in many models. We focus on two major models from two distinct fields that are Information Visualization [6] and Scientific Visualization [7]. See Figure 3.

Fig. 3. Existing data transformation models are common in Information Visualization [6] and Scientific Visualization [7]. They help better understanding the various states of data.

We proposed a slightly different model based on the previous ones, including interactions and considering the visualization process as a sequence of independent operations. Our model decomposes the Information Visualization process into 3 stages, each coupled with users actions (see Figure 4). Interactions can be either manual (users action), either semi-manual (users action triggers a system reaction) or automatic (system action). The model description is as follow: Extraction: is a step transforming data into an internal such as semantic (RDF), physical (matrix, directories) or even social, which extract different points of view from the data. Selection reduces datasets by selection (SQL, SPARQL, ..), aggregating and projecting in an appropriate manner, relieving users from a potential overload. Layout: gives a spatial attribute to abstract data structure such as graph, lists or multidimensional sets. The layout can be 2D maps or 3D model. Organization is an any action changing the layout attribute of data, but not the data themselves. Render: transforms abstract data layout into images or 3D scenes. Filtering means to post-process (using image analysis techniques such as Gaussian blur or Laplacian filter) to provide a pre-processed visual result [8]. These mathematical computations based on 2D-signal transformations help, for instance, users to fade details or highlight contours. The above mentioned layers descriptions help to identify and describe each technique, that can now be seen as independent modular programs.

352

R. Vuillemot, B. Rumpler, and J.-M. Pinon

Fig. 4. We introduce a holistic data transformation model, including both data transformation processes and interactions

2.2 Service Oriented Architectures Service Oriented Architecture (SOA) is an architecture style that aims at reorganizing business processes into loosely coupled packages. The packages are distributed into modules reachable over a network. The simplicity of use and universality of access makes it a design style helping the reuse of self-describing components. The components are interfaced as individual entities with the propensity to build applications very quickly. Services goal is to allow functionality to be stuck together to form ad-hoc applications reusing existing software service. The capabilities of adaptability and evolution are high by adding a new service, reducing developing costs and a quick deployment. Service are software reachable over a network independently from the underlying implementation. Languages describing them are: – Interfaces are published in the WSDL file (XML-Based Document that describes how to communicate with the Web Service) [9]. – Service repositories store WSDL files and are using UDDI (Universal Description, Discovery and Integration) helping to match with users needs. [10] – Clients having retrieved relevant interface will contact the service provider using SOAP [11]. Service composition [12] enables resources to be merged and responds more quickly and cost-effectively to changing market conditions. Service helps not having license/ software distribution issues, and can reduce distribution costs, piracy and reverse engineering. 2.3 Visualization Service The challenge of generically benefiting from visualization techniques has already been faced through many researches. A prolific field is Scientific Visualization, providing

Dissection of a Visualization On-Demand Server

353

related architectures. [13] introduced a client/server communication based on a simple reference model to perform data visualization. The visualization is done over the web, and focuses only on a specific data type which are plots. [14] introduces a modular visualization environments enabling users to change the data pipeline. A GUI interface is described, where dynamic change of the visualization pipeline can be done by users. Finally [15] is a Scientific Visualization system offering web services facilities, but which is not based on a model. Thus visualization is seen as a whole process which steps can’t be isolated. Table 1. Comparative characteristics of both InfoVis and SOA. Infovis is locally tied to clients, whereas SOA are distributed and broadly available. Type/Field InfoVis Location Local Coupling Tight Flexibility Bundle Messages File format Communication Function calls

SOA Distributed Loose Package Messages Protocol

3 A Visualization On-Demand Architecture Our contribution is a Visualization On-Demand (VizOD) architecture is 1) to separate into independent modules the visualization process according to our model (as seen on Figure 4) and 2) distribute processes, regardless the data being studied. This modular approach has to be combined according to a strategy, taking into account both technical constraints and users’ preferences. Using on-demand paradigm means we consider the rendered data stream as a media, such as movies with Video On-Demand (VoD) that can be chosen according to a user action. Conceptually, this expression is well-fitted since we consider the visualization as a stream of existing knowledge, just with another perspective. Technically, it holds many identical properties as VoD, such as cache, replication and performance. 3.1 Architecture We now focus on interfaces and communication protocols of each steps we introduced in our holistic model (4). Processes Subdivision. Making smaller parts (called business processes) of a larger system, operating independently. Each business process has properties (described in a UDDI repository) such as a description including the task it achieves, performance and complexity. These information will be useful to respect users Quality of Service constraints. While communication among the processes becomes asynchronous, modules keep interdependence constraints. For instance a change in the layout will trigger a new render. A color change in the graph depiction will leave an identical layout, but here again

354

R. Vuillemot, B. Rumpler, and J.-M. Pinon

render will have to be regenerated. We used as exchange protocol among the modules existing intermediate data transformation. For instance, the extraction module will communicate a graph structure to the layout process as it would be done in a monolithic architecture. In other words, external interface mirror internal temporary data residing in computer memory. The language choice turned out to be XML-like to wrap messages as showed on Figure 5. Then an additional SOAP communication protocol layer is added.

Fig. 5. Modules are independent steps issued from data transformation models. They are reusable and can be remotely located. Modules are encapsulated and communicate using SOAP messages.

Processes Distribution. Business processes can be located at any places, and reachable through a service repository. Process distribution means that some processes may remain local to users (e.g. because of privacy issues) while other may be distributed and reachable other a network (e.g. because they require lots of computing resources).

Fig. 6. New strategies appear based on our approach of cutting data transformation steps and providing them as independent services. For instance (a) strategy is using remote layout process, (b) is to keep only local interactions.

Dissection of a Visualization On-Demand Server

355

There exists many distribution strategies that can be complex and dynamic (changing over time, service availability or service load). Two examples are described on Figure 6: (a) shows that user hosts the dataset and transfer only data structure to perform a layout process on it. This is a case where the layout needs lots of computing power (b) shows an opposite strategy, where users select only the data and interact with the render only. This is a case where a dataset is shared and reached through a local interactive environment. 3.2 Strategy of Access A strategy is a common way of combining small steps to tackle a bigger problem. A step will be regarded as a group of processes, which are selected and assembled together to solve a task. Assembly of steps will be done as close as the way the mind is working and will be called patterns. These patterns are following common orchestration that have been subject of study, such as Schneiderman’s Overview Zoom and Details-OnDemand (OZD) [3]. 3.3 Personalization While a company is an organization having the same focus, individuals have specificities to be considered. Even if every task or context may be different, there exists a visual knowledge about data structures (e.g. tree-like structures similar as Figure 8) that are common to every kind of information. Learning how to understand and master this knowledge requires time, but once done it can result in time savings by cutting delays to visually master new datasets. Thus, users visual habits and preferences have to be identified and stored. Personalization [16] is a way of taking users preferences into account. Researches have been carried out in such direction as in information retrieval systems, in order to reduce datasets according to users explicit or implicit preferences. Explicit preferences are users selections and configurations operations, such as local environment choice or service choice. Implicit preferences are users history or any typical behavior registered in a non-intrusive way. For instance, if a specific service or group of services are invoked many times, they will be considered as a preference, even if no question has ever been asked to users [17]. Preferences can also vary from short to long term interest. Short term preferences are edge colors, any encoded knowledge (such as symbols) and filters of the render which aims at solving a task in a specific context. Middle term preference is the data layout which can’t vary (whereas colors can) in order not to disturb users mental model, which is his internal vision of a virtual scene. Finally, a long term interest concerns the interactive environment in which are integrated visualizations. Preferences will be stored in a User Visual Profile and will be available regardless users location or context.

356

R. Vuillemot, B. Rumpler, and J.-M. Pinon

Fig. 7. Sequence Diagram of an application based on the VizOD architecture. Interactions and rendering steps remain local, whereas all other steps are outsourced remotely.

4 Prototype To validate our architecture, we developed a first VizOD prototype, based on our recommendations and using existing visualization techniques and interactive environments. Advanced technical details are available in [8]. Our approach was to implement the strategy described in Figure 6 (b) which sequence diagram is available (Figure 7). The dataset is a movie database based on a sample of the Internet Movie Database3 . The data extraction consists in performing queries, selecting 1) users and 2) for each users their rated movies. That selection is done by means of a web interface allowing SQL-like queries. The extracted result consists in a tree-like structure where the root is an artificial node connecting all users with their rated movies as leaves. The graph layout techniques used was originally aimed to display protein networks [1]. Other graph visualization tools exist such as [18]. The render step results in an image with annotated data (containing details about movies), provided in a separate file structured in a XML-like format, KML (Keyhole Markup Language). The result is the picture displayed on Figure 8. The image looks intriguing, but holds only structural information: it has to be included in users interactive environments that makes metadata available (such as nodes and edges names visible). We selected Google earth (GE) as an interactive environment (with all geo-spatial features disabled). A screenshot is available (Figure 9). GE is installed and run by the client, and connects to VizOD by mean of HTTP requests. Using HTTP helps to keep away firewall issues or any complicated network configuration. VizOD will seamlessly 3

http://www.imdb.com/

Dissection of a Visualization On-Demand Server

357

Fig. 8. The image shows a tree-like layout visualization of a query result from a movie database. Images need to be coupled to local interactive environment to let users get more informations such as metadata and annotations.

map onto GE’s 3D sphere an image according to users altitude and angle of view, following [3]’s recommendations. The result is a 3-layered multi-scale strategy combining external business processes, and resulting (by decreasing altitude) in an overview layer, zoom layer and details layer. The overview layer aims at showing global trends, then it will be connected to a blurring render service, to remove details and raise trends. The service interfaces Gimp4 used in command line. The zoom layer remains the original image. The detail layer is the original image augmented with details on top of it (included in the KML file). Bandwidth usage has been minimized with that detail layer, by keeping the zoom image locally and only requesting and adding light KML data on top of it. The KML is converted to additional vectorial graphics (lines, captions, ..) by GE. We used a server running on GNU/Linux Fedora Core 6, 512 Mo RAM, AMD AthlonTMXP 2500+ (1.8 GHz) 512 KB cache memory. It took 59 seconds to generate a single image. The layout process is the greediest one. Comparatively, generating or blurring images involves nearly no extra time or resource use cost. Results, issued from experiments, is that GE is a good metaphor since it implements a well-known object that is earth. And users already used GE for other purpose: then we managed to minimize the environment mastering phase by reusing an existing and widespread tool. Usability was excellent since we re-used a powerful environment, which remains very reactive to every gestures and moves from users, even if images were not fully loaded or updated yet. Thanks to our VizOD approach, many 4

GNU Image Manipulation Program available at: http://www.gimp.org/

358

R. Vuillemot, B. Rumpler, and J.-M. Pinon

Fig. 9. Mapping rendered images mapped onto a 3D-sphere is an efficient way to provide ends users interactive abilities. Panning, zooming and rotating, coupled with details on demand are powerful way to face visual overloads.

visualization strategies can be adopted, while the client stays focused to a very same interface. The dataset can even totally change without absolutely no interface change. Adding new features on VizOD will be transparent for users. Finally users appreciated to use an attractive means that is a 3D sphere (similar to the iPod effect where the appealing wheel attracts users).

5 Discussions and Perspectives In this section we discuss results applications of our architecture and provide research tracks that need top be investigated. Company Benefits issued from using a VizOD architecture will come by outsourcing visualization to experts, providing software as services with better support and piracy prevention. The information system rationalization will go further by centralizing computing power at the same remote place and leaving end-users with lightweight heterogeneous terminals. The software maintenance routine is not on site any more but on the VizOD servers, holding business processes, which can be numerous allowing diversity and backups alternatives. New economical models will appear for producers. Privacy and Security Issues are a big concern in our approach. Regarding the steps of our model one can see that rendered images prevent reverse engineering process. Furthermore, extracting data structures only will give structural information and preventing details being visible: quantity of information is given, not quality. Finally, a service approach prevents implementation details to be visible. However, we keep SOA related issues, such as messages exchange that is prone to attacks.

Dissection of a Visualization On-Demand Server

359

Users needs have to be Considered Globally, with imperfections, such as short attention spans. Memories are also important aspects to deal with, especially with data masses and in the case information is available in streams which are not stored: users full attention is then required. The focus can be on visualizing updates, data growth, rather than content itself: the change or the behavior becomes as important as the instant content. The scalability and adaptability of the VizOD architecture is crucial. Services Mashup interfaces are new way for users to compose services to build up and share new ones. But is does not include yet extra process such as data layout and render. A future work is to implement a Visualization Mashup Interface to cope with the lack of semantic and integration visualization service repository. Users needs tools in new era where web-users have become actors using web interfaces. New Design Process and product life cycle will emerge. Programmers are not constrained, then they will keep their own programming habits. A piecewise conception process can be set up, with progressive features. Other Research communities are addressed such as cognitivists in order to observe users behavior, interface designer to re-think the way to design and evaluate. New actors such as artists can now fully take part to the design process by including artistic among one or many navigation step.

6 Conclusions In this paper we introduced and implemented a visualization on-demand server (VizOD). We first proposed a holistic data transformation model, considering both visualizations and interactions. Every step of the model are considered as business processes that can either remain local or be distributed. Such an approach allows end-users to benefit and personalize visualization services, and couple it into their local interactive environments. A present day result is a prototype resulting from an assembly of existing tools and showing that even non visualization-dedicated tools (such as graph manipulation libraries) can quickly result in an innovating application, following VizOD’s specifications, in a context of affordable systems and non-expensive softwares. Encapsulated processes to make them automatically discoverable and usable is our next research step. There is also a semantic gap between users task needs which are expressed in natural language and tasks the machine already knows how to achieve. To tackle this issue, our next step is building on line communities that will fertilize best practices or novel uses. We will also focus on users generated data (e.g. traces of use) that have to be structured, filtered and sorted. More generally, a stable on line framework has to emerge and be sustainable over time, and then lessons will be learn by widespread use.

References 1. Adai, A.T., Date, S.V., Wieland, S., Marcotte, E.M.: Lgl: creating a map of protein function with an algorithm for visualizing very large biological networks. J. Mol. Biol. 340, 179–190 (2004) 2. Shneiderman, B.: Designing the User Interface: Strategies for Effective Human-Computer Interaction. Addison-Wesley Longman Publishing Co., Inc., Boston (1997)

360

R. Vuillemot, B. Rumpler, and J.-M. Pinon

3. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: VL 1996: Proceedings of the 1996 IEEE Symposium on Visual Languages, Washington, DC, USA, p. 336. IEEE Computer Society, Los Alamitos (1996) 4. Michelle, W.A., Kuchinsky, A.: Guidelines for using multiple views in information visualization. In: Advanced Visual Interfaces, pp. 110–119 (2000) 5. Butler, D.M., Almond, J.C., Bergeron, R.D., Brodlie, K.W., Haber, R.B.: Visualization reference models. In: VIS 1993: Proceedings of the 4th conference on Visualization 1993, pp. 337–342 (1993) 6. Chi, E.H.: A taxonomy of visualization techniques using the data state reference model. In: INFOVIS, pp. 69–76 (2000) 7. Haber, R., McNabb, D.: Visualization idioms: A conceptual model for scientific visualization systems. In: Nielson, G.M., Shriver, B., Rosenblum, L.J. (eds.) Visualization in Scientific Computing, pp. 74–93. IEEE Computer Society Press, Los Alamitos (1990) 8. Vuillemot, R., Peralta, V.: From Beautiful to Useful: A Multi-Scale Visualization of Users Movie Ratings. Technical Report RR-LIRIS-2008-001, LIRIS UMR 5205 CNRS/INSA de Lyon/Universit´e Claude Bernard Lyon 1/Universit´e Lumi`ere Lyon 2/Ecole Centrale de Lyon (2008) 9. Christensen, E., Curbera, F., Meredith, G., Weerawarana, S.: Web Services Description Language (WSDL) 1.1. Technical report, W3C Note (2001), http://www.w3.org/TR/wsdl 10. UDDI: Universal description, discovery and integration, version 3. OASIS, Billerica, Mass. (2000), http://www.uddi.org 11. SOAP: Simple object access protocol (soap 1.1) (2000), http://www.w3.org/TR/SOAP 12. Agarwal, V., Dasgupta, K., Karnik, N., Kumar, A., Kundu, A., Mittal, S., Srivastava, B.: A service creation environment based on end to end composition of web services. In: WWW 2005: Proceedings of the 14th international conference on World Wide Web, pp. 128–137. ACM, New York (2005) 13. Wood, J., Brodlie, K., Wright, H.: Visualization over the world wide web and its application to environmental data. In: VIS 1996: Proceedings of the 7th conference on Visualization 1996, p. 81. IEEE Computer Society Press, Los Alamitos (1996) 14. Bonneau, G.P., Ertl, T., Nielson, G.M.: Scientific Visualization: The Visual Extraction of Knowledge from Data. In: Mathematics+Visualization. Springer, Heidelberg (2005) 15. Blazona, B., Mihajlovic, Z.: Visualization service based on web services. In: 29th International Conference on Information Technology Interfaces, 2007. ITI 2007, June 25-28, 2007, pp. 673–678 (2007) 16. Brusilovsky, P., Kobsa, A., Nejdl, W.E.: The Adaptive Web (2007) 17. Eirinaki, M., Vazirgiannis, M.: Web mining for web personalization. ACM Trans. Inter. Tech. 3, 1–27 (2003) 18. Auber, D.: Tulip: A huge graph visualisation framework. In: Mutzel, P., J¨unger, M. (eds.) Graph Drawing Softwares. Mathematics and Visualization, pp. 105–126. Springer, Heidelberg (2003)

Author Index

Andreou, Andreas S.

87

Mehad, Shafie 297 Mendoza, Luis E. 213 M¨ uhlh¨ auser, Max 307

Bachlechner, Daniel 280 Benghazi, Kawtar 213 B¨ ogl, Andreas 155 Brasethvik, Terje 201 Bringas, Pablo Garc´ıa 117 Brisson, Laurent 103 Buraga, Sabin C. 321

Nayak, Richi 253, 265 Noor, Nor Laila Md. 297 Norrie, Moira C. 3 Papatheocharous, Efi 87 Pastor, Oscar 226 Penya, Yoseba K. 117 P´erez, Mar´ıa 213 Pinggera, Jakob 31 Pinon, Jean-Marie 348 Pomberger, Gustav 155 Pontieri, Luigi 130

Capel, Manuel I. 213 Cardoso, Jorge 15 Carri¸co, Lu´ıs 334 Ceravolo, Paolo 46 Collard, Martine 103 Cui, Zhan 46 Damiani, Ernesto Droop, Matthias Eessaar, Erki

46 31

Reis, Tiago 334 Retkowitz, Daniel Rumpler, B´eatrice

73

S´ a, Marco de 334 Santner, Florian 31 Scheidl, Stefan 307 Schier, Michael 31 Sch¨ opf, Felix 31 Schrefl, Michael 155 Staffler, Hannes 31 Stoitsev, Todor 307

Flarer, Markus 31 Flentge, Felix 307 Folino, Francesco 130 Greco, Gianluigi 130 Groppe, Jinghua 31 Groppe, Sven 31 Gulla, Jon Atle 61, 201 Gusmini, Alex 46 Guzzo, Antonella 130 Heer, Thomas

Tarkkanen, Kimmo

175

Ibrahim, Emma Nuraihan Mior Kangasharju, Jaakko 241 Koskimies, Oskari 241 Kraft, Bodo 175 Kvarv, Gøran Sveia 201 Leida, Marcello 46 Li, Yuefeng 253, 265 Linnemann, Volker 31 Luca, Anca-Paula 321

175 348

297

188

Valverde, Francisco 226 Vlaanderen, Kevin 226 Voigt, Konrad 15 Vuillemot, Romain 348 Weber, Norbert 155 Weng, Li-Tung 253, 265 Winkler, Matthias 15 Xu, Yue

253, 265

Zacharias, Valentin Zugal, Stefan 31

144