134 10 46MB
English Pages 652 [633] Year 2008
George A. Tsihrintzis, Maria Virvou, Robert J. Howlett, and Lakhmi C. Jain (Eds.) New Directions in Intelligent Interactive Multimedia
Studies in Computational Intelligence, Volume 142 Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: [email protected] Further volumes of this series can be found on our homepage: springer.com
Vol. 132. Danil Prokhorov (Ed.) Computational Intelligence in Automotive Applications, 2008 ISBN 978-3-540-79256-7
Vol. 123. Shuichi Iwata, Yukio Ohsawa, Shusaku Tsumoto, Ning Zhong, Yong Shi and Lorenzo Magnani (Eds.) Communications and Discoveries from Multidisciplinary Data, 2008 ISBN 978-3-540-78732-7
Vol. 133. Manuel Gra˜na and Richard J. Duro (Eds.) Computational Intelligence for Remote Sensing, 2008 ISBN 978-3-540-79352-6
Vol. 124. Ricardo Zavala Yoe Modelling and Control of Dynamical Systems: Numerical Implementation in a Behavioral Framework, 2008 ISBN 978-3-540-78734-1 Vol. 125. Larry Bull, Bernad´o-Mansilla Ester and John Holmes (Eds.) Learning Classifier Systems in Data Mining, 2008 ISBN 978-3-540-78978-9 Vol. 126. Oleg Okun and Giorgio Valentini (Eds.) Supervised and Unsupervised Ensemble Methods and their Applications, 2008 ISBN 978-3-540-78980-2 Vol. 127. R´egie Gras, Einoshin Suzuki, Fabrice Guillet and Filippo Spagnolo (Eds.) Statistical Implicative Analysis, 2008 ISBN 978-3-540-78982-6 Vol. 128. Fatos Xhafa and Ajith Abraham (Eds.) Metaheuristics for Scheduling in Industrial and Manufacturing Applications, 2008 ISBN 978-3-540-78984-0 Vol. 129. Natalio Krasnogor, Giuseppe Nicosia, Mario Pavone and David Pelta (Eds.) Nature Inspired Cooperative Strategies for Optimization (NICSO 2007), 2008 ISBN 978-3-540-78986-4 Vol. 130. Richi Nayak, Nikhil Ichalkaranje and Lakhmi C. Jain (Eds.) Evolution of the Web in Artificial Intelligence Environments, 2008 ISBN 978-3-540-79139-3 Vol. 131. Roger Lee and Haeng-Kon Kim (Eds.) Computer and Information Science, 2008 ISBN 978-3-540-79186-7
Vol. 134. Ngoc Thanh Nguyen and Radoslaw Katarzyniak (Eds.) New Challenges in Applied Intelligence Technologies, 2008 ISBN 978-3-540-79354-0 Vol. 135. Hsinchun Chen and Christopher C. Yang (Eds.) Intelligence and Security Informatics, 2008 ISBN 978-3-540-69207-2 Vol. 136. Carlos Cotta, Marc Sevaux and Kenneth S¨orensen (Eds.) Adaptive and Multilevel Metaheuristics, 2008 ISBN 978-3-540-79437-0 Vol. 137. Lakhmi C. Jain, Mika Sato-Ilic, Maria Virvou, George A. Tsihrintzis, Valentina Emilia Balas and Canicious Abeynayake (Eds.) Computational Intelligence Paradigms, 2008 ISBN 978-3-540-79473-8 Vol. 138. Bruno Apolloni, Witold Pedrycz, Simone Bassis and Dario Malchiodi The Puzzle of Granular Computing, 2008 ISBN 978-3-540-79863-7 Vol. 139. Jan Drugowitsch Design and Analysis of Learning Classifier Systems, 2008 ISBN 978-3-540-79865-1 Vol. 140. Nadia Magnenat-Thalmann, Lakhmi C. Jain and N. Ichalkaranje (Eds.) New Advances in Virtual Humans, 2008 ISBN 978-3-540-79867-5 Vol. 141. Christa Sommerer, Lakhmi C. Jain and Laurent Mignonneau (Eds.) The Art and Science of Interface and Interaction Design, 2008 ISBN 978-3-540-79869-9 Vol. 142. George A. Tsihrintzis, Maria Virvou, Robert J. Howlett and Lakhmi C. Jain (Eds.) New Directions in Intelligent Interactive Multimedia, 2008 ISBN 978-3-540-68126-7
George A. Tsihrintzis Maria Virvou Robert J. Howlett Lakhmi C. Jain (Eds.)
New Directions in Intelligent Interactive Multimedia
123
Prof. George Tsihrintzis
Prof. Maria Virvou
Department of Informatics University of Piraeus 80, Karaoli & Dimitriou St. Piraeus 18534 Greece E-mail: [email protected]
Department of Informatics University of Piraeus 80, Karaoli & Dimitriou St. Piraeus 18534 Greece E-mail: [email protected]
Prof. Robert Howlett
Prof. Lakhmi C. Jain
University of Brighton School of Engineering Research Centre Moulsecoomb, Brighton, BN2 4GJ UK E-mail: [email protected]
KES Centre School of Electrical and Information Engineering, University of South Australia Adelaide, Mawson Lakes Campus South Australia SA 5095 Australia E-mail: [email protected]
ISBN 978-3-540-68126-7
e-ISBN 978-3-540-68127-4
DOI 10.1007/978-3-540-68127-4 Studies in Computational Intelligence
ISSN 1860949X
Library of Congress Control Number: 2008926411 c 2008 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 987654321 springer.com
Sponsoring Institutions
University of Piraeus – Rector’s Council
University of Piraeus – Research Center
University of Piraeus – Graduate Program of Studies in Informatics
Ministry of National Education and Religious Affairs
County of Piraeus
Eurobank EFG, S.A.
Preface to: New Directions in Intelligent Interactive Multimedia George A. Tsihrintzis1, Maria Virvou1, Robert J. Howlett2, and Lakhmi C. Jain3 1
Department of Informatics University of Piraeus 2 Center for Smart Systems University of Brighton 3 School of Electrical & Information Engineering University of South Australia
Multimedia systems is the term chosen to refer to the coordinated and secure storage, processing, transmission and retrieval of multiple forms of information, such as audio, image, video, animation, graphics, and text. During the last decade, multimedia systems has become a vibrant field of research and development worldwide. As a result, multimedia services based on multimedia systems have made significant progress in recent times. Multimedia systems and services have been developed to address needs in various areas including, but not limited to, advertisement, art, business, education, entertainment, engineering, medicine, mathematics, scientific research and spatiotemporal applications. The growth rate of multimedia services has become explosive, as technological progress only attempts to match consumers for content. In our times, computers are more widespread than ever and computer users range from highly qualified scientists to non-computer-expert professionals and may include people with special needs. Thus, interactivity, personalization and adaptivity have become a necessity in modern multimedia systems and services. Modern intelligent multimedia systems need to be interactive not only through classical modes of interaction where the user inputs information through a keyboard or mouse. They must also support other modes of interaction, such as visual or lingual computer-user interfaces, which render them more attractive, user friendlier, more human-like and more informative. On the other hand, solution in which “one-fits-all” are no longer applicable to wide ranges of users of various backgrounds and needs. Therefore, one important goal of many intelligent multimedia systems is their ability to provide personalized service and adapt dynamically to their users. To achieve these goals, intelligent interactive multimedia systems and services (IIMSS) need to evolve at all levels of processing. Specific sub-areas of required further research include: 1. Advances in Multimedia Data Analysis 2. New Reasoning Approaches 3. More efficient Infrastructure for Intelligent Interactive Multimedia Systems and Services 4. Development of innovative Multimedia Application Areas 5. Improvement of the Quality of Interactive Multimedia Services
VIII
Preface
This book summarizes the works and new research results presented at the First International Symposium on Intelligent Interactive Multimedia Systems and Services (KES-IIMSS 2008), organized by the University of Piraeus and its Department of Informatics in conjunction with KES International (Piraeus, Greece, July 9–11, 2008). The aim of the symposium was to provide an internationally respected forum for scientific research into the technologies and applications of intelligent interactive multimedia systems and services. Besides the Preface, the book contains sixty four (64) chapters. The first four (4) chapters in the book are printed versions of the keynote addresses of the invited speakers of KES-IIMSS 2008. Besides the invited speaker chapters, the book contains fifteen (15) chapters on recent Advances in Multimedia Data Analysis, eleven (11) chapters on Reasoning Approaches, nine (9) chapters on Infrastructure of Intelligent Interactive Multimedia Systems and Services, fourteen (14) chapters on Multimedia Applications, and eleven (11) chapters on Quality of Interactive Multimedia Services. More specifically, Chapter 1 by Germano Resconi is on “Morphic Computing.” Chapter 2 by Mike Christel is on “Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces.” Chapter 3 by Alfred Kobsa is on “PrivacyEnhanced Personalization.” Chapter 4 by Paul Brna is on “Narrative Interactive Multimedia Learning Environments: Achievements and Challenges.” Chapters 5, 6, 7, and 8 cover various aspects of Image and Video Analysis, while Chapters 9, 10, 11, and 12 are devoted to Fast Methods for Intelligent Image Recognition. Chapters 13, 14, 15, and 16 are devoted to Audio Analysis and Chapters 17, 18, and 19 present new results in Time Series Analysis in Financial Services. Chapters 20, 21, and 22 present new results in Multimedia Information Clustering and Retrieval, while Chapters 23, 24, and 25 are devoted to Decision Support Services. Additionally, Chapters 26, 27, 28, 29, and 30 are devoted to Reasoning–based Intelligent Information Systems. Chapters 31, 32, 33, and 34 are devoted to Wireless and Web-based Multimedia and Chapters 35, 36, 37, 38, and 39 present Techniques and Applications for Multimedia Security. Chapters 40, 41, 42, 43, and 44 are devoted to Tutoring Systems, while Chapters 45, 46, and 47 are devoted to Geographical Multimedia Services. Chapters 48 and 49 present multimedia applications in Interactive TV and Chapters 50, 51, 52, and 53 are devoted to Intelligent and Interactive multimedia in Bioinformatics and Medical Informatics. Chapters 54, 55, and 56 present new results in Affective Multimedia, while Chapters 57, 58, 59, 60, and 61 present Multimedia Techniques for Ambient Intelligence. Finally, Chapters 62, 63, and 64 are devoted to approaches for Evaluation of Multimedia Services. We wish to express our gratitude to the authors of the various chapters and reviewers for their wonderful contributions. For their help with organizational issues of KES-IIMSS 2008, we express our thanks to Ms. Paraskevi Lampropoulou, Ms. Lina Stamati, Mr. Efthimios Alepis and Mr. Konstantinos Patsakis, doctoral students at the
Preface
IX
University of Piraeus, and Mr. Peter Cushion of KES International. Thanks are due to Springer-Verlag for their editorial support. We would also like to express our sincere thanks to Mr. Thomas Ditzinger for his wonderful editorial support. We believe that this book would help in creating interest among researchers and practitioners towards realizing human-like interactive multimedia services. This book would prove useful to the researchers, professors, research students and practitioners as it reports novel research work on challenging topics in the area of intelligent interactive multimedia systems and services. Moreover, special emphasis has been put on highlighting issues concerning the development process of such complex systems and services, thus revisiting the difficult issue of knowledge engineering of such systems. In this way, the book aims at providing the readers with a better understanding of how intelligent interactive multimedia systems and services can be successfully implemented to incorporate recent trends and advances in theory and applications of intelligent systems. George A. Tsihrintzis Maria Virvou Robert J. Howlett Lakhmi C. Jain
Contents
Morphic Computing Germano Resconi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces Michael G. Christel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Privacy-Enhanced Personalization Alfred Kobsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Narrative Interactive Multimedia Learning Environments: Achievements and Challenges Paul Brna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
A Support Vector Machine Approach for Video Shot Detection Vasileios Chasanis, Aristidis Likas, Nikolaos Galatsanos . . . . . . . . . . . . . . .
45
Comparative Performance Evaluation of Artificial Neural Network-Based vs. Human Facial Expression Classifiers for Facial Expression Recognition I.-O. Stathopoulou, G.A. Tsihrintzis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
Histographic Steganographic System Constantinos Patsakis, Nikolaos Alexandris . . . . . . . . . . . . . . . . . . . . . . . . . .
67
Moving Object Detection and Tracking for the Purpose of Multimodal Surveillance System in Urban Areas Andrzej Czyzewski, Piotr Dalka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach ˇ Smiljan Sinjur, Damjan Zazula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
XII
Contents
Fast Segmentation of Ovarian Ultrasound Volumes Using Support Vector Machines and Sparse Learning Sets Mitja Leniˇc, Boris Cigale, Boˇzidar Potoˇcnik, Damjan Zazula . . . . . . . . . . .
95
Fast and Intelligent Determination of Image Segmentation Method Parameters Boˇzidar Potoˇcnik, Mitja Leniˇc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Fast Image Segmentation Algorithm Using Wavelet Transform Tomaˇz Romih, Peter Planinˇsiˇc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Musical Instrument Category Discrimination Using Wavelet-Based Source Separation P.S. Lampropoulou, A.S. Lampropoulos, G.A. Tsihrintzis . . . . . . . . . . . . . . 127 Music Perception as Reflected in Bispectral EEG Analysis under a Mirror Neurons-Based Approach Panagiotis Doulgeris, Stelios Hadjidimitriou, Konstantinos Panoulas, Leontios Hadjileontiadis, Stavros Panas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Automatic Recognition of Urban Soundscenes Stavros Ntalampiras, Ilyas Potamitis, Nikos Fakotakis . . . . . . . . . . . . . . . . . 147 Low Bitrate Coding of Spot Audio Signals for Interactive and Immersive Audio Applications Athanasios Mouchtaris, Christos Tzagkarakis, Panagiotis Tsakalides . . . . . 155 Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate Using NEWFM Sang-Hong Lee, Hyoung J. Jang, Joon S. Lim . . . . . . . . . . . . . . . . . . . . . . . . 165 Forecasting Short-Term KOSPI Time Series Based on NEWFM Sang-Hong Lee, Hyoung J. Jang, Joon S. Lim . . . . . . . . . . . . . . . . . . . . . . . . 175 The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering Jianhua Tong, Hong-Zhou Tan, Leiyong Guo . . . . . . . . . . . . . . . . . . . . . . . . . 185 Artificial Immune System-Based Music Genre Classification D.N. Sotiropoulos, A.S. Lampropoulos, G.A. Tsihrintzis . . . . . . . . . . . . . . . 191 Semantic Information Retrieval Dedicated to Multimedia Systems: A Platform Based on Conceptual Graphs Xavier Aim´e, Francky Trichet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Contents
XIII
Interactive Cluster-Based Personalized Retrieval on Large Document Collections Petros Belsis, Charalampos Konstantopoulos, Basilis Mamalis, Grammati Pantziou, Christos Skourlas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Decision Support Services Facilitating Uncertainty Management Sylvia Encheva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Efficient Knowledge Transfer by Hearing a Conversation While Doing Something Eiko Yamamoto, Hitoshi Isahara . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 On Managing Users’ Attention in Knowledge-Intensive Organizations Dimitris Apostolou, Stelios Karapiperis, Nenad Stojanovic . . . . . . . . . . . . . 239 Two Applications of Paraconsistent Logical Controller Jair Minoro Abe, Kazumi Nakamatsu, Seiki Akama . . . . . . . . . . . . . . . . . . . 249 Encoding Modalities into Extended Petri Net for Analyzing Discrete Event Business Process Takashi Hattori, Hiroshi Kawakami, Osamu Katai, Takayuki Shiose . . . . . 255 Paraconsistent Before-After Relation Reasoning Based on EVALPSN Kazumi Nakamatsu, Jair Minoro Abe, Seiki Akama . . . . . . . . . . . . . . . . . . . 265 Image Representation with Reduced Spectrum Pyramid Roumen Kountchev, Roumiana Kountcheva . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Constructive Logic and the Sorites Paradox Seiki Akama, Kazumi Nakamatsu, Jair Minoro Abe . . . . . . . . . . . . . . . . . . . 285 Resource Authorization in IMS with Known Multimedia Service Adaptation Capabilities Tomislav Grgic, Vedran Huskic, Maja Matijasevic . . . . . . . . . . . . . . . . . . . . . 293 Visualizing Ontologies on the Web Ioannis Papadakis, Michalis Stefanidakis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Performance Analysis of ACL Packets Using Turbo Code in Bluetooth Wireless System Il-Young Moon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Design and Implementation of Remote Monitoring System for Supporting Safe Subways Based on USN Seok Cheol Lee, Chang Soo Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
XIV
Contents
Evaluation of PC-Based Real-Time Watermark Embedding System for Standard-Definition Video Stream Takaaki Yamada, Yoshiyasu Takahashi, Hiroshi Yoshiura, Isao Echizen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 User Authentication Scheme Using Individual Auditory Pop-Out Kotaro Sonoda, Osamu Takizawa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Combined Scheme of Encryption and Watermarking in H.264/Scalable Video Coding (SVC) Su-Wan Park, Sang-Uk Shin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Evaluation of Integrity Verification System for Video Content Using Digital Watermarking Takaaki Yamada, Yoshiyasu Takahashi, Yasuhiro Fujii, Ryu Ebisawa, Hiroshi Yoshiura, Isao Echizen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Improving the Host Authentication Mechanism for POD Copy Protection System Eun-Jun Yoon, Kee-Young Yoo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 User Stereotypes Concerning Cognitive, Personality and Performance Issues in a Collaborative Learning Environment for UML Kalliopi Tourtoglou, Maria Virvou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Intelligent Mining and Indexing of Multi-language e-Learning Material Angela Fogarolli, Marco Ronchetti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Classic and Multimedia Based Activities to Teach Colors for Both Teachers and Their Pre-school Kids at the Kindergarten of Arab Schools in South of Israel Mahmoud Huleihil, Huriya Huleihil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 TeamSim: An Educational Micro-world for the Teaching of Team Dynamics Orazio Miglino, Luigi Pagliarini, Maurizio Cardaci, Onofrio Gigliotta . . . 417 The Computerized Career Gate Test K.17 Theodore Katsanevas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Fuzzy Logic Decisions and Web Services for a Personalized Geographical Information System Constantinos Chalvantzis, Maria Virvou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
Contents
XV
Design Rationale of an Adaptive Geographical Information System Katerina Kabassi, Georgios P. Heliades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Multimedia, User-Centered Design and Tourism: Simplicity, Originality and Universality Francisco V. Cipolla Ficarra, Miguel Cipolla Ficarra . . . . . . . . . . . . . . . . . . 461 Dynamically Extracting and Exploiting Information about Customers for Knowledge-Based Interactive TV-Commerce Anastasios Savvopoulos, Maria Virvou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 Caring TV as a Service Design with and for Elderly People Katariina Raij, Paula Lehto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 A Biosignal Classification Neural Modeling Methodology for Intelligent Hardware Construction Anastasia Kastania, Stelios Zimeras, Sophia Kossida . . . . . . . . . . . . . . . . . . 489 Virtual Intelligent Agents to Train Abilities of Diagnosis in Psychology and Psychiatry Jos´e Guti´errez-Maldonado, Ivan Alsina-Jurnet, Mar´ıa Virginia Rangel-G´ omez, Angel Aguilar-Alonso, Adolfo Jos´e Jarne-Esparcia, Antonio Andr´es-Pueyo, Antoni Talarn-Caparr´ os . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 The Role of Neural Networks in Biosignals Classification Stelios Zimeras, Anastasia Kastania . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Medical Informatics in the Web 2.0 Era Iraklis Varlamis, Ioannis Apostolakis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 Affective Reasoning Based on Bi-modal Interaction and User Stereotypes Efthymios Alepis, Maria Virvou, Katerina Kabassi . . . . . . . . . . . . . . . . . . . . 523 General-Purpose Emotion Assessment Testbed Based on Biometric Information Jorge Teixeira, Vasco Vinhas, Eugenio Oliveira, Luis Paulo Reis . . . . . . . . 533 Realtime Dynamic Multimedia Storyline Based on Online Audience Biometric Information Vasco Vinhas, Eugenio Oliveira, Luis Paulo Reis . . . . . . . . . . . . . . . . . . . . . . 545 Assessing Separation of Duty Policies through the Interpretation of Sampled Video Sequences: A Pair Programming Case Study Marco Anisetti, Valerio Bellandi, Ernesto Damiani, Gabriele Gianini . . . . 555
XVI
Contents
Trellis Based Real-Time Depth Perception Chip Using Interline Constraint Sungchan Park, Hong Jeong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 Simple Perceptually-Inspired Methods for Blob Extraction Paolo Falcoz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577 LOGOS: A Multimodal Dialogue System for Controlling Smart Appliances Theodoros Kostoulas, Iosif Mporas, Todor Ganchev, Nikos Katsaounos, Alexandros Lazaridis, Stavros Ntalampiras, Nikos Fakotakis . . . . . . . . . . . . 585 One-Channel Separation and Recognition of Mixtures of Environmental Sounds: The Case of Bird-Song Classification in Composite Soundscenes Ilyas Potamitis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Evaluating the Next Generation of Multimedia Software Ray Adams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Evaluation Process and Results of a Middleware System for Accessing Digital Music LI braries in MObile S ervices P.S. Lampropoulou, A.S. Lampropoulos, G.A. Tsihrintzis . . . . . . . . . . . . . . 615 Interactive Systems, Design and Heuristic Evaluation: The Importance of the Diachronic Vision Francisco V. Cipolla Ficarra, Miguel Cipolla Ficarra . . . . . . . . . . . . . . . . . . 625 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
Morphic Computing Germano Resconi Catholic University, Brescia, Italy [email protected]
Abstract. In this paper, we introduce a new type of computation is called “Morphic Computing”. The Morphic Computing is based on Field Theory and more specifically based on Morphic Fields. Morphic Fields were first introduced by Rupert Sheldrake [1981] based on his hypothesis of formative causation which makes use of the older notion of Morphogenetic Fields. Rupert Sheldrake [1981] developed his famous theory, the Morphic Resonance, on the basis of the work by French philosopher Henri Bergson. The Morphic Fields and it‘s subset Morphogenetic Fields have been in the center of controversy for many years among mainstream science and the hypothesis is not accepted by some scientists, who consider it pseudoscience. We claim that the Morphic Computing is a natural extension of the Holographic Computation, Quantum Computation, Soft Computing, and DNA computing. All the natural computation that are bonded by the Turing Machine can be formalised and extended by our new type of computation model – Morphic Computing. In this paper, we introduce the basis for our new Computing paradigm – Morphic Computing-, it’s extensions such as Quantum Logic and Entanglement in Morphic Computing, Morphic Systems and Morphic System of Systems (M-SOS) and it’s applications to the field of computation by words as an example of the Morphic Computing, Morphogenetic Fields in neural network and Morphic Computing, Morphic Fields - concepts and Web search, and agents and fuzzy in Morphic Computing. Keywords: Morphic Computing, Morphogenetic Computing, Morphic Fields, Morphogenetic Fields, Quantum Computing, DNA Computing, Soft Computing, Computing with Words, Morphic Systems, Morphic Network, Morphic System of Systems.
1 Introduction Inspired by the work of the French philosopher Henri Bergson, Rupert Sheldrake [1981] developed his famous theory, the Morphic Resonance. His work on Morphic Fields which is based on Morphic Resonance Theory has been published in his well known book “A New Science of Life: The Hypothesis of Morphic Resonance” (1981, second edition 1985). Morphic Fields of Rupert Sheldrake [1981] is based on his hypothesis of formative causation which makes use of the older notion of Morphogenetic Fields. The Morphic Fields and it‘s subset Morphogenetic Fields have been in the centre of controversy for many years among mainstream science and the hypothesis is not accepted by some scientists, who consider it pseudoscience. The Morphogenetic Fields is a hypothetical biological fields and it has been used by environmental biologists since 1920's which deals with living things. However, the Morphic Fields are more general than Morphogenetic Fields and are defined as universal information for both organic (living things) and abstract forms. Sheldrake G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 1–20, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
2
G. Resconi
defined Morphic and Morphogenetic Fields in his book, The Presence of the Past [1988] as follows: “The term [Morphic Fields] is more general in its meaning than Morphogenetic Fields, and includes other kinds of organizing fields in addition to those of morphogenesis; the organizing fields of animal and human behaviour, of social and cultural systems, and of mental activity can all be regarded as Morphic Fields which contain an inherent memory.” – Sheldrake [1988]. Our new computation paradigm – Morphic Computing- is based on Field Theory and more specifically based on Morphic Fields. We claim that the Morphic Computing is a natural extension of the Holographic Computation, Quantum Computation, Soft Computing, and DNA computing. We also claim that all the natural computation that are bonded by the Turing Machine can be formalised and extended by our new type of computation model – Morphic Computing. In this paper, first we introduce the basis for our new computing paradigm – Morphic Computing based on Field Theory. Then we introduce it’s extensions such as Quantum Logic and Entanglement in Morphic Computing, Morphic Systems and Morphic System of Systems (M-SOS). Then Morphic Computing’s applications to the field of computation by words will be given. Finally, we present the Morphogenetic Fields in neural network and Morphic Computing, Morphic Field - Concepts and Web search, and Agents and fuzzy in Morphic Computing.
2 Morphic Computing and Field Theory: Classical and Modern Approach 2.1 Fields In this paper, we assume that computing is not always related to symbolic entities as numbers, words or other symbolic entities. Fields as entities are more complex than any symbolic representation of the knowledge. For example, Morphic Fields include the universal database for both organic (living) and abstract (mental) forms. In classical physics, we represent the interaction among the particles by local forces that are the cause of the movement of the particles. In classical physics, it is also more important to know at any moment the individual values of the forces than the structure of the forces. This approach considers that the particles are independent from the other particles under the effect of the external forces. But with further development of the theory of particle physics the researchers discovered that forces are produced by intermediate entities that are not located in one particular point of the space but are at any point of specific space at the same time. These entities are called “Fields”. Based on this new theory, the structure of the fields is more important than the value of force itself at any point. In this representation of the universe, any particle in any position is under the effect of the fields. Therefore, the fields are used to connect all the particles of the universe in one global entity. However, if any particle is under the effect of the other particles, every local invariant property will disappear because every system is open and it is not possible to close any local system. To solve this invariance problem, scientist discovered that the local invariant can be conserved
Morphic Computing
3
with deformation of the local geometry and the metric of the space. The change of local geometry compensate the change of the invariant due to the field. We can assume that the action of the fields can be compensate with deformation of the reference in any point. So we cannot use only one global reference by we have infinite number of references one in any point. Any particle is under the action of the fields however, the references that we have chosen, change in space and time, in a way to compensate the action of the field. In this case, all the reference space have been changed and the action of the field is completely compensate. In this way the action of the field is hidden in the change of the reference. In this case, we have different reference space which it’s geometry in general is non Euclidean. With the quantum phenomena, the problem becomes more complex because the particles are correlated one with the other in a much hidden way without any physical interaction with fields. This correlation or entanglement generates a structure inside the universe for which the probability to detect a particle is a virtual or conceptual field that cover the entire Universe. 2.2 Morphic Computing: Basis for Quantum, DNA, and Soft Computing Gabor [1972] and H. Fatmi and Resconi [1988] discovered the possibility to compute images made by a huge number of points as output from objects as a set of huge number of points as input by reference beams or laser (holography). It is also known that a set of particles can have a huge number of possible states given by the position and momentum of the particles. In the classical physics is impossible to have at the same time two different states. One or more particles cannot have in the same time different position and different momentum. So the states are separate one from the other. With respect to the quantum mechanics, one can have a superposition of all states, states are presented in the superposition at the same time. It is also very important to note that at the same time one cannot separate the states as individual entities but one must consider all the states as an entity. For this very peculiar property of the quantum mechanics, one can change all the superpose states at the same time. This type of global computation is the conceptual principle by which we think one can build quantum computers. Similar phenomena can be used to develop DNA computation where a huge number of DNA as a field of DNA elements are transformed (replication) at the same time and filtered (selection) to solve non polynomial problems. In addition, soft-computing or computation by words extend the classical local definition of true and false value for logic predicate to a field of degree of true and false inside the space of all possible values of the predicates. In this way, the computation power of the soft-computing is extended similar to the computation power that one can find in quantum computing, DNA computing, and Holographic computing. In conclusion, one can expect that all the previous approaches and models of computing are examples of a more general computation model called “Morphic Computing” where the morphic means “form” and is associated to the idea of holism, geometry, field , superposition, globality and so on. 2.3 Morphic Computing and Conceptual Fields – Non Physical Fields The Morphic Computing change or compute non physical but conceptual fields. One example can be to represent the semantic of the words. In this case, a field is
4
G. Resconi
generated by a word or a sentence as sources. For example, in a library, the reference space would be where the documents are located. At any given word, we define the field as a map of the position of the documents in the library and the number of the occurrences (values) of the word in the document. The word or source is located in one point of the reference space (query) but the field (answer) can be located in any part of the reference. Complex strings of words (structured query) generate a complex field or complex answer by which the structure can be obtained by the superposition of the fields of the words as sources with different intensity. Any field is a vector in the space of the documents. A set of basic fields is a vector space and form a concept. We break the traditional idea that a concept is one word in the conceptual map. Internal structure (entanglement) of the concept is the relation of dependence among the basic fields. The ambiguous word is the source (query) of the fuzzy set (field or answer). 2.4 Morphic Computing and Natural Languages – Theory of Generalized Constraint In a particular case, we know that a key assumption in computing with words is that the information which is conveyed by a proposition expressed in a natural language or word may be represented as a generalized constraint of the form “X isr R”, where X is a constrained variable; R is a constraining relation; and r is an indexing variable whose value defines the way in which R constrains X. Thus, if p is a proposition expressed in a natural language, then “X isr R” representing the meaning of p, equivalently, the information conveyed by p. Therefore, the generalised constraint model can be represented by field theory in this way. The meaning of any natural proposition p is given by the space X of the fields that form a concept in the reference space or objective space, and by a field R in the same reference. We note that a concept is not only a word, but is a domain or context X where the propositions p represented by the field R is located. The word in the new image is not a passive entity but is an active entity. In fact, the word is the source of the field. We can also use the idea that the word as an abstract entity is a query and the field as set of instances of the query is the answer. 2.5 Morphic Computing and Agents – Non Classical Logic In the agent image, where only one word (query) as a source is used for any agent, the field generated by the word (answer) is a Boolean field (the values in any points are true or false). Therefore, we can compose the words by logic operations to create complex Boolean expression or complex Boolean query. This query generates a Boolean field for any agent. This set of agents creates a set of elementary Boolean fields whose superposition is the fuzzy set represented by field with fuzzy values. The field is the answer to the ambiguous structured query whose source is the complex expression p. The fields with fuzzy values for complex logic expression are coherent with traditional fuzzy logic with a more conceptual transparency because it is found on agents and Boolean logic structure. As points out [Nikravesh, 2006] Web is a large unstructured and in many cases conflicting set of data. So in World Wide Web, fuzzy logic and fuzzy sets are essential part of query and also to find appropriate searches to
Morphic Computing
5
obtain the relevant answer. For the agent interpretation of the fuzzy set, the net of the Web is structured as a set of conflicting and in many case irrational agents whose task is to create any concept. Agents produce actions to create answers for ambiguous words in the Web. A structured query in RDF can be represented as a graph of three elementary concepts as subject, predicate and complement in a conceptual map. Every word and relationship in the conceptual map is variables whose values are fields which it’s superposition gives the answer to the query. Because we are more interested in the meaning of the query than how we write the query itself, we are more interested in the field than how we produce the field by the query. In fact, different linguistic representations of the query can give the same field or answer. In the construction of the answer by a query, we use symbolic words as sources of semantic fields with different intensity. Now given a query by a chain of words we activate the words as instruments to detect semantic sources in the context of documents for example in Web. The words in the query are diffuse in many different documents. Now from the sources as a semantic hologram in web, we can activate a process by which we generate other words locate in the same documents. The fields of words in the same documents are superposed in a way to generate the answer to the query. Now the localisation of the words inside the query in Web, can be denoted as a WRITE process. In fact we represent or write any individual word in the query inside the space of web as semantic field one for any word. After we have the READ process for which we generate other semantic fields for other words from the field of query word inside web. The READ process give us the answer. In analogy with the holography, the WRITE process is the construction of the hologram when we know the light field of the object. The READ is the construction of the light field image by the hologram. In the holography, the READ process uses a beam of coherent light as a laser to obtain the image. Now in our structured query, the words inside of text are activated at the same time. The words as sources are coherent in the construction by superposition of the desired answer or field. Now the field image of the computation by words in a crisp and fuzzy interpretation prepare the implementation of the Morphic Computing approach to the computation by words. In this way, we have presented an example of the meaning of the new type of computation “Morphic Computing”. 2.6 Morphic Computing: Basic Concepts Morphic Computing is based on the following concepts: 1) The concept of field in the reference space 2) The fields as points or vectors in the N dimension Euclidean space of the objects ( points ) 3) A set of M ≤ N basis fields in the N dimensional space. The set of M fields are vectors in the N dimensional space. The set of M vectors form a non Euclidean subspace H ( context ) of the space N. The coordinates Sα in M of the field X are the contro-variant components of the field X. The components of X in M are also the intensity of the sources of the basis field. The superposition of the basis field with different intensity give us the projection Q of X or Y = QX into the space H When M < N the projection operator of X into H define a constrain or relation among the components of Y.
6
G. Resconi
With the tensor calculus with the components Sα of the vector X or the components of more complex entity as tensors , we can generate invariants for any unitary transformation of the object space or the change of the basis fields. 5) Given two projection operators Q1 , Q2 on two spaces H1 , H2 with dimension M1 and M2 we can generate the M = M1 M2 , with the product of Y1 and Y2 or Y = Y1 Y2 . Any projection Q into the space H or Y = QX of the product of the basis fields generate Y. When Y ≠ Y1 Y2 the output Y is in entanglement state and cannot separate in the two projections Q1 and Q2 . 6) The logic of the Morphic Computing Entity is the logic of the projection operators that is isomorphic to the quantum logic.
4)
The information can be coded inside the basis fields by the relation among the basis fields. In the Morphic Computing the relation is represented by a non Euclidean Geometry which metric or expression of the distance between two points shows this relation. The projection operator is similar to the measure in quantum mechanics. The projection operator can introduce constrains inside the components of Y. The sources are the instrument to control the image Y in the Morphic Computing. There is a deep analogy among the Morphic Computing and the computation by holography and computation by secondary sources ( Jessel ) in the physical field. The computation of Y by X and the projection operator Q that project X into the space H give his result when the Y is similar to X. In this case, the sources S are the solution of the computation. We see the analogy with the neural network where the solution is to find the weights wk at the synapse. In this paper, we show that the weights are sources in the Morphic Computing. Now, it is possible to compose different projection operators in a network of Morphic Systems. It is obvious to consider this system as a System of Systems. Any Morphic Computation is always context dependent where the context is H. With the projection operator Q we project the query X in input inside the context H. The projection operator in the hologram is the process by which from the object X by the hologram we obtain the image Y. The context H is the space of the possible images that are generate by the hologram as sources. We remark that the images are a subset of the objects. In fact in the image we loss information that are locate in the object. We can control the context in a way to obtain wanted results. When any projection operator of X or QX is denoted as a measure, in analogy with the quantum mechanics, any projection operator loss information but can be see by the instruments . Also in the holography the image is the projection of the object , we loss information but by the image we can have information , as an instrument , of the properties of the object. In the measure analogy any measure depend on the previous measures. So any measure is dependent on the path of measures or projection operators that we realise before or the history. So we can say that different projection operators are a story ( See Roland Omnès in quantum mechanics stories). As in any story we loss information but we can use the story to have information and properties of the original phenomena or objects. The analogy of the measure gives us also another intuitive idea of the Morph Computing. Any measure become a good measure when give us an image Y of the real phenomena X that is similar. When the internal rules to X are not destroyed. In
Morphic Computing
7
the measure process, the measure is a good measure. The same for the Morphic Computing. The computation is a good computation when the projection operator does not destroy the internal relation of the field in input X. The projection operator as in the holography give as also a model of the object. In fact the image is a construction process by which we use sources to generate image. Now change or transformation of the sources give new images that we can say as computed by the sources and the transformation. So the sources give us the model of the object by which we can generate new images from the same object. The analogy with the measure in quantum mechanics is also useful to explain the concept of the Morphic Computing because the instrument in the quantum measure is the fundamental context that interfere with the physical phenomena as H interfere with the input field X. A deeper connection exist between the Projection operator lattice that represent the quantum logic and Morphic Computing processes ( see Eddie Oshins ). Because any fuzzy set is a scalar field of the membership values on the factors ( reference space ) ( Wang and Sugeno ). We remember that any concept can be viewed as a fuzzy set in the factor space. So at the fuzzy set we can introduce all the processes and concepts that we utilise in the Morphic Computing. For the relation between concept and field, we introduce in the field theory an intrinsic fuzzy logic. So in Morphic Computing, we have an external logic of the projection or measure ( quantum Logic ) and a possible internal fuzzy logic of the fuzzy interpretation of the fields. At the end, because we also use agents superposition to define fuzzy sets and fuzzy rules, we can again use the Morphic Computing to compute the agents inconsistency and irrationality. So, fuzzy set and fuzzy logic are part of the more general computation denoted Morphic Computing.
3 Reference Space, Space of the Objects, Space of the Fields in the Morphic Computing Given the n dimensional reference space ( R1 , R2 ,…,Rn ), any point P = ( R1 , R2 ,…,Rn ) is an object. Now we create the space of the objects which dimension is equal to the number of the points and the value of the coordinates in this space is equal to the value of the field in the points. We call the space of the points “space of the objects”. Now inside the space of the objects, we can locate any type of field as vectors in the space of the objects as points. In field theory, we assume that any complex field can be considered as a superposition of prototype fields that it’s model is well known. The prototype fields are vectors in the space of the objects that form a new reference or field space. In general, the field space is a non Euclidean space. In conclusion, any complex field Y can be written in this way Y = S1 H1 ( R1 ,…,Rn ) + S2 H2 ( R1 ,…,Rn )+ ....+ Sn Hn ( R1 ,…,Rn ) = H( R ) S
(1)
8
G. Resconi
In equation (1) , H1 , H2 ,…, Hn are the basic fields or prototype fields and S1 , S2 ,…, Sn are the weights or source values of the basic fields. We assume that any basic field is generated by a source. The intensity of the prototype fields is proportional to the intensity of the sources that generates the field itself. 3.1 Example of the Basic Field and Sources In Figure 1, we show an example of two different basic fields in a two dimensional reference space ( x, y). The general equation of the fields is
F ( x, y ) = S [ e
− h (( x − x0 )2 +( y − y0 )2 )
]
(2)
the parameters of the field F1 are S=1 h =2 and x0 = -0.5 and y0 = -0.5, the parameters of the field F2 are S=1 h =2 and x0 = 0.5 and y0 = 0.5
F
F 1
2
F
F
Fig. 1. Two different basic fields in the two dimensional reference space (x,y)
For the sources S1 = 1 and S2 = 1 the superposition field F that is shown in Figure 2 is F = F1 + F2. For the sources S1 = 1 and S2 = 2, the superposition field F that is shown again in Figure 2 is F = F1 + 2 F2 . 3.2 Computation of the Sources
To compute the sources Sk, we represent the prototype field Hk and the input field X in a Table 1 where the objects are the points and the attribute are the fields. The values in Table 1 is represented by the following matrices
⎡ H 1,1 ⎢H H = ⎢ 2,1 ⎢ ... ⎣⎢ H M,1
H 1,2
...
H 2,2
...
...
... ...
H M,2
⎤ H 2,N ⎥ ⎥ ... ⎥ H M,N ⎥ ⎦ H 1,N
,
⎡ X1 ⎤ ⎢X ⎥ X=⎢ 2 ⎥ ... ⎢X ⎥ ⎣ M⎦
Morphic Computing
F = F1 +
F = F1 +
F
9
F
Fig. 2. Example of superposition of elementary fields F1 , F2 Table 1. Fields values for M points in the reference space H1
H2
…
HN
Input
Field
X P1 P2 … PM
H1,1 H2, 1 … HM,1
F1,2 H2,2 … HM,2
… ... … …
H1,N H2,N … HM,N
X1 X2 ... XM
The matrix H is the relation among the prototype fields Fk and the points Ph. At this point, we are interested in the computation of the sources S by which they give the best linear model of X by the elementary field values. Therefore, we have the superposition expression
⎡ H1,1 ⎤ ⎡ H1,2 ⎤ ⎡ H1,n ⎤ ⎢H ⎥ ⎢H ⎥ ⎢H ⎥ 2,1 2,2 ⎢ ⎥ ⎢ ⎥ Y = S1 +S + ... + Sn ⎢ 2,n ⎥ = HS ⎢ ... ⎥ 2 ⎢ ... ⎥ ⎢ ... ⎥ ⎢H ⎥ ⎢H ⎥ ⎢H ⎥ ⎣ M ,n ⎦ ⎣ M ,1 ⎦ ⎣ M ,2 ⎦
(3)
Then, we compute the best sources S in a way the difference Y − X is the minimum distance for any possible choice of the set of sources. It is easy to show that the best sources are obtained by the expression T −1 T S = (H H ) H X
(4)
Given the previous discussion and field presentation, the elementary Morphic Computing element is given by the input-output system as shown in Figure 3.
10
G. Resconi
Field
Field Y = H S = QX
Sources S = (HTH)-1 HT X
X
Prototype fields H(R) Fig. 3. Elementary Morphic Computing
H
S1 X
Y S2
S3
H
H
Fig. 4. Shows the Network of Morphic Computing
Figure 4 shows network of elementary Morphic Computing with three set of prototype fields and three type of sources with one general field X in input and one general field Y in output and intermediary fields from X and Y.
When H is a square matrix, we have Y = X and S=H
-1
X and Y = X = H S
(4)
Now for any elementary computation in the Morphic Computing, we have the following three fundamental spaces. 1) The reference space 2) The space of the objects ( points ) 3) The space of the prototype fields Figure 5 shows a very simple geometric example when the number of the objects are three ( P1 , P2 , P3 ) and the number of the prototype fields are two ( H1 , H2 ). The space which coordinates are the two fields is the space of the fields.
Morphic Computing
11
P3
H1 P1 H2 P2 Fig. 5. The fields H1 and H2 are the space of the fields. The coordinates of the vectors H1 and H2 are the values of the fields in the three points P1 , P2 , P3 .
Please note that the output Y = H S is the projection of X into the space H Y = H ( HT H )-1 HT X = Q X With the property Q2 X = Q X Therefore, the input X can be separated in two parts X =QX+F where the vector F is perpendicular to the space H as we can see in a simple example given in Figure 6. P3 X F1 QX = Y
P1
F2 P2 Fig. 6. Projection operator Q and output of the elementary Morphic Computing. We see that X = Q X + F , where the sum is the vector sum.
Now, we try to extend the expression of the sources in the following way Given G ( Γ ) = ΓT Γ and G ( H ) = HT H and
12
G. Resconi
S* = [ G ( H ) + G ( Γ ) ] HT X So for S* = ( HT H )-1 HT X + Ωα = Sα + Ωα we have G ( Γ ) Sα + [ G ( H ) + G ( Γ ) ] Ωα = 0 and
S* = S + Ω = ( H T H + Γ T Γ ) H T X -1
where for Ω is function of S by the equation G(Γ )S + [G(H)+G(Γ )] Ω =0 For non-square matrix and/or singular matrix, we can use the generalized model given by Nikravesh [] as follows;
S * = ( H T ΛT Λ H
)
-1
H T ΛT Λ X = ( ( ΛH )T ( Λ H )
)
-1
( ΛH )T Λ X
Where we transform by Λ the input and the references H. The value of the variable D ( metric of the space of the field) is computed by the expression (5) D2 = ( H S )T ( H S ) = ST HT H S = ST G S = ( Q X )T Q X
(5)
For the unitary transformation U for which, we have UT U = I and H’ = U H the prototype fields change in the following way H’ = U H G’ = ( U H )T ( U H ) = HT UT U H = HT H And S’= [ ( U H )T ( U H ) ]-1 ( U H )T Z = G-1 HT UT Z = G-1 HT ( U-1 Z ) For Z = U X we have S’ = S and the variable D is invariant. for the unitary transformation U .
We remark that G = HT H is a quadratic matrix that gives the metric tensor of the space of the fields. When G is a diagonal matrix the entire elementary field are independent one from the other. But when G has non diagonal elements, in this case the elementary fields are dependent on from the other. Among the elementary fields there is a correlation or a relationship and the geometry of the space of the fields is a non Euclidean geometry.
Morphic Computing
13
4 Quantum Logic and Entanglement in Morphic Computing In the Morphic Computing, we can make computation on the context H as we make the computation on the Hilbert Space. Now we have the algebras among the context or spaces H. in fact we have H = H1 ⊕ H2 where ⊕ is the direct sum. For example given H1 = ( h1,1 ,h1,2 ,…..,h1,p ) where h1,k are the basis fields in H1 H2 = ( h2,1 ,h2,2 ,…..,h2,q ) where h2,k are the basis fields in H2 So we have H = H1 ⊕ H2 = ( h1,1 ,h1,2 ,…..,h1,p , h2,1 ,h2,2 ,…..,h2,q) The intersection among the context is H = H1 ∩ H2 The space H is the subspace in common to H1 , H2 . In fact for the set V1 and V2 of the vectors V1 = S1,1 h1,1 + S1,2 h1,2 + …..+ S1,p h1,p V2 = S2,1 h2,1 + S2,2 h2,2 + …..+ S2,q h2,q The space or context H = H1 ∩ H2 include all the vectors in V1 ∩ V2 . Given a space H , we can also built the orthogonal space H⊥ of the vectors that are all orthogonal to any vectors in H. Now we have this logic structure Q ( H1 ⊕ H2 ) = Q1 ∨ Q2 = Q1 OR Q2 where Q1 is the projection operator on the context H1 and Q2 is the projection operator on the context H2 . Q ( H1 ∩ H2 ) = Q1 ∧ Q2 = Q1 AND Q2 Q ( H⊥ ) = Q – ID = ¬ Q = NOT Q In fact we know that Q X – X = ( Q – ID ) X is orthogonal to Y and so orthogonal to H. In this case the operator ( Q – ID ) is the NOT operator.
14
G. Resconi
Now it easy to show [ 9 ] that the logic of the projection operator is isomorphic to the quantum logic and form the operator lattice for which the distributive low ( interference ) is not true. In figure 7 we show an expression in the projection lattice for the Morphic Computing
Field X
QX =[ ( Q1 Q2 ) Q3 ] X = Y
Sources S
Prototype fields H= ( H1 H2 ) H3A Fig. 7. Expressions for projection operator from the Morphic Computing Entity
Now we give an example of the projection logic and lattice in this way : Given the elementary field references
⎡ ⎢ ⎡1 ⎤ ⎡ 0⎤ H1 = ⎢ ⎥ , H 2 = ⎢ ⎥ , H 3 = ⎢ ⎢ ⎣0 ⎦ ⎣1 ⎦ ⎢⎣
1 ⎤ 2⎥ ⎥ 1 ⎥ 2 ⎥⎦
For which we have the projection operators
⎡1 Q1 = H1 ( H1T H1 ) −1 H1T = ⎢ ⎣0 ⎡0 Q2 = H 2 ( H 2T H 2 ) −1 H 2T = ⎢ ⎣0 Q3 = H 3 ( H 3T H 3 ) −1 H 3T
⎡1 ⎢ = ⎢2 ⎢1 ⎣⎢ 2
With the lattice logic we have
0⎤ 0 ⎥⎦ 0⎤ 1 ⎥⎦ 1⎤ 2⎥ ⎥ 1⎥ 2 ⎦⎥
Morphic Computing
15
⎡1 0 ⎤ ⎡1 0 ⎤ T -1 H 1,2 = H 1 ⊕ H 2 = ⎢ , Q Q = H (H H ) H = ∨ 1 2 1,2 1,2 1,2 1,2 ⎥ ⎢0 1⎥ ⎣0 1⎦ ⎣ ⎦ 1 ⎤ ⎡ ⎢1 ⎡1 0 ⎤ 2⎥ ⎥ , Q1 ∨ Q3 = H 1,3 (H 1,3T H 1,3 )-1 H 1,3 = ⎢ H 1,3 = H 1 ⊕ H 3 = ⎢ ⎥ 1 ⎥ ⎢ ⎣0 1⎦ 0 ⎢⎣ 2 ⎥⎦ 1 ⎤ ⎡ ⎢0 ⎡1 0 ⎤ 2⎥ ⎢ ⎥ , Q2 ∨ Q3 = H 2,3 (H 2,3T H 2,3 )-1 H 2, = ⎢ H 2,3 = H 2 ⊕ H 3 = 1 ⎥ 0 1⎥⎦ ⎢ ⎣ ⎢⎣ 1 2 ⎥⎦ And
⎡0 ⎤ H = H1 ∩ H 2 = H1 ∩ H 3 =H 2 ∩ H 3 = ⎢ ⎥ ⎣0 ⎦ ⎡0 0 ⎤ and Q1 ∧ Q2 =Q1 ∧ Q3 =Q 2 ∧ Q3 = ⎢ ⎥ ⎣0 0 ⎦ And in conclusion we have the lattice
Q1 Q2 = Q1 Q3 = Q2 Q3
Q1
Q2
Q3
Q1 Q2 = Q1 Q3 = Q2 Q3= 0 We remark that ( Q1 ∨ Q2 ) ∧ Q3 = Q3 but ( Q1 ∧ Q3 ) ∨ ( Q2 ∧ Q3 ) = 0 ∨ 0 = 0 When we try to separate Q1 from Q2 in the second expression the result change. Between Q1 and Q2 we have a connection or relation ( Q1 and Q2 generate the two dimensional space ) that we destroy when we separate one from the other. In fact Q1 ∧ Q3 project in the zero point. A union of the zero point cannot create the two dimensional space . the non distributive property assume that among the projection operators there is an entanglement or relation that we destroy when we separate the operators one from the other.
16
G. Resconi
Given two references or contexts H1 , H2 the tensor product H = H1 ⊗ H2 is the composition of the two independent contexts in one. We can prove that the projection operator of the tensor product H is the tensor product of Q1 , Q2 . So we have Q = H ( HT H )-1 HT = Q1 ⊗ Q2 The sources are Sαβ = Sα1 Sβ2 So we have Y = Y1 ⊗ Y2 = ( H1 ⊗ H2 ) Sαβ The two Morphic System are one independent from the other. The output is the product of the two outputs for the any Morphic System. Now we give some examples
⎡ ⎢1 ⎢ 1 H1 = ⎢ ⎢2 ⎢ ⎢1 ⎣⎢
1⎤ ⎡1 ⎢2 2⎥ ⎥ ⎢ ⎥ 1 , H2 = ⎢ 0 ⎥ ⎢ ⎥ ⎢1 1⎥ ⎢ ⎢⎣ 2 2 ⎥⎦
H1 ⊕ H 2 = H1α , β H 2γ ,δ
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎥⎦
⎡ ⎢ 1H 2 ⎢ 1 = ⎢ H2 ⎢2 ⎢ ⎢ 1H 2 ⎣⎢
⎡ ⎡1 ⎢ ⎢2 ⎢ ⎢ ⎢ 1⎢ 0 ⎢ ⎢ ⎢ ⎢ ⎢ ⎢1 ⎢ ⎣⎢ 2 ⎢ 1 ⎤ ⎢ ⎡1 H2 2 ⎥ ⎢ ⎢2 ⎥ ⎢ ⎢ 1 0H2 ⎥ = ⎢ ⎢ 0 ⎥ ⎢2 ⎢ 1 ⎥⎥ ⎢ ⎢⎢ 1 H2 ⎢ 2 ⎦⎥ ⎢ ⎣⎢ 2 ⎢ ⎡1 ⎢ ⎢ ⎢ ⎢2 ⎢ ⎢ ⎢ 1⎢ 0 ⎢ ⎢ ⎢ ⎢1 ⎢ ⎢⎣ 2 ⎣
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎥⎦
⎡1 ⎢2 ⎢ 1⎢ 0 2⎢ ⎢1 ⎢ ⎣⎢ 2
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎥⎦
⎡1 ⎢2 ⎢ 1⎢ 0 ⎢ ⎢1 ⎢ ⎣⎢ 2
⎤ 1⎥ ⎥ 1⎥ 2⎥ 1 ⎥⎥ 2 ⎦⎥
⎡1 ⎢2 ⎢ 1⎢ 0 2⎢ ⎢1 ⎢ ⎢⎣ 2
⎤⎤ 1 ⎥⎥ ⎥⎥ 1 ⎥⎥ 2 ⎥⎥ ⎥ 1 ⎥⎥ ⎥ 2 ⎦⎥ ⎥ ⎥ ⎤⎥ 1⎥ ⎥ ⎥⎥ 1⎥ ⎥ = H α , β ,γ ,δ 2⎥ ⎥ 1 ⎥⎥ ⎥ ⎥ 2 ⎦⎥ ⎥ ⎤⎥ 1 ⎥⎥ ⎥⎥ 1 ⎥⎥ ⎥ 2 ⎥⎥ 1 ⎥⎥ ⎥ 2 ⎥⎦ ⎥⎦
Morphic Computing
17
At every value of the basis fields we associate the basis field H2 multiply for the value of the basis fields in H1. Now because we have
⎡1 ⎢2 0 ⎢ Q1 = H1 ( H1T H1 ) −1 H1T = ⎢ 0 1 ⎢1 0 ⎢ ⎣2 ⎡2 1 ⎢3 3 ⎢ 1 2 Q2 = H 2 ( H 2T H 2 ) −1 H 2T = ⎢ ⎢3 3 ⎢ ⎢1 1 ⎢⎣ 3 3
⎡1 ⎢ 2 Q2 ⎢ Q = Q1 ⊗ Q2 = ⎢ 0Q2 ⎢1 ⎢ Q2 ⎣2
0Q2 1Q2 0Q2
1⎤ 2⎥ ⎥ 0⎥ 1⎥ ⎥ 2⎦ 1 ⎤ 3 ⎥ ⎥ 1 − ⎥ 3⎥ ⎥ 2 ⎥ 3 ⎥⎦
1 ⎤ Q2 2 ⎥ ⎥ 0Q2 ⎥ 1 ⎥ Q2 ⎥ 2 ⎦
⎡1 ⎤ ⎡1 ⎤ ⎡1⎤ ⎢3⎥ ⎢3 X2 ⎥ ⎢4⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ X1 = ⎢ 0 ⎥ , X 2 = ⎢ 0 ⎥ , X = X1 ⊗ X 2 = ⎢ 0 X 2 ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢1⎥ ⎢ ⎥ ⎢ X2 ⎥ ⎢ ⎥ ⎣4⎦ ⎣3⎦ ⎣3 ⎦ ⎡ 4 ⎤ ⎡ 3⎤ ⎢ ⎥ ⎢− 2 ⎥ 9 T −1 T T −1 T S1 = ( H1 H1 ) H1 X 1 = ⎢ ⎥ , S 2 = ( H 2 H 2 ) H 2 X 1 = ⎢ ⎥ ⎢− 2 ⎥ ⎢ 4 ⎥ ⎢⎣ 9 ⎦⎥ ⎢⎣ 3 ⎥⎦ And
S αβ
⎡ 4 ⎤ ⎢ 9 S2 ⎥ = S1 ⊗ S 2 = ⎢ ⎥ ⎢− 2 S ⎥ ⎢⎣ 9 2 ⎥⎦
18
G. Resconi
For
⎡ 7 ⎤ ⎡1 ⎤ ⎢ 12 ⎥ ⎢3⎥ ⎢ ⎥ ⎢ ⎥ 2 ⎥ ⎢ Y1 = H1S1 = ⎢ 0 ⎥ , Y2 = H 2 S2 = ⎢ 3 ⎥ ⎢1 ⎥ ⎢ ⎥ 1⎥ ⎢ ⎥ ⎢ − ⎣3⎦ ⎢⎣ 12 ⎥⎦ ⎡1 ⎤ ⎢ 3 Y2 ⎥ ⎢ ⎥ Y = Y1 ⊗ Y2 = ⎢ 0Y2 ⎥ ⎢1 ⎥ ⎢ Y2 ⎥ ⎣3 ⎦ In conclusion we can say that the computation of Q , S and Y by H and X can be obtained only with the results of Q1 , S1 , Y1 and Q2 , S2 , Y2 independently. Now when H and X cannot be write as a scalar product of H1 , X1 and H2 , X2 , the context H and the input X are not separable in the more simple entities so are in entanglement. In conclusion with the tensor product, we can know if two measures or projection operators are dependent or independent one from the other.
So we have for the tensor product of the context that all the other elements of the entity are obtained by the tensor product
Query Field X = X1
X2
Sources S = X1
X2 QX =Q(X1
X2 ) = Y1
Y2 =Y Prototype fields H= H1
H2
Fig. 8. Tensor product for independent projection operators or measures
5 Conclusion In this paper we present a new type of computation denoted morphic computing. The new type of computation extend and improve the principles of the optic computation by holography. In the holographic process we have one object and one image. The image is considered as the projection of the object. Now we give a formal description
Morphic Computing
19
of the projection operator with the hidden logic. Based on this new logic we can implement a new type of computation. The logic of the projection is similar to the quantum logic where we loss the distributive rule for the interference or superposition among the states. In fact we know that in quantum mechanics we loss the exclusive principle for which particles can assume in one time only one position or momentum. In quantum mechanics one particle can have in a superposition state different positions or momentum at the same time. This generate new type of state, superpose states, that we cannot found in the classical physics. The new property of superposition change dramatically the computation process and give us a new type of computer denoted quantum computer. Now the morphic computing extend the quantum computing process to any type of context and any type of query. The quantum mechanics for morphic computing is the prototype system. In the morphic computing we are beyond any identification with the physic of particles domain we argue that morphic computing include neural computing , soft computing , genetic computing.
Reference 1. Zadeh, L.A., Nikravesh, M.: Perception-Based Intelligent Decision Systems; Office of Naval Research, Summer 2002 Program Review, Covel Commons. University of California, Los Angeles, July 30th-August 1st (2002) 2. Zadeh, L.A., Kacprzyk, J. (eds.): Computing With Words in Information/Intelligent Systems 1: Foundations. Physica-Verlag, Germany (1999a) 3. Zadeh, L.A., Kacprzyk, J. (eds.): Computing With Words in Information/Intelligent Systems 2: Applications. Physica-Verlag, Germany (1999b) 4. Resconi, G., Jain, L.C.: Intelligent Agents. Springer, Heidelberg (2004) 5. Nikravesh, M.: Intelligent Computing Techniques for Complex systems. In: Soft Computing and Intelligent Data Analysis in Oil Exploration, pp. 651–672. Elsevier, Amsterdam (2003) 6. Gabor, D.: Holography 1948-1971. Proc. IEEE 60, 655–668 (1972) 7. Fatmi, H.A., Resconi, G.: A New Computing Principle. Il Nuovo Cimento 101B(2), 239– 242 (Febbraio 1988) 8. Omnès, R.: The Interpretation of Quantum Mechanics. Princeton Series in Physics (1994) 9. Oshins, E., Ford, K.M., Rodriguez, R.V., Anger, F.D.: A comparative analysis: classical, fuzzy, and quantum logic. In: 2nd Florida Artificial Intelligence Research Symposium, St. Petersburg, Florida, April 5, 1989 (1992); In: Fishman, M.B. (ed.), Advances in artificial intelligence research, vol. II. JAI Press, Greenwich, CT. Most Innovative Paper Award, 1989. In: FLAIRS 1989, Florida AI Research Symposium 10. Jessel, M.: Acoustique Théorique. Masson et Cie Editours (1973) 11. Wang, P.Z., Sugeno, M.: The factor fields and background structure for fuzzy subsets. Fuzzy Mathematics 2, 45–54 (1982) 12. Sheldrake, R.: A New Science of Life: The Hypothesis of Morphic Resonance (1981, second edition 1985) 13. Sheldrake, R.: The Presence of the Past (1988) 14. Nikravesh, G.R.M.: Morphic Computing. Concepts and Foundation. In: Nikravesh, M., Zadeh, L.A., Kacprzyk, J. (eds.) Forging the new Frontieres: Fuzzy Pioneers I. Studies in Fuzziness and Soft Computing. Springer, Heidelberg (2007)
20
G. Resconi
15. Nikravesh, G.R.M.: Morphic Computing: Quantum Field. In: Nikravesh, M., Zadeh, L.A., Kacprzyk, J. (eds.) Forging the new Frontieres: Fuzzy Pioneers II. Studies in Fuzziness and Soft Computing. Springer, Heidelberg (2007) 16. Nikravesh, G.R.M.: Morphic Computing. Applied Soft Computing Journal (July 2007) 17. Nikravesh, G.R.M.: Morphic Computing part 1: Foundation. In: Melin, P., Castillo, O., Aguilar, L.T., Kacprzyk, J., Pedrycz, W. (eds.) IFSA 2007. LNCS (LNAI), vol. 4529. Springer, Heidelberg (2007) 18. Nikravesh, G.R.M.: Morphic Computing part 1I: Web Search. In: Melin, P., Castillo, O., Aguilar, L.T., Kacprzyk, J., Pedrycz, W. (eds.) IFSA 2007. LNCS (LNAI), vol. 4529. Springer, Heidelberg (2007)
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces Michael G. Christel School of Computer Science, Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213, USA [email protected]
Abstract. Years of international participation and collaboration in TRECVID have shown that interactive multimedia systems, those with a user in the search loop, have consistently outperformed fully automated systems. Interface capabilities like querying by text, by image, and by semantic concept and storyboard layouts have led to significant performance improvements on provided search tasks. Lessons learned for TRECVID shot-based video retrieval are presented. In the real world, however, video collection users may focus on story threads instead of shots, or may not be provided with a clear stated search task. The paper also discusses users facing situations where they lack the knowledge or contextual awareness to formulate queries and having a genuine need for exploratory search systems supporting serendipitous browsing. Interfaces promoting various views for navigating complex information spaces can help with exploratory search and investigation into video corpora ranging from documentaries to broadcast news to oral histories. Work is presented using The HistoryMakers oral history archive as well as TRECVID international broadcast news to discuss the utility of various query and presentation mechanisms emphasizing people, time, location, and visual attributes. The paper leads into a discussion of how exploratory interfaces for video extend beyond storyboards, with a series of user studies referenced as empirical data in support of the presented conclusions. Keywords: Video browsing, digital video retrieval, TRECVID, Informedia, user studies.
1 Introduction The Informedia research project at Carnegie Mellon University (CMU) has worked since 1994 on various issues related to digital video understanding, tackling search, retrieval, visualization and summarization in both contemporaneous and archival video content collections through speech, image, and natural language understanding [1]. As the interface designer, developer, and evaluator on the Informedia team, the author’s role has been to iterate through a number of deployments that leverage the latest advances in machine learning techniques and other approaches to automated video metadata creation. Benchmarking user performance with digital video retrieval to chart progress became much easier with the creation of a TREC video retrieval track in 2001, the subject of Section 2. A number of Informedia user studies have taken place through the years, most often with CMU students and staff as the participants. These studies were surveyed in a 2006 paper reporting on how they can provide a user pull complementing the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 21–30, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
22
M.G. Christel
technology push as automated video processing advances [2]. Section 3 overviews a few studies, reporting empirical conclusions on video summarization and browsing. Section 4 reports on recent studies on two types of video corpora: international broadcast news as used in TRECVID, and video oral histories from The HistoryMakers. Section 5 presents conclusions and opportunities for future work.
2 TRECVID Interactive Video Search Benchmarking The Text REtrieval Conference (TREC) was started in 1992 to support the text retrieval industry by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. The same needs for the video retrieval community led to the establishment of the TREC Video Track in 2001. Now an independent evaluation, TRECVID began with the goal to promote progress in content-based retrieval from digital video via open, metrics-based evaluation. The corpora have ranged from documentaries to advertising films to broadcast news, with international participation growing from 12 to 54 companies and academic institutions from 2001 to 2007 [3]. A number of tasks are defined in TRECVID, including shot detection, semantic feature extraction, rush video summarization, and information retrieval. The Cranfield paradigm of retrieval evaluation is based on a test collection consisting of three components: a set of documents, a set of information need statements called topics, and a set of relevance judgments. The relevance judgments are a list of the “correct answers” to the searches: the documents that should be retrieved for each topic. Success is measured based on quantities of relevant documents retrieved, in particular the metrics of recall and precision. The two are combined into a single measure of performance, average precision, which measures precision after each relevant document is retrieved for a given topic. Average precision is then itself averaged over all of the topics to produce a mean average precision (MAP) metric for evaluating a system’s performance. For TRECVID video searches, the individual “documents” retrieved are shots, where a shot is defined as a single continuous camera operation without an editor’s cut, fade or dissolve – typically 2-10 seconds long for broadcast news. The TRECVID search task is defined as follows: given a multimedia statement of information need (a topic) and the common shot reference, return a ranked list of up to 1000 shots from the reference which best satisfy the need. For the interactive search task, the user can view the topic, interact with the system, see results, and refine queries and browsing strategies interactively while pursuing a solution. The interactive user has no prior knowledge of the search test collection or topics. The topics are defined by NIST to reflect many of the sorts of queries real users pose, based on query logs against video corpora like the BBC Archives and other empirical data [3, 4]. Three TRECVID test sets are used in studies cited in Section 3: TRECVID 2004 test set holds 128 broadcasts, 64 hours, of ABC News and CNN video from 1998, consisting of 33,367 reference shots. TRECVID 2005 is 140 international broadcasts (85 hours) of English language, Arabic, and Chinese news from 2004, consisting of 45,765 reference shots. TRECVID 2006 is similar but with more data: 165 hours of U.S., Arabic, and Chinese news with 79,484 reference shots.
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
23
An ACM SIGMM retreat on the future of multimedia information retrieval praised the contributions of TRECVID for international community benchmarking, noting that “repeatable experiments using published benchmarks are required for the field to progress” [5]. The NIST TRECVID organizers are clearly cognizant of issues of ecological validity: the extent to which the context of a user study matches the context of actual use of a system, such that it is reasonable to suppose that the results of the study are representative of actual usage and that the differences in context are unlikely to impact the conclusions. TRECVID user studies can make use of the TRECVID community effort to claim ecological validity in most regards: the data set is real and representative, the tasks (topics) are representative based on prior analysis of BBC and other empirical data, and the processing efforts are well communicated with a set of rules for all to follow. The remaining question of validity is whether the subject pool represents a broader set of users, with university students and staff for the most part comprising the subject pool for many research groups because of their availability. TRECVID has provided metrics showing the benefits of automated tool support in combination with human manipulation and interpretation for video information retrieval. Without automated tools to support browsing and summarization, the human user is swamped with too many possibilities as the quantity and diversity of video proliferate. Ignoring the human user, though, is a mistake, for through the history of TRECVID, fully automated systems involving no human user have consistently and significantly underperformed compared to interactive human-in-theloop search systems [3]. Over the years, Informedia TRECVID experiments have confirmed the utility of storyboards showing matching thumbnails across multiple video documents [6], the differences in expert and novice search behavior when given TRECVID topics [7], the utility of transcript text for news video topics [8], and the overlooking of using concept filters (e.g., include or exclude all shots having the “roads” concept or “outdoors” concept) to reduce the shot space [6, 8, 9]. These studies are surveyed in the next section.
3 Evaluation and User Studies with Respect to Video Summarization and Browsing Video summaries have many purposes, summarized well by Taskiran et al. [10] and including the following: intriguing the user to watch the whole video (movie trailers), deciding if the program is worth watching (electronic program guide), locating specific regions of interest (lecture overview), collapsing highly redundant footage into the subset with important information (surveillance executive summary). For most applications video summaries mainly serve two functions [10]: an indicative function, where the summary is used to indicate what topics of information are contained in the original program; and the informative function, where the summaries are used to cover the information in the source program as much as possible, subject to the summary length. This paper focuses on indicative summaries, i.e., the assessment of video surrogates meant to help users better judge the relevance of the source program for their task at hand.
24
M.G. Christel
A 1997 Informedia study with 30 high school and college students and a documentary corpus found that a single thumbnail image chosen from query context represents a source document well [11]. It produces faster, more accurate, more satisfying retrieval performance compared to straight text representations or a context-independent thumbnail menu, in which each document is always represented by the same selection strategy of taking the first shot in the document. Figure 1 shows a number of views into TRECVID 2006 data following a query on any of “tornado earthquake flood hurricane volcano.” As an example of a query-based thumbnail, the tornado story as the eleventh result in Figure 1 (3rd row, 3rd thumbnail in segment grid at lower left) starts off with anchorperson and interview shots in a studio that are much less informative visually than the tornado/sky shot shown in Figure 1, with the tornado shot chosen automatically based on the user’s query.
Fig. 1. 417 segments returned from text query against TRECVID 2006 data set, shown in 4 views: Segment Grid; Storyboard of shots filtered to non-map, non-anchor outdoor shots; Common Phrases; and VIBE visualization-by-example plot showing volcano by itself and hurricane aligned with flood
The automatic breakdown of video into component shots has received a great deal of attention by the image processing community [2, 5, 8]. TRECVID has had a shot detection task charting the progress of automatic shot detection since 2001, and has shown it to be one of the most realizable tasks for video processing with accuracy in excess of 90% [3]. A thumbnail image representing each shot can be arranged into a single chronological display, a storyboard surrogate, which captures the visual flow of a video document along with the locations of matches to a query. From Figure 1’s interface, clicking the filmstrip icon ( ) in the segment grid displays a storyboard surrogate for just that one segment. The storyboard interface is equivalent to drilling into a document to expose more of its visual details before deciding whether it should be viewed. Storyboards are also navigation aids, allowing the user to click on an
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
25
image to seek to and play the video document from that point forward. Informedia storyboards were evaluated primarily through discount usability techniques, two of which were heuristic evaluation and think-aloud protocol, working in the context of TRECVID tasks [6]. Storyboards were found to be an ideal roadmap into a video possessing a number of shots, and very well suited to the TRECVID interactive search task emphasizing the retrieval of shots relevant to a stated task, as evidenced in annual empirical studies run with CMU students and staff as participants [6, 7, 9, 12]. As for ecological validity, in practice users were struggling more with the task of finding the right shot from a collection of videos, rather than just finding the right shot within a single video, once the corpus grew from tens to hundreds to thousands of hours. The obvious interface extension for presentations like Figure 1 is to present all of the shots for a set of video segments in a multiple document storyboard, e.g., all of the shots for the 417 segments of Figure 1. If all shots are shown, though, for even this small set of 417 segments, the storyboard would contain 11503 shots, a much greater number of thumbnails than is likely to be scanned efficiently. Hence, a major difficulty with storyboards is that there are often too many shots to display in a single screen [11, 13]. Rather than show all the shots, only those shots containing matches could be included in a representation for a collection of video, so that rather than needing to show 11503 shots, 661 matching shots could be shown to represent the 417 segments returned in the query shown in Figure 1. Such a storyboard is shown in the upper right, with triangle notches at the top of thumbnails communicating some match context: what matched and where for a given query against this selected video. Storyboards have achieved great success for TRECVID interactive search tasks. Worring et al. from MediaMill report on three alternate forms of shot thumbnail displays for video: the CrossBrowser, SphereBrowser, and GalaxyBrowser [14], with the CrossBrowser evaluating well for TRECVID interactive search [15]. In the Cross Browser, two strips of thumbnails are shown rather than a storyboard grid, with the vertical strip corresponding to a visual concept or search engine ranked ordering and the horizontal strip corresponding to temporal shot order. In the Informedia storyboard interface, the thumbnails are kept the same size and in a packed temporal grid, with the dense layout allowing over two thousand shots to be visually reviewed within the 15-minute TRECVID task time limit, with high task performance [12]. For TRECVID evaluations from 2002 through 2006, storyboard interfaces from Informedia and the MediaMill team have consistently and overwhelmingly produced the best interactive search performance [6, 7, 9, 12, 15]. When given a stated need, a short period of time to fulfill that need, many answer candidates, and an average precision metric to measure success, storyboards produce the best results. Storyboards are the most frequently employed interface into video libraries seen today, but that does not mean that they are sufficient. On the contrary, a 2007 workshop involving the BBC [16] witnessed discussion over the shortcomings of storyboards and the need for playable, temporal summaries and other forms of video surrogates for review and interactive interfaces for control. A BBC participant stated that the industry is looking to the multimedia research community for the latest advances into video summarization and browsing. The next section overviews a few Informedia studies showing the need to move beyond storyboards and TRECVID tasks when addressing opportunities with various user communities.
26
M.G. Christel
4 Opportunities with Real World Users For some video, like an hour video of a single person talking, the whole video is a single shot of that person’s head, and a storyboard of that one shot provides no navigational value. Such video is typical of oral history interviews. For this material, storyboards are not as useful as other views, such as those shown in Figure 2. These views are interactive, allowing users to browse and explore, e.g., to filter down into a subset of interest as shown in Figure 3. Along with the nature of the video corpus, the nature of the task can modify whether storyboards and other widgets are effective and sufficient interfaces. The HistoryMakers is a non-profit institution headquartered in Chicago whose purpose is to record, preserve and disseminate the content of video oral history interviews highlighting the accomplishments of individual African Americans and African-American-led groups and movements. Life oral histories were used from The HistoryMakers (18,254 stories, 913 hours of video) addressing the question as to whether video adds value beyond transcripts and audio (full details in [17]). The question was operationalized by two variants of the same user interface: one with a still image with the audio and one with the video track with the audio.
Fig. 2. Multiple access strategies into an oral history collection like the Common Phrases, Map, and Event Timeline let the user browse attributes of the set of 725 segments produced from a query on “chemistry biology science”
These two interfaces were used in two user studies conducted with primarily CMU and University of Pittsburgh students. In the first study, 24 participants conducted a treasure hunt task (find the good video for 12 topics), similar to TRECVID topics in that the information need is targeted and expressed to the participant under tight time constraints on performance. There were no statistical differences on a range of metrics (performance and satisfaction), and participants did not seem to even notice the differences between the two systems. This is somewhat surprising and disappointing in a way: the video offered no value with this particular task.
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
27
Fig. 3. Working from the same data of Figure 2, the views encourage exploration; 1 story discusses Antarctica, the shown video with Walter Massey
In a follow-up study that considers the effect of task, when the same user interfaces are tested with 14 participants on an exploratory search task (find several stories for a report), there was significant subjective preference for the video interface. Unlike what occurred with the first study, the subjects were aware of the interface differences and strongly preferred the video treatment over still image with transcript for the exploratory task. These two studies together show the subtle dynamics of task and multimedia stimuli, as discussed in a full report on the work [17]. Reflecting on empirical validity again, what are the real world tasks that history students and other users of The HistoryMakers corpus actually conduct? If more of the stated fact-finding nature, video plays little or no role. If more of the exploratory generate-research-report nature, then video is much preferred. Ongoing work is taking place to investigate the utility of the views from Figure 2 and 3 with history students, showing a preference for text views like Common Text over more complex views like the Map View or VIBE View on initial interactions with the system.
28
M.G. Christel
Returning to Figure 1, TRECVID 2004-2006 tasks, and broadcast news corpora, what real-world users are there, and how do their tasks compare to the TRECVID topics for interactive search? Six intelligence analysts were recruited to participate in an experiment as representatives of a user pool for news corpora: people mining open broadcast sources for information as their profession. These analysts, compared to the university students participating in prior referenced studies, were older, more familiar with TV news, just as experienced with web search systems and frequent web searchers, but less experienced digital video searchers. Their expertise was in mining text sources and text-based information retrieval rather than video search. More details and a full discussion of the experiments appear in [18], with the results of a TRECVID study showing that the analysts do not even attempt performance on relatively easy sports topics, and in general stop filling in answers well before the 15minute time limit was reached. For the analysts, sports topics were irrelevant and not meaningful or interesting, and the TREC metric of MAP at a depth of 1000 shots is also unrealistic: they were content with finding 30 answer shots. Importantly, these real-world users did successfully make use of all three query strategies: query-by-text, query-by-image, and query-by-concept (using semantic concepts like “road” or “people” for video retrieval; filtering storyboards by such concepts is shown in Figure 1). When given an expressed information need, the TRECVID topic, this community performed better with and favored a system with image and concept query capabilities over an exclusive text-search system [18]. Analyst activity is creative and exploratory as well, where the information need is discovered and evolves over time based on interplay with data sources. Likewise, video search activity can be creative and exploratory where the information need is discovered and evolves over time. Evaluating tools for exploratory, creative work is difficult, as acknowledged by Shneiderman and Plaisant [19], with this subject being the current work of the author in the context of both broadcast news sources and life oral histories. The conference talk will discuss the very latest work, building from some exploratory task work done with these same six analysts using views like Fig. 1.
5 Lessons Learned and Future Directions TRECVID provides a public corpus with shared metadata to international researchers, allowing for metrics-based evaluations and repeatable experiments along with other advantages [8]. An evaluation risk with over-relying on TRECVID is tailoring interface work to deal solely with the genre of video in the TRECVID corpus. This risk is mitigated by varying the TRECVID corpus genre. Another risk is the topics and corpus drifting from being representative of real user communities and their tasks, which the TRECVID organizers hope is addressed by continually soliciting broad researcher and consumer involvement in topic and corpus definitions. An area that so far has remained outside of TRECVID evaluation has been the exploratory browsing interface capabilities supported by multiple views into video data as illustrated in Figures 1-3. The HistoryMakers study [17] hints that for TRECVID stated search topics and time limits, exploratory search is not needed and perhaps the video modality itself is not needed. What is the point of video for a user community, and what are their tasks with that video? The tasks drive the metadata and presentation
Amplifying Video Information-Seeking Success through Rich, Exploratory Interfaces
29
requirements. The work with the intelligence analysts [18] show that if the task does not fit the users’ expectations, there will be measurable changes in performance, i.e., for the analysts there was significant drop-off in gathering more than 30 shots per topic and in working diligently on the sports topics. From the cited user studies and Informedia interface work in general through the years, the following lessons learned and strategic directions are offered: • Leverage context (improve the video surrogate based on user activity) • Provide easy user tuning of precision vs. recall (e.g., some users may want to see just correct shots, others want to get all of them; through widgets like dynamic query sliders as shown with Figure 1 for “outdoor” the user is left in control) • Exploratory search, the new frontier for video repositories as cheap storage and transmission allows for huge corpora (discussed in [17]; evaluation in [19]) • Augment automatically produced metadata with human-provided descriptors (take advantage of what users are willing to volunteer, and in fact solicit additional feedback from humans through motivating games that allow for human computation, a research focus of Luis von Ahn at CMU) Of course, these directions are not independent. Fielding an interactive system supporting each of these features can gather operational data and feedback for yet more improvements in video information seeking. For example, a web video news service offered by a broadcaster could: • Track user interests to gauge that field footage of weather-related stories was a typical context, • Note that users wanted many possible rather than a few definite shots to peruse, • Streamline exploration along desired date, location, and reporter dimensions, and • Solicit additional feedback, recommendations, and tags from a willing social network user community. As video corpora grow on the web and their user bases grow as well, sophisticated personalization mechanisms can couple with automatically derived metadata for video to allow rich, engaging interfaces supporting effective exploration.
Acknowledgements This work is supported by the National Science Foundation under Grant No. IIS0205219 and Grant No. No. IIS-0705491. The HistoryMakers, CNN, and others’ video contributions are gratefully acknowledged, with thanks to NIST and the TRECVID organizers for enabling video evaluation work through the years.
References 1. Informedia Research at Carnegie Mellon University, http://www.informedia.cs.cmu.edu 2. Christel, M.: Evaluation and User Studies with Respect to Video Summarization and Browsing. In: Chang, E.Y., Hanjalic, A., Sebe, N. (eds.) Proceedings of SPIE, Multimedia Content Analysis, Management, and Retrieval 2006, vol. 6073 (2006), doi:10.1117/12.642841
30
M.G. Christel
3. NIST TREC Video Retrieval Evaluation Online Proceedings (2001-2007), http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html 4. Enser, P.G.B., Sandom, C.J.: Retrieval of Archival Moving Imagery - CBIR Outside the Frame? In: Lew, M.S., Sebe, N., Eakins, J.P. (eds.) CIVR 2002. LNCS, vol. 2383, pp. 206–214. Springer, Heidelberg (2002) 5. Rowe, L.A., Jain, R.: ACM SIGMM Retreat Report on Future Directions in Multimedia Research. ACM Trans. Multimedia Computing, Comm., & Applications 1, 3–13 (2005) 6. Christel, M.G., Moraveji, N.: Finding the Right Shots: Assessing Usability and Performance of a Digital Video Library Interface. In: Proc. ACM Multimedia, pp. 732– 739. ACM Press, New York (2004) 7. Christel, M.G., Conescu, R.: Mining Novice User Activity with TRECVID Interactive Retrieval Tasks. In: Sundaram, H., et al. (eds.) CIVR 2006. LNCS, vol. 4071, pp. 21–30. Springer, Berlin (2006) 8. Hauptmann, A.G., Christel, M.G.: Successful Approaches in the TREC Video Retrieval Evaluations. In: Proc. ACM Multimedia, pp. 668–675. ACM Press, New York (2004) 9. Christel, M.G., Conescu, R.: Addressing the Challenge of Visual Information Access from Digital Image and Video Libraries. In: Proc. Joint Conference on Digital Libraries, pp. 69– 78. ACM Press, New York (2005) 10. Taskiran, C.M., Pizlo, Z., Amir, A., Ponceleon, D., Delp, E.J.: Automated Video Program Summarization Using Speech Transcripts. IEEE Trans. on Multimedia 8, 775–791 (2006) 11. Christel, M.G., Winkler, D., Taylor, C.R.: Improving Access to a Digital Video Library. In: Howard, S., Hammond, J., Lindgaard, G. (eds.) Human-Computer Interaction: INTERACT 1997, pp. 524–531. Chapman and Hall, London (1997) 12. Christel, M., Yan, R.: Merging Storyboard Strategies and Automatic Retrieval for Improving Interactive Video Search. In: Proc. CIVR 2007, pp. 69–78. ACM Press, New York (2007) 13. Lienhart, R., Pfeiffer, S., Effelsberg, W.: Video Abstracting. Comm. ACM 40(12), 54–62 (1997) 14. Worring, M., Snoek, C., et al.: Mediamill: Advanced Browsing in News Video Archives. In: Sundaram, H., et al. (eds.) CIVR 2006. LNCS, vol. 4071, pp. 533–536. Springer, Heidelberg (2006) 15. Snoek, C., Worring, M., Koelma, D., Smeulders, A.: A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval. IEEE Trans. Multimedia 9, 280–292 (2007) 16. ACM: Proc. Int’l Workshop on TRECVID Video Summarization (in conjunction with ACM Multimedia). ACM Press, New York (2007) ISBN: 978-1-59593-780-3 17. Christel, M.G., Frisch, M.H.: Evaluating the Contributions of Video Representation for a Life Oral History Collection. In: Proc. Joint Conference on Digital Libraries. ACM Press, New York (2008) 18. Christel, M.G.: Establishing the Utility of Non-Text Search for News Video Retrieval with Real World Users. In: Proc. ACM Multimedia, pp. 706–717. ACM Press, New York (2007) 19. Shneiderman, B., Plaisant, C.: Strategies for Evaluating Information Visualization Tools: Multi-dimensional In-depth Long-term Case Studies. In: Proc. ACM BELIV 2006 Workshop, Advanced Visual Interfaces Conference, pp. 1–7. ACM Press, New York (2006)
Privacy-Enhanced Personalization Alfred Kobsa Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA 92697-3440, U.S.A. [email protected] http://www.ics.uci.edu/~kobsa
Personalized interaction with computer systems can be at odds with privacy since it necessitates the collection of considerable amounts of personal data. Numerous consumer surveys revealed that computer users are very concerned about their privacy online. The collection of personal data is also subject to legal regulations in many countries and states. This talk presents work in the area of Privacy-Enhanced Personalization that aims at reconciling personalization with privacy through suitable human-computer interaction strategies and privacy-enhancing technologies.
References 1. Kobsa, A.: Privacy-Enhanced Personalization. Communications of the ACM 50(8), 24–33 (2007) 2. Kobsa, A.: Privacy-Enhanced Web Personalization. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 628–670. Springer, Heidelberg (2007)
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, p. 31, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
Narrative Interactive Multimedia Learning Environments: Achievements and Challenges Paul Brna School of Informatics, the University of Edinburgh, Edinburgh EH8 9LW, Scotland [email protected] http://webmail.mac.com/paulbrna/home/
Abstract. The promise of Narrative Interactive Learning Environments is that, somehow, the various notions of narrative can be harnessed to support learning in a manner that adds significantly to the effectiveness of learning environments. Here, we briefly review what has been achieved, and seek to identify the most productive paths along which researchers need to travel if we are to make the most of the insights that are currently available to us.
1 The Promise of Narrative As a (broad) working definition, a Narrative Interactive Learning Environment (NILE) is an environment which has been designed with an explicit awareness of the advantages for learning to be cast as a process of setting up a challenge, seeking to overcome some obstacles and achieving a (partial) resolution. The notion of a Narrative Interactive Learning Environment is attractive, in part, because of the potential for stories to engage the reader. There seems to be an implicit promise that NILEs will have, for example, the intensity of seeing an exciting film, or reading an absorbing book. Because of the association with the purpose of ILEs as promoting effective learning, there is also the suggestion that the learning experience will be enhanced by the use of narrative (in some sense). The notion of narrative is increasingly utilised in the rhetoric of current designers of interactive learning environments. Narrative is seen as one key ingredient in the search for providing environments that strongly motivate the learner. It is also seen as a key ingredient in making sense of personal experience — and hence of value in seeking to communicate meaning. Dickey, for example, comes to the problem of developing motivating learning environments from the perspective of edutainment; an approach based on games with a strong story line [1]. While the argument is often accepted with little criticism, there is an implicit assumption that engagement = motivation to learn, but it is equally possible that the motivation is just “to play”. While there is genuine value in some kinds of play, for the most part, the trick is to make learning environments that encourage enjoyable learning. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 33–44, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
34
P. Brna
Turning to the issue of making sense, Davis and her colleagues are working with autistic children to help them participate with others to overcome some of the deficits traditionally associated with autism [2]. Davis is also hopeful that there is value for the autistic child in working through narrative to build some skills at making sense of experience, and communicating this to others. We provide a simple framework to present which aspects of learning have received the most attention from designers. We will then consider the main achievements with the help of some existing systems. Finally we present some possible future developments. The conclusion, in brief, is that NILEs can help designers refocus their attentions upon the whole person.
2 A Framework The simple framework presented here does not do full justice to the underlying complexity of narrative in all its forms. However, it allows us to focus on the main roles of narrative in learning. We classify the uses of narrative into five primary contexts that relate to the main purpose of learning: Preconditions for learning: arranging for the learner to be prepared for some learning experience in the fullest possible manner. This preparation can include cognitive, affective and conative factors, as well as situational and social ones. Postconditions for learning: ensuring that appropriate activities take place that encourage the learner to consolidate their learning, maintain or increase their motivation, reflect on their experience and so on. Learning content: the actual interaction with material to be understood, mastered and applied. Design: prior to any learning there is the planning that is needed to set in motion the events that lead to the experience. Design involves, amongst other things, setting up the design team, deciding on the design methodology, working with stakeholders, choosing the kinds of interactions with the learner that are to be provided, and the narrative within which the learner is to be placed. Evaluation: both during and after the learning experience there is a need to examine the learning experience with a view to making sure that the learner benefited. This process might well lead to suggestions for improving the experience. The above five contexts are relevant to the design of NILEs, but learners can be (self) motivated towards achieving many kinds of goals such as ones connected with: becoming more knowledgeable and skilled; becoming more creative; improving one’s sense of personal identity (so as to become more confident etc.); improving one’s social identity (e.g.receiving the approval of others); and improving relationships. These aims are neither mutually exclusive nor independent. Nor are they necessarily the “designed-in” aims of any specific NILE. By listing them we seek to emphasise the importance of taking into account the widest possible range of human wants and needs.
Narrative Interactive Multimedia Learning Environments
2.1
35
Preconditions for Learning
Preparing learners for learning is a key idea. Ausubel, in his persuasive approach to this problem, introduced the idea of advance organisers [3]. Ausubel was not just concerned with preparing the learner to learn by providing the vocabulary to be used, or even making sure that certain preconditions were met: Advance organisers bridge the gap between what the student already knows and the new material at a higher level of abstraction, generality and inclusiveness. [3] Other approaches focus on motivational and attitudinal aspects of learning — most famously, Keller and his colleagues [4] who introduced the ARCS model involving attention (A), relevance (R), confidence (C) and satisfaction (S). The ARCS model is used by many to support the design of ILEs that are intended to engage learners. Narrative is also used to set up the situation in which learning takes place. From a situational perspective, setting up the social and physical context in which action will happen requires some engagement with the concerns that are central to narrative design. Perhaps the most natural usage is in the exposition of the ‘back story’. Generating such a narrative is by no means straightforward but, if done well, can provide the context and the help to motivate learners. 2.2
The Postconditions of Learning
While Ausubel is well known for his notion of advance organisers, the idea of post organisers is less familiar, yet there is evidence as to their effectiveness [5]. Hall, Hall and Saling examined the effectiveness of knowledge maps — a form of concept map — when used as a post organiser. Their empirical work suggested some advantages to this approach. Much of the recent work that covers related ideas have tended to draw on Sch¨ on’s notion of reflection-on-action [6]. In the last few years there seems to have been a noticeable increase in interest in systems that encourage the learner to reflect on their learning or on the way that they learn. 2.3
Learning the Content
Designers frequently embed a set of tasks into a suitable story - real or imaginary. For example, Waraich’s Binary Arithmetic Tutor (BAT) system consists of a set of tasks connected with binary arithmetic within a motivating story embedded [7]. His MOSS (Map reading and Ordinance Survey Skills) system, developed as a pilot for his approach to narrative centred informant design, was aimed at teaching map reading skills to children [7]. The MOSS system weaves the narrative in together with the tasks. The key challenges are how to make the narrative aspects of an ILE work to achieve the learning aims and goals, and how to ensure that the learning goals do not undercut the narrative aspects in such a way that the whole endeavour is compromised.
36
2.4
P. Brna
Design
Traditional methods can be used for the design of NILEs. Learner centred design has been increasingly favoured; there have been some notable developments in the approaches taken. These include Scaife, Rogers, Aldrich and Davies in their development of informant design [8] as well as Chin, Rosson and Carroll’s scenario-based design [9] and illustrated by Cooper and Brna in their work on the development of T’rrific Tales [10]. A recent special issue of the International Journal of Artificial Intelligence (Volume 16, number 4) provides a good overview of the ways in which learner centred design is conceived as a general means of developing ILEs. In particular, Luckin, Underwood, du Boulay, Holmberg, Kerawalla, O’Connor, Smith and Tunley provide a detailed account of their learner centred methodology for software development [11]. They term their approach an example of Human Centred Design. Their approach seeks to identify the stakeholders and then involves them working with the stakeholders using, for example, storyboards and interviews. The process is cyclical. It is claimed that the experience of working through this cycle leads to an increasingly rich understanding of the needs of the learners. 2.5
Evaluation
If we want to know whether a particular NILE is delivering the goods then we need to evaluate the current state of the design in relation to the kinds of people who have an interest in the outcomes — learners, parents, schools, educational policy makers and designers themselves. Since NILEs work on so many levels — cognitive, affective and conative, as well as on self-identity and personal relationships — the methods needed for any evaluation are very varied. We are not ‘simply’ looking for learning gains on standardised tests, we are also looking for more elusive gains. Self identity, for example, is something that can be examined throughout one’s life, and there are no simple metrics that can identify changes in some absolutely ‘right’ direction. In the case of evaluations of NILEs we find both standard methods from experimental psychology, and ones that are qualitative . While there is no obvious requirement to evaluate the effectiveness of a NILE using ‘narrative’ methods, there is a place to use methods that can loosely be described under the heading of “narrative inquiry” [12] which seeks to take the stories of participants very seriously.
3 The Achievements We select some clear exemplars that demonstrate distinctive qualities for the first four primary contexts — but the fifth primary context, that of evaluation, does not have any strong exemplar. However, some promising approaches are outlined.
Narrative Interactive Multimedia Learning Environments
3.1
37
Preparing the Learner
Robertson’s thesis work on Ghostwriter [13] is a good example of the use of narrative to prepare the learner to take part in a learning experience. Ghostwriter is based on the idea that children who find it difficult to write stories could be provided with a stimulating experience which would then provide them with the germs of ideas that they could turn into stories. The environment was designed to avoid the point-and-shoot style of many games in order to encourage imaginative characterisation. Ghostwriter involves two participants who need to help each other complete the task given to them [13]. The participants cannot avoid making value-based decisions about other characters in the game. After the children finish discussing their experiences, they are encouraged to write a new story. The empirical evidence obtained from this work is impressive: learners were motivated, developed good working relationships with each other, identified with the characters in the game [13] and, importantly, their new stories featured a greater use of the relationships between characters [14]. The “Ghostwriter” scenario is a clear use of a NILE to give learners an experience to prepare them for an educational task. The designer’s aim with Ghostwriter may have been that it should be a preparatory experience that supports the development of a learner’s story writing skills, but the experience-in-itself also seems to have been a success. 3.2
Reflecting on the Experience
Machado’s Support And Guidance Architecture (SAGA) was a significant attempt to develop a more learner-centred support architecture for collaborative learning [15]. The aim of the work was to produce a kind of plug in component that could be included in a variety of systems designed for story creation. It has been tested with Teatrix, a 3D collaborative story creation environment designed for eight year olds. Perhaps the most significant educational innovation was the inclusion of a method for encouraging reflection through “hot seating”, derived from the approach developed by Dorothy Heathcote to the use of drama in education [16]. The reflection engine is the component that generates a ‘reflection moment’ consisting of a request for the learner to stop their work in the learning environment and review their character’s behaviour and the story’s development. Heathcote makes it clear that such a move in the classroom is more than just a means of generating reflection. She sees this as a failure saver as well as a slower down into experience [16]. 3.3
Learning to Manage Difficult Situations
While there are many ILEs that are designed for learning science and mathematics, modern NILEs featuring role playing immersive environments are often
38
P. Brna
targetted at procedural training, topics in the humanities or connected with social and psychological aspects. This latter class of systems is of great value with a growing awareness in some countries of the urgent need to socialise young people into good relationships with each other, older people and various social institutions (not least, the schools themselves). FearNot! is one of the most significant NILEs of recent years [17]. The VICTEC (Virtual ICT with Empathic Characters) project which developed the FearNot! system is an approach to help children understand the nature of bullying and, perhaps, learn how to deal with bullies. The system was targeted at 8-12 year olds in the UK, Portugal and Germany. When using FearNot!, the child views a sequence of incidents that features direct bullying. The characters in the scenes are synthetic, and the overarching context is a school. The characters needed to be believable because it was intended that children should be engaged with the situations, and care about the outcomes for the characters involved. It was also intended that the children using the system should both empathise with the characters and help them through proffering advice between incidents. This advice affects the synthetic character’s emotional state which in turn affects the actions the character selects in the next episode. Hence in two ways, the child is intended to be a “safe” emotional distance from the incidents — through an empathic relationship (rather than one in which the child identifies directly with the characters), and by trying out ways of dealing with bullying through offering advice. 3.4
Bringing Narrative into the Design Process
Waraich takes an informant design approach in his work on designing motivating narratives [7]. Informant design seeks to draw on the expertise of various stakeholders. However, when working with children in particular, it can be very difficult to extract the key information from the contributions being made. Informant design seeks to recognise the different kinds of contributions made by different contributors. Waraich explicitly introduces narrative concepts into the process of structuring the contributions from the informants. Such an approach focuses on helping informants work on the problem of generating software which is engaging in terms of theme, setting, characterisation and plot structure. Providing informants with sufficient background in the understanding of narrative is challenging. Not only do different learners have different needs in order to participate constructively, but the need is conditioned to some extent by the curricular system in which learners grew up. 3.5
Evaluating the Experience
As pointed out above, there is a need to be very flexible about the manner in which NILEs benefit learners. The approach needs to be suited to the kind of outcomes in which we are interested. For example, suppose we are interested in how engaged students are when using a NILE. Webb and Mallon used focus groups to examine engagement when students used narrative adventure and
Narrative Interactive Multimedia Learning Environments
39
role-play games [18]. The method yielded some very useful guidelines which demonstrated some ways in which narrative devices could work within a game scenario. Another interesting approach taken was by Marsh who, in his thesis work, turned the notion in VR of “being there” on its head, and examined the notion of “staying there” [19]. He developed an evaluation method for three categories of experience — the vicarious, the visceral and the voyeuristic. The voyeuristic experience is associated with sensations, aesthetics, atmosphere and a sense of place, the visceral experience with thrills, attractions and sensations, and the vicarious as connected with empathy and emotional information [20]. The engaging experience that a NILE is supposed to provide needs such evaluations: the division proposed by Marsh is one way of categorising the kinds of experience that need to be evaluated. However, the approach needs to be fleshed out further. Knickmeyer and Mateas have sought to evaluate the Fa¸cade system [21]. They were particularly interested in responses to interaction breakdown. Their method used a form of retroactive protocol analysis combined with a post experience interview; their analysis was based on a coding scheme developed from Malone’s identification of the importance of challenge, fantasy and curiosity [22].
4 Ways Forward There are three areas which I believe will be important in the future, and need attention. Progress in each of these areas has the potential for improving learning environments — and for each area there is some evidence that progress is possible. – Empathic design – Personal development/relationships – Narrative pedagogy We can also expect some significant developments in traditional learning environments — e.g. environments designed to train people (to drive, fight fires, play football etc.) or ones designed to deliver learning outcomes that are found in the school curriculum (solve equations, learn french etc.). Various researchers will no doubt produce learning environments that increasingly blend systems for learning and systems for delivering a strong narrative experience. While SAGA is one of the few systems for planning narratives designed for educational purposes, it does not blend the work on narrative with that of ILEs in an explicit manner. Riedl, Lane, Hill and Swartout at the Institute for Creative Technologies, University of Southern California [23] have been seeking to study how narrative and teaching objectives might be managed. The area of planning narrative experiences which has the potential to be highly productive is that exemplified by Machado’s reflection tool which emphasises the role of the learner to engage with the narrative. While this approach takes one out of the narrative being constructed in order to reflect on it, it is this that makes
40
P. Brna
Machado’s work attractive. There are two major pathways — “narrative as motivation” and “narrative as the ways in which we approach difficult ideas and experiences”. Machado uses both approaches, but it must be clear that, in the situation in which any NILE is used, the capability to move in and out of the engaging experiences and take stock of the situation in terms of what has been learned and how the experience can be built upon is at the heart of the educational uses of NILEs. 4.1
Empathic Design
In terms of designing for NILEs, there is a further issue worth mentioning which is implicit in much that has been done — that of Empathic Design [24]. This is connected with the designer’s duty of care to the learner. In the artificial intelligence in education community, John Self argued that caring for learners involved responding to their needs [25]. Caring for learners goes beyond effective and efficient learning of the specific content being considered, and looks to the wider picture, both in terms of the content and in terms of personal development. Designers of educational software need to factor empathy into the design process adequately to ensure that issues connected with management and the curriculum don’t dominate. Empathy can be defined in a number of distinct ways — all of which have some bearing on the problem of utilising empathy in the design of NILEs. Preston and de Waal’s process model makes: “empathy a superordinate category that includes all sub-classes of phenomena that share the same mechanism. This includes emotional contagion, sympathy, cognitive empathy, helping behaviour, etc.” [26]. Emotional contagion is the notion that, for example, seeing a person smile literally evokes a muscular response emulating a smile — or seeing a child in a state of fear evokes fear in the child’s mother. I feel that this notion has been exploited — knowingly or unknowingly — in many agent-based systems. The educational community is perhaps more interested in cognitive empathy, a conscious, rational assessment of another’s situation as found in Roger’s work [27]. ‘Designing in’ empathy into learning environments should almost certainly aim at working both at the conscious and unconscious levels. If we bring into consideration that all learning environments ‘stand in’ in some way for the teacher, then how does a good teacher express empathy? The empathic teacher treats children as individuals by seeking to discover a pupil’s existing skills and seeks to help them develop. The empathic teacher knows the child as a person, knows their confidence levels, as well as their knowledge. The empathic teacher also nurtures each child’s sense of self, supports their academic progress, and seeks to develop each child’s awareness of themselves [28]. So I take the position that: – A strong sense of empathy is a valuable, probably essential, characteristic for designers of learning systems.
Narrative Interactive Multimedia Learning Environments
41
– Good teachers demonstrate a set of empathic characteristics that provide a starting point for the development of better quality interactions between the learning environment and the learner. Empathic design supplements methods derived from informant design [8]. Waraich’s NCID adds a narrative dimension as an aspect of design that can be confronted explicitly within an informant design framework — but there is also scope for extending the — often — unconscious use of empathy within a design team to become a far more explicit component of the design process. 4.2
Personal Development/Relationships
While some environments such as FearNot! focus on personal experiences of bullying and the development of ways in which bullying might be managed, others have looked at the ways in which people can seek to grow/restore their sense of personal worth and relationships with others [29]. Their emphasis is on telling stories which evoke connections with the learner, and give insights into their own personal circumstances. Each learning experience may generate a story from which something is taken, memorised and learned. We don’t need a NILE for this to happen — but a NILE might well facilitate this by encouraging the learner to ‘tell’ their own stories whether within the NILE or elsewhere. 4.3
Narrative Pedagogy
What about the future of pedagogy in relation to the design of NILEs? In many of the systems discussed, the underlying pedagogy is obscure — sometimes, because this is not seen as important by the authors/designers of the NILEs, sometimes because the pedagogy is taken for granted. On the other hand, some of the systems have a clear pedagogy in mind even if there are other ways of using the specific NILEs. Rather than dwell on ‘standard’ pedagogies, there is an approach with significant potential for system designers. Diekelmann has introduced and advocated the use of narrative pedagogy within nursing education [30]. This approach has found application also within teacher education. Narrative Pedagogy is placed somewhat in tension with standard approaches to learning in that the emphasis is on the generation of person-centred descriptions of situations and their interpretation within a community. Narrative pedagogy downplays the importance of being absolutely right or absolutely wrong, and of assessing learners through objective tests. It is similar to Narrative Inquiry in that argumentation is aimed at mutual understanding rather than winning [31]. For some, this will appear anathema (i.e. more or less taboo). For others taking a strong constructivist approach, it is not such an alien way of thinking about learning. For school learners, the methods of Narrative Pedagogy may not always be appropriate, but there is resonance with some movements in education connected
42
P. Brna
with inclusion, promotion of self esteem, recognising different kinds of individual achievement and so on. I would also argue that the underlying philosophy can be used as the theoretical grounding for future work on NILEs. If NILEs can embody complex situations that encourage the learner to generate their own responses to the challenges found in a situation — whether this be in relation to understanding physics or how to respond to bullying — then we are part way to an approach that could support Narrative Pedagogy. What is evidently missing from most NILEs is the pooling of learner’s interpretations and the opportunity for a learning community to work with these interpretations to form a new understanding.
5 Consequences I would like to suggest, in line with the notion of empathic design [24] that it becomes increasingly important to understand the system designer in terms of their relationship with every learner. In some cases, it will prove worthwhile to realise this relationship as a two way one [32]. System designers can make a valuable contribution to re-establish the importance of personal relationships at a time when these are under stress in a world in which the impersonal, functional view of people seems to be dominant. This is the hope for the future design of NILEs: that such systems will be of use in sustaining the personal development of learners in terms of building and supporting quality relationships with, amongst others, parents, teachers and fellow learners. In using NILEs of the future, we might hope that these systems will also be used to help learners attain a wide range of competencies.
Acknowledgements This paper is based on another article [33]. I thank Atif Waraich for his comments.
References 1. Dickey, M.D.: Game design narrative for learning: Appropriating adventure game design narrative devices and techniques for the design of interactive learning environments. Educational Technology Research and Development 54(3), 245–263 (2006) 2. Davis, M., Dautenhahn, K., Nehaniv, C.L., Powell, S.D.: The narrative construction of our (social) world: steps towards an interactive learning environment for children with autism. Universal Access in the Information Society 6(2), 145–157 (2007) 3. Ausubel, D., Novak, J., Hanesian, H.: Educational Psychology: A Cognitive View. Holt, Rinehart and Winston, New York (1978) 4. Keller, J.M.: Motivational design of instruction. In: Reigeluth, C.M. (ed.) Instructional-design theories and models: an overview of their current status. Lawrence Erlbaum Associates, Hillsdale (1983)
Narrative Interactive Multimedia Learning Environments
43
5. Hall, R.H., Hall, C.R., Saling, C.B.: The effects of graphical post organization strategies on learning from knowledge maps. Journal of Experimental Education 67(2), 101–112 (1999) 6. Schon, D.A.: Educating the reflective Practitioner. Jossey-Bass, San Francisco (1987) 7. Waraich, A.: Designing Motivating Narratives for Interactive Learning Environments. PhD thesis, Computer Based Learning Unit, Leeds University (2003) 8. Scaife, M., Rogers, Y., Aldrich, F., Davies, M.: Designing for or designing with? Informant design for interactive learning environments. In: CHI 1997: Proceedings of Human Factors in Computing Systems, pp. 343–350. ACM, New York (1997) 9. Chin, G.J., Rosson, M., Carroll, J.: Participatory analysis: Shared development requirements from scenarios. In: Pemberton, S. (ed.) Proceedings of CHI 1997: Human Factors in Computing Systems, pp. 162–169 (1997) 10. Cooper, B., Brna, P.: Classroom conundrums: The use of a participant design methodology. Educational Technology & Society 3(4), 85–100 (2000) 11. Luckin, R., Underwood, J., du Boulay, B., Holmberg, J., Kerawalla, L., O’Connor, J., Smith, H., Tunley, H.: Designing educational systems fit for use: A case study in the application of human centred design for aied. International Journal of Artificial Intelligence in Education 16(4), 353–380 (2006) 12. Clandinin, D.J., Connelly, F.M.: Narrative Inquiry: Experience and Story in Qualitative Research. Jossey-Bass, San Francisco (2000) 13. Robertson, J.: The effectiveness of a virtual role-play environment as a story preparation activity. PhD thesis, Edinburgh University (2001) 14. Robertson, J., Good, J.: Using a collaborative virtual role-play environment to foster characterisation in stories. Journal of Interactive Learning Research 14(1), 5–29 (2003) 15. Machado, I., Brna, P., Paiva, A.: Learning by playing: Supporting and guiding story-creation activities. In: Moore, J.D., Redfield, C.L., Johnson, W.L. (eds.) Proceedings of the 10th International Conference on Artificial Intelligence in Education AI-ED 2001, pp. 334–342. IOS Press, Amsterdam (2001) 16. Heathcote, D.: Drama and learning. In: Johnson, L., O’Neill, C. (eds.) Collected Writing on Education and Drama, pp. 90–102. Northwestern University Press, Evanston, Illinois (1991) 17. Hall, L., Woods, S., Aylett, R.: Fearnot! involving children in the design of a virtual learning environment. International Journal of Artificial Intelligence in Education 16(4), 327–351 (2006) 18. Mallon, B., Webb, B.: Stand up and take your place: Identifying narrative elements in narrative adventure and role-play games. Computers in Entertainment 3(1) (2005) 19. Marsh, T.: Staying there: an activity-based approach to narrative design and evaluation as an antidote to virtual corpsing. In: Riva, G., Davide, F., IJsselsteijn, W. (eds.) Being There: Concepts, effects and measurement of user presence in synthetic environments, pp. 85–96. IOS Press, Amsterdam (2003) 20. Marsh, T.: Presence as experience: Film informing ways of staying there. Presence 12(5), 538–549 (2003) 21. Knickmeyer, R.L., Mateas, M.: Preliminary evaluation of the interactive drama fa¸cade. In: CHI 2005, ACM, New York (2005) 22. Malone, T.: Towards a theory of intrinsically motivating instruction. Cognitive Science 5(4), 333–369 (1981)
44
P. Brna
23. Riedl, M., Lane, H., Hill, R., Swartout, W.: Automated story direction and intelligent tutoring: Towards a unifying architecture. In: AI and Education 2005 Workshop on Narrative Learning Environments, Amsterdam, The Netherlands (July 2005) 24. Brna, P.: On the role of self esteem, empathy and narrative in the development of intelligent learning environments. In: Pivec, M. (ed.) Affective and Emotional Aspects of Human-Computer Interaction Game-Based and Innovative Learning Approaches, pp. 237–245. IOS Press, Amsterdam (2006) 25. Self, J.: The defining characteristics of intelligent tutoring systems research: ITSs care, precisely. International Journal of Artificial Intelligence in Education 10(3-4), 350–364 (1999) 26. Preston, S.D., de Waal, F.B.M.: Empathy: Its ultimate and proximate bases. Behaviour and Brain Science 25, 1–72 (2001) 27. Rogers, C.: Empathic: An unappreciated way of being. The Counselling Psychologist 5(2), 2–10 (1975) 28. Cooper, B., Brna, P., Martins, A.: Effective affective in intelligent systems — building on evidence of empathy in teaching and learning. In: Paiva, A. (ed.) Affect in Interactions: Towards a New Generation of Computer Interfaces, pp. 21–34. Springer, Heidelberg (2000) 29. Sharry, J., Brosnan, E., Fitzpatrick, C., Forbes, J., Mills, C., Collins, G.: ’working things out’ a therapeutic interactive cd-rom containing the stories of young people overcoming depression and other mental health problems. In: Brna, P. (ed.) Proceedings of Narrative and Interactive Learning Environments NILE 2004, pp. 67–74 (2004) 30. Diekelmann, N.: Narrative Pedagogy: Heideggerian hermeneutical analyses of lived experiences of students, teachers, and clinicians. Advances in Nursing Science 23(3), 53–71 (2001) 31. Conle, C.: The rationality of narrative inquiry in research and professional development. European Journal of Teacher Education 24(1), 21–33 (2001) 32. Sims, R.: Interactivity or narrative? a critical analysis of their impact on interactive learning. In: Proceedings of ASCILITE 1998, Wollongong, Australia, pp. 627–637 (1998) 33. Brna, P.: In search of narrative interactive learning environments. In: Virvou, M., Jain, L.C. (eds.) Intelligent Interactive Systems in Knowledge-based Environments, pp. 47–74. Springer, Berlin (2008)
A Support Vector Machine Approach for Video Shot Detection Vasileios Chasanis, Aristidis Likas, and Nikolaos Galatsanos Department of Computer Science, University of Ioannina, 45110 Ioannina, Greece {vchasani,arly,galatsanos}@cs.uoi.gr
Abstract. The first step towards indexing and content based video retrieval is video shot detection. Existing methodologies for video shot detection are mostly threshold dependent. No prior knowledge about the video content makes such methods sensitive to video content. To ameliorate this shortcoming we propose a learning based methodology using a set of features that are specifically designed to capture the differences among hard cuts, gradual transitions and normal sequences of frames simultaneously. A Support Vector Machine (SVM) classifier is trained both to locate shot boundaries and characterize transition types. Numerical experiments using a variety of videos demonstrate that our method is capable of accurately detecting and discriminating shot transitions in videos with different characteristics. Keywords: Abrupt cut detection, Dissolve detection, Support Vector Machines.
1 Introduction In recent years there has been a significant increase in the availability of high quality digital video as a result of the expansion of broadband services and the availability of large volume digital storage devices. Consequently, there has been an increase in the need to access this huge amount of information and a great demand for techniques that will provide efficient indexing, browsing and retrieving of video data. The first step towards this direction is to segment the video into smaller “physical” units in order to proceed with indexing and browsing. The smallest physical segment of a video is the shot and is defined as an unbroken sequence of frames recorded from one camera. Shot transitions can be classified into two categories. The first one which is the most common is the abrupt cut. An abrupt or hard cut takes place between consecutive frames due to camera switch. In other words, a different or the same camera is used to record a different aspect of the scene. The second category concerns gradual transitions such dissolves, fade outs followed by fade ins, wipes and a variety of video effects which stretch over several frames. A dissolve takes place when the initial frames of the second shot are superimposed on the last frames of the first shot. A formal study of the shot boundary detection problem is presented in [20]. In [11] the major issues to be considered for the effective solution of the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 45–54, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
46
V. Chasanis, A. Likas, and N. Galatsanos
shot-boundary detection problem are identified. A comparison of existing methods is presented in ([4], [10], and [15]). There are several approaches to the shotboundary detection task most of which involve the determination of a predefined or adaptive threshold. A simple way to declare a hard cut is pair-wise pixel comparison [22]. This method is very sensitive to object and camera motions, thus many researchers propose the use of a motion independent characteristic, which is the intensity or color, global or local histogram ([17], [22]). The use of second order statistical characteristics of frames , in a likelihood ratio test, is also suggested ([12], [22]). In [21] an algorithm is presented based on the analysis of entering and exiting edges between consecutive frames. This approach works well on abrupt changes, but fails in the detection of gradual changes. In [5] mutual information and joint-entropy between frames are used for the detection of cuts, fade-ins and fade-outs. An original approach to partitioning of a video into shots based on a foveated representation of the video is proposed in [3]. A quite interesting approach is presented in [20] where the detection of shot boundaries is based on a graph partitioning problem. Finally, support vector machines with active learning are implemented to declare boundaries and nonboundaries. A Support Vector Machine classifier with color and motion features is also employed in [7]. In [8] the authors propose as features of SVMs, wavelet coefficient vectors within sliding windows. A variety of methods have been proposed for gradual transitions detection, but still are inadequate to solve this problem due to the complicated nature of such transitions. In [22], a twin-comparison technique is proposed for hard cuts and gradual transitions detection by applying different thresholds based on differences in color histograms between successive frames. In [18] a spatiotemporal approach was presented for the detection of a variety of transitions. There is also research specifically aimed towards the dissolve detection problem. In [16], the problem of dissolve detection is treated as a pattern recognition problem. Another direction, which is followed in ([9], [11], and [14]), is to model the transitions types by presupposing probability distributions for the feature difference metrics and perform a posteriori shot change estimation. It is worth mentioning that the organization of the TREC video shot detection task [19] provides a standard performance evaluation and comparison benchmark. In summary, the main drawback of most previous algorithms is that they are threshold dependent. As a result, if there is no prior knowledge about the visual content of a video that we wish to segment into shots, it is rather difficult to select an appropriate threshold. In order to overcome this difficulty we propose in this paper a supervised learning methodology for the shot detection problem. In other words, the herein proposed approach does not use thresholds and can actually detect shot boundaries of videos with totally different visual characteristics. Another advantage of the proposed approach is that we can detect hard cuts and gradual transitions at the same time in contrast with existing approaches. For example, in [7] the authors propose a Support Vector Machine classifier only for abrupt cut detection. In [20], features for abrupt cuts and dissolves are constructed
A Support Vector Machine Approach for Video Shot Detection
47
separately and two different SVM models are trained. In our approach, we define a set of features designed to discriminate hard cuts from gradual transitions. These features are obtained from color histograms and describe the variation between adjacent frames and the contextual information at the same time. Due to the fact that the gradual transitions spread over several frames, the frameto-frame differences are not sufficient to characterize them. Thus, we also use the differences between non adjacent frames in the definition of the proposed features. These features are used as inputs to a Support Vector Machine (SVM) classifier algorithm. A set of nine different videos with over 70K frames from TV series, documentaries and movies is used to train and test the SVM classifier. The resulting classifier achieves content independent correct detection rates greater than 94%. The rest of this paper is organized as follows: In Sections 2 and 3 the features proposed in this paper are described. In Section 4 the SVM method employed for this application is briefly presented. In Section 5 we present numerical experiments and compare our method with three existing methods. Finally, in Section 6 we present our conclusions and suggestions for future research.
2 Feature Selection 2.1
Color Histogram and x2 Value
Color histograms are the most commonly used features to detect shot boundaries. They are robust to object and camera motion, and provide a good trade-off between accuracy of detection and implementation speed. We have chosen to use normalized RGB histograms. So for each frame a normalized histogram is computed, with 256 bins for each one of the RGB component defined as H R , H G and H B respectively. These three histograms are concatenated into a 768 dimension vector representing the final histogram of each frame. H = [H R H G H B ] .
(1)
To define whether two shots are separated with an abrupt cut or a gradual transition we have to look for a difference measure between frames. In our approach we use a variation of the x2 value to compare the histograms of two frames in order to enhance the difference between the two histograms. Finally the difference between two images Ii , Ij based on their color histograms Hi , Hj is given from the following equation: 1 (Hi (k) − Hj (k))2 , 3 Hi (k) + Hj (k) 768
d(Ii , Ij ) =
k=1
where k denotes the bin index.
(2)
48
2.2
V. Chasanis, A. Likas, and N. Galatsanos
Inter-frame Distance
The dissimilarity value given in equation (2) can be computed for any pair of frames within the video sequence. We compute the value not only between adjacent frames, but also between frames with time distance l, where l is called the inter-frame distance as suggested in ([1], [11]). We compute the dissimilarity value d(Ii , Ii+l ) for three values of the inter-frame distance l: – l=1. This is used to identify hard cuts between two consecutive frames, so the dissimilarity values are computed for l=1, Fig. 1(a). – l=2. Due to the fact that during a gradual transition two consecutive frames may be the same or very similar to each other, the dissimilarity value will tend to zero and, as a result, the sequence of the dissimilarity values could have the form shown in Fig. 1(b). The computation for l=2 usually results in a smoother curve, which is more useful for our further analysis. A typical example of a sequence of dissimilarity values for l=2 is shown in Fig. 1(b). – l= 6. A gradual transition stretches along several frames, while the difference value between consecutive frames is smaller, so we are interested not only in the difference between consecutive frames, but also between frames that are a specific distance apart from each other. As the inter-frame distance increases, the curve becomes smoother as it can be observed in the example of Fig. 1(c). Of course the maximum distance between frames for which the inter-frame distance is useful is rather small. This distance should be less than the minimum length of all transitions in the video set in order to capture the form of the transition. Thus, the choice of most of the gradual transitions in our set of videos have length between 7-40 frames.
3 Feature Vector Selection for Shot-Boundary Classification The dissimilarity values defined in Section 2 are not going to be compared with any threshold, but they will be used to form feature vectors based on which an SVM classifier will be constructed. 3.1
Definition of Feature Vectors
The feature vectors selected are the normalized dissimilarity values calculated in a temporal window centered at the frame of interest. More specifically, the dissimilarity values that are computed in section 2 form three vectors, one for each one of the three inter-frame distances l. Dl=1 = [d(I1 , I2 ), . . . , d(Ii , Ii+1 ), . . . , d(IN −1 , IN )] Dl=2 = [d(I1 , I3 ), . . . , d(Ii , Ii+2 ), . . . , d(IN −2 , IN )] . Dl=6 = [d(I1 , I7 ), . . . , d(Ii , Ii+6 ), . . . , d(IN −6 , IN )]
(3)
A Support Vector Machine Approach for Video Shot Detection
(a) l=1
49
(b) l=2
(c) l=6 Fig. 1. Dissimilarity patterns
Moreover for each frame, we define a window of length w that is centered at this frame and contains the dissimilarity values. As a result for the ith frame the following three vectors are composed: W l=1 (i, 1 : w) = [Dl=1 (i − w/2), . . . , Dl=1 (i), . . . , Dl=1 (i + w/2 − 1)] W l=2 (i, 1 : w) = [Dl=2 (i − w/2), . . . , Dl=2 (i), . . . , Dl=2 (i + w/2 − 1)] . W l=6 (i, 1 : w) = [Dl=6 (i − w/2), . . . , Dl=6 (i), . . . , Dl=6 (i + w/2 − 1)]
(4)
To obtain the final features we normalize the dissimilarity values in equation (4) by dividing each dissimilarity value by the sum of the values in the window. This provides the normalized “magnitude” independent features. (i, j) , k = 1, 2, 6 . l=k (i, j) W j=1
˜ l=k (i, j) = wW W
l=k
(5)
The size of the window used is w=40. In our experiments we also considered windows of length 50 and 60 in order to capture longer transitions. The 120-long vector resulting from the concatenation of the normalized dissimilarities for the three windows given by ˜ l=2 (i) W ˜ l=6 (i)] , ˜ l=1 (i) W F (i) = [W
(6)
is the feature vector corresponding to frame i. In what follows we show examples of the feature vectors for a hard cut and a dissolve in Fig. 2.
50
V. Chasanis, A. Likas, and N. Galatsanos
(a) Hard cut
(b) Dissolve
Fig. 2. Feature vectors for transitions
4 Support Vector Machine Classifier After the feature definition, an appropriate classifier has to be used in order to categorize each frame in three categories: normal sequences, abrupt cuts and gradual transitions. For this purpose we selected the Support Vector Machine (SVM) classifier [6] that provides state-of-the-art performance and scales well with the dimension of the feature vector which is relatively large (equal to 120) in our problem. The classical SVM classifier finds an optimal hyperplane which separates data points of two classes. More specifically, suppose we are given a training set of l vectors xi ∈ Rn , i=1, . . . , l and a vector y ∈ Rl with yi ∈ {1,-1} denoting the class of vector xi . We also assume a mapping function φ(x), that maps each training vector to a higher dimensional space, and the corresponding kernel function (eq. (9)). Then the SVM classifier [6] is obtained by solving the following primal problem: l 1 T min i=1 ξi 2w w + C w,b,ξ T (7) subject to yi (w φ(xi ) + b) ≥ 1 − ξi . ξi ≥ 0, i = 1, . . . , l
The decision function is: l wi K(xi , x) + b), where K(xi , xj ) = φT (xi )φ(xj ) . sqn(
(8)
i=1
A notable characteristic of SVMs is that after training, usually most of the training patterns xi have wi =0 in eq. (8), in other words they do not contribute to the decision function. Those xi for which wi = 0, are retained in the SVM model and called Support Vectors (SVs). In our approach the commonly used radial basis function (RBF) kernel is employed: K(xi , xj ) = exp(−γxi − xj 2 ) ,
(9)
where γ denotes the width of the kernel. It must be noted that in order to obtain an efficient SVM classifier the parameters C (eq. (7)) and γ (eq. (9)) must be carefully selected, usually through cross-validation.
A Support Vector Machine Approach for Video Shot Detection
51
5 Experiments 5.1
Data and Performance Criteria
The video sequences used for our data set were taken from TV-series, documentaries and educational films. Nine videos (70000 frames) were used; containing 355 hard cuts and 142 dissolves, manually annotated. To evaluate the performance of our method we used the following criteria [2]: Recall =
Nc Nc 2 × Rec × P rec , , P recision = , F1 = Nc + Nm Nc + Nf Rec + P rec
(10)
where Nc stands for the number of correct detected shot boundaries, Nm for the number of missed ones and Nf the number of false detections. During our experiments we calculate the F1 value for the cuts (F1C ) and the dissolves (F1D ) separately. Then the final performance measure is given from the following equation: α b F1C + F1D , F1 = (11) α+b α+b where α is the number of true hard cuts and b the number of true dissolves. 5.2
Results and Comparison
In our experiments, 8 videos are used for training and the 9th for testing, therefore, 9 “rounds” of testing were conducted. In order to obtain good values of the parameters γ and C (in terms of providing high F1 values), in each ”round” we applied 3-fold cross-validation using the 8 videos of the corresponding training set. A difficulty of the problem under consideration is the generation of an imbalanced training set that contains few positives examples and a huge number of negative ones. In our approach we sample negative examples uniformly, thus we reduce their number to 3% of the total number of examples. More specifically, in our training set there are 440 positive examples (transitions) and 2200 negative examples (no transitions) on average. Finally each model of the training procedure generated on average 1276 support vectors for normal transitions, 101 support vectors for gradual transitions and 152 support vectors for abrupt transitions. We also tested our method by using larger windows of width w = 50 and w = 60. In what follows in Tables 1-3 we provide the classification results using different selections of window lengths. We notice that the performance improves as the size of the window increases. False boundaries are reduced since larger windows contain more information. The use of larger windows also helps the detection of dissolves that last longer. In order to reduce the size of our feature vector, we have also consider as feature vectors used to train the SVM classifier, those obtained from the concatenation of features extracted for l=2 and l=6, only. It can be observed (Table 4) that even with the shorter feature vector the proposed algorithm gives very good results that are only slightly inferior to the ones obtained by the longer feature vector.
52
V. Chasanis, A. Likas, and N. Galatsanos Table 1. Performance results for w = 40, l=1, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 351 127 -
Nm 4 15 -
Nf 9 33 -
Recall 98.87% 89.44% 96.18%
Precision F1 97.50% 98.18% 79.38% 84.11% 92.32% 94.21%
Table 2. Performance results for w = 50, l=1, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 352 130 -
Nm 3 12 -
Nf 8 25 -
Recall 99.15% 91.55% 96.98%
Precision F1 97.78% 98.46% 83.87% 87.54% 93.80% 95.37%
Table 3. Performance results for w = 60, l=1, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 353 127 -
Nm 2 15 -
Nf 4 25 -
Recall 99.44% 89.44% 96.58%
Precision F1 98.88% 99.16% 83.55% 86.39% 94.50% 95.53%
Table 4. Performance results for w = 50, l=2 and l=6 TRANSITION TYPE CUTS DISSOLVES AVERAGE
Nc 350 129 -
Nm 5 13 -
Nf 5 21 -
Recall 98.59% 90.85% 96.38%
Precision F1 97.49% 98.04% 86.00% 88.36% 94.21% 95.28%
To demonstrate the effectiveness of our algorithm and its advantage over threshold depended methods, we implemented three methods that use thresholds in different ways. More specifically, we implemented pair-wise comparison of successive frames [22], likelihood ratio test ([12], [22]) and the twin-comparison method [22]. The first two methods can only detect cuts, while the third can identify both abrupt and gradual transitions. The obtained results indicate that our algorithm outperforms the other three threshold dependent methods. In Table 5 we provide the recall, precision and F1 values for our algorithm and the three methods under consideration. For our algorithm we present the results using w = 50, for best values (C, γ) , using all features (l=1, l=2 and l=6) and less features (l=2 and l=6). The thresholds used in these three methods were calculated in different ways. We used adaptive thresholds in pair-wise comparison algorithm, cross validation in likelihood ratio method and finally global adaptive threshold in the twin-comparison method. Especially for the dissolve detection, our algorithm provides far better results than the twin-comparison algorithm.
A Support Vector Machine Approach for Video Shot Detection
53
Table 5. Comparative results using Recall, Precision and F1 measures METHOD w = 50, l=1, l=2 and l=6. w = 50, l=2 and l=6. PAIR-WISE COMPARISON [22] LIKELIHOOD RATIO [22] TWIN-COMPARISON [22]
Recall 99.15% 98.59% 85.07% 94.37% 89.30%
TRANSITION TYPE CUTS DISSOLVES Precision F1 Recall Precision F1 97.78% 98.46% 91.55% 83.87% 87.54% 97.49% 98.04% 90.85% 86.00% 88.36% 84.83% 84.95% 86.12% 90.05% 88.05% 88.92% 70.42% 64.94% 67.57%
6 Conclusion - Future work In this paper we have proposed a method for shot-boundary detection and discrimination between a hard cut and a dissolve. Features that describe the variation between adjacent frames and the contextual information were derived from color histograms using a temporal window. These feature vectors become inputs to a SVM classifier which categorizes transitions of the video sequence into normal transitions, hard cuts and gradual transitions. This categorization provides an effective segmentation of any video into shots and thus is a valuable aid to further analysis of the video for indexing and browsing. The main advantage of this method is that it is not threshold dependent. As a future work, we will try to improve the performance of the method by extracting other types of features from the video sequence.
Acknowledgments This research project (PENED) is co-financed by E.U.-European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%).
References 1. Besc´ os, J., Cisneros, G., Mart´ınez, J.M., Men´endez, J.M., Cabrera, J.: A Unified Model for Techniques on Video-Shot Transition Detection. IEEE Trans. Multimedia 7(2), 293–307 (2005) 2. Bimbo, A.D.: Visual Information Retrieval. Morgan Kaufmann Publishers, Inc., San Francisco (1999) 3. Boccignone, G., Chianese, A., Moscato, V., Picariello, A.: Foveated Shot Detection for Video Segmentation. IEEE Trans. Circuits and Systems for Video Technology 15(3), 365–377 (2005) 4. Boreczky, J.S., Rowe, L.A.: Comparison of Video Shot Boundary Detection Techniques. In: Proc. SPIE Storage and Retrieval for Image and Video Databases, vol. 2664, pp. 170–179 (1996) 5. Cernekova, Z., Pitas, I., Nikou, C.: Information Theory-Based Shot Cut/Fade Detection and Video Summarization. IEEE Trans. Circuits and Systems for Video Technology 16(1), 82–91 (2006)
54
V. Chasanis, A. Likas, and N. Galatsanos
6. Cortes, C., Vapnik, V.: Support-vector network. Machine Learning 20(3), 273–297 (1995) 7. Dalatsi, C., Krinidis, S., Tsekeridou, S., Pitas, I.: Use of Support Vector Machines based on Color and Motion Features for Shot Boundary Detection. In: International Symposium on Telecommunications (2001) 8. Feng, H., Fang, W., Liu, S., Fang, Y.: A new general framework for shot boundary detection and key-frame extraction. In: Proc. 7th ACM SIGMM Int. Workshop Multimedia Inf. Retrieval, pp. 121–126 (2005) 9. Fernando, W.A.C., Canagarajah, C.N., Bull, D.R.: Fade and dissolve detection in uncompressed and compressed video sequences. In: Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 299–303 (1999) 10. Gargi, U., Kasturi, R., Strayer, S.H.: Performance characterization of video-shotdetection methods. IEEE Trans. Circuits and Systems for Video Technology 10(1), 1–13 (2000) 11. Hanjalic, A.: Shot-boundary detection: Unraveled and resolved? IEEE Trans. Circuits and Systems for Video Technology 12(2), 90–105 (2002) 12. Kasturi, R., Lain, R.: Dynamic Vision. In: Kasturi, R., Lain, R. (eds.) Computer Vision: Principles, pp. 469–480. IEEE Computer Society Press, Washington (1991) 13. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise procedure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing Algorithms, Architectures and Applications. Springer, Heidelberg (1990) 14. Lelescu, D., Schonfeld, D.: Statistical sequential analysis for real-time video scene change detection on compressed multimedia bitstream. IEEE Trans. Multimedia 5(1), 106–117 (2003) 15. Lienhart, R.: Comparison of automatic shot boundary detection algorithms. In: Proc. SPIE Storage and Retrieval for Image and Video Databases VII, San Jose, CA, vol. 3656, pp. 290–301 (1999) 16. Lienhart, R.: Reliable dissolve detection. In: Proc. SPIE Storage and Retrieval for Media Databases 2001, vol. 4315, pp. 219–230 (2001) 17. Nagasaka, A., Tanaka, Y.: Automatic video indexing and full-video search for object appearances. In: Knuth, E., Wegner, L.M. (eds.) Visual Database Systems II, pp. 113–127. Elsevier, Amsterdam (1995) 18. Ngo, C.W., Pong, T.C., Chin, R.T.: Video partitioning by temporal slice coherence. IEEE Trans. Circuits and Systems for Video Technology 11(8), 941–953 (2001) 19. NIST, Homepage of Trecvid Evaluation. [Online]. http://www-nlpir.nist.gov/projects/trecvid/ 20. Yuan, J., Wang, H., Xiao, L., Zheng, W., Li, J., Lin, F., Zhang, B.: A Formal Study of Shot Boundary Detection. IEEE Trans. Circuits and Systems for Video Technology 17(2), 168–186 (2007) 21. Zabih, R., Miller, J., Mai, K.: Feature-Based Algorithms for Detecting and Classifying Production Effects. Multimedia Systems 7(2), 119–128 (1999) 22. Zhang, H.J., Kankanhalli, A., Smoliar, S.W.: Automatic partitioning of full-motion video. Multimedia Systems 1(1), 10–28 (1993)
Comparative Performance Evaluation of Artificial Neural Network-Based vs. Human Facial Expression Classifiers for Facial Expression Recognition I.-O. Stathopoulou and G.A. Tsihrintzis Department of Informatics University of Piraeus Piraeus 185 34, Greece {iostath,geoatsi}@unipi.gr
Abstract. Towards building new, friendlier human-computer interaction and multimedia interactive services systems, we developed a neural network-based image processing system (called NEU-FACES), which first determines automatically whether or not there are any faces in given images and, if so, returns the location and extent of each face. Next, NEU-FACES uses neural network-based classifiers, which allow the classification of several facial expressions from features that we develop and describe. In the process of building NEU-FACES, we conducted an empirical study in which we specify related design requirements and, study statistically the expression recognition performance of humans. In this paper, we make and evaluation of performance of our NEU-FACES system versus the human’s expression recognition performance.
1 Introduction Facial expressions are particularly significant in communicating information in human-to-human interaction and interpersonal relations, as they reveal information about the affective state, cognitive activity, personality, intention and psychological state of a person and this information may, in fact, be difficult to mask. In the design of advanced human-computer interfaces, the variations of the emotions of human users during the interaction should be taken into consideration and the computer made able to react accordingly. In human-to-human interaction and interpersonal relations, facial expressions play a significant communicative role because they can reveal information about the affective state, cognitive activity, personality, intention and psychological state of a person and this information may in fact be quite difficult to mask. Similarly, images that contain user faces are instrumental in the development of more effective and friendlier methods in human-computer interaction, since facial expressions corresponding to the “neutral”, “smile”, “sad”, “surprise”, “angry”, “disgust” and “bored-sleepy” psychological states arise very commonly during a typical human-computer interaction session. The task of processing facial images generally consists of two steps: a face detection step, which determines whether or not there are any faces in an image and, if so, returns the location and extent of each face, and a facial expression classification step, which attempts to recognize the expression formed on a detected face. These problems are quite challenging because faces are non-rigid and have a high degree of G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 55–65, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
56
I.-O. Stathopoulou and G.A. Tsihrintzis
variability in size, shape, color and texture. Furthermore, variations in pose, facial expression, image orientation and conditions add to the level of difficulty of the problem. The task of determining the true psychological state of the person using an image of his/her face is complicated further by the problem of pretence, i.e. the case of the person’s facial expression not corresponding to his/her true psychological state. The difficulties in facial expression classification by humans and some indicative classification error percentages are illustrated in [1]. Previous attempts to address similar problems of face detection and facial expression classification in images have followed two main directions in the literature: (1) methods based on face features [2-5] and (2) image-based representations of the face [6-8]. Our system follows the first direction (feature-based approach) and has been developed over a time period of approximately two years and, therefore, is an evolution of previous of its versions [9-13]. Specifically, the face detection algorithm currently used was developed and presented in [9], while the facial expression classification algorithms are evolved and extended versions of those gradually developed and presented in [10-14]. In this paper we present a performance evaluation of NEU-FACES [14], a fully automated neural network-based face detection and facial expression classification system, emphasizing in the Facial Expression Classification Subsystem. Our studies have concluded to seven emotions which arise very commonly during a typical human-computer interaction session and, thus, vision-based human-computer interaction systems that recognize them could guide the computer to “react” accordingly and attempt to better satisfy its user needs. Specifically, these emotions are: expressions are: “neutral”, “happy”, “sadness”, “surprise”, “anger”, “disgust” and “bored-sleepy”. NEU-FACES is able to recognize these emotions satisfactory and in this paper we present it’s performance evaluation, compared to the answers given in empirical studies where humans were asked to classify the corresponding emotions from a subject’s image. More specifically, the paper is organized as follows: in Section 2, we present our empirical studies to human subjects, where we contracted two types of questionnaires and describe the structure each one of them. In Section 3, we present our NEUFACES system, concentrating in facial expression classification module. In Section 5, we evaluate the performance of our system versus the human’s performance. We draw conclusions and point to future work in Sections 6 and 7, respectively.
2 Empirical Studies on Human Subjects 2.1 The Questionnaire Structure In order to understand how a human classifies someone else’s facial expression and set a target error rate for automated systems, we developed a questionnaire in which each we asked 300 participants to state their thoughts on a number of facial expression-related questions and images. Specifically, the questionnaire consisted of three different parts: 1.
In the first part, the observer was asked to identify an emotion from the facial expressions that appeared in 14 images. Each participant could choose from
Comparative Performance Evaluation
2.
3.
57
the 7 of the most common emotions that we pointed out earlier, such as: “anger”, “happiness”, “neutral”, “surprise”, “sadness”, “disgust”, “boredom– sleepiness”, or specify any other emotion that he/she thought appropriate. Next, the participant had to state the degree of certainty (from 0-100%) of his/her answer. Finally, he/she had to state which features (such as the eyes, the nose, the mouth, the cheeks etc.), had helped him/her make that decision. A typical question of the first part of the questionnaire is depicted in Figure 1. When filling the second part of the questionnaire, each participant had to identify an emotion from parts of a face. Specifically, we showed them the “neutral” facial image of a subject and the corresponding image of some other expression. In this latter image pieces were cut out, leaving only certain parts of the face, namely the “eyes”, the “mouth”, the “forehead”, the “cheeks”, the “chin” and the “brows.” This is typically shown in Figure 2. Again, each participant could choose from the 7 of the most common emotions “anger”, “happiness”, “neutral”, “surprise”, “sadness”, “disgust”, “boredom–sleepiness”, or specify any other emotion that he/she thought appropriate. Next, the participant had to state the degree of certainty (from 0-100%) of his/her answer. Finally, the participant had to specify which features had helped him/her make that decision. In the final (third) part of our study, we asked the participants to supply information about their background (e.g. age, interests, etc.). Additionally, each participant was asked to provide information about: • The level of difficulty of the questionnaire with regards to the task of emotion recognition from face images • Which emotion he/she though was the most difficult to classify • Which emotion he/she though was the easiest to classify • The percentage to which a facial expression maps into an emotion (0-100%)
Fig. 1. The first part of the questionnaire
58
I.-O. Stathopoulou and G.A. Tsihrintzis
Fig. 2. The second part of the questionnaire
2.2 The Participant and Subject Backgrounds There were 300 participants in our study. All the participants were Greek, thus familiar with the greek culture and the greek ways of expressing emotions. They were mostly undergraduate or graduate students and faculty in our university and there age varied between 19 and 45 years.
3 Facial Expression Classification Subsystem In order for our system to be fully automated, first we locate and extract the face using the face detection subsystem. The face data is then fed to the facial expression classification subsystem, which preprocess the in order to extract some facial features of high discrimination power, namely: (1) left eye dimension ratio, (2) right eye dimension ratio, (3) mouth dimension ratio, (4) face dimension ratio, (5) forehead texture, (6) texture between the brows, (7) left eye brow direction, (8) right eye brow direction, and (9) mouth direction The above features consist the input data to a two layer neural network. The network produces a 7-dimensional output vector which can be regarded as the degree of membership of the face image in each of the ‘neutral’, ‘happiness, ‘surprise’, ‘anger’, ‘disgust-disapproval’ and ‘bored-sleepy’ classes. An illustration of the network architecture can be seen in Figure 3.
Fig. 3. The Facial Expression Neural Network Classifier
Comparative Performance Evaluation
59
3.1 Discriminating Features for Facial Expressions For the classification task, we gathered and studied a dataset of 1400 images of facial expressions, which corresponded to 200 different persons forming the “neutral” and the six emotions: “happiness”, “sadness”, “surprise”, “anger”, “disgust” and “boredsleepy”. We use the “neutral” expression as a template, which can somehow be deformed into other expressions. From our study of these images, we identified significant variations between the “neutral” and other expressions, which can be quantified into a classifying feature vector. Typical such variations are shown in Table 1. Table 1. Formation of facial expressions via deformation of the neutral expression
Variations between Facial Expressions: Happiness • Bigger-broader mouth • Slightly narrower eyes • Changes in the texture of the cheeks • Occasionally, changes in the orientation of brows Surprise • Longer head • Bigger-wider eyes • Open mouth • Wrinkles in the forehead (changes in the texture) • Changes in the orientation of eyebrows (the eyebrows are raised) Anger • Wrinkles between the eyebrows (different textures) • Smaller eyes • Wrinkles in the chin • The mouth is tight • Occasionally, wrinkles over the eyebrows, in the forehead
Boredness-Sleepy • Head slightly turned downwards • Eyes slightly closed • Occasionally, wrinkles formed in the forehead and different direction of the brows Sadness • Changes in the direction of the mouth • Wrinkles formed on the chin (different texture) • Occasionally, wrinkles formed in the forehead and different direction of the brows Disgust-Disapproval • The distance between the nostrils and the eyes is shortened • Wrinkles between the eyebrows and on the nose • Wrinkles formed on the chin and the cheeks
3.2 The Feature Extraction Algorithm The feature extraction process in NEU-FACES converts pixel data into a higher-level representation of shape, motion, color, texture and spatial configuration of the face and its components. We extract such classification features on the basis of observations of facial changes that arise during formation of various facial expressions, as indicated in
60
I.-O. Stathopoulou and G.A. Tsihrintzis
Fig. 4. The extracted features (gray points), the measured dimensions (gray lines) and the regions (orthogonals) of the face
Table 1. Specifically, we locate and extract the corner points of specific regions of the face, such as the eyes, the mouth and the eyebrows, and compute variations in size or orientation from the “neutral” expression to another one. Also, we extract specific regions of the face, such us the forehead or the region between the eyebrows, so as to compute variations in texture. The extracted features are illustrated in Figure 4. Specifically, the feature extraction algorithm works as follows: 1. 2.
3.
4.
5.
Search the binary face image and extract its parts (eyes, mouth and brows) into a new image of the same dimensions and coordinates as the original image. In each image of a face part, locate corner points using relationships between neighboring pixel values. This results in the determination of 18 facial points, which are subsequently used to form the classification feature vector. Based on these corner points, extract the specific regions of the faces (e.g. forehead, region between the eyebrows). The extracted corner points and regions can be seen in the third column in Table 3 in the Results Section, as they correspond to the six facial expressions of the same person shown in the first column. Although these regions are located in the binary face image, their texture measurement is computed from the corresponding region of the detected face image (‘window pattern’) in the second column. Compute the Euclidean distances between these points, depicted with gray lines in Figure 1, and certain specific ratios of these distances. Compute the orientation of the brows and the mouth. Finally, compute a measure of the texture for each of the specific regions based on the texture of the corresponding “neutral” expression. The results of the previous steps form the feature vector, which is fed into a neural network.
3.3 Training and Results After computing the feature vector, we use it as input to an artificial neural network to classify facial images according to the expression they contain. To train the neural network we used a training set of 1050 images, which consisted of 150 persons forming the seven facial expressions. During training, the neural network reached an error rate of 10-2.
Comparative Performance Evaluation
61
Some of the results obtained by our neural network can be seen in Table 2. Specifically, in the first column we see a typical input image, whereas in the second column we see the results of the Face Detection Subsystem. The extracted features are shown in the third column and finally the Facial Expression Classification Subsystem’s response is shown in the fourth column. According to the requirements set, when the window pattern represented a ‘neutral’ facial expression, the neural network should produce an output value of [1.00; 0.00; 0.00; 0.00; 0.00; 0.00; 0.00] or so. Similarly, for the “smile” expression, the output must be [0.00; 1.00; 0.00; 0.00; 0.00; 0.00; 0.00] and so on for the other expressions. The output value can be regarded as the degree of membership of the face image in each of the [Neutral; Happiness; Surprise; Anger ; Disgust-Disapproval; Sadness; Bored-Sleepy] classes in the adequate position. Table 2. Face Detection and Feature Extraction
Neutral
[1.00; 0.00; 0.00; 0.00; 0.00; 0.00; 0.00]
Happiness
Expression Classification
[0.12; 0.83; 0.00; 0.00; 0.00; 0.05; 0.00]
Surprise
Extracted Features
[0.00; 0.00; 0.93; 0.00; 0.07; 0.00; 0.00]
Anger
Detected Face
[0.00; 0.13; 0.00; 0.63; 0.34; 0.00; 0.00]
62
I.-O. Stathopoulou and G.A. Tsihrintzis
Bored-Sleepy
Sadness
DisgustDissaproval
Table 2. (continued)
[0.00; 0.00; 0.00; 0.22; 0.61; 0.01; 0.16]
[0.00; 0.00; 0.00; 0.23 0.00; 0.66; 0.11]
[0.00; 0.00; 0.00; 0.29; 0.00; 0.23; 0.58]
4 Evaluation of Performance The NEU-FACES System managed to classify the emotion’s based on a person’s face quite satisfactory. We tested the NEU-FACES wit 20 subjects forming the 7 facial expressions corresponding to 7 equivalent emotions, which formed a total of 140 images. The results are summarized in Table 3. In the three firs columns we show the results from our empirical studies to humans, specifically the first part of the questionnaire in the first column, the second part in the second column and the mean success rate in the third. In the fourth column we depict the success rate of NEU-FACES for the corresponding emotion. As we can observe, the NEU-FACES achieved higher success rates in most of the emotion compared to the success rates achieved by humans, with exception to the “anger” emotion, where it achieved only 55%. This is done mostly, first, because of the pretence we may have in such an emotion and, secondly, because of the difficulty of humans to show such an emotions full. The second is further validated by the fact that the majority of the face images depicting ‘anger” that were erroneously classified by our system, were misclassified as “neutral’. Generally, the NEU-FACES achieve very good results in positive emotions, such as ‘happiness” and “surprise”, where he achieved 90% and 95%, respectively.
Comparative Performance Evaluation
63
Table 3. Error rates in the two parts of the questionnaire
Success Rates Emotions
Neutral Happiness Sadness Disgust oredomSleepy Anger Surprise
Questionaire results
NEU-FACES System Results
1st Part 39,25% 68,94% 34,09% 18,74%
2nd Part ---------96,21% 82,58% 13,64%
Mean Value 61,74% 82,57% 58,33% 16,19%
80% 90% 60% 65%
50,76%
78,03%
64,39%
75%
76,14% 89,77%
69,7% 95,45%
72,92% 92,61%
55% 95%
5 Conclusions Automatic face detection and expression classification in images is a prerequisite in the development of novel human-computer interaction modalities. However, the development of integrated, fully operational such detection/classification systems is known to be non-trivial, a fact that was corroborated by our own statistical results regarding expression classification by humans. Towards building such systems, we developed a neural network-based system, called NEU-FACES, which first determines whether or not there are any faces in given images and, if so, returns the location and extent of each face. Next, we described features which allow the classification of several facial expressions and presented neural network-based classifiers which use them. The proposed system is fully functional and integrated, in that it consists of modules which capture face images, estimate the location and extent of faces, and classify facial expressions. Therefore, the present or improved versions of our system could be incorporated into advanced human-computer interaction systems and multimedia interactive services.
6 Future Work In the future, we will extend this work in the following three directions: (1) we will improve our system by using wider training sets so as to cover a wider range of poses and cases of low quality of images. (2) We will investigate the need for classifying into more than the currently available facial expressions, so as to obtain more accurate estimates of a computer user’s psychological state. In turn, this may require the extraction and tracing of additional facial points and corresponding features. (3) We
64
I.-O. Stathopoulou and G.A. Tsihrintzis
plan to apply our system for the expansion of human-computer interaction techniques, such as those that arise in mobile telephony, in which the quality of the input images is too low for existing systems to operate reliably. Another extension of the present work of longer term interest will address several problems of ambiguity concerning the emotional meaning of facial expressions by processing contextual information that a multi-modal human-computer interface may provide. For example, complementary research projects are being developed [15-17] that address the problem of emotion perception of users through their actions (mouse, keyboard, commands, system feedback) and through voice words. This and other related work will be presented on future occasions
Acknowledgement This work has been sponsored by the General Secretary of Research and Technology of the Greek Ministry of Development as part of the PENED basic research program.
References [1] Stathopoulou, I.-O., Tsihrintzis, G.A.: Facial Expression Classification: Specifying Requirements for an Automated System. In: 10th International Conference on KnowledgeBased & Intelligent Information & Engineering Systems, Bournemouth, United Kingdom, October 9-11 (2006) [2] Ekman, P., Friesen, W.: Unmasking the face: A Guide to Recognizing Emotions from Facial Expressions. Prentice-Hall, Englewood Cliffs (1975) [3] Terzopoulos, D., Waters, K.: Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence 15(6), 569–579 (1993) [4] Essa, I., Pentland, A.: Coding, analysis, interpretation and recognition of facial expressions. IEEE Pattern Analysis and Machine Intelligence 19(7), 757–763 (1997) [5] Black, M.J., Yacoob, Y.: Recognizing facial expressions under rigid and non-rigid facial motions. In: Proceedings of the International Workshop on Automatic Face and Gesture Recognition, pp. 12–17. IEEE Press, Los Alamitos (1995) [6] Lisetti, C.L., Schiano, D.J.: Automatic Facial Expression Interpretation: Where HumanComputer Interaction, Artificial Intelligence and Cognitive Science Intersect. Pragmatics and Cognition (Special Issue on Facial Information Processing: Multidisciplinary Perspective) 8(1), 185–235 (2000) [7] Dailey, M.N., Cottrell, G.W., Adolphs, R.: A six-unit network is all you need to discover happiness. In: Proceedings of the Twenty-Second Annual Conference of the Cognitive Science Society, pp. 101–106. Erlbaum, Mahwah (2000) [8] Rosenblum, M., Yacoob, Y., Davis, L.: Human expression recognition from motion using a radial basis function network architecture. IEEE Transactions on Neural Networks 7(5), 1121–1138 (1996) [9] Stathopoulou, I.-O., Tsihrintzis, G.A.: A new neural network-based method for face detection in images and applications in bioinformatics. In: Proceedings of the 6th International Workshop on Mathematical Methods in Scattering Theory and Biomedical Engineering, September 17-21 (2003)
Comparative Performance Evaluation
65
[10] Stathopoulou, I.-O., Tsihrintzis, G.A.: A neural network-based facial analysis system. In: 5th International Workshop on Image Analysis for Multimedia Interactive Services, Lisboa, Portugal, April 21-23 (2004) [11] Stathopoulou, I.-O., Tsihrintzis, G.A.: An Improved Neural Network-Based Face Detection and Facial Expression Classification System. In: IEEE International Conference on Systems, Man, and Cybernetics 2004, October 10-13. The Hague, The Netherlands (2004) [12] Stathopoulou, I.-O., Tsihrintzis, G.A.: Pre-processing and expression classification in low quality face images. In: 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, Slovak Republic, June 29 – July 2 (2005) [13] Stathopoulou, I.-O., Tsihrintzis, G.A.: Evaluation of the Discrimination Power of Features Extracted from 2-D and 3-D Facial Images for Facial Expression Analysis. In: 13th European Signal Processing Conference, Antalya, Turkey, September 4-8 (2005) [14] Stathopoulou, I.-O., Tsihrintzis, G.A.: Detection and Expression Classification Systems for Face Images (FADECS). In: IEEE Workshop on Signal Processing Systems (SiPS 2005), Athens, Greece, November 2–4 (2005) [15] Virvou, M., Alepis, E.: Mobile educational features in authoring tools for personalised tutoring. The journal Computers & Education (to appear, 2004) [16] Virvou, M., Katsionis, G.: Relating Error Diagnosis and Performance Characteristics for Affect Perception and Empathy in an Educational Software Application. In: Proceedings of the 10th International Conference on Human Computer Interaction (HCII) 2003, Crete, Greece, June 22-27 (2003) [17] Alepis, E., Virvou, M., Kabassi, K.: Affective student modeling based on microphone and keyboard user actions. In: 6th IEEE International Conference on Advanced Learning Technologies 2006 (ICALT 2006), pp. 139–141 (2006) ISBN:0-7695-2632-2
Histographic Steganographic System Constantinos Patsakis and Nikolaos Alexandris Department of Informatics, University of Piraeus Abstract. In this paper we propose a new steganographic algorithm for jpeg images named HSS, with very good statistical properties. The algorithm is based on a previous work of Avidan and Shamir for image resizing. One of the key features of the algorithm is its capability of hiding the message according to the cover image properties, making the hidden message as traceable as possible.
1 Introduction Data hiding has always been an important issue. Data hiding has two meanings, hiding the meaning of the data, or even hiding their very existence. Nowadays due to the growth of technology two sciences involving data hiding have been created, cryptography and steganography. The main purpose of the first is to hide the contents of data so that only authenticated entities can have access to them. The latter, steganography, has been developed for embedding data in other media, such as text, sound or image in order not to hide the existence of the hidden data. Perhaps the model that best describes steganography is the prisoners problem, given by [1] and [2], where two prisoners Alice and Bob, from now on A and B respectively, want to escape prison. Both of them have to exchange information without their Warden, from now on W, knowing that they are up to something. If W finds out that there is a peculiar way in the way A and B are talking to each other, then he will put them in separate wards and their escape plan is doomed. The algorithm that we propose, HSS, is embedding data in JPEG images so that only A and B are able to know their existence. The image is processed in such way that the original image does not differ much from the stego-image visually, and also it’s statistical properties do not show that there is data infiltration. In any case, both symmetric and asymmetric algorithms can be used as we will show later on. The paper is organized as follows, after this short introduction, we give some background material that is necessary. We then present the algorithm and show some facts about it’s performance as well as it’s advantages against other algorithms. Finally we have a conclusion, giving a summary of what has been achieved and things that can be done for later work.
2 Background In a recent work of Avidan and Shamir [3], they proposed a new algorithm for image resizing based on the energy weight of the pixels to be removed. They G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 67–73, 2008. springerlink.com © Springer-Verlag Berlin Heidelberg 2008
68
C. Patsakis and N. Alexandris
Fig. 1. The classic model of steganography
find seams, paths of small energy, and remove them from each row or column, depending on how the image has to be resized. We will give now the necessary definitions and tools for defining our algorithm. For sake of simplicity, we will give the definitions that Avidar and Shamir have used. Definition 1. Let I be an n×m image and define a vertical seam to be: sx = {sxi }ni=1 = {(x(i), i)}ni=1 such that ∀i, |x(i) − x(i − 1)| ≤ 1 and x is a mapping x : [1, ..., n] → [1, ..., m]. If we try to imagine the visual meaning of a vertical seam, we can say that it is a vertical path from the top of the image to the bottom. Similarly we have the horizontal seam.This will of course be a horizontal path from the left side of the picture to the right. Definition 2. Let I be an n×m image and define a vertical seam to be: m sy = {syj }m j=1 = {(j, y(j))}i=1
such that ∀i, |y(i) − y(i − 1)| ≤ 1 and y is a mapping y : [1, ..., m] → [1, ..., n]. We will now have the following notation for the pixels of a seam s, I(s). Definition 3. Given an energy function e, we define the cost of a seam as: E(s) = E(IS ) =
n
e(I(si ))
i=1
Opposite to Avidan and Shamir, we will regard the optimal seam s∗ as the seam that maximizes the seam cost:
Histographic Steganographic System
69
Fig. 2. Vertical and horizontal seams
Fig. 3. Original picture,its energy map and gradient
s∗ = maxs E(s) =
n
e(I(si ))
i=1
The energy function that we are going to use is eHoG eHoG =
∂ ∂ I| + | ∂y I| | ∂x
max(HoG(I(x, y)))
where HoG(I(x,y)) is taken to be a histogram of oriented gradients at every pixel . We use an 8-bin histogram computed over a 11×11 window around the pixel. Thus, taking the maximum of the HoG at the denominator attracts the seams to edges in the image, while the numerator makes sure that the seam will run parallel to the edge and will not cross it. As they propose the optimal seam can be found using dynamic programming.
3 The Algorithm The main idea of the algorithm is to hide data in areas of the image that have much energy and do not effect the way the picture is shown. In this way, the
70
C. Patsakis and N. Alexandris
Fig. 4. Original picture and stego image
bias of DCT will be ignored by W, since it is normal for high energy areas to have big changes around them. Moreover since they have so much energy they capture the eye so that it cannot detect many differences. It is obvious that the resizing algorithm of Avidan and Shamir tries to keep these areas untouched, as they are the true carriers of information. Furthermore these changes do not distort the picture, they do not add noise to the whole of it, making it suspicious to W. Finally since these areas need more information to be stored, the impact of their distortion will be unnoticed by the compression. We assume that both parties share a secret key K, which they will use in order to exchange their hidden information. We will now devide the picture in parts according to optimal seams. The encryption algorithm that we are going to use is AES [4, 5], so the key sizes that can be used are 128, 192 and 256 bits. The key size depends on the decision of entities A and B. In order to track possible malicious or accidental changes to the image we will use SHA-1 hashing function [6, 7]. The algorithm as we are going to present it, is in it simpler form yet it can be easily altered for meeting more special needs. Let M be the secret message that A wants to transfer to B and I the image of n by m pixels that will be the carrier of M. Entity A computes C = EncK (M ||SHA(M )). We pad C in order to have a C’ such that len(C’) mod n =0, for sake of simplicity, we pad C with zeros. Now we have to compute ) vertical seams, that will hide our data inside them, the same thing h = 3len(C 8n can be done using vertical seams. Using dynamic programming we find h optimal vertical seams that do not intersect and we increase their energy with the smallest possible value, so that after the infiltration, all h seams have the biggest possible values. We now take these h vertical seams and using LSB method we embed C’ to the their last two bits. It is apparent that the reversing the steps of the HSS algorithm one can extract the embedded message from the stego image.
4 Performance and Security Since the algorithm does not create new images in order to hide the message but detects the parts of the picture that have the most information, the general
Histographic Steganographic System
71
Fig. 5. The stego image its energy map and gradient
Fig. 6. The histogram of the original image
performance of the algorithm is rather fast. Furthermore, due to the nature of the algorithm the stego image and the original picture share almost equal histograms, something that is apparent in figures 6 and 7. Furthermore, the algorithm does not add information that can be detected by human eye, or that can be traced with automated tools. None of the tests of stegdetect [9] of Provos has recognized the presence of hidden message in the tested images.
72
C. Patsakis and N. Alexandris
Fig. 7. The histogram of the stego image
One of the main properties of HSS is its security. The use of AES and SHA1 as parts of the algorithm improve the performance of the algorithm and the security of the whole infrastructure. The message is encrypted with a well known algorithm and we can test if the message has been altered using the hash function.
5 Conclusion In this paper we introduced the HSS algorithm, a steganographic algorithm that embeds hidden messages in images, adapting each time to the properties of the cover image. This allows the hidden message to become less traceable from steganographic attacks, as the DCT coefficients remain the same, the histogram of the stego image is almost the same with the histogram of the original image and the stego image does not have any obvious differences with the original one. HSS uses modern cryptographic algorithms providing extra security for the embedded message with good statistical behavior. In some cases, the algorithm can even retrieve a part of the hidden message from images that have been tampered with, using the hash function or retrieve up to the destroyed area. Perhaps, the only drawback compared to other algorithms is the steganographic capacity due to the use of seams.
References 1. Kharrazi, M., Sencar, H.T., Memon, N.: Image steganography: Concepts and practice. In: WSPC. Lecture Notes Series (2004) 2. Simmons, G.J.: The prisoners problem and the subliminal channel. In Advances in Cryptology: Proceedings of Crypto 1983, pp. 51–67. Plenum Press (1984)
Histographic Steganographic System
73
3. Avidan, S., Shamir, A.: Seam Carving for Content-Aware Image Resizing. ACM Transactions on Graphics 26(3); SIGGRAPH 2007 (2007) 4. FIPS PUB 197: Advanced Emcryption Standard 5. Daemen, J., Rijmen, V.: The Block Cipher Rijndael. In: Schneier, B., Quisquater, J.J. (eds.) CARDIS 1998. LNCS, vol. 1820, pp. 277–284. Springer, Heidelberg (2000) 6. RFC 3174, US Secure Hash Algorithm 1 (SHA-1) 7. FIPS 180-2: Secure Hash Standard (SHS) 8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: International Conference on Computer Vision & Pattern Recognition, vol. 2, pp. 886–893 (2005) 9. Provos, N.: Stegdetect, http://www.outguess.org/download.php
Moving Object Detection and Tracking for the Purpose of Multimodal Surveillance System in Urban Areas Andrzej Czyzewski and Piotr Dalka Multimedia Systems Department, Gdansk University of Technology Gdansk, Poland [email protected]
Abstract. Background subtraction method based on mixture of Gaussians was employed to detect all regions in a video frame denoting moving objects. Kalman filters were used for establishing relations between the regions and real moving objects in a scene and for tracking them continuously. The objects were represented by rectangles. The objects coupling with adequate regions including the relation of many-to-many was studied experimentally employing Kalman filters. The implemented algorithm provides a part of an advanced audiovideo surveillance system for security applications which is described briefly in the paper. Keywords: moving object detection and tracking, background subtraction, mixture of Gaussians, Kalman filters.
1 Introduction Video surveillance system are very often used for monitoring of many public places in every agglomeration. Such systems usually utilizes dozens of cameras and produce large amount of video streams that cannot be analyzed effectively by human operators. Furthermore, commonly found systems do not utilize other rich sources of information, like acoustic signals. At the same time, surveillance systems are required to be very reliable and effective. Thus it is necessary to implement an autonomous system combining visual and acoustic data, which would be able to detect, classify, record and report unusual and potentially dangerous events. Such a multimodal surveillance system would be invaluable asset for various administrative authorities and safety agencies, as well as for private companies interested in securing their facilities. The outcome of the system would have positive influence on increasing the global safety of citizens. We started this kind of experiments in Poland with building a system for monitoring urban noise employing multimedia approach [1][8]. The paper presents a fragment of our research concerning advanced surveillance system elements. We restrict in this paper to our experiments devoted to visual objects detection and tracking them in motion being a part of the more complex surveillance system. The main goal of this multimodal surveillance system is to automatically detect events, classify them and alarm an operator if an unusual or dangerous activity is detected. Events are detected by many universal monitoring units which are placed in the monitored area. Results are sent to the central surveillance server, which stores them, analyses, classifies and notifies an operator if needed. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 75–84, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
76
A. Czyzewski and P. Dalka
2 Moving Object Detection Moving object detection and segmentation is an important part of video based applications, including video surveillance. In the last application, results of detection and segmentation of objects in video streams are required to determine the type of an object and to classify events regarding the object. Most video segmentation algorithms usually employ spatial and/or temporal information to generate binary masks of objects [2]. Spatial segmentation is basically image segmentation, which partitions the frame into homogenous regions with respect to their colours or intensities. This method can be typically divided into three approaches. Region-based methods rely on spatial similarity in colour, texture or other pixel statistics to identify separate objects while boundary-based approaches use primarily a differentiation filter to detect image gradient information and extract edges. In the third, classification-based approach, a classifier trained with a feature vector extracted from the feature space is employed to combine different cues such as colour, texture and depth [3]. Temporal segmentation is based on change detection followed by motion analysis. It utilizes intensity changes produced by moving objects to detect their boundaries and locations. Although temporal segmentation methods are usually more computationally effective than spatial approaches, they are sensitive to noise or lighting variations [2]. There are also methods combining both spatial and temporal video characteristics, thus leading to spatio-temporal video segmentation. The final outcome is a 3-D surface encompassing object position through time called object tunnel [4]. The solution presented in the paper utilizes spatial segmentation to detect moving objects in video sequences. The most popular region-based approach is background subtraction [5], which generally consists of three steps. First, a reference (background) image is calculated. Then, the reference image is subtracted from every new image frame. And finally, the resulting difference is threshold filtered. As a result, binary images denoting foreground objects in each frame are obtained. The simplest method to acquire the background image is to calculate a time averaged image. However this method suffers many drawbacks (i.e. limited adapting capabilities) and cannot be effectively used in a surveillance system. A popular and promising technique of adaptive background subtraction is modelling pixels as mixtures of Gaussians and using an on-line approximation to update the model [6][7]. This method proved to be useful in many applications as it is able to cope with illumination changes and adapt the background model accordingly to the changes in the scene, e.g. motionless foreground objects eventually become a part of the background. Furthermore the background model can be multi-modal, allowing regular changes in the pixel colour. This makes it possible to model such events as trees swinging in the wind or traffic light sequences. Thus this method was used for moving object detection in our multimodal surveillance system. In this method, each image pixel is described by a mixture of K Gaussian distributions [13]. The probability that a pixel has value xt at the time t is given as: K
(
p( xt ) = ∑ wt η xt , μt , Σ t i =1
i
i
i
)
(1)
Moving Object Detection and Tracking for the Purpose of Multimodal
77
where wti denotes the weight and μti and Σti are the mean vector and the covariance matrix of ith distribution at the time t, and η is the normal probability density function. The number of distributions K is usually a small number (3 or 5) and is limited by the available computational power. For simplicity and to reduce memory consumption it is assumed that RGB colour components are independent, but their variances are not restricted to be identical as in [6]. In this way the covariance matrix Σ is a diagonal matrix with variances of RGB components on its main diagonal. It is assumed, that each Gaussian distribution represents a different background colour of a pixel. The longer a particular colour is present in the video stream, the higher value of the weight and lower values in the covariance matrix of the corresponding distribution are. With every new video frame, parameters of distributions for each pixel are updated according to the previous values of the parameters, current pixel value and the model learning rate α. The higher α the faster model adjusts to changes in the scene background (e.g. caused by gradual illumination changes), although moving objects remaining still for a longer time (e.g. vehicles waiting at traffic lights) would become a part of the background quicker. During determining whether the current pixel is a part of a moving object, only distributions characterized by high weights and low values in covariance matrices are used as the background model. If the current pixel matches one of the distributions forming the background model, it is classified as the background of the scene; otherwise it is considered as a part of a foreground object. Moving object image detection is supplemented with the shadow detection module which is required for every outdoor video processing application, especially in the field of surveillance applications. The shadow of a moving object is always present, moves together with an object and as such is detected as a foreground object by a background removal application. The shadow detection method is based on the idea that while the chromatic component of a shadowed background object remains generally unchanged, its brightness is significantly lower. It makes possible to separate the RGB colour space used in the model into chromatic and brightness components. Only pixels recognized as a part of a foreground object during the background subtraction process are checked whether they are part of a moving shadow. A binary mask denoting pixels recognized as belonging to foreground objects in the current frame is the result of the background subtraction. The mask is morphologically refined in order to allow object segmentation [7][8]. Morphological processing of a binary mask consists of removing regions (connected components) having too few pixels, morphological closing and filling holes in regions.
3 Applying Kalman Filters to Objects Tracking in Motion The Kalman filtering [9] provides a useful approach to tracking objects in video, thus numerous papers discussed this subject. A good review of older publications is provided in the Funk’s study [10]. Hongshan YU et al. present the application of Kalman filtering in an advanced framework form multi-moving target tracking [11]. Many interesting concepts as well as newer literature citations can be found in some
78
A. Czyzewski and P. Dalka
papers published after PETS’06 workshop, e.g. [12] concerning the tracking using multi-camera. In the process of tracking, each of the detected moving objects has its own Kalman filter (so-called tracker) created, that represents it. The Kalman filter is most of all used to establish proper relation between detected regions (blobs) that map to moving objects of a current frame, and the real moving objects under analysis. In our initial experiments, we planned to follow some ideas presented by a research team of the Institute for Robotics and Intelligent Systems University of Southern California [13], however the Kalman filter application is only mentioned in that paper, so that it was necessary to start the algorithm building from scratch. As a result of applying previously implemented algorithms for moving objects extraction [1][8] mentioned in the previous paragraph we obtained moving objects represented by rectangles. The experiments used two types of trackers (Kalman filters). In the first type, the state of the moving object (vector x8) is described by 8 parameters (which is denoted by the superscript value in the vector’s indicator) and in the second version state vector x6 has 6 elements: x 8 = [x
w h dx dy dw dh ]
T
y
x 6 = [x
(2)
y w h dx dy ]
T
(3)
where x and y denote the location of the object (actually the coordinates of the upper-left corner of the rectangle that represents it), w and h are the width and height of the rectangle, dx and dy indicate the change in the object’s location during subsequent time intervals, and dw and dh are the changes in the width and height of the rectangle during subsequent time intervals. The two additional parameters, that differ the 8-element state vector from the 6element one, express the dynamics of changes in an object’s dimensions. This means that an 8-element Kalman filter assures tracking faster and greater changes in an object's (rectangle) dimensions than the 6-element one which enables the changes only at the stage of the vector correction phase. In both cases the measurement vector adopts the following form:
[
z = xb
yb
wb
hb
]
T
(4)
which includes: the location, width and height of the region holding the pixels of the moving object associated with the current tracker. Transition matrix A and observation matrix H of the Kalman filter were binary matrices in the form appropriate for the state and observation vectors defined above. The applied model does not require any control inputs, which results in an input matrix B equal to 0. The process of tracking a moving object has several phases. Each newly detected object is assigned a new tracker with the state vector that is based on the parameters measured for the blob according to the following equation:
[
x −81 = x −b1
y −b1
w−b1
h−b1
0 0 0 0
]
T
[
x −61 = x −b1
y −b1
w−b1
h−b1 0 0
]
T
(5)
Moving Object Detection and Tracking for the Purpose of Multimodal
79
In the proceeding time interval (namely in the next frame), the state vector is updated once more based upon the parameters corresponding to the newly-created object:
[ = [x
x08 = x0b
y0b
w0b
h0b
x0b − x−b1
y0b − y−b1
b 0
y0b
w0b
h0b
x0b − x−b1
y0b − y−b1
x06
]
w0b − w−b1
h0b − h−b1
]
T
T
(6)
The vector x0 constitutes the initial estimate of the state vector xˆ 0 . In the following time intervals, firstly the forward prediction of the state vector of all Kalman filters assigned to the currently existing objects is made. This is done in order to obtain the a priori estimate of the location of objects belonging to the current image frame. The next step is to purge trackers whose a priori estimate of the state vector contains non-positive or too small values of the object's width and height (the situation which is possible for the 8-element state vectors, if errors occur during background subtraction). Then, it is decided which blob of the current frame is assigned to which one of the tracked objects. In the final phase, the Kalman filter state vectors of each object are corrected. This is done based on the parameters measurement of the regions holding the pixels of the respective moving objects detected in the current frame.
4 Establishing Relations between Moving Objects and Regions The key action of the tracking algorithm is to associate properly trackers with the blobs resulting from background subtraction in the current frame. For this purpose, a binary matrix M that depicts the relations is created. In this matrix, each tracker-blob − pair (where the a priori estimate xˆ k of the object's state represents the tracker and the measurement vector zk relates the blob) is assigned zero or one, depending on whether the rectangles enclosing the region and the estimated object location have a common part (i.e. whether they overlap). As a result a i × j relations matrix M is created for i trackers and j detected blobs of the current frame. This way of the matrix creation provides a vital simplification of hitherto used procedures. There are some basic types of relations between trackers and regions, each of them requesting some different actions to be taken [13]. If a certain blob is not associated with any tracker, a new tracker (Kalman filter) is created and initialised in compliance with this region. If a certain tracker has no relation to any of the blobs, then the phase of measurement update is not carried out in the current frame. If the tracker fails to relate to a proper region within several subsequent frames, it is deleted. The predictive nature of trackers assures that moving objects, whose detection through background subtraction is temporarily impossible, are not “lost” (e.g., when a person passes behind an opaque barrier). One of the most desirable types of relation is a one-to-one relationship between a tracker and a blob. In this case, the tracker is updated with the results of the region measurements. Another type of association is a many-to-one relationship, meaning that in the current image frame there are several trackers related to the same blob. Thus, each of these trackers is updated with the parameter measurements of this same region. Similar circumstances correspond to the situation in which two humans,
80
A. Czyzewski and P. Dalka
previously moving separately (mapping to one-to-one tracker-blob relations), start walking together, possibly causing their hiding one behind another, which makes their trackers start relating to the same region. Another type of an object-region relation is the one-to-many relation, i.e., one tracker is associated with many blobs. In this case, the tracker is updated with the parameters of the rectangle covering all of these regions. Such an approach is supposed to assure the cohesion of the traced object in case of faulty background subtraction that divides the object, and then erroneously couples it with several regions instead of one, or when a moving person temporarily disappears behind another object, e.g., a pillar. If actually it is a situation of two humans entering the camera scope as a group and then parting, or a situation of a person who abandons an object, the distance between the blobs increases, making in effect the current tracker start “following” the object that changes its dimensions and motion vector to a lesser extent, while the other object is assigned a new tracker. The last and most intricate way of coupling objects with regions is a relation of many-to-many. It corresponds to the situation of several people, who initially are moving separately, then form a group (thus, inducing the relation of many trackers to one blob), and after some time some persons decide to leave the group. Ideally, in such a case, the tracker originally assigned to a leaving person (before the person entered the group), should follow the same person (proving that the continuity of viewing is provided all the time the objects remain in the scene). For the algorithm, identical is the situation when two people pass by each other. First, we have to deal with 2 one-to-one relations. Next, a single two(trackers)-to-one(blob) relation is established, then (at the very moment of passing by) a many-two-many relation is formed, which finally again transforms to 2 one-to-one relations. All the time the trackers should be able to follow adequate objects. In order to achieve this, trackers store the descriptions of the objects they trace. At moments when many trackers are associated with many blobs (as in the case of a parting group of people), the degree of similarity is calculated in the same way for each object. It is computed based on the description held by the object’s tracker and the description derived from the calculations performed for each of the regions. If the maximum degree of similarity (within the group of the many-to-many relationship) exceeds a specified threshold, a suitable tracker is updated with the measurements of the region that is most similar to it. The obtained tracker-blob pair is excluded from further analysis, and the same process of finding another matching pair repeats. The next pair is updated and excluded. If, finally, only degrees that do not exceed the threshold are left, all the remaining trackers (in the analysed group) are updated with the parameters of the rectangle covering the whole rest of the regions. If there are no blobs left to associate trackers with (which is possible if the many-to-many relation was formed by more trackers than regions), the remaining trackers are not updated. In our experiments, a two-dimensional colour histogram using a chromatic space of RcGc colours was applied for each object as its description. The degree of similarity between the appearance of the object (the RcGc histogram) stored by the tracker, and the appearance (the RcGc histogram) of the analysed region is determined through the measurement of correlation.
Moving Object Detection and Tracking for the Purpose of Multimodal
81
5 Experiments and Results The experiments show that the developed algorithm works correctly. In Fig. 1, sample results of vehicle detection and segmentation are presented. The implemented algorithm for moving object detection correctly determines the scene background and marks locations of all objects (vehicles), both during day and night. The moving shadow detection and morphological processing turned out to be very useful in separating two vehicles originally labelled as one region (the first row in Fig. 1). The algorithm is also able to detect vehicles in night sequences (the second row in Fig. 1). There is one major drawback of the night environment. Car headlights illumine the road ahead of a car and nearby buildings which causes that illuminated areas are classified as foreground objects. Supplementary decision layer (possibly employing intelligent decision algorithm) needs to be added to the algorithm to prevent such false detections and obtain exact vehicle shapes. Based on results of moving object detection, the developed algorithm performs tracking of objects. The performance results are satisfactory, especially when the number of objects present in the analysed scene is not large. Fig. 2 demonstrates effectiveness of the algorithm in a sample situation of a car overtaking another car. Similar situation is shown in Fig. 3 (an event of two people passing by each other). Both examples show that trackers (marked by different colours) follow the right object and confirm that the algorithm is able to continuously track objects despite their partial overlapping. The experiments did not clearly prove which of the two types of Kalman filters (8- or 6-element state vector) is better. The 8-element vector filter shows greater
a)
b)
c)
Fig. 1. Examples of vehicle detection during day (first row) and night (second row); a) original frames from recorded video sequences; b) raw results of background removal without any further processing; b) final results of vehicle segmentation
82
A. Czyzewski and P. Dalka
flexibility and better results in the case of more crowded scenes. The 6-element vector filter assures more precise measurements of the shape and location of objects (real, traced objects do not change their size all the time) and in some problematic situations performs better, i.e., when many-to-many relationships are engaged. The decision as to which type of vector to chose will depend on further application of its performance results and on the characteristics of the analysed scene.
a)
b)
c)
d)
Fig. 2. Frames illustrating continuous tracking of two vehicles that overtake themselves
Fig. 3. Fragments of frames 2266, 2270 and 2277 of the S1-T1-C3 recording from the PETS 2006 [14] set. Two humans passing by each other. In the upper row, 8-element Kalman filter was used, and in the lower row 6-element vector was used.
Moving Object Detection and Tracking for the Purpose of Multimodal
83
The presented solution is the first version of our mobile objects tracking algorithm, which even at its initial state of development lets us achieve good results. It is of course advisable to advance and improve it, mostly through using some other more distinctive parameters in objects descriptions, and through specifying a more precise measure of similarity than correlation is. However, in simple situations, the described algorithm performs very well, however, its reliability decreases with the increase of the number of objects interacting with each other.
6 Conclusions The solution for tracking mobile objects applied in our hitherto experiments and considered for future work will constitute an important element of the prototype surveillance system consisting of a set of distributed monitoring units and a central server for data processing. Data regarding moving objects, obtained from trackers, can be directly used to detect unusual or prohibited events such as trespassing or luggage abandonment. In the area of moving object detection, future work will be focused on including spatial and temporal dependencies between pixels in the background model and dynamically adjusting the learning rate, depending on the current scene change rate. A possible area of improvements in the tracking part of the algorithm should address an implementation and examination of trackers that use more advanced algorithms to estimate state vectors of dynamic discrete processes, particularly those employing the Extended Kalman Filter and the Unscented Kalman Filter. These solutions utilize non-linear and/or non-Gaussian models of processes and therefore should estimate motion of real-world objects with greater accuracy.
Acknowledgements Research is subsidized by the Polish Ministry of Science and Higher Education within Grant No. R00-O0005/3 and by the European Commission within FP7 project “INDECT” (Intelligent Information System Supporting Observation, Searching and Detection for Security of Citizens in Urban Environment).
References 1. Czyzewski, A., Dalka, P.: Visual Traffic Noise Monitoring in Urban Areas. International Journal of Multimedia and Ubiquitous Engineering 2(2), 91–101 (2007) 2. Li, H., Ngan, K.: Automatic Video Segmentation and Tracking for Content-Based Applications. IEEE Communication Magazine 45(1), 27–33 (2007) 3. Liu, Y., Zheng, Y.: Video Object Segmentation and Tracking Using y-Learning Classification. IEEE Trans. Circuits and Syst. For Video Tech. 15(7), 885–899 (2005) 4. Konrad, J.: Videopsy: Dissecting Visual Data in Space Time. IEEE Communication Magazine 45(1), 34–42 (2007)
84
A. Czyzewski and P. Dalka
5. Yang, T., Li, S., Pan, Q., Li, J.: Real-Time and Accurate Segmentation of Moving Objects in Dynamic Scene. In: ACM Multimedia 2nd International Workshop on Video Surveillance and Sensor Networks, New York, October 10-16 (2004) 6. Stauffer, C., Grimson, W.: Learning patterns of activity using real-time tracking. IEEE Trans. on Pattern Analysis and Machine Intell. 22(8), 747–757 (2000) 7. Elgammal, A., Harwood, D., Davis, L.: Non Parametric Model for Background Subtraction. In: ICCV Frame-rate Workshop (September 1999) 8. Dalka, P.: Detection and Segmentation of Moving Vehicles and Trains Using Gaussian Mixtures, Shadow Detection and Morphological Processing. Machine Graphics and Vision 15(3/4), 339–348 (2006) 9. Welch, G., Bishop, G.: An Introduction To the Kalman Filter. Technical Report TR95041, University of North Carolina at Chapel Hill (1995) 10. Funk, N.: A Study of the Kalman Filter applied to Visual Tracking. University of Alberta, Project for CMPUT 652 (December 7, 2003) 11. Yu, H., Wang, Y., Kuang, F., Wan, Q.: Multi-moving Targets Detecting and Tracking in a Surveillance System. In: Proc. of the 5th World Congress on Intelligent Control and Automation, Hangzhou, China, June 15-19 (2004) 12. Martínez-del-Rincón, J., Herrero-Jaraba, J.E., Gómez, J.R., Orrite-Uruñuela, C.: Automatic left luggage detection and tracking using multi-camera UKF. In: Proc. 9th IEEE Internat. Workshop on Performance Evaluation in Tracking and Surveillance (PETS 2006), NY, USA, pp. 59–66 (2006) 13. Lv, F., Song, X., Wu, B., Kumar, V., Nevatia, S.: Left-Luggage Detection using Bayesian Inference. In: Proc. of 9th IEEE Int. Wrokshop on Performance Evaluation of Tracking and Surveillance, New York, USA, June 2006, pp. 83–90 (2006) 14. PETS 2006 – a collection of test recordings from the Ninth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, New York, USA, June 18 (2006)
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach Smiljan Šinjur and Damjan Zazula Faculty of Electrical Engineering and Computer Science University of Maribor Smetanova ulica 17 2000 Maribor Slovenia {smiljan.sinjur,zazula}@uni-mb.si
Abstract. Today’s tendency to protect various copyrighted multimedia contents, such as text, images or video, resulted in many algorithms for detecting duplicates. If the observed content is identical, then the task is easy. But if the content is even slightly changed, the task to identify the duplicate can be difficult and time consuming. In this paper we develop a fast, two-step algorithm for detecting image duplicates. The algorithm finds also slightly changed images with added noise, translated or scaled content, or images having been compressed and decompressed by various algorithms. The time needed to detect duplicates is kept low by implementing image feature-based searches. To detect all similar images for a given reference image, the feature extraction based on convex layers is deployed. The correlation coefficient between two features gives the first hint of similarity to the user, who creates a learning set for support vector machines by simple on-screen selection. Keywords: Image similarity, Convex layer, Correlation coefficient, Machine learning, Support vector machine.
1 Introduction If the similarity algorithms for text are more or less known and good by its quality and speed, this is not true for images and video. To detect the video duplicates, the process is easily transformed to an image problem. A complete reference video or a part of it is similar to the tested video or a part of it if a sequence of frames of the reference video is similar to the sequence of frames from tested video. All frames of all tested videos have to be stored locally in a database. Since usually a video is composed from a lot of frames, the storage size of such a database can be large. In this case, it is crucial to store only smaller amount of information which describes the images uniquely by their example features. Although the database size is smaller the number of feature vectors equals the number of images. So, the algorithms that perform any manipulation on the feature vectors have to be very fast. Image similarity has been used in many applications, such as content-base image retrieval, feature tracking, stereo matching, and scene and object recognition. First image matching algorithms were developed back in fifties [1]. Cross-correlation in stereo image tracking was used in [2]. Very fast image similarity search can be G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 85–93, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
86
S. Šinjur and D. Zazula
performed if an image is described by keywords and metadata [3]. The keywords for searching are generated by the user or by computer, for example from the image name. Metadata are used by Google Image Search [4] or Tilmoto search engine [5]. Grauman et al. [6] use local descriptors for object recognition. Similarity of the images is measured as a distance between local features, which leads to short computational times. The CBIR features, such as colour and texture, are used in [7]. The similarity of two images is defined by dynamic distance of those features that add value to the comparison. Given a reference image, we developed an algorithm that finds all similar images from a large database. In the first place, the database of image features is constructed from all tested images. To extract the features from an image, the convex layers based on monochromatic images are formed. The similarity measure is defined on correlation coefficient between the two feature vectors of images, and falls into the interval [-1, 1]. Correlation coefficients closer to 1 indicate similar images, while those closer to -1 indicate their dissimilarity. However, it is difficult to define a proper thresholding, because the feature vectors are not easily separable, in general. Therefore, we derived a two-step procedure: first, a coarse thresholding is done for the reference image using the correlation coefficient algorithm, which is then followed by user’s selection of an initial learning set of images. This selection is performed on-screen from three collections of images displayed according to their correlation coefficient: a group of most similar images, a group of border cases, and a group of least similar images with regard to the reference image. The obtained learning set is further used to train a support vector machine (SVM). This paper is organized in 5 sections. Section 2 describes convex layer construction on a grid of points. Also the correlation coefficient computed on the detrended image features is presented in Section 2. In Section 3, a selection of the initial learning set for the SVM-based algorithm is given. Section 4 interprets the experimental results, while Section 5 concludes the paper and gives hints for future work.
2 Image Similarity The image similarity measure is defined by a distance between the corresponding pairs of objects from two images. In our case, all the objects from an image constitute one unique feature vector. The similarity of two images is, therefore, measured by a comparison of two feature vectors: one belonging to the reference and the other to the tested image. To extract a feature vector for an image, image background and foreground have to be separated. Foreground is a set of objects that mean the point of user’s interest, is usually placed in the middle of image, and dynamically changes along the subsequent images (e.g. through video frames), whereas the background is usually static and covered by the objects. Also, the foreground and background usually differ in hue. The background is ignored and the foreground determines a unique feature vector. Separation of image background and foreground can be facilitated by a transformation to the HSV colour model. Human eye is most sensitive to hue, H, so saturation S and value V are set equal to 0. As a result, a greyscale image is obtained. Now, the background and foreground differ only in hue. A thresholding is based on the
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
87
minimum value between two local maximums in the greyscale image bimodal histogram [8]. Foreground pixels are laid over a grid, afterwards, to compute convex layers and use them in building up the image feature vector. 2.1 Convex Layers as Features An image feature vector is created from convex layers based on foreground points [12], [13]. The problem of convex layers is an extension of the convex hull problem. So, to generate convex layers a set of convex hulls is generated as described by the following algorithm: procedure ConvexLayers set S: all foreground pixels begin while S not empty CH = ConvexHull(S); S = S\CH; end end. The above algorithm shows that convex layers are generated recursively, until the point set S is empty. To compute convex hull on a point set, various known algorithms can be used [9], [10], [11], [12]. Those algorithms differ by their time complexity and implementation. Best results are achieved by Chazelle’s algorithm [12], which obtains the time complexity proportional to O(n log(n)) .
Fig. 1. Example image
Fig. 3 shows that any point can always take part in only one convex hull. Additionally, the upper and lower rows (resp. columns) of the current set S (represented as a matrix) lie on a convex hull. So, the maximum number of convex hulls on an image is ⎡max( p, q) / 2⎤ , where p and q are the image dimensions. Fig. 1 and Fig. 2 depict an example greyscale and monochromatic image from which the convex layers are generated and shown in Fig. 3. All the points in Fig. 3 lie
88
S. Šinjur and D. Zazula
Fig. 2. Monochromatic image for the example image from Fig. 1
Fig. 3. Convex layers for the example image from Fig. 2
on a grid. For better comprehension, Fig. 1 was resized to 32× 32 pixels, because larger convex layers become hard to follow visually. The number of convex hulls in our example is 11. 2.2 Similarity Measure Convex layers already define a feature vector that describes the content of an image. A direct comparison of such a two-dimensional feature is computationally complex, so a reduction of dimensions is performed first. We tested various measures to reduce feature dimensionality, such as the number of hulls or length of a hull. We found out the most significant information is given by the number of vertices on individual convex hulls. Therefore, we introduced a feature vector whose elements correspond to the number of vertices on the consecutive convex layers. Fig. 4 shows the feature vector obtained from convex layers in Fig. 3. An undesired property of decreasing tendency of feature vectors is evident from Fig. 4. This decrease is not caused by the individual features of an image, but is intrinsic because the inner layers are always “shorter” from the outer convex layers. It actually disturbs the comparison of image individual characteristics and must be
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
89
removed. By constructing the regression line of the feature vector, a means is given to eliminate the effect of intrinsically different sizes of convex layers, i.e. to detrend the feature vectors. The regression line t = [t1 , t2 K, t L ] is subtracted from feature vector x = [x1 , x 2 K , x L ] :
x′ = x − t ,
(1)
giving a detrended version of features x’.
Fig. 4. Feature vector, regression line and detrended features for the example image in Fig. 1
A regression line component ti is defined as
ti =
( L + 1 − i) , ⋅b L
(2)
where L stands for the length of x and b for the regression coefficient. This coefficient is calculated as follows: L
b=
∑ ( xi − x ) ⋅ (i −
i =1
L +1 ) 2
L +1 2 ) ∑ (i − 2 i =1
,
(3)
L
where xi stands for the i-th component of x and x for the mean of x. The similarity of two images can be measured by correlation coefficient of two detrended feature vectors x and y as: L
d ( x, y ) =
∑ (x i =1
i
− x ) ⋅ ( yi − y )
L
∑ ( xi − x ) ⋅ ∑ ( y i − y ) i =1
.
L
2
(4)
2
i =1
In general, the length of vectors x and y is different. To obtain the same vector length the shorter one is padded by zeros.
90
S. Šinjur and D. Zazula
3 Learning So far, we explained how to test a pair of images for similarity by using correlation coefficients. Correlation coefficient always returns a value in interval [-1, 1]. But, of course, the similarity threshold value is not completely general and it depends also on the reference image contents. There is also drawbacks of convex layers that cannot cope with rotated and interpolated image contents well.
Fig. 5. User interface for supervised learning
The correlation-coefficient-based thresholding is, therefore, not precise enough and can certainly be refined by more sophisticated approaches. We decided to use a machine learning algorithm, where a model of images similar to the reference one is learned and, afterwards, implemented in a refined similarity search. This means that we introduce a two-step procedure: first, the search in a large database is done using our fast convex-layer approach and the correlation coefficient thresholding, whereas the continuation is supervised by user as follows. A graphical user interface, as depicted in Fig. 5, offers three sets of most similar images, border cases, and most different images with respect to the reference one. The choice of sets is made automatically by our correlation-coefficient-based algorithm and the feature vectors. The user’s task now is to indicate all the displayed similar and different images and, thus, create a set of positive and negative learning examples. Of course, the user is displayed the original images, but his or her on-screen selection
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
91
enters the learning set as examples in the form of feature vectors. This learning set is further used for the SVM learning and classification [15], [16]. Fig. 5 may also be understood as an illustrative example of images that enter the three sets with different degree of similarity measured by the correlation coefficient. The group of images most similar to the reference one is formed according to the predefined number of highest correlation coefficients, while the group of the most different images exhibits the lowest values of correlation coefficients. The border set consists of a certain number of images with correlation coefficients just below or just above a preselected threshold correlation value. The number of images in any of the three sets and the threshold correlation value are defined by the user for every application run separately. The user has to go through all the displayed images and mark all similar and different images, respecting their own perception and decision. We have also studied the influence of the threshold correlation value as it is set for automated selection of border cases. It is obvious that the learning set can increase if new learning examples are added at different threshold levels. It is also expected that, by increasing the size of the learning set, the accuracy of the SVM learning and classification increases.
4 Results The proposed method is suitable for searching the large databases of images. The best example of a large set of similar and different images is a movie. It is expected the frames to be similar within the same scene and different between the scenes. However, even the frames within the same scene are not totally equal, because either the camera or the objects move, and some noise is added during image processing. The movement of an object or camera results in translation or scaling of the object, while the noise is caused by lossy compression algorithms whenever applied. To create a database of images, the movie “Star Wreck: In the Pirkinning” [17] published under the Creative Common licence [18] was chosen. Complete number of frames comprising the movie is 154892, and their resolution is 640 × 272 . If the movie is extracted by the best JPEG quality, the size of all its images grows up to 10.6 GB. The size of extracted images is enormous, so a direct search in such a database would be very time consuming. It is reasonable to create a database of features, as we have explained in previous sections. We generated the convex-layer-based feature vectors for all images, and stored them accompanied also by their labels and size. For this particular movie, the size of all feature vectors is 68 MB. Because of the learning process in the second step of our approach (see Section 3), the whole database of original images is also kept. The feature vector database has to be created only once. As the feature extraction for an image takes about 23 milliseconds on average, the complete process for the selected movie takes 3528 seconds. The times were measured on a Pentium Core 2 processor with 2.18 GHz frequency and 2 GB of memory, running under Linux. All code was written in Matlab, except the convex layer routine which was coded in C. First of all, we are interested in sensitivity and specificity of the proposed algorithm. We found out the sensitivity was strongly dependent on the learning set
92
S. Šinjur and D. Zazula
Fig. 6. Sensitivity and specificity of the proposed algorithm versus the learning set size
size. The larger the set, the bigger the number of similar images we recognized. Fig. 6 shows that an initial learning set of 40 feature vectors leads only to a half of similar images recognized in the processed movie. When increasing the learning set to 320, the probability of proper recognition increases to 0.83. The learning set size does not affect the specificity which yields 0.99 on average. All experiments were made using the linear kernel of SVM. The database searches are fast in both proposed steps, i.e. with the correlation coefficients and SVM classification. An average search time per image feature vector is 1.8 microseconds calculating the correlation and 15 microseconds using the SVM. This means that the tested database with the Star Wreck movie features was totally scanned in 279 milliseconds by the correlation approach and in 2.3 seconds by the SVM.
5 Conclusion In this paper, a novel method for searching similar images in large databases is described. Every image is computed and assigned a unique feature vector. Feature vectors are extracted from image convex layers that are based on image foreground pixels. The elements of vectors represent the numbers of vertices on corresponding layers. A simple algorithm for convex layers is revealed, where the fact that convex layers are an extension of the convex hull construction is used. The obtained feature vectors can be compared very fast by computing their correlation coefficients. This enables quick automated searches of reference images in large image databases. However, the correlation-coefficient-based approach is not very accurate. We refined it by an additional intelligent step based on SVM. A fast correlation-based search extracts the most similar and most dissimilar images and shows them to user whose duty is to mark the images that, according to his or her perception, can be considered similar. In this way, a learning set of positive and negative examples, i.e. feature vectors, is gathered. This set is deployed for the SVM learning. Optimal SVM weights are obtained that are implemented in the refined second database search. All the stored feature vectors are tested by the trained SVM in order to mine the images most similar to the reference one.
Image Similarity Search in Large Databases Using a Fast Machine Learning Approach
93
An application to create learning sets for SVMs was developed. It offers the user three sets of database images whose feature vectors produced highest, border, and lowest correlation coefficients when compared to the features of a reference image. The learning set selected by the user can be increased in subsequent searches, which leads to better sensitivity. Considering the speeds of the feature database scans, the SVM learning times and the duration of user’s on-screen image selection, we talk about a few tens of seconds. When this fact is combined with the obtained sensitivity of 83% and, possibly, more, the proposed image similarity approach proves worthwhile for further investigation and development.
References 1. Hobrough, G.L.: Automatic stereo plotting. Photogrammetric Engineering & Remote Sensing 25(5), 763–769 (1959) 2. Hannah, M.J.: A system for digital stereo image matching. Photogrammetric Engineering & Remote Sensing 55(12), 1765–1770 (1989) 3. Viitaniemi, V., Laaksonen, J.: Keyword-detection approach to automatic image annotation. In: Proceedings of 2nd European Workshop on the Integration of Knowledge, London, UK, pp. 15–22 (2005) 4. Google Image Search, http://images.google.si 5. Content based Visual Image Search Engine, http://www.tiltomo.com 6. Grauman, K., Darrell, T.: Efficient Image Matching with Distributions of Local Invariant Features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 627–634. IEEE Press, Los Alamitos (2005) 7. Qamra, A., Meng, Y., Chang, E.Y.: Enhanced perceptual distance functions and indexing for image replica recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(3), 379–391 (2005) 8. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision. Chapman & Hall, London (1994) 9. Graham, R.L.: An efficient algorithm for determining the convex hull of a finite planar set. Information Proceeding Letters 1(4), 132–133 (1972) 10. Andrew, A.M.: Another Efficient Algorithm for Convex Hulls in Two Dimensions. Information Proceeding Letters 9(5), 216–219 (1979) 11. Jarvis, R.A.: On the identification of the convex hull of a finite set of points in the plane. Information Proceeding Letters 2(1), 18–21 (1973) 12. Chazelle, B.: On the convex layers of a point set. IEEE Transactions on Information Theory 31(4), 509–517 (1985) 13. Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction. Springer, New York (1985) 14. Bewick, V., Cheek, L., Ball, J.: Statistics review 7: Correlation and regression. Critical Care 7(6), 451–459 (2003) 15. Lenič, M., Cigale, B., Potočnik, B., Zazula, D.: Fast Segmentation of Ovarian Ultrasound Volumes Using Support Vector Machines and Sparse Learning Sets. In: IIMSS 2008 (submitted, 2008) 16. Berthold, M., Hand, D.J.: Intelligent Data Analysis. Springer, Berlin (2003) 17. A film Star Wreck: In the Pirkinning: http://www.starwreck.com/ 18. Creative Commons license: http://creativecommons.org/
Fast Segmentation of Ovarian Ultrasound Volumes Using Support Vector Machines and Sparse Learning Sets Mitja Lenič, Boris Cigale, Božidar Potočnik, and Damjan Zazula University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {mitja.lenic,boris.cigale,bozo.potocnik,damjan.zazula}@uni-mb.si
Abstract. Ovarian ultrasound imaging has recently drawn attention because of the improved ultrasound-based diagnostic methods and because of its application to in-vitro fertilisation and prediction of women's fertility. Modern ultrasound devices enable frequent examinations and sophisticated built-in image processing options. However, precise detection of different ovarian structures, in particular follicles and their growth still need additional, mainly off-line processing with highly specialised algorithms. Manual annotation of a whole 3D ultrasound volume consisting of 100 and more slices, i.e. 2D ultrasound images, is a tedious task even when using handy, computer-assisted segmentation tools. Our paper reveals how an application of support vector machines (SVM) can ease the follicle detection by speeding up the learning and annotation processes at the same time. An iterative SVM approach is introduced using training on sparse learning sets only. The recognised follicles are compared to the referential expert readings and to the results obtained after learning on the entire annotated 3D ovarian volume. Keywords: Medical image segmentation, Ultrasound imaging, Ovarian follicles, Support vector machines (SVM), Iterative SVM, Fast learning, Sparse learning sets.
1 Introduction Ovarian ultrasound imaging has recently drawn attention for several reasons. Its importance grows both because of the improved ultrasound-based diagnostic methods and because of its application to in-vitro fertilisation (IVF) and prediction of women's fertility. However, successful computer-assisted approaches to ovarian follicle analysis are still rare (see [1, 2] and references therein). Most of those analyses focus on 2D ultrasound scans (also addressed as B-mode scans) and measure specific properties of ovarian follicles (e.g. follicle diameter and area). In some cases, simple computational intelligence accompanies the integrated prior knowledge about follicles and about formation process of ovarian ultrasound images. Only a few recent developments utilize machine learning capabilities in conjunction with either artificial neural networks or other optimised classifiers. In the following introductory paragraphs, a short overview is given on the development of some of those approaches. Muzzolini et al. [1, 5] used the split-and-merge segmentation approach by using texture-based measure to segment 2D ultrasound images of ovarian follicles. The split and merge operations were controlled by means of a simulated annealing algorithm G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 95–105, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
96
M. Lenič et al.
(Metropolis). The algorithm’s efficiency was assessed on several ovarian ultrasound images with single follicles. Mislabelling error (i.e. the percentage of misclassified pixels) was around 30% with original images and around 1 % if pixels’ grey-level values were stretched on a predefined interval before images were processed. Sarty et al. [6] reported a semi-automated knowledge-based approach for detecting the inner and outer follicle wall. A cost function with integrated prior knowledge is minimised by using heuristic graph searching techniques to detect the walls. 31 ultrasound images were analysed. Euclidian distance between computer-segmented and observerannotated follicle boundaries resulted in 1.47 mm with 0.83 mm standard deviation, on average. A similar semi-automated approach was reported by Krivanek et al. [7]. An automatic approximation of the inner follicle walls using watershed segmentation on the smoothed images is first performed. Then binary mathematical morphology is employed to separate some merged small adjacent follicles. Finally, a knowledgegraph searching method is applied to detect inner and outer follicle walls. This approach was applied to 36 ovarian ultrasound images. Euclidian distance between computer-segmented and observer-annotated follicle boundaries resulted in around 1.64 mm with 0.92 mm standard deviation, on average. Our first attempt was based on an optimal thresholding applied to coarsely estimated ovary detected by observing the density and spatial distribution of edge pixels [8]. A test on 20 ovarian ultrasound images with 768×576 pixels yielded the recognition rate, which was defined as a ratio between the number of correctly identified follicles and the number of all follicles in images, around 62%. Considering only the dominant follicles, the recognition rate was 89%. The average misidentification rate defined as a ratio between the number of correctly identified follicles and the number of all computer-segmented regions was around 47%. Our most mature classical approach for automatic follicle segmentation in 2-D ultrasound images is three-step knowledge-based algorithm [1, 2]. Firstly, seed points (i.e. follicle centres) are found by a combination of watershed segmentation and several thresholding operations on pre-filtered images. Secondly, the accurate follicle boundary detection is performed by using a region growing from the seed points. The final step uses prior knowledge in order to eliminate non-follicle detected regions. This algorithm was tested on 50 randomly selected cases from an ovarian ultrasound image database. Image dimensions were 640×480 pixels. The obtained recognition rate was around 78%, while the average misidentification rate was around 29%. Reported Euclidian distance between segmented and correct follicle boundaries resulted in 1.1 mm with 0.4 mm standard deviation, on average. All abovementioned methods operate in 2D which mimics clinical practice. A much greater level of computational intelligence can be brought into follicle detection algorithms if they deal with a sequence of 2D ovarian ultrasound images acquired with classical B-mode ultrascanner during examination, or even with 3D ovarian ultrasound data (e.g. see [1, 9, 10]). In both cases the processing of a sequence of 2D cross-sections through ovary is applicable. Instead of focusing just on the follicle segmentation in single 2D image, it is much more advantageous to consider and integrate the information obtained from the analysis of vicinal images in a sequence. We proposed such solutions based on Kalman filter theory [3, 4]. 2D ovarian ultrasound recordings have also been processed by cellular automata in [11] and cellular neural networks (CNN) in [14] and [15]. The CNN templates were
Fast Segmentation of Ovarian Ultrasound Volumes
97
trained on a learning set of 4 images [15] randomly selected from the database of 1500 sampled images. Genetic learning algorithm (GA) and simulated annealing (SA) were applied. A testing set consist of 28 images [15], again randomly chosen from the same database. To recognize both the dominant and also smaller follicles, a grid of 3 CNNs was proposed [15], so that rough follicle estimation was done first, the expressive estimates were then expanded by the second CNN, and, finally, delineated by the area of ovary as detected by the third CNN. Actually, two learning sets were necessary for this reason: one with annotated follicles, and another with the annotated ovaries. In 28 images of testing set 168 follicles were annotated. The proposed detection algorithm recognized 81 regions of which 63 belong to the follicles. The main disadvantage of the learning processes used is a very slow convergence, which takes at least a few hours. Slow learning when using GA and SA makes the approach very annoying if large learning sets must be applied. And this certainly is the case with 3D imaging. If learning is not accomplished with a representative set of examples, the obtained recognition rates are rather low. There is no way to speed up GA or SA significantly, so another learning approach is necessary. A variety of optimised classifiers exist that run fast even with large data sets and give optimum classification according to the selected criterion. One of them is the well-known SVM approach. Its computational structure is very similar to the CNN model, which inspired us to merge both principles [16]. This led to a new way of the CNN template formation based on the SVM optimisation. The learning procedure shortened drastically, as it does not take more than a few minutes. Also the recognition rate for the tested ultrasound recordings of ovarian follicles increased slightly. The proposed detection algorithm recognized 113 regions of which 97 belong to the follicles (168 follicles were annotated in 28 images of testing set). Although the SVM-based learning proved to be a few 100 times faster than GA or SA, an important drawback still remains: representative and statistically large enough learning set of annotated follicles (and ovaries) must be provided first, which means a lot of routine work for an expert. If training data are too few, the recognition rate cannot be obtained satisfactory. But, the idea of combining the CNNs with SVM introduced an iterative application of SVM. This fact suggests that learning can also be done in several steps, so that only a few quick and sparse (limited) annotations are made by user at the beginning, which serves for the first optimisation and recognition by the SVM. User is presented the obtained results on-line to supervise the recognised regions and to mark the most evident false positives and negatives. The on-line processing becomes feasible because there is only a small, limited learning set and the speed of learning is boosted by the SVM-based training. The marked false positives and negatives are taken the new learning samples, and the next recognition iteration is performed. As the quality is supervised by user, he or she can stop immediately after the satisfactory outcomes are obtained. Our novel approach to the recognition of ultrasound images is fast and userfriendly, incorporates a combination of user’s and machine intelligence, and means a lot lighter burden for user than the need of annotating the entire ultrasound recording (possibly a 3D volume with 100 and more images). We describe its implementation in the following sections.
98
M. Lenič et al.
2 Linear Classification Using Support Vector Machines SVMs [13] solve classification problems by determining a decision function which separates samples from different classes with highest sensitivity and specificity. The approach is known for its noise resistance [12]. A learning set must be given with positive and negative instances for each class. In the case of ovarian ultrasound follicles, we deal with two classes; one belongs to the pixels constituting the follicles, while the other stands for the background. Hence, we face a two-class case where the SVM training can be accomplished by giving some positive instances belonging to the follicle structures and some negative instances belonging to the background. In a two-class case, the hyperplane separate positive from negative instances and can be defined as
w T xi + b = 0 ,
(1)
where w describes the hyperplane (weights), xi stands for the column vector of instance features, and b for the distance from the origin to the hyperplane. The hyperplane is fully described by both w and b: (w,b). Each instance xi must be given a classification (target) value yi ∈{-1,1} where the value 1 designates positive and -1 negative examples. These values correspond to the annotations provided in the learning set. In linearly separable cases, a classification parameter γ can be calculated as follows [17]:
γ i = yi w , xi + b ,
(2)
where denotes scalar product. If γi > 0 then the classification of xi was correct. The learning phase of SVM simply looks for the hyperplane with the largest possible margin, i.e. the hyperplane whose distance to both the positive and negative examples is maximal. Optimum can be obtained by using Lagrangian multipliers Lagrange multiplies αi ≥ 0. Classification results may often be improved if the multipliers αi are also limited by an upper bound, c, so that 0 ≤ αi ≤ c, ∀i. SVM learning results in the set of Lagrange multipliers αi and distance from the origin hyperplane b. The weights w are obtained in the following way: m
w = ∑ yiαi xi . i =1
(3)
We limit the learning process only on liner separable case, since then resulting weight can be directly applied as image operator on 3D volume, usually utilizing hardware accelerated operations and thus making classification very fast compared to case where nonlinear kernel is introduced. 2.1 SVM for Ovarian Ultrasound Imaging
Consider a 2D ultrasound image of ovarian follicles first. A number of pixels belong to the follicular regions, while the others mirror the background and different ovarian structures. Only the pixels belonging to the follicles are treated as positive instances
Fast Segmentation of Ovarian Ultrasound Volumes
99
(yi = 1) in the learning process, while the background pixels are negative instances (yi = -1). Remaining at the single pixel level, it is quite clear that some parts of the follicles may have grey-levels very similar to the background and vice versa. Such a real situation is usually far from linear separability, especially if other tissues and blood vessels appear in the ultrasound recording which is, additionally, always corrupted by high level of speckle noise. This is why SVM would do no good job deploying only the information on single pixels. Feature vectors xi must, therefore, involve several pixels, most preferably a certain neighbourhood being centred over every observed pixel to be classified. Pixel values from the selected neighbourhood are vectorized along the rows and columns in order to obtain feature vectors xi. The overlap of joint distributions of the neighbourhoods belonging to the follicles and those from the background depends on the types of neighbourhoods chosen and the contents of images, i.e. the distributions of single pixel grey-levels. In general, it can be proved the separability improves when taking larger neighbourhoods. However, the larger the neighbourhoods, the more severe are threats of fusing the different vicinal regions. So, a compromise is usually sought empirically. After the learning phase successfully completed, by the SVM generated weights w and b are ready to be employed in a classification process. For any unknown feature vector ξi the expression + b > 1
(4)
defines it belongs to the class of positive examples, which, in our ultrasound image recognition, means the class of follicles. This is equivalent to the interpretation with the saturation SVM output function (compare to CNNs in [15]). The explained approach does not change a lot when 3D volumes are processed. Again, a learning set of positive and negative examples is constructed by annotating the voxels inside 3D ovarian follicles as positive (yi = 1) and all others as negative examples (yi = -1). 3D neighbourhoods are selected and vectorized into feature vectors xi. Everything else is just the same as in 2D cases. 2.2 Simplification of Image Annotation for the Purpose of SVM Learning
Although some very handy software tools exist for manual image segmentation, such as ITK-SNAP [18], the annotation of even only one 2D recording may be very cumbersome. A keen eye perceives follicular boundaries quickly if the dominant follicles are in question. But smaller and less expressive ones would pose even experts. And finally, to encircle precisely a follicle on the screen by the computer mouse, it is not always trivial. Our goal was to derive a procedure which would be fast enough to run in real-time and would need as limited user interaction as possible. Yet, the SVM learning process must be given a proper learning set and this set must be contributed by an expert. Initially, this expert is expected to give just a hint of what he or she considers positive and negative instances. The SVM optimises the weights (w,b) and processes the entire input image or 3D volume. The recognised regions of interest, i.e. follicles, are displayed projected onto the original image or volume slices. The most evident discrepancies are outlined by user again, and those instances enter the second SVM
100
M. Lenič et al.
iteration as an additional, refined learning set. User can stop after any iteration and save the most appropriate weights. Denote the initial annotated examples by x0,i, i = 0, … , N-1. They comprise the initial learning set S0 = [x0,1, x0,2, … , x0,N-1]. After first learning phase, weights (w0,b0) are obtained for the first recognition. Evident mismatches guide the annotator in building the second learning set S1, which is actually the set S0 appended with new examples x1,i, i = 0, … , M-1, having been suggested by the outcomes of the first iteration. This yields the second version of weights (w1,b1) and the second, improved recognition of follicles. The described procedure just repeats in subsequent iterations.
3 Experiments and Result The images used in our experiments were extracted from 3D ultrasound volumes of women’s ovaries acquired at the University Clinical Centre of Maribor by using a Voluson 730 ultrasound device. The images of 7 patients in different stages of menstrual cycle were used. All the acquired volumes were transformed from polar to Cartesian coordinate system in the first place. The volumes were sliced into separate 2D images along axis X, Y and Z. Thus, 3117 2D images were obtained in 7 volumes. All the images contain other different structures of characteristics similar to follicles, such as veins and intestines, perceived as dark and monotone regions. Those regions were considered as part of the background. No speckle noise or other disturbances were removed prior to image analysis either. For the purpose of the experiment described in this paper we focused just on one volume labelled XX2. Spatial resolution of the volume was 149×103×121 voxels and its contrast (grey-level) resolution was 8 bits. Our main goal was to observe the SVMbased recognition accuracy and the complexity of the procedure in two different approaches. The experiment was, therefore, prepared in two variants: •
•
First experiment: The entire volume was considered for learning and testing. It was annotated by an expert in the sense that all the follicles perceived by human eye were outlined. Than the volumes were undersampled by factor 125, in order to keep the number of learning instances and computational complexity of the SVM learning phase lower (12667 instances). The undersampled voxels were taken as centres of 5×5×5 (undersampled) neighbourhoods. All the neighbourhoods were vectorized and labelled positive and negative examples, x0,i, regarding the central voxel position either inside or outside the follicles. The entire learning set was used for one step SVM learning procedure, as described at the beginning of Section 2.1. The obtained weights were employed for classification, as described in Section 2, and the recognition rate for the entire volume was verified. Second experiment: The same volume was taken, but no a priori annotation was taken into account. We followed the procedure proposed in Section 2.2. An expert was asked to browse the volume slices on the screen and decide for a most significant follicular region and a most significant part of background on any of the slides. A quick annotation of the two selected regions followed. The volume was presented in its full spatial resolution. The learning examples were formed from 5×5×5 neighbourhood again, but the learning set was much smaller owing
Fast Segmentation of Ovarian Ultrasound Volumes
101
to the fact that only one expressive follicle and one background area of approximately the same size were annotated. This set was involved in the SVM training without undersampling. The obtained weights were then applied to the classification of the entire volume. The expert was shown individual slides displaying both the original ultrasound image and the recognized regions, presumably follicles, overlapped in a semi-transparent mode. Browsing through the slides, he located the most evident misrecognitions and annotated a region of positive and a region of negative instances. At the same time, he was told by the system about the current classification rate on the voxel basis. This rate is expected to grow through successive iterations. When the growth stops, this is a clear sign that no further improvement can be expected. Of course, if the rate is satisfactory even sooner, there is no reason to go on with additional iterations. Finally, the recognition results for the entire volume were verified and also compared to the results obtained after learning on the entirely annotated ultrasound volume. Verification of the accuracy of recognition results was deployed using two different measures. The first one was based on the percentage of properly classified voxels. This is a measure of the algorithm’s learning and classification capabilities, but in the case of image recognition it does not give relevant information on the recognition success for the individual objects, such as ovarian follicles in our experiments. Hence, we resorted to the region-based measures proposed in [2] and [3]. Two ratios were introduced to estimate the percentage of the intersection area between the annotated and recognised region with respect to the annotated area, ρ1, and the percentage of the intersection area with respect to recognised area, ρ2. The closer the values of ρ1 and ρ2 to 1, the better the recognition accuracy has been achieved. It has to be stressed that measuring the recognition rate by applying planar or spatial measures, such as ρ1 and ρ2, even a small misalignment of the two compared areas changes their cross-sectional cover by power of 2. With volumes it is even much more important: misalignments of volumes cause cubical decrease of the joint volume. A practical interpretation for follicles would be as follows: even if the values of ρ1 and ρ2 may be as low as 0.5, or in particular case even lower, the visual inspection shows the annotated and recognised regions definitely mostly cover each other, so that there can be no doubt they indicate the same phenomenon. 3.1 Results of the First Experiment
As mentioned above, our intention in this experiment was to observe the SVM capabilities, its learning speed, and the effort to be put into the annotation and processing of an entire ultrasound volume with ovarian follicles and resolution of 149×103×121 voxels. The volume was undersampled by factor 5 and an expert annotated and cross-checked all the slices in all three dimensions. It took him about 5 hours of a very tedious and tough work. Then, positive and negative instances were generated automatically in 5×5×5 (undersampled) voxel neighbourhoods. Their number totalled 12666. This set was in the SVM learning phase, where the parameter c was set to 0.001 . It took 148.53 seconds to complete the learning on a average personal computer (AMD Athlon
102
M. Lenič et al.
3200+, 2G RAM). All our algorithms were implemented in Matlab and native implementation libSVM library was used for SVM learning. SVM-based recognition results yielded as follows: the recognition accuracy on the voxel basis was 96.83%, while the follicle recognition rate was 50.00% when measured by ρ1 and ρ2 set equal to 0.2 and 0.2 respectively. A follicle was considered recognised when ρ1 and ρ2 were fulfilled for it at least in the slide its central (largest) cross-section in any of three spatial directions (verified manually). These results are shown in Table 1 where can be compared with the results of the second experiment. At the same time Figure 1 shows some typical examples of the annotated and correspondingly recognised follicles.
(a)
(b)
(c)
Fig. 1. Typical result of ultrasound volume slice: (a) original ultrasound slice, (b) manual annotation and (c) recognized by SVM with undersampled learning set
3.2 Results of the Second Experiment
This experiment followed the idea presented in Section 0. Our main interest was in finding out whether a supervised learning approach using SVM can be run in real time and whether it can give comparable results to more exhaustive learning approaches. The same ultrasound volume was processed as in the first experiment, although no undersampling was needed here. The same sizes of voxel neighbourhoods were considered. The initial learning set chosen by expert on an expressive follicle and from a typical background region contained 2000 instances, on average. The learning set size increased through subsequent iterations by 60 instances, on average. Browsing through the volume slides on the screen and selecting two new regions as refined positive and negative instances was not very time consuming. It took about 3 seconds in each step, on average. The SVM learning phase was extremely fast and recognition slightly improves through subsequent iterations. The obtained voxel assignment accuracies in each iteration step is shown in Table 1. It can be observed that voxel classification accuracy as true positive rate and true negative rate is quite high and even improves in first few steps, when additional positive and negative learning instances are added to learning set. Voxel classification rates are even slightly higher than SVM with undersampled learning set. On other hande, when follicle classification rate is taken into the account it is not as high as voxel classification rates would imply. The problem can be observed on Fig. 2 where some typical images of the annotated and correspondingly recognised follicles by undersampled SVM and sparse learning SVM are shown. Some follicles are merged on the top of the image, thus counting only as
Fast Segmentation of Ovarian Ultrasound Volumes
(a)
(b)
103
(c)
Fig. 2. Typical segmentation result of ultrasound volume slice: (a) manual annotation and (b) recognized by SVM with undersampled training set, (c) recognized by SVM with sparse training set Table 1. Overview of segmentation results: iteration (It), training time, voxel classification accuracy (ACC), voxel true positive rate (TPR), voxel true negative rate (TNR), follicle recognition rate (RR), follicle recognition rate with ρ1 and ρ2 set to 0.2 (RR ρ), misclassification rate (MR) and number of instances in training set (INST) It
Time ACC TPR (s) (%) (%) Undersampled full learning set /
148.53
TNR (%)
RR (%)
RR ρ (%)
MR (%)
INST
96.83
82.04
98.59
52.94
50.00
14.29
12666
Sparse learning set 1 0.34 85.92 2 1.72 96.78 3 2.18 96.81 4 2.6 96.85 5 3.16 96.88 6 3.05 96.88 7 3.46 96.88 8 3.36 96.87
98.70 83.69 84.18 84.68 85.89 86.39 86.80 87.20
84.40 98.35 98.31 98.30 98.19 98.13 98.09 98.02
8.82 47.06 55.88 52.94 47.06 47.06 47.06 47.06
5.88 47.06 52.94 52.94 47.06 47.06 47.06 47.06
62.50 15.79 13.64 14.29 15.79 15.79 11.11 11.11
2198 2280 2354 2407 2446 2474 2495 2512
single follicle recognition, although identifying region of all follicles. SVM was not able to distinguish between the background and follicles that have practically identical learning instance with two different outcomes, one belonging to follicle and other to background.
4 Conclusion Recognition of ultrasound images still appears to be a delicate problem, in general. Non-adaptable approaches cannot achieve satisfactory results. Adaptation, on the other hand, necessitates a certain level of intelligence. We experimented with 3D ovarian recording and tried to recognise follicles by using SVMs. This implies learning and the problem of selection and gathering of learning sets, the latter being
104
M. Lenič et al.
(a)
(b)
Fig. 3. 3D segmentation of ultrasound: (a) manual annotation and (b) recognized by SVM with sparse training set
dependent on experts’ knowledge. As the annotation of 3D ultrasound recordings means dealing with 100 and more slides, it becomes hardly feasible in practice, in particular with applications in clinical environment. Therefore, we introduced a method based on iterative SVM and supervised learning with sparse learning sets. To build one of them is a very simple selection of two image areas, one belonging to positive and another to negative examples. Any new iteration does not demand more effort than the initial one. We have shortened the recognition of ultrasound volumes significantly this way. An important speed-up was achieved, first, by introducing intelligent image processing based on SVM. Learning algorithms known from the field of neural networks, such as GA and SA, run a few hours at least, and still their convergence may be questionable. SVM-based learning is completed within minutes even with very extensive learning sets, as we have show by our experiments. The speedup factor for this improvement lies over a few 100 times. Yet another fastening comes with the iterative SVM procedure we introduced in this paper. Considering the experts’ effort to be spent for annotations, a reduction from several hours to a few minutes is encountered again. Follicle recognition rate was not so high, but as can be observed from Fig. 3, resulting segmentation, after only 5 short iterations, is very similar to expert segmentation and can serve as good starting point for full annotation, reducing annotation time by hours. Our recognition system runs practically real time. The next possible step for improvement is introduction of ensembles, which would probably improve classification rate, but might increase learning time. Acknowledgments. This research was supported by Slovenian Ministry of Higher Education, Science and Technology through the programme ''Computer Systems, Methodologies, and Intelligent Services'' (P2-0041). The authors also gratefully acknowledge the indispensable contribution of Prof. Dr. Veljko Vlaisavljević from University Clinical Centre, Maribor, Slovenia, whose suggestions and imaging material enabled verification of this research.
Fast Segmentation of Ovarian Ultrasound Volumes
105
References 1. Noble, J.A., Boukerroui, D.: Ultrasound image segmentation: A survey. IEEE Transactions on medical imaging 25(8), 987–1010 (2006) 2. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part I: Segmentation of single 2D images. Image vision and computing 20(3), 217–225 (2002) 3. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part II: Prediction-based object recognition from a sequence of images. Image vision and computing 20(3), 227–235 (2002) 4. Potočnik, B., Zazula, D.: Improved prediction-based ovarian follicle detection from a sequence of ultrasound images. Computer methods and programs in biomedicine 70, 199– 213 (2003) 5. Muzzolini, R., Yang, Y.-H., Pierson, R.: Multiresolution texture segmentation with application to diagnostic ultrasound images. IEEE Transactions on medical imaging 12(1), 108–123 (1993) 6. Sarty, G.E., Liang, W., Sonka, M., Pierson, R.E.: Semiautomated segmentation of ovarian follicular ultrasound images using knowledge-based algorithm. Ultrasound in Medicine and Biology 24(1), 27–42 (1998) 7. Krivanek, A., Sonka, M.: Ovarian ultrasound image analysis: follicle segmentation. IEEE Transactions on medical imaging 17(6), 935–944 (1998) 8. Potočnik, B., Zazula, D., Korže, D.: Automated computer-assisted detection of follicles in ultrasound images of ovary. Journal of medical systems 21(6), 445–457 (1997) 9. Gooding, M.J., Kennedy, S., Noble, J.A.: Volume reconstruction from sparse 3D ultrasonography. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2879, pp. 416–423. Springer, Heidelberg (2003) 10. Romeny, B.M.H., Tirtulaer, B., Kalitzin, S., Scheffer, G., Broekmans, F., Staal, J., Velde, E.: Computer assisted human follicle analysis for fertility prospects with 3D ultrasound. In: Kuba, A., Sámal, M., Todd-Pokropek, A. (eds.) IPMI 1999. LNCS, vol. 1613, pp. 56– 69. Springer, Heidelberg (1999) 11. Viher, B., Dobnikar, A., Zazula, D.: Cellular automata and follicle recognition problem and possibilities of using cellular automata for image recognition purposes. International journal of medical informatics 49(2), 231–241 (1998) 12. Pankajakshan, P., Kumar, V.: Detail-preserving image information restoration guided by SVM based noise mapping. Digital Signal Processing 17(3), 561–577 (2007) 13. Vapnik, V.: Statistical Learning Theory. Wiley-Interscience, New York (1998) 14. Cigale, B., Zazula, D.: Segmentation of ovarian ultrasound images using cellular neural networks. Int. j. pattern recogn. artif. intell. 18(4), 563–581 (2004) 15. Zazula, D., Cigale, B.: Intelligent segmentation of ultrasound images using cellular neural networks. In: Artificial intelligence in recognition and classification of astrophysical and medical images, pp. 247–302. Springer, Heidelberg (2007) 16. Cigale, B., Lenič, M., Zazula, D.: Segmentation of ovarian ultrasound images using cellular neural networks trained by support vector machines. Lect. notes comput. sci (part 3), pp. 515–522 17. Berthold, M., Hand, D.J.: Intelligent Data Analysis. Springer, Berlin (2003) 18. Yushkevich, P.A., Piven, J., Hazlett, H.C., Smith, R.G., Ho, S., Gee, J.C., Gerig, G.: Userguided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability. Neuroimage 31(3), 1116–1128 (2006)
Fast and Intelligent Determination of Image Segmentation Method Parameters Božidar Potočnik and Mitja Lenič University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia [email protected], [email protected]
Abstract. Advanced digital image segmentation framework implemented by using service oriented architecture is presented. The intelligence is not incorporated just in a segmentation method, which is controlled by 11 parameters, but mostly in a routine for easier parameters’ values determination. Three different approaches are implemented: 1) manual parameter value selection, 2) interactive step-by-step parameter value selection based on visual image content, and 3) fast and intelligent parameter value determination based on machine learning. Intelligence of second and third approach is introduced by end-users in the repeated interaction with our prototype in attempts to correctly segment out the structures from image. Fast and intelligent parameter determination predicts a new set of parameters’ values for current image being processed based on knowledge models constructed from previous successful (positive samples) and unsuccessful (negative samples) parameter selections. Such approach pointed out to be very efficient and fast, especially if we have many positive and negative samples in the learning set.
1 Introduction To design and implement flexible and intelligent machine vision and interactive multimedia systems, a capability of performing the demanding tasks of image processing and pattern recognition becomes of crucial importance. An intelligent image recognition consists of the processing algorithms that incorporate a considerable level of adaptability and ability of learning and inference. Segmentation is one of the most important tasks in a digital image processing. With its main aim to divide images on regions of interest and on spurious regions like background, it has an imposing role at object detection [1, 10]. Segmentation methods mostly depend on images type and characteristics of searched objects [1, 5, 6, 10]. Usually, they are very well tuned for a specific problem domain. However, without appropriate parameter tuning they are less applicable for other problem domains. Parameter tuning may be very complex task also for an expert, because in some certain examples it is not possible to accurately determine influence of specific parameters on final segmentation result. Mentioned problem is intensified if parameter tuning is left to an end-user with just shallow knowledge about segmentation method and image processing. From all this G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 107–115, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
108
B. Potočnik and M. Lenič
we resolve, that majority of segmentation methods are applicable as an efficient utility only for a narrow group of very well skilled users. To overcome above delineated shortcoming, we propose in this paper a prototype for segmentation of arbitrary digital images, with integrated module for interactive, and, to some extent, intelligent determination of segmentation routine parameters. This prototype is implemented in a form of web service, and, consequently, exploits all advantages of service oriented architectures (SOA). The SOA combines ability to invoke remote object and functions with tools based on dynamic service discovery. It means that an end-user does not have to be concerned with installation, upgrading or customization of software, neither with assurance of sufficient computer resources (e.g. adequate processor power and storage size), which are essential for the efficient and prompt execution of commonly great pretension segmentation methods. Thus, at such paradigm the end-users just forward their imaging material to appropriate services and, at the end, collect results. Service oriented principle combined with an efficient and robust segmentation method is a future in the image processing field. The intelligence of our prototype is not gathered just in a segmentation method, which is controlled by 11 parameters, but mostly in a routine for segmentation method parameters’ values determination. We implemented three different methods for parameter value determination: 1) manual parameter value selection, 2) interactive step-by-step parameter value selection, and 3) fast parameter value determination based on machine learning. Intelligence of second and third approach is introduced by end-users in the repeated interaction with our prototype in attempts to correctly segment out the structures from image (method 2), understand the meaning and execution of image segmentation (method 2), and assist the machine learning process (methods 2 and 3). This paper is organized as follows. A novel paradigm for a segmentation of digital images is described in Section 2. Section 3 present fast and intelligent determination of segmentation method parameters’ values by using machine learning, followed by some implementation details covered in Section 4. This paper is concluded with some future work directions.
2 Paradigm for Digital Image Segmentation To avoid demanding and complex task of segmentation method parameters tuning, and, moreover, to make a method the more universal, a two part structure of segmentation method is proposed. The first module is a procedure for determination of segmentation parameters’ values with respect to imaging material (subsection 2.2), and the second is a module for image segmentation with known parameter set (subsection 2.3). The same segmentation is used in both modules. 2.1
Segmentation Framework
Proposed segmentation framework with some modifications follows a segmentation algorithm for follicles detection described in [5]. Our eight-step method
Fast and Intelligent Determination
109
carries results between steps, i.e. result of previous step determine an input for a current step. Steps are briefly presented in this sequel (see also [2, 5] for details). 1. Filtering. Original grey-level image is filtered by noise reduction filter. If colour image is processed then it is, first, transformed into grey-levels (i.e. luminance information of image presented in HSL colour model is retained). 2. Binarization. A global thresholding is implemented. Two different inputs can be selected: a) filtered grey-level image from step 1 or b) image obtained by calculating a standard deviation of grey-levels for each pixel in its k × k mask. 3. Removal of noisy regions. This optional step removes from the binary image all regions smaller than preset threshold (i.e. spurious regions). 4. Removal of regions connected to the image border. All regions with at least one pixel in a zone of m pixels from image border may be removed. 5. Filling region holes. Eventual region holes are filled by using morphological operators. This step is optional as well. 6. Image negation. If bright object are searched then image must be negated by replacing each pixel with 255 minus its current grey-level (optional step). 7. Region growing. Obtained initial homogeneous regions are grown by using method from [5] (optional step). Region growing is controlled by two parameters: a) parameter for controlling region compactness and growing strength, and b) parameter for limiting a number of iterations. 8. Post-processing. Post-processing combines methods from steps 3 to 5. Proposed framework offers this step as an option. Each step is controlled by a single parameter, with the exception of step 2 and step 7, which are controlled by two and three parameters, respectively. Finally, let’s list all eleven parameters used in this method sorted by step number (see also Fig. 2): 1) filter; 2) segmentation method and its threshold; 3) threshold for small region removal; 4) removal of regions at image border (boolean); 5) gaps filling (boolean) 6) image type (boolean), 7) region growing (boolean), number of iteration, and alpha; and 8) postprocessing (boolean). 2.2
Determination of Segmentation Parameters
A special routine has been developed to facilitate the determination of parameter values of the proposed segmentation framework. We implemented three parameter value determination methods: 1) interactive step-by-step selection, 2) manual selection, and 3) fast selection based on machine learning. Methods 1 and 2 are presented in this sequel, whereas method 3 is introduced in Section 3. Interactive step-by-step selection. This routine offers to the end-user several partial (intermediate) results in a form of visual information or images. Partial results are constructed by fixing parameters’ values for all steps j, except for the step i. Combinations of parameters’ values for this step i are formed by dynamically varying each parameter on some predefined set or interval of values (for details see [2]). Original image segmented with fixed parameters for steps
110
B. Potočnik and M. Lenič
Fig. 1. Partial results of interactive step-by-step parameter value selection
j and one combination of parameters for the step i results in a partial result (image). The number of partial results equals to a number of all parameter combinations in step i (always between 2 and 6). All partial results are presented to the end-user. Afterwards, the user selects a result best suiting his expectations (i.e. visually optimal result) among all offered partial results (see Fig. 1). By choosing a partial result, the user determines parameter values for step i, and, afterwards, proceed with the next step. Determined parameters’ values for the step i remain unchanged to the end of this selection method. The user commences parameter value determination by step 1 and proceed in an ascending order to the step 8. Through interactive step-by-step selection the end-user visually selects among 25 partial results (a number of all alternatives is, however, 4608) and, simultaneously, determines values for 11 segmentation parameters. In the most extreme case, the user determines parameters’ values without knowing their meaning, numerical values or their influence on a final segmentation result. Parameter value determination on suchlike, above described, manner belongs to a class of sub-optimal optimization methods [8]. Manual parameter value selection. This option was designed for advanced users. After establishing all 11 parameters, the user has an option to manually alter parameters. An image should be re-segmented after such parameter tuning. The advanced users can very efficiently fine tune the segmentation method by using this functionality. Fig. 2 depicts a GUI for manual parameters tuning.
Fig. 2. GUI with a list of all parameters (11) for our segmentation framework
Fast and Intelligent Determination
2.3
111
Ordinary Segmentation Execution
With known set of parameters, the proposed segmentation method can be applied on arbitrary single grey-level image or image sequence. To obtain quality segmentation results, a similarity of image(s) being processed with the image, on which the parameter set was established, should be ensured. There exist several indexes for measuring similarity between images [1]. If the image being processed is too dissimilar, then a parameter value determination module must be executed once again for this new type of images.
3 Fast and Intelligent Segmentation Method Parameter Value Selection by Using Machine Learning A step forward to an intelligent segmentation and intelligent parameter value selection was introduction of machine learning into proposed framework. A fundamental idea of machine learning is to make some conclusions and predictions from positive and negative samples of observed process [11]. If an analogy with image segmentation is made, then positive samples are correctly and accurately segmented images, while negative samples are falsely segmented images. Accordingly, the idea behind our intelligent segmentation process is as follows: first, an image is statistically described by a set of features, then parameters of segmentation framework are predicted by using machine learning, obtained parameters are eventually refined and, finally, segmentation result is evaluated and stored into database. We will detail this idea in this sequel. 3.1
Image Description
Despite the fact that our segmentation method (see section 2.1) is designed for segmenting grey-level images, we describe visual image content by using lowlevel features as are colours, textures, and edges. In this way more visual image information is retained, and, consequently, a machine learning is more effective. Every image is described by four histograms: a) colour histogram, b) colour autocorrelogram, c) histogram of Tamura’s contrast feature, and d) colour histogram of maximum edge pixels. We use such image description because it is simple and can be fast and easily calculated. Another reason is that histograms pointed out to be invariant for some geometric transformation [9]. To speed up image description process we divide the RGB space into Υ 3 different colours (Υ = 6). Afterwards, we determine four histograms for colour image I. First, we calculate a normalized colour histogram hB as hB , hB = Υ 3 i=1 hB,i
(1)
where hB is colour histogram and hB,i is a number of pixels having value (colour) i. Afterwards, we determine a normalized 1D colour autocorrelogram hA,i . Each autocorrelogram element is determined as
112
B. Potočnik and M. Lenič
hA,i = P (I(p2 ) = i|I(p1 ) = i ∧ p2 − p1 = 1),
(2)
where p1 and p2 are image pixels, P () is probability function and is a spatial distance measure. Finally, the autocorrelogram is normalized by using equation (1). The third normalized histogram hC –called also texture histogram–was constructed from Tamura’s feature contrast [4]. Contrast is calculated for each image pixel in its 13 × 13 neighborhood as σ (3) Fcon = 1/4 , α4 where α4 is kurtosis defined as α4 = μ4 /σ 4 , μ4 is the fourth moment, and σ is standard deviation in image (region). The final visual information used for image description was ”strong” edges. First, we calculate image gradient by using Sobel operator. Then, we determine κ (κ = 4096) pixels with the highest gradient value. For these pixels we identify their colours (quantified values) and calculate normalized histogram hE . Each image I was, thus, described by a feature vector x from 4Υ 3 features (i.e. 864 in our case) constructed from introduced normalized histograms as x = [hB , hA , hC , hE ]. 3.2
(4)
Machine Learning Method and Segmentation Parameter Value Determination
Each usage of segmentation service results in single learning example with all 864 features, 11 segmentation parameters and user feedback of final segmentation result satisfaction. If the user is authenticated, this information is also stored in the learning sample, which enable service personalization. Learning set is then used to automatically produce 11 classification/regression models, which are calculated offline and stored for next usage of segmentation service. For classification and regression support vector machines (SVMs) [12] are used, since learning set is large with big number of features. Classification/regression is based on all calculated image features and user feedback is set to ”TRUE” to acquire segmentation parameters with only positive feedback. For every parameter classification/regression is executed to obtain segmentation parameters, which is very fast, since decision/regression models are calculated offline. Until offline calculation of a new model is completed, the old model is used for classification/regression. The segmentation result based on predicted parameter is offered as the first choice to the user in the interactive selection. When segmentation result offered by machine learning is selected, the example in learning set is duplicated to increase weight of correct solution. 3.3
Evaluation of Segmentation Results
In our framework, a segmentation result are evaluated on a simple TRUE or FALSE end-user feedback, i.e. “TRUE” if result is approximately correct,
Fast and Intelligent Determination
113
“FALSE” otherwise. If an end-user has successfully determined segmentation method parameters’ values and if he, afterwards, launch ordinary segmentation on some image or image sequence, then we store original image and corresponding parameter set into positive samples of observed process, otherwise we treat them as negative samples. This type of result evaluation is easily to implement, however, it is not very precise.
4 Implementation Our segmentation framework is implemented as a set of Matlab routines. A module for interactive parameter value determination is called “Selection”, a module for segmenting a single grey-level image is “AnalyzeOne”, while the module “AnalyzeAll” segments grey-level image sequences. We convert these routines written for the Matlab environment into executable modules (.exe files) by using Matlab automatic executable code generator from .m files. By transforming this code, we got rid of one additional server, namely of the Matlab interpreter, and, simultaneously, we simplify SOA of our segmentation framework. The frontend segmentation service, which utilizes matlab segmentation routines, integrates machine learning service and logging service. Logging service stores user interaction, extracted image features and serves usage data to machine learning service, which constructs all models and performs prediction of all segmentation parameters. Both machine learning service and logging service are also used as independent services in other applications in the scope of the institute. Logging and segmentation services are implemented .NET framework. Machine learning services is implemented in Java and uses Weka machine learning tools [11] for classification and regression. Segmentation service submits usage information to logging service, stores extracted features, user information and segmentation parameters as separate data set, that is than periodically used by machine learning service to produce classification/regression models. Segmentation parameters’ values are calculated by the machine learning service, that is invoked right after image features are calculated by segmentation service. Segmentation service can operate on large images and is invoked in multiple steps with different operations. Statefull service model is used to reduce network traffic. Segmentation service can also operate asynchronously, to enable prompt interactions and status feedback when processing large sequence of images. To enable quick and easy access to the segmentation service, we developed a web user interface, that enables utilisation of segmentation service only with a web browser. Fig. 3 depicts GUI of our segmentation framework after segmentation parameter value selection process. Left image presents final segmentation results obtained by using determined parameter set (shown in bottom of this figure), while the right image presents the original image. A prototype of a presented segmentation framework can be tested using web tool, available in [3].
114
B. Potočnik and M. Lenič
Fig. 3. GUI of proposed segmentation framework
5 Conclusion Advanced image segmentation framework implemented by using SOA was presented. The main advantage of proposed approach is a possibility (i.e. a routine) to easily determine and/or fine tune parameters used in our framework. We designed three different routines for parameter value determination ranging from manual selection, over a simple and interactive step-by-step parameter value selection based on visual information, through fast and intelligent parameter selection based on machine learning. Intelligence of second and third approach is introduced by end-users in the repeated interaction with our prototype in attempts to correctly segment out the structures from image. Fast and intelligent parameter value selection using machine learning is a step forward to a more intelligent segmentation. We proposed the following idea: the web service calculates for each representative image a feature vector. A set of feature vectors form a learning set (positive and negative samples), which is used for construction of different knowledge based models. These models serve for predicting new segmentation method parameter set for image being processed. Such parameter value selection pointed out to be very efficient and fast, especially if we have many positive and negative samples in the learning set.
Fast and Intelligent Determination
115
The main drawback of proposed framework is a simple evaluation of segmentation results, which has also an influence on the machine learning, because positive and negative samples of observed process are not determined very precisely. In the future we would like to integrate into our framework a sophisticated evaluation method, which will measure a dissimilarity of calculated segmentation results and correct results (i.e. golden rule) provided by an end-user.
References 1. Forsyth, D.A., Ponce, J.: Computer vision, a modern approach. Prentice-Hall, Englewood Cliffs (2003) 2. Granier, P., Potočnik, B.: Interactive parameter determination for grey-level images segmentation method. In: Proceedings of the 13th Elect. and Comp. Conf., vol. B, pp. 175–178 (2004) 3. Lenič, M., Potočnik, B., Zazula, D.: Prototype of inteligent web service for digital images segmentation, http://www.cs.feri.uni-mb.si/podrocje.aspx?id=30 4. Long, F., Zhang, H., Feng, D.D.: Fundamentals of content-based image retrieval. In: Feng, D., Siu, W.C., Zhang, H.J. (eds.) Multimedia information retrieval and management-Technological fundamentals and applications. Springer, Berlin (2005) 5. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part I: Segmentation of single 2D images. Image vision and computing 20(3), 217–225 (2002) 6. Potočnik, B., Zazula, D.: Automated analysis of a sequence of ovarian ultrasound images, Part II: Prediction-based object recognition from a sequence of images. Image vision and computing 20(3), 227–235 (2002) 7. Potočnik, B., Lenič, M., Zazula, D.: Inteligentna spletna storitev za segmentiranje digitalnih slik (Intelligent web service for digital images segmentation). In: Proceedings of the 14th Elect. and Comp. Conf., vol. A, pp. 193–196 (2005) 8. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical recepies in C, The art of scientific computing. Cambridge University Press, Cambridge (1992) 9. Saykol, E., Güdükbay, U., Ulusoy, Ö.: A histogram-based approach for objectbased query-by-shape-and-color in image and video databases. Image and vision computing 23, 1170–1180 (2005) 10. Sonka, M., Hlavac, V., Boyle, R.: Image processing, analysis and machine vision. Chapman and Hall, Boca Raton (1994) 11. Witten, H.I., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2005) 12. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Fast Image Segmentation Algorithm Using Wavelet Transform Tomaž Romih and Peter Planinšič University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {Tomaz.Romih,Peter.Planinsic}@uni-mb.si
Abstract. Fast image segmentation algorithm is discussed, where first significant points for segmentation are determined. Reduced set of image points is then used in K-means clustering algorithm for image segmentation. Our method reduces segmentation of the whole image to segmentation of significant points. Reduction of points of interest is made by introducing some kind of intelligence in decision step before clustering algorithm. It is numerically less complex and suitable for implementation in the low speed computing devices, such as smart cameras for the traffic surveillance systems. Multiscale edge detection and segmentation are discussed in detail in the paper. Keywords: Wavelet transform, multiscale edge detection, image segmentation.
1 Introduction Computer vision tasks are known by their complexity. To achieve final goal, which is some sort of description of image content, several task at different complexity levels are required. One of the most merchant descriptors of image content are edges, they define the shapes of the objects. In this article, our method, first proposed in [1], is described, where set of points, used in the task of image segmentation, is significantly reduced. This is done by introducing smart decision algorithm in this step, that is, edge points are used as guide points for definition of significant points, used in image segmentation procedure. To achieve compactness of the whole procedure, wavelet transformation is used in both steps of described procedure. In first step, finding the edges on the image, a optimal multiscale edge detection method is used, and in the second step, segmentation of the image, wavelet coefficients trough the scales are used as input to clustering algorithm. Improved edge detector has been first proposed by Canny [2], where optimal edge detector with mathematical background was proposed. Canny edge detector is fast, reliable, robust and generic. Mallat has later extended Canny edge detector to the time-scale plane [3, 4] and introduced multiscale edge detector, that is incorporated in our method. Using reduced set of points for image description is already proposed for example by Ma and Grimson [6], where modified SIFT (Scale Invariant Feature Transform) G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 117–126, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
118
T. Romih and P. Planinšič
keypoints [7] are used for vehicle classification. We extended this in such a way, that it is generally applicable for the image segmentation. Often the image segmentation using clustering algorithm has a drawback that it does not always follow the actual shape of the object. Borders of the segments are not aligned with the actual edges on the image. Our method preserves edges, found on the image and reduces the number of points, considered for image segmentation. It is therefore of lower numerical complexity and as such suitable for use in the low speed computing devices. In this paper we focus on image segmentation using smart selection of significant points. Significant points form reduced set of image points used for image segmentation. Whole procedure is based on wavelet transform. Proposed method is evaluated using K-means clustering algorithm, but can be used with other clustering algorithms too. This paper is organized as follows. Section 2 explains our method and its steps. The multiscale edge detection with significant points selection and building of multiscale feature vectors for segmentation using the wavelet transform are described in details. Section 3 discusses the experimental results and the conclusions are presented in section 4.
2 Defining Significant Points for Image Segmentation 2.1 The Principle of Our Method Our algorithm has two stages. In the first stage, the edges are detected using the wavelet based multiscale edge detection approach, proposed by Mallat [3, 4]. In the second stage, significant points for image segmentation are selected and described using the local energy values trough the scales of wavelet decomposition. Edges are used as guidelines for selection of significant points, as shown on Fig. 1. Significant points are selected from each side of the edge. In order to evaluate the properties of homogeneous area, which edge surrounds, significant points lies few pixels away from the edge perpendicular to the edge direction. As we will see in section 2.3, local energy in some neighborhood is used. The distance of significant point from the edge is therefore chosen at the half of the
(a)
(b)
(c)
Fig. 1. Selection of significant points. (a) Test image with added Gaussian noise, (b) found edges using multi-scale approach, (c) magnified section of edge with marked positions of selected significant points on both sides of the edge.
Fast Image Segmentation Algorithm Using Wavelet Transform
119
Find edges
Assign descriptors to edges
Group edges
Dark grey area
White area
Light gray area
Fill-in empty space
Directions of filling-in of areas
Fig. 2. Steps and data flow of our algorithm. Rows from top to bottom: 1) the original image and the detected edges, 2) the edges define significant points, 3) result of clustering, shown tree clusters, for the dark gray area, the white area and the light gray area, 4) initialization of the region filling process of the empty areas between the edges and 5) the resulting segmented image with three different clusters.
window size, that defines the neighborhood. For experiments a window of the size 3x3 pixels has been used, so significant points lies at the distance of 1 pixel from the edge. Assuming that the edge surrounds the object, one can say, that the area inside the edge does not show large variations, which is true for most of the objects. Surface of
120
T. Romih and P. Planinšič
the object does not introduce significant variations at different points. Experimental results show, that using only points around the edges for segmentation gives satisfactory results. Even more, results are better compared to segmentation with all image points, but without considering edges. Improvement is in a sense, that our method keeps borders of the segments settled with edges, where possible. After assigning significant points to clusters, the segment regions are defined by filling-in the empty space between the edges with respective cluster values. Simple flooding algorithm is used for this task. Each region is limited by other regions or by collided edges. Fig. 2 figuratively shows the corresponding steps of our method. The number of the clusters is not strictly defined; we found five clusters as sufficient for testing purpose. 2.2 Multiscale Edge Detection According to Mallat et. al. [3, 4] edges are more accurately and reliably found using wavelet transform and analyzing scale planes to detect singularities in the image by estimating Lipschitz exponent. An approach, proposed by them, is used, where image is first transformed using wavelet transform and modulus maxima points are calculated and connected in scale space. The wavelet transform is used without decimation. All scales have full resolution. The wavelet transform is defined as follows. For any real smoothing function θ ( x ) the θ s ( x ) = 1 θ x is smoothing function at the scale s. For a real s s
( ) ( )
function f ( x ) in L ( R ) and wavelet function ψ ( x ) defined as: 2
ψ ( x) =
dθ ( x ) dx ,
(1)
the wavelet transform is defined as:
Wf ( s, x ) = f ∗ψ s ( x ) .
(2)
d ⎛ dθ ⎞ Wf ( s, x ) = f ∗ ⎜ s s ⎟ ( x ) = s ( f ∗ θ s )( x ) dx ⎝ dx ⎠
(3)
Further:
and for specific orientation:
⎛ dθ ⎞ d Wf ( s, i, x ) = f ∗ ⎜ s s ⎟ ( x ) = s ( f ∗θ s )( x ) dxi ⎝ dxi ⎠ where i specifies orientation of wavelet transform (horizontal or vertical).
(4)
Fast Image Segmentation Algorithm Using Wavelet Transform
Modulus maximum points
(s , x ) 0
0
121
are points in scale planes, which have the
following properties:
Wf ( s0 , xm ) < Wf ( s0 , x0 )
(5)
Wf ( s0 , xn ) ≤ Wf ( s0 , x0 )
where ( xm , xn ) belongs to the opposite neighborhood (either left or right) of x0 . Maxima line is called a connected curve of the modulus maxima points in the scale space ( s , x ) . The decay of the maxima line over scales estimates the Lipschitz exponent. By following the maxima line to the finest scale, one can localize the point of singularity. The decay of the maxima line shows, if the point is regular edge or noise. Edge points have smaller decay trough scales as noise, where maxima line drops to zero after few scales. If noise is present in the image, then edges are difficult to detect at finer scales. By using information from coarser scales one can still detect edges. The drawback is poor localization of the edge. 2.3 Segmentation Using the Wavelet Coefficients
Once edges are found, their location is used for selection of significant points. Significant points are chosen on both sides of the edges, as Fig. 1 suggests. Since the edge direction is one of the outputs of edge detection step, we use it for direction definition. Front of the edge is in the same direction as edge direction. Opposite direction is then looking behind the edge. One significant point lies in the front of the edge and one behind the edge. Therefore, if the number of all edge points is P, then total number of significant points is 2P. We already have image, decomposed using wavelet transform, so we use this for feature extraction. For each significant point we extract feature from each subimage separately and build a feature vector, that reflects scale-dependent properties [10]. Coefficients of the wavelet transform are not directly applicable as texture features, because they exhibit great variability within the same texture. As stated in [8, 10], where several local energy measures were proposed, using local energy is more appropriate. We use square flat window to calculate energy values in small neighborhood of point of interest trough subimages at various scales and orientation:
eng ( s, i, x j , y j ) =
1 M ×N
∑∑Wf ( s, i, x M
N
2
m
n
j
+ m, y j + n )
(6)
where MxN is the size of square window, s and i are scale and orientation respectively and (xj, yj) are coordinates on the image of j-th significant point, where j is the index of significant point 0 ≤ j < 2P. Feature vector vj for j-th significant point is constructed from elements with values of local energy, measured at specific scale and orientation for respective significant point:
122
T. Romih and P. Planinšič
⎡ f (xj , y j ) ⎤ ⎢ ⎥ ⎢ eng ( s0 , i0 , x j , y j ) ⎥ ⎢ eng ( s , i , x , y ) ⎥ j j 0 1 ⎢ ⎥ ⎢ eng ( s1 , i0 , x j , y j ) ⎥ vj = ⎢ ⎥ ⎢ eng ( s1 , i1 , x j , y j ) ⎥ ⎢ ⎥ ⎢M ⎥ ⎢ eng ( sl , i0 , x j , y j ) ⎥ ⎢ ⎥ ⎢⎣ eng ( sl , i1 , x j , y j ) ⎥⎦
(7)
where f() is a image gray level value, (xj, yj) are coordinates of the j-th significant point, {s0, s1, …, sl} are scales up to the l-th scale and i0, i1 are horizontal and vertical orientation, respectively. Now each feature vector v j can be assigned to specific cluster by some clustering algorithm. We have chosen K-means clustering algorithm for its simplicity and wide use, although proposed method is not limited to it and other clustering algorithms can be used too. Using K-means algorithm, we need to specify expected number of clusters K. For n array elements, K-means algorithm then iteratively tries to minimize the measure [9]: n
K
J = ∑∑ ukm xm − zk
2
(8)
m =1 k =1
where ukm is the membership of pattern xm to cluster Ck and forms partition matrix U ( X ) = [ ukm ]K ×n for the data and z k is the center of the k-th cluster.
Once more is to emphasize that in clustering process reduced number of image points is used. Not all points of the image are used, but only selected, significant points, as described at the beginning of this section. The wavelet transform is performed on whole image. The numerical complexity of the fast discrete wavelet transform is O(N2log(N)) [3], where N is the number of all points of the image and numerical complexity of K-means algorithm is O(KtM) [11], where K is the number of clusters, t is the number of iteration of K-means algorithm and M is the number of points, considered in clustering process.
3 Experimental Results To evaluate our method, a computer program in C-language has been written that performs the wavelet decomposition of the image into the subimages, calculates the wavelet modulus maxima points and connects them in chains. For these procedures,
Fast Image Segmentation Algorithm Using Wavelet Transform
123
the program calls the routines, written by Emmanuel Bacry [5]. The program then calculates the local energy of the surrounding points and uses the K-means clustering algorithm to segment them. The flooding algorithm is then performed to define the surface of the segments. Experiments were made using various images, results from three test images, Lena, Airplane and Baboon, are shown on Fig. 3, 4 and 5. Edges, that are found using multiscale edge detection approach, are shown on Fig. 3b, 4b and 5b. Figure 3c, 4c and 5c shows final results of the segmentation using our method. Fig. 3d, 4d and 5d shows results of segmentation using K-means algorithm with all points considered, but without considering edges, referred here as common method. Comparison of the results of the segmentation using our method with the results of the common segmentation procedure shows, that by our method the edges of the details are better defined and that no new clusters are introduced around the edges. Obtained segments follow the edge lines, were possible. Comparison of the numerical
a)
b)
c)
d)
Fig. 3. a) Test image “Lena”, b) found edges, minimum edge length set to 10, amplitude threshold set to 5, c) segmentation using our method and d) common method
124
T. Romih and P. Planinšič
a)
b)
c)
d)
Fig. 4. a) Test image “Airplane”, b) found edges, minimum edge length set to 10, amplitude threshold set to 5, c) segmentation using our method and d) common method
a)
b)
c)
d)
Fig. 5. a) Test image “Baboon”, b) found edges, minimum edge length set to 10, amplitude threshold set to 5, c) segmentation using our method and d) common method
Fast Image Segmentation Algorithm Using Wavelet Transform
a)
b)
125
c)
Fig. 6. Evaluation of speed up of proposed method. a) Test image "Water in hands", b) segmentation using all 65536 pixels as input for K-means algorithm, c) segmentation with proposed method using 20278 pixels as input for K-means algorithm.
complexity of our method and that of the common method shows, that by our method about 1/3 of the points are involved in K-means clustering procedure compared to usual approach, as shown in example in Fig. 6. Execution time measurements confirm our expectations. We measured time of function execution using MS Visual C debugger utility and high-resolution timer on a PC computer with 3.0 GHz Pentium IV processor. Code has not been optimized. Using proposed method with reduced set of points for clustering brings time savings of over 20 ms for 256x256 pixels large image (18 ms compared to 41 ms of execution time of K-means function, using image from example on Fig. 6) and over 1 s for 512x512 pixels large image (24 ms compared to 1,05 s of execution time of K-means function, using test image Lena). In spite of all that, the multiscale edge detector code is not optimized either and is as such very time consuming. Other (faster) edge detection methods could be considered instead. The drawback of our method could be large number of introduced segments on the haired parts of images, as can be seen on Lena's hair and Baboon's fur, if this is not desired.
4 Conclusion The experiments shows, that our method speeds-up clustering algorithm and gives satisfactory results. It improves the common segmentation method, speeds up the whole process and better describes the objects by considering their edges. Our method uses even for structural complex images less points for calculation of the segments as the common method. It outperforms the common method in a case of the object shape definition, because it follows the true object edge and does not introduce new segments around edges. This was achieved by smart selection of significant points of the image, considered for the segmentation. Because of lower numerical complexity it is suitable for use in the low speed computing devices as smart cameras for consumer electronic. The method can be extended by using bandelets, contourelets and other second generation wavelet transforms. Edge detection can be extended by involving parameters for adjustment of edge intensity and edge fragments length. Flooding algorithm can be improved by introducing smarter expansion algorithm. False segment regions can be reduced that way.
126
T. Romih and P. Planinšič
References 1. Romih, T., Čučej, Ž., Planinšič, P.: Wavelet based edge preserving segmentation algorithm for object recognition and object tracking. In: Proceedings of the IEEE International Conference on Consumer Electronic 2008 (2008) 2. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Machine Intell. 8, 679–698 (1986) 3. Mallat, S., Zhong, S.: Characterization of signals from multiscale edges. IEEE Trans. on Pattern Anal. Machine Intell. 14, 710–732 (1992) 4. Mallat, S., Hwang, W.L.: Singularity detection and processing with wavelets. IEEE Trans. on Information Theory 38, 617–643 (1992) 5. Bacry, E.: LastWave, http://www.cmap.polytechnique.fr/~bacry/LastWave 6. Ma, X., Grimson, W.E.L.: Edge-based rich representation for vehicle classification. In: Proceedings of the Tenth IEEE International Conference on Computer Vision, vol. 2, pp. 1185–1192 (2005) 7. Love, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60, 91–110 (2004) 8. Petrou, M., Sevilla, P.G.: Image processing - Dealing with texture. John Wiley & Sons Ltd., Chichester (2006) 9. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. on Pattern Anal. Machine Intell. 24, 1650–1654 (2002) 10. Livens, S., Scheunders, P., van de Wouwer, G., Van Dyck, D.: Wavelets for texture analysis, an overview. In: Sixth International Conference on Image Processing and Its Applications, vol. 2, pp. 581–585 (July 1997) 11. Hruschka, E.R., Hruschka Jr., E.R., Covoes, T.F., Ebecken, N.F.F.: Feature selection for clustering problems: a hybrid algorithm that iterates between k-means and a Bayesian filter. In: Fifth International Conference on Hybrid Intelligent Systems (November 2005)
Musical Instrument Category Discrimination Using Wavelet-Based Source Separation P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis Department of Informatics University of Piraeus Piraeus 185 34, Greece {vlamp,arislamp,geoatsi}@unipi.gr Abstract. In this paper, we present and evaluate a new innovative method for quantitative estimation of the musical instrument categories which compose a music piece. The method uses a wavelet-based music source (i.e., musical instrument) separation algorithm and consists of two steps. In the first step, a source separation technique based on wavelet packets is applied to separate the musical instruments which compose a music piece. In the second step, a classification algorithm based on support vector machines is applied to estimate the musical category of each of the musical instruments identified in the first step. The method is evaluated on the publically available Iowa Musical Instrument Database and found to perform quite successfully.
1 Introduction Instrumentation is an important high-level descriptor of music, which may provide useful information for many music information retrieval (MIR) related tasks. Furthermore, instrument identification is useful for automatic musical genre recognition, as certain instruments may be more characteristic of specific genres [1]. For example, the electric guitar is quite a dominant instrument in rock music, but is hardly ever used in classical music. Additionally, human listeners may be able to determine the genre of a music signal and, at the same time, identify a number of different musical instruments from a complex sound structure. There are several difficulties in developing automated musical instrument identification procedures, which reside in the fact that some audio signal features depend on pitch and individual instruments. Specifically, the timbre of a musical instruments is obviously affected by the wide range of the pitch of the instrument. For example, the pitch range of the piano covers over seven octaves. To achieve high performance of musical instrument identification, it is indispensable to cope with the pitch dependencence of timbre. Most studies on musical instrument identification, however, have not dealt with timbre dependence on pitch [2]. For the identification of the musical instruments which compose an audio signal, many approaches have been proposed. One of the most popoular lies on recognition of the musical instruments from the direct signal. Another approach G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 127–136, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
128
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
attempts to separate the sources of an audio signal with source separation techniques borrowed from the signal processing literature and, then, identify the participating instruments in each separated source [3]. In the current work, we follow the second approach and attempt to identify the instrumentation of an audio signal, as a preprocessing step to a genre classification system. Our method uses a wavelet-based music source (i.e., musical instrument) separation algorithm and consists of two steps. In the first step, a source separation technique based on wavelet packets is applied to separate the musical instruments which compose a music piece. In the second step, a classification algorithm based on support vector machines is applied to estimate the musical category of each of the musical instruments identified in the first step. The method is evaluated on the publically available Iowa Musical Instrument Database and found to perform quite successfully. Specifically, this paper is organized as follows: Section 2 reviews previous related works, while Section 3 presents the instrument separation step in detail. Section 4 presents the instrument category classification in detail, illustrates the results and evaluates our method. Conclusions are drawn and future research directions are illustrated in Section 5.
2 Previous Related Work Early work on musical instrument recognition includes the development of a pitchindependent isolated tone musical instrument recognition system that was tested using the full pitch range of thirty orchestral instruments from the string, brass and woodwind categories, played with different articulations [4]. Later, this work was extended into a classification system and several features were compared with regards to musical instrument recognition performance [5]. Moreover, a new system was developed to recognize musical instruments from isolated notes [6]. In other works, a number of classification techniques were evaluated to determine the one that provided the lowest error rate when classifying monophonic sounds from 27 different musical instruments [7, 8] or to to reliably identify up to twelve instruments played under a diverse range of articulations [9]. Other approaches include the classification of musical instrument sounds was based on decision tables and knowledge discovery in databases (KDD) for training data analysis [10, 11] and the development of a musical instrument classification system based on a multivariate normal distribution the mean of which was represented as a function of fundamental frequency (F0) [2, 12]. All the above systems are instrument recognition systems, which use a signal of one instrument as input. Even in systems fed with input signals coming from a mixture of instruments, each instrument plays an isolated note. In a previous work of ours, we proposed a new approach for musical genre classification based on the features extracted from signals that correspond to musical instrument sources [13]. Contrary to previous works, this approach used a sound source separation method first to decompose the audio signal into a number of component signals, each of which corresponded to a different musical instrument source. In this way timbral, rhythmic and pitch features are extracted from separated
Musical Instrument Category Discrimination
129
sources and used to classify a music clip, detect its various musical instruments sources and classify them into a musical dictionary of instrument sources or instrument teams. The source separation algorithm we used was the Convolutive Sparse Coding(CSC) [14]. In this approach, we applied the CSC algorithm on the entire input signal that corresponds to a music piece assuming that all the instruments are active throughout the entire piece. This assumption, however, does not meet with realistic music scenarios, in which it is possible that different instruments participate only in a part of the music piece. For example, during the introduction of a music piece, only two instruments may participate, while a third instrument added may be added later on and so on. In order to address such cases, we propose a new approach of a source separation method based on the CSC algorithm [15].
3 Wavelet-Based Musical Instrument Identification Method A human listener is able not only to determine the genre of a music signal, but at the same time distinguish a number of different musical instruments
Fig. 1. Two Steps of the instrument category identification approach
130
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
from a complex sound structure. In order to mimic this process, we proposed a new approach for instrument category identification that is going to be a preprocessing module in a genre classification system. This approach implements a wavelet-based source separation method, followed by feature extraction of each separated signal. Classifiers are built next to identify the instrument category in each separated source. The procedure is illustrated in Fig 1. 3.1
Source-Separation Method Based on the Wavelet Packets
The problem of separating the component signals that correspond to the musical instruments that generated an audio signal is ill-defined as there is no prior knowledge about the instrumental sources. It is common for audio signals coming from different sources to exist simultaneously in the form of a mixture. Many existing separation methods have been exploiting these different multiple observations to extract the required signals without using any other form of information. This is called the Blind Audio Source Separation (BASS) problem. An extreme case of this problem is when there is only one available observation of the signal mixture (i.e., one channel). In the case of audio signals, the most obvious choice for the observation matrix is a time-frequency, so that basis functions are magnitude spectra of sources. This basic approach has already been used in some ICA, ISA, and sparse coding systems [16]. In our previous work, for the source separation method we used a dataadaptive algorithm that is similar to ICA and called Convolutive Sparse Coding (CSC) [14] based on the Independent Subspace Analysis (ISA) method which can separate individual sources from a single-channel mixture by using sound spectra. Signal independence is the main assumption of both the ICA and ISA methods. In musical signals, however, there exist dependencies in both the time and frequency domains. In this algorithm, the number of sources N was set by hand and should be equal to the number of clearly distinguishable instruments. In the case of audio signals, the most obvious choice for the observation matrix is a time-frequency presentation, so that basis functions are magnitude spectra of sources. This basic approach has already been used in some ICA, ISA, and sparse coding systems [17, 14]. For the separation of the different sources in this work, we use an approach of sub-band decomposition independent component analysis (SDICA) [18] which is based on decomposition using wavelet packets (WPs) [19]. In order to adaptively select the sub-band with the least dependent sub-components of the source signals, in this work it was introduced a criterion based on small cumulant approximation of Mutual Information(MI). The problem is known as blind source separation (BSS) and is formally described as in x = As (1) where x represents vector of measured signals, A represents an unknown mixing matrix and s represents an unknown vector of the source signals. In our source separation method, we use an approach similar to SDICA, which is based on decomposition using wavelet packets (WPs) based on iterative filter
Musical Instrument Category Discrimination
131
banks [19]. In order to enable the filter bank to adaptively select the sub-band with the least dependent sub-components of the source signal, we have introduced a criterion based on small cumulant approximation of MI. In order to obtain a sub-band representation of the original wideband BSS problem Eq. 1, we can use any linear operator Tk which will extract a set k of sub-components sk = Tk [s], (2) where Tk can, for example, represent a linear time-invariant bandpass filter. Using Eq. 2 and sub-band representation of the sources, application of the operator Tk on the wideband BSS model Eq. 1 yields xk = Tk [As] = ATk [s] = Ask .
(3)
In this algorithm, a WP transform was used for Tk in order to obtain subband representation of the wideband BSS problem Eq. 1. The main reason was existence of the WP transform in a form of iterative filter bank which allows isolation of the fine details within each decomposition level and enable adaptive
Fig. 2. Wavelet- based Source Separation
132
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
Fig. 3. Example of resulted separation
sub-band decomposition [20, 21]. In the context of SDICA, it means that an independent sub-band that is arbitrarily narrow can be isolated, by progressing to the higher decomposition levels. In this implementation of the described SDICA algorithm, it is used 1D WP for separation of audio signals. Let fl and cl be constructed from the lth coefficients of the mixtures and sources, respectively. For each component xn of the signal x, the WP transform creates a tree with the nodes that correspond to the sub-bands of the appropriate scale. In order to select the sub-band with least dependent components sk , MI[19] is measured between the same nodes in the WP tree. Once the sub-band with the least dependent components is selected, it is obtained either estimation of the inverse of the basis matrix W or estimation of the basis matrix A by applying standard ICA algorithms on the model [22]. Mixed signals can be reconstructed through the synthesis part of the WP transform, where sub-bands with a high level of MI are removed from the reconstruction. An diagram of the abstract steps of this separation is illustrated in Fig. 2 Summarizing this algorithm in the following four steps: 1. Perform multi-scale WP decomposition of each component of the input data x. Wavelet tree will be associated to each component of x. 2. Select sub-band with the least dependent components by estimating MI between the same nodes (sub-bands) in the wavelet trees
Musical Instrument Category Discrimination
133
3. Learn basis matrix A or its inverse W by executing standard ICA algorithm for linear instantaneous problem on the selected sub-band. 4. Obtain recovered sources y by applying W on data x. An example of the separation process of a audio signal by means of 1D wavelet transform with the three decomposition level and the separated sources is presented in the Fig. 3 .
4 Instrument Class Classification-Results The samples were obtained from the University of Iowa, Musical Instrument Samples Database[23], and were all, apart from the piano, recorded in an anechoic chamber and sampled at 44.1 kHz. The instruments we used from this database are categorized in three categories [24] the winds, the brass and strings as shown in Table 1. Table 1. Categories of Instruments Wind
Brass
String
Alto Saxophone
French Horn
Violin
Alto Flute
Bass Trombone
Viola
Bass Flute
Trumpet
Cello
Bass Clarinet
Tuba
Double Bass
Oboe
Tenor Trombon
Eb Clarinet Bb Clarinet Bassoon Flute Soprano Saxophone
In this work, we utilize Instrument-class classifier based on Gaussian Mixture Models (GMM). We have 40 audio signals from each instrument class (40 wind, 40 brass, 40 string). Then we produced 40 mixtures of audio signals WavePad Master’s Edition which is is a sound editor program. Each mixture signal separated by wavelet-based algorithm. The source separation process produced three component signals which correspond to the three instrument teams such as strings, winds, and brass. The component signals from every mixture are labeled by a user into three instrument-classes. Therefore, dataset Table 2 consists of 120 initial audio signals and 120 component signals, balanced distributed in three classes (80 wind, 80 brass, 80 string). From each signal, we extract a specific set of 30 objective features [25]. It is worth to mention that these features not only provide a low level representation of the statistical properties of the music signal but also include high level
134
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis Table 2. Dataset Wind Brass String Total
Initial Signals Signals from separation process Total 40 40 80 40 40 80 40 40 80 120 120 240
Table 3. Confusion Matrix: Gaussian Mixture Models, K=5, 70.66% Wind Wind 72 Brass 34 String 13
Brass 23 63 10
String 5 3 77
information, extracted by psychoacoustic algorithms in order to represent rhythmic content (rhythm, beat and tempo information) and pitch content describing melody and harmony of a music signal. In the Gaussian Mixture Model (GMM) classifier, Probability Distribution Function (PDF) for each instrument class is assumed to consist of a mixture of a specific number K of multidimensional Gaussian distributions, herein K=5. The GMM classifier is initialized using the K-means algorithm with multiple random starting points then refine the models using Expectation-Maximization algorithm (EM) for 200 cycles. Assuming equal prior likelihoods for each class the decision rule is that data points (and corresponding signals) in feature space for which one the PDF is larger are classified as belonging to that class. The NetLab toolbox was utilized in order to construct the GMM classifier. Classification result was calculated using 10-fold cross-validation evaluation, where the dataset to be evaluated was iteratively partitioned so that 90% be used for training and 10% be used for testing for each class. This process was iterated with different disjoint partitions and the results were averaged. This ensured that the calculated accuracy was not biased because of the particular partitioning of training and testing. The achieved GMM classification accuracy is 70.66%. In a confusion matrix, the columns correspond to the actual instrument category, while the rows correspond to the predicted instrument category. In Table 3, the cell in row 1 of column 1 has value 72, which means that 72 signals (in a total of 100 signals) from the ”Wind” category class was accurately predicted as ”Wind”. Similarly, 63 and 77 signals from the ”Brass”, and ”String” category class, correspondingly were predicted accurately. Therefore, the classifier accuracy is computed to equal (72+63+77)*100/300=70.66 % for this classifier.
5 Conclusions We presented a new innovative method for quantitative estimation of the musical instruments sources with the use of a wavelet-based source separation algorithm.
Musical Instrument Category Discrimination
135
This method consists of two steps, in the first a source separation technique based on the wavelet transform applied for the separation of the musical sources and in the second a classification process is applied in order to estimate the participation of each musical instrument team in the separated sources. Performance of this method is evaluated.
References 1. Kostek, B.: Musical instrument classification and duet analysis employing music information retrieval techniques 2. Kitahara, T., Goto, M., Okuno, H.G.: Pitch-dependent musical instrument identification and its application to musical sound ontology. In: Chung, P.W.H., Hinde, C.J., Ali, M. (eds.) IEA/AIE 2003. LNCS, vol. 2718, pp. 112–122. Springer, Heidelberg (2003) 3. Martin, K.: Sound-source recognition: A theory and computational mode. PhD thesis, MIT (1999) 4. Eronen, A., Klapuri, A.: Musical instrument recognition using cepstral coefficients and temporal features 2, II753–II756 (2000) 5. Eronen, A.: Automatic musical instrument recognition (2001) 6. Tzanetakis, G.: Musescape: A tool for changing music collections into libraries. In: Proc. Seventh International Symposium on Signal Processing and Its Applications, 2003, vol. 2, pp. 133–136 (2003) 7. Agostini, G., Longari, M., Pollastri, E.: Musical instrument timbres classification with spectral features 8. Agostini, G., Longari, M., Pollastri, E.: Musical instrument timbres classification with spectral features. EURASIP J. Appl. Signal Process 2003(1), 5–14 (2003) 9. Czyzewski, A., Szczerba, M., Kostek, B.: Musical phrase representation and recognition by means of neural networks and rough sets 3100, 254–278 (2004) 10. Slezak, D., Synak, P., Wieczorkowska, A., Wroblewski, J.: Kdd-based approach to musical instrument sound recognition. In: ISMIR 2002: Proceedings of the 13th International Symposium on Foundations of Intelligent Systems, pp. 28–36. Springer, London (2002) 11. Wieczorkowska, A., Wroblewski, J., Synak, P., Slezak, D.: Application of temporal descriptors to musical instrument sound recognition. Journal of Intelligent Information Systems 21(1), 71–93 (2003) 12. Kitahara, T., Goto, M., Komatani, K., Ogata, T., Okuno, H.G.: Musical instrument recognizer “instrogram” and its application to music retrieval based on instrumentation similarity. In: ISM, pp. 265–274. IEEE Computer Society Press, Los Alamitos (2006) 13. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification of audio data using source separation techniques. In: Proc. 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, The Slovak Republic (2005) 14. Virtanen, T.: Separation of sound sources by convolutive sparse coding. In: Proc. of Workshop on Statistical and Perceptual Audio Processing (SAPA) (2004) 15. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification enhanced by source separation techniques. In: Proc. 6th International Conference on Music Information Retrieval, London, UK, pp. 576–581 (2005)
136
P.S. Lampropoulou, A.S. Lampropoulos, and G.A. Tsihrintzis
16. Casey, M.A., Westner, A.: Separation of mixed audio sources by independent subspace analysis. In: International Computer Music Conference (ICMC) (2000) 17. Smaragdis, P., Brown, J.: Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2003 (2003) 18. Zhang, K., Chan, L.W.: An adaptive method for subband decomposition ica. Neural Comput. 18(1), 191–223 (2006) 19. Koprivaa, I., Sersic, D.: Wavelet packets approach to blind separation of statistically dependent sources. Neurocomputing (2007) doi:10.1016/j.neucom.2007.04.002 20. Wickerhauser, M.V.: Adapted wavelet analysis from theory to software. A. K. Peters, Ltd., Natick (1994) 21. Mallat, S.: A Wavelet Tour of Signal Processing (Wavelet Analysis & Its Applications), 2nd edn. Academic Press, London (1999) 22. Marchini, J.L., Heaton, C., et al.: The fastica package - fastica algorithms to perform ica and projection pursuit 23. University of Iowa Musical Instrument Samples Database, http://theremin.music.uiowa.edu/ 24. Martin, K.: Musical instrument identification: A pattern-recognition approach (1998) 25. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5) (2002)
Music Perception as Reflected in Bispectral EEG Analysis under a Mirror Neurons-Based Approach Panagiotis Doulgeris, Stelios Hadjidimitriou, Konstantinos Panoulas, Leontios Hadjileontiadis, and Stavros Panas Aristotle University of Thessaloniki, Faculty of Technology, Department of Electrical & Computer Engineering, GR – 541 24, Thessaloniki, Greece [email protected]
Abstract. One important goal of many intelligent interactive systems is dynamic personalization and adaptivity to users. ‘Motion’ and intention that are involved in the individual perception of musical structure combined with mirror neuron (MN) system activation are studied in this article. The mechanism of MN involved in the perception of musical structures is seen as a means for cueing the learner on ‘known’ factors that can be used for his/her knowledge scaffolding. To explore such relationships, EEG recordings, and especially the Mu-rhythm in the premotor cortex that relates to the activation of MN, were acquired and explored. Three experiments were designed to provide the auditory and visual stimuli to a group of subjects, including both musicians and non-musicians. The acquired signals, after appropriate averaging in the time domain, were analysed in frequency and bifrequency domains, using spectral and bispectral analysis, respectively. Experimental results have shown that an intention–related activity shown in musicians could be associated with Mu–rhythm suppression. Moreover, an underlying ongoing function appearing in the transition from heard sound to imagined sound could be revealed in the bispectrum domain and a Mu-rhythm modulation provoked by the MNs could cause bispectral fluctuations, especially when visual stimulation is combined with an auditory one for the case of musicians. These results pave the way for transferring the research in the area of blind or visually impaired people, where hearing is the main information sensing tool. Keywords: music, motion, intention, mirror neurons, EEG, Mu-rhythm, spectrum, bispectrum.
1 Introduction Multimedia services based on multimedia systems arise in various areas including art, business, education, entertainment, engineering, medicine, mathematics, scientific research and spatio-temporal applications. Dynamic personalization and adaptivity to users set an important goal for many intelligent interactive systems. To achieve this, some indicative characteristics can be used to identify underlying mechanisms in many sensory systems that could be taken into account in the modelling procedures by intelligence systems. In this work, the focus is placed upon the mechanisms involved in music perception and knowledge scaffolding. The underlying features of musical sounds play a major role in music perception. Music consists of sounds, but not all sounds are music. It has been suggested that music, like G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 137–146, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
138
P. Doulgeris et al.
language, involves an intimate coupling between the perception and production of hierarchically organized sequential information, the structure of which has the ability to communicate meaning [1]. Musical structure consists of basic elements that are combined in patterns whose performance and perception are governed by combinatorial rules or a sort of musical grammar [1]. Auditory features of the musical signal are primarily processed in the superior temporal gyrus (auditory cortex). However, the processing of structural features activates regions of the human brain that have been related to the perception of semantic and motor tasks. A mechanism involved in the structural analysis of communicative signals, like music, is the mirror neuron (MN) system. In 2001, a magnetoencephalogram-based study suggesting that musical syntax is processed in Broca’s area, a region of the human brain involved in language perception, was presented [2]. Moreover, in 2006, a positron emission tomography-based study suggested that there are common regions in the human brain for language and music processing, as Brodmann area 44, [3], whereas two studies that used functional magnetic resonance imaging analysis, revealed shared networks for auditory and motor processing [4, 5]. Furthermore, in 2006, a model of the integration of motor commands with their associated perceived sounds in vocal production was proposed. This model was based on neurobiologically plausible principles, i.e., receptive fields responding to a subset of all possible stimuli, population coded representations of auditory stimuli and motor commands and a simple Hebbian–based weight update [6]. The activation of MN system in response to music stimulation that is targeted to ‘motion’ and intention is evaluated in this work. These qualities of music are inherent in musical performance and creation and could be seen as basic elements for music knowledge scaffolding. ‘Motion’ is conveyed by the augmenting or diminishing pitch, while intention underlies in every communicative signal with hierarchical structure. The activation of the MN system from music tasks targeted to specific structural features of musical creation and performance has not yet been monitored using electroencephalogram (EEG) analysis. Here, fluctuations of the Mu–rhythm, related to the activation of MNs, were explored using EEG recordings and bi/spectral analysis.
2 Background 2.1 Mirror Neuron System The MN system consists of a whole network of neurons and was initially discovered in macaque monkeys, in the ventral premotor cortex (probably the equivalent of the inferior frontal gyrus in humans) and in the anterior inferior parietal lobule [7]. These neurons are active when the monkeys perform certain tasks, but they also fire when the monkeys watch or hear someone else perform the same specific task. In humans, brain activity consistent with MNs has been found in the premotor cortex (inferior frontal cortex) and the inferior parietal cortex [8]. Cognitive functions, like imitation and the understanding of intentions, have been linked to the activation of the MN system [9-11]. As far as music is concerned, a number of studies suggest the implication of Brodmann area 44 in music cognition [2, 3]. Brodmann area 44 is situated just anterior to premotor cortex (mirror neuron system) [8]. Together with Brodmann area 45 it comprises Broca’s area, a region involved in the processing of hierarchical structures that are inherent in communicative signals, like language and action [12]. Moreover, auditory features of the
Music Perception as Reflected in Bispectral EEG Analysis
139
musical signal, which are processed primarily in the primary auditory cortex (superior temporal gyrus), are combined with structural features of the ‘motion’ information conveyed by the musical signal in the posterior inferior frontal gyrus and adjacent premotor cortex (mirror neuron system) [1, 8]. Certain neuroimaging evidence suggests that frontoparietal motor-related regions including, prominently, the posterior inferior frontal gyrus (Broca’s region), as well as the posterior middle premotor cortex, were active during passive listening of music pieces by professional musicians [4, 5]. Figure 1 displays the regions of the brain related to music perception. The activation of MNs can be detected using EEG analysis. Mu-rhythm could reflect visuomotor integrative processes, and would ‘translate seeing and hearing into doing’. Mu-rhythm is alpha range activity (8–12 Hz) that is seen over the sensorimotor cortex. Fluctuation on Mu-rhythm during the observation of a motor action (suppression of Murhythm) is highly similar to the one seen during the direct performance of the action by the individual (greater suppression of Mu-rhythm) [13]. The suppression is due to the desychronization of the underlying cell assemblies, reflecting an increased load in the related group of neurons [13]. This motor resonance mechanism, witnessed by an Mu-rhythm modulation, is provoked by the MNs.
Fig. 1. Regions of the human brain related to the mirror neuron system (premotor cortex) and semantic tasks (Broca’s area). Position of the electrodes (C3 and its homologue C4) over the sensorimotor cortex and an example of the acquired signals (Mu–rhythm) across ten trials during the Experiment 1, along with the resulting averaged signal.
140
P. Doulgeris et al.
2.2 Music Structure Hearing provides us with sensory information arousing from our environment. As with any other form of information, the human brain tends to focus on the pieces of greater interest and extract messages. In this way, a communication port is created for the human brain and these ‘messages’ inherited in different qualities of sound. Pitch is underlying the sound perception and it is only the structure of the human auditory system (cochlea) that allows us to have a clear view of the frequencies present in a sound. However, this is not the only quality of music that we can perceive. Seven basic elements of music can be defined: pitch, rhythm, tempo, timbre, loudness, tone, and the spatial qualities of sounds. While pitch is obviously defined as the dominant frequency of a tone, another element of music, timbre, is frequency related. The timbre of a sound is obviously dependent on the different harmonics present in the tone. The same occurs when we go at the time domain, where rhythm and tempo have actually close relation. Furthermore, harmonic progression of music, i.e., the sequence of chords (tones heard simultaneously), can give meaning (communicate a message) to the listener. Finally, style, which is defined partly by the instruments (the timbre) participating and partly from the rhythm and patterns followed during the performance of a musical piece, could also affect music perception. Nevertheless, apart from these elements, several combinations of them seem to affect different groups of people or people in different ways, thus leading to the definition of a variety of styles.
3 Material and Methods 3.1 Material 3.1.1 Subjects Seven subjects have participated in the experiments. They were divided into musicians and non- musicians. Table 1 presents the age, sex and music skills of each subject. Subjects that had more than 5 years of musical training are described as intermediates whereas the rest are described as beginners. The last column refers to which experiments the subjects participated in. Table 1. Anthropometric characteristics of the subjects, their musical skills, and the related experiments Subject No. Sex Age Musical Skills Experiment No. 1 Male 24 Musician (Intermediate) 1,2,3 2 Male 23 Musician (Intermediate) 1,2 3 Male 24 Musician (Intermediate) 1,2,3 4 Female 23 Musician (Beginner) 1,2,3 5 Male 24 Musician (Beginner) 2 6 Female 22 Non-Musician 1,2,3 7 Male 23 Non-Musician 1,2,3
3.1.2 EEG Acquisition Device EEG recordings were conducted using the g.MOBIlab portable biosignal acquisition system (4 EEG channels bipolar, Filters: 0.5–30 Hz, Sensitivity: 100 μV, Data acquisition:
Music Perception as Reflected in Bispectral EEG Analysis
141
A/D converter with 16 bits resolution and sampling frequency of 256 Hz, Data transfer: wireless, Bluetooth ‘Class I’ technology, meets IEC 60601-1, for research application, no medical use) (g.tec medical & electrical engineering, Guger Technologies). An example of the acquired EEG signal is shown in Fig. 1. Seven subjects have participated in the experiments. They were divided into musicians and non- musicians. 3.1.3 Interface The experiments were designed and conducted with Max/Msp 4.5 software (Cycling ’74) on a PC (Pentium 3.2 GHz, RAM 1 GB). In order to precisely synchronize the g.MOBIlab device with the Max/MSP software an external object for Max/MSP was created in C++, using the g.MOBIlab API. Thus, we were able to open and close the device, start the acquisition and store the acquired data in text files (.txt) through Max/MSP. No visualization of the acquired signal was available during the experiments. Separate interfaces were designed for each experiment providing the essential auditory and visual stimulation. Figure 2(A) shows the designed interface, whereas Fig. 2(B) depicts the system configuration.
Fig. 2. (A) Experiments interface in Max/Msp. (B) System configuration.
3.2 Material 3.2.1 Experimental Design Three experiments were designed. In particular: Experiment 1 (Intention): This experiment consisted of seven consecutive instances. During the first two instances no tone was heard (relax state). At the beginning of the third, fourth, fifth and seventh instance, a tone (A4–440 Hz) was provided to the subjects; the tone was missing in the sixth instance. The importance of the last tone relies on the fact that the subjects should conceive the true ending of the sequence of tones. The time interval between the instances was set at 2 sec and the tone duration was 400 msec. The subjects had their eyes closed. We expected the subjects to conceive the tempo and imagine the missing tone at the beginning of the sixth instance. Experiment 2 (Intention-Harmonic Analysis): This experiment consisted of four acoustic blocks. Each block analyzed the same 4-voices major chord into its four notes starting
142
P. Doulgeris et al.
with the sound of the chord itself. During the chord analysis, each block-apart from the first-omitted one note, randomly, different for each block, but the same for each trial. The time interval between the instances was set at 1.5 sec and the tone duration was 500 msec. The major chord chosen was G+ starting with G4 (392 Hz) note. The subjects had their eyes closed. The subjects, musicians especially, were expected to conceive the sequence of notes and the underlying intention in order to imagine the omitted ones. Experiment 3 (‘Motion’): This experiment consisted of six consecutive instances. During the first two instances no tone was heard (relax state). At the beginning of the following four instances, a tone of augmenting pitch (‘motion’), each time (E4–329.6 Hz, F4– 349.2 Hz, G4–392 Hz & A4–440 Hz), was heard (auditory stimulus). The subjects could also see the corresponding notes on a music notation scheme on a computer screen (visual stimulus). The time interval between the instances was set at 2 sec and the tone duration was 400 msec. A series of trials was conducted by providing the subjects with auditory and visual stimulus and another series was conducted by providing the subjects with auditory stimulus only. The subjects had their eyes open during all trials. Subjects were expected to conceive the ‘motion’ in both series. 3.2.2 EEG Recordings The EEG recordings were conducted according to the 10/20 international system of electrode placement. The subjects were wearing the EEG cap provided with the g.MOBIlab device. One bipolar channel was used and the electrodes were placed at the C4 position (if the subject was left handed) or at the C3 position (if the subject was right handed) [4], where sensorimotor activity would most likely be presented [14] (see Fig. 1). The subjects sat still during all trials. The acoustic stimulus was provided to the subjects by headphones (Firstline Hi – Fi Stereo Headphones FH91) and the visual stimulation in Experiment 3 could be viewed on the computer screen. A series of trials (see below) per subject and per experiment was conducted (see Fig. 1). 3.2.3 Data Analysis Off–line data analyses were carried out with MatLab 7 (Mathworks). A bandpass filter (Butterworth IIR, 6th order, lower cut–off frequency 8 Hz and upper cut–off frequency 12 Hz) was designed in order to isolate the alpha range and at the same time the Mu-rhythm. The acquired EEG signals per subject and per experiment were synchronized and averaged across all trials in the time domain in order to discard random phenomena, or other artefacts, and produce the underlying evoked potential for each case. The latter, was then analyzed using spectral and high–order spectral analyses, as described below. 3.2.3.1 Bi/Spectral analysis. The filtered signals were scaled and then segmented using a time window in several parts according to the experiment’s design. The process of segmentation of the signals and the discrimination of the different states, for each experiment, is described below. Experiment 1: Ten trials were conducted for each subject. The filtered signal from each trial was segmented using a 2 sec time window in seven parts corresponding to each of the seven instances of the experiment. Three states were distinguished: relax state (first & second instance), the state of auditory stimulation (third, fourth, fifth & seventh instance) and the state during which the subject imagined the missing tone (sixth instance).
Music Perception as Reflected in Bispectral EEG Analysis
143
Experiment 2: Five trials were conducted for each subject. The filtered signal from each trial was segmented using a 1500 msec time window in 20 parts corresponding to each of the 20 instances. Three states were distinguished: the state in which the subjects listened to the chord, the state in which the subjects listened to the notes and the state during which the subject imagined the missing notes. Experiment 3: Five trials with acoustic and visual stimulation and five trials with acoustic and no visual stimulation were conducted for each subject. The filtered signal from each trial was segmented using a 2 sec time window in six parts corresponding to each of the six instances. Two states were distinguished in both cases (acoustic and visual stimulus, acoustic stimulus and no visual stimulation): relax state (first & second instance) and the state of auditory stimulation (third, fourth, fifth & sixth instance). The power spectral density (based on Fast Fourier Transformation) and the power of the signal were estimated for each part. The mean value and the standard deviation of the Mu-rhythm power corresponding to the different states of each experiment for all trials were estimated, for each subject. The averaged and filtered EEG data were segmented using a time window, as described in spectral analysis section. Third-order statistical analysis was conducted and the bispectrum corresponding to each part was estimated. Definitions about third-order statistics and the conventional bispectrum are described in details in [16].
4 Results and Discussion Figure 3 displays in the form of box plots the average spectral power of the Mu-rhythm corresponding to the relax state, the state of auditory stimulation and the state during which the subject imagined the missing tone, for all subjects participating in the Experiment 1. These results indicate an Mu-rhythm suppression both during the second and the third state. Between these two states no significant modulation on Mu-rhythm was observed among all subjects. Figure 4 shows an example from the bispectrum analysis of the data from the subjects participating in the Experiment 1, for instances (steps) 5 and 6. According to Fig. 4, (i) a transition of high bispectral coefficients from frequency pairs located on the two axes towards frequency pairs located on the diagonal during the three heard tones, and (ii) a clear ‘echoing’ effect that appears in the transition from the third tone towards the missing one. As shown in Fig. 4, bispectral coefficients of the missing tone appear at the same frequency pairs of the ones of the last tone. Moreover, experimental results have shown that the average spectral power of the Murhythm corresponding to the chord state, the notes state and the state during which the subject imagined the missing tone, for all subjects participating in the Experiment 2 appears to be stable. Furthermore, analysis of the average spectral power of the Mu-rhythm corresponding to the relax state and the notes state, for all subjects participating in the Experiment 3, has indicated an Mu-rhythm suppression during the second state. It is noteworthy that results between musicians and non–musicians showed no statistical difference whatsoever. In the case of the relax state versus the imagined tone during the Experiment 1, Mu–rhythm suppression has been observed for all musicians. Non musicians’ response was contradicting as subject six showed no suppression, whereas subject seven did. Such suppression can be linked directly to the mirror neuron system activation, proposing an
144
P. Doulgeris et al.
Fig. 3. Experiment 1: Box plots of the estimated average spectral power at each state: Relax state (RS), Auditory stimulation (AS), state in which the subjects imagined the missing tone (Imagined tone)
Fig. 4. Analysis of the EEG signals from Experiment 1; only the primary area of bispectrum is shown: Step 5 (Fifth instance) - Third heard tone; Step 6 (Sixth instance) – Missing tone; Transition of the bispectrum corresponding to the two aforementioned instances and ‘echoing’ effect, for all subjects
underlying procedure of musical intention. As far as bispectral analysis is concerned, thetransition of the high bispectral coefficients from frequency pairs on the two axes towards frequency pairs located on the diagonal during the three heard tones shows a shifting to non-linearity that appears at higher frequencies (i.e., from 10 to 20 Hz). This non-linearity reveals a self-quadratic phase coupling between these harmonic pairs, implying the generation of harmonics at the sum and difference of the frequencies of the pair. However, this pattern of bispectral coefficients was very similar between the third heard tone and the missing one for all subjects, thus providing us with an ‘echoing’ effect during this transition. This suggests a continuous function, i.e., the perception of intention. Data analysis from subjects one, two and four who belong to the musicians’ group from Experiment 2 showed no fluctuation of the Mu–rhythm average spectral power at any
Music Perception as Reflected in Bispectral EEG Analysis
145
state. On the contrary, in the case of subject three and five (also belonging to the same group), the Mu–rhythm average spectral power was higher for the third state (imagined notes). Non musicians also displayed contradicting results, as in the case of subject six (Mu–rhythm power attenuated for the third state), whereas, in the case of subject seven the equivalent spectral power was higher. Consequently, no safe conclusion can be drawn concerning the response of MNs from Experiment 2. This was also evident in the bispectrum domain. Furthermore, during the Experiment 3, musicians displayed Mu–rhythm attenuation in the case of relax state versus auditory and visual stimulation. During the novisual stimulation trials, subjects one and two displayed similar response as to the previous trials, whereas subject four did not. Non-musicians’ response was once again contradicting in both cases. The aforementioned results conclude that the visual stimulus (virtualized by an ascending note on a notation scheme), boosted the perception of motion conveyed by the ascending pitch of the note. In the bispectrum domain, an increase in the bispectrum values is noticed for the case of visual stimulation compared to that of no-visual stimulation. Moreover, no-visual stimulation has resulted in diffused bispectral values and in a gradual attenuation across the six states (steps) of the Experiment 3, implying a gradual transition from the non-Gaussianity and non-linearity towards the Gaussian and linear assumption. On the contrary, when the visual stimulation was employed, less diffused bispectral values were noticed and the attenuation was limited between the second and fourth state (step), exhibiting a gradual increase in the bispectral values when moving from state (step) five to six; hence, implying a shifting towards a non-linear and non-Gaussian behaviour. These results, also observed in the cases of all musicians, justify the Murhythm modulation provoked by the MNs. Nevertheless, for the cases of non-musicians the degree of fluctuation in the intensity of the bispectrum was smaller, both without and with visual stimulation, indicating a smaller sensitivity in the Mu-rhythm modulation provoked by the MNs. This might be explained by the lack of musical experience in distinguishing pitch differences and, hence, in perceiving an ascending motion across the six states of the Experiment 3. From the above experimental analysis it is clear that music perception can be viewed as a pattern recognition process by the brain, in analogy to the more familiar pattern recognition processes in the visual system. Our brain carries (or builds through experience) ‘templates’ against which the signals of incoming sounds are compared-if there is a match with the template that corresponds to a harmonic tone, a musical tone sensation with definite pitch is evoked. In addition, ‘motion’ and intension in music facilitates this pattern recognition, and in an analogy with the optical system, if part of the acoustical stimulus is missing, or if the stimulus is somewhat distorted, the brain is still capable of providing the correct sensation. MNs as seen through the Mu-rhythm modulation seems to support such pitch prediction (correction) and probably could even foster a kind of ‘acoustical illusions’. The role of MNs towards this direction seems to be central and understanding of their relation with the music qualities could really expand the way we see knowledge scaffolding in music perception.
5 Conclusions The response of MN cells to intention and ‘motion’ involved in musical structures was studied in this paper. EEG recordings from three experiments were conducted on seven subjects, musicians and non–musicians, and spectral and bispectral analyses were
146
P. Doulgeris et al.
implemented on the data acquired. Experimental results showed that Mu–rhythm suppression displays an intention–related activity shown in musicians, whereas the bispectral ‘echoing’ effect supports the idea of an underlying ongoing function, appearing in the transition from heard sound to imagined sound. Moreover, a Mu-rhythm modulation provoked by the MNs was linked to bispectral fluctuations, especially when visual stimulation was combined with an auditory one. Further experiments towards the exploration of the role of MNs in music perception for other sonic qualities, such as timbre, spatial motion, spectromorphology, harmonic style, are already on the way.
References 1. Molnar–Szakacs, I., Overy, K.: Music and mirror neurons: from motion to ‘e’motion. SCAN I, 234–241 (2006) 2. Maess, B., Koelsch, S., Gunter, T., Fiederici, A.: Musical syntax is processed in Broca’s area: a MEG study. Nature Neuroscience 4, 540–545 (2001) 3. Brown, S., Martinez, M., Parsons, L.: Music and language side by side in the brain: a PET study of the generation of melodies and sentences. European journal of neuroscience 23(10), 2791–2803 (2006) 4. Lahav, A., Saltzman, E., Schlaug, G.: Action representation of sound: audiomotor recognition network while listening to newly acquired sounds. The journal of neuroscience 27(2), 308–314 (2007) 5. Bangert, M., Peschel, T., Schlaug, G., Rotte, M., Drescher, D., Hinrichs, H., Heinze, H.J., Altenmuller, E.: Shared networks for auditory and motor processing in professional pianists: Evidence from fMRI conjunction. Neuroimage 30, 917–926 (2006) 6. Westerman, G., Miranda, E.R.: Modeling the development of mirror neurons for auditory motor integration. The journal of new music research 31(4), 367–375 (2002) 7. Rizzolatti, G., Graighero, L.: The mirror neuron system. An. Rev. of Neurosc. 27, 169–192 (2004) 8. Logothetis, I., Milonas, I.: Logotheti Neurology. University Studio Press, Thessaloniki (2004) 9. Calvo–Merino, B., Grezes, J., Glaser, D., Passingham, R., Hanggard, P.: Seeing or doing? Influence of visual and motor familiarity in action observation. Current Biology 16, 1–6 (2006) 10. Rizzolatti, G., Fogassi, L., Gallese, V.: Mirrors in the mind, pp. 30–37. Sci. American (2006) 11. Grezes, J., Costes, N., Decety, J.: The effects of learning and intention on the neural network involved in the perception of meaningless actions. Brain 122, 1875–1887 (1999) 12. Grossman, M.: A central processor for hierarchically–structured material: evidence from Broca’s aphasia. Neuropsychologia 18, 299–308 (1980) 13. Pineda, J., Oberman, L., Hubbard, E., McCleery, J., Altschuler, E., Ramachandran, V.: EEG evidence for mirror neuron dysfunction in autism spectrum disorders. Cognitive Brain Res. 24(2), 109–198 (2005) 14. Hadjileontiadis, J., Panas, S.: Higher–order statistics: a robust vehicle for diagnostic assessment and characterization of lung sounds. Technology and Healthcare 5, 359–374 (1997)
Automatic Recognition of Urban Soundscenes Stavros Ntalampiras1, Ilyas Potamitis2, and Nikos Fakotakis1 1
Wire Communications Laboratory, University of Patras {dallas,fakotaki}@wcl.ee.upatras.gr 2 Department of Music Technology and Acoustics, Technological Educational Institute of Crete [email protected]
Abstract. In this paper we propose a novel architecture for environmental sound classification. In the first section we introduce the reader to the current work in this research field. Subsequently, we explore the usage of Mel frequency cepstral coefficients (MFCCs) and MPEG7 audio features in combination with a classification method based on Gaussian mixture models (GMMs). We provide details concerning the feature extraction process as well as the recognition stage of the proposed methodology. The performance of this implementation is evaluated by setting up experimental tests in six different categories of environmental sounds (aircraft, motorcycle, car, crowd, thunder, train). The proposed method is fast because it does not require high computational resources covering therefore the needs of a real time application. Keywords: Computer Audition, Automatic audio recognition, MPEG-7 audio, MFCC, Gaussian mixture model (GMM).
1 Introduction Due to the exponential increase of the amount of data being used in computer science in the last decades the need of a robust and user friendly way to access data came up. Nowadays a great amount of information is produced and, therefore, searching algorithms of large metadata databases becomes a necessity. Another important fact that we have to take under consideration is the spread of the internet. The global network is becoming faster and larger, vanishing the limitations that existed some years ago concerning the data transfers. The result of this situation is the increased need to search and retrieve desired data as fast as possible. Our best allies in these kinds of situations are classification and similarity. Machine learning techniques are widely used in these types of problems making life easier by proposing automatic methods for annotations of data collections, which is a time consuming process. In this way huge collections of databases are becoming searchable without human interference. This work is considered to be a step forward in this direction and improves the automatic recognition of environmental sounds. The scope of our system is to understand the surrounding environment exploiting only the acoustic information just like humans do unconsciously and constantly. Think as a paradigm the situation where one is sitting on a bench near a harbour. Using only the perceived acoustic information one is able to understand that a boat is leaving, a car is passing by and a dog is barking. This is the exact human property we are trying to capture. A system possessing this human property can be of great G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 147–153, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
148
S. Ntalampiras, I. Potamitis, and N. Fakotakis
importance for monitoring and understanding the environment or even to help humans making decisions whether or not they should perform an action in the specific environment. The main difficulty is that it is not possible to create a perfect database of environmental sounds, unlike speech processing. Nowadays a great deal of research has been conducting in the researching field of environmental sound classification resulting in many different implementations. A method based on three MPEG-7 audio low-level descriptors (spectrum centroid, spectrum spread and spectrum flatness) is presented in [1]. As for the classification scheme a fusion support vector machines and k nearest neighbour rule is adopted in order to define the class of a sound into predefined classes of common kinds of home environmental sounds. Wold et al. [2] build a system for sound retrieval based on the statistical measures of pitch, harmonicity, loudness, brightness and bandwidth. These features are input to a k nearest neighbor classifier, which makes the decision of the sound class. A different approach is adopted in [3] taking advantage of a multilayered perceptron neural network. The feature taken into account is the one-dimensional combination of the instantaneous spectrum at the power peak and the power pattern in time domain. Wang et al. [4] show the usage of an MPEG-7 based feature set which is processed by an HMM scheme in order to assign the sound to a defined class. In this work we are trying to find out the feature set which includes information that can effectively distinct a large variety of different environmental sound categories. As a first step a comparison between the MPEG-7 feature set and the MFCCs, which are typically used in speech recognition is made. The organization of the paper is as follows. In section 2 we describe the overall architecture of our methodology and the feature extraction processes. Section 3 contains the details concerning the recognition engine and the last section presents recognition results as well as possible extensions of this work.
2 System Architecture The overall architecture of the system is rather simple and is described in Fig. 1. First the sound recording passes through a preprocessing step to prepare it for the feature extraction. After that Gaussian mixture models are used to compute the probability for each sound class. In the operational phase, the class with the higher probability is assigned to the unknown sound to be recognized.
Fig. 1. System Architecture
Automatic Recognition of Urban Soundscenes
149
2.1 The Feature Extraction Process As mentioned before we are going to use two different feature sets in order to evaluate their performance in the task of environmental sound classification. The same parameters are used for both of the feature sets. The signal is cut into frames of 30ms with 10ms of time shifts after the MPEG7 standard recommendations. It should be noted that the hamming window is used. MFCCs The first feature set is consisted of the total energy of the frame and first 12 Mel frequency cepstral coefficients. The MFCC feature extraction is composed of several steps. First the time domain signal is segmented in overlapping frames. We derive the power of the STFT of these frames and then passed through a triangular Mel scale filterbank which emphasizes the portions of the signal that are proven to play an important role to human perception. Subsequently the log operator is applied and for the decorrelation of the features the discrete cosine transform (DCT) is used. The result of the MFCC feature extraction process is a feature vector consisted of 13 features in total. In Fig. 2 the whole process of the MFCC extraction can be seen.
Fig. 2. MFCC feature extraction process
MPEG-7 feature set The second feature set includes descriptors that are part of the MPEG-7 standard. In order be fair with the comparison and to keep the balance between the amounts of data contained in the feature sets we use the following MPEG-7 descriptors: Audio Spectrum Centroid Audio Spectrum Spread Audio Spectrum Flatness which have been demonstrated to produce good results [1, 4]. Feature extraction methodology All of these descriptors are part of the basic spectral descriptors that are provided by the MPEG7 standard and the method for their computation is the described in the following session. I. Audio Spectrum Centroid (ASC) For the calculation of this feature the log-frequency spectrum is computed first. This descriptor gives the centre of its gravity. To avoid the effect of a non-zero DC-component and/or very low frequency components, every one of the power
150
S. Ntalampiras, I. Potamitis, and N. Fakotakis
coefficients bellow 62.5Hz are summed and represented by a single coefficient. This process gives as an output a different power spectrum pi as well as a new representation of the corresponding frequencies fi. For a given frame the ASC is defined from the modified power coefficients and their frequencies as:
ASC= ∑ log 2 (f i /1000)p i ) / ∑ p i i
(1)
i
The derivation of this descriptor provides information about the dominant frequencies (high or low) as well as perceptual information of timbre (i.e. sharpness). II. Audio Spectrum Spread (ASS) This feature corresponds to another simple measure of the signal’s spectral shape. ASS (also called instantaneous bandwidth) is defined as the second central moment of the log-frequency spectrum. For a given frame it can be extracted by taking the rootmean-square (RMS) deviation of the spectrum from its Centroid ASC:
ASS=
∑ (log (f /1000)-ASC) p / ∑ p 2
2
i
i
i
i
(2)
i
where pi and fi represent the same quantities as before. This descriptor indicates the way the spectrum of the signal is distributed around its centroid frequency. If its value is low then the spectrum may be concatenated around the centroid, while a high value shows the opposite. Its purpose is to differentiate tone-like and noise-like sounds. III. Audio Spectrum Flatness (ASF) The ASF descriptor is designed to expose how flat a particular portion of the signal is. For a given frame it consists of a series of values, each one expressing the deviation of the signal’s power spectrum from a flat shape. In order to obtain the values of the descriptor the power coefficients are computed from non-overlapping frames (window length=time shift). Subsequently the spectrum is divided into 1/4-octave resolution, overlapping frequency bands which are logarithmically spaced. The ASF of a band is calculated as the ratio of the geometric mean and the arithmetic mean of the spectral power coefficients within the band. N
ASF= N ∏ Cn n=1
N
/ 1 ∑C N
n
(3)
n =1
where N is the number of coefficients within a subband and cn is the n-th spectral power coefficient of the subband. The usage of this feature achieves to classify effectively the sounds which correspond to noise (or impulse) and the harmonic sounds. Psychoacoustics tells us that a large deviation from a flat shape generally depicts the tonal sounds. The computation of the MPEG-7 audio features (alternatively called Low Level Descriptors) results in the formulation of a 21 feature vector since the ASF descriptor has 19 coefficients. In figure 3 we depict the Mel log filterbank and the MPEG-7 descriptors for a part of the same file belonging to the category Motorcycle so to visualize the differences between the two feature sets.
151
Signal 0.04 0 -0.04 1
MPEG-7 features log Mel Filterbank
Frequency (kHZ)
Sampled data
Automatic Recognition of Urban Soundscenes
2
3
4 Time
5
6
7
8 4
x 10
Spectrogram 8 6 4 2 1
2
3
4
300
400
500
300
400
500
Time
5 10 15 20 100
200 Frames
5 10 15 20 100
200 Frames
Fig. 3. Mel Filterbank and feature values against the sound’s frames
3 Recognition Engine The next figure describes the recognition process which is consisted of Gaussian mixture models (Fig. 4). A linear combination of Gaussians represents the probability density function of a mixture model. Concerning the experiments a standard version GMM for sound class 1
Probability Computation
P(s|M1)
GMM for sound class 2
signal
Probability Computation
P(s|M2) Maximum Probability
GMM for sound class n
Class=maxarg[P(s|Mi)] 1≤i≤n
P(s|Mn) Probability Computation
Fig. 4. GMM classification process
152
S. Ntalampiras, I. Potamitis, and N. Fakotakis
of Expectation-Maximization algorithm was used, with k-means initialization for the training of the models. The number of Gaussian mixtures for all the experiments was 4 while each density is described by a covariance matrix of the diagonal form. At the stage of the experimental set-up, the feature streams of each sound are passed to the trained Gaussian mixture models. The probabilities produced by all models are compared and the class with the highest probability represents the system decision. In the test phase we applied the 10-fold cross validation method and the results were averaged for each category. The procedure that was followed during the whole process was identical for both feature sets (MFCC and MPEG-7 features) so that the comparison of their performance provides reliability.
4 Results The data were obtained from recordings found on the internet due to the unavailability of such a sound corpus. There are six classes of sounds consisted of aircraft (110), motorcycle (46), car (81), crowd (60), thunder (60) and train (82) and there is a lot of variability in each of them. All the sounds were downsampled to 16 KHz, 16 bit while their average length is 25.2 seconds. In Tables 1 and 2 the confusion matrices of the proposed methodology are provided. At this point we must stress the importance of the fact that the evaluation was made with a frame-based decision of the class in order to obtain reliable results. The MFCC feature set achieves better recognition rates as it is shown in Table 1 showing that it is able to capture more discriminative information than the MPEG-7 feature set. Furthermore it can be observed that both feature sets tend to confuse the same classes. We conclude that the MPEG-7 descriptors achieve overall accuracy of 65% having the best performance in thunder category while the MFCCs results in 78.3% overall accuracy and their best performance is in the train category.
Car
Crowd
Thunder
Train
64.1
2.2
11.2
6.4
11
5.1
2.7
81.6
9.6
4.6
1.4
0.1
Aircraft
Aircraft Motorcycle
Responded
Motorcycle
Table 1. Confusion Matrix (MFCC)
Car
11.6
5.2
62
9.6
8.5
3.1
Crowd
3.2
1.7
11
82.4
0.8
0.9
Thunder
6.5
0.6
2.1
0.3
88.4
2.1
Train
4.3
1.6
1.2
1.4
0.1
91.4
Automatic Recognition of Urban Soundscenes
153
4.2
11.3
10.1
13.2
10
10.2
7.9
3.6
5.1
14
14.1
10.9
0.9
3.6
77.2 5.6
68.6
Car
8.3
67 2.3
Crowd
7.2
1.2
50.4 11.2
Thunder
10.4
3.3
5.1
75.9 0.5
Train
7
3.4
8.9
6.5
Train
Thunder
Motorcycle
Crowd
51.2 6.2
Car
Aircraft Motorcycle
Aircraft
Responded
Table 2. Confusion Matrix (MPEG-7)
3.5
5 Conclusions and Future Work In this work we evaluated the performance of two well known feature sets on the task of environmental sound classification and it has been proved that the MFCCs outperform the MPEG-7 descriptors. Our future work includes the incorporation of more sound classes, signal separation as well as the usage of several techniques like silence detection that may improve the recognition performance.
References 1. Wang, J.-C., Wang, J.-F., Kuok, W.-H., Hsu, C.-S.: Environmental Sound Classification Using Hybrid SVM/KNN Classifier and MPEG-7 Audio Low-Level Descriptor. In: International Joint Conference on Neural Networks (2006) 2. Wold, E., Blum, T., Keislar, D., Wheaton, J.: Content based classification search and retrieval of audio. IEEE Multimedia Magazine 3, 27–36 (1996) 3. Toyoda, Y., Huang, J., Ding, S., Liu, Y.: Environmental sound recognition by multilayered neural networks. In: Proceedings of the Fourth International Conference on Computer and Information Technology, pp. 123–127 (2004) 4. Wang, J.-F., Wang, J.-C., Huang, T.-H., Hsu, C.-S.: Home environmental sound recognition based on MPEG-7 features. In: Circuits and Systems, MWSCAS 2003, vol. 2, pp. 682–685 (2003) 5. Casey, M.A.: MPEG-7 sound recognition tools. IEEE Transactions on Circuits and Systems for Video Technology 11(6), 737–747 (2001) 6. Kim, H.-G., Moreau, N., Sikora, T.: MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval. Wiley, Chichester (2005) 7. Nabney, I.: Netlab: Algorithms for Pattern Recognition. Springer, London (2002)
Low Bitrate Coding of Spot Audio Signals for Interactive and Immersive Audio Applications Athanasios Mouchtaris, Christos Tzagkarakis, and Panagiotis Tsakalides Department of Computer Science, University of Crete and Institute of Computer Science (FORTH-ICS) Foundation for Research and Technology - Hellas Heraklion, Crete, Greece {mouchtar,tzagarak,tsakalid}@ics.forth.gr Abstract. In the last few years, a revolution has occurred in the area of consumer audio. Similarly to the transition from analog to digital sound that took place during the 80s, we have been experiencing the transition from 2-channel stereophonic sound to multichannel sound (e.g., 5.1 systems). Future audiovisual systems will not make distinctions regarding whether the user will be watching a movie or listening to a music recording; they are envisioned to offer a realistic experience to the user who will be immersed into the content, implying that the user will be able to interact with the content according to his will. In this paper, an encoding procedure is proposed, focusing on spot microphone signals, which is necessary for providing interactivity between the user and the environment. A model is proposed which achieves high-quality audio reproduction with side information for each spot microphone signal in the order of 19 kbps.
1 Introduction Similarly to the transition from analog to digital sound that took place during the 80s, these last years we have been experiencing the transition from 2-channel stereophonic sound to multichannel sound. This transition has shown the potential of multichannel audio to surround the listener with sound and offer a more realistic acoustic scene compared to 2-channel stereo. Current multichannel audio systems place 5 or 7 loudspeakers around the listener in pre-defined positions, and a loudspeaker for low-frequency sounds (5.1 [1] and 7.1 multichannel systems), and are utilised not only for film but also for audio-only content. Multichannel audio offers the advantage of improved realism compared to 2-channel stereo sound at the expense of increased information concerning the storage and transmission of this medium. This is important in many networkbased applications, such as Digital Radio and Internet audio. At a point where MPEG Surround (explained in the following paragraph) achieves coding rates for 5.1 multichannel audio that are similar to MP3 coding rates for 2-channel stereo, it seems that the research in audio coding might have no future. However,
This work has been funded by the Marie Curie TOK “ASPIRE” grant within the 6th European Community Framework Program.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 155–164, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
156
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
this is far from the truth. Current multichannel audio formats will eventually be substituted by more advanced formats. Future audiovisual systems will not distinguish between whether the user will be watching a movie or listening to a music recording; audiovisual systems of the future are envisioned to offer a realistic experience to the user who will be immersed into the content. As opposed to listening and watching, the passive voice of immersed implies that the user’s environment will be seamlessly transformed into the environment of his/her desire, the user being able to interact with the content according to his/her will. Using a large number of loudspeakers is useless if there is no increase in the content information. Immersive audio is largely based on enhanced audio content, which translates into using a large number of microphones (known as spot recordings) for obtaining a recording, containing as many sound sources as possible. These sources offer increased sound directions around the listener, but are also useful for providing interactivity between the user and the audio environment. The increase in audio content, combined with the strict requirements regarding the processing, network delays, and losses in the coding and transmission of immersive audio content, are issues that can be addressed based on the proposed methodology. The proposed approach in this paper is an extension of multichannel audio coding. For 2-channel stereo sound, the importance of decreasing the bitrate in a music recording has been made apparent within the Internet audio domain with the proliferation of MP3 audio coding (MPEG-1 Layer III [2, 3]). MP3 audio coding allows for coding of stereo audio with rates as low as 128 Kbit/sec for high-quality audio (CD-like or transparent quality). Multichannel sound, as the successor of 2-channel stereo, has been in focus of all audio coding methods since the early 1990s. MPEG-2 Advanced Audio Coding (AAC) [4, 5] and Dolby AC-3 [6] were proposed among others and truly revolutionised the delivery of multichannel sound, allowing for bitrates as low as 320 Kbit/sec for 5.1 audio (transparent quality). These methods were soon adopted by all audio-related applications, such as newer versions of Internet music files (Apple’s iTunes) and Digital Television (DTV). In the audio coding methods mentioned in the previous paragraph, the concept of perceptual audio coding has been of central importance. Perceptual audio coding refers to colouring the coding noise in the frequency domain, so that it will be inaudible by the human auditory system. However, early on it was apparent that coding methods that exploit interchannel (for 2-channel or multichannel audio) were necessary for achieving best coding results. In MPEG-1 and MPEG-2 audio coding, Mid/Side [7] and Intensity Stereo Coding [8] were employed. The former operated on the audio channels in an approximate Karhunen-Loeve-type approach for decorrelation of the channel samples, while the latter was applied to higher frequency bands by exploiting the fact that the auditory image in these bands can be retained by only using the energy envelope of each channel at each short-time audio segment. In early 2007, a new standard for very low bitrate coding of multichannel audio became an International Standard under the name MPEG Surround [9]. MPEG Surround allows for coding of multichannel audio content with rates as low
Low Bitrate Coding of Spot Audio Signals
157
as 64 Kbit/sec for transparent quality. It is based on Binaural Cue Coding (BCC) [10] and Parametric Stereo (PS) [11]. Both methods operate on the same philosophy, which is to capture (at the encoder) and re-synthesise (at the decoder) the cues needed for sound localisation by the human auditory system. In this manner, it is possible to recreate the original spatial image of the multichannel recording by encoding only one monophonic audio downmix signal (the sum of the various audio channels of a particular recording), as well as the binaural cues which constitute only a small amount of additional (side) information. MPEG Surround and (related) AAC+ are expected to replace the current MP3 and AAC formats for Internet audio, and to dominate in broadcasting applications. Immersive audio, as opposed to multichannel audio, is based on providing the listener the option to interact with the sound environment. This translates, as explained later in this paper, into different objectives in the content to be encoded and transmitted, which cannot be fulfilled by current multichannel audio coding approaches. Our goal is to introduce mathematical models specifically directed towards immersive audio, for compressing the content and allowing model-based reconstruction of lost or delayed information. Our aspirations are towards finally implementing long-proposed ideas in the audio community, such as (network-based) telepresence of a user in a concert hall performance in realtime, implying interaction with the environment, e.g., being able to move around in the hall and appreciate the hall acoustics; virtual music performances, where the musicians are located all around the world; collaborative environments for the production of music; and so forth. In this paper, the sinusoids plus noise model (henceforth denoted as SNM for brevity), which has been used extensively for monophonic audio signals, is introduced in the context of low-bitrate coding for Immersive audio. As in the SAC method for low bitrate multichannel audio coding, our approach is to encode one audio channel only (which can be one of the spot signals or a downmix), while for the remaining spot signals we retain only the parameters that allow for resynthesis of the content at the decoder. These parameters are the sinusoidal parameters (harmonic part) of each spot signal, as well as the short-time spectral envelope (estimated using Linear Predictive – LP – analysis) of the sinusoidal noise component of each spot signal. These parameters are not as demanding in coding rates, as the true noise part of the SNM model. For this reason, the noise part of only the reference signal is retained; during the resynthesis of each spot signal, its harmonic part is added to the noise part, which is recreated by using the corresponding noise envelope with the noise residual obtained from the reference channel. This procedure, has been described in our recent work as noise transplantation [12], and is based on the observation that the noise component of the spot signals of the same multichannel recording are very similar when the harmonic part has been captured with an appropriate number of sinusoids. In this paper, we focus on describing the coding stage of the model parameters, and defining the lower limits in terms of bitrate that our proposed system can achieve. The coding of the sinusoidal parameters is based
158
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
on the high-rate quantization scheme of [13], while the encoding process of the noise envelope is based on the vector quantization method described in [14].
2 Modeling Methodology Initially, we briefly explain how interactivity can be achieved using the multiple microphone recordings (spot microphone signals) of a particular multichannel recording. The number of these multiple microphone signals is usually higher than the available loudspeakers, thus a mixing process is needed when producing a multichannel audio recording. We place emphasis on the mixing of the multimicrophone audio recordings on the decoder side. Remote mixing is imperative for immersive audio applications, since it offers the amount of freedom for the creation of the content that is needed for interactivity. Thus, in immersive audio applications, current multichannel audio coding methods. This is due to the fact that, for audio mixing (remote or not), not only the spatial image but the content of each microphone recording must be encoded, so that the audio engineer will have full control of the content. We note that remote mixing, when the user is not an experienced audio engineer, can be accomplished in practice by storing at the decoder a number of predefined mixing “files” that have been created by experts for each specific recording. The limitations of transmitting the microphone recordings through a low-bandwidth medium (e.g., the Internet or wireless channels) are due to: (i) the increase in the audio channels, which translates into the need of high transmission rates which are not available, and (ii) network delays and losses which are unacceptable in high-quality real-time audio applications. In order to address these problems, we propose using the source/filter and sinusoidal models. The source/filter model [15] segments the signal in short (around 30 ms) segments, and the spectral envelope of each segment is modelled (e.g., by linear prediction) using a small number of coefficients (filter part). The remaining modelling error has the same number of samples as the initial segment (source part), and contains important spectral information. For speech signals, the source part theoretically contains the integer multiples of the pitch, so it can be modelled using a small number of coefficients. Many speech compression methods are based on this concept. However, for audio signals, methods for reducing the dimensionality of the source signal and retaining high quality have not yet been derived. We have recently found that multiresolution estimation of the filter parameters can greatly improve the modelling performance of the filter model. We, then, were able to show that the source/filter model can separate the spot microphone signals of multimicrophone recording into a part that is specific to each microphone (filter) and a part which can be considered common to all signals (source) [16]. Thus, for each spot recording we can only encode its filter part (using around 10 Kbit/sec), while one reference audio signal (can be a dowmix) must be encoded, e.g., using MP3. Our aforementioned method introduces an amount of correlation between the recordings (crosstalk), and is not suitable for some audio signals (e.g., transients).
Low Bitrate Coding of Spot Audio Signals
159
These problems can be overcome by additional use of the sinusoidal model [17, 18, 19]. It has been applied to speech and audio signals and is based on retaining (for each segment) only the prominent spectral peaks. The sinusoidal parameters alone cannot model audio signals with enough accuracy. Representing the modelling error is an important problem for enhancing the low audio quality of the sinusoids-only model. It has been proposed that the error signal can be modelled by only retaining its spectral envelope (e.g., [19, 20, 21]). The sinusoids plus noise model (SNM) represents a signal s(n), with harmonic nature, as the sum of a predefined number of sinusoids (harmonic part) and a noise term (stochastic part) e(n) (for each short-time analysis frame) s(n) =
L
αl cos(ωl n + φl ) + e(n) , n = 0, . . . , N − 1,
(1)
l=1
where L denotes the number of sinusoids, {αl , ωl , φl }L l=1 are the constant amplitudes, frequencies and phases respectively and N is the length (in samples) of the analysis short-time frame of the signal. The noise component is also needed for representing the noise-like part of audio signals which is audible and is necessary for high-quality resynthesis. The noise component can be computed by subtracting the harmonic component from the original signal. Modeling the noise component is a challenging task. We follow the popular approach of modeling e(n) as the result of filtering a residual noise component with an autoregressive (AR) filter that models the noise spectral envelope, i.e., e(n) =
p
b(i) e(n − i) + re (n),
(2)
i=1
where re (n) is the residual of the noise, and p is the AR filter order, while vector b = (1, −b(1), −b(2), ..., −b(p))T represents the spectral envelope of the noise component e(n) and can be obtained by LP analysis. In the remainder of the paper, we refer to e(n) as the (sinusoidal) noise signal, and to re (n) as the residual (noise) of e(n). Fully parametric models under the SNM degrade audio quality, since the residual of the original audio signals is discarded and replaced by (filtered) white noise or parametrically generated. Thus, so far the sinusoidal (as the source/filter) model is considered useful (for audio) only in low-bitrate low-quality applications (e.g., scalable audio coding in MPEG-4). The idea in our research is to apply our findings of the source/filter model not to the actual audio signal but to the sinusoidal error signal. Our preliminary efforts have shown that this “noise transplantation” procedure is indeed valid and can overcome the problems of crosstalk and transient sounds, since even only a few sinusoidal coefficients can capture the significant components of an audio signal. In fact, by using our approach, the number of sinusoidal coefficients can be greatly decreased compared to current sinusoidal models, due to the improved accuracy in the noise modelling of the proposed multiresolution source/filter model.
160
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
In more detail, consider a collection of M microphone signals that correspond to the same multichannel recording and thus have similar acoustical content. We model and encode as a full audio channel only one of the signals (alternatively it can be a downmix, e.g. a sum signal), which is the reference signal. The remaining (side) signals are modeled by the SNM, retaining their sinusoidal components and the noise spectral envelope (filter b in (2)). In order to reconstruct the side signals, we obtain the LP residual of the reference channel’s noise signal. Each side microphone signal is reconstructed using its sinusoidal (harmonic) component and its noise LP filter. In specific, its harmonic component is added to the noise component that it is obtained by filtering, with the signal’s LP noise shaping filter, the LP residual of the sinusoidal noise from the reference signal. In this manner, we avoid encoding the residual of each of the side signals. This is important, since this signal is of highly stochastic nature, and cannot be adequately represented using a small number of parameters (thus, it is highly demanding in bitrates for accurate encoding). We note that modeling this signal with parametric models results in low-quality audio resynthesis; in our previous work [12] we have shown that our noise transplantation method can result in significantly better quality audio modeling compared to parametric models for the residual signal. We obtained subjective scores around 4.0 using as low as 10 sinusoids, which is very important for low bitrate coding. For decoding, the proposed model operates as follows. The reference signal (Signal 1) is fully encoded (e.g. using an MP3 encoder at 64 kbps), while the remaining microphone signals are reconstructed using the quantized sinusoidal and LP parameters, using the LP residual obtained from the reference channel.
3 Coding Methodology The second part of our method is the coding procedure. It can be divided into two tasks; the quantization of the sinusoidal parameters and the quantization of the noise spectral envelopes for each side signal (for each short-time frame). 3.1
Coding of the Sinusoidal Parameters
We adopt the coding scheme of [13], developed for jointly optimal quantization of sinusoidal frequencies, amplitudes and phases. Due to space limitations, and since the details of the coding can be found in [13], we only provide here the final equations for the coding. More specifically, the quantizations point densities gA (α), gΩ (ω) and gΦ (φ) (corresponding to amplitude, frequency, and phase, respectively) are given by the following equations: 1
gA (α) = gA =
˜
2
1
1
wg6 ( N12 ) 6 1
gΩ (ω, α) = gΩ (α) =
1
wα6 2 3 H− 3 b(A)
αwα6
2
N2 12
13
1
,
˜
(3)
2
2 3 H− 3 b(A) 1
wg6
,
(4)
Low Bitrate Coding of Spot Audio Signals
1
gΦ (φ, α, wl ) = gΦ (α, wl ) =
1
˜
2
αwl2 2 3 H− 3 b(A) 1 3
1 6
161
1
wα wg ( N12 ) 6 2
,
(5)
where wα and wg are the arithmetic and geometric mean of the perceptual ˜ = H − h(A) − h(Ω) − h(Φ) and weights of the L sinusoids, respectively, H b(A) = fA (α) log2 (α) dα. The quantities h(A), h(Ω) and h(Φ) are the differential entropies of the amplitude, frequency and phase variables, respectively, while fA (α) denotes the marginal pdf of the amplitude variable. 3.2
Coding of the Spectral Envelopes
The second group of parameters for each spot signal that need to be encoded are the spectral envelopes of the sinusoidal noise. We follow the quantization scheme of [14]. The LP coefficients of each spot signal that model the noise spectral envelope are transformed to LSF’s (Line Spectral Frequencies) which are modeled by means of a Gaussian Mixture Model (GMM). Then, the Karhunen Lo`eve Transform (KLT) decorrelates each LSF vector for each time segment. The decorrelated components can be independently quantized by a non-uniform quantizer (compressor, uniform quantizer and expander). Each LSF vector is classified to only one of the GMM clusters. This classification is performed in an analysis-by-synthesis manner. For each LSF vector, the Log Spectral Distortion (LSD) is computed for each GMM class (the distortion among the spectral envelopes obtained by the original and the quantized LSF vectors), and the vector is classified to the cluster associated with the minimal LSD.
4 Results In this section, we are interested to examine the coding performance of our proposed system, with respect to the resulting audio quality. For this purpose we performed subjective (listening) tests. We employed the Degradation Category Rating (DCR) test, in which listeners grade the coded vs the original waveform using a 5-scale grading system (from 1-“very annoying” audio quality compared to the original, to 5-“not perceived” difference in quality). For our listening tests, we used three signals, referred to as Signals 1-3. These signals are parts of a multichannel recording of a concert hall performance. We used the recordings from two different microphones, one of which captured mainly the female voices of the orchestra chorus, while the second one captured mainly the male voices. The former was used in our experiments as the side channel, and the latter as the reference signal. Thus, the objective is to test whether the side signal can be accurately reproduced when using the residual from the reference signal. We note that in our previous work [12], we showed that the proposed noise transplantation approach results in very good quality (around 4.0 grade in DCR tests in most cases) for various music signals, with the number of sinusoids per frame as low as 10. Thus, in this section our objective is to examine the lower
162
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
Not perceived
Perceived, not annoying
Slightly annoying
19 kbps 24.4kbps
Annoying
Very annoying
SIgnal 1
Signal 2
Signal 3
Fig. 1. Results from the quality rating DCR listening tests, corresponding to coding with (a) 24.4 kbps (dotted), (b) 19 kbps (solid). Each frame is modeled with 10 sinusoids and 10 LP parameters.
limit in bitrates which can be achieved by our system without loss of audio quality below the grade achieved by modeling alone (i.e. 4.0 grade for the three signals tested here). Regarding the parameters used for deriving the waveforms used in the tests, the sampling rate for the audio data was 44.1 kHz and the LP order for the AR noise shaping filters was 10. The analysis/synthesis frame for the implementation of the sinusoidal model is 30 msec with 50% overlapping between successive frames. The coding efficiency for the sinusoidal parameters was tested for a given (target) entropy of 28 and 20 bits per sinusoid (amplitudes, frequencies and phases in total), which gives a bitrate of 19.6 kbps and 14.2 kbps respectively. Regarding the coding of the LP parameters (noise spectral envelope), 28 bits were used per LSF vector. With 23 msec frame and 75 % overlapping, this corresponds to 4.8 kbps for the noise envelopes. Thus, the resulting bitrates that were tested are 24.4 kbps and 19 kbps (adding the bitrate of the sinusoidal parameters and the noise envelopes). A training audio dataset of about 100,000 LSF vectors (approximately 9.5 min of audio) was used to estimate the parameters of a 16class GMM. The training database consisted of recordings of the classical music performance (corresponding to the recording from which Signals 1-3 originated, but a different part of the recording than the one used for testing). Details about the implementation of the coding procedure for the LP parameters can be found in our earlier work [16]. Eleven volunteers participated in the DCR tests, using high-quality headphones. The results of the DCR tests are depicted in Fig. 1, where the 95% confidence interval are shown (the vertical lines indicate the confidence limits). The solid line shows the results for the case of coding with a bitrate of 19 kbps, while the dotted line shows the results for the 24.4 kbps case. The results of the
Low Bitrate Coding of Spot Audio Signals
163
figure verify that the quality of the coded audio signals is good and the proposed algorithm offers an encouraging performance, and that this quality can be maintained at as low as 19 kbps per side signal. We note that the reference signal was PCM coded with 16 bits per sample, however similar results were obtained for the side signals when the reference signal was MP3 coded at 64 kbps (monophonic case).
5 Conclusions In this paper a novel modeling approach, namely noise transplantation, was proposed for achieving interactive and immersive audio applications of high-quality audio at low bitrates. The approach was based on applying the sinusoidal model at spot microphone signals, i.e. the multiple audio recordings before performing the mixing process which produces the final multichannel mix. It was shown that these signals can be encoded collectively using a bitrate as low as 19 kbps per spot signal. Further research efforts are necessary in order to achieve even lower bitrates while preserving high audio quality, while a more detailed testing procedure using subjective methods is currently underway.
Acknowledgments The authors wish to thank Prof. Y. Stylianou for his insightful suggestions and for his help with the implementation of the sinusoidal model algorithm, Prof. C. Kyriakakis for providing the audio recordings used in the experiments, as well as the listening tests volunteers.
References 1. ITU-R BS.1116: Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems, 1994. International Telecommunications Union, Geneva, Switzerland(1994) 2. ISO/IEC JTC1/SC29/WG11 (MPEG) International Standard ISO/IEC 11172-3: Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s (1992) 3. Brandenburg, K.: MP3 and AAC explained. In: Proc. 17th International Conference on High Quality Audio Coding of the Audio Engineering Society (AES) (September 1999) 4. ISO/IEC JTC1/SC29/WG11 (MPEG) International Standard ISO/IEC 13818-7: Generic coding of moving pictures and associated audio: Advanced audio coding, 1997 (1997) 5. Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L., Akagiri, K., Fuchs, H., Dietz, M., Herre, J., Davidson, G., Oikawa, Y.: ISO/IEC MPEG-2 advanced audio coding. In: Proc. 101st Convention of the Audio Engineering Society (AES), preprint No. 4382, Los Angeles, CA (November 1996) 6. Davis, M.: The AC-3 multichannel coder. In: Proc. 95th Convention of the Audio Engineering Society (AES), preprint No. 3774, New York, NY (October 1993)
164
A. Mouchtaris, C. Tzagkarakis, and P. Tsakalides
7. Johnston, J.D., Ferreira, A.J.: Sum-difference stereo transform coding. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), pp. 569–572 (1992) 8. Herre, J., Brandenburg, K., Lederer, D.: Intensity stereo coding. In: Proc. 96th Convention of the Audio Engineering Society (AES), preprint No. 3799 (February 1994) 9. Breebaart, J., Herre, J., Faller, C., Roden, J., Myburg, F., Disch, S., Purnhagen, H., Hotho, G., Neusinger, M., Kjorling, K., Oomen, W.: MPEG Spatial Audio Coding / MPEG Surround: Overview and current status. In: Proc. AES 119th Convention, Paper 6599, New York, NY (October 2005) 10. Baumgarte, F., Faller, C.: Binaural Cue Coding - Part I: Psychoacoustic Fundamentals and Design Principles. IEEE Trans. on Speech and Audio Proc. 11(6), 509–519 (2003) 11. Breebaart, J., van de Par, S., Kohlrausch, A., Schuijers, E.: Parametric coding of stereo audio. EURASIP Journal on Applied Signal Processing 9, 1305–1322 (2005) 12. Tzagkarakis, C., Mouchtaris, A., Tsakalides, P.: Modeling spot microphone signals using the sinusoidal plus noise approach. In: Proc. Workshop on Appl. of Signal Proc. to Audio and Acoust. (October 2007) 13. Vafin, R., Prakash, D., Kleijn, W.B.: On Frequency Quantization in Sinusoidal Audio Coding. IEEE Signal Proc. Letters 12(3), 210–213 (2005) 14. Subramaniam, A.D., Rao, B.D.: PDF optimized parametric vector quantization of speech line spectral frequencies. IEEE Trans. on Speech and Audio Proc. 11, 365–380 (2003) 15. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993) 16. Karadimou, K., Mouchtaris, A., Tsakalides, P.: Multichannel Audio Modeling and Coding Using a Multiband Source/Filter Model. In: Conf. Record of the ThirtyNinth Asilomar Conf. Signals, Systems and Computers, pp. 907–911 (2005) 17. McAulay, R.J., Quatieri, T.F.: Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust., Speech, and Signal Process. 34(4), 744–754 (1986) 18. Stylianou, Y.: Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Trans. Speech and Audio Process. 9(1), 21–29 (2001) 19. Serra, X., Smith, J.O.: Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal 14(4), 12–24 (1990) 20. Goodwin, M.: Residual modeling in music analysis-synthesis. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), pp. 1005–1008 (May 1996) 21. Hendriks, R.C., Heusdens, R., Jensen, J.: Perceptual linear predictive noise modeling for sinusoid-plus-noise audio coding. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Process (ICASSP), pp. 189–192 (May 2004)
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate Using NEWFM Sang-Hong Lee, Hyoung J. Jang, and Joon S. Lim* College of IT, Kyungwon University, Korea {shleedosa,hjjang,jslim}@kyungwon.ac.kr
Abstract. Fuzzy neural networks have been successfully applied to generate predictive rules for exchange rate forecasting. This paper presents a methodology to forecast the daily and weekly GBP/USD exchange rate by extracting fuzzy rules based on the neural network with weighted fuzzy membership functions (NEWFM) and the minimized number of input features using the distributed non-overlap area measurement method. NEWFM classifies upward and downward cases of next day’s and next week’s GBP/USD exchange rate using the recent 32 days and 32 weeks of CPPn,m (Current Price Position of day n and week n : a percentage of the difference between the price of day n and week n and the moving average of the past m days and m weeks from day n-1 and week n-1) of the daily and weekly GBP/USD exchange rate, respectively. In this paper, the Haar wavelet function is used as a mother wavelet. The most important five and four input features among CPPn,m and 38 numbers of wavelet transformed coefficients produced by the recent 32 days and 32 weeks of CPPn,m are selected by the nonoverlap area distribution measurement method, respectively. The data sets cover a period of approximately ten years starting from 2 January 1990. The proposed method shows that the accuracy rates are 55.19% for the daily data and 72.58% for the weekly data. Keywords: fuzzy neural networks, wavelet transform, exchange rate, forecasting.
1 Introduction Fuzzy neural network (FNN) is the combination of neural network and fuzzy set theory, and provides the interpretation capability of hidden layers using knowledge based the fuzzy set theory [14-17]. Various FNN models with different algorithms such as learning, adaptation, and rule extraction were proposed as an adaptive decision support tool in the field of pattern recognition, classification, and forecasting [4-6][18]. Chai proposed the economic turning point forecasting using fuzzy neural network [13] and Gestel proposed the financial time series prediction using least squares support vector machines within the evidence framework [11]. Kim proposed support vector machines (SVMs) to predict a financial time series and compared SVMs with back-propagation neural networks [2] and a genetic algorithm approach to instance selection in artificial neural networks [12]. Exchange rate forecasting has been studied using AI (artificial intelligence) approach such as artificial neural networks and rule-based systems. Artificial neural networks are to support for training exchange rate data and rule-based systems are to *
Corresponding author.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 165–174, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
166
S.-H. Lee, H.J. Jang, and J.S. Lim
support for making a decision in the higher and lower of the daily and weekly change. Panda [10] compared the weekly Indian rupee/USD exchange rate forecasting performance of neural network with the performances of linear autoregressive (LAR) and random walk (RW) models. A new forecasting model based on neural network with weighted fuzzy membership functions (NEWFM) [3] concerning forecasting of GBP/USD exchange rate using the Haar wavlet transforms (WT) is implemented in this paper. In this paper, the five and four extracted input features are presented to forecast the daily and weekly GBP/USD exchange rate, respectively, using the Haar WT and the neural network with weighted membership functions (NEWFM), and the non-overlap area distribution measurement method [3]. The method extracts minimum number of input features each of which constructs an interpretable fuzzy membership function. All features are interpretably formed in weighted fuzzy membership functions preserving the disjunctive fuzzy information and characteristics. All features are extracted by the non-overlap area measurement method validated by the wine benchmarking data in University of California, Irvine (UCI) Machine Learning repository [7]. This study is to forecast the higher and lower of the daily and weekly changes of GBP/USD exchange rate. They are classified as “1” or “2” in the data of GBP/USD exchange rate. “1” means that the next day’s data and next week’s data are lower than today’s data and this week’s data, respectively. “2” means that the next day’s data and next week’s data are higher than today’s data and this week’s data, respectively. In this paper, the total numbers of samples are 2800 days for the daily GBP/USD exchange rate and 560 weeks for the weekly GBP/USD exchange rate used in Sfetsos [1] for approximately ten years starting from 2 January 1990. Sfetsos divided the samples into three subsets namely the training, the evaluation, and the unknown prediction sets. These were formed using approximately the 70%, 19%, and 11%, respectively. The performance and forecasting ability is measured on the totally unknown prediction sets. Sfetsos compared linear regression (LR) with the feedforward artificial neural network (ANN) for forecasting the daily and weekly GBP/USD exchange rate, and then the accuracy rates of LR and ANN are 48.86% and 50.62% for the daily data and 63.93% and 65.57% for the weekly data, respectively. In this paper, the most important five and four input features are selected by nonoverlap area measurement method [7]. The five and four generalized features are used to generate the fuzzy rules to forecast the next day’s directions of the daily changes of GBP/USD exchange rate and the next week’s directions of the weekly changes of GBP/USD exchange rate, respectively. NEWFM shows that the accuracy rates are 55.19% for the daily data and 72.58% for the weekly data.
2 Wavelet Transforms The wavelet transform (WT) is a transformation to basis functions that are localized in scale and in time as well. WT decomposes the original signal into a set of coefficients that describe frequency content at given times. The continuous wavelet transform (CWT) of a continuous time signal x(t) is defined as:
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
T ( a, b) =
1 a
∫
∞
⎛t −b⎞ ⎟x(t )dt ⎝ a ⎠
ψ⎜
−∞
167
(1)
where ψ((t-b)/a) is the analyzing wavelet function. The transform coefficients T(a,b) are found for both specific locations on the signal, t=b, and for specific wavelet periods (which are scale function of a). The CWT is defined as the dyadic wavelet transform (DWT), if a is discretized along the dyadic sequence 2i where i = 1, 2, … . The DWT can be defined as [8]:
S2 i x(n) = ∑ hk S2 i−1 x( n − 2i −1 k ) k ∈Z
W2i x( n) = ∑ g k S 2i −1 x(n − 2i −1 k )
(2)
k∈Z
where
S 2i is a smoothing operator, W2i is the digital signal x(n), i∈Z (Z is the integral
set), and hk and gk are coefficients of the corresponding low pass and high pass filters. A filtered signal at level i is down-sampled reducing the length of a signal at level i-1 by a factor of two, and generating approximation (ai) and detail coefficients (di) at level i. This paper proposes CPPn,m (Current Price Position) as a new technical indicator to forecast the next day’s directions of the daily changes of GBP/USD exchange rate and the next week’s directions of the weekly changes of GBP/USD exchange rate, respectively. CPPn,m is a current price position of day n and week n on a percentage of the difference between the price of day n and week n, and the moving average of the past m days and m weeks from day n-1 and week n-1, respectively. CPPn,m is calculated by
CPPn ,m = ((Cn − MAn−1,n−m ) / MAn−1,n−m ) × 100
(3)
where Cn is the closing price of day n and week n, and MAn-1,n-m is the moving average of the past m days and m weeks from day n-1 and week n-1, respectively. In this paper, the Haar wavelet function is used as a mother wavelet. The Haar wavelet function makes 38 numbers of approximations and detail coefficients from CPPn,5 to CPPn-31,5 to extract input features. 38 numbers of approximations and detail coefficients consist of 16 detail coefficients at level 1, 8 detail coefficients at level 2, 4 Table 1. Comparisons of Input Features Used for Forecasting the Daily Changes with the Weekly Changes in NEWFM Input Features for Forecasting the Daily Changes 5 features such as 1) d12 among 16 detail coefficients at level 1 2) d13 among 16 detail coefficients at level 1 3) d4 among 8 approximations at level 2 4) d1 among 4 approximation at level 3 5) d2 among 2 approximation at level 4
Input Features for Forecasting the Weekly Changes 4 features such as 1) d1 among 8 detail coefficients at level 2 2) a1 among 4 approximations at level 3 3) a1 among 1 approximation at level 5 4) CPPn
168
S.-H. Lee, H.J. Jang, and J.S. Lim
detail coefficients and 4 approximations at level 3, 2 detail coefficients and 2 approximations at level 4, and 1 detail coefficient and 1 approximation at level 5. The neural network with weighted membership functions (NEWFM) and the nonoverlap area distribution measurement method [3] are used to extract minimum number of input features among 39 numbers of features. Table 1 shows the extracted minimum input features.
3 Neural Network with Weighted Fuzzy Membership Function (NEWFM) 3.1 The Structure of NEWFM Neural network with weighted fuzzy membership function (NEWFM) is a supervised classification neuro-fuzzy system using bounded sum of weighted fuzzy membership functions (BSWFM in Fig. 2) [3][9]. The structure of NEWFM, illustrated in Fig. 1, comprises three layers namely input, hyperbox, and class layer. The input layer contains n input nodes for an n featured input pattern. The hyperbox layer consists of m hyperbox nodes. Each hyperbox node Bl to be connected to a class node contains n BSWFMs for n input nodes. The output layer is composed of p class nodes. Each class node is connected to one or more hyperbox nodes. An hth. Input pattern can be recorded as Ih={Ah=(a1, a2, … , an), class}, where class is the result of classification and Ah is n features of an input pattern. The connection weight between a hyperbox node Bl and a class node Ci is represented by wli, which is initially set to 0. From the first input pattern Ih, the wli is set to 1 by the winner hyperbox node Bl and class i in Ih. Ci should have one or more than one connections to hyperbox nodes, whereas Bl is restricted to have one connection to a corresponding class node. The Bl can be learned only when Bl is a winner for an input Ih with class i and wli = 1.
C1
C2
Cp
L
wmp = 0
wm 2 = 1 L
Bmi : μ j (x)
B1
B2
B3
B4
1
L
1
1
x
x
L
L
I1
I2
L
In
Ah = (a1 ,
a2 ,
L
, an )
Fig. 1. Structure of NEWFM
Bm μ j (x)
μ j (x)
Bm
x
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
169
3.2 Learning Scheme A hyperbox node Bl consists of n fuzzy sets. The ith fuzzy set of Bl, represented by Bli , has three weighted fuzzy membership functions (WFM, grey triangles ωli1 , ωli 2 , and ωli 3 in Fig. 2) which randomly constructed before learning. Each ω li j is originated from the original membership function μ li j with its weight Wl i j in the Fig. 2. The bounded sum of three weighted fuzzy membership functions (BSWFM, bold line in Fig. 2) of Bli combines the fuzzy characteristics of the three WFMs. The BSWFM value of Bli , denoted as BSli (.) , and is calculated by formulas (4) where ai is an ith feature value of an input pattern Ah for Bli . 3
BS li ( ai ) = ∑ ω li j ( ai ),
(4)
j =1
The winner hyperbox node Bl is selected by the Output (Bl) operator. Only the Bl, that has the maximum value of Output (Bl) for an input Ih wtih class i and wli = 1, among the hyperbox nodes can be learned. For the hth input Ah= (a1, a2… an) with n features to the hyperbox Bl, output of the Bl is obtained by formulas (5)
Output ( B l ) =
1 n
n
∑ BS i =1
i l
( a i ).
(5)
BS li (x) μli1 Wl 1 = 0.7 i
BS li (ai )
Wl i 2 = 0.8
ωli 2
ωli1 vli 0 vli min
μli 3
μli 2
Wl i 3 = 0.3
ωli 3
vli1
vli 2
ai
vli 3
x vli max
vli 4
Fig. 2. An Example of Bounded Sum of Weighted Fuzzy Membership Functions (BSWFM, i i Bold Line) of B l and BSl (ai )
Then, the selected winner hyperbox node Bl is learned by the Adjust (Bl) operation. This operation adjusts all Bli s according to the input ai, where i=1, 2… n. The membership function weight Wl i j (where 0≤ Wl i j ≤1 and j=1, 2, 3) represents the strength of ω li j . Then a WFM ω li j can be formed by ( v li j −1 , W l i j , v li j + 1 ). As a result
170
S.-H. Lee, H.J. Jang, and J.S. Lim
of Adjust (Bl) operation, the vertices v li j and weights W l i j in Fig. 3 are adjusted by the following expressions (6): vli j = vli j + s × α × Eli j × ωli j (ai ) = vli j + s × α × Eli j × μli j (ai ) × Wl i j , where
⎧s = −1, Eli j = min( vli j − ai , vli j −1 − ai ), if vli j −1 ≤ ai < vli j ⎪⎪ i i i i i ⎨ s = 1, El j = min( vl j − ai , vl j +1 − ai ), if vl j ≤ ai < vl j +1 ⎪ i ⎪⎩ El j = 0, otherwise W l i j = W l i j + β × ( μ li j (a i ) − W l i j )
(6)
Where the α and β are the learning rates for v li j and W l i j respectively in the range from 0 to 1 and j=1,2,3. Fig. 3 shows BSWFMs before and after Adjust (Bl) operation for Bli with an input ai. The weights and the centers of membership functions are adjusted by the Adjust (Bl) operation, e.g., W l i 1 , W l i 2 , and W l i 3 are moved down, v li 1 and v li 2 are moved toward ai, and v li 3 remains in the same location. The Adjust (Bl) operations are executed by a set of training data. If the classification rate for a set of test data is not reached to a goal rate, the learning scheme with Adjust(Bl) operation is repeated from the beginning by randomly reconstructing all WFMs in Bl s and making all connection weights to 0 (wli = 0) until the goal rate is reached. BSli (x) μli1
1
μ li 2 Wl i 2
Wl i1
Bli
μli 3
Wl i 3
vli 0
vli min vli1 BS li (x) μ li1
1
Bli
i
ai v l 2 Adjust(Bl) for Bli μ li 2
vli 3 vli max
x
vli 4
μli 3
Wl i 2
Wl i1
Wl i 3
vli 0
vli min
vli1
ai v li 2
vli 3
x
i vli max vl 4
Fig. 3. An Example of Before and After Adjust (Bl) Operation for Bli
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
171
4 Experimental Results In this section, the total numbers of samples are 2800 days for the daily GBP/USD exchange rate and 560 weeks for the weekly GBP/USD exchange rate used in Sfetsos [1] for approximately ten years starting from 2 January 1990. Sfetsos divided the samples into three subsets namely the training, the evaluation, and the unknown prediction sets. Table 2 displays that these were formed using approximately the 70%, 19%, and 11%, respectively. The performance and forecasting ability is measured on the totally unknown prediction sets. Table 2. Number of instances used in Sfetsos Training sets
Evaluation sets
Unknown prediction sets
Total sets
Forecasting the Daily Changes
1960
532
308
2800
Forecasting the Weekly Changes
392
106
62
560
Sfetsos compared linear regression (LR) with the feedforward artificial neural network (ANN) for forecasting the daily and weekly GBP/USD exchange rate. The accuracy of NEWFM is evaluated by the totally unknown prediction sets which were used in Sfetsos. Table 3 displays the comparison of performance results for Sfetsos with NEWFM and the accuracy rates for the totally unknown prediction sets. Table 3. Comparisons of Performance Results for Sfetsos with NEWFM Sfetsos’s LR
Sfetsos’s ANN
NEWFM
Accuracy (%)
Accuracy (%)
Accuracy (%)
Daily data
48.86
50.62
55.19
Weekly data
63.93
65.57
72.58
In this paper, the most important five and four input features are selected by nonoverlap area measurement method [7]. The five and four generalized features are used to generate the fuzzy rules to forecast the next day's directions of the daily changes of GBP/USD exchange rate and the next week's directions of the weekly changes of GBP/USD exchange rate, respectively. The five and four generalized features extracted from 39 numbers of input features are selected by non-overlap area distribution measurement method [3]. The method measures the degree of salience of the ith feature by non-overlapped areas with the area distribution by the following equation: f ( i ) = ( Area
i U
+ Area Li ) 2 Max ( Area
i U
, Area Li ) ,
(7)
where AreaU and AreaL are the upper phase superior area and the lower phase superior area, respectively. As an example, for d1 feature among 8 detail coefficients at level 2,
172
S.-H. Lee, H.J. Jang, and J.S. Lim
Fig. 4. Trained BSWFM of the Generalized Five Features for Lower Phase and Upper Phase Classification of the daily GBP/USD exchange rate
Fig. 5. Trained BSWFM of the Generalized Four Features for Lower Phase and Upper Phase Classification of the weekly GBP/USD exchange rate
Fig. 6. AreaU(White) and AreaL(Black) for d1 feature among 8 detail coefficients at level 2
the AreaU and AreaL are shown in Fig. 6. The larger the value of f(i), the more the feature characteristic is implied. In this experiment, two hypoboxes are created for classification. While a hyperbox which contains a set of line (BSWFM) in Fig. 4 and Fig. 5 are a rule for class 1 (lower
Extracting Input Features and Fuzzy Rules for Forecasting Exchange Rate
173
phase), the other hyperbox which contains a set of lines (BSWFM) is another rule for class 2 (upper phase). The graph in Fig. 4 and Fig. 5 are obtained from the training process of the NEWFM program and shows the difference between lower phase and upper phase for each input feature graphically. Lower phase means that the next day's data and next week's data are lower than today's data and this week's data, respectively. Upper phase means that the next day's data and next week's data are higher than today's data and this week's data, respectively.
5 Concluding Remarks This paper proposes a new forecasting model based on neural network with weighted fuzzy membership function (NEWFM). NEWFM is a new model of neural networks to improve forecasting accuracy rates by using self adaptive weighted fuzzy membership functions. The degree of classification intensity is obtained by bounded sum of weighted fuzzy membership functions extracted by NEWFM. In this paper, the Haar wavelet function is used as a mother wavelet to extract input features. The five and four input features extracted by the non-overlap area distribution measurement method [3] are presented to forecast the daily and weekly GBP/USD exchange rate using the Haar WT. The accuracy rates are 55.19% for the daily data and 72.58% for the weekly data. To improve the accuracy of the exchange rate forecasting capability, some kinds of statistics such as probability density function, normal distribution, and etc will be needed to study.
References 1. Sfetsos, A., Siriopoulos, C.: Time Series Forecasting of Averaged Data With Efficient Use of Information. IEEE Trans. on Systems, Man, and Cybernetics—Part A: Systems and Humans 35(5) (September 2005) 2. Kim, K.-j.: Financial time series forecasting using support vector machines. Neurocomputing 55, 307–309 (2003) 3. Lim, J.S., Ryu, T.-W., Kim, H.-J., Gupta, S.: Feature Selection for Specific Antibody Deficiency Syndrome by Neural Network with Weighted Fuzzy Membership Functions. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 811–820. Springer, Heidelberg (2005) 4. Ishibuchi, H., Nakashima, T.: Voting in Fuzzy Rule-Based Systems for Pattern Classification Problems. Fuzzy Sets and Systems 103, 223–238 (1999) 5. Nauk, D., Kruse, R.: A Neuro-Fuzzy Method to Learn Fuzzy Classification Rules from Data. Fuzzy Sets and Systems 89, 277–288 (1997) 6. Setnes, M., Roubos, H.: GA-Fuzzy Modeling and Classification: Complexity and Performance. IEEE Trans. Fuzzy Systems 8(5), 509–522 (2000) 7. Lim, J.S., Gupta, S.: Feature Selection Using Weighted Neuro-Fuzzy Membership Functions. In: The 2004 International Conference on Artificial Intelligence (ICAI 2004), Las Vegas, Nevada, USA, June 21-24, vol. 1, pp. 261–266 (2004) 8. Mallat, S.: Zero Crossings of a Wavelet Transform. IEEE Trans. on Information Theory 37, 1019–1033 (1991)
174
S.-H. Lee, H.J. Jang, and J.S. Lim
9. Lim, J.S., Wang, D., Kim, Y.-S., Gupta, S.: A neuro-fuzzy approach for diagnosis of antibody deficiency syndrome. Neurocomputing 69(7-9), 969–974 (2006) 10. Panda, C., Narasimhan, V.: Forecasting exchange rate better with artificial neural network. Journal of Policy Modeling 29, 227–236 (2007) 11. Gestel, T.V., et al.: Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework. IEEE Trans. Neural Networks 12(4), 809–821 (2001) 12. Kim, K.-j.: Artificial neural networks with evolutionary instance selection for financial forecasting. Expert System with Applications 30, 519–526 (2006) 13. Chai, S.H., Lim, J.S.: Economic Turning Point Forecasting Using Fuzzy Neural Network and Non-Overlap Area Distribution Measurement Method. The Korean Economic Association 23(1), 111–130 (2007) 14. Carpenter, G.A., Grossberg, S., Reynolds, J.: ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks 4, 565–588 (1991) 15. Jang, R.: ANFIS: Adaptive network-based fuzzy inference system. IEEE Trans. Syst., Man, Cybern. 23, 665–685 (1993) 16. Wang, J.S., Lee, C.S.G.: Self-Adaptive Neuro-Fuzzy Inference System for Classification Applications. IEEE Trans., Fuzzy Systems 10(6), 790–802 (2002) 17. Simpson, P.: Fuzzy min-max neural networks-Part 1: Classification. IEEE Trans., Neural Networks 3, 776–786 (1992) 18. Lim, J.S.: Finding Fuzzy Rules by Neural Network with Weighted Fuzzy Membership Function. International Journal of Fuzzy Logic and Intelligent Systems 4(2), 211–216 (2004)
Forecasting Short-Term KOSPI Time Series Based on NEWFM Sang-Hong Lee, Hyoung J. Jang, and Joon S. Lim* College of IT, Kyungwon University, Korea {shleedosa,hjjang,jslim}@kyungwon.ac.kr
Abstract. Fuzzy neural networks have been successfully applied to generate predictive rules for stock forecasting. This paper presents a methodology to forecast the daily Korea composite stock price index (KOSPI) by extracting fuzzy rules based on the neural network with weighted fuzzy membership functions (NEWFM) and the minimized number of input features using the distributed non-overlap area measurement method. NEWFM supports the KOSPI time series analysis based on the defuzzyfication of weighted average method which is the fuzzy model suggested by Takagi and Sugeno. NEWFM classifies upper and lower cases of next day’s KOSPI using the recent 32 days of CPPn,m (Current Price Position of day n : a percentage of the difference between the price of day n and the moving average of the past m days from day n-1) of KOSPI. In this paper, the Haar wavelet function is used as a mother wavelet. The most important four input features among CPPn,m and 38 numbers of wavelet transformed coefficients produced by the recent 32 days of CPPn,m are selected by the non-overlap area distribution measurement method. The total number of samples is 2928 trading days, from January 1989 to December 1998. About 80% of the data is used for training and 20% for testing. The result of classification rate is 59.0361%. Keywords: fuzzy neural networks, weighted average defuzzification, wavelet transform, KOSPI, nonlinear time series.
1 Introduction Fuzzy neural network (FNN) is the combination of neural network and fuzzy set theory, and provides the interpretation capability of hidden layers using knowledge based the fuzzy set theory [14-17]. Various FNN models with different algorithms such as learning, adaptation, and rule extraction were proposed as an adaptive decision support tool in the field of pattern recognition, classification, and forecasting [4-6][12]. Chai proposed the economic turning point forecasting using fuzzy neural network [7] and Gestel proposed the financial time series prediction using least squares support vector machines within the evidence framework [11]. Stock forecasting has been studied using AI (artificial intelligence) approach such as artificial neural networks and rule-based systems. Artificial neural networks are to support for training stock data and rule-based systems are to support for making a decision in the higher and lower of daily change. Bergerson and Wunsch [10] combined of neural network and rule-based system in the S&P 500 index futures *
Corresponding author.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 175–184, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
176
S.-H. Lee, H.J. Jang, and J.S. Lim
market. Xiaohua Wang [1] proposed the time delay neural network (TDNN). TDNN explored the usefulness of volume information in the explanation of the predictability of stock index returns. Kim proposed support vector machines (SVMs) to predict a financial time series and compared SVMs with back-propagation neural networks [2]. In this paper, the four extracted input features are presented to forecast the daily Korea composite stock price index (KOSPI) using the Haar WT and the neural network with weighted membership functions (NEWFM), and the non-overlap area distribution measurement method [3]. The method extracts minimum number of input features each of which constructs an interpretable fuzzy membership function. The four features are interpretably formed in weighted fuzzy membership functions preserving the disjunctive fuzzy information and characteristics, locally related to the time signal, the patterns of KOSPI. This study is to forecast the higher and lower of daily changes of KOSPI. They are classified as “1” or “2” in the data of KOSPI. “1” means that the next day’s index is lower than today’s index. “2” means that the next day’s index is higher than today’s index. In this paper, the total numbers of samples are 2928 trading days used in Kim [2], from January 1989 to December 1998. About 80% of trading data are used for training and 20% for testing, from January 1989 to December 1998. Kim used support vector machines (SVMs) to predict a financial time series and then the accuracy rate was 57.8313% [2]. In this paper, the most important four input features are selected by non-overlap area distribution measurement method [3]. The four generalized features are used to generate the fuzzy rules to forecast the next day’s directions of the daily changes of KOSPI. NEWFM shows that the accuracy rate is 59.0361%. The fuzzy model suggested by Tagaki and Sugeno in 1995 can represent nonlinear system such as stock time series [13] and business cycle [7].
2 Wavelet Transforms The wavelet transform (WT) is a transformation to basis functions that are localized in scale and in time as well. WT decomposes the original signal into a set of coefficients that describe frequency content at given times. The continuous wavelet transform (CWT) of a continuous time signal x(t) is defined as:
T ( a, b) =
1 a
∫
∞
⎛t −b⎞ ⎟x(t )dt ⎝ a ⎠
ψ⎜
−∞
(1)
where ψ((t-b)/a) is the analyzing wavelet function. The transform coefficients T(a,b) are found for both specific locations on the signal, t=b, and for specific wavelet periods (which are scale function of a). The CWT is defined as the dyadic wavelet transform (DWT), if a is discretized along the dyadic sequence 2i where i = 1, 2, … . The DWT can be defined as [8]: S2 i x(n) = ∑ hk S2 i−1 x( n − 2i −1 k ) k ∈Z
W2i x( n) = ∑ g k S 2i −1 x(n − 2i −1 k ) k∈Z
(2)
Forecasting Short-Term KOSPI Time Series Based on NEWFM
where
177
S 2i is a smoothing operator, W2i is the digital signal x(n), i∈Z (Z is the integral
set), and hk and gk are coefficients of the corresponding low pass and high pass filters. A filtered signal at level i is down-sampled reducing the length of a signal at level i-1 by a factor of two, and generating approximation (ai) and detail coefficients (di) at level i. This paper proposes CPPn,m (Current Price Position) as a new technical indicator to forecast the next day’s directions of the daily changes of KOSPI. CPPn,m is a current price position of day n on a percentage of the difference between the price of day n and the moving average of the past m days from day n-1. CPPn,m is calculated by
CPPn ,m = ((Cn − MAn−1,n−m ) / MAn−1,n−m ) × 100
(3)
where Cn is the closing price of day n and MAn-1,n-m is the moving average of the past m days from day n-1. In this paper, the Haar wavelet function is used as a mother wavelet. The Haar wavelet function makes 38 numbers of approximations and detail coefficients from CPPn,5 to CPPn-31,5 to extract input features. 38 numbers of approximations and detail coefficients consist of 16 detail coefficients at level 1, 8 detail coefficients at level 2, 4 detail coefficients and 4 approximations at level 3, 2 detail coefficients and 2 approximations at level 4, and 1 detail coefficient and 1 approximation at level 5. The neural network with weighted membership functions (NEWFM) and the nonoverlap area distribution measurement method [3] are used to extract minimum number of input features among 39 numbers of features. The following four minimum input features are extracted: 1) d1 among 16 detail coefficients at level 1 2) a1 among 4 approximations at level 3 3) a1 among 1 approximation at level 5 4) CPPn,5
3 Neural Network with Weighted Fuzzy Membership Function (NEWFM) 3.1 The Structure of NEWFM Neural network with weighted fuzzy membership function (NEWFM) is a supervised classification neuro-fuzzy system using bounded sum of weighted fuzzy membership functions (BSWFM in Fig. 2) [3][9]. The structure of NEWFM, illustrated in Fig. 1, comprises three layers namely input, hyperbox, and class layer. The input layer contains n input nodes for an n featured input pattern. The hyperbox layer consists of m hyperbox nodes. Each hyperbox node Bl to be connected to a class node contains n BSWFMs for n input nodes. The output layer is composed of p class nodes. Each class node is connected to one or more hyperbox nodes. An hth. Input pattern can be recorded as Ih={Ah=(a1, a2, … , an), class}, where class is the result of classification and Ah is n features of an input pattern. The connection weight between a hyperbox node Bl and a class node Ci is represented by wli, which is initially set to 0. From the first input pattern Ih, the wli is
178
S.-H. Lee, H.J. Jang, and J.S. Lim
C1
C2
Cp
L
Class Nodes
wm 2 = 1
wmp = 0
L
Bmi : μ j (x)
B2
B3
B4
...
L
I2
L
In
Ah = (a1 ,
a2 ,
L
, an )
Hyperbox
... x
x
L
Bm
1
1
L
I1
of μ j(x )
μ j (x )
1
B1
th fuzzy set
Bm :
x
Nodes
th Hyperbox Node
Input Nodes
th Input Pattern with
Features
Fig. 1. Structure of NEWFM
set to 1 by the winner hyperbox node Bl and class i in Ih. Ci should have one or more than one connections to hyperbox nodes, whereas Bl is restricted to have one connection to a corresponding class node. The Bl can be learned only when Bl is a winner for an input Ih with class i and wli = 1. 3.2 Learning Scheme A hyperbox node Bl consists of n fuzzy sets. The ith fuzzy set of Bl, represented by Bli , has three weighted fuzzy membership functions (WFM, grey triangles ωli1 , ωli 2 , and ωli 3 in Fig. 2) which randomly constructed before learning. Each ω li j is originated from the original membership function μ li j with its weight Wl i j in the Fig. 2. The bounded sum of three weighted fuzzy membership functions (BSWFM, bold line in Fig. 2) of Bli combines the fuzzy characteristics of the three WFMs. The BSWFM value of Bli , denoted as BSli (.) , and is calculated by formulas (4) where ai is an ith feature value of an input pattern Ah for Bli . 3
BS li ( ai ) = ∑ ω li j ( ai ),
(4)
j =1
The winner hyperbox node Bl is selected by the Output (Bl) operator. Only the Bl, that has the maximum value of Output (Bl) for an input Ih wtih class i and wli = 1, among the hyperbox nodes can be learned. For the hth input Ah= (a1, a2… an) with n features to the hyperbox Bl, output of the Bl is obtained by formulas (5) 1 n (5) Output ( B l ) = ∑ BS li ( a i ) . n i =1
Forecasting Short-Term KOSPI Time Series Based on NEWFM
179
BS li (x) μli1 Wl i1 = 0.7
BS li (ai )
Wl i 2 = 0.8
ωli 2
ωli1 vli 0 vli min
μli 3
μli 2
Wl i 3 = 0.3
ωli 3
vli1
vli 2
ai
vli 3
x vli max
vli 4
Fig. 2. An Example of Bounded Sum of Weighted Fuzzy Membership Functions (BSWFM, i i Bold Line) of B l and BSl (ai )
Then, the selected winner hyperbox node Bl is learned by the Adjust (Bl) operation. This operation adjusts all Bli s according to the input ai, where i=1, 2… n. The membership function weight Wl i j (where 0≤ Wl i j ≤1 and j=1, 2, 3) represents the strength of ω li j . Then a WFM ω li j can be formed by ( v li j −1 , W l i j , v li j + 1 ). As a result of Adjust (Bl) operation, the vertices v li j and weights W l i j in Fig. 3 are adjusted by the following expressions (6): vli j = vli j + s × α × Eli j × ωli j (ai ) = vli j + s × α × Eli j × μli j (ai ) × Wl i j , where
⎧s = −1, Eli j = min( vli j − ai , vli j −1 − ai ), if vli j −1 ≤ ai < vli j ⎪⎪ i i i i i ⎨ s = 1, El j = min( vl j − ai , vl j +1 − ai ), if vl j ≤ ai < vl j +1 ⎪ i ⎪⎩ El j = 0, otherwise W l i j = W l i j + β × ( μ li j (a i ) − W l i j )
(6)
Where the α and β are the learning rates for v li j and W l i j respectively in the range from 0 to 1 and j=1,2,3. Fig. 3 shows BSWFMs before and after Adjust (Bl) operation for Bli with an input ai. The weights and the centers of membership functions are adjusted by the Adjust (Bl) operation, e.g., W l i 1 , W l i 2 , and W l i 3 are moved down, v li 1 and v li 2 are moved toward ai, and v li 3 remains in the same location. The Adjust (Bl) operations are executed by a set of training data. If the classification rate for a set of test data is not reached to a goal rate, the learning scheme with Adjust(Bl) operation is repeated from the beginning by randomly reconstructing all WFMs in Bl s and making all connection weights to 0 (wli = 0) until the goal rate is reached.
180
S.-H. Lee, H.J. Jang, and J.S. Lim BS li (x ) μli1
Wl i 2
Wl i1
i l
B
μ li 3
μli 2
Wl i 3
vli 0
vli min vli1
i ai v l 2
x
vli 4
Bli
BS li (x) μli1
Bli
vli 3 vli max
μli 2
μ li 3
Wl i 2
Wl i1
Wl i 3
vli 0
vli min
vli1
ai v li 2
x
i vli max vl 4
vli 3
Fig. 3. An Example of Before and After Adjust (Bl) Operation for Bli
4 Experimental Results In this section, the data of KOSPI for 10 years, from January 1989 to December 1998, were trained for about 80% of the data and tested for about 20% of the data. Table 1 displays the comparison of the numbers of features used in Kim and NEWFM. Kim used 12 features such as CCI, RSI, Stochastic, and etc. NEWFM uses 4 features, which consist of 4 numbers of approximations and detail coefficients made by the Haar wavelet function. The four generalized features extracted from 39 numbers of input features are selected by non-overlap area distribution measurement method [3]. The method measures the degree of salience of the ith feature by non-overlapped areas with the area distribution by the following equation: f ( i ) = ( Area
i U
+ Area Li ) 2 Max ( Area
i U
, Area Li ) ,
(7)
where AreaU and AreaL are the upper phase superior area and the lower phase superior area, respectively. As an example, for CPPn,5 feature, the AreaU and AreaL are shown in Fig. 4. The larger the value of f(i), the more the feature characteristic is implied. Table 2 displays the comparison of the numbers of features used in Kim and NEWFM. Kim used 12 features such as CCI, RSI, Stochastic, and etc. NEWFM uses 4 features, which consist of 3 numbers of approximations and detail coefficients made by the Haar wavelet function and CPPn,5. The four generalized features are used to generate the fuzzy rules (BSWFM) to forecast the time series of KOSPI.
Forecasting Short-Term KOSPI Time Series Based on NEWFM
181
Fig. 4. AreaU(White) and AreaL(Black) for CPPn,5 Table 1. Comparisons of Features of KIM With NEWFM Kim
NEWFM
12 features such as CCI, RSI, Stochastic, and etc
4 features such as CPPn,5 and 3 numbers of approximations and detail coefficients from CPPn,5 to CPPn-31,5
The accuracy of NEWFM is evaluated by the same data sets which were used in Kim. Table 2 displays the accuracy rates for about 20% of the data, from January 1989 to December 1998 in Kim. Kim proposed support vector machines (SVMs) to predict a financial time series and compared SVMs with back-propagation neural networks [2]. Table 2. Comparisons of Performance Results for KIM with NEWFM NEWFM Accuracy rate
59.0361%
SVM 57.8313%
BP 54.7332%
In this experiment, two hypoboxes are created for classification. While a hyperbox which contains a set of line (BSWFM) in Fig. 5 is a rule for class 1 (lower phase), the other hyperbox which contains a set of lines (BSWFM) is another rule for class 2 (upper phase). The graph in Fig. 5 is obtained from the training process of the NEWFM program and shows the difference between lower phase and upper phase for each input feature graphically. Lower phase means that the next day’s index is lower than today’s index. Upper phase means that the next day’s index is higher than today’s index. The forecasting result of NEWFM can be represented by the trend line using the defuzzyfication of weighted average method (The fuzzy model suggested by Takagi and Sugeno in 1985 [13]). The Fig. 6 shows the trend line of the forecasting result from January 1989 to December 1998 by the KOSPI. This result generally demonstrates the similar fluctuations with KOSPI.
182
S.-H. Lee, H.J. Jang, and J.S. Lim
Fig. 5. Trained BSWFM of the Generalized Five Features for Lower Phase and Upper Phase Classification of KOSPI 900 800 700 600 ec 500 ri P 400
Sugeno Original
300 200 100 0 7 1 0- -1 10 20 -7 -7 99 99 1 1
71 -3 079 91
91 -4 079 91
62 -5 079 91
82 -6 079 91
10 -8 079 91
40 -9 079 91
11 -0 179 91
31 -1 179 91
61 -2 179 91
62 -1 089 91
30 -3 089 91
40 -4 089 91
90 -5 089 91
31 -6 089 91
61 -7 089 91
02 -8 089 91
22 -9 089 91
82 -0 189 91
03 -1 189 91
Fig. 6. Comparison of the Original KOSPI and the Fuzzy Model Suggested by Tagaki and Sugeno
5 Concluding Remarks This paper proposes a new forecasting model based on neural network with weighted fuzzy membership function (NEWFM) and the time series of KOSPI based on the defuzzyfication of weighted average method which is the fuzzy model suggested by Takagi and Sugeno [13]. NEWFM is a new model of neural networks to improve
Forecasting Short-Term KOSPI Time Series Based on NEWFM
183
forecasting accuracy rates by using self adaptive weighted fuzzy membership functions. The degree of classification intensity is obtained by bounded sum of weighted fuzzy membership functions extracted by NEWFM, and then weighted average defuzzification is used for forecasting the time series of KOSPI. In this paper, the Haar wavelet function is used as a mother wavelet to extract input features. The four input features extracted by the non-overlap area distribution measurement method [3] are presented to forecast KOSPI using the Haar WT. The total number of samples is 2928 trading days, from January 1989 to December 1998. About 80% of the data is used for training and 20% for testing. The result of classification rate is 59.0361%. In Table 2, NEWFM outperforms SVMs by 1.2048% for the holdout data. Although further study will be necessary to improve the accuracy of the stock forecasting capability, the buy-and-hold investment strategy can be planned by the trend line of KOSPI. To improve the accuracy of the stock forecasting capability, some kinds of statistics such as CCI, normal distribution, and etc will be needed to study.
References 1. Wang, X., Phua, P.K.H., Lin, W.: Stock market prediction using neural networks: Does trading volume help in short-term prediction? In: Proceedings of the International Joint Conference on Neural Networks, 2003, July 20-24, vol. 4, pp. 2438–2442 (2003) 2. Kim, K.-j.: Financial time series forecasting using support vector machines. Neurocomputing 55, 307–309 (2003) 3. Lim, J.S., Ryu, T.-W., Kim, H.-J., Gupta, S.: Feature Selection for Specific Antibody Deficiency Syndrome by Neural Network with Weighted Fuzzy Membership Functions. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 811–820. Springer, Heidelberg (2005) 4. Ishibuchi, H., Nakashima, T.: Voting in Fuzzy Rule-Based Systems for Pattern Classification Problems. Fuzzy Sets and Systems 103, 223–238 (1999) 5. Nauk, D., Kruse, R.: A Neuro-Fuzzy Method to Learn Fuzzy Classification Rules from Data. Fuzzy Sets and Systems 89, 277–288 (1997) 6. Setnes, M., Roubos, H.: GA-Fuzzy Modeling and Classification: Complexity and Performance. IEEE Trans. Fuzzy Systems 8(5), 509–522 (2000) 7. Chai, S.H., Lim, J.S.: Economic Turning Point Forecasting Using Fuzzy Neural Network and Non-Overlap Area Distribution Measurement Method. The Korean Economic Association 23(1), 111–130 (2007) 8. Mallat, S.: Zero Crossings of a Wavelet Transform. IEEE Trans. on Information Theory 37, 1019–1033 (1991) 9. Lim, J.S., Wang, D., Kim, Y.-S., Gupta, S.: A neuro-fuzzy approach for diagnosis of antibody deficiency syndrome. Neurocomputing 69(7-9), 969–974 (2006) 10. Bergerson, K., Wunsch, D.C.: A commodity trading model based on a neural networkExpert system hybrid. In: Proceedings of the IEEE International Conference on Neural Networks, pp. I289–I293 (1991) 11. Gestel, T.V., et al.: Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework. IEEE Trans. Neural Networks 12(4), 809–821 (2001)
184
S.-H. Lee, H.J. Jang, and J.S. Lim
12. Lim, J.S.: Finding Fuzzy Rules by Neural Network with Weighted Fuzzy Membership Function. International Journal of Fuzzy Logic and Intelligent Systems 4(2), 211–216 (2004) 13. Tagaki, T., Sugeno, M.: Fuzzy Identification of System and Its Applications to Modeling and Control. IEEE Trans. SMC 15, 116–132 (1985) 14. Carpenter, G.A., Grossberg, S., Reynolds, J.: ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network. Neural Networks 4, 565–588 (1991) 15. Jang, R.: ANFIS: Adaptive network-based fuzzy inference system. IEEE Trans. Syst., Man, Cybern. 23, 665–685 (1993) 16. Wang, J.S., Lee, C.S.G.: Self-Adaptive Neuro-Fuzzy Inference System for Classification Applications. IEEE Trans., Fuzzy Systems 10(6), 790–802 (2002) 17. Simpson, P.: Fuzzy min-max neural networks-Part 1: Classification. IEEE Trans., Neural Networks 3, 776–786 (1992)
The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering* Jianhua Tong, Hong-Zhou Tan, and Leiyong Guo School of Information Science & Technology Sun Yat-Sen University Guangzhou, Guangdong, China, 510275 [email protected], [email protected]
Abstract. Immune Algorithms have been used widely and successfully in many computational intelligence areas including clustering. Given the large number of variants of each operator of this class of algorithms, this paper presents a study of the convergence properties of an improved artificial immune algorithm for clustering(DCAAIN algorithm), which has better clustering quality and higher data compression rate rather than some current clustering algorithms. It is proved that the DCAAIN is completely convergent based on the use of Markov chain. The simulation results verified the steady convergence of DCAAIN by comparing with the similar algorithms. Keywords: immune algorithm, complete convergence, Markov chain.
1 Introduction From the information process perspective, the immune system is a massively-parallel and self-adaptive system which can defend the invading antigen effectively and keep various antigen coexist[1]. It become a valuable research area because it has the character of diversity, distributivity, dynamics, self-adaptability, robustness and so on[2]. Recently, researchers have put forward numerous modals and algorithms to solve the problems in engineering and science such as clustering by emulation the information process ability of the immune system. De Castro proposes a clonal selection algorithm(aiNet)[2] based on the clone selection principle and the affinity maturation process. The algorithm is shown to be capable of solving the clustering task. Na Tang improves aiNet algorithm[3] by combining it with k-means and HAC algorithms. Compared with above algorithms, DCAAIN algorithm[4] have made great improvement on the properties of incremental clustering ability, selfadaptability and diversity. But the research on the aspect of theory analysis, such as convergence analysis, is rather scarce. In fact, it is very helpful in pointing out the direction of improving the performance of the immune algorithm and providing insights of immunity-based systems applications. In this paper, we adopt the DCAAIN as a basis for the mathematical modal, and analyze the complete convergence of it. The proof is based on the use of Markov chain and other related knowledge. *
This work was supported in part by the National Natural Science Foundation of China under grant No. 60575006.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 185–189, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
186
J. Tong, H.-Z. Tan, and L.Guo
The remainder of the paper is organized as follows. Section 2 describes the proposed algorithm DCAAIN. In section 3, we analyze the complete convergence of DCAAIN based on the Markov chain. Typical test are used to validate the proposed algorithm and to verity the correctness of the theory analysis in section 4. Finally, we present some conclusions.
2 The Proposed Algorithm The algorithm is population based, like any typical evolutionary algorithm. Each individual of the population is a candidate solution belonging to the fitness landscape of a given computational problem. We expand the single population to multipopulation by performing antibody cluster. In each subpopulation, realize the parallel subspace search by performing competition clone and selection. We introduce hypermutation, antibody elimination and supplement operators in each subpopulation in order to improve mature progenies and suppress the similar antibodies except the max affinity. Thus the remainder individuals have better fitness than the initial population. We also introduce the barycenter of cluster in order to get high data compression rate and the incremental clustering ability. Finally we introduce the newcomers which expand the search space to find global precise solution. The main process of DCAAIN algorithm is described as follow. Step1: Initialization: Create initial population randomly and get N antibodies
Ab ∈ S N × L . Step2: Clustering: Perform cluster to antibody population and get M antibodies cluster. Step3: Competition selection: Perform competition selection in each cluser and put a currently best antibody(the barycenter of cluster) that has the max fitness or represent the cluster center into elite set, compose the T(T=2M) subpopulation
Abc ( Abc ∈ S T × L ) . The restricted population ultimately helps finding a local optimum solution for each elite cluster member. Step4: Clonal Proliferation: Perform reproduction to population formed the
Abc by n times and
N c population Abc ( Abc ∈ S Nc × L ) .
Step5: hypermutation : Perform the hypermutation to the some of the expanded individuals thus obtain the
N c population Abm ( Abm ∈ S Nc × L ) .
Step6: Suppression and supplement: Eliminate all but the max fitness individuals whose distance are less than the suppression threshold σ s , thus obtain
Abd ( Abd ∈ S N d × L , N d ≤ N c ) Create N r newcomers randomly and choose N N s ( N s N r and 5% < s < 10% ) individual which have better fitness to N constitute the next population together with Abd .
The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering
187
Step7: Convergence check: Repeat step 3-6, until most solutions are not improved any more.
3 Convergence Analysis The transformation of the states in the algorithm can be described as following stochastic process: cluster clone mutation selection T : Ab(t ) ⎯⎯⎯ → Ab '(t ) ⎯⎯⎯ → Abc (t ) ⎯⎯⎯ ⎯ → Abm (t ) ⎯⎯⎯ ⎯ → Abd (t ) ∪ N s sup press ⎯⎯⎯⎯ → Ab(t + 1) sup plement
(1)
where N s is the new individuals which are added randomly. Markov chain offers an appropriate model to analyze the probability convergence properties. It is obviously that the transformation from the state Ab(t) to Ab(t+1) constitutes a Markov chain. The state Ab(t+1) has nothing to do with the state earlier and depends only on Abd (t ) ∪ N s , So the stochastic process {A(n), n≥1} is still a Markov chain. The population serial {A(n), n≥0}of this algorithm is a finite state Markov chain. In the algorithm, the initial population size is n and the antibodies are clustered to m subpopulations. si ∈ S , where si expresses the number of states in S. f
is
the
fitness
function
of
the
variable
X,
namely
s ' = { x ∈ X | f ( x) = max f ( x)} . so we can defined this algorithm completely
convergence with probability one:
lim ∑ p { Ati } = 1 t →∞
(2)
si ∩ s '
Proof: Define pij (t ), (i, j ∈ I ) is transfer probability of the stochastic process {A(t)}
{
}
and I = {i | si ∩ s ' ≠ ∅} ,where pij (t ) = p At j+1 | Ati ≥ 0 , namely
p { Ati } is pit , pt = ∑ pi (t ) , by the property of Markov chain we have i∈I
pt +1 = ∑ ∑ pi (t ) pij (t ) = ∑∑ pi (t ) pij (t ) si ∈S j∈I
i∈I j∈I
∵ ∑∑ pi (t ) pij (t ) + ∑∑ pi (t ) pij (t ) = ∑ pi (t ) = pt i∈I j∈I
i∈I j∉I
i∈I
∴ ∑∑ pi (t ) pij (t ) = pt − ∑∑ pi (t ) pij (t ) i∈I j∈I
i∈I j∉I
∴ 0 ≤ pt +1 = pt − ∑∑ pi (t ) pij (t ) ≤ pt ≤ 1 i∈I j∉I
Where we get
pt +1 ≤ pt , we conclude that lim pt = 0 t →∞
188
J. Tong, H.-Z. Tan, and L.Guo
Therefore 1 ≥ lim t →∞
∑ p (t ) ≥ lim ∑ p (t ) = 1 − lim p i
t →∞
si ∩ s '
i∈I
i
t →∞
t
=1
So we can say that this algorithm is completely convergent with probability one. Because of the antibody cluster operators, the N antibodies of population are divided to M subpopulations. In each local cluster, we can parallel perform clone selection, mutation and suppression operation. The time complexity of the algorithm during each cycle is O ( N + X )i M i K 2 , which the clonal selection and mutation
(
)
part is O ( K 2 ) , the cluster part is O(M), and the similarity suppression part is O
(N+ X), where N,X is the number of antibodies during one cycle and may be different for each cycle..
4 Simulation and Discussion In order to validate the algorithm for multimodal optimization problem and to verify the correctness of the theory analysis, DCAAIN is executed to solve the typical multimodal function compared with aiNet. The function is described as following: Maxf(x,y)=x.sin(4 π x)-y.sin(4 π y+ π )+1, x,y ∈ [-2,2]
(3)
f is the Multi-function with several local optima solution and a single global optimum all distributed non-uniformly. To be convenient for comparing, we choose the same number of initial population N=100 for DCAAIN and aiNet, see[4] for a detail description of the other parameters. Table 1. The results of two algorithms
Algorithm
ItCav
ItCmin
ItC max
aiNet
63.5
48
77
0.547
0.121
DCAAIN
32.5
23
38
0.852
0.184
F-measure
Where ItC av is average iterations to local global optimum; iterations to local global optimum;
Entropy
ItCmin is the least
ItC max is the most iterations to local global
optimum; F-measure and Entropy describes the precision of clustering. From table 1, we can learn that DCAAIN locates all the peaks in each experiment and the iteration of convergence is relatively steady. Note that DCAAIN on average,the least and the most, requires less number of iterations to local the global optimum than aiNet. It means that the time of convergence of DCAAIN is less than that of aiNet.. The
The Convergence Analysis of an Improved Artificial Immune Algorithm for Clustering
189
more of F-measure and the less of Entropy means higher quality of convergence. Obviously, DCAAIN is better than aiNet
5 Conclusion In this paper, we analyze the complete convergence of DCAAIN with Markov chain and other related knowledge on probability and prove that the DCAAIN is complete convergence with probability one. By theoretical analysis and simulation results, we note that DCAAIN can reach a diverse set of local optima solutions with the special mutation and selection method. All the subpopulations access to the peaks gradually following the lead of their dynamic center. The best solution in every subpopulation is maintained by keeping the original individual unmutated, which ensure the convergence of the algorithm. Since Markov chain has progressively been used in the analyses of Evolutionary Algorithms on combinatorial optimization problems with practical applications, a similar strategy in analysing other Immune Algorithms is worth considering.
References 1. Frank, S.A.: The design of natural and artificial adaptive systems. Rose, M.R., Lauder, G.V. edition. Academic Press, New York (1996) 2. De Castro, L.N., Von Zuben, F.J.: Artificial immune systems: part II–a survey of applications. Technical Report, p. 65 (2000) 3. Tang, N., Rao Vemuri, V.: An Artificial Immune System Approach to Document Clustering. In: ACM Symposium on Applied Computing, pp. 918–922 (2005) 4. Tong, J., Tan, H.-Z.: A Document Clustering Algorithm Based on Artificial Immune Network. Computer Engineering and Science 29(10), 17–19 (2007)
Artificial Immune System-Based Music Genre Classification D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis University of Piraeus, Department of Informatics, 80 Karaoli and Dimitriou St, Piraeus 18534, Greece {dsotirop,arislamp,geoatsi}@unipi.gr
Abstract. We present a novel approach for the problem of automated music genre classification, which utilizes an Artificial Immune System (AIS)-based classifier. Our inspiration lies in the observation that the natural immune system has the intrinsic property of self/non-self cell discrimination, especially when the non-self (complementary) space of cells is significantly larger than the class of self cells. The AIS-based classifier that we have built is compared with KNN-, RBF- and SVM-based classifiers in various experiments involving music data. We find that the performance of our classifier is similar to that of the other classifiers when tested in multi-class (eg. four class) problems. On the other hand, it exceeds by a significant margin the performance of the other classifiers when tested in two class problems.
1 Introduction Recent advances in digital storage technology and the rapid increase in the amount of digital music files have led to the creation of large music collections for use by broad classes of computer users. In turn, this fact gives rise to a need for systems that have the ability to manage and organize efficiently large collections of stored music files. Many currently available music search engines and peer-topeer systems (e.g. Kazaa, emule, Torrent) rely on textual meta-information such as file names and ID3 tags as the retrieval mechanism. This textual description of audio information is subjective and does not make use of the musical content and the relevant meta-data have to be entered and updated manually, which implies significant effort in both creating and maintaining the music database. Therefore, an automated process that extracts information from the actual music data and, thus, organizes the data automatically, could overcome some of the problems that arise in current Music Information Retrieval (MIR) systems. An important and difficult task in MIR systems is musical genre classification. The boundaries between genres are fuzzy, which makes the problem of automatic classification a highly non-trivial task. For the purpose of automatic music genre classification, we have developed a procedure which relies on Artificial Immune Systems (AIS). In general, AIS provide metaphors for the development of high-level abstractions of functions or mechanisms that can be utilized to solve real world (pattern recognition) problems. AIS-based clustering [1] and G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 191–200, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
192
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
classification algorithms are characterized by an ability to adapt their behavior so as to cope efficiently with extremely complex and continuously changing environments. Their main advantage over other classifier systems is their intrinsic property of self/non-self discrimination, especially when the class of non-self (complementary) patterns is significantly larger than the class of self patterns. In this paper, we develop an Artificial Immune Network (AIN) for classifying a set of labeled multidimensional music feature vectors extracted from a music database and assess its classification performance. Specifically, the paper is organized as follows: Section 2 is devoted to a review of related work on music genre classification, while Section 3 presents a review of the basic concepts of (natural and artificial) immune systems and relevant learning algorithms. Section 4 describes our experimental results on music data on testing the performance of our AIN vs. the performance of KNN-, RBF- and SVM-based classifiers. Finally, we draw conclusions and point to future related work in Section 5 of the paper.
2 Related Work on Music Genre Classification There have been several works on automatic musical genre classification. These systems usually consist of two modules, namely a feature extraction and a classifier module. Existing works have dealt with both the extraction of features extracted from the audio signal and the performance of classification schemes trained on the extracted features. In [2] three different feature sets are evaluated for music genre classification using 10 genres. The 30-dimensional feature vector represents timbral texture, rhythmic content and pitch content. Experiments are made with a gaussian classifier, a gaussian mixture model and a K-nearest neighbor classifier. The best combination of features and classifier achieved a correct classification rate of 61%. On the other hand, Li et al. [3] propose a new feature extraction method, which relies on Daubechies Wavelet Coefficient Histograms (DWCH). Effectiveness of this feature is evaluated using machine learning algorithms such as Support Vector Machines, K5 Nearest Neighbour (KNN), Gaussian Mixture Models (GMMs) and Linear Discriminant Analysis (LDA). It is shown that DWCHs improve the accuracy of music genre classification significantly. On the dataset provided by [3], the classification accuracy has increased from 61% to almost 80%. In [4], short-time features are compared to two novel psychoacoustic feature sets for classification of five general audio classes as well as seven music genres. It is found that the psychoacoustic features outperform the power spectrum features. The temporal evolution of the short-time features is found to improve performance. Support Vector Machines have been used in the context of genre classification in [5] and [6]. In [5], SVMs are used for genre classification with a Kullback -Lleiber divergence-based kernel to measure the distance between songs. In [6], genre classification is done with a mixture of SVM experts. A mixture of experts solves a classification problem by using a number of classifiers to decompose it into a series of sub-problems. Not only does it reduce the complexity of each
Artificial Immune System-Based Music Genre Classification
193
single task, but it also improves the global accuracy by combining the results of the different classifiers (SVM experts). Individual songs are modelled as GMMs, trained using the k-means instead of the Expectation Maximization (EM) algorithm [7]. They approximate the KL divergence between GMMs as the earth mover’s distance based on the KL divergences of the individual Gaussians in each mixture. Since their system is described as a distance measure, there is no mention of an explicit classifier. Finally, they generate playlists with the nearest neighbors of a seed song. The performance of genre classification can be improved by combining spectral similarity with complementary information, as in [8]. In particular, they combine spectral similarity with fluctuation patterns and derive two new descriptors named as “Focus” and “Gravity”. The authors state that fluctuation patterns describe loudness fluctuations in frequency bands, as well as characteristics which are not described by spectral similarity measures. For classification, the nearest neighbour classifier is used. They obtained an average classification performance increase of 14%. Artificial Neural Networks have been used for musical genre classification in [9, 10]. In [9], a musical genre classification system was presented which processed audio features extracted from signals corresponding to distinct musical sources. An important difference from previous related works is a sound source separation method was applied first to decompose the signal into a number of component signals. Then, timbral, rhythmic and pitch features were extracted from distinct instrument sources and used to classify a music excerpt. The genre classifiers were built as multilayer perceptrons. Results showed that this approach presented an improvement of 2% - 2.5% in correct music genre classification. Finally, Turnbull and Elkan [10] explore radial basis function (RBF) networks for musical genre classification by using a combination of unsupervised and supervised initialization methods. These initialization methods yield classifiers that are as accurate as RBF networks trained with gradient descent (which is hundreds of times slower). The experiments in this paper show that RBF networks initialized with a combination of methods can yield good classification performance without relying on gradient descent. In the present paper, we propose a new approach for musical genre classification based on the construction of an AIS-based classifier. Our AIS-based classifier utilizes the supervised learning algorithm proposed by Watkins and Timmis [11]. More specifically, we aim at exploiting the inherent information processing capabilities of the natural immune system through the implementation of an Artificial Immune Recognition System. The essence of our work is to demonstrate the classification efficiency of the constructed classifier when required to assign multi-dimensional music feature vectors to corresponding music categories. The classification accuracy measurements presented in this paper justify the use of the AIS-based classifier over other classifiers, such as KNN, RBF, or SVM.
194
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
3 AIS - Based Classification AIS-based classification relies on a computational imitation of the biological process of self/non-self discrimination, that is the capability of the adaptive biological immune system to classify a cell as “self” or “non-self” cell. Any cell or even individual molecule recognized and classified by the self/non-self discrimination process is called an antigen. A non-self antigen is called a pathogen and, when identified, an immune response (specific to that kind of antigen) is elicited by the adaptive immune system in the form of antibody secretion. The essence of the antigen recognition process is the affinity (molecular complementarity level) between the antigen and antibody molecules. The strength of the antigen-antibody interaction (stimulation level ) is measured by the complementarity of their match and, thus, pathogens are not fully recognized, which makes the adaptive immune system tolerant to molecular noise. Learning in the immune system is established by the clonal selection principle [12] which suggests that only those antibodies exhibiting the highest level of affinity with a given antigen will be selected to proliferate and grow in concentration. Moreover, the selected antibodies also suffer a somatic hypermutation process [12], that is a genetic modification of their molecular receptors which allows them to learn to recognize a given antigen more efficiently. This hypermutation process is termed affinity maturation [12] and results in the development of long lasting memory cells which guarantee a faster and more accurate immune response when presented with antigenic patterns similar to those they were originally exposed to. This evolutionary procedure of developing memory antibodies lies within the core of the training process of our AIS-based classifier applied on each class of antigenic patterns. The evolved memory cells provide an alternative problem domain representation since they constitute points in the original feature space that do not coincide with the original training instances. However, the validity of this alternative representation follows from the fact the memory antibodies produced recognize the corresponding set of training patterns in each class in the sense that their average affinity to them is above a predefined threshold. To quantify immune recognition, we consider all immune events as taking place in a shape-space S, constituting a multi-dimensional metric space in which each axis stands for a physico-chemical measure characterizing molecular shape [12]. Specifically, we utilized a real-valued shape-space in which each element of the AIS-based classifier is represented by a real-valued vector of 30 elements, thus S = 30 . The affinity/complementarity level of the interaction between two elements of the constructed immune-inspired classifier was computed on the basis of the Euclidean distance between the corresponding vectors in 30 . The antigenic pattern set to be recognized by the memory antibodies produced during the training phase of the AIS-based classifier is composed of the set of representative antibodies, which maintain the spatial structure of the set of all data in the music database, yet form a minimum representation of them. The AIS-based classifier [11] was as follows:
Artificial Immune System-Based Music Genre Classification
195
1. Initialization Phase a) Normalization b) Affinity Threshold Estimation c) Seeding 2. Training Phase: For each class of training patterns do: – For each antigenic pattern do a) Matching Memory Cell Identification b) Antibodies Generation c) while (StoppingCriterion == False) • Resource Allocation • Removal of Less Stimulated Antibodies • Generate Antibodies Mutated Offsprings d) Candidate Memory Cell Identification e) Memory Cell Introduction 3. Classification The initialization phase of the algorithm constitutes a preprocessing stage which is combined with a parameter discovery stage. All available data items pertaining to the training set are normalized so that the Euclidean distance between any two feature vectors lies within the [0,1] interval. The affinity threshold computation step consists in estimating the average affinity value over all the training data. The Affinity Threshold multiplied by an auxiliary control parameter, namely the Affinity Threshold Scalar (of value between 0 and 1), provides a cut - off value for the replacement of memory cells during the training phase. The final step of the initialization procedure involves the seeding of the memory cells and the pool of the available antibodies by randomly choosing 0 or more antigenic patterns from the training set. Once the initialization phase is completed training proceeds as a one-shot incremental learning algorithm where each antigenic pattern from each class is presented to the algorithm only once. This portion of the algorithm focuses on developing a candidate memory antibody, for the antigen currently being processed, from the pool of the available antibodies. This is realized by the utilization of three mechanisms: 1) Competition for resources, 2) mutation and 3) adopting an average stimulation threshold as a stopping criterion in order to determine when the training on a given antigen is completed. Resources are allocated to a given antibody based on its stimulation level to the current antigen which may be thought of as an indicator of its efficiency as a recognizer. Mutation enforces diversification and shape-space exploration. Matching Memory Cell Identification involves determining the memory cell having the strongest bond to the current training data item which is quantified by its stimulation level. The matching memory cell is subsequently used in order to generate new mutated versions of the original cell that will be placed into the pool of the available antibodies. This process constitutes the Antibodies Generation step. The number of mutated clones a given memory cell allowed to inject into the cell population is controlled by the hypermutation rate. This
196
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
number is proportional to the product of its stimulation level with the value of the hyper-mutation rate. Each antibody in the pool of the available antibodies is allocated a finite number of resources during the resource allocation step based on its stimulation value and clonal rate. Clonal rate comprises an additional control parameter of the algorithm which may serve as a resource allocation factor when multiplied by the stimulation level of a given antibody. The total number of system wide resources is set to a certain limit specified by the number of resources allowed. If more resources are consumed by the pool of the available antibodies then the least stimulated cells are removed from the system during the removal of less stimulated antibodies step. Regardless of whether the stopping criterion has been met each antibody is given a chance to produce a set of mutated offsprings which is determined by the multiplication of the clonal rate with the stimulation value. This procedure is conducted during the generate antibodies mutated offsprings step. Once the training on a specific antigenic pattern is completed, the learning algorithm proceeds by identifying the candidate memory cell among the maturated antibodies. The candidate memory cell is that feature vector with the maximum stimulation level to the current antigenic pattern. The memory cell identification stage also involves incorporating the candidate memory cell into the pool of the available memory cells. Finally, the memory cell introduction step determines whether the matching memory cell is replaced by the candidate memory cell. After the training process has completed, the evolved memory cells are available for the classification task of unseen music feature vectors in a k-nearest neighbour approach. The system classified these new data items by using a majority vote of the outputs of the k most stimulated memory antibodies.
4 Experimental Results on Music Data An audio signal may be represented in multiple ways according to the specific features utilized in order to capture certain aspects of an audio signal. More specifically, there has been a significant amount of work in extracting features that are more appropriate for describing and modeling music signals. In this paper, we have utilized a specific set of 30 objective features that were originally proposed by Tzanetakis and Cook [2, 13] and have dominated the literature in subsequent approaches in this research area. It is worth to mention that these features not only provide a low-level representation of the statistical properties of the music signal, but also include high-level information extracted by psychoacoustic algorithms. In summary, these features represent rhythmic content (rhythm, beat and tempo information), as well as pitch content describing melody and harmony of a music signal. The collection we have utilized in our experiments contains one thousand (1000) pieces from 10 classes of western music. This collection has been used as a test bed for assessing the relative performance of various musical genre
Artificial Immune System-Based Music Genre Classification
197
classification algorithms [2, 3]. Specifically, the collection contains one hundred (100) pieces, each of thirty second duration, from each of the following ten (10) classes of western music: Table 1. Classes of western music Class ID Label 1 Blues 2 Classical 3 Country 4 Disco 5 Hip-Hop 6 Jazz 7 Metal 8 Pop 9 Reggae 10 Rock
In order to evaluate our AIS-based classifier in music genre classification, we compared its classification performance against 1)Radial Basis Functions neural networks, 2)K-th Nearest Neighbour classifiers and 3)Support Vector Machines. The NetLab toolbox was utilized in order to construct the RBF network and KNN classifiers, while the SVM classifier was implemented with the OSU-SVM toolbox. The AIS-based classifier was implemented in the programming environment of MatLab. The RBF network consisted of fifty (50) neurons in the hidden layer. The number of neurons in the output layer is determined by the number of audio classes we want to classify in each experiment. The network was trained with the Expectation Maximization algorithm for two hundred (200) cycles and its output estimates the degree of membership of the input feature vector in each class. Thus, the value at each output necessarily remains between 0 and 1. The KNN classifier was based on the class label prediction of the 10 nearest neighbours. The SVM classifier was based on a gaussian kernel with the default parameters provided by the osu-svm toolbox. Classification results were calculated using 10-fold cross-validation evaluation, where the dataset to be evaluated was iteratively partitioned so that 90% be used for training and 10% be used for testing for each class. This process was iterated with different disjoint partitions and the results were averaged. This ensured that the calculated accuracy was not biased because of the particular partitioning of training and testing. We conducted three experiments in order to measure the classification accuracy for each of the different classifiers. In the first experiment, a four class classification problem was considered. The results presented in Table 2 illustrate the competitiveness of the AIS-based classifier, as it ranks high and second only to the SVM classifier.
198
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis Table 2. Experiment 1: Four Class Classification Trial Four Classes AIRS 1 1, 2, 3, 4 71.5 2 1, 2, 7, 10 70.75 3 4, 7, 8, 10 60
KNN 10 RBF SVM 66.5 70.25 75.5 66.5 65 71.25 54.25 54 61.25
In the second experiment, we addressed a problem of two class classification, in which we considered a “self” class (classic music) and a “non-self” class (everything else in the music database). As it is presented in Table 3, the number of complementary classes was gradually increased in each trial. In this way, we increased the diversity of training data points belonging to the non-self class. As stated earlier, this experiment was driven by the fact that the self/non-self discrimination constitutes an intrinsic property of the natural immune system, especially when the non-self (complementary) space of patterns is significantly larger than the patterns belonging to the self class. In this experiment the AISbased classifier outperforms the other classifiers in every trial. Table 3. Experiment 2: Two Class Classification (Self Class = Classic) Trial Self 1 2 3 4 5 6 7 8 9
Class 2 2 2 2 2 2 2 2 2
Non-Self Class 1 1,3 1,3,4 1,3,4,5 1,3,4,5,6 1,3,4,5,6,7 1,3,4,5,6,7,8 1,3,4,5,6,7,8,9 1,3,4,5,6,7,8,9,10
AIRS 92 93.5 92.5 94 92 93.5 93 90.5 93.5
KNN 10 RBF SVM 91 89.5 91.5 91.5 91 92.5 91 88 92 89 87.5 90.5 88.5 88 90 89 88.5 90.5 88.5 89.5 89.5 89.5 88.5 90 90.5 85 91
In third experiment, we explore further the two class classification problem considering only two classes in each trial, as follows: In trials 1 and 2, the music data come from the classes classic-rock and classic-metal, respectively, which have a low degree of overlap in feature space and, thus, a higher classification accuracy is expected. In contrast, in trials 3 and 4, the music data come from the classes disco-pop and rock-metal, respectively, which have a high degree of overlap in feature space and, thus, a lower classification accuracy is expected. Results of the third experiment are presented in Table 4, in which the AIS-based classifier is seen to outperform all other classifiers in every trial. The experiments described in this section, show clearly that the classification accuracy of the AIS-based classifier increases and outperforms the other classifiers when the number of classes pertaining to the classification problem are reduced to two. Specifically, the results concerning the second experiment demonstrate that the mean classification accuracy in 9 trials is higher for the AIS-based
Artificial Immune System-Based Music Genre Classification
199
Table 4. Experiment 3: Two class classification Trial Self Class Non-Self Class AIRS 1 (Low Overlap) 2 10 93 2 (Low Overlap) 2 7 97 3 (High Overlap) 4 8 79.5 4 (High Overlap) 10 7 79
KNN 10 RBF 89.5 91 95.5 95.5 74.5 75 74.5 74
SVM 90 95 75.6 77
classifier (92.72%) than the classification accuracy of the SVM (90.83%) while they have the same standard deviation (1.0).
5 Conclusions In this paper, we suggest a new approach to the problem of music genre classification based on the construction of an Artificial Immune Recognition System which presents incorporates the inherent self/non-self discrimination ability of the natural immune system. The AIS-based classifier that we have built is compared with Knn-, RBF- and SVM-based classifiers in various experiments involving music data. We find that the performance of our classifier is similar to that of the other classifiers when tested in multi-class (eg. four class) problems. On the other hand, it exceeds by a significant margin the performance of the other classifiers when tested in two class problems. In the future, we will improve the AIS-based classifier further and evaluate their performance over other classification scenarios and various types of data sets. We will also investigate the appropriateness of their use as recommender systems. This and other related work is currently under way and will be reported shortly.
References 1. Sotiropoulos, D.N., Lampropoulos, A.S., Tsihrintzis, G.A.: Artificial immune system-based music piece similarity measures and database organization. In: Proc. 5th EURASIP Conference on Speech and Image Processing, Multimedia Communications and Services, Smolenice, Slovak Republic (June 2005) 2. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10(5) (July 2002) 3. Li, T., Ogihara, M., Li, Q.: A comparative study on content based music genre classification. In: Proc. 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada (August 2003) 4. McKinney, M.F., Breebaart, J.: Features for audio and music classification. In: Proc. 4th International Conference on Music Information Retrieval, Washington, D.C., USA (October 2003) 5. Mandel, M., Ellis, D.: Song-level features and support vector machines for music classification. In: Proc. 6th International Conference on Music Information Retrieval, London, UK (September 2005)
200
D.N. Sotiropoulos, A.S. Lampropoulos, and G.A. Tsihrintzis
6. Lidy, T., Rauber, A.: Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: Proc. 6th International Conference on Music Information Retrieval, London, UK (September 2005) 7. Logan, B., Salomon, A.: A music similarity function based on signal analysis. In: Proc. International Conference on Multimedia and Expo, Tokyo, Japan (2003) 8. Pampalk, E., Flexer, A., Widmer, G.: Improvements of audio-based music similarity and genre classification. In: Proc. 6th International Conference on Music Information Retrieval, London, UK (September 2005) 9. Lampropoulos, A.S., Lampropoulou, P.S., Tsihrintzis, G.A.: Musical genre classification enhanced by source separation techniques. In: Proc. 6th International Conference on Music Information Retrieval, London, UK, September 2005, pp. 576–581 (2005) 10. Turnbull, D., Elkan, C.: Fast recognition of musical genres using rbf networks. IEEE Transactions on Knowledge and Data Engineering 17(4) (2005) 11. Watkins, A., Timmis, J.: Artificial immune recognition system (airs): An immuneinspired supervised learning algorithm. Genetic Programming and Evolvable Machines 5, 291–317 (2004) 12. Castro, L.N., Timmis, J.: Artificial Immune Systems: A new Computational Intelligence Approach. Springer, Heidelberg (2002) 13. Tzanetakis, G., Cook, P.: Marsyas: A framework for audio analysis. Organised Sound 4(3) (2000)
Semantic Information Retrieval Dedicated to Multimedia Systems: A Platform Based on Conceptual Graphs Xavier Aimé and Francky Trichet LINA, Laboratoire d’Informatique de Nantes Atlantique (UMR-CNRS 6241) University of Nantes - Team Knowledge and Decision (KOD) 2, rue de la Houssinière - BP 92208 - 44322 Nantes cedex 03, France {xavier.aime,francky.trichet}@univ-nantes.fr
Abstract. OSIRIS is a web platform dedicated to the development of Ontology-based System for Semantic Information Retrieval and Indexation of multimedia resources which are shared within communautary and open web Spaces. Based on the use of both heavyweight ontologies and thesaurii, OSIRIS allows the end-user (1) to describe the semantic content of its resources by using an intuitive natural-language based model of annotation which is founded on the triple (Subject, Verb, Object), and (2) to formally represent these annotations by using Conceptual Graphs. Moreover, each resource can be described by adopting multiple points of view, which usually correspond to different end-users. These different points of view can be defined by using multiple ontologies which can be related to connected (or not-connected) domains. Developed from the integration of Semantic Web technologies and Web 2.0 technologies, OSIRIS aims at facilitating the deployment of semantic, collaborative, communautary and open web spaces. Keywords: ontology, heavyweight ontology, thesaurus, semantic annotation, semantic information retrieval, conceptual graphs, semantic web, intelligent multimedia system, collaborative annotation, social tagging, semantic web 2.0.
1 Introduction Currently, the collective and interactive dimension of Web 2.0 coupled with the lightness of its tools facilitates the rise of many platforms dedicated to the sharing of multimedia resources such as Flickr (http://www.flickr.com/) for the images or YouTube (http://www.youtube.com) for the videos. However, the success of these platforms (in terms of number of listed resources and number of federated users) must be moderate in comparison with the poverty of the approach used for Information Retrieval (IR). Indeed, the search engines integrated in such systems are only based on the use of tags which are usually defined manually by the end-users of the communities (i.e. the social tagging which leads to the creation of folksonomies). In addition to the traditional limits of IR systems based on keywords, in particular the poverty of semantic description provided by a set of tags and consequently the impossibility of implementing a semantic search engine, these systems suffer from a lack of openness because the tags provided by the end-users remain useful and efficient only inside the platforms. they cannot be exported when the resources are duplicated from a platform to another. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 201–210, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
202
X. Aimé and F. Trichet
OSIRIS (Ontology-based Systems for Semantic Information Retrieval and Indexation dedicated to communautary and open web Spaces) is a platform dedicated to the development of communautary web spaces which aim at facilitating both semantic annotating process and searching process of multimedia resources. Such a communautary space corresponds to an Internet-mediated social and semantic environment in the sense that the resources which are shared are not only tagged by the users (which thus construct a folksonomy in a collaborative way) but they are also formally described by using one (or several) ontology(ies) shared by all the members of the community. The result is an immediate and rewarding gain in the user's capacity to semantically describe and find related content. Based on the use of heavyweight ontologies [6] coupled with thesaurii1, OSIRIS allows the end-users to semantically describe the content of a resource (for instance, this photography of Doisneau represents “A woman which kisses a man in a famous French place located in Paris”) and then to formally represent this content by using Conceptual Graphs [13]. Each resource can be described according to multiple points of view (i.e. representation of several contents) which can also be defined according to multiple ontologies, which can cover connected domain or not. Thus, during the annotating process, OSIRIS allows managing several ontologies which are used jointly and in a transparent way during the searching process, thanks to the possibility of defining equivalence links between concepts and/or relations of two ontologies. Moreover, OSIRIS is based on heavyweight ontologies, i.e. ontologies which in addition to including the concepts and relations (structured within hierarchies based on the relation of Specialisation/Generalisation) characterizing the considered domain, also include the axioms (rules and constraints) that govern this domain. This confers to OSIRIS the possibility to automatically enrich the annotations (manually associated to a resource) by applying the axioms which generally correspond to inferential knowledge of the domain. From a technical point of view, OSIRIS is based on the integration of technologies currently developed in Web 2.0 and Semantic Web areas: it aims at dealing with the Semantic Web 2.0 [11]. In its current version, OSIRIS allows implementing Semantic Web Spaces dedicated to the sharing of images, videos, music files and office documents, respectively in JPEG, MP3, OpenOffice and Office 200 format. The choice of these formats is justified by the fact that it is possible to store the semantic annotations (represented in terms of conceptual graphs) within the files via the use of standards such as IPTC (http://www.iptc.org) for JPEG, ID3 (http://www.id3.org) for MP3 and ODF (http://www.oasis-open.org/) for OpenOffice and Office 2007. These standards make it possible first to associate meta-data to images and sounds and second to store these meta-data within the files. They are currently used by the majority of the well-known tools dedicated to the management of personal images and sounds such as Picasa (http://picasa.google.com) or Winamp (http://www.winamp.com). But this use is only limited to the association of keywords, which does not make it possible to represent the semantic content of the resources. 1
A thesaurus is a special kind of controlled vocabulary where the terms (which correspond to the entries of the thesaurus) are structured by using linguistic relationships such as synonymy, antonymy, hyponymy or hypernymy. A thesaurus (like Wordnet) is not an ontology because it only deals with terms (i.e. the linguistic level), without considered the concepts and the relations of the domain (i.e. the conceptual or knowledge level).
Semantic Information Retrieval Dedicated to Multimedia Systems
203
OSIRIS aims at answering this lack by integrating the use of domain ontologies (coupled with thesaurii) in the process of indexing and searching process. In addition, preserving the annotations directly within the files makes our system much more open than the current Web 2.0 systems where the tags can not be exported from a platform to another. The rest of this paper is structured as follows. Section 2 introduces the basic foundations of our work: heavyweight ontologies coupled with thesaurii and represented within the Conceptual Graphs model. Section 3 presents (i) the model of annotation we have adopted, (ii) the annotating process (manual and automatic) and the searching process we advocate. These different functionalities are illustrated within examples extracted from an application (developed in French) dedicated to the History of Art.
2 Context of the Work 2.1 Heavyweight Ontologies Currently, ontologies are at the heart of many applications, in particular the Semantic Web, because they facilitate interoperability between human and/or artificial agents [9]. However, most of the current works concerned with the ontological engineering are limited to the construction of lightweight ontologies, i.e. ontologies simply composed of a hierarchy of concepts (possibly enriched by properties such as exclusion or abstraction) which is sometimes associated to a hierarchy of relations (possibly enriched by algebraic properties). Reduced in semantics, these ontologies do not make it possible to take all the knowledge of a given domain into account, in particular the rules and the constraints governing this domain and thus fixing the interpretation of the concepts and relations characterising it. This deficit of semantics, which is prejudicial at various levels, is due to the low level of expressivity of the language OWL (Ontology Web Language http://www.w3.org/2004/OWL/). Indeed, since 2004, this standard used to represent and to share domain ontologies, has indirectly influenced most of the works related to ontological engineering in the sense that the majority was focused on lightweight ontologies, by completely forsaking knowledge related to inference (mainly rules and constraints), and this both from a representation point of view (what kind of primitives can be used to represent this kind of reasoning knowledge ?) and from an implementation point of view (how to use effectively this type of knowledge within Knowledge-Based System ?). In our work, we are more particularly interested in heavyweight ontologies (semantically speaking), i.e ontologies which in addition to include the concepts and relations (structured within hierarchy) of a domain D, also include the axioms (rules and constraints) which govern D. The use of a heavyweight ontology (which represents all the semantic richness of a domain via the axioms) coupled with a thesaurus (which represents all the linguistic richness of a domain) characterises the originality of the OSIRIS platform. This features proves to be promising within an Information Retrieval System based on keywords because the interpretation of the sense of a request expressed by a set of terms becomes more precise.
204
X. Aimé and F. Trichet
2.2 The Conceptual Graphs Model and the Language OCGL The Conceptual Graphs Model (CGs), first introduced by J. Sowa [13], is an operational knowledge representation model which belongs to the field of semantic networks. This model is mathematically founded both on logics and graph theory [3]. However, to reason with CGs, two approaches can be distinguished: (1) considered CGs as a graphic interface for logics and thus reasoning with logics, (2) considered CGs as a graph-based knowledge representation and reasoning formalism with its own reasoning capabilities. In the context of our work, we adopt the second approach by using projection (a graph-theoretic operation corresponding to homomorphism) as the main reasoning operator; projection is sound and complete w.r.t. deduction in First Order Logic [3]. OCGL (Ontology Conceptual Graphs Language) [8] is a modeling language based on the CGs and dedicated to the representation of heavyweight ontologies. To represent an ontology in OCGL mainly consists in (1) specifying the conceptual vocabulary of the domain which is considered and (2) specifying the semantics of this vocabulary using Axioms. The conceptual vocabulary is composed of a set of Concepts and a set of Relations. These two sets can be structured by using both wellknown conceptual properties called Schemata Axioms (covering the current expressivity of OWL-DL such as for example the algebraic properties of the relations, the disjunction of two concepts, etc), or Domain Axioms used to represent rules and constraints. Domain Axioms correspond to all the inferential knowledge of the domain which can not be represented by using the Diagrams of Axioms, and thus which do not correspond to traditional properties attested on the concepts or the relations. A Domain Axiom is composed of an Antecedent graph and a Consequent graph; the formal semantics of such a construction can be intuitively expressed as follows: “if the Antecedent graph is true, then the Consequent graph is true”. Figure 1 presents two Domain Axioms expressed in OCGL and respectively dedicated to the representation of the following knowledge related to an ontology called OntoArt and dedicated to the History of Art2: (i) “A cubist is an artist which has created at least a work of art illustrating the artistic movement called cubism” and (ii) “All the works of art created by Claude Monet illustrate the impressionist movement of the 20th century”. Note that these axioms are not at same level (and that OCGL makes it possible to take these different levels of representation into account): the first one expresses a generic knowledge, the second one expresses a more specific knowledge which implies an instance of the domain: Claude Monet. OCGL is implemented into TooCoM3 (A Tool to Operationalize an Ontology with the Conceptual Graph Model), a tool dedicated to the representation and the operationalisation of heavyweight ontologies. TooCom is based on CoGITaNT [10], a
2
The OntoArt ontology has been created in the context of a French project. So, all the concepts and relations are expressed in French. This justifies why all the figures of this paper are in French. But as the domain of Art is generalist and (for most of people) well-known, we think that this situation will not interfere with the understanding of the ideas. 3 TooCoM is available under GNU GPL license: http://sourceforge.net/projects/toocom/
Semantic Information Retrieval Dedicated to Multimedia Systems
205
Fig. 1. Examples of Domain Axioms represented in OCGL (edited with TooCoM)
C++ library for developing conceptual graphs applications. An ontology expressed in OCGL is stored in a CGXML4.
3 OSIRIS Framework 3.1 The Annotation Model The annotation model which is advocated in OSIRIS is based on the triple {Subject/Verb/Object}. This model allows the end-user to represent the content of simple sentences (expressed in natural language) such as “A man which kisses a woman”. In the context of this triple, Subject and Object correspond to concepts and Verb corresponds to a relation of the ontology which is considered (cf. figure 2). Thus, each resource can be described semantically by a set of triples where each triple is defined according to a particular ontology. Note that it is possible to used multiple ontologies (covering overlapping domain or not) in the same OSIRIS application. Each triple can be associated to a member of the community. Figure 2 illustrates the application of this model in the context of the well-known work of art of the French photographer R. Doisneau: « Baiser de l’hôtel de ville ». In this example, the first user u1 annotates the photography by using the ontology Onto1 and describes the following contents: (1) “A man who kisses a woman” (where man is
4
CGXML is the format used to represent CGs in XML. This format is integrated into Cogitant: http://sourceforge.net/projects/cogitant. Note that TooCom enables to import and export lightweight ontologies in OWL, by using the OWL API [1]. Thanks to a specific transformational model from OCGL to OWL [7], most of the properties of classes and relations expressed in OWL are translated into Schemata Axioms in OCGL. However, because of the difference of expressivity of the two languages, the following properties are not yet translated: allValuesFrom, someValuesFrom and hasValue. Inversely, the Domain Axioms of OCGL cannot be translated in OWL as long as OWL does not offer capability of rule-like axioms representation.
206
X. Aimé and F. Trichet
- (man To-Kiss woman)Onto1 / u1 - (man To-Wear beret)Onto1 / u1 - (woman To-Walk)Onto1 / u2 - (building To-Locate town:paris)Onto2 / u2 - (photographer:Doisneau To-Create work_of-art) Onto3 / u3 …
Fig. 2. Annotating process applied to the work of art of Doisneau: “Baiser de l’hôtel de ville”
the concept corresponding to the Subject, To-kiss the relation corresponding to the Verb and woman the concept corresponding to the Object) and (2) “A man who wears a beret”. The second user u2 annotates the photography by using two different ontologies Onto1 and Onto2 ; he describes the following situations: (1) “A woman who walks”, without defining explicitly where (i.e. there is no concept corresponding to the Object) and (2) “A building which is located in a town called Paris” (i.e. the concept town corresponding to the Object is instantiated by Paris). Finally, the last end-user u3 does not annotate the content of the photography but the photography as a work of art in the sense that he states “A work of art creates by the photographer Doisneau”. This last point clearly illustrates that our model can be used both to describe the content of a resource and to describe the resource as such, which makes it flexible and open. As shown by this example, our model is intuitive, easily comprehensible and owns a strong correspondence with the Conceptual Graphs model. It allows the end-users to describe the content of their resources under several angles, and possibly by using several ontologies related to the same domain (which can be developed by different communities) or to different and not necessarily overlapping domains. In this way, OSIRIS is a tool which enables performing a multi users, multi points of view and multi ontologies based annotation process. 3.2 The Annotating Process The manual approach Annotating a resource which has been imported into ORISIS starts with the selection of an ontology which must be imported beforehand by the administrator of the platform. When an ontology O is selected, then the annotating process mainly consists in identifying a set of triples (Subject,Verb,Object) where Subject and Object correspond to concepts of O and Verb corresponds to a relation of O5. To state a triple, two approaches can be distinguished: (1) the end-user directly navigates within the hierarchies of concepts and relations of O or (2) the end-user freely expresses a list of terms which are then compared with the inputs of the thesaurii associated to O. In the first case, the end-user is guided by the interface because when he identifies the concept C1 corresponding to the Subject or the Object of its current annotation, 5
Of course, for the same resource, the end-user can repeat this process by using another ontology, in order to give another point of view on the same resource.
Semantic Information Retrieval Dedicated to Multimedia Systems
207
then only the relations having in signature the selected concept C1 (or more specific concepts of C1) are proposed by the interface. It is the same case when the end-user starts with the identification of the relation associated to the Verb: only the compatible concepts (defined by the signature of the relation) are then accessible. In the second case, which aims at offering more freedom and openness from a linguistic point of view, OSIRIS uses the thesaurii to find the concepts and relations underlying the set of terms expressed by the end-user. When OSIRIS finds correspondences between the terms of the end-user and the inputs of the ontology extended with the thesaurii, then it proposes the possible triples (Subject,Verb,Object) which are validated (or rejected) by the end-user.
Fig. 3. Illustration of the annotating process
Figure 3 illustrates this process in the context of the resource “Baiser de l’hôtel de ville” of Doisneau. Before applying the axioms, the annotations of this resource is as follows: “A woman which kisses a man” (in French, [femme:* embrasser homme:*]), “A photography of the artist Robert Doisneau” ([photographie:* photographier artiste:doisneau_robert]); “A man which wears a beret” ([homme:* porter beret:*]). After the automatic application of the axioms, two new annotations are produced: “A photography which is dated from the 20th century” ([photographie:* dater 20_siecle:*]) and “A man which kisses a woman” ([homme:* embrasser femme:*]). It is important to underline that it is also possible to link the annotations in order to precise, for instance, that two instances of the same concept are different (respectively similar). In the figure 3, this allows the end-user to specify that the man who wears the beret is different from the man who kisses the woman. Each annotation is recorded (in CGXML) within the files via the use of the standards IPTC, ID3 and ODF. OSIRIS also permits the automatic extraction of keywords (stored in terms of metadata of IPTC, ID3 and ODF) from the annotations: for each triple (Subject,Verb,Object), OSIRIS computes a set of terms which correspond to the union of all the synonyms of the concepts Subject, Object (and their sub-concepts) and all the synonyms of the relation Verb (and its possible sub-relations).
208
X. Aimé and F. Trichet
The automatic approach The heavyweight ontologies manipulated by OSIRIS include axioms intrinsically. The application of these axioms on the annotations defined beforehand by the end-users allows performing a process of automatic enrichment of the annotations. Figure 3 illustrates the results of this process after having applied the axioms of two ontologies: OntoArt dedicated to the History of Art and including the following axiom “Any work of Doisneau is dated from the 20th century” and OntoCourant, an ontology covering phenomena of the everyday life and including (for example) the To_Kiss relation which is defined between two Human (where the Human concept can be specialized in Man and Woman). The application of the Symetry algebraic property of the To_Kiss relation (which is represented by a Schemata Axiom in OCGL) produces the new annotation “Man:* To_Kiss Woman:*” (from the original annotation “Woman:* To_Kiss Man:*”) and the application of the axiom “Any work of Doisneau is dated from the 20th century” (represented by a Domain Axiom in OCGL) produces the new annotation “Photography:* To_Date 20th_century:*”. In addition, when the end-user imports a new resource, OSIRIS checks if this latter already includes keywords (which have been associated to it from other platforms such as YouTube or Flickr) via the use of the standards IPTC, ID3 or ODF. If it is the case, OSIRIS starts an analysis of these keywords in order to automatically find relevant annotations. This is done by comparing the keywords and the entries of the thesaurii which are coupled with the ontologies. This analysis leads to a set of potential annotations that the end-user must validate. 3.3 The Searching Process The searching process starts with the expression of a query in terms of (Subject, Verb, Object), or by a set of queries connected by the logical operators AND/OR.. To formulate queries, the end-user can either navigate in the hierarchies of the ontologies (cf. section 3.2.1) or freely (and directly) express terms which are then compared with the entries of the thesaurii in order to find the subjacent concepts and relations. Note that it is possible to formulate partial queries, i.e. queries which do not include all the elements of the triple, but only part of it such as (Subject) or (Object) or (Subject,Verb) or (Verb,Object). Each query C corresponds to one (or several) conceptual graph(s). The search for the resources which comply with the criterion defined by the query is performed by using the projection operator of the CG: a resource Ri satisfies a query C if there exists (at least) one projection from the conceptual graph representing C to the graphs representing the annotations associated with Ri. Figure 4 illustrates an example of query where the criteria are: (1) “a contemporary work of art” represented by the conceptual graph (Work_of_art:* To_Date Contemporary:*) (oeuvre:* dater contemporain:* in French), where Work_of_art and Contemporary are concepts and To_Date a relation AND (2) “whose the content incarnates a woman” represented by the conceptual graph including only one concept (Woman:*) (femme:* in French). A resource R is considered as relevant for the query when there exists a projection from each one of these two graphs into (at least) one of the graphs representing the annotations of R.
Semantic Information Retrieval Dedicated to Multimedia Systems
209
Fig. 4. Illustration of the searching process: “Contemporary works of art representing women”. This query corresponds to the two triples: (Work_of-art:* To-date Contemporary:*) and (Woman:*). The works of art which are proposed are photographies, sculptures or paintings.
OSIRIS also makes it possible to perform searching on instances of concepts. For example, “What are the paintings created by the artist Picasso ?” (represented by the graph “Painting:* To_Create Artist:Picasso”) specifies that the searching process must be focused on works of Picasso as paintings, and not on its other works like sculptures. OSIRIS also makes it possible to express partial queries which only involve one relation, without more precision on concepts (for example, “To_Kiss”).
4 Conclusion OSIRIS is a platform that enables the development of collaborative web spaces dedicated to the sharing of multimedia resources. The semantic annotation based on the use of conceptual graphs is not an new approach. Indeed, several works have adopted this approach [2, 4]. Thus, the originality of our work is not related to the adopted approach, but in the context in which this approach is considered. Indeed, contrary to similar works, OSIRIS makes it possible to make several ontologies cohabit within the same communautary space and these ontologies can be refined by the members of the community. This work is currently in progress towards a thorough study of the tags associated to the resources by the way of Web 2.0 platforms like Youtube or Flickr, in order to automatically enrich the thesaurii and to discover possible lacks in the ontologies that are considered. Our assumption is that the tags defined and shared by the communities (semantic and social tagging) are good vectors of the evolutions of the subjacent fields of knowledge (we are currently testing this idea in the context of a French project related to Cultural Heritage Preservation through a collaborative collection development of old and popular postcards). In this context, it appears relevant to use this type of material for tackle with the problem of the ontology evolution, which is one of the key factor in the popularisation of semantic and participative platforms such as OSIRIS.
210
X. Aimé and F. Trichet
References 1. Bechhofer, S., Volz, R., Lord, P.: Cooking the Semantic Web with the OWL-API. In: Fensel, D., Sycara, K.P., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 659– 675. Springer, Heidelberg (2003) 2. Bocconi, S., Nack, F., Hardman, L.: Supporting the Generation of Argument Structure within Video Sequence. In: Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, pp. 75–84 (2005) 3. Chein, M., Mugnier, M.L.: Conceptual Graphs: fundamental notions. Revue d’Intelligence Artificielle (RIA) 6(4), 365–406 (1992) Hermès 4. Crampes, M., Ranwez, S.: Ontology-Supported and Ontology-Driven Conceptual Navigation on the World Wide Web. In: Proceedings of the eleventh ACM on Hypertext and hypermedia, pp. 191–199 (2000) 5. Euzenat, J., Shvaiko, P.: Ontology Matching, p. 341. Springer, Heidelberg (2007) 6. Fürst, F., Trichet, F.: Heavyweight Ontology Engineering. In: Meersman, R., Tari, Z., Herrero, P. (eds.) OTM 2006 Workshops. LNCS, vol. 4277, pp. 38–39. Springer, Heidelberg (2006a) 7. Fürst, F., Trichet, F.: Reasoning on the Semantic Web needs to reason both on ontologybased assertions and on ontologies themselves. In: Proceedings of the International Workshop on Reasoning on the Web (Row 2006), Co-located with the 15th International World Wide Web Conference (WWW 2006, Edinburgh) (2006b) 8. Fürst, F., Leclere, M., Trichet, F.: Operationalizing domain ontologies: a method and a tool. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI 2004), pp. 318–322. IOS Press, Amsterdam (2004) 9. Gomez-Perez, A., Fernandez-Lopez, M.: Ontological Engineering. In: Advanced Information and Knowledge Processing (2003) 10. Genest, D., Salvat, E.: A Platform allowing typed nested graphs: how CoGITo became CoGITaNT. In: Mugnier, M.-L., Chein, M. (eds.) ICCS 1998. LNCS (LNAI), vol. 1453, pp. 154–161. Springer, Heidelberg (1998) 11. Greaves, M.: Semantic Web 2.0. IEEE Intelligent Systems 22(2), 94–96 (2007) 12. Shvaiko, P., Euzenat, J.: A Survey of Schema-based Matching Approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005) 13. Sowa, J.: Conceptual Structures: information processing in mind and machine, Handbook. Addison-Wesley, Reading (1984)
Interactive Cluster-Based Personalized Retrieval on Large Document Collections Petros Belsis1, Charalampos Konstantopoulos2, Basilis Mamalis1, Grammati Pantziou1, and Christos Skourlas1 1 2
Department of Informatics, TEI of Athens Department of Informatics, University of Piraeus [email protected], [email protected], {pantziou,vmamalis,cskourlas}@teiath.gr
Abstract. Lately, many systems and websites add personalization functionalities among their provided services. However, for large document collections it is difficult for the user to direct effective queries from the beginning of his/her search, since accurate query terms may not be known in advance. In this paper we describe a system that applies a hybrid approach to assist a user identify the most relevant documents: at the beginning it applies dynamic personalization techniques based on user modeling to initiate the search on a large document and multimedia content collection; next the query is further refined using a clustering based approach which after processing a sub-collection of documents presents the user with more categories to select from a list of new keywords. We analyze the most prominent implementation choices for the modular components of the proposed architecture: a machine learning approach for personalized services, a clustering based approach towards a user directed query refinement and a parallel processing module that supports document clustering in order to decrease the system’s response times.
1 Introduction The continuous growth of data stored in different types of systems such as information portals, digital libraries etc, has created an overwhelming amount of information that a user has to deal with. Many approaches have emerged towards query-refinement to facilitate the user towards a more efficient retrieval process in respect to his/her personal interests. A wide variety of systems also integrate personalization features that aim to assist the user to identify knowledge items that match the user’s preferences. Among else, digital libraries, document management systems and multimedia datawarehouses with focus on scientific data-storage grow significantly in size as new scientific content is gathered on a daily basis on different areas of research. Considering each user has specific areas of expertise or interest, a digital library consists of a good test-bed domain where personalization techniques may prove to be beneficial. Still, in large document collections it is hard to identify an efficient user model that contains adequate sub-categories to support the user preferences for two reasons: first, it would be difficult to identify appropriate sub-categories in respect to the number of existing users; second, classification of incoming documents would require a significant overhead for the system. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 211–220, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
212
P. Belsis et al.
In this paper, we describe a hybrid approach that utilizes personalization techniques at the initiation of the user interaction with the system, while it proceeds towards a more user-oriented interaction, where the user is participating in the dynamic clustering process by selecting the sub-categories that arise dynamically after processing subsets of the documents. In order to keep the system’s response times low, we also apply parallel processing techniques when processing the selected by the user document sub-clusters. The rest of the paper is organized as follows. Section 2 presents related work in context; Section 3 presents the main principles that drive the design of the proposed architecture and discusses the structure of its modular components, while Section 4 concludes the paper.
2 Related Work A wide variety of research prototype systems as well as commercial solutions have emerged lately offering personalized services to their users. Many of the successful deployments use machine learning methods, which aim in integrating among the system’s features the ability to adapt to the user’s needs and to perform many of the necessary tasks in an automated way [7]. 2.1 User Models, Stereotypes, Communities and Personalization Systems Personalization technology aims to adapt information systems, information retrieval and filtering systems, etc. to the needs of the individual user. A user model may contain personal details about the user, such as occupation, interests, etc. and information gathered through the interaction of the user with the system. User community models are generic models that apply to the needs of groups of users and usually do not use explicitly provided personal information. If personal information is given the community models are called stereotypes. Machine learning techniques have been applied to construct all these types of models and are used in digital library services, and in personalized news services, etc. For example: The MyView system [1] collects bibliographic data and facilitates the user in his/her browsing digital libraries. MyView supports direct on-line reorganization, browsing and selection as specified by the user. Among its strong features are that it can support browsing in heterogeneous distributed repositories. It does not store the actual data-sources but metadata pointing to actual sources. It also supports user directed browsing. The PNS [4] is a generic system that offers to its users personalized news services. Its architecture consists of sub-modules that collect user related data, either explicitly inserted by the user or implicitly by monitoring a user’s behavior. A personalization module builds the user’s model and makes recommendations on topics that fall within the user’s interests. The PNS also contains a content management module that collects information about the actual content sources and indexes them without though storing the actual sources but instead the indexing information as collected by specific purpose wrappers.
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
213
2.2 Document Clustering and Parallel Techniques There exists a large number of document clustering algorithms. They are usually classified into two main categories – hierarchical algorithms and partitional algorithms. Partitioning assigns every document to a single cluster iteratively [17] in an attempt to determine k partitions that optimize a certain criterion function [18]. Partitional clustering algorithms usually have better time complexity than hierarchical algorithms. The K-means algorithm [21] is a popular clustering method of this category. A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence. Hierarchical clusterings generally fall into two categories: splitting and agglomerative methods. Splitting methods work in a top down approach to split clusters until a certain threshold is obtained. The more popular agglomerative clustering algorithms use a bottom-up approach to merge documents into a hierarchy of clusters [19]. Agglomerative algorithms typically use a stored matrix or stored data approach [20]. There also exist several algorithms that combine the accuracy of the hierarchical approach with the lower time complexity of the partitioning approach to form a hybrid approach. Such a popular algorithm is the Buckshot algorithm [8] (see also section 3.2). A detailed overview of sequential document clustering algorithms can be found in [9] and [16]. Many authors have also examined parallel algorithms for both hierarchical clustering and partitional clustering [22]. In [23], Olson provides a comprehensive review on parallel hierarchical clustering algorithms. Two versions of parallel Kmeans algorithms are discussed in recent literatures. In [21], Dhillon and Modha proposed a parallel K-means algorithm on distributed memory multiprocessors. Xu and Zhang [24] designed a parallel K-means algorithm to cluster high dimensional document datasets, which has low communication overhead. Besides K-means, some other classical clustering algorithms also have their corresponding parallel versions, such as the parallel PDDP algorithm [24] and the parallel Buckshot algorithm (given earlier in [15] and most recently in [9]). 2.3 The Scatter/Gather Approach Scatter/Gather was first proposed by Cutting et al [8], as a cluster-based method for browsing large document collections. The method works as follows: In the beginning, the system scatters the initial document collection into a small set of clusters (i.e., document groups) and presents to the user short descriptive summaries of these clusters. The summaries may include text that characterizes the cluster in general, as well terms that sample the contents of the cluster. Based on these summaries, the user can select one or more of the clusters for further examination. The clusters selected by the user are gathered together into a subcollection. In the sequel, on line clustering is applied again to scatter the subcollection into a new small set of clusters, whose summaries are presented to the user. The above process may be repeated while after each iteration the clusters become smaller and more detailed. With the Scatter/Gather method the user is not forced to provide query terms but from the beginning he is presented with a set of clusters. The successive iterations of the method help the user to find the desired information from a large document collection. Therefore, the Scatter/Gather approach is very useful when the user cannot
214
P. Belsis et al.
or does not want to express a query formally. In addition, as Hearst and Pedersen showed in [13],[14] the Scatter/Gather method can also significantly improve the retrieval results over a very large document collection. Since each iteration of the Scatter/Gather method requires online clustering on a large document collection, fast clustering algorithms should be employed. Cutting et al [8] have proposed and applied to Scatter/Gather two clustering procedures: Buckshot (which is used also in our hybrid approach) and Fractionation. In [12], a scheme is proposed that after near linear time pre-processing (O(kNlogN)), it requires constant time for the online phase for arbitrarily large document collections. The method involves the construction of a cluster hierarchy. Liu et al in [16] also proposed a new algorithm for Scatter/Gather browsing which achieves constant response time for each Scatter/Gather iteration. Their algorithm requires (as the algorithm in [12]) the construction of a cluster hierarchy.
3 System Architecture The proposed architecture consists of three sub-modules: i) the personalization submodule which collects user related data and recommends initially categories containing documents related to the user’s interests, ii) the content repository which is actually responsible to store the documents and facilitates a user directed search by performing a scatter/gather approach and iii) the parallel processing module which is responsible for speeding up online clustering procedures as well as for preprocessing of documents in real time. In the following sections we explain the main concepts behind the functionality of each sub-module. Fig. 1 shows a generic overview of the proposed architecture.
Fig. 1. Generic overview of the system’s architecture
3.1 Personalization Module In order to build an accurate and effective user model there are two main tasks that the system should support: either i) the user at the time of registration should be able to provide details about his/her personal preferences so as to create easily his/her
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
215
model / stereotype, or ii) the sequence of the topics that she/he usually selects are monitored and could be used to create a model which will direct the system to classify him/her to one community. In general the personalization module should provide support for the following operations: • • •
Provide support in respect to new user’s registration Keep track of user’s preferences in respect to the topics of interest that affect their interaction with the system Present personalized information to users that have similar interests or in general could be classified in a common behavior stereotype.
In respect to the classification model adopted, the system must support the creation of user models / communities using a feature-based representation. Towards this, from a list of generic sub-categories that the user usually explores a user model / community is created using a machine learning approach. Typical algorithms which have been successfully applied towards this direction are the COBWEB algorithm and the Cluster Mining Algorithm [3] and its variations [4][6]. Paliouras et al [5] studied a free-text query based information retrieval service and they constructed models of user communities using these two algorithms. They compared the two approaches using two evaluation criteria: 1) coverage, the proportion of features covered by the models and 2) distinctiveness, the number of distinct features that appear in at least one model divided by the sum of the sizes of all models. Eventually, they concluded that “the cluster mining method is doing consistently better than COBWEB, in terms of coverage, and distinctiveness”. The main principle of the Cluster Mining algorithm is to create - from a graph that contains all the possible features - a sub-graph with weights containing all the features associated with a given user model. In other words, the algorithm constructs a weighted graph G(A,E,wA,wE), where the set of vertices A contains all the features and the set of edges E corresponds to the coexistence of two features in the corresponding model. Then, weights are assigned to both edges E and to vertices A as aggregate usage statistics. In order to lower the complexity of the graph a threshold can be imposed which results in rejecting the edges with an assigned value below that threshold. In figure 2 considering we have a threshold of 0.09 the edge between the categories hardware and databases that has a lower value is rejected (it means that there is no strong evidence that the user or set of user in this specific stereotype are
Fig. 2. The feature-based graph that allows creation of the personalization model
216
P. Belsis et al.
interested in both categories). The remaining subset of the graph results in the construction of the feature group. 3.2 Clustering - Based Browsing for Large Document Collections In addition to personalization, our system also provides effective automatic browsing using the known Scatter/Gather approach (which is mainly based on iterative application of document clustering procedures – see section 2) in order to further facilitate the user search procedure. Moreover, we apply parallelism over a distributed memory environment in order to gain better (and acceptable) total performance for very large document collections. Specifically, we first follow the typical scatter/gather approach proposed in [8], slightly changed due to the fact that in our system personalized documents categorization for each user has already been done via the personalization module of the system. The predefined categories for each specific user (e.g. user model / stereotype based) can serve here as the basic initial clusters for the scatter/gather procedure. Thus, initially, the documents belonging into the specific user-profile categories (in other words, the set of initial clusters assigned to the specific user) are gathered together to form a dynamic (for the specific user) subcollection. An appropriate reclustering procedure is then applied to scatter the user subcollection into a number of document groups, and short summaries of them are presented to the user. Based on these summaries, the user selects one or more of the groups for further study. The selected groups are gathered together again to form a new (smaller) subcollection. The system then applies clustering (re-clustering via the same procedure as above) again to scatter the new subcollection into a small number of document groups, which are again presented (summaries of them) to the user. The user selects again, etc. With each successive iteration the groups become smaller, and therefore more detailed. Ultimately, when the groups become small enough, this process bottoms out by enumerating individual documents. Note that, since an initial (via the personalization module) document recommendation and selection has already been done (assignment of specific categories to each user), initial heavy (and more accurate) clustering (i.e. such as fractionation proposed in [8]) is not necessary. This initial personalized filtering can serve as the basic initial (via a different and more accurate procedure) clustering step proposed in the above reference. Thus, in our hybrid approach, only fast online reclustering procedures have to be considered. Towards this direction, we’ve used a customized version of the Buckshot algorithm (see [8] for a general description) which is a typical quite fast clustering algorithm suitable for the online re-clustering essential for scatter/gather. The Buckshot algorithm is a combination of hierarchical and partitioning algorithms designed to take advantage of the accuracy of hierarchical clustering as well as the low computational complexity of partitioning algorithms. Specifically, it assumes the existence of some (i.e. hierarchical) algorithm which clusters well, but which may run slowly. This procedure is usually called ‘the cluster subroutine’. In our system, we use single-link hierarchical agglomerative clustering method for this subroutine (instead of group-average or complete-link), in order to obtain not very tight initial clusters. Hierarchical conceptual clustering plays an important role in our
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
217
work because we plan, in the future, to combine knowledge acquisition with machine learning to extract semantics from resources found on the Web [2]. Then, the algorithm takes a random sample of s = √kn documents from the collection and uses the specific ‘cluster subroutine’ (hierarchical single-link) as the high-precision clustering routine to find initial centers from this random sample. The initial centers generated from the hierarchical agglomerative clustering subroutine can be used as the basis for clustering the entire collection in a highperformance manner, by assigning the remaining documents in the collection to the most appropriate initial center. The original Buckshot algorithm gives no specifics on how best to assign the remaining documents to appropriate centers, although various techniques are given. In our work we use an iterated assign-to-nearest algorithm with two iterations similar to the one proposed in [9]. The Buckshot algorithm typically requires linear time (since s = √kn, the total time is O(kn) where k is much smaller than n) which is very satisfactory. This establishes the feasibility of the scatter/gather method for browsing moderately large document collections. But for very large document collections the linear time requirement for the online phase makes the use of the scatter/gather browsing method not very efficient. On the other hand, in our system the Buckshot procedure is usually expected to run over controlled-size subcollections (since the user subcollections are the results of personalized filtering procedures). However, in order to face this inefficiency in either case, we apply parallelism over a distributed memory parallel environment, aiming at gaining acceptable performance even for very large document collections. 3.3 Parallel Processing Module As mentioned above, even the Buckshot algorithm in sequential execution tends to be quite slow for today’s very large (huge) collections. Even the most simplistic modern clustering techniques tend to be quite slow too. Naturally, a promising approach could be parallel processing. In our proposed system, we use such efficient parallel techniques in order to achieve acceptable performance even for very large document collections. Moreover, using distributed memory architecture we can reduce the time and memory complexity of the sequential algorithms by a factor of p where p is the number of nodes used. Specifically, towards an efficient design and implementation of the proposed (in the previous section) scatter/gather clustering techniques, we follow the parallel approach presented in [9]. First, an efficient implementation of the underlying hierarchical agglomerative clustering subroutine is constructed (initially based on the parallel calculation of the pair-wise documents similarity matrix – in a distributed manner over the multiple processors, and then iterating to build the cluster hierarchy using the single-link criterion). Based on the above parallel execution of the underlying clustering subroutine, we build an efficient parallel implementation of the Buckshot algorithm (similar again to the one proposed in [9]). The first phase of the parallel Buckshot algorithm uses the parallel hierarchical clustering subroutine to cluster s random documents. The second phase for the parallel version of the Buckshot algorithm groups the remaining documents in parallel. After the clustering subroutine has finished, k initial clusters have been created from the random sample of s = √kn documents. From the total collection n−s
218
P. Belsis et al.
documents remain that have not yet been assigned to any cluster. The second phase of the Buckshot algorithm assigns these documents according to their similarity to the centroids of the initial clusters. This phase of the algorithm is trivially parallelized via data partitioning. First, the initial cluster centroids are calculated on every node (with use of appropriate collective parallel functions – aiming at properly reducing the total communication cost). After centroids calculation is complete, each node is assigned approximately (n−s)/p documents to process. Each node iterates through these documents in place, (comparing the document’s term vector to each centroid and making the assignment) until all documents are assigned. The second phase is iterated two times. The second iteration recalculates the centroids and reassigns all the documents to one of the k clusters. Moreover, we also apply parallelism during the documents’ preprocessing phase, based on previous works of ours (see [10],[11]). As a part of these techniques, a more accurate off-line clustering algorithm (partitional clustering based on the iterative calculation of connected components of the documents similarity matrix – as a specialization of the single-link hierarchical agglomerative clustering algorithm) is also given. This global initial clustering method is quite useful if the user wish to perform global searches from the beginning (entering natural language keywords etc.) without using any personalization-based categorization feature. The specific used document indexing process (essential as part of the off-line setup/preprocessing phase of the system, in order to be able to apply effectively the similarity-based clustering techniques) follows the basics of the Vector Space Model (construction of weighted document vectors, based on the statistical extraction of word-stems, phrases and thesaurus classes). For speeding up the similarity calculations we also extract and extensively use a global inverted index (as well as partial local inverted lists when needed for parallel clustering procedures). Some of our parallel processing methods (see [10],[11]) have been extensively tested over real distributed memory environment, yielding to very good performance. As the underlying distributed memory platform we use a beowulf-class linux-cluster environment with use of the MPI-2 (MPICH implementation) passing message library. Specifically, our cluster consists of 8 Pentium-4 based processors with 1GB RAM and a dedicated Myrinet network interface which provides 2Gbps communication speed. The corresponding experiments have been done over a part the known TIPSTER/TREC standard document collections.
4 Conclusion Adding personalization features in websites or other systems becomes a very popular approach. Towards this direction machine learning algorithms have proved to be an effective solution. Personalization models base their operation on a limited set of features. In large document collections though it is not sufficient to direct the userqueries based on the generic categories that help build up the personalization model. We have presented a hybrid approach that initiates the user-system interaction by making propositions to the user based on the user model created either by userfeedback or by a machine learning approach based on tracking her/his previous past interaction with the system; accordingly the selected sub-clusters of documents are
Interactive Cluster-Based Personalized Retrieval on Large Document Collections
219
processed and new keywords arise which help build up a new set of sub-clusters of the remaining documents. This process proceeds repetitively with the user’s participation until an adequately limited number of documents have been refined through the user directed queries. The benefit of our approach is that it proceeds in a highly dynamic manner, not limiting the number of features that arise in each step of the query process. Thus, the associated with the resources feature set is updated frequently, resulting in an effective and dynamic re-clustering of documents. In order to keep the response times low, parallel processing techniques are employed. We have described the modular components of a proof of concept architecture that encompasses the basic principles of our approach and we have described good selection choices towards the system’s implementation which is still under continuous development; still, based on previous experimentation of some of its sub-modules [10][11] we provide adequate evidence about the validity of our approach. Acknowledgments. We are grateful to George Paliouras for his helpful comments on the early version of this article.
References 1. Wolff, E., Cremers, A.: The MyVIEW Project: A Data Warehousing Approach to Personalized Digital Libraries. Next Generation Information Technologies and Systems, 277–294 (1999) 2. Godoy, D., Amandi, A.: Modeling user interests by conceptual clustering. Information Systems 31, 247–265 (2006) 3. Perkowitz, M., Etzioni, O.: Learning and revising user profiles: The identification of interesting Web sites. Machine learning 27, 313–331 (1998) 4. Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulos, C.D.: Clustering the Users of Large Web Sites into Communities. In: Proceedings of the InternationalConference on Machine Learning (ICML), pp. 719–726 (2000) 5. Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulos, C.D.: Discovering User Communities on the Internet Using Unsupervised Machine Learning Techniques. Interacting with Computers 14(6), 761–791 (2002) 6. Paliouras, G., Mouzakidis, A., Ntoutsis, C., Alexopoulos, A., Skourlas, C.: PNS: Personalized Multi-source News Delivery. In: Proceedings of the 10th International Conference on Knowledge-Based & Intelligent Information & Engineering Systems (KES), Bournemouth, UK, October 2006, pp. 1152–1161 (2006) 7. Langley, P.: User modeling in adaptive interfaces. In: Proceedings of the 7th International conference on user modeling, pp. 357–370. Springer, Heidelberg (1999) 8. Cutting, D.R., Pedersen, J.O., Karger, D., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329. ACM, New York (1992) 9. Cathey, R., Jensen, E., Beitzel, S., Frieder, O., Grossman, D.: Exploiting parallelism to support scalable hierarchical clustering. JASIST 58(8), 1207–1221 (2007) 10. Kehagias, D., Mamalis, B., Pantziou, G.: Efficient VSM-based Parallel Text Retrieval on a PC-Cluster Environment using MPI. In: Proceedings, ISCA 18th Intl. Conf. on Parallel Distributed Computing Systems (PDCS’05), Las Vegas, Nevada, USA, September 12-14, pp. 334–341 (2005)
220
P. Belsis et al.
11. Gavalas, D., Konstantopoulos, C., Mamalis, B., Pantziou, G.: Efficient BSP/CGM Algorithms for Text Retrieval. In: Proceedings of the 17th IASTED Intl. Conf. on Parallel and Distributed Computing and Systems (PDCS’05), Phoenix, Arizona, USA, November 14-16, pp. 301–306 (2005) 12. Cutting, D.R., Karger, D.R., Pedersen, J.O.: Constant interaction-time scatter/gather browsing of very large document collections. In: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 126–134. ACM, New York (1993) 13. Hearst, M.A., Karger, D., Pedersen, J.O.: Scatter/gather as a tool for the navigation of retrieval results. In: Burke, R. (ed.) Working Notes of the AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, Cambridge, MA. AAAI, Menlo Park (1995) 14. Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: SIGIR 1996: Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 76–84. ACM Press, New York (1996) 15. Jensen, E.C., Beitzel, S.M., Pilotto, A.J., Goharian, N., Frieder, O.: Parallelizing the buckshot algorithm for efficient document clustering. In: CIKM 2002: 11th int. conf. on Information and knowledge management, pp. 684–686. ACM Press, New York (2002) 16. Liu, Y., Mostafa, J., Ke, W.: A Fast Online Clustering Algorithm for Scatter/Gather Browsing (2006) 17. Hartigan, J.A.: Clustering Algorithms. Wiley, Chichester (1975) 18. Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases. In: Proceedings of the 1998 ACM-SIGMOD, pp. 73–84 (1998) 19. Jardine, N., van Rijsbergen, C.J.: The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval (1971) 20. Dash, M., Petrutiu, S., Sheuermann, P.: Efficient Parallel Hierarchical Clustering. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149. Springer, Heidelberg (2004) 21. Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000) 22. Heckel, B., Hamann, B.: Divisive parallel clustering for multiresolution analysis. In: Geometric Modeling for Scientifc Visualization, Germany, pp. 345–358 (2004) 23. Olson, C.: Parallel algorithm for hierarchical clustering. Parallel Computing 21 (1995) 24. Xu, S., Zhang, J.: A hybrid parallel web document clustering algorithm and its performance study (366-03) (2003)
Decision Support Services Facilitating Uncertainty Management Sylvia Encheva Stood/Haugesund University College, Bjørnsonsg. 45, 5528 Haugesund, Norway [email protected]
Abstract. This work focuses on taking actions with respect to managing uncertainty situations in system architectures by employing non Boolean logic. Particular attention is paid to solving problems arising in situations where recognition of all correct answers is required and some responses contain both correct and incorrect options. Keywords: decision support services, uncertainty management.
1 Introduction The importance of uncertainty is increasing when a long-term view is considered. In order to manage and explore unexpected developments, an introduction of designs, susceptible of modifications and adjustable to changing environments, is highly desirable. The need for proper management of uncertainty requires flexibility that can be obtained through system architectures based on non-Boolean logic. While Boolean logic appears to be sufficient for most everyday reasoning, it is certainly unable to provide meaningful conclusions in presence of inconsistent and/or incomplete input [9]. However, many-valued logic is offering a solution to this problem. Knowledge assessment is strongly related to what kind of answer alternatives should be included while establishing students’ level of mastering of a particular concept. Introducing various types of answer alternatives helps to attain a higher level of certainty in the process of decision making. However, providing meaningful responses to all answer combinations is a difficult task. A possible solution to this problem is arranging all answer combinations into meaningful sets first. Application of many-valued logic, for drawing conclusions and providing recommendations, is then suggested. This allows introduction of decision strategies where multivalued inferences support comparison of degrees of specificity among context. In addition involvement of intermediate truth values adds a valuable contribution to the process of comparing degrees of certainty among contexts. The rest of the paper is organized as follows. Related work, basic terms and concepts are presented in Section 2. The management model is described in Section 3. The paper ends with a description of the system in Section 4 and a conclusion in Section 5. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 221–230, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
222
S. Encheva
2 Background Let P be a non-empty ordered set. If sup{x, y} and inf {x, y} exist for all x, y ∈ P , then P is called a lattice, [2]. In a lattice illustrating partial ordering of knowledge values, the logical conjunction is identified with the meet operation and the logical disjunction with the join operation. A lattice L is said to be modular [2], if it satisfies the modular low (∀a, b, c ∈ L) a ≥ c =⇒ a ∧ (b ∨ c) = (a ∧ b) ∨ c Nested line diagrams are used for visualizing large concept lattices, emphasizing sub-structures and regularities, and combining conceptual scales, [14]. A nested line diagram consists of an outer line diagram, which contains in each node inner diagrams. Both five-valued and seven-valued logics can be obtained from the generalized Lukasiewicz logic [12]. The set of truth values with cardinality n corresponds to 1 2 , n−1 , ..., n−1 the equidistant rational numbers {0, n−1 n−1 , 1}. Seven-valued logic has been employed in reliability measure theory [10], for verification of switch-level designs in [5], and for verifying circuit connectivity of MOS/LSI mask artwork in [13]. The seven-valued logic presented in [10], known also as seven-valued relevance logic, has the following truth values - truth (i.e. valid), false (i.e. invalid), true by default, false by default, unknown, contradiction, and contradiction by default. The authors of [5] define seven-valued propositional logic called switch-level logic. The truth values are E, D0, D1, DU, S0, S1, SU and are called switch-level values, where S0 and S1 are ’strong’ values associated with the support voltage ’vdd’ and ground ’gnd’, D0 and D1 are obtained as a result of degradation effect, SU and DU are undefined values corresponding to certain strength, and E is the value of all nodes not connected to a source node via a path through the network. These truth values are ordered in a switch-level values lattice Fig. 1.
SU
S0
S!
DU
D0
D1
E
Fig. 1. A lattice with switch-level values
Decision Support Services Facilitating Uncertainty Management
223
We choose to apply the five-valued and the seven-valued logics developed in [3] and [4] because the latter one contains all the truth values of the former and thus simplifies the use of both in combination. The five-valued logic Fig. 2 introduced in [3] is based on the following truth values: – – – – –
uu - unknown or undefined, kk - possibly known but consistent, ff - false, tt - true, ww - inconsistent.
tt
ww
kk
ff
uu
Fig. 2. Five-valued logic lattice
ii
ff
kk
it
tt
fi
uu
Fig. 3. Lattice of the seven-valued logic
A seven-valued logic presented in [4] is based on the following truth values: – uu - unknown or undefined, – kk - possibly known but consistent,
224
S. Encheva
Fig. 4. M3 × 2 lattice
– – – – –
ff - false, tt - true, ii - inconsistent, it - non-false, and if - non-true.
The lattice in Fig. 3 is modular since it is isomorphic to the shaded sublattice of the modular lattice M3 × 2 in Fig. 4, [2].
3 The Test The main idea is to develop a framework for automated evaluation of knowledge and/or skills learned by a student. In real life situations a person is often presented with several alternatives and has to choose one of them as a solution to a given problem. To prepare future experts for dealing with such situations we propose application of multiple choice tests where a student should mark all correct answers. Such a test immediately opens possibilities for inconsistent and incomplete input. The inconsistency occurs if a student’s response contains both
qqq
qqi
qqe
qqu
qqp
Fig. 5. The truth value tt
Decision Support Services Facilitating Uncertainty Management
225
ppp
ppv
ppe
ppu
ppq
Fig. 6. The truth value ff
eee
eev
eeq
eeu
eep
Fig. 7. The truth value ww
vvv
vvq
vve
vvu
vvp
Fig. 8. The truth value kk
a correct answer and a wrong answer, and incompleteness occurs when a student is not providing any answer. Similar situations cannot be resolved with Boolean logic because systems based on Boolean logic operate with two outputs ‘correct’ or ‘incorrect’. Therefore we suggest application of many-valued logic.
226
S. Encheva uuu
uuv
uue
uuq
uup
Fig. 9. Truth value uu qpi
qpu
qpv
qve
pvu
Fig. 10. Truth value fi qvu
que
pve
vue
pue
Fig. 11. Truth value it
A test consists of three questions addressing understanding of a concept or mastering a skill. Stem responses can be – – – – –
true (q), false (p), answer is missing (u), incomplete answer but true (v), and both true and false (e).
Decision Support Services Facilitating Uncertainty Management
227
eee
eev
eeq
eeu
eep
ppp
qqq
vvv
vvq
ppv
qqi
ppe
qqu
vvu
ppu
qqe
vve
ppq
qqp
vvp
qvu
qpi
que
qpu
pve
vue
qpv
qve
pue
pvu
uuu
uuv
uue
uuq
uup
Fig. 12. Possible outcomes from a single test
The meaning of the last two notations is as follows – responses where a part of the answer is missing but whatever is stated is true (v), and – responses where one part of the answer is true and another one is false, (e). A student can take the test several times depending on the size of the pull of questions and answer alternatives.
228
S. Encheva
ii
ff
kk
it
tt
fi
uu
ii ii
ff
kk
tt ff
it
kk
tt
fi it
fi
uu uu
Fig. 13. Relations among truth values
Aiming at a simplification of the visualizing process we first group the answer alternatives in sets with five elements. The lattices in Fig. 5, Fig. 6, Fig. 7, Fig. 8, Fig. 9, Fig. 10, and Fig. 11 relate the answer alternatives to the seven truth values. These sets are then placed in the nodes of the lattice in Fig. 3. The outcome can be seen in Fig. 12. On Fig. 13 we show in details how the results of two tests are related. The seven-valued logic is both associative and commutative. This allows combining results of tests based on truth Table 1 as well as drawing only halve of the truth values dependencies.
Decision Support Services Facilitating Uncertainty Management
229
Table 1. The ontological operation ∨ in [4] ∨ uu
uu uu
kk uu
fi fi
ff ff
ii fi
tt it uu uu
kk
uu
kk
fi
ff
fi
kk uu
fi
fi
fi
fi
ff
fi
fi
fi
ff
ff
ff
ff
ff
ff
ff
ff
ii
fi
fi
fi
ff
ii
ii
ii
tt
uu
kk
fi
ff
ii
tt
it
it
uu
uu
fi
ff
ii
it
it
4 Brief System Description Web application server architecture is proposed for system implementation, where Apache Web server deals with the presentation layer, the logic layer is written in Python, and SQLite database engine is used for implementing the data layer. The back end SQLite databases are used to store both static and dynamic data.
5 Conclusion This work discusses assessment of students understanding of knowledge. The presented framework facilitates automation of an evaluation process where a student is asked to find all correct answers. Since some of the presented to the student options are incorrect the system is challenged to provide decisions for cases with incomplete and or contradictory input. This is resolved by applying many-valued logic.
References 1. Belnap, N.J.: A useful four.valued logic. In: Dunn, J.M., Epstain, G. (eds.) Modern uses of multiple-valued logic, pp. 8–37. D. Reidel Publishing Co, Dordrecht (1977) 2. Davey, B.A., Priestley, H.A.: Introduction to lattices and order. Cambridge University Press, Cambridge (2005) 3. Ferreira, U.: A Five-valued Logic and a System. Journal of Computer Science and Technology 4(3), 134–140 (2004) 4. Ferreira, U.: Uncertainty and a 7-Valued Logic. In: Proceedings of The 2nd International Conference on Computer Science and its Applications (ICCSA 2004), San Diego CA, USA (June 2004)
230
S. Encheva
5. Hahnle, R., Werner Kernig, W.: Verification of Switch-level designs with manyvalued logic. In: Voronkov, A. (ed.) LPAR 1993. LNCS, vol. 698, pp. 158–169. Springer, Heidelberg (1993) 6. http://httpd.apache.org/ 7. http://www.python.org/ 8. http://www.sqlite.org/ 9. Immerman, N., Rabinovich, A., Reps, T., Sagiv, M., Yorsh, G.: The boundery between decidability and undecidability of transitive closure logics. In: Marcinkowski, J., Tarlecki, A. (eds.) CSL 2004. LNCS, vol. 3210. Springer, Heidelberg (2004) 10. Kim, M., Maida, A.S.: Reliability measure theory: a nonmonotonic semantics. IEEE Transactions on Knowledge and Data Engineering 5(1), 41–51 (1993) 11. Kleene, S.: Introduction to Metamathematics. D. Van Nostrand Co., Inc, New York (1952) 12. Lukasiewicz, J.: On Three-Valued Logic. Ruch Filozoficzny 5, 170–171 (1920); English translation in Borkowski, L. (ed.) 1970. Jan Lukasiewicz: Selected Works. North Holland, Amsterdam (1920) 13. Takashima, M., Mitsuhashi, T., Chiba, T., Yoshida, K.: Programs for Verifying Circuit Connectivity of MOS/LSI Mask Artwork. In: 19th Conference on Design Automation, pp. 544–550 (1982) 14. Wille, R.: Concept lattices and conceptual knowledge systems. Computers and Mathematics with Applications 23(6-9), 493–515 (1992)
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something Eiko Yamamoto and Hitoshi Isahara Graduate School of Engineering, Kobe University, 1-1 Rokodai-cho, Nada-ku, Kobe, Hyogo 657-8501, Japan National Institute of Information and Communications Technology, 3-5 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0289, Japan [email protected], {eiko,isahara}@nict.go.jp Abstract. We introduce an approach for developing a conversation synthesizer in order to support knowledge acquisition. The system can transfer knowledge to humans even while the person is doing something or is not concentrating on listening to the voice. Our approach does not create a summary of the key points of what is being read out, but focuses on the knowledge transfer method for supporting knowledge acquisition. Specifically, to provide knowledge efficiently by computer, that is, to transfer knowledge from computers to human, we consider what kinds of conversation are naturally retained in the brain; as such conversations may enable people to obtain knowledge more easily. We aim to construct an intelligent system which can create such conversations by applying natural language processing technologies. Keywords: Intelligent narrative environments, Knowledge acquisition support, Natural language processing, Learning by listening.
1 Introduction One of the most common means of acquiring useful knowledge is reading suitable documents and websites. However, this is time-consuming and cannot be done in parallel with other tasks. Is there a way to acquire knowledge when we cannot read written texts, such as while driving a car, walking around or doing housework? It is not easy to remember the contents of a document simply by listening to its reading aloud from the top, even if we concentrate while listening. In contrast, it is sometimes easier to remember words heard on the radio or television even if we are not concentrating on them. While we are doing something, listening to conversation is better than listening to a precise reading out of a draft or summary for memorizing the contents and turning them into knowledge. We are therefore trying to improve the efficiency of knowledge transfer1 by “hearing a conversation while doing something.” 1
In this paper, "(knowledge) transfer" is a movement of knowledge/information from a knowledge source, including a human, to a human recipient. That is to say, the term “knowledge transfer” means not only transferring knowledge between people but also transferring knowledge from computers to human. "Acquisition" is a process of understanding/memorizing knowledge by the human recipient. We focus on the process of synthesizing conversation being uttered for knowledge transfer, which relates to the "externalization" in SECI model [1], in order to realize efficient knowledge acquisition by the recipient, which relates to the "combination" in the model.
G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 231–238, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
232
E. Yamamoto and H. Isahara
In order to support knowledge acquisition by humans, we aim to develop a system which provides people with useful knowledge while they are doing something or not concentrating on listening. We did not try to edit notes to be read out, or to summarize documents; rather, we aimed to develop a way of transferring knowledge. Specifically, in order to provide knowledge efficiently with computers, we consider how to turn the content into a dialogue that is easily remembered, and develop a system to produce dialogue by which one can easily acquire knowledge.
2 Sophisticated Eliza Recently, thanks to the improvement of natural language processing (NLP) technology, development of high-performance computers and the availability of huge amounts of stored linguistic data, there are now useful NLP-based systems. There are also practical speech synthesis tools for reading out documents and tools for summarizing documents. These tools do not necessarily use state-of-the-art technologies to achieve deep and accurate language understanding, but are based on huge amounts of linguistic resources that used not to be available. Although current computer systems can collect huge amounts of knowledge from real examples, it is not obvious how to transfer knowledge more naturally between such powerful computer systems and humans. We need to develop a novel way to transfer knowledge from computers to humans. We believe that, based on large amounts of text data, it is possible to devise a system which can generate dialogue by a simple mechanism to give people the impression that two intelligent persons are talking. We verified this approach by implementing a system named Sophisticated Eliza [2] which can simulate conversation between two persons on a computer. Sophisticated Eliza is not a Human-Computer Interaction system; instead, it simulates conversation by two people and users acquire information by listening to the conversation generated by the system. Concretely, using an encyclopedia in Japanese [3] as a knowledge base, we develop rules to extract information from the knowledge base and create fragments of conversation. We extract rules with syntactic patterns to make a conversation, for example, “What is A?” “It’s B.” from “A is B.” The system extracts candidate fragments of conversation using these simple scripts and two voices then read the conversation aloud. This system cannot generate long conversations as humans do on one topic, but it can simulate short conversations from stored linguistic resources and continue conversations while changing topics. Figure 1 shows a screenshot of Sophisticated Eliza and Figure 2 shows its system flow. Figure 3 is examples of conversation generated by the system. Example 1: Original text in knowledge base Osaka Castle was a castle in Osaka prefecture from 15th century to 17th century. Extracted fragment of conversation A: What is Osaka Castle? B: It is a castle in Osaka prefecture from 15th century to 17th century.
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something
Fig. 1. Screenshot of Sophisticated Eliza
Synthesizer Huge text database Manual compilation of rules
Conversation generator (fragment selection)
Japanese parser
Template
Conversation extraction
Conversation fragment database
Fig. 2. System Flow of Sophisticated Eliza
Speaker 1
Speaker 2
233
234
E. Yamamoto and H. Isahara
Example 2: Japanese government reinforces bi-relation with African countries and appeals Japanese policy of foreign affairs, aiming to establish environment to solve problems at United Nations.
What activities are done under the supporting program for Africa?
Fig. 3. Examples of Generated Conversation
The encyclopedia utilized here contains all about Japan, e.g., history, culture, economy and politics. All sentences in the encyclopedia are analyzed syntactically using a Japanese parser and we use rules to extract the fragments of conversation using information in the encyclopedia. As for the manual compilation of rules, we carefully analyzed the output of the Japanese parser and found useful patterns to extract knowledge from the encyclopedia. The terms extracted during the syntactic analysis are stored in the keyword table and are used for selection of topics and words during the conversation synthesis. Note that in our current system, we use Japanese documents as the input. Because we are using only syntactic information output by the Japanese parser, our mechanism is also applicable to other languages such as English. We use a rather simple mechanism to generate actual conversations in the system, which includes rules to select fragments containing similar words and rules to change topics. The contents in the encyclopedia are divided into seven categories, i.e. geography, history, politics, economy, society, culture and life. When the topic in a conversation moves from one topic to another, the system generates utterance showing such move. As for the speech synthesis part, we use the synthesizer developed by Oki Electric Industry Co. Ltd., Japan. The two authors of this paper, one male and one female, recorded 400 sentences each and the two characters in the system talk to each other by impersonating our voices. The images of the two characters are also based on the authors. Because this system uses simple template-like knowledge, it cannot generate semantically deep conversation on a topic by considering context or by compiling highly precise rules to extract script-like information from text. Thus, the mechanism used in this system has room for improvement to create conversations for knowledge transfer.
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something
235
3 Efficient Knowledge Transfer by Hearing a Conversation While Doing Something In the daily transfer of knowledge, such as in a cooking program on TV, there are not only the reading aloud of recipes by the presenter but also conversation between the cook and assistant. Through such conversations, information which viewers want to know and which they should memorize is transferred to them naturally. We started to develop a mechanism to achieve natural knowledge acquisition for humans by turning information that is written in documents into conversational text. Efficient methods of acquiring knowledge include not only “reading documents” and “listening to passages read aloud,” but also “hearing a conversation while doing something,” provided that information is appropriately embedded into the conversation. We believe that we can verify that this “conversation hearing” can assist knowledge acquisition by developing a system for synthesizing conversations by collecting fragments of conversation and conducting experiments by using the system. As a means to transfer information, contents conveyed by an interpretive reading with pronounced intonation are better retained in memory than if read monotonously from a document or summary. Furthermore, by turning contents into conversation style, even someone who is not concentrating on listening may become interested in the topic and acquire the contents naturally. This suggests that several factors in conversations, such as throwing in words of agreement, pauses and questions, which may appear to decrease the density of information, are actually effective means of transferring information matching humans’ ability to acquire knowledge with limited concentration. Based on this idea, we propose a novel mechanism of an information transfer system by considering the way of transferring knowledge from computers to humans. Various dialogue systems have already been developed as communication tools between humans and computers [4, 5]. However, in our novel approach, the dialogue system regards the user as an outsider, presents conversation by two speakers in the computer which is of interest to the outside user, and thus provides the user with useful knowledge. There are dialogue systems [6, 7, 8] which can join in a conversation between a human and a computer, but they simply create fragments of conversation and so do not sound like an intelligent human speaker. One reason is that they do not aim to provide knowledge or transfer information to humans, and few theoretical evaluations have been done in this field. In this research, we consider a way to transfer knowledge and develop a conversation system which generates dialogue by which humans can acquire knowledge from dialogue conducted by two speakers in the computer. We analyze the way to transfer knowledge to humans with this system. This kind of research is beneficial not only from an engineering viewpoint but also cognitive science and cognitive linguistics. Furthermore, a speech synthesis system in which two participants conduct spoken conversation automatically is rare. In this research, we develop an original information-providing system by assigning conversation to two speakers in the computer in order to transfer knowledge to humans.
236
E. Yamamoto and H. Isahara
4 System Implementation The principle of Sophisticated Eliza is that because a large amount of text data is available, even if the recall of information extraction is low, we can obtain sufficient information to generate short conversations. However, the rules still need to be improved by careful analysis of input texts. As for the information transfer system, although our final target is to handle topics which are practically useful such as knowledge from newspapers, encyclopedia and Wikipedia, as a first step we are trying to compile rules for small procedural domains such as cooking recipes. Concretely, we are developing the new system via the following five steps repeatedly. 1) Enlargement of conversational script and template in order to generate sentences in natural conversation. We have already compiled simple templates for extracting fragments of conversation as a part of Sophisticated Eliza. We are now enlarging the set of templates to handle wider contexts, domain-specific knowledge and insertion of words. This enlargement is basically being done manually. Here, domain-specific knowledge includes domain documents in a specific format, such as recipes. Insertion of words includes words of agreement and encouragement for the other speaker, part of which is already introduced in Sophisticated Eliza. An example of synthesized conversation is shown in Figure 4. A: Let’s make boiled scallop with lettuce and cream. B: It is 244 Kcal for one person. A: What kinds of materials are needed? B: Lettuce and scallop. For four persons, four peaces of tomatoes and …… …………… A: How will we cook lettuce? B: Pour off the hot water after boiling it. Then cool it. A: How about tomatoes? B: Remove seeds and dice them. Fig. 4. Example conversation
2) Implementation of system in which two speakers (agents/characters) make conversation in a computer considering dialogue and document contexts. Using the conversational templates extracted based on the contexts, the system continues conversation with two speakers. Fundamental functions of this kind have already been developed for Sophisticated Eliza. Here, there are two types of “context.” One is the context in the documents, i.e. knowledge-base. For the recipe example, cooking heavily depends on the order of each process and on the result of each process. The
Efficient Knowledge Transfer by Hearing a Conversation While Doing Something
237
other type is the context in the conversation. If all subevents included in an event are explicitly uttered in conversation, it would be very dull and makes understanding obstruct. For example, “Make hot water in a pan. Peel potatoes and boil them” is enough and it is not necessary to say “boil peeled potatoes in the hot water in a pan.” Appropriate use of ellipsis and anaphoric representation based on the context in the conversation are useful tools for easy understanding. Though speech synthesis itself is out of the scope of our research, pauses in utterances are also important in natural communication. 3) Mechanism to extract (fragment of) knowledge from text Sophisticated Eliza outputs informative short conversations, but the content of the conversation is not consistent as a whole. In this research, we are developing a system to provide people with some useful knowledge. We have to recognize the useful part of the knowledge base and to place great importance on the extracted useful part of the text. We previously reported how to extract an informative set of words using a measure of inclusive relations [9], and will apply a similar method to this conversation system. 4) Improvement of conversation script and template considering “fragment of knowledge” By considering the useful part of information written in the knowledge base, we modify the templates to extract conversational text. Contextual information such as ellipsis and anaphora is also treated in this part. As a first step, we will handle anaphora resolution in a specific domain, such as cooking, considering factors described at 2). We will use domain knowledge about cooking such as cookware, cookery and ingredient. 5) Evaluation We will conduct tests with participants to evaluate our methodology and verify the effectiveness of our method for transferring knowledge. So far, we are reported by some small number of participants that it is rather easy to listen to the voice of the system, however, objective evaluation is still our future work.
5 Conclusion We introduced an approach for developing an information-providing system in order to support knowledge acquisition. The system can transfer knowledge to humans even while the person is doing something or is not concentrating on listening to the voice. Our approach does not create a summary of the key points of what is being read out, but focuses on the knowledge transferring method. Specifically, to provide knowledge efficiently, we consider what kinds of conversation are naturally retained in the brain, as such conversations may enable people to obtain knowledge more easily. We aim to construct an intelligent system which can create such conversations by applying natural language processing techniques.
238
E. Yamamoto and H. Isahara
References 1. Nonaka, I., Takeuchi, H.: The Knowledge-Creating Company. Oxford University Press, Oxford (1995) 2. Isahara, H., Yamamoto, E., Ikeno, A., Hamaguchi, Y.: Eliza’s daughter. In: Annual Meeting of Association for Natural Language Processing of Japan (2005) 3. Bilingual Encyclopedia about Japan (in Japanese and English), Kodansha International (1998) 4. Waizenbaum, J.: ELIZA—A Computer Program For the Study of Natural Language Communication Between Man And Machine. Communications of the ACM 9(1), 36–45 (1966) 5. Matsusaka, Y., Tojo, T., Kuota, S., Furukawa, K., Tamiya, D., Hayata, K., Nakano, Y., Kobayashi, T.: Multi-person Conversation via Multimodal Interface –A Robot who Communicate with Multi-user. In: Proceedings of 6th European Conference on Speech Communication Technology (EUROSPEECH 1999), vol. 4, pp. 1723–1726 (1999) 6. Nadamoto, A., Tanaka, K.: Passive viewing of Web contents based on automatic generation of conversational sentences. Japanese Society of Information Processing 2004-DBS-134(1), 183–190 (2004) 7. ALICE: http://alice.pandorabots.com 8. Artificial non-Intelligence UZURA: http://www.din.or.jp/~ohzaki/uzura.htm 9. Yamamoto, E., Kanzaki, K., Isahara, H.: Extraction of hierarchies based on inclusion of co-occurring words with frequency information. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 1166–1172 (2005)
On Managing Users’ Attention in Knowledge-Intensive Organizations Dimitris Apostolou1, Stelios Karapiperis1, and Nenad Stojanovic2 1
University of Piraeus, Karaoli & Dimitriou St. 80 Piraeus, Greece 185 34 [email protected], [email protected] 2 FZI at the University of Karlsruhe, Haid-und-Neu Strasse 10-14, 76131 Karlsruhe, Germany [email protected]
Abstract. In this paper we present a novel approach for managing users’ attention in knowledge intensive organizations, which goes beyond informing a user about changes in relevant information towards proactively supporting the user to react on changes. The approach is based on an expressive attention model, which is realized by combining ECA rules with ontologies. We present the architecture of the system, describe its main components and present early evaluation results. Keywords: attention management, semantic technologies.
1 Introduction Success factors in knowledge-intensive and highly dynamic business environments are mostly the ability to rapidly adapt to complex and changing situations and the capacity to deal with a quantity of information of all sorts. For knowledge workers, these new conditions have translated in the acceleration of time, the multiplication of projects in which they are involved, and in increased collaboration with colleagues, clients and partners. Knowledge workers are overloaded with potentially useful and continuously changing information originating from a multitude of sources and tools. In order to cope with a complex and dynamic business environment, the attention of knowledge workers must be always paid on the most relevant changes in information. Attention management in an organisational context refers to supporting knowledge workers focus their cognitive ability only on up-to-date information that is most relevant for their current business context (e.g. the business process they are involved in and the task they are currently resolving). In particular, support is required for searching, finding, selecting and presenting the most relevant and up-to-date information without distracting them from their activities. Information retrieval systems have provided means for delivering the right information at the right place and time. The main issue with existing systems is that they do not cope explicitly with the information overload, i.e. it might happen that a knowledge worker “overlooks” important information. The goal of attention management systems is to avoid information overload and to provide proactive effective recommendations for dealing with changed or new information [5]. Moreover, enterprise attention management is not just about receiving notifications proactively, but also enabling relevant reaction on this information and on relevant changes in general. Our approach puts forward a comprehensive G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 239–248, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
240
D. Apostolou, S. Karapiperis, and N. Stojanovic
reasoning framework that can trigger a knowledge base in order to find out the best way to react on a change. We base such a framework on a combination of ECA (event-condition-action) rules and ontologies. The paper is organized as follows: in the second section we analyze a motivating example and derive requirements for an Enterprise Attention Management System (EAMS). In the third section we outline the approach and implementation of the SAKE EAMS encompassing various functionalities to address relevant attentionrelated issues. In the fourth section we present related works and we compare them to our approach while in the fifth section we summarise conclusions and future work.
2 Enterprise Attention Management Framework In this section we give an indicative but generic motivating scenario from the public administration sector, we elaborate requirements for supporting it and generalise them into an enterprise attention management framework. 2.1 Motivating Scenario In order to keep local laws aligned with federal ones, the mayor in a municipality has to react on any changes and updates in the federal laws. In this process, a General Binding Regulation (GBR) is being developed in the local administration by a group of experts. Together they discuss about ways to adapt the federal law in the local administration. The outcome of this discussion is a draft GBR, which thereafter is open for public deliberation. Citizens can provide comments on the draft GBR and the head of the legal department together with the experts assess the comments received. Next, the revised GBR is submitted to the local councillors for approval. If the GBR is approved, it is then signed by the mayor and published. Changes in the federal law are announced in the federal governmental web portal; a system for alerting the mayor about updates is needed. Nevertheless, the mayor may not be interested in all possible changes, but rather only in some that are relevant for her/his municipality. Moreover, the mayor should be assisted in starting the public deliberation process (i.e. what additional information should be gathered, what is the timeline for the deliberation, etc.). In order to be supported in this decision making process, the mayor could use process-specific information such as previous deliberation processes. Moreover, relevant information might be hidden in the relationships between some information artefacts, like that there are not many experts who are familiar with the selected domain in this municipality. This information is needed to enable the mayor to react on the received alert properly. 2.2 Requirements We summarise three basic requirements for an EAMS stemming from the aforementioned scenario: 1. Expressive modelling of information in order to enable focusing of attention on the right level of information abstraction as well as to conceptually similar information. For example, a public servant working in the urban planning department
On Managing Users’ Attention in Knowledge-Intensive Organizations
241
should be alerted for new information about building refurbishing but the mayor should not. 2. Context-awareness in order to support a particular decision making process. For example, a new law about sanitation should trigger different alerts in a GBR about pets and in other, less-related to sanitary conditions GBRs. 3. Expressive formalism for the description of user preferences, including when to alert a user, but also how to react on an alert. 2.3 Enterprise Attention Management Framework Figure 1 presents an EAMS framework that generalises the aforementioned requirements. working environment
Web sites
AMS
remote databases proactively deliver relevant information
learn preferences
INFORMATION
extranets
INFORMATION
RSS feeds
CONTEXT
external information sources
learn preferences
business processes
information repositories
usage
usage
user-to-user interactions
internal information sources PREFERENCES
Fig. 1. Enterprise Attention Management Framework
− Information represents all relevant artefacts that can be found in the available information repositories (knowledge sources). In the business environment of an organization, sources of information can be both internal and external to the organization. Moreover, information can be represented either formally (e.g. using information structuring languages such as XML) or informally. Finally, information may be stored in structured repositories such as databases that can be queried using formal languages or in unstructured repositories such as discussion forums. − Context defines the relevance of information for a user. Detection of context is related to the detection of the user’s attentional state that involves collecting information about users’ current focus of attention, their current goals, and some relevant aspects of users’ current environment. In order to form a complete picture of the user’s attentional state, both sensor-based (e.g., observing cues of users’ current
242
D. Apostolou, S. Karapiperis, and N. Stojanovic
activity and of the environment) and non-sensor based (e.g., users explicitly state what they are working on) mechanisms for detecting user attention can be employed [9]. − Preferences enable filtering of relevant information according to its importance/relevance to the given user’s context. In other words, the changeability of resources is proactively broadcasted to the users who may be interested in them in order to keep them up to date with new information. Users may have different preferences about both the means they want to be notified and also about the relevance of certain types of information in different contexts. User preferences can be defined with formal rules or more informally e.g., by adding keywords to user profiles. Moreover, even when employing mechanisms capable of formalizing the users’ preferences, a certain level of uncertainty about users’ preferences will always remain. For this reason, dealing with uncertainty is an important aspect of attention management systems. Equally important is the way preferences can be derived: by explicitly specifying them or be machine learning techniques.
3 The SAKE Enterprise Attention Management System The objective of this section is to describe the SAKE EAMS that tries to address the problem of keeping corporate users’ attention always focused on their current job. 3.1 Attention Model Figure 2 presents the conceptual model and technical architecture underlying the SAKE EAMS. The model assumes that interactions between users and external/internal information sources are logged; the same applies to the business process context in which user interactions take place. Some log entries can be defined as Events that cause Alerts, which are related to a user and a problem domain, and associated to a priority level. Every Alert invokes Actions, that can be purely informative (i.e. an information push) or executable (e.g., execute a business process, start a new discussion forum). In the core of the SAKE approach are ECA (Event – Condition – Action) rules; their general form is: ON event AND additional knowledge, IF condition THEN do something Relevant events and actions are usually triggered by interactions taking place in organisational systems, such as the SAKE Content Management System (CMS) and the GroupWare System (GWS) or by external change detectors. The later are implemented with the Change Notification System (CNS), a component that can be configured to monitor web pages, RSS feeds and files stored in file servers for any change or specific changes specified by regular expressions (e.g. new web content containing the keyword “sanitary” but not “pets”).
On Managing Users’ Attention in Knowledge-Intensive Organizations external knowledge sources
Context observer
working environment
CNS
internal knowledge sources
Log ontology
Reasoner
SAKE EAMS
243
Information ontology
Preference editor
proactively suggest relevant information and actions
preferences
User
Fig. 2. The SAKE Conceptual Model (left) and High-level Technical Architecture (right)
The log ontology models change events, following a generic and modular design approach so that it is re-usable in other domains as well. Figure 3 shows the four basic types of events: AccessEvent, AddEvent, ChangeEvent and RemoveEvent. The subclasses of AddEvent and RemoveEvent are further differentiated: AddToCMSEvent means the addition of a new resource to the CMS (either by uploading an existing document or by creating a new one). AddToParentEvent and RemoveFromParent refers to the composition of generic parent/child relationship. For example, the addition of a ForumMessage to a ForumThread, or a ForumThread to a Forum is logged using an AddToParentEvent. The Information ontology contains the domain concepts and relations about which we want to express preferences, such as documents and forum messages. On the top level, the Information ontology separates physical types of information resources (e.g. HTML documents, PDF files, etc.) from conceptual types: some information resources are of an abstract nature, such as persons, while others physically exist in the SAKE system, such as CMS documents, GWS forums or e-mails. Preferences are generated using the Preference Editor. Each preference is expressed as a logical rule, represented in SWRL1 (Semantic Web Rule Language). Figure 4 illustrates a preference rule: if userA is in the processZ, then userA has preference of value 1.0 for documents created in 2006.Among the preferred values, preferences include the business context of the user, in order to support context-awareness of the whole system. The Preference Editor supports creation of preference rules by providing a GUI for step-wise, interactive rule development, as presented in Fig. 5. The user starts with defining a variable which properties are further specified (including defining new variables) in several subsequent steps.
1
www.w3.org/Submission/SWRL/
244
D. Apostolou, S. Karapiperis, and N. Stojanovic
Fig. 3. Log ontology: a) Class hierarchy, b) Class diagram
Fig. 4. A Sample Preference Rule Expressed in SWRL
Fig. 5. Preference Editor: Step-wise, interactive rule development
On Managing Users’ Attention in Knowledge-Intensive Organizations
245
Rules can react on events upon detection of them. The Reasoner then evaluates additional queries against the log for obtaining further information such as the business context of the user and then evaluates the condition. The business context is derived using the Context Observer, a component that links to enterprise systems (such as workflows and ERPs) and extracts the current business process, activity or task the user is working on. Finally, the Reasoner executes the action part of rules. 3.2 Technical Implementation and Evaluation The SAKE prototype is based on J2EE and Java Portlets following a three-tiered architecture. The presentation tier contains Portlets, JavaServer Pages (JSPs) and an auxiliary Servlet. Portlets call business methods on the Enterprise Java Beans (EJBs), pre-process the results and pass them to the JSP pages. The JSPs contain Hypertext Markup Language (HTML) fragments as well as placeholders for dynamic content (such as the results passed from the Portlets). The auxiliary Servlet is used for initializing the connection to the KAON2 ontology management system (http://kaon2. semanticweb.org/, part of the integration tier). The business tier consists mostly of EJBs, which provide the business logic and communicate with the components of the integration tier that comprise a third-party CMS component (Daisy) and GWS component (Coefficient) as well the Preference Framework. The interface to these components is represented using EJBs which all use the Kaon2DAO in order to access the ontologies: the CMSBean and GWSBean enhance the CMS and GWS with semantic meta-data, the AMSBean manages the preference rules. KAON2 stores the semantic meta-data for these entities with ontologies and provides the facilities for querying them using SPARQL2. The KAON2 reasoner is used for evaluating the user’s preference rules. The integration tier contains also a MySQL relational database, which stores CMS- and GWS-related content, such as forums, discussions, documents etc. Since the development of the SAKE system has not been completed yet (mainly integration of components is still pending), a comprehensive user-driven evaluation of the system as a whole is planned but not performed yet. On the contrary, we have performed an early evaluation of the main SAKE components, independently. Evaluation has been performed in three case studies: two local administrations and one ministry. We validated the usability of these components and their relevance to the knowledgeintensive work of public servants. We collected useful comments for further improvement regarding the functionality and interface of the SAKE components. Early evaluation of the Preference Framework in particular has revealed noticeable improvement in relevance of system-generated notifications when user preferences are taken into account. In the future we plan to perform formal experiments to measure the degree of improvement of notifications. Moreover, as soon as the SAKE system is integrated, we plan to test the system’s ability to not only send relevant notifications to users but also execute relevant actions such as the initiation of a workflow process. From a conceptual point of view, we have ensured that all components are based on a common ontological model for representing information resources and changes as well as other concepts not presented in this paper, such as context, roles and 2
www.w3.org/TR/rdf-sparql-query/
246
D. Apostolou, S. Karapiperis, and N. Stojanovic
preferences. From the technical point of view we ensured standards-based interoperability by using state-of-the-art Semantic Web technologies, such as SWRL and SPARQL.
4 Related Work There has been considerable research done on attention aware systems that address the information overload problem (e.g., [9]). According to [7] attention-aware or attentive systems are software systems that support users’ information needs by keeping track of what users are writing, reading, looking at or talking to and suggesting information that might have beneficial influence to them. SUITOR [7] is an attentive system comprising four modules: a) watching user’s actions to infer user’s current mental state and needs, b) storing user’s actions to create and maintain user’s model, c) searching information from the digital world and scanning user’s local hard disk and d) ranking and suggesting relevant information sources through peripheral display. Having enough input information from these modules, SUITOR can infer user’s current interests and propose relevant information sources from local and remote databases that have previously gathered and stored. The Attentional User Interface project [6] developed methods for inferring attention from multiple streams of information, and for leveraging these inferences in decision making under uncertainty. The project focused on the design of interfaces that take into account visual attention, gestures and ambient sounds as clues about a user’s attention. These clues can be detected through cameras, accelerometers and microphones or other perceptual sensors and, along with the user’s calendar, current software interaction and data about the history of user’s interests, they provide valuable information about the status of a user’s attention. The same project incorporated Bayesian models dealing with uncertainty and reasoning about current or future user’s attention taking as input all of the above clues. In comparison to attention aware systems, our system does not include sensorbased mechanisms for detecting the user’s environment. We argue that for enterprise attention management, non-sensory based mechanisms provide a wealth of attentional cues such as users’ scheduled activities (e.g. using online calendars), users’ working context (e.g. by querying workflow or enterprise systems) and user’s communication and collaboration patterns (e.g. using groupware and other communication tools). Our approach is more in-line with related commercial systems, such as KnowNow Enterprise Syndication Server (KESS), NewsGator Enterprise Server (NES) and Attensa Feed Server (AFS). These systems leverage RSS technology to find and distribute relevant content to employees. KESS (http://www.knownow.com) persistently monitors both RSS and non-RSS enabled data sources, from either outside or inside the enterprise, for predefined criteria and routes automatically notifications of new and updated information to employees, partners and customers into various output formats, like enterprise portals, RSS readers, mobile devices or email. This server is capable to syndicate and aggregate, rank, prioritize and route enterprise content. NES (http://www.newsgator.com) is a centrally managed and administered RSS aggregation platform providing access to RSS and Atom sources, search tools to find relevant feeds, multiple user options for
On Managing Users’ Attention in Knowledge-Intensive Organizations
247
feed reading and enables users to collaborate and discuss important topics. Besides RSS feeds, NES can aggregate and deliver content from internal portals, enterprise applications and databases (CRM, ERP, HR), premium content providers such as Thomson, blogs, wikis and e-mail applications. AFS (http://www.attensa.com) is a content delivery and notification system which has analytics and reporting capabilities that profile user behavior to predict and identify the most effective communications channels between users. Context-based, proactive delivery of information refers to delivering information to users based on context, e.g. activities, organizational role, and work outputs. Maus [8] and Abecker et al. [1] developed workflow management systems that recommend relevant information to users based on the process context. The Watson system [4] provides users with related documents based on users’ job contents such as wordprocessing and Web browsing. Ahn et al. [3] provide a knowledge context model, which facilitates the use of contextual information in virtual collaborative work. In [2], organizational context is modelled to provide awareness of other users’ activities in a shared workspace. In our work, context-based delivery of information is coupled to attention aware delivery of information and is also used for triggering actions.
5 Conclusions In this paper we presented a novel approach for managing attention in an enterprise context by realising the idea of having a reactive system that manages not only alerting a user that something has been changed, but also supporting the user to react properly on that change. In a nutshell, the corresponding system is an ontology-based platform that logs changes in internal and external information sources, observes user context and evaluates user attentional preferences represented in the form of ECA rules. The system has been developed targeting the eGovernment domain. The evaluation process is still ongoing, but the first results are very promising. Future work will be toward further refinement of ECA rules for preference description and automatic learning of preferences from usage data using machine learning techniques.
References 1. Abecker, A., Bernardi, A., Hinkelmann, K., Kühn, O., Sintek, M.: Context-aware, proactive delivery of task-specific information: the Know-More Project. Information Systems Frontiers 2, 253–276 (2000) 2. Agostini, A., De Michelis, G., Grasso, M.A., Prinz, W., Syri, A.: Contexts, work processes and workspaces. In: Proceedings of the International Workshop on the Design of Cooperative Systems (COOP 1995), INRIA, Antibes, France, pp. 219–238 (1995) 3. Ahn, H.J., Lee, H.J., Cho, K., Park, S.J.: Utilizing knowledge context in virtual collaborative work. Decision Support Systems 39(4), 563–582 (2005) 4. Budzik, J., Hammond, K.: Watson: Anticipating and contextualizing information needs. In: Proceedings of the Sixty-Second Annual Meeting of the American Society for Information Science (1999)
248
D. Apostolou, S. Karapiperis, and N. Stojanovic
5. Davenport, T., Beck, J.: The Attention Economy: Understanding the New Currency of Business. Harvard Business School Press (2001) 6. Horvitz, E., Kadie, C.M., Paek, T., Hovel, D.: Models of attention in computing and communication: From principles to applications. Communications of the ACM 46(3), 52–59 (2003) 7. Maglio, P.P., Campbell, C.S., Barrett, R., Selker, T.: An architecture for developing attentive information systems. In: Knowledge-Based Systems, vol. 14, pp. 103–110. Elsevier, Amsterdam (2001) 8. Maus, H.: Workflow context as a means for intelligent information support. In: Akman, V., Bouquet, P., Thomason, R.H., Young, R.A. (eds.) CONTEXT 2001. LNCS (LNAI), vol. 2116, pp. 261–274. Springer, Heidelberg (2001) 9. Roda, C., Thomas, J.: Attention Aware Systems: Theory, Application, and Research Agenda. Computers in Human Behaviour 22, 557–587 (2006)
Two Applications of Paraconsistent Logical Controller Jair Minoro Abe1,2, Kazumi Nakamatsu3, and Seiki Akama4 1
Graduate Program in Production Engineering, ICET - Paulista University R. Dr. Bacelar, 1212, CEP 04026-002 São Paulo – SP – Brazil 2 Institute For Advanced Studies – University of São Paulo, Brazil [email protected] 3 School of Human Science and Environment/H.S.E. – University of Hyogo – Japan [email protected] 4 C-Republic Inc., 1-20-1, Higashi-Yurigaoka, Asao-ku, Kawasaki-shi, 215-0012, Japan [email protected] Abstract. In this paper we discuss two applications of the logical controller Paracontrol. Such controller is based on Paraconsistent Annotated Logic and it is .capable of manipulating imprecise, inconsistent and paracompete data. Keywords: Logical controller, paraconsistent logics, annotated logics, conflicts and automation, temperature sensors.
1 Introduction A Paraconsistent Logical Controller based on Paraconsistent Annotated Evidential Logic Eτ was introduced in [5]. Such controller was dubbed Paracontrol. In this paper we sketch two more applications made: an electronic device for blind and/or dumb people locomotion and for an autonomous mobile robot based on two temperature sensors. The Paracontrol is the eletric-eletronic materialization of the Para-analyzer algorithm [5], which is basically an electronic circuit, which treats logical signals in a context of logic Eτ [1]. The atomic formulae of the logic Eτ is of the type p(μ, λ), where (μ, λ) ∈ [0, 1]2 and [0, 1] is the real unitary interval with the usual order relation and p denotes a propositional variable. There is an order relation defined on [0, 1]2: (μ1, λ1) ≤ (μ2, λ2) ⇔ μ1 ≤ μ2 and λ1 ≤ λ2 . Such ordered system constitutes a lattice that will be symbolized by τ. p(μ, λ) can be intuitively read: “It is believed that p’s favorable evidence is μ and contrary evidence is λ.” So, we have some interesting examples: • • • • •
p(1.0, 0.0) can be read as a true proposition. p(0.0, 1.0) can be read as a false proposition. p(1.0, 1.0) can be read as an inconsistent proposition. p(0.0, 0.0) can be read as a paracomplete proposition. p(0.5, 0.5) can be read as an indefinite proposition.
Note, the concept of paracompleteness is the “dual” of the concept of paraconsistency. The Paracontrol compares logical values and determines domains of a state lattice corresponding to output value. Favorable evidence and contrary evidence degrees are represented by voltage. Certainty and contradiction degrees are determined by analogues of operational amplifiers. The Paracontrol comprises both G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 249–254, 2008. © Springer-Verlag Berlin Heidelberg 2008 springerlink.com
250
J.M. Abe, K. Nakamatsu, and S. Akama
analog and digital systems and it can be externally adjusted by applying positive and negative voltages. The Paracontrol was tested in real-life experiments with an autonomous mobile robot Emmy [6], whose favorable/contrary evidences coincide with the values of ultrasonic sensors and distances are represented by continuous values of voltage.
2 Paraconsistent Autonomous Mobile Robot Hephaestus1 Recently we’ve tested the Paracontrol in another study, namely to verify the performance in robots with temperature sensors. The prototype studied was dubbed Hephaestus [8]. The robot uses a microcontroller PIC 16f877A, two pairs of two temperature sensors LM35 (which can measure 0 oC to 100 oC, with output tension variation of 10mV/oC) and two continuous electric current engines with speed reduction. For each 1 oC of temperature variation, it corresponds to 10mV tension sign variation (i.e., temperatures ranging 2 oC to 60 oC). Feed circuit provides the necessary electric tension to guarantee the performance of the robot’s circuits. We’ve employed 5 and 12 Volts tensions.
Fig. 1. Electric diagram of the feed circuit
Information on the robot’s environment is sent to the microcontroller by a sensor circuit, which processes according to the Paracontrol, resulting a decision to be performed by the robot. Next, there is a description of the operation of the program located in the memory of the microcontroller. The control circuit is responsible for calculating the temperature of where the robot is. It is also responsible for transforming this temperature in values of favorable and contrary evidences degree, execution of the Para-analyzer algorithm and generating the signs for starting the engines. Two engines supplied by a continuous electric tension ranging from 0 to 12 Volts are responsible for the robot movements; go back, go forward, and speed control. Paracontrol is also responsible for determining which engine should operate, which way it should turn and which speed should be applied. A circuit D/A was built using the adder chain to process the signs coming in digital signs and convert them first into analog signs, a PWM circuit to help controlling the speed and a bridged circuit to control where the engines should turn to. Each engine has their independent circuits whose are operated by the microcontroller outputs. The logical controller Paracontrol will quantify the values of de favorable evidence (µ) and of contrary evidence (λ) degrees, corresponding to the signs of physical signs 1
The ancient Greek god of fire.
Two Applications of Paraconsistent Logical Controller
251
Microcontroller PIC16f877
Fig. 2. Sensors circuit
value originated by the temperature sensors, resulting an action. This analysis is made according to the output states lattice τ (see Fig. 2). Information about temperature is captured by two pairs of two sensors; each pair gives the value of favorable evidence degree and contrary evidence degree: (μ1, λ1) and (µ 2, λ2). The circuit of measuring and controlling the temperature was projected having the basis 50 ºC, that is to say, the robot is determined to take an action if the temperature is equal or greater than 50 ºC. The possible actions are go forward, backward or turn aside. The temperature sensors were placed on the extremities at the basis of the robot, in such a manner that can be identify different temperatures in front of, behind, on the left side or on the right side of the robot, which will determine its decision. In this case, the different degrees of favorable or contrary evidences of the process can be submitted to a logic operation NOT, OR and AND in order to obtain resulting values (µ R e λR). The logical operation applied will depend on the process that is under control. In each pair of sensors, it was considered logical operation OR (maximization of the terms), so if one of the sensors fails, the measuring of the temperature will be performed by the second sensor without loosing of efficiency. Table 1 presents the converted values of the continuous tension of the analog input signs. Table 1. Values of the tension of the converted input analog signs
0.3
0.6
0.9
1.2
1.5
1.8
2.1
2.4
2.7
3
0011
0100
0110
1000
1001
1011
1100
1101
1111
1111
0
1000
1
1110
0.9
0101
0.8
1011
0.7
0001
0.6
0111
0.5
1101
0.4
0011
0.3
1001
0.2
0000
0.1
0001
Hexadecimal signal
0
0000
LSB MSB
Binary signal
Evidence Signal Voltage Signal(V)
00
19
33
4D
67
81
9B
B5
CE
D8
FF
252
J.M. Abe, K. Nakamatsu, and S. Akama
λ
μ Fig. 3. Output states lattice – Extreme and Non-extreme states Table 2. Symbology of the Extreme and Non-extreme states
Extreme States True False Inconsistent Paracomplete
Symbol V F T ⊥
Non-extreme states
Symbol
Quasi-true tending to Inconsistent Quasi-true tending to Paracomplete Quasi-false tending to Inconsistent Quasi-false tending to Paracomplete Quasi-inconsistent tending to True Quasi-inconsistent tending to False Quasi-paracomplete tending to True Quasi-paracomplete tending to False
D E A H C B F G
The decision states are defined by the control values. So, the analysis of the input parameters will be started after the definition of the control limit values (C1, C2, C3 e C4). The parameters of the control values are defined as the following: C1 = upper certainty control value; C2 = lower certainty control value; C3 = upper contradiction control value; C4 = lower contradiction control value. In the prototype studied we’ve worked with C1 = C2 = C3 = C4 = 0.75. Table 3. Control limit values C1 11000001
PARAMETERS OF THE CONTROL LIMIT VALUES C2 C3 C4 00111111 11000001 00111111
Two Applications of Paraconsistent Logical Controller
253
In 6 tests made, ranging 13 to 60 seconds time, the robot was able to avoid a heat source placed in front of inside a box, with a escape opening, at a rate of 80%. In the remaining cases, the printed circuit board was heated up or was not escape from the tunnel.
3 Keller – Electronic Device for Blind and/or Deaf People Locomotion The Paracontrol was also used for the construction of an electronic device which we dubbed Keller2, for helping blind and/or dumb people in their locomotion. The main components of Keller are a microcontroller of 8051 family, two ultrasonic sensors, and two vibracalls. Figure 3 shows the Keller basic structure. The ultrasonic sensors are responsible in verifying whether there is any obstacle in front of the person in the area of sonars acting. The signals generated by the sensors are sent to the microcontroller. These signals are used to determine the favorable evidence degree μ and the contrary evidence degree λ regarding the proposition “There is no obstacle in front of the person”. Then the Paracontrol, recorded in the internal memory of the microcontroller, uses in order to determine the people movements, through a signal generated by the vibracalls, provided also by the microcontroller. The vibracalls can be placed in a confortable place for the user. In the prototype Keller, the two ultrasonic sensors were placed at person’s chest (Fig. 3). Thus, the main motivation in studying this prototype leans on the very simple implementation of the Paracontrol. Keller is a promising device in aiding blind and/or deaf people in their locomotion. Some tests were made in real situations with promising results. However, the problem as whole, naturally, we are aware we need improve with more accurate sensors, as well as, solve how to detect obstacles on the ground and innumerous other questions.
Fig. 4. Keller: position of the sensors 2
Keller is in homage to Helen Adams Keller.
254
J.M. Abe, K. Nakamatsu, and S. Akama
4 Conclusions The applications considered show the power and even the beauty of the Paraconsistent systems. They have provided other ways than usual electronic theory, opening applications of non-classical logics in order to overcome situations not covered by other systems or techniques. All the topics treated in this paper are being improving as well as many other projects in a variety of themes. We hope to say more in forthcoming papers.
References 1. Abe, J.M.: Fundamentos da Lógica Anotada (Foundations of Annotated Logics), in Portuguese, Ph. D. Thesis, Universidade de São Paulo, São Paulo (1992) 2. Torres, C.R.: Sistema Inteligente Paraconsistente para Controle de Robôs Móveis Autônomos, MSc. Dissertation, Universidade Federal de Itajubá - UNIFEI, Itajubá (2004) 3. Abe, J.M.: Some Aspects of Paraconsistent Systems and Applications. Logique et Analyse 157, 83–96 (1997) 4. Abe, J.M., da Silva Filho, J.I.: Manipulating Conflicts and Uncertainties in Robotics. Multiple-Valued Logic and Soft Computing 9, 147–169 (2003) 5. Da Silva Filho, J.I.: Métodos de Aplicações da Lógica Paraconsistente Anotada de Anotação com Dois Valores LPA2v com Construção de Algoritmo e Implementação de Circuitos Eletrônicos, in Portuguese, Ph. D. Thesis, Universidade de São Paulo, São Paulo (1999) 6. Da Silva Filho, J.I., Abe, J.M.: Emmy: a paraconsistent autonomous mobile robot. In: Abe, J.M., Da Silva Filho, J.I. (eds.) Logic, Artificial Intelligence, and Robotics, Proc. 2nd Congress of Logic Applied to Technology – LAPTEC 2001. Frontiers in Artificial Intelligence and Its Applications, vol. 71, pp. 53–61, 287. IOS Press, Amsterdam (2001) [14] Elfes, A.: Sonar based real-world mapping and navegation. IEEE Journal of Robotics and automation (1987) 7. Mckerrow, P.: Introduction to Robotcs. Addison-Wesley Publishing Company, New York (1992) 8. Berto, M.F.: Aplicação da Lógica Paraconsitente Anotada Evidencial E(no Controle de Sensores de Temperatura na Atuação de Robôs Móveis, MSc. Dissertation, in Portuguese, Paulista University, São Paulo (2007)
Encoding Modalities into Extended Petri Net for Analyzing Discrete Event Business Process Takashi Hattori1,2 , Hiroshi Kawakami1, Osamu Katai1 , and Takayuki Shiose1 1
2
Graduate School of Informatics, Kyoto University Sakyo-ku, Kyoto, 606-8501, Japan {kawakami,katai,shiose}@i.kyoto-u.ac.jp http://www.symlab.sys.i.kyoto-u.ac.jp/ NTT Communication Science Laboratories 2-4, Hikaridai, Seika-cho, “Keihanna Science City”, Kyoto, 619-0237, Japan takashi [email protected]
Abstract. This paper proposes a method for encoding discrete systems together with their tasks into an extended Petri net. The main feature of this method is the interpretation of tasks, which are first described by modal logic formulae, into what we call “task unit graphs” that can be naturally combined with a Petri net. The combination of a normal Petri net and task unit graphs provides a method for detecting conflicts among tasks. Furthermore, we examine a way for resolving such conflicts and attaining the correct behavior of systems.
1 Introduction To cope with the recent need for constructing sophisticated reliable systems that are governed in a discrete manner, such as IT process models, this paper proposes a theoretical framework for analyzing discrete event-driven systems. Our framework employs a representational method based on a Petri net [1] and a combination of two kinds of modal logics [2]: “temporal logic [3]” and “deontic logic [4, 5].” A Petri net is known as a conventional method for modeling a physical structure and event-driven behavior of systems with rich mathematical analysis [6]. Recent trends in IT business have revitalized Petri nets as one of the frameworks of process modeling together with conventional Peterson’s ERM (EntityRelationship Model), UML (Unified Modeling Language) and so on. For example, the research on translating UML into a Petri net [7] can be seen as a work to enhance a Petri net’s visibility. The long history of research on Petri nets has clarified not only their ability but also certain problems such as their visibility and ability for modular techniques [8]. To overcome these problems, this paper employs the strategy, as shown in Fig. 1, of dividing a target system into two portions. The portion that is pre-fixed by physical and structural constraints is represented by a simple Petri net with high visibility, and the portion that is conducted through control and systemic constraints is represented by modal logic formulae. One of the G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 255–264, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
256
T. Hattori et al.
target system
pre-fixed part
control and systemic constraints
sec. 2.1 encode X
sec. 2.2 encode modal logic formulae sec. 3 translate
extended Petri net Petri net (C/E system)
task unit graph
Fig. 1. Overview of Proposed Framework
promising methods for dealing with the latter portion is to follow the rigid syntax, semantics and proof theory for the combination of Tense logic with Deontic Logic [9]. On the other hand, our strategy translates modal logic formulae into what we call “task unit graphs” that mediate the integration of the two portions into an extended Petri net. In these unit graphs, the deontic part of modal logic formulae is substantiated as “imperative” arcs between the unit graphs, while the temporal part is represented as the structures of the unit graphs. Among several works for combining Temporal Logic and Petri nets such as [10, 11], the main feature of our framework is that each logic formula is directly interpreted to a network representation. Some part of the latter portion may be directly translated into a Petri net, but we do not adopt such a strategy because it remains to be determined whether it can be embedded into a Petri net. In addition, keeping the two portions of the systems separate is better than encoding different aspects of systems into a complicated single network for the system’s higher modularity. The rest of this paper consists of the following sections. Section 2 gives an overview of a way to model systems by using Petri nets and modal logic formulae. Then the formulae are translated into “task unit graphs” that are defined through a systematic analysis of logical definitions in section 3. Section 4 describes how this representational framework enables us to elucidate potential “conflicts” among tasks.
2 Modeling Systems by Petri Net and Modal Logic This section proposes a method for modeling discrete event systems based on a kind of Petri net and modal logic. For illustrating the procedures of system modeling, this paper employs the following example of a simple ID authentication system. Example System: Generally, ID authentication systems require users’ submission of a set of an ID and a password. The ID authentication system we employ here is unique in that it accepts not only an ID-password set but also a series of IDs and passwords, such series as “an ID and a password submitted alternately twice” or “ID submissions twice followed by a password submission.” The series may be called a “pass-action.”
Encoding Modalities into Extended Petri Net
cache
P2 P3
P4
memory P5
257
P7 authentication process
Fig. 2. Example of Target System
Unlike ordinary systems, this new system cannot store ID and password to an individual memory. Instead, it has to accept both ID and password in any order, and also has to hand over a sequence of submissions to the authentication engine. Figure 2 shows an example of this system that allows the length of pass-actions to be two at most. 2.1
Petri Net Representation of System’s Pre-fixed Aspect
The first step of our framework is to encode the pre-fixed portion of a system into a Petri net. It is known that a k-bounded standard Petri net can be translated into an equivalent 1-bounded Petri net. We employ a 1-bounded one called the condition/event system (C/E system) [12] where a transition can fire only if all “its output places” are empty. For instance, the example shown in Fig. 2 is the case where the upper-bound of “the number of stored passwords or IDs” is two (2-bounded). Thus, it can be modeled as a C/E system as shown in Fig. 3, where a token in place Pi means that the process Pi is in progress. Places P1 /P6 correspond to the idling state of users/“an authentication engine” respectively. A token in P4 means that data are stored in cache P4, and transition τ5 means the “data transfer from P4 to the memory P5” and “flushing P4.”
τ1
τ2
P2
P3
P6
τ5
P1
τ3
P4
τ4
τ6
τ7 P7
P5
Fig. 3. Example of Petri Net
By corresponding the place of a C/E system to a proposition, we can represent the true/false value of the proposition by putting/removing a token in/from the place. In this case, each transition leads a value alteration of the proposition. For instance, in Fig. 3, the firing of τ6 leads P4 , P5 , P6 to turning from true to false, and P7 to turning from false to true. 2.2
Modal Logic Representation of Tasks
The second step of our framework is to represent each task that states “when the focused state is required to be true” as a proposition by introducing modal logics such as temporal and deontic logics.
258
T. Hattori et al.
Temporal Modalities. A temporal logic is given by the propositional logic, modal operators, and an axiom system. This paper employs the following modal operators: T A: GA: F A: AUB:
A will be true at the next state S1 , A will be true from now on S0 ,S1 ,S2 ,· · ·, A will be true at some time in the future Sj (j ≥ 0), B is true at the current state S0 or A will be true from now on until the first moment when B will be the case,
where A, B denote logic formulae, and S0 and Si (i > 0) mean current and future states (worlds) respectively. Axiom systems of temporal logic vary depending on the viewpoint of time. This paper employs one of the discrete and linear axiom systems, KSU [13], which is an extension of the minimal axiom system Kt [3]. Introducing Y (yesterday) as the mirror image of T (tomorrow), the axiom system claims that T ¬A ≡ ¬T A, Y¬A ≡ ¬YA, and T YA ≡ YT A ≡ A. Introducing S (since) as the mirror image of U (until), GA ≡ AU⊥, where ⊥ denotes the contradiction, and F A ≡ ¬G¬A, Kt is rewritten as
AUB ≡ B ∨ (A ∧ T (AUB)),
(1)
ASB ≡ B ∨ (A ∧ Y(ASB)),
{(A ⊃ T (A ∨ B))UC} ⊃ {A ⊃ AU(B ∨ C)},
(2) (3)
{(A ⊃ Y(A ∨ B))SC} ⊃ {A ⊃ AS(B ∨ C)},
(4)
where A, B and C denote logic formulae. Deontic Modalities. Understanding the system’s behavior by temporal logic is of an “objective” view of the focused proposition. To represent our “subjective” intention or purpose, such as how the propositions should behave, i.e., the control rule (or task), we introduce deontic modalities: OA: A is obligatory,PA: A is permitted. The axiom system we adopt here for O and P is that of SDL (standard deontic logic), which defines OA ≡ ¬P¬A, and claims OA ⊃ PA, and O(A ⊃ B) ⊃ (OA ⊃ OB). Control rules and specifications of systems can be translated into the combinations of temporal and deontic modes by using “translation templates” such as OF A: A has to be true in the future, PGA: A can be always true. Translating Tasks into Tentative Obligations. Assume that the provider of the authentication system shown in Fig. 2 must adopt some privacy policies as control rules in order to guarantee the system’s security such as: PP1: submission and authentication must not be done simultaneously; PP2: password submission is always followed by an ID submission;
Encoding Modalities into Extended Petri Net
259
PP3: submission is accepted only when the cache is empty; PP4: the system should request a password submission randomly. Each task is translated into modal logic formulae, which are activated by a firing of the corresponding transition, as follows. PP1: This task is easily encoded into the C/E system, but in order to maintain high modularity, we translate it into logic formulae at first. This task can be broken down into relatively logical sentences, i.e., after an authentication begins (τ6 ), input methods have to wait (P1 ) until the authentication finishes (P6 ), i.e., τ6 activates O(P1 UP6 ). Also, after an input process begins (τ1 or τ3 ), the authentication has to wait (P6 ) until the input process finishes (P1 ), i.e., each of τ1 and τ3 activates O(P6 UP1 ). PP2: This task means that after a password submission (τ4 ), passwords are not accepted (¬P3 ) until an ID input method works (P2 ), i.e., τ4 activates O(¬P3 UP2 ). PP3: This task means that after some data is stored in the cache (τ2 or τ4 ), input processes should wait (P1 ) until the cache is flushed (¬P4 ), i.e., each of τ2 and τ4 activates O(P1 U(¬P4 )). PP4: This task means that after an ID process starts (τ1 ), the password input process (P3 ) has to work in the future, i.e., τ1 activates OF P3 . Translating Tasks into Resident Obligations: Not every task corresponds to a specific transition. Some tasks are translated into logical forms that are not activated by a transition but are always activated. For example, a rule PP0: Once the authentication process (P7) is finished, a password (P3) should not be accepted until the cache memory (P4) flushes its contents to the memory P5, is resident and translated into OG({(¬P3 )U(¬P4 ∧ P5 )}S(¬P7 )).
3 Network Representation of Tasks The third step of our framework is to translate the task represented by modal logic into an extended Petri net, which we call a “task unit graph,” by introducing four types of special arcs shown in Fig. 4. These arcs are placed from a place to a transition. They function whenever the place holds a token and the transition satisfies its firing condition, but they differ from ordinary arcs of the standard Petri net on the following points. First, they do not transfer tokens from places to transitions. Next, if there are multiple special arcs from the same place, all of them are activated simultaneously. As a result, simultaneous firing of multiple transitions is permitted at the same state.
260
T. Hattori et al. (a)
Prohibit firing Force firing at the next step
(b)
(c)
Force firing in the future
(d)
Force synchronized firing
Fig. 4. Special arcs for control of transition firing
A
free
A
A
A
A (a) OTA
free
A (b) OGA
A
A (c) OFA
free
B
B (d) O(AUB)
Fig. 5. Examples of Task Unit Graph
Figure 5 shows task unit graphs, which are derived by a systematic analysis of logical representations of tasks. OT A: A has to be true at the next state S1 . If the current state S0 holds A, OT A forbids the transition from A to ¬A. Otherwise, OT A forces the transition from ¬A to A. In each case, OT A is accomplished at S1 . OGA: A has to be true from now on. If S0 holds A, the transition from A to ¬A is forbidden. Otherwise, OGA cannot be true. OF A: A has to be true at some time in the future. This task forces the transition from A to ¬A in the future. If A becomes true, OF A is accomplished. So an arc of “compulsion of firing at the net step” is placed from A to the transition of transferring the token to the place f ree at the next state. O(AUB): A has to be true from now on until B will be the case. If B is the case at S0 , this task is accomplished, else if ¬A ∧ ¬B at S0 , this task cannot be accepted due to Eq. (1). If A ∧ ¬B at S0 , A should be maintained and O(AUB) also has to be the case at S1 . In each case, O(AUB) at S0 prohibits the alteration from A to ¬A, so an arc of “prohibition of firing” is placed from the place of O(AUB) to the “transition of the alteration from A to ¬A.” These general definitions of “task unit graph” are instantiated to specific tasks. Figure 6 shows an extended Petri net where “general task unit graphs” are instantiated to tasks PP0, · · ·, PP4. They include transitions that are unified to transitions of a “Petri net representation of the pre-fixed portion of the target system” at the middle part of Fig. 6. The unifications are indicated by synchronized firing linkages (Fig. 4 (d)). It should be noted that defining Q ≡ ¬P4 ∧ P5 , H ≡ ¬P3 UQ, PP0 can be represented as OG(HS(¬P7 )), which derives hierarchical structure of the extended Petri net representation as shown in the upper part of Fig. 6.
Encoding Modalities into Extended Petri Net
Hierarchical Decomposition of Task OG(HS( P7)) (HS( P7))
HS( P7) OG(HS( P7)) HS( P7)
H
P7
P7 P7
261
H
H
H
P7
(HS( P7)) H= P3UQ
Q H
Q
P3
H
P3
τ4 τ3
P4
P5
Q
Q= P4 P5 P4
or
A B
Q
P5
τ5 τ2 Pre-fixed portion of Target System
τ3 free
τ6 τ5 τ1 P3
P3
P2
τ2
P1
τ4
P6
τ7
P5
P1 free P1 O(P1U P4)
P3 OFP3
P4
τ5 τ6 P7
P4 P4
Fig. 6. Hierarchical Extended Petri net Representation of the System with Tasks
A B ASB B
A
A A
A
B
B
B or A
B (ASB)
A
B (A B)
Fig. 7. Task Unit Graph for S and that for ∧ (conjunction)
In the figure, task unit graphs representing S and ∧ (conjunction) are employed. Their general types are defined as shown in Fig. 7.
4 Detecting Conflict among Systems’ Behavior The typical conflicts are observed among tasks. Figure 8 shows the state transitions of the target system with the initial state S0, which holds P1 and P6 , and is in charge of tasks PP0, · · ·, PP4. Table 1 shows the markings of each
262
T. Hattori et al.
τ1 τ1
S7
τ7
S6
τ6
’’ S5
S1
τ2
S2
τ5
S3
S0
τ7
’ S6
τ6
’’’ S5
τ4
’ S4
τ4
’ S5
τ3 τ1 τ2
S4
τ2
S5
Fig. 8. State Transition Diagram
P1 P2 P3 P4 P5 P6 P7 P6 P1 free P6 ∪ P1 P3 free F P3 P1 ¬P4 free P1 ∪ (¬P4 ) ¬P3 P2 free (¬P3 ) ∪ P2 P1 P6 free P6 ∪ P1 ¬P4 P5 Q = ¬P4 ∧ P5 ¬P3 H = (¬P3 ) ∪ Q ¬P7 HS(¬P7 ) G(HS(¬P7 ))
Table 1. State Transition Table
S0 S1 S2 S3 S4 S4 S5 S5 S5 S5 S6 S6 S7
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦
◦ ◦
◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
• • ◦ ◦ ◦
◦ ◦
• • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦
• • ◦ ◦
◦ ◦
• • • • • ◦ ◦ ◦ • • • • • • • ◦ ◦ ◦ • ◦ ◦ ◦
• • • • • • • •
◦ ◦ ◦ • • • • • ◦ • • ◦ • ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
state where “◦” means a normal token and “•” means an active token, which constrains other tokens. We have already clarified the criteria for governing this kind of state transition [14], but the criteria indicate only which state involves task conflicts. The framework proposed in this paper detects how the state falls into conflicts by tracing synchronized firing linkages (broken lines in Fig. 6) as mutual interferences among tasks. For example, Fig. 9 shows conflicts in state S5, where place O(P1 U(¬P4 )) has a token, which prohibits firing of τ1 and τ3 . On the other hand, the token in place OF P3 requests firing of τ3 . Therefore, there is a conflict of firing τ3 . The only way to resolve this conflict is turning P4 to ¬P4 , which leads the token in P1 U(¬P4 ) to f ree. But the establishment of ¬H in this state prohibits turning P4 to ¬P4 by tracing synchronized firing linkages from OG(HS(¬P7 )). As a result, this conflict cannot be resolved unless ¬H turns to H.
Encoding Modalities into Extended Petri Net
263
HS( P7) (HS( P7))
OG(HS( P7)) HS( P7)
H
P7
P7 P7
H
H
H
P7
(HS( P7)) Target System
τ1
τ3 free P3 OFP3
P3
P3
P2
τ2
P6
τ5 τ6 P7
P1
τ4
or P4
τ7
P5
P1 free P1 O(P1U P4)
P4 P4
Fig. 9. Conflict Detection via Extended Petri Net
5 Conclusions In this paper, we have proposed a novel modeling framework of systems that are governed in a discrete manner. In the framework, tasks that conduct systems’ behavior are represented by task unit graphs that can be unified with the extended Petri net. If there are conflicts among multiple tasks, they can be visually detected using the network representation. Our model also opens the way to discussing the “level of correctness” of the target system. Due to the design of the system, a system may be consistent under any series of operations and need no governance. Or it may need adequate control in order to keep itself consistent. There also may be systems that cannot be kept consistent despite all possible governance. We call these systems “strongly correct,” “weakly correct,” and “incorrect” respectively. We expect that the adequate governance needed for the consistency of weakly correct systems can be derived through analysis of the proposed modeling method. Such a discussion of correct behavior can be extended to multi-agent systems by assuming that agents are in charge of governing their own tasks. Although conflicts mainly occur among tasks, multiple agents are another major cause of conflicts. Agents sometimes interfere with each other’s field of work. Our model has the potential for discussing such conflicts [14]. Also, solutions of such conflicts can be derived through an analysis of the extended Petri net.
References 1. Peterson, J.L.: Petri Net Theory and the Modeling of Systems. Prentice-Hall, Englewood Cliffs (1981) 2. Hughes, G.H., Cresswell, M.J.: An Introduction of Modal Logic. Methuen (1968)
264
T. Hattori et al.
3. Rescher, N., Urquhart, A.: Temporal Logic. Springer, Heidelberg (1971) 4. von Wright., G.: Deontic logic. Mind 60, 1–15 (1951) 5. Goble, L.F.: Gentzen systems for modal logic. Notre Dame J. of Formal Logic 15(3), 455–461 (1974) 6. Karatkevich, A.: Dynamic Analysis of Petri Net-Based Discrete systems. LINCIS, vol. 356. Springer, Heidelberg (2007) 7. Hu, Z., Shatz, S.M.: Mapping UML diagrams to a Petri net notation for system simulation. In: Proc. of the 16th Int. Conf. on Software Engineering and Knowledge Engineering (SEKE), pp. 213–219 (2004) 8. Saldhana, J., Shatz, S.M.: UML diagrams to object Petri net models: An approach for modeling and analysis. In: Proc. of the Int. Conf. on Software Engineering and Knowledge Engineering (SEKE), pp. 103–110 (2000) 9. ˚ Aqvist, L.: Combinations of tense and deontic modality. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 3–28. Springer, Heidelberg (2004) 10. Penczek, W., et al.: Advances in Verification of Time Petri Nets and Timed Automata. Springer, Heidelberg (2006) 11. Okugawa, S.: Introduction to Petri Nets (in Japanese). Kyoritsu Shuppan Co., Ltd (1995) 12. Reisig, W.: Petri Nets. Springer, Heidelberg (1982) 13. Katai, O., Iwai, S.: A design method for concurrent systems based on step diagram and tense logic under incompletely specified design criteria (in Japanese). Systems, Control and Information 27(6), 31–40 (1983) 14. Katai, O., et al.: Decentralized control of discrete event systems based on extended higher order Petri nets. In: Proc. of the Asian Control Conference, pp. 897–900 (1994)
Paraconsistent Before-After Relation Reasoning Based on EVALPSN Kazumi Nakamatsu1 , Jair Minoro Abe2 , and Seiki Akama3 1
2
3
University of Hyogo, Himeji, Japan [email protected] Paulista University, Sao Paulo, Brazil [email protected] C-republic, Kawasaki, Japan [email protected]
Abstract. A paraconsistent annotated logic program called EVALPSN by Nakamatsu has been applied to deal with real-time safety verification and control such as pipeline process safety verification and control. In this paper, we introduce a new interpretation for EVALPSN to dynamically deal with before-after relations between two processes (time intervals) in a paraconsistent way, which is named bf-EVALPSN. We show a simple example of an EVALPSN based reasoning system that can reason before-after relations in real-time. Keywords: annotated logic program, EVALPSN, bf-EVALPSN, before-after relation, paraconsistent reasoning system.
1 Introduction We have already developed a paraconsistent annotated logic program called Extended Vector Annotated Logic Program with Strong Negation(abbr. EVALPSN), which has been applied to various kinds of process safety verification and control such as pipeline process control [5, 6, 7]. However, the EVALPSN based process control is for each process itself, not for process order. However, we have many systems in which process order control based on its safety verification is strongly required such as chemical plants. In this paper, we introduce a newly interpreted EVALPSN named bf(beforeafter)-EVALPSN and a paraconsistent reasoning system based on bf-EVALPSN that can deal with before-after relations between processes dynamically. Meaningful process before-after relations are classified into 15 kinds according to the before-after relation of the start/finish times of two processes, and they are paraconsistently represented in vector annotations whose components designate before/after degrees. The vector annotation (m, n) to represent beforeafter relations can be dynamically determined according to the order of process start/finish times of two processes. We also show how the paraconsistent reasoning system based on bf-EVALPSN can deal with process order correctly in real-time with a simple example. G.A. Tsihrintzis et al. (Eds.): New Direct. in Intel. Interac. Multimedia, SCI 142, pp. 265–274, 2008. c Springer-Verlag Berlin Heidelberg 2008 springerlink.com
266
K. Nakamatsu, J.M. Abe, and S. Akama (2, 2) q P P PP ∗1 @ (2, 1) ∗3 P (1, 2) q P q @ PP α P @(1, 1) @ P @q ∗2 @q (2, 0) (0, 2) q @ @ @q γP @q β PP (0, 1) @ (1, 0) P P @q ⊥ (0, 0)
Fig. 1. Lattice Tv (2) and Lattice Td
This paper is organized in the following manner: firstly, EVALPSN is reviewed briefly; subsequently bf-EVALPSN is introduced with a paraconsistent beforeafter relation reasoning system; lastly, we conclude the features of bf-EVALPSN with a comparison of bf-EVALPSN and Allen’s interval temporal logic [1, 2].
2 EVALPSN We review EVALPSN briefly[6]. Generally, a truth value called an annotation is explicitly attached to each literal in annotated logic programs [3]. For example, let p be a literal, μ an annotation, then p : μ is called an annotated literal. The set of annotations constitutes a complete lattice. An annotation in EVALPSN has a form of [(i, j), μ] called an extended vector annotation. The first component (i, j) is called a vector annotation and the set of vector annotations constitutes the complete lattice, Tv (n) = { (x, y)|0 ≤ x ≤ n, 0 ≤ y ≤ n, x, y and n are integers } in Fig.1. The ordering(v ) of Tv (n) is defined as : let (x1 , y1 ), (x2 , y2 ) ∈ Tv (n), (x1 , y1 ) v (x2 , y2 ) iff x1 ≤ x2 and y1 ≤ y2 . For each extended vector annotated literal p : [(i, j), μ], the integer i denotes the amount of positive information to support the literal p and the integer j denotes that of negative one. The second component μ is an index of fact and deontic notions such as obligation, and the set of the second components constitutes the complete lattice, Td = {⊥, α, β, γ, ∗1 , ∗2 , ∗3 , }. The ordering(d) of Td is described by the Hasse’s diagram in Fig.1. The intuitive meaning of each member of Td is ⊥ (unknown), α (fact), β (obligation), γ (non-obligation), ∗1 (fact and obligation), ∗2 (obligation and non-obligation), ∗3 (fact and non-obligation), and (inconsistency). Then the complete lattice Te (n) of extended vector annotations is defined as the product Tv (n) × Td . The ordering(e ) of Te (n) is defined as : let [(i1 , j1 ), μ1 ] and [(i2 , j2 ), μ2 ] ∈ Te , [(i1 , j1 ), μ1 ] e [(i2 , j2 ), μ2 ] iff
(i1 , j1 ) v (i2 , j2 )
and
μ1 d μ2 .
Paraconsistent Before-After Relation Reasoning Based on EVALPSN
267
There are two kinds of epistemic negation (¬1 and ¬2 ) in EVALPSN, which are defined as mappings over Tv (n) and Td , respectively. Definition 1. (epistemic negations ¬1 and ¬2 in EVALPSN) ¬1 ([(i, j), μ]) = [(j, i), μ], ∀μ ∈ Td , ¬2 ([(i, j), ⊥]) = [(i, j), ⊥], ¬2 ([(i, j), α]) = [(i, j), α], ¬2 ([(i, j), β]) = [(i, j), γ], ¬2 ([(i, j), γ]) = [(i, j), β], ¬2 ([(i, j), ∗1 ]) = [(i, j), ∗3 ], ¬2 ([(i, j), ∗2 ]) = [(i, j), ∗2 ], ¬2 ([(i, j), ∗3 ]) = [(i, j), ∗1 ],
¬2 ([(i, j), ]) = [(i, j), ].
If we regard the epistemic negations as syntactical operations, the epistemic negations followed by literals can be eliminated by the syntactical operations. For example, ¬1 p : [(2, 0), α] = p : [(0, 2), α] and ¬2 q : [(1, 0), β] = p : [(1, 0), γ]. There is another negation called strong negation (∼) in EVALPSN, and it is treated as classical negation. Definition 2. (strong negation ∼) [4] Let F be any formula and ¬ be ¬1 or ¬2 . ∼ F =def F → ((F → F ) ∧ ¬(F → F )). Definition 3. (well extended vector annotated literal) Let p be a literal. p : [(i, 0), μ] and p : [(0, j), μ] are called weva(well extended vector annotated)-literals, where i, j ∈ {1, 2, · · · , n}, and μ ∈ { α, β, γ }. Defintion 4. (EVALPSN) If L0 , · · · , Ln are weva-literals, L1 ∧ · · · ∧ Li ∧ ∼ Li+1 ∧ · · · ∧ ∼ Ln → L0 is called an EVALPSN clause. An EVALPSN is a finite set of EVALPSN clauses. Fact and deontic notions, “obligation”, “forbiddance” and “permission” are represented by extended vector annotations, [(m, 0), α], [(m, 0), β], [(0, m), β], and [(0, m), γ], respectively, where m is a positive integer.
3 Before-After Relation in EVALPSN First of all, we introduce a special literal R(pi, pj, t) whose vector annotation represents the before-after relation between processes P ri (pi) and P rj (pj) at time t, and the literal R(pi, pj, t) is called a bf-literal.1 Definition 5. (bf-EVALPSN) An extended vector annotated literal R(pi , pj , t) : [μ1 , μ2 ] is called a bf-EVALP literal, where μ1 is a vector annotation and μ2 ∈ 1
Hereafter, the term “before-after” is abbreviated as just “bf” in this paper.
268
K. Nakamatsu, J.M. Abe, and S. Akama
{α, β, γ}. If an EVALPSN clause contains bf-EVALP literals, it is called a bfEVALPSN clause or just a bf-EVALP clause if it contains no strong negation. A bf-EVALPSN is a finite set of bf-EVALPSN clauses. We define vector annotations to represent bf-relations, which are called bfannotations. Strictly speaking, bf-relations are classified into meaningful 15 kinds according to the order of process start/finish times. Suppose that there are two processes, P ri with its start time xs and finish time xf , and P rj with its start time ys and finish time yf . Then we have the following 15 kinds of bf-annotations. Before (be)/After (af) firstly, we define basic bf-relations before/after according to the bf-relation between each start time of two processes, which are represented by the bfannotations be/af, respectively. If one process has started before/after another, then the bf-relations are defined as just ‘before(be)/after(af)’, respectively. The bf-relations also are described in Fig.2 with the condition that process P ri has started before process P rj starts. The order of their start/finish times is denoted by the inequality {xs < ys }.2 xs
P ri ys
xs
P rj
Fig. 2. Bf-relations, Before/After
P ri
xf -
ys
P rj
yf -
and Disjoint Before/After
Disjoint Before (db)/After (da) bf-relations disjoint before(db)/after(da) are described in Fig.2. Immediate Before (mb)/After (ma) bf-relations immediate before(mb)/after(ma) are described in Fig.3.
P ri xs
ys xf
P rj -yf
xs
Fig. 3. Bf-relations, Immediate Before/After
P ri ys
xf -
P rj
yf -
and Joint Before/After
Joint Before (jb)/After (ja) bf-relations joint before(jb)/after(ja) are are described in Fig.3. S-included Before (sb)/After (sa) bf-relations s-included before(sb)/after(sa) are described in Fig.4. Included Before (ib)/After (ia) bf-relations included before(ib)/after(ia) are described in Fig.4. F-included Before (fb)/After (fa) bf-relations f-include before(fb)/after(fa) are described in Fig.5. 2
If time t1 is earlier than time t2 , we conveniently denote their relation by the inequality t1 < t2 in this paper.
Paraconsistent Before-After Relation Reasoning Based on EVALPSN xs ys
xs
xf -
P ri P rj
-
ys
yf
Fig. 4. Bf-relations, S-included Before/After xs
P ri
ys
P rj
-
yf
Fig. 5. Bf-relations, F-included Before/After
xf -
P rj - yf
and Included Before/After xs
xf -
P ri
269
ys
P ri
xf -
P rj
-
yf
and Paraconsistent Before-after
Paraconsistent Before-after (pba) the bf-relation paraconsistent before-after(pba) is described in Fig.5. If we consider the before-after measure over the 15 bf-annotations, obviously there exists a partial order(