152 29 10MB
English Pages 346 [340] Year 2006
Multimodal Intelligent Information Presentation
Text, Speech and Language Technology VOLUME 27
Series Editors Nancy Ide, Vassar College, New York Jean Véronis, Université de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonòma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France
The titles published in this series are listed at the end of this volume
Multimodal Intelligent Information Presentation Edited by
Oliviero Stock ITC-irst, Trento, Italy
and
Massimo Zancanaro ITC-irst, Trento, Italy
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 1-4020-3050-9 (PB) ISBN 1-4020-3049-5 (HB) ISBN 1-4020-3051-7 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. Sold and distributed in North, Central and South America by Springer, 101 Philip Drive, Norwell, MA 02061, U.S.A. In all other countries, sold and distributed by Springer, P.O. Box 322, 3300 AH Dordrecht, The Netherlands.
Printed on acid-free paper
All Rights Reserved © 2005 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed in the Netherlands
TABLE OF CONTENTS INTRODUCTION
vii
Part I: LIFE LIKE CHARACTERS I. POGGI, C. PELACHAUD, F. DE ROSIS, V. CAROFIGLIO AND B. DE CAROLIS / Greta. A Believable Embodied Conversational Agent
3
P. PAGGIO AND B. JONGEJAN / Multimodal Communication in Virtual Environments
27
M. THEUNE, D. HEYLEN AND A. NIJHOLT / Generating Embodied Information Presentations
47
Part II: MOBILE PRESENTATIONS J. BAUS, A. KRÜGER AND C. STAHL / Resource-Adaptive Personal Navigation
71
O. STOCK, M. ZANCANARO AND E. NOT / Intelligent Interactive Information Presentation for Cultural Tourism
95
T. RIST / Supporting Mobile Users Through Adaptive Information Presentation
113
Part III: NATURAL LANGUAGE GENERATION E. ANDRÉ, K. CONCEPCION, I. MANI AND L. VAN GUILDER / AutoBriefer: A System for Authoring Narrated Briefings
143
G. CARENINI AND C. CONATI / Generating Tailored Worked-Out Problem Solutions to Help Students Learn from Examples
159
TABLE OF CONTENTS
VI
J. CALDER, A. C. MELENGOGLOU, C. CALLAWAY, E. NOT, F. PIANESI, I. ANDROUTSOPOULOS, C. D. SPYROPOULOS, G. XYDAS, G. KOUROUPETROGLOU AND M. ROUSSOU/ Multilingual Personalized Information Objects
177
P. PIWEK, R. POWER, D. SCOTT AND K. VAN DEEMTER / Generating Multimedia Presentations from Plain Text to Screen Play
203
C. ZINN, J. D. MOORE AND M. G. CORE / Intelligent Information Presentation for Tutoring Systems
227
Part IV: VIRTUAL AND AUGMENTED REALITY B. BELL, S. FEINER AND T. HÖLLERER / Maintaining Visibility Constraints for View Management in 3D User Interfaces
255
W. SWARTOUT, J. GRATCH, R. HILL, E. HOVY, R. LINDHEIM, S. MARSELLA, J. RICKEL AND D. TRAUM / Simulation Meets Hollywood
279
P.L. WEISS / Presentation Technologies for People With Disabilities
305
Part V: FUTURE DIRECTIONS H. BUNT, M. KIPP, M. MAYBURY AND W. WAHLSTER / Fusion and Coordination for Multimodal Interactive Information Presentation INDEX
325 341
INTRODUCTION
Information Technology often moves slowly and the potential of innovative concepts requires time to be fully understood. We know what happened with the history of the computer and its directions of development in general. Though the achievements of hardware technology were amazing, software development methodology lagged behind and did not consent full exploitation of the hardware. Even more strikingly, it took over three decades, until in the Seventies for the computer to be recognized as an interactive tool aimed not only at automatic computation, but also for assisting human beings in any of their activities. Especially with the appearance of the personal computer, and of graphic, intuitive interfaces including the concept of direct manipulation, the computer grew closer to naïve users. At the same time, it became evident that much further work was needed in human-computer interaction to make computers friendly and usable by a larger public. In the following decades, aside investing on the traditional themes - computing power, infrastructure, system development methodology, etc. - Information Technology put effort into improving the intelligence of systems and in their capability of understanding ways of communicating that are natural for humans. In the Nineties the web era opened a new phase, based substantially on two concepts: connectivity (one single information space, irrespective of physical location of resources) and a simple navigational modality of interaction (inspired by the concept of hypertext, itself about as old as the computer). Mobile computing has added one additional dimension to Information Technology and pervasiveness of computing is coming into our lives. Today everyone realizes that we share many aspects of our routine activities with computers. It is understood that something must be done so to make access to information feasible for all (e.g. interpretation of requests more complex than could be possible with a menu-based interface, selection of existing information so that only the most relevant is displayed, etc.). Researches in the field, however, are realizing that computers have the capability of doing more than just mimicking human-human communication. Presenting information flexibly, integrating various means of communication, providing intelligent feedback are just some examples. Intelligent Information Presentation is a field that is assuming an increasing strategic importance. Current intelligent presentation systems are task specific and refer mainly to some specific technologies, but they have something important in common. They all posit that presentations have to take into account the user’s characteristics, the user’s specific needs and the context in which the interaction takes place. The process of elaborating a presentation starts from some level of internal representation of the content and from some knowledge or experience of previous presentations and exploits the available means of communication for conveying the information in the most appropriate way. In some cases, the system may even have its own communicative goals that may decide which one to pursue as well as when and how. VII O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, VII–XII © 2005 Springer. Printed in the Netherlands.
VIII
INTRODUCTION
At the root of the theme of Intelligent Information Presentation we can consider several scientific areas, but at least three are fundamental. Probably the first to be mentioned is Natural Language Generation, the branch of natural language processing that deals with the automatic production of texts. The field normally is described as investigating communicative goals, the dynamic choice of what to say, the planning of the overall rhetorical structure of the text (called sometime strategic planning), the actual realization of sentences on the basis of grammar and lexicon (sometimes called tactical planning), and so on. With a similar objective but with different means, the field of Adaptive Hypermedia combines hypertext (hypermedia) and user modeling. In contrast to NLG, text or multimedia existing documents are manipulated. Adaptive Hypermedia systems build a model of the goals, preferences and knowledge of the individual user and use this throughout the interaction for adaptation of the hypertext to the needs of the user. In other words, AH takes into account the fact that users vary in knowledge, cognitive skills and reasons for searching information. By keeping a model of some aspects of the user’s characteristics, the system can adapt to and aid the user in navigating and filtering information that best suits his or her goals. A third important field is computer graphics; it has experienced a fundamental passage toward the end of the Eighties, when it was understood that graphics production should start from internal representations and communicative goals in a way similar to language production. This passage has led to the possibility of developing multimodal systems, that in output would consider the available modalities, possibly the context and the user characteristics, and operate so that the message is allocated and realized in a coordinated way on several media. Intelligent Interactive Information Presentation has gone further along that line: it relates to the ability of a computer system to automatically produce multimodal information presentations, taking into account the specifics about the user, such as needs, interests and knowledge, and engaging in a collaborative interaction that helps the retrieval of relevant information and its understanding on the part of the user. It may include dimensions such as entertainment and education, opening important connections to areas that were not related to the world of human-computer interaction, such as broadcasting or cinematography. This vision has led to novel concrete aggregations. This is evident in a number of projects, in Europe as well as in America and in Japan, where the teams have included very diverse expertise. The vision has even caused the establishment of novel institutions: a good example is the Institute for Creative Technologies, created at the University of Southern California with the goal of bringing together researchers in simulation technology to collaborate with people from the entertainment industry. The idea was that “more compelling simulations could be developed if researchers who understood state-ofthe-art simulation technology worked together with writers and directors who knew how to create compelling stories and characters.” (see the chapter by Swartout et al. in this volume). Looking at the market prospects, many areas of IT will benefit from results in Intelligent Interactive Information Presentation. Future developments in the field are expected to have a significant impact on business output by increasing
INTRODUCTION
IX
productivity and improving quality of business documents’ content through rich user feedback. Training of employees at a company is another area that i3p technology can significantly enhance by improving the learning process. Systems able to provide assistance to customers that face, for instance, technical difficulties with an electronic product, can translate into significant savings for businesses, while also providing high-quality and tailored assistance to their customers. They can substantially simplify the presentation of information to nonexpert computer users as well as provide services for individual with various impairments. The tourism industry (about to become the most important business on the web) is an excellent example of the trend toward increasing demand from users for a personalized service. A privileged general area of application is education: Intelligent Information Presentation technology will bring a whole new experience for the user providing personalized information to allow individuals to learn at their own pace while gaining access to relevant information that efficiently complements their knowledge base. Obviously, interactivity in the entertainment industry can also be significantly enhanced by Intelligent Information Presentation technology by creating a personalized experience for the user. And many more areas could be indentified. With this book we have put together a collection of highly representative works that show the advancements and the trends in the field. The book is organized along five clusters. The first cluster encompasses three contributions on information presentation using lifelike characters. The contribution by Poggi, Pelachaud, de Rosis, De Carolis and Carofiglio illustrates the architecture of Greta, a believable embodied conversational agent built in the context of the EU Project MagiCster. Greta engages the users in natural conversations. It is provided with a mind, a dialogue manager and a body. The mind triggers an emotion on the basis of the agent's personality, the occurring events and the current dialog move. The dialog manager selects the appropriate dialog move to perform. The body, which includes a 3D face model is capable of expressing a set of nonverbal communicative functions. The second chapter, by Paggio and Jongejan, describes the multimodal communication components of a virtual world application developed at the Centre for Language Technology within the Danish research project "Staging of 3D Virtual Worlds". The application is a virtual farm in which the user can interact with an autonomous farmer agent to participate in simple conversations and actions concerning work on the farm. Communication is multimodal in that the agent can respond to verbal input combined with a limited number of deictic, iconic, and turntaking or giving hand gestures. Speech and gesture inputs are synchronised and, if the relevant syntactic and semantic constraints are satisfied, a unified interpretation of both is generated by a feature-based multimodal parser. Finally the work of Theune, Heylen and Nijholt discusses the output modalities available for information presentation by embodied, human-like agents which include both language and various nonverbal cues such as pointing and gesturing. Nonverbal modalities can be used to emphasize, extend or even replace the language
X
INTRODUCTION
output produced by the agent. In this chapter, the authors discuss the issues involved in extending a natural language generation system with the generation of nonverbal signals. The second cluster comprises three contributions on the topic of information presentation on mobile devices. Mobile navigation systems are considered one of the possible breakthrough technologies for broad band wireless internet access (i.e. for UMTS). Although quite sophisticated systems already exist for different transportation means, e.g. car navigation systems and more recently also pedestrian navigation systems, these systems only function in a single well defined context. They are designed either to be operated by a driver in a car, or a by pedestrian in an outdoor scenario. In contrast to that personal navigation systems are supposed to work in all of these contexts, regardless of the chosen mean of transportation or other external factors. The first chapter of this cluster, by Baus, Krüger and Stahl, focuses on the conceptual and technical requirements of such a personal navigation system. They compare their approach with classical mobile navigation systems and argue that resource-adaptiveness has to be achieved on three different levels of the navigation task: the wayfinding level, the presentation planning level and the presentation rendering level. Stock, Zancanaro and Not discuss the opportunities and challenges offered by the area of cultural heritage appreciation for innovative, natural–language centered applications. They introduce the PEACH project, aimed at exploring various technologies for enhancing the visitors' experience during their actual visit to a museum. Finally, in his contribution, Rist discusses the problem of the overwhelming variety of new portable computing and communication devices. This adds a new level of complexity to the design of usable information services since different users may not be equipped equally in terms of output and input capabilities. With a special focus on graphical representations, this contribution sketches several application scenarios taken from projects conducted at DFKI and discusses the technical approaches to accomplishing adaptation tasks that result from device restrictions, such as limited screen real estate, lack of high resolution and color. The cluster on Natural Language Generation presents four contributions. First, the chapter by Andrè and colleagues describes a system called AutoBriefer which automatically generates multimedia briefings from high-level outlines. The author of the briefing interacts with the system to specify some of the briefing content. In order to be scalable across domains, AutoBriefer provides a graphical user interface that allows the user to create or edit domain knowledge for use in the selection of some of the briefing content. The system then uses declarative presentation planning strategies to synthesize a narrated multimedia briefing in various presentation formats. The narration employs synthesized audio as well as, optionally, an agent embodying the narrator. Calder and colleagues address a similar issue in their chapter. They introduce The M-PIRO project which targets the concept of personalized information objects. MPIRO's technology allows textual and spoken descriptions of exhibits to be generated automatically from an underlying language-neutral database and existing
INTRODUCTION
XI
free-text descriptions. The resulting descriptions, produced in three different languages (English, Greek, and Italian), are tailored according to the user's interests, background knowledge, and language skills. Carenini and Conati in the last chapter of this cluster describe a framework that helps students learn from examples by generating example problem solutions whose level of detail is tailored to the students’ domain knowledge. When presenting a new example to a student, the framework uses natural language generation techniques and a probabilistic student model to selectively introduce gaps in the example solution, so that the student can practice applying rules learned from previous examples in problem solving episodes of difficulty adequate to her knowledge. Filling in solution gaps is part of the meta-cognitive skill known as self-explanation (generate explanations to oneself to clarify an example solution), which is crucial to effectively learn from examples. Finally, Piwek, Power, Scott and van Deemter explore NLG applications in which extra media are brought into play from the point of view of planning the layout of the document. The main point of the chapter is that the extra media are not simply added to plain text, but integrated with it: thus the use of formatting, or pictures, or dialogue, may require radical rewording of the text itself. The fourth cluster comprises two contributions on the topic of Virtual and Augmented Reality. Bell, Feiner and Höllerer describe a view-management mechanism that addresses dynamically changing visibility relationships among moving objects, and show how to apply it to 3D user interfaces. The focus is on augmented reality systems, in which users wearing head-tracked, head-worn displays can view graphics overlaid directly onto the real world. Examples from an implemented view-management testbed are used to demonstrate some of the different ways in which the system and its users can exert control over what is seen in augmented reality. These examples include automatically laid out annotations that provide desired information about the surrounding environment, and a hand-held tracked physical display and headtracked virtual situation-awareness aid that respond to user control and interact with the annotated environment. The work of Swartout and colleagues presents the Mission Rehearsal Exercise Project, which confronts a soldier trainee with the kinds of dilemmas he might reasonably encounter in a peacekeeping operation. The trainee is immersed in a synthetic world and interacts with virtual humans: artificially intelligent and graphically embodied conversational agents that understand and generate natural language, reason about world events and respond appropriately to the trainee's actions or commands. The final cluster of the book is about the future. It consists of a contribution by Bunt, Kipp, Maybury and Wahlster that introduces a recommended action plan and a roadmap for the topic of multimodal interactive information presentation. The occasion that gave impulse to this volume was the European CLASS Project (IST-1999-12611). CLASS (http://www.class-tech.org) was launched at the request of the European Commission with the purpose of stimulating cross-project
XII
INTRODUCTION
collaboration among EC-sponsored Human Language Technology projects as well as between those projects and relevant projects world-wide. Among the initiatives of CLASS, two workshops were held recently in Verona, and Copenhagen with submissions from Europe, the US, Canada, Australia and Japan. Subsequent to the workshops, the authors of the best selected extended abstracts in the proceedings have been invited to write a long paper for the present volume that was reviewed also by authors of other papers in the collection. Finally, we would like to thank Ivana Alfaro for her help in this enterprise and Erika Belli for her work on the manuscript.
Oliviero Stock Massimo Zancanaro
LIFE LIKE CHARACTERS
The research field on lifelike characters is a promising area for investigating challenging issues in Artificial Intelligence and Human-Computer Interfaces. Lifelike characters are a kind of multimodal interfaces where the modalities are those natural in human conversations: speech, gestures as well as facial expressions and body postures. The hypothesis is that intelligent user interfaces can take advantage of embodied intelligence to facilitate human-machine interaction. From this point of view lifelike characters constitute another argument in the long-term debate on anthropomorphism in the interface. Indeed, human-like embodied interfaces have become popular as the front end to many web sites and as part of many computer applications. However, these interfaces have just the graphical appearance of a body without any internal representation of the communicative functions of a body and they often produce far from satisfactory results. The contributions in this cluster take the opposite approach, they all try to investigate how to properly model the communicative functions of a body in order to produce realistic behavior. The contribution by Poggi, Pelachaud, de Rosis, De Carolis and Carofiglio illustrates the architecture of Greta, a believable embodied conversational agent built in the context of the EU Project MagiCster. Greta engages the users in natural conversations. It is provided with a mind, a dialogue manager and a body. The mind triggers an emotion on the basis of the agent's personality, the occurring events and the current dialog move. The dialog manager selects the appropriate dialog move to perform. The body, which includes a 3D face model is capable of expressing a set of nonverbal communicative functions. The second chapter, by Paggio and JongeJan, describes the multimodal communication components of a virtual world application developed at the Centre for Language Technology within the Danish research project "Staging of 3D Virtual Worlds". The application is a virtual farm in which the user can interact with an autonomous farmer agent to participate in simple conversations and actions concerning work on the farm. Communication is multimodal in that the agent can respond to verbal input combined with a limited number of deictic, iconic, and turntaking or giving hand gestures. Speech and gesture inputs are synchronised and, if the relevant syntactic and semantic constraints are satisfied, a unified interpretation of both is generated by a feature-based multimodal parser. Finally the work of Theune, Heylen and Nijholt discusses the output modalities available for information presentation by embodied, human-like agents which include both language and various nonverbal cues such as pointing and gesturing. Nonverbal modalities can be used to emphasize, extend or even replace the language output produced by the agent. In this chapter, the authors discuss the issues involved in extending a natural language generation system with the generation of nonverbal signals.
I. POGGI, C. PELACHAUD, F. DE ROSIS, V. CAROFIGLIO AND B. DE CAROLIS
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
1. INTELLIGENT BELIEVABLE EMBODIED CONVERSATIONAL AGENTS A wide area of research on Autonomous Agents is presently devoted to the construction of ECAs, Embodied Conversational Agents (Cassell et al. 2000; Pelachaud & Poggi, 2001). An ECA is a virtual Agent that interacts with a User or another Agent through multimodal communicative behavior. It has a realistic or cartoon-like body and it can produce spoken discourse and dialogue, use voice with appropriate prosody and intonation, exhibit the visemes corresponding to the words uttered, make gestures, assume postures, produce facial expression and communicative gaze behavior. An ECA is generally a Believable Agent, that is, one able to express emotion (Bates, 1994) and to exhibit a given personality (Loyall & Bates, 1997). But, according to recent literature (Trappl & Payr, in press; de Rosis et al., in press a), an Agent is even more believable if it can behave in ways typical of given cultures, and if it has a personal communicative style (Canamero & Aylett, in press; Ruttkay et al., in press). This is, in fact, what makes a human a human. More, an ECA must be interactive, that is, take User and context into account, so as to tailor interaction onto the particular User and context at hand. In an ECA that fulfils these constraints the communicative output, that is, the particular combination of multimodal communicative signals displayed (words, prosody, gesture, face, gaze, body posture and movements) is determined by different aspects: a. contents to communicate, b. emotions, c. personality, d. culture, e. style, f. context and User sensitivity. At each moment of a communicative interaction, all of these aspects combine with each other to determine what the Agent will say, and how. In this paper we show how these aspects of an ECA can be modeled in terms of a belief and goal view of human communicative behavior. We then illustrate Greta, an ECA following these principles which is being implemented in the context of the EU project MagiCster1. 2. WHAT AND HOW WE COMMUNICATE Before focusing on the Agent’s dialogue, let us see how a single move can be represented and simulated. An Agent S (Sender) generates a set of beliefs - about 3 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 3–25 © 2005 Springer. Printed in the Netherlands
4
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
itself or about external objects and events - and has the goal to make Agent A (the Addressee) believe them. To achieve this, S produces a set of communicative signals (for example words, gestures, gaze, head, face and body behavior), taken from its communicative repertoire and temporally ordered in a particular way. Now, what are the beliefs an Agent may conceive to communicate? What is the structure of its communicative repertoire? What are the rules for the temporal ordering and synchronization of different signals? But also: how does the system go from the input – a set of beliefs – to the output – that particular arrangement of words, voice, hands, body, face signals? In fact, our communicative repertoire is very complex, in order for us to modulate our communication in a very sophisticated way. For example, words have formal and informal variants, connotations, positive or negative, tender or insulting nuances; and not only is the verbal lexicon so rich, but we can also communicate by subtly different intonations, gestures, facial expression, gaze, posture, spatial behavior. In each move of a dialog, how do we choose the best way to communicate, the combination of verbal and non verbal signals that are most fit to express our communicative goal? How do we activate the goal of using that given word, replacing it with a gesture or using both to communicate our meaning? According to a goal and belief model of social action (Conte & Castelfranchi, 1995), choice, that is, the decision to pursue a goal instead of another is determined by the relative values attached to the alternative goals. But the value of a goal in its turn stems from the value of its superordinate goals, or from the algebraic sum of the values of two or more of them. So, whoever discovers his car was stolen might shout. But if this happens to someone who is just starting a work where he needs his car, his shout will be sharper or longer and his utterance more aggressive. In other words, resources we specifically use in a given communicative situation (a gesture in the place of a word, a very colloquial term instead of a more formal one) are determined by a number of permanent and contingent factors. 3. PERMANENT AND CONTINGENT FACTORS As an Agent enters a communicative interaction, having the goal of communicating some meanings, two kinds of factors affect the final aspect of his/her communication: permanent and contingent ones (Table 1). The former are the goals and resources coming from the Senders’ biological and cultural endowment, that are always active in them; the latter are the goals activated and the resources provided by the contingent situation in which the Senders communicate. Long-lasting internal resources are (1) personality, (2) social identity (age, gender, cultural roots) and (3) cognitive traits. Among them, we may distinguish (4) innate and (5) culturally learned features: innate ones may be (6) a higher or lower capacity of making inferences, different reasoning styles (more abstract, intuitive, or imaginative); or (7) the different aptitudes, partly depending on neurological dispositions, towards musical, mathematical, visual or linguistic skills.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
5
Table 1. Factors affecting the choice of communication resources
contents to communicate PERMANENT personality
1 S’s
2 S’s social identity
3 S’s Cognitive Traits
4 Innate
6 Inference capacity 7 Aptitudes 5 8 Learn- Knowledge ed Base 9 Cultural techniques, norms, values 10 Communicative repertoire
communicative choice CONTINGENT 14 Physical resources 11 Self (motor capacity and energy) 15 Cognitive resources (drunk, concentrated) 12 16 18 Other Physical Sensory capacity resources 19 Media 17 20 Cognitive Knowledge resources Base 21 Inference capacity
22 Communicative repertoire
13 Situation
23 Personality 25 24 Available modalities Physical setting 26 Presence of referents 27 28 Social S-A relation setting (status, role, affect) 29 Type of encounter (service, affective) 30 Relations to others (in public, in private)
Other cognitive capacities are culturally dependent: culture entails beliefs about the environment (8) and about strategies of behavior typical of a population (see Section 8.), but also norms on how to do things, how to behave, what and how to
6
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
communicate (9), and finally the communication repertoires (verbal and nonverbal) one comes to learn since infancy (10). The combination of personality traits, attitudes and culturally learned behavior habits gives rise to style, an idiosyncratic tendency to behave in some peculiar way that allows others to recognize the Agent’s identity. The contingent resources taken into account for the choice of the output communicative behavior are provided, instead, by the intrinsic features of context (Poggi & Pelachaud, 2000). The presence or absence of these resources in oneself (11), the Other (12) or the Situation (13) activate specific goals in Senders as they plan their discourse and then the communicative signals to exploit. If I am very tired (physical resources, 14) I’ll tend to speak low and not to make conspicuous gestures; if my interlocutor is a bit deaf (sensory capacity of other, 18), I’ll speak loud. If he is a tourist (Other’s communicative repertoire, 22) I’ll talk slowly; with a student (Knowledge Base, 20) I’ll explain things at length; with a not so smart person I’ll explain even obvious causal links (inference capacity, 21). And if the interlocutor is a touchy person (personality, 23), I’ll be more indirect in my criticism. Beside the features of oneself and the interlocutor, resources present or absent in the environment trigger different goals about what and how to communicate: in a noisy discotheque (physical setting, 24), I’ll use gestures instead of words (available modality, 25), while I will not do this on the phone (absence of referents, 26). But more than the physical, the social setting (27) is relevant: we use more polite words with a high status Addressee (S-A relation, 28) or in formal situations (type of encounter, 29); less colloquial words in public than in private (relation to others, 30). How do we choose, then, how to communicate in a given situation? Presence or lack of individual resources lead us to choose one signal or combination of signals instead of another. Our hypothesis is that every combination of an Agent's multimodal communicative behavior (that particular word, uttered by a particular intonation, while making that gesture, that gaze, facial expression, and posture) is the result of the final choice of the communicative goal to pursue. This goal is selected among the Agent’s different goals, determined in their turn by contingent events (content to communicate, felt emotions, context and interlocutor) and by long-lasting features (the Agent’s culture, personality and style). In Sections 4 through 9 we overview how the above aspects (meanings to communicate, emotion, personality, culture and style) can be viewed according to a goal and belief model of human communicative interaction, and how they may be represented in an ECA. In Section 10 we describe the architecture of the ECA “Greta”, while showing how some of the above principles can be applied to construct a Believable Embodied Conversational Agent. 4. MEANINGS TO CONVEY Let us first overview the beliefs that may form the content of a communicative act. Three classes of meanings can be distinguished (Poggi, 2002 b): Information on the World, Information on the Speaker’s Identity and Information on the Speaker’s Mind.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
7
Information on the World. As we communicate, we provide information on concrete or abstract events, their actors and objects and the time and space relations among them. Such information is provided mainly through words, but also by gestures or gaze. To mention the referents of our discourse, we may point at them by deictic gestures or gaze; to refer to some properties of objects we may use iconic or symbolic gestures and even gaze (as when we squeeze eyes to mean ‘small’ or ‘difficult’). Information on the Speaker’s Identity. Physiognomic traits of our face, eyes, lips, the acoustic features of our voice and often our posture provide information on our sex, age, socio-cultural roots, and personality. And, of course, our words can inform on how we want to present ourselves. Information on the Speaker's Mind. While mentioning events of the external world, we also communicate why we want to talk of those events, what we think and feel about them, how we plan to talk of them. We provide information on the beliefs we are mentioning, our own goals concerning how to talk about them and the emotions we feel while talking (Poggi 2002 a). About beliefs, we inform about: 1. degree of certainty: words like ‘perhaps’, ‘certainly’; conditional or subjunctive verb modes; but also frowning, which means ‘I am serious in stating this’; opening hands, which means ‘This is self-evident’; 2. metacognitive information: that is, the source of mentioned beliefs, whether they originate from memory, inference or communication (we look up when trying to make inferences, snap fingers while trying to remember,...) We inform about the following goals: 1. performative of the sentence (by performative verbs, intonation, facial expression); 2. topic-comment distinction (by batons, eyebrow raising, voice intensity or pitch); 3. rhetorical relations: class-example (saying “first…second…third...”; counting on fingers) topic shift (expressed through posture shift); 4. turn-taking and backchannel: raise hand for asking turn; nod to tell the Interlocutor we are following what she says. Finally, we inform on the emotions we feel while talking (by affective words, gestures, intonation, facial expression, gaze and posture). 5. COMMUNICATIVE REPERTOIRE Let us now see what is the structure of the communicative repertoire (Table 1, n.10): the set of innate and learned features and behaviors that we can use as signals in order to convey the meanings specified above. The Human Agent is endowed with different “mode-specific communicative systems”, that is, sets of rules to link meanings to signals that work in different modalities. Several parts of the Human Agent’s body produce communicative signals: so head, face (with eyes and mouth), hands, trunk and legs can be viewed each as the depository of a communicative system. Communicative systems may be of at least
8
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
two kinds: “codified” and “creative”. In a “codified” system, or “lexicon”, the same signal-meaning link is shared and coded in the Agents’ memory. Not only words or symbolic gestures but also gaze, facial expression, posture form lexicons, where each signal represented in an Embodied Agent’s mind corresponds to a specific meaning. In a “creative” system, instead (for instance the system of iconic gestures), what is coded in memory is only a small set of inference rules about how to create a new signal starting from a given meaning, or about how to retrieve a meaning from a given signal (Magno Caldognetto & Poggi, 1995). In our hypothesis, in all communicative systems, meanings may be represented in terms of ‘mental images’ or logical propositions. For example: Let Ai denote the Sender and Aj the Addressee, a an action and g a goal that may be achieved by means of a. The performative of a ‘Peremptory order’ may be represented as follows: 1. Goal Ai (Do Aj a) 2. (Goal Ai g) Bel Ai (Achieve a g)) 3. Goal Ai (Bel Aj (Power-on Ai Aj a)) 4. If (Not (Do Aj a)) then (Feel Ai Angry) while the performative of an ‘Imploration’ may be represented as follows: 1. Goal Ai (Do Aj a) 2. (Goal Ai g) Bel Ai (Achieve a g)) 3. Goal Ai (Bel Aj (Power-on Aj Ai a)) 4. If (Not (Do Aj a)) then (Feel Ai Sad) In both ordering and imploring, the Sender wants the Addressee to do some action. However, the two performatives differ for the power relationship between the two interlocutors (Ai has power over Aj in the former, and the reverse in the latter), and for the potential emotion (anger vs. sadness) in case Aj does not perform the requested action (Poggi & Pelachaud, 2000). A compositional representation can be adopted not only on the side of the meaning, but also for the signals: a signal is represented as a combination of behavioral parameters, each with its number of values. Gestures are decomposed into hand-shape, arm and wrist position, type of movement; gaze into direction of eyes, eyelid aperture, movements of the eyebrows, and so on (Poggi, 2002 b). This compositional representation on the two sides enables achieving a high flexibility in the correspondence between signals and meanings. It opens, as well, the possibility that a combination is not always conveyed by the same signal, but may be expressed through different parameters of signals taken from different communication systems (Pelachaud & Poggi, 2002). The overall communicative repertoire, in fact, manages the combination of signals taken from different communicative systems, the distribution of meanings across them, and their temporal ordering and synchronization.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
9
6. EMOTIONS Before showing why and how emotions are an important determinant of an Agent’s communication, let us define them according to a goal and belief model of action and social interaction (Conte & Castelfranchi, 1995). 6.1. Emotions and goals Actions in our life are often part of a plan aiming at some goal. Take for example an action of Oetzi, the 5000 B.C. pre-historic man of Similaun, who chooses a stone apt to sharpen well and makes a lance for chasing the wild-pig successfully. The actions of looking for a good stone and to sharpen it are just means for the complex action of chasing. But also chasing is aimed at feeding himself and his group, which in turn aims at the goal of survival. The goals of our everyday plans are not ends in themselves: they all aim at more general goals of biological import that are common to all humans, like the biological goals of survival and reproduction and some subgoals of them, physical well-being, safety, loving and being loved, self-realization, image and self-image. These are terminal goals, that are ends in themselves and ones to which we assign the highest weights. So much that, if two of them are incompatible (as for instance freedom vs. life itself), giving up one of them is a heavy renunciation. With respect to terminal goals, the goals of our everyday life are instrumental goals, in that they directly or indirectly serve our terminal goals. For instance, chasing the wild-pig with a sharpened stone is instrumental to survival: if the lance is not sharp enough and does not hit the wild-pig to death, the wild-pig might aggress and kill Oetzi. Instrumental goals are more or less important to us, depending on the strength of their link with terminal goals: at the extent to which an Instrumental Goal is likely to be the only possible means to reach a Terminal Goal, that Instrumental Goal receives a high weight, just because it inherits its weight from the Terminal Goal it serves. Emotions are a biological device aimed at monitoring the state of reaching or threatening of our most important goals, be they terminal or instrumental (see, for instance, Carbonell, 1980; Oatley & Johnson-Laird, 1987). Anytime something happens (or the Agent believes it happens) that is likely to produce the achieving or threatening of a highly weighted goal, the biological device of emotion is triggered: from the agent's interpretation of the situation, a complex subjective state originates, generally of a short duration and with different degrees of intensity. This state includes physiological, expressive and motivational aspects. If Oetzi throws his lance but sees it has not run into the wild pig’s heart, fear is triggered since his goal of survival is challenged: physiological reactions are activated (blood flowing away from face to limbs) some of which may show in the perceivable state of his body (pale face, tremors); and the specific goal of escaping, that might serve the terminal goal of survival, is activated. There is a strong relationship, then, between goals and emotions: goals both cause emotions and are caused by emotions. They cause emotions since, if an important goal is achieved or threatened, an emotion is triggered: emotions are therefore a feedback device that monitors the reaching or threatening of our high-
10
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
weighted goals. At the same time, emotions activate goals and plans that are functional to re-establishing or preserving the well-being of the individual, challenged by the events that produced the emotions. So, fear triggers flight, anger triggers aggression, guilt triggers the goal of helping the harmed person or of escaping sanction (Castelfranchi, 2000). 6.2. Emotion triggering vs. emotion display Emotions may be implied in communication in at least two ways. 1. they may be the very reason that triggers communication: we activate the goal of communicating just because we want to express our emotion; 2. they may intervene during our communication, as a reaction to what our interlocutor is saying, or to some thought suddenly coming to our mind, either related to the ongoing dialogue or not. In both cases, the triggering of emotion does not necessarily imply that the Agent displays it. There are many reasons why we may refrain from expressing our emotion, and the final (aware or non-aware) decision of displaying it may depend on a number of factors (Prendinger & Ishizuka, 2001; De Carolis et al., 2001). Some of them concern the very nature of the emotion felt (emotion nature), others the interaction of several contextual (scenario) factors. 1. Emotion nature a. Intensity (a more intense emotion might be more likely displayed); b. Valence (it is not the same to display negative or positive emotions); c. Social evaluation (some emotions, like envy or shame, are subject to social sanction: then it is more difficult to express them); d. Addressee (it is different to express an emotion to the one who caused it or to a third person). 2. Scenario Factors a. Agent's Display motive (displaying or not depends on whether you do it to be helped, consoled, or if you want to demonstrate or teach something); b. Agent's personality (an impulsive person is generally more keen to displaying than a reflexive or a shy one); c. Interlocutor’s features (displaying depends on the other's personality, empathy, intelligence...). d. Agent - Interlocutor Role relationship (whether he has power over you or you over him); e. Agent - Interlocutor Personal Relationship (you might not display your being worried to someone you love, if you want to protect him); f. Type of social interaction (being in public makes a difference for emotion display).
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
11
7. PERSONALITY Personality is also linked to goals: it can be viewed in terms of weights people attribute to terminal goals (Carbonell, 1980; Poggi and Pelachaud, 2000). For example, a sociable person gives more importance than others to knowing and staying with other people. A selfish person, when the goals of physical safety and others' care are conflicting, chooses to pursue the former, while an altruist pursues the latter. A proud person attributes a high value to self image and autonomy, while a dependent person cares others’ image more than self-image. Since both emotions and personality have to do with the relative importance of goals, there is also some link between emotion and personality. Some personality traits may be viewed in terms of the general ‘propensity to feel emotions’ (Plutchick, 1980; Poggi and Pelachaud, 1998). Picard (1997) calls ‘temperament’ this subset of personality traits, while other authors relate them directly to one of the factors in the ‘Big-Five’ model: for instance, neuroticism (Mc Crae & John, 1992). These traits imply, in a sense, a lower threshold in emotion feeling (Ortony et al., 1988). For instance, a ‘shy’ person is keener to feel ‘shame’, especially in front of unknown people. A ‘proud’ person, who attributes a high weight to his goals of self-esteem and autonomy, will feel particularly proud of himself every time one of these goals is achieved. And, conversely, every time they are threatened (if, say, he is obliged to ask for help), the person will feel the opposite emotion, shame. Thus, a personality trait (proud) is related to attaching a higher weight to a particular goal (self-esteem, autonomy); and, since that goal is particularly important to that kind of person, the person will feel the corresponding emotions (pride or shame) with a higher frequency or intensity. 8. CULTURE Culture may also be viewed in terms of different weights on goals. Both a Somali shepherd and an Italian housewife have the goal of feeding themselves and their family; but the sub-goal chosen to pursue this goal may be for the former to search the bush, for the latter to go shopping in a store. 8.1. Goals and culture Humans pursue their goals by using external resources (presence of food, characteristics of the territory, climate conditions) and internal resources (physical strength, body agility, manual skill, beliefs and intelligence). Since different physical environments provide different resources, each population, given the environment in which it lives, comes to accumulate a set of beliefs on the instrumental goals that most easily and economically serve the biological terminal goals in that environment. At the extent to which an instrumental Goal is (or is likely to be) the only possible means to reach a terminal Goal, given the external conditions available, that instrumental Goal will receive a higher weight than other possible ones. Thus, the instrumental goal chosen becomes a strategy of survival typical of that culture; and culture, overall, may be defined as a set of beliefs on the
12
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
typical techniques to pursue goals. Of course it also entails beliefs about the external world. And since language is both produced by beliefs and a vehicle of them, culture typically shows up in language. Language is made of the beliefs of a population, and a way to organize them, a set of rules on how to conceptualize and categorize information. Consequently, it also implies a set of settled communicative techniques, that is of settled instrumental goals, stating how to convey information. More, culture entails values and norms. Values are evaluative beliefs about what is good and then has to be pursued as a goal (Miceli & Castelfranchi, 1989). But since particular ways to behave may be good or bad according to the environment, again due to what are the most useful techniques of survival, different populations in different environments may hold different values. Where individualistic behavior proved convenient, individualistic values will develop, while in environments where collectivistic behavior is more fit, values centered on the family or the group will hold. Norms are obligations that rule the relationships among people in a group (Conte & Castelfranchi, 1995). In a culture more centered on interdependency, a norm (then, a highly weighted goal), may prescribe to be cooperative, even when this implies intruding in the other’s affairs; in a culture centered on the individual’s autonomy, the goal of keeping one’s privacy will be more weighted, and a norm will hold of not intruding in others’ affairs and of contrasting others’ intrusions. Now, since values and norms generate goals in people (the goal to pursue that value or to respect that norm), if they are thwarted, they provoke emotions. Not living up to one’s values may induce shame, while violating norms may cause guilt. Therefore, if two populations have different values and norms, they will also feel these emotions out of different events. To sum up, culture is a set of beliefs shared by a population about the environment in which the population lives and the best techniques (the most highly weighted instrumental goals) to reach the biological terminal goals in that environment, given the means-end relations that hold in the given physical conditions and the set of beliefs accumulated. Culture also includes beliefs about how to gather and organize beliefs, and about the norms and values that are functional to techniques of goal achievement that best fit the surrounding environment. According to this definition, let us try to figure out how the way people communicate changes with cultural differences, by trying to distinguish what is universal (biological) and what is culturally determined in the different aspects of communication. These differences may then be simulated in a Believable Embodied Agent, in words or discourse planning, in gesture, gaze, facial expression, body posture and proxemic behavior. 8.2. Semantic rules versus norms of use in communication systems A communication system includes two kinds of rules: semantic rules and norms of use. Semantic rules concern the correspondence between meanings and signals, be they words, signs or sentences in a verbal or sign language, or gesture or gaze items. These are rules of the following type:
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
13
if you want to communicate the meaning ”I greet you”, say “Hello” if you want to communicate the meaning ”I greet you”, raise your eyebrows if you want to communicate the meaning ”I greet you”, wave your hand Norms of use, instead (those studied by Pragmatics and Sociolinguistics) do not state how some meaning has to be conveyed, but if some meaning can, should or should not be conveyed in a given situation. They are rules of this type: if you meet a person you know, apply the rule for the meaning “I greet you”, or if you meet an unknown person, do not apply the rule for the meaning “I greet you”. Now, in some communication systems (for example spoken languages) both semantic rules and norms of use vary across cultures. But in others (typically, the facial expression of primary emotions) semantic rules might be everywhere the same (a grimace of anger is performed in the same way in all cultures), while cultural difference holds in the norms of use (in Japan expressing anger is much more sanctioned than in the USA). 8.3. What is universal and what is cultural in speech, gestures, and face communication? - Words and sentences Even if at a deep level the syntax of all languages might be universal (as argued by the Chomskian Universal Grammar approach), specific words and syntactic rules differ from one language to another. Some mechanisms like iconicity hold in all languages, but cultural variation across languages holds at various levels: in the strategies of discourse planning, the importance of politeness or rhetoric, in the rules defining what to speak about, how much to mention the self, whether to convey new information or just make reference to shared knowledge, and so on. - Gestures Some gestures are culturally codified (e.g. the gestures for ‘OK’ or ‘Victory’), while others are biologically codified (raising fists up to show elation). If we want to simulate the former in a culture-sensitive Agent, they will have to be varied from a culture to another. Non-codified iconic gestures, instead, might be generated all through the same set of inference rules (supposedly universal), whatever the culture the Agent comes from; but if the referent represented is typical of a culture or an action is performed in a way typical of it, then also a creative gesture may be culturally dependent. - Gaze and Facial Expression Facial expression and gaze are more likely to be universally shared than gesture. They can communicate information on the world (we point at things with chin or gaze, squeeze eyes to say that something is little or difficult), information on the Speaker’s beliefs, goals and emotions (we raise eyes while we remember or make
14
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
inferences; we frown to communicate anger, concentration, or an order) and information on the Speaker’s identity (our face and gaze provide information on sex, age, ethnicity, personality, sometimes even social class). Focusing on the expression of emotions, even if the so-called basic emotions (happiness, sadness, anger, fear, surprise and disgust) are felt in all cultures, and everywhere they trigger an innate universal neural program for facial expression (Ekman, 1982), this does not imply that people in different cultures always show their emotions in the same way in the same situation. First, an emotion is triggered by the cognitive categorization of a situation on the part of the subject, and a situation that in a culture, because of its beliefs, norms and values, is categorized as a cause of sadness, in a different culture might be seen as a cause of happiness (take the death of a martyr in the Islamic culture). Second, the emotion display is filtered not only by cognitive and personality traits of the Agent and of the Interlocutor, their relationship and the situation, but also by the cultural norms about the expression or non-expression of given emotions. 9. STYLE Style is an internal feature of an Agent that affects its choice of communicative resources. We define style as the idiosyncratic stable tendency of a specific agent to exhibit specific communicative or non-communicative behaviors. Communicative style is then the tendency one has to choose some signals, or arrangements or aspects of signals, instead of others. Agent X uses formal words also in private informal situations; Y's discourse is always thoroughly explained; Z makes ample gestures; W avoids the Addressee's gaze. In its being a stable tendency to communicative choice, style is the ultimate result of the combination and interaction of the permanent goals that rule an Agent's behavior, namely the goals stemming from its personality, attitudes, and culture. A particular combination of these goals, at the moment of the communicative output, affects the choice of a particular signal or set of signals, or simply the value of some parameters in the production of signals. High level goals implied in a specific personality get instantiated into lower level goals. For example, in an introverted person, the general goal of not being too visible may result in a goal of avoiding too wide gestures. A tendency to visual thought may lead to use metaphors in discourse. The habit of explaining to students may induce to be always didactic. Style can manifest itself both in the choice of whole signals (I may prefer a metaphorical to a literal word) and in the choice of a particular way to produce a signal. For example, people may differ in their style as to various aspects of gesture use (Ruttkay et al., in press). Differences may concern: 1. whether to use a gesture or not, which depends on at least three factors: a. width of repertoire: we cannot convey a meaning by a gesture if our culture or personal gestural competence does not include a gesture for that meaning;
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
15
threshold: the degree of formality, as well as other factors like need for redundancy, may provide a threshold for the use of a gesture; c. redundancy: I may use both a word and the corresponding gesture if I want to stress some meaning or if I want to be very clear. 2. ways of performing gestures. While some parameters and sub-parameters of nonverbal signals like gestures are subject to lexical and sociolinguistic variation, some sub-parameters of movement: tense or relaxed (tension), wide or narrow (amplitude), smooth or angular (manner) might be more affected by style. That is, they are directly determined by the current goal set of the Sender and then by his/her emotional or physiological state. In conclusion, each signal chosen and the combination of particular values in each parameter or sub-parameter of this signal is determined by the combination of goals that are activated at a certain moment in the Sender, all stemming possibly by the goal set described above. b.
10. GRETA’S EMOTIONS, PERSONALITY, AND COMMUNICATION In this Section, we show how some of the principles introduced above have been applied in the construction of an Embodied Agent: Greta. Specifically, the device of emotion triggering and its influence on the Agent’s behavior have been already implemented, along with some aspects of its personality, communicative planning, voice and face communication. Some aspects of gestural communication are still to be thoroughly implemented in Greta such as gesture expressiveness (Chi et al., 2000). But this architectural model may also support adaptation to some features of the user target population, such as culture and style (de Rosis et al., in press a; Ruttkay et al., in press; Hartmann et al., 2002). In order to achieve a high degree of flexibility, necessary to this aim, we base the architecture of our Agent on the following principles: - the Mind and the Body of the agent are separated. This separation, in our opinion, helps in achieving a higher flexibility in adaptation of the Agent’s behavior and appearance. It offers the opportunity to vary its mental state and its reasoning capacity according to the context and to establish the forms of communication to adopt by considering the technical resources available. Considering interdependency between the body and the mind would mean to include context-dependent rules concerning the different aspects of manifesting behavioral differences during communication into the planning process, and thus to make this step very complicated since it has to solve constraints at both inner and outward levels. In our approach (independence of Body and Mind), at the Mind level, only constraints on the meanings associated to a communicative act are represented, leaving to the Body the task of deciding which signal to employ according to the context: - at the Mind level, represented according to the BDI (belief, desire, intention) theory and architecture (Rao and Georgeff, 1991), the way in which its mental state is related to the context may be represented explicitly. This enables the Agent to vary the decision made (including the ‘discourse’ that achieves a given
16
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
‘communicative goal’) and the triggered and displayed emotions, according to its mental state’s structure; - at the Body level, that can use different physical representations, a contextdependent meaning-signal translation is performed in order to render the selected meaning consistently with the chosen physical body and context rules. - relations between the various phases of the generation process have to be conceived in such a way as to facilitate a flexible adaptation to different features that influence communication. For this reasons, the I/O interface between Mind and Body is constituted by an XML specification that is easily wrapped according to available resources and rendering rules. 10.1. The System Architecture These principles have been applied, in the context of the MagiCster project, to develop an Embodied Conversational Agent that appropriately combines verbal and nonverbal signals when delivering information, to establish a natural communication with the User. Our Agent achieves a quite rich expressiveness in conversation, by displaying various types of information typically transmitted in human-human dialogs (syntactic, dialogic, meta-cognitive, performative, deictic, adjectival, and rhetorical relation). It is embodied in a 3D talking head whose name is ‘Greta’ (Pelachaud, 2002); it has a personality and a social role, as well as the capability of expressing emotions, consistently with the context in which the conversation takes place and with its own goals, and to adapt her behavior to some user features relevant for the selected application domain. At present, we simulate advice-giving dialogs, where the main function of Greta is to provide suggestion and useful information to the User, in the application domain. In our ‘mixed-initiative’ system, the User can ask Greta some questions; this opens a question-answering sub-dialog after which, if needed, Greta revises her discourse plan according to the User request. At the moment, Greta provides information and advice in the medical domain adapting her behavior to some user features such as the age (children, teens, adult), the role (patient, nurse, parent of a patient) and some cultural features that are relevant for providing more appropriate suggestions both from the content and the signal rendering points of view. The architecture of our Conversational Agent includes the following main components (Figure 1): i) the Agent’s MIND including i.1) the ‘Emotional Mind’ and i.2) the Dialog Manager, and ii) a Generator of the Agent’s Body including ii.1) a Signal Generator, ii.2) a Signal Wrapper and iii.3) the chosen body (Greta in this case) that renders the specified behavior. Even if we are working in the field of advisory dialogs, the system is domainindependent. A particular application domain is therefore selected as a first step of interaction: a conversational goal in that domain is passed to the Dialog Manager (DM) and the dialog may start. From this goal, an overall discourse plan is produced for the Agent, by retrieving an appropriate ‘recipe’ from a plan library formalized according to DPML (de Rosis et al., in press b). This plan represents how the Agent
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
17
will try to achieve the specified communicative goal during the conversation and is selected according to user features and context. The way the dialog goes on is a function not only of this plan but also of the User Moves and of the Social Context. User Move Agent Model MentalState Affective State
MIND Emotional Mind Dialog Manager
Domain DPML Plans DialogGoal
User Model Features Preferences
MIDAS Plan Enricher
Context Features
Meaning Enriched Agent Move (APML)
BODY SignalGenerator
Body Features Rendering Rules
SignalWrapper
Oh, I’m sorry, that’s a bad situation indeed...
Signal Enriched Agent Move (Greta FDL – Festival)
Figure 1. The Architecture of our Conversational Agent
This context includes variables concerning the topic of the dialog, the personality traits of the two interlocutors and their ‘attitudes’ towards the dialog itself. For instance, the Agent may be defined as more or less ‘empathic’ towards the User, more or less easily affected emotionally, more or less prone to feel specific emotions with a high intensity (for more details, see Carofiglio et al., in press). Once an Agent’s first move has been generated, the User replies with a move in his or her turn. The DM then asks the emotion simulation module (Emotional Mind) whether a particular affective state of the Agent is activated as a consequence of this move and with which intensity. Then, the DM selects the next move according to the goal activated by the user move, the Agent’s new affective state, and the Social Context. When the DM selects a particular Agent’s move, this has to be enriched by the plan enrichment module (MIDAS) that adds tags indicating the meanings to be conveyed. At this point, a further adaptation can be performed by adding meaning
18
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
according to the user features, culture and context. This enriched move is then passed to the Body Generator, that interprets it by converting meaning into signals and by rendering them into an expressive behavior. The selection of signals, at this point, can be adapted to the available communication channel of the chosen body, user features and cultural features. We now describe in some more detail the above mentioned modules. 1. The Emotional Mind is responsible for updating the Agent’s mental state by deciding whether a particular affective state should be activated and with which intensity, and whether the felt emotion should be displayed and how, according to the context variables. Mind is based on a dynamic belief network (DBN), that combines a belief network (BN) representing the agent’s mental state at time T with a network representing its mental state at time T+1 and a network that monitors the triggering of emotions in the interval (T, T+1). Three kinds of nodes can be found in the Agent’s mental state: ‘belief’ nodes, ‘goal nodes’ and ‘goal-achievement’ nodes. A weight is associated with goalachievement nodes, as a function of the agent’s personality. The belief network at time (T+1) is generated according to the network at time (T) and to the events occurred in the interval (T,T+1). These events are modeled, as well, by belief networks. 2. The Dialogue manager (DM) is built on the top of TRINDIKIT (http://www.ling.gu.se/projekt/trindi/). It controls the dialogue flow by iterating the following steps: a. after a 'dialog goal' has been specified, an appropriate discourse plan is selected from the library of plan recipes and the first move is generated according to the first step of this plan. The ‘dialog goal’ becomes the main topic of the conversation; b. at the end of this first move, the initiative is passed to the User, that may ask questions to the agent about any subject concerning the main topic under discussion; c. the User move is translated into a symbolic communicative act (through a simplified interpretation process) and is passed to the DM; d. the DM decides “what to say next” by selecting the sub-plans to execute. At the same time, the information state of Trindi is updated; e. the DM goes on cycling over steps b. to d., until the user leaves the conversation. 3. The Plan Enrichment module, called MIDAS, has the role of translating the symbolic representation of a dialog move into an Agent’s behavior specification. A dialogue move may be very simple (i.e. greet, ask) or it may correspond to a small discourse plan (for instance: describe an object with its properties). In both cases MIDAS generates a tagged output expressed according to a particular XML specification which is interpretable by the Body of any Animated Agent (Affective
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
4.
5.
19
Presentation Markup Language: APML. See De Carolis et al., 2002 for details). APML is used as a middle-ware level, to overcome integration problems between the mind and body components and, at the same time, to allow independence and modularity between them. The move is then represented as an APML string, in which the verbal part of the dialog act is enriched with the tags that are needed by the speech and body generation components of the Agent to produce the required expressions: rhetorical relations, deictic or adjectival information, certainty values, metacognitive or turn-taking expressions. The Body Generator module interprets the APML-tagged dialog move and decides which signal to convey on which channel for each communicative act. We defined a lexicon of facial expressions and gaze: that is, a set of (meaning, signal) pairs. Each meaning in the taxonomy specified in Section 4. (certainty, metacognitive, comment etc.) corresponds to a particular configuration of parameters of gaze or facial expressions that Greta is able to exhibit. For example, among the same class “certainty”, the meaning “certain” corresponds to a “small frown” while the meaning “uncertain” to “raise eyebrow”. In order to adapt this correspondence between meanings and signals, we started to introduce some translation conditions depending on the user and the culture of the target user population. The Body we use is a combination of a 3D face model compliant with the MPEG-4 standard (Pelachaud, 2002) and of the speech synthetizer Festival (Black et al., 2002). The facial model is capable of performing the face and gaze expressions foreseen for our conversational agent. Each signal may be expressed as a set of facial parameters. We have developed a language to describe facial expressions easily (De Carolis et al 2002). The text of each dialogue move with its tags is given as input to the animation module and to Festival, which provides the duration of the phonemes and a wav file (an audio file). Phonemes are the smallest temporal unit considered. Knowing the phoneme duration enables us to retrieve the exact duration of any expression as defined by the tags in the dialogue move, thus ensuring synchrony between speech and other visual activities. Tags get instantiated by their corresponding signals which are in turn transformed into facial parameters’ values. When several tags occur in the same text span, if their corresponding signals act on the same facial regions with different values (e.g. on the eyebrow region the agent should perform a frown and a raised eyebrow, or with the head, a head nod and a head shake), a conflict may arise. In such cases a conflict solver module takes as input the co-occurring meanings and produces as output a complex expression made of a mix of signals from different meanings (Pelachaud & Poggi, 2002). The various components (Mind, MIDAS, Greta and the DM itself) are connected through a Graphical Interface which controls activation, termination and information exchange for the various processes involved in the dialogue management.
20
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
10.2. An Example To give an idea of how the system works, we show a simple example of the output; in this dialog, Greta represents a doctor who explains a drug prescription to a patient (the User). After an initial ‘greeting’, the Agent informs the patient about her disease. Then, the patient can ask any question concerning the main topic of the conversation; for instance, she may ask for clarifications about the disease and its severity, or even about the related therapy to undergo; Greta answers the User question by selecting the adequate plan steps and by showing the appropriate expression that is appropriate to the social context. Figure 2 shows a few annotated moves of the example dialogue. The a moves are directed to a user which is a patient while b moves are directed to a user which is a nurse. These APML moves have been generated by the same DPML plan (informing about Mr. Smith’s disease), it is possible to notice that the verbal and nonverbal meanings to convey are slightly different. For instance, the affect attribute of the performative is not instantiated in move 1b: since the user is a nurse, an empathic attitude toward her has not been generated by Mind, and then it is not to be displayed by the Body Generator. Annotations are translated into signals and then rendered as facial expressions and voice intonations. Figure 3 shows some examples of these expressions (neutral, sorry-for and surprise). Let’s suppose we want to introduce a form of adaptation to the user culture at this level, the Body Generator has to look into a conditional table and select the appropriate signal for the chosen body (King et al., 2003; Ruttkay et al., in press). This translation file is also an XML Meaning-Signal description in which the combination of signals by which every meaning should be rendered is established by rules of the form: where is an APML tag and attj specifies an expression or speech feature that the body is able to perform under the specified conditions (Cond). The following are some examples with the Facial Description Language of Greta. For some cultures, such a French or Italian, we have: For other ones, such as Greek or Turkish, the mapping is:
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
21
Greta move1a:
I'm sorry to tell you that you have been diagnosed as suffering from what we call angina pectoris, which appears to be mild.
Patient User: What is angina pectoris? Greta move 2a :
This is a spasm of chest resulting from
overexertion when heart is diseased.
....
Greta move1b:
Mr. Smith is suffering from angina pectoris, which appears to be mild.
Nurse User: Which is the therapy? Greta move 2b:
To solve the problem, he has to take two drugs.
The first one is Aspirin and the second one is Atenolol
… and so on. Figure 2. An example of dialog annotated with APML.
Meaning-signal translation files define a mapping between the semantics of discourse segments and the expressions the character can show with face, gesture, body and speech. Defining several meaning-signal translation files enables us to
22
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
render the same meanings with different combinations of signals. We may, for instance, implement a character with a ‘fun’ behavior for interacting with children and a more ‘serious’ one for adults. Or we may apply this method to build culturedependent characters (de Rosis et al., in press a). 11. CONCLUSIONS In this paper, we have overviewed some factors determining the communicative behavior of a Human Agent, that should be simulated in a Believable Embodied Agent to make it ‘natural’ (or ‘believable’). In particular, we have considered contents to communicate emotions, personality, culture, style and contextual variables. Then, we have briefly described the architecture of a conversational agent embodied in a 3D face, Greta, that tries to achieve a believable behavior while interacting with the User. Our Agent follows the seminal work of a number of other research groups involved in this domain (to mention only some of them: Arafa et al., 1999; Cassell et al., 1999; Rickel and Johnson, 1999; André et al., 2000; Lester et al., 2000). It builds upon research on how to model emotion activation and regulation (Frjida & Swagerman, 1987; Ortony et al., 1988; Elliott, 1992; Castelfranchi, 2000). The main characteristic of our Conversational Agent is in its ability to dialog with the User in any application domain whose knowledge has been represented with the appropriate formalism that is required by TRINDI and by MIND. Another peculiar feature is in the Mind-Body independence: although the examples we showed in this contribution are all about Greta, the APML tagging of dialog moves enables us to render them with any animated agent (2D or cartoons, for instance) according to the needs of the application domain and of the particular category of users to which the system is addressed.
Figure 3. Three expressions of Greta: neutral, sorry-for and surprise
In the future, we plan to study how the variables considered (in particular, affective factors) influence not only the external behavior of the Agent but also the inner part of it: typically, its reasoning style, its decision making ability and the argumentation process. The independence of the Mind and the Body representations
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
23
allows us to represent communicative information at a high abstract level. The instantiation of a meaning into a set of signals is done by considering factors such as the technical context (on which device is the agent displayed; is it a 2D cartoon-like face, or a 3D realistic-like face; does it have only the face or also the body…) or socio-cultural context (what is the user’s culture, the agent’s culture; what is the user and agent’s age and gender; where is the conversation taking place…). Such an architecture provides great flexibility to create specific agents for each given set of factors. 12. NOTES 1 IST project Magicster IST-1999-29078 with partners: University of Edinburgh, UK (coordination); DFKI, Germany; SICS, Sweden; Univ. of Bari, Italy; Univ. of Rome, Italy; AvartarME, UK.
13. REFERENCES André, E.., Rist, T., Van Mulken, S., Klesen, M., & Baldes, S. (2000). The automated design of believable dialogues for animated presentation teams. In J.Cassell, J.Sullivan, S. Prevost & E.Churchill (eds.). Embodied Conversational Agents. Cambridge, MA: MIT press. Arafa, J., Charlton, P., Mamdani, A., & Fehin, P. (1999). Designing and building Personal Service Assistants with personality. In S. Prevost & E. Churchill (Eds.). Proceedings of the Workshop on Embodied Conversational Characters, Tahoe City. Bates, J. (1994). The role of emotion in believable agents. Communications of the ACM, 37, 7, 122-125. Black, A.W., Taylor, P., Caley, R. & Clark, R. (2002) Festival. http://www.cstr.ed.ac.uk/projects/festival/ Canamero, L. & Aylett, R. (eds.) Animating Expressive Characters for Social Interactions. Amsterdam: John Benjamins. Carbonell, J.G. (1980). Towards a process model of human personality traits. Artificial Intelligence, 15, 49-74. Carofiglio, V., de Rosis, F. & Grassano, R. (in press). Dynamic models of mixed emotion activation. In L Canamero & R.Aylett (eds.): Animating Expressive Characters for Social Interactions. Amsterdam: John Benjamins. Cassell, J., Bickmore, J., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H. & Yan, H. (1999). Embodiment in conversational interfaces: Rea. In Proceedings of CHI'99, Pittsburgh, PA, pp 520527. Cassell, J., Sullivan, J., Prevost, S. & Churchill, E. (eds.) (2000). Embodied Conversational Agents. Cambridge, MA: MIT press. Castelfranchi, C. (2000). Affective Appraisal Versus Cognitive Evaluation in Social Emotions and Interactions. In A. Paiva (Ed.). Affective Interactions. Springer LNAI 1814, Berlin. pp. 76-106. Chi, D., Costa, M., Zhao, L. & Badler, N. (2000) The EMOTE model of effort and shape. In Proceedings. of Siggraph, 173-182. Conte, R. & Castelfranchi, C. (1995). Cognitive and Social Action. London : University College. De Carolis, B., Carofiglio, V. & Pelachaud C. (2002). From Discourse Plans to Believable Behaviour Generation. Proceedings of the International National Language Generation Conference. New York. De Carolis, B., Pelachaud, C., Poggi, I. & de Rosis, F. (2001). Behavior planning for a reflexive agent. In Proceedings of IJCAI'01. Seattle, August 2001. de Rosis, F., Pelachaud, C. & Poggi, I. (in press a). Transcultural believability in embodied agents: a matter of consistent adaptation. In R.Trappl & S.Payr (eds.) Agent Culture: Designing virtual characters for a multi-cultural world. Dordrecht: Kluwer Academic Publishers. de Rosis, F., Pelachaud, C., Poggi, I., Carofiglio, V. & De Carolis, B. (in press b) From Greta's mind to her face: modeling the dynamics of affective states in a Conversational Embodied Agent. International Journal of Human-Computer Studies. Ekman, P. (1982). Emotion in the human face. Cambridge: Cambridge University Press.
24
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
Elliott, C. (1992). An Affective Reasoner: A process model of emotions in a multiagent system. Technical Report No. 32 of The Institute for the Learning Sciences, Northwestern University. Frjida, N.H. & Swagerman, J. (1987). Can computers feel? Theory and design of an emotional system. Cognition and Emotion. 1(3), pp. 235-257. Hartmann, B. Mancini, M. & Pelachaud, C. (2002). Formational parameters and adaptive prototype instantiation for MPEG-4 compliant gesture synthesis. In Proceedings of Computer Animation ’02, Geneva: IEEE Computer Society Press. King, S., Knott, A. & McCane, B. (2003). Language-driven nonverbal communication in a bilingual conversational agent. In Computer Animation and Social Agents, New Brunswick. Lester, J.C., Stuart, S.G., Callaway, C.B., Voerman, J.L., & Fitzgerald, P.J. (2000). Deictic and emotive communication in animated pedagogical agents. In J.Cassell, J. Sullivan, S. Prevost, & E. Churchill (eds.). Embodied Conversational Agents. Cambridge, MA: MIT press. Loyall, A.B., & Bates, J. (1997). Personality-rich believable agents that use language. In Proceedings of Autonomous Agents 1997, pp. 106-113. Magno Caldognetto, E. & Poggi, I. (1995) Creative Iconic Gestures: some Evidence from Aphasics. In R.Simone (ed.) Iconicity in Language. Amsterdam: John Benjamins. Mc Crae, R. & John, O.P. (1992). An introduction to the Five-Factor model and its applications. Journal of Personality. 60, pp. 175-215. Miceli, M. & Castelfranchi, C. (1989). A cognitive approach to values. Journal for the Theory of Social Behavior, 19, 169-193. Ortony, A., Clore, G.L., & Collins, A. (1988). The Cognitive Structure of Emotions. Cambridge University Press. Oatley, K. & Johnson-Laird, P.N. (1987). Towards a cognitive theory of emotions. Cognition and Emotion, 13, 29-50. Pearl, J.(2000). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufman Publishers. Pelachaud, C. (2002) Visual Text-to-Speech. In I.S. Pandzic, R.Forchheimer (eds.) MPEG-4 Facial Animation - The standard, implementations and applications. New York: John Wiley & Sons. Pelachaud, C., Carofiglio, V., De Carolis, B., de Rosis, F. & Poggi, I. (2002). Embodied Contextual Agent in Information Delivering Application. In Proceeding of the First International Joint Conference on Autonomous Agents & Multi-Agent Systems. Bologna, Italy. Pelachaud, C. & Poggi, I. (eds.) (2001). Multimodal Communication and Context in Embodied Agents. Workshop at the 5th International Conference on Autonomous Agents, Montreal. Pelachaud, C. & Poggi, I. (2002). Subtleties of Facial Expressions in Embodied Agents. Journal of Visualization and Computer Animation. 13, 301-312. Picard, R.W. (1997). Affective Computing. Cambridge, MA: MIT Press. Plutchick, R. (1980). The Psychology and Biology of Emotions. New York: Harper & Collins. Poggi, I. (2002a). Mind Markers. In M.Rector, I.Poggi & N.Trigo (eds.) Gestures. Meaning and Use. Oporto: Universidad Fernando Pessoa Press. Poggi, I. (2002b). Towards the Lexicon and Alphabet of Gesture, Gaze and Touch. www.semioticon.com/virtual Poggi, I. & Pelachaud, C. (1998). Performative Faces". Speech Communication, 26, 5-21. Poggi, I. & Pelachaud, C. (2000). Performative Facial expression in Animated Faces. In J.Cassell, J.Sullinvan, S. Prevost & E.Churchill (eds.). Embodied Conversational Agents. Cambridge, MA: MIT press. Prendinger, H. & Ishizuka, M. (2001). Smart brains in simplistic bodies. In C.Pelachaud & I.Poggi (eds.) Multimodal Communication and Context in Embodied Agents. Workshop at the 5th International Conference on Autonomous Agents, Montreal. Rao, A.S. & Georgeff, M.P. (1991). Modeling Rational Agents within a BDI-Architecture. In J.Allen, R.Fikes, & E.Sandewall (eds.), In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reasoning (KR’91). Morgan Kaufmann Publishers, San Francisco, CA. Rickel, J. & Johnson, W.L. (1999). Animated agents for procedural training in virtual reality: Perception, cognition, and motor control. Applied Artificial Intelligence, 13, 343—382. Ruttkay, Z. Pelachaud, C., Poggi, I. & Noot, H. (in press) Excercises of Style for Virtual Humans. In L.Canamero & R.Aylett (eds.) Animating Expressive Characters for Social Interactions. Amsterdam: John Benjamins.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
25
Trappl, R. & Payr, S. (eds.) (in press). Agent Culture: Designing virtual characters for a multi-cultural world. Dordrecht: Kluwer Academic Publishers.
14. AFFILIATIONS I. Poggi Università di Roma 3 Roma – Italy C. Pelachaud IUT de Montreuil - Universite de Paris 8 Paris – France F. de Rosis V. Carofiglio B. De Carolis University of Bari Bari - Italy
P. PAGGIO AND B. JONGEJAN
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
1. INTRODUCTION This paper describes research work carried out at the Center for Sprogteknologi (CST) in Copenhagen to develop the multimodal communication components of a 3D virtual world application. This work was part of an interdisciplinary Danish research project – the Staging project – the purpose of which has been to investigate the nature and application potential of 3D virtual worlds, and their impact in theatre and entertainment (see Qvortrup (2000) for an overview). One of the many research activities the project has dealt with, was the design and development of a virtual farm scenario and the exploration of this scenario from a number of different perspectives. Relevant issues were for example the usefulness of theatrical metaphors in 3D applications, the interaction modes possible in them, and last but not least how the user can communicate with the virtual agents. The interaction patterns defined for the agents in the scenario – a farmer and a number of animals – are based on an interactive improvisation defined for the EU-funded Puppet project, which has had an ongoing cooperation with Staging (Klasen et al. 2000). In our work, we have focused on the interaction between the agents and the user, who can communicate his intentions to the farmer and in that way influence the events on the farm. Two characteristics of the application have been challenging from the point of view of how communication between the user and the agents could be modelled. First of all, we wanted communication to be multimodal: the user should be allowed to talk, write and use hand gestures in as natural a way as possible. Secondly, interaction was to accommodate the fact that the virtual world would not be static, so that user and agents had to be able to relate to such changes while communicating with one another. Therefore, communication could not be strictly planned, and a balance had to be found between the agents' wish to engage in a dialogue and their autonomous behaviour. Finally, we wanted to experiment with an environment in which the interaction with the agents in the world had an entertainment value rather than being subordinate to the achievement of specific goals such as finding information or learning how to carry out a task. This has been accomplished by enabling surrealistic actions such as moving trees around, and slightly comical replies such as “The cow is not hungry, but dead”. It is our impression from the feedback obtained so far that users appreciate the system's funny reactions and do not necessarily expect the interaction to have a practical goal. 27 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 27–45 © 2005 Springer. Printed in the Netherlands
28
PAGGIO AND JONGEJAN
The multimodal requirements have been met by implementing a multimodal parser which is able to combine input coming from different modality devices and combine it where possible into a merged representation. The parser is neutral as to which specific modalities are used. In the Staging application, it combines speech and gesture inputs. Speech and gestures are paired during chart initialisation, and the grammar then merges the multimodal information into a unified semantic representation where possible. To comply with the flexibility necessary in the interaction with the virtual world, we have defined a dialogue model aimed at dealing with mixed-initiative dialogue. The two building blocks of the model are dialogue obligations modelling valid speech act sequences, and dialogue goals modelling the agents' domain-oriented initiatives in the dialogue. The paper starts by presenting the Staging virtual farm in Section 2. In Section 3, the architecture supporting multimodal communication in the virtual farm is described. Section 4 deals with the issue of how to combine speech and hand gestures in computer interfaces in rather general terms by considering various classifications of hand gestures, while Section 5 discusses the use of hand gestures in Staging and describes the methodology adopted for merging speech and gesture inputs. Section 6 deals with the component in charge of managing a humanoid agent's communication with the user, i.e. its Communication Manager, it describes our dialogue model and discusses a few dialogue examples. Finally, Section 7 contains the conclusion. 2. THE STAGING VIRTUAL FARM Before discussing the treatment of multimodal communication in Staging, a brief introduction to the project's virtual farm is provided in this section. The virtual farm has been developed at the Laboratory for Computer Vision and Media Technology at the University of Aalborg (CVMT/AAU) in Denmark and is described in detail by Paggio et al (2000). At CST we have produced a mock-up version of it in order to experiment with the integration of the multimodal communication components. In this mock-up, which we will refer to as the CST virtual farm from here on, the graphical rendering of the farm is two-dimensional, although the underlying spatial model is 3D. The graphical style is also different from that of the original farm in that we have substituted a simpler, cartoon-like look for the initial 3D computer game style. At the moment, the farm is a square with a tree in the middle in which the user can navigate by means of the keyboard arrows. The farm is populated by a farmer, a few cows and a sheep. Apples and carrots pop up in the grass for the animals to eat. The animals are mainly governed by their hunger level, and get more and more restless the hungrier they become. They are attracted by food if food is not too far from them, and eat it if they come close enough. If they don't find food, they eventually die. The farmer is capable of more complex actions such as moving things around on the farm, feeding the animals, and talking to the user. Finally the user, who can interact with the farm verbally and by means of gestures, is represented in the world by an invisible avatar. This means that the user sees the world from a certain perspective, and that this perspective changes if she
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
29
moves around the world – that is if she uses keyboard strokes to navigate in the world. Although the user is not represented physically in the world, and cannot manipulate any of the objects directly, the fact that she is positioned in the world by means of the invisible avatar makes it possible for the agents to relate to her in several ways. For example, when the farmer engages in a dialogue with the user, he comes closer to where the user is positioned, and when asked to tell the user where an object is, he will relate the position of the object in question to the position of the avatar.
Figure 1. The CST virtual farm
At the core of the CST virtual farm is the Virtual Environment (VE) server in charge of simulating the world, and to which relate an arbitrary number of autonomous agents as well as the user avatar. The VE provides the agents with sensory information that enables them to move around in the world and react to the presence of other agents, and it processes requests from agents, for example a request to move an object, produce a sound or play an animation. The agents perceive the presence of the user avatar in the same way as they perceive the presence of other agents. Agents have two different levels of functionality. The low level, which all agents share, accounts for spatial memory and motion planning, and enables an agent to perform simple actions such as “follow object X” or “move close to object Y”. The high level, on the other hand, is different from agent to agent, and is responsible for more complex behaviours and action planning. The high level of the farmer agent,
30
PAGGIO AND JONGEJAN
for example, will know that after a certain amount of time has elapsed from the last feeding time, the animals are hungry and must be fed. It is still the high level agent that may have the ability and will to engage in a dialogue. To choose among competing behaviours (only some of which may be the result of a verbal exchange with the user avatar), agents are equipped with a set of Releasing Mechanisms that compute the relevance of specific behaviours at any point in time (Madsen and Granum 2000). Agents continuously attempt to carry out the most relevant kind of behaviour, i.e. the behaviour that displays the strongest activation level. If for example the user (via an avatar) asks the farmer to feed a certain animal, say cow$3, the farmer’s releasing mechanisms will increase the relevance of the “feed object cow$3” behaviour. Whether the agent chooses to perform the action depends on the level of activation of the other behaviours at that moment in time – the farmer may be hungry, tired, or unwilling to cooperate with the user. In fact, other behaviours may be more relevant than “feed object cow$3”. Engaging in a conversation with the user is a type of behaviour in itself: to win over others, it must have a high level of activation. In the current implementation of the system, the dialogue behaviour has generally a high activation level for the farmer agent. 3. AN ARCHITECTURE FOR MULTIMODAL COMMUNICATION In the architecture we have defined for the CST virtual farm, and which is shown in Figure 2, each agent (but not the user avatar) may be equipped with its own multimodal analysis component. In the current implementation, only the farmer (agent B in the figure) is capable of reacting to multimodal messages. In the figure, A, B, C and D are individual agents. A is the user avatar, that is the representation of the user in the world. The avatar mediates the user's communication with the virtual world. We have chosen not to represent the user avatar graphically for the difficulties this would imply in terms of realism, credibility and so on, but the avatar is present in the 3D space so that for example spatial references are interpreted relative to the avatar’s position. The user can send input to the farm via a microphone, a data glove and a touch screen. The touch screen is used to point, whilst hand gestures other than pointing are performed by means of the data glove. The keyboard can be used as an alternative to the microphone should the physical conditions make it preferable. It is also used to navigate in the world. In principle, all the agents that are close enough to hear the avatar speak or to see its movements or both, can react to the input in different ways. In a fictive world (e.g. that of a children's game), it is easy to imagine a dog agent equipped with its own more or less primitive multimodal grammar. Depending on how anthropomorphic the agent is meant to be, it would either be able to react to very few linguistic utterances (Move, Sit), or to participate in a conversation. In the current implementation, however, only one agent has linguistic capabilities. This is the farmer, agent B in the figure, an example of a humanoid agent with more complex communicative skills than the other agents.
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
☞
C well$1
Gesture recognition A avatar
well$1
(the well)
"Take it to the well"
{
Virtual Environment
Speech recognition
D cow$3
Speech generation
"Okay"
31
B the farmer
}
spatial memory
Speech + Gestures
take it to the well$1|well Parser pred=move arg2=it arg3=well$1
releasing mechanism
ve mo to $1 ? g l n l illi we I w 3 to Am ow$ c am s, I Ye
action planning
Okay resolve("it") = cow$3
salient(cow$3)
Dialogue model
Dialogue history
Text generation
move(cow$3,well$1)
B's communication manager
Figure 2. System architecture
The farmer’s parser can analyse speech input and gesture information. As will be explained in more detail in Section 5, the multimodal input is unified if possible in a feature structure merging syntactic and semantic interpretation. In the figure, the user has entered the speech input Take it to the well while pointing at a well (“well$1”). The result of multimodal parsing is indicated in the figure as the list of features “pred=move, arg2=it, arg3=well$1”, where pred stands for the semantic predicate, arg2 for the direct object, and arg3 for the oblique complement. This is a much simplified representation which, however, serves the purpose of illustrating the fact that the information provided by the gesture is inserted in the structure built by the parser in the appropriate place. In addition to the multimodal parser, the farmer is equipped with a communication manager, which is in charge of managing the dialogue with the user avatar by means of a dialogue model1. It is the job of this component to interpret the user's messages with respect to the ongoing dialogue, to select the appropriate
32
PAGGIO AND JONGEJAN
dialogue moves for the agent, to build a dialogue history, and to generate verbal responses. In addition, the communication manager tries to resolve ambiguous referential expressions by looking at the salience of discourse referents in the dialogue history (in the figure, the pronoun it is resolved to “cow$3”). Microphone Loudspeaker
Dragon Naturally Speaking
SR-wrapper
Data glove Touch screen
Virtual Environment
Communication Manager
Keyboard
SAPI-compliant COM-interface
Multimodal Parser
CORBA-interfaces
Figure 5: Communication protocols in the Staging prototype
However, referential expressions may also be resolved by means of the gestures they are paired with, or because the agent cannot “see” more than one object matching the expression. As we shall see, the dialogue model constructs dialogue trees based on speech act analysis, and selects appropriate dialogue moves that fit in with the dialogue tree. However, the response to a request put forward to the agent in the dialogue also depends on the activation level of the corresponding action in the agent's releasing mechanisms. This is shown in the figure in terms of an internal deliberation as to whether the agent feels like complying with the request. In reality, the agent does not actually asks the question “Am I willing to move cow$3 to well$1?”. If the action can be released – for instance if the agent is not too tired – the request is sent to the agent's action planner. The system's prototype consists of several programs that are chained together using CORBA (omniORB 3.0.4) and Microsoft’s COM as communication protocols (see Figure 5). Dragon Naturally Speaking runs under the Windows 98 operating system, as does the Speech Recognition (SR) wrapper. The VE developed at CST runs under
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
33
Windows 98. The Communication Manager and the Multimodal Parser are platform independent and have been tested under Sun Solaris and Windows 98. Thanks to the CORBA interfaces all these programs can be distributed over several computers. The SR-wrapper, the VE mock-up, the Communication Manager and the Multimodal Parser are all written in C++. A considerable part of the functionality of some of these modules is defined in external text files: the parser loads lexicon, grammar and semantic mapping rules at start-up, whereas the Communication Manager loads and reloads script files describing virtually all of its functionality. The ability to reload the script files while the prototype is running allows for a rapid development of this component. 4. SPEECH AND HAND GESTURES Many sources argue that users generally prefer multimodal systems, and indicate a preference for one or the other modality in different applications (see e.g. Oviatt 1999). It has also been argued (Cohen et al 2000) that at least in the case of mapbased tasks where being able to manipulate diagrams directly by pointing and drawing is very important, multimodal interfaces are more efficient than graphical user interfaces, and make it easier and quicker for the user to recover from error situations. In dialogue systems, speech (or written text) replaces the mouse clicks we are so accustomed to from menu-driven interfaces. In human dialogue, however, speech is only one of the modalities used to convey meaning: gaze, expressions, head movements, hand gestures and bodily posture are also important channels by means of which we communicate. Thus, insights from studies of human communication are beginning to find their way into research on multimodal interfaces, so that multimodality is considered an interesting option both concerning system input and system output. This development is based on the assumption that a good understanding of human communication, which is fundamentally multimodal in its nature, is useful for the design of better human-computer interfaces: "The design of cooperative dialog systems, new human-computer interfaces [...] and avatars in various types of virtual environments can all be improved by better knowledge of features of real human-human multimodal communication." (Allwood 1999, p.1)
In our work, we are especially concerned with multimodal input and in particular, with the combined use of speech and hand gestures. Most researchers make a first basic distinction between conventionalised gestures like the OK sign, also called emblems, and conversational gestures (Kendon 1981, MacNeill 1992). Whilst emblems have stereotypical shapes (although the shape differs in different cultures) and do not necessarily occur in dialogues, conversational gestures do not take a predictable shape and occur in juxtaposition with speech. Conversational gestures can be categorised in different ways. In Bavelas (1994), they are grouped into topical gestures on the one hand, and interactive gestures on the other. Topical gestures are those directly related to the topic under consideration, e.g. gestures depicting the shape of an object, the speed of
34
PAGGIO AND JONGEJAN
a movement or a time location. Interactive gestures are those used to attend to the dialogue as such, for example by marking the delivery of information (new vs. old information), involving the addressee by acknowledging their contribution or seeking a response, and finally managing the turn in the dialogue. Bavelas describes no less than 12 different specialised interactive gestures. In Allwood (2002), Peirce's basic semiotic categories of indexical, iconic and symbolic information are applied to gestures. This yields the following classification: i. indexical gestures, which establish a causal relation (e.g. pointing) with the information being perceived (the indicated object); ii. iconic gestures, which express information by similarity or homomorphism; iii. symbolic gestures (emblems), where the relation between form and meaning is based on social convention. In Staging, we have focused on three different gesture types which we call deictic, iconic and turn management gestures following i.a. Cassell and Prevost (1996). We have further restricted deictic to pointing gestures, which are indexical according to Allwood's classification, and iconic gestures to gestures expressing the size of the objects on the farm. Turn management gestures are used either to take or to give the turn. The three gesture types interact with speech in different ways, as shown by the following examples:
(1)
a) Please, feed the animals/them. (while pointing at a group of pigs)
b) Move the smaller cow. (with a gesture meaning small)
c) No, I meant the other animal. (while waving with a hand)
In example (1a), the pointing gesture is associated either with a full nominal phrase or a pronominal placeholder. It adds referential information that must be compatible with the semantics of the rest of the speech act. In (1b), the gesture adds a different kind of semantic information, which can usefully be represented as a feature contributing to the semantics of the noun cow. As we shall see, object references coming from pointing gestures, or size information coming from iconic gestures are input to the parser and, if syntactically and semantically compatible, merged with the speech input. Both pointing and iconic gestures are used by the system to resolve nominal phrase reference, and take precedence over reference resolution based on the dialogue history. Example (1c) is different, in that it does not directly contribute meaning to the utterance, and is rather to be interpreted as an independent speech act. A turn-taking
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
35
action is picked up by the agent's communication manager, which urges the agent to come closer to the user avatar and listen to the user's next utterance, maybe after having interrupted an ongoing action. Turn-giving gestures, on the other hand, cause the agent to take a dialogue initiative. In conclusion, Bavelas's distinction between topical and conversational gestures is reflected in the quite different treatments associated with pointing and iconic gestures on the one hand, and turn management gestures on the other. 5. MERGING MULTIMODAL INPUT An important assumption in our approach to multimodality is the fact that a great deal of research and experience are still needed to determine which particular combination of modalities is the most appropriate for specific tasks and applications. Therefore, we are interested in generic methodologies that allow systems to merge inputs from different modality devices irrespective of the specific application rather than in optimised, low-level implementations of concrete and domain-specific multimodal input combinations. We have chosen to parse multimodal inputs in an integrated fashion, thus taking advantage of unification to treat redundant, complementary ad well as ambiguous inputs. Our approach is different from that used in frame-based systems (as in Vo and Wood 1996), where input from different modalities is integrated by adding information in semantic frames tailored to the domain. It is more similar to the framework adopted in Johnston et al (1997), where the integration of spoken and gestural information is driven by typed-feature structures and carried out via unification. In their approach, multimodal integration takes place after speech and gestures have been independently parsed into separate, partial feature structures. Whilst we also use feature structures to build an integrated semantic representation, in our approach, as already discussed, deictic and iconic gestures are merged with speech by the parser, and turn managing gestures are interpreted later by the communication manager. Input to the system is provided via a microphone, a touch screen and a data glove. The speech input is sent to an off-the-shelf speech recogniser (Dragon NaturallySpeaking Professional) which provides word-based, continuous speech analysis trainable to different speakers' voices and accents. Currently, the speech recogniser's vocabulary is used as provided by the vendor. In other words, we have not limited Dragon only to recognise the words that are known to the parser. This strategy allows us to experiment with unknown words in the parsing algorithm, and has the advantage of not imposing vocabulary limitations on the user. In fact, adaptation of the product has been restricted to user-specific training of the speech files. We have not worked with the internal grammar of the speech recogniser, nor have we experimented with feeding the parser with multiple speech analysis hypotheses. In other words, in our current set-up Dragon sends one and only one analysis per speech input to subsequent processing. A different approach could certainly be adopted, but it was not in focus in the project.
36
PAGGIO AND JONGEJAN
Speech
Hand movements
Speech recognition Gesture recognition pointing, size
Chart initialisation Parsing
turn managing
Semantic mapping Communication management Action
Figure 3: Merging Speech and Hand Gestures
The gesture input is sent to a gesture recogniser developed by Karin Husballe Munk at CVMT (Aalborg University). This component recognises the following hand gestures: x x x x
gesture indicating “small”: closing thumb and forefinger into a circular or oval shape; gesture indicating “big”: stretching thumb and forefinger far from each other, other fingers stretched; turn-taking gesture: lifting hand with palm towards the screen, forefinger and middle finger stretched upwards; turn-giving gesture: from a closed hand (all fingers bent), opening the hand with palm upwards, all fingers stretched in the direction of the screen.
In the preceding section we saw that conversational gestures (both iconic and turn management gestures in our small gesture repertoire) are characterised by not having conventionalised shape. This means for example that the concept of “big” in human conversation can be conveyed in various ways using either one or two hands. The gesture we have decided to implement in the gesture recogniser is iconic in the sense that it is an attempt at imitating the real size of the big animals on the farm – the cows as opposed to the sheep: they are only relatively big, so that the distance between thumb and forefinger when the two fingers are stretched apart from each other seems a reasonable approximation of what “big” is in our virtual world. However, it may be objected that by only allowing two specific gestures to convey the meaning of the two size concepts we are interested in, we turn iconic gestures into emblems. The same objection can in fact be made for turn management gestures: only two of a spectrum of possible gestures are possible. The objection is
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
37
certainly justified. Given the fact that automatic gesture recognition is an extremely young discipline, it may be expected that future interfaces will be able to react to a much broader range of gestures than those proposed here. The inherent nonconventionality of conversational hand gestures, however, remains a very hard problem, so that systems also in the future will have to strike a difficult balance between naturalness of interaction and implementation feasibility. 5.1. Multimodal parsing Turning again to our implementation, the multimodal parser used in the project is an adaptation of Njal (Music and Povlsen 1993), a bottom-up left-corner chart parser that has been adapted to receive input from several modalities. Njal was chosen because it is fast, and can produce partial parses in case the input is illformed or outside the coverage of the grammar. It can also accommodate unknown words, and apply heuristics to prune multiple parses. Njal applies lexicon entries and grammar rules expressed in an augmented context-free formalism, where category symbols such as S (sentence), NP (nominal phrase) etc. are associated with a feature structure defining the semantics of the sentence or phrase under consideration. For Njal to be able to process speech and gestures together, the speech signal and the signal from the data glove are time-stamped and synchronised. As already mentioned, pointing and size gestures are entered into the initial chart, and if possible paired with relevant words. In addition to the near synchronicity of the two, syntactic and semantic constraints have to be met for the word and the gesture to be paired (see also Paggio and Jongejan 2000 for a discussion). A pointing gesture can be paired with a noun or a pronoun. Furthermore, the semantic type of the noun and the object type indicated by the gesture must be compatible. Let us consider two concrete examples:
(2)
Move that cow$1|animal.
(3)
Feed that cow$1|apple.
In (2), the user asks the farmer to move that animal while pointing at a cow. The string cow$1 is the internal representation of the pointing gesture, where “cow” is the object type and “$1” an internal identifier. Since “cow” is a subtype of “animal” in the system's domain ontology, noun and pointing gesture are semantically compatible. The resulting semantic representation of the multimodal input is the following feature structure: (4)
{act=request, predicate=move, arg3={pred=animal, semtype=animal,
38
PAGGIO AND JONGEJAN objectid=cow$1}}
The interpretation of the gesture is an object identifier added to the recipient (arg3) of a “move” request. In example (3), on the contrary, the semantic type of apple (presumably the result of a speech recognition error) is not compatible with the semantic type associated in the lexicon with the word animal. Speech and gesture in this case cannot be unified. The parser is left with two distinct parse hypotheses, one in which apple is the recipient of the predicate feed, and one in which cow$1 is. Since the former is semantically impossible, the latter is yielded as the only possible parse. The resulting feature structure is shown below: (5)
{act=request, predicate=feed, arg3={semtype=animal, objectid=cow$1}}
Whilst pointing gestures are exploited to resolve the reference of nominal phrases directly, size gestures add semantic information which can be used for the same purpose in a more indirect manner. For instance, the gesture small or big paired with the phrase the animal, in certain contexts can provide enough information to identify a specific animal. In the feature representation, the meaning of a size gesture is expressed as the value of the feature “mod(ifier)” as shown below: (6)
{act=request, predicate=feed, arg3={semtype=animal, objectid=cow$1, mod={pred=small}}}
To conclude, our approach to the merging of multimodal inputs is general enough to accommodate complementary as well as ambiguous and contradictory information. The use of feature structures to represent the content of the resulting multimodal sign is, we believe, a powerful tool which can be exploited also for more complex examples than those implemented so far. 6. THE COMMUNICATION MANAGER The overall goal of interaction in Staging, as already pointed out, is not carrying out specific tasks, but rather “playing around” with the world and making things happen in it. Since the user cannot manipulate objects directly, the farmer has to be resorted to. Thus the user can ask the farmer questions about the situation in the world, e.g. where the animals are or whether they are hungry, and urge the farmer to perform simple actions that have to do with the animals on the farm. The
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
39
communication manager is in charge of managing the farmer's dialogue with the user in accordance with a dialogue model. This implies interpreting the user messages with respect to an ongoing dialogue, deciding on the agent's dialogue moves, building a dialogue history, and generating a verbal response. Our model aims at accommodating the flexibility required by the application by combining dialogue goals arising from the scenario with dialogue obligations expressed in terms of sequences of speech acts. Dialogue goals and dialogue obligations are modelled based on well-known approaches. However, we believe these have been combined in an interesting way to be applied in a world of autonomous agents in which communicating is one among a set of competing behaviours. Compared to the industrial models of dialogue management reviewed in Zinn et al. (this volume), our approach is more open in the sense that it is not based on explicit coding of all possible sub-dialogues as done for instance in systems based on finite state machines. The construction and interpretation of dialogue trees that is the central mechanism in our approach depends on the definition of general speech act sequences, or dialogue obligations, and the grammar used by the parser to interpret the individual dialogue moves is also very general. The domain knowledge consulted by the communication manager to give meaningful content to the speech acts generated by the agent, is expressed in a library of action templates that are used to generate dialogue goals. This seems similar to what is done in dialogue systems built by computational linguists and speech researchers (e.g. Allen et al 2000, Larsson and Traum 2000), where still according to Zinn et al, dialogue modelling and domain knowledge are kept separate. Dialogue goals and dialogue obligations will now be described in more detail. 6.1. Dialogue goals Examples of informally specified dialogue goals from the farm scenario – some very general, some more specific – are negotiate whether some animal should be moved, negotiate whether some animal should be fed, talk about the weather, and ask for object reference. Dialogue goals are action plans referring to a pre-defined and domain-relevant set of action templates in the spirit of Badler et al (1999). Such templates specify for a given action which semantic objects are involved, the corresponding attribute name in the semantic representation output by the parser, relevant preconditions if any precondition holds, etc. Templates for feeding and moving, for example, may look as follows (somewhat simplified):
(7)
FeedAction(Topic=Feed, Animal=, Food=, Tool=)
(8)
MoveAction(Topic=Move, Object=, Modus=, Pace=(quickly|slowly|normal)
40
PAGGIO AND JONGEJAN
Dialogue goals are generated dynamically either to reflect changes in the VE as experienced by the agent, or to help the agent react properly to a request by the user. For example, if the farmer becomes aware of the fact that a pig is hungry, maybe because a certain amount of time has elapsed since he last fed it, his communication manager will create a dialogue goal urging him to ask the user whether the pig in question should be fed. Thus, the agent can take the initiative in the dialogue, and it will do so if explicitly given the turn by the user, or if the user is silent for too long. Dialogue goals also enable the agent to gather all the information it needs to satisfy a request by the user. For instance if the user (U) asks the agent (A) to feed the cows, the agent will ask which food they should be fed with as shown in the following transcript (where cow$2 is the system internal representation of the reference provided by a pointing gesture): (9)
U: Hi feed an animal please. A: Which animal shall I take? U: Feed cow$2|that cow. A: Which food shall I take? U: An apple.
Of course, dialogue goals are not the only goals an agent will have. The exchange above, for example, will yield the goal of actually feeding the cow. Whether the action is then carried out will depend on facts external to the agent's will, such as in this case the cow's hunger level. In fact, an action may be relevant when the agent sets out to perform it, and become impossible before the agent has managed to carry it out. For example, a cow may die of hunger before the farmer gets there to feed it. In that case, the farmer will simply not perform the action. 6.2. Dialogue obligations and dialogue trees Dialogue obligations are stated as a set of condition/obligation pairs following the model described in Traum and Heeman (1996). For instance, a question is followed by an answer, an assertion by a confirmation or a contradiction, and so on. Dialogue obligations are used by the communication manager to produce a correct reaction to a speech act by the user as well as to interpret a dialogue move by the user either as closing an open dialogue segment, or as initiating a sub-dialogue. Thus, understanding a user move and selecting the appropriate type of verbal response is performed in terms of constructing and interpreting dialogue trees. An example of the analysis this mechanism yields for a short construed dialogue fragment is shown in Figure 4. In addition to a string corresponding to the message, each terminal node in the dialogue tree contains a semantic and pragmatic interpretation of it including speech act type. In the figure, the speech act is shown in italics, together with a node index and the message. Following the DAMSL annotation scheme described in Core and Allen (97), we make a distinction between backward-looking and forward-looking speech acts. A backward-looking speech act
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
41
constitutes a reaction to a preceding utterance, while a forward-looking one elicits a response (either a verbal response or an action). A given dialogue move may well have both a backward-looking and a forward-looking function. From the communication manager’s perspective, forward-looking speech acts specify rule conditions, while backward-looking speech acts are expected reactions to these conditions – our dialogue obligations. If a different speech act occurs instead of the expected obligation, a sub-dialogue is opened.
D Topic=feed
I/R1
I1 Shall I feed the brown cow? offer
I/R3
R1
I3
R3
Feed it then.
Shall I feed more animals? offer
Yes, that sheep. accept/request
accept /request
I/R2 I2
R2
Where is it? whque
In the cowshed. answer
Figure 4: An agent-initiated dialogue
In the dialogue example under consideration, the first initiative (I1) is taken by the agent. It is a forward-looking speech act of type offer, which the communication manager interprets as opening the first initiative-response pair (I/R1) of dialogue D (Jönsson 1997). The topic of the dialogue is defined as feed. Instead of answering the agent’s question directly by either accepting or rejecting the offer, the user asks for more information. The user’s move is therefore interpreted as a forward-looking whque(stion) opening a new initiative (I2), to which the agent replies conveniently with an answer move (R2). This closes the sub-dialogue consisting of the second I/R pair. The next user’s move is interpreted as a reply to the open I/R1, since it can be interpreted as a (backward-looking) accept(ance) of the agent’s original offer. Note that this move also has a forward-looking function of type request. We shall return to this point below. Before satisfying the user’s request of feeding the brown cow, the agent asks whether there are other possible object candidates for the feeding action. This initiative, the offer in I3, is generated by a dialogue goal based on the
42
PAGGIO AND JONGEJAN
relevant action template. The subsequent response by the user closes I/R3 as well as the whole dialogue. 6.3. Implied and coerced speech acts
Even in simplified and domain-oriented dialogues of the type we are dealing with, we know that the same message can be assigned several speech act labels, and that responses do not always seem to obey our limited set of condition/obligation rules. Here, we would like to discuss two cases in which the rules can be relaxed in a systematic way to make the dialogue model more flexible. One way to relax the condition-obligation rules is by allowing responses to be implied. For example, let us consider the following dialogue: (10)
A: Hi. U: Feed the animals
Since the user does not answer the agent’s greeting right away, the agent should not expect to be greeted back later. In other words, after the first utterance of the human participant U, the agent must consider the greeting procedure complete. If we stick to our model of initiative-response pairs, then we have to allow for implied responses. In the example, the Feed the animals request initiates a new initiativeresponse pair. If we do not want to regard this I/R pair as part of a sub-dialogue of the greeting séance (which would seem unnatural), then we have to conclude the greeting I/R pair with an implied greeting response. Another case in which the condition/obligation rules cannot be applied in a straightforward way is the following apparently unproblematic exchange. (11)
U: Feed the animal. A: Which animal shall I take? U: Feed that cow$1|cow. A: Which food shall I take?
The second request (feed that cow$1|cow) is not a new feeding request and must not be interpreted as an initiative opening a new sub-dialogue. On the contrary, it must be accepted as a valid response to the question (which animal shall I take?), and an elaboration of the action under discussion. In the current implementation a request can be coerced into an answer or an acceptance if there is an action on the action stack that can be enriched by the information provided by the request. More precisely, for the request to be assigned the appropriate backward-looking speech act type, there must be semantic equality between the verb of the request and that of the pending action (“feed” in this case), as well as semantic compatibility between their arguments (a cow is an animal).
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
43
7. CONCLUSION To meet the requirements of a natural and non-task-oriented communication in a virtual world, we have developed an experimental platform where the user can use speech and hand gestures to interact with a humanoid agent, and user and agent can engage in mixed-initiative dialogues dealing with work on a virtual farm. The repertoire of gestures the user can use in combination with speech includes pointing gestures that can be input via a touch screen, as well as iconic gestures indicating size and turn management gestures that can be entered by means of a data glove. In this paper we have presented the methodology used to combine speech and gestures, which is based on the unification of syntactic and semantic features in feature structures, and discussed the way in which the communication manager makes it possible for an agent to react to the user's multimodal communication verbally and by carrying out actions in the virtual world. The Staging virtual farm has provided a nice opportunity for us to experiment with multimodal human-computer interaction in an application where emphasis was put on entertainment value rather than on the accomplishment of specific tasks. On the Staging farm, it is perfectly appropriate for the farmer, when the user points at a brown cow that is looking very restless and asks whether it is hungry, to answer “The brown cow is not hungry, but starving”. If the cow is lying down, the same question will be met with an absurd “The cow is not hungry, but dead”. The possibility of funny responses like these was not planned from the start: it presented itself as we started to realise what possibilities and limitations the scenario had to offer in terms of communication. From the point of view of multimodal interaction, the farm has made us realise how difficult it is to work not only with speech-based dialogue, but to add the complexity of gestural input. The coverage of the gesture recognition should be improved substantially with respect to what we have managed to do in Staging to enable users to use gestures in a flexible way, and it should be possible to perform all types of gestures by means of the same device – e.g. a data glove, or in future freely in front of cameras. Furthermore, the disambiguating potential of multimodal input can be taken further than it has been done in Staging: in particular, it should be investigated how the pairing of speech input and gestures can be made to work with multiple speech analysis hypotheses. Also in the approach to multimodal dialogue presented here several aspects obviously deserve further study. First of all, a more refined treatment of failures ought to be designed and implemented. For instance, when an action requested by the user becomes impossible, we have seen that the agent's current reaction is simply to abandon his plan to carry out the action. It would be worth pursuing different strategies: for instance, the agent could explain to the user what has happened, or try an alternative path of action. Another way in which the scenario could be made more interesting from a theoretical as well as an application point of view is to have several conversational agents interact with the user at the same time. In principle, the general architecture of the system allows experimenting with multi-party dialogue, but of course this would introduce a great deal of complexity in dialogue management.
44
PAGGIO AND JONGEJAN
In conclusion, we believe that a great deal still has to be learnt on how to process gestural input, how to combine it with other modalities in computer interfaces and how actual users would like to be able to use gestures when interacting with systems for different purposes. The Staging farm has provided an interesting framework to study some of the aspects of multimodal interaction in virtual environments. 8. NOTES 1.
In principle, the communication manager could handle communication between agents as well. Such an extension of its role, however, is by no means trivial, and it is not clear that such a complex apparatus is necessary. The project has chosen not to address this issue.
9. REFERENCES Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A. (2000) An architecture for a generic dialogue shell. Natural Language Engineering 6 (3&4), 213–228. Allwood, J. (1999) Cooperation and Flexibility in Multimodal Communication. Gothenburg Papers in Theoretical Linguistics. December. Allwood, J. (2002) Bodily Communication Dimensions of Expression and Content, in B. Granström and D. House (eds.) Multimodality in Language and Speech Systems. Kluwer, Dordrecht. Badler, N.I., Bindiganavale, R., Allbeck, J., Schuler, W., Zhao, L. and Palmer, M. (2000) Parameterized action representation for virtual human agents, in J.Cassell, J.Sullivan, S.Prevost and E.Churchill (eds) Embodied Conversational Agents, MIT Press, 256–284. Bavelas, J. B., Chovil, N. and Roe, L. (1995) Gestures Specialized for Dialogue. Personality and Social Psychology Bulletin. Vol. 21, No. 4, 394–405, April. Cassell, J. and Prevost, S. (1996) Distribution of semantic features across speech & gesture by humans and machines, in Proceedings of the Integration of Gesture in Language and Speech. Cohen, P:, MacGee, D. and Cllow, J. (2000) The efficiency of multimodal interaction for a map-based task, in Proceedings of the 6th Applied Natural Language Conference and the 1 st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 331-337, Seattle, Washington, USA. Core, M.G. and Allen, J.F. Allen. (1997) Coding dialogues with the DAMSL annotation scheme. In Working Notes of the AAAI Fall 1997 Symposium on Communicative Action in Humans and Machines. Johnston, M., Cohen, P.. MacGee, D., Oviatt, S.L., Pittman, J.A. and Smith, I. (1997) Unification-based multimodal integration. In Proceeding of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain. Jönsson, A. (1997) A model for habitable and efficient dialogue management for natural language interaction. Natural Language Engineering, Vol.3, 103–122, September. Kendon, A. (1981) Nonverbal Communication. Interaction and gesture. Mouton. The Hague. Klasen, M., Szatkowski, J. and Lehmann, N. (2000) The black sheep – interactive improvisation in a 3D virtual world, in Proceedings of the i3 Annual Conference, Jönköping, 13–15. Larsson, S., Traum, D. (2000) Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural Language Engineering 6 (3&4), 323–40. Madsen, C.B. and Granum, E. (2001) Aspects of interactive autonomy and perception, in L.Qvortrup (ed.) Inhabited 3D Virtual Spaces, Springer Verlag, chapter 3, 182–208. MacNeill, D. (1992) Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago. Oviatt, S.L. (1999) Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the Conference on Human Factors in Computing Systems, pp. 576–583. ACM Press: New York. Paggio, P. and Jongejan, B. (2000) Representing multimodal input in a unification-based system: the Staging project, in Proceedings of the Workshop on Integrating Information from Different Channels in Multi-Media Contexts at ESSLLI 2000, 24–31.
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
45
Paggio, P., Jongejan, B. and Madsen, C.B. (2000) Unification-based multimodal analysis in a 3D virtual world: the staging project, in Proceedings of the CELE-Twente Workshop on Language Technology: Interacting Agents, 71–82. Qvortrup, L., ed. (2000) Inhabited 3D Virtual Spaces, Springer Verlag. Traum, D.R. and Heeman, P.A. (1996) Utterance units and grounding in spoken dialogue, in Proceedings of ICSLP. Vo and Wood (1996) Building an application framework for speech and pen input integration in multimodal learning interfaces. In Proceedings of the Internation Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA.
10. AFFILIATIONS Patrizia Paggio and Bart Jongejan Center for Sprogteknologi (CST) Copenhagen, Denmark
M. THEUNE, D. HEYLEN AND A. NIJHOLT
GENERATING EMBODIED INFORMATION PRESENTATIONS
1. INTRODUCTION In conversations between human speakers, speech is the main carrier of information, but nonverbal signals such as gestures and facial expressions also play an important role, providing additional information about the content and the structure of the discourse. In order to successfully engage in a natural interaction with a human user, an embodied conversational agent should be able to interpret the user’s speech and nonverbal signals, and to respond with appropriate verbal and nonverbal behaviours of its own. We know from our research on interaction with embodied conversational agents (Nijholt and Heylen, 2002) that extending the agent’s repertoire of nonverbal behaviours can improve the quality of the interaction (Heylen et al., to appear). Embodied conversational agents should therefore not just communicate by words but also nonverbally. Just as in human-human communication, many channels can be used to send a whole range of signals. To give a few examples, facial expressions can be used to express interest or surprise and a whole range of emotional and conversational signals. Posture shifts may indicate, among other things, a readiness to speak. Gestures can be used for several functions, such as accentuating parts of utterances. Gaze provides information about the focus of attention and plays a role in turn taking. Nonverbal signals can thus be used to signal information about the mental and emotional state of the agent, its personality, its understanding of what is said in the conversation and many other things. In the past few years several groups have started to investigate different modes of nonverbal communication in more detail. This research deals with a whole range of questions: what signals are appropriate, what can we learn from human-human interaction in this respect, when do signals occur or when are they planned, how are they synchronized, how are they combined, how are they interpreted or evaluated by humans conversing with the agents, and so on. The variety of channels and kinds of signals that can be examined, together with the large amount of parameters that enter into nonverbal communication means that much of the work is still exploratory, focussing on just a few behaviours and functions for specific settings. In principle one would like embodied agents to be capable not only of producing the appropriate signals, but also of receiving and 47 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 47–67 © 2005 Springer. Printed in the Netherlands
48
THEUNE, HEYLEN AND NIJHOLT
interpreting the nonverbal signals of the human interlocutors. Although there are noticeable attempts at this (see e.g., Sowa et al., 2001, Breazeal, 2002, Wahlster, 2002), most work on nonverbal communication has been on the production side. We briefly discuss a few examples to illustrate the kinds of nonverbal signals that have been considered and the kind of research involved. In the tutor agent Steve (Rickel and Johnson, 2000), nonverbal modes of communication have been introduced to help the instructions. Steve uses a number of signalling modes to aid in specific functions that are prominent in teaching situations, like drawing the attention of students to certain objects or actions and giving clues on how they are doing with their tasks. Body orientation, gaze and deictic gestures are used to direct the students’ attention to objects in the virtual world. Steve can provide feedback to the students by shaking his head when telling students that they made an error or by a simple nod of approval when students perform correct actions. Nodding can also be used for back-channelling to acknowledge that Steve has understood the student. In this kind of research, the main task dictates the kind of use that is made of nonverbal communication. Research in the Gestures and Narrative Language group of the MIT Media Lab, headed by Justine Cassell, has been concerned with studying aspects of nonverbal communication in human-human conversations and using these results to build computational models of that behaviour for conversational agents in specific dialogue settings. In these studies special attention is paid to the relation between the nonverbal acts on the one hand and linguistics aspects of the dialogue on the other hand. For instance, in their research on gaze, they looked at the role of gaze in relation to turn-taking and information structure. Other nonverbal behaviours they have investigated are posture shifts and beat gestures (Cassell et al., 2000ab, 2001a). In “face-to-face” conversations, the face is the most expressive part of the body. Ekman (1979) provides an overview of the kind of signals that involve eyebrows. They play a role in common emotional expressions of sadness, surprise, fear and anger or distress; they may help underline certain words or punctuate the discourse; they may function as question markers, indicators of word search, et cetera. People are particularly sensitive to the eyes of others as a source of information about another person’s mental state. Baron-Cohen (1995) speaks about a special language of the eyes. Facial expressions can be classified as either functioning as a conversational signal or as emotional icons. The latter could be expressive of an emotional state experienced by the conversant, but in face-to-face conversation, affective displays mostly have a social, communicative function (Kraut and Johnston, 1979; Chovil, 1992). Despite this, much work on facial expressions is solely concerned with a few emotional expressions. A notable exception is the work by Takeuchi and Nagao (1993), who pay attention to conversational signals as well. A few other authors have looked in more detail at how to link facial expressions to the conversational actions and the intentional state of the agents (Pelachaud et al., 1996, 2002; Poggi et al., 2000). The diversity in nonverbal communication is huge and involves many parameters, while our models of emotion, personality, dialogue and social interaction are still rudimentary. Given this, it is a daunting task to let embodied
GENERATING EMBODIED INFORMATION PRESENTATIONS
49
conversational agents converse naturally with the full repertoire of verbal and nonverbal modes of communication that people use in face-to-face conversations. In this paper we will look at the relation between verbal and nonverbal communication for embodied conversational agents. We argue that because of their mutual dependence, the generation of verbal and nonverbal signals should be closely integrated. We will describe how a standard architecture for language generation can be extended to deal with the specification of nonverbal behaviours, and work out an example that shows how the resulting architecture can be used in a specific situation. First, however, we describe in more detail the role of human nonverbal modalities in conversation and discuss how taking these modalities into account can improve the interaction with conversational agents. 2. NONVERBAL SIGNALS IN HUMAN CONVERSATION Humans engaged in conversation typically produce many different kinds of nonverbal signals, which include visual and audio signals. Gestures and facial expressions are the most noticeable visual signals; others are gaze and posture. Audio signals include non-speech sounds, like snorting and sighing, and prosody, which involves features of the speech signal like volume, tempo, and pitch. Both speakers and listeners produce such nonverbal signals, but with differences in frequency and type. Most nonverbal signals are produced unintentionally. In the case of gestures, it has been argued that their main function is to help the speaker with verbal formulation (Krauss et al., 1991; Rimé and Schiaratura, 1991). This point of view is supported by the fact that people still produce nonverbal signals when they cannot be seen, e.g., on the telephone. Still, even though many nonverbal signals may not be intended for the benefit of the hearer/viewer, it has been shown experimentally that people do make use of the information conveyed by such signals (see Kendon (1994) for an overview). In the following, we discuss three general classes of nonverbal signals, according to the kind of information they convey. Following Pelachaud et al. (1996) and Cassell et al. (2000a), we distinguish between the interpretation of a nonverbal signal (its function, in terms of Cassell et al., 2000a) and its form: nonverbal signals with different forms can have the same interpretation, and the same nonverbal behaviour can be interpreted differently, depending on its context. The next three sections provide an overview of nonverbal signals in interaction between humans, grouped according to the kind of information they express. The last section discusses their relevance for embodied agents. 2.1 Expressing message content
One of the functions of communication is transferring knowledge: information about the state of the world is sent from the speaker to the listener. The information being transferred can be expressed using language, but also using nonverbal signals. All nonverbal signals in this category are in principle ‘interchangeable’ with the corresponding verbal expressions. In normal circumstances, the main part of any message is expressed using language. In specific circumstances, ‘content-bearing’
50
THEUNE, HEYLEN AND NIJHOLT
signals can fully replace language. (Think of communicating with someone when there is too much noise to be heard!) Usually, however, they are used in combination with language, sometimes redundantly, and sometimes non-redundantly. With respect to timing, redundant signals are synchronized with the corresponding verbal expression. In the case of non-redundant signals, they co-occur with the verbal reference to the action, object or property that they provide additional information about. Nonverbal signals that are used to express message content can be divided into the following two classes. Deictic signals are used to identify objects being referred to in the message. These signals usually take the form of a pointing hand gesture, but gaze direction, head and body movements are used as well. The indicated objects do not necessarily need to be visible, or even concrete. Deictic signals are different from other signals that express message content in that they do not inherently represent any meaning; their interpretation is entirely determined by the situational context. In this respect they are like verbal deictic expressions such as ‘here’, ‘we’, ‘now’, et cetera. Representational signals are used to express concepts such as attributes, actions, and relationships between objects. They can be divided into different subclasses. Emblems are like words in that the relation between their shape and meaning is arbitrary and different across cultures. They can replace words or entire messages. Examples are nodding or head shaking to express agreement or disagreement, and the ‘thumbs up’ gesture for OK. Iconic and metaphoric signals illustrate properties of objects and actions. Their shape is not arbitrary, but reflects the meaning being expressed. In the case of iconic signals, there is a direct resemblance between the signal and the concept being depicted, for example hands forming a circular shape. They usually take the form of gestures, but facial expressions can also act as icons. For instance, squeezed eyes can symbolize small size (Poggi et al., 2000), or bad taste can be expressed by pulling a disgusted face. For pantomimes, the relation between shape and meaning is even more direct, whereas for metaphoric gestures some visual metaphor is used, such as depicting a physical container to represent a bearer of information, e.g., a film (McNeill, 1992). 2.2 Reflecting discourse structure
Nonverbal signals that reflect the structure of the ongoing discourse deal with the form rather than the content of the message. For instance, speakers mark focused discourse elements by intonation, quick hand movements (beat gestures), nods, eyebrow raises or combinations of these. Turn taking is associated with a number of nonverbal signals. Avoiding eye contact while speaking can indicate a reluctance to give away the turn whereas making eye contact can indicate a readiness to hand it over. Posture shifts, like leaning forward, are often used to indicate one wants to take the turn. Nonverbal signals like prosody and facial expressions can also mark the kind of speech act that is performed, like the typical prosody associated with questions, or back-channelling vocalisations to acknowledge what has been said. These signals have well-defined relations with the linguistic channel, since what they do is provide information about the verbal discourse. In many cases, the
GENERATING EMBODIED INFORMATION PRESENTATIONS
51
information they convey can also be expressed verbally, but those expressions are often considered as marked. Examples are the use of cleft sentences (‘It was John who…’) and explicit turn taking phrases (‘Do you have anything to add?’). 2.3 Showing speaker/hearer characteristics
Some signals do not convey information about the message, but about the person who is sending or receiving it. Face, gestures and voice may convey information about static characteristics such as personality, age, and gender, and about dynamic characteristics, holding at the moment of speaking: emotion (pleased, irritated), mental state (nervous, confused, paying attention or not), physical state (sick, sleepy, full of energy). This kind of information can be inferred from all kinds of nonverbal signals. In a few cases, the signals ‘stand on their own’ (most notably, facial expressions), but in most cases the information is read from the way other signals are performed, or from their frequency. For instance, a happy or extravert speaker will use broader gestures and more ‘open’ body postures; and if a speaker uses few gestures this can indicate tiredness or lack of enthusiasm, but also age or personality. For most of these signals there is no direct temporal relation with the language being produced, although some dynamic states may be triggered by the content or form of the message (e.g., sadness at bad news, or irritation with long-windd wording). Speaker characteristics that are commonly expressed by nonverbal signals can be described verbally as well, e.g., the speaker can tell the listener about his current emotions. In this case, the information is part of the message content that is expressed intentionally (and which may be inaccurate). Usually, however, nonverbal signals of this type are produced unintentionally, and they are typically those that people will try to conceal or feign in face-to-face conversations (Ekman, 1992). 2.4 Relevance for Embodied Conversational Agents
We have seen that the production of nonverbal signals is an inherent part of human communication. Since nonverbal signals produced by human speakers can help the receiver to understand and remember the information being presented (cf. Kendon, 1994), this can be expected to hold for signals produced by embodied conversational agents as well. On the other hand, it has also been argued that nonverbal signals are usually not attended to, and that when they are, they are distracting and harm the processing of the message. Nevertheless, we believe that to allow for more fluent and natural interactions, embodied agents should be able to produce the same kinds of nonverbal signals as human speakers. Even if they are not always helpful, such nonverbal signals will make the embodied conversational agent more believable and lifelike, i.e., more like a human. This expectation is confirmed by our experiments on gaze (Heylen et al., to appear). For an overview of other experimental results on the effect of nonverbal signals (and agent embodiment in general) on human-agent interaction, see Dehn and van Mulken (2000).
52
THEUNE, HEYLEN AND NIJHOLT 3. AN ARCHITECTURE FOR EMBODIED LANGUAGE GENERATION
In this section we look at the generation of language in combination with nonverbal signals for the presentation of information by an embodied agent. Most existing embodied conversational agents or virtual presenters produce language using canned utterances. However, if the information that has to be presented changes over time, or is highly variable depending on the user’s information needs or other contextual dynamics, this approach is not feasible, and some form of natural language generation is required. So far, research in natural language generation has been aimed primarily at unimodal information presentation, where some underlying message is expressed using natural language only, usually in the form of a written text. However, when the message is to be expressed by an embodied agent an additional modality becomes available in the form of nonverbal behaviour. This raises the question how language generation and the production of nonverbal signals should be combined. A common approach is to specify the agent’s verbal utterance first, and then add mark-up to indicate any accompanying nonverbal behaviour. In this approach, the language used by the agent is not adapted to its nonverbal actions. This may result in information presentations that are less concise than they could have been (since the nonverbal signals that are added can only reflect, but not complement or replace language), and also less natural (for instance, when a full description of an object is made superfluous by pointing). In section 2, we saw that most nonverbal signals have a strong relation with language. Therefore, our approach is to integrate the specification of nonverbal signals with language generation. We focus on those nonverbal signals that are most closely related to language, i.e., deictic and representational signals, which mostly take the form of hand and arm gestures. The specification of other signals in connection to language generation is also discussed, but less extensively. In addition, we take a simplified view on information presentation as purely a ‘monologue’ task, and ignore nonverbal signals that are specific to dialogues, such as those signalling turn taking or dialogue acts. In the following, we first sketch a global architecture for generating embodied information presentations. Then, we discuss in some detail how the production of various kinds of nonverbal signals interacts with related language generation tasks. 3.1 General architecture
A standard architecture for language generation systems is described in Reiter and Dale (2000), who decompose the generation process into the following stages: x
x
Document planning. This involves determining the content and the global structure of the message to be presented. The outcome is an abstract message specification. In dialogue systems the dialogue planner generally carries out these tasks, which are largely domain specific and language independent. Microplanning (or sentence planning). At this stage, the message specification is fleshed out further. This involves the generation of
GENERATING EMBODIED INFORMATION PRESENTATIONS
53
referring expressions, lexicalisation (word choice), and aggregation (grouping information into clauses and sentences). These tasks require both linguistic and domain knowledge. x Realisation. Here, the abstract message specification is converted into real text, using knowledge about syntax, morphology, etc. In addition, mark-up may be added for use by external components. We now discuss how the specification of nonverbal signals can be integrated in this general architecture, so that it can be used to generate texts that contain mark-up for nonverbal signals. In such an integrated architecture, the message specification created at the document planning (or rather, content determination) stage is the basis for specifying both verbal and nonverbal signals. The two language generation tasks that are most closely related to the production of nonverbal signals are lexicalisation and the generation of referring expressions, both of which are carried out at the microplanning level. The main aim of referring expression generation is building a description that distinguishes the intended object from its distractors, i.e., other objects that might be referred to. Typically, such descriptions can be much reduced if they are accompanied by a deictic gesture that rules out most, or even all, distractors. Lexicalisation is the task of choosing words for expressing message concepts, such as actions and object properties. Since many concepts can be expressed by a representational nonverbal signal, the specification of such nonverbal signals can be seen as part of the lexicalisation process. The third microplanning task, aggregation, is related to the generation of nonverbal signals in that it delimits the domain for gesture production: on average, human speakers produce one gesture per clause during information presentation (McNeill, 1992), and to create a natural impression, a virtual presenter should do the same. In the field of language generation, there is no consensus about the order of the three tasks involved in microplanning. Here, we make the following assumptions on this ordering. Referring expression generation comes before lexicalisation, since it may enrich the message specification with additional concepts (i.e., object properties) that must be lexicalised. For instance, to distinguish an object from its distractors it may be necessary to mention a certain property, e.g., its shape. This property will then have to be lexicalised, either verbally or by making an appropriate iconic gesture, or both. Aggregation is sometimes done before and sometimes after lexicalisation. As argued by Reiter and Dale (2000), it should come at least before referring expression generation, to avoid generating a description more often than necessary. For instance, the two messages X is on R and Y is on R can be aggregated to X and Y are on R, so that a reference to R needs to be generated only once. Given that referring expression generation should precede lexicalisation, this means that aggregation is the first microplanning task. Nonverbal signals that reflect discourse structure, including prosody, do not really interact with language generation; rather, they reflect the structure of the result. Such signals are added at the realisation level, after syntactic realisation has been finished. The selection, both in form and frequency, of these and the other types of nonverbal signals should be influenced by different factors, such as the personality type or emotion the agent should convey (see sections 2.3 and 3.5).
54
THEUNE, HEYLEN AND NIJHOLT
Ideally, such factors should also influence the language used by the system, a matter that is ignored in most language generation work (some exceptions are Hovy, 1988, de Rosis and Grasso, 2000, and, for embodied agents, André and Rist, 2000). The output from embodied language generation is annotated text, specifying the nonverbal behaviour that has to be produced in parallel with certain words in the text. Several (XML-based) annotation languages have been proposed for this; e.g., Arafa et al., 2002, DeCarolis et al., 2002, Kranstedt et al., 2002. The annotated text is sent to external speech synthesis and animation modules. Speech synthesis converts the text into a speech signal, taking the prosodic markers into account. Animation takes care of producing the visual nonverbal signals, which have to be synchronised with respect to the pronunciation of the corresponding words, based on timing information from speech synthesis. For descriptions of how this can be done, see Pelachaud et al. (1996) and Cassell et al. (2001b). The animation component also has to determine the actual shape of the signals to be produced, taking relevant speaker characteristics into account (cf. Badler et al., 2002). Figure 1 shows the general architecture for embodied language generation proposed here. We assume that it is a pipeline, i.e., there is no backtracking between modules. This means that nonverbal signals, once they have been added, cannot be removed, and that subsequent generation stages can only add nonverbal signals that do not conflict with those already specified. In other words, deictic signals take precedence over representational ones, and the latter over signals that reflect discourse structure. (We are aware that this is a simplification.) Figure 1 also shows the knowledge sources needed: discourse, domain, speaker, and user models, as well as world knowledge. At least some of these models must be dynamically updated during generation. In the following sections we see how such knowledge is used during the intertwined specification of language and nonverbal signals. 3.2 Deictic signals and the generation of referring expressions
References to objects in the message specification are potential opportunities for deictic nonverbal signals by the agent. A first selection from these candidates will only retain references to objects that are visible to the user. 1Checking this condition is not always trivial; for instance, in applications where the user can move around in a 3D virtual world, as described in section 4, it will require some computation using data from the domain model (listing the positions of virtual objects) and the user model (listing at least the user’s current position and orientation). Since human speakers use deictic gestures and other nonverbal signals mainly to convey new information, another condition is that the object should be new, i.e., not previously mentioned, inferable, or otherwise familiar to the user. We also assume that deictic signals are appropriate for objects that are mentioned in a contrastive context (for ways to detect contrast using a discourse model, see Prevost, 1995, Theune, 2002). In short, first or contrastive references to visible objects will be accompanied by a deictic nonverbal signal. We now discuss how this influences the creation of a verbal reference to the object.
GENERATING EMBODIED INFORMATION PRESENTATIONS
55
Figure 1. Architecture for combined generation of language and nonverbal signals.
Constructing a verbal description of an object, with the aim of allowing the user to uniquely identify it, involves selecting those properties of the object that distinguish it from other potential referents, the distractors. Having the agent point at a virtual object will generally rule out several of its distractors, namely those that are positioned outside the ‘range’ of the pointing finger. (We assume that pointing with the hand or finger is the preferred form of a deictic gesture.) When pointing is quite exact, for instance when the object is nearly touched, or when there are no distractors close to it, all distractors may be ruled out by the pointing gesture, so that no further verbal description is required. In such a situation, generating a syntactic placeholder such as that or this one may be sufficient. In most cases, however, a pointing gesture by the agent will not rule out all distractors. As the distance from the agent to the virtual object increases, pointing becomes less precise (more or less in the fashion of a flashlight) and will rule out fewer distractors. In the case of deictic head movements or even only gaze, which must be reverted to if the agent’s hands are occupied by some task, pointing is always very inaccurate, indicating only a general direction. So, depending on the distance from the agent to the intended referent, and the kind of deictic signal being produced, the system must determine which distractors are ruled out. The remaining distractors will have to be ruled out by the verbal description. The result of referring expression generation is a specification of the content of the verbal expression and
56
THEUNE, HEYLEN AND NIJHOLT
of the deictic signal that should accompany it, if any. This information is added to the message specification, which is passed to lexicalisation. 3.3 Representational signals and lexicalisation
Not all concepts are equally suitable for expression using a representational nonverbal signal. Semantic features that can be easily visualized are the manner and direction of actions, and the shape, size and (relative) location of objects (Kendon, 1980, McNeill, 1992). When generating embodied information presentations, a simple way of checking whether a specific message concept can be expressed using a nonverbal signal is to use a ‘gesture dictionary’ or database linking concepts to nonverbal signals (Cassell et al., 2001b, Sowa et al., 2001). Such a dictionary may be based on a domain specific inventory of nonverbal signals that have been actually produced by human speakers. Nonverbal signals are only retrieved from the dictionary if they are compatible with nonverbal behaviours that were selected during the preceding generation task (i.e., deictic signals). For instance, the agent cannot produce an iconic gesture representing some property of an object if in the preceding stage it was decided that it should be pointing at that object. Similarly, representational head movements (e.g., a head wiggle to indicate dubiousness) are incompatible with deictic head movements. If multiple nonverbal signals are available to express a concept, a probabilistic choice can be made between them, based on their relative frequency in the corpus that was used to create the dictionary of representational signals. Having identified those concepts in the message specification that the agent can express using a representational nonverbal signal, some of these must be selected. Simply grasping all opportunities for generating a nonverbal signal is not an option, as it is likely to produce an unnatural effect, e.g., giving the impression that the agent is “talking to a foreigner” (Cassell and Prevost, 1996). Like human speakers, in a neutral context the agent should not produce more than one or two gestures2 per clause; a target that may be raised or lowered based on information from the speaker and user models. For instance, to convey an introvert personality, the agent should produce fewer nonverbal signals than average, whereas if the user actually is a foreigner, more nonverbal signals may be appropriate. As with deictic signals, the selection of representational signals may be based on newness and contrast. Additional selection criteria may reflect the human tendency to express surprising or uncommon object features using a gesture, as well as information that is ‘relevant to the primary communicative goal’ (Yan, 2000). For cases where too many candidates meet these criteria, we need a ranking in the form of a (domain specific) information hierarchy, where the highest level is occupied by information that is most likely to be expressed using a nonverbal signal, e.g., new and highly relevant information, and where the lowest level is occupied by, e.g., discourse-old information. Based on the current target number of nonverbal expressions to be generated (see above), a selection can be made from the candidates, starting with those on the highest levels.
GENERATING EMBODIED INFORMATION PRESENTATIONS
57
The next decision to be made is which concepts the agent should express only nonverbally, and which ones it should also express using speech, leading to a ‘redundant’ nonverbal signal (Cassell and Prevost, 1996). Since nonverbal signals run a higher risk of being missed (i.e., overlooked) by the user than speech, it seems that the choice between redundant and non-redundant signals should be based on the deemed importance of the information to be expressed. In terms of the information hierarchy mentioned above, only items on the lower levels should be expressed nonredundantly. Other important factors are economy and ease of expression. Some concepts are more easily expressed nonverbally than verbally. For instance, complex motions or shapes can be relatively difficult or inefficient to express using a verbal description. This makes the use of speech less attractive, especially when the presentation is bound to certain time limits. Regardless of the information hierarchy, a non-redundant visual signal will be preferred in such cases. In the choice between redundancy and non-redundancy, the kind of nonverbal signal being produced also plays a role. Emblems that are used non-redundantly seem to run little risk of being overlooked, because they actually replace the words they correspond with: speech is temporarily interrupted when a non-redundant emblem is produced, drawing the hearer’s attention to the emblem. An example is saying She is really . Here, the choice for non-redundancy seems to be a matter of style rather than efficiency or effectiveness. On the other hand, other non-redundant representational signals are usually produced during the verbal description of the action or object of which they express an attribute. In this case, speech is not interrupted, so there is a risk of overlook, especially for subtle behaviours such as head or eye movements. 3.4 Nonverbal signals reflecting discourse structure
Nonverbal signals that reflect discourse structure take the form of both prosodic and visual nonverbal signals. Since these signals do not only mark informational but also syntactic structure, they are specified during the realisation stage, after syntactic realisation has taken place. Based on syntactic information,3 some phrase boundaries can be marked prosodically by specific intonation contours or pauses of different lengths (cf. Theune et al., 2001), and visually by head motions, gaze direction, blinks or eyebrow movements (cf. Pelachaud et al., 1996). Similar visual signals, as well as beat gestures and pitch accents, can be used to mark words expressing new or contrastive information, and posture shifts may be inserted to mark topic changes (cf. Cassell et al., 2001a). As before, appropriate signals can only be selected if they are not in conflict with signals that were previously specified. In other words, we assume that both deictic and representational signals take precedence over signals that reflect discourse structure. This is similar to the rule of gesture class priority used in the BEAT system (Cassell et al., 2001b). Since deictic and representational signals most often take the form of gestures, in our proposed architecture discourse structure is somewhat more likely to be reflected by prosody and facial actions than by beat gestures, which run a higher chance of being incompatible with nonverbal
58
THEUNE, HEYLEN AND NIJHOLT
signals specified during microplanning. 4Rules for coordinating facial motions with intonation have been proposed by Pelachaud et al. (1996). Finally, the influence of speaker characteristics must also be taken into account at this level. For instance, to create an impression of extraversion or enthusiasm, the agent should produce more beat gestures and pitch accents than average, and to convey introversion it should look away more and move less. 3.5 Factors influencing selection preferences and production rate
We have already discussed that the type and frequency of the nonverbal signals produced by the agent can be varied depending on which speaker characteristics should be conveyed, such as personality and emotion. These characteristics also influence the actual shape of the specified nonverbal signals as determined by the speech synthesis and animation modules. In addition, there are several other contextual factors that should be taken into account in the selection of nonverbal signals. Among these are user characteristics such as age, nationality, and level of expertise in the application domain. For instance, the use of more, less subtle nonverbal signals may be appropriate when the agent is speaking to a child as opposed to an adult, to a foreigner as opposed to a native speaker, and to an inexperienced user as opposed to an experienced one. In addition, there may be an influence of the application domain. In some domains, getting the message across correctly may be more important than in others; for example, in an educational setting production rates may be set higher than in social talk, and non-redundant signals may be avoided to ensure that the student does not miss any information. Within domains, message complexity may have to be taken into account. As shown by Cohen (1977) for the presentation of route descriptions, the gesture rate of human speakers increases with the complexity of the message being presented. Presumably, this also holds for other kinds of nonverbal signals and in other domains, and it may be sensible to have the agent copy such behaviour. 3.6 Discussion
In the preceding sections, we have taken a standard language generation architecture (Reiter and Dale, 2000) as the starting point for describing an ‘ideal’ system for the generation of embodied information presentations. However, several language generation systems do not adhere to this standard architecture, or do not perform all of the distinguished tasks (see Cahill et al., 1999). Also, in practice many generation tasks are carried out in an ad-hoc rather than a theoretically motivated manner. Examples are directly mapping concepts to standard lexicalisations instead of ‘real’ word choice, and the use of templates for syntactic realisation. Similarly, in the architecture sketched here all possible kinds of nonverbal signals are specified in a principled manner, but in practical systems, achieving this will be nearly impossible, and often unnecessary. Also, the model for embodied language generation presented here is necessarily simplified. Clearly, the interactions between language generation and the production of nonverbal signals are more intricate than
GENERATING EMBODIED INFORMATION PRESENTATIONS
59
has been presented here. Finally, much of what has been presented rests on assumption: there are many things about the production of nonverbal signals (and of language, for that matter) that we do not know yet. To extend our knowledge, careful study of human speaker behaviour will be required, as well as evaluation of implemented models. User experiments should be conducted to test the naturalness and effectiveness of different types of generated information presentations, as well as their effect on the user’s perception of the agent’s personality and other speaker characteristics. 4. EXAMPLE: THE PRESENTATION OF ROUTE DESCRIPTIONS In this section, we discuss the presentation of route descriptions as an example application of embodied language generation. Human speakers presenting a route description make prominent use of both deictic and representational signals to indicate landmarks and directions. These signals mainly take the form of broad hand and arm gestures; facial expressions play a relatively minor role. The ANGELICA5 project aims at developing an embodied agent that presents route descriptions in the Virtual Music Centre (VMC, Figure 2). The VMC is a 3D virtual building with halls, corridors and different floors (Nijholt and Heylen, 2002). Since visitors to such environments often experience navigation problems (Nijholt et al., to appear), the VMC is a natural environment for the presentation of route descriptions. In previous research we developed a navigation agent that determines the user’s intended destination within the VMC by means of a spoken dialogue, and then computes the shortest route to this destination (van Luin et al., 2001). This agent is not yet embodied, and cannot give a verbal route description: it either presents the computed route on a map or moves the user through the VMC to the target location. In the ANGELICA project we will extend the navigation agent with a component for embodied language generation as sketched in the previous section. In the following, we describe how the embodied guide would generate a simple verbal and nonverbal route description, illustrating the proposed architecture.
Figure 2. Outside view of the VMC.
60
THEUNE, HEYLEN AND NIJHOLT
4.1 Content determination
The target location of the route description is determined in a spoken dialogue between the user and the embodied guide. This involves the generation of several dialogue utterances, accompanied by nonverbal signals. Several of these will be dialogue-specific, e.g., turn taking signals and signals that reflect different speech acts, such as confirmation and verification. Since our focus is on the presentation of the route description, we do not describe this stage of the interaction (but see Heylen et al. (to appear) for our work on gaze as a turn taking signal in spoken dialogues with an embodied agent). Here, we only assume that, after a number of turns, the target location is established as being the balcony of the VMC. The shortest route to this location is computed, and returned in the form of a vector of 3D coordinates.
ROUTE:[ action:[type:walk, direction:up, object:s2], action:[type:turn, direction:sharp_right], action:[type:walk, direction:through, object:d3]] Figure 3. Example route specification
This vector is given as input to content determination, which turns it into a message specification as shown in Figure 3, which describes the route in terms of the actions the user has to carry out to reach the intended destination, for example walking towards some landmark or turning in some direction. For brevity, ‘obvious’ locations and actions, such as the start and end point of the route, have been left out. Creating such a route specification requires a nontrivial amount of domain and world knowledge, which is used to map the vector of 3D coordinates to actions, landmarks and spatial concepts. We do not go into that here. 4.2. Microplanning
The first task performed during microplanning is aggregation. Here, it is determined that the actions specified in our example message can be expressed in separate clauses, which are combined into one sentence. Within this sentence, the performer of each action (the user, left implicit in the message) remains the same, and therefore only needs to be mentioned in the first clause. In the following, we discuss the generation of referring expressions (including deictic signals) and lexicalisation (including representational signals) for each clause/action in turn.
GENERATING EMBODIED INFORMATION PRESENTATIONS
61
Expressing the first action involves referring to the performer (the user) and the object of the action (s2, one of the two stairs in the VMC). For referring to the user, a simple pronoun (‘you’) is sufficient. This expression is not accompanied by a nonverbal signal, as the information being expressed is at the very bottom of the information hierarchy from section 3.3. The second reference is to the stairway, s2. This entity is new to the discourse, essential for the route description, and in the user’s current line of sight. A deictic hand gesture is therefore selected. To make clear that the agent refers to the stairs as a whole, and not to one of its steps or the handrail, the referring expression algorithm specifies that the object’s type property must be added to the verbal part of the expression. In combination with the gesture, this is sufficient to uniquely identify object s2. (If pointing had not been possible, for instance because the user could not see s2, then a more elaborate verbal description distinguishing it from the other stairway in the VMC would have been required.) Now that the content of the referring expressions is known, lexicalisation starts. One candidate for nonverbal expression is s2’s property of being a stairway, which can be easily expressed using, for instance, an iconic gesture representing the shape of the steps (assuming such a gesture is present in the ‘gesture dictionary’). However, a deictic gesture has already been selected for referring to the stairs, leaving no room for any additional hand movements by the agent. Therefore, the type property can only be lexicalised verbally. Another candidate for nonverbal expression is the upward direction of the action. This information is both new and essential for the route to be taken, so an upward moving gesture is selected. The action type, walking, might be expressed by a ‘walking finger’ gesture (cf. Cassell et al., 2000a). However, this information is low in the information hierarchy, since there is no other way to move around in the VMC. Therefore no nonverbal expression is selected to express this aspect of the action, and on the verbal side the general verb ‘go’ is preferred over the more specific, but superfluous, ‘walk’. All in all, two gestures are selected for the first clause: pointing upwards and at the stairs. Describing the second action does not involve references to objects, and thus no deictic signals. During lexicalisation, it is found that the concept of turning in a specific direction can be expressed nonverbally. Since this concept is new and important, a gesture is selected where the hand and arm demonstrate a sharp right turn. On the verbal side, the phrase ‘turn sharply to the right’ is selected. Here, the adverbial ‘sharply’ might have been left out, as it is grammatically optional and the sharpness of the turn is already illustrated by the gesture. However, this aspect is essential for the route, since the user can also make a non-sharp right turn at the top of the stairs. Only using a non-redundant gesture carries the risk of this information being overlooked and the user taking a wrong turn, so an additional verbal expression is preferred. The third action involves d3, one of the doors in the VMC. Because d3 is invisible from the user’s current location a deictic signal is deemed inappropriate, and a distinguishing verbal description must be created. The VMC has several doors, but only one of them (d4) is near to d3 and therefore counts as a distractor. To distinguish d3 from d4, describing it as ‘the first door’ suffices. We assume that the door’s attribute of being first, or nearest, cannot be expressed nonverbally.6 The concept of (walking through) a door may be expressed by a nonverbal signal, e.g., a
62
THEUNE, HEYLEN AND NIJHOLT
pantomimic gesture. However, walking through is an obvious action in connection with a door, and so we assume that no such gesture will be selected and that this information is only expressed verbally. 4.3. Realisation
The final generation stage is realisation. For the first clause, syntactic realisation produces the sentence ‘You go up the stairs’. Then, nonverbal signals reflecting discourse structure are added. The clause is too short for pauses, but the end of the clause is marked prosodically by rising intonation, indicating that the sentence is not yet at its end. The words ‘up’ and ‘stairs’ convey new information, so these receive a pitch accent. They cannot be accompanied by beat gestures, since the agent’s hands are already involved with gestures specified during microplanning: pointing upwards and at the stairs. Available alternatives to mark the focused information are blinking and raising the eyebrows (cf. Pelachaud et al., 1996). The second clause is syntactically realised as ‘turn sharply to the right’, where ‘sharply’ and ‘right’ are accented and accompanied by blinks and eyebrow movements. Again, there is no room left for beat gestures. The final clause is realised as ‘and go through the first door’. Here, ‘first’ is accented and accompanied by a beat gesture, since this word expresses information that is both new and contrastive (distinguishing d3 from d4). The word ‘door’, on the other hand, is not accented, since this property is shared by d3 and d4 and thus not contrastive. The end of the clause is marked by a falling intonation contour, since it is sentence-final. The resulting route description, with simplified mark-up, looks as follows: you go UP the STAIRSturn SHARPLY to the RIGHT and go through the FIRST door
Here, accented words are given in small capital letters, and phrase boundaries of different strengths are indicated by a number of slashes. Of the visual signals, only gestures are indicated using greatly simplified XML style mark-up. The embodied presentation of the first clause is illustrated in Figure 4. 4.4. Remaining issues
It is clear that generating an embodied route description does not only require rules for the generation of verbal and nonverbal signals, but also involves a good deal of reasoning on the basis of (spatial) domain knowledge and general knowledge. The questions that need to be resolved include, which information is more or less essential for describing the route, how can ‘virtual objects’ and their properties be linked to coordinates in the 3D world, which objects are currently visible to the user (or will be from certain points along the route), and which objects are within the range of a deictic signal. In addition, it is not clear how to react to user
GENERATING EMBODIED INFORMATION PRESENTATIONS
63
interruptions or movements, during information presentation. For instance, what if the user moves away from the virtual guide, perhaps to look where she is pointing?
Figure 4. ‘You go UP the STAIRS’
Other remaining problems are at the level of speech synthesis and, in particular, animation. One of these is timing: the generated nonverbal signals have to be synchronized with the pronunciation of the corresponding words. For signals that correspond to phrases of varying lengths, this might require speeding up or slowing down their execution, depending on speech times. If we assume that the guide does not have a fixed location, this means that deictic signals must be animated in real time, based on the current position of the guide relative to the object being indicated. This involves selecting from all possible joint rotations and translations those that will result in a natural looking movement. Another issue is that of realistically blending similar signals (e.g., different gestures or facial expressions) that are performed in close sequence. An example is combining the upwards and deictic gestures from the first clause of our route description. Nonverbal facial signals will also have to be combined with facial movements that are directly related to physical speech production (visemes). Finally, in determining the actual shape of the nonverbal signals, both in animation and in speech synthesis, any relevant speaker characteristics should be taken into account. For many of the research issues listed above, practical solutions are still lacking, so that at the implementation level several of these problems will have to be sidestepped or handled in an ad-hoc manner. 5. RELATED WORK So far, there has been relatively little work on combining language generation with the production of nonverbal signals by embodied conversational agents. Notable exceptions are the work by Cassell et al. (2000ab) and Pelachaud et al. (2002). Cassell et al. focus on the division of labour between speech and
64
THEUNE, HEYLEN AND NIJHOLT
representational gestures, guided by information about the discourse context and the communicative function of the utterance to be generated. Their work is applied in REA, a virtual real estate agent that presents embodied descriptions of houses. In addition to representational signals, REA also generates nonverbal signals that are related to discourse structure. These include posture shifts (Cassell et al., 2001a), gaze, and beat gestures. Pelachaud et al. (2002) developed the Greta agent, a talking face that presents medical diagnoses to patients. Greta can display nonverbal signals that reflect discourse structure and the agent’s cognitive and emotional state, as well as signals that express message content. There is no interaction between language generation and nonverbal signal production, however. Other research in this area is that of André and Rist (2000), who have worked on language generation for virtual presenters, focusing on the problem of projecting different agent personalities. The NECA project (Krenn et al., 2002) builds further on this work, developing components for reasoning about agents’ emotions and generating combined (emotional) speech and nonverbal signals. Work on discourse-related prosody combined with language generation includes that of Williams (1999) (in the route description domain), McKeown and Pan (2000) and Theune et al. (2001). Schmauks (1987), Claassen (1992), Reithinger (1992), and Lester et al. (1999) address the generation of multimodal referring expressions. Their approaches are not aimed at animated agents, with the exception of Lester et al., who combine language generation with the production of pointing gestures by a pedagogical agent. All mentioned approaches are limited in that pointing is assumed to be exact, and hardly influences property selection for the verbal description. Krahmer and van der Sluis (2003) propose a new model for generating multimodal references that improves on this by using various degrees of pointing precision, and relating the selection of distinguishing properties to the kind of pointing gesture being generated. The presentation of route descriptions by an embodied agent is addressed in the REAL project, where an embodied agent shows users the way through a virtual environment (Baus et al., 2000). However, this agent has no language generation capabilities except for limited object references. In the MACK system (Cassell et al., 2002), route descriptions are presented by an embodied conversational agent that is projected in a real-world environment. This agent does not have language generation capabilities, but uses canned text. It produces nonverbal signals using a probabilistic strategy, based on data from route descriptions by human speakers. 6. CONCLUSION Over the years there has been a steady increase of our knowledge of human verbal and nonverbal communication, and how it can be simulated by embodied conversational agents. Nevertheless, the actual development of applications in which all kinds of contextually appropriate nonverbal signals are generated in a principled fashion is still some steps away. In this chapter we have sketched a general architecture for combining language generation with the specification of nonverbal signals for information presentation by embodied conversational agents, and illustrated how it can be used in a specific application. The model for embodied
GENERATING EMBODIED INFORMATION PRESENTATIONS
65
language generation we have presented here is necessarily simplified, and much of it rests on assumption. To extend our knowledge, further studies of spontaneous human speaker behaviour are required, as well as controlled experiments and evaluations of implemented systems. Our ultimate goal is to allow for human-agent interactions that are equally natural, effective, and entertaining as face-to-face interactions between humans. 7. NOTES 1 Actually, this constraint is too strong, because human speakers also use deictic references to objects that are not visible, but occupy an imaginary position projected in the space before the speaker (McNeill, 1992). However, for simplicity we will ignore this here. 2 Most representational signals take the form of gestures; we have no figures available on the average number of other nonverbal behaviours produced by human speakers. 3 Theune et al. (2001) determine prosody on the basis of syntactic structure. On the other hand, Steedman (2000) points out that syntactic and prosodic structure may diverge. 4 We ignore the possibility of overlaying beats over other gestures, as observed by McNeill (1992). 5 A Natural-language Generator for Embodied, Lifelike Conversational Agents. This project is partially funded by NWO, the Dutch Organization for Scientific Research. 6 A Natural-language Generator for Embodied, Lifelike Conversational Agents. This project is partially funded by NWO, the Dutch Organization for Scientific Research.
8. REFERENCES André E. and Rist T. (2000). Presenting through performing: on the use of multiple lifelike characters in knowledge-based presentation systems. Proceedings of the Second International Conference on Intelligent User Interfaces, New Orleans, USA, 1-8. Arafa Y., Kamyab K., Mamdani E., S. Kshirsagar, Magnenat-Thalmann N., Guye-Vuilleme A. and Thalmann D. (2002). Two approaches to scripting character animation. Proceedings of ECA's: Let's specify and evaluate them! Workshop held in conjunction with AAMAS 2002. Badler N., Allbeck J., Zhao L. and Byun M. (2002). Representing and parameterizing agent behaviors. Proceedings of Computer Animation, IEEE Computer Society, Geneva, Switzerland, 133-143. Baron-Cohen S. (1995). Mindblindness. MIT Press. Baus J., Butz A. and Krüger A. (2000). Incorporating a virtual presenter in a resource adaptive navigational help system. Proceedings of the Workshop on Guiding Users through Interactive Experiences, Paderborn, Germany. Breazeal C.L. (2002). Designing Sociable Robots. MIT Press. Cahill L., Doran C., Evans R., Mellish C., Paiva D., Reape M., Scott D. and Tipper N. (1999). In search of a reference architecture for NLG systems. Proceedings of the 7th European Workshop on Natural Language Generation (EWNLG'99), Toulouse, France, 77-85. Cassell J. and Prevost S. (1996). Distribution of semantic features across speech and gesture by humans and computers. Proceedings of the Workshop on the Integration of Gesture in Language and Speech, Newark, USA, 253-270. Cassell J., Bickmore T., Campbell L., Vilhjálmsson H. and H. Yan (2000a). Conversation as a system framework: Designing embodied conversational agents. In J. Cassell, S. Prevost, E. Churchill, and J. Sullivan (eds.), Embodied Conversational Agents. MIT Press, 29-63. Cassell J., Stone M. and Yan H. (2000b). Coordination and context-dependence in the generation of embodied conversation. Proceedings of the First International Conference on Natural Language Generation, Mitzpe Ramon, Israel, 171-178. Cassell J., Nakano Y., Bickmore T., Sidner C. and Rich C. (2001a). Non-verbal cues for discourse structure. Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL 2001), Toulouse, France, 106-115.
66
THEUNE, HEYLEN AND NIJHOLT
Cassell J., Viljhálmsson H. and Bickmore T. (2001b). BEAT: the Behavior Expression Animation Toolkit.Proceedings of the 28th Int. Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 2001), Los Angeles, USA, 477-486. Cassell J., Stocky T., Bickmore T., Gao Y., Nakano Y., Ryokai K., Tversky D., Vaucelle C. and Vilhjálmsson H. (2002). MACK: Media lab Autonomous Conversational Kiosk. Proceedings of Imagina '02, Monte Carlo, Monaco. Chovil N. (1991/1992). Discourse-oriented facial displays in conversation. Research on Language and Social Interaction 25, 163-194. Claassen W. (1992). Generating referring expressions in a multimodal environment. In R. Dale, E. Hovy, D. Rösner and O. Stock (eds.), Aspects of Automated Natural language Generation. Springer Verlag, Berlin, 247-262. Cohen A. (1977). The communicative functions of hand illustrators. Journal of Communication 27, 5463. Cohen A. (1980). The use of hand illustrators in direction-giving situations. In W. von Raffler-Engel (ed.), Aspects of Nonverbal Communication, Swets and Zeitlinger BV, Lisse, 265-273. DeCarolis B., Carofiglio V., Bilvi M. and Pelachaud C.. APML, a mark-up language for believable behavior generation. Proceedings of ECA's: Let's specify and evaluate them! Workshop held in conjunction with AAMAS 2002, Bologna, Italy. Dehn D. and van Mulken S. (2000). The impact of animated interface research: a review of empirical research. International Journal of Human-Computer Studies 52(1), 1-22. Ekman P. (1979). About brows. In M. von Cranach, K. Foppa, W. Lepenies and D. Ploog (eds.) Human Ethology, Cambridge University Press, 169-202. Ekman P. (1992). Telling Lies. W.W. Norton and Company. Heylen D., van Es I., Nijholt A. and van Dijk B. Controlling the gaze of a conversational agent. In J. van Kuppevelt, L. Dybkjaer and O. Bernsen, Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems. Kluwer Academic Publishers. Hovy E. (1988). Generating Natural Language Under Pragmatic Constraints. Lawrence Erlbaum, Hillsdale, New Jersey. Kendon A. (1980). Gesticulation and speech. In M.R. Key (ed), The Relationship of Verbal and Nonverbal Communication. Mouton, Den Haag, 207-227. Kendon A. (1994). Do gestures communicate? A review. Research on Language and Social Interaction 27(3), 175-200. Krahmer E. and van der Sluis I. (2003). A new model for the generation of multimodal referring expressions. Proceedings of the 9th European Workshop on Natural Language Generation, held in conjunction with EACL2003, Budapest, Hungary, 47-54. Kranstedt A., Kopp S. and Wachsmuth I. (2002). MURML: a Multimodal Utterance Representation Markup Language for conversational agents. Proceedings of ECA's: Let's specify and evaluate them! Workshop held in conjunction with AAMAS 2002, Bologna, Italy. Krauss R.M., Morrel-Samuels P. and Colasante C. (1991). Do conversational hand gestures communicate? Journal of Personality and Social Psychology 28, 389-450. Kraut R.E. and Johnston R.E. (1979). Social and emotional messages of smiling: an ethological approach. Journal of Personality and Social Psychology 37, 1539-1553. Krenn B., Pirker H., Grice M., Baumann S., Piwek P., van Deemter K., Schroeder M., Klesen M. and Gstrein E. (2002). Generation of multimodal dialogue for net environments. In S. Busemann (ed.), Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2002), Saarbrücken, Germany, 91-98. Lester J., Voerman J., Towns S. and Callaway C. (1999). Deictic believability: coordinating gesture, locomotion, and speech in lifelike pedagogical agents. Applied Artificial Intelligence 13 (4-5), 383414. van Luin J., Nijholt A. and op den Akker R. (2001). Natural language navigation support in virtual reality. Proceedings of the International Conference on Augmented, Virtual Environments and Three-dimensional Imaging (ICAV3D). Mykonos, Greece, 263-266. McKeown K. and Pan S. (2000) Prosody modeling in concept-to-speech generation: methodological issues, Philosophical Transactions of the Royal Society 358, 1419-1431. McNeill D. (1992). Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago. Nijholt A. and Heylen D. (2002). Multimodal communication in inhabited virtual environments. International Journal of Speech Technology 5, 343-354.
GENERATING EMBODIED INFORMATION PRESENTATIONS
67
Nijholt A., J. Zwiers and B. van Dijk (to appear). Maps, agents and dialogue for exploring a virtual world. In J. Aguilar, N. Callaos and E.L. Leiss (eds.), Web Computing. International Institute of Informatics and Systemics (IIIS). C. Pelachaud, N. Badler and M. Steedman (1996). Generating facial expressions for speech. Cognitive Science 20, 1-46. C Pelachaud, V. Carofiglio, B. De Carolis, F. de Rosis and I. Poggi (2002). Embodied contextual agent in information delivering application. Proceedings of the First International Joint Conference on Autonomous Agents & Multi-Agent Systems (AAMAS'02), Bologna, Italy, 758-765. Poggi I., Pelachaud C., and DeRosis F. (2000). Eye communication in a conversational 3D synthetic agent. AI communications 13(3), 169-182. Prevost S. (1995). A Semantics of Contrast and Information Structure for Specifying Intonation in Spoken Language Generation. PhD thesis, University of Pennsylvania. Reiter E. and Dale R. (2000). Building applied natural language generation systems. Cambridge University Press, Cambridge. Reithinger N. (1992). The performance of an incremental generation component for multimodal dialog contributions. In R. Dale, E. Hovy, D. Rösner and O. Stock (eds.), Aspects of Automated Natural language Generation. Springer Verlag, Berlin, 263-276. Rickel J. and Johnson W. L. (2000). Task-oriented collaboration with embodied agents in virtual worlds. In J. Cassell, S. Prevost, E. Churchill and J. Sullivan (eds.), Embodied Conversational Agents. MIT Press, 95-122. Rimé B. and Schiaratura L. (1991). Gesture and speech. In R. Feldman and B. Rimé (eds.), Fundamentals of Nonverbal Behavior. Cambridge University Press, Cambridge, 239-281. de Rosis F. and Grasso F. (2000). Affective natural language generation. In A.M. Paiva (Ed.), Affective Interactions. Springer Lecture Notes in AI 1814, 204–218. Schmauks D. (1987). Natural and simulated pointing. Proceedings of the 3rd Conference of the European Chapter of the Association for Computational Linguistics (EACL'87), 79-185. Sowa T., Kopp S., Latoschik M. E. (2001). A communicative mediator in a virtual environment: processing of multimodal input and output. Proceedings of the International Workshop on Information Presentation and Natural Multimodal Dialogue ( IPNMD-2001),Verona, Italy, 71-74. Steedman M. (2000). Information structure and the syntax-phonology interface. Linguistic Inquiry 31(4), 649-689. Takeuchi A. and Nagao K. (1993). Communicative facial displays as a new conversational modality, InterCHI' 93, Amsterdam, The Netherlands, 187-193. Theune M. (2002). Contrast in concept-to-speech generation. Computer Speech and Language 16(3/4), 491-531. Theune M., Klabbers E., de Pijper J.R., Odijk J. and Krahmer E. (2001). From data to speech: a general approach. Natural Language Engineering 7(1), 47-86. Wahlster W. (2002). SmartKom: fusion and fission of speech, gestures, and facial expressions. Proceedings of the First International Workshop on Man-Machine Symbiotic Systems, Kyoto, Japan, 213-225. Williams S. and Watson C. (1999). A profile of the discourse and intonational structures of route descriptions. Sixth European Conference on Speech Communication and Technology (Eurospeech'99), Budapest, Hungary, 1659-1662. Yan H. (2000). Paired Speech and Gesture Generation in Embodied Conversational Agents. Master’s thesis, Media Lab, MIT.
9. AFFILIATIONS M. Theune D. Heylen A. Nijholt University of Twente Twente, The Netherlands
MOBILE PRESENTATIONS The last few years have seen an increasing interest in the development of applications for mobile systems. The mobile user setting poses new challenges for information presentations especially because of the of constraints derived by the small display dimension and by the conditions of usage, (while walking or driving, with bad light conditions and so on). Technology on the market nowadays, in particular third generation cellular phones and the new hybrid devices that mix cellular phones and PDA, allows the users to be connected on the Internet and to access a wide array of services. Yet there is still a long way to go before these appliances may be effectively used in our everyday activities. Usability experts are working on guidelines and interface principles to build interfaces specifically appropriate for the small dimension of the display or, for instance, that improve the efficiency in browsing the web while walking. The approach taken by the authors of this cluster is different, although not in contrast with the effort toward usability. The underling principle is that intelligent adaptation of content may improve the user experience with mobile devices. The first chapter of this cluster, by Baus Krueger and Stahl argue that resourceadaptiveness is a crucial topic to increase the efficacy of a mobile navigation system. They propose that adaptation has to be achieved on three different levels of the navigation task, the wayfinding level, the presentation planning level and the presentation rendering level. Stock, Zancanaro and Not discuss the opportunities and challenges offered by the area of cultural heritage appreciation for innovative, natural–language centred applications. They introduce the PEACH project, aimed at exploring various technologies for enhancing the visitors' experience during their actual visit to a museum. Finally Rist discusses the problem the overwhelming variety of new portable computing and communication devices. He argues that this add a new level of complexity to the design of usable information services since different users may not be equipped equally in terms of output and input capabilities. With a special focus on graphical representations, his contribution sketches several application scenarios taken from projects conducted at DFKI and discusses the technical approaches to accomplishing adaptation tasks that result from device restrictions.
J. BAUS, A. KRÜGER AND C. STAHL
RESOURCE-ADAPTIVE PERSONAL NAVIGATION
1. INTRODUCTION While car navigation systems are already widely commercialized today, in the near future not only motorists will be guided by a navigation assistance system. Recent progress in mobile computing has provided us with the means to realize pedestrian navigation systems. Personal navigation systems will provide location sensitive information, which can be accessed at any place and any time. Since several different means of transportation typically have to be combined in order to reach a destination, it must be ensured that the user interface reacts to the user's alternating situation. Changing the means of transportation implies that a personal ubiquitous navigation service must adapt smoothly to the associated situational context. The necessary change between different positioning technologies such as GPS or GSM/UMTS radio-cell based technologies in an outdoor scenario to the use of infrared and Bluetooth technologies in buildings should remain unnoticed to the user of such a navigation system. Furthermore, the system should adapt the presentation to different output devices and modalities and to the varying accuracy of positional information according to the technical resources available in the current situation. In addition, cognitive limitations of the user put constraints on the system’s presentation and interaction possibilities. In this article we will discuss the major concepts that help to design and implement navigation systems that are sensitive to technical limitations on one hand and to cognitive constraints of the user on the other hand. We will start by explaining our view on resource-adapted navigation and discussing the various technical and cognitive resources that have to be taken into account. An overview over the relevant literature will provide the basis for the introduction of our navigation system REAL and the adaptation to different limited resources. We will see how REAL adapts graphical way descriptions on three different levels to resource limitations cast by the situational context. The article finishes with an example in which REAL provides the framework for a practical implementation of a fair guide navigation system for CeBIT 2002. 71 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 71–93 © 2005 Springer. Printed in the Netherlands
72
BAUS, KRÜGER AND STAHL 2. RESOURCE-ADAPTIVE NAVIGATION
Personal navigation systems that extend beyond today’s use in cars will play a major role in the future. Especially pedestrians will profit from the ability to get localized information on mobile devices everywhere and at anytime. The generated graphical presentations, ranging from complete route descriptions at a stationary information kiosk to incremental route descriptions given by the mobile devices, are used to communicate content (navigational instructions) between a sender and a receiver (in our case the system and the user). Whenever graphics are presented via a certain medium to a human viewer two different types of resources play an important role: on one hand all the technical resources of the electronic navigational aid and on the other the cognitive resource available to the user. 2.1 Technical Resources
Technical resources cover all types of limitations of the presentation platform. Here, three different subtypes can be identified: x x x
Quality and precision of the positional information Restrictions of the output medium Restrictions of the graphics generation process
Since several different means of transportation typically have to be combined in order to reach a destination, it must be ensured that the user interface reacts to the user's changing situation, no matter what kind of transportation is used. When a change of the means of transportation is detected the system has to adapt its interface to the new situational constraints. The essential switch between different positioning technologies, e.g. GPS or GSM/UTMS cells (outdoor) or infrared and Bluetooth (indoor) should ideally remain unnoticed by the user. It is also very important that a mobile pedestrian navigation system takes into account the precision of the location measurements, because sizes of radio cells may vary, GPSsatellites visibility depends on the actual location and the distribution of indoor infrared senders is often not uniform. Our notion of positional information does not only include the absolute or relative location of the user in the actual environment, but also the user’s orientation and speed. The knowledge about the quality and precision of positional information can be used to adapt the graphical navigational instructions, as described later in the article. In our case the output medium is the systems’ visual interface. Typical output media are printed paper and computer displays, but also 3D-Display, such as stereo glasses. All the different media are restricted by an outer and an inner scale (Hake, 1982). These terms from cartography describe the maximum size of the graphics displayed, their resolution and thus the amount of detail that will be visible. Both factors influence the graphical presentations. For example it is not possible to reduce the size of a graphic without loosing details, so if the outer scale of the display is limited, it might be better to enlarge some small, but important details, while omitting others which are not so relevant for the navigational task at hand. In our case this implies that we
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
73
have to adapt our graphical presentations to many devices with different display sizes or resolutions. Another resource which might be limited is the capability of the devices to represent graphical navigational instructions in colour. Colours often provide important information about the depicted objects of the environment. In addition colours could help to draw the observers’ attention to important objects for the navigational task or to focus on certain details of the map. And last but not least, the graphics generation process itself needs resources, especially cpu-time and memory. These restricted resources are often critical in real-time applications. Even more difficult than the adaptation to technical constraints is the adaptation to the cognitive limitations of the user for a given situation. 2.2 Cognitive Resources
One of the most important cognitive resources of the user is the working memory. Unfortunately, there exists no comprehensive theory how tasks are solved in our brains and for the architecture or structure of the working memory different models have been proposed (cf. Baddeley, 1968, Loggie, 1995). In our work we simplify the limitations which are connected to the users’ working memory to three classes: x x x
Time restrictions Domain knowledge Familiarity with the presentation type
Time restrictions can be further divided into two different types. The viewing time is the time the addressee is given to look at the graphical presentation. In contrast to that, the decoding time describes the amount of time needed by the user to decode the presentation and to understand its meaning. Both times may be very restricted, especially when the user is heavily stressed or the navigational task is not the primary goal of the user. Parallel tasks such as carrying heavy luggage or answering a phone call while walking apply additional pressure on the user, which has to be compensated by the navigational aid. In this case complicated presentations that must be decoded under time pressure often lead to vicious circles, where the sense of not understanding the presentations results in stress reducing the ability to decode the presentation and so on. The domain knowledge of the user about the environment is represented in a user model and stores information about known and preferred routes. This enables the system to use different strategies to optimise routes, e.g., it minimises the amount of turning points, thereby reducing the number of possibilities to take wrong turns. And last but not least, the user might be familiar with some types of graphical presentations used in navigation tasks, e.g., ordinary street maps use certain coding conventions to convey their information and they rely on the ability of user to follow these conventions and to read maps.
74
BAUS, KRÜGER AND STAHL
2.3 The navigational task
In order to generate graphical presentations a navigational aid has to perform at least the following steps: determination of the information to present, determination of the presentation structure, and the graphical realization of the actual presentation. In addition, the information should be presented in a way that takes the maximum benefit of particular strengths of each presentation medium, and takes into account the actual context of the user as well as the current environmental context. We define navigation as the coordinated, goal-oriented following of a route through a spatial environment, where the navigational task consists of two parts: wayfinding and movement. In our notion the term wayfinding includes the planning and decision making processes involved in the determination of a route the navigator has to follow from a start to goal location. Such a route will be divided into segments and each segment consists of at least four parts, belonging to different categories: starting point, reorientation/orientation, path progression and ending point. Since routes are continuous the ending point of a segment serves as starting point for the next. The segmentation of a route is based on changes in direction at decision points. Here we have to communicate changes in direction clearly. Such a segment of a route represents a partial view of the environment by integrating different kinds of knowledge, e.g. knowledge about landmarks. In navigation tasks landmarks are used to identify decision points, destination points or provide orientation cues. They are the memorable cues that are selected along a route and also enable the navigator to encode spatial relations between objects along the route and the route itself, enhancing the development of a cognitive map of a region. A landmark may be any element in the environment that is external to the observer and that serves to define the location of other objects, locations or the actual position of the navigator in the environment (cf. Sorrows & Hirtle, 1999). In our case we use landmarks in graphical navigational instructions to help the user to locate herself in the current environment. In the following section we will give a brief overview about some existing mobile navigation systems. 3. MOBILE NAVIGATION SYSTEMS One of the first systems that used location aware information to help tourists was the CYBERGUIDE-system (cf. Long, Kooper, Abowd & Atkeson, 1996). Both an indoor and an outdoor navigation component were designed to assist tourists during their visits. The indoor component relied on infrared beacons broadcasting a unique ID that was used to display an arrow on a map whenever the user entered a new room. Additionally the user's orientation was estimated from her actual walking direction and the topology of the building. The outdoor system instead used GPS to determine the user's position and to display it on a map. Both systems operated independently from each other and could not be combined. The MARS system is an augmented reality system that provides information for the buildings on the Columbia University Campus (cf. Höllerer, Feiner, Terauchi, Rashid & Hallaway, 1999). More recently an additional indoor component has been developed that
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
75
assists the user also in indoor navigational tasks (cf. Höllerer, Hallaway, Tienna & Feiner, 2001). The GUIDE-system is a location aware multimedia tourist guide for the City of Lancaster (cf. Cheverst, Davies, Mitchell, Friday & Efstratiou, 2000). Based on a radio cell infrastructure information is provided that is tailored to the personal preferences of the user, but also to her actual contextual situation. The DEEP MAP project (carried out at the European Media Lab) focuses on a mobile tourist information guide that brings together results from natural language and intelligent graphics generation (cf. Malaka & Zipf, 2000). This allows a multimodal user interface to offer the user a variety of information on the city of Heidelberg. Since GPS is used to determine the position DEEP MAP is at the moment restricted to outdoor scenarios. The MOBIS system is an electronic guide based on a PDA that provides information on the exhibits to a visitor of a museum (cf. Encarnacao & Kirste, 2000). The PDA receives its position from infrared beacons distributed in the environment and uses this position as a pointer to a specific content that is stored in a database on the PDA. A similar approach stands behind the HIPS system (cf. Not, Petrelli, Sarini, Stock, Strapparava & Zancanaro, 1998). In contrast to MOBIS a radio back channel is used for downloading multimedia content. HIPS uses more powerful mobile devices (sub-notebooks) which are heavier but allow the play back of much better animations and sound files. HIPS does not only take into account the absolute position of the user but also the relative distances to objects in the exhibition. The NEXUS system aims to provide a general framework for mobile and location aware computing (cf. Hohl, Kubach, Leornardi, Rothermel & Schwehm, 1999). The concept of an augmented world is used to store information that may be relevant to a user at a certain location. This model forms the basis for so called virtual information towers that connect information objects (i.e. from the WWW) to real places. Nearly all the discussed systems are designed for certain computational and communicational platforms. Either enough computational power has to be available on the mobile device (DEEP MAP) or the information has to be stored locally in a huge database (MOBIS) or can be accessed over an network infrastructure (NEXUS, HIPS). This prevents the distribution of computational tasks into the environment even if the environment is instrumented with enough computational power. Another drawback of the aforementioned systems is their inability to adapt to the restricted cognitive resources of the user. Especially the use of mobile devices implies the need for adaptive presentation generation and interaction, given the fact that the user might be distracted or under time pressure. In addition, none of the discussed systems allows a seamless switch from indoor to outdoor scenarios and vice versa.
76
BAUS, KRÜGER AND STAHL 4. AN ARCHITECTURE FOR RESOURCE-ADAPTATED NAVIGATION
4.1. Resource and location sensitivity
We distinguish three different classes of resource sensitive processes: (a) resource adapted processes, (b) resource adaptive processes and (c) resource adapting processes (cf. Wahlster & Tack, 1997). Resource adapted processes have been optimized in advance for well known and stable restricted resources. The quality of their results remains constant for a given input. In contrast resource adaptive and resource adapting processes can handle varying resources restrictions. Therefore their results depend on the available resources during runtime. Resource adaptive processes are using exactly one strategy that can react to varying input resources, whereas resource adapting processes switch their strategies on a meta cognitive level to comply with different resource situations. This means that under certain conditions several strategies are used to solve a given problem. We distinguish two different kinds of location sensitivity for mobile devices: 1. Active location sensitivity: the mobile device detects actively the actual location on its own. Based on that information the presentation is generated automatically to comply with the user’s requests (e.g. GPS based car navigation system). 2. Passive location sensitivity: the mobile device just passively presents the information that it receives from senders in its local environment (e.g. location based information on traffic status via Radio Data Service). Active location sensitivity puts most of the computational burden on the mobile device. Given the fact that its memory and CPU performance are limited, the adaptation process is even more difficult. In contrast to that passive location sensitivity shifts most of the computations to the instrumented environment and the mobile device reacts only to the changing situations by presenting the localized information. In other words one could say that the needed intelligence for a process that uses active location sensitivity has to be in the mobile device, whereas processes that rely on passive location sensitivity use computational resources of the environment. Therefore it is reasonable to use passive location sensitivity in scenarios with high user frequencies and complex adaptation services and active location sensitivity in scenarios with less users and limited adaptation services. 4.2. The resource sensitive navigation system REAL
REAL uses both resource adaptive and resource adapting processes to generate graphical way descriptions. At the same time REAL combines active and passive location sensitivity. The system has been designed in a way that the switch between both adaptation paradigms is barely noticeable by the user. REAL integrates two subsystems: the system IRREAL (Infrared REAL) for indoor navigation tasks based on infrared senders and the system ARREAL (Augmented Reality REAL) for outdoor navigation based on GPS-satellites. IRREAL uses passive and ARREAL
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
77
uses active location sensitivity. When comparing the IRREAL system with the MOBIS system (Encarnacao & Kirste, 2000) it is worth highlighting that the same underlying technology (both systems use strong infrared senders in the environment and infrared receivers on the mobile devices) can be used for the same task (e.g. a fair navigation guide) with active location sensitivity (MOBIS) and passive location sensitivity (IRREAL). The overall system architecture is presented in figure 1. The input level transforms a specific request for a way description under consideration of the cognitive and technical limited resources into a user specific request. This request is then passed to the resource adaptive way finding module. With the help of domain knowledge and a user model the optimal route is determined and passed on to the presentation planning module. This module optimizes the presentation of the way description not only according to the resolution, screen size and colour capabilities of the output device but also to quality of the given sensor information (i.e. precision of location, orientation and speed of the user). The planning process consists of a media independent and a media dependent part. The media independent component relies on a 3D-model of the environment and domain knowledge to first choose appropriate parts of the 3D-model for visualization according to the range of the infrared senders (cf. Butz, Baus, Krüger & Lohse, 2001). The component then looks at spatial relations that can be used to express the way description in the specific situation based on information gained from experimental studies (cf. Zimmer, Speiser & Baus, 2001). This information is passed to the media dependent planning component, which uses design presentation strategies to adapt the way description to different output media. This approach allows the generation for different output medias by a single centralized server. At the moment the system supports the generation for a large screen (e.g. a Smartboard), for a mobile device (PDA) and for a small monitor that can be clipped to standard glasses. This classifies the graphics generation module of REAL as resource adaptive process. A great part of the cognitive resource limitations of users can be covered since REAL informs the user not only during the navigation task, but also in advance at a stationary information kiosk. This kiosk is able to provide sophisticated adaptation services (e.g. by the use of 3D-graphics, automatically planned camera paths and ego-motion). Both the stationary and the mobile part of the REAL system form a hybrid resource adaptive navigational aid, that provides consistent way descriptions over different output media (cf. Krüger, Baus & Butz 2001).
78
BAUS, KRÜGER AND STAHL Limited resources
Sensor Data
Request
Context of the request
Quality Position
Technical
Cognitive
Orientation User specific request
Speed
Media unspecific
presentation planning
User model
Dynamic localization 3D−Model Determination
Computation of spatial
of dynamic
relations
reference systems
Ontology Domainknowledge
Media specific
Resource adaptive mediaspecific graphic generation presentation planning
Resource adapting presentation planning
Resource adaptive waysearch
Navigational instructions Automatic animated 3D animation presentation agent Sketchrendering
WaveLAN
Mobile outdoor Clip−on display
Preparation IR−Broadcast
LAN
Mobile indoor
Static large display
Palm Pilot
Smart Board
Presentation− strategies
Figure 1: The system architecture of the resource sensitive navigation system REAL
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
79
5. PERSONALISED NAVIGATION This section will explain in more detail the major components of the proposed resource adaptive and adapting architecture for a pedestrian navigation system. The architecture in figure 1 consists of three important levels that have to be sensible to the available resources: wayfinding, planning the presentation and finally rendering the presentation.
Figure 2: The thicker line represents the shortest path between two points of one of our working scenarios: Frankfurt Airport’s Terminal 1.
5.1. Wayfinding
The first step, that a navigation system has to perform, is the task of finding an appropriate path in the environment. Current realisations of navigation systems (e.g. for in-car use) usually try to find the shortest path. Certain flexibility is provided by most systems by taking into account the travelling speed for different types of roads and vehicles. The last generation of car navigation systems is even able to avoid traffic jams, by including real-time traffic information into the computation of the optimal route. Navigation systems for pedestrian have to go one step further. The context of a pedestrian is subject to high variations. Pedestrians can change their walking directions suddenly, enter building and climb stairs, cross places and use public means of transportation. All this has to be taken into account, when computing the shortest path from current position to a target location.
80
BAUS, KRÜGER AND STAHL
For this purpose our approach is based on the well-know Dijkstra-Algorithm with some significant changes that makes the algorithm resource-adaptive (in the sense of the definition in section 4.1.). Figure 2 displays the results of the usual shortest-path Dijkstra-Algorithm for the path network of one of our scenario (Frankfurt Airport). Three additional constraints were found that help to enhance this algorithm to take into account the cognitive resources of the user: the minimisation of turning points, the minimisation of decision points and the maximisation of the familiarity with the environment. There is a trade-off between the length of the path and the other optimisation criteria. Section 5.1.4 will provide a brief empirical study that investigates this trade-off more closely. 5.1.1. Minimisation of turning points
It is well known from relevant psychological literature that users remember way directions best, if they contain only a few decision points. A decision point in this sense is a node in the way net, where the user has to change her walking direction in order to follow the right path. In figure 2 path computed by the algorithm consists of five turning points. The user has to memorize all five turning points in order to be successful. If the goal is to minimize the cognitive load, this can not be considered the optimal solution. When looking at figure 3 we can see that there exists a path that has fewer turning points (namely three instead of five). This is achieved by adding additional costs to nodes that require a turning point. The DijkstraAlgorithm will find this solutions if additional costs are added during the system run for each turning point. 5.1.2. Minimisation of complex decision points
Another way of reducing the cognitive load is to keep the overall complexity of decision points small. The complexity of a decision point can be simply defined by the number of the crossing roads. For a more complex definition have a look at Raubal (2001). Here the hypothesis is, that the smaller the possible choices, the easier it is to remember and then to take the correct decision. Again, the modifications to the Dijkstra-Algorithm consist of additional costs for nodes with higher complexity.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
81
Figure 3: The thicker line represents a path has the same length as the one in figure 2 while requiring less turns. This makes it easier for the user to remember the way description.
5.1.3. Maximisation of spatial familiarity
The third way of adapting way directions to the cognitive resources of the user is to maximise the number path segments the user is familiar with for a given navigation task. This is only feasible if the user has at least some experiences with the location. For example, let us assume that the user has already visited the Frankfurt Airport a couple of times before and therefore already knows how to move from the check-in area to the area with the airport shops. When determining the way directions from the check-in area to a new location (e.g. the departure gate) the system can make use of the spatial familiarity by guiding the user via the known shopping area. Of course, this might result in an overall path being longer than the direct connection between check-in and gate. However, the length of the unknown path (i.e. the path from the shopping area to the gate) might be much shorter and therefore significantly reduce the cognitive efforts of remembering the right path. Graph nodes that represent known places to the user a rewarded with a cost reduction, enhancing the chance of the Dijkstra-Algorithm to include this area into the final way direction.
82
BAUS, KRÜGER AND STAHL
Figure 4: Trade off between path length and reduction of turning points for the Frankfurt Airport Scenario, Terminal 1
5.1.4. Empirical investigation of the trade-off between path length and cognitive criteria
As described in the foregoing sections the integration of all three cognitive optimization criteria is realised with a modification of the costs that are used by the Dijkstra-Algorithm to determine the Path with the lowest sum of the overall costs. Theoretically, a clever choice of the costs allows considering all three criteria simultaneously. Unfortunately from a practical perspective it is not clear which costs to choose in a given situation. In order to get an idea of the trade-offs between the cognitive criteria and the path length, empirical investigations were carried out. For this purpose we used the way net of figure 2 to calculate more than 1000 paths for randomly chosen start and target locations. Afterwards for each of those paths we looked at 10 different weights that controlled a certain cognitive criteria. The diagram in figure 4 shows the mean effect of the decrease of turning points on the increase of the overall path length. It is observable that there is only a slight chance to push the reduction of turning points for this scenario over 20 percent. As a consequence increasing the weight furthermore would not make much sense. Similar investigations were made for the other criteria providing us with good heuristics for the appropriate choice of weights in a given situation. Although this information is domain dependent and requires therefore a distinct analysis for a new domain, it can be obtained offline and only has to be recalculated in the case of a change in the way net.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
83
5.2. Planning way presentations
In order to generate structural graphical descriptions for the way directions we extend an efficient hierarchical planning approach presented in Butz (1997) for the generation of 3D animation. The main assumption here is that all generated graphical presentations can be structured in the form of a tree describing parts and subparts of the graphics. Each part or subpart corresponds to a node in the tree. Nodes are either terminal nodes in which case they describe portions of the graphics that will be realized by one of the graphics realization techniques described in the section about graphical techniques, or they are non-terminal nodes. In this case they specify a set of sub-nodes and a logical, spatial or temporal interrelationship between them. Temporal interrelationships only apply to temporal media and include the concepts parallel, sequential and incremental. An example for temporally parallel subparts of a graphics are, for example, a camera motion and an object motion taking place over the same times pan of a 3D animation. A sequential interrelationship describes a sequence of subparts taking place in a temporal order, e.g., one after each other. Specifying the subparts as having an incremental relationship means that after a sub-tree is fully expanded, this sub-tree can be forwarded to the graphics realization component, which is not the case with every sub-tree in a sequential list. The specification of incrementally ordered subsequences allows the graphics realization process to start its work before the structure of the graphics is fully generated, and thus greatly reduces the perceived delay from the start of the whole graphics generation process to the moment the first graphical element is shown. Logical interrelationships include the concepts of alternative, conditional and additional sub-trees in the structure of a graphical presentation. Both alternative and conditional sub-trees specify a list of possibilities for the realization of a certain part of the graphics. Conditional expansion selects one alternative from this list at planning time. In this way we can, for example, specify that a certain type of graphical presentation can be generated either as a line drawing or as a 3D image of the scene depending on media restrictions. Specifying these sub-trees as alternative postpones the decision until presentation time. The strategy here is to first expand the structurally simplest or computationally cheapest part of the tree and then – if time permits – to proceed with more complex alternatives that might be visually more appealing or clearer in the communicative sense. The resulting structure graph contains all of the various alternatives (unless planning was stopped before due to temporal restrictions) and leaves the decision which alternative is chosen to either the presentation process or even the user. An example for an additional relationship between sub-trees of the structure tree is the labeling of a line drawing or the creation of additional viewports for an already running 3D animation. This is based on the assumption that already the first sub-tree yields a ’working’ graphical presentation, while the following subtrees contain presentation elements which will enhance the overall quality of the graphics. Obviously, the different kinds of interrelationships within the structure tree of a graphical presentation leave room for various strategies of adaptation of the
84
BAUS, KRÜGER AND STAHL
generation process to limited resources either in the output medium or in the generation process.
Figure 5: A simplified view on a plan tree for adaptive way descriptions
Figure 5 shows a simplified part of a structure tree for a graphical way description. At the root of the tree we see that the way description can be presented in the form of a sketch or alternatively a 3D walkthrough. One of these alternatives can be chosen at presentation time based on circumstances, media restrictions or user preferences. In the case of a 3D walkthrough the ground plan and the actual
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
85
path have to be shown. Specifying these two actions as having an incremental interrelationship implies that the ground plan can already be drawn by the realization component while planning for the path visualization continues. Had we specified the two actions as being parallel or sequential, planning would have to be finished before the realization starts. The visualization of the path itself consists of showing its starting and ending point and the trajectory in between. Showing the trajectory in turn is nothing else than an incremental loop over all of the trajectory’s segments (see section about route descriptions) and drawing an arrow for each of them. Optionally, after drawing each segment of the trajectory, the corresponding decision point can be shown. Depending on media restrictions this visualization can consist of a thumbnail image from the 3D world at this particular point, a text label or nothing at all. 5.3. Automated rendering of way descriptions for mobile devices
The hybrid navigation system REAL combines several static and mobile components to provide adaptive way instructions. As described in section 4 this includes an information kiosk with a large display and a high resolution as well as a mobile handheld device. The planning formalism discussed in section 5.2 ensures that the generated graphics for both device classes are coherent and contain cross references. The graphics generation process relies on a 3D-model of the environment. If the user is standing at the information kiosk this approach allows for highly sophisticated rendering of 3D ego-perspective views, including a virtual presenter who plays the role of a tour guide explaining the environment (for more details see Baus, Krüger & Wahlster, 2002). Unfortunately, those graphical techniques can not be put to use on the mobile device. Limited computational power and small displays with low resolution ask for more abstracted graphics and information. Therefore we will describe in the following the necessary steps to generate images of a way directions in a building that are adapted to the PDA display capabilities in terms of size and resolution, using the original 3D model from the information booth knowledge base as the source. Since we rely on infrared beacons to determine the actual position of the user, it is necessary to generate an image for each beacon in the environment. This graphics generation process is divided into a separate background and foreground graphics generation step. 5.3.1. Background generation
We derive the background drawings directly from the REAL 3D-knowledge base (see section 4) that consists of a polygon model of the building. In addition, all sender positions and orientations are represented too. By modifying the camera parameters we are now able to achieve graphical results that are similar to floor plans and therefore very suitable for our purpose.
86
BAUS, KRÜGER AND STAHL
Figure 6: The original model (a) is viewed as a parallel projection (b). The sketch renderer generates a floor plan in black and white (c). Setting the near and far clipping planes uncovers doors (d).
In the first step, the clipping area and orientation of the image is derived from the position and direction of the beacon. Instead of the usual perspective projection, we use parallel projection. The camera is placed into the centre of the map rectangle, where it is aligned along the Y-axis to show a top-down view and then turned around the Y-axis to match the orientation of the sender. This leads to an image that is oriented along the user’s walking direction. Since in a parallel projection, the camera distance to the model has no effect on the clipped area, the model is scaled to fit the desired extension. As a rule of thumb we always choose the scale that allows the user to overlook the way until he reaches the next beacon triggering a new position.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
87
C)
A)
B)
i
Figure 7: Generation of (foreground) route depictions: The background (a) is superposed by the foreground annotations (b) and displayed on the mobile device. Image (C) gives an example of the stationary booth visualisation at the same location.
The background drawing has to show doors and should hide small steps which could be erratically interpreted as walls. A vertical cut through the model, somewhere between the height of the knee and head of typical users yields the desired result. This can be achieved by simply setting the near and far cutting planes of the projection accordingly. A simple sketch renderer (see Butz, Baus, Krüger & Lohse, 2001) is applied to the geometry model to obtain a sketched image of the background. Additionally, the lighting model is changed to flat shading and all materials are set to plain black and white. The image is rendered by the graphics subsystem (here Java 3D) into a hidden frame-buffer with the exact PDA display resolution and then exported as a bitmap. The result low resolution sketch of the ground plan in black and white.
88
BAUS, KRÜGER AND STAHL
5.3.2. Foreground Generation
A)
B)
C)
D)
E)
F)
G)
H)
I)
Bitte kehren Sie auf den Pfad zurueck
Figure 8: The image sequence A to F contains route depictions, which the user receives via IR on their PDA. The images G to I are shown if the user deviates their route, leading them back on the right way.
Annotations of the foreground are generated and stored as 2D vector elements and superimposed to the background bitmap at runtime. The annotation consists of two parts: arrows that indicate directions, and important landmarks. If located inside the clipping region, the starting point of the route is depicted by a circle, whereas the destination point by a cross. Directional changes and the points where the path segment enter and leave the image are depicted by green arrows. The beacon that corresponds to the current map is indicated by a blue dot for better orientation. The generation of the arrows as vector elements is not trivial, since the
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
89
path can arbitrarily start or end in the map region and enter or leave or even cross it many times. This problem is resolved by clipping each path segment with all map boundaries. The last step includes landmark information into the map. Displaying text labels on a low resolution display is hard to accomplish: large fonts may obscure important parts of the picture and small fonts are too difficult to read. Instead, we represent landmarks graphically by an information icon, making use of the PDA touch screen: when tapping on the icon with a stylus or finger, a text label naming the landmark is displayed for a few seconds. For this purpose, all route and landmark coordinates have to be transformed from 3D model space into 2D screen space, so that the vector elements fit exactly into the background bitmap. 5.3.3. A mobile indoor navigation example
The example shown in figure 8 explains how the user is led to a desired destination. In this setup the rooms of our lab have been associated with typical meanings taken from the Frankfurt Airport domain. In the example, the user starts in the VIP-lounge. There he interacts with the information booth and chooses the luggage claim as the destination. We assume that the route is presented by an agent in a complete walkthrough animation beforehand. After the presentation, the user starts to move on with the PDA. The first map (A) shows a rather large area, since the next IR sender is relatively far away. The user has to leave the room and turn right, following the arrows. The next beacon is located in the hall, separated by a door. The user will first receive a plain straightforward arrow (B) and then after a moment, a more detailed route description (C). The next map (D) includes a landmark represented by an “I”-shaped icon. Tapping on the icon will display the name of that landmark. At the destination, the user receives a map (F), telling her that she has reached the baggage-claim. Images (G)-(I) are used to indicate wrong turns to users and tell her how to find the right path again. 6. THE PERSONAL NAVIGATION SYSTEM REAL AT CEBIT 2002 The hybrid navigation system REAL was tested in several different scenarios, one of which was the world-wide largest computer fair CeBIT in 2002. For this purpose we integrated the interior 3D-views of several exhibition halls into the 3Dmodel of the fairground. This allowed users of the REAL system to obtain information about their location not only outdoors, while navigating from hall to hall (see figure 9), but also indoors when looking for a certain booth (see figure 10). Beside the geometrical representation of the fair ground, information on the various companies and their booth locations were also modelled. This CeBIT version of REAL is the first system that integrates the indoor and outdoor environment into a common 3D polygon model, containing buildings and their interior. The integrated representation allows for sophisticated camera paths, i.e. when the user enters or leaves the halls. Smooth transitions from the large scale
90
BAUS, KRÜGER AND STAHL
outdoor and lower scale indoor view prevent irritations of the user and enhance her spatial awareness.
Figure 9: Outside view of the CeBIT fair ground
The fair guide combines two distinct localization techniques. Outdoors a GPS receiver module is used to determine the users’ position. Indoors, where no satellite data is available, infrared beacons are used. Once the user’s position has been determined by one of those beacons a suitable view of the map is generated using the 3D polygon model and OpenGL rendering. For outdoor views the system relies on a more realistic 3D-view of the environment, whereas for indoor navigation a more sketch-like map was generated by means of the techniques described in section 5. The user’s position is indicated by a red dot. Target locations are marked with elevated 3D-boxes. A green line represents the proposed path. Outdoors the user movement is shown continuously and in real time, indoors the positioning appears in discrete steps depending on the distribution of beacons. Nevertheless, in order to make the visualisation more pleasant, the camera smoothly interpolates between those discrete user positions. Currently the visualization supports two adaptive features. First the display adapts to the user’s speed. When the user stops moving, the camera zooms in and the halls and the booths are completely labelled. When walking, the camera starts to zoom out of the scene. Only halls and booths close to the user are labelled. The second feature is the visualization of the GPS signal accuracy by the radius of the
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
91
user’s standpoint marker. A small marker indicates a precise position, a larger marker represents a imprecise position. The CeBIT fair guide was implemented on a Xybernaut wearable computer running the Linux operating system.
Figure 10: Indoor view of an exhibition hall at CebIT 2002
7. SUMMARY AND OUTLOOK In this article we have focussed on the major concepts that help to design and implement a resource sensitive navigation system that adapts not only to various technical restrictions of the output media (e.g. a small screen or poor sensor qualities), but also to the cognitive limitations of the user (e.g. time pressure). We have shown that resource-adaptivity has to be included on three levels of the navigation system architecture: on the wayfinding, presentation planning and rendering level. Our navigation system REAL has proven to be useful in a range of scenarios one of which (the CeBIT fair guide) was introduced in this paper. From this point we would like to continue in two directions: on one hand we are working on multimodal extensions to REAL, allowing the user to use natural language and gestures to interact with the system. We are preparing an extension of the architecture that allows for speech recognition both on a thin client (i.e. a PDA) and a server in order to combine different speech recognition procedures in parallel.
92
BAUS, KRÜGER AND STAHL
On the other hand we want to develop personal travel files that contain all relevant information leading to personalised way descriptions. This includes not only known places and routes, but also personal annotations (e.g. voice memos). Appropriate visualisation and editing techniques will bring us one step further towards a more usable personal and spatial episodic memory. 8. REFERENCES Baddeley, A. D. (1986). Working Memory. Oxford University Press. Baus, J., Krüger, A., Wahlster W. (2002), A resource-adaptive mobile navigation system. In Proceedings of Intelligent User Interfaces 2002, ACM Press. Butz, A. (1997). Anymation with CATHI. In Proceedings of AAAI/IAAI ’97, (p. 957–962). AAAI Press. Butz, A., Baus, J., Krüger, A. & Lohse, M. (2001). A Hybrid Indoor Navigation System. In IUI2001: International Conference on Intelligent User Interfaces, (p. 25–33). ACM. Press. Cheverst, K., Davies, N., Mitchell, K., Friday, A. & Efstratiou, C. (2000). Developing a Context-aware Electronic Tourist Guide: Some Issues and Experiences. In Proceedings of the 2000 Conference on Human Factors in Computing Systems (CHI-00) (p. 17-24). ACM Press. Encarnacao, J.L., & Kirste, T. (2000). Beyond the desktop: Natural interaction and intelligent assistance for everyday life. In Alltag der Zukunft – Informationstechnik verändert unser Leben. (p. 39-57). Schöningh. Hake, G. (1982). Kartographie 1. Walter de Gruyter. Hohl, F., Kubach, U., Leonhardi, A., Rothermel, K. & Schwehm, M. (1999). Next century challenges: NEXUS – an global infrastructure for spatial-aware applications. In International Conference on Mobile Computing and Networking (MobiCom ’99) (p. 249-255). ACM Press. Krüger, A., Baus, J. & Butz, A. (2002). Smart Graphics in Adaptive Way Descriptions. In Proceedings of the working conference on Advanced Visual Interfaces (AVI 2000), (p. 92-97). ACM Press. Höllerer, T., Feiner, S., Terauchi, T., Rashid, G. & Hallway, D. (1999). Exploring mars: Developing indoor and outdoor user interfaces to a mobile augmented reality system. Computer and Graphics, 23(6), 779-785. Höllerer, T., Hallaway, D., Tienna, N. & Feiner, S. (2001). Steps towards accommodating variable position tracking accuracy in a mobile augmented reality system. In IJCAI-2001 Working Notes: AI in Mobile Systems (AIMS) (p. 31-57). AAAI Press. Long, S., Kooper, K., Abowd, G.D. & Atkinson, C. (1996). Rapid Prototyping of Mobile Context-Aware Applications: The Cyberguide Case Study. In Proceedings of the 2nd ACM International Conference on Mobile Computing and Newtorking. (p. 97-107). ACM Press. Malaka, R., & Zipf, A. (2000). Deep Map – challenging IT research in the framework of a tourist information system. In Information and Communication Technologies in Tourism. (p. 15-27). Springer. Not, E., Petrelli, D., Sarini, M., Stock, O., Strapparava, C. & Zancanaro, M. (1998). Hypernavigation in the physical space: Adapting presentations to the user and situational context. The New Review of Hypermedia and Multimedia, 4, 33-45. Raubal, M. (2001). Human wayfinding in unfamiliar buildings: a simulation with a cognizing agent. Cognitive Processing, 2(2-3), 363-388. Sorrows, M.E. & Hirtle, S. (1999). The Nature of Landmarks for Real and Electronic Spaces. In COSIT 99: Spatial Information Theory: cognitive and computational foundations of geographic information sciences. (p. 37-59). Springer. Wahlster, W. & Tack, W. (1997). SFB 378: Ressourcenadaptive Kognitive Prozesse. Informatik ’97: Informatik als Innovationsmotor. (p. 51–57). Springer. Zimmer, H.D., Speiser, H. & Baus, J. (2001). Die Selektion dimensionaler Präpositionen: automatisch und nicht ressourcenadaptierend. Kognitionswissenschaft, 9(3), 115-131.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION 9. AFFILIATIONS J. Baus A. Krüger DFKI Saarbrucken, Germany C. Stahl Saarland University Saarbrucken, Germany
93
O. STOCK, M. ZANCANARO AND E. NOT
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION FOR CULTURAL TOURISM
1. INTRODUCTION Since the second half of the Eighties, we have considered cultural heritage appreciation a privileged area of application for innovative, natural–language centred applications. From the application point of view, we believe this is an area of high interest, for the following principal reasons: a) the “users” of cultural heritage increase in number at a fast pace; b) there is a natural request for a quality shift: from presentation of cultural heritage as a standard mass product, similar to supermarket goods, to a way to provide the single person with the possibility of acquiring information and understanding on things that interests him most, and to assist his cultural development; c) the way in which the cultural experience is carried on has not changed much for centuries; and especially the young seem to require novel modes of being exposed to the cultural material, so that they would engage and entertain them; d) for Italy and Mediterranean countries cultural heritage can be a natural resource that fuels economy (Minghetti et al, 2002); e) humancomputer interface technology can have a decisive role in providing solutions for the individual. From the research point of view in the first phase we have considered this as an opportunity for exploring ideas related to multimodal interfaces. The AlFresco System was a system that integrated language, pointing in input and language and images in output (Stock et al, 1997). But the main aspect is that it integrated in a coherent way different interaction attitudes: the goal-oriented language based modality and the navigation-oriented hypermedia modality. Well before the web era the AlFresco generalized communication act management approach was perhaps anticipating some of the present challenges of web interaction. Subsequently we have begun working on information presentation in the physical environment. This brought in a number of new issues and some constraints (see Stock, 2001). Ideas were experimented in two projects, Hyperaudio (Not et al, 1998) and the European project HIPS (Benelli et al, 1999). We shall present here some new lines of research that we are now carrying on. 95 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 95–111 © 2005 Springer. Printed in the Netherlands
96
STOCK, ZANCANARO AND NOT 2. THE PEACH PROJECT
The PEACH (Personal Experience with Active Cultural Heritage) project objective is that of studying and experimenting with various advanced technologies that can enhance cultural heritage appreciation. The project, sponsored by the Trento Autonomous Province, is mainly based on IRST research, with important contributions by the other two partners: DFKI and Giunti Multimedia. The research activity focuses on two technology mainstreams, natural interactivity (encompassing natural language processing, perception, image understanding, intelligent systems etc.) and micro-sensory systems. Throughout the project, synergy and integration of different research sectors will be emphasized. Two general areas of research are highlighted: x The study of techniques for individual-oriented information presentation: (i) use of formalisms and technologies derived from the field of natural language generation in order to build contextual presentations; (ii) use of speech and gestures as input and audio and animated characters as output; (iii) use of multi-agent architectures to provide suggestions and propose new topics. x The study of techniques for multi-sensorial analysis and modeling of physical spaces—that is, the use of visual sensors such as video cameras, laser telemetry, infrared sensors, and audio sensors such as an array of microphones and ultrasonic signals for monitoring a dynamic environment, and collecting information about objects and about the environment for accurate virtual reconstruction. The aim of the project is to significantly increase the quality of cultural heritage appreciation, in such a way as to transform passive objects into active ones that can be manipulated by the observer, and thus aiding to bridge the gap between our past, which they represent, and our future, of which they are the seeds. Extended Appreciation and (Inter)active Objects are facets of an underlying unifying vision we term “Active Cultural Heritage”. 2.1 Extended Appreciation
The traditional modes of cultural heritage appreciation impose numerous limitations that are not always obvious. For instance, in observing a large statue, notwithstanding physical proximity, the observer most likely will be unable to capture details from every angle, as these may be too far from his/her viewpoint. In these cases, direct observation creates limitations that can be overcome with augmented reality, such as by using a palm computer to observe the details of the statue, taken from cameras or reconstructed in a virtual environment. Moreover, access to some objects can be difficult or even impossible for some visitors, such as disabled or elderly people. Creating an accurate virtual representation of the objects would extend fruition of the exhibit to these visitors as well. In general, remote appreciation opens interesting possibilities, also for the study of an artefact that due to its fragile nature must be kept under restricted conditions and is thus not accessible to everyone. The possibility of interacting with an accurate
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
97
virtual representation, allows the non-invasive access to a work of art in the manner, time, and place most appropriate for the visitor. Objects can be manipulated in an innovative, didactic, and fun way such as by modifying a work of art, partially or in its entirety. 2.2 (Inter)Active Objects
It is particularly important for the individual to be able to “navigate” an independent information course based on individually and dynamically created presentations. One of the scopes of the project is transcending a museum’s restrictive environment by transforming a passive object observed by the visitor into an active subject capable of providing new information in a context-sensitive manner, a kind of hyperlink for accessing additional situation-specific information to be presented coherently. Much of the technology for accessing information in the Internet today (for example, adaptive user profiling, information promotion, database browsing, query by example) has a natural place of application in this environment 3. THE MUSEUM AS A SMART ENVIRONMENT A system that generates presentations of artworks in a museum must mould to the behaviour of a person visiting the museum. On the one hand, the system must facilitate movements within the space by (i) aiding the orientation of the user using appropriate linguistic support such as “to your right you will see…”; (ii) proposing suggestions about the best route for continuing the visit, such as with “…along the same lines, the next room contains an interesting…”. On the other hand, the system must be able to interpret the implicit intentions of the person’s movements. For example, the prolonged observation of one object may be interpreted as a sign of interest. A system of this type will be able to take into consideration the constrains posed by the environment in accessing information (e.g. objects in an adjacent room may be far if the two rooms are not connected) emphasising the emotional impact of seeing the “real” work of art. Such a system will also be able to affect the visitor’s perception within the environment by attracting his/her attention to a particular work or detail; for instance, taking advantage of new technology such as the ability to superimpose computer-generated images to the real scene (via special transparent visors) or by generating verbal presentations based on rhetorical and persuasionoriented strategies. In this way, the museum visit is a full-fledged interaction between the visitor and the museum itself. In order to render possible this interaction, it is necessary that the museum - in fact the underlying information system - (i) knows the physical position of the visitor (and, as much as possible, his focus of visual attention); (ii) communicates individual information on the objects under exhibition—for instance through a portable device, 3D audio, or using a special wearable device that
98
STOCK, ZANCANARO AND NOT
automatically superimposes generated images to the real scene; and (iii) receives requests from the visitor—verbally and/or through gestures. A museum of this type will not be simply reactive, limiting itself to satisfy the questions of the visitors, but will also be proactive, explicitly providing unasked information; for instance, suggesting the visit to particularly interesting or famous objects, or allowing access to a “window” (e.g. a flat screen on the wall) that can deepen the study of the object under observation. Such suggestions can be made based on the observations of the person’s behaviour, for example, the route chosen by a visitor or how much time is spent in front of a work, information noted about the user, such as age and culture, or considerations relative to the environment like rooms that are too crowded or that are temporarily closed. The system should be able to overhear the visitor’s interaction (Busetta et al, 2001) and provide further suggestions on the basis of an internal model of priorities (for example, satisfying visitor’s interests, fulfilling educational goals, or, perhaps, increasing museum bookshop’s sales). Another important dimension is that of attracting the young and keep them hooked to the cultural experience. With children a playful attitude is essential. We are conceiving new technology-based environments, with spoken interaction (see also the NICE project with a similar theme1), where as a side effect children will be motivated to look with attention and learn about the cultural heritage. One of the central aspects is the communication attitude. A humorous interaction is a key resource with children. The role of humor to keep attention, enhance memorization names and help creative thinking is well known. We are now beginning to see some concrete results in modeling some processes of humour production. To this end our initial work in computational humor will find a useful terrain of experimentation here (see Stock and Strapparava, 2002). 4. THE ROLE OF INFORMATION PRESENTATION According to (Bordegoni et al, 1997), a medium is a physical space in which perceptible entities are realized. Indeed, in a museum (as well as in a cultural city, an archaeological site, etc.) the most prominent medium is the environment itself. The main requirement for the presentation of information task is that of integrating the ‘physical’ experience, without competing with the original exhibit items for the visitor’s attention. From a multimedia point of view, this means that additional uses of the visual channel have to be carefully weighed. In this context, the audio channel should play the major role in particular for language-based presentations, although the role of non-speech audio (e.g., music or ambient sounds) should also be investigated. Yet when a visual display is available (for example a PDA or a wall-size flat screen) images can be used to support the visitor in the orientation task (3D or 2D images can be used to support linguistic reference to physical objects). In this latter case, the visual channel is shared between the display and the environment but the goal is still to provide support to environment-related tasks.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
99
From a multimodal point of view, different modalities can be employed to focus the visitor’s attention on specific objects or to stimulate interest in other exhibits. For example, the linguistic part of the presentation (through speech audio) can make large use of deictic and cross-modal expressions both with respect to space (such as “here”, “in front of you”, “on the other side of the wall”, etc.) and to time (“as you have seen before”, etc.) (Not and Zancanaro, 2000). The peculiarity of the environment as a medium is its staticity: the system cannot directly intervene on this medium (i.e. the system cannot move or hide exhibits nor change the architecture of a room, as in virtual settings, at least without considering technology-based futuristic extensions.) Therefore, it may appear that a main limitation of the presentation system is the need to adapt the other media in service of the environment. Yet a multimodal approach can get round the staticity constraint in two ways at least: a) Dynamically changing the user's perception of the environment: by exploiting augmented reality techniques (for example as described in Feiner, 1997) it is possible to overlay labels or other images on what the visitor is actually seeing. In this way, for example, the system can plan to highlight some relevant exhibits in the environment or shadow other less relevant ones. 3D audio effects or the selection of characteristic voices or sounds for audio messages can stimulate user's curiosity and attention (Marti et al., 2001). Yet a similar effect can be obtained by exploiting the power of language, as we did: language-based presentations can be carefully planned to attract the visitor's attention to more important exhibits and shadow less relevant ones. The simplest example: when a visitor enters a room for the first time, she usually receives a general room presentation followed by one that directs the visitor’s attention to the exhibit the system hypothesizes most interesting for her. b) Changing the user's physical position: the system can induce the user to change her physical position either by a direct suggestion (e.g. ”go to the other side of the room, the big fresco you'll see on the wall is La Maestà”) or indirectly, for instance by introducing a new topic (e.g., ”La Maestà, one of the absolute masterpieces of European Gothic painting, is located on the wall behind you”). Ultimately, the goal of such system is to support visitors in making their visiting experience meet their own interests; but in some cases a visitor should be encouraged not to miss some particular exhibits (for example, you cannot visit the Louvre for the first time and miss the Mona Lisa). Sometimes this task can be accomplished by direction giving, but there are other ways to promote exhibits: for example, by providing at the beginning of the visit a list of hotspots, or by planning a presentation that, in a coherent way, links the exhibit in sight to other ones through reference to the visitor’s interest. More generally, further research is needed towards implementing pedagogically-motivated systems with meta-goals to pursue, educational strategies to follow and intentions to satisfy. In this respect, the interaction between the visitor and the system must evolve from simple interaction to full-fledged collaboration (for a discussion on this topic applied to cultural tourism see Stock, 2001).
100
STOCK, ZANCANARO AND NOT
5. BUILDING ADAPTIVE PRESENTATION FROM EXISTING REPOSITORY A fundamental prerequisite for using existing information in different discourse and interaction contexts, and for different users, is that the original material is properly organized in a repository of information units that can be flexibly accessed and composed. In the HIPS [2] and HyperAudio [7] projects we experimented with an annotation formalism, called the MacroNode formalism, that allows the dynamic composition of an adaptive hypermedia [8]. In this framework, the available material is fragmented into content units which are annotated with essential semantic and linguistic data, i.e. a description of the main topic of the content unit, the semantic relations existing with other units (in terms of RST rhetorical relations [6]), and the different ways in which the unit can be presented to the user. A content unit is called a macronode and, for textual information, it typically corresponds to a paragraph. More generally, a macronode can correspond to data in different media, for example text, audio, images, or a combination of these.
Figure 1: A sample portion of macronode network displaying the rhetorical links.
The rhetorical links, used to connect macronodes to each other, express which information units can be composed together, and how to compose them to produce a text in the given genre. A rhetorical link specifies the coherence relation existing between two portions of discourse, e.g. whether text span B describes the event that caused state A to occur (CAUSE rhetorical relation), or whether text span B provides background information useful for interpreting the assertion in text span A (BACKGROUND relation). As a starting point for the identification of a set of relevant rhetorical relations, we have taken as a reference the work by Mann and Thompson ([6]). In HIPS, we exploit a set of basic relations considered as the most
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
101
relevant for our application task: elaboration: when text span B provides more details for what is discussed in text span A; an additional domain-dependent refinement of the elaboration relation (elaboration.authorship, elaboration.direct, elaboration.indirect, elaboration.represented) was introduced to capture different ways to add details to an artwork description; background: when text span B provides information useful for interpreting the assertion in text span A; comparison: when text span B compares the object/concept described in A with another relevant object/concept; navigation: when text span B provides suggestions about where to go/what to look at next. Figure 1 shows a sample fragment of macronode net, where rhetorical links signal the coherence relations between different macronodes. To compose an adaptive hypermedia out of the content units, the macronode approach exploits some text planning techniques and linguistic manipulation rules derived from the field of NLG. 5.1 Assuring Coherence
At generation time, the annotated data is processed by a Presentation Composer that uses some NLG text planning strategies to determine the most appropriate discourse organization for the macronodes relevant for the current discourse context. The module explicitly maintains a representation of the discourse history and is able to exploit a model of user's interests and knowledge to best tailor the output. The schema-based text planning phase, which relies on the rhetorical organization of the macronodes in the repository, guarantees that the composed message displays (i) a coherent rhetorical structure, (ii) a topic flow compatible with discourse and interaction context, (iii) cross-references to material already presented and to user interests. For example, discourse strategies are encoded to avoid presenting already known information, to choose the kind of information for which the user's interest is higher, to present new information when the user goes back to a previous topic. In addition, rules checking the rhetorical links between macronodes provide control for the length of an elaboration chain or for the inclusion of background information to clarify a topic, and so on. All this reasoning, familiar for deep NLG systems, is possible on fragments of existing text thanks to the shallow semantic representation of the macronodes content and to the macronode rhetorical network. 5.2 Assuring Cohesion
In order to assure cohesion, each macronode encodes different realizations of the same content. These alternative realizations are stored as a graph. Each node of the graph represents a minimal unit of content, whereas arcs contain the conditions under which the target node can be realized (given that the source node has been realized). Each path in the graph is a possible realization of the content of the macronode.
102
STOCK, ZANCANARO AND NOT
For example, for a macronode that encodes the content that “The Guidoriccio da Fogliano is a fresco by Simone Martini that dates back to 1328”, a possible realization graph is depicted in Figure 2. In this case, a pronoun is chosen if the concept has just been mentioned in the previous discourse. In addition, if the main topic of the overall current presentation is the fresco itself, a picture is also included. The use of a realization graph with linguistically motivated options does not allow the same flexibility that a deep NLG component would; yet in many contexts this solution is sufficient to assure cohesion in dynamically composed presentations.
Figure 2: Internal representation of a macronode
5.3 Integrating existing and automatically generated text
Although reusing existing repositories of information is definitely valuable in many domains, often the ability to intermix existing text and deeply generated text could significantly improve the effectiveness of an adaptive system. There already exist some systems where this ability has been explored (e.g., ILEX [10]). In many cases, the system provides sophisticated NLG machinery to support generation from scratch (i.e., generation from a knowledge base representing entities and facts related by semantic relations) as well as mechanisms to support the inclusion of flexible templates and stories (paragraph-length pieces of canned text). However, usually stories play a marginal role: no particular support is designed to help authors in systematically editing canned texts nor is reasoning done on the text content (though some adjustments are possible on the linguistic surface form, for example to increase cohesion by properly introducing anaphora). This might result in limitations for those application domains where much interesting information is already available in the form of narrative text, anecdotes, or other complex text (but also images and videoclips) which would be too difficult/expensive to generate from scratch. Figure 3 shows some sample texts written by a human author describing an archaeological exhibit for different users.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
103
The figure highlights in light color the portions of text that would be difficult or worthless to generate from scratch, but that still require some flexibility in content selection and linguistic realization to get the desired personalization in the final presentation.
Figure 3. A dynamic page generated by M-PIRO
To overcome these limitations, within the M-PIRO project [1] at ITC-irst we have been exploring the possibility of integrating a deep NLG system, like ILEX, with a flexible approach to the composition of preexisting text as the one supported by the MacroNode architecture, thus achieving a trade-off between information reuse and flexibility. The NLG machinery allows the generation of the most dynamic parts of the presentation - i.e., with the selection of the most appropriate content and realization with respect to the current discourse context and user model; the introduction of sentences comparing different objects the user has seen; the suggestion of other interesting related objects. The AH machinery allows to enrich the content and the phrasing, still allowing for personalization.
104
STOCK, ZANCANARO AND NOT 6. CINEMATIC TECHNIQUES AT THE INTERFACE
Usually, multimedia guides use static images or pre-recorded short video clips about museum exhibits. The advent of more powerful devices has enabled experimentation with new forms of multimedia, in particular time-based media such as animations. In our recent work we introduce the idea of enhancing the audio presentation of a complex fresco by using the PDA screen to travel throughout the fresco itself and identify details. At presentation time a sequence of pictures is synchronized with the audio commentary and the transitions among them are planned according to cinematic techniques. Our hypothesis is that the use of this type of animation to present the description of a painting allows the visitor to better identify the details introduced by the audio part of the presentation. In this manner, both the efficiency and the satisfaction dimensions of the system usability are increased (Nielsen, 1994) while also providing an enhanced learning experience for the visitor.
Figure 4. The Infrared positioning at Torre Aquila
We have experimented this idea in a prototype of a multimedia guide for Torre Aquila, a tower at the Buonconsiglio Castle in Trento where a fresco called “The Cycle of the Months”, a masterpiece of the gothic period, can be found. This fresco, painted in the Fifteenth Century, illustrated the activities of aristocrats and peasants throughout the year. The fresco is composed by eleven panels each one representing one month (the month of March, painted on wood, was destroyed by a fire) and occupies the four walls of the room (see figure 4). Our multimedia guide, implemented on a PDA, detects the position of the visitor by means of infrared emitters placed in front of each panel. The initiative of the interaction is shared between the system and the user. When the system detects that the visitor is in front of a wall, a picture of that wall is displayed on the PDA and, after a few seconds, if the user has not changed its position the panel she is facing is
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
105
highlighted (see figure 5). Now, the visitor can click on the panel and get a multimedia presentation. The multimedia presentation is composed of an audio commentary accompanied by a sequence of images moved on the PDA display that help the visitor quickly identify the fresco’s details mentioned in the commentary.
Figure 5. Two snapshots of the mobile multimedia guide in Torre Aquila
The language of cinematography (Metz, 1974), including shot segmentation, camera movements and transition effects, is the basic resource for planning the animation and to synchronize the visual and the verbal parts of the presentation. In building the animations, a set of strategies similar to those used in documentaries were thus employed. Two broad classes of strategies have been identified. The first class encompasses constraints, imposed by the grammar of cinematography, while the second deals with conventions normally used in guiding camera movements in the production of documentaries. For instance, a strategy in the first class would discourage a zoom-in immediately followed by a zoom-out, while a different strategy in the second class would recommend the use of sequential scene cuts, rather than a fade-out effect, to visually enumerate different characters in a scene. It is worth noting that in the latter strategy it is often necessary to make reference to the discourse structure of the audio part of the presentation, such as enumeration of properties, background knowledge, and elaboration of related information. In order to formally use
106
STOCK, ZANCANARO AND NOT
discourse structure, we employ the Rhetorical Structure Theory (Mann and Thompson, 1987) mentioned earlier.
Figure 6. A videoclip on the mobile guide.
7. SEVEN CHALLENGES The developed prototype we have illustrated is a step in the direction we have tried to describe at the beginning of this paper. We think the theme discussed constitutes a terrain where several areas of research can yield important contributions. In the following we briefly review some challenges. Visitor Tracking. In our own experience after various investigations we have ended sticking to our initial choice - infrared emitters at fixed positions, sensors on mobile devices. This choice was also combined with a compass, but we are sure shortly there will be more interesting solutions available (e.g. ultrasounds). For the outdoor scenario, we need a combination of GPS and finer localization devices. Other techniques can be envisioned and should be further investigated, for example, beyond the physical position, it would be useful to know the direction of sight of the visitor. For the moment it requires head-mounted displays and complex vision recognition hardware, but one can foresee a future where gaze detection may be possible with less obtrusive hardware, at least in structured and internally represented domains. Representations and reasoning on what is at sight need obviously to develop substantially.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
107
Novel devices. Acoustic output has been shown to be here preferable over written language, which instead can be usefully exploited for highlights and followups. Yet improvements on high quality graphics on a small device would be highly appreciated since pictures have been shown to be very helpful in signalling references to objects in the physical space. Wearable devices and head-mounted displays can play a role in specific settings. In particular, head-mounted displays can be very useful if coupled with a technology that can overlay computer generated graphics to real scenes (see Feiner et al, 1997). This technique is particularly interesting for archaeological sites where the visitor would be able to “see” the buildings as they were originally. But often the best device for these kinds of applications is no device at all. Speech recognition in the environment coupled with "spatialised" audio would allow the visitors to experience multisensory and unobtrusive interaction with the environment. The “narration” must develop with individual-oriented characteristics and at personal times, so it cannot just be produced once for all by physically dislocated sound sources. Expressing space and time reference. We need our systems to be able to reason about where things are, what kind of spatial entity they are, how they look like from a given position, how best the visitor can reach them, when they will appear. For example, the system should be able to instruct the visitor to “reach the room at the end of this corridor” rather than “go forward 10 meters and then turn left”. There is a substantial tradition in AI dealing with qualitative temporal reasoning and a somehow less extended one dealing with spatial reasoning (Stock, 1997). Representations must provide us with material at the right level of detail so that we can properly express it in words. Of course we need also that the language we produce is sophisticated in the proper use of word characteristics, for time taking into account concepts like word aspect and tense, or, for space, for example, being able to choose specific spatial prepositions. A newer important element in research is qualitative modelling of movement (see Galton, 1997), particularly relevant here, as we have seen that movement is the most relevant input modality, strictly coupled with our suggested medium - the environment. Beyond descriptive texts. Perhaps the biggest challenge is concerned with keeping the attention of the user high and granting a long term memory effect. We need to be able to device techniques of material presentation that hook the visitor, that continuously build the necessary anticipation and release tension. The "story" (we mean the multimodal story that includes language, graphics and the visible physical environment) must be entertaining, and it should include mechanisms of surprise. The expectation sometimes must be contradicted and this contrast will help in keeping the attention and memorizing the situation. A typical mechanism of this kind is at the basis of various forms of computational humour (see Stock, 1996). Especially with children humour (and play) can be a powerful means for keeping them interested. Another aspect where much more research is needed is concerned with mechanisms of persuasion: i.e. how we can build rhetorical mechanisms aimed at the goal that the hearer/experiencer adopts desired beliefs and goals. It is not only a matter of rational argumentation, a field a bit more developed, but an integration of various aspects, including some modelling of affect. At the end our philosophy is that the user is responsible for what she does and hence for the material that is
108
STOCK, ZANCANARO AND NOT
presented to her, but yet through the presentation some specific goals of the museum curator can be submitted for adoption. New visit modalities. The advent of technology opens the way to new modalities of visit, particularly important with children. A treasure hunt is an obvious example, where the external goals cause the innocent visitor to look for details and come across many different exhibits with "artificially" induced attention. An easier development is that at the end of a visit, a report of the visit is produced electronically, available for successive elaboration. For instance it will allow the user to re-follow on a virtual environment what she has seen and to explore related material at a deeper level through added hyperlinks. Support for group visits. A relevant percentage of visitors come to the museum together with other people. For natural science museums, the typical case is a parent with children, for art museums it is the group of friends. The group dimension is largely unexplored: how best can a family (or other group) be exposed in individually different manners to the material in the environment so that they discuss what they have seen and have a conversation that adds to their individual experience, bringing in new interests and curiosity? Only limited research is devoted to group visit (see for example, Woodruff et al, 2001) and most issues are still open. Of course, we can envisage a big difference in the parent-child case with respect to the friend's scenario. Another interesting issue can be the study of dynamic grouping, for example when grouping extends over time (see for example, Rahlff, 1999) or is dynamically created during the visit. Experimental evaluation. HIPS has been evaluated with real visitors in the Museo Civico in Siena. The results of the evaluation can be found in (Marti, 1999) and (Marti and Lanzi, 2001). The most enthusiastic comments regarded the possibility to move freely during the visit while being assisted by the dynamic guide. The visitors felt comfortable in listening to descriptions without interacting too much with the PDA interface, mainly used in case of poor performance by the system (delay in loading a presentation, lack of information etc.). A feature that was especially appreciated was how information came tailored to the context. The visitors recognized the capability of the tourist guide to follow their movements offering appropriate and overall coherent information at the right moment. However, our community has not become sophisticated enough in evaluating mobile systems for a cultural task. What we really need are techniques as powerful as the Wizard of Oz (simulation by hidden humans of systems that at least in part do not exist yet, and observation of user behaviour with the new means) so that the results will really help decide on the specific design choice. Equally important, as in any educational environment, is to evaluate retention of concepts and vividness of memory after time (hours, weeks, years). These themes impact on the environment seen as a medium: a) finer tracking permits to build better user models and to integrate dynamically appropriate linguistic material to be combined with the present environment perception; b) innovative devices will impact on coordination with the environment medium; c) better referring to space (and time too) will provide on one side more accurate means to change position and focus of attention, and on the other side a better integrated multimedia, with language and environment more tightly coupled; d) text that are able to bring about context-dependent story
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
109
mechanisms, for instance building expectations and releasing tension, may involve the surprise that an unpredicted scene, or an unpredicted interpretation, will bring: the environment can have a substantial role; e) visit modalities conceived for specific categories of visitors, such as children, will typically involve specific skills, so that attention will not be drawn to the environment medium according to standard, descriptive rhetoric, but will cause new pragmatic interplay within various modalities; f) the social dimension in visits opens the way to a more developed subjective exploitation of the environment medium, to be reconciled through subsequent interaction among human agents; g) evaluation of human-system behaviour in multimodal physical environment-based situations will have an obvious fundamental role for understanding and helping the design of better systems. Some of these challenges will be taken up in the PEACH project, that will deliver its results in a three year period with experimentation at the Castello del Buonconsiglio in Trento. DFKI, partner in the project, will experiment the prototypes at the Voelklinger Huette, a world cultural heritage site dedicated to iron and steel industry near Saarbruecken.
8. CONCLUSIONS There is a strong need for intelligent information presentation in the domain of cultural heritage appreciation and tourism. The way in which the cultural experience is carried on has not changed much for centuries; and especially the young seem to require novel modes of being exposed to the cultural material, so that they would engage and entertain them. In this paper we have mainly presented the PEACH project, aimed exactly at this. The project builds on our previous experience with building location-aware systems able to present information in an adaptive way on one side, and the one with building multimodal interfaces on the other side. In PEACH, one of the salient elements is the emphasis on multimodality in the dynamic presentation. The system is intended to combine the dynamically adapted language-based output with a kind of dynamically produced visual documentaries. While the first part is based on macronodes, a technique we have been working at for years and continue to improve, the second is based on cinematic techniques. Underlying the whole approach is the adoption of rhetorical structure theory and the use of repositories of material to be adapted for the specific presentation case. The input to the system, comes from the location of the visitor and observations the system itself has made about her behaviour and presumed interests and what she has been exposed to and presented so far; the material is presented coherently throughout the visit. At the time of this writing the first phase has been concluded and a prototype for the Castle of the BuonConsiglio, the most important museum of Trento, is available as a proof of concept. In the last part of this chapter we have discussed a number of issues that are critical for the further development of the field. We also believe advancement in technology for cultural tourism will lead to advancement in the world of education,
110
STOCK, ZANCANARO AND NOT
of mobile services and in general to human interfaces. The ultimate goal in fact is to have systems that are nonobtrusive, that allow us to perceive the natural environment and integrate it, if useful, with a number of resources appropriately coordinated and coherent, with a result that is cognitively congenial and - why not? pleasurable. 9. NOTES
1 http://www.niceproject.com
10. REFERENCES Stock. O. (2001) Language-Based Interfaces and Their Application for Cultural Tourism. AI Magazine, Vol.22, n.1, 2001, pp. 85-97, American Association for Artificial Intelligence, Menlo Park, CA. Minghetti, V., Moretti, A., Micelli, S. (2002) Reengineering the Museum’s Role in the Tourism Value Chain: Towards an IT Business Model. In Werthner, H. (Ed.), Information Technology & Tourism. New York: Cognitant Communication Corporation. 4(2), 131-143. Stock O., Strapparava C., Zancanaro M. (1997) Explorations in an Environment for Natural Language Multimodal Information Access in M. Maybury (ed.) Intelligent Multimodal Information Retrieval. AAAI Press, Menlo Park, Ca./MIT Press, Cambridge, Mass. Not E., Petrelli D., Sarini M., Stock O., Strapparava C., Zancanaro M. (1998) Hypernavigation in the Physical Space: Adapting Presentations to the User and the Situational Context. In New Review of Hypermedia and Multimedia, vol. 4. Benelli, G., Bianchi, A., Marti, P., Not, E., Sennati D. (1999) HIPS: Hyper-Interaction within the Physical Space. In Proceedings of IEEE Multimedia System ’99, International Conference on Multimedia Computing and Systems, Firenze Busetta P., Serafini L., Singh D., Zini F. (2001) Extending Multi-Agent Cooperation by Overhearing. In Proceedings of the 9th International Conference on Cooperative Information Systems - CoopIS2001. Lectures in Computer Sciences vol. 2172, Trento, September. Stock O. and Strapparava C. (2002) Humorous Agent for Humorous Acronyms: The HAHAcronym Project. Proceedings of the Fools' Day Workshop on Computational Humor, TLWT-20, Trento Bordegoni M., Faconti G. Maybury M.T. Rist T. Ruggeri S. Trahanias, P. Wilson, M. (1997) A Standard Reference Model for Intelligent Multimedia Presentation Systems. Computer Standards and Interfaces, 18, pp. 477--496, 1997 Not E. and Zancanaro M. (2000) The MacroNode Approach: mediating between adaptive and dynamic hypermedia. In Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-based Systems, AH'2000, Trento, August. Feiner S., MacIntyre B., Hollerer T., Webster A. (1997) A Touring Machine: Prototyping 3D Mobile Augmented Reality Systems for Exploring the Urban Environment. In Proceedings of ISWC 97 (Int. Symp. on Wearable Computing). Cambridge, MA. October. Marti P., Gabrielli L., Pucci F. (2001) Situated Interaction in Art. Personal Technologies, 5:71-74. Baus J., Krüger A. and Wahlster W. (2002) A resource-adaptive mobile navigation system. Proceedings of IUI2002: International Conference on Intelligent User Interfaces 2002,ACM Press. Stock O. ed. (1997) Spatial and Temporal Reasoning. Kluwer Academic Publishers, Dordrecht. Galton A. (1997) Space, Time and Movement. In O. Stock (ed.) Spatial and Temporal Reasoning. Kluwer Academic Publishers, Dordrecht. Rahlff O.-W. (1999) Tracing Father. In Proceedings of i3 Annual Conference, Siena, October . Marti P. and Lanzi P. (2001) I enjoyed that this much! A technique for measuring usability in leisureoriented applications, In Joanna Bawa & Pat Dorazio, The Usability Business: Making the Web Work.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
111
Stock. O. (1996) Password Swordfish: Humour in the Interface. In Proceedings of the International Workshop on Computational Humour, TWLT-12 Enschede. Woodruff A., Aoki P.M., Hurst A., Szymanski M. (2001) The Guidebook, the Friend, and the Room: Visitor Experience in a Historic House. Extended Abstract, ACM SIGCHI Conf. on Human Factors in Computing Systems, Seattle, WA, March. Malaka R. and Zipf A. (2000) Deep Map – challenging IT research in the framework of a tourist information system. In D. R. Fesenmaier, S. Klein, D. Buhalis (eds): Information and communication technologies in tourism 2000, pp. 15-27, Springer-Verlag, Wien, New York, 2000. Mann W.C. and Thompson S. (1987) Rhetorical Structure Theory: A Theory of Text Organization. In L. Polanyi (ed.), The Structure of Discourse. Ablex Publishing Corporation. Spohrer J.C.. (1999) Information in Places. IBM Systems Journal Vol. 38 n.4, pp. 602-628.
11. AFFILIATIONS Oliviero Stock Elena Not Massimo Zancanaro ITC-irst Trento, Italy
T. RIST
SUPPORTING MOBILE USERS THROUGH ADAPTIVE INFORMATION PRESENTATION
1. INTRODUCTION Recently, the coalescence of mobile telecommunication and portable computing has brought about a broad variety of new mobile appliances. Cellular phones with built-in micro computers are no longer just mobile telecommunication terminals but also serve as organizers and mobile Internet terminals. In turn, pocket computers, such as PDAs benefit from built-in telecommunication technology since they can now become mobile nodes of computer networks, and some of them work as mobile phones as well. In addition, there is a third strand of technology development motivated by the aim to empower all kinds of portable everyday gadgets, such as wrist watches, clothing, and also jewelry with computing and wireless communication functions. To provide a flavor of the ongoing convergence of technology strands, Fig. 1 illustrates a pedigree of some popular mobile devices. By the end of the last decade, the era of the “mobile internet/internet everywhere” has been announced and the first generation of mobile phones with built-in browser software for the display of data that are retrieved wireless from remote servers. In Europe the de facto standard WAP1 has been the choice of technology for wireless data connection, whereas Japan and other Asian countries introduced proprietary technologies, such as the i-mode system by NTT-DoCoMo, EZ-Web of the KDDI Corp., and J-Sky by J-Phone. In Europe, however, the predicted market penetration of WAP services did not take place for a number of reasons. Firstly, the display capabilities of early WAP devices as well as browser functionality were rather restricted. For instance, in principal it was possible to convert an arbitrary HTML2 web page, that entailed text, into a pile of so-called WML3 pages ready for wireless download and display in a WAP browser of a mobile device. Due to limited screen real estate, however, only a small fragment of the text could be displayed at once. Reading a larger text, say of A4 page size, requires the user to perform extensive scrolling and paging operations. More severe weighs the fact that many users experience such reading conditions as too awkward and therefore hardly use this option. To attract more users several attempts have been made by service providers and portal owners to prepare customized content for WAP clients. Unfortunately such extra efforts caused additional costs which could not be easily charged on top of the relatively high costs WAP-users had to pay already to their telecom providers for the “air-time” when downloading the pages. 113 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 113–139 © 2005 Springer. Printed in the Netherlands
114
RIST
Since the launch of the first WAP services some progress has been made concerning both the wireless data transmission systems and the capabilities of new devices. With the introduction of techniques like HSCSD4 and GPRS5 more data can be transferred at significantly reduced download times. In addition GPRS paved the way for a billing structure that is based on data volume rather than on air-time.
Figure 1: Pedigree of some popular mobile devices
Therefore, users can stay connected to the internet as long as they want and get basically charged only for the amount of downloaded data. As to devices, display screens now tend to be slightly larger, have a much higher resolution and support color. This so-called 2.5th generation (2.5G) of mobile phones certainly goes well beyond its predecessor generation – at least in Europe. In fact, the Japanese approach to wireless Internet seemed to skip some of the intermediate stages we had in Europe. With the introduction of i-mode in 1999 their era of 2.5G mobile internet had begun at least 2 years earlier than in Europe By the end of 2003 the third generation messengers will already be in use. The transition from 2G/2.5G to the yet pending introduction of 3G wireless devices and broadband communication infrastructure, such as UMTS6 in Europe, will boost the development of adaptive services for mobile users for at least the following reasons: Availability and fast transfer of rich media formats. Multimedia messaging is currently a buzz word heavily used by marketing people to quicken the user’s appetite for the upcoming era of mobile telecommunication. In fact, many of the new devices will be equipped with high-resolution color screens, some with cameras
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
115
and some others will even support the simultaneous use of different input modalities, such as voice, touch screen, and keyboard. Improved location awareness. A number of applications – mainly navigation systems - are already able to take into account a user’s current position, e.g., by taking advantage of GPS (Global Positioning System) or phone cell information with which a mobile phone is currently registered. However, GPS only works outdoors and requires either a special device or an extension board. On the other hand, with a resolution between approx. 30 meters under optimal conditions and some hundred meters in worse cases phone cell information is often too imprecise for many location-based services. In contrast, the upcoming 3G devices and infrastructure promise a localisation resolution in the range of less than 10 meters. This will still be worse than GPS but the important point is that it may work in- and outdoors without requiring hardware extensions. Download of executable applications. Perhaps the most important aspect concerns the ability to download and execute software on 3D devices. That is, 3G devices are no longer restricted to built-in firmware but provide open platforms for all kinds of software applications – only restricted by storage and processor performance capabilities. For instance, similar to the introduction of so-called Java applets to enrich web pages in the second half of the 1990s, Java programmers have now started to develop so-called Midlets7 which are small Java programs that can be executed on mobile devices running a mobile edition of a Java virtual machine. Looking 4 to 5 years further into the future, that is moving downwards in the pedigree depicted in Figure 1, we can already anticipate the next but one generation of mobile devices (4G and beyond) that are likely to take advantage of various input sensors, e.g., to measure orientation, acceleration, temperature and so forth. No doubt, this will again open another strand for new services to benefit mobile users. While there is no doubt that upcoming new mobile appliances will provide a high potential for a broad range of new information services that can accommodate specific needs of mobile users, a particular challenge for service providers will remain. It is a matter of fact that any new generation of devices and software tools will hardly substitute the previous generation immediately. This phenomena is commonplace in the IT industries and therefore many software manufacturers and operators of web portals feel obliged to support some downward-compatibility of their products in order not to exclude potential customers. Unfortunately, compared to the “PC world” the situation is more challenging since the spectrum of mobile devices is already much more diverse in terms of hardware and software capabilities and it is likely that this diversity will continue to increase. As to hardware, differences are due to broadly varying display capabilities given by different screen sizes, resolution, and available colors if color display is supported at all. As to software, one has to consider that mobile editions of browser and display programs are usually less powerful than their counterparts running on a PC, and those mobile editions of application software may struggle with dramatic performance differences from one mobile device to another. In the sequel we first refer to some related work and then sketch several different application scenarios to illustrate and motivate the need for techniques that aim to customize information presentations for mobile users.
116
RIST 2. RELATED WORK
Depending on the specific methods to solve adaptation tasks that will be introduced in the proceeding sections, similarities to work done by others exist, especially in the areas of (i) adaptive information delivery and presentation, (ii) navigation services for mobile users, and (iii) shared information representations in distributed systems. Concerning adaptive information delivery and presentation most of the existing approaches are based on proxy solutions. In essence, an adaptation proxy sits in between an ordinary web-server (or any other remote information source) and a client. The proxy performs specific tasks that aim to meet the adaptation goals of the client. This may comprise, for instance, filtering, aggregation, transformation and (re)formatting of fetched web content. Some adaptation proxies have been developed to compensate bandwidth bottlenecks in networks and especially in mobile networks (Brewer et al. 1998). One of the most popular proxies that aim to compensate for device limitations is the Top Gun Wingman, a graphical, split Web browser for the Palm Pilot PDA, developed by Fox et al. (1998). Recently, more complex proxy architectures have been proposed that, besides device adaptations, also address personalization issues. For instance, the APRO proxy server by Kylänpää & Heino (2000) enables a user to configure an own preference profile with an ordinary web interface (e.g., using HTML forms). If this user accesses the proxy with a mobile device, the specific profile will guide the adaptation process at the proxy and thus enables a more personalized adaptation. The iMobile proxy even foresees a personal agent that maintains a user profile that accesses and processes Internet resources on behalf of the user, keeps track of user interaction and performs content transformations according to the user and a device profile (Rao et al. 2001). The DailyLearner (Billsus et al. 2000) is another personalized agent running on a proxy. The system creates a statistical interest profile of the user to determine which information shall be sent to the mobile device. Personalization in this case is intended to save both, data transmission costs, as well as the user’s effort to read news which do not fit her interest profile. The introduction of a standardized version of so-called Composite Capability/Preference Profiles (CC/PP) as well as the use of more fine grained user models (Marucci & Paternò 2002) will further support more personalised adaptations and scalability of mobile access to web content. The adaptation process itself is often driven by meta information that is already coded in the representation of the information source. For instance, Gaedke et al. (1998) proposed a special purpose markup language called WCML (Web Composition Markup Language) for an object-oriented approach to compose presentations from a repository of device specific style components. In contrast, the approach by Hori et al. (2000) relies on a set of special purpose annotations that provide a proxy with information about the relative importance of content in a web page. Their proxy entails a so-called document clipping plug-in to portion content into smaller pages that match the available screen size of the mobile target device. To avoid extra efforts in the content authoring phase, Gonzalez-Castaño et al. (2002) propose an automated mechanism that superimposes a symbolic abstraction
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
117
hierarchy on the original document. Rather than accessing the original document with a PDA, the mobile user receives a hyper-document and uses the abstraction hierarchy to navigate through the single parts. Han et al. (1998) propose a transcoding policy that a proxy may apply to images. In their work, image transcoding means applying compression mechanisms to reduce the data volume with the ultimate aim to reduce download time. However, since time is also needed for the compression, they try to estimate the trade-off between adaptation costs and actual gain in terms of reduced download time. Mohan and colleagues (1999) describe a transcoding framework with the aim to enable universal access to multimedia web content. They propose a set of content transcoders that transform high-resolution source media objects into target media objects that typically have a lower resolution. In the case of images, they perform a semantic classification of the source image to decide on the transcoding mechanism to apply. While somewhat similar to the ImageTrans system that will be presented in Section 3.3.3, the assignment of image classes and transcoders is done once manually and does not involve learning. Most of the scenarios that will be presented later on in this paper assume a proxy where adaptation is done, too. For instance, the IMAGEN system described in Section 3.1 is essentially a proxy server, and image transformation for WAP clients is also a typical service to be carried out at a proxy. The popularity of proxy solutions for mobile content access has two major reasons. Firstly, so far the communication bandwidth of a wireless link has been either quite restricted or quite expensive. Secondly, most of today’s mobile devices are yet not powerful enough to perform complex adaptation tasks themselves. In the near future, however, this may change and part of the adaptation will be done by the devices themselves. This may lead to heterogeneous approaches of application aware adaptation as proposed in the Odyssey project (Noble 2000). To cope with adaptation issues which arise due to variations in the available bandwidth of a wireless connection, the authors argue that neither the operating system (which might include a proxy) nor an application alone should decide on the adaptation strategies to overcome bandwidth problems. Addressing a basic need of many mobile users, various navigation services have already become available. Most web-based routing services support mobile access and provide presentation variants tailored to different categories of mobile devices (e.g., for WAP phones and PDAs). Companies, such as Fontus Inc.8 even aim at users of 3G wireless devices and have begun to produce sophisticated 3D maps and display software that runs on 3G devices. As to the design of graphical navigation aids for mobile devices with sparse display capabilities Chincholle et al. (2002) note that the download of traditional miniaturised maps to a WAP enabled cell phone contributes little to the mobile user's demands whereas route directions were considered valuable. A comparison of different presentation formats for route instructions on mobile devices can be found in Kray et al. (2003). An interesting approach for the generation of route maps has been presented by Agrawala and Stolte (2001). They have developed a set of generalization techniques to improve route map usability. Taking into account aspects of human cognition their LineDrive system generates route sketches that show distortions in road lengths, angles and shape quite similar to distortions found in sketches produced by humans when asked
118
RIST
to sketch how to drive from one location to another. Compared to conventional maps, sketches produced by LineDrive have the advantage that they need less screen real estate for display and thus are especially suitable for PDAs. The Shopping Companion that we will sketch in Section 4.2 shares some similarities with other special purpose navigation aids, such as mobile museum guides (Not et al. 1998) and other shopping guides. The potential of such guides has been studied, for instance, by Bohnenberger et al. 2002. They conducted an empirical study to compare performance and user appreciation of two shopping conditions, one in which subjects were provided with a PDA-based shopping navigator versus one in which subjects had only a printed map (i.e., a 2D floor plan) at their disposal. They found that the format of assistance did not significantly impact the performance of concrete shopping tasks (i.e., time needed to localise products), however, most of the subjects preferred the PDA version over the classical map. Finally, our work on multi-user systems, such as the shared map application that will be presented in Section 5 relates to work done in the area of CSCW (Computer Supported Cooperative Work). In the context of a CSCW system with shared graphical output, Stefik et al. (1987) have coined the term “relaxed WYSIWIS (What You See Is What I See)”. In contrast to “strict WYSIWIS", where all collaboration partners obtain exactly the same graphical displays, relaxed WYSIWIS allows to be more selective on what is to be shared by other collaboration partners and how shared information gets individually presented. The so-called abstractionlink-view concept by Hill (1992) links different views to a shared data abstraction of the application semantics via constraints. During a multi-user session different views are dynamically updated in a way so that the specified constraints remain satisfied. The CSCW toolkit GroupKit (Roseman & Greenberg 1996) relies on an open protocol approach that allows application programmers to implement and coordinate tailored views on shared data abstractions. The toolkit Weasel (Graham & Urnes 1992) comprises a so-called Relational View Language for specifying customized views and their dynamic behaviors during a session. The notion of so-called tailorable user display agents has been introduced by Bentley et al. (1994). Sitting between the shared data representation and a user’s display, the agent is responsible for the (possibly) user-specific style of information display and also for the continuous update of the display during a session based on a filed mechanism for event-handling. While the purpose of these approaches is view adaptation, their first place intention was not directed towards the support of mobile users. But apart from a few exceptions, the involvement of mobile and heterogeneous communication devices has been rarely addressed in CSCW research. Myers et al. (1998) describe a collaborative environment in which PDA’s are used as remote control devices for the manipulation of a display that is shared by all users. A combination of multiple PDA’s with a shared large-scale whiteboard is also used by Greenberg and Boyle (1999). In their SharedNotes system, the PDA’s are used as input devices for private notes which can, however, also be made public by sending them to the shared whiteboard. Nevertheless, the primary focus of this work has been on maintaining private and public information in a multi-user application but less on view adaptation to account for specific device limitations.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
119
Figure 2: Core Components of the IMAGEN Platform
3. ACCESSING THE WORLD-WIDE WEB WITH MOBILE DEVICES While it is debatable whether web-browsing belongs to the most useful applications for the users of mobile devices, there is no doubt that these appliances provide a high potential for a broad range of new information services that can accommodate for the specific needs of mobile users. Let us consider a first example that is taken from the EU-funded IMAGEN9 project. 3.1 Access to a Personalized Art Portal
The objective of IMAGEN is the development of a platform for the set-up and operation of portal services that aim at customized publication and distribution of multimedia content. A first pilot application of IMAGEN is the MyArt online service whose major business objective is to establish a communication channel between a large Italian publisher in the art domain and the community of English speaking Internet users interested in Renaissance arts. IMAGEN can be conceived as a mediator between content authors (1st-level users) and content customers (3rd-level users). In between we have to consider a further user group, usually consisting of publishers, content syndicators/owners and administrators of Web portals. Such 2nd-level users may purchase media assets from a number of different 1st-level users and compile customized content packages for their 3rd-level users. The platform and its core components are roughly sketched in Figure 2. The basic contents, so-called content units, that are to be distributed by IMAGEN are
120
RIST
stored in a number of possibly dispersed content repositories. A content unit consists of a small composition of multimedia assets, such as texts, graphics, images, sound and video clips, and programmed interactive units, e.g., Java applets or Flash animations. The composition is assumed to be self-contained and is specified in an XML format. A graphical editor, such as the eXact Packager10 may be used by authors for the creation of content units.
Figure 3: Delivery of customized art collections to different PC users
In IMAGEN the idea is to provide 3rd-level users with content packages, which are created on the fly based on content that is of interest to them. Therefore, in response to 3rd-level user requests, IMAGEN creates and delivers customized collections of content units. As illustrated in Figure 3, delivered packages usually differ with regards to content and layout style depending on profile information about the users. For the representation of content packages IMAGEN draws on XML-based standards developed by the IMS consortium (see http://www.imsproject.org/) for representing tutorials and multimedia content packages. The process of delivering content from the content provider to the content user is divided into three major steps: Content Management, Layout Management, and Transaction Management – which are implemented by separated components. Details about the single components and processes can be found in Kraus et al. (2002) for personalised content selection, and Kröner et al. (2002) for constraint-based layout. For the purpose of this paper, we restrict ourselves to layout customization for mobile users. 3.2 Content Delivery to Mobile Devices with Small Screens
There are a number of mobile devices, such as pocket computers and PDAs, that feature medium-size screens, typically in the range of [200 to 550] x [200 to 550] pixels. However, there is no standard screen shape and size apart from the fact that all screens are rectangular and that PDA screens tend to have a more balanced length
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
121
and width ratio than screens of pocket PCs. On the other hand, an important common feature of these devices is that they usually run at least one mobile edition of a PC-style Web browser. Therefore, layout customization for the purpose of the MyArt application can be achieved by providing the Web browsers with suitable settings of style attributes, such as font families and sizes as well as dimension parameters for embedded images.
Figure 4: A MyArt content package customized for the Nokia 9210 Communicator
In IMAGEN, a constraint-based layout manager can take into account parameters, such as screen size and resolution, and availability of colors to compute an optimal assignment of style parameters for a content package. Once a suitable assignment of style attributes has been determined, syntactic transformations can be applied to make further device-specific adaptations concerning factors, such as hardware limitations, design of physical controls, and limitations of the installed web browser. For instance, the design of some devices invite their users to scroll, whereas leafing through a presentation will be more convenient on other devices. Since in IMAGEN both content units and content packages are represented as XML documents, syntactic transformations can be performed easily by applying XSLT11 transformation rules that transform XML documents in other XML documents. In our current experiments, we use a number of different mobile devices with relatively small screens - among others a Nokia 9210 Communicator. In contrast to a typical PDA, this device has a 640 x 200 pixel color display and a built-in web browser that supports basic HTML features. Indeed, our test cases confirm that suitable presentations can be produced with IMAGEN’s layout manager. A showcase of accessing MyArt with the Nokia Communicator is shown in Figure 4. To account for the small screen of this device, package layout has been adapted in several ways. For instance, a constraint has been added that gives priority to larger fonts since due to a higher screen resolution fonts appear smaller on the Nokia 9210 than on an ordinary PC screen. To reduce the need for extensive scrolling, content will be portioned into several pages that are compliant with the available display space of the screen. Finally, an XSLT transformation is applied to refine the page layout and to connect these pages with a navigation bar that appears in the upper left corner of a page. Similarly, the layout of MyArt content packages can be customized for display on a PDA. Customization trials included the Sharp Zaurus SL5500 device, and an HP Jornada handheld.
122
RIST
3.3 Mobile Devices with Tiny Screens
Today, the delivery of content to small mobile devices, such as contemporary cell phones, remains constrained in terms of bandwidth, memory, and processing power. In the near future, however, rapidly improving hardware and infrastructure will make these limitations disappear. Nevertheless one problem will not be solved in this way – a great deal of mobile devices will still come with tiny displays and this will impose strong constraints on how content can be presented and accessed. 3.3.1 Transcoding Content
A popular approach to deliver web content to mobile devices with tiny displays is called web content transcoding, e.g., see Britton et al. (2001). This technology aims to adapt content according to constraints and preferences associated with a specific device. In the case of text, a number of approaches exist to transform text represented in an XML format into so-called stacks of WML (Wireless Markup Language) pages, or into lean versions of HTML / XHTML, such as cHTML (compact HTML), HTML-4-Mobile, and XHTML Basic12. Commercial products (including for example IBM’s WebSphere™, or xCA.suite by xCA.com) are available to achieve this task. However, preliminary tests conducted at DFKI with the aim to transcode MyArt content units into WML pages showed that a mere syntactical transcoding resulted in huge piles of WML pages. Such presentations are cumbersome to handle with a tiny display and weak navigation facilities, and thus hardly meet the idea of providing 3rd-level users with easy to use content access mechanisms. Even worse is the situation for non-text media types, such as images and graphics. Even the transformation of static graphics still remains a challenge. Neglecting graphical representations completely may be acceptable if these representations serve as decorations only. Unfortunately, however, there are many examples of services which can hardly do without graphical representations to encode information. In the case of the MyArt service, it is certainly doubtful whether one should try to adapt images of art works so that they can be displayed on tiny screens. But there are many other services that can benefit mobile users. For instance, at DFKI we have developed a personal travel assistance service (André & Rist 2001) and a service called Personal Picture Finder (Endres et al. 1999). In their original versions both services have been designed for PC users who are assumed to interface with the services via an ordinary WWW browser. In response to a user query the travel assistance service usually returns a mixture of text and graphics, e.g. including images of hotels and resorts as well as maps. Moreover, the Personal Picture Finder frequently returns only pictures as a search result. 3.3.2 Image Transformations
To enable access to graphical web services by means of mobile WAP phones with tiny displays, we have investigated several approaches to the transformation of graphical representations. By a transformation we understand an operation that can
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
123
be applied to a source image in order to obtain a target image that meets the particular requirements of the target display environment. Transformations can be complex in the sense that they can be described as sequences or configurations of several less complex operators. Targeted towards an adaptation of graphics for tiny displays a graphics transformation usually comprises a down-scaling operation and a reduction of colours. Also, there may be transformations that take into account a number of additional parameters (e.g., height and width of the target graphics) to adjust the transformation process. Unfortunately, a certain transformation T may lead to reasonable results for some source images while it provides unacceptable results for others. The problem is illustrated in Figure 5. Both source images have been converted with one and the same transformation, in this case using the Ginco “Pic2WBMP”13 image converter. However, similar problems arise when using other available products for image manipulation with built-in facilities that automatically convert images into formats that fit onto the display of mobile phones. This observation motivates the configuration of different transformations for different image classes. For a new source image, we need to determine an appropriate image class and apply the transformation which has been associated with that class. This approach raises two issues: (i) how to classify images, and (ii) how to configure appropriate transformations for the specific image classes. Secondly, we compute for each source image a characteristic feature vector that can be used later on to classify the image. An extreme case would be to take each pixel of the image as a single feature. However, this would be impractical as images often have hundred thousands if not millions of pixels. To serve their purpose, a feature vector should be a condensed representation of an image and the features should relate to an image’s color and luminance distribution in one or the other way. In fact, the exact computation of a feature vector is a second-rate issue. It is more important that an image vector covers a sufficient number of features (e.g., several hundreds of features) to ensure that different source images get characterised by distinct vectors.
T
T
Figure 5: Left: Applying a transformation T to a source image to obtain a small black and white target image that can be displayed on a mobile device. Right: Applying the same transformation to another source image produces a less satisfactory result
124
RIST
3.3.3 A Trainable Image Transformation System
Our idea was to deploy machine learning techniques to establish an assignment between recognized features of an image on the one hand, and available transformations on the other hand. That is, in a training phase, a graphics design expert manually assigns transformations to source images and thereby allows the system to recognize and generalize correspondencies between image features and transformation parameters. A screenshot of our trainable image transformation system is shown in Figure 6. When loading a source image into the system, the image is displayed in the left part of the frame and an analysis is carried out to construct a feature vector that can be used to classify images. One extreme case is to consider each pixel as a feature. Unfortunately, such long vectors would be unpractical as input for a learning system. In our current test setting we use about 430 features that result from regional color distributions of an image. To calculate regional distributions we overlay a source image with a grid and determine the histograms for the cells. The size of the current feature vector has been determined experimentally using smaller and larger vector sizes. However, more extensive testing with other feature vectors would be a reasonable strategy to optimize the approach. Our repertoire of transformations comprises eight different transformations and is yet quite small. The result of applying each of the eight available transformations to a source image is displayed on the right-hand side of the screenshot in Figure 6. In the learning or training phase, a graphics design expert would tell the system which of the eight transformations yields the best result for a given source image. In the test phase, the system is asked to recommend one of the eight transformations on its own for a new image that was not part of the training set. To transform the sample source image shown in Figure 6 into an image that can be displayed on a mobile WAP-phone, the system recommends T2. In cases where the system remains undecided which of the eight transformations to recommend, it may not recommend any of them, but tick the “nil” option T9 (see Figure 6). We trained the system with a set of 130 images including portraits of persons, images of landscapes, in- and outdoor images of buildings, images of everyday objects, and photographs of artworks. While we have experimented with several different learning methods provided by a software package for machine learning (Witten & Frank 2000), a standard C4.5 decision tree algorithm has proven most useful for this task. To assess the performance of the trainable image transformation system from a user’s point of view, some feedback was gathered during the Comdex computer faire in Basel in September 01. Stand visitors were invited to participate in an informal evaluation. A web cam was used at the stand to take portrait pictures (size 320 x 240 pixels, 256 colours) of stand visitors. Such portraits were then loaded into the image transformation system in order to obtain small size (60 x 45 pixels) black and white portraits that could be loaded and displayed in a WAP-browser of a mobile phone. A total of 61 visitors participated in the exercise and attested the system a reasonably good average performance of 3.3 points on a 5 points scale (1 = not acceptable, 2 = poor, 3 = acceptable, 4 = good, 5 = very good). However, this
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
125
ranking was relative to the available 8 transformations that the system could choose. For further details of the evaluation and the ImageTrans System we refer to (Rist & Brandmeier 2002).
Figure 6: Screenshot of the trainable image transformation system ImageTrans. After the training phase the system recommends one of several available transformations for yet unseen source images (left). The transformed image can be displayed in a WAP phone (right).
3.3.3 Semantic-based Translation of Graphics
The ImageTrans system introduced in the previous section aimed to modify a source graphics on the picture level by means of an image transformation. Another approach is inspired by the semantic-based translation of language. That is, a text given in one language is read and analysed to capture its meaning. Once the meaning is understood, a corresponding text can be generated in the target language. Similar, the translation of graphics does not aim at a modification of the source image. Rather, the idea is to first perform a deeper semantic analysis of the source image to obtain a content description, and then generate a new target image that still preserves the essential semantic contents, but at the same time suffices particular style constraints. We illustrate this approach by a translation of a route description as it may be obtained from a routing service available on the web (cf. left frame of
126
RIST
Figure 7) into stylistic variants that can be displayed on a WAP phone (cf. set of frames on the right-hand side of Figure 7). In this particular example the analysis of the map display is relatively easy to perform since a special colour coding scheme is used in the map generation process. For instance, the colour blue is exclusively used to highlight the requested route from its start to its end point and colours used to depict streets are not used for other purposes elsewhere in the map. Therefore, it is possible to apply a scan-line algorithm that identifies the highlighted route segments in the map. It is then also possible to construct a semantic representation of the route that contains all crossings between the start and the end point. Such a semantic description can be represented, for instance, in an XML-compliant format for further processing. Once a semantic representation of the route description has been extracted from the source map, various stylistic variants can be generated that meet resource limitations of the output device and/or a user’s personal style preferences. Since in the example the task was to display the route description on a small display of a WAP-enabled mobile phone, a reasonable approach is to segment the route into single manoeuvre points and to generate a schematic visualization for each corresponding manoeuvre point. However, as illustrated on the right-hand side of Figure 7, a manoeuvre like “turn left on next crossing” can be visualised in many different ways.
Figure 7: Translation of a map showing a route description into stylistic variants for display on a mobile WAP phone (here a Nokia 7110).
Unfortunately, the presented translation approach will only work for the particular map service used in the experiment. However, the development of
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
127
information extraction algorithms for visual media is an active field of research that may lead to usable results at least for restricted domains. 4. EMERGING NEW SERVICES FOR MOBILE USERS In the previous section, we focused on the aim to provide mobile users with access to web services that are primarily designed for classical use with a stationary PC. However, mobile users do not necessarily have the same information needs and Internet usage behaviour as if they were accessing the web with a PC from home or from the office. Rather, to increase the benefit of services provided to mobile users, it is important to explore the actual information needs of this user group. In our search for dedicated services for mobile users, we observed that added-value for mobile users can emerge when already existing services are combined. In this section. we will present two examples – the first one will again be taken from the IMAGEN project. 4.1 Informing About Local Art Exhibitions
The MyArt portal introduced in Section 3.1 is a typical web service that invites interested users to subscribe. For the purpose of illustration, let us consider a mobile user visiting Florence. When in Florence, the user might get interested in accessing MyArt to receive information about artworks that are on display in the Uffizi museum to decide on whether or not to visit the museum. Having only a WAP phone with her for accessing the MyArt portal it wouldn’t be a good idea to compile an ordinary content package even if it contained only descriptions of pieces that are displayed in the particular museum. As an alternative proposal, we suggest to compile a package that comprises only a few selected artworks and present this to the user in the form of a virtual walk through the museum. In this case it is not only the amount of delivered information but also the form (i.e., an exploratory walk-through) that may attract a mobile user to actually browse through the provided presentation. Figure 8 illustrates this application. As input to presentation generation serves a floor plan of the exhibition, and a set of content units, which correspond to the exposed artworks. The user may browse through the exhibition by following a predefined path. Waypoints of this path are presented as small 3D graphics representing rooms and floors of the exhibition building (see the sequentially aligned screens in Figure 8). Some of the graphics include icons which indicate points of interest, e.g., artworks exposed at this location. By selecting the icon, the exposed content units can be accessed (see the 4th waypoint). In this scenario the user herself initiated a request for information about a certain museum. In a further MyArt version, however, a registered user may also wish to get actively informed about upcoming and ongoing art exhibitions that seem of interest according to an available profile and also her current location. Such an active “push” service is already possible by means of a so-called LBS (Location-Based Services) server provided the user agrees on getting localized (via the cell phone
128
RIST
carrier) and also agrees to receive unsolicited notifications, e.g., in form of an SMS message.
Figure 8: A virtual walk through a museum performed on a mobile WAP phone
A recent survey undertaken by HPI Research in June 2001 deserves mentioning in this context. A total of 3300 people in eleven different global markets were questioned what they think about direct marketing concepts based on mobile phones. Interestingly, 88% of respondents stated that they would be open to receiving ads, while 31% would welcome marketing of this type. A major conclusion of the study was that the success of location-based direct marketing concepts depends on a number of key factors. For instance, even those who appreciated the service envisaged situations in which they want to turn it off. Therefore, it seems important to leave with the user the choice whether or not to receive ads. Also, users must not be enforced to read the messages completely. Rather, they should be given means to bypass and delete messages easily and quickly. Customization is another important aspect. Most users sought a means to customize the notification service so that they would only receive messages that match their interest profiles. Lastly, users expect a mutual benefit. This could be, for instance, an electronic coupon that qualifies the receiver to get a discount when buying products from the advertising party. Also, 86 percent of the respondents stated mobile marketing would be acceptable if it helped keep the cost of service down, that is, they expect a reduction in the cost of mobile service in exchange for getting ads.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
129
4.2 Enriching the Traditional Shopping Tour in the City
In Figure 8 we sketched a service for mobile users of WAP phones. This service can be conceived as a combination of a navigation application and an image access service. The idea to combine existing services is currently put forward in the project ORBA that has been initiated by the federal state of the Saarland as an accompanying measure to the regional set-up of the UMTS network with the purpose to stimulate development of UMTS services by local companies. The ORBA project aims at an enrichment of a classical shopping visit in the city. This goal is motivated by the observation that traditional forms of business, such as real shopping tours in the city, are unlikely to get replaced by eCommerce services in the near and midterm future, but could be enhanced in many ways by 3G devices and wireless net infrastructure. In contrast to the afore mentioned virtual museum tour that was designed for 2.5G WAP phones, the upcoming new generation of devices will allow for multitasking to a certain extent. The practical value of this feature to interlock applications can be nicely demonstrated on a pocket computer or a PDA. In the context of the ORBA project, we are currently developing a shopping companion that is composed of an interactive map application and a web browser. As shown in Figure 9, both applications run concurrently and appear side-by-side on the screen. An interesting feature of the shopping companion is that both applications communicate which each other to enable a flexible assignment of control and mutual adaptation of information content and presentation style among the two applications. For instance, if the user selects a certain location on the map display, the selection can force the web-browser application to perform a search for more information about this location. Note that in this context the word “home page” may take on a literal meaning.
Figure 9: A navigation tool combined with a web browser running on a Nokia 9210 (left) or likewise on a Sharp PDA (right). Activities in one application can influence content and presentation style in the other application.
130
RIST
Vice versa, if the user initiates a content-oriented search for a certain product or shop, the web-browser not only provides a list of search results but also provides the map application with the list of selected target locations so that they can be shown and highlighted in the map display. A further example relates to focus and level of detail of the displayed information in both applications. In the map application the user can perform either a zoom-in or a zoom-out operation in order to shift focus and increase or decrease the level of detail. For instance, if the highest level of detail is chosen, the user will be able to tell apart single buildings. On the other hand, only a small region can be shown. When zooming out, less details are recognizable but a larger region comes into focus. Interestingly, it is possible to mirror a rough counterpart of zoom-in and zoom-out operations in the web browser application. That is, if the map is shown at a low level of detail and thus a larger region is in focus, the web browser will provide a simple list of all business units that are located in the displayed map area. Zooming into the region not only updates the map display but also causes the web browser to switch (at a certain zoom level) to a more focused display of the locations that are now in focus. In our current implementation this intermediate display takes on the form of a list of entries as found in the yellow pages of a phone book. Zooming further into the map will eventually cause the web browser to display a homepage of the shop or building in focus. In addition, movements of the user can serve as input that controls both applications. When in “tracking mode” the user’s position will be continuously updated on the map display. Therefore, walking along a path will change the focus of the map display. Again, this information can be used to change the focus of the information display in the web browser as well. For instance, suppose the user walks along a shopping street that is displayed in the map view together with a blinker that indicates the user’s current position. Aligned to this, the web browser display shows a two-column table where each column lists entries for shops that are located to the left-hand and right-hand side of the street respectively (assuming the same orientation as the map display). Since not all entries for the shops in this street fit on a single screen, only shops in the current vicinity of the user are displayed. When moving along the street the map display updates itself to show the user’s new position whereas the web browser updates its display list as if a scroll operation was performed. Since this is work in progress, it is yet an open question to what extent town visitors will actually take advantage of such a Shopping Companion and its suggested operation modes. While our developments demonstrate the technical feasibility of the service ideas, there are more factors that will eventually have an impact on user appreciation. For instance, the examples illustrated above presuppose an extensive and up-to-date content repository that maintains interesting information of business units and sightseeing attractions located in the shopping region under consideration. And, as taught by the WAP experience, pricing policies of the service providers will play a decisive role, too.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
131
5. MULTI-USER COLLABORATIONS WITH HETEROGENEOUS DEVICES While the examples in the preceding sections aimed at single users, we now turn to the area of mobile multi-user applications. After all most of the mobile appliances are telecommunication devices and therefore suffice the multi-user criterion per se. However, the combination of portable computers with telecommunication capabilities goes well beyond the classical view of audio-based user-to-user telecommunication but at the same time raises a number of new questions. For instance, how will groups composed of mobile and possibly stationary users communicate with each other using multiple media to express themselves? How will people use combinations of text and speech if both forms are available? How will the style of communication change if people use communication terminals that are not equally rooted in terms of communication media and bandwidth? In the case of a user who missed a previous meeting, could it be useful/possible to provide a structured view of previous exchanges of text and audio contributions made by the participants? In the context of the EU-funded project Magic Lounge*(Bernsen et al. 1998, Rist et al. 2000) we were especially interested in the question of how to overcome communication barriers imposed by the use of heterogeneous communication terminals, e.g, a PC connected with mobile devices, such as a PDA and a 2.5G mobile phone (cf. Figure 10). In the following subsections we present two scenarios that have been investigated within the Magic Lounge project.
Figure 10: A group of geographically dispersed people communicate through the Magic Lounge server using heterogeneous communication devices
4.2 Text and Audio Chat using heterogeneous Communication Terminals
As suggested by its name, the Magic Lounge system has been designed as a virtual meeting space that aims to provide their users with some communication support tools. These tools include: a text-based conversation tool, a multi-party audio tool, and a meeting memory tool that provides a user with various different views on recorded conversations. At a first glance, the text conversation tool is similar to a text chat system. However, in a conventional chat system messages sent by users are basically strings
132
RIST
of text characters that are not interpreted further by the system. Therefore, such systems can only provide little support for structuring conversations and collaborations. In contrast, text messages exchanged through the Magic Lounge conversation tool are labelled with meta information referring to speech acts that are associated with the messages, and information to mark the conversation thread to which a certain message belongs. This meta information forms essential input for the memory tool since it can be used to structure recorded messages among different criteria and thus enable the generation of purpose-specific views on the memory entries. For instance, the views supported by the meeting memory tool comprise sort functions to list messages in a certain order, a view that displays the messages in a tree-like structure to emphasise topic boundaries in a conversation, a “musical score” view that emphasises the temporal order of exchanged messages, and finally a replay-view that deploys animated conversational characters to represent conversation partners for the replay of recorded conversations. The audio tool was built on the pre-existing conferencing tool VAT which has been developed by the Network Research Group of the Information and Computing Sciences Division at Lawrence Berkeley National Laboratory (LBNL) in Berkeley. Quite similar to a modern telephone with conferencing facilities, the audio tool enables its users to send and receive verbal messages either to some selected communication partners or to broadcast messages to all connected partners. All audio messages can be recorded at the communication server and thus can be accessed and replayed after a meeting. The labeling of audio contributions on analogy of the labeling of textual messages would have been desirable too, especially for the purpose of structuring the messages in a meeting record. However, after some initial testing it was concluded that such a labeling effort would disrupt the fluency of any spoken conversation and thus should be hidden from the users. As a consequence, Magic Lounge only produces a log-file of all verbal messages that indicates who has sent a certain message, when it was sent, how long it was and who the recipients were.
Figure 11: Magic Lounge WAP interface for a mobile phone (here a Nokia 7110).
Within the project, mobile access to Magic Lounge mediated meetings has been based partly on WAP which in turn employs the Wireless Markup Language (WML), and partly on a mobile telephone connection. Figure 11 illustrates an interaction sequence on a WAP browser of a mobile phone. When entering the Magic Lounge available services, such as querying the meeting memory, or writing
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
133
and receiving chat messages, are listed and can be started by selecting them (first and second snapshot of Figure 11). Memory contents can be presented as text (third snapshot of Figure 11), or via the audio channel. Also, chat messages can be submitted in textual form (fourth snapshot of Figure 11). Conducting field tests for the mobile client was not possible within the project, mainly for financial reasons. Among other things, the project was not able to purchase devices for a broader user group and to sign the required contracts with the telecom companies for the WAP service provision. Furthermore, we wanted to exploit the log-files of the WAP-gateway in the evaluation process. If we had used a telecom-provider’s WAP-gateway, these log-files would not have been accessible for us. On the other hand, the research license of our WAP-gateway was restricted to only three concurrent clients. Therefore, the tests where done on a smaller scale at DFKI with a group of nine subjects. The small number of test candidates does not allow to draw conclusions of any statistical significance. However, the observations made during testing helped to identify those parts of the tasks which were most difficult to perform. To sum up, our major observations were: Using the mobile Magic Lounge client seems to be a relatively easy task. We expected inexperienced users would have severe difficulties with the handling of the mobile devices. This seems to be the case. However, we were indeed surprised that only little training seems necessary to drastically improve task performance concerning required time, need for assistance, and deviations from the minimal interaction sequence. In addition, the test persons experienced the exercise as being a bit laborious but relatively easy to perform. Our initial concerns that users would blame the mobile client because of inconveniences concerning the overall handling of the mobile device was not at all supported. On the contrary, most test candidates indicated interest in using such a system, e.g., in order to brief themselves before joining a meeting. The current interaction design of the mobile client is not optimal. The records of the test sessions as well as the subjective responses from the test candidates revealed the most problematic parts of the current interaction structure. Unfortunately, most of the encountered problems are due to the particular mobile phone implementations. For instance, the quality of the displays will certainly improve very fast and down-load time will decrease as well. Another serious problem is that we are currently very much restricted in over-defining the assignments of function keys. As a consequence, any interaction design for an application will be constrained by already assigned functions. This can be very confusing as it is not always obvious whether a certain option is part of the specific application or whether it is related to some other phone functionality, such as the phone’s address book. Fortunately, some of the problems will certainly become obsolete when switching to the next generation of mobile devices. If text input is required, it must become more comfortable to use. As expected, text editing on a tiny number key board requires some training and is experienced as being quite inconvenient. In addition, most test candidates experienced the completion of text editing as unnecessarily complicated. Unfortunately, the implementations of the current WAP-browsers do not allow the introduction of a short-cut so that text-entries can be sent off by pressing only one button right after
134
RIST
editing has been completed. Though only three chat sessions were observed, it seems quite obvious that the implemented version of the WAP chat client was inconvenient to use. Among other aspects, users did not like the additional effort for providing meta information, such as the speech act associated with a message. Users also wanted to receive notification in case new messages arrive. In the provided system they had to inspect the message list periodically to find out whether new messages were received. Also, replying to a message was found inconvenient since after switching to the text editing mode, they could no longer see the message to which they wanted to reply and switching back and forth between the reading mode and the editing mode was found too awkward. 5.2 Using Heterogeneous Graphical Representations in Collaboration Tasks
With a special interest in graphical information presentation, we also conducted a case study in which we used a simulation environment to investigate how a small group of geographically dispersed users can jointly solve localization and route planning tasks while being equipped with different communication terminals. For the purpose of illustration, consider the situation in which a group of three users U1,U2, and U3 meet in the Magic Lounge to discuss the details of an impending trip to downtown Saarbrücken. U1 accesses the virtual meeting space via his PC, U2 via a PDA, and U3 via a cell phone with a tiny LCD display. To clarify how to get from a certain location to another, the participants want to consult a map. U1 is in an advantageous position as his PC can easily display even highly colored and detailed maps. As far as usability is concerned, it doesn’t make much sense to output complex graphics on a small PDA screen. Therefore, U2 receives a more abstract graphical presentation, in this case a network-style map that essentially encodes topological information about the objects in focus. The phone user U3 is certainly in the weakest position as there is only a 90x60 pixels display available on his phone to output graphics. U3 receives on his display a simplified graphical representation of the street which is in the current focus of interest. Other details, such as the street name or attributes which can not be shown are provided verbally using a text-to-speech synthesizer on the server side. Our approach to provide the users with appropriate information displays is to generate presentation variants (i.e. different views) from a shared data representation of the information to be presented (Rist & Hüther 1999). We use a data model in which geographic information is described in hierarchically ordered layers. Of course the details of this hierarchical representation depend on the data which have to be represented. We assume that this representation will be located on a centralized server which is accessible by all participants. The data hierarchy for the “shared map” application is shown in the center of Figure 12. The uppermost level of this hierarchy is just a list of meaningful domain objects, such as the names of streets, buildings, places, bridges and the like. In contrast, the lowest level of the hierarchy comprises a detailed bitmap, but also all the information of the superior layers. The intermediate layers represent abstractions relative to the information of the layers below. In the map example the intermediate
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
135
layers correspond to representations that subsequently abstract from geometric aspects such as spatial extent, shape, and distance between objects. In addition to the generation of different views on geographical data, the data hierarchy plays also a major role for mirroring object selections/markings from a source view to other target views. Figure 12 illustrates a scenario where a certain path to a particular location has been marked on the PDA display. The marked path is then mirrored on the PC’s map display as well as on the display of a cell phone (CP). The mapping from the PDA to the PC involves the following steps: First, the system determines the interaction type that is performed on the source view. In the example of Figure 12 the user has performed a Mark-action on his PDA. Next, the system determines the object(s) which are affected by the interaction (in the example, the graphical representation of the path leading to LOC#1 has been marked). The system also determines which action on the PC side corresponds best to the action that has been executed on the PDA display. In the example shown Figure 12, the appropriate action is named Mark_PC. Finally, the system must determine the graphical representation of LOC#1 on the PC display and apply the action Mark_PC to this representation. The mapping from the PDA (or the PC) display to that of the cellular phone is more challenging, especially if complex objects are marked, such as a whole path. As already illustrated by Figure 7, one possible approach is to transform the spatial display of the marking into a sequence of phone displays, each showing a small portion of the marked path. To construct such a view for the phone display the server needs to determine the involved path segments and to generate a sequence of displays. Using the scroll buttons on the phone, the user can navigate back and forth through the sequence of single images. While this approach does not allow the user to grasp the marked path at a glance, it may still help the user to build up a mental representation of the path (e.g. how many turns to make, order of turns etc.). Vice versa, a mapping from a marked/selected item on the phone display does not cause severe problems apart from having to provide a convenient means that enables the user to make such selections. To a large extent this is a device-specific issue. However, a fall back solution is to assign numbers to the selectable items shown on the phone display and use the phones keyboard to perform selections. To what extent collaborations over heterogeneous communication channels are useful and usable is an issue that is investigated best by means of concrete application scenarios. For the purpose of such a study we have chosen the domain of collaborative travel planning and implemented a demonstrator system. This demonstrator relied on simulations of the user interfaces for both the PDA and the mobile phone. For example, the phone interface was represented in the simulation through a realistic image. The user could press the depicted number and control buttons. Vice versa, information was presented through combinations of spoken messages and written text and minimalist graphics which are displayed in an area that simulates the tiny LCD display of a real phone. The device simulations were run on separate but networked PCs. In an informal evaluation conducted with colleagues and students at DFKI different settings have been considered: x collaboration between three PC users;
136
RIST x x
collaboration between PC, PDA and mobile phone; collaboration between PC and a mobile user with a NOKIA 7110 WAP phone. In all settings the participating users had to solve some concrete localization and route description tasks, such as showing the others where a particular object is located, or how to get from one place to another. The try runs were made with and without audio/text communication available. Based on our observations, we got a clearer idea on the sort of appropriate information displays and the type of collaboration modes which are likely to support the users in solving their tasks. For example, it turned out that collaborations without any verbal communication channel do not work at all. Some verbal communication (audio or at least textbased) is necessary in order to clarify the purpose of the graphical representations and the markings in relation to the given task and the proposed steps to solve it. On the other hand, the availability of graphical representations – though being different from each other – help to facilitate collaborations on localization and route planning tasks.
Figure 12: An object marking on the PDA display is mirrored on the PC display and on the Cell Phone display. References to objects are resolved via the hierarchical data representation at the centralized server component.
Another observation concerns the way in which users exchange markings which each other. It seems that one should prefer a collaboration principle which leaves the decisions to the individual users when and from where they want to “import” other views. Also, it seems not advisable to import markings from several other users at the same time since this can result in confusing displays.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
137
6. CONCLUSIONS In this paper, we presented a number of different scenarios to motivate the need for adaptive information presentation that meets the requirements of mobile users. So far, our focus has been on requirements that are due to limited capabilities of mobile devices for graphical information display. We presented several different adaptation strategies, including image transcoding as well as graphics translation and re-generation from a semantic representation. While new generations of mobile devices will overcome many of the current performance limitations, other restrictions, such as reduced screen real-estate will remain since they are naturally restricted by the overall size of a mobile device. Diversity in devices is also likely to increase further and thus will call for smart adaptation mechanisms. Also, to fully exploit the new technical features of upcoming devices, new research questions arise as well. For instance, new devices will enable multimodal interaction and multimedia information presentation including text, audio/speech, graphics, animation, video, and active content (such as Midlets). However, interaction design and presentation composition have to be tailored to the target device, the user’s information needs and interaction preferences as well as to the particular mobile usage context. 7. ACKNOWLEDGEMENTS The work presented in this paper comprises valuable contributions from colleagues and students who where in involved in the corresponding projects. Special thanks to Patrick Brandmeier who did most of the development work on the image transformation system, the Shopping Companion, and who was responsible for setting up the WAP infrastructure for the Magic Lounge project. Thanks also to Alexander Kröner for his work on the constraint-based Layout Manger used in the MyArt portal, Uwe Sengewald for his work on several map visualization components for tiny displays, and Elisabeth André and a reviewer for comments and corrections on a draft version of this paper. 8. NOTES 1 WAP (Wireless Application Protocol), see http://www.wapforum.org 2 HTML (HyperText Markup Language), see http://www.w3.org/MarkUp/ 3 WML (Wireless Markup Language), see http://www.wapforum.org/DTD/wml_1.1.xml 4 HSCSD (High Speed Circuit Switched Data), see http://www.gsmworld.com/technology/hscsd/ 5 GPRS (General Packet Radio Service), see http://www.gsmworld.com/technology/gprs/ 6 UMTS (Universal Mobile Telecommunications System) 7 See http://wireless.java.sun.com 8 See http://www.fontus.com/ 9 EU project IMAGEN IST-1999-13123. IMAGEN project partners are: Bar-Ilan University Ramat Gan, Israel, CBE Group International, Amsterdam, Netherlands, DFKI Saarbrücken Germany (coordination), GIUNTI Interactive Labs Geonoa Italy, RADICAL MEDIA Lisbon Portugal, Vrije Universiteit Brussel - STAR Lab, Brussels Belgium. 10 The eXact Packager® produces assessments (IMS QTI), multi-device style sheets and is interoperable with any third party Learning Management System (LMS) or information system conforming to international standards. See http://www.learnexact.com/.
138
RIST
11 XSLT (XSL Transformations), see http:// www.w3.org/TR/xslt 12 The Magic Lounge project (1997-2000) was funded under the Future Emerging Technologies (FET) pro-active initiative i3. Project partners were DFKI, SaarbrĦcken Germany (coordination); NIS, Odense University, Denmark; LIMSI-CNRS Paris France; Siemens AG, München Germany; and The Danish Isles - User Community, Denmark. For details on mobile HTML derivatives see http://www.w3.org/Mobile/Activity. 13 Pic2WBMP stands for a conversion that assumes a source image in GIF or JPEG format and produces as output a black and white image in the so-called “wireless bitmap” format. The Ginco graphics converter was made available as shareware by Ginco New Media (www.gingco.de).
The Magic Lounge project (1997-2000) was funded under the Future Emerging Technologies (FET) pro-active initiative i3. Project partners were DFKI, SaarbrĦcken Germany (coordination); NIS, Odense University, Denmark; LIMSI-CNRS Paris France; Siemens AG, München Germany; and The Danish Isles - User Community, Denmark.
9. REFERENCES André, E. and Rist, T. (2001) Controlling the Behavior of Animated Presentation Agents in the Interface: Scripting vs. Instructing. AI Magazine 22(4), pp. 53-66. Agrawala, M. and Stolte, C. (2001) Rendering Effective Route Maps: Improving Usability Through Generalization. In Proc. of ACM SIGGRAPH 2001, pp. 241-250. Bentley, R., Rodden, T., Sawyer, P., and Sommerville, I. (1994) Architectural support for cooperative multiuser interfaces. In IEEE Computer, 37-46. Bernsen, N.O., Rist, T. , Martin, J.C., Hauck, C., Boullier, D., Briffault, X., Dybkjaer, L., Henry, C., Masoodian, M., Néel, F., Profitlich, H.J., André, E., Schweitzer, J., Vapillon, J. (1998) Magic Lounge: A Thematic Inhabited Information Space with "Intelligent" Communication Services. In: La Lettre de l'Intelligence Artificielle following the International Conference on Complex Systems, Intelligent Systems, & Interfaces (NIMES'98). Billsus, D., Pazzani, M.J., and Chen, J. (2000) A learning agent for wireless news access. Intelligent User Interfaces, pp.33-36. Bohnenberger, T., Jameson, A., Krüger, A., and A. Butz. (2002) Location-Aware Shopping Assistance: Evaluation of a Decision-Theoretic Approach. In: Mobile Human-Computer Interaction. LNCS Vol. 2411, Springer Verlag. Brewer, E., Katz, R., Amir, E., Balakrishnan, H., Chawathe, Y., Fox, A. , Gribble, S., Hodes, T., Nguyen, G., Padmanabhan, V., Stemm, M., Seshan, S., and Henderson, T. (1998) A Network Architecture for Heterogeneous Mobile Computing. In: IEEE Personal Communications Magazine 5(5), pp. 8-24. Britton, K.H., Case, R., Citron, A., Floyd, R., Li, Y., Seekamp, C., Topol, B., and Tracey, K. (2001) Transcoding: Extending e-Business to New Environments. In IBM Systems Journal, Vol. 40, No 1. Chincholle, D., Goldstein, M., Nyberg, M., and Eriksson, M. (2002) Lost or Found? A Usability Evaluation of a Mobile Navigation and Location-Based Service. In: Mobile Human-Computer Interaction, LNCS Vol. 2411, Springer Verlag. Endres, C., Meyer, M., and Wahlster, W. (1999) Personal Picture Finder: Ein Internet-Agent zur wissensbasierten Suche nach Personenphotos, Online'99 (in German). Service temporarily available: http://finder.dfki.de:7000/ Fox, A., Goldberg, I.., Gribble, S.D., Lee, D.C., Polito, A., and Brewer, E.A. (1998) Experience With Top Gun Wingman: A Proxy-Based Graphical Web Browser for the 3Com PalmPilot. In Proc. of Middleware '98. Gaedke, M., Beigl, M., Gellersen, h.-W., and Segor, C. (1998) Web Content Delivery to Heterogenous Mobile Platforms. In: Advances in Database Technologies. LNCS Vol. 1552, Springer Verlag. Gonzalez-Castaño, F.J., Rifón, L.A., and Costa-Montenegro, E. (2002) A New Transcoding Technique for PDA Browsers Based on Content Hierarchy. In: Human-Computer Interaction, LNCS Vol. 2411, pp. 69-80. Graham, T.C.N. and Urnes, T. (1992) Relational views as a model for automatic distributed implementation of multi-user applications. In Proc. of CSCW’92. pp. 59-66. Greenberg, S., Boyle, M., and LaBerge, J. (1999) PDAs and Shared Public Displays: Making Personal Information Public, and Public Information Personal. In: Personal Technologies 3(1&2).
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
139
Han, R., Bhagwat, P., LaMaire, R., Mummert, T., Perret, V., and Rubas, J. (1998) Dynamic adaptation in an Image Transcoding Proxy for Mobile Web Browsing. In: IEEE Personal Communication. pp. 8-17. Hill, R.D. (1992) The Abstraction-link View Paradigm, Using Constraints to Connect user interfaces to applications. In Proc. of CHI '92. pp. 335-342. Hori, M., Kondoh, G., Ono, K., Hirose, S., and Singhal, S., (2000) Annotation –Based Web Content Transcoding. In Proc. of 9th International World Wide Web Conference (WWW9). Kraus, S., Kröner, A., and Tsaban, L. (2002) IMAP - Intelligent Multimedia Authoring Tools for Electronic Publishing. In Proc. of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems AH 2002, pp 527-530. Kray, C., Laakso, K., Elting, C., and Coors, V. (2003) Presenting Route Instructions on Mobile Devices. In Proc. of IUI'03, Miami, FL. Kröner, A., Bandmeier, P., and Rist, T. (2002) Managing Layout Constraints in a Platform for Customized Multimedia Content Packaging. In Proceedings of the Working Conference on Advanced Visual Interfaces AVI 2002, pp 89-93. Kylänpää, M. and Heino, I. (2000) Personalized Transcoding Proxy - an Approach to Mobile Web Access. In Proc. of 9th International World Wide Web Conference (WWW9). Article available under: http://www9.org/final-posters/poster10.html Marucci, L. and Paternò, F. (2002) Supporting Adaptivity to Heterogeneous Platforms through User Models. In: Mobile Human-Computer Interaction. LNCS Vol. 2411, Springer Verlag. Mohan, R., Smith, J., and Li, C.-S. (1999) Adapting Multimedia Internet Content For Universal Access. In: IEEE Transactions on Multimedia, pp. 104-114. Myers, B.A., Stiel, H., and Gargiulo, R. (1998) Collaboration Using Multiple PDAs Connected to a PC. Proc. of CSCW’98, pp. 285-295. Noble, B. (2000) System Support for Mobile, Adaptive Applications. In: IEEE Personal Communications Vol. 7, No. 1. Not, E., Sarini, M., Stock, O., Strapparava, C., Zancanaro, M. (1998) Information adaptation for physical hypernavigation. Proc. of i3 Annual Conference, Nyborg. Rao, H., Chen, Y. , Chang, D., and Chen, M. (2001) iMobile: A Proxy-based Platform for Mobile Services. In Proc. of the first ACM Workshop on Wireless Mobile Internet. Reynolds, F., Hjelm, J., Dawkins, S., and Singhal, S. (1999) Composite Capability/Preference Profiles (CC/PP): A user side framework for content negotiation. Available as online resource under: W3C Note. http://www.w3.org/TR/NOTE-CCPP/ Rist, T. and Hüther, M. (1999) Adaptive Generation of Graphical Information Presentations in a Heterogeneous Telecooperation Environment. In. Proc. of HCI’99 Vol.1, pp. 331-335. Rist, T. and the Magic Lounge Team. (2000) Magic Lounge: A virtual communication space with intelligent communication services. In: Proc. of the i3 Annual Conference’02, Jönköping, Sweden. Rist, T. and Bandmeier, P. (2002) Customising Graphics for Tiny Displays of Mobile Devices In: Personal and Ubiquitous Computing. No. 6, pp. 260-268. Roseman, M., and Greenberg, S. (1996) Building Real-Time Groupware with GroupKit, A Groupware Toolkit. ACM Trans. on Human-Computer Interaction, pp. 66-106. Stefik, M., Bobrow, D.G., Foster, G., Lanning, and Tatar, D. (1987) WYSIWIS Revisited: Early Experiences With Multiuser Interfaces, ACM Trans. on Office Information Systems 5(2) 147-167. Witten, I.H. and Frank, E. (2000) Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco.
10. AFFILIATIONS T. Rist DFKI, Saarbruecken, Germany
NATURAL LANGUAGE GENERATION Natural Language Generation is concerned with the automatic production of texts. This branch of natural language processing for many years has attracted much less interest than language understanding and only in the Eighties a real community of researchers has formed and different schools have appeared. Normally the field is described as investigating communicative goals, the dynamic choice of what to say, the planning of the overall rhetorical structure of the text (called sometime strategic planning), the actual realization of sentences on the basis of grammar and lexicon (sometimes called tactical planning), and so on. Among the different approaches a classical partition consists in a class that includes those based on schemata or templates, and a class based on those that combine finer-grained rhetorical elements. Hybrid approaches of various sorts have also been developed and in recent years, following a trend that is prevailing in language analysis, statistical approaches have been attempted as well. A very important aspect is concerned with the integration between language and other communication media. Many groups are now active in a way or in another in this direction. The cluster on Natural Language Generation presents four contributions. First, the chapter by Andrè and colleagues describes a system called AutoBriefer which automatically generates multimedia briefings from high-level outlines. The author of the briefing interacts with the system to specify some of the briefing content. The system uses declarative presentation planning strategies to synthesize a narrated multimedia briefing in various presentation formats. The narration employs synthesized audio as well as, optionally, an agent embodying the narrator. Calder and colleagues address a similar issue in their chapter. They introduces The MPIRO project which targets the concept of personalized information objects. MPIRO's technology allows textual and spoken descriptions of exhibits to be generated automatically from an underlying language-neutral database and existing free-text descriptions. The resulting descriptions, produced in three different languages (English, Greek, and Italian), are tailored according to the user's interests, background knowledge, and language skills. Carenini and Conati in the last chapter of this cluster describe a framework that helps students learn from examples by generating example problem solutions whose level of detail is tailored to the students’ domain knowledge. When presenting a new example to a student, the framework uses natural language generation techniques and a probabilistic student model to selectively introduce gaps in the example solution, so that the student can practice applying rules learned from previous examples in problem solving episodes of difficulty adequate to her knowledge. Filling in solution gaps is part of the metacognitive skill known as self-explanation (generate explanations to oneself to clarify an example solution), which is crucial to effectively learn from examples. Finally, Piwek, Power, Scott and van Deemter explore NLG applications in which extra media are brought into play from the point of view of planning the layout of the document. The main point of the chapter is that the extra media are not simply added to plain text, but integrated with it: thus the use of formatting, or pictures, or dialogue, may require radical rewording of the text itself.
E. ANDRÉ, K. CONCEPCION, I. MANI AND L. VAN GUILDER
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
1. INTRODUCTION Document production is an important function in many organizations. In addition to products such as technical papers, instruction manuals, system documentation, courseware, etc., briefings are a very common type of multimedia document product. Since so much time is spent by so many people in producing briefings, often under serious time constraints, any method to reduce the amount of time spent on briefing production could yield great gains in productivity. Our work forms part of a larger DARPA-funded project aimed at improving analysis and decision-making in crisis situations by providing tools that allow analysts to collaborate to develop structured arguments in support of particular conclusions, and to help predict likely future scenarios. These arguments, along with background evidence, are packaged together as briefings to high-level decisionmakers. In leveraging automatic methods to generate briefings, our approach needs to allow the analyst to take on as much of the briefing authoring as she wants to (e.g., it may take time for her to adapt to or trust the machine, or she may want the machine to present just part of the briefing). The analyst's organisation usually will instantiate one of several templates dictating the high-level structure of a briefing; for example, a briefing may always have to begin with an executive summary. The methods also need to be relatively easily adapted to different domains, given that the subject matter of crises are somewhat unpredictable. These task requirements force us to focus on a flexible approach to briefing generation. The system must allow the user to take control as much as she wants in the specification of the content and format, as well as in editing the final product. It must also be able to help the user as needed by exploiting any redundancy in the briefing content, including stereotypical structure (dictated in part by the business rules of the organization) as well as material from previous briefings, databases, etc. In both these cases, the system should be able to populate an outline of the briefing. In cases where the user wants to be freed from specifying the low-level details of the animation, narration, and multimedia synchronization of the briefing, the system should be able to plan and synthesize the actual presentation, which the user can then post-edit. As an example, consider a user putting together a briefing about a particular individual, say Victor Polay, a leader of Peru’s Tupac Amaru movement. A person preparing this briefing by hand (using, say, Microsoft PowerPoint) would need to 143 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 143–158 © 2005 Springer. Printed in the Netherlands
144
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
plan the outline, and assemble the various textual and multimedia elements that constitute the core content. This latter activity could involve searching, doing background research, collecting existing material, drafting new text, etc. The person must also design or select a template for the overall briefing, as well as the layout of individual slides, and then produce each slide individually, realizing the content of each slide by embedding or linking the media elements on it, specifying the animation strategies, and then rehearsing and refining the briefing. Finally, the person has to present the briefing in front of the intended audience(s). In this paper, we report on a system called AutoBriefer where the author of the briefing interacts with a system via a graphical user interface (GUI) to specify some of the briefing content. As before, the user can select or create an outline. However, the user doesn’t have to produce each slide, realizing the media elements on each. Instead, the AutoBriefer GUI allows the user to select and associate the media elements directly with the outline. In doing so, the user may specify the temporal layout of the content. Moreover, instead of using pre-existing media elements, the GUI also allows the user to direct personalized information agents to assemble media elements from existing (possibly distributed) databases or text collections. The user can then leave the rest of the briefing preparation to the system, by asking AutoBriefer to generate a PowerPoint briefing. In doing so, the user allows the system to plan how many slides are needed, and to plan and realize the media elements on each slide, synchronizing them spatially and temporally. Further, the user doesn’t need to present the briefing; the system itself can generate a spoken narrative for the briefing, synchronizing the narrative elements it introduces with the other media elements, producing an animated briefing as a result. One challenge in automating this process involves automatically selecting and assembling media elements; here AutoBriefer exploits new technologies for automated summarization. In order to generate a narrative automatically, the system needs knowledge about the domain objects; here, AutoBriefer provides scalability across domains by allowing the user to specify or edit domain knowledge in the form of meta-information attributes associated with media elements. This ability to specify and edit domain knowledge in turn addresses an important bottleneck in current multimedia presentation strategies, which are difficult to adapt across domains. AutoBriefer leverages the ability to edit domain knowledge along with declarative presentation planning strategies to synthesize a narrated multimedia briefing in various presentation formats. 2. APPROACH The design of AutoBriefer is different from most multimedia presentation planners in the way the problem is decomposed. An easy-to-use GUI is used to specify high-level structural and temporal constraints for arranging domain objects in a presentation. Information associated with these domain objects is used to generate a narrative outline. This narrative outline is then fed as data along with a high-level communicative goal to a presentation planning component (PPP). The planner chooses a sequence of plan operators to realize the goals associated with the
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
145
outline. The plan operators constitute a declarative representation of presentation strategies to decide on narrator animation, and spatial and temporal synchronization. The system architecture is shown in Figure 1. The user interacts with the system via a GUI using a simple point-and-click interface. She may load in a domain model of existing domain objects, for possible editing, or she may create new domain objects using information agents.
Briefing GUI
XML Narrative Outline Domain Knowledge
XML2Prop Converter Propositional Narrative Outline
Agents User
Multimedia Player or PowerPoint
SMIL or PowerPoint
User Interface
Narrative Planner
Layout Generator Briefing Generator
Figure 1. System Architecture
As shown in Figure 2, the user arranges the objects along two dimensions. The first, structural dimension, allows objects to be nested under others to create a hierarchical structure. This is shown in the left frame, which uses a directory browser interaction metaphor. The user here has chosen to specify the goal of retrieving an existing media object, namely a picture of “Polay”. The user has also specified goals of creation of a summary of events and a biography; the execution of these goals, of type ‘create’, involve the use of information agents. Finally, the user has also included a particular audio file as a coda to end the briefing. The second, temporal dimension allows objects to be ordered temporally. This is shown in the right frame: objects from the left frame can be dropped into numbered temporal slots in the right frame (the system lays them out, by default, based on the outline). Each object can be edited by filling in meta-information in the form of values for slots (Figure 3). Here the user has typed in a caption, and has required that the system generate the running text. Once a subset of objects in the domain have been selected and arranged appropriately, the user may choose a style sheet for the current briefing, and then ask the system to create the briefing, in either PowerPoint (preferred, since it has a wide
146
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
user base) or SMIL (Synchronized Multimedia Integration Language), a W3Cdeveloped extension (SMIL 2002) of HTML that can be played by standard multimedia players, such as Real (REAL 2002) and Grins (GRINS 2002).
Figure 2. Briefing GUI
The narrative outline is represented in XML as a tree whose structure represents the rhetorical structure of the briefing. An example is shown in Figure 4, which was generated automatically as a result of GUI interaction. (Here, for readability, we show a schematic representation of the narrative outline rather than the raw XML.) Each node has a label, which offers a brief textual description of the node. Each leaf node has an associated goal, which, when realized, provides content for that node. Content-level goals are of two kinds: retrieve goals, which retrieve pre-existing media objects of a particular type (text, audio, image, audio, video) satisfying some description, and create goals, which create new media objects of these types using personalized information agents for search, summarization, etc. In addition to content-level goals, the narrative outline contains narrative-level goals, which serve to introduce descriptions of content at other nodes, including segues to content nodes, and running text and captions describing media objects at content nodes. Ordering relations reflecting temporal and spatial layout are defined on nodes in the tree. Two coarse-grained relations, seq for precedence, and par for simultaneity, are used to specify a temporal ordering on the nodes in the tree.
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
147
Figure 3. Editing an Object
As shown in Figure 1, once the user of the GUI asks the system to create the briefing, the GUI passes off the narrative outline to a converter which first applies an XML parser to the outline to check for syntactic correctness. The validated outline is then transformed into propositions that can be processed by the Narrative Planner and the Layout Generator. The Narrative Planner takes the input narrative outline and works out a coherent presentation by realizing the corresponding retrieve and create goals. In particular, it fills out some content by invoking specific components for the acquisition and creation of multimedia material. It also makes decisions when needed about output media type. To improve the coherence of the briefing, the narrative planner may introduce narrative-level goals which refer to meta information on the content of a briefing. In the case of create goals, meta information may be easily provided by the narrative planner. For retrieve goals, this information is usually supplied by the user. In either case, narrative elements are generated from the meta-information by using shallow text generation methods (canned text). Finally, the Layout Generator generates the temporal and spatial layout of the presentation. Both the narrative planner and the layout generator are based on the PPP multimedia presentation planning tool (André and Rist 1996). The PPP tool combines two components: an HTN-planner that realizes the Narrative Planner and a component for spatial and temporal reasoning that is used in the Layout Generator. While in the original approach of André and Rist a presentation is planned on the basis of a domain knowledge base, the starting point for the approach presented here
148
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
is the narrative outline as provided by the user, which is then refined and synchronized. Peru Action Brief 1 Preamble 2 Situation Assessment 2.1 Chronology of Events 2.1.2 Latest Document Summary create(“summarize-generic compression .10/peru/p32.txt”) 2.2 Biographies 2.2.1 Profile of Victor Polay 2.2.1.1 A File Photo of Victor Polay retrieve(“D:\rawdata/polay.jpg”) 2.2.1.2 Profile of Victor Polay create(“summarize –bio –length 350 –span multi – person Victor Polay -out table/peru/”*”) 3 Coda “This briefing has assessed aspects of the situation in Peru. Overall, crisis appears to be worsening.”
1 2.1.2 2.2.1.1 2.2.1.2 3
Figure 4. Narrative Outline
3. INTELLIGENT MULTIMEDIA PRESENTATION PLANNING The author of a briefing may choose to flesh out as much of the narrative as desired. For instance, she may exactly indicate how a certain image should be labeled, but is not required to do so. Further, she may specify certain narrative goals, though she is not required to do so; she may also leave it up to the Narrative Planner component to introduce all the narrative goals. In case domain objects should appear in a certain order in the final presentation, she may also provide the desired temporal ordering relations. Note, however, that the complete temporal layout need not be specified by the user. In addition to the narrative outline, the user may input a number of presentation constraints, such as the degree of detail, the style of presentation (e.g. agent-based versus audio presentation) or the output format, e.g.,
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
149
SMIL or PowerPoint. The task of the presentation planner is then to work out a presentation based on the user’s input. This includes the realization of the create, retrieve and narrative goals as well as the determination of the spatial and temporal layout. The result of the presentation is a document in the required format. 3.1 Basic Technology
To accomplish the tasks sketched above, we rely on our earlier work on presentation planning. The essential idea of this approach is to formalize action sequences for composing multimedia material and designing scripts for presenting this material to the user as operators of a planning system. The header (i.e., effect) of a planning operator refers to a complex communicative goal (e.g. to provide a biography of a certain person) whereas the expressions of the body of the operator indicate which acts have to be executed in order to achieve this goal (e.g. to show a picture of that person). The input goal to the presentation planner is a complex presentation goal, e.g. to provide a briefing. To accomplish this goal, the planner looks for operators whose headers subsume it. If such an operator is found, all expressions in the body of the operator will be set up as new subgoals. During the decomposition process, a presentation is built up by collecting and expanding the nodes of the user’s narrative outline. To accomplish this task, we start from an empty output file that is continuously augmented by SMIL-expressions or expressions in an XML-based markup language that serves as an intermediate format for the PowerPoint slides to be generated. The planning process terminates if all subgoals have been expanded to elementary execution acts or have already been worked out by the user. Elementary execution acts include retrieve or create goals as well as commands, such as SAddString, that augment the output file by expressions in the required language. 3.2 Narrative Planning
As mentioned above, the system can construct a narrative to accompany the briefing. Narrative goals are generated to cover captions, running text, and segues. The captions and running text are realized from information provided in the narrative outline. In the case of create goals, the running text and captions may be easily generated by the system, as a by-product of creating a media object. In the case of retrieve goals, the objects retrieved may not have any meta-information, in which case a default caption and running-text is generated. For example, if a retrieve goal associated with a picture lacked a user-supplied caption, the Narrative Planner would use the narrative outline as the basis for constructing a caption. This flexibility allows the system to fill in what the user left unspecified. Clearly, a system's explanatory narrative will be enhanced by the availability of rich metainformation. The segues are realized by the system. For example, an object with a label slot “A biography of Victor Polay” could result in a generated segue “Here is a biography of Victor Polay”. The Narrative Planner, when providing content for
150
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
narrative nodes, uses a variety of different canned text patterns. For the above example, the pattern would be “Here is ?label”, where ?label is a variable whose value is the label of the current node. Segue goals are generated automatically by the system and integrated into the narrative outline. We always introduce a segue node at the beginning of the presentation (called a preamble node), which provides a segue covering the root of the narrative outline tree. A segue node is also produced at the end (called a coda). (Both preamble and segue can of course be specified by the author if desired). For introducing intervening segue nodes, we use the following algorithm based on the distance between nodes and the height in the narrative outline tree. We traverse the non-narrative leaves of the tree in their temporal order, evaluating each pair of adjacent nodes A and B where A precedes B temporally. A segue is introduced between nodes A and B if either (a) the maximum of the 2 distances from A and B to their least common ancestor is greater than 3 nodes or (b) the sum of the 2 distances from A and B to the least common ancestor is greater than 4 nodes. This is less intrusive than introducing segues at random or between every pair of successive nodes, and appears to perform better than introducing a segue at each depth of the tree. In case an animated presenter is used, the narrative planner also decides on the non-verbal communicative means to be employed. In particular, we make use of mimics and body gestures to convey the discourse structure of the presentation and the character’s emotions, see also (Poggi et al., this volume) or (Theune et al., this volume). Since a deliberate use of communicative means requires detailed knowledge about the content of the presentation, it is hard to determine appropriate non-verbal gestures for the presentation of content associated with retrieve goals and create goals (that heavily rely on pre-authored multimedia material). The use of nonverbal gestures for the realization of narrative goals essentially depends on their type. For instance, segues that relate to a change in topic are accompanied by a shift in posture, see also (Cassell et al. 2001). Strategies for expanding the narrative content are coded by means of plan operators. In the following, we present a simple plan operator which may be used to work out a narrative goal, namely to present a new PowerPoint slide. The body includes a communicative gesture as well as a canned text pattern which is instantiated with the value for ?label and uttered by the character. It should be obvious how this operator could be used to realize the segues for several of the nodes in the narrative outline in Figure 4. (define-plan-operator :header (A0 (PresentPersona ?id ?SlideNumber)) :constraints (Label ?id ?label) :inferiors ( (A1 (sAddPPCode ("PLAY OneHand;;"))) (A2 (SAddPPCode ("SAY Here is a " ?label ";;"))))
The presentation operators also specify which media type is used to realize a certain goal. By default, we attempt to realize narrative goals as speech media type,
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
151
using rules on text length and truncatability to less than 250 bytes to decide when to use text-to-speech. Captions are always realized as text (i.e., they have a text realization and a possible audio realization). 3.3 Layout Generation
As described above, we are able to generate both SMIL and PowerPoint presentations. The generation of Powerpoint presentations starts from an XMLbased markup language that represents both the contents of a Powerpoint slide as well as the accompanying comments and gestures by the presenter. The XMLrepresentation is then converted into Powerpoint presentations making use of Bridge2Alpha from IBM alphaWorks, a tool that allows Java programs to communicate with ActiveX objects. While the material of a presentation appears on the slides pages, suggestions regarding its presentation are included in the note pages. Essentially, these suggestions are related to AutoBriefer’s narrative goals. The spatial layout of the slides is given by the corresponding PowerPoint style sheet and can be directly inferred from the hierarchical structure as indicated by the human author. When generating SMIL-presentations, we also allow for dynamic media, such as video and audio, all of which have to be displayed in a spatially and temporally coordinated manner and need to be synchronized with the agents’ communicative gestures. For instance, a character may point to a certain object in a video while it is being played. In order to relieve the user from the burden to completely work out a spatial and temporal layout, the following extensions of the presentation planner have become necessary (André and Rist 1996): The formulation of temporal and spatial constraints in the plan operators. We distinguish between metric and qualitative constraints. Metric temporal and spatial constraints appear as metric (in-)equalities, e.g. (5 Bottom VideoRegion – Top VideoRegion). Qualitative temporal constraints are represented in an “Allen-style” fashion which allows for the specification of thirteen temporal relationships between two named intervals, e.g. (SpeakInterval (During) PointInterval). Qualitative spatial constraints are represented by a set of topological relations, such as LeftOf, CenterHor or TopAlign. Note that the plan operators include general temporal constraints that should be satisfied for a certain document type while the temporal constraints the user specifies with the GUI refer to a particular document instance. The development of a mechanism for designing a spatial and temporal layout. We collect all spatial and temporal constraints during the presentation planning process, combine them with the temporal constraints specified by the user for the document to be generated and use the incremental constraint solving toolkit Cassowary (Borning et al. 1997) to determine a consistent spatial and temporal layout which is then represented as a SMIL document. In case of a conflict between the constraints specified by the GUI user and the constraints specified by the author of plan operators, the user’s constraints is given preference, but a warning is produced.
152
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
In the following, we present a plan operator which may be used to synchronize text, spoken voice and video. The body of the plan operator contains a complex action (PresentVideo) that requires further expansion and two elementary actions (SAddSpeech and SAddSmilCode). SAddSmilCode will extend the SMIL output file with an animation that visualizes the talking. SAddSpeech will cause the Festival speech synthesizer (Taylor et al. 1998) to generate an audio file for ?text which will be included into the SMIL output file as well. The title should be visible while the comment is spoken and the video is run. In particular, the verbal comment starts one time unit after the display of the title which preceeds the video. After the video is played, the title will be visible for one more time unit. The spatial constraints specify where the regions for the title A1 and the video sequence A3 should be located with respect to the region corresponding to the superior presentation segment A0. (define-plan-operator :header (A0 (PresentSegment ?segment)) :constraints (*and* (BELP (Title ?segment ?title)) (BELP (Comment ?segment ?comment)) (BELP (Video ?segment ?video))) :inferiors (A1 (SAddSmilCode (?title))) (A2 (SAddSpeech Festival ?comment)) (A3 (PresentVideo (?video))) :temporal (1 n, the object is removed. Otherwise, the object is displayed at the largest possible size no greater than its maximum size. The state diagram of Figure 7 employs similar tests to make an additional distinction between drawing an object inside (preferred) vs. outside another object, which we use for
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
269
displaying area feature labels. The values selected for n and the difference between the absolute minimum and sustained minimum sizes help avoid visual discontinuities. 4.5.2. Positional Stability
To compute whether the current frame is a “keyframe” for a particular object, it is necessary to find out whether the object’s position from Section 4.4, which is considered the best placement for the current frame, has moved from the object’s position in the last computation. To do this, an additional operation, computed in pipeline step 4, needs to query for the closest layout to the previous layout. The same visible region is used for this query. For example, if the label was an inside label in the previous frame, the visible region of the object would be queried. However, since the annotation’s related object could have moved because of view specification change (e.g., due to head movement), the last placement is not valid for finding the closest layout. Instead, we compute the change relative to the projection of the related object by transforming the last frame’s placement from the 2D coordinate system of the projected object in the last frame, to the 2D coordinate system of the projected object in the current frame, where the centroid of each rectangle is its own origin. Inside size t min Outside size t min & inside size < min Do not draw
Outside size