The Handbook on Socially Interactive Agents: 20 Years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics, Volume 1: Methods, Behavior, Cognition 1450387209, 9781450387200

The Handbook on Socially Interactive Agent provides a comprehensive overview of the research fields of Embodied Conversa

235 49 91MB

English Pages 517 [538] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
Foreword • Justine Cassell
1 Introduction to Socially Interactive Agents • Birgit Lugrin
PART I: ESTABLISHING SIA RESEARCH
2 Empirical Methods in the Social Science for Researching Socially Interactive Agents • Astrid Rosenthal-von der Pütten and Anna M. H. Abrams
3 Social Reactions to Socially Interactive Agents and Their Ethical Implications • Nicole Krämer and Arne Manzeschke
PART II: APPEARANCE AND BEHAVIOR
4 Appearance • Rachel McDonnell and Bilge Mutlu
5 Natural Language Understanding in Socially Interactive Agents • Roberto Pieraccini
6 Building and Designing Expressive Speech Synthesis • Matthew P. Aylett, Leigh Clark, Benjamin R. Cowan, and Ilaria Torre
7 Gesture Generation • Carolyn Saund and Stacy Marsella
8 Multimodal Behavior Modeling for Socially Interactive Agents • Catherine Pelachaud, Carlos Busso, and Dirk Heylen
PART III: SOCIAL COGNITION—MODELS AND PHENOMENA
9 Theory of Mind and Joint Attention • Jairo Perez-Osorio, Eva Wiese, and Agnieszka Wykowska
10 Emotion • Joost Broekens
11 Empathy and Prosociality in Social Agents • Ana Paiva, Filipa Correia, Raquel Oliveira, Fernando Santos, and Patrícia Arriaga
12 Rapport Between Humans and Socially Interactive Agents • Jonathan Gratch and Gale Lucas
13 Culture for Socially Interactive Agents • Birgit Lugrin and Matthias Rehm
Authors’ Biographies
Index
Recommend Papers

The Handbook on Socially Interactive Agents: 20 Years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics, Volume 1: Methods, Behavior, Cognition
 1450387209, 9781450387200

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

The Handbook on Socially Interactive Agents

ACM Books Editors in Chief Sanjiva Prasad, Indian Institute of Technology (IIT) Delhi, India Marta Kwiatkowksa, University of Oxford, UK Charu Aggarwal, IBM Corporation, USA ACM Books is a new series of high-quality books for the computer science community, published by ACM in collaboration with Morgan & Claypool Publishers. ACM Books publications are widely distributed in both print and digital formats through booksellers and to libraries (and library consortia) and individual ACM members via the ACM Digital Library platform.

Probabilistic and Causal Inference: The Works of Judea Pearl Hector Geffner, ICREA and Universitat Pompeu Fabra Rina Dechter, University of California, Irvine Joseph Y. Halpern, Cornell University 2021

Event Mining for Explanatory Modeling Laleh Jalali, University of California, Irvine (UCI), Hitachi America Ltd. Ramesh Jain, University of California, Irvine (UCI) 2021

Intelligent Computing for Interactive System Design: Statistics, Digital Signal Processing, and Machine Learning in Practice Parisa Eslambolchilar, Cardiff University, Wales, UK Andreas Komninos, University of Patras, Greece Mark Dunlop, Strathclyde University, Scotland, UK 2021

Semantic Web for the Working Ontologist: Effective Modeling for Linked Data, RDFS, and OWL, Third Edition Dean Allemang, Working Ontologist LLC Jim Hendler, Rensselaer Polytechnic Institute Fabien Gandon, INRIA 2020

Code Nation: Personal Computing and the Learn to Program Movement in America Michael J. Halvorson, Pacific Lutheran University 2020

Computing and the National Science Foundation, 1950–2016: Building a Foundation for Modern Computing Peter A. Freeman, Georgia Institute of Technology W. Richards Adrion, University of Massachusetts Amherst William Aspray, University of Colorado Boulder 2019

Providing Sound Foundations for Cryptography: On the work of Shafi Goldwasser and Silvio Micali Oded Goldreich, Weizmann Institute of Science 2019

Concurrency: The Works of Leslie Lamport Dahlia Malkhi, VMware Research and Calibra 2019

The Essentials of Modern Software Engineering: Free the Practices from the Method Prisons! Ivar Jacobson, Ivar Jacobson International Harold “Bud” Lawson, Lawson Konsult AB (deceased) Pan-Wei Ng, DBS Singapore Paul E. McMahon, PEM Systems Michael Goedicke, Universität Duisburg–Essen 2019

Data Cleaning Ihab F. Ilyas, University of Waterloo Xu Chu, Georgia Institute of Technology 2019

Conversational UX Design: A Practitioner’s Guide to the Natural Conversation Framework Robert J. Moore, IBM Research–Almaden Raphael Arar, IBM Research–Almaden 2019

Heterogeneous Computing: Hardware and Software Perspectives Mohamed Zahran, New York University 2019

Hardness of Approximation Between P and NP Aviad Rubinstein, Stanford University 2019

The Handbook of Multimodal-Multisensor Interfaces, Volume 3: Language Processing, Software, Commercialization, and Emerging Directions Editors: Sharon Oviatt, Monash University Björn Schuller, Imperial College London and University of Augsburg Philip R. Cohen, Monash University

Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI) Gerasimos Potamianos, University of Thessaly ̈ Saarland University and German Research Center for Artificial Antonio Kruger, Intelligence (DFKI) 2019

Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker Editor: Michael L. Brodie, Massachusetts Institute of Technology 2018

The Handbook of Multimodal-Multisensor Interfaces, Volume 2: Signal Processing, Architectures, and Detection of Emotion and Cognition Editors: Sharon Oviatt, Monash University Björn Schuller, University of Augsburg and Imperial College London Philip R. Cohen, Monash University Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI) Gerasimos Potamianos, University of Thessaly ̈ Saarland University and German Research Center for Artificial Antonio Kruger, Intelligence (DFKI) 2018

Declarative Logic Programming: Theory, Systems, and Applications Editors: Michael Kifer, Stony Brook University Yanhong Annie Liu, Stony Brook University 2018

The Sparse Fourier Transform: Theory and Practice Haitham Hassanieh, University of Illinois at Urbana-Champaign 2018

The Continuing Arms Race: Code-Reuse Attacks and Defenses Editors: Per Larsen, Immunant, Inc. Ahmad-Reza Sadeghi, Technische Universität Darmstadt 2018

Frontiers of Multimedia Research Editor: Shih-Fu Chang, Columbia University 2018

Shared-Memory Parallelism Can Be Simple, Fast, and Scalable Julian Shun, University of California, Berkeley 2017

Computational Prediction of Protein Complexes from Protein Interaction Networks Sriganesh Srihari, The University of Queensland Institute for Molecular Bioscience Chern Han Yong, Duke-National University of Singapore Medical School Limsoon Wong, National University of Singapore 2017

The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations Editors: Sharon Oviatt, Incaa Designs Björn Schuller, University of Passau and Imperial College London Philip R. Cohen, Voicebox Technologies Daniel Sonntag, German Research Center for Artificial Intelligence (DFKI) Gerasimos Potamianos, University of Thessaly ̈ Saarland University and German Research Center for Artificial Antonio Kruger, Intelligence (DFKI) 2017

Communities of Computing: Computer Science and Society in the ACM Thomas J. Misa, Editor, University of Minnesota 2017

Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai, University of Illinois at Urbana–Champaign Sean Massung, University of Illinois at Urbana–Champaign 2016

An Architecture for Fast and General Data Processing on Large Clusters Matei Zaharia, Stanford University 2016

Reactive Internet Programming: State Chart XML in Action Franck Barbier, University of Pau, France 2016

Verified Functional Programming in Agda Aaron Stump, The University of Iowa 2016

The VR Book: Human-Centered Design for Virtual Reality Jason Jerald, NextGen Interactions 2016

Ada’s Legacy: Cultures of Computing from the Victorian to the Digital Age Robin Hammerman, Stevens Institute of Technology Andrew L. Russell, Stevens Institute of Technology 2016

Edmund Berkeley and the Social Responsibility of Computer Professionals Bernadette Longo, New Jersey Institute of Technology 2015

Candidate Multilinear Maps Sanjam Garg, University of California, Berkeley 2015

Smarter Than Their Machines: Oral Histories of Pioneers in Interactive Computing John Cullinane, Northeastern University; Mossavar-Rahmani Center for Business and Government, John F. Kennedy School of Government, Harvard University 2015

A Framework for Scientific Discovery through Video Games Seth Cooper, University of Washington 2014

Trust Extension as a Mechanism for Secure Code Execution on Commodity Computers Bryan Jeffrey Parno, Microsoft Research 2014

Embracing Interference in Wireless Systems Shyamnath Gollakota, University of Washington 2014

The Handbook on Socially Interactive Agents 20 years of Research on Embodied Conversational Agents, Intelligent Virtual Agents, and Social Robotics Volume 1: Methods, Behavior, Cognition Birgit Lugrin Julius-Maximilians-Universität of Würzburg

Catherine Pelachaud CNRS-ISIR, Sorbonne Université

David Traum University of Southern California

ACM Books #37

Copyright © 2021 by Association for Computing Machinery All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews—without the prior permission of the publisher. Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which the Association of Computing Machinery is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. The Handbook on Socially Interactive Agents Birgit Lugrin, Catherine Pelachaud, and David Traum, Editors books.acm.org http://books.acm.org ISBN: 978-1-4503-8720-0 ISBN: 978-1-4503-8721-7 ISBN: 978-1-4503-8722-4 ISBN: 978-1-4503-8723-1

hardcover paperback EPUB eBook

Series ISSN: 2374-6769 print

2374-6777 electronic

DOIs: 10.1145/3477322 Book 10.1145/3477322.3477323 10.1145/3477322.3477324 10.1145/3477322.3477325 10.1145/3477322.3477326 10.1145/3477322.3477327 10.1145/3477322.3477328 10.1145/3477322.3477329

Foreword Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6

10.1145/3477322.3477330 Chapter 7 10.1145/3477322.3477331 Chapter 8 10.1145/3477322.3477332 Chapter 9 10.1145/3477322.3477333 Chapter 10 10.1145/3477322.3477334 Chapter 11 10.1145/3477322.3477335 Chapter 12 10.1145/3477322.3477336 Chapter 13 10.1145/3477322.3477337 Bios/Index

A publication in the ACM Books series, #37 Editors in Chief: Sanjiva Prasad, Indian Institute of Technology (IIT) Delhi, India Marta Kwiatkowksa, University of Oxford, UK Charu Aggarwal, IBM Corporation, USA This book was typeset in Arnhem Pro 10/14 and Flama using pdfTEX. Front cover design by Birgit Lugrin and Sophia C. Steinhaeusser (copyright remains with the designers). First Edition 10 9 8 7 6 5 4 3 2 1

Contents Foreword xvii Justine Cassell

Chapter 1

Introduction to Socially Interactive Agents

1

Birgit Lugrin 1.1 1.2 1.3 1.4 1.5 1.6

Potential of SIAs 2 Terminology 4 Origin and Embodiment 7 Purpose of the Book 9 Structure of the Book 11 Further Readings 15 Acknowledgments 17 References 17

PART I ESTABLISHING SIA RESEARCH 19 Chapter 2

Empirical Methods in the Social Science for Researching Socially Interactive Agents 21 Astrid Rosenthal-von der Pütten and Anna M. H. Abrams 2.1 2.2 2.3 2.4 2.5 2.6 2.A

Motivation 21 Models and Approaches 24 Research Tools 54 Current Challenges 56 Future Directions 60 Summary 61 Appendix A 61 References 68

xii

Contents

Chapter 3

Social Reactions to Socially Interactive Agents and Their Ethical Implications 77 3.1 3.2 3.3 3.4 3.5 3.6 3.7

Nicole Krämer and Arne Manzeschke Motivation 77 Models and Approaches 80 History/Overview 84 Similarities and Differences in IVAs and SRs 89 Current Challenges 93 Future Directions 94 Summary 96 References 96

PART II APPEARANCE AND BEHAVIOR 105 Chapter 4

Appearance 107 4.1 4.2 4.3 4.4 4.5

Chapter 5

Natural Language Understanding in Socially Interactive Agents 147 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

Chapter 6

Rachel McDonnell and Bilge Mutlu Why Appearance? 107 History 107 Design 111 Features 120 Summary 133 References 135

Roberto Pieraccini Natural Language Understanding in Interactive Agents 147 NLU and Interactive Virtual Agent Features 149 Developing NLU 153 NLU and Interaction Modalities 156 NLU Induction From Examples 163 The NLU Usability Paradox 166 Context 167 Conclusions 169 References 170

Building and Designing Expressive Speech Synthesis 173 6.1 6.2

Matthew P. Aylett, Leigh Clark, Benjamin R. Cowan, and Ilaria Torre Introduction and Motivation 173 Expressive Speech—A Working Definition 175

Contents

6.3 6.4 6.5 6.6

Chapter 7

Building Expressive Synthesis 176 Fundamental Considerations When Designing Expressive Agents 191 Current Challenges and Future Directions in Expressive Synthesis 196 Summary 198 References 199

Gesture Generation 213 7.1 7.2 7.3 7.4 7.5 7.6

Chapter 8

xiii

Carolyn Saund and Stacy Marsella The Importance of Gesture in Social Interaction 213 Models and Approaches 222 Similarities and Differences in Intelligent Virtual Agents and Social Robots 234 Current Challenges 238 Future Directions 242 Summary 244 References 245

Multimodal Behavior Modeling for Socially Interactive Agents 259 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8

Catherine Pelachaud, Carlos Busso, and Dirk Heylen Motivation 259 Nonverbal Behavior Representation 261 Models and Approaches 265 History/Overview 267 Databases 286 Similarities and Differences in IVAs and SRs 288 Current and Future Challenges 290 Summary and Conclusion 292 References 293

PART III SOCIAL COGNITION—MODELS AND PHENOMENA 311 Chapter 9

Theory of Mind and Joint Attention 313 9.1 9.2 9.3 9.4 9.5

Jairo Perez-Osorio, Eva Wiese, and Agnieszka Wykowska Social Cognitive Neuroscience and SIA 313 Theory of Mind and Joint Attention—Crucial Mechanisms of Social Cognition 316 Theory of Mind in Artificial Agents 321 Modeling Social Cognition 326 Comparison of IVAs and SRs 333

xiv

Contents

9.6

Chapter 10

Emotion 349 10.1 10.2 10.3 10.4 10.5 10.6

Chapter 11

Ana Paiva, Filipa Correia, Raquel Oliveira, Fernando Santos, and Patrícia Arriaga Motivation 385 Concepts and Framework 388 Models and Architectures to Build Empathy and Prosociality 396 Empathy and Prosociality in the Interaction with SIAs 402 Toward Prosociality in Populations with SIAs 409 Summary and Current Challenges 417 Acknowledgments 419 References 419

Rapport Between Humans and Socially Interactive Agents 433 12.1 12.2 12.3 12.4 12.5

Chapter 13

Joost Broekens Motivation 349 Computational Models and Approaches 357 History/Overview 370 Similarities and Differences in IVAs and SRs 373 Current Challenges and Future Directions 373 Summary 374 References 374

Empathy and Prosociality in Social Agents 385 11.1 11.2 11.3 11.4 11.5 11.6

Chapter 12

Current Challenges 335 References 336

Jonathan Gratch and Gale Lucas Introduction 433 Rapport Theory 435 History and Overview of Rapport Agents 437 Empirical Findings 443 Discussion and Conclusion 452 References 454

Culture for Socially Interactive Agents 463 13.1 13.2 13.3 13.4

Birgit Lugrin and Matthias Rehm Motivation 463 Theories and Approaches 465 History 473 Evaluation of SIAs That are Based on Cultural Information 478

Contents

13.5 13.6 13.7 13.8

Role of Embodiment 480 Current Challenges 481 Future Perspectives 483 Summary 486 References 487

Authors’ Biographies 495 A.1 A.2

Editors 495 Chapter Authors 496

Index 509

xv

Foreword In preparation for writing this foreword, I looked through old emails (really old emails) dating back to early 1998, when we were planning the “First Workshop on Embodied Conversational Characters.” In and amongst detailed menu planning for the workshop (I haven’t changed a bit since those days) are emails floating the idea of publishing a book with the best papers from the workshop. We were already starting to see a shift in the literature, away from “lifelike computer characters” (Microsoft’s Clippy was presented at a workshop with that name) and “believable computer characters” (characters whose behavior was believable, but that did not do anything for people), and we wanted the book to reflect that shift. We particularly wanted to highlight the fact that embodied conversational characters did not only talk but also listened. They were capable of understanding as well as generating language and non-verbal behavior, and they did so in the service of humans—they were agents, like travel agents or real estate agents. To that end, I sent an email to the chapter authors with the following tidbits. I wrote: Next, a note about terminology. After long debate, we’ve decided to call the book Embodied Conversational * Agents* , and not * Characters* (for marketing reasons, in part) so you might want to follow this terminology in your chapter. Finally, do make sure to focus on the * communicative* abilities of your systems, since this is what distinguishes this work—and this book—from previous volumes on believable characters, software agents and so forth. It’s amusing to read this today when we take for granted the agentive nature of our conversational systems. At this point, we assume that embodied conversation agents (ECAs) are designed primarily to accomplish work for people. We also take for granted that ECAs must both understand and talk. However, when the Embodied Conversational Agents book was conceived, both of those features were only newly possible. In turn, the title of the current volume highlights the most recent technological innovation, which is the ability of the systems not just to do work for

xviii

Foreword

humans but to interact socially with them in the process, in many cases using social interaction as a way to bootstrap task performance. It’s illuminating to look at two other debates that took place during this same period. The first concerns what kinds of data are used to create the most natural behaviors for an ECA. The second concerns whether it is ethical to build natural-acting ECAs. While there was beginning to be consensus in the late 1990s on the idea that conversational characters could do more than just look pretty, there were three schools of thought about the proper inspiration for the conversational behaviors of ECAs (as they were called). Some of the authors in the original volume worked with actors to understand what kinds of language and non-verbal behaviors were most evocative of normal human conversation. These researchers hewed to the belief that ECAs should behave in a somewhat exaggerated fashion, like actors on a stage, in order to seem natural to their human interlocutors. Other authors believed that, being native speakers of their own language, and acculturated to the customs of their own society, the simple intuitions of the researcher were sufficient to design human-like conversational behaviors. A third group believed that psychological and linguistic studies of human conversation were the only proper inspiration for the behaviors of ECAs. Today, while a few researchers still work with actors or rely on their own intuitions, the community of researchers in ECAs (and in today’s socially interactive agents) mostly rely on empirical psychological and linguistic studies of human behavior as their inspiration. Some of these researchers carry out their own studies, and some rely on extant literature, but in both cases they rely on normal everyday humans for inspiration rather than actors or computer scientists. The debate is interesting in the face of today’s focus on big data. In fact, the increasing reliance in the field of artificial intelligence (AI) on machine learning techniques to analyze human behavior has led to a parallel increase in ECA systems that rely on deep learning techniques applied to large corpora, often of naturally produced human conversational behavior, to generate appropriate verbal and non-verbal conversational behavior. In other words, AI has brought us closer to the human-inspired ECAs of the past by bringing a focus on corpora of natural behaviors. At the same time, however, it has taken us further away from those human-inspired ECAs of the past because the corpora are too large to be examined by the human eye. Another debate that evoked heated interchanges in the late 1990s, and that is useful to contemplate today, was whether we should even contemplate deploying ECAs as interfaces to computational systems in the first place. Many if not most of the authors in the 1998 volume believed that ECAs represented a more natural way of interacting with computational systems than a keyboard and a mouse.

Foreword

xix

Their work was predicated on the assumption that interacting with a humanlike agent was a more intuitive manner of accessing technical systems. To other computer scientists of the era, however, ECAs were downright evil. Perhaps most famously, human–computer interaction researcher Ben Schneiderman saved his strongest invectives for human-like agents and their designers. In 1995, he wrote Anthropomorphic terms and concepts have continually been rejected by consumers, yet some designers fail to learn the lesson. Talking cash registers and cars, SmartPhone, SmartHome, Postal Buddy, Intelligent Dishwasher, and variations have all come and gone. Even the recent variation of Personal Digital Assistant had to give way to the more service oriented name now used in the Apple Newton ads: MessagePad. We’ll leave it to the psychoanalysts to fathom why some designers persist in applying human attributes to their creations … But, possibly, just possibly, all this heated debate is excessive and agents will merely become the Pet Rock of the 1990s—everyone knows they’re just for fun (Ben Shneiderman 1995. ACM Interactions. 2, 1, 13–15). Today, fears about whether robots will steal jobs, and whether machine learning will make it hard to tell who is human and who is an AI, have once again launched debates on whether human-like agents are a good or bad influence on society. These debates have led to a stronger focus on transparency in AI, a concern with bias in data, and a much-needed conversation on the ethics of where ECAs should and should not be used. These contemporary debates, however, and contra Shneiderman’s predictions, show that anthropomorphic agents have stood the test of time. The topic has inspired passion and dedication in a whole new generation of researchers. To that end, here 20 years later is a two-volume follow-up from our 1998 Embodied Conversational Agents book, with more than 25 chapters, showing the depth, breadth, innovation, creativity, and, yes, effectiveness, of human-inspired agents. Justine Cassell

1

Introduction to Socially Interactive Agents Birgit Lugrin

Since the commercialization of graphical user interfaces in the late 1980s, the way humans interact with computers has been dominated by their interactions through windows, icons, menus, and pointers (WIMP) interfaces, with buttons that can be clicked and information that can be read or watched in separate windows. The research discipline of human–computer interaction (HCI) is constantly developing new and creative systems that go beyond this traditional interaction for a more intuitive usage, for example, with technology such as touch interaction, virtual reality, tangible computing, and many more. Taking a different approach to realize natural and intuitive interaction, the research area of socially interactive agents (SIAs) aims to develop artificial agents that can interact via communication channels that come more natural to human interactants by equipping the interface with a body that interacts multi-modally by using verbal, para-verbal, and non-verbal behaviors. With it, communication styles that are known from human face-to-face interaction can be transferred to interaction with machines. SIAs (see Figure 1.1 for examples) have been developed under different names in different research fields such as intelligent virtual agents (IVAs), embodied conversational agents (ECAs), or social robotics (SRs) (see below for definitions of the respective terms). More than 20 years of research and development in these fields have drastically advanced the state of the art. For this book, we chose to use the term socially interactive agents (or SIAs) as it includes both physical and virtual embodiments, while highlighting their ability for social interaction as well as the need to realize socially intelligent, autonomous behaviors. We define SIAs as follows: SIAs are virtually or physically embodied agents that are capable of autonomously communicating with people and each other in a socially intelligent manner using multi-modal behaviors.

2

Chapter 1 Introduction to Socially Interactive Agents

Figure 1.1

Examples of socially interactive agents (SIAs): intelligent virtual agents (left) and social robots (right). SIAs in both figures are located in the same virtual versus physical office space (reflected reality) [Eckstein et al. 2019], used for various research in the Media Informatics Laboratory of Wuerzburg University (left to right: two female agents and a male agent by Autodesk, partly adapted by features such as clothing style, Pepper by SoftBank Robotics, Reeti by Robopec, Nao by SoftBank Robotics).

In order to interact with humans in a socially intelligent manner, underlying concepts such as emotions, empathy, or how to behave in a group are essential for SIAs to interpret. To be part of the interaction, observed input must be reasoned about, and decisions to be taken upon that resemble a cognitive process. The SIA’s (re)actions need to be externalized by natural language, expressive speech, and non-verbal behaviors.

1.1

Potential of SIAs The right choice of interface is not a simple one. While traditional WIMP interaction is certainly fast and well established, it is still rather well suited for simple, repetitive tasks, for example, in office work. Many of today’s challenging tasks have led to novel solutions, such as sophisticated three-dimensional (3D) interfaces to help visualize 3D problems. Analog, for scenarios, where social sensitivity and conversation are paramount, the natural communication with SIAs might be the best solution. Thanks to extensive research, today prototypes including SIAs are used in many application domains that are helpful for individuals or society, with SIAs serving as companions or assistants in aging support, health education, life-long learning, or training of specific skills. In the long run, SIAs are envisioned to unobtrusively support humans in their daily lives. Figure 1.2 illustrates that vision by extending a

1.1 Potential of SIAs

Figure 1.2

3

Vision of using SIAs in the future: progressing from having to adapt to interact with technology to a more natural communication with SIAs that assist people in their everyday lives (based on Zallinger [1965], humoristic extensions and silhouette versions of the Zallinger image have become known as “The March of Progress” that has achieved iconic status so that there are many different versions of the March of Progress used today).

well-known humoristic illustration in HCI, of how humans had to adapt for interaction with machines by adding a future perspective of how technology in the form of a SIA adapts to human-style interaction. In some cases, using a SIA might even have advantages over a human communication partner. For example, in a tutoring scenario with a SIA an emotional distance can be kept, and a user might not feel embarrassment, for example, to admit that he or she cannot read. In addition, a training task can be repeated as often as preferred without the risk of annoying a human training partner, or having to pay for each additional lesson, providing individualized sessions to social groups that usually might not have access to private training. Also, there is a dichotomy between appearance and behavior for SIAs, allowing modification of background factors such as age, gender, personality, or ethnic background separately from the implemented role and communicative behavior. This can be useful in personalizing a SIA to provide the best possible solution for each user’s specific requirements or preferences. Besides the many useful applications SIAs are (envisioned to be) employed in, they can serve as a research paradigm. In perception studies SIAs can serve as stimulus material. With it, they can help learn more about humans, their judgements, preferences, or emotional reactions to artificially created, yet very standardized variations of social situations. In interaction studies SIAs can serve as communication partners, allowing for high control over the experiment, ensuring detailed consistent behavior over many sessions. That way, the social behavior of humans, and the effects of different behaviors they are confronted with, can be studied. A concern or fear that many researchers in the research area of SIAs are confronted with is the conception that these agents might be developed to replace

4

Chapter 1 Introduction to Socially Interactive Agents

humans in the workplace or even in social relationships. It is very important to note here that the replacement of humans is not, and has never been, a goal in the development or research on SIAs. On the contrary, SIAs are developed to support humans and assist in situations where no human support can be provided or is not desired, and to offer additional functionalities or support in social domains. As they aim to provide a more human-like interface that is intuitive to understand and interact with, they might be replacing other devices that might appear complicated to certain users for certain tasks. We want to further highlight that, particularly since SIAs enter social domains, development has to follow interdisciplinary approaches and methods, and needs to include, besides the technical know-how, expertise in psychology, sociology, and ethics.

1.2

Terminology Since research on SIAs is manifold and researchers are coming from different disciplines and research areas, a number of terms exist that can be found in the literature. In the following, we aim to shed light on the terminology (in alphabetical order), and highlight their origin and different foci, albeit you might find some of the definitions quite similar: Agent: “An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators” [Russell and Norvig 2009]. This very classic and well-known definition looks at agents from the perspective of artificial intelligence (AI), highlighting the autonomy of the artificial entities. These agents can be, but are not necessarily, embodied. Examples include softbots, thermostats, robots, or humans. Avatar: An avatar represents a game unit that is under the player’s control [Kromand 2007], which is usually the graphical representation of the user in the virtual environment [Trepte and Reinecke 2010]. Unfortunately, this term is often confused with virtual or robotic agents in communities other than SIAs. Note that an avatar is not behaving or interacting autonomously with a user but representing the user in the virtual or real world. The term embodiment also has a different meaning concerning avatars, and describes the physical process to substitute (parts of) a person’s body with a virtual one by the deployment of virtual reality hardware and software [Spanlang et al. 2014]. Embodied conversational agent: “Embodied conversational agents are computergenerated cartoonlike characters that demonstrate many of the same properties as humans in face-to-face conversation, including the ability to produce and respond to verbal and nonverbal communication” [Cassell et al. 2000]. The term was defined by Cassell and colleagues in their same named book on the topic in

1.2 Terminology

5

2000. The authors highlight the importance of the combination of the multi-modal interface, with a software agent and a dialog system, to assure natural conversation. While the original focus was on virtual embodiments, the term also allows robotic embodiments, and is used in both fields. Intelligent virtual agent: “Intelligent virtual agents are interactive digital characters that exhibit human-like qualities and can communicate with humans and each other using natural human modalities like facial expressions, speech, and gesture. They are capable of real-time perception, cognition, emotion, and action that allow them to participate in dynamic social environments” [IVA 2019]. This term focuses on communicative, digital characters, and is mainly used by researchers that are affiliated with the IVA conference series. An important fact lies on the character’s intelligence that allows them to dynamically interact, as opposed to scripted behavior. Socially assistive robot: Socially assistive robots were defined by Feil-Seifer and Matari´ c [2005] as robots that share characteristics with assistive robots, in particular to provide assistance to users, but are distinguished by their focus on social interaction while assisting people. Socially intelligent agent: “The field of socially intelligent agents is characterized by agent systems that show human style social intelligence” [Dautenhahn et al. 2002]. The term was coined by Dautenhahn in the late 1990s and highlights the specific social intelligence of the agent, relying on “deep models of human cognition and social competence” [Dautenhahn 1998] that needs to comprise strongly interdisciplinary approaches. Different embodiments of these agents are possible, virtual, or robotic. Socially interactive robot: Socially interactive robots were defined as “robots for which social interaction plays a key role” [Fong et al. 2003] in order to “distinguish these robots from other robots that involve ‘conventional’ human–robot interaction, such as those used in teleoperation scenarios” [Fong et al. 2003]. This term was defined after the definition of socially intelligent agents, to highlight the need for social interaction. Socially interactive agent: The term socially interactive agent extends the term socially interactive robot by allowing virtual and physical embodiments. This term was used by the AAMAS (autonomous agents and multi-agent systems) community and conference series, where they are described as “capable of interacting with people and each other using social communicative behaviors common to human–human interaction. Example applications include social assistants on mobile devices,

6

Chapter 1 Introduction to Socially Interactive Agents

pedagogical agents in tutoring systems, characters in interactive games, SRs collaborating with humans and multimodal interface agents for smart appliances and environments” [AAMAS 2019]. Social robot: “Social (or Sociable) robots are designed to interact with people in a natural, interpersonal manner [...] They will need to be able to communicate naturally with people using both verbal and non-verbal signals. They will need to engage us not only on a cognitive level, but on an emotional level as well in order to provide effective social and task-related support to people” [Breazeal et al. 2016]. Social robotics is distinguished from robotics through its socially interactive focus with applications in domains such as education, ageing support, or entertainment. This term is predominantly used by the social robotics community and the same named conference series and journal. Virtual character: The term virtual character focuses on a virtual representation of a figure along with its animations. “Virtual characters in animated movies and games can be very expressive and have the ability to convey complex emotions” [McDonnell et al. 2008]. Note, that they do not necessarily have to be intelligent or interactive, cf. characters of a movie. Thus, the term is often used by researchers who focus on the character’s appearance, graphics, animation, or background story. Virtual human: “Virtual humans are artificial characters who look and act like humans but inhabit a simulated environment” [Traum 2008]. The term focuses on human-like appearance and behavior and is frequently used by American authors and research groups. Research on virtual humans often relies on highly realistic graphical representations of the characters and their animations. Please note that the terms introduced above are the ones most commonly used. Other variations, for example, affective embodied agent, companion robot, conversational robot, relational agent, social embodied agent, socially intelligent robot, socially intelligent virtual agent, virtual agent, and so on, are also found in the literature and address similar research topics. For the scope of this book, we use the term socially interactive agents (or SIAs) when we talk about both kinds of embodiment, virtual or robotic. We chose this term as we think it highlights the socially interactive nature as well as the intelligent background of the agent. We use the term intelligent virtual agent (or IVA) in instances where we discuss virtual representations of SIAs solely. We use the term social robot (or SR) when we discuss robotic representations of SIAs solely.

1.3 Origin and Embodiment

1.3

7

Origin and Embodiment To understand why there is a large variety of terms, and why research in the fields of IVAs and SRs might seem distinct sometimes, it is informative to have a look at their different origins. IVAs originated from the idea of simulating human communication with displays of human-like communication channels such as facial expressions or gestures, and became feasible in the 1990s due to advances in computer graphics. With constant advances in computer graphics and the integration of AI methods and cognitive modeling, the communicative abilities and social behaviors of IVAs have constantly been driven further. SRs, on the other hand, originated from robotics. Since robots are leaving industrial applications and are entering private households, a closer interaction with the user in a private and social domain is imposed. Social behavior and skills that are acceptable for humans are becoming a requirement (referred to as a “robotiquette” by Dautenhahn [2007]). Thus, SRs are robots that are specifically designed to interact with people in an inter-personal manner, including the need to recognize and generate verbal and non-verbal signals, for example [Breazeal 2002]. In simple terms, one could say that in IVA research the virtual body was introduced to be able to simulate human behavior, while in SR research the physical body of the robot was naturally there and needed to be adapted for human-like behavior when interpersonal interaction was desired. Despite their different origins, the fields of IVAs and SRs today follow the same goals and are employed in similar domains. To a certain extent, they also share common underlying technologies, such as text-to-speech systems, computer vision, or emotion detention. Also, the theoretical background from psychology or the social sciences are shared for the computational modeling of cognitive processes such as empathy. Particularly in these areas, research from IVAs and SRs can benefit greatly from one another. Other aspects are not as easily transferable from one field to the other. The key difference is the environment that the SIA inhabits and with it whether or not they share the same physical space with their human interaction partners. Particularly when it comes to human perception of the SIA or the acceptance thereof, the different type of embodiment seems to play a key factor. Also, the translation from high-level behavior (e.g., show agreement) to the concrete execution with the particular body part (e.g., nod head and smile) differs across embodiments. Furthermore, hybrid versions are available today, where an SR contains a display on the head that shows a virtual face.

8

Chapter 1 Introduction to Socially Interactive Agents

Either type of embodiment has certain advantages that might be a disadvantage of the other but does not necessarily have to be. Some of the characteristics are listed below: Characteristics of a virtual embodiment ∙

Appearance: The look of an IVA can be freely customized and adapted for different users, applications, or contexts.



Animation: The virtual face and body can be animated very fine grained and in a very realistic manner and show a large variety of emotional expressiveness.



Acceptance: IVAs are often described as non-threatening. A “safe setting” can be created through the separation of the environment that is inhabited by the human and the IVA.



Duplication: Applications with IVAs can be duplicated easily and provided to many users.



Easy Access to implementation: Since very good tools are available for free, students and practitioners who want to get acquainted with the research area of SIAs can have easy access to build their own IVA architecture, or application. The only requirement to get started is the availability of a computer.



Easy Access to use: Applications with IVAs can be implemented for usage with traditional computers or mobile phones. This way the IVA can be deployed anywhere, anytime on people’s private devices.

Characteristics of a physical embodiment ∙

Appearance: With commercial SRs, the options to customize the appearance are limited (e.g., by adding stickers or accessories). However, today’s opportunities with 3D printers and single-board computers allow designing individual SRs at a rather low cost.



Animation: The options to animate an SR is dependent on the particular model and its individual degrees of freedom (e.g., whether it has limbs or an animate-able face). Due to hardware limitations subtle emotional expressions might not be feasible.



Acceptance: It has been widely reported that the physical presence of an SR has a positive effect on the perception of users, and in particular their feeling of social presence, for example [Breazeal et al. 2016].

1.4 Purpose of the Book

9



Mobility: The most dominant advantage of a robot’s physical body lies in its ability to move around in the real world and conduct physical interaction with the environment. An SR can, for example, provide services such as serving food or beverages. However, the physical body also provides challenges, such as the risk of accidentally falling over.



Physical interaction: In addition to conversational interaction, physical interaction with the human user is possible (e.g., by performing social touch). The shared space can additionally be used for conversational purposes (e.g., to gain someone’s attention).

Despite the characteristics that are implied by the embodiment of a SIA, a number of studies have directly compared physical and virtual embodiments to evaluate the outcomes of similar interactions with users, see, for example, Deng et al. [2019] for an overview. Mainly in these comparisons, the virtual SIA is a direct transfer of an SR into a virtual representation of the same robot. It seems the physical embodiment of an SR outperforms a virtual one, both in task performance and the perception of the users. However, the results are more inconclusive if the concepts of physical presence and embodiment are separated, by either comparing physically present SIAs to virtually present SIAs, or comparing physical SIAs with virtual SIAs both presented on a screen [Li 2015]. While directly comparing virtual and physical representations of SRs is a valid research paradigm that allows comparing between embodiments and the impact of physical presence in very controlled settings, from a practical perspective the design of a “virtual social robot” would not be beneficial. With it, most advantages of an IVA are out-ruled, and the virtual representation is artificially bound to non-existing, virtual hardware limits. Advantages such as subtle animations, duplication, or potential usage on smartphones are neglected. To date, the number of studies that compare three or more representations, or compare a state-of-the-art SR against a state-of-the-art IVA are rare. It also needs to be noted that moderating factors such as the interaction scenario and task, and the user’s perception of the SIA’s body-related capabilities, seem to play a crucial role in people’s ratings of the SIAs [Hoffmann et al. 2018]. The right choice of embodiment of a SIA is thus highly complex and dependent on many factors such as the situational context, role of the SIA, purpose of the application, or user’s preference.

1.4

Purpose of the Book The fields of IVAs and SRs face similar research issues and challenges and are further developed in universities and research facilities across the world. Research on

10

Chapter 1 Introduction to Socially Interactive Agents

IVAs and SRs can benefit greatly from one another and have contributed to each other’s advancement in the past. However, substantial work in both research fields is sometimes overlooked by researchers in the other area. This is partly due to the fact that different wordings are used and there exists a large number of journals and conferences that publish works on SIAs, making it very difficult to maintain a good overview. The interdisciplinary nature of SIA research also contributes to the very diverse venues where you can find relevant findings on SIAs. While researchers from the cognitive sciences bring expertise in underlying processes, communication, and interaction, computer scientists bring expertise in conceptualizing computational models and implementation. Even within a single discipline, approaches, methods, and wording can be used differently, thereby complicating cooperation. In computer science, for example, many areas are involved in SIA research, such as AI, HCI, robotics, computer graphics, or software engineering. Only through communication and research in interdisciplinary teams can the field be advanced. This constitutes one major challenge by itself, as researchers sometimes do not have enough insights into other areas (or even disciplines), and thus might not appreciate each other’s work enough. We hope that this handbook will help raise the visibility of the research in the fields involved and further close the gap between the IVA and SR communities. At the same time, we hope that in the future reinventing the wheel can be avoided. This comprehensive handbook on SIAs summarizes the research that has taken place over the last 20 years. We are referring to this period, since the first complete book on embodied conversational agents [Cassell et al. 2000, see above] appeared in 2000, although we are aware that research on this topic began earlier. By pointing out current challenges and future directions in the various topics involved, we hope to help directing future research and cooperation. In the book, we include views from an interdisciplinary perspective, containing theoretical backgrounds from human–human interaction, their implementation in computational models, their evaluation with human users, integration into applications, and ethical implications. In a structured and easily accessible way, the book (hopefully) provides a valuable source of information on SIAs for research and education. Researchers in the research area of SIAs will find it a valuable overview of the field. Teaching staff will benefit from the handbook to structure courses for undergraduate or graduate students, and with it train the upcoming generations of young researchers. Particularly now, public interest in SIAs is increasing. The book will also help professionals, and interested lay public readers, to get acquainted with this research area.

1.5 Structure of the Book

1.5

11

Structure of the Book This handbook is divided into two volumes, including 28 chapters that are grouped in five major parts, to cover the major topics in the area. For the book, we have relied on our connections to both fields, IVAs and SRs, providing a collection of surveys, each written by (an) acknowledged international expert(s) of their field. Each chapter provides a survey that summarizes the theoretical background, approaches for implementation, history/overview of the topic, alongside current challenges and future directions. All the chapters discuss similarities and differences between IVAs and SRs and highlight important work of both fields. Where applicable, the chapters will follow a common structure to ensure internal consistency and facilitate understanding.

1.5.1 Volume 1 After this first chapter that introduces readers to the handbook, Volume 1 starts with Part I “Establishing SIA Research” that helps understand how research in this area is conducted and discusses the impact thereof on individuals and society. Chapter 2 “Empirical Methods in the Social Science for Researching Socially Interactive Agents,” by Astrid Rosenthal-von der Pütten and Anna M. H. Abrams, introduces the empirical methodology from the social sciences that is necessary for SIA research, particularly when it comes to research experiments including human participants. Chapter 3, “Social Reactions to Socially Interactive Agents and Their Ethical Implications,” by Nicole Krämer and Arne Manzeschke, looks at SIA research from a psychological and ethical perspective. It points to numerous studies demonstrating that people (unconsciously) react socially toward artificial entities, and that as soon as they display social cues people can also be manipulated or socially influenced. Part II “Appearance and Behavior” deals with the impact of the looks of SIAs and the various aspects of multi-modal behavior that need to be taken into account when convincing SIAs behavior is modeled. Chapter 4 “Appearance,” by Rachel McDonnell and Bilge Mutlu, argues that, compared to voice assistants, embodied agents enable the use of appearance-based cues from human–human interaction, such as mutual gaze, that are known to improve social outcomes. The chapter shows that the appearance of a SIA can affect how people perceive, respond to, and interact with it. Chapter 5 “Natural Language Understanding in Socially Interactive Agents,” by Roberto Pieraccini, introduces natural language understanding as an essential part

12

Chapter 1 Introduction to Socially Interactive Agents

of any interactive agent and highlights its complexity, particularly for SIAs that need to react to user-initiated interactions across various application areas. Chapter 6 “Building and Designing Expressive Speech Synthesis,” by Matthew Aylett, Leigh Clark, Benjamin R. Cowan and Ilaria Torre, gives an overview of definitions, methods, and state-of-the art in expressive voices, and critically discusses when and where expressive speech is beneficial. Chapter 7 “Gesture Generation,” by Carolyn Saund and Stacy Marsella, discusses the complexity of communicative gestures and how they enhance communication in human–human conversation, and summarizes the research and their challenges in the transfer of this complexity in the implementation with SIAs. Chapter 8 “Multi-modal Behavior Modeling for Socially Interactive Agents,” by Catherine Pelachaud, Carlos Busso, and Dirk Heylen, extends the theme of nonverbal behavior by adding additional modalities such as gaze, smiles, or social touch. Starting from introducing concepts from the social sciences, the chapter has a strong focus on the different computational models that can be employed for the implementation of multi-modal behaviors. Part III “Social Cognition—Models and Phenomena” investigates internal processes known from human cognition that are driving forces in human–human interaction, and demonstrates how they are addressed in SIA systems. Chapter 9 “Theory of Mind and Joint Attention,” by Jairo Perez-Osorio, Eva Wiese, and Agnieszka Wykowska, introduces the two crucial mechanisms of social cognition, and explains how they apply to the interaction between humans and SIAs from two angles: evoking human social cognition and modeling artificial social cognition. Chapter 10 “Emotion,” by Joost Broekens, focuses on the computational representation of emotion and other related affective concepts such as mood, attitude, or appraisal, and highlights how SIAs can make constructive use of them. Chapter 11 “Empathy and Prosociality in Social Agents,” by Ana Paiva, Filipa Correia, Raquel Oliveira, Fernando Santos, and Patrícia Arriaga, focuses on empathy and in particular on the related concept of prosociality (conducting positive and voluntary behavior that should benefit others). With it, the authors provide a framework including the main variables needed to design prosocial agents, for individual or dyadic interactions, or at the society level. Chapter 12 “Rapport Between Humans and Socially Interactive Agents,” by Jonathan Gratch and Gale Lucas, introduces rapport (a fine-grained emotional communicational interplay) in the communication of humans and machines by approaching it from a theoretical, computational, and empirical side, and demonstrating its benefits.

1.5 Structure of the Book

13

Chapter 13 “Culture for Socially Interactive Agents,” by Birgit Lugrin and Matthias Rehm, introduces culture and its implementation in SIAs, and argues that implementing culture for SIAs can be beneficial not only to raise their acceptance in certain user groups but also to be able to teach about cultural differences and foster cultural diversity.

1.5.2 Volume 2 The second volume of this handbook starts with a preface that recaps the most important aspects and terminology of its introductory chapter. Part IV “Modeling Interactivity” explains how interaction with human users or other SIAs is modeled, and how the many detailed aspects of multi-modal, multi-party, adaptive interactivity are implemented. Chapter 14 “Interaction in Social Space,” by Hannes Högni Vilhjálmsson, deals with the intricate social performance that inevitably takes place when SIAs and human users share the same social space (virtual or physical), regardless of their explicit intentions to connect with one another. Chapter 15 “Dialogue for Socially Interactive Agents,” by David Traum, introduces several approaches to modeling the structure of extended verbal and multimodal interactions, with an emphasis on how different kinds of embodiment impact the communication affordances and requirements for SIA tasks. Chapter 16 “The Fabric of Socially Interactive Agents—Multi-modal Interaction Architectures,” by Stefan Kopp and Teena Hassan, presents different SIA architectures and gives an extensive overview on how SIAs can engage in dynamic and fluid social interaction, discussing different approaches to deal with multi-modality and interactivity. Chapter 17 “Multi-party Interaction Between Humans and Socially Interactive Agents,” by Sarah Gillet, Marynel Vázquez, Christopher Peters, Fangkai Yang, and Iolanda Leite, looks into SIAs that interact with a group of people for which the complex group dynamics need to be understood, and highlights that the SIA can affect and even explicitly influence the group’s dynamics. Chapter 18 “Adaptive Artificial Personalities,” by Kathrin Janowski, Hannes Ritschel, and Elisabeth André, focuses on how a SIA can automatically adapt its personality in accordance with the user’s preferences, and with it make the interaction with them more enjoyable and productive. Chapter 19 “Long-term Interaction with Relational Socially Interactive Agents,” by Jacqueline M. Kory-Westlund, Cynthia Breazeal, Hae Won Park, and Ishaan Grover, argues that strong relationships support people in achieving their goals in various domains, and thus relational SIAs have the potential to scaffold humans in their long-term endeavors.

14

Chapter 1 Introduction to Socially Interactive Agents

Chapter 20 “Platforms and Tools for Socially Interactive Agent Research and Development,” by Arno Hartholt and Sharon Mozgai, gives a practical introduction to the history of SIA platforms and tools directing to state-of-the-art technical solutions that support the development and implementation of SIAs. Part V “Areas of Application” gives an overview of the major domains in which SIAs are employed, directing to systems and research findings, highlighting the benefits of SIAs to individuals and society. Chapter 21 “Pedagogical Agents,” by H. Chad Lane and Noah L. Schroeder, introduces work with SIAs in the domain of education, examining social aspects of teaching and learning and summarizing empirical research with pedagogical agents. Chapter 22 “Socially Interactive Agents as Peers,” by Justine Cassell, describes work that uses SIAs that are designed to work or play with children or teenagers at an eye level, discussing the benefits of SIAs that look and act like peers rather than teachers, tutors, or parents. Chapter 23 “Socially Interactive Agents for Supporting Aging,” by Moojan Ghafurian, John Edison Munoz Cardona, Jennifer Boger, Jesse Hoey, and Kerstin Dautenhahn, is centered on work with SIAs located in the area of aging support that aim to improve older adults’ quality of life and wellbeing. The chapter provides methods and suggestions to address the many challenges that arise when designing SIAs that should successfully assist the targeted user group. Chapter 24 “Health-related Applications of Socially Interactive Agents,” by Timothy Bickmore, addresses another area of major societal importance, and highlights the potential of SIAs that have shown to have a positive impact on voluntary changes in health behavior. Chapter 25 “Autism and Socially Interactive Agents,” by Jacqueline Nadel, Ouriel Grynszpan, and Jean-Claude Martin, reviews work that uses SIAs to study or help improve the social skills of people with autism spectrum disorder. The chapter highlights the improvements that have been achieved throughout the last two decades and that, following a multi-disciplinary approach, more can be expected in the future. Chapter 26 “Interactive Narrative and Story-telling,” by Ruth Aylett, introduces narrative and storytelling as fundamental human capabilities, and outlines how SIAs are used in character- or plot-based systems, highlighting the great challenge of interactivity in this domain. Chapter 27 “Socially Interactive Agents in Games,” by Rui Prada and Diogo Rato, discusses the complexity in which SIAs have been used in games, and introduces their different roles alongside with their contributions to gameplay.

1.6 Further Readings

15

Chapter 28 “Serious Games with Socially Interactive Agents,” by Patrick Gebhard, Dimitra Tzovaltzi, Tanja Schneeberger, and Fabrizio Nunnari, focuses on serious games that can partly be seen as a means to an end to achieve certain goals in various domains (such as education or health-behavior change) using specific methods from games and interactive narratives. Thus, the chapter focuses on learning gain as well as individual experience during game play.

1.6

Further Readings Since the research area of SIAs is interdisciplinary, and researchers approach it from different angles and disciplines, a large number of books, conferences, and journals present work on SIAs. Below, we suggest further readings (in alphabetical order), but do not claim that the list is complete. A few books have appeared that focus on SIAs: ∙

C. Bartneck, T. Belpaeme, F. Eyssel, T. Kanda, M. Keijsers, and S. Sabanovic, Human–Robot Interaction—An Introduction, Cambridge University Press, Cambridge, 2019.



C. Breazeal, Designing Sociable Robots, MIT Press, Cambridge, MA, 2002.



J. Cassell, J. Sullivan, and S. Prevost, Embodied Conversational Agents, MIT Press, Cambridge, MA, 2000.



K. Dautenhahn, A. Bond, L. Canamero, and B. Edmonds, Socially Intelligent Agents, Springer, Boston, 2002.



J. Gratch and S. Marsella, Social Emotions in Nature and Artifact, Oxford University Press, Oxford, 2014.



N. Magnenat-Thalmann and D. Thalmann, Handbook of Virtual Humans, Wiley, West-Sussex (England), 2007.

In addition, chapters on SIAs are part of broader handbooks: ∙

R. Calvo, S. D’Mello, J. Gratch, and A. Kappas, Handbook on Affective Computing, Oxford University Press, Oxford, 2015. Facial expressions of emotions for virtual characters, M. Ochs, R. Niewiadomski, and C. Pelachaud Expressing emotion through posture and gesture, M. Lhommet and S. Marsella Emotion modeling for social robots, A. Paiva, I. Leite, and T. Ribeiro Preparing emotional agents for intercultural communication, E. André

16

Chapter 1 Introduction to Socially Interactive Agents

Affect in human–robot interaction, R. C. Arkin and L. Moshkina Relational agents in health applications: leveraging affective computing to Promote healing and wellness, T. W. Bickmore ∙

B. Siciliano and O. Khatib, Springer Handbook on Robotics, Springer International Publishing, Switzerland, 2016. Cognitive human–robot interaction, B. Mutlu, N. Roy, and S. Sabanovic Social robotics, C. Breazeal, K. Dautenhahn, and T. Kanda Socially assistive robotics, M. Mataric and B. Scassellati

Likewise, a number of conferences addresses work related to SIAs: Please note that, as opposed to some other research domains, in the area of SIAs, and computer science in general, conferences and their proceedings are as important as (and sometimes even more important than) journal papers. High quality conferences have acceptance rates of 15% or lower. In the domain of SIAs, the following conferences are of relevance, albeit they strongly differ in their acceptance rates): ∙

International Conference on Autonomous Agents and Multiagent Systems (AAMAS) (http://www.ifaamas.org/), since 2002, proceedings by IFAAMAS, available by ACM Digital Library.



International Conference on Computer Animation and Social Agents (CASA) (https://dl.acm.org/conference/casa), since 2004, proceedings by ACM.



International Conference on Human–Agent Interaction (HAI) (http://haiconference.net), since 2013, proceedings by ACM.



International Conference on Human–Robot Interaction (HRI) (https://dl. acm.org/conference/hri), since 2006, proceedings by ACM.



International Conference on Intelligent Virtual Agents (IVA) (https://dl.acm. org/conference/iva), since 1998, proceedings by ACM (Springer until 2017).



International Conference on Robot and Human Interactive Communication (ROMAN) (https://www.ieee-ras.org/conferences-workshops/financially-cosponsored/roman), since 1992, proceedings by IEEE.



International Conference on Social Robotics (ICSR) (https://link.springer. com/conference/socrob), since 2010, proceedings by Springer.

A number of journals publish work related to SIAs: ∙

ACM Transactions on Human–Robot Interaction (https://dl.acm.org/journal/ thri), since 2012, ACM.

References

17



ACM Transactions on Interactive Intelligent Systems (https://dl.acm.org/ journal/tiis), since 2011, ACM.



Autonomous Agents and Multi-Agent Systems (https://link.springer.com/ journal/10458), since 1998, Springer.



Computers in Human Behavior (https://www.journals.elsevier.com/ computers-in-human-behavior), since 1985, Elsevier.



Frontiers in Robotics and AI, Section Human–Robot Interaction (https://www.frontiersin.org/journals/robotics-and-ai/sections/humanrobot-interaction).



IEEE Transactions on Affective Computing (https://www.computer.org/csdl/ journal/ta), since 2010, IEEE.



International Journal of Human–Computer Studies (https://www.journals. elsevier.com/international-journal-of-human-computer-studies), since 1994, Elsevier.



International Journal of Social Robotics (https://link.springer.com/journal/ 12369), since 2009, Springer.



Journal on Multimodal User Interfaces (https://link.springer.com/journal/ 12193), since 2007, Springer.

Acknowledgments Many thanks to Catherine Pelachaud and Astrid Rosenthal-von der Pütten for their very helpful comments on this introductory chapter!

References AAMAS. 2019. International Conference on Autonomous Agents and Multiagent Systems (AAMAS): Socially Interactive Agents Track. http://aamas2019.encs.concordia.ca/ sia.html. C. Breazeal. 2002. Designing Sociable Robots. MIT Press, Cambridge, MA. C. Breazeal, K. Dautenhahn, and T. Kanda. 2016. Social robotics. Springer Handbook of Robotics, Springer, 1935–1972. J. Cassell, J. Sullivan, and S. Prevost. 2000. Embodied Conversational Agents. MIT Press, Cambridge, MA. K. Dautenhahn. 1998. The art of designing socially intelligent agents: Science, fiction, and the human in the loop. Appl. Artif. Intell. 12, 7–8, 573–617. DOI: https://doi.org/10.1080/ 088395198117550. K. Dautenhahn. 2007. Socially intelligent robots: Dimensions of human–robot interaction. Philos. Trans. R. Soc. Lond. Series B Biol Sci. 362, 1480, 679–704. DOI: https://doi.org/10. 1098/rstb.2006.2004.

18

Chapter 1 Introduction to Socially Interactive Agents

K. Dautenhahn, A. Bond, L. Canamero, and B. Edmonds (Eds.). 2002. Socially Intelligent Agents. Springer. DOI: https://doi.org/10.1007/b116424. E. Deng, B. Mutlu, and M. J. Matari´ c. 2019. Embodiment in socially interactive robots. Found. Trends® Robot. 7, 4, 251–356. ISSN 1935-8253. DOI: http://dx.doi.org/10.1561/ 2300000056. B. Eckstein, F. Niebling, and B. Lugrin. 2019. Reflected reality: A mixed reality knowledge representation for context-aware systems. In 11th International Conference on Virtual Worlds and Games for Serious Applications (VS-Games), 1–4. doi: https://doi.org/10.1109/VSGames.2019.8864516. D. Feil-Seifer and M. J. Matari´ c. 2005. Defining socially assistive robotics. In IEEE 9th International Conference on Rehabilitation Robotics, 465–468. T. Fong, I. Nourbakhsh, and K. Dautenhahn. 2003. A survey of socially interactive robots. Robot. Auton. Syst. 42, 143–166. DOI: https://doi.org/10.1016/S0921-8890(02)00372-X. L. Hoffmann, N. Bock, and A. M. R. v.d. Pütten. 2018. The peculiarities of robot embodiment: Development, validation and initial test of the embodiment and corporeality of artificial agents scale. In ACM/IEEE International Conference on Human–Robot Interaction (HRI’18), 370–378. DOI: https://doi.org/10.1145/3171221.3171242. IVA. 2019. International Conference on Intelligent Virtual Agents (ACM IVA). https://iva2019. sciencesconf.org/. D. Kromand. 2007. Avatar categorization. In DiGRA ’07 - Proceedings of the 2007 DiGRA International Conference: Situated Play. The University of Tokyo, Tokyo, 4. ISBN/ISNN: ISSN 2342-9666. J. Li. 2015. The benefit of being physically present: A survey of experimental works comparing copresent robots, telepresent robots and virtual agents. Int. J. Hum.-Comput. Stud. 77. DOI: http://dx.doi.org/10.1016/j.ijhcs.2015.01.001. R. McDonnell, S. Jörg, J. McHugh, F. Newell, and C. O’Sullivan. 2008. Evaluating the emotional content of human motions on real and virtual characters. In Proceedings of the 5th Symposium on Applied Perception in Graphics and Visualization (APGV 08). ACM, 67–74. DOI: https://doi.org/10.1145/1394281.1394294. S. J. Russell and P. Norvig. 2009. Artificial Intelligence: A Modern Approach. Prentice Hall, Hoboken, NJ. B. Spanlang, J.-M. Normand, D. Borland, K. Kilteni, E. Giannopoulos, A. Pomés, Et al. 2014. How to build an embodiment lab: Achieving body representation illusions in virtual reality. Front. Robot. AI 1, 9. DOI: https://doi.org/10.3389/frobt.2014.00009. D. Traum. 2008. Talking to virtual humans: Dialogue models and methodologies for embodied conversational agents. In I. Wachsmuth and G. Knoblich (Eds.), Modeling Communication with Robots and Virtual Humans. Springer. S. Trepte and L. Reinecke. 2010. Avatar creation and video game enjoyment: Effects of lifesatisfaction, game competitiveness, and identification with the avatar. J. Media Psychol.: Theor. Methods Appl. 22, 4, 171–184. DOI: https://doi.org/10.1027/1864-1105/a000022. R. Zallinger. 1965. The road to Homo sapiens (illustration). In F. C. Howel (Ed.), Early Man. TIME-LIFE Books, New York, 41–45.

I

PART

ESTABLISHING SIA RESEARCH

2

Empirical Methods in the Social Science for Researching Socially Interactive Agents Astrid Rosenthal-von der Pütten and Anna M. H. Abrams

2.1

Motivation

This introductory methods chapter is meant to be an informative overview for all non-social scientists who work with socially interactive agents (SIAs) and who would like to familiarize themselves with empirical methodologies in psychology and the social sciences. It is primarily written for young scholars, that is, undergraduate or graduate students who are new to this field of research and new to empirical methods in the social sciences. We will clarify the research process and explain methods for studying research questions surrounding humancentered development, testing, and distribution of SIAs. In particular, we will provide answers to the following questions: ∙

What do we mean by methods in empirical social sciences? (Section 2.1.2)



Why do I need methodological knowledge in empirical social sciences? (Section 2.1.1)



Which research questions are addressed in empirical social sciences? (Sections 2.1.2 and 2.2)



Which empirical methods should I use to address my research question? (Section 2.2)



How does the chosen method work, in principle, and what aspects are important to consider when constructing, conducting, and analyzing my study and its results? (Section 2.2)

22

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents



Where can I find additional resources about methods in the empirical social sciences? (Section 2.3)



What are the hot topics discussed in the community concerning methods? What are the current challenges and future directions? (Sections 2.4 and 2.5)

This chapter will also be useful for established scholars in the field as we provide an overview of different methods that can serve as inspiration. Furthermore, we included helpful material such as lists of online-tools, questionnaires, and specialized methods books and will point you into the right direction for further reading.

2.1.1 Why Do I Need Methodological Knowledge in the Empirical Social Sciences?

Depending on your discipline, you have a specific understanding of the term “methods.” An engineer might understand methods as different systematic approaches that can be followed in order to reach the desired (technical) solution to a problem. The engineering method consists of stages such as idea, concept, planning, design, and then development of the former into a working prototype that demonstrates the solution to the problem [Ertas and Jones 1996]. The solution may be a tangible working prototype or an intangible working simulation. This prototype is being tested and debugged before launch. In computer science, depending on the problem statement, you might use theoretical, experimental, or simulation computer science methods. For instance, the experimental computer science approach [Zelkowitz and Wallace 1998] serves to identify concepts that facilitate solutions to a problem and then evaluate these solutions. One example for this evaluation process would be simulation studies with which researchers can evaluate a technology by executing the product using a model of the real environment, testing whether their hypothesis of the environment’s reaction to the technology is supported. These are examples of methods that need no human in the loop (except for the engineer or computer scientist). In contrast to engineering and computer science, in psychology and the social sciences the human being and its relation to other human beings is the central focus of the research endeavor. Psychology is a scientific attempt to understand and explain human mental processes and behavior. Psychological science includes fields such as perception, cognition, attention, emotion, intelligence, subjective experiences, motivation, brain functioning, and personality. In social psychology this extends to interaction between people, such as interpersonal relationships. The social sciences are concerned with the scientific study of human society and social relationships.

2.1 Motivation

23

The term SIA already implies why you will need to gain at least some knowledge about social science concepts and methods. SIAs are meant to be “socially” interactive, drawing on social psychological principles of interaction. Moreover, SIAs are developed to be deployed in social settings (rather than caged robot arms in production lines). Thus, their development and deployment involves an additional problem space than the technical questions that we have discussed above. For this additional problem it will be useful to know about empirical methods in psychology and the social sciences. Consider that you followed a systematic approach to develop a social robot that helps to gather supplies in a hospital and assists nurses. You have run simulations to test whether it moves correctly and whether speech input is processed as intended. You have bench-marked two different navigation systems and two different natural language processing units and identified which one performs better on your training data. Now, you are ready to give the social robot the go to interact with humans. Will the human, let’s say his name is Ben, find the robot useful? Is the interaction smooth? Does Ben understand the functionality of the robot? Does he like working with it? Does Ben consider the robot a team member? Does the social robot change the way how the human team members work with each other, and if yes, in what way? When you want to answer these questions, you need to know about the process of studying human perception, human behavior, and human attitude building. Ideally, engineers, computer scientists, and researchers in the field of psychology and social sciences work together in an interdisciplinary team from the start until the end of a development process following a human-centered design approach.

2.1.2 What Are Methods in the Empirical Social Sciences? There are different methods for the acquisition of knowledge. We consider ourselves as social scientists and will therefore apply an empirical approach to acquiring knowledge instead of knowing because we have a “gut feeling,” because it has always been like that, or because an authority said so. We will apply the empirical method that uses observation or direct sensory experience to obtain knowledge and uses evidence for verification of information [Gravetter and Forzano 2012, pp. 13–15]. Within the empirical method, we follow either the hypothetico-deductive model of the scientific method and engage in an “approach to acquiring knowledge that involves formulating specific questions and then systematically finding answers.” [Gravetter and Forzano 2012, p. 16]. In contrast, there are also systematic methodologies based on empirical data but use inductive reasoning, for example, focusing on the construction of (new) theories through methodical gathering and analysis of data, such as grounded theory. This approach will only be

24

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

briefly covered in this chapter (see Sections 2.2.3.1 and 2.4.2), but you will find recommendations for further reading in Section 2.3. Once you have specified your research question or hypothesis, you have to think about your research strategy. In Section 2.2 you will learn more about different research strategies. Most commonly, in the field of SIAs researchers conduct evaluation studies. Evaluation is the process of developing and implementing a plan to assess something (e.g., your SIA) against the background of a specific research question or hypothesis using a systematic approach to assessment through previously defined measures (see Section 2.2.1). These measures can be quantitative and qualitative 2.4.2. Evaluations serve to determine the merit, worth, or value of something to inform judgements about the relative strengths and weaknesses, and the impact of variables. Since they are so prevalent in SIA research, we will put a focus on evaluation studies that can be realized differently, see Section 2.2.3.

2.2

Models and Approaches How do you proceed once you have made up your mind that you want to do a study? In the following, we will guide you through the research process step by step. This section includes the research process in eight steps (see Section 2.2.1). Please note: the elaborations regarding the steps and important concepts and factors for each step are limited. In this book chapter, we can only provide a glimpse into the broad topic of empirical social science methods. In addition, you will find recommendations for further reading throughout this section and in Section 2.3. We provide two scenarios to exemplify how researchers derive a study design and which methodological choices they make considering the appropriateness of different methodological options. The following two scenarios are meant to give you concrete examples for methodological options when explaining the research process steps in Section 2.2.1; however, we also go through the full procedure of how to plan, conduct, analyze, and report a study using the two examples in Sections 2.2.2.1 and 2.2.2.2 to provide a more “hands-on” guide. Example 1—Evaluating a Learning Robot: Imagine there is a competition within your class on social robotics. Using the Keepon robot platform, the students in your class form two teams, each building a social robot to assist children with vocabulary learning for Spanish. The robots differ with regard to the social roles they take on in interaction (see Figure 2.1). The robot of the T-Team acts like a tutor, while the robot of the P-Team acts like a peer. You want to know which robot is better in helping children and which team has won the competition. Example 2—Developing an Agent for Assisted Living: You are working at a research lab in a third party-funded project with the aim of developing a virtual assistant for

2.2 Models and Approaches

Figure 2.1

25

Example scenarios for study design.

older adults to be installed in their homes (see Figure 2.1). You are at the beginning of the project and want to know who exactly the target group is for this technology, what the virtual assistant should be capable of, and how it should look like. At the end of the project there should be a prototype and an estimate of whether this might be a successful product on the market.

2.2.1 The Research Process The textbooks on empirical methods agree on the nature of the research process as involving at least eight steps [e.g., Gravetter and Forzano 2012]: ∙

Step 1—Find a research topic



Step 2—Form a research question or hypothesis



Step 3—Define the research strategy and experiment design



Step 4—Operationalization of variables



Step 5—Define and select sample



Step 6—Conduct the study/data collection



Step 7—Data processing and data analysis



Step 8—Report results

26

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

For each step, we will provide an overview about (i) what has to be considered in the step, (ii) which methodological decisions a researcher has to take, and (iii) what the methodological alternatives are for the respective decision. 2.2.1.1

Step 1—Find a Research Topic The research process begins with identifying the research topic. In our examples, the research topic is given because the lecturer decided to run a competition, or the funding agency provided money to develop an assistant to bring to market. However, when you are about to do a Bachelor’s, Master’s, or PhD thesis, you will have to define your own research topic. You might want to identify a human need and develop a SIA addressing human needs. You might be inspired by one of the topics covered in this book and want to contribute to this area, or you might have observed a social phenomenon in interactions with SIAs that in your opinion deserves further investigation. All of these exemplary approaches are valid methods to identify and define a research topic.

2.2.1.2

Step 2—Form a Research Question or Hypothesis Once you identify a research topic, you will have to review the literature in that field and find the specific research question(s) you want to address. If applicable (and in most cases it is applicable) you should consult theories that are relevant to your research topic. The literature review will help you to define your central concepts and get an overview of which research questions have already been addressed, what empirical evidence is available, and where the research gaps are. This allows you to formulate research questions or derive hypotheses that are based on prior findings regarding that research question (see Section 2.2.2.1 on how to do this based on our examples).

2.2.1.3 Step 3—De ne the Research Strategy and Experiment Design There are many different ways to design a study. Your research strategy and study design depend on the type of research question or hypothesis you have proposed. Remember that a “research strategy is a general approach to research determined by the kind of question that the research study hopes to answer.” [Gravetter and Forzano 2012, p. 159]. We will review different types of research strategies and explain when they are applicable: (i) the descriptive research strategy, (ii) the correlational research strategy, and (iii) experimental or quasi-experimental research strategies. Moreover, regarding the latter research strategy, experiments, we provide additional information on how to design experiments and explain three

2.2 Models and Approaches

27

types of experiment designs: (i) the within-subjects design, (ii) the betweensubjects design, and (iii) the factorial design. Research Strategies. The descriptive research strategy is “intended to answer questions about the current state of individual variables for a specific group of individuals” [Gravetter and Forzano 2012, p. 160] and is not concerned with relationships between variables. For instance, you could assess how much money people would be willing to spend on a virtual assistant or which functions they would want to have incorporated into the system. If you’re going to examine the relationship between variables, there are two different ways to do so. One approach includes simple observation of variables of interest as they exist naturally for a set of individuals. This is called correlational research strategy. If you are interested in the amount of money older adults are willing to spend depending on their income, you would choose a correlational strategy. You would assess people’s willingness for investment and their income and run a statistical test on this data to discover a correlation. You would continue to examine whether there is any pattern of relationship between the variables and how strong this relationship is. This strategy can only describe a relationship but cannot explain the relationship because correlation is not causation. Another approach to examining relationships follows an experimental or quasiexperimental research strategy. The experimental research strategy is “intended to answer cause-and-effect questions about the relationship between two variables” [Gravetter and Forzano 2012, p. 163]. You can answer questions such as “does interacting with a robot peer lead to longer attention in a learning task compared to interacting with a robot tutor?” To answer cause-and-effect questions, you manipulate one variable (the independent variable) to create so-called treatment conditions (robot peer vs. robot tutor). In addition, you prepare for measurement of a second variable (the dependent variable) to obtain a set of scores within each treatment condition (attention span while learning). It is of great importance that all other potentially influencing variables are controlled, as far possible. By controlling all other variables, you can conclude that differences in the scores of your dependent variable between treatment conditions is due to your manipulation of the independent variable. With regard to our example, you would design an experiment in which one group of children is interacting with the peer robot and one group of children is interacting with the tutor robot (two treatment conditions, independent variable: social role of robot) and you measure how long they focus their attention on the task (dependent variable: attention) and compare the scores between the two groups. It is crucial that you randomly assign participants to one of the groups. In the context of our example, this can easily be

28

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

done by inviting children into your lab and randomly assigning them to treatment conditions. Sometimes, however, it is not so easy to assign participants randomly to the experimental groups. Imagine, you are conducting the experiment in a school. For a whole week, you put the peer robot into one classroom and the tutor robot into another classroom. You cannot resolve the class structure for a week and randomly assign children, thus, you use two naturally existing groups: the classrooms. At the end of the experiment you are comparing learning gains in each class by administering a vocabulary test. This is called a quasi-experimental research strategy. Although, quasi-experimental settings use some of the rigor and control of true experiments, they are always flawed to a certain extend and cannot obtain an absolute cause-and-effect answer because there might exist other group factors systematically influencing the outcome. For instance, the human teacher in one classroom might motivate children more to use the social robot for learning vocabulary, thereby generating more learning time and greater learning gain (thus, in this example, the possible confounding variable is: motivation by teacher). Experiment design. There are, however, many more methodological decisions required when planning an experiment. For instance, you have to decide whether to use a within-subjects design, a between-subjects design, or a factorial design. As for the within-subjects design, you would use a single group of participants who receive or experience all of the treatment conditions. Thus, a within-subjects design looks for differences between treatment conditions within the same group of participants. In this case you would have one group of children who interact with both robots, the peer and the tutor. In contrast, the between-subjects design requires separate independent groups of participants for each condition. In this case you would use two groups of children. Each group interacts with only one version of the robot. Sometimes, researchers want to investigate more than one independent variable. This would require designing a factorial study design. In the context of the current example, you might be interested in the question of whether girls and boys react differently to peer or tutor robots, thereby introducing a second independent variable (in this case a so-called quasi-independent variable because you cannot actively manipulate the gender of your participants, but there are naturally existing groups). When two or more independent variables are combined in a single study, they are called factors. Our example would be a two-factor design in which both factors have two values, resulting in a 2 × 2 factorial design with the factors gender (values: boy or girl) and the robot’s social role (values: peer or tutor). You can design this study as a complete between-subjects design or as a so-called mixed design in which one factor is a between factor (gender) and one is a within factor (robot’s social role).

2.2 Models and Approaches

29

When planning an experiment, please note that you might want to include a control condition (or a control group). A control condition refers to a non-treatment condition in an experiment where participants do not receive the treatment being evaluated. Here, a reference classroom group that does not interact with a robot but has normal class and is also measured in the dependent variable. 2.2.1.4

Step 4—Operationalization of Variables The next important step in planning your study is the operationalization of your variables. In this step, we explain (i) what operational definitions are, (ii) why it is important to consider different modalities of measurement, and (iii) what scales of measurement exist. In step 2 of the research process, the task was to identify theories relevant to your research question and to define appropriate constructs. The “problem” with constructs is that they are hypothetical attributes or mechanisms that help explain and predict behavior in a theory. Examples of constructs are motivation, knowledge, intelligence, or cognitive load. These constructs cannot be observed or measured directly, but it is possible to observe and measure the external factors and the external behaviors associated with the construct. Constructs can be influenced by external stimuli and in turn can influence external behavior. For instance, the theory of similarity attraction suggests that people are like others who they perceive as being similar to themselves, rather than dissimilar. Attraction is the relevant construct here. Attraction is hard to measure directly because it is a mental process. However, we can manipulate external factors such as similarity of the other person (e.g., similar = same gender/attitude/similar appearance; dissimilar = opposite gender/diverging attitude/diverging appearance). Moreover, we can observe and measure external behavior that might be affected by attraction such as a rating for how much we like that other person. What is needed is an operational definition that “specifies a measurement procedure (a set of operations) for measuring an external, observable behavior, and uses the resulting measurements as a definition and a measurement of the hypothetical construct” [Gravetter and Forzano 2012, p. 105]. This process is also referred to as operationalization. In our example, the construct similarity can be operationally defined in a variety of ways. For instance, for our group of participants evaluating the assistive agent, we created an agent more similar (matching gender) or dissimilar to them (opposite gender). Hence, we are comparing two different levels of similarity that in this case is defined by whether or not the agent has the same gender.

30

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

A simple way to come to the operational definition for the variables of interest is to consult previous research that made use of the same variable, because this research should report in detail how the variables have been defined and measured. By adopting these definitions and measurements in your study your results will be directly comparable to the results obtained in previous research. Usually, there are different options for measuring any particular construct and variable. For example, when you want to assess acquired knowledge in a specific language you could use self-report and ask people how much they think they have learned, you could administer language tests (e.g., vocabulary) or observe whether the verbal behavior in that language has changed and is more fluent, more verbose, and contains less grammatical errors than before a treatment. In this example we would use different modalities of measurement: self-report measures such as interviews and questionnaires, and behavioral measures such as performance tests or behavior in interactions. A third modality are physiological measures (e.g., galvanic skin response, heart rate, or brain imaging techniques). All three modalities have certain advantages and disadvantages that can influence the quality of the measurement. There are two criteria for the evaluation of quality of operationalizations of variables and these are validity and reliability. A valid measurement has been demonstrated to actually measure what it claims to be measuring and a reliable measurement is able to produce identical results when it is used repeatedly to measure the same individual under the same conditions (see Gravetter and Forzano [2012, pp. 107–119]). If participants deliberately lie in a self-report this poses a threat to the validity of your measurement. In case you decide to use increased heart rate as a measure for similarity attraction you also might face a validity problem. Heart rate can increase due to a number of causes such as fear, anxiety, arousal, or embarrassment. The question is how can you be sure that measurement of heart rate is in fact a measurement for fear? To determine the validity and reliability of measures you should learn and read more about different types of validity and reliability in a methods book (see Section 2.3 for suggestions, e.g., some types of reliability can be tested for with statistical tests) and consult more closely the discussions in previous work using the variables you are using. Once you have chosen the measures that you want to use in your study, you should be aware of the scale of measurement. Traditionally, there are four types of measurement scales: nominal scales, ordinal scales, and interval and ratio scales. Nominal scales represent qualitative (not quantitative) differences in the variable measured (some are female or male; being female is not superior nor inferior to being male). Categories on an ordinal scale are organized sequentially and consists of a series of ranks (e.g., first, second, third; small, medium, large). With an ordinal scale, you can determine not only differences but also the direction of differences

2.2 Models and Approaches

31

(not the magnitude of differences). Interval and ratio scales are organized sequentially, and all categories have the same size [e.g., degrees in Celsius, each interval (degree) has the same size]. Hence, interval and ratio scales allow the determination of difference as well as its direction and magnitude. Interval scales have an arbitrary zero point (e.g., Celsius or Fahrenheit have an arbitrary zero point in addition to positive and negative values) while ratio scales have a meaningful zero point. For ratio-scaled variables, zero is the complete absence of something. The scale of measurement of your variables also determines which statistical test you can use when describing your data and when trying to discover relationships between variables. In this regard, please note that so-called Likert scales (explanation can be found below in the examples) that are most frequently used in self-assessments are ordinal scaled but given the robustness of many parametric tests can be used as interval scales in statistical testing (see Norman [2010]). 2.2.1.5 Step 5—De ne and Select Sample Once you have established your study design and measures, you should invest some thought into defining and selecting your sample. In this step we explain (i) what is a population, a target population, and a sample; (ii) different sampling procedures and when to use them; and (iii) how to determine the adequate sample size for your study by using power analysis. We therefore briefly explain statistical hypothesis testing. First, we have to distinguish between the population, being the large group of interest to a researcher, and the sample, the small set of individuals who participate in the study. Very often, you will have a so-called target population that is defined by the researcher’s specific interests. By target population, researchers address a group of individuals in the target population that shares one specific characteristic. For instance, a target population could constitute all German children in fourth grade or all individuals over 70 years living alone in an independent home. Usually, researchers do not have the means to draw a sample from the whole target population (all children in second grade), but from an accessible population (e.g., all children in second grade in one city). However, the goal is always to generalize study results of the sample to the population. Therefore, researchers seek to find a representative sample that closely mirrors or resembles the population and its defined characteristics. When the sample does not closely resemble the population but has different characteristics from those of the population, this is called a biased sample. Researchers have to be careful which sampling procedures they use in order to avoid sampling bias.

32

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

The likelihood of the sample being representative or biased depends on the procedure that is used to select participants for your study. There are two types of sampling procedures: probability sampling methods and non-probability sampling methods. Probability sampling methods require that the odds of selecting a particular individual are known and can be calculated. In order to do so, you must (i) know the exact size of the population and all its members, (ii) each individual in the population must have a specified probability of selection, and (iii) selection of individuals must be a random process. For non-probability sampling methods, the odds of selection are not known, the researcher does not know the population size and cannot list all members of the population. In this case, you do not use an unbiased method of selection. Thus, non-probability sampling methods have a greater risk of producing a biased sample. For the research field of SIA, not all population parameters are understood and can be identified. It is, thus, unlikely that you will be able to perform probability sampling methods. You will more likely perform non-probability sampling methods, such as convenience sampling. Convenience sampling means that you will be using those individuals who you have easy access to. Availability and individuals’ willingness to participate are the decisive factors here. These are, for instance, students who are enrolled in one of your classes, or the children of the elementary school where you know teachers who are willing to help you in doing a study, or those people in the mall that happen to be there when you are conducting a field trial with your new social robot. Although convenience samples are obviously convenient, that is, less expensive and easier to get, they are also more prone to be biased. There are, however, ways to handle potential bias. You can ensure that your sample is reasonably representative and not strongly biased; for instance, you can work with schools from different districts of the city and be careful to select a broad cross-section of children (males and females, with siblings and only child, with and without immigration background). Moreover, you should describe your sample in detail in your research report and thus allow other researchers to evaluate how representative or biased your sample might have been and take this into consideration when evaluating the results of your study. Once you know how you want to select your sample you have to determine the required sample size for your study—how large should the sample be in order to be representative? A general principle from statistics is the law of large numbers: the larger the sample size, the more representative the sample. There are, however, also practical limits to the sample size (e.g., time and expenses). Thus, most often you will have to compromise between the benefits and advantages of a large sample size and the costs of running a study with many participants. A rule of thumb is that you need about 25–30 individuals in every group you are testing [Gravetter and

2.2 Models and Approaches

33

Forzano 2012, p. 142] because accuracy of the sample mean in relation to population mean increases with sample size, but the improvement of accuracy slows dramatically once the sample size is around 30 (per experimental condition!). Because of this limited added accuracy, researchers often opt for a sample size of 25–30 per condition. The sample size is also determined by other statistical factors that can be taken into account in a so-called power analysis, which is a statistical procedure to determine the required sample size for detecting an effect of a given size with a given degree of confidence. Power stands for the probability of finding an existing effect and is influenced by the significance level, the sample size, and the effect size (high power diminishes the risk of false negatives). Given any three of these four components, we can estimate the fourth. Hence, when we know the significance level (e.g., p < 0.05), the assumed effect size of the effect we are looking for (e.g., d = 0.5, which would constitute a medium sized effect in a t-test), and the power we want to use in our study (e.g., 80%), we can calculate the required sample size for a t-test (e.g., 102 participants, 51 in each group). In a t-test, you determine the differences in means of two groups (children interacting with tutor or with peer robot). On the other hand, if you have a given maximal sample size (e.g., you only have access to 40 people with a very specific characteristic and no chance to get access to more individuals of that target population), the power analysis can determine the probability of detecting an effect of a given size with a given level of confidence. If you plan an experiment with two groups, trying to find an effect of medium size with 40 participants, the probability of determining this effect will be extremely low (power = 46%). This means that your study would have a 46% chance of finding a statistically significant effect of treatment condition given there really is an important difference between the treatment conditions. This might lead you to overthink and revise this experiment design. Statistics books often feature lists with examples of power analyses. There are also freely available software tools that help with performing power analyses (e.g., G* Power3; http://www.psycho.uni-duesseldorf.de/abteilungen/aap/ gpower3/). 2.2.1.6 Step 6—Conduct the Study/Data Collection Before conducting a study, the last step should be to critically review everything that you have prepared and decided so far from an ethical viewpoint (see Chapter 3). In this step we explain (i) research ethics, (ii) informed consent and debriefing of participants, and (iii) provide useful tips for conducting a study.

34

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

Considering research ethics is very important and in many countries it is institutionalized with so-called institutional review boards (IRBs) or ethics committees. A common process involves obligatory notifications to the IRB about every study involving human subjects. These reports should include detailed descriptions of the entire study, discussion of potential ethical concerns, and specification of measures to reduce potential harm to human subjects. The IRB commonly reviews study proposals and judges them upon ethical innocuousness. A positive evaluation of the ethics board is the official permission for conducting the study. IRBs often perform a risk–benefit analysis and assess the individual risks a participant is subjected to in a study and the benefits your research provides for society. One might tend to think that most research in SIA does not involve high risk for participants because people will not be physically harmed. Unfortunately, this is a fallacy because psychological harm can result from some studies. You might think that administering an IQ test in your study is a low-risk endeavor. However, when a person participates in this test and receives a low IQ score, this can seriously threaten the person’s self-concept. IRBs usually provide guidelines on how to conduct studies that follow ethical standards. IRBs are governed by Title 45 Code of Federal Regulations Part 46 of the United States. Many other countries have similar rules for the establishment and working processes of ethical review boards. If there is no official regulation of the state, then, very often, universities and academic associations have committed themselves to establish an ethical review board. Even when there is no institution requiring you to do an ethical review, your research integrity should tell you to follow ethical guidelines and seek for guidance in this matter. Most academic journals and conferences will ask you to state whether your research has been IRB reviewed and might reject research that has not. Sometimes this can be avoided when you can explain in detail what measures you have undertaken in order to ensure ethical standards. The American Psychological Association (APA) provides their “Ethical Principles of Psychologists and Code of Conduct” online for your reference (https://www.apa.org). The absolute basics for research are informed consent and debriefing (see section 8 in APA Ethics Code). Informed consent means that you inform the participant that he or she is about to take part in a study and get permission to collect data [.20 2010]. This is especially important when you are collecting data that cannot be anonymized such as audio or video data. In this case, IRBs often require a process for data handling and data protection. The APA Ethics Code describes informed consent as follows: “psychologists inform participants about (1) the purpose of the research, expected duration and procedures; (2) their right to decline to participate and to withdraw from the research once participation has begun; (3) the foreseeable consequences of

2.2 Models and Approaches

35

declining or withdrawing; (4) reasonably foreseeable factors that may be expected to influence their willingness to participate such as potential risks, discomfort or adverse effects; (5) any prospective research benefits; (6) limits of confidentiality; (7) incentives for participation; and (8) whom to contact for questions about the research and research participants’ rights. They provide opportunity for the prospective participants to ask questions and receive answers.” (APA Ethics Code, Section 8.02). Some universities or their IRBs provide examples or guidelines on how to construct an appropriate informed consent form. Moreover, you should debrief participants properly, which means that “psychologists provide a prompt opportunity for participants to obtain appropriate information about the nature, results, and conclusions of the research, and they take reasonable steps to correct any misconceptions that participants may have of which the psychologists are aware. If scientific or humane values justify delaying or withholding this information, psychologists take reasonable measures to reduce the risk of harm. When psychologists become aware that research procedures have harmed a participant, they take reasonable steps to minimize the harm.” (APA Ethics Code, Section 8.08). One specialty that frequently occurs in studies with SIAs is that researchers use a so-called Wizard-of-Oz (WoZ) scenario, [Dahlbäck et al. 1993]. This means that participants ostensibly interact with an autonomous system, but actually the social robot or virtual agent is controlled by a so-called “wizard,” a hidden confederate of the experimenter controlling the actions of the robot or virtual agent. In this setup, participants are deceived about the true nature of the SIA (for a review on WoZ in HRI, see Riek [2012]). Deception is to be avoided unless the researcher has determined that “the use of deceptive techniques is justified by the study’s significant prospective scientific, educational, or applied value and that effective non-deceptive alternative procedures are not feasible” (APA Ethics Code, Section 8.08). If your study setup includes any kind of deception, you are obliged to debrief participants as early as possible about the deception, preferably at the conclusion of their participation but no later than at the conclusion of the data collection, and permit participants to withdraw their data. For further discussion on deception in research, see, for instance, Christensen [1988] and Tai [2012]. When you have received the IRB approval, you can start recruiting participants, conducting the experiment, and collecting your data. Here are some useful tips that you usually do not find in a textbook but are based on experience. When recruiting participants, always recruit more participants than you need. There is always someone who does not show up or your technology is on strike on one day. You will experience that not all test trials produce suitable data to be included in your dataset. Thus, plan to recruit more participants than you need in order to cope for any dropouts. Clearly specify the inclusion and exclusion criteria for study participation.

36

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

For instance, if it is crucial for your study that participants can properly hear the robot, an exclusion criterion will be impaired (and not corrected) hearing. If the study investigates how girls react to a robot, boys cannot participate. Clearly stating the inclusion and exclusion criteria in study advertising helps to avoid frustrating situations for you and your participants. If you are conducting the experiment with more than one experimenter, try to be consistent in how the experimenters are conducting the experiment. Ideally, counterbalance experimenters to experimental conditions so that you do not have unexpected experimenter effects (see Section 2.2.3.1 for experimenter/interviewer effects). It helps when you prepare an experimenter script that lays out what should be said (and how) and what should not be said during the experiment when talking to the participant. For interaction studies in the lab, prepare an experimenter checklist that lists all the steps of your study. This prevents you from forgetting something and risking data loss, for example, ∙

which equipment has to be made ready, switched on, started before data collection?



which documents have to be put out ready? (e.g., informed consent forms, written debriefing)



how and where to store data after the interaction? (e.g., which server, folder)

An additional useful tip is to conduct testing runs with other uninformed lab members in order to see how long the study takes (this also helps in planning time slots for participants), whether everything runs smoothly, and whether participants do understand every task within the experiment. When you debrief participants, ask them first whether they noticed something or found something strange. This is especially important for WoZ settings in order to check whether the deception has been successful or whether it has been detected by the participant (in this case, these data cannot be included in the analysis). However, answers to this question can reveal other flaws in your study setup that can be changed when detected early. Last but not least, keep records of all participants (in anonymized form) where you note relevant information such as technical errors, failed manipulations or deceptions (e.g., in WoZ), or other peculiarities. For more useful tips, consult the paper by Bethel and Murphy [2010]. 2.2.1.7 Step 7—Data Processing and Data Analysis After data collection, the most exciting step follows—data analysis. In this step, we explain (i) when you need to consider processing of data, (ii) how to make a plan for statistical analysis of your data.

2.2 Models and Approaches

37

First, you might have to process some of the data. For instance, you have to extract from a continuous video how often and how long children looked to the peer or tutor robot. In case of a study involving a questionnaire, you usually collapse data of a well-established questionnaire into a sum score or a mean value that is then used in further data analysis. Some processing is not very challenging or hard work. Other procedures require more effort and need quality checks. In the case of behavioral coding (e.g., for signs of attention by children in the interaction with a robot), coding is best done by two people. This procedure serves as a quality check because it enables you to detect the degree of agreement between the two coders to ensure that the coding results are valid (interrater/intercoder reliability, see Sections 2.2.3.3 and 2.2.3.1 for more detailed information). After data processing, you can start with data analysis. In case you want to do a cause-and-effect analysis, you will have to consider your study design (correlational, between-subjects, within-subjects, or mixed-design), the measures for the independent and dependent variables, and their characteristics (e.g., nominal, ordinal, interval scaled), and then chose an appropriate statistical test. For detailed explanations on how to run statistical tests (e.g., the t-test that has been briefly discussed above), please refer to further reading: some books provide decision trees for choosing the appropriate test [Field 2018]. Some statistics book publishers also have companion websites with useful tools such as Andy Field’s Discovering Statistics book series, which also features a “which stats test” online (http:// methods.sagepub.com/which-stats-test). In any case, you will need to familiarize yourself with the most common statistical tests, how to run them, and how to interpret and report their results. We present some examples based on our two scenarios in Section 2.2.2.1. 2.2.1.8 Step 8—Report Results The last step in your research endeavor is to report your results. In this step, we explain (i) the structure of a research report and (ii) general recommendations regarding writing style. The form of the report depends on the addressee. Thus, the form of your report might depend on the funding agency of your research project, your lecturer, or the research community via a scientific journal or conference publication. However, there are some general guidelines that should always be followed. A good research report should describe in detail the research process and the theoretical and methodological decisions that were made during this process. You should provide an objective description of the outcome of your research project, which typically includes the measurements that were taken and the statistical summary

38

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

and interpretation of those measurements. And very importantly, your research study “grows out of an existing body of knowledge and adds to that body of knowledge. The research report should show the connection between the present study and past knowledge.” [Gravetter and Forzano 2012, p. 488]. The basic structure of a paper follows the IMRAD acronym: ∙

Introduction (Which question was asked?)



Methods (How was it studied?)



Results (What was found?), And



Discussion (What do the findings mean?)

Writing a good paper is a science in its own right. There are articles and even books [Hall 2012; Field and Hole 2013] that summarize guidelines and best practices and help with starting to write. They explain which information should be presented in which way in what chapter. When writing up psychological research, the APA publication manual is a very good reference [American Psychological Association 2020]. The APA also provides specialized guides for reporting quantitative [Cooper 2020] and qualitative research [Levitt 2020] in psychology. Most importantly, the APA Publication Manual tells you how to cite properly for psychology journals. However, writing a really excellent paper is a skill, and learning this skill will require experience and practice. The general recommendation for writing style is to write in an impersonal and objective style and avoiding ambiguity, colloquialisms, and jargon. You should also try to avoid biased language (e.g., “older adults” is less biased than “the elderly”). Typically, research reports are written in past tense or past perfect when describing prior work (introduction and theoretical background sections of report), how you decided to set up and conduct the study (method section of report), and when presenting performed analyses and their results (results section of report). When interpreting and discussion the results you should switch to present tense. The work of other researchers must be properly cited in your research report to avoid plagiarism. Journals and academic conferences with proceedings usually have one predefined citation style that has to be followed when submitting your work. For psychology journals and conferences, this is the APA citation style. Proceedings of conferences in computer science are often realized by specialized publishers in that area such as IEEE, ACM, or Springer Nature, which all have their own citation styles. Find out which citation style is to be used before preparing the manuscript. Software for reference management and knowledge organization can be very useful to collect prior work and properly cite this work. Examples

2.2 Models and Approaches

39

are EndNote, Mendeley, Citavi, and Zotero (a comparison of reference management software can be found here: https://en.wikipedia.org/wiki/Comparison-of_ reference_management_software).

2.2.2 Two Exemplary Research Projects 2.2.2.1 Example 1—Evaluating a Learning Robot For the scenario description, please refer to the introduction (see Section 2.2). In this scenario, your research topic is given (see Section 2.2.1.1): you shall assess the impact of different social roles that are implemented in a social robot that is supposed to help children with learning Spanish vocabulary. In order to formulate a research question or hypothesis (see Section 2.2.1.2), it is advisable to consult research from the field of education research, pedagogy, and social psychology when studying the impact of different roles in learning situations. For instance, how exactly do you conceptualize the role of a tutor and the role of a peer? How do peers and tutors behave differently in learning situations and what are the impacts of these different roles? You will find conceptualizations and prior evidence in existing literature (e.g., Belpaeme et al. [2018]). Belpaeme and colleagues [2018] review concepts and existing studies in this area and state that a peer has “the potential of being less intimidating than a tutor or teacher, peerto-peer interactions can have significant advantages over tutor-to-student interactions” (page 6). For instance, in interactions with a peer robot, longer periods of attention on learning tasks, faster responses, and more accurate responses were observable compared to interactions with a tutor robot [Zaga et al. 2015]. There is certainly more evidence to find, but for our example we will use this one prior finding in order to pose the hypothesis that the peer robot will elicit longer attention in the task of learning vocabulary and when asked for vocabulary the children’s responses will be faster and more accurate. However, the (very limited) literature we reviewed in our example does not allow us to state a hypothesis on long-term learning gains. Thus, here we can only pose the following research question: what is the influence of the robot’s social role on children’s learning gain of Spanish vocabulary? Next, you define the research strategy and experiment design (see Section 2.2.1.3). Your hypothesis suggests a relationship between the robot’s social role and children’s attention and recall promptness and correctness. Moreover, you assume that the social role might have an influence on long-term learning gain. Hence, your independent variable is the social role of the robot. The dependent variables are children’s attention, recall promptness and recall correctness as well as learning gain. You plan to invite children into your lab and let them interact with one of your robots. Thus, you can randomly assign participants to

40

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

your groups, making this a true experiment. Suppose you decide for independent groups of participants, hence you have a between-subjects design. After you have identified your independent (social role of the robot) and dependent variables (children’s attention, vocabulary recall promptness and correctness, and learning gain), it is time to operationalize these variables (see Section 2.2.1.4). For your independent variable, this means that you need to provide definitions of the two social roles as well as definitions about the behavior that is connected to the two roles that can be implemented into the social robot. For instance, in Zaga et al. [2015] the differences in social role (peer vs. tutor) were established via the style of interaction through the design of gestures, speech, and postures with peer or tutor characteristics based on literature on teachers’ multi-modal expressions and on peer collaboration. With regard to your dependent variables, this means likewise that you need a definition and a specified way to measure variables. In the work by Zaga and colleagues, focus of attention was measured by gaze behaviors of the participants directed at the robot and to the task (counts and duration of gaze behaviors: behavioral measure, ratio scale). You might also decide to include a self-report measure and ask children whether it was easy for them to be attentive during the task on a scale of 1 to 5 with 1 being “very hard” and 5 being “very easy” (self-report measure, ordinal/interval scale). This type of scale is known as Likert scale [see Schrum et al. [2020] for an analysis and discussion on the usage of Likert scales in the field of human–robot interaction]. Moreover, you want to know how promptly and correctly children can recall the vocabulary during the interaction. Hence, you are measuring the time they take to recall an item (behavioral measure, ratio scale) and the number of correct items (behavioral measure, ratio scale). Lastly, you administer a vocabulary test 1 week after the treatment in order to assess learning gains in Spanish vocabulary. In the next step, you need to define and select an adequate sample (see Section 2.2.1.5). Your target population are elementary school children in the fourth grade. You will probably identify an accessible population of elementary school children in a local elementary school where you have contact to—making this a convenience sample. In order to avoid sampling bias as much as possible, you could contact a second school in a different district. Your power analysis for a ttest given an alpha level of 0.05, power of 80%, and a medium effect size (d = 0.5) shows that 102 participants are required for your study. In order to conduct the study (see Section 2.2.1.6), you need to prepare some documents. Especially, you will need the informed consent of the parents of the participating children because children cannot give informed consent themselves. Since you decided that you will run a laboratory study, children and their parents will probably only be available during the (late) afternoon hours or on weekends.

2.2 Models and Approaches

41

You prepare an experimenter checklist that tells you to have study materials ready and to not forget to switch on the cameras because you need videos to determine the attention allocation of the children. You decide to debrief parents about the manipulation of the study immediately after their children’s participation. However, you decide to postpone debriefing for the children until after data collection has been concluded because you fear that the children will talk about the manipulations to others while the study is still ongoing. After data collection and analysis, children can be debriefed and informed about the results together in their classroom. During the study, imagine that two children were so shy that they did not engage in the interaction at all. One interaction was interrupted by a phone call and a second interaction failed because the robot did not produce speech output anymore. You note these four cases in your record list and plan to discuss with the team which cases have to be excluded from the dataset. After data collection, processing of data and data analysis follows (see Section 2.2.1.7). In order to extract behavioral measures, you will need to process the video data. This means that you will have to do coding on the videos, that is, when and how long participants showed gaze behaviors toward the robot, how prompt their reactions were and how correct. You should prepare a coding guideline that describes what behavior can be interpreted as attention, and hand this to two coders who code the data. Afterward you calculate to what extent the two coders agree in their coding by assessing the inter-coder reliability. Depending on the measure, you need to calculate kappa statistics, correlation coefficients, or intraclass correlation coefficients (see Cohen 1960; Shrout and Fleiss 1979]. You determine the number of gazes and their duration; the latter is summed up. You also count the number of correct recalls during interaction and determine the mean reaction time to recall. Finally, you have a look at the free recall vocabulary test and get a test score for each child. Then you can perform independent t-tests on attention allocation, recall promptness and accuracy, and learning gain. You find that your hypothesis is supported—children pay greater attention to the peer robot, and direct recall is more often correct in this group than in the tutor group. However, recall promptness did not differ between groups nor did learning gain. Finally, you have to write a research report for the social robotics course (see Section 2.2.1.8). 2.2.2.2 Example 2—Developing an Agent for Assisted Living For the scenario description, please refer to the introduction (see Section 2.2). In this scenario. you have the mission of building an assistive agent for older adults. You want to find out more about your target group in order to build a useful and pleasant intelligent virtual agent (IVA) that people will buy to use at home (see Section 2.2.1.1).

42

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

This example is a bit more complex than the Learning Robot Scenario. When you start to develop an IVA from scratch, your development process will actually involve a number of different research questions and thereby different studies (see Section 2.2.1.2). At first you want to find out more about your target group and their needs in order to determine the functionalities of the virtual assistant. Here the research questions might be (amongst others): What needs do older adults express to have and which, in their view, could be addressed by an IVA? What functionalities do they want to have in an IVA? How do they envision an IVA to behave and look like? And also important for the project would be the question: How much are customers willing to spend on a virtual assistant? Later on, when you start to prototype, you will probably perform a perception-only non-interactive study (see Section 2.2.3.2) showing participants pictures or short videos of different versions of the virtual agent with regard to looks and behavior. By this, you want to determine: Which agent is preferred by the majority of the participants and why? The next stage would then be conducting interaction studies (see Section 2.2.3.3), probably in a laboratory situation and later on in field experiments in future users’ homes. Let us assume that you are one of the first projects to ever try to develop a virtual assistant. In this case, there would be little prior evidence to build upon and a lot of your work will be exploratory. Thus, you would rather pose research questions than specific hypotheses. In order to keep it simple, we will outline two different studies in the following. Study 1—Survey. Your first study will be a mix of descriptive and correlation research strategy (see Section 2.2.1.2). You would conduct a survey assessing the demographics of your assumed target group, including their income in order to examine the assumed relationship between people’s willingness to invest in a robot and their income. Moreover, your survey could be asking people about their attitudes toward IVAs and which tasks they want the IVA to perform. In case you already envision some functionalities, you can ask people whether they would use the IVA for these functions. Using the survey method, you can query a large number of people. However, such a survey also has its limitations because people cannot explain why they have specific attitudes or why they would use a virtual assistant for one task but not for another task. A solution to this problem might be to back up your survey with another research approach. You could invite a smaller group of participants to interview them more deeply about this topic. During interviews, participants have the chance to elaborate on the “why.” You could also invite a group of people to participate in a so-called focus group, which is a semi-structured interview held with a group of people. Please refer to Section 2.2.3.1 for more detailed information on different types of interviews.

2.2 Models and Approaches

43

As for operationalization of variables (see Section 2.2.1.4) in the survey, which only uses self-report measures, you could, for example, assess demographics such as gender (male, female, or divers; nominal scale), age (years; ratio scale), and income (euros or USD; ratio scale) as well as people’s willingness to invest in a robot (euros or USD; ratio scale), their attitude toward virtual assistants, and which tasks they envision the agent for. Let us assume that you did not find a wellestablished questionnaire to establish attitudes toward virtual assistants. Hence, you decided to ad-hocly create some questionnaire items that you think are valid measures such as “I think virtual assistants are useful” or “I believe that using a virtual assistant improves my everyday life,” and let people rate these questions on a scale of 1 to 5 with 1 being “I disagree” and 5 being “I agree” (self-report; ordinal/interval scale). You also provide participants a list of functionalities and ask whether or not the virtual assistant should be capable of those functionalities (e.g., “virtual agent has access to calendar”: yes/no; nominal scale; selfreport). In the next step, you need to define and select an adequate sample (see Section 2.2.1.5). Your target population for the survey study are older adults aged over 70 who live in their own homes or in assisted living environments (in contrast to nursing homes). Let us assume that you work together with four cities and have access to the population register of these cities. Thereby, you know the exact size of the target population (in these four cities) and all its members. By this you can assign each individual in the population a specified probability of selection, and randomly select the number of individuals you need. Based on a power analysis, you know you will need at least 134 individuals for performing the correlation analysis between income and willingness to invest. However, it might be advisable to recruit more individuals following the law of large numbers. When planning to conduct the survey (see Section 2.2.1.6), you will need to include the informed consent into the online survey on the first page (see Perrault and Keating [2018] for recommendations on how to construct informed consent in online studies). You have sent 1000 potential participants a written invitation with the link to the online study because you expect that only a fraction of invited persons will actively make the effort to participate. The return rate for surveys are sometimes as low as 10% to 20%. Another recruitment tool many researchers are using nowadays are platforms such as Amazon Mechanical Turk (MTurk) or CrowdFlower, which have the advantage of fast completion of the study, access to otherwise hard to reach target populations, and more diversity in samples, such as specific occupational groups or people with a specific health condition [see Casler et al. 2013; Smith et al. 2015; Hauser and Schwarz 2016], but also come with disadvantages [see Fleischer et al. 2015; Smith et al. 2016]. The usage of crowdsourcing

44

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

websites has been discussed in different disciplines [see Necka et al. 2016; Shank 2016; Follmer et al. 2017]. The survey study will predominantly be analyzed (see Section 2.2.1.7) on a descriptive level stating the percentage of male and female respondents, their mean age, and their mean income (which can also be presented per groups using cross-tables). You can run a correlation analysis on the relation between income and willingness to invest in an IVA, which shows that there is no relationship. Rather, the descriptive data suggests that there is low variability in what people are willing to invest regardless of their income. Moreover, the descriptive data gives you an impression on which tasks the participants judge as suitable for an IVA and which are not. The results of the survey are part of a research report to be delivered to the funding agency (see Section 2.2.1.8). Study 2—Perception-only non-interactive study. Your second study will probably be a study evaluating the participants’ perception of different IVAs (see Section 2.2.3.2), in which you compare different looks of the IVAs. Let us assume that you designed two versions of a virtual assistant: a female version and a male version. Based on work in social psychology on similarity attraction [Montoya et al. 2008], you develop the hypothesis that an agent matching the participants’ gender might be preferred over non-matching agent (see Section 2.2.1.2). In summary, you are using a 2 × 2 mixed factorial design with the quasi-independent between-subjects factor participant gender (male, female) and the within-subjects factor agent gender (male, female). As for operationalization of variables (see Section 2.2.1.3), you use participant gender (male, female; nominal scale) and the virtual assistant gender (male, female; nominal scale) as independent variables. As dependent variables, you want to assess the perceived likability of the agent, its perceived similarity to the participant, and participants’ usage intentions. It is advisable to include a so-called manipulation check for similarity. A manipulation check “is an additional measure to assess how the participants perceived and interpreted the manipulation and/or assess the direct effect of the manipulation” [Gravetter and Forzano 2012, p. 268]. In your case, this could be an item asking “How similar is this agent to you?” You could again use a Likert scale ranging from 1 “not at all similar” to 5 “very similar” (self-report, ordinal scale). Your manipulation of similarity would be successful when participants rate an agent matching their gender significantly more similar than participants evaluating an agent not matching their gender. As dependent variables you use the perceived likability of the agent and ask participants “How likable is the agent?” on a scale ranging from 1 “not at all likable” to 5 “very likable”

2.2 Models and Approaches

45

(self-report, ordinal/interval scale) as well as their usage intention and ask “How likely are you to use this agent” on a scale ranging from 1 “will definitely not use” to 5 “will definitely use” (self-report, ordinal/interval scale). In the next step, you need to define and select an adequate sample (see Section 2.2.1.5). The power analysis for your 2 × 2 mixed factorial design tells you that you will need at least 34 individuals, that is, 17 female and 17 male participants since this is your between-subjects factor. This is probably going to be a convenience sample. For instance, you could launch an advertisement in a local newspaper stating that you are looking for study participants. You prepare your perception-only non-interactive study in the laboratory (see Section 2.2.1.6). When advertising the study (which includes older adults), you explicitly state that participants should have normal or corrected to normal vision and hearing. Participants give informed consent before starting the experiment and are debriefed after completion of the study. When analyzing the data (see Section 2.2.1.7), you first do your manipulation check by calculating a mixed design repeated measures analysis of variance (ANOVA) with the between-subjects factor participant gender and the withinsubjects factor agent gender on the dependent variable “perceived similarity.” The analysis reveals an interaction effect showing that indeed a gender-matched agent is perceived as being significantly more similar to the participants than a nonmatched agent, that is, women rate the female agent as more similar than the male agent and men vice versa rate the male agent more similar to themselves. However, when performing the same analysis for “likability of the agent” you see that the similarity attraction effect is only observable in men. Men rate the gendermatched male agent as significantly more likable than the female agent, while women did not show a preference for one or the other agent. Moreover, there was no difference in usage intentions. Given your results, it would be advisable to either continue developing both agents or, if this is too costly, to continue developing the male agent only since women did not show a preference for agent gender and men preferred a male agent. You plan to submit your research result to a human–technology interaction journal since those journals accept interdisciplinary works (see Section 2.2.1.7). You inform yourself about the journal’s guidelines for authors, the required template, and the citation format.

2.2.3 Types of Studies Most Commonly Used in SIA Research As we learned above, the type of study you are conducting depends on your research question and the research strategy you choose in order to answer it. In the example of developing an assistive agent for older adults, we saw that in the early stages of

46

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

the development it might be advisable to conduct interviews in order to receive more in-depth information about people’s needs and wishes for what an agent should be capable of. Later on, the team developed prototypes of the virtual agent such as different graphical renderings of the virtual agents in still pictures or maybe animations of behavior in short videos. This can be presented to participants in perception-only non-interactive studies to receive feedback that can be used in further iterations of the development process. In the later stages of the development process you will have an autonomous or semi-autonomous agent that can be used in interaction studies. These three types of studies, interview, perceptiononly non-interactive study, and interaction study, are most frequently used in SIA research. We will discuss their advantages and disadvantages. 2.2.3.1

Interview An interview usually follows an exploratory research agenda and is commonly used in ethnographic or field studies [Qu and Dumay 2011]. Mostly, researchers have broader research questions rather than narrow hypotheses that are addressed in interview studies. Even though interviews are sometimes used to enrich data from quantitative studies, they are more commonly used to get a first but deep understanding of how individuals experience, perceive, think or feel about, and evaluate a certain topic. This leads us to a very fundamental fact about interviews: they always produce data that rely on subjects’ introspection. Introspection entails a lot of subjectivity, which researchers usually try to avoid when doing social research as subjectivity introduces biases. With interviews, this is not the case. In interviews, gathering subjective and introspective data is wanted. In interviews, you do not aim to find quantifiable and generalizable phenomena. Interviews deliver qualitative data, which stands in contrast to most other methods introduced in this chapter. Interviews are described as centered around the interviewee, qualitative, descriptive, presuppositionless, focused, open for ambiguity and changes, taking place in an interpersonal interaction [Kvale 1983]. Whether interviews are a valid method for your research depend on the questions that you are asking. The kind of questions that you can address by applying a qualitative research methodology such as an interview technique are questions concerning the “why.” You will never be able to identify every possible answer that people might give to the why-question (e.g., why is this conversational agent appealing to a specific user?), so you will not be able to hand people a questionnaire with multiple-choice items covering all possible answer options. In order to really understand why people feel how they feel or think how they think, you cannot give them predefined answers to choose from but have to let them talk freely. Interviews give you a very deep insight into real needs and demands (careful: you are not

2.2 Models and Approaches

47

producing statistically significant results generalizable to all other older adults). Thus, interviews are often used in the analysis of needs and requirements for technical systems. Some researchers use interviews as part of a participatory design method along with other qualitative methods such as focus groups, co-creation workshops, and paper prototyping (e.g., Frauenberger et al. [2011], Šabanovi´ c et al. [2015], and Lee et al. [2017]). Types of Interviews. There are many different types of interview techniques and for a complete overview, please refer to one of the recommended textbooks in Section 2.3. Here, we will give you an introduction into three different types of interviews concerning their degree of structure [Qu and Dumay 2011]: ∙

Structured: The interviewer asks the interviewee predefined questions according to a rigid interview guide that is followed strictly throughout the interview. The guide does not only give the exact wording for questions but also for the introduction into the topic and any other information that is given to the interviewee. The underlying assumption is that correctly and unambiguously asked questions will all deliver relevant information. On the one hand, compared with the other structured types, the structured interview produces the most comparable and generalizable data. On the other hand, there is no room for flexibility and spontaneity. Thus, we argue that the structured interview is a bad compromise from both worlds, qualitative and quantitative research, and would like to advise you to use a less structured approach.



Unstructured: The unstructured interview is the least formal and predefined type of interview. It is rooted in ethnographic research where an interviewer tries to understand someone’s perspective entirely and most data is gathered through conversation rather than a pre-prepared line of questions. In an unstructured interview, the interviewer adapts and reacts dynamically to the topics brought up by the interviewee. The underlying assumption is that researchers cannot know the relevant questions in advance. The advantage of this approach is its openness to any given topic that an interviewee addresses. However, by being an active part of the conversation the interviewer risks getting involved and shaping and steering the conversation too much. Additionally, data are not easily comparable between interviewees.



Semi-structured: The semi-structured interview technique lies within both extreme ends. Methodologically, it is not pinpointed to the exact middle but is interpreted differently (sometimes more structured, sometimes more unstructured) by different researchers and research fields. Semi-structured

48

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

interviews make use of systematic, standardized interview guides but allow for interim questions and unplanned exploration of certain topics. Thus, the interviewer aims to ask certain pre-planned questions in the same standardized way to all interviewees to create comparable answers but gathers individual responses from specific interim questions. Successful execution of a semi-structured interview requires a well-trained interviewer to make sure the interview situation stays under the interviewer’s control even though there are free talking passages. These types of interviews can be applied to many different scenarios. Some interviews are conducted as part of contextual inquiries, which includes placing the interviewee in a relevant context. Instead of inviting older adults into the lab to do an interview on assistive devices, you might as well visit them in their homes or care facilities and ask them questions while they are in their usual habitat. Verbal data are usually combined and analyzed in the context of behavioral observations. Advantages and Disadvantages. ∙

Time: Qualitative research is not about measuring and numbers. Having a small sample is likely to decrease the time for data gathering and its analysis. (However, sometimes, conducting 10 interviews might result in more work than a quantitative poll with 100 participants.)



Information density and quality: Well-conducted interviews will deliver more in-depth information about the participants. Participants will be able to verbally communicate their answers that will deliver more extensive answers than anonymous participants typing answers on their keyboards in front of a computer. There is an added value and potential to combine verbal information with behavioral data analysis in face-to-face interviews.



Trust: During an interview situation a good interviewer will build a personal bond with the interviewee and cause the interviewee to trust and share many information (refer to the biases below to read about the drawbacks of this situation). The interviewer will also never be able to validate whether an anonymous participant in an online survey was actually part of the target group or only in it for the incentive.

Even though, potentially, there are many more biases in interview studies, a selection of four biases is presented and described in Figure 2.2. Practical Tips: How to Conduct Interviews. We would like to give you some tips from our personal experience in conducting interviews that you might find practical.

2.2 Models and Approaches

Figure 2.2

49

Selection of biases that potentially influence results in an interview.

1. Educate yourself: Try to learn as much as you can about the topic in question in order to formulate relevant questions and prepare your interview guide. Learn as much as you can about the target group in order to pose questions that are sensible and appropriate, and study possible biases and pitfalls in interviews in order to prevent them. Enroll in an interviewer training course. 2. Use this simple principle: You can only get answers to questions that you have asked. Thus, careful preparation is as important as in any other research study. What do you want to know? What do you have to ask? If you have very specific questions, you should choose a more structured approach. If you do not have specific questions but want to have a first glimpse into a topic, chose a more unstructured approach. 3. Embrace pauses: In normal conversations we tend to fill pauses because silence is sometimes considered awkward. As an interviewer, you should not fill these pauses with paraphrasing and repetitive questioning. Be patient. Let interviewees fill pauses and be surprised how many extra information you get (also, spontaneous paraphrasing should be avoided because it influences participants in an unsystematic way and produces biased answers).

50

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

Data Analysis. Analysis of data depends heavily on the type of interview. Unstructured interviews do not produce very comparable datasets and should not be interpreted as such. Data from such interviews is usually described per participant and summarized without counting or measuring data. The more structured the interview is, the more comparison between participants can be integrated in the analysis of data, and the more descriptive statistical analyses can be used to analyze data. Audio and video recordings are usually used for analysis. Audio recordings are transcribed. Coding schemes can be used to cluster answers into categories, for example, positive, negative, and neutral for general valence of an answer. For coding, two coders can categorize, and inter-coder reliability can be calculated (e.g., Cohen’s kappa, see Section 2.2.1.7). 2.2.3.2 Perception-only Non-interactive Studies In perception-only non-interactive studies (or short perception studies), participants are presented with stimulus material such as pictures, videos, audio files, or written descriptions of socials robots or virtual agents that shall be evaluated, for example with regard to design, appearance, or behavior (i.e., participants view stimulus material without directly interacting with a SIA). The goal is to develop a detailed understanding about how people interpret and reason about a robot’s appearance and behavior. Such an understanding is not only crucial for advancing our basic knowledge about human–robot interaction but also for our ability to design a robot’s appearance and behavior that is easy to recognize and to interpret. Sometimes perception-only studies are combined with in-depth interviews (see Section 2.2.3.1). In this case, participants are presented with stimuli of SIAs and are asked to give short ratings and subsequently elaborate on why they have given such ratings [Rosenthal-von der Pütten and Krämer 2015]. Types of Perception Studies. Many perception studies in development processes are used for evaluation of different designs to identify the best design option for the SIA. Other perception studies rather explore psychological phenomena by using controlled stimuli such as videos or pictures. This has the advantage of control as stimuli can be controlled whereas in interaction studies the interaction unfolds between the interactants (the SIA and human), giving the researcher less control about what exactly happens. Advantages and Disadvantages. In contrast to interaction studies, the advantage of perception studies is that they are less error-prone since the stimulus material consists of pictures, descriptions, or videos of SIAs. In interaction studies dropouts happen regularly due to malfunctions of the social robot or virtual agent. This risk is diminished in perception studies. Moreover, perception studies require less

2.2 Models and Approaches

51

personnel than interaction studies where often more than one experimenter is needed to run the study. In fact, perception-only studies can often be performed using online survey platforms that can be completed by a large number of participants simultaneously. Crowdsourcing platforms such as MTurk or CrowdFlower facilitate the recruitment of participants (see Section 2.2.1.6). Lastly, in perception studies researchers can exert a higher control on the experimental setting because the stimuli are exactly the same for all participants; whereas in interaction studies the interaction unfolds between the two interactants, thus always generating variability in the flow of interaction. On the downside, perception studies lack external validity because the stimuli are not presented in context, that is, the context of a real interaction situation. For some research questions this is more problematic than for others. For instance, in a perception study the nonverbal behavior reviewed in small videos might be clearly recognized as dominant or submissive. However, this effect might be diminished when the dominant behavior sequences are presented in a longer interaction phase together with other nonverbal behaviors. Practical Tips: How to Conduct Perception Studies. We would like to give you some tips on what to consider when designing perception-only non-interaction studies. 1. Experiment design of recognition studies: You have to make important methodological decisions when designing your recognition study, for example, whether each participant is shown only one type of behavior (betweensubject design) or whether each participant is shown multiple types of behavior (within-subject design). A within-subjects design study may cause bias in that participants are prone to engage in more direct comparison between the various stimuli. In case this establishes a confound with regard to your research question a between-subjects design would be more advisable. Furthermore, you have to decide which response format to use in assessing people’s ability to recognize a certain behavior, for instance, a forced-choice or Likert-scale response format. Previous research has shown that such methodological decisions about the study design and response format have large implications for the conclusions we draw, for example, people’s ability to reliably distinguish between emotion expressions is highly contingent on the particular response format a study employed [Russell 1996]. These considerations are valid for all types of studies; recognition studies are especially prone to generating distortions of recognition rates based on methodological choices. 2. Framing of the study: Make sure that participants know what their task is and which perspective they should take when making evaluations, especially

52

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

when you are using an online study. For some research questions you want the participant to act as an observer of a situation; for other research questions you want the participants to put themselves in the shoes of a person in the portrayed situation. Data and Data Analysis. Unless you are combining stimulus presentation and rating with an interview (see Section 2.2.3.1 for data analysis tips) that would require a mixed methods approach to data analysis (see Section 2.4.2), most perception studies will result in self-report data that usually need less preprocessing compared to interaction studies with video coding or transcription of interactions. The descriptive and inferential statistical analysis follows according to the previously defined hypotheses (see Section 2.2.1.7). 2.2.3.3 Interaction Studies In addition to the data from perception and evaluation of agents from observation of stimulus material, meaningful data can be gathered from interaction studies. In interaction studies, participants are invited to directly interact with a SIA. Thus, an interaction study always includes at least one participant and at least one SIA joined up in some form. Information is drawn from verbal and behavioral observations made during the interaction and usually added by pre- and/or post-interaction questionnaires where people report, for instance, prior experiences with SIAs, attitudes toward SIAs, and their perception and evaluation of the interaction itself. Studies involving interaction can either be of qualitative nature where, for example, an interaction is followed by an interview or where an interaction is part of a single case study. An interaction study can also be part of an experiment where the kind of interaction itself is varied and serves as the independent variable or different kinds of participants are confronted with the same kind of interaction (see Section 2.2.1). The major advantage of a real interaction is the external validity of the results. Only in a real interaction scenario can data be gathered either in experiments or qualitative studies that can be transferred and interpreted for real-life interactions. In addition to data from the interaction itself, participants can be asked to rate and evaluate the SIA and their real interaction experience. These data paired with observational data will deliver a dataset that can give a very holistic understanding of human–SIA interactions. Mixed methods are powerful tools in understanding phenomena. Type of Interaction Studies: Interaction Settings and Methods. Interaction studies can be conducted in different settings. Two of the most prominent settings to conduct a study are lab vs. field. Lab studies take place in research institutes and

2.2 Models and Approaches

53

usually involve highly controlled conditions under which the study is conducted. Field experiments are conducted “in the wild,” in natural settings. Commonly it is assumed that the lab provides higher internal validity (more control of the independent variable) and field studies produce higher external validity (higher generalizability due to natural surrounding) [Reis and Judd 2013]. Even though, in empirical science, it is never that easy (e.g., you can have unexpected influences in the lab that endanger the internal validity of the experiment), we can state that different degrees of internal and external validity have to be considered when choosing the setting of an experiment and that the setting enables you to exert more or less control over the situation. Many interaction studies with socially interactive robots take place in semi-public spaces such as hospitals, airports, and shopping malls. These places have one advantage over public spaces—people are usually prepared and warned about camera surveillance upon entrance in these areas. Even as a researcher with good intentions, you may not film and record individuals without asking for consent. Thus, careful planning and choosing of a data collection site does not only include concerns about the validity of the experimental result but requires considerations about ethics, privacy rights, and data security (see Section 2.2.1.6). Types of Interaction Studies: Types of Interactions. An interaction can either be virtual or with a physically embodied agent. On many occasions, testing and studies are conducted when the agent is still being developed or not at all developed. In these cases, studies in virtual reality can be conducted to gain a first impression of how an interaction might take place. Another option to study not yet developed autonomous systems are studies with a WoZ design. A WoZ design involves a person who remotely operates the agent. There are different types of SIA capabilities that are often simulated using a WoZ design, among others: natural language processing, navigation and mobility, and nonverbal behavior [Riek 2012]. A structured guide and training for the wizard controlling the agent are necessary to ensure reliability and consistency. Only reliable testing conditions will deliver data that can be used for analysis. Advantages and Disadvantages. In contrast to other types of studies, interaction studies’ main advantage is the possibility of observing a real interaction and combining different types of data together. However, the amount of time and preparation that has to be put in to set up an interaction study sometimes exceeds other types of studies. In addition to the usual preparation for each step of an experiment (see Section 2.2.1), research teams have to develop and test the functioning of the interactive agent and make sure the system runs reliably throughout

54

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

the experiment. The extent of preparation time and resources that are needed are dependent on the agent and the type of interaction. Practical Tips: How to Conduct Interaction Studies. Some practical tips for running an experiment are discussed here. Also consult Section 2.2.1.6 for more tips as well as the paper by Bethel and Murphy [2010]. 1. Experimenter script: prepare a script for the experimenter and confederate (and wizard) that defines what has to be done and said in which way in order to avoid possible confounds to your experiment. 2. Experimenter checklist: the checklist should include all the steps of your study so that you do not forget to switch on a device and risk data loss. 3. Testing runs: invite other lab members to be pretend-participants in your study to search for possible misunderstandings and misconceptions before you conduct the study with real participants. Data and Data Analysis. Analysis of data is preceded by preparing your data. Depending on the type of study, preparation includes transcribing video and audio recordings of the interaction, coding and categorizing the transcripts, and transferring the processed data together with data from any additional material (e.g., questionnaires) into your statistics software. Descriptive and inference statistical analyses follow according to the previously defined hypotheses (see Section 2.2.1.7).

2.3

Research Tools There are many (free) resources available that will help you with constructing and conducting your study, analyzing your data, and reporting your results. In this section we provide you with lists of useful online resources, recommendations for further reading on quantitative and qualitative research methods, statistics, and reporting your scientific work as well as survey and experiment tools. Moreover, we provide you with an overview of ready-to-use questionnaires that could be helpful for your study of SIAs. Useful Online Resources and Online Research Tools: ∙

Power analysis: G* Power3; abteilungen/aap/gpower3/



Statistical test decision tree: http://methods.sagepub.com/which-stats-test



The Ethics Code of the American Psychological Association: https://www. apa.org/ethics/code/

http://www.psycho.uni-duesseldorf.de/

2.3 Research Tools

55



www.surveymonkey.com (commercially available online survey tool in 17 languages)



www.qualtrics.com (commercially available online survey tool in 62 languages)



www.soscisurvey.de (free online survey tool, user interface only in German)



https://www.psytoolkit.org/

Books on Quantitative and Qualitative Research Methodology, Statistics, and Reporting: ∙

F. J. Gravetter, L. A. Forzano. 2012. Research Methods for the Behavioral Sciences. (4. Aufl.). Wadsworth, Cengage Learning, Belmont, CA.



A. Field and G. Hole. 2013. How to Design and Report Experiments. Repr. SAGE, Los Angeles.



A. Field. 2018. Discovering Statistics Using IBM SPSS Statistics. (5th. Ed.). SAGE. (also available for R and SAS), Los Angeles, London, New Delhi, Singapore, Washington, DC, Melbourne.



C. Jost, B. Le Pévédic, T. Belpaeme, C. Bethel, D. Chrysostomou, N. Crook, M. Grandgeorge, and N. Mirnig. 2020. Human–Robot Interaction. Evaluation Methods and Their Standardization. Springer International Publishing (12), Cham, IL.



E. Lyons, A. Coyle. 2016. Analysing Qualitative Data in Psychology. (2nd. ed.). SAGE, Los Angeles.



P. Leavy (Ed.). 2015. The Oxford Handbook of Qualitative Research. Oxford Library of Psychology. Oxford University Press, Oxford.



George M. Hall. 2012. How to Write a Paper. John Wiley & Sons, Ltd, Chichester, UK.



American Psychological Association. 2020. Publication Manual of the American Psychological Association. The Official Guide to APA Style. 2020. (7th. ed.).



Harris M. Cooper. 2020. Reporting Quantitative Research in Psychology. How to Meet APA Style Journal Article Reporting Standards. (2nd. ed., revised).



H. M. Levitt. 2020. Reporting Qualitative Research in Psychology. How to Meet APA Style Journal Article Reporting Standards. (Revised ed.).

Questionnaires Commonly Used in SIA Research. Although we look back to two decades of research on SIAs, there has long been a lack of standardized measures with regard to evaluation of interactions with SIAs, especially concerning

56

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

the “newer” field of social robots. However, in the last five years a significant effort within the research community has been put into addressing this gap, resulting in a constantly growing body of work around questionnaires or other forms of standardized assessments and tests around interactions with SIAs. To facilitate your research endeavor we collected and systematized questionnaires according to whether they are dependent variables that evaluate the outcome of the interaction in some form or whether they are potential moderating or mediating variables. A moderator variable has the potential to change the strength or direction of an effect between two variables of interest, meaning it affects the relationship between the independent variable and the dependent variable. Gender is often included as a moderator, but also different psychological profiles (e.g., low or high loneliness, low or high self-efficacy in HRI) can have a moderating effect on the relationship. For example, you could find that the usage of an emotionally expressive SIA (in contrast to a non-expressive agent) has a greater impact on acceptance of that agent in female than in male participants. Or you might find that a highly self-disclosing agent (compared to a non-disclosing agent) has a greater impact on perceived likability of the agent for people scoring high in loneliness. In contrast, a mediator variable is a variable that causes mediation in the dependent and the independent variables. In other words, it explains the how or why of an observed relationship between the dependent variable and the independent variable (IV), assuming that the independent variable does not influence the dependent variable (DV) directly but instead does so by means of a third variable. This can either be a complete mediation, meaning the full effect from IV on DV is caused by the mediator variable, or it can be a partial mediation in which only a part of the effect of IV on DV is cause by the mediator (see Jose [2013] for further reading on statistical moderation and mediation analyses). Moreover, questionnaires assessing dependent variables, that is, the outcome of, for example, an interaction, are systematized according to the aspects of the interaction they are measuring (e.g., system performance, social evaluation of SIAs, acceptance, evaluation of overall interaction). Please find an extensive list of available questionnaires in Appendix A.

2.4

Current Challenges Since SIAs are still a quite young research field, research in this field face many challenges. We decided to address three of these challenges since they are directly related to methods: the replication crisis, the unnecessary conflict between quantitative and qualitative methods, and the lack of long-term and field studies.

2.4 Current Challenges

57

2.4.1 Replication Crisis In a tremendous effort, the Open Science Collaboration made the attempt to replicate 100 psychology studies. In their paper published in 2015 [Open Science Collaboration 2015], the authors reported that only 39% of these studies have replicated the original result. This is rooted on the already ongoing debate about the so-called replication crisis that seems to be especially pronounced in psychology and the social sciences. However, researchers from other disciplines likewise report that their studies were not able to reproduce findings by other scholars and even their own prior work [Baker 2016]. Recently, Irfan et al. [2018] discussed how the replication crisis in psychology impacts research in the field of human–robot interaction; their arguments also hold for research on virtual agents. They argue that the consequences of the crisis in psychology also affects SIA research because “we do either use research methods similar to those used in other disciplines (and psychology in particular), or rely directly on insights and results handed down from other disciplines” (p. 14). Many scholars developing SIAs do so in interdisciplinary teams working closely with psychologists drawing on prior research on social interaction and social relationships in order to design the “socialness” of the SIA. When, however, classical effects from social psychology cannot be replicated with humans, we cannot expect the effect to emerge in interactions with SIAs. Irfan et al.’s good advice is to attempt to replicate the social psychology effect with humans first before running a study with social robots. If the effect can be replicated with humans but not with social robots (or virtual agents), we can classify our null results better; for instance, this would suggest that the social phenomenon is likely not the same or not at all occurring in interactions with SIAs. The takeaway message of Irfan et al. is to be critical and approach the classic effects from social psychology in textbooks with the necessary skepticism that is advisable not only for social psychology but for any discipline. However, we should not be too pessimistic because in fact many classics from social psychology have been shown to be existent in interactions with computers, virtual agents, and social robots as the group of Clifford Nass first and many researchers later have demonstrated (e.g., Reeves and Nass [1996], Hoffmann et al. [2009], von der Pütten et al. [2010], and Eyssel and Hegel [2012]). Nevertheless, the SIA research community should avoid replicating the replication crisis, that is, we should avoid letting the same mechanisms, such as the file drawer problem or publication bias, skew the output of our scientific work by reporting only “significant” results. The community has been picking up on this recently. For instance, the International Conference on Human– Robot Interaction launched a new “reproducibility in Human–Robot-Interaction” track that welcomes contributions that “reproduce, replicates, or re-creates prior HRI work (or fails to)” and “provide new HRI artifacts (e.g., datasets, software)”

58

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

that facilitate reproducibility. Selected journals welcome pre-registered reports for replication studies, agreeing to publish the paper regardless of the outcome. Moreover, we should all value studies regardless of whether they produced significant results. Null-results studies are as informative. Defalcating these results can inflate the importance of single significant studies, for instance when a dozen research teams try to replicate a study, fail, and do not report the replication failure, the significant original study gains unwarranted attention given the majority of null effects. Thus the advice is, even when it is harder to publish non-significant results, try. Or at least publish your work accessibly for other scholars in preprint servers such as www.psycharchives.com.

2.4.2 Quantitative and Qualitative Research Methods The difference between quantitative and qualitative studies can be easily summarized: quantitative studies produce data that can be counted and measured, qualitative studies produce data that cannot. Both approaches have their drawbacks and advantages, but both should be equally valued as scientific approaches for data collection in order to answer important research questions. As Hammarberg et al. [2016] point out “qualitative and quantitative research methods are often juxtaposed as representing two different world views. In quantitative circles, qualitative research is commonly viewed with suspicion and considered lightweight because it involves small samples which may not be representative of the broader population, it is seen as not objective, and the results are assessed as biased by the researchers’ own experiences or opinions. In qualitative circles, quantitative research can be dismissed as over-simplifying individual experience in the cause of generalization, failing to acknowledge researcher biases and expectations in research design, and requiring guesswork to understand the human meaning of aggregate data.” (p. 498). “Choosing sides” in research is misleading. We would like to point out that careful and thorough reflection of the research topic and, especially, of the research question should guide the choice for an appropriate method. “The crucial part is to know when to use what method.” [Hammarberg et al. 2016, p. 498]. ∙

Quantitative research method: When you have specific hypotheses, you can identify, isolate, and operationalize variables; when you want to unravel relationships and differences and you want to generalize results and make statements for the population, then choose a quantitative research method (see Section 2.2.1 for a detailed description on how to conduct experiments).



Qualitative research method: When you have a research question that asks about subjective experiences and perspective, you have a specific, small

2.4 Current Challenges

59

target group from which you want to get background information from or you want to have an in-depth understanding of a specific case, then choose a qualitative research method (see Section 2.2.3.1 for an example of a qualitative study method). Both methodological approaches can be used in combination, either to research different aspects of the same research question or to research the same aspect and complement and enrich the dataset. In consideration of the exemplary study on robots for learning, results from a quantitative study on the learning outcome could be enriched by interviews assessing students’ personal experience of learning together with an agent. Theoretically and pragmatically, it is not always clear cut which between qualitative and quantitative methods should be used and, sometimes, qualitative and quantitative study approaches are merged together. Strictly speaking, once data is counted and measured, a study is not seen as purely qualitative anymore. However, sometimes qualitative methods are used and data is categorized into clusters that can be analyzed with descriptive statistics. A good example for successfully merging qualitative approaches with quantitative analytical approaches in a virtual agent study is the study by Opfermann and Pitsch [2017]. Special user groups (older adults and individuals with mild cognitive impairments) were confronted with an embodied conversational agent in a WoZ study. Authors studied the influence of continuous reprompts by the agent that indicate non-understanding of users’ interaction attempts and reactions. Analysis of data was done by a sequential protocol of qualitative single-case conversational analyses for each participant followed by quantitative coding including categorizing and frequency counting to compare user behavior and find patterns. Authors conclude with a specific advice for reprompts as an error handling strategy: a reprompt should be given once and it should be unambiguous (“Do you mean yes? Say yes or no.” Read Opfermann and Pitsch [2017] for details and for more practical implications). Valuing both approaches (quantitative and qualitative) but understanding their advantages and disadvantages and, especially, knowing when to use them (and when not to use them) is very important for researchers in psychology and the social sciences. Scholars in SIA should be open-minded and use either approach when appropriate.

2.4.3 Field Studies and Long-term Studies The third grand challenge in the field of SIA is that we still face a significant lack of field studies and long-term studies. SIAs are envisioned to provide assistance or service, to work together with humans in mixed teams in different working

60

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

environments, or to offer some form of companionship. As a matter of fact, SIA will have to deal with more than one human in complex social environments. In utter contrast to this envisioned future scenario, research on SIAs has primarily focused on laboratory experiments, examining the interaction between a single human and a single social robot or virtual agent, while research on multi-agent systems is still young and research on HRI groups has only recently begun. As Jung and Hinds [2018] pointed out, this dyad-based research of HRI in laboratory settings “has helped establish a fundamental understanding about people as they interact with robots,” but “our theories reflect an oversimplified view of HRI” (p. 1). Although the need for a paradigm shift from studying dyadic interactions in laboratory settings to studying (group) interactions in complex environments has been identified and advocated for [Jung and Hinds 2018], research in this regard is still scarce. This is mostly due to the fact that field studies require a robustly running system that can deal with environmental changes and challenges. Running these studies is expensive, time-consuming, and pose many ethical challenges with regard to informed consent, data protection, and many more. Depending on the type of target group, it is challenging to find participants who agree, for example, to try a robot in their homes for a longer period of time and agree to be under “constant evaluation.” As a result, sample sizes of field and long-term studies are often small, leading to negative reviewer comments about the power of the studies. As a community, we should value the tremendous effort that goes into a field or long-term study. Even when the results are not statistically generalizable, these studies provide us with badly needed insights into how our SIAs perform and are perceived in the complex social environments that we design them for. Field and long-term studies that might result in smaller sample sizes can benefit greatly from combining quantitative and qualitative approaches to assess how successfully a SIA is integrated in the social environment.

2.5

Future Directions When reviewing the grand challenges that we face regarding methods, we can directly infer future directions for our research. Since the research field of SIA is still quite young most empirical work has been pioneering. In order to establish results and effects we need replication studies that test the robustness of these effects as well as their generalizability to different cultural contexts (see Strait et al. [2020]) for an example of cross-cultural replication. Moreover, we should welcome the diversity of our research community and embrace the potential that it offers for interlacing qualitative and quantitative methods. We hope we were able to illustrate how the two methodological traditions can be of mutual benefit instead of hindrance, and especially when it comes to field- and long-term studies that

2.6 Summary

61

are placed in social context qualitative methods are well suited to take this social context into account, granting a more holistic understanding of the interaction situation and its social meaning than using quantitative methods alone.

2.6

Summary

2.A

Appendix A

In this chapter on methods from the social sciences and psychology that can be used for research on SIAs, we provided you with a broad overview about all relevant concepts you should have heard of and taken into account when planning to conduct studies involving human participants. However, keep in mind that methods are an integral part of all disciplines, which usually makes up a significant part of your expert knowledge in a given discipline. This means that although we provided you with the fundamentals, you might need to study more about methods to acquire expertise. For most of the methodological considerations that we covered in a section, there are specialized books or many research papers that deal with aspects of methodology. It is advisable that once you have chosen a rough direction, you consult more specialized literature on the specific method of your choice. We hope that our suggestions for further reading and recommendations for support tools will facilitate this process. In addition, we hope you will learn that social science methods are not only a duty to fulfil but can be a pleasure as well.

Name of Questionnaire

Measured Construct(s)

IV, DV, MV

Evaluation of Agents/Interactions with Agents General Impressions Familiarity DV of Humanoids General Impressions Repulsion DV of Humanoids General Impressions Utility DV of Humanoids General Impressions Performance DV of Humanoids General Impressions Motion DV of Humanoids General Impressions Voice DV of Humanoids

Reference

Kamide et al. [2013] Kamide et al. [2013] Kamide et al. [2013] Kamide et al. [2013] Kamide et al. [2013] Kamide et al. [2013]

62

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

Name of Questionnaire

Measured Construct(s)

IV, DV, MV

Reference

General Impressions of Humanoids

Sound

DV

Kamide et al. [2013]

General Impressions of Humanoids

Humanness

DV

Kamide et al. [2013]

General Impressions of Humanoids

Entitativity

DV

Kamide et al. [2013]

The Human–Robot Interaction Evaluation Scale (HRIES)

Sociability

DV

Spatola et al. [2021]

The Human–Robot Interaction Evaluation Scale (HRIES)

Animacy

DV

Spatola et al. [2021]

The Human–Robot Interaction Evaluation Scale (HRIES)

Agency

DV

Spatola et al. [2021]

The Human–Robot Interaction Evaluation Scale (HRIES)

Disturbance

DV

Spatola et al. [2021]

Godspeed Questionnaire

Animacy

DV

Bartneck et al. [2009]

Godspeed Questionnaire

Anthropomorphism

DV

Bartneck et al. [2009]

Godspeed Questionnaire

Likeability

DV

Bartneck et al. [2009]

Godspeed Questionnaire

Perceived Intelligence

DV

Bartneck et al. [2009]

Godspeed Questionnaire

Perceived Safety

DV

Bartneck et al. [2009]

Animated Character and Interface Evaluation

Anxiety

DV

Rickenberg and Reeves [2000]

Animated Character and Interface Evaluation

Task Performance

DV

Rickenberg and Reeves [2000]

2.A Appendix A

Name of Questionnaire

Measured Construct(s)

IV, DV, MV

Animated Character Liking DV and Interface Evaluation The Robotic Social Warmth DV Attributes Scale (RoSAS) The Robotic Social Competence DV Attributes Scale (RoSAS) The Robotic Social Discomfort DV Attributes Scale (RoSAS) Attitudes, Emotions, and Expectations in Interaction Rapport–Expectation Expectation as a IV, DV, MV Robot Scale (RERS) Conversation Partner Rapport–Expectation Expectation for IV, DV, MV Robot Scale (RERS) Togetherness Robot Anxiety Scale Anxiety toward IV, DV, MV (RAS) Communication Capability of Robots IV, DV, MV Robot Anxiety Scale Anxiety toward (RAS) Behavioral Characteristics of Robots Robot Anxiety Scale Anxiety toward IV, DV, MV (RAS) Discourse with Robots Assessment of Mental Capacities IV, DV, MV Attitudes Towards Social Robots (ASOR) Assessment of Socio-practical IV, DV, MV Attitudes Towards Capacities Social Robots (ASOR) Assessment of Socio-moral Status IV, DV, MV Attitudes Towards Social Robots (ASOR)

63

Reference

Rickenberg and Reeves [2000] Carpinella et al. [2017] Carpinella et al. [2017] Carpinella et al. [2017]

Nomura and Kanda [2016] Nomura and Kanda [2016] Nomura et al. [2006b] Nomura et al. [2006b]

Nomura et al. [2006b] Damholdt et al. [2020] Damholdt et al. [2020] Damholdt et al. [2020]

64

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

Name of Questionnaire

Measured Construct(s)

IV, DV, MV

Reference

Negative Attitudes towards Robots Scale (NARS)

Negative Attitude toward Situations of Interaction with Robots Negative Attitude toward Social Influence of Robots Negative Attitude toward Emotions in Interaction with Robots General Anxiety toward Humanoid Robots Apprehension toward Social Risks of Humanoid Robots Trustworthiness for Developers of Humanoid Robots Expectation for Humanoid Robots in Daily Life Basic Moral Concern

IV, DV, MV

Nomura et al. [2006a]

IV, DV, MV

Nomura et al. [2006a]

IV, DV, MV

Nomura et al. [2006a]

DV

Nomura et al. [2012]

DV

Nomura et al. [2012]

DV

Nomura et al. [2012]

DV

Nomura et al. [2012]

Negative Attitudes towards Robots Scale (NARS) Negative Attitudes towards Robots Scale (NARS) Frankenstein Syndrome Questionnaire (FSQ) Frankenstein Syndrome Questionnaire (FSQ) Frankenstein Syndrome Questionnaire (FSQ) Frankenstein Syndrome Questionnaire (FSQ) Measurement of Moral Concern for Robots Measurement of Moral Concern for Robots Self-Efficacy in HRI

Nomura et al. [2019]

Concern for Psychological Harm Self-efficacy Expectations

Nomura et al. [2019] IV, DV, MV

Rosenthal-von der Pütten and Bock [2018] Embodiment, Physical Presence, Social Presence, and Co-Presence Social Presence Survey Social Presence DV or MV Bailenson et al. [2003]

2.A Appendix A

Name of Questionnaire

Measured Construct(s)

Networked Minds Social Presence Questionnaire of Social Presence Networked Minds Co-presence Questionnaire of Social Presence Networked Minds Subjective Questionnaire of Symmetry Social Presence Networked Minds Intersubjective Questionnaire of Symmetry Social Presence Kidd and Breazeal Perceived Questionnaire Presence Lombard and Ditton Presence Presence Questionnaire Embodiment and Corporeality Corporeality Questionnaire Embodiment and Mobility and Corporeality Tactile Questionnaire Interaction Shared Embodiment and Corporeality Perception Questionnaire Embodiment and Nonverbal Corporeality Expressiveness Questionnaire Usability and User Experience Hoonhout Product Enjoyability Scale Enjoyability User Experience User Experience Questionnaire System Usability Scale System Usability

IV, DV, MV

Reference

DV or MV

Biocca et al. [2003]

DV or MV

Biocca et al. [2003]

DV or MV

Biocca et al. [2003]

DV or MV

Biocca et al. [2003]

DV or MV

Kidd and Breazeal [2004] Lombard et al. [2000]

DV or MV

DV or MV

Hoffmann et al. [2018]

DV or MV

Hoffmann et al. [2018]

DV or MV

Hoffmann et al. [2018]

DV or MV

Hoffmann et al. [2018]

DV

Hoonhout [2002]

DV

Laugwitz et al. [2008] Brooke [1996]

DV

65

66

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

Name of Questionnaire

Measured Construct(s)

IV, DV, MV

Reference

Questionnaires for Children and Adolescents Children’s Social Behavior Questionnaire

Social Behavior, Children

IV

Hartman et al. [2006]

Emotion Awareness Questionnaire for Children

Emotion, Children

IV

Rieffe et al. [2008]

Technology-Specific Satisfaction Scale (TSSS) (child)

Satisfaction, Children

DV

Alves-Oliveira et al. [2015]

Technology-Specific Expectations Scale (TSES) (child)

Expectations (Capabilities and Fiction), Children

DV

Alves-Oliveira et al. [2015]

Human–Robot Trust Scale

Trust (Robot)

IV, DV, MV

Schaefer [2013]

Scale of Trust in Automated Systems

Trust (System)

IV, DV, MV

Jian et al. [2000]

Human–Computer Trust Scale

Trust (Computer; Reliability, Technical Competence, Perceived Understandability, Faith, Personal Attachment)

IV, DV, MV

Madsen and Gregor [2000]

Trust in Technology

Psychological States, Emotion, Motivation, Satisfaction, and Stress Self-Assessment Manikin and Semantic Differential

Emotional State

DV or MV

Bradley and Lang [1994]

Positive and Negative Affect Schedule (PANAS)

Affective State

DV or MV

Watson et al. [1988]

2.A Appendix A

Name of Questionnaire

Measured Construct(s)

IV, DV, MV

Reference

Satisfaction with Life Scale Situational Motivation Scale (SIMS)

Satisfaction with Life Intrinsic Motivation, Identified Regulation, External Regulation, and Amotivation Intrinsic, Extrinsic, and Amotivation in Education Motivation to Learn Science

IV

Diener et al. [1985] Guay et al. [2000]

Academic Motivation Scale

DV or MV

IV, DV, MV

Students’ Motivation IV, DV, MV Toward Science Learning (SMTSL) English Language Motivation to Learn IV, DV, MV Learner Motivation the English Scale (ELLMS) Language UCLA Loneliness Scale Loneliness IV Perceived Stress Scale Stress IV, DV, MV (PSS) New General General Self-efficacy IV, DV, MV Self-Efficacy Scale Mental State, IV Standardized Cognitive Abilities Mini-Mental State Examination Development CES-D Scale: A Depressive IV Self-Report Symptomatology in Depression Scale General Population Psychological Traits and Diagnostic Measurements Eysenck Personality Personality IV Questionnaire Big Five Questionnaire Personality IV Barratt Impulsiveness Scale (BIS 11)

Impulsiveness

IV

67

Vallerand et al. [1992] Tuan et al. [2005] Ardasheva et al. [2012] Russel [1996] Cohen et al. [1983] Chen et al. [2001] Crum et al. [1993]

Radloff [1977]

Francis et al. [1992] Caprara et al. [1993] Patton et al. [1995]

68

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

Name of Questionnaire

The Aggression Questionnaire

Measured Construct(s)

Physical Aggression, Verbal Aggression, Anger, Hostility Buss–Perry Aggression Physical Questionnaire short Aggression, Verbal form Aggression, Anger, Hostility Emotion Regulation Emotion Questionnaire Regulation (Suppression and Reappraisal) Task-Related Cognitive Load Cognitive Questionnaire for Development Placement Committees NASA Task Load Index Cognitive Load Questionnaire Cognitive Load Cognitive Load Questionnaire

IV, DV, MV

Reference

IV

Buss and Perry [1992]

IV

Bryant and Smith [2001]

IV

Gross and John [2003]

DV

DV

Fridin and Belokopytov [2014] Hart [2006]

DV

Sweller [1988]

References P. Alves-Oliveira, T. Ribeiro, S. Petisca, E. Di Tullio, F. S. Melo, and A. Paiva. 2015. An empathic robotic tutor for school classrooms: Considering expectation and satisfaction of children as end-users. In A. Tapus, E. André, J.-C. Martin, F. Ferland, and M. Ammi (Eds.), Social Robotics. ICSR 2015. Lecture Notes in Computer Science, Vol. 9388. Springer International Publishing, Cham, 21–30. DOI: https://doi.org/10.1007/978-3-319-25554-5_3. American Psychological Association. 2020. Publication Manual of the American Psychological Association: The Official Guide to APA Style (7th. ed.). ISBN: 9781433832154. Y. Ardasheva, S. S. Tong, and T. R. Tretter. 2012. Validating the English language learner motivation scale (ELLMS): Pre-college to measure language learning motivational orientations among young ELLs. Learn. Individ. Differ. 22, 4, 473–483. DOI: https://doi.org/10. 1016/j.lindif.2012.03.001. J. N. Bailenson, J. Blascovich, A. C. Beall, and J. M. Loomis. 2003. Interpersonal distance in immersive virtual environments. Pers. Soc. Psychol. Bull. 29, 7, 819–833. DOI: https://doi. org/10.1177/0146167203029007002. M. Baker. 2016. 1,500 scientists lift the lid on reproducibility. Nature 533, 7604, 452–454. DOI: https://doi.org/10.1038/533452a.

References

69

C. Bartneck, D. Kuli´ c, E. Croft, and S. Zoghbi. 2009. Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int. J. Soc. Robot. 1, 1, 71–81. DOI: https://doi.org/10.1007/s12369-0080001-3. T. Belpaeme, J. Kennedy, A. Ramachandran, B. Scassellati, and F. Tanaka. 2018. Social robots for education: A review. Sci. Rob. 3, 21, eaat5954. DOI: https://doi.org/10.1126/ scirobotics.aat5954. C. L. Bethel and R. R. Murphy. 2010. Review of human studies methods in HRI and recommendations. Int. J. Social Rob. 2, 4, 347–359. ISSN: 1875-4791. DOI: https://doi.org/10.1007/ s12369-010-0064-9. F. Biocca, C. Harms, and J. K. Burgoon. 2003. Toward a more robust theory and measure of social presence: Review and suggested criteria. Presence (Camb). 12, 5, 456–480. https: //doi.org/10.1162/105474603322761270. M. M. Bradley and P. J. Lang. 1994. Measuring emotion: The self-assessment manikin and the semantic differential. J. Behav. Ther. Exp. Psychiatry 25, 1, 49–59. DOI: https://doi.org/ 10.1016/0005-7916(94)90063-9. J. Brooke. 1996. SUS: A ‘quick and dirty’ usability. In P. W. Jordan, B. Thomas, I. L. McClelland, and B. Weerdmeester, (Eds.), Usability Evaluation in Industry. CRC Press, 189–194. F. B. Bryant and B. D. Smith. 2001. Refining the architecture of aggression: A measurement model for the Buss-Perry aggression questionnaire. J. Res. Pers. 35, 2, 138–167. DOI: https: //doi.org/10.1006/jrpe.2000.2302. A. H. Buss and M. Perry. 1992. The aggression questionnaire. J. Pers. Soc. Psychol. 63, 3, 452– 459. DOI: https://doi.org/10.1037/0022-3514.63.3.452. G. V. Caprara, C. Barbaranelli, L. Borgogni, and M. Perugini. 1993. The “big five questionnaire”: A new questionnaire to assess the five factor model. Pers. Individ. Differ. 15, 3, 281–288. DOI: https://doi.org/10.1016/0191-8869(93)90218-R. C. M. Carpinella, A. B. Wyman, M. A. Perez, and S. J. Stroessner. 2017. The robotic social attributes scale (RoSAS). In HRI’17. Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction. IEEE, Piscataway, NJ, 254–262. DOI: https://doi.org/10.1145/ 2909824.3020208. K. Casler, L. Bickel, and E. Hackett. 2013. Separate but equal? A comparison of participants and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral testing. Comput. Hum. Behav. 29, 6, 2156–2160. ISSN: 0747-5632. DOI: https://doi.org/10.1016/ j.chb.2013.05.009. G. Chen, S. M. Gully, and D. Eden. 2001. Validation of a new general self-efficacy scale. Organ. Res. Methods 4, 1, 62–83. DOI: https://doi.org/10.1177/109442810141004. R. J. Chenail. 2011. Interviewing the investigator: Strategies for addressing instrumentation and researcher bias concerns in qualitative research. Qual. Rep. 16, 1, 255–262. DOI: https://doi.org/10.46743/2160-3715/2011.1051. L. Christensen. 1988. Deception in psychological research. Pers. Soc. Psychol. Bull. 14, 4, 664–675. ISSN: 0146-1672. DOI: https://doi.org/10.1177/0146167288144002.

70

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

J. Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1, 37–46. ISSN: 0013-1644. DOI: https://doi.org/10.1177/001316446002000104. S. Cohen, T. Kamarck, and R. Mermelstein. 1983. A global measure of perceived stress. J. Health. Soc. Behav. 24, 4, 385–396. DOI: https://doi.org/10.2307/2136404. H. M. Cooper. 2020. Reporting Quantitative Research in Psychology: How to Meet APA Style Journal Article Reporting Standards (2nd. ed.). Revised. APA style products. ISBN: 9781433832833. R. M. Crum, J. C. Anthony, S. S. Bassett, and M. F. Folstein. 1993. Population-based norms for the mini-mental state examination by age and educational level. JAMA 269, 18, 2386–2391. DOI: https://doi.org/10.1001/jama.1993.03500180078038. N. Dahlbäck, A. Jönsson, and L. Ahrenberg. 1993. Wizard of Oz studies: Why and how. In Proceedings of the 1st International Conference on Intelligent User Interfaces. 193–200. M. F. Damholdt, C. Vestergaard, M. Nørskov, R. Hakli, S. Larsen, and J. Seibt. 2020. Towards a new scale for assessing attitudes towards social robots: The attitudes towards social robots scale (ASOR). Interact. Studies 21, 1, 24–56. DOI: https://doi.org/10.1075/ is.18055.fle. E. Diener, R. A. Emmons, R. J. Larsen, and S. Griffin. 1985. The satisfaction with life scale. J. Pers. Assess. 49, 1, 71–75. DOI: https://doi.org/10.1207/s15327752jpa4901_13. A. Ertas and J. C. Jones. 1996. The Engineering Design Process (2nd. ed.). Wiley, New York. ISBN: 0471136999. F. Eyssel and F. Hegel. 2012. (S)he’s got the look: Gender stereotyping of robots. J. Appl. Soc. Psychol. 42, 9, 2213–2230. DOI: https://doi.org/10.1111/j.1559-1816.2012.00937.x. A. Field. 2018. Discovering Statistics Using IBM SPSS Statistics (5th. ed.). SAGE, Los Angeles and London and New Delhi and Singapore and Washington DC and Melbourne. ISBN: 9781526419521. A. Field and G. Hole. 2013. How to Design and Report Experiments, repr. SAGE, Los Angeles. ISBN: 9780761973829. A. Fleischer, A. D. Mead, and J. Huang. 2015. Inattentive responding in MTurk and other online samples. Ind. Organ. Psychol. 8, 2, 196–202. ISSN: 1754-9426. DOI: https://doi.org/ 10.1017/iop.2015.25. D. J. Follmer, R. A. Sperling, and H. K. Suen. 2017. The role of MTurk in education research: Advantages, issues, and future directions. Edu. Res. 46, 6, 329–334. ISSN: 0013-189X. DOI: https://doi.org/10.3102/0013189X17725519. L. J. Francis, L. B. Brown, and R. Philipchalk. 1992. The development of an abbreviated form of the revised Eysenck personality questionnaire (EPQR-A): Its use among students in England, Canada, the USA and Australia. Pers. Individ. Differ. 13, 4, 443–449. DOI: https://doi.org/10.1016/0191-8869(92)90073-X. C. Frauenberger, J. Good, and W. Keay-Bright. 2011. Designing technology for children with special needs: Bridging perspectives through participatory design. CoDesign 7, 1, 1–28. ISSN: 1571-0882. DOI: https://doi.org/10.1080/15710882.2011.587013.

References

71

M. Fridin and M. Belokopytov. 2014. Embodied robot versus virtual agent: Involvement of preschool children in motor task performance. Int. J. Hum-Comput. Int. 30, 6, 459–469. DOI: https://doi.org/10.1080/10447318.2014.888500. F. J. Gravetter and L. A. Forzano. 2012. Research Methods for the Behavioral Sciences (4th. ed.). Wadsworth, Cengage Learning, Belmont, CA. J. J. Gross and O. P. John. 2003. Individual differences in two emotion regulation processes: Implications for affect, relationships, and well-being. J. Pers. Soc. Psychol. 85, 2, 348–362. DOI: https://doi.org/10.1037/0022-3514.85.2.348. F. Guay, R. J. Vallerand, and C. Blanchard. 2000. On the assessment of situational intrinsic and extrinsic motivation: The situational motivation scale (SIMS). Motiv. Emot. 24, 3, 175–213. DOI: https://doi.org/10.1023/A:1005614228250. G. M. Hall. 2012. How to Write a Paper. John Wiley & Sons, Ltd, Chichester, UK. ISBN: 9781118488713. DOI: https://doi.org/10.1002/9781118488713. K. Hammarberg, M. Kirkman, and S. de Lacey. 2016. Qualitative research methods: When to use them and how to judge them. Hum. Reprod. 31, 3, 498–501. DOI: https://doi.org/10. 1093/humrep/dev334. S. G. Hart. 2006. Nasa-task load index (NASA-TLX); 20 years later. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 50, 9, 904–908. DOI: https://doi.org/10.1177/154193120605000909. C. A. Hartman, E. Luteijn, M. Serra, and R. Minderaa. 2006. Refinement of the children’s social behavior questionnaire (CSBQ): An instrument that describes the diverse problems seen in milder forms of PDD. J. Autism Dev. Disord. 36, 3, 325–342. https://doi.org/10. 1007/s10803-005-0072-z. D. J. Hauser and N. Schwarz. 2016. Attentive Turkers: MTurk participants perform better on online attention checks than do subject pool participants. Behav. Res. Methods 48, 1, 400–407. DOI: https://doi.org/10.3758/s13428-015-0578-z. L. Hoffmann, N. C. Krämer, A. Lam-chi, and S. Kopp. 2009. Media equation revisited: Do users show polite reactions towards an embodied agent? In Z. Ruttkay, M. Kipp, A. Nijholt, and H. H. Vilhjálmsson (Eds.), Intelligent Virtual Agents. IVA 2009. Lecture Notes in Computer Science, Vol. 5773. Springer, Berlin. DOI: https://doi.org/10.1007/978-3-64204380-2_19. L. Hoffmann, N. Bock, and A. M. Rosenthal-von der Pütten. 2018. The peculiarities of robot embodiment (EmCorp-Scale). In T. Kanda, S. Šabanovi´ c, G. Hoffman, and A. Tapus (Eds.), HRI’18, March 5–8, 2018, Chicago, IL. ACM, New York, NY, 370–378. DOI: https://doi. org/10.1145/3171221.3171242. J. Hoonhout. 2002. Development of a rating scale to determine the enjoyability of user interactions with consumer devices. Technical report, Philips Research. B. Irfan, J. Kennedy, S. Lemaignan, F. Papadopoulos, E. Senft, and T. Belpaeme. 2018. Social psychology and human–robot interaction. In T. Kanda, S. Šabanovi´ c, G. Hoffman, and A. Tapus (Eds.), HRI’18 Companion, March 5–8, 2018, Chicago, IL, USA. ACM, New York, NY, 13–20. ISBN: 9781450356152. DOI: https://doi.org/10.1145/3173386.3173389.

72

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

J.-Y. Jian, A. M. Bisantz, and C. G. Drury. 2000. Foundations for an empirically determined scale of trust in automated systems. Int. J. Cogn. Ergon. 4, 1, 53–71. DOI: https://doi.org/ 10.1207/S15327566IJCE0401_04. P. E. Jose. 2013. Doing Statistical Mediation and Moderation. Methodology in the Social Sciences. The Guilford Press, New York. ISBN: 9781462508211. M. Jung and P. Hinds. 2018. Robots in the wild. ACM Trans. Hum. Rob. Interact. 7, 1, 1–5. ISSN: 2573-9522. DOI: https://doi.org/10.1145/3208975. H. Kamide, K. Kawabe, S. Shigemi, and T. Arai. 2013. Development of a psychological scale for general impressions of humanoid. Adv. Robot. 27, 1, 3–17. DOI: https://doi.org/10. 1080/01691864.2013.751159. C. D. Kidd and C. Breazeal. 2004. Effect of a robot on user perceptions. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE Cat. No.04CH37566, Vol. 4. 3559–3564. DOI: https://doi.org/10.1109/IROS.2004.1389967. S. Kvale. 1983. The qualitative research interview. J. Phenomenol. Psychol. 14, 1–2, 171–196. DOI: https://doi.org/10.1163/156916283X00090. B. Laugwitz, T. Held, and M. Schrepp. 2008. Construction and evaluation of a user experience questionnaire. In Symposium of the Austrian HCI and usability engineering group. Springer, Berlin, Heidelberg, 63–76. DOI: https://doi.org/10.1007/978-3-540-89350-9_6. H. R. Lee, S. Šabanovi´ c, W.-L. Chang, S. Nagata, J. Piatt, C. Bennett, and D. Hakken. 2017. Steps toward participatory design of social robots. In B. Mutlu, M. Tscheligi, A. Weiss, and J. E. Young (Eds.), HRI’17. IEEE, Piscataway, NJ, 244–253. ISBN: 9781450343367. DOI: https://doi.org/10.1145/2909824.3020237. H. M. Levitt. 2020. Reporting Qualitative Research in Psychology: How to Meet APA Style Journal Article Reporting Standards, (revised ed.). ISBN: 9781433833434. M. Lombard, T. B. Ditton, D. Crane, B. Davis, G. Gil-Egui, K. Horvath, and S. Park. 2000. Measuring presence: A literature-based approach to the development of a standardized paper-and-pencil instrument. In Third International Workshop on Presence. Delft, The Netherlands. M. Madsen and S. Gregor. 2000. Measuring human-computer trust. In 11th Australasian Conference on Information Systems, Vol. 53. Australasian Association for Information Systems, Brisbane, Australia, 6–8. R. M. Montoya, R. S. Horton, and J. Kirchner. 2008. Is actual similarity necessary for attraction? A meta-analysis of actual and perceived similarity. J. Soc. Pers. Relat. 25, 6, 889–922. ISSN: 0265-4075. DOI: https://doi.org/10.1177/0265407508096700. E. A. Necka, S. Cacioppo, G. J. Norman, and J. T. Cacioppo. 2016. Measuring the prevalence of problematic respondent behaviors among MTurk, campus, and community participants. PLoS One 11, 6, e0157732. DOI: https://doi.org/10.1371/journal.pone. 0157732. A. L. Nichols and J. K. Maner. 2008. The good-subject effect: Investigating participant demand characteristics. J. Gen. Psychol. 135, 2, 151–165. DOI: https://doi.org/10.3200/GE NP.135.2.151-166.

References

73

T. Nomura and T. Kanda. 2016. Rapport–expectation with a robot scale. Int. J. Soc. Robot. 8, 1, 21–30. DOI: https://doi.org/10.1007/s12369-015-0293-z. T. Nomura, T. Kanda, and T. Suzuki. 2006a. Experimental investigation into influence of negative attitudes toward robots on human–robot interaction. AI Soc. 20, 2, 138–150. DOI: https://doi.org/10.1007/s00146-005-0012-7. T. Nomura, T. Suzuki, T. Kanda, and K. Kato. 2006b. Measurement of anxiety toward robots. In Proceedings of The 15th IEEE International Symposium on Robot and Human Interactive Communication, 2006: Ro-MAN 2006; 6–8 September 2006. University of Hertfordshire, Hatfield, UK. IEEE, Piscataway, NJ, 372–377. DOI: https://doi.org/10.1109/RO MAN.2006.314462. T. Nomura, K. Sugimoto, D. S. Syrdal, and K. Dautenhahn. 2012. Social acceptance of humanoid robots in Japan: A survey for development of the Frankenstein syndrome questionnaire. In 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012), 242–247. DOI: https://doi.org/10.1109/HUMANOIDS.2012. 6651527. T. Nomura, T. Kanda, and S. Yamada. 2019. Measurement of moral concern for robots. In HRI’19: The 14th ACM/IEEE International Conference on Human-Robot Interaction: March 11–14, 2019. Daegu, South Korea. IEEE, Piscataway, NJ, 540–541. DOI: https://doi.org/10. 1109/HRI.2019.8673095. G. Norman. 2010. Likert scales, levels of measurement and the “laws” of statistics. Adv. Health Sci. Educ. Theory Pract. 15, 5, 625–632. DOI: https://doi.org/10.1007/s10459-0109222-y. Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349, 6251, aac4716. DOI: https://doi.org/10.1126/science.aac4716. C. Opfermann and K. Pitsch. 2017. Reprompts as error handling strategy in human– agent-dialog? User responses to a system’s display of non-understanding. In Human– Robot Collaboration and Human Assistance for an Improved Quality of Life. IEEE, Piscataway, NJ, 310–316. ISBN: 978-1-5386-3518-6. DOI: https://doi.org/10.1109/ROMAN.2017. 8172319. J. H. Patton, M. S. Stanford, and E. S. Barratt. 1995. Factor structure of the Barrett impulsiveness scale. J. Clin. Psychol. 51, 6, 768–774. DOI: https://doi.org/10.1002/1097-4679 (199511)51:63.0.CO;2-1. E. K. Perrault and D. M. Keating. 2018. Seeking ways to inform the uninformed: Improving the informed consent process in online social science research. J. Empirical Res. Hum. Res. Ethics 13, 1, 50–60. DOI: https://doi.org/10.1177/1556264617738846. E. Pronin. 2009. Chapter 1: The Introspection Illusion. In M. P. Zanna (Ed.), Advances in Experimental Social Psychology, Vol. 41. Academic Press/Elsevier, London, 1–67. DOI: https://doi.org/10.1016/S0065-2601(08)00401-2. S. Q. Qu and J. Dumay. 2011. The qualitative research interview. Qual. Res. Account. Manag. 8, 3, 238–264. ISSN: 1176-6093. DOI: https://doi.org/10.1108/11766091111162070. L. S. Radloff. 1977. The CES-D Scale. Appl. Psychol. Meas. 1, 3, 385–401. DOI: https://doi.org/ 10.1177/014662167700100306.

74

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

R. Rickenberg and B. Reeves. 2000. The effects of animated characters on anxiety, task performance, and evaluations of user interfaces. In T. Turner (Ed.), Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, The SIGCHI Conference. The Hague, The Netherlands, 4/1/2000–4/6/2000. ACM Special Interest Group on ComputerHuman Interaction, ACM, New York, NY, 49–56. B. Reeves and C. I. Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge University Press. H. T. Reis and C. M. Judd. 2013. Handbook of Research Methods in Social and Personality Psychology. Cambridge University Press, New York. ISBN: 9780511996481. DOI: https:// doi.org/10.1017/CBO9780511996481. C. Rieffe, P. Oosterveld, A. C. Miers, M. M. Terwogt, and V. Ly. 2008. Emotion awareness and internalising symptoms in children and adolescents: The emotion awareness questionnaire revised. Pers. Individ. Differ. 45, 8, 756–761. DOI: https://doi.org/10.1016/j.paid. 2008.08.001. L. Riek. 2012. Wizard of Oz studies in HRI: A systematic review and new reporting guidelines. J. Hum. Rob. Interact. 1, 119–136. DOI: https://doi.org/10.5898/JHRI.1.1.Riek. A. M. Rosenthal-von der Pütten and N. C. Krämer. 2015. Individuals’ evaluations of and attitudes towards potentially uncanny robots. Int. J. Soc. Rob. 7, 5, 799–824. ISSN: 1875-4791. DOI: https://doi.org/10.1007/s12369-015-0321-z. A. M. Rosenthal-von der Pütten and N. Bock. 2018. Development and validation of the selfefficacy in human-robot-interaction scale (SE-HRI). ACM Trans. Human-Robot Interact. 7, 3, 1–30. DOI: https://doi.org/10.1145/3139352. D. W. Russell. 1996. UCLA loneliness scale (version 3): Reliability, validity, and factor structure. J. Pers. Assess. 66, 1, 20–40. DOI: https://doi.org/10.1207/s15327752jpa 6601_2. S. Šabanovi´ c, W.-L. Chang, C. C. Bennett, J. A. Piatt, and D. Hakken. 2015. A robot of my own: Participatory design of socially assistive robots for independently living older adults diagnosed with depression. In J. Zhou and G. Salvendy (Eds.), Design for Aging, volume 9193 of Lecture Notes in Computer Science Information Systems and Applications, incl. Internet/web, and HCI. Springer, Cham, 104–114. ISBN: 978-3-31920891-6. DOI: https://doi.org/10.1007/978-3-319-20892-311. K. Schaefer. 2013. The Perception and Measurement of Human-Robot Trust. (2013). Doctoral dissertation, University of Central Florida Orlando, Florida. M. L. Schrum, M. Johnson, M. Ghuy, and M. C. Gombolay. 2020. Four years in review: Statistical practices of Likert scales in human-robot interaction studies. In Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction (HRI‘20). Association for Computing Machinery, New York, NY, 43–52. DOI: https://doi.org/10.1145/ 3371382.3380739. D. B. Shank. 2016. Using crowdsourcing websites for sociological research: The case of Amazon Mechanical Turk. Am. Sociol. 47, 1, 47–55. ISSN: 0003-1232. DOI: https://doi.org/ 10.1007/s12108-015-9266-9.

References

75

P. E. Shrout and J. L. Fleiss. 1979. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull. 86, 2, 420–428. ISSN: 0033-2909. DOI: https://doi.org/10.1037/ 0033-2909.86.2.420. N. A. Smith, I. E. Sabat, L. R. Martinez, K. Weaver, and S. Xu. 2015. A convenient solution: Using MTurk to sample from hard-to-reach populations. Ind. Organ. Psychol. 8, 2, 220–228. ISSN: 1754-9426. DOI: https://doi.org/10.1017/iop.2015.29. S. M. Smith, C. A. Roster, L. L. Golden, and G. S. Albaum. 2016. A multi-group analysis of online survey respondent data quality: Comparing a regular USA consumer panel to MTurk samples. J. Bus. Res. 69, 8, 3139–3148. ISSN: 01482963. DOI: https://doi.org/10.1016/ j.jbusres.2015.12.002. M. Strait, F. Lier, J. Bernotat, S. Wachsmuth, F. Eyssel, R. Goldstone, and S. Šabanovi´ c. 2020. A three-site reproduction of the joint Simon effect with the NAO robot. In T. Belpaeme, J. Young, H. Gunes, and L. Riek, (Eds.), HRI’20. Association for Computing Machinery, New York, NY, 103–111. ISBN: 9781450367462. DOI: https://doi.org/10.1145/ 3319502.3374783. N. Spatola, B. Kühnlenz, and G. Cheng. 2021. Perception and evaluation in human–robot interaction: The human–robot interaction evaluation scale (HRIES)—A multicomponent approach of anthropomorphism. Int. J. Soc. Robot. DOI: https://doi.org/10.1007/s12369020-00667-4. J. Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cogn. Sci. 12, 2, 257–285. DOI: https://doi.org/10.1016/0364-0213(88)90023-7. M. C.-T. Tai. 2012. Deception and informed consent in social, behavioral, and educational research (SBER). Tzu Chi Med. J. 24, 4, 218–222. ISSN: 10163190. DOI: https://doi.org/10. 1016/j.tcmj.2012.05.003. H.-L. Tuan, C.-C. Chin, and S.-H. Shieh. 2005. The development of a questionnaire to measure students’ motivation towards science learning. Int. J. Sci. Educ. 27, 6, 639–654. DOI: https://doi.org/10.1080/0950069042000323737. R. J. Vallerand, L. G. Pelletier, M. R. Blais, N. M. Briere, C. Senecal, and E. F. Vallieres. 1992. The academic motivation scale: A measure of intrinsic, extrinsic, and amotivation in education. Educ. Psychol. Meas. 52, 4, 1003–1017. DOI: https://doi.org/10.1177/ 0013164492052004025. A. M. von der Pütten, N. C. Krämer, J. Gratch, and S.-H. Kang. 2010. “It doesn’t matter what you are!” Explaining social effects of agents and avatars. Comput. Hum. Behav. 26, 6, 1641–1650. ISSN: 0747-5632. DOI: https://doi.org/10.1016/j.chb.2010.06.012. D. Watson, L. A. Clark, and A. Tellegen. 1988. Development and validation of brief measures of positive and negative affect: The PANAS scales. J. Pers. Soc. Psychol. 54, 6, 1063–1070. DOI: https://doi.org/10.1037/0022-3514.54.6.1063. C. Zaga, M. Lohse, K. P. Truong, and V. Evers. 2015. The effect of a robot’s social character on children’s task engagement: Peer versus tutor. In A. Tapus, E. André, J.-C. Martin, F. Ferland, and M. Ammi (Eds.), Social Robotics, volume 9388 of Lecture Notes in Computer Science Lecture Notes in Artificial Intelligence. Springer, Cham and

76

Chapter 2 Empirical Methods in the Social Science for Researching Socially Interactive Agents

Heidelberg and New York and Dordrecht and London, 704–713. ISBN: 978-3-319-25553-8. DOI: https://doi.org/10.1007/978-3-319-25554-570. M. V. Zelkowitz and D. R. Wallace. 1998. Experimental models for validating technology. Computer 31, 5, 23–31. ISSN: 00189162. DOI: https://doi.org/10.1109/2.675630.

3

Social Reactions to Socially Interactive Agents and Their Ethical Implications Nicole Krämer and Arne Manzeschke

3.1

Motivation

Socially interactive agents have now been developed and discussed for more than 20 years. While they have aptly been described as a testbed to gain a better understanding of human communication skills, they have also always been developed with a view to supporting users by means of helpful applications and implementations. In this regard, they have from the beginning raised interest from a psychological and ethical point of view: It soon became apparent that early applications—even when they were not yet very sophisticated regarding their interaction abilities—already proved to be able to elicit social reactions from the users. The goal of this chapter is to summarize research on these social effects. While this has already been done from a psychological perspective [Krämer et al. 2015], the combination of a psychological and an ethical view is unique to this chapter. Effects of socially interactive agents can be termed “social” if a participant’s emotional, cognitive, or behavioral reactions are similar to reactions shown during interactions with other human beings (for an early review, see Krämer [2005]) or, in some respect, other animals (e.g., de Waal [2019]). A number of studies demonstrate that those—actually inappropriate—reactions really do occur, sometimes even without the appearance of a human-like character. Studies [Nass et al. 1994, 1997, Fogg and Nass 1997] indicate that in interactions with computers politeness phrases are employed, principles of person perception and gender stereotypes

78

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

apply, and liking is triggered in a similar way as within human relationships (computers that “flatter” and give positive feedback are evaluated more positively). Although the studies suggest that a “rich human presentation” [Nass et al. 1994] is not necessary and specifically speech might be sufficient to trigger social reactions [Nass and Moon 2000], studies suggest that intelligent virtual agents (IVA) might be especially prone to lead to social reactions. There is long-standing evidence that even subtle social phenomena such as impression management (see, e.g., Leary [1995]) are prevalent in human–agent interaction. When a human-like face is present, participants aim at leaving a favorable impression by, for example, choosing a socially desirable TV program (a documentary about Albert Einstein compared to a James Bond movie) [Krämer et al. 2003] or by presenting themselves in a socially desirable way [Sproull et al. 1996]. Krämer [2005] showed that IVAs affect the way in which users communicate with a TV–VCR system. When an IVA is visible instead of a graphical user interface or a user interface with speech output, users address the system significantly more often using natural speech rather than using a remote control. Additional qualitative analyses of the semantic content of all speech acts indicate that users seem to have a more human-like attitude and behavior toward the system when it is represented by an anthropomorphic agent. Similarly, Hoffmann et al. [2009] replicated the politeness experiment of Nass et al. [1999] with an IVA. Participants had to evaluate the IVA Max (at Bielefeld University, Germany) after a short interaction, whereby the questioning was either conducted by Max itself, a paper-and-pencil questionnaire in the same room, or a paper-and-pencil questionnaire in a separate room. The results showed that participants were more polite, that is, provided better evaluations of Max’s competence when Max itself asked for its evaluation, compared to the questionnaire in the same room. However, no significant difference was observed compared to the questionnaire in a separate room. It should be noted that not all studies analyzing people’s reactions toward artificial entities found social effects in the manner and to the extent that was demonstrated in the CASA (“Computers are Social Actors”; Nass et al. [1994]) studies and their follow-ups with embodied agents. Based on their criticism that the setting in the CASA studies focused too narrowly on the computer asking for evaluation rather than allowing for and assessing actual interaction, Shechtman and Horowitz [2003] conducted an experiment in which they analyzed the actual conversational discourse and—in order to address a second shortcoming of the CASA approach— added an experimental condition in which the interlocutor was announced to be human. Within the CASA approach, the authors frequently conclude that the behavior toward a computer is the same as that toward fellow humans, although this is not tested directly. In their study, Shechtman and Horowitz [2003] discovered

3.1 Motivation

79

as a result of the conversational analysis that human–human interaction is indeed different from human–computer interaction. When participants thought that their partner was human, they, for instance, used more words, spent more time, and used more relationship statements. Despite these findings, further studies more compellingly show that people indeed unconsciously react in social ways when confronted with an IVA: Krämer et al. [2013] demonstrate that people mimic the smiling behavior of a virtual agent. Mimicry is a widely cited phenomenon of human–human communication that has been shown to be especially indicative of the sociality of the situation. In a betweensubjects design, participants conducted an 8-minute small-talk conversation with an agent that either did not smile, showed occasional smiles, or displayed frequent smiles. Results show that the human interaction partners themselves smiled longer when the agent was smiling. Interestingly, the smiling activity did not have an impact on people’s evaluation of the agent, nor were they able to reliably indicate whether the agent had smiled and whether this was occasional or frequent. Therefore, it can be concluded that the participants’ behavioral reactions were rather unconscious and automatic. Moreover, since mimicry has been shown to be indicative of the sociality of the situation, the fact that participants smiled at the agent can be seen as evidence that they—at least unconsciously—experienced the situation as social. It can therefore be summarized that numerous studies yield evidence that artificial entities lead at least partly to social effects. From a psychological point of view these results are interesting because when we learn what triggers social reactions, we learn something about human nature. In line with this, Parise et al. [1999] suggested: “As computer interfaces can display more life-like qualities such as speech output and personable characters or agents, it becomes important to understand and assess user’s interaction behavior within a social interaction framework rather than only a narrower machine interaction one” (p. 123). Given these social reactions and the necessity of understanding these new forms of human–machine interactions within a social interaction framework, it also became necessary to reflect on these issues from a normative stance. Therefore, from the beginning, ethical questions have been raised. Kiesler and Sproull voiced first concerns about the tendency to use humanoid interfaces as early as 1996: “Many people want computers to be responsive to people. But do we also want people to be responsive to computers?” [Sproull et al. 1996, p. 119]. With continuously more realistic and human-like machines, these ethical questions have neither been solved nor become less important: Human beings are those beings capable of speech, language, crying, and laughter [Plessner 1970] and thus able to communicate in various modes with each other (i.e., rational, emotional, spiritual).

80

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

It is this capability that characterizes the human species among all others and enables human beings to recognize each other as such. Encountering machines capable of human language blurs the fundamental distinction between the anthropos zoon logon echon (cf. Aristotle, Politics 1253a, meaning that human beings are political beings in an advanced sense due to their capability of speech. Aristotle discerns between language, which many animals are capable of, and language, which only belongs to human beings.) and all the other beings. It poses the fundamental question who or what the “speaking machine” should be recognized as. Furthermore, it should be acknowledged that speech does not only transport information from a sender to a receiver, but speech opens a social space accompanied by emotions such as expectations, worries, and—more generally spoken—social negotiations. However, who or what is the counterpart in these social negotiations? Besides the problematic aspect that categories are blurred and the unambiguous assignment to the categories “human” or “thing” is no longer possible, another aspect is the question of whether the human user is manipulated in an undue way. One very basic assumption in featuring machines capable of speech is that it would enhance the human–machine interaction and thus the productivity. From an ethical point of view, this seems to be at least questionable. Besides the fact that it is not clear whether efficiency actually increases, it should be questioned whether human emotionality should be exploited for purposes like that (cf. Manzeschke et al. [2016]).

3.2

Models and Approaches Based on the first findings on social reactions toward speaking machines and IVAs, diverse models have been formulated that try to explain the origin of these social reactions (for an earlier summary, see Krämer et al. [2015]). The most prominent of them is the media equation approach: Reeves and Nass [1996] postulate that individuals treat computers and other artifacts as social actors, and term this phenomenon “media equation” (media equals real life). To test this assumption, they conducted numerous experiments within the CASA paradigm, in which human subjects had to interact with computers. These experiments all followed a similar pattern: Search for a social science finding, replace “human” with “computer” in the theory statement and method, and observe whether the social rule is still observable [Nass et al. 1994]. The media equation is considered to be verified when the results resemble the findings from interpersonal contexts. Different approaches that try to explain social reactions toward non-social artifacts have been mentioned in the literature. These approaches can generally be divided into (a) approaches which assume that the reactions cannot be considered as truly social as they either result from a deficit on the part of the user or

3.2 Models and Approaches

81

are conscious reactions due to demand characteristics of the situation and (b) approaches which argue that social reactions toward artifacts occur unconsciously and are even denied by the human interlocutor (mindlessness, ethopoeia, computer as source). Additionally, (c) it has been suggested that social reactions depend on the level of assumed agency (Threshold Model of Social Influence [TMSI]).

3.2.1 Approaches which Assume that Reactions are Not Truly Social Early criticism with regard to the CASA study results stated that users who react in a social way must have deficits resulting from psychological dysfunctions, young age, or lack of experience. However, this can be denied as the participants in the CASA studies were mostly healthy undergraduate students who had extensive experience with computers (e.g., Nass et al. [1996, 1999]). Similarly, the notion that the participants assume that they are interacting rather with the programmer than with the computer, that is, the computer is not seen as a source but merely as a medium, has been refuted and rejected based on empirical data [Sundar 1994, Sundar and Nass 2000]. A further explanation that continues to compete against the explanations below lies in the assumption that the observable reactions are not really “social” but are merely due to demand characteristics of the situation. It is argued that people show “as if” reactions (in the sense that people consciously tell themselves that they will talk to the artificial entity as if it were a person) that merely occur because appropriate scripts are missing when humans interact with computers [Kiesler and Sproull 1997]. However, this assumption can at least be questioned based on some of the results on social reactions that obviously happen unconsciously [Krämer et al. 2013].

3.2.2 Approaches which Assume that Reactions are Social and Unconscious Supporters of the media equation assumption see social reactions to artificial entities as truly social in the sense that “People respond socially and naturally to media even though they believe it is not reasonable to do so, and even though they don’t think that these responses characterize themselves” [Reeves and Nass 1996, p. 7]. Nass and Moon [2000] suggest using the term “ethopoeia” as an explanation for this unconscious and automatic behavior (social reaction) that is inconsistent with one’s conscious opinion (computers do not need social treatment). According to this approach, minimal social cues like a human-sounding voice mindlessly (cf. Langer [1989]) trigger social responses because humans cannot avoid reacting automatically to social cues. The ethopoeia approach is supported by the fact that participants in the studies of Nass et al. obviously did not consciously recognize their social behaviors, and when they were asked in the debriefing they stated that they did not act socially (e.g., polite) toward the computers and that they believe such

82

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

behavior to be inappropriate [Nass et al. 1999]. Studies suggest that social responses to computers are indeed moderated by the extent of cognitive busyness [Lee 2010]. Also, it has been demonstrated that anthropomorphism of computers can happen mindlessly, leading to the social reactions that have been described [Kim and Sundar 2012]. In a revised version of the ethopoeia concept authors acknowledge that automatic and unconscious social reactions will be stronger if there are more social cues [Morkes et al. 1999, Nass and Yen 2012]—speaking to the possibility that socially interactive agents might be more prone to triggering social reactions than speaking computers.

3.2.3 Approaches which Assume that Social Reactions Depend on the Level of Assumed Agency

In the course of the development of anthropomorphic characters, Blascovich [2002] established the TMSI, which predicts the social verification of a virtual other depending on the factors of agency and behavioral realism. Agency here means the degree to which the virtual entity is controlled by a real human (low agency is present in the case of a virtual agent that is controlled by an autonomous computer program; high agency is given when a human controls the virtual character—which is then termed avatar). The authors assume a Threshold of Social Influence, which has to be crossed to evoke social reactions by the user. This is only possible when the level of social verification is sufficiently high. When the factor of agency is high (i.e., when the user knows that the virtual character is a representation of a human being), then the factor of behavioral realism does not have to be high in order for social verification to take place and for social effects to occur. Conversely, when the factor of agency is low (i.e., when the user knows that the virtual character is a mere computer program), the factor of behavioral realism has to be very high to compensate for the lack of agency. In sum, it can be derived that according to the TMSI the social influence of real persons will always be high, whereas the influence of an artificial entity depends on the realism of its behavior. In our own research, we were especially interested in testing the ethopoeia and the TMSI models against each other. The major difference is that the TMSI model assumes that there is a fundamental difference between agents and avatars in the sense that users react socially to avatars (i.e., mediated fellow humans) but will only react socially to agents when they show sufficient social cues. The ethopoeia model, on the other hand, assumes that agents will automatically evoke social reactions in the same way as fellow humans do. Therefore, von der Pütten et al. [2010a] empirically tested the TMSI against the ethopoeia approach. With the aim of testing the aforementioned assumptions, agency and behavioral realism of a virtual agent (the Rapport Agent; Gratch et al. [2006]) were experimentally manipulated in

3.2 Models and Approaches

83

a 2 × 2 between-subjects design. Participants were led to believe that they would be interacting either with another participant mediated by a virtual character or with an autonomous computer program. Moreover, the agent with higher behavioral realism featured responsive non-verbal behavior while participants were interacting with the agent, whereas the agent in the low behavioral realism condition showed only idle behavior (breathing, eye blinking) but no responsive behaviors. According to the TMSI, interaction effects between agency and behavioral realism should occur (in the sense that social reactions are observable in both avatar conditions but only in the agent condition with high behavioral realism). However, if the ethopoeia concept in its revised version (which acknowledges that automatic and unconscious social reactions will be stronger if there are more social cues [Morkes et al. 1999, Nass and Yen 2012]) is more accurate, social reactions should be reinforced when behavioral realism increases and should be independent of assumed agency. During the interaction, the Rapport Agent asked the participants intimate questions so that self-disclosure behavior of the participants could be used as dependent variable. Additionally, self-report scales to evaluate the virtual character as well as the situation were employed. The data analyses revealed that the belief of interacting with either an avatar or an agent resulted in barely any differences with regard to the evaluation of the virtual character or behavioral reactions, whereas higher behavioral realism affected both (e.g., participants experienced more feelings of mutual awareness, and they used more words during the interaction when behavioral realism was high). However, no interaction effects of the factors of agency and behavioral realism emerged. Ultimately, since main effects of behavioral realism, but no interaction effects, were found, the results support the Revised Ethopoeia Concept but not the TMSI. However, it should be noted that a recent meta-analysis by Fox et al. [2015] rather provided evidence for the notion that the perception of agency is decisive when interacting with virtual characters. The analysis revealed that, overall, social reactions were stronger when people thought that they were interacting with another human compared to when they believed they were interacting with a computer program. Therefore, the role of agency in terms of the emergence of social reactions is still unclear. Regarding ethical models, there are several theories and philosophical assumptions that are relevant here. There is, for example, a strong moral claim that human beings should have the control in any socio-technical arrangement (cf. Sturma [2004]). Otherwise, responsibility is hard to assign (cf. Christen [2004]). There is a vivid debate on how to assign responsibility to a technical system. The EU Commission fostered the idea of an electronic personhood for “the most sophisticated autonomous robots” to cover claims for compensation when autonomous systems

84

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

cause any damage. However, given the sometimes subtle effects (as described above) it might be difficult to directly detect damage. Therefore, it has been argued that something called “Robot Ethics 2.0” should be implemented [Lin et al. 2017, Misselhorn 2018].

3.3

History/Overview From an ethical point of view, agency is not only behavior but also the ability of setting oneself goals and choosing the means to achieve these goals. Furthermore, ethics deals with the justification of means and goals by arguments and reasoning within the human society. Reducing agency to the “outer” behavior might lead to an impasse with regard to a richer understanding of agency. Expanding the circle of agents, specific animals, especially domestic animals, should be taken into consideration. Reacting to social cues does not necessarily mean to interact in a social way as it is in human interaction. Humans interact with animals and they know quite well that this social interaction is different from those with human beings. It is a social habit that has been established within some thousand years. From this anthropological and ethical point of view it seems to be reasonable to contrast the interaction of human–human relation, human–animal relation, and human–technology relation. Thus, a specific difference appears between the first two relations and the latter one: There is not yet a practice of interaction between human beings and technological agents, and thus, a lack of experience and habitualization. Secondly, these technological agents are constructed by human beings and depend on human presets regarding design and interaction. Besides the work focusing on social reactions summarized above, the last 20 years have seen a wealth of studies on factors that affect agent acceptance and ultimately also social reactions (for an earlier review, see Krämer et al. [2015]). Influencing factors include attributes of the agents (agent behavior and appearance) as well as attributes of the user (gender, expertise, personality). In the following, first, agent attributes such as communicative behavior, non-verbal behavior, and appearance will be discussed and, second, the role of user attributes will be reflected on.

3.3.1 In uence of Agent Attributes 3.3.1.1

Communicative Behavior of the Agent As early as 2000, Rickenberg and Reeves showed that the (communicative) behavior of an agent is decisive. They demonstrate that whether a virtual character on a website monitored the user or ignored him/her had an impact on the user’s perceived anxiety and conclude that it is not sufficient “to focus on whether or not an animated character is present. Rather the ultimate evaluation is similar to those for

3.3 History/Overview

85

real people, it depends on what the character does, what it says and how it presents itself” (p. 55). Indeed, further studies indicate that, for example, the quantity of the agent’s communicative utterances is influential. It was found that an interview agent’s self-disclosure (quality of utterances) lead only to minor effects, while the agent’s verboseness (quantity of utterances) affected both the participants’ verbal behavior (with regard to word usage and intimacy of answers) and their perception of the interview [von der Pütten et al. 2011a]. Participants more often disclosed specific embarrassing situations, their biggest disappointment, and what they feel guilty about to the agent regardless of its previous self-disclosure. Moreover, when the agent was more talkative it was generally evaluated more positively, and the interview was perceived as being more pleasant. It can therefore be assumed that talkativeness led to a more favorable evaluation by the users and subsequently facilitated self-disclosure and thereby social reactions. 3.3.1.2

Non-verbal Communicative Features of Agents Traditionally, there has been more research on non-verbal compared to verbal behavior. For example, Krämer et al. [2007] demonstrated that when the IVA Max showed self-touching gestures (e.g., touching its arm or face) this had positive effects on the experiences and evaluations of the user, whereas eyebrow-raising evoked less positive experiences and evaluations in contrast to no eyebrow-raising. Based on the notion that gestures influence the perception of competence [Maricchiolo et al. 2009], further research showed that when manipulating extensiveness of gesture usage and gender of a leader a positive impact of extensive non-verbal behavior is observable. Participants were more willing to hire the virtual person who used hand and arm gestures than the more rigid person. The virtual person using gestures was also perceived as exhibiting more leadership skills and general competence than the person in the non-gesture condition [Klatt et al. 2012]. The effects and efficiency of non-verbal behavior has especially been debated in the area of pedagogical agents. Here, researchers have discussed whether non-verbal behavior is decisive for learning experiences and that it therefore needs to be implemented in pedagogical agents. Baylor and Ryu [2003], for example, suggest that the key advantage of embodied pedagogical agents is that human-likeness creates more positive learning experiences and provides a strong motivating effect. However, Rajan et al. [2001] demonstrated that it is first and foremost the voice that is responsible for these effects. Moreno [2003] further summarized that— in line with results that especially the voice is decisive—there is no evidence for the social cue hypothesis as it has not been shown that the mere presence of social aspects such as a human-like body leads to distinct effects. However, the cognitive guiding functions provided by vocalizations and a program’s didactic

86

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

concept proved to be influential. Also, recent research [Carlotto and Jaques 2016] as well as a recent meta-analysis [Schroeder and Adesope 2014] have supported the notion that voice is more important than non-verbal expressiveness. Still, as Krämer [2017] argues, these results have to be considered with caution given the fact that the systems that had been evaluated did not (yet) include very sophisticated non-verbal behavior. It needs to be considered that non-verbal behavior is very complex: The dynamics of the movements are important, very subtle movements have distinct effects (e.g., head movements such as a head tilt), and the effects are context-dependent (e.g., a smile leads to a different effect when accompanied by a head tilt). This complexity, however, is mostly not implemented in pedagogical agents. So far, only very few pedagogical agent systems have achieved realistic and sufficiently subtle non-verbal behavior in order to administer a fair test. And indeed, when employing technology that provides realistic, dynamic non-verbal behavior, results show that non-verbal rapport behavior leads to an increase in effort and performance [Krämer et al. 2016]. Therefore, the conclusion that embodiment and non-verbal behavior is less decisive compared to voice is premature. More recently, specific non-verbal behavior of robots have been analyzed [Hoffmann 2017]. In a series of studies, it was investigated whether the positive effects of interpersonal touch are also observable with regard to robot touch. Based on media equation assumptions [Reeves and Nass 1996], an experimental study in which a robot either touched or did not touch a human interaction partner revealed positive emotional reactions toward robot-initiated touch as well as increased compliance to the robot’s suggestions. In conclusion, numerous non-verbal features have been demonstrated to influence the evaluation of the agent. As results largely mirror the results that have been found in human–human interaction, the realm of non-verbal effects can also be seen as an area in which social effects have been corroborated. 3.3.1.3

Physical Appearance Several studies have shown that the appearance of a virtual character matters, for example, for acceptance and evaluation of the character [van Vugt et al. 2007, Domagk 2010]. In particular, Domagk [2010] shows that when the appearance (and voice) is likeable, a pedagogical agent has more positive effects. Building on this, we compared the impact of a virtual tutor depending on its appearance as either a cartoon-like rabbit character or a realistic anthropomorphic agent [Sträfling et al. 2010]. Results show that the rabbit-like agent was not only preferred, but people exposed themselves to the tutoring session longer when the rabbit provided feedback. However, this was not related to an increase in learning performance.

3.3 History/Overview

87

Other studies, which focus more on credibility rather than learning and likeability, show that characters which are more anthropomorphic are perceived as more credible [Nowak and Rauh 2005]. More recent studies with more sophisticated appearances show that different appearances appeal to different groups of users: While students prefer non-human looking agents, elderly users specifically benefit from the social outcomes of a humanoid appearance [Straßmann 2017]. In addition, results demonstrate that attractive agents were found to be more likeable and were more persuasive. These effects, however, did not increase in a long-term relationship with an agent. In short, there is sufficient evidence to conclude that social effects also depend on aspects of physical appearance. However, given that different studies focus on different dimensions of appearance (e.g., realism, anthropomorphism, attractiveness/likeability), it is still difficult to conclude which physical features are decisive. A first attempt at systematizing the area is presented by Straßmann and Krämer [2017].

3.3.2 In uence of User Attributes Various user attributes have been considered as potential predictors of the reactions toward socially interactive agents, among them gender, age, computer literacy and personality. An overview is given in Krämer et al. [2015]. Here, some examples will be summarized in order to demonstrate that these characteristics affect how users (a) experience the interactions and (b) perceive and evaluate artificial entities. 3.3.2.1 Gender Krämer et al. [2010] revealed in their re-analysis of earlier studies that men and women have different preferences with regard to IVAs. In fact, compared to the effects of age and computer literacy, the influence of gender was more important. In one study, women were found to be more nervous during the interaction with the agent, which contradicts the vision that IVAs will facilitate human– computer interaction for these kinds of users. The data suggest further that female users’ interest and acceptance can be increased when non-verbal behaviors are implemented (here: self-touching gestures) and when the agent frequently smiles. Interestingly, Krämer et al. [2016] demonstrate that with regard to pedagogical agents non-verbal behaviors communicating rapport were especially beneficial when displayed by agents of the opposite sex. In sum, it can be concluded that women especially benefit from an increased non-verbal behavior of the agent, in line with the finding that women are more sensitive to non-verbal behaviors [Hall 1984].

88

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

3.3.2.2

Age It is important to analyze older users’ reactions as more and more technology is developed to enable ambient assisted living for seniors. Part of this development are also multiple virtual agent applications (see, e.g., Kopp et al. [2018]). Although the overall goal is that an IVA leads to a facilitation of human–technology interactions, studies suggest that older persons are more nervous when interacting with an IVA than younger persons [Krämer et al. 2007]. Further results show that empathic non-verbal behavior can be helpful [Hosseinpanah et al. 2018]. Interestingly, an agent that behaves in a dominant way leads to more persuasion when interacting with elderly users [Rosenthal-von der Pütten et al. 2019a, 2019b]. With regard to appearance variables, older people seem to prefer more humanoid appearances [Straßmann and Krämer 2017, 2018].

3.3.2.3 Computer Literacy Computer novices proved to be more nervous when interacting with an IVA than other users [Krämer et al. 2010]. This is in line with previous findings that computer laypeople do not benefit from IVAs in the way in which it is typically hoped [Krämer et al. 2002]. Additional research will need to demonstrate under which conditions non-computer-literate users can be supported in their interactions with IVAs. 3.3.2.4 Personality Personality traits (for example, the so-called Big Five personality traits: extraversion, neuroticism, conscientiousness, openness, and agreeableness) have long been discussed as potential influencing factors in human–agent interactions. However, the Big Five themselves seem to have only limited exploratory value—results of a study with the Rapport Agent show that participants’ personality traits affected their subjective feelings after the interaction, as well as their evaluation of the agent and their actual behavior [von der Pütten et al. 2010b]. From the various personality traits, those traits that relate to withstanding behavioral patterns in social contact (agreeableness, extraversion, approach avoidance, self-monitoring sensitivity, shyness, public self-consciousness) were found to be predictive for the positive and negative feelings participants reported after the interaction, the evaluation of the agent, and the amount of words they used during the conversation. However, other personality traits (e.g., openness, neuroticism) as well as gender and age did not affect the evaluation. For instance, the higher one’s rating on extraversion and public self-consciousness, the more words were used. Furthermore, the more shy people are, the more negatively they evaluate the agent; whereas agreeableness increases positive feelings after the interaction.

3.4 Similarities and Differences in IVAs and SRs

89

In conclusion, several personality characteristics were shown to influence how people experience interactions with artificial entities. Whether social reactions are also directly influenced by user attributes has not yet been analyzed in detail. The fact that users’ experiences are different depending on their attributes does not necessarily mean that social reactions are altered—especially when these are assumed to be unconscious reactions that are deeply rooted in humanity’s social nature. In parallel to these empirical studies, ethical guidelines have been developed. While researchers in psychology are first and foremost interested in understanding the mechanisms, they usually refrain from deriving normative conclusions in the sense of recommending which effects should be avoided and which should be strived for. Psychological research is of course prone to—based on empirical results—describe potential risks and chances but would usually not use this to prescribe normative guidelines. On the contrary, ethics emphasizes the fact that “[t]he things we call ‘technologies’ are ways of building order in our world. Many technical devices and systems important in everyday life contain possibilities for many different ways of ordering human activity. Consciously or unconsciously, deliberately or inadvertently societies choose structures for technologies that influence how people are going to work, communicate, travel, consume and so forth over a very long time. […] In that sense technological innovations are similar to legislative acts or political foundings that establish a framework for public order that will endure over many generations.” [Winner 1980, S. 128f]. Every invention, thus, unfolds its specific normativity. This is why research on ethics is needed to accompany the development of such guidelines and to help implement structures that can assist when developers want to receive ethical advice. Recently, positive experiences have been made in the so-called FAT (fairness, accountability, and transparency) community. Here, AI researchers have used ethical expertise to develop guidelines on how to ensure fairness, accountability, and transparency in machine learning (https://www.fatml.org/). Similarly, the IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems have suggested general principles of ethically aligned design. In addition, edited books give an overview of practically usable ethical aspects in the sense of value sensitive design [van den Hoven et al. 2015].

3.4

Similarities and Differences in IVAs and SRs With increasing progress in the development of social robots, the question of whether IVAs and robots as physically embodied characters differ in their effects has become prevalent. The emergence of social robots is accompanied by a larger public debate which might lead to the conclusion that humans are more fascinated but at the same time more appalled and—potentially—frightened by social robots.

90

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

Therefore, it is important to ask what the differences and similarities regarding the effects of virtual agents and robots are. Surprisingly, however, there is not much controlled research on this question that directly compares robots and agents. Most studies have focused on obvious characteristics of robots versus agents: Robots are generally taller than characters on a screen, which might lead to more favorable perceptions of the robot [Powers et al. 2007]. Also, the “physical proximity” [Powers et al. 2007] might play a role. Whereas virtual characters are only graphical two-dimensional (2D) or threedimensional (3D) representations on a monitor, robots have a material embodiment. Accordingly, they are able to physically touch humans, carry things, and move on the ground. Therefore, it has been assumed that robots are perceived more as autonomous “living” systems than virtual characters, meaning that more social presence might be sensed when facing them [Jung and Lee 2004]. Although some studies indeed demonstrate that robots elicit stronger experiences and social behaviors than agents [Bartneck 2002, Kiesler et al. 2008], this has not been shown consistently [Yamato et al. 2001, Wainer et al. 2007]. Hoffmann and Krämer [2013] identified that one reason for these inconsistencies might be the lack of comparability due to different operationalizations of the embodiments (e.g., different robots, animation, video recordings), dependent variables, and interaction scenarios. In particular, findings by Shinozawa et al. [2005] suggest that the interaction scenario and task might be decisive. Therefore, Hoffmann and Krämer [2013] conducted a 2 × 2 between-subjects experiment in which embodiment (rabbit-shaped Nabaztag robot vs. virtual version) and the interaction scenario (cognitive task vs. persuasive conversation) were systematically varied. For the purpose of the study, a 3D virtual character was designed that resembles the rabbit-shaped robot that was used and displays the same voice and behavior. Two different interaction scenarios were created: a persuasive conversation about health habits (cf. Kiesler et al. [2008]), and a task-oriented scenario in which participants had to solve a Towers of Hanoi puzzle, which was set on the table, under the guidance of the artificial counterpart. As dependent variables, subjective evaluation criteria (affective state, perceived social presence, attractiveness, and general evaluation of the interaction) and objective measures were assessed. In the conversational scenario, persuasion was analyzed as objective measurement by means of the amount of healthy food participants ate after the interaction. In the task-oriented setting, the amount of moves to solve the Towers of Hanoi task was counted as performance measure. However, contrary to the hypotheses which assumed that more social presence should be experienced in the presence of the physical robot, no differences emerged. Moreover, no differences caused by the embodiment were observable

3.4 Similarities and Differences in IVAs and SRs

91

with respect to the participants’ affective state, acceptance of the artificial entity, performance, or persuasion. Still, two main effects of embodiment emerged: First, participants perceived the robot as more competent with regard to the fulfillment of tasks. The second main effect of embodiment occurred for the factor of control. Participants stated that they perceived more control during the interaction with the screen animation than with the robot. Additionally, a significant interaction between embodiment and interaction scenario with regard to the fulfillment of tasks emerged. In line with the findings of Shinozawa et al. [2005], task-oriented attractiveness was higher for the robot in the Towers of Hanoi condition, whereas it was higher for the screen animation in the conversational scenario. In summary, the results of this study suggest that virtual characters can be used instead of more expensive robots when the aim of the application is of a persuasive nature. For scenarios in which physical manipulation is necessary at least on the side of the user, robots seem to be beneficial because they share the space of reference. Most importantly, the study underlines the importance of the consideration of different contexts (i.e., task or interaction scenario) while analyzing the impact of different embodiments. Whether social effects (like persuasion by an artificial companion) can be observed will therefore depend not only on the form of embodiment alone but also on the appropriateness of the specific embodiment for the specific task or scenario (see Krämer et al. [2015]). There are other studies—focusing only on robots—which give evidence that robots imply specific social reactions. For example, Horstmann et al. [2018] investigate people’s reactions when they are confronted with a robot that objects against being switched off. When the robot voiced an objection, a number of people let the robot remained switched on. When asked why they decided to do so, the majority of these participants said they felt sorry for the robot since it had told them about its fear of the darkness and that they did not want the robot to be scared (“He asked me to leave him on, because otherwise he would be scared. Fear is a strong emotion and as a good human you do not want to bring anything or anyone in the position to experience fear,” male, 21 years). Almost as many participants explained that they did not want to act against the robot’s will, which was expressed through the robot’s objection to being switched off (“it was fun to interact with him, therefore I would have felt guilty, when I would have done something, what affects him, against his will,” male, 21 years). In addition, participants mentioned being surprised or being afraid of doing something wrong. A second condition manipulated whether the style of the interaction was either social (mimicking human behavior) or functional (displaying machinelike behavior). After the functional interaction, people evaluated the robot as less likeable, which in turn led to a reduced stress experience after the switching off situation. However, individuals hesitated longest

92

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

when they had experienced a functional interaction in combination with an objecting robot. This unexpected result might be due to the fact that the impression people had formed based on the task-focused behavior of the robot conflicted with the emotional nature of the objection. In sum, the results show that the robot’s human-mimicking behavior had a surprisingly strong impact on the participants. Instead of dismissing the objection to be switched off as weird for a machine, they were quite affected emotionally. Similarly, Rosenthal-von der Pütten et al. [2014] show that humans are emotionally affected when they see that a robot is being “tortured.” Here, the reactions are also documented on a neural level: in a functional magnetic resonance imaging (fMRI) study participants were presented videos showing a human, the toy dinosaur robot “Pleo,” and an inanimate object (a green box), being treated in either an affectionate (e.g., caressing the skin) or in a violent way (e.g., hitting, being choked). Self-reported emotional states and functional imaging data revealed that participants indeed reacted emotionally when seeing the affectionate and violent videos. Overall, the patterns were similar for robot and human and differed from people’s reactions toward watching the box being caressed or tortured. While no different neural activation patterns emerged for the affectionate interaction toward both, the robot and the human, we still found differences in neural activity when comparing only the videos showing abusive behavior, which indicate that participants experience more emotional distress and show negative empathetic concern for the human in the abuse condition. This indicates that although robots evoke emotional reactions, there is still a difference to the—more intense—reactions toward humans. Additional studies investigated whether robots trigger different kinds of reactions depending on their human likeness. Here, according to the uncanny valley hypothesis, humans are expected to prefer anthropomorphic appearance but to reject the artificial entity when it becomes too human-like (but is not yet perfectly human). A study with 40 standardized pictures of robots with varying human-likeness, however, indicated that the relation of robot characteristics and evaluation is not best explained by a cubic function (which would be closest to the uncanny valley thesis) but rather by a linear function—indicating that the evaluation gets more positive with increasing human-likeness [Rosenthal-von der Pütten 2014]. An additional study employing fMRI demonstrates that, on a neural level, areas that encode a linear human-likeness continuum (temporoparietal junction) can be differentiated from areas that encode non-linear representations and a human–nonhuman distinction (dorsomedial prefrontal cortex and fusiform gyrus) [Rosenthal-von der Pütten et al. 2019a]. If uncanny valley reactions happen, in the sense of a selective dislike of highly humanlike agents, it is based on

3.5 Current Challenges

93

non-linear value-coding in the ventromedial prefrontal cortex, a key component of the brain’s reward system. In consequence, it can be concluded that a basic principle known from sensory coding—neural feature selectivity from linear–nonlinear transformation—may also underlie human responses to artificial social partners. Given these results regarding the social effects of robots, it needs to be concluded that from the perspective of ethics social robots might be more problematic than IVAs in the sense that robots offer physical encounter and interaction. Physical interactions inscribe specific patterns of understanding oneself and the world. This is, on the one hand, due to their physical embodiment and the corresponding ability to actually act in a real environment. On the other hand, the results on social reactions suggest that they might be more powerful in eliciting emotional reactions—which might make humans more prone for manipulations. Henceforth, human beings have to deal with the question of what it means to cooperate with a robot in comparison with the cooperation with other human beings or animals. The answer to the question depends very much on how human beings will design the robots—which is, not least, a normative decision.

3.5

Current Challenges Current challenges include the necessity of observing how social reactions change when human–machine interaction becomes increasingly “natural.” Although current systems—even when built on machine learning with millions of user utterances (see Siri or Alexa)—are not yet comparable with human–human communication, further improvements in the near future are likely. Given that people already react to basic social cues, it is not clear yet whether social reactions will increase in line with better dialogue capabilities. Especially from an ethical perspective, however, it has to be discussed whether the observable social reactions actually entail “full” sociality. Even if empirical reactions are the same as those observable in human–human interactions, it is questionable whether this indeed indicates that interactions with technology can be viewed as comparably social as interactions with fellow human beings or even animals. Another challenge pertains to the question of how to further improve dialogue with machines. Here, psychology should contribute knowledge as well as methods. How cooperation in the area might yield synergies is described in Kopp and Krämer [2021]. When systems become increasingly intelligent it is also important to ask what people know about and expect from human-like machines. At the moment, it is clearly visible that people derive their knowledge mainly from media. Based on the uncertainty reduction theory it was demonstrated that people’s experiences with robots in the media lead to high expectations regarding the skills of robots, which

94

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

in turn increase people’s general expectations that social robots will be part of the society as well as of their personal lives. Furthermore, knowledge of negatively perceived fictional robots increases negative expectancies of robots becoming a threat to humans, while technical affinity reduces general robot anxiety. In sum, it can be concluded that especially fictional media material which depicts mighty robots which compete against or threaten humans lead to skepticism against artificial intelligence, machine learning, and digitalization. Based on these results and further evidence which indicates that the average human user is not able to grasp what intelligent algorithms and social robots can and cannot do, let alone how they basically function [DeVito et al. 2018, Krämer et al. 2019], it seems necessary to increase the computational literacy of the general public. In order to enable a truly informed consent to use socially interactive agents, including their collecting and processing of personal data, people need to be taught the basic steps of intelligent algorithms. First studies that try to describe the mental models and concepts that people have developed suggest, however, that it is a long way to go to achieve at least a basic understanding that enables informed decisions of whether to take or avoid the risks of using an intelligent interlocutor [Ngo et al. 2020]. Still, from an ethical and also from an empirical point of view, it needs to be asked whether this knowledge will indeed prevent us from misleading perceptions and presuppositions. This area also poses ethical questions: From an ethical perspective, one can question whether people should be better informed about the artificial nature of the interaction partner and the algorithms. It might be doubted that, for example, people can protect themselves better against unconscious social reactions if they understand more about the technology’s functioning. Also, for other technology, it is well known that people use technology without knowing in detail how it actually functions. There is a way of familiarizing oneself with a technology without knowing its details—and without needing to take the effort to learn something new or to inform oneself. Becoming familiar with a technology means establishing habits and routines that allows interaction with the technology. Hans Blumenberg, a German philosopher, claimed that this acquaintance is one side of the coin, the other one is a loss of reflection and even “Sinnverzicht” [renunciation of meaning].

3.6

Future Directions From a psychological point of view the most important question for future research is to observe whether social reactions become stronger when machines become increasingly intelligent and show improved dialogue abilities (see above). One methodological strategy that will be vital in these investigations are long-term

3.6 Future Directions

95

measurements. Also, it needs to be asked whether people will react with less social reactions when they understand better what the nature of intelligent algorithms and machines is. Both aspects need to be looked at in (experimental) field studies. From an anthropological and ethical point of view it will be indispensable to investigate the deeper meaning of “intelligence” when attributing it to human beings as a precondition of agency, decision, responsibility, and so on. On the one hand, we have to take into account that not all human beings have these capacities and still count as vis-á-vis in human interaction. On the other hand, we should acknowledge that intelligence with regard to machines differs from the intelligence of human beings. The understanding of this difference must not necessarily be based on detailed knowledge of algorithms, machine learning, or Bayesian networks but the ontological difference might be illustrated by other, “simpler” means. Most of the studies conducted so far have been experimental laboratory studies that tried to increase internal validity. However, in order to learn whether social effects of artificial entities are actually relevant in humans’ everyday encounters with agents or robots, field studies and long-term studies also have to be conducted, which provide higher external validity. Field studies with an open, exploratory approach are also important in order to be able to identify new aspects of human–agent or human–robot interaction, which can then be checked in controlled laboratory studies. Contrary to mainly quantitative laboratory studies, field studies can also be conceptualized in a way that makes inter-individual differences more visible: Here, one study described the emergence of huge inter-individual differences as one of the most important results of a long-term study in which six elderly participants interacted with a robot serving as a health advisor [von der Pütten et al. 2011b]. Following a multi-methodological approach, the continuous quantitative and qualitative description of user behavior on a very fine-grained level gave insights into when and how people interacted with the robot companion. Post-trial semi-structured interviews explored how the users perceived the companion and revealed their attitudes. Based on this large dataset, it was found that users are willing to start interactions and even conversation with a robot even though its perceptive and expressive capabilities are limited. Although aware of the fact that they are interacting with an artificial being, some of the users built relationships with the robot while others treated the robot as a tool. Some people tended to like the robot as long as it was helping them and they had a feeling of being in control. Others seemed to integrate the companion into their life even though it did not always work properly and was perceived as being of limited use. Thus, the finegrained analysis showed a particular prevalence of idiosyncratic reactions. Therefore, future research should extend field and long-term studies in order to identify

96

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

patterns and factors that determine the conditions under which people show social reactions and whether they ultimately even develop relationships with artificial entities. As an experimental factor, information on the advance knowledge on the functioning of the machine can be provided to only half of the participants. Also, future research needs to foster more interdisciplinary research. In the research realm described here, a combination of psychology, computer science, ethics, and law is especially relevant [Krämer et al. 2019].

3.7

Summary As a concluding message we would like to repeat that numerous studies demonstrate that people react socially toward artificial entities—as soon as these display social cues. It still has to be analyzed in greater depths whether the availability of the number of social cues affects the degree of social reactions and whether social robots indeed lead to larger effects compared to IVAs. Independent of these questions, the majority of reactions seem to happen unconsciously, and this suggests that reactions are based on human nature and humans’ unique sociality. This is in line with assumptions that humans are indeed “free monadic radicals” who are persistently searching for a social partner to interact and bond with [Kappas 2005]. Due to the unconsciousness and probable inevitability of the reactions, it has to be asked from an ethical point of view whether people have to be protected against manipulations based on social influence—potentially by keeping in mind that it is human design which affects people and may manipulate them consciously or inadvertently. Also, it needs to be asked whether computational literacy in the sense of more knowledge about intelligent algorithms and the basic functioning of intelligent conversing machines will help to reduce social effects. This, however, can just as well be doubted as it can be doubted that trying to impose more knowledge is desirable from an ethical point of view.

References C. Bartneck. 2002. eMuu—An Embodied Emotional Character for the Ambient Intelligent Home. Doctoral dissertation. Eindhoven University of Technology. UC Research Repository. http://www.bartneck.de/publications/2002/eMuu/bartneckPHDThesis 2002.pdf. A. L. Baylor and J. Ryu. 2003. The effects of image and animation in enhancing pedagogical agent persona. J. Educ. Comput. Res. 28, 4, 373–394. DOI: https://doi.org/10.2190/V0WQNWGN-JB54-FAT4. J. Blascovich. 2002. A theoretical model of social influence for increasing the utility of collaborative virtual environments. In Proceedings of the 4th International Conference on Collaborative Virtual Environments. Bonn, Germany, September 30–October 2, 25–30. DOI: https://doi.org/10.1145/571878.571883.

References

97

T. Carlotto and P. A. Jaques. 2016. The effects of animated pedagogical agents in an English-as-a-foreign-language learning environment. Int. J. Hum.-Comput. Stud. 95, 15–26. DOI: https://doi.org/10.1016/j.ijhcs.2016.06.001. M. Christen. 2004. Schuldige Maschinen? Autonome Systeme als Herausforderung für das Konzept der Verantwortung [Guilty machines? Autonomous systems as a challenge for the concept of responsibility]. In L. Honnefelder and C. Streffer (Eds.), Jahrbuch für Wissenschaft und Ethik, Vol. 9. Walter de Gruyter, Berlin, 163–191. F. de Waal. 2019. Mama’s Last Hug: Animal Emotions and What They Tell Us about Ourselves. Norton & Company, New York. M. A. DeVito, J. Birnholtz, J. T. Hancock, M. French, and S. Liu. 2018. How people form folk theories of social media feeds and what it means for how we study self-presentation. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, New York, 311–314. DOI: https://doi.org/10.1145/3173574.3173694. S. Domagk. 2010. Do pedagogical agents facilitate learner motivation and learning outcomes? The role of the appeal of agent’s appearance and voice. J. Media Psychol. 22, 2, 84–97. DOI: https://doi.org/10.1027/1864-1105/a000011. B. J. Fogg and C. Nass. 1997. Silicon sycophants: The effects of computers that flatter. Int. J. Hum.-Comput. Stud. 46, 5, 551–561. DOI: https://doi.org/10.1006/ijhc.1996.0104. J. Fox, S. J. Ahn, J. H. Janssen, L. Yeykelis, K. Y. Segovia, and J. N. Bailenson. 2015. Avatars versus agents: A meta-analysis quantifying the effects of agency on social influence. Hum.-Comput. Interact. 30, 401–432. DOI: https://doi.org/10.1080/07370024.2014.921494. J. Gratch, A. Okhmatovskaia, F. Lamothe, S. Marsella, M. Morales, R. J. van der Werf, and L.-P. Morency. 2006. Virtual rapport. In J. Gratch, M. Young, R. Aylett, D. Ballin, and P. Olivier (Eds.), Intelligent Virtual Agents, Vol. 4133. Lecture Notes in Computer Science (6th. ed.). Springer, Berlin, 14–27. DOI: https://doi.org/10.1007/11821830_2. J. A. Hall. 1984. Nonverbal Sex Differences. Communication Accuracy and Expressive Style. Johns Hopkins University Press, Baltimore, MD. L. Hoffmann. 2017. That Robot Touch that Means So Much. On the Psychological Effects of Human–Robot Touch. Ph.D. thesis. University Duisburg-Essen. https://duepublico2.unidue.de/receive/duepublico_mods_00043331. L. Hoffmann and N. C. Krämer. 2013. Investigating the effects of physical and virtual embodiment in task-oriented and conversational contexts. Int. J. Hum.-Comput. Stud. 71, 7–8, 763–774. DOI: https://doi.org/10.1016/j.ijhcs.2013.04.007. L. Hoffmann, N. C. Krämer, A. Lam-Chi, and S. Kopp. 2009. Media equation revisited: Do users show polite reactions towards an embodied agent? In Z. Ruttkay, M. Kipp, A. Nijholt, and H. H. Vilhjálmsson (Eds.), Intelligent Virtual Agents, Vol. 5773. Lecture Notes in Computer Science (9th. ed.). Springer, Berlin, 159–165. DOI: https://doi.org/ 10.1007/978-3-642-04380-2_19. A. C. Horstmann, N. Bock, E. Linhuber, J. M. Szczuka, C. Straßmann, and N. C. Krämer. 2018. Do a robot’s social skills and its objection discourage interactants from switching the robot off? PLoS One 13, 7, e0201581. DOI: https://doi.org/10.1371/journal. pone.0201581.

98

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

A. Hosseinpanah, N. C. Krämer, and C. Straßmann. 2018. Empathy for everyone? The effect of age when evaluating a virtual agent. In Proceedings of the 6th International Conference on Human–Agent Interaction. Southampton, UK, December, 184–190. DOI: https://doi.org/ 10.1145/3284432.3284442. Y. Jung and K. M. Lee. 2004. Effects of physical embodiment on social presence of social robots. In Proceedings of PRESENCE. Valencia, Spain, October 13–15, 80–87. A. Kappas. July, 2005. My happy vacuum cleaner. Paper presentation. ISRE General Meeting, Symposium on Artificial Emotions, Bari, Italy. S. B. Kiesler and L. Sproull. 1997. Social responses to “social” computers. In B. Friedman (Ed.), Human Values and the Design of Technology. Cambridge University Press/CSLI, Cambridge, 191–200. S. Kiesler, A. Powers, S. R. Fussell, and C. Torrey. 2008. Anthropomorphic interactions with a robot and robot-like agent. Soc. Cogn. 26, 2, 169–181. DOI: https://doi.org/10.1521/soco. 2008.26.2.169. Y. Kim and S. S. Sundar. 2012. Anthropomorphism of computers: Is it mindful or mindless? Comput. Hum. Behav. 28, 1, 241–250. DOI: https://doi.org/10.1016/j.chb.2011. 09.006. J. Klatt, N. Haferkamp, L. Tetzlaff, and N. C. Krämer. June, 2012. How to be…a leader – Examining the impact of gender and nonverbal behavior on the perception of leaders. Paper presentation. International Communication Association 62nd Annual Meeting, Phoenix, AZ. S. Kopp and N. Krämer. 2021. Revisiting human–agent communication: The importance of incremental co-construction and understanding mental states. Front. Psychol.: HumanMedia Interaction 12, 1–15. DOI: https://doi.org/10.3389/fpsyg.2021.580955. S. Kopp, M. Brandt, H. Buschmeier, K. Cyra, F. Freigang, N. Krämer, F. Kummert, C. Opfermann, K. Pitsch, L. Schillingmann, C. Straßmann, E. Wall, and R. Yaghoubzadeh. 2018. Conversational assistants for elderly users – The importance of socially cooperative dialogue. In E. André, T. Bickmore, S. Vrochidis, and L. Wanner (Eds.), CEUR Workshop Proceedings, Vol. 2338. In Proceedings of the AAMAS Workshop on Intelligent Conversation Agents in Home and Geriatric Care Applications co-located with the Federated AI Meeting. RWTH, 10–17. N. C. Krämer. 2005. Social communicative effects of a virtual program guide. In T. Panayiotopoulos, J. Gratch, R. Aylett, D. Ballin, P. Olivier, and T. Rist (Eds.), Intelligent Virtual Agents, Vol. 3661. Lecture Notes in Computer Science (5th. ed.). Springer, Berlin, 442–543. DOI: https://doi.org/10.1007/11550617_37. N. C. Krämer. 2017. The immersive power of social interaction. In D. Liu, C. Dede, R. Huang, and J. Richards (Eds.), Smart Computing and Intelligence, Vol. 1. Virtual, Augmented, and Mixed Realities in Education. Springer, Berlin, 55–70. DOI: https://doi. org/10.1007/978-981-10-5490-7_4. N. C. Krämer, S. Rüggenberg, C. Meyer zu Kniendorf, and G. Bente. 2002. Interface for everyone? Possibilities for adapting anthropomorphic interface agents to different user groups. In M. Herczeg, W. Prinz, and H. Oberquelle (Eds.), Reports of the German Chapter

References

99

of the ACM, Vol. 56. Human and Computer. Teubner, Stuttgart, 125–134. DOI: http://doi. org/10.1007/978-3-322-89884-5_13. N. C. Krämer, G. Bente, and J. Piesk. 2003. The ghost in the machine. The influence of embodied conversational agents on user expectations and user behaviour in a TV/VCR application. In G. Bieber and T. Kirste (Eds.), IMC Workshop, Assistance, Mobility, Applications. IRB Verlag, Stuttgart, 121–128. N. C. Krämer, N. Simons, and S. Kopp. 2007. The effects of an embodied conversational agent’s nonverbal behaviour on user’s evaluation and behavioural mimicry. In C. Pelachaud, J.-C. Martin, E. André, G. Chollet, K. Karpouzis, and D. Pelé (Eds.), Intelligent Virtual Agents, Vol. 4722. Lecture Notes in Computer Science (7th. ed.). Springer, Berlin, 238–251. DOI: http://doi.org/10.1007/978-3-540-74997-4_22. N. C. Krämer, L. Hoffmann, and S. Kopp. 2010. Know your users! Empirical results for tailoring an agent’s nonverbal behaviour to different user groups. In J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud, and A. Safonova (Eds.), Intelligent Virtual Agents, Vol. 6356. Lecture Notes in Computer Science (10th. ed.). Springer, Berlin, 468–474. DOI: http://doi. org/10.1007/978-3-642-15892-6_50. N. C. Krämer, S. Kopp, C. Becker-Asano, and N. Sommer. 2013. Smile and the world will smile with you – The effects of a virtual agent’s smile on users’ evaluation and behaviour. Int. J. Hum.-Comput. Stud. 71, 3, 335–349. DOI: http://doi.org/10.1016/j.ijhcs. 2012.09.006. N. C. Krämer, A. M. Rosenthal-von der Pütten, and L. Hoffmann. 2015. Social effects of virtual and robot companions. In S. S. Sundar (Ed.), The Handbook of the Psychology of Communication Technology (1st. ed.). John Wiley & Sons, Hoboken (New Jersey), 137–159. DOI: https://doi.org/10.1002/9781118426456.ch6. N. C. Krämer, B. Karacora, G. Lucas, M. Dehghani, G. Rüther, and J. Gratch. 2016. Closing the gender gap in STEM with friendly male instructors? On the effects of rapport behaviour and gender of a virtual agent in an instructional interaction. Comput. Educ. 99, 1–13. DOI: https://doi.org/10.1016/j.compedu.2016.04.002. N. C. Krämer, A. Artelt, C. Geminn, B. Hammer, S. Kopp, A. Manzeschke, A. Rossnagel, P. Slawik, J. Szczuka, L. Varonina, and C. Weber. 2019. KI-basierte Sprachassistenten im Alltag: Forschungsbedarf aus informatischer, psychologischer, ethischer und rechtlicher Sicht. [AI-based speech assistants in everyday-life: Research gaps from a computer-science, psychology, ethical and law perspective]. Policy paper. Universität Duisburg-Essen. DOI: https://doi.org/10.17185/duepublico/70571. E. J. Langer. 1989. Mindfulness. Addison-Wesley, Boston, MA. M. R. Leary. 1995. Self-presentation: Impression Management and Interpersonal Behavior. Brown & Benchmark Publishers, Madison, WI. E.-J. Lee. 2010. What triggers social responses to flattering computers? Experimental tests of anthropomorphism and mindlessness explanations. Commun. Res. 37, 2, 191–214. DOI: https://doi.org/10.1177/0093650209356389. P. Lin, R. Jenkins, and K. Abney. 2017. Robot Ethics 2.0. From Autonomous Cars to Artificial Intelligence. Oxford University Press, Oxford.

100

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

A. Manzeschke, G. Assadi, F. Karsch, and W. Viehöver. 2016. Functional emotions and emotional functionality. About the new role of emotions and emotionality in humantechnology-interaction. In A. Manzeschke and F. Karsch (Eds.), Robots, computers and hybrids. What happens between humans and machines? Nomos, 107–130. DOI: https://doi. org/10.5771/9783845272931-107. F. Maricchiolo, A. Gnisci, M. Bonaiuto, and G. Ficca. 2009. Effects of different types of hand gestures in persuasive speech on receivers’ evaluations. Lang. Cogn. Processes. 24, 2, 239–266. DOI: https://doi.org/10.1080/01690960802159929. C. Misselhorn. 2018. Artificial morality. Concepts, issues and challenges. Society 55, 161–169. DOI: https://doi.org/10.1007/s12115-018-0229-y. R. Moreno. 2003. The role of software agents in multimedia learning environments: When do they help students reduce cognitive load? Paper presentation. European Association for Research on Learning and Instruction Annual Conference, Padova, Italy. J. Morkes, H. K. Kernal, and C. Nass. 1999. Effects of humor in task-oriented human– computer interaction and computer-mediated communication: A direct test of SRCT theory. Hum.-Comput. Int. 14, 4, 395–435. DOI: https://doi.org/10.1207/ S15327051HCI1404_2. C. Nass and Y. Moon. 2000. Machines and mindlessness: Social responses to computers. J. Soc. Issues 56, 1, 81–103. DOI: https://doi.org/10.1111/0022-4537.00153. C. Nass and C. Yen. 2012. The Man Who Lied to His Laptop. What We Can Learn About Ourselves from Our Machines. Penguin, London. C. Nass, J. Steuer, and E. R. Tauber. 1994. Computers are social actors. In Proceedings of the SIGCHI Conference of Human Factors in Computing Systems. Boston, MA, 72–78. DOI: https://doi.org/10.1145/191666.191703. C. Nass, B. J. Fogg, and Y. Moon. 1996. Can computers be teammates? Int. J. Hum. Comput. Stud. 45, 6, 669–678. DOI: https://doi.org/10.1006/ijhc.1996.0073. C. Nass, Y. Moon, J. Morkes, E.-Y. Kim, and B. J. Fogg. 1997. Computers are social actors: A review of current research. In B. Friedman (Ed.), Human Values and the Design of Computer Technology. CSLI Publications, Standford, CT, 137–162. C. Nass, Y. Moon, and P. Carney. 1999. Are people polite to computers? Responses to computer-based interviewing systems. J. Appl. Soc. Psychol. 29, 5, 1093–1110. DOI: https://doi.org/10.1111/j.1559-1816.1999.tb00142.x. T. Ngo, J. Kunkel, and J. Ziegler. 2020. The Netflix experience: Mental models of recommender systems exemplified by Netflix. In Proceedings of the ACM UMAP, 28th Conference on User Modeling, Adaptation and Personalization. ACM, New York, 183–191, https://doi. org/10.1145/3340631.3394841. K. L. Nowak and C. Rauh. 2005. The influence of the avatar on online perceptions of anthropomorphism, androgyny, credibility, homophily, and attraction. J. Comput. Mediat. Commun. 11, 1, 153–178. DOI: https://doi.org/10.1111/j.1083-6101.2006.tb00308.x. S. Parise, S. Kiesler, L. Sproull, and K. Waters. 1999. Cooperating with life-like interface agents. Comput. Hum. Behav. 15, 2, 123–142. DOI: https://doi.org/10.1016/S0747-5632(98) 00035-1.

References

101

H. Plessner. 1970. Philosophische Anthropologie. Lachen und Weinen. Das Lächeln. Anthropologie der Sinne [Philosophical Anthropology. Laughing and Crying. Anthropology of the Senses]. Fischer, Frankfurt. A. Powers, S. Kiesler, S. Fussel, and C. Torrey. 2007. Comparing a computer agent with a humanoid robot. In Proceedings of the 2nd ACM/IEEE International Conference on Human– Robot Interaction. Arlington, VA, 145–152. DOI: https://doi.org/10.1145/1228716.1228736. S. Rajan, S. D. Craig, B. Gholson, N. K. Person, A. C. Graesser, and The Tutoring Research Group. 2001. AutoTutor: Incorporating back-channel feedback and other human-like conversational behaviors into an intelligent tutoring system. Int. J. Speech Technol. 4, 117–126. DOI: https://doi.org/10.1023/A:1017319110294. B. Reeves and C. I. Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. CSLI Publications, Standford, CT. R. Rickenberg and B. Reeves. 2000. The effects of animated characters on anxiety, task performance, and evaluations of user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. The Hague, Netherlands, 49–56. DOI: https://doi. org/10.1145/332040.332406. A. M. Rosenthal-von der Pütten, F. P. Schulte, S. C. Eimler, S. Sobieraj, L. Hoffmann, S. Maderwald, M. Brand, and N. C. Krämer. 2014. Investigations on empathy towards humans and robots using fMRI. Comput. Hum. Behav. 33, 201–212. DOI: https://doi.org/ 10.1016/j.chb.2014.01.004. A. Rosenthal-von der Pütten, N. C. Krämer, S. Maderwald, M. Brand, and F. Grabenhorst. 2019a. Neural mechanisms for accepting and rejecting artificial social partners in the uncanny valley. J. Neurosci. 39, 33, 6555–6570. DOI: https://doi.org/10.1523/JNEUROSCI. 2956-18.2019. A. M. Rosenthal-von der Pütten, C. Straßmann, R. Yaghoubzadeh, S. Kopp, and N. C. Krämer, 2019b. Dominant and submissive nonverbal behavior of virtual agents and its effects on evaluation and negotiation outcome in different age groups. Comput. Hum. Behav. 90, 397–409. DOI: https://doi.org/10.1016/j.chb.2018.08.047. N. L. Schroeder and O. O. Adesope. 2014. A systematic review of pedagogical agents’ persona, motivation, and cognitive load implications for learners. J. Res. Technol. Educ. 46, 3, 229–251. DOI: https://doi.org/10.1080/15391523.2014.888265. N. Shechtman and L. M. Horowitz. 2003. Media inequality in conversation: How people behave differently when interacting with computers and people. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. Fort Lauderdale, FL, 281–288. DOI: https://doi.org/10.1145/642611.642661. K. Shinozawa, F. Naya, J. Yamato, and K. Kogure. 2005. Differences in effect of robot and screen agent recommendations on human decision-making. Int. J. Hum. Comput. Stud. 62, 2, 267–279. DOI: https://doi.org/10.1016/j.ijhcs.2004.11.003. L. Sproull, M. Subramani, S. Kiesler, J. H. Walker, and K. Waters. 1996. When the interface is a face. Hum. Comput. Interact. 11, 2, 97–124. N. Sträfling, I. Fleischer, C. Polzer, D. Leutner, and N. C. Krämer. 2010. Teaching learning strategies with a pedagogical agent. The effects of a virtual tutor and its appearance on

102

Chapter 3 Social Reactions to Socially Interactive Agents and Their Ethical Implications

learning and motivation. J. Media Psychol. 22, 2, 73–83. DOI: https://doi.org/10.1027/18641105/a000010. C. Straßmann. 2017. All Eyes on the Agent’s Appearance?! Investigation of Target-Group-Related Social Effects of a Virtual Agent’s Appearance in Longitudinal Human–Agent Interactions. Ph.D. thesis. University of Duisburg-Essen. https://duepublico2.uni-due.de/receive/ duepublico_mods_00070205. C. Straßmann and N. C. Krämer. 2017. A categorization of virtual agent appearances and a qualitative study on age-related user preferences. In J. Beskow, C. Peters, G. Castellano, C. O’Sullivan, I. Leite, and S. Kopp (Eds.), Intelligent Virtual Agents, Vol. 10498. Lecture Notes in Computer Science (17th. ed.). Springer, Berlin, 413–422. DOI: https://doi.org/ 10.1007/978-3-319-67401-8_51. C. Straßmann and N. C. Krämer. 2018. A two-study approach to explore the effect of user characteristics on users’ perception and evaluation of a virtual assistant’s appearance. Multimodal Technol. Interact. 2, 4, 1–25. DOI: https://doi.org/10.3390/mti2040066. D. Sturma. 2004. Substitutability of human? Robotic and human life forms. In L. Honnefelder and C. Streffer (Eds.), Yearbook for Science and Ethics, Vol. 9. Walter de Gruyter, Berlin, 141–162. S. S. Sundar. 1994, August 10–13. Is human–computer interaction social or parasocial? Paper presentation. Association for Education in Journalism and Mass Communication 77th Annual Meeting, Atlanta, GA. S. S. Sundar and C. Nass. 2000. Source orientation in human–computer interaction: Programmer, networker, or independent social actor. Commun. Res. 27, 6, 683–703. DOI: https://doi.org/10.1177%2F009365000027006001. J. van den Hoven, P. E. Vermaas, and I. van de Poel (Eds.). 2015. Handbook of Ethics, Values, and Technological Design: Sources, Theory, Values and Application Domains. Springer, Berlin. H. C. van Vugt, E. A. Konijn, J. F. Hoorn, I. Keur, and A. Eliéns. 2007. Realism is not all! User engagement with task-related interface characters. Interact. Comput. 19, 2, 267–280. DOI: https://doi.org/10.1016/j.intcom.2006.08.005. A. M. von der Pütten, N. C. Krämer, J. Gratch, and S.-H. Kang. 2010a. “It doesn’t matter what you are!” Explaining social effects of agents and avatars. Comput. Hum. Behav. 26, 6, 1641–1650. DOI: https://doi.org/10.1016/j.chb.2010.06.012. A. M. von der Pütten, N. C. Krämer, and J. Gratch. 2010b. How our personality shapes our interactions with virtual characters – Implications for research and development. In J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud, and A. Safonova (Eds.), Intelligent Virtual Agents, Vol. 6356. Lecture Notes in Computer Science (10th. ed.). Springer, Berlin, 208–221. DOI: https://doi.org/10.1007/978-3-642-15892-6_23. A. M. von der Pütten, L. Hoffmann, J. Klatt, and N. C. Krämer. 2011a. Quid pro quo? Reciprocal self-disclosure and communicative accommodation towards a virtual interviewer. In H. H. Vilhjálmsson, S. Kopp, S. Marsella, and K. R. Thórisson (Eds.), Intelligent Virtual Agents, Vol. 6895. Lecture Notes in Computer Science (11th. ed.). Springer, Berlin, 183–194. DOI: https://doi.org/10.1007/978-3-642-23974-8_20.

References

103

A. M. von der Pütten, N. C. Krämer, and S. C. Eimler. 2011b. Living with a robot companion – Empirical study on the interaction with an artificial health advisor. In Proceedings of the 13th International Conference on Multimodal Interaction. Alicante, Spain, 327–334. DOI: https://doi.org/10.1145/2070481.2070544. J. Wainer, D. J. Feil-Seifer, D. A. Shell, and M. J. Matari´ c. 2007. Embodiment and human– robot interaction: A task-based perspective. In Proceedings of the 16th IEEE International Conference on Robot and Human Interactive Communication. Jeju, Korea, August 26–29, 872–877. DOI: https://doi.org/10.1109/ROMAN.2007.4415207. L. Winner. 1980. Do artifacts have politics? Daedalus 1, 109, 121–136. J. Yamato, K. Shinozawa, F. Naya, and K. Kogure. Evaluation of communication with robot and agent: Are robots better social actors than agents? In Proceedings of IFIP INTERACT01: Human–Computer Interaction. Tokyo, Japan, 690–691.

II PART

APPEARANCE AND BEHAVIOR

4

Appearance Rachel McDonnell and Bilge Mutlu

4.1

Why Appearance?

4.2

History

One might question why we need appearance for a metaphor to work as voice assistants can effectively express the characteristics of a metaphor solely through behavior. We argue that, although disembodied agents can effectively serve as computer-based assistants in specific scenarios of use, for example, involving driving and visually impaired users, appearance provides a “locus of attention” [Cassell 2001] for the cognitive and interactive faculties of the user of the system. Additionally, human communication mechanisms, such as mutual gaze, turn-taking, and body orientation, necessitate the presence of appropriate visual cues to properly function, making appearance a necessity for agent design. Studies of human– human, human–agent, and human–robot interaction provide strong evidence that such mechanisms work more effectively when parties provide appearance-based cues. The mere presence of a form of embodiment in interacting with an agent improves social outcomes, such as motivation [Mumm and Mutlu 2011]. As the scale and modality of appearance get closer to that of the metaphor, these outcomes further improve; human-scale and physical agents have more perceived presence [Kiesler et al. 2008] and persuasive ability [Bainbridge et al. 2011] than scaled-down and virtual agents.

Agents with virtual and physical embodiments follow different historical trajectories. Virtual agents, also called embodied conversational agents, a term coined by Cassell [2000], are “computer-generated cartoonlike characters that demonstrate many of the same properties as humans in face-to-face conversation, including the ability to produce and respond to verbal and non-verbal communication.” Early visions for virtual agents involved characters that re-played recordings of human

108

Chapter 4 Appearance

performers, such as the intelligent personal agent included in the Knowledge Navigator concept developed by Apple in 1987 [Colligan 2011] (Figure 4.1). First implementations of virtual agents were stylized non-human or human characters that were generated through three-dimensional (3D) modeling and rendering and were embedded within virtual environments. An example of early non-human characters is Herman the Bug, an animated pedagogical agent embedded within a virtual learning environment [Lester et al. 1997]. Another early example is Rea, a real-estate agent that followed a stylized human-like design and appeared within a simulated home environment [Cassell 2000]. Although these examples represent agents that are controlled and visualized by computer systems, the design of such non-human and human characters have a long history in shadow puppetry, dating back to the first millennium BC [Orr 1974]. These characters were designed for storytelling and entertainment, and the character designs reflected historical or cultural figures as well as characters developed with backstories. The design of the characters also include stylizations and ornamentations that reflect their ethnic and cultural context, such as the character Karagöz that followed a stylized human design with clothing and storyline from the 16th to 19th century Ottoman Empire [Scarce 1983]. The design of agents with robotic embodiments date back to mechanical humanoid automata designed as early as the 10th century BC [Hamet and Tremblay 2017]. Like virtual characters and shadow puppetry, the physical appearance of these early automata also followed stylized human-like forms. Examples, shown in Figure 4.2, include the design of the Mechanical Turk, a covertly humancontrolled chess-playing machine that integrated a humanoid chess player on a wooden chest where the human operator hid [Simon et al. 1999]. Karakuri puppets, mechanical automata designed in 17th- to 19th-century Japan to be used, for example, to ceremonially serve tea, followed a stylized human-like appearance and traditional Japanese clothing [Yokota 2009]. Although the appearance of robotic

Figure 4.1

Early examples of virtual embodiments. Left: The Rea real-estate agent [Cassell 2000]; Right: the personal assistant envisioned for Knowledge Navigator [Sculley 1989].

4.2 History

Figure 4.2

109

Early physical agents. Left: Mechanical Turk automata by Joseph Racknitz [1789], image courtesy of Humboldt University Library; Right: a tea-serving Karakuri puppet, Karakuri ningyo © 2016 Donostia/San Sebastian.

agents has overwhelmingly followed a human form with some level of stylization, robotic agents also commonly follow non-human morphologies. Examples of non-human appearances include the doglike robot Aibo designed by Sony in 1999 [Pransky 2001], a robotic seal designed for therapy in assisted living settings [Wada et al. 2005], and Keepon, a robot whose appearance resembled that of a chick [Kozima et al. 2009]. Finally, robots have also been envisioned as cartoonish characters that blend features from different sources, such as the design of the WALL-E robot by Pixar, a trash compactor with features that suggested human-like eyes and arms [Whitley 2012]. In the 1960s, the field of computer graphics and animation started to gain momentum, and by the 1970s most of the building blocks of 3D computer animation were laid, such as surface shading by Gouraud [1971] and Phong [1975] and texture mapping by Catmull [1974]. It was not long until computer-generated characters began to appear in feature-films such as Futureworld (1979, Richard T. Heffron), which was the first to showcase a computer-animated hand and face, with both wireframe and 3D shading, while the well-known film Tron (1982, Steven Lisberger) followed soon after with a whole 15 minutes of computer-generated content. Fully animated characters also started to appear in other areas such as music videos (e.g., Mick Jagger’s Hard Woman). Ten years later, the technology was developed even further and adopted in films such as Terminator 2: Judgment Day (1991, James Cameron), The Lawnmower Man (1992, Brett Leonard), and Jurassic Park (1993, Steven Spielberg). This was the start of 3D animation receiving widespread commercial success and it was not long until Pixar Animation Studios released the first entirely computer-animated featurelength film Toy Story (1995, John Lasseter). Toy Story was a massive success, largely

110

Chapter 4 Appearance

due to the use of appealing cartoon characters with plastic appearance, which computer graphics shading was perfectly suited to at that time. In the 2000s, more technology was being developed to support the growing industry and Pixar’s Monsters Inc. (2001, Pete Docter) showed impressive results with simulated fur depicting the subtle secondary motion on the coats of the monster characters. The Lord of the Rings: The Fellowship of the Ring (2001, Peter Jackson) pushed new boundaries with realistic crowd simulation, while in the same year Final Fantasy: The Spirits Within (2001, Hironobu Sakaguchi) attempted to create the first photorealistic virtual humans. While the near-lifelike appearance of the characters in the film was well received, some commentators felt the character renderings appeared unintentionally creepy. Films The Polar Express (2004, Robert Zemeckis) and Beowulf (2007, Robert Zemeckis) marked further milestones in photorealism, but again received poor audience reactions. Photorealistic rendering was used more successfully for fantasy creatures such as the character Gollum from The Lord of the Rings: The Fellowship of the Ring, the first full CGI character in a liveaction movie. The actor that drove the movements of Gollum (Andy Serkis) even went on to win the first performance-capture Oscar for his acting in later films. Similar success was achieved with the photorealistic fantasy Na’vi characters in Avatar (2009, James Cameron). More recent advancements in 3D scanning, deep learning, and performance capture have allowed actors to play realistic depictions of their younger selves (Bladerunner 2049 [2017, Denis Villeneuve], The Irishman [2019, Martin Scorsese], Gemini Man [2019, Ang Lee]) or even to play virtual roles after they have passed-away (Peter Cushing in Star Wars: Rogue One [2016, Gareth Edwards] and Paul Walker Fast and Furious 7 [2015, James Wan]). In the 1980s and 1990s, there was also a shift toward interactive media such as games, where real-time animation was employed. This posed new challenges for character creation due to the additional requirements of character responsiveness and agency. Game characters were thus less visually complex than film characters of the time due to the higher computation cost. The first attempts in the 1980s were in the form of simple two-dimensional (2D) sprites such as Pac-Man (Namco), Sonic the Hedgehog (Sega), and Mario (Nintendo). With the advent of home console systems and consumer-level graphics processing units, there was a shift from 2D to 3D in games such as Quake (1996, id Software), The Legend of Zelda: Ocarina of Time (1998, Nintendo), Tomb Raider (1996, Core Design), and Star Wars Jedi Knight: Dark Forces II (1997, LucasArts). Characters started to appear more sophisticated and used texture mapping techniques for materials and linear blend skinning for animation.

4.3 Design

111

In the 2000s, many games utilized cutscenes of cinematic sequences that could achieve higher photorealism and conversation while disabling the interactive element of the game (e.g., LA Noire, Heavy Rain). Nowadays, with real-time ray tracing available in game engines, there is no longer a need for photorealism to be restricted to cutscenes, and we are seeing incredibly realistic depictions of humans and environments in real-time (e.g., Detroit: Become Human and Hellblade: Senua’s Sacrifice). Throughout the years, the graphics and game components have developed rapidly, allowing progressively more realistic depictions every year, though characters with advanced facial animation and conversational capabilities are rarely seen. In commercial games, conversing with non-player characters is usually achieved by selecting predefined conversation texts on the screen (to progress the conversation). There is scope in the future for truly conversational non-player characters. Additionally, as virtual reality becomes ever more immersive, we could be about to see the next evolution for the media with higher levels of realism, conversational capabilities, and social presence with non-player characters.

4.3

Design

4.3.1 What Is Appearance? When we say “appearance” for agents, we refer to the virtual or physical embodiment that users can experience using their visual faculties. Most agents, from simple static visual representations that accompany chatbots to human surrogates, follow a metaphoric design, that is, the design of the agent takes inspiration or reference from a familiar and existing or envisioned biological entity (e.g., a human, a dog, a grasshopper) or hybrid entity (e.g., a “trash can” in appearance but a cartoonish human in behavior). The expression of a metaphor involves two key dimensions: appearance and behavior. Metaphoric designs can follow consistent or inconsistent implementations across these two dimensions. For example, an agent that follows the metaphor of a dog and appears and behaves like a dog involves a consistent implementation, whereas a dog that speaks involves an inconsistent implementation, integrating dog-like appearance with human-like behavior. The power of agents as a family of computer interfaces comes from metaphoric design, which jumpstarts user mental models and expectations of the system using a familiar representation. For example, a computer system that uses speech as the mode of user interaction and follows a human-like agent metaphor signals to the user that the system is capable of human mechanisms of communication, such as

112

Chapter 4 Appearance

speech. Similarly, a robot designed to follow the metaphor of a maid or a butler is expected to be competent in household work. A common approach to designing the appearance of agents is metaphorical design, where the design follows a well-known metaphor to elicit familiarity and jumpstart user mental models of the agent’s capabilities. For example, a virtual agent designed to review hospital discharge procedures with patients followed the metaphor of a nurse, appearing on the screen as a nurse in scrubs [Bickmore et al. 2009]. The design of most agents follow a singular metaphor, such as the ASIMO humanoid robot designed to appear as an astronaut wearing a spacesuit [Sakagami et al. 2002], although some designs blend multiple metaphors [Deng et al. 2019], such as the MiRo robot, which integrates multiple animal features chosen to improve perceptions of its friendliness and feelings of companionship [Prescott et al. 2017]. Metaphorical design provides not only morphological features for the design of the agent but also additional behavioral and physical features such as clothing and environmental context to further support the expression of the metaphor. An example of such features is the design of Valerie the Roboceptionist, a receptionist robot situated in a receptionist’s cubicle, equipped with a backstory that was consistent with the design of the character, and dressed in clothing that was consistent with the backstory and the metaphor that the agent’s design followed [Gockley et al. 2005]. Figure 4.3 illustrates examples of metaphorical design: the Paro, the Keepon, and the iCat robots that followed the metaphors of a seal, a chick, and a cat, respectively. Virtual agents are also designed to follow different metaphors, most frequently of instructors or experts. For example, a digital double replica of a real doctor [Dai and MacDorman 2018] was found to be effective at delivering cues of warmth and competence (Figure 4.4). More importantly, the virtual doctor’s recommendations

Figure 4.3

Example metaphors used in the design of robotic agents. Left to right: PARO Therapeutic Robot © 2014 PARO Robots U.S.; the Keepon robot that followed the metaphor of a chick © 2007 BeatBots LLC [Kozima et al. 2009]; the iCat robot designed to follow the metaphor of a cat [van Breemen et al. 2005].

4.3 Design

Figure 4.4

113

An example of agent design by replicating human experts [Dai and MacDorman 2018].

also significantly influenced the decisions of participants in the same manner as the real doctor, implying effectiveness at persuasion. In an educational context, a study on learning outcomes found that a human lecturer is preferable, but that robotic and virtual agents may be viable alternatives if designed properly [Li et al. 2016]. It was also shown that having a stereotypically knowledgeable appearance of the pedagogical agent influenced learning [Veletsianos 2010]. Virtual agents have also been used extensively as assistants. For example, as a navigation assistant in a crash-landing scenario in a study by Torre et al. [2018, 2019], where they had to persuade participants to accept their recommendations about items required for survival. Participants explicitly preferred interacting with a cartoon-like agent than a photorealistic one, and were more inclined to accept the cartoon-agent’s suggestions. Note that the photorealistic agent was rated low on attractiveness, and since persuasion and attractiveness have been linked in previous work (e.g., Pallak et al. [1983]), it may be the case that a more attractive virtual human may have been more persuasive.

114

Chapter 4 Appearance

Another study compared digital avatars, humans, and humanoid robots to determine the influence of appearance on trust and identifying expert advice [Pan and Steed 2016]. They found that participants were less likely to choose advice from the avatar, irrespective of whether or not the avatar was an expert. In contrast, experts represented by the robot or by a real person were identified reliably.

4.3.2 Modalities Appearance can be expressed in graphical, virtual, video-mediated, physical, and hybrid modalities (Figure 4.5). Agents in graphical modalities are static or dynamic 2D representations, such as a photo, drawing, or animation of a character. For example, “Laura,” a virtual nurse designed to support low-literacy patients, appeared as a 2D rendering [Bickmore et al. 2009]. Virtual embodiments usually involve 3D simulations that are rendered in real-time or replays of rendered animations. An example of virtual embodiment is MACH, a virtual interview coach that is rendered in real-time in a virtual environment and presented on a 2D display [Hoque et al. 2013]. Such representations can also be presented in virtual reality and mixed reality modalities [Garau et al. 2005], which provide the user with a more immersive experience of the agent’s embodiment. Agents with a physical

Figure 4.5

Modalities in which agents are expressed. Left to right, top to bottom: the nurse agent Laura rendered as a graphical agent [Bickmore et al. 2009]; the MACH virtual interview coach [Hoque et al. 2013]; the hybrid robot Spritebot with a physical body and a graphical face [Deng et al. 2019]; the hybrid Furhat robot with a physical head and a projected face © 2021 Furhat Robotics; the Pepper physical robot © 2021 SoftBank Robotics; the Geminoid F android robot [Watanabe et al. 2015].

4.3 Design

115

appearance involve a robotic embodiment, such as the Robovie robot, designed as a shopping mall assistant [Iwamura et al. 2011], or the Geminoid, designed to serve as a human surrogate [Nishio et al. 2007]. Users of agents with physical embodiments can also experience the appearance of the agent over video [Kiesler et al. 2008]. Finally, hybrid embodiments bring physical and graphical or virtual features together, such as a graphical face appearing on a physical body or graphical features that are projected on the surface of a physical body. Examples of hybrid appearances include the Furhat robot [Al Moubayed et al. 2012] or Valerie/Tank, a receptionist robot [Lee et al. 2010]. The modality in which an agent is presented affects user perceptions of and experience with the agent. A large body of literature has aimed to compare interaction outcomes across different modalities toward testing the “embodiment hypothesis:” that physical embodiment has a measurable effect on user performance and perceptions in interactions with an agent. This body of work shows that, in general, users respond more favorably to agents with stronger embodiments and human-scale sizes. In this context, “strong” embodiment refers to modalities that elicit a strong sense of presence, such as physical or hybrid modalities, and “weak” embodiment describes modalities such as graphical or virtual that may not elicit a sense of presence to such an extent. Deng et al. [2019] systematically analyzed 65 studies that compared virtual and physical agents in measures of perceptions of the agent and task performance. The analysis showed that 78.5% of these studies involved improvements in at least one of these categories of measures, consistent with the embodiment hypothesis, 15.4% involved no change, and 6.1% involved worsening in at least one of the categories of measures. Among the studies included in this analysis, the most comprehensive comparison was performed by Kiesler et al. [2008], who compared a collocated robot, a life-size video projection of a remote robot, a life-size projection of the virtual version of the robot, and the virtual robot on a computer screen. The measured interaction outcomes generally decreased in this order, the participants responding to the robot more favorably than the virtual agent, and the collocated robot more than the projected robot. The modality in which the agent is presented not only affects user interaction with the agent but also presents different sets of affordances. For example, even if the behaviors of a virtual character and a physical robot are controlled by the same algorithm, the behaviors demonstrated by the agents might look very different due to the differences inherent in the modalities. Unlike virtual characters, physical robots are subject to mechanical limitations and bound by the physical properties of the real world, which might affect the speed with which the agent displays a desired behavior (unbounded in virtual characters, bounded by actuator

116

Chapter 4 Appearance

performance in robots), the sounds that the agent makes (e.g., sound artifacts produced by robots executing motion), the detail with which agent features can be fabricated (bound by modeling and rendering limitations in virtual characters and by physical fabrication limitations in robots), and so on. Physical robots and hybrid agents afford touch interactions and offer texture and material hardness as additional cues. The scale in which the agent is presented is another factor that affects affordances and interaction outcomes. Across all modalities, the closer the agent is presented to human scale, the more likely the agent will support human communication mechanisms. For example, a robot that is expected to be hugged by users must have a size that affords hugging.

4.3.3 Agent Construction An important factor that shapes agent appearance is how agents are constructed, which due to historical as well as practical reasons varies based on the modality of the agent. For example, physical agents are constructed using processes and practices from industrial design, and their designs are affected by factors such as manufacturing limitations, product safety, and material choice. On the other hand, the construction of virtual characters borrows processes and practices from animated filmmaking and game design, and their designs are affected by factors including character backstory, the environment in which the agent will be presented, and the mechanisms with which the agent interacts with its environment, the user, and the user’s environment. The paragraphs below outline some of these processes and practices. 4.3.3.1

Construction of Virtual Characters Virtual characters have fewer constraints in terms of design than robots, and can be programmed to take on a multitude of different appearances, using a variety of modeling and rendering techniques. For modeling, virtual characters are typically visualized in 3D using a mesh of consecutive planar polygons that approximate the surface of the human’s body. Polygons are very simple building blocks, and so can be used to describe many different shapes. They are also very quick to render on graphics hardware. The construction of 3D models is an established industry with many sophisticated packages available for model-building (e.g., 3D Studio Max, Maya, Blender, Houdini). Creating detailed 3D virtual characters using these packages is a highly skilled and labor-intensive task primarily due to the fact that 3D models are created using a 2D display and a high level of geometric detail is required to create convincing virtual characters. Generating 3D data for virtual characters can also be accomplished by scanning real people using a range of

4.3 Design

117

techniques such as photogrammetry, structured light scanning, or laser scanning. Photogrammetry is a type of scanning whereby a collection of still photographs from regular DSLR cameras taken from various angles is all that is required to create a 3D model. Software then analyzes the photographs, matching characteristic points of the object on the images. This creates a point cloud of vertices that can later be converted into a mesh. It is the most commonly used tool nowadays for scanning humans in the visual effects industry, where the number and quality of cameras used in the rig contribute to the accuracy of the recovered mesh. 3D scanning can also be performed using sophisticated 3D scanning devices to project structured patterns of light or lasers onto the surface of the human to reproduce a 3D model that is a copy of the original. For more stylized characters, artists can sculpt characters out of clay and then use one of the mentioned forms of 3D scanning to gather the data onto the computer. Professional grade 3D scanners are expensive, but there are also more affordable, consumer-grade technologies such as depth-sensor based 3D scanning (e.g., Microsoft Kinect) and low-cost photogrammetry, which use regular cameras, but results are generally of lower quality and suitable only for low fidelity non-player characters. In the industry, there are a number of rapid character creation products that only require a single photo and create a virtual human within seconds on a tablet or phone [Didimo 2019, Pinscreen 2019, itSeez3D 2020]. These methods are improving in quality and speed with recent advancements in computer vision and deep learning [Thies et al. 2016, Hu et al. 2017, Saito et al. 2017, Nagano et al. 2018, Yamaguchi et al. 2018]. Once a 3D representation of a human character has been created, a number of different techniques can be utilized in order to add detail and realism. A wide variety of render styles from photorealistic to non-photorealistic can be achieved using rasterization for local illumination or ray tracing for more realistic global illumination [Marschner and Shirley 2016]. While the rasterizer is the current standard for real-time, recent GPU optimization allows for ray tracing in real-time, and we expect to see much higher realism in virtual characters in the future with global illumination. Besides the underlying rendering approach, there are many other methods for adding realism such as texture mapping, and approximating surface reflectance through shading [Masson 2007]. Diffuse texture mapping enhances the character by adding image-based information to its geometry while entailing only a small increase in computation. The basic idea is to map the color of the image or “texture” onto the corresponding color of an object at each pixel [Catmull 1974], which adds the illusion of detail to the model, such as clothing material and skin color.

118

Chapter 4 Appearance

In order to add color detail to virtual characters, diffuse texture maps are used that define the color of diffused light (Figure 4.6). Additionally, there are situations where surfaces are not smooth and roughness needs to be added if it is not present in the geometry. For example, skin is not a smooth surface as it has imperfections such as pores and wrinkles. These details are best added using normal maps that perturb the surface normals to add detail or displacement maps that add geometric detail. In modern computer graphics, surface properties are governed by shaders, the code snippets describing how a surface should react to incident light. Many physically based shaders have been developed to produce realistic materials with different bidirectional reflectance distribution functions (BRDFs) [Nicodemus et al. 1992] (the function that relates the incident to the reflected light). More recently, with the rapid advancements in graphics hardware, more complex shading effects approximating a wide range of BRDFs can now be achieved in real-time. For example, subsurface light transport in translucent materials [Jensen et al. 2001] for realistic scattering of light on the skin was once a technique only used in offline high-end visual effects, but faster methods [Jimenez et al. 2009, 2010] are now used to enhance the realism in real-time. Hair for interactive virtual characters has traditionally been modeled using card-based rendering, where images of chunks of the hair are mapped onto large flat sheets to approximate the shape of a much larger number of individual hairs. Later advancements allowed for modeling each individual hair, which dramatically improves realism. For rendering of hair, physically based fiber reflectance models are used, based on a combination of an anisotropic specular and a diffuse

Figure 4.6

Left: Wireframe render of a character with no texture mapping; center: diffuse textures applied; right: high quality rendering including normal maps, specular map, subsurface scattering, global illumination, and so on.

4.3 Design

119

component [Kajiya and Kay 1989]. More recently, the scattering distribution of the hair fiber is split into different lobes based on the number of internal reflections within the fiber [Marschner et al. 2003]. The use of physically based simulations is ubiquitous for realism in virtual clothing, where fast mass-spring models [Liu et al. 2013] or more complex implicitly integrated continuum techniques [Baraff and Witkin 1998] are used in the state-of-the-art. Implementing realistic cloth and hair dynamics in real-time applications still represents a significant challenge for developers since simulation dynamics need to be solved at run time, and are required to be fast and stable. Based on this, depictions of stiff clothing and hair with little secondary-motion effects are still commonplace for interactive virtual characters across a range of applications from video games to virtual assistants. 4.3.3.2

Industrial Design of Robots The paragraphs above have discussed design approaches to, for example, metaphorical design, and the resources used, for example, facial features, for the development of agent appearance. Another factor that significantly affects agent appearance is the industrial design of physical agents or the physical platforms in which virtual or hybrid agents are presented, including form, material use, scale, color choice, and so on. Although there are no systematic studies of how these factors affect agent appearance or how they must be designed to maximize user experience, the HRI literature includes reports of the design process for the appearance of specific robot platforms. For example, Lee et al. [2009] described the design process for Snackbot, a robot designed to deliver snacks in an academic building, including the form of the housing of the robotic hardware and the snack tray that the robot would carry; the material and colors used to construct the housing and the tray; the height of the robot; and the expressive features of the head and face of the robot. Another example is the design of the Simon humanoid robot, where the research team explored the proportions that the robot’s head and body should follow, the placement of the eyes on the head, facial features that would achieve the appearance of a “friendly doll,” and the interplay between the design of the housing and structural or mechanical elements of the robot’s head [Diana and Thomaz 2011]. Hegel et al. [2010] documented and reported on the industrial design of the social robot Flobi, which included an exploration of the design of the robot’s head to follow a “baby face” schema; effective color combinations of the robot’s face, hair, lips, and eyebrows; and how blushing on the robot’s cheeks could be achieved using LEDs placed behind the surface of the face. A final example is the design of Kip1, a peripheral robotic conversation companion, involving form and material exploration through sketches and mock-ups [Hoffman et al. 2015]. Figure 4.7

120

Chapter 4 Appearance

Figure 4.7

Sketches and models generated during the industrial design of the Snackbot [Lee et al. 2009] (top-left), Simon [Diana and Thomaz 2011] (top-right), and Kip1 [Hoffman et al. 2015] (bottom) robots.

illustrates the sketches and mock-ups generated in the industrial design of some of these examples. In all of the examples discussed above, the research team engaged professional industrial designers or members of the research team with training in industrial design as well as an iterative design process. The literature does not include any discussion of such considerations for virtual characters, and characters designed for research and commercial use all utilize existing display platforms, such as mobile phones, tablet computers, computer monitors, large displays, or virtual or mixed reality environments. Overall, there is a great need for systematic research on the industrial design of the appearance of agents, including the effects of the physical design of the agent itself and the environment within which virtual agents are presented on user interaction and experience.

4.4

Features The design approaches described above draw on a rich space of features, shaped by the metaphor followed by the design (e.g., human-like features included in the design of a virtual human), functional requirements of the agent (e.g., light displays placed on physical robots to convey the agent’s status), and/or aesthetic and experiential goals of the design (e.g., material, color, and texture choices for a robot). The paragraphs below provide an overview of this space, focusing

4.4 Features

121

on facial and bodily features as well as features that communicate demographic characteristics of virtual and physical agent embodiments.

4.4.1 Facial Features The face of an agent serves as the primary interface between the agent and its user, and facial features make up a substantial portion of the design space for agents. Even when designs lack anthropomorphic or zoomorphic faces, people attribute facial features to them, highlighting the importance of faces in the perception of non-living objects [Kühn et al. 2014]. Designers of virtual and physical agents draw on this human propensity and create faces that can display conversational cues, express affect, and communicate direction of attention. In order to convey a true feeling of life in a character, the appearance of the eye is highly important. Rendering techniques such as adding specular and reflection maps can be very useful for this purpose to increase the appearance of wetness and to reflect the environment. Additionally, more advanced techniques such as ambient occlusion allow for soft shadowing, and refraction to replicate the refraction of light that passes through the eyeball, which is filled with fluid. Creating the geometry of the eye is a difficult task due to the complexity of the surface, but there exist special photogrammetry rigs for capturing the visible parts of the eye—the white sclera, the transparent cornea, and the non-rigidly deforming colored iris [Bérard et al. 2014]. Computer-generated eyes used in computer graphics applications are typically gross approximations of the actual geometry and material of a real eye. This is also true for facial expressions, which typically take a simple approach of linearly blending pre-generated expression meshes (blendshapes) to create new expressions and motion [Anjyo 2018]. However, little is known about how these approximations affect user perception of the appearance of virtual characters. Similar to the studies on real humans, virtual humans with narrow eyes have been rated as more aggressive and less trustworthy for both abstract creatures [Ferstl et al. 2017] and more realistic depictions [Ferstl and McDonnell 2018] (Figure 4.8). It should be noted that for realistic eye size alterations, the size of the eyes themselves should not be scaled as this will be quickly perceived as eerie and artificial [Wang et al. 2013]. Instead, the shape of the eyelids can be changed as protruding eyes appear larger, whereas hooded and monolid eyes appear smaller. In contrast to human face studies, wider faces were not judged as less trustworthy, and were perceived as less aggressive compared to narrow faces for realistic [Wang et al. 2013] and abstract virtual characters [Ferstl et al. 2017], even when a particularly masculine rather than a babyface appearance was presented [Ferstl and McDonnell 2018]. The results of these studies support the notion that virtual faces are perceived differently from real human faces. A potential explanation

122

Chapter 4 Appearance

could be the tendency of villains in animated movies to be portrayed with narrow, long, sharp facial features (e.g., Captain Hook in Peter Pan (1953, Clyde Geronimi), Scar in The Lion King (1994, Roger Allers), Maleficent in Sleeping Beauty (1959, Clyde Geronimi). This tendency could influence the perception of computergenerated characters toward automatic association of narrow faces with dangerous characters. Other work has addressed the perception of rather unusual facial proportions for realistic characters and their influence on perceived appeal. Seyama and Nagayama [2007] studied eye size by morphing between photographs of real people and dolls, and found that characters were judged as unpleasant if the eyes had strong deviations from their original size. Participants were more sensitive to the alterations for real faces than for artificial faces. Several studies confirmed that altering facial parts lowers perceived appeal, especially for human-like characters. Green et al. [2008] demonstrated that not only proportions but also the placement of facial parts may negatively affect the perceived appeal. The measured effect was greater for the human-like and more attractive faces. Additionally, it has been demonstrated that a mismatch of realism between facial parts negatively affects appeal [MacDorman et al. 2009, Burleigh et al. 2013].

Figure 4.8

Left: Examples of eye and head shape manipulations on abstract characters (based on Ferstl et al. [2017]). Right: More subtle facial feature manipulations on realistic virtual characters (adapted from Ferstl and McDonnell [2018]).

4.4 Features

123

Prior work in HRI includes a large body of literature on the facial features of robotic agents. A number of studies aimed to characterize the design space for robot faces. Blow et al. [2006a] characterized this space as varying across the dimensions of abstraction, from low to high abstraction, and realism, from realistic to iconic, borrowing from literature on the design of cartoon faces [McCloud 1993]. DiSalvo et al. [2002] carried out an analysis of 48 robots and conducted an exploratory survey that resulted in a number of design recommendations to improve human perceptions of human-like robots: (1) the head and the eye space should be wide; (2) facial features should dominate the face with minimal space for a forehead and a chin, (3) the design should include eyes with sufficient complexity; (4) the addition of a nose, mouth, and eyelids improve perceptions of human-likeness; and (5) the head should include a skin or a casing that core the electromechanical components. A similar analysis was carried out by Kalegina et al. [2018] of 157 rendered robot faces—physical robots that are equipped with a screenbased face and facial features that are virtually rendered on the screen—who coded the faces for 76 different features and conducted a survey to understand how each feature affected user perceptions of the robot (Figure 4.9). The study found that faces with no pupils and no mouth were consistently ranked as being unfriendly, machinelike, and unlikable; those with pink or cartoon-style cheeks were perceived as being feminine; and faces with detailed blue eyes were found to be friendly and trustworthy. Survey participants also expressed preferences for robots with specific facial features for specific contexts of use, for example, selecting robots with no pupils and no mouth for security work and faces with detailed blue eyes for entertainment applications. Consistently, Goetz et al. [2003] argued that there is not a universally preferred design for the facial features of a robot, but that people prefer

Figure 4.9

The 157 faces analyzed by Kalegina et al. [2018] (left), their analysis of facial features used in the design of the robot faces (right-top), and the spectrum of facial realism (right-bottom). Copyright information: Images included in this paper under ACM guidelines on Fair Use.

124

Chapter 4 Appearance

appearances that match the robot’s task. They varied the robot’s appearance across three stylistic dimensions—human versus machine, youth versus adult, and male versus female—and found that user preferences for facial features presented in these styles depended on the robot’s task. In a follow-up study, Powers and Kiesler [2006] showed that the length of the robot’s chin and the fundamental frequency of its voice predicted whether participants expressed interest in following advice from the robot. The literature also includes reports of the process for the design and development of faces for several robot platforms. For example, the design of the iCub social robot primarily involved the mechanical replication of human anatomical mechanisms to achieve realistic eye and head movements and the design of the rest of the face to follow a “toy-like” appearance [Beira et al. 2006]. The design specifications for the face of the KASPAR social robot included a sufficiently expressive but minimal design, an iconic overall design (as opposed to a realistic one), a human-like appearance, and the ability to express autonomy, communicate attention, and display projected expressions [Blow et al. 2006b, Dautenhahn et al. 2009]. The design of the humanoid robot HUBO integrated an abstract body with the overall appearance of an astronaut and a highly human-like face using elastomer-based materials that appeared and moved similar to human skin and a 28-degree-of-freedom mechanism to achieve human-like facial movements [Oh et al. 2006]. The faces of robots including the Flobi [Lutkebohle et al. 2010], Melvin [Shayganfar et al. 2012], and iCat [van Breemen 2004] featured pairs of flexible actuators that served as the robot’s lips and pairs of eyebrows to express emotion. As discussed earlier, the design of the face of the Flobi robot, shown in Figure 4.14, additionally included sophisticated mechanisms for emotion expression, such as lights placed behind the cheeks to enable the appearance of blushing. These reports illustrate how different facial features come together in the design of different robot systems and point to specific examples in the design space of facial features for robots.

4.4.2 Bodily Features While the face serves as the primary interface for human–agent interaction, the remainder of an agent’s body also contributes to the appearance of the agent. The design space for an agent’s body primarily includes several bodily features, how these features come together structurally, and how they are represented. A virtual agent’s body can be presented in a range of different styles, from low-detailed stick-figures or point-light displays to photorealistic bodies or anthropomorphized creatures, and there have been some studies aimed at investigating the effect of the body representation on perception of the agent’s appearance and actions. Most studies apply motion captured animations to a virtual character and

4.4 Features

125

map the motion onto a range of bodies and assess if the different bodies change the meaning of the motion. Typically, factors such as emotion, gender, and biological motion are chosen since these have all been shown to be identifiable solely through motion cues (e.g., Johansson [1973], Cutting and Kozlowski [1977], Kozlowski and Cutting [1977]) thus allowing the contribution of the bodies’ appearance to be assessed. Beginning with a study by Hodgins et al. [1998], the amount of detail in a virtual character’s representation has been studied to investigate the effect on perception. Their study found that viewers’ perception of motion characteristics is affected by the geometric model used for rendering. They observed higher sensitivity to changes in motion when applied to a polygonal model than a stick figure. Chaminade et al. [2007] also found an effect on motion perception, where character anthropomorphism decreased the tendency to report their motion as biological, while another study found that emotions were perceived as less intense on characters with lower geometric detail [McDonnell et al. 2009b]. Body shape has also been investigated—it was found that a virtual character’s body does not affect recognition of body emotions, even for extreme characters such as a zombie with decomposing flesh [McDonnell et al. 2009b] (Figure 4.10). Fleming et al. [2016] evaluated the appeal and realism of female body shapes, which were created as morphs between a realistic character and stylized versions following the design principles of major computer animation studios. Surprisingly,

Figure 4.10

Different structural and material representations for agent body [McDonnell et al. 2009b].

126

Chapter 4 Appearance

the most appealing characters were in-between morphs, where 33% morphs had the highest scores for realism and appeal and 66% morphs were rated as equally appealing but less realistic (Figure 4.11). The perception of sex of a virtual character’s walking motion has also been found to be affected by body shape. Adding stereotypical indicators of sex to the body shapes of male and female characters influences sex perception. Exaggerated female body shapes influenced sex judgements more than exaggerated male shapes [McDonnell et al. 2009a] (Figure 4.12).

Figure 4.11

Stylization applied at different levels (33%, 66%, 100%) to captured performer body (0%) in Marvel and Disney styles (image based on Fleming et al. [2016]).

Figure 4.12

Six body shapes with indicators of gender [McDonnell et al. 2009a].

4.4 Features

127

In virtual reality, embodiment of virtual characters is where the user is positioned virtually inside the body of a virtual avatar, where they have agency over that virtual body. The character model used for the virtual avatar can affect the behavior of the user, from becoming more confident when embodied in a taller avatar, more friendly as an attractive avatar [Yee and Bailenson 2009], to reducing implicit racial bias by embodying an avatar of a different race [Banakou et al. 2016]. This powerful effect is referred to as the Proteus effect [Yee and Bailenson 2007] (named after the Greek god known for his ability to take on many different physical forms). The use of self-avatars or virtual doppelgangers has also been shown to affect outcomes, with generally a positive influence on aspects such as cognitive load [Steed et al. 2016], pain modulation [Romano et al. 2014], and embodiment [Kilteni et al. 2012, Fribourg et al. 2020]. These effects describe to some extent the dynamism of interactions between users and avatars. The design of a physical robot’s body is shaped by a number of factors, including the metaphor that the design follows, the functional requirements of the robot, and environmental constraints that the design must consider. The first factor, the design metaphor, might dictate how the body of the robot is structured and the features that are articulated in the design. For example, the Paro robot [Wada and Shibata 2007] follows the metaphor of a baby seal, and the design of the robot’s body roughly follows the form of a seal, including fore and hind flippers. The functional requirement of the robot might include specific forms of mobility, such as holonomic movement, climbing stairs, or movement across rough terrain, or prehensile manipulation involving a single arm or two arms. Depending on such design requirements, the design of the body of a robot might follow a humanoid design including human-like limbs attached to a torso, such as the ASIMO robot [Sakagami et al. 2002], or a single arm attached on a mobile base, such as the Fetch robot [Wise et al. 2016]. Finally, the environment that the robot is designed for can dictate the bodily features of the robot, such as requiring that a robot that crawls into tight spaces has a low profile and limbs that can be tucked away, such as a PackBot robot [Yamauchi 2004] used in search-and-rescue scenarios. In addition to bodily features borrowed from the design metaphor, such as the hind flippers of a seal or the legs of a human, the design of physical robots also utilize features that facilitate specific functions. These functions include communication, and features that support communication include lights that communicate the robot’s affective states using different colors [Bethel and Murphy 2007] or light arrays that convey information about the robot’s direction of motion using light patterns [Szafir et al. 2015]. Features of a robot’s body may also support transferring items, such as a tray that the Snackbot robot holds to carry food items

128

Chapter 4 Appearance

[Lee et al. 2009] and the different configurations of carts that hospital delivery robots pull to transport materials [Ozkil et al. 2009]. An agent’s body can also include bodily features, such as clothing or furniture, designed to support the agent’s character or backstory or eventually improve user experience with the agent. For example, the Roboceptionist robot was placed in a booth that resembled an information booth and wore clothes that were consistent with the gender and the backstory of its character [Gockley et al. 2005]. The Geminoid robot, a highly realistic android developed to serve as a robotic surrogate to support remote communication, was constructed to resemble its creator and dressed in similar fashion [Nishio et al. 2007]. Figure 4.13 illustrates examples of bodily features that support specific functions, such as a tray, and that support the agent’s character, such as clothing.

4.4.3 Features Expressing Demographic Characteristics Agent appearance communicates other attributes of the character of the agent, such as gender, age, race, and ethnicity. Virtual agents are usually designed as distinctive characters, such as the two female nurse characters, one middle-aged Caucasian and one middle-aged African American, designed by Bickmore et al. [2009] to match user patient demographics. Physical agents, on the other hand, are designed as characters with ambiguous features and interchangeable parts that highlight specific character attributes, such as the interchangeable hair and lips of the Flobi robot that communicate a male or female gender [Lutkebohle et al. 2010] (Figure 4.14). A large body of research on human–agent interaction has shown such character attributes to significantly shape interaction outcomes. For example, Siegel et al. [2009] asked participants to make an optional donation to a robot that used

Figure 4.13

Bodily features that support specific functions, such as a tray that the robot uses to deliver snacks [Lee et al. 2009] (left) and light arrays that a flying robot uses to communicate direction [Szafir et al. 2015] (left-center), and that support the agent’s character, such as a booth and clothing for a receptionist robot [Lee et al. 2010] (right-center) and clothing for a surrogate robot [Watanabe et al. 2015] (right).

4.4 Features

Figure 4.14

129

Facial features of the Flobi robot that provide the robot with different demographic characteristics. Left: neutral male (top) and smiling female (bottom) faces; center: the physical parts that represent facial features; right: different hair and lip styles. Adapted from Lutkebohle et al. [2010].

pre-recorded male or female voices, which research has shown to be sufficient to trigger gender stereotypes [Nass et al. 1997], and found a significant interaction between robot and participant gender over the proportion of participants who donated any amount, for example, men consistently donating more to a female robot. Eyssel and Hegel [2012] manipulated the gender of the Flobi robot by varying the robot’s appearance via its interchangeable parts for hair and lips and found that participant perceptions of the male and female robots closely followed gender stereotypes. The male robot was perceived as having more agency and being more suitable for stereotypically male tasks (e.g., repair), and the female robot was perceived as being more communal and being more suitable for stereotypically female tasks (e.g., childcare). The effect of stereotypes has also been studied for virtual characters, mostly in the context of embodiment in virtual reality. The Proteus effect, as mentioned previously, has additionally shown that users conform to stereotypes associated with their avatar’s appearance. For example, embodiment in female avatars made players more likely to conform to female-typed language norms [Palomares and Lee 2010] and made them more likely to engage in healing activities [Yee et al. 2011]. Interestingly, these effects were observed regardless of the actual gender of the player, indicating a tendency to conform to expectations associated with the virtual gender. In other work, Zibrek et al. [2015] explored gender bias on different types of emotions applied on male and female virtual characters. They found that emotion biases gender perception according to gender stereotypes: an angry motion is seen as more male, while fear and sadness are seen as less male motions, and

130

Chapter 4 Appearance

they observed a contrast effect where anger was seen as more male when viewed on a female model than when viewed on a male model. Similar effects were found for real humans [Hess et al. 2004], indicating that virtual humans follow similar stereotyping effects.

4.4.4 Realism, Appeal, and the Uncanny Valley Metaphorical design involves the application of a familiar metaphor to the design of an agent, such as a virtual human following the metaphor of a human. In practice, metaphors are applied at different levels of abstraction due to technical limitations (e.g., inability to closely replicate the original metaphor) and design choices (e.g., stylization). Deng et al. [2019] argued that designs follow discrete metaphors (e.g., a “baby seal” metaphor) but the realism in which these metaphors are applied to vary along a spectrum of abstraction (e.g., a stylized or abstract household robot vs. a highly realistic robotic surrogate). The design choices of metaphor and abstraction result in differences in user perceptions of the agent and experience with it. In the classic textbook “Disney Animation: The Illusion of Life,” Thomas and Johnston [1995] use the term appeal to describe well designed, interesting, and engaging characters. This is contrary to many face perception studies, which use the term appeal and attractiveness interchangeably. Appeal is an essential ingredient for virtual characters in video games and movies, as well as for avatars, agents, and robots, to ensure audience engagement and positive interactions. Creating highly detailed, photorealistic virtual characters does not necessarily produce appealing results [Geller 2008], and it is often the case that more stylized approximations evoke more positive audience responses and engagement [Zell et al. 2019]. However, additional factors are the context of the interaction and how appropriate the appearance is under the circumstances. For example, having a fun cartoon-appearance may be less appropriate for a more serious application such as a for a business meeting [Junuzovic et al. 2012], medical training [Volante et al. 2016], and so on. Perception of appeal of virtual characters is an ongoing area of research, with the ultimate goal of speeding up or automating the process of producing appealing characters, and avoiding negative reactions from audiences. The term uncanny valley (UV) is often used to describe the negative reactions that can occur toward virtual characters. It is a feeling of repulsion produced by artificial agents that appear close to human form but not quite real. This UV phenomenon was first hypothesized in the 1970s by robotics professor Mori [1970]. Mori’s original hypothesis states that as a robot’s appearance becomes more human, humans evoke more positive and empathetic responses, until a point where the response quickly becomes strongly negative resulting in feelings

4.4 Features

131

of disgust, eeriness, and even fear. Once the robot’s appearance becomes less distinguishable from a human being, the emotional response becomes positive once again. This negative response has been attributed to many causes such as motion errors or lack of familiarity or a mismatch in realism between elements of character design. More recently, the UV hypothesis has been transferred to virtual humans in computer graphics, and has been explored directly in some studies [Bartneck et al. 2009, MacDorman et al. 2009]. Virtual faces in particular are difficult to reproduce as humans are very adept at perceiving and recognizing other faces and facial emotions. As discussed previously, the appearance of a character can be separated into texture, materials, shape, and lighting. Various studies have attempted to isolate these factors and independently examine the effect on appeal and UV. Wallraven et al. [2007] studied the perceived realism, recognition, sincerity, and aesthetics of real and computer-generated facial expressions using 2D filters to provide brush, cartoon, and illustration styles and found that stylization caused differences in recognition accuracy and perceived sincerity of expressions. Additionally, their realistic computer-generated faces scored high aesthetic rankings, which is contrary to the UV theory. Pejsa et al. [2013] additionally found no effect on appeal or lifelikeness between a character with human proportions and one with stylized geometry including large eyes, while other studies found realistic and cartoon depictions to be equally appealing when expressing personality [Ruhland et al. 2015] and when a user had agency over their movements [Kokkinara and McDonnell 2015]. In order to investigate the effect of stylization in more detail, McDonnell et al. [2012] created a range of appearances from abstract to realistic by altering the rendering style (texture, material, and lighting) of a realistically modeled male character while keeping the shape and motion constant. They analyzed subjective ratings of appeal and trustworthiness and found that the most realistic character was often rated as equally appealing or pleasant as the cartoon characters, and equally trustworthy in a truth-telling task. A drop in appeal occurred for characters in the middle of the scale (rated neither abstract nor realistic), which was attributed to the difficulty in categorizing these characters due to their uncommon appearance [Saygin et al. 2012]. Other studies of the UV that used still images generated by morphing between photographs and animated characters also found valleys in participant ratings of uncanniness for intermediate morphs [Hanson 2005, Seyama and Nagayama 2007, Green et al. 2008]. This idea was further developed in the categorization ambiguity hypothesis [Cheetham and Jancke 2013, Yamada et al. 2013], where it was shown that this response is more prominent when the morph is between a real human and an inanimate object or representation of a human.

132

Chapter 4 Appearance

Studies focusing on neurocognitive mechanisms attribute negative evaluation to competing visual-category representations during recognition [Ferrey et al. 2015]. This effect was also investigated in a study by Carter et al. [2013] where they created a realistic, cartoon, and robot female character and assessed subjective pleasantness ratings as well as analyzing eye-tracking as a psychophysiological measure. Contrary to the UV theory, they found higher ratings of unpleasantness for their cartoon than for their realistic character, and that fixations were affected by subjective perceptions of pleasantness. Investigating yet more parameters, Zell et al. [2015] independently examined the dimensions of shape, texture, material, and lighting by creating a range of stimuli of characters with various levels of realism and stylization (Figure 4.15, left). Their study identified that the shape of the character’s face is the main descriptor for realism, and material increases realism only for realistic shapes. Also, that strong mismatches in stylization between material and shape made characters unappealing and eerie, in particular abstract shapes with realistic materials were perceived as highly eerie, validating the design choices of some horror movies with living puppets. Finally, blurring or stylizing a realistic texture can achieve a make-up effect, increasing character appeal and attractiveness, without reducing realism. The opposite was found in a study on body stylization, where the stylization of

Figure 4.15

Left: Examples of manipulating material (y-axis) and shape (x-axis) to vary character realism and appeal [Zell et al. 2015]; Right: Examples of brightness and shadow alterations on cartoon characters displaying emotion that were shown to change the perceived intensity of emotion [based on Wisessing et al. 2020].

4.5 Summary

133

body shape predicted appeal ratings rather than improvements to render quality [Fleming et al. 2016]. More recently, Wisessing et al. [2020] carried out an in-depth analysis of the effect of lighting on appeal, particularly brightness and shadows, and found that increasing the brightness of the key-light or lessening the key-to-fill ratio (lighter shadows) increased the appeal ratings (Figure 4.15, right). They also found little effect of key-light brightness on eeriness but reported reduced eeriness as a consequence of lightening the shadows, which could be used to reduce the UV effects of virtual characters. However, shadow lightening did not improve appeal for characters with realistic appearance, and thus key-light brightness alone should be used to enhance appeal for such characters. Several studies in immersive VR have also examined the effect of character appearance on viewer responses, focusing on co-presence, that is, the sense that one is present and engaged in an interpersonal space with the character [Biocca 1997, Garau et al. 2003]. While some evidence confirms the importance of realistic appearance [Nowak 2001, Zibrek and McDonnell 2019], others put less importance on it [Slater and Steed 2002, Garau et al. 2003]. On the other hand, a mismatch between the realism of behavior and appearance has been often shown to lower the feeling of co-presence [Bailenson et al. 2005]. There are a number of reasons why mismatches may cause negative effects on the viewer. A mismatch between the physical and emotional states of a character violate expectations and thus can result in a breakdown in how users experience agents [Vinayagamoorthy et al. 2006].

4.5

Summary Technical advancements are increasingly pushing the boundaries of how agents are designed and developed, the capabilities of these agents, and their use in human environments. The rapid development in real-time rendering technologies has enabled incredibly detailed, high-quality virtual character appearances (Figure 4.16), often reaching photorealism [Seymour et al. 2017, Epic Games 2018]. Deep learning is also improving the ease and speed at which characters can be created, even from a single photograph [Yamaguchi et al. 2018]. Additionally, animation and behaviors are starting to become easier and less expensive to create, allowing virtual human technologies to be more accessible to a wider audience than ever before. With these advancements comes the increasing use of characters across different domains such as education, sales, therapy, entertainment, social media, and virtual and augmented reality. New methods are also emerging for the construction of physical robots. Rapid fabrication methods, such as 3D printing, have led to the development of new

134

Chapter 4 Appearance

Figure 4.16

State-of-the-art real-time virtual humans in Unreal Engine 4 created by 3Lateral in collaboration with Cubic Motion, Epic Games, Tencent, and Vicon. Left: Siren demo. Right: virtual replica of the actor Andy Serkis. With permission from Epic Games © 2020.

robot morphologies, including 3D printable robots inspired by “origami” [Onal et al. 2014] and robots with soft skin that can change appearance and texture to communicate internal states to the user [Hu et al. 2018]. Mixed reality technologies are also being utilized to facilitate human interaction with robots, including displaying cues that communicate the motion intent [Walker et al. 2018] and the field of view [Hedayati et al. 2018] of the robot. Finally, robots are increasingly being integrated into human environments across different domains, including manufacturing [Sauppé and Mutlu 2015], education [Belpaeme et al. 2018, Michaelis and Mutlu 2018], food services [Jennings and Figliozzi 2019], hospitality [Tussyadiah and Park 2018], surveillance [Inbar and Meyer 2019], and healthcare [Mutlu and Forlizzi 2008, Miseikis et al. 2020]. As applications proliferate, we will gain a better understanding of how the design space for agent appearance is utilized to support each application domain, how the features of this space affect user perceptions of and experience with these agents, and how the appearance of robotic agents might be designed to support personalization, customization, and environmental fit. In this chapter, we have shown that the choice of appearance can have implications for human interactions in a number of ways, including changes to the perception of personality, emotion, trust, and confidence. Studies have shown that the many factors that constitute the final appearance of an agent, such as the design metaphor, modality of representation, and methods of agent construction,

References

135

including modeling, texturing, materials, and even lighting, have different effects on how people perceive and respond to it. This multidimensionality has the drawback that some factors might cancel each other out or amplify each other, leading to inconsistent conclusions. Additionally, more frequent exposure to agents and increasing technological sophistication may continuously change the way we perceive them, much like how we are becoming more and more sensitive to poor visual effects in movies [Tinwell et al. 2011]. The need for understanding the implications of different appearances of agents has therefore never been greater.

References S. Al Moubayed, J. Beskow, G. Skantze, and B. Granström. 2012. Furhat: A back-projected human-like robot head for multiparty human-machine interaction. In A. Esposito, A. M. Esposito, A. Vinciarelli, R. Hoffmann, and V. C. Müller (Eds.), Cognitive Behavioural Systems. Lecture Notes in Computer Science, vol 7403. Springer, Berlin, Heidelberg, 114–130. DOI: https://doi.org/10.1007/978-3-642-34584-5_9. K. Anjyo. 2018. Blendshape facial animation. In Handbook of Human Motion. Springer International Publishing, Cham, IL, 2145–2155. DOI: https://doi.org/10.1007/978-3-319-144184_2. J. N. Bailenson, K. R. Swinth, C. L. Hoyt, S. Persky, A. Dimov, and J. Blascovich. 2005. The independent and interactive effects of embodied-agent appearance and behavior on self-report, cognitive, and behavioral markers of copresence in immersive virtual environments. Presence 14, 4, 379–393. W. A. Bainbridge, J. W. Hart, E. S. Kim, and B. Scassellati. 2011. The benefits of interactions with physically present robots over video-displayed agents. Int. J. Soc. Robot. 3, 1, 41–52. DOI: https://doi.org/10.1007/s12369-010-0082-7. D. Banakou, P. D. Hanumanthu, and M. Slater. 2016. Virtual embodiment of white people in a black virtual body leads to a sustained reduction in their implicit racial bias. Front. Hum. Neurosci. 10, 601. ISSN 1662-5161. https://www.frontiersin.org/article/10.3389/ fnhum.2016.00601. DOI: https://doi.org/10.3389/fnhum.2016.00601. D. Baraff and A. Witkin. 1998. Large steps in cloth simulation. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’98. Association for Computing Machinery, New York, NY, 43–54. DOI: https://doi.org/10.1145/ 280814.280821. C. Bartneck, T. Kanda, H. Ishiguro, and N. Hagita. 2009. My robotic doppelgänger— A critical look at the uncanny valley. In Proceedings of Robot and Human Interactive Communication, 269–276. DOI: https://doi.org/10.1109/ROMAN.2009.5326351. R. Beira, M. Lopes, M. Praça, J. Santos-Victor, A. Bernardino, G. Metta, F. Becchi, and R. Saltarén. 2006. Design of the robot-cub (iCub) head. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006. IEEE, 94–100. T. Belpaeme, J. Kennedy, A. Ramachandran, B. Scassellati, and F. Tanaka. 2018. Social robots for education: A review. Sci. Robot. 3, 21. DOI: https://doi.org/10.1126/scirobotics. aat5954.

136

Chapter 4 Appearance

P. Bérard, D. Bradley, M. Nitti, T. Beeler, and M. Gross. 2014. High-quality capture of eyes. ACM Trans. Graph. 33, 6. ISSN 0730-0301. DOI: https://doi.org/10.1145/2661229. 2661285. C. L. Bethel and R. R. Murphy. 2007. Survey of non-facial/non-verbal affective expressions for appearance-constrained robots. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38, 1, 83–92. T. W. Bickmore, L. M. Pfeifer, and B. W. Jack. 2009. Taking the time to care: Empowering low health literacy hospital patients with virtual nurse agents. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1265–1274. DOI: https://doi. org/10.1145/1518701.1518891. F. Biocca. 1997. The cyborg’s dilemma: Progressive embodiment in virtual environments. J. Comput.-Mediat. Commun. 3, 2, 12–27. DOI: https://doi.org/10.1111/j.1083-6101.1997.tb 00070.x. M. Blow, K. Dautenhahn, A. Appleby, C. L. Nehaniv, and D. Lee. 2006a. The art of designing robot faces: Dimensions for human–robot interaction. In Proceedings of the 1st ACM SIGCHI/SIGART Conference on Human–Robot Interaction, 331–332. DOI: https://doi.org/10. 1145/1121241.1121301. M. Blow, K. Dautenhahn, A. Appleby, C. L. Nehaniv, and D. C. Lee. 2006b. Perception of robot smiles and dimensions for human–robot interaction design. In ROMAN 2006—The 15th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 469–474. T. J. Burleigh, J. R. Schoenherr, and G. L. Lacroix. 2013. Does the uncanny valley exist? An empirical test of the relationship between eeriness and the human likeness of digitally created faces. Comput. Hum. Behav. 29, 3, 759–771. DOI: https://doi.org/10.1016/j.chb.2012. 11.021. E. J. Carter, M. Mahler, and J. K. Hodgins. 2013. Unpleasantness of animated characters increases viewer attention to faces. In Proceedings of the ACM Symposium in Applied Perception, 35–40. DOI: https://doi.org/10.1145/2492494.2502059. J. Cassell. 2000. Embodied conversational interface agents. Commun. ACM 43, 4, 70–78. DOI: https://doi.org/10.1145/332051.332075. J. Cassell. 2001. Embodied conversational agents: Representation and intelligence in user interfaces. AI Magazine 22, 4, 67–87. DOI: https://doi.org/10.1609/aimag.v22i4. 1593. E. Catmull. 1974. A Subdivision Algorithm for Computer Display of Curved Surfaces. Ph.D. thesis, Department of Computer Science, University of Utah. T. Chaminade, J. Hodgins, and M. Kawato. 2007. Anthropomorphism influences perception of computer-animated characters’ actions. Soc. Cogn. Affect. Neurosci. 2, 3, 206–216. DOI: https://doi.org/10.1093/scan/nsm017. M. Cheetham and L. Jancke. 2013. Perceptual and category processing of the uncanny valley hypothesis’ dimension of human likeness: Some methodological issues. J. Vis. Exp. 76, 4375. DOI: https://doi.org/10.3791/4375.

References

137

B. Colligan. 2011. How the Knowledge Navigator Video Came About. http://www.dubberly. com/articles/how-the-knowledge-navigator-video-came-about.html (accessed June 30, 2020). J. E. Cutting and L. T. Kozlowski. 1977. Recognizing friends by their walk: Gait perception without familiarity cues. Bullet. Psychon. Soc. 9, 5, 353–356. Z. Dai and K. F. MacDorman. 2018. The doctor’s digital double: How warmth, competence, and animation promote adherence intention. PeerJ Comput. Sci. 4, e168. DOI: https://doi. org/10.7717/peerj-cs.168. K. Dautenhahn, C. L. Nehaniv, M. L. Walters, B. Robins, H. Kose-Bagci, N. Assif Mirza, and 0-M. Blow. 2009. Kaspar—A minimally expressive humanoid robot for human–robot interaction research. Appl. Bionics Biomech. 6, 3, 4, 369–397. DOI: https://doi.org/10.1080/ 11762320903123567. E. Deng, B. Mutlu, and M. J. Mataric. 2019. Embodiment in socially interactive robots. Found. Trends Robot. 7, 4 (Jan 2019), 251–356. DOI: https://doi.org/10.1561/ 2300000056. C. Diana and A. L. Thomaz. 2011. The shape of Simon: Creative design of a humanoid robot shell. In CHI’11 Extended Abstracts on Human Factors in Computing Systems, 283–298. DOI: https://doi.org/10.1145/1979742.1979648. Didimo, 2019. The Breathtaking Reality of Your Digital You. https://mydidimo.com/. C. F. DiSalvo, F. Gemperle, J. Forlizzi, and S. Kiesler. 2002. All robots are not created equal: The design and perception of humanoid robot heads. In Proceedings of the 4th Conference on Designing Interactive Systems: Processes, Practices, Methods, and Techniques, 321–326. DOI: https://doi.org/10.1145/778712.778756. Epic Games, Inc. 2018. Siren. https://www.3lateral.com/projects/siren.html. F. Eyssel and F. Hegel. 2012. (S)he’s got the look: Gender stereotyping of robots. J. Appl. Soc. Psychol. 42, 9, 2213–2230. DOI: https://doi.org/10.1111/j.1559-1816.2012.00937.x. A. E. Ferrey, T. J. Burleigh, and M. J. Fenske. 2015. Stimulus-category competition, inhibition, and affective devaluation: A novel account of the uncanny valley. Front. Psychol. 6, 249. DOI: https://doi.org/10.3389/fpsyg.2015.00249. Y. Ferstl and R. McDonnell. 2018. A perceptual study on the manipulation of facial features for trait portrayal in virtual agents. In Proceedings of International Conference on Intelligent Virtual Agents (IVA), 281–288. DOI: https://doi.org/10.1145/3267851.3267891. Y. Ferstl, E. Kokkinara, and R. McDonnell. 2017. Facial features of non-player creatures can influence moral decisions in video games. ACM Trans. Appl. Percept. 15, 1, 4:1–4:12. ISSN 1544-3558. DOI: https://doi.org/10.1145/3129561. R. Fleming, B. J. Mohler, J. Romero, M. J. Black, and M. Breidt. 2016. Appealing female avatars from 3D body scans: Perceptual effects of stylization. In International Conference on Computer Graphics Theory and Applications (GRAPP). DOI: https://doi.org/10.5220/ 0005683903330343. R. Fribourg, F. Argelaguet, A. Lécuyer, and L. Hoyet. 2020. Avatar and sense of embodiment: Studying the relative preference between appearance, control and point of view.

138

Chapter 4 Appearance

IEEE Trans. Vis. Comput. Graph. 26, 5, 2062–2072. DOI: https://doi.org/10.1109/TVCG.2020. 2973077. M. Garau, M. Slater, V. Vinayagamoorthy, A. Brogni, A. Steed, and M. A. Sasse. 2003. The impact of avatar realism and eye gaze control on perceived quality of communication in a shared immersive virtual environment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 529–536. DOI: https://doi.org/10.1145/642611. 642703. M. Garau, M. Slater, D.-P. Pertaub, and S. Razzaque. 2005. The responses of people to virtual humans in an immersive virtual environment. Presence: Teleoperat. Virt. Environ. 14, 1, 104–116. T. Geller. 2008. Overcoming the uncanny valley. IEEE Comput. Graph. Appl. 28, 4, 11–17. DOI: https://doi.org/10.1109/mcg.2008.79. R. Gockley, A. Bruce, J. Forlizzi, M. Michalowski, A. Mundell, S. Rosenthal, B. Sellner, R. Simmons, K. Snipes, A. C. Schultz, and J. Wang. 2005. Designing robots for long-term social interaction. In 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1338–1343. J. Goetz, S. Kiesler, and A. Powers. 2003. Matching robot appearance and behavior to tasks to improve human–robot cooperation. In The 12th IEEE International Workshop on Robot and Human Interactive Communication, 2003. Proceedings. ROMAN 2003. IEEE, 55–60. DOI: https://doi.org/10.1109/ROMAN.2003.1251796. H. Gouraud. 1971. Continuous shading of curved surfaces. IEEE Trans. Comput. C-20 6, 623–629. DOI: https://doi.org/10.1109/T-C.1971.223313. R. D. Green, K. F. MacDorman, C.-C. Ho, and S. Vasudevan. 2008. Sensitivity to the proportions of faces that vary in human likeness. Comput. Hum. Behav. 24, 5, 2456–2474. DOI: https://doi.org/10.1016/j.chb.2008.02.019. P. Hamet and J. Tremblay. 2017. Artificial intelligence in medicine. Metabolism 69, S36–S40. DOI: https://doi.org/10.1016/j.metabol.2017.01.011. D. Hanson. 2005. Expanding the aesthetics possibilities for humanlike robots. In Proceedings of IEEE Humanoid Robotics Conf., Special Session on the Uncanny Valley. H. Hedayati, M. Walker, and D. Szafir. 2018. Improving collocated robot teleoperation with augmented reality. In Proceedings of the 2018 ACM/IEEE International Conference on Human–Robot Interaction, 78–86. F. Hegel, F. Eyssel, and B. Wrede. 2010. The social robot ‘Flobi’: Key concepts of industrial design. In 19th International Symposium in Robot and Human Interactive Communication. IEEE, 107–112. U. Hess, R. B. Adams, and R. E. Kleck. 2004. Facial appearance, gender, and emotion expression. Emotion 4, 4, 378–388. DOI: https://doi.org/10.1037/1528-3542.4.4.378. J. K. Hodgins, J. F. O’Brien, and J. Tumblin. 1998. Perception of human motion with different geometric models. IEEE Trans. Vis. Comput. Graph. 4, 4, 101–113. http://graphics.cs. berkeley.edu/papers/Hodgins-PHM-1998-12/.

References

139

G. Hoffman, O. Zuckerman, G. Hirschberger, M. Luria, and T. Shani-Sherman. 2015. Design and evaluation of a peripheral robotic conversation companion. In 2015 10th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 3–10. M. Hoque, M. Courgeon, J.-C. Martin, B. Mutlu, and R. W. Picard. 2013. Mach: My automated conversation coach. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, 697–706. L. Hu, S. Saito, L. Wei, K. Nagano, J. Seo, J. Fursund, I. Sadeghi, C. Sun, Y.-C. Chen, and H. Li. Nov. 2017. Avatar digitization from a single image for real-time rendering. ACM Trans. Graph. (TOG), 36, 6, 195:1–195:14. ISSN 0730-0301. Y. Hu, Z. Zhao, A. Vimal, and G. Hoffman. 2018. Soft skin texture modulation for social robotics. In 2018 IEEE International Conference on Soft Robotics (RoboSoft). IEEE, 182–187. O. Inbar and J. Meyer. 2019. Politeness counts: Perceptions of peacekeeping robots. IEEE Trans. Hum.-Mach. Syst. 49, 3, 232–240. DOI: https://doi.org/10.1109/THMS.2019. 2900337. itSeez3D, 2020. Turn your mobile device into a powerful 3D scanner. https://itseez3d.com/. Y. Iwamura, M. Shiomi, T. Kanda, H. Ishiguro, and N. Hagita. 2011. Do elderly people prefer a conversational humanoid as a shopping assistant partner in supermarkets? In Proceedings of the 6th International Conference on Human–Robot Interaction, 449–456. DOI: https://doi.org/10.1145/1957656.1957816. D. Jennings and M. Figliozzi. 2019. Study of sidewalk autonomous delivery robots and their potential impacts on freight efficiency and travel. Transp. Res. Rec. 2673, 6, 317–326. DOI: https://doi.org/10.1177/0361198119849398. H. W. Jensen, S. R. Marschner, M. Levoy, and P. Hanrahan. 2001. A practical model for subsurface light transport. In Proceedings of SIGGRAPH, 511–518. DOI: https://doi.org/10.1145/ 383259.383319. J. Jimenez, V. Sundstedt, and D. Gutierrez. 2009. Screen-space perceptual rendering of human skin. ACM Trans. Appl. Percept. 6, 4, 23, 1–23:15. DOI: https://doi.org/10.1145/ 1609967.1609970. J. Jimenez, T. Scully, N. Barbosa, C. Donner, X. Alvarez, T. Vieira, P. Matts, V. Orvalho, D. Gutierrez, and T. Weyrich. 2010. A practical appearance model for dynamic facial color. ACM Trans. Graph. 29, 6, 141. DOI: https://doi.org/10.1145/1882261.1866167. G. Johansson. 1973. Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14, 2, 201–211. DOI: https://doi.org/10.3758/BF03212378. S. Junuzovic, K. Inkpen, J. Tang, M. Sedlins, and K. Fisher. 2012. To see or not to see: A study comparing four-way avatar, video, and audio conferencing for work. 31–34. DOI: http://doi.org/10.1145/2389176.2389181. J. T. Kajiya and T. L. Kay. 1989. Rendering fur with three dimensional textures. SIGGRAPH Comput. Graph. 23, 3, 271–280. DOI: https://doi.org/10.1145/74334.74361.

140

Chapter 4 Appearance

A. Kalegina, G. Schroeder, A. Allchin, K. Berlin, and M. Cakmak. 2018. Characterizing the design space of rendered robot faces. In Proceedings of the 2018 ACM/IEEE International Conference on Human–Robot Interaction, 96–104. DOI: https://doi.org/10.1145/3171221. 3171286. S. Kiesler, A. Powers, S. R. Fussell, and C. Torrey. 2008. Anthropomorphic interactions with a robot and robot-like agent. Soc. Cogn. 26, 2, 169–181. DOI: https://doi.org/10.1521/soco. 2008.26.2.169. K. Kilteni, R. Groten, and M. Slater. 2012. The sense of embodiment in virtual reality. Presence Teleoperat. Virt. Environ. 21. DOI: http://doi.org/10.1162/PRES_a_00124. E. Kokkinara and R. McDonnell. 2015. Animation realism affects perceived character appeal of a self-virtual face. In Proceedings of the 8th ACM SIGGRAPH Conference on Motion in Games. ACM, 221–226. DOI: https://doi.org/10.1145/2822013. 2822035. H. Kozima, M. P. Michalowski, and C. Nakagawa. 2009. Keepon. Int. J. Soc. Robot. 1, 1, 3–18. DOI: https://doi.org/10.1007/s12369-008-0009-8. L. T. Kozlowski and J. E. Cutting. 1977. Recognizing the sex of a walker from a dynamic point-light display. Percept. Psychophys. 21, 6, 575–580. DOI: https://doi.org/10.3758/ BF03198740. S. Kühn, T. R. Brick, B. C. Müller, and J. Gallinat. 2014. Is this car looking at you? How anthropomorphism predicts fusiform face area activation when seeing cars. PLoS One 9, 12, e113885. DOI: https://doi.org/10.1371/journal.pone.0113885. M. K. Lee, J. Forlizzi, P. E. Rybski, F. Crabbe, W. Chung, J. Finkle, E. Glaser, and S. Kiesler. 2009. The Snackbot: Documenting the design of a robot for long-term human–robot interaction. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction, 7–14. DOI: https://doi.org/10.1145/1514095.1514100. M. K. Lee, S. Kiesler, and J. Forlizzi. 2010. Receptionist or information kiosk: How do people talk with a robot? In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, 31–40. DOI: https://doi.org/10.1145/1718918.1718927. J. C. Lester, C. B. Callaway, B. Stone, and S. G. Towns. 1997. Mixed initiative problem solving with animated pedagogical agents. In Workshop on Pedagogical Agents, volume 19. J. Li, R. Kizilcec, J. Bailenson, and W. Ju. 2016. Social robots and virtual agents as lecturers for video instruction. Comput. Hum. Behav. 55, 1222–1230. DOI: https://doi.org/10.1016/ j.chb.2015.04.005. T. Liu, A. W. Bargteil, J. F. O’Brien, and L. Kavan. 2013. Fast simulation of mass-spring systems. ACM Trans. Graph. 32, 6, 209:1–7. http://cg.cis.upenn.edu/publications/Liu-FMS. Proceedings of ACM SIGGRAPH Asia 2013, Hong Kong. I. Lütkebohle, F. Hegel, S. Schulz, M. Hackel, B. Wrede, S. Wachsmuth, and G. Sagerer. 2010. The Bielefeld anthropomorphic robot head “ Flobi.” In 2010 IEEE International Conference on Robotics and Automation. IEEE, 3384–3391.

References

141

K. F. MacDorman, R. D. Green, C.-C. Ho, and C. T. Koch. 2009. Too real for comfort? Uncanny responses to computer generated faces. Comput. Human Behav. 25, 3, 695–710. DOI: https://doi.org/10.1016/j.chb.2008.12.026. S. Marschner and P. Shirley. 2016. Fundamentals of Computer Graphics. CRC Press, Boca Raton, FL. S. R. Marschner, H. W. Jensen, M. Cammarano, S. Worley, and P. Hanrahan. 2003. Light scattering from human hair fibers. ACM Trans. Graph. 22, 3, 780–791. ISSN 0730-0301. T. Masson. 2007. CG 101: A Computer Graphics Industry Reference. Digital Fauxtography. S. McCloud. 1993. Understanding Comics: The Invisible Art. Tundra Publishing, Northampton, MA. R. McDonnell, S. Jörg, J. K. Hodgins, F. Newell, and C. O’Sullivan. 2009a. Evaluating the effect of motion and body shape on the perceived sex of virtual characters. ACM Trans. Appl. Percept. (TAP), 5, 4, 20. DOI: https://doi.org/10.1145/1462048.1462051. R. McDonnell, S. Jörg, J. McHugh, F. N. Newell, and C. O’Sullivan. 2009b. Investigating the role of body shape on the perception of emotion. ACM Trans. Appl. Percept. (TAP), 6, 3, 14. DOI: https://doi.org/10.1145/1577755.1577757. R. McDonnell, M. Larkin, B. Hernandez, I. Rudomin, and C. O’Sullivan. 2009c. Eyecatching crowds: Saliency based selective variation. ACM Trans. Graph. 28, 3, 55:1–55:10. DOI: https://doi.org/10.1145/1531326.1531361. R. McDonnell, M. Breidt, and H. H. Bülthoff. 2012. Render me real? Investigating the effect of render style on the perception of animated virtual humans. ACM Trans. Graph. 31, 4, 91. DOI: https://doi.org/10.1145/2185520.2185587. J. E. Michaelis and B. Mutlu. 2018. Reading socially: Transforming the in-home reading experience with a learning-companion robot. Sci. Robot. 3, 21. DOI: https://doi.org/10. 1126/scirobotics.aat5999. J. Miseikis, P. Caroni, P. Duchamp, A. Gasser, R. Marko, N. Miseikiene, F. Zwilling, C. de Castelbajac, L. Eicher, M. Fruh, and H. Fruh. 2020. Lio—A personal robot assistant for human–robot interaction and care applications. IEEE Robot. Autom. Lett. 5, 4, 5339–5346. DOI: https://doi.org/10.1109/LRA.2020.3007462. M. Mori. 1970. The uncanny valley. Energy 7, 4, 33–35. J. Mumm and B. Mutlu. 2011. Designing motivational agents: The role of praise, social comparison, and embodiment in computer feedback. Comput. Hum. Behav. 27, 5, 1643–1650. DOI: https://doi.org/10.1016/j.chb.2011.02.002. B. Mutlu and J. Forlizzi. 2008. Robots in organizations: The role of workflow, social, and environmental factors in human–robot interaction. In 2008 3rd ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 287–294. DOI: https://doi.org/10.1145/ 1349822.1349860. K. Nagano, J. Seo, J. Xing, L. Wei, Z. Li, S. Saito, A. Agarwal, J. Fursund, and H. Li, 2018. paGAN: Real-time avatars using dynamic textures. ACM Trans. Graph. 37, 6, 258–1. DOI: https://doi.org/10.1145/3272127.3275075.

142

Chapter 4 Appearance

C. Nass, Y. Moon, and N. Green. 1997. Are machines gender neutral? Gender-stereotypic responses to computers with voices. J. Appl. Soc. Psychol. 27, 10, 864–876. DOI: https://doi. org/10.1111/j.1559-1816.1997.tb00275.x. F. E. Nicodemus, J. C. Richmond, J. J. Hsia, I. W. Ginsberg, and T. Limperis. 1992. Geometrical considerations and nomenclature for reflectance. In Radiometry. Jones and Bartlett Publishers, Inc., Sudbury, MA, 94–145. ISBN 0867202947. S. Nishio, H. Ishiguro, and N. Hagita. 2007. Geminoid: Teleoperated android of an existing person. Humanoid Robots: New Developments, 14, 343–352. K. Nowak. 2001. The influence of anthropomorphism on social judgment in social virtual environments. In Annual Convention of the International Communication Association, Washington, DC. J.-H. Oh, D. Hanson, W.-S. Kim, Y. Han, J.-Y. Kim, and I.-W. Park. 2006. Design of android type humanoid robot Albert HUBO. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1428–1433. DOI: https://doi.org/10.1109/IROS.2006. 281935. C. D. Onal, M. T. Tolley, R. J. Wood, and D. Rus. 2014. Origami-inspired printed robots. IEEE/ASME Trans. Mechatron. 20, 5, 2214–2221. DOI: https://doi.org/10.1109/TMECH.2014. 2369854. I. C. Orr. 1974. Puppet theatre in Asia. Asian Folklore Studies. 69–84. A. G. Ozkil, Z. Fan, S. Dawids, H. Aanes, J. K. Kristensen, and K. H. Christensen. 2009. Service robots for hospitals: A case study of transportation tasks in a hospital. In 2009 IEEE International Conference on Automation and Logistics. IEEE, 289–294. DOI: https://do i.org/10.1109/ICAL.2009.5262912. N. A. Palomares and E.-J. Lee. 2010. Virtual gender identity: The linguistic assimilation to gendered avatars in computer-mediated communication. J. Lang. Soc. Psychol. 29, 1, 5–23. DOI: http://doi.org/10.1177/0261927X09351675. Y. Pan and A. Steed. 2016. A comparison of avatar-, video-, and robot-mediated interaction on users trust in expertise. Front. Robot. AI 3, 12. DOI: http://doi.org/10.3389/frobt.2016. 00012. T. Pejsa, B. Mutlu, and M. Gleicher. 2013. Stylized and performative gaze for character animation. Comput. Graph. Forum 32, 2pt2, 143–152. https://onlinelibrary.wiley.com/doi/abs/ 10.1111/cgf.12034. DOI: http://doi.org/10.1111/cgf.12034. B. T. Phong. 1975. Illumination for computer generated pictures. Commun. ACM 18, 6, 311–317. DOI: https://doi.org/10.1145/360825.360839. Pinscreen, 2019. The most advanced AI-driven personalized avatars. https://www. pinscreen.com/. A. Powers and S. Kiesler. 2006. The advisor robot: Tracing people’s mental model from a robot’s physical attributes. In Proceedings of the 1st ACM SIGCHI/SIGART Conference on Human–Robot Interaction, 218–225. DOI: https://doi.org/10.1145/1121241.1121280. J. Pransky. 2001. AIBO—The no. 1 selling service robot. Ind. Robot Int. J. 28, 1, 24–26. DOI: https://doi.org/10.1108/01439910110380406.

References

143

T. J. Prescott, B. Mitchinson, and S. Conran. 2017. Miro: An animal-like companion robot with a biomimetic brain-based control system. In Proceedings of the Companion of the 2017 ACM/IEEE International Conference on Human–Robot Interaction, 50–51. DOI: https: //doi.org/10.1145/3029798.3036660. D. Romano, C. Pfeiffer, A. Maravita, and O. Blanke. 2014. Illusory self-identification with an avatar reduces arousal responses to painful stimuli. Behav. Brain Res. 261, 275–281. ISSN 01664328. DOI: http://doi.org/10.1016/j.bbr.2013.12.049. K. Ruhland, K. Zibrek, and R. McDonnell. 2015. Perception of personality through eye gaze of realistic and cartoon models. In Proceedings of Symp. on Applied Perception. ACM, 19–23. DOI: https://doi.og/10.1145/2804408.2804424. S. Saito, L. Wei, L. Hu, K. Nagano, and H. Li. 2017. Photorealistic facial texture inference using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5144–5153. DOI: https://doi.org/10.1109/CVPR.2017.250. Y. Sakagami, R. Watanabe, C. Aoyama, S. Matsunaga, N. Higaki, and K. Fujimura. 2002. The intelligent ASIMO: System overview and integration. In IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 3, 2478–2483. DOI: 10.1109/IRDS.2002. 1041641. A. Sauppé and B. Mutlu. 2015. The social impact of a robot co-worker in industrial settings. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 3613–3622. DOI: https://doi.org/10.1145/2702123.2702181. A. P. Saygin, T. Chaminade, H. Ishiguro, J. Driver, and C. Frith. 2012. The thing that should not be: Predictive coding and the uncanny valley in perceiving human and humanoid robot actions. Soc. Cogn. Affect. Neurosci. 7, 4, 413–422. DOI: https://doi.org/10.1093/scan/ nsr025. J. Scarce. 1983. Karagoz Shadow Puppets of Turkey. https://openlibrary.org/works/OL 2875191W/Karag%C3%B6z_shadow_puppets_of_Turkey. J. Sculley. 1989. The relationship between business and higher education: A perspective on the 21st century. Commun. ACM 32, 9, 1056–1061. DOI: https://doi.org/10.1145/66451.66452. J. Seyama and R. S. Nagayama. 2007. The uncanny valley: Effect of realism on the impression of artificial human faces. Presence: Teleop. Virt. Environ. 16, 4, 337–351. DOI: https: //doi.org/10.1162/pres.16.4.337. M. Seymour, C. Evans, and K. Libreri. 2017. Meet Mike: Epic avatars. In ACM SIGGRAPH 2017 VR Village, 1–2. M. Shayganfar, C. Rich, and C. L. Sidner. 2012. A design methodology for expressing emotion on robot faces. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4577–4583. M. Siegel, C. Breazeal, and M. I. Norton. 2009. Persuasive robotics: The influence of robot gender on human behavior. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2563–2568. DOI: https://doi.org/10.1109/IROS.2009.5354116. S. Simon, C. William, and G. Jan. 1999. Enlightened automata. In The Sciences in Enlightened Europe. University of Chicago Press, Chicago and London.

144

Chapter 4 Appearance

M. Slater and A. Steed. 2002. Meeting people virtually: Experiments in shared virtual environments. In The Social Life of Avatars. Springer, 146–171. DOI: https://doi.org/10.1007/9781-4471-0277-9_9. A. Steed, Y. Pan, F. Zisch, and W. Steptoe. 2016. The impact of a self-avatar on cognitive load in immersive virtual reality. In 2016 IEEE Virtual Reality (VR). 67–76. DOI: https://doi.org/ 10.1109/VR.2016.7504689. S. R. Pallak, E. Murroni, and J. Koch. 1983. Communicator attractiveness and expertise, emotional versus rational appeals, and persuasion: A heuristic versus systematic processing interpretation. Soc. Cogn. 2, 2, 122–141. DOI: https://doi.org/10.1521/soco.1983. 2.2.122. D. Szafir, B. Mutlu, and T. Fong. 2015. Communicating directionality in flying robots. In 2015 10th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 19–26. DOI: http://dx.doi.org/10.1145/2696454.2696475. J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. 2016. Face2face: Realtime face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2387–2395. DOI: https://doi.org/10.1109/CVPR .2016.262. F. Thomas and O. Johnston. 1995. The Illusion of Life: Disney Animation. Hyperion: New York. A. Tinwell, M. Grimshaw, D. A. Nabi, and A. Williams. 2011. Facial expression of emotion and perception of the uncanny valley in virtual characters. Comput. Hum. Behav. 27, 2, 741–749. DOI: https://doi.org/10.1016/j.chb.2010.10.018. I. Torre, E. Carrigan, K. McCabe, R. McDonnell, and N. Harte. 2018. Survival at the museum: A cooperation experiment with emotionally expressive virtual characters. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 423– 427. DOI: https://doi.org/10.1145/3242969.3242984. I. Torre, E. Carrigan, R. McDonnell, K. Domijan, K. McCabe, and N. Harte. 2019. The effect of multimodal emotional expression and agent appearance on trust in human–agent interaction. In Motion, Interaction and Games, MIG ’19. Association for Computing Machinery, New York. DOI: https://doi.org/10.1145/3359566.3360065. I. P. Tussyadiah and S. Park. 2018. Consumer evaluation of hotel service robots. In Information and Communication Technologies in Tourism 2018. Springer, Cham, 308–320. DOI: https://doi.org/10.1007/978-3-319-72923-7_24. A. van Breemen, X. Yan, and B. Meerbeek. 2005. iCat: An animated user-interface robot with personality. In Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, 143–144. DOI: https://doi.org/10.1145/1082473.1082823. A. J. van Breemen. 2004. Animation engine for believable interactive user-interface robots. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), volume 3, IEEE, 2873–2878. DOI: https://doi.org/10.1109/IROS.2004. 1389845.

References

145

G. Veletsianos. 2010. Contextually relevant pedagogical agents: Visual appearance, stereotypes, and first impressions and their impact on learning. Comput. Educ. 55, 2, 576–585. ISSN 0360-1315. DOI: https://doi.org/10.1016/j.compedu.2010.02.019. V. Vinayagamoorthy, M. Gillies, A. Steed, E. Tanguy, X. Pan, C. Loscos, and M. Slater. 2006. Building expression into virtual characters. In B. Wyvill and A. Wilkie (Eds.), Eurographics 2006 - State of the Art Reports. The Eurographics Association. DOI: http://do i.org/10.2312/egst.20061052. M. Volante, S. V. Babu, H. Chaturvedi, N. Newsome, E. Ebrahimi, T. Roy, S. B. Daily, and T. Fasolino. 2016. Effects of virtual human appearance fidelity on emotion contagion in affective inter-personal simulations. IEEE Trans. Vis. Comput. Graph. 22, 4, 1326–1335. DOI: https://doi.org/10.1109/TVCG.2016.2518158. K. Wada and T. Shibata. 2007. Living with seal robots—Its sociopsychological and physiological influences on the elderly at a care house. IEEE Trans. Robot. 23, 5, 972–980. DOI: https://doi.org/10.1109/TRO.2007.906261. K. Wada, T. Shibata, T. Saito, K. Sakamoto, and K. Tanie. 2005. Psychological and social effects of one year robot assisted activity on elderly people at a health service facility for the aged. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation. IEEE, 2785–2790. DOI: https://doi.org/10.1109/ROBOT.2005. 1570535. M. Walker, H. Hedayati, J. Lee, and D. Szafir. 2018. Communicating robot motion intent with augmented reality. In Proceedings of the 2018 ACM/IEEE International Conference on Human–Robot Interaction, 316–324. DOI: https://doi.org/10.1145/3171221.3171253. C. Wallraven, H. H. Bülthoff, D. W. Cunningham, J. Fischer, and D. Bartz. 2007. Evaluation of real-world and computer-generated stylized facial expressions. ACM Trans. Appl. Percept. 4, 3. DOI: https://doi.org/10.1145/1278387.1278390. Y. Wang, J. Geigel, and A. Herbert. 2013. Reading personality: Avatar vs. human faces. In Humaine Association Conference on Affective Computing and Intelligent Interaction, 479–484. DOI: http://doi.org/10.1109/ACII.2013.85. M. Watanabe, K. Ogawa, and H. Ishiguro. 2015. Can androids be salespeople in the real world? In Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, 781–788. DOI: https://doi.org/10.1145/2702613. 2702967. D. Whitley. 2012. The idea of nature in Disney animation: From Snow White to WALL-E. Ashgate Publishing, Ltd. M. Wise, M. Ferguson, D. King, E. Diehr, and D. Dymesich. 2016. Fetch and freight: Standard platforms for service robot applications. In Workshop on Autonomous Mobile Service Robots. P. Wisessing, K. Zibrek, D. W. Cunningham, J. Dingliana, and R. McDonnell. 2020. Enlighten me: Importance of brightness and shadow for character emotion and appeal. ACM Trans. Graph. 39, 3. ISSN 0730-0301. DOI: https://doi.org/10.1145/3383195.

146

Chapter 4 Appearance

Y. Yamada, T. Kawabe, and K. Ihaya. 2013. Categorization difficulty is associated with negative evaluation in the “uncanny valley” phenomenon. Jpn. Psychol. Res. 55, 20–32. DOI: http://doi.org/10.1111/j.1468-5884.2012.00538.x. S. Yamaguchi, S. Saito, K. Nagano, Y. Zhao, W. Chen, K. Olszewski, S. Morishima, and H. Li. 2018. High-fidelity facial reflectance and geometry inference from an unconstrained image. ACM Trans. Graph. (TOG) 37, 4, 1–14. DOI: https://doi.org/10.1145/ 3197517.3201364. B. M. Yamauchi. 2004. PackBot: A versatile platform for military robotics. In Unmanned Ground Vehicle Technology VI, Vol. 5422. 228–237. International Society for Optics and Photonics. N. Yee and J. Bailenson. 2007. The Proteus effect: The effect of transformed selfrepresentation on behavior. Hum. Commun. Res. 33, 271–290. DOI: http://doi.org/10.1111/ j.1468-2958.2007.00299.x. N. Yee and J. N. Bailenson. 2009. The difference between being and seeing: The relative contribution of self-perception and priming to behavioral changes via digital selfrepresentation. Media Psychol. 12, 2, 195–209. DOI: http://doi.org/10.1080/152132609 02849943. N. Yee, N. Ducheneaut, M. Yao, and L. Nelson. 2011. Do men heal more when in drag? Conflicting identity cues between user and avatar. In CHI ’11: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 773–776. DOI: https://doi.org/10.1145/ 1978942.1979054. Y. Yokota. 2009. A historical overview of Japanese clocks and karakuri. In International Symposium on History of Machines and Mechanisms. Springer, Dordrecht, 175–188. DOI: https://doi.org/10.1007/978-1-4020-9485-9_13. E. Zell, C. Aliaga, A. Jarabo, K. Zibrek, D. Gutierrez, R. McDonnell, and M. Botsch. 2015. To stylize or not to stylize?: The effect of shape and material stylization on the perception of computer-generated faces. ACM Trans. Graph. 34, 6, 184, 1–12. DOI: https://doi.org/10. 1145/2816795.2818126. E. Zell, K. Zibrek, and R. McDonnell. 2019. Perception of virtual characters. In ACM SIGGRAPH 2019 Courses, SIGGRAPH ’19. Association for Computing Machinery, New York. DOI: https://doi.org/10.1145/3305366.3328101. K. Zibrek and R. McDonnell. 2019. Social presence and place illusion are affected by photorealism in embodied VR. In Motion, Interaction and Games, MIG ’19. Association for Computing Machinery, New York. DOI: https://doi.org/10.1145/3359566.3360064. K. Zibrek, L. Hoyet, K. Ruhland, and R. McDonnell. 2015. Exploring the effect of motion type and emotions on the perception of gender in virtual humans. ACM Trans. Appl. Percept. (TAP) 12, 3, 11, 1–20. DOI: http://dx.doi.org/10.1145/2767130.

5

Natural Language Understanding in Socially Interactive Agents Roberto Pieraccini

5.1

Natural Language Understanding in Interactive Agents

Natural language has fascinated and challenged research scientists and technologists from the early times of computer science. Claude Shannon’s 1948 seminal paper on information theory uses language as an example of complex information and discusses its statistical properties based on the modern notion of n-grams [Shannon 1948]. In 1950, Alan Turing [1950] proposed what today is known as the Turing test, that is, a criterion to determine the level of intelligence of a machine based on its capability to converse in natural language with humans. Since then, there has been a proliferation of studies trying to unlock the complexity of natural language for various purposes, including machine translation, automatic summarization, questions answering, sentiment analysis, market intelligence, grammar checking, and of course natural language understanding (NLU). Besides the intellectual and scientific interest in it as a challenging AI problem, NLU occupies a prominent role in the realization of spoken language systems that belong to the category of virtual interactive agents. Figure 5.1 describes a classic conversational architecture for a virtual interactive agent. The user speaks to the system, where speaking may just be one of the modalities of interaction. Touch, gesture, and images or text on a display are other typical modalities of input/output interaction that can be often combined with speech. However, in this chapter, we consider only speech input and output. User’s speech is converted by a speech recognizer (ASR) into its textual transcription, that is, the string of words that were presumably spoken. For the sake of completeness, we need to be aware that modern speech recognizers can generate

148

Chapter 5 Natural Language Understanding in Socially Interactive Agents

Figure 5.1

A reference architecture of a virtual interactive agent.

several alternative hypotheses ranked in terms of confidence, known as N-best, and even a graph of word hypotheses, generally called a lattice. Even though one can apply the techniques described here to alternative speech recognition transcriptions, and re-rank them based on combined speech and language scores, we will restrict our considerations here to a single, first-best string of word hypotheses. The NLU system converts a sequence of words into some actionable symbolic representation of the utterance meaning. That representation is actionable in the sense that subsequent processes can use it to interact with the external digital world, for instance through well-defined APIs, and with the physical world through actuators (e.g., in home automation applications), or simply define and generate a response back to the user. The function of the dialog manager, or DM, is deciding what the next action that the agent should accomplish based on the meaning representation of the input utterance, and contextual knowledge, derived from instance from the history of the conversation, and from the knowledge of the user environment (e.g., her location, time of the day). The generation of a response back to the user is one of the potential actions that the DM can initiate. The response can, for instance, provide a spoken answer or request more information from the user. The DM would send a request to the language generation module (NLG) to generate a specific textual representation of the utterance to speak out. The speech generation or text-to-speech (TTS) system would finally generate an utterance based on the textual representation provided by the NLG. To summarize, NLU is thus a fundamental module of a conversational interactive agent that converts a raw textual transcription of an utterance into a symbolic representation that is used by the DM to select and perform the next action. In this

5.2 NLU and Interactive Virtual Agent Features

149

chapter, we will provide more details on how NLU modules for interactive agents are built, and what the technologies involved, the problems, and the possible solutions are.

5.2

NLU and Interactive Virtual Agent Features A virtual agent, within the scope of this chapter, is a machine that is able to entertain a spoken conversational interaction for the fulfillment of a user goal, whether that goal be a well-defined task, the answer to a question, or the satisfaction of a social function. Examples of modern commercial interactive agents include automated telephone customer assistance systems and consumer personal virtual assistants, such as Siri, Alexa, Bixby, and the Google Assistant. While all of these agents do not have a body, there are a few examples of embodied agents. Embodied agents with a virtual body implemented as graphical avatars have been utilized to answer customer questions on websites, for instance for banking or travel. Research is providing advanced embodied agents, like Zara developed at the Hong Kong University of Science and Technology [Fung et al. 2016]. Finally, as far as embodied agents with a physical body are concerned, we should mention Jibo, a consumer robot that was shipped in 2017 (see for instance https://www.wsj.com/articles/this-cute-little-robot-made-my-family-mad1511980911). Other examples of agents with a physical body are the robots produced by SoftBank (see https://www.softbankrobotics.com/emea/en/pepper) and Furhat Robotics (see https://furhatrobotics.com). In general, there are a couple of considerations that we should keep in mind when implementing NLU for an embodied agent, especially for a physically embodied one. First, depending on its function, a physical embodied agent may require the NLU system to be running fully embedded in the device. Imagine, for instance, a robot that may be in situations where the Internet is not available, or where the latency requirements are very strict. Depending on the available computational capabilities of the device, the implementation of an on-device NLU may require some architectural and scope limitations with respect to an NLU that runs in the cloud. Moreover, a physically embodied agent is situated in the physical world, and thus it may be exposed to linguistic expressions that refer to the space around it. That is especially true when the device is capable, or presumed capable by the user, of visually understanding objects and gestures. The NLU system needs to be able to ground spatially some referential expressions (e.g., look here, take a picture of that object) in the sense that they need to be resolved with the knowledge of the space around the device. Spatial grounding [Guadarrama et al. 2013], and grounding in general, is beyond the scope of this chapter, and it is arguable whether, architecturally, it should be part of the NLU system or belong to a successive interpretation

150

Chapter 5 Natural Language Understanding in Socially Interactive Agents

phase. Besides these specific issues, NLU for embodied and non-embodied agents share similar requirements. Table 5.1 shows a non-exhaustive taxonomy of request types, or features, that are common in today’s personal virtual assistants where we can see a clear distinction between goal oriented and social interactions. A goal-oriented interaction is generally characterized by a request targeted at the fulfillment of a concrete user need for information or for the execution of an action. Within that scope, different types of requests may have different natural language requirements and may pose different problems for the NLU and the Table 5.1

A taxonomy of the type of requests fulfilled by a modern virtual assistant

Goal Oriented

Home automation

Media requests

Personal help

Personal information

Games Question answering Technical support Third-party applications

Social

Fun Chitchat

Turn the lights on in the living room Start the air conditioning in the kitchen Lock the front gate Play Hello by Adele Play the Shawshank Redemption on my family room TV Set an alarm for tomorrow at 7 am Remind me to take the garbage out this evening Set a timer for 10 minutes When is my next flight to London? What is my hotel in Paris next week? When is my next personal trainer appointment? Let’s play a trivia game Who was the 23rd president of the United States? How do I change my email password? How do I change privacy settings? Make a hotel reservation in Geneva for this weekend Call a taxi for going to the airport Buy tickets for the Joker at the closest movie theater Tell me a joke Surprise me How are you?

5.2 NLU and Interactive Virtual Agent Features

151

successive stages of processing. In the rest of this section, we will highlight some of the issues arising with the different types of features reported in Table 5.1. NLU for home automation has some specific issues. The functionality of most of the devices are quite limited (e.g., turn a device on–off and change some of its characteristics). One of the issues with home automation requests is the identification of the specific device referred to by the request. Users generally give arbitrary names to the devices once they are installed on the home network. For instance, one can distinguish between the “living room TV” and the “bedroom TV” or the “ceiling lights,” the “lights on the wall,” or give them an arbitrary name, like “the foobar lights.” That use of arbitrary names to identify a specific device poses problems for the speech recognizer (i.e., the names can be confused by the ASR with other similar sounding words, or they may not exist in the speech recognition vocabulary). Another problem is that users may not use the names associated with the devices in a consistent way. In any case, the NLU system needs to be able to account for arbitrary dynamic entities in the grammars and parsers that are userspecific and account for all possible variations that users may apply when they issue a command. Media is a broad and complex area for a virtual interactive agent. That will cover the choice of music, videos, and movies (on the appropriate device), podcasts, audiobook, and so on. Referring to music, as an example, the user may request a specific title (e.g., “play Hello”), a title specifically interpreted by an artist (e.g., “play Hello by Adele”), an album that may have the same name of a song (e.g., “play Let it Be by the Beatles”), a genre (e.g., play classical music), or a category (e.g., play piano pieces by Chopin), and so on. Thus, the NLU issues related to media requests are generally related to the vast number of artists, albums, movies, and titles that need to be accounted for and the different selections, some of them ambiguous, that users may request. Ambiguity may be handled best at the level of the DM, so the NLU may propose a number of alternative candidates for the DM to disambiguate by asking. Another serious problem with media is that generally people do not remember the exact title of songs, albums, and movies, but refer to them partially (e.g., “play The Winding Road”) or by using a recurring phrase in lyrics. These are NLU grounding problems that need to be addressed specifically for this task. Personal assistance, like setting timers and alarms, may be quite straightforward. However, NLU and dialog complications arise with the interpretation of complex commands for changing existing objects, for example, User: Change my alarm to 7:30 am. Assistant: Which one? You have 12 alarms. U: My morning alarm.

152

Chapter 5 Natural Language Understanding in Socially Interactive Agents

A: You have an alarm for 6:30 am, one for 7 am, one for 8:30 am, one for…. Which one do you want to change? These are typical grounding problems that arise when dealing with potentially large lists of objects. Similar problems arise with personal information: U: Where is my hotel? A: Which hotel? U: The one I am staying at tonight. A: Sorry, I did not understand you. General question answering is a special category that can rely on the power of existing search engines, like Google’s, that today can provide direct answers for most of the specific questions that users can ask on the vast number of entities on the Web. However, without relying on existing general solutions, building a general question answering system may require the availability of a knowledge repository, such as a knowledge graph, that can be either created in a manual manner, for small private domains (for instance, using open-source knowledge graph software [Data Cloud, neo4j]. See for instance “The Linked Open Data Cloud” (https:// lod-cloud.net/) or DBPedia (https://wiki.dbpedia.org/)), or derived automatically or semi-automatically from an analysis of a collection of documents. The input query needs to be analyzed so as to determine the category of the requested answer (e.g., a who, what, where, when question) and the attributes of the requested entity (e.g., a person, an animal, a fact). Question answering is a specific area of natural language processing that abstracts from the classic NLU paradigm used, for instance, as in the architecture of Figure 5.1. Technical support and third-party applications, including booking of events, travel reservations, orders and so on, are quite more structured, and once triggered in a virtual assistant by a simple request, like “I want to make a restaurant reservation,” they fall into the category of directed dialog. The solution of directed dialog generally relies on well-crafted prompts, as we will see later in this chapter. Most of these applications require some form filling logic or form interpretation algorithm [Pieraccini and Huerta 2005, Stoyanchev et al. 2016, https://www.w3.org/TR/ voicexml20/].

5.2.1 NLU for Not Goal-oriented Social Agent The broad category of social interactions is generally not goal oriented but targeted at providing conversational delight for the user, like jokes, fun facts, games, and conversational exchanges that can be characterized as chitchat. One way to develop agents for social interactions consists in creating a repository of answers, generally

5.3 Developing NLU

Table 5.2

153

Examples of chitchat requests to a social interactive agent

How are you? When were you born? Who built you? Are you listening to me all the time? Do you celebrate any holidays? What is your favorite food? Do you like me? I love you! You make me laugh! May the force be with you written by creative writers. Table 5.2 shows examples of queries among the many that users may ask to a social interactive agent. To be interesting, a social interactive agent requires witty answers to thousands of those queries, and also it requires content to be fresh and have special seasonal answers, for instance, in conjunction with public holidays or large events, such as the Superbowl. The number of potential queries can be so large that creating and maintaining grammars or training neural NLU systems can become prohibitive. Thus, a solution to this could be based on typical information retrieval mechanisms. For instance, one can create a repository of all potential queries, with links to the corresponding answers. When a user query comes, the retrieval algorithm can try to find the closest query in the repository and play the corresponding answer. Multiturn social interactions may fall into the domain of social chatbots. One of the most recent and successful examples of that is Meena, a chatbot recently developed by Google Brain [Adiwardana et al. 2020]. At a high level Meena is based on an end-to-end neural system trained on a 40B word corpus of public social media conversations. An interesting characteristic of Meena is that it is optimized over a human-like metric that weights the sensibility and the specificity of each answer, and thus producing quite impressive results.

5.3

Developing NLU The goal of an NLU module is to transform a natural language textual sentence, possibly the transcription of an utterance generated by a speech recognition system, into a non-ambiguous symbolic representation of its meaning. In other words, the problem of NLU is that of translating from natural language to a defined formal language, and thus it can be functionally considered analogous to that

154

Chapter 5 Natural Language Understanding in Socially Interactive Agents

of machine translation. In fact, some approaches to NLU that leveraged machine translation models have been proposed in the past, as described in Della Pietra et al. [1997]. However, the major and not insignificant difference between NLU and machine translation is that the latter acts upon defined and existing source and target natural languages (e.g., English and Italian), while the target language for NLU is not natural and needs to be created for any domain of use. The specification of the target formal language for an NLU system, in other words the meaning representation, is often a non-negligible part of the development of NLU. A typical approach to the development of a meaning representation for NLU is based on the definition of an ontology of intents and their corresponding arguments. Let’s make a simple example. If we are building a system for home automation, one may consider intents that express the actions a user may want to accomplish, for instance: Light-control; Door-control; TV-control; Temperature-control; In order to be able to represent all the possible requests and act upon them to change the status of some home automation device, each one of these intents needs to be complemented by a number of arguments. Let’s consider the following examples of user requests: Turn the dining room TV on. Switch on the light in the living room and make it blue. The following expressions are suitable representations of the respective meaning of the above requests. They include, at a symbolic level, all the pieces of information that will be used by the subsequent stages, for instance by the DM, to fulfill the user requests: TV-control(target-status = on, location = dining-room); light-control(target-status = on, location = living-room, target-color = blue); The definition of the intents and their arguments is often called an intent schema. Although there have been attempts to learn ontologies automatically

5.3 Developing NLU

155

from large collections of text [Lourdusamy and Abraham 2020], the most reliable approach for practical human–machine interaction systems still relies on crafting the intent schema based on the developer’s knowledge of domain and the corresponding applications. As of today, there is still not a universal standard ontology of intent schemas, so each system is generally engineered in a proprietary manner. The design of intents and their level of granularity is somewhat arbitrary. For instance, a different design criterion may have consolidated the four intents defined above into a single one by adding an additional argument that specifies the device to be controlled and device-specific arguments. In that case, the semantic representation of the above example sentences would be: device-control(device = TV, target-status = on, location = dining-room); device-control(device = light, target-status = on, location = living-room, target-color = blue); The arbitrary nature of the intent schema design poses a problem for large systems, such as modern virtual assistants with an increasing number of features. Crafting and maintaining a collection of thousands of intents requires a wellcoordinated team effort. When new features are developed the team designated to maintain the intent ontology needs to decide whether to create new intents or derive them by specializing existing ones. The intent-specification team needs to dedicate special care to making sure there are no duplications and ambiguities. Incidentally, it is important to consider the ambiguity that may be inherent to certain applications. For instance, consider the following query: Switch the blue light off. The term “blue” here could refer to the color as well as the name given by the user to that particular light. So, the parsing of the query alone does not resolve that ambiguity unless the context is taken into consideration. For instance, if the user actually has a device called “blue,” which happens to be a light, or if there is a light that happens to be of color blue at that particular moment. One way to resolve that is to leave the ambiguity at the NLU level and generate multiple parsing candidates. Successive stages of processing may select the correct candidate based on contextual knowledge (e.g., whether there is actually a light named “blue” or characterized by a “blue” color).

156

Chapter 5 Natural Language Understanding in Socially Interactive Agents

When dealing with a large number of intents the creation of annotated corpora for testing and training can be complex. Corpus annotations are generally created via crowdsourcing. For a reasonably small number of intents, crowdsourcers can be easily trained to choose the correct intent for each utterance transcription that needs to be annotated. However, sometimes the number of intents is so large that operation is difficult and requires carefully designed annotation tools and human annotators who become experts of the domain. Using a hierarchical approach, for example, showing the crowdsourcers a choice of high-level intent clusters, and letting them drill down to the specific ones, is a viable but rather costly approach.

5.4

NLU and Interaction Modalities The interaction modality of the system we are going to build has a strong influence on the type of NLU system we need to use. The rest of this section is devoted to the description of the different types of interaction modality, and the implication of that on NLU.

5.4.1 System-initiated, Directed Prompts Interaction In this interaction modality, also denoted as directed dialog, the system always initiates the interaction by clearly prompting the user to elicit a specific and desired response. Prompts are designed, typically by conversation designers or voice–user interface designers, with the goal of constraining the user responses so that the NLU can have a higher chance being able to understand a higher percentage of utterances [Cohen et al. 2004, https://cloud.google.com/speech-to-text/docs/ contextstrength]. The directed dialog modality is typical of menu-based systems, and has been used quite extensively in telephone customer care services (also called interactive voice response systems). Directed prompts can be effectively used when the number of choices for the user is limited, generally between three and five, the response belongs to a known closed set (e.g., city names, dates, times of day), or it is a well-defined and structured numeric value (e.g., a date, credit card number, telephone number). In all these situations the transcription of an utterance performed by the ASR can be parsed by a well-defined context-free grammar (CFG). CFGs have been traditionally used for directed prompt interactions. In the early systems, grammars were integrated in such a way as to provide a set of strong constraints for the speech recognition systems. In other words, the recognizers were trying to match the input utterance to, and only to, the sentences and phrases described by the associated grammars. If the user said something different, the speech recognizer was still trying to match it to what was described by the grammar, and return the closest match, or a no-match. Modern recognizers do not need that

5.4 NLU and Interaction Modalities

157

constraint since they are capable of recognizing speech with a high accuracy over vocabularies of millions of words. However, that constraint could be used to further reduce the error rate of the ASR by biasing the recognizer towards the expressions expected as a response to a prompt, and represented by a grammar. Commercial cloud ASR systems, like Google, provide an API to bias the recognizer towards expected responses. See for instance Google’s SpeechContex API at “Speech-toText basics,” Google Cloud Speech to text documentation: https://cloud.google. com/speech-to-text/docs/basics. One could suggest possible phrases that are likely to be spoken by the user based on the prompt. For instance, if the prompt asks for a yes/no response, the recognizer could be biased towards yes, no, and their synonyms. As an example, imagine we want to write a CFG that would act as an NLU parser for the responses to the following prompt: Do you want flight arrival, departure, or gate information? A grammar covering the potential responses may look as follows: $menu-choice = [$initial] [flight] (arrival | departure | gate) [information] [$final]; $initial = (I would like) | (I want) | (please give me); $final = please | (thank you) | thanks; The above grammar snippets follow a convention where symbols preceded by a $ sign are non-terminals (e.g., $initial) and square brackets represent optional elements (e.g., [flight]). There are many formats that have been used and are still used to write CFG . W3C, the World Wide Web Consortium, embraced several standards for interactive commercial systems, and Speech Recognition Grammar Specification (SRGS) [http://www.w3.org/TR/2004/REC-speech-grammar-20040316/.] is the one used almost universally in the industry. The convention used in this chapter is a simplified form inspired by ABNF, which is part of the SRGS specification. The above grammar excerpt can parse the following sentences, and several others: arrival gate information, please I would like flight arrival information, thank you. In order to be able to use a CFG for NLU, we need a way for it to generate symbols that represent the meaning of a sentence, for instance intents and arguments, as the result of a successful parsing. One way to do that, which has been used in practice mostly for system-initiated directed-prompt interactions, consists of

158

Chapter 5 Natural Language Understanding in Socially Interactive Agents

assigning external variables within the body of the rules. In the example above, only the first rule is carrying a meaning relevant for the application, and thus we can augment it by assigning a proper value to an existing external variable, let’s call it choice, that is passed on to the DM, as in the following example: $menu-choice = [$initial] [flight] ( arrival {choice = arr-info;) | departure {choice = dep-info;} | gate {choice = gate-info;} ) [information] [$final]; The CFG industry standards, such as SRGS, allow building sophisticated grammars with arbitrarily complex code (typically JavaScript) attached to the rules, and not just variable assignments. Code assigned to rules can be used by the developers, for instance, to perform string operations, calculation, or validate the values resulting from parsing an input utterance transcription. Some typical cases are aggregating natural number expressions into a number (e.g., “forty-five thousand and three hundred twenty-six” into a variable assignment like “amount=45,326”), normalizing date expressions, or validating the correctness of a credit card number based on the digit checksum. Of course, it is important to understand that resolving these issues with code attached to the grammar terminals may not be the best architectural solution. An alternative way is to let the grammar assign intermediate values that would be then processed further at the level of the DM, which may use more contextual information. The example above highlights one of the main problems with using CFGs to parse natural language. In fact, a slight variation from the responses prescribed by the grammar will not parse and would produce a failure, also known as a no-match. For instance, if the user says: I would like to get gate information. The above rules will fail to parse and the NLU will not produce a result, even though the sentence is very close to those accounted for by the grammar. When using CFGs, the best way to reduce the number of no-matches consists of analyzing the utterances that did not parse and making sure that they would be represented in the grammar rules. This can be typically done in an experimental phase that precedes the production of the virtual agent. Or later on, after the grammar is in production, by analyzing the logs and instituting an operational “continuous improvement cycle” [Suendermann et al. 2009a].

5.4 NLU and Interaction Modalities

159

Improving CFGs is a labor heavy process that, besides transcribers and annotators, requires specialists who can look at the data and modify the grammars in order to reduce the number of no-matches. In today’s world, transcribing and looking at user utterances is also a delicate process that often conflicts with user privacy. In any case, the use and maintenance of CFGs is problematic for systems with a large number of intents for the reasons discussed above. In order to reduce the amount of labor required by vanilla CFGs and/or increase their generalization, one can consider robust parsing or the use of statistical or neural classifiers that will be discussed in the next sections. Deep neural networks (DNNs), and in particular sequence to sequence or transformer mechanisms, are one of the modern solutions. However, the choice of the NLU technology depends very much on the type of interaction and the scale of the assistant, and whether a developer would prefer an immediate solution, like writing a simple grammar, or start a process of example collection and use a statistical or neural induction process. In fact, for small scale systems CFG can still constitute an advantage since, contrary to neural methods, they can be crafted in a short time without the need of data.

5.4.2 System-initiated, Open Prompt Interaction Using CFGs is not practical when the response of the user is not directed towards a limited choice or a bound value (like a date or city name) by a specific prompt, but rather unconstrained, for instance, as the result of a generic open prompt. This was a typical situation that arose in the past in applications such as call routing [Gorin et al. 1997] or technical support [Evanini et al. 2007]. In those situations, the opening prompt of a virtual agent is quite generic, like Please tell me the reason you are calling about. The response to such an open prompt can exhibit a vast variety of linguistic expressions and meanings. However, if the application domain is limited, for instance, technical support in a well-defined domain, all the possible linguistic expressions can be clustered into a number of well-defined meanings or intents. Once the set of intents is defined, one can proceed with the collection and annotation of a corpus for the purpose of training and testing the NLU system. In some instances, the corpus can be collected and annotated at the same time using a Wizard of Oz procedure, like in Gorin et al. [1997]. The annotated corpus can be then used to train an ML classifier to produce the right intent based on features of the input utterances, such word n-grams. In Suendermann et al. [2009a] statistical classifiers were built not only for the initial open prompt responses but also for direct responses that were handled, initially,

160

Chapter 5 Natural Language Understanding in Socially Interactive Agents

with CFG. As the system started to operate, data for each interaction point was collected and sent to transcribers and annotators through an automated procedure [Suendermann et al. 2008]. The automated procedure would also verify the consistency of the transcriptions and annotations, request crowdsourcers to resolve possible conflicts, create training, and test corpora for each dialog state, train the classifiers, and, if the accuracy was higher than those in use, push the new ones to production. All of that was done automatically and in a mostly unsupervised manner. Eventually the overall accuracy was shown to increase with time and become superior to that obtained by using grammars only [Suendermann et al. 2009a]. It is important to consider that the labor required to semantically annotate transcription data for a specific domain does not grow linearly with the size of the corpus since we need to annotate each unique expression only once. Suendermann et al. [2010] shows that the automation rate obtained for an open-prompt interaction (e.g., “What is the reason you are calling about”) reaches 50% (i.e., only one out of two transcriptions need to be annotated) after roughly 500,000 utterances, while for a specific yes/no question it is above 90% after only 1,000 utterances.

5.4.3 User-initiated Interaction What characterizes a user-initiated interaction is the spontaneous nature of it. As opposed to the system-initiated interaction, in a user-initiated interaction the system does not prompt or guide the user on what to say. Users interact in a spontaneous manner. NLU for spontaneous interactions requires being able to understand a large number of potential expressions. In restricted domains, this is not much dissimilar to an open-prompt situation. However, for unrestricted domains, especially when the user does not know the capabilities of the system, user-initiated interactions represent a substantial challenge for NLU. This is the typical situation for personal virtual assistants like Siri, Alexa, and the Google Assistant. A virtual assistant is generally idle and the user is never prompted to start an interaction. Users start interactions at their will, typically with a wake-up phrase, for instance Hey Google, or by touching an icon on a display (like, for instance, the assistant logo or the search microphone on an Android phone). Users then follow the invocation of the assistant with a spontaneous query. Often, the assistant responds with an answer or executes a command to fulfill the user’s request. However, a user query may require a follow-up request by the assistant through an open or directed prompt. When the system has all the necessary information, it can proceed to fulfilling the user’s request. When the system is in an idle state, it needs to be able to understand any possible initial user query addressed to invoke a specific feature. At the same time NLU

5.4 NLU and Interaction Modalities

161

also needs to be able to understand when a query will not produce any reasonable answer, and either return a punt prompt (e.g., “sorry, I don’t understand your request”), or as most virtual assistants do, pass the request to a search engine as a fallback, hoping it will return something useful for the user. Historically, the first time the spoken dialog community faced the issue of recognizing and understanding spontaneous speech was during the DARPA ATIS [Price 1990] challenge at the end of the 1980s. ATIS (Air Travel Information System) was a project where a corpus of spontaneous requests to a flight database was used for training and testing systems developed by the participating labs. During the first years of the challenge, many of the developed NLU systems used legacy linguistic approaches based on CFGs, and in some cases also on more sophisticated grammars (e.g., context-sensitive grammars). The grammars had been originally developed by linguists for parsing written text. However, in the first experiments, feeding the traditional parsers with the transcriptions obtained by the speech recognition systems on spontaneous user utterances resulted in very low correct understanding rates. On the one hand, the errors produced by the speech recognizers dramatically impacted the accuracy of the understanding systems. On the other hand, the grammars, generally developed by hand by expert linguists to parse written text with high accuracy, could not account for phenomena occurring in spontaneous speech. Even though the grammars in use were quite sophisticated, they could not cope with disfluencies, broken sentences, repetitions, and so on. Some labs started to experiment with less traditional approaches that were cognizant of the problems of spontaneous speech. One of the first non-traditional approaches relied on stochastic models to represent the relationship between a meaning representation and the utterance words. These models, called conceptual hidden Markov models (CHMMs) [Pieraccini et al. 1991], represented sets of concepts (analogous to intents and arguments) as hidden Markov models with state-specific statistical language models based on n-grams. The intuition here is that an utterance representing an intent is composed of a number of conceptual entities that can be mapped directly to segments of a sentence, and that can be effectively represented by statistical models. In fact, the main cause of variability of natural language is due to the large number of equivalent expressions that carry the same meaning. However, we need to consider that a significant part of that variability can be attributed to the combinatorics of different conceptual entities within a sentence that can appear in different order, and by the presence of words that do not contribute directly to the meaning, like pleasantries and fluff words. The advantage that CHMMs had over other systems developed during the ATIS challenge is that they structurally accounted for all

162

Chapter 5 Natural Language Understanding in Socially Interactive Agents

the potential orderings of the conceptual entities while ignoring the non-meaning carrying phrases. In fact, the AT&T system based on CHMMs was the one with the highest text understanding score at the final DARPA ATIS evaluation in 1994 [Levin and Pieraccini 1995]. CHMMs are a statistically motivated instance of a broader category of parsers strategy that resulted in being best suited for spontaneous speech, known as robust parsers [Seneff 1992]. The concept behind robust parsing is that a grammar does not need to account for all the words in a sentence. In a restricted domain application, there are phrases that carry most of the information, while additional words can be ignored. A robust parser can be implemented as CFGs with the addition of a wildcard or match-all operator. The following is an example of a robust parsing rule. $weather-information = \w* weather \w* [$city \w* ]; $city = (boston | new york | san francisco |…); Where the symbol “\w” is a wildcard that matches any word and the symbol “* ” is analogous to the Kleene star in regular expressions (e.g., it matches zero or more repetitions of the previous element). The following are examples of sentences accepted by this simple robust parsing grammar: Weather Weather in New York City please Can you please tell me what is the weather like in Boston However, the following sentence cannot be parsed: I am traveling tomorrow to Boston. How is the weather there? In fact, the concepts of “weather” and “city” are appearing in a different order than the one defined in the previous rules. A simple way to modify the above parsing rule in order to make it independent of the order of the concepts relies on the Kleene plus operator “+” that allows for one or more repetitions of the previous element. Thus: $weather-information = \w* ($concept \w* )+; $concept = (weather | $city); $city = (boston | new york | san francisco | …); This set of rules allows for any order of the weather qualifier and the city and would parse a vast array of sentences.

5.5 NLU Induction From Examples

163

The one above is a simple example chosen to illustrate the power of robust parsing, but also the issues associated with it. Overgeneration is one of the main problems of robust parsing rules. While the above rules cover most of the positive examples, they also cover a vast number of negative examples. For instance, all of the following queries are parsed by the same rules: Weather weather Boston New York San Francisco weather weather. I don’t want to know the weather in New York. Weather I don’t want to know Boston is not my city. One of the arguments in favor of using a robust parser is that those sentences are quite uncommon, and negation can be dealt with special rules and heuristics. The advantage of a robust parser is that it is easy to handcraft rules that allow for a high recall, at the expense of low precision. Robust parsing is a practical solution for small-scale systems (i.e., systems with a limited number of intents).

5.5

NLU Induction From Examples Rather than creating handcrafted grammars, a different approach consists in learning an NLU system from a number of defined examples. In particular, the use of deep learning and the availability of libraries like TensorFlow and PyTorch made the process of creating NLU parsers from examples quite accessible to developers that do not necessarily have a deep understanding of the intricacies of machine learning (ML). However, we should not forget that, as we discussed earlier, the complexity of a large NLU system is not just in the training of the parser. As the number of intent grows, the work required for maintaining and creating new intents and annotating data for training the NLU can become a clear bottleneck.

5.5.1 Deep Learning for NLU In less than a decade deep learning [Courville et al. 2015] has revolutionized the whole field of artificial intelligence (AI). Speech recognition, language generation, text-to-speech synthesis, machine dialog, and NLU are among the most successful applications of deep learning in the field of human–machine communication. Deep learning is based on the notion of DNNs, that is, neural networks with more than one hidden layer. DNNs are being used to solve a number of problems, like modeling, classification, and mapping between input and output strings of discrete elements. More advanced types of DNNs have recently been developed to solve increasingly complex problems, and with higher and higher performance. Those include

164

Chapter 5 Natural Language Understanding in Socially Interactive Agents

recurrent long short-term memory networks (LSTMs) [Sak et al. 2014], Transformers [Polosukhin et al. 2017], and bi-directional transformers like BERT [Devlin et al. 2018]. In general, a neural NLU receives a sequence of words as input and generates a meaning representation as output. We have to keep in mind that because of the nature of the NLU problem both the input sequence of words as well as the output meaning representation do not have a fixed size. The input can have arbitrary length, that is, an arbitrary long sequence of words, and the output can be a structure, for example, an intent/argument set or a serialized structure with an arbitrary number of elements. A general solution for accommodating inputs and outputs of arbitrary size is that of using a recurrent sequence to sequence DNN [Kana 2019] (such as, for instance, an LSTMs). In a recurrent neural network, you feed each input word to a different instance of the same network. However, for each word in the sequence, the hidden neurons of the corresponding instance of the network also receive the output values that were calculated for the instance that processed the immediately preceding word. As a result of this incremental processing, the network internal values for a specific word of the input sentence are influenced by all the preceding words. The final result will thus depend on all the words in the input sentence. The problem of representing the input words in a way that allows for generalization, for instance across synonyms and semantically similar expressions, is common to every neural architectural solution. In modern language-processing neural networks the input sequence is represented by projecting words into a multidimensional space that preserves the semantic relationships between them. In other words, words that are semantically close are also close to each other in that space. The solution to this problem is central to the notion of word embeddings [Mikolov et al. 2013]. Word embeddings are extremely popular in today’s neural NLU and NLP systems in general, and can be reused across them. Multilingual embeddings [Faruqui and Dyer 2014, Chen and Cardie 2018] are an interesting and useful evolution of word embedding. To summarize, the use of DNNs for NLU is still a topic of research and there is not a mainstream solution yet. Whether to use a sequence-to-sequence model or a transformer like BERT is still an open question. Using DNNs we trade the problem of writing a grammar with that of providing labeled training data. While that may seem a quite simpler and effective approach, it is still cumbersome for many reasons. For one, getting realistic data is quite hard. Unless the system is deployed and we have access to a large number of logs, the creation of a large corpus of data may require a collection campaign relying on recruited subjects.

5.5 NLU Induction From Examples

165

Thus, depending on the collection paradigm, the data may not represent the reality of the language used by real customers in real situations. Even though we may have access to a large number of logs, semantic annotation is quite hard and imprecise. This is especially true if we are in the presence of a large number of meanings, each of them represented by a symbolic meaning representation structure, like intent and arguments as described earlier in this chapter. Exposing a human annotator to unlabeled utterance transcriptions and asking them to select a corresponding intent/argument structure out of a catalog of thousands requires properly designed tools and processes. Moreover, human annotators need to be trained and have a certain familiarity with the ontology of intents and arguments. As discussed above, any hand-crafted ontology may include duplications, ambiguities, and inconsistencies that make the annotation task harder. The precision of the annotation done in this way may not be satisfactory, thus requiring the deployment of multiple human annotators per each utterance transcription and a mechanism to resolve conflicts.

5.5.2 Commercial NLU Tools There are several available tools for creating NLU systems in the cloud. Major companies like Amazon, IBM, Microsoft, and Google offer those tools for experimentation or for commercial use. Google’s DialogFlow (https://cloud.google.com/ dialogflow) is a typical example of that. DialogFlow is characterized by a dashboard that allows a developer to build intents and define entities as arguments. The interface allows developers to specify a number of sample sentences that define a certain intent. Known entities, such as numbers, dates, and geographical entities are automatically detected in the examples and used to generalize across them. So, an intent like “travel-destination” could be created by specifying a number of example sentences, such as I want to go to San Francisco My destination is Boston Going to Paris The entities like “San Francisco,” “Boston,” and “Paris” used in the examples are automatically detected and internally expanded by all entities in the same category, so in the future the system can classify as “travel-destination” sentences using different entities, such as “I want to go to New York.” The system has also some inbuilt generalization across common expressions, such as “I want” and “I would like.” The interface allows entities to be associated with roles or arguments. It also allows the developers to create their own list of entities.

Chapter 5 Natural Language Understanding in Socially Interactive Agents

5.6

The NLU Usability Paradox The NLU Usability Paradox was initially stated by Mike Phillips [2006] and restated in terms of “habitability gap” by Roger Moore [2019] and further developed by Bruce Balentine [2020]. Consider the qualitative chart of Figure 5.2, based on the original presentation by Mike Phillips. With structured dialog, where the user is specifically instructed by the system on what to say, one can reach a quite reasonable level of usability. The usability would slightly increase by increasing the flexibility of the system to understand not only the commands suggested by the prompts but also natural language expressions. One can safely increase the flexibility of the system, for example, the number of available commands and the related accepted expression, up to a point, beyond which the usability will drop without ever reaching the point marked as “??.” This is because the dialog cannot be structured with directed prompts when too many options are available, and the user is left to guess what to say. Unfortunately, if the natural language coverage is limited, a high percentage of the user attempts will fail. And even though the user may learn which expressions work, by increasing the flexibility of the system users will have a hard time remembering what works and what doesn’t. The usability of an NLU system will continue to drop until the coverage of NLU expressions reaches a point that is closer to human capabilities. At that point whatever the user says will be understood and dealt with by the system. Early NLU systems fell into the low usability minimum of the NLU paradox curve. Systems with a relatively large number of features and low NLU coverage resulted in very extremely poor usage and the lack of feature discovery mechanisms made the systems practically unusable. Like a Human

?? Usability

166

Add NL/Dialog

Structured Dialog Flexibility

Figure 5.2

A qualitative description of the NLU usability paradox [Moore 2019].

5.7 Context

167

An open question for large NLU systems is that of being able to discern among expressions that the system understands and can fulfill, expressions that the system understands and cannot fulfill, and expressions that are totally not understandable by the system. If we could do that, the usability of NLU systems will be much higher than what we have today.

5.7

Context Humans carry on efficient conversations by relying on contextual information. Let’s consider the following dialog User: Where is the Empire State building? Assistant: In New York City U: How tall is it? A: 381 meters U: Give me that in feet A: 1,250 feet U: When was it built? … This is a very efficient and natural exchange. That’s how we talk, and that’s how we would like interactive agents and assistants to talk. Each query is grounded, if possible, based on the previous conversation, and previously mentioned entities are referenced by the use of pronouns. Think how cumbersome and unnatural the same conversation would be without context references: U: Where is the Empire State building? A: The Empire State Building is in New York City U: How tall is the Empire State Building? A: The Empire State Building is 381 meters high U: How much would 381 meters be in feet? A: 381 meters correspond to 1,250 feet U: When was the Empire State Building built? … Pronouns are not the only way to reference the context. In fact, sometimes the context is obvious even without any linguistic reference. For instance, U: What is the weather in Boston? A: 65 degrees and sunny

168

Chapter 5 Natural Language Understanding in Socially Interactive Agents

U: What about Chicago? A: 58 degrees and rainy. In order to build an efficient conversational dialog system, NLU needs to be able to account for these linguistic phenomena, generally referred to as anaphoric references, that is, the use of pronouns, and ellipsis, that is, the use of incomplete sentences. There are many ways to account for anaphora and ellipsis. One method is to deal with those phenomena at the symbolic level after the NLU process has generated a meaning representation. For instance, a contextual resolution module can assume a missing intent to be the same as the previous query and simply change the argument of the previous query for the new argument (e.g., intent= weatherinformation; city = Boston --> Chicago). One can devise other rules that work most of the time. However, a query may not reference only a previous user query but also a previous agent answer, as in: U: Who was the 44th president of the United States? A: Barack Obama U: How old is he? Or, in a multimodal interface, for instance a virtual assistant with a display or on a smartphone, the user may refer to a displayed picture and ask questions or refine the search using incomplete utterances. For instance, one may ask the Google Assistant on a smartphone: U: Show me the pictures I took in Tunisia last year A: Here they are U: Only those with camels A: U: When was this picture taken? A: This picture was taken on December 20th, 2019. Even though commercial virtual assistants, like Alexa, Siri, and the Google Assistant, are able to handle context quite well, this is still an interesting topic of research where ML and deep learning in particular can provide increasing levels of quality.

5.8 Conclusions

5.8

169

Conclusions We have seen how NLU is an essential part of any interactive agent. For an agent to be able to respond or to fulfill a user’s requests, first it needs to understand the request itself. Converting an arbitrary textual representation of an utterance into an actionable representation of its meaning, for instance intents and arguments, is the goal of NLU. In order to develop an NLU, it is important to consider not just the actual parsing mechanism but the whole process, starting from the creation and management of a meaning representation, the annotation of data for testing and training, and the actual training of the NLU module. That whole process can become quite complex as the number of meanings (e.g., intents) increases, which is the case in today’s virtual interactive assistant. We have also analyzed how the choice of the NLU system is influenced by the type of interaction. System-initiated interactions characterized by well-defined prompts can be handled quite well by more or less strict CFGs since the prompts lead the user into a restricted number of expressions. Open prompts, which need to be used in system-initiated applications where a directed prompt may not work, may expect a large unbound number of expressions, even though restricted to a specific domain. In that case an ML classifier trained over large corpora of annotated transcriptions is a better solution. Finally, user-initiated interactions are the ones that require the most sophisticated NLU systems. That is especially true for virtual interactive assistants that span through a vast number of domains and features. Robust parsing is a simple extension of CFGs that can be useful in these situations. With the advent of deep learning, epitomized by DNNs, the task of learning NLU from a corpus of annotated transcription has become easier and more effective. There is a clear dichotomy between structural NLU, such as grammars and robust parsers, and induced NLU. The choice between the two depends very much on the type of system one wants to build. Structured methods allow a high precision and a high degree of control by the developer. If the NLU does not work for a given number of expressions, accounting for them in a grammar is relatively straightforward, even though labor intensive. On the other hand, ML-based NLU, and especially DNN-based systems, suffer from the problem of interpretability of the results. Besides using more data and larger networks, correcting errors is a process that requires some level of intuition and a lot of experimentation. And there is no guarantee that the process would succeed. Large-scale internationalization is another big issue. Translating grammars is labor intensive and requires specialist grammar writers proficient in different

170

Chapter 5 Natural Language Understanding in Socially Interactive Agents

languages. As machine translation improves one can use that, for instance, to create an initial corpus in a target language from a corpus in English. In Suendermann et al. [2009b], we used publicly available machine translation to translate an annotated corpus from English to Spanish, and thus create a set of ML-based intent classifiers in the target language. Once the Spanish system went into production, a continuous improvement loop [Suendermann et al. 2009a] continued to improve the performance of the initial classifier. As we have seen in this chapter, NLU for interactive virtual agents is complex, and even though one may find some off-the-shelf solution for some limited domains, there is not a general solution to the NLU problem that works on any potential application.

References D. Adiwardana, M.-T. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, and Q. V. Le. 2020. Towards a human-like opendomain chatbot. arXiv:2001.09977v3 [cs.CL] 27 Feb 2020. B. Balentine. 2020. Dagstuhl Seminar 20021. In Spoken Language Interaction with Virtual Agents and Robots (SLIVAR): Towards Effective and Ethical Interactions. January 2020. X. Chen and C. Cardie. 2018. Unsupervised multilingual word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium. October–November 2018. M. Cohen, J. Giangola, and J. Balogh. 2004. Voice User Interface Design. Addison-Wesley Professional, Boston. A. Courville, I. Goodfellow, and Y. Bengio. 2015. Deep Learning. MIT Press, Cambridge, MA. S. Della Pietra, M. Epstein, S. Roukos, and T. Ward. 1997. Fertility models for statistical natural language understanding. In 35th Annual Meeting of the Association Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics. July 1997, Madrid, Spain. DOI: https://doi.org/10.3115/976909.979639. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 11 October 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805v2. K. Evanini, D. Suendermann, and R. Pieraccini. 2007. Call classification for automated troubleshooting on large corpora. In ASRU 2007. Kyoto, Japan, December 9–13, 2007. M. Faruqui and C. Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 462–471. DOI: https://doi.org/10.3115/v1/E141049. P. Fung, A. Dey, F. B. Siddique, R. Lin, Y. Yang, W. Yan, and R. C. H. Yin. 2016. Zara the Supergirl: An empathetic personality recognition system. In Proceedings of NAACL-HLT 2016 (Demonstrations). San Diego, CA, 87–91, June 12-17.

References

171

A. Gorin, G. Riccardi, and J. Wright. 1997. How may I help you? Speech Communication. October 1997. S. Guadarrama, L. Riano, D. Golland, D. Gouhring, Y. Jia, D. Klein, P. Abbeel, and T. Darrell. 2013. Grounding spatial relations for human–robot interaction. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE/RSJ International Conference on Intelligent Robots and Systems. 1640–1647. M. Kana. Sep. 5. 2019. Natural language understanding with sequence to sequence models. Medium. 2019. E. Levin and R. Pieraccini. 1995. Concept-based spontaneous speech understanding system. In Proceedings of EUROSPEECH’95. Madrid, Spain. R. Lourdusamy and S. Abraham. 2020. A survey on methods of ontology learning from text. In Learning and Analytics in Intelligent Systems. Vol. 9. LAIS Book Series. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 26 October 2013. R. K. Moore. 2019. A ‘canny’ approach to spoken language interfaces. https://arxiv.org/pdf/ 1908.08131.pdf. M. Phillips. 2006. Applications of spoken language technology and systems. In M. Gilbert and H. Ney (Eds.), IEEE/ACL Workshop on Spoken Language Technology (SLT). IEEE, Aruba. R. Pieraccini and J. Huerta. 2005. Where do we go from here? Research and commercial spoken dialog systems. In Proceedings of the 6th SIGdial Workshop on Discourse and Dialog. Lisbon, Portugal, 2–3 September 2005. R. Pieraccini, E. Levin, and C.-H. Lee. 1991. Stochastic representation of conceptual structure in the ATIS task. In Proceedings of the 4th Joint DARPA Speech and Natural Lang. Workshop, Pacific Grove, CA, Feb. 1991. I. Polosukhin, L. Kaiser, A. N. Gomez, L. Jones, J. Uszkoreit, N. Parmar, N. Shazeer, and A. Vaswani. 2017-06-12. Attention is all you need. arXiv:1706.03762. P. Price. 1990. Evaluation of spoken language systems: The ATIS domain. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania. June 24–27, 1990. Levin. H. Sak, A. Senior, and F. Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2014. 338–342. S. Seneff. 1992. Robust parsing for spoken language systems. In Proceeding of the International Conference on Acoustic, Speech, and Signal Processing 1, 189–192. DOI: https:10.1109/ICASSP.1992.225940. C. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 379–423, 623–656.

172

Chapter 5 Natural Language Understanding in Socially Interactive Agents

D. Suendermann, J. Liscombe, K. Evanini, K. Dayanidhi, and R. Pieraccini. 2008. C5. In Proceedings of 2008 IEEE Workshop on Spoken Language Technology (SLT 08), December 15–18, 2008. Goa, India. D. Suendermann, K. Evanini, J. Liscombe, P. Hunter, K. Dayanidhi, and R. Pieraccini. 2009a. From rule-based to statistical grammars: Continuous improvement of large-scale spoken dialog systems. In Proceedings of the 2009 IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP 2009). Taipei, Taiwan, April 19–24, 2009. D. Suendermann, J. Liscombe, K. Dayanidhi, and R. Pieraccini. 2009b. Localization of speech recognition in spoken dialog systems: How machine translation can make our lives easier. In Proceedings of Interspeech. Brighton, UK. D. Suendermann, J. Liscombe, and R. Pieraccini. 2010. How to drink from a fire hose: One person can annoscribe 693 thousand utterances in one month. In The 11th Annual SIGDIAL Meeting on Discourse and Dialogue, SIGDIAL 2010. September 2010, Tokyo, Japan. S. Stoyanchev, P. Lison, and S. Bangalore. 2016. Rapid prototyping of form-driven dialogue systems using an open-source framework. In Proceedings of the SIGDIAL 2016 Conference. Los Angeles, CA, 13–15 September 2016, 216–219. J. Stern. 2017. This Cute Little Robot Made My Family Mad. The Wall Street Journal, Nov. 29. A. Turing. 1950. Computing machinery and intelligence. Mind 59, 433–460.

6

Building and Designing Expressive Speech Synthesis Matthew P. Aylett, Leigh Clark, Benjamin R. Cowan, and Ilaria Torre

You all know the test for artificial intelligence—the Turing test. A human judge has a conversation with a human and a computer. If the judge can’t tell the machine apart from the human, the machine has passed the test. I now propose a test for computer voices—the Ebert test. If a computer voice can successfully tell a joke and do the timing and delivery as well as Henny Youngman, then that’s the voice I want. — Roger Ebert.

6.1

Introduction and Motivation We know there is something special about speech. Our voices are not just a means of communicating. They also give a deep impression of who we are and what we might know. They can betray our upbringing, our emotional state, our state of health. They can be used to persuade and convince, to calm and to excite. As speech systems enter the social domain they are required to interact, support, and mediate our social relationships with (1) each other, (2) with digital information, and, increasingly, (3) with AI-based algorithms and processes. Socially interactive agents (SIAs) are at the forefront of research and innovation in this area. There is an assumption that in the future “spoken language will provide a natural conversational interface between human beings and so-called intelligent systems” [Moore 2017, p. 283]. A considerable amount of previous research work has tested this assumption with mixed results. However, as pointed out “voice interfaces have become notorious for fostering frustration and failure” [Nass and Brave 2005, p. 6].

174

Chapter 6 Building and Designing Expressive Speech Synthesis

It is within this context, between our exceptional and intelligent human use of speech to communicate and interact with other humans, and our desire to leverage this means of communication for artificial systems, that the technology, often termed expressive speech synthesis uncomfortably falls. Uncomfortably because it is often overshadowed by issues in interactivity and the underlying intelligence of the system which is something that emerges from the interaction of many of the components in a SIA. This is especially true of what we might term conversational speech, where decoupling how things are spoken, from when and to whom they are spoken, can seem an impossible task. This is an even greater challenge in evaluation and in characterizing full systems which have made use of expressive speech. Furthermore, when designing an interaction with a SIA, we must not only consider how SIAs should speak but how much, and whether they should even speak at all. These considerations cannot be ignored. Any speech synthesis that is used in the context of an artificial agent will have a perceived accent, a vocal style, an underlying emotion, and an intonational model. Dimensions like accent and personality (cross-speaker parameters) as well as vocal style, emotion, and intonation during an interaction (within-speaker parameters) need to be built in the design of a synthetic voice. Even a default or neutral voice has to consider these same expressive speech synthesis components. Such design parameters have a strong influence on how effectively a system will interact, how it is perceived, and its assumed ability to perform a task or function. To ignore these is to blindly accept a set of design decisions that ignores the complex effect speech has on the user’s successful interaction with a system. Thus, expressive speech synthesis is a key design component in SIAs. This chapter explores the world of expressive speech synthesis, aiming to act as a starting point for those interested in the design, building, and evaluation of such artificial speech. The debates and literature within this topic are vast and are fundamentally multidisciplinary in focus, covering a wide range of disciplines such as linguistics, pragmatics, psychology, speech and language technology, robotics and human–computer interaction (HCI), to name a few. It is not our aim to synthesize these areas but to give a scaffold and a starting point for the reader by exploring the critical dimensions and decisions they may need to consider when choosing to use expressive speech. To do this, the chapter explores the building of expressive synthesis, highlighting key decisions and parameters as well as emphasizing future challenges in expressive speech research and development. Yet, before these are expanded upon, we must first try and define what we actually mean by expressive speech.

6.2 Expressive Speech—A Working Definition

6.2

175

Expressive Speech—A Working De nition The term expressive speech can be used for many features that we see in human speech. It is a problematic term because it can rely on the concept of neutral speech, which assumes there is such a thing as un-expressive speech. This view is often one used in speech technology because implicitly, from an engineering perspective, it is an issue of control. The concept of neutral speech helps to frame synthesis development parameters and challenges. For instance, if a speech synthesis system produces so-called default speech for a series of words, how might we alter the emotion or the emphasis perceived in this speech? What controls would we need? What changes to the synthesis system would be required? This is echoed in the definition used by Govind and Prasanna [2013, p. 237]: “In expressive speech synthesis, along with text, the desired expression also forms an additional input to the text processing stage.” Pitrelli et al. [2006, p. 1099] define expressive speech as a system’s ability to “reinforce the message with paralinguistic cues, to communicate moods and other content beyond the text itself.” However, the term other content is open-ended, making it hard to pin down what exactly the term expressive speech should or should not cover. They go on to use a more concrete working definition: “say the same words differently and appropriately for different situations” (p. 1099). Influenced by early work on emotional speech synthesis (e.g., Schröder [2001]), the term expressive speech has also been used synonymously with the term emotional speech. For example, according to Schröder [2009], the purpose of expressive synthesis is in “making a voice sound happy or subdued, friendly or empathic, authoritative or uncertain” (p. 111). This emotional emphasis is further echoed by Govind and Prasanna [2013, p. 237]: “we have considered different emotions as the expressions and hence emotions and expressions are interchangeably used.” This viewpoint on expressive synthesis focused on developing and evaluating a distinct set of between three and nine, extreme, sometimes termed primitive emotional states within an artificial voice, such as disgust, fear, anger, joy, sadness, and surprise [Schröder 2001]. Yet, as computer applications began to take on roles such as of a trainer or tutor, or attempted to offer emotional support for users, a more nuanced view of emotion in speech was required [Theune et al. 2006, Aylett et al. 2013]. As applications such as personal digital assistants and other forms of SIAs become more commercially deployed, we see the definition of expressive speech also becoming an important part of defining the perceived personality of the system [Aylett et al. 2017], and becoming more dependent on interaction. Expressive speech “has the potential to provide the user with the choice to select a nuanced

176

Chapter 6 Building and Designing Expressive Speech Synthesis

tone of voice suited to their intent and to the communicative setting” but that “In an interactive situation however, this does not become a real possibility, until a functional interaction model is available to control aspects of the expressive synthetic speech to ensure timely and effortless delivery” [Székely 2015, p. 4]. This perspective echoes advice from Pitrelli et al. [2006]. We would add that to this the design of a voice to convey an intended character and emotional state, and to widen “different situations” to include interaction situations such as conversation, monologue, and even dramatic performance, see Aylett et al. [2019a]. The definition of expressive synthesis is also hard to decouple from the aim of improving the naturalness of synthetic speech. Recent work in neural or WaveNetstyle TTS (e.g., Oord et al. [2016]) has dramatically improved the perceived naturalness of synthetic speech. Improved to the extent that Google were able create a system to book appointments (i.e., the hairdresser) that, in an interactive phone call, could not be perceived as artificial1 . Although this is important to consider when conceptualizing expressive synthesis, a blind drive for naturalness can eclipse a requirement for appropriateness needed for artificial systems [Aylett et al. 2019a, 2019b]. Both the drive for naturalness and the appropriateness of such naturalness must therefore be considered.

6.3

Building Expressive Synthesis Models and approaches in expressive speech synthesis have been heavily constrained by the historical techniques used to generate speech. Recently, there has been a paradigm shift in speech synthesis technology where recurrent neural networks have successfully been applied to model speech output with unprecedented quality (e.g., Oord et al. [2016], Ping et al. [2017], and Lorenzo-Trueba et al. [2018]). There is often a lag between the state-of-the-art synthesis systems available and systems used by SIAs. Availability, cost, performance, in-house technical experience, and language coverage are important factors when choosing a system. However, new paradigm neural text-to-speech (TTS) systems are available commercially and there are extensive open-source repositories. Despite this, many systems will still be operating with unit selection techniques. These systems can also offer very good performance and quality and in many cases could be appropriate for SIAs. Finally, although the speech quality of older types of systems is significantly worse, research systems built for low resourced languages or low-cost commercial systems such as toys may still use diphone and formant synthesis. 1. https://www.youtube.com/watch?v=D5VN56jQMWM

6.3 Building Expressive Synthesis

177

6.3.1 Overview of Major Approaches to Speech Synthesis Most systems assume text is the input to the system, possibly with XML markup (see Section 6.3.3), and that an audio waveform is the final output, together with information on phone, word, and markup timings that can be used to drive a SIA’s speech. Conventionally, the text is processed by a series of modules. The first set (often termed the front-end) extracts linguistic and pronunciation information. The second set (often termed the back-end) then converts this information into audio. Of particular importance for the control of expressive speech are the linguistic specifications of the front-end that determines stress, emphasis, and phrasing—the modules which convert these specifications into duration and pitch parameters in the back-end. Below, we describe the different major approaches to speech synthesis. 6.3.1.1

Formant Synthesis Rule-based formant synthesis is one of the oldest approaches to speech synthesis (e.g., Klatt [1980]). A set of rules dictate parameters for 3 or 4 formants. Formants are the key resonant frequencies of the vocal tract and contain information on whether a pulse (voiced) or white noise (un-voiced) source is fed through the resulting filter. Stephen Hawking used a formant synthesizer, and its sound became synonymous with his vocal identity. Although the quality is poor, such systems can easily be pre-built onto a chip and be used in toys and games.

6.3.1.2

Diphone Synthesis Diphone synthesis depends on a voice talent recording examples of every diphone (one phone sound transitioning into another phone sound) within a carrier phrase and this set forming the basis of output speech (e.g., Lenzo and Black [2000]). Here, a complex prosodic model based on front-end features would predict the duration and pitch parameters of each phone. Digital signal processing would alter the pitch and duration of each diphone to achieve these targets and then concatenate the diphones together (often with a smoothing algorithm) to create the output speech.

6.3.1.3 Unit Selection Unit selection extends diphone synthesis by using hundreds of thousands of diphones (or other speech units) extracted from a very large corpus recorded from a single speaker [Hunt and Black 1996]. Linguistic features and prosodic targets are used in a dynamic programming algorithm to select a good sequence of units. These are concatenated together with both smoothing and some processing to control duration and pitch. This produced a major jump in quality and many systems still use this approach.

178

Chapter 6 Building and Designing Expressive Speech Synthesis

6.3.1.4

Parametric Synthesis with a Traditional Vocoder Within this approach, machine learning in the form of Hidden Markov models [Tokuda et al. 2000] or neural networks (e.g., Zen et al. [2013]) trained on linguistics features are used instead of rules to control the spectral and prosodic parameters of the output speech. These parameters are then converted into a waveform using a vocoder—a digital signal processing algorithm which can change spectral specifications into a waveform. The use of machine learning allows voices to be modelled, and these models can be adapted and manipulated. This in turn means that the systems offer significant flexibility not present in unit selection synthesis. However, the modelling and vocoding process reduces the perceived quality and naturalness of the speech output.

6.3.1.5 Neural TTS By replacing the vocoding algorithm with a recurrent multi-level neural net, these systems are able to significantly improve the quality of parametric-based systems to the extent that the quality exceeds output from unit selection systems (e.g., Oord et al. [2016]). In some cases, the linguistic front-end can also be released by sequence-to-sequence neural net models [Wang et al. 2017]. Work on expressive speech exists for all these types of systems, though diphone, formant, and traditionally vocoded systems are now generally regarded as having unacceptably low perceived naturalness for most purposes. How much expressive speech control is available will also depend on whether the system is a research system, an off-the-shelf commercial system, or a bespoke system (depending on voice recordings made specifically for the system). Open-source speech synthesis systems are notoriously labor intensive in terms of building bespoke voices, ensuring sufficient quality, and integrating with a full system (e.g., SIAs). They can also present significant challenges in terms of performance. In contrast, commercial systems vary extensively in terms of price, flexibility, and amount of expressiveness control they offer.

6.3.2 The Importance of Corpora In general, the speech corpus used to create a voice will have a direct impact on the expressive variation available in the voice. Both unit selection and neural TTS systems are corpus-based. For unit selection the corpus underlying the voice has a rather rigid effect on the voice produced as only fragments of the recorded corpus are used to generate new speech output. For neural TTS it is easier to generalize across the corpus and to also generalize across speaker, accent, and even language. Open-source speech synthesis systems will typically be based on corpora that are available to the academic community. Open-source corpora include

6.3 Building Expressive Synthesis

179

Arctic [Kominek and Black 2004], LJ Speech2 VCTK [Yamagishi et al. 2019], LibriTTS [Zen et al. 2019], and LADS [Braude et al. 2019]. Other corpora are also available but may not be licensed for commercial use. Services such as ELDA3 and LDC4 also have corpora but currently most are captured to support automatic speech recognition (ASR) research and development. Note that corpora captured for ASR are not typically appropriate for synthesis. Having background noise, speech noises, disfluencies, and very casual speech are an expected challenge in processing speech for recognition. Therefore, ASR corpora tend to include these effects. For speech synthesis, where clear articulate speech is normally desired, such effects can cause significant problems for any synthesis generation technique. Recording a bespoke corpus for a voice is resource intensive and time consuming. Unit selection systems will require upwards of five hours of data which could take over 30 hours to record. For neural TTS approaches adaptive voice modelling can be used to reduce the data required. The amount required is changing as modelling approaches become more sophisticated, but at least one hour of recorded data is a sensible starting point. The makeup of a corpus can affect the amount of data required and will also affect the flexibility of any voice based on it. However, when designing for a SIA, a bespoke corpus can have significant benefits. Custom voice styles and accents can be chosen to fit the form and functionality of the target system, adding the required expressive speech to the corpus. For a commercial system, creating a custom voice also has the benefit of unique branding. As we discuss the models, approaches, and parameters to expressive speech synthesis below, we will reflect on the corpus requirements for specific functionalities. Often using a custom corpus will be determined by the resources and access to speech synthesis expertise available to a SIA project. Off-the-shelf systems may not offer the desired functionality, and an informed decision is required as to how to proceed with any given project.

6.3.3 Controlling Expressiveness—Speech Synthesis Markup The dominant method for controlling expressive functionality in speech synthesis is through the use of XML markup. The advantage of XML-based systems is that different synthesis systems can implement a subset of functionality, or extend functionality, without damaging output (i.e., reading out the markup instead of the content). SSML5 [Taylor and Isard 1997] is the dominant standard, although 2. https://keithito.com/LJ-Speech-Dataset/ 3. http://catalogue.elra.info/en-us/ 4. https://www.ldc.upenn.edu/ 5. https://www.w3.org/TR/speech-synthesis11

180

Chapter 6 Building and Designing Expressive Speech Synthesis

Microsoft offers their own XML markup. Apple’s embedded speech commands do not use XML and are not compatible across non-Apple platforms. Below is an example of SSML markup applied to alter the word emphasis of some synthetic speech.

I already told you I really like that person.

SSML tags can control prosodic elements of speech such as speech rate, pitch, and phrasing. It can also control lexical elements such as pronunciation of words using the International Phonetic Alphabet (IPA). It can even be used to insert pre-recorded audio into speech, control the interpretation of symbols and digits, and be used to mark word boundaries so timing can be extracted after speech is generated to synchronize animation. However, there are a number of issues regarding SSML markup when used to control expressive synthetic speech for SIAs: 1. The standard is 10 years old and was designed without any knowledge of neural TTS approaches; also, it was historically designed before unit selection became mainstream. This leads to some problems in interpreting the functionality of some tags in speech synthesis systems. 2. The prosodic control is intentionally simplistic and does not specify how these controls should be implemented. For example, if you change emphasis, what should happen to previously emphasized words? How should emphasis be created in terms of pitch, duration, energy, and spectral characteristics? What sort of intonation should occur at prosodic breaks? 3. Tags can be easily misused. For example, the tag hello there will attempt to say “hello there” in exactly 2 seconds. If implemented, this will produce very unnatural slow speech. To change speech rate, the markup below would be more appropriate: hello there 4. When switching to a different synthesis system, the way a tag is implemented, and if it is implemented, may vary widely, meaning that the text may not be synthesized correctly. Nonetheless, SSML is a very useful standard, and, given the very big changes that have occurred in speech synthesis technology, it has remained extremely

6.3 Building Expressive Synthesis

181

useful both as a practical markup system and as a reference for the sort of control that would be available. Part of the strength of SSML is its lack of concrete specification, giving engineers the flexibility to implement tags appropriately in different synthesis systems. Commercial systems have long developed internal tag sets to offer a more specified set of controls, as well as controls not covered by SSML, such as variation in emotion and speech style. For example, Alexa offers a set of amazon: tags that can be used with different voices. To conclude, understanding standard markup is important when designing the voice for a SIA, but making a design decision to use a specific synthesis system and its internal markup, given the under-specification of SSML, is also a valid strategy.

6.3.4 Cross and Within Speaker Variation When considering models, approaches, and design choices in expressive speech synthesis, it is useful to distinguish between cross speaker features and within speaker features that can be given to the voice. In general, cross speaker features are the basis of speaker identity and perceived personality and can be grouped in terms of language, accent, dialect, and vocal style. Within speaker features are changes across a sentence such as emotion, emphasis, and conversational control. In modern corpus-based speech synthesis, cross speaker features are very heavily dependent on the corpora used to build the voice. For single speaker corpora, the recorded material will completely dictate these features. For multiple speaker corpora, they can potentially be constructed drawing features from different sources. In general, cross speaker features would remain constant throughout an interaction, and normally across whole utterances. In contrast, within speaker features are primarily used to convey the inner state of the speaking agent by altering the way an utterance is spoken. Together with text, movement, or animation, expressive speech synthesis can convey a change in emotion, empathy, focus, and appropriately match the interactional context. It is here that speech synthesis XML markup (described in Section 6.3.3) becomes extremely important. Ideally, a language system could automatically insert such markup to control the speech synthesis. There is no clear-cut separation between cross and within speaker features. For example, a multilingual system may switch languages within a sentence, or different voice styles may be used across an utterance to render complex emotions. It is rather the pragmatic design of the system that creates the distinction. Many voices will only have one language, one accent, one dialect, and one voice style. After the corpora are recorded and voices are built, these features cannot be altered. Many systems will offer a limited set of word-by-word XML controls that in turn dictate

182

Chapter 6 Building and Designing Expressive Speech Synthesis

the within speaker features that can be applied. When designing a speech system for a SIA it is important to determine what flexibility will be required, whether the voice can produce such flexibility and whether the TTS system can support it. Over the next few years, as interactive systems make greater use of the expressive functionality on speech synthesis, we will see a further blurring of the cross and within speaker feature boundary. Just as human speakers can mimic other speakers, alter voice style at will, and code switch between multiple languages, we will see future TTS systems offering such functionalities. However, this is an important distinction to understand the challenges of the current design process, and we will discuss expressive speech design within this context.

6.3.5 Designing Expressive Speech—Cross Speaker Features 6.3.5.1

Language The language a synthetic voice speaks (or is required to speak) has wide-ranging effects on both the quality and expressive functionality of the system. Despite many years of linguistic research exploring cross-linguistic features of expressive speech, for example, Ohala [1983], within speech synthesis the vast majority of systems are built on language specific data. Although this is changing with recent work exploring the use of neural networks to generalize across languages, for example, Zhang et al. [2019], from a practical perspective the language(s) a SIA is required to speak will have a key impact on the expressive speech functionality that is available. Research in expressive speech is historically Anglo-centric. For example, in emotion recognition research it has been pointed out that collected databases are “dominated by English language” and “very few databases are collected in languages such as: Russian, Dutch, Slovenian, Swedish, Japanese, and Spanish. There is no reported reference of an emotional speech database in any of the Indian languages” [Koolagudi and Rao 2012, pp. 103–104]. Speech synthesis markup such as SSML is also Anglo-centric, and not all tags will transfer appropriately across languages. In addition, research exploring and controlling expressive speech synthesis across languages is rare. Multilingualism, where a speaker can use more than one language, is believed to be more common than monolingualism [Tucker 1999]. The use of code switching, where a speaker switches language during speech, is often cultural, expressive, and intimately connected to social relationships, for example, Paugh [2005]. Speech synthesis systems normally have very limited multi-language support. This can present a serious problem in systems where foreign words are often used. How an English word should be pronounced in a Spanish system is different from how it should be pronounced in a German system.

6.3 Building Expressive Synthesis

183

Traditional monolingual speech synthesis systems use what is often termed a front-end or G2P system to take a series of marked-up words and convert them into a series of phones (the basic sounds that make up a language), and other linguistic features such as syllable structure, stress, and phrasing (as described throughout this section). Such front-ends depend on a large body of manually constructed data with the largest component comprising of a pronunciation dictionary. Such a dictionary will have an entry for every word in all forms, describing its pronunciation in terms of phones and often stress (e.g., the CMU pronunciation dictionary6 ). Depending on language there may also be modules for, amongst others, letter to sound generation for out of vocabulary words, normalization of symbols and digits, homograph disambiguation, stress and phrasing modelling, part of speech tagging, word disambiguation, archiphones (where phone sounds alter in sentence context), and more. More recently, significant work has explored generating synthesis using sequence-to-sequence neural network models which go directly from text to speech without any front-end processing. The most notable example of this work is Tacotron [Wang et al. 2017]. Although such end-to-end synthesis is attractive (especially for less resourced languages), from the perspective of expressive speech it raises the question of how expression can be controlled. For example, if the end-toend system has no overt model of stress, how might we control emphasis? Recent work [Skerry-Ryan et al. 2018, Hodari et al. 2019] explores this issue and proposes various solutions. However, from a practical perspective, when using an end-to-end speech synthesis system for a SIA it is important to be aware of what expressive speech control is available, if any, as well as the ability to correct errors in the synthesis produced. 6.3.5.2 Accent and Dialect Accent can be defined as “The cumulative auditory effect of those features of pronunciation which identify where a person is from, regionally or socially.” [Crystal 1997, p. 2]. It acts as a clue to a speaker’s social identity, socio-linguistic background, and geographical origin [Ikeno and Hansen 2007, Crystal 2011], strongly influencing perceptions of a speaker, eliciting specific stereotypes and assumptions associated with a particular accent [Ryan and Giles 1982, Coupland and Bishop 2007]. Accent differs from dialect in that it is concerned with pronunciation. Conversely, dialect refers to language patterns, such as grammar and vocabulary that we associate with particular geographical regions or social groups [Hughes 6. http://www.speech.cs.cmu.edu

184

Chapter 6 Building and Designing Expressive Speech Synthesis

et al. 2013]. Like accent, dialect terms used by people can signal cues to their identity. It is possible to model multi-accent voices and use accent shift as an expressive synthesis technique. If the accents can be broadly described by the same phone set, then this can be implemented using a voice style (see Section 6.3.5.3). If not, techniques used for multi-language systems are required (see Section 6.3.5.4). The decision to use accents in synthesis needs to be made with full awareness of the social context in which the voice will be used. For instance, research has shown that people tend to prefer interacting with agents that use standard-accented speech due to their supposed prestigious status [Bishop et al. 2005, Torre and Maguer 2020]. Singaporean users rated British-accented speech in a helpdesk agent more positively compared to Singaporean-accented English [Niculescu et al. 2008], in part due to the perception of prestige given by a British accent within formal interactions within Singaporean society [Niculescu et al. 2008]. Similarly, within the UK, artificial agents speaking with accents traditionally deemed to be prestigious were trusted more than other accents, even when the agents were equally trustworthy [Torre et al. 2015, 2018]. In a study involving social robots speaking with different Arabic accents, Andrist et al. [2015] have also found that social status and credibility interact. People have also shown a tendency to prefer interacting with agents that speak with similar accents to their own [Cargile and Giles 1997, Kinzler et al. 2011], congruent with in-group membership preferences and similarity attraction effects [Dahlbäck et al. 2001]. A system’s accent also has ramifications for the dialect choices people are likely to use in interaction. A recent study found that accent significantly impacted the likelihood of using lexical alternatives [Cowan et al. 2019]. When playing a referential communication game where a number of objects could be described using either US or Hiberno-English dialect alternatives (e.g., wrench and spanner), people were more likely to use US English dialect terms in their descriptions when interacting with US-accented system compared to an Irish-accented system [Cowan et al. 2019]. This parameter, as well as influencing perceptions, may therefore also play a significant role in shaping the dialect that users use in interaction. 6.3.5.3 Voice Styles Sub-corpora, where a speaker is recorded speaking in a certain style, can be used to control synthetic voice styles. Often such sub-corpora will be acted, for example, asking a speaker to speak sadly when reading material for recording. But they can also be elicited by asking speakers to read appropriate material, for example, upbeat enthusiastic sentences to encourage an upbeat and enthusiastic voice style. In unit selection approaches, voice styles are most effective when recorded with

6.3 Building Expressive Synthesis

185

appropriate target material. This is because unit selection systems require very high coverage of phonetic and prosodic contexts, and without recording a very large amount of material the sub-corpora will tend to produce less natural results the more it strays from the text used to record it. For neural TTS systems, by using voice adaptation (see Section 6.3.5.4) this problem is minimized and allows the construction of voices with many sub-corpora. Initially, work on voice styles was carried out in order to synthesize emotions, for example, Hofer et al. [2005], where Happy, Angry, and Neutral sub-corpora were used. Later work explored sub-corpora based on different voice qualities so that that they could be coupled with emotional synthesis techniques based on changing rate and pitch (see Section 6.3.6.1). A stressed (tense) voice quality tends to be rated negatively, while a lax (calm) voice quality tends to be rated positively [Aylett et al. 2013, Potard et al. 2016]. Other examples of work include using sub-corpora of conversational speech [Andersson et al. 2010] and recording voice styles specifically for the use cases of an intelligent virtual agent (IVA), such as a motivating voice style for developing a virtual sports coach [Aylett and Braude 2018]. If resources are available to record bespoke audio for a SIA, then considering a set of voice styles that will support the intended interaction contexts is a fundamental design step. When choosing an off-the-shelf TTS system, it is important to be aware of what voice styles may be available and how they may be used in a SIA. 6.3.5.4 Voice Adaptation Recent methods for voice cloning (the creation of a synthetic voice that sounds like a specific source speaker) and voice style depend on the use of voice adaptation techniques. Voice adaptation is a process where speech data collected from other speakers is used to improve a model for a new speaker or speaker with limited data. The technique was initially applied to speech recognition but generalized to synthesis models in pioneering work on parametric synthesis based on Hidden Markov models [Yamagishi et al. 2006]. Hidden Markov models were later replaced with deep neural networks (DNNs). Until the vocoder, the system that converted model parameters into speech, was also replaced by DNNs in neural TTS, the output quality was much lower than unit selection systems and not widely used. However, neural TTS opens an entire world of possibilities by allowing voice adaptation to be applied to high quality speech synthesis, as exampled by recent work [Arik et al. 2018, Bollepalli et al. 2019, Prateek et al. 2019, Luong and Yamagishi 2020, Zhu and Xue 2020]. It is important to note that this field is changing very rapidly. Whether voice adaptation could be used to rapidly generate voice

186

Chapter 6 Building and Designing Expressive Speech Synthesis

styles and clone voices for SIAs would depend very much on what neural TTS system is used and to what extent there is collaboration with speech synthesis experts.

6.3.6 Designing Expressive Speech—Within Speaker Parameters 6.3.6.1

Emotional State The ability to develop speech that can emote has been one of the core pillars of expressive synthesis development. The notion here is that for a voice to be truly natural or human-like, it needs to be able to accurately express emotion, with much work dedicated to understanding what parameters in speech are related to emotion and how these can be applied in speech synthesis, see Schröder [2001, 2009], Govind and Prasanna [2013], Gangamohan et al. [2016], and Kamilo˘ glu et al. [2020] for reviews. The manipulation of prosodic features (e.g., fundamental frequency (F0) contour, level and range, speech tempo, loudness, and voice quality) are key to generating specific emotions, see Schröder [2001, 2009] and Govind and Prasanna [2013] for detailed breakdown across emotional states. The majority of previous work exploring emotional and expressive speech synthesis has focused on a distinct set of between three and nine, extreme, sometimes termed Darwinian emotional states, such as disgust, anger, joy, sadness, and surprise [Schröder 2001]. However, as Schröder [2004, p. 211] points out: In a dialogue, an emotional state may build up rather gradually, and may change over time as the interaction moves on. Consequently, a speech synthesis system should be able to gradually modify the voice in a series of steps towards an emotional state. In addition, it seems reasonable to assume that most human–machine dialogues will require the machine to express only mild, non-extreme emotional states. Therefore, the need to express full blown emotions is a marginal rather than a central requirement, while the main focus should be on the system’s capability to express a large variety of emotional states of low to medium intensity. Thus, dimensional models of emotion, such as the circumplex model, might be more appropriate to convey emotion in artificial speech [Rubin and Talarico 2009]. More recently, there has been a growing understanding that emotional content of synthesis should be dependent on the task and requirements of the system. Given this, we would expect our emotional categorization to be task-dependent and that synthesis researchers and dialogue researchers would work closely together to both specify, design, and evaluate the resulting system.

6.3 Building Expressive Synthesis

187

Although task-based schemes for emotional response do exist (for example the framework proposed by Ortony et al. [1988] or OCC model), the emotional categorization used in most emotional synthesis research is typically not task-dependent. Pitch and speech rate contribute strongly to the perception of emotion [Ramakrishnan 2012]. For most commercial speech synthesizers, controlling speech rate and pitch is normally possible using SSML or bespoke XML markup. For example, raising the pitch and increasing the speaking rate to convey cheerful enthusiasm and lowering both to convey a sense of sadness. However, pitch and rate features are strongly related to other speech features such as voice quality, spectral tilt, and prosodic and phonetic context. Thus, although some variation can be generated using pitch and rate change, it may often sound unnatural especially if modified by more than 10%. An alternative is to record a sub-corpus of speech acted with a specific emotion. This is similar to using the cross speaker approach of voice styles (see Sections 6.3.5.3 and 6.3.2). This sub-corpus can then be used either as a data for unit selection or to model the emotion. Speech output based on the sub-corpus can then be inserted into the speech stream when a change is desired. However, being able to transition between such voice styles is a challenge. Voice adaptation techniques (see Section 6.3.5.4) can potentially be used to produce a graded effect [Zhu and Xue 2020], but how to move gracefully between emotions is still a key issue. As mentioned, when considering the use of emotional synthesis it is important to note that emotional expression is not just relevant to the speech signal itself. The perception of emotion is hugely influenced by the context, speaker, and speech content (e.g., syntax and lexicon) [Erickson 2005]. When creating expressive speech, as much consideration needs to be paid to linguistic content, sociolinguistic factors, and their interplay with prosodic factors within the speech signal. Expression of emotion through speech is also multimodal in nature. For a very simple example, consider smiling—the act of pulling the muscles around our lips upwards modifies our vocal tract, which in turn will give a different quality to speech when it is smiled, than when it is “neutral.” This is reflected in the fact that we can “hear” smiles, for example, when we are on the phone with someone [Tartter and Braun 1994]. Although they are yet to make their way into commercial systems (along with much of expressive synthetic voices), work is already underway to create TTS systems that can switch from neutral to smiling [El Haddad et al. 2015a, 2015b] and laughing [Sundaram and Narayanan 2007]. While researchers have not agreed on an acoustic definition of a smile yet, it seems that smiling is reflected in our voice by means of prosodic and spectral changes (namely, increased fundamental and formant frequency, and increased spectral centroid) [El Haddad et al. 2017, Arias et al. 2018]. Synthesizing smiling and laughing speech is not trivial because

188

Chapter 6 Building and Designing Expressive Speech Synthesis

of the lack of one-to-one mappings between acoustic and perceptual features, but it is a promising avenue for research that could have a real impact on agent technologies, as currently there are very few “emotional” voices for artificial agents.

6.3.6.2

Emphasis and Question Intonation Changing the intonation of an utterance with the desire to alter the speech style or emotion is typically connected with information content and dialogue context. For example, to make an utterance sound like a question or to change the emphasis on a word. Many speech synthesis systems will support question intonation to some extent. In unit selection systems, effective question intonation synthesis is hampered by a requirement for sound coverage. Typically, pre-recorded question tags and the use of non-final intonation present in non-sentence final utterances can be used to generate question intonation. This has mixed success across languages but is often acceptable especially as the perception of a question is often driven more strongly by lexical content than question prosody. In neural TTS systems the long-term relationship between question prosody can be modelled without requiring full sound coverage. However, question intonation across sentences and speakers is very far from homogeneous, making learning such forms from a corpus a challenging task. Research in this area is still at an early stage, with the exploration of different neural net models and feature architectures to improve the variation expected in spoken intonation a source of current research [Kenter et al. 2019, Marelli et al. 2019, Sun et al. 2020]. For emphasis, the issue of a lack of homogeneity in the way speakers emphasize words caused by sound and sentence context and individual speaker styles make modelling emphasis even more challenging than question intonation. Most current work in speech synthesis assumes that if the intonation modelling is good, appropriate emphasis will be created automatically (e.g., research work looking at paragraph intonation to model changes in emphasis across utterances [Aubin et al. 2019]). Although many commercial systems implement the SSML emphasis tag, the implementation can vary and can have various levels of success. For noncommercial systems, markup may not be present at all. For designers and engineers working with SIAs, this common lack of an explicit control of emphasis can be frustrating. However, if we return to the original phonetics literature, (e.g., Cruttenden [1997]), we can see that emphasis is typically manifested by extended phonetic duration and pitch excursions. Therefore, if pitch and rate control is available,

6.3 Building Expressive Synthesis

189

manipulating duration and pitch directly can often be used to create the perception of emphasis as well as rising question intonation.

6.3.6.3

Conversational Speech Kawahara [2019] presents the challenges and research objectives for a spoken dialogue system for a human-like conversational Robot ERICA. He points out that “Speech synthesis should be designed for the conversational style rather than textreading applications, which are conventional targets of text-to-speech. Moreover, a variety of non-lexical utterances such as back channels, fillers and laughter are needed with a variety of prosody” (p. 5). Thus, as well as requiring expressive speech techniques described in earlier sections, interactive conversational speech also requires specialized expressive speech functionality. Most commercial systems are not designed for conversational speech, expecting rigid turn-by-turn interaction. This is rapidly changing, with Google’s Duplex having a major impact in the ability of systems to engage in conversation (in this case in a limited appointment/reservation domain) and use conversational features which meant that, over the telephone, the artificial system is indistinguishable from a human dialogue partner. There is great scope for SIA research to move forward the state of the art in speech interaction, but to do so it is important to be aware of techniques that are available for creating synthetic conversational styles and interaction. Conversational Speech Style As we discussed in Section 6.3.2, modern synthesis systems are based on speech corpora. Most corpora are created from read speech. Corpora of spontaneous and conversational speech are rare and present significant challenges for analysis. However, a conversational speech style is a requirement for an interactive system. Whereas unit selection systems face several challenges in creating conversational speech [Andersson et al. 2010], neural TTS systems can potentially model conversational styles of speech (e.g., Székely et al. [2019]). Such systems, with much lower requirements in terms of corpora size, especially with voice adaptation, could also successfully model acted conversational speech. Another feature of conversational speech is prosodic accommodation, where human dialogue partners tend to match each other’s speech rate and style of speech [De Looze et al. 2014]. Using standard markup that implements global pitch and rate change this can be achieved by most synthetic speech systems. Back Channels Giving speech feedback while listening is a fundamental part of conversational interaction. Such speech is termed back-channeling and has significant prosodic differences to the same words spoken in different contexts. There is nothing that prevents most speech synthesis systems from generating

190

Chapter 6 Building and Designing Expressive Speech Synthesis

backchannels or stock phrases to support a dialogue partner [Schröder et al. 2011, DeVault et al. 2014]. Disfluencies Disfluency, often defined as a combination of speech errors and filled pauses, is a normal part of conversational speech. Filled pauses in particular can play an important role in conversational dynamics. Previous research has explored the use of filled pauses in speech synthesis [Adell et al. 2007, Andersson et al. 2010, Wester et al. 2015]; however, standard speech synthesis systems will not generally offer such functionality. Laughter, Breathing, and Speech Noises The modelling of non-speech noises is common in speech recognition but rarely used in speech synthesis. Google Duplex did use some of these techniques, but we are not aware of any published work. Synthesizing laughter is extremely challenging [Trouvain and Schröder 2004]. As with sobbing, laughter often merges with speech, making modelling difficult. Taking, Holding, and Ceding the Floor In dialogue the act of speaking is mediated by the interaction with other speakers. In order to take the floor and be able to speak, prosodic techniques are used (i.e., abrupt in-breaths to show a desire to speak, raising your voice to prevent interruption, using very clear prosodic drop to show a readiness for another person to speak, etc.). Some of these effects can be modelled using current expressive synthesis functionality, but many remain unexplored in the current state of the art. Architecture An interactive use of speech synthesis requires a major change in the assumed architecture of a dialogue system. Being able to interrupt speech output in response to an outside event is a fundamental requirement. Work on reactive [Wester et al. 2017] and incremental [Baumann and Schlangen 2012] speech synthesis architectures explore how speech output might change dynamically as external events occur. In complex artificial systems there is a tendency to play audio and move on, meaning that it cannot be interrupted, and if it is interrupted the system does not know what part of the message has already been played. Furthermore, bearing in mind that a response is required within 200ms, integrating speech recognition, dialogue planning, natural language generation, and speech synthesis to respond effectively is a major challenge. For example, speech synthesis from a cloud service may just not be fast enough.

6.3.7 Summary The technology underpinning expressive speech synthesis is rapidly changing. For most SIA work the state of the art is not required, but a good understanding of the control and functionality of any proposed speech synthesis system is crucial for the design, implementation, and integration of synthetic speech into an agent or robot that is required to speak. In the previous sections we have tried to cover the

6.4 Fundamental Considerations When Designing Expressive Agents

191

engineering techniques and types of control that may be typically available or make sensible part of a development project. But we note that key for any technology is how it is used and integrated into the overall system.

6.4

Fundamental Considerations When Designing Expressive Agents

6.4.1 Should You Use Speech at All? Although synthetic expressive voices can be built, whether to use expressive synthetic speech is fundamentally a design decision. This may not always be necessary and those who wish to use these voices may need to weigh the benefits and drawbacks of speech as a modality before committing to such a design path. 6.4.1.1

The Bene ts of Using Speech Speech has been touted as a more natural modality for interface interaction [Moore 2017]. Despite the stark differences between human–human and human– machine dialogue and their underlying purposes [Clark et al. 2019b], introducing speech in SIAs can have practical benefits. In hands-busy and eyes-busy scenarios, where people are otherwise occupied in other tasks, incorporating speech in SIAs can make a great deal of sense. A wealth of tasks requires our attention. These include more mundane tasks like reading a newspaper or book, through to safetycritical tasks like driving or performing surgery [Heo et al. 2017, Large et al. 2017]. If systems wish to interact with us (or us with them) when we are otherwise busy, using speech can allow for minimal interference on our primary task—although this might depend on the difficulty of the primary task being conducted [Edwards et al. 2019]. Speech can also play an essential role in making an interface more accessible. For some users, using speech is not only preferable but an essential way to interact with a device [Corbett and Weber 2016]. Speech input and output can be critical in supporting people with limited motor capabilities [Corbett and Weber 2016], those who have visual impairments [Abdolrahmani et al. 2018, Reyes-Cruz et al. 2020], as well as having the potential to support interaction for older adult users [Sayago et al. 2019]. For diverse demographics, speech can provide accessibility interactions where other modalities (e.g., through GUIs and tactile interfaces) are difficult, if not impossible [Corbett and Weber 2016]. This allows users to conduct common tasks like web browsing [Sato et al. 2011, Williams et al. 2020]. Speech can also help technology to be more socioeconomically inclusive, supporting people whose levels of literacy may be low [Medhi et al. 2009]. Compared to other modalities,

192

Chapter 6 Building and Designing Expressive Speech Synthesis

speech can also allow designers to more easily give the interface a personality. In human–human interaction, speech is a powerful indicator of identity and personality [Cameron 2001, Goffman 2005] and is one of the primary means of social identification [Barthes 1977]. These perceptions are also made in HCI [Nass and Lee 2000, 2001]. Through manipulating accent or speech rate, a specific identity can be created for an interface, which can have a notable impact on user performance, learning, trust, and even purchasing habits [Nass and Brave 2005].

6.4.1.2 The Drawbacks of Using Speech At times, speech can be a cognitively demanding modality [Aylett et al. 2014]. As such, it may not be appropriate for delivering large amounts of information [Doyle et al. 2019]. SIAs delivering long-form information such as lists could make interactions cognitively taxing for the user [Jung et al. 2020]. Recent research has also found that synthetic speech can impose a higher cognitive load to process than non-synthetic speech, particularly for lower quality synthesis [Govender and King 2018]. The cognitive demand of interaction is particularly acute for non-native (L2) speakers. In a recent study, users who had to use their non-native language to interact with a voice agent experienced significantly higher levels of cognitive load than those who could interact with their native language [Wu et al. 2020a]. For these users, pairing speech with visual feedback may be critical in supporting the interaction [Wu et al. 2020b]. A consistent challenge when using speech also lies in how people determine what they can actually do through speech with the interface. Facilitating discoverability—the ability for users to discover and use interface commands— needs major consideration when designing a speech interaction. Recent work identified that including a strategy to help users discover functions (either explicitly prompting users or through user initiating a help request) leads to significantly better usability scores than not designing for this at all [Kirschthaler et al. 2020]. Methods can also be used to aid discoverability when using more visual and multimodal forms of interaction, for example, via screens to display common options or functions. Whether through vision- or speech-based means, thought needs to be put into how users will actually know what to do in a speech-enabled interaction. Critically, we also need to consider whether the context within which the interaction occurs is appropriate for speech. The user experience of a speech system is highly dependent not only on the purpose of a system but also on where it will be deployed, and the type of interaction the users are planned to have with it in that context. For instance, noisy environments can significantly impact the intelligibility of speech output [Cooke et al. 2013], cancelling out any multitasking

6.4 Fundamental Considerations When Designing Expressive Agents

193

benefits of using speech. Using speech in public environments may pose additional drawbacks. Research has shown that people prefer to use speech in private settings [Begany et al. 2016, Luger and Sellen 2016, Cowan et al. 2017], partly due to fear of social embarrassment. Not all settings may afford this potential embarrassment, however. Multiparty interactions with IPAs are common in home settings [Porcheron et al. 2018], and when designed for the context, they could lead to the feeling of social speech interface interaction being more appropriate [Porcheron et al. 2017].

6.4.1.3 Synthetic or Pre-recorded Speech After deciding to use speech, it is important to consider whether to use synthetic or pre-recorded speech. Synthetic speech may be accessed relatively cheaply in contrast to pre-recorded speech, both in terms of open-source synthesizers and commercially available voices. Additionally, using synthetic speech allows designers to make rapid changes to the spoken output, for example, in changing entire utterances, ordering of words, or expressivity. While creating synthetic speech is not without its cost (see Section 6.3.1), pre-recorded speech can incur additional equipment, service, and time costs that are not always feasible or within monetary constraints [Pincus et al. 2015]. These can further be increased if aiming for the highest quality pre-recordings, such as the use of voice actors. With both speech options, there are differences in quality to be considered. A prior systematic evaluation of both speech types in comparing utterances found that voice actor recordings were rated as more likeable, conversational, and natural than both amateur human recordings and synthesized voices [Georgila et al. 2012]. However, synthetic speech emitted from either a high-quality general purpose or a “good limited domain voice” can outperform amateur human recordings [Georgila et al. 2012, p. 3525]. Designers must also consider what the speech in a system’s use is intended to be. Pincus et al. [2015] discuss that a human voice is more appropriate for evoking an intended reaction from a listener, though only if evoking specific perceptions of a voice can warrant the additional costs that human recordings bring. If a system is being designed to have very precise expressive characteristics and quality, a human voice may be more suitable. Additionally, the perception of linguistic content spoken by a system can be related to the voice used to speak it. Clark et al. [2016] argue that phenomena like being polite and using vague language may be better received by a human voice than a synthetic voice. The purpose and use of a system should be considered when deciding on the type of voice to use for expressive speech. This also requires an understanding of how best to evaluate expressive speech and the context a system is used in (see Section 6.5.3).

194

Chapter 6 Building and Designing Expressive Speech Synthesis

Finally, many speech synthesis systems will support a library of pre-recorded prompts that can be accessed with XML markup. Thus, if the resources are available for building a custom voice, both the dynamic benefits of speech synthesis and the subtle quality enhancements from pre-recorded prompts can be used to support expressive output for a SIA.

6.4.2 The Decision to Embody—Intelligent Virtual Agents and Social Robots Another clear decision point when designing SIAs that use expressive speech is whether the agent being developed will be embodied or not. This can interplay significantly with the voice, influencing perceptions. When considering embodiment, there are two common forms: through either a virtual representation of an agent (e.g., an IVA) or through a physical robot (e.g., a social robot, SR). Due to their embodied nature, both have the ability to use natural cues (gaze, gestures, etc.) to express emotion and exhibit personality, with their ability to emulate social characteristics being important for acceptance [Fong et al. 2003]. The most obvious difference between the two is that SRs have a physical body that can interact with the environment, whereas IVAs only inhabit a virtual world. This has implications for their ability to be expressive as well as realistically interact with the environment around them. This can be driven not only by their perceived physical form but also by their built-in sensors and algorithms. For example, some SRs such as Pepper or PR2 have cameras for vision, speakers and microphones for speaking and hearing, and manipulators for touch; the Furhat robot can “see” and “speak” but cannot touch; the Keepon robot can only “hear” (Figure 6.1). These affordances drive their ability to be expressive in interaction such as their gestures [Kose-Bagci et al. 2009, Bremner et al. 2011, Salem et al. 2013, de Wit et al. 2020] or gaze behaviors [Mumm and Mutlu 2011, Srinivasan and Murphy 2011]. When designing speech output for IVAs and SRs, it is important to be aware that these entities elicit different psychological and behavioral responses from users. People tend to rate robots as more similar to humans, and significantly more engaging than virtual characters [Kidd and Breazeal 2004], because of the perception that the agent is a real entity, as opposed to a virtual one [Kidd and Breazeal 2004]. A recent meta-review of studies comparing robots and virtual agents [Li 2015] also found that the majority of participants reported more positive attitudes—for example, trust, enjoyment, attraction—towards robots than virtual agents. These differences may be rooted in the differing physical presence afforded by SRs over IVAs. This may not only lead to more realistic expressiveness, but presence augments their ability to generate rich communication, which in turn leads to a more successful interaction and higher acceptance from the user [Wainer et al. 2006].

6.4 Fundamental Considerations When Designing Expressive Agents

(a)

(b)

(c) Figure 6.1

195

(d)

Social robots with different levels of embodiment: (a) Pepper robot by SoftBank Robotics ©2019 Marco Verch, (b) PR2 robot by Willow Garage ©2019 Ilaria Torre, (c) Furhat robot by Furhat Robotics ©2020 Furhat Robotics, and (d) Keepon robot by BeatBots ©2007 Hideki Kozima, Marek Michalowski/BeatBots LLC.

Physical co-location has also been shown to increase compliance and influence decision-making [Bainbridge et al. 2011]. The discussion on embodiment is integral to the development of expressive speech in artificial agents, as embodiment may significantly impact how expressive speech is perceived. This further ties to the concept of multimodality, as expressivity is multimodal by nature. For example, if we want to highlight a word in an utterance, we use prosodic stress and will often accompany it by so-called “visual beats,” such as hand gestures or eyebrow movements [Krahmer and Swerts 2007, Swerts and Krahmer 2010]. Expressive phenomena (such as highlighting a word, smiling, or showing empathy) can still be conveyed through a single modality, for example, in the case of disembodied artificial agents, and they generally increase positive perceptions [Bretan et al. 2015] and decision-making [Torre et al. 2020]. However, expressing the same phenomenon through multiple modalities can

196

Chapter 6 Building and Designing Expressive Speech Synthesis

increase communication success, such as in the case of lip-reading [Campanella and Belin 2007]. Not only this, but accidentally—or not—pairing modalities that do not match actually interferes with communication, as exemplified by the McGurk effect cf.t [McGurk and MacDonald 1976]. In human–agent interaction, it has been shown that agents expressing different emotions in the face and voice elicit different behavioral responses than when the emotion is expressed in only one channel [Antos et al. 2011, Torre et al. 2018]. This suggests that expressivity enhances communication when there is congruence between modalities, but interferes with it when the modalities do not match. This could be an issue when designing an expressive voice for a social robot that cannot make the corresponding facial and bodily gestures, for example, because it lacks the appropriate degrees of freedom. In sum, while expressivity at large can facilitate communication and increase person perception, designers must ensure that all the available modalities on an artificial agent have the same expressive capabilities; otherwise, the resulting expression mismatch could be detrimental for the interaction.

6.5

Current Challenges and Future Directions in Expressive Synthesis Recent work has highlighted a number of challenges and future directions related to expressive speech that need to be addressed. Clark et al. [2019] provide an overview of speech systems within HCI, noting a number of research challenges alongside methodological and evaluation challenges. Cambre and Kulkarni [2019] highlight the social implications of designing voices for smart devices and provide a research framework for designers to utilize to help shape users’ experiences. Finally, Wagner et al. [2019] discuss the future of evaluating speech synthesis, suggesting a move towards HCI-focused approaches of evaluating speech in appropriate contexts with users. Here, we build upon this existing work and present three challenges for those working in different areas of expressive synthesis.

6.5.1 Considering Why and Where We Need Expressive Synthesis One of the aims of developing and using expressive synthesis is to emulate humanlike qualities within the voice [Akuzawa et al. 2018]. Although a seemingly innocuous goal, recent work emphasizes that this may have profound consequences for interaction from the user perspective. The voice of a speech system is likely one of the key reference points when thinking and reasoning about what a system using speech can (and cannot) do. Design decisions in the voice may be important drivers of people’s perceptions of partner competence and ability [Luger and Sellen 2016] (i.e., a user’s partner model [Doyle et al. 2021]). Work (highlighted in

6.5 Current Challenges and Future Directions in Expressive Synthesis

197

the sections above) demonstrates that expressiveness (e.g., through accent choices) can significantly impact user perceptions [Dahlbäck et al. 2007] and user language choices [Cowan et al. 2019] in interaction. Crucially, making voices more humanlike through using expressive synthesis may over-inflate users’ perceptions of what the system being used can achieve in interaction [Luger and Sellen 2016, Moore 2017], significantly affecting the quality of the interaction. Research with both interviews and focus groups exploring the user experience of speech-based IPAs [Luger and Sellen 2016, Cowan et al. 2017] found—unsurprisingly—that users see conversation as the key metaphor for interaction, with the human-like nature of speech output and synthesis being major cues to support this assumption. Yet these clearly do not accurately map to current system capabilities. This leads to a problem in inaccurate expectation setting [Leahu et al. 2013, Luger and Sellen 2016, Moore et al. 2016], resulting in potential communication breakdowns and unsuccessful engagements with systems. These perceptions of humanness also appear to be multidimensional in nature rather than monolithic, and designers may need to reconsider them as such [Doyle et al. 2019], for example, by matching levels of human-likeness both in visual appearance and voice quality [McGinn and Torre 2019]. Identifying when and whether to use expressive synthesis as well as evaluating its effect on user interaction is a significant challenge for future research in this domain.

6.5.2 Towards Gender-neutral Voices in Expressive Speech Design? A typical decision by most designers of speech systems is to opt for a female voice as the default [Danielescu 2020]. This has brought about a significant sense of users anthropomorphizing speech agents as female, consistently referring to agents as “her” or “she” when describing their experiences [Luger and Sellen 2016, Cowan et al. 2017]. A recent UNESCO report identified that current design of speech agents is not gender-neutral, potentially amplifying gender stereotypes, with the need to consider how voices can be designed to be more gender-neutral [West et al. 2019]. A key challenge in expressive synthesis is in exploring how we can build expressive synthesis that minimizes this bias. According to UNESCO, such a decision may have significant consequences: “Because the speech of most voice assistants is female, it sends a signal that women are obliging, docile and eager-to-please helpers, available at the touch of a button or with a blunt voice command like ‘hey’ or ‘OK’.” [West et al. 2019, p. 150]. Consequently, it is clear that we need to carefully consider the design rationale for gendered voices within speech and agent-based systems. Seminal work by Nass, Steur, and Tauber has highlighted that similar gender stereotypes in human–human interaction also appear when using male and female agent voices, with male voices being perceived as more dominant and

198

Chapter 6 Building and Designing Expressive Speech Synthesis

assertive than female voices [Nass et al. 1994]. As highlighted by Sutton [2020] and Cambre and Kulkarni [2019], we must deal with gender issues in speech interfaces with nuance, avoiding the common conflation of biological sex (i.e., male or female) with perceptions of gender in VUI design, which are influenced by a number of variables.

6.5.3 More and Better User Evaluation Needed Evaluating speech is currently done using three key approaches [Wagner et al. 2019]: objective assessments classifying systems with particular scores or contrasting them with other speech (e.g., through mel-cepstral distortion [MCD] ratings); subjective assessments rating speech on concepts such as intelligibility and naturalness; and behavioral assessments examining user actions like task completion time or physiological arousal. Traditionally, synthesis is evaluated through listening tests, for example, Black and Tokuda [2005], where speech samples are subjectively rated for perceptions of naturalness and intelligibility [Wu et al. 2019]. These assessments are often done in a state of quasi-isolation and contrasted against other samples of synthetic or human speech. Given that speech is rarely listened to in a vacuum, the ecological validity of this approach has been questioned [Mendelson and Aylett 2017]. Instead, speech forms part of a specific application in a particular interaction context [Clark et al. 2019a]. As such, it is critical that expressive speech synthesis (indeed all synthesis) is evaluated in a manner relevant to how, where, and with whom it is deployed. We also need more evaluation of how expressive synthesis parameters impact user experience and user behavior. Future expressive synthesis work should follow recent calls to adopt approaches often seen in HCI literature, where some form of user interaction with a system is crucial [Wagner et al. 2019]. These interactions may include simple prototypes or mock systems, or even Wizard of Oz scenarios, placed within an appropriate interaction context. Following these methods will not only improve ecological validity and remove the burden of focusing on incremental improvements to human-likeness [Aylett et al. 2019a], but will also shed light on how aspects of expressiveness influence the end user and their interaction experience.

6.6

Summary As mentioned in the introduction, we know there is something special about speech, and as speech systems are being interacted with widely, in a varying set of contexts, the expressiveness of the voices they use is a timely topic. It is one at the forefront of research and development in speech systems. Through this chapter we hope that we have given the reader a flavor of some of the key definitions, methods, and considerations when building expressive speech synthesis into any SIA. We

References

199

highlight the importance of aspects such as corpora as well as the usefulness and limitations of SSML in the creation of expressive voices. Even though we emphasize the types of parameters that can currently be added, or are being explored in relation to synthetic speech, our chapter is not meant to suggest that these should be used at all times and in all situations. On the contrary. We wish to emphasize to the reader that it is important to think about the interaction being developed and designed for. Is expressive speech going to benefit this interaction? How will it affect user’s perceptions and behaviors? Will embodiment be able to support the expressiveness of the speech appropriately? Do you even need speech at all? The focus on user-related concepts in expressive synthesis is a significant omission in the field currently, and as such it is a major challenge for future researchers looking to influence expressive synthesis research. We hope that this chapter will not only give readers a set of signposts to support them when exploring the world of expressive synthesis, but also act as a catalyst for readers to think critically about the role, nature, and place of expressive synthesis in systems being designed.

References A. Abdolrahmani, R. Kuber, and S. M. Branham. 2018. “Siri talks at you”: An empirical investigation of voice-activated personal assistant (VAPA) usage by individuals who are blind. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility. 249–258. J. Adell, A. Bonafonte, and D. Escudero. 2007. Filled pauses in speech synthesis: Towards conversational speech. In International Conference on Text, Speech and Dialogue. Springer, 358–365. K. Akuzawa, Y. Iwasawa, and Y. Matsuo. 2018. Expressive speech synthesis via modeling expressions with variational autoencoder. In Interspeech 2018. ISCA, 3067–3071. http:// www.isca-speech.org/archive/Interspeech_2018/abstracts/1113.html. DOI: https://doi.org/ 10.21437/Interspeech.2018-1113. S. Andersson, K. Georgila, D. Traum, M. Aylett, and R. A. Clark. 2010. Prediction and realisation of conversational characteristics by utilising spontaneous speech for unit selection. In Speech Prosody 2010-Fifth International Conference. S. Andrist, M. Ziadee, H. Boukaram, B. Mutlu, and M. Sakr. 2015. Effects of culture on the credibility of robot speech. In Proceedings of the 10th Annual International Conference on Human–Robot Interaction, HRI’15. ACM/IEEE, 157–164. DOI: https://doi.org/10.1145/ 2696454.2696464. D. Antos, C. M. De Melo, J. Gratch, and B. J. Grosz. 2011. The influence of emotion expression on perceptions of trustworthiness in negotiation. In Proceedings of the 25th AAAI Conference on Artificial Intelligence. P. Arias, C. Soladie, O. Bouafif, A. Robel, R. Seguier, and J.-J. Aucouturier. 2018. Realistic transformation of facial and vocal smiles in real-time audiovisual streams. IEEE Transactions on Affective Computing.

200

Chapter 6 Building and Designing Expressive Speech Synthesis

S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou. 2018. Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems. 10019–10029. A. Aubin, A. Cervone, O. Watts, and S. King. 2019. Improving speech synthesis with discourse relations. In INTERSPEECH. 4470–4474. DOI: http://dx.doi.org/10.21437/ Interspeech.2019-1945. M. P. Aylett and D. A. Braude. 2018. Designing speech interaction for the Sony Xperia Ear and Oakley Radar Pace smartglasses. In Proceedings of the 20th International Conference on Human–Computer Interaction with Mobile Devices and Services Adjunct. 379–384. DOI: https://doi.org/10.1145/3236112.3236171. M. P. Aylett, B. Potard, and C. J. Pidcock. 2013. Expressive speech synthesis: Synthesising ambiguity. In Eighth ISCA Workshop on Speech Synthesis. M. P. Aylett, P. O. Kristensson, S. Whittaker, and Y. Vazquez-Alvarez. 2014. None of a CHInd: Relationship counselling for HCI and speech technology. In Proceedings of the Extended Abstracts of the 32nd Annual ACM Conference on Human Factors in Computing Systems – CHI EA’14. ACM Press, Toronto, Ontario, Canada, 749–760. ISBN: 978-1-4503-2474-8. http://dl.acm.org/citation.cfm?doid=2559206.2578868. DOI: https://doi.org/10.1145/ 2559206.2578868. M. P. Aylett, A. Vinciarelli, and M. Wester. 2017. Speech synthesis for the generation of artificial personality. IEEE Transactions on Affective Computing. DOI: https://doi.org/10.1109/ TAFFC.2017.2763134. M. P. Aylett, B. R. Cowan, and L. Clark. 2019a. Siri, Echo and performance: You have to suffer darling. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, alt08. DOI: https://doi.org/10.1145/3290607.3310422. M. P. Aylett, S. J. Sutton, and Y. Vazquez-Alvarez. 2019b. The right kind of unnatural: Designing a robot voice. In Proceedings of the 1st International Conference on Conversational User Interfaces. 1–2. DOI: https://doi.org/10.1145/3342775.3342806. W. A. Bainbridge, J. W. Hart, E. S. Kim, and B. Scassellati. 2011. The benefits of interactions with physically present robots over video-displayed agents. Int. J. Soc. Rob. 3, 1, 41–52. DOI: https://doi.org/10.1007/s12369-010-0082-7. R. Barthes. 1977. Image–Music–Text. Macmillan. T. Baumann and D. Schlangen. 2012. Inpro_iSS: A component for just-in-time incremental speech synthesis. In Proceedings of the ACL 2012 System Demonstrations. 103–108. G. M. Begany, N. Sa, and X. Yuan. 2016. Factors affecting user perception of a spoken language vs. textual search interface: A content analysis. Interact. Comput. 28, 2, 170–180. DOI: https://doi.org/10.1093/iwc/iwv029. H. Bishop, N. Coupland, and P. Garrett. 2005. Conceptual accent evaluation: Thirty years of accent prejudice in the UK. Acta Linguist. Hafniensia 37, 1, 131–154. DOI: https://doi. org/10.1080/03740463.2005.10416087. A. W. Black and K. Tokuda. 2005. The Blizzard Challenge-2005: Evaluating corpus-based speech synthesis on common datasets. In Ninth European Conference on Speech Communication and Technology.

References

201

B. Bollepalli, L. Juvela, P. Alku. 2019. Lombard speech synthesis using transfer learning in a Tacotron text-to-speech system. In Interspeech. 2833–2837. DOI: https://doi.org/10.21437/ Interspeech.2019-1333. D. A. Braude, M. P. Aylett, C. Laoide-Kemp, S. Ashby, K. M. Scott, B. O. Raghallaigh, A. Braudo, A. Brouwer, and A. Stan. 2019. All together now: The living audio dataset. In Interspeech. 1521–1525. P. Bremner, A. G. Pipe, C. Melhuish, M. Fraser, and S. Subramanian. 2011. The effects of robot-performed co-verbal gesture on listener behaviour. In 2011 11th IEEE-RAS International Conference on Humanoid Robots. IEEE, 458–465. DOI: https://doi.org/10.1109/ Humanoids.2011.6100810. M. Bretan, G. Hoffman, and G. Weinberg. 2015. Emotionally expressive dynamic physical behaviors in robots. Int. J. Hum. Comput. Stud. 78, 1–16. DOI: https://doi.org/10.1016/j. ijhcs.2015.01.006. J. Cambre and C. Kulkarni. 2019. One voice fits all? Social implications and research challenges of designing voices for smart devices. Proc. ACM Hum. Comput. Interact. 3, CSCW, 1–19. DOI: https://doi.org/10.1145/3359325. D. Cameron. 2001. Working with Spoken Discourse. Sage. S. Campanella and P. Belin. 2007. Integrating face and voice in person perception. Trends Cogn. Sci. 11, 12, 535–543. DOI: https://doi.org/10.1016/j.tics.2007.10.001. A. C. Cargile and H. Giles. 1997. Understanding language attitudes: Exploring listener affect and identity. Lang. Commun. 17, 3, 195–217. DOI: https://doi.org/10.1016/S02715309(97)00016-5. L. Clark, A. Ofemile, S. Adolphs, and T. Rodden. 2016. A multimodal approach to assessing user experiences with agent helpers. ACM Trans. Interact. Intell. Syst. 6, 4, 29. DOI: https://doi.org/10.1145/2983926. L. Clark, P. Doyle, D. Garaialde, E. Gilmartin, S. Schlögl, J. Edlund, M. Aylett, J. Cabral, C. Munteanu, and J. Edwards. 2019a. The state of speech in HCI: Trends, themes and challenges. Interact. Comput. 31, 4, 349–371. DOI: https://doi.org/10.1093/iwc/iwz016. L. Clark, N. Pantidi, O. Cooney, P. Doyle, D. Garaialde, J. Edwards, B. Spillane, E. Gilmartin, C. Murad, C. Munteanu, V. Wade, and B. R. Cowan. 2019b. What makes a good conversation? Challenges in designing truly conversational agents. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12. DOI: https://doi.org/10.1145/ 3290605.3300705. M. Cooke, C. Mayo, and C. Valentini-Botinhao. 2013 August. Intelligibility-enhancing speech modifications: The Hurricane Challenge. In Interspeech. 3552–3556. E. Corbett and A. Weber. 2016. What can I say? Addressing user experience challenges of a mobile voice user interface for accessibility. In Proceedings of the 18th International Conference on Human–Computer Interaction with Mobile Devices and Services. 72–82. DOI: https://doi.org/10.1145/2935334.2935386. N. Coupland and H. Bishop. 2007. Ideologised values for British accents. J. Socioling. 11, 1, 74–93. DOI: https://doi.org/10.1111/j.1467-9841.2007.00311.x.

202

Chapter 6 Building and Designing Expressive Speech Synthesis

B. R. Cowan, N. Pantidi, D. Coyle, K. Morrissey, P. Clarke, S. Al-Shehri, D. Earley, and N. Bandeira. 2017. What can I help you with?: Infrequent users’ experiences of intelligent personal assistants. In Proceedings of the 19th International Conference on Human– Computer Interaction with Mobile Devices and Services. ACM, 43. DOI: https://doi.org/10. 1145/3098279.3098539. B. R. Cowan, P. Doyle, J. Edwards, D. Garaialde, A. Hayes-Brady, H. P. Branigan, J. A. Cabral, and L. Clark. 2019. What’s in an accent? The impact of accented synthetic speech on lexical choice in human–machine dialogue. In Proceedings of the 1st International Conference on Conversational User Interfaces, CUI’19. Association for Computing Machinery, New York, NY. ISBN: 9781450371872. https://doi-org.ucd.idm.oclc.org/10.1145/3342775.3342786. DOI: https://doi.org/10.1145/3342775.3342786. A. Cruttenden. 1997. Intonation. Cambridge University Press. DOI: https://doi.org/10.1017/ CBO9781139166973. D. Crystal. 1997. A Dictionary of Linguistics and Phonetics. Blackwell, UK. D. Crystal. 2011. A Dictionary of Linguistics and Phonetics, Vol. 30. John Wiley & Sons. N. Dahlbäck, S. Swamy, C. Nass, F. Arvidsson, and J. Skågeby. 2001. Spoken interaction with computers in a native or non-native language—Same or different. In Proceedings of Interact. 294–301. N. Dahlbäck, Q. Wang, C. Nass, and J. Alwin. 2007. Similarity is more important than expertise: Accent effects in speech interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1553–1556. A. Danielescu. 2020. Eschewing gender stereotypes in voice assistants to promote inclusion. In Proceedings of the 2nd Conference on Conversational User Interfaces. 1–3. DOI: https://doi.org/10.1145/3405755.3406151. C. De Looze, S. Scherer, B. Vaughan, and N. Campbell. 2014. Investigating automatic measurements of prosodic accommodation and its dynamics in social interaction. Speech Commun. 58, 11–34. DOI: https://doi.org/10.1016/j.specom.2013.10.002. J. de Wit, A. Brandse, E. Krahmer, and P. Vogt. 2020. Varied human-like gestures for social robots: Investigating the effects on children’s engagement and language learning. In Proceedings of the 2020 ACM/IEEE International Conference on Human–Robot Interaction. 359–367. DOI: https://doi.org/10.1145/3319502.3374815. D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, K. Georgila, J. Gratch, A. Hartholt, M. Lhommet, G. Lucas, S. Marsella, F. Morbini, A. Nazarian, S. Scherer, G. Stratou, A. Suri, D. Traum, R. Wood, Y. Xu, A. Rizzo, and L.-P. Morency. 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems. 1061– 1068. P. R. Doyle, J. Edwards, O. Dumbleton, L. Clark, and B. R. Cowan. 2019. Mapping perceptions of humanness in intelligent personal assistant interaction. In Proceedings of the 21st International Conference on Human–Computer Interaction with Mobile Devices and Services, MobileHCI’19. Association for Computing Machinery, New York, NY. ISBN:

References

203

9781450368254. https://doi-org.ucd.idm.oclc.org/10.1145/3338286.3340116. DOI: https:// doi.org/10.1145/3338286.3340116. P. R. Doyle, L. Clark, and B. R. Cowan. 2021. What do we see in them? Identifying dimensions of partner models for speech interfaces using a psycholexical approach. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, Article 244, 1–14. DOI: https://doiorg.ucd.idm.oclc.org/10.1145/3411764.3445206. J. Edwards, H. Liu, T. Zhou, S. J. J. Gould, L. Clark, P. Doyle, and B. R. Cowan. 2019. Multitasking with Alexa: How using intelligent personal assistants impacts languagebased primary task performance. In Proceedings of the 1st International Conference on Conversational User Interfaces, CUI’19. Association for Computing Machinery, New York, NY. ISBN: 9781450371872. https://doi-org.ucd.idm.oclc.org/10.1145/3342775.3342785. DOI: https://doi.org/10.1145/3342775.3342785. K. El Haddad, H. Cakmak, A. Moinet, S. Dupont, and T. Dutoit. 2015a. An HMM approach for synthesizing amused speech with a controllable intensity of smile. In 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE, 7–11. K. El Haddad, S. Dupont, N. D’Alessandro, and T. Dutoit. 2015b. An HMM-based speech– smile synthesis system: An approach for amusement synthesis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 5. IEEE, 1–6. K. El Haddad, I. Torre, E. Gilmartin, H. Çakmak, S. Dupont, T. Dutoit, and N. Campbell. 2017. Introducing AmuS: The amused speech database. In N. Camelin, Y. Estève, and C. Martìn-Vide (Eds.), Proceedings of Statistical Language and Speech Processing Conference. Springer International Publishing, 229–240. ISBN: 978-3-319-68456-7. DOI: https:// doi.org/10.1007/978-3-319-68456-7. D. Erickson. 2005. Expressive speech: Production, perception and application to speech synthesis. Acoust. Sci. Technol. 26, 4, 317–325. DOI: https://doi.org/10.1250/ast. 26.317. T. Fong, I. Nourbakhsh, and K. Dautenhahn. 2003. A survey of socially interactive robots. Rob. Auton. Syst. 42, 3–4, 143–166. DOI: https://doi.org/10.1016/S0921-8890(02)00372-X. P. Gangamohan, S. R. Kadiri, and B. Yegnanarayana. 2016. Analysis of emotional speech— A review. In Toward Robotic Socially Believable Behaving Systems—Volume I. Springer, 205–238. DOI: https://doi.org/10.1007/978-3-319-31056-5_11. K. Georgila, A. W. Black, K. Sagae, and D. R. Traum. 2012. Practical evaluation of human and synthesized speech for virtual human dialogue systems. In LREC. 3519–3526. E. Goffman. 2005. Interaction Ritual: Essays in Face-to Face-Behavior. AldineTransaction. A. Govender and S. King. 2018. Using pupillometry to measure the cognitive load of synthetic speech. System 50, 100. DOI: http://dx.doi.org/10.21437/Interspeech.2018-1174. D. Govind and S. M. Prasanna. 2013. Expressive speech synthesis: A review. Int. J. Speech Technol. 16, 2, 237–260. DOI: https://doi.org/10.1007/s10772-012-9180-2.

204

Chapter 6 Building and Designing Expressive Speech Synthesis

S. Heo, M. Annett, B. J. Lafreniere, T. Grossman, and G. W. Fitzmaurice. 2017. No need to stop what you’re doing: Exploring no-handed smartwatch interaction. In Graphics Interface. 107–114. Z. Hodari, O. Watts, and S. King. 2019. Using generative modelling to produce varied intonation for speech synthesis. arXiv preprint arXiv:1906.04233. DOI: https://doi.org/10. 21437/SSW.2019-43. G. Hofer, K. Richmond, and R. Clark. 2005. Informed blending of databases for emotional speech synthesis. In Proc. Interspeech. A. Hughes, P. Trudgill, and D. Watt. 2013. English Accents and Dialects: An Introduction to Social and Regional Varieties of English in the British Isles. Routledge. A. J. Hunt and A. W. Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1. IEEE, 373–376. DOI: https://doi.org/ 10.1109/ICASSP.1996.541110. A. Ikeno and J. H. Hansen. Nov 2007. The effect of listener accent background on accent perception and comprehension. EURASIP J Audio Speech Music Processing 2007, 1, 076030. ISSN: 1687-4722. DOI: https://doi.org/10.1155/2007/76030. J. Jung, S. Lee, J. Hong, E. Youn, and G. Lee. 2020. Voice+tactile: Augmenting in-vehicle voice user interface with tactile touchpad interaction. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12. DOI: https://doi.org/10.1145/ 3313831.3376863. R. G. Kamilo˘ glu, A. H. Fischer, and D. A. Sauter. 2020. Good vibrations: A review of vocal expressions of positive emotions. Psychon. Bull. Rev. 27, 2, 237–265. DOI: https://doi.org/ 10.3758/s13423-019-01701-x. T. Kawahara. 2019. Spoken dialogue system for a human-like conversational robot ERICA. In 9th International Workshop on Spoken Dialogue System Technology. Springer, 65–75. T. Kenter, V. Wan, C.-A. Chan, R. Clark, and J. Vit. 2019. CHiVE: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network. In International Conference on Machine Learning. 3331–3340. C. D. Kidd and C. Breazeal. 2004. Effect of a robot on user perceptions. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), (IEEE Cat. No. 04CH37566). Vol. 4. IEEE, 3559–3564. K. D. Kinzler, K. H. Corriveau, and P. L. Harris. 2011. Children’s selective trust in nativeaccented speakers. Development. Sci. 14, 1, 106–111. DOI: https://doi.org/10.1111/j.1467-7687.2010.00965.x. P. Kirschthaler, M. Porcheron, and J. E. Fischer. 2020. What can I say? Effects of discoverability in VUIs on task performance and user experience. In Proceedings of the 2nd International Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY. ISBN: 978-1-4503-7544-3/20/07. DOI: https://doi.org/10. 1145/3405755.3406119. D. H. Klatt. 1980. Software for a cascade/parallel formant synthesizer. J. Acoust. Soc. Am. 67, 3, 971–995. DOI: https://doi.org/10.1121/1.383940.

References

205

J. Kominek and A. W. Black. 2004. The CMU Arctic speech databases. In Fifth ISCA Workshop on Speech Synthesis. S. G. Koolagudi and K. S. Rao. 2012. Emotion recognition from speech: A review. Int. J. Speech Technol. 15, 2, 99–117. DOI: https://doi.org/10.1007/s10772-011-9125-1. H. Kose-Bagci, E. Ferrari, K. Dautenhahn, D. S. Syrdal, and C. L. Nehaniv. 2009. Effects of embodiment and gestures on social interaction in drumming games with a humanoid robot. Adv. Rob. 23, 14, 1951–1996. DOI: https://doi.org/10.1163/016918609X125187833 30360. E. Krahmer and M. Swerts. 2007. The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. J. Mem. Lang. 57, 3, 396– 414. DOI: https://doi.org/10.1016/j.jml.2007.06.005. D. R. Large, L. Clark, A. Quandt, G. Burnett, and L. Skrypchuk. 2017. Steering the conversation: A linguistic exploration of natural language interactions with a digital assistant during simulated driving. Appl. Ergon. 63. 53–61. ISSN: 00036870. https://linkinghub. elsevier.com/retrieve/pii/S0003687017300790. DOI: https://doi.org/10.1016/j.apergo.2017. 04.003. L. Leahu, M. Cohn, and W. March. 2013. How categories come to matter. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems—CHI’13. ACM Press, Paris, France, 3331. ISBN: 978-1-4503-1899-0. http://dl.acm.org/citation.cfm?doid= 2470654.2466455. DOI: https://doi.org/10.1145/2470654.2466455. K. A. Lenzo and A. W. Black. 2000. Diphone collection and synthesis. In Sixth International Conference on Spoken Language Processing. J. Li. 2015. The benefit of being physically present: A survey of experimental works comparing copresent robots, telepresent robots and virtual agents. Int. J. Hum. Comput. Stud. 77, 23–37. DOI: https://doi.org/10.1016/j.ijhcs.2015.01.001. J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, Alexis Moinet, and Vatsal Aggarwal. 2018. Towards achieving robust universal neural vocoding. arXiv preprint arXiv:1811.06292. E. Luger and A. Sellen. 2016. Like having a really bad PA: The gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 5286–5297. DOI: http://dx.doi.org/10.1145/ 2858036.2858288. H.-T. Luong and J. Yamagishi. 2020. Nautilus: A versatile voice cloning system. arXiv preprint arXiv:2005.11004. F. Marelli, B. Schnell, H. Bourlard, T. Dutoit, and P. N. Garner. 2019. An end-to-end network to synthesize intonation using a generalized command response model. In ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7040–7044. DOI: https://doi.org/10.1109/ICASSP.2019.8683815. C. McGinn and I. Torre. 2019. Can you tell the robot by the voice? An exploratory study on the role of voice in the perception of robots. In 2019 14th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 211–221. DOI: https://doi.org/10.1109/ HRI.2019.8673305.

206

Chapter 6 Building and Designing Expressive Speech Synthesis

H. McGurk and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588, 746–748. DOI: https://doi.org/10.1038/264746a0. I. Medhi, S. N. Gautama, and K. Toyama. 2009. A comparison of mobile money-transfer UIs for non-literate and semi-literate users. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1741–1750. DOI: https://doi.org/10.1145/1518701. 1518970. J. Mendelson and M. P. Aylett. 2017. Beyond the listening test: An interactive approach to TTS evaluation. In Interspeech. 249–253. DOI: https://doi.org/10.21437/Interspeech.20171438. R. K. Moore. 2017. Is spoken language all-or-nothing? Implications for future speech-based human–machine interaction. In Dialogues with Social Robots. Springer, 281–291. R. K. Moore, H. Li, and S.-H. Liao. 2016. Progress and prospects for spoken language technology: What ordinary people think. 3007–3011. http://www.isca-speech.org/archive/ Interspeech_2016/abstracts/0874.html. DOI: https://doi.org/10.21437/Interspeech.2016874. J. Mumm and B. Mutlu. 2011. Human–robot proxemics: Physical and psychological distancing in human–robot interaction. In Proceedings of the 6th ACM/IEEE International Conference on Human–Robot Interaction. 331–338. DOI: https://doi.org/10.1145/1957656. 1957786. C. Nass and K. M. Lee. 2000. Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI’00. ACM, New York, NY, 329–336. ISBN: 9781-58113-216-8. http://doi.acm.org/10.1145/332040.332452. DOI: https://doi.org/10.1145/ 332040.332452. C. Nass and K. M. Lee. 2001. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction. J. Exp. Psychol. Appl. 7, 3, 171. DOI: https://doi.org/10.1037//1076-898X.7.3.171. C. I. Nass and S. Brave. 2005. Wired for Speech: How Voice Activates and Advances the Human– Computer Relationship. MIT Press Cambridge, MA. C. Nass, J. Steuer, and E. R. Tauber. 1994. Computers are social actors. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 72–78. DOI: https://doi.org/ 10.1145/191666.191703. A. Niculescu, G. M. White, S. S. Lan, R. U. Waloejo, and Y. Kawaguchi. 2008. Impact of English regional accents on user acceptance of voice user interfaces. In Proceedings of the 5th Nordic Conference on Human–Computer Interaction: Building Bridges, NordiCHI’08. Association for Computing Machinery, New York, NY, 523–526. ISBN: 9781595937049. https://doi-org.ucd.idm.oclc.org/10.1145/1463160.1463235. DOI: https://doi.org/10.1145/ 1463160.1463235. J. J. Ohala. 1983. Cross-language use of pitch: An ethological view. Phonetica 40, 1, 1–18. DOI: https://doi.org/10.1159/000261678.

References

207

A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. A. Ortony, G. Clore, and A. Collins. 1988. The Cognitive Structure of Emotion. CUP, Cambridge. DOI: http://dx.doi.org/10.1017/CBO9780511571299. A. L. Paugh. 2005. Multilingual play: Children’s code-switching, role play, and agency in Dominica, West Indies. Lang. Soc. 34, 1, 63–86. DOI: https://doi.org/10.1017/S00474045 05050037. E. Pincus, K. Georgila, and D. Traum. 2015. Which synthetic voice should I choose for an evocative task? In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 105–113. DOI: https://doi.org/10.18653/v1/W15-4613. W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller. 2017. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654. J. F. Pitrelli, R. Bakis, E. M. Eide, R. Fernandez, W. Hamza, and M. A. Picheny. 2006. The IBM expressive text-to-speech synthesis system for American English. IEEE Trans. Audio Speech Lang. Process. 14, 4, 1099–1108. DOI: https://doi.org/10.1109/TASL.2006.876123. M. Porcheron, J. E. Fischer, and S. Sharples. 2017. “Do animals have accents?”: Talking with agents in multi-party conversation. In Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing, CSCW’17. ACM, New York, NY, 207–219. DOI: https://doi.org/10.1145/2998181.2998298. M. Porcheron, J. E. Fischer, S. Reeves, and S. Sharples. 2018. Voice interfaces in everyday life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 640. DOI: https://doi.org/10.1145/3173574.3174214. B. Potard, M. P. Aylett, and D. A. Braude. 2016. Cross modal evaluation of high quality emotional speech synthesis with the Virtual Human Toolkit. In International Conference on Intelligent Virtual Agents. Springer, 190–197. N. Prateek, M. Łajszczak, R. Barra-Chicote, T. Drugman, J. Lorenzo-Trueba, T. Merritt, S. Ronanki, and T. Wood. 2019. In other news: A bi-style text-to-speech model for synthesizing newscaster voice with limited data. arXiv preprint arXiv:1904.02790. S. Ramakrishnan. 2012. Recognition of emotion from speech: A review. Speech Enhancement, Modeling and Recognition–Algorithms and Applications 7, 121–137. G. Reyes-Cruz, J. E. Fischer, and S. Reeves. 2020. Reframing disability as competency: Unpacking everyday technology practices of people with visual impairments. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI’20. ACM, New York, NY, 1–13. ISBN: 9781450367080. https://doi.org/10.1145/3313831.3376767. DOI: https://doi.org/10.1145/3313831.3376767. D. C. Rubin and J. M. Talarico. 2009. A comparison of dimensional models of emotion: Evidence from emotions, prototypical events, autobiographical memories, and words. Memory 17, 8, 802–808. DOI: https://doi.org/10.1080/09658210903130764.

208

Chapter 6 Building and Designing Expressive Speech Synthesis

E. B. Ryan and H. Giles. 1982. An integrative perspective for the study of attitudes towards language variation. In Attitudes Towards Language Variation: Social and Applied Contexts. Edward Arnold London, 1–19. M. Salem, F. Eyssel, K. Rohlfing, S. Kopp, and F. Joublin. 2013. To err is human (-like): Effects of robot gesture on perceived anthropomorphism and likability. Int. J. Soc. Rob. 5, 3, 313–323. DOI: https://doi.org/10.1007/s12369-013-0196-9. D. Sato, S. Zhu, M. Kobayashi, H. Takagi, and C. Asakawa. 2011. Sasayaki: Augmented voice web browsing experience. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2769–2778. DOI: https://doi.org/10.1145/1978942.1979353. S. Sayago, B. B. Neves, and B. R. Cowan. 2019. Voice assistants and older people: Some open issues. In Proceedings of the 1st International Conference on Conversational User Interfaces, CUI’19. ACM, New York, NY. ISBN: 9781450371872. https://doi.org/10.1145/ 3342775.3342803. DOI: https://doi.org/10.1145/3342775.3342803. M. Schröder. 2001. Emotional speech synthesis: A review. In Proceedings Eurospeech 01. 561–564. M. Schröder. 2004. Dimensional emotion representation as a basis for speech synthesis with non-extreme emotions. In Proceedings Workshop on Affective Dialogue Systems. 209–220. DOI: https://doi.org/10.1007/978-3-540-24842-2_21. M. Schröder. 2009. Expressive speech synthesis: Past, present, and possible futures. In Affective Information Processing. Springer, 111–126. DOI: https://doi.org/10.1007/978-184800-306-4_7. M. Schröder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. Ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, and M. Wöllmer. 2011. Building autonomous sensitive artificial listeners. IEEE Trans. Affective Comput. 3, 2, 165–183. DOI: https://doi.org/10.1109/T-AFFC.2011.34. R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. arXiv preprint arXiv:1803.09047. V. Srinivasan and R. Murphy. 2011. A survey of social gaze. In Proceedings of the 6th ACM/ IEEE International Conference on Human–Robot Interaction. 253–254. DOI: https://doi.org/ 10.1145/1957656.1957757. G. Sun, Y. Zhang, R. J. Weiss, Y. Cao, H. Zen, and Y. Wu. 2020. Fully-hierarchical finegrained prosody modeling for interpretable speech synthesis. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6264–6268. S. Sundaram and S. Narayanan. 2007. Automatic acoustic synthesis of human-like laughter. J. Acoust. Soc. Am. 121, 1, 527–535. DOI: https://doi.org/10.1121/1.2390679. S. Sutton. 2020. Gender ambiguous, not genderless: Designing gender in voice user interfaces (VUIs) with sensitivity. In Proceedings of the 2nd International Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY. DOI: https://doi.org/10.1145/3405755.3406123.

References

209

M. Swerts and E. Krahmer. 2010. Visual prosody of newsreaders: Effects of information structure, emotional content and intended audience on facial expressions. J. Phon. 38, 2, 197–206. DOI: https://doi.org/10.1016/j.wocn.2009.10.002. E. Székely. 2015. Expressive Speech Synthesis in Human Interaction. Ph.D. thesis, University College Dublin. É. Székely, G. E. Henter, J. Beskow, and J. Gustafson. 2019. Spontaneous conversational speech synthesis from found data. In Interspeech. DOI: http://dx.doi.org/10.21437/ Interspeech.2019-2836. V. C. Tartter and D. Braun. 1994. Hearing smiles and frowns in normal and whisper registers. J. Acoust. Soc. Am. 96, 4, 2101–2107. DOI: https://doi.org/10.1121/1.410151. P. Taylor and A. Isard. 1997. SSML: A speech synthesis markup language. Speech Commun. 21, 1–2, 123–133. DOI: https://doi.org/10.1016/S0167-6393(96)00068-4. M. Theune, K. Meijs, D. Heylen, and R. Ordelman. 2006. Generating expressive speech for storytelling applications. IEEE Trans. Audio Speech Lang. Process. 14, 4, 1137–1144. DOI: https://doi.org/10.1109/TASL.2006.876129. K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. 2000. Speech parameter generation algorithms for HMM-based speech synthesis. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Vol. 3. IEEE, 1315–1318. I. Torre and S. L. Maguer. 2020. Should robots have accents? In Proceedings of the 29th International Workshop on Robot and Human Interactive Communication, RO-MAN’ 20. IEEE. DOI: https://doi.org/10.1109/RO-MAN47096.2020.9223599. I. Torre, J. Goslin, and L. White. 2015. Investing in accents: How does experience mediate trust attributions to different voices? In Proceedings of the 18th International Congress of Phonetic Sciences (ICPhS 2015). I. Torre, E. Carrigan, K. McCabe, R. McDonnell, and N. Harte. 2018. Survival at the museum: A cooperation experiment with emotionally expressive virtual characters. In Proceedings of the 2018 on International Conference on Multimodal Interaction. ACM, 423–427. DOI: https://doi.org/10.1145/3242969.3242984. I. Torre, J. Goslin, and L. White. 2020. If your device could smile: People trust happysounding artificial agents more. Comput. Hum. Behav. 105, 106215. DOI: https://doi.org/ 10.1016/j.chb.2019.106215. J. Trouvain and M. Schröder. 2004. How (not) to add laughter to synthetic speech. In Tutorial and Research Workshop on Affective Dialogue Systems. Springer, 229–232. DOI: https://doi.org/10.1007/978-3-540-24842-2_23. G. R. Tucker. 1999. A Global Perspective on Bilingualism and Bilingual Education. ERIC Digest. P. Wagner, J. Beskow, S. Betz, J. Edlund, J. Gustafson, G. Eje Henter, S. Le Maguer, Z. Malisz, É. Székely, C. Tånnander, and Jana Voße. 2019. Speech synthesis evaluation— State-of-the-art assessment and suggestion for a novel research program. In Proceedings of the 10th Speech Synthesis Workshop (SSW10). DOI: https://doi.org/10.21437/SSW. 2019-19.

210

Chapter 6 Building and Designing Expressive Speech Synthesis

J. Wainer, D. J. Feil-Seifer, D. A. Shell, and M. J. Mataric. 2006. The role of physical embodiment in human–robot interaction. In ROMAN 2006—The 15th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 117–122. DOI: https://doi.org/ 10.1109/ROMAN.2006.314404. Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurou. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135. M. West, R. Kraut, and H. Chew, 2019. I’d blush if I could: Closing gender divides in digital skills through education. https://unesdoc.unesco.org/ark:/48223/pf0000367416. M. Wester, M. Aylett, M. Tomalin, and R. Dall. 2015. Artificial personality and disfluency. Interspeech 2015. 5. M. Wester, D. A. Braude, B. Potard, M. P. Aylett, and F. Shaw. 2017. Real-time reactive speech synthesis: Incorporating interruptions. In Interspeech. 3996–4000. DOI: https://doi.org/ 10.21437/Interspeech.2017-1250. A. Williams, J. Cambre, I. Bicking, A. Wallin, J. Tsai, and J. Kaye. 2020. Toward voiceassisted browsers: A preliminary study with Firefox Voice. In Proceedings of the 2nd International Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY. ISBN: 978-1-4503-7544-3/20/07. DOI: https://doi. org/10.1145/3405755.3406154. Z. Wu, Z. Xie, and S. King. 2019. The blizzard challenge 2019. Blizzard Challenge Workshop. http://www.festvox.org/blizzard/bc2019/blizzard2019_overview_paper.pdf. Y. Wu, J. Edwards, O. Cooney, A. Bleakley, P. R. Doyle, L. Clark, D. Rough, and B. R. Cowan. 2020a. Mental workload and language production in non-native speaker IPA interaction. In Proceedings of the 2nd Conference on Conversational User Interfaces, CUI’20. Association for Computing Machinery, New York, NY. ISBN: 9781450375443. https://doi-org.ucd.idm. oclc.org/10.1145/3405755.3406118. DOI: https://doi.org/10.1145/3405755.3406118. Y. Wu, D. Rough, A. Bleakley, J. Edwards, O. Cooney, P. R. Doyle, L. Clark, and B. R. Cowan. 2020b. See what I’m saying? Comparing intelligent personal assistant use for native and non-native language speakers. In Proceedings of the 22nd International Conference on Human–Computer Interaction with Mobile Devices and Services, Mobile HCI’20. Association for Computing Machinery, New York, NY. J. Yamagishi, K. Ogata, Y. Nakano, J. Isogai, and T. Kobayashi. 2006. HSMM-based model adaptation algorithms for average-voice-based speech synthesis. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. IEEE, I–I. DOI: https://doi.org/10.1109/ICASSP.2006.1659961. J. Yamagishi, C. Veaux, and K. MacDonald. 2019. CSTR VCTK Corpus: English multispeaker corpus for CSTR voice cloning toolkit (version 0.92). DOI: https://doi.org/10. 7488/ds/2645. H. Zen, A. Senior, and M. Schuster. 2013. Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7962–7966.

References

211

H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu. 2019. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882. Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448. X. Zhu and L. Xue. 2020. Building a controllable expressive speech synthesis system with multiple emotion strengths. Cognit. Syst. Res. 59, 151–159. DOI: https://doi.org/10.1016/j. cogsys.2019.09.009.

7

Gesture Generation Carolyn Saund and Stacy Marsella

Gestures accompany our speech in ways that punctuate, augment, substitute for, and even contradict verbal information. Such co-speech gestures draw listeners’ attention to specific phrases, indicate the speaker’s feelings toward a subject, or even convey “off-the-record” information that is excluded from our spoken words. The study of co-speech gesture stretches at least as far back as the work of Quintilian in 50 AD, and draws from the disciplines of cognitive science, performance arts, politics, and, more recently, computer science and robotics. Gesture is a critical tool to enrich face-to-face communication, of which social artificial agents have yet to take full advantage. In this chapter, we discuss the importance, selection, production, challenges, and future of co-speech gestures for artificial social intelligent agents.

7.1

The Importance of Gesture in Social Interaction

7.1.1 What are Gestures? Gestures as we discuss them here are the spontaneous movements that accompany speech. Generally, these are limited to hand and arm movements [McNeill 1992] but can occasionally extend to the head, feet, or other body parts [Kendon 2000]. Our focus here, however, is on hand and arm movements. Specifically, this chapter focuses on gestures in conversation that usually accompany utterances, commonly referred to as co-speech gestures. This includes gestures that occur during speech in conversational or performative settings, such as interviews and monologues, with or without audiences. These can occur with or without conversational partners as well. As we describe below, gestures serve a remarkably wide variety of communicative functions in conversation, including conveying information to observers as well as aiding in speech production and fluency for the speaker.

214

Chapter 7 Gesture Generation

Table 7.1

Table of gesture classical types and co-speech properties Gesture type

Co-speech necessary?

Viewer necessary?

Emblem Beat Iconic Deictic Metaphoric

No Yes Sometimes Sometimes Yes

Sometimes No No Sometimes No

Importantly, the classifications provided here are by no means exhaustive. In this section, in addition to introducing one prevailing taxonomy (Section 7.1.1.1), we discuss weaknesses and alternative proposals to classifying gestures using these dimensions (Sections 7.1.1.2 and 7.1.1.3), as well as many other factors that determine how researchers tend to group gestures, both physically and functionally. 7.1.1.1 Classi cation Dimensions A common method of classifying co-speech gestures is by the five types or dimensions described in Table 7.1. These correspond not only to differences in the motions used to realize the gesture but more meaningfully to differences in the conversational contexts, their roles in speech production, and the communicative intentions of the speaker. Emblems are gestures that may essentially be thought of as replacements for spoken language. A prominent example is the “thumbs up” gesture that is common in several cultures, but often with strikingly different meanings. In North American and European cultures, for example, if somebody asks a question, a “thumbs-up” response unambiguously means “yes,” with or without verbal affirmation. They carry equivalent meaning to their linguistic counterpart. Importantly, the interpretation of these gestures are culturally and linguistically dependent; the “OK” symbol in Western cultures is a rude insult in Morocco. Beat gestures, contrarily, are gestures that do not carry semantic content in their movements, but instead “reveal the speaker’s conception of the narrative’s discourse as a whole” [McNeill 1992] by emphasizing specific words with small motions, often coinciding with the prosody of the spoken utterance. The movement of a beat gesture is short and quick, and often takes place only in the periphery of where the speaker uses other gestures [McNeill 1992], and take generally similar form regardless of content of the co-utterance [Levy and McNeill 1992]. Beats may also aid in speech fluency by coinciding rhythmically to a spoken coutterance, providing prosodic cues to word recall and comprehension [Hadar 1989, Leonard and Cummins 2011].

7.1 The Importance of Gesture in Social Interaction

215

Iconic gestures are literal representations of real, physical counterparts. For example, if someone utters “we need a knife to cut the cake,” they may produce a gesture with one flat palm held horizontally, and the other held vertically in a perpendicular “slicing” motion. In this instance, the hands are literally acting out the motion of a knife cutting something, with the hands embodying literal physical objects in the world. Similarly, an iconic gesture may be a mime of a literal motion. For example, if someone tells a story in which they were “running down the street,” they may hold their arms to their sides and swing them up and down to emphasize, exaggerate, or depict their speed. Deictic gestures are pointing gestures that direct attention toward a referent in the environment. If you have an array of items on a table and tell someone to “pick up that one,” the statement makes no sense without a verbal or gestural counterpart to identify the referent. Similarly, if somebody asks, “which way did they go?”, a person may simply point in lieu of providing a verbal response. Metaphoric gestures “present an image of an abstract concept” [McNeill 1992]. For example, one may gesture in a bowl or container shape when describing “all of their ideas.” Although the abstract notion of an “idea” can never be physically realized, the metaphoric gesture situates “ideas” in a metaphorical container that can be reliably referenced throughout the conversation by the speaker and viewers. 7.1.1.2

Multiple Classi cations As McNeill [2006] has argued, these classifications are not strict types but rather dimensions that are overlapping and open to interpretation when considering the use of gestures in interactions. This refers to the notion that a particular gesture, within one particular context, may be interpreted to have different elements of the axes described above. The same physical motion of a gesture may result in different interpretations depending on co-speech context. Consider the “slicing” motion described above. When applied to physical objects (“a knife to cut the cake”), this would be characterized as an iconic gesture. However, consider the same gesture if it accompanies the phrase “we need coordination to cut to the heart of the issue.” In this instance, the cutting is metaphorical as “issues” are not physical beings with literal “hearts.” Similarly, “coordination” is not a physical object like a knife that can cut. However, the metaphor of “cutting to the heart of an issue” is grounded in physical space insofar as cut is a verb that describes a physical action. In the metaphoric condition, “coordination” may be represented metaphorically as a knife by the fingers falling into stiff, parallel lines. In this case the fingers may further be thought of as representing people falling into line. This motion thus illustrates two distinct utterances in which the same gesture occurs, one where the gesture

216

Chapter 7 Gesture Generation

(a) Figure 7.1

(b)

The motion of the metaphoric gesture accompanying the phrase “Anything at all.” (a) The beginning of the movement as she says “Anything at all.” (b) The second part of this gesture, creating the space where “anything” may metaphorically be.

is referring to actually cutting a physical object and one where the gesture is used metaphorically. The use of a metaphor in speech is not necessary for the metaphor to be conveyed in the accompanying gestures. Figure 7.1(a) illustrates a metaphoric gesture accompanying the dialog “we can talk about anything at all.” There is no metaphor used in the dialog while the gesture is based in metaphors whereby abstract things, such as topics of conversation, can be represented as physical objects and a set of these objects can be held in a physical container that is being depicted by the gesture. Despite this degree of independence between the metaphor use in spoken language and accompanying gestures, the catalog of metaphors used in speech provides a useful resource for researchers. Grady [1997] provides many such metaphors, for which gesture researchers commonly observe gestural counterparts.1 These include similarity is proximity (e.g., “these fabrics aren’t quite the same but they’re close”), change is motion (e.g., “things have shifted since you were last here.”), and moments in time are objects on a path (e.g., “Summer always passes too quickly”). These and many other metaphors often coincide with physical representations of these metaphoric actions [Lakoff and Johnson 2008] represented gesturally. The above are examples of how gestures may be used to emphasize or induce metaphors. Conversely, consider the straightforward presentation of two options “this or that,” with the hands held flat, palm-up in front of the speaker. The speaker may say “this option,” and beat with one hand, and then repeat the 1. Grady does not propose or consider a framework for gesture analysis in this work cited. Instead, this work considers in depth the many ways in which metaphors permeate our speech, but does not explicitly discuss how we may use bodies to act out these metaphors as we say them.

7.1 The Importance of Gesture in Social Interaction

217

phrase “or this option,” but move the other hand, clearly indicating that they are providing context for the different options. The indication is made by a beat motion, but also is a clarification of “which option,” giving it attributes of a deictic gesture, referred to as an abstract deictic [McNeill et al. 1993]. Additionally, the laying out of two different ideas in space is metaphoric as it relies on the metaphors of abstract concepts being physical objects and dissimilar concepts are far apart (Grady’s [1997] categories/sets are bounded spatial regions), thus incorporating yet another element of the dimensions described above into a single gestural motion. 7.1.1.3

Alternative Classi cation Schemes In some modern works, gestures are often given multiple classifications, or the classification of gestures is skipped altogether, and gestures are judged solely by their communicative role or perceived intention. For example, Murphy [2003] proposes analyzing gestures not by abstract representation but instead by the production of those representations themselves. That is, gestures can be analyzed exclusively by their body movements as opposed to attempting to interpret what those movements represent. He argues that movement-based analysis is less prone to researcher bias and less likely to leave out body movements that do not fall neatly into the dimensions described above. This is contrary to the idea proposed by Novack and Goldin-Meadow [2017]. They suggest that iconic and deictic gestures are not simulations of actions they intend to portray but instead are consciously representational of abstract versions of those actions. This allows researchers to organize gestures according to their functional role in conversation. By focusing on gesture’s function as opposed to its specific form, researchers can begin to focus on why a particular gesture occurs rather than how the intention maps to movement. Still more schemes that suggest classifying gestures using both principles of form and function also attempt to address this problem. Saund et al. [2019] discusses the possibility of delineating and classifying gestures according to both conversational context (the function of the gesture) in tandem with the novel physical spaces they occupy (physical form of the gesture). Additionally, because of these overlapping dimensions, the process of describing and classifying the motion of gestures themselves is often decoupled from the meaning the gesture carries [Kipp et al. 2007]. This allows other schemes to break down gesture classification into linguistic and motion sub-problems [Cassell 1998]. It is only by considering the full picture of gesture production, from intention, to function, to physical action, that we can begin to create socially compelling gesture in artificial agents.

218

Chapter 7 Gesture Generation

7.1.2 Timing These axes of gesture vary as well by the timing of their performance with a co-occurring utterance, ranging from nearly coinciding temporally with speech, to gestural performances many seconds in advance [Calbris 1995, Nobe 2000, Gibbs 2008]. However, perception of appropriateness for different gestures with respect to co-speech timing is not fixed [Leonard and Cummins 2011]. The window of time for gestures to be relevant to corresponding speech is similarly fluid, depending on context [Leonard and Cummins 2011]. Often, gestures anticipate the speech to which they correspond [McNeill 1985, Nobe 2000], indicating that, cognitively, the meaning we attempt to convey is formulated and performed by the body before we are able to form (or at least utter) words for intended communication [Kendon 2000]. This similarly implies that the cognitive processes between communication intention and speech formulation are the same processes that initiate gesture production [Kendon 2000]. While the development of social artificial agents have a ways to go before these artifacts can form rich coherent conversational speech from a communication intention alone, it is important to keep in mind that such a pipeline that truly possesses the spontaneity, creativity, and expressive substance of human gestures must similarly be responsible for producing meaningful co-speech gestures. We discuss current implementations of various gesture generators in relation to speech in Section 7.2. 7.1.2.1 Gestural Phases and Units At the level of individual gestures, there is a complex feature structure. There are the phases of gestural motion including the rest, preparation, stroke, holding, and relax phases, as well as the forms of motion, their locations, and changing hand shapes. However, people often gesture in an overall fluid performance involving gesture sequences (a.k.a. gesture units [Kendon 2004]) in such a way that not all phases may be present in every individual gesture. In sequences, co-articulations between gestures may eliminate the rest or relaxation phase of a gesture [McNeill 1992]. One such name to refer to a sequence of related ideas that can span multiple gestures is an ideational unit [Calbris 1990]. Calbris argues that ideational units structure the discourse and the kinesic segmentation of gestures, and serve to impose requirements on gestural features both within and across ideational units in an overall performance. Within a gesture performance, some features such as hand shape, movement trajectory, or location in space may be coupled across gestures while other features serve at times a key role in distinguishing individual gestures from one another.

7.1 The Importance of Gesture in Social Interaction

219

This happens both physically and at the level of their meaning. For example, the hands may go into a rest position between gestures to indicate the end of an idea, a change of hand shape can serve to indicate the start of a new idea in the discourse [Calbris 2011], or one gesture’s location may serve to refer to a preceding gesture in an overall gestural scene where, for example, locations in gestural space take on specific meanings that may be referred to by subsequent gestures.

7.1.3 Cultural Relevance Another critical aspect to bear in mind when discussing gestures, especially in the context of artificial agents, is that nearly every aspect of gesturing is culturally dependent [Efron 1941]. Hand shapes [Calbris 2011], gesture size and frequency [Kita 2009], emblematic meaning [Calbris 1990], and timing [Talmy 1985, Kita 2009] are a few examples of components of gesture that rely heavily on the native and contextual culture of the speaker. Some cultures use hardly any beat gestures, whereas some use them to punctuate almost every sentence [Levinson 1996]. As previously mentioned, emblems that are positive signals in one culture may be rude insults in another [Calbris 2011]. But beyond this, different cultures’ concepts of physical space and indeed time inform their gestures as well [DiMaggio 1997]. In North American cultures, when talking about time individuals often gesture along a plane running horizontal to the speaker, with the left in the past and the right in the future. However, in French culture time is often gestured as a plane running parallel to the speaker, as if the speaker is walking along the line of time with the future positioned in front and the past behind the back of the head [Calbris 2011]. But, in other cultures, the future may be referenced behind the speaker, with the past in front of the speaker’s eyes [Núñez and Sweetser 2006]. Contrast this yet again to Chinese culture, in which the vertical axis commonly applies in conceptualizing time where earlier times are viewed as “up” and later times as “down” [Radden 2003]. These different gestures show not only that cultural sensitivity must be taken into account for artificial agents when interpreting and performing gestures, but also that the underlying conceptual representation of time may differ between cultures as well. A further review may be found in Kendon [1997]. For an overview of the implementation of culture in SIAs, please refer to Chapter 13 of this handbook.

7.1.4 Gesture's Role in Conversation The influence of gesture permeates social interaction. While we predominantly discuss gesture’s role in human–human interaction, it is crucial to note that virtual agents elicit responses consistent to humans in many social contexts [Takeuchi and Naito 1995, Poggi and Vincze 2008, McCall et al. 2009, Krämer et al. 2013].

220

Chapter 7 Gesture Generation

7.1.4.1

Dialog Regulation Gestures can help regulate conversation, for example, by signaling the desire to hold onto, acquire, or hand over the dialog turn [Bavelas 1994]. Bergmann et al. [2011] explore a non-exhaustive list of the multitudinous ways gesture regulates dialog, which can be broadly broken into content-specific and content-agnostic behaviors. Content-specific gestures relate to the specific content being discussed, and includes clarification requests, establishing a confidence level in the content of conversation, assessments of relevance, and indications and connections of topical information within the conversation. Content-agnostic behavior, however, has to do with the social rules of the conversation. Content-agnostic gestures may include next-speaker selection or handling of anti-social or non-canonical discourse behavior, such as interrupting.

7.1.4.2

Observer's Internal Beliefs The gestures that accompany face-to-face spoken interaction convey a wide variety of information and stand in different relations to the verbal content. For the observer, gestures serve a wide variety of communication functions, such as commenting, requesting, protesting, directing attention, showing, and rejecting [Jokinen et al. 2008]. In realizing these communicative functions, a gesture can provide information that embellishes, substitutes for, contradicts, or is even independent of the information provided verbally (e.g., Ekman and Friesen [1969b] and Kendon [2000]). As discussed above, gestures, of course, are physical actions but these actions can convey both physical and abstract concepts. A sideways flip of the hand suggests discarding an object but can also be used to represent the rejection of an idea [Calbris 2011]. Gestures serve a variety of rhetorical functions. Comparison and contrasts between abstract ideas can be emphasized by abstract deictic (pointing) gestures that point at the opposing ideas as if they each had a distinct physical locus in space [McNeill 1992]. A downward stroke of a gesture is often used to emphasize the significance of a word or phrase in the speech or enumerate points. Gestures are also used to reinforce and clarify their co-speech utterances. Jamalian and Tversky [2012] show that different gestures in coordination with the same temporally ambiguous utterance (“the meeting was moved forward two days”) successfully disambiguate temporal uncertainty. Similarly, gestures are able to allow observers to interpret statements as questions using the same audio [Kelly et al. 1999], and to disambiguate linguistic homonyms [Holler and Beattie 2003]. It is precisely because gestures are used to clarify speech so often that some researchers suggest that gesture is the first tool humans use to disambiguate basic ideas and requests [Özçalı¸skan and Goldin-Meadow 2005]. Further

7.1 The Importance of Gesture in Social Interaction

221

evidence suggests increased gesturing in this manner can lead to positive learning outcomes in teaching scenarios [Goldin-Meadow and Alibali 2013]. Yet the impact of gesture is not always so explicit. For example, gestures are known to influence thought in the viewer. In the same publication, Jamalian and Tversky [2012] showed that using different types of metaphoric gestures changes the way that individuals qualitatively describe certain systems and processes. Gestures can also present information about the speaker’s state and views toward the subject of conversation. Pollick et al. [2001] show that viewers are able to read affect from arm motions alone, potentially giving the viewer valuable interpretable information about the gesturer’s internal mental state. Similarly, gestures have also been shown to influence memory recall in cases of eye-witness testimony [Gurney et al. 2013], opening up discussion of gestures providing leading answers in a similar off-the-record manner. Seeing gestures used appropriately also bolster’s viewers’ impression of the speaker. Speakers who gesture in conversation are perceived as more composed, effective, persuasive, and competent than those who do not [Maricchiolo et al. 2009]. 7.1.4.3 Revealing the Speaker's Mental States and Traits Gesture plays a critical role in human interaction, where it is not simply an addition to speech. Rather, it is an independent expression of thought that reveals the underlying beliefs, intentions, and processes of the speaker [Cienki and Koenig 1998]. A wide range of mental states and character traits can be conveyed gesturally. Placing hands on hips can display dominance or displeasure, gestures performed with rapid acceleration can convey arousal or displeasure, and a gesture with palm facing outward as if suggesting stop can convey displeasure at what a conversational partner is saying or doing. Self-touching gestures or self-adaptors [Ekman and Friesen 1969b], such as rubbing a forearm, are also believed to convey information about a person’s mental state while also providing self-comfort. In particular, these behaviors can reveal negatively valenced emotional states such as anxiety, fear, or guilt [Ekman and Friesen 1969a]. Gestures may further be used to implicitly convey off-the-record information [Wolff 2015]. For example, a speaker may describe two people “getting together” with a co-speech gesture of either gently intertwining hands, or two fists clashing against one another. While the former may suggest harmony between individuals, forcing hands together at high velocity multiple times implies conflict and aggression [Morris 2015] (we discuss the ways in which the form of gesture carries

222

Chapter 7 Gesture Generation

meaning in Section 7.2.1). However, the speaker may specifically choose to convey this information outside of the speech channel. In doing so, the speaker both relays information in a fashion that is off-the-record but still provides context of that information for the viewers. 7.1.4.4

7.2

Speaker Impact While gesture is an invaluable tool for communication, it also acts as an aid for the speaker. Gestures occur regardless of whether a listener can actively view them. Individuals gesture at near the same rate when speaking to someone on the phone or in person [Iverson and Goldin-Meadow 1998]. Similarly, individuals gesture when they know that the viewer is blind [Iverson and Goldin-Meadow 1997, 1998]. Even congenitally blind individuals gesture at both sighted and other blind individuals [Iverson and Goldin-Meadow 2001]. This suggests that gesture plays an important role not only in social communication but to aid in the speaker’s own process of conveying information. One hypothesis for this is that using gesture helps lighten the cognitive load on the speaker [Goldin-Meadow et al. 2001]. While it is impossible to know the full extent of interaction between gesture and speech without understanding the underlying mechanism of going from thought to communication, we can observe ways in which communication is explicitly aided by gesture, or rather, hindered without gesture. Speakers speak less fluently when they lose the ability to gesture [Lickiss and Wellens 1978]. They also have more trouble recalling words when their hands are bound and they are unable to gesticulate during speech [Rauscher et al. 1996]. This phenomenon points to deep relationships between physical body movements and cognition, discussed in the next section.

Models and Approaches While the importance of gesture in both the viewer and the speaker is clear, so too is the extent to which gesture is a complex, nuanced, and difficult task to perform. Broadly, this difficulty can be broken down into two tasks: selection and execution. This is not to downplay all the difficulty in collecting upstream knowledge on which to base selection, such as modeling or inferring intentions, leakage, dialog regulation, and predicting the effects of gesture performance. These phenomena represent substantial challenges in their own right, and have fields of research dedicated to them. For purposes of gesture generation, we will focus on approaches for these two sub-problems. However, before we go further into how gestures may be generated and acted by socially intelligent agents, we must elaborate on how gestures carry meaning in the

7.2 Models and Approaches

223

first place in order to discuss how the components of gesture may be manipulated based on communicative intent. In this section, we focus on broad approaches and their similarities and differences. While we provide contemporary examples of these various architectures, we do not deal with implementations of computational models or gesture generation mechanisms. For a more extensive look at the implementation of such architectures, please refer to Chapter 16.

7.2.1 How Gestures Carry Meaning As we saw earlier, gestures play a variety of functions in face-to-face interaction and further there may be multiple such functions that are relevant during a specific utterance. However, there is a limit to the complexity of information they can reliably convey [Saund et al. 2019]. In this section, we discuss the traits of gesture that have been shown to carry meaning to viewers. There are many individual components of a gesture that may be responsible for viewer interpretation, and the information and capacity of each component varies by individual and by culture. Broadly, when discussing co-speech gestures, we refer to the shape and trajectory of the hands and all of the parameters that guide those components. Non-exhaustively, this includes velocity and amplitude of arm motions, orientation of the speaker toward the subject, the direction and symmetry of the hands, and the timing of hand shape changes relative to conversational context. These components and more are discussed at length by Calbris [2011], in which she discusses how parameters of these components (such as the plane of trajectory of the hands or orientation of the hand relative to the arm) may augment or vary the communicative function of a gesture. Specifically, she uses the gestural components specified in Zao in Calbris et al. [1986]: movement, localization, body part, orientation, and configuration. Together, these components can be used as a framework to describe and analyze the shape and communicative function of conversational gestures. It is not only the components themselves but moreover the dynamics (e.g., amplitude, speed, and fluidity of movement) of these components are integral in conveying these functions [Castellano et al. 2007]. Calbris et al. [1986] also explores how varying parameters of a gesture may result in multiple gestural representations of a single idea, and how, because of the parameter space of gestures, one idea may be presented by many different conceivable gestures.

7.2.2 Challenges of Gesture Generation The two challenges of selection and execution come with two important constraints that plague all aspects of intelligent social agent research: processing time

224

Chapter 7 Gesture Generation

and realization (animation or hardware) constraints. An acceptable pause between utterances is anywhere from 100–300ms [Reidsma et al. 2011], during which time an agent must gather or infer the relevant context, select a gesture given that context, plan, and perform the gesture in coordination with speech in order to appear natural. Similarly, choosing the contextually perfect gesture is useless if it cannot be performed on the required hardware. If choosing the optimal gesture would take 5s, but a close-enough gesture only 0.05, that must be accounted for in the selection process. In addition to these theoretical challenges, researchers also face the practical issue of how best to transcribe communicative functions using a common interface across different selection and execution implementations. The dominant framework for this is the SAIBA framework [Kopp et al. 2006] with stages that represent intent planning, behavior planning, and behavior realization. SAIBA interfaces with two markup languages, functional markup language and behavior markup language, to move between these stages. By beginning with intention of the agent, one can then derive the signals to produce. This decouples intention from implementations for different gesture generation mechanisms so they may be applied to different social agents, and forces architectures to drive gesture generation by intention and communicative function. Notably, this framework was explicitly developed with the goal of interdisciplinary collaboration in mind. In reality, the major challenges of what motions to perform, how to communicate those motions, and how to finally perform them must be considered in tandem throughout the gesture selection and performance process. Below, we dive deeper into the considerations of the process going from communicative intent to gesture performance. 7.2.2.1

Selection Selecting a gesture comes with a range of considerations. Some driving factors may be the communicative intent of the speaker, from the motivation and subgoal of a particular utterance to any driving goals of the interaction. An agent must then incorporate relevant social context, such as the social status of the user or the user’s attentiveness to the conversation. This leads to considering the location of the conversation, both generally and to be aware of elements that may be constantly updating, such as people walking by. These factors drive the process of determining how to actually gesture, both with and without speech. Selection must primarily be guided by the conversational goals of an agent. While gestures can be used to build rapport between agents and users [Wilson et al. 2017], this function may be considered unnecessary or even detrimental to an agents whose primary function is to direct or inform users efficiently. It is

7.2 Models and Approaches

225

important that these dialog goals guide gesture selection, as random gesturing is not only confusing for the viewer and unnatural looking [Lhommet and Marsella 2014] but can also lead to critical misunderstandings [Gurney et al. 2013]. As previously discussed, one role that gesture plays in human speech is to convey both explicit and implicit information to conversational partners in a contextually appropriate manner. Depending on the intended communicative function of the gesture, this context can be considered with great depth. One of the fundamental social skills for humans is the attribution of beliefs, goals, and desires to other people, otherwise known as theory of mind [Whiten and Byrne 1988]. In other words, an agents’ concern with respect to gesture is not only “what does my gesture mean?” but “what does my gesture mean to them?” Scassellati [2002] provides an overview of how these challenges might be addressed in artificial agents, including implementations to find ways that can be used to predict internal state and, consequently, potential user responses. For an overview of theory of mind for SIAs, please refer to Chapter 9 of this handbook. Moreover, what may still be more relevant to an agent’s gestures is its own internal emotional state. Gesture can also be used to portray emotion in a way that is detectable by viewers [Pollick et al. 2001, Kipp and Martin 2009]. There is considerable literature dedicated to computational models of emotion, with a summary found in Marsella et al. [2010]. The breadth of this field in the context of gesture research suggests that an agent’s own internal state may play a modulating role in gesture generation, with respect to both the type of gesture selected as well as the way that gesture is performed. Research suggests agents with understandable and consistent mental states and that act predictably are preferable for users [Mubin and Bartneck 2015], making gesture a key potential avenue to facilitate positive social interaction. Yet another consideration is when is a gesture performance appropriate by an agent. If given speech to perform, acoustic features such as emphasis and prosody can be key indicators of when a gesture performance may enhance communication (or hinder it) [Krahmer and Swerts 2007]. Similarly, semantic information in speech may give clues as to when to gesture or give parameter values to modulate gestures. For instance, it may be advantageous to refrain from gesturing, or use very low-amplitude gestures, when discussing sensitive topics. 7.2.2.2 Execution Equally important to the context and content an agent may access and express is the structure of potential gestures the agent can perform. Given the space of possible human gestures (e.g., the infinite planes on which hands can project and angles at which wrists can move, Section 7.2.1), they can be extremely challenging

226

Chapter 7 Gesture Generation

or impossible to replicate exactly, especially in physical robots with limited degrees of freedom compared to people or non-humanoid forms. One area of concern in terms of the execution of a gesture is temporally aligning motion appropriately with co-speech utterances. Gestures seem to differ in terms of perceivers’ sensitivity to their alignment with speech [Bergmann and Kopp 2012]. Depending on agent implementation, coordination with other relevant body parts, such as the eyes, legs, and mouth, may present challenges for both dynamic animation and robotic movement. While virtual agents may have limited body points that can be controlled, a wide variety of tools from 3D modeling and animation tools [Autodesk, INC.] to character animation engines [Niewiadomski et al. 2009, USC Institute for Creative Technologies] exist to both hand animate, use motion capture, or procedurally generate gestures on virtual agents. As discussed in Section 7.1.2.1, another challenge in gesture animation concerns the complex structure of gestures and the role of that structure in the performance of sequences of gestures (namely the phases described in Section 7.1.2.1). This includes the challenge of how to integrate individual gestures’ features into fluid performances. To do so, virtual agent researchers have taken into account that human gesturing has a hierarchical structure that serves important demarcative, referential, and expressive purposes [Xu et al. 2014]. Xu et al. [2014] lay out an approach that uses this higher level of organization to realize gesture performances. Their approach determines when and which features are common versus which ones must be distinguishable and addresses issues concerning the physical coordination or co-articulation between gestures within gesture units, including determining whether individual gestures go into phases of relax, rests, or holds. The work of Xu et al. drew on Calbris’ [2011] concept of an ideational unit. Another challenge concerns the manipulation of the expressivity of gestures. For example, consider a gentle beat gesture that might convey a calm speaker emphasizing a point versus a strong beat gesture with larger, more accelerated motion that conveys a more agitated speaker strongly emphasizing a point. One approach to realizing such variation is to handcraft a suite of beat gestures. The technique of parameterized blending of animations, however, supports smooth variation between those extremes by controlling the amount of each gesture that is used in the blend so that the resulting gesture could vary the degree to which it emphasizes a point or conveys agitation. Blending presents challenges specifically to animators and graphic designers responsible for the presentation of gestures on virtual agents. A variety of motion blending techniques used specifically in the context of gesture generation are discussed in Feng et al. [2012]. Robots offer their own set of challenges. Often, robots have far fewer degrees of freedom than humans and virtual agents, with hard constraints on the extent and

7.2 Models and Approaches

227

speed of motion. They are very different and severely limited compared to graphicsbased humanoid models. Specifically, robots suffer from the physical limitations of their own hardware, with body parts being too heavy to move quickly without hurting themselves or others around them. Or, in order to alleviate danger to themselves or others, they may have a severely limited range of motion they can use to express gestures. These challenges are discussed further in Section 7.3. 7.2.2.3 Gesture Catalogs Versus Dynamic Generation Broadly, we can characterize approaches to gesture generation as either using a set catalog of gestures or a set of parameters that drives dynamic generation of gestures on the fly. Here we provide an overview of these approaches, while below we will instantiate them with existing implementations. Virtual agent designers and social roboticists often take the approach of using a fixed library of gestures. This is beneficial both because the agent designer may create gestures specific to the use case of the agent, either by having an animator create gestures using animation software or use motion capture of an actor. Another benefit is that by having pre-computed animations the agent does not have to do extra work to actually compute the animation, but instead can act instantaneously in a motion that is guaranteed to satisfy the requirements of its software and hardware. However, while looking smooth and executing quickly are huge considerations in social agent research, this approach suffers from a lack of diversity in movements. By selecting only from a library of pre-animated gestures, agents risk looking particularly “artificial” by re-using gestures, by lacking a gesture for a particular social situation, or by being unable to vary expressivity. To address such limitations, research has explored parameterized gesture generation techniques as mentioned above that blend animations dynamically, providing a continuous range of variability between a mild beat gesture to a strong beat or small frame gesture or a large frame. This can also be done across multiple dimensions so that, for example, a beat may be varied both in intensity and direction. Alternatively, an option of greater complexity is to allow agents to generate gestures entirely from a more complete parameterization of the motion such as the hand shapes, the path the wrist takes, etc. This can be manifested in two ways by generating gestures on the fly or finding gestures from a library that satisfy any specified parameters. The first approach must contain a model of how particular elements of the communicative context relate to gestural parameters, where the context might include, for example, whether the agent is trying to convey confusion, how agitated should the agent look, and what hand shape and motion was used in the previous gesture. The alternative one might use is to simply have

228

Chapter 7 Gesture Generation

a lookup table approach, where the context selects a set of pre-specified parameter values. For example, Poggi et al. [2005] uses context to derive hand-crafted parameters (such as amplitude, openness, etc.), which then select from a library of pre-created gestures. The use of pre-animated motions saves the calculation of motion planning during execution, while also supporting manipulation of the dynamics of those motions during execution to provide a level of novelty for the viewer. Importantly, the resulting gesture from any method may still be adjusted through parameter manipulation. Gestures may be sped up, mirrored to adjust direction, or blended to create amplitudinal “mild” or “extreme” versions of a gesture, all at run time.

7.2.3 Broad Approaches in Current Implementations We have discussed the ways in which gestures carry meaning and the challenges facing researchers who implement generative models of gesture. Now, we present implementations that attempt to overcome these challenges to create compelling gestures in socially intelligent agents. Approaches to co-speech gesture generation can be characterized as existing on a continuum: rule-based vs. end-to-end machine learning techniques. One issue common to any approach, however, is that of going from mental states to gestural performance. As we noted, human gesturing is influenced by a wide variety of mental states, including communicative intentions within and across utterances, leakage or regulation of affective and cognitive states, traits, and dialog management. The richness of human gesturing arises from this variety of mental state inputs. However, the social agent field currently lacks a cognitive architecture of sufficient complexity to model such a variety of mental states, and has broadly moved away from holistic, all-encompassing behavioral architectures (with notable exceptions [Swartout et al. 2006, Kopp et al. 2014]). Consequently, the proxy input in gesture models is often reduced to the text and/or audio of the utterance that the agent is meant to perform, sometimes along with a limited communicative intent, for these elements are available to agents. This can limit an agent’s gesture performance to what is available in these inputs. In other words, if the agent is not modeling emotion, social attitudes like skepticism or what it wants to say on versus off the record, then its gestures cannot reflect this information. This is even true in the case of systems that use recorded voice, where potentially some of this information may be inferred from the audio, since the agent or agent designer must still be modeling to such information when selecting or recording the voice, respectively.

7.2 Models and Approaches

7.2.3.1

229

Rule-based Models One of the earliest, if not earliest, generators is the behavior expression animation toolkit (BEAT) [Cassell et al. 2004], which works by analyzing the relation between surface text and gestures. Text is parsed to attain information such as clauses, themes/rhemes, objects, and actions occurring in the discourse. This information is then used in conjunction with a knowledge base containing additional information about the world in which the discourse is taking place in order to map them onto a set of gestures. Non-verbal behavior generator (NVBG) [Chiu and Marsella 2011] extends the BEAT framework by making a clearer distinction between the communicative intent embedded in the surface text (e.g., affirmation, intensification, negation) and the realization of the gestures. This design allows NVBG to generate gestures that are rhetorically relevant even without a well-defined knowledge base. Another approach that utilizes real-world utterance analysis is by Stone et al. [2004]. They proposed a framework to extract utterances and gesture motions from recorded human data, and then generate animations by synthesizing these utterances and motion segments. This framework includes an authoring mechanism to segment utterances and gesture motions and a selection mechanism to compose utterances and gestures. Similar to this, Neff et al. [2008] created a comprehensive list of mappings between gesture types and related semantic tags to derive transmission probabilities of motion from sample data. This framework captures the details of human motion and preserves individual gesture style, which can then be generalized to generate gestures with varying forms of input. This leads to a still more sophisticated method of generation, which is to combine this language-based method with making inferences from dialog about the mental state of the agent to determine which gesture to use. Notably, this approach may be effective without mapping to exact gestures. The outcome from different rules may, instead of prescribing an exact gesture, determine specific elements that should be present in a gesture (as seen in Poggi et al. [2005]). Additionally, various contextual information, such as speech prosody or detected listener attention, can determine other elements of gestural performance such as speed (or co-speech timing) and amplitude. This approach has been shown to be effective through multiple prominent examples in virtual agents. Using a combination of acoustic and linguistic elements, Cerebella [Lhommet and Marsella 2013, Marsella et al. 2013] is a system currently in use in both virtual agent and social robotics applications. which dynamically generates gestures that appropriately correspond to speech both auditorily and semantically.

230

Chapter 7 Gesture Generation

User Move Agent Model Mental State Affective State

2.CFs Derivation 1. Input Treatment Text

Text Processing

Audio

Acoustic Processing

Emotion Analysis

Emphasis Analysis

Working Memory

Animation Scheduling

MIDAS Plan Enricher Meaning Enriched Agent Move (APML)

Initial Rhetorical Analysis

Initial Lexical Analysis

Signal Wrapper

3. Behavior Mapping

Domain DPML Plans Dialog Goal

Context Features

BODY Signal Generator

Knowledge Elaboration Analyssis

4. Animation Scheduling BML

User Model Features Preferences

MIND Emotional Mind Dialog Manager

Body Features Rendering Rules

Signal Enriched Agent Move (Greta FDL 0û Festival)

Oh, I’m sorry that’s a bad situation indeed...

Behavior Mapping

(a) Figure 7.2

(b)

The architectures of two generative gesture models. (a) Cerebella architecture and (b) GRETA architecture.

Greta [Poggi et al. 2005] is another example that typifies how high-level concepts can be used through external context to drive the motion of gestures of an agent. The architecture for these two systems, which provide excellent comparative examples of gesture generating architecture, are shown in Figure 7.2(a) and (b). 7.2.3.2 Data-driven Techniques The other end of the spectrum is completely text-agnostic end-to-end gesture production using deep learning. These models use large amounts of audio and video harvested from online sources like YouTube, and use video parsing tools such as OpenPose [Cao et al. 2019] to extract motion data to correlate audio to speaker movements. Using varying combinations of adversarial networks and regression, models are able to produce extremely natural gestures over a wide variety of speechaudio inputs [Ferstl et al. 2020]. This approach undeniably leads to impressively natural results, particularly in the context of generating gestures based on an individual speaker [Ginosar et al. 2019]. However, this approach lacks the sophistication of including multiple informative aspects of gesturing. By using audio input, these models are largely based exclusively on vocal cues like pitch and prosody. As a result, they fail to learn mappings between motion and semantic and rhetorical structure, and produce gestures that, while more natural, are less nuanced and complex than those we see in human performance. While it has been argued that the middle layers of these networks can derive some of these aspects [Takeuchi et al. 2017], evaluations of gesture

7.2 Models and Approaches

231

meaningfulness or semantic relatedness to co-utterances have not been done with end-to-end machine learning models based on audio. Recently, end-to-end models have also been developed without audio, exclusively using the co-utterance text of gestures [Yoon et al. 2019]. These have resulted in gestures that are judged as related to co-utterance, as well as life-like and likeable. This work paves the way for promising avenues in the future of gesture generation, harnessing the power of both end-to-end machine learning models with speech qualities derived from both audio and textual cues. The possibility of hybrid systems can offer the best of both worlds in terms of flexibility, novelty, and performance. From the examples above, it is easy to see how these two approaches exist on a continuum. In the rule-based example, to recognize that a particular phrase has a negative intent necessarily requires some aspect of machine learning, as there is a robust body of literature on detecting affect in both written language [Pennebaker et al. 2001, Hutto and Gilbert 2014] and speech [Eyben et al. 2009, Schuller et al. 2011]. Similarly, we can detect transcripts from audio input and parse these using rhetorical and semantic cues through text parsers (e.g., Charniak [2000], Pedersen et al. [2004], and Joty et al. [2015]), many of which are used in the models above. These can be correlated with gestures and may add crucial elements extra-auditory to deep learning models. The Cerebella system realizes such a hybrid technique. It leverages information about the character’s mental state and communicative intent to generate nonverbal behavior when that information is modeled by the agent [Marsella et al. 2013, Lhommet et al. 2015]. In addition, it relies on machine learning methods to also derive syntactic structure from the text and prosodic information from the spoken utterance. These sources of information are fed into a rule-based system and lexical database that perform additional lexical, pragmatic, metaphoric, and rhetorical analyses of the agent’s utterance text and audio to infer communicative functions that will drive the agent’s non-verbal behavior.

7.2.4 Gesture Collection and Analysis To study and understand naturally occurring gestures, researchers use a variety of techniques, tools, and analyses. Like many fields of behavioral psychology, researchers have used natural observation since the 1970s and 1980s. In the lab, however, classical techniques include solving spatial reasoning problems and game play [Alibali and GoldinMeadow 1993], narrating videos [Kita and Özyürek 2003], or telling written stories to conversational partners [Jacobs and Garnham 2007]. Recently, researchers have begun using more subjective techniques such as conversational scenarios [Ennis et al. 2010] and questions, explicitly designed to elicit a variety of metaphoric gestures

232

Chapter 7 Gesture Generation

[Chu et al. 2014]. Some researchers have also used trained actors, either to perform their interpretation of an expression of an emotion or to speak freely in a storylike, monologue fashion [Ferstl and McDonnell 2018]. Recently, current tools like YouTube have provided troves of real-life examples of gestures by a huge variety of speakers in different contexts [Ginosar et al. 2019, Yoon et al. 2019]. A litany of tools is then used to dissect and analyze these gestures. Mainly from audio and video, a variety of annotation schemes have been developed for the purposes of segmenting and assigning meaning to sections of gestures [Chafai et al. 2007, Neff et al. 2010, Kipp 2014]. Such schemes are validated by determining internal consistency and inter-annotator agreement, thereby generating a reliable metric through which gesture elicitation techniques as well as gestures themselves can be compared along many axes. Motion capture has also gained prominence in the gesture-capture space. Motion capture allows precise information on the spatial and temporal aspects of gesture, which can lead to powerful insights into how gesture correlates to speech and other elements of non-verbal behavior [Luo et al. 2009]. However, this equipment is also expensive, can be cumbersome or distracting for participants, and still suffers from technical inaccuracies, particularly for capturing hands. And technological advances have allowed still other tools, such as gyroscopes, accelerometers, wiimote, and even VR controllers to sometimes be used to capture information about gestures [Corera and Krishnarajah 2011]. Using these and other technologies, numerous datasets have gained popularity for use of studying, comparing, and animating gestures. This includes a wide range of visual technologies, from over 30 camera angles [Joo et al. 2017] to one central camera [Cooperrider 2014], and from set gestures in tightly controlled staging conditions [Gunes and Piccardi 2006, Hwang et al. 2006] to spontaneous recordings collected completely outside laboratory settings [Ginosar et al. 2019, Yoon et al. 2019]. Along with a growing interest in open science and dataset production, new annotation tools such as the Visual Search Engine for Multimodal Communication Research [Turchyn et al. 2018], which allows researchers to rapidly search datasets for specific types of motion, are becoming more sophisticated and widely used.

7.2.5 Evaluation Evaluations of these models must be as application-driven as the selection and performance of the gestures themselves. And, while some metrics offer the comfort of traditional statistical analysis or straightforward interpretations, the right metrics to evaluate a model might be as difficult to determine as the gestures themselves. Manipulating gesture can impact how viewers perceive an agent’s personality traits [Neff et al. 2010] as well as common factors of interest such as

7.2 Models and Approaches

233

trustworthiness, persuasiveness [Poggi and Pelachaud 2008], and naturalness [Maatman et al. 2005], often using self-reported subjective measurement techniques. However, these factors are usually difficult to measure directly. Many individual gestures may be produced over the course of a relatively short utterance, leading to a litany of issues for how best to parse and recreate the timing of gestures [Wilson et al. 1996, Wachsmuth and Kopp 2001, Chiu and Marsella 2014]. This is even further complicated once a gesture has been selected for evaluation because humans are notoriously bad at consciously discerning what does and does not look natural [Ren et al. 2005], for example. For this reason, a variety of other metrics may be employed to measure the performance of generative models across axes of interest. Providing a forced choice between the original input gesture and the model’s output and comparing results versus a random production may be an alternative way to allow users to express preference for gesturing behavior [Lhommet and Marsella 2013]. Mixed methods may also be used, for example, giving users a chance to freely write an utterance that could accompany a gesture and perform a thematic analysis on the generated utterances. Minimally, this method can be used during pilot experiments to determine appropriate terminology for classic fixed-choice responses [Bryman 2017]. Although it may seem intuitive that gestures should be evaluated by interpretability or clarity, this may not always be the case. For instance, an agent may actually intentionally perform a gesture that contradicts the utterance. The ultimate goal is to evaluate the gesture’s consistency with the desired communicative function. That function, though, must be tailored to the particular context and uses for that social agent. As an alternative to subjective measurements, one can evaluate gestures in terms of do they have the desired effect on behavior. For example, a range of experimental games have been used to explore the effect of an agent’s non-verbal behavior on a human participant’s behavior. prisoner’s dilemma [De Melo et al. 2009], the ultimatum game [Nishio et al. 2018], and the desert survival task [Khooshabeh et al. 2011] are a few examples. When the physical motion properties of a gesture are available, as in the Bounding Volume Hierarchy (BVH) file format used in motion capture and animation work, objective metrics concerning the physical properties can be used to evaluate gestures. The challenge here becomes relating these properties to communicative functions and non-verbal behavior. Tools to deploy evaluations are also advancing rapidly. Whereas researchers previously required individuals to make in-person evaluations of many gestures, crowdsourcing platforms such as Amazon’s Mechanical Turk and Prolific now imbue the possibility of rapidly acquiring many “first-impression” measures on

234

Chapter 7 Gesture Generation

many different gestures. This has the added benefit of reducing the burden on viewers as well as reducing any fatigue effects of rating many different gestures. However, crowdsourcing platforms often offer varying quality in participant responses, and some demographic elements cannot be verified, making precise research on this medium challenging [Breazeal et al. 2013]. Additionally, crowdsourced participants may be non-naive “expert survey-takers,” which can skew study results [Downs et al. 2010]. Study design elements such as verifying attentiveness, longitudinal studies, and mixed method qualitative analyses of free responses are able to overcome some of these challenges [Chandler et al. 2014, Rouse 2015]. Ultimately, the evaluation of a model must be specific to both its implementation and application.

7.3

Similarities and Differences in Intelligent Virtual Agents and Social Robots Both social robots and virtual agents are discussed when considering the future of human–computer interactions. The application domains that researchers in each field aim to apply these artificial social agents largely overlap, and include personal assistance, companionship, education, leisure, and clerical work [Riek 2014]. The importance of co-speech gesture in both domains has been strongly established, albeit with discrepancies as to the impact of physical embodiment [Li 2015]. Gesture is widely acknowledged as vital in initiating social conversation [Satake et al. 2009], building rapport [Riek et al. 2010], and increasing human-likeness [Salem et al. 2013] for both virtual agents and social robots. Non-verbal behavior in social robots also increases users’ abilities to maintain mental models of the robot’s internal state [Breazeal et al. 2005], which is vital in co-operative tasks [Hiatt et al. 2011]. So far the algorithms we have described have been agnostic to the agent that may employ them. In this section we explore the similarities between gesture generation in virtual agents and social robots, but more pressingly the acute challenges that come with realizing gestures on physical devices.

7.3.1 Physical Presence A significant body of literature suggests that robots gain some benefit to social interaction over virtual agents [Thellman et al. 2016]. Techniques that require physical presence, such as user mimicry and attention-grabbing motions [Fridin and Belokopytov 2014], may give robots an edge on virtual agents in terms of boosting learning outcomes in tutoring settings [Leyzberg et al. 2012, Belpaeme et al. 2018],

7.3 Similarities and Differences in Intelligent Virtual Agents and Social Robots

235

particularly for children [Jost et al. 2012]. Social robots have also been shown to be more helpful and enjoyable in interactions than their virtual counterparts for adults who are familiar with robots [Wainer et al. 2007]. However, robots also suffer from very high user expectations with respect to physical interaction and ability to sense the environment [Lee et al. 2006]. Many of these evaluations are task-based or based solely on physical embodiment and not about specific movements of gestures on robots versus virtual agents. It is unclear how these physical properties transfer to gestures’ communicative properties.

7.3.2 Challenges of Physicality There are many reasons why it is difficult to compare human-like gestures on virtual agents and robots due to robot form, function, movement capabilities, environmental limitations, and the high stakes of making movement mistakes in a robot. These limitations require creativity, artistry, and thorough exploration to realize communicative expression in new ways on physically limited robots. Ultimately, individual use cases must be taken into account when determining the tradeoff between utilizing a virtual agent or a social robot for specific purposes. Humans have many more degrees of freedom in motion than most commercially available robots, and especially the social robots seen today [Leite et al. 2013]. High degrees of freedom robots are costly and more difficult to program than simpler counterparts. While a few humanoid robots with potentially full expression do exist [Robotics 2019, Shigemi et al. 2019], many more exist with humanoid shapes but severely limited expression [Gouaillier et al. 2009, Robotics 2018], and still more bypass any attempt at humanoid presentation in favor of more abstract forms [Anki, Breazeal 2014, Embodied]. For this reason, most generative algorithms designed for virtual agents must be re-mapped onto a robot’s more limited expressive abilities, which can make gestures appear awkward or mis-timed [Bremner et al. 2009, Ng-Thow-Hing et al. 2010]. In most cases industrial robots are equipped with a set of pre-recorded gestures that are not generated online but simply replayed during human–robot interaction, as seen in Gorostiza et al. [2006], Sidner et al. [2003], or Salem et al. [2012]. Aligning speech to motion is particularly difficult in robots due to the path-planning required for novel gestures [Kopp et al. 2008]. Existing in the physical environment, while potentially more compelling and certainly with a wider range of physical tasks that may be accomplished, comes with distinct challenges when it comes to gesture. Problems unique to robots extend from motion planning to design, control, sensing, biomimetics, and

236

Chapter 7 Gesture Generation

(a) Figure 7.3

(b)

(c)

(d)

Some examples of contemporary social robots, ranging from humanoid with arms and legs, to less humanoid but still distinctly human with torso and arms, to more abstract but retaining a head and torso shape, to completely un-humanoid (and object-like). (a) ASIMO by Honda. (b) Pepper by SoftBank Robotics. (c) Jibo, photo courtesy of NTT DISRUPTION US Inc. (d) Cozmo, photo courtesy of Digital Dream Labs. Photos retrieved from global.honda/innovation/robotics; https://www. softbankrobotics.com/emea/en/pepper; https://www.jibo.com; https://www.anki.com

complex software [De Santis et al. 2008]. Additionally, robots must be consistently aware of their environment, including the people with whom they interact. Peri-personal space is a long-studied phenomenon in human–human interactions [Burgoon and Aho 1982, Sussman and Rosenfeld 1982, Burgoon 1991], and welldocumented in virtual agents in AR interactions [Slater et al. 2000, Ennis et al. 2010], but establishing a “social safety zone” seems to be an especially salient issue when involving heavy or unfamiliar robots [Truong and Ngo 2016]. The problem of keeping robots at a socially acceptable distance from humans during interactions in itself requires knowledge of computer vision, psychology, and robotic path-planning [Gupta et al. 2018]. Despite the importance of proprioception and path-planning, most robots on the market today do not have robust full-body sensors capable of pro-actively avoiding collision, which means that some gestures could put the robot at risk of hurting itself or others. Another ongoing challenge in gesture research for social robotics is the mapping of communicative intent to expression onto the many abstract forms of existing devices (e.g., those found in Figure 7.3) [Hoffman and Ju 2014]. Attribution of internal states from abstract motions has long been chronicled and analyzed

7.3 Similarities and Differences in Intelligent Virtual Agents and Social Robots

237

[Dittrich et al. 1996, Pollick et al. 2001], but the field is currently in the earliest stages of developing a framework that is capable of mapping the many elements of expression onto abstract frames [Van de Perre et al. 2018]. The art of mapping communication onto abstract bodily forms that are human-understandable is yet to be mastered.

7.3.3 Reach and Market Penetration One of the fundamental distinctions between VAs and SRs is the ease of reaching users. VAs have been deployed on computers, web pages, tablets, and phones. Any device with a screen can be used to realize a VA application. The fact that they can be deployed so widely has special relevance for less-wealthy countries where the market penetration of cell phones is very high due to limitations in traditional landlines for telecommunications. For the user, there may be no significant additional hardware cost in using a VA application. SRs in comparison require the purchase of the robot and therefore are more of a luxury as opposed to a necessity given limited budgets. This is especially true of the current crop of SRs that can socialize but are incapable of performing useful physical actions that could justify the cost.

7.3.4 Interdisciplinary Collaboration The fields of social robotics and virtual agents largely overlap. Both attempt to facilitate natural, socially fulfilling, and productive interactions in a wide range of fields, including medicine, teaching, and leisure. Both are concerned with the artificial agent’s theory of mind [Breazeal and Scassellati 1999] and see agents as tools to study wider psychological phenomenon under tight controls, such as gender effects of gestures in human–computer interactions [Siegel et al. 2009, Feng et al. 2017]. Additionally, some properties known to be important in human interpretation of gesture, such as smoothness, shape, and timing, are shown to transfer to gestures in robots [Bremner et al. 2009]. The need and call for collaboration is not new [Holz et al. 2009]. Some researchers have begun using generative models originally developed on virtual agents with social robots, notably Salem et al. [2010] and Le and Pelachaud [2011]. This is made possible through common frameworks such as the dominant SAIBA framework [Kopp et al. 2006], described in detail in Chapter 16 [ICMI 2012], which may be combined to create an agent-agnostic generative pipeline [Le et al. 2012]. However, work in this area needs much more exploration. Collaborations need more than experts in robotics and virtual agents, and must include professionals in interaction and aesthetic design, animation, market research, and other artists. Without a holistic team, robots continue to be designed according to physical

238

Chapter 7 Gesture Generation

constraints, with behaviors, animations, and designs then being forced to work within the physical constraints of the robot. Rather than separate disciplines, for commercial success all aspects of a social robot or agent must be included when considering specific use cases and audiences. This is especially true in gestures, for which studies of interpretation of nonhumanoid motions are academically limited but anecdotally extremely expressive. Consider Disney’s many non-humanoid and non-verbal characters. In addition to actual robot characters Wall-E and Eve, animators use many cues to portray both character traits about animal characters as well as express a wide variety of communicative functions in non-humanoid ways. The transference of gesture properties onto non-humanoid characters without humanoid gesture components (described in Section 7.2.1), both virtual and robotic, is something that seems to be mastered by artists storytellers but not yet rigorously harnessed by academic researchers in either robotics or virtual agents.

7.4

Current Challenges The technology and tools for modeling and generating gestures continues to advance. Further, larger datasets are being captured and new techniques are being used to process that data, further enabling machine learning approaches. These advances will provide new power to address challenges and opportunities. Here, we discuss what are some of those challenges.

7.4.1 Gestures and the Context that Informs Their Use One of the key challenges we face in realizing gestures for social agents is the complex relation of gestures to the context of the interaction and overall structure of the discourse. As has been pointed out repeatedly by gesture researchers (e.g., Kendon [2000]), gestures, specifically their communicative function, are not simply a vivid illustration of the dialog text. For example, pragmatics concerns the context in which the interaction occurs and the impact of that context on deixis, turn-taking, across utterance structure of the interaction, presuppositions, and implicature. These factors have a profound effect on gesture use. An obvious example of this concerns deictic gestures. Utterances such as “You should talk to Michael,” or “Leave by the door on the right,” may or may not co-occur with a deictic gesture. Another example is the cross utterance use of gestural space, where one utterance can locate an abstract concept in gesture space and in a subsequent utterance gestures can refer back to that location so as to refer to that original concept. Another example of the extra-utterance factors impacting gestures concerns

7.4 Current Challenges

239

how mental state leakage discussed above impacts gesture use and gesture performance. Further the roles, cultures, and relational history of the participants impact their gestures. Yet another example is when gestures are used to convey information off the record or even contradict the content of the utterance. Broadly, a gesture can be a distinct speech act from the speech act realized by the utterance. These examples pose significant challenges to realizing rich gesturing in social agents, regardless whether the approach is end-to-end machine learning, rule-based, or some hybrid. Fundamentally capturing the above requires some approach to modeling or inferring this extra-utterance information. In the case of end-to-end machine learning approaches that map an utterance to gesture, the external context of the utterance, the overall structure of the interaction, off-the record information to convey gesturally and arguably even the internal mental states and roles of the participants will not be apparent in the individual utterance text or prosody, making it unlikely that a mapping from utterance to gesture that takes into account just the utterance will capture the richness of human gestures. Even in the case of rule-based methods, there must be some way of modeling this information over the course of interaction.

7.4.2 Complex Gesturing A related challenge concerns complex gesturing. As illustrated above, gesture categories are fluid and a single gesture often combines elements of many different categories, which are related to elements of the interaction through multiple cues. This complexity is compounded by the fact that gestures can both stand alone individually as well as tie together pragmatic, semantic, and rhetorical elements that span utterances. In order to use these various sources of information to gesture effectively both for individual turns of dialog as well as coherently and naturally over an utterance and multiple dialog turns, researchers in gesture as well as conversational AI will need to come together to create a computationally organized model that tracks semantic, environmental, conversational, and spatial context for interactions. This underscores the tight relationship between gesture, speech, and the overarching interaction, and highlights how integrated gesture generation systems need to be with speech production and pragmatics in order for virtual agents to be as human-like as possible.

7.4.3 Role of Participants A gesture model also needs to consider the participants themselves. In order to gesture appropriately, the social agent should take into account their conversational partner. Humans tailor gestures to the individual to whom we are speaking

240

Chapter 7 Gesture Generation

[de Marchena and Eigsti 2014], which can have significant effects on how the speaker is perceived [Lee et al. 1985]. This can include some basic automatic responses like mirroring, but also encompasses extremely sophisticated complex modeling of the user’s mental state. Adjusting gestures to be smaller or slower when discussing sensitive topics, taking into account the age of the listener, or making large, pointed gestures to persuade a crowd are a few examples of acutely different circumstances during which the context must be detected, and the implications analyzed, to adjust gesture parameters [Poggi and Vincze 2008]. Crucially, this aspect of the context must affect both the selection as well as production of gestures. This raises the question of how an agent infers a conversational partner’s reactions. Are they, for example, being persuaded or amused by the agent’s use of expressive gestures? Clearly, an agent should select a gesture that is relevant and meaningful to its communicative function and consequently be able to infer whether that communicative function is being realized in the human partners in the interaction. This brings up issues of detecting user engagement and inferring mental state, as well as a growing issue of concern in gesture research: crosscultural interpretation. As the world becomes more interconnected and developers of social agents become increasingly interested in international marketplaces, the importance of gesturing in a culturally sensitive way is gaining much greater importance. This includes not only the amount or style of gesture but gets into deeper issues of conceptual organization and metaphorical hierarchies that exist in different cultures (such as the “time as a line” metaphor discussed in Section 7.1.3). This means that metaphoric gestures that convey a particular meaning in one culture may carry no or even an opposite meaning in another, which can result in critical misunderstandings between agents and users.

7.4.4 Ambiguity On the other hand, one might well argue that human-like or “natural” behaviors may bring ambiguity. Instead of an agent conveying agitation by the dynamics of their gestures maybe it is just as or even more effective to put a sign over agent saying it is agitated or altering the color of the agent. Specifically, some work suggests that when gestures are too complex [Saund et al. 2019] in the sense of a single gesture conveying multiple pieces of information, they become less uniformly interpreted across subjects—muddling the message an agent may attempt to convey. As the ability to produce complex gestures increases, researchers will need to consider different ways to measure tradeoffs in performance of generative models, from speed and complexity to optimizing for user understanding. Finally, one question that still remains as an overarching guiding principle is just how human-like does the behavior of the agent have to be. If one ascribes to the

7.4 Current Challenges

241

media naturalness hypothesis, divergence from the naturalness of face-to-face interaction, broadly speaking but specifically here in terms of non-verbal behavior, can lead to an increase in cognitive effort, an increase in communication ambiguity, and a decrease in physiological arousal [Kock 2005].

7.4.5 The Application Unquestioningly, these tradeoffs will be context-dependent, specifically application dependent. In a social skills training application to train doctors to break bad news to patients [Kron et al. 2017, Ochs et al. 2017], naturalness is a paramount consideration in part because people are being trained to deal with ambiguities. In contrast, a learning application for children that seeks to increase engagement as a child learns to count may forego any attempt at naturalness. Here there are opportunities to draw on a wide range of research. There is animal and human research on supernormal stimuli that can provoke primal responses in people [Barrett 2010]. The performance arts, specifically theatre and dance, can provide more stylized and less ambiguous means of conveying information. Notably, social agent researchers [Marsella et al. 2006, Neff et al. 2008] have looked at Delsarte’s work on gesture that heavily influenced early silent film acting as a means of gesture selection and performance, as well as Laban movement analysis to manipulate the animation of expressive gestures [Chi et al. 2000].

7.4.6 Impact This discussion underscores the critical challenge of understanding and measuring the impact of gestures on human participants. One way to evaluate this impact across large demographic populations is through increasingly popular crowdsourcing platforms [Breazeal et al. 2013, Morris et al. 2014]. In addition to evaluating a social agent’s gesture performance, crowdsourcing opinions makes a combined approach to gesture generation possible: generative models that use crowd or expert input to create and refine generative models of dialog for a social agent [Feng et al. 2018] could be extended to gesture. Research has begun using crowd feedback in model tuning to adjust gestures according to different social and conversational contexts. By using machine learning to uncover patterns in user preference and determine salient features in gesture motion, we may be able to increase model performance and produce gestures that are more contextually appropriate and complex than simply using top–down expert-driven rule-based techniques or end-to-end deep learning. While this is a relatively new technique in the field of gesture generation, finding ways to seamlessly incorporate human judgements into the generation process is a

242

Chapter 7 Gesture Generation

promising avenue for producing natural, meaningful, and relevant gestures in social artificial agents.

7.5

Future Directions While this chapter has discussed state-of-the-art implementations of gestures in social agents, there are many promising horizons for future research that will allow still better gesture performances as well as insights into the cognitive processes behind gesture production.

7.5.1 Big Data and Gesture It is impossible to talk about the future of gesture research without addressing the research field of big data. Using neural networks to create generative models of gesture for individual speakers is a present reality. Ginosar et al. [2019] present a model that produces gestures built off of L1 regression and adversarial neural networks. This model produces gestures that are nearly indistinguishable from the original speaker in many cases, but which are also driven exclusively by audio inputs. This approach simplifies the inherently cognitively driven and complex nature of gesture. This model generates gesture from audio, not communicative intent. This attempts to drive gesture behavior from smaller spaces (e.g., prosody) because the entire space of gesture meaning does not have a neat mapping. This model, for example, does not handle the complexity of semantics, rhetoric, or affect (aside from how those elements are expressed in voice qualities). It could be argued that the middle layers of these networks implicitly derive other salient features. However, the gestures that result from these methods have been judged by naturalness with a particular piece of audio, not communicated message. This is problematic as gestures have the ability to change the interpretation of the same audio [Jamalian and Tversky 2012, Lhommet and Marsella 2013]. Without a principled way to deal with semantics, machine learning techniques currently remove meaning and communicative intention out of the equation when it comes to gesture generation. So, the challenge remains to move to deep learning approaches that have the potential to generate not only extremely natural beat gestures but also more complex, nuanced, and subtle gestures as well.

7.5.2 Using Gesture to Make Inferences About Cognition Using deep learning to generate gestures, however, misses the deeper complexity of gesture research: the cognitive relationship between thought and behavior. While neural networks given sufficient data may produce extremely high-quality behavior, it sheds less light on the way humans actually store, process, generate, and then

7.5 Future Directions

243

transmit thoughts. For artificial social agents to be truly human in their expression, an alternative view is to assume that they must abide by the same cognitive processes and limitations as we do.2 This possibility is eloquently expressed by the theory of embodied cognition [Hostetter and Alibali 2008]. The theory of embodied cognition states that many features of cognition are shaped by the human experience of a physical body. This includes both high-level mental constructs (such as concepts and categories) as well as performance on various cognitive tasks (such as reasoning or judgment). According to this hypothesis, the organization of human thought is limited by the constraints of our body not only neurologically but by our mental incapacity to imagine what it would be like to exist without our body. This drives our physical metaphors, both gestural and in language, and indeed may be reflected in a hierarchy of metaphors in our own thoughts. With this in mind, it may be impossible to create a perfectly human-like gestural model for social artificial agents unless their thoughts are organized like ours. In this view, part of the goal modeling gestures is to make inferences about our own cognition that may be applied to social artificial agents. By demonstrating correlations between expressed thoughts and physical motions, we may uncover elements of this mental hierarchy to learn about the structure and organization of our own thoughts. These insights can propel both the field of cognitive science as well as human–computer social interaction.

7.5.3 “Better than Human” One of the common assumptions in the design of virtual agents is that human appearance and behavior is a gold standard for effective face-to-face interaction. This assumption is based on several factors. The non-verbal behaviors of human– human interaction are both our evolutionary heritage and socially learned. Therefore, an agent using these behaviors will be able to leverage the various deliberate inferences and automatic processes that are in play when we perceive these behaviors. Human–human interaction is also often a guiding principle informing the design of social robots. Of course, the behaviors invariably get distilled down when realized in a robot, often due to mechanical constraints. For example, subtlety in dynamics may be removed, degrees of freedom may be removed such as not having fully functional hands. Some channels may be removed altogether such as eliminating eyebrows. 2. Although it is left to context whether the goal of an agent is to be human-like, or communicatively efficient, or agreeable to talk to, etc.

244

Chapter 7 Gesture Generation

(a) Figure 7.4

(b)

Examples of interactive wearables from Behnaz Farahi. (a) Iridescence. (b) Caress of the Gaze.

The use of human–human interaction as a design goal or even a guiding principle risks ignoring several factors. We are very adaptive, and in a persistent relation we could adapt to an artificial agent’s behavior. That adaptation in turn may even help to build a stronger bond with an agent, for example, as a child requires a shared secret mode of interaction with an agent. Additionally, human non-verbal behavior is often ambiguous, and we may want to avoid that ambiguity in a particular application. Rather the focus may be on the most effective way to communicate the information, most effective in terms of an application’s goals. Finally, by limiting ourselves to human non-verbal modalities, we ignore that we could employ novel non-human modalities. For example, the work of Behnaz Farahi [2016, 2018, 2019], investigates novel modalities in interaction. Her “bio-inspired” work on the interactive installation Iridescence (Figure 7.4(a), [Farahi 2019]) draws inspiration from the gorget of the male Anna’s Hummingbird that changes color during courtship. Iridescence changes colors and make patterns in response to observer’s movements and facial expressions. Similarly, Caress of the Gaze is a wearable that explores how “clothing could interact with other people as a primary interface [Farahi 2016].” It uses eyegaze tracking technologies to respond to the observer’s gaze. Such work explores the potential of opening up new modalities in face-to-face interaction.

7.6

Summary In this chapter we discussed the many ways that gesture enhances communication. Gesture acts as a guide for dialog, an influence on the observer, and a

References

245

reflection of the speaker’s internal beliefs. We briefly summarized a long history of gesture studies, including myriad ways to classify gesture by both motion and communicative function. We discussed how these functions, combined with individual and cultural context, may reveal information about the speaker’s attitudes and mental states, as well as more complex information about an individual’s cognition. We then discuss current implementations of gestures in virtual agents. There are many ways to realize compelling gestures in social agents, but these must be centered on the communicative function of the gesture. Using frameworks that abstract implementation from communicative function allows researchers to separate the problem of gesture selection and animation. Both machine learning and rule-based techniques offer promising solutions to these difficulties but face similar challenges in terms of gesture collection and model evaluation. These models may be deployed on either virtual agents or social robots, with the latter presenting great physical challenges but offering potentially greater impact on the viewer. Abstractions over gesture architectures are necessary to foster interdisciplinary collaboration between these two closely related mediums. Despite recent advancements, gesture generation still faces many challenges, such as generating conversationally (semantically) relevant movements, incorporating complex or ambiguous gestures, and considering the role of the viewer when modulating gesture behavior. These must all be taken into consideration in order to achieve the greatest impact of gesture on an agent’s audience. New technology constantly advances techniques for studying gesture for both data collection and computational modeling of the physical gesture performance. In particular, superhuman stimuli offer unique avenues through which to study the impact of gesture, going beyond the possibilities of human–human studies. Additionally, collaborations in machine learning and the advancement of computational hardware and infrastructure allow more resources to use big data and end-to-end modeling of gesture behavior. These new technologies present opportunities to understand gesture’s relationship to the semantic context in which it is produced, which will lead to new insights in human behavior, communication, and cognition.

References M. W. Alibali and S. GoldinMeadow. 1993. Gesture–speech mismatch and mechanisms of learning: What the hands reveal about a child’s state of mind. Cogn. Psychol. 25, 4, 468–523. DOI: https://doi.org/10.1006/cogp.1993.1012. Anki. Cozmo. https://anki.com/en-gb/cozmo.html. Autodesk, INC. Maya. https://autodesk.com/maya.

246

Chapter 7 Gesture Generation

D. Barrett. 2010. Supernormal Stimuli: How Primal Urges Overran their Evolutionary Purpose. WW Norton & Company. J. B. Bavelas. 1994. Gestures as part of speech: Methodological implications. Res. Lang. Soc. Interact. 27, 3, 201–221. DOI: https://doi.org/10.1207/s15327973rlsi2703_3. T. Belpaeme, J. Kennedy, A. Ramachandran, B. Scassellati, and F. Tanaka. 2018. Social robots for education: A review. Sci. Robot. 3, 21, eaat5954. DOI: https://doi.org/10.1126/sc irobotics.aat5954. K. Bergmann and S. Kopp. 2012. Gestural alignment in natural dialogue. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 34. K. Bergmann, H. Rieser, and S. Kopp. 2011. Regulating dialogue with gestures—Towards an empirically grounded simulation with conversational agents. In Proceedings of the SIGDIAL 2011 Conference. 88–97. C. L. Breazeal. 2014. Jibo, the world’s first social robot for the home. Indiegogo. https://www.indiegogo.com/projects/jibo-the-world-s-first-socialrobot-for-the-home, checked on, 1, 22, 2019. C. Breazeal and B. Scassellati. 1999. How to build robots that make friends and influence people. In Proceedings 1999 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human and Environment Friendly Robots with High Intelligence and Emotional Quotients (Cat. No. 99CH36289), Vol. 2. IEEE, 858–863. DOI: http://doi.org/10.1109/IROS. 1999.812787. C. Breazeal, C. D. Kidd, A. L. Thomaz, G. Hoffman, and M. Berlin. 2005. Effects of nonverbal communication on efficiency and robustness in human–robot teamwork. In 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 708–713. C. Breazeal, N. DePalma, J. Orkin, S. Chernova, and M. Jung. 2013. Crowdsourcing human– robot interaction: New methods and system evaluation in a public environment. J. Hum.-Robot Interact. 2, 1, 82–111. DOI: https://doi.org/10.5898/JHRI.2.1. P. Bremner, A. Pipe, C. Melhuish, M. Fraser, and S. Subramanian. 2009. Conversational gestures in human–robot interaction. In 2009 IEEE International Conference on Systems, Man and Cybernetics. IEEE, 1645–1649. A. Bryman. 2017. Quantitative and qualitative research: Further reflections on their integration. In Mixing Methods: Qualitative and Quantitative Research. Routledge, 57–78. J. K. Burgoon. 1991. Relational message interpretations of touch, conversational distance, and posture. J. Nonverbal Behav. 15, 4, 233–259. DOI: https://doi.org/10.1007/BF00986924. J. K. Burgoon and L. Aho. 1982. Three field experiments on the effects of violations of conversational distance. Commun. Monogr. 49, 2, 71–88. DOI: https://doi.org/10.1080/ 03637758209376073. G. Calbris. 1990. The Semiotics of French Gestures, Vol. 1900. Indiana University Press. G. Calbris. 1995. Anticipation du geste sur la parole. Dins Verbal/Non Verbal, Frères juneaux de la parole. Actes de la journée d’études ANEFLE. Besançon, Université de Franche-Comte, 12–18. G. Calbris. 2011. Elements of Meaning in Gesture, Vol. 5. John Benjamins Publishing.

References

247

G. Calbris, J. Montredon, and P. W. Zaü. 1986. Des gestes et des mots pour le dire. Clé International, Paris, 145. Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence. J. Cassell. 1998. A framework for gesture generation and interpretation. Computer Vision in Human–Machine Interaction. 191–215. J. Cassell, H. H. Vilhjálmsson, and T. Bickmore. 2004. BEAT: The behavior expression animation toolkit. In Life-Like Characters. Springer, 163–185. G. Castellano, S. D. Villalba, and A. Camurri. 2007. Recognising human emotions from body movement and gesture dynamics. In International Conference on Affective Computing and Intelligent Interaction. Springer, 71–82. DOI: https://doi.org/10.1007/978-3-54074889-2_7. N. E. Chafai, C. Pelachaud, and D. Pelé. 2007. A case study of gesture expressivity breaks. Lang. Resour. Eval. 41, 3–4, 341–365. DOI: https://doi.org/10.1007/s10579-007-9051-7. J. Chandler, P. Mueller, and G. Paolacci. 2014. Nonnaïveté among Amazon Mechanical Turk workers: Consequences and solutions for behavioral researchers. Behav. Res. Methods 46, 1, 112–130. DOI: http://doi.org/10.3758/s13428-013-0365-7. E. Charniak. 2000. A maximum-entropy-inspired parser. In 1st Meeting of the North American Chapter of the Association for Computational Linguistics. D. Chi, M. Costa, L. Zhao, and N. Badler. 2000. The emote model for effort and shape. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. 173–182. DOI: https://doi.org/10.1145/344779.352172. C.-C. Chiu and S. Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In International Workshop on Intelligent Virtual Agents. Springer, 127–140. C.-C. Chiu and S. Marsella. 2014. Gesture generation with low-dimensional embeddings. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 781–788. M. Chu, A. Meyer, L. Foulkes, and S. Kita. 2014. Individual differences in frequency and saliency of speech-accompanying gestures: The role of cognitive abilities and empathy. J. Exp. Psychol. Gen. 143, 2, 694. DOI: http://dx.doi.org/10.1037/a0033861. A. J. Cienki and J.-P. Koenig. 1998. Metaphoric gestures and some of their relations to verbal metaphoric expressions. Discourse and Cognition: Bridging the Gap. 189–204. K. Cooperrider. 2014. Body-directed gestures: Pointing to the self and beyond. J. Pragmat. 71, 1–16. DOI: https://doi.org/10.1016/j.pragma.2014.07.003. S. Corera and N. Krishnarajah. 2011. Capturing hand gesture movement: A survey on tools, techniques and logical considerations. In Proceedings of Chi Sparks. A. B. de Marchena and I.-M. Eigsti. 2014. Context counts: The impact of social context on gesture rate in verbally fluent adolescents with autism spectrum disorder. Gesture 14, 3, 375–393. DOI: https://doi.org/10.1075/gest.14.3.05mar.

248

Chapter 7 Gesture Generation

C. M. De Melo, L. Zheng, and J. Gratch. 2009. Expression of moral emotions in cooperating agents. In International Workshop on Intelligent Virtual Agents. Springer, 301–307. A. De Santis, B. Siciliano, A. de Luca, and A. Bicchi. 2008. An atlas of physical human–robot interaction. Mech. Mach. Theory 43, 3, 253–270. DOI: https://doi.org/10.1016/j.mechmach theory.2007.03.003. P. DiMaggio. 1997. Culture and cognition. Annu. Rev. Sociol. 23, 1, 263–287. DOI: https://doi. org/10.1146/annurev.soc.23.1.263. W. H. Dittrich, T. Troscianko, S. E. Lea, and D. Morgan. 1996. Perception of emotion from dynamic point-light displays represented in dance. Perception 25, 6, 727–738. DOI: https: //doi.org/10.1068/p250727. J. S. Downs, M. B. Holbrook, S. Sheng, and L. F. Cranor. 2010. Are your participants gaming the system? Screening Mechanical Turk workers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2399–2402. D. Efron. 1941. Gesture and Environment. King’s Crown Press. DOI: https://doi.org/10.1177/ 000271624222000197. P. Ekman and W. V. Friesen. 1969a. Nonverbal leakage and clues to deception. Psychiatry 32, 1, 88–106. DOI: https://doi.org/10.1080/00332747.1969.11023575. P. Ekman and W. V. Friesen. 1969b. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Nonverbal Communication, Interaction, and Gesture. 57–106. DOI: https://doi.org/10.1515/semi.1969.1.1.49. I. Embodied. Moxie. https://embodied.com/products/moxie-reservation. C. Ennis, R. McDonnell, and C. O’Sullivan. 2010. Seeing is believing: Body motion dominates in multisensory conversations. ACM Tran. Graph. (TOG) 29, 4, 1–9. DOI: https://doi. org/10.1145/1778765.1778828. F. Eyben, M. Wöllmer, and B. Schuller. 2009. OpenEAR—Introducing the Munich opensource emotion and affect recognition toolkit. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 1–6. DOI: http://doi. org/10.1109/ACII.2009.5349350. B. Farahi. 2016. Caress of the gaze: A gaze actuated garment. In ACADIA 2016: Posthuman Frontiers, published in Proceedings of the 36th Annual Conference. USA. B. Farahi. 2018. Heart of the matter: Affective computing in fashion and architecture. In ACADIA 2018: Recalibration: Imprecision and Infidelity, Published in proceedings of the 38th Annual Conference. Mexico City, Mexico. B. Farahi. 2019. Iridescence: Bio-inspired emotive matter. In ACADIA 2019: Ubiquity and Autonomy, Published in Proceedings of the 39th Annual Conference. Austin. A. Feng, Y. Huang, M. Kallmann, and A. Shapiro. 2012. An analysis of motion blending techniques. In International Conference on Motion in Games. Springer, 232–243. D. Feng, D. C. Jeong, N. C. Krämer, L. C. Miller, and S. Marsella. 2017. “Is it just me?” Evaluating attribution of negative feedback as a function of virtual instructor’s gender and proxemics. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems. 810–818.

References

249

D. Feng, P. Sequeira, E. Carstensdottir, M. S. El-Nasr, and S. Marsella. 2018. Learning generative models of social interactions with humans-in-the-loop. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 509–516. Y. Ferstl and R. McDonnell. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. 93–98. DOI: https://doi.org/10.1145/3267851.3267898. Y. Ferstl, M. Neff, and R. McDonnell. 2020. Adversarial gesture generation with realistic gesture phasing. Comput. Graph. 89, 117–130. DOI: https://doi.org/10.1016/j.cag.2020. 04.007. M. Fridin and M. Belokopytov. 2014. Embodied robot versus virtual agent: Involvement of preschool children in motor task performance. Int. J. Hum.-Comput. Int. 30, 6, 459–469. DOI: https://doi.org/10.1080/10447318.2014.888500. R. W. Gibbs Jr. 2008. The Cambridge Handbook of Metaphor and Thought. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511816802. S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3497–3506. DOI: https://doi.org/10.1109/CVPR.2019.00361. S. Goldin-Meadow and M. W. Alibali. 2013. Gesture’s role in speaking, learning, and creating language. Annu. Rev. Psychol. 64, 257–283. DOI: https://doi.org/10.1146/annurev-psyc h-113011-143802. S. Goldin-Meadow, H. Nusbaum, S. D. Kelly, and S. Wagner. 2001. Explaining math: Gesturing lightens the load. Psychol Sci. 12, 6, 516–522. DOI: https://doi.org/10.1111/14679280.00395. J. F. Gorostiza, R. Barber, A. M. Khamis, M. Malfaz, R. Pacheco, R. Rivas, A. Corrales, E. Delgado, and M. A. Salichs. 2006. Multimodal human–robot interaction framework for a personal robot. In ROMAN 2006—The 15th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 39–44. DOI: https://doi.org/10.1109/ROMAN. 2006.314392. D. Gouaillier, V. Hugel, P. Blazevic, C. Kilner, J. Monceaux, P. Lafourcade, B. Marnier, J. Serre, and B. Maisonnier. 2009. Mechatronic design of NAO humanoid. In 2009 IEEE International Conference on Robotics and Automation. IEEE, 769–774. DOI: https://doi.org/ 10.1109/ROBOT.2009.5152516. J. Grady. 1997. Foundations of Meaning: Primary Metaphors and Primary Scenes. University of California, Berkeley. H. Gunes and M. Piccardi. 2006. A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 1. IEEE, 1148–1153. A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. 2018. Social GAN: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2255–2264.

250

Chapter 7 Gesture Generation

D. J. Gurney, K. J. Pine, and R. Wiseman. 2013. The gestural misinformation effect: Skewing eyewitness testimony through gesture. Am J. Psychol. 126, 3, 301–314. DOI: https://doi.org/ 10.5406/amerjpsyc.126.3.0301. U. Hadar. 1989. Two types of gesture and their role in speech production. J. Lang. Soc. Psychol. 8, 3–4, 221–228. DOI: https://doi.org/10.1177/0261927X8983004. L. M. Hiatt, A. M. Harrison, and J. G. Trafton. 2011. Accommodating human variability in human–robot teams through theory of mind. In Twenty-Second International Joint Conference on Artificial Intelligence. G. Hoffman and W. Ju. 2014. Designing robots with movement in mind. J. Hum.-Rob. Interact. 3, 1, 91–122. DOI: https://doi.org/10.5898/JHRI.3.1.Hoffman. J. Holler and G. Beattie. 2003. Pragmatic aspects of representational gestures: Do speakers use them to clarify verbal ambiguity for the listener? Gesture 3, 2, 127–154. DOI: https://doi.org/10.1075/gest.3.2.02hol. T. Holz, M. Dragone, and G. M. O’Hare. 2009. Where robots and virtual agents meet. Int. J. Soc. Robot. 1, 1, 83–93. DOI: http://dx.doi.org/10.1007/s12369-008-0002-2. A. B. Hostetter and M. W. Alibali. 2008. Visible embodiment: Gestures as simulated action. Psychon. Bull. Rev. 15, 3, 495–514. DOI: http://dx.doi.org/10.3758/pbr.15.3.495. C. J. Hutto and E. Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International AAAI Conference on Weblogs and Social Media. B.-W. Hwang, S. Kim, and S.-W. Lee. 2006. A full-body gesture database for automatic gesture recognition. In 7th International Conference on Automatic Face and Gesture Recognition (FGR06). IEEE, 243–248. DOI: https://doi.org/10.1109/FGR.2006.8. J. M. Iverson and S. Goldin-Meadow. 1997. What’s communication got to do with it? Gesture in children blind from birth. Dev. Psychol. 33, 3, 453. DOI: https://doi.org/10.1037/ 0012-1649.33.3.453. J. M. Iverson and S. Goldin-Meadow. 1998. Why people gesture when they speak. Nature 396, 6708, 228–228. DOI: https://doi.org/10.1038/24300. J. M. Iverson and S. Goldin-Meadow. 2001. The resilience of gesture in talk: Gesture in blind speakers and listeners. Dev. Sci. 4, 4, 416–422. DOI: https://doi.org/10.1111/14677687.00183. N. Jacobs and A. Garnham. 2007. The role of conversational hand gestures in a narrative task. J. Mem. Lang. 56, 2, 291–303. DOI: https://doi.org/10.1016/j.jml.2006.07.011. A. Jamalian and B. Tversky. 2012. Gestures alter thinking about time. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 34. K. Jokinen, C. Navarretta, and P. Paggio. 2008. Distinguishing the communicative functions of gestures. In International Workshop on Machine Learning for Multimodal Interaction. Springer, 38–49. DOI: https://doi.org/10.1007/978-3-540-85853-9_4. H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. Godisart, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh. 2017. Panoptic Studio: A massively

References

251

multiview system for social interaction capture. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1, 190–204. C. Jost, V. André, B. Le Pévédic, A. Lemasson, M. Hausberger, and D. Duhaut. 2012. Ethological evaluation of human–robot interaction: Are children more efficient and motivated with computer, virtual agent or robots? In 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 1368–1373. DOI: https://doi.org/10.1109/ ROBIO.2012.6491159. S. Joty, G. Carenini, and R. T. Ng. 2015. CODRA: A novel discriminative framework for rhetorical analysis. Comput. Linguist. 41, 3, 385–435. DOI: https://doi.org/10.1162/COLI_a _00226. S. D. Kelly, D. J. Barr, R. B. Church, and K. Lynch. 1999. Offering a hand to pragmatic understanding: The role of speech and gesture in comprehension and memory. J Mem. Lang. 40, 4, 577–592. DOI: https://doi.org/10.1006/JMLA.1999.2634. A. Kendon. 1997. Gesture. Annu. Rev. Anthropol. 26, 1, 109–128. DOI: https://doi.org/10.1146/ annurev.anthro.26.1.109. A. Kendon. 2000. Language and gesture: Unity or duality. Lang. Gesture 2, 47–63. DOI: https://doi.org/10.1017/CBO9780511620850.004. A. Kendon. 2004. Gesture: Visible Action as Utterance. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511807572. P. Khooshabeh, C. McCall, S. Gandhe, J. Gratch, and J. Blascovich. 2011. Does it matter if a computer jokes. In CHI’11 Extended Abstracts on Human Factors in Computing Systems. 77–86. DOI: https://doi.org/10.1145/1979742.1979604. M. Kipp. 2014. ANVIL: A universal video research tool. In Handbook of Corpus Phonology. 420–436. M. Kipp and J.-C. Martin. 2009. Gesture and emotion: Can basic gestural form features discriminate emotions? In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 1–8. M. Kipp, M. Neff, and I. Albrecht. 2007. An annotation scheme for conversational gestures: How to economically capture timing and form. Lang. Resour. Eval. 41, 3–4, 325–339. DOI: https://doi.org/10.1007/s10579-007-9053-5. S. Kita. 2009. Cross-cultural variation of speech-accompanying gesture: A review. Lang. Cogn. Process. 24, 2, 145–167. DOI: https://doi.org/10.1080/01690960802586188. S. Kita and A. Özyürek. 2003. What does cross-linguistic variation in semantic coordination of speech and gesture reveal?: Evidence for an interface representation of spatial thinking and speaking. J. Mem. Lang. 48, 1, 16–32. DOI: https://doi.org/10.1016/S0749-596X (02)00505-3. N. Kock. 2005. Media richness or media naturalness? The evolution of our biological communication apparatus and its influence on our behavior toward e-communication tools. IEEE Trans. Prof. Commun. 48, 2, 117–130. DOI: https://doi.org/10.1109/TPC.2005. 849649.

252

Chapter 7 Gesture Generation

S. Kopp, B. Krenn, S. Marsella, A. N. Marshall, C. Pelachaud, H. Pirker, K. R. Thórisson, and H. Vilhjálmsson. 2006. Towards a common framework for multimodal generation: The behavior markup language. In International Workshop on Intelligent Virtual Agents. Springer, 205–217. DOI: https://doi.org/10.1007/11821830_17. S. Kopp, K. Bergmann, and I. Wachsmuth. 2008. Multimodal communication from multimodal thinking—towards an integrated model of speech and gesture production. Int. J. Semant. Comput. 2, 01, 115–136. DOI: https://doi.org/10.1142/S1793351X08000361. S. Kopp, H. van Welbergen, R. Yaghoubzadeh, and H. Buschmeier. 2014. An architecture for fluid real-time conversational agents: Integrating incremental output generation and input processing. J. Multimodal User Interfaces 8, 1, 97–108. DOI: https://doi.org/10.1007/ s12193-013-0130-3. E. Krahmer and M. Swerts. 2007. The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. J. Mem. Lang. 57, 3, 396–414. DOI: https://doi.org/10.1016/j.jml.2007.06.005. N. Krämer, S. Kopp, C. Becker-Asano, and N. Sommer. 2013. Smile and the world will smile with you—The effects of a virtual agent’s smile on users’ evaluation and behavior. Int. J. Hum. Comput. Stud. 71, 3, 335–349. DOI: https://doi.org/10.1016/j.ijhcs.2012.09.006. F. W. Kron, M. D. Fetters, M. W. Scerbo, C. B. White, M. L. Lypson, M. A. Padilla, G. A. Gliva-McConvey, L. A. Belfore II, T. West, A. M. Wallace, T. C. Guetterman, L. S. Schleicher, R. A. Kennedy, R. S. Mangrulkar, J. F. Cleary, S. C. Marsella, and D. M. Becker. 2017. Using a computer simulation for teaching communication skills: A blinded multisite mixed methods randomized controlled trial. Patient Educ Couns. 100, 4, 748–759. DOI: https://doi.org/10.1016/j.pec.2016.10.024. G. Lakoff and M. Johnson. 2008. Metaphors We Live By. University of Chicago Press. Q. Le, J. Huang, and C. Pelachaud. 2012. A common gesture and speech production framework for virtual and physical agents. In ACM International Conference on Multimodal Interaction. ICMI. 2012. Workshop on Speech and Gesture Production in Virtually and Physically Embodied Conversational Agents, October 26, 2012, SantaMonica, CA. ACM. Q. A. Le and C. Pelachaud. 2011. Generating co-speech gestures for the humanoid robot NAO through BML. In International Gesture Workshop. Springer, 228–237. DOI: https:// doi.org/10.1007/978-3-642-34182-3_21. D. Y. Lee, M. R. Uhlemann, and R. F. Haase. 1985. Counselor verbal and nonverbal responses and perceived expertness, trustworthiness, and attractiveness. J. Couns. Psychol. 32, 2, 181. DOI: https://doi.org/10.1037/0022-0167.32.2.181. K. M. Lee, Y. Jung, J. Kim, and S. R. Kim. 2006. Are physically embodied social agents better than disembodied social agents?: The effects of physical embodiment, tactile interaction, and people’s loneliness in human–robot interaction. Int. J. Hum. Comput. Stud. 64, 10, 962–973. DOI: https://doi.org/10.1016/j.ijhcs.2006.05.002. I. Leite, C. Martinho, and A. Paiva. 2013. Social robots for long-term interaction: A survey. Int. J. Soc. Robot. 5, 2, 291–308. DOI: https://doi.org/10.1007/s12369-013-0178-y.

References

253

T. Leonard and F. Cummins. 2011. The temporal relation between beat gestures and speech. Lang. Cogn. Neurosci. 26, 10, 1457–1471. DOI: https://doi.org/10.1080/01690965. 2010.500218. S. C. Levinson. 1996. Language and space. Annu. Rev. Anthropol. 25, 1, 353–382. DOI: https://doi.org/10.1146/annurev.anthro.25.1.353. E. T. Levy and D. McNeill. 1992. Speech, gesture, and discourse. Discourse Process. 15, 3, 277–301. DOI: https://doi.org/10.1080/01638539209544813. D. Leyzberg, S. Spaulding, M. Toneva, and B. Scassellati. 2012. The physical presence of a robot tutor increases cognitive learning gains. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 34. M. Lhommet and S. C. Marsella. 2013. Gesture with meaning. In International Conference on Intelligent Virtual Agents. Springer, 303–312. DOI: https://doi.org/10.1007/978-3-64240415-3_27. M. Lhommet and S. Marsella. 2014. Metaphoric gestures: Towards grounded mental spaces. In International Conference on Intelligent Virtual Agents. September. http://www. ccs.neu.edu/marsella/publications/pdf/Lhommet_IVA2014.pdf. M. Lhommet, Y. Xu, and S. Marsella. 2015. Cerebella: Automatic generation of nonverbal behavior for virtual humans. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. 4303–4304. J. Li. 2015. The benefit of being physically present: A survey of experimental works comparing copresent robots, telepresent robots and virtual agents. Int. J. Hum. Comput. Stud. 77, 23–37. DOI: https://doi.org/10.1016/j.ijhcs.2015.01.001. K. P. Lickiss and A. R. Wellens. 1978. Effects of visual accessibility and hand restraint on fluency of gesticulator and effectiveness of message. Percept. Mot. Ski. 46, 3, 925–926. DOI: https://doi.org/10.2466/pms.1978.46.3.925. P. Luo, M. Kipp, and M. Neff. 2009. Augmenting gesture animation with motion capture data to provide full-body engagement. In International Workshop on Intelligent Virtual Agents. Springer, 405–417. DOI: https://doi.org/10.1007/978-3-642-04380-2_44. R. Maatman, J. Gratch, and S. Marsella. 2005. Natural behavior of a listening agent. In International Workshop on Intelligent Virtual Agents. Springer, 25–36. DOI: https://doi.org/ 10.1007/11550617_3. F. Maricchiolo, A. Gnisci, M. Bonaiuto, and G. Ficca. 2009. Effects of different types of hand gestures in persuasive speech on receivers’ evaluations. Lang. Cogn. Neurosci. 24, 2, 239–266. DOI: https://doi.org/10.1080/01690960802159929. S. C. Marsella, S. M. Carnicke, J. Gratch, A. Okhmatovskaia, and A. Rizzo. 2006. An exploration of Delsarte’s structural acting system. In International Workshop on Intelligent Virtual Agents. Springer, 80–92. DOI: https://doi.org/10.1007/11821830_7. S. Marsella, J. Gratch, and P. Petta. 2010. Computational models of emotion. A Blueprint for Affective Computing—A Sourcebook and Manual 11, 1, 21–46. S. Marsella, Y. Xu, M. Lhommet, A. Feng, S. Scherer, and A. Shapiro. 2013. Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics

254

Chapter 7 Gesture Generation

Symposium on Computer Animation, SCA ’13. ACM, New York, NY, 25–35. ISBN: 978-1-45032132-7. DOI: http://doi.acm.org/10.1145/2485895.2485900. C. McCall, D. P. Bunyan, J. N. Bailenson, J. Blascovich, and A. C. Beall. 2009. Leveraging collaborative virtual environment technology for inter-population research on persuasion in a classroom setting. PRESENCE Teleop. Virt. Environ. 18, 5, 361–369. DOI: https://doi. org/10.1162/pres.18.5.361. D. McNeill. 1985. So you think gestures are nonverbal? Psychol. Rev. 92, 3, 350. DOI: https://doi.org/10.1037/0033-295X.92.3.350. D. McNeill. 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press. D. McNeill. 2006. Gesture: A psycholinguistic approach. The Encyclopedia of Language and Linguistics. 58–66. DOI: https://doi.org/10.1016/B0-08-044854-2/00798-7. D. McNeill, J. Cassell, and E. T. Levy. 1993. Abstract deixis. Semiotica 95, 1–2, 5–20. DOI: https://doi.org/10.1515/semi.1993.95.1-2.5. D. Morris. 2015. Bodytalk: A World Guide to Gestures. Random House. R. Morris, D. McDuff, and R. Calvo. 2014. Crowdsourcing techniques for affective computing. In The Oxford Handbook of Affective Computing. Oxford University Press, 384–394. DOI: https://doi.org/10.1093/oxfordhb/9780199942237.013.003. O. Mubin and C. Bartneck. 2015. Do as I say: Exploring human response to a predictable and unpredictable robot. In Proceedings of the 2015 British HCI Conference. 110–116. DOI: https://doi.org/10.1145/2783446.2783582. K. M. Murphy. 2003. Building meaning in interaction: Rethinking gesture classifications. Crossroads of Language, Interaction, and Culture 5, 29–47. M. Neff, M. Kipp, I. Albrecht, and H.-P. Seidel. 2008. Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans. Graph. (TOG) 27, 1, 1–24. DOI: https://doi.org/10.1145/1330511.1330516. M. Neff, Y. Wang, R. Abbott, and M. Walker. 2010. Evaluating the effect of gesture and language on personality perception in conversational agents. In International Conference on Intelligent Virtual Agents. Springer, 222–235. DOI: https://doi.org/10.1007/978-3-642-158926_24. V. Ng-Thow-Hing, P. Luo, and S. Okita. 2010. Synchronized gesture and speech production for humanoid robots. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4617–4624. DOI: https://doi.org/10.1109/IROS.2010.5654322. R. Niewiadomski, E. Bevacqua, M. Mancini, and C. Pelachaud. January. 2009. Greta: An Interactive Expressive ECA System, Vol. 2. 1399–1400. DOI: https://doi.org/10.1145/1558109. 1558314. S. Nishio, K. Ogawa, Y. Kanakogi, S. Itakura, and H. Ishiguro. 2018. Do robot appearance and speech affect people’s attitude? Evaluation through the ultimatum game. In Geminoid Studies. Springer, 263–277. DOI: https://doi.org/10.1007/978-981-10-8702-8_16.

References

255

S. Nobe. 2000. Where do most spontaneous representational gestures actually occur with respect to speech. Language and Gesture 2, 186. https://doi.org/10.1017/CBO9780511 620850.012. M. A. Novack and S. Goldin-Meadow. 2017. Gesture as representational action: A paper about function. Psychon. Bull. Rev. 24, 3, 652–665. DOI: https://doi.org/10.3758/s13423-0161145-z. R. E. Núñez and E. Sweetser. 2006. With the future behind them: Convergent evidence from Aymara language and gesture in the crosslinguistic comparison of spatial construals of time. Cogn. Sci. 30, 3, 401–450. DOI: https://doi.org/10.1207/s15516709cog 0000_62. M. Ochs, G. de Montcheuil, J.-M. Pergandi, J. Saubesty, C. Pelachaud, D. Mestre, and P. Blache. 2017. An architecture of virtual patient simulation platform to train doctors to break bad news. In Conference on Computer Animation and Social Agents (CASA). ¸ S. Özçalı¸skan and S. Goldin-Meadow. 2005. Gesture is at the cutting edge of early language development. Cognition 96, 3, B101–B113. DOI: https://doi.org/10.1016/j.cognition.2005. 01.001. T. Pedersen, S. Patwardhan, and J. Michelizzi. 2004. WordNet::Similarity—Measuring the relatedness of concepts. In AAAI, Vol. 4. 25–29. J. W. Pennebaker, M. E. Francis, and R. J. Booth. 2001. Linguistic Inquiry and Word Count: LIWC 2001. Lawrence Erlbaum Associates, Mahway, NJ. 71, 2001. I. Poggi and C. Pelachaud. 2008. Persuasion and the expressivity of gestures in humans and machines. Embodied Communication in Humans and Machines. 391–424. DOI: https://doi.org/10.1093/acprof:oso/9780199231751.003.0017. I. Poggi and L. Vincze. 2008. Gesture, gaze and persuasive strategies in political discourse. In International LREC Workshop on Multimodal Corpora. Springer, 73–92. DOI: https://doi. org/10.1007/978-3-642-04793-0_5. I. Poggi, C. Pelachaud, F. de Rosis, V. Carofiglio, and B. De Carolis. 2005. Greta. A believable embodied conversational agent. In Multimodal Intelligent Information Presentation. Springer, 3–25. https://doi.org/10.1007/1-4020-3051-7_1. F. E. Pollick, H. M. Paterson, A. Bruderlin, and A. J. Sanford. 2001. Perceiving affect from arm movement. Cognition 82, 2, B51–B61. DOI: https://doi.org/10.1016/s0010-0277(01) 00147-0. G. Radden. 2003. The metaphor time as space across languages. Zeitschrift für interkulturellen Fremdsprachenunterricht, 8, 2. F. H. Rauscher, R. M. Krauss, and Y. Chen. 1996. Gesture, speech, and lexical access: The role of lexical movements in speech production. Psychol. Sci. 7, 4, 226–231. DOI: https:// doi.org/10.1111/j.1467-9280.1996.tb00364.x. D. Reidsma, I. de Kok, D. Neiberg, S. C. Pammi, B. van Straalen, K. Truong, and H. van Welbergen. 2011. Continuous interaction with a virtual human. J. Multimodal User Interfaces 4, 2, 97–118. DOI: https://doi.org/10.1007/s12193-011-0060-x.

256

Chapter 7 Gesture Generation

L. Ren, A. Patrick, A. A. Efros, J. K. Hodgins, and J. M. Rehg. 2005. A data-driven approach to quantifying natural human motion. ACM Trans. Graph. (TOG) 24, 3, 1090–1097. DOI: https://doi.org/10.1145/1073204.1073316. L. D. Riek. 2014. The social co-robotics problem space: Six key challenges. Robotics Challenges and Vision (RCV2013). L. D. Riek, P. C. Paul, and P. Robinson. 2010. When my robot smiles at me: Enabling human–robot rapport via real-time head gesture mimicry. J. Multimodal User Interfaces 3, 1–2, 99–108. DOI: https://doi.org/10.1007/s12193-009-0028-2. S. Robotics. 2018. Pepper. https://www.softbankrobotics.com/emea/en/pepper. H. Robotics. 2019. Sophia. https://www.hansonrobotics.com/sophia. S. V. Rouse. 2015. A reliability analysis of Mechanical Turk data. Comput. Hum. Behav. 43, 304–307. DOI: https://doi.org/10.1016/j.chb.2014.11.004. M. Salem, S. Kopp, I. Wachsmuth, and F. Joublin. 2010. Generating robot gesture using a virtual agent framework. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 3592–3597. DOI: https://doi.org/10.1109/iros.2010.5650572. M. Salem, S. Kopp, I. Wachsmuth, K. Rohlfing, and F. Joublin. 2012. Generation and evaluation of communicative robot gesture. Int. J. Soc. Robot. 4, 2, 201–217. DOI: https://doi.org/ 10.1007/s12369-011-0124-9. M. Salem, F. Eyssel, K. Rohlfing, S. Kopp, and F. Joublin. 2013. To err is human (-like): Effects of robot gesture on perceived anthropomorphism and likability. Int. J. Soc. Robot. 5, 3, 313–323. DOI: https://doi.org/10.1007/s12369-013-0196-9. S. Satake, T. Kanda, D. F. Glas, M. Imai, H. Ishiguro, and N. Hagita. 2009. How to approach humans? Strategies for social robots to initiate interaction. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction. 109–116. DOI: https://doi. org/10.1145/1514095.1514117. C. Saund, M. Roth, M. Chollet, and S. Marsella. 2019. Multiple metaphors in metaphoric gesturing. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 524–530. DOI: https://doi.org/10.1109/ACII.2019.8925435. B. Scassellati. 2002. Theory of mind for a humanoid robot. Auton. Robots 12, 1, 13–24. DOI: https://doi.org/10.1023/A:1013298507114. B. Schuller, A. Batliner, S. Steidl, and D. Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 53, 9–10, 1062–1087. DOI: https://doi.org/10.1016/j.specom.2011.01.011. S. Shigemi, A. Goswami, and P. Vadakkepat. 2019. ASIMO and humanoid robot research at Honda. In Humanoid Robotics: A Reference. Springer, 55–90. C. L. Sidner, C. Lee, and N. Lesh. 2003. The role of dialog in human robot interaction. In International Workshop on Language Understanding and Agents for Real World Interaction. M. Siegel, C. Breazeal, and M. I. Norton. 2009. Persuasive robotics: The influence of robot gender on human behavior. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2563–2568. DOI: https://doi.org/10.1109/IROS.2009.5354116.

References

257

M. Slater, A. Sadagic, M. Usoh, and R. Schroeder. 2000. Small-group behavior in a virtual and real environment: A comparative study. Presence Teleop. Virt Environ. 9, 1, 37–51. DOI: https://doi.org/10.1162/105474600566600. M. Stone, D. DeCarlo, I. Oh, C. Rodriguez, A. Stere, A. Lees, and C. Bregler. 2004. Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Trans. Graph. (TOG) 23, 3, 506–513. DOI: https://doi.org/10.1145/ 1186562.1015753. N. M. Sussman and H. M. Rosenfeld. 1982. Influence of culture, language, and sex on conversational distance. J. Pers. Soc. Psychol. 42, 1, 66–74. DOI: https://doi.org/10.1037/00223514.42.1.66. W. R. Swartout, J. Gratch, R. W. Hill Jr, E. Hovy, S. Marsella, J. Rickel, and D. Traum. 2006. Toward virtual humans. AI Magazine 27, 2, 96–96. DOI: https://doi.org/10.1609/aimag. v27i2.1883. A. Takeuchi and T. Naito. 1995. Situated facial displays: Towards social interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 450–455. DOI: https://doi.org/10.1145/223904.223965. K. Takeuchi, D. Hasegawa, S. Shirakawa, N. Kaneko, H. Sakuta, and K. Sumi. 2017. Speechto-gesture generation: A challenge in deep learning approach with bi-directional LSTM. In Proceedings of the 5th International Conference on Human Agent Interaction. 365–369. DOI: https://doi.org/10.1145/3125739.3132594. L. Talmy. 1985. Grammatical categories and the lexicon. Language Typology and Syntactic Description, Vol. 3. 57–149. S. Thellman, A. Silvervarg, A. Gulz, and T. Ziemke. 2016. Physical vs. virtual agent embodiment and effects on social interaction. In International Conference on Intelligent Virtual Agents. Springer, 412–415. X.-T. Truong and T.-D. Ngo. 2016. Dynamic social zone based mobile robot navigation for human comfortable safety in social environments. Int. J. Soc. Robot. 8, 5, 663–684. https://doi.org/10.1007/s12369-016-0352-0. S. Turchyn, I. O. Moreno, C. P. Cánovas, F. F. Steen, M. Turner, J. Valenzuela, and S. Ray. 2018. Gesture annotation with a visual search engine for multimodal communication research. In Thirty-Second AAAI Conference on Artificial Intelligence. USC Institute for Creative Technologies. SmartBody. https://smartbody.ict.usc.edu/ download2. G. Van de Perre, H.-L. Cao, A. De Beir, P. G. Esteban, D. Lefeber, and B. Vanderborght. 2018. Generic method for generating blended gestures and affective functional behaviors for social robots. Auton. Robots 42, 3, 569–580. DOI: https://doi.org/10.1007/s10514-017-9650-0. I. Wachsmuth and S. Kopp. 2001. Lifelike gesture synthesis and timing for conversational agents. In International Gesture Workshop. Springer, 120–133. J. Wainer, D. J. Feil-Seifer, D. A. Shell, and M. J. Mataric. 2007. Embodiment and human– robot interaction: A task-based perspective. In RO-MAN 2007—The 16th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 872–877. DOI: https:// doi.org/10.1109/ROMAN.2007.4415207.

258

Chapter 7 Gesture Generation

A. Whiten and R. W. Byrne. 1988. The Machiavellian Intelligence Hypotheses. DOI: https://doi. org/10.1007/978-1-4419-1428-6_1048. A. D. Wilson, A. F. Bobick, and J. Cassell. 1996. Recovering the temporal structure of natural gesture. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition. IEEE, 66–71. DOI: https://doi.org/10.1109/AFGR.1996.557245. J. R. Wilson, N. Y. Lee, A. Saechao, S. Hershenson, M. Scheutz, and L. Tickle-Degnen. 2017. Hand gestures and verbal acknowledgments improve human–robot rapport. In International Conference on Social Robotics. Springer, 334–344. C. Wolff. 2015. A Psychology of Gesture. Routledge. Y. Xu, C. Pelachaud, and S. Marsella. 2014. Compound gesture generation: A model based on ideational units. In International Conference on Intelligent Virtual Agents. Springer, 477–491. Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee. 2019. Robots learn social skills: Endto-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303–4309.

8

Multimodal Behavior Modeling for Socially Interactive Agents Catherine Pelachaud, Carlos Busso, and Dirk Heylen

8.1

Motivation

Imagine you start a conversation with the newest socially interactive agent (SIA), a humanoid robot or life-size embodied virtual agent. It talks to you and you talk back, but the only parts moving are the lips. It does not smile when greeting you. There are no nods and shakes with its head, and when it says “look over there” it does not point with its eyes or hands. You think to yourself: “What is wrong with it?” Is it working properly? Should I restart it? Or, who made this SIA? Did the makers not read the chapter on multimodal behavior in the Handbook on Socially Interactive Agents in which studies about the role of nonverbal communication in human–human communication are discussed and how such behaviors have been implemented in several generations of SIAs? Given that these SIAs are embodied conversational agents, it is expected that they mimic as best as they can the nonverbal behaviors that humans use in their face-to-face conversations. Take the case of gaze. The eyes are the organs that—as part of our visual system—make us see the world and the people around us. We look at what attracts us or at objects that we manipulate and where vision helps us to successfully perform the action. Sometimes we might stare without looking at anything in particular; our mind lost in thought. Most of the time, however, we look because we need to pay attention to what is happening and because we need to act. We look at the world and the people around us for a reason. As do the people around us. As we can see the other looking at us, we might start wondering why we are being looked at and perhaps also

260

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

what the other person thinks is the intention behind the reason why we are looking at her/him. We look at each other in different ways. We have learned and we have been taught what intents lie behind a glance or a stare and what conventions govern interactions in particular situations. We know how somebody tries to direct our attention to something by pointing to it with her/his eyes. We look at the specific person that we are addressing when we are together in a group and know that we are being talked to when somebody talks and looks at us. From the other side, we pay attention to the person who is talking to us by looking at the person. All these behaviors help us to regulate the flow of interaction. We have also learned to see how somebody is angry at us by a fierce stare and when somebody wants to seduce us with a flirtatious look—accompanied by a slight head tilt and a squeezing of the eyes. The same kind of interpretations apply when we look into the eyes of an embodied SIA as Fukayama et al. [2002] demonstrated by showing interface agents that merely consisted of eyes that displayed different gaze patterns. By changing parameters such as the amount of gaze and the mean duration of gaze, they showed how the eyes conveyed different impressions related to liking, warmth, and potency. Based on the suggestions to link gaze patterns with turn-taking and discourse structure by Torres et al. [1997], Heylen et al. [2005] experimented with an agent displaying different gaze patterns that were designed to provide smooth interaction patterns by providing the right turn-taking cues (see Figure 8.1). Besides showing that using the appropriate patterns made a difference in task performance, they also showed an effect on the trustworthiness of the agent and the appreciation of warmth. This goes to show that simple nonverbal behaviors have multiple effects and are important to get right not only in order to be efficient and effective but also to have the agents give the desired impression in terms of emotion and interpersonal relationship as this will have an effect on how someone will want to keep being engaged with the agent.

Figure 8.1

Different “look-away ” behaviors [Heylen et al. 2005].

8.2 Nonverbal Behavior Representation

Figure 8.2

261

Different gaze and facial behaviors (adapted from Heylen [2010]).

Gaze is just one of the modes of nonverbal communication that one needs to pay attention to in building SIAs, as we will see in this chapter. Facial expressions, head movements, gestures, and posture all play an important part in the impressions the agents convey and the way a person engages with them (see Figure 8.2). In this chapter, we present studies on the generation of multimodal nonverbal behaviors for SIAs. We first present some works on nonverbal communication in human–human interaction and next move to computational models. In the history section, we present a variety of topics that received attention at a particular time. We go in some depth on machine learning approaches to the generation of nonverbal behaviors, after which we conclude the chapter by discussing some current and future challenges.

8.2

Nonverbal Behavior Representation In this section, we start by presenting different taxonomies of nonverbal behaviors as well as coding schemes that have been proposed in the literature of human and social sciences.

8.2.1 Classi cation of Nonverbal Behavior As introduced above, nonverbal behaviors convey a variety of meanings. To study them, scholars have proposed to cluster them depending on their characteristics. Several attempts have been made to classify them depending on their communicative, emotional, or pragmatic functions. Classifications may be geared toward one

262

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

modality (cf. gesture taxonomy by McNeill [1992], facial expression taxonomy by Chovil [1991]) or encompass multimodality (e.g., Bavelas et al. [2014], Ekman [2004], or Poggi [2007]). Ekman [2004] distinguishes between facial and body movements used as emotional or conversational signals. He refers to the latter ones as illustrators since the signals illustrate speech. Signals such as hand movements or brow movements often accompany a change of loudness in the voice. Deixis refers to indicating a point in space; it is also part of the illustrator cluster. Deixis can be conveyed by an extended hand gesture, a head and eye direction, and even by a chin upward movement. Pictographs embed signals imitating an object (referred to as iconic by other scholars [McNeill 1992]). Another cluster gathers emblem, including emotional emblem that are produced to replace common verbal expressions, and correspond to movements whose meanings are very well-known and culturally dependent (e.g., nodding instead of saying “yes”). The other two clusters embrace manipulators (such as touching one’s face) and regulators (to maintain and regulate speaking turns) and emotional expressions. McNeill [1992] focuses mainly on communicative gestures. He classified them by the type of information they convey. There are the iconics where hand gesture describes a concrete object, the metaphorics for which motion of the hand represents an abstract idea, the deictic where finger pointing designs a point in space, beat that corresponds to rhythmic movement of the hand, and emblem that includes gestures whose meaning are culturally coded. Another example of taxonomy has been proposed by Poggi [2007]. She characterizes behaviors based on the semantic functions they convey [Poggi 2007]. She considered three main categories, each of them divided into subcategories. Speakers may communicate information about the world (e.g., deictic and iconic gestures), one’s own identity (age, culture, personality, gender), and the mind (belief [certainty, etc.], goal [performative, intonational structure…], and emotion). To capture the polysemy of nonverbal behaviors, she defines each element of these categories by a pair where the first element is the meaning and the second one lists the multimodal signals that convey this meaning. As such, the same signal may be present in different pairs and linked to different meanings. As we can see in these few examples, taxonomies show similarities and differences. They differ in the type of information that are used to distinguish signals or on the granularity of the taxonomy. Moreover, before choosing a taxonomy to cluster behaviors, we would like to raise some concerns. Indeed, one has to be aware with a simplification that may arise when using taxonomies. The same behavior may have different meanings depending on its context of occurrence—for example, a smile can be a sign of happiness or politeness, or serve as a backchannel;

8.2 Nonverbal Behavior Representation

263

a hand moving upward can be an iconic gesture and can also be considered as a deictic gesture. Identically, a communicative function may be expressed by various signals—to communicate refusal, one can shake the head or shake the index finger. Context is a crucial factor when interpreting a social signal. However, it is not encoded in the taxonomies. Some scholars, in particular Bavelas and Chovil [2000], warned also about using taxonomy. Taxonomy defines classes that are mutually exclusive, which is incompatible with polysemous characteristics of nonverbal behaviors. Poggi [2007] takes into account the polysemy of nonverbal behaviors, but her taxonomy does not include any contextual information. Rather than defining a taxonomy, Bavelas and Chovil [2000] proposed to use the term “visible acts of meaning” and to include behaviors that are visible, communicative, and linked to the on-going speech. They highlight the need to look at nonverbal behaviors by their communicative functions in the context of the speech.

8.2.2 Coding System To be able to discuss about a behavior, one needs to be sure the term being used is commonly understood by everyone and everyone attaches the same meaning to it. Let us consider smile as an example. Smile corresponds to a well-defined facial expression often associated to happiness. However, smiles can take very diverse forms associated with a large number of meanings. Just think of the smile of the Joker or of sportswomen/men arriving second place, or even of politicians; they smile but their smiles differ greatly in their appearance and communicative functions. So, the label “smile” is confusing. It cannot be associated to a common meaning. To analyze and describe nonverbal behaviors, it is important to have a common representation language. To this aim, several coding schemas have been proposed; some focus on one modality, others cover multiple ones. We list here the most popular ones. The different schemes differ by the modality they focus on and by their granularity. The Facial Action Coding System (FACS) was developed by Ekman, Friesen, and Hager to describe facial expressions [Ekman et al. 2002]. The authors defined an Action Unit (AU) as the minimal muscular contraction that is visible. A facial expression corresponds to the combination of AUs. Forty-four AUs are defined divided along the facial regions. An expression is also defined by its temporal course, where onset is the time of appearance of an expression, offset the time of disappearance, and apex is the time where the expression is at its maximum. There exist variants of FACS such as EM-FACS for facial expressions of emotion [Friesen et al. 1983] and babyFACS for expressions of babies [Oster 2006].

264

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

Several annotation schema have been designed to code body expression of emotions. The Body Action Coding System (BACS) proposed by Huis In’t Veld et al. [2014] follows a similar concept as FACS. Contrary to Kendon [1980]’s approach that describes the shape and the movement that constitute a gesture, BACS describes the muscular activation, measured through electromyogram (EMG), involved in a movement. The approaches capture very complementary information. The aim of BACS is to describe body expressions of emotions. Other schemas have been proposed to describe body movement on a multilevel labeling approach [Dael et al. 2012, Fourati and Pelachaud 2014]. Dael et al. [2012] developed the Body Action and Posture coding system BAP. BAP was designed to describe body movement of emotions that is visible. It describes body movement along three main levels: (1) anatomical level working at the body articulation level; (2) form level encoding direction and orientation of body movement; and (3) functional level that corresponds to communicative and self-regulatory functions. BAP also offers mechanisms to describe temporal relationship between body parts. Fourati and Pelachaud [2014] also proposed a multilevel annotation scheme to describe body actions. Their scheme encompass descriptions at anatomical level, directional level, and posture/movement level. Regarding communicative gestures, to our knowledge there is not an annotation scheme that is commonly used among scholars (see Chapter 7 on communicative gestures). However, often a communicative gesture will be defined by the position of the wrist in 3D space, palm orientation, and fingers shape. Kendon [1980] defined the temporal course of hand gestures along different phases: preparation, pre-stroke, stroke, post-stroke, hold, retraction; where stroke is the phase that carries the gesture meaning. Between consecutive gestures that are close enough temporally, hand movement may co-articulate between the last phase of the precedent gesture into the first phase of the successive one. We can note that the classification and the decomposition in phases can be applied to other modalities such as facial expressions and head movements. For example, the Behavior Markup Language [Kopp et al. 2006] follows these ideas (see Chapter 16 on multimodal architectures). Lately the modality touch is receiving more and more attention from the SIA community (see Section 8.4.6). To our knowledge, there is no common annotation scheme that has been drawn for touch and used by various researchers. Instead, researchers made up their own annotation schemes that share several features. To define the conformational parameters of social touch gestures, Teyssier et al. [2020] and Atkinson et al. [2013] relied on human–human interaction studies [Hertenstein et al. 2009]. These parameters encompass the action of the touch gestures [Hertenstein et al. 2009, Atkinson et al. 2013]—for example, hit, pat, stroke,

8.3 Models and Approaches

265

caress, shake… to name a few. Each action can be further represented by the characteristics of the movement itself: velocity, amplitude, force of touch movement. Temperature of the body part giving the touch and of receiving it can also be encoded. It can also be applied to self-touch (e.g., often called adaptator by scholars [Ekman 2004]). While the schemas we have just presented focus on one modality only, the annotation scheme MUMIN [Allwood et al. 2007] was developed to annotate multimodal communication. MUMIN has a multilayer structure that encodes low-and highlevel information. The latter one corresponds to the list of signals of each modality. For example, the item facial display (face modality) encompasses signals of five facial regions: Eyes, Gaze, Eyebrows, Mouth, Head. Each region can have different values. For example, Eyebrows can get three possible values: Frowning, Raising, Other; and Gaze five different ones: Toward Interlocutor, Up, Down, Sideways, Other. Communicative functions are annotated at the highest layer. Three main classes of communicative functions are considered: Feedback (Give/Elicit), TurnManaging, and Sequencing. Each class can take many forms that are instantiated into several values. To illustrate the different levels of granularity of this annotation scheme, let us consider the function Feedback-Elicit. It can be instantiated into three sub-classes: Basic, Acceptance, Additional Emotion/Attitude; each of these sub-classes can receive different values. Such a multimodal and multilayers coding system allows learning about the role of gesture, facial expression, head… in conveying communicative functions. Different analysis methods, for example, statistical or sequencing, can be applied to build models mapping communicative functions and multimodal behaviors. Further information on coding schemes can be found in Jokinen and Pelachaud [2013].

8.3

Models and Approaches We present here the broad categories of models that have been proposed so far to model IVA’s nonverbal behaviors. Over the last decades, the research community has investigated different computational approaches to add nonverbal behaviors to SIAs. The early formulations were based on carefully designing rules to communicate particular discourse meaning. Examples of these approaches include the studies of Beskow and McGlashan [1997], Cassell et al. [1994, 1999, 2001], Kopp and Wachsmuth [2002, 2004], Marsella et al. [2013], and Poggi and Pelachaud [2000]. The rules were often obtained by learning relationships between communication channels. For example, these methods leverage what we know about the discourse functions of hand gestures [McNeill 1992], head motion [Heylen 2005, Sadoughi and Busso 2017b],

266

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

and facial displays [Chovil 1991]. The text is often analyzed, selecting synchronization points to include nonverbal gestures to emphasize or clarify the message. For example, Kopp and Wachsmuth [2004] proposed to identify prominent words or phrases, which were used to anchor specific nonverbal gestures. Performance-based nonverbal generation is another popular approach that aims to use concatenation of human recordings that are re-purposed to create the behavior in the animation [Williams 1990, Rizzo et al. 2004, Stone et al. 2004, Kipp et al. 2007, Neff et al. 2008]. The advantage of these approaches is that the original synchronization between speech and gestures is preserved, resulting in natural renditions. Examples of this technique include the work of Arikan and Forsyth [2002] and Lee et al. [2002], where motion capture recordings were concatenated to create a novel sequence. Performance-based generation has been used for lip synchronization, where frames are carefully concatenated to match the phone sequence [Bregler et al. 1997]. The key challenge with this approach is to generalize the system to new sentences that are not in the database. It is important to create flexible approaches to combine and smooth the transition between pasted sequences [Lee et al. 2002]. Recent studies have also proposed to incorporate nonverbal behaviors by using machine-learning approaches, obtaining the relationship across gestures directly from the data. For example, these approaches leverage the rich relationship between speech and gestures [Brand 1999, Graf et al. 2002, Valbonesi et al. 2002, Kettebekov et al. 2005, Busso and Narayanan 2007]. Studies to generate nonverbal behaviors have used graphical models, including hidden Markov models (HMMs) [Busso et al. 2007, Le et al. 2012], dynamic Bayesian model (DBN) [Mariooryad and Busso 2012, Sadoughi et al. 2017], hidden conditional random fields [Levine et al. 2010], fully parameterized hidden Markov model [Ding et al. 2013a], and hidden semi-Markov models [Bozkurt et al. 2008]. Other machine-learning approaches to generate nonverbal gestures have relied on solutions based on deep neural networks (DNNs). Examples of implementations of nonverbal behavior prediction models using DNNs include architectures with fully connected layers [Taylor et al. 2016, Parker et al. 2017, Taylor et al. 2017, Sadoughi and Busso 2018b], long-short term memory (LSTM) [Fan et al. 2016, Li et al. 2016, Sadoughi and Busso 2017a], generative adversarial network (GAN) [Huang and Khan 2017, Sadoughi and Busso 2018a, Sadoughi and Busso 2020], and convolutional neural networks [Karras et al. 2017]. Some studies have proposed hybrid approaches to combine rule-based systems with machine learning models. One approach is to constrain the data-driven model to a specific gesture (e.g., head nod), or discourse function (e.g., asking questions) [Sadoughi and Busso 2015, 2019, Sadoughi et al. 2014, 2017]. These hybrid methods

8.4 History/Overview

267

have the advantage of providing novel realizations of the target gesture or behavior that can be intrinsically synchronized with other modalities (e.g., speech), preserving the meaning intended for the message. These models can operate between the behavior planning and behavior realization modules of the SAIBA architecture (for further details, see Chapter 16 presenting different agent architectures).

8.4

History/Overview We now turn our attention to the presentation of existing works on modeling nonverbal behaviors for SIAs. The focus of this section is primarily on IVAs’ behavior models. We start from the earliest works and follow a somewhat chronological order. However, studies are also presented by the functions they target, such as the expression of emotion or the modeling of rapport. A pure historical presentation is not possible as the same topics have been addressed over the years but with very different methods.

8.4.1 Early Works In the 1970s, the first facial models appeared [Parke 1972]. Research focused not only on modeling lip movement and co-articulation effects during speech [Parke 1975, Cohen and Massaro 1993, Beskow 1997] but also the facial expressions of emotion [Pelachaud et al. 1996]. The first system where two virtual agents entered into a dialogue together and showed nonverbal behaviors was GestureJack [Cassell et al. 1994] (see Figure 8.3). Gesture, facial expression, head movement, and gaze were aligned automatically with speech. Different rules were designed to drive the agent’s behavior. Rules came from the human and social sciences literature. They were also extracted from corpus analysis of human interactions. They specified the type of a behavior (e.g., an iconic gesture, a backchannel), its form (such as a writing gesture, a head nod), and its timing (when it starts and how long it lasts). The GestureJack system also included a dialog model as well as a voice synthesizer. It laid the foundation of embodied conversational agents as all the nonverbal behaviors for the agent acting as a speaker or as a listener were automatically generated. Later on, Cassell and her colleagues developed the human–agent interaction system called REA [Cassell et al. 1999] (see Figure 8.4). REA stands for Real-Estate Agent. Users could converse with REA using a spoken dialog. User’s gaze direction and speech were extracted in real-time. To establish some form of rapport, the REA agent chatted with the user and exchanged small talk. It used multimodal behavior to communicate. For example, it could display iconic gestures to describe houses as well as interactional gestures to handle turn-taking. Turn-taking mechanisms were further developed by Thórisson [1997] in the Ymir architecture. The perception module included several modalities. Gaze behavior and deictic gestures of the

268

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

Figure 8.3

GestureJack: first complete IVA system; two fully autonomous agents interacting together [Cassell et al. 1994].

user were detected and used to understand which object(s) the user was interested in discussing. Other agent models integrating verbal and nonverbal behaviors were proposed by the late nineties [André et al. 1998, Pelachaud and Poggi 1998, Rickel and Johnson 1999]. The agents were integrated into a system architecture that consisted of multilayer components where one component made plans and took decisions and another one instantiated the latter onto nonverbal behaviors. Communication involves multimodality where nonverbal behaviors are displayed with complex and sophisticated relations. Several approaches were developed at the turn of the century to capture these relations. Cassell and Stone [1999] designed a multimodal manager whose role was to supervise the distribution of behaviors across the verbal and nonverbal channels (verbal, face, gesture, head, and gaze). The Behavior Expression Animation Toolkit (BEAT) [Cassell et al. 2001] is a tool that can generate and synchronize facial expression and gestures from

8.4 History/Overview

Figure 8.4

269

REA performs an iconic gesture.

text (see Figure 8.5). Rules were designed from human behavior studies to indicate where and which gestures could plausibly fit. The text to be spoken by the agent was further decomposed into theme and rheme, that correspond, respectively, to known versus new information carried by the utterance [Halliday 1967]. It allows us to place gestures on the rheme, that is when new information of the utterance is said by the agent.

Figure 8.5

Nonverbal behaviors computed by the BEAT system.

270

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

Poggi and Pelachaud [2000] considered some contextual information to instantiate communicative intentions into behaviors. They focused on the notion of performative, the illocutionary force of an utterance [Austin 1962], where they consider as context the relationship between the speaker and the listener. They applied a meaning-to-face approach to compute the appropriate facial expression driven by semantic analysis. Deictic behaviors received also particular attention for the creation of pedagogical agents [André et al. 1996, Lester et al. 2000]. These agents would indicate objects in 2D or 3D space that were the focus of the discussion. Indicating a point in space can be done through a pointing gesture, walking to this point, gazing at it, turning one’s head toward it, or even doing a chin-up movement in its direction. Choosing the behaviors to display requires knowing where the object is in space in relation to the agent’s position as well as if this object was already mentioned and thus known or not to the learner. Lee and Marsella [2006] applied such a methodology when developing the Nonverbal Behavior Generation (NVBG) system. They annotated corpora on several levels: dialog acts (e.g., affirmation, obligation, listing), text, and behaviors (head movement, eyebrow movement, gaze and other). These analyses allowed the authors to draw rules that mapped a dialog act onto instances of text and nonverbal behaviors [Lee and Marsella 2006]. The dialog acts are encoded with Function Markup Language (FML) [Heylen et al. 2008] and the output of the NVBG system in the Behavior Markup Language [Vilhjámsson et al. 2007] of the Situation, Agent, Intention, Behavior, and Animation (SAIBA) architecture [Kopp et al. 2006]. 8.4.1.1 Expressions of Emotions Emotions are crucial in everyday life. Chapter 10 on emotion presents different theories in psychology that have led to computational models. The latter ones trigger an emotion or a blend of emotion the agent ought to convey through its choice of word, prosody, facial expression, gaze, and body movement. Static representations At first, researchers [Pelachaud et al. 1996, Thalmann et al. 1998] relied on the literature, and in particular on the so-called “six basic” expressions of emotion that Ekman claimed to be universally recognized and produced [Ekman and Friesen 1975]. The expressions are described in terms of AU from the coding system FACS [Ekman et al. 2002]. The standard MPEG-4 [Ostermann 2002] also supports the encoding of these expressions. Further computational models were proposed to create a great variety of facial expressions of emotion. They followed the dimensional approach promoted by the core affect theory (see Chapter 10 to have an overview of these approaches). Ruttkay et al. [2003] created

8.4 History/Overview

271

the EmotionDisc, a disc acting as an interface where expressions of emotion were spread along with the neutral expression at its center. A new expression is obtained as the bilinear interpolation of the facial expressions of two emotions that can have different intensities. This approach was extended to a 3D space. A new expression was computed as the linear interpolation of two neighboring expressions that are already defined [Tsapatsoulis et al. 2002, Albrecht et al. 2005]. Courgeon et al. [2008] also applied the interpolation between expressions of emotions defined in the 3D space of pleasure (P), arousal (A), and dominance (D) [Mehrabian 1996]. The authors made use of a joystick whose displacement between points (i.e., expressions) in the PAD space was used to compute the intermediate expressions between the points. These models view the facial expression as a whole. Other models proposed a compositional approach where a facial expression is defined as a set of regions and a new expression is computed as the combination of expressions in its different regions. This approach was used to create blends of expressions such as the superposition of two expressions or the masking of an expression by another one. Fuzzy rules were designed to determine the blending of expressions over the facial regions [Bui 2004, Niewiadomski and Pelachaud 2007]. Arya and DiPaola [2007] used fuzzy rules on perceptually validated believable expressions of emotion. Rehm and André [2005] modeled the expression of non-felt emotion as asymmetric and without the reliable features as defined by Ekman and Friesen [1975]. Temporal representations The previous models described expressions of emotion by their apex, where expressions are defined at the maximal muscular contractions. Other models viewed the expressions of emotion as temporal sequences of facial actions. While some of these models [Paleari and Lisetti 2006, Malatesta et al. 2009, Courgeon et al. 2014] link these sequences with appraisal checks [Scherer 2001], some models do not explicitly do so [Pan et al. 2007, Niewiadomski et al. 2011, Jack et al. 2012]. The former models have implemented how facial expressions of emotion are built in based on the appraisal model proposed by Scherer and Ellgring [2007]. Facial expressions arise from the temporal evaluation process of an event along a set of stimulus evaluation checks (SEC) (see Chapter 10 on emotion). Scherer [2001] has defined a mapping between evaluation of these SEC and facial signals. The facial expression of an emotion is, thus, a sequence of dynamic facial signals. These computational models differ on how the facial signals are merged on the face; they can either be adding up on the face or the facial signals have a given temporal course. These computational models highlighted some missing information of the theoretical model (such as the intensity of facial signals or their duration). The studies by Niewiadomski et al. [2011] and Pan et al. [2007] relied on

272

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

the analysis and annotation of videos of humans expressing emotions. Pan et al. [2007] built a motion graph where the arcs of the graph correspond to motion clips and the nodes to possible transition clips. A new sequence of facial signals can be created by choosing a different motion graph. After annotating facial action using FACS, head, gesture, and body motion, Niewiadomski et al. [2011] designed a set of temporal and spatial constraints. These constraints were defined to capture the relationship between the multimodal signals. The expression of an emotion is a sequence of signals that respect these constraints (see Figure 8.6). An advantage of both these methods is the ability to create different sequences of behaviors while maintaining their meaning (the conveying of a given emotion). It allows the agent to display a greater variety of behaviors. Lately, Jack et al. [2012] developed a more accurate model able to capture the “High Dimensional Dynamic Information Space” [Jack and Schyns 2017]. A very high number of stimuli were created that correspond to sequences of AUs where the temporal course of each AU and its intensity vary. Naive participants evaluated each stimulus in term of emotion labels. Using reverse correlation methods allowed Jack and colleagues to build the mapping between stimuli and the perception of the facial expressions of emotions. Interestingly, the authors found cultural differences in the perception of the stimuli, highlighting that part of the face did not play the same role in the perception of the emotion [Jack et al. 2012]. Multimodal behaviors Emotions are not expressed solely through facial expressions but also through the body, voice, and touch. Body postures and expressivity contribute to conveying emotions [Wallbott 1998, Pollick et al. 2001, Schindler et al. 2008]. Studies on emotion recognition have been using different stimuli ranging from still images of body poses [Ekman and Friesen 1967] to short videos of actors whose face has been blurred or not, or even videos of actors saying nonsense but phonetically balanced sentences [Bänziger et al. 2006]. While earlier studies reported that the face was more prominent in conveying emotional signals, more recent work found similar results in emotion recognition tasks from face and body stimuli [Kleinsmith and Bianchi-Berthouze 2012, de Gelder et al. 2015]. The first computational models of expressions of emotions for SIAs focused on facial expressions. Later on, several models were proposed for bodily expressions of emotions. To measure the impact of the face and the body, recognition tests were conducted with different stimuli using VAs. Stimuli could be static, an expression at its apex, or short videos [Buisine et al. 2014]. They were defined as displaying expressions of emotion solely through the face, the body, and the face and body. In the latter case, the stimuli could be either congruent where the face and body displayed the same emotion, or incongruent where these modalities displayed

8.4 History/Overview

Figure 8.6

273

Expression of regret as sequences of multimodal behaviors.

different emotions [Clavel et al. 2009]. Body expressivity can also characterize the expressions of emotions [Wallbott 1998, Kleinsmith and Bianchi-Berthouze 2012]. It can be defined through different features, such as the amplitude, dynamics, or

274

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

fluidity of the movements [Wallbott 1998] or using dance annotation like Laban [Laban and Lawrence 1974]. Laban movement analysis (LMA) schema considers four effort factors: body, effort, shape, and space. Each of these factors can have various values. For example, effort that relates to the quality of the movement is further defined by the parameters time, flow, weight, and space. The Expressive MOTion Engine (EMOTE) [Chi et al. 2000] implemented two of the four LMA factors, effort and space. This model acts on the movement coordination of the body parts (e.g., relation of the wrist to the trunk) as well as on the dynamic quality of the movement (such as velocity, acceleration) to compute the expressive motion. Likewise, Hartmann et al. [2005] relied on six motion factors proposed by Wallbott [1998]. These factors were implemented as modifier of the trajectory of an end effector of an articulated chain, such as the wrist. For example, the wrist could change position in space (spatial parameter), move in a faster (temporal parameter), stronger (power parameter), and more fluid (fluidity parameter) manner. Later on, this model was extended to also include the tension parameter [Huang and Pelachaud 2012]. Encoding–decoding A distinction can also be made related to the encoding or decoding process. An emotion will be expressed, thus encoded, through multimodal behaviors; it will be recognized, thus decoded by the recipients that view these displayed signals. Several studies have highlighted that there may be differences in the encoding and decoding processes that can be due to cultural or individual differences (cf. work by Jack and Schyns [2017] just presented and Chapter 13 on culture for SIAs for further details). Scherer and colleagues have applied a modified Brunswikian lens model to explain the different mechanisms involved in the perception and encoding of the expression of emotions [Scherer et al. 2011]. Lhommet and Marsella [2014] report the necessity of designers to take into account such difference when creating expressions of emotions for IVAs.

8.4.2 Rapport Besides the idea that SIAs should be able to enter in a conversation and perform a well-designed task, the research community started to focus on the social dimensions of an interaction. SIAs should be able to build up a short- or long-term relationship with their users. The REA agent [Cassell et al. 1999] mentioned before was, for instance, given the task of a real-estate agent because in this job it is important to build rapport with the customers through small talk (see also Chapter 12). Research on the social aspects of interaction with virtual agents considers many different angles. Various dimensions of social interaction were started to be investigated: rapport, friendship, affiliation, impression management, engagement, or

8.4 History/Overview

275

intimacy. On the basis of human–human interactions, Kang et al. [2011] implemented particular patterns of head tilts, pauses, and gaze aversion behaviors in an agent to convey more intimate self-disclosure (see Figure 8.7). Besides studies into which behaviors best express the desired relational attitude, work on these so-called “relational agents” has also looked at the appropriate computational models of mind and emotion for such agents. One of the ways in which the computational models were extended was by mechanisms that simulated the concept of “theory of mind”—a person’s beliefs about what other persons are feeling and thinking—such as PsychSim [Pynadath and Marsella 2005]. (Fearnot AffecTIve Mind Architecture) FAtiMA [Dias et al. 2014] is another example of offering modeling agent’s mental and emotional states. It has been applied to the development of a serious game for children to learn about bullying [Aylett et al. 2007]. Scenarios were designed that involved three main virtual characters, the bully, the victim, and a friend of the victim. The behaviors of the characters were driven by FAtiMA that computed their emotional state and coping behaviors. The growing attention to relational aspects lead not only to more studies on the behaviors of agents to convey interpersonal stances such as intimacy and computational models such as PsychSim but also to implementations and studies in which the actions of the agents were more closely aligned and contingent with the verbal and nonverbal behaviors of the human interlocutor. The virtual rapport agent [Gratch et al. 2006, Huang et al. 2011] is based on the idea that contingency of the feedback is more important than the frequency of feedback [Gratch et al. 2007]. One way in which attention to contingent human–agent behavior was given form was through closed-loop interactions. By paying close attention to detecting the end of turn of a user [de Kok and Heylen 2009], the timing of the agent’s response can be manipulated to mark differences in personality and attitude

Figure 8.7

Humans are more prone to disclosing information when they believe they are discussing with an autonomous IVA than with an IVA driven by a human operator.

276

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

[ter Maat et al. 2010]. Another way to have tighter interactions is to have the agent respond when it is listening to the human interlocutor through backchannels or mimicry (see Figure 8.8). Backchannels can be both verbal (“uhuh, mmm”) and nonverbal; a head nod or shake, or a smile in response. To choose the appropriate backchannel and have the SIA perform it at the right moment requires paying close attention to the human interlocutor not only by analyzing the speech and prosody but also the nonverbal expressions in the phase through computer vision. The SEMAINE project [Schroder et al. 2011] extensively worked on building listener models for agents that would be able to detect appropriate places where an agent should backchannel [de Kok et al. 2013], what strategies to employ [Poppe et al. 2010], and what the effects of variations in backchanneling strategies are on conveying specific relational behaviors or personality [Bevacqua et al. 2012]. With respect to backchannels, Buschmeier and Kopp [2014] have studied the other way—how an agent can elicit feedback from a user depending on the agent’s information need (see Figure 8.9).

8.4.3 Personality People do not behave in the same way. Their personality may affect their decision, intention, and also behaviors. As with many complex concepts, there is not yet a definition and representation of personality that receives the consensus of the psychology community [Bergner 2020]. Several models have been proposed to characterize personality along different dimensions. We can name a few models such as OCEAN (Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism), proposed by McCrae and Costa Jr [2008], or the model proposed by Eysenck [2012] (see Chapter 18 on adaptive personality). While there is no consensus on the definition of personality, there is a consensus that behaviors can be

Figure 8.8

IVA imitating human expression during split–steal game.

8.4 History/Overview

Figure 8.9

277

IVA eliciting feedback from user [Buschmeier and Kopp 2014] (Copyright CC-BY Hendrik Buschmeier).

linked to different personality traits. Studies have shown how some gestures and postures can be associated with degrees of dominance (from dominant to submissive). Facial expressions, gaze behavior, and turn-taking mechanisms are also markers of specific traits. To create SIA with different personalities, computational models have been proposed that act on the type of behaviors and their expressivity the SIAs display. One of the first models was developed by Magnenat-Thalmann and colleagues. Kshirsagar and Magnenat-Thalmann [2002] relied on the OCEAN representation of personality. The authors link mood, emotion, and personality and model the effect of personality trait on the display of an emotion through the introduction of a layer representing the mood of the agent. The links between these

278

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

three layers, personality, mood, and emotions, are represented through a Bayesian belief network. From the same lab, Egges et al. [2003] focused on idle movements done while not performing a specific task (i.e., movements done while waiting for a bus). An animation model of posture shifts based on motion capture data allows simulation of different idle behaviors. Other models were proposed for IVAs interacting with a user. In the already mentioned SEMAINE project, four characters were created to act as sensitive artificial listeners [Schroder et al. 2011]. Each character is characterized with a personality trait and a specific emotional agenda [McRorie et al. 2011, Bevacqua et al. 2012]. In this work, personality is defined by three dimensions following Eysenck [2012], namely extraversion–introversion, neuroticism–emotional stability, and psychoticism. Each dimension is associated with different behavior types and the expressivity that were implemented using the notion of baseline and modality preferences. Baseline represents the tendency to perform behaviors [Mancini and Pelachaud 2008]; for example, should they be displayed with large amplitude and fast and strong movement. This model was applied to IVAs being a speaker and a listener.

8.4.4 Social Attitudes Social attitudes affect how one relates to other members of an interaction. They are often described along two axes, dominance and liking, of the Argyle Interpersonal Circumplex [Argyle 1988]. Scherer [2005] defines social attitude as “an affective style that spontaneously develops or is strategically employed in the interaction with a person or a group of persons, coloring the interpersonal exchange in that situation.” To capture this coloring aspect, Chollet et al. [2017] and then Dermouche and Pelachaud [2016] view the expression of social attitudes as sequences of multimodal behaviors (see Figure 8.10). Both approaches rely on the analysis of a video corpus that was annotated on two main levels: the dimensions of the social attitudes and the multimodal behaviors. The authors aimed to measure which multimodal signals triggered a change in perception of the social attitudes. To this aim, they proposed computational models to extract which common features happen during a change in perception. Sequence mining [Dermouche and Pelachaud 2016] is applied to capture not only the multimodal signals but also their temporal relationships. These models allow simulating an increase or decrease of either dominance or friendliness, or both, through nonverbal behaviors. Callejas et al. [2014] integrate verbal and nonverbal behaviors in their social attitude model. The authors implemented rules from the PERSONAGE model [Mairesse and Walker 2011] that indicate how some verbal cues such as the use of pronouns, the choice of nouns versus verbs, and the degree of formality, are linked with the expression of personality traits. On the nonverbal side, Callejas et al. [2014] applied a

8.4 History/Overview

Figure 8.10

279

Expression of social attitudes using sequence mining.

user-centered approach where users selected which gestures shape and expressivity, facial expression, and head nods on an interactive interface would best express a social attitude for a given dialog act. From these data, they built a Bayesian model to select the multimodal behaviors with the highest probability.

8.4.5 Adaptation in Interaction During an interaction, participants adapt to each other. Adaptation may arise at different levels and may take different forms of imitation, synchronization, or coordination. Participants may align linguistically, using similar levels of politeness, vocabulary, and grammatical structure [Pickering and Garrod 2004]. They may also imitate a body posture and head movement; they may respond to a smile with a smile, or laughter with laughter [Burgoon et al. 2007]. Adaptation may have several functions in the interaction. It is a strong marker of rapport building, affiliation, engagement, and empathy [Lakin and Chartrand 2003]. Its role is very important in social interaction (see Chapter 18 on adaptive personality). Several models have been proposed to enhance the social capabilities of virtual agents. Bailenson and Yee [2005] studied the chameleon effect on social influence. The authors manipulated the behaviors of a virtual agent that would imitate, or not, the head movement of the human interlocutor with delays that could go up to 4 seconds. An agent that imitates their interlocutor’s behavior was found more persuasive and more positive than an agent that did not display imitated behaviors. Biancardi, Dermouche and colleagues have conducted several studies to measure the influence of different adaptation mechanisms [Biancardi et al. 2019b, Dermouche and Pelachaud 2019]. They have developed an architecture where an agent converses with a user while adapting its behavior [Biancardi et al. 2019b, Mancini et al. 2019]. Three adaptation mechanisms were implemented. For two adaptation mechanisms, the agent would adapt dynamically its conversational

280

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

strategies [Biancardi et al. 2019a] and its multimodal behaviors [Biancardi et al. 2019b] to appear warmer or more competent. Reinforcement learning approaches were used to learn which strategies and which behaviors would optimize either user’s engagement or user’s impression of the agent. The third adaptation mechanism was obtained at the signal level [Dermouche and Pelachaud 2019]. It took as input the behaviors of both, the user and the agent, over a certain time window, to compute the adapted behavior of the agent. In this last model, a sequence mining model was applied. These three models were evaluated in a science museum following a similar protocol where the agent acted as a guide to a video game exhibit. Results of the studies showed that agents that can adapt to users are perceived either as more competent or warmer. Users had also a better experience of the interaction when interacting with an agent that adapts in some form.

8.4.6 Social Touch We see IVAs talking and gesturing to us on a screen or we look at them through our virtual reality glasses. We can hear them talk or see them communicate verbal messages through text balloons or other graphics. IVAs can be seen and heard, but their screen-based existence does not seem to allow us to get into physical contact with them. We cannot touch them. They do not touch us. This in contrast to the affordances that physical robots offer. In human–human communication, touch plays an important role. As we greet each other through a hand shake we establish contact but also show how we appreciate the relation. How long we shake, how much pressure we apply, the temperature of the other hand, and how dry or sweaty the hand is convey many impressions. Are we friends?, is this just formal?, are we welcome?, and more; these are all questions that might be answered through how we experience the handshake. A handshake might be refused at a greeting or replaced by one, two, three, or four kisses on the cheek. Or instead of the handshake and kisses, a hug might be chosen as the way to greet. Cultural parameters, level of intimacy, how well we are acquainted, the situation—are we meeting friends or meeting for business, as a wedding party, or a funeral—these will all be conveyed through whether and how we shake hands, kiss, or hug. And then greetings are just one type of interaction in which touch plays a role. An overview of the science of interpersonal touch can be found in Gallace and Spence [2010]. Dibiasi and Gunnoe [2004] is one of the studies on how gender and culture differences play a role in touch. The way touch communicates meaning is discussed in Jones and Yarbrough [1985] and how it communicates emotion is presented in Hertenstein et al. [2009]. The studies on touch in SIAs are almost all motivated by their potential to establish positive effects in the relationship with the human interacting with them.

8.4 History/Overview

281

A typical example is the design of the Paro robot [Wada et al. 2006], the robotic seal that responds to a person’s touch, such as gentle strokes, to foster the comfort and feeling of well-being. Also illustrative is the work by Yohanan and MacLean [2012], in which a touch-sensitive robot, the Haptic Creature, is presented. Important for a touch-sensitive robot is that it can recognize a human touch, reason about the meaning, and show a reaction. In the paper, they investigate the typical ways in which people try to communicate an emotion through touch and what type of reaction they expect back. But what about IVAs that are present only through images and sounds? How can we touch a virtual agent? Nguyen et al. [2007] explored the development of a virtual skin for their agent placed in a virtual reality environment. Through motion tracking of the hand of a person, the agent can then feel where a person is touching it and react accordingly. This touch is not complete, however, as normally touch is reciprocal and both the toucher and the touched touch each other. In order for an agent to touch us they need to become hybrids and be extended with a physical device that could conversely also serve as an input device when equipped with touch sensors to touch the IVA [Huisman et al. 2013]. One of the first studies on extending virtual agents with touching devices was aimed at seeing how a gentle squeeze of a user’s hand by an agent using an air bladder could express empathy to users in distress [Bickmore et al. 2010]. The authors pointed out that it is important to recognize the relevance of the different parameters involved in a squeeze such as the pressure, the duration, or the number of squeezes. These influence how people perceive the affect arousal of the agent. Another aspect to take into consideration is the interaction of the touch modality with the other verbal and nonverbal communication modes. In their case, they looked at the interaction of touch with facial displays and prosody and found, for instance, that facial displays dominate the perception of affect valence. Also important is the difference between people in their sensitivity to touch. In the study by Huisman et al. [2014], it was examined how a simulated social touch by an agent would influence the perceived trustworthiness, warmth, and politeness of the agent. A participant in the experiment would play either a cooperative or competitive game with two agents in augmented reality. After the game, one of the agents would touch the participant’s arm. The agent would approach the participant who would see the agent’s arm tapping on the shoulder and simultaneously a tactile display would give a tactile sensation on the participant’s arm. The study found no differences between the cooperative and the competitive condition, but the agent touching the participants was observed to score higher on warmth than the non-touching agent.

282

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

Both of these studies connected the virtual and the physical environment through physical actuators; an air bladder in the case of Bickmore et al. [2010] and vibration motors in the case of Huisman et al. [2014]. Some effort has been made to simulate touch by thermal feedback [Tewell et al. 2017] and force feedback actuators [Bailenson et al. 2007]. These devices are limited in replicating the full dimensions of human touch, which involves texture, pressure, temperature, and moistness. More involved devices that can simulate human-like touch are being engineered by several researchers in particular as part of robotic devices. An example of this effort is the work by Teyssier et al. [2018, 2020], who among other things attempts to make artificial skin that looks and feels like human skin. The research on social touch for SIAs is still in its infancy. There are limitations in terms of the technology and there are methodological issues because of the complex factors that are involved in touch. Willemse et al. [2017] reflect on the opportunities and limitations of human–robot interaction. They point out the difficulties in mimicking human–human touch behavior because of the many variables that make it different such as the appearance of the robot, the location where touch is applied, and the difference in social context. Touch is a very intimate way to communicate and is only appropriate in particular social contexts governed by cultural and personal practices. It might take some time before robots and agents can take part in these practices.

8.4.7 Machine Learning and Data Driven An important development in the area of generating nonverbal behaviors is the use of machine learning algorithms to learn the relation between the different modalities, for instance between gestures and speech. These models have been designed and built by relying on an important effort to collect and annotate multimodal databases to directly learn the complexity in the relationship between nonverbal behaviors and speech (See Section 8.5 for discussion on databases). During human interactions, nonverbal behaviors and speech are intrinsically synchronized to convey a message [Cassell et al. 1994]. Hand movements [McNeill 1992], head motion [Heylen 2005, Sadoughi and Busso 2017b], and facial displays [Chovil 1991] play important roles during a conversation, emphasizing a message, clarifying ambiguities, and parsing sentences. As a result, different behaviors are highly correlated not only with speech but also between themselves [Mariooryad and Busso 2012, Ding et al. 2013a, Sadoughi and Busso 2017a]. These relationships can be learned with machine learning algorithms. Graphical Models Given the temporal relationship between the various modalities in nonverbal behavior, studies have used different variation of graphical models

8.4 History/Overview

283

with explicit connections between frames. An appealing approach to synthesize nonverbal behaviors is HMM. Busso et al. [2007] clustered head orientation space, learning models for each of the clusters. The transition between clusters was learned using an HMM formulation. Ding et al. [2013b] explored alternative variations of HMMs to capture the relationship between speech and eyebrow motion. Hofer and Shimodaira [2007] proposed the concept of motion units for head motion, which were generated with HMM. Le et al. [2012] modeled the kinematic feature of head motion with prosodic features using Gaussian mixture models (GMM). Ding et al. [2013a] used a variation of HMMs to model the relation between eyebrow, head movements, and speech prosody. The approach incorporates contextual information into the HMM framework. This formulation directly modeled the relationship between head and eyebrow motion. Other popular graphical models are dynamic Bayesian networks (DBNs). DBNs offer the flexibility to explicitly model connections not only between speech and behaviors but also across behaviors. For example, Mariooryad and Busso [2012] proposed a speech-driven DBN that explicitly models the relation between head and eyebrow movements. These models also allow us to add explicit discourse function constraints to generate meaningful behaviors [Sadoughi et al. 2017, Sadoughi and Busso 2019] (e.g., head shakes for negations, specific gestures for questions). Another popular graphical model used to synthesize nonverbal behaviors is conditional random fields (CRF). Unlike HMM, which is a generative model with directed graphs, CRF is a discriminative model with undirected graphs. Levine et al. [2010] argued that prosodic features provide valuable information to derive the kinematic of the behavior rather than its shape. They proposed a gesture controller that learns the kinematic of gestures with a speech-driven model based on CRF. Lee and Marsella [2017] considered a formulation based on latent-dynamic conditional random fields (LDCRF). This model incorporates hidden states to capture dynamic within and across gestures. They demonstrated that this approach led to better performance for head nod and eyebrow motion than other models such as HMM and CRF. Deep learning models Recent studies have relied on deep learning solutions to generate nonverbal behaviors. Ding et al. [2014a, 2015a] proposed to synthesize head movement with fully connected DNNs, showing that this formulation led to better performance than HMMs [Ding et al. 2014a]. Similar findings were reported by Parker et al. [2017], where they proposed a text-driven model implemented with DNNs to synthesize facial expressions. The approach simultaneously generated facial expressions for different emotions. The results outperformed a baseline implemented with HMMs. Taylor et al. [2016] proposed the sliding window deep

284

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

neural networks (SW-DNNs) to generate lip motion. The approach synthesized facial parameters representing lip movements driven from speech. The approach generated overlapped visual predictions that are later averaged. The approach was later adapted to synthesize lip movements from phoneme labels [Taylor et al. 2017]. Kucherenko et al. [2019] proposed a deep learning approach based on denoising autoencoder (DAE) to generate hand gestures from speech. The approach creates a bottleneck representation by training an encoder and decoder for hand gesture motions. Then, the approach creates a mapping between speech features and the motion bottleneck representation. Recurrent Neural Networks Studies have also relied on LSTM to capture the temporal relationship in the generation of nonverbal behaviors [Hasegawa et al. 2018]. Fan et al. [2016] proposed a speech-driven approach to synthesize facial parameters associated with the lower facial area. The approach relied on two layers of bidirectional long short-term memory (BLSTM) cells. While BLSTMs are not causal models, they offer improved contextual information by incorporating past and future frames leading to better models. Ding et al. [2015b] demonstrated that BLSTM models led to better performance in generating head movements from speech than fully connected DNNs. They obtained the best performance by combining fully connected layers with BLSTM layers. Similar results were reported by Haag and Shimodaira [2016], who proposed to use the bottleneck layer of one speech-driven network as the input of a second BLSTM-based architecture that predicts head motion. The results were better than a single DNN framework. Sadoughi and Busso [2018b] proposed a multitask learning formulation based on BLSTMs to synthesize nonverbal behaviors. The primary task was generating expressive lip motions. The secondary tasks were recognizing the phonemes and the emotion on the sentence. The LSTM layers create a share feature representation aiming to generate realistic lip motions with the right emotion and properly synchronized with lexical content. Li et al. [2016] also used BLSTM to map speech features with lip movements and facial expressions. Their study explored the generation of articulatory movements related to lexical content (i.e., matching the correct phone) and emotional content (i.e., matching expressions associated with the right emotion). The BLSTM layers capture this relationship driven by acoustic features. Sadoughi and Busso [2017a] proposed a multitask formulation based on BLSTM to create facial movements, where the lower, middle, and upper face areas were jointly predicted. This speechdriven formulation considered the interrelationship between facial expressions across different parts of the face. Yunus et al. [2019] use speech-driven recurrent networks with attention mechanism to determine when to generate a nonverbal behavior and its corresponding stroke. The approach focused on beat movements,

8.4 History/Overview

285

which are related to the speech rhythm. The addition of attention mechanism was intended to increase the interpretability of the network to understand the relationship between prosodic features and the decision made by the models. Ferstl and McDonnell [2018] proposed an autoencoder implemented with gated recurrent unit to model the relation between speech and behaviors. Other methods to generate gestures or facial expressions have explored temporal modeling using convolutional neural networks [Karras et al. 2017], and variational autoencoder implemented with BLSTMs [Greenwood et al. 2017a, 2017b]. Generative models A popular generative model that has recently revolutionized several areas is GANs. This framework has a generator and a discriminator that are adversarially trained. The generator is trained to create realizations that are similar to real examples. The discriminator has to decide whether the input is real or a fake instance created by the generator. After training, the generator is expected to create instances that the discriminator cannot distinguish from real samples. Sadoughi and Busso [2018a] proposed to use conditional generative adversary networks to synthesize head motion, where the discriminator and generator were constrained by the acoustic features (Figure 8.11). The approach was also used to synthesize expressive lip movements [Sadoughi and Busso 2020]. Huang and Khan [2017] proposed the DyadGAN model, which is a two-stage framework based on GANs. The

Figure 8.11

Conditional generative adversarial network (GAN) for head movement driven by speech based on Sadoughi and Busso [2018a]. The GAN model was constrained by speech, creating head movements that were temporally synchronized with speech. A similar architecture was also used to synthesize expressive lip movements, where the models were constrained not only by speech but also by the emotions [Sadoughi and Busso 2020].

286

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

first step generates facial sketches using GANs, where the facial expressions of the interlocutor are used to constrain the models. This approach leverages the mutual influence between interlocutors [Mariooryad and Busso 2013]. The second stage also uses GANs to synthesize facial expressions from the sketches. Ferstl et al. [2020] also use generative adversary training to model the relation between speech and nonverbal behaviors. They divided the process into smaller problems, which aim to solve correct behavior dynamics and plausible joint configurations, and to generate diverse and smooth trajectories for nonverbal behaviors. A classifier was used to determine the phase of the behavior, addressing the behavior dynamics. The study found that an adversary loss provides better results than conventional regression losses in mapping the non-deterministic relationship between nonverbal behaviors and speech.

8.5

Databases Computational models of multimodal behaviors have been relying on databases of human data. Over the years, there was a shift in the type of databases that were used. At first models relied on observational studies. Databases were annotated manually as automatic tools were not available yet. Often, researchers recorded specific databases needed for their research. Many of these early databases were not made available. Later on, databases were created and gathered larger number of items. They were made accessible to the research community. These databases were mainly used for specific research purposes. We can name the Jaffe [Lyons et al. 1998] and Cohn–Kanade [Lucey et al. 2010] databases. These databases are photos of facial expressions of emotions, mainly the six basic emotions. Their purpose was mainly for facial expression recognition but also for facial expression synthesis. Later on, video corpus was used. It allows having multimodal synchronized data. The SEMAINE database [McKeown et al. 2011] is one of the few databases that contains videos of human participants interacting with IVAs, the sensitive artificial listener SAL agents with different personality traits [McRorie et al. 2011]. Other corpora were gathered for specific research purposes such as the Distress Analysis Interview corpus collected by Gratch et al. [2014] or the negotiation task corpus [Gratch et al. 2016]. The GEneva Multimodal Emotion Portrayals (GEMEP) database [Bänziger et al. 2006] contains videos of 10 professional actors saying nonsense sentences with 18 affective states. The actors followed the induction technique by acting out simple scenarios. Bergmann and Kopp [2009] aimed to study iconic gesture production (see Figure 8.12(b)). To this aim the authors gathered a corpus, called Gesture Net for Iconic Gestures (GNetIc), of participants giving direction to another person. Their gestures were captured with video and using a 3D sensor on the hands of the speakers (see Figure 8.12(a)). Rehm et al. [2007] focused on cultural

8.5 Databases

(a)

Figure 8.12

287

(b)

(a) Motion capture used to gather the GNetIc corpus; (b) max agent performing an iconic gesture.

differences in nonverbal behaviors gathered in the Cube-G corpus. To this aim, the authors recorded the Cube-G corpus of dyadic human interactions in Japan and in Germany. To have direct access to the signals data and their dynamics, motion capture can be used. Some mocap databases focus on multimodal behavior during an interaction or on body motion quality. For the former, the IEMOCAP recorded by Busso et al. [2008] gathered data of multiple actors interacting in dyads and displaying a variety of emotions. It contains 12h of mocap and video data. The MSP-AVATAR corpus [Sadoughi et al. 2015] contains 6 actors performing improvisation scenarios in dyads. The purpose of this corpus is to study the role of discourse functions in view of modeling SIAs. The CMU Graphics Lab Motion Capture Database1 contains many examples of actions and behaviors executed with different emotions. On the same line, Fourati and Pelachaud [2016] gathered data of expressive actions movements; each action is performed three times by 11 actors with eight emotions. These last two databases allow, in particular, studying body motion quality. Lately, with the development of models relying on deep learning techniques, the need for huge quantities of data became urgent. Collecting videos or motion capture data in very large quantity require important human resources, infrastructure, and can be very time consuming. It is not always feasible in research labs. Thus, many researchers turned their attention to data from the web such as YouTube and TEDx videos. For example, the CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI) [Zadeh et al. 2016] contains a high number of videos from movie reviews on YouTube. Reviews are made by 93 participants. It allows having access to a variety of speaker styles. The YouTube Gesture Database [Yoon et al. 2019] is 1. http://mocap.cs.cmu.edu/

288

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

made of TEDx videos. The authors provide tools to segment the videos into scenes that are pertinent for the study (here hand gestures need to be clearly visible in the video segments). Ginosar et al. [2019] aim also to model communicative gestures, paying particular attention to gesture style. The authors gathered 144h of videos from 10 speakers that are made available at the web site2 . They have also made available the code to extract, train, and validate the data. Collecting data is not an easy task. It is time consuming and it also requires setting the protocol very carefully. Data that are recorded in the lab rely on actors, be they professional or not. It is much more difficult to gather naturalistic data in the lab. To this aim, induction techniques based on scenarios have been used [Banziger et al. 2006]. Interested readers can read works by Cowie et al. [2011] on issues for collecting data and Cowie et al. [2011] and Jokinen and Pelachaud [2013] on issues for annotating data.

8.6

Similarities and Differences in IVAs and SRs Several studies have explored similarities and differences in virtual and physical agents. 3D animation models for IVAs and mechanical models for SRs have an impact on their behavioral capacities. Moreover, the more realistic SIAs are, the more expectancies they behave like humans are high. This phenomenon is referred as the uncanny valley [Mori et al. 2012]. Studies have been conducted to understand the impact of appearance, the degree of realism in rendering, and in behavior on user’s perception. The chapter on appearance (Chapter 4) presents in detail studies on rendering styles on a 2D screen and in virtual reality. We can name the works by Rachel McDonnell’s group and Bilge Mutlu’s group. Some research has been conducted to reach a high degree of realism in physical robots. The work by Hiroshi Ishiguro and his colleagues have achieved amazing results. These researchers do not compare their robots with virtual agents but with humans [MacDorman and Ishiguro 2006]. The Erica robot is another example of a human-like robot that has been involved in comparison studies [Inoue et al. 2020]. In contrast, some studies have looked at minimalist representations of agents (e.g., just eyes or a microphone) and checked how behaviors of such minimalist representations can act and be perceived as social agents of an interaction [Tennent et al. 2019]. Other studies, reported in the Introduction chapter, compared the physical and virtual representation of robots on their impact on task performance and user’s perception [Deng et al. 2019]. We can name one line of research that merges both virtual and physical representations. Furhat [Al Moubayed et al. 2012], a robot head with a projected 2. https://github.com/amirbar/speech2gesture

8.6 Similarities and Differences in IVAs and SRs

289

virtual face, is one such example (Figure 8.13). It can display subtle facial expression and gaze in the 3D world. Currently, it is just a head with a face, it does not have body and hand gestures. Virtual and physical SIAs may share similar decision and emotion models, intention, and behavior planners in the SAIBA sense. The previous sections of this chapter present computational models of behaviors that can be implemented in IVAs or in SRs. Their behavior realization varies due to their different capacities in embodiment. However, similarities and differences between social interactive agents with virtual or physical embodiment may arise from other directions as well:

Figure 8.13



expressivity capacity that arises from the different degrees of freedom in the face and the body, dynamism of movements, sense of gravity.



available modalities such as including the possibility to touch and to be touched, to perform facial expression or not.



a priori of the SIA since people’s a priori conception of a VA or of an SR may affect how they perceive and interact with a SIA; as such the choice of nonverbal behaviors for a SIA may need to take this a priori in consideration.



spatial and social presence that also affect the perception of personal space with proxemics distance, thus influencing the proxemics and gaze behaviors of SIAs.



exploration capability that allows moving and gazing around; SRs have much greater capability for exploring the surrounding than an IVA displayed on a 2D screen.

First and second generation of Furhat robotics.

290

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

8.7



availability depending on the type of displays, be on a large screen, a mobile, or be tiny (to be held in one hand) or human-size robots. The choice of a behavior could be adapted. Indeed, visibility of a behavior can be impacted when it is being displayed on a large or a tiny screen, for example. Identically, large arm movements executed by human-size robots or by tiny robots will not have the same impact in user’s perception.



degree of realism in virtual and robotic appearance at the level of morphology, rendering of the skin, artifacts (hair, eyelids, etc.). It can have an impact in the perception of the SIAs.

Current and Future Challenges In the previous sections we have presented the large advances in generating multimodal behaviors. However, much more remains to be done to obtain natural and expressive behaviors for SIAs. We list here a few challenges. This is not an exhaustive list. Other challenges are embedded with modeling emotion, adaptive personality, appearance, long-term interaction, and foreseen applications… that are presented in other chapters of this book. It would be highly desirable to tackle all challenges in a holistic manner. However, we will now focus on challenges for generating multimodal expressive behaviors, which represent already difficult tasks to tackle.

8.7.1 Automatic Generation of Multimodal Behaviors The algorithms to generate behaviors need to be designed by considering the intention of the message. The gestures need to respond to the communication function during an interaction. A key challenge is to automatically generate gestures that have meaning and can be easily integrated with other modules of a social interactive agent framework. The most powerful generative models today are adversary networks including variations of GANs. The straightforward approach to constrain the models to specific communication functions is by adding the constraints as inputs of the generator and discriminator. This approach may not scale well when multiple constraints are needed (e.g., hand gestures of a sad SIA while asking a question). As databases have only partial information to train these types of systems, the training strategies of these algorithms will have to consider partial information, leveraging the knowledge extracted from multiple databases.

8.7.2 Databases Another related challenge to create gestures with a clear meaning is to design databases with clear annotations of discourse functions. The MSP-AVATAR corpus [Sadoughi et al. 2015] was collected explicitly for this purpose, providing

8.7 Current and Future Challenges

291

annotations for a broad range of discourse functions (e.g., contrasting, confirmation, negation, questioning, uncertainty, suggestion, giving orders, and warning). Increasing the number of databases with these labels can serve as a starting point to train deep learning models that are constrained by appropriate discourse functions. Some of the relevant communicative functions needed to generate meaningful nonverbal behaviors can be directly obtained with advances in natural language processing and automatic speech recognition, without requiring manual annotations. This approach can facilitate the collection of large-scale databases. For example, databases collected from video-sharing websites can be very effective in training algorithms for nonverbal behaviors [Zadeh et al. 2018, Vidal et al. 2020]. Social signals Most of the works presented above have focused on communicative gestures, facial expressions, and gaze behaviors. However, there exists a large set of signals that are fully part of the social interactions that have not been, or barely, considered so far. We can name laughter, yawning, cries, hesitation, sighs, etc. These signals can carry a variety of communicative functions and be linked to social stances [Scott et al. 2014, Curran et al. 2018]. Mazzocconi et al. [2020] have proposed a taxonomy highlighting their propositional content. Considering these signals require understanding their communicative functions, and also simulate their animation. Several attempts exist regarding laughter [El Haddad et al. 2016, Ding et al. 2017]. However, these works focused on hilarious laughter only. Modeling laughter animation is quite complex. Laughter involves synchronized torso and head movement, a large variety of facial expressions; breathing and inhalation are also an important part of laughter. Speech laughter is a whole other issue. To our knowledge, few, if not no study has been conducted to study lip movements during speech and laughter at the acoustic level (see Chapters 6 on speech synthesis and 20 on platforms and tools). Another issue is to understand where to place a given laughter. How do we respond to laughter? What triggers laughter? Which type of laughter should we produce? Should it carry a specific content? Should it be an answer to interlocutor’s behavior? Should it arise from contagion? These are some of the questions to answer to endow agents with laughing capabilities. Producing signals of laughter, sighs, cries, etc., require also taking into account their potential impact on the interaction. One needs to understand how these signals produced by SIAs in a given social context affect the perception of the SIA and of the interaction. Preliminary studies on smiles and hilarious laughter [Ochs and Pelachaud 2013, Ding et al. 2014b] have shown how human users changed their perception of the SIAs depending on the signals they produced and when they displayed them. They also modulate how they perceive what the SIAs are saying.

292

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

8.7.3 Go Beyond Generic Model to Simulate SIAs with Identity Defining what makes an individual is extremely complex. It involves intricate patterns as so many features are interwoven. We can name some that shape an individual: socio-cultural background, previous personal history, personality traits, emotional tendency, social attitude, interpersonal relationships, competences, knowledge, beliefs, etc. At the behavior level, an individual may have a specific style, have a given behavior expressivity, or display idiosyncratic behaviors. The list of features that makes a person unique could go on. It is huge and very diverse. In this chapter, we have presented several approaches that modeled one or more of these features. These models offer to simulate SIAs with some specificities, but not yet with an identity. In early works, Hayes-Roth and Doyle [1998] created a backstory for the synthetic actors they placed in an application. The backstory would define the role of the agents, their personality profile, but also their family life, hobbies, jobs…. The characters would act based on these background information, offering motivations for their acting. Noot and Ruttkay [2005] have proposed a representation language, GESTYLE, to capture the behavior “style” of a person. Tags would range from the culture to profession. Dictionaries were created to map these tags into behaviors. These works were a first step to creating agents with identities. However, they tend to create stereotypes. Also, they do not model either how the different features characterizing an identity influence each other or what are the processes involved in computing motivations and deciding which actions to perform…. Defining SIAs with mental and affective states [Pynadath and Marsella 2005, Marsella and Gratch 2009] also gives coherence to SIAs’ decisions, actions, and emotional states that are the premises for giving a sense of identity to SIAs. In most cases, SIAs are defined by their role in the interaction. Most of them are young adults. Their cultural background is often not well specified; even though there are studies on modeling cultural agents, this is still not the case for the majority of agents (see Chapter 13 on culture). SIAs may have a specific appearance but do not correspond to a specific person. One cannot attach an identity to them. Much more research needs to be done to simulate an identity for SIAs that is reflected in their nonverbal behavior.

8.8

Summary and Conclusion Endowed with virtual or physical eyes, mouths, hands, and arms, SIAs that do not use them to communicate nonverbally in similar ways as humans do raise eyebrows with the human interlocutor. In this chapter we have tried to provide an overview of studies on generating nonverbal expressions of SIAs. We started with introducing concepts from the social sciences and humanities that formed

References

293

the basis of the computational models and approaches. Interestingly, there has also been a line of research in the social sciences to use VAs to study aspects of how humans communicate nonverbally [Jack and Schyns 2017, de Gelder et al. 2018]. In the first phase, the behaviors of the SIAs were informed by the studies and theories on nonverbal behavior in humans or on analyzing small corpora of recordings of humans interacting. This led to rule-based systems that made the decision when to perform which nonverbal behavior aligned with speech. Currently, larger corpora are used by machine learning approaches for data-driven generation of nonverbal behaviors. Implementing nonverbal behaviors for SIAs, we need not only pay attention to the quality of realization in terms of, for instance, expressivity but also to their timing, be it in combination with the other expressive modalities of which they make use of, such as speech or in closed-loop interaction with the human interlocutor when providing feedback in the form of backchannels. Nonverbal communication takes many forms and in the course of 20+ years eye movements, facial expressions, hand, arm gestures, and posture have all received elaborate attention, with studies on both the repertoire of behaviors that SIAs should be capable of performing and the various functions they serve—from visual prosody, such as head nods that are used to emphasize part of the speech, to expressions of emotions and interpersonal stance. This chapter focuses to a large extent on IVAs. For a long time, the social robotics community and the virtual agents community went their separate ways. But as more researchers have started to become active in both fields, the studies carried out in one field also become known in the other. In the area of virtual agents, there is a history of collaboration between research labs, resulting in a common language to talk about nonverbal behavior—the Behavior Markup Language, to give one example. With the fields of robotics and agents starting to talk to each other such collaborations will grow and the challenges will be met together. So, when you start talking to the newest humanoid robot and it does not smile in greeting or show none of the other nonverbal behaviors that you expect, it is unlikely that the makers did not read the chapter on Multimodal Behavior in the Handbook on Socially Interactive Agents but more likely that a simple reboot will do the trick.

References S. Al Moubayed, J. Beskow, G. Skantze, and B. Granström. 2012. Furhat: A back-projected human-like robot head for multiparty human–machine interaction. In Cognitive Behavioural Systems. Springer, 114–130. DOI: https://doi.org/10.1007/978-3-642-34584-5_9.

294

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

I. Albrecht, M. Schröder, J. Haber, and H. Seidel. August. 2005. Mixed feelings—Expression of non-basic emotions in a muscle-based talking head. Virtual Real. 8, 4, 201–212. DOI: https://doi.org/10.1007/s10055-005-0153-5. J. Allwood, L. Cerrato, K. Jokinen, C. Navarretta, and P. Paggio. 2007. The MUMIN coding scheme for the annotation of feedback, turn management and sequencing phenomena. Lang. Resour. Eval. 41, 3–4, 273–287. DOI: https://doi.org/10.1007/s10579-007-9061-5. E. André, J. Müller, and T. Rist. 1996. The PPP persona: A multipurpose animated presentation agent. In Proceedings of the Workshop on Advanced Visual Interfaces. 245–247. DOI: https://doi.org/10.1145/948449.948486. E. André, T. Rist, and J. Mueller. 1998. Integrating reactive and scripted behaviors in a life-like presentation agent. In Proceedings of the Second International Conference on Autonomous Agents. 261–268. DOI: https://doi.org/10.1145/280765.280842. M. Argyle. 1988. Bodily Communication. (2nd. ed.). Methuen & Co., London. DOI: https:// doi.org/10.4324/9780203753835. O. Arikan and D. Forsyth. July. 2002. Interactive motion generation from examples. ACM Trans. Graph. 21, 3, 483–490. DOI: https://doi.org/10.1145/566654.566606. A. Arya and S. DiPaola. 2007. Multispace behavioral model for face-based affective social agents. EURASIP J. Image Video Proc. 2007, 1–12. DOI: https://doi.org/10.1155/2007/48757. D. Atkinson, P. Orzechowski, B. Petreca, N. Bianchi-Berthouze, P. Watkins, S. Baurley, S. Padilla, and M. Chantler. 2013. Tactile perceptions of digital textiles: A design research approach. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1669–1678. DOI: https://doi.org/10.1145/2470654.2466221. J. Austin. 1962. How to Do Things with Words. Oxford University Press, London. R. Aylett, M. Vala, P. Sequeira, and A. Paiva. 2007. FearNot!—An emergent narrative approach to virtual dramas for anti-bullying education. In International Conference on Virtual Storytelling. 202–205. Springer. DOI: https://doi.org/10.1007/978-3-540-77039-8_19. J. N. Bailenson and N. Yee. 2005. Digital chameleons: Automatic assimilation of nonverbal gestures in immersive virtual environments. Psychol Sci. 16, 10, 814–819. DOI: https://doi. org/10.1111/j.1467-9280.2005.01619.x. J. N. Bailenson, N. Yee, S. Brave, D. Merget, and D. Koslow. 2007. Virtual interpersonal touch: Expressing and recognizing emotions through haptic devices. Hum. Comput. Interact. 22, 3, 325–353. DOI: https://doi.org/10.1080/07370020701493509. T. Bänziger, H. Pirker, and K. Scherer. May. 2006. GEMEP—Geneva multimodal emotion portrayals: A corpus for the study of multimodal emotional expressions. In First International Workshop on Emotion: Corpora for Research on Emotion and Affect (International Conference on Language Resources and Evaluation (LREC 2006)). Genoa, Italy, 15–19. J. Bavelas, J. Gerwing, and S. Healing. 2014. Hand and facial gestures in conversational interaction. Oxford Handb. Lang. Soc Psychol. 111, 130. DOI: https://doi.org/10.1093/ oxfordhb/9780199838639.013.008.

References

295

J. B. Bavelas and N. Chovil. 2000. Visible acts of meaning: An integrated message model of language in face-to-face dialogue. J. Lang. Soc. Psychol. 19, 2, 163–194. DOI: https://doi. org/10.1177/0261927X00019002001. K. Bergmann and S. Kopp. 2009. GNetIc–Using Bayesian decision networks for iconic gesture generation. In International Workshop on Intelligent Virtual Agents. Springer, 76–89. R. M. Bergner. 2020. What is personality? Two myths and a definition. N. Ideas Psychol. 57, 100759. DOI: https://doi.org/10.1016/j.newideapsych.2019.100759. J. Beskow. September. 1997. Animation of talking agents. In C. Benoit and R. Campbell (Eds.), Proceedings of the ESCA Workshop on Audio-Visual Speech Processing. Rhodes, Greece, 149–152. J. Beskow and S. McGlashan. August. 1997. Olga—A conversational agent with gestures. In Proceedings of the IJCAI 1997 Workshop on Animated Interface Agents: Making Them Intelligent. Nagoya, Japan. E. Bevacqua, E. De Sevin, S. J. Hyniewska, and C. Pelachaud. 2012. A listener model: Introducing personality traits. J. Multimodal User Interfaces 6, 1–2, 27–38. DOI: https:// doi.org/10.1007/s12193-012-0094-8. B. Biancardi, M. Mancini, P. Lerner, and C. Pelachaud. 2019a. Managing an agent’s selfpresentational strategies during an interaction. Front. Rob. AI 6, 93. DOI: https://doi.org/ 10.3389/frobt.2019.00093. B. Biancardi, C. Wang, M. Mancini, A. Cafaro, G. Chanel, and C. Pelachaud. 2019b. A computational model for managing impressions of an embodied conversational agent in real-time. In 2019 International Conference on Affective Computing and Intelligent Interaction (ACII). DOI: https://doi.org/10.1109/ACII.2019.8925495. T. W. Bickmore, R. Fernando, L. Ring, and D. Schulman. 2010. Empathic touch by relational agents. IEEE Trans. Affective Comput. 1, 1, 60–71. DOI: https://doi.org/10.1109/T-AFFC. 2010.4. E. Bozkurt, C. E. Erdem, E. Erzin, T. Erdem, M. Özkan, and A. M. Tekalp. May. 2008. Speech-driven automatic facial expression synthesis. In 3DTV Conference 2008: The True Vision—Capture, Transmission and Display of 3D Video. 273–276. DOI: https://doi.org/ 10.1109/3DTV.2008.4547861. M. Brand. 1999. Voice puppetry. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1999). (August 1999). Los Angeles, CA, 21–28. DOI: https://doi.org/10.1145/311535.311537. C. Bregler, M. Covell, and M. Slaney. August. 1997. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1997). Los Angeles, CA, 353–360. DOI: https://doi.org/ 10.1145/258734.258880. T. D. Bui. 2004. Creating Emotions and Facial Expressions for Embodied Agents. Ph.D. thesis, University of Twente, Department of Computer Science.

296

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

S. Buisine, M. Courgeon, A. Charles, C. Clavel, J.-C. Martin, N. Tan, and O. Grynszpan. 2014. The role of body postures in the recognition of emotions in contextually rich scenarios. Int. J. Hum. Comput. Int. 30, 1, 52–62. DOI: https://doi.org/10.1080/10447318.2013. 802200. J. K. Burgoon, L. A. Stern, and L. Dillman. 2007. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge University Press. H. Buschmeier and S. Kopp. 2014. When to elicit feedback in dialogue: Towards a model based on the information needs of speakers. In T. W. Bickmore, S. Marsella, and C. L. Sidner (Eds.), Intelligent Virtual Agents—14th International Conference, IVA 2014, Boston, MA, August 27–29, 2014. Proceedings, Vol. 8637 of Lecture Notes in Computer Science. Springer, 71–80. DOI: https://doi.org/10.1007/978-3-319-09767-1_10. C. Busso and S. Narayanan. November. 2007. Interrelation between speech and facial gestures in emotional utterances: A single subject study. IEEE Trans. Audio Speech Lang. Process. 15, 8, 2331–2347. DOI: https://doi.org/10.1109/TASL.2007.905145. C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan. March. 2007. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15, 3, 1075–1086. DOI: https://doi.org/10.1109/TASL.2006.885910. C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. Chang, S. Lee, and S. Narayanan. December. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. J. Lang. Res. Eval. 42, 4, 335–359. DOI: https://doi.org/10.1007/s10579-008-9076-6. Z. Callejas, B. Ravenet, M. Ochs, and C. Pelachaud. 2014. A computational model of social attitudes for a virtual recruiter. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems. 93–100. J. Cassell and M. Stone. 1999. Living hand and mouth. Psychological theories about speech and gestures in interactive dialogue systems. In AAAI99 Fall Symposium on Psychological Models of Communication in Collaborative Systems. J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Bechet, B. Douville, S. Prevost, and M. Stone. 1994. Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversational agents. In Computer Graphics (Proc. of ACM SIGGRAPH’94). Orlando, FL, 413–420. DOI: https://doi. org/10.1145/192161.192272. J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H. Vilhjálmsson, and H. Yan. May. 1999. Embodiment in conversational interfaces: Rea. In International Conference on Human Factors in Computing Systems (CHI-99). Pittsburgh, PA, 520–527. DOI: http s://doi.org/10.1145/302979.303150. J. Cassell, H. Vilhjálmsson, and T. Bickmore. 2001. BEAT: The behavior expression animation toolkit. In Computer Graphics Proceedings, Annual Conference Series. ACM SIGGRAPH. DOI: https://doi.org/10.1145/383259.383315. D. Chi, M. Costa, L. Zhao, and N. Badler. 2000. The EMOTE model for effort and shape. In K. Akeley, (Ed.), Siggraph 2000, Computer Graphics Proceedings. ACM Press/ACM SIGGRAPH/Addison Wesley Longman, 173–182. DOI: https://doi.org/10.1145/344779. 352172.

References

297

M. Chollet, M. Ochs, and C. Pelachaud. 2017. A methodology for the automatic extraction and generation of non-verbal signals sequences conveying interpersonal attitudes. IEEE Trans. Affective Comput. DOI: https://doi.org/10.1109/TAFFC.2017.2753777. N. Chovil. 1991. Discourse-oriented facial displays in conversation. Res. Lang. Soc. Interact. 25, 1–4, 163–194. DOI: https://doi.org/10.1080/08351819109389361. C. Clavel, J. Plessier, J.-C. Martin, L. Ach, and B. Morel. 2009. Combining facial and postural expressions of emotions in a virtual character. In International Workshop on Intelligent Virtual Agents. Springer, 287–300. M. M. Cohen and D. W. Massaro. 1993. Modeling coarticulation in synthetic visual speech. In M. Magnenat-Thalmann and D. Thalmann (Eds.), Models and Techniques in Computer Animation. Springer-Verlag, Tokyo, 139–156. M. Courgeon, J.-C. Martin, and C. Jacquemin. 2008. User’s gestural exploration of different virtual agents’ expressive profiles. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems—Volume 3. International Foundation for Autonomous Agents and Multiagent Systems. 1237–1240. DOI: https://doi.org/10.1145/ 1402821.1402840. M. Courgeon, C. Céline, and J.-C. Martin. 2014. Modeling facial signs of appraisal during interaction : Impact on users’ perception and behavior. Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems. 765–772. http://dl.acm.org/citation.cfm?id=2615731.2615855. R. Cowie, E. Douglas-Cowie, M. McRorie, I. Sneddon, L. Devillers, and N. Amir. 2011. Issues in data collection. In Emotion-Oriented Systems. Springer, 197–212. W. Curran, G. J. McKeown, M. Rychlowska, E. André, J. Wagner, and F. Lingenfelser. 2018. Social context disambiguates the interpretation of laughter. Front. Psychol. 8, 2342. DOI: https://doi.org/10.3389/fpsyg.2017.02342. N. Dael, M. Mortillaro, and K. R. Scherer. 2012. The body action and posture coding system (BAP): Development and reliability. J. Nonverbal Behav. 36, 2, 97–121. DOI: https://doi.org/ 10.1007/s10919-012-0130-0. B. de Gelder, A. De Borst, and R. Watson. 2015. The perception of emotion in body expressions. Wiley Interdiscip. Rev. Cogn. Sci. 6, 2, 149–158. DOI: https://doi.org/10.1002/wcs.1335. B. de Gelder, J. Kätsyri, and A. W. de Borst. 2018. Virtual reality and the new psychophysics. Br. J. Psychol. 109, 421–426. DOI: https://doi.org/10.1111/bjop.12308. I. de Kok and D. Heylen. 2009. Multimodal end-of-turn prediction in multi-party meetings. In Proceedings of the 2009 International Conference on Multimodal Interfaces, ICMI-MLMI’ 09. Association for Computing Machinery, New York, NY, 91–98. ISBN 9781605587721. DOI: https://doi.org/10.1145/1647314.1647332. I. de Kok, D. Heylen, and L. Morency. 2013. Speaker-adaptive multimodal prediction model for listener responses. In J. Epps, F. Chen, S. L. Oviatt, K. Mase, A. Sears, K. Jokinen, and B. W. Schuller (Eds.), 2013 International Conference on Multimodal Interaction, ICMI ’13, Sydney, NSW, Australia, December 9–13, 2013. ACM, 51–58. DOI: https://doi.org/10.1145/ 2522848.2522866.

298

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

E. Deng, B. Mutlu, and M. J. Mataric. 2019. Embodiment in socially interactive robots. Found. Trends Rob. 7, 4, 251–356. DOI: https://doi.org/10.1561/2300000056. S. Dermouche and C. Pelachaud. 2016. Sequence-based multimodal behavior modeling for social agents. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 29–36. DOI: https://doi.org/10.1145/2993148.2993180. S. Dermouche and C. Pelachaud. 2019. Generative model of agent’s behaviors in human– agent interaction. In Proceedings of the 21st ACM International Conference on Multimodal Interaction (ICMI 2019). ACM, Suzhou, Jiangsu, China. DOI: https://doi.org/10.1145/ 3340555.3353758. J. Dias, S. Mascarenhas, and A. Paiva. 2014. FAtiMA modular: Towards an agent architecture with a generic appraisal framework. In Emotion Modeling. Springer, 44–56. DOI: https://doi.org/10.1007/978-3-319-12973-0_3. R. Dibiasi and J. Gunnoe. 2004. Gender and culture differences in touching behavior. J. Soc. Psychol. 144, 1, 49–62. DOI: https://doi.org/10.3200/SOCP.144.1.49-62. Y. Ding, C. Pelachaud, and T. Artieres. August. 2013a. Modeling multimodal behaviors from speech prosody. In R. Aylett, B. Krenn, C. Pelachaud, and H. Shimodaira (Eds.), International Conference on Intelligent Virtual Agents (IVA 2013), Vol. 8108 of Lecture Notes in Computer Science. Springer Berlin, Heidelberg, Edinburgh, UK, 198–207. ISBN 978-3642-40415-3. DOI: https://doi.org/10.1007/978-3-642-40415-3_19. Y. Ding, M. Radenen, T. Artières, and C. Pelachaud. May. 2013b. Speech-driven eyebrow motion synthesis with contextual Markovian models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013). Vancouver, BC, Canada, 3756–3760. DOI: https://doi.org/10.1109/ICASSP.2013.6638360. C. Ding, P. Zhu, L. Xie, D. Jiang, and Z. Fu. September. 2014a. Speech-driven head motion synthesis using neural networks. In Interspeech 2014. Singapore, 2303–2307. Y. Ding, K. Prepin, J. Huang, C. Pelachaud, and T. Artières. 2014b. Laughter animation synthesis. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems. 773–780. C. Ding, L. Xie, and P. Zhu. 2015a. Head motion synthesis from speech using deep neural networks. Multimed. Tools Appl. 74, 22, 9871–9888. C. Ding, P. Zhu, and L. Xie. September. 2015b. BLSTM neural networks for speech driven head motion synthesis. In Interspeech 2015. Dresden, Germany, 3345–3349. Y. Ding, J. Huang, and C. Pelachaud. 2017. Audio-driven laughter behavior controller. IEEE Trans. Affective Comput. 8, 4, 546–558. DOI: https://doi.org/10.1109/TAFFC.2017. 2754365. A. Egges, S. Kshirsagar, and N. Magnenat-Thalmann. 2003. A model for personality and emotion simulation. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer, 453–461. DOI: https://doi.org/10.1007/9783-540-45224-9_63. P. Ekman. 2004. Emotional and conversational nonverbal signals. In Language, Knowledge, and Representation. Springer, 39–50. DOI: https://doi.org/10.1007/978-1-4020-2783-3_3.

References

299

P. Ekman and W. Friesen. 1975. Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues. Prentice-Hall, Inc. P. Ekman and W. V. Friesen. 1967. Head and body cues in the judgment of emotion: A reformulation. Percept. Mot. Skills 24, 3 PT 1, 711–724. DOI: https://doi.org/10.2466/pms.1967. 24.3.711. P. Ekman, W. Friesen, and J. Hager. 2002. Facial action Coding System (FACS). A Human Face. Research Nexus, Salt Lake City, UT. K. El Haddad, H. Çakmak, E. Gilmartin, S. Dupont, and T. Dutoit. 2016. Towards a listening agent: A system generating audiovisual laughs and smiles to show interest. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 248–255. DOI: https://doi.org/10.1145/2993148.2993182. H. J. Eysenck. 2012. A Model for Personality. Springer Science & Business Media. DOI: https://doi.org/10.1007/978-3-642-67783-0. B. Fan, L. Xie, S. Yang, L. Wang, and F. K. Soong. May. 2016. A deep bidirectional LSTM approach for video-realistic talking head. Multimed. Tools Appl. 75, 9, 5287–5309. DOI: https://doi.org/10.1007/s11042-015-2944-3. Y. Ferstl and R. McDonnell. November. 2018. Investigating the use of recurrent motion modelling for speech gesture generation. In Intelligent Virtual Agents (IVA 2018). Sydney, NSW, Australia, 93–98. DOI: https://doi.org/10.1145/3267851.3267898. Y. Ferstl, M. Neff, and R. McDonnell. June. 2020. Adversarial gesture generation with realistic gesture phasing. Comput. Graphics 89, 117–130. DOI: https://doi.org/10.1016/j.cag.2020. 04.007. N. Fourati and C. Pelachaud. 2014. Collection and characterization of emotional body behaviors. In Proceedings of the 2014 International Workshop on Movement and Computing. 49–54. DOI: https://doi.org/10.1145/2617995.2618004. N. Fourati and C. Pelachaud. 2016. Perception of emotions and body movement in the Emilya database. IEEE Trans. Affective Comput. 9, 1, 90–101. DOI: https://doi.org/10.1109/ TAFFC.2016.2591039. W. V. Friesen and P. Ekman. 1983. EMFACS-7: Emotional Facial Action Coding System. Unpublished Manuscript, University of California at San Francisco 2, 36, 1. A. Fukayama, T. Ohno, N. Mukawa, M. Sawaki, and N. Hagita. 2002. Messages embedded in gaze of interface agents—Impression management with agent’s gaze. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’02. Association for Computing Machinery, New York, NY, 41–48. ISBN 1581134533. DOI: https://doi.org/ 10.1145/503376.503385. A. Gallace and C. Spence. 2010. The science of interpersonal touch: An overview. Neurosci. Biobehav. Rev. 34, 2, 246–259. DOI: https://doi.org/10.1016/j.neubiorev.2008.10.004. S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. 2019. Learning individual styles of conversational gesture. In Computer Vision and Pattern Recognition (CVPR). H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang. 2002. Visual prosody: Facial movements accompanying speech. In Proceedings of IEEE International Conference on Automatic Faces

300

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

and Gesture Recognition. (May 2002) Washington, DC, 396–401. DOI: https://doi.org/10. 1109/AFGR.2002.1004186. J. Gratch, A. Okhmatovskaia, F. Lamothe, S. Marsella, M. Morales, R. van der Werf, and L. Morency. August. 2006. Virtual rapport. In J. Gratch, M. Young, R. Aylett, D. Ballin, and P. Olivier (Eds.), International Conference on Intelligent Virtual Agents (IVA 2006), Vol. 4133 of Lecture Notes in Computer Science. Springer-Verlag Berlin, Heidelberg, Marina del Rey, CA, 14–27. ISBN 978-3-540-37593-7. DOI: https://dx.doi.org/10.1007/11821830_2. J. Gratch, N. Wang, J. Gerten, E. Fast, and R. Duffy. 2007. Creating rapport with virtual agents. In Intelligent Virtual Agents. Springer, Berlin, Heidelberg. DOI: https://doi.org/10. 1007/978-3-540-74997-4_12. J. Gratch, R. Artstein, G. M. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, D. Traum, S. Rizzo, and L.-P. Morency. 2014. The distress analysis interview corpus of human and computer interviews. In LREC. 3123–3128. J. Gratch, D. DeVault, and G. Lucas. 2016. The benefits of virtual humans for teaching negotiation. In International Conference on Intelligent Virtual Agents. Springer, 283–294. DOI: https://doi.org/10.1007/978-3-319-47665-0_25. D. Greenwood, S. Laycock, and I. Matthews. August. 2017a. Predicting head pose in dyadic conversation. In J. Beskow, C. Peters, G. Castellano, C. O’Sullivan, I. Leite, and S. Kopp (Eds.), International Conference on Intelligent Virtual Agents (IVA 2017), Vol. 10498 of Lecture Notes in Computer Science. Springer Berlin, Heidelberg, Stockholm, Sweden, 160–169. ISBN 978-3-319-67400-1. DOI: https://doi.org/10.1007/978-3-319-67401-8_18. D. Greenwood, S. Laycock, and I. Matthews. August. 2017b. Predicting head pose from speech with a conditional variational autoencoder. In Interspeech 2017. Stockholm, Sweden, 3991–3995. DOI: https://doi.org/10.21437/Interspeech.2017-894. K. Haag and H. Shimodaira. September. 2016. Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In D. Traum, W. Swartout, P. Khooshabeh, S. Kopp, S. Scherer, and A. Leuski (Eds.), International Conference on Intelligent Virtual Agents (IVA 2016), Vol. 10011 of Lecture Notes in Computer Science. Springer Berlin, Heidelberg, Los Angeles, CA, 198–207. ISBN 978-3319-47664-3. DOI: https://doi.org/10.1007/978-3-319-47665-0_18. M. Halliday. 1967. Intonation and Grammar in British English. Mouton, The Hague. DOI: https://doi.org/10.1515/9783111357447. B. Hartmann, M. Mancini, and C. Pelachaud. 2005. Implementing expressive gesture synthesis for embodied conversational agents. In International Gesture Workshop. Springer, 188–199. DOI: https://doi.org/10.1007/11678816_22. D. Hasegawa, N. Kaneko, S. Shirakawa, H. Sakuta, and K. Sumi. November. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In Intelligent Virtual Agents (IVA 2018). Sydney, NSW, Australia, 79–86. DOI: https://doi.org/10. 1145/3267851.3267878. B. Hayes-Roth and P. Doyle. 1998. Animate characters. Auton. Agents Multi-Agent Syst. 1, 2, 195–230. DOI: https://doi.org/10.1023/A:1010019818773.

References

301

M. J. Hertenstein, R. Holmes, M. McCullough, and D. Keltner. 2009. The communication of emotion via touch. Emotion 9, 4, 566–573. DOI: https://doi.org/10.1037/a0016108. D. Heylen. April. 2005. Challenges ahead: Head movements and other social acts in conversation. In Artificial Intelligence and Simulation of Behaviour (AISB 2005), Social Presence Cues for Virtual Humanoids Symposium. Hertfordshire, UK, 8. D. Heylen. 2010. Ubiquitous gaze: Using gaze at the interface. In Human-Centric Interfaces for Ambient Intelligence. Elsevier, 49–70. DOI: https://doi.org/10.1016/B978-0-12-374708-2. 00003-6. D. Heylen, I. van Es, A. Nijholt, and B. van Dijk. 2005. Controlling the gaze of conversational agents. In J. van Kuppevelt, L. Dybkjær, and N. Bernsen (Eds.), Advances in Natural Multimodal Dialogue Systems, Vol. 30 of Text, Speech and Language Technology. Springer, Dordrecht, 245–262. ISBN 978-1-4020-3933-1. DOI: https://doi.org/10.1007/1-4020-39336_11. D. Heylen, S. Kopp, S. C. Marsella, C. Pelachaud, and H. Vilhjálmsson. 2008. The next step towards a Function Markup Language. In Intelligent Virtual Agents. Springer, 270–280. DOI: https://doi.org/10.1007/978-3-540-85483-8_2. G. Hofer and H. Shimodaira. August. 2007. Automatic head motion prediction from speech data. In Interspeech 2007. Antwerp, Belgium, 758–761. J. Huang and C. Pelachaud. 2012. Expressive body animation pipeline for virtual agent. In Proceedings of 12th International Conference of Intelligent Virtual Agents—IVA. 355–362. DOI: https://doi.org/10.1007/978-3-642-33197-8_36. L. Huang, L. Morency, and J. Gratch. September. 2011. Virtual rapport 2.0. In H. Vilhjálmsson, S. Kopp, S. Marsella, and K. Thórisson (Eds.), Intelligent Virtual Agents, Vol. 6895 of Lecture Notes in Computer Science. Springer Berlin, Heidelberg, Reykjavik, Iceland, 68–79. ISBN 978-3-642-23974-8. DOI: https://doi.org/10.1007/978-3-642-23974-8_8. Y. Huang and S. M. Khan. July. 2017. DyadGAN: Generating facial expressions in dyadic interactions. In IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2017). Honolulu, HI, 2259–2266. DOI: https://doi.org/10.1109/CVPRW.2017.280. E. M. Huis In’t Veld, G. J. van Boxtel, and B. de Gelder. 2014. The Body Action Coding System II: Muscle activations during the perception and expression of emotion. Front. Behav. Neurosci. 8, 330. DOI: https://doi.org/10.3389/fnbeh.2014.00330. G. Huisman, A. Frederiks, B. V. Dijk, D. Hevlen, and B. Kröse. 2013. The TaSST: Tactile sleeve for social touch. In World Haptics Conference (WHC). 211–216. DOI: https://doi.org/ 10.1109/WHC.2013.6548410. G. Huisman, J. Kolkmeier, and D. Heylen. 2014. With us or against us: Simulated social touch by virtual agents in a cooperative or competitive setting. In T. W. Bickmore, S. Marsella, and C. Sidner (Eds.), Intelligent Virtual Agents—14th International Conference, IVA 2014, Boston, MA, August 27–29, 2014. Proceedings, volume 8637 of Lecture Notes in Computer Science. Springer, 204–213. DOI: https://doi.org/10.1007/978-3-319-09767-1_25. K. Inoue, D. Lala, K. Yamamoto, S. Nakamura, K. Takanashi, and T. Kawahara. 2020. An attentive listening system with android ERICA: Comparison of autonomous and WOZ

302

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

interactions. In Proceedings of the 21st Annual Meeting of the Special Interest Group on Discourse and Dialogue. 118–127. R. E. Jack and P. G. Schyns. 2017. Toward a social psychophysics of face communication. Ann. Rev. Psychol. 68, 269–297. DOI: https://doi.org/10.1146/annurev-psych-010416-044242. R. E. Jack, O. G. Garrod, H. Yu, R. Caldara, and P. G. Schyns. 2012. Facial expressions of emotion are not culturally universal. Proc. Nat. Acad. Sci. 109, 19, 7241–7244. DOI: https:// doi.org/10.1073/pnas.1200155109. K. Jokinen and C. Pelachaud. 2013. From annotation to multimodal behaviour. In Coverbal Synchrony in Human–Machine Interaction. CRC Press, Taylor & Francis Group, 203–222. S. E. Jones and A. E. Yarbrough. 1985. A naturalistic study of the meanings of touch. Commun. Monogr. 52, 19–56. DOI: https://doi.org/10.1080/03637758509376094. S.-H. Kang, C. Sidner, J. Gratch, R. Artstein, L. Huang, and L.-P. Morency. 2011. Modeling nonverbal behavior of a virtual counselor during intimate self-disclosure. In International Workshop on Intelligent Virtual Agents. Vol. 6895. 455–457. DOI: https://doi. org/10.1007/978-3-642-23974-8_60. T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. July. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (TOG), 36, 494. DOI: https://doi.org/10.1145/3072959.3073658. A. Kendon. 1980. Gesticulation and speech: Two aspects of the process of utterance. In M.R. Key, (Ed.), The Relation between Verbal and Nonverbal Communication. Mouton, 207–227. DOI: https://doi.org/10.1515/9783110813098.207. S. Kettebekov, M. Yeasin, and R. Sharma. April. 2005. Prosody based audiovisual coanalysis for coverbal gesture recognition. IEEE Trans. Multimedia 7, 2234–242. DOI: https://doi. org/10.1109/TMM.2004.840590. M. Kipp, M. Neff, K. Kipp, and I. Albrecht. September. 2007. Towards natural gesture synthesis: Evaluating gesture units in a data-driven approach to gesture synthesis. In C. Pelachaud, J. Martin, E. André, G. Chollet, K. Karpouzis, and D. Pelé (Eds.), International Workshop on Intelligent Virtual Agents (IVA2007), Vol. 4722 of Lecture Notes in Computer Science. Springer Berlin, Heidelberg, Paris, France, 15–28. ISBN 978-3-54074997-4. DOI: https://doi.org/10.1007/978-3-540-74997-4_2. A. Kleinsmith and N. Bianchi-Berthouze. 2012. Affective body expression perception and recognition: A survey. IEEE Trans. Affective Comput. 4, 1, 15–33. DOI: https://doi.org/10. 1109/T-AFFC.2012.16. S. Kopp and I. Wachsmuth. June. 2002. Model-based animation of co-verbal gesture. In Proceedings of Computer Animation. Geneva, Switzerland, 252–257. DOI: https://doi.org/10. 1109/CA.2002.1017547. S. Kopp and I. Wachsmuth. March. 2004. Synthesizing multimodal utterances for conversational agents. Comput. Animat. Virtual Worlds 15, 139–52. DOI: https://doi.org/10.1002/ cav.6. S. Kopp, B. Krenn, S. Marsella, A. Marshall, C. Pelachaud, H. Pirker, K. Thorisson, and H. Vilhjálmsson. 2006. Towards a common framework for multimodal generation: The

References

303

Behavior Markup Language. In Intelligent Virtual Agents—IVA. 205–217. DOI: https://doi. org/10.1007/11821830_17. S. Kshirsagar and N. Magnenat-Thalmann. June. 2002. A multilayer personality model. In International Symposium on Smart graphics (SMARTGRAPH 2002). Hawthorne, NY, 107–115. DOI: https://doi.org/10.1145/569005.569021. T. Kucherenko, D. Hasegawa, G. Henter, N. Kaneko, and H. Kjellström. July. 2019. Analyzing input and output representations for speech-driven gesture generation. In International Conference on Intelligent Virtual Agents (IVA 2019). Paris, France, 97–104. DOI: https://doi.org/10.1145/3308532.3329472. R. Laban and F. C. Lawrence. 1974. Effort: Economy in Body Movement. Plays, Inc., Boston. J. L. Lakin and T. L. Chartrand. 2003. Using nonconscious behavioral mimicry to create affiliation and rapport. Psychol. Sci. 14, 4, 334–339. DOI: https://doi.org/10.1111/1467-9280. 14481. B. H. Le, X. Ma, and Z. Deng. November. 2012. Live speech driven head-and-eye motion generators. IEEE Trans. Visual. Comput. Graphics 18, 11, 1902–1914. DOI: https://doi.org/10. 1109/TVCG.2012.74. J. Lee and S. Marsella. 2006. Nonverbal behavior generator for embodied conversational agents. In International Workshop on Intelligent Virtual Agents. Springer, 243–255. DOI: https://doi.org/10.1007/11821830_20. J. Lee and S. Marsella. September. 2017. Modeling speaker behavior: A comparison of two approaches. In Y. Nakano, M. Neff, A. Paiva, and M. Walker (Eds.), International Conference on Intelligent Virtual Agents (IVA 2012), Vol. 7502 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, Santa Cruz, CA, 160–169. ISBN 978-3-642-33197-8. DOI: https://doi.org/10.1007/978-3-642-33197-8_17. J. Lee, J. Chai, P. Reitsma, J. Hodgins, and N. Pollard. July. 2002. Interactive control of avatars animated with human motion data. In Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 2002). San Antonio, Texas, 491–500. DOI: https://doi.org/10.1145/566570.566607. J. Lester, S. Towns, C. Callaway, J. Voerman, and P. Fitzgerald. 2000. Deictic and emotive communication in animated pedagogical agents. In S. P. J. Cassell, J. Sullivan and E. Churchill (Eds.), Embodied Conversational Characters. MIT Press, Cambridge, MA, 123–154. DOI: https://doi.org/10.7551/mitpress/2697.003.0007. S. Levine, P. Krähenbühl, S. Thrun, and V. Koltun. July. 2010. Gesture controllers. ACM Trans. Graphics 29, 4124, 1–124:11. DOI: https://doi.org/10.1145/1778765.1778861. M. Lhommet and S. C. Marsella. 2014. Expressing emotion through posture. The Oxford Handbook of Affective Computing. 273. DOI: https://doi.org/10.1093/oxfordhb/ 9780199942237.013.039. X. Li, Z. Wu, H. Meng, J. Jia, X. Lou, and L. Cai. September. 2016. Expressive speech driven talking avatar synthesis with DBLSTM using limited amount of emotional bimodal data. In Interspeech 2016. San Francisco, CA, 1477–1481. DOI: https://doi.org/10.21437/ Interspeech.2016-364.

304

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. 2010. The extended Cohn–Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 94–101. DOI: https://doi.org/10.1109/CVPRW.2010. 5543262. M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba. 1998. Coding facial expressions with Gabor wavelets. In Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition. IEEE, 200–205. DOI: https://doi.org/10.1109/AFGR.1998.670949. K. F. MacDorman and H. Ishiguro. 2006. The uncanny advantage of using androids in cognitive and social science research. Interact. Stud. 7, 3, 297–337. DOI: https://doi.org/10. 1075/is.7.3.03mac. F. Mairesse and M. A. Walker. 2011. Controlling user perceptions of linguistic style: Trainable generation of personality traits. Comput. Linguist. 37, 3, 455–488. DOI: https:// doi.org/10.1162/COLI_a_00063. L. Malatesta, A. Raouzaiou, K. Karpouzis, and S. Kollias. 2009. Towards modeling embodied conversational agent character profiles using appraisal theory predictions in expression synthesis. Appl. Intell. 30, 1, 58–64. DOI: https://doi.org/10.1007/s10489-0070076-9. M. Mancini and C. Pelachaud. 2008. Distinctiveness in multimodal behaviors. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems—Vol. 1. Citeseer, 159–166. M. Mancini, B. Biancardi, S. Dermouche, P. Lerner, and C. Pelachaud. 2019. An architecture for agent’s impression management based on user’s engagement. In International Conference on Intelligent Virtual Agents. Springer. DOI: https://doi.org/10.1145/3308532. 3329442. S. Mariooryad and C. Busso. October. 2012. Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Trans. Audio, Speech Language Process. 20, 8, 2329–2340. DOI: https://doi.org/10.1109/TASL.2012.2201476. S. Mariooryad and C. Busso. April–June. 2013. Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans. Affect. Comput. 4, 2 (April-June 2013), 183–196. DOI: https://doi.org/10.1109/T-AFFC.2013.11. S. Marsella, Y. Xu, M. Lhommet, A. Feng, S. Scherer, and A. Shapiro. July. 2013. Virtual character performance from speech. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA 2013). Anaheim, CA, 25–35. DOI: https://doi.org/10.1145/2485895.2485900. S. C. Marsella and J. Gratch. 2009. EMA: A process model of appraisal dynamics. Cogn. Syst. Res. 10, 1, 70–90. DOI: https://doi.org/10.1016/j.cogsys.2008.03.005. C. Mazzocconi, Y. Tian, and J. Ginzburg. 2020. What’s your laughter doing there? A taxonomy of the pragmatic functions of laughter. IEEE Trans. Affect. Comput. DOI: https://doi. org/10.1109/TAFFC.2020.2994533. R. R. McCrae and P. T. Costa Jr. 2008. The five-factor theory of personality. In L. P. O. P. John and R.W. Robins (Eds.), Handbook of Personality: Theory and Research. Guilford Press, 159–181.

References

305

G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schroder. 2011. The SEMAINE database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3, 1, 5–17. DOI: https://doi.org/ 10.1109/T-AFFC.2011.20. D. McNeill. 1992. Hand and Mind: What Gestures Reveal about Thought. The University of Chicago Press, Chicago, IL, ISBN 0-226-56132-1. M. McRorie, I. Sneddon, G. McKeown, E. Bevacqua, E. de Sevin, and C. Pelachaud. 2011. Evaluation of four designed virtual agent personalities. IEEE Trans. Affect. Comput. 3, 3, 311–322. DOI: https://doi.org/10.1109/T-AFFC.2011.38. A. Mehrabian. 1996. Pleasure–arousal–dominance: A general framework for describing and measuring individual differences in temperament. Curr. Psychol. Dev. Learn. Pers. Soc. 14, 261–292. DOI: http://dx.doi.org/10.1007/BF02686918. M. Mori, K. F. MacDorman, and N. Kageki. 2012. The uncanny valley [from the field]. IEEE Robot. Autom. Mag. 19, 2, 98–100. DOI: http://dx.doi.org/10.1109/MRA.2012.2192811. M. Neff, M. Kipp, I. Albrecht, and H. Seidel. March. 2008. Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Trans. Graph. (TOG) 27, 1, 1–24. DOI: https://dx.doi.org/10.1145/1330511.1330516. N. Nguyen, I. Wachsmuth, and S. Kopp. 2007. Touch perception and emotional appraisal for a virtual agent. In Proceedings Workshop Emotion and Computing—Current Research and Future Impact, KI. 17–22. R. Niewiadomski and C. Pelachaud. 2007. Fuzzy similarity of facial expressions of embodied agents. In C. Pelachaud, J.-C. Martin, E. André, G. Chollet, K. Karpouzis, and D. Pelé (Eds.), Proceedings of the 7th International Conference on Intelligent Virtual Agents (IVA). Springer, 86–98. R. Niewiadomski, S. J. Hyniewska, and C. Pelachaud. 2011. Constraint-based model for synthesis of multimodal sequential expressions of emotions. IEEE Trans. Affect. Comput. 2, 3, 134–146. DOI: https://doi.org/10.1109/T-AFFC.2011.5. H. Noot and Z. Ruttkay. 2005. Variations in gesturing and speech by GESTYLE. Int. J. Hum.Comput. Stud. 62, 2, 211–229. ISSN 1071-5819. DOI: https://doi.org/10.1016/j.ijhcs.2004. 11.007. M. Ochs and C. Pelachaud. 2013. Socially aware virtual characters: The social signal of smiles. IEEE Signal Process. Mag. 30, 2, 128–132. DOI: https://doi.org/10.1109/MSP.2012. 2230541. H. Oster. 2006. Baby FACS: Facial action coding system for infants and young children. In Unpublished Monograph and Coding Manual. New York University. J. Ostermann. 2002. Face animation in MPEG-4. In I. Pandzic and R. Forchheimer (Eds.), MPEG-4 Facial Animation : The Standard, Implementation and Applications. Wiley, England, 17–55. DOI: https://doi.org/10.1002/0470854626. M. Paleari and C. Lisetti. 2006. Psychologically grounded avatars expressions. In First Workshop on Emotion and Computing at KI 2006, 29th Annual Conference on Artificial Intelligence. Bremen, Germany.

306

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

X. Pan, M. Gillies, T. M. Sezgin, and C. Loscos. 2007. Expressing complex mental states through facial expressions. In Second International Conference on Affective Computing and Intelligent Interaction (ACII). Springer, 745–746. F. Parke. 1975. A model for human faces that allows speech synchronized animation. Computer and Graphics, Pergamon Press. 1, 1, 3–4. DOI: https://doi.org/10.1016/0097-8493 (75)90024-2. F. I. Parke. 1972. Computer generated animation of faces. In Proceedings of the ACM Annual Conference—Vol. 1. 451–457. DOI: https://doi.org/10.1145/800193.569955. J. Parker, R. Maia, Y. Stylianou, and R. Cipolla. March. 2017. Expressive visual text to speech and expression adaptation using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017). New Orleans, LA, 4920–4924. DOI: https://doi.org/10.1109/ICASSP.2017.7953092. C. Pelachaud and I. Poggi. 1998. Multimodal communication between synthetic agents. In Advanced Visual Interface. Aquila, Italy. DOI: https://doi.org/10.1145/948496.948518. C. Pelachaud, N. I. Badler, and M. Steedman. January–March. 1996. Generating facial expressions for speech. Cogn. Sci. 20, 1, 1–46. DOI: https://doi.org/10.1016/S0364-0213(99) 80001-9. M. J. Pickering and S. Garrod. 2004. Toward a mechanistic psychology of dialogue. Behav. Brain Sci. 27, 2, 169–190. DOI: https://doi.org/10.1017/S0140525X04000056. I. Poggi. 2007. Mind, Hands, Face and Body. A Goal and Belief View of Multimodal Communication, Volume Körper, Zeichen, Kultur (19). Weidler Verlag. I. Poggi and C. Pelachaud. 2000. Performative facial expressions in animated faces. In J. Cassell, J. Sullivan, S. Prevost, and E. Churchill (Eds.), Embodied Conversational Agents. MIT Press, Cambridge, MA, 154–188. DOI: https://doi.org/10.7551/mitpress/2697.003. 0008. F. Pollick, H. Paterson, A. Bruderlin, and A. Sanford. 2001. Perceiving affect from arm movement. Cognition 82, 51–61. DOI: https://doi.org/10.1016/S0010-0277(01)00147-0. R. Poppe, K. P. Truong, D. Reidsma, and D. Heylen. 2010. Backchannel strategies for artificial listeners. In J. M. Allbeck, N. I. Badler, T. W. Bickmore, C. Pelachaud, and A. Safonova (Eds.), Intelligent Virtual Agents, 10th International Conference, IVA 2010. September 20–22, 2010. Proceedings, Vol. 6356 of Lecture Notes in Computer Science. Springer, Philadelphia, PA, 146–158. DOI: https://doi.org/10.1007/978-3-64215892-6_16. D. V. Pynadath and S. C. Marsella. 2005. PsychSim: Modeling theory of mind with decisiontheoretic agents. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, IJCAI’05. Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1181–1186. M. Rehm and E. André. 2005. Catch me if you can—Exploring lying agents in social settings. In F. Dignum, V. Dignum, S. Koenig, S. Kraus, M. P. Singh, and M. Wooldridge (Eds.), Proceedings of International Joint Conference on Autonomous Agents and Multi-Agent Systems (AAMAS). ACM, Utrecht, The Netherlands, 937–944. DOI: https://doi.org/10.1145/ 1082473.1082615.

References

307

M. Rehm, E. André, N. Bee, B. Endrass, M. Wissner, Y. Nakano, T. Nishida, and H. Huang. 2007. The CUBE-G approach: Coaching culture-specific nonverbal behavior by virtual agents. Organizing and Learning Through Gaming and Simulation: Proceedings of Isaga. 313. J. Rickel and W. Johnson. 1999. Animated agents for procedural training in virtual reality: Perception, cognition, and motor control. Appl. Artif. Intell. 13, 343–382. DOI: https://doi. org/10.1080/088395199117315. A. Rizzo, U. Neumann, R. Enciso, D. Fidaleo, and J. Noh. July. 2004. Performance-driven facial animation: Basic research on human judgments of emotional state in facial avatars. CyberPsychol. Behav. 4, 4, 471–487. DOI: https://doi.org/10.1089/109493101 750527033. Z. Ruttkay, H. Noot, and P. T. Hagen. 2003. Emotion disc and emotion squares: Tools to explore the facial expression face. Comput. Graph. Forum. 22, 1, 49–53. DOI: https://doi. org/10.1111/1467-8659.t01-1-00645. N. Sadoughi and C. Busso. November. 2015. Retrieving target gestures toward speech driven animation with meaningful behaviors. In International Conference on Multimodal Interaction (ICMI 2015) Seattle, WA, 115–122. DOI: https://doi.org/10.1145/2818346.2820750. N. Sadoughi and C. Busso. August. 2017a. Joint learning of speech-driven facial motion with bidirectional long-short term memory. In J. Beskow, C. Peters, G. Castellano, C. O’Sullivan, I. Leite, and S. Kopp (Eds.), International Conference on Intelligent Virtual Agents (IVA 2017), Vol. 10498 of Lecture Notes in Computer Science Springer Berlin, Heidelberg, Stockholm, Sweden, 389–402. ISBN 978-3-319-67400-1. DOI: https://doi.org/ 10.1007/978-3-319-67401-8_49. N. Sadoughi and C. Busso. January. 2017b. Head motion generation. In B. Müller, S. Wolf, G.-P. Brueggemann, Z. Deng, A. McIntosh, F. Miller, and W. Scott Selbie (Eds.), Handbook of Human Motion. Springer International Publishing, 1–25. ISBN 978-3-319-30808-1. DOI: https://doi.org/10.1007/978-3-319-30808-1_4-1. N. Sadoughi and C. Busso. April. 2018a. Novel realizations of speech-driven head movements with generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) Calgary, AB, Canada, 6169–6173. DOI: https://doi.org/10.1109/ICASSP.2018.8461967. N. Sadoughi and C. Busso. May. 2018b. Expressive speech-driven lip movements with multitask learning. In IEEE Conference on Automatic Face and Gesture Recognition (FG 2018) Xi’an, China, 409–415. DOI: https://doi.org/10.1109/FG.2018.00066. N. Sadoughi and C. Busso. July. 2019. Speech-driven animation with meaningful behaviors. Speech Commun. 110, 90–100. DOI: https://doi.org/10.1016/j.specom.2019.04.005. N. Sadoughi and C. Busso. 2020. Speech-driven expressive talking lips with conditional sequential generative adversarial networks. IEEE Trans. Affect. Comput. To appear. DOI: https://doi.org/10.1109/TAFFC.2019.2916031. N. Sadoughi, Y. Liu, and C. Busso. November. 2014. Speech-driven animation constrained by appropriate discourse functions. In International Conference on Multimodal Interaction (ICMI 2014). Istanbul, Turkey, 148–155. DOI: https://doi.org/10.1145/2663204.2663252.

308

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

N. Sadoughi, Y. Liu, and C. Busso. May. 2015. MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents. In 1st International Workshop on Understanding Human Activities through 3D Sensors (UHA3DS 2015) Ljubljana, Slovenia, 1–6. DOI: https://doi.org/10.1109/FG.2015. 7284885. N. Sadoughi, Y. Liu, and C. Busso. December. 2017. Meaningful head movements driven by emotional synthetic speech. Speech Commun. 95, 87–99. DOI: https://doi.org/10.1016/ j.specom.2017.07.004. K. R. Scherer. 2001. Appraisal considered as a process of multilevel sequential checking. In K. Scherer, A. Schorr, and T. Johnstone (Eds.), Appraisal Processes in Emotion: Theory, Methods, Research. Oxford University Press, 92–119. K. R. Scherer. 2005. What are emotions? And how can they be measured? Soc. Sci. Inf. 44, 4, 695–729. DOI: https://doi.org/10.1177/0539018405058216. K. R. Scherer and H. Ellgring. 2007. Are facial expressions of emotion produced by categorical affect programs or dynamically driven by appraisal? Emotion 7, 1, 113–130. DOI: https://doi.org/10.1037/1528-3542.7.1.113. K. R. Scherer, E. Clark-Polner, and M. Mortillaro. 2011. In the eye of the beholder? Universality and cultural specificity in the expression and perception of emotion. Int. J. Psychol. 46, 6, 401–435. DOI: https://doi.org/10.1080/00207594.2011.626049. K. Schindler, L. van Gool, and B. de Gelder. 2008. Recognizing emotions expressed by body pose: A biologically inspired neural model. Neural Netw. 21, 9, 1238–1246. DOI: https:// doi.org/10.1016/j.neunet.2008.05.003. M. Schroder, E. Bevacqua, R. Cowie, F. Eyben, H. Gunes, D. Heylen, M. ter Maat, G. McKeown, S. Pammi, M. Pantic, C. Pelachaud, B. Schuller, E. de Sevin, M. Valstar, and M. Wollmer. 2011. Building autonomous sensitive artificial listeners. IEEE Trans. Affect. Comput. 3, 2, 165–183. DOI: https://doi.org/10.1109/T-AFFC.2011.34. S. K. Scott, N. Lavan, S. Chen, and C. McGettigan. 2014. The social life of laughter. Trends Cogn. Sci. 18, 12, 618–620. DOI: https://doi.org/10.1016/j.tics.2014.09.002. M. Stone, D. DeCarlo, I. Oh, C. Rodriguez, A. Stere, A. Lees, and C. Bregler. August. 2004. Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Trans. Graph. (TOG). 23, 3, 506–513. DOI: https://doi.org/10. 1145/1015706.1015753. S. Taylor, A. Kato, I. Matthews, and B. Milner. September. 2016. Audio-to-visual speech conversion using deep neural networks. In Interspeech 2016. San Francisco, CA, 1482–1486. DOI: https://doi.org/10.21437/Interspeech.2016-483. S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. Rodriguez, J. Hodgins, and I. Matthews. July. 2017. A deep learning approach for generalized speech animation. ACM Trans. Graph. (TOG). 36, 4. DOI: https://doi.org/10.1145/3072959.3073699. H. Tennent, S. Shen, and M. Jung. 2019. Micbot: A peripheral robotic object to shape conversational dynamics and team performance. In 2019 14th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 133–142.

References

309

M. ter Maat, K. P. Truong, and D. K. G. Heylen. 2010. How turn-taking strategies influence users’ impressions of an agent. In Intelligent Virtual Agents. IVA 2010. Springer, Berlin, Heidelberg. DOI: https://doi.org/10.1007/978-3-642-15892-6_48. J. Tewell, J. Bird, and G. R. Buchanan. 2017. The heat is on: A temperature display for conveying affective feedback. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. 1756–1767. DOI: https://doi.org/10.1145/3025453.3025844. M. Teyssier, G. Bailly, C. Pelachaud, and E. Lecolinet. 2018. MobiLimb: Augmenting mobile devices with a robotic limb. In P. Baudisch, A. Schmidt, and A. Wilson (Eds.), The 31st Annual ACM Symposium on User Interface Software and Technology, UIST 2018, Berlin, Germany, October 14–17, 2018. ACM, 53–63. DOI: https://doi.org/10.1145/3242587.3242626. M. Teyssier, G. Bailly, C. Pelachaud, and E. Lecolinet. 2020. Conveying emotions through device-initiated touch. IEEE Trans. Affect. Comput. 01. DOI: https://doi.org/10.1109/TAFFC. 2020.3008693. N. M. Thalmann, P. Kalra, and M. Escher. 1998. Face to virtual face. Proc. IEEE 86, 5, 870–883. DOI: https://doi.org/10.1109/5.664277. K. Thórisson. 1997. Layered modular action control for communicative humanoids. In Computer Animation’97. IEEE Computer Society Press, Geneva, Switzerland. DOI: https:// doi.org/10.1109/CA.1997.601055. O. Torres, J. Cassell, and S. Prevost. 1997. Modeling gaze behavior as a function of discourse structure. In In Proceedings of the First International Workshop on Human– Computer Conversations. N. Tsapatsoulis, A. Raouzaiou, S. Kollias, R. Cowie, and E. Douglas-Cowie. 2002. Emotion recognition and synthesis based on MPEG-4 FAPs. In I. S. Pandzic and R. Forcheimer (Eds.), MPEG-4 Facial Animation—The Standard, Implementation and Applications. John Wiley & Sons. DOI: https://doi.org/10.1002/0470854626.ch9. L. Valbonesi, R. Ansari, D. McNeill, F. Quek, S. Duncan, K. McCullough, and R. Bryll. September. 2002. Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures. In European Signal Processing Conference (EUSIPCO 02). Toulouse, France, 75–78. A. Vidal, A. Salman, W.-C. Lin, and C. Busso. October. 2020. MSP-Face corpus: A natural audiovisual emotional database. In ACM International Conference on Multimodal Interaction (ICMI 2020). Utrecht, The Netherlands. DOI: https://doi.org/10.1145/3382507. 3418872. H. Vilhjálmsson, N. Cantelmo, J. Cassell, N. E. Chafai, M. Kipp, S. Kopp, M. Mancini, S. Marsella, A. N. Marshall, C. Pelachaud, Z. Ruttkay, K. R. Thórisson, H. van Welbergen, and R. J. van der Werf. 2007. The Behavior Markup Language: Recent developments and challenges. In International Workshop on Intelligent Virtual Agents. Springer, 99–111. DOI: https://doi.org/10.1007/978-3-540-74997-4_10. K. Wada, T. Shibata, K. Sakamoto, and K. Tanie. 2006. Long-term interaction between seal robots and elderly people—Robot assisted activity at a health service facility for the aged. In K. Murase, K. Sekiyama, T. Naniwa, N. Kubota, and J. Sitte (Eds.), Proceedings of

310

Chapter 8 Multimodal Behavior Modeling for Socially Interactive Agents

the 3rd International Symposium on Autonomous Minirobots for Research and Edutainment (AMiRE 2005). Springer Berlin, Heidelberg, 325–330. H. Wallbott. 1998. Bodily expression of emotion. Eur. J. Soc. Psychol. 28, 879–896. DOI: https://doi.org/10.1002/(SICI)1099-0992(1998110)28:63.0.CO;2-W. C. Willemse, A. Toet, and J. van Erp. 2017. Affective and behavioral responses to robotinitiated social touch: Toward understanding the opportunities and limitations of physical contact in human–robot interaction. Frontiers in ICT. DOI: https://doi.org/10.3389/ fict.2017.00012. L. Williams. August. 1990. Performance-driven facial animation. Comput. Graph. 24, 4, 235–242. DOI: https://doi.org/10.1145/97879.97906. S. Yohanan and K. MacLean. 2012. The role of affective touch in human–robot interaction: Human intent and expectations in touching the haptic creature. Int. J. Soc. Robot. 4, 163–180. DOI: https://doi.org/10.1007/s12369-011-0126-7. Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 4303–4309. DOI: https://doi.org/10.1109/ICRA.2019.8793720. F. Yunus, C. Clavel, and C. Pelachaud. July. 2019. Gesture class prediction by recurrent neural network and attention mechanism. In International Conference on Intelligent Virtual Agents (IVA 2019). Paris, France, 233–235. DOI: https://doi.org/10.1145/3308532.3329458. A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 31, 6, 82–88. DOI: https://doi.org/10.1109/MIS.2016.94. A. Zadeh, P. Liang, J. Vanbriesen, S. Poria, E. Tong, E. Cambria, M. Chen, and L.-P. Morency. July. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In ACM Association for Computational Linguistics (ACL 2004). Vol. 1. Melbourne, Australia, 2236–2246. DOI: https://doi.org/10.18653/v1/P18-1208.

III PART

SOCIAL COGNITION— MODELS AND PHENOMENA

9

Theory of Mind and Joint Attention Jairo Perez-Osorio, Eva Wiese, and Agnieszka Wykowska

9.1

Social Cognitive Neuroscience and SIA Increasing technological progress in the last decades has facilitated the development of artificial agents. Social robots, virtual agents, and smart assistants have been introduced slowly but firmly into our daily lives. From entertainment to education, from healthcare to the conquest of the solar system, artificial agents are becoming increasingly essential for the human social landscape. However, before they can become fully integrated into our lives, it is important to consider how to measure the impact interactions with these agents might have on human cognition, and how to evaluate whether the behavior of artificial agents has desired effects on everyday life. In order to attain an overarching comprehension of the dynamics of social interactions with artificial agents, research in human–agent interaction (HAI) would immensely benefit from the methods and approaches used in social cognitive neuroscience (SCN). This discipline focuses on studying the intricate interplay between social and neurophysiological aspects when the brain is engaged in social-cognitive processing during interactions with others, using objective behavioral and neuro-physiological measures in carefully designed and controlled experiments. SCN is characterized by hypothesis-driven experiments that manipulate experimental variables targeting specific cognitive processes, and aims at interpreting behavioral responses and their neural correlates in the context of theoretical models of social cognition. Adopting methods of SCN for the study of social interactions with artificial agents offers multiple advantages. First, in

314

Chapter 9 Theory of Mind and Joint Attention

addition to other methods commonly used in HAI studies, SCN methods allow examining cognitive mechanisms that are not necessarily explicit or available to introspection. Specifically, although HAI methods, such as subjective ratings, surveys, and questionnaires allow the assessment of attitudes, experiences, or perceptions during interactions with artificial agents, using only those methods does not cover the more implicit processing of information in social interactions (e.g., gaze, posture, voice pitch, turn-taking). Furthermore, the normative interpretation and understanding of these signals is typically learned implicitly through experience, which can make it difficult for participants to describe it verbally when explicit assessment techniques are used. For instance, could you tell how long it takes you to experience mutual gaze or the duration of a handshake with a stranger as uncomfortable? You probably have not thought about this before, or measured it empirically, which makes it difficult to give a precise answer. Thus, although selfreported measures are easily obtained and suitable for the assessment of some aspects of social cognition, they may be insufficient to evaluate all the different layers of social interactions with natural and artificial agents. Another challenge in the assessment of social cognition is to implement paradigms that are capable of capturing the dynamic and proactive nature of social interactions, involving predictive processes that rely on inferences regarding others’ intentions and mental states. SCN has revealed that when interacting with the world, human brains constantly select, process, and compare sensory inputs to previous representations of knowledge and experiences to build accurate representations of the world in a dynamic cycle that updates priors and adjusts predictions of future events [Friston 2005]. Given that these processes also unfold when interacting with the social world, any paradigm or method that does not appropriately elicit or allow for the dynamics of this process to unfold may not accurately assess the underlying social-cognitive mechanisms [Schilbach et al. 2013]. For instance, mutual gaze during conversations varies depending on the context and the interlocutor and people are usually unaware of those variations: a cognitively demanding conversation elicits less mutual eye contact than chitchatting, and people prefer to look longer in the eyes of familiar than unfamiliar people [Beattie and Ellis 2017]. Coming back to the original question of how long mutual gaze can last before starting to feel uncomfortable, this example shows that it is impossible to define an ideal time window as such behavior is dynamic and linked to social context and personal experience. However, we can examine how humans engage in mutual gaze using well-designed experimental protocols that use objective behavioral and neurophysiological measures and that do not restrict the natural dynamics of social interactions.

9.1 Social Cognitive Neuroscience and SIA

315

Another way in which SCN informs HAI is by elucidating the brain structures involved in social cognition using diverse methods such as electroencephalography, functional magnetic resonance imaging, and functional near-infrared spectroscopy. The areas and networks involved in the processing of social stimuli have been collectively termed “the social brain,” which includes structures like the medial prefrontal cortex, the temporoparietal junction, the superior temporal sulcus, and the fusiform area, inferior temporal sulcus, among others (see Figure 9.1). These networks show characteristic patterns of activation during the processing of social signals commonly used to communicate interest, highlight the relevance of events or objects in the environment, or coordinate interactions, such as

Figure 9.1

Areas of the brain associated with processing social information. Classification of the brain areas linked with social processing and divided in four cognitive processes: (1) The perception of basic social stimuli, such as biological motions (V5), part of the body (extra-striate body area, EBA), and faces (fusiform face area, FFA); (2) emotional and motivational appraisal, where the amygdala (AMY), the anterior insula (AI), the subgenual and perigenual anterior cingulate cortex (ACC), together with the orbitofrontal cortex (OFC) are closely linked with subcortical structures as the ventral striatum (VS) and the hypothalamus (HTH); (3) Emotional and motivational appraisal areas work closely with regions such as the dorsolateral and the medial prefrontal cortex (dlPFC, mPFC) and the ACC in goal-directed, adaptive behaviors, and the categorization processes. Finally, (4) areas associated with social attribution, like the ventral premotor cortex (vPMC), the superior temporal sulcus (STS), the AI, the posterior cingulate cortex (PCC), and the precuneus (PC) participate in more automatic, bottom-up inferences of other people’s mental states; whereas structures like the mPFC and the temporoparietal junction (TPJ) are involved in more cognitive theory of mind skills. Adapted from Billeke and Aboitiz [2013].

316

Chapter 9 Theory of Mind and Joint Attention

biological motion, facial expressions, and eye and head movements. Furthermore, those signals are tightly linked to higher cognitive processes like recognizing others’ feelings and internal states, identifying others’ intentions, or deciding whether they are friend or foe. Understanding the neurobiological basis of behavior is crucial for cognitive, developmental, clinical, comparative, and social psychology, as well as philosophy and evolutionary anthropology [Singer 2012]. In this context, examining the engagement of social brain areas in interactions with artificial agents seems like a natural step to follow for HAI as well. Using a combination of subjective/explicit measures like self-report and questionnaires, and objective measures like metrics related to performance, behavior, psychophysiology, and neuroimaging, HAI is able to have a more comprehensive view of behavioral and brain mechanisms involved in social interactions with human and non-human agents. The present chapter provides an overview from the perspective of SCN regarding theory of mind (ToM) and joint attention (JA) as crucial mechanisms of social cognition and discusses how these mechanisms have been investigated in social interaction with artificial agents. In the final sections, the chapter reviews computational models of ToM and JA in social robots (SRs) and intelligent virtual agents (IVAs) and discusses the current challenges and future directions.

9.2

Theory of Mind and Joint Attention—Crucial Mechanisms of Social Cognition

9.2.1 Theory of Mind Imagine that as you are walking on the street someone is coming from the opposite direction toward you. At some point, and before passing each other, the person all of a sudden puts the palm of her/his hand on their forehead, stops walking, turns around, and goes back to where she/he came from. How would you explain this behavior? Probably, you would guess that she/he might have forgotten something and decided to go back. And you might be right. Importantly, most of the explanations you would choose to explain the observed behavior would refer to mental states, such as thoughts, preferences, intentions, or emotions. This is based on the ability to perceive and understand that others have beliefs, desires, goals, and knowledge different from your own and that others’ behavior is driven by their internal representations of the world. Social interaction heavily requires awareness of our counterpart’s knowledge of the world [Frith et al. 1991, Baron-Cohen et al. 1995]. The ability of referring to others’ mental states in explaining their behavior

9.2 Theory of Mind and Joint Attention—Crucial Mechanisms of Social Cognition

317

has been termed mentalizing [Frith and Frith 2006] or using a ToM [Baron-Cohen 1997]. ToM is the basis for a wide range of social processes such as competitive and cooperative joint actions, language, action execution, imagination, and even humor. The strategy of referring to mental states to predict others’ behavior has also been called adopting the intentional stance [Dennett 1971, 1987]. Please note, however, that we distinguish the concept of “intentional stance” from the concept of “theory of mind.” The first is related to the general strategy that one adopts when explaining the behavior of another agent, based on the assumptions regarding the agents’ rationality and capacity of having mental states. The latter, on the other hand, is the active process of inferring a particular mental state in a particular context. One can infer a wrong mental state based on observation of the other’s behavior (and thus, fail the ToM test, see below), but still adopt the intentional stance to the agent in general. Now, imagine that you are asking a humanoid robot for directions at the counter of a train station. You are interested in knowing about restaurants nearby, and the robot manages to answer your questions. After an effortless dialogue, the robot says: “I will give you a map,” turns around, and looks for something inside a shelf. All of a sudden, the robot stops every movement, turns back toward you, and says: “Good morning, how can I help you?” How would you explain this behavior to someone else? You would probably describe it in terms of the robot’s presumed internal states and say something like: “The robot forgot that I was there waiting for the map.” Most people facing a situation like this would also use mentalistic terms to describe the behavior of the robot. In fact, it is very intuitive for humans to attribute human intentions, preferences, capacities, and emotions to non-human agents and interact as if those agents would actually have a mind. However, it might be that attributions or more cognitive states, such as thoughts or intentions, are more likely than attributions of more phenomenal or affective states, such as pain or happiness [Huebner 2010]. The tendency to attribute mental states to nonhuman agents is a part of the process called anthropomorphism. When an entity is anthropomorphized, its behavior, as well as inferences drawn from it, is interpreted in human-centric terms [Epley et al. 2007]. If paper constantly jams in a printer, for instance, an anthropomorphic interpretation would say that “the printer refuses to work,” or even say “that stubborn printer.” A technical or more detailed explanation of the same behavior could feel artificial and be more complex, preventing effective communication. Humans have extensive experience understanding other humans’ minds, which is probably why mentalistic explanations for the behavior of non-human entities often seem intuitive. Note, however, that typically people do not believe that cars or other complex non-human systems have internal mental states, but that they often explain behavior and natural phenomena within

318

Chapter 9 Theory of Mind and Joint Attention

their “mentalistic comfort zone.” In summary, humans have the tendency to interact with non-human agents as if they had mental states. This ability facilitates prediction, understanding, and interaction with social counterparts. 9.2.1.1

Developing a Theory of Mind From very early stages of development, we learn to read others’ minds. This capability is critical for cognitive development: it provides foundations for language acquisition, allows differentiating between self and others, and is crucial for social interactions. Interestingly, representing the mental states of others occurs effortlessly, automatically, and unconsciously, which explains why others’ behavior is often intuitively explained in mentalistic terms although it is questionable (as in the case of mindless artificial systems, such as a printer or car). In the past century, several disciplines have extensively studied the effects and origins of human ToM. Initial cognitive approaches by Heider [1958], for instance, suggested that people have a general understanding of others’ ideas and actions in particular situations—a “common-sense or folk psychology” that helps them deal effectively with social situations. This ability is firmly anchored on the assumption that beliefs and intentions play an active role in others’ behavior, together with subjective experiences and perceptions of the environment. Such inferences transcend the observed behavior and prove to be useful in predicting and understanding others’ actions. Subsequently, Premack and Woodruff [1978] defined ToM as the ability to reason about other’s behavior and mental states, based on observations during experiments with chimpanzees.

9.2.1.2 Tasks Used to Assess Theory of Mind Capabilities during Development Various tasks have been traditionally used to measure the ToM capabilities in development, focusing on different aspects of the process: some tasks have been created to evaluate the capabilities of the participants to infer others’ mental states (i.e., mentalizing). In such tasks, participants are usually exposed to descriptions of social situations and are asked to predict the mental states of the characters involved. Examples for this category are the false belief task [Wimmer and Perner 1983] or the strange stories task [Happé 1994]. The level of difficulty of these tasks depends on the target population, that is: more complex versions of these tasks exist for adults versus children; cognitive demands, culture, education, and language skills may also affect the test results. For instance, on the Sally–Anne falsebelief task [Wimmer and Perner 1983], one of the most widely used task to evaluate ToM, kids are told a story in which a character, Sally, puts an object inside a basket before leaving the room. Once Sally is out of the room, another character, Anne, changes the location of the object to a box. At this point, the children are asked

9.2 Theory of Mind and Joint Attention—Crucial Mechanisms of Social Cognition

319

where Sally would look for the object upon her return. Initial findings revealed that four-year-olds can compute the perspective or state of (false) beliefs of others, revealing that ToM abilities have developed. Another group of tests evaluates the capability to detect and interpret social signals, as they are pivotal in mindreading. Gaze following [Bayliss et al. 2007, Frischen et al. 2007] or identification of emotions from visual [e.g., Baron-Cohen et al. 1995, De Sonneville et al. 2002, Bayliss and Tipper 2006] or auditory stimuli [Nowicki and Carton 1993, Scherer and Scherer 2011] are examples for this category. Similarly, tests like Reading the Mind in Eyes [Baron-Cohen 1997] or Reading the Mind in Film [Golan et al. 2006] measure whether participants can accurately identify internal states from social signals. While these traditional tasks offer highly controlled measures of participants’ ToM abilities, they lack ecological validity. One major challenge with these paradigms is that the evaluation of ToM is based on experimental protocols with spectatorial perspective, meaning that participants are placed in the role of a passive observer rather than partaking in a dynamic social interaction. Furthermore, the evaluation of ToM in those paradigms takes place in non-social contexts, potentially delivering a biased assessment of social cognition compared to real-life situations. More recent approaches, such as the second-person neuroscience framework [Schilbach et al. 2013], stress the importance of allowing natural, reciprocal interactions in experimental paradigms when trying to understand the mechanisms of social cognition. Evidence from clinical studies supports the postulates of second-person neuroscience framework showing that although children with an autism spectrum condition are capable of passing the traditional lab-based false-belief tasks at a mental age of six years [Happé and Frith 2014], adolescents and adults with Asperger syndrome still have difficulties in mentalizing with others during naturalistic interactions [Ponnet et al. 2004]. 9.2.1.3 Neural Correlates of Theory of Mind Recent studies have revealed two main brain regions (see Figure 9.1) involved in ToM: the paracingulate cortex, involved in processing of own and others’ mental states, and the temporoparietal junction, linked to identifying actions and intentions produced by biological agents. The wider network of brain areas involved in ToM tasks includes the dorsomedial prefrontal cortex, temporal-parietal junction, superior temporal sulcus, ventromedial prefrontal cortex, and posterior cingulate cortex [Amodio and Frith 2006, Frith and Frith 2006, Blakemore 2008, Van Overwalle and Baetens 2009]. These areas have been reported to be activated during various mentalizing tasks, such as making inferences about others’ preferences [Mitchell et al. 2002, Jenkins et al. 2008], reading stories about others’ mental states

320

Chapter 9 Theory of Mind and Joint Attention

[Saxe and Kanwisher 2003], interactive games that require reasoning about intentions [Hampton et al. 2008, Chang et al. 2011, Sul et al. 2017], or watching movies that require inferring characters’ mental states [Pantelis et al. 2015, Richardson et al. 2018].

9.2.2 Joint Attention Cognitive science has also extensively investigated the phenomenon of JA, which is closely related to-, and a precursor of, ToM. JA takes place when two individuals coordinate their attentional processes to conjointly attend to the same object or situation in the environment. JA is fundamental for the acquisition of language, such that the caregiver says a word for a given object out loud and uses her/his gaze to guide the child’s attention to the object in the environment, establishing an association between the spoken word and the object it represents [Baron-Cohen 1997]. JA is a pivotal precursor of joint action and mental state attribution as they are used to communicate a partner’s focus of attention and allow inferences about his/her intentions (e.g., looking at a food item might mean that the gazer is hungry) and action goals (e.g., looking at a coffee cup might predict an upcoming grasping action). 9.2.2.1 Tasks for Measuring JA Capabilities To examine attentional processes underlying JA, the gaze-cueing paradigm has been used extensively, where participants observe a centrally presented face on the screen that first looks at them and then changes its gaze direction to the left or right side of the screen to either validly or invalidly cue a subsequent target probe [Friesen and Kingstone 1998, Driver et al. 1999, Emery 2000, Frischen et al. 2007]. The standard observation is that targets appearing at the gazed-at (validly cued trials) location are processed faster and more accurately than targets appearing elsewhere (invalidly cued trials), which results in faster response times to targets located at the gazed-cued relative to other locations, where the difference in reaction times across these two conditions is termed the gaze-cueing effect. The gazecueing effect is explained as the consequence of enhanced attentional orienting in response to the change in gaze direction that functions as a spatial cue: when the gaze is directed to a location, the observer’s attentional focus is shifted there, and this facilitates the sensory processing of stimuli that subsequently appear at the attended location. On the other hand, when a stimulus appears in an uncued location, the observer’s attentional focus first needs to be shifted from the cued location to the target location; this additional time for shifting the attentional focus to the target is reflected in reaction times.

9.3 Theory of Mind in Artificial Agents

321

Although it has been argued that attention is reflexively oriented by social stimuli, such as changes in gaze direction (see Friesen and Kingstone [1998]), multiple studies (i.e., Wiese et al. [2014] and Kuhn et al. [2018]) have shown that attentional orienting to gaze cues can be top–down controlled when the context in which the interaction takes place is sufficiently social and provides information about the social relevance of the cue [Wiese et al. 2013]. For example, attentional orienting to gaze cues is enhanced when they are believed to be intentional as opposed to random or unpredictable [Teufel et al. 2010, Wiese et al. 2012, 2014, Perez-Osorio et al. 2015, Perez-Osorio et al. 2017; see Capozzi and Ristic 2020, for a review]. 9.2.2.2 Neural Correlates of JA Several specialized cerebral mechanisms have been postulated as the basis of socio-cognitive mechanisms related to gaze processing and gaze-induced JA. Neuroimaging studies in humans show that the superior temporal sulcus region is implicated in processing various face signals, such as changes in gaze direction or facial expression [Puce et al. 1998, Hooker et al. 2003, Pelphrey et al. 2003, see Allison et al. 2000, for a review]. The intraparietal sulcus (IPS), which generally is activated during covert shifts of attention [Nobre et al. 1997, Corbetta et al. 1998], is also involved in JA, specifically in shifting the observer’s attention to the gazedat location [Puce et al. 1998, Wicker et al. 1998, Hoffman and Haxby 2000, George et al. 2001, Hooker et al. 2003, Pelphrey et al. 2003]. In support of this notion, a functional magnetic resonance imaging study by Hoffman and Haxby [2000] reported that passive viewing of faces that showed averted gaze elicited a significantly more robust response in bilateral IPS and left superior temporal sulcus than the passive viewing of faces with direct gaze, indicating that these brain areas are specialized in processing averted gaze. Further, Puce and colleagues [1998] showed that the inferior temporal sulcus is particularly sensitive to eye movements. Hence, inferior temporal sulcus and superior temporal sulcus seem to be specialized in processing gaze direction, while IPS may be preferentially involved in attentional orienting in response to gaze cues.

9.3

Theory of Mind in Arti cial Agents

9.3.1 Evoking Human Social Cognition Mechanisms during HRI

9.3.1.1 Theory of Mind in HRI Implementation of social cognition in SIA can be characterized by simulation of social interactions. This approach relies on scripted social behavior, usually created based on human-like behavior, to endow artificial agents with social signals,

322

Chapter 9 Theory of Mind and Joint Attention

actions, and language. Through the combination of pre-scripted behavior, depending on the experimental conditions exhibited during social interactions or through experimenter-controlled protocols referred as Wizard-of-Oz studies [Dahlbäck et al. 1993], social behavior of artificial agents become dynamic and interactive within well-defined experimental scenarios. Multiple studies have shown that simulating social interactions with SRs can trigger ToM-related processes in HRI [Hegel et al. 2008, Byom and Mutlu 2013, De Graaf and Malle 2019]. Manipulating beliefs about an agent’s capacity of having internal states or exploring the effects of communicative gestures in cooperative or competitive interactions, for instance, have been used to understand whether and under which conditions people mentalize with artificial agents. These studies help understand whether people would spontaneously adopt a mentalistic approach to artificial agents and what factors contribute to the likelihood of attributing mental states to SIA. Many studies have examined to what extent humans interpret social signals displayed by SR in mentalistic terms and whether they are used to infer a robot’s mental state. As an example, Mutlu and colleagues [2009] investigated whether non-verbal cues during an interactive game would elicit ToM inferences in participants, leading them to follow the cues of the SR. In this experiment, the participants’ task was to guess which object the robot had chosen from objects depicted on top of a table. The results showed that participants were faster and more accurate when the robot used non-verbal social cues relative to no social cues. Interestingly, when asked afterwards, most of the participants did not explicitly pay attention to the cues or actively use them to complete the task. Other interactive protocols showed that humans tend to take into account the behavior and, arguably, internal states of SR when trying to coordinate or synchronize their actions with them during joint task execution [Xu et al. 2016, Ciardo et al. 2020]. Schreck et al. [2019], for instance, evaluated whether a SR’s social behavior, the type of social signals it displayed, and the proxemics (how close would the SR get to people) affected the likelihood of ToM-related interpretations. They found that increased experience with a robot through continued interaction decreased the likelihood of mental state attributions, unless the robot showed more socially active behaviors (get close to people when interacting) as well as more human-like expressions, which triggered stable levels of mental state attributions across the experiment. Instead of looking at the effect of social signals on mentalizing, more recent approaches ask a broader question, namely which physical and behavioral features SRs need to display in order to being perceived as an entity “with a mind,” capable of displaying internal states [Epley et al. 2007, Gray et al. 2007, Złotowski et al. 2015]. When non-human agents are treated as agents with a mind, humans adopt

9.3 Theory of Mind in Artificial Agents

323

the intentional stance [Dennett 1987] to them and interpret their behavior based on the assumption that it is motivated by internal states such as beliefs, desires, or intentions. Given the general capacity of SIAs to display physical and behavioral signs of human-likeness, and the human tendency to anthropomorphize non-human entities, it plausible to assume that humans would use mentalistic strategies to explain and predict the behaviors of artificial agents [Perez-Osorio and Wykowska 2019, for review]. While this assumption is theoretically plausible, empirical evidence is mixed. The seminal study by Gallagher and colleagues [2002] observed differential activation in the anterior cingulate cortex, previously linked with mentalizing tasks, when participants believed that they were playing rock-paper-scissors against a human compared to playing against a rule solving program or a random response generator. Importantly, all the conditions were controlled by the same algorithm and the only difference between them was the stance of the participants toward the opponent. However, Chaminade et al. [2012] used a similar paradigm and replaced the rule solving opponent with a humanoid robot and reported no differences in brain activation between humanoid robot and the random responses. Furthermore, Krach and colleagues [2008] who used the Prisoner’s dilemma instead of rock-paper-scissors and found that the medial prefrontal and left temporo-parietal junction, associated with the attribution of intentional stance and part of the ToM network, were only activated in response to humans but not during the interaction with artificial agents (a humanoid robot, a functional robot without human-like appearance, and a computer). More recent evidence suggests that in addition to belief manipulations, other cues that manipulate social context information also have the potential to trigger participants to adopt the intentional stance toward SIA. For example, a recent study manipulating the intentional stance by presenting human and SR agents embedded in different scenes, and asking participants to score the plausibility of different explanations for the agents’ behaviors, shows no difference in participants’ tendencies to adopt mentalistic explanations toward human versus SRs [Thellman et al. 2017]. In a similar study, Marchesi et al. [2019] also showed that people spontaneously adopt the intentional stance toward humanoid robots under some contexts using a novel questionnaire, the Instance Questionnaire (ISQ), that was specifically developed to measure peoples’ tendencies to adopt the intentional stance toward a robot [specifically, the humanoid robot iCub, Metta et al. 2010]. In the questionnaire, participants observe a series of pictures showing a sequence of events involving iCub and are then asked to judge whether its behavior was motivated by a mechanical (e.g., malfunction, calibration) or mentalistic reason (e.g., desire, curiosity), with the latter explanation indicating adoption of the intentional stance, see Figure 9.2. Results showed that although participants tended to give more often

324

Chapter 9 Theory of Mind and Joint Attention

Figure 9.2

Examples of scenarios from the Instance Questionnaire. Under each scenario participants chose the explanation that would better describe the behavior of the robot, either a mentalistic or mechanistic statement. For example, in Panel (a) the options were “iCub was trying to cheat by looking at opponent’s cards” for mentalistic description, and “iCub was unbalanced for a moment” for mechanistic description. Panels (b) and (c) show other examples of scenarios included in the Questionnaire. Copyright © 2019 Marchesi, Ghiglino, Ciardo, Perez-Osorio, Baykara and Wykowska, Istituto Italiano di Tecnologia (CC BY 4.0)

mechanistic explanations for iCub’s behaviors, some behaviors evoked mentalistic interpretations. Interestingly, inter-individual might have also played a role in likelihood of adopting the intentional stance toward the robot. Another crucial factor that has been hypothesized to influence adopting the intentional stance and that can be directly influenced via robotic design is whether the behavior of a robot seems human-like [Złotowski et al. 2015]. Wykowska et al. [2015], for instance, showed that variable temporal characteristics of gaze behavior lead participants to judge a robot’s behavior as more human-like compared with less variable eye movements (see also Ghiglino et al. [2020]). Willemse et al. [2018] also reported that participants anthropomorphized and liked robots more

9.3 Theory of Mind in Artificial Agents

325

that followed the participants’ gaze during an interactive experiment that exhibited typical human-like reciprocity. Perez-Osorio et al. [2019] further showed that when participants had high expectations regarding the behavior of the robot, their scores in the ISQ increased after a brief observation of the robot; for participants with lower expectations ISQ scores decreased after the observation. Collectively, evidence suggests that people can (and do sometimes) attribute mental states to SRs and employ these attributions during social interactions. Human-like shape and behavior might facilitate the attribution of mental states to robots but might not be sufficient; rather, the type of interaction and the social signals exhibited play a crucial role on this process, as well as individual attitudes [De Graaf et al. 2016] and imageries of robots [De Graaf and Malle 2019]. 9.3.1.2

Joint Attention in HRI Several studies have examined JA in HAI, and have shown that people can identify and follow non-human gaze and discriminate whether the SR is looking at them or at a different location (see Admoni and Scassellati [2017] for a review). Findings also suggest that robot gaze can be used to communicate information about relevant events and targets in the world—similar to human gaze. Although some studies have shown that the directional gaze of two different robots failed to elicit reflexive attentional orienting (i.e., Admoni et al. [2011] and Okumura et al. [2013]), several other studies have consistently shown engagement in JA with artificial agents. This occurs in screen-based studies [Wiese et al. 2012, 2019], but also when using embodied humanoid agents in interactive protocols [Wykowska et al. 2015, Kompatsiari et al. 2018, Chevalier et al. 2020, Kompatsiari et al. 2021 for review]. Various experimental conditions can modulate certain aspects of social attention, which, in consequence, results in variable empirical findings that can vary depending on which paradigm has been used and how it has been implemented. For instance, the robot’s believed intentionality appear to modulate gaze effectiveness in orienting attention (i.e., the extent to which gaze orients attention) [Wiese et al. 2012, Wykowska et al. 2014]. Participants also respond more favorably to robots that display socially engaging gaze, for example, in the series of studies by Kompatsiari and colleagues [2017], gaze cueing effects were modulated by whether the robot engaged or not in mutual gaze with the participants prior to directing their attention to one of the locations where the target could appear (the gaze cueing procedure). Willemse et al. [2018] as well as Willemse and Wykowska [2019] showed that degree of contingency of the robot’s gaze on participants’ gaze direction also influenced JA. Finally, Perez-Osorio et al. [2018] showed that action expectations also affect the magnitude of the gaze cueing effect. Huang and Mutlu [2012]

326

Chapter 9 Theory of Mind and Joint Attention

also found that participants recalled the details of a story better when the SR used congruent speech and gaze cues than when the cues were spatially or temporally incongruent. Similarly, Mutlu et al. [2013] found that participants responded faster and understood instructions better when the SR used verbal and visual cues. These findings indicate that humans can engage in JA with artificial agents such as robots and that non-verbal social cues can be beneficial for human–robot interaction.

9.4

Modeling Social Cognition

9.4.1 Implementing Theory of Mind in SR Considering the strong bias that humans have to interpret others’ behaviors in anthropomorphic or mentalistic terms, it is natural to assume that in social interactions with artificial agents people would employ the same social-cognitive mechanisms as in interactions with humans. Roboticists might have followed a similar intuition when deciding how to best design SRs: they aim to create robots that can communicate using human-like social signals and to equip robots with social skills and cognitive capabilities comparable to humans in order to facilitate social interactions. For example, Scassellati [2002] suggested that endowing a robot with ToM would be very beneficial for social interactions as robots could use such a model not only to understand human behavior and communicate efficiently with humans, but also to learn from social interactions the same way that infants learn from their parents. Endowing a SR with ToM would not only allow robots to generate internal representations of humans’ mental states, and to appropriately respond to these mental states, but it would also help robots to interact smoothly and fluently with humans. For that purpose, Scassellati extracted the most relevant aspects of traditional psychological models of ToM from developmental cognitive science [e.g., Baron-Cohen et al. 1995, see Figure 9.3], and aimed to create analogous structures in artificial robot systems. The ToM formulated by Baron-Cohen et al. [1995] proposes that humans develop a mindreading system consisting of four different modules: intentionality detector (ID), eye direction detector (EDD), shared attention mechanism (SAM), and theory of mind mechanism (ToMM). The ID recognizes entities in the environment that exhibit biological motion and is able to detect self-propelled motion and goal-oriented behaviors and thus can identify an organism with volition or agency [Premack 1990]. The EDD automatically detects the presence of eyes/face in the visual field and decides whether eye-like stimuli are “looking at me” or “looking at something else” and thus whether an agent shows mutual gaze (signaling

9.4 Modeling Social Cognition

Figure 9.3

327

Theory of mind model based on Baron-Cohen et al. [1995]. Detection of stimuli in the Intentional Detector (ID) module and the Eye Direction Detector (EDD) module constitute the basic input for the model. Representations created in this first layer feed the Shared Attention Mechanism (SAM) to build triadic representations. In the final layer, the Theory of Mind Mechanism encodes and stores representations to create a theory about others’ mental states and beliefs. The levels increase in complexity and mature sequentially during development (based on Baron-Cohen et al. [1995]).

readiness to engage) or averted gaze (trying to shift observer’s attention to potential objects of interest). ID and EDD become functional earlier in development and precede the maturation of SAM and ToMM. SAM receives input from both ID and EDD to determine whether two biological interaction partners conjointly attend to the same event or object in the environment, thus creating a triadic representation (self, other, object) out of different dyadic representations (self/other, self/object, other/object). SAM usually develops at 9 to 14 months of age and allows JA behaviors, such as proto-declarative pointing and gaze monitoring. Importantly, SAM allows the agent to interpret the gaze change of others as intentional and, what follows, as intentional representations (i.e., “she wants to...”), which highlights the importance of gaze perception for the successful inference of others’ mental states. The most advanced module, the ToMM, enables representing and integrating the full set of mental state concepts into a “theory”; it creates representations of others’ beliefs and desires but also allows for formulating knowledge states that are neither necessarily true nor match the knowledge of the agent (i.e., imagination and creativity). ToMM forms later in development (between 2 and 4 years of

328

Chapter 9 Theory of Mind and Joint Attention

age) and allows pretend-play [Leslie 1987] as well as understanding false beliefs [Wimmer and Perner 1983] and the relationship between mental states [Wellman 1990]. Baron-Cohen’s model has proven to be useful in interpreting typical and atypical development of social skills in humans, autism spectrum condition in particular. An important part of this model is that it is hierarchical with representations of different levels of complexity, starting with precursor functions like the detection of biological agents (ID) and gaze signals (EDD), continuing to the maturation of shared attention (SAM), and finally the successful representation and inference of others’ mental states (ToMM). In the process of implementing a ToM model in artificial agents, Scassellati first introduced the EDD and ID modules and over the years additional modules have been proposed: modules to distinguish animate from inanimate motion [Scassellati 2001], to share attention [Nagai et al. 2002, Scassellati 2002], to imitate actions as a method for learning motor skills and recognizing human actions [e.g., Schaal 1997, Dautenhahn and Nehaniv 2002, Fod et al. 2002, Billard et al. 2004, Breazeal et al. 2005, Gray et al. 2005, Johnson and Demiris 2005], and to take others’ perspective [Gray et al. 2005, Trafton et al. 2005, 2006]. The main challenge consists, however, in generating and integrating different motor and social skills into an articulated architecture able to cope with the changing environmental demands and able to adapt to multiple and variable social contexts. Since this early work, the implementation of ToM models in socially interactive agents has progressed considerably. Most recent advancements (for a review, see Bianco and Ognibene [2019]) have resulted in the formulation of more complex cognitive architectures that aim at providing social skills to SIA. In recent years, computational models of social cognition have been used to understand cognitive mechanisms of ToM by simulating functioning ToM [Newell 1994, O’Reilly et al. 2012, 2016]. These models of ToM vary in their characteristics but typically highlight the importance of detecting social signals conveyed by others (with most of these signals being conveyed by eyes and faces), identification of goals and motivations (i.e., task-related goals), and creation of beliefs (based on state of the world and the estimation of others’ knowledge). For example, Baker et al. [2009] propose a Bayesian theory of mind (BToM), a model that formalizes action understanding as a Bayesian inference problem. This approach models beliefs, goals, and desires as rational probabilistic planning in Markov decision problems (MDPs) and the goal inference is performed by the Bayesian inversion of this model of planning. The MDPs are a normative framework for modeling sequential decision-making processes under uncertainty, commonly used for human planning and reinforced learning [Dayan and Daw 2008]. MDPs allow creating representations of an agent’s interaction within the environment and encode all relevant information about the

9.4 Modeling Social Cognition

329

configuration of the world and the agent as state variable, which allows capturing mental models of intentional agents’ goal- and environment-based planning. Further, MDPs represent actions permitted in the environment and determine a causal model of implications of these actions in the state of the world; they also represent subjective rewards or costs caused by the agent’s actions in each state. The model creates an agent’s hypothetical representations of beliefs and desires that caused a behavior within that given environment using Bayesian inference; all hypotheses are associated with a particular goal. For each hypothesis, the model evaluates the likelihood of generating the observed behavior given the hypothesized belief or desire. Then, the model integrates this likelihood with the prior over mental states to infer the agent’s joint belief and desire [Baker et al. 2009, 2017]. Although the model integrates beliefs, desires, and goals, it depends strongly on priors regarding the action goals (in contrast with other models of ToM). Importantly, and unlike other models of ToM, BToM performs Bayesian inferences over beliefs and desires simultaneously. The cognitive model incorporates the current perceptual states and beliefs’ updates in order to modify the initial hypothesis and then generates new adjusted predictions in each iteration. To evaluate the model, the authors tested whether the model is able to predict the mental states of an agent performing a decision-making task (i.e., choose a food truck) displayed in three action frames and then contrasted these predictions with both human and alternative models’ performance on the same task. On each trial, an agent was looking for a food truck in three frames, starting point, transition, and goal, in different spatial configurations (layouts). After each trial, participants (and models) were asked to predict the agent’s preferences regarding the food trucks and to rate how confident they were with their assessment. The authors contrasted the predictions of the BToM model with humans’ performance and assessments and against two model alternatives—TrueBelief (a special case of BToM with a prior that assigns probability 1 to the true world state) and NoCost (another special case of BToM that tests the contribution of the principle of efficiency to ToM reasoning by assuming that the agent’s cost of action is zero), as well as one cue-based alternative—Motion Heuristic (which tests whether social inferences are derived from processing of bottom-up perceptual features). They found that the proposed model successfully predicts the mental states of an agent and generates mental-state judgments similar to those of human participants in a wide variety of environment configurations. These findings obtained with Bayesian inversion of models of rational agents suggest that the brain might use similar principles to handle social information, infers others’ mental states, and predicts their actions. Thus, Bayesian computational models offer a powerful tool to evaluate the inherent predictive functioning of the brain.

330

Chapter 9 Theory of Mind and Joint Attention

Cangelosi and colleagues have also been developing cognitive architectures incorporating ToM (e.g., Vinanzi et al. [2019]). The authors designed and implemented a biologically inspired artificial cognitive system that incorporates trust and ToM, which is supported by an episodic memory system and based on developmental robotics. The cognitive system integrates multimodal perception (visual and auditory stimuli) together with a motor module. The visual module detects and recognizes faces through machine learning algorithms Haar Cascade [Viola and Jones 2001] and Local Binary Pattern Histogram [Ojala et al. 2002]. This cognitive system also has a belief module based on Bayesian belief networks. Representations are stored in the episodic memory module to be retrieved and included in future interactions with new users. Interestingly, the architecture was tested using an experimental paradigm from developmental psychology. The paradigm has been developed to evaluate how much children trust an interaction partner [Vanderbilt et al. 2011]. Only children 5 years of age and older are typically able to pass this test, thanks to the emergence of the ToM. To pass the test, children need to differentiate people who give useful cues (helpers) from people who are lying (trickers). Interestingly, the proposed architecture satisfactorily identified helpers from trickers, thereby passing the test. The creation of neurocognitive models of ToM mechanisms with computational simulations and architectures and the further observation of the effects of these models in interactive situations has two main advantages. As mentioned above, the implementation of ToM models in SIA would facilitate interaction with humans, as the models anchor the agents’ behavior in predictive identification of action and proactive generation of responses. Furthermore, implementation of such models in SIA also provides a new tool for understanding the ToM mechanisms in humans during social interactions. Finally, some cognitive architectures, like the work proposed by Rabinowitz et al. [2018], use ToM neural networks to infer mental states online based on a meta-learning approach. The application of strong prior results in inferences that require only a few observations, adapting quickly to different tasks and behaviors, which brings the architecture closer to human performance. It is important to mention that although this neural network receives only visual inputs, it can solve false belief tasks. Another notable example of biologically inspired architectures is the work of Kahl and Kopp [2018]. The authors propose a mentalizing system for attributing and inferring mental states together with a hierarchical predictive model for online action perception and production that represents the mirror system. While the mentalizing subsystem allows differentiating mental perspectives for “me,” “you,” and “us,” the mirror subsystem adopts the Empirical Bayesian Belief Update model (EBBU) [Sadeghipour and Kopp 2011] for

9.4 Modeling Social Cognition

331

action observation and production. The architecture allows second-order ToM, that is, representations of beliefs about beliefs, in actual simulations of dynamic interaction.

9.4.2 Simulating Other Social Cognition Mechanisms that Support Theory of Mind in Arti cial Agents

Simulation of higher-order socio-cognitive capabilities includes action recognition, imitation, memory, and learning. All these modules contribute to model a functional ToM. The model can create, store, retrieve, and track the counterpart’s mental states and compare them with its own internal states online. This type of simulation allows making inferences about goals, predict actions, and also facilitates learning [Byom and Mutlu 2013]. Furthermore, these cognitive simulations allow (i) making inferences about the perspective of humans [Trafton et al. 2005] and (ii) distinguishing and storing particular sets of beliefs to help robots plan actions and to learn based on imitation [Breazeal et al. 2006] or activate motorresonance mechanisms that facilitate generation of inferences about subsequent steps of actions sequences [Blakemore and Decety 2001]. Another example of cognitive simulation in the motor domain is a system developed by Gray et al. [2005] that monitors the human behavior in a collaborative task by simulating the observed behavior within the robot’s own generative mechanisms. This enables the robot to perform task-level simulations, track the participant’s progress, and anticipate the needs to accomplish the action goal. A considerable part of the initial work on ToM in robotics was focused on implementing the visual perspective-taking mechanism and the so-called “belief management system” in the SR in order to infer the humans’ mental states (i.e., Scassellati [2002] and Berlin et al. [2006]). A prominent example of the integration of these modules was presented by Breazeal et al. [2009]. The authors aimed at incorporating mechanisms based on the simulation theory as a principle to support mindreading skills and abilities. The proposed model is characterized by two modes of operation: one, which generates the mental states of the SR using the current state of the world, and a second one, which constructs and represents the mental states of the human counterpart. Importantly, both modes share the perception, belief, motor, intention, and body representation modules. The perception system can estimate what the other can see and transforms that information into the point of view of the robot. The motor system maps and represents the body positions of the human in terms of the SR joint space to perform action recognition. The belief system combined with perspective taking represents possible beliefs of the human. Finally, the intention system predicts the ideal action sequence to achieve a goal.

332

Chapter 9 Theory of Mind and Joint Attention

This information is combined with the perceptual cues and the current state of the environment to create inferences about mental states and to predict the actions of the human counterpart. According to the authors, the system to represent the human’s mental states builds beliefs based on perceptual states using an embodied simulation together with higher-level knowledge about task-goals [Breazeal et al. 2009]. Their architecture was tested in a collaborative task and a learningfrom-demonstration task. The SR was able to anticipate and generate inferences about human behavior and pointing at relevant objects during the collaborative task. It was also capable of recognizing rules demonstrated by the teacher in the learning task. The physical limitations of the SR platform (i.e., the social robot Leonardo) prevented physical interaction with the environment. For that reason, the capabilities of the architecture were tested also in a virtual reality environment showing successful results. The authors concluded that the system can infer and predict the beliefs of the interaction partner, although the range of these beliefs was limited compared to the capabilities of humans. More recent cognitive architectures attempt to develop more flexible artificial ToM systems to enhance robots’ capabilities to improve human–robot interactions, allowing SRs to take others’ perspectives, generate predictive actions, support active perception, and reduce the dependence on external datasets to infer actions and mental states [Bianco and Ognibene 2019]. There are several types of cognitive architectures. For instance, multimodal architectures rely on collecting inputs from different modalities (i.e., visual, auditory, and proprioception) to predict and understand the behavior of the human and to reproduce movements. Inputs that include posture, location, facial expressions, visual perspective, and movements of the human can be used to determine whether the actions are intended or not. This inputs are also integrated with verbal commands and proprioceptive information to perform collaborative tasks like cleaning a table in the most efficient manner. The biomimetic architecture for situated social intelligence systems (BASSIS) proposed by Petit and colleagues [2013] provides a robot real-time adaptation during collaborative scenarios using multimodal inputs (visual, verbal, and proprioceptive) to infer the mental states of the human counterpart. This architecture is organized at three different levels of control: reactive, adaptive, and contextual, which are all based on the physical instantiation of the agent through its body. It is based on the Distributed Adaptive Control Architecture [Verschure et al. 2003] and was employed for multimodal learning with NAO and iCub platforms. It has shown great potential for collaborative environments. However, it is limited by the quality of teaching (as it does not tolerate errors from the tutor) and by limited long-term storing of the acquired learning [Verschure et al. 2003].

9.5 Comparison of IVAs and SRs

333

Another architecture that uses multimodal estimations was proposed by Görür et al. [2017]. In contrast with BASSIS, it integrates a ToM model into decisionmaking tasks. This architecture is composed of three modules: Sensing, Action State Estimation (ASE), and the Human–Robot Shared Planner (HRSP). The architecture receives primarily sensory data (visual and auditory/verbal) and generates stochastic policies in the form of human–robot shared decisions from the robot’s point of view. A combination of sensory data and generated policies produces an input that drives the remainder of the architecture. The ASE and the HRSP constitute the ToM part, but they rely closely on the Sensing module. The stochastic planner of the HRSP depends on partially observable Markov decision process (POMDP), which is a Bayesian ToM model inspired by Baker and Tenenbaum [2014]. Similarly, Devin and Alami [2016] designed a model that employs multiple inputs to estimate and maintain representations about the environment and the actions goals of the partner. It includes representations of the previous goal, plans, and actions, and holds them online to decrease unnecessary or redundant verbal communication with the human counterpart. A variation of Devin and Alami’s architecture [2016] has been proposed by Demiris and Khadhouri [2006], called HAMMER (Hierarchical Attentive Multiple Models for Execution and Recognition). This architecture is an example of multimodal estimation and hypothesis simulation. It has been designed to identify and execute goal-oriented actions and has an inverse model paired with a forward model. The inverse model processes the current state of the system and the target goal(s) and produces the control commands that are needed to achieve or maintain those goal(s); the forward model takes the current state of the system as input and a control command to be applied on it and outputs the predicted next state of the controlled system. This architecture is based on the visual perception of another agent’s movements, which is controlled by top–down signals to orient the robot’s attention toward information necessary to confirm its hypothesis concerning the demonstrator’s action. This architecture was implemented and tested on a robot that conducted an action recognition task while observing a human demonstrator performing an object-oriented action. The robot successfully performed the task and the attentional mechanism acting over the inverse model was suggested to reduce robots’ computational costs.

9.5

Comparison of IVAs and SRs The physical embodiment of interaction partners is known to impact social interactions (see Li [2015], for review). However, there is no consensus regarding the

334

Chapter 9 Theory of Mind and Joint Attention

question whether artificial agents’ embodiment has an effect on ToM and JA specifically. Users typically find physically embodied SRs more engaging, enjoyable, informative, and credible [Kidd and Breazeal 2004] than virtual agents. Physically co-present embodied systems improve interactions over virtual systems [Bainbridge et al. 2011]. In general, a wide variety of studies in HRI support the idea that implementing human-like characteristics in SRs facilitates social interaction. However, there is sparse literature that supports a systematic comparison of the SRs and IVAs. Studies showed that SRs with human-like appearance and behavior are judged as more pleasant [Axelrod and Hone 2005], more usable [Riek et al. 2009], more accepted [Venkatesh and Davis 2000, Kiesler and Goetz 2002, Duffy 2003], easier to get acquainted with [Krach et al. 2008], and more engaging [Bartneck and Forlizzi 2004]. Further, SRs that communicate using social signals, such as facial expressions [Eyssel et al. 2010], other emotion displays, like ear or fin movements in human-like robots, [Gonsior et al. 2011], or turn-taking in conversations [Fussell et al. 2008] evoke stronger emotional responses and are preferred by users over SRs that do not show social signals. IVAs with these or similar characteristics would also be expected to facilitate the attribution of mental capacities to them, and studies indeed suggest that people can read intentions into IVAs’ behaviors during social interactions. For instance, using the principles of animation, Takayama et al. [2011] designed the behavior of an IVA such that it either displayed intentions (i.e., hint with the gaze whether it is aiming at opening a door) or not, and was reactive (or not) to the events in the environment during action execution. They found that when an IVA showed forethought (time to “think”) before executing an action, the outcome was judged as more competent and intelligent, and the agent was perceived as more appealing. This suggests that people are sensitive to the intentional hints from IVAs and can interpret them accordingly. Several studies have focused on whether virtual agents can communicate using the gaze and engage participants during interaction. For example, Andrist and colleagues [2012] used a model to control the gaze shifts of a virtual character; two main conditions, mutual and referential gaze, were developed. Predominant mutual gaze elicited subjective positive feelings of connection, and referential gaze improved participants’ recall of information in the environment. This suggests that similar to findings with SRs, participants seem to follow the gaze of IVAs and engage in mutual eye gaze with IVAs [Andrist et al. 2012]. In a similar study, Wilms et al. [2010] showed that when IVAs’ gaze was contingent with participants’ gaze, JA evoked higher activity in the medial prefrontal cortex and posterior cingulate cortex relative to disjoint conditions. More recently, Willemse and colleagues [2018], using a gaze leading task on screen (iCub followed or not participants’ gaze), found that participants preferred the robot that exhibited JA behavior relative to the robot

9.6 Current Challenges

335

with a disjoint attention behavior. The robot with JA behavior was also rated as more human-like and as more likeable. These results showed a similar pattern to findings obtained with a physically embodied robot [Willemse and Wykowska 2019]. Development of cognitive architectures makes it now possible to implement ToM in IVAs. For instance, Buchsbaumm et al. [2005] proposed a framework inspired by simulation theory and hierarchical action structures to help IVAs understand human actions and emotions. The framework includes a motivational system in which certain actions (get/search/find) are associated with certain drives (feeding, self-defense, or socializing), and increasingly specific, sequentially organized actions can be defined for satisfying associated drives (e.g., feeding —> eat/search/get food —> jump/reach item). The system is designed to learn by imitation, that is, associate an observed action with a particular goal, and is able to identify goals, learn, and predict actions. Other approaches have implemented BToM to predict the behavior of the users, aiming at facilitating the navigation and increasing the users’ satisfaction within an immersive virtual environment with multiple agents, rather than a one-to-one interaction in a social context. The algorithm proposed by Narang et al. [2019] uses a probabilistic model that integrates observed social cues and actions with statistical priors regarding the user’s mental states. The model, used in a real-time algorithm, endows multiple virtual agents with a ToM model. The algorithm perceives the proxemics and the gaze-based social cues from the users to reliably infer their underlying implicit intentions. For instance, the algorithm can differentiate between a user who is passing and a user who is aiming to talk/interact with a virtual agent. Altogether, this suggests that both IVAs and SRs can communicate using social signals and language. Further research employing the methods of cognitive neuroscience should be carried out to evaluate the added value of physical embodiment in SRs as compared to IVAs.

9.6

Current Challenges One of the central questions in HRI is: What are the necessary conditions for a robot to evoke similar social-cognitive mechanisms as in human–human interaction [Wykowska 2020, 2021]? Which robot features make us socially attune or synchronize with it, and represent its actions and interpret them in mentalistic terms? Which features are more impactful, physical or behavioral, and would their impact be different in the short-term versus long-term interactions? What role do preexisting expectations, stereotypes, and individual differences play when interpreting and reacting to observed robot behaviors (see Marchesi et al. [2019]; Spatola and Wykowska [2021], and Bossi et al. [2020])? Answering these questions by no means suggests that it is desirable under all circumstances to design robots that

336

Chapter 9 Theory of Mind and Joint Attention

evoke socio-cognitive mechanisms to the same extent as other humans. Whether it is ethically and morally acceptable, and in which application contexts, is open to ethical debate. On the one hand, robots that evoke similar social schemes as human interaction partners may have positive effects as SRs are perceived as more friendly, which might lead to higher acceptance, for instance, in elderly care. In fact, it might be that a senior person is more likely to, for example, follow medical recommendations (e.g., taking pills at a prescribed time of the day) when a robot elicits social attunement, as compared to when the robot is perceived as a machine. Whether this is indeed the case remains to be tested in future research. However, there might be application scenarios where social attunement is not desirable. For example, when a person is working side by side with a robot in a factory or wants to use the robot for a specific service. Studies that have examined the fit between a robot and the task it is supposed to execute suggest that anthropomorphic design features might help for tasks that require “core” human skills, like reading emotions, but might be disadvantageous for tasks that require the robot to execute actions that a human would not want to execute (e.g., Smith et al. [2016] and Hertz and Wiese [2018]). Finally, it is also important to evaluate the ethical implications of creating the “illusion” that a robot is a social being “with a mind,” similar or equivalent to another human. Humans should always remain aware of the difference between a robot, which is just machine, and another human. The challenge is to make sure that ethical debate goes hand in hand with technical and scientific development and research.

References H. Admoni and B. Scassellati. 2017. Social eye gaze in human–robot interaction: A review. J. Hum. Robot Interact. 6, 1, 25–63. DOI: https://doi.org/10.5898/JHRI.6.1.Admoni. H. Admoni, C. Bank, J. Tan, M. Toneva, and B. Scassellati. 2011. Robot gaze does not reflexively cue human attention. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 33, No. 33). T. Allison, A. Puce, and G. McCarthy. 2000. Social perception from visual cues: Role of the STS region. Trends Cogn. Sci. 4, 267–278. DOI: https://doi.org/10.1016/S1364-6613(00) 01501-1. D. M. Amodio and C. D. Frith. 2006. Meeting of minds: The medial frontal cortex and social cognition. In Discovering the Social Mind. Psychology Press, 183–207. S. Andrist, T. Pejsa, B. Mutlu, and M. Gleicher. 2012. Designing effective gaze mechanisms for virtual agents. In Proceedings of the ACM Annual Conference on Human Factors in Computing Systems (CHI). ACM Press, Austin, TX, 705–714. L. Axelrod and K. Hone. 2005. Uncharted passions: User displays of positive affect with an adaptive affective system. In Lecture Notes in Computer Science (including subseries Lecture

References

337

Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). DOI: https://doi.org/10. 1007/11573548_114. W. A. Bainbridge, J. W. Hart, E. S. Kim, and B. Scassellati. 2011. The benefits of interactions with physically present robots over video-displayed agents. Int. J. Soc. Robotics 3, 1, 41–52. DOI: https://doi.org/10.1007/s12369-010-0082-7. C. L. Baker and J. B. Tenenbaum. 2014. Modeling human plan recognition using bayesian theory of mind. In Plan, Activity, and Intent Recognition: Theory and Practice, 177–204. C. L. Baker, R. Saxe, and J. B. Tenenbaum. 2009. Action understanding as inverse planning. Cognition 113, 3, 329–349. DOI: https://doi.org/10.1016/j.cognition.2009.07.005. C. L. Baker, J. Jara-Ettinger, R. Saxe, and J. B. Tenenbaum. 2017. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nat. Hum. Behav. 1. DOI: https://doi.org/10.1038/s41562-017-0064. S. Baron-Cohen. 1997. Mindblindness: An Essay on Autism and Theory of Mind. MIT Press, Cambridge, MA. DOI: https://doi.org/10.7551/mitpress/4635.001.0001. S. Baron-Cohen, R. Campbell, A. Karmiloff-Smith, J. Grant, and J. Walker. 1995. Are children with autism blind to the mentalistic significance of the eyes? Br. J. Dev. Psychol. 13, 379–398. DOI: https://doi.org/10.1111/j.2044-835x.1995.tb00687.x. C. Bartneck and J. Forlizzi. 2004. A design-centred framework for social human–robot interaction. In Proceedings—IEEE International Workshop on Robot and Human Interactive Communication. DOI: https://doi.org/10.1109/roman.2004.1374827. A. P. Bayliss and S. P. Tipper. 2006. Gaze cues evoke both spatial and object-centered shifts of attention. Percept. Psychophys. 68, 2, 310–318. DOI: https://doi.org/10.3758/BF03193678. A. P. Bayliss, A. Frischen, M. J. Fenske, and S. P. Tipper. 2007. Affective evaluations of objects are influenced by observed gaze direction and emotional expression. Cognition 104, 3, 644–653. DOI: https://doi.org/10.1016/j.cognition.2006.07.012. G. Beattie and A. W. Ellis. 2017. The Psychology of Language and Communication. Taylor & Francis. DOI: https://doi.org/10.4324/9781315187198. M. Berlin, J. Gray, A. L. Thomaz, and C. Breazeal. 2006. Perspective taking: An organizing principle for learning in human–robot interaction. In Proceedings of the National Conference on Artificial Intelligence. F. Bianco and D. Ognibene. 2019. Functional advantages of an adaptive theory of mind for robotics: A review of current architectures. 2019 11th Computer Science and Electronic Engineering Conference, CEEC 2019—Proceedings. 139–143. DOI: https://doi.org/10.1109/ CEEC47804.2019.8974334. A. Billard, Y. Epars, S. Calinon, S. Schaal, and G. Cheng. 2004. Discovering optimal imitation strategies. Rob. Auton. Syst. 47, 2-3, 69–77. DOI: https://doi.org/10.1016/j.robot.2004. 03.002. P. Billeke and F. Aboitiz. 2013. Social cognition in schizophrenia: From social stimuli processing to social engagement. Front. Psychiatry 4, 4. DOI: https://doi.org/10.3389/fpsyt. 2013.00004.

338

Chapter 9 Theory of Mind and Joint Attention

S. J. Blakemore. 2008. The social brain in adolescence. Nat. Rev. Neurosci. 9, 4, 267–277. DOI: https://doi.org/10.1038/nrn2353. S. J. Blakemore and J. Decety. 2001. From the perception of action to the understanding of intention. Nat. Rev. Neurosci. 2, 8, 561–567. DOI: https://doi.org/10.1038/35086023. F. Bossi, C. Willemse, J. Cavazza, S. Marchesi, V. Murino, and A. Wykowska. 2020. The human brain reveals resting state activity patterns that are predictive of biases in attitudes toward robots. Sci. Robot. 5, 46. DOI: https://doi.org/10.1126/scirobotics.abb6652. C. Breazeal, D. Buchsbaum, J. Gray, D. Gatenby, and B. Blumberg. 2005. Learning from and about others: Towards using imitation to bootstrap the social understanding of others by robots. Artif. life 11, 1–2, 31–62. DOI: https://doi.org/10.1162/1064546053278955. C. Breazeal, M. Berlin, A. Brooks, J. Gray, and A. L. Thomaz. 2006. Using perspective taking to learn from ambiguous demonstrations. Rob. Auton. Syst. 54, 5, 385–393. DOI: https://doi.org/10.1016/j.robot.2006.02.004. C. Breazeal, J. Gray, and M. Berlin. 2009. An embodied cognition approach to mindreading skills for socially intelligent robots. Int. J. Rob. Res. 28, 5, 656–680. DOI: https://doi.org/10. 1177/0278364909102796. D. Buchsbaumm, B. Blumberg, C. Breazeal, and A. N. Meltzoff. 2005. A simulation-theory inspired social learning system for interactive characters. IEEE International Workshop on Robot and Human Interactive Communication, 2005. Nashville, TN, 2005, 85–90. DOI: https://doi.org/10.1109/ROMAN.2005.1513761. L. J. Byom and B. Mutlu. 2013. Theory of mind: Mechanisms, methods, and new directions. Front. Hum. Neurosci. 7, 413. DOI: https://doi.org/10.3389/fnhum.2013.00413. F. Capozzi and J. Ristic. 2020. Attention AND mentalizing? Reframing a debate on social orienting of attention. Visual Cognit. 28, 97–105. DOI: https://doi.org/10.1080/13506285. 2020.1725206. T. Chaminade, D. Rosset, D. Da Fonseca, B. Nazarian, E. Lutcher, G. Cheng, and C. Deruelle. 2012. How do we think machines think? An fMRI study of alleged competition with an artificial intelligence. Front. Hum. Neurosci. 6, 103. DOI: https://doi.org/10. 3389/fnhum.2012.00103. L. J. Chang, A. Smith, M. Dufwenberg, and A. G. Sanfey. 2011. Triangulating the neural, psychological, and economic bases of guilt aversion. Neuron 70, 3, 560–572. DOI: https://doi. org/10.1016/j.neuron.2011.02.056. P. Chevalier, K. Kompatsiari, F. Ciardo, and A. Wykowska. 2020. Examining joint attention with the use of humanoid robots—A new approach to study fundamental mechanisms of social cognition. Psychon. Bull. Rev. 27, 2, 217–236. DOI: https://doi.org/10.3758/s13423019-01689-4. F. Ciardo, F. Beyer, D. De Tommaso, and A. Wykowska. 2020. Attribution of intentional agency towards robots reduces one’s own sense of agency. Cognition 194, 104109. DOI: https://doi.org/10.1016/j.cognition.2019.104109. M. Corbetta. 1998. Frontoparietal cortical networks for directing attention and the eye to visual locations: Identical, independent, or overlapping neural systems? Proc. Natl. Acad. Sci. U. S. A. 95, 3, 831–838. DOI: https://doi.org/10.1073/pnas.95.3.831.

References

339

M. Corbetta, E. Akbudak, T. E. Conturo, A. Z. Snyder, J. M. Ollinger, H. A. Drury, M. R. Linenweber, S. E. Petersen, M. E. Raichle, D. C. Van Essen, and G. L. Shulman. 1998. A common network of functional areas for attention and eye movements. Neuron 21, 4, 761–773. DOI: https://doi.org/10.1016/S0896-6273(00)80593-0. N. Dahlbäck, A. Jönsson, and L. Ahrenberg. 1993. Wizard of Oz studies: Why and how. In Proceedings of the 1st International Conference on Intelligent user Interfaces. (1993 February) 193–200. K. Dautenhahn and C. L. Nehaniv. 2002. Imitation as a dual-route process featuring predictive and learning components: A biologically plausible computational model. In Imitation in Animals and Artifacts. MIT Press, 327–361. P. Dayan and N. D. Daw. 2008. Decision theory, reinforcement learning, and the brain. Cogn. Affect. Behav. Neurosci. 8, 429–453. DOI: https://doi.org/10.3758/CABN.8.4.429. M. M. A. De Graaf and B. F. Malle. 2019. People’s explanations of robot behavior subtly reveal mental state inferences, 2019 14th ACM/IEEE International Conference on Human– Robot Interaction (HRI), Daegu, Korea (South). 239–248. DOI: https://doi.org/10.1109/HRI. 2019.8673308. M. M. A. De Graaf, S. B. Allouch, and S. Lutfi. 2016. What are people’s associations of domestic robots? Comparing implicit and explicit measures. In 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 1077– 1083. L. M. J. De Sonneville, C. A. Verschoor, C. Njiokiktjien, V. Op het Veld, N. Toorenaar, and M. Vranken. 2002. Facial identity and facial emotions: Speed, accuracy, and processing strategies in children and adults. J. Clin. Exp. Neuropsychol. 24, 2, 200–213. DOI: https://doi.org/10.1076/jcen.24.2.200.989. Y. Demiris and B. Khadhouri. 2006. Hierarchical attentive multiple models for execution and recognition of actions. Rob. Auton. Syst. 54, 361–369. DOI: https://doi.org/10.1016/j. robot.2006.02.003. D. C. Dennett. 1971. Intentional systems. J. Philos. 68, 87–106. DOI: https://doi.org/10.2307/ 2025382. D. C. Dennett. 1987. The Intentional Stance. MIT Press. S. Devin and R. Alami. 2016. An implemented theory of mind to improve human–robot shared plans execution. In ACM/IEEE International Conference on Human–Robot Interaction. DOI: https://doi.org/10.1109/HRI.2016.7451768. J. Driver, G. Davis, P. Ricciardelli, P. Kidd, E. Maxwell, and S. Baron-Cohen. 1999. Gaze perception triggers reflexive visuospatial orienting. Vis. Cogn. 6, 5, 509–540. DOI: https://doi. org/10.1080/135062899394920. B. R. Duffy. 2003. Anthropomorphism and the social robot. In Robotics and Autonomous Systems. DOI: https://doi.org/10.1016/S0921-8890(02)00374-3. N. J. Emery. 2000. The eyes have it: The neuroethology, function and evolution of social gaze. Neurosci. Biobehav. Rev. 24, 581–604. DOI: https://doi.org/10.1016/S0149-7634(00) 00025-7.

340

Chapter 9 Theory of Mind and Joint Attention

N. Epley, A. Waytz, and J. T. Cacioppo. 2007. On seeing human: A three-factor theory of anthropomorphism. Psychol. Rev. 114, 864–886. DOI: https://doi.org/10.1037/0033-295X. 114.4.864. F. Eyssel, F. Hegel, G. Horstmann, and C. Wagner. 2010. Anthropomorphic inferences from emotional nonverbal cues: A case study. In Proceedings—IEEE International Workshop on Robot and Human Interactive Communication. DOI: https://doi.org/10.1109/ROMAN.2010. 5598687. A. Fod, M. J. Matari´ c, and O. C. Jenkins. 2002. Automated derivation of primitives for movement classification. Auton. Rob. 12, 39–54. DOI: https://doi.org/10.1023/A:1013254724861. C. K. Friesen and A. Kingstone. 1998. The eyes have it! Reflexive orienting is triggered by nonpredictive gaze. Psychon. Bull. Rev. 5, 3, 490–495. DOI: https://doi.org/10.3758/BF 03208827. A. Frischen, A. P. Bayliss, and S. P. Tipper. 2007. Gaze cueing of attention: Visual attention, social cognition, and individual differences. Psychol. Bull. 133, 4, 694–724. DOI: https://do i.org/10.1037/0033-2909.133.4.694. K. Friston. 2005. A theory of cortical responses. Philos. Trans. R. Soc. Lond. B Biol. Sci. 360, 1456, 815–836. DOI: https://doi.org/10.1098/rstb.2005.1622. C. D. Frith and U. Frith. 2006. The neural basis of mentalizing. Neuron 50, 4, 531–534. DOI: https://doi.org/10.1016/j.neuron.2006.05.001. U. Frith, J. Morton, and A. M. Leslie. 1991. The cognitive basis of a biological disorder: Autism. Trends Neurosci. 14, 10, 433–438. DOI: https://doi.org/10.1016/0166-2236(91)90041r. S. R. Fussell, S. Kiesler, L. D. Setlock, and V. Yew. 2008. How people anthropomorphize robots. In HRI 2008—Proceedings of the 3rd ACM/IEEE International Conference on Human– Robot Interaction: Living with Robots. DOI: https: / /doi.org/10.1145/1349822.1349842. H. L. Gallagher, A. I. Jack, A. Roepstorff, and C. D. Frith. 2002. Imaging the intentional stance in a competitive game. NeuroImage 16, 814–821. DOI: https://doi.org/10.1006/nimg .2017. N. George, J. Driver, and R. J. Dolan. 2001. Seen gaze-direction modulates fusiform activity and its coupling with other brain areas during face processing. NeuroImage 13, 1102–1112. DOI: https://doi.org/10.1006/nimg.2001.0769. D. Ghiglino, C. Willemse, D. De Tommaso, F. Bossi, and A. Wykowska. 2020. At first sight: Robots’ subtle eye movement parameters affect human attentional engagement, spontaneous attunement and perceived human-likeness. Paladyn 11, 31–39. DOI: https://doi. org/10.1515/pjbr-2020-0004. O. Golan, S. Baron-Cohen, J. J. Hill, and Y. Golan. 2006. The “reading the mind in films” task: Complex emotion recognition in adults with and without autism spectrum conditions. Soc. Neurosci. 1, 2, 111–123. DOI: https://doi.org/10.1080/17470910600980986. B. Gonsior, S. Sosnowski, C. Mayer, J. Blume, B. Radig, D. Wollherr, and K. Kuhnlenz. 2011. Improving aspects of empathy and subjective performance for HRI through mirroring facial expressions. In Proceedings—IEEE International Workshop on Robot and Human Interactive Communication. DOI: https://doi.org/10.1109/ROMAN.2011.6005294.

References

341

O. C. Görür, B. S. Rosman, G. Hoffman, and S. Albayrak. 2017. Toward integrating theory of mind into adaptive decision-making of social robots to understand human intention. 12th ACM/IEEE International Conference on Human–Robot Interaction (HRI). J. Gray, C. Breazeal, M. Berlin, A. Brooks, and J. Lieberman. 2005. Action parsing and goal inference using self as simulator. In ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, (August 2005). IEEE, 202–209. H. M. Gray, K. Gray, D. M. Wegner. 2007. Dimensions of mind perception. Science 315, 5812, 619. DOI: https://doi.org/10.1126/science.1134475. A. N. Hampton, P. Bossaerts, and J. P. O’Doherty. 2008. Neural correlates of mentalizingrelated computations during strategic interactions in humans. Proc. Nat. Acad. Sci. 105, 18, 6741–6746. DOI: https://doi.org/10.1073/pnas.0711099105. F. G. Happé. 1994. An advanced test of theory of mind: Understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. J. Autism Dev. Disord. 24, 2, 129–154. DOI: https://doi.org/10.1007/BF02172093. F. Happé, and U. Frith. 2014. Annual research review: Towards a developmental neuroscience of atypical social cognition. J. Child Psychol. Psychiatry 55, 553–577. DOI: https://doi.org/10.1111/jcpp.12162. F. Hegel, S. Krach, T. Kircher, B. Wrede and G. Sagerer. 2008. Theory of mind (ToM) on robots: A functional neuroimaging study, 2008 3rd ACM/IEEE International Conference on Human–Robot Interaction (HRI), Amsterdam, 335–342. DOI: https://doi.org/110.1145/ 1349822.1349866. F. Heider. 1958. The Psychology of Interpersonal Relations. Psychology Press. DOI: https://doi. org/10.4324/9780203781159. N. Hertz and E. Wiese. 2018. Under pressure: Examining social conformity with computer and robot groups. Hum. Factors 60, 8, 1207–1218. DOI: https://doi.org/10.1177/ 0018720818788473. E. A. Hoffman and J. V. Haxby. 2000. Distinct representations of eye gaze and identity in the distributed human neural system for face perception. Nat. Neurosci. 3, 80–84. DOI: https://doi.org/10.1038/71152. C. I. Hooker, K. A. Paller, D. R. Gitelman, T. B. Parrish, M. M. Mesulam, and P. J. Reber. 2003. Brain networks for analyzing eye gaze. Cogn. Brain Res. 17, 2, 406–418. DOI: https: //doi.org/10.1016/S0926-6410(03)00143-5. C. M. Huang and B. Mutlu. 2012. Robot behavior toolkit: Generating effective social behaviors for robots. In HRI’12—Proceedings of the 7th Annual ACM/IEEE International Conference on Human–Robot Interaction. DOI: https://doi.org/10.1145/2157689.2157694. B. Huebner. 2010. Commonsense concepts of phenomenal consciousness: Does anyone care about functional zombies? Phenomenol. Cogn. Sci. 9, 133–155. DOI: https://doi.org/ 10.1007/s11097-009-9126-6. A. C. Jenkins, C. N. Macrae, and J. P. Mitchell. 2008. Repetition suppression of ventromedial prefrontal activity during judgments of self and others. Proc. Natl. Acad. Sci. U. S. A. 105, 11, 4507–4512. DOI: https://doi.org/10.1073/pnas.0708785105.

342

Chapter 9 Theory of Mind and Joint Attention

M. Johnson and Y. Demiris. 2005. Perceptual perspective taking and action recognition. Int. J. Adv. Rob. Syst. 2, 4, 32. DOI: https://doi.org/10.5772/5775. S. Kahl and S. Kopp. 2018. A predictive processing model of perception and action for selfother distinction. Front. psychol. 9, 2421. DOI: https://doi.org/10.3389/fpsyg.2018.02421. C. D. Kidd and C. Breazeal. 2004. Effect of a robot on user perceptions. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). DOI: https://doi.org/10. 1109/iros.2004.1389967. S. Kiesler and J. Goetz. 2002. Mental models and cooperation with robotic assistants. CHI’02 Extended Abstracts on Human Factors in Computing Systems. DOI: https://doi.org/ 10.1145/506443.506491. K. Kompatsiari, V. Tikhanoff, F. Ciardo, G. Metta, and A. Wykowska. 2017. The importance of mutual gaze in human–robot interaction. In A. Kheddar, E. Yoshida, S. S. Ge, K. Suzuki, J.-J. Cabibihan, F. Eyssel, and H. He (Eds.), Social Robotics. Springer International Publishing, Cham, 443–452. K. Kompatsiari, J. Perez-Osorio, D. De Tommaso, G. Metta, and A. Wykowska. 2018. Neuroscientifically-grounded research for improved human–robot interaction. In IEEE International Conference on Intelligent Robots and Systems. DOI: https://doi.org/10.1109/ IROS.2018.8594441. K. Kompatsiari, F. Bossi, and A. Wykowska. 2021. Eye contact during joint attention with a humanoid robot modulates oscillatory brain activity. Soc. Cogn. Affect. Neurosci. 16, 383–392. DOI: https://doi.org/10.1093/scan/nsab001. G. Kuhn, I. Vacaityte, A. D. C. D’Souza, A. C. Millett, and G. G. Cole. 2018. Mental states modulate gaze following, but not automatically. Cognition 80, 1–9. DOI: https://doi.org/ 10.1016/j.cognition.2018.05.020. S. Krach, F. Hegel, B. Wrede, G. Sagerer, F. Binkofski, and T. Kircher. 2008. Can machines think? Interaction and perspective taking with robots investigated via fMRI. PLoS One 3, 7, e2597. DOI: https://doi.org/10.1371/journal.pone.0002597. A. M. Leslie. 1987. Pretense and representation: The origins of “theory of mind.” Psychol. Rev. 94, 4, 412-426. DOI: https://doi.org/10.1037/0033-295X.94.4.412. J. Li. 2015. The benefit of being physically present: A survey of experimental works comparing copresent robots, telepresent robots and virtual agents. Int. J. Hum. Comput. Stud. 77, 23-37. DOI: https://doi.org/10.1016/j.ijhcs.2015.01.001. S. Marchesi, D. Ghiglino, F. Ciardo, J. Perez-Osorio, E. Baykara, and A. Wykowska. 2019. Do we adopt the intentional stance toward humanoid robots? Front. Psychol. 10. DOI: https://doi.org/10.3389/fpsyg.2019.00450. G. Metta, L. Natale, F. Nori, G. Sandini, D. Vernon, L. Fadiga, C. von Hofsten, K. Rosander, M. Lopes, J. Santos-Victor, A. Bernardino, and L. Montesano. 2010. The iCub humanoid robot: An open-systems platform for research in cognitive development. Neural Netw. 23, 1125–1134. DOI: https://doi.org/10.1016/j.neunet.2010.08.010. J. P. Mitchell, T. F. Heatherton, and C. N. Macrae. 2002. Distinct neural systems subserve person and object knowledge. Proc. Natl. Acad. Sci. U. S. A. 99, 23, 15238–15243. DOI: https://doi.org/10.1073/pnas.232395699.

References

343

B. Mutlu, F. Yamaoka, T. Kanda, H. Ishiguro, and N. Hagita. 2009. Nonverbal leakage in robots: Communication of intentions through seemingly unintentional behavior. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction. (March 2009). 69–76. B. Mutlu, A. Terrell, and C. Huang. 2013. Coordination mechanisms in human–robot collaboration. In Proceedings of the HRI 2013 Workshop on Collaborative Manipulation. Y. Nagai, M. Asada, and K. Hosoda. 2002. Developmental learning model for joint attention. In: IEEE International Conference on Intelligent Robots and Systems. S. Narang, A. Best, and D. Manocha. 2019. Inferring user intent using Bayesian theory of mind in shared avatar–agent virtual environments. IEEE Trans. Vis. Comput. Graph. 25, 5, (May 2019). 2113–2122. DOI: https://doi.org/10.1109/TVCG.2019.2898800. A. Newell. 1994. Unified Theories of Cognition. Harvard University Press. A. C. Nobre, G. N. Sebestyen, D. R. Gitelman, M. M. Mesulam, R. S. Frackowiak, and C. D. Frith. 1997. Functional localization of the system for visuospatial attention using positron emission tomography. Brain 120, 3, 515–533. DOI: https://doi.org/10.1093/brain/ 120.3.515. S. Nowicki Jr, and J. Carton. 1993. The measurement of emotional intensity from facial expressions. J. Soc. Psychol. 133, 5, 749–750. DOI: https://doi.org/10.1080/00224545.1993. 9713934. R. O’Reilly, T. Hazy, and S. Herd. 2012. The Leabra Cognitive Architecture: How to Play 20 Principles with Nature and Win! The Oxford Handbook of Cognitive Science. DOI: https: //doi.org/10.1093/oxfordhb/9780199842193.013.8. R. C. O’Reilly, T. E. Hazy, and S. A. Herd. 2016. The leabra cognitive architecture: How to play 20 principles with nature and win! In S. Chipman (Ed.), Oxford Handbook of Cognitive Science. Oxford, Oxford University Press. T. Ojala, M. Pietikainen, and T. Maenpaa. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24, 971–987. DOI: https://doi.org/10.1109/TPAMI.2002.1017623. Y. Okumura, Y. Kanakogi, T. Kanda, H. Ishiguro, and S. Itakura. 2013. Infants understand the referential nature of human gaze but not robot gaze. J. Exp. Child Psychol. 116, 1, 86–95. DOI: https://doi.org/10.1016/j.jecp.2013.02.007. P. C. Pantelis, L. Byrge, J. M. Tyszka, R. Adolphs, and D. P. Kennedy. 2015. A specific hypoactivation of right temporo-parietal junction/posterior superior temporal sulcus in response to socially awkward situations in autism. Soc. Cogn. Affect. Neurosci. 10, 10, 1348–1356. DOI: https://doi.org/10.1093/scan/nsv021. K. Pelphrey, J. Singerman, T. Allison, and G. McCarthy. 2003. Brain activation evoked by perception of gaze shifts: The influence of context. Neuropsychologia 41, 156–170. DOI: https://doi.org/10.1016/s0028-3932(02)00146-x. J. Perez-Osorio and A. Wykowska. 2019. Adopting the intentional stance toward natural and artificial agents. Philos. Psychol. 33, 1-27. DOI: https://doi.org/10.1080/09515089.2019. 1688778.

344

Chapter 9 Theory of Mind and Joint Attention

J. Perez-Osorio, H. J. Müller, E. Wiese, and A. Wykowska. 2015. Gaze following is modulated by expectations regarding others’ action goals. PLoS One 10, e0143614. DOI: https://doi. org/10.1371/journal.pone.0143614. J.Perez-Osorio, H. J. Möller, and A. Wykowska. 2017. Expectations regarding action sequences modulate electrophysiological correlates of the gaze-cueing effect. Psychophysiology 54, 942–954. DOI: https://doi.org/10.1111/psyp.12854. J. Perez-Osorio, D. De Tommaso, E. Baykara, and A. Wykowska. 2018. Joint action with iCub: A successful adaptation of a paradigm of cognitive neuroscience to HRI. In 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Nanjing – Tai’an, 6 Pages. DOI: https://doi.org/10.1109/ROMAN.2018.8525536. J. Perez-Osorio, S. Marchesi, D. Ghiglino, M. Ince, and A. Wykowska. 2019. More than you expect: Priors influence on the adoption of intentional stance toward humanoid robots. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). DOI: https://doi.org/10.1007/978-3-03035888-4_12. M. Petit, S. Lallée, J. D. Boucher, G. Pointeau, P. Cheminade, D. Ognibene, E. Chinellato, U. Pattacini, I. Gori, U. Martinez-Hernandez, H. Barron-Gonzalez, M. Inderbitzin, A. Luvizotto, V. Vouloutsi, Y. Demiris, G. Metta, and P. F. Dominey. 2013. The coordinating role of language in real-time multimodal learning of cooperative tasks. IEEE Trans. Auton. Ment. Dev. 5, 3–17. DOI: https://doi.org/10.1109/TAMD.2012.2209880. K. S. Ponnet, H. Roeyers, A. Buysse, A. de Clercq, and E. van der Heyden. 2004. Advanced mind-reading in adults with Asperger syndrome. Autism 8, 249–266. DOI: https://doi.org/ 10.1177/1362361304045214. D. Premack. 1990. The infant’s theory of self-propelled objects. Cognition 36, 1, 1–16. DOI: https://doi.org/10.1016/0010-0277(90)90051-k. D. Premack and G. Woodruff. 1978. Premack and Woodruff: Chimpanzee theory of mind. Behav. Brain Sci. 4, 1978, 515–526. DOI: http://dx.doi.org/10.1017/S0140525X00076512. A. Puce, T. Allison, S. Bentin, J. C. Gore, and G. McCarthy. 1998. Temporal cortex activation in humans viewing eye and mouth movements. J. Neurosci. 18, 6, 2188–2199. DOI: https://doi.org/10.1523/JNEUROSCI.18-06-02188.1998. N. C. Rabinowitz, F. Perbet, H. F. Song, C. Zhang, and M. Botvinick. 2018. Machine theory of mind. In 35th International Conference on Machine Learning, ICML 2018. H. Richardson, G. Lisandrelli, A. Riobueno-Naylor, and R. Saxe. 2018. Development of the social brain from age three to twelve years. Nat. Commun. 9, 1, 1–12. DOI: https://doi.org/ 10.1038/s41467-018-03399-2. L. D. Riek, T. C. Rabinowitch, B. Chakrabartiz, and P. Robinson. 2009. Empathizing with robots: Fellow feeling along the anthropomorphic spectrum. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, ACII 2009. DOI: https://doi.org/10.1109/ACII.2009.5349423. A. Sadeghipour and S. Kopp. 2011. Embodied gesture processing: Motor-based integration of perception and action in social artificial agents. Cognit. Comput. 3, 3, 419–435. DOI: https://doi.org/10.1007/s12559-010-9082-z.

References

345

R. Saxe and N. Kanwisher. 2003. People thinking about thinking people: The role of the temporo-parietal junction in “theory of mind.” NeuroImage 19, 4, 1835–1842. DOI: https://doi.org/10.1016/s1053-8119(03)00230-1. B. M. Scassellati. 2001. Foundations for a Theory of Mind for a Humanoid Robot. MIT Press. DOI: https://doi.org/10.1037/e446982006-001. B. Scassellati. 2002. Theory of mind for a humanoid robot. Auton. Robots 12, 1, 13–24. DOI: https://doi.org/10.1023/A:1013298507114. S. Schaal. 1997. Learning from demonstration. In Advances in Neural Information Processing Systems. DOI: https://doi.org/10.1007/978-1-4419-1428-6_4646. K. R. Scherer and U. Scherer. 2011. Assessing the ability to recognize facial and vocal expressions of emotion: Construction and validation of the Emotion Recognition Index. J. Nonverbal Behav. 35, 4, 305. DOI: https://doi.org/10.1007/s10919-011-0115-4. L. Schilbach, B. Timmermans, V. Reddy, A. Costall, G. Bente, T. Schlicht, and K. Vogeley. 2013. Toward a second-person neuroscience. Behav. Brain Sci. 36, 4, 393–414. DOI: https: //doi.org/10.1017/S0140525X12000660. J. L. Schreck, O. B. Newton, J. Song, and S. M. Fiore. 2019. Reading the mind in robots: How theory of mind ability alters mental state attributions during human–robot interactions. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 63, 1, 1550–1554. DOI: https://doi.org/10. 1177/1071181319631414. T. Singer. 2012. The past, present and future of social neuroscience: A European perspective. NeuroImage 61, 2, 437–449. DOI: https://doi.org/10.1016/j.neuroimage.2012. 01.109. M. A. Smith, M. M. Allaham, and E. Wiese. 2016. Trust in automated agents is modulated by the combined influence of agent and task type. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 60, 1, 206–210. DOI: https://doi.org/10.1177/1541931213601046. N. Spatola and A. Wykowska. 2021. The personality of anthropomorphism: How the need for cognition and the need for closure define attitudes and anthropomorphic attributions toward robots. Comput. Hum. Behav. 122. DOI: https://doi.org/10.1016/j.chb.2021. 106841. S. Sul, B. Güro˘ glu, E. A. Crone, and L. J. Chang. 2017. Medial prefrontal cortical thinning mediates shifts in other-regarding preferences during adolescence. Sci. Rep. 7, 1, 1–10. DOI: https://doi.org/10.1038/s41598-017-08692-6. L. Takayama, D. Dooley, and W. Ju. 2011. Expressing thought: Improving robot readability with animation principles. In Proceedings of the ACM/IEEE International Conference on Human–Robot Interaction (HRI). (March 2011). ACM Press. Lausanne, Switzerland, 69–76. C. Teufel, P. C. Fletcher, and G. Davis. 2010. Seeing other minds: Attributed mental states influence perception. Trends Cognit. Sci. 14, 8, 376–382. DOI: https://doi.org/10.1016/j.tics .2010.05.005. S. Thellman, A. Silvervarg, and T. Ziemke. 2017. Folk-psychological interpretation of human vs. humanoid robot behavior: Exploring the intentional stance toward robots. Front. Psychol. 8, 1962. DOI: https://doi.org/10.3389/fpsyg.2017.01962.

346

Chapter 9 Theory of Mind and Joint Attention

J. G. Trafton, N. L. Cassimatis, M. D. Bugajska, D. P. Brock, F. E. Mintz, and A. C. Schultz. 2005. Enabling effective human–robot interaction using perspective-taking in robots. IEEE Trans. Syst. Man Cybern. 35, 4, 460–470. DOI: https://doi.org/10.1109/TSMCA.2005. 850592. Trafton, J. G., Schultz, A. C., Cassimatis, N. L., Hiatt, L. M., Perzanowski, D., Brock, D. P., Magdalena D. Bugajska, and W. Adams. 2006. Cognition and multi-agent interaction: From cognitive modeling to social simulation. In R. Sun (Ed.), Communicating and Collaborating with Robotic Agents. 252–278. DOI: https://doi.org/10.1017/CBO9780511 610721.011. F. Van Overwalle and K. Baetens. 2009. Understanding others’ actions and goals by mirror and mentalizing systems: A meta-analysis. NeuroImage 48, 3, 564–584. DOI: https://doi. org/10.1016/j.neuroimage.2009.06.009. K. E. Vanderbilt, D. Liu, and G. D. Heyman. 2011. The development of distrust. Child Dev. 82, 5, 1372–1380. DOI: https://doi.org/10.1111/j.1467-8624.2011.01629.x. V. Venkatesh and F. D. Davis. 2000. A theoretical extension of the technology acceptance model: Four longitudinal field studies. Manage. Sci. 46, 2, 186–204. DOI: https://doi.org/ 10.1287/mnsc.46.2.186.11926. P. F. Verschure, T. Voegtlin, and R. J. Douglas. 2003. Environmentally mediated synergy between perception and behaviour in mobile robots. Nature 425, 620–624. DOI: https:// doi.org/10.1038/nature02024. S. Vinanzi, M. Patacchiola, A. Chella, and A. Cangelosi. 2019. Would a robot trust you? Developmental robotics model of trust and theory of mind. Philos. Trans. R. Soc. B. 374, 1771, 20180032. DOI: https://doi.org/10.1098/rstb.2018.0032. P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. Proceedings of the 2001 IEEE Computer Society Conference on CVPR 2001, Vol. 1. IEEE, New York, NY, I–511. H. Wellman. 1990. Children’s Theories of Mind. MIT Press, Cambridge, MA. B. Wicker, F. Michel, M. A. Henaff, and J. Decety. 1998. Brain regions involved in the perception of gaze: A PET study. NeuroImage 8, 2, 221–227. DOI: https://doi.org/10.1006/nimg. 1998.0357. E. Wiese, A. Wykowska, J. Zwickel, and H. J. Möller. 2012. I see what you mean: How attentional selection is shaped by ascribing intentions to others. PLoS One 7, 9, e45391. DOI: https://doi.org/10.1371/journal.pone.0045391. E. Wiese, J. Zwickel, and H. J. Müller. 2013. The importance of context information for the spatial specificity of gaze cueing. Atten. Percept. Psychophys. 75, 967–982. DOI: https://doi. org/10.3758/s13414-013-0444-y. E. Wiese, A. Wykowska, and H. J. Müller. 2014. What we observe is biased by what other people tell us: Beliefs about the reliability of gaze behavior modulate attentional orienting to gaze cues. PLoS One 9, 4. DOI: https://doi.org/10.1371/journal.pone.0094529. E. Wiese, A. Abubshait, B. Azarian, and E. J. Blumberg. 2019. Brain stimulation to left prefrontal cortex modulates attentional orienting to gaze cues. Philos. Trans. R. Soc. B 374, 1771, 20180430. DOI: https://doi.org/10.1098/rstb.2018.0430.

References

347

C. Willemse and A. Wykowska. 2019. In natural interaction with embodied robots, we prefer it when they follow our gaze: A gaze-contingent mobile eyetracking study. Philos. Trans. R. Soc. B 374, 1771, 20180036. DOI: https://doi.org/10.1098/rstb.2018.0036. C. Willemse, S. Marchesi, and A. Wykowska. 2018. Robot faces that follow gaze facilitate attentional engagement and increase their likeability. Front. Psychol. 9, 70. DOI: https: //doi.org/10.3389/fpsyg.2018.00070. M. Wilms, L. Schilbach, U. Pfeiffer, G. Bente, G. R. Fink, and K. Vogeley. 2010. It’s in your eyes—Using gaze-contingent stimuli to create truly interactive paradigms for social cognitive and affective neuro- science. Soc. Cogn. Affect. Neurosci. 5, 98–107. DOI: https://doi. org/10.1093/scan/nsq024. H. Wimmer and J. Perner. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition 13, 1, 103–128. DOI: https://doi.org/10.1016/0010-0277(83)90004-5. A. Wykowska. 2020. Social robots to test flexibility of human social cognition. Int. J. Soc. Robot. 12, 1203–1211. DOI: https://doi.org/10.1007/s12369-020-00674-5. A. Wykowska. 2021. Robots as mirrors of the human mind. Curr. Dir. Psychol. Sci. 30, 1, 34–40. DOI: https://doi.org/10.1177/0963721420978609. A. Wykowska, E. Wiese, A. Prosser, and H. J. Müller. 2014. Beliefs about the minds of others influence how we process sensory information. PLoS One 9, 4, e94339. A. Wykowska, J. Kajopoulos, M. Obando-Leitón, S. S. Chauhan, J. J. Cabibihan, and G. Cheng. 2015. Humans are well tuned to detecting agents among non-agents: Examining the sensitivity of human perception to behavioral characteristics of intentional systems. Int. J. Soc. Rob. 7. DOI: https://doi.org/10.1007/s12369-015-0299-6. T. Xu, H. Zhang, and C. Yu. 2016. See you see me: The role of eye contact in multimodal human–robot interaction. ACM Trans. Interact. Intell. Syst. (TiiS) 6, 1, 1–22. DOI: https:// doi.org/10.1145/2882970. J. Złotowski, D. Proudfoot, K. Yogeeswaran, and C. Bartneck. 2015. Anthropomorphism: Opportunities and challenges in human–robot interaction. Int. J. Soc. Rob. 7, 347–360. DOI: https://doi.org/10.1007/s12369-014-0267-6.

10 Emotion Joost Broekens

In this chapter, I cover computational modeling of emotion in socially interactive agents (SIAs). I focus on the computational representation of emotion and other affective concepts such as mood and attitude, and on computational modeling of appraisal, that is, the assessment of personal relevance of a situation. In Section 10.1, I define essential affective concepts in the study and modeling of emotion and discuss three different psychological perspectives toward studying emotion that have strongly influenced emotion modeling in SIAs. I will motivate why SIAs can make constructive use of emotions. In Section 10.2, I cover computational representation of affective concepts and four approaches toward computational modeling of appraisal. I also give four working examples. In Sections 10.3 to 10.5, I cover the history, state of the art, and outlook of emotion modeling in SIAs.

10.1 10.1.1

Motivation What are Emotions? Emotions are about feelings. Emotions tell us how a situation matters to us. Emotions also motivate us to do something about that situation, and we express them to let others know how we feel. Emotion is a multifaceted phenomenon involving a relation between action, motivation, expression, information processing, language, feelings, and social interaction, as well as showing complex interactions with other affective and cognitive phenomena such as mood, attitudes, beliefs, and decision making [Damasio 1994, Frijda et al. 2000, Barrett et al. 2007, Fischer and Manstead 2008]. An instance of an emotion is a specific combination of jointly active bodily and mental features, including expression, arousal, assessment of the situation in terms of personal relevance, often with an associated (learned) label [Broekens et al. 2013, Hoemann et al. 2019]. The core of an emotion is an assessment of the personal relevance of a situation, thereby in some way providing feedback on the suitability of past, current, or future behavior [Lazarus 1991, Van Reekum and

350

Chapter 10 Emotion

Scherer 1997, Frijda 2004, Baumeister et al. 2007, Broekens et al. 2013, Moors et al. 2013]. Even in modern constructionist views this is an important underlying mechanism (“emotional events ... fundamentally occur within a brain that anticipates the body’s energy needs in relation to the current context.”) [Hoemann et al. 2019]. In this chapter, I will use the term appraisal for this process of assessment, independent of how this process is implemented in agent or human, or what the potential consequences on behavior or further information processing are. This is useful for our discussion of emotion simulation in SIAs later on as it distances us from debates around the nature of the appraisal process (in SIAs this is always grounded in binary information processing), its causal role (in SIAs appraisal always causes the emotion), and the exact information processing involved in this assessment (in SIAs this is computationally implemented in many different ways). However, what about mood, attitude, and other affective phenomena? To facilitate a clear discussion of emotion in SIAs, in this section I define essential concepts in the study and modeling of emotion. I cover affect, emotion, mood, attitude, relation, and personality (the latter two to a limited extent as Chapters 11, 12, 18 and 19 cover these topics in more depth). Then I discuss three different psychological perspectives toward studying emotion that have strongly influenced emotion modeling in SIAs: the categorical view, the dimensional-constructionist view, and, the cognitive appraisal view. I will explain their main differences but also their commonalities. Please note that cognitive appraisal refers to this particular perspective on emotion elicitation while appraisal is used as stated above. Finally, I briefly highlight the interplay between affect and cognition in humans. 10.1.1.1 De nitions The study of emotion falls within the field of affective sciences. Affect in affective science is an umbrella term that refers to anything related to emotion, emotion processing, and emotion in social interaction. Affective science thus deals with the study of emotion in the broadest sense. For the purpose of this chapter, we introduce the most commonly used terminology when it comes to emotion modeling in SIAs. I will borrow some of the terminology from Scherer [2005]. Affect also refers to the positiveness and negativeness associated with an emotion or other psychological construct (an attitude, a mood, a thought, a relation, etc...): for example, in mood induction studies [Dreisbach and Goschke 2004] (valence), in core affect [Russell 1980] (valence-arousal), and in affect associated with textual stimuli [Bradley and Lang 2007] (valence-arousal-dominance). Emotion refers to an event-related affective reaction (it is about something) typically of short duration and relatively intense (one feels the emotion and is conscious of it). For example, joy is a strong and short-term reaction resulting from

10.1 Motivation

351

an event with an associated positive feeling. In psychology and neuroscience, a distinction is made between emotions and feelings, where feeling is sometimes reserved for the subjective experience of the emotion [Damasio 1994, p. 139], while other times emotion is reserved for the mental categorization of the affective experience [Barrett 2005]. In this chapter I refer to feeling when I mean experience. Mood refers to the longer-term affective state an individual is in, is usually less intense, unrelated to a specific event, and less differentiated [Beedie et al. 2005]. For example, I can be in a cheerful mood, in which case I feel positive (i.e., positive associated affect) for no particular reason (it is not directed at something specific), and although I feel good I am not necessarily laughing all the time (it is not intense). Mood influences emotion elicitation; pre-existing moods intensify congruent emotional responses [Neumann et al. 2001]. For example, being in a positive mood makes me more, and more easily, joyful, just like a grumpy mood will make me more easily angry. Moods can be caused by psychological and biological events including repeated emotions (e.g., repeated exposure to negative events), thoughts (e.g., when ruminating or mind wandering), and changes in physiological state (e.g., hunger). Moods are also difficult to identify for people and can be unconscious. Attitude refers to affect that has been associated with something or someone. For example, I like Chinese food (I have a positive association with Chinese food). Attitudes form due to repeated exposure to and appraisal of a stimulus. Attitude is also referred to as opinion or sentiment. For example, in opinion mining and sentiment analysis [Liu and Zhang 2012], one tries to automatically identify the attitude the public has for a particular thing or person based on text data. Related to attitudes are relations (interpersonal stance [Scherer 2005]), which are social attitudes attributed to other agents, typically other people but not exclusively. For example, I like my boss (i.e., I have a positive attitude toward my boss), and I love my children (i.e., I have a positive attitude toward my children and I feel a bond). Relations are complex social constructs (see related chapters), but when it comes to emotion modeling in SIAs this definition is sufficient. Personality (affect dispositions [Scherer 2005]) refers to generic and stable characteristics of a person in terms of behavior, emotion, and thought. Usually, a person’s personality is expressed as values on several traits. These personality traits are the result of large factor analysis studies of personality adjectives with the aim of expressing as much variation as possible in as little number of factors. Then such factors are transformed into questionnaires and validated for measuring personality traits. A well-known personality model that is used often in SIAs is the big five factor model consisting of Openness, Conscientiousness, Extraversion, Agreeableness, and Neurotism (a.k.a. OCEAN) [McCrae and Costa 1987, Goldberg 1990].

352

Chapter 10 Emotion

A newer, related personality model that puts more emphasis on emotional and relational factors by introducing a new factor, Honesty-Humility, is HEXACO [Lee and Ashton 2004]. 10.1.1.2

Emotion Now that we have defined the most commonly used terms related to affect, we move on to the three most influential perspectives on emotion in psychology that have influenced emotion modeling in SIAs: the categorical perspective, the dimensional-constructionist perspective, and, the cognitive appraisal perspective (for an excellent comparison of the fundamentals behind categorical versus dimensional perspectives, see Zachar and Ellis [2012]). Emotion can be studied from these different, complementary perspectives. Each of these bring unique insights into what emotion is, and how emotions are produced and represented as psychological constructs in human minds. Furthermore, these perspectives offer opportunities but also present limitations to computational modeling of emotion. It is therefore important to understand these perspectives before modeling emotion in a SIA because these perspectives ultimately shape what you can expect from (interaction with) your emotional agent. When emotions are studied from a categorical perspective, an emotion is a specific multi-modal response resulting from an assessment of the situation in terms of survival potential for the individual. All humans have the same survival needs, and many related animals as well. As a result, many emotions are similar in different individuals of one species, and probably even between species [Bekoff 2008, De Waal 2019]. The modalities of this reflex typically consist of an affective assessment of the situation (is it good or bad), a specific feeling (how does this feel), a specific action tendency (what do I do), and if evolution had a need for it, a typical expression pattern (how do I show this internal state). For example, anger is a negative feeling due to someone doing you harm. Anger has an associated tendency to act aggressively, a particular facial grimace, and an approach posture. The categorical view emphasizes the evolutionary roots of emotion and its role in shaping behavior and communication. Key historical theories that represent this view include Darwin’s emotions as serviceable habits (see Barrett [2011] for a critical analysis), Ekman’s basic emotions [Ekman and Friesen 1971], and Frijda’s action tendencies [Frijda 1988]. Jack et al. [2014] present more recent work in this line, refining the notion of basic emotion categories into biologically plausible hierarchies by studying perception of computationally generated dynamic facial expressions. Indeed, there is developmental evidence that emotional categories and the labeling thereof develops over time and becomes more refined when children grow older [Widen and Russell 2008]. The categorical perspective

10.1 Motivation

353

is useful when one is interested in communication, labeling of emotions, emotion specificity, and embodied approaches. When emotions are studied from a dimensional perspective, an emotion is the person’s interpretation of currently felt core affect, where core affect is described in terms of affective dimensions [Russell 1980]. This relates to the constructionist view [Barrett 2005] which emphasizes that many important emotions that we experience are not related to any expression or action tendency, even though we do have words and feelings that clearly identify these emotions as specific mental constructs with an affective feeling. We learn to classify core affect, together with the context of its emergence, just like we learn to classify colors or car brands. These affective dimensions typically include valence (a.k.a. pleasure) and arousal (not the same as emotional intensity), and sometimes a third dimension called dominance (related to motivational stance and social verticality [Mast and Hall 2017]). Valence refers to the positive and negative aspect of the emotion, arousal to the associated physiological activation, and dominance to the amount of influence and control the individual feels over the situation. Emotions are the labels we learn to attach to specific values of core affect together with the context. For example, sadness is a label we identify with a feeling of low valence, low arousal, and low dominance when something happens that is irreversible; while elation (extreme joy) is a label we identify with a feeling of high valence, high arousal, and high dominance [Mehrabian 1980]. Affective dimensions can also be associated with other psychological constructs including moods, thoughts, opinions, and even representations of objects. It is important to keep in mind that while it is possible to describe an emotion in terms of its affective dimensions, these dimensions have low specificity and without knowing what a PAD value triplet represents in context it is hard to deduce what it means. For example, interpreting high valence, high arousal, and high dominance as elated is not necessarily correct; the individual might feel extreme pride instead. More contextual information is needed to find the “correct” emotion label because many emotion labels map to similar PAD values. The dimensional perspective is useful in SIAs when one is interested in a common representation for different affective phenomena (e.g., when modeling emotion, mood, and attitudes) or when emotional continuity is important (e.g., when modeling emotions that dynamically change from one into the other). When emotions are studied from a cognitive appraisal perspective, an emotion is the result of the evaluation of the situation on a set of cognitive dimensions in light of the individual’s concerns in order to motivate the individual into appropriate action [Arnold 1960, Ortony et al. 1988, Smith and Lazarus 1990, Van Reekum and Scherer 1997, Roseman and Smith 2001, Scherer 2001]. In short, emotion results from concern-based reasoning. Some evaluations are simple assessments

354

Chapter 10 Emotion

of stimulus properties, for example, the suddenness of a stimulus or the intrinsic pleasantness of that stimulus. Others are complex assessments of the consequences and causes of the stimulus, for example, goal congruence and attribution of responsibility. However, the core of this view is that emotion is largely the result of a cognitive evaluation of the situation. As mentioned, this cognitive appraisal process is organized into different processes often referred as appraisal dimensions. For example, if a car is nearing me at great speed, this is a sudden stimulus that is of personal relevance, not conducive to my concern of survival, and I have limited control. This combination of appraisal dimension “activations” (sudden, high relevance, low goal congruence, and low control) is typically associated with the emotion we would label as “fear” [Scherer 2001]. Cognitive appraisal theory thus links cognitive processing to the elicitation of emotion. Note, however, that modern appraisal theory does not claim that all assessments are due to conscious reasoning, for example, when the taste of candy is assessed as pleasant. Cognitive appraisal theories are less concerned with exact emotion labeling of the appraisal outcome, nor are they with the resulting specific behavior. Although many SIA researchers have interpreted appraisal as a goal/belief-derived reasoning process aimed at action planning (see, e.g., Rosis et al. [2003] and Gratch and Marsella [2004]), this view has been advocated explicitly only recently by a subset of cognitive emotion theorists [Reisenzein 2009a, Moors et al. 2017]. The cognitive appraisal view is useful when one is interested in modeling the emotion elicitation process but lacks a precise mapping between the appraisal and the resulting emotion label. An exception to the latter is the Ortony–Clore–Collins (OCC) model [Ortony et al. 1988] (see Bartneck [2002] for an analysis of the model), explaining its popularity in SIAs [Rosis et al. 2003, Popescu et al. 2014] as well as formal modeling of emotion [Meyer 2006, Steunebrink et al. 2007, Adam et al. 2009]. The cognitive appraisal perspective is helpful when emotions are needed for agents that have a cognitive basis for their AI. These three perspectives are complementary. Affective dimensionconstructionist views give us a generic representation of affect for a wide variety of affective phenomena while emphasizing a common emotional core and statistical categorization principles explaining individual emotional development and variation. Categorical emotion research brings us structure in affective expression and communication while emphasizing the biological roots of emotions as coordinated behaviors to address immediate concerns. Cognitive appraisal theory brings us information processing mechanisms for the elicitation of emotion while emphasizing that specific emotions are elicited by thought processes that are mental, individual, and contextual. There are also important similarities. First of all, all views emphasize that emotion is about experiencing the positiveness

10.1 Motivation

355

and negativeness of a situation related to the well-being of the organism. All views therefore acknowledge that emotions at the core are about assessing the utility of the current situation with respect to survival of the individual. The valence dimension represents positiveness/negativeness in the dimensional view; categorical emotions are hierarchically structured around positive and negative emotions; and cognitive appraisal resolves around an affective evaluation related to personal concerns. Second, the different views reserve an important role for power, including social power. The dominance dimension represents the extent to which one influences or is influenced by the external environment (including others), an important aspect of categorical emotions is whether the emotion is an approach versus avoidance emotion, and many cognitive appraisal theories propose coping-related appraisal processes related to the perception of power and influence [Scherer 2001]. Finally, all views emphasize the importance of bodily activation. The arousal dimension represents the extent to which a stimulus, thought, relation, etc., has associated bodily activation; categorical emotions are strongly tied to action and bodily activation through action tendencies [Frijda 1988]; and cognitive appraisal theories have processes related to the urgency and novelty of stimuli (e.g., Scherer [2001]) that predict alertness of the individual. 10.1.1.3

Affect and Cognition Modern psychology, neuroscience, and computational modeling research strongly suggests that affective processing and cognitive processing are interdependent [Damasio 1994, Marsella and Gratch 2009, Reisenzein 2009a, Rolls 2014, Moors et al. 2017, Broekens 2018, Hoemann et al. 2019]. In fact, many emotion theorists nowadays suspect that a hard distinction between cognition and emotion is arbitrary and not helping in advancing our understanding of affect and cognition. However, here I will not go into that debate but simply list several well-known interactions between affect and cognition. First, mood influences information processing in that positive moods typically favor high-level processing of information and creative problem solving while negative moods favor attention to detail and critical reflection [Vosburg 1998, Dreisbach and Goschke 2004], and certain information processing styles are more prone to influences of mood than others [Forgas 2000]. Second, memory recall is mood congruent [Matt et al. 1992]. Third, emotional processing needs to be intact for decision making [Damasio 1994], and affect influences how decisions are made, for example, in negotiations [Kleef et al. 2004, Broekens et al. 2010]. Forth, memories with strong associated emotions are easier to remember [Reisberg and Hertel 2003], affect-based attitudes are relatively stable compared to cognition-based attitudes [Edwards 1990], and affect plays an important role in how attitudes [Maio et al. 2018] and judgments [Greifeneder et

356

Chapter 10 Emotion

al. 2010] are formed. Many mechanisms have been proposed that might be responsible for this, including current affect as a source of information about something, current affect that triggers associations with congruent attitudes, arousal as intensity measure for the importance of beliefs and memories, emotion processing as a way to value alternative outcomes, etc. It would go too far to review this field here, but I hope to have convinced you that cognition and emotion are intertwined.

10.1.2 Why Do SIAs Need Emotions? Now that we have covered some background in affective science, I explain why SIAs, particularly those that need to interact with humans, need some form of artificial emotional intelligence [Picard 1997, Schuller and Schuller 2018]. Emotional intelligence is defined as the ability to carry out accurate reasoning about emotions and the ability to use emotions and emotional knowledge to enhance thought [Mayer et al. 2008]. This ability should—for now—be considered a “holy grail” for SIA research as this is still a long way to go. For this to be possible, many things need to be in place including proper recognition of emotion in humans, plausible emotion elicitation simulation and incorporation of affective information in the agent’s information processing, and finally reliable emotion expression synthesis. Further, agent reasoning, machine learning, and pattern recognition is needed as well for a solid understanding of the context. However, why would we want this in the first place? In general, there are two main reasons for using affect in SIAs: expression of affect and recognition of affect can be used as a means to enhance SIAs’ communication abilities, and the modeling of affective processes can enhance the SIAs’ decision making abilities. This closely follows the function of emotion in humans. On an interpersonal level, emotion has a communicative function: the expression of an emotion is used to communicate social feedback as well as empathy (or distance) (see, e.g., Fischer and Manstead [2008]). On an intra-personal level emotion has a reflective function [Oatley 2010]: emotions shape behavior by providing feedback on past, current, and future situations [Baumeister et al. 2007] as well as help to make important decisions [Damasio 1994]. Communication of affect is essential for the development of children [Klinnert 1984, Chong et al. 2003, Buss and Kiel 2004, Saint-Georges et al. 2013]. If SIAs are implemented either as tutor/coach or as teachable robot or agent, then it is likely that both roles are difficult to fulfil without the use of affective processing in the SIA (see, e.g., Heylen et al. [2003] and Castellano et al. [2013]). Indeed, evidence indicates that emotion helps for both roles [Broekens 2007, Leyzberg et al. 2011]. Communication of affect is also essential for the development of relationships

10.2 Computational Models and Approaches

357

and the building of trust [Weber et al. 2004], and the communication of empathy [Dimberg et al. 2011]. These are important elements in the building of rapport with a conversational agent (see Chapters 11 and 12 on empathy and rapport, respectively). Indeed, there is evidence that SIAs that express emotions and mood as part of their behavior are perceived to be more empathic, resulting in higher trust [Cramer et al. 2010] than those that do not, and that emotions expressed by agents influence how the users respond to the agent [de Melo et al. 2011, 2014]. Also, there is a long line of research in emotional and sociable robots showing that users generally attribute all kinds of human abilities and characteristics to the robots (see, e.g., Turkle et al. [2006]). Emotions are essential in decision making [Damasio 1994]. Classically, artificial reasoning and decision making is approached from the perspective of optimality: the process should give the best possible outcome given the input data/knowledge base. However, in many cases, reaching a “good enough” solution (i.e., satisficing solution) is fine as well. Further, sometimes the goal is to represent human decision making, not to reach optimality per se [Baarslag et al. 2017]. Indeed, work on embedding emotions in decision making architectures clearly showed diverse benefits for reaching better solutions or good enough solutions faster [Belavkin 2001, Salichs and Malfaz 2012]. A long history of research into cognitive-affective architectures shows that emotions can play a useful role in agent learning, exploration, and reasoning [Velasquez 1998, Franklin and Graesser 1999, Marinier III and Laird 2004, Hogewoning et al. 2007, Franklin et al. 2014] (see historical section for more references). In short, emotions are in various ways crucial for SIAs. We will next look at the computational methods to implement emotions in such agents.

10.2

Computational Models and Approaches This section focuses on computational modeling of affect, excluding both expression synthesis (see Section II in this book, and Section III in Calvo et al. [2014]) and automated affect detection (see Section II in Calvo et al. [2014] and Section II in Burgoon et al. [2017]). I discuss the most commonly used approaches to modeling emotion and directly related affective phenomena including mood, attitude, relation, and personality. First, I cover computational representations of emotion, mood, attitude, relation, and personality. These representations are the computational constructs maintaining the values of the affective variables. Then, I cover four approaches to the computational modeling of appraisal (i.e., emotion elicitation), the mechanism responsible for simulating the values of artificial emotions when the SIA evaluates its situation. Finally, I cover four working examples, one for each approach.

358

Chapter 10 Emotion

10.2.1 Computational Representations of Affect In this section, I cover the most commonly used computational representation of the affective phenomena introduced in the definition sections. Note that SIA architectures do not need to implement all of these phenomena. The emotion of the agent (a.k.a. affective state) is usually represented as a vector with intensities e = [iE1 , ..., iEn ] for each represented emotion E = {E1 , ..., En } [Ochs et al. 2009, Kaptein et al. 2016]. Vector elements typically represent categorical emotions or dimensions. For example, if an agent models the six Ekman emotions, then E = {joy, surprise, anger, fear, disgust, sadness} with, for example, an emotional state equal to e = [1, 0.5, 0, 0, 0, 0] denoting maximum intensity for joy, and half intensity for surprise. If the emotion is represented as emotion dimensions, then, for example, E = {valence, arousal, dominance} and the emotional state could be e = [1, 1, 0]. Note that when the agent needs to express or reason upon this state, there are many different ways to do this. For example, one could take Express(Max(e)) to express only the emotion with the highest intensity, or one could express the interpolation of the expressions based on all intensities in ExpressInterpolate(e), if the expression/rendering allows this. For reasoning upon the state, similar choices need to be made. The emotional state is changed due to appraisals (covered in detail below). Appraisals eventually result in emotion intensities, which are integrated in the emotional state. In the below dynamics we assume an appraisal sets the emotional state; however, one can also add the appraisal to it and use a bound for the emotion intensities. Further, in the absence of appraisals the emotional state typically decays over time. So, the complete abstract dynamics for the emotional state can be written as follows: et+1 = at (situation) ∨ {et * 𝛾}

(10.1)

with 𝛾 = [0, 1] and at (situation) the outcome of the appraisal process at time t represented as an emotion vector with dimensions E. The mood of the agent is also commonly represented as a vector with intensities m = [iM1 , ..., iMn ] for each represented emotion M = {M1 , ..., Mn }. Moods are typically represented as a vector of dimensions [Pe na et al. 2011, Jones and Sabouret 2013], but one also sees approaches with a categorical mood state. The mood state typically is a function of the history of the last n emotional states, so: mt = f (g(et−n ), ..., g(et ))

(10.2)

Usually, f () is some form of averaging, that is, f () = avg(), with g(x) → x, that is, function g() does nothing. Sometimes the mood representation is different form

10.2 Computational Models and Approaches

359

the emotion representation, for example, when the emotional state is represented as a vector of basic emotion intensities while the mood is represented as a vector of affective dimensions. In those cases, function g() maps the emotional state to a different representation first. We will use g() to denote a mapping function between affective dimensions throughout this chapter. For example, when M = {valence, arousal, dominance} and the emotional state is again based on Ekman, E = {joy, surprise, anger, fear, disgust, sadness}, then a possible element from this mapping could be g([1, 0.5, 0, 0, 0, 0]) → [1, 1, 1]. Such mappings are usually continuous functions based on findings from the literature (see, e.g., the word-affect lists from Bradley and Lang [2007] or Mehrabian [1980]). Moods can also influence the emotional state. One of the more common ways is to have the emotional state decay to the mood state: et+1 = at (situation) ∨ {g −1 (m) * (1 − 𝛾) + et * 𝛾}

(10.3)

with g −1 (m) the inverse mapping from the agent’s mood representation to emotional state representation. Note that such inverse mapping is not trivial if the emotional state has a higher dimensionality and requires a representation of the emotions in the lower dimensional mood space, for example, as points representing prototype emotions, and a distance function defining the intensities [Breazeal 1998]. An attitude is usually represented as an association between a piece of affective information and a piece of knowledge (a belief, a state, an entry in a knowledge base, a “chunk,” a thought, an image, a word, etc.). The affective information typically originates from the emotional state or the appraisal process. Again, the actual representation varies depending on the affective dimensions used, but attitudes are commonly represented with the valence dimension alone. Computationally this means that an attitude atk = i is a tuple of an intensity i and a piece of knowledge k. Attitudes form due to the attribution of an appraisal or emotion to a piece of knowledge k. In order to do this, the agent appraises the current situation, potentially identifies salient aspects of the situation, and stores the result of the appraisal, or the resulting emotional state, as an association with the situation or one or more aspects. For example, if a virtual math coach identifies that the currently proposed exercise is too hard for a child, and the child reacts with anger, it could appraise based on the OCC model [Ortony et al. 1988] that its own action is blameworthy resulting in the emotion of guilt, storing a negative attitude for that exercise atexercise = g([joy = 0, .., guilt = 1]) with g([joy = 0, .., guilt = 1]) = −1, assuming that attitudes get mapped from the OCC emotions to a one-dimensional valence representation by a mapping function g() taking either an emotional state

360

Chapter 10 Emotion

e or appraisal a as input [Jones and Sabouret 2013]. Also, attitudes have dynamics, with a simple abstract form equal to: atk,t+1 = g(e ∨ a) * (1 − 𝛾) + atk,t * 𝛾

(10.4)

with t the time dimension, usually moments of attributions so that attitudes will be averaged over attributions, and 𝛾 another discount factor. Attitudes can also be stored as fuzzy or probabilistic links between beliefs and emotions in, for example, a Belief Net. When it comes to computational representations of relations, we have to distinguish relations from social emotions. Relations are typically represented as ragent = i with agent being the agent and i being again the intensity of the relation on some affective dimensions (see, e.g., Ochs et al. [2009]). The rest is similar to attitudes: relations develop over time due to emotions or appraisals and can be represented on different dimensions than the emotional state. Social emotions are emotions that exist by the virtue of relations and other agents, are attributed to other agents, and influence relations with other agents. To highlight the difference: one can have a positive relation with someone and feel proud of, envious, angry at, disappointed with, or thankful for that person. This might influence the relation (following the attribution as explained above), but it is something different. To give a concrete example: if I have a positive relation with Peter rPeter,t = 1, and Peter does something that makes me feel disappointed, then following Equation 10.4 my relation could become: rPeter,t+1 = g([..., disappointment = 1, ...]) * 0.2 + rPeter,t * 0.8) rPeter,t+1 = −1 * 0.2 + 1 * 0.8)

(10.5)

rPeter,t+1 = 0.6 with g([..., disappointment = 1, ...]) = −1 and 𝛾 = 0.8. If Peter repeatedly disappoints me, my relation toward him will gradually move to −1 with a speed depending on 𝛾. Personality is usually represented as a vector of personality trait values p = [iT1 , ..., iTn ], with T = T1 , ..., Tn being the traits, for example, based on OCEAN [Goldberg 1990]. Contrary to the above affective constructs, the personality is assumed to be stable, so once set, p does not change for an individual agent. There are many ways in which personality can influence the emotional state, including changing the way information is processed, changing the sensitivity of particular emotions by having, for example, an emotion-specific personality-dependent 𝛾, changing the weight for particular appraisal processes in the calculation of the appraisal of

10.2 Computational Models and Approaches

361

the current situation, and using the personality as a default agent mood [Gebhard 2005]. In most cases some static mapping g(p) is introduced linking p to the factors that change the emotional outcome (appraisal weights, emotion sensitivity) [Jones and Sabouret 2013].

10.2.2 Emotion Elicitation The core of an emotion is the assessment of personal relevance of a situation, thereby in some way providing feedback on the suitability of past, current, or future behavior [Van Reekum and Scherer 1997, Baumeister et al. 2007, Broekens et al. 2013, Moors et al. 2013]. As mentioned, in this chapter I will use the term appraisal for this process of assessment. An emotion occurs when something happens that is personally meaningful to the agent. In this chapter, I cover four approaches toward computational modeling of appraisal: cognitive-agent based modeling, embodied modeling, reinforcement-learning modeling, and hard-wired appraisal. Each approach is based on different modeling principles, in particular with respect to the concept of goal, and the representation of the agent– environment relation. Cognitive-agent based appraisal models use some form of cognitive agent formalism—such as BDI logic or utility-based planning—to represent the agent–environment relation, with an explicit representation of a goal that can be used in planning and reasoning (for review, see Gratch and Marsella [2014]). Embodied modeling uses a goal representation derived from homeostasis and places less emphasis on symbolic processing to assess the agent–environment relation (for a recent treatise see Canamero [2019]). Reinforcement learning-based modeling proposes that goals are derived from some value or reward signal, and the agent–environment relation is based on (models of) state-action-rewardstate-action sequences (see Moerland et al. [2018] for a recent review). Hard-wired appraisal directly or indirectly encodes the appraisal outcome in the stimuli perceived by the agent. Although each of these approaches are different, they all share a utility-based view of appraisal. Therefore, from a theoretical perspective all of these approaches are related to the psychological concept of appraisal. There is always some agent’s need and some form of discrepancy or distance between the current state and that need. Emotion is derived from this discrepancy and the intensity of the need. 10.2.2.1 Cognitive-agent Based Modeling Cognitive-agent based appraisal models are based on cognitive theories of emotion. In cognitive appraisal theory emotion is often defined as a valenced reaction resulting from the assessment of personal relevance of an event [Ortony et al. 1988, Scherer 2001, Moors et al. 2013]. The assessment is based on what the agent believes

362

Chapter 10 Emotion

to be true and what it aims to achieve as well as its perspective on what is desirable for others. The basis is that a collection of computational processes analyze the current situation in terms of desirability for and impact on the agent [Dias and Paiva 2005, Steunebrink et al. 2007, Marsella and Gratch 2009, Popescu et al. 2014]. Cognitive appraisal models in SIAs mostly come in two flavors: those developed from theories that propose specific appraisals to assess the affective outcome of stimuli, we refer to this as stimulus appraisal, and those developed from theories which propose that emotions result from belief–desire structures, we refer to this as belief–desire theory of emotion (BDTE). We first cover stimulus appraisal. Each appraisal process is responsible for a particular aspect of the analysis of the stimulus and together they result in an emotion. Most computational models are based on Ortony, Clore, and Collins’ model [Ortony et al. 1988]. This OCC model proposes a tree-structure of evaluations, resulting in one or more specific emotions for a given situation. The appraisal process is described in abstract computational terms including statements such as: if desirable(event) ∧ approveAction(otherAgent) → emotion(gratitude)

(10.6)

Although this does not tell us how to calculate desirable(event) or approveAction(otherAgent) or the intensity of emotion(gratitude), it does give a precise structure to implement appraisal rules and the resulting emotions. Many computational models use this structure as a guideline for the emotion elicitation process of virtual agents and robots. Further, due to this clarity it facilitates selecting an expression when interacting with humans. Computational models need some appraisal logic to decide how to implement the appraisal processes and the emotional intensities. For example, based on modal logic, desirable can be expressed as accomplishing a goal: if 𝜅 ∈ K ∧ 𝜅 = true → desirable(𝜅)

(10.7)

This still needs to be extended with proper intensities and a logic for consequences of events, see, for example, Steunebrink et al. [2008] for more detail on such an approach. Other approaches are more agent-logic agnostic and focus on the appraisal framework [Jiang et al. 2007, Ochs et al. 2009, Dias et al. 2014, Popescu et al. 2014], or use a fuzzy-logic approach based on OCC to determine the emotional state [El-Nasr et al. 2000]. Several models are available as open-source appraisal engines, including FAtiMA [Dias et al. 2014] and GAMYGDALA [Popescu et al. 2014]. A second influential stimulus appraisal idea is that appraisal processes evaluate stimuli in order to motivate appropriate behavior, with a looser connection to the specific emotion that results from those processes. This can be found in Smith and

10.2 Computational Models and Approaches

363

Lazarus’ appraisal theory as well as in Scherer’s stimulus evaluation checks (SEC). Both propose that situations are checked by specific appraisal processes, in line with OCC, but the appraisals are different and follow a different process. SEC, for example, proposes that simple appraisals assess the stimulus first, including relevance and pleasantness of that stimulus, after which more complex processes kick in, including goal congruence and even later coping. Computational models based on SEC follow similar lines as those based on OCC, namely, they have to select and implement specific appraisals, which together give an indication of the resulting emotion. For example, when a stimulus is sudden and unpleasant, fear is likely to be the emotional result. A computational model will now have to implement if sudden(event) ∧ unpleasant(event) → emotion(fear)

(10.8)

While Scherer’s model goes into quite some depth on the appraisal processes, the link between appraisal activation and emotion is less clear. Therefore, it is harder for the agent designer to decide based on this theory what specific emotion comes out of the reasoning process. This is not problematic when interested in simulating appraisal (such as in the work by Marinier III and Laird [2004]), but may become a problem if clear emotion signals need to be sent to the user of the SIA. A wellknown computational model that is inspired by this idea of sequential checking in light of an organism’s adaptation and functioning is EMA [Marsella and Gratch 2009]. EMA implements appraisals as fast automatic and parallel evaluators of the current cognitive state. As such the appraisal has no causal influence (although the information may be used for coping later on) but provides a moment-to-moment affective summary of the situation, an idea that also resonates with the ideas of BDTE discussed next. We now cover the cognitive-motivational view also known as belief–desire theory of emotion (BDTE). BDTE assumes that appraisal of beliefs and desires, rather than stimuli, is the core of what emotions are. The key difference is that BDTE proposes that emotions result from an assessment of the current belief–desire structure of an agent, while previous theories propose that perceived stimuli are assessed with a set of appraisal processes in the context of desires of the agent. There are subtle differences in the psychological and philosophical underpinnings and ramifications of different BDTE approaches (see Reisenzein [2009b]), but from a SIA engineer’s perspective all BDTE approaches place important emphasis on the concepts of beliefs and desires (goal states in particular). To explain this view, we take Reisenzein’s cognitive model as an example. In this model there are two core appraisal processes: belief–belief and belief–desire congruence. Reisenzein proposes that these processes are sufficient to explain eight basic emotions [Reisenzein 2009a].

364

Chapter 10 Emotion

For example, joy results from the belief that a state s is true and desired (i.e., s is a goal state). Fear would result from the belief that a state s becomes more true but not entirely, and s is not desired. And so on... In practice, though, the difference between BDTE models and stimulus appraisal models is not so big from a computational point of view. In both cases, the computational model needs to explicate the appraisal processes, and this is usually done with a form of utility planning/goal-based agent formalism. For example, while Steunebrink uses OCC as basis [Steunebrink et al. 2007] and Kaptein uses Reisenzein’s BDTE [Kaptein et al. 2016], both use similar agent logic to decide upon the desirability of events (whether or not a belief helps in achieving a desired goal, that is). Further, in both approaches, typical agent implementations will not reason upon just a holistic state representation but instead will reason over beliefs, plans, and goals [Reilly 1996, Castelfranchi 1999, Meyer 2006, Steunebrink et al. 2007, Marsella and Gratch 2009, Kaptein et al. 2016]. For example, if a particular belief brings a particular goal closer (e.g., in terms of time or effort of the agent), then hope could result. As the final actions of these agents are often also informed by this same process of reasoning, action and emotion are in line, which is consistent with current cognitive views on emotion [Moors et al. 2017]. 10.2.2.2 Embodied Models In theories that emphasize biology, behavior, and evolutionary benefit [Panksepp 1982, Frijda 2004], or core-affect [Russell 1980], the emotion is more directly related to action selection, the body, hormones, biological drives, and particular behaviors but the core of the appraisal is similar: an assessment of harm versus benefit resulting in action aimed at adapting the behavior of the agent. Computationally, embodied models of emotion emphasize that agents have drives, limited decisionmaking resources, an artificial body, and an action selection problem. Core in most of these models is that at some point the agent needs to select an action in real time and this action needs to be consistent with the need to keep a set of homeostatic variables in check [Arkin et al. 2003, Cañamero 2003, Cos et al. 2013]. This process of homeostasis is the basis for the goal representations. For example, the emotion from the homeostasis process may be used as additional—or modulation of the—reward signal (e.g., Cos et al. [2013] motivated actor–critic approach). Such emotion models are thus implemented on top of homeostatic machines [Man and Damasio 2019]. Most implementations of such models are used in studying how robots can solve relatively simple resource gathering-like tasks [Avila-Garcia and Canamero 2005, Kiryazov et al. 2013], although these principles can be applied to human–robot and human–agent interaction as well [Breazeal 1998, Arkin et al. 2003, Verdijk et al. 2015].

10.2 Computational Models and Approaches

365

An important concept in embodied models of emotion is grounding: emotion is emerging from the organism’s assessment of (and is functionally meaningful to) its bodily state and well-being. For example, fear is the anticipation of bodily harm, resulting in avoidance behavior. If a robot has sensors to detect body integrity (which is a homeostatic variable it wants to keep up), and it has an association between a certain stimulus and a decrease in body integrity, then the perception that body integrity is anticipated to drop triggers a drive to do something about that, eventually resulting in an action to move away from the stimulus. Notice that in this process we could add that fear is triggered with an intensity equal to the predicted drop in body integrity, but in this embodied example this does not add much to the whole process. Indeed, in many embodied approaches the emotions are considered emergent phenomena, consisting of the collective activation of processes including affect grounded in the robot’s body and activity to meet robotic needs. As such, they fit well with core-affect and constructionist views as well. For example, in the work by Kiryazov et al. [2013], arousal is a representation of the robot’s electrical energy processes. In the work by Avila-Garcia and Canamero [2005] the “emotion” of fear can be observed when the robot’s subsystems trigger behavior to avoid a competitor robot in a resource gathering task when in a high-risk health state. 10.2.2.3 Reinforcement Learning Models Most reinforcement learning (RL) models of emotion are in essence cognitive appraisal theories implemented on top of the RL paradigm. With RL an agent tries to solve a Markov decision problem by effective exploration, receiving after each action it takes as only feedback the reward and the next state it arrives in [Kaelbling et al. 1996, Sutton and Barto 2018]. The goal of the agent is to learn an action selection policy that will maximize utility, expressed as the sum of future rewards. The reward is a scalar R(s, a), the utility of a state is expressed as the value V(s), the value of an action is Q(s, a) and the Markov decision process model is usually represented as conditional transition probabilities T(s′ |s, a). RL models of emotion have been extensively surveyed in Moerland et al. [2018]. Here we summarize two of the four main approaches toward emotion elicitation (not how the simulated emotion is subsequently used). The other two are very similar to either embodied modeling or hard-wired appraisal. In the first approach, the agent learns and acts in the environment and the emotions are derived from the reward, the value function or the temporal difference signal (the update to the value of the state, see below in Section 10.2.3). The default assumption is similar to that of cognitive appraisal theory, namely, that affect is related to an assessment of utility. First ideas emerged as early as the 1980s with the work of Bozinovski [1982] interpreting the state-value as the emotion associated to

366

Chapter 10 Emotion

that state. Also, in Broekens et al. [2015] the value of the state is used as a signal for fear and hope, while in Moerland et al. [2016] the temporal difference signal is taken as basis for the simulation of joy and distress, hope and fear. Salichs and Malfaz [2012] model fear for a particular state as the worst historical Q-value associated with that state (remembering a particular bad situation that it should fear). Other approaches compute a mood-like signal from normalized averages of rewards over time [Schweighofer and Doya 2003, Hogewoning et al. 2007]. In the second approach, the agent appraises situations based on its model and environment states. Typically, appraisal processes are implemented based on a cognitive appraisal theory that take state and model information as input, and output either emotion intensities or appraisal intensities (e.g., novelty, desirability). These emotions can then be used as meta-learning parameters or as additional reward signals. Key approaches include Marinier and Lairds’ [2008] approach and Sequeira et al.’s [2014], both based on Scherer [Scherer 2001]. Notice that when it comes to emotion elicitation, this approach is in fact a cognitive appraisal-based approach, albeit using RL state and model as input. 10.2.2.4 Hard-wired Appraisal Finally, we briefly cover an emotion elicitation approach that I would refer to as hard-wired appraisal. Here, events or environmental stimuli have a predetermined meaning in terms of the appraisal processes or even in terms of emotion. For example, for the stimulus snake there would be a predetermined emotional outcome fear(snake) = 1. These primary emotional responses are often related to the low route of LeDoux’s emotion processing proposal, whereby primary emotions are evolutionary shaped complex responses [LeDoux 1996]. Secondary emotions are more difficult to simulate because each event has a predetermined emotional meaning, and for these secondary emotions cognitive processing is assumed to be needed. To add some flexibility, some approaches do not directly annotate events with emotional or appraisal consequences but indirectly use the input needed for the appraisal process. For example, in both Ochs et al. [2009] and Popescu et al. [2014] the events in a simulation can be annotated with appraisal-relevant information including to which goal the event is contributing and the likelihood of the event being true, after which the “black-box” appraisal engine will interpret the event and compute emotional consequence and effects on, for example, relations between agents.

10.2.3 Examples of Cognitive-affective Architectures In this section, we will go through the design of four fictional cognitive-affective architectures for a SIA inspired by the four modeling approaches just described.

10.2 Computational Models and Approaches

367

We focus on the appraisal (not the expression). It is important to keep in mind that emotions are added to an agent for a particular purpose. This can be theoretical exploration, but also practical purposes such as enabling a robot to simulate emotions to children. These design goals change the way the model is developed and evaluated. 10.2.3.1

Cognitive-agent Based Appraisal Assume we want a virtual agent that can help tutor children with math problems. Inspired by D’Mello et al. [2007] and Castellano et al. [2013], we assume that the children want to learn and we assume that the robot is empathic (i.e., its own emotions will mimic those of the child). As such we assume the robot has the same goal as the child, namely, goal(understand(X)). The robot can give exercises to the child in the form of actions pushed to a tablet interface action(exercise(X, nr)), and perceive the answers percept(answer(X, nr)) pushing answer(X, nr) in the belief base of the agent. It further has a knowledge base of correct answers correct(X, nr) and some rules stating that: if answer(X, nr) = correct(X, nr) → likelyhood(X, l +1) else likelyhood(X, l −1) (10.9) if likelyhood(X) > 10 → understand(X) if answer(X, nr) ∧ ¬understand(X) → action(exercise(X, nr + 1))

(10.10) (10.11)

which keeps pushing actions. Granted, this is a simple agent, but it will start pushing actions as long as the child does not understand a particular goal X, which we add in the following way to the goal base goal(understand(fractions)). Now we implement a simple emotion model based on Reisenzein’s ideas that emotions are belief–belief and belief–desire comparators. In fact, we cheat a bit because one of the comparators, the likelihood of the goal being true, is built in the logic in the form of the likelyhood(X) predicate. We can now express the appraisal process for joy as follows: if understand(X) ∧ goal(understand(X)) → joy(X)

(10.12)

The agent is happy when it believes the child reaches the learning goal X. Now this is not a very interactive agent, and it would be helpful to also express some hope when the child is doing a good job. For this the agent needs to know if the situation improved or not. We add: if answer(X, nr) = correct(X, nr) → improved(X)

(10.13)

368

Chapter 10 Emotion

Now we can simply appraise this as follows: if improved(X) ∧ goal(understand(X)) → hope(X)

(10.14)

The agent is hopeful (and of course expresses this to the child) when improvement is made. I leave the formalization of distress and fear as well as an actual working simulation of this system as an exercise to the reader. 10.2.3.2 Embodied Appraisal Assume we want to investigate the relation between resource gathering, survival, and emotions. Inspired by Avila-Garcia and Canamero [2005] and Kiryazov et al. [2013], consider the following homeostatic machine (animat) with homeostatic variables H = {hunger, thirst}, behavioral urges B = {search, drink, eat}, potential stimuli S = {food, water}, and drives Deat = (hunger * food), Ddrink = (thirst * water), Dsearch = (hunger * thirst)

(10.15)

where we assume that the intensity of the drive Db equals the behavioral urge B and behavior selection is based on bactive = argmax(B). The problem this animat needs to solve is how to survive by keeping hunger and thirst low, while food and water are scattered around the world. It needs to solve an action selection problem and balance searching, eating, and drinking. If hunger or thirst goes up, search will be triggered and will become the argmax(B) resulting in searching behavior. When food is found, this triggers eating. When hunger is lower again, the animat will start searching because thirst is triggering search behavior but the hunger is gone due to eating. Upon finding water it will start drinking, lowering thirst after which it will start searching again and so on. Now what could emotion be in this system? Emotion can be simulated in different ways (the following are examples). First, we can assume a hedonic approach, and interpret the homeostatic state as pleasure, that is, pleasure = 1 − (avg(H)). This means that whenever the animat is doing well in terms of its homeostatic variables it is also feeling good, reflecting the idea of “core affect” [Russell 1980]. Second, we can assume an emotion as feedback approach, and interpret changes to the homeostatic state as signals of joy and distress, that is, pleasure = 𝛿(avg(H)). This means that whenever something happens that moves the homeostatic variables in the desired direction, the animat will feel good (and vice versa), reflecting the idea that emotions are abstract feedback signals about the appropriateness of actions for the individual’s well-being [Baumeister et al. 2007]. With respect to arousal, there are also different choices

10.2 Computational Models and Approaches

369

to be made. For example, we can interpret the overall behavioral urge as arousal, that is, arousal = avg(B), reflecting the idea that arousal is related to (preparation of) physiological activity [Russell 1980, Frijda 2004]. Second, we can interpret arousal in a more holistic way such that arousal is “all activity including information processing,” that is, arousal = avg(B, S), presence of stimuli also increases arousal. Let’s pick an emotion as feedback and physiological activity approach. This means that activity is linked to arousal, and changes in the homeostatic state are linked to pleasure. If the animat is hungry, it will search and have high drives for eating. In the absence of food, the animat will feel aroused and on top of that it feels displeasure every time avg(H) decreases. When food is found, it will switch to eating behavior. The animat will still feel aroused (nothing changed there yet) but will feel pleasure due to the first bite of food reducing the hunger drive. While eating, hunger goes down and eventually vanishes. At this point, the animat will either stay there or start wandering around a bit, feels low arousal and neutral pleasure (no changes). Emotionally, we can thus observe the following: high arousal and displeasure when hungry and searching (fear?); high arousal and pleasure when food is found (excitement?); low arousal and neutral pleasure when finished eating (relaxed?). It is left as an exercise to the reader to implement this, perhaps use the emotion as an actual feedback signal for the agent and define additional evaluation criteria. 10.2.3.3 Reinforcement Learning Appraisal Assume we want an adaptive household service robot able to communicate to us the extent to which the learning process is converging and whether or not consequences of events were anticipated. Inspired by Thrun et al. [1999], Moerland et al. [2016], and Broekens and Chetouani [2019], consider an RL service robot. The robot receives rewards when the user praises the robot and for the amount of dust and dirt it collects. While it is learning, it experiences TD errors and updates Q(s, a) accordingly. Temporal difference errors are interpreted as signals of joy and distress [Broekens 2018]. For Q-learning this would mean that Joy and Distress are defined as follows: if (TD > 0) → Joy = TD

(10.16)

if (TD < 0) → Distress = TD

(10.17)

With the TD error defined in the standard way for Q-learning: TD = r + 𝛾 max Q(s′ , a′ ) − Q(s, a)old ′ a

(10.18)

370

Chapter 10 Emotion

In terms of additional emotion dynamics, whenever an emotion is triggered it is added to the current emotional state intensity for that emotion using a logarithmic function with decay [Reilly 2006] to not saturate the emotion but keep gain at low intensities and allow for decay over time. At every point in time the agent thus has a vector E = [ijoy , idistress ]. It expresses this vector continuously. While learning the particular tasks in a household it will experience positive and negative TDs, expressing joy and distress to the user. By the time the tasks are converted to known RL policies, the robot will have become emotionally neutral (and thus show that the learning has converged) and only express emotions when TDs occur due to unexpected outcomes. It is left to the reader to implement a simulation of this, and again, define additional evaluation criteria. 10.2.3.4 Hard-wired Appraisal The last example is simple. Inspired by Ochs et al. [2009] and Popescu et al. [2014], assume we want a non-player character, a villager, in a video game to be able to simulate rudimentary emotions based on events in the game. If the villager can perceive the events S = {player_near, monster_near, gold_stolen, thief _near}, and has the following emotions E = {joy, fear, sadness, anger}, then we can annotate the events as appraisals as follows: A = {a(player_near) = [1, 0, 0, 0], a(monster_near) = [0, 1, 0, 0], a(gold_stolen) = [0, 0, 1, 0], a(thief _near) = [0, 0, 0, 1]}

(10.19)

Upon perceiving an event, the emotional state can be updated according to Equation (10.1). When more flexibility is needed, one can annotate the event with input needed for the appraisal process instead, for example: likelihood(gold_stolen) = 0.5, conducivenessget_rich (gold_stolen) = −1

(10.20)

This can now be fed into an appraisal model and leave the emotion calculation to the model, in the spirit of Ochs et al. [2009] and Popescu et al. [2014]. We leave playing around with this as an exercise to the reader.

10.3

History/Overview In this section I give a short overview of the history of computational modeling of emotion, including important milestones that influenced the different approaches introduced in the previous section. It will be a brief history, enhanced with some recent work from the last 5 years. For more history on this field including an excellent taxonomy of models pre-2014, readers are referred to Gratch and Marsella

10.3 History/Overview

371

[2014], while readers should look at Pfeifer [1988] for a review of the treatment of emotion and affect in computer models before the field of emotion modeling existed.

10.3.1 The Early Period The computational study of emotion was initiated by the cognitive revolution in psychology. Computational study of emotion was for the first time explicitly suggested in the early 1980s by Rolf Pfeifer [1982], and around the same time by Aaron Sloman and Monica Croucher [1981] who wrote the influential paper why robots will have emotions. In the late 1980s, psychologists such as Nico Frijda with his student Jaap Swagerman [Frijda and Swagerman 1987] started formalizing Frijda’s action tendencies theory and around that same time the influential OCC model was developed [Ortony et al. 1988] by Andrew Ortony, Gerald Clore, and Allan Collins. These developments spurred agent-oriented research into emotion simulation resulting in the famous work by Clark Elliot, the affective reasoner [Elliott 1992], which was the first full-blown cognitive appraisal-based implementation of the OCC model using goal-based agent reasoning. Emotion simulation work soon began to be applied in intelligent virtual agents, for example, in the work on believable agents in the Oz project by Scott Reilly and Joseph Bates [Reilly and Bates 1992, Reilly 1996]. Fueled by Damasio’s ideas on the importance of emotion on decision making, emotions also made their appearance in the first social robots such as Kismet [Breazeal 1998] of which the emotional system was based on Velasquez’s [1998] work on modeling emotion-based decision making.

10.3.2 The Diversi cation Period During the 2000s a surge in interest was seen in trying to understand the role of emotion in interaction with agents [Paiva 2000, Hudlicka 2003, Conati et al. 2005]. This was not in the least due to the book Affective Computing by Rosalind Picard [1997], who for the first time defined emotion modeling as part of a field. Simulated emotions were added to social robots (iCat) and virtual agents (Greta, Steve) and applied to different settings including pedagogical agents [Gratch and Marsella 2001], negotiations [Core et al. 2006], game characters [Ochs et al. 2009], and human–robot interactions [Leite et al. 2008]. We see the development of virtual humans (now we would call these intelligent virtual agents or SIAs) that included emotions in their reasoning, their decision making. and expressive repertoire [Allbeck and Badler 2002, Gratch et al. 2002]. Also, we see that agent researchers started to look at how to structure the appraisal process based on different formalisms including planning [Gratch and Marsella 2004], BDI logic [Meyer 2006], and set theoretic approaches [Broekens 2007], as well as how

372

Chapter 10 Emotion

to embed emotions in complete cognitive agent architectures [Marinier III and Laird 2004, Dias and Paiva 2005, Hudlicka 2005]. We also see modeling emotion in relation to complete agents with personality, mood, and expression [Rosis et al. 2003]. This was especially seen in embodied conversational agents [Egges et al. 2004]. Other appraisal theories, including Scherer’s, are also being modeled [Broekens 2007, Marinier and Laird 2008]. In parallel during the 2000s, different approaches to emotion modeling appeared that were not emphasizing the cognitive appraisal but the interplay between emotion and cognition in agents as well as more embodied (cybernetic) approaches [Belavkin 2001, Cañamero 2001], as well as first attempts at linking emotion to reinforcement learning and metalearning and optimization [Schweighofer and Doya 2003, Hogewoning et al. 2007]. However, most of the work remained focused on computational investigation and application of cognitive appraisal and in particular the OCC model in interactive agents.

10.3.3 Current Work By the end of the 2000s it became clear that the field needed to focus more on the evaluation of models of emotion [Gratch et al. 2009, Broekens et al. 2013]. This shifts the focus to the question of why emotions were added in the first place. We see that validity (is the emotion theoretically valid) and user experience (how does the model impact the human in the loop) become important evaluation criteria. With regard to user experience, we see, for example, a focus on applying emotions in robots and agents for specific reasons including robot empathy [Paiva et al. 2017] in human–robot interaction (see Chapter 11), the building of rapport between SIA and human (see Chapter 12), enhancing non-player character flexibility in games [Chowanda et al. 2016], for review see Yannakakis and Paiva [2014] and Chapter 27, and enhancing cognitive-assistive technologies [Robillard and Hoey 2018]. With regards to validity, we see approaches that focus on studying particular theoretical aspects of artificial emotion elicitation, such as how can emotions result from temporal difference reinforcement learning [Moerland et al. 2016, Broekens and Dai 2019], how can appraisal be modeled on top of reinforcement learning statevalue information [Sequeira et al. 2014], how can we ground emotion in robot physiology [Kiryazov et al. 2013, Lowe et al. 2016] and interaction with robots [Jung 2017], how can appraisal be conceptualized as an iterative affective summary process [Marsella and Gratch 2009], and how can Damasio’s as-is loop be simulated using a dynamical systems approach [Bosse et al. 2008]. Finally, we see a large body of research working toward integrating emotion and other affective phenomena, such as personality, relations as well as user emotions (user modeling) into the decisions-making process of SIAs.

10.5 Current Challenges and Future Directions

10.4

10.5

373

Similarities and Differences in IVAs and SRs The main approach for emotion elicitation in SIAs is cognitive appraisal, although models of emotions in social robots tend to also investigate embodied approaches (such as grounding affective dimensions in robot’s physiology). Further, in both fields, emotions are often used in interaction with people. Other chapters go into more detail on this (such as Chapters 8, 11, and 12). Overall, the two fields are relatively well aware of what each other does when it comes to modeling emotion elicitation through cognitive appraisal. RL-based and embodied modeling are more seen in robots and more theoretical agent simulation studies that do not involve interaction but mainly task-based agent learning or adaptation scenarios. These modeling approaches are at this point still more theoretical in nature.

Current Challenges and Future Directions In this section, we discuss seven (somewhat arbitrary) challenges in the modeling of emotion. We discuss challenges originating from the core of emotion simulation and challenges in applying emotions in SIAs. First, it is still not clear how to select the appropriate frame of reasoning for generating emotions based on cognitive appraisal theory. As most theories assume emotions are triggered by the appraisal of a stimulus or appraisal of the belief–desire structure, the question in AI agents remains, what goals do I take into account, and do I focus on the hope or fear side of things. Many emotions may result from a single percept, and it is not clear if all of these arise, only the strongest, or only the last, and so on. It remains to be seen if this issue can be solved at all. Second, the intensity of emotions remains a difficult issue. Up until now there is no widely accepted model for emotion intensity based on appraisal theoretic simulations. Third, how can we ground and user-test emotions in architectures other than classical cognitive agents, including adaptive agents and simple robots? Will we be stuck with human perception studies only, or is there more to be done, for example, replicating animal studies of emotion? Fourth, how to incorporate recent evidence that emotions follow a hierarchy with little basic emotions [Jack et al. 2014]? Should this perspective change the way we develop computational appraisal models? Fifth, we need to test the plausibility and effects of SIA emotion, as generated by an elicitation process, in non-trivial and longer-term interaction domains. For example, the effect of an emotional agent that is always supportive and empathic might be counterproductive in the long run and raise frustration. Sixth, there is still a lack of standard benchmarks for testing (often quite complex) emotion models. For this, human–agent negotiation can be a good basis as this involves many aspects of emotion including reactive emotions, appraisal, utility, norms, values, and strategic use of emotions [Gratch et al. 2015]. Related to this is the fact that there are many models that implement

374

Chapter 10 Emotion

emotion, mood, personality, and relations, but there is in fact no way to test and compare these complex models. We need simple interaction effects between affective phenomena to be replicated (such as the impact of mood on emotion and vice versa), including what this might bring to the user in terms of experience. Seventh, emotions are used by humans to explain their point of view and perspective. In AI there is the potential to investigate the use of emotion modeling in explanation, transparency, ethics, and simulated emotions. Emotional expression grounded in the decision-making process of the agent could be a form of transparency of the SIA’s functioning [Kaptein et al. 2017, Broekens and Chetouani 2019].

10.6

Summary We have covered important affective concepts including emotion, mood, attitude, personality, and relation. We have covered how these concepts are computationally represented. Then we covered four approaches to the simulation of appraisal in SIAs and given practical working examples of these approaches. Finally, we surveyed the history of the field and pointed out current challenges in emotion modeling.

References C. Adam, A. Herzig, and D. Longin. 2009. A logical formalization of the OCC theory of emotions. Synthese 168, 2, 201–248. ISSN 1573-0964. DOI: https://doi.org/10.1007/s11229-0099460-9. J. Allbeck and N. Badler. 2002. Toward representing agent behaviors modified by personality and emotion. Embodied Conversational Agents at AAMAS 2, 15–19. R. C. Arkin, M. Fujita, T. Takagi, and R. Hasegawa. 2003. An ethological and emotional basis for human–robot interaction. Robot. Auton. Syst. 42, 3–4, 191–201. DOI: https://doi. org/10.1016/S0921-8890(02)00375-5. ISSN 0921-8890. M. B. Arnold. 1960. Emotion and Personality. Vol. I: Psychological Aspects. Columbia University Press, New York. O. Avila-Garcia and L. Canamero. 2005. Hormonal modulation of perception in motivationbased action selection architectures. SSAISB. T. Baarslag, M. Kaisers, E. Gerding, C. M. Jonker, and J. Gratch. August 2017. When will negotiation agents be able to represent us? The challenges and opportunities for autonomous negotiators. In C. Sierra (Ed.), Proceedings of the 26th International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence, 4684–4690. DOI: https://doi.org/10.24963/ijcai.2017/653. L. F. Barrett. 2005. Feeling is perceiving: Core affect and conceptualization in the experience of emotion. Guilford Press, New York, 255–284. ISBN 1-59385-188-X (Hardcover).

References

375

L. F. Barrett. 2011. Was Darwin wrong about emotional expressions? Curr. Direct. Psychol. Sci. 20, 6, 400–406. DOI: https://doi.org/10.1177/0963721411429125. L. F. Barrett, B. Mesquita, K. N. Ochsner, and J. J. Gross. 2007. The experience of emotion. Ann. Rev. Psychol. 58, 1, 373–403. DOI: https://doi.org/10.1146/annurev.psych.58.110405. 085709. C. Bartneck. 2002. Integrating the OCC model of emotions in embodied characters. 39–48. R. F. Baumeister, K. D. Vohs, and C. Nathan DeWall. 2007. How emotion shapes behavior: Feedback, anticipation, and reflection, rather than direct causation. Pers. Soc. Psychol. Rev. 11, 2, 167. ISSN 1088-8683. DOI: https://doi.org/10.1177/1088868307301033. C. Beedie, P. Terry, and A. Lane. 2005. Distinctions between emotion and mood. Cogn. Emot. 19, 6, 847–878. ISSN 0269-9931. DOI: https://doi.org/10.1080/02699930541000057. M. Bekoff. 2008. The Emotional Lives of Animals: A Leading Scientist Explores Animal Joy, Sorrow, and Empathy–and Why They Matter. New World Library. ISBN 1577316290. R. V. Belavkin. 2001. The role of emotion in problem solving. In Proceedings of the AISB’01 Symposium on Emotion, Cognition and Affective Computing. Heslington, York, UK, 49–57. T. Bosse, C. M. Jonker, and J. Treur. 2008. Formalisation of Damasio’s theory of emotion, feeling and core consciousness. Conscious. Cogn. 17, 1, 94–113. ISSN 1053-8100. http://ww w.sciencedirect.com/science/article/pii/S1053810007000633. DOI: https://doi.org/10.1016/ j.concog.2007.06.006. S. Bozinovski. 1982. A self-learning system using secondary reinforcement. In E. Trappl (Ed.), Cybernetics and Systems Research. North-Holland Publishing Company, 397–402. M. Bradley and P. Lang. 2007. Affective Norms for English Text (ANET): Affective Ratings of Text and Instruction Manual. Technical Report. D-1, University of Florida, Gainesville, FL. C. Breazeal. 1998. A motivational system for regulating human–robot interaction. In AAAI. 54–61. J. Broekens. 2007. Emotion and reinforcement: Affective facial expressions facilitate robot learning. 113–132. DOI: http://dx.doi.org/10.1007/978-3-540-72348-6_6. J. Broekens. 2018. A temporal difference reinforcement learning theory of emotion. arXiv preprint arXiv:1807.08941. J. Broekens and M. Chetouani. 2019. Towards transparent robot learning through TDRL-based emotional expressions. IEEE Transactions on Affective Computing 12, 2, 352–362. J. Broekens and L. Dai. 2019. A TDRL model for the emotion of regret. In 8th International Conference on Affective Computing and Intelligent Interaction (ACII). 150–156. IEEE. ISBN 1728138884. J. Broekens, C. M. Jonker, and J.-J. C. Meyer. 2010. Affective negotiation support systems. J. Ambient Intell. Smart Environ. 2, 2, 121–144. ISSN 1876-1364. DOI: http://dx.doi.org/10.3233/ AIS-2010-0065. J. Broekens, T. Bosse, and S. C. Marsella. 2013. Challenges in computational modeling of affective processes. Affect. Comput. IEEE Trans. 4, 3, 242–245. ISSN 1949-3045. DOI: 10.1109/T-AFFC.2013.23.

376

Chapter 10 Emotion

J. Broekens, E. Jacobs, and C. M. Jonker. 2015. A reinforcement learning model of joy, distress, hope and fear. Connect. Sci. 1–19. ISSN 0954-0091. DOI: http://dx.doi.org/10.1080/ 09540091.2015.1031081. J. K. Burgoon, N. Magnenat-Thalmann, M. Pantic, and A. Vinciarelli. 2017. Social Signal Processing. Cambridge University Press. ISBN 1108124585. K. A. Buss and E. J. Kiel. 2004. Comparison of sadness, anger, and fear facial expressions when toddlers look at their mothers. Child Develop. 75, 6, 1761–1773. ISSN 1467-8624. DOI: http://dx.doi.org/10.1111/j.1467-8624.2004.00815.x. R. A. Calvo, S. D’Mello, J. Gratch, and A. Kappas. 2014. The Oxford Handbook of Affective Computing. Oxford University Press. ISBN 0199942234. L. Cañamero. 2001. Emotions and adaptation in autonomous agents: A design perspective. Cybern. Syst. 32, 5, 507–529. ISSN 0196-9722. DOI: http://dx.doi.org/10.1080/ 019697201750257757. L. Cañamero. 2003. Designing emotions for activity selection in autonomous agents. In R. Trappl, P. Petta, and S. Payr (Eds.), Emotions in Humans and Artifacts. MIT Press, Cambridge, MA, 115–148. L. Canamero. 2019. Embodied robot models for interdisciplinary emotion research. IEEE Trans. Affect. Comput. 1–1. ISSN 2371-9850. DOI: http://dx.doi.org/10.1109/TAFFC.2019. 2908162. C. Castelfranchi. 1999. Affective appraisal versus cognitive evaluation in social emotions and interactions. In International Workshop on Affective Interactions. Springer, 76–106. G. Castellano, A. Paiva, A. Kappas, R. Aylett, H. Hastie, W. Barendregt, F. Nabais, and S. Bull. 2013. Towards Empathic Virtual and Robotic Tutors. Springer Berlin, Heidelberg, 733–736. ISBN 978-3-642-39112-5. S. Chong, J. F. Werker, J. A. Russell, and J. M. Carroll. 2003. Three facial expressions mothers direct to their infants. Infant Child Dev. 12, 3, 211–232. ISSN 1522-7219. DOI: http://dx.d oi.org/10.1002/icd.286. A. Chowanda, M. Flintham, P. Blanchfield, and M. Valstar. 2016. Playing with social and emotional game companions. Intelligent Virtual Agents. Springer International Publishing, 85–95. ISBN 978-3-319-47665-0. C. Conati, S. Marsella, and A. Paiva. 2005. Affective Interactions: The Computer in the Affective Loop. ACM, San Diego, CA, 7–7. ISBN 1581138946. M. Core, D. Traum, H. C. Lane, W. Swartout, J. Gratch, M. van Lent, and S. Marsella. 2006. Teaching negotiation skills through practice and reflection with virtual humans. Simulation 82, 11, 685–701. http://sim.sagepub.com/cgi/content/abstract/82/11/685. DOI: http:// dx.doi.org/10.1177/0037549706075542. I. Cos, L. Cañamero, G. M. Hayes, and A. Gillies. 2013. Hedonic value: Enhancing adaptation for motivated agents. Adapt. Behav. 21, 6, 465–483. http://adb.sagepub.com/content/ 21/6/465.abstract. DOI: http://dx.doi.org/10.1177/1059712313486817.

References

377

H. Cramer, J. Goddijn, B. Wielinga, and V. Evers. 2010. Effects of (in)accurate empathy and situational valence on attitudes towards robots. 141–142. ISBN 2167-2148. DOI: http://dx. doi.org/10.1109/HRI.2010.5453224. A. R. Damasio. 1994. Descartes’ Error: Emotion Reason and the Human Brain. Putnam, New York. C. M. de Melo, P. Carnevale, and J. Gratch. 2011. The effect of expression of anger and happiness in computer agents on negotiations with humans. International Foundation for Autonomous Agents and Multiagent Systems, 937–944. ISBN 098265717X. C. M. de Melo, P. J. Carnevale, S. J. Read, and J. Gratch. 2014. Reading people’s minds from emotion expressions in interdependent decision making. J. Pers. Soc. Psychol. 106, 1, 73. ISSN 1939-1315. DOI: http://dx.doi.org/10.1037/a0034251. F. De Waal. 2019. Mama’s Last Hug: Animal Emotions and What They Tell Us about Ourselves. WW Norton and Company. ISBN 0393635074. J. Dias and A. Paiva. 2005. Feeling and Reasoning: A Computational Model for Emotional Characters. Springer, 127–140. ISBN 3540307370. DOI: https://doi.org/10.1007/11595014_13. J. Dias, S. Mascarenhas, and A. Paiva. 2014. FAtiMA modular: Towards an agent architecture with a generic appraisal framework. Springer, 44–56. DOI: https://doi.org/10.1007/ 978-3-319-12973-0_3. U. Dimberg, P. Andréasson, and M. Thunberg. 2011. Emotional empathy and facial reactions to facial expressions. J. Psychophysiol. 25, 1, 26–31. https://econtent.hogrefe.com/d oi/abs/10.1027/0269-8803/a000029. DOI: http://dx.doi.org/10.1027/0269-8803/a000029. S. D’Mello, R. W. Picard, and A. Graesser. 2007. Toward an affect-sensitive autotutor. IEEE Intell. Syst. 22, 53–61. ISSN 1541-1672. http://doi.ieeecomputersociety.org/10.1109/MIS. 2007.79. G. Dreisbach and K. Goschke. 2004. How positive affect modulates cognitive control: Reduced perseveration at the cost of increased distractibility. J. Exp. Psychol. Learn. Mem. Cogn. 30, 2, 343–353. DOI: http://dx.doi.org/10.1037/0278-7393.30.2.343. K. Edwards. 1990. The interplay of affect and cognition in attitude formation and change. J. Pers. Soc. Psychol. 59, 2, 202–216. ISSN 1939-1315 (Electronic), 0022-3514 (Print). DOI: http://dx.doi.org/10.1037/0022-3514.59.2.202. A. Egges, S. Kshirsagar, and N. Magnenat-Thalmann. 2004. Generic personality and emotion simulation for conversational agents. Comput. Anim. Virtual Worlds 15, 1, 1–13. ISSN 1546-4261. https://onlinelibrary.wiley.com/doi/abs/10.1002/cav.3. DOI: http://dx.doi.org /10.1002/cav.3. P. Ekman and W. Friesen. 1971. Constants across cultures in the face and emotion. J. Pers. Soc. Psychol. 17, 2, 124. ISSN 1939-1315. DOI: https://doi.org/10.1037/h0030377. M. S. El-Nasr, J. Yen, and T. R. Ioerger. 2000. FLAME—Fuzzy logic adaptive model of emotions. Auton. Agents Multi-Agent Syst. 3, 219–257. ISSN 1387-2532. DOI: http://dx.doi.org/10. 1023/A:1010030809960. C. D. Elliott. 1992. The Affective Reasoner: A Process Model of Emotions in a Multi-Agent System. PhD thesis, Northwestern University.

378

Chapter 10 Emotion

A. H. Fischer and A. Manstead. 2008. Social functions of emotion. Guilford Press, 456–468. J. P. Forgas. 2000. Feeling is believing? The role of processing strategies in mediating affective influences in beliefs. Cambridge University Press, 108–143. DOI: https://doi.org/10. 1017/CBO9780511659904.005. S. Franklin and A. Graesser. 1999. A software agent model of consciousness. Conscious. Cogn. 8, 3, 285–301. ISSN 1053-8100. DOI: http://dx.doi.org/10.1006/ccog.1999.0391. S. Franklin, T. Madl, S. D’Mello, and J. Snaider. 2014. LIDA: A systems-level architecture for cognition, emotion, and learning. IEEE Trans. Auton. Ment. Dev. 6, 1, 19–41. ISSN 19430604. DOI: http://dx.doi.org/10.1109/TAMD.2013.2277589. N. H. Frijda. 1988. The laws of emotion. Am. Psychol. 43, 5, 349–358. ISSN 1935-990X. DOI: https://doi.org/10.1037/0003-066X.43.5.349. N. H. Frijda. 2004. Emotions and action. Cambridge University Press, 158–173. DOI: https: //doi.org/10.1017/CBO9780511806582.010. N. H. Frijda and J. Swagerman. 1987. Can computers feel? Theory and design of an emotional system. Cogn. Emot. 1, 3, 235–257. ISSN 0269-9931. DOI: https://doi.org/10.1080/ 02699938708408050. N. H. Frijda, A. S. R. Manstead, and S. Bem. 2000. Emotions and Beliefs: How Feelings Influence Thoughts. Cambridge University Press. DOI: https://doi.org/10.1017/CBO97805 11659904. P. Gebhard. 2005. ALMA: A layered model of affect. In Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems. 29–36. DOI: https://doi.or g/10.1145/1082473.1082478. L. R. Goldberg. 1990. An alternative “description of personality”: The big-five factor structure. J. Pers. Soc. Psychol. 59, 6, 1216–1229. ISSN 1939-1315 (Electronic), 0022-3514 (Print). DOI: https://doi.org/10.1037//0022-3514.59.6.1216. J. Gratch and S. Marsella. 2001. Tears and fears: Modeling emotions and emotional behaviors in synthetic agents. ACM, Montreal, Quebec, Canada, 278–285. DOI: http://doi.acm. org/10.1145/375735.376309. J. Gratch and S. Marsella. 2004. A domain-independent framework for modeling emotion. Cogn. Syst. Res. 5, 4, 269–306. ISSN 1389-0417. http://www.sciencedirect.com/science/arti cle/B6W6C-4C56KYY-1/2/e21f759dcf674531f63aa07c171a0f31. DOI: https://doi.org/10.1016/ j.cogsys.2004.02.002. J. Gratch and S. Marsella. 2014. Appraisal models. 54–67. ISBN 0199942234. DOI: https://doi. org/10.1093/oxfordhb/9780199942237.013.015. J. Gratch, J. Rickel, E. André, J. Cassell, E. Petajan, and N. Badler. 2002. Creating interactive virtual humans: Some assembly required. IEEE Intell. Syst. 17, 4, 54–63. ISSN 1541-1672. DOI: https://doi.org/10.1109/mis.2002.1024753. J. Gratch, S. Marsella, N. Wang, and B. Stankovic. 2009. Assessing the validity of appraisalbased models of emotion. DOI: https://doi.org/10.1109/ACII.2009.5349443. J. Gratch, D. DeVault, G. M. Lucas, and S. Marsella. 2015. Negotiation as a challenge problem for virtual humans. In W.-P. Brinkman, J. Broekens, and D. Heylen (Eds.), Intelligent

References

379

Virtual Agents. Springer International Publishing, Cham, 201–215. DOI: https://doi.org/ 10.1007/978-3-319-21996-7_21. R. Greifeneder, H. Bless, and M. T. Pham. 2010. When do people rely on affective and cognitive feelings in judgment? A review. Pers. Soc. Psychol. Rev. 15, 2, 107–141. ISSN 10888683. DOI: https://doi.org/10.1177/1088868310367640. D. Heylen, A. Nijholt, R. o. d. Akker, and M. Vissers. 2003. Socially intelligent tutor agents. Lecture Notes in Artificial Intelligence, 341–347. DOI: https://doi.org/10.1007/978-3-54039396-2_56. K. Hoemann, F. Xu, and L. F. Barrett. 2019. Emotion words, emotion concepts, and emotional development in children: A constructionist hypothesis. Dev. Psychol. 55, 9, 1830. DOI: https://doi.org/10.1037/dev0000686. E. Hogewoning, J. Broekens, J. Eggermont, and E. Bovenkamp. 2007. Strategies for affectcontrolled action-selection in Soar-RL. Springer, Berlin, 501–510. E. Hudlicka. 2003. To feel or not to feel: The role of affect in human–computer interaction. Int. J. Hum. Comput. Stud. 59, 1–2, 1–32. ISSN 1071-5819. http://www.sciencedirect.com/sc ience/article/B6WGR-48SNKXG-2/2/c172889a95be64cb4419763a5cefa852. E. Hudlicka. 2005. Modeling interactions between metacognition and emotion in a cognitive architecture. 55–61. R. E. Jack, O. G. Garrod, and P. G. Schyns. 2014. Dynamic facial expressions of emotion transmit an evolving hierarchy of signals over time. Curr. Biol. 24, 1–6. DOI: https://doi.or g/10.1016/j.cub.2013.11.064. H. Jiang, J. M. Vidal, and M. N. Huhns. 2007. EBDI: An architecture for emotional agents. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems. 1–3. DOI: https://doi.org/10.1145/1329125.1329139. H. Jones and N. Sabouret. 2013. TARDIS—A simulation platform with an affective virtual recruiter for job interviews. In Conference on IDGEI (Intelligent Digital Games for Empowerment and Inclusion). M. F. Jung. 2017. Affective grounding in human–robot interaction. In 2017 12th ACM/IEEE International Conference on Human–Robot Interaction (HRI). 263–273. DOI: https://doi.org/ 10.1145/2909824.3020224. L. P. Kaelbling, M. L. Littman, and A. W. Moore. 1996. Reinforcement learning: A survey. J. Artif. Intell. Res. 4, 237–285. DOI: https://doi.org/10.1613/jair.301. F. Kaptein, J. Broekens, K. V. Hindriks, and M. Neerincx. 2016. CAAF: A cognitive affective agent programming framework. Springer International Publishing. 317–330. F. Kaptein, J. Broekens, K. Hindriks, and M. Neerincx. 2017. The role of emotion in selfexplanations by cognitive agents. In 2017 7th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 88–93. K. Kiryazov, R. Lowe, C. Becker-Asano, and M. Randazzo. 2013. The role of arousal in tworesource problem tasks for humanoid service robots. In 2013 IEEE RO-MAN. 62–69. ISBN 1944-9437.

380

Chapter 10 Emotion

G. v. Kleef, C. De Dreu, and A. Manstead. 2004. The interpersonal effects of emotions in negotiations: A motivated information processing approach. J. Pers. Soc. Psychol. 87, 4, 510–528. DOI: https://doi.org/10.1037/0022-3514.87.4.510. M. D. Klinnert. 1984. The regulation of infant behavior by maternal facial expression. Infant Behav. Dev. 7, 4, 447–465. ISSN 0163-6383. http://www.sciencedirect.com/science/ar ticle/pii/S0163638384800053. DOI: http://dx.doi.org/10.1016/S0163-6383(84)80005-3. R. S. Lazarus. 1991. Cognition and motivation in emotion. Am. Psychol. 46, 4, 352. DOI: http: //dx.doi.org/10.1037//0003-066x.46.4.352. J. LeDoux. 1996. The Emotional Brain. Simon and Shuster, New York. K. Lee and M. C. Ashton. 2004. Psychometric properties of the HEXACO personality inventory. Multivar. Behav. Res. 39, 2, 329–358. ISSN 0027-3171. DOI: https://doi.org/10.1207/ s15327906mbr3902_8. I. Leite, C. Martinho, A. Pereira, and A. Paiva. 2008. iCat: An affective game buddy based on anticipatory mechanisms. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems. Vol. 3. International Foundation for Autonomous Agents and Multiagent Systems. 1229–1232. DOI: https://doi.org/10.1145/ 1402821.1402838. D. Leyzberg, E. Avrunin, J. Liu, and B. Scassellati. 2011. Robots that express emotion elicit better human teaching. 347–354. ISBN 2167-2121. DOI: https://doi.org/10.1145/1957656. 1957789. B. Liu and L. Zhang. 2012. A survey of opinion mining and sentiment analysis. Springer US, Boston, MA, 415–463. ISBN 978-1-4614-3223-4. DOI: https://doi.org/10.1007/978-1-46143223-4_13. R. Lowe, E. Barakova, E. Billing, and J. Broekens. 2016. Grounding emotions in robots— An introduction to the special issue. Adapt. Behav. 24, 5, 263–266. DOI: https://doi.org/10. 1177/1059712316668239. ISSN 1059-7123. G. R. Maio, G. Haddock, and B. Verplanken. 2018. The Psychology of Attitudes and Attitude Change. Sage Publications Limited. ISBN 1526454122. K. Man and A. Damasio. 2019. Homeostasis and soft robotics in the design of feeling machines. Nat. Mach. Intell. 1, 10, 446–452. DOI: https://doi.org/10.1038/s42256-019-01037. ISSN 2522-5839. R. Marinier and J. E. Laird. 2008. Emotion-driven reinforcement learning. 115–120. R. P. Marinier III and J. E. Laird. 2004. Toward a Comprehensive Computational Model of Emotions and Feelings. 172–177. S. Marsella and J. Gratch. 2009. EMA: A process model of appraisal dynamics. Cogn. Syst. Res. 10, 1, 70–90. ISSN 1389-0417. http://www.sciencedirect.com/science/article/B6W6C-4S X9G35-1/2/484cdc830f9bfa8c5d12b87ddd5bace7. M. S. Mast and J. A. Hall. 2017. The vertical dimension of social signaling. Cambridge University Press. DOI: https://doi.org/10.1017/9781316676202.004. G. E. Matt, C. Vázquez, and W. K. Campbell. 1992. Mood-congruent recall of affectively toned stimuli: A meta-analytic review. Clin. Psychol. Rev. 12, 2, 227–255. ISSN 0272-7358.

References

381

http://www.sciencedirect.com/science/article/pii/027273589290116P. DOI: https://doi.org/ 10.1016/0272-7358(92)90116-P. J. D. Mayer, R. D. Roberts, and S. G. Barsade. 2008. Human abilities: Emotional intelligence. Annu. Rev. Psychol. 59, 507–536. DOI: https://doi.org/10.1146/annurev.psych.59. 103006.093646. ISSN 0066-4308. R. R. McCrae and P. T. Costa. 1987. Validation of the five-factor model of personality across instruments and observers. J. Pers. Soc. Psychol. 52, 1, 81. DOI: https://doi.org/10.1016/10. 1037/0022-3514.52.1.81. ISSN 1939-1315. A. Mehrabian. 1980. Basic Dimensions for a General Psychological Theory. Oelgeschlager, Gunn, Hain Inc, Cambridge, MA. J.-J. C. Meyer. 2006. Reasoning about emotional agents. Int. J. Intell. Syst. 21, 6, 601–619. DOI: http://dx.doi.org/10.1002/int.20150. ISSN 1098-111X. T. Moerland, J. Broekens, and C. Jonker. 2016. Fear and hope emerge from anticipation in model-based reinforcement learning. AAAI Press, 848–854. T. M. Moerland, J. Broekens, and C. M. Jonker. 2018. Emotion in reinforcement learning agents and robots: A survey. Mach. Lear. 107, 2, 443–480. DOI: https://doi.org/10.1007/ s10994-017-5666-0. ISSN 1573-0565. A. Moors, P. C. Ellsworth, K. R. Scherer, and N. H. Frijda. 2013. Appraisal theories of emotion: State of the art and future development. Emot. Rev. 5, 2, 119–124. DOI: https://doi.or g/10.1177/1754073912468165. A. Moors, Y. Boddez, and J. De Houwer. 2017. The power of goal-directed processes in the causation of emotional and other actions. Emot. Rev. 9, 4, 310–318. ISSN 1754-0739. DOI: https://doi.org/10.1177/1754073916669595. R. Neumann, B. Seibt, and F. Strack. 2001. The influence of mood on the intensity of emotional responses: Disentangling feeling and knowing. Cogn. Emot. 15, 6, 725–747. DOI: https://doi.org/10.1080/02699930143000266. ISSN 0269-9931. K. Oatley. 2010. Two movements in emotions: Communication and reflection. Emot. Rev. 2, 1, 29–35. http://emr.sagepub.com/content/2/1/29.abstract. DOI: https://doi.org/10.1177/ 1754073909345542. M. Ochs, N. Sabouret, and V. Corruble. 2009. Simulation of the dynamics of nonplayer characters’ emotions and social relations in games. Comput. Intell. AI Games IEEE Trans. 1, 4, 281–297. ISSN 1943-068X. DOI: https://doi.org/10.1109/tciaig.2009. 2036247. A. Ortony, G. L. Clore, and A. Collins. 1988. The Cognitive Structure of Emotions. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511571299. A. Paiva. 2000. Affective interactions: Toward a new generation of computer interfaces? 1–8. DOI: http://dx.doi.org/10.1007/10720296_1. A. Paiva, I. Leite, H. Boukricha, and I. Wachsmuth. 2017. Empathy in virtual agents and robots: A survey. ACM Trans. Interact. Intell. Syst. 7, 3, Article 11. DOI: https://doi.org/10. 1145/2912150. ISSN 2160-6455.

382

Chapter 10 Emotion

J. Panksepp. 1982. Toward a general psychobiological theory of emotions. Behav. Brain Sci. 5, 03, 407–422. DOI: https://doi.org/10.1017/S0140525X00012759. ISSN 1469-1825. L. Peña, J.-M. Peña, and S. Ossowski. 2011. Representing emotion and mood states for virtual agents. In F. Klügl and S. Ossowski (Eds.), Multiagent System Technologies. Springer Berlin Heidelberg, 181–188. ISBN 978-3-642-24603-6. DOI: https://doi.org/10.1007/978-3642-24603-6_19. R. Pfeifer. 1982. Cognition and emotion: An information processing approach. CIP Working Paper 436. R. Pfeifer. 1988. Artificial intelligence models of emotion. Springer Netherlands, Dordrecht, 287–320. ISBN 978-94-009-2792-6. R. W. Picard. 1997. Affective Computing. MIT Press. A. Popescu, J. Broekens, and M. v. Someren. 2014. GAMYGDALA: An emotion engine for games. IEEE Trans. Affect. Comput. 5, 1, 32–44. DOI: https://doi.org/10.1109/T-AFFC.2013. 24. ISSN 1949-3045. http://doi.ieeecomputersociety.org/10.1109/T-AFFC.2013.24. S. N. Reilly. 2006. Modeling what happens between emotional antecedents and emotional consequents. ACE 2006, 19. W. S. Reilly. 1996. Believable Social and Emotional Agents. Report, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA. W. S. Reilly and J. Bates. 1992. Building Emotional Agents. Report, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. D. Reisberg and P. Hertel. 2003. Memory and Emotion. Oxford University Press. ISBN 019534796X. R. Reisenzein. 2009a. Emotional experience in the computational belief–desire theory of emotion. Emot. Rev. 1, 3, 214–222. http://emr.sagepub.com/cgi/content/abstract/1/3/214. DOI: https://doi.org/10.1177/1754073909103589. R. Reisenzein. 2009b. Emotions as metarepresentational states of mind: Naturalizing the belief–desire theory of emotion. Cogn. Syst. Res. 10, 1, 6–20. ISSN 1389-0417. http:// www.sciencedirect.com/science/article/B6W6C-4SSNDC1-1/2/6c65804e7c6115a7e230a17e 0285bfd1. J. M. Robillard and J. Hoey. 2018. Emotion and motivation in cognitive assistive technologies for dementia. Computer 51, 3, 24–34. DOI: https://doi.org/10.1109/MC.2018.1731059. ISSN 1558-0814. E. T. Rolls. 2014. Emotion and decision-making explained: A précis. Cortex 59, 185–193. DOI: https://doi.org/10.1016/j.cortex.2014.01.020. ISSN 0010-9452. I. J. Roseman and C. A. Smith. 2001. In K. R. Scherer, A. Schorr, and T. Johnstone (Eds.), Series in Affective Science. Appraisal Processes in Emotion: Theory, Methods, Research. Oxford University Press, Oxford, 3–19. F. d. Rosis, C. Pelachaud, I. Poggi, V. Carofiglio, and B. D. Carolis. 2003. From Greta’s mind to her face: Modelling the dynamics of affective states in a conversational embodied agent. Int. J. Hum. Comput. Stud. 59, 1–2, 81–118. http://www.sciencedirect.co m/science/article/B6WGR-487DHWN-1/2/125e0d74aaa75243f67ca1711d527201. ISSN 10715819.

References

383

J. A. Russell. 1980. A circumplex model of affect. J. Pers. Soc. Psychol. 39, 6, 1161–1178. DOI: https://doi.org/10.1037/h0077714. C. Saint-Georges, M. Chetouani, R. Cassel, F. Apicella, A. Mahdhaoui, F. Muratori, M. Laznik, and D. Cohen. 2013. Motherese in interaction: At the cross-road of emotion and cognition? (A systematic review). PLoS One 8, 10, e78103. DOI: https://doi.org/10.1371/jour nal.pone.0078103. M. A. Salichs and M. Malfaz. 2012. A new approach to modeling emotions and their use on a decision-making system for artificial agents. IEEE Trans. Affect. Comput. 3, 1, 56–68. ISSN 2371-9850. DOI: https://doi.org/10.1109/T-AFFC.2011.32. K. Scherer. 2001. Appraisal considered as a process of multilevel sequential checking. In K. R. Sherer, A. Schorr, and T. Johnstone (Eds.), Appraisal Processes in Emotion: Theory, Method, Research. Oxford University Press, Oxford, 92–120. K. R. Scherer. 2005. What are emotions? And how can they be measured? Soc. Sci. Inform. 44, 4, 695–729. DOI: https://doi.org/10.1177/0539018405058216. D. Schuller and B. W. Schuller. 2018. The age of artificial emotional intelligence. Computer. 51, 9, 38–46. ISSN 1558-0814. DOI: https://doi.org/10.1109/MC.2018.3620963. N. Schweighofer and K. Doya. 2003. Meta-learning in reinforcement learning. Neural Netw. 16, 1, 5–9. ISSN 0893-6080. http://www.sciencedirect.com/science/article/pii/ S0893608002002289. DOI: https://doi.org/10.1016/s0893-6080(02)00228-9. P. Sequeira, F. S. Melo, and A. Paiva. 2014. Learning by appraising: An emotion-based approach to intrinsic reward design. Adapt. Behav. 22, 5, 330–349. https://journals.sagep ub.com/doi/abs/10.1177/1059712314543837. DOI: https://doi.org/10.1177/1059712314543837. A. Sloman and M. Croucher. 1981. Why Robots will have Emotions. Report, Sussex University. C. A. Smith and R. S. Lazarus. 1990. Emotion and adaptation. The Guilford Press, New York, NY, 609–637. B. R. Steunebrink, M. Dastani, and J.-J. C. Meyer. 2007. A logic of emotions for intelligent agents. AAAI Press, 142–147. B. R. Steunebrink, M. Dastani, and J.-J. C. Meyer. 2008. A formal model of emotions: Integrating qualitative and quantitative aspects. IOS Press, 256–260. R. S. Sutton and A. G. Barto. 2018. Reinforcement Learning: An Introduction. MIT Press. ISBN 0262352702. S. Thrun, M. Bennewitz, W. Burgard, A. Cremers, F. Dellaert, D. Fox, D. Hähnel, C. Rosenberg, N. Roy, J. Schulte, and D. Schulz. 1999. MINERVA: A tour-guide robot that learns. 696–696. DOI: http://dx.doi.org/10.1007/3-540-48238-5_2. S. Turkle, C. Breazeal, O. Dasté, and B. Scassellati. 2006. Encounters with Kismet and Cog: Children respond to relational artifacts. Digital Media: Transformations in Human Communication, 120. C. M. Van Reekum and K. R. Scherer. 1997. Levels of processing in emotion-antecedent appraisal. In Advances in Psychology. Elsevier, 259–300, 124. DOI: https://doi.org/10.1016/ S0166-4115(97)80123-9. J. Velasquez. 1998. Modeling emotion-based decision making. AAAI Press.

384

Chapter 10 Emotion

J. W. Verdijk, D. Oldenhof, D. Krijnen, and J. Broekens. 2015. Growing emotions: Using affect to help children understand a plant’s needs. In International Conference on Affective Computing and Intelligent Interaction (ACII), 160–165. DOI: http://dx.doi.org/10. 1109/ACII.2015.7344566. S. K. Vosburg. 1998. The effects of positive and negative mood on divergent-thinking performance. Creat. Res. J. 11, 2, 165–172. DOI: https://doi.org/10.1207/s15326934crj1102_6. ISSN 1040-0419. K. Weber, A. Johnson, and M. Corrigan. 2004. Communicating emotional support and its relationship to feelings of being understood, trust, and self-disclosure. Commun. Res. Rep. 21, 3, 316–323. DOI: https://doi.org/10.1080/08824090409359994. ISSN 0882-4096. S. C. Widen and J. A. Russell. 2008. Children acquire emotion categories gradually. Cogn. Dev. 23, 2, 291–312. http://www.sciencedirect.com/science/article/pii/S0885201408000038. DOI: https://doi.org/10.1016/j.cogdev.2008.01.002. ISSN 0885-2014. G. N. Yannakakis and A. Paiva. 2014. Emotion in games. In R. A. Calvo, S. D’Mello, J. Gratch, and A. Kappas (Eds.), Handbook of Affective Computing. Oxford University Press, Oxford, 459–471. P. Zachar and R. D. Ellis. 2012. Categorical Versus Dimensional Models of Affect: A Seminar on the Theories of Panksepp and Russell, Vol. 7. John Benjamins Publishing. ISBN 9027274754. DOI: https://doi.org/10.1075/ceb.7.

11

Empathy and Prosociality in Social Agents Ana Paiva, Filipa Correia, Raquel Oliveira, Fernando Santos, and Patrícia Arriaga

11.1

Motivation

Although some scientists might disagree about the exact role and importance of emotions in our daily lives, virtually all people (including scientists) admit to occasionally experiencing sadness, joy, and other emotions and can recognize how each one feels and how it affects them [Izard 2013]. From an evolutionary point of view, emotions carry significant advantages in terms of survival and, due to their universality, they allow for the non-verbal communication of inner psychological states and the easy recognition of those states in others. For this reason, many argue that emotions are essential features of complex social interactions and that they have adaptive functions [Ekman 1999, Izard 2013, Gloria and Steinhardt 2016]. So, the role that emotions play in the interactions between humans and technology, and in particular social agents, is not surprising. Much of the work conducted in this area seems to clearly and repeatedly find that people interact with technological artefacts as more than mere tools; users often apply their schemes for social and emotional interaction with other humans to their interaction with machines (see Chapter 3). For this reason, emotions are now a central part of the design, development, and evaluation of new technologies [Picard et al. 2002], including social agents and robots. But, as we consider the development of emotional behaviors in social agents, as seen in the previous chapter (see Chapter 10), we also need to reflect upon the effects that emotions have on both humans and the agents as a result. We cannot detach how emotional behavior experienced by one person affects another, and the responses that arise as a result. Indeed, many of our emo-

386

Chapter 11 Empathy and Prosociality in Social Agents

tions are social and related to how the others feel, and empathy processes fit in this realm of responses. In general terms, empathy can be considered as the response to some other person’s emotional state, where such response is more congruent to the others’ emotional state than one’s own. For example, if an agent reports a gloomy event, the human viewer may respond by feeling sad or even try to comfort or provide some advice to the agent. Several efforts to create empathic agents have shown that the display of empathy can positively impact the user’s feelings of trust and friendship toward such agents. Although an interesting finding, one might ask why (and if) we should trust and nourish these emotional connections with technological items [Reeves and Nass 1996]. From a utilitarian point of view, the answer to this question is rooted in the assumption that positive feelings toward machines can lead to higher overall satisfaction and engagement in long-term interactions with technological devices or agents. This can have a positive impact in the commercial success of such devices, as modulated by users’ intention to interact with them in the future and their attitudes and acceptance of such items in their daily lives. On the other hand, empathy, and empathic responses to agents, may also result in more awareness of someone else’s emotions, fostering perspective taking and reasoning about others, often competencies we seek to promote. We believe that our interactions with machines, and agents, can have a deep impact in our behaviors (both in actions directed at those machines and toward other humans). In particular, positive social behaviors (such as cooperation and prosocial behaviors) may be elicited through the interaction with socially interactive agents (SIAs) that can invoke positive emotions from their users. These complex social behaviors, as many studies have proposed, can transcend the limited domain of interaction and lead to actual cooperation and prosocial behaviors directed at other humans. In this chapter, we focus on empathy, prosociality, and the benefits that may accrue from such dispositions, both at the individual and societal level, when humans and agents interact. At an individual level, for example, prosocial spending seems to make individuals happier and result in higher levels of positive emotions [Dunn et al. 2014]. Similarly, individuals who engage in volunteering activities frequently report higher levels of happiness and health than those who do not [Borgonovi 2008], and some authors have even observed a link between the offer of instrumental support to close family and friends (i.e., informal caretaking) with a lowered mortality rate [Brown et al. 2003]. In addition, engaging in prosocial behaviors seems to result in greater well-being to the prosocial actor due to its ability to satisfy the psychological needs of autonomy, competence, and relatedness [Martela and Ryan 2016].

11.1 Motivation

387

At a societal level, although hard to quantify, prosocial behaviors also bring many advantages. For example, in the United States of America alone, in 2018 it was estimated that around 30% of American citizens were engaged in some type of volunteering activity, which is the equivalent to over 75 million volunteers countrywide, whose total efforts account for a work and service effort valued at around 167 billion dollars. Volunteers are the foundation of essential organizations, such as the national disaster response system, which provides vital aid to victims of hurricanes and other catastrophic events. However, the official number of volunteers does not include the many kinds of prosocial acts that are not considered formal volunteering, such as informal caretakers and people who perform prosocial actions toward friends and family. Moreover, the widespread movement of solidarity originated during the COVID-19 lockdowns is one example of how prosociality can act as a motivator for social support in hard times. Many different acts of kindness and help in the world were directed at those in need, especially toward the elderly, care workers, and most often strangers. The use of hashtags such as #viralkindess were high and a sense of unanimity emerged as our unknown neighbors became our friends. The COVID-19 lockdowns were a time for empathy and prosociality. In different countries and continents, prosocial acts emerged such as giving free milk through a “kindness cooler” in Wisconsin, help from cosmetic factories to produce and give away hand sanitizers, to the creation of community kitchens worldwide. Moreover, Pfattheicher et al. [2020] have shown that empathic concern for those most vulnerable to COVID-19 predicted and promoted adherence to physical distancing and wearing of face masks. These are two important behavioral measures recommended by the World Health Organization (WHO) to control the spreads of the SARS-CoV-2 contagion, which in turn contribute to making health systems work and allow better treatment for those infected. In the same line, CamposMercade et al. [2021] have found that prosocial motivations were related to following these and other WHO health behavior guidelines, and also to donating to the fight against COVID-19. Given the benefits of prosocial behavior, in this chapter we will stand by the idea that, as we build agents that interact with humans, we need to go beyond social interaction and think about the effects that those agents can have in humans’ wellbeing. We argue that effective SIAs should not only be social but also be prosocial; SIAs should be able to act in a prosocial manner and evoke prosocial behaviors from their users, directed at the agent, at other humans, and eventually at society as a whole. The idea that interaction with SIA can contribute to the development of prosocial skills and interactions is not new. This idea is supported by psychological models of learning that propose that we learn and develop skills based on our

388

Chapter 11 Empathy and Prosociality in Social Agents

interaction with other people and other social agents, in different contexts (e.g., direct interaction, playing games). In particular, according to the general learning model, people can extract information and learn from different situations and environmental interactions through the employment of a wide range of cognitive mechanisms [Anderson and Bushman 2002, Buckley and Anderson 2006, Barlett and Anderson 2012]. This model attempts to explain how different life experiences can have an impact on a person’s beliefs, attitudes, and cognitions [Gentile et al. 2009]. To address this challenge, we build on previous work about empathy and prosociality in SIAs [Paiva et al. 2017] by providing a framework that accounts for the main variables that can be used to design prosocial agents, for individual, group, and society level interactions. This chapter takes a step in examining how empathy in the interaction between humans and agents can be achieved, and the role it plays in fostering prosocial and altruistic behavior in general. This ultimate aim is the basis of the area of “prosocial computing,” as initially described in Paiva et al. [2018]. In this chapter, we first start by elaborating on the definition of the relevant concepts implicated in our work and by presenting a framework that captures both the potential effects of these concepts (empathy and prosociality) as well as the interactions among them, which are expected to produce prosocial behavior. Second, we present an oriented and selective review of the literature regarding the currently existing models and architectures to build prosociality in agents, followed by a review of user studies that have been conducted involving SIA. Finally, we present a selective review of prosociality models in populations, and we conclude by outlining possible future avenues of research and discussion.

11.2

Concepts and Framework Concepts such as prosociality, cooperation, and altruism are important in many fields of psychology and other social sciences as they underline the role that certain behaviors have in our daily lives, resulting in important effects on how we behave toward each other and toward society in general. For this reason, the search for the causes or antecedents that explain why people act prosocially, and what conditions facilitate that choice, has a long and fascinating history that gathers the multi-disciplinary contributions of many scientists from many areas of study yielding many interesting and sometimes even contradictory results. In particular, the pervasiveness of empathy, altruism, cooperation, and prosociality in humans has for long puzzled biologists, economists, psychologists, and researchers from multiple other disciplines [Hamilton 1964, Trivers 1971, De Waal 2008, Rand and Nowak 2013]. Although altruistic behaviors are commonly seen as some “heroic human acts,” they are also observed in non-human species that exhibit complex

11.2 Concepts and Framework

389

social structures [Carter et al. 2017]. For example, in some species of birds, one can observe unrelated individuals protecting little fledglings from predators, thus helping the breeding parents [Brown 1978]. Social insects (e.g., worker bees) give up their reproductive function in order to benefit their colonies [Hamilton 1972]. Some of these examples, where animals reveal apparent selfless behaviors, were a conundrum for Darwin: in a world where only the fittest survive, it is certainly puzzling that those sacrificing their own fitness—to benefit others—manage to win the contest of natural selection. Notwithstanding, Darwin himself advanced some explanations for the selection of altruistic behaviors, suggesting incipient notions of kin selection and reciprocal altruism. Some of these ideas were later elaborated. In the 1960s, Hamilton developed ideas on kin selection, coining what today is called Hamilton’s rule. This rule postulates that altruistic cooperation evolves if the genetic relatedness between the cooperator and the recipient of the altruistic act, times the reproductive benefit gained by the recipient, outweighs the cost of altruism [Hamilton 1964]. Later on, Trivers formalized the idea of reciprocal altruism, proposing that altruistic cooperation can evolve if a cooperator helping today will be helped tomorrow [Trivers 1971]. Other mechanisms, such as indirect reciprocity, spatial selection, and multi-level selection, were more recently studied [Nowak 2006, Rand and Nowak 2013]. These mechanisms can be seen as interaction structures that allow natural selection to choose, in the long-run, cooperative behaviors. In other words, these mechanisms constitute ultimate causes for cooperation. In parallel, research has advanced our knowledge on the proximate causes of cooperation, often rooted in psychological mechanisms. Empathy appears, in this context, as a prime justification for altruism. In particular, the empathy–altruism hypothesis of Batson suggests that cooperation is triggered, regardless of the costs and benefits involved, if someone feels empathy toward another individual [Batson et al. 1995]. Similarly, Frans de Waal suggests that empathy is the ideal candidate mechanism that underlies altruism, especially altruism that arises in response to another person’s pain, need, and distress [De Waal 2008]. The mysteries of cooperation are not solved. In fact, explaining the evolution of cooperation was pointed out as a grand challenge for the XXI century [Pennisi 2005]. The vast scope of the factors involved in the study of prosociality requires on our part an initial clarification of different concepts required to describe the various approaches we will discuss in this chapter. They are: ∙

Empathy is defined by Hoffman [Hoffman 2001] as a psychological process that makes a person have “feelings that are more congruent with another’s situation than with his own situation.” Empathy is a multidimensional concept usually distinguished in terms of cognitive and affective empathy

390

Chapter 11 Empathy and Prosociality in Social Agents

[Maibom 2017, Davis 2018]. Cognitive empathy is the capacity to put oneself in the other’s position by being able to see and understand what the recipient thinks and/or feels, also named perspective taking, and requires having a theory of another’s mind (theory of mind). Affective empathy involves affect from the actor (the empathizer). Examples include “vicarious” affect, resonance, or mirroring similar emotions of the recipient (also considered basic affective empathy). When applied to situations in which a recipient is in need or suffering, two different affective empathy dimensions have been proposed: empathic concern (also named sympathy or compassion) and personal distress. Empathic concern is the ability to feel other-oriented concerns, that is, sympathy for the welfare of others by resonating with others’ negative emotions, and often gives rise to prosocial behaviors. Personal distress involves feeling distress for oneself (self-oriented concern) and for the recipient in need [Maibom 2017]. ∙

Prosocial behavior is a multidimensional concept that can broadly be defined as a voluntary behavior intended to benefit another [Eisenberg and Spinrad 2014, Coyne et al. 2018]. Examples include altruism, solidarity, sharing, caregiving, and comforting. It can vary from high-cost (e.g., altruism, caregiving, volunteer, sacrificing) to very low-cost behaviors (e.g., comforting), and is intimately related with other constructs such as cooperation, reciprocity, empathy, generosity, trust, and fairness. The underlying motives to act prosocially can vary from being motivated to increase another’s welfare (other-oriented) to increase one’s welfare (self-oriented) [Eisenberg et al. 2016]. When there is no expectation of self-gain the behavior is considered altruistic, but when enacted because of the request of others or internalized social norms, it is associated with compliance [Xiao et al. 2019], suggesting moral reasons such as gratitude [Ma et al. 2017]. Specific emotions may also play an important role in prosocial actions. Emotions often associated with prosocial behaviors include sympathy, compassion, guilt, and regret, but their role is also highly dependent on the type of prosocial behavior, underlying motivation(s), context, individual differences, and group factors.



Altruism is an unconditional prosocial tendency for an agent to act to benefit the recipient and increase his/her welfare [Batson 2011] without the expectation of any self-gain (thus opposed to egoism) [Van Lange et al. 2014]. However, while some authors consider that altruism does not preclude the agent from benefiting from the behavior, other authors argue that even altruistic actions benefit the agent in some way. For this reason, they maintain that there is no such thing as authentic, genuine, or “true” altruism. In contrast,

11.2 Concepts and Framework

391

for other authors, authentic altruism occurs when the altruistic action has some cost to the agent’s personal interest (for a review, see Schramme [2017]). In fact, in biology, altruism often refers to behaviors that are costly (in terms of reproductive fitness) to an organism and beneficial to the recipient [West et al. 2007]. Similarly, in economics, altruism is defined as costly behaviors that confer economic benefits on other individuals [Fehr and Fischbacher 2003]. ∙

Reciprocity is defined in broad terms as “treating another in the same way as the one is treated” [Kolm 2008]. Many forms of reciprocity have been described, but the most common is direct and indirect reciprocity. Direct reciprocity involves a mutual and direct exchange between an agent (A) and a recipient (B). Indirect reciprocity occurs when the reciprocal acts involve another person (C) who is not the initial recipient (B) and can be divided into upstream and downstream reciprocity. Upstream reciprocity occurs when the agent (A) acts prosocially toward a person (B) after receiving some prosocial behavior from the recipient (C). Downstream reciprocity corresponds to an increase in the likelihood that the agent (A) will be a recipient of prosocial behavior from another person (C) after acting prosocially toward a former recipient (B), and this likelihood is expected because it benefits the agent reputation [Nowak and Roch 2007, Ma et al. 2017]. Reciprocity—both direct and indirect—have been pointed as fundamental mechanisms to explain the evolutionary origins of altruistic cooperation [Trivers 1971, Nowak and Sigmund 1998, Nowak 2006, Rand and Nowak 2013].



Cooperation is a type of prosocial behavior involving efforts to enhance joint positive outcomes for both the agent and recipient(s). However, cooperation has also different forms depending on the motivation. The most common distinction is between instrumental and non-instrumental (or elementary) cooperation. Instrumental cooperation refers to all behavior by which individuals contribute to the quality of a system that rewards cooperation and punishes non-cooperators. Yet, these actions are performed as a mean to obtain self-benefit, that is, the agent performs the cooperative action(s) because it will enable the achievement of certain outcomes, including positive outcomes for the recipient(s). Cooperation is also often referred to, in biology, as behavior that provides benefits to another individual [West et al. 2007]. In this regard, one may distinguish between altruistic cooperation— referring to altruism, defined above—and collaboration [Tomasello and Vaish 2013] or mutualism [West et al. 2007]—when both the agent and the recipient benefit from the cooperative relation. Across different disciplines

392

Chapter 11 Empathy and Prosociality in Social Agents

[Fehr and Fischbacher 2003, Rand and Nowak 2013, Wu et al. 2020], cooperation has been used to refer to altruistic cooperation, that is, costly behavior that confer benefits to other individuals. ∙

Selfishness is considered the motivation for self-benefit (egoistic or selforiented concern) without concern for others’ interests and well-being [Crocker et al. 2017]. It underlies most of the current approaches to the creation of “rational agents” [Wooldridge 2003] very much inspired in the homo economicus notion. The idea that humans act in pure self-interested ways, trying to optimize their gains while disregarding the other’s welfare, as adopted in many economic theories, is the root of many approaches for designing rational agents [Wooldridge 2003]. However, humans do not act in a completely selfish way, as many studies involving social dilemmas played by humans have shown. Instead, humans cooperate and act altruistically at their own personal cost. For example, in the well-known prisoner’s dilemma, where the rational strategy is to defect, it was found through a meta-analysis that humans on average cooperated 47% more of cooperation [Sally 1995].

11.2.1 From Empathy to Prosociality According to Hoffman [2001], empathy is the “spark of human concern for others, the glue that makes social life possible,” underlying the strong effect that empathy has toward the establishment of social bonds. More specifically, empathy has been widely thought of as an “other-centered” emotion that facilitates the understanding of other people’s situations by allowing us to put ourselves in another person’s shoes (i.e., perspective-taking) [Batson et al. 1991, Rumble et al. 2010]. So, it is not surprising that many studies have reported positive relations between empathy and prosocial behavior. In particular, Batson, by means of a set of experimental designs, tested the hypothesis that empathy (in particular, empathic concern) is a strong predictor of altruistic motivation and behavior, also known as the empathy–altruism hypothesis [Batson 2011, 2014]. In human–human interactions, empathy has been linked to a number of prosocial behaviors such as helping and cooperating in contexts in which such behaviors do not serve the individuals’ immediate selfish objectives [Batson et al. 1995, Batson and Ahmad 2001]. Empathy is thought to sustain these type of prosocial behavior by increasing the positive weight assigned to the other’s outcomes, consequently increasing generous behavior from the first to the latter [Rumble et al. 2010]. Regarding empathic-related traits, a recent meta-analysis on the predictive role of personality on prosocial behavior across several interdependent situations

11.2 Concepts and Framework

393

[Thielmann et al. 2020] has shown that one of the strongest positive predictors of prosocial behavior were the traits of unconditional concern for others. In fact, many studies priming empathic concern toward a recipient (usually framed as a victim) showed more altruism, even when these behaviors were against the person’s personal interest [Batson 2014]. Less consistent is the role of personal distress on prosocial behavior. Although personal distress also tends to co-occur with empathic concern toward a recipient expressing distress, actions vary depending on how much distress the recipient feels. When the distress is experienced as overwhelming, it is often associated with a tendency to withdrawal from the distressing context, thereby compromising prosocial acts toward the recipient in need [Maibom 2017]. In spite of this, affective empathy, and empathic concern in particular, seems to be one of the main predictors of prosocial behavior. However, cognitive empathy also seems to play an important role. For example, in three studies, Galinsky et al. [2008] have shown that perspective taking (understanding others’ interests and motives) was more useful in negotiation processes than affective empathy. Thus, it is important to understand which empathic dimension, and under which circumstances they arise, to establish the relationship with prosocial behaviors. Furthermore, when agents are mixed in these types of interactions, all these dimensions need to be articulated and somehow engineered.

11.2.2 Prosocial Agents: Dimensions of the Current Analysis The goal of this chapter is to provide an overview of the area of SIAs that act in situations where empathic processes and prosocial behaviors exist. Thus, we should consider the multitude of roles and situations where agents can participate in and the different ways by which humans may interact with them. As a way to characterize these scenarios, let us consider that there are two humans/agents: a recipient and a subject. The recipient is the agent experiencing an emotionally charged situation (for example when one is given some bad news) and potentially expressing it to others. This expression can be displayed through facial expressions or even by uttering the sentiment felt in natural language. As a result, the subject responds to the situation and the feelings of the recipient experiencing an empathic response (see empathic phase in Figure 11.1), and eventually acting in a prosocial manner (see the prosocial phase in Figure 11.1). We can say that we are witnessing prosocial behavior when the subject incurs some cost (Cs) as she/he acts to provide some gain to the recipient (Gr). As mentioned before, the subject, her/himself may also obtain some gain from the prosocial action. In fact, in many scenarios, that is the case as the positive effects of prosociality are enormous, as already mentioned.

394

Chapter 11 Empathy and Prosociality in Social Agents

Figure 11.1

Generic situations of empathic and prosocial behaviors between two agents.

The characterization of the roles that SIAs and humans play in this analysis framework allows for the following possibilities: ∙

SIAs that act as the subject in the empathic phase: the agent is in a situation that perceives others (agents or humans) and its internal mechanisms allows for it to respond empathically;



SIAs that act as a subject in the prosocial phase: the agent acts toward the recipient in a prosocial manner;



SIAs that act as recipients in an empathic phase: agents act in scenarios to evoke empathy in others (including users);



SIAs that act as recipients in a prosocial phase: the agents promote prosocial behaviors in others.

These different types of roles for the agents require from them a myriad of design features and computational processes. An agent acting as the subject needs to be equipped with mechanisms that allow it to perceive and appraise the situation, be able to reason about the others, and respond adequately. Features such as emotional recognition or perspective taking may be essential for SIAs to act as subjects, but not necessary for acting as recipients. Other features such as the SIA’s embodiment (disembodied, virtually embodied, or physically embodied) may also be more important in one type of context than another. For example, an SIA acting as a recipient may use aspects of its embodiment (its gaze, lights, posture) that may be vital to conveying the emotional state in a situation.

11.2 Concepts and Framework

395

The roles of SIAs can be extended when we move from traditional dyadic interactions, as portrayed in Figure 11.1, to groups featuring both humans and agents (see Chapter 17). In many situations we can also have agents acting as bystanders witnessing empathic and prosocial situations. For example, agents may operate in a group context where other agents or humans act in a manner that fires some emotional and empathic responses. Creating agents as bystanders can be inspired in the bystander intervention model (BIM) proposed by Latané and Darley [1970] that was developed to examine bystander behavior in emergency situations. This model describes a set of successive phases which an individual must experience to intervene in a situation, namely, the perception of the event, the interpretation of the degree of emergency of the event, the recognition that it is the agent’s responsibility to intervene, knowing what to do, and intervening. This model has actually been observed to characterize the behavior of bystander adolescents in cyberbullying cases [Ferreira et al. 2020], opening doors to agent-based interventions with agents acting as bystanders. This move from dyadic interactions to groups and societies is of paramount importance when analyzing the role that SIAs may have in real-world scenarios. As proposed by Penner et al. [2005] the analysis of prosocial behavior can be done at three different levels: micro-level, meso-level, and macro-level. At the micro-level, the study of prosociality is done around the origins of prosocial tendencies in general and sources of variation for these tendencies. In the meso-level, the study is done around helper–recipient situations (similar to what is shown in Figure 11.1). Finally, at the macro-level, prosociality is studied in the context of groups and societies. These three levels are interconnected, and if we place agents to interact with humans, we need to consider different levels of analysis. In this chapter, we will explore the role of empathy and prosociality in social agents along three different levels (see Figure 11.2). The first one is the individual level (A), where we will detail the internal processes that lead to empathy and prosociality, and how those processes can be integrated and engineered in SIAs (see Section 11.3). The second level (B) is the interaction level, where we examine dyadic interactions and where we review the mechanisms and processes that affect how humans interact with SIAs, particularly focusing on the effects that SIAs have on human prosociality (Section 11.4). Finally, we believe that prosociality in human–agent interactions needs to be examined beyond isolated encounters, that is, embedded in dynamic populations and acknowledging possible long-term effects. Therefore, at the third level of analysis (C) we will explore the role of prosociality at the macro-level, that is, in groups and in (hybrid) populations of humans and social agents (Section 11.5).

396

Chapter 11 Empathy and Prosociality in Social Agents

Figure 11.2

11.3

Dimensions of analysis: (A) agent architectures: focus on the internal processes that lead to empathy and prosociality, and how those processes can be integrated and engineered in social agents; (B) social agents: focus on the mechanisms and processes that affect how humans interact with social agents, particularly focusing on the effects that SIA have on human prosociality; (C) social agents within populations: focus on the role of SIA in stabilizing prosociality in (hybrid) populations of humans and agents.

Models and Architectures to Build Empathy and Prosociality In order for SIAs to act in empathic and prosocial situations, they need to be equipped with computational mechanisms that include perception, decision making an action execution, underpinning a traditional agent modeling approach. Moreover, models and architectures to create empathic agents are also inspired by existing theories of empathy in humans providing ways to identify and structure the computational processes in social agents. In general terms these theories may follow two distinct approaches: on the one hand, categorical approaches carefully distinguish the affective empathy from cognitive empathy; on the other hand, dimensional approaches propose that both affective and cognitive mechanisms can be integrated into a multidimensional system. This implies that architectures may feature different computational processes accordingly. In a recent survey, Yalçn and DiPaola [2019b] have systematically analyzed the literature on empathic agent architectures by separating affective mechanisms, also referred to as low-level functions, from cognitive mechanisms, also referred to as high-level functions. Their framework and, in particular, the distinction between affective and cognitive mechanisms of empathy aims to highlight the similarities and overlaps between the existing models and how some functions of these models can be functionally integrated. In contrast, the framework proposed by Boukricha et al. [2013], and extended in Paiva et al. [2017], proposes the following components in a general architecture: (1) empathy mechanisms: “the process by which an empathic emotion arises”; (2) empathy modulation: “the process by which both an empathic emotion is modulated and a degree of empathy is determined”; and (3) empathic responses: “the

11.3 Models and Architectures to Build Empathy and Prosociality

397

process by which an empathic emotion is expressed/communicated and actions are taken” [Paiva et al. 2017]. This framework specifically aggregates low- and highlevel functions into the empathy mechanisms, as well as merges their outputs into empathic responses. In other words, it acknowledges that empathic responses by artificial agents may still occur regardless of the mechanisms behind having an affective and/or a cognitive process(es). We are particularly interested in this framework as we postulate that an empathic response by artificial agents, independently from their theoretical and methodological approach, can lead to prosocial behaviors (see Figure 11.3). Finally, to bridge artificial empathy and prosociality in SIAs, we will first overview existing architectures and computational models to create empathic agents and then discuss how artificial empathy may lead to prosocial behaviors. We will base this connection in Batson’s hypothesis that empathy (in particular, empathic concern) is a strong predictor of altruistic motivation and behaviors. However, from an architectural point of view, we assume that agents must have different cognitive and affective mechanisms, often inspired by humans. Thus, perception, cognition, motivation, emotions, and interpersonal behaviors ought to be considered as they are essential for creating intelligent and social behaviors in agents.

11.3.1 Empathy Mechanisms for Empathic Agents Empathy mechanisms constitute the internal processes that lead to an empathic emotion to arise. Hence, empathy mechanisms for artificial agents are closely

Figure 11.3

Empathy processes and mechanisms in SIAs leading to prosocial behavior.

398

Chapter 11 Empathy and Prosociality in Social Agents

related to their perceptive skills. Generally, empathic agents are required to be aware of others, either by recognizing their emotions or their actions within a certain context, from which they can then infer or interpret the others’ goals, intentions, and/or affective state. Additionally, empathy mechanisms may also depend on the modalities that agents use to interact, which in turn may increase the complexity to model empathy. For instance, there are rule-based systems in which the agents are able to produce empathic behaviors, such as sympathetic or encouraging utterances, by interpreting the context or by performing a situational appraisal [Becker et al. 2005, Bickmore and Picard 2005, Prendinger and Ishizuka 2005, Lisetti et al. 2013, Leite et al. 2014]. Other more complex behaviors may present sophisticated models or architectures according to different methodological approaches. One of the most used approaches is the analytical or theory-driven, in which computational models are based on theoretical models of empathy established in psychological and neuropsychological research. Mimicry is considered a fundamental mechanism for empathy and is supported by both the perception-action hypothesis [Preston and De Waal 2002] and the shared affective neural networks [De Vignemont and Singer 2006]. Some examples have explored such affective mechanisms using motor mimicry [Riek et al. 2010, Gonsior et al. 2011], or affective matching techniques [Boukricha et al. 2013, Lisetti et al. 2013, Leite et al. 2014]. Regarding cognitive mechanisms, four works should be highlighted. Firstly, the use of perspective-taking through self-projection both in Leite et al. [2013] and Rodrigues et al. [2015]. On the one hand, Leite et al. have used the appraisal mechanism of the robotic agent to appraise its companion’s situation for that particular game context [Leite et al. 2013]. On the other hand, Rodrigues et al. go a step further by proposing a general model for the agent to appraise the target’s situation using its own belief system and goals to appraise the other’s situation as its own [Rodrigues et al. 2015]. Similarly, Boukricha et al. used regression functions that map the activation of action units into the pleasure–arousal–dominance space as a shared representational system (i.e., to both to animate the agent and to infer the emotional state of other agents). Recently, Yalçn and DiPaola [2018] proposed another computational model of empathy, which is inspired by the Russian doll model of empathy [De Waal 2007]. This last approach not only allows for other-oriented perspective-taking, such as theory of mind, but also an isolated information processing between low- and high-level mechanisms of empathy. Another methodological approach to create computational models of empathy is the empirical or data-driven, in which the models are obtained from collected data and constitute generalizations of empathic behaviors and/or empathic situations. Within this methodological approach, McQuiggan et al. [2008] used data

11.3 Models and Architectures to Build Empathy and Prosociality

399

from human–human social interactions in a virtual environment to create two classifiers: one to learn when and another to learn how the agent can act empathetically. Similarly, Ochs et al. [2012] created a model to express empathic emotions based on an empirical analysis of human–agent dialogues. The collected dialogues were annotated according to their conditions of elicitation, which matched the theoretical appraisal theory they have used [Scherer 1988]. The last methodological approach is considered hybrid as it includes both empirical and theoretical processes for the agent to learn and/or express empathy. Within this methodological approach, we would like to highlight two works that both follow a developmental perspective. Firstly, Lim et al. explored the learning process of mirroring mechanisms as an emergent empathic behavior and, therefore, their work was mostly focused on low-level empathy [Lim and Okuno 2015]. On the other hand, the work by Asada et al. proposed that empathic development can emerge from a parallel between imitation (such as motor mimicry or emotional contagion) and other cognitive mechanisms (such as self–other distinction) [Asada 2015]. Another example can be found in the EMOTE project [Alves-Oliveira et al. 2019] where an autonomous robot was designed with empathic competencies to foster collaborative learning in adolescents, in particular toward sustaining positive educational outcomes in long-term collaborative learning. The computational model built to drive the behavior of the SIA (embodied as the NAO robot) was a “hybrid behavior controller” combining a rule-based component and a data-driven one. The data-driven component was built with a dataset created using a restricted perception Wizard-of-Oz study [Sequeira et al. 2016]. The final system was tested in the robot showing its capability to foster meaningful discussions among students interacting with the robot and among themselves.

11.3.2 Empathy Modulation in Empathic Agents Empathy modulation is the process by which the empathic emotion or the degree of empathy are shaped by features of the agents and the situation. Empathy modulation is inherently coupled to the empathy mechanisms, as it shapes them, changing the result of the process. This modulation reflects in humans the individual differences found as well as the type of relationship that exists between subject and recipient. Paul Bloom [2017] discusses the negative effects of empathy due to modulation, arguing that “empathy is biased,” and may “push us in the direction of parochialism and racism.” However, despite of the importance of this aspect for studying empathy, so far only a few computational models of empathy have included modulation factors in their architectures. For instance, McQuiggan et al. [2008] have considered in their data-driven model the following features of the observer: gender, age, user empathetic nature, and goal orientation. Boukricha

Chapter 11 Empathy and Prosociality in Social Agents

et al. have not only included features of the subject (observer), such as the mood, but also liking and familiarity to represent the social relationship with the recipient, as well as the desirability of the observed emotion [Boukricha et al. 2013]. Similarly, the model proposed by Rodrigues et al. [2015] supports the following modulation factors: mood, personality, affective link, and similarity (see Figure 11.4). Another factor that modulates the empathic responses is the strength of the emotional situation, the context, and the valence and intensity of the emotions exhibited by the target. These different categories of empathy modulators for computational models of empathy [Paiva et al. 2017] need further investigation, in particular in what concerns situational or context-related factors.

11.3.3 Empathic Responses in Empathic Agents Empathic responses can include both the expression of attitudes as well as actions and action tendencies. In fact, prosocial acts can result from an empathic emotion. Furthermore, the expression of empathy in SIAs can be displayed through different channels or modalities, according to the social affordances of the agent. Event

Emotional Cues

Self-Projection Appraisal

Emotional Recognition

(a)

Event/Context

Candidate Emotions Potential Empathic Emotion Selection Potential Empathic Emotion

Modulation Factors Affective Link

Similarity Personality Mood

Communication Competence

Elicited Emotion

Emotion

Mimicry–Affective Matching

Emotion Regulation

Empathic Appraisal

(b)

Emotion Recognition

Emotion Expression

Emotion Representation

Self-Related Modulation Factors (Mood, Personality)

Relationship-Related Modulation Factors (Affective Link, Liking, Similarity)

Appraisal Re-appraisal

Theory of Mind

Empathic Emotion Empathic Response

Emotional State

Cognitive Mechanisms

400

Simulation Theory Self and Other Oriented Perspective-Taking

Reactive Behavior

Empathic Action

Figure 11.4

Examples of architectures for empathic SIAs taking modulation into account: (a) from Rodrigues et al. [2015] and (b) from Yalçn and DiPaola [2018].

11.3 Models and Architectures to Build Empathy and Prosociality

401

The most common empathic response in SIAs is body expression [Riek et al. 2010] and, in particular, facial expression [Becker et al. 2005, Bickmore and Picard 2005, McQuiggan et al. 2008, Ochs et al. 2012, Boukricha et al. 2013, Lisetti et al. 2013, Rodrigues et al. 2015, Yalçn and DiPaola 2019a]. This means that the body of the SIA must include some form of “face” or eyes. A few examples have also conveyed the empathic emotion on conversational settings through language [Brave et al. 2005, Bickmore and Picard 2005, Prendinger and Ishizuka 2005, McQuiggan et al. 2008]. Finally, an additional empathic response identified by Paiva et al. [2017] is the action tendency, which is the readiness or urge to carry a behavior upon a certain stimulus being prompted. In the next section, we will discuss existing theories that address how action tendencies and empathic responses, in general, may precede prosocial behavior.

11.3.4 From Arti cial Empathic Responses to Prosocial Behavior According to Batson et al. the major source of altruistic motivation is empathy, an other-oriented emotional response. This emotion is “elicited by and congruent with the perceived welfare of a person in need” [Batson et al. 2015] and is frequently reported as pity, compassion, or sympathy. Note that not all empathic emotion leads to altruistic motivation. For example, one may feel joy for another who received some good news, and that is still considered an empathic response. However, altruism and prosociality result from empathy felt when another is perceived to be in need. The empathy–altruism hypothesis claims that empathic concern is one of the main drivers of altruistic motivation, and thus conducive to prosocial actions. Recently, Costantini et al. [2019] proposed a simulation model of prosocial behaviors that integrates both descriptive and normative approaches. When analyzing literature that explores basic processes and determinant variables of prosocial behaviors, the authors distinguish between descriptive-emotive approaches and normative-evolutionary approaches. While the first ones mainly aim at explaining the psychological motivations of prosociality, the latter ones look for an evolutionary benefit on prosocial acts. We will leave the discussion of the normative-evolutionary approaches to the meta-level analysis of social agents in societies (see Section 11.5). Thus, in the same vein as Batson et al. [2015] and Costantini et al. [2019] that role of emotions and empathy is prominent as an antecedent of prosocial behaviors. Nevertheless, so far, little work has been done in reflecting this link into the SIAs community, and in particular in agent’s architectures. Considering the models and architectures we have previously reviewed, they may present distinct mechanisms to produce an empathic response, but some of them may even trigger hierarchically more than one empathic response [Yalçn

402

Chapter 11 Empathy and Prosociality in Social Agents

and DiPaola 2018]. In both cases, regardless of the particular mechanism(s) being used, how can an agent act prosocially upon having an empathic response? One concrete framework from psychology, the SAVE framework (sociocultural appraisals, values, and emotions), is a good example to draw the first steps for creating prosocial behavior as it provides an equation that tries to mirror the complex deliberative processes that occur during prosocial decisions by humans [Keltner et al. 2014]. If such an equation can be used by an agent to consider prosocial acts, empathy mechanisms can also be used to calculate some of the parameters. Cognitive mechanisms can contribute to infer the benefits of an action for someone else, referred to as Brecipient . Similarly, empathic agents might also determine their own benefit of performing prosocial behaviors, referred to as Bself , upon their empathic responses. For instance, the negative relief theory [Baumann et al. 1981] suggests that helping behaviors can reduce the negative states of its actor.

11.4

Empathy and Prosociality in the Interaction with SIAs Within the multitude of environmental factors that can have an impact on the interaction with technology, empathy and prosociality seem to be gaining increasingly more relevance. At the same time, the use of SIAs is now more widespread as online virtual interactions become ever more common, virtual characters begin to be used more and more in different applications, and social robots start to gain terrain as actors in our daily activities. As such, the question of if and how these SIAs interact and affect people’s behaviors, both in real-life and virtual scenarios, has received a considerable amount of attention recently. While in the previous section we delved into the internal computational mechanisms that are required for empathic and prosocial SIAs, in this section we consider how empathic and prosocial interactions between humans and SIAs unfold (see Figure 11.2). So, we will discuss some of the issues underpinning the different factors that affect the emergence of empathic and prosocial responses in the interaction between humans and SIAs.

11.4.1 Research and Application Scenarios The concept of prosociality is important in many fields of psychology, biology, and economics as it underlines many behaviors that are central components of our daily lives and have important effects on how we behave toward each other and toward society in general. For this reason, the search for the causes or antecedents that explain why people choose to act prosocially, and what conditions facilitate that choice, has a long and fascinating history that gathers the multi-disciplinary contributions of many scientists and yields many interesting results. As such, many

11.4 Empathy and Prosociality in the Interaction with SIAs

403

studies have attempted to determine the factors that influence human prosociality not only in human–human scenarios but also toward agents and toward other humans in virtual spaces. Studies in this area make use of an array of social games or dilemmas to model important aspects of social interaction and prosocial behaviors, such as the Prisoner’s Dilemma, Trust Game [Berg et al. 1995], or Public Goods Game (see Gotts et al. [2003], Dawes [1980], and Van Lange et al. [2013]). Social dilemmas can be broadly defined as situations in which short-term self-interest is at odds with longer-term collective interests [Van Lange et al. 2013], and they are particularly important as research settings to explore the strategic interactions between agents. In the context of social and prosocial interaction studies, they have been widely used given that they usually present a scenario in which participants are asked to decide between taking a selfish course of action that serves their own immediate interest, or a prosocial course of action that serves the collective interests. The flexibility and widespread use of these games have allowed for the emergence of a vast, interdisciplinary body of research (see Berg et al. [1995], Sally [1995], Tavoni et al. [2011], Rand et al. [2012], Rand and Nowak [2013]), that has, to some extent, provided counter-evidence to the thesis of homo economicus (i.e., the argument that people are mostly guided by external, selfish individualistic interests) [Gotts et al. 2003]. In addition, these research settings grow in relevance as they can be used to represent and model several collective, complex group situations that are often dependent on the actions of large groups of independent agents (e.g., climate change or resource depletion) [Gotts et al. 2003] and humans, as will be discussed in Section 11.5. Furthermore, this type of scenarios constitutes a basis for the study of what are the characteristics (e.g., behavior, embodiment) of social agents that can be manipulated to foster prosocial behavior. Despite the widespread use of social dilemmas as settings for studying empathy and prosociality, they are nevertheless artefacts where the complexity of real-world cases is reduced, allowing for researchers to pinpoint the exact elements to study and draw conclusions from very controlled situations. However, they are often too simple for real-world applications. Real-world scenarios are messier and involve many more variables, but SIAs may offer the potential for change in real-life interactions with humans. As examples consider the use of an SIA acting as an empathic virtual nurse to promote behavior change in health-related issues [Bickmore and Picard 2005], or the recent work by Morris et al. [2018] that created an empathic conversational agent to help people with mental health problems. In fact, health and education are application domains where the use of empathic SIAs has been quite prominent. In the particular case of education applications, SIAs can be used in several topics and for distinct target users. For example, the TARDIS system is an example where an SIA is used to coach young adults in the context of job interviews

404

Chapter 11 Empathy and Prosociality in Social Agents

[Anderson et al. 2013]. SIAs have also been used in the game Crystal Island in the domain of microbiology for middle school students interviews [Sabourin et al. 2011] or embodied as robots exhibiting aspects of empathy processes to train children and adolescents to understand geography and sustainability [Castellano et al. 2013, Alves-Oliveira et al. 2019]. In fact, agents can also be used to foster the development of prosocial skills, and many interventions aimed at triggering prosociality have been developed in the past few years, with varying degrees of success [Goldstein et al. 1994, Leiberg et al. 2011, Schellenberg et al. 2015, Lukinova and Myagkov 2016]. Some of these interventions have become technology-based [Ibrahim and Ang 2018], opening doors for real world cases for SIAs. Some of these interventions are discrete in time and highly targeted at providing intentional prosocial training (e.g., Lukinova and Myagkov [2016]), whereas others seek to invoke prosocial skills in a more continuous manner through interaction with computerized agents. In addition, as demonstrated by the study conducted by Kozlov and Johansen [2010], prosocial behaviors in virtual environments seem to obey the same influences as prosocial behavior in real-life environments. That is, virtual environments constitute a good setting to explore some of the human-tohuman studies on empathy and prosociality. For example, the existence of a large group of bystanders and the imposition of time constraints to help both seem to hinder participants’ helping behaviors toward virtual agents [Kozlov and Johansen 2010]. This transference of the psychological determinants of prosocial behavior and the subsequent prosocial responses within virtual environments (and toward social agents) falls in line with the predictions of the media equation theory [Reeves and Nass 1996], which broadly states that technology can elicit social responses from humans similar to those elicited by other humans in the same social situations. These can include, much like in-person prosocial behavior, user-related variables (such as personality [Graziano et al. 2007, Pursell et al. 2008, Hilbig et al. 2014, Habashi et al. 2016], dispositional compassion and empathy [Rameson et al. 2012, Lim and DeSteno 2016, Lupoli et al. 2017], or emotions [Batson 2014]), virtual environment-related variables (such as the presence of bystanders [King et al. 2008, Kozlov and Johansen 2010, Slater et al. 2013]), and agent-related variables (such as ethnicity [Gamberini et al. 2015] and gaze behavior [Slater et al. 2013]). In other words, exposure to prosocial content in virtual environments (often with the presence of SIAs) is expected to have both short-term (e.g., by increasing positive affect [Saleem et al. 2012]) and long-term impacts (e.g., through changes in trait empathy [Prot et al. 2014]) in people’s prosocial behaviors, motivations, and tendencies [Coyne et al. 2018]. For example, Gentile et al. [2009] demonstrated that repeated engagement with prosocial games can result in players transferring and generalizing prosocial motivations in real-life scenarios that are similar to those

11.4 Empathy and Prosociality in the Interaction with SIAs

405

presented in the game, consequently resulting in greater helpful and cooperative behaviors. This effect seems to be culturally robust and remain stable across different age ranges and levels of exposure (long- and short-term) [Saleem et al. 2012, Greitemeyer and Mugge 2014, Gentile et al. 2009]. Recently, Ferreira et al. [2021] examined whether experiencing a multiplayer serious game could foster cognitive empathy and prosociality in adolescent bystanders of cyberbullying (see Figure 11.5). The game uses SIAs acting as victims, bullies, and bystanders in the game, and the results suggest an effect in increasing prosociality when compared with a control group. In a similar context, the FearNot! game [Aylett et al. 2005] was developed to foster empathy toward a victim of bullying and promote behavior change (see Figure 11.5). The game, featuring SIAs in a storytelling environment, was designed to help children experience effective strategies for dealing with bullying. The results of a large-scale evaluation showed a short-term effect on escaping victimization for a priori identified victims [Sapouna et al. 2010]. Indeed, studies using virtual reality as a method to create more lifelike and ecologically valid scenarios to observe prosocial behavior suggest that prosociality can be elicited through a number of factors. For example, one study that manipulated the affordances (superhero flight or riding as a passenger in a helicopter) given to players in the context of a simulated search and rescue activity showed

Figure 11.5

Examples of scenarios of use of SIAs to address problems of bullying and cyberbullying (a) FearNot! and (b) Com@Viver.

406

Chapter 11 Empathy and Prosociality in Social Agents

that participants given a superpower were more likely to engage in real-life prosocial behavior immediately after the study [Rosenberg et al. 2013]. These results are in line with previous studies priming superhero concepts to influence prosocial behavior, which have found that priming was effective not only at increasing participants’ immediate likelihood of helping in hypothetical situations but also their engagement in prosocial activities (specifically, volunteering) three months after [Nelson and Norton 2005]. Given the distinct research scenarios and different areas that take advantage of SIAs to support learning or promote prosocial behaviors and attitudes, the question that arises is what specific factors can contribute to elicit such behaviors still needs further development.

11.4.2 Agent's Characteristics, Empathy, Prosocial Outcomes, and Measures Although research on prosociality in the context of SIA is still quite new, many studies have already been conducted to investigate which agents’ characteristics can impact empathic responses from users and nudge them toward prosocial courses of action. In fact, some studies have shown that empathy in SIAs seems to have a positive impact in cooperation and prosocial behavior, a result that is in line with the empathy–altruism hypothesis [Baumann et al. 1981]. Nevertheless, in agents the display of empathy requires the agent to be able to recognize or modulate the user’s emotional state at a given time or as a response to a given situation, and to be able to communicate effectively in response [Paiva et al. 2017]. This effective communication might include both the agent’s ability to convey its emotions through some “embodiment.” In addition, some studies have suggested that the embodiment or appearance of the SIA can affect how users respond to it. However, SIAs can be embodied in many different ways. They can be portrayed as a 3D virtual character in a virtual world; as a 2D character on a screen; as a conversational system such as Alexa; disembodied like Cortana or Siri; or even physically embodied as a very realistic social robot like Erica [Glas et al. 2016]. This wide variety of embodiment possibilities leads us to question if the degree of the embodiment may act just as a mere facilitator of the social interaction or have some impact on the empathic responses as well as prosocial actions by people interacting with them. In fact, we may question if a physical body is better than a virtual or no body at all for the SIA. In a study by Seo et al. [2015] empathy responses to a physical or a virtual “robot” were compared. The main question addressed was: how do people empathize with a physical or a virtual (simulated) robot when something bad happens to it. The results reported suggest that people may empathize more with a physical robot than a virtual one. Indeed, it has been shown that empathy display may lead to improved interpersonal relations, with users who consider an

11.4 Empathy and Prosociality in the Interaction with SIAs

407

empathic robot more as a friend in comparison to a robot not displaying that feature [Pereira et al. 2010]. In a recent study comparing embodied agents (a robot) versus disembodied ones, people interacted with a prosocial agent and a selfish agent in a variant of a public goods game [Correia et al. 2020] (see Figure 11.6 for the illustration of the robotic embodiment in this study). The study showed that when the agents were “disembodied,” prosocial agents were rated more positively and selfish agents rated more negatively, which is what one would expect. However, when agents were “embodied” this effect did not occur, which means that although the social aspects achieved by embodiment can positively affect the emotional responses to agents (as is usually the case), the “embodiment” itself may mask selfish behaviors from the agents. That is, embodied affordances of the agents seem to lead people to consider additional aspects during the interaction, and the behavior itself, such as acting selfishly, becomes less salient when compared with other features associated with embodiment. Thus, the “type” as well as the existence of embodiment in SIAs matter in prosocial contexts [Correia et al. 2020]. Surprisingly, robots that display lower levels of human resemblance seem to be more effective at triggering prosocial behaviors from their human users [De Kleijn et al. 2019]. The study reported in De Kleijn et al. [2019] suggests that, although appearance might have an effect on fairness, it nevertheless fails to affect prosocial behavior. Other studies looking at prosocial behavior in the context of HRI have used a variety of different robots, making their results hard to compare and leaving the issue of embodiment still largely unresolved. A few authors have, however, already developed social robots especially for this purpose [Sarabia et al. 2013], showing DONA [Kim et al. 2010], a robot developed for the purpose of collecting money from kind passersby to donate it to charity. Designing social agents that can successfully exhibit and evoke empathy (and the resulting prosocial outcomes) requires paying special attention to various factors, such as the characteristics of the agent (e.g., its embodiment, or physical

Figure 11.6

Example of robotic embodiment: EMYS from Correia et al. [2020].

408

Chapter 11 Empathy and Prosociality in Social Agents

appearance, see Chapter 4 of this book), the dialogue that the agent is able to establish, the social and emotional responses, the non-verbal behaviors, the characteristics of the user, the details of the situation, and the mechanisms and modulation processes that can affect the empathic response (e.g., degree of familiarity with the agent, signaling of the need for help, prior social relation with the users) [Paiva et al. 2017]. In fact, emotions have been shown to positively impact prosocial behavior toward social agents, including context-based amalgamations of positive and negative emotions, such as gratitude (expressed when both the human and the social agent cooperate), shame or guilt (expressed when the social agent defects), or anger (expressed when the human agent defects) (e.g., De Melo et al. [2009, 2010]). In particular, the display of context-based emotions aligned with prosocial motivations by a social agent seems to have a positive role in prosociality by increasing the participant’s level of trust in the agent [Riegelsberger et al. 2003] and perceived likability [Straßmann et al. 2018], although more research on these mechanisms is needed before reaching a final verdict. Research comparing the effect of contextbased emotions in prosociality and cooperation between virtual agents controlled by humans (i.e., avatars) versus agents controlled by algorithms yields similar results according to the type of emotion displayed, with the expression of cooperative intentions by the agents having a superior effect in cooperative behavior toward avatars (compared to virtual agents), but with signaled competition intentions resulting in similar competitive behavior toward both avatars and virtual agents [de Melo et al. 2018]. In social robotics, the display or priming of emotions by a social robot also seems to be an important factor to determine both how the robot is perceived and how users respond to it. For example, when a robot starts off the conversation by making a remark related to emotions (rather than related to an object), users are more likely to follow its directions and answer its requests [Imai and Narumi 2004]. In addition, the display of emotion can also modulate the effect of empathy on prosocial behavior. For example, in a study by Kim et al. [2009], a display of negative emotions by a robot, after receiving a penalty for failing a task, can result in empathy toward the robot, with some participants choosing to suffer the penalty in place of the robot. Having a robot displaying empathy or concern for others can also have a positive effect on the users’ intention to engage in prosocial behavior, as demonstrated in a study conducted by Hayes et al. [2014], in which the robot either displayed concern for itself or its programmer while petitioning the human participant to sacrifice his/her performance in a competitive task. The authors observed that participants were more likely to help the robot when it displayed concern for others than when it was egotistically motivated, and that the level of empathy felt

11.5 Toward Prosociality in Populations with SIAs

409

toward the robot was a predictor of the users’ likelihood of offering assistance to the robot. One of the important factors that influence prosociality is the agent’s own behavior during the interaction. This is particularly relevant in scenarios with social dilemmas where the interactions seem to require a certain level of reciprocity or interactive consideration of the other player’s strategy [Straßmann et al. 2018]. However, the behavior of an agent after a prosocial (or antisocial) decision can also influence the user’s subsequent actions toward the agent. For example, when the agent does not cooperate, studies have found that negative responses (decreased trust) from the user can be diminished when the virtual agent blushes [Dijk et al. 2011]. Similarly, the non-verbal behavior adopted by the robot can also help the SIA to evoke prosocial behavior. For example, one study demonstrated that receiving a reciprocal hug from a robot might lead individuals to donate more money to charity [Shiomi et al. 2017]. Another study found a tendency for participants to help a robot complete a task more often after the robot introduced itself with a handshake [Avelino et al. 2018]. Some research also suggests that the goal-orientation manifested by the social agent (i.e., cooperative vs. selfish) can have a positive effect on the participant’s own goal orientation, in the context of social dilemmas [de Melo et al. 2013, Kulms et al. 2014]. In particular, people are more likely to exhibit cooperative behavior, guided by the collective profit, when the virtual agent displays similar behavior, whether that expression is objective (e.g., demonstrated or stated through verbal utterances) or subjective (e.g., demonstrated by the agent’s behavior [de Melo et al. 2013]), which falls in line with social psychology predictions about the role of social values orientations (for a review, see Bogaert et al. [2008]). Similarly, agents who display cooperative or prosocial emotions are also more likely to evoke prosocial behavior in social dilemmas, in some cases regardless of the actual game strategy employed by the agent [De Melo et al. 2010]. Some authors suggest that the interdependence of agents’ roles (in this case human and SIA) should also be taken into consideration; however, research in this area is still insufficient to draw conclusions [Vásquez and Weretka 2019].

11.5

Toward Prosociality in Populations with SIAs So far, we have read that specific agent architectures can be handily implemented to create SIAs that interact with humans—through empathic processing and providing empathic responses. Those goals can be achieved, respectively, through empathy mechanisms, modulation, and responses (Section 11.3). We have also pointed out that SIAs can trigger altruism in social contexts, which is evidenced by experiments resorting to social dilemmas and economic games involving humans

410

Chapter 11 Empathy and Prosociality in Social Agents

and agents in both physical and virtual environments (Section 11.4). Social agents’ embodiment, empathy, personality, and emotion expression were determined to positively impact prosociality. Beyond single and short-term interactions (see Chapter 19 on long-term interactions), a question, however, remains: how can empathic SIAs, embedded in dynamic populations of humans and agents, be used to trigger and stabilize long-term prosociality? In this section, we build on the works that previously speculated about the role of SIAs in sustaining prosocial populations of humans and social agents [Paiva et al. 2018]. As defined above, prosocial behavior can be defined as behaviors that intend to benefit one or more people other than the self. Here, we will mainly focus on prosocial behavior that involves a cost to the actor—that is, altruistic cooperation—as these acts are particularly hard to trigger. Hopefully, reducing the costs involved in altruism will further facilitate prosocial action. We will discuss SIAs through the lens of evolutionary game theory [Weibull 1997], emphasizing populations and whether certain behavior can evolve and become evolutionarily stable. Within that framework, we will figure out how SIAs can be used to operationalize several cooperation mechanisms [Nowak 2006] that were singled out to guarantee the stability of altruistic strategies in social dilemmas. We envision scenarios in which SIAs can work as instruments to leverage longterm prosociality. To do that, we will discuss the behaviors (strategies) of the agents and discuss four possible classes of agents, with increasing complexity and given the function that they may play in social interactions: (1) resilient agents; (2) reciprocal agents; (3) information-sharing agents, and (4) emotional-signaling agents. As will be noted, research on SIAs (done in the past 20 years and hopefully in the future) plays a vital role in designing all classes of agents encompassing empathy mechanisms, modulation, and responses, and supported by tools such as automated learning, planning, verbal communication, emotional expression, and emotional recognition. All these domains are likely to fundamentally impact the design of hybrid populations of (prosocial) SIAs.

11.5.1 Resilient Agents One of the positive effects of SIAs can simply accrue from revealing a fixed prosocial behavior over time. An agent that acts systematically the same way, thereby showing the others its prosocial behavior. We call these resilient agents. These are, naturally, some of the simplest agents one can think of. In fact, no sophisticated agent architecture is needed to generate such type of behavior. However, in the context of a population, having a fixed behavior can affect the overall population dynamics in at least two ways. First, a small fraction of prosocial agents may suffice to reach a critical mass of cooperators above which a population can self-organize

11.5 Toward Prosociality in Populations with SIAs

411

toward full cooperation. Second, the existence of agents revealing a fixed prosocial behavior can incentivize others to follow a similar strategy, thus triggering cascades of cooperation through conformist learning or social contagion. Regarding the first point, we shall refer to some previous work that show, precisely, how fixed behavior can trigger long-term prosociality in a population. Pacheco et al. showed that, in a population of adaptive agents, the existence of a small fraction of obstinate cooperators—defined as those who never change their behavior over time—is able to change the evolutionary dynamics of a population toward the coexistence of a majority of cooperators [Pacheco and Santos 2011]. The igniting effect of resilient cooperators can be extended to interactions that entail coordination dynamics. Take the example of situations in which a minimal fraction of cooperators is required to achieve a collective goal—a dilemma said to capture the perils of climate change negotiations or simpler mundane tasks such as taking part in a band or team project [Santos et al. 2020]. In those dilemmas, a minimal fraction of cooperators may facilitate collective success and, as such, provide extra incentives for cooperation. Resilient agents may contribute to reaching such thresholds. In the flavor of simple agents contributing to potentiate human coordination, Shirado and Christakis [2017] showed that simple artificial agents with random behavior, placed in central locations of a social network, can facilitate coordination in human groups. Several contexts where cooperation requires extra incentives, however, may not configure the coordination dilemma that resilient unconditional agents may be suitable to solve. More complex interaction paradigms require more complex agents that make better use of the current capabilities of the SIAs (see below). Regarding the potential effects of resilient agents through social contagion, we mention that imitation was suggested as a relevant enabler of cooperation evolution in humans, leading to cascades of cooperation [Fowler and Christakis 2010]. Experiments by Fowler et al. show that individuals who cooperate tend to influence positively the cooperation level of individuals up to three degrees of separation— that is, a cooperator contributes to increasing the chances that a (1) direct friend, (2) a friend of that friend, and (3) a friend of a friend of a friend also cooperate. This reveals the potential overreaching effect that an agent with a prosocial behavior can have in a networked population. In general, the prospective benefits of resilient prosocial agents is highlighted by works showing that, in human social networks, altruist individuals tend to be connected with altruist neighbors [Leider et al. 2009]. We shall also mention that conformism—that is, adopting the most common behavior in a population—is a form of learning also pointed as fundamental in the evolution of cooperation in human societies [Guzmán et al. 2007]. In situations where humans resort to conformism to adapt their behavior,

412

Chapter 11 Empathy and Prosociality in Social Agents

yet again, observing prosocial agents may increase the chances of behaving altruistically due to the increase of prosocial models to conform with. As far as we know, it remains an open question knowing whether contagious or conformist cooperation from virtual or robotic agents to humans has similar characteristics as those observed in human social networks—for example, three degrees of separation in positive influence—and which social capabilities are required by the former for that purpose. One may question if these agents are actually SIAs, or agents at all. We, however, believe that in this context the term agent and SIA can and should be used to represent the automatic artificial entities that will exist in a society, allowing us to simulate the effects of different behaviors and thus analyze at a macro-level the emergence of prosociality in hybrid societies of humans and technology.

11.5.2 Reciprocal Agents Reciprocal agents introduce a layer of complexity when compared with resilient (unconditional) agents. These agents have memory, are able to recognize their peers’ strategies, and respond accordingly. Reciprocity (namely direct reciprocity) is known as an important cooperation mechanism [Nowak 2006]. Tit-for-tat (TFT) is a prototypical example of strategy that can be used by these agents and sustain high levels of cooperation, as identified by Axelrod and Rapoport in the 1980s [Axelrod and Hamilton 1981]. This strategy postulates that, in the context of repeated altruistic interactions, individuals should start by cooperating and defect after an opponent defects. If a significant number of TFT agents are introduced in a population, cooperation will stabilize [Imhof et al. 2005]. As a result, a certain fraction of artificial agents, with a judicious choice of reciprocal behavior, may render cooperation a stable strategy in a population of humans and agents. More recently, Mao et al. [2017] showed that, in fact, a small fraction of reciprocal agents with a resilient behavior—that cooperate until an opponent defects, always defecting afterwards, a strategy also coined grim trigger—is able to significantly increase cooperation levels in a population at large. Interactions in the real-world are not constrained to pairwise interactions, where agents decide to cooperate or not with a single opponent. In the context of multiplayer interactions, the decision-making principles encapsulated in TFT can be extended to account for information about a distribution of strategies in a group [Pinheiro et al. 2014, Hilbe et al. 2017]. Also, in scenarios where a critical number of prosocial agents is needed for a collective goal to be achieved, reciprocal agents can be employed to sustain cooperation. In this context, reciprocal agents can use information about their own strategy, on top of information about

11.5 Toward Prosociality in Populations with SIAs

413

opponents’ previous strategies. Recent work shows that high levels of cooperation and group success can be achieved if agents reciprocate based on their success history and on the strategies anticipated to be played by the others in a group [Santos et al. 2020]. In the context of multiplayer ultimatum games, it was also shown that a small fraction of resilient prosocial agents—that give up their payoff to sustain fair outcomes when sharing a given resource—can significantly alter the dynamics in a population of adaptive agents such that prosocial strategies become stable and prevalent in the long-run [Santos et al. 2019]. We foresee research on empathy and prosociality in SIA being employed in the context of reciprocal agents along, at least, two lines: First of all, empathy mechanisms—as introduced in Section 11.3—are required so that agents are able to recognize others’ emotions, goals, and intentions. As we have just read, anticipating the intentions of agents in the context of cooperation dilemmas is central to devising conditional strategies that effectively support prosociality. Recent work stress that individuals’ behavior can be anticipated through non-verbal expressions, which allows the anticipation of humans’ reaction to negotiation offers [Park et al. 2013]. Naturally, besides being able to recognize the prosocial intentions of their opponents, SIA can use empathic response channels (also alluded to in Section 11.3) to convey their intention to humans so that the latter can themselves reciprocate. The usage of empathy mechanisms and empathy responses is particularly important in situations where information about the past behavior of individuals cannot be directly accessed, either because information is not accessible or reliable, or because individuals are interacting for the first time with a specific SIA. Second, research on SIAs, again in the area of empathy mechanisms and user modeling, can prove fundamental in developing agents that avoid getting stuck in long defection periods after erroneous moves by humans or other agents. One well-known drawback of TFT, when a population at large is using it, is the inability to recover from errors when an isolated defection move (possibly done by mistake or misinterpreted) is done. If this occurs, an opponent using TFT will also defect, which will lead to a subsequent wave of defections. Alternative strategies, such as the win–stay, lose–shift [Nowak and Sigmund 1993] or tit-for-two-tats (TF2T) [Axelrod and Hamilton 1981] were proposed, precisely, to help solve this drawback—introducing others, such as, in the case of TF2T, being prone to exploitability by more aggressive strategies. In this realm, SIAs can be used to devise agents that successfully convey their real intentions, thus avoiding incurring loops of defection after a mistake. Likewise, an SIA can be used to recognize errors by other humans or agents, thus being forgiving while remaining non-exploitable. In this context, research on having artificial agents justifying their erroneous moves [Correia et al. 2018a] may provide important advances.

414

Chapter 11 Empathy and Prosociality in Social Agents

11.5.3 Information-sharing Agents Increasing, once again, the complexity of the considered agents, we foresee the potential benefits of employing SIAs that can, in their interaction with humans, remember about others in a population, recognize interactions, and share information about interacting individuals. We call these information-sharing agents. The information obtained and shared can be the result of internal mechanisms as discussed previously. In the context of hybrid populations of humans and artificial agents, such SIAs can handily be used in the context of reputations systems [Resnick et al. 2000, Sabater and Sierra 2005] and indirect reciprocity [Nowak and Sigmund 1998], particularly in situations where it is costly for humans to share such information about others [Santos et al. 2018a]. Systems of indirect reciprocity occur when individuals discriminate their actions based on what peers did to others in the past. The simplest indirect reciprocity systems build upon the concept of image score [Nowak and Sigmund 1998], occur when individuals cooperate, consequently gaining a positive reputation. Others will then use that reputational uplift to cooperate back. We shall note that these simple reputations systems support a myriad of e-commerce and economy sharing platforms. More complex indirect reciprocity systems—in which altruistic cooperation can become prevalent over time—consider that, for example, reputations are attributed based on the reputation of the cooperating individual and the historic reputation of the individual being helped [Santos et al. 2018b]. In this context, an SIA would need to keep a record of the interacting individuals’ reputations, identify the valence of the employed action, and attribute a new reputation based on a given reputation update rule (for example, a rule stating that if an individual with a bad reputation helps an opponent with a good reputation, the helping individual deserves to recover a good reputation). After this process, agents would need to share the new information about the observed individuals to other SIAs or, potentially, humans. If information about interacting individuals is not readily available or can only be accessed with a high degree of noise, SIAs might be called upon to identify the intentions of individuals through emotion recognition technologies. Likewise, the reputations of interacting individuals might be communicated through numerical scores, natural language, and also emotion expression (e.g., revealing an angry face whenever an individual deserves a bad reputation [de Melo et al. 2021]). An insightful connection between empathy and the evolution of cooperation was recently suggested, again in the context of indirect reciprocity systems [Radzvilavicius et al. 2019]. In these systems, one rule to attribute reputations— named stern-judging [Pacheco et al. 2006]—was shown to combine simplicity with high cooperation levels, guaranteeing the stability of altruism in populations with

11.5 Toward Prosociality in Populations with SIAs

415

different sizes and composed of agents with simple cognitive abilities [Santos et al. 2018b]. Stern-judging states that agents should be considered good when they cooperate with a good opponent or defect with a bad opponent; all else being considered bad. While this norm promotes high levels of cooperation whenever reputations are able to spread fast in a population and become public (e.g., though gossip), there are some drawbacks when considering private reputations. In particular, two individuals may disagree on how they regard a third peer, and these incongruities can hamper cooperation when stern-judging is the reputation rule prevailing in a population [Hilbe et al. 2018]. Radzvilavicius et al. showed that, in this case, cooperation requires empathic individuals. In Radzvilavicius et al. [2019], the authors suggest that, when individual A is judging the behavior of individual B (after B plays against a third individual, C), then A can use information in an empathic or egocentric fashion: agent A will be empathic when she places herself in the position of B, taking into account the intentions of B in order to judge her behavior; in this regard, even if A and B have a different opinion over C, A will use the information that B had. On the other hand, agent A will be egocentric when judging B without considering that B can potentially have a different opinion on C (differing from the opinion of A). Radzvilavicius et al. show, mathematically, how empathy can open new routes for the stability of prosociality in populations of adaptive agents. Yet again, tools developed to build SIAs, specifically concerning empathy mechanisms, can be handily used to create agents that judge others and share information in an empathic way on a large scale, thus rendering altruism stable in a population. Particularly, an open question in this context relates to understanding which mechanisms enable individuals to know how another agent’s reputation is perceived by others [Masuda and Santos 2019]. We believe that solutions for this challenge may be inspired by emotional expression and communication that characterizes SIAs. On the other hand, many of the techniques and tools used to model large populations of disembodied agents can also influence the design and creation of SIAs.

11.5.4 Emotion-signaling Agents Finally, and for the purpose of simulating these societies of agents and humans, we foresee potential benefits in designing emotional or social-signaling agents. These are agents that, on top of having the ability to discriminate based on pre-play signals of their opponents, are themselves able to communicate their intention before (and after) an interaction, resorting to social signals such as emotional expression. First, let us introduce the relation between signaling, economic games, and (altruistic) prosociality. In coordination games with multiple equilibria, arbitrary

416

Chapter 11 Empathy and Prosociality in Social Agents

signals can disrupt the equilibrium of inferior payoff strategies—that is, strategies that constitute a stable equilibrium yet lead to lower payoffs than other stable strategies [Robson 1990]. Let us say that strategy A forms a payoff-inferior equilibrium and strategy B forms a payoff-superior equilibrium. In a population fully composed of individuals playing strategy A or strategy B, no mutant strategy can invade and fixate (respectively, mutant strategy B or mutant strategy A). This means that strategy A can be stable despite the fact that it leads to lower payoffs than B. The stability of strategy A can, however, be disrupted by arbitrary signals. This can occur through so-called secret handshakes: we can conceive a third strategy, C, that develops a signal only recognized by other individuals adopting C; this strategy will behave as strategy B whenever encountering someone also signaling and will behave as strategy A otherwise. C will invade a population fully composed of (payoff-inferior) strategy A, thus leading to payoff-superior outcomes. Signaling suggests an idyllic scenario in coordination games. One can then speculate if the same mechanisms can be used in cooperation scenarios, assuming that cooperators can signal their cooperative intentions (for example by smiling) and only cooperate with those that also signal. In the case of altruistic cooperation—which leads to prisoner’s dilemma type of interactions—there is a catch, however: defectors, that is, those refusing to take prosocial (altruist) actions, can learn how to fake the signals sustaining cooperation. A population fully composed of cooperators that emit an arbitrary signal before playing, and only cooperate with those using the same signal, can be easily exploited by defectors that use the same signal as cooperators. The interrelation of signaling and cooperation has long been known [Robson 1990]. More recent models, however, show that even if cooperation cannot be stabilized through signaling the possibility that multiple signals can be used allows cooperation to still become prevalent over time [Santos et al. 2011]. The more signals available, the better, and SIAs are indeed agents that use social signals to communicate. The advantages and limitations of signaling in sustaining cooperation within populations can again illuminate, in our opinion, some future applications of SIAs. As mentioned before, emotion expression (see Section 11.3.4) can be conceived of as a sophisticated form of pre-play signaling. Thus, on the one hand, emotion expression can disrupt defective equilibria and trigger the evolution of prosocial (altruist) actions. On the other hand, novel emotion recognition tools can allow the implementation of improved ways of anticipating the trustworthiness of expressed signals [Lucas et al. 2016], which can contribute to alleviate the biggest peril of signaling: the possibility that malicious agents fake signals and exploit cooperators.

11.6 Summary and Current Challenges

417

11.5.5 Environment and Networks So far, we have discussed the potential role of SIAs at different levels without placing too much emphasis on environment characteristics where human–agent interactions take place. In Section 11.4 we discussed the types of scenarios that these agents can be used. At the level of populations, we should further consider one particular set of important characteristics that are related to the network topology where individuals interact. At the population level, people to not interact with everyone. They are arranged in networks. Some networks, particularly those where individuals are highly heterogeneous in what concerns the number of contacts they have, were shown to facilitate the evolution of cooperation [Santos and Pacheco 2005]. The node diversity implied by these networks opens up the possibility of thinking, not only about designing SIAs that take it into account, but also about where to place them in a network—in particular, if those networks in the future can have both SIAs and humans in a hybrid population. Future approaches may combine particular SIAs’ architectures with knowledge about centrality measures of the network positions where those networks should be deployed.

11.6

Summary and Current Challenges In this chapter, we explored how to create SIAs that not only exhibit empathy but evoke empathy from others, and, as a consequence, foster prosocial behavior. Before investigating the approaches to build such agents, we started by clarifying some of the major concepts in empathic and prosocial agents, having established a framework for thinking, studying, and engineering these agents. This framework allows reasoning about the key elements that agents should include to behave prosocially, or trigger prosociality in interaction groups or populations at large. Then, we analyzed SIAs at three different levels. At the micro-level, we discussed the computational mechanisms that are needed to build empathic and prosocial SIAs. We reviewed models and architectures that agents can include to be prosocial, and we elaborated on the different approaches that have been taken by researchers in the field. We then delved more deeply into how empathic and prosocial SIAs interact with humans and the effects that such types of interactions have. Finally, we investigated how these SIAs can be embedded into a large society and explored empathy and prosociality at that macro, societal level. We believe that this last step is critical in bringing humans and agents together in large hybrid groups, and that studying how empathy and prosociality in such agents will allow us to face particular societal challenges, such as inequality, tribalism, and sustainability. Naturally, several challenges lie ahead in the route to (1) designing, (2) deploying, and (3) evaluating SIAs that promote prosociality in the real world.

418

Chapter 11 Empathy and Prosociality in Social Agents

There are natural challenges associated with the deployment of such technologies. Foremost, individuals may have concerns about being influenced by technological artefacts. One may as well remember the Facebook emotional contagion experiment [Kramer et al. 2014] and the ethical debates it prompted [Fiske and Hauser 2014, Verma 2014]. In fact, nudging individuals toward more prosocial behaviors should be done, considering high standards of transparency and privacy concerns. Experiments of such kind should be conducted according to the careful guidance of institutional review boards, and users should be allowed to opt out from using the suggested technologies. One should also be clear about how SIAs, intended to promote prosociality, can be tuned to trigger harmful behaviors; in that case, this technology should allow for mechanisms that limit its influence. At the same time, there are several challenges related to designing and deploying SIAs to trigger prosociality in specific contexts. While we here discuss the potential role of SIAs in generic situations, we foresee that specific scenarios may call for specific details in SIAs to be tuned. The diversity of situations, especially real-world problems, where we envision that prosocial SIA can be applied to is fascinating. We can think of agents having a real impact in our society ranging from sustaining environmentally friendly behaviors, that support diversity and helping behaviors toward out-group members, or that promote sharing actions that mitigate the effects of inequality. As an extra example, SIAs can be used to nudge individuals to adopting responsible behaviors in the midst of our current COVID pandemic. Wearing a mask, keeping social distance, refraining from hoarding, avoiding tempting yet crowded places, help vulnerable people in risk groups, are all behaviors that, even if presenting a small cost to oneself, are a necessary step to achieving collective success. SIAs can be used to highlight how such behaviors can become effective (stressing the collective benefits they lead to) or highlight how others may benefit from them (nurturing empathic concerns in users). In these particular scenarios, one can foresee immense challenges: To start with, how to incentivize individuals to interact, in the first place, with such SIA? Second, how to design and deploy SIA in a timely fashion that makes this technology useful while, at the same time, guaranteeing that the right amount of effort was placed to design effective agents? How to know which users should be targeted first in order to achieve fast and actual beneficial outcomes—again, in a situation so timesensitive as an ongoing pandemic? How to incentivize mass usage while making sure that the privacy of each individual is being protected? How to deploy effective SIAs that comply with international regulations on data protection? These are natural societal challenges—beyond the technical ones—that may lie ahead when deploying prosocial SIAs.

References

419

For centuries, the investigation into human nature has tried to answer whether humans are primarily good or bad. Fortunately, despite human nature being guided mostly by self-serving motivations, it is also known that we help each other at our own cost [Paiva et al. 2018]. Empathic and prosocial SIAs can leverage on this characteristic of human nature to foster prosociality in groups and societies, thus contributing to the establishment of the area of prosocial computing [Paiva et al. 2018].

Acknowledgments This work was supported by FCT scholarships (SFRH/BD/118031/2016 and PD/BD/ 150570/2020), the AGENTS Project (CMU/TIC/0055/2019), TAILOR project (H2020ICT-48-2020/952215), and HumanE-AI-Net Project (H2020-ICT-48-2020/952026). Fernando Santos acknowledges support from the James S. McDonnell Foundation Postdoctoral Fellowship Award. Ana Paiva is the Katherine Hampson Bessell Fellow of the Radcliffe Institute for Advanced Study at Harvard University, and has been partially funded by the fellowship program.

References P. Alves-Oliveira, P. Sequeira, F. S. Melo, G. Castellano, and A. Paiva. 2019. Empathic robot for group learning: A field study. ACM Trans. Hum.-Rob. Interact. (THRI) 8, 1, 1–34. DOI: https://doi.org/10.1145/3300188. C. A. Anderson and B. J. Bushman. 2002. Human aggression. Annu. Rev. Psychol. 53, 27–51. DOI: https://doi.org/10.1146/annurev.psych.53.100901.135231. K. Anderson, E. André, T. Baur, S. Bernardini, M. Chollet, E. Chryssafidou, I. Damian, C. Ennis, A. Egges, P. Gebhard, H. Jones, M. Ochs, C. Pelachaud, K. Porayska-Pomsta, P. Rizzo, and N. Sabouret. 2013. The TARDIS framework: Intelligent virtual agents for social coaching in job interviews. In International Conference on Advances in Computer Entertainment Technology. Springer, 476–491. DOI: https://doi.org/10.1007/978-3-31903161-3_35. M. Asada. 2015. Towards artificial empathy. Int. J. Soc. Rob. 7, 1, 19–33. DOI: https://doi.org/ 10.1007/s12369-014-0253-z. J. Avelino, F. Correia, J. Catarino, P. Ribeiro, P. Moreno, A. Bernardino, and A. Paiva. 2018. The power of a hand-shake in human–robot interactions. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 1864–1869. DOI: https://doi.org/ 10.1109/IROS.2018.8593980. R. Axelrod and W. D. Hamilton. 1981. The evolution of cooperation. Science 211, 4489, 1390– 1396. DOI: https://doi.org/10.1126/science.7466396.

420

Chapter 11 Empathy and Prosociality in Social Agents

R. S. Aylett, S. Louchart, J. Dias, A. Paiva, and M. Vala. 2005. FearNot!—An experiment in emergent narrative. In International Workshop on Intelligent Virtual Agents. Springer, 305– 316. DOI: https://doi.org/10.1007/11550617_26. C. P. Barlett and C. A. Anderson. 2012. Examining media effects: The general aggression and general learning models. In The International Encyclopedia of Media Studies. DOI: https://doi.org/10.1002/9781444361506.wbiems110. C. D. Batson. 2011. Altruism in Humans. Oxford University Press. DOI: https://doi.org/ 10.1093/acprof:oso/9780195341065.001.0001. C. D. Batson. 2014. The Altruism Question: Toward a Social-Psychological Answer. Psychology Press. DOI: https://doi.org/10.4324/9781315808048.s C. D. Batson and N. Ahmad. 2001. Empathy-induced altruism in a prisoner’s dilemma. II: What if the target of empathy has defected? Eur. J. Soc. Psychol. 31, 1, 25–36. DOI: https: //doi.org/10.1002/ejsp.26. C. D. Batson, J. G. Batson, J. K. Slingsby, K. L. Harrell, H. M. Peekna, and R. M. Todd. 1991. Empathic joy and the empathy–altruism hypothesis. J. Pers. Soc. Psychol. 61, 3, 413–426. DOI: https://doi.org/10.1037/0022-3514.61.3.413. C. D. Batson, J. G. Batson, R. M. Todd, B. H. Brummett, L. L. Shaw, and C. M. Aldeguer. 1995. Empathy and the collective good: Caring for one of the others in a social dilemma. J. Pers. Soc. Psychol. 68, 4, 619–631. DOI: https://doi.org/10.1037/0022-3514.68.4.619. C. D. Batson, D. A. Lishner, and E. L. Stocks. 2015. The empathy–altruism hypothesis. In The Oxford Handbook of Prosocial Behavior. 259–268. DOI: https://doi.org/10.1093/ oxfordhb/9780195399813.013.023. D. J. Baumann, R. B. Cialdini, and D. T. Kendrick. 1981. Altruism as hedonism: Helping and self-gratification as equivalent responses. J. Pers. Soc. Psychol. 40, 6, 1039–1046. DOI: https://doi.org/10.1037/0022-3514.40.6.1039. C. Becker, H. Prendinger, M. Ishizuka, and I. Wachsmuth. 2005. Evaluating affective feedback of the 3D agent Max in a competitive cards game. In International Conference on Affective Computing and Intelligent Interaction. Springer, 466–473. DOI: https://doi.org/10. 1007/11573548_60. J. Berg, J. Dickhaut, and K. McCabe. 1995. Trust, reciprocity, and social history. Games Econ. Behav. 10, 1, 122–142. DOI: https://doi.org/10.1006/game.1995.1027. T. W. Bickmore and R. W. Picard. 2005. Establishing and maintaining long-term human– computer relationships. ACM Trans. Comput.-Hum. Interact. (TOCHI). 12, 2, 293–327. DOI: https://doi.org/10.1145/1067860.1067867. P. Bloom. 2017. Against Empathy: The Case for Rational Compassion. Random House. S. Bogaert, C. Boone, and C. Declerck. 2008. Social value orientation and cooperation in social dilemmas: A review and conceptual model. Br. J. Soc. Psychol. 47, 3, 453–480. DOI: https://doi.org/10.1348/014466607X244970. F. Borgonovi. 2008. Doing well by doing good. The relationship between formal volunteering and self-reported health and happiness. Soc. Sci. Med. 66, 11, 2321–2334. DOI: https: //doi.org/10.1016/j.socscimed.2008.01.011.

References

421

H. Boukricha, I. Wachsmuth, M. N. Carminati, and P. Knoeferle. 2013. A computational model of empathy: Empirical evaluation. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. IEEE, 1–6. DOI: https://doi.org/10.1109/AC II.2013.7. S. Brave, C. Nass, and K. Hutchinson. 2005. Computers that care: Investigating the effects of orientation of emotion exhibited by an embodied computer agent. In. J. Hum.Comput. Stud. 62, 2, 161–178. DOI: https://doi.org/10.1016/j.ijhcs.2004.11.002. J. L. Brown. 1978. Avian communal breeding systems. Annu. Rev. Ecol. Syst. 9, 1, 123–155. DOI: https://doi.org/10.1146/annurev.es.09.110178.001011. S. L. Brown, R. M. Nesse, A. D. Vinokur, and D. M. Smith. 2003. Providing social support may be more beneficial than receiving it: Results from a prospective study of mortality. Psychol. Sci. 14, 4, 320–327. DOI: https://doi.org/10.1111/1467-9280.14461. K. E. Buckley and C. A. Anderson. 2006. A theoretical model of the effects and consequences of playing video games. In Playing Video Games: Motives, Responses, and Consequences. 363–378. P. Campos-Mercade, A. Meier, F. Schneider, and E. Wengström. 2021. Prosociality predicts health behaviors during the Covid-19 pandemic. J. Public Econ. 195, 104367. DOI: https: //doi.org/10.1016/j.jpubeco.2021.104367. C. S. Carter, I. B.-A. Bartal, and E. C. Porges. 2017. The roots of compassion: An evolutionary and neurobiological perspective. In The Oxford Handbook of Compassion Science. 173. DOI: https://doi.org/10.1093/oxfordhb/9780190464684.013.14. G. Castellano, A. Paiva, A. Kappas, R. Aylett, H. Hastie, W. Barendregt, F. Nabais, and S. Bull. 2013. Towards empathic virtual and robotic tutors. In International Conference on Artificial Intelligence in Education. Springer, 733–736. DOI: https://doi.org/10.1007/978-3642-39112-5_100. F. Correia, C. Guerra, S. Mascarenhas, F. S. Melo, and A. Paiva. 2018a. Exploring the impact of fault justification in human–robot trust. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 507–513. F. Correia, S. Mascarenhas, R. Prada, F. S. Melo, and A. Paiva. 2018b. Group-based emotions in teams of humans and robots. In Proceedings of the 2018 ACM/IEEE International Conference on Human–Robot Interaction. 261–269. DOI: https://doi.org/10.1145/3171221.3171252. F. Correia, S. Gomes, S. Mascarenhas, F. S. Melo, and A. Paiva. 2020. The dark side of embodiment: Teaming up with robots vs disembodied agents. In Proceedings of the Robotics Science and Systems RSS’2020. DOI: https://doi.org/10.15607/RSS.2020.XVI.010. A. Costantini, A. Scalco, R. Sartori, E. M. Tur, and A. Ceschi. 2019. Theories for computing prosocial behavior. Nonlinear Dynamics Psychol. Life Sci. 23, 297–313. S. M. Coyne, L. M. Padilla-Walker, H. G. Holmgren, E. J. Davis, K. M. Collier, M. K. Memmott-Elison, and A. J. Hawkins. 2018. A meta-analysis of prosocial media on prosocial behavior, aggression, and empathic concern: A multidimensional approach. Dev. Psychol. 54, 2, 331–347. DOI: https://doi.org/10.1037/dev0000412.

422

Chapter 11 Empathy and Prosociality in Social Agents

J. Crocker, A. Canevello, and A. A. Brown. 2017. Social motivation: Costs and benefits of selfishness and otherishness. Annu. Rev. Psychol. 68, 299–325. DOI: https://doi.org/10. 1146/annurev-psych-010416-044145. M. H. Davis. 2018. Empathy: A Social Psychological Approach. Routledge. R. M. Dawes. 1980. Social dilemmas. Annu. Rev. Psychol. 31, 1, 169–193. DOI: https://doi.org/ 10.1146/annurev.ps.31.020180.001125. R. de Kleijn, L. van Es, G. Kachergis, and B. Hommel. 2019. Anthropomorphization of artificial agents leads to fair and strategic, but not altruistic behavior. Int. J. Hum.-Comput. Stud. 122, 168–173. DOI: https://doi.org/doi:10.1016/j.ijhcs.2018.09.008. C. M. de Melo, L. Zheng, and J. Gratch. 2009. Expression of moral emotions in cooperating agents. In International Workshop on Intelligent Virtual Agents. Springer, 301–307. DOI: https://doi.org/10.1007/978-3-642-04380-2_32. C. M. de Melo, P. Carnevale, and J. Gratch. 2010. The influence of emotions in embodied agents on human decision-making. In International Conference on Intelligent Virtual Agents. Springer, 357–370. DOI: https://doi.org/10.1007/978-3-642-15892-6_38. C. de Melo, P. Carnevale, and J. Gratch. 2013. People’s biased decisions to trust and cooperate with agents that express emotions. In Proc. AAMAS. C. M. de Melo, P. Khooshabeh, O. Amir, and J. Gratch. 2018. Shaping cooperation between humans and agents with emotion expressions and framing. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2224–2226. C. M. de Melo, K. Terada, and F. C. Santos. 2021. Emotion expressions shape human social norms and reputations. iScience 24, 3, 102141. DOI: https://doi.org/10.1016/j.isci.2021. 102141. F. de Vignemont and T. Singer. 2006. The empathic brain: How, when and why? Trends Cogn. Sci. 10, 10, 435–441. DOI: https://doi.org/10.1016/j.tics.2006.08.008. F. B. de Waal. 2007. The ‘Russian doll’ model of empathy and imitation. In On Being Moved: From Mirror Neurons to Empathy. 35–48. DOI: https://doi.org/10.1075/aicr.68.06waa. F. B. de Waal. 2008. Putting the altruism back into altruism: The evolution of empathy. Annu. Rev. Psychol. 59, 279–300. DOI: https://doi.org/10.1146/annurev.psych.59.103006. 093625. C. Dijk, B. Koenig, T. Ketelaar, and P. J. de Jong. 2011. Saved by the blush: Being trusted despite defecting. Emotion 11, 2, 313–319. DOI: https://doi.org/10.1037/a0022774. E. W. Dunn, L. B. Aknin, and M. I. Norton. 2014. Prosocial spending and happiness: Using money to benefit others pays off. Curr. Dir. Psychol. Sci. 23, 1, 41–47. DOI: https://doi.org/ 10.1177/0963721413512503. N. Eisenberg and T. L. Spinrad. 2014. Multidimensionality of prosocial behavior: Rethinking the conceptualization and development of prosocial behavior. 17–39. DOI: https:// doi.org/10.1093/acprof:oso/9780199964772.001.0001.

References

423

N. Eisenberg, S. K. VanSchyndel, and T. L. Spinrad. 2016. Prosocial motivation: Inferences from an opaque body of work. Child Dev. 87, 6, 1668–1678. DOI: https://doi.org/10.1111/cd ev.12638. P. Ekman. 1999. Basic emotions. In T. Dalgleish and M. J. Power (Eds.), Handbook of Cognition and Emotion. John Wiley, 98, 45–60, 16. E. Fehr and U. Fischbacher. 2003. The nature of human altruism. Nature 425, 6960, 785–791. DOI: https://doi.org/10.1038/nature02043. P. C. Ferreira, A. V. Simão, A. Paiva, and A. Ferreira. 2020. Responsive bystander behaviour in cyberbullying: A path through self-efficacy. Behav. Inform. Technol. 39, 5, 511–524. DOI: https://doi.org/10.1080/0144929X.2019.1602671. P. C. Ferreira, A. Simão, A. Paiva, C. Martinho, R. Prada, A. Ferreira, and F. Santos. 2021. Exploring empathy in cyberbullying with serious games. Comput. Educ. 166, 104155. DOI: https://doi.org/10.1016/j.compedu.2021.104155. S. T. Fiske and R. M. Hauser, 2014. Protecting human research participants in the age of big data. Proc. Natl. Acad. Sci. U. S. A. 111, 38, 13675–13676. DOI: https://doi.org/10.1073/pnas .1414626111. J. H. Fowler and N. A. Christakis. 2010. Cooperative behavior cascades in human social networks. Proc. Natl. Acad. Sci. U. S. A. 107, 12, 5334–5338. DOI: https://doi.org/10.1073/pnas .0913149107. A. D. Galinsky, W. W. Maddux, D. Gilin, and J. B. White. 2008. Why it pays to get inside the head of your opponent: The differential effects of perspective taking and empathy in negotiations. Psychol. Sci. 19, 4, 378–384. DOI: https://doi.org/10.1111/j.1467-9280.2008. 02096.x. L. Gamberini, L. Chittaro, A. Spagnolli, and C. Carlesso. 2015. Psychological response to an emergency in virtual reality: Effects of victim ethnicity and emergency type on helping behavior and navigation. Comput. Hum. Behav. 48, 104–113. DOI: http://dx.doi.org/10. 1016/j.chb.2015.01.040. D. A. Gentile, C. A. Anderson, S. Yukawa, N. Ihori, M. Saleem, L. K. Ming, A. Shibuya, A. K. Liau, A. Khoo, B. J. Bushman, L. R. Huesmann, and A. Sakamoto. 2009. The effects of prosocial video games on prosocial behaviors: International evidence from correlational, longitudinal, and experimental studies. Pers. Soc. Psychol. Bul. 35, 6, 752–763. DOI: http://dx.doi.org/10.1177/0146167209333045. D. F. Glas, T. Minato, C. T. Ishi, T. Kawahara, and H. Ishiguro. 2016. ERICA: The ERATO intelligent conversational android. In 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 22–29. DOI: https://doi.org/10. 1109/ROMAN.2016.7745086. C. T. Gloria and M. A. Steinhardt. 2016. Relationships among positive emotions, coping, resilience and mental health. Stress Health. 32, 2, 145–156. DOI: https://doi.org/10.1002/ smi.2589. A. P. Goldstein, B. Glick, M. Carthan, and D. Blancero. 1994. The Prosocial Gang: Implementing Aggression Replacement Training. ERIC.

424

Chapter 11 Empathy and Prosociality in Social Agents

B. Gonsior, S. Sosnowski, C. Mayer, J. Blume, B. Radig, D. Wollherr, and K. Kühnlenz. 2011. Improving aspects of empathy and subjective performance for HRI through mirroring facial expressions. In 2011 RO-MAN. IEEE, 350–356. DOI: https://doi.org/10.1109/ROMAN. 2011.6005294. N. M. Gotts, J. G. Polhill, and A. N. R. Law. 2003. Agent-based simulation in the study of social dilemmas. Artif. Intell. Rev. 19, 1, 3–92. DOI: https://doi.org/10.1023/A: 1022120928602. W. G. Graziano, M. M. Habashi, B. E. Sheese, and R. M. Tobin. 2007. Agreeableness, empathy, and helping: A person × situation perspective. J. Pers. Soc. Psychol. 93, 4, 583. DOI: https://doi.org/10.1037/0022-3514.93.4.583. T. Greitemeyer and D. O. Mügge. 2014. Video games do affect social outcomes: A metaanalytic review of the effects of violent and prosocial video game play. Pers. Soc. Psychol. Bull. 40, 5, 578–589. DOI: https://doi.org/10.1177/0146167213520459. R. A. Guzmán, C. Rodríguez-Sickert, and R. Rowthorn. 2007. When in Rome, do as the Romans do: The coevolution of altruistic punishment, conformist learning, and cooperation. Evol. Hum. Behav. 28, 2, 112–117. DOI: https://doi.org/10.1016/j.evolhumbehav.2006. 08.002. M. M. Habashi, W. G. Graziano, and A. E. Hoover. 2016. Searching for the prosocial personality: A big five approach to linking personality and prosocial behavior. Pers. Soc. Psychol. Bull. 42, 9, 1177–1192. DOI: https://doi.org/10.1177/0146167216652859. W. D. Hamilton. 1964. The genetical evolution of social behaviour. II. J. Theor. Biol. 7, 1, 17– 52. DOI: https://doi.org/10.1016/0022-5193(64)90039-6. W. D. Hamilton. 1972. Altruism and related phenomena, mainly in social insects. Annu. Rev. Ecol. Syst. 3, 1, 193–232. DOI: https://doi.org/10.1146/annurev.es.03.110172.001205. B. Hayes, D. Ullman, E. Alexander, C. Bank, and B. Scassellati. 2014. People help robots who help others, not robots who help themselves. In The 23rd IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 255–260. DOI: https://doi.org/ 10.1109/ROMAN.2014.6926262. C. Hilbe, L. A. Martinez-Vaquero, K. Chatterjee, and M. A. Nowak. 2017. Memory-n strategies of direct reciprocity. Proc. Natl. Acad. Sci. U. S. A. 114, 18, 4715–4720. DOI: https://doi. org/10.1073/pnas.1621239114. C. Hilbe, L. Schmid, J. Tkadlec, K. Chatterjee, and M. A. Nowak. 2018. Indirect reciprocity with private, noisy, and incomplete information. Proc. Natl. Acad. Sci. U. S. A. 115, 48, 12241–12246. DOI: https://doi.org/10.1073/pnas.1810565115. B. E. Hilbig, A. Glöckner, and I. Zettler. 2014. Personality and prosocial behavior: Linking basic traits and social value orientations. J. Pers. Soc. Psychol. 107, 3, 529. DOI: https://do i.org/10.1037/a0036074. M. L. Hoffman. 2001. Empathy and Moral Development: Implications for Caring and Justice. Cambridge University Press. E. N. M. Ibrahim and C. S. Ang. 2018. Communicating empathy: Can technology intervention promote pro-social behavior? Review and perspectives. Adv. Sci. Lett. 24, 3, 1643– 1646. DOI: https://doi.org/10.1166/asl.2018.11127.

References

425

M. Imai and M. Narumi. 2004. Robot behavior for encouraging immersion in interaction. In Proceedings of Complex Systems Intelligence and Modern Technological Applications (CSIMTA 2004). Cherbourg, France, 591–598. L. A. Imhof, D. Fudenberg, and M. A. Nowak. 2005. Evolutionary cycles of cooperation and defection. Proc. Natl. Acad. Sci. U. S. A. 102, 31, 10797–10800. DOI: https://doi.org/10.1073/ pnas.0502589102. C. E. Izard. 2013. Human Emotions. Springer Science & Business Media. DOI: https://doi.or g/10.1007/978-1-4899-2209-0. D. Keltner, A. Kogan, P. K. Piff, and S. R. Saturn. 2014. The sociocultural appraisals, values, and emotions (save) framework of prosociality: Core processes from gene to meme. Annu. Rev. Psychol. 65, 425–460. DOI: https://doi.org/10.1146/annurev-psych-010213115054. E. H. Kim, S. S. Kwak, and Y. K. Kwak. 2009. Can robotic emotional expressions induce a human to empathize with a robot? In RO-MAN 2009-The 18th IEEE International Symposium on Robot and Human Interactive Communication. IEEE, 358–362. DOI: https://doi.org/ 10.1109/ROMAN.2009.5326282. M. S. Kim, B. K. Cha, D. M. Park, S. M. Lee, S. Kwak, and M. K. Lee. 2010. DONA: Urban donation motivating robot. In 2010 5th ACM/IEEE International Conference on Human– Robot Interaction (HRI). IEEE, 159–160. DOI: https://doi.org/10.1109/HRI.2010.5453217. T. J. King, I. Warren, and D. Palmer. 2008. Would Kitty Genovese have been murdered in Second Life? Researching the “bystander effect” using online technologies. In TASA 2008: Re-imagining sociology: The Annual Conference of The Australian Sociological Association. University of Melbourne, 1–23. S.-C. Kolm. 2008. Reciprocity: An Economics of Social Relations. Cambridge University Press. DOI: https://doi.org/10.1007/s00712-009-0072-0. M. D. Kozlov and M. K. Johansen. 2010. Real behavior in virtual environments: Psychology experiments in a simple virtual-reality paradigm using video games. Cyberpsychol. Behav. Soc. Netw. 13, 6, 711–714. DOI: https://doi.org/10.1089/cyber.2009.0310. A. D. Kramer, J. E. Guillory, and J. T. Hancock. 2014. Experimental evidence of massivescale emotional contagion through social networks. Proc. Natl. Acad. Sci. U. S. A. 111, 24, 8788–8790. DOI: https://doi.org/10.1073/pnas.1320040111. P. Kulms, S. Kopp, and N. C. Krämer. 2014. Let’s be serious and have a laugh: Can humor support cooperation with a virtual agent? In International Conference on Intelligent Virtual Agents. Springer, 250–259. DOI: https://doi.org/10.1007/978-3-319-09767-1_32. B. Latané and J. M. Darley. 1970. The Unresponsive Bystander: Why Doesn’t He Help? AppletonCentury-Crofts. S. Leiberg, O. Klimecki, and T. Singer. 2011. Short-term compassion training increases prosocial behavior in a newly developed prosocial game. PLoS One 6, 3. DOI: https://doi. org/10.1371/journal.pone.0017798. S. Leider, M. M. Möbius, T. Rosenblat, and Q.-A. Do. 2009. Directed altruism and enforced reciprocity in social networks. Q. J. Econ. 124, 4, 1815–1851. DOI: https://doi.org/10.1162/qj ec.2009.124.4.1815.

426

Chapter 11 Empathy and Prosociality in Social Agents

I. Leite, A. Pereira, S. Mascarenhas, C. Martinho, R. Prada, and A. Paiva. 2013. The influence of empathy in human–robot relations. Int. J. Hum. Comput. Stud. 71, 3, 250–260. DOI: https://doi.org/10.1016/j.ijhcs.2012.09.005. I. Leite, G. Castellano, A. Pereira, C. Martinho, and A. Paiva. 2014. Empathic robots for long-term interaction. Int. J. Soc. Rob. 6, 3, 329–341. DOI: https://doi.org/10.1007/s12369014-0227-1. A. Lim and H. G. Okuno. 2015. A recipe for empathy. Int. J. Soc. Rob. 7, 1, 35–49. DOI: https: //doi.org/10.1007/s12369-014-0262-y. D. Lim and D. DeSteno. 2016. Suffering and compassion: The links among adverse life experiences, empathy, compassion, and prosocial behavior. Emotion 16, 2, 175–182. DOI: https://doi.org/10.1037/emo0000144. C. Lisetti, R. Amini, U. Yasavur, and N. Rishe. 2013. I can help you change! An empathic virtual agent delivers behavior change health interventions. ACM Trans. Manage. Inf. Syst. 4, 4, 1–28. DOI: https://doi.org/10.1145/2544103. G. Lucas, G. Stratou, S. Lieblich, and J. Gratch. 2016. Trust me: Multimodal signals of trustworthiness. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 5–12. DOI: https://doi.org/10.1145/2993148.2993178. E. Lukinova and M. Myagkov. 2016. Impact of short social training on prosocial behaviors: An fMRI study. Front. Syst. Neurosci. 10, 60. DOI: https://doi.org/10.3389/fnsys.2016. 00060. M. J. Lupoli, L. Jampol, and C. Oveis. 2017. Lying because we care: Compassion increases prosocial lying. J. Exp. Psychol. Gen. 146, 7, 1026–1042. DOI: https://doi.org/10.1037/xg e0000315. L. K. Ma, R. J. Tunney, and E. Ferguson. 2017. Does gratitude enhance prosociality? A metaanalytic review. Psychol. Bull. 143, 6, 601–635. DOI: https://doi.org/10.1037/bul0000103. H. L. Maibom. 2017. Introduction to philosophy of empathy. In H. L. Maibom (Ed.), The Routledge Handbook to Philosophy of Empathy. Routledge, New York, 1–10. A. Mao, L. Dworkin, S. Suri, and D. J. Watts. 2017. Resilient cooperators stabilize long-run cooperation in the finitely repeated prisoner’s dilemma. Nat. Commun. 8, 1, 1–10. DOI: https://doi.org/10.1038/ncomms13800. F. Martela and R. M. Ryan. 2016. The benefits of benevolence: Basic psychological needs, beneficence, and the enhancement of well-being. J. Pers. 84, 6, 750–764. DOI: https://doi. org/10.1111/jopy.12215. N. Masuda and F. C. Santos. 2019. A mathematical look at empathy. eLife 8. DOI: https://do i.org/10.7554/eLife.47036. S. W. McQuiggan, J. L. Robison, R. Phillips, and J. C. Lester. 2008. Modeling parallel and reactive empathy in virtual agents: An inductive approach. In AAMAS (1). CiteSeer, 167– 174. DOI: https://doi.org/10.1145/1402383.1402411. R. R. Morris, K. Kouddous, R. Kshirsagar, and S. M. Schueller. 2018. Towards an artificially empathic conversational agent for mental health applications: System design and user perceptions. J. Med. Int. Res. 20, 6, e10148. DOI: https://doi.org/10.2196/10148.

References

427

L. D. Nelson and M. I. Norton. 2005. From student to superhero: Situational primes shape future helping. J. Exp. Soc. Psychol. 41, 4, 423–430. DOI: https://doi.org/10.1016/j.jesp.2004. 08.003. M. A. Nowak. 2006. Five rules for the evolution of cooperation. Science 314, 5805, 1560–1563. DOI: https://doi.org/10.1126/science.1133755. M. Nowak and K. Sigmund. 1993. A strategy of win-stay, lose-shift that outperforms tit-fortat in the prisoner’s dilemma game. Nature 364, 6432, 56–58. DOI: https://doi.org/10.1038/ 364056a0. M. A. Nowak and K. Sigmund. 1998. Evolution of indirect reciprocity by image scoring. Nature 393, 6685, 573–577. DOI: https://doi.org/10.1038/31225. M. A. Nowak and S. Roch. 2007. Upstream reciprocity and the evolution of gratitude. Proc. Biol. Sci. 274, 1610, 605–610. DOI: https://doi.org/10.1098/rspb.2006.0125. M. Ochs, D. Sadek, and C. Pelachaud. 2012. A formal model of emotions for an empathic rational dialog agent. Auton. Agent Multi Agent Syst. 24, 3, 410–440. DOI: https://doi.org/ 10.1007/s10458-010-9156-z. J. M. Pacheco and F. C. Santos. 2011. The messianic effect of pathological altruism. Pathol. Altruism, 300. J. M. Pacheco, F. C. Santos, and F. A. C. Chalub. 2006. Stern-judging: A simple, successful norm which promotes cooperation under indirect reciprocity. PLoS Comput. Biol. 2, 12. DOI: https://doi.org/10.1371/journal.pcbi.0020178. A. Paiva, I. Leite, H. Boukricha, and I. Wachsmuth. 2017. Empathy in virtual agents and robots: A survey. ACM Trans. Interact. Intell. Syst. 7, 3, 1–40. DOI: https://doi.org/10.1145/ 2912150. A. Paiva, F. P. Santos, and F. C. Santos. 2018. Engineering pro-sociality with autonomous agents. In Thirty-Second AAAI Conference on Artificial Intelligence. S. Park, S. Scherer, J. Gratch, P. Carnevale, and L.-P. Morency. 2013. Mutual behaviors during dyadic negotiation: Automatic prediction of respondent reactions. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. IEEE, 423–428. DOI: https://doi.org/10.1109/ACII.2013.76. L. A. Penner, J. F. Dovidio, J. A. Piliavin, and D. A. Schroeder. 2005. Prosocial behavior: Multilevel perspectives. Annu. Rev. Psychol. 56, 365–392. DOI: https://doi.org/10.1146/annurev. psych.56.091103.070141. E. Pennisi. 2005. How did cooperative behavior evolve? Science 309, 5731, 93–93. DOI: https: //doi.org/10.1126/science.309.5731.93. A. Pereira, I. Leite, S. Mascarenhas, C. Martinho, and A. Paiva. 2010. Using empathy to improve human–robot relationships. In International Conference on Human–Robot Personal Relationship. Springer, 130–138. DOI: https://doi.org/10.1007/978-3-64219385-9_17. S. Pfattheicher, L. Nockur, R. Böhm, C. Sassenrath, and M. B. Petersen. 2020. The emotional path to action: Empathy promotes physical distancing during the

428

Chapter 11 Empathy and Prosociality in Social Agents

COVID-19 pandemic. Psychol. Sci. 31, 11, 1363–1373. DOI: https://doi.org/10.1177/0956797620 964422. R. W. Picard, A. Wexelblat, and C. I. N. I. Clifford I. Nass. 2002. Future interfaces: Social and emotional. In CHI’02 Extended Abstracts on Human Factors in Computing Systems. 698–699, DOI: https://doi.org/10.1145/506443.506552. F. L. Pinheiro, V. V. Vasconcelos, F. C. Santos, and J. M. Pacheco. 2014. Evolution of allor-none strategies in repeated public goods dilemmas. PLoS Comput. Biol. 10, 11. DOI: https://doi.org/10.1371/journal.pcbi.1003945. H. Prendinger and M. Ishizuka. 2005. The empathic companion: A character-based interface that addresses users’ affective states. Appl. Artif. Intell. 19, 3–4, 267–285. DOI: https: //doi.org/10.1080/08839510590910174. S. D. Preston and F. B. De Waal. 2002. Empathy: Its ultimate and proximate bases. Behav. Brain Sci. 25, 1, 1–20. DOI: https://doi.org/10.1017/s0140525x02000018. S. Prot, D. A. Gentile, C. A. Anderson, K. Suzuki, E. Swing, K. M. Lim, Y. Horiuchi, M. Jelic, B. Krahé, W. Liuqing, A. K. Liau, A. Khoo, P. D. Petrescu, A. Sakamoto, S. Tajima, R. A. Toma, W. Warburton, X. Zhang, and B. C. P. Lam. 2014. Long-term relations among prosocial-media use, empathy, and prosocial behavior. Psychol. Sci. 25, 2, 358–368. DOI: https://doi.org/10.1177/0956797613503854. G. R. Pursell, B. Laursen, K. H. Rubin, C. Booth-LaForce, and L. Rose-Krasnor. 2008. Gender differences in patterns of association between prosocial behavior, personality, and externalizing problems. J. Res. Pers. 42, 2, 472–481. DOI: https://doi.org/10.1016/j.jrp.2007. 06.003. A. L. Radzvilavicius, A. J. Stewart, and J. B. Plotkin. 2019. Evolution of empathetic moral evaluation. eLife 8, e44269. DOI: https://doi.org/10.7554/eLife.44269. L. T. Rameson, S. A. Morelli, and M. D. Lieberman. 2012. The neural correlates of empathy: Experience, automaticity, and prosocial behavior. J. Cogn. Neurosci. 24, 1, 235–245. DOI: https://doi.org/10.1162/jocn_a_00130. D. G. Rand and M. A. Nowak. 2013. Human cooperation. Trends Cogn. Sci. 17, 8, 413–425. DOI: https://doi.org/10.1016/j.tics.2013.06.003. D. G. Rand, J. D. Greene, and M. A. Nowak. 2012. Spontaneous giving and calculated greed. Nature 489, 7416, 427–430. DOI: https://doi.org/10.1038/nature11467. B. Reeves and C. I. Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge University Press. P. Resnick, K. Kuwabara, R. Zeckhauser, and E. Friedman. 2000. Reputation systems. Commun. ACM 43, 12, 45–48. DOI: https://doi.org/10.1145/355112.355122. J. Riegelsberger, M. A. Sasse, and J. D. McCarthy. 2003. The researcher’s dilemma: Evaluating trust in computer-mediated communication. Int. J. Hum. Comput. Stud. 58, 6, 759– 781. DOI: https://doi.org/10.1016/S1071-5819(03)00042-9. L. D. Riek, P. C. Paul, and P. Robinson. 2010. When my robot smiles at me: Enabling human–robot rapport via real-time head gesture mimicry. J. Multimodal User Interfaces 3, 1-2, 99–108. DOI: https://doi.org/10.1007/s12193-009-0028-2.

References

429

A. J. Robson. 1990. Efficiency in evolutionary games: Darwin, Nash and the secret handshake. J. Theor. Biol. 144, 3, 379–396. DOI: https://doi.org/10.1016/s0022-5193(05)80082-7. S. H. Rodrigues, S. Mascarenhas, J. Dias, and A. Paiva. 2015. A process model of empathy for virtual agents. Interact. Comput. 27, 4, 371–391. DOI: https://doi.org/10.1093/iwc/iw u001. R. S. Rosenberg, S. L. Baughman, and J. N. Bailenson. 2013. Virtual superheroes: Using superpowers in virtual reality to encourage prosocial behavior. PLoS One 8, 1. DOI: https: //doi.org/10.1371/journal.pone.0055003. A. C. Rumble, P. A. Van Lange, and C. D. Parks. 2010. The benefits of empathy: When empathy may sustain cooperation in social dilemmas. Eur. J. Soc. Psychol. 40, 5, 856–866. DOI: https://doi.org/10.1002/ejsp.659. J. Sabater and C. Sierra. 2005. Review on computational trust and reputation models. Artif. Intell. Rev. 24, 1, 33–60. DOI: https://doi.org/10.1007/s10462-004-0041-5. J. Sabourin, B. Mott, and J. Lester. 2011. Computational models of affect and empathy for pedagogical virtual agents. In Standards in Emotion Modeling, Lorentz Center International Center for Workshops in the Sciences. CiteSeer. M. Saleem, C. A. Anderson, and D. A. Gentile. 2012. Effects of prosocial, neutral, and violent video games on children’s helpful and hurtful behaviors. Aggress. Behav. 38, 4, 281– 287. DOI: https://doi.org/10.1002/ab.21428. D. Sally. 1995. Conversation and cooperation in social dilemmas: A meta-analysis of experiments from 1958 to 1992. Ration. Soc. 7, 1, 58–92. DOI: https://doi.org/10.1177/ 1043463195007001004. F. C. Santos and J. M. Pacheco. 2005. Scale-free networks provide a unifying framework for the emergence of cooperation. Phys. Rev. Lett. 95, 9, 098104. DOI: https://doi.org/10.1103/ PhysRevLett.95.098104. F. C. Santos, J. M. Pacheco, and B. Skyrms. 2011. Co-evolution of pre-play signaling and cooperation. J. Theo. Biol. 274, 1, 30–35. DOI: https://doi.org/10.1016/j.jtbi.2011.01.004. F. P. Santos, J. M. Pacheco, and F. C. Santos. 2018a. Social norms of cooperation with costly reputation building. In Thirty-Second AAAI Conference on Artificial Intelligence. F. P. Santos, F. C. Santos, and J. M. Pacheco. 2018b. Social norm complexity and past reputations in the evolution of cooperation. Nature 555, 7695, 242–245. DOI: https://doi.org/ 10.1038/nature25763. F. P. Santos, J. M. Pacheco, A. Paiva, and F. C. Santos. 2019. Evolution of collective fairness in hybrid populations of humans and agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6146–6153. DOI: https://doi.org/10.1609/aaai.v33i01.33016146. F. P. Santos, S. F. Mascarenhas, F. C. Santos, F. Correia, S. Gomes, and A. Paiva. 2020. Picky losers and carefree winners prevail in collective risk dilemmas with partner selection. Auton. Agents Multi-Agent Syst. 34, 40. DOI: https://doi.org/10.1007/s10458-020-09463-w. M. Sapouna, D. Wolke, N. Vannini, S. Watson, S. Woods, W. Schneider, S. Enz, L. Hall, A. Paiva, E. André, K. Dautenhahn, and R. Aylett. 2010. Virtual learning intervention to

430

Chapter 11 Empathy and Prosociality in Social Agents

reduce bullying victimization in primary school: A controlled trial. J. Child Psychol. Psychiatry 51, 1, 104–112. DOI: https://doi.org/10.1111/j.1469-7610.2009.02137.x. M. Sarabia, T. Le Mau, H. Soh, S. Naruse, C. Poon, Z. Liao, K. C. Tan, Z. J. Lai, and Y. Demiris. 2013. iCharibot: Design and field trials of a fundraising robot. In International Conference on Social Robotics. Springer, 412–421. DOI: https://doi.org/10.1007/978-3-31902675-6_41. E. G. Schellenberg, K. A. Corrigall, S. P. Dys, and T. Malti. 2015. Group music training and children’s prosocial skills. PLoS One 10, 10. DOI: https://doi.org/10.1371/journal.pone .0141449. K. R. Scherer. 1988. Criteria for emotion-antecedent appraisal: A review. In Cognitive Perspectives on Emotion and Motivation. Springer, 89–126. DOI: https://doi.org/10.1007/97894-009-2792-6_4. T. Schramme. 2017. Empathy and altruism. In H. L. Maibom (Ed.), The Routledge Handbook of Philosophy of Empathy, Chapter 18. 203–214. DOI: https://doi.org/10.4324/ 9781315282015. S. H. Seo, D. Geiskkovitch, M. Nakane, C. King, and J. E. Young. 2015. Poor thing! Would you feel sorry for a simulated robot? A comparison of empathy toward a physical and a simulated robot. In 2015 10th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 125–132. DOI: https://doi.org/10.1145/2696454.2696471. P. Sequeira, P. Alves-Oliveira, T. Ribeiro, E. Di Tullio, S. Petisca, F. S. Melo, G. Castellano, and A. Paiva. 2016. Discovering social interaction strategies for robots from restrictedperception Wizard-of-Oz studies. In 2016 11th ACM/IEEE International Conference on Human–Robot Interaction (HRI). IEEE, 197–204. DOI: https://doi.org/10.1109/HRI.2016. 7451752. M. Shiomi, A. Nakata, M. Kanbara, and N. Hagita. 2017. A hug from a robot encourages prosocial behavior. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 418–423. H. Shirado and N. A. Christakis. 2017. Locally noisy autonomous agents improve global human coordination in network experiments. Nature 545, 7654, 370–374. DOI: https://do i.org/10.1038/nature22332. M. Slater, A. Rovira, R. Southern, D. Swapp, J. J. Zhang, C. Campbell, and M. Levine. 2013. Bystander responses to a violent incident in an immersive virtual environment. PLoS One 8, 1, e52766. DOI: https://doi.org/10.1371/journal.pone.0052766. C. Straßmann, A. M. Rosenthal-von der Pütten, and N. C. Krämer. 2018. With or against each other? The influence of a virtual agent’s (non) cooperative behavior on user’s cooperation behavior in the prisoners’ dilemma. Adv. Hum.-Comput. Interact. 2018. DOI: https://doi.org/10.1155/2018/2589542. A. Tavoni, A. Dannenberg, G. Kallis, and A. Löschel. 2011. Inequality, communication, and the avoidance of disastrous climate change in a public goods game. Proc. Natl. Acad. Sci. U. S. A. 108, 29, 11825–11829. DOI: https://doi.org/10.1073/pnas.1102493108.

References

431

I. Thielmann, G. Spadaro, and D. Balliet. 2020. Personality and prosocial behavior: A theoretical framework and meta-analysis. Psychol. Bull. 146, 1, 30–90. DOI: https://doi.org/10. 1037/bul0000217. M. Tomasello and A. Vaish. 2013. Origins of human cooperation and morality. Annu. Rev. Psychol. 64, 231–255. DOI: https://doi.org/10.1146/annurev-psych-113011-143812. R. L. Trivers. 1971. The evolution of reciprocal altruism. Q. Rev. Biol. 46, 1, 35–57. DOI: https: //doi.org/10.1086/406755. P. A. Van Lange, J. Joireman, C. D. Parks, and E. Van Dijk. 2013. The psychology of social dilemmas: A review. Organ. Behav. Hum. Decis. Process. 120, 2, 125–141. DOI: https://doi.or g/10.1016/j.obhdp.2012.11.003. P. A. Van Lange, D. P. Balliet, C. D. Parks, and M. Van Vugt. 2014. Social Dilemmas: Understanding Human Cooperation. Oxford University Press. J. Vásquez and M. Weretka. 2019. Affective empathy in non-cooperative games. Games Econ. Behav. 121. DOI: https://doi.org/10.1016/j.geb.2019.10.005. I. M. Verma. 2014. Editorial expression of concern: Experimental evidence of massivescale emotional contagion through social networks. Proc. Natl. Acad. Sci. U. S. A. 111, 10779. DOI: https://doi.org/10.1073/pnas.1412469111. J. W. Weibull. 1997. Evolutionary Game Theory. MIT Press. S. A. West, A. S. Griffin, and A. Gardner. 2007. Social semantics: Altruism, cooperation, mutualism, strong reciprocity and group selection. J. Evol. Biol. 20, 2, 415–432. DOI: https://doi.org/10.1111/j.1420-9101.2006.01258.x. M. Wooldridge. 2003. Reasoning about Rational Agents. MIT Press. DOI: https://doi.org/10. 1023/A:1024162303279. J. Wu, D. Balliet, L. S. Peperkoorn, A. Romano, and P. A. Van Lange. 2020. Cooperation in groups of different sizes: The effects of punishment and reputation-based partner choice. Front. Psychol. 10, 2956. DOI: https://doi.org/10.3389/fpsyg.2019.02956. S. X. Xiao, E. C. Hashi, K. M. Korous, and N. Eisenberg. 2019. Gender differences across multiple types of prosocial behavior in adolescence: A meta-analysis of the prosocial tendency measure-revised (PTM-R). J. Adolesc. 77, 41–58. DOI: https://doi.org/10.1016/j.ad olescence.2019.09.003. Ö. N. Yalçn and S. DiPaola. 2018. A computational model of empathy for interactive agents. Biol. Inspired Cogn. Archit. 26, 20–25. DOI: https://doi.org/10.1016/j.bica.2018.07.010. Ö. Yalçn and S. DiPaola. 2019a. Evaluating levels of emotional contagion with an embodied conversational agent. In Proceedings of the 41st Annual Conference of the Cognitive Science Society. Ö. N. Yalçn and S. DiPaola. 2019b. Modeling empathy: Building a link between affective and cognitive processes. Artif. Intell. Rev. 53, 1–24. DOI: https://doi.org/10.1007/s10462019-09753-0.

12 Rapport Between Humans and Socially Interactive Agents Jonathan Gratch and Gale Lucas

12.1

Introduction

Think back on a time when you saw speakers engrossed in conversation. Even without hearing their words you can sense their connection. The participants seem tightly enmeshed in something like a dance. They rapidly resonate to each other’s gestures, facial expressions, gaze, and patterns of speech. Such behavior has been studied by many names including social resonance [Duncan et al. 2007, Kopp 2010], interpersonal adaptation [Burgoon et al. 1995], entrainment [Levitan et al. 2011], interactional synchrony [Bernieri and Rosenthal 1991], social glue [Lakin et al. 2003], immediacy behaviors [Julien et al. 2000], and positivity resonance [Fredrickson 2016]. In this chapter, we follow the terminology of Tickle-Degnen and Rosenthal [1990] and refer to this seemingly automatic attunement of positive emotional displays, gaze, and gestures as rapport. Regardless of the difference in focus or definition, these bodies of research agree that, once established, rapport influences a wide range of interpersonal processes. It plays a crucial role in the establishment of social bonds [Tickle-Degnen and Rosenthal 1990], promoting or diffusing conflict [Lanzetta and Englis 1989], persuasion [Fuchs 1987, Bailenson and Yee 2005], and in the establishment of identity [Mead 1934]. Perhaps more importantly, rapport leads to beneficial outcomes across a wide range of practical interpersonal problems. It fosters success in negotiations [Drolet and Morris 2000, Goldberg 2005], improves workplace cohesion [Cogger 1982], enhances psychotherapeutic effectiveness [Tsui and Schultz 1985], elevates test performance in classrooms [Fuchs 1987], and raises

434

Chapter 12 Rapport Between Humans and Socially Interactive Agents

the quality of childcare [Burns 1984]. To paraphrase Tickle-Degnen and Rosenthal [1990], this chapter explores how intelligent virtual clinicians can develop rapport with patients, robot sales personnel can use it to make a deal, and socially interactive agents can try to predict from it the future of their relationship with a user. In people, such unfolding and resonating patterns of behavior arise without conscious deliberation [Bargh et al. 1996, Lakin et al. 2003] but they are hardly fixed reflexes. The same individual who smiles reflexively to the smile of a friend may frown to the smiles of his opponent [Lanzetta and Englis 1989]. The way such interpersonal patterns unfold depends on a host of intra- and interpersonal factors including the relative power of individuals in an interaction—with weaker partners tending to mimic the more powerful but not vice versa [Tiedens and Fragale 2003]; prior expectations—with unexpectedly positive behaviors producing more favorable outcomes than expected ones [Burgoon 1983]; and conformity to social norms such as reciprocity [Roloff and Campion 1987] or the appropriate use of expressing negative emotions in a particular setting [Adam et al. 2010]. More generally, these patterns both arise from and help to redefine an evolving affective relationship between individuals [Parkinson 2013]. Building “rapport agents” that can engage people in this unfolding emotional dance is a fascinating prospect with potentially profound practical and scientific consequences. In terms of applications, socially interactive agents can use rapport to enhance disclosure in mental health screenings [Lucas et al. 2014] and robots can use it to improve learning in primary school students [Lubold et al. 2018]. In terms of science, computational models of rapport can enhance social theory by allowing systematic manipulation of nonverbal patterns in ways that are difficult or impossible for human confederates in social science experiments. Such experimental control enables studies that can definitively establish if such patterns cause these effects or merely reflect them (see Bente et al. [2001], Bailenson et al. [2004], Gratch et al. [2006], Forgas [2007], de Melo et al. [2014a], and Hoegen et al. [2018]). Indeed, this chapter reviews findings that synthesized patterns of nonverbal behavior cause predicted changes in the impressions and behaviors of human subjects in laboratory studies. These studies give insight into the factors that promote these effects and show how these insights can translate into computer-based applications to health, commerce, and entertainment. We first review rapport theory and why it is important for human–machine systems. We next highlight some of the technical and methodological advances that have allowed the creation of machines that establish rapport with people. Many groups have explored these capabilities in both virtual and robotic agents and this chapter will not give full justice to this vibrant body of research. Rather, we

12.2 Rapport Theory

435

will focus on our own modest contributions to this field. Finally, we will review empirical findings that human–machine rapport has important benefits, that computational models can inform psychological theories of rapport, and that rapport supports important societal outcomes, such as addressing human mental health.

12.2

Rapport Theory Rapport is studied across a range of scientific disciplines and domains for its role in fostering emotional bonds and prosocial behavior. This includes research on teaching [Bernieri 1988], psychotherapy [Charny 1966], sales [Gremler and Gwinner 2000], and interrogations [Abbe and Brandon 2013], to name but a few. Research on rapport is also diverse in the behaviors and modalities addressed. Most research, and the focus of this chapter, emphasizes nonverbal behaviors such as smiles and postures. But other work explores how feelings of rapport arise from, and shape the content of conversations [Friedberg et al. 2012], conversational mechanisms such as turn-taking [Cassell et al. 2007], or even the convergence of physiological processes such as respiration [McFarland 2001] and pupil dilation [Kang and Wheatley 2017]. In their seminal article, Tickle-Degnen and Rosenthal [1990] equate rapport with subjective experience (e.g., individual report that they “clicked” or have a sense of “chemistry”), but also with three essential elements of nonverbal behavior. First are signals of positivity such as smiles, nods of agreement, forward trunk lean, uncrossed arms, and open body posture. These behaviors indicate mutual liking and approval. The second nonverbal element is signals of mutual attentiveness such as shared gaze, direct body orientation, and “listening behaviors” such as backchannels (verbal “uh-huh” or quick nods) that convey that participants are actively attending and understanding each other. The third essential element is coordination. This refers to dyadic/group behaviors that convey parties are functioning as a coordinated unit, such as postural mimicry or interactional synchrony. Tickle-Degnen and Rosenthal further emphasize that feelings and behavior are tightly linked. They argue for strong consistency between self-reported rapport from each participant in a conversation, the presence of nonverbal components, and third-party judgments. In this sense, rapport is easy to assess—observers can reliably predict if people will report feelings of rapport by watching their nonverbal behavior [Ambady et al. 2000]—and these judgments and feelings predict positive interpersonal outcomes. Rapport also has a dynamic aspect. In the short term, as people begin to establish rapport in a conversation, there is a tendency for behaviors to become more coordinated over time through a process sometimes called entrainment (e.g.,

436

Chapter 12 Rapport Between Humans and Socially Interactive Agents

Figure 12.1

The importance of elements of rapport as relationships deepen (based on TickleDegnen and Rosenthal [1990]).

Borrie et al. [2019]). This can include increased postural and gestural mimicry [Bergmann and Kopp 2012], convergence of acoustic and prosodic features [Levitan and Hirschberg 2011], and shared syntax and word use [Brennan 1996]. Over longer time periods (i.e., across multiple conversations), Tickle-Degnen and Rosenthal posit that the nonverbal elements of rapport shift in importance as the relationship between interlocutors deepens and strengthens (see Figure 12.1). In particular, the need to signal positivity becomes less important as mutual-liking becomes assumed. Although not reflected in Tickle-Degnen and Rosenthal’s model, other research suggests that the appearance of coordination may diminish (at least from the perspective of classical turn-taking models). For example, as participants become more comfortable with each other, barging in and overlapping speech becomes far more common [Coates 1989]. Most psychological research on rapport examines contexts where both parties are actively speaking, but rapport can be established even during a monolog. During monologues, listeners can convey their positivity, mutual attention, and coordination through backchannels [Yngve 1970]. In linguistics, these are behaviors such as verbal interjections (e.g., “yeah,” “right,” “uh-huh”) or nonverbal signals such as nods, smiles, and other facial expressions such as eyebrow raises or expressions of sympathy. These behaviors can increase speaker comfort, fluidity, and even change the nature of the monolog. Indeed, Bavelas et al. [2000] has emphasized that, through these behaviors, listeners become co-narrators.

12.3 History and Overview of Rapport Agents

437

When it comes to computational models, most systems have emphasized “early rapport” (as shown in Figure 12.1) and active-listening behaviors such as backchannels. This is probably due to limitations in the current state-of-the-art in conversational agents. Although rapidly advancing, dialog systems are still limited especially with regard to the sort of continuous and predictive parsing of speech needed to match human conversational fluidity [DeVault et al. 2011]. Thus, many systems focus on the illusion of understanding by responding to easier to recognize surface features of speech [Ward and Tsukahara 2000] and rarely entrain their behavior to participants across multiple interactions. For this reason, we will focus the remainder of this chapter on systems that emphasize active listening approaches to establishing rapport.

12.3

History and Overview of Rapport Agents The practical benefit of rapport between people has spurred a wide array of research into techniques that can reproduce this interpersonal state within human–machine interactions. We trace the evolution of a 15-year effort within our laboratory to endow machines with the elements of rapport, beginning with the “Rapport Agent” (that engaged in simple listening behaviors) and ending with SimSensei, an intelligent virtual agent that integrates multimodal rapport behaviors with natural language dialog. During this period, the technology underlying such rapport agents advanced from handcrafted rules (based on theoretical observations) to machine learning approaches and we innovated new ways to collect and annotate the data that machine learning approaches demand, such as parasocial consensus sampling. In terms of applications, this project began as a theoretical exercise to study if human–machine rapport was even possible. As these methods have matured, they are increasingly applied to practical applications, particularly in health interviews [Lucas et al. 2014] and behavior change (see Chapter 25).

12.3.1 A Rule-based Approach The Rapport Agent [Gratch et al. 2006], illustrated in Figure 12.2, was designed to evoke subjective feelings and behavioral consequence of rapport with human participants in a restricted set of social interactions we refer to as quasi-monologs. In a quasi-monolog, the human does most of the talking and the agent primarily prompts human speech and provides attentive listening feedback. The Rapport Agent was originally developed to replicate findings by Duncan and colleagues on listening behaviors [Welji and Duncan 2004]. In their studies, a speaker (the narrator) retells some previously observed series of events (e.g., the events in a recently watched video) to some listener. In our case, the listener is a computer-generated

438

Chapter 12 Rapport Between Humans and Socially Interactive Agents

Figure 12.2

The first-author’s daughter interacting with the Rapport Agent.

character that has been programmed to produce (or fail to produce) the type of nonverbal feedback seen in rapportful conversations. The agent was crafted by three visiting interns, Francois Lamothe and Mathieu Morales from the French Military Academy and Rick van der Werf from Twente University. The implementation focused on the observation that, when it comes to active listening, feelings of rapport correlate with simple contingent listening behaviors such as the use of backchannels (nods, elicited by speaker prosodic cues, which signify that the communication is working), postural mirroring, and mimicry of certain head gestures, for example, gaze shifts and head nods [Yngve 1970, Chartrand and Bargh 1999, Ward and Tsukahara 2000]. To identify backchannel opportunity points, Lamothe and Morales created an acoustic tool called Luan, building on techniques proposed by Nigel Ward [Ward and Tsukahara 2000]. To identify nonverbal behaviors, van der Werf utilized the Watson software package developed by Morency et al. [2005]. This component uses a seated participant’s head position to detect posture shifts and use head orientation to recognize head nods and shakes. Figure 12.3 illustrates the Rapport Agent architecture. Participants told stories to a virtual agent rendered in a game engine and animated with the SmartBody animation system [Kallmann and Marsella 2005, Thiebaux et al. 2008]. Features recognized from speech and vision were passed to a simple authorable rule-based system that communicated messages in the behavior markup language [Kopp et al. 2006] to SmartBody. The original character, illustrated in Figure 12.2, was intended to look like Brad Pitt, an obvious failure, but this was the origin of a long line of characters named Brad emanating from our laboratory.

12.3 History and Overview of Rapport Agents

Figure 12.3

439

Rapport Agent architecture.

Though simple and based on handcrafted rules, the Rapport Agent proved remarkably effective. Other groups reported success with alternative rule-based systems for virtual [Truong et al. 2010] or robotic agents [Fujie et al. 2004]. As discussed in the next section, such agents create subjective feelings of rapport and several beneficial changes in participant behavior (fostering more fluent speech and eliciting longer and more intimate stories). Such systems also served as an important empirical tool, helping to establish the importance of all three theoretically positive nonverbal elements of rapport.

12.3.2 Machine Learning The Rapport Agent, and similar agents at the time, relied on handcrafted rules due to the lack of good annotated datasets on rapport. One consequence of the project was the creation of a reasonably large corpus (by standards of the time) of humanto-human conversations that were rated by both parties for perceptions of rapport and annotated (both manually and automatically) for nonverbal behaviors associated with rapport (this data is publicly available at https://rapport.ict.usc.edu/). This data afforded the opportunity to explore machine learning approaches to replace the handcrafted rules. Our initial machine learning efforts focused on predicting backchannel opportunity points. Figure 12.4 illustrates our approach that was spearheaded by LouisPhillipe Morency and another University of Twente intern, Iwan de Kok [Morency et al. 2008, 2010]. The approach encoded multimodal features from a speaker using an encoding dictionary to represent different possible temporal relationships between speaker behaviors and listener responses. For example, a step function hypothesizes that the backchannel is likely whenever the speaker’s feature is present, whereas a ramp function hypothesizes that the backchannel is most likely when the speaker’s feature is first present. These features were then passed

440

Chapter 12 Rapport Between Humans and Socially Interactive Agents

Figure 12.4

Architecture for multimodal backchannel prediction. The system (1) senses audiovisual information from a speaker. These features are (2) encoded using a dictionary that represented alternative possible temporal relationships between speaker behaviors and listener responses. After feature selection (3), these were passed to a machine learning classifier (4) that generates probabilistic predictions (5).

to a supervised machine learning approach based on conditional random fields [Lafferty et al. 2001]. The learning approach yielded considerable improvement over our handcrafted rules. An analysis of the learned model emphasized the importance of multimodality: back channels were elicited by prosodic features, as predicted by Ward and

12.3 History and Overview of Rapport Agents

441

Tsukahara [2000], but also depended on visual features (listeners would tend to backchannel when the speaker paused and looked at them). Thus, at least in terms of backchannels, this research showed that elements of rapport could be automatically learned from data.

12.3.3 Parasocial Consensus Sampling If the same storyteller tells the same story to different audiences, they may get quite different reactions. Some audiences may be engaged and show active listening, which in turn will feed back and help the speaker tell a better story [Bavelas et al. 2000]. Others may sit impassively. Even when engaged, there may be considerable variability in how feedback is offered. As learning models began to capture the basics of rapport, an emerging question became how to explain this variability. Lixing Huang, a PhD student in our laboratory, proposed an interesting approach to examine and analyze variability in listening behavior. Rather than focusing on dyads, he asked whether it was possible to get the responses of many listeners to a single speaker. Around the same time, de Kok and Heylen were raising the same question [de Kok and Heylen 2011]. But Dr. Huang further wondered if people really needed to be in a real conversation to provide useful data about how they would respond. In communication theory, Horton and Wohl [1956] introduced the term parasocial interaction to refer to the observation that people often respond to media, like cinema or television, as if they are really engaged in a social interaction with them (think of the time you yelled at a TV pundit you disagreed with). Building on this idea, he created a framework where crowdworkers could watch a storyteller and “act out” the types of responses they would show if they were in a real social interaction. The technique turned out to be surprisingly effective and allowed the inexpensive creation of large corpora of rapport behavior [Huang 2013]. Huang termed his method parasocial consensus sampling. The idea was to have multiple coders signal some form of feedback (he began with simple backchannels but then moved to turn-taking and emotional feedback). This would elicit an entire distribution of responses to each speaker. As seen in Figure 12.5, coders highlight multiple opportunities to provide feedback but only reached consensus on the third opportunity. The consensus response could be used to train a classifier [Huang et al. 2010]. However, the variability was interesting in its own right. One could use the amount of consensus as a measure of how salient certain features were in eliciting feedback. For example, the word “bothering” seemed to trigger a stronger response. Further, one can examine variability across different “types” of individuals. For example, extraverts are more likely to provide feedback during opportunity points than introverts [Huang and Gratch 2013].

442

Chapter 12 Rapport Between Humans and Socially Interactive Agents

Spoken transcript ... on the computer

way too much

and it’s bothering her

so she’s telling ...

time individual feedback (face-to-face) time parasocial consensus response level time Figure 12.5

Illustration of the parasocial consensus sample of a particular storyteller. The first line shows the speaker’s transcript. The second line shows the backchannels provided by the original “real” listener. The bottom line shows the distribution of responses by parasocial consensus sampling coders.

The parasocial consensus sampling technique was applied to backchannels but also turn-taking and emotional feedback. The resulting learned models yielded insight into these conversational processes. For example, people allow the speaker to keep the turn longer when they pause if the speaker looks away. In terms of emotional feedback, mimicry was far more likely to occur during backchannel opportunity points. The resulting models were incorporated into an improved version of the Rapport Agent that could interview people, with a fixed script of questions, pause the appropriate length of time before moving on, and provide a semblance of active and empathetic feedback [Huang et al. 2011].

12.3.4 SimSensei Perhaps the biggest limitation (and concern) about the abovementioned rapport agents is that they can create the illusion of understanding without any actual understanding. While this can be a useful experimental tool for studying rapport, it can be problematic in an actual application. Our most recent research on the SimSense agent attempted to correct this by integrating the capabilities of the Rapport Agent with a state-of-the-art dialog system [DeVault et al. 2014]. SimSensei was designed to create interactional situations favorable to the automatic assessment of distress indicators, defined as verbal and nonverbal behaviors correlated with depression, anxiety, or post-traumatic stress disorder (PTSD). The agent would simulate a mental health screening by interviewing participants about the challenges they may be facing in life. The agent implemented dialogmanagement using the FLoReS dialog manager [Morbini et al. 2014]. This grouped interview questions into several phases including initial chitchat designed to enhance rapport, followed by a set of diagnostic questions, and ending with a “cool

12.4 Empirical Findings

443

down” phase. Active dialog management allowed the agent to ask follow-up questions. It also allowed the agent to incorporate verbal and valanced backchannel feedback. For example, rather than simply nodding at backchannel opportunity points, the agent could respond “that’s great” or “I’m sorry” depending on what it recognized. In practice, this feedback had to be used sparingly due to the current limits of natural language understanding. Nonverbal feedback and gestures were substantially improved over the abovementioned systems by using the Cerebella nonverbal behavior generation system [Marsella et al. 2013]. As discussed in our empirical findings, incorporating rapport behaviors into a computer program proved an effective way to reduce participants’ fear of being judged and increased comfort in disclosing symptoms of mental illness.

12.4

Empirical Findings The above-described rapport agents exhibit the nonverbal behaviors that occurs in natural rapportful conversation, but are people influenced by computer-generated behaviors? In a series of empirical studies, we have used rapport agents as methodological tools to explore the role of nonverbal behaviors in human and human–computer interaction. These studies clearly demonstrate that synthetic rapport behaviors alter participant feelings, impressions, and behavior, and that the strength of such effects are mediated by the same elements of positivity, contingency, and mutual attention that Tickle-Degnen and Rosenthal have posited for face-to-face interactions between people. Further, as posed by Reeves and Nass [1996] in their media equation, such effects seem to occur even when participants know that such behaviors are “merely” produced by a computer. In the remainder of this chapter, we outline the basic experimental paradigm by which we explore such questions and then summarize the findings of several studies. We highlight different aspects of these findings. First, we describe the consequences, subjective and behavioral, that such contingent nonverbal behaviors have on human participants. Next, we discuss some of the factors that seem to mediate these effects. These include properties of the agent’s behavior, dispositional factors on the part of participants, and situational factors surrounding the interaction such as if participants believe they are interacting with another participant or a computer program.

12.4.1 Experimental Paradigm Figure 12.2 illustrates a typical experimental setup. Participants would sit in front of a rapport agent and be prompted (either by the experimenter or by the agent itself) to retell some previously experienced situation—in one series

444

Chapter 12 Rapport Between Humans and Socially Interactive Agents

of experiments, participants watched a short video and were then instructed to retell it to the agent in as much detail as possible. More recent experiments, for example with the SimSensei system, involved more interactive dialog interviewing participants about their real-life experiences. In either situation, the agent displays some form of nonverbal feedback while the participant speaks, with the exact nature of the feedback being dictated by the specific experimental manipulation. In these studies, participant rapport was assessed through a variety of subjective and behavioral measures. Subjective measures included scales assessing rapport, social presence [Biocca and Harms 2002], helpfulness, distraction, and naturalness.1 Behavioral measures included the length people speak (as a measure of engagement), the fluency of their speech (e.g., how many repeated words, broken words, and filled pauses per minute), the depth of their disclosure, as well as facial expressions and eye-gaze patterns. Behavioral measures were assessed through a mixture of automatic annotation techniques and hand annotations by multiple coders. Part of the power of computational models is we can systematically manipulate aspects of the agent’s appearance and nonverbal behavior, as well as prior beliefs about the situation. For example, Figure 12.6 (from Gratch et al. [2007]) illustrates a study that examined the impact of appearance (human vs. computergenerated human), behavior (human-generated vs. computer-generated), and contingency (“properly” timed vs. random listener feedback). Going clockwise in this figure from the upper-left, this experiment compared face-to-face interaction (in which a visible human participant displayed natural listening behaviors), the Rapport Agent (which exhibited computer-generated behavior and appearance), a noncontingent version of the Rapport Agent (which exhibited behaviors identical to the Rapport Agent in terms of their frequency and dynamics, but not contingent on the behaviors of the speaker), and a “mediated agent” (in which a real participant’s listening behaviors were displayed on a computer-generated avatar). Other research has manipulated beliefs about the situation such as whether the participant believes they are interacting with another human (represented by an avatar) or if they are interacting with an autonomous agent (e.g., von der Pütten et al. [2009b] and Lucas et al. [2014]). These beliefs were shaped by prior instruction. For example, in an experiment involving SimSensei, participants were provided a detailed explanation of how the avatar or agent software worked, as seen in Figure 12.7. 1. For the evolution of our own rapport scales, see https://rapport.ict.usc.edu/data/rapport-2006/ RapportScales_summary_sKang_updated.pdf.

12.4 Empirical Findings

Face-to-face condition

445

Contingent agent condition Rapport Agent

Microphone

Listener

Cameras

Speaker

Mediated condition

Listener

Figure 12.6

Speaker Listener Video Avatar

Agent Avatar

Confederate

Speaker

Non-contingent agent condition

Speaker

Confederate

Pre-recorded Avatar

Speaker

Graphical depiction of the four conditions. The actual face-to-face condition is illustrated on the lower left and the setup for the other three conditions on the lower right.

12.4.2 Social Effects of the Rapport Agent The picture emerging from a series of studies is that the Rapport Agent elicits beneficial social effects from human participants similar to what can be found in rapportful face-to-face interactions. Broadly, a rapport agent’s behavior shapes subjective feelings: ∙

Greater feelings of self-efficacy [Kang et al. 2008a]



Less tension [Wang and Gratch 2010] and less embarrassment [Kang et al. 2008a]



Greater feelings of rapport [Wang and Gratch 2010]



A greater sense of mutual awareness [von der Pütten et al. 2009a]

446

Chapter 12 Rapport Between Humans and Socially Interactive Agents

“Today is my birthday.”

There are buttons that they can use to ask you questions, such as “How have you been feeling lately?”

There are buttons they can use to control some of the virtual human’s facial expressions, like smiling.

feel lately

small smile

The virtual human has software to recognize your facial expressions and posture.

small smile

Today is my birthday

And buttons they can use to respond to what you are saying, such as “Wow”

And buttons they can use to control some of the virtual human’s head and body movements, like shaking the head.

WOW

shake

Figure 12.7

The virtual human will ask you questions, and there is software that analyzes and understands your answers.

Based on your responses, it has software that figures out what to do next.

The software tries to take your feelings into account as it conducts the interview.

WOW

Smile Back

Graphical depiction of the two belief conditions. Participants were told that they were interacting with a virtual agent controlled by human operators (left) or told it was driven by a computer program that used speech and vision (right). ∙

Greater feelings of trustworthiness on the part of the agent [Kang et al. 2008a]

And a rapport agent’s behavior shapes human behavior: ∙

More disclosure of information including longer interaction times and more words elicited [Gratch et al. 2006, 2007, von der Pütten et al. 2009a, Wang and Gratch 2010]



More fluent speech [Gratch et al. 2006, 2007, von der Pütten et al. 2009a, Wang and Gratch 2010]



More mutual gaze [Wang and Gratch 2010]



Fewer negative facial expressions [Wang and Gratch 2009b]



Improvement in performance when the agent supervises them taking an academic test [Karacora et al. 2012, Krämer et al. 2016]



Satiates belonging needs among those with chronically high need to belong, thus reducing intentions to seek social interaction with others after interacting with the Rapport Agent [Krämer et al. 2018]

But besides verifying these general effects, our research has sought to illuminate factors that mediate or moderate these relationships and, more generally, to explore the validity of alternative theoretical constructs for interpreting these results and guiding future agent design. We organize this research review around three basic questions. First, what properties of agents are necessary or sufficient

12.4 Empirical Findings

447

for promoting rapport? Second, what characteristics of people lead them to be more or less influenced by the agent’s behavior? Finally, we consider the more general question of the usefulness of social psychological theory (which was developed to explain human-to-human interaction) as a framework for guiding the design of computer systems. Blascovich [2014] suggests that interactions might unfold very differently depending on whether people believe they are interacting with computers or through computers (i.e., the Rapport Agent might have different social effects depending on if participants believed its behavior was generated by a computer or if they believed the behavior corresponded to the movements of an actual human listener). This last question, depending on the answer, could have profound effects for the value of interdisciplinary research on social artifacts. 12.4.2.1 Characteristics of the Agent that Impact Rapport If we adopt the former perspective and apply social psychological theory directly, Tickle-Degnen and Rosenthal’s theory argues that three broad characteristic of agent behavior should promote the establishment of rapport between participants and agents. These factors include positivity, meaning that rapport will be enhanced by positive nonverbal cues including positive facial expressions and encouraging feedback such as head nods; coordination, meaning that rapport will increase as the behavior of one participant is perceived as contingent upon (i.e., causally related to) the behavior of the other; and mutual attentiveness, meaning that rapport will increase as participants attend to each other nonverbally, for example, through mutual gaze. We also consider an additional factor: anonymity, the sense that one’s identity is protected. While direct face-to-face human interactions does not—as a rule—allow for anonymity (and it was therefore not pertinent for Tickle-Degnen and Rosenthal), research with other means of communication suggest that it should also promote the establishment of rapport between participants and agents. Thus, our empirical studies have sought to manipulate these four factors independently and observe their impact on rapport and participants’ subsequent behaviors, for example, disclosure of personal information. Positivity In face-to-face conversations, positivity is conveyed through a variety of nonverbal signals such as facial expressions and head nods. To explore the impact of positivity, we have operationalized this dimension in terms of the presence or absence of head nods and facial expressions. Our findings illustrate that the presence of listener nods significantly enhances feelings and behavioral manifestations of rapport [Wang and Gratch 2010]. The strength of this effect seems moderated by the perceived contingency of the nods

448

Chapter 12 Rapport Between Humans and Socially Interactive Agents

[Kang et al. 2008a] and dispositional factors of the participant [von der Pütten et al. 2010], as will be discussed below. We also explored whether participant facial expressions could be indicators of rapport, specifically via communicating positivity [Wang and Gratch 2009b]. We looked at participants’ facial expressions that were analyzed using the Facial Action Coding System and the Computer Expression Recognition Toolbox [Whitehill et al. 2008]. In both of human-to-human and human-to-agent interactions, more positive facial expressions were associated with greater rapport. Most recently, we have examined how an agent that mimics facial expressions in the context of a prisoner’s dilemma game can foster rapport [Hoegen et al. 2018]. Collectively, these findings are in line with Tickle-Degnen and Rosenthal’s predictions. Coordination Coordination occurs when two participants’ behaviors are mutually related in a timely and positive manner. We have operationalized this factor by manipulating whether behaviors (such as nods or posture shifts) were generated in response to the participant’s behavior (a coordinated condition) or in response to some unrelated factor (an uncoordinated condition). For example, in one study, we created non-contingent behavior by showing a participant the same nonverbal behavior that was generated for the previous participant. This “yoked” experimental design only breaks contingency, and still ensures a similar frequency and distribution of behaviors are generated. Overall, participants exhibit more subjective and behavior characteristics of rapport when coordination is present. For example, breaking the contingency of nonverbal feedback leads participants to talk less and produced more disfluent speech [Gratch et al. 2007]. These effects were especially strong in participants that scored high in a scale of social anxiety, as such participants, in addition to these behavioral effects, feel less subjective rapport and greater embarrassment [Kang et al. 2008a]. We further explored the effects of coordination on learning and found some evidence that coordination helps improve a speaker’s retention of the event they are discussing [Wang and Gratch 2009a]. Collectively, these findings are in line with Tickle-Degnen and Rosenthal’s predictions. Mutual attentiveness Mutual attentiveness occurs in a conversation when both participants attend closely to each other’s words or movements. In our empirical research, we have operationalized this concept in terms of gaze—for example, are participants looking at each other during a conversation. Prior research in gaze suggests there should be a curvilinear relationship between gaze and feelings of rapport. In other words, continuously staring at another person or completely avoiding their gaze would tend to be disruptive but, short of these extremes, rapport should be enhanced by more visual attention. Consistent with this, we found

12.4 Empirical Findings

449

a similar curvilinear relationship between gaze and rapport. For example, an agent that continuously gazed at the participant without other accompanying timely positive gestures (e.g., nodding) caused more distractions, less rapport, and more speech disfluency in storytelling interaction [Wang and Gratch 2010]. Similarly, an agent that failed to maintain gaze with the participant was equally disruptive. Collectively, these findings are in line with Tickle-Degnen and Rosenthal’s predictions. Anonymity People feel a sense of anonymity when they believe that their identity is protected. In our empirical research, we have found that manipulating a sense of anonymity can encourage positive downstream consequences of rapport such as disclosure of personal information. Indeed, research has shown that greater feelings of rapport lead people to disclose more [Miller et al. 1983, Burgoon et al. 2016, Hall et al. 1995, Gratch et al. 2007]. Like rapport, anonymity also increases disclosure of personal information (for a review, see Weisband and Kiesler [1996]). Because they are anonymous, computer programs give people a “sense of invulnerability to criticism, an illusion of privacy, the impression that responses ‘disappear’ into the computer” [Weisband and Kiesler 1996], and therefore, because rapport agents might be perceived as anonymous, they have the potential to increase disclosure of personal information as well [Sebestik et al. 1988, Thornberry et al. 1991, Baker 1992, Beckenbach 1995, Joinson 2001]. Rapport agents therefore have the potential to use both their rapport-building capabilities and the sense of anonymity they foster as two “routes” to encourage people to open up and disclose more personal information. Indeed, when rapport agents use rapport-building behaviors contingently, they are able to prompt disclosure from interviewees [Gratch et al. 2013]. Even when used as interviewers in clinical settings, virtual agents get users to share more personal information when people believe they are operated by a computer than a human [Lucas et al. 2014, Pickard et al. 2016, Mell et al. 2017]. Because conversation with intelligent virtual agents are typically experienced as more anonymous than similar conversations with humans, users seem to be more comfortable disclosing highly sensitive information, like their mental health, and on questions that could lead them to admit something stigmatized or otherwise negative. For example, during a clinical interview with a virtual character, participants disclose more personal details when they are told that the character is controlled by artificial intelligence than when they are told that the character is operated by a person in another room [Lucas et al. 2014]. In 2016, Pickard and colleagues reported that individuals are more comfortable disclosing to an automated interviewer than a human interviewer.

450

Chapter 12 Rapport Between Humans and Socially Interactive Agents

However, additional research in this line demonstrates the value of having the anonymous agent build rapport. We also examined [Lucas et al. 2017] whether a rapport agent could increase disclosure of mental health symptoms among activeduty service members and veterans. Replicating prior work showing that service members report more symptoms of PTSD when they anonymously answer the Post-Deployment Health Assessment (PDHA) symptom checklist compared to the official PDHA that goes on their permanent record [Warner et al. 2011], service members reported more symptoms during a conversation with an intelligent virtual interviewer than on the official PDHA. This demonstrated the importance of anonymity for disclosure of personal information. However, across two studies, active duty and retired service members also reported more symptoms to a rapport agent (SimSensei) than on an anonymized version of the PDHA. Given that the virtual intelligent interviewer and anonymized version of the PDHA were equally anonymous but only the virtual intelligent interviewer could evoke rapport, this work establishes the idea that rapport has an impact on self-disclosure above and beyond anonymity. Therefore, socially intelligent interviewers that build rapport may provide a superior option to encourage disclosure of personal information. Pragmatically, this finding makes the case for taking advantage of the value that rapport-building holds for increasing disclosure of information rather than just relying on anonymity. An important qualification to the impact of anonymity may be that it depends on the social situation. Specifically, in contexts where they might be judged, people show more social effects with an agent than an avatar. Indeed, as described above, research has found that people respond more socially to agents than avatars in settings where they feel judged: clinical interviews about their mental health [Slack and Van Cura 1968, Lucas et al. 2014, 2017, Pickard et al. 2016], but they also do so when interviewed about their personal financial situation [Mell et al. 2017]. Some of this work shows that people are more comfortable with agents only if they are concerned about being judged [Pickard et al. 2016]. In contrast, when rapport agents engage with users on other tasks where judgement is less likely, the results are more complicated. First, during persuasive conversations, rapport agents are more persuasive than human-operated agents as long as they give off cues that they are competent [Khooshabeh and Lucas 2018, Lucas et al. 2019]. Indeed, if their competence is called into question (e.g., by making repeated errors), rapport agents have less social influence than agents without rapport-building capabilities [Lucas et al. 2018a, 2018b]. However, this only seems to occur if rapport-building occurs before the persuasive conversation but not after [Lucas et al. 2018b], and only among participants acculturated in the U.S. [Lucas et al. 2018a]. Additionally, we did not find that people were more comfortable

12.4 Empirical Findings

451

with agents than human-operated agents in a personal training context [Lucas et al. 2018c]. While we expected participants using our virtual trainer to be concerned about being judged, like in the clinical and financial interview contexts, and thus respond more positively to the virtual intelligent agent than human-operated agents, they did not. In this work, there are also implications for rapport with virtual trainers: people felt more rapport when they believed the virtual trainer was operated by a human than when they believed it was a virtual intelligent agent. Some studies in other contexts where people could be concerned about being judged, such as when practicing negotiation, have shown that people are more comfortable with virtual intelligent agents than human-operated agents [Gratch et al. 2016], while others have not [Gratch et al. 2015]. 12.4.2.2

Characteristics of Participants that Impact Rapport The previous studies and findings emphasized the impact of differences in the agent’s behavior, but numerous social psychological studies emphasize that the trajectory of a social interaction is heavily shaped by the “baggage” people bring to a situation. In human-to-human interactions, people who are extroverted will more easily establish rapport than introverts and we might expect that these dispositional tendencies will carry over into interactions with rapport agents. Indeed, in a series of studies we have found that dispositional factors shape interactions with socially intelligent agents in similar ways to how they influence face-to-face interactions between people. Indeed, several dispositional factors have been found to influence the effectiveness of rapport agents. With respect to the Big Five, extroversion and agreeableness influence interactions in ways that are consistent with their impact in human-tohuman interactions. Extroverts tend to talk more, more fluently, and feel better about their interaction, and similar findings hold for participants that score high in agreeableness [Kang et al. 2008b, von der Pütten et al. 2010]. Social anxiety also plays a moderating role: interestingly, we found dispositionally anxious subjects felt more trust toward a rapport agent than in their interactions with human conversational partners [Kang et al. 2008a]. Dispositional factors don’t uniquely determine the outcome of a social interaction, but rather interact with aspects of the situation. For example, someone who is confident in social situations might perform well regardless of the behavior of their conversational partner. However, someone who is less secure might seek constant reassurance by carefully monitoring their partner’s nonverbal feedback: if this feedback is positive, they may perform well and report positive feelings; if this feedback is negative, the opposite may occur. We see similar interaction effects with rapport agents. For example, extroverts seem insensitive to manipulations

452

Chapter 12 Rapport Between Humans and Socially Interactive Agents

that impact the quality of a rapport agent’s nonverbal feedback, whereas participants who score high in social anxiety are quite disrupted when they fail to receive positive and coordinated nonverbal feedback [Kang et al. 2008a]. Overall, our studies suggest that both agent behavior and dispositional factors interact to determine the overall quality of an experience with a socially intelligent agent. Further, these effects are largely consistent with predictions from the social psychological literature on human-to-human interactions.

12.5

Discussion and Conclusion Across a wide range of studies, we have consistently shown that simple nonverbal cues on the part of a computer program can provoke a wide range of beneficial subjective and behavioral outcomes. Overall, our studies and related findings provide substantial evidence that the nonverbal behavior of socially intelligent agents influence the behavior of the humans who interact with them in ways that are consistent with psychological findings on human-to-human interaction. Further, these effects increase as a virtual agent exhibits more human-like behavioral characteristics. More specifically, studies support Tickle-Degnen and Rosenthal’s claims that rapport is promoted by social behavioral cues that are positive, contingent, and convey mutual attention, and that these effects are moderated by the personality traits of the human user. Despite the apparent success of rapport agents, one should be cautious before concluding that people will always be so easily manipulated by simple nonverbal behaviors. Behaviors such as nods or smiles might trigger automatic responses and simple social inferences but are otherwise limited. Our experimental settings (storytelling and interviews) are, by design, simple for agents to navigate. For example, the Rapport Agent conveys understanding without actual understanding, a behavior that most of us have engaged in from time to time (for example, when carrying on a conversation in an unfamiliar language or in a noisy room), but such a charade only goes so far before it ends in embarrassment. In a similar storytelling paradigm, Janet Bavelas illustrated that “generic” feedback (similar to what the Rapport Agent provides) is easy to produce without actually attending to the meaning of a conversation—she had participants listen while solving complex mathematical problems—but at certain points speakers need more meaningful “specific” feedback. Bavelas et al. [2000] found that when speakers were telling personally emotional stories they expected emotional feedback at key dramatic moments. When they failed to receive it, they felt embarrassment and had difficulty proceeding with their story. Even more generic feedback requires some level of natural language sophistication. According to theories of conversational grounding, speakers in a conversation expect frequent and incremental feedback

12.5 Discussion and Conclusion

453

from listeners that their communication is understood [Traum 1994, Nakano et al. 2003]. When listeners provide grounding feedback, speech can proceed fluently and presumably with a greater sense of rapport. Such feedback often takes the form of nods, such as produced by our rapport agents, and presumably speakers are (incorrectly) interpreting these nods as grounding cues. This illusion can be maintained to an extent, but it will eventually backfire and lead participants to view such feedback with suspicion. SimSensei, with its incorporation of a dialog manager, does have the ability to provide some cognitive and emotional feedback, but in practice this had to be used sparingly due to recognition errors. Thus, the ability to provide rapid and meaningful feedback remains a challenge for rapport agents. Part of our research can be seen as pushing the boundary of just how far one can go with simple contingent feedback. Our early studies explored “safe” and impersonal content such as cartoons. Over time we took on more challenging settings such as mental health screenings. At each stage, we continue to show robust subjective and behavioral effects of contingent positive feedback; however, we must be careful before concluding that the agent is performing well. What we’ve shown is that the agent performs about as well when discussing personal matters with a stranger (something many people are uncomfortable with) and better than an agent that provides no or negative feedback. While important, this is a low bar, and much can be done to improve the performance and effects of such agents. Future research must extend beyond such “mindless” feedback (i.e., feedback without deep understanding) to ensure that responses are aligned with the underlying grounding and inferential mechanisms of social agents. Indeed, research on emotional expressions emphasize that people assume these reflect an agent’s or “appraisal” of the current social situation [Hareli and Hess 2010, de Melo et al. 2014b]. An agent that fails to make these expressions contingent on the meaning of conversation actions will certainly reduce rapport. Thus, future work must better align nonverbal behaviors with the underlying cognitive machinery of agents. Alas, this point has been long recognized by the intelligent virtual agent community (see Gratch et al. [2002]), but stubbornly hard to achieve. To conclude, we have illustrated that interpersonal processes can be effectively modeled in computational systems, both virtual and robotic, and these models can yield important theoretical insights and practical benefits. Herb Simon [1969] emphasized the importance of a partnership between computational and social science, and research into rapport agents vividly illustrates this partnership continues to bear fruit. By starting with a psychological theory of social processes (in this case the theory of rapport by Tickle-Degnen and Rosenthal), we are able to construct a social artifact that instantiates the theory. This results in practical benefits,

454

Chapter 12 Rapport Between Humans and Socially Interactive Agents

such as greater disclosure in mental health interviews, but such artifacts also provide the means to experimentally tease apart these factors in ways that avoid the disadvantages and potential confounds introduced by more traditional psychological methods (such as the use of human confederates). Thus, social theory has allowed us to build a better computer and return the favor through experimental support for theory. This is a true partnership between the social and computational sciences of social emotions.

References A. Abbe and S. E. Brandon. 2013. The role of rapport in investigative interviewing: A review. J. Investig. Psychol. Offender Profiling. 10, 3, 237–249. DOI: https://doi.org/10.1002/jip.1386. H. Adam, A. Shirako, and W. W. Maddux. 2010. Cultural variance in the interpersonal effects of anger in negotiations. Psychol. Sci. 21, 6, 882–889. DOI: https://doi.org/10.1177/ 0956797610370755. N. Ambady, F. J. Bernieri, and J. A. Richeson. 2000. Toward a histology of social behavior: Judgment accuracy from thin slices of the behavioral stream. In Advances in Experimental Social Psychology. Vol. 32. Academic Press, San Diego, CA, 201–271. J. Bailenson and N. Yee. 2005. Digital chameleons: Automatic assimilation of nonverbal gestures in immersive virtual environments. Psychol. Sci. 16, 814–819. J. Bailenson, A. Beall, J. Loomis, J. Blascovich, and M. Turk. 2004. Transformed social interaction: Decoupling representation from behavior and form in collaborative virtual environments. Presence: Teleop. Virt. Environ. 13, 4, 428–441. DOI: https://doi.org/10.1162/ 1054746041944803. R. P. Baker. 1992. New technology in survey research: Computer-assisted personal interviewing (CAPI). Soc. Sci. Comput. Rev. 10, 2, 145–157. DOI: https://doi.org/10.1177/08944393 9201000202. J. A. Bargh, M. Chen, and L. Burrows. 1996. Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. J. Pers. Soc. Psychol. 71, 2, 230–244. DOI: https://doi.org/10.1037/0022-3514.71.2.230. J. B. Bavelas, L. Coates, and T. Johnson. 2000. Listeners as co-narrators. J. Pers. Soc. Psychol. 79, 6, 941–952. DOI: https://doi.org/10.1037/0022-3514.79.6.941. A. Beckenbach. 1995. Computer-assisted questioning: The new survey methods in the perception of the respondents. Bull. Socioll Methodol/Bulletin de Méthodologie Sociologique. 48, 1, 82–100. DOI: https://doi.org/10.1177/075910639504800111. G. Bente, N. C. Kraemer, A. Petersen, and J. P. de Ruiter. 2001. Computer animated movement and person perception: Methodological advances in nonverbal behavior research. J. Nonverbal. Behav. 25, 3, 151–166. K. Bergmann and S. Kopp. 2012. Gestural alignment in natural dialogue. In Proceedings of the Annual Meeting of the Cognitive Science Society. F. J. Bernieri. 1988. Coordinated movement and rapport in teacher–student interactions. J. Nonverbal. Behav. 12, 2, 120–138. DOI: https://doi.org/10.1007/BF00986930.

References

455

F. J. Bernieri and R. Rosenthal. 1991. Interpersonal coordination: Behavior matching and interactional synchrony. In R. S. Feldman and B. Rimé (Eds.), Fundamentals of Nonverbal Behavior. Cambridge University Press, Cambridge. F. Biocca and C. Harms. 2002. Defining and measuring social presence: Contribution to the networked minds theory and measure. In Proceedings of the 5th International Workshop on Presence. J. Blascovich. 2014. Challenge, threat, and social influence in digital immersive virtual environments. Social Emotions in Nature and Artefact. Oxford University Press, New York, 44–54. S. A. Borrie, T. S. Barrett, M. M. Willi, and V. Berisha. 2019. Syncing up for a good conversation: A clinically meaningful methodology for capturing conversational entrainment in the speech domain. J. Speech Lang. Hear. Res. 62, 2, 283–296. DOI: https://doi.org/10.1044/ 2018_JSLHR-S-18-0210. S. E. Brennan. 1996. Lexical entrainment in spontaneous dialog. Proc. ISSD, 96, 41–44. J. K. Burgoon. 1983. Nonverbal violations of expectations. In J. M. Wiemann and R. P. Harrison (Eds.), Nonverbal Interaction. Sage, Beverly Hills, CA, 11–77. J. K. Burgoon, L. A. Stern, and L. Dillman. 1995. Interpersonal Adaptation: Dyadic Interaction Patterns. Cambridge University Press, Cambridge. J. K. Burgoon, L. K. Guerrero, and V. Manusov. 2016. Nonverbal Communication. Routledge. M. Burns. 1984. Rapport and relationships: The basis of child care. J. Child Care. 2, 47–57. J. Cassell, A. Gill, and P. Tepper.2007. Conversational coordination and rapport. In Proceedings of Workshop on Embodied Language, Processing at ACL 2007. Prague, CZ. J. E. Charny. 1966. Psychosomatic manifestations of rapport in psychotherapy. Psychosom. Med. 28, 4, 305–315. DOI: https://doi.org/10.1097/00006842-196607000-00002. T. L. Chartrand and J. A. Bargh. 1999. The chameleon effect: The perception–behavior link and social interaction. J. Pers. Soc. Psychol. 76, 6, 893–910. DOI: https://doi.org/10.1037/ 0022-3514.76.6.893. J. Coates. 1989. Gossip revisited: Language in all-female groups. Women in Their Speech Communities. 94–122. J. W. Cogger. 1982. Are you a skilled interviewer? Pers. J. 61, 840–843. I. de Kok and D. Heylen. 2011. The MultiLis corpus—Dealing with individual differences in nonverbal listening behavior. In Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces, Theoretical and Practical Issues. Springer, 362–375. C. de Melo, P. J. Carnevale, and J. Gratch. 2014a. Using virtual confederates to research intergroup bias and conflict. In 74th Annual Meeting of the Academy of Management. Philadelphia, PA. DOI: https://doi.org/i.org/10.5465/ambpp.2014.62. C. de Melo, P. J. Carnevale, S. J. Read, and J. Gratch. 2014b. Reading people’s minds from emotion expressions in interdependent decision making. J. Pers. Soc. Psychol. 106, 1, 73–88. DOI: https://doi.org/10.1037/a0034251. D. DeVault, K. Sagae, and D. Traum. 2011. Incremental interpretation and prediction of utterance meaning for interactive dialogue. Dialogue Discourse. 2, 1, 143–170.

456

Chapter 12 Rapport Between Humans and Socially Interactive Agents

D. DeVault, R. Artstein, G. Benn, T. Dey, E. Fast, A. Gainer, and L.-P. Morency. 2014. SimSensei kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems. Paris, France. A. L. Drolet and M. W. Morris. 2000. Rapport in conflict resolution: Accounting for how face-to-face contact fosters mutual cooperation in mixed-motive conflicts. Exp. Soc. Psychol. 36, 26–50. DOI: https://doi.org/10.1006/JESP.1999.1395. S. Duncan, A. Franklin, F. Parrill, H. Welji, I. Kimbara, and R. Webb. 2007. Cognitive processing effects of ‘social resonance’ in interaction. In Proceedings Gesture 2007—The Conference of the Int. Society of Gesture Studies. J. P. Forgas. 2007. The use of computer-mediated interaction in exploring affective influences on strategic interpersonal behaviours. Comput. Hum. Behav. 23, 2, 901–919. DOI: https://doi.org/10.1016/j.chb.2005.08.010. B. L. Fredrickson. 2016. Love: Positivity resonance as a fresh, evidence-based perspective on an age-old topic. In Handbook of Emotions. 847–858. H. Friedberg, D. Litman, and S. B. F. Paletz. 2012. Lexical entrainment and success in student engineering groups. In 2012 IEEE Spoken Language Technology Workshop (SLT). DOI: https://doi.org/10.1109/slt.2012.6424258. D. Fuchs. 1987. Examiner familiarity effects on test performance: Implications for training and practice. Top. Early Child. Spec. Educ. 7, 90–104. DOI: https://doi.org/10.1177/ 027112148700700309. S. Fujie, Y. Ejiri, K. Nakajima, Y. Matsusaka, and T. Kobayashi. 2004. A conversation robot using head gesture recognition as para-linguistic information. In 13th IEEE International Workshop on Robot and Human Communication. DOI: https://doi.org/10.1109/ROMAN. 2004.1374748. S. B. Goldberg. 2005. The secrets of successful mediators. Negot. J. 21, 3, 365–376. DOI: http s://doi.org/10.1111/j.1571-9979.2005.00069.x. J. Gratch, J. Rickel, E. André, J. Cassell, E. Petajan, and N. Badler. July/August. 2002. Creating interactive virtual humans: Some assembly required. IEEE Intell. Syst. 17, 4, 54–61. J. Gratch, A. Okhmatovskaia, F. Lamothe, S. Marsella, M. Morales, R. van der Werf, and L.-P. Morency. 2006. Virtual rapport. In 6th International Conference on Intelligent Virtual Agents. Marina del Rey, CA. DOI: https://doi.org/10.1007/11821830_2. J. Gratch, N. Wang, J. Gerten, and E. Fast. 2007. Creating rapport with virtual agents. In 7th International Conference on Intelligent Virtual Agents. Paris, France. DOI: https://doi.org/ 10.1007/978-3-540-74997-4_12. J. Gratch, S.-H. Kang, and N. Wang. 2013. Using social agents to explore theories of rapport and emotional resonance. In J. Gratch and S. Marsella (Eds.), Social Emotions in Nature and Artifact. Oxford University Press. DOI: https://doi.org/10.1093/acprof:oso/ 9780195387643.003.0012.

References

457

J. Gratch, D. Devault, G. Lucas, and S. Marsella. 2015. Negotiation as a challenge problem for virtual humans. In 15th International Conference on Intelligent Virtual Agents. Delft, The Netherlands. DOI: https:/doi.org/10.1007/978-3-319-21996-7_21. J. Gratch, D. DeVault, and G. Lucas. 2016. The benefits of virtual humans for teaching negotiation. In 16th International Conference on Intelligent Virtual Agents. Los Angeles, CA. DOI: https://doi.org/10.1007/978-3-319-47665-0_25. D. D. Gremler and K. P. Gwinner. 2000. Customer–employee rapport in service relationships. J. Serv. Res. 3, 1, 82–104. DOI: https://doi.org/10.1177/109467050031006. J. A. Hall, J. A. Harrigan, and R. Rosenthal. 1995. Nonverbal behavior in clinician–patient interaction. Appl. Prevent. Psychol. 4, 1, 21–37. DOI: https://doi.org/10.1016/s0962-1849(05) 80049-6. S. Hareli and U. Hess. 2010. What emotional reactions can tell us about the nature of others: An appraisal perspective on person perception. Cogn. Emot. 24, 1, 128–140. DOI: https://doi.org/10.1080/02699930802613828. R. Hoegen, J. V. D. Schalk, G. Lucas, and J. Gratch. 2018. The impact of agent facial mimicry on social behavior in a prisoner’s dilemma. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. Sydney, NSW, Australia. DOI: https://doi.org/10.1145/ 3267851.3267911. D. Horton and R. R. Wohl. 1956. Mass communication and para-social interaction: Observations on intimacy at a distance. Psychiatry 19, 215–229. DOI: https://doi.org/10.1080/ 00332747.1956.11023049. L. Huang. 2013. Parasocial Consensus Sampling: Modeling Human Nonverbal Behaviors from Multiple Perspectives. University of Southern California, Los Angles, CA. L. Huang and J. Gratch. 2013. Explaining the variability of human nonverbal behaviors in face-to-face interaction. In International Workshop on Intelligent Virtual Agents. DOI: https://doi.org/10.1007/978-3-642-40415-3_24. L. Huang, L.-P. Morency, and J. Gratch. 2010. Learning backchannel prediction model from parasocial consensus sampling: A subjective evaluation. In J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud, and A. Safonova (Eds.), Intelligent Virtual Agents, Vol. 6356. Springer Berlin, Heidelberg, 159–172. DOI: https://doi.org/10.1007/978-3-64215892-6_17. L. Huang, L.-P. Morency, and J. Gratch. 2011. Virtual rapport 2.0. In Proceedings of the 10th International Conference on Intelligent Virtual Agents. Reykjavik, Iceland. DOI: https://doi. org/10.1007/978-3-642-23974-8_8. A. N. Joinson. 2001. Self-disclosure in computer-mediated communication: The role of selfawareness and visual anonymity. Eur. J. Soc. Psychol. 31, 177–192. DOI: https://doi.org/10. 1002/ejsp.36. D. Julien, M. Brault, É. Chartrand, and J. Bégin. 2000. Immediacy behaviours and synchrony in satisfied and dissatisfied couples. Can. J. Behav. Sci./Revue canadienne des sciences du comportement 32, 2, 84. DOI: https://doi.org/10.1037/h0087103.

458

Chapter 12 Rapport Between Humans and Socially Interactive Agents

M. Kallmann and S. Marsella. 2005. Hierarchical motion controllers for real-time autonomous virtual humans. In 5th International Working Conference on Intelligent Virtual Agents. Kos, Greece. O. Kang and T. Wheatley. 2017. Pupil dilation patterns spontaneously synchronize across individuals during shared attention. J. Exp. Psychol. Gen. 146, 4, 569–576. DOI: https://doi. org/10.1037/xge0000271. S.-H. Kang, J. Gratch, N. Wang, and J. Watt. 2008a. Does contingency of agents’ nonverbal feedback affect users’ social anxiety? In 7th International Conference on Autonomous Agents and Multiagent Systems. Estoril, Portugal. S.-H. Kang, J. Gratch, N. Wang, and J. Watts. 2008b. Agreeable people like agreeable virtual humans. In 8th International Conference on Intelligent Virtual Agents. Tokyo. DOI: http:// dx.doi.org/10.1007/978-3-540-85483-8_26. B. Karacora, M. Dehghani, N. C. Krämer, and J. Gratch. 2012. The influence of virtual agents’ gender and rapport on enhancing math performance. In Proceedings of the Annual Meeting of the Cognitive Science Society. P. Khooshabeh and G. Lucas. 2018. Virtual human role players for studying social factors in organizational decision making. Front. Psychol. 9, 194. DOI: https://10.3389/fpsyg.2018. 00194. S. Kopp. 2010. Social resonance and embodied coordination in face-to-face conversation with artificial interlocutors. Speech Commun. 52, 6, 587–597. DOI: https://doi.org/10.1016/ j.specom.2010.02.007. S. Kopp, B. Krenn, S. Marsella, A. Marshall, C. Pelachaud, H. Pirker, K. R. Thórisson, and H. Vilhjálmsson. 2006. Towards a common framework for multimodal generation in ECAs: The behavior markup language. In Proceedings of 6th International Conference on Virtual Agents. Marina del Rey, CA. DOI: https://doi.org/10.1007/11821830_17. N. C. Krämer, B. Karacora, G. Lucas, M. Dehghani, G. Rüther, and J. Gratch. 2016. Closing the gender gap in STEM with friendly male instructors? On the effects of rapport behavior and gender of a virtual agent in an instructional interaction. Comput. Educ. 99, 1–13. DOI: https://doi.org/10.1016/j.compedu.2016.04.002. N. C. Krämer, G. Lucas, L. Schmitt, and J. Gratch. 2018. Social snacking with a virtual agent—On the interrelation of need to belong and effects of social responsiveness when interacting with artificial entities. Int. J. Hum. Comput. Stud. 109, Suppl. C, 112–121. DOI: https://doi.org/10.1016/j.ijhcs.2017.09.001. J. Lafferty, A. McCallum, and F. C. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning. J. L. Lakin, V. A. Jefferis, C. M. Cheng, and T. L. Chartrand. 2003. Chameleon effect as social glue: Evidence for the evolutionary significance of nonconscious mimicry. J. Nonverbal Behav. 27, 3, 145–162. DOI: https://doi.org/10.1023/A:1025389814290. J. T. Lanzetta and B. G. Englis. 1989. Expectations of cooperation and competition and their effects on observers’ vicarious emotional responses. J. Pers. Soc. Psychol. 56, 543– 554. DOI: https://doi.org/10.1037/0022-3514.56.4.543.

References

459

R. Levitan and J. Hirschberg. 2011. Measuring acoustic-prosodic entrainment with respect to multiple levels and dimensions. In 12th Annual Conference of the International Speech Communication Association. DOI: https://doi.org/10.7916/D8V12D8F. R. Levitan, A. Gravano, and J. Hirschberg. 2011. Entrainment in speech preceding backchannels. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers, Vol. 2. N. Lubold, E. Walker, H. Pon-Barry, and A. Ogan. 2018. Automated pitch convergence improves learning in a social, teachable robot for middle school mathematics. In Artificial Intelligence in Education. Cham. DOI: https://doi.org/10.1007/978-3-319-93843-1_21. G. Lucas, J. Gratch, A. King, and L.-P. Morency. 2014. It’s only a computer: Virtual humans increase willingness to disclose. Comput. Hum. Behav. 37, 94–100. DOI: https://doi.org/10. 1016/j.chb.2014.04.043. G. Lucas, A. Rizzo, J. Gratch, S. Scherer, G. Stratou, J. Boberg, and L.-P. Morency. 2017. Reporting mental health symptoms: Breaking down barriers to care with virtual human interviewers. Front. Robot. AI 4, 51. DOI: https://doi.org/10.3389/frobt.2017.00051. G. Lucas, J. Boberg, D. Traum, R. Artstein, J. Gratch, A. Gainer, E. Johnson, A. Leuski, and M. Nakano. 2018a. Culture, errors, and rapport-building dialogue in social agents. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. Sydney, NSW, Australia. DOI: http://dx.doi.org/10.1145/3171221.3171258. G. Lucas, J. Boberg, D. Traum, R. Artstein, J. Gratch, A. Gainer, E. Johnson, A. Leuski, and M. Nakano. 2018b. Getting to know each other: The role of social dialogue in recovery from errors in social robots. In Proceedings of the 2018 ACM/IEEE International Conference on Human–Robot Interaction. Chicago, IL, USA. DOI: http://dx.doi.org/10.1145/3171221. 3171258. Lucas, N. C. Krämer, C. Peters, L.-S. Taesch, J. Mell, and J. Gratch. 2018c. Effects of perceived agency and message tone in responding to a virtual personal trainer. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. Sydney, NSW, Australia. DOI: https://doi.org/10.1145/3267851.3267855. G. Lucas, J. Lehr, N. C. Krämer, and J. Gratch. 2019. The effectiveness of social influence tactics when used by a virtual agent. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents. Paris, France. S. Marsella, Y. Xu, M. Lhommet, A. Feng, S. Scherer, and A. Shapiro. 2013. Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation. D. H. McFarland. 2001. Respiratory markers of conversational interaction. J. Speech Lang. Hear. Res. 44, 128–143. DOI: https://doi.org/10.1044/1092-4388(2001/012). G. H. Mead. 1934. Mind, Self and Society. University of Chicago Press, Chicago. J. Mell, G. Lucas, and J. Gratch. 2017. Prestige questions, online agents, and genderdriven differences in disclosure. In J. Beskow, C. Peters, G. Castellano, C. O’Sullivan, I. Leite, and S. Kopp (Eds.), Intelligent Virtual Agents. IVA 2017. Lecture Notes in Computer Science, Vol. 10498. Springer, Cham, IL. https://doi.org/10.1007/978-3-319-674018_36.

460

Chapter 12 Rapport Between Humans and Socially Interactive Agents

L. C. Miller, J. H. Berg, and R. L. Archer. 1983. Openers: Individuals who elicit intimate selfdisclosure. J. Personal. Soc. Psychol. 44, 6, 1234–1244. DOI: https://doi.org/10.1037/00223514.44.6.1234. F. Morbini, D. DeVault, K. Sagae, J. Gerten, A. Nazarian, and D. Traum. 2014. FLoReS: A forward looking, reward seeking, dialogue manager. In Natural Interaction with Robots, Knowbots and Smartphones. Springer, 313–325. DOI: http://dx.doi.org/10.3115/v1/W144334. L.-P. Morency, C. Sidner, C. Lee, and T. Darrell. 2005. Contextual recognition of head gestures. In 7th International Conference on Multimodal Interactions. Torento, Italy. DOI: http s://doi.org/10.1145/1088463.1088470. L.-P. Morency, I. de Kok, and J. Gratch. 2008. Predicting listener backchannels: A probabilistic multimodal approach. In 8th International Conference on Intelligent Virtual Agents. Tokyo. L.-P. Morency, I. de Kok, and J. Gratch. 2010. A probabilistic multimodal approach for predicting listener backchannels. Auton. Agent. Multi Agent Syst. 20, 1, 70–84. DOI: https://doi.org/10.1007/s10458-009-9092-y. Y. Nakano, G. Reinstein, T. Stocky, and J. Cassell. 2003. Towards a model of face-to-face grounding. In Annual Meeting of the Association for Computational Linguistics. Sapporo, Japan. DOI: http://dx.doi.org/10.3115/1075096.1075166. B. Parkinson. 2013. Processes of emotional meaning and response coordination. In Social Emotions in Nature and Artifact. 29–43. DOI: https://doi.org/10.1093/acprof:oso/ 9780195387643.003.0003. M. D. Pickard, C. A. Roster, and Y. Chen. 2016. Revealing sensitive information in personal interviews: Is self-disclosure easier with humans or avatars and under what conditions? Comput. Hum. Behav. 65, 23–30. DOI: https://doi.org/10.1016/j.chb.2016.08.004. B. Reeves and C. Nass. 1996. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge University Press, New York. M. E. Roloff and D. E. Campion. 1987. Communication and reciprocity within intimate relationships. In M. E. Roloff and G. R. Miller (Eds.), Interpersonal Processes: New Directions in Communication Research. Sage, Beverly Hills, CA, 11–38. J. Sebestik, H. Zelon, D. DeWitt, J. O’Reilly, and K. McGowan. 1988. Initial experiences with CAPI. In Proceedings of the Bureau of the Census Fourth Annual Research Conference. H. Simon. 1969. The Sciences of the Artificial. MIT Press, Cambridge, MA. W. V. Slack and L. J. van Cura. 1968. Patient reaction to computer-based medical interviewing. Comput. Biomed. Res. 1, 5, 527–531. DOI: https://doi.org/10.1016/0010-4809(68)90018-9. M. Thiebaux, A. Marshall, S. Marsella, and M. Kallmann. 2008. SmartBody: Behavior realization for embodied conversational agents. In International Conference on Autonomous Agents and Multi-Agent Systems. Portugal. DOI: https://doi.org/10.1145/1402383.1402409. O. Thornberry, B. Rowe, and R. Biggar. 1991. Use of CAPI with the U.S. National Health Interview Survey. Bull. Méthodol. Sociol. 30, 1, 27–43. DOI: https://doi.org/10.1177/07591063 9103000103.

References

461

L. Tickle-Degnen and R. Rosenthal. 1990. The nature of rapport and its nonverbal correlates. Psychol. Inq. 1, 4, 285–293. DOI: https://doi.org/10.1207/s15327965pli0104_1. L. Z. Tiedens and A. R. Fragale. 2003. Power moves: Complementarity in dominant and submissive nonverbal behavior. J. Pers. Soc. Psychol. 84, 3, 558–568. DOI: https://doi.org/ 10.1037/0022-3514.84.3.558. D. Traum. 1994. A Computational Theory of Grounding in Natural Language Conversation. Ph.D. thesis. University of Rochester, Rochester, NY. K. P. Truong, R. Poppe, and D. Heylen. 2010. A rule-based backchannel prediction model using pitch and pause information. In 11th Annual Conference of the International Speech Communication Association. P. Tsui and G. L. Schultz. 1985. Failure of rapport: Why psychotherapeutic engagement fails in the treatment of Asian clients. Am. J. Orthopsychiatry 55, 561–569. DOI: https://doi.org/ 10.1111/j.1939-0025.1985.tb02706.x. A. von der Pütten, N. C. Krämer, and J. Gratch. 2009a. Who’s there? Can a virtual agent really elicit social presence? In 12th Annual International Workshop on Presence. Los Angeles. A. von der Pütten, N. C. Krämer, J. Gratch, and S.-H. Kang. 2009b. It doesn’t matter what you are! Comparing interacting with an autonomous virtual person with interacting with a virtually represented human. In Proceedings of the 6th Conference of the Media Psychology Division of the German Psychological Society. Lengerich. A. von der Pütten, N. C. Krämer, and J. Gratch. 2010. How our personality shapes our interactions with virtual characters—Implications for research and development. In J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud, and A. Safonova (Eds.), Intelligent Virtual Agents, Vol. 6356. Springer, Berlin, Heidelberg, 208–221. N. Wang and J. Gratch. 2009a. Can a virtual human build rapport and promote learning? In 14th International Conference on Artificial Intelligence in Education. Brighton. DOI: https://doi.org/10.3233/978-1-60750-028-5-737. N. Wang and J. Gratch. 2009b. Rapport and facial expression. In International Conference on Affective Computing and Intelligent Interaction. Amsterdam. N. Wang and J. Gratch. 2010. Don’t just stare at me. In 28th Annual CHI Conference on Human Factors in Computing Systems. Atlanta. DOI: https://doi.org/10.1145/1753326. 1753513. N. Ward and W. Tsukahara. 2000. Prosodic features which cue back-channel responses in English and Japanese. 23, 1177–1207. DOI: https://doi.org/10.1016/S0378-2166(99)00109-5. C. H. Warner, G. N. Appenzeller, T. Grieger, S. Belenkiy, J. Breitbach, J. Parker, C. Hoge, C. M. Warner. 2011. Importance of anonymity to encourage honest reporting in mental health screening after combat deployment. Arch. Gen. Psychiatry 68, 10, 1065–1071. DOI: https://doi.org/10.1001/archgenpsychiatry.2011.112. S. Weisband and S. Kiesler. 1996. Self disclosure on computer forms: Meta-analysis and implications. In CHI ’96: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.

462

Chapter 12 Rapport Between Humans and Socially Interactive Agents

H. Welji and S. Duncan. 2004. Characteristics of face-to-face interactions, with and without rapport: Friends vs. strangers. In Symposium on Cognitive Processing Effects of ‘Social Resonance’ in Interaction, 26th Annual Meeting of the Cognitive Science Society. J. Whitehill, M. S. Bartlett, and J. R. Movellan. 2008. Automatic facial expression recognition for intelligent tutoring systems. In CVPR 2008 Workshop on Human Communicative Behavior Analysis. Anchorage, Alaska. V. H. Yngve. 1970. On getting a word in edgewise. In 6th Regional Meeting of the Chicago Linguistic Society.

13 Culture for Socially Interactive Agents Birgit Lugrin and Matthias Rehm

13.1

Motivation

Culture is well known to be a driving force in social cognition (e.g., Aronson et al. [2013] and Fiske and Taylor [2016]). Humans tend to construct their schemes and models of the world, as well as evaluate the behavior of others, based on their cultural upbringing. Although culture might not be the first factor that comes into mind when designing socially interactive agents (SIAs), there are important reasons why it should be considered. ∙

A SIA cannot be without cultural background. That means, if culture is not explicitly considered, the SIA will unconsciously contain the cultural cues of the designer as (s)he is the one who judges on the SIA’s naturalness. These cues can be on the surface such as clothing style, or manifest themselves more subtly such as choosing an “appropriate” spacial extent of conversational gestures for a young female agent.



A mismatch of cultural backgrounds between the SIA and the user can cause misunderstandings. Consider, for example, communication management behaviors. In many cultures such as Germany interruptions of speech are considered impolite, while in others such as Hungary they are judged positively as an increased interest in the ongoing conversation [Ting-Toomey 1999]. A conversation where the SIA interrupts the speech of its user could be considered impolite by users of certain cultural backgrounds, while not interrupting might leave the impression of not being interested in the user’s elaborations in other cultures. Since these judgements usually happen subconsciously, people might not be aware of the reason for their impression but just feel a general dislike for their interlocutor (and potentially reject to interact with the particular SIA in the future).

464

Chapter 13 Culture for Socially Interactive Agents



Matching culture-specific cues can, vice versa, raise the acceptance of a SIA. In case a SIA aims at establishing a positive relation with a user, proxemics, for example, might be important. The well-known interpersonal distance zones by Hall [1966] would quite concretely suggest appropriate interpersonal distances for either a personal zone (often kept by friends and family) or a social zone (commonly kept by acquaintances or during social get-togethers). However, these are not considered culturally universal [TingToomey 1999]. When introducing different interpersonal distances for SIAs, it has been shown that observers from different cultural backgrounds judged them differently in their appropriateness for both virtual SIAs [Jan et al. 2007] and physical SIAs [Eresha et al. 2013].

It is also important to define which cultural influences should be realized with a SIA. These can be seen as layers of culture [Trompenaars and HampdenTurner 1997], ranging from a very concrete explicit layer (e.g., language or cloth), over norms and values (what is right or wrong, and good or bad), to the subconscious implicit layers that constitute one’s basic assumptions. When incorporating culture for (the behavior of) SIAs these can include visible features, conscious interactions, subconscious details, or underlying processes (c.f. Figure 13.1). Besides the potential to enhance the SIA’s acceptance in a target user group, modeling cultural background can serve another purpose: to teach about cultures and culture-specific differences. With it, SIAs can help to foster cross-cultural understanding in a multi-cultural world or reduce subconscious biases. Using SIAs to raise cultural awareness can have several advantages, even over human training partners. First, with a SIA as communication partner, the task can be repeated as often as necessary without the risk of annoying a human training partner or paying for each additional lesson. Another advantage is that an emotional distance is kept. On the one hand, the trainee might feel embarrassed by training behavior routines

Figure 13.1

Cultural influences and their potential realization with SIAs.

13.2 Theories and Approaches

465

with a real human, and on the other hand, he or she does not need to be afraid of embarrassing the SIA by treating it in a culturally inappropriate way. Cultural differences are often subtle and, thus, hard to recognize. Using SIAs, these differences can be acted out in an exaggerated manner or can be shown in isolation. In addition, SIAs can simply change their culture. In that way, one and the same agent can simulate the behaviors of different cultures and point out the differences. This chapter gives an extensive overview of how culture was, and can be, implemented for SIAs. After this general introduction highlighting the potential of this endeavor, the remaining chapter is organized as follows: Section 13.2 firstly introduces theories of culture from the social sciences that we find suited for the implementation with SIAs, and secondly provides guidance on which questions to address when aiming at technically realizing culture within a SIA system. Section 13.3 constitutes the core of the chapter, providing an extensive literature review on the various approaches taken by many research groups around the world for the last two decades to implement culture for SIAs. We thereby refer back to the various different theories of culture and approaches for implementation introduced before. Since there has been confusion on the different notions of culture-related studies (e.g., differentiating cross-cultural from inter-cultural or mixed-cultural settings), Section 13.4 summarizes potential research settings with SIAs. Subsequently, Section 13.5 discusses the role of the SIA’s embodiment when implementing culture, highlighting where the research fields of intelligent virtual agents (IVAs) and social robots (SRs) can benefit from one another and in which cases complementary studies and computational approaches are needed. Section 13.6 points out some concrete current challenges that we derive from former research and the current state of the art in SIA research, before directing to future perspectives in Section 13.7 where we opt for more dynamic and fine-grained approaches. The chapter concludes by giving a brief summary of the various perspectives of culture in SIA research and development that have been introduced in this chapter.

13.2

Theories and Approaches In this section, we first introduce theories of culture from various disciplines that are commonly used in the domain of SIAs, before describing different approaches for the implementation of culture in SIAs.

13.2.1 Theories of Culture When we aim to integrate a social or psychological phenomenon into our systems, we usually base this integration on theories from the corresponding sciences, thus ensuring that we have a theoretical and empirical foundation (c.f. Chapter 10 on

466

Chapter 13 Culture for Socially Interactive Agents

emotion for SIAs in this handbook). In the case of culture, we face a challenge here because many disciplines are to a more or lesser degree dealing with culture. This results in multiple definitions that on top are conflated with a layman’s everyday understanding of culture. A simple yet classical way to distinguish cultures is on a geographical (e.g., Western vs. Eastern cultures) or national (e.g., Italian vs. Chinese) level. However, one needs to pay attention to subcultures (e.g., depending on age, social upbringing, or political interests) or regional variations (e.g., northern Italians vs. Romans vs. Sicilians), particularly when comparing cultures on these levels. Also, ethnicity might play a crucial role in some (national) cultures, describing a group with a shared cultural identity based on shared ancestry, language, or traditions. It is thus important to define very precisely how the concept of culture is understood when it should be integrated into a system. We can safely assume that there is no one-fits-all cultural theory that covers the full spectrum. In the following, we describe some theories that are used in SIA research or that might be worth a closer look for future implementations. Dimensional theories of culture are descriptive frameworks whose essential perspective lies in the identification of dimensions, value systems, and various constructs that help in categorizing cultures and thus facilitate systematic comparisons between cultural groups. One of the most used theories is Hofstede’s system of values [Hofstede 1980]: Culture is defined by six dimensions1 ; a given culture is thus a point in a 6dimensional space. The current six dimensions are power distance, individualism/ collectivism, uncertainty avoidance, masculinity/femininity, long-term/short-term orientation, and indulgence/restraint. All dimensions are linked to specific ways of thinking, interpreting, and managing interactions between people. If a culture is, for example, high on the uncertainty avoidance dimension, one can expect lives being governed by an intricate system of rules that minimize uncertainty in new situations. Another theory frequently used relies on Hall’s anthropological work [Hall 1959, 1966] that defines different dichotomies on the dimensions of space, context, and time. The space dimension is highly linked with Hall’s concept of proxemics that refers to human spatial behavior, where immediacy is interpreted differently across cultures and influences communication patterns. On the space dimension we can distinguish between high- and low-contact cultures, where the latter can be 1. The original study presented four dimensions. Later extensions with additional data resulted in two more dimensions (long-term/short-term orientation and indulgence/restraint).

13.2 Theories and Approaches

467

defined by being more comfortable with larger distances in interpersonal encounters. The context dimension distinguishes between high- and low-context cultures. Context refers to the amount of information that has to be encoded explicitly in a communication. Members of high-context cultures thus are good in inferring meaning without being explicitly told. The time dimension refers to the perception of time and ordering of actions. Monochronic cultures value clock time and prefer finishing tasks before starting new ones, whereas polychronic cultures are comfortable with multi-tasking and might have different time perceptions. Other popular theories apart from Hofstede and Hall include Kluckhohn and Strodtbeck’s [1961] values orientation theory that assumes a stable core of universal adaptations to social interaction, and that cultures can differ in their preferences on which of those to adapt, and Trompenaars and Hampden-Turner’s [1997] dimensional theory that features seven dimensions including, for example, individualism versus communitarianism and neutral versus emotional. Developing SIAs based on dimensional theories: Dimensional theories are frequently used as theoretical background for SIAs that behave in a culture-specific manner. Culture in these theories can be described as a point in an n-dimensional space or in some cases as vector with binary values for each dimension. Thus, they offer a computationally friendly design and some empirical evidence for behaviors connected with the dimensions, for example, by Hall where spatial behavior or amount and detail of necessary information in communication can be linked to the end points of the corresponding dimensions. But as the theories are descriptive, they do not allow for precise predictions of behavior from a culture’s position on the dimensions. Another challenge is the uncertainty about precedence of dimensions if a culture’s position on different dimensions suggests conflicting behavior. Therefore, some systems use only one dimension to demonstrate prototypical differences rather than implementing the whole multi-dimensional space. Cognitive theories of culture discuss the innate nature of culture, its emergence in groups, as well as its epigenetic evolution. Dual inheritance theory [Cavalli-Sforza and Feldman 1981, Boyd and Richerson 1985] poses that there exists a similarity between human biological and cultural evolution and define three characteristics in their theory. (1) Cultural adaptation means that the current cultural capacities of human beings (i.e., abilities to socially acquire ideas, beliefs, values, practices, mental models, and strategies by observations and inferences) are the result of genetically evolved psychological adaptations. (2) Cultural evolution is a system of inheritance based on human cultural learning, human cognition, and human social interactions. (3) Culture–gene coevolution assumes that cultural evolution produces distinctive effects on social

468

Chapter 13 Culture for Socially Interactive Agents

and physiological environments (e.g., cultural norms influence the perception of aesthetic beauty, which in turn influence the spread of genes among populations). Sperber’s epidemiology of representations [Sperber 1985, 1996, 2012] shifts the focus from the general question of cultural evolution to the more specific one of how culture manifests in an individual and—through interaction—in the group in which the individual is embedded. He thus gives an account of how culture is an emergent phenomenon of a group of individual members. From Sperber’s standpoint, this emergence is explained by the intertwining of mental representations and cultural productions by transformation processes such as imitation and communication. These processes must not be understood as replication processes since the resulting element is very likely to differ from the original element (i.e., it is transformed). A main element of this theory is the learning processes that allow an individual to build up mental representations that are similar to the social group s/he is embedded in. Thus, while belonging to the same cultural group, the mental representations of individual group members are likely to be different to a larger or lesser degree. Developing SIAs based on cognitive theories: While such cognitive theories have so far not been employed for the development of SIAs, they offer great potential in explaining the emergence of shared, or at least similar, knowledge in groups (see also communities of practice below) and could be simulated with appropriate machine learning methods. The integration of cognitive theories of culture could foster long-term interactions, transmission of (cultural) knowledge, as well as dynamic adaptation to different cultural environments. While descriptive theories such as dimensional approaches lack a dynamic component to explain an individual’s idiosyncratic cultural expression due to the individual experiential history, for example, as an expat, or as a member of a specific subculture, for example, a scientist, cognitive theories provide this dynamic component through the described learning mechanisms. Theories for cultural training focus on the acquisition of inter-cultural skills, capabilities that are especially sought after in the globalized world we are living in. Bennett presents a theory for training inter-cultural sensitivity, which has been developed from a practitioner’s point of view [Bennett 1986]. Trainees progress from an ethnocentric perspective (taking one’s own culture as the gold standard) to an ethnorelative one (being sensitive about different value systems). There are three stages in each perspective that a trainee is supposed to pass through: denial, defense, and minimization in the ethnocentric perspective; and acceptance, adaptation, and integration in the ethnorelative perspective.

13.2 Theories and Approaches

469

Cultural intelligence [Earley 2002, Earley and Ang 2003, Earley and Mosakowski 2004] refers to the ability to accurately assess a cultural situation and determine a culturally adequate way to respond to it. Improving cultural intelligence (through experience or training) will lower the risk of improper cultural adaptation. Cultural intelligence consists of three building blocks (called facets): a cognitive component (having knowledge about cultural differences), a motivational component (wanting to overcome inter-cultural communication problems), and a behavioral component (being able to act in a culturally appropriate way). Research on cultural intelligence shifts the focus from specific differences between cultures to a more general understanding of strategies underlying the discovery of knowledge, the acquisition of behavior, and the skills for problem solving that allow an individual to adjust to inter-cultural situations over time. Inter-cultural communication abilities can be obtained in three steps following Hofstede’s [1991] training model. 1. Awareness: The first step includes being aware of culture-related differences, and the acceptance that there are differences but that one’s own behavior routines are not superior to others. 2. Knowledge: Gaining knowledge implies learning about the target culture’s symbols and rituals. This does not necessarily include to share the values of a culture, but to have an idea of where these values differ from one’s owns. 3. Skills: Acquiring skills includes recognizing the symbols of the other culture, and practicing their rituals. While (1) and (2) might be sufficient to avoid most of the obvious misunderstandings in inter-cultural communication, the last step requires practice. Developing SIAs based on theories for cultural training: In inter-cultural or educational contexts theories of cultural training present a good basis for developing mechanisms for adapting to the user’s perspective. These theories encompass different layers in a SIAs architecture including knowledge, goals, and behavior routines. For example, SIAs can be employed in all three steps of Hofstede’s training model described above: The first step, gaining a general awareness, can simply be achieved by observing SIAs that demonstrate certain culture-specific routines. For gaining knowledge about another culture, additional information is required about the culture-specific differences demonstrated by the SIAs. These explanations can either be given before observing the SIAs or afterwards in a debriefing. For the third step of obtaining cultural skills, the learner needs to be able to interact with the SIAs that represent members of another culture. Through their reactions and behavioral suggestions, learning can be implemented

470

Chapter 13 Culture for Socially Interactive Agents

in an interesting, interactive way, making SIAs a powerful medium in gaining inter-cultural competencies. Other theories are mentioned here that are useful in the domain of culture. These theories offer alternative views on culture but do not necessarily focus thereon or investigate very specific cultural features. The similarity-attraction principle [Byrne 1971] suggests that communication partners who perceive themselves as being similar are more likely to like each other. This principle has, among many other factors, been successfully applied for ethnical similarity (e.g., Hu et al. [2008]). Similarly, in-group favoritism (also known as “in-group bias” or “in-group/outgroup bias”) describes the phenomenon that people evaluate and treat others preferentially if they are perceived to be in the same in-group. This effect has been widely researched and often been applied to cultural or ethical groups (e.g., Efferson et al. [2008]). Politeness theory [Brown and Levinson 1987] is not developed as a theory of culture but addresses an important aspect, that is, how politeness is linguistically realized in different language groups. A central notion of the theory is the face threatening act. Brown and Levinson describe a universal hierarchy of politeness strategies that are tailored to the different types of face threatening acts. Cultural variations occur in the way different language groups have developed different ways of expressing these strategies. Additionally, they describe cultural parameters that have an influence on the realization of strategies such as social distance, power relation, and ranking of the imposition. Communities of practice [Lave and Wenger 1991, Wenger 1998] are centered on the question of what leads an arbitrary group to turn into a community with shared goals and practices, which could be called a (sub-)culture. They describe three steps in constituting a community of practice: (1) Mutual engagement: The members of a community of practice have to engage together in the practices that are the constituting elements of the community’s culture. (2) Joint enterprise: Communities of practice self-develop to achieve their joint goals. (3) Shared repertoire: In order to create a common meaning of practices, a shared repertoire evolves over time for a community. Developing SIAs based on these other theories of culture: The similarity attraction principle and in-group favoritism have widely been used to motivate the implementation of culture in SIAs and to evaluate these SIAs with certain user groups, suggesting that agents that simulate the cultural background of the observer are preferred. Politeness is an important feature in interpersonal communication and an element of each interaction of a SIA with a user. It describes, for example, how to realize requests and commands and when and how to apologize. Communities

13.2 Theories and Approaches

471

of practice are well-suited to describe cultural phenomena that are encountered in enculturated technologies, which are designed for a certain task or to achieve a certain goal.

13.2.2 Approaches for the Implementation of Culture in SIAs In the previous subsection, we have introduced theories of culture alongside each theory’s potential to be integrated into SIA systems. However, the theoretical foundation is only one question that needs to be addressed when aiming at integrating culture into SIAs. To give an overview on the different approaches, we introduce cornerstones to provide guidance for the various potential technical realizations alongside with examples: Theory of culture As outlined above, culture can be defined and understood in several different ways. So, the first decision that needs to be taken is based on which theoretical foundation culture should be implemented: for example, as a specific national culture’s subgroup or abstract culture based on one or more cultural dimensions. Culture on a national level has, for example, be addressed by Rehm et al. [2007], who pointed out behavioral differences between German and Japanese interactants. Such an approach can be useful when aiming to raise awareness on prototypical behavior in a certain culture that one wants to visit, for example, to understand the typical conversational flow of a first-time meeting [Endrass et al. 2011b]. Thereby, it might be useful to model a certain subgroup within that national culture, for example, typical small talk conversations among undergraduate students [Endrass et al. 2011a] or appropriate behavior during a military negotiation [Johnson et al. 2011]. Within a national culture, ethnicity can further be addressed. Finkelstein et al. [2013], for example, modeled different dialects spoken by different US American subgroups such as African American Vernacular English. Such an approach can be particularly useful to support socially deprived groups, for example, by reducing the gap between supervisor and learner. To avert stereotyping, avoiding existing cultures is another approach. Aylett et al. [2014], for example, implemented fantasy cultures based on a theoretical model to raise a general cultural awareness and demonstrate that “different is not dangerous.” Foundation of approach Another fundamental question is what the implementation of culture will be based on, and with it, what determines how cultural background and resulting culture-specific output are linked. In principle, there are two approaches: theory-based or data-driven. Theory-based approaches model culturespecific behaviors based on theories from the literature. The Traveller application

472

Chapter 13 Culture for Socially Interactive Agents

[Mascarenhas et al. 2013], for example, models different cultures based on Hofstede’s dimensional model. Dependent on the assigned cultural background, an agent evaluates the behavior of others differently. If, for example, an interlocutor is considered out-group his or her behavior might be judged inappropriate by a collectivistic agent, although the same behavior is considered reasonable if conducted by someone from the in-group. Data-driven approaches extract culture-specific patterns from human behavioral data to inform empirically grounded computational models. A multi-modal video corpus was, for example, analyzed in Endrass et al. [2010], where participants from different cultures interacted in a standardized scenario to allow later comparison. There are also attempts that follow a hybrid approach, combining the advances of theory-based and data-driven approaches. In Lugrin et al. [2018c], for example, a probabilistic model for culture-specific behavior for SIAs was developed for which the dependencies were modeled based on theory, while the weighing was based on empirical data. Features of culture Culture can manifest itself in many different ways. So, one also needs to decide which features of culture should actually be implemented. Basically, these features can be grouped in external features that are observable on the surface and internal features that constitute one’s basic assumptions such as whether all people are equal (e.g., Trompenaars and Hampden-Turner [1997]). External features contain all aspects of appearance of the SIA such as skin color, eye shape, clothing, and the like (c.f. Chapter 4 on Appearance in this handbook for an overview). Also, the language the SIA speaks constitutes a cultural factor. Please note that factors of expressive speech, for example, intonation or emotional speech, and dialogue behavior such as turn taking are also dependent on cultural background (see Chapters 6 and 15 on Expressive speech and Dialogue in this and the second volume of this handbook). Similarly, non-verbal behaviors are external features of culture. Thus, all aspects described in Chapters 7 and 8 of this volume such as gaze, facial expressions, and conversational gestures vary with cultural background and can help scaffold mutual understanding (or might lead to misinterpretations). External features have largely been addressed in various systems including Koda et al. [2008], Eresha et al. [2013], and Finkelstein et al. [2013] or Endrass et al. [2013]. Internal features of culture are included in the decision-making process of the SIA. In that vein, meanings and values might be manipulated, how emotions are evoked, or the behavior of others is interpreted. Mascarenhas et al. [2013] or Nouri and Traum [2014] are examples of integrating culture into the decision-making process of the agent’s mind. Particularly the decision of which features of culture should be integrated is heavily intertwined with the method of implementation.

13.3 History

473

Method of implementation Outwardly observable features of culture can be manipulated to demonstrate cultural differences. Features such as facial expressions [Koda et al. 2009], gesture performance, or typical body postures [Endrass et al. 2013] have been investigated. In that manner, a SIA can be designed that implicitly communicates its cultural background, for example, by taking a tight body posture with folded hands, performing fewer gestures with a small spacial extent, or showing frequent smiles. While external features of culture can be implemented in a behavioral way of thinking, the internal features need a cognitive approach. For their technical implementation, several methods from artificial intelligence have been applied. Approaches that aim to modulate behaviors based on culture-specific norms and values typically start from existing belief–desire–intention architectures and extend them by adding culture-driven interpretations of actions and their appraisal. Mascarenhas et al. [2009], for example, extended their agent mind architecture FAtiMA that implements a cognitive model of appraisal (for further details see Chapter 10 of this volume) by representations of Hofstede’s cultural dimensions. In their model, an agent’s alleged culture determines its decision processes (the selection of goals and plans) and its appraisal processes (how an action is evaluated). For example, an action that is of benefit to others is more praiseworthy for members of a collectivistic culture. Other approaches have used Bayesian networks to select the most probable non-verbal behaviors based on the cultural background of the speaker and the current verbal utterance [Lugrin et al. 2018c]. The use of Bayesian networks bears a number of advantages such as making predictions based on conditional probabilities (e.g., to model how likely it is that a person makes use of very large gestures given the cultural background) or mitigating the risk of over-stereotyping (e.g., by not continuously repeating the same non-verbal behavior for a given cultural background and behavioral sequence). Another example is work by Nouri and Traum [2014] that makes use of a data-driven approach to map statistical data onto culture-specific computational models for decision making. In particular, culture-specific decision making is simulated based on values such as selfishness. Bruno et al. [2019] make use of ontologies to represent knowledge of cultural information. The resulting framework shall be used to allow robots to adapt to the user’s culture-driven habits.

13.3

History Investigating culture in SIA research started with virtually embodied SIAs, while integrating culture in physically embodied SIAs was considered somewhat later. Thus, this review section starts with work on IVAs before summarizing work on SRs.

474

Chapter 13 Culture for Socially Interactive Agents

At the very origin of IVAs, Cassell [2000] discussed the impact of cultural background on (non-verbal) behavior and on user’s estimations of IVAs in the introductory chapter of her book on embodied conversational agents [Cassell et al. 2000]. At that early stage, other factors such as display of emotion were in the focus of IVA research. However, cultural differences were highlighted as “an important one for future research” [Cassell 2000, page 17]. In 2004, the first comprehensive book on “Agent Culture” appeared that discussed human–agent interaction in a multi-cultural world [Payr and Trappl 2004]. In three parts, containing 12 chapters, the book discusses (1) that technology is always embedded in culture and thus an agent cannot be without culture, (2) the potential of adaptive agents while maintaining consistency, and (3) the potential of agents as mediators in inter-cultural communication. Particularly, the second part of the book includes concrete and innovative directions for potential implementations of culture in SIAs (e.g., Allbeck and Badler [2004] and de Rossis et al. [2004]). One of the earliest systems that implements cultural background and models culture-specific behaviors for IVAs is the Tactical Language Training System [Johnson et al. 2004a, 2004b]. The system originally focused on the acquisition of basic communicative skills for American soldiers in foreign languages and cultures. Learners communicate via a multi-modal interface that permits them to speak and choose gestures for their character. Different technologies such as speech recognition, motivational dialogue, learner modeling, or multi-agent simulations are implemented. Tactical Language is based on an architecture for social behavior that implements a version of theory of mind and supports IVAs that understand and follow culture-specific social norms. While the user converses with an IVA, the underlying architecture Thespian tracks the affinity between the IVA and the human user depending on the appropriateness of the user’s behavior. To date, it has formed the basis of a variety of products for language and culture training, including Iraqi, Chinese, or Danish [Alelo Inc.]. In the same line of research, the VECTOR system aimed to teach face-to-face inter-cultural communication skills in the Arab culture in the military domain [Barba et al. 2005], and the BiLAT system focuses on negotiation skills where the learner has to adapt to some Iraqi interaction rules for a successful negotiation [Kim et al. 2009]. Both use internal features of culture to drive the interpretation of actions and decision for reactions of the IVAs. For example, a character might be more willing to interact in a simulated mission if the user removes his/her sunglasses for the communication. Modeling and studying culture in the domain of IVAs gained increasing interest in 2006/2007, with a number of research groups and international projects

13.3 History

475

investigating culture from different angles, be it national cultures, ethnicity, or synthetic cultures, using different approaches regarding their foundation and methods of implementation. Most of the approaches initiated then were continued for a long time (if not up to now) in follow-up projects and further studies. In 2006, the German–Japanese bi-nationally funded project Cube-G started and explored whether and how the non-verbal behavior of IVAs can be generated from a parametrized computational model [Rehm et al. 2007]. In the scope of the project, a multi-cultural corpus for the German and Japanese cultures was recorded [Rehm et al. 2008] and a computational model initialized [Rehm et al. 2009]. Based on the data of the corpus analysis different aspects of behavior such as the use of gestures and postures [Endrass et al. 2011c], communication management [Endrass et al. 2009], or choice of conversational topics [Endrass et al. 2011a] were modeled for IVAs and tested in isolation in perception studies [Endrass et al. 2013]. Their results suggest that users seem to prefer agent dialogues that reflect behavioral patterns observed for their own cultural background. The (scripted) agent behavior was also transferred to interactive settings and evaluated in interaction studies [Kistler et al. 2012]. At a later stage, an extended computational model that was learned from the annotated video data was implemented [Lugrin et al. 2015] and combined the various behavioral differences in a demonstrator with IVA dialogues [Lugrin et al. 2018c]. Also investigating non-verbal behavior for national cultures, Jan et al. [2007] modeled differences in gaze, proxemics, and turn-taking behaviors, and simulated these using a group of IVAs for the American and Arabic cultures. An evaluation of their demonstrator suggests that observers are able to distinguish the differences in relation to cultural appropriateness. Also in 2006, the European project eCIRCUS, which featured IVAs with emotional intelligence and role-playing capabilities, was initiated, with a focus on social interaction. Their educational application ORIENT [Aylett et al. 2009] aims to develop inter-cultural empathy for 13–14-year-old students. To avoid cultural stereotyping, their demonstrator makes use of fantasy cultures and developed a culture-specific agent architecture based on theories of cultural dimensions. Following up on the ORIENT application, the European Project eCute (2010–2013) developed two applications that aim at increasing cultural awareness through roleplay with IVAs. Therefore, the agent architecture FAtiMA was enriched by culturesensitive theory of mind (c.f. Chapter 9 in this volume for detailed information on theory of mind) mechanisms [Mascarenhas et al. 2009] that determine both the agent’s decisions as well as its appraisal processes (how the actions of others are evaluated) dependent on the agent’s allocated culture. In the MIXER [Aylett et al. 2014, Hall et al. 2015] and Traveller [Mascarenhas et al. 2013] applications, IVAs

476

Chapter 13 Culture for Socially Interactive Agents

with synthetic cultures are employed in virtual learning environments to establish empathy and raise cultural awareness in 9-11-year-old school children, and young adults from 18 to 25, respectively. In 2011, the group organized a workshop on “Culturally Motivated Virtual Characters” in the International Conference on Intelligent Virtual Agents (IVA), bringing together researchers from different disciplines to drive the topic further. Focusing on the expression of emotion via facial expressions of IVAs, Koda and colleagues performed a number of cross-cultural studies. In these studies, they showed that facial expressions of IVAs are interpreted differently in different cultures [Koda and Ishida 2006], independent whether the agent was designed in Asian [Koda 2007] or Western countries [Koda et al. 2008]. Overall, their results indicate that there is an in-group favoritism for the correct interpretation of emotional displays, for example, emotions designed by a Japanese designer are more likely to be interpreted correctly by Japanese participants. They further found that positive emotional display has a wider cultural variance in interpretation than negative ones [Koda et al. 2009] and that, albeit the mouth region seems to be more effective for conveying the emotions of facial expressions than the eye region, Japanese observers weighted facial cues more heavily in the eye region compared to Hungarians, who vice versa weighted facial cues in the mouth region more heavily [Koda et al. 2010]. In a later investigation, the authors investigated culture-specific gaze-patterns and found that Japanese participants preferred agents that showed familiar gaze [Koda et al. 2017]. Instead of using fantasy cultures or modeling stereotypical behavioral patterns for national cultures, Iacobelli and Cassell [2007] focused on the impact of ethnicity on verbal and non-verbal behavior within the US American national culture. Children were able to identify an agent’s ethnic identity correctly when interacting with it. Thus, it appeared possible to alter an agent’s perceived background by changing its behavior only. In a later approach [Finkelstein et al. 2013], the authors showed that speakers of African American Vernacular English (AAVE) benefited from an IVA that consistently showed linguistic features of AAVE, resulting in better performance in a science class. Work by Nouri and Traum [2014] investigated culture-specific decision making in negotiations. They made use of a data-driven approach to integrate statistical data on the ultimatum game into a computational model based on values, such as selfishness. Evaluating their model with IVAs that engage with different verbal strategies in the ultimatum game based on their model, Nouri et al. [2017] show that weights learned for one culture outperform weights learned for other cultures when playing against opponents of the first culture.

13.3 History

477

In the domain of SRs, research on culture started around 2005 by exploring the attitudes of members from different cultural backgrounds toward robots, showing that the attitudes differ noticeably across cultures [Bartneck et al. 2005, 2007]. For example, the authors found that American participants were least negative toward robots; interestingly, even less than the Japanese participants who did not report a particularly positive attitude toward robots. Manipulating the behavior or appearance of the robots, by manipulating individual aspects, in a culture-specific manner started in 2009. Rau et al. [2009] investigated the communication style (i.e., implicit or explicit) of a robot and found that Chinese participants preferred an implicit communication style and rated this version more positively while German participants rated this version more negatively. In another study, the authors demonstrated that the different communication styles of a robot influenced the decision making of users with a Chinese or American cultural background in the intended direction [Wang et al. 2010]. The in-group advantage for SRs was investigated in 2012 by Trovato et al. [2012], who created different versions of emotional expressions for the Western or Japanese cultures, and showed that the emotional expressions are recognized better when the culture of the observer matches the simulated culture of the SR. Eyssel and Kuchenbrandt [2012] simply changed the name and background story (being implemented in Germany or in Turkey) of a robot, and found that the German robot was evaluated more positively by German participants. Looking into cultural differences in proxemics behavior, Eresha et al. [2013] found that Arabs and Germans have different expectations on the interpersonal distance between themselves and robots. These expectations are in line with culture-specific proxemics behavior known from the literature on human–human interaction. In 2017, the European–Japanese Project CARESSES started, which aimed at designing the first socially assistive robot to support ageing, that is, able to adapt to the cultural background of their users. In the scope of the project, Sgorbissa et al. [2018] developed guidelines for culturally competent robots in elderly care, and a knowledge representation framework that aims to help robots adapt their behavior [Bruno et al. 2019]. In 2018, the group initiated a special session on “Cultural Factors in Human–Robot Interaction” in the International Conference on Intelligent Robots and Systems (IROS), where different contributions focusing on culture for robots were discussed, including embedding ethics in robot design [Battistuzzi et al. 2018], implementing local cultural practices [Rehm et al. 2018], transferring the similarity-attraction principle for cultural background on HRI [Lugrin et al. 2018a], or adapting greeting rituals [Khaliq et al. 2018].

478

Chapter 13 Culture for Socially Interactive Agents

13.4

Evaluation of SIAs That are Based on Cultural Information This section has to start with a disambiguation of several notions regarding the use of culture in interactive systems. Some notions relate to the whole system (culturally aware, enculturated) and some relate to the interaction with the system (multi-cultural, cross-cultural, inter-cultural). Unfortunately, these notions are often not defined or are used interchangeably. This makes it difficult to talk about the different elements of a SIA system that are affected by culture. System specific We often read about culturally aware systems, meaning systems that have integrated culture in one way or another. The term cultural awareness though is a concept from the social sciences, where it is used to describe a process an individual goes through to become aware of one’s own value system and then, as a subsequent result, be able to recognize own biases, prejudices, and assumptions in interactions with members of other cultures (e.g., Campinha-Bacote [2002]). Thus, it entails self-reflection as a necessary ingredient. Therefore, using the term for describing systems, in which we have integrated culture as one parameter for the interaction with the user, as “culture-aware” might be a bit overselling the system. Thus, we suggested the term enculturated system earlier [Rehm 2010], which is used to describe any system where culture has been part of the design and/or development process, influences the decision-making process, or affects the observable interaction with the user. Interaction specific When SIAs are evaluated, we can often find claims about multi-cultural, cross-cultural, or inter-cultural communication. Again, it might be a good idea to try to disambiguate the different notions as they actually signify different concepts [Gudykunst 2003]. Multi-cultural is used in relation to societies that contain more than one cultural group, and where the different cultural groups live in co-existence but not necessarily engage in interactions. Cross-cultural is used when cultural norms, values, or behaviors are compared between two or more cultural groups. Thus, it makes sense to speak of a cross-cultural study when we test a SIA with two different cultural groups. Inter-cultural at last is used to denote interaction between members of different cultural groups. Thus, a SIA system might, for example, train inter-cultural competences or act as a mediator in inter-cultural communication between users from different cultures. Evaluating enculturated SIAs is a challenging task. In principle, we have to answer two questions for setting up an evaluation study in this domain: (1) What is the experimental design? (2) Who are the participants? What is the experimental design? In general, the following experimental designs are possible, where the first two designs aim at verifying that the SIA behaves in a

13.4 Evaluation of SIAs That are Based on Cultural Information

479

culturally appropriate way, and the latter two designs aim at evaluating SIAs in the context of their application: (1) Testing a SIA with a target culture: In this case, we would be interested in the effect of culturally appropriate behavior and the experiment would thus compare behaviors of a SIA that is based on cultural information with a SIA without this characteristic or different cultural characteristics. (2) Comparing SIA performance in two different cultures (cross-cultural study): Main goal of a cross-cultural study is to measure the differences in perception or interaction of members of different cultures for the same system. In general, three different types of comparisons are possible: (a) comparing performance of a “standard” or universal behavior in two different cultures; (b) comparing performance of the culturally adequate behavior in each target culture (in-group comparison); and/or (c) comparing performance of a culturally inadequate behavior in each target culture. (3) Evaluating a SIA in inter-cultural training: If the SIA’s behavior has been shown to be culturally appropriate, it can be employed for training purposes where the experimental design then aims at evaluating the learning effect of training with the SIA as compared to traditional methods. (4) Testing a SIA simultaneously with more than one target culture (intercultural study): In case of an inter-cultural study, the agent might serve as moderator in inter-cultural communication or it might adapt to the different target cultures for enhancing efficiency. Who are the participants? Depending on the experimental design, you will require to test one particular or more target cultures. But in the section on cultural models we have seen that it might be too simple to just equate culture with country. It might, for example, be troublesome to test a guiding robot on two university campuses in Japan and Germany and attribute any difference to culture. In this context, the target culture might actually be university students and the nationality might not play a crucial role in the interaction with the robot as student culture may not be so different in the two countries. In a more extreme example, you might unconsciously test in a school in a low-income area in one country while testing in a highly ranked facility in the other, and attribute behavioral differences to culture, where they in fact be might attributed to other demographical factors of your participants.

480

Chapter 13 Culture for Socially Interactive Agents

13.5

Table 13.1

Role of Embodiment As can be concluded from the history section, the implementation of culture in SRs seems relatively new, especially in larger or long-term endeavors, in comparison to research on culture with IVAs. This might be partly due to the fact that an SR cannot be animated as subtly as an IVA and thus the sometimes subtle differences in the external features such as appearance or non-verbal behavior of culture are difficult to demonstrate (see Table 13.1 for a comparison). However, for both types of embodiment culture should be considered in their decision making and execution of their behaviors to avoid a cultural clash between the robot’s (implicitly) implemented culture and the user’s culture-inclined preferences (see also Section 13.1). Specifically, in the various implementations of internal features of culture, for example, a cognitive architecture that models how an action of another agent or human is evaluated based on cultural settings, the two fields can largely benefit from one another as these models are based on theoretical knowledge or human data, and are not dependent on the hardware and the externalization of culture. Work by Correia et al. [Correia et al. 2018], is one of the few examples that have made use of their agent architecture FATiMA, which was previously extended by cultural factors for IVAs, for re-usage with SRs that makes use of in-group factors to model group emotion. Since a major difference between SRs and IVAs lies in the fact that SRs inhabit the same physical space as their human users, it is not surprising that proxemics was among the first culture-specific aspects that were subject of studies in SRs, while in the domain of IVAs facial expressions and differences in the execution of gestures were focused on first. The concrete implementation of external features

Similarities and differences caused by embodiment when implementing culture for SIAs Feature of Culture

IVAs

SRs

Appearance Arbitrarily configurable incl. subtle expressive features Cognition Verbal Behavior Non-verbal Behavior

Limited modifications through attached accessories (e.g., stickers) Same models apply for both embodiments Same models apply for both embodiments Arbitrarily configurable, restricted spatial behavior and collaboration

Limited degrees of freedom and expressive features, shared space and collaboration

13.6 Current Challenges

481

of culture can only partly be transferred: while the preparation of the underlying knowledge of the culture-specific differences, for example, in gestural expressivity, can be reused, the concrete translation into animations or joint movements need to be redeveloped (e.g., Rehm [2018]). Moreover, the appearance of the SR, including crucial features such as height or movement characteristics, are static and determined by the hardware design of the robot. Thus, it might be impossible to adjust such external features to mimic a given target culture. Also, the movement quality might stay unnatural, and asymmetrical to the user, in their non-verbal abilities due to different degrees of freedom, movement characteristics (for example, omnidirectional with wheels), or rigidity of the body of the robot. Another difference between SRs and IVAs regarding culture is the perception thereof by human users. As pointed out by previous research (c.f. Section 13.3), there are on the one hand different a priori attitudes toward SIAs across cultures, on the other hand one and the same behavior of SIAs might be evaluated differently across cultures.

13.6

Current Challenges We have seen in the history section that culture has been the focus of attention in SIA research for over two decades. While there has been impressive progress during that time, research is still far from being able to present a common solution to integrating culture into the interactive behavior of SIAs, or to provide a general computational model that would be available to include external and internal features of culture and is applicable over different socio-cultural contexts. From our perspective, there are three main challenges that should be further addressed to help in advancing the current state-of-the art in enculturating SIAs. (1) External features: Part II of this handbook is focusing on Appearance and Behavior (Chapters 4, 5, 6, 7 and 8), which is closely related to what we call external factors in this chapter. We have not written much on the appearance of enculturated agents, but some projects tried to prevent stereotyping by specifically using non-human characters [Aylett et al. 2009] while others tried to depict ethnic diversity [Cassell 2009]. The design of the agent obviously plays an important role if we do not want to just reproduce cultural stereotypes. But is it better to have an agent that looks similar to citizens from the target culture or should it be a “neutral” agent? Which role does the gender of the agent play in relation to the culture-specific application? Or maybe it’s better to use a non-human design instead? Thus, all aspects discussed in the “Appearance” chapter of this handbook (Chapter 4) should be considered in culture-related context as well. Similarly, observable behavior, as

482

Chapter 13 Culture for Socially Interactive Agents

discussed in Chapter 5 (natural language generation), Chapter 6 (expressive speech), Chapter 7 (gesture generation) and Chapter 8 (multimodal behavior) are culture-specific and has been the focus of research in enculturating the observable behavior of SIAs, both verbally (e.g., accent, grammar) and non-verbally (proxemics, gestures). To this end, research had to rely on (often incomplete) information from the literature, or on specifically collected data. In particular, the second approach is restricted to specific cultures and contexts for which there happen to be the necessary information available to model the SIA’s behavior. (2) Internal features: Interacting with enculturated SIAs means that the system has to interpret the behavior of the user and generate appropriate cultural reactions for the agents. This relates strongly to work presented in Part III (Social Cognition and Phenomena) and IV (Interaction) of this handbook. Both tasks (interpretation and generation) are highly contextual and dependent on the application. The behavior of a SIA used for cultural training purposes, for instance, should be different from one that serves as a museum guide for international visitors. Some approaches that integrate culture as an internal feature work with underlying universal representations of interaction and interpret and translate incoming data [Mascarenhas et al. 2009]. Others model this knowledge explicitly, making it difficult to extend the system to new cultures [Rehm et al. 2009]. The challenge that remains is to integrate cultural knowledge in the system that remains flexible enough to handle several cultural backgrounds or to be able to extend it to other cultural backgrounds. (3) Applications with enculturated SIAs: Learning about appropriate behavior in a given target culture has been at the core of research in enculturated systems, with a number of different motivations ranging from military training for expatriate missions [Johnson et al. 2011], increasing cultural awareness of school children [Aylett et al. 2009], to preventing in-class discrimination [Cassell 2009]. While there are a number of prototypes described in the history section of this chapter, there is limited generalizability and validity for these approaches. We frankly do not know exactly what users learn from these systems, or which aspects were responsible for the learning. What would be needed are comparative longitudinal studies, which of course are costly and difficult to organize. A major challenge is thus to show the effect of these systems in a statistically valid manner. It is also very likely that systems that focus on learning and training in other domains, for example, health-related systems (see also Part V of this handbook

13.7 Future Perspectives

483

on Applications), could benefit from integrating internal and external features of culture to their SIAs, for example, to meet the basic assumptions and preferences of the users of a certain cultural background, or to simulate (cultural) similarity with the user to increase acceptance. Culture should thus be a part of the research and development cycles of many systems that do not explicitly address culture. The challenges described in this section are based on the work that has been conducted so far. In summary, one could say that all of the presented approaches that aim at enculturating SIAs serve as great basis for further research to address internal or external features and to build interactive applications. Existing work can be extended to simulate a broader spectrum of culture-related features in several socio-cultural contexts. Therefore, a combination of methods used for the implementation of different aspects of culture becomes paramount to allow integration into complete interactive systems that address the whole range of potential enculturated SIAs, from recognizing (culture-specific) user input, internally interpreting actions in a culture-related manner, reasoning on appropriate reactions, externalizing these into concrete verbal and non-verbal actions, and to measuring their impact on the perception, behavior, and learning of users.

13.7

Future Perspectives In this section, we look into future perspectives for enculturated systems, challenging the notion of culture and opting for more dynamic approaches. We argue that national culture might not be the right level of granularity, that culture is not a monolithic stable concept, and that individuals cannot easily be classified by a single cultural background. Granularity Many approaches to integrate culture into SIAs have focused on national culture or cultural dichotomies modeling decisions or behavior. But SIAs are supposed to be used in specific contexts, for example, training for business negotiations, support of well-being, or as personal care assistants. Thus, they are embedded in specific social contexts with their own accepted behavior and practices, which might or might not be similar to the heuristics derived from the underlying (national) cultural concept. More fine-grained notions of culture, for example, on a regional, social, or institutional level, might thus be necessary. The following two examples illustrate this point. (1) Rehm et al. [2018] have worked on assistive technologies in institutional care in Denmark. What became apparent was that the workplace culture at the individual institutions was the determining factor on the success or failure of their system. Thus, they needed to employ individual development processes for socially assistive robots that were rooted in

484

Chapter 13 Culture for Socially Interactive Agents

the (cultural) practices of the respective institutions. (2) Even in contexts where a national culture would be assumed, the challenge of granularity can be found. Schank [1975] used a birthday party scenario to make the point that there exists a “predetermined causal chain of conceptualizations that describe the normal sequence of things.” Arguing with Sperber [1996] and Wenger [1998] (see Section 13.2.1 Theories of Culture), we could instead claim that there is no normal sequence of things and no predetermined causal chain. Instead, there is constant negotiation and reassuring of dynamic sequences of action and behavior, which could be called cultural practices. In Denmark, for example, it is customary that the Danish flag is shown when someone celebrates a birthday. Children are expected to invite the whole class as an anti-bullying strategy, there are specific songs, and a “cake man” is served that is covered with sweets. This seems to be a cultural practice across the whole of Denmark, but at the same time there are a lot of aspects of the birthday practice that are negotiated locally, for example, the maximum amount of money the present is supposed to cost, if it should be money (so the child can buy something bigger afterwards), or individual small presents or a present from the class as a whole. Other aspects are where the party takes place, how long a party usually lasts, if it takes place after school or on the weekend, etc. These practices emerge in each class context out of the negotiation of the group of parents. Current models of culture do not take this difference of granularity into account and do not distinguish between more stable and more fluid aspects of (sub-)cultural phenomena.

Dynamics Apart from local or sub-cultural realities, culture is also a moving target that changes over time and generations. Looking back 30 years, to a time without the Internet and mobile phones, many cultural practices that we are taking for granted now did not exist. For instance, while people were very vigilant toward governmental surveillance and privacy violations, nowadays there seem to be fewer barriers to providing sensitive information to private companies. While these changes happen gradually and are often only noticed in hindsight, we are currently experiencing a more radical change in cultural behavior through the coronavirus crisis. Typical examples of culturally adequate, and expected, behaviors are suddenly challenged and have changed basically overnight, for example, banning handshakes when greeting or standing close together in social interactions. To date, there are no solutions provided for SIAs on how to handle these gradual or disruptive changes in cultural behavior. Quite to the contrary, research on enculturated SIAs has concentrated on identifying cultural behavior to be modeled as a stable aspect of the SIAs’ behavioral repertoire.

13.7 Future Perspectives

485

This dynamic nature of culture is both a challenge and an opportunity for SIAs. The challenge lies in how models can be created that keep adapting to changing cultural practices. One obvious direction to look at are current machine learning trends that will allow the SIA to become an actual member of the cultural group it is supposed to represent, and will result in behavior tailored to this specific subgroup. In such a scenario, the focus would be on culturally (or practically) appropriate behavior in a given (sub-)cultural group. In order to achieve such a vision, it will be necessary to define relevant learning mechanisms for the interpretation of events, accumulation of knowledge, and selection and performance of actions to allow the SIA to adapt to its host culture. The fact that culture constantly reshapes itself, and in the long run might lead to culture-specific details getting outdated, another scope for the application of enculturated SIAs becomes plausible: cultural heritage. While most applications using virtual environments target tangible cultural heritage, and aim at simulating places, buildings, or artefacts, SIAs could be used for intangible cultural heritage applications, and demonstrate culture-specific behaviors, rituals, or social practices. Mixed-cultural membership Globalization and immigration have led to increased contact with people from different cultural backgrounds [Ting-Toomey 1999]. For instance, in 2014 the number of first- and second-generation immigrants in the EU was 55 million,2 with an overall population of 507 million. Due to long-term or close contact with a new culture, often customs are adopted and communication is adapted. Thus, a large number of people today have mixed-cultural memberships, which means that different cultural beliefs, values, and communication styles might be present in one and the same person at the same time. Thus, in the future SIAs should be able to recognize and display not only culturally appropriate behavior for an assumed national culture but must be able to reflect and react to cultural behavioral traits from several origins. Few research with SIAs has taken mixed-cultural settings into account. IVAs speaking different accents within the English language were, for example, considered by Khooshabeh et al. [2017]. They found that the cultural background of the user plays a crucial role in the perception of these agents. IVAs that speak an accent (e.g., Middle Eastern English) were perceived as being foreign by people that do not share the agent’s simulated mixed background. Vice versa, a positive impact was observed on people who share a mixed background (e.g., being bi-cultural), resulting in an increased perceived shared social identity. 2. https://ec.europa.eu/eurostat/statistics-explained/index.php/First_and_second-generation_i mmigrants_-_statistics_on_main_characteristics

486

Chapter 13 Culture for Socially Interactive Agents

Work by Obremski et al. [2019] has implemented IVAs that speak the language of one country (German), but include grammatical mistakes typically made by foreign speakers of that language. Native speakers of German categorized these IVAs as being foreign. The same finding held true for a reimplementation and follow up study in the English language [Obremski et al. 2021]. In an interaction study, they observed that human communication partners also adapted their verbal and nonverbal behavior toward the non-native IVA, compared to interacting with a native speaking IVA [Lugrin et al. 2018b]. In particular, interlocutors spoke slower and used more gestures in conversations with the agent that was perceived as being foreign. Triggering the impression of foreign/non-native SIAs can serve several purposes. They can enhance tools of cultural training by allowing a SIA to speak the language of the learner while consistently behaving non-native. They can also be employed to further study the social phenomena occurring in typical interactions with non-native interlocutors, potentially reducing cultural bias through guided (positive) interaction. Last but not least, the implementation of convincing mixedcultural SIAs would foster the cultural diversity of enculturated systems and better reflect members of our modern society.

13.8

Summary In this chapter, we have investigated culture for SIAs. We first discussed that when not actively modeling culture the SIA will still have a culture, which is based on the background of the designer/programmer. We introduced a variety of theories from different disciplines that explain culture from different angles. With it, we outlined the possible approaches that can be taken to computationally implement culture for SIAs. We then summarized the research field of culture for SIAs by providing a historical overview of the major systems, applications, and approaches that focused on enculturating SIAs over the last two decades. We have considered the different types of embodiment for SIAs and outlined that the modeling of internal features of culture can be (partly) reused between IVAs and SRs, while external features as well as the perception of SIAs cannot be transferred directly. We complemented this review chapter by outlining some current challenges and mainly unaddressed future perspectives when aiming at modeling the dynamic phenomenon of culture. As a concluding remark, we believe that implementing culture for SIAs has great potential to not only increase their acceptance but also to teach about cultural differences, scaffold cultural diversity, and support cultural understanding.

References

487

References Alelo Inc. Alelo: Play. Learn. Communicate. https://www.alelo.com/about-us/. Last visited 27.01.2020. J. Allbeck and N. Badler. 2004. Creating embodied agents with cultural context. In Agent Culture: Human–Agent Interaction in a Multicultural World. Lawrence Erlbaum Associates, 107–126. E. Aronson, T. D. Wilson, and R. M. Akerta. 2013. Social cognition. In Social Psychology (8th ed.). Pearson. R. Aylett, N. Vannini, E. Andre, A. Paiva, S. Enz, and L. Hall. 2009. But that was in another country: Agents and intercultural empathy. In 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009). IFAAMAS, 329–336. R. Aylett, L. Hall, S. Tazzymann, B. Endrass, E. André, C. Ritter, A. Nazir, A. Paiva, G. J. Hofstede, and A. Kappas. 2014. Werewolves, cheats, and cultural sensitivity. In Proc. of 13th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2014). C. Barba, J.-E. Deaton, T. Santarelli, B. Knerr, M. Singer, and J. Belanich. 2005. Virtual environment composable training for operational readiness (vector). In 25th Army Science Conference. C. Bartneck, T. Nomura, T. Kanda, T. Suzuki, and K. Kato. 2005. Cultural differences in attitudes towards robots. In Proceedings of the Symposium on Robot Companions: Hard Problems and Open Challenges in Robot–Human Interaction, 1–4. C. Bartneck, T. Suzuki, T. Kanda, and T. Nomura. 2007. The influence of people’s culture and prior experiences with Aibo on their attitude towards robots. AI Soc. 21, 217–230. DOI: https://doi.org/10.1007/s00146-006-0052-7. L. Battistuzzi, C. Papadopoulos, I. Papadopoulos, C. Koulouglioti, and A. Sgorbissa. 2018. Embedding ethics in the design of culturally competent socially assistive robots. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 1996–2001. DOI: https://doi.org/10.1109/IROS.2018.8594361. M. J. Bennett. 1986. A developmental approach to training for intercultural sensitivity. Int. J. Intercult. Relat. 10, 2, 179–195. DOI: https://doi.org/10.1016/0147-1767(86)90005-2. R. Boyd and P. J. Richerson. 1985. Culture and the Evolutionary Process. University of Chicago Press, Chicago. P. Brown and S. C. Levinson. 1987. Politeness—Some Universals in Language Usage. Cambridge University Press, Cambridge. B. Bruno, C. T. Recchiuto, I. Papadopoulos, A. Saffiotti, C. Koulouglioti, R. Menicatti, F. Mastrogiovanni, R. Zaccaria, and A. Sgorbissa. 2019. Knowledge representation for culturally competent personal robots: Requirements, design principles, implementation, and assessment. Int. J. Soc. Robot. 11, 515–538. D. E. Byrne. 1971. The Attraction Paradigm. Academic Press. J. Campinha-Bacote. 2002. The process of cultural competence in the delivery of healthcare services: A model of care. J. Transcult. Nurs. 13, 3, 181–184. DOI: https://doi.org/10. 1177/10459602013003003.

488

Chapter 13 Culture for Socially Interactive Agents

J. Cassell. 2000. Nudge nudge wink wink: Elements of face-to-face conversation for embodied conversational agents. In Embodied Conversational Agents. MIT Press, 1–27. J. Cassell. 2009. Social practice: Becoming enculturated in human–computer interaction. In C. Stephanidis (Ed.), Universal Access in Human–Computer Interaction. Applications and Services. Springer Berlin, Heidelberg, 303–313. DOI: https://doi.org/10.1007/978-3-64202713-0_32. J. Cassell, J. Sullivan, and S. Prevost. 2000. Embodied Conversational Agents. The MIT Press. L. L. Cavalli-Sforza and M. W. Feldman. 1981. Cultural Transmission and Evolution. Princeton University Press, Princeton, NJ. F. Correia, S. Mascarenhas, R. Prada, F. S. Melo, and A. Paiva. 2018. Group-based emotions in teams of humans and robots. In International Conference on Human–Robot Interaction (HRI 18). ACM, 261–269. DOI: https://doi.org/10.1145/3171221.3171252. F. de Rossis, C. Pelachaud, and I. Poggi. 2004. Transcultural believability in embodied agents: A matter of consistent adaptation. In Agent Culture: Human–Agent Interaction in a Multicultural World. Lawrence Erlbaum Associates, 75–106. P. C. Earley. 2002. Redefining interactions across cultures and organizations: Moving forward with cultural intelligence. Res. Organ. Behav. 24, 271–299. DOI: https://doi.org/10. 1016/S0191-3085(02)24008-3. P. C. Earley and S. Ang. 2003. Cultural Intelligence: Individual Interactions Across Cultures. Stanford University Press. P. C. Earley and E. Mosakowski. 2004. Cultural intelligence. Harv. Bus. Rev. 82, 10, 139–146. C. Efferson, R. Lalive, and E. Fehr. 2008. The coevolution of cultural groups and ingroup favoritism. Science 321, 5897, 1844–1849. DOI: https://doi.org/10.1126/science.1155805. B. Endrass, M. Rehm, and E. André. 2009. Culture-specific communication management for virtual agents. In S. Decker, S. Sierra, and Castelfranchi (Eds.), Proceedings of 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2009). Budapest, Hungary. DOI: https://doi.org/10.1145/1558013.1558052. B. Endrass, E. André, L. Huang, and J. Gratch. 2010. A data-driven approach to model culture-specific communication management styles for virtual agents. In W. van der Hoek, G. A. Kaminka, Y. Lespérance, M. Luck, and S. Sen (Eds.), 9th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2010). IFAAMAS, 99–108. B. Endrass, Y. Nakano, A. Lipi, M. Rehm, and E. André. 2011a. Culture-related topic selection in small talk conversations across Germany and Japan. In H. H. Vilhjálmsson, S. Kopp, S. Marsella, and K. R. Thórisson (Eds.), Proceedings of 11th International Conference on Intelligent Virtual Agents (IVA 2011). Springer, 1–13. B. Endrass, M. Rehm, and E. André. 2011b. Planning small talk behavior with cultural influences for multiagent systems. Comput. Speech Lang. 25, 2, 158–174. DOI: https:// doi.org/10.1016/j.csl.2010.04.001. http://dblp.uni-trier.de/db/journals/csl/csl25.html# EndrassRA11. B. Endrass, M. Rehm, A.-A. Lipi, Y. Nakano, and E. André. 2011c. Culture-related differences in aspects of behavior for virtual characters across Germany and Japan. In Proceedings of

References

489

10th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2011), 441–448. B. Endrass, E. André, M. Rehm, and Y. Nakano. 2013. Investigating culture-related aspects of behavior for virtual characters. Auton. Agent. Multi-Agent Syst. 27, 2, 277–304. DOI: https://doi.org/10.1007/S10458-012-9218-5. G. Eresha, M. Häring, B. Endrass, E. André, and M. Obaid. 2013. Investigating the influence of culture on proxemic behaviors for humanoid robots. In International Symposium on Robot and Human Interactive Communication (RO-MAN 2013), 430–435. IEEE. ISBN 978-14799-0507-2. http://dblp.uni-trier.de/db/conf/ro-man/ro-man2013.html#EreshaHEAO13. F. Eyssel and D. Kuchenbrandt. 2012. Social categorization of social robots: Anthropomorphism as a function of robot group membership. Br. J. Soc. Psychol. 51, 4, 724–731. DOI: https://doi.org/10.1111/j.2044-8309.2011.02082.x. S. Finkelstein, E. Yarzebinski, C. Vaughn, A. Ogan, and J. Cassel. 2013. The effects of culturally congruent educational technologies on student achievement. In International Conference on Artificial Intelligence in Education. Springer, 493–502. S. T. Fiske and S. E. Taylor. 2016. Social Cognition: From Brains to Culture (3rd ed.). Sage Publications. W. B. Gudykunst (Ed.). 2003. Cross-Cultural and Intercultural Communication. Sage Publications. E. T. Hall. 1959. The Silent Language. Doubleday, New York. E. T. Hall. 1966. The Hidden Dimension. Doubleday, New York. L. Hall, S. Tazzyman, C. Hume, B. Endrass, M. Lim, G. Hofstede, A. Paiva, E. André, A. Kappas, and R. Aylett. 2015. Learning to overcome cultural conflict through engaging with intelligent agents in synthetic cultures. Int. J. Artif. Intell. Educ. 25, 2, 291–317. DOI: https://doi.org/10.1007/s40593-014-0031-y. G. Hofstede. 1980. Culture’s Consequences: International Differences in Work-Related Values. Sage, Newbury Park, CA. G. Hofstede. 1991. Cultures and Organisations—Intercultural Cooperation and its Importance for Survival, Software of the Mind. Profile Books, London, UK. C. Hu, K. Thomas, and C. Lance. 2008. Intentions to initiate mentoring relationships: Understanding the impact of race, proactivity, feelings of deprivation, and relationship roles. J. Soc. Psychol. 148, 6, 727–744. DOI: https://doi.org/10.3200/SOCP.148.6.727-744. F. Iacobelli and J. Cassell. 2007. Ethnic identity and engagement in embodied conversational agents. In 7th International Conference on Intelligent Virtual Agents (IVA 2007). Springer, 57–63. DOI: https://doi.org/10.1007/978-3-540-74997-4_6. D. Jan, D. Herrera, B. Martinovski, D. Novick, and D. Traum. 2007. A computational model of culture-specific conversational behavior. In C. Pelachaud, et al., (Ed.). Intelligent Virtual Agents 2007. Springer, 45–56. DOI: https://doi.org/10.1007/978-3-540-74997-4_5. W. L. Johnson, C. Beal, A. Fowles-Winkler, U. Lauper, S. Marsella, S. Narayanan, and D. P. abd Hannes Vilhjálmsson. 2004a. Tactical language training system: An interim

490

Chapter 13 Culture for Socially Interactive Agents

report. In International Conference on Intelligent Tutoring Systems (ITS 2004). Springer, 336–334. W. L. Johnson, S. Marsella, and H. Vilhjálmsson. 2004b. The DARWARS Tactical Language Training System. In Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC 2004). W. L. Johnson, L. Friedland, P. J. Schrider, and A. Valente. 2011. The virtual cultural awareness trainer (VCAT): Joint Knowledge Online’s (JKO’s) solution to the individual operational culture and language training gap. In Proceedings of ITEC 2011. A. A. Khaliq, U. Köckemann, F. Pecora, A. Saffiotti, B. Bruno, C. T. Recchiuto, A. Sgorbissa, H.-D. Bui, and N. Y. Chong. 2018. Culturally aware planning and execution of robot actions. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 326–332. DOI: https://doi.org/10.1109/IROS.2018.8593570. P. Khooshabeh, M. Dehghani, A. Nazarian, and J. Gratch. 2017. The cultural influence model: When accented natural language spoken by virtual characters matters. AI Soc. 32, 9–16. DOI: https://doi.org/10.1007/s00146-014-0568-1. J. Kim, R.-W. Hill, P. Durlach, H.-C. Lane, E. Forbell, M. Core, S. Marsella, D. Pynadath, and J. Hart. 2009. Bilat: A game-based environment for practicing negotiation in a cultural context. Int. J. Artif. Intell. Educ. 19, 289–308. F. Kistler, B. Endrass, I. Damian, C. T. Dang, and E. André. 2012. Natural interaction with culturally adaptive virtual characters. J. Multimodal User Interfaces 6, 1, 39–47. DOI: https://doi.org/10.1007/s12193-011-0087-z. K. Kluckhohn and F. Strodtbeck. 1961. Variations in Value Orientations. Row, Peterson, New York. T. Koda. 2007. Cross-cultural study of avatars’ facial expressions and design considerations within Asian countries. In International Workshop on Intercultural Collaboration. Springer, 207–220. DOI: https://doi.org/10.1007/978-3-540-74000-1_16. T. Koda and T. Ishida. 2006. Cross-cultural study of avatar expression interpretations. In International Symposium on Applications and the Internet (SAINT’06). DOI: https://doi.org/ 10.1109/SAINT.2006.19. T. Koda, M. Rehm, and E. André. 2008. Cross-cultural evaluations of avatar facial expressions designed by Western designers. In International Conference on Intelligent Virtual Agents (IVA 2008). Springer, 245–252. DOI: https://doi.org/10.1007/978-3-540-85483-8_25. T. Koda, T. Ishida, M. Rehm, and E. André. 2009. Avatar culture: Cross-cultural evaluations of avatar facial expressions. AI Soc. 24, 3, 237–250. T. Koda, Z. Ruttkay, Y. Nakagawa, and K. Tabuchi. 2010. Cross-cultural study on facial regions as cues to recognize emotions of virtual agents. In International Conference on Culture and Computing. Springer, 16–27. DOI: https://doi.org/10.1007/978-3-642-17184-0_2. T. Koda, T. Hirano, and T. Ishioh. 2017. Development and perception evaluation of culturespecific gaze behaviors of virtual agents. In International Conference on Intelligent Virtual Agents (IVA 17). Springer, 213–222. DOI: https://doi.org/10.1007/978-3-319-67401-8_25. J. Lave and E. Wenger. 1991. Situated Learning: Legitimate Peripheral Participation. Cambridge University Press, Cambridge.

References

491

B. Lugrin, J. Frommel, and E. André. 2015. Modeling and evaluating a Bayesian network of culture-dependent behaviors. In International Conference on Culture and Computing. IEEE (winner of the best paper award). B. Lugrin, A. Bartl, H. Striepe, J. Lax, and T. Toriizuka. 2018a. Do I act familiar? Investigating the similarity-attraction principle on culture-specific communicative behaviour for social robots. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2033–2039. B. Lugrin, B. Eckstein, K. Bergmann, and C. Heindl. 2018b. Adapted foreigner-directed communication towards virtual agents. In 18th International Conference on Intelligent Virtual Agents (IVA 18). ACM, 59–64. DOI: https://doi.org/10.1145/3267851.3267859. B. Lugrin, J. Frommel, and E. André. 2018c. Combining a data-driven and a theory-based approach to generate culture-dependent behaviours for virtual characters. In Advances in Culturally-Aware Intelligent Systems and in Cross-Cultural Psychological Studies. Springer, 111–142. DOI: https://doi.org/10.1007/978-3-319-67024-9_6. S. Mascarenhas, J. Dias, R. Prada, and A. Paiva. 2009. One for all or one for one? The influence of cultural dimensions in virtual agents’ behaviour. In Z. Ruttkay, M. Kipp, A. Nijholt, and H. H. Vilhjálmsson (Eds.), Proceedings 9th International Conference on Intelligent Virtual Agents, IVA 2009. Amsterdam, The Netherlands, September 14–16, 2009. Volume 5773 of Lecture Notes in Computer Science. Springer, 272–286. S. F. Mascarenhas, A. Silva, A. Paiva, R. Aylett, F. Kistler, E. André, N. Degens, G. J. Hofstede, and A. Kappas. 2013. Traveller: An intercultural training system with intelligent agents. In 12th International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2013). 1387–1388. E. Nouri and D. R. Traum. 2014. Generative models of cultural decision making for virtual agents based on user’s reported values. In T. W. Bickmore, S. Marsella, and C. L. Sidner (Eds.), Proceedings 14th International Conference on Intelligent Virtual Agents, IVA 2014, Boston, MA, USA, August 27–29, 2014. Volume 8637 of Lecture Notes in Computer Science. Springer, 310–315. E. Nouri, K. Georgila, and D. Traum. 2017. Culture-specific models of negotiation for virtual characters: Multi-attribute decision-making based on culture-specific values. AI Soc. 32, 1, 51–63. DOI: https://doi.org/10.1007/s00146-014-0570-7. D. Obremski, J.-L. Lugrin, P. Schaper, and B. Lugrin. 2019. Non-native speaker generation and perception for mixed-cultural settings. In ACM International Conference on Intelligent Virtual Agents (IVA 2019). ACM, 105–107. D. Obremski, J. L. Lugrin, P. Schaper, and B. Lugrin. 2021. Non-native speaker perception of Intelligent Virtual Agents in two languages: the impact of amount and type of grammatical mistakes. J. Multimodal User Interfaces 15, 229–238. DOI: https://doi.org/10.1007/ s12193-021-00369-9. S. Payr and R. Trappl (Eds.). 2004. Agent Culture: Human–Agent Interaction in a Multicultural World. Lawrence Erlbaum Associates.

492

Chapter 13 Culture for Socially Interactive Agents

P. Rau, Y. Li, and D. Li. 2009. Effects of communication style and culture on ability to accept recommendations from robots. Comput. Hum, Behav. 25, 2, 587–595. DOI: https://doi.org/10.1016/j.chb.2008.12.025. M. Rehm. 2010. Developing Enculturated Agents: Pitfalls and Strategies. Idea Group Publishing, 362–386. M. Rehm. 2018. Affective body movements (for robots) across cultures. In C. Faucher (Eds.), Advances in Culturally-Aware Intelligent Systems and in Cross-Cultural Psychological Studies. Intelligent Systems Reference Library, Vol. 134. Springer, Cham. DOI: https://doi.or g/10.1007/978-3-319-67024-9_8. M. Rehm, E. André, Y. Nakano, T. Nishida, N. Bee, B. Endrass, H.-H. Huan, and M. Wissner. 2007. The CUBE-G approach—Coaching culture-specific nonverbal behavior by virtual agents. In I. Mayer and H. Mastik (Eds.). ISAGA 2007: Organizing and Learning through Gaming and Simulation. M. Rehm, E. André, N. Bee, B. Endrass, M. Wissner, Y. Nakano, A. A. Lipi, T. Nishida, and H.-H. Huang. 2008. Creating standardized video recordings of multimodal interactions across cultures. In International LREC Workshop on Multimodal Corpora. Springer, 138–159. M. Rehm, Y. Nakano, E. André, T. Nishida, N. Bee, B. Endrass, M. Wissner, A.-A. Lipi, and H.-H. Huang. 2009. From observation to simulation: Generating culture-specific behavior for interactive systems. AI Soc. 24, 3, 267–280. DOI: https://doi.org/10.1007/s00146-0090216-3. M. Rehm, A. Krummheuer, and K. Rodil. 2018. Developing a new brand of culturally-aware personal robots based on local cultural practices in the Danish health care system. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2002–2007. DOI: https://doi.org/10.1109/IROS.2018.8594478. R. C. Schank. 1975. Conceptual Information Processing. Elsevier, New York, USA. A. Sgorbissa, I. Papadopoulos, B. Bruno, C. Koulouglioti, and C. Recchiuto. 2018. Encoding guidelines for a culturally competent robot for elderly care. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). DOI: https://doi.org/10. 1109/IROS.2018.8594089. D. Sperber. 1985. Anthropology and psychology: Towards an epidemiology of representations. Man 20, 1, 73–89. DOI: https://doi.org/10.2307/2802222. D. Sperber. 1996. Explaining Culture: A Naturalistic Approach. Blackwell, Oxford. D. Sperber. 2012. Cultural attractors. In J. Brockman (Ed.), This Will Make You Smarter. Harper, New York, 180–183. S. Ting-Toomey. 1999. Communicating Across Cultures. The Guilford Press, New York. F. Trompenaars and C. Hampden-Turner. 1997. Riding the Waves of Culture—Understanding Cultural Diversity in Business. Nicholas Brealey Publishing, London. G. Trovato, T. Kishi, N. Endo, K. Hashimoto, and A. Takanishi. 2012. A cross-cultural study on generation of culture dependent facial expressions of humanoid social robot. In

References

493

International Conference on Social Robotics (ICSR 2012). 35–44. DOI: https://doi.org/10. 1007/978-3-642-34103-8_4. L. Wang, P.-L. P. Rau, V. Evers, B. K. Robinson, and P. Hinds. 2010. When in Rome: The role of culture & context in adherence to robot recommendations. In 5th ACM/IEEE International Conference on Human–Robot Interaction (HRI). 359–366. E. Wenger. 1998. Communities of Practice: Learning, Meaning, and Identity. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511803932.

Authors' Biographies

A.1

Editors Birgit Lugrin Birgit Lugrin (maiden name Birgit Endrass) is a professor for media informatics at the University of Würzburg, Germany. Since her first contact with a socially interactive agent (the Greta agent) in 2003, she has been fascinated about the research area. Ten years later, she received the prestigious IFAAMAS Victor Lesser Distinguished Dissertation Award and the research award from Augsburg University for her doctoral thesis titled “Cultural Diversity for Virtual Characters.” Today she could not be happier about the chance to co-edit this handbook and work with all the great researchers who have contributed to make this happen.

Catherine Pelachaud Catherine Pelachaud is Director of research at CNRS in the ISIR Laboratory, Sorbonne University. She received her PhD in Computer Graphics at the University of Pennsylvania, Philadelphia, PA, USA, in 1991. She participated in the elaboration of the first embodied conversation agent system, GestureJack, with Justine Cassell, Norman Badler, and Mark Steedman when she was a post-doctoral researcher at the University of Pennsylvania. With her research team, she has been developing an interactive virtual agent platform, Greta, that can display emotional and communicative behaviors.

496

Authors’ Biographies

David Traum David Traum is the Director for Natural Language Research at the Institute for Creative Technologies (ICT) and Research Professor in the Department of Computer Science at the University of Southern California (USC). He leads the Natural Language Dialogue Group at ICT. Traum’s research focuses on dialogue communication between humans and artificial agents. He has engaged in theoretical, implementational, and empirical approaches to the problem, studying human–human natural language and multimodal dialogue, as well as building a number of dialogue systems to communicate with human users. Traum earned his PhD in Computer Science at the University of Rochester in 1994.

A.2

Chapter Authors Anna Abrams Anna M. H. Abrams is a research associate and doctoral candidate at the Chair of Individual and Technology (iTec) at RWTH Aachen University. She received a bachelor’s degree in psychology from Maastricht University, the Netherlands, and a master’s degree from TU Darmstadt, Germany. Her research interests include group dynamics in human–robot interaction (HRI) under consideration of fundamental social–psychological models and theories and research methods in psychology and HRI. Previously, she worked as a research associate at the Institute for Mechatronic Systems in Mechanical Engineering at TU Darmstadt, and as a consultant for usability and user experience research.

Authors’ Biographies

497

Patrícia Arriaga Patrícia Arriaga is a psychologist and received her PhD in social and organizational psychology in 2006. Currently, she is an assistant professor and the director of a master’s course in science on emotions at ISCTE-University Institute of Lisbon, Portugal. Her main research focus is on emotions applied to topics in both social and health psychology, being currently involved in projects related to the study of human–robot interaction and the development of tools (e.g., board games, exergames, multimedia games) with the ultimate goal of promoting wellbeing and prosocial behaviors.

Matthew Aylett Matthew Aylett has been working for over two decades in speech synthesis both in a commercial and academic role. He has published widely on the theme of putting character and emotion into speech synthesis and he has significant media engagement experience in the areas of voice cloning and expressive speech synthesis, appearing on national and international news from the BBC to Good Morning America. He has worked on high profile projects such as recreating JFKs voice to give his last speech and creating a voice for the Hanson Robotics’ Sophia as well as the first ever robot–human duet.

Joost Broekens Joost Broekens is president-elect of the Association for the Advancement of Affective Computing (AAAC). He is associate professor of Affective Computing and Human Robot Interaction at the Leiden Institute of Advanced Computer Science (LIACS) of Leiden University. He heads the [a]social creatures lab, which focuses on understanding social interaction with and between artificial creatures. He is co-founder and CTO of Interactive Robotics. His

498

Authors’ Biographies

expertise is in human–agent and human–robot interaction and the computational modeling of emotion.

Carlos Busso Carlos Busso is the director of the Multimodal Signal Processing (MSP) laboratory at The University of Texas at Dallas (UTD), where he is a professor at the Electrical and Computer Engineering Department. He received his PhD degree in electrical engineering from the University of Southern California (USC), Los Angeles, in 2008. In 2014, he received the ICMI Ten-Year Technical Impact Award. He also received the Hewlett-Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). His research interest is in human-centered multimodal machine intelligence and applications. His current research includes the development of nonverbal behaviors for intelligent conversational agents, focusing of data-driven models and machine learning. His group has developed novel solutions to synthesize realistic, human-like facial expression and head and lip movements for social interactive agents. He has served as the general chair of ACII 2017 and ICMI 2021. He has also served as an associate editor of the IEEE/ACM Transactions on Audio, Speech, and Language Processing, and as a senior editor of the IEEE Signal Processing Letters.

Justine Cassell Justine Cassell is the dean’s professor of language technologies in the School of Computer Science at Carnegie Mellon University. She also holds a chair at the PRAIRIE AI Institute in Paris. Previously, Cassell founded the Technology and Social Behavior Program at Northwestern University, and was a tenured professor at MIT. Cassell has received the MIT Edgerton Prize, the Anita Borg Women of Vision award, the AAMAS Test of Time award, and a National Academy of Sciences Prize for Behavioral Science applicable to policy. She is a fellow of the AAAS, Royal Academy of Scotland, and the Association for Computing Machinery (ACM).

Authors’ Biographies

499

Leigh Clark Leigh Clark is a lecturer in computer science at the Computational Foundry in Swansea University. His research examines the effects of voice and language design on speech interface interactions and how linguistic theories can be implemented and redefined in this context. His work also explores the concept of trust in these interactions and how speech interface design can be improved to better accommodate people who stammer. He is co-founder of the international Conversational User Interfaces (CUI) conference series.

Filipa Correia Filipa Correia received a MSc in computer science from the University of Lisbon, Portugal in 2015. She is currently pursuing a PhD on human–robot interaction at the University of Lisbon. Her research is focused on the group dynamics within mixed teams of humans and robots. In the past, she was a teaching assistant in courses on artificial intelligence, multi-agents systems, society and computing, and social robots and human–robot interaction. She was also part of the EU-FP7 EMOTE project.

Benjamin R. Cowan Benjamin R. Cowan is Associate Professor at University College Dublin School of Information & Communication Studies. He completed his undergraduate studies in psychology and business studies (2006) as well as his PhD in Usability Engineering (2011) at the University of Edinburgh. His research lies at the juncture between psychology, human– computer interaction, and communication systems in investigating how design impacts aspects of user behavior in social, collaborative, and communicative technology interactions. His recent research focuses specifically on how theory and quantitative methods from psychological science can be applied to understand

500

Authors’ Biographies

and design speech and language technologies. Dr. Cowan is the co-founder and codirector of the HCI@UCD group, one of the largest HCI groups in Ireland and is a funded investigator in Science Foundation Ireland’s ADAPT Centre.

Dirk Heylen Dirk Heylen is full professor at the University of Twente in the Netherlands. His research interests cover both the machine analysis of human (conversational) behavior and the generation of humanlike behavior by virtual agents and robots. He is especially interested in the nonverbal and paraverbal aspects of dialogue. He has been involved in many national and European projects on SIAs. Besides his duties as associate editor of journals such as Transactions on Affective Computing, Human– Computer Studies, and Frontiers in Human–Media Interaction, he has been involved in the organization of multiple workshops and conferences related to SIA research (IVA, ICMI, ACII, Persuasive). He has been president of the Association for the Advancement of Affective Computing.

Jonathan Gratch Jonathan Gratch is a research full professor of computer science and psychology at the University of Southern California (USC) and director for virtual human research at USC’s Institute for Creative Technologies. He completed his PhD in Computer Science at the University of Illinois in UrbanaChampaign in 1995. Dr. Gratch’s research focuses on computational models of human cognitive and social processes, especially emotion, and explores these models’ role in advancing psychological theory and in shaping human–machine interaction. He is the founding editor-in-chief (retired) of IEEE’s Transactions on Affective Computing, founding associate editor of Affective Science, associate editor of Emotion Review, and the Journal of Autonomous Agents and Multiagent Systems.

Authors’ Biographies

501

Nicole Krämer Nicole Krämer is full professor of social psychology: media and communication at the University of Duisburg-Essen, Germany. She completed her PhD in psychology at the University of Cologne, Germany, in 2001, and received the venia legendi for psychology in 2006. Dr. Krämer’s research focuses on social psychological aspects of human– machine-interaction (especially social effects of robots and virtual agents) and computer-mediatedcommunication (CMC).

Gale Lucas Gale Lucas is a research assistant professor at the University of Southern California’s Institute for Creative Technologies (USC’s ICT) and Department of Computer Science; she also has appointments in the Department of Civil and Environmental Engineering and Department of Psychology of USC. She received her PhD in psychology at Northwestern University, Evanston, IL, USA, in 2010. She participated in the development of Rapport Agents, especially SimSensei, when she was a post-doctorate at USC’s ICT. She works in the areas of human–computer interaction, affective computing, and trust-in-automation. Her research focuses on rapport, disclosure, trust, persuasion, and negotiation with virtual agents and social robots.

Arne Manzeschke Arne Manzeschke is a professor of anthropology and ethics, Lutheran University Nuremberg (50% part time); director of the Department of Ethics and Anthropology in Health Care (50% part time). Among his past and current endeavors are training in computer science (Siemens, Munich), Academic Studies in Theology and Philosophy, president of Societas Ethica of the European Society for Research in Ethics since 2018, he is vice chairman of the Ethics

502

Authors’ Biographies

Commission on Pre-Implantation Diagnostics in Bavaria, and director of the Specialist Committee Health Care Technology and Society of the German Society of Biomedical Technology (DGBMT).

Stacy Marsella Stacy Marsella is a professor at Northeastern University in the Khoury College of Computer Sciences with a joint appointment in psychology. Professor Marsella’s multidisciplinary research uses artificial intelligence techniques to model human cognition, emotion, and social behavior. Beyond its relevance to understanding human behavior, the work has seen numerous applications, including health interventions, social skills training, and planning operations. His more applied work includes frameworks for large-scale social simulations and the creation of virtual humans, embodied facsimiles of people that can engage people in face-to-face interactions using verbal and nonverbal behavior.

Rachel McDonnell Rachel McDonnell is an associate professor at the School of Computer Science, Trinity College Dublin. She received her PhD in computer graphics at Trinity College Dublin in 2006. She combines research in cutting-edge computer graphics and investigating the perception of virtual characters to both deepen our understanding of how virtual humans are perceived, and directly provide new algorithms and guidelines for industry developers on where to focus their efforts.

Authors’ Biographies

503

Bilge Mutlu Bilge Mutlu is an associate professor of computer science, psychology, and industrial engineering at the University of Wisconsin-Madison where he directs the People and Robots Laboratory. His research program focuses on building humancentered methods and principles to enable the design of robotic technologies and their successful integration into human environments. Dr. Mutlu has an interdisciplinary background that combines design, computer science, and social and cognitive psychology, and a PhD in human–computer interaction from Carnegie Mellon University. He is a former Fulbright fellow. His research has received 18 best paper awards or nominations and recognition in the international press.

Raquel Oliveira Raquel Oliveira is a researcher and a PhD candidate at Iscte-Instituto Universitário de Lisboa (CIS-IUL). She has a bachelor’s degree in Psychology and a masters’ degree in Social and Organizational Psychology. Her research interests include human–robot interaction and human–computer interaction in group entertainment settings. Her current work focuses on the role of humor in enhancing human–robot interactions.

Ana Paiva Ana Paiva is a professor of computer science at Instituto Superior Técnico, University of Lisbon, investigating the creation of intelligent interactive systems by designing “social agents” that can interact with humans in a natural and social manner. She is also a fellow at the Radcliffe Institute for Advanced Study at Harvard University. Over the years, she has addressed this problem by engineering social agents that exhibit specific capabilities, including emotions, personality, culture, non-verbal behavior,

504

Authors’ Biographies

empathy, and collaboration, among others. Her more recent research combines methods from artificial intelligence with social modeling to study hybrid societies of humans and machines. In particular, she is investigating how to engineer agents that lead to more prosocial and altruistic societies. She has published extensively and received best paper awards in several conferences, notably, she won the Blue Sky Awards at the AAAI in 2018. She has further advanced the area of artificial intelligence and social agents worldwide, having served for the Global Agenda Council in Artificial Intelligence and Robotics of the World Economic Forum and as a member of the Scientific Advisory Board of Science Europe. She is a EurAI fellow.

Jairo Perez-Osorio Jairo Perez-Osorio obtained a PhD in systemic neurosciences from the Ludwig-Maximilians-University in Munich (LMU) in 2016. His background is cognitive neuroscience (MSc in neuro-cognitive psychology, LMU, 2010) and psychology (BA in psychology, National University of Colombia, 2005). His research interests are in social neuroscience, and the link between action prediction and visual attention; and the attribution of mental states to artificial agents. Jairo’s research is focused on the impact of prediction upon deployment of attention in social interactions using behavioral, EEG/ERP, and eye tracking methods. He is a managing editor of the International Journal of Social Robotics.

Roberto Pieraccini Since 1981, Roberto Pieraccini has pursued a career of research and development in human–machine spoken interaction at institutions in Italy, the US, and Switzerland. He is best known for his pioneering work on statistical natural language understanding and reinforcement learning for automated dialogue systems. Currently, he leads the Google Assistant’s Natural Language Understanding team in Zurich. He is the author of The Voice in the Machine (MIT Press, 2012) and a fellow of IEEE and ISCA. He received the Primi Dieci award from the Italian–American Chamber of Commerce in 2016, and an honorary degree of doctor in science from the Heriot-Watt University in Edinburgh, UK, in 2019.

Authors’ Biographies

505

Matthias Rehm Matthias Rehm is the head of the Human Machine Interaction Group and the director of the crossdisciplinary HRI lab at Aalborg University, Denmark. He received his doctoral degree with honors from Bielefeld University, Germany, in 2001 and has since worked in the area of socially intelligent agents both in virtual and physical form. While working with Elisabeth André at Augsburg University, he started one of the first projects that investigated culture as a parameter in the interaction with virtual characters, the German–Japanese collaboration CUBE-G with Yukiko Nakano, Seikei University, and Toyoaki Nishida, Kyoto University.

Astrid Rosenthal-von der Pütten Astrid Marieke Rosenthal-von der Pütten is a full professor and director of the Chair Individual and Technology at the Department of Society, Technology, and Human Factors at RWTH Aachen University. She received BS and MS degrees in applied cognitive and media science and the PhD degree in psychology all from the University of DuisburgEssen, where she also worked as a postdoctoral researcher until 2018. Her research interests include social effects of artificial entities, HRI, linguistic alignment with artificial entities, and communication in social media. She paid particular attention to the uncanny valley phenomenon in HRI in her PhD studies.

506

Authors’ Biographies

Fernando Santos Fernando Santos is an Assistant Professor at the University of Amsterdam and member of the Socially Intelligent Artificial Systems group at the Informatics Institute. In his research, Fernando is interested in understanding collective dynamics in complex systems, in explaining the evolution of cooperation and in designing fair/pro-social AI. He is particularly interested in the emergence of cooperation in populations where humans and artificial agents interact. Fernando received his PhD in Computer Science and Engineering at Instituto Superior Técnico (Lisbon, Portugal). Before joining the University of Amsterdam, Fernando was a James S. McDonnell postdoctoral fellow at Princeton University.

Carolyn Saund Carolyn Saund is a PhD student in the schools of neuroscience and psychology and computing sciences at University of Glasgow. Previously she has worked in software development for social robotics and affective AI systems for industry start-ups. Her interdisciplinary approach to social robotics aims to incorporate elements of philosophy, cognitive neuroscience, and machine learning to bring inspiring and engaging social experiences to the virtual world. Her current work focuses on semantically aware social gesture generation, and modeling social spatial cognition.

Ilaria Torre Ilaria Torre is a postdoctoral researcher at KTH Royal Institute of Technology, funded by the Wallenberg AI, Autonomous Systems and Software Program. She studies how different characteristics of an artificial agent (e.g., voice, emotional expression, human-likeness, behavior) affect human behavior, specifically in terms of trust. She has a background in speech, psychology, and human–agent interaction. She obtained a PhD from

Authors’ Biographies

507

the University of Plymouth, in the EU-funded CogNovo doctoral training center, in 2017. Prior to joining KTH, she was a Marie Sklodowska-Curie postdoctoral researcher at Trinity College Dublin, School of Electronic and Electrical Engineering.

Eva Wiese Eva Wiese is the head of the Social and Cognitive Interactions (SCI) lab at George Mason University (Fairfax, VA, USA). She graduated in psychology in 2008 from Otto-Friedrich-University Bamberg and obtained her PhD in neuroscience in 2013 at the International Graduate School for Systemic Neurosciences (GSN) at Ludwig Maximilian University Munich. She is on the editorial board of the Journal of Experimental Psychology: Applied and associate editor at the International Journal of Social Robotics. Her research is focused on examining the neural and behavioral processes involved in human– human and human–robot social interactions.

Agnieszka Wykowska Agnieszka Wykowska is a principal investigator at the Italian Institute of Technology leading the unit “Social Cognition in Human–Robot Interaction” and is an adjunct professor of engineering psychology at the Luleå University of Technology. She graduated in neuro-cognitive psychology (2006, LMU Munich), obtained her PhD in psychology (2008) and her habilitation (2013) from LMU Munich. In 2016, she was awarded the ERC Starting grant “InStance.” She is editor-in-chief of the International Journal of Social Robotics, a board member of the European Society for Cognitive and Affective Neuroscience (ESCAN), and a member of the core faculty of European Lab for Learning and Intelligent Systems (ELLIS).

Index AAMAS. See Autonomous agents and multi-agent systems (AAMAS) AAVE. See African American Vernacular English (AAVE) Accent, 183–184, 485 Action State Estimation (ASE), 333 Action Unit (AU), 263 Adaptation, 279–280 Affect, 350, 355–356 Agent for assisted living, 41–45 Appearance, 107 Appeal, 130–133 construction, 116 communicative behavior, 84–85 non-verbal communicative features, 85–86 physical appearance, 86–87 Aibo (doglike robot), 109 Altruism, 388, 390–391 Amazon Mechanical Turk, 43 American Psychological Association (APA), 34 Analysis of variance (ANOVA), 45 ANOVA. See Analysis of variance (ANOVA) Anthropomorphism, 317

APA. See American Psychological Association (APA) Appraisal, 350, 354 Arousal, 353 Artificial intelligence (AI), xviii, 4, 163 ASE. See Action State Estimation (ASE) ASIMO humanoid robot, 112 ASR. See Automatic speech recognition (ASR) Attitude, 351, 359 AU. See Action Unit (AU) Automatic speech recognition (ASR), 147, 179 Autonomous agents and multi-agent systems (AAMAS), 5 Avatar, 4 AVATAR corpus, 287 Back channels, 189–190, 436 BACS. See Body Action Coding System (BACS) BASSIS. See Biomimetic architecture for situated social intelligence systems (BASSIS) Bayesian theory of mind (BToM), 328 BDTE. See Belief–desire theory of emotion (BDTE)

510

Index

BEAT. See Behavior expression animation toolkit (BEAT) Behavior Communicative Computational model Nonverbal Behavior expression animation toolkit (BEAT), 229, 268–269 Behavioral assessments, 198 Behavioral measures, 30, 444 Belief management system, 331 Belief system, 331 Belief–desire theory of emotion (BDTE), 362–364 Bidirectional long short-term memory cells (BLSTM cells), 284 Bidirectional reflectance distribution functions (BRDFs), 118 Big data, 242 Big Five personality traits, 88–89 BiLAT system, 474 BIM. See Bystander intervention model (BIM) Biomimetic architecture for situated social intelligence systems (BASSIS), 332 BLSTM cells. See Bidirectional long short-term memory cells (BLSTM cells) Body Action Coding System (BACS), 264 Bounding Volume Hierarchy (BVH), 233 BRDFs. See Bidirectional reflectance distribution functions (BRDFs) BToM. See Bayesian theory of mind (BToM)

BVH. See Bounding Volume Hierarchy (BVH) Bystander intervention model (BIM), 395 CARESSES (European–Japanese Project), 477 CASA. See Computers are Social Actors (CASA) Cerebella, 229, 231 CFG. See Context-free grammar (CFG) CHMMs. See Conceptual hidden Markov models (CHMMs) CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI), 287 CMU-MOSI. See CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI) Coding system, 263–265 Cognition, 355–356 Cognitive appraisal, 350 theory, 354, 361 Cognitive empathy, 390, 393 Cognitive-affective architectures cognitive-agent based appraisal, 367–368 embodied appraisal, 368–369 reinforcement learning appraisal, 369–370 Cognitive-agent based appraisal, 367–368 Cognitive-agent based modeling, 361–364 Computers are Social Actors (CASA), 78 Conceptual hidden Markov models (CHMMs), 161–162 Conditional random fields (CRF), 283 Context-free grammar (CFG), 156

Index

Conversational speech, 189–190 Cooperation, 388, 391–392 Coordination, 435, 448 Correlational research strategy, 27 COVID-19 lockdowns, 387 CRF. See Conditional random fields (CRF) Cross speaker features, 182 accent and dialect, 183–184 language, 182–183 voice adaptation, 185–186 voice styles, 184–185 Cross-cultural communication, 478 CrowdFlower, 43 Cube-G corpus, 287, 475 Culture, 463 Cultural adaptation, 467 Cultural awareness, 464, 471, 475–478, 482 culture in SIAs, 471–473 evolution, 467 intelligence, 469 Theories of Culture awareness, 478 DAE. See Denoising autoencoder (DAE) Data processing and analysis, 36–37 Data-driven techniques, 230–231, 282–286 Databases, 286–288, 290–291 DBNs. See Dynamic Bayesian networks (DBNs) DBPedia, 152 Deep learning. See also Reinforcement learning (RL) models, 283–284 for NLU, 163–165

511

Deep neural networks (DNNs), 159, 185, 266 Deixis, 262 Denoising autoencoder (DAE), 284 Dialect, 183–184 Dialog manager (DM), 148 Dialog regulation, 220 DialogFlow, 165 Diphone synthesis, 177 Disfluencies, 190 DM. See Dialog manager (DM) DNNs. See Deep neural networks (DNNs) Dual inheritance theory, 467–468 DV. See Dependent variable (DV) DyadGAN model, 285 Dynamic Bayesian networks (DBNs), 266, 283 EBBU. See Empirical Bayesian Belief Update model (EBBU) eCIRCUS European project, 475 EDD. See Eye direction detector (EDD) ELDA, 179 Electromyogram (EMG), 264 Embodied appraisal, 368–369 Embodied cognition, 243 Embodied models, 364–365 Embodiment, 7–9, 480–481 EMG. See Electromyogram (EMG) EMOTE. See Expressive MOTion Engine (EMOTE) Emotion elicitation, 361 Emotional state, 186–188 Emotions, 270–274, 349–356, 385 affect and cognition, 355–356 Empathic responses, 396 Empathy, 386, 389–390

512

Index

empathy–altruism hypothesis, 392 modulation in empathic agents, 399–400 Emphasis intonation, 188–189 Empirical Bayesian Belief Update model (EBBU), 330 Empirical social sciences, 21. See also Social cognition modeling Encoding–decoding, 274 End-to-end models, 231 Entrainment, 435 Erica robot, 288 Ethics committees, 34 Expressive MOTion Engine (EMOTE), 274, 399 Expressive speech synthesis, 173–176 External validity, 52 Eye direction detector (EDD), 326 Facial Action Coding System (FACS), 263 FACS. See Facial Action Coding System (FACS) Fairness, accountability, and transparency (FAT), 89 FAT. See Fairness, accountability, and transparency (FAT) FAtiMA. See Fearnot AffecTIve Mind Architecture (FAtiMA) Fearnot AffecTIve Mind Architecture (FAtiMA), 275, 362, 473, 480 Flobi social robot, 119 FML. See Function Markup Language (FML) fMRI. See Functional magnetic resonance imaging (fMRI) Formant synthesis, 177

Function Markup Language (FML), 270 Functional magnetic resonance imaging (fMRI), 92 Furhat robot, 288–289 Fuzzy rules, 271 G2P system, 183 GAMYGDALA, 362 GAN. See Generative adversarial network (GAN) Gaussian mixture models (GMM), 283 Gaze, 261 GEMEP database. See GEnevaMultimodal Emotion Portrayals database (GEMEP database) General learning model, 388 General question answering, 152 Generative adversarial network (GAN), 266 Generative models, 285–286 GEnevaMultimodal Emotion Portrayals database (GEMEP database), 286 Gestures, 213–214 Beat Co-speech, 213 Deictic, 215 Gestural phases and units, 218–219 Iconic, 262 Metaphoric timing, 218–219 Gesture Net for Iconic Gestures (GNetIc), 286 GestureJack, 267 GMM. See Gaussian mixture models (GMM)

Index

GNetIc. See Gesture Net for Iconic Gestures (GNetIc) Graphical models, 282–283 Greta, 230 Grim trigger, 412 Haar Cascade, 330 Habitability gap, 166 Hamilton’s rule, 389 HAMMER. See Hierarchical Attentive Multiple Models for Execution and Recognition (HAMMER) Hard-wired appraisal, 366, 370 Hidden Markov models (HMMs), 185, 266 Hierarchical Attentive Multiple Models for Execution and Recognition (HAMMER), 333 HMMs. See Hidden Markov models (HMMs) Hofstede’s system of values, 466 HRSP. See Human–Robot Shared Planner (HRSP) Human–Robot Shared Planner (HRSP), 333 Hybrid approaches, 266 Hypothetico-deductive model, 23 ID. See Intentionality detector (ID) In-group bias. See In-group favoritism In-group favoritism, 470 In-group/outgroup bias. See In-group favoritism Inductive reasoning, 23 Intelligent systems, 173 Intentionality detector (ID), 326 Inter-cultural communication, 478 abilities, 469

513

Interaction studies, 52–54 Interdisciplinary collaboration, 237–238 International Phonetic Alphabet (IPA), 180 Intraparietal sulcus (IPS), 321 IPA. See International Phonetic Alphabet (IPA) IPS. See Intraparietal sulcus (IPS) IRBs. See Institutional review boards (IRBs) IROS. See Intelligent Robots and Systems (IROS) ISQ. See Instance Questionnaire (ISQ) IV. See Independent variable (IV) JA. See Joint attention (JA) Joint attention (JA), 316, 320 in HRI, 325–326 neural correlates of JA, 321 KASPAR social robot, 124 Knowledge Navigator concept, 108 Laban movement analysis (LMA), 274 LADS, 179 Latent-dynamic conditional random fields (LDCRF), 283 Lattice, 148 Laughter, 291 LDC, 179 LDCRF. See Latent-dynamic conditional random fields (LDCRF) LibriTTS, 179 Likert scale, 40 Linked Open Data Cloud, 152 LMA. See Laban movement analysis (LMA) Local Binary Pattern Histogram, 330

514

Index

Long-short term memory (LSTM), 164, 266 Long-term studies, 59–60 Lose–shift strategy, 413 LSTM. See Long-short term memory (LSTM) Machine learning (ML), 163, 266, 282–286, 439–441. See also Deep learning Manipulation check, 44 Markov decision problems (MDPs), 328 MCD. See Mel-cepstral distortion (MCD) MDPs. See Markov decision problems (MDPs) Mechanical Turk, 108 Media, 151 equation approach, 80 Mel-cepstral distortion (MCD), 198 Metaphorical design, 112, 130 Metaphorics, 262 Mimicry, 79, 398 MiRo robot, 112 Mixed-cultural membership, 485–486

N-best, 148 National disaster response system, 387 Natural language module (NLG), 148 Natural language understanding (NLU), 147 Usability Paradox, 166–167 Neural correlates of JA, 321 of theory of mind, 319–320 Neural TTS, 176, 178 Neuroticism, 88–89 NLG. See Natural language module (NLG) NLU. See Natural language understanding (NLU) NoCost, 329 Non-probability sampling methods, 32 Non-verbal behavior generator (NVBG), 229, 270 Nonverbal behavior representation, 261 classification, 261–263 coding system, 263–265 NVBG. See Non-verbal behavior generator (NVBG)

ML. See Machine learning (ML) Moderator variable, 56 Mood, 351 Motion capture, 232 Motion Heuristic, 329 Motor system, 331 MSP-AVATAR corpus, 290–291 Multilingualism, 182 MUMIN, 265 Mutual attentiveness, 435, 448–449 Mutual engagement, 470

OCC model. See Ortony–Clore–Collins model (OCC model) OCEAN. See Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism (OCEAN) Open Science Collaboration, 57 Open-source corpora, 178–179 Openness, Conscientiousness, Extroversion, Agreeableness,

Index

and Neuroticism (OCEAN), 276, 351 ORIENT application, 475 Ortony–Clore–Collins model (OCC model), 354 Parasocial consensus sampling, 437, 441–442 Parasocial interaction, 441 Paro robot, 281 Partially observable Markov decision process (POMDP), 333 PDHA. See Post-Deployment Health Assessment (PDHA) PERSONAGE model, 278 Personality, 88–89, 276–278, 351, 360 Photogrammetry, 117 Politeness theory, 470 POMDP. See Partially observable Markov decision process (POMDP) Positivity, 435, 447–448 Post-Deployment Health Assessment (PDHA), 450 Post-traumatic stress disorder (PTSD), 442 Power analysis, 33 Primitive emotional states, 175 Prisoner’s Dilemma, 403 Probability sampling methods, 32 Prosocial agents, 393–396 Prosocial behavior, 401–402 Prosocial behaviors, 387, 390, 410 Prosocial computing, 388 Prosociality, 386, 388 Proteus effect, 127, 129 PTSD. See Post-traumatic stress disorder (PTSD) Public Goods Game, 403

515

PyTorch, 163 Qualitative research methods, 58–59 Quasi-experimental research strategy, 27–28 Rapport, 274–276, 433 agents, 434, 437–443 theory, 434–437 Rasterizer, 117 REA. See Real-Estate Agent (REA) Real-Estate Agent (REA), 267 Realism, 130–133 Reciprocal agents, 412–413 Reciprocity, 391, 412 Recurrent neural networks, 284–285 Reinforcement learning (RL), 365 appraisal, 369–370 models, 365–366 Relational agents, 275 Relations, 351, 360 Reliability, 30 Replication crisis, 57–58 Research ethics, 34 Research process, 25 Research tools, 54–56 Resilient agents, 410–412 RL. See Reinforcement learning (RL) Robotiquette, 7 Rule-based models, 229–230, 398, 437–439 SAIBA. See Situation, Agent, Intention, Behavior, and Animation (SAIBA) SAIBA framework, 224 SAM. See Shared attention mechanism (SAM) Sampling procedures, 32 SARS-CoV-2 contagion, 387

516

Index

SAVE framework. See Sociocultural appraisals, values, and emotions framework (SAVE framework) SCN. See Social cognitive neuroscience (SCN) SEC. See Stimulus evaluation checks (SEC) Self-report measures, 30 Selfishness, 392 SEMAINE project, 276, 278 Sensing, 333 Sentiment. See Attitude Shared attention mechanism (SAM), 326 Signals, 262 Similarity-attraction principle, 470 Simon humanoid robot, 119 SimSensei, 442–443 Situation, Agent, Intention, Behavior, and Animation (SAIBA), 270 Sliding window deep neural networks (SW-DNNs), 283–284 Social attitudes, 278–279 Social brain, 315 Social cognition modeling, 326 Social cognitive neuroscience (SCN), 313–316 Social dilemmas, 403 Social reactions, 77, 81–84 Social signals, 291 Social touch, 280–282 Sociocultural appraisals, values, and emotions framework (SAVE framework), 402 Speech, 191 Speech Recognition Grammar Specification (SRGS), 157–158 2 Speech VCTK, 179

Sperber’s epidemiology of representations, 468 SRGS. See Speech Recognition Grammar Specification (SRGS) Static representations, 270–271 Stimulus appraisal, 362 Stimulus evaluation checks (SEC), 271, 363 “Strong” embodiment, 115 Subjective assessments, 198 Subjective measures, 444 SW-DNNs. See Sliding window deep neural networks (SW-DNNs) Synthetic speech, 193–194 Tactical Language Training System, 474 TARDIS system, 403–404 Taxonomy, 263 Temporal representations, 271–272 TensorFlow, 163 Text-to-speech system (TTS system), 148, 176 TF2T. See Tit-for-two-tats (TF2T) TFT. See Tit-for-tat (TFT) Theory of mind (ToM), 275, 316 Theory of mind mechanism (ToMM), 326 Theory-driven approaches, 398 Three-dimension (3D) interfaces, 2 modeling, 108 representations, 90 scanning, 117 Threshold Model of Social Influence (TMSI), 81–83 Tit-for-tat (TFT), 412 Tit-for-two-tats (TF2T), 413

Index

TMSI. See Threshold Model of Social Influence (TMSI) ToM. See Theory of mind (ToM) ToMM. See Theory of mind mechanism (ToMM) Touch, 280 Training inter-cultural sensitivity, 468 TTS system. See Text-to-speech system (TTS system) Turing test, 147 Two-dimensional representations (2D representations), 90 Un-expressive speech, 175 Uncanny valley (UV), 130–133, 288 Unit selection, 177 Unstructured interview, 47 User attributes, 87. See also Agent attributes User-initiated interaction, 160–163 Utterances, 238 UV. See Uncanny valley (UV) Valence, 353 Validity, 30 VECTOR system, 474

517

Visual beats, 195 Voice. See also Speech adaptation, 185–186 styles, 184–185 WaveNetstyle TTS, 176 “Weak” embodiment, 115 WHO. See World Health Organization (WHO) WIMP interfaces. See Windows, icons, menus, and pointers interfaces (WIMP interfaces) Win–stay strategy, 413 Windows, icons, menus, and pointers interfaces (WIMP interfaces), 1 Within-subjects design, 28 Wizard-of-Oz scenario (WoZ scenario), 35 World Health Organization (WHO), 387 WoZ scenario. See Wizard-of-Oz scenario (WoZ scenario) XML markup, 179–181