Gaze in Human-Robot Communication. [1 ed.] 9789027267641, 9789027242693

Gaze in Human-Robot Communication is a volume collecting recent research studying gaze behaviour in human-robot interact

117 85 33MB

English Pages 178 Year 2015

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Gaze in Human-Robot Communication. [1 ed.]
 9789027267641, 9789027242693

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Gaze in Human-Robot Communication

Benjamins Current Topics issn 1874-0081 Special issues of established journals tend to circulate within the orbit of the subscribers of those journals. For the Benjamins Current Topics series a number of special issues of various journals have been selected containing salient topics of research with the aim of finding new audiences for topically interesting material, bringing such material to a wider readership in book format. For an overview of all books published in this series, please see http://benjamins.com/catalog/bct

Volume 81 Gaze in Human-Robot Communication Edited by Frank Broz, Hagen Lehmann, Bilge Mutlu and Yukiko Nakano These materials were previously published in Interaction Studies 14:3 (2013).

Gaze in Human-Robot Communication Edited by

Frank Broz Heriot-Watt University

Hagen Lehmann Instituto Italiano di Tecnologia

Bilge Mutlu University of Wisconsin-Madison

Yukiko Nakano Seikei University

John Benjamins Publishing Company Amsterdam / Philadelphia

8

TM

The paper used in this publication meets the minimum requirements of the American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.

doi 10.1075/bct.81 Cataloging-in-Publication Data available from Library of Congress: lccn 2015040609 (print) / 2015036657 (e-book) isbn 978 90 272 4269 3 (Hb) isbn 978 90 272 6764 1 (e-book)

© 2015 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Company · https://benjamins.com

Table of contents Introduction Frank Broz, Hagen Lehmann, Bilge Mutlu & Yukiko Nakano Design of a gaze behavior at a small mistake moment for a robot Masahiro Shiomi, Kayako Nakagawa & Norihiro Hagita

vii 1

Robots can be perceived as goal-oriented agents Alessandra Sciutti, Ambra Bisio, Francesco Nori, Giorgio Metta, Luciano Fadiga & Giulio Sandini

13

Can infants use robot gaze for object learning? The effect of verbalization Yuko Okumura, Yasuhiro Kanakogi, Takayuki Kanda, Hiroshi Ishiguro & Shoji Itakura

33

Interactions between a quiz robot and multiple participants: Focusing on speech, gaze and bodily conduct in Japanese and English speakers Akiko Yamazaki, Keiichi Yamazaki, Keiko Ikeda, Matthew Burdelski, Mihoko Fukushima, Tomoyuki Suzuki, Miyuki Kurihara, Yoshinori Kuno & Yoshinori Kobayashi Cooperative gazing behaviors in human multi-robot interaction Tian Xu, Hui Zhang & Chen Yu Learning where to look: Autonomous development of gaze behavior for natural human-robot interaction Yasser Mohammad & Toyoaki Nishida

47

71

99

Designing robot eyes for communicating gaze Tomomi Onuki, Takafumi Ishinoda, Emi Tsuburaya, Yuki Miyata,Yoshinori Kobayashi & Yoshinori Kuno

131

Index

159

Introduction Frank Broz1, Hagen Lehmann2, Bilge Mutlu3 & Yukiko Nakano4 1Heriot-Watt

University, Dept. of Computer Science Italiano di Tecnologia, iCub Facility 3University of Wisconsin-Madison, Dept. of Computer Science 4Seikei University, Dept. of Computer and Information Science 2Istituto

1.  Introduction Advancements in robot design and supporting technologies such as computer vision and speech recognition increasingly enable robots to interact with humans in a robust and natural manner. These interaction capabilities place new expectations on robots to correctly produce and interpret social behaviors that humans use in face-to-face communication. One of the most salient behaviors that this type of interaction involves is gaze. Current research on gaze in robotics is informed by a history of research in psychology and related fields and primarily comprised of work from the burgeoning fields of human-robot interaction and social robotics. The exploration of the role of gaze in human-robot communication is the topic of this book. The articles presented illustrate various approaches toward a better understanding of how to utilize the human predisposition for interpreting social cues transmitted via gaze to build better social robots. 1.1  Gaze in human communication Eye gaze is one of the most important non-verbal cues helping humans to understand the intention of other social agents. Compared with the eyes of non-human primate species, human eyes are very visible due to a strong contrast between the sclera and the iris (Kobayashi & Kohshima 1997, 2001). This makes it possible to easily recognize gaze direction and hints at the evolution of a new function of the human eye in close range social interactions as an additional source of information about the intention of a potential interaction partner (Tomasello et al. 2007). Human children start to use this ­additional information obtained from eye m ­ ovements of their caregivers from an age of 10  month and are able to follow the eye gaze of the others by the age of 12 months (Meltzoff & Brooks 2007).

doi 10.1075/bct.81.001int © 2015 John Benjamins Publishing Company

 Frank Broz, Hagen Lehmann, Bilge Mutlu & Yukiko Nakano

Especially during cooperative social interactions, humans rely heavily on eye gaze information to achieve a common goal. The importance of this information becomes evident when it is inaccessible for one of the interacting agents. Humans with autism spectrum condition have immense difficulties in understanding the intentions of others, which could be inferred from information contained in the eye region of the others face due to the avoidance of direct eye contact (BaronCohen et al. 1995, 1997). The movements of our eyes also signal relevant emotional states, enabling us to interact empathically (Baron-Cohen et al. 2001). The absence of contingent eye gaze creates an eerie feeling and makes humans feel uncomfortable in social situations. For most social interactions it is essential to coordinate one’s behavior with one or more interaction partners. It is therefore not only necessary to transmit information, but also to jointly regulate eye contact in a continuous ongoing process with one another, known as mutual gaze (Argyle 1988). Being able to interact with others in this fashion is of great social importance from an early developmental stage and seems to be the basis of and precursor to more complex gaze behaviors such as visual joint attention (Farroni 2003). Gaze is also important for face-to-face communication. It is a component of turn-taking “­proto-conversations” between infants and caregivers that set the stage for language learning (­Trevarthen  & ­Aitken 2001) and is known to play a role in regulating conversational turn-taking in adults (Kleinke 1986; Kendon 1967). Gaze interaction in the field of psychology is traditionally analyzed using manual coding of video data. However, researchers in robotics and related fields are increasingly using new technologies such as gaze-tracking systems to automatically analyze gaze interaction at a granularity that would be difficult or impossible using manual video coding (Yu et al. 2012; Broz et al. 2012). 1.2  Gaze in human-agent interaction Gaze has been an increasingly studied topic in human-robot interaction in recent years. So much so that it is impossible to give an overview of this body of research here in this brief introduction. We believe this influx in gaze-related HRI research is due to the field’s recognition of the fundamental role of gaze in “face-to-face” embodied interactions. Much of this research investigates how manipulating a robot’s gaze behavior influences a person’s impressions of the robot as well as of the interaction and their role in it. Mutlu’s work on the impact of robot gaze on participant “footing” in multiparty conversation is an influential example of these types of HRI studies (Mutlu et al. 2009). Prior to most of the work on gaze in HRI, gaze has been studied in the context of interaction with Embodied Conversational Agents (ECA) (Cassell et al. 2000).

Introduction 

In order to produce gaze behaviors for conversational agents, the communicative gaze behaviors of humans during conversation are analyzed and modelled (Nakano et al. 2003). While there are important differences between interactions with computer agents and with physical robots, researchers in HRI can benefit from the knowledge gained in this research area. 1.3  Gaze and human-robot communication This book was inspired by a series of workshops on gaze and human-robot interaction co-organized by the editors. The first of the two workshops, held at the ACM/IEEE Conference on Human-Robot Interaction (HRI) 2012, focused on the topic of gaze in human and human-robot interaction. Gaze was rapidly growing in popularity as a research topic in the field at this time. The second gaze workshop at HRI focused on gaze and speech interaction and the importance of these tightly coupled behaviors for face-to-face communication. Many of the papers in this volume were invited based on the authors’ submissions to these workshops. The papers included deal with various functions of human and robot gaze and on the role of gaze in communication between humans and robots. This diverse and interesting collection of articles is representative of the state of the art in research on gaze in human-robot interaction. 2.  Contents Common themes in the articles of this volume include (1) experimental investigation of human responses to robot gaze, (2) investigation of the impact of coordinating gaze acts with speech, and (3) development of hardware and software technologies for enabling robot gaze. We begin with a short article on an experiment investigating the social impact of robot gaze by Shiomi, Nakagawa, and Hagita. In this research report, the authors investigate the effect of changes in a robot’s gaze direction immediately after a mistake on humans’ feelings about the robot. They use questionnaire-based interviews to identify a set of common gaze behaviors that people engage in after making a mistake. They then implement these behaviors for a robot during a short interaction task with a human. Post-interaction questionnaires determined that people felt that the robot was most apologetic and friendliest when it looked toward the human after a mistake, as opposed to the gaze aversion behavior that the questionnaire responses reported when people were asked about their behavior when they accepted responsibility for a mistake.



Frank Broz, Hagen Lehmann, Bilge Mutlu & Yukiko Nakano

The next article by Sciutti et al. compares human gaze following of robot action to gaze following of human action and explores what the similarities may tell us about how robot action is understood by observers. This article describes an experiment where human gaze is measured in order to infer the mental processes activated when people witness robot action. The authors introduce how anticipatory gaze can be interpreted as evidence of motor resonance of the mirror neuron system when humans observe the actions of other humans and propose that the presence of anticipatory gaze when humans observe the same type of actions by humanoid robots may indicate that the same mental processes are at work. Experiment participants watched a grasp and transport action performed both by a human and by a humanoid robot. The anticipatory gaze to the object’s goal location in both conditions was highly correlated. These results suggest that people unconsciously attribute intentionality to the robot’s actions and process its motion similarly to how they process human action. The next two articles address our second theme, analyzing interactions with robots that coordinate their gaze behavior with speech and exploring the impact of the coupling of verbal and non-verbal behavior. In the article by Okumura et al., the impact of combining robot gaze with speech is investigated in an infant word-learning task. In the experiments described, infants interacted with a robot, which used different amounts and types of verbalizations in coordination with its gaze while presenting novel objects. Verbalizations led infants to gaze longer at cued objects, and additional results suggest that previously cued objects were recognized as more familiar in the condition with the most accompanying verbalizations. A follow-up study investigated whether non-language-based coordinated sound cues would also produce these effects. The sound cues did not produce an effect on infant gaze behavior. The authors suggest that verbalizations coordinated with gaze may be recognized by infants as communicative acts, causing the infants to attribute intentionality to the robot and changing how they process and respond to its gaze cues. Yamazaki et al. analyze multiparty human-robot interaction in cross-cultural settings. Groups of native Japanese and English speakers played a quiz game with a robot that used the same non-verbal behaviors coordinated with significant speech events in each language. The robot’s behaviors were designed based on ethnographic research conducted in Japan and the US on strategies expert tour guides use to engage visitors. Their video analysis focuses on question-response sequences in the interactions and is based on detailed transcription of verbal and nonverbal behavior by the humans and the robot. Their results show more gaze shifts and nodding by English speakers. The increased nodding is the opposite result found in comparisons of Japanese and English speakers in analysis of human-human interaction, where Japanese speakers have been found to nod more frequently.

Introduction 

They also found that the coordination of gaze and the placement of keywords in an utterance have an impact on the success of that utterance in eliciting a response. These results highlight the importance of considering cultural expectations and language-specific differences in designing robot behavior for face-to- face interaction with people of different cultures. The final three papers of this volume address technical approaches to enabling robot gaze. The next two articles evaluate autonomous robot gaze controllers during interactions with humans. Xu, Zhang, and Yu explore an interaction between a human and multiple robots. Their article compares the impact of different gaze behavior strategies by groups of robot learners on the gaze behavior of a human teacher. In the experimental task, a human teacher teaches object names to a pair of robots. In both of the two conditions, one of the robots in the pair exhibits the same gaze policy, following the gaze of the human to relevant objects and returning gaze. In each condition, the other robot’s behavior differs when the human’s attention is focused on the default gaze policy robot. In the active case, the other robot looks to the human when the human attends to that robot, and in the passive case the other robot also looks at that robot when the human is attending to it. Gaze frequency and duration to the robots and objects during the interaction were measured. People modified their behavior based on the different cooperative gaze behaviors of the robot pairs. They looked at the passive gaze robot significantly less than the other robots and looked more at the active gaze robot during naming events. They also produced more utterances during the active gaze condition. These results suggest that the robots in that active gazing condition may have had more learning opportunities than the pair in the passive gaze condition. Mohammad and Nishida present a machine learning-based approach to producing robot gaze controllers for interaction. Their article describes and evaluates a robot gaze controller that is learned directly from human gaze data rather than designed by hand. In the unsupervised, hierarchical learning method described, a set of gaze patterns are first identified and then a controller is learned which switches among these patterns. The controller was learned using a dataset from a human listening to an explanation of a novel object and evaluated in a similar interaction in which a robot played the role of the listener. Particularly interesting is that most of the learned patterns corresponded to socially meaningful gaze actions without the system making use of any modeling or design based on theories of gaze behavior. The controller was evaluated in comparison with a controller that was learned in a supervised manner with its structure designed in order to produce natural, human-like gaze behavior. Experiment participants watched and evaluated videos of humans interacting with robots driven by the different controllers. The controller from the unsupervised approach was judged

 Frank Broz, Hagen Lehmann, Bilge Mutlu & Yukiko Nakano

to be more natural, more human-like, and to be more comfortable for the robot’s human partner. These results demonstrate that data-driven approaches can be an effective alternative to design-based approaches that implement controllers based on theories of gaze behavior. The final article of the volume concerns the design and evaluation of novel hardware for enabling robot gaze. Onuki et al.’s article describes a series of experiments conducted to identify good design features for the eyes of service robots to achieve a pleasing appearance and better gaze communication with people. They manipulated the shape of the eyes (from flat to round) and the size of the irises (from small to large) using a pair of rear-projected eyes, producing nine robot eye designs based on these combinations of features. They then evaluated people’s impressions of the friendliness of the eye designs. Designs with a rounder eye shape and larger irises were rated as friendlier. The designs were also compared based on how easy it was for humans to identify the gaze target by observing the eyes. Round eyes with large irises were also found to produce the most legible gaze. An additional experiment compared gazes that dynamically shifted to the gaze target versus static gaze at a target. People were more able to accurately judge the direction of dynamic gaze. Based on the results of these studies, the most positively evaluated design (round eyes with large irises) was installed in a mobile robot and these projection-based eyes were compared to mechanical eyes. This experiment found that the eyes using projection were judged similarly to the mechanical eyes, though people reported that they found the projected eyes more expressive. The ability of the rear-projected eyes to support the experimental comparison of so many designs demonstrates the flexibility of this technology for manipulating gaze appearance. These papers are an excellent representation of the depth and breadth of work on gaze in the fields of HRI and social robotics. They highlight the interdisciplinary nature of this body of research, drawing on diverse techniques from conversational analysis to machine learning in order to analyze and enable human-robot gaze communication. We would like to thank all of the authors for their strong submissions and hope that readers find these articles as interesting and thoughtprovoking as we do.

References Argyle, M. (1988). Bodily communication (2nd ed.). London: Routledge. Baron-Cohen, S., Wheelwright, S., & Jolliffe, T. (1997). Is there a “language of the eyes”? Evidence from normal adults, and adults with autism or asperger syndrome. Visual Cognition, 4, 311–331. DOI: 10.1080/713756761

Introduction  Baron-Cohen, S., Campbell, R., Karmiloff-Smith, A., Grant, J., & Walker, J. (1995). Are children with autism blind to the mentalistic significance of the eyes? British Journal of Developmental Psychology, 13, 379–398. DOI: 10.1111/j.2044-835X.1995.tb00687.x Baron-Cohen, S., Wheelwright, S., Hill, J., Raste, Y., & Plumb, I. (2001). The ‘reading the mind in the eyes’ test revised version: A study with normal adults, and adults with asperger syndrome or high-functioning autism. Journal of Child Psychology and Psychiatry, 42, 241–252. DOI: 10.1111/1469-7610.00715 Broz, F., Lehmann, H., Nehaniv, C.L., & Dautenhahn, K. (2012). Mutual gaze, personality, and familiarity: Dual Eye-tracking during conversation. In Proceedings of IEEE International Symposium on Robot and Human Interactive Communication (Ro-Man). Cassell, J., Sullivan, J., Prevost, S., & Churchill, E.F. (Eds.). (2000). Embodied conversational agents. MIT Press. Farroni, T. (2003). Infants perceiving and acting on the eyes: Tests of an evolutionary hypothesis. Journal of Experimental Child Psychology, 85(3), 199–212. DOI: 10.1016/S0022-0965(03)00022-5 Kendon, A. (1967). Some functions of gaze-direction in social interaction. Acta Psychologica, 26, 22–63. Kleinke, C. (1986). Gaze and eye contact: A research review. Psychological Bulletin, 100(1), 78–100. DOI: 10.1037/0033-2909.100.1.78 Kobayashi, H., & Kohshima, S. (1997). Unique morphology of the human eye. Nature, 387, 767–768. DOI: 10.1038/42842 Kobayashi, H., & Kohshima, S. (2001). Unique morphology of the human eye and its adaptive meaning: Comparative studies on external morphology of the primate eye. Journal of Human Evolution, 40, 419–435. DOI: 10.1006/jhev.2001.0468 Meltzoff, A.N., & Brooks, R. (2007). Eyes wide shut: The importance of eyes in infant gaze following and understanding other minds. In R. Flom, K. Lee, & D. Muir (Eds.), Gaze following: Its development and significance (pp. 217–241). Mahwah, NJ: Erlbaum. Mutlu, B., Shiwa, T., Kanda, T., Ishiguro, H., & Hagita, N. (2009). Footing in human-robot conversations: How robots might shape participant roles using gaze cues. In Proceedings of the 4th ACM/IEEE Conference on Human-Robot Interaction (HRI 2009), San Diego, CA. DOI: 10.1145/1514095.1514109 Nakano, Y.I., Reinstein, G., Stocky, T., & Cassell, J. (2003). Towards a model of face-to-face grounding. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1 (pp. 553–561). Association for Computational Linguistics. Tomasello, M., Hare, B., Lehmann, H., & Call, J. (2007). Reliance on head versus eyes in the gaz following of great apes and human infants: The cooperative eye hypothesis. Journal of Human Evolution, 52, 314–320. DOI: 10.1016/j.jhevol.2006.10.001 Trevarthen, C., & Aitken, K.J. (2001). Infant intersubjectivity: Research, theory, and clinical applications. The Journal of Child Psychology and Psychiatry and Allied Disciplines, 42(01), 3–48. DOI: 10.1111/1469-7610.00701 Yu, C., Schermerhorn, P., & Scheutz, M. (2012). Adaptive eye gaze patterns in interactions with human and artificial agents. ACM Transactions on Interactive Intelligent System, 1(2), 13: 1–13: 25.

 Frank Broz, Hagen Lehmann, Bilge Mutlu & Yukiko Nakano

Editors’ addresses Frank Broz Heriot -Watt University Dept. of Computer Science Edinburgh, UK

Hagen Lehmann Instituto Italiano di Tecnologia iCub Facility Genova, Italy

[email protected] [email protected] Bilge Mutlu University of Wisconsin-Madison Dept. of Computer Science Madison, USA

Yukiko Nakano Seikei University Dept. of Computer and Information Science Tokyo, Japan

[email protected] [email protected]

Editors’ biographical notes Frank Broz is an assistant professor of computer science at Heriot-Watt University. His recent research involves HRI for eldercare, using multimodal interaction to foster social acceptance and provide user feedback, as part of the Robot-Era project. Previously, he studied mutual gaze in human conversational pairs as part of the ITALK project while a research fellow at the University of Hertfordshire. He received his B.S. in computer science from Carnegie Mellon University and his Ph.D. in robotics from the CMU Robotics Institute. He has organized the AAAI spring symposium “It’s All in the Timing: Representing and Reasoning About Time in Interactive Behavior” in 2010, the ICDL-EpiRob special session on “Social Gaze: From Human-Human to Human-Robot Interaction” in 2011, and the HRI workshops “Gaze in HRI: From Modeling to Communication” in 2012 and “HRI Face-to-Face: Gaze and Speech Communication” in 2013. Hagen Lehmann is a Marie Curie experienced researcher at the Instatuto Italiano do Tecnologia. Within the AURORA project he worked on the development of a therapeutic robot to help children with autism to enhance their social skills. Within the ACCOMPANY project he worked on the development of interaction scenarios for robot home companions for elderly people. As part of the ITALK project he worked on mutual gaze in human conversational pairs. He received his M.S. in Psychology from the Technical University Dresden and a Ph.D. in Computer Science from the University of Bath. At the Max-Planck Institute for Evolutionary Anthropology in Leipzig he researched social gaze behaviour and its role in human social evolution in different non-human primate species and human infants. He co-organized the ICDL-EpiRob special session on “Social Gaze: From Human-Human to Human-Robot Interaction” in 2011 and the HRI workshops “Gaze in HRI: From Modeling to Communication” in 2012 and “HRI Face-to-Face: Gaze and Speech Communication” in 2013. Bilge Mutlu is an assistant professor of computer science, psychology, and industrial engineering at the University of Wisconsin–Madison and the director of the Wisconsin Human-­ Computer Interaction Laboratory. His research program bridges human-computer interaction (HCI), robotics, and computer-supported collaborative work (CSCW), exploring how robotic technologies might be designed to help people learn, communicate, and work. Dr. Mutlu is a former Fulbright Scholar and the recipient of National Science Foundation’s CAREER award

Introduction  and several Best Paper awards and nominations including HRI 2008, HRI 2009, HRI 2011, UbiComp 2013, IVA 2013, RSS 2013, and HRI 2014. His research has been covered by national and international press including the NewScientist, MIT Technology Review, Discovery News, Science Nation, and Voice of America. He has served in the Steering Committee of the HRI Conference and Program Committees of the HRI, CHI, and ICMI conferences, co-chairing the Program Committee for HRI 2015 and ICSR 2011 and the Program Sub-committee on Design at CHI 2013 and CHI 2014. He is also serving as the Managing Technical Editor of the Journal of Human-Robot Interaction and an Associate Editor of IEEE Transactions on Affective Computing and the Entertainment Robotics section of Entertainment Computing. Dr. Mutlu received his Ph.D. degree from Carnegie Mellon University’s Human-Computer Interaction Institute in 2009. His interdisciplinary background combines training in interaction design, computer science, and cognitive and behavioral sciences with industry experience in product design and development. Yukiko Nakano is an associate professor in the Department of Computer and Information Science at Seikei University, Japan, and leading the Intelligent User Interface Laboratory. She received her M.S. in Media Arts and Sciences from Massachusetts Institute of Technology and Ph.D. in Information Science and Technology from the University of Tokyo. In the research field of Conversational Agents, she has been working on establishing models of multimodal grounding and conversational engagement based on attentional information, and applying the models to communication between user and conversational agents. She organized the 4th, 3rd, and 2nd workshops on Eye Gaze in Intelligent Human Machine Interaction at ICMI 2012, HRI 2012, and IUI 2011, as well as an AISB symposium on “Conversational Informatics for Supporting Social Intelligence & Interaction: Situational and environmental information enforcing involvement in conversation” in 2005.

Design of a gaze behavior at a small mistake moment for a robot Masahiro Shiomi, Kayako Nakagawa & Norihiro Hagita ATR Intelligent Robotics and Communication Laboratory

A change of gaze behavior at a small mistake moment is a natural response that reveals our own mistakes and suggests an apology to others with whom we are working or interacting. In this paper we investigate how robot gaze behaviors at small mistake moments change the impressions of others. To prepare gaze behaviors for a robot, first, we identified by questionnaires how human gaze behaviors change in such situations and extracted three kinds: looking at the other, looking down, and looking away. We prepared each gaze behavior, added a no-gaze behavior, and investigated how a robot’s gaze behavior at a small mistake moment changes the impressions of the interacting people in a simple cooperative task. Experiment results show that the ‘looking at the other’ gaze behavior outperforms the other gaze behaviors and indicates the degrees of perceived apologeticness and friendliness. Keywords:  Communication robots; gaze; mistake; mitigation

1.  Introduction Since human beings are fallible, completely preventing failures and mistakes is impossible. When we make mistakes, we have various strategies to mitigate them, such as apologizing or offering compensation. Even if the failure is minor, people do something to mitigate the unhappiness of others. Based on these considerations, we assume that since a robot will also experience failure and make mistakes, it will also need mitigation strategies. Because of the increasing development of robots that work with people (Sara et al. 2012; Mutlu & Forlizzi 2008), such a strategy for failures will become especially important. How should a robot behave at the moment of a mistake? Even if current robots do not make serious mistakes because of their limited uses, several researchers have started to investigate error recovery strategies from small mistake situations for robots (Lee, Kiesler & Forlizzi 2010; Lee et al. 2010; Peltason & Wrede 2011). For example, Lee et al. investigated appropriate robotic service recovery ­strategies

doi 10.1075/bct.81.01shi © 2015 John Benjamins Publishing Company



Masahiro Shiomi, Kayako Nakagawa & Norihiro Hagita

using delivery robots: apologies, compensation, and user options (Lee et  al. 2010). Other research implemented an apology behavior for robots that interact with people at small mistakes during interaction (Lee, Kiesler & Forlizzi 2010; ­Peltason & Wrede 2011). However, these research works only focused on error recovery during conversations, i.e. verbal behavior. In interactions between people, they often change their nonverbal behavior, e.g. looking at the other, looking down, and so on. Since nonverbal behavior is essential in interactions (Birdwhistell 1970; Vargas 1986), we believe that its design is also important for robot error recovery strategies that resemble those of humans. Gaze is one essential nonverbal behavior to indicate a robot’s intention ­(Srinivasan & Murphy 2011). In particular, eye-contact is an important social cue that positively improves impressions toward robots in various situations (Yamazaki et al. 2008; Yonezawa et al. 2007]. These previous works show the significance of the gaze behavior of robots in human robot interaction, but they do not focus on the effects of gaze behaviors from the view of mitigation during mistake situations. It remains unknown how a robot’s gaze behavior at mistake moments changes the impressions of others.

Figure 1.  Mistake: how should the robot change its gaze?

2.  Data collection Even though many works exist on gaze behavior in human-robot interaction (Birdwhistell 1970; Kirchner et al. 2011; Mutlu et al. 2012; Srinivasan & Murphy 2011; Vargas 1986; Yamazaki et al. 2008; Yonezawa et al. 2007), we found none that focused on the design of gaze behavior at a robot’s mistake moment. So we investigated how people change their gaze behaviors at mistake moments by conducting a questionnaire-based data collection. We can empirically observe that people change their gaze behavior depending on two factors: social relationships and where the culpability lies. We asked participants to imagine a situation where they made a mistake in a cooperative task.



Design of a gaze behavior at a small mistake moment for a robot

They assumed that their social position is lower than the others (e.g. superiors, because people perceive robots as having relatively lower social positions (Hayashi et al. 2011; Kahn 2007)) and that they are responsible for the mistake. They considered the following situation: “Describe how you change your gaze when you make a mistake in front of a superior.” Ten Japanese people (five women and five men whose average age was 20.5 years, standard deviation (S.D) was 2.0) participated in our data collection. They freely described the changes of their gaze behaviors at mistake moments when they themselves are responsible for the mistake. We gathered 20 sentences from the data collection (ten sentences for both assumptions about responsibility). Two coders analyzed and classified the transcribed results. Cohen’s kappa coefficient from their classifications was 0.964 and yielded the following classifications: a. Looking at the other: they looked at the face of the person with whom they were working b. Looking down: they looked down because they could not face the other c. Looking away: they looked away to avoid eye contact with the other Table 1 shows each category for each assumption. The results indicated that people look down when they themselves feel responsible for the mistake; on the other hand, people look at the other when they blame that person for the mistake. Based on these results, we used four kinds of gaze behavior (three gaze behaviors and a no-gaze behavior) in our experiments. Table 1.  Number of observations in each category under different assumptions Responsibility on myself

on the other

Looking at other

1

8

Looking down

9

1

Looking away

0

1

3.  Experiments 3.1  Hypotheses and predictions about apologies An apology signals the admission of one’s own mistake and suggests remorse and regret for the action. For example, people who fail in cooperative tasks provide





Masahiro Shiomi, Kayako Nakagawa & Norihiro Hagita

such signals. In fact, our data collection suggests that people change their gaze behaviors at such moments. This strategy is used not only in daily situations but also in medical malpractice litigation, political contexts, and corporate culture (Ho 2009). A robot that works with people, e.g. doing a cooperative task, must be able to admit its own mistakes and offer apologies at appropriate moments. Past research has also reported the importance of apologies from robots (Groom et al. 2010; Lee et al. 2010). To admit mistakes and apologize to others, gaze behavior at mistake moments would be useful. Based on these considerations, we made the following hypothesis: Prediction 1 Gaze behavior at a small mistake moment will increase the degree of perceived apologeticness. 3.2  Hypotheses and prediction for friendliness and dissatisfaction Indicating an apology increases the degree of perceived friendliness and decreases the dissatisfaction toward the robot. However, as observed in the data collection, people change their gaze behavior depending on their feelings about the cause of the mistake. Since phenomena related to gaze behaviors at small mistake moments remain basically unexplored, we established two contradictory hypotheses toward friendliness and dissatisfaction. 3.2.1  Hypothesis that assumes advantages of looking down Based on the trend from the data collection, we assume that when a person accepts responsibility for a mistake, he uses looking down behaviors to signal such intentions as “I’m sorry.” This attitude increases the degree of perceived friendliness and decreases dissatisfaction. From these considerations, we made the following hypothesis: Prediction 2-a: Looking down during small mistake moments will increase the degree of perceived friendliness toward the robot more than other gaze behaviors. Prediction 3-a: Looking down during small mistake moments will decrease dissatisfaction toward the robot more than other gaze behaviors. 3.2.2  Hypothesis that assumes advantages of looking at the other We also have an opposite assumption: looking at others is friendlier than other gaze behaviors, because past research on gaze behaviors suggests that eye contact improves impressions of robots (Srinivasan & Murphy 2011; Yamazaki et al. 2008; Yonezawa et al. 2007). Since looking down and looking away might be more ambiguous for displaying intentions than looking at others, we made the following hypotheses: Prediction 2-b: Looking at the other at small mistake moments will increase the degree of perceived friendliness toward the robot more than other gaze behaviors.



Design of a gaze behavior at a small mistake moment for a robot

Prediction 3-b: Looking at the other at small mistake moments will decrease dissatisfaction toward the robot more than other gaze behaviors. 3.3  Participants Sixteen university students (eight women and eight men, whose age averaged 20.5, SD 2.13) participated. 3.4  Tasks To investigate the effects of the gaze behaviors, we adopted a simple cooperation task in which the robot first asks the participant to put an object into its hand. After taking it from the participant, the robot puts it into a box, rotates its body and releases its arms. This task was conducted twice in each condition. The robot successfully does the first trial, but in the second trial, it fails to put the object into the box. 3.5  Robot In the experiment, we used Robovie-mR2, an interactive humanoid robot. To conduct a cooperation task with participants, we prepared four kinds of motions: (1) taking an object from a participant, (2) putting it into a box, (3) dropping it, and (4) changing gaze behaviors. Figure 2 shows both motions used in the first trial in each task. In the third motion, the robot also turns to its right as the second motion, but it slightly spreads its arms in the middle of the turn. The object falls without going into the box (left side of Figure 3). The robot changes its gaze behavior depending on the conditions

Figure 2.  Success a. Looking down

b. Looking away

Figure 3.  Gaze behaviors during a mistake

c. Looking at the other





Masahiro Shiomi, Kayako Nakagawa & Norihiro Hagita

from the mistake moment; it waited one second after the mistake moment based on research of robot response times (Shiwa et al. 2009). 3.6  Conditions We used a within-participant experiment design to evaluate and compare the effects of the three gaze behaviors from our data collection and the no-gaze behavior. Note that the participants reported that in the down/avoid/look conditions, the robot looked appropriately at the mistake moments. –– –– –– ––

No-gaze: when the robot fails, it does not change its gaze. Down: when the robot fails, it looks down (Figure 3-a). Avoid: when the robot fails, it looks away (Figure 3-b). Look: when the robot fails, it looks at the participant (Figure 3-c).

3.7  Procedure Robovie-mR2 was placed on a low table, and a cube was placed diagonally in front of it as a stand (Figure 4). The box in which to drop the object was placed on the robot’s right. Participants sat in front of the robot. In the interaction, the robot asks for the object twice. The signal to start the trial was sent by the operator.

PC Control Robot

Box

Object stand

Participant Figure 4.  Experiment environment

Operator



Design of a gaze behavior at a small mistake moment for a robot

Before the first session, the participants were given a brief description of the experiment’s purpose and procedure. Each participant participated in four ­sessions. They filled out questionnaires after each session. The order of these conditions was counterbalanced. 3.8  Measurement To measure the subjective impressions, we prepared a questionnaire that addressed the perceived apologeticness (the robot seemed to apologize), friendliness (I felt that the robot was friendly), and dissatisfaction (I felt dissatisfaction when the robot failed). After each session, participants answered on a 1-to-7 point scale, where “7” is the most positive and “1” is the most negative. They could freely describe their impressions of each gaze behavior. 4.  Results 4.1  Verification of prediction 1 We analyzed the apologeticness (Figure 5) by conducting a one-factor within subject ANOVA. Since Mauchly’s test indicated that the assumption of sphericity was violated (chi-square = 16.174, p = .007), we corrected the degrees of freedom using the Huynh-Feldt estimates of sphericity (epsilon = 0.863). We found a significant difference among the conditions (F (2.589, 38.838) = 7.265, p = .001). Multiple comparisons with the Scheffe method revealed significant differences: look > no-gaze (p = .001), down > no-gaze (p = .003), and avoid > no-gaze (p < .001). Prediction 1 was supported. 7 6 5 4 3 2 1 No-gaze

Figure 5.  Perceived apologeticness

Down

Avoid

Look





Masahiro Shiomi, Kayako Nakagawa & Norihiro Hagita

4.2  Verification of prediction 2 We analyzed the perceived friendliness toward the robot (Figure 6) by conducting a one-factor within subject ANOVA. We found a significant difference among the conditions (F (3, 45) = 15.496, p < .001). Multiple comparisons with the Scheffe method revealed significant differences: look > no-gaze (p < .001) look > down (p = .004), look > avoid (p = .005), down > no-gaze (p = .011), and avoid > no-gaze (p = .001). Prediction 2-b was supported, but not prediction 2-a. 7 6 5 4 3 2 1 No-gaze

Down

Avoid

Look

Figure 6.  Perceived friendliness

4.3  Verification of prediction 3 We analyzed dissatisfaction with the robot (Figure 7) by conducting a one-factor within subject ANOVA. Since Mauchly’s test indicated that the assumption of sphericity was violated (chi-square = 15.083, p = .010), we corrected the degrees of freedom using the Greenhouse-Geisser estimates of sphericity (epsilon = 0.604). We found no significant differences among the conditions (F (1.811, 27.164) = 2.853, p = .080). Neither predictions 3-a nor 3-b were supported. 7 6 5 4 3 2 1 No-gaze

Figure 7.  Perceived dissatisfaction

Down

Avoid

Look



Design of a gaze behavior at a small mistake moment for a robot

5.  Discussion 5.1  Analysis of free descriptions Our results showed that the ‘looking at the other’ condition outperformed looking down. We analyzed the free comments from the participants to determine why. In our experiment, we gathered 64 sentences (16 for each gaze behavior). Two coders analyzed and classified the transcribed results from them. Cohen’s kappa coefficient from the coder classifications was 0.625, which showed moderate agreement. They yielded the following classifications: a. The robot seemed to notice and to reflect on its own mistake. b. The robot seemed to notice without reflecting on its own mistake. c. The robot didn’t seem to notice its own mistake. Table 2 shows each category for each gaze behavior. The results indicate that the participants felt that the ‘looking at the other’ condition provides more reflection. Table 2.  Number of observations in each category of gaze impressions a

b

c

No-gaze

2

7

7

Down

6

6

4

Avoid

6

8

2

Look

11

5

0

5.2  Responsiveness to mistakes In this study, we situated the responsibility for the mistake on the robot in a cooperative task. However, in cooperative tasks in real situations, a person might fail even if the robots worked correctly, i.e. placing responsibility for the mistake on the person. In such situations, the gaze behaviors of the robot at mistake moments might have different impressions, e.g. blaming a person. Groom et al. investigated the effects of blame from a robot in cooperation situations and argued that blame negatively affected impressions of it (Groom 2010). Therefore, designing a gaze behavior at the mistake moments of people who engage in cooperative tasks with robots is fruitful future work. 6.  Conclusion We conducted an experiment during small mistake moments on four gaze behaviors: looking at the other, looking down, looking away, and no-gaze. Experimental



 Masahiro Shiomi, Kayako Nakagawa & Norihiro Hagita

results revealed that participants perceived apologies from the gaze behaviors at mistake moments and reported that the looking at the other condition increased the perceived friendliness more than the other conditions. This study provides useful knowledge for designing social robots that work with people.

Acknowledgements This work was supported by the Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research No. 21118003.

References Birdwhistell, R.L. (1970). Kinesics and context: Essays on body motion communication. Philadelphia: University of Pennsylvania Press. Groom, V., Chen, J., Johnson, T., Kara, F.A., & Nass, C. (2010). Critic, compatriot, or chump?: Responses to robot blame attribution. In Proceeding of the 5th ACM/IEEE International Conference on Human-Robot Interaction (pp. 211–218). Hayashi K., Shiomi, M., Kanda, T., & Hagita N. (2011). Are robots appropriate for troublesome and communicative tasks in a city environment? IEEE Transactions on Autonomous Mental Development, 4, 150–160. DOI: 10.1109/TAMD.2011.2178846 Ho, B. (2009). Apologies as signals: With evidence from a trust game. Management Science, 58(1), 141–158. DOI: 10.1287/mnsc.1110.1410 Kirchner, N., Alempijevic, A., & Dissanayake, G. (2011). Nonverbal robot-group interaction using an imitated gaze cue. In Proceedings of the 6th International Conference on HumanRobot Interaction (pp. 497–504). Lee, M.K., Kiesler, S., & Forlizzi, J. (2010). Receptionist or information kiosk: How do people talk with a robot? Proceedings of the Conference on Computer-Supported Cooperative Work (pp. 31–40). Lee, M.K., Kiesler, S., Forlizzi, J., Srinivasa, S., & Rybski, P. (2010). Gracefully mitigating breakdowns in robotic services. Proceedings of the 5th ACM/IEEE Conference on Human-Robot Interaction (pp. 203–210). Mutlu, B., & Forlizzi, J. (2008). Robots in organizations: Workflow, social, and environmental factors in human-robot interaction. Proceedings of the 3rd ACM/IEEE Conference on Human-Robot Interaction (pp. 287–294). Mutlu, B., Kanda, T., Forlizzi, J., Hodgins, J., & Ishiguro, H. (2012). Conversational gaze mechanisms for humanlike robots. ACM Transactions on Interactive Intelligent Systems, 1(2), 12. DOI: 10.1145/2070719.2070725 Kahn, P.H., Ishiguro, H., Friedman, B., & Kanda, T. (2007). What is a human? – Toward psychological benchmarks in the field of human-robot interaction. Interaction Studies, 8(3), 363–390. DOI: 10.1075/is.8.3.04kah



Design of a gaze behavior at a small mistake moment for a robot

Peltason, L., & Wrede, B. (2011). The curious robot as a case-study for comparing dialog systems. AI Magazine, 32(4), 85–99. Sara, L., Jirina, K., Mattias, J., Henriette, C., & Karol, N. (2012). Hospital robot at work: Something alien or an intelligent colleague? Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (pp. 13–15). Shiwa, T., Kanda, T., Imai, M., Ishiguro, H., & Hagita, N. (2009). How quickly should a communication robot respond? Delaying strategies and habituation effects. International Journal of Social Robotics, 1(2), 141–155. DOI: 10.1007/s12369-009-0012-8 Srinivasan, V., & Murphy, R.R., (2011). A survey of social gaze. In Proceedings of Human Robot Interaction (pp. 253–254). Vargas, M.F. (1986). Louder than words: An introduction to nonverbal communication. Ames: Iowa State University Press. Yamazaki, A., Yamazaki, K., Kuno, Y., Burdelski, M., Kawashima, M., & Kuzuoka, H. (2008). Precision timing in human-robot interaction: Coordination of head movement and utterance. In Proceedings of CHI ‘08 (pp. 131–140). DOI: 10.1145/1357054.1357077 Yonezawa, T., Yamazoe, H., Utsumi, A., & Abe, S. (2007). Gaze-communicative behavior of stuffed-toy robot with joint attention and eye contact based on ambient gaze-tracking. ­Proceedings of the 9th International Conference on Multimodal Interfaces (pp. 140–145). DOI: 10.1145/1322192.1322218



Robots can be perceived as goal-oriented agents Alessandra Sciutti1, Ambra Bisio1, 2, Francesco Nori1, Giorgio Metta1, 3, Luciano Fadiga1, 4 & Giulio Sandini1 1Robotics,

Brain and Cognitive Sciences Dept., Istituto Italiano di Tecnologia of Experimental Medicine, Section of Human Physiology, University of Genova 3Center for Robotics and Neural Systems, Plymouth University, Plymouth 4Section of Human Physiology, University of Ferrara, Ferrara 2Department

Understanding the goals of others is fundamental for any kind of interpersonal interaction and collaboration. From a neurocognitive perspective, intention understanding has been proposed to depend on an involvement of the observer’s motor system in the prediction of the observed actions (Nyström et al. 2011; Rizzolatti & Sinigaglia 2010; Southgate et al. 2009). An open question is if a similar understanding of the goal mediated by motor resonance can occur not only between humans, but also for humanoid robots. In this study we investigated whether goal-oriented robotic actions can induce motor resonance by measuring the appearance of anticipatory gaze shifts to the goal during action observation. Our results indicate a similar implicit processing of humans’ and robots’ actions and propose to use anticipatory gaze behaviour as a tool for the evaluation of human-robot interactions. Keywords:  Humanoid robot; motor resonance; anticipation; proactive gaze; action understanding

1.  Introduction The ability to understand others’ actions and to attribute them mental states and intentionality is crucial for the development of a theory of mind and of the ability to interact and collaborate. Indeed, the comprehension of the goal-directed nature of the actions performed by other people is something that we learn very early in infancy, already during our first year of life (e.g. Falck-Ytter et al. 2006; Kanakogi & Itakura 2011; Woodward 1998).

doi 10.1075/bct.81.02sci © 2015 John Benjamins Publishing Company

 Alessandra Sciutti et al.

One of the main difficulties in investigating goal understanding – especially in preverbal children – is to tap into this mechanism without relying on high level cognitive evaluations or complex linguistic skills. A backdoor to action comprehension has been traditionally represented by the measurement of gaze behaviour in habituation paradigms and during action observation. The analysis of gaze has been used to assess action planning (e.g. Johansson et al. 2001) and action understanding both in adults (e.g. Ambrosini et al. 2011a; Flanagan & Johansson 2003), and in preverbal children (e.g. Falck-Ytter et al. 2006; Gredebäck et al. 2009; Kanakogi & Itakura 2011; Rosander & von Hofsten 2011; Woodward 1998), sometimes even allowing to differentiate between implicit and explicit levels of social cognition (e.g. Senju et al. 2009). Gaze can therefore represent an important tool to communicate implicit or covert understanding of other agents’ goals. One specific aspect of gaze behaviour is represented by its anticipatory nature. When humans observe others performing goal directed actions, they shift their gaze to the target of their movement before it is completed (Flanagan & Johansson 2003). Interestingly, Flanagan and Johansson (2003) showed in a manipulation task that when the hands moving an object could not be seen, observers’ gaze did not anticipate object motion anymore, but started tracking it and being reactive rather than predictive. It has been therefore suggested that anticipatory gaze is linked to a motor resonance mechanism: the anticipatory shift of the gaze toward the goal of actions performed by someone else would be a part of the action plan covertly activated by the observation. In fact, the anticipatory looking measured during the observation of the actions of others reflects the same visuo-manual coordination exhibited during our own action execution, where gaze is directed anticipatorily to the relevant landmarks of that act, e.g. obstacles to be avoided, targets to be reached or places where the objects are going to be grasped or released (Johansson et al. 2001). Much evidence confirms that the perception of others’ action is dependent on motor system activation (Stadler et al. 2012; Stadler et al. 2011; Urgesi et al. 2010) and is based on a matching between the observed act and the observer’s motor representation of the same action: the observer would implement covert action plans corresponding to the action executed by the other agent (the direct matching hypothesis, Rizzolatti & Craighero 2004; Rizzolatti et al. 2001). This link between perception and action known as motor resonance has been indicated as the behavioural expression of the mirror neurons system (MNS, R ­ izzolatti et al. 1999), which is activated both when individuals perform a given motor act and when they observe others performing the same motor act (in monkeys: G ­ allese et al. 1996; Rizzolatti et al. 1996; in humans: see Fabbri-Destro & Rizzolatti 2008; ­Rizzolatti  & Craighero 2004 for reviews). The activation of a set of neurons both ­during e­ xecution and observation of an action would provide a common



Robots can be perceived as goal-oriented agents 

­ escription of own and others’ behaviours, thus allowing for the anticipation of d others’ goals. The hypothesis of anticipatory gaze being dependent on motor resonance has been supported by several studies on young infants (Falck-Ytter et al. 2006; ­Gredebäck & Kochukhova 2010; Gredebäck & Melinder 2010; Kanakogi & I­ takura 2011). For example, Falck-Ytter et al. (2006) tested the assumption that if anticipation is the result of a direct match between the observed action and the observer’s motor repertoire, the anticipation on a given action can occur only after that specific action has been mastered by the infant. This hypothesis has been confirmed by the authors, who demonstrated that only 12 months old children, who were able to perform a transport action, exhibited anticipatory gaze shifts when observing others transporting an object, while younger children, who were unable to grasp and transport, tracked the actor’s movement. The fact that the tendency to anticipate with the gaze others’ goals emerges in correspondence with the development of the infant’s own motor ability to execute the same action has been proven also for the observation of more structured actions, as solving a puzzle (­Gredebäck & Kochukhova 2010), feeding another person (Gredebäck  & Melinder 2010) or moving objects into multiple target positions (Gredebäck et  al. 2009). A recent research reduced further the minimum age at which infants anticipate the action goals of others with their gaze. In particular, infants as young as 6 months of age anticipatorily shift their gaze to the target of a reaching-to-grasp, an action that they have already mastered (Kanakogi & Itakura 2011). In addition, children who perform an action more efficiently are also more proficient at anticipating the goal of similar acts executed by others (Gredebäck & ­Kochukhova 2010; Kanakogi & Itakura 2011). These findings are in favour of the theory that goal anticipation is facilitated by a matching process that maps the observed action onto one’s own motor representation of that action. A recent Transcranial Magnetic Stimulation (TMS) study provided further support for a motor-based origin of anticipatory gaze shifts during action observation. Elsner et al. (2012) used TMS to disrupt activity in the “hand region” of the primary motor cortex and found that such selective disruption causes a delay in predictive gaze shifts relative to trials without TMS stimulation. This finding indicates a functional connection between the activation of the observer’s motor system and the anticipatory gazing, as the stimulation of the motor cortex has been shown to directly impact on the ability to predict other’s goals with the gaze. This relationship between anticipatory gaze shifts and motor resonance implies that the occurrence of goal anticipation with the eyes is dependent on matching the observed behaviour and the observer’s motor repertoire. It can be interpreted as an indication that the observer has implicitly (or motorically) recognized the actor as an agent who shares the same actions and the same goals.

 Alessandra Sciutti et al.

This recognition becomes particularly relevant in the domain of human-robot interaction. Monitoring anticipatory gaze behaviour during the observation of a robot could therefore become a measure to evaluate whether this motor resonance mechanism at the basis of the interpretation of actions is extendable also to robots. This may be useful in building robots whose behaviour is easily predictable by the user and that elicit a natural confidence in their acting. Whether, and under which conditions, motor resonance can be evoked also by the observation of non-human agents is still an open issue (see Chaminade & Cheng 2009; Sciutti et al. 2012 for reviews). Although the first neuroimaging studies (Perani et al. 2001; Tai et al. 2004) excluded activation of the mirror neurons system and hence motor resonance when the action was performed by a virtual or non-biological agent, subsequent researches (Chaminade et al. 2010; Gazzola et al. 2007; Oberman et al. 2007a; Shimada 2010) have indicated that robotic agents evoke a similar MNS activity as humans do (or even stronger, Cross et al. 2011). Also in the context of behavioural experiments, a few studies found either the absence or a quantitative reduction in the resonance for the observation of robotic agents (Kilner et al. 2003; Oztop et al. 2005; Press et al. 2005) while other researchers observed conditions where the motor resonance effect was the same for human and non human agents observation (Liepelt et al. 2010; Press et al. 2007). In summary robotic agents can, to a certain degree, evoke motor resonance as a function of both their shape, the context in which they are immersed and the way they move. If at the neurophysiological level the MNS activation seems to be present also for very non-biological stimuli (i.e. when the non-biological agent moves with a non-biological kinematics, e.g. Cross et al. 2011; Gazzola et al. 2007) some behavioural effects require a higher degree of human resemblance also in terms of robot motion (Chaminade & Cheng 2009). In the context of assessing humans’ implicit perception of the robot, the analysis of the occurrence of anticipatory gaze shifts to the robotic action goal could tell something more: i.e. not only whether a resonance mechanism can be activated by an artificial visual model, but also if it can be exploited by the observer to predict the goal of a non-human agent, as it happens during human action observation. Previous studies failed to find anticipatory gaze shifts toward the spatial destination of an object moving by itself (Falck-Ytter et al. 2006; ­Flanagan & Johansson 2003), even if the object movement followed biological rules and the target position was unambiguous. Adult observers exhibited anticipatory gaze behaviour in the presence of non-biological agent when the latter could be interpreted as a tool they could use (a mechanical claw), while anticipation was not exhibited by young infants (4 to 10 months old), not as familiar with that tool (Kanakogi & Itakura 2011).



Robots can be perceived as goal-oriented agents 

In this work we evaluated whether the observation of a robotic actor might evoke anticipatory gaze shifts in the observer. In particular, we replicated an “object – transport” task similar to the one described by Falck-Ytter (2006) and we replaced the human action demonstrator with a humanoid robot (the iCub robotic platform, Metta et al. 2010). We considered two alternative hypotheses. Either the human subject could implicitly recognize the robot as an agent and motorically match its action with his/her own, thus evoking the ability to anticipate with the gaze the goal of the robot’s action or alternatively, the robot could be perceived just as a very complex moving object, rather than a goal-oriented agent. In this latter case we would expect a significant reduction of the predictive gaze shift to the target and a stronger tendency to track the moving object. The results would allow us to provide hints to the robot designers for improving the robot overall behaviour.

2.  Methods 2.1  Subjects Ten right-handed subjects (2 women and 8 men, M = 31 years, SD = 13) took part in the experiment. All subjects were healthy, with normal or corrected to normal vision, and did not present any neurological, muscular or cognitive disorder. The participants gave informed consent prior to testing. All experiments were conducted in accordance with legal requirements and international norms (­Declaration of Helsinki 1964, 2008). 2.2  Action demonstrators 2.2.1  The human demonstrator In the “human” condition, a human demonstrator presented repeatedly a grasp and transport action. The person who acted as model was a woman and was the same in all the experiments. She was previously trained to make movements at a steady, slow pace and her movement trajectory and timing was recorded prior to the experiment to program robot motion (see below for details). 2.3  The humanoid robot In the main experimental condition we used the humanoid robot iCub as action demonstrator and we made it repeatedly perform grasp-transport and release actions in front of the observer. iCub is a humanoid robot developed as part of the EU project RobotCub. It is approximately 1m tall with the appearance of a 3.5 years

 Alessandra Sciutti et al.

old child (Metta et al. 2010; Sandini et al. 2007). Its hands have nine degrees of freedom each, with five fingers, three of which independently driven. All motors are placed remotely in the forearm and the hands are completely tendon driven. As we wanted to use the robot’s right arm and hand to produce grasping, release and transport movements, we commanded only the right arm and the torso joints to generate the movement. To generate the transport movement the robot had to track the end point Cartesian trajectories captured from human motion. The grasp and release actions were instead realized with a fast, position controlled, stereotyped closing and opening of the fingers. To produce robot’s hand trajectories, we collected the human transport movement at 250 Hz by means of an infrared marker positioned on the hand of a human actor (Optotrak Certus System, NDI). Then, we downsampled the data by a factor of 5 and we roto-translated them to match the coordinate system of the iCub hand, so that the points in the trajectory belonged to the workspace of the robot. The end effector coordinates were then transformed into torso and arm joint angles solving the inverse kinematics by means of nonlinear constrained optimization (Pattacini et al. 2010). A velocitybased controller was used to track the transformed trajectories. Tracking was satisfactory for our purposes as the robot movements were human-like for a human observer (Figure 1B). b. Hand stop

Hand start

a.

Goal area

Figure 1.  Experimental setup. (A) Pictures of the setup: Subjects wearing an Eyelink II helmet sat in front of the action plane, with their chin positioned on a chin rest. (B) Schema of the subjective view of the participant with – superimposed – the rectangular zones representing the Areas of interest (AOI) used for the analysis and (in blue) the sample trajectories of the robotic hand. The superimposed graphical information was not visible to the subject

2.4  Experimental paradigm Subjects sat comfortably on a chair at about 75 cm from the action plane, with their chin positioned on a chin rest. They wore an Eyelink II helmet, provided of a scene camera located in correspondence to the centre of their forehead. The scene



Robots can be perceived as goal-oriented agents 

was a table top on which an object (little plush octopus) and a vase (the target) were placed at a distance of about 40 cm. The work area was individuated by two vertical bars, which also held the four infrared markers needed by the Eyelink system to compensate for head movements. At the beginning of each trial the object was placed in a predefined starting position on the side of the scene. Then the demonstrator, either a human actor or the robot iCub, grasped the object with his right hand and transported and released it into the target (Figure 2). Human experiment

Subject’s view

Scene view

Robot experiment

Figure 2.  Experimental procedure. Sample transporting action in the robot (left) or human (right) condition, from external (top line) or subject’s (bottom line) point of view. The rectangles representing the Areas of Interest and the cross indicating gaze fixation were not visible during the experiment

During the whole movement the eyes of the demonstrator were hidden from view, to avoid the presence of any additional cue about action goal except object motion. A screen behind the demonstrator provided a uniform background. To replicate the setup described in Falck-Ytter et al. (2006) and maximize the probability to obtain gaze proactivity, as suggested by the results by Eshuis et al. (2009), we attached a little toy to the vase. The toy produced a sound at the arrival of the object. The Eyelink system recorded binocularly the gaze motion at 250 Hz and projected the gaze position in real-time on the video recorded by the scene camera. The camera was arranged in order to oversee the working plane. An alignment procedure assured that the gaze and camera images were correctly superimposed. Before each recording session a standard 9 points calibration procedure was performed on the movement plane. The calibration was then validated and repeated in case the average error was larger than 0.8 visual degrees (the average was computed for each eye separately and repeated for both eyes even if only one exceeded the threshold). In addition, a correction procedure for depth was realized to map

 Alessandra Sciutti et al.

eye motion on the camera scene also in presence of fixations outside the calibration plane. All these procedures were performed through the SceneLink application (SR Research) provided with the Eyelink system. The experiment consisted of two sessions. During the first session the demonstrator was the iCub robot who repeated the grasp-transport and release action 8 times, from the same starting position to the vase, with a biological motion (see Section “The Humanoid Robot”). During the second session the robotic demonstrator was replaced by the experimenter, who repeated the same task. The choice to have always the robotic demonstrator first was taken to avoid people being forced to perceive the robot as human due to a pre-exposure to the human actions. The robotic movement was recorded on the video, while, simultaneously, the coordinates of the end effector were saved on a file. 2.5  Data analysis The analysis was mainly based on videos which recorded both actors’ movements and the observers’ gaze position automatically overlaid by the Eyelink software. Each video (720 × 480 pixel size, 29.97 fps) was manually segmented into different parts (one for each transport movement) in Adobe Premiere 6.5. The Eyelink Data Viewer software was adopted to analyze gaze movements in detail and to define Areas of Interests (AOI), which were afterward overlaid on the video. We individuated three AOIs (90 × 126 pixel each, corresponding to about 9.3 × 13°); one covering the objects starting position (hand start), one covering the end position of the hand before leaving the object (hand stop) and one covering the vase (goal area). Gaze was measured during each movement of an object to the target. Data were included in the analysis if subjects fixated at least once the goal area up to 1000 ms after the object disappeared into the vase. Two subjects never looked at the goal area, either keeping an almost stable fixation during the whole experiment or continuously tracking the demonstrator’s hand without ever looking at the object. Their data were therefore discarded from all subsequent analyses, which were conducted on a total of 8 subjects. To compute gaze anticipation or delay, the timing of subjects’ fixation shift to the goal area was compared to the arrival time of the object. If gaze arrived at the goal area before the object, the trial was considered predictive (positive values). To evaluate the amount of anticipation, for each trial we computed the proportion of anticipation, i.e. the difference between the times of object and gaze arrival on target, divided by movement duration. Movement duration was computed as the time between object exit from the area hand start and its entrance into the goal area. While the arrival to the hand stop area was almost simultaneous with the entering of the object into the vase (goal area) in the human condition, the ­opening of



Robots can be perceived as goal-oriented agents 

the robotic hand required a longer time. To compensate for the difference in the timing of the release action between human and robot hand, in the robotic condition the arrival time of the object into the goal area was replaced by the time when the hand stopped over the vase (hand stop). It should be noted here that such a criterion tends to reduce the estimate of anticipation in the robotic condition, as gaze needs to arrive to the goal area already before the stopping of the hand to be counted as anticipatory. To statistically compare the degree of anticipation during human and robot observation, the proportion of anticipation and the percentage of anticipatory trials in the two conditions were subjected to paired-sample t-test analysis. Moreover, one sample t-tests on the amount of anticipation were performed to determine whether gaze behaviour was significantly different from a tracking (i.e. 0 anticipation, corresponding to a simultaneous arrival of gaze and object to the goal). To evaluate the relation between the anticipatory behaviour exhibited by subjects in the human and the robot condition, a regression analysis was performed on both the percentage of anticipatory trials and the amount of anticipation in the two conditions. In addition, the percentage of variation in anticipation between conditions was computed for each subject as 100 * (1 – proportion_ anticipation_robot/proportion_anticipation_human). Lastly, to assess whether the amount of anticipation changed over multiple presentations of the transport action because of a habituation effect, a linear fit of proportion of anticipation as a function of repetition number was performed for each condition and subject. 3.  Results The first aim of this study was verifying which kind of gaze behaviour is associated to the observation of goal directed actions performed by a humanoid robot. The robotic movement lasted around 3s (M = 2.8s, SD = 0.09s). On average subjects showed an anticipatory behaviour, with a mean percentage of 70% (M = 70%, SD = 33%) of anticipatory trials and with gaze anticipating actor’s hand on average 30% of trial duration (M = 27%, SD = 25%, see Figure 3A). If the robot had been perceived as an inanimate device, we would have expected a tracking behaviour, with the eyes of the observer following the movement of the hand (Flanagan & Johansson 2003). This would have been translated into negative or near zero values of the measured anticipation (see Data Analysis in the Methods section). Instead, average anticipation (normalized by movement duration) was significantly greater than 0 (one-tailed one-sample t-test, t(7) = 3.05, p = 0.009), indicating the presence of proactive gaze behaviour, at least in the majority of subjects.

 Alessandra Sciutti et al. Robot

Human

b.

100

100

75

75

% Ant. Trial

% Ant. Trial

a.

50

25

25 0 –0.4

50

0.4 0.0 Prop. Ant

0.8

0 –0.4

0.0 0.4 Prop. Ant

0.8

Single subjs Average

Figure 3.  Experimental Results. Gaze behaviour during the observation of robotic (A) and human (B) actions. Percentage of anticipation (percentage of trials in which gaze is arrived to target before actor’s hand) plotted against anticipation measured as proportion of movement duration. Different small symbols represent different subjects. The larger sphere indicates the population average. Error bars correspond to standard errors. The dashed line indicates zero anticipation, approximately corresponding to tracking gaze behaviour

Some individuals, however, showed a tracking behaviour. To understand whether the absence of anticipation was due to the presence of the robot as actor, we analyzed the proportion of anticipation and the percentage of anticipatory trials during the observation of a human actor executing actions similar to the ones previously performed by the robotic demonstrator. In Figure 3B the percentage of anticipatory trials and the amount of anticipation (in proportion to movement duration) are shown for each subject in this “human” condition. The pattern was similar to the one measured for the observation of robotic actions: those few subjects who assumed a tracking behaviour during robot observation did the same also during human observation. This suggests that the disappearance of anticipation was not due to the presence of a robotic artefact as demonstrator but rather to other factors, which modulate gazing behaviour also during human action observation. This finding replicates the results by Gesierich et al. (2008), which showed that half of their sample presented in tendency a tracking behaviour during action observation (moving virtual blocks on a computer screen, see Discussion). To assess whether the presence of a robotic demonstrator caused a quantitative difference in gazing strategy with respect to a human actor we compared the percentage of anticipatory trials in the “human” and the “robotic” condition. Though a tendency to increase prediction appeared for the human condition, no significant difference was present (paired sample t-test, t(7) = 1.43, p = 0.194).



Robots can be perceived as goal-oriented agents 

Analyzing in more detail the relationship between anticipation in the human and the robot condition, we found a strong correlation between subjects’ behaviours in the two tasks. The linear regression of the amount of anticipation in the human versus the robot condition was highly significant (p = 0.003) with a slope not significantly different from 1 (0.89, 95% confidence interval: [0.43–1.36]; adjusted R2 = 0.75). Additionally, also the percentage of anticipatory trials was highly correlated in the two conditions (p = 0.018, slope = 0.78, 95% confidence interval: [0.18–1.37]; adjusted R2 = 0.57). Hence, a similar gazing behaviour was adopted by subjects during the observation of both actor types, as participants with a higher tendency to anticipate during human actions, tended also to anticipate more than others during the robotic ones. To assess whether the presence of a robotic actor produced a quantitative reduction in anticipatory behaviour, we evaluated the individual variation in anticipation between the human and the robot condition (i.e. 100 * (1- proportion_anticipation_robot/proportion_anticipation_human), see Data Analysis). By considering the percentage of variation with respect to the “human” condition, we compensate for the inter-individual differences in the natural tendency to anticipate, independently from the nature of the actor. A positive number would indicate that the introduction of the robot as actor has determined a decrease in anticipation. Average percentage of variation was not significantly different from zero (one-sample t-test, t(7) = -0.529, p = 0.613), indicating that robot observation did not significantly modify the natural anticipatory behaviour exhibited during human action observation. Additionally, we ran 10000 iterations of bootstrap simulation (Efron & Tibshirani 1993) on the percentages of variation. With this resampling technique we aimed at approximating the distribution of the average of this parameter from our sample. More precisely, on each iteration the percentages of variation were independently sampled with replacement to form a new 8-element sample, which was then averaged. Of the new 10000 average percentages of variation computed, only a minority (27%) resulted in a decrease of anticipation larger than 20% of the one measured in the “human” condition, providing no evidence in favour of a reduction in anticipation to be associated to robot observation. We also wanted to be sure that the similarity between human and robot observation did not derive from a habituation effect, i.e. did not depend on a progressive reduction in anticipatory gaze associated to the repetitive exposure to similar stimuli. This phenomenon would have particularly influenced the human condition, because it was the last to be presented. To check whether there was any learning or habituation effect we linearly fitted the proportion of anticipation with respect to repetition number for each subject in the robot and human conditions separately. No significant trend of change in anticipation as a function of repetition number emerged for any subject, neither for the robot nor for the human condition (all p > 0.05, with a slope not significantly different from 0 in a o­ ne-sample

 Alessandra Sciutti et al.

t-test – p = 0.854 in the “robot” condition and p = 0.166 in the “human” one – and an average adjusted R2 of M = 0.09, SD = 0.14 in the “robot” condition and M = 0.08, SD = 0.11 in the “human” condition). Another possible confound in the results derived from the fact that the timing of the human action was more variable than the robotic one and in general shorter: average human movement duration was around 2. 5s (M = 2.4, SD = 0.4), while average robot movement duration lasted a little less than 3s (M = 2.8, SD = 0.09). To compensate for a possible subjective effect of this difference on the amount of anticipation between human and robotic action observation, we fitted linearly the proportion of anticipation over movement duration for all trials in the human condition, for each subject. Then, we extrapolated the proportion of anticipation in the human condition for a trial duration corresponding to the average robotic movement duration for that subject. Lastly, we replaced the anticipation measured in the “human” condition with this corrected estimate. The results are plotted in Figure 4. As it emerges clearly from the figure, also after this correction, no difference in gazing behaviour appears when the actor is a human or a humanoid robot. Indeed, replicating the previous analysis on the corrected data we obtained similar results, with an average percentage of variation between “human” and “robot” a.

b.

Anticipation

1.0

0.4

Ant. Prop.

Robot

0.8

0.0

0.0

0.4 Human

0.8

N=8

0.5

0.0

–0.5

Robot

Human

Figure 4.  Human – Robot comparison. Amount of anticipation (measured as proportion of movement duration). A: Single subjects’ proportion of anticipation during the observation of robotic actions plotted against the corresponding proportion of anticipation during the observation of human actions. “Human” values have been corrected for movement duration differences between robotic and human conditions (see text for details). Error bars represent within subject standard error. Different symbols represent different subjects. The dashed line individuates the identity line: if a data point lies under this line, the proportion of anticipation for that subject is higher in the “human” condition than in the “robot” one. B: Box plot of anticipation proportion in the “robot’” and the “human” actor conditions. Each box is determined by the 25th and 75th percentiles of the distribution while the whiskers are determined by the 5th and 95th percentiles. The small squares indicate the samples averages, while the horizontal lines represent their medians



Robots can be perceived as goal-oriented agents 

condition not significantly different from 0 (t(7) = -0.757, p = 0.474 in a onesample t-test) and a strong correlation between the anticipation exhibited in the two conditions (p = 0.007 for the linear regression, with a slope not significantly different from 1 (0.95, 95% confidence interval: [0.37–1.52]; adjusted R2 = 0.68). Thus, our results suggest that motor resonance, in the form of anticipatory gaze behaviour, occurs during humanoid robot observation as much as during human agent observation. 4.  Discussion When we observe someone performing an action, we usually interpret the action as goal directed. This “goal-centric” understanding is also reflected by the way we move our eyes. So if we look at someone fetching a bottle of wine, our gaze will tend to shift anticipatorily to the glass into which the wine will be poured. Such anticipatory mechanism is triggered by action observation and in particular by an automatic matching between the observed act and our own motor repertoire: the actor is “motorically” interpreted as an agent, which shares with us similar motor representations and thus similar goals (e.g. drinking a glass of good wine). Since anticipatory eyes movements belong to the motor programme associated to action execution, they are similarly activated also when we just “resonate” to the action of others. Much evidence exists in favour of a tight link between gaze and action control. For instance gaze has been shown to signal the intention behind one’s action as much as the action itself, as demonstrated by a similar neural response in an observer when witnessing a gazing or a reach-to-grasp action toward the same object (Pierno et al. 2006). Gaze movements become therefore a backdoor to access the action processing mechanism of our partners. Eyes represent a communication channel during interaction, telling not only where others’ are focusing their attention, but also providing a direct connection to the very basic mechanism of mutual understanding. In this study we monitored subjects’ gaze during the observation of a humanoid robot performing a goal-directed action to evaluate whether a similar implicit mutual understanding can occur also between humans and robots. More precisely, the question was whether a robotic model is able to induce motor resonance as a human actor would or, on the contrary, it is perceived as a non goal-oriented, self moving object, which does not evoke mirror neurons system activation and anticipatory gaze. Our results show that subjects exhibited the tendency to ­anticipatorily shift their gaze toward action goal similarly when the actor was human or robot. The humanoid robot is therefore implicitly interpreted as a goal-oriented, predictable agent, able to evoke the same motor resonance as a human actor.

 Alessandra Sciutti et al.

Before accepting this conclusion, one needs to examine whether alternative explanations are equally plausible. One alternative explanation is that being the spatial goal of the action evident, the anticipation of the action goal could have been performed with no need of the activation of the motor resonance mechanism sub serving anticipatory gaze. However, the request to the subject was not to concentrate on (or anticipate) the action goal, but rather just to look at the action. Previous studies have shown that even in presence of unambiguous spatial goals, automatic anticipatory gaze shifts to the goal occur significantly more often when an action (and an action belonging to the motor repertoire of the observer) is witnessed, while otherwise a tracking behaviour is more prominent (e.g. F ­ alck-Ytter et al. 2006 in infants; Flanagan & Johansson 2003 in adults). Therefore, we are confident that the occurrence of anticipatory gaze shifts to action goal in presence of the robot’s action represents evidence in favour of the activation of a direct matching mechanism between the human observer and the robot, as it has been proved with neurophysiological studies for human observation (Elsner et al. 2012). Hence, at least at the level of a motor matching the robot is perceived almost as a conspecific, i.e. as a goal-oriented agent, sharing a common motor vocabulary. Another possibility is that subjects show anticipatory gaze behaviour because they have misunderstood experimenter’s instructions as if their task was to gaze as soon as possible to the action goal. We are however prone to exclude this alternative, because – as the spatial target of the action was clear already from the beginning of the movement – the gaze would have arrived immediately to the goal since the beginning of action presentation and would have not shown the variability here reported. However, we recognize that in some cases a misunderstanding of task requirements might have actually occurred. In fact, although subjects presented on average anticipatory gaze behaviour, some individuals followed the demonstrator’s hand movements all the time, irrespective of agent’s nature (human or robot). This result seems to confirm the hypothesis formulated by Gesierich et al. (2008) in relation to an experiment where they monitored eye tracking during the observation of virtual block stacking task on a computer. They failed to measure anticipation in about half of their sample of subjects and suggested that it could have been due to a misunderstanding of the task. The experimental setup and the calibration procedure may in fact have let subjects understand that eye movements were the relevant element in the study and thus erroneously infer that their task was to track the moving effector. This explanation seems to be confirmed by the behaviour of one of our subjects. He was discarded from the analysis because during action observation his gaze never entered the goal area, as it remained always aligned with demonstrator’s (human or robot) hand. To check whether he behaved this way for a lack of motor resonance or for a misunderstanding of the task, we analyzed his gaze behaviour not only during the transport-to-target action but



Robots can be perceived as goal-oriented agents 

also in the phase of putting the object back to the start position. Interestingly in this case, which to the subject did not appear as a part of the task but rather as a functional movement necessary to begin with the next trial, subject’s gaze arrived on the target object in advance 75% of the times and with an anticipation of about 32% of the whole trial duration. This suggests that indeed some false belief about the task can modulate or even cancel the natural anticipatory behaviour that subjects would have shown in an ecological situation. The progressive improvement of eye tracking devices, which are becoming less invasive and require faster (or a posteriori) calibration procedures, will maybe simplify the design of more ecological testing situation, in which the subjects are not induced to monitor their own gazing behaviour during the experiment. However, the majority of our subjects were not confused by the task instructions and exhibited a clear anticipation for both human and robot action presentation. A possible alternative explanation of the results could be that subjects were somehow forced to perceive the robot as a human agent because it replicated an action previously performed by a human. To avoid this potential issue we presented always the robot as the first demonstrator, so that no immediate association between the observed action performed by the robot and a human agent could be made. This conservative choice assured that anticipation could not be ascribed to the attribution of human-like behaviours to the robot because of previous exposure to human actions. However, it did not allow for testing the existence of an effect of the order of the presentation of the two agents. Future research should be dedicated to measure whether witnessing robot actions has a significant impact on the subsequent observation of human actions and vice versa. It should be also noted that the robot was physically present, executing the action in front of the subjects’ eyes. Such concrete presence made it clear for the subject that the robot was a real artefact and not just an animated character or a human actor in disguise, which could have in turn possibly evoked a response similar to human observation by analogy. Therefore, all precautions were taken in order to avoid a high-level, cognitive “humanization” of the robotic platform. Of course, we cannot exclude that some subjects have anyway explicitly attributed a human-like nature to the iCub, however such explicit (non motor) attribution would not have automatically led to the occurrence of anticipatory gaze shifts (Gredebäck & Melinder 2010). Another hypothesis is that subjects could get habituated to the multiple presentations and reduced their attention to the stimuli and possibly their anticipatory gaze shift. Hence, the anticipation measured for the human condition would be lower than what it really is, because it was always performed as the last condition. Indeed, the repetitive presentation of exactly the same action has been suggested by some authors to inhibit the firing of the mirror neuron system

 Alessandra Sciutti et al.

(e.g. Gazzola et al. 2007), one of the neural mechanisms connected to the occurrence of anticipatory gaze (Elsner et al. 2012). However, in our case we would reject this hypothesis, because no trend of habituation was observed during either condition for all subjects. Indeed, no clear change in anticipatory gaze shifts was individuated as a function of the number of repetitions, suggesting that habituation did not play a relevant role in this experiment. Moreover, the test was conducted with real presentations of the stimuli, rather than videos. Probably the slight variations always present from movement to movement kept the resonance mechanism active during all the motion repetitions. In summary, during the observation of a humanoid robot performing a goal directed action as transporting an object into a container, subjects anticipatorily gazed to the goal of its action the same way they would for the presentation of a human action, suggesting that the robot has been implicitly interpreted as a goaloriented agent and not just as a complex moving object. This occurrence of anticipatory gaze shifts implies a motor-based understanding of the action goal, which does not require inferential or teleological reasoning, but is rather based on an implicit, covert matching between the observer’s and the agents’ actions (Elsner et al. 2012; Flanagan & Johansson 2003). Such motor-based mutual understanding constitutes one of the principal bases of human social interaction abilities (­Gallese et al. 2004; Oberman et al. 2007b) and there is evidence that the manifestation of such resonance – e.g. when two agents imitate each other – can lead to an increased acceptance and sense of comfort in the interaction (Chartrand & Bargh 1999), an increased sense of closeness to other people and even evoke the occurrence of more prosocial behaviours (van Baaren et al. 2003). Therefore, these results are promising for HRI, in that they suggest that interaction with robots could be based on the same basic social implicit mechanisms on which human-human interaction is rooted. Several findings suggest that multiple subtle cues play a fundamental role to elicit mutual understanding both in interaction between humans and between human and robot, ranging from the way the robot moves (biological motion, e.g. Chaminade et al. 2005; Kupferberg et al. 2011), to robot appearance (humanoid shape, e.g. Moriguchi et al. 2010) and robot social signals (gaze behaviour, autonomous movements, e.g. Itakura et al. 2008; Sciutti et al. 2013). In this work we focussed on the role of agent’s nature (human versus robot) controlling all other parameters, i.e. maintaining in both conditions a biological velocity profile of the motion, hiding actor’s gaze direction from subjects’ view and comparing human agents with a humanoid robot. Future research will be needed to determine which parameters are actually relevant in making a robot more or less likely to engage implicit social mechanisms, possibly disentangling the role of motion and form of the robot (as already suggested by Oztop et al. 2005; ­Chaminade & Cheng 2009).



Robots can be perceived as goal-oriented agents  .

This study introduces the measure of anticipatory gaze behaviour as a powerful tool to understand which elements in the robotic implementation let the robot be perceived as an interactive agent rather than a mechanical tool. In addition to motor resonance, several other factors affect the perception of the robot by a person, e.g. attention, the emotional state, the action context, previous experience and cultural background. However, the measure of motor resonance through the monitoring of anticipatory gaze behaviour could be an important source of information about the unconscious perception of robots behaviour, as it plays such a basic role in human interactions (Gallese et al. 2004). Therefore, we suggest that the combination of this measure with physiological (Dehais et al. 2011; Rani et al. 2002; Wada et al. 2005) and qualitative information (Bartneck et al. 2009; Kamide et al. 2012) would provide a comprehensive description of HRI, encompassing the conscious judgment provided by the human agent about the robot, as well as the quantification of its automatic response. This ensemble of techniques could therefore represent an innovative test of the basic predisposition to the interaction with humanoid robots, also useful to give guidelines on how to build new interactive robots. Indeed, the question of how robots are perceived by humans is becoming more central: the progressive introduction of robots in a wide range of common applications, as for instance home appliances, entertainment security or rehabilitation, is reducing the engineers’ control on who will interact with the robot and how. Consequently, the design of the robot has to take into account its interactive skills and the impact that its own behaviour has on its human partners. The monitoring of anticipatory gaze could tell us under which conditions humans unconsciously interpret robots as a predictable interaction partner, sharing their same action representations and their same goal-directed attitude.

Acknowledgments The authors would like to thank Marco Jacono for his help in building the setup and preparing the experiments. The work has been conducted in the framework of the European projects ITALK (Grant ICT-FP7-214668), POETICON++ (Grant ICT-FP7-288382) and CODEFROR (PIRSES-2013-612555).

References Ambrosini, E., Costantini, M., & Sinigaglia, C. (2011a). Grasping with the eyes. Journal of Neurophysiology, 106, 1437–1442. Bartneck, C., Kulic, D., Croft, E., & Zoghbi, S. (2009). Measurement instruments for the anthropomorphism, animacy, likeability, perceived safety of robots. International Journal of Social Robotics, 1, 71–81.

 Alessandra Sciutti et al. Chaminade, T., & Cheng, G. (2009). Social cognitive neuroscience and humanoid robotics. Journal of physiology, Paris, 103, 286–295. Chaminade, T., Franklin, D., Oztop, E., & Cheng, G. (2005). Motor interference between humans and humanoid robots: Effect of biological and artifical motion. In International Conference on Development and Learning (pp. 96–101). Chaminade, T., Zecca, M., Blakemore, S–J., Takanishi, A., Frith, C.D., Micera, S., Dario, P., ­Rizzolatti, G., Gallese, V., & Umiltà, M.A. (2010). Brain response to a humanoid robot in areas implicated in the perception of human emotional gestures. PLoS One, 5, e11577. Chartrand, T.L., & Bargh, J.A. (1999). The chameleon effect: The perception-behavior link and social interaction. Journal of Personality and Social Psychology, 76, 893–910. Cross, E.S., Liepelt, R., de CHAF, Parkinson, J., Ramsey, R., Stadler, W., & Prinz, W. (2011). Robotic movement preferentially engages the action observation network. Human Brain Mapping, 33, 2238–2254. DOI: 10.1002/hbm.21361 Dehais, F., Sisbot, E.A., Alami, R., & Causse, M. (2011). Physiological and subjective evaluation of a human-robot object hand-over task. Applied Ergonomics, 42(6), 785–791. Efron, B., & Tibshirani, R.J. (1993). An introduction to the bootstrap. New York, NY: Chapman & Hall. Elsner, C., D’Ausilio, A., Gredebäck, G., Falck-Ytter, T., & Fadiga, L. (2012). The motor cortex is causally related to predictive eye movements during action observation. Neuropsychologia, 51, 488–492. Eshuis, R., Coventry, K.R., & Vulchanova, M. (2009). Predictive eye movements are driven by goals, not by the mirror neuron system. Psychological Science, 20, 438–40. Fabbri-Destro, M., & Rizzolatti, G. (2008). Mirror neurons and mirror systems in monkeys and humans. Physiology (Bethesda), 23, 171–179. Falck-Ytter, T., Gredebaeck, G., & von Hofsten, C. (2006). Infants predict other people’s action goals. Nature Neuroscience, 9, 878–879. Flanagan, J.R., & Johansson, R.S. (2003). Action plans used in action observation. Nature, 424, 769–771. DOI: 10.1038/nature01861 Gallese, V., Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119(Pt 2), 593–609. Gallese, V., Keysers, C., & Rizzolatti, G. (2004). A unifying view of the basis of social cognition. Trends in Cognitive Sciences, 8, 396–403. Gazzola, V., Rizzolatti, G., Wicker, B., & Keysers, C. (2007). The anthropomorphic brain: The mirror neuron system responds to human and robotic actions. Neuroimage, 35, 1674–1684. Gesierich, B., Bruzzo, A., Ottoboni, G., & Finos, L. (2008). Human gaze behaviour during action execution and observation. Acta psychological, 128, 324–330. Gredebäck, G., & Kochukhova, O. (2010). Goal anticipation during action observation is influenced by synonymous action capabilities, a puzzling developmental study. Experimental Brain Research, 202, 493–497. Gredebäck, G., & Melinder, A. (2010). Infants’ understanding of everyday social interactions: A dual process account. Cognition, 114, 197–206. Gredebäck, G., Stasiewicz, D., Falck-Ytter, T., von Hofsten, C., & Rosander, K. (2009). Action type and goal type modulate goal-directed gaze shifts in 14-month-old infants. Developmental Psychology, 45, 1190–1194. Helskinki (1964, 2008) World Medical Association, Declaration of Helsinki. Ethical principles for medical research involving human subjects. URL: http://www.wma.net/ en/30publications/10policies/b3/. Last accessed 24/4/2014.



Robots can be perceived as goal-oriented agents 

Itakura, S., Ishida, H., Kanda, T., Shimada, Y., Ishiguro, H., & Lee, K. (2008). How to build an intentional android: Infants’ imitation of a robot’s goal-directed actions. Infancy, 3, 519–532. Johansson, R.S., Westling, G., Bäckström, A., & Flanagan, J.R. (2001). Eye-hand coordination in object manipulation. The Journal of Neuroscience, 21, 6917–6932. Kamide, H., Mae, Y., Kawabe, K., Shigemi, S., & Arai, T. (2012). A psychological scale for general impressions of humanoids In IEEE International Conference on Robotics and Automation (ICRA, pp. 4030–4037). Kanakogi, Y., & Itakura, S. (2011). Developmental correspondence between action prediction and motor ability in early infancy. Nature Communications, 2, 341. Kilner, J.M., Paulignan, Y., & Blakemore, S.J. (2003). An interference effect of observed biological movement on action. Current Biology, 13, 522–525. Kupferberg, A., Glasauer, S., Huber, M., Rickert, M., Knoll, A., & Brandt, T. (2011). Biological movement increases acceptance of humanoid robots as human partners in motor interaction. AI & Society, 26, 339–345. Liepelt, R., Prinz, W., & Brass, M. (2010). When do we simulate non-human agents? Dissociating communicative and non-communicative actions. Cognition, 115, 426–434. Metta, G., Natale, L., Nori, F., Sandini, G., Vernon, D., Fadiga, L., von Hofsten, C., Rosander, K., Lopes, M., Santos-Victor, J., Bernardino, A., & Montesano, L. (2010). The iCub humanoid robot: An open-systems platform for research in cognitive development. Neural Networks, 23, 1125–1134. Moriguchi, Y., Minato, T., Ishiguro, H., Shinohara, I., & Itakura, S. (2010). Cues that trigger social transmission of disinhibition in young children. Journal of Experimental Child Psychology, 107, 181–187. Nyström, P., Ljunghammar, T., Rosander, K., & von Hofsten, C. (2011). Using mu rhythm desynchronization to measure mirror neuron activity in infants. Developmental Science, 14, 327–335. Oberman, L.M., McCleery, J.P., Ramachandran, V.S., & Pineda, J.A. (2007a). EEG evidence for mirror neuron activity during the observation of human and robot actions: Toward an analysis of the human qualities of interactive robots. Neurocomputing, 70, 2194–2203. Oberman, L.M., Pineda, J.A., & Ramachandran, V.S. (2007b). The human mirror neuron system: A link between action observation and social skills. Social Cognitive and Affective Neuroscience, 2, 62–66. Oztop, E., Franklin, D., Chaminade, T., & Cheng, G. (2005). Human-humanoid interaction: Is a humanoid robot perceived as a human? International Journal of Humanoid Robotics, 2, 537–559. Pattacini, U., Nori, F., Natale, L., Metta, G., & Sandini, G. (2010). An experimental evaluation of a novel minimum-jerk cartesian controller for humanoid robots. IEEE International Conference on Intelligent Robots and Systems (pp. 1668–1674). Perani, D., Fazio, F., Borghese, N.A., Tettamanti, M., Ferrari, S., Decety, J., & Gilardi, M.C. (2001). Different brain correlates for watching real and virtual hand actions. Neuroimage, 14, 749–758. Pierno, A.C., Becchio C., Wall, M.B., Smith, A.T, Turella, L., & Castiello, U. (2006). When gaze turns into grasp. Journal of Cognitive Neuroscience, 18(12), 2130–2137. Press, C., Bird, G., Flach, R., & Heyes, C. (2005). Robotic movement elicits automatic imitation. Brain Research. Cognitive Brain Research, 25, 632–640. DOI: 10.1016/j.cogbrainres.2005.08.020 Press, C., Gillmeister, H., & Heyes, C. (2007). Sensorimotor experience enhances automatic imitation of robotic action. Proceedings. Biological sciences, 274, 2509–2514. DOI: 10.1098/rspb.2007.0774 Rani, P., Sims, J., Brackin, R., & Sarkar, N. (2002). Online stress detection using psychophysiological signal for implicit human-robot cooperation. Robotica, 20(6), 673–686. DOI: 10.1017/S0263574702004484

 Alessandra Sciutti et al. Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27, 169–192. Rizzolatti, G., Fadiga, L., Fogassi, L., & Gallese, V. (1999). Resonance behaviors and mirror neurons. Archives Italiennes de Biologie, 137, 85–100. Rizzolatti, G., Fadiga, L., Gallese, V., & Fogassi, L. (1996). Premotor cortex and the recognition of motor actions. Brain Research. Cognitive Brain Research, 3, 131–141. Rizzolatti, G., Fogassi, L., & Gallese, V. (2001). Neurophysiological mechanisms underlying the understanding and imitation of action. Nature Reviews Neuroscience, 2, 661–670. Rizzolatti, G., & Sinigaglia, C. (2010). The functional role of the parieto-frontal mirror circuit: Interpretations and misinterpretations. Nature Reviews Neuroscience, 11, 264–274. Rosander, K., & von Hofsten, C. (2011). Predictive gaze shifts elicited during observed and performed actions in 10-month-old infants and adults. Neuropsychologia, 49, 2911–2917. Sandini, G., Metta, G., & Vernon, D. (2007). The iCub cognitive humanoid robot: An opensystem research platform for enactive cognition. In 50 years of artificial intelligence (pp. 358–369). Springer Berlin: Heidelberg. Sciutti, A., Bisio, A., Nori, F., Metta, G., Fadiga, L., Pozzo, T., & Sandini, G. (2012). Measuring human-robot interaction through motor resonance. International Journal of Social ­Robotics, 4(3), 223–234. Sciutti, A., Del Prete, A., Natale, L., Burr, D.C., Sandini, G., & Gori, M. (2013). Perception during interaction is not based on statistical context. IEEE/ACM Proceedings of the Human Robot Interaction Conference 2013. p. 225–226. Senju, A, Southgate, V, White, S, & Frith, U. (2009). Mindblind eyes: An absence of spontaneous theory of mind in Asperger syndrome. Science, 325, 883–885. Shimada, S. (2010). Deactivation in the sensorimotor area during observation of a human agent performing robotic actions. Brain Cognitive, 72, 394–399. DOI: 10.1016/j.bandc.2009.11.005 Southgate, V., Johnson, M.H., Osborne, T., & Csibra, G. (2009). Predictive motor activation during action observation in human infants. Biology Letters, 5, 769–772. DOI: 10.1098/rsbl.2009.0474 Stadler, W., Ott, D.V., Springer, A., Schubotz, R.I., Schutz-Bosbach, S., & Prinz, W. (2012). Repetitive TMS suggests a role of the human dorsal premotor cortex in action prediction. Frontiers in Human Neuroscience, 6. Stadler, W., Schubotz, R.I., von Cramon, D.Y., Springer, A., Graf, M., & Prinz, W. (2011). Predicting and memorizing observed action: Differential premotor cortex involvement. Human Brain Mapping, 32, 677–687. Tai, Y.F., Scherfler, C., Brooks, D.J., Sawamoto, N., & Castiello, U. (2004). The human premotor cortex is ‘mirror’ only for biological actions. Current Biology, 14, 117–120. DOI: 10.1016/j.cub.2004.01.005 Urgesi, C., Maieron, M., Avenanti, A., Tidoni, E., Fabbro, F., & Aglioti, S.M. (2010). Simulating the future of actions in the human corticospinal system. Cerebral Cortex, 20, 2511–2521. van Baaren, R.B., Holland, R.W., Steenaert, B., & van Knippenberg, A. (2003). Mimicry for money: Behavioral consequences of imitation. Journal of Experimental Social Psychology, 39, 393–398. DOI: 10.1016/S0022-1031(03)00014-3 Wada, K., Shibata, T., Musha, T., & Kimura, S. (2005). Effects of robot therapy for demented patients evaluated by EEG. In Proceedings IEEE/RSJ International Conference Intelligent Robots and Systems (IROS, pp.1552–1557). Woodward, A.L. (1998). Infants selectively encode the goal object of an actor’s reach. Cognition, 69, 1–34. DOI: 10.1016/S0010-0277(98)00058-4

Can infants use robot gaze for object learning? The effect of verbalization Yuko Okumura1, Yasuhiro Kanakogi2, Takayuki Kanda3, Hiroshi Ishiguro3,4 & Shoji Itakura1 1Graduate

School of Letters, Kyoto University, Japan / 2Graduate School of Education, Kyoto University, Japan / 3ATR Intelligent Robotics and Communication Laboratories, Japan / 4Graduate School of Engineering Science, Osaka University, Japan Previous research has shown that although infants follow the gaze direction of robots, robot gaze does not facilitate infants’ learning for objects. The present study examined whether robot gaze affects infants’ object learning when the gaze behavior was accompanied by verbalizations. Twelve-month-old infants were shown videos in which a robot with accompanying verbalizations gazed at an object. The results showed that infants not only followed the robot’s gaze direction but also preferentially attended to the cued object when the ostensive verbal signal was present. Moreover, infants showed enhanced processing of the cued object when ostensive and referential verbal signals were increasingly present. These effects were not observed when mere nonverbal sound stimuli instead of verbalizations were added. Taken together, our findings indicate that robot gaze accompanying verbalizations facilitates infants’ object learning, suggesting that verbalizations are important in the design of robot agents from which infants can learn. Keywords:  gaze following; humanoid robot; infant learning; verbalization; cognitive development

1.  Introduction Observing others’ behavior enables infants to learn new information about their surroundings. Human eye-gaze points to the target of others’ attention and conveys rich information. In interactions between infants and caregivers, infants can acquire various types of information by following gaze direction. Therefore, gazefollowing behavior is an aspect of social learning and has an important role in language acquisition (Baldwin 1991; Tomasello 1999), emotional regulation (Moses, Baldwin, Rosicky & Tidball 2001), and theory of mind (Baron-Cohen 1994). doi 10.1075/bct.81.03oku © 2015 John Benjamins Publishing Company

 Yuko Okumura et al.

Previous studies have shown that infants follow the gaze direction of humans (Farroni, Massaccesi, Pividori & Johnson 2004; Flom, Lee & Muir 2007; ­Gredebäck, Theuring, Hauf & Kenward 2008) and nonhuman agents or robots (Johnson, Slaughter & Carey 1998; Meltzoff, Brooks, Shon & Rao 2010; O’Connell, Poulin-Dubois, Demke & Guay 2009). A recent study compared the influence of following gaze between humans and robots. In that study, Okumura, Kanakogi, Kanda, Ishiguro and Itakura (2013a) focused on infants’ ability to use the gaze of others in learning about objects and compared the influence of human and robot gaze on infants’ object learning. That experiment considered two different effects of gaze on object learning using two distinct test methodologies. First, the manner in which infants process object information was assessed by measuring their time spent looking at objects. This assessment relied on previous studies indicating that infants process a previously cued object as more familiar than an uncued object, and these studies consider object processing as a kind of object learning (Cleveland, Schug & Striano 2007; Cleveland & Striano 2008; Reid & Striano 2005; Theuring, Gredebäck & Hauf 2007). Second, whether infants evaluated a cued object as more likeable than an uncued object (for research on adults, see Bayliss, Frischen, Fenske & Tipper 2007; Bayliss, Paul, Cannon & Tipper 2006) was assessed by measuring object preferences through their reaching behavior. Selective preference toward a cued object is regarded as evidence that the gaze of others can have an impact on the affective appraisal of objects in the environment (Becchio, Bertone & Castiello 2008). Thus, selective choice based on others’ gaze seems to reflect social learning, especially in the context of social referencing. Therefore, object processing and object preferences can be considered to be a kind of object learning. In the experiments by Okumura et al. (2013a), 12-month-old infants were shown videos in which a human or a robot gazed at an object. The results showed that 12-month-old infants followed the gaze direction of both humans and robots, but only human gaze facilitated their object learning. Infants showed enhanced processing of, and preferences for, the target object gazed at by a human but not by a robot. Okumura et al. (2013a) concluded that human gaze has a powerful influence on infants’ object learning compared with robot gaze. Likewise, O’Connell et al. (2009) demonstrated that although 18-month-olds followed the gaze of a robot, they did not use robot gaze to learn new words. Furthermore, Okumura, ­Kanakogi, Kanda, Ishiguro, and Itakura (2013b) addressed the mechanism underlying the difference in gaze nature between humans and robots, showing that 12-month-old infants hold referential expectations specifically from gaze shifts of humans but not from robots. However, robots may offer some potential to affect infant learning. There is growing evidence that factors such as social signals or experience with robots can



Can infants use robot gaze for object learning? 

modulate how infants recognize and interact with these nonhuman agents (Arita, Hiraki, Kanda & Ishiguro 2005; Itakura et al. 2008a; Johnson et al. 1998; ­Johnson, Booth & O’Hearn 2001; Meltzoff et al. 2010; Tanaka, Cicourel & Movellan 2007). For example, Arita et al. (2005) found that 10-month-old infants interpret an interactive robot as a communicative agent, a kind of human being, but regard a non-interactive robot as an object. Similarly, Itakura et al. (2008a) demonstrated that toddlers can correctly infer the intentions of a robot’s actions and imitate the intended goal when the robot displays social signals like eye contact with others. These studies indicate that when robots are equipped with social properties, infants’ recognition of them could be altered. Thus, it is possible that infants learn and acquire information from a robot when the robot displays social and communicative signals. The present study addressed whether adding signals to robots can help in turning them into agents for infant learning, using the paradigm of Okumura et al. (2013a). To examine whether robots can influence infant learning, we added verbalizations to the robot used in the study by Okumura et al. (2013a). In that study, the robot was self-propelled but did not have explicit communicative signals except direct eye-gaze. Communicative signals (e.g. direct eye contact, infant-directed speech, having one’s name called, or contingent reactivity) indicate someone’s communicative intention and play a primary role in facilitating social learning in young infants, as suggested by the theory of natural pedagogy (Csibra & Gergely 2009, 2011). Specifically, the way that infants process events can be altered through communicative signals by conveying to infants that they are the addressees of informative intentions (Senju & Csibra 2008; Topál, Gergely, Miklósi, Erdohegyi & Csibra 2008; Yoon, Johnson & Csibra 2008). We focused on infant-directed verbalizations as a communicative signal because in the Okumura et al. (2013a) study, only direct eye-gaze did not facilitate infant learning from robots. The importance of verbalizations has been demonstrated for establishing joint attention in infantadult interactions (Parise, Cleveland, Costabile & Striano 2007) and for guiding infants’ information gathering behavior in a social referencing situation (Mumme, Fernald & Herrera 1996; Vaish & Striano 2004). The present study tested whether robot gaze affects infants’ object learning when verbalizations are present. We combined the verbalizations in a naturalistic way with the robot’s gaze behavior in the procedure used by Okumura et al. (2013a). The verbal signal “Hello Baby!” was uttered when the robot made a direct eye-gaze. Moreover, previous studies have shown that multiple and additive cues have a greater effect than a single cue on infants’ gaze-following behavior (Flom, Deák, Phill & Pick 2004) as well as on infants’ perception of goal-directed actions (Biro & Leslie 2007). Therefore, our study focused on additive effects and investigated whether infant learning was facilitated when the robot’s verbalizations were

 Yuko Okumura et al.

quantitatively increased. The additional verbal signal “there is a toy” was combined with the robot’s shifting gaze. As a consequence, the nature of the two verbal signals was different. The first verbal signal was an ostensive signal to get infants’ attention and the second verbal signal was a referential signal to indicate an object. Then, we directly compared our current results with the original data in the robot condition from Okumura et al. (2013a) to assess the influence of verbalizations combined with robots. 2.  Experiment 1 2.1  Method 2.1.1  Participants Informed consent was obtained from the parents prior to participants’ involvement in the study. The ethics review board at the Department of Psychology, Kyoto University, approved the study design. Thirty-two 12-month-old infants (18 males, 14 females; mean age 366.9 days; range 351–381 days) were randomly assigned to either an ostensive (n = 16) or ostensive-referential signal (n = 16) condition. An additional seven infants were tested but excluded from analyses because of fussiness (n = 3) and a failure to meet the inclusion criterion by completing fewer than three trials of gazing at one of the objects in the initial phase (n = 2), or by refusing to reach for an object (n = 2). 2.1.2  Apparatus The apparatus was the same as in the study by Okumura et al. (2013a). A Tobii T60 eye tracker (Tobii Technology) with a 17-inch TFT monitor was used to record infants’ looking behavior. The robot used as a model was developed at the ATR Intelligence Robotics Laboratory in Japan. This is an autonomous humanoid robot (1.2 m in height, 50 cm in diameter, and 40 kg in weight) with human-like eyes, hands, and body (see ­Figure 1). The robot’s movement is controlled through a computer program, and the robot can move independently. 2.1.3  Stimuli and procedure The procedure was the same as the one employed by Okumura et al. (2013a). Infants were seated on their parent’s lap with their eyes approximately 60 cm from the monitor. A five-point calibration was conducted prior to eye movement recording. In the initial phase, infants viewed six video clips in which a robot gazed at one of two objects (Figure 1a-c). Each video began with a scene in which the robot looked down at the table (2 seconds). Next, the robot looked up (1 second) and



Can infants use robot gaze for object learning?  a.

b.

c.

d.

Figure 1.  Selected frames from the gaze-shift stimuli (initial phase) and the test stimuli (test phase). (a) Each presentation begins with a robot looking down at a table. (b) The robot looks up and fixates straight ahead. (c) The robot turns toward and fixates on one of two objects. (d) The two objects are presented in the looking time test

fixated straight ahead (2 seconds). The robot then turned toward (1 second) and fixated on (5 seconds) one of the two objects. The robot gaze was always directed toward the same object, but the object’s location changed in an ABBABA order (i.e. A = left and B = right or vice versa). The object at which the robot gazed and the initial direction of robot gaze (leftward or rightward) were counterbalanced across participants. This robot stimulus was identical to that in the robot condition from Okumura et al. (2013a), except for the addition of verbalizations. In the ostensive condition, we added the ostensive verbal signal, “Hello Baby!”, as the robot fixated straight ahead before shifting gaze (see Figure 1b). This ostensive signal was to get infants’ attention, delivered in an infant-directed manner (2 ­seconds) by a human voice. In the ostensive-referential condition, in addition to the ostensive verbal signal, we further provided the referential verbal signal, “There is a toy” (2  seconds), with the same human voice as the robot looked at the object (the verbalizations were in the first 2 out of 5 seconds in this phase; see Figure 1c). This referential signal was descriptive to indicate the referent. After this initial phase, two tasks were conducted in the test phase: one trial for the looking time test and one trial for the object choice test. First, in the l­ ooking

 Yuko Okumura et al.

time test that assessed object processing, the two objects on a black background were presented for 30 seconds (Figure 1d). Second, in the object choice test, infants were shown the two real objects by a researcher who was unaware of the identity of the cued object. The presentation position (left or right) of the two objects was counterbalanced across participants. 2.1.4  Data analysis To assess gaze following in the initial phase, we mainly measured (1) infants’ first eye movements toward the object and (2) looking time for the two objects for 5 seconds during the time in which the robot fixated on the object. Infants needed to gaze at one of the two objects in at least three out of six trials to be included in the final analysis. For each measurement, the proportions (cued object/cued + uncued objects) were calculated and tested against a chance level of 0.5. Additionally, we coded the looking times for the video clips for 11 seconds of the entire stimulus presentation of the initial phase. In the test phase, infants’ looking times for each object were coded for 30 ­seconds for the looking time test, and proportions of looking time for the cued and uncued objects were calculated. Because attention toward the uncued object is regarded as evidence that object processing toward the cued object is enhanced (Okumura et al. 2013a; Reid & Striano 2005; Theuring et al. 2007), the proportion of time directed at the uncued object was tested against chance level. In the object choice test, infants’ first touch within 60 seconds was defined as the choice. A researcher and coder who were unaware of the experimental condition judged the choices of all infants. There was perfect agreement between these two individuals (Cohen’s κ = 1.00). 2.2  Results and discussion Our current results (ostensive and ostensive-referential conditions) were directly compared with the original data (robot condition) from Okumura et al. (2013a). In the initial phase, total time looking at the video clip did not differ significantly across the robot, ostensive, and ostensive-referential conditions (MRobot = 9.51 of 11 s, MOstensive = 9.68 of 11 s, MOstensive-Referential = 9.97 of 11 s; F(2,45) = 0.38, p = .69, η2 = 0.02), indicating that the infants kept their attention on the video equally in all conditions regardless of the existence of verbalizations. In the initial phase, infants’ gaze-following behavior showed that their first eye movements were directed toward the cued object at a frequency significantly greater than chance (0.5) in the robot, ostensive, and ostensive-referential conditions (one-sample t tests), t(15) = 3.66, p = .002, d = 0.92; t(15) = 3.47, p = .003, d = 0.87; and t(15) = 3.04, p = .008, d = 0.76, respectively. A one-way analysis of



Can infants use robot gaze for object learning? 

variance (ANOVA) with condition (robot, ostensive, and ostensive-referential) as a between-participants factor revealed no significant effect, F(2,45) = 0.18, p = .84, η2 = 0.01 (Figure 2a). In contrast, when we subsequently analyzed proportions of looking time directed at the two objects after the robot’s gaze shift, the proportion directed toward the cued object was not significantly above chance level in the robot condition (one-sample t tests), t(15) = 0.92, p = .37, d = 0.23. However, when the ostensive verbal signal was combined with the robot, performance was significantly greater than chance in both the ostensive and ostensive-referential conditions, t(15) = 4.34, p = .001, d = 1.08; and t(15) = 3.88, p = .001, d = 0.97, respectively. These analyses indicate that infants looked longer at the cued object when the ostensive verbal signal was present. An ANOVA with condition (robot, ostensive, and ostensive-referential) as a between-participants factor revealed a significant effect, F(2,45) = 3.79, p = .03, η2 = 0.14. Post hoc testing (Tukey’s HSD) showed a significant difference between the robot and ostensive conditions (p = .04), marginal significance between the robot and ostensive-referential conditions (p = .06), and no significant difference between the ostensive and ostensive-referential conditions (p = .97) (Figure 2b). Together, these findings indicate that the ostensive verbal signal facilitated infants’ attention toward the cued object after they initially followed the gaze direction of the robot. In analysis of the looking time test in the test phase, we found that the proportion of looking time directed at the uncued object was significantly different from chance level (0.5) only in the ostensive-referential condition (one-sample t tests), t(15) = 3.37, p = .004, d = 0.84. There were no significant differences in the robot and ostensive conditions, t(15) = -0.96, p = .35, d = 0.24; and t(15) = -0.44, a.

First look

1 *

*

*

*

0.5 0.25 0

Robot Okumura et al. (2013a)

Ostensive

Ostensive -Referential

Experiment 1

Sound Experiment 2

Proportion of fixation

Proportion of first look

0.75

Looking time

b.

1

0.75

*

*

NS

NS

0.5 0.25 0

Robot Okumura et al. (2013a)

Ostensive

Ostensive -Referential

Experiment 1

Sound Experiment 2

Figure 2.  Results of gaze following in the initial phase. (a) Proportion of first eye movements directed toward the cued object in each condition. (b) Proportion of looking time directed at the cued object in each condition. Error bars represent standard errors. Chance level is 0.5. Asterisks indicate statistical significance, p < .05; NS, not significant

 Yuko Okumura et al. a.

b. Cued Uncued

0.75 NS

NS

*

NS

0.5 0.25 0 Robot Okumura et al. (2013a)

Ostensive

Ostensive -Referential

Experiment 1

Sound Experiment 2

Proportion of infants choosing each object

Proportion of looking time

1

1 0.75

Cued Uncued NS

NS

NS

NS

Robot

Ostensive

Ostensive -Referential

Sound

0.5 0.25 0

Okumura et al. (2013a)

Experiment 1

Experiment 2

Figure 3.  Results in the test phase. (a) Proportion of looking time directed at the cued and uncued objects in each condition in the looking time test. (b) Proportion of infants choosing each object in the object choice test. Error bars represent standard errors. Asterisks indicate statistical significance, p < .05; NS, not significant

p = .67, d = 0.11, respectively. An ANOVA with condition (robot, ostensive, and ostensive-referential) as a between-participants factor revealed a significant effect, F(2,45) = 3.50, p = .03, η2 = 0.14. Post hoc testing (Tukey’s HSD) revealed a significant difference between the robot and ostensive-referential conditions (p = .04), marginal significance between the ostensive and ostensive-referential conditions (p = .09), and no significant difference between the robot and ostensive conditions (p = .95) (Figure 3a). These results revealed that infants showed a novelty preference for the uncued object only in the ostensive-referential condition, indicating that infants process a previously cued object as more familiar than an uncued object (Okumura et al. 2013a; Reid & Striano 2005; Theuring et al. 2007). Therefore, infants showed enhanced processing of the object cued by robot gaze when both the ostensive and referential verbal signals were present. However, in the object choice test, infants did not robustly choose the cued object in any condition (10 of 16 infants for robot, binomial test, one-tailed, p = .23; 9 of 16 infants, p = .40 for ostensive; 8 of 16 infants, p = .60 for ostensive-referential) (Figure 3b). Together with the findings in the initial and test phases, we found that although infants in all conditions followed the gaze direction of the robot, infants who observed the robot with the ostensive verbal signal were more likely to look longer at the cued object after following the gaze. Moreover, infants showed enhanced processing of the cued object when the ostensive and referential verbal signals were increasingly present. Our results showed that combining verbalizations with the movement of the robot facilitated infants’ object learning from robot gaze. These findings suggest that infants can use robot gaze accompanying verbalizations for object learning.



Can infants use robot gaze for object learning? 

It should be noted that infants observed exactly the same visual stimuli across the three conditions, although there were differences in verbalizations. Thus, these results cannot be explained by low-level perceptual interpretations such as movement speed or kinetic momentum of the robot stimuli. However, the question remains whether a sound stimulus not including verbalizations is sufficient to affect infant learning. In Experiment 2, to confirm that verbalizations affected infants’ attitudes toward the robot, different groups of infants observed exactly the same robot stimulus, but we substituted a nonverbal beeping sound for verbalizations at the same point in the stimulus as in the ostensive-referential condition. 3.  Experiment 2 3.1  Method 3.1.1  Participants Sixteen 12-month-old infants (10 males, 6 females; mean age 366.1 days; range 349–381 days) completed the study. An additional three infants were excluded from the analysis because of fussiness (n = 1), a failure to meet the inclusion criterion (n = 1), or parental interference (n = 1). 3.1.2  Stimuli and procedure The procedure and analysis were identical to that in Experiment 1, except that in the initial phase, a beeping sound instead of verbalizations was added to the robot (Sound condition). The beeping sound was inserted at the same time and was of the same duration as verbalizations in the ostensive-referential condition. 3.1.3  Results and discussion In the initial phase, infants’ first eye movements were directed toward the cued object significantly more than chance (one-sample t tests), t(15) = 3.34, p = .004, d = 0.85 (Figure 2a). However, the proportion of looking time for the cued object did not differ significantly from chance level, t(15) = 0.78, p = .45, d = 0.19 (­Figure  2b). Moreover, in the test phase, the proportion of looking time for the uncued object was not significantly above chance in the looking time test, t(15) = 0.33, p = .75, d = 0.08 (Figure 3a), and infants showed no significant preference for the cued object in the choice test (8 of 16 infants, p = 0.60) (Figure 3b). These results contrast with our findings in the ostensive and ostensive-­ referential conditions of Experiment 1. Mere nonverbal sound stimuli did not influence infants’ attention toward and processing of the cued object. Hence, these findings confirm that adding verbalizations to the robot is critical to affect infants’ object learning from robots.

 Yuko Okumura et al.

4.  General discussion Previous research has shown that although infants follow the gaze direction of a robot, robot gaze does not maintain infants’ attention toward the cued object or facilitate their object learning (Okumura et al. 2013a). In the current study, first steps were undertaken to investigate whether verbalizations as communicative signals can modulate the turning of robots into agents from which infants can learn. Our experiment used two verbalizations. The first ostensive verbal signal (“Hello Baby!”) was combined with direct eye-gaze, while the second referential verbal signal (“There is a toy”) was combined with the robot’s gaze shifting. The results showed that 12-month-old infants not only followed the gaze direction of the robot but also preferentially attended to the cued object when the ostensive signal was present. Moreover, infants showed enhanced processing of the cued object when the referential signal was present in addition to the ostensive signal. This type of enhanced processing of the object indicates that infants processed a previously cued object as more familiar than an uncued object. Therefore, this processing can be considered to be a kind of object learning, as previous studies suggest (Cleveland et al. 2007; Cleveland & Striano 2008; Okumura et al. 2013a). Importantly, these effects were not observed when mere nonverbal sound stimuli instead of verbalizations were added to the robot. Taken together, our study demonstrated that infants could use robot gaze accompanying verbalizations for object learning. The present study used two verbal signals. How might these verbal signals work at 12 months old? It is likely that the ostensive verbal signal worked to promote infants’ attention-seeking, while the referential verbal signal guided their attention to the referent. From this perspective, infants might infer the specific expectation of the robot by ostensive signals, and thereby not only respond to the direction of the robot but also preferentially attend to the cued object. When both the ostensive and referential signals were increasingly present, infants might recognize the indication toward the object from the referential signals and show enhanced processing of the cued object. However, the ostensive-referential condition was qualitatively (i.e. semantically) and quantitatively different from the other conditions. That is, the ostensive-referential condition included confounding factors. Therefore, from our findings alone, we cannot accurately determine which aspects (e.g. the specific combination of signals, the quantitative increase of signals, etc.) are critical for object processing. Although investigating which aspects of the verbal signals affect infants’ object processing might be needed in future research, including verbalizations is clearly important when designing robot agents from which infants can learn. How do verbalizations influence infant learning? One possibility is that



Can infants use robot gaze for object learning? 

by attributing intentional states to the verbalizing robot, infants regarded it as an agent for learning. Indeed, previous studies have shown that adding social factors to a robot may facilitate infants’ sensitivity to its intentions (e.g. Itakura et al. 2008a; Johnson et al. 1998). It has been suggested that a specific intention inferred from the gaze of others is essential for its influence toward an object (Becchio et al. 2008). In accordance with this proposal, the infants in our study might have recognized the robot with verbalizations as an intentional agent, and thus were affected by the gaze behavior of the robot. A second possibility is that combining the verbalizations with gaze behavior facilitated infants’ identification and interpretation of the robot’s actions as communicative acts specifically addressed to them. According to the theory of natural pedagogy, communicative signals play a primary role in facilitating social learning in young infants, and make it possible to efficiently convey information because the infants recognize the addresser’s actions as communicative demonstrations (Csibra & Gergely 2009, 2011). Moreover, researchers have recently demonstrated that 12-month-old infants can understand the communicative function of speech and recognize that the speaker is providing information to the listener (Martin, Onishi & Vouloumanos 2012; Vouloumanos, Onishi & Pogue 2012). Therefore, verbalizations might have provoked infants’ motivation for learning and induced them to attend to the robot’s actions. With any of these possibilities, adding verbalizations to the robot modulates how infants interpret it as an agent from which they can learn. Research regarding child-robot interactions has accumulated growing evidence and is sparking current interest in fields including education and therapy. Robotics researchers have designed and developed programs to enrich childhood education by introducing robots into educational environments such as elementary schools (Kanda, Hirano, Eaton & Ishiguro 2004; Kanda, Sato, Saiwaki & ­Ishiguro 2007) and nursery schools (Tanaka et al. 2007). In addition, developing social robots may be helpful for children with autism by guiding their social learning. Given that children with autism can make contact with a robot and continue an interaction with it (Werry, Dautenhahn & Harwin 2001), there is some hope that robotic systems will be effective therapeutic aids for increasing these children’s access to learning. In particular, the emerging research field of developmental cybernetics examines the interaction and integration between children and robots, and builds theoretical frameworks regarding characteristics that facilitate these interactions in areas such as teaching and learning (Itakura 2008; Itakura, Okanda & Moriguchi 2008b). This approach can be helpful in the formation of new learning strategies through integration with robot technology (Meltzoff, Kuhl, Movellan, & Sejnowski 2009). Therefore, our findings are important not only for their contribution to basic research, but also for practical use.

 Yuko Okumura et al.

At the same time, our findings raise the possibility that learning from robots has its limitations. Infants do not show enhanced preference for the target object gazed at by a robot, although infants of the same age robustly prefer the object gazed at by a human (Okumura et al. 2013a). In our experiment, we showed a video of a robot doing things. If we showed a real robot to the infants, the real embodiment might produce a different effect for infants and facilitate the influence of robot gaze on affective appraisal of the cued object. In addition, it is possible that combining additional types of signals with the robots will have an additive effect on infant learning. Future studies are necessary to examine how robots can influence infant affective learning and to clarify which factors facilitate infant learning from these nonhuman agents. Such research can guide future directions for the design of humanoid robots in the field of social robotics and lead to new learning strategies.

Acknowledgements This research was supported by a research fellowship from the Japan Society for the Promotion of Science (JSPS) for young scientists to Yuko Okumura and by grants from JSPS (21220005, 20220002, 20220004 and 25245067) and MEXT (21118005) to Shoji Itakura. We are grateful to the parents and infants who participated in this study.

References Arita, A., Hiraki, K., Kanda, T., & Ishiguro, H. (2005). Can we talk to robots? Ten-month-old infants expected interactive humanoid robots to be talked to by persons. Cognition, 95, B49–B57. DOI: 10.1016/j.cognition.2004.08.001 Baldwin, D.A. (1991). Infants’ contribution to the achievement of joint reference. Child Development, 62, 875–890. DOI: 10.2307/1131140 Baron-Cohen, S. (1994). How to build a baby that can read minds: Cognitive mechanisms in mindreading. Cahiers de Psychologie Cognitive/Current Psychology of Cognition, 13, 513–552. Bayliss, A.P., Frischen, A., Fenske, M.J., & Tipper, S.P. (2007). Affective evaluations of objects are influenced by observed gaze direction and emotional expression. Cognition, 104, 644–653. DOI: 10.1016/j.cognition.2006.07.012 Bayliss, A.P., Paul, M.A., Cannon, P.R., & Tipper, S.P. (2006). Gaze cuing and affective judgments of objects: I like what you look at. Psychonomic Bulletin & Review, 13, 1061–1066. DOI: 10.3758/BF03213926 Becchio, C., Bertone, C., & Castiello, U. (2008). How the gaze of others influences object processing. Trends in Cognitive Sciences, 12, 254–258. DOI: 10.1016/j.tics.2008.04.005 Biro, S., & Leslie, A.M. (2007). Infants’ perception of goal-directed actions: Development through cue-based bootstrapping. Developmental Science, 10, 379–398. DOI: 10.1111/j.1467-7687.2006.00544.x



Can infants use robot gaze for object learning? 

Cleveland, A., Schug, M., & Striano, T. (2007). Joint attention and object learning in 5- and 7-month-old infants. Infant and Child Development, 16, 295–306. DOI: 10.1002/icd.508 Cleveland, A. & Striano, T. (2008). Televised social interaction and object learning in 14- and 18-month-old infants. Infant Behavior & Development, 31, 326–331. DOI: 10.1016/j.infbeh.2007.12.019 Csibra, G., & Gergely, G. (2009). Natural pedagogy. Trends in Cognitive Sciences, 13, 148–153. DOI: 10.1016/j.tics.2009.01.005 Csibra, G., & Gergely, G. (2011). Natural pedagogy as evolutionary adaptation. Philosophical Transactions of the Royal Society B, 366, 1149–1157. DOI: 10.1098/rstb.2010.0319 Farroni, T., Massaccesi, S., Pividori, D., & Johnson, M.H. (2004). Gaze following in newborns. Infancy, 5, 39–60. DOI: 10.1207/s15327078in0501_2 Flom, R., Deák, G.O., Phill, C.G., & Pick, A.D. (2004). Nine-month-olds’ shared visual attention as a function of gesture and object location. Infant Behavior & Development, 27, 181–194. DOI: 10.1016/j.infbeh.2003.09.007 Flom, R., Lee, K., & Muir, D. (2007). Gaze-following: Its development and significance. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Gredebäck, G., Theuring, C., Hauf, P., & Kenward, B. (2008). The microstructure of infants’ gaze as they view adult shifts in overt attention. Infancy, 13, 533–543. DOI: 10.1080/15250000802329529 Itakura, S. (2008). Development of mentalizing and communication: From viewpoint of developmental cybernetics and developmental cognitive neuroscience. IEICE Transactions on Communications, E91-B, 2109–2117. DOI: 10.1093/ietcom/e91-b.7.2109 Itakura, S., Ishida, H., Kanda, T., Shimada, Y., Ishiguro, H., & Lee, K. (2008a). How to build an intentional android: Infants’ imitation of a robot’s goal-directed actions. Infancy, 13, 519–532. DOI: 10.1080/15250000802329503 Itakura, S., Okanda, M., & Moriguchi, Y. (2008b). Discovering mind: Development of mentalizing in human children. In S. Itakura & K. Fujita (Eds.), Origins of social mind: Evolutionary and developmental view (pp.179–198). Springer. DOI: 10.1007/978-4-431-75179-3_9 Johnson, S.C., Booth, A., & O’Hearn, K. (2001). Inferring the goals of a nonhuman agent. Cognitive Development, 16, 637–656. DOI: 10.1016/S0885-2014(01)00043-0 Johnson, S., Slaughter, V., & Carey, S. (1998). Whose gaze will infants follow? The elicitation of gaze following in 12-month-olds. Developmental Science, 1, 233–238. DOI: 10.1111/1467-7687.00036 Kanda, T., Hirano, T., Eaton, D., & Ishiguro, H. (2004). Interactive robots as social partners and peer tutors for children: A field trial. Human-Computer Interaction, 19, 61–84. DOI: 10.1207/s15327051hci1901&2_4 Kanda, T., Sato, R., Saiwaki, N., & Ishiguro, H. (2007). A two-month field trial in an elementary school for long-term human-robot interaction. IEEE Transactions on Robotics, 23, 962–971. DOI: 10.1109/TRO.2007.904904 Martin, A., Onishi, K.H., & Vouloumanos, A. (2012). Understanding the abstract role of speech in communication at 12 months. Cognition, 123, 50–60. DOI: 10.1016/j.cognition.2011.12.003 Meltzoff, A.N., Brooks, R., Shon, A.P., & Rao, R.P.N. (2010). “Social” robots are psychological agents for infants: A test of gaze following. Neural Networks, 23, 966–972. DOI: 10.1016/j.neunet.2010.09.005 Meltzoff, A.N., Kuhl, P.K., Movellan, J., & Sejnowski, T.J. (2009). Foundations for a new science of learning. Science, 325, 284–288. DOI: 10.1126/science.1175626 Moses, L.J., Baldwin, D.A., Rosicky, J.G., & Tidball, G. (2001). Evidence for referential understanding in the emotions domain at twelve and eighteen months. Child Development, 72, 718–735. DOI: 10.1111/1467-8624.00311

 Yuko Okumura et al. Mumme, D.L., Fernald, A., & Herrera, C. (1996). Infants’ responses to facial and vocal emotional signals in a social referencing paradigm. Child Development, 67, 3219–3237. DOI: 10.2307/1131775 O’Connell, L., Poulin-Dubois, D., Demke, T., & Guay, A. (2009). Can infants use a nonhuman agent’s gaze direction to establish word-object relations? Infancy, 14, 414–438. DOI: 10.1080/15250000902994073 Okumura, Y., Kanakogi, Y., Kanda, T., Ishiguro, H., & Itakura, S. (2013a). The power of human gaze on infant learning. Cognition, 128, 127–133. DOI: 10.1016/j.cognition.2013.03.011 Okumura, Y., Kanakogi, Y., Kanda, T., Ishiguro, H., & Itakura, S. (2013b). Infants understand the referential nature of human gaze but not robot gaze. Journal of Experimental Child Psychology, 116, 86–95. DOI: 10.1016/j.jecp.2013.02.007 Parise, E., Cleveland, A., Costabile, A., & Striano, T. (2007). Influence of vocal cues on learning about objects in joint attention contexts. Infant Behavior & Development, 30, 380–384. DOI: 10.1016/j.infbeh.2006.10.006 Reid, V.M., & Striano, T. (2005). Adult gaze influences infant attention and object processing: Implications for cognitive neuroscience. European Journal of Neuroscience, 21, 1763–1766. DOI: 10.1111/j.1460-9568.2005.03986.x Senju, A., & Csibra, G. (2008). Gaze following in human infants depends on communicative signals. Current Biology, 18, 668–671. DOI: 10.1016/j.cub.2008.03.059 Tanaka, F., Cicourel, A., & Movellan, J.R. (2007). Socialization between toddlers and robots at an early childhood education center. Proceedings of the National Academy of Sciences, USA, 104, 17954–17958. DOI: 10.1073/pnas.0707769104 Theuring, C., Gredebäck, G., & Hauf, P. (2007). Object processing during a joint gaze following task. European Journal of Developmental Psychology, 4, 65–79. DOI: 10.1080/17405620601051246 Tomasello, M. (1999). The cultural origins of human cognition. Cambridge, MA: Harvard University Press. Topál, J., Gergely, G., Miklósi, Á., Erdohegyi, Á., & Csibra, G. (2008). Infants’ perseverative search errors are induced by pragmatic misinterpretation. Science, 321, 1831–1834. DOI: 10.1126/science.1161437 Vaish, A., & Striano, T. (2004). Is visual reference necessary? Contributions of facial versus vocal cues in 12-month-olds’ social referencing behavior. Developmental Science, 7, 261–269. DOI: 10.1111/j.1467-7687.2004.00344.x Vouloumanos, A., Onishi, K.H., & Pogue, A. (2012). Twelve-month-old infants recognize that speech can communicate unobservable intentions. Proceedings of the National Academy of Sciences, USA, 109, 12933–12937. DOI: 10.1073/pnas.1121057109 Werry, I., Dautenhahn, K., & Harwin, W. (2001). Investigating a robot as a therapy partner for children with autism. In Proceedings of the 6th European Conference for the Advancement of Assistive Technology. Yoon, J.M.D., Johnson, M.H., & Csibra, G. (2008). Communication-induced memory biases in preverbal infants. Proceedings of the National Academy of Sciences, USA, 105, 13690–13695. DOI: 10.1073/pnas.0804388105

Interactions between a quiz robot and multiple participants Focusing on speech, gaze and bodily conduct in Japanese and English speakers Akiko Yamazaki1, Keiichi Yamazaki4, Keiko Ikeda2, Matthew Burdelski3, Mihoko Fukushima4, Tomoyuki Suzuki4, Miyuki Kurihara4, Yoshinori Kuno4 & Yoshinori Kobayashi4 1Tokyo

University of Technology / 2Kansai University / 3Osaka University / University

4Saitama

This paper reports on a quiz robot experiment in which we explore similarities and differences in human participant speech, gaze, and bodily conduct in responding to a robot’s speech, gaze, and bodily conduct across two languages. Our experiment involved three-person groups of Japanese and English-speaking participants who stood facing the robot and a projection screen that displayed pictures related to the robot’s questions. The robot was programmed so that its speech was coordinated with its gaze, body position, and gestures in relation to transition relevance places (TRPs), key words, and deictic words and expressions (e.g. this, this picture) in both languages. Contrary to findings on human interaction, we found that the frequency of English speakers’ head nodding was higher than that of Japanese speakers in human-robot interaction (HRI). Our findings suggest that the coordination of the robot’s verbal and non-verbal actions surrounding TRPs, key words, and deictic words and expressions is important for facilitating HRI irrespective of participants’ native language. Keywords:  coordination of verbal and non-verbal actions; robot gaze comparison between English and Japanese; human-robot interaction (HRI); transition relevance place (TRP); conversation analysis

1.  Introduction “Were an ethologist from Mars to take a preliminary look at the dominant animal on this planet, he would be immediately struck by how much of its behavior, within a rather extraordinary array of situations and settings (from camps in the tropical rain forest to meetings in Manhattan skyscrapers), was organized through face-toface interaction with other members of its species” (M. Goodwin 1990: p. 1). doi 10.1075/bct.81.04yam © 2015 John Benjamins Publishing Company

 Akiko Yamazaki et al.

The study of face-to-face interaction has long been a central concern across various social scientific disciplines. More recently it has become an important focus within fields related to technology and computer-mediated discourse (e.g. Heath & Luff 2000; Suchman 2006). This is especially true of research adopting ethnomethodology (Garfinkel 1967) and conversation analysis (hereafter abbreviated as CA) (Sacks, Schegloff & Jefferson 1974) that examines humanrobot interaction (HRI) (e.g. Pitsch et al. 2013; A. Yamazaki et al. 2010 and A. Yamazaki et al. 2008). One of the central issues in this area is multicultural and inter-cultural patterns in human-robot and human-virtual agents. Many of these corpora are of human-human interaction, but were collected for the purposes of HRI or human-agent interaction, such as the C ­ UBE-G corpus (e.g. Nakano & Rehm 2009), corpora of the natural language dialogue group of USC (e.g. Traum et al. 2012), and the CMU cross-cultural receptionist corpus (e.g. Makatchev, Simmons & Sakr 2012). These works also focus on verbal and nonverbal behavior of human interactions in order to develop virtual agents. While we have a similar interest in examining and developing technology that can be employed in a real-world environment across various cross-cultural and crosslinguistic settings, our research project utilizes a robot that is able to verbalize and display its bodily orientation towards objects in the immediate vicinity and multiparty participants in order to facilitate participants’ understanding of, and engagement in, the robot’s talk. Thus, a key feature of the present research is not only the use of speech and non-verbal language, but also the coordination of these resources in order to facilitate visitors’ engagement in human-robot interaction (A. Yamazaki et al. 2010). In the present paper we focus on how non-verbal actions (e.g. gaze, torso, gesture) are related to question-response sequences in multi-party HRI. A rich literature on human social interaction can be found in studies on CA. A main focus of these studies is to identify the underlying social organization in constructing sequences of interaction. In particular, a growing number of studies examines question-response sequences. For example, as Stivers and Rossano (2010) point out, a question typically elicits a response from the recipient of the question turn. Thus asking a question is a technique for selecting a next speaker (Sacks, Schegloff & Jefferson 1974). Since a question forms the first part of an a­ djacency pair, it calls for a specific type of second pair part (cf. Sacks 1987; ­Schegloff 2007). Rossano (2013) points out that in all question-­response sequences there is a transition relevance place (TRP). A TRP is defined as: “The first possible completion of a first such unit constitutes an initial transition-­ relevance place” (Sacks, Schegloff & Jefferson 1974: p. 703) and it is a place where turn transfer or speaker change may occur. At a TRP, a hearer can p ­ rovide



Interaction between a robot and multiple participants 

a response to the speaker, but may not necessarily take the turn (e.g. verbal ­continuers, head nods). When somebody asks a question, a hearer has a normative obligation to answer (Rossano 2013). Stivers and her colleagues (2010) examined question-response sequences of naturally occurring dyadic interactions among ten different languages. While there are some variations in the ways speakers produce question formats, what is common among them is that the speaker typically gazes towards the addressee(s) when asking a question to a multiparty audience (e.g. Hayashi 2010). A number of researchers have studied gaze in human interaction in various settings and sequential contexts. In particular, Kendon (1967) points out that gaze patterns were systematically related to the particular feature of talk. C. Goodwin (1981) clarified that hearers display their engagement towards the speaker by using gaze. In addition, Bavelas and his colleagues (2002) describe how mutual gaze plays a role in speaker-hearer interactions. A recent study of dyadic interactions in cross-cultural settings reveals similarities and differences of gaze behaviors among different language and cultures (Rossano, Levinson & Brown 2009). While gaze has been given much thought and acknowledged as an important resource in human-human interaction, a question still remains as to how gaze is deployed and can be employed in multiparty HRI. Within the current research on gaze behavior in HRI (e.g. Knight & ­Simmons 2012; Mutlu et al. 2009) there has not yet been discussion on multiparty ­question-response sequences in cross-cultural settings. The present paper begins to fill this gap by comparing human-robot interaction in Japanese and English within a quiz robot experiment. The use of a robot allows us to ask the same questions employing the same bodily gestures, and to compare the responses of participants under the same conditions. Utilizing videotaped recordings, our analysis involves detailed transcriptions of human-robot interaction and a quantitative summary of the kinds of participants’ responses. We show that a main difference between the two language groups is that the frequency of nodding of English speakers is significantly higher than that of Japanese speakers. This is contrary to research on human-human interaction that argues that Japanese speakers nod more often than English speakers (e.g. Maynard 1990). Participants show their engagement in interaction with a robot when the robot’s utterance and bodily behavior such as gaze are coordinated appropriately. This paper is organized in the following manner. In Section 2, we discuss the background of this study. In Section 3, we explain the setup for a quiz robot experiment. In Section 4, we offer initial analysis. In Section 5, we provide detailed analysis of participants’ responses in regard to the robot’s gaze and talk. Discussion and concluding remarks will follow in Section 6.

 Akiko Yamazaki et al.

2.  Background of this study 2.1  Cross-cultural communicative differences: Word order A rich literature on multicultural and inter-cultural variations in interaction can be found in the literature on ‘interactional linguistics’ (e.g. Tanaka 1999; Iwasaki 2009) in which scholars with training in CA and related fields tackle cultural differences from a cross-linguistic perspective. They do not follow the traditional syntactical approach to defining “differences” among languages used in interaction. Rather, they reveal an interactional word order associated to a specific social interactional activity. For instance, one study focused on a cross-cultural comparison between Japanese and English (Fox, Hayashi & Jasperson 1996). Differences between interactions involving these two languages are particularly interesting due to their respective word order. There is a distinctive difference in ‘projection’ in regard to the timing of completion of a current turn-constructional unit (e.g. a sentential unit), which is defined as ‘projectability’ in CA. In regard to question formats, which include interrogatives, declaratives and tag-questions (Stivers 2010), English and Japanese word order exhibits differences and similarities in interaction. For interrogatives, the sentence structure is nearly the opposite between Japanese and English. As Tanaka (1999: p. 103) states, “[i]n a crude sense, the structures of Japanese and English can be regarded as polar opposites. This is reflected in differences in participant orientations to turn-­construction and projection.” For declaratives, there are no dramatic differences between the two languages in terms of word order. However, the placement of the question word in the sentence is different. For tag-questions, the sentence structure is similar between Japanese and English as a question-format ‘tagged on’ to the end of a statement. 2.2  Coordination of verbal and non-verbal actions and questioning strategy We have conducted ethnographic research in several museums and exhibitions in Japan and the United States in order to explore ways that expert human guides engage visitors in the exhibits. We analyzed video recordings by adopting CA and have applied those findings to a guide robot that can be employed in museums. Based on the following findings, we employed two central design principles in the robot for the current project. First, as the coordination of verbal actions and non-verbal actions is an essential part of human interaction (C. Goodwin 2000), we observed that guides often turn their heads toward the visitors when they mention a key word in their talk, and point towards a display when they use a deictic word or expression (e.g. this painting). We also found that visitors display their engagement by nodding and



Interaction between a robot and multiple participants 

shifting their gaze, in particular at key words, deictic words and expressions, and sentence endings (one place of a TRP). Adopting these findings we programmed the coordination of verbal and non-verbal actions into a museum guide robot, and found that in dyadic robot guide-human interactions participants frequently nodded and shifted their gaze towards the object (A. Yamazaki et al. 2010). Second, we observed that human guides often use a question-answer strategy in engaging visitors. In particular, guides use questions aimed at engaging visitors while coordinating their gaze (K. Yamazaki et al. 2009; Diginfonews 2010). In particular, guides ask a pre-question (first question) regarding an exhibit, and at the same time monitor the visitors’ responses (during a pause) to check whether visitors are displaying engagement (e.g. nodding). Then the guide asks a main question (second question) towards a particular visitor who displays engagement. The first question serves as a “clue” to the main question. We programmed this questioning strategy into the robot’s talk, and the results showed that the robot could select an appropriate visitor who has displayed “knowledge” in order to provide an answer to the main question (A. Yamazaki et al. 2012). In relation to the second finding, we found that guides use a combination of three types of pre-question and main question formats towards multiple visitors: (1) Guide begins with a pre-question as an interrogative, and then asks a main question, (2) guide begins with a pre-question as a tag-question, and then asks a main-question, and (3) guide begins with a declarative sentence without telling the name or attribution of the referent and then asks a question regarding the referent. We implemented these three types of combinations of pre-question and main question in the current quiz robot, as we are interested in how coordination of the robot’s question (verbal) and gaze (non-verbal actions) is effective in multiparty interaction in English and Japanese.

3.  The present experiment: A quiz robot in Japanese and English In this experiment, we implemented the robot’s movement based on the ethnography described above in regard to verbal actions in both English and Japanese. 3.1  Robot system We used Robovie-R ver.3. The experimental system was designed to provide explanations to three participants. The system has three pan-tilt-zoom (PTZ) cameras and three PCs, which are each dedicated to processing images from one PTZ camera observing one participant. Figure 1 presents an overview of the robot system. In addition to the three cameras outside of the body of the robot and three PCs for

 Akiko Yamazaki et al.

PTZ cameras Laser range sensor Robot

PC Robot control PC

PC PC PC

Body tracking

Face tracking

Main process

Pan-tilt control

Figure 1.  Quiz robot system

image processing, we used a laser range sensor (Hokuyo UTM-30LX) and another PC for processing the range sensor data and for integrating the sensor processing results. The system detects and tracks the bodies of multiple visitors in the range sensor data. A PTZ camera is assigned to each detected body. The system is controlled to enable it to turn toward the observed bodies of participants. For detecting and tracking a face and computing its direction, we used Face API (http:// www.seeingmachines.com/product/faceapi/). The pan, tilt, and zoom of each PTZ camera are automatically adjusted based on the face detection results, so that the face remains focused in the center of the image. The system can locate a human body with a margin of error of 6 cm for position and 6 degrees for orientation. It can measure 3D face direction within 3 degrees of error at 30 frames per second. The head orientation is measured around three coordinate axes (roll, pitch and yaw) with the origin at the center of the head. From the face direction results, the system can recognize the following behaviors of participants: nodding, shaking, cocking and tilting the head, and gazing away, and it can choose an appropriate answerer based on such recognition results (A. Yamazaki et al. 2012). However, in the experiments described later, we did not use these autonomous functions so as to prevent potential recognition errors from influencing human behaviors. The system detected and tracked the participants’ positions using the laser range sensor to turn its head precisely toward them. A human experimenter, who was seated in back of the screen and could see the participants’ faces, controlled the robot to point to and ask one of three participants to answer the question. In other words, we adopted a WOZ (Wizard of Oz) method



Interaction between a robot and multiple participants 

for selecting an answerer by employing a human experimenter who controlled the robot’s body movement. Following the findings of our ethnographic research of expert human guides, we programmed the robot to move its gaze by moving its head (its eyes can be moved as well) and arms/hands in relation to its speech by using canned phrases (a built-in text-speech system can be used as well). The robot can open and close its hands and move its forefingers to point towards a target projected on the screen, similar to a human guide. The robot speaks English towards English participants and it speaks Japanese towards Japanese participants. A rough outline of the sequence of speech and movement is as follows: (1) Before the robot begins to talk, it looks towards the participants, (2) When the robot says the first word, it moves its gaze and hands to a picture projected on the projection screen in back, (3) During its speech, the robot moves its gaze and hand when it utters deictic words and expressions such as ‘here’ and ‘this picture’, (4) At the end of each sentence, the robot moves its gaze towards three participants one at a time, or it looks at a particular participant depending on the length of sentence as expert human guides do, (5) When the robot asks the main question, it moves its gaze and hand towards a particular participant (selected by the experimenter), (6) If the participant gives the correct answer to the main question, the robot makes a clapping gesture (an experimenter operates a PC to make a sound of clapping hands) and repeats the answer to the question. When the participant gives an incorrect answer, the robot says, “That’s incorrect” (Chigaimasu, in Japanese) and then produces the correct answer. 3.2  Experimental setup The following are the details of the experiment we conducted in English and Japanese with the quiz robot. As described above, each group consisted of three participants. 1. Experiment 1 (in English): Kansai University (Osaka, Japan), 21 participants (7 groups): 18 male and 3 female native speakers of English (mainly ­Americans, New Zealanders, and Australians). (15 June 2012). All participants are international undergraduate/graduate students or researchers who either are presently studying Japanese language and culture or have in-depth knowledge of Japan. 2. Experiment 2 (in Japanese): Saitama University (Saitama, Japan, near Tokyo), 51 participants (27 groups): 31 male and 20 female native speakers of J­ apanese (4 July 2012). All participants were undergraduate students of Saitama University.

 Akiko Yamazaki et al.

During the experiment, we used three video cameras. The position of the robot (R), participants (1, 2 and 3), and the three video cameras (A, B, C) are shown (Figure 2). Display image size: 180 cm width × 120 cm height Camera B

Screen R

0c

m

2

140 cm

140

11

cm

3

Camera A

400 cm

200

1

cm

230 cm

Camera C 450 cm Figure 2.  Bird’s eye view of experimental setup

Before the experiments, our staff asked the participants to answer the robot’s quiz questions. Then the robot asked each group six questions related to six different pictures projected on a screen in back of the robot. The content of these questions (Q1–Q6) was the following: Q1:

Name of the war portrayed in Picasso’s painting ‘Guernica’ (Screen: Guernica painting) (Answer: Spanish civil war). Q2: Name of a Japanese puppet play (Screen: Picture of a famous playwright and a person operating a puppet) (Answer: Bunraku). Q3: Name of the prefecture in which the city of Kobe is located (Screen: picture of Kobe port and a Christmas light show called ‘Luminarie’) (Answer: Hyogo prefecture) (Figure 3). Q4: Name (either first, last, or both) of the lord of Osaka Castle (picture of Osaka castle and Lord Hideyoshi Toyotomi) (Answer: Hideyoshi ­Toyotomi) (Figure 4). Q5-a (English speakers only): Full name of a Japanese baseball player in the American major leagues (Screen: photo of Ichiro Suzuki) (Answer: Ichiro Suzuki). Q5-b (Japanese speakers only): Full name of the chief cabinet secretary in former Japanese Prime Minister Kan’s cabinet (Screen: photo of Cabinet Secretary Yukio Edano) (Answer: Yukio Edano)



Q6:

Interaction between a robot and multiple participants 

Name of the former governor of California who was the husband of the niece of a former president of the United States (Screen: map of California, John F. Kennedy) (Answer: Arnold Schwarzenegger) (Figure 5).

Figure 3.  Image projected on screen at Q3 (the left is Kobe port and the right is “Luminarie”)

Figure 4.  Image projected on screen at Q4 (the left is Osaka castle and the right is Hideyoshi Toyotomi)

As an image is projected on the screen (as in Figures 3, 4 and 5), the robot poses a pre-question to the visitors (e.g. “Do you know this castle?”), and then provides an explanation of the image before asking a main question (e.g. “Do you know the name of the famous person in this photo who had this castle built?”).

 Akiko Yamazaki et al.

Figure 5.  Image projected on screen at Q6 (President Kennedy is on the left and map of the United States with California highlighted is on the right)

Due to differences in the topics of Q5 in English and Japanese and the low frequency of correct answers by participants for Q1 and Q2, here we do not analyze these three questions in detail. Rather we focus on Q3, Q4, and Q6. We will also report on some of the results of a questionnaire we asked participants to fill out after the experiment on whether they knew the answers to the pre-question and main question of Q3, Q4 and Q6. 3.3  Experimental stimuli In what follows we explain the three question types in English and Japanese in relation to robot gaze and discuss similarities and differences in terms of word order. 1. Declarative question: Q3 For each utterance there are three lines of transcript. Rh stands for the robot’s hand motion and Rg represents the robot’s head motion. The third line, R, represents the robot’s speech. Transcription symbols are as follows: f = robot facing forward towards the participants; a comma-like mark (,) represents the robot moving its hand/head; ‘i’ indicates the robot in a still position facing towards the screen; d = robot’s hand/head down; ‘1’ indicates the robot in a still position facing towards Participant 1 who stands to the furthest right of three participants in regard to the robot position; ‘2’ means the robot is in a still position facing towards Participant 2 who stands in between the other two participants; ‘3’ represents the robot in a still position facing towards Participant 3 who stands to the furthest left of the three



Interaction between a robot and multiple participants 

participants; ‘o’ indicates the robot spreading its hands and arms outward; ‘m’ represents the robot moving its hands. We adopt the Jefferson (1984) transcription notation for describing the robot’s speech. The transcription symbols are as follows: > < represents the portion of an utterance delivered at a pace noticeably quicker than the surrounding talk, and < > noticeably slower. (.) represents a brief pause. ↑ marks a sharp rise in pitch register. ↓ marks a noticeable fall in pitch register. _ Underlining represents vocalic emphasis. Q3 in English 01. Rh: f,,,,,iiiiiiiiiiiiiiiiiiiiiiiiiiiii 02. Rg: f,,,,,iiiiiiiiiiiiiiiiiii,,,,,fffff 03. R  : This well-known(.)port in Japan has 04. Rh: iiiiiiiiiiiiiiiiiiii,,,,,,ooooommmmmmmm 05. Rg: ffffffffffffffffffff,,,,,,111111,,,,,,, 06. R  : some famous tourist sites↓ (.)such as 07. Rh: mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm 08. Rg: ,,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3,,, 09. R  : a;nd Arima (.)>hot springs↑man on the left