125 105 90MB
English Pages 981 [962] Year 2020
Springer Proceedings in Advanced Robotics 14 Series Editors: Bruno Siciliano · Oussama Khatib
Marco Morales Lydia Tapia Gildardo Sánchez-Ante Seth Hutchinson Editors
Algorithmic Foundations of Robotics XIII Proceedings of the 13th Workshop on the Algorithmic Foundations of Robotics
Springer Proceedings in Advanced Robotics 14 Series editors Bruno Siciliano Dipartimento di Ingegneria Elettrica e Tecnologie dell’Informazione Università degli Studi di Napoli Federico II Napoli, Napoli Italy
Oussama Khatib Robotics Laboratory Department of Computer Science Stanford University Stanford, CA USA
Editorial Advisory Board Gianluca Antonelli, Department of Electrical and Information Engineering, University of Cassino and Southern Lazio, Cassino, Italy Dieter Fox, Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA Kensuke Harada, Engineering Science, Osaka University Engineering Science, Toyonaka, Japan M. Ani Hsieh, GRASP Laboratory, University of Pennsylvania, Philadelphia, PA, USA Torsten Kröger, Karlsruhe Institute of Technology, Karlsruhe, Germany Dana Kulic, University of Waterloo, Waterloo, ON, Canada Jaeheung Park, Department of Transdisciplinary Studies, Seoul National University, Suwon, Korea (Republic of)
The Springer Proceedings in Advanced Robotics (SPAR) publishes new developments and advances in the fields of robotics research, rapidly and informally but with a high quality. The intent is to cover all the technical contents, applications, and multidisciplinary aspects of robotics, embedded in the fields of Mechanical Engineering, Computer Science, Electrical Engineering, Mechatronics, Control, and Life Sciences, as well as the methodologies behind them. The publications within the “Springer Proceedings in Advanced Robotics” are primarily proceedings and post-proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. Also considered for publication are edited monographs, contributed volumes and lecture notes of exceptionally high quality and interest. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results.
More information about this series at http://www.springer.com/series/15556
Marco Morales Lydia Tapia Gildardo Sánchez-Ante Seth Hutchinson •
•
•
Editors
Algorithmic Foundations of Robotics XIII Proceedings of the 13th Workshop on the Algorithmic Foundations of Robotics
123
Editors Marco Morales Departamento de Sistemas Digitales Instituto Tecnológico Autónomo de México México, México
Lydia Tapia Department of Computer Science University of New Mexico Albuquerque, NM, USA
Gildardo Sánchez-Ante Universidad Politécnica de Yucatán Yucatán, México
Seth Hutchinson Georgia Tech Atlanta, GA, USA
ISSN 2511-1256 ISSN 2511-1264 (electronic) Springer Proceedings in Advanced Robotics ISBN 978-3-030-44050-3 ISBN 978-3-030-44051-0 (eBook) https://doi.org/10.1007/978-3-030-44051-0 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
At the dawn of the century’s third decade, robotics is reaching an elevated level of maturity and continues to benefit from the advances and innovations in its enabling technologies. These all are contributing to an unprecedented effort to bringing robots to human environment in hospitals and homes, factories and schools, in the field for robots fighting fires, making goods and products, picking fruits and watering the farmland, saving time and lives. Robots today hold the promise for making a considerable impact in a wide range of real-world applications from industrial manufacturing to health care, transportation, and exploration of the deep space and sea. Tomorrow, robots will become pervasive and touch upon many aspects of modern life. The Springer Tracts in Advanced Robotics (STAR) was launched in 2002 with the goal of bringing to the research community the latest advances in the robotics field based on their significance and quality. During the latest fifteen years, the STAR series has featured publication of both monographs and edited collections. Among the latter, the proceedings of thematic symposia devoted to excellence in robotics research, such as ISRR, ISER, FSR, and WAFR, has been regularly included in STAR. The expansion of our field as well as the emergence of new research areas has motivated us to enlarge the pool of proceedings in the STAR series in the past few years. This has ultimately led to launching a sister series in parallel to STAR. The Springer Proceedings in Advanced Robotics (SPAR) is dedicated to the timely dissemination of the latest research results presented in selected symposia and workshops. This volume of the SPAR series brings the proceedings of the thirteenth edition of the Workshop Algorithmic Foundations of Robotics (WAFR). This edition was hosted by Universidad Politécnica de Yucatán in Mérida, México, from December 9 to 11, 2018. The volume edited by Marco Morales, Lydia Tapia, Gildardo Sánchez-Ante, and Seth Hutchinson is a collection of 54 contributions organized in 11 parts, highlighting cutting-edge algorithmic research in planning, learning, navigation, control,
v
vi
Foreword
manipulation, optimality, completeness, and complexity. Applications span a wide variety of areas including multi-robot systems, perception, and contact manipulation. Rich by topics and authoritative contributors, WAFR culminates with this unique reference on the current developments and new directions in the field of algorithmic foundations. A very fine addition to the series! January 2020
Bruno Siciliano Oussama Khatib SPAR Editors
Preface
Since its inception in 1994, the Workshop on the Algorithmic Foundations of Robotics (WAFR) has distinguished itself as the premier venue for presenting and discussing cutting-edge research on algorithmic problems in robotics. The thirteenth WAFR was held on December 9–11, 2018, at Universidad Politécnica de Yucatán in the city of Mérida, the capital city of the state of Yucatán in the southeast of México. We were delighted by the strong participation in WAFR 2018 with 160 conference attendees and 15 virtual undergraduate participants. The program of WAFR 2108 was driven by the Program Committee, 69 experts in algorithmic robotics, and additional 51 outside reviewers. The 95 strong submissions went through a competitive review process where the majority of papers received at least three independent reviews. Each paper went through a review process, moderated by the co-chairs including the PC committee members. The result of this process was the selection of 54 papers for presentation at the workshop. Since discussion is of primary importance at WAFR, the papers were organized according to subtopic in 11 sessions where a chair and two respondents, both selected from the Program Committee, were in charge of leading the discussion. In order to promote further discussion, one of the most popular features of WAFR was employed, the inclusion of a “Dirty Laundry” slide discussing the limitations or weaknesses of the presented work. We also had one session focused on Open Problems with a lively participation. This volume of Springer Proceedings in Advanced Robotics contains the papers presented at WAFR 2018. These contributions highlight cutting-edge algorithmic research in planning, learning, navigation, control, manipulation, optimality, completeness, and complexity. Applications span a wide variety of robotic applications including multi-robot systems, perception, and contact manipulation. About two-fifths of the accepted papers have been forwarded for further review for dedicated special issues of the International Journal of Robotics Research and IEEE Transactions on Automation Science and Engineering.
vii
viii
Preface
Our invited speakers highlighted the best in academic and industrial research. First, Steve LaValle (University of Oulu) spoke on “The Path to (Human) Perception Engineering.” Next, Enrique Sucar (Instituto Nacional de Astrofísica, Óptica y Electrónica) gave a talk on “Planning under Uncertainty in Robotics: From MDPs to Causal Models.” Following this, Vikas Sindhwani (Google Brain Robotics) discussed “The Learning Continuum for Robotics: from Classical Optimal Control to Blackbox Policy Search.” Afterward, Leslie Kaelbling (Massachusetts Institute of Technology) discussed her views on “Doing for our Robots what Evolution did for Us.” Finally, Elon Rimon (Technion - Israel Institute of Technology) provided “Perspectives on Minimalistic Robot Hand Design and a New Class of Caging-to-Grasping Algorithms.” A special feature of WAFR 2018 was a Google-funded broadening participation in robotics workshop, Becoming a Robot Guru. Scholarships were awarded to selected graduate students, WAFR Robot Guru Fellows, mentor teams of undergraduates (some in person and some virtual) that were also invited to participate in WAFR. Additionally, speakers and participants from WAFR graciously spent the day with these students and addressed topics critical for success in applying for and attending graduate school including: the role of graduate studies in building a career in robotics research, overcoming the interdisciplinary nature of robotics, and strategies to overcome challenges. Nancy Amato (University of Illinois Urbana-Champaign) and Matt Mason (Carnegie Mellon University) were keynote speakers, and Neil Dantam (Colorado School of Mines), Michael Otte (University of Maryland, College Park), Jason O’Kane (University of South Carolina), Vikas Sindhwani (Google Brain Robotics), and the WAFR Co-Chairs served as invited speakers and faculty mentors. A highlight of Becoming a Robot Guru was the live-streaming of the WAFR technical program to the registered virtual undergraduate participants. This virtual component of WAFR was very successful, as indicated by a statistically significant increase in the interest and excitement about graduate education in the pre- to post-workshop evaluations performed by Google. Additionally, students from Robot Guru pursued robotics research through Research Experiences for Undergraduates (REU) programs. We would be remiss to not clearly acknowledge the role of UPY in hosting WAFR 2018. From the faculty and staff, to the student volunteers, the people at UPY were always ready to lend a helping hand or find a solution for a technical challenge. We even had our own WAFR 2018 unofficial mascot, GUPY (Gato de —Cat of— UPY)! During WAFR 2018, UPY made it possible to discuss technical challenges in the jungle and to network by the Mayan pyramids. WAFR 2018 also was a success thanks to the authors for the high-quality work they submitted, to the PC members and auxiliary reviewers for all of their careful evaluations, to the session chairs and respondents for the insightful discussions they led, and to the participants. Also, thanks for the support of the Parasol Laboratory from the Department of Computer Science and Engineering of Texas A&M
Preface
ix
University, of the Department of Computer Science of the University of New Mexico, and of the Department of Digital Systems of Instituto Tecnológico Autónomo de México. December 2018
Marco Morales Lydia Tapia Gildardo Sánchez-Ante Seth Hutchinson
Organization
Co-chairs Marco Morales Lydia Tapia Gildardo Sánchez-Ante Seth Hutchinson
Instituto Tecnológico Autónomo de México University of New Mexico Universidad Politécnica de Yucatán Georgia Institute of Technology
Steering Committee Nancy Amato Oliver Brock Ken Goldberg Dan Halperin Seth Hutchinson David Hsu Lydia Kavraki Daniela Rus Frank van der Stappen
Texas A&M University and University of Illinois at Urbana-Champaign Technische Universität Berlin UC Berkeley Tel Aviv University Georgia Institute of Technology National University of Singapore Rice University Massachusetts Institute of Technology Utrecht University
Advisory Committee Jean-Claude Latombe Jean-Paul Laumond Matt Mason
Stanford University LAAS-CNRS Carnegie Mellon University
Program Committee Srinivas Akella Ron Alterovitz Nancy Amato
University of North Carolina at Charlotte University of North Carolina at Chapel Hill Texas A&M University and University of Illinois at Urbana-Champaign xi
xii
Devin Balkcom Kostas Bekris Dmitry Berenson Timothy Bretl Nilanjan Chakraborty Suman Chakravorty Juan Cortes Neil T. Dantam Jory Denny Mehmet Dogar Chinwe Ekenna Brendan Englot Esra Erdem Claudia Esteves Aleksandra Faust Anthony Francis Emilio Frazzoli Stephen Guy Kris Hauser Jean-Bernard Hayet Geoffrey Hollinger David Hsu Seth Hutchinson Volkan Isler Leslie Kaelbling Marcelo Kallmann Sertac Karaman Lydia Kavraki Ross Knepper Sven Koenig Hadas Kress-Gazit Torsten Kroeger Hanna Kurniawati Jyh-Ming Lien Ming Lin Tomas Lozano-Perez Dinesh Manocha Nicolas Mansard Jose Martínez-Carranza Troy McMahon Mark Moll Marco Morales Rafael Murrieta-Cid Jason O’Kane Songhwai Oh
Organization
Dartmouth College Rutgers University University of Michigan, Ann Arbor University of Illinois at Urbana-Champaign Stony Brook University Texas A&M University LAAS-CNRS Colorado School of Mines University of Richmond University of Leeds University at Albany Stevens Institute of Technology Sabanci University Universidad de Guanajuato Google Brain Robotics Google Brain Robotics ETH Zürich, nuTonomy University of Minnesota Duke University Centro de Investigación en Matemáticas Oregon State University National University of Singapore Georgia Institute of Technology University of Minnesota Massachusetts Institute of Technology University of California, Merced Massachusetts Institute of Technology Rice University Cornell University University of Southern California Cornell University Karlsruhe Institute of Technology (KIT) University of Queensland George Mason University University of Maryland, College Park Massachusetts Institute of Technology University of Maryland, College Park LAAS-CNRS Instituto Nacional de Astrofísica, Óptica y Electrónica University of Queensland Rice University Instituto Tecnológico Autónomo de México Centro de Investigación en Matemáticas University of South Carolina Seoul National University
Organization
Jia Pan Marco Pavone Florian Pokorny Elon Rimon Sam Rodriguez Alberto Rodriguez José Guadalupe Romero Nicholas Roy Dylan Shell Stephen L. Smith Stephen F. Smith Dezhen Song Enrique Sucar Subhash Suri Lydia Tapia Shawna Thomas Pratap Tokekar Chee Yap Jingjin Yu
xiii
City University of Hong Kong Stanford University KTH Royal Institute of Technology Technion - Israel Institute of Technology Texas Wesleyan University Massachusetts Institute of Technology Instituto Tecnológico Autónomo de México Massachusetts Institute of Technology Texas A&M University University of Waterloo Carnegie Mellon University Texas A&M University Instituto Nacional de Astrofísica, Óptica y Electrónica University of California, Santa Barbara University of New Mexico Texas A&M University Virginia Tech New York University Rutgers University
Additional Reviewers Marco Antonio Arteaga Pérez Efstathios Bakolas Giovanni Beltrame Lars Berscheid Graeme Best Saverio Bolognani Justin Carpentier Alexander Cebulla Ching-An Cheng Liron Cohen Sourav Dutta Renato Farias Benjamin Fine J. Octavio Gutiérrez-García Yue Hao James Harrison Bradley Hayes Antonio-Hernández-Garduño Thomas Howard Jasmine Hsu Jeffrey Ichnowski
Dylan Jones Chase Kew Jonas Kiemel Gim Hee Lee Shuang Luan Hang Ma Zachary Manchester Christian Marzi Christoforos Mavrogiannis Dale McConachie Pascal Meißner Michael Otte Zherong Pan Andrew Price Dilshad Raihan Rodolfo Reyes-Báez Angel Rodríguez Castaño Hugo Rodríguez Cortés Brad Saund Olivier Stasse Yoonchang Sung
xiv
Organization
Varun Suryan Will Thomason Aakriti Upadhyay Tansel Uras Anastasiia Varava
Valerio Varricchio Kevin Yu Zhongshun Zhang Lifeng Zhou
Session Chairs and Respondents General Tools for Motion Planning Chair: Dinesh Manocha Respondents: Frank van der Stappen, Liangjun Zhang Data Structures for Motion Planning, Planning Under Uncertainty Chair: David Hsu Respondents: Dmitri Berenson, Michael Otte Imitation Learning Chair: Tomás Lozano-Pérez
Respondents: Scott Livingston, Dorsa Sadigh
Learning in Planning Chair: Ken Goldberg
Respondents: Byron Boots, Nilanjan Chakraborty
Sensor-Based Planning, Navigation, and Control Chair: Kostas Bekris Respondents: Jason O’Kane, Alberto Quattrini Li Dynamics of Contact in Manipulation and Locomotion Chair: Mehmet Dogar Respondents: Devin Balkcom, Jonathan DeCastro Optimal Motion Generation Chair: Dan Halperin
Respondents: Neil Dantam, Michael Posa
Tools for Perception, Kinematics and Dynamics Chair: Kris Hauser Respondents: Alen Alempijevic, Stefano Carpin Multi-robot and Multi-body Systems Chair: Dylan Shell Respondents: Alberto Rodríguez, Ajay Tanwani Optimality, Completeness, and Complexity in Planning Chair: Todd Murphey Respondents: Ross Knepper, Jingjin Yu Human–Robot Interaction Chair: Ron Alterovitz Open Problems Chair: Matt Mason
Respondents: Israel Becerra, Rafael Murrieta
Organization
xv
Sponsoring Institutions Universidad Politécnica de Yucatán Instituto Tecnológico Autónomo de México Parasol Laboratory, Department of Computer Science and Engineering, Texas A&M University Google Computer Science Department, University of New Mexico
Contents
General Tools for Motion Planning Path Planning for Ellipsoidal Robots and General Obstacles via Closed-Form Characterization of Minkowski Operations . . . . . . . . . Sipu Ruan, Qianli Ma, Karen L. Poblete, Yan Yan, and Gregory S. Chirikjian Free Space of Rigid Objects: Caging, Path Non-existence, and Narrow Passage Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anastasiia Varava, J. Frederico Carvalho, Florian T. Pokorny, and Danica Kragic Fast Collision Detection for Motion Planning Using Shape Primitive Skeletons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mukulika Ghosh, Shawna Thomas, and Nancy M. Amato Fast Swept Volume Estimation with Deep Learning . . . . . . . . . . . . . . . Hao-Tien Lewis Chiang, Aleksandra Faust, Satomi Sugaya, and Lydia Tapia Concurrent Nearest-Neighbor Searching for Parallel Sampling-Based Motion Planning in SO(3), SE(3), and Euclidean Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jeffrey Ichnowski and Ron Alterovitz
3
19
36 52
69
Data Structures for Motion Planning A Visibility-Based Approach to Computing Nondeterministic Bouncing Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandra Q. Nilles, Yingying Ren, Israel Becerra, and Steven M. LaValle
89
xvii
xviii
Contents
Finding Plans Subject to Stipulations on What Information They Divulge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Yulin Zhang, Dylan A. Shell, and Jason M. O’Kane Planning Under Uncertainty Online Partial Conditional Plan Synthesis for POMDPs with Safe-Reachability Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Yue Wang, Swarat Chaudhuri, and Lydia E. Kavraki An Approximation Algorithm for Risk-Averse Submodular Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Lifeng Zhou and Pratap Tokekar Pushing Fast and Slow: Task-Adaptive Planning for Non-prehensile Manipulation Under Uncertainty . . . . . . . . . . . . . . . 160 Wisdom C. Agboh and Mehmet R. Dogar Imitation Learning Learning Stabilizable Dynamical Systems via Control Contraction Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Sumeet Singh, Vikas Sindhwani, Jean-Jacques E. Slotine, and Marco Pavone Generalizing Robot Imitation Learning with Invariant Hidden Semi-Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Ajay Kumar Tanwani, Jonathan Lee, Brijen Thananjeyan, Michael Laskey, Sanjay Krishnan, Roy Fox, Ken Goldberg, and Sylvain Calinon A Dynamic Regret Analysis and Adaptive Regularization Algorithm for On-Policy Robot Imitation Learning . . . . . . . . . . . . . . . . 212 Jonathan N. Lee, Michael Laskey, Ajay Kumar Tanwani, Anil Aswani, and Ken Goldberg Learning Constraints from Demonstrations . . . . . . . . . . . . . . . . . . . . . . 228 Glen Chou, Dmitry Berenson, and Necmiye Ozay Probability-Weighted Temporal Registration for Improving Robot Motion Planning and Control Learned from Demonstrations . . . . . . . . 246 Chris Bowen and Ron Alterovitz Learning in Planning SACBP: Belief Space Planning for Continuous-Time Dynamical Systems via Stochastic Sequential Action Control . . . . . . . . . . . . . . . . . 267 Haruki Nishimura and Mac Schwager
Contents
xix
Active Area Coverage from Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . 284 Ian Abraham, Ahalya Prabhakar, and Todd D. Murphey Learning a Spatial Field with Gaussian Process Regression in Minimum Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Varun Suryan and Pratap Tokekar Meta-learning Priors for Efficient Online Bayesian Regression . . . . . . . 318 James Harrison, Apoorva Sharma, and Marco Pavone Deep BOO! Automating Beam Orientation Optimization in Intensity-Modulated Radiation Therapy . . . . . . . . . . . . . . . . . . . . . . . 338 Olalekan Ogunmolu, Michael Folkerts, Dan Nguyen, Nicholas Gans, and Steve Jiang Sensor-Based Planning, Navigation, and Control Continuous Optimization Framework for Depth Sensor Viewpoint Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 Behnam Maleki, Alen Alempijevic, and Teresa Vidal-Calleja Look Before You Sweep: Visibility-Aware Motion Planning . . . . . . . . . 373 Gustavo Goretkin, Leslie Pack Kaelbling, and Tomás Lozano-Pérez Path Planning for Information Gathering with Lethal Hazards and No Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Michael Otte and Donald Sofge Reactive Navigation in Partially Known Non-convex Environments . . . . 406 Vasileios Vasilopoulos and Daniel E. Koditschek Resource-Aware Algorithms for Distributed Loop Closure Detection with Provable Performance Guarantees . . . . . . . . . . . . . . . . . 422 Yulun Tian, Kasra Khosoussi, and Jonathan P. How Dynamics of Contact in Manipulation and Locomotion RMPflow: A Computational Graph for Automatic Motion Policy Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Ching-An Cheng, Mustafa Mukadam, Jan Issac, Stan Birchfield, Dieter Fox, Byron Boots, and Nathan Ratliff Dynamic Model of Planar Sliding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 Jiayin Xie and Nilanjan Chakraborty A Constrained Kalman Filter for Rigid Body Systems with Frictional Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Patrick Varin and Scott Kuindersma
xx
Contents
A Quasi-static Model and Simulation Approach for Pushing, Grasping, and Jamming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Mathew Halm and Michael Posa GP-SUM. Gaussian Process Filtering of non-Gaussian Beliefs . . . . . . . . 508 Maria Bauza and Alberto Rodriguez Optimal Motion Generation Time-Optimal Motion of Spatial Dubins Systems . . . . . . . . . . . . . . . . . . 529 Weifu Wang and Devin Balkcom Robust Tracking with Model Mismatch for Fast and Safe Planning: An SOS Optimization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Sumeet Singh, Mo Chen, Sylvia L. Herbert, Claire J. Tomlin, and Marco Pavone Semi-infinite Programming for Trajectory Optimization with Nonconvex Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 Kris Hauser Signal Temporal Logic Meets Reachability: Connections and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 Mo Chen, Qizhan Tam, Scott C. Livingston, and Marco Pavone Low Complexity Control Policy Synthesis for Embodied Computation in Synthetic Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Ana Pervan and Todd D. Murphey Tools for Perception, Kinematics and Dynamics Parallax Bundle Adjustment on Manifold with Improved Global Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 Liyang Liu, Teng Zhang, Yi Liu, Brenton Leighton, Liang Zhao, Shoudong Huang, and Gamini Dissanayake Practical Exponential Coordinates Using Implicit Dual Quaternions . . . 639 Neil T. Dantam A Performance Analysis of Parallel Differential Dynamic Programming on a GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Brian Plancher and Scott Kuindersma Time Integrating Articulated Body Dynamics Using Position-Based Collocation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 Zherong Pan and Dinesh Manocha Efficient Computation of Higher-Order Variational Integrators in Robotic Simulation and Trajectory Optimization . . . . . . . . . . . . . . . . 689 Taosha Fan, Jarvis Schultz, and Todd Murphey
Contents
xxi
Multi-Robot and Multi-Body Systems Interlocking Block Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 Yinan Zhang and Devin Balkcom CADbots: Algorithmic Aspects of Manipulating Programmable Matter with Finite Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 Sándor P. Fekete, Robert Gmyr, Sabrina Hugo, Phillip Keldenich, Christian Scheffer, and Arne Schmidt Multi-agent Trajectory Prediction and Generation with Topological Invariants Enforced by Hamiltonian Dynamics . . . . . . . . . . . . . . . . . . . 744 Christoforos I. Mavrogiannis and Ross A. Knepper Balancing Unpredictability and Coverage in Adversarial Patrolling Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762 Nicola Basilico and Stefano Carpin Fast, High-Quality Dual-Arm Rearrangement in Synchronous, Monotone Tabletop Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778 Rahul Shome, Kiril Solovey, Jingjin Yu, Kostas Bekris, and Dan Halperin Optimality, Completeness, and Complexity in Planning Motion Planning for Multiple Unit-Ball Robots in Rd . . . . . . . . . . . . . . 799 Israela Solomon and Dan Halperin Coordinating the Motion of Labeled Discs with Optimality Guarantees under Extreme Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817 Rupesh Chinta, Shuai D. Han, and Jingjin Yu An Experimental Analysis on the Necessary and Sufficient Conditions for the RRT* Applied to Dynamical Systems . . . . . . . . . . . . 835 Israel Becerra, Heikel Yervilla-Herrera, and Rafael Murrieta-Cid Hardness of 3D Motion Planning Under Obstacle Uncertainty . . . . . . . 852 Luke Shimanuki and Brian Axelrod The Hardness of Minimizing Design Cost Subject to Planning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868 Fatemeh Zahra Saberifar, Jason M. O’Kane, and Dylan A. Shell Human-Robot Interaction Altruistic Autonomy: Beating Congestion on Shared Roads . . . . . . . . . . 887 Erdem Bıyık, Daniel A. Lazar, Ramtin Pedarsani, and Dorsa Sadigh Operation and Imitation Under Safety-Aware Shared Control . . . . . . . 905 Alexander Broad, Todd Murphey, and Brenna Argall
xxii
Contents
Guided Exploration of Human Intentions for Human-Robot Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 921 Min Chen, David Hsu, and Wee Sun Lee Counterexample-Guided Safety Contracts for Autonomous Driving . . . 939 Jonathan DeCastro, Lucas Liebenwein, Cristian-Ioan Vasile, Russ Tedrake, Sertac Karaman, and Daniela Rus Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957
General Tools for Motion Planning
Path Planning for Ellipsoidal Robots and General Obstacles via Closed-Form Characterization of Minkowski Operations Sipu Ruan1 , Qianli Ma2 , Karen L. Poblete1 , Yan Yan3 , and Gregory S. Chirikjian1,4(B) 1
Laboratory for Computational Sensing and Robotics, Johns Hopkins University, Baltimore, MD 21218, USA {ruansp,kpoblet1,gchirik1}@jhu.edu 2 Aptiv, Pittsburgh, PA 15238, USA [email protected] 3 Amperity Inc., Seattle, WA 98134, USA [email protected] 4 Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1., Singapore 117575, Singapore [email protected] Abstract. Path planning has long been one of the major research areas in robotics, with PRM and RRT being two of the most effective path planners. Though they are generally very efficient, these two samplebased planners can become computationally expensive in the important special case of narrow passage problems. This paper develops a path planning paradigm which uses ellipsoids and superquadrics to respectively encapsulate the rigid parts of the robot and obstacles. The main benefit in doing this is that configuration-space obstacles can be parameterized in closed form, thereby allowing prior knowledge to be used to avoid sampling infeasible configurations, in order to solve the narrow passage problem efficiently. Benchmark results for single-body robots show that, remarkably, the proposed method outperforms the sample-based planners in terms of the computational time in searching for a path through narrow corridors. Feasible extensions that integrate with sample-based planners to further solve the high dimensional multi-body problems are discussed, which will require substantial additional theoretical development in the future.
1
Introduction
Sample-based planners such as PRMs [1] and RRTs [2] (and a multitude of their extensions, e.g. [3,4]) have demonstrated remarkable success. These methods usually are based on polyhedral representations of robots and obstacles and perform explicit collision detection to assess the feasibility of samples. These methods have had a profound impact both within robotics and across other fields such as molecular docking, urban planning, and assembly automation. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 3–18, 2020. https://doi.org/10.1007/978-3-030-44051-0_1
4
S. Ruan et al.
It is well known that despite the great success of these methods, the “narrow passage problem” remains a significant challenge. The reason is that, generally speaking, sample-based approaches use a strategy of sampling states in the whole configuration space, followed by collision checking. When a robot and an obstacle are found to be in collision, the corresponding sample is discarded. Then, valid state configurations are connected by edges, where each edge is sub-sampled and collision checking is done along the edge. If any of the states on the edge corresponds to a collision, the whole edge is discarded. This approach works extremely well when the obstacles in the environment are sparse. But when there is a narrow passage, an inordinate amount of computational time is spent on the samples and edges that eventually will be discarded. To increase the probability of sampling valid configurations in a narrow passage, various methods have been proposed such as [5–8]. However, there is still no guarantee of finding valid vertices efficiently within the corridor due to the probabilistic nature of sampling and collision checking. Therefore the first goal of this paper is: 1. Extend our previous method of parameterizing a priori the free space [9], and develop a planner so as to avoid traditional collision checking computations. In the planner developed here, the basic objects are unions of ellipsoids for the robot and superquadrics for environmental features. One reason to use superquadrics is that they can characterize a wide range of the complex shapes while requiring only a few parameters [10–12]. It is well known that for a single rigid body with a fixed orientation in n-dimensional space, a “slice” of the configuration space (C-space) obstacle corresponding to this orientation is the Minkowski sum of the rigid body and the obstacles in the workspace [13–17], denoted as a “C-layer” [18,19]. There is substantial literature on the computational complexity of Minkowski sums of polyhedra and faceted approximations of ellipsoids [20–24]. Recently an exact closed-form formula for n-dimensional ellipsoids was introduced and discussed [25]. As a generalization of that, the closed-form Minkwoski sums of an ellipsoid and an arbitrary convex differentiable shape embedded in n-dimensional Euclidean space is presented here, with superquadrics being a typical example. This is another essential reason for the choice of superquadrics objects as environmental features in our new planner. Minkowski sums characterize the C-space obstacles for the individual rigid components in an articulated robot, and the feasibility of a robot’s configuration corresponds to each rigid component of the robot in the complement of the union of C-space obstacles. Consequently collision-free samples can be generated. However, if one seeks to connect such samples using current sample-based planning paradigms like PRM or RRT, then collision checking is still required. Therefore, the second goal of this paper is: 2. Develop guaranteed safe and efficient methods for connecting configurations with different rotational components without performing traditional collision checking. This applies the idea of the Kinematics of Containment for convex objects [26], as well as detailed derivations for ellipsoids [27,28]. It is shown in this paper
Path Planning for Ellipsoidal Robots and General Obstacles
5
that one can enclose the robot by a slightly larger ellipsoid, so the allowable motions of the robot being fully contained can be characterized as a convex polyhedron, which is denoted by the “Local C-space” for a specific configuration. As an evaluation, the performance of the proposed planner is compared with the sample-based algorithms from the well-known “Open Motion Planning Library (OMPL)” [29]. Path planning for planar motions, i.e. those on SE(2), are demonstrated, with the robot being an ellipse and obstacles being superellipses. The rest of the paper is organized as follows. Section 2 reviews the concepts of Minkowski sum and difference, and extends the closed-form expressions for Minkowski operations of an ellipsoid and a convex differentiable surface embedded in n-dimensional Euclidean space so that the free space of individual robot components can be parameterized. Section 3 reviews our previously proposed Highway RoadMap algorithm for elliptical robot and obstacles, and extends the idea to the ellipse-superellipse case. An idea of the “Local C-space” is then proposed to connect configurations with different rotational components. We then evaluate the efficiency and effectiveness of our algorithm on SE(2) by benchmarking with probabilistic planners from OMPL, especially for the “narrow passage” problem, in Sect. 4. We conclude with discussions in Sect. 5.
2
Mathematical Preliminaries
This section provides the mathematical preliminaries for the path planning algorithm. At first, closed-form Minkowski sum and difference between an ellipsoid and any surface with implicit and parametric expressions is derived. As a concrete representation of the environment, a surface in the form of a superellipse is studied, where the closed-form Minkowski sum and difference between an ellipse and a superellipse is derived explicitly. 2.1
Minkowski Sum and Difference Between Two Convex Objects
The Minkowski sum and difference of two convex point sets (or bodies) each centered at the origin, P1 and P2 in Rn , are defined respectively as [30] . . P1 ⊕ P2 = {p1 + p2 | p1 ∈ P1 , p2 ∈ P2 }, and P1 P2 = {p | p + P2 ⊆ P1 }. (1) Alternatively, the Minkowski difference of two convex bodies can be defined relative to the Minkowski sum as the body P1 = P1 P2 where P1 = P1 ⊕ P2 . While it is relatively simple to mathematically define the Minkowski operations, calculating useful representations of Minkowski sums or differences can be difficult and computationally expensive, especially when the boundary of these regions requires an explicit representation. 2.2
Closed-Form Minkowski Operations Between an Ellipsoid and a General Convex Differentiable Surface Embedded in Rn
It has been observed that the Minkowski sum and difference between two ellipsoids can be computed in closed-form. The computational procedure can be
6
S. Ruan et al.
further extended when one ellipsoid is substituted by an arbitrary convex differentiable surface embedded in n-dimensional Euclidean space (Rn ). Assume that S1 is a hyper-surface embedded in Rn , with implicit and parametric forms being (2) Φ(x1 ) = 1 and x1 = f (ψ), where Φ(·) is a real-valued differentiable function of x1 ∈ Rn and f is a differentiable n-dimensional vector-valued functions of ψ ∈ Rn−1 . Let E2 be an arbitrary ellipsoid in Rn , with semi-axis lengths a2 = [a1 , a2 , ..., an ] . Then, the implicit and explicit equations are of the form x A−2 2 x = 1 and x = A2 u(ψ), where A2 = R2 Λ(a2 )R2 is the shape matrix of E2 where R2 ∈ SO(n) denotes the orientation of the ellipsoid, and Λ(·) is a diagonal matrix. Here u(ψ) is the standard parameterization of the n-dimensional hyper-sphere with parameters ψ = [ψ1 , ψ2 , ..., ψn−1 ] . The affine transformation that shrinks the ellipsoid into a sphere with radius . r = min{a1 , a2 , ..., an } on the surface S1 can be expressed as x = T x, where T = R2 Λ(r/a2 )R2 denotes the “shrinking” affine transformation, which is symmetric and positive definite since Λ(r/a2 ) is diagonal and positive definite. The implicit expression for the “shrunk” S1 , denoted as S1 , is Φ(T −1 x ) = 1. Then the Minkowski sum between S1 and E2 (now is a sphere), is obtained by computing the boundary of the offset surface with offset radius r as xof s = −1
∇Φ(T x ) x + rn , where n = ∇Φ(T −1 x ) is the outward normal of the surface and −1 − ∇Φ(T x ) = T ∇Φ(x) with T − = (T −1 ) = (T )−1 = T −1 . The Minkowski sum between the original surface S1 and ellipsoid E2 can be given by “stretching” the transformed space back, using inverse affine transformation, as
xeb = T −1 xof s = T −1 (T x + r
T −2 ∇Φ(x) T − ∇Φ(x) ) = x + r −1 − T ∇Φ(x) T ∇Φ(x)
(3)
The Minkowski difference S1 E2 therefore can be obtained by switching the plus sign in Eq. (3) to minus. However, for the Minkowski difference, a “curvature constraint” should be satisfied: after the “shrinking” operation, the curvature of every point on the transformed surface S1 should be smaller than the curvature of the transformed ellipsoid E2 . 2.3
Explicit Expressions of the Closed-Form Minkowski Operations Between an Ellipse and a Superellipse
We now give a concrete example for the closed-form Minkowski operations when the surface is a 2D superellipse. This formulates a mathematical representation for the implementation in this paper. The implicit and explicit equations for a superellipse S1 in R2 are defined as 2 2 y1 x1 x1 a1 cos θ + = 1, and x = , −π ≤ θ ≤ π, = Φ(x) = b1 sin θ y1 a1 b1 (4)
Path Planning for Ellipsoidal Robots and General Obstacles
(a) Both bodies are rotated clockwise with an angle θ2 .
(c) The Minkowski sum in the shrunk space is an offset surface (the shaded area).
7
(b) Both of the bodies are shrunk until ∂E2 becomes a circle.
(d) Transform back to the original space. The shaded region represents S1 ⊕ E2 .
Fig. 1. Algorithm for obtaining the characterizations of the Minkowski sum between a superellipse S1 and an ellipse E2 .
respectively. The shape described by the above function changes with . We only consider the case of 0 < < 2 to ensure the convexity of the corresponding shape. Then the normal vector of the superellipse can by calculating be obtained 2 cos2− θ/a1 . the gradient of Φ(x1 , y1 ) as ∇Φ(x1 (θ), y1 (θ)) = 2− θ/b1 sin Suppose an ellipse is defined by the parameters a2 = [a2 , b2 ] and its orientations be characterized as R2 = R(θ) ∈ SO(2). If we further define r = min{a2 , b2 }, then the “shrinking” transformation can be computed as T = R2 Λ(r/a2 )R2 . Now we have all the information to calculate the closedform Minkowski sum and difference between an ellipse and a superellipse, i.e. S1 ⊕ E2 and S1 E2 respectively. Figure 1 illustrates the computational process.
3
The Highway RoadMap Path Planning Algorithm for Ellipsoidal Robots: Reviews and Extensions
With the definition of the exact closed-form contact boundary of an ellipsoid and a superquadrics obtained in Sect. 2.3, path planning problems can be solved effectively. In this section, the “Highway RoadMap” algorithm [9,31] which takes
8
S. Ruan et al.
advantage of the knowledge of collision-free C-space a priori is briefly reviewed. An extension for configuration connections with different rotational components is developed.
Algorithm 1. Highway RoadMap Algorithm Input: Robot; Obstacle; Arena; EndP oint Output: Graph; Path Rot = Discretized Orientations; foreach Rot do CObs , CArena ← Minkowski Operations between Robot and Obstacle, Arena; CellCF ← Cell Decomposition by Sweep-Line Process; Graph ← Vertex Generations and Edge Connections Within the C-Layer; end Graph ← Connect Closest vertices among Adjacent C-Layers; P ath ← Graph Search Process;
1 2 3 4 5 6 7 8
3.1
Overview of the Highway RoadMap Algorithm
The Highway RoadMap system is built based on the idea of cell decomposition at each orientation of the robot. At first, the orientations are discretized from the configuration space. Then, at each fixed orientation, a subset of the C-space that only contains translational motions is built, denoted as a “C-layer”. The roadmap system is built by first decomposing each C-layer into disjoint collisionfree cells, and connecting vertices between adjacent C-layers. At each C-layer, The closed-form Minkowski sum and difference are computed between the robot and the obstacles and arenas, respectively1 . Note that for the sake of simplicity, one can always make the arena larger enough in comparison to the robot, so that the “curvature constraint” of Minkowski difference is not activated. Once the Minkowski operations are applied, the configuration space with C-obstacles is constructed, then the free space can be characterized and decomposed into disjoint collision-free cells. A subset of the roadmap can be constructed by detecting the middle point at each boundary of the adjacent cells as a vertex and connecting edges between two vertices. The entire roadmap system can then be constructed by connecting vertices among adjacent C-layers. This procedure is based on our previously published work which uses two ellipsoids [9]. In this paper, we extend this description by using superquadrics, and propose a new method to connect vertices among different C-layers. The planning algorithm is illustrated in Algorithm 1.
1
Here the word “arena” denotes the bounded area in which the robot and obstacles are contained.
Path Planning for Ellipsoidal Robots and General Obstacles
9
Fig. 2. The cell decomposition at one C-layer. Black superelliptical objects: obstacles; blue curves: the closed-form Minkowski sum and difference boundaries; thin black line segments: the sweep lines parallel to x-axis; black dots: collision free vertices within the C-layer; thick black line segments: collision free edges that connect vertices.
3.2
A Sweep-Line Process for the Cell Decomposition Within One C-Layer
Within one C-layer, the motion of a robot is restricted to translations only, and the Minkowski operations are applied here to obtain C-space obstacles and arenas. To address this, a process similar to the “sweep line” method to decompose the collision-free space as disjoint cells has been proposed [32]. Suppose that the robot is constructed by a finite union of M ellipsoids E1 , E2 , . . . , EM , and the transformations between the first ellipsoid E1 and other ellipsoids E2 , E3 , . . . , EM are defined as g2 , g3 , . . . , gM respectively. Let the collision-free C-space for each ellipsoid Ei be denoted as Ci (i = 1, . . . , M ). Then the collision-free space for the whole robot can be characterized as a union of them as viewed in the body frame of first ellipsoid, i.e. C = C1 ∩ (g2 · C2 ) ∩ . (g3 · C3 ) ∩ · · · ∩ (gM · CM ), where gi · Cj = Ri Cj + ti (i, j = 1, ..., M ). In order to detect those regions, a set of sweep lines parallel to the x-axis are generated. Each sweep line intersects with all the curves, with the intersecting points saved as pairs or intervals. Denoting the line segments within the obstacles as POi , and those within the arenas as PAi , the collision-free line segment PCF for each sweep line can be represented as PCF =
MA ×M i=1
PAi −
MO ×M
POi
(5)
j=1
where MA and MO are numbers of superquadrics that represent arenas and obstacles respectively. Figure 2 shows the decomposed C-space at one layer with collision-free cells highlighted by horizontal raster lines.
10
3.3
S. Ruan et al.
A Local Planner for Vertex Connections Between C-Layers
Since each C-layer only represents one orientation of the robot, one must connect the subgraphs among different C-layers so that the robot can transform between different layers by rotations. It is always beneficial to find a continuous collisionfree space that can enclose all the steps along the edge between two vertices. Therefore, we propose a local planner to build a continuous convex C-space that allows the robot to move between two configurations in different C-layers. The basic idea is to enclose the actual robot within a slightly larger ellipsoid, i.e. scale the actual robot up by a factor of = 0.1. Then the robot is allowed to move small amounts inside the larger ellipsoid without collisions. Such motions can be described locally in the C-space, denoted as “Local C-space”. The local C-space becomes collision-free if the Minkowski operations are performed using the larger ellipsoid, and the descriptions of the local C-space can be done before building the roadmap a priori. Once the local C-space of the two vertices intersect, a new vertex can be generated within the intersecting area and connected to the two vertices. The following subsections introduce in detail the characterizations of the local C-space and the procedure to connect two vertices by a collision-free path. Characterization of the Local C-Space. We first review the allowable motions of the smaller ellipsoid inside a larger ellipsoid, both of which can be described in the n-dimensional case. The related study traces back to the concept of the “Kinematics of Containment”, which deals with any convex body that moves inside another [26]. A recent study of a special case, in which both bodies are n-dimensional ellipsoids, can be applied here [27]. Given two n-dimensional ellipsoids Ea and Eb , with Ea ⊆ Eb , we let a = [a1 , a2 , ..., an ] , b = [b1 , b2 , ..., bn ] ∈ Rn denote the semi-axes of Ea and Eb respectively. By substituting the explicit expression of the moving ellipsoid Ea into the implicit expression of the fixed ellipsoid Eb that is aligned with the world frame, the algebraic condition for Ea to move inside Eb without collision can be written as (Ra Λ(a)u + ta ) Λ−2 (b)(Ra Λ(a)u + ta ) ≤ 1.
(6)
This highly nonlinear expression can be simplified by a small angle approximation. If Ea is restricted to infinitesimal motions, the rotation part calculated by exponential map can be approximated to the first order as Ra = exp( ωa ) ≈ I + ω a ,
(7)
where ω a ∈ so(n). Grouping parameters (u) and variables (ω and t) gives the first-order approximation of the left-hand side of the algebraic condition of containment as (8) Cu (ξ) = ξ H(u)ξ + h (u)ξ + c(u), where H(u) ∈ Rn(n+1)/2×n(n+1)/2 , h(u) ∈ Rn(n+1)/2 and c(u) ∈ R. The first order algebraic condition of containment is then defined as Cu (ξ) ≤ 1.
Path Planning for Ellipsoidal Robots and General Obstacles
11
(a) Edges between C-layers in the “Lo- (b) A complete Highway RoadMap in Ccal C-space”. space.
Fig. 3. (a) A collision-free vertex connection scheme between adjacent C-layers. Blue dots are the two vertices, V1 and V2 , with convex polyhedron being their local C-space. The green line segments connect V3 at the intersection and the two vertices respectively. (b) The fully connected graph structure. The vertical axis represents the rotational angle; black dots are valid vertices and line segments are collision-free edges.
It can be further proved that Eq. (8) is a family of convex functions, with . parameters u and unknown variables ξ = [ω , t ] ∈ Rn(n+1)/2 .2 As a result, enclosing several valid configurations by a convex hull gives a convex polyhedron in the C-space of the smaller ellipsoid. This convex polyhedron describes a collision-free subspace and any path inside is guaranteed to be collision-free. The convex polyhedron is a lower bound of the actual local C-space, where the first-order approximation works well when the rotational motion is small. Therefore, not only the inflation factor , but the roundness of the ellipsoid also affects the approximation. With the increase of the aspect ratio, which quantifies the roundness, the convex lower bound takes larger portion of the volume related to the actual local C-space. For more details about the performance of convex lower bound of the local C-space, see [28]. Vertex Connections Based on the Convex Polyhedron Local C-Space. To further connect two vertices, V1 and V2 , by a collision-free path P ath12 , one can first define a new middle vertex V3 that is inside the intersection of the Local C-space of V1 and V2 . Then connecting V1 , V3 and V2 , V3 by line segments gives P ath13 and P ath23 respectively. These two path segments are guaranteed to be collision-free since both are fully inside the convex polyhedron of V1 or V2 . Finally, the desired collision-free path is a combination of the two segments, i.e. P ath12 = P ath13 ∪ P ath23 . Figure 3 demonstrates the proposed connection scheme for vertices in different C-layers, and a complete Highway RoadMap system in the C-space with valid vertices being connected. 2
Explicitly, ω = log∨ (R), R ∈ SO(n) and t ∈ Rn (t is the actual translation as seen in the world reference frame). And the pair (R, t) forms the “Pose Change Group”, . i.e. PCG(n) = SO(n) × Rn , with the group operation being a direct product, which . is different than SE(n) = SO(n) Rn (for more details, see [33]).
12
4
S. Ruan et al.
Experiments on Path Planning for a 2D Elliptical Robot
As a demonstration of the extended Highway RoadMap algorithm, we consider a 2D single-body planning problem, where the robot is an ellipse and the arena and obstacles are modeled as superellipses. To evaluate the performance of the proposed algorithm, benchmarks are performed with three commonly-used sample-based planners (PRM, RRT, RRT-Connect) from the well-known Open Motion Planning Library (OMPL). Moreover, three different sampling methods are also compared for the probabilistic planners: uniform random sampling (Uniform), obstacle-based sampling (OB) [6] and bridge test (Bridge) [7]. This section describes the benchmark process and results, followed by some discussions and potential extension in the future work. Our algorithm is implemented in C++ and all the comparisons are performed in an Intel Core i7 CPU at 3.60 GHz. Table 1. Benchmark parameters for highway RoadMap and sample-based planners from OMPL. Parameters Explanations
4.1
Maps
Sparse, Cluttered, Maze
Samplers
Uniform, Obstacle-based (OB), Bridge
Shapes
Ellipse, SuperEllipse
NL
Number of C-layers for Highway RoadMap
NSL
Number of sweep lines for Highway RoadMap
NV
Number of valid configurations
NE
Number of valid edges that connect two valid configurations
NP
Number of vertices in the collision-free path
Experimental Parameters
Our experimental parameters include two types of shapes for representing the obstacles and define different kinds of maps for benchmark environments. The obstacles are defined as ellipses and superellipses, and for each shape, three maps are considered. We deal with a sparse map, where there are only a few obstacles and the free space occupies a majority of the area; a cluttered map, where more space is occupied by obstacles which are placed in different orientations; and a maze map, where only some narrow passages are free. Those maps capture different real life scenarios the robot might face, which can evaluate the performance of difference algorithms. And the comparisons include the running time for solving the problem and the success rate across multiple experimental trials. For each map and each algorithm, we perform 50 planning trials and compute the average planning time and success rate. Note that for the probabilistic algorithms, a time limit of 60 s is set for one planning trial, which means the planning fails if the time exceeds the limit. The list of parameters for our comparisons is shown in Table 1.
Path Planning for Ellipsoidal Robots and General Obstacles
13
Table 2. Implementation details of the parameters for Highway RoadMap planner Shape
Map
NL NSL NV
NE
NP
Ellipse
Sparse 15 Cluttered 20 Maze 50
17 35 40
723 916 25 3621 4828 37 10636 14699 486
SuperEllipse Sparse 14 Cluttered 14 Maze 55
10 25 30
493 672 21 2009 2547 72 9782 13450 554
Table 2 provides the implementation details for the Highway RoadMap planner, where the numbers of C-layers and sweep lines and the resulting vertices, edges and valid vertices on the path are provided. 4.2
Collision Checking Methods for the Sample-Based Algorithms
In sample-based methods, the majority of time is consumed in collision checking, so the choice of a relatively fast collision checker is a priority. The standard and widely-used collision checking library, “Flexible Collision Library (FCL)”, has been applied in many scenarios [34], therefore we include FCL in our benchmark process. In particular, a special and efficient ellipsoidal method from FCL is applied for the case of elliptical obstacles. Since the input parameters for describing the ellipses for the planners used in our experiments are the same, the comparison is fair. For the case of superellipses, the objects are treated as polygons, and traditional collision checking for polygons is applied. To make the checking process more efficient as for a relatively fair comparison, we characterize each superellipse object by 10 discrete points on its boundary and construct triangulation meshes as inputs for the polygonal collision checking. We select 10 points on the boundary because a superelliptical object can be arbitrarily characterized and the speed of collision checking is still fast. 4.3
Experimental Results
This section describes the experimental results for benchmarking, followed by some discussions and potential extensions for future work. Figure 4 shows the maps used for benchmark as well as the valid path Highway RoadMap finds, which verifies the correctness of our implementation. As a comparison, Fig. 5 shows the valid paths found by the three sampled-based planners from OMPL. Different sampling methods are also compared, and shown at the bottom row. It is evident that either bridge or obstacle-based method samples more valid configurations inside the narrow corridor.
14
S. Ruan et al.
(a) Ellipse, sparse
(d) SuperEllipse, sparse
(b) Ellipse, cluttered
(e) SuperEllipse, cluttered
(c) Ellipse, maze
(f) SuperEllipse, maze
Fig. 4. The maps with elliptical and superelliptical obstacles for benchmarking the algorithms. The red and green ellipses are the robot at the start and goal configurations respectively.
Figure 6 shows the planning time for different algorithms on different maps. The “errorbox” plots demonstrate the statistics of the planning time among the 50 trials of the experiments. Although in the sparse map the Highway RoadMap is almost twice as slow as the probabilistic planners, as the map becomes more dense and cluttered, its speed increases slightly and it starts to take the lead among other planners. It becomes more obvious in the maze map that the speed of Highway RoadMap is much faster than the RRT-connect planner, which is the most efficient among the three sample-based planners from the comparisons. Furthermore, because of deterministic nature of the Highway RoadMap, it behaves much more stable among different trials, while sample-based planners have relatively large variance. The choice of collision checking methods matters, since the sample-based planners becomes slower when the obstacles are superellipses and polygonal collision checking is applied. For the superellipse maze map, almost all the trials for the sample-based planners fail to find a solution within the time limit. Therefore we only show the planning time comparisons for the RRT-Connect planner, which illustrates that Highway RoadMap is more advantageous when solving the “narrow passage” problem by having a higher quantity of valid vertices in less computational time. 4.4
Discussion
The extended Highway RoadMap planner solves the path planning problems on SE(2) in different scenarios efficiently. This is due to usage of a closed-form expression of Minkowski sum and difference that explicitly characterizes the C-space, in addition to the effectiveness of the cell decomposition method to solve the “narrow passage” problem. On the other hand, the sample-based
Path Planning for Ellipsoidal Robots and General Obstacles
15
(a) PRM/Uniform
(b) RRT/Uniform
(c) RRT-C/Uniform
(d) PRM/Bridge
(e) RRT/Obstacle-based
(f) RRT-C/Obstacle-based
Fig. 5. A demonstration of the paths found by sample-based planners with different sampling strategies. The environment is the superellipse cluttered map. The blue dots and line segments are valid configurations and edges.
planners are very efficient when the environment is sparse, but start to drop speed as the space occupied by obstacles increases. This is because most of the sampled configurations are discarded, so the “sample-check-discard-resample” process iterates much longer than in sparse environments. The idea of “local C-space” gives a connection strategy for vertices on adjacent C-layers, which requires the existence of the intersection volume between the local C-space of the two vertices. In the SE(2) case, a necessary condition for this existence requirement is that the “gaps” between adjacent C-layers should be smaller that the largest rotational angle, therefore there is a trade-off between selecting the inflation factor and the number of layers. The largest angle of rotation can be computed in closed-form, when fixing the smaller ellipse to the center of the larger one. The major current limitation of the Highway RoadMap planner, however, is that its extension to solve high degree-of-freedom multi-body problems is not yet clear. The computational complexity of a naive implementation would grow exponentially as the dimension of the C-space increases. Addressing this problem is a challenge for future work. One of the direct extensions to the current implementation is to solve the SE(3) planning problem, using a similar computational process. In fact, hybrid planners might be a promising solution, where one can either construct a whole graph structure by sampling in SO(3) and then generate valid vertices and edges in a similar fashion to the PRM method. A second option is to build a tree structure by repeatedly sampling orientations with associated C-layers until reaching the goal, similar to RRT. These possibilities potentially allow the Highway RoadMap planner to solve higher dimensional planning problems with narrow passages.
16
S. Ruan et al.
(a) Ellipse, sparse
(d) Superellipse, sparse
(b) Ellipse, cluttered
(c) Ellipse, maze
(e) Superellipse, cluttered
(f) Superellipse, maze
Fig. 6. Planning time comparisons between different algorithms.
5
Conclusion
This paper proposes a closed-form characterization of Minkowski sums and differences between an ellipsoid and a general convex differentiable surface embedded in Rn . As a specific demonstration, the Minkowski operations between a 2D ellipse and a superellipse are explicitly derived. These formulate the mathematical basis for describing the configuration space of an elliptical robot with superelliptical obstacles. With prior explicit knowledge of the C-space, a path planning problem can be solved by the proposed extended “Highway RoadMap” planner, where configurations with different rotational components can be connected via the procedure of generating the “Local C-space” for each configuration. A benchmark scheme with an elliptical robot and superelliptical obstacles is performed using the Highway RoadMap and PRM, RRT and RRT-Connect planners from the Open Motion Planning Library. Different sampling strategies for the narrow passage problem are also applied to sample-based planners. The results show that the extended Highway RoadMap outperforms the sample-based planners on the cluttered and maze maps, where obstacles occupy much space. Finally, potential extensions of the Highway RoadMap to higher dimensional planning problems with narrow passages are discussed, which the authors plan to explore
Path Planning for Ellipsoidal Robots and General Obstacles
17
more in the future. By combining the efficient explicit descriptions of the configuration space presented here and the effectiveness of the sample-based planners on high dimensional C-space, hybrid path planners can thereby potentially achieve better performance in higher dimensional cluttered environment. Acknowledgements. The authors would like to thank Dr. Fan Yang, Mr. Thomas W. Mitchel and Mr. Zeyi Wang for useful discussions. This work was performed under National Science Foundation grant IIS-1619050 and Office of Naval Research Award N00014-17-1-2142. The ideas expressed in this paper are solely those of the authors.
References 1. Kavraki, L.E., Svestka, P., Latombe, J.C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580 (1996) 2. LaValle, S.M.: Rapidly-exploring random trees: a new tool for path planning (1998) 3. Kuffner, J.J., LaValle, S.M.: RRT-connect: an efficient approach to single-query path planning. In: Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2000, vol. 2, pp. 995–1001. IEEE (2000) 4. Bohlin, R., Kavraki, L.E.: Path planning using lazy PRM. In: Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2000, vol. 1, pp. 521–528. IEEE (2000) 5. Varadhan, G., Manocha, D.: Star-shaped roadmaps - a deterministic sampling approach for complete motion planning. In: Robotics: Science and Systems, vol. 173. Citeseer (2005) 6. Rodriguez, S., Tang, X., Lien, J.M., Amato, N.M.: An obstacle-based rapidlyexploring random tree. In: Proceedings 2006 IEEE International Conference on Robotics and Automation, ICRA 2006, pp. 895–900. IEEE (2006) 7. Hsu, D., Jiang, T., Reif, J., Sun, Z.: The bridge test for sampling narrow passages with probabilistic roadmap planners. In: Proceedings of the IEEE International Conference on Robotics and Automation, ICRA 2003, vol. 3, pp. 4420–4426. IEEE (2003) 8. Shi, K., Denny, J., Amato, N.M.: Spark PRM: using RRTs within PRMs to efficiently explore narrow passages. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4659–4666. IEEE (2014) 9. Yan, Y., Ma, Q., Chirikjian, G.S.: Path planning based on closed-form characterization of collision-free configuration-spaces for ellipsoidal bodies obstacles and environments. In: Proceedings of the 1st International Workshop on Robot Learning and Planning, pp. 13–19 (2016) 10. Barr, A.H.: Superquadrics and angle-preserving transformations. IEEE Comput. Graphics Appl. 1(1), 11–23 (1981) 11. Jaklic, A., Leonardis, A., Solina, F.: Segmentation and Recovery of Superquadrics, vol. 20. Springer, Dordrecht (2013) 12. Agba, E.I., Wong, T.L., Huang, M.Z., Clark, A.M.: Objects interaction using superquadrics for telemanipulation system simulation. J. Robot. Syst. 10(1), 1–22 (1993) 13. Guibas, L., Ramshaw, L., Stolfi, J.: A kinetic framework for computational geometry. In: 24th Annual Symposium on Foundations of Computer Science, pp. 100–111. IEEE (1983)
18
S. Ruan et al.
14. Latombe, J.C.: Robot Motion Planning, vol. 124. Springer, Boston (2012) 15. Kavraki, L.E.: Computation of configuration-space obstacles using the fast Fourier transform. IEEE Trans. Robot. Autom. 11(3), 408–413 (1995) 16. Agarwal, P.K., Flato, E., Halperin, D.: Polygon decomposition for efficient construction of Minkowski sums. Comput. Geom. 21(1), 39–61 (2002) 17. Nelaturi, S., Shapiro, V.: Configuration products and quotients in geometric modeling. Comput. Aided Des. 43(7), 781–794 (2011) 18. Lien, J.M.: Hybrid motion planning using Minkowski sums. In: Proceedings of Robotics: Science and Systems IV (2008) 19. Nelaturi, S., Shapiro, V.: Solving inverse configuration space problems by adaptive sampling. Comput. Aided Des. 45(2), 373–382 (2013) 20. Fogel, E., Halperin, D.: Exact and efficient construction of Minkowski sums of convex polyhedra with applications. Comput. Aided Des. 39(11), 929–940 (2007) 21. Hachenberger, P.: Exact Minkowksi sums of polyhedra and exact and efficient decomposition of polyhedra into convex pieces. Algorithmica 55(2), 329–345 (2009) 22. Behar, E., Lien, J.M.: Dynamic Minkowski sum of convex shapes. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 3463–3468. IEEE (2011) 23. Kurzhanskiy, A.A., Varaiya, P.: Ellipsoidal Toolbox (ET). In: Proceedings of the 45th IEEE Conference on Decision and Control, pp. 1498–1503 (2006). https:// doi.org/10.1109/CDC.2006.377036 24. Hartquist, E.E., Menon, J., Suresh, K., Voelcker, H.B., Zagajac, J.: A computing strategy for applications involving offsets, sweeps, and Minkowski operations. Comput. Aided Des. 31(3), 175–183 (1999) 25. Yan, Y., Chirikjian, G.S.: Closed-form characterization of the Minkowski sum and difference of two ellipsoids. Geom. Dedicata 177(1), 103–128 (2015) 26. Chirikjian, G.S., Yan, Y.: The kinematics of containment. In: Lenarˇciˇc, J., Khatib, O. (eds.) Advances in Robot Kinematics, pp. 355–364. Springer, Cham (2014) 27. Ma, Q., Chirikjian, G.S.: A closed-form lower bound on the allowable motion for an ellipsoidal body and environment. In: Proceedings of 2015 IDETC/CIE, pp. V05CT08A055–V05CT08A055. American Society of Mechanical Engineers (2015) 28. Ruan, S., Ding, J., Ma, Q., Chirikjian, G.S.: The kinematics of containment for N-dimensional ellipsoids. J. Mech. Rob. 11(4) (2019) 29. Sucan, I.A., Moll, M., Kavraki, L.E.: The open motion planning library. IEEE Robot. Autom. Mag. 19(4), 72–82 (2012) 30. de Berg, M., Cheong, O., van Kreveld, M., Overmars, M.: Computational Geometry: Algorithms and Applications. Springer, Heidelberg (2008) 31. Yan, Y.: Geometric motion planning methods for robotics and biological crystallography. Ph.D. thesis (2014) 32. Choset, H.M., Hutchinson, S., Lynch, K.M., Kantor, G., Burgard, W., Kavraki, L.E., Thrun, S.: Principles of Robot Motion: Theory, Algorithms, and Implementation. MIT Press, Cambridge (2005) 33. Chirikjian, G.S., Mahony, R., Ruan, S., Trumpf, J.: Pose changes from a different point of view. J. Mech. Robot. 10(2), 021008 (2018) 34. Pan, J., Chitta, S., Manocha, D.: FCL: a general purpose library for collision and proximity queries. In: 2012 IEEE International Conference on Robotics and Automation (ICRA), pp. 3859–3866. IEEE (2012)
Free Space of Rigid Objects: Caging, Path Non-existence, and Narrow Passage Detection Anastasiia Varava(B) , J. Frederico Carvalho, Florian T. Pokorny, and Danica Kragic KTH Royal Institute of Technology, Stockholm, Sweden {varava,jfpbdc,fpokorny,dani}@kth.se
Abstract. In this paper, we present an approach towards approximating configuration spaces of 2D and 3D rigid objects. The approximation can be used to identify caging configurations and establish path non-existence between given pairs of configurations. We prove correctness and analyse completeness of our approach. Using dual diagrams of unions of balls and uniform grids on SO(3), we provide a way to approximate a 6D configuration space of a rigid object. Depending on the desired level of guaranteed approximation accuracy, the experiments with our single core implementation show runtime between 5–21 s and 463–1558 s. Finally, we establish a connection between robotic caging and molecular caging from organic chemistry, and demonstrate that our approach is applicable to 3D molecular models. The supplementary material for this paper can be found at https://anvarava.github.io/publications/ wafr-2018-supplementary-material.pdf.
Keywords: Caging
1
· Path non-existence · Computational geometry
Introduction
Understanding the global topological and geometric properties of the free space is a fundamental problem in several fields. In robotics, it can be applied to manipulation and motion planning. In computational chemistry and biology, it may be used to predict how molecules can restrict each other’s mobility, which is important for sorting molecules, drug delivery, etc. [14,22]. In robotic manipulation, the mobility of an object may be constrained by manipulators and/or obstacles. When the object cannot escape arbitrarily far from an initial configuration, we say that the object is caged. Formally, this means that it is located in a compact path-connected component of its free space. One of the biggest challenges in caging is verification – i.e., designing efficient algorithms which are able to provide theoretical guarantees that an object in a given configuration is caged. One approach towards caging verification relies on A. Varava and J. Frederico Carvalho—These two authors contributed equally. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 19–35, 2020. https://doi.org/10.1007/978-3-030-44051-0_2
20
A. Varava et al.
particular geometric and topological features of the object under consideration. This approach can be computationally efficient, but is limited to objects with certain shape properties, such as the existence of narrow parts or handles. To deal with objects and obstacles of arbitrary shape, one needs to show that an object is located in a bounded path-connected component of its free space. Direct reconstruction of a 6D configuration space is computationally expensive, and to the best of our knowledge no fast and rigorous approach has been proposed so far. In this paper, we address this problem.
Fig. 1. Diagram of our method for different application scenarios. We approximate the collision space of an object by choosing a finite set of fixed object’s orientations and considering the corresponding projections of the collision space to Rn (n ∈ {2, 3}).
We consider caging a special case of the problem of proving path nonexistence between a pair of configurations. To show that two configurations are disconnected, we construct an approximation of the object’s collision space. Intuitively, we construct a set of projections of a subset of the object’s configuration space to subspaces corresponding to fixed orientations of the object, see Fig. 1. By construction, our approximation is a proper subset of the object’s collision space, which implies that when our algorithm reports that the two configurations are disconnected, then there is no path between them in the real free space. However, for the same reason, our algorithm is not guaranteed to find all possible caging configurations, since we do not reconstruct the entire collision space. In [27], we presented our initial work on this approach without analyzing its theoretical properties and discussing practical details of its implementation. The core contributions of the present paper with respect to [27] can be summarized as follows: (i) we present a way to implement our algorithm using geometric constructions from [4] and [34], and run experiments in a 2D and 3D workspace; (ii) we provide an accurate bound for the displacement of a 3D object induced by a given rotation, and discuss how to practically discretize SO(3) using a technique from [33]; (iii) we generalize our approach to narrow passage detection and construct δ-clearance configuration space approximations; (iv) we provide a correctness proof of our algorithm by showing that if in our approximation of
Free Space of Rigid Objects
21
the free space two configurations are disconnected, then they are disconnected in the actual free space; (v) we prove δ-completeness of the algorithm: if two configurations are disconnected, we can construct a good enough approximation of the free space to show that these configurations are either disconnected or connected by a narrow passage; (vi) we establish a connection between robotic caging and molecular caging in organic chemistry and discuss possible future research in this direction; as a proof of concept, we run our algorithm on a pair of molecules studied in [14] and obtain the same result as the authors in [14] obtained in a chemical laboratory.
2
Related Work
In manipulation, caging can be considered as an alternative to a force-closure grasp [9,10,18,26], as well as an intermediate step on the way towards a formclosure grasp [21]. Unlike classical grasping, caging can be formulated as a purely geometric problem, and therefore one can derive sufficient conditions for an object to be caged. To prove that a rigid object is caged, it is enough to prove this for any subset (part) of the object. This allows one to consider caging a subset of the object instead of the whole object, and makes caging robust to noise and uncertainties arising from shape reconstruction and position detection. The notion of a planar cage was initially introduced by Kuperberg in 1990 [5] as a set of n points lying in the complement of a polygon and preventing it from escaping arbitrarily far from its initial position. In robotics, it was subsequently studied in the context of point-based caging in 2D by Rimon and Blake [19], Pipattanasomporn and Sudsang [16], Vahedi and van der Stappen [28], and others. A similar approach has also been adopted for caging 3D objects. For instance, Pipattanasomporn and Sudsang [17] proposed an algorithm for computing all two-finger cages for non-convex polytopes. In [25,29] the authors present a set of 2D caging-based algorithms enabling a group of mobile robots to cooperatively drag a trapped object to the desired goal. In the above mentioned works fingertips are represented as points or spheres. Later, more complex shapes of caging tools were taken into account by Pokorny et al. [18], Stork et al. [23,24], Varava et al. [26], Makita et al. [9,10]. In these works, sufficient conditions for caging were derived for objects with particular shape features. Makapunyo et al. [12] proposed a heuristic metric for partial caging based on the length and curvature of escape paths generated by a motion planner. The authors suggested that configurations that allow only rare escape motions may be successful in practice. We address caging as a special case of the path non-existence problem: an object is caged if there is no path leading it to an unbounded path-connected component. The problem of proving path non-existence has been addressed by Basch et al. [2] in the context of motion planning, motivated by the fact that most modern sampling-based planning algorithms do not guarantee that two configurations are disconnected, and rely on stopping heuristics in such situations [6]. Basch et al. prove that two configurations are disconnected when the
22
A. Varava et al.
object is ‘too big’ or ‘too long’ to pass through a ‘gate’ between them. There are also some related results on approximating configuration spaces of 2D objects. In [31], Zhang et al. use approximate cell decomposition and prove path nonexistence for 2D rigid objects. They decompose a configuration space into a set of cells and for each cell decide if it lies in the collision space. In [13] McCarthy et al. propose a related approach. There, they randomly sample the configuration space of a planar rigid object and reconstruct its approximation as an alpha complex. They later use it to check the connectivity between pairs of configurations. This approach has been later extended to planar energy-bounded caging [8]. The problem of explicit construction (either exact or approximate) of configuration spaces has been studied for several decades in the context of motion planning, and a summary of early results can be found in [30]. In [7] the idea of slicing along the rotational axis was introduced. To connect two consecutive slices, the authors proposed to use the area swept by the robot rotating between two consecutive orientation values. In [32], this idea was extended to using both outer and inner swept areas to construct a subset and a superset of the collision space of polygonal robots. The outer and inner swept areas are represented as generalized polygons defined as the union and intersection of all polygons representing robot’s shape rotating in a certain interval of orientation values, respectively. Several recent works propose methods for exact computation of configuration spaces of planar objects [3,15]. In [3], a method towards exact computation of the boundary of the collision space is proposed. In [15], the authors explicitly compute the free space for complete motion planning. As we can see, several approaches to representing configuration spaces of 2D objects, both exact and approximate, have been proposed and successfully implemented in the past. The problem however is more difficult if we consider a 3D object, as its configuration space is 6-dimensional. In the recent survey on caging [11], the authors hypothesis that recovering a 6D configuration space and understanding caged subspaces is computationally infeasible. We present a provably-correct algorithm for approximating 3D and 6D configuration spaces, which has a reasonable runtime on a single core of Intel Core i7 processor. We study path-connectivity of the free space of 2D and 3D rigid objects. We do this by decomposing the configuration space into a finite set of lower dimensional slices. Although the idea of slicing is not novel and appears in the literature as early as in 1983 [7], recent advances in computational geometry and topology, as well as a significant increase in processing power, have made it easier to approximate a 6D configuration space on a common laptop. When dealing with slicing a 6D configuration space, we identify two main challenges: how to quickly compute 3D slices, and how to efficiently discretize the orientation space. For slice approximation, our method relies on simple geometric primitives— unions of balls, which makes our algorithm easy to implement and generalizable to 6D configuration spaces. Given a subset of an object, we determine how densely the set of orientations should be sampled in order to approximate the collision space of the object, and use dual diagrams of a union of balls [4] to approximate the free space. This way, we do not need to use generalized polygons,
Free Space of Rigid Objects
23
which makes previous approaches more difficult in 2D and very hard to generalize to 3D workspaces. For SO(3) discretization, we use the method from [33], which provides a uniform grid representation of the space. The confluence of these factors result in overcoming the dimensionality problem without losing necessary information about the topology of the configuration space, and achieving both practical feasibility and theoretical guarantees at the same time. Finally, our method does not require precise information about the shape of objects and obstacles, and the only requirement is that balls must be located strictly inside them, which makes our approach robust to noisy and incomplete sensor data. The focus of this paper is on proving path non-existence rather than complete motion planning. In practice, it is often enough to have a rough approximation of the free space to find its principal connected components, which is easier to compute than its exact representation.
3
Free Space Decomposition
Since our work is related both to object manipulation and motion planning, we use the general term ‘object’ without loss of generality when talking about objects and autonomous rigid robots (e.g., disc robots) moving in n-dimensional workspaces1 . Definition 1. A rigid object is a compact connected non-empty subset of Rn . A set of obstacles is a compact non-empty subset of Rn . We want to represent both the object and the obstacles as a set of ndimensional balls. Therefore, we do not allow them to have ‘thin parts’. Formally, we assume that they can be represented as regular sets [20]: Definition 2. A set U is regular if it is equal to the closure of its interior: U = cl(int(U )). We approximate both the obstacles and the object as unions of balls lying in their interior, S = {BR1 (X1 ), . . . , BRn (Xn )} and O = {Br1 (Y1 ), . . . , Brm (Ym )} of radii R1 , . . . , Rn and r1 , . . . , rm respectively. Let C(O) = SE(n) denote the configuration space of the object. We define its collision space 2 C col (O) as the set of the objects configurations in which the object penetrates the obstacles: Definition 3. C col (O) = {c ∈ C | [int c(O)] ∩ [int S] = ∅}, where c(O) denotes the object in a configuration c. The free space C f ree (O) is the complement of the collision space: C f ree (O) = C(O) − C col (O). In [27], we suggested that configuration space decomposition might be a more computationally efficient alternative to its direct construction. Namely, we 1 2
Throughout the paper n ∈ {2, 3}. A theoretical analysis of different ways to define the free space can be found in [20].
24
A. Varava et al.
represent the configuration of an object space as a product C(O) = Rn × SO(n), and consider a finite covering of SO(n) by open sets (this is always possible, since SO(n) is compact): SO(n) = i∈{1,...,s} Ui . We recall the notion of a slice, introduced in [27]: Definition 4. A slice of the configuration space C(O) of a rigid object, is a subset of C(O) defined as follows: SlU (O) = Rn × U , where U is an subset of SO(n). f ree col (O) (SlU (O)). For We denote a slice of the collision (free) space by SlU col (O) of its collision space in such each slice we construct an approximation aSlU a way that our approximation lies inside the real collision space of the slice, col col (O) ⊂ SlU (O). aSlU This way, we approximate the entire collision space by a subset aC col (O):
⎛ aC col (O) = ⎝
⎞ col aSlU (O)⎠ ⊂ C col (O) i
i∈{1,...,s}
An Object’s ε-Core. First of all observe that by Definition 3, if a subset O of an object O placed in configuration c ∈ C(O) is in collision, then the entire object O is in collision. Therefore, the collision space of O is completely contained within the collision space of O . This allows us to make the following observation: Observation 1. Consider an object O and a set of obstacles S. Let c1 , c2 ∈ C f ree (O) be two collision-free configurations of the object. If there is no collisionfree path between these configurations for its subset O ⊂ O, then there is no collision-free path connecting these configurations for O. Therefore, if some subset O of O in configuration c is caged, then the entire object O in the same configuration c is caged. This means that if we construct col col in such a way that for any configuration c ∈ aSlU there exists a subset aSlU O of c(O) such that O is in collision, then c(O) is also in collision. In [27] we defined an ε-core of an object as follows: Definition 5. The ε-core of an object O is the set Oε comprising the points of O which lie at a distance3 of at least ε from the boundary of O: Oε = {p ∈ O | d(p, ∂O) ≥ ε}. Now, for an object O and its ε-core Oε , we write Oφ and Oεφ respectively to mean that their orientation is fixed at φ ∈ SO(n). So, let C col (Oεφ ) denote the collision space of Oε with a fixed orientation φ. Note that since the orientation is fixed, we can identify C col (Oεφ ) with a Fig. 2. An ε-core remains inside the object when we slightly subset of Rn . 3
rotate it By distance here we mean Euclidean distance in Rn .
Free Space of Rigid Objects
25
In [27], we showed that for an object O, ε > 0 and a fixed orientation φ ∈ SO(n) there exists a non-empty neighbourhood U (φ, ε) of φ such that for any θ ∈ U (φ, ε), Oεφ is contained in Oθ , see Fig. 2. We approximate the collision space as follows: we pick an ε > 0 and a set of orientation values {φ1 , . . . , φs } so that U (φ1 , ε), . . . , U (φs , ε) cover SO(n), and so that for any θ ∈ U (φi , ε) we have Oεφi ⊂ Oθ . Finally, for each φi ∈ {φ1 , . . . , φs }, we compute collision space slice approxcol φi col col imations aSlU (Oεφi ) × (φi ,ε) as the collision space of Oε , aSlU (φi ,ε) = C U which results in our approximation of the collision space: aCεcol (O) = (φi , ε)col φi aSlU (φi ,ε) . Discretization of SO(n): Elements of SO(n) can be seen as parametrizing rotations in Rn , so for any q ∈ SO(n) we define Rq as the associated rotation. Let D(Rq ) denote the maximal displacement of any point p ∈ O after applying Rq , i.e. D(Rq ) = maxp∈O (d(p, Rq (p))), then Oε ⊂ Rq (O) if D(Rq ) < ε. In [27] we have derived the following upper bound for the displacement of a two-dimensional object: D(Rq ) ≤ 2| sin(q/2)| · rad(O) assuming that we rotate the object around its geometric center and rad(O) denotes the maximum distance from it to any point of the object, and q is the rotation angle. In the two-dimensional case, discretization of the rotation space is trivial: given an ε we pick q ∈ SO(2) such that D(Rq ) < ε, and compute a set of orientation samples {φ1 = 0, φ2 = 2q, . . . , φs = 2(s − 1)q}, where s = π/q . This gives us a covering {U (φ1 , ε), . . . , U (φs , ε)} of SO(2), where for each i ∈ {1, . . . , s} we define U (φi , ε) = [φi − q, φi + q]4 . Let us now discuss the three-dimensional case. Similarly to the previous case, our goal is to cover SO(3) with balls of fixed radius. To parametrize SO(3) we use unit quaternions. For simplicity, we extend the real and imaginary part notation from complex numbers to quaternions, where q and q denote the real and “imaginary” parts of the quaternion q. Further, we identify q = qi i + qj j + qk k with the vector (qi , qj , qk ) ∈ R3 ; and we write q¯, and |q| to mean the conjugate √ q − q and the norm q q¯, respectively. We use angular distance [33] to define the distance between two orientations: ρ(x, y) = arccos(|x, y|), where x and y are two elements of SO(3) represented as unit quaternions which are regarded as vectors in R4 , and x, y is their inner product. A unit quaternion q defines q . a rotation Rq as the rotation of angle θq = 2 cos−1 (q) around axis wq = |q| This allows one to calculate the displacement of the rotation D(q) = D(Rq ) as: θ D(q) = 2 sin( 2q ) = 2 sin(cos−1 (q)) = 2| q|. Define the angle distance from a point to a set S ⊆ SO(3) in the usual way as ρ(S, x) = miny∈S ρ(y, x). In [33], the authors provide a deterministic sampling scheme to minimize the dispersion δ(S) = maxx∈SO(3) ρ(S, x). Intuitively, the dispersion of a set δ(S) determines how far a point can be from S, and in this way it determines how well the set S covers SO(3). Now, assume we are given a set of samples S ⊆ SO(3) such that δ(S) < δ. Then for any point p ∈ SO(3) 4
This is a cover by closed sets, but given q > q satisfying D(Rq ) < ε we can use instead U (φi , ε) = (φi − q , φi + q ) which results on the same graph.
26
A. Varava et al.
¯ = maxq∈S D(p¯ denote D(pS) q ), we want to show that in these conditions, there ¯ < ε. This would imply that exists some small ε such that maxp∈SO(3) D(pS) there exists some δ > δ such that if we take U = {q ∈ SO(3) | ρ(1, q) < δ }, then denoting Up = {qp | q ∈ U }, the family {Up }p∈S fully covers SO(3) and for any q ∈ Up satisfies D(q p¯) < ε. Now, Proposition 1 allows us to establish the relation between the distance between two quaternions and displacement associated to the rotation between them. Proposition 1. Given two unit quaternions p, q, the following equation holds: D(p¯ q ) = 2 sin(ρ(p, q)).
(1)
The proof of Proposition 1 can be found in the supplementary material. This means that if we want to cover SO(3) with patches Ui centered at a point pi such that D(Rqp¯i ) is smaller than some ε, we can use any deterministic sampling scheme on SO(3) (e.g. [33]) to obtain a set S with dispersion δ(S) < arcsin( 2ε ). Note that if S ⊆ SO(3) has a dispersion δ(S) < arcsin( 2ε ) then the family of patches {U (s, ε) | s ∈ S} forms a cover of SO(3). Finally, given such a cover of SO(3), recall that we want to approximately f ree reconstruct the full free space of the object C f ree as the union of slices aSlU (s,ε) . This requires us to test the whether two slices overlap. To make this efficient, we create a slice adjacency graph, which is simply a graph with vertices S and edges (s, s ) if U (s, ε) ∩ U (s , ε) = ∅. To compute this graph, we use the fact that U (s, ε) = {p ∈ SO(3)|ρ(s, p) < δ} for some δ, and so if two slices U (s, ε), U (s , ε) are adjacent, there must exist some p such that ρ(p, s), ρ(p, s ) < δ, which implies ρ(s, s ) < ρ(s, p) + ρ(s , p) < 2δ. We use this fact to calculate the slice adjacency graph in Algorithm 1.
Algorithm 1: ComputeAdjacentOrientations 1 2 3 4 5 6 7 8 9 10 11 12
input : S /* Set of points in SO(3). output: G /* Patch adjacency graph. V ← S; δ ← δ(S); T ← KDT ree(V ); E ← {}; for p ∈ V do P ← T.query(p, dist ≤ 2 sin(δ) + τ ) \ {p} ; P ← L ∪ T.query(−p, dist ≤ 2 sin(δ) + τ ); for q ∈ L do add (p, q) to E; end end return G = (V, E);
*/ */
The algorithm starts by setting V = S, as this is the set of vertices in the graph, and we put these vertices in a KDT ree in order to quickly perform nearest
Free Space of Rigid Objects
27
neighbor queries. Now, to compute the set of edges E, we locate for each p ∈ S the points at an angle distance smaller than 2δ 5 in line 6. Finally, in line 7 we also add edges to the points at an angle distance smaller than 2δ of −p, as both p and −p represent the same orientation. One of the prerequisites of Algorithm 1 is the availability of an estimate of the dispersion of δ(S). In our implementation we used the algorithm in [33] to compute S where the authors provide a provable upper bound for dispersion. However because this is an upper bound, and we want a tight estimate of the dispersion, in our implementation we employed a random sampling method to estimate the dispersion. At each step we draw a uniform sample p and compute its two nearest neighbors q, q . We estimate the displacement at point p to be half the (angle) distance between q and q . The final estimate is taken to be the maximum over the estimate for every sample taken. Construction of Slices. In [27], wederive the following representation for the collision space of Oεφi : C col (Oεφ ) = i,j (BRj +ri −ε (Xj − GYi )), where G is the origin chosen as the geometric center of the object, and GYi denotes the vector from G to Yi . Now, let us discuss how we construct path-connected components of C f ree (Oεφ ), see Algorithm 2. For this, we first need to approximate the complement of C col (Oεφ ), which is represented as a finite collection of balls. We do this by constructing its dual diagram [4]. A dualdiagram Dual( Bi ) of a union of balls Bi is a finite collection of balls such that the complement of its interior iscontained in and is homotopy equivalent to Bi . It is convenient to approximate C f ree (Oεφ ) as a dual diagram of collision space balls for several reasons. First, balls are simple geometric primitives which make intersection checks trivial. Second, the complement of the dual diagram is guaranteed to lie strictly inside the collision space, which provides us with a conserva- Fig. 3. Dual diagram of a set of tive approximation of the free space. Finally, circles. Red circles with blue cenhomotopy equivalence between its comple- ters represent the collision space, ment and the approximate collision space the yellow circles approximate the implies that our approximation preserves the free space. The black edges form the associated weighted Voronoi diaconnectivity of the free space that we want gram. to approximate. Another advantage of a dual diagram is that it is easy to construct. First, recall that a weighted Voronoi diagram is a special case of a Voronoi diagram, where instead of Euclidean distance between points a weighted distance 5
Since the KDT ree T uses the Euclidean distance in R4 we employ the formula ). p − q = 2 sin( ρ(p,q) 2
28
A. Varava et al.
function is used. In our case, the weighted distance of a point x ∈ Rn from 2 Bri (zi ) is equal to dw (x, B ri (zi )) = ||x − zi || − ri . Consider a weighted Voronoi diagram of a set of balls i Bri (zi ). For each vertex yj of the Voronoi diagram, let Bqj (yj ) whose radius qj is equal to the square root of the weighted distance of yj to any of the four (three in the two-dimensional case) balls from i Bri (zi ) generating yj . Then, take each infinite edge of the Voronoi diagram, and add an degenerate “infinitely large” ball (a half-space) with center at the infinite end of this edge, see Fig. 3. After constructing the dual diagram, it is easy to compute connected components of the slice. Note that there always is one infinite component corresponding to the outside world, and caging configurations are contained in bounded components. Algorithm 2: Compute-Slice-Connectivity 1 2 3 4 5 6 7 8 9 10 11
Data: A union of d-dimensional balls C col (Oεφ ) Result: A set of connected components aC0 , . . . , aCn V (C col (Oεφ )) ← Weighted-Voronoi-Diagram(C col (Oεφ )) aC f ree ← Dual-Diagram(V (C col (Oεφ ))) aC0 ← Compute-Infinite-Component() i←0 foreach ball ∈ aC f ree do if Connected-Component(ball) = ∅ then i←i+1 aCi ← Compute-Component(ball) end end return {aC0 , . . . , aCn }
The Connectivity Graph. A naive algorithm described in [26] requires us to first compute all the slices and their connected components, and only after that compute the connectivity graph. This strategy leads to suboptimal usage of memory—in practice, when one needs to compute around 30000 slices, storing all their free space representations at the same time is not feasible on a common laptop. However, we notice that this is not necessary, since the representation of each slice is needed only to check whether its connected components overlap with the connected components of adjacent slices. Therefore, in our implementation each slice is deleted as soon as the connectivity check between it and the slices adjacent to it is performed. Let us now discuss how we check whether two connected components overlap. Let G(aCεcol (O)) = (V, E) be a graph approximating the free space. The vertices of G correspond to the connected components {aC1i , . . . , aCni i } of each slice, i ∈ {1, . . . , s}, and are denoted by v = (aC, U ), where aC and U are the corresponding component and orientation interval. Two vertices representf ree f ree and Cq ⊂ aSlU , i = j, are connected by an edge ing components Cp ⊂ aSlU i j if the object can directly move between them. For that, both the sets Ui , Uj , and aCq , aCp must overlap: Ui ∩ Uj , aCq ∩ aCp = ∅. G(aCεcol (O)) approximates the free space of the object: if there is no path in G(aCεcol (O)) between the vertices associated to the path components of two configurations c1 , c2 , then they
Free Space of Rigid Objects
29
are disconnected in C f ree (O). Finally, to check whether two connected components in adjacent slices intersect, we recall that they are just finite unions of balls. Instead of computing all pairwise intersections of balls, we approximate each ball by its bounding box and then use the CGAL implementation of Zomorodian’s algorithm [34], which efficiently computes the intersection of two sequences of three-dimensional boxes. Every time it finds an overlapping pair of boxes, we check whether the respective balls also intersect. Remark 1. Recall that in each slice the aC f ree (Oεφi ) are constructed as the dual of the collision space C col (Oεφi ), which entails that aC f ree (Oεφi ) has the same connectivity as C f ree (Oεφi ). However, it also entails that any connected component of aC f ree (Oεφi ) partially overlaps with the collision space C col (Oεφi ). This means that for two connected components Cji , Cji of adjacent slices which do not overlap, it may occur that the corresponding approximations aCji , aCji do overlap. In this case the resulting graph G(aCεcol (O)) would contain an edge between the corresponding vertices. This effect can be mitigated by verifying whether the overlap between the approximations occurs within the collision space of both slices. This can be done for example by covering the intersection aCji ∩ aCji with a union of balls and checking if it is contained inside the collision space φ C col (Oεφi ) ∪ C col (Oε i ). Existence of δ-Clearance Paths. For safety reasons, in path planning applications a path is often required to keep some minimum amount of clearance to obstacles. The notion of clearance of an escaping path can also be applied to caging: one can say that an object is partially caged if there exist escaping paths, but their clearance is small and therefore the object is unlikely to escape. Consider a superset of our object O, defined as a set of points lying at most at distance δ from O, and let us call it a δ-offset of the object: O+δ = {p ∈ Rn | d(p, O) ≤ δ}. Now observe that two configurations are δ-connected if and only if there exists a collision-free path connecting these configurations in the configuration space of the δ-offset O+δ of the object. This means that two configurations c1 and c2 are not δ-connected in C f ree (O) if and only if they are not path-connected in C f ree (O+δ ). Therefore, to understand the δ-connectivity of the free space it is enough to compute path-connected components as previously described, for the δ-offset O+δ . One application of this observation is narrow passage detection. One can use our free space approximation to identify narrow passages: if two configurations f ree (O+δ ), but connected in aCεf ree (O), then they are are disconnected in aCε+δ connected by a narrow passage of width at most δ. Our approximation then can be used to localize this passage, so that probabilistic path planning algorithm can sample in this location. Furthermore, we can view δ as the level of accuracy of the approximation: if δ = 0, and our algorithm reports path non-existence, then this result is provably correct; for δ > 0, we can guarantee non-existence of paths whose clearance does not exceed δ. This means that depending on the desired level of accuracy
30
A. Varava et al.
one can construct a coarse approximation of the free space by looking at the f ree (O+δ ). This approximation requires fewer slices, and path connectivity of aCε+δ Sect. 5 demonstrates that this significantly reduces the computation time.
4
Theoretical Properties of Our Approach
In this section, we discuss correctness, completeness and computational complexity of our approach. The proofs of Propositions 2 and 3, as well as the discussion about computational complexity, can be found in the supplementary material. First of all, let us show that our algorithm is correct: i.e., if there is no collision-free path between two configurations in our approximation of the free space, then these configurations are also disconnected in the actual free space. Proposition 2 (correctness). Consider an object O and a set of obstacles S. Let c1 , c2 be two collision-free configurations of the object. If c1 and c2 are not path-connected in G(aCεf ree (O)), then they are not path-connected in C f ree (O). Now, we show that if two configurations are not path-connected in C f ree (O), we can construct an approximation of C f ree (O) in which these configurations are either disconnected or connected by a narrow passage. Proposition 3 (δ-completeness). Let c1 , c2 be two configurations in C f ree (O). If they are not path-connected in C f ree (O), then for any δ > 0 there exists ε > 0 such that the corresponding configurations are not path-connected in G(aCεf ree (O+δ )), where the graph is produced according to the procedure outlined in Remark 1. Computational Complexity. The complexity of the first part of the algorithm (slice construction) is O(n2 m2 ), where n and m are the number of balls in the object’s and the obstacle’s spherical representation; the complexity of the second part (connectivity graph construction) is O(sl(b log3 (b) + k)), where s is the number of slices, l is the number of neighbours of each slice, b is the maximum number of balls in each pair of connected components, and k the number of pairwise intersections of the balls. The details can be found in the supplementary material.
5
Examples and Experiments
We implemented our algorithm for 2D and 3D objects. All experiments were run on a single core of a computer with Intel Core i7 processor and 16 GB GPU. Different Approximations of the Workspace. In this experiment, we consider how the accuracy of a workspace spherical approximation affects the output and execution time of the algorithm, see Fig. 4. For our experiments, we generate a workspace and an object as polygons, and approximate them with unions of balls of equal radii lying strictly inside the polygons. Note that the choice of the
Free Space of Rigid Objects
31
Fig. 4. In the left we present a set of 2D obstacles in blue, and an object in green. The numbers represent the connected components of the free space. In red are three approximations of the obstacles by sets of balls of radius 15, 10, and 4, respectively. Note that the smaller the radius the more features from the original configuration space are preserved.
radius is important: when it is small, we get more balls, which increases the computation time of our algorithm; on the other hand, when the radius is too large, we lose some important information about the shape of the obstacles, because narrow parts cannot be approximated by large balls, see Fig. 4. We consider a simple object whose approximation consists of 5 balls. We run our algorithm for all the 3 approximations of the workspace, and take 5 different values of ε, see Table 1. We can observe that as we increase the ε the computation time decreases. This happens because we are using fewer slices. However, we can also observe that when the ε is too large, our approximation of the collision space becomes too small, and we are not able to find one connected component (see the last column of Table 1). Table 1. We report the number of path-connected components we found and the computation time for each case. When we were using our first approximation of the workspace, we were able to distinguish only between components 4 and 2 (see Fig. 4), and therefore prove path non-existence between them. For a more accurate approximation, we were also able to detect component 3. Finally, the third approximation of the workspace allows us to prove path non-existence between every pair of the four components. ε
R = 15
R = 10
R=4
0.30 · r 2 c 741 ms 3 c 1287 ms 4 c 1354 ms 0.33 · r 2 c 688 ms 3 c 1208 ms 4 c 1363 ms 0.37 · r 2 c 647 ms 3 c 1079 ms 4 c 1287 ms 0.40 · r 2 c 571 ms 3 c 986 ms
3 c 1156 ms
0.43 · r 2 c 554 ms 3 c 950 ms
3 c 1203 ms
Different Types of 3D Caging. As we have mentioned in Sect. 2, a number of approaches towards 3D caging is based on identifying shape features of objects and deriving sufficient caging conditions based on that. By contrast, our method is not restricted to a specific class of shapes, and the aim of this section is to
32
A. Varava et al.
consider examples of objects of different types and run our algorithm on them. The examples are depicted on Fig. 5, and Table 2 reports execution time for different resolutions of the SO(3) grid. Table 2. Results from running 3D experiments in the 4 different scenarios described in Fig. 5 and the molecular model in Fig. 6 using 3 different resolutions for the SO(3) grid. The number of balls used to approximate the collision space of each model is indicated in parenthesis next to the model name. We report the computational time and the value of guaranteed clearance δ for each case. #Slices Rings (320)
Narrow part (480) Surrounding (266) Gravity (348) Molecules (2268)
576
5 s δ = 2.4
6 s δ = 5.9
10 s δ = 1.2
4 s δ = 1.2
21 s δ = 1.02
4608
49 s δ = 0.7
70 s δ = 2.4
108 s δ = 0.4
41 s δ = 0.4
181 s δ = 0.34
540 s δ = 0.0 785 s δ = 0.0
1073 s δ = 0.0
36864
463 s δ = 0.0 1558 s δ = 0.0
Fig. 5. Different 3D caging scenarios. From left to right: linking-based caging linkingbased caging [18, 23, 24], narrow part-based caging [27, 35], surrounding-based caging, and caging with gravity [8]. In linking-based and narrow part-based caging the obstacles form a loop (not necessarily closed) around a handle or narrow part of the object. In surrounding-based the object is surrounded by obstacles so as not to escape. In caging with gravity, gravity complements the physical obstacles in restricting the object’s mobility. Here, we set a fixed level of the potential energy that the object can achieve and consider every configuration with higher potential energy as “forbidden”.
Observe that depending on the desired accuracy of the resulting approximation, different numbers of slices can be used. In our current implementation, we precompute grids on SO(3) using the algorithm from [33], and in our experiments we consider 3 different predefined resolutions. However, given a concrete clearance value δ, one can construct a grid on SO(3) with the necessary resolution based on δ. As we see, our algorithm can be applied to different classes of shapes. It certainly cannot replace previous shape-based algorithms, as in some applications global geometric features of objects can be known a priori, in which case it might be faster and easier to make use of the available information. However, in those
Free Space of Rigid Objects
33
applications where one deals with a wide class of objects without any knowledge about shape features they might have, our algorithm provides a reasonable alternative to shape-based methods. Molecular Cages. In organic chemistry, the notion of caging was introduced independently of the concept of [14]. Big molecules or their ensembles can form hollow structures with a cavity large enough to envelope a smaller molecule. Simple organic cages are relatively small in size and Fig. 6. In this example, we consider one therefore can be used for selective big and two small molecules to see whether our algorithm can predict their abilencapsulation of tiny molecules based ity to form a cage. Atomic coordinates on their shape [14]. In particular, this and (van der Waals) radii were retrieved property can be used for storage and from the Cambridge Crystallographic Data delivery of small drug molecules in Centre (CCDC) database. The depicted living systems [22]. By designing the molecules are mesitylene (left, CCDC caging molecules one is able to con- 907758), CC1 (middle, CCDC 707056), and trol the formation and breakdown of 4-ethyltoluene (right, 1449823) [14]. Our a molecular cage, in such a way as to algorithm reported that CC1 cages mesityremain intact until it reaches a spe- lene, but does not cage 4-ethyltoluene. cific target where it will disassemble These results coincide with the experimenrelease the drug molecule. An exam- tal results reported in [14]. ple on Fig. 6 shows that in principle, our algorithm can be applied to molecular models, assuming they can be treated as rigid bodies. In our future work, we aim to use our algorithm to test a set of potential molecular carriers and ligands to find those pairs which are likely to form cages. This prediction can later be used as a hypothesis for further laboratory experiments. Possible Parallel Implementation. Our algorithm has a lot of potential for parallelization, and in the future we are planning on taking advantage of it. For the examples considered above, our current implementation requires between 40 and 60 ms to compute each slice together with its connected components, and the connectivity check for a pair of slices is even faster. The bottleneck of the algorithm is the number of slices. However, most of the computations can be performed in parallel: indeed, we can first compute the approximation of the collision space for each orientation s in separate threads, and store them in shared memory. After that, we can again run one process per slice to compute the intersections between connected components of adjacent slices. In the future, we are planning on implementing a parallel version of our algorithm.
34
6
A. Varava et al.
Conclusion
In this paper, we provide a computationally feasible and provably-correct algorithm for 3D and 6D configuration space approximation and use it to prove caging and path non-existence. In the future, we will look into how we can make our algorithm faster and applicable to real-time scenarios. In particular, we will work on a parallel implementation, and investigate how the number of slice can be reduced. We are also interested in integrating our narrow passage detection and path non-existence techniques with probabilistic motion planners. Finally, we are going to use our approach to find molecular cages. Acknowledgements. The authors are grateful to D. Devaurs, L. Kavraki, and O. Kravchenko for their insights into molecular caging. This work has been supported by the Knut and Alice Wallenberg Foundation and Swedish Research Council.
References 1. Barraquand, J., Kavraki, L., Latombe, J.-C., Motwani, R., Li, T.-Y., Raghavan, P.: A random sampling scheme for path planning. IJRR 16(6), 759–774 (1997) 2. Basch, J., Guibas, L.J., Hsu, D., Nguyen, A.T.: Disconnection proofs for motion planning. In: IEEE ICRA, pp. 1765–1772 (2001) 3. Behar, E., Lien, J.-M.: Mapping the configuration space of polygons using reduced convolution. In: IEEE/RSJ Intelligent Robots and Systems, pp. 1242–1248 (2013) 4. Edelsbrunner, H.: Deformable smooth surface design. Discrete Comput. Geom. 21(1), 87–115 (1999) 5. Kuperberg, W.: Problems on polytopes and convex sets. In: DIMACS Workshop on Polytopes, pp. 584–589 (1990) 6. Latombe, J.-C.: Robot Motion Planning. Kluwer Academic Publishers, Norwell (1991) 7. Lozano-Perez, T.: Spatial planning: a configuration space approach. IEEE Trans. Comput. 2, 108–120 (1983) 8. Mahler, J., Pokorny, F.T., McCarthy, Z., van der Stappen, A.F., Goldberg, K.: Energy-bounded caging: formal definition and 2-D energy lower bound algorithm based on weighted alpha shapes. IEEE RA-L 1(1), 508–515 (2016) 9. Makita, S., Maeda, Y.: 3D multifingered caging: basic formulation and planning. In: IEEE IROS, pp. 2697–2702 (2008) 10. Makita, S., Okita, K., Maeda, Y.: 3D two-fingered caging for two types of objects: sufficient conditions and planning. Int. J. Mechatron. Autom. 3(4), 263–277 (2013) 11. Makita, S., Wan, W.: A survey of robotic caging and its applications. Adv. Robot. 31(19–20), 1071–1085 (2017) 12. Makapunyo, T., Phoka, T., Pipattanasomporn, P., Niparnan, N., Sudsang, A.: Measurement framework of partial cage quality based on probabilistic motion planning. In: IEEE ICRA, pp. 1574–1579 (2013) 13. McCarthy, Z., Bretl, T., Hutchinson, S.: Proving path non-existence using sampling and alpha shapes. In: IEEE ICRA, pp. 2563–2569 (2012) 14. Mitra, T., Jelfs, K.E., Schmidtmann, M., Ahmed, A., Chong, S.Y., Adams, D.J., Cooper, A.I.: Molecular shape sorting using molecular organic cages. Nat. Chem. 5, 276 (2013)
Free Space of Rigid Objects
35
15. Milenkovic, V., Sacks, E., Trac, S.: Robust complete path planning in the plane. In: Frazzoli, E., Lozano-Perez, T., Roy, N., Rus, D. (eds.) Algorithmic Foundations of Robotics X, pp. 37–52. Springer, Heidelberg (2013) 16. Pipattanasomporn, P., Sudsang, A.: Two-finger caging of concave polygon. In: IEEE ICRA, pp. 2137–2142 (2006) 17. Pipattanasomporn, P., Sudsang, A.: Two-finger caging of nonconvex polytopes. IEEE T-RO 27(2), 324–333 (2011) 18. Pokorny, F.T., Stork, J.A., Kragic, D.: Grasping objects with holes: a topological approach. In: IEEE ICRA, pp. 1100–1107 (2013) 19. Rimon, E., Blake, A.: Caging planar bodies by one-parameter two-fingered gripping systems. IJRR 18(3), 299–318 (1999) 20. Rodriguez, A., Mason, M.T.: Path connectivity of the free space. IEEE T-RO 28(5), 1177–1180 (2012) 21. Rodriguez, A., Mason, M.T., Ferry, S.: From caging to grasping. IJRR 31(7), 886– 900 (2012) 22. Rother, M., Nussbaumer, M.G., Rengglic, K., Bruns, N.: Protein cages and synthetic polymers: a fruitful symbiosis for drug delivery applications, bionanotechnology and materials science. Chem. Soc. Rev. 45, 6213 (2016) 23. Stork, J.A., Pokorny, F.T., Kragic, D.: Integrated motion and clasp planning with virtual linking. In: IEEE/RSJ IROS, pp. 3007–3014 (2013) 24. Stork, J.A., Pokorny, F.T., Kragic, D.: A topology-based object representation for clasping, latching and hooking. In: IEEE-RAS International Conference on Humanoid Robots, pp. 138–145 (2013) 25. Pereira, G.A.S., Campos, M.F.M., Kumar, V.: Decentralized algorithms for multirobot manipulation via caging. IJRR 23(7–8), 783–795 (2004) 26. Varava, A., Kragic, D., Pokorny, F.T.: Caging grasps of rigid and partially deformable 3-d objects with double fork and neck features. IEEE T-RO 32(6), 1479–1497 (2016) 27. Varava, A., Carvalho, J.F., Pokorny, F.T., Kragic, D.: Caging and path nonexistence: a deterministic sampling-based verification algorithm. In: ISRR (2017). Preprint: https://www.csc.kth.se/∼jfpbdc/path non ex.pdf 28. Vahedi, M., van der Stappen, A.F.: Caging polygons with two and three fingers. IJRR 27(11–12), 1308–1324 (2008) 29. Wang, Z., Kumar, V.: Object closure and manipulation by multiple cooperating mobile robots. In: IEEE ICRA, pp. 394–399 (2002) 30. Wise, K., Bowyer, A.: A survey of global configuration-space mapping techniques for a single robot in a static environment. IJRR 19(8), 762–779 (2000) 31. Zhang, L., Young, J.K., Manocha, D.: Efficient cell labelling and path non-existence computation using C-obstacle query. IJRR 27(11–12), 1246–1257 (2008) 32. Zhu, D., Latombe, J.: New heuristic algorithms for efficient hierarchical path planning. IEEE T-RO 7(1), 9–20 (1991) 33. Yershova, A., Jain, S., LaValle, S.M., Mitchell, J.C.: Generating uniform incremental grids on SO(3) using the Hopf fibration. IJRR 29, 801–812 (2009) 34. Zomorodian, A., Edelsbrunner, H.: Fast software for box intersections. In: Proceedings of the 16th Annual Symposium on Computational Geometry, pp. 129–138 (2000) 35. Liu, J., Xin, S., Gao, Z., Xu, K., Tu, C., Chen, B.: Caging loops in shape embedding space: theory and computation. In: ICRA (2018)
Fast Collision Detection for Motion Planning Using Shape Primitive Skeletons Mukulika Ghosh(B) , Shawna Thomas, and Nancy M. Amato Parasol Laboratory, Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77843, USA {mukulika,sthomas,amato}@tamu.edu
Abstract. In many robotics applications, the environment (robots and obstacles) often have very complex geometries. These result in expensive primitive computations such as collision detection which in turn, affect the overall performance of these applications. Approximating the geometry is a common approach to optimize computation. Unlike other applications of geometric approximation where it is applied to one space (usually obstacle space), we approximate both obstacle and free workspace with a set of geometric shape primitives that are completely contained within the space and represent its topology (skeleton). We use these “shape primitive skeletons” to improve collision detection performance in motion planning algorithms. Our results show that the use of shape primitive skeletons improves the performance of standard collision detection methods in motion planning problems by 20–70% in our 2D and 3D test environments regardless of motion planning strategy. We also show how the same shape primitive skeletons can be used with robots of different sizes to improve the performance of collision detection operation.
1
Introduction
In robotics applications such as navigation and manipulation, the environment (robot and obstacles) have complex geometries. Collision detection is an important primitive operation in these applications which is affected by these complexities. Due to its high computational cost, collision detection is often used as a measure of performance in these applications [12,27]. Geometric approximation of a complex shape into a set of simple geometric shapes (e.g., sphere or box) is a popular approach used in handling these complexities [3,12]. As many of the obstacles in the environment are static (such as buildings in a city environment), a single approximation can be used to check the collision of different robots with the static portions of environment. In this paper, we propose approximating both obstacle and free workspace into a connected set of geometric shape primitives, or shape primitive skeletons to improve the performance of collision detection (and in turn motion planning applications). Generally, geometric approximations such as bounding volume hierarchies are applied to only one space (usually the obstacle space). This c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 36–51, 2020. https://doi.org/10.1007/978-3-030-44051-0_3
Primitive Skeleton Collision Detection
37
requires traversing the depths of the hierarchies to determine the collision status if the object lies in the space (on which the hierarchy is defined). As we approximate both the free and obstacle workspaces, we utilize the information from both spaces (e.g., the volume covered by the approximation) to make quick decisions about collision status. In our method, the data structure storing the set of geometric shape primitives is called a shape primitive skeleton. The shape primitive skeleton of a bounded space is defined as a graph annotated with geometric shape primitives that represents the connectivity of the space. Figure 1(a) shows an example of a shape primitive skeleton of the bounded free space and Fig. 1(b) shows the same for obstacle space.
(a) Free Space Primi- (b) Obstacle Space tive Skeleton Primitive Skeleton
(c) Collision Checker
Fig. 1. (a)–(b) Shape Primitive Skeleton of the free space (white colored) and object space (object with blue colored boundary). The underlying skeleton graph is shown in green (free space) and red (obstacle space) and primitives are shown with black boundaries. (c) Collision status for different robot placements (red - in collision, green - collision free, yellow - undecided). The robot is shown as star shaped object.
Unlike the bounding volume hierarchies where the objects are bounded by geometric shape primitives, the shape primitives in a shape primitive skeleton are completely contained inside the object (as shown in Figs. 1(a) and (b)). Hence, these shapes can be used to perform fast inclusion tests (i.e., if an object is inside a shape primitive, it is inside the object bounding the primitive) as compared to exclusion tests (i.e., an object is outside another object if the former is outside all bounding shape primitives of the latter). Similar to exclusion tests, inclusion tests can be used to filter out collision statuses. We use the shape primitive skeletons of both the obstacle space and the free space to make a quick decision about the collision status of an object. If an object is completely inside any shape primitive from any of the shape primitive skeletons, then the object is marked as collision-free or in-collision depending on which space the shape primitive belongs to (i.e., free space or obstacle space respectively); see Fig. 1(c). If not, we check if an object lies across two primitives. This is to improve the performance of the collision checker when the primitives are skinny (such as those in narrow passages) compared to the size of the robot. If the collision status cannot be determined by shape primitive inclusion, a full
38
M. Ghosh et al.
collision detection computation is performed. Hence, our inclusion based collision checker can be used with any existing collision detector to improve their performance and not as a stand-alone collision checker. As our shape primitive skeleton is defined in workspace, our method is useful or applicable to motion planning problems that involve checking collision of the robot’s geometry with workspace obstacles. For robots with linkages or joints, the geometry of each link or moveable part is checked for collision with the workspace obstacles. Our method is applicable to problems with apriori knowledge of the environments as our shape primitive skeleton construction method is for static space (i.e., with static obstacles). We evaluate the impact in collision detection time with the use of shape primitive skeletons in our 2D and 3D test environments. Our results show that the use of the shape primitive skeletons improves the collision detection time (ranging from 20% to 70% in most cases) of two collision detection libraries (RAPID [11] and PQP [17]) when compared to the same without the use of skeletons. We observe the improvement in performance in our test environments irrespective of the underlying motion planning strategy. However, the improvement is dependent on the coverage provided by the shape primitive skeletons and the size of the primitives (compared to the size of the robot). We also show how the same shape primitive skeletons can be used with robots of different sizes to improve the performance of collision detection operation. In summary, our major contributions in this work are: – Creation of a geometric shape primitive skeleton of any bounded space. – Application of these shape primitive skeletons in both free and obstacle space to improve the performance of collision detection in motion planning.
2
Related Work
Here we review other approximation methods for collision checking and planning. 2.1
Bounding Volume Hierarchies
Bounding volume hierarchies are used in approximating polygonal or polyhedral objects which are further used in collision detection algorithms such as: Spheres [12], Axis Aligned Bounding Box (AABB) [24], Oriented Bounding Box (OBB) [11,24], and Discrete Orientation Polytopes (k-DOPs) [15]. Bounding volume hierarchies follow two major steps: decomposition of the object into regions and bounding the regions with geometric shape primitives. The choice of bounding volume depends on various criteria: tightness and simplicity in intersection check. In all of these bounding volume hierarchy methods, the shape primitives overhang beyond the boundary of the object. The collision detection method uses the hierarchy to filter out potential candidates that are not in collision through exclusion tests, i.e., if an object is outside the bounding volume of another object, the objects are not in collision. In contrast, our method uses an
Primitive Skeleton Collision Detection
39
approximation where the geometric shape primitives are completely inside the object. Hence it is used for inclusion tests, i.e., if the object is completely inside the shape primitive inscribed in another object, then the object is in collision with the other. Schneider et al. improved the performance of bounding volume hierarchy collision detection for objects in narrow passages by early detection of collision using distance fields and oct-trees on a modified swap algorithm [20]. In our method, we use the approximation of the obstacle and free space to avoid the expensive penetration distance computation. 2.2
Inner Volume Approximations
Packing algorithms “pack” the geometric shape primitives inside a given object(s). Similar to bounding volume hierarchies, various packing algorithms exist for various shape primitives [8,13,19]. Their objective is to provide a good coverage of the space. Generally, packing algorithms use same sized primitives. Weller and Zachmann [25] proposed an irregular sphere packing method. Their approach uses a discretized distance map of the inner volume which is updated after placement of each sphere. Stolpner et al. [22] provide an approach of generating an inner sphere tree by approximating the medial axis. The spheres are internal to the object and are “well distributed” which is achieved by placing the spheres on the intersection of the grid cells and the medial axis. However, the distribution of the spheres is dependent on the grid cell size. Most of these approximation methods (including the bounding volume hierarchies) use a single type of geometric shape primitive. Our method of approximation uses a variety of geometric shape primitives to provide a good coverage of the space with a small number of primitives. Also, all these methods are used in approximating one space (mainly the obstacle space). We use the approximation method to approximate both obstacle and free space to improve the performance of collision detection methods. 2.3
Workspace Certificates
Our work closely relates to the concept of workspace safety certificates [5,21]. Safety certificates are regions in free space (mainly convex) that are guaranteed to be collision free and hence help in certifying validity of robot configurations or path segments. In [16], Lacevic et al. provide a way to generate an approximation of C-free certificates (called burs) and use them for efficient path planning. These safety certificates methods bear resemblance to kinetic data structures which are constructed or evolve with time, whereas the safety certificates (i.e., shape primitives) in our method are statically placed as a pre-processing step. Hence the shape and size of the certificates do not depend on the shape and size of the robot but on that of the obstacles in the environment. Deits and Tedrake [9] provide a method of computing the largest collision free convex region (polytope or ellipsoid) around a input seed point. Getting a good coverage of the obstacle-free space requires a good distribution of the seed points. Brock and Kavraki [7] use a set of spheres in the free space as a guide for
40
M. Ghosh et al.
high-dimensional problems. Yang and Brock [26] use a set of spheres in the free space to improve the sampling distribution (to be near to the medial axis and hence with high clearance). All of these methods use only one type of primitive and they are created in free space. Exploitation of information from both free and obstacle space is motivated from works such as Denny and Amato [10] where planning is done in both spaces and Bialkowski et al. [4] where spatial indexing is augmented with information about the obstacle density.
3
Fast Collision Detection
We use the shape primitive skeleton(s) of the obstacle and/or free workspace to improve the performance of collision detection operations in motion planning applications. The details about the construction of the shape primitive skeletons are provided in Sect. 4. We construct a collision checker which makes quick decisions (early rejection or acceptance) about the collision status of a moveable object with the surrounding environment (with static obstacles). If the checker cannot decide the collision status, then another full collision detector is called. We first construct the shape primitive skeleton(s) of the obstacle and/or free workspace in the environment. The shape primitive skeleton of a bounded space is a set of geometric shape primitives placed along the topological skeleton of the space. Unlike bounding volumes, the shape primitives in the shape primitive skeleton are completely contained in the space and may be of different types. Figure 1(a) and (b) are examples of the shape primitive skeletons of the free and obstacle space of a given environment.
Fig. 2. Flow chart for the algorithm
Figure 2 provides an overview of the algorithm. Given the object’s location, we check whether the object is completely contained in any shape primitive. If the object is completely or partially contained in a shape primitive in obstacle space, the object is marked as in collision (placement of the red robot in Fig. 1(c)). If the object is completely contained in a shape primitive in free space, the object is marked as collision-free (placement of the green robot in Fig. 1(c)). If the object is partially contained in a shape primitive in free space, the object is clipped and the part outside the primitive is further processed. If the collision status cannot be determined (placement of the yellow robot in Fig. 1(c)), a full
Primitive Skeleton Collision Detection
41
collision detector is called. This is done to determine the collision status of the object when part of it lies in a space not covered by the shape primitive skeletons. A brief sketch of the algorithm is given in Algorithm 1.
Algorithm 1. Fast Collision Detection (FastCD) Input: Shape Primitive Skeletons P S (obstacle OS and free F S), Moveable object M , Collision Detector CD, Partial Validation Flag P V (False by default). Output: True(not in collision) or False (in collision) 1: P = FindShapePrimitives(P S, M.currentLocation) 2: for each p in P do 3: if M is completely/partially inside p then 4: if p is in OS then 5: return False 6: else if p is in F S then 7: if M is completely inside p then 8: return True 9: else if P V is True then 10: M = ClipRobot(M , p) 11: return FastCD(P S, M , CD, False) 12: return CD(M )
As robot placements are marked collision free or in collision if they are contained in shape primitives of the shape primitives skeleton, skeletons that provide good coverage of the space will result in fewer indecisive states and fewer calls to the full collision detector. A large number of primitives might provide a good coverage but it will create primitives of size comparable to or smaller than the robot size which will also result in increased number of indecisive states. In such circumstances, partial validation reduces the number of indecisive states. Some of the implementation details used in this work are: (1) A containment check of an object with any primitive is done by checking the clearance of the object’s center with the primitive. (2) To clip the robot, we clip the robot’s bounding box with the primitive. (3) Axis-aligned bounding boxes of the primitives are used for spatial indexing of the shape primitives, i.e., to find candidate shape primitives for containment check.
4
Shape Primitive Skeleton Construction
In this section, we provide an overview of the shape primitive skeleton construction method. We use a topological skeleton of a bounded space and annotate it with shape primitives. The annotation is done by finding a best fit shape primitive from a candidate set.
42
4.1
M. Ghosh et al.
Topological Skeleton for Shape Primitive Placement
The topological skeleton of a bounded space is a graph embedded in the space that represents the topology of the space (i.e., the same number of connected components and holes as the space). The skeleton vertices are points in the space and the skeleton edges are curves connecting the skeleton vertices (for example the green graph in Fig. 1(a)). An example of skeleton is the medial axis [6]. In order to place shape primitives to provide a good coverage of the space, the topological skeleton should be medially placed and annotated with clearance and boundary witness points. As it is not easy to place primitives on curved edges, the skeleton is simplified to a line skeleton. 4.2
Shape Primitive Placement
Given a skeleton, we place the shape primitives on each skeleton component (vertices and edges). We iterate through each skeletal component based on a priority score. A best fit geometric shape primitive from a set of candidate shape primitives is placed if the skeleton component is not already covered by another shape primitive. An overview of the algorithm is provided in Algorithm 2. Algorithm 2. Shape Primitive Skeleton Input: Object O, Vertex Shape Primitive Candidates(V P ), Edge Shape Primitive Candidates(EP ), Scoring Function(SCORE(el)) Output: Set of Shape Primitives P 1: G(V, E) ← Skeleton(O) 2: Let skeleton elements, EL = V ∪ E 3: Sort EL based on SCORE(el) 4: for every el in EL do 5: if el is in V then 6: p = BEST-FIT-SHAPE-PRIMITIVE(V P ,el) 7: else 8: p = BEST-FIT-SHAPE-PRIMITIVE(EP ,el) 9: if p does not overlap significantly with a previously placed shape primitive then 10: Insert p in P
In our implementation, the skeleton components are scored based on the unoccupied volume surrounding the skeletal component which is approximated using a grid map of the environment. Scores are updated for neighboring skeleton components after every shape primitive placement. We use spheres and boxes as vertex candidate shape primitives and capsules, boxes and ellipsoids as edge candidate shape primitives. The following describes how to place each shape primitive: – Sphere: The vertex is the center and its minimum clearance is the radius. – Capsule: The edge is the back-bone line segment of the capsule and the minimum clearance along the edge is the radius.
Primitive Skeleton Collision Detection
43
– Box: The x-axis is aligned along the edge with edge length as range. For a vertex, the x-axis is along the minimum clearance witness with the semi-range as the minimum clearance. Other axes are determined by aligning the axis along the minimum clearance witness among the boundary points which are within the range of the already determined axes. – Ellipsoid: The ellipsoid is computed similar to the box on the edges: the x-axis lies along the edge with its range as the edge length. The y-axis is aligned along minimum clearance witness and range value that does not go beyond the clearance at any point on the edge. The ellipsoid is then expanded along the z-axis without overhanging the boundary.
5
Results
In this section, we evaluate shape primitive skeletons in 2D and 3D environments and show their impact on the performance of collision detectors. Figure 3 shows the environments considered: (a) Office (2D) with a rigid body robot with 3 degrees of freedom (DOFs), (b) Nuclear power plant (3D) with a rigid body robot (6 DOFs), (c) Tunnels (3D) with a rigid body robot (6DOFs), (d) Bronchi (3D) of lungs with a free flying linkage robot with 4 links (9 DOFs), (e) Room (3D) with multilink robot with 8 DOFs (2 positional, 1 orientation and 5 joints). For Tunnels and Bronchi environments, inside the tunnel or bronchi is considered free space and outside is considered obstacle space.
(a) Office
(c) Tunnels
(b) Nuclear Power Plant
(d) Bronchi
(e) Room
Fig. 3. Environments studied in the experiments. The robot is shown in green.
44
M. Ghosh et al.
(a) Office-Free
(b) Office-Obstacle
(c) Nuclear-Free
(d) Nuclear-Obstacle
(e) Tunnels-Free
(f) Tunnels-Obstacle
(g) Bronchi-Free
(h) Bronchi-Obstacle
(i) Room-Free
(j) Room-Obstacle
Fig. 4. Underlying topological skeleton graphs and shape primitive skeletons for both spaces in different environments.
Primitive Skeleton Collision Detection
45
All experiments are conducted on a Dell Optiplex 9010 computer running Centos 7 with Intel Core i7-3770 CPU 3.4 GHz processor using the GNU gcc compiler version 4.7. We used the CGAL library [1] for our skeleton implementations: Segment Delaunay Graph (2D) to generate Medial Axis as a skeleton and Triangulated Surface Skeletonization (3D) which is the CGAL implementation of Mean Curvature Skeleton [23]. 5.1
Shape Primitive Skeleton and Coverage
In this section, we evaluate the space covered by the shape primitive skeletons. Figure 4 shows the shape primitive skeletons generated by our method for the environments in Fig. 3. We observe that the shape primitives generated at each environment cover most of the space except at the corners. Note that the placement of the shape primitives is dependent on the underlying structure and coverage of the skeleton. We evaluate the coverage of the space through the ratio of the volume occupied by the shape primitives to the total space volume. We approximate the volume computation by placing a grid (with grid cell side length 1% of the shortest length of the bounding box of the entire environment) over the environment and adding up the volume of the cells occupied by the shape primitives. Table 1 lists the number of shape primitives and the coverage measure for each environment. Table 1. Number of shape primitives and coverage measure. Environments Free space
Obstacle space
#Spheres #Boxes #Capsules #Ellipsoids Coverage #Spheres #Boxes #Capsules #Ellipsoids Coverage Office
53
83
39
38
92.02
4
84
10
3
Nuclear
222
32
76
10
73.56
280
30
24
25
83.71
Tunnels
54
89
33
3
90.49
152
61
106
14
53.16
95
65
57
33
87.50
352
96
156
84
61.59
151
24
51
23
75.49
32
89
37
35
88.52
Bronchi Room
99.27
In general, the number of spheres is more than the other shape primitives in these environments. This can be attributed to the curvilinear nature of the underlying topological skeleton creating small length edges during line approximation. On the other hand in Tunnels, the free space has long rectangular regions with long edges in the underlying skeleton causing the creation of more boxes and capsules than other shape primitives. Ellipsoids are generally placed along the corner edges in all the environments. This is due to drastic decrease in clearance value at the end-vertices. The coverage measure shows that more than 70% of the space (with exception of the obstacle space in Tunnels and Bronchi) is covered by shape primitives. The coverage is dependent on the underlying skeleton on which the shape primitives are placed. If the underlying topological skeleton does not span certain regions (left corners in Tunnels obstacle space), no shape primitives will be placed in
46
M. Ghosh et al. 1 0.9
RAPID (w/o Partial) RAPID (w/ Partial) PQP (w/o Partial) PQP (w/ Partial)
0.7 1
0.6
Complete Partial
0.9
0.5
0.8 0.7
0.4 Call Ratio
Normalized Time
0.8
0.3 0.2
0.6 0.5 0.4 0.3
0.1 0
0.2 0.1
Office
Nuclear
Tunnels
Bronchi
Room
0
(a) PRM-Time 1 0.9
Nuclear
Tunnels
Bronchi
Room
(b) PRM-Call Ratio
RAPID (w/o Partial) RAPID (w/ Partial) PQP (w/o Partial) PQP (w/ Partial)
0.7 1
0.6
Complete Partial
0.9
0.5
0.8 0.7
0.4 Call Ratio
Normalized Time
0.8
Office
0.3 0.2
0.6 0.5 0.4 0.3
0.1 0
0.2 0.1
Office
Nuclear
Tunnels
Bronchi
Room
0
(c) OBPRM - Time 1 0.9
Nuclear
Tunnels
Bronchi
Room
(d) OBPRM-Call Ratio
RAPID (w/o Partial) RAPID (w/ Partial) PQP (w/o Partial) PQP (w/ Partial)
0.7 1
0.6
Complete Partial
0.9
0.5
0.8 0.7
0.4 Call Ratio
Normalized Time
0.8
Office
0.3 0.2
0.6 0.5 0.4 0.3
0.1 0
0.2 0.1
Office
Nuclear
Tunnels
(e) RRT-Time
Bronchi
Room
0
Office
Nuclear
Tunnels
Bronchi
Room
(f) RRT-Call Ratio
Fig. 5. Normalized collision detection time (time taken by our method to time taken by the underlying detector alone) and call ratio (number of calls whose status is determined by complete/partial inclusion test to total number of calls) to build roadmaps for each environment and sampling strategy
these regions. Also, if the skeleton is not medially placed, the clearance information along the edges are affected which in turn causes placement of skinny primitives in otherwise clear regions. For 3D environments, the curvilinear nature of the underlying skeleton affects the size and fitting of the primitives.
Primitive Skeleton Collision Detection
5.2
47
Impact on Collision Detectors
We study the impact of using our shape primitive skeleton on the performance of collision detection methods (refer to Algorithm 1). We use two different underlying collision detectors: RAPID [11] (which uses OBB hierarchy) and PQP [17] (which uses swept sphere volumes) to show the generality of our approach. For each environment, we use a sampling based motion planning approach: generate a set of random placements of robot (500 in our experiments) and then connect them (to 3 nearest neighbors in our experiments) using a straight line local planner to build a roadmap. Both the placements of robots (samples) and the edges in the roadmap (intermediates in the edge) are checked for collision with and without the use of shape primitive skeleton approach and with and without the partial intersection check in free space (as described in Sect. 3). We use three different strategies for roadmap generation: Probabilistic Roadmap with Uniform Sampling (PRM) [14], Probabilistic Roadmap with Obstacle based Sampling (OBPRM) [2] and Rapidly exploring Random Trees with Uniform Sampling (RRT) [18] to study how the distribution of samples affects the performance impact of using the shape primitive skeleton. For Bronchi and Room environments, the collision check is done for each moveable part (or link). The self collision check is performed using the underlying collision detection library and is done at the end for samples which are not colliding with obstacles. The shape primitive skeleton construction time is less than 10% of the total planning time without the use of skeleton for most of the test environments. For large environments such as Nuclear power plant, the skeleton construction takes less than 20% of the total planning time without the use of skeleton. We analyze the performance with respect to total time taken in collision detection. The collision detection time reported includes the collision detection time for both sampling and local planning (validation of edges). Figure 5(a, c, e) shows the collision detection time with the use of the shape primitive skeleton normalized to the collision detection time of the underlying collision detection library without the use of the skeleton for each environment and planning strategy. Figure 5(b, d, f) shows the respective call ratio measured as the ratio of the number of collision detection calls whose collision status is determined by containment checking with the shape primitives to the total number of collision detection calls. (“Complete” denotes those without the partial check with free space primitives and “Partial” denotes the same with the partial check with free space primitives.) We observe an improvement in performance of the collision detection irrespective of underlying collision detector and underlying planning strategy. The improvement in performance varies from marginal to 70% in some cases both with and without the partial checking feature. As observed in Fig. 5(a, c, f), the use of the shape primitive skeleton (even without the partial check in free space) improves the collision detection time as compared to the underlying CD library. In the Office environment, the status of most of the collision detection calls (>90%) is determined through inclusion tests with the shapes in the shape primitive skeletons and thereby, we observe maxi-
48
M. Ghosh et al.
mum improvement in collision detection performance over the collision detection method without the use of the shape primitive skeleton. Skinny shape primitives (compared to the size of each body in robot) and low coverage of any space by the shape primitive skeleton causes more indecisive states, which causes increases in the number of full collision detector calls and lesser improvement in performance (as observed in the Tunnels and the Bronchi environments); see Fig. 5(b, d, e). The motion planning strategy also affects the improvement provided by the use of the shape primitive skeleton. If the robots are placed near to the boundary (such as in OBPRM where the samples are generated closer to the obstacle surface), they are near to the boundary of the shape primitives causing increased number of indecisive states (see Fig. 5(d)) and hence increased number of calls to full collision detector. The performance is further improved when the shape primitive skeleton is used with partial checking with free space primitives with the exception of the Room. It is most noticeable in the Tunnels environment where primitives in the free space are skinny compared to the robot size. For the Office environment where most of the samples are already checked through complete inclusion tests in PRM, the improvement with partial checking is more noticeable in OBPRM than in PRM. In the Room environment, the size of the each part of the robot is often large enough for a containment check with one or two primitives. This causes the partial checking to be a computation overhead when compared to the improvement in performance with use of skeletons. 5.3
Use of Skeletons of both Spaces
Figure 6(a) compares the collision detection time with the use of skeletons of one space (free and obstacle) with that with the use of skeletons of both spaces. The collision detection time reported is for a roadmap with 500 samples and no connections and normalized to the collision detection time without the use of skeleton. We observe that the use of skeletons in both spaces provides better performance improvement than with the use of skeleton in one space. The performance improvement from using a skeleton in one space depends on the volume of the space with respect to the total volume. This can be observed in Tunnels and Bronchi environments with the use of obstacle space skeleton only and in Office and Room environments with the use of free space skeleton only. 5.4
Performance with Different Robot Sizes
Figure 6(b) compares the performance of the collision detector with 3 different robot sizes. We uniformly scaled the robot used in the above experiment using 3 scaling factors: 0.5 (reduced), 1 (original), 2 (enlarged). For the Bronchi environment, we scale each moveable part by the scaling factor. We used the PRM strategy to analyze the performance impact for a roadmap with 500 samples. The collision detection time stated in the figure is normalized to the collision detection time without the use of shape primitive skeleton. With increased robot
Primitive Skeleton Collision Detection 1.3 1.2 1.1
49
Free Space Obstacle Space Both Spaces
Normalized Time
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
RA PQ PID P Office
RA PQ PID P Nuclear
RA PQ PID P Tunnels
RA PQ PID P Bronchi
RA PQ PID P Room
(a) Skeletons of different spaces 1 0.9
Normalized Time
0.8
RAPID (w/o Partial) RAPID (w/ Partial) PQP (w/o Partial) PQP (w/ Partial)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.5 1
Office
2
2 0.5 1 Nuclear
0.5 1
Tunnels
2
0.5 1
2
Bronchi
(b) Different robot size
Fig. 6. Normalized collision detection time with (a) use of skeletons of free space only, obstacle space only and both spaces (b) use of different robot sizes
size, the performance improvement is reduced (see Fig. 6(b)). This is due to an increase in robot placements near the shape primitive boundary and an increase in calls to the full collision detector. The same shape primitive skeletons can be used for different robots.
50
6
M. Ghosh et al.
Conclusion
In this paper, we propose a fast collision detection method that uses shape primitive skeletons of both the free and obstacle space to make quick decisions about the collision status of robot placements in motion planning algorithms. Our results show that the shape primitive skeletons cover more than 70% of the space in most cases. We also observe improvement in collision detection time for 2D and 3D environments (from 20%–70% in most cases). However, the trend in improvement is dependent on the coverage provided by the shape primitive skeletons and size of the shape primitives in the skeleton compared to the size of the robot.
References 1. Cgal, Computational Geometry Algorithms Library (1997). https://www.cgal. org/ 2. Amato, N.M., Wu, Y.: A randomized roadmap method for path and manipulation planning. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 113–120 (1996) 3. Bayani, H., Masouleh, M.T., Karimi, A., Cardou, P., Ebrahimi, M.: On the determination of the maximal inscribed ellipsoid in the wrench-feasible workspace of the cable-driven parallel robots. In: 2014 Second RSI/ISM International Conference on Robotics and Mechatronics (ICRoM), pp. 422–427 (2014) 4. Bialkowski, J., Otte, M., Frazzoli, E.: Free-configuration biased sampling for motion planning. In: IEEE International Conference on Intelligent Robots and Systems (IROS), pp. 1272–1279 (2013) 5. Bialkowski, J., Otte, M., Karaman, S., Frazzoli, E.: Efficient collision checking in sampling-based motion planning via safety certificates. Int. J. Robot. Res. 35(7), 767–796 (2016) 6. Blum, H.: A transformation for extracting new descriptors of shape. In: WathenDunn, W. (ed.) Models for the Perception of Speech and Visual Form, pp. 362–380. MIT Press, Cambridge (1967) 7. Brock, O., Kavraki, L.: Decomposition-based motion planning: a framework for real-time motion planning in high-dimensional configuration places. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1469–1475 (2001) 8. Conway, J.H., Sloane, N.J.H.: Sphere Packings, Lattices and Groups, 3rd edn. Springer Science and Business Media, New York (1998) 9. Deits, R., Tedrake, R.: Computing large convex regions of obstacle-free space through semi-definite programming. In: Akin, H., Amato, N., Isler, V., van der Stappen, A. (eds.) Algorithmic Foundations of Robotics XI. Springer, Cham (2014) 10. Denny, J., Amato, N.M.: Toggle PRM: a coordinated mapping of C-free and Cobstacle in arbitrary dimension. In: Frazzoli, E., Lozano-Perez, T., Roy, N., Rus, D. (eds.) Algorithmic Foundations of Robotics X. Springer Tracts in Advanced Robotics, vol. 86, pp. 297–312. Springer, Heidelberg (2013) 11. Gottschalk, S., Lin, M.C., Manocha, D.: OBB-tree: a hierarchical structure for rapid interference detection. Comput. Graph. 30, 171–180 (1996). Proceedings of the SIGGRAPH 1996 12. Hubbard, P.M.: Approximating polyhedra with spheres for time-critical collision detection. ACM Trans. Graph. 15(3), 179–210 (1996)
Primitive Skeleton Collision Detection
51
13. Kallrath, J.: Packing ellipsoids into volume-minimizing rectangular boxes. J. Global Optim. 67, 151–185 (2017) ˇ 14. Kavraki, L.E., Svestka, P., Latombe, J.C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580 (1996) 15. Klosowski, J., Held, M., Mitchell, J.S.B., Zikan, K., Sowizral, H.: Efficient collision detection using bounding volume hierarchies of k-DOPs. IEEE Trans. Vis. Comput. Graph. 4(1), 21–36 (1998) 16. Lacevic, B., Osmankovic, D., Ademovic, A.: Burs of free C-space: a novel structure for path planning. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 70–76 (2016) 17. Larsen, E., Gottschalk, S., Lin, M.C., Manocha, D.: Fast distance queries with rectangular swept sphere volumes. In: IEEE International Conference on Robotics and Automation (ICRA), vol. 4, pp. 3719–3726 (2000) 18. LaValle, S.M., Kuffner, J.J.: Randomized kinodynamic planning. Int. J. Robot. Res. 20(5), 378–400 (2001) 19. Li, S., Zhao, J., Lu, P., Xie, Y.: Maximum packing densities of basic 3d objects. Chin. Sci. Bull. 55(2), 114–119 (2010) 20. Schneider, D., Sch¨ omer, E., Wolpert, N.: Collision detection for 3d rigid body motion planning with narrow passages. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 4365–4370 (2017) 21. Schwarzer, F., Saha, M., Latombe, J.C.: Exact collision checking of robot paths. In: Boissonnat, J.D., Burdick, J., Goldberg, K., Hutchinson, S. (eds.) Algorithmic Foundations of Robotics V, pp. 25–41. Springer, Heidelberg (2004) 22. Stolpner, S., Kry, P., Siddiqi, K.: Medial spheres for shape approximation. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1234–1240 (2012) 23. Tagliasacchi, A., Alhashim, I., Olson, M., Zhang, H.: Mean curvature skeletons. Comput. Graph. Forum 31(5), 1735–1744 (2012) 24. Tu, C., Yu, L.: Research on collision detection algorithm based on AABB-OBB bounding volume. In: 2009 First International Workshop on Education Technology and Computer Science, vol. 1, pp. 331–333 (2009) 25. Weller, R., Zachmann, G.: Inner sphere trees for proximity and penetration queries. In: Robotics: Science and Systems, vol. 2 (2009) 26. Yang, Y., Brock, O.: Adapting the sampling distribution in PRM planners based on an approximated medial axis. In: IEEE International Conference on Robotics and Automation (ICRA), vol. 5, pp. 4405–4410 (2004) 27. Yershova, A., Jaillet, L., Simeon, T., Lavalle, S.M.: Dynamic-domain RRTs: efficient exploration by controlling the sampling domain. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 3856–3861 (2005)
Fast Swept Volume Estimation with Deep Learning Hao-Tien Lewis Chiang1,2 , Aleksandra Faust2 , Satomi Sugaya1 , and Lydia Tapia1(B) 1
Department of Computer Science, University of New Mexico, MSC01 11301, Albuquerque, NM 87131, USA {lewispro,satomi}@unm.edu,[email protected] 2 Google Brain, Mountain View, CA 94043, USA [email protected], [email protected]
Abstract. Swept volume, the volume displaced by a moving object, is an ideal distance metric for sampling-based motion planning because it directly correlates to the amount of motion between two configurations. However, even approximate algorithms are computationally prohibitive. Our fundamental approach is the application of deep learning to efficiently estimate swept volume computation within a 5%–10% error for all robots tested, from rigid bodies to manipulators. However, even inference via the trained network can be computationally costly given the often hundreds of thousands of computations required by sampling-based motion planning. To address this, we demonstrate an efficient hierarchal approach for applying our trained estimator. This approach first prefilters samples using a weighted Euclidean estimator trained via swept volume. Then, it selectively applies the deep neural network estimator. The first estimator, although less accurate, has metric space properties. The second estimator is a high-fidelity unbiased estimator without metric space properties. We integrate the hierarchical selection approach in both roadmap-based and a tree-based sampling motion planners. Empirical evaluation on the robot set demonstrates that hierarchal application of the metrics yields up to 5000 times faster planning than state of the art swept volume approximation and up to five times higher probability of finding a collision-free trajectory under a fixed time budget than the traditional Euclidean metric.
1
Introduction
Illustrated in Fig. 1, swept volume, SV(c1 , c2 ), is the measure of the volume displaced by an object moving between two configurations, c1 and c2 , [1,2]. Swept volume computation has proven useful in several applications including machining verification, geometric modeling, collision detection, mechanical assembly, ergonomic studies, and product lifecycle management [3]. Additionally, swept volume was identified as an ideal sampling-based planning distance metric because it directly correlates to amount of motion required to transition between c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 52–68, 2020. https://doi.org/10.1007/978-3-030-44051-0_4
Fast Swept Volume Estimation with Deep Learning
(a) Start configuration
(b) End configuration
53
(c) Swept volume
Fig. 1. Example of start (a) and end (b) configurations and the corresponding volume swept (c) for a Kuka LBR IIWA 14 R820 manipulator.
two collision-free configurations [4]. Yet, swept volume has not been commonly used in sampling-based planning due to the complexities in its computation. The problem lies in the intractability of exact swept volume computation, due to the frequent requirement of the construction of a complex non-linear geometry. As a result, most algorithms focus on generating approximations [5], such as occupation grid-based and boundary-based methods [1,3,5–7]. However, even these approximations are too computationally expensive to be used as a practical distance metric in sampling-based motion planners [4,8] since distance metric calls are one of the most common operations [9]. Sampling-based planners select a subset of configurations to attempt expensive local planning operations that make connections in roadmap-based, e.g., a probabilistic roadmap method (PRM) [10], or tree extensions in tree-based planners, e.g., a rapidly-exploring random tree (RRT) [11]. The configurations are typically selected w.r.t. some distance metric. Intuitively, a good metric limits the local planning operations to those most likely to succeed, i.e., lowest collision probability between the two configurations without prior knowledge of obstacle distribution. Yet, the commonly chosen metrics suffer from several issues. For example, the configuration space Euclidean distance metric does not represent collision probability well [4,9], and the weighted Euclidean distance metric is difficult to tune [9]. Our fundamental approach lies in the ability of a deep neural network (DNN) to approximate swept volume. Trained off-line with swept volume examples for a given robot in an obstacle-free space, the DNN captures the complex and nonlinear relationship between trajectories in the robot’s configuration space. In this paper, we show that successful learning is feasible since DNNs have been proven to be capable of approximating any continuous bounded function [12], and swept volume possesses these properties in a finite configuration space. Also, through empirical evaluation on a variety of robot types, from a rigid body to manipulators, we find that the DNN estimators are able to capture swept volumes within a 5–10% error at speeds 3500 to 5000 times faster than traditionally used approximate swept volume computation.
54
H.-T. L. Chiang et al.
Despite the increased speedup by using a DNN for swept volume estimation, inference via the DNN can remain prohibitive due to the often hundreds of thousands or more of inference calls required in difficult sampling-based motion planning problems. In addition, SV(c1 , c2 ) does not form a metric space as it does not satisfy the triangle inequality. This prevents utilization of efficient nearest neighbor data structures, such as GNAT [13], common for sampling-based planners. Therefore, we also propose an efficient hierarchal approach for applying our trained estimator, Hierarchical Neighbor Search (HNS). Specifically, during nearest neighbor selection in sampling-based planning when comparison via metric is applied, we pre-filter configurations using a weighted Euclidean estimator trained via swept volume. This first estimator is a single layer neural network that has metric properties but less accuracy. Then, we can selectively apply the DNN, a high-fidelity estimator without metric properties. The hierarchical combination in HNS demonstrates both metric properties and high-fidelity, thus enabling efficient selection of configurations with low swept volume using existing nearest neighbor data structures that require metric space properties. We achieve feasible swept volume integration with sampling-based planning in three steps. First, we generate the training data using an occupation gridbased [6] swept volume approximation. This is by far the most computationally intensive step, as the swept volume is computed for random configuration pairs. Second, we train the models needed for swept volume estimation. The training is also computationally intensive, but less so than the data generation. Both steps are done once, prior to planning. Lastly, sampling-based planners such as RRT and PRM use nearest neighbors computed with the metrics to attempt the connections between configurations. We use an efficient nearest neighbor data structure, GNAT [13], with the metric estimators. Evaluation of the metrics and HNS on three robots, a 6 Degree Of Freedom (DOF) rigid body, a 15 DOF fixed-based planar manipulator, and a 7 DOF Kuka LBR IIWA 14 R820 in a cluttered, a narrow corridor, and a real-application inspired environment. Two popular sampling-based planners, PRM and RRT are integrated with our metrics. Planners that use HNS are: (1) up to five times more likely to identify a collision-free path on a fixed time budget and (2) able to return paths with a smaller swept volume. These advantages are consistent for all three robots and are particularly significant for more complex robots with a highly articulated body. The contributions of this paper include learning a high-precision swept volume estimator, datasets for learning, and demonstrations of the metrics within planning. Specifically, we highlight the utility of high-fidelity DNN swept volume models and efficient hierarchical combination of learned metrics, HNS, for sample neighbor selection during roadmap connection and tree expansion. The weighted Euclidean metric and datasets are publicly available [14] for the three robots used in this paper, and can be readily used by other researchers without any modifications. In addition, this paper might be of interest to the larger motion planning community as an example of a machine learning’s role within the motion planning. It is a proof of concept that a computation of a
Fast Swept Volume Estimation with Deep Learning
55
highly complex and non-linear, yet Lipschitz continuous and bounded, metric can be learned ahead of time off-line from examples, reducing the complexity of planning.
2
Related Work
Modern approximate swept volume algorithms can be roughly classified as occupation grid-based and boundary-based methods. Occupation grid-based approaches decompose the workspace, e.g., into voxels, in order to record the robot’s occupation in the workspace as it executes a trajectory [3,6]. The resulting approximation has a resolution-tunable accuracy and is conservative, which can be critical for applications such as collision avoidance [15]. The boundary-based methods extract the boundary surface [1,5,7]. Despite more than four decades of study, swept volume computation is still too slow to be used online by sampling-based motion planners [4,8]. Swept volume has been used in motion planning in various ways. In [15], swept volume of a bipedal robot for 276 step patterns are computed offline and queried online by an RRT-based planner to speed up collision detection for robot footstep planning. Similarly, in this paper we compute swept volume approximations offline. However, our learned estimators can generalize to unseen configuration pairs. Swept volume had also been used directly as a distance metric in [8]. However, due to the exceedingly high swept volume computation cost, the performance is reported to be orders of magnitude worse than weighted euclidean distance metrics. A distance metric that accurately predicts a local planner’s success rate is critical for sampling-based motion planners [16]. On the other hand, distance metric calculations also need to be fast since they are one of the most numerous sampling-based planner operations [9]. Carefully tuned weighted Euclidean distance metrics have been empirically shown to outperform other metrics [9]. This conclusion is echoed in [17] where a weighted Euclidean distance metric is tuned to approximate swept volume. However, a weighted Euclidean distance metric may not be expressive enough to approximate swept volume as it is clearly nonlinear, i.e., each joint DOF affects one another in an articulated body. Machine learning has been used to learn distance metrics for kinodynamic robots [18,19]. In such systems, the design goal of distance metrics is often quite different from planning in configuration space due to the constrained reachability. Good distance metrics typically approximate the minimum time to reach between states [18]. For example, regression learning algorithms are trained offline in [18] to approximate the time to reach between states. The training data is generated by a near-optimal controller for unicycle robots. A RRT-based planner then uses the learned distance metric during online planning. A similar method replaces the near-optimal controller with an indirect controller to learn both the time to reach and control inputs [19]. These methods differ from ours, in that our methods identify neighboring configurations that are likely to succeed in the connect or extend operations due to expected distance of a trajectory, while
56
H.-T. L. Chiang et al.
distance metrics in [18,19] approximate minimum time to reach. Another example of integration between deep machine learning and sampling-based planning is PRM-RL [20]. Similar to our method, it uses an offline, once-per robot training to connect PRM nodes that are most likely to succeed. Unlike our learned distance estimators, PRM-RL learns the local planner for a physical robot moving via sensor information. Therefore, nodes that are most likely to succeed are connected.
3
Methods
Swept volume, SV : R2df → R, for a trajectory in configuration space is SV(c1 , c2 ) = ∪t∈[0,1] V((1 − t)c1 + tc2 ),
(1)
where c1 , c2 ∈ Rdf are the start and end configurations of a robot with df degrees of freedom, and V(c) is the workspace occupied by the robot in configuration c. We consider the trajectory between c1 and c2 to be a straight line in configuration space. SV(c1 , c2 ) can be highly complex and nonlinear due to rotational degrees of freedoms, especially in cases where the robot has an articulated body. Swept volume estimator models are trained offline in three steps: (1) SV training dataset generation described in Sect. 3.1, (2) learning a weighted sv , using a single layer network in Sect. 3.2, and Euclidean distance metric, Dwe (3) training a deep swept volume estimator in Sect. 3.3. After the off-line learning, Sect. 3.4 describes how HNS efficiently combines the two estimators into a hierarchical neighbor selector that selects the most promising nodes in samplingbased planning. 3.1
Training Dataset Generation
The training data (X, y) of size n, where X = [x1 , · · · , xn ] , and each training sample xi = [c1,i c2,i ] consists of two uniformly-randomly sampled points from the configuration space. The ground truth labels, y, match swept volume between two corresponding configurations with respect to the straight line planner. Since SV is only related to the kinematics of the robot and independent to the environments it operates in, we do not consider obstacles during the generation of training data. Ideally, the labels should be computed with (1), but computing the exact SV is intractable. Instead, we use labels, 1,1 , c1,2 ), · · · , SV(c n,1 , cn,2 )] , y = [y1 , · · · , yn ] = [SV(c
(2)
approximated with a state of the art octree-based swept volume algorithm [6], where the robot trajectory is represented by Nlerp intermediate configurations. The details of the octree-based algorithm and the training datasets are in supplementary material [14].
Fast Swept Volume Estimation with Deep Learning
57
sv Training Weighted Euclidean Distance Estimator, Dwe df 2 Weighted Euclidean distance metric, dw (c1 , c2 ) = j=1 wj (c1,j − c2,j ) , often df requires manual tuning of the vector w ∈ R . ∗ sv We learn the weights, wD sv , with a single layer network, Dwe , that models the we weighted Euclidean distance metric, dw (c1 , c2 ), w.r.t. the training dataset (2),
3.2
sv ∗ sv , c1 , c2 ) = dw ∗ (c1 , c2 ). Dwe (wD we
The details of the network are depicted in Fig. 2(a). We use a stochastic gradient ∗ descent optimizer to find wD sv that minimizes L2 loss we
∗ sv = arg minw wD we
n
sv 1,i , c2,i ))2 . (Dwe (w, c1,i , c2,i ) − SV(c
(3)
i=1
The network is trained once per robot, and like the analytical representation of sv the network, Dwe , forms a metric space when the weights are positive. 3.3
sv Training Deep Swept Volume Distance Estimator, Ddnn
well because The weighted Euclidean distance metric cannot approximate SV the model may not be expressive enough to capture non-linearities. Therefore, sv , to learn a non-linear swept volume we also use a deep neural network, Ddnn sv model. Ddnn is a fully-connected feed-forward DNN. The inputs, outputs, and sv are described in Fig. 2 (b). The inputs are 2df input the architecture of the Ddnn neurons, corresponding to c1 and c2 . The k hidden layers consist of ReLu [21] neurons. Finally, the output is a neuron estimating the swept volume between
1 , c2 ). c1 and c2 are the Fig. 2. Neural network architectures used to estimate SV(c start and end configurations of the trajectory, respectively. (a) Weighted Euclidean sv (w, c1 , c2 ). The input, c1 − c2 , is fed to df neurons (one per DOF) Metric model, Dwe with a square activation function, i.e., f (x) = x2 . The output is the square-root of absolute value of the weighted sum of activations. (b) Deep Swept Volume Model, sv (w, c1 , c2 ). The inputs are 2df . The activation function of the first layer are idenDdnn tities. The input layer is connected to the k hidden layers each with Ni ReLU neurons. The output layer has one neuron corresponding to the swept volume estimate.
58
H.-T. L. Chiang et al.
two configuration points and outputs zero if the network prediction is negative. Stochastic gradient descent backprop finds the weights and biases w.r.t. L2 loss and the dataset (2), sv (W ∗ , b∗ )Ddnn = arg min(W ,b)
n
sv 1,i , c2,i ))2 . (Ddnn ((W , b), c1,i , c2,i ) − SV(c
i=1
3.4
(4)
Hierarchical Neighbor Search, HNS
In this section we propose Hierarchical Neighbor Search, HNS, that combines the trained swept volume estimators introduced above to be used for neighbor selection within sampling-based planning. This hierarchical combination efficiently selects neighbors with low swept volume distance, by first filtering all candidates sv and then filtering this smaller subset with using the extremely fast learned Dwe sv Ddnn . In this paper, we implement this filtering by using the k-closest neighbor selection method at each level, but other neighbor connection strategies can be used, such as a distance cutoff [22]. Our implementation first identifies kc cansv (output of the weighted didate nearest neighbors of configuration c using Dwe sv Euclidean metric model). Next, HNS uses the Ddnn (output of the deep swept volume model) to choose the final knn < kc nearest neighbors among the candidates. Using the following notation to denote selecting nearest neighbors and the corresponding swept volume estimates from a given start configuration c to any element in a given set of configurations X:
and
sv ∗ sv (X) = {(x, D N NDwe sv , c, x))|x ∈ X}, we (wDwe
(5)
sv ∗ ∗ sv (X) = {(x, D sv , c, x))|x ∈ X}, N NDdnn dnn ((W , b )Ddnn
(6)
The HNS algorithm is found in [14]. The hierarchical combination of the metrics within neighbor selection has several benefits. First, it enables the use of any efficient nearest neighbor data sv structure. Second, it greatly reduces the number of Ddnn queries, which are slower than computing the weighted Euclidean distance. Finally, it employs metrics trained specifically for swept volume prediction.
4
Swept Volume Properties
DNNs are a universal approximator for any bounded continuous function [12]. In this section we formalize the proposition that swept volume is Lipschitz continuous and bounded along a continuous trajectory, justifying using DNNs as approximators.
Fast Swept Volume Estimation with Deep Learning
59
Proposition 1. SV(c1 , c2 ) along a continuous trajectory between two configuration points c1 , c2 in the configuration space above is Lipschitz continuous, i.e. (7) SV(c1 , c2 ) − SV(c1 + Δc1 , c2 + Δc2 ) ≤ KΔc1 + Δc2 , where Δ is a small increment along paths. The proof of the proposition can be found in [14].
5
Evaluations
Fig. 3. Start (left column) and goal (right column) configurations of the 15 DOF planar manipulator (a, b), free-floating rigid body (c, d) and Kuka LBR IIWA 14 R820 manipulator in Retrieve task(e, f) Shuffle task (g, h).
60
H.-T. L. Chiang et al.
We evaluate our method on a 15 DOF planar manipulator, a free-floating rigid sv sv model and a Ddnn model are body and a fixed-based Kuka manipulator. A Dwe trained to learn SV for each robot. We compare PRMs and RRTs that use HNS sv , the most widely used distance to ones that use Euclidean distance, DE , and Dwe metrics for sampling-based planners [9,18]. We highlight important settings used in our evaluation below and provide full for details in the supplementary material [14]. The three DNNs used to learn SV the robots share the same hyper-parameters and training dataset construction parameters. The dataset is composed of one hundred thousand training samples per robot. An additional ten thousand evaluation samples are generated in the same manner as the training samples but are unseen by the estimators. We use the PRM and RRT implemented in OMPL [23]. PRM with HNS identifies knn = 5 nearest neighbors among kc = 10 candidates to connect to, while RRT with HNS finds kc = 5 candidates in order to identify the nearest configuration in the tree. With a single metric, PRM simply uses knn = 5, and RRT finds the nearest configuration without pre-filtering. Figure 3 shows the starts and goals of the three robots evaluated in four environments. The Kuka manipulator is evaluated in two pick-and-place inspired tasks: Retrieve (Fig. 3(e, f)) and Shuffle (Fig. 3(g, h)) with complex environments. For the Kuka and the 15 DOF manipulators, the joint angles describe a configuration. The position and rotation quaternion describe a configuration of the free-floating rigid body. 5.1
Learning Results
sv for the 15 DOF and Kuka manipulators are approximately The weights of Dwe [214, 97, 80, 73, 60, 43, 31, 28, 23, 19, 8, 5, 9, 7, 5] and [1506, 2371, 181, 482, 1, 170, 30], respectively. As expected, these weights indicate that the joints near the base impact SV more. The weights of the free-floating rigid body are [120, 140, 140] for x, y, z and [86, 110, 82, 88] for quaternions. These weights indicate that the translational degrees of freedom have a higher impact on SV than rotation. sv sv and Dwe at each epoch. It is clear Figure 4 shows the evaluation loss of Ddnn sv sv sv has a much smaller that learning converges for both Dwe and Ddnn , but Ddnn sv sv loss than Dwe across all robots. This means that Ddnn learns to approximate SV sv much better than Dwe .
(a) 15 DOF
(b) Rigid Body sv sv Fig. 4. Learning curves of Ddnn and Dwe .
(c) Kuka
Fast Swept Volume Estimation with Deep Learning
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
61
sv sv Fig. 5. Histogram of DE (left column), Dwe (middle column) and Ddnn (right column) for the 15 DOF manipulator (top row), free-floating rigid body (middle row) and Kuka manipulator (bottom row). The gray shade is the Histogram of SV.
sv sv Figure 5 shows the histogram of DE , Dwe and Ddnn for the evaluation data compared to the ground truth SV (gray shade). Note that in order to compare of we scale the value of DE such that the average matches the average SV to SV, the evaluation data. The last column of Fig. 5 (c, f, i) shows striking similarities sv indicating the DNNs learned to approximate SV well for and SV, between Ddnn all robots. In contrast, the highly nonlinear SV of the 15 DOF (Fig. 5 (a, b)) and sv . Kuka manipulators (Fig. 5 (g, h)) are not approximated well by DE and Dwe These linear metrics only approximate the free-floating rigid body well (Fig. 5 (d, e)). This is expected as the joint angles in the articulated bodies nonlinearly sv provide better approximations for the 7 DOF impact each other. DE and Dwe Kuka manipulator in 3D workspace than the 15 DOF planar manipulator. This is likely due to the fact that the Kuka manipulator has fewer DOF and similar lengths between joints. Similar trends can be found in Table 1, which shows the L1 norm of the error ratio for various distance metrics and robots w.r.t. the sv is 4.38 to 11.67 times smaller evaluation dataset. The average error ratio of Ddnn sv . than DE and 2.44 to 8.73 times smaller than Dwe
62
H.-T. L. Chiang et al.
SV) for various distance metrics and robots. Table 1. L1 norm of error ratio ((D− SV)/ The metric with the lowest error ratio is highlighted for each robot. 15 DOF manipulator Rigid body Kuka manipulator DE sv Dwe sv Ddnn
76.9%
19.7%
60.7%
70.7%
11.0%
29.9%
8.1%
4.5%
5.2%
sv and the distance estimated by DE (red circles), Dwe Fig. 6. Scatter plots of SV (blue sv squares) and the DNN (Ddnn ) (black diamonds) for the 15 DOF manipulator (a), freefloating rigid body (b) and Kuka manipulator (c).
We further explored the distance metric performance by comparing SV against each distance metric (Fig. 6). In this figure, the black squares, represv They are clustered along the diagonal for all , closely track SV. senting Ddnn robots. sv (blue diamonds) correlate to It is clear that neither DE (red circles) nor Dwe SV well for the 15 DOF or Kuka manipulator (Fig. 6 (a, c)), especially when SV is large or small. We also investigated the learning performance of DNNs as impacted by the number of neurons in the hidden layer and the quantity of training samples. Figure 7 shows the L2 loss over the evaluation dataset as a function of training epochs as impacted by combinations of training sample and DNN sizes. It is clear that networks trained with 25,000 training samples (1/4 the original quantity of samples, shown as dashed lines) have higher loss across all robots and exhibit over-fitting as the loss increases after the initial decrease. When trained with the full training dataset (solid curves), large networks (networks with 1024 and 512 neurons in the first hidden layer) perform similarly across all robots while small networks demonstrate a larger loss for Kuka and rigid body robots. These results indicate that a large network and training dataset size are important to accurately approximate SV.
Fast Swept Volume Estimation with Deep Learning
63
Fig. 7. The L2 evaluation loss of DNNs with varied numbers of neurons in the hidden layer as shown in the legend (solid curves) across the 15 DOF manipulator (a), freefloating rigid body (b) and Kuka manipulator (c). The dashed curves show the same network trained with 25,000 training samples (1/4 of the full sample size).
5.2
Planning Results
Next we compare the impact of various distance metrics on PRM and RRT. The top two rows of Fig. 8 shows the cumulative success rate of identifying a collision-free motion plan as a function of time for PRM and RRT using the sv (blue) and HNS (black) distance metrics in various environments. DE (red), Dwe For all scenarios with PRMs, using the HNS distance metric is more likely to successfully find a solution within the time budget in all cases. The gain in success rate at 200 s (max planning time allowed) compared to DE ranges from sv ranges from 1.08 to 2.14 times more. 1.27 to 5 times more while the gain over Dwe In addition, the bottom two rows of Fig. 8 shows that paths identified by HNS have a smaller swept volume. Comparing across robots, results demonstrate that the advantage of HNS is much less prominent for the rigid body robot. This is sv reasonably well for this system. both approximate SV expected since DE and Dwe HNS shows a similar performance gain for both the 3D Kuka and the 2D 15 DOF manipulators. In the RRT case, HNS also enhances planning by identifying solutions with lower swept volume and identifying solutions where other metrics failed. For example, Fig. 8(a) clearly shows that the 15 DOF manipulator in a sv found a solution narrow corridor is very difficult for RRT as neither DE nor Dwe in 20 runs. In contrast, RRTs using HNS were able to identify a solution in 2 runs, likely due to the goal bias mechanism of RRT which has been shown to significantly increase the performance of RRT [11]. In the planning scenario shown in Fig. 1(a, b), the start and goal have the same joint angles except for the joint at the base. This means DE between the start and goal is relatively small. However, the robot must curl towards the base and then extend in order to reach the goal. These curled configurations require a large DE change from the goal configuration and therefore are unlikely to be selected by an RRT using goal bias. As a result, the goal bias is ineffective for the Euclidean-based metrics as it mostly selects configurations near the start. In contrast, HNS does not have this problem since the SV between the start and goal is larger than the SV between any curled configuration and the goal.
64
H.-T. L. Chiang et al.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 8. The cumulative success rate of identifying a path (top two rows) and path of successful runs (bottom two rows) for PRM (dotted lines) length (in units of SV) and RRT (solid lines) evaluated on the 15 DOF manipulator (a, e), free-floating rigid body (b, f), Kuka manipulator in the Retrieve task (c, g) and in the Shuffle task (d, h). The color of bars and curves represents various distance metrics (red: DE , blue: sv sv , black: Ddnn (our method)). The path length data is not available for RRTs using Dwe DE or DWE since there were zero successful runs.
6
Distance Metric Trade-Offs
sv In Sect. 5, the advantages of HNS are clear, particularly when DE or Dwe cannot capture SV well, i.e., when the robot has a highly articulated body. Here we investigate the advantages further by empirically evaluating the computational cost and the quality of returned nearest neighbors for each distance metric.
Fast Swept Volume Estimation with Deep Learning
65
Table 2. Computation time of various operations broken down by training (Training) and a single usage as done as a primitive operation in motion planning (Distance Call).
Robot
Training Data generation
sv Dwe
sv Ddnn
training
Distance call Compute Compute sv sv Dwe Ddnn
training
Compute SV
15 DOF manipulator
31 h
630.02 s
4360.03 s
0.081 µs
175.1 µs
8.85 s
Free-floating rigid body
2h
601.53 s
4001.53 s
0.053 µs
164.3 µs
0.58 s
Kuka manipulator
14 h
629.33 s
4023.35 s
0.055 µs
164.3 µs
4.06 s
The computation time required by the distance metrics is shown in Table 2. sv sv and Ddnn are trained once per robot (columns 3 and 4) and Recall that Dwe utilized one hundred thousand training samples (column 2). After training, the sv is at or just under 175 µs (column computation time for a single inference to Ddnn for each robot 6). Comparing a single inference to times required to generate SV sv shows that Ddnn inference is 3500 to 5000 times faster than state of the art SV sv computation. In addition, the computation time of Ddnn is only slightly affected by the DOF of the robot and is independent to the robot’s 3D model complexity. sv (column 5) is about 2000 to 3000 times faster On the other hand, computing Dwe sv than querying Ddnn . These results suggest HNS can reduce computation time sv before the slower, as it identifies candidate nearest neighbors using the fast Dwe sv more accurate, Ddnn . It is clear from Sect. 5 that nearest neighboring configurations selected w.r.t. sv are very different from ones selected by HNS. However, little is DE and Dwe known about the quality of neighboring configurations returned by the distance metrics. We evaluate this by comparing neighbors returned by each metric to comparison. Since a full comparison during a planning run those returned by SV would be computationally prohibitive, we randomly sample 100 starting configurations (c1 ) and 100 potential neighbor configurations (c2 ) for each robot. For sv sv , Ddnn each c1 , five nearest configurations among c2 are identified using DE , Dwe and HNS (kc = 10). Then, these configurations are compared to the configu Table 3 captures the quantity and quality differences rations selected by SV. in the returned neighbor configurations. First, the percentage of configurations are shown. returned by each metric that do not match those returned by SV Next, the quality of the returned configurations for each metric is demonstrated by tallying the additional volume swept by the returned neighbors over the base These values demonstrate line provided by the configurations returned by SV. in one examthat DE selects very different neighboring configurations than SV, sv , with weights ple incurring a 171% increase in swept volume. In contrast, Dwe chooses neighboring configurations with only an up to optimized to mimic SV,
66
H.-T. L. Chiang et al.
Table 3. Comparison of neighboring configurations selected by various distance met The Percent of Non-Matching Neighbors rics as compared to those selected by SV. columns demonstrate the quantity of neighboring configurations that do not match The Additional Swept Volume columns capture the amount of those selected by SV. additional volume swept by the neighboring configurations selected by the metrics as configurations. The best metric of each robot is highlighted. over that of the SV Robot
Percent of non-matching neighbors Percent additional volume swept sv sv sv sv DE Dwe Ddnn HNS DE Dwe Ddnn HNS
15 DOF
87% 80% 32% 65%
180% 160% 14% 61%
Rigid body 65% 22% 11% 11%
38%
7%
2%
2%
Kuka
56%
8%
1%
2%
87% 33% 12% 15%
7% increase in additional volume swept for the L-shaped and Kuka manipulator robot. However, the simple weights face difficulty capturing the highly nonlinear SV of the 15 DOF manipulator well, resulting in a 166% increase in volume and swept. In contrast, HNS chooses neighboring configurations closest to SV, the additional volume swept is much lower than any other Euclidean-based metsv sv . As expected, Ddnn selects neighboring ric, i.e., 2.7 to 3.5 times smaller than Dwe sv is configurations closest to SV for all robots tested. However, computing Ddnn much slower than HNS, and efficient nearest neighbor data structures cannot be sv does not form a metric space. used since Ddnn
7
Conclusion
The ability of DNNs to approximate any continuous bounded function makes them especially well suited to estimate swept volume. We demonstrated this ability for several multibody systems, from a rigid body to manipulator systems. To further enhance efficiency for use in complex sampling-based motion planning scenarios, we integrated the DNN with a trained weighted Euclidean metric. The hierarchical combination of learned metrics retains metric space properties along with high-fidelity. This hierarchical combination of metrics improved the performance of both RRT and PRM planners in all scenarios tested, particularly when the robot has a highly articulated body. Acknowledgement. Tapia, Chiang, and Sugaya are partially supported by the National Science Foundation under Grant Numbers IIS-1528047 and IIS-1553266 (Tapia, CAREER). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Fast Swept Volume Estimation with Deep Learning
67
References 1. Abrams, S., Allen, P.K.: Computing swept volumes. J. Visual. Comput. Animat. 11(2), 69–82 (2000) 2. Abdel-Malek, K., Yang, J., Blackmore, D., Joy, K.: Swept volumes: foundation, perspectives, and applications. Int. J. Shape Model. 12(01), 87–127 (2006) 3. Himmelstein, J.C., Ferre, E., Laumond, J.P.: Swept volume approximation of polygon soups. IEEE Trans. Autom. Sci. Eng. 7(1), 177–183 (2010) 4. Kuffner, J.J.: Effective sampling and distance metrics for 3D rigid body path planning. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 3993–3998 (2004) 5. Kim, Y.J., Varadhan, G., Lin, M.C., Manocha, D.: Fast swept volume approximation of complex polyhedral models. Comput.-Aided Des. 36(11), 1013–1027 (2004) 6. Von Dziegielewski, A., Hemmer, M., Sch¨ omer, E.: High precision conservative surface mesh generation for swept volumes. IEEE Trans. Autom. Sci. Eng. 12(1), 183–191 (2015) 7. Campen, M., Kobbelt, L.: Polygonal boundary evaluation of minkowski sums and swept volumes. Comput. Graph. Forum. 29, 1613–1622 (2010) 8. Ekenna, C., Uwacu, D., Thomas, S., Amato, N.M.: Improved roadmap connection via local learning for sampling based planners. In: Proceedings of 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3227–3234 (2015) 9. Amato, N.M., Bayazit, O.B., Dale, L.K., Jones, C., Vallejo, D.: Choosing good distance metrics and local planners for probabilistic roadmap methods. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 630–637 (1998) 10. Kavraki, L., Svestka, P., claude Latombe, J., Overmars, M.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. In: Proceedings of IEEE International Conference Robotics and Automation (ICRA), pp. 566–580 (1996) 11. LaValle, S.M., Kuffner, J.J.: Randomized kinodynamic planning. Int. J. Robot. Res. 20(5), 378–400 (2001) 12. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991) 13. Brin, S.: Near neighbor search in large metric spaces. In: Proceedings of International Conference on Very Large Data Bases, pp. 574–584 (1995) 14. http://cs.unm.edu/tapialab/Publications/60appendix.pdf 15. Perrin, N., Stasse, O., Baudouin, L., Lamiraux, F., Yoshida, E.: Fast humanoid robot collision-free footstep planning using swept volume approximations. IEEE Trans. Robot. 28(2), 427–439 (2012) 16. Elbanhawi, M., Simic, M.: Sampling-based robot motion planning: a review. IEEE Access 2, 56–77 (2014) 17. Voelz, A., Graichen, K.: Distance metrics for path planning with dynamic roadmaps. In: Proceedings of International Symposium on Robotics, pp. 126–132 (2016) 18. Palmieri, L., Arras, K.O.: Distance metric learning for RRT-based motion planning with constant-time inference. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 637–643 (2015) 19. Wolfslag, W.J., Bharatheesha, M., Moerland, T.M., Wisse, M.: RRT-CoLearn: towards kinodynamic planning without numerical trajectory optimization. IEEE Robot. Autom. Lett. 3(3), 1655–1662 (2018)
68
H.-T. L. Chiang et al.
20. Faust, A., Oslund, K., Ramirez, O., Francis, A., Tapia, L., Fiser, M., Davidson, J.: PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 5113–5120 (2018) 21. Hahnloser, R.H., Sarpeshkar, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 947–951 (2000) 22. McMahon, T., Jacobs, S., Boyd, B., Tapia, L., Amato, N.M.: Local randomization in neighbor selection improves PRM roadmap quality. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4441–4448 (2012) 23. S ¸ ucan, I.A., Moll, M., Kavraki, L.E.: The open motion planning library. IEEE Robot. Autom. Mag. 19(4), 72–82 (2012)
Concurrent Nearest-Neighbor Searching for Parallel Sampling-Based Motion Planning in SO(3), SE(3), and Euclidean Spaces Jeffrey Ichnowski(B) and Ron Alterovitz University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA {jeffi,ron}@cs.unc.edu
Abstract. This paper presents a fast exact nearest neighbor searching data structure and method that is designed to operate under highlyconcurrent parallel operation on modern multi-core processors. Based on a kd-tree, the proposed method is fast, supports metric spaces common to robot motion planning, and supports nearest, k-nearest, and radius-based queries. But unlike traditional approaches using kd-trees, our approach supports simultaneous queries and insertions under concurrency, supports wait-free queries, and provides asymptotically diminishing expected wait-times for random concurrent inserts. We provide proofs of correctness under concurrency, and we demonstrate the proposed method’s performance in a parallelized asymptotically-optimal sampling-based motion planner. Keywords: Nearest neighbors · Concurrent data structure Sampling-based motion planning
1
·
Introduction
Nearest neighbor searching data structures are a fundamental building block for many algorithms in robotics. Algorithms such as sampling-based robot motion planners [7], typically need to repeatedly search and insert data into a nearest neighbor data structure, and thus their performance benefits from nearest neighbor operations that are fast. However, with the trend of modern CPUs towards increasing computational parallelism in the form of multiple processor cores, it is no longer sufficient for a data structure to just enable operations to be fast. To harness the full computational power of a multi-core processor, algorithms must also allow for concurrent operations across multiple cores without slowdown. Slowdown is unfortunately worsened by increasing parallelism when the data structure requires concurrent operations to wait for mutually exclusive access to data to ensure correct operation. A concurrent data structure, on the other hand, avoids this source of slowdown, by minimizing or eliminating the requirement for mutual exclusion and the associated wait time. In this paper we c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 69–85, 2020. https://doi.org/10.1007/978-3-030-44051-0_5
70
J. Ichnowski and R. Alterovitz
propose a concurrent data structure, and associated algorithms, for fast exact nearest neighbor searching that is geared towards robotics applications. The proposed nearest neighbor data structure is based on a kd-tree [5], and thus it provides for fast insert and search operations on metric spaces important to many robotics applications—including Minkowski spaces (a common example being Euclidean), SO(3) [11], and Cartesian products thereof [20]. In the proposed data structure, as with kd-trees, branching nodes partition space into spatially separated sub-trees. Fast insertion of a new point (e.g., a robot configuration for a sampling-based motion planner) into the data structure comes from the ability to quickly traverse the partitions to an insertion point. Fast searches for a set of nearest neighbors to a query point come from the ability to use the partitions to confine traversal to a spatially relevant portion of the tree. With minor modifications to the searching algorithm, searches can also produce nearest neighbor sets that are bounded in cardinality or bounded to be within a radius from the query point. The data structure we propose supports provably correct concurrent operations. This is in contrast to the traditional approach to kd-trees, in which concurrent operation without mutual exclusion leads to data structure corruption. Corruption occurs when concurrent operations interleave mutations that invalidate the computations of each other. The problem is only exacerbated by modern compilers and CPUs as they often automatically and unpredictably change the order of memory accesses to improve the performance of non-concurrent operations. For example, if one operation writes to memory location ‘A’ and then to ‘B’, a concurrent operation may see the change to ‘B’ before it sees the change to ‘A’. While the reordered memory writes do not affect the correctness of the operation in which they occur, they may become problematic for the correctness of an operation running concurrently. An effective way to prevent corruption caused by interleaved mutations and reordering of memory writes, is to only allow one insert operation to happen at any moment in time by using a mutual exclusion locking mechanism. But, by definition, locking prevents concurrent operations, and thus all but one attempted concurrent insert operation will spend time waiting. When an algorithm spends time waiting instead of computing, it effectively slows down. To avoid this slowdown, the data structure we propose supports concurrent wait-free queries, and it also supports inserts that wait with asymptotic probability of zero. In this paper we improve upon the performance of the concurrent nearest neighbor data structure that we introduced in prior work [10]. In that work, a lock-free kd-tree provided wait-free queries and lock-free inserts. The fast lockfree inserts of that approach reduced the likelihood of insert waiting, but come at the expense of increased search times due to imbalances in the resulting tree. In this paper, insert operations produce a more balanced tree resulting in faster queries, and we provide proofs of correct operation and the low probability of waits. We embed the proposed method in a parallelized sampling-based motion planning algorithm to demonstrate its performance and ability to operate under
Concurrent Nearest-Neighbor Searching
71
concurrency on a 32-core computer. The improvements proposed in this paper double nearest-neighbor search performance when compared to our prior lockfree nearest neighbor search data structure, and lead to up to 30% faster convergence rates of the motion planner. Sampling-based motion planners parallelize well [1], but as the results show, contention over exclusive access to a nonconcurrent nearest neighbor data structures can slow them down significantly. The concurrent data structure we propose allows the parallelized motion planner to find solutions and converge faster by generating orders of magnitude more samples when compared to locked data structures.
2
Related Work
Our proposed nearest neighbor searching approach loosely follows that of a kdtree [5,9,19]. A kd-tree is a space-partitioning binary tree that splits branching nodes along axis-aligned hyperplanes in Rn . When splitting hyperplanes occur at the median of values in the subtrees, it creates perfectly balanced trees. However, as originally proposed, kd-trees are limited to Rn with a Minkowski metric. Yershova and LaValle [20] extended the metric spaces supported by kd-trees to include SO(2), SO(3), and cartesian products thereof and with Rn . The SO(3) partitions of this approach are along axis-aligned hyperplanes in Rn . Our prior work [11] proposes partitioning SO(3) using hyperplanes that wrap around the 3-sphere manifold obtained from a quaternion representation of SO(3) rotations. While the data structure we propose in this paper works with either SO(3) partitioning scheme, we expand upon the latter to address special handling required when inserting values under concurrency. Generalized nearest-neighbor approaches, such as the Geometric Nearneighbor Access Tree (GNAT) [6] only require a well-behaved metric and thus support a broader set of topologies than kd-trees. The generalized nature of such structures does not take advantage of knowledge of the underlying topology as kd-trees do, and thus may not be as efficient as kd-trees. Additional work is also required to make such structures support concurrent and wait-free operations. Approximate nearest neighbor searching approaches gain search efficiency by sacrificing accuracy. Methods include locality sensitive hashing (LSH) [2] and randomized kd-trees [18]. Our focus is on exact nearest neighbor searching as the proofs of many sampling-based motion planners’ asymptotic feasibility (e.g., RRT [16]) and asymptotic optimality (e.g., RRT* [13]) implicitly rely on the nearest neighbor structure being exact. However, if the trade-off of accuracy for speed is appropriate, methods such as those proposed by Arya et al. [3] and Beis et al. [4] shorten the kd-tree search process producing approximate results faster. We believe similar methods could be readily applied to our proposed method to allow for approximate nearest neighbor searching under concurrency. Concurrent data structures, such as the binary tree proposed by Kung and Lehman [15], allow correct operation while avoiding contention by having threads lock only the node that they are manipulating. In prior work [10], we proposed a kd-tree that allows concurrent modification and searching while avoiding contention through the use of a lock-free atomic update. When inserting into this
72
J. Ichnowski and R. Alterovitz
kd-tree, the algorithm makes partitioning choices at the leaf of the kd-tree based upon the bounds of the region and/or the value in the leaf. Empirically this approach works well for the random insertion order of the associated sampling-based planner. However, better search performance is possible with a balanced kd-tree as would be created by median splits. To better approximate median splits in this work, we incorporate the approach described by Sherlock et al. [17] that accumulates a predetermined number of values into leaves before performing a median split on the values within the leaf.
3
Problem Definition
The problem definition is stated in two key parts: (1) correct concurrent operation, and (2) nearest neighbor searching. Correct Concurrent Operation requires that memory writes of one operation must not adversely affect the memory reads or writes of a concurrent operation, while minimizing the time concurrent operations wait on each other. Once an operation running on a CPU core inserts a point into the data structure, the inserted point will eventually be reachable to all other cores. Once an operation running on a CPU core reaches a point in the data structure, all subsequent operations on that core must continue to reach the point. Nearest Neighbor Searching finds all the nearest neighbors of a query point. Let C be a topological space which is composed of the Cartesian product of one or more sub-topologies in Rn and SO(3). Let q ∈ C be a single configuration in the topological space with components from each sub-topology, e.g., q = {pi , . . . , rj , . . .} , with pi ∈ Rni and rj ∈ S 3 for each i and j. Each SO(3) component is specified using the coefficients of a unit quaternion representing a rotation in 3D space [14]. Let d(q1 , q2 ) be the distance between two configurations, such that it is the weighted sum of the distances of each sub-topology’s component: αi dpRn (pai , pbi ) + αj dSO(3) (raj , rbj ), d(qa , qb ) = i
j
where αi and αj are positive real weight values, dpRn (·, ·) is an Lp distance metric on Rn , and dSO(3) (·, ·) is the length of the shorter of the two angles subtended along the great arc. Thus: dpRn (pa , pb ) =
n
dSO(3) (ra , rb ) = cos
|pa,i − pb,i |p
i −1
1/p
|ra · rb |.
If appropriate to the application, a similar effect to weighting the distance metric can also be obtained by scaling the Rn coefficients instead.
Concurrent Nearest-Neighbor Searching
73
Given a set Q = {q1 , q2 , . . . , qn } where qi ∈ C, and a query point qsearch ∈ C for some topological space C, the objective of k-nearest neighbors search, is to find the set N ⊆ Q, such that |N| = min (k, |Q|), and: max d (qi , qsearch ) ≤
qi ∈N
min d (qj , qsearch ),
qj ∈Q\N
where k is a positive integer. With k = 1 it thus finds the nearest neighbor. The objective of r-nearest neighbors search, where r is a non-negative real value, is to find N ⊆ Q, such that: N = {qi | d (qi ∈ Q, qsearch ) ≤ r} .
x
y
z
Fig. 1. Lower-dimensional analog of SO(3) partitioning scheme [11]. In SO(3), quaternions are partitioned into 4 non-overlapping bounded regions of a 3-sphere, with the negative axis mapped onto the positive axis due to the double-coverage property. The 2-sphere analog shown here is partitioned into 3 bounded region, with the x-centered bounded region highlighted in red. Within the bounded region, evenly separated partitioning hyperplanes are shown in green for one axis and in blue for the other.
4
Method
The proposed method is based upon a kd-tree. A kd-tree is a binary tree data structure in which each branch successively partitions space by an axis-aligned hyperplane, and the leaf nodes contain the points to search. Searching a kd-tree for a query point begins at the root of the tree. When the search encounters a branch, it recurses to the child on the same side of the branch’s splitting hyperplane as the query point. When the search encounters a leaf, it checks the distance between the leaf’s point and the query point, and adds the point to the result set if the distance is small enough. When returning from recursion, the search then checks the distance between the query point and the closest point on the splitting hyperplane. If the distance between the points is small enough to be added to the result set, then the algorithm recurses to search the other child of the branch. The partitioning approach for SO(3) [11], requires special handling for the top-level SO(3) branch (see lower dimensional analog in Fig. 1). Unlike other branches, this branch partitions space into 4 top-level volumes, one for each of
74
J. Ichnowski and R. Alterovitz
the four components of a quaternion. (See SO3Root in Fig. 2). Once the algorithm has partitioned a value to a top-level SO(3) volume, the branches in the subtree are binary splits—similar to branches in Rn , but with a hyperplane through the origin and defined by a constrained normal (see [11]).
Node region : Region Branch axis : int prev : Leaf RnBranch split : real child : Node[2]
Leaf size : int values : T[N]
SO3Root child : Node[4]
SO3Branch split : vec2d child : Node[2]
Fig. 2. Diagram of a possible node design needed to implement the proposed data structure. Each box represents a type of node that can be in the tree, with its name (top) and its data members (below the separating line). Data members are listed as name : type. Array types have their capacity listed in square brackets. Nodes inherit all members from their ancestors (shown with open arrows), thus all node types have a region data member. The three node types that inherit from Branch include a split axis and prev pointer to the Leaf node that the branch replaced. The root of an SO(3) subtree has four children, while the other branch types have a split plane definition and two children. The Leaf node has a current size, and fixed capacity (N ) array of values of the type (T ) stored in the data structure.
4.1
Data Storage
In previous work [10], we proposed a lock-free kd-tree that created a new branch every time a leaf was inserted. That approach has the benefit of making insertions quick and lock-free, but introduces an expense to search performance from two factors: (1) there is little information from which to choose a splitting hyperplane, leading to suboptimal tree-balancing, and (2) traversing a branch is more time consuming than a simple point-to-point distance check of a leaf. This performance issue is further exacerbated in algorithms that search more frequently than they insert (e.g., sampling-based motion planning algorithms such as [13,16] that reject samples after checking the validity of paths to nearest neighbors). In the approach proposed herein, we address these two factors to improve search performance, by batching many points into leaves before splitting them into branches [17]. In our implementation, the leaf node’s batch size is a fixed tunable number of the data structure.
Concurrent Nearest-Neighbor Searching
75
Algorithm 1. INSERT(T, u) Require: T is the kd-tree, u is the value to insert 1: p ← root of T 2: loop 3: n ← load(p) 4: if n is a branch then 5: update n’s region to contain u 6: p ← FOLLOW(n, u) 7: else if not try lock(n) then /* n is a leaf */ 8: relax CPU 9: else /* acquired lock on n */ 10: m ← load(n.size) 11: if m < leaf capacity then 12: update n’s region to contain u 13: append u to leaf n 14: store(n.size, m + 1) 15: unlock(n) 16: return 17: c ← SPLIT(n, u) 18: store(p, c)
Algorithm 2. FOLLOW(n, u) Require: n is branch 1: if n is SO(3) root then 2: i ← so3 volume index(u) 3: return n.child[i] 4: else if n is SO(3) branch then 5: return n.child[H(u[axis] · n.split)] 6: else if n is Rn branch then 7: return n.child[H(u[axis] − n.split)]
4.2
Inserting Data
Inserting a value into a concurrent batched kd-tree (Algorithm 1) starts at the kd-tree’s root node (line 1) and traverses down the tree until it finds a leaf into which it will insert the new point. At each level of the tree, the current node is checked to see if it is a branch or a leaf. Empty trees and children are stored as leaf nodes with 0 size, and thus do not require special handling. When the algorithm encounters a branch node (line 4), it updates the branch node’s region and traverses to the child under which the new value will be inserted. When the algorithm encounters a leaf it first attempts to lock the leaf (line 7) using a fast spin locking mechanism such as compare-and-swap (CAS) on a boolean flag. If the algorithm fails to lock the node, it issues an optional CPU-specific instruction (line 8) for efficient spin locking, and then it loops to try again. Once the algorithm successfully acquires the lock, it appends the value to the leaf if there is room (line 11), or splits the leaf (line 17) otherwise. When appending
76
J. Ichnowski and R. Alterovitz
to a leaf, the algorithm ensures the new value is fully initialized before updating the leaf’s size (line 14). The size update is a linearization point for making the inserted value reachable to other cores. When splitting the leaf, the algorithm replaces the leaf with the new branch (line 18), and then loops to insert the value into one of the branch’s children. Algorithm 3. SPLIT(n, u)
SO(3)
Euclidean
Require: n is a Leaf, u is the value to insert 1: axis ← best axis(n’s region) 2: if axis is first SO(3) then 3: c ← new SO3Root 4: for all v ∈ n.values do 5: i ← so3 volume index(v) 6: append v to c.child[i] 7: return r 8: else 9: b0 , b1 ← median split(p, axis) 10: split ← 12 (max(b0 .values) + min(b1 .values)) 11: return new branch with axis, split, b0 , b1
(a) Leaf to split
(b) Branch created
(c) Region updated
Fig. 3. Steps of splitting a leaf while operating under concurrency. In (a) the INSERT algorithm traversed to the leaf to add the ‘x’, finds the leaf is full, and thus calls SPLIT to create a branch. SPLIT partitions the branch along leaf’s horizontal dimension resulting in the branch shown in (b). After SPLIT returns, INSERT then traverses to the right side, adds to the leaf, and updates the leafs region (c). During the SPLIT process, concurrent nearest neighbor searches traverse the old leaf. Once the INSERT replaces the leaf with the branch, searches will traverse the branch instead.
Traversing to Insertion Point. FOLLOW (Algorithm 2) implements the branch traversal required by the INSERT algorithm. When it encounters an SO(3) root node, it traverses to the child whose partition contains the sample. When it encounters an SO(3) branch or an Rn branch, it computes the signed
Concurrent Nearest-Neighbor Searching
77
distance between the point to insert and the splitting hyperplane. The sign of the distance selects the child using a Heaviside step function H(·) defined as: 0, x < 0 H(x) = 1, x ≥ 0. Splitting Leaf Nodes. When inserting into a full leaf, the INSERT algorithm uses SPLIT (Algorithm 3) to create a branch from the values in the full leaf. For an efficient kd-tree the splitting process will choose a partition that: (1) minimizes the maximum distance between points in the resulting subdivision and (2) divides the values into equal leaf nodes with the same number of elements (median split). To that end, the SPLIT algorithm first selects the best axis for partitioning as the one with the greatest extent between region bounds of the leaf. The region bounds are maintained by INSERT. For Rn axes, the extent is the difference between the minimum and maximum along each dimension of the bounds. For SO(3) root nodes, the extent is π/2. For SO(3) branch nodes, the extent is the arccosine of the dot product of the minimum and maximum normalized bounds for the axis [11]. If the selected axis is the SO(3) root, the SPLIT algorithm creates a new SO3Root branch node, and copies the old leaf’s values into the appropriate child of the SO3Root (lines 3 to 6). Otherwise, for the remaining axis types, median split (line 9) partitions the values of the old leaf evenly into two new leaf nodes (see Fig. 3(a) and (b)) using an efficient selection algorithm. The SPLIT algorithm returns with a new branch that is split halfway between the maximum of one child and the minimum of the other (line 11). In the presence of concurrency, concurrent nearest neighbor searches will continue to traverse the old leaf until INSERT atomically replaces the old leaf with the new branch. This means that INSERT does not know if the old leaf is being concurrently accessed, and thus cannot release the memory associated with the leaf without risking a program error. The SPLIT algorithm presented here stores a reference to the old leaf to allow the memory associated with the leaf to be safely deallocated later. 4.3
Searching Operations
NEAREST (Algorithm 4) implements k-nearest neighbor (with k as appropriate and r = ∞) and r-nearest neighbor (with k = ∞ and r as appropriate) searches. Traversal for searching for a nearest neighbor is similar to that of FOLLOW. The primary difference is that after searching one child of the branch, NEAREST may need to search the other children of a branch. The algorithms starts with a pointer n to the root node of the kd-tree, and an empty set N of nearest neighbors. It terminates recursion if the node’s region (as maintained by INSERT) is too far away from the query point to be added to the nearest neighbor set. If the node is a leaf (lines 3 to 7), it iterates through each value in the leaf, updating the N as appropriate. Here it first loads the node’s size, ensuring that it will only visit consistent values in the leaf based upon the linearization point in INSERT.
78
J. Ichnowski and R. Alterovitz
Algorithm 4. NEAREST(N, n, q, k, r) Require: N is the set of nearest neighbor result so far, n is a pointer to the current node, q is the query, k is the maximum |N | to return, r is the maximum radius 1: if |N | < k or dist(n.region, q) ≤ min(r, max N ) then 2: if n is leaf then 3: for all i ∈ {0, . . . load(n.size)} do 4: if |N | < k or dist(n.values[i], q) < min(r, max N ) then 5: if |N | = k then 6: N ← N \ (max N ) 7: N ← N ∪ n.values[i] 8: else if n is SO(3) root then 9: i ← so3 volume index(q) 10: N ← NEAREST(N, load(n.child[i]), q, k, r) 11: for all v ∈ {0, 1, 2, 3} \ i do 12: N ← NEAREST(N, load(n.child[v]), q, k, r) 13: else 14: if n is SO(3) branch then 15: c ← H(q · n.split) 16: else /* n is Rn branch */ 17: c ← H(q[axis] − n.split) 18: N ← NEAREST(N, load(n.child[c]), q, k, r) 19: N ← NEAREST(N, load(n.child[1 − c]), q, k, r) 20: return N
When traversing an SO3Root node, NEAREST navigates the search key’s SO(3) axis-major volume first (lines 9 and 10). It then searches the remaining volumes in an arbitrary order (lines 11 to 12). When traversing an SO3Branch node or RnBranch node (lines 15 to 19), the algorithm first traverses a child in the same order as FOLLOW does. After returning from recursion on that child, it then traverses the other child. By recursing on the closer child first, updates to N will cause the traversal on the farther child to terminate quickly on line 1.
5
Correctness and Analysis
In this section we prove that NEAREST is wait-free and correct with concurrent INSERTS (Lemma 2), and provide analysis on the probability that INSERT waits (Lemma 4). Correct operation relies upon linearizable operations which appear to occur instantaneously at a linearization point from the perspective of concurrent operations. Thus, before the linearization point, the linearizable operation has not occurred, and after the linearization point, the operation has occurred—there is no intermediate point in which the operation partially occurs. We prove that INSERT is linearizable (Lemma 1) and that once a value is reachable it remains reachable (Lemma 3). The following proofs depend upon release and acquire ordering semantics where noted in the algorithms. These semantics ensure that all memory writes that happen before the release-ordered
Concurrent Nearest-Neighbor Searching
79
store (via store(a, ·)) become visible side-effects of an acquire-ordered load (via load(a)). Implementations must explicitly ensure this ordering. Lemma 1. The INSERT operation is linearizable. Proof. INSERT can modify a leaf in one of two ways: (1) by appending a value to a leaf, or (2) splitting the leaf into a branch. As such, there are two linearization point cases to make INSERT linearizable. Case (1): INSERTs do not store new values until they have exclusive write access to a leaf, and thus no two INSERT operations will concurrently store a value into the same leaf. INSERT stores the new value one past the leaf’s size limit before incrementing the size with a release-order store. Concurrent operations do not read values in a leaf past the leaf’s size limit, thus storing the incremented size is the linearization point for this case. Case (2): INSERT splits a leaf by replacing it with a new branch node with children populated from the values from the leaf. As INSERT locks the leaf before populating the branch’s children, the same values will be present in both leaf and branch. INSERT replaces the pointer to the leaf with the pointer to the branch using a release-order store. Since concurrent operations will either load a pointer to the leaf before the store, or to the branch after the store, the store is the linearization point for this case. Both cases have linearization points, and thus INSERT is linearizable. In case (2), unlike case (1), the leaf is not (necessarily) unlocked, as concurrent INSERT operations waiting for the leaf will load the new branch after the linearization point, and recurse to operate on a child of the new branch. Lemma 2. The NEAREST operation is wait-free, and concurrent INSERT operations do not cause incorrect operation. Proof. The NEAREST operation contains no blocking waits or retry loops, and thus will not wait on other operations. Correct operation under concurrency results from the two linearization points of NEAREST. In case (1), when NEAREST visits a leaf, it first performs an acquire-order load of the leaf’s size before iterating through the values in the leaf. As incrementing the size is the linearization point, NEAREST will only iterate through values in the leaf stored before the linearization point, and thus it will only traverse consistent data. In case (2), when NEAREST recurses to search a child of a branch, it performs an acquire-order load of a pointer to the child. NEAREST will either load the pointer before or after the corresponding linearization point of INSERT. If NEAREST loads the child before the linearization point, it will recurse to visit the leaf. If NEAREST loads the child after the linearization point, it will recurse to visit the branch. All nodes remain valid once reached, including leaf nodes that have been replaced by branch nodes, thus NEAREST will operate correctly under concurrency. Lemma 3. Once a value is reachable by NEAREST, the value will remain reachable to all subsequent NEAREST operations.
80
J. Ichnowski and R. Alterovitz
Proof. A value is first reachable after linearization point case (1) of INSERT. The leaf in which the value resides remains reachable until linearization point case (2) of INSERT. After the linearization point case (2), all values from the original leaf reside in the child nodes of the branch that replaced the original leaf. The originally reachable value thus remains reachable before and after linearization point case (2), and the value will thus always remain reachable to subsequent NEAREST operations. Lemma 4. With uniform random insertion, an INSERT operation waits, or causes a wait, with probability (1 − ((n − 1)/n)p−1 ), where p is the number of concurrent INSERT operations and n is the number of leaf nodes in the tree. INSERT asymptotically almost surely does not wait. Proof. An INSERT will loop, and thus effectively wait on line 7, if a concurrent INSERT had a successful try lock on the same leaf. Leaf nodes represent a bounded subregion of the space with uniform distribution. We cast this as the generalized birthday problem, and follow its derivation. Let P (An ) be the probability that an INSERT concurrently updates a leaf of the same bounded subregion as any of the other (p − 1) concurrent INSERTs. This is equivalent to 1 − P (An ), where P (An ) is the probability that no other INSERT concurrently updates the same bounded region. We compute P (An ) as the joint probability that (p − 1) INSERT operations are updating different regions. Thus, p−1 n−1 P (An ) = 1 − P (An ) = 1 − . n It follows that limn→∞ P (An ) = 1, and thus INSERT asymptotically almost surely does not wait.
6
Results
We evaluate the proposed data structure by embedding it in PRRT* [10], a lock-free parallelized asymptotically optimal sampling-based motion planner. The data structure and planner implementations use the standard C++ atomic library [12] for memory operations that require release and acquire semantics. PRRT* uses the proposed data structure for concurrent insert, nearest, and knearest operations. We have PRRT* compute motion plans in two SE(3) rigidbody scenarios from OMPL [8] on a computer with four Intel x7550 2.0-GHz 8-core Nehalem-EX processors, using all 32 cores. The experiments compare both concurrent and locked nearest neighbor data structures to show the benefit of using data structures designed for concurrency. Locking on the data structure makes use of an efficient reader/writer lock, under the observation that insertions are relatively fast and infrequent compared to time spent nearest neighbor searching. Thus the locked version of the data structure is exclusively write-locked when inserting, and shared readlocked when searching. This prevents searches from traversing an inconsistent
Concurrent Nearest-Neighbor Searching
81
Fig. 4. The proposed data structure speeds up parallelized motion planning in the “Home” SE(3) scenario from OMPL. In this scenario, the motion planner finds a path for the red table to move from the lower right to the upper right configuration. The graph in (a) shows the time in milliseconds spent performing nearest neighbor operations (insert, nearest, and k-nearest) relative to the size of the nearest neighbor structure. To illustrate relative impact on overall planner performance, the graph also shows the time spent in collision detection, which is typically the other dominant time consumer in sampling-based motion planners. For the locked versions of the nearest-neighbor structures, the time spent waiting for the lock is shown in the shaded area—the lower boundary of the region is the time spent performing a nearest neighbour operation, and the height of the region is the time the planner must wait for the nearest neighbor operation including the lock. Locked structures (dotted lines) become prohibitively expensive to benchmark past a graph size of 105 . With the concurrent data structures (solid lines) the total wall-clock time to generate a data structure with 106 points averaged between 1 and 2 min. The graph in (b) shows the average path cost relative to the estimated optimal path cost as it converges over wall-clock time. The graph in (c) shows the time relative to the proposed method to compute the same solution cost—the proposed method finds the same solution 10% to 30% faster than previous lock-free methods.
data structure that would result from partial mutations and reordered memory writes of a concurrent insert. It also allows multiple concurrent searches that only block when there is a concurrent insert.
82
J. Ichnowski and R. Alterovitz
Fig. 5. Speeding up planning on OMPL’s “Cubicles” scenario. See description in Fig. 4.
We compare our proposed method to the linear (brute-force) implementation included in OMPL, the GNAT implementation included in OMPL, the dynamically rebalanced median-split kd-tree from prior work [11], and the original lock-free kd-trees in PRRT*. The OMPL methods and the median-split kd-tree method are read/write locked. The concurrent methods are also evaluated in locked form. The implementations of the locked versions of the kd-trees do not make use of memory-ordering operations, and thus run slightly faster in the absence of concurrency. In all experiments, the leaf nodes of the proposed method are configured to have a capacity of 8. Figures 4 and 5 show two evaluated scenarios involving motion planning for a robot in SE(3). In both scenarios the motion planner must find, and asymptotically optimize, a path for a rigid body robot through a 3D environment. The topological space for nearest neighbor searching is thus R3 × SO(3). We set the SO(3) distance scale factor to αSO(3) = 100, and leave the αR3 = 1 (the default). The R3 space extends for hundreds of units, so this makes the two sub-topologies approximately evenly weighted. This weighting has two effects: (1) it makes rotations more expensive, thus as the motion planner converges, the robot rotates less freely than otherwise, and (2) it ensures that the kd-tree splits both R3 and SO(3) axes.
Concurrent Nearest-Neighbor Searching
83
The Figs. 4(a) and 5(a) show the time spent in nearest neighbor operations (both inserts and searches) per sampling iteration based upon the size of the PRRT* graph (which is equivalent to the number of points in the nearest neighbor data structure). These graphs show both the time spent searching (bottom line of shaded regions) and the time spent waiting on a lock (shaded regions). We generate a data structure sizes up to 105 with the locked versions, stopping then because it becomes too time consuming to continue to the next order of magnitude. The concurrent versions of the kd-tree continue to 1 million. In both graphs we observe that our proposed method performs better than alternatives, even under high concurrency, with roughly half the time (the graph is log scaled) spent compared to the best alternatives. To demonstrate the relative impact on the motion planner, the graph includes the time spent in collision detection—which typically is the other most time consuming part of a sampling-based motion planner. From the graphs, we observe the time spent in collision detection shrinks as its computation time is a function of shrinking expected distance between random samples. We observe that nearest neighbor operations eventually dominate the per-iteration time. The Figs. 4(b) and 5(b) show the overall effect on convergence rate of the asymptotically-optimal sampling-based planner. Due to the acceleration of each iteration, the motion planner is able to find lower-cost paths faster. The alternate presentation of the same data in Figs. 4(c) and 5(c), shows that the proposed method results in approximately 20% to 30% faster convergence of PRRT*.
7
Conclusion
This paper proposes and evaluates an exact nearest neighbor data structure that handles concurrent inserts and queries. Based on a kd-tree, the proposed data structure supports searching nearest neighbors on topologies relevant to robotics. The paper described how to support Cartesian products of an arbitrary number of Euclidean and SO(3) spaces with a distance metric that is the weighted sum of sub-topology components within the concurrent data structure. In evaluation, a parallelized asymptotically-optimal sampling-based motion planner used the proposed data structure to accelerate motion planning. The proposed data structure enabled higher performance than the motion planner’s default lock-free kd-tree. Furthermore, the faster performance relative to the lock-based alternatives demonstrates the importance of having a concurrent data structure in parallel processing algorithms such as sampling-based motion planning that depend heavily on nearest-neighbor searching. In future work, we plan to explore using the proposed data structure with approximate nearest neighbor searching while maintaining its design for concurrency. Observing that the batching approach contributes significantly to performance, we also plan on exploring tuning the batch-size parameter. The batch-size might be optimized based upon static configuration, such as the topology; or on dynamic data, such as the leaf node’s position in the tree and region in space.
84
J. Ichnowski and R. Alterovitz
Acknowledgments. This research was supported in part by the U.S. National Science Foundation (NSF) under Awards IIS-1149965 and CCF-1533844.
References 1. Amato, N.M., Dale, L.K.: Probabilistic roadmap methods are embarrassingly parallel. In: Proceedings of IEEE International Conference Robotics and Automation, pp. 688–694 (1999) 2. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: 47th Annual IEEE Symposium on Foundations of Computer Science, 2006. FOCS 2006, pp. 459–468. IEEE (2006) 3. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y.: An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45(6), 891–923 (1998) 4. Beis, J.S., Lowe, D.G.: Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In: 1997 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997. Proceedings, pp. 1000–1006. IEEE (1997) 5. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975) 6. Brin, S.: Near neighbor search in large metric spaces. In: Proceedings of 21st Conference on Very Large Databases (VLDB), Zurich, Switzerland, pp. 574–584 (1995) 7. Choset, H., et al.: Principles of Robot Motion: Theory, Algorithms, and Implementations. MIT Press, Cambridge (2005) 8. S ¸ ucan, I.A., Moll, M., Kavraki, L.E.: The open motion planning library. IEEE Robot. Autom. Mag. 19(4), 72–82 (2012). ompl.kavrakilab.org 9. Friedman, J.H., Bentley, J.L., Finkel, R.A.: An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. (TOMS) 3(3), 209–226 (1977) 10. Ichnowski, J., Alterovitz, R.: Scalable multicore motion planning using lock-free concurrency. IEEE Trans. Robot. 30(5), 1123–1136 (2014) 11. Ichnowski, J., Alterovitz, R.: Fast nearest neighbor search in SE(3) for samplingbased motion planning. In: Algorithmic Foundations of Robotics XI, pp. 197–214. Springer (2015) 12. ISO/IEC: ISO international standard ISO/IEC 14882:2017(E)—programming languages—C++. Standard, International Organization for Standards (ISO), Geneva, Switzerland, December 2017 13. Karaman, S., Frazzoli, E.: Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res. 30(7), 846–894 (2011) 14. Kuipers, J.B., et al.: Quaternions and rotation sequences, vol. 66. Princeton University Press, Princeton (1999) 15. Kung, H., Lehman, P.L.: Concurrent manipulation of binary search trees. ACM Trans. Database Syst. (TODS) 5(3), 354–382 (1980) 16. LaValle, S.M., Kuffner, J.J.: Rapidly-exploring random trees: Progress and prospects. In: Donald, B.R., et al. (eds.) Algorithmic and Computational Robotics: New Directions, pp. 293–308. AK Peters, Natick (2001) 17. Sherlock, C., Golightly, A., Henderson, D.A.: Adaptive, delayed-acceptance MCMC for targets with expensive likelihoods. J. Comput. Graph. Stat. 26(2), 434–444 (2017)
Concurrent Nearest-Neighbor Searching
85
18. Silpa-Anan, C., Hartley, R.: Optimised kd-trees for fast image descriptor matching. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, pp. 1–8. IEEE (2008) 19. Sproull, R.F.: Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6(1–6), 579–589 (1991) 20. Yershova, A., LaValle, S.M.: Improving motion planning algorithms by efficient nearest-neighbor searching. IEEE Trans. Robot. 23(1), 151–157 (2007)
Data Structures for Motion Planning
A Visibility-Based Approach to Computing Nondeterministic Bouncing Strategies Alexandra Q. Nilles1(B) , Yingying Ren1 , Israel Becerra1 , and Steven M. LaValle1,2 1 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA {nilles2,yren17,israelb4,lavalle}@illinois.edu 2 Faculty of Information Technology and Electrical Engineering, University of Oulu, Oulu, Finland
Abstract. Inspired by motion patterns of some commercially available mobile robots, we investigate the power of robots that move forward in straight lines until colliding with an environment boundary, at which point they can rotate in place and move forward again; we visualize this as the robot “bouncing” off boundaries. Different boundary interaction rules can be defined for such robots, such as one that orients the robot relative to its heading prior to collision, or relative to the normal of the boundary. We introduce a new data structure, the bounce visibility graph, which is generated from a polygonal environment definition. The bounce visibility graph can be queried to determine the feasibility of path-based tasks such as navigation and patrolling, assuming we have unavoidable nondeterminism in our actuation. If the task is feasible, then this approach synthesizes a strategy (a sequence of nondeterministic rotations). We also show how to compute stable cyclic trajectories and use these to limit uncertainty in the robot’s position (Software implementation at https://github.com/alexandroid000/bounce viz).
1
Introduction
Mobile robots have rolled smoothly into our everyday lives, and can now be spotted vacuuming our floors, cleaning our pools, mowing our grass, and moving goods in our warehouses. These tasks require that the robot’s path achieve certain properties; for example, a vacuuming robot should cover the whole space while not visiting any particular area more frequently than others. A robot that is monitoring humidity or temperature in a warehouse should repeat its path consistently so data can be compared over time. In many such examples, the strategies for controlling the robot’s path may be decoupled from the specific application, so we envision building a library of useful high-level path controllers based on desired dynamical system properties. Current algorithmic approaches to these tasks take two flavors: (1) maximizing the information available to the robot though high-fidelity sensors and c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 89–105, 2020. https://doi.org/10.1007/978-3-030-44051-0_6
90
A. Q. Nilles et al.
map-generating algorithms such as SLAM, or (2) minimizing the information needed by the robot, such as the largely random navigation strategies of the early robot vacuums. The first approach is powerful and well-suited to dynamic environments, but also resource-intensive in terms of energy, computation, and storage space. The second approach is more resource efficient, but does not immediately provide general-purpose strategies for navigation and loop-closure. We propose a combined approach: first, global geometry of environment boundaries is provided to the system, either a priori or calculated online by an algorithm such as SLAM. The global geometry is then processed to produce a strategy providing strong formal guarantees and which can be executed with minimal processing power and only low bandwidth local sensors, such as bump and proximity sensors, instead of high bandwidth sensors such as cameras. We consider simple robots with “bouncing” behaviors: robots that travel in straight lines in the plane, until encountering an environment boundary, at which point they rotate in place and set off again. The change in robot state at the boundary is modelled by what we call bounce rules. These interactions may be mechanical (the robot actually makes contact with a surface), or may be simulated with virtual boundaries (such as the perimeter wire systems used by lawn mowing robots [26]). Physical implementations of the bouncing maneuver have been validated experimentally [2,14]. Often these lines of work consider a subset of the strategy space, such as an iterated fixed bounce rule. In this work, we present a tractable approach to reasoning over all possible bounce strategies, generalizing previous tools for analyzing a few given bounce rules. This paper presents three ideas for simplifying the characterization and generation of paths of such mobile robots. The first crucial aspect of our approach is that we assume that the robot has some intrinsic nondeterminism, so we generate nondeterministic strategies such that if the robot executes any action in a family at each stage, the strategy will succeed. We can then analyze the size of the bounce rule sets at each stage to determine the minimum required accuracy of a successful robot design. Our second contribution is that by using the geometric structure of a polygonal environment, we can create a combinatorial representation of the environment that lets us reason over a finite number of families of paths, instead of the infinite collection of all possible paths. Finally, we provide results on when transitions in this robotic system are contracting: two potential robot states under the transition move closer together. This property can be used to control uncertainty through actuation, and we provide the first steps toward leveraging this property (Fig. 1).
2
Related Work
We incorporate techniques from computational geometry, specifically visibility [11]. Visibility has been considered extensively in robotics, but usually with the goal of avoiding obstacles [16,27]. To plan over collisions, we use the edge visibility graph, analyzed in [24], and shown to be strictly more powerful than the vertex visibility graph. Our work is also related to problems that consider what
Nondeterministic Bouncing Strategies
91
Fig. 1. Two paths produced by different sequences of bounces, which visit different points of ∂P , yet have the same sequence of edge collisions and high-level dynamical behavior (escape the room on the left, travel through hallway, then patrol the room on the right in a periodic orbit).
parts of a polygon will be illuminated by a light source after multiple reflections (as if the edges of the polygon are mirrors) [3], or with diffuse reflections [25], which are related to our nondeterministic bounces. Our robot motion model is related to dynamical billiards [31]. Modified billiard systems have attracted recent interest [7,18,31]. One similar work was inspired by the dynamics of microorganisms; in [28], the authors show that Chlamydomonas reinhardtii “bounce” off boundaries at a specific angle determined by their body morphology. They characterize periodic and chaotic trajectories of such agents in regular polygons, planar curves, and other environments. Our motion model is a form of compliant motion, in which task constraints are used to guide task completion, even when the exact system state is not known. Our use of contraction mappings and nondeterministic reasoning is related to the idea of funnelling: using the attraction regions of a dynamical system to guide states into a goal region. These ideas have been developed in the context of manipulation and fine motion control by Whitney [32], Mason [19], Erdmann [9], Goldberg [12], Lozano-P´erez et al. [15], Lynch and Mason [17], and Burridge et al. [5], among many others. Our intentional use of collisions with environment boundaries is enabled by the advent of more robust, lightweight mobile robots. Collisions as information sources have also been recently explored for multi-robot systems [20]. The firstclass study of the dynamical properties of bouncing was proposed in [10] and continued in [23]. Here we extend and improve these analysis tools, and incorporate visibility properties to discretize the strategy space. In [1] and [2], the authors describe navigation, coverage, and localization algorithms for a robot that rotates a fixed amount relative to the robot’s prior heading. Due to the chosen state space discretization, periodic trajectories may exist which require bounce angles not allowed by the discretization. By using a discretization induced by the environment geometry, we are able to find all possible limit cycles, and the controllers necessary to achieve them. Another closely related work is [14], which considered navigation for a robot that has our same motion model, with uncertainty given as input to the algorithm. Instead
92
A. Q. Nilles et al.
of taking error bounds as input, our approach considers all possible amounts of uncertainty, and returns bounds on the required accuracy for strategies. We are working toward a generalization and hierarchy of robot models, in a similar spirit to [4]. Their simple combinatorial robot is able to detect all visible vertices and any edges between them, and move straight toward vertices. By augmenting this basic robot model with sensors they construct a hierarchy of these robot models. Our questions are related but different: if the robot is given a compact, purely combinatorial environment representation, what tasks can it accomplish with minimal sensing?
3
Model and Definitions
We consider the movement of a point robot in a simple polygonal environment, potentially with polygonal obstacles. All index arithmetic for polygons with n vertices is mod n throughout this paper. We do not require polygons in general position. We do not consider trajectories that intersect polygon vertices. We define our robot to move forward in a straight line, until encountering an environment boundary (such as when a bump or range sensor is triggered). Once at a boundary, the robot is able to rotate in place. More formally, the model is: – The configuration space X = ∂P ×S 1 . P is a simple polygon, potentially with holes. P has boundary ∂P , the union of external and obstacle boundaries. S 1 is the robot’s orientation in the plane. Let s refer to a point in ∂P without an associated robot orientation. – The action space U = [0, π], where u ∈ U is the orientation the robot takes when it reaches an environment boundary, before moving forward again. u is measured counterclockwise relative to the boundary tangent. – The state transition function f : X × U → X, which describes how actions change the state of the robot. We will often lift this function to act nondeterministically over sets of states and actions, propagating each state forward under each possible action, and unioning the resulting set of states. – We model time as proceeding in stages; at each stage k, the robot is in contact with the environment boundary, executes an action uk , and then moves forward until the next contact with the boundary at stage k + 1. Definition 1. Let α ∈ (0, π) be the robot’s incoming heading relative to the environment boundary at the point of contact. Let θ ∈ (0, π) be a control parameter. A bounce rule b maps α and θ to an action u, and determines how the robot will reorient itself when it collides with a boundary. Bounce rules are defined in the frame in which the environment normal is aligned with the positive y axis and the robot’s point of contact with the boundary is the origin. Definition 2. A nondeterministic bounce rule is a bounce rule lifted to return a set of actions. We will restrict nondeterministic bounce rules to return a convex set of actions (a single interval in (0, π)).
Nondeterministic Bouncing Strategies
93
Fig. 2. Examples of different “bounce rules” that can be implemented on mobile robots. In the first row, b(α, θ) = θ, which we refer to as a fixed bounce rule. In the second row, we have a monotonic fixed bounce rule, in which b(α, θ) = θ or π − θ, depending on what is necessary to preserve monotonicity of travel direction along the x axis. In the third row, we have a relative bounce rule, b(α, θ) = α − θ, rotating α through θ in the clockwise direction. If this rotation causes the heading of the robot to still be facing into the boundary, the robot performs the rotation again until its heading points into the free space.
For example, a specular bounce (laser beams, billiard balls) has bounce rule b(α, θ) = π − α. See Fig. 2 for more examples of bounce rules. Definition 3. A bounce strategy is a sequence of nondeterministic bounce rules. Of course, robots rarely move perfectly, so our analysis will assume the robot has some nondeterminism in its motion execution. Instead of modelling explicit distributions, we assume the robot may execute any action in a set. Planning over such nondeterminism can result in design constraints for robots; for example, if a robot has an uncertainty distribution over its actions in the range u ± , the largest allowable will be half the width of the smallest convex bounce rule interval in a strategy.
4
Visibility-Based Boundary Partitioning
Given a polygonal environment (such as the floor plan of a warehouse or office space), we would like to synthesize bounce strategies that allow a robot to perform a given task (such as navigation or patrolling). To do so, we first discretize the continuous space of all possible transitions between points on ∂P . We find a visibility-based partition that encodes the idea of some points on ∂P having different available transitions. Definition 4. The visibility polygon of a point s in a polygon P is the polygon formed by all the points in P that can be connected to s with a line segment that lies entirely within P . Imagine a robot sliding along the boundary of a polygon, calculating the visibility polygon as it moves. In a convex polygon, nothing exciting happens.
94
A. Q. Nilles et al.
In a nonconvex polygon, the reflex vertices (vertices with an internal angle greater than π) cause interesting behavior. As the robot slides, its visibility polygon mostly changes continuously. Edges shrink or grow, but the combinatorial structure of the polygon remains the same, until it aligns with a reflex vertex r and another vertex v (visible from r). At this point, either v will become visible to the robot, adding an edge to the visibility polygon, or v will disappear behind r, removing an edge from the visibility polygon. To compute all such points at which the visibility polygon changes structure, we compute the partial local sequence, defined in [24]. Each point in the partial local sequence marks the point at which a visible vertex appears or disappears behind a reflex vertex. The sequence is constructed by shooting a ray through each reflex vertex r from every visible vertex and keeping the resulting sequence of intersections with ∂P . See Fig. 3 for an example of the vertices in the partial local sequence of v0 . Once all the partial local sequences have been inserted into the original polygon, the resulting segments have the property that any two points in the segment can see the same edge set of the original polygon (though they may see different portions of those edges). See Algorithm 1 for a pseudocode description of this partitioning process. Algorithm 1 applies to polygons with or without holes; holes require more bookkeeping to correctly find visible vertices and shoot rays. See Fig. 5 for an example partition of a polygon with holes. Let P be the polygon P after application of Algorithm 1. Algorithm 1. PartitionPoly(P)
1: 2: 3: 4: 5: 6: 7: 8: 9:
Input: A polygon P as a list of vertices in counterclockwise order. Output: P : P with all partial local sequence points added as new vertices. vnew ← {} ref lex verts ← GetReflexVerts(P ) for vr in ref lex verts do for vvis in VisibleVerts(P, vr ) do vnew ← vnew ∪ ShootRay(vvis , vr ) end for end for P ← InsertVerts(vnew , P ) return P
The ShootRay function takes two visible vertices v1 and v2 and compute the first intersection of ∂P and ray v1 v2 . This operation will take O(log n) time after preprocessing the polygon P in O(n) [30]. The VisibleVerts function computes all visible vertices in the polygon given an input query vertex, and takes O(n) [8]. So the total runtime of Algorithm 1 is O(n2 log n). Definition 5. The edge visibility graph of a polygon P has a node for each edge of P , and has an arc between two nodes (ei , ej ) if and only if there is a point si in the open edge ei and a point sj on the open edge ej such that si and sj can be connected with a line segment which is entirely within the interior of P .
Nondeterministic Bouncing Strategies v13
v8
v0
v6
v12
v4
v10 v3
v7 v 5
v1
v2
95
v11
v8
v4
v11 v12
v10
v3
v1
v5
v2
v9
v13 v7
Fig. 3. On the left, the partial local sequence for v0 . On the right, an example polygon for which the bounce visibility graph has O(n4 ) edges.
Definition 6. Let the bounce visibility graph be the directed edge visibility graph of P . Although visibility is a symmetric property, we use directed edges in the bounce visibility graph so that we can model the geometric constraints on the visibility from one edge to another, which are not symmetric and govern what actions allow the robot to accomplish that transition. See Sect. 5 for further exploration of this idea. We now introduce the bounce visibility diagram, a helpful representation of the vertex-edge visibility structure of a partitioned polygon P . If |∂P | is the perimeter length of ∂P , let the x axis of the bounce visibility diagram be the interval [0, |∂P |), in which the endpoints of the interval are identified. Let the y axis of the diagram be a parameter θ, which is an angle between 0 and π. For a point s ∈ ∂P , we can compute all the visible vertices and the angle at which they are visible. The graph of all angles at which vertices are visible from points in ∂P is the bounce visibility diagram. The bounce visibility diagram gives us some visual insight into the structure of the problem. First, it shows our motivation for the specific partitioning we have chosen: at each inserted vertex, another vertex appears or disappears from view. Of course, such transitions also occur at the original vertices of P . Thus, the partition induced by the vertices of P and our inserted partial local sequences captures all combinatorial changes in the visible vertices. Slicing the diagram vertically at a specific s ∈ ∂P produces a list of vertices visible from that point on the boundary (the combinatorial visibility vector [29]), and vertices of P are points where this vector changes. This diagram gives us some insight into which segment-to-segment transitions are possible under uncertainty. Within a segment (vi , vi+1 ), if two successive elements of the cvv are adjacent vertices, then some transition exists from (vi , vi+1 ) to a point on the edge between those vertices. Also, if within an entire segment there is an interval of θ such that no vertex lines are crossed from left to right,
96
A. Q. Nilles et al.
Fig. 4. A partitioned polygon and its corresponding bounce visibility diagram.
this corresponds to our notion of safe actions; the robot can transition from segment to segment under some amount of nondeterministic error in position and action from anywhere on the originating segment. For example, in Fig. 4, we see that a band of intervals exists at the top and bottom of the diagram with no “crossings” as you move from left to right (moving the robot around ∂P ). These bands correspond to the safe actions to be formally described in Proposition 3. Proposition 1. The bounce visibility graph for a polygon with n vertices has O(n2 ) vertices and O(n4 ) edges. Proof. Consider a polygon P with n vertices operated on by Algorithm 1 to form P . Each convex vertex will not add any new vertices; however, a reflex vertex can add O(n) new vertices. Up to half of the vertices in the polygon can be reflex, so the number of vertices in P is O(n2 ). Each vertex indexes a node in the edge visibility graph of P . A vertex in P may be visible to all other vertices, so in the worst case, the bounce visibility graph will have O(n4 ) edges.
Fig. 5. An example partition for a polygon with holes; our discretization scheme extends naturally to handle visibility events caused by static obstacles.
Nondeterministic Bouncing Strategies
5
97
Safe Actions
To characterize some families of paths, we will use the boundary partition technique defined in Sect. 4, then define safe actions between segments in the partition that are guaranteed to transition to the same edge from anywhere in the originating edge. Such actions define transitions which keep the robot state in one partition under nondeterministic actions. Definition 7. Two edges ei , ej of a polygon are entirely visible to each other if and only if every pair of points si ∈ ei and sj ∈ ej are visible (the shortest path between si and sj lies entirely within P ). Definition 8. A safe action from edge ei to edge ej in a polygon is an action u such that f (s, u) ∈ ej for any s ∈ ei and u in some interval of actions θ˜ ⊆ (0, π). Proposition 2. Given two entirely visible line segments ei = (vi , vi+1 ) and ej = (vj , vj+1 ) in ∂P , if a safe action exists from ei to ej , the maximum interval of safe actions is θ˜ = [θr , θl ] such that θr = π−∠vj vi+1 vi and θl = ∠vj+1 vi vi+1 . Here we will only sketch the proof. Geometric intuition comes from drawing the quadrilateral vi vi+1 vj vj+1 . If θr as defined is smaller than θl , every ray shot from ei at θ ∈ [θr , θl ] must intersect with ej ; else, there is no interval of safe actions between the segments. We largely ignore the case where θr = θl , as in practice we will not find these action intervals useful as they require very accurate robot actuation. Note that this definition of a safe action is similar to the definition of an interval of safe actions from [14]; the main differences in approach are the generation of boundary segments, methods of searching the resulting graph, and how we generate constraints on robot uncertainty instead of assuming uncertainty bounds are an input to the algorithm. Lemma 1. If two edges in ∂P are entirely visible to each other, then there will be at least one safe action between them. Corollary 1. If two edges in ∂P share a vertex that is not reflex, and the two edges are not collinear, then there exist safe actions from one to the other in both directions. Lemma 1 holds because the geometric condition in Proposition 2 must hold in one direction or the other, except in the case where θl = θr , which we do not consider in this work as it is a rare geometric condition. For a proof of Corollary 1, see Fig. 6a. Algorithm 1 guarantees that such neighboring segments are entirely visible. Definition 9. Given two entirely visible segments ei and ej , rotate frame such that ei is aligned with the x−axis with its normal pointing along the positive y−axis, such that segment ej is above segment ei . If the intersection of segment ei and ej would be on the left of segment ei , then call the transition from ei to ej a left transition; if the intersection would be on the right of segment ei , then call the transition a right transition.
98
A. Q. Nilles et al.
Proposition 3. For every polygon P and the resulting partitioned polygon P under Algorithm 1, each edge e ∈ P has at least two safe actions which allow transitions away from e. Proof. Let ei = (vi , vi+1 ). Consider right transitions from ei to some ek , where the safe action interval θ˜ = (0, θl ) for some nonzero θl . We will show that such a transition must exist. By Corollary 1, if an edge ei has an adjacent edge ei+1 which is not collinear or separated by a reflex angle, ek = ei+1 and a safe transition exists between ei and ei+1 with θl = ∠vi+2 vi vi+1 . See Fig. 6a. If the adjacent edge is at a reflex angle, Algorithm 1 will insert a vertex in line with ei on the closest visible edge, forming edge ek . If vi is an original vertex of P , ek will be entirely visible from ei , since Algorithm 1 will otherwise insert a point in vi ’s partial local sequence. If vi is itself inserted by Algorithm 1, ek will still be entirely visible. There must be some original vertex of P clockwise from vi and collinear with ei , which would insert a vertex visible to ei through Algorithm 1 if there were any reflex vertices blocking transitions to ek . See Fig. 6b for an example of the geometry. If the adjacent edge ei+1 is collinear with ei , apply the above reasoning to the first non-collinear edge to find ek . The above arguments extend symmetrically to left transitions, which will have safe actions of the form θ˜ = [θr , π). Thus, each edge will have two guaranteed safe actions leading away from it.
θ vi
ei
vi+1
vi
(a)
θ
ei vi+1 (b)
Fig. 6. Examples of geometrical setup for some guaranteed safe actions.
6
Dynamical Properties of Paths
Many tasks required of mobile robots can be specified in terms of dynamical properties of the path the robot takes through space: stability, ergodicity, etc. Our motion strategy allows us to disregard the robot’s motion in the interior of P . Instead, the robot’s state is an interval along the boundary ∂P , and we can formulate transitions as a discrete dynamical system under the transition function f . The properties of this dynamical system can be used to engineer paths corresponding to different robot task requirements.
Nondeterministic Bouncing Strategies
99
One generally useful property of mapping functions is contraction: when two points under the function always become closer together. We can use this property to control uncertainty in the robot’s position. Definition 10. Let φi,j be the interior angle between two edges ei , ej ∈ ∂P . Definition 11. A function g that maps a metric space M to itself is a contraction mapping if for all x, y ∈ M , |g(x) − g(y)| ≤ c|x − y|, and 0 ≤ c < 1. Lemma 2. If the transition from segment ei to segment ej is a left transition, then the transition function f (x, θ) between segments ei and ej is a contraction φ mapping if and only if θ > π2 + 2i,j ; if a right transition, the transition function φ f (x, θ) is a contraction mapping if and only if θ < π2 − 2i,j . In our case, we will always take M to be an interval in R and we will use the L1 norm. The proof is a straightforward application of the law of sines to find (y,θ)| to be less than one, the conditions for the contraction coefficient |f (x,θ)−f |x−y| which is independent of choice of interval [x, y] for transitions between linear segments, and is very similar to the calculations in [23]. Corollary 2. For all pairs of adjacent segments with internal angle less than π, there exists a range of actions for which f is a contraction mapping. Definition 12. The contraction coefficient of a mapping is the ratio of the distance between points before and after the mapping is applied. Let C(θ, φi,j ) be the contraction coefficient of a transition from ei to ej in ∂P For a left transition, sin(θ) sin(θ) C(θ, φi,j ) = | sin(θ−φ |; for a right transition, C(θ, φi,j ) = | sin(θ+φ |. i,j ) i,j ) Given C from Definition 12, for a sequence of transitions f0 , . . . , fk , we can construct the overall mapping from the domain of f0 to the range of fk through function composition. Since f is a linear mapping, the contraction coefficient of a composition of multiple bounces can be determined by multiplying the contraction coefficients of each bounce. Definition 13. Given a sequence of m feasible transitions F = {f0 , f1 , . . . , fm−1 }, at stage k the robot will be located on edge e(k) and will depart the coefficient of the overall robot trajectory edge with action θk ; the contraction m−1 fm−1 ◦ . . . ◦ f0 is C(F ) = k=0 C(θk , φe(k),e(k+1) ). If C(F ) < 1, the composition of F is a contraction mapping. This is true even if some transition along the way is not a contraction mapping, since C(F ) is simply the ratio of distances between points before and after the mapping is applied. Furthermore, the value of C(F ) tells us the exact ratio by which the size of the set of possible robot states has changed. Limit Cycles: A contraction mapping that takes an interval back to itself has a unique fixed point, by the Banach fixed point theorem [13]. By composing individual transition functions, we can create a self-mapping by finding transitions which take the robot back to its originating edge. A fixed point of this
100
A. Q. Nilles et al.
mapping corresponds to a stable limit cycle. Such trajectories in regular polygons were characterized in [23]. Here, we present a more general proof for the existence of limit cycles in all convex polygons. Definition 14. φP,min is the smallest interior angle in a polygon P . Theorem 1. For all convex polygons with n edges, there exist constant fixedangle bouncing strategies which result in a period n limit cycle regardless of the robot’s start position. Proof. Without loss of generality, assume θ ∈ (0, π2 ]. The robot will always bounce to the next adjacent edge if and only if θ ∈ (0, mini (∠vi+2 vi vi+1 )). Suppose we have two start positions s1 and s2 on edge e0 and a constant fixed bounce rule b(α, θ) = θ. At stage k, s1 and s2 will be located at f k (s1 , θ) and f k (s2 , θ). By Definition 13, the distance between s1 and s2 changes after one orbit of n n−1 ,θ)−f n (s2 ,θ)| = i=0 C(θ, φi,i+1 ). If this ratio is the polygon by the ratio |f (s1|s 1 −s2 | less than one, then f n (s, θ) has a unique fixed point by the Banach fixed-point φ theorem [13]. By Lemma 2, this constraint is satisfied if θ < π2 − i,i+1 for all i. 2 We can guarantee that this condition holds for the orbit by requiring π φi−1,i − ))), i=0,1,...,n−1 i=0,1,...,n−1 2 2 in which case the fixed-angle bouncing strategy with b(α, θ) = θ leads to a convex cycle which visits each edge of P sequentially, regardless of the robot’s start position. Symmetry applies for orbits in the opposite direction. θ ∈ (0, min(
min
(∠vi+2 vi vi+1 ),
min
(
Proposition 4. For all points s on the boundary of all polygons, a constant fixed-angle controller exists which will cause the robot’s trajectory to enter a stable limit cycle. Proof. First we observe that by Proposition 3, for every segment e ∈ ∂P , safe actions always exist for two action intervals. These intervals are the ones bordering the segment itself: by staying close enough to the boundary, the robot may guarantee a safe transition. By Lemma 2, these safe actions will also admit contraction mappings. Thus, we may choose a constant fixed-angle controller such that it results in safe, contracting transitions from all segments in P . Since there are a finite number of segments in P , this controller must result in limit cycle from every point in P . If s is on a hole of the polygon, this procedure will cause the robot to leave the hole and enter a cycle on the exterior boundary.
7 7.1
Applications Navigation
Here, we define the navigation task, assuming that the robot has unavoidable uncertainty in its actuation. A navigation strategy should take a robot from any point in a starting set to some point in a goal set, as long as some action in the bounce rule set is successfully executed at each stage.
Nondeterministic Bouncing Strategies
101
Definition 15. Navigation: Given a polygonal environment P , a convex starting set S ⊂ ∂P of possible robot positions along the environment boundary, and a convex goal set G ⊂ ∂P , determine a strategy π which will be guaranteed to take the robot from any point in S to a point in G, or determine that no such strategy exists. Using the formalisms built up so far, we can perform the navigation task as a search over a discrete bounce visibility graph. Shortest paths in the graph will correspond to paths with the fewest number of bounces. It is important to note that an arc ei → ek in the bounce visibility graph only implies that for each point x ∈ ei there exists an action taking x to some point in ek . For our navigation task, we require a range of angles which can take any point in ei to a point on ek , so we restrict the arcs of the bounce visibility graph to those corresponding to safe actions only. As may be expected, not all edges of ∂P are reachable using safe actions. Proposition 5. There exist simple polygons such that under Algorithm 1, there exist edges in the partitioned polygon P that are not reachable by a safe action from any other edge in P . Proof. The only such edges will be edges for which both endpoints are reflex vertices or vertices inserted by Algorithm 1, since by Corollary 1, edges adjacent to other vertices of P will be reachable by a safe action. Thus, navigation using paths in Gsaf e is not complete: we cannot get everywhere safely, at least under this partitioning of ∂P . Figure 7 is an example where the edge v10 v11 is not reachable via safe actions. Equivalently, node 10 in Gsaf e has no incoming arcs.
Fig. 7. A polygon after Algorithm 1 and its safe bounce visibility graph BV Gsaf e .
However, it is still useful to explore what is possible under nondeterministic control of such a robot. Here, we will show how to search for a constant fixedangle bounce controller, where at every stage the robot executes a fixed-angle ˜ This is an extension of the controllers analyzed in [23] bounce in some range θ. for regular polygons. We define a few helper functions:
102
A. Q. Nilles et al.
– mkBVG: Uses the visibility information generated in Algorithm 1 to generate the bounce visibility graph in O(n4 ) time. – mkSafeBVG: Using Definition 8 and Proposition 2, we can create a safe roadmap, Gsaf e , out of the bounce visibility graph by traversing all edges ˜ and labelling the remaining edges with and removing edges with an empty θ, the interval of safe actions. – SearchConstantFixedStrats: performs breadth-first search from start to goal, starting with θ˜ = (0, π) and intersecting θ˜ with the safe action intervals along each path, terminating when all start states reach a goal state at the same stage with overlapping θ˜ intervals. Returns the resulting constant controller θ˜ or that no such strategy is possible. Algorithm 2. SafeConstantFixedNavigate(P , S, G, k)
1: 2: 3: 4: 5:
7.2
Input: A polygon P , intervals S and G in ∂P , and an integer bound k. Output: A nondeterministic bounce rule, or None if no strategy can be found. P ← PartitionPoly(P ) BV G ← mkBVG(P ) BV Gsaf e ← mkSafeBVG(BV G) θ˜ ← SearchConstantFixedStrats(BV Gsaf e , S, G, k) return θ˜
Patrolling
Note that there are exponentially many possible limit cycles; a cycle may contain transitions which are not contraction mappings, as long as the overall contraction coefficient is less than one. As more becomes understood about the structure and robustness of such limit cycles, we can begin to formalize robotic patrolling tasks as a search over possible cycles. Definition 16. Given an environment P , a set of possible starting states S, and a sequence of edges of the environment E = {e1 , . . . , ek }, determine a strategy which causes the robot to enter a stable cycle visiting each edge of the sequence in order from any point in S. This task is related to the Aquarium Keeper’s Problem in computational geometry [6]. It may be solved by coloring nodes in the bounce visibility graph by which edge of P they belong to, and then searching for safe cycles which visit edges in the correct order. If a cycle is contracting, it will result in a converging stable trajectory. Interesting open questions remain on how to incorporate other useful properties of a patrolling cycle, such as coverage of a space with certain sensors, or guarantees about detecting evaders while patrolling. Leveraging Cycles to Reduce Uncertainty: Say no safe path exists between two subsets of ∂P . We may use limit cycles to reduce the uncertainty in the robot’s position enough to create a safe transition. By Proposition 4, limit cycles are reachable from any point s ∈ ∂P under a constant fixed-angle
Nondeterministic Bouncing Strategies
103
controller (and we know the range of angles from which this constant controller may be drawn). Say the contraction coefficient of a given limit cycle is C, and the length of edge ei is i . After k iterations of the cycle, the distance between the robot’s position and the fixed point of the cycle’s transition function is less than C k i . Thus, we may reduce the uncertainty in the robot’s position to less than after log(/i )/ log(C) iterations of the cycle. If instead of a deterministic fixed-angle controller, we consider a nondeterministic controller, we need to reason over a range of C coefficients and resulting attractor structure. The mathematical theory of these attractors and the best way to choose such nondeterministic controllers is still an open question.
8
Open Questions and Future Work
We have presented a visibility-based approach to reasoning about paths and strategies for a class of mobile robots. We are moving toward understanding nondeterministic control of such robots; however, many open questions remain! Comparing Bounce Rules. Our approach can be used to compare different families of bounce strategies in a given polygon, by comparing the reachability of the transition systems induced by each strategy family. This would require either reducing the full bounce visibility graph given constraints on possible bounces, or constructing a more appropriate transition system. It is not clear the best way to analyze relative bounce rules, in which the outgoing angle of a robot after a collision is a function of the incoming angle. Optimal Strategies. Say we wish to find the strategy taking the robot from region S to region G with the maximum amount of allowed uncertainty in the bounce rules (the sum of the interval sizes of each action set along the path). This problem can be framed as an optimization problem over the space of all transition function compositions. Localization. A localization strategy is a nondeterministic strategy that produces paths which reduce uncertainty in the robot’s position to below some desired threshold, from arbitrarily large starting sets. The use of limit cycles to produce localizing strategies has been explored in [2], and it would be interesting to take a similar approach with our environment discretization. Ergodic Trajectories. We are interested in strategies that produce ergodic motion, where the robot’s trajectory “evenly” covers the state space. Measures of ergodicity have recently been used in exploration tasks [21]. Chaotic dynamical systems have also been used directly as controllers for mobile robots [22]. Acknowledgement. This work was supported by NSF grants 1035345 and 1328018, and CONACyT post-doctoral fellowship 277028.
References 1. Alam, T., Bobadilla, L., Shell, D.A.: Minimalist robot navigation and coverage using a dynamical system approach. In: IEEE IRC (2017)
104
A. Q. Nilles et al.
2. Alam, T., Bobadilla, L., Shell, D.A.: Space-efficient filters for mobile robot localization from discrete limit cycles. IEEE Robot. Autom. Lett. 3(1), 257–264 (2018) 3. Aronov, B., Davis, A.R., Dey, T.K., Pal, S.P., Prasad, D.C.: Visibility with Multiple Reflections. Springer, Heidelberg (1996) 4. Brunner, J., Mihal´ ak, M., Suri, S., Vicari, E., Widmayer, P.: Simple robots in polygonal environments: a hierarchy. In: Algosensors (2008) 5. Burridge, R.R., Rizzi, A.A., Koditschek, D.E.: Sequential composition of dynamically dexterous robot behaviors. Int. J. Robot. Res. 18(6), 534–555 (1999) 6. Czyzowicz, J., et al.: The aquarium keeper’s problem. In: Proceedings of the Second Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM (1991) 7. Del Magno, G., Lopes Dias, J., Duarte, P., Gaiv˜ ao, J.P., Pinheiro, D.: SRB measures for polygonal billiards with contracting reflection laws. Commun. Math. Phys. 329(2), 687–723 (2014) 8. ElGindy, H.A., Avis, D.: A linear algorithm for computing the visibility polygon from a point. J. Algorithms 2(2), 186–197 (1981) 9. Erdmann, M.A.: Using backprojections for fine motion planning with uncertainty. Int. J. Robot. Res. 5(1), 19–45 (1986) 10. Erickson, L.H., LaValle, S.M.: Toward the design and analysis of blind, bouncing robots. In: IEEE ICRA (2013) 11. Ghosh, S.K.: Visibility Algorithms in the Plane. Cambridge University Press, Cambridge (2007) 12. Goldberg, K.Y.: Orienting polygonal parts without sensors. Algorithmica 10(2–4), 201–225 (1993) 13. Granas, A., Dugundji, J.: Elementary Fixed Point Theorems, pp. 9–84. Springer, New York (2003) 14. Lewis, J.S., O’Kane, J.M.: Planning for provably reliable navigation using an unreliable, nearly sensorless robot. Int. J. Robot. Res. 32(11), 1342–1357 (2013) 15. Lozano-P´erez, T., Mason, M.T., Taylor, R.H.: Automatic synthesis of fine-motion strategies for robots. Int. J. Robot. Res. 3(1), 3–24 (1984) 16. Lozano-P´erez, T., Wesley, M.A.: An algorithm for planning collision-free paths among polyhedral obstacles. Commun. ACM 22(10), 560–570 (1979) 17. Lynch, K.M., Mason, M.T.: Pulling by pushing, slip with infinite friction, and perfectly rough surfaces. Int. J. Robot. Res. 14(2), 174–183 (1995) 18. Markarian, R., Pujals, E., Sambarino, M.: Pinball billiards with dominated splitting. Ergodic Theory Dyn. Syst. 30, 1757–1786 (2010) 19. Mason, M.T.: The mechanics of manipulation. In: Proceedings IEEE International Conference on Robotics & Automation, pp. 544–548 (1985) 20. Mayya, S., Pierpaoli, P., Nair, G., Egerstedt, M.: Localization in densely packed swarms using interrobot collisions as a sensing modality. IEEE Trans. Robot. 35(1), 21–34 (2018) 21. Miller, L.M., Silverman, Y., MacIver, M.A., Murphey, T.D.: Ergodic exploration of distributed information. IEEE Trans. Robot. 32(1), 36–52 (2016) 22. Nakamura, Y., Sekiguchi, A.: The chaotic mobile robot. IEEE Trans. Robot. Autom. 17(6), 898–904 (2001) 23. Nilles, A., Becerra, I., LaValle, S.M.: Periodic trajectories of mobile robots. In: IROS (2017) 24. O’Rourke, J., Streinu, I.: The vertex-edge visibility graph of a polygon. Comput. Geom.: Theory Appl. 10(2), 105–120 (1998) 25. Prasad, D.C., Pal, S.P., Dey, T.K.: Visibility with multiple diffuse reflections. Comput. Geom. 10(3), 187–196 (1998)
Nondeterministic Bouncing Strategies
105
26. Sahin, H., Guvenc, L.: Household robotics: autonomous devices for vacuuming and lawn mowing [applications of control]. IEEE Control Syst. 27(2), 20–96 (2007) 27. Sim´eon, T., Laumond, J.P., Nissoux, C.: Visibility based probabilistic roadmaps for motion planning. Adv. Robot. 14(6), 477–493 (2000) 28. Spagnolie, S.E., Wahl, C., Lukasik, J., Thiffeault, J.L.: Microorganism billiards. Phys. D Nonlinear Phenom. 341, 33–44 (2017) 29. Suri, S., Vicari, E., Widmayer, P.: Simple robots with minimal sensing: From local visibility to global geometry. IJRR 27(9), 1055–1067 (2008) 30. Szirmay-Kalos, L., M´ arton, G.: Worst-case versus average case complexity of rayshooting. Computing 61(2), 103–131 (1998) 31. Tabachnikov, S.: Geometry and Billiards. American Mathematical Society, Providence (2005) 32. Whitney, D.: Force feedback control of manipulator fine motions. Trans. ASME J. Dyn. Syst. Measure. Control 91–97 (1977)
Finding Plans Subject to Stipulations on What Information They Divulge Yulin Zhang1(B) , Dylan A. Shell1 , and Jason M. O’Kane2 1 2
Texas A&M University, College Station, TX, USA {yulinzhang,dshell}@tamu.edu University of South Carolina, Columbia, SC, USA [email protected]
Abstract. Motivated by applications where privacy is important, we study planning problems for robots acting in the presence of an observer. We first formulate and then solve planning problems subject to stipulations on the information divulged during plan execution—the appropriate solution concept being both a plan and an information disclosure policy. We pose this class of problem under a worst-case model within the framework of procrustean graphs, formulating the disclosure policy as a particular type of map on edge labels. We devise algorithms that, given a planning problem supplemented with an information stipulation, can find a plan, associated disclosure policy, or both jointly, if and only if some exists. The pair together, comprising the plan and associated disclosure policy, may depend subtly on additional information available to the observer, such as whether the observer knows the robot’s plan (e.g., leaked via a side-channel). Our implementation finds a plan and a suitable disclosure policy, jointly, when any such pair exists, albeit for small problem instances.
1
Introduction
In 2017, iRobot announced that they intended to sell maps of people’s homes, as generated by their robot vacuum cleaners. The result was a public outcry [1]. It is increasingly clear that, as robots become part of our everyday lives, the information they could collect (indeed, may need to collect to function) can be both sensitive and valuable. Information about a robot’s internal state and its estimates of the world’s state are leaked by status displays, logged data, actions executed, and information outputted—often what the robot is tasked with doing. The tension between utility and privacy is fundamental. Typically, robots strive to decrease uncertainty. Some prior work, albeit limited, has illustrated how to cultivate uncertainty, examining how to constrain a robot’s beliefs so that it never learns sensitive information (cf. [2–4]). In so doing, one precludes sensitive information being disclosed to any adversary. But not disclosing secrets by simply never knowing any, limits the applicability of the This work was supported by NSF awards IIS-1453652, IIS-1527436, and IIS-1526862. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 106–124, 2020. https://doi.org/10.1007/978-3-030-44051-0_7
Finding Plans Subject to Stipulations on What Information They Divulge
107
approach severely. This paper proposes a more general, wider-reaching model for privacy, beyond mere ing´enue robots. This article posits a potentially adversarial observer and then stipulates properties of what shall be divulged. The stipulation describes information that must be communicated (being required to perform the task) as well as information (confidential information potentially violating the user’s privacy) that shouldn’t be. Practical scenarios where this model applies include: (i) privacy-aware care robots that assist the housebound, providing nursing care; (ii) inspection of sensitive facilities by robots to certify compliance with regulatory agreements, whilst protecting other proprietary or secret details; (iii) sending data remotely to computing services on untrusted cloud infrastructure.
Pebble bed facility
Breeder reactor
Fig. 1. Nuclear Site Inspection A robot inspects a nuclear facility by taking a measurement at the ‘?’ location, which depends on the facility type. But the type of the facility is sensitive information that it must not be divulged to any external observers.
Figure 1 illustrates a scenario which, though simplistic, is rich enough to depict several aspects of the problem. The task requires that a robot determine whether some facility’s processing of raw radioactive material meets international treaty requirements or not. The measurement procedure itself depends on the type of facility as the differing physical arrangements of ‘pebble bed’ and ‘breeder’ reactors necessitate different actions. First, the robot must actively determine the facility type (checking for the presence of the telltale blue light in the correct spot). Then it can go to a location to make the measurement, the measurement location corresponding with the facility type. But the facility type is sensitive information and the robot must ascertain the radioactivity state while ensuring that the facility type is not disclosed. What makes this scenario interesting is that the task is rendered infeasible immediately if one prescribes a policy to ensure that the robot never gains sensitive information. Over and above the (classical) question of how to balance information-gathering and progress-making actions, the robot must control what it divulges, strategically increasing uncertainty as needed, precisely limiting and reasoning about the ‘knowledge gap’ between the external observer and itself. To solve such problems, the robot needs a carefully constructed plan and must establish a policy characterizing what information it divulges, the
108
Y. Zhang et al.
former achieving the goals set for the robot, the latter respecting all stipulated constraints—and, of course, each depending on the other. 1.1
Contributions and Itinerary
This paper contributes the first formulation, to our knowledge, of planning where solutions can be constrained so as to require that some information be communicated and other information obscured subject to an adversarial model of an observer. Nor do we know of other work where both a plan and some notion of an interface (the disclosure policy, in our terminology) can both be solved for jointly. The paper is organized as follows: after discussion of related work, Sect. 3 develops the preliminaries, notation, and formalism, Sect. 4 addresses an important technical detail regarding an observer’s background knowledge, and Sect. 5 finding plans that satisfy the stipulations. The last section reports experiments conducted with our implementation.
2
Related Work
An important topic in HRI is expressive action (e.g., see [5]). In recent years there has been a great deal of interest in mathematical models that enable generation of communicative plans. Important formulations include those of [6,7], proposing plausible models for human observers (from the perspectives of presumed cost efficiency, surprisal, or generalizations thereof). In this prior work, conveying information becomes part of an optimization objective, whereas we treat it as a constraint instead. Both [6] and [7] are probabilistic in nature, here we consider a worst-case model that is arguably more suitable for privacy considerations: We ask what an observer can plausibly infer via the history of its received observations. In doing so, we are influenced by the philosophy of LaValle [8], following his use of the term information state (I-state) to refer to a representation of information derived from a history of observations. Finally, since parts of our stipulations may require concealing information, we point out there is also recent work in deception (see [9,10]) and also obfuscation [11].
3
The Model: Worlds, Robots and Observers
Figure 2 illustrates the three-way relationships underlying the setting we examine. Most fundamentally, a robot executes a plan to achieve some goal in the world, and the coupling of these two elements generates a stream of observations and actions. Both the plan and the action–observation stream are disclosed, though potentially only partially, to a third party, we term the observer. The observer uses the stream, its knowledge of the plan, and also other known structure to infer properties about the interaction. Additionally, a stipulation is provided specifying particular properties that can be learned by the observer. We formalize these elements in terms of p-graphs and label maps (see [12]).
Finding Plans Subject to Stipulations on What Information They Divulge
109
Fig. 2. An overview of the setting: the robot is modeled abstractly as realizing a plan to achieve some goal in the world and a third party observer, modeled as a filter. All three, the world, plan, and filter have concrete representations as p-graphs.
3.1
P-Graph and Its Interaction Language
We will start with the definition of p-graphs [12] and related properties: Definition 1 (p-graph). A p-graph is an edge-labelled directed bipartite graph with G = (Vy ∪ Vu , Y, U, V0 ), where (1) the finite vertex set V (G) := Vy ∪ Vu , whose elements are also called states, comprises two disjoint subsets: the observation vertices Vy and the action vertices Vu , (2) each edge e originating at an observation vertex bears a set of observations Y (e) ⊆ Y , containing observation labels, and leads to an action vertex, (3) each edge e originating at an action vertex bears a set of actions U (e) ⊆ U , containing action labels, and leads to an observation vertex, and (4) a non-empty set of states V0 are designated as initial states, which may be either exclusively action states (V0 ⊆ Vu ) or exclusively observation states (V0 ⊆ Vy ). An event is an action or an observation. Respectively, they make up the sets U and Y , which are called the p-graph’s action space and observation space. We will also write Y (G) and U (G) for the observation space and action space of G. Though that is a slight abuse of notation, the initial states will be written V0 (G), similarly. Intuitively, a p-graph abstractly represents a (potentially non-deterministic) transition system where transitions are either of type ‘action’ or ‘observation’ and these two alternate. The following definitions make this idea precise. Definition 2 (transitions to). For a given p-graph G and two states v, w ∈ V (G), a sequence of events 1 , . . . , k transitions in G from v to w if there exists a sequence of states v1 , . . . , vk+1 , such that v1 = v, vk+1 = w, and for each Ei vi+1 for which i ∈ Ei , and Ei is a i = 1, . . . , k, there exists an edge vi −→ subset of Y (G) if vi is in Vy , or a subset of U (G) if vi is in Vu . s
Concisely, we let the predicate TransTo(v − → w)G hold if there is some way of tracing s on G from v to w, i.e., it is True iff v transitions to w under execution s. Note, when G has non-deterministic transitions, v may transition to multiple vertices under the same execution. We only require that w be one of them.
110
Y. Zhang et al.
Definition 3 (executions and interaction language). An execution on a p-graph G is a finite sequence of events s, if there exists some v ∈ V0 (G) and s some w ∈ V (G) for which TransTo(v − → w)G . The set of all executions on G is called the interaction language (or, briefly, just language) of G and is written L(G). Given any edge e, if U (e) = Le or Y (e) = Le , we speak of e bearing the set Le . Definition 4 (joint-execution). A joint-execution on two p-graphs G1 and G2 is a sequence of events s that is an execution of both G1 and G2 , written as s ∈ L(G1 ) ∩ L(G2 ). The p-graph producing all the joint-executions of G1 and G2 is their tensor product graph with initial states V0 (G1 ) × V0 (G2 ), which we denote G1 ⊗ G1 . A vertex from G1 ⊗ G2 is as a pair (v1 , v2 ), where v1 ∈ V (G1 ) and v2 ∈ V (G2 ). Next, the relationship between the executions and vertices is established. Definition 5. The set of vertices reached by execution s in G, denoted VsG , are the vertices to which the execution s ∈ L(G) transitions, starting at an initial s state. Symbolically, VsG := {v ∈ V (G) | ∃v0 ∈ V0 (G), TransTo(v0 − → v)G }. Further, the set of executions reaching vertex v in G is written as SvG := {s ∈ L(G) | v ∈ VsG }. The naming here serving to remind that V describes sets of vertices, S describes sets of strings/executions. The collection of sets {SvG0 , SvG1 , . . . , SvGi . . . } can be used to form an equivalence relation ∼ over executions, under G
which s1 ∼ s2 if and only if VsG1 = VsG2 . This equivalence relation partiG
tions the executions in L(G) into a set of non-empty equivalence classes: L(G)/∼ = {[r0 ]G , [r1 ]G , [r2 ]G , . . . }, where each equivalence class is [ri ]G = {s ∈ G
L(G) | ri ∼ s} and ri is a representative execution in [ri ]G . The intuition is that G
any two executions that transition to identical sets of vertices are, in an important sense, indistinguishable. We shall consider systems where the vertices of a p-graph constitute the state that is stored, acted upon, and/or represented—they are, thus, akin to a ‘sufficient statistic’. Definition 6 (state-determined). A p-graph G is in a state-determined presentation, or is in state-determined form, if ∀s ∈ L(G), |VsG | = 1. The procedure to expand any p-graph G into a state-determined presentation Sde(G) can be found in Algorithm 2 of [12]. The language of p-graphs is not affected by state-determined expansion, i.e., L(G) = L(Sde(G)). Next, one may start with vertices and ask about the executions reaching those vertices. (Later, this will be part of how an observer makes inferences about the world.)
Finding Plans Subject to Stipulations on What Information They Divulge
111
Definition 7. Given any set of vertices B ⊆ V (G) in p-graph G, the set of G executions that reach exactly (i.e. reach and reach only) B is SG B := (∩v∈B Sv ) \ G ∪v∈(V (G)\B) Sv . Above, the ∩v∈B SvG represents the set of executions that reach every vertex in B. By subtracting the ones that also reach the vertices outside B, SG B describes the set of executions that reach exactly B. In Fig. 3, the executions reaching G = {a1 o1 , a2 o1 }. But the executions reaching and w3 are represented as Sw 3 G reaching only {w3 } are S{w3 } = {a1 o1 } since a2 o1 also reaches w4 . Specifically, the equivalence class [ri ]G contains the executions that reach exactly VrGi , so we have [ri ]G = SG VG . ri
{a1 , a2 }
w1
{a2 }
w2
w0
{o1 } {o1 , o2 }
w3 w4
Fig. 3. An example showing the difference between ‘reaches’ and ‘reaches exactly’ as G and SG distinguished in notation as Sw {w} .
3.2
Planning Problems and Plans
In the p-graph formalism, planning problems and plans are defined as follows [12]. Definition 8 (planning problems and plans). A planning problem is a pgraph W along with a goal region Vgoal ⊆ V (W ); a plan is a p-graph P equipped with a termination region Vterm ⊆ V (P ). Planning problem (W, Vgoal ) is solved by some plan (P, Vterm ) if the plan always terminates (i.e., reaches Vterm ) and only terminates at a goal. Said with more precision: Definition 9 (solves). A plan (P, Vterm ) solves a planning problem (W, Vgoal ) if there is some integer which bounds length of all joint-executions, and for each joint-execution and any pair of nodes (v ∈ V (P ), w ∈ V (W )) reached by that execution simultaneously, the following conditions hold: (1) if v and w are both action nodes and, for every label borne by each edge originating at v, there exist edges originating at w bearing the same action label; (2) if v and w are both observation nodes and, for every label borne by each edge originating at w, there exist edges originating at v bearing the same observation label; (3) if v ∈ Vterm and then w ∈ Vgoal ;
112
Y. Zhang et al.
(4) if v ∈ / Vterm then some extended joint-execution exists, continuing from v and w, that does reach the termination region. In the above, properties (1) and (2) describe a notion of safety; property (3) of correctness; and (4) of liveness. In the previous definition, there is an upper bound on joint-execution length. We say that plan (P, Vterm ) is c-bounded if, ∀s ∈ L(P ), |s| ≤ c. 3.3
Information Disclosure Policy, Divulged Plan, and Observer
The agent who is the observer sees a stream of the robot’s actions and observations, and uses them to build estimates (or to compute general properties) of the robot’s interaction with the world. But the observer’s access to this information will usually be imperfect—either by design, as a consequence of real-world imperfections, or some combination of both. Conceptually, this is a form of partial observability in which the stream of symbols emitted as part of the robot’s execution is distorted into to the symbols seen by the observer (see Fig. 4). For example, if some pairs of actions are indistinguishable from the perspective of the observer, this may be expressed with a function that maps those pairs of actions to the same value. In this paper, this barrier is what we have been referring to (informally, thus far) with the phrase information disclosure policy. It is formalized as a mapping from the events in the robot’s true execution in the world p-graph to the events received by the observer.
Fig. 4. The information disclosure policy, divulged plan and information stipulation. Even when the observer is a strong adversary, the disclosure policy and divulged plan can limit the observer’s capabilities effectively.
Definition 10 (Information disclosure policy). An information disclosure policy is a label map h on p-graph G, mapping from elements in the combined observation and action space Y (G) ∪ U (G) to some set of events X. The word ‘policy’ hints at two interpretations: first, as something given as a predetermined arrangement (that is, as a rule); secondly, as something to be sought (together with a plan). Both senses apply in the present work; the exact transformation describing the disclosure of information will be used first (in Sect. 5.1) as a specification and then, later (in Sect. 5.2) as something which planning algorithms can produce. How the information disclosure policy is
Finding Plans Subject to Stipulations on What Information They Divulge
113
realized in some setting depends on which sense is apt: it can be interpreted as describing observers (showing that for those observers unable to tell yi from yj , the stipulations can be met), or it can inform robot operation (the stipulations require that the robot obfuscate u and um via means such as explicit concealment, sleight-of-hand, misdirection, etc.) The observer, in addition, may also have imperfect knowledge of robot’s plan, which is leaked or communicated from the side-channel. The disclosed plan is also modeled as a p-graph, which may be weaker than knowing the actual plan. A variety of different types of divulged plan are introduced later (in Sect. 4) to model different prior knowledge available to an observer; as we will show, despite their differences, they can be treated in a single unified way. The next step is to provide formal definitions for the ideas just described. In the following, we refer to h as the map from the set Y ∪ U to some set X, and refer to its preimage h−1 as the map from X to subsets of Y ∪ U . The notation for a label map h and its preimage h−1 is extended in the usual way to sequences and sets: we consider sets of events, executions (being sequences), and sets of executions. They are also extended to p-graphs in the obvious way, by applying the function to all edges. For brevity’s sake, the outputs of h will be referred to simply as ‘the image space.’ The function h may either preserve information (when a bijection) or lose information (with multiple inputs mapped to one output). The loss of information is felt in Y ∪ U by the extent to which some element of Y ∪ U grows under h−1 ◦ h, and for all ∈ Y ∪ U , h−1 ◦ h() ⊇ {}. In contrast, starting from x ∈ X, the uncertainty, apparent via set cardinality under h−1 , is washed out again when pushed forward to the image space X via h ◦ h−1 , i.e., ∀x ∈ X, h ◦ h−1 (x) = {x}. Definition 11 (I-state graph). For planning problem (W, Vgoal ), plan (P, Vterm ) and information disclosure policy h : Y (W ) ∪ U (W ) → X, an observer’s I-state graph I is a p-graph, whose inputs are from the image space of h (i.e., Y (I) ∪ U (I) = X), with L(I) ⊇ h[L(W )]. The action space and observation space of I are also written as Xu = U (I) and Xy = Y (I). Inherited from the property of h ◦ h−1 , for any I-state graph I, we have h−1 I I = h ◦ h−1 I, and ∀B ⊆ V (I), h−1 [SIB ] = SB . The observer’s I-state graph is a p-graph with events in the image space X. By having L(I) ⊇ h[L(W )], we are requiring that strings generated in the world can be safely traced on I. Next, we formalize the crucial connection from the interaction of the robot and world, via the stream of symbols generated, to the state tracked by the observer. Inference proceeds from the observer back to the world, though causality runs the other way (glance again at Fig. 2). We begin, accordingly, with that latter direction. Definition 12 (compatible world states). Given observer I-state graph I, robot’s plan (P, Vterm ), world graph (W, Vgoal ), and label map h, the world state
114
Y. Zhang et al.
w is compatible with the set of I-states B ⊆ V (I) if ∃s ∈ L(W ) such that W . s ∈ h−1 [SIB ] ∩ L(P ) ∩ Sw (1)
(2)
(3)
Informally, each of the three terms can be interpreted as: (1) An observer with I-state graph I may ask which sequences are responsible for having arrived at states B. The answer is the set SIB , being the executions contained in equivalence classes that are indistinguishable up to states in I. Those strings are in the image space X, so, to obtain an answer in Y ∪ U , we take their preimages. Every execution in h−1 [SIB ] leads the observer to B. Note that information may be degraded by either h, I, or both. (2) The set of executions that may be executed by the robot is represented by L(P ). If the observer knows that the robot’s plan takes, say, the high road, this information allows the observer to remove executions involving the robot along the low road. W . Two (3) The set of executions reaching world state w is represented by Sw world states w, w ∈ V (W ) are essentially indiscernible or indistinguishable W W = Sw if Sw , as the sets capture the intrinsic uncertainty of world W . When an observer is in B, and w is compatible with B, there exists some execution, a certificate, that the world could plausibly be in w subject to (1) the current information summarized in I; (2) the robot’s plan; (3) the structure of the I,P , world. The set of all world states that are compatible with B is denoted WB which is the observer’s estimate of the world states when known information about W , P and I have all been incorporated. A typical observer may know less about the robot’s future behavior than the robot’s full plan. Weaker knowledge of how the robot will behave can be expressed in terms of some p-graph D, such that L(D) ⊇ L(P ). (Here the mnemonic is that it is the divulged information about the robot’s plan, which one might imagine as leaked or communicated via a side-channel.) Notice that the information divulged to the observer about the robot’s execution is in the preimage space. The key reason for this modeling decision is that information may be lost under label map h; an observer gains the greatest information when the plan is disclosed in the preimage space and, as we consider worst-case conditions, we are interested in what the strongest (even adversarial) observers might infer. Thus, we study divulgence where the observer obtains as much as possible. Definition 12 requires the substitution of the second term in the intersection I,P with with L(D). When only D is given, the most precise inference replaces WB I,D : WB Definition 13 (estimated world states). Given an I-state graph I, divulged plan p-graph D, world p-graph W , and labelmap h, the set of estimated h−1 I I,D := w ∈ V (W ) (SB world states for I-states B ⊆ V (I) is WB ∩ L(D)∩ W Sw ) = ∅ .
Finding Plans Subject to Stipulations on What Information They Divulge h−1 I
Observe that h−1 [SIB ] has been replaced with SB
−1
h
SB
I
115
, since h−1 [SIB ] =
. The last remaining element in Fig. 4 that needs to be addressed is the stipulation of information. We do that next. 3.4
Information Stipulations
We prescribe properties of the information that an observer may extract from its input by imposing constraints on the sets of estimated world states. The observer, filtering a stream of inputs sequentially, forms a correspondence between its I-states and world states. We write propositional formulas with semantics defined in terms of this correspondence—in this model the stipulations are written to hold over every reachable set of associated states.1
Fig. 5. The syntax and natural semantics of the information stipulations, where ci , i , vi , represent a clause, literal, and symbol, respectively, and wi is the result of the evaluation. The transition e ⇓ w denotes a transition, where e is any expression defined by the grammar and w is the value yielded by the expression.
First, however, we must delineate the scope of the estimated world states to be constrained. Some states, in inherently non-deterministic worlds, may be inseparable because they are reached by the same execution. In Fig. 3, both w3 and w4 will be reached (non-deterministically) by execution a2 o1 . Since this is intrinsic to the world, even when the observer has perfect observations, they remain indistinguishable. In the remainder of this paper, we will assume that the world graph W is in state-determined form, and we may affix stipulations to the world states knowing that no two vertices will be non-deterministically reached by the same execution. Second, we write propositional formulae to constrain the observer’s estimate. Formula Φ is written in conjunctive normal form, consisting of symbols, literals and clauses as shown in Fig. 5. Firstly, an atomic symbol vi is associated with I,D each world state vi ∈ V (W ). If vi is contained in the observer’s estimates WB , 1
We foresee other variants which are straightforward to modifications to consider; but we report only on our current implementation.
116
Y. Zhang et al.
we will evaluate the corresponding symbol vi as T rue. It evaluates as False otherwise. With each symbol grounded in this way, we evaluate literals and clauses compositionally, using logic operators not, and, or. These are defined in the standard way, eventually enabling evaluation of Φ on the observer’s estimate I,D . WB Let the predicate satfd(B, Φ) denote whether the stipulation Φ holds for I-states B. Then a plan P satisfies the stipulations, if and only if I satfd(B, Φ). ∀s ∈ L(P ) ∩ L(W ) B = Vh(s)
4
The Observer’s Knowledge of the Robot’s Plan
Above, we hinted that observers may differ depending on the prior knowledge that has been revealed to them; next we bring this idea into sharper focus. The information associated with an observer is contained in a pair (I, D): the I-state graph I that acts as a filter, succinctly tracking state from a stream of inputs, and knowledge of robot’s plan in the form of a p-graph D. These two elements, through Definition 13, allow the observer to form a correspondence with the external world W . The I-state graph I induces ∼ over its set of executions and I
hence over the joint-executions with the world, or, more precisely, the image of those through h. By comparing the fineness of the relations induced by two I-state graphs, one obtains a sense of the relative coarseness of the two I-state graphs. As the present paper describes methods motivated by applications to robotic privacy, we model the most capable adversary, taking the finest observer, that is, one whose equivalence classes are as small as possible. Definition 14 (finest observer). Given world graph W and the divulged plan D, an I-state graph I is a finest observer if for any I-state graph I, we have I,D I,D ⊆ Wh(s) . ∀s ∈ L(W ), Wh(s) Lemma 1. hW is a finest observer. By way of a proof sketch, note that the observer only ever sees the image of the world under the label map h, i.e. hW . The p-graph hW serves as a natural I-state graph for a finest observer as it allows the observer to have sufficient internal structure to keep track of every world state. The second element in the observer pair is D, information disclosed about the plan, and presumed to be known a priori, to the observer. Depending on how much the observer knows, there are multiple possibilities here, from mostto least-informed: I. The observer knows the exact plan P to be executed. II. The plan to be executed is among a finite collection of plans {P1 , P2 , . . . , Pn }.
Finding Plans Subject to Stipulations on What Information They Divulge
117
III. The observer may only know that the robot is executing some plan, that is, the robot is goal directed and aims to achieve some state in Vgoal . IV. The observer knows nothing about the robot’s execution other than that it is on W . It turns out that a p-graph exists whose language expresses knowledge for each of those cases (we omit the details here). Furthermore, Sect. 3.3 details how I,D ) from I-states B depends on the observer’s knowledge of the world state (WB h−1 I
SB
∩ L(D) ∩ L(W ), a set of executions that arrive at B in the I-state graph h−1 I
, when L(P ) L(D) the gap I. Because the observer uses D to refine SB between the two sets of executions represents a form of uncertainty. The ordering of the four cases, thus, can be stated precisely in terms of language inclusion. Now using the D as appropriate for each case, one may examine whether a given plan and disclosure policy solves the planning problem (i.e., achieves desired goals in the world) while meeting the stipulations on information communicated. Hence, we see that describing disclosed information via a p-graph D is in fact rather expressive. This section has also illustrated the benefits of being able to use both interaction language and graph presentation views of the same structure.
5
Searching for Plans and Disclosure Policy: The Seek problems
In this section, we will show how to search for a plan (together with the label map).
D), h, Φ Problem: Seekx (W, Vgoal ), x, (I, Vars. to solve for:
x is a plan Seekx,λ (W, Vgoal ), x, (I, x), λ, Φ λ is a label map
a divulged plan Input: A planning problem (W, Vgoal ), a finest observer I, p-graph D, information disclosure policy h and information stipulation Φ. Output: A plan x = (P, Vterm ) and/or label map λ = h such that plan (P, Vterm ) solves the problem (W, Vgoal ), and ∀s ∈ L(W † ) ∩ I , the information stipulation Φ is always evaluated L(P ), B = Vh(s) I,D as True on WB (i.e. satfd(B, Φ) = T rue), else False.
Of the two versions of Seek, the first searches for a plan, the second for a plan and a label map, jointly. We consider each in turn. 5.1
Finding a Plan Given Some Predetermined D
For Seekx , first we must consider the search space of plans. Prior work [12] showed that, although planning problems can have stranger solutions than people usually contemplate, there is a core of well-structured plans (called homomorphic solutions) that suffice to determine solvability. As an example, there may exist
118
Y. Zhang et al.
plans which loop around the environment before achieving the goal, but, they showed that in seeking plans, one need only consider plans that short-circuit the loops. The situation is rather different when a plan must satisfy more than mere goal achievement: information stipulations may actually require a plan to loop in order to ensure that the disclosed stream of events is appropriate for the observer’s eyes. (A concrete example appears in Fig. 7(c).) The argument in [12] needs modification for our problem—a different construction can save the result even under disclosure constraints. This fact is key to be able to implement a solution. In this paper, without loss of generality, we focus on finding plans in statedetermined form. Next, we will examine the solution space closely. Definition 15. A plan P is congruent on the world graph W , if and only if for every pair of executions s1 , s2 ∈ L(P ) we have s1 ∼ s2 =⇒ s1 ∼ s2 . P
W
In other words, a plan that respects the equivalence classes of the world graph is defined as a congruent plan. Next, our search space is narrowed further still. Lemma 2. Given any plan (P, Vterm ), there exists a plan (P , Vterm ) that is congruent on the world graph W and L(P ) = L(P ).
Proof. We give a construction from P of P as a tree, and show that it meets the conditions. To construct P , perform a BFS on P . Starting from V0 (P ), build a starting vertex v0 in P , keep a correspondence between it and V0 (P ). Mark v0 as unexpanded. Now, for every unexpanded vertex v in P , mark the set of all outgoing labels for its corresponding vertices in P as Lv , create a new vertex v in P for each label l ∈ Lv , build an edge from v to v with label l in P , and mark it as expanded. Repeat this process until all vertices in P have . been expanded. Mark the vertices corresponding to vertices in Vterm as Vterm In the new plan (P , Vterm ), no two executions reach the same vertex. That is, ∀s1 , s2 ∈ L(P ), s1 ∼ s2 . Hence, P is congruent on W . In addition, since no P
new executions are introduced and no executions in P are eliminated during the construction of P , we have L(P ) = L(P ).
Theorem 1. For problem Seekx (W, Vgoal ), x, (I, D), h, Φ , if there exists a solution (P, Vterm ), then there exists a solution (P , Vterm ) that is both c-bounded and congruent on W , where c = |V (W )| · |V (D)| · |V (I)|. Proof. Suppose Seekx has a solution (P, Vterm ). Then the existence of a solution 2. Moreover, we have (P , Vterm
) which is congruent on W is implied by Lemma
Check (W, Vgoal ), (P, Vterm ), D, I, h, Φ =⇒ Check (W, Vgoal ), (P , Vterm ), D, I, h, Φ , following from two observations: (i ) if (P, Vterm ) solves (W, Vgoal ) then the means of construction ensures ) does as well, and (P , Vterm I,D (ii ) in checking Φ, the set of estimated world states W{v} does not change for each vertex v ∈ V (Sde(I)), since the triple graph is independent of the
Finding Plans Subject to Stipulations on What Information They Divulge
119
plan to be searched. The set of I-states to be evaluated by Φ in Sde(I) Sde(I) is ∪s ∈h[L(P )∩L(W )] Vs . Since L(P ) = L(P ), the set of I-states to be evaluated is no altered and the truth of Φ along the plan is preserved. The final step is to prove that if there exists a congruent solution (P , Vterm ), then there exits a solution (P , Vterm ) that is c-bounded. First, build a product graph T of W , D, and h−1 SED(I), with vertex set V (W ) × V (D) × V (h−1 Sde(I)). Then trace every execution s in P on T . If s visits the same −1 −1 vertex (v W , v D , v h Sde(I) ) multiple times, then v W , v D , and v h Sde(I) have to be action vertices, for otherwise P can loop forever and is not a solution (since P is finite on W ). Next, record the action taken at the last visit of −1 (v W , v P , v h Sde(I) ) as alast . Finally, build a new plan (P , Vterm ) by bypassing −1 W unnecessary transitions on P as follows. For each vertex (v , v P , v h Sde(I) ) −1 that is visited multiple times, P takes action alast when (v W , v P , v h Sde(I) ) is first visited. P terminates at the goal states without violating any stipulations, since it takes a shortcut in the executions of P but—crucially—without visiting any new observer I-states. In addition, P will visit each vertex in T at most once, and the maximum length of its executions is |V (W )|×|V (D)|×|V (h−1 Sde(I))|. Since P preserves the structure of P during this construction, P is also congruent.
The intuition, and the underlying reason for considering congruent plans, is that modifying the plan will not affect the stipulations if the underlying languages are preserved. The bound on the length then takes this further, modifying the language by truncating long executions in the triple graph, thereby shortcutting visits to I-states that do not affect goal achievement. Accordingly, it suffices to look for congruent plans in the (very specific) form of trees, since any plan has a counterpart that is congruent and in the form of a tree (see Lemma 2 for detail). Theorem 1 states that the depth of the tree is at most c = |V (W )| · |V (D)| · |V (h−1 Sde(I))|. Therefore, we can limit the search space to trees of a specific bounded depth. To search for a c-bounded solution, −1 first we mark the vertex (v W , v D , v h Sde(I) ) as: (i) a goal state if v W is a goal state in the world graph; (ii) as satisfying Φ when all the world states appearing −1 together with v h Sde(I) together satisfy Φ. Then we will conduct an and–or search [13] on the triple graph: • Each action vertex serves as an or node, and an action should be chosen for the action vertex such that it will eventually terminate at the goal states and all the vertices satisfy Φ along the way. • Each observation vertex is treated as an and node, and there exists a plan that satisfies Φ for all its outgoing observation vertices. 5.2
Search for Plan and Label Map for the Finest Observer, Disclosing the Same
It is not merely the joint search that makes this, the second problem more interesting. Whereas the first has a divulged plan D that is a priori fixed, the
120
Y. Zhang et al.
second uses x, the plan that was found, as D. This latter fact makes the third substantially more difficult. At a high level, it is not hard to see why: the definitions in the previous section show that both P and D play a role in determining whether a plan satisfies a stipulation. Where D is known and fixed beforehand (for example, in Case IV, D = W , or Case III, D = P ∗ ), a solution can proceed by building a correspondence in the triple graph W ⊗ D ⊗ h−1 Sde(I) and searching in this graph for a plan. In Seekx,λ , however, one is interested in the case where D = P , where the divulged plan is tight, being the robot’s plan exactly. We cannot search in the same product graph, because we can’t make the correspondence since D has yet to be discovered, being determined only after P has been found. Crucially, the feasibility of P depends on D, that is, on itself! Finding such a solution requires an approach capable of building incremental correspondences from partial plans. A key result of this paper is that Seekx,λ is actually solvable without resorting to mere generate-and-check. Lemma 3. Let W be estimated world states for the finest observer hW , and let w be the world state which is observable to the robot. If there exists a solution for Seekx,λ , then there exists a solution that only visits each pair (w, W ) at most once. Proof. Let (P, Vterm ) and h be a solution for Seekx,λ . Suppose P visited (w, W ) n times. Let the set of actions taken at i-th visit be Ai . Then we can construct a new plan (P , Vterm ) which always takes An at (w, W ). If P does not violate the stipulations, then P will never do since P is a shortcut of P and never visits more I-states than P does. In addition, P will also terminate at the goal region if P does.
Theorem 2. If there exists a solution for Seekx,λ (W, Vgoal ), x, (If , x), λ, Φ . hW ,P ) as its plan state, where w then there exists a plan P that takes (w, WB hW ,P consists of the estimated world states is the world state and the set WB hW ,P hW ,P ) ∈ V (P ), then ∀w ∈ WB , for I-states B. Furthermore, if (w, WB hW ,P ) ∈ V (P ). (w , WB hW ,P
) as the plan state for the Proof. Lemma 3 shows that we can treat (w, WB plan to be searched for. hW ,P W −1 hW , we have ∃s ∈ Sw [SB ]. Since s ∈ L(P ), Since w ∈ WB ∩ L(P ) ∩ h hW ,P ). s reaches w and h(s) reaches B, we have s reaches the tuple (w , WB hW ,P ) ∈ V (P ). Hence, (w , WB hW ,P
In searching for (P, Vterm ), for any action state vp = (w, WB
), we determine:
w ∈ Vgoal : We must decide whether vp ∈ Vterm holds or not; w ∈ Vgoal : We must choose the set of nonempty actions to be taken at vp . It has to be a set of actions, since these chosen actions are not only aiming for the goal but also obfuscating each other under the label map.
Finding Plans Subject to Stipulations on What Information They Divulge
121
hW ,P
A state vp = (w, WB ) is a terminating state in the plan when hW ,P ⊆ Vgoal . WB hW ,P ) and label map h, we are With action choices for each plan state (w, WB able to maintain transitions of the estimated world states for B after observing hW ,P the image x. Now, if (w, WB ) is an action state, let the set of actions taken at w be Aw . Then the label map h partitions the actions in ∪w∈W hW ,P Aw into B groups, each of which shares the same image. The estimated worlds states for B transition in terms of groups hW ,P
WB
hW ,P hW ,P = w ∈ V (W )(w, WB ) ∈ Vterm , w ∈ WB , a → w )W . ∃a ∈ Aw , h(a) = x, TransTo(w − hW ,P
Conversely, if (w, WB ) is an observation state, let the observations available at w be Ow . Then h also partitions the observations in ∪(w,W hW ,P ) ∈Vterm Ow B and estimated world states for B transition as hW ,P
WB
hW ,P hW ,P = w ∈ V (W )(w, WB ) ∈ Vterm , w ∈ WB , o → w )W . ∃o ∈ Ow , h(o ) = x, TransTo(w −
Instead of searching for the label map over the set of all actions and observations in W , we will first seek a partial label map for all observations or chosen hW ,P actions for world states in WB , and then incrementally consolidate them. Each partial label map is a partition of the events, making it easy to check whether two partial maps conflict when they are consolidated. If two partial partitions disagree on a value, we backtrack in the search to try another partition label map. Putting it all together as detailed in Fig. 6, we can build a type of and–or search tree to incorporate these choices. For a set of actions comprising a vertex W 0 two tiers of or nodes are generated. The first is over subsets (A01 , A02 , . . . , A0m ), being possible actions to the take; the second chooses specific partitions of values Pi = {X1 , X2 , . . . }, (i.e., partial label maps). A given partition is expanded as an and node with each outgoing edge bearing a group of events sharing the same image under the partial label map. Observation vertices W 1 are expanded in a similar way, but are simpler since we forgo the step involving choosing actions. If there exists a plan and label map then for each W in the tree, there exists an action choice under which there exists a safe partition, such that there exists a plan for all of its children. Let the number of actions and observations in W be |Y | and |U |, and the number of vertices be |V |. There are 2|U ||V | action choices to consider, in the worst case, for all the world states in W . The total number of partitions is a
122
Y. Zhang et al.
Fig. 6. Solving the Seekx ,λ problem via generalized and–or search.
n Bell number B|U | , where Bn+1 = k=0 Cnk Bk and B0 = 1. For each partition, the number of groups we must consider is |U |. To expand an action vertex in the search tree, the computation complexity is 2|U ||V | |U |B|U | . Similarly, the complexity to expand an observation vertex is |Y |B|Y | . If the depth of the tree d is d, then the computational complexity is O(2|U ||V | ).
6
Experimental Results
We implemented all the algorithms in this paper, the mainly using Python. The problem Seekx was implemented with both the algorithm we propose and via specification in computation tree logic (CTL) (and then utilizing the nuXmv model-checker). All executions in this section used a OSX laptop with a 2.4 GHz Intel Core i5 processor. To experiment we constructed a 3 × 4 grid for the nuclear inspection scenario of Fig. 1. Including the differing facility types and radioactivity status, the world graph is a p-graph with 96 vertices before state-determined expansion (154 vertices for the state-determined form). The robot can move left, right, up, down one block at a time. After the robot’s movement, it receives 5 possible observations: pebble bed facility or not (only when located at the blue star), radioactivity high or low when located at one of the ‘?’ cells, and cell is an exit. But the observer only knows the image of the actions and observations under a label map. The stipulation requires that the observer should learn the radioactivity strength, but should never know the facility type. Firstly, we seek the plan in the nuclear inspection scenario with a label map shown in Fig. 7a. A plan can be found (with the world graph disclosed, D = W ).
Finding Plans Subject to Stipulations on What Information They Divulge
(a)
123
(b)
Fig. 7. The scenario and results for Seekx and Seekx ,λ problem: (a) shows the plan found in the nuclear inspection scenario, when the observer knows nothing about robot’s plan (The robot traces the gray arrow, then the blue one if blue light is seen, the red one otherwise.) (b) shows the pentagonal world in Seekx ,λ , where the robot moves along the gray lines.
It takes 11 s for the and–or search and 24 s for the CTL-based implementation to find their solutions. The CTL solver takes longer, but it prioritizes finding the plan of shortest length first. The plan found by CTL is shown in Fig. 7a. As the plan found by and–or search is lengthy, we omit it. Since, for the nuclear inspection scenario, Seekx,λ doesn’t return any result within reasonable time we opted to examine a smaller problem. Here a robot moves in the pentagonal world shown in Fig. 7b. The robot can either decide to loop in the world (a1 ) or exit the loop at some point (a2 or a3 ). We wish to find a plan and label map pair so that the robot can reach some charging station. The observer should not be able to distinguish the robot’s position when at either of the top two charging locations. Seekx,λ gives a plan which moves forward 6 times and then exits at the next time step. Additionally, to disguise the actions and observations after the exit, it maps h(a2 ) = h(a3 ) and h(o1 ) = h(o3 ). Note that in this problem, the robot reaches a goal, without considering the stipulations, by taking the exit at the next time step. The stipulations force the robot to navigate at least one loop in the world to conflate state for the sake of the observer.
7
Conclusion
This paper continues a line of work on planning with constraints imposed on knowledge- or belief-states. Our contribution is a substantial generalization of prior models, though, as we see in the section reporting experiments, with grim implications for computational requirements. Future work might consider techniques that incorporate costs, informed methods (with appropriate heuristics), and other ways to solve certain instances quickly.
References 1. Vaas, L.: Privacy dust-up as Roomba maker mulls selling maps of users’ homes (2017). https://nakedsecurity.sophos.com/2017/07/26/
124
Y. Zhang et al.
2. O’Kane, J.M.: On the value of ignorance: Balancing tracking and privacy using a two-bit sensor. In: WAFR 2008, pp. 235–249 (2008) 3. O’Kane, J.M., Shell, D.A.: Automatic design of discreet discrete filters. In: Proceedings of IEEE International Conference on Robotics and Automation, pp. 353–360 (2015) 4. Zhang, Y., Shell, D.A.: Complete characterization of a class of privacy-preserving tracking problems. Int. J. Robot. Res. (2018). In WAFR 2016 special issue 5. Takayama, L., Dooley, D., Ju, W.: Expressing thought: improving robot readability with animation principles. In: Proceedings of the International Conference on Human-Robot Interaction, HRI 2011, Lausanne, Switzerland, March 2011, pp. 69– 76 (2011) 6. Dragan, A.D.: Robot Planning with Mathematical Models of Human State and Action (2017). arXiv preprint arXiv:1705.04226 7. Knepper, R.A., Mavrogiannis, C.I., Proft, J., Liang, C.: Implicit communication in a joint action. In: Proceedings of the International Conference on Human-Robot Interaction, HRI 2017, Vienna, Austria, March 2017, pp. 283–292 (2017) 8. LaValle, S.M.: Planning Algorithms. Cambridge University Press, Cambridge (2006) 9. Masters, P., Sardina, S.: Deceptive path-planning. In: Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, August 2017, pp. 4368–4375 (2017) 10. Dragan, A.D., Holladay, R., Srinivasa, S.S.: Deceptive robot motion: synthesis, analysis and experiments. Auton. Robots 39(3), 331–345 (2015) 11. Wu, Y.-C., Raman, V., Lafortune, S., Seshia, S.A.: Obfuscator synthesis for privacy and utility. In: NASA Formal Methods Symposium, pp. 133–149. Springer (2016) 12. Saberifar, F.Z., Ghasemlou, S., Shell, D.A., O’Kane, J.M.: Toward a languagetheoretic foundation for planning and filtering. Int. J. Robot. Res. (2018). In WAFR 2016 special issue 13. Pearl, J.: Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, Boston (1984)
Planning Under Uncertainty
Online Partial Conditional Plan Synthesis for POMDPs with Safe-Reachability Objectives Yue Wang, Swarat Chaudhuri, and Lydia E. Kavraki(B) Department of Computer Science, Rice University, Houston, USA {yw27,swarat,kavraki}@rice.edu
Abstract. The framework of Partially Observable Markov Decision Processes (POMDPs) offers a standard approach to model uncertainty in many robot tasks. Traditionally, POMDPs are formulated with optimality objectives. However, for robotic domains that require a correctness guarantee of accomplishing tasks, boolean objectives are natural formulations. We study POMDPs with a common boolean objective: safereachability, which requires that, with a probability above a threshold, the robot eventually reaches a goal state while keeping the probability of visiting unsafe states below a different threshold. The solutions to POMDPs are policies or conditional plans that specify the action to take contingent on every possible event. A full policy or conditional plan that covers all possible events is generally expensive to compute. To improve efficiency, we introduce the notion of partial conditional plans that only cover a sampled subset of all possible events. Our approach constructs a partial conditional plan parameterized by a replanning probability. We prove that the probability of the constructed partial conditional plan failing is bounded by the replanning probability. Our approach allows users to specify an appropriate bound on the replanning probability to balance efficiency and correctness. We validate our approach in several robotic domains. The results show that our approach outperforms a previous approach for POMDPs with safe-reachability objectives in these domains.
1
Introduction
Planning robust executions under uncertainty is a fundamental concern in robotics. POMDPs [29] provide a standard framework for modeling many robot tasks under uncertainty. The solutions to POMDPs are policies [29] or conditional plans [9] that specify the actions to take under all possible events during execution. Traditionally, the goal of solving POMDPs is to find optimal solutions that maximize (discounted) rewards [1,3,9,12,16,17,23,24,30]. While this purely quantitative formulation is suitable for many applications, some robotic settings demand synthesis concerning boolean requirements. For example, consider c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 127–143, 2020. https://doi.org/10.1007/978-3-030-44051-0_8
Y. Wang et al. b0 a1 o 2
o1 a02 o2
o1
···
o1
···
···
···
···
··· ···
a33
o2
a23
o2
a13
o2
Fig. 1. A robot with imperfect actuation and perception needs to pick up the blue can on the table, while avoiding collisions with uncertain obstacles such as floor signs and file cabinets.
a12 o2
o2
a03
o1
o1
o1
o1
128
···
Fig. 2. A full conditional plan γk contains both solid and dotted branches. a1 , a02 , . . . are actions. o1 and o2 are observations. A partial conditional plan γkp contains only solid branches.
a robot with imperfect actuation and perception working in an office environment with uncertain obstacles such as floor signs and furniture (Fig. 1). Due to uncertainty, the locations of the robot and the obstacles are partially observable, and the robot’s action effects and observations are both probabilistic. In this probabilistic setting, a reasonable task requirement for the robot is to eventually pick up the target object with a probability above a threshold while keeping the probability of collision below a different threshold. This task requirement is naturally formulated as a boolean objective written in a temporal logic. Moreover, formulating boolean requirements implicitly as quantitative objectives does not always yield good solutions for certain domains [32]. Therefore, POMDPs with explicit boolean objectives are better formulations than quantitative POMDPs in these domains. Policy synthesis for POMDPs with boolean objectives has been studied in previous works [4,5,31], where the goal is to satisfy a temporal property with probability 1 (almost-sure satisfaction). A more general policy synthesis for POMDPs with boolean objectives is to synthesize policies that satisfy a temporal property with a probability above a threshold. In this work, we study this problem for the special case of safe-reachability objectives, which require that with a probability above a threshold, a goal state is eventually reached while keeping the probability of visiting unsafe states below a different threshold. Many robot tasks such as the one in Fig. 1 can be formulated as a safe-reachability objective. Our previous work [33] has presented a method called Bounded Policy Synthesis (BPS) for POMDPs with safe-reachability objectives. BPS computes a valid policy over the goal-constrained belief space rather than the entire belief space to improve efficiency. The goal-constrained belief space only contains beliefs visited by desired executions achieving the safe-reachability objective and is generally much smaller than the original belief space. BPS is an offline synthesis method that computes a full policy before execution. Another category of approaches to planning under uncertainty is online planning that interleaves planning and execution [3,8,13,14,17,28,30]. Offline synthesis offers a strong
Online Partial Conditional Plan Synthesis for POMDPs
129
correctness guarantee, but it is difficult to scale. Online planning is much more scalable and works well when replanning is likely to succeed, but it often fails when replanning is difficult or infeasible in some states, making it hard to ensure correctness. In this work, our goal is to scale up our previous BPS method further through online planning and achieve a good balance between efficiency and correctness. Specifically, we present a method called Online Partial Conditional Plan Synthesis (OPCPS) for POMDPs with safe-reachability objectives, based on the new notion of partial conditional plans. A partial conditional plan only contains a sampled subset of all possible events and approximates a full policy. OPCPS computes a partial conditional plan parameterized by a replanning probability, i.e., the probability of the robot encountering an event not covered by the partial conditional plan, thus requiring replanning. We offer a theoretical analysis of this replanning probability framework, showing that the probability of the constructed partial conditional plan failing is bounded by the replanning probability. OPCPS allows users to specify an appropriate bound on the replanning probability to balance efficiency and correctness: for domains where replanning is likely to succeed, increasing the bound usually leads to better scalability, and for domains where replanning is difficult or impossible in some states, users can decrease the bound and allocate more computation time to achieve a higher success rate. To further improve performance, OPCPS updates the replanning probability bound instead of using the same bound during computation. This bound update enables quicker detection of the current partial conditional plan meeting the bound and avoids unnecessary computation. For a better safety guarantee, OPCPS checks whether the successor belief of every uncovered observation of the constructed partial conditional plan satisfies the safety requirement. Thus OPCPS guarantees that the robot still satisfies the safety requirement when replanning fails. Section 3.1 has more details on the bound update and the safety guarantee of OPCPS. We evaluate OPCPS in the kitchen domain presented in [33] and the Tag domain [23]. We also validate OPCPS on a Fetch robot for the domain shown in Fig. 1. The results demonstrate that OPCPS scales better than BPS and can solve problems that are beyond the capabilities of BPS within the time limit. Related Work. The analysis of POMDPs can be divided into three categories. In the first category, the objective is to find optimal solutions concerning quantitative rewards. Many previous POMDP algorithms [1,3,9,16,17,23,30] focus on maximizing (discounted) rewards. In the second category, the objective combines the quantitative rewards of the traditional POMDPs with notions of risk and cost. Recently, there has been a growing interest in constrained POMDPs [11,15,25,32], chance-constrained POMDP [27], and risk-sensitive POMDPs [10,19] that handle cost/risk constraints explicitly. The third category consists of POMDPs with high-level boolean requirements written in a temporal logic. Recent work [4,5,31] has investigated the almost-sure satisfaction of POMDPs
130
Y. Wang et al.
with temporal properties, where the goal is to check whether a given temporal property can be satisfied with probability 1. A more general policy synthesis problem of POMDPs with safe-reachability objectives has been introduced in our previous work [33]. It has been shown that for robotic domains that require a correctness guarantee of accomplishing tasks, POMDPs with safe-reachability provide a better guarantee of safety and reachability than the quantitative POMDP formulations [33]. In this work, we focus on POMDPs with safe-reachability objectives and evaluate our previous BPS approach [33]. While BPS synthesizes a full policy (conditional plan) offline that covers all possible events, our approach is an online method that interleaves the computation of a partial conditional plan and execution. Since a partial conditional plan only contains a sampled subset of all possible events, our method achieves better scalability than BPS and can solve problems that are beyond the capabilities of BPS within the time limit. This idea of partial conditional plans resembles the state-of-the-art online POMDP algorithm based on Determinized Sparse Partially Observable Tree (DESPOT) [3,30]. Both DESPOT and our partial conditional plans contain a subset of all possible observations to improve efficiency. There are two major differences between our method and DESPOT: first, DESPOT handles POMDPs with (discounted) rewards while our approach solves POMDPs with safe-reachability objectives. Second, DESPOT contains all action branches while our approach constructs partial conditional plans (Fig. 2) that only contains one action per step, which is part of the desired execution satisfying the safereachability objective.
2
Problem Formulation
We follow the notation in [33] for POMDPs with safe-reachability objectives. Definition 1 (POMDP). A POMDP is a tuple P = (S, A, T , O, Z), where S is a finite set of states, A is a finite set of actions, T is a probabilistic transition function, O is a finite set of observations, and Z is a probabilistic observation function. T (s, a, s ) = Pr(s |s, a) specifies the probability of moving to state s ∈ S after taking action a ∈ A in state s ∈ S. Z(s , a, o) = Pr(o|s , a) specifies the probability of observing o ∈ O after taking action a ∈ A and reaching s ∈ S. Due to uncertainty, states are partially observable and typically we maintain b(s) = 1. a probability distribution (belief ) over all states b : S → [0, 1] with s∈S b(s) = 1} is the belief space. The set of all beliefs B = {b : S → [0, 1] | s∈S
The belief space transition function TB : B × A × O → B is deterministic. boa = TB (b, a, o) is the successor belief for a belief b ∈ B after taking an action a ∈ A and receiving an observation o ∈ O, defined according
boa (s )
Z(s ,a,o)
T (s,a,s )b(s)
to Bayes rule: ∀ s ∈ S, = , where Pr(o|b, a) = Pr(o|b,a) Z(s , a, o) T (s, a, s )b(s) is the probability of receiving the observation o
s ∈S
s∈S
after taking the action a in the belief b.
s∈S
Online Partial Conditional Plan Synthesis for POMDPs
131
Definition 2 (Plan). A k-step plan is a sequence σ = (b0 , a1 , o1 , . . . , ak , ok , bk ) such that for all i ∈ (0, k], the belief updates satisfy the transition function TB , i.e., bi = TB (bi−1 , ai , oi ), where ai ∈ A is an action and oi ∈ O is an observation. Definition 3 (Safe-Reachability Objective). A safe-reachability objective is a tuple G = (Dest, Safe), where Safe = {b ∈ B | b(s) < δ2 } is a s violates safety set of safe beliefs and Dest = {b ∈ Safe | b(s) > 1 − δ1 } ⊆ Safe is s is a goal state
a set of goal beliefs. δ1 and δ2 are small values that represent tolerance. G compactly represents the set ΩG of valid plans: Definition 4 (Valid Plan). A k-step plan σ = (b0 , a1 , o1 , . . . , ak , ok , bk ) is valid w.r.t. a safe-reachability objective G = (Dest, Safe) if bk is a goal belief (bk ∈ Dest) and all beliefs visited before step k are safe beliefs (∀ i ∈ [0, k), bi ∈ Safe). Partial Conditional Plan. Computing an exact policy over the entire belief space B is intractable, due to the curse of dimensionality [21]: B is a highdimensional space with an infinite number of beliefs. To make the problem tractable, we can exploit the reachable belief space Bb0 [16,23]. Bb0 only contains beliefs reachable from the initial belief b0 and is generally much smaller than B. Our previous BPS work [33] has shown that the performance of policy synthesis for POMDPs with safe-reachability objectives can be further improved based on the notion of a goal-constrained belief space BG . BG combines the reachable belief space Bb0 and the set ΩG of valid plans defined by the safe-reachability objective G. BG only contains beliefs reachable from the initial belief b0 under a valid plan σ ∈ ΩG and is generally much smaller than the reachable belief space Bb0 . Previous results [6,18,22] have shown that the problem of policy synthesis for POMDPs is generally undecidable. However, when restricted to a bounded horizon, this problem becomes PSPACE-complete [20,21]. Therefore, BPS computes a bounded policy π over the goal-constrained belief space BG within a bounded horizon h. This bounded policy π is essentially a set of conditional plans [9]: Definition 5 (Conditional Plan). A k-step conditional plan γk ∈ Γk is a tuple γk = (b, a, νk ), where b ∈ B is a belief, a ∈ A is an action and νk : O → Γk−1 is an observation strategy that maps an observation o ∈ O to a (k − 1)step conditional plan γk−1 = (b , a , νk−1 ) ∈ Γk−1 , where b = TB (b, a, o) is the successor belief. Figure 2 shows an example k-step conditional plan γk = (b0 , a1 , νk ). γk defines a set Ωγk of k-step plans σk = (b0 , a1 , o1 , . . . , ak , ok , bk ). For each plan σk ∈ Ωγk , the action a1 at step 1 is chosen by the k-step conditional plan γk , the action a2 at step 2 is chosen by the (k − 1)-step conditional plan γk−1 = νk (o1 ), ... , and the action ak at step k is chosen by the one-step conditional plan γ1 = ν2 (ok−1 ).
132
Y. Wang et al.
Definition 6 (Valid Conditional Plan). A k-step conditional plan γk is valid w.r.t. a safe-reachability objective G if every plan in Ωγk is valid (Ωγk ⊆ ΩG ). It is clear that the number of valid plans in a valid k-step conditional plan γk grows exponentially as the horizon k increases. To address this challenge, our method computes partial conditional plans to approximate full conditional plans: Definition 7 (Partial Conditional Plan). A k-step partial conditional plan is a tuple γkp = (b, a, Okp , νkp ), where b ∈ B is a belief, a ∈ A is an action, Okp ⊆ O p is a partial observation is a subset of the observation set O, and νkp : Okp → Γk−1 p strategy that maps an observation o ∈ Ok to a (k − 1)-step partial conditional p p p = (b , a , Ok−1 , νk−1 ), where b = TB (b, a, o) is the successor belief. plan γk−1 Similarly, γkp defines a set Ωγkp of k-step plans σk in belief space. Definition 8 (Valid Partial Conditional Plan). A k-step partial conditional plan γkp is valid w.r.t. a safe-reachability objective G if every plan in Ωγkp is valid. Replanning Probability. Since a partial conditional plan γkp = (b, a, Okp , νkp ) only contains a subset of all observation branches at each step (see Fig. 2), during online execution, it is possible that an observation branch o ∈ O − Okp that is not part of the partial conditional plan is visited. In this case, we need to recursively compute a new partial conditional plan for this new branch o. However, since γkp does not consider all possible observation branches, it is possible that the action chosen by γkp is invalid for the new observation branch o. As a result, there are no partial conditional plans for the new observation branch o and execution fails. To preserve correctness, we would like to bound the failure probability pfail (γkp ) = Pr(failure|γkp ) measured under a valid partial conditional γkp = (b, a, Okp , νkp ). However, computing pfail is costly because it requires checking whether the action a chosen by γkp is valid for every uncovered observation branch o ∈ O − Okp , which essentially computes a full conditional plan. Alternatively, we can easily compute the replanning probability preplan (γkp ) = Pr(replanning|γkp ) of reaching an uncovered observation branch o ∈ O − Okp and requiring replanning: preplan (γkp ) =
p o∈Ok
Pr(o|b, a)preplan (νkp (o)) +
For the base case k = 1, preplan (γ1p ) =
o∈O−O1p
Pr(o|b, a)
(1)
p o∈O−Ok
Pr(o|b, a).
The following theorem states that for a valid partial conditional plan γkp , the failure probability pfail (γkp ) is bounded by the replanning probability preplan (γkp ): Theorem 1. For any valid partial conditional plan γkp , pfail (γkp ) ≤ preplan (γkp ). Theorem 1 can be proved by induction.
Online Partial Conditional Plan Synthesis for POMDPs
133
Problem Statement. Given a POMDP P , an initial belief b0 , a replanning probability bound δpreplan , a safe-reachability objective G and a horizon bound h, our goal is to synthesize a valid k-step (k ≤ h) partial conditional plan γkp = (b0 , a, Okp , νkp ) with a replanning probability preplan (γkp ) ≤ δpreplan . Since the replanning probability preplan (γkp ) is bounded by δpreplan , by Theorem 1, γkp guarantees achieving the given safe-reachability objective with a probability at least 1 − δpreplan . Note that when preplan (γkp ) = 0, γkp is a full conditional plan.
3
Online Partial Conditional Plan Synthesis
Figure 3 shows the overall structure of OPCPS. OPCPS follows the typical online planning paradigm [26] that interleaves synthesis of valid partial conditional plans and execution. If there are no valid partial conditional plans within the horizon bound, execution fails. Otherwise, OPCPS follows the generated partial conditional plan until a goal belief is reached or a new observation o ∈ O − Okp is received. In the latter case, OPCPS recursively replans for the observation o. In partial conditional plan synthesis (Fig. 4), we replace the policy generation component in BPS [33] with a new partial conditional plan generation (the green dashed component). For completeness, we offer a brief summary of the constraint generation and plan generation components in BPS. See [33] for more details. POMDP, Initial Belief Replanning Probability Bound Safe-Reachability Objective Horizon Bound
Partial Conditional Plan Synthesis
γkp = ∅: execution fails
observation p o ∈ O − Ok observation p o ∈ Ok
γkp
Robot Execution
= ∅
reach goal: execution succeeds
Fig. 3. OPCPS
no valid plan: increase horizon k additional constraint φ
Constraint Generation
reach horizon bound: γkp = ∅ constraint Φk
Plan Generation valid plan σk
Partial Conditional Plan Generation
valid partial conditional plan γkp
Fig. 4. Partial conditional plan synthesis
In constraint generation (Fig. 4), given a POMDP P , an initial belief b0 and a safe-reachability objective G = (Dest, Safe), we first construct a constraint Φk to symbolically encode the goal-constrained belief space over a bounded horizon k based on the encoding from Bounded Model Checking [2]. Φk compactly represents the requirement of reaching a goal belief b ∈ Dest safely in k steps. Then in plan generation (Fig. 4), we compute a valid plan σk by checking the satisfiability of Φk through an SMT solver [7]. If Φk is satisfiable, the SMT solver returns a valid plan σk = (bσ0 k , aσ1 k , oσ1 k , . . . , bσk k ). This valid plan σk only
134
Y. Wang et al.
covers a particular observation oσi k at step i. In partial conditional plan generation (Fig. 4), we generate a valid partial conditional plan γkp with a replanning probability preplan (γkp ) ≤ δpreplan from this valid plan σk by sampling a subset Okp ⊆ O of observations at each step, where δpreplan is the given replanning probability bound. If this partial conditional plan generation fails, we construct an additional constraint φ to block invalid plans and force the SMT solver to generate another better plan. If Φk is unsatisfiable, we increase the horizon and repeat the above steps until a valid partial conditional plan is found or a given horizon bound is reached. Next we describe the new partial conditional plan generation component. 3.1
Partial Conditional Plan Generation
In partial conditional plan generation (Algorithm 1), we construct a valid partial conditional plan γkp that satisfies the given bound δpreplan from a valid plan σk . p For each step i, we first recursively construct a next-step conditional plan γnext p σk for oi (line 5). If the replanning probability preplan (γk ) is greater than the bound δpreplan (line 9), we add more observation branches to γkp by sampling a new observation o according to the probability of occurrence (line 11) and p for o (line recursively constructing a next-step partial conditional plan γnext 13). This is another partial conditional plan synthesis problem with a new initial belief b (line 12), and can be solved recursively using the algorithm shown in Fig. 4. p for o , we add o to γkp (line 8 or 16). If we successfully construct a valid γnext Otherwise, this input plan σk cannot be an element of a valid partial conditional k , aσi k ) of the plan γkp (σk ∈ Ωγkp ). Therefore, the prefix (bσ0 k , aσ1 k , oσ1 k , . . . , bσi−1 input plan σk is invalid for the current horizon k and we construct the following additional constraint φ to block invalid plans: i−1 σk σk σk σk σk (am = am ) ∧ (om = om ) ∧ (bm = bm ) ) (2) ¬((b0 = b0 ) ∧ (ai = ai ) ∧ m=1
φ blocks the invalid plans that have this prefix and avoids unnecessary checks of these plans (checking σk has already shown that these plans are invalid). Updating Replanning Probability Bound. As we add more observation branches to the current partial conditional plan γkp = (b, a, Okp , νkp ), we update the replanning probability bound δp replan (line 10) for the remaining uncovered observation branches O − Okp to avoid unnecessary computation. Initially, Okp is empty and δp replan is the input bound δpreplan (line 4). δp replan bounds the replanning probability preplan (νkp (o)) of the next-step partial conditional plan νkp (o) for every remaining uncovered observation o ∈ O − Okp . δp replan p guarantees that the replanning bound probability preplanp(γk ) satisfies the original p δpreplan , i.e., preplan (γk ) = Pr(o|b, a)preplan (νk (o)) ≤ Pr(o|b, a)δp replan ≤ o∈O
o∈O
δp replan = δpreplan since preplan (νkp (o)) ≤ δp replan based on the definition of δp replan .
Online Partial Conditional Plan Synthesis for POMDPs
135
Algorithm 1: PartialConditionalPlanGeneration Input: POMDP P = (S, A, T , O, Z), Replanning Probability Bound δpreplan , Safe-Reachability Objective G = (Dest, Safe), Valid k-Step Plan σ σ σ σ σk = (b0 k , a1 k , o1 k , . . . , bk k ), Step i, Horizon Bound h Output: Valid partial conditional plan γkp with replanning probability preplan (γkp ) ≤ δpreplan , Constraint φ for blocking invalid plans 1 if i > k then /* Reach end of the plan */ σ 2 γkp ← (bk k , ∅, ∅, ∅) /* Terminal belief: γkp specifies nothing */ 3 return γkp , ∅ 4 5 6 7 8 9 10
σ
σ
σ
k Okp ← ∅, δp replan ← δpreplan , b ← bi−1 , a ← ai k , o ← oi k /* Initialize */ /* Recursively process next step */ p γnext ← PartialConditionalPlanGeneration(P, δp replan , G, σk , i + 1, h) p if γnext = ∅ then /* Construction failed */ Construct φ using Formula 2, return ∅, φ
p Okp ← Okp ∪ {o }, νkp (o ) ← γnext , γkp ← (b, a, Okp , νkp ) p while preplan (γk ) > δpreplan do
δp replan ← δp replan +
/* Add o to γkp */
p
Pr(o |b,a)(δp −preplan (νk (o )) replan Pr(o|b,a)
/* Bound update */
p o∈O−O −{o } k
15
o ← sampled observation in O − Okp based on the probability of occurrence b ← TB (b, a, o ) /* Get new initial belief */ /* Recursively construct a next-step partial conditional plan */ p γnext ← PartialConditionalPlanSynthesis(P, b , δp replan , G, i, h) p if γnext = ∅ then /* Construction failed */ Construct φ using Formula 2, return ∅, φ
16
p Okp ← Okp ∪ {o }, νkp (o ) ← γnext
11 12 13 14
/* Add o to γkp */
20
foreach observation o ∈ O − Okp do b ← TB (b, a, o) if b ∈ Safe then Construct φ using Formula 2, return ∅, φ
21
return γkp , ∅
17 18 19
/* Final safety check */ /* Try observation o */ /* Violates safety */
During partial conditional plan generation, after adding a new observation o ∈ O − Okp to the partial conditional plan γkp (line 8 or 16), we update δp replan to avoid unnecessary computation. Suppose we construct a new nextp with the same replanning probability α for step partial conditional plan γnext every remaining uncovered observation o ∈ O − Okp − {o }. Then the replanning probability branches O − Okp is Pr(o |b, a)preplan (νkp (o )) + of the observation Pr(o|b, a) ≤ Pr(o|b, a)δp replan . Therefore α ≤ δp replan + α p p o∈O−Ok −{o } o∈O−Ok p Pr(o |b,a)(δp −preplan (νk (o )) replan . Then the Pr(o|b,a)
new bound for the remaining uncovered
p o∈O−O −{o } k
observation o ∈ O − Okp − {o } should be δp replan +
Pr(o |b,a)(δp −preplan (νkp (o )) replan Pr(o|b,a) p o∈O−O −{o } k
136
Y. Wang et al.
and this new bound is at least δp replan since preplan (νkp (o )) ≤ δp replan according to the definition of δp replan . When the replanning probability bound becomes larger, computing a partial conditional plan is usually less expensive. Therefore, updating the replanning probability bound (line 10) usually improves efficiency and still makes the current partial conditional plan γkp satisfy the original bound δpreplan . Safety Guarantee. After we construct a valid partial conditional plan γkp = (b, a, Okp , νkp ), if the uncovered observation set is not empty (O − Okp = ∅), then the replanning probability preplan (γkp ) > 0. Though this replanning probability is bounded by the given bound δpreplan and by Theorem 1, we know that the execution failure probability pfail (γkp ) is also bounded by δpreplan . However, if preplan (γkp ) > 0, during execution the robot might receive an uncovered observation o ∈ O − Okp and there are no valid partial conditional plans for this observation o. Then execution fails due to unsuccessful replanning. In this case, though we cannot achieve the safe-reachability objective, a guarantee of the robot still satisfying the safety requirement is preferable to the situation where the robot violates the safety requirement. Our approach OPCPS can provide this safety guarantee by checking whether the successor belief of every uncovered observation o ∈ O − Okp of the constructed partial conditional plan γkp is a safe belief (lines 17–20).
4
Experiments
We test OPCPS on the kitchen domain (horizon bound h = 30) presented in [33] and the classic Tag domain [23] (h = 100). We also validate OPCPS on a Fetch robot for the scenario shown in Fig. 1 (h = 20). We use Z3 [7] as our backend R processor with SMT solver. All experiments were conducted on a 3.0 GHz Intel 32 GB of memory. We set the time-out to be 1800 s. For all the tests of the kitchen and Tag domains, the results are averaged over 50 independent runs. In the kitchen domain [33], a robot needs to eventually pick up a cup from the storage while avoiding collisions with M uncertain obstacles. This kitchen domain is an example scenario that requires a correctness guarantee of accomplishing tasks, and POMDPs with safe-reachability objectives provide a better correctness guarantee than the quantitative POMDP formulations [33]. The kitchen environment is discretized into N = 36 regions. The actuation and perception of the robot are imperfect, modeled as ten uncertain robot actions: move and look in four directions, pick-up using the left or right hand. We assume that the robot starts at a known initial location. However, due to the robot’s imperfect perception, the location of the robot and the locations of obstacles are all partially observable during execution. This kitchen domain has a large state space |S| = C(N, M ) · N , where C(N, M ) is the number of M -combinations from the set of N regions. In the largest test (M = 7) there are more than 108 states.
10
Number of Synthesis Calls
One Call Computation Time (s)
Online Partial Conditional Plan Synthesis for POMDPs
2
101
100
3 2.5 2 1.5 1 BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
δpreplan (b) Computation Time per Step (s)
Total Computation Time (s)
(a) δpreplan 103 102 101 100 BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
δpreplan (c)
137
101
100
10−1 BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
M = 1
M = 2
M = 3
M = 5
M = 6
M = 7
M = 4
δpreplan (d)
Success Rate
Fig. 5. Performance results for the kitchen domain as the bound δpreplan increases. Different plots correspond to tests with different numbers M of obstacles. Missing data points in a plot indicate time-out. The red dashed line is the plot of time = 1800 s (time-out). The blue dashed line passes through the data points generated by BPS. All the results are averaged over 50 independent runs. 1 0.75 0.5 0.25 0
M =1
BPS
M =2
0.1
M =3
0.2
M =4
0.3
0.4
0.5
0.6
0.7
0.8
0.9
δpreplan
Fig. 6. Success rate as δpreplan increases. The green dotted line shows the plot of success rate = 1.0 − δpreplan . The red dashed line is the plot of success rate = 1.0. The blue dashed line passes through the data points generated by BPS.
Performance. We evaluate our previous BPS method [33] and OPCPS (with the replanning probability bound δpreplan ranging from 0.1 to 0.9) in the kitchen domain with various numbers of obstacles. BPS computes a full conditional plan that covers all observation branches and is equivalent to OPCPS with δpreplan = 0. Figure 5a, b, c and d show the average computation time of one synthesis call, the average number of synthesis calls, the average total computation time and the
138
Y. Wang et al.
average computation time per step as the bound δpreplan increases, respectively. As shown in from Fig. 5a (semi-log scale) and Fig. 5b, the computation time of one synthesis call decreases very quickly while the number of calls to partial conditional plan synthesis does not increase much as δpreplan increases. Therefore, the total computation time (Fig. 5c) keeps decreasing as δpreplan increases. Additionally, as we can see from Fig. 5c (semi-log scale), with a small bound δpreplan = 0.1, we observe a big performance gain compared to BPS: for the test case with M = 4 obstacles, the speedup is around 5 times and for the test case with M = 5 obstacles, BPS times out while OPCPS with δpreplan = 0.1 can solve this test in around 9 min. Therefore, OPCPS achieves better performance than BPS in the tests by computing partial conditional plans to approximate full conditional plans. The results of the average computation time per step (Fig. 5d) also show the same trend. These results suggest that for domains where replanning is easy, increasing the replanning probability bound usually leads to better scalability.
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
δpreplan (a)
Total Computation Time (s)
Replanning Probability
Success Rate. For all the previous performance tests, the constructed partial conditional plans by OPCPS with different bounds δpreplan always achieve the safe-reachability objective (success rate = 100%) because the robot can move in four directions. When the robot enters a region surrounded by obstacles in three directions, the robot can always move back to its previous position, which means replanning is always possible. However, in some domains such as autonomous driving and robot chefs, when the robot commits to an action and finds something wrong, it is difficult or impossible to reverse the action effects and replan. To evaluate how OPCPS performs in these scenarios, we test OPCPS in the kitchen domain with different numbers M of obstacles (M ≤ 4 since BPS times out for tests with more than four obstacles), but we disable the robot’s movenorth action. Therefore, when the robot performs move-south and enters a region surrounded by obstacles in three directions, replanning fails. However, the robot still satisfies the safety requirement, thanks to the safety guarantee of OPCPS. 103
102
101 BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
with bound update no bound update
δpreplan (b)
Fig. 7. Replanning probability and total computation time as the bound δpreplan increases (M = 4). The green dotted line shows the plot of replanning probability = δpreplan . The blue dashed line passes through the data points generated by BPS.
Online Partial Conditional Plan Synthesis for POMDPs
139
Figure 6 shows the success rate as the bound δpreplan increases. For all the tests, the success rate is always greater than 1.0 − δpreplan (all data points are above the plot of success rate = 1.0 − δpreplan ). This matches Theorem 1: the failure probability of a valid partial conditional plan is bounded by the replanning probability. Moreover, as the bound δpreplan decreases to 0, OPCPS produces a valid full conditional plan with 100% success rate. These results suggest that for some domains where we anticipate that replanning is difficult, users can decrease the bound δpreplan and allocate computational resources for a high success rate. Note that the replanning probability bound is a conservative upper bound of the failure probability since it pessimistically assumes all the uncovered observation branches that require replanning will fail, which is a rare case in practice. As we can see from Fig. 6, even with a high replanning probability bound δpreplan = 0.9, the success rate is still at least 70% and the failure rate is at most 30%, which is much smaller than the given bound δpreplan = 0.9. Gains from Updating Replanning Probability Bound. As we discussed in Sect. 3.1, updating the replanning probability bound during partial conditional plan generation is important for avoiding unnecessary computation and improving efficiency. To evaluate the gains from this bound update step, we test OPCPS with and without bound update in the kitchen domain with M = 4 obstacles. Figure 7a and b (semi-log scale) show the average replanning probability of the constructed partial conditional plans and the average total computation time as the bound δpreplan increases, respectively. As shown in Fig. 7a, with both settings (with and without bound update) OPCPS constructs a partial conditional plan with a replanning probability smaller than δpreplan . However, OPCPS without bound update constructs a partial conditional plan with a lower replanning probability than that constructed by OPCPS with bound update. Therefore, OPCPS without bound update performs unnecessary computation and constructs a partial conditional plan with more branches and thus spends more time than OPCPS with bound update, as shown in Fig. 7b. For the tests with δpreplan = 0.1, 0.2, 0.3 that take more time to solve than those with δpreplan > 0.3, OPCPS with bound update achieves a 2–5 times speedup.
Fig. 8. Physical validation of OPCPS for the domain shown in Fig. 1.
140
Y. Wang et al.
40
30
20 BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
(a) δpreplan
Computation Time per Step (s)
Total Computation Time (s)
Physical Validation. We validate OPCPS on a Fetch robot for the domain shown in Fig. 1. The setup of this domain is similar to the kitchen domain. The Fetch needs to pick up a target object (the blue can on the table) while avoiding collisions with uncertain obstacles such as floor signs and file cabinets, which can be placed in different locations. The POMDP’s state space consists of robot locations and object locations. We use a Vicon system to detect object locations, which is usually accurate but can still produce false negative and false positive due to occlusion or inappropriate Vicon marker configurations on objects. We estimate the false negative and false positive probabilities by counting the false negative and false positive events during 100 Vicon detections. The POMDP’s probabilistic observation function is defined based on the false negative and false positive probabilities. To test the effects of different replanning probability bounds, we only allow the Fetch to move in three directions (west, east and south), similar to the setup in the previous success rate tests. Sometimes the Fetch may fail to move its base when given a move action command and stay in the same place. We estimate the failure probability of these move actions by counting the failure events during 100 move action executions. The POMDP’s probabilistic transition function is defined based on this failure probability. Fig. 8a shows the initial state. There are two uncertain obstacles (a wet-floor sign and a file cabinet). We test OPCPS with two bounds δpreplan = 0.9 and δpreplan = 0.1. With δpreplan = 0.9, after observing no obstacle in the south direction, the Fetch decides to move south (Fig. 8b) because the partial conditional plan constructed with a high replanning probability bound does not cover the case where the Fetch is surrounded by obstacles and the wall. Then replanning fails but the Fetch still satisfies the safety requirement as shown in Fig. 8b, thanks to the safety guarantee provided by OPCPS. However, with δpreplan = 0.1, after observing no obstacles in the south direction, the Fetch decides to move west (Fig. 8c) because the partial conditional plan constructed with a low replanning probability bound covers the case where the robot is surrounded by obstacles. In order to avoid this situation, the Fetch 0.7
0.6
0.5
BPS0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
(b) δpreplan
Fig. 9. Performance results for the Tag domain as the replanning probability bound δpreplan increases. All the results are averaged over 50 independent runs.
Online Partial Conditional Plan Synthesis for POMDPs
141
needs to move west and gather more information. Then the Fetch observes an obstacle in the south direction and decides to move west again (Fig. 8d). Next, the Fetch observes no obstacle in the south direction, and now it can move south. Unlike the case shown in Fig. 8b where the robot is surrounded by two obstacles and the wall, in the situation shown in Fig. 8d, if there is another obstacle in the south direction, the Fetch can still move west since there are only two obstacles. Finally, the Fetch moves to the table and picks up the target object (Fig. 8e). Tag Domain. To further demonstrate the advantage of OPCPS over our previous BPS method, we evaluate OPCPS on a classic POMDP domain [23]. The task for the robot is to search for and tag a moving agent in a grid with 29 locations. The agent follows a fixed strategy that intentionally moves away from the robot. Both the robot and the agent can move in four directions or stay. The robot’s location is fully observable while the agent’s location is unobservable unless the robot and the agent are in the same location. This Tag domain is challenging for BPS because of a large number of observations (|O| = 30) and more importantly, a huge planning horizon for computing a full conditional plan. However, computing a full conditional plan is unnecessary since replanning is easy in this domain. Fig. 9a and b show the average total computation time and the average computation time per step for the Tag domain as the bound δpreplan increases. These results show a similar trend to the previous kitchen domain tests: with a small bound δpreplan = 0.1, we observe a big performance gain compared to BPS. BPS cannot solve this test within the 1800-s time limit while OPCPS with δpreplan = 0.1 can solve this test in around 40 s and the computation time per step is less than 1 s.
5
Discussion
We presented a new approach, called OPCPS, to policy synthesis for POMDPs with safe-reachability objectives. We introduce the notion of a partial conditional plan to improve computational efficiency. Rather than explicitly enumerating all possible observations to construct a full conditional plan, OPCPS samples a subset of all observations to ensure bounded replanning probability. Our theoretical and empirical results show that the failure probability of a valid partial conditional plan is bounded by the replanning probability. Moreover, OPCPS guarantees that the robot still satisfies the safety requirement when replanning fails. Compared to our previous BPS method [33], OPCPS with a proper replanning probability bound scales better in the tested domains and can solve problems that are beyond the capabilities of BPS within the time limit. The results also suggest that for domains where replanning is easy, increasing the replanning probability bound usually leads to better scalability, and for domains where replanning is difficult or impossible in some states, we can decrease the replanning probability bound and allocate more computation time to achieve a higher success rate. Our results also indicate that by updating the replanning probability bound during partial conditional plan generation, we can quickly detect if
142
Y. Wang et al.
the current partial conditional plan satisfies the bound and avoid unnecessary computation. In this work, we focus on discrete POMDPs. While many robot applications can be modeled using this discrete representation, discretization often suffers from the curse of dimensionality. Investigating how to deal with continuous POMDPs [1,9,24,28] directly is a promising future direction. The current implementation of OPCPS constructs partial conditional plans by sampling observations according to the probability of occurrence (Algorithm 1, line 11), which does not consider the importance of observations [17]. How to extend OPCPS to handle critical observations is another important ongoing question. Acknowledgments. This work was supported in part by NSF CCF 1139011, NSF CCF 1514372, NSF CCF 1162076 and NSF IIS 1317849. We thank the reviewers for their insightful comments, and Juan David Hern´ andez, Bryce Willey and Constantinos Chamzas for their assistance in the physical experiments.
References 1. Bai, H., Hsu, D., Lee, W.S.: Integrated perception and planning in the continuous space: a POMDP approach. Int. J. Robot. Res. 33(9), 1288–1302 (2014) 2. Biere, A., Cimatti, A., Clarke, E.M., Strichman, O., Zhu, Y.: Bounded model checking. Adv. Comput. 58, 117–148 (2003) 3. Cai, P., Luo, Y., Hsu, D., Lee, W.S.: HyP-DESPOT: a hybrid parallel algorithm for online planning under uncertainty. In: RSS (2018) 4. Chatterjee, K., Chmel´ık, M., Gupta, R., Kanodia, A.: Qualitative analysis of POMDPs with temporal logic specifications for robotics applications. In: ICRA, pp. 325–330 (2015) 5. Chatterjee, K., Chmel´ık, M., Gupta, R., Kanodia, A.: Optimal cost almost-sure reachability in POMDPs. Artif. Intell. 234(C), 26–48 (2016) 6. Chatterjee, K., Chmel´ık, M., Tracol, M.: What is decidable about partially observable Markov decision processes with ω-regular objectives. J. Comput. Syst. Sci. 82(5), 878–911 (2016) 7. DeMoura, L., Bjørner, N.: Z3: an efficient SMT solver. In: TACAS, pp. 337–340 (2008) 8. Hadfield-Menell, D., Groshev, E., Chitnis, R., Abbeel, P.: Modular task and motion planning in belief space. In: IROS, pp. 4991–4998 (2015) 9. Hoey, J., Poupart, P.: Solving POMDPs with continuous or large discrete observation spaces. In: IJCAI, pp. 1332–1338 (2005) 10. Hou, P., Yeoh, W., Varakantham, P.: Solving risk-sensitive POMDPs with and without cost observations. In: AAAI, pp. 3138–3144 (2016) 11. Isom, J.D., Meyn, S.P., Braatz, R.D.: Piecewise linear dynamic programming for constrained POMDPs. In: AAAI, pp. 291–296 (2008) 12. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998) 13. Kaelbling, L.P., Lozano-P´erez, T.: Integrated task and motion planning in belief space. Int. J. Robot. Res. 32(9–10), 1194–1227 (2013) 14. Kaelbling, L.P., Lozano-P´erez, T.: Implicit belief-space pre-images for hierarchical planning and execution. In: ICRA, pp. 5455–5462 (2016)
Online Partial Conditional Plan Synthesis for POMDPs
143
15. Kim, D., Lee, J., Kim, K.E., Poupart, P.: Point-based value iteration for constrained POMDPs. In: IJCAI, pp. 1968–1974 (2011) 16. Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: RSS (2008) 17. Luo, Y., Bai, H., Hsu, D., Lee, W.S.: Importance sampling for online planning under uncertainty. In: WAFR (2016) 18. Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and related stochastic optimization problems. Artif. Intell. 147, 5–34 (2003) 19. Marecki, J., Varakantham, P.: Risk-sensitive planning in partially observable environments. In: AAMAS, pp. 1357–1368 (2010) 20. Mundhenk, M., Goldsmith, J., Lusena, C., Allender, E.: Complexity of finitehorizon Markov decision process problems. J. ACM 47(4), 681–720 (2000) 21. Papadimitriou, C., Tsitsiklis, J.N.: The complexity of Markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987) 22. Paz, A.: Introduction to Probabilistic Automata. Academic Press Inc., Cambridge (1971) 23. Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: an anytime algorithm for POMDPs. In: IJCAI, pp. 1025–1030 (2003) 24. Porta, J.M., Vlassis, N., Spaan, M.T.J., Poupart, P.: Point-based value iteration for continuous POMDPs. J. Mach. Learn. Res. 7, 2329–2367 (2006) 25. Poupart, P., Malhotra, A., Pei, P., Kim, K.E., Goh, B., Bowling, M.: Approximate linear programming for constrained partially observable Markov decision processes. In: AAAI, pp. 3342–3348 (2015) 26. Ross, S., Pineau, J., Paquet, S., Chaib-Draa, B.: Online planning algorithms for POMDPs. J. Artif. Intell. Res. 32(1), 663–704 (2008) 27. Santana, P., Thi´ebaux, S., Williams, B.: RAO*: an algorithm for chanceconstrained POMDP’s. In: AAAI, pp. 3308–3314 (2016) 28. Seiler, K.M., Kurniawati, H., Singh, S.P.N.: An online and approximate solver for POMDPs with continuous action space. In: ICRA, pp. 2290–2297 (2015) 29. Smallwood, R.D., Sondik, E.J.: The optimal control of partially observable Markov processes over a finite horizon. Oper. Res. 21(5), 1071–1088 (1973) 30. Somani, A., Ye, N., Hsu, D., Lee, W.S.: DESPOT: online POMDP planning with regularization. In: NIPS, pp. 1772–1780 (2013) ˇ 31. Svoreˇ nov´ a, M., Chmel´ık, M., Leahy, K., Eniser, H.F., Chatterjee, K., Cern´ a, I., Belta, C.: Temporal logic motion planning using POMDPs with parity objectives: case study paper. In: HSCC, pp. 233–238 (2015) 32. Undurti, A., How, J.P.: An online algorithm for constrained POMDPs. In: ICRA, pp. 3966–3973 (2010) 33. Wang, Y., Chaudhuri, S., Kavraki, L.E.: Bounded policy synthesis for POMDPs with safe-reachability objectives. In: AAMAS, pp. 238–246 (2018)
An Approximation Algorithm for Risk-Averse Submodular Optimization Lifeng Zhou1(B) and Pratap Tokekar2 1
2
Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, USA [email protected] Department of Computer Science, University of Maryland, College Park, USA [email protected]
Abstract. We study the problem of incorporating risk while making combinatorial decisions under uncertainty. We formulate a discrete submodular maximization problem for selecting a set using ConditionalValue-at-Risk (CVaR), a risk metric commonly used in financial analysis. While CVaR has recently been used in optimization of linear cost functions in robotics, we take the first stages towards extending this to discrete submodular optimization and provide several positive results. Specifically, we propose the Sequential Greedy Algorithm that provides an approximation guarantee on finding the maxima of CVaR cost function under a matroidal constraint. The approximation guarantee shows that the solution produced by our algorithm is within a constant factor of the optimal and an additive term that depends on the optimal. Our analysis uses the curvature of the submodular set function, and proves that the algorithm runs in polynomial time. This formulates a number of combinatorial optimization problems that appear in robotics. We use two such problems, vehicle assignment under uncertainty for mobility-ondemand and sensor selection with failures for environmental monitoring, as case studies to demonstrate the efficacy of our formulation.
1
Introduction
Combinatorial optimization problems find a variety of applications in robotics. Typical examples include: • Sensor placement: Where to place sensors to maximally cover the environment [1] or reduce the uncertainty in the environment [2]? • Task allocation: How to allocate tasks to robots to maximize the overall utility gained by the robots [3]? • Combinatorial auction: How to choose a combination of items for each player to maximize the total rewards [4]? This material is based upon work supported by the National Science Foundation under Grant 1637915 and Office of Naval Research Award N00014-18-1-2829. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 144–159, 2020. https://doi.org/10.1007/978-3-030-44051-0_9
CVaR Submodular Maximization
145
Algorithms for solving such problems find use in sensor placement for environment monitoring [1,2], robot-target assignment and tracking [5–7], and informative path planning [8]. The underlying optimization problem is: max f (S),
S∈I,S∈X
(1)
where X denotes a ground set from which a subset of elements S must be chosen. f is a monotone submodular utility function [9,10]. Submodularity is the property of diminishing returns. Many information theoretic measures, such as mutual information [2], and geometric measures i.e., the visible area [11], are known to be submodular. I denotes a matroidal constraint [9,10]. Matroids are a powerful combinatorial tool that can represent constraints on the solution set, e.g., cardinality constraints (“place no more than k sensors”) and connectivity constraints (“the communication graph of the robots must be connected”) [12]. The objective is to find a set S satisfying a matroidal constraint I and maximizing the utility f (S). The general form of this problem is NP-complete. However, a greedy algorithm yields a constant factor approximation guarantee [9,10]. In practice, sensors can fail or get compromised [13] or robots may not know the exact positions of the targets [14]. Hence, the utility f (S) is not necessarily deterministic but can have uncertainty. Our main contribution is to extend the traditional formulation given in Eq. 1 to also account for the uncertainty in the actual cost function. We model the uncertainty by assuming that the utility function is of the form f (S, y) where S ∈ X is the decision variable and y ∈ Y represents a random variable which is independent of S. We focus on the case where f (S, y) is monotone submodular in S ∈ X and integrable in y. The traditional way of stochastic optimization is to optimize the expected utility: maxS∈I,S∈X Ey [f (S, y)]. Since the sum of the monotone submodular functions is monotone submodular, Ey [f (S, y)] is still monotone submodular in S. Thus, the greedy algorithm retains its constant-factor performance guarantee [9,10]. Examples of this approach include influence maximization [15], moving target detection and tracking [14], and robot assignment with travel-time uncertainty [16]. While optimizing the expected utility has its uses, it also has its pitfalls. Consider the example of mobility-on-demand where two self-driving vehicles, v1 and v2 , are available to pick up the passengers at a demand location (Fig. 1). v1 is closer to the demand location, but it needs to cross an intersection where it may need to stop and wait. v2 is further from the demand location but there Fig. 1. Mobility on demand with is no intersection along the path. The travel travel time uncertainty of selftime for v1 follows a bimodal distribution driving vehicles. (with and without traffic stop) whereas that for v2 follows a unimodal distribution with a higher mean but lower uncertainty. Clearly, if the passenger uses the expected travel time as the objective,
146
L. Zhou and P. Tokekar
they would choose v1 . However, they will risk waiting a much longer time, i.e., 17–20 min about half of the times. A more risk-averse passenger would choose v2 which has higher expected waiting time 16 min but a lesser risk of waiting longer. Thus, in these scenarios, it is natural to go beyond expectation and focus on a risk-averse measure. One popular coherent risk measure is Conditional-Value-atRisk (CVaR) [17,18]. CVaR takes a risk level α which is the probability of the worst α-tail cases. Loosely speaking, maximizing CVaR is equivalent to maximizing the expectation of the worst α-tail scenarios.1 This risk-averse decision is rational especially when the failures can lead to unrecoverable consequences, such as a sensor failure. Related Work. Yang and Chakraborty studied a chance-constrained combinatorial optimization problem that takes into account the risk in multi-robot assignment [19]. They later extended this to knapsack problems [20]. They solved the problem by transforming it to a risk-averse problem with mean-variance measure [21]. Chance-constrained optimization is similar to optimizing the Valueat-Risk (VaR), which is another popular risk measure in finance [22]. However, Majumdar and Pavone argued that CVaR is a better measure to quantify risk than VaR or mean-variance based on six proposed axioms in the context of robotics [23]. Several works have focused on optimizing CVaR. In their seminal work [18], Rockafellar and Uryasev presented an algorithm for CVaR minimization for reducing the risk in financial portfolio optimization with a large number of instruments. Note that, in portfolio optimization, we select a distribution over available decision variables, instead of selecting a single one. Later, they showed the advantage of optimizing CVaR for general loss distributions in finance [24]. When the utility is a discrete submodular set function, i.e., f (S, y), Maehara presented a negative result for maximizing CVaR [25]—there is no polynomial time multiplicative approximation algorithm for this problem under some reasonable assumptions in computational complexity. To avoid this difficulty, Ohsaka and Yoshida in [26] used the same idea from portfolio optimization and proposed a method of selecting a distribution over available sets rather than selecting a single set, and gave a provable guarantee. Following this line, Wilder considered a CVaR maximization of a continuous submodular function instead of the submodular set functions [27]. They gave a (1 − 1/e)–approximation algorithm for continuous submodular functions and also evaluated the algorithm for discrete submodular functions using portfolio optimization [26]. Contributions. We focus on the problem of selecting a single set, similar to [25], to maximize CVaR rather than portfolio optimization [26,27]. This is because we are motivated by applications where a one-shot decision (placing sensors and assigning vehicles) must be taken. Our contributions are as follows: • We propose the Sequential Greedy Algorithm (SGA) which uses the deterministic greedy algorithm [9,10] as a subroutine to find the maximum value of CVaR (Algorithm 1). 1
We formally review CVaR and other related concepts in Sect. 2.1.
CVaR Submodular Maximization
147
• We prove that the solution found by SGA is within a constant factor of the optimal performance along with an additive term which depends on the optimal value. We also prove that SGA runs in polynomial time (Theorem 1) and the performance improves as the running time increases. • We demonstrate the utility of the proposed CVaR maximization problem through two case studies (Sect. 3.2). We evaluate the performance of SGA through simulations (Sect. 5). Organization of Rest of the Paper. We give the necessary background knowledge for the rest of the paper in Sect. 2. We formulate the CVaR submodular maximization problem with two case studies in Sect. 3. We present SGA along with the analysis of its computational complexity and approximation ratio in Sect. 4. We illustrate the performance of SGA to the two case studies in Sect. 5. We conclude the paper in Sect. 6.
2
Background and Preliminaries
We start by defining the conventions used in the paper. Calligraphic font denotes a set (e.g., A). Given a set A, 2A denotes its power set. |A| denotes the cardinality of A. Given a set B, A \ B denotes the set of elements in A that Fig. 2. An illustration of risk measures: are not in B. Pr[·] denotes the proba- VaR and CVaR. bility of an event and E[·] denotes the expectation of a random variable. x = min{n ∈ Z|x ≤ n} where Z denotes the set of integers. Next, we give the background on set functions (in the appendix file) and risk measures. 2.1
Risk Measures
Let f (S, y) be a utility function with decision set S and the random variable y. For each S, the utility f (S, y) is also a random variable with a distribution induced by that of y. First, we define the Value-at-Risk at risk level α ∈ (0, 1]. Value at Risk: VaRα (S) = inf{τ ∈ R, Pr[f (S, y) ≤ τ ] ≥ α}.
(2)
Thus, VaRα (S) denotes the left endpoint of the α-quantile(s) of the random variable f (S, y). The Conditional-Value-at-Risk is the expectation of this set of α-worst cases of f (S, y), defined as: Conditional Value at Risk: CVaRα (S) = E[f (S, y)|f (S, y) ≤ VaRα (S)]. y
(3)
148
L. Zhou and P. Tokekar
Figure 2 shows an illustration of VaRα (S) and CVaRα (S). CVaRα (S) is more popular than VaRα (S) since it has better properties [18], such as coherence [28]. When optimizing CVaRα (S), we usually resort to an auxiliary function: 1 E[(τ − f (S, y))+ ]. α We know that optimizing CVaRα (S) over S is equivalent to optimizing the auxiliary function H(S, τ ) over S and τ [18]. The following lemmas give useful properties of the auxiliary function H(S, τ ). H(S, τ ) = τ −
Lemma 1. If f (S, y) is normalized, monotone increasing and submodular in set S for any realization of y, the auxiliary function H(S, τ ) is monotone increasing and submodular, but not necessarily normalized in set S for any given τ . We provide the proofs for all the Lemmas and Theorem in the arXiv version [29]. Lemma 2. The auxiliary function H(S, τ ) is concave in τ for any given set S. Lemma 3. For any given set S, the gradient of the auxiliary function H(S, τ ) ) ≤ 1. with respect to τ fulfills: −( α1 − 1) ≤ ∂H(S,τ ∂τ
3
Problem Formulation and Case Studies
We first formulate the CVaR submodular maximization problem and then present two applications which we use as case studies. 3.1
Problem Formulation
CVaR Submodular Maximization: We consider the problem of maximizing CVaRα (S) over a decision set S ⊆ X under a matroid constraint S ∈ I. Maximizing CVaRα (S) over S is equivalent to maximizing the auxiliary function H(S, τ ) over S and τ [18]. Thus, we propose the maximization problem as: Problem 1. 1 E[(τ − f (S, y))+ ] α s.t. S ∈ I, S ⊆ X , τ ∈ [0, Γ ], max τ −
(4)
where Γ is the upper bound of the parameter τ . Problem 1 gives a risk-averse version of maximizing submodular set functions. 3.2
Case Studies
The risk-averse submodular maximization has many applications (c.f. Sect. 3.2). We describe two specific applications which we will use in the simulations.
CVaR Submodular Maximization
149
Resilient Mobility-on-Demand. Consider a mobility-on-demand problem where we assign R vehicles to N demand locations under arrival-time uncertainty. An example is shown in Fig. 3 where seven self-driving vehicles must be assigned to three demand locations to pick up passengers. We follow the same Fig. 3. Mobility-on-demand with mulconstraint setting in [16]—each vehicle tiple demands and multiple self-driving can be assigned to at most one demand vehicles. but multiple vehicles can be assigned to the same demand. Only the vehicle that arrives first is chosen for picking up the passengers. Note that the advantage of the redundant assignment to each demand is that it counters the effect of uncertainty and reduces the waiting time at demand locations [16]. This may be too conservative for consumer mobility-on-demand services but can be crucial for urgent and time-critical tasks such as delivery medical supplies [30]. Assume the arrival time for the robot to arrive at demand location is a random variable. The distribution can depend on the mean-arrival time. For example, it is possible to have a shorter path that passes through many intersections, which leads to an uncertainty on arrival time. While a longer road (possibly a highway) has a lower arrival time uncertainty. Note that for each demand location, there is a set of robots assigned to it. The vehicle selected at the demand location is the one that arrives first. Then, this problem becomes a minimization one since we would like to minimize the arrival time at all demand locations. We convert it into a maximization one by taking the reciprocal of the arrival time. Specifically, we use the arrival efficiency which is the reciprocal of arrival time. Instead of selecting the vehicle at the demand location with minimum arrival time, we select the vehicle with maximum arrival efficiency. The arrival efficiency is also a random variable, and has a distribution depending on mean-arrival efficiency. Denote the arrival efficiency for robot j ∈ {1, ..., R} arriving at demand location i ∈ {1, ..., N } as eij . Denote the assignment utility as the arrival efficiency at all locations, that is, maxj∈Si eij (5) f (S, y) = i∈N
N with i=1 Si = S and Si ∩ Sk = ∅, i, k ∈ {1, · · · , N }. Si ∩ Sk = ∅ indicates the selected set S satisfies a partition matroid constraint, S ∈ I, which represents that each robot can be assigned to at most one demand. The assignment utility f (S, y) is monotone submodular in S due to the “max” function. f (S, y) is normalized since f (∅, y) = 0. Here, we regard the uncertainty as a risk. Our risk-averse assignment problem is a trade-off between efficiency and uncertainty. Our goal is to maximize the total efficiencies at the demand locations while considering the risk from uncertainty.
150
L. Zhou and P. Tokekar
(a) Part of Virgina Tech campus from Google Earth.
(b) Top view of part of a campus and ground sensor’s visibility region.
Fig. 4. Campus monitoring by using a set of sensors with visibility regions.
Robust Environment Monitoring. Consider an environment monitoring problem where we monitor part of a campus with a group of ground sensors (Fig. 4). Given a set of N candidate positions X , we would like to choose a subset of M positions S ⊆ X , M ≤ N , to place visibility-based sensors to maximally cover the environment. The visibility regions of the ground sensors are obstructed by the buildings in the environment (Fig. 4(b)). Consider a scenario where the probability of failure of a sensor depends on the area it can cover. That is, a sensor covering a larger area has a larger risk of failure associated with it. This may be due to the fact that the same number of pixels are used to cover a larger area and therefore, each pixel covers proportionally a smaller footprint. As a result, the sensor risks missing out on detecting small objects. Denote the probability of success and the visibility region for each sensor i, i ∈ {1, ..., N } as pi and vi , respectively. Thus, the polygon each sensor i monitors is also a random variable. Denote this random polygon as Ai and denote the selection utility as the joint coverage area of a set of sensors, S, that is, Ai ), i ∈ S, S ⊆ I. (6) f (S, y) = area( i=1:M
The selection utility f (S, y) is monotone submodular in S due to the overlapping area. f (S, y) is normalized since f (∅, y) = 0. Here, we regard the sensor failure as a risk. Our robust environment monitoring problem is a trade-off between area coverage and sensor failure. Our goal is to maximize the joint-area covered while considering the risk from sensor failure.
4
Algorithm and Analysis
We present the Sequential Greedy Algorithm (SGA) for solving Problem 1 by leveraging useful properties of the auxiliary function. The pseudo-code is given in Algorithm 1. SGA mainly consists of searching for the appropriate value of τ by
CVaR Submodular Maximization
151
solving a subproblem for a fixed τ under a matroid constraint. Even for a fixed τ , the subproblem of optimizing the auxiliary function is NP-complete. Nevertheless, we can employ the greedy algorithm for the subproblem, and sequentially apply it for searching over all τ . We explain each stage in detail next. 4.1
Sequential Greedy Algorithm
These are four stages in SGA: (a) Initialization (line 1): Algorithm 1 defines a storage set M and initializes it to be the empty set. Note that, for each specific τ , we can use the greedy algorithm to obtain a near-optimal solution S G based on the monotonicity and submodularity of the auxiliary function H(S, τ ). M stores all the (S G , τ ) pairs when searching all the possible values of τ . (b) Searching for τ (for loop in lines 2–10): We use a user-defined separation Δ (line 3) to sequentially search for all possible values of τ within [0, Γ ]. Γ is an upper bound on τ and can be set by the user based on the specific problem at hand. We show how to find Γ for the specific cases in Sect. 5. (c) Greedy algorithm (lines 4–8): For a specific τ , say τi , we use the greedy approach to choose set SiG . We first initialize set SiG to be the empty set (line 4). Under a matroid constraint, SiG ∈ I (line 5), we add a new element s which gives the maximum marginal gain of H(SiG , τi ) (line 6) into set SiG (line 7) in each round. (d) Find the best pair (line 11): Based on the collection of pairs M (line 9), we pick the pair (SiG , τi ) ∈ M that maximizes H(SiG , τi ) as the final solution SiG . We denote this value of τ by τ G . Designing an Oracle: Note that an oracle O is used to calculate the value of H(S, τ ). We use a sampling based method to approximate this oracle. Specifically, we sample ns realizations y˜(s) from the distribution of y and approximate H(S, τ ) as H(S, τ ) ≈ τ − ns1α y˜[(τ − f (S, y˜))+ ]. According to [26, Lemma 4.1], if the number of samples is ns = O( 12 log 1δ ), δ, ∈ (0, 1), the CVaR approximation error is less than with the probability at least 1 − δ. 4.2
Performance Analysis of SGA
Theorem 1. Let S G and τ G be the set and the scalar chosen by the SGA, and let the S and τ be the set and the scalar chosen by the OPT, we have H(S G , τ G ) ≥
1 kf 1 (H(S , τ ) − Δ) − Γ ( − 1), 1 + kf 1 + kf α
(7)
where kf ∈ [0, 1] is the curvature of H(S, τ ) in set S. Please see the detailed defiΓ |X |2 ns ) nition of the curvature in the appendix. The computational time is O( Δ where Γ and Δ are the upper bound on τ and searching separation parameter, |X | is the cardinality of the ground set X and ns is the number of the samplings used by the oracle.
152
L. Zhou and P. Tokekar
Algorithm 1: Sequential Greedy Algorithm for Problem 1. Input: • Ground set X and matroid I • User-defined risk level α ∈ [0, 1] • Range of the parameter τ ∈ [0, Γ ] and discretization stage Δ ∈ (0, Γ ] • An oracle O that evaluates H(S, τ ) Output: • Selected set S G and corresponding parameter τ G 1: M ← ∅ Γ } do 2: for i = {0, 1, · · · , Δ 3: τi = iΔ 4: SiG ← ∅ 5: while SiG ∈ I do H((SiG ∪ {s}), τi ) − H(SiG , τi ) 6: s= argmax SiG
s∈X \SiG ,SiG ∪{s}∈I ← SiG ∪ {s}
7: 8: end while 9: M = M ∪ {(SiG , τi )} 10: end for 11: (S G , τ G ) = argmax H(SiG , τi ) (SiG ,τi )∈M
SGA gives 1/(1 + kf ) approximation of the optimal with two approximation errors. The first comes from the searching separation, Δ. We can make this error very small by setting Δ to be close to zero at the expense of computational time. The second approximation error comes from the additive term, H add =
kf 1 Γ ( − 1), 1 + kf α
(8)
which depends on the curvature kf and the risk level α. When the risk level α is very small, this error is very large which means SGA may not give a good performance guarantee of the optimal. However, if the function H(S, τ ) is close to modular in S (kf → 0), this error is close to zero. Notably, when kf → 0 and Δ → 0, SGA gives a near-optimal solution (H(S G , τ G ) → H(S , τ )). Next, we prove Theorem 1. We start with the proof of approximation ratio, then go to the analysis of the computational time. We first present the necessary lemmas for the proof of the approximation ratio. Lemma 4. Let Si be the optimal set for a specific τi that maximizes H(S, τ = τi ). By sequentially searching for τ ∈ [0, Γ ] with a separation Δ, we have max
H(Si , τi ) ≥ H(S , τ ) − Δ.
Γ i∈{0,1,··· , Δ }
(9)
Next, we build the relationship between the set selected by the greedy approach, SiG , and the optimal set Si for τi .
CVaR Submodular Maximization
153
Lemma 5. Let Si and SiG be the sets selected by the greedy algorithm and the optimal approach for a fixed τi that maximizes H(S, τ = τi ). We have H(SiG , τi ) ≥
1 kf 1 H(Si , τi ) − Γ ( − 1). 1 + kf 1 + kf α
(10)
where kf is the curvature of the function H(S, τ ) in S with a matroid constraint I. Γ is the upper bound of parameter τ .
5
Simulations
We perform simulations to verify the performance of SGA in resilient mobilityon-demand and robust environment monitoring. Our code is available online.2 5.1
Resilient Mobility-on-Demand Under Arrival Time Uncertainty
We consider assigning R = 6 supply vehicles to N = 4 demand locations in a 2D environment. The positions of the demand locations and the supply vehicles are randomly generated within a square environment of 10 units side length. Denote the Euclidean distance between demand location i ∈ {1, ..., N } and vehicle position j ∈ {1, ..., R} as dij . Based on the distribution discussion of the arrival efficiency distribution in Sect. 3.2, we assume each arrival efficiency eij has a uniform distribution with its mean proportional to the reciprocal of the distance between demand i and vehicle j. Furthermore, the uncertainty is higher if the mean efficiency is higher. Note that, the algorithm can handle other, more complex, distributions for arrival times. We use a uniform distribution for ease of exposition. Specifically, denote the mean of eij as e¯ij and set e¯ij = 10/dij . We model the arrival efficiency distribution to be a uniform distribution as 2.5 2.5 eij − e¯ij /max{¯ eij }, e¯ij + e¯ij /max{¯ eij }], where max{¯ eij } = maxi,j eij , i ∈ eij = [¯ {1, ... , N }, j ∈ {1, ... , R}. From the assignment utility function (Eq. 5), for any realization of y, say y˜, f (S, y˜) := i∈N maxj∈Si e˜ij , where e˜ij indicates one realization of eij . If all vehicle-demand pairs are independent from each other, y models a multi-independent uniform distribution. We sample ns times from underlying multi-independent uniform distribution of yand approximate the auxiliary function H(S, τ ) as H(S, τ ) ≈ τ − ns1α y˜[(τ − i∈N maxj∈Si e˜ij )+ ]. We set the upper bound of the parameter τ as Γ = N max(˜ eij ), i = {1, ... N }, j = {1, ... , R}, to make sure Γ − f (S, y) ≥ 0. We set the searching separation for τ as Δ = 1. After receiving the pair (S G , τ G ) from SGA, we plot the value of H(S G , τ G ) and H(S G , τ ) with respect to different risk levels α in Fig. 5. Figure 5(a) shows that H(S G , τ G ) increases when α increases. This suggests that SGA correctly maximizes H(S, τ ). Figure 5(b) shows that H(S G , τ ) is concave or piecewise concave, which is consistent with the property of H(S, τ ). We plot the distribu tion of assignment utility, f (S G , y) = i∈N maxj∈SiG eij in Fig. 6 by sampling ns = 1000 times from the underlying distribution of y. SiG is a set of vehicles 2
https://github.com/raaslab/risk averse submodular selection.git.
154
L. Zhou and P. Tokekar
Fig. 5. The value of H(S, τ ) by SGA with respect to different risk confidence levels.
Fig. 6. Distribution of the assignment utility f (S G , y) by SGA.
Fig. 7. Additive term in the approximation ratio with respect to risk level α.
G assigned to demand i by SGA. S G = ∪N i=1 Si . When the risk level α is small, vehicle-demand pairs with low efficiencies (equivalently, low uncertainties) are selected. This is because small risk level indicates the assignment is conservative and only willing to take a little risk. Thus, lower efficiency with lower uncertainty is assigned to avoid the risk induced by the uncertainty. In contrast, when α is large, the assignment is allowed to take more risk to gain more assignment utility. Vehicle-demand pairs with high efficiencies (equivalently, high uncertainties) are selected in such a case. Note that, when the risk level is close to zero, SGA may not give a correct solution because of a large approximation error (Fig. 7). However, this error decreases quickly to zero when the risk level increases. We also compare SGA with CVaR measure to the greedy algorithm with the expectation (risk-neutral measure [16]) in Fig. 8. Note that risk-neutral measure is a special case of CVaR measure when α = 1. We give an illustrative example of the assignment by SGA for two extreme risk levels, α = 0.1 and α = 1. When α is small (α = 0.1), the assignment is conservative and thus further vehicles (with lower efficiency and lower uncertainty) are assigned to each demand (Fig. 8(a)).
CVaR Submodular Maximization
155
Fig. 8. Assignments and utility distributions by SGA with two extreme risk level values. The red solid star represents the demand location. The black solid square represents the vehicle position. The line between the vehicle and the demand represents an assignment.
In contrast, when α = 1, nearby vehicles (with higher efficiency and higher uncertainty) are selected for the demands (Fig. 8(b)). Even though the mean value of the assignment utility distribution is larger at α = 1 than α = 0.1, it is exposed to the risk of receiving lower utility since the mean-std bar at α = 1 has smaller left endpoint than the mean-std bar at α = 0.1 (Fig. 8(c)). 5.2
Robust Environment Monitoring
We consider selecting M = 4 locations from N = 8 candidate locations to place sensors in the environment (Fig. 4). Denote the area of the free space as v free . The positions of N candidate locations are randomly generated within the free space v free . We calculate the visibility region for each sensor vi by using the VisiLibity library [31]. Based on the sensor model discussed in Sect. 3.2, we set the probability of success for each sensor i as pi = 1 − vi /v free , and model the working of each sensor as a Bernoulli distribution with pi probability of success and 1 − pi probability of failure. Thus the random polygon monitored by each sensor Ai , follows the distribution {Pr[Ai = vi ] = pi and Pr[Ai = utility function (Eq. 6), for any realization 0] = 1 − pi }. From the assignment of y, say y˜, f (S, y) = area( i=1:M A˜i ), where A˜i indicates one realization of Ai by sampling y. If all sensors are independent of each other, we can model the working of a set of sensors as a multi-independent Bernoulli distribution. We sample ns = 1000 times from the underlying multi-independent Bernoulli distribution of y and approximate the auxiliary function H(S, τ ) as H(S, τ ) ≈ τ − ns1α y˜[(τ − i=1:M A˜i )+ ], where A˜i is one realization of Ai by sampling y. We set the upper bound for the parameter τ as the area of all the free space v free and set the searching separation for τ as Δ = 1. We use SGA to find the pair (S G , τ G ) with respect to several risk levels α. We plot the value of H(S G , τ G ) for several risk levels in Fig. 9(a). A larger risk level gives a larger H(S G , τ G ), which means the pair (S G , τ G ) found by SGA correctly
156
L. Zhou and P. Tokekar
Fig. 9. The value of H(S, τ ) by SGA with respect to different risk confidence levels.
maximizes H(S, τ ) with respect to the risk level α. Moreover, we plot functions H(S G , τ ) for several risk levels α in Fig. 9(b). Note that S G is computed by SGA at each τ . For each α, H(S G , τ ) shows the concavity or piecewise concavity of function H(S, τ ).
Fig. 10. Distribution of the selection utility f (S G , y) by SGA.
Fig. 11. Additive term in the approximation ratio with respect to risk level α.
Based on the S G calculated by SGA, we sample ns = 1000 times from the underlying distribution of y and plot the distribution of the selection utility, f (S G , y) = i=1:M Ai , i ∈ S G in Fig. 10. Note that, when the risk level α is small, the sensors with smaller visibility region and a higher probability of success should be selected. Lower risk level suggests a conservative selection. Sensors with a higher probability of success are selected to avoid the risk induced by sensor failure. In contrast, when α is large, the selection would like to take more risk to gain more monitoring utility. The sensors with larger visibility region and a lower probability of success should be selected. Figure 10 demonstrates this behavior except between α = 0.001 to α = 0.01. This is because when α is very small, the approximation error (Eq. 8) is very large as shown in Fig. 11, and thus SGA may not give a good solution.
CVaR Submodular Maximization
157
Fig. 12. Sensor selection and utility distributions by SGA with two extreme risk level values. The red solid circle represents the sensor selected by SGA.
We also compare SGA by using CVaR measure with the greedy algorithm by using the expectation (risk-neutral measure [2, Section 6.1]) in Fig. 12. In fact, the risk-neutral measure is equivalent to case of CVaRα (S) when α = 1. We give an illustrative example of the sensor selection by SGA for two extreme risk levels, α = 0.1 and α = 1. When risk level α is small (α = 0.1), the selection is conservative and thus the sensors with small visibility region are selected (Fig. 12(a)). In contrast, when α = 1, the risk is neutral and the selection is more adventurous, and thus sensors with large visibility region are selected (Fig. 12(b)). The mean-std bars of the selection utility distributions in Fig. 12(c) show that the selection utility at the expectation (α = 1) has larger mean value than the selection at α = 0.1. However, the selection at α = 1 has the risk of gaining lower utility since the left endpoint of mean-std bar at α = 1 is smaller than the left endpoint of mean-std bar at α = 0.1.
6
Conclusion and Discussion
We studied a risk-averse discrete submodular maximization problem. We provide the first positive results for discrete CVaR submodular maximization for selecting a set under matroidal constraints. In particular, we proposed the Sequential Greedy Algorithm and analyzed its approximation ratio and the running time. We demonstrated two practical use-cases of CVaR submodular maximization problem. Notably, our SGA works for any matroid constraint. In particular, the multiplicative approximation ratio can be improved to 1/kf (1−e−kf ) if we know that the constraint is a uniform matroid [32, Theorem 5.4]. The additive term in our analysis depends on α. This term can be large when the risk level α is very small. Our ongoing work is to remove this dependence on α, perhaps by designing another algorithm specifically for low risk levels. We note that if we use an optimal algorithm instead of the greedy algorithm as a subroutine, then the additive term disappears from the approximation guarantee. The algorithm also requires knowing Γ . We showed how to find Γ (or an upper bound for it)
158
L. Zhou and P. Tokekar
for the two case studies considered in this paper. Devising a general strategy for finding Γ is part of our ongoing work. Our second line of ongoing work focuses on applying the risk-averse strategy to multi-vehicle routing, patrolling, and informative path planning in dangerous environments [33] and mobility on demand with real-world data sets (2014 NYC Taxicab Factbook).3
References 1. O’Rourke, J.: Art Gallery Theorems and Algorithms, vol. 57. Oxford University Press, Oxford (1987) 2. Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9(Feb), 235–284 (2008) 3. Gerkey, B.P., Matari´c, M.J.: A formal analysis and taxonomy of task allocation in multi-robot systems. Int. J. Robot. Res. 23(9), 939–954 (2004) 4. Vondr´ ak, J.: Optimal approximation for the submodular welfare problem in the value oracle model. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pp. 67–74. ACM (2008) 5. Spletzer, J.R., Taylor, C.J.: Dynamic sensor planning and control for optimally tracking targets. Int. J. Robot. Res. 22(1), 7–20 (2003) 6. Tekdas, O., Isler, V.: Sensor placement for triangulation-based localization. IEEE Trans. Autom. Sci. Eng. 7(3), 681–685 (2010) 7. Tokekar, P., Isler, V., Franchi, A.: Multi-target visual tracking with aerial robots. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2014, pp. 3067–3072. IEEE (2014) 8. Singh, A., Krause, A., Guestrin, C., Kaiser, W.J.: Efficient informative sensing using multiple robots. J. Artif. Intell. Res. 34, 707–755 (2009) 9. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions-I. Math. Program. 14(1), 265–294 (1978) 10. Fisher, M.L., Nemhauser, G.L., Wolsey, L.A.: An analysis of approximations for maximizing submodular set functions-II. In: Polyhedral Combinatorics, pp. 73–87. Springer (1978) 11. Ding, H., Castan´ on, D.: Multi-agent discrete search with limited visibility. In: 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pp. 108–113. IEEE (2017) 12. Williams, R.K., Gasparri, A., Ulivi, G.: Decentralized matroid optimization for topology constraints in multi-robot allocation problems. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 293–300. IEEE (2017) 13. Wood, A.D., Stankovic, J.A.: Denial of service in sensor networks. Computer 35(10), 54–62 (2002) 14. Dames, P., Tokekar, P., Kumar, V.: Detecting, localizing, and tracking an unknown number of moving targets using a team of mobile robots. Int. J. Robot. Res. 36(13– 14), 1540–1553 (2017) 15. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146. ACM (2003) 3
http://www.nyc.gov/html/tlc/downloads/pdf/2014 taxicab fact book.pdf.
CVaR Submodular Maximization
159
16. Prorok, A.: Supermodular optimization for redundant robot assignment under travel-time uncertainty. arXiv:1804.04986 (2018) 17. Pflug, G.Ch.: Some remarks on the value-at-risk and the conditional value-at-risk. In: Probabilistic Constrained Optimization, pp. 272–281. Springer (2000) 18. Rockafellar, R.T., Uryasev, S.: Optimization of conditional value-at-risk. J. Risk 2, 21–42 (2000) 19. Yang, F., Chakraborty, N.: Algorithm for optimal chance constrained linear assignment. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 801–808. IEEE (2017) 20. Yang, F., Chakraborty, N.: Algorithm for optimal chance constrained Knapsack problem with applications to multi-robot teaming. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1043–1049. IEEE (2018) 21. Markowitz, H.: Portfolio selection. J. Finance 7(1), 77–91 (1952) 22. Morgan, J.P.: RiskMetrics Technical Document. Morgan Guaranty Trust Company of New York, New York (1996) 23. Majumdar, A., Pavone, M.: How should a robot assess risk? Towards an axiomatic theory of risk in robotics. In: Robotics Research, pp. 75–84. Springer (2020) 24. Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002) 25. Maehara, T.: Risk averse submodular utility maximization. Oper. Res. Lett. 43(5), 526–529 (2015) 26. Ohsaka, N., Yoshida, Y.: Portfolio optimization for influence spread. In: Proceedings of the 26th International Conference on World Wide Web, pp. 977–985. International World Wide Web Conferences Steering Committee (2017) 27. Wilder, B.: Risk-sensitive submodular optimization. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (2018) 28. Artzner, P., Delbaen, F., Eber, J.-M., Heath, D.: Coherent measures of risk. Math. Finance 9(3), 203–228 (1999) 29. Zhou, L., Tokekar, P.: An approximation algorithm for risk-averse submodular optimization. arXiv:1807.09358 (2018) 30. Ackerman, E., Strickland, E.: Medical delivery drones take flight in east africa. IEEE Spectr. 55(1), 34–35 (2018) 31. Obermeyer, K.J. and Contributors: VisiLibity: A C++ library for visibility computations in planar polygonal environments (2008). http://www.VisiLibity.org 32. Conforti, M., Cornu´ejols, G.: Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the rado-edmonds theorem. Discrete Appl. Math. 7(3), 251–274 (1984) 33. Jorgensen, S., Chen, R.H., Milam, M.B., Pavone, M.: The team surviving orienteers problem: routing teams of robots in uncertain environments with survival constraints. Auton. Robots 42(4), 927–952 (2018)
Pushing Fast and Slow: Task-Adaptive Planning for Non-prehensile Manipulation Under Uncertainty Wisdom C. Agboh(B) and Mehmet R. Dogar School of Computing, University of Leeds, Leeds LS29JT, UK {scwca,m.r.dogar}@leeds.ac.uk Abstract. We propose a planning and control approach to physics-based manipulation. The key feature of the algorithm is that it can adapt to the accuracy requirements of a task, by slowing down and generating “careful” motion when the task requires high accuracy, and by speeding up and moving fast when the task tolerates inaccuracy. We formulate the problem as an MDP with action-dependent stochasticity and propose an approximate online solution to it. We use a trajectory optimizer with a deterministic model to suggest promising actions to the MDP, to reduce computation time spent on evaluating different actions. We conducted experiments in simulation and on a real robotic system. Our results show that with a task-adaptive planning and control approach, a robot can choose fast or slow actions depending on the task accuracy and uncertainty level. The robot makes these decisions online and is able to maintain high success rates while completing manipulation tasks as fast as possible. Keywords: Manipulation & grasping · Motion and path planning
1 Introduction We propose a planning and control algorithm for non-prehensile manipulation. The key feature of our algorithm is task-adaptivity: the planner can adapt to the accuracy requirements of a task, performing fast or slow pushes. For example in Fig. 1 (top), the robot is pushing an object on a narrow strip. The task requires high-accuracy during pushing—otherwise the object can fall down. The controller therefore generates slow pushing actions that make small but careful progress to the goal pose of the object. In Fig. 1 (bottom), however, the object is on a wide table and the goal region for the object is large (circle drawn on the table). In this case, the controller generates only a small number of fast pushing actions to reach the goal quickly—even if this creates more uncertainty about the object’s pose after each action, the task can still be completed successfully. We present a controller that can adapt to tasks with different accuracy requirements, such as in these examples. There has been significant recent interest in non-prehensile pushing-based manipulation. Most existing work use motion planning and open-loop execution to address the problem of generating a sequence of actions to complete a non-prehensile manipulation task [7, 11, 16, 17, 23]. Others developed closed-loop approaches. Ruiz-Ugalde et al. [26] proposed a compact predictive model for sliding objects and a feedback c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 160–176, 2020. https://doi.org/10.1007/978-3-030-44051-0_10
High Accuracy Task
Pushing Fast and Slow
161
Goal
After 8 actions Low Accuracy Task
Initial scene
After 15 actions
Goal in 21 actions
Goal
Initial scene
Goal in 1 action
Fig. 1. Task-adaptive pushing with 21 slow actions for a high accuracy task (top) and a single fast action for a low accuracy task (bottom).
controller to push an object to a desired location. Hogan and Rodriguez [9] proposed a model predictive controller that uses integer programming to handle the different contact modes associated with the pusher-slider system. Arruda et al. [2] considered the task of pushing an object to a goal region while avoiding regions of high predictive uncertainty through model predictive path integral control. We take a similar model predictive control (MPC) approach and propose a closed-loop planning and control algorithm for pushing tasks. A common feature of existing work is the reliance on the quasi-static model of pushing [10, 21]. While one reason of the popularity of the quasi-static model may be the simpler analytic equations of motion it enables one to derive, another reason is the slow nature of quasi-static interactions, which keeps the uncertainty during pushing tightly bounded and therefore easier to control accurately. However, accuracy is not the main criterion for every task, as we illustrate in Fig. 1. Fast motions, even if inaccurate, may be desired during some tasks. We humans also adapt our actions to the task (Fitts’s law [8]). Imagine reaching into a fridge shelf that is crowded with fragile objects, such as glass jars and containers. You move slowly and carefully. However, if the fridge shelf is almost empty, only with a few plastic containers that are difficult-to-break, you move faster with less care. The major requirements to build a task-adaptive planner/controller are: 1. The planner must consider a variety of actions (different pushing speeds): The robot should not be limited to moving at quasi-static speeds. It must consider dynamic actions wherever possible to complete a given task as fast as possible. 2. The planner must consider action-dependent uncertainty: Different actions can induce different amounts of uncertainty into the system. For example, pushing an object for a longer distance (or equivalently pushing faster for a fixed amount of time) would induce more uncertainty than pushing a short distance (or equivalently pushing slower for a fixed amount of time) [31].
162
W. C. Agboh and M. R. Dogar
One way to build such a controller is to model the problem as a Markov Decision Process (MDP) with stochastic dynamics, where the stochasticity is action-dependent. Then, if this MDP is solved for an optimal policy under a cost that includes time to reach the goal, the resulting policy will use fast actions when it can, and fall back to slow actions for tasks that require higher accuracy. In this paper, we model the problem as an MDP. However, we do not search for a globally optimal policy as this would be prohibitively computationally expensive. Instead, we solve the MDP online [15, 24] with an approximate solution. Even in this online setting, evaluating the value of all possible actions (including actions of a wide variety of speeds), proves computationally expensive, since the cost of physics-based predictions is high. Therefore, instead of evaluating all possible actions, at any given state, we first use a fast trajectory optimizer to suggest a reduced set of promising actions, i.e. actions that are known to drive the system to the goal under the deterministic setting. We then evaluate these actions under uncertainty to pick the best one. Our specific contributions include a task-adaptive online solution to the stochastic MDP for pushing-based manipulation and a trajectory optimizer to generate actions for evaluation under the MDP setting. Additionally, we compare our task-adaptive planner with a standard model predictive control approach using only slow actions for high and low accuracy tasks under different levels of uncertainty. We show that our approach achieves higher success rates in significantly smaller amounts of time, especially for tasks that do not require high accuracy. Finally, we implement our approach on a real robotic system for tasks requiring different accuracy levels and compare it with standard MPC. Results can be found in the video at https://youtu.be/8rz f V0WJA.
2 Problem Formulation We consider the problem where a robot must plan a sequence of non-prehensile actions to take an environment from an initial configuration to a desired goal configuration. We consider two task categories: In the pushing task the goal is to push a target object into a goal region; and in the grasping in clutter task the goal is to bring a target object, among other objects, into the robot’s hand. Our scenes contain D dynamic objects. qi refers to the full pose of each dynamic object, for i = 1, . . . , D. We assume a flat working surface and the robot is not allowed to drop objects off the edges. Our robot is planar with a 1-DOF gripper. The robot’s configuration is defined by a vector of joint values qR = {θx , θy , θrotation , θgripper }. We represent the complete state of our system as xt at time t. This includes the configuration and velocity of the robot and all dynamic objects; xt = {qR , q1 , . . . , qD , q˙ R , q˙ 1 , . . . , q˙ D }. Our control inputs are velocities: ut = q˙ R applied to the robot’s joints. We then define the stochastic discrete time dynamics of our system as: xt+1 = f (xt , ut ) + ζ(ut )
(1)
where f is a deterministic function that describes the evolution of state xt given the action ut . We induce stochasticity in the system dynamics through ζ(ut ) ∝ ||ut ||, which is proportional to the magnitude of action ut . When we push an object over a long distance, there is a large number of interactions/contacts especially in
Pushing Fast and Slow
163
cluttered environments. This implies that the uncertainty in the resulting state at the end of a long push should be larger than that for a shorter push. We assume an initial state of the system x0 . Our goal is to generate a sequence of actions for the robot such that the desired final goal configuration of the environment is reached as quickly as possible without dropping objects off the edge of our working surface. In the foregoing paragraphs, U = {u0 , u1 , . . . , un−1 } denotes a sequence of control signals of fixed duration Δt applied in n time steps and we use brackets to refer to the control at a certain time step, i.e. U[t] = ut . Similarly, X is a sequence of states. To build a task-adaptive controller, we formulate our problem as an MDP, and we provide an approximate solution to it. An MDP is defined by a tuple , where S is the set of states, A is the set of actions, P is the probabilistic transition function, and L defines the costs. In our problem S is given by all possible values of xt . Similarly, A is given by all possible values of ut , and P can be computed using the stochastic transition function in Eq. 1. The optimal policy for an MDP is given by: π ∗ (xt ) = arg min L (xt , ut ) + γ · P (xt+1 |xt , ut ) · V ∗ (xt+1 ) · dxt+1 (2) ut ∈A
S
where 0 < γ < 1 is the discount factor, and V ∗ is the optimal value function. An online one-step lookahead approximate solution to the MDP problem can be found by sampling and evaluating the average value over samples as in [15, 24]: ⎡ ⎤ 1 V (xt+1 )⎦ (3) π (xt ) = arg min ⎣L (xt , ut ) + · Q ut ∈A xt+1 ∈S(xt ,ut ,Q)
where S(xt , ut , Q) is the set of Q samples found by stochastically propagating (xt , ut ), and V is an approximation of the value function. For our problem, to compute the cost L (xt , ut ), we use a cost function L which also takes into account the next state xt+1 : L(xt , xt+1 , ut ) =
D
i
{we · ek·dP + ws · (xit+1 − xit )2 } + kact
(4)
i
The first term in the cost which we call the edge cost penalizes pushing an object close to the table’s boundaries or static obstacles. We show the edge cost in Fig. 4 where we define a safe zone smaller than the table’s boundaries. If an object is pushed out of this safe region as a result of an action between t and t + 1, we compute the pushed distance dp . Also note that k is a constant term and no edge costs are computed for objects in the safe zone. The second term is the environment disturbance cost which penalizes moving dynamic objects away from their current states. The third term, kact is a constant cost incurred for each action taken by the robot. We use we and ws to represent weights for the edge and environment disturbance costs respectively. Then, we compute L (xt , ut ) using the same set of Q samples: L (xt , ut ) =
1 · Q
xt+1 ∈S(xt ,ut ,Q)
L(xt , xt+1 , ut )
164
W. C. Agboh and M. R. Dogar Initialization
Trajectory optimization
Action evaluation under uncertainty
Fig. 2. First column: Initialization of our task-adaptive approach with control sequences including fast (top) and slow (bottom) actions. Second column: Stochastic trajectory optimization of the initial candidate control sequences. Last column: Action evaluation under uncertainty through sampling.
This solution requires propagating Q samples for every possible action ut , to find the one with the minimum total cost. Performing this for all actions ut ∈ A is not feasible for our purposes for a variety of reasons: First, we are interested in actions that span a wide range of speed profiles (i.e. fast and slow), which make our potential action set large; second, each propagation in our domain is a physics simulation which is computationally expensive; and third, our goal is closed-loop pushing behaviour close to real-time speeds.
3 Proposed Approach Instead of considering a large action set in our online MDP solution (Eq. 3), we propose to use a small set of promising actions including both fast and slow actions. We identify such a set of actions using a trajectory optimizer based on the deterministic dynamics function f of the system. Note that our approach does not discretize the action space or the state space a priori. We adaptively sample the action space using a trajectory optimizer to find high value actions to consider at a given current state. We also get stochastic next state samples by applying these actions through a physics simulator. In Algorithm 1, we present our online approximate solution to the MDP. Consider the scene in Fig. 2, where we have a planar gripper and an object. Our task is to push the object to a desired goal location (the red spot) while avoiding the rectangular black obstacle. We begin by generating N candidate action sequences
Pushing Fast and Slow
165
Algorithm 1: Online MDP solver
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Input : x0 : Initial state Parameters : Q: Number of stochastic samples N : Number of initial candidate control sequences nmin : Minimum number of control actions in a control sequence nmax : Maximum number of control actions in a control sequence {U0 , . . . , UN −1 } ←GetActionSequences(x0 , nmin , nmax , N ) while task not complete do 0, · · · , V N −1 } ←GetOptActionSequences(x0 , {U0 , . . . , UN −1 }) {U0∗ , · · · , UN −1∗ }, {V for i ← 0 to N − 1, do Vi =0 for each sample in Q, do x1 =StochasticExecution(x0 , Ui∗ [0]) i [0] V i = V i + L(x0 , x1 , Ui∗ [0]) + V V i = V i /Q imin = arg mini∈N V i x1 ← execute Uimin [0] check task completion x0 ← x 1 {U0 , . . . , UN −1 } ← {GetActionSequences(nmin , nmax , N − 1), Uimin [1 : n − 1]}
{U0 , . . . , UN −1 } to the goal by using the GetActionSequences procedure (line 1). The number of actions in each sequence varies between nmin and nmax , and each action is of fixed duration Δt . In the example task (Fig. 2), we show the candidate action sequences in the first column where N = 2, nmin = 2, and nmax = 4. Since each action is of fixed duration, the set of action sequences contain both fast (top) and slow (bottom) actions. Details of how we generate these candidate control sequences are explained in Sect. 4. Using a trajectory optimizer (Sect. 5), the procedure GetOptActionSequences returns N optimized control sequences {U0∗ , · · · , UN −1∗ }, and an approximation of N −1 } along the optimal trajectories. We visualize the 0, · · · , V the value function {V optimized trajectories for the example task in the second column. We seek a one-step lookahead solution to the M DP , hence, we get the first actions Ui∗ [0] from each of the optimized control sequences. Our task is now to select the best amongst N actions even under uncertainty. We apply each action Q times to the stochastic state transition function in Eq. 1 to yield Q next state samples (line 7). More specifically, using our system dynamics model, we apply the controls (velocities) to the robot for the control duration Δt and thereafter we wait for a fixed extra time trest for objects to come to rest before returning the next state and computing the cost. We can see the Q samples in the third column (Fig. 2) for the example pushing task. Thereafter, i [0] on line 8, for each sample we add the immediate cost L to the approximate value V for the resulting state. We use the same approximate value for all Q samples. We then compute an average value for each action (line 9), select the best one and execute it (line 11). If the task is not yet complete, we repeat the whole process but also re-use the remaining portion of the best control sequence Uimin [1 : n − 1] from the current iteration. This algorithm chooses slow, low velocity actions for tasks that require high accuracy but faster actions for tasks that allow inaccuracies.
166
W. C. Agboh and M. R. Dogar
In the example pushing task, Fig. 2 (third column), we see a wider distribution in the resulting state for the fast action (top) compared to the slower action (bottom) as a result of our uncertainty model. Such a wide distribution in the resulting state increases the probability of undesired events happening especially in high accuracy tasks. For example, we see that some samples for the fast action result in collisions between the robot and the obstacle. This implies high costs with respect to the slower action, hence our planner chooses the slow action in this case, executes it and starts the whole process again from the resulting state.
Algorithm 2: GetActionSequences (x0 , nmin , nmax , N )
1 2 3 4
Output : {U0 , . . . , UN −1 }: A set of candidate action sequences Parameters : Δt : Control duration for each action in an action sequence for k ← 0 to N − 1 do nmax − nmin nk = k + nmin N −1 Distance to goal Uk [0 : nk − 1] = { }1 n k Δt return {U0 , . . . , UN −1 }
Algorithm 3: GetOptActionSequences(x0 , {U0 , . . . , UN −1 }) : {U0∗ , · · · , UN −1∗ }: Optimized set of action sequences N −1 } Approx. value function along N optimal trajectories 0, . . . , V {V for i ← 0 to N − 1 do i ←TrajectoryOptimization(x0 , Ui ) Ui∗ , V 0, · · · , V N −1 } return {U0∗ , · · · , UN −1∗ }, {V Output
1 2 3
4 Generating a Variety of Actions At each iteration of our online MDP solver, we provide N actions that are evaluated under uncertainty. First, using the GetActionSequences procedure in Algorithm 2, we generate N candidate action sequences where the number of actions in an action sequence increases linearly from nmin to nmax (line 2). On line 3, each of the action sequences is set to a straight line constant velocity profile to the goal. This is a simple approach, other more complicated velocity profiles could also be used here. Furthermore, in the GetOptActionSequences procedure, Algorithm 3, we use these candidate action sequences to initialize a stochastic trajectory optimization algorithm (Sect. 5). The algorithm quickly finds and returns a locally optimal solution for each of the candidate control sequences. It also returns an approximation of the value function along N optimal trajectories. We use this approximate value while evaluating actions in our online MDP solver.
Pushing Fast and Slow
167
5 Trajectory Optimization Trajectory optimization involves finding an optimal control sequence U∗ for a planning horizon n, given an initial state x0 , an initial candidate control sequence U, and an objective J which can be written as: J(X, U) =
n−1
L(xt , xt+1 , ut ) + wf Lf (xn )
(5)
t=0
Algorithm 4: Stochastic Trajectory Optimization : x0 : Initial state U: Candidate control sequence containing n actions Output : U∗ : Optimal control sequence Parameters : K: Number of noisy trajectory rollouts ν: Sampling variance vector Cthresh : Success definition in terms of cost Imax : Maximum number of iterations X, C ←TrajectoryRollout(x0 , U) while Imax not reached and Sum(C) > Cthresh do for k ← 0 to K − 1 do δUk ← N (0, ν) Random control sequence variation Uk = U + δUk Xk , Ck ←TrajectoryRollout(x0 , Uk ) ∗ U ←UpdateTrajectory(U, {δU0 , . . . , δUK−1 }, {C0 , . . . CK−1 }) X∗ , C∗ ←TrajectoryRollout(x0 , U∗ ) if Sum(C∗ ) < Sum(C) then U ← U∗ , X ← X∗ , C ← C∗ for j ← 0 to n − 1 do ∗ Approx. value function for state x∗j along the optimal V[j] = n−1 h=j C [h] trajectory return U∗ , V Input
1 2 3 4 5 6 7 8 9 10 11 12
13
J is obtained by applying the control sequence U starting from a given initial state and includes the sum of running costs L and a final cost Lf . We use the constant wf to weight the terminal cost with respect to the running cost. We consider a deterministic environment defined by the state transition function xt+1 = f (xt , ut ). This is a constraint that must be satisfied at all times. Then the output of trajectory optimization is the minimizing control sequence: U∗ = arg min J(X, U) U
(6)
In this work, we consider stochastic trajectory optimization methods such as STOMP [13] and MPPI [30]. With parallel rollouts on multiple cores of a PC, these methods show impressive speed. They also easily accept arbitrary cost functions that
168
W. C. Agboh and M. R. Dogar
do not have to be differentiable. In contrast with sampling-based methods such as RRTs and PRMs [16, 23], optimization approaches are able to produce lower cost trajectories within a time limit even if they do not take the system to the desired goal state. These benefits make stochastic trajectory optimization very attractive. In this work, we propose Algorithm 4 which adapts the STOMP algorithm [13] for non-prehensile object manipulation. We begin with an initial candidate control sequence U and iteratively seek lower cost trajectories (lines 2–10) until the cost reaches a threshold or until the maximum number of iterations is reached (line 2). We add random control sequence variations δUk on the candidate control sequence to generate K new control sequences at each iteration (line 5). Thereafter on line 6, we do a trajectory rollout for each sample control sequence using the T rajectoryRollout procedure in Algorithm 5. It returns the corresponding state sequence X and costs C calculated for each state along the resulting trajectory. I.e C[t] is the cost of applying action ut in state xt . After generating K sample control sequences and their corresponding costs, the next step is to update the candidate control sequence using the U pdateT rajectory procedure. One way to do this is a straightforward greedy approach where the minimum cost trajectory is selected as the update: Algorithm 5: TrajectoryRolllout : x0 : Initial state U: Control sequence with n actions Output : X :State sequence C :Costs along the trajectory for t ← 0 to n − 1 do xt+1 = f (xt , ut ) C[t] = L(xt , xt+1 , ut ) if t == n − 1 then C[t] = C[t] + Lf (xt+1 ) return X, C Input
1 2 3 4 5 6
k∗ = arg min k
n−1
Ck [t],
Calculate cost using Eq. 4 Add final cost
U∗ = U + δUk∗
(7)
t=0
Another approach is a cost-weighted convex combination similar to [30]: K−1 U∗ [t] = U[t] +
k=0
[exp(−( λ1 )Ck [t])]δUk [t]
K−1 k=0
exp(−( λ1 )Ck [t])
(8)
Where λ is a parameter to regulate the exponentiated cost’s sensitivity. In our experiments, for a small number of noisy trajectory rollouts K (e.g. K = 8), a greedy update performs better. Hence, we use the greedy update in all our experiments. Once the trajectory update step is complete, we update the candidate control sequence only if the new sequence has a lower cost (line 9, Algorithm 4). The trajectory optimization algorithm then returns the locally optimal control sequence and an
Pushing Fast and Slow
169
approximation of the value function for each state along the trajectory, where the value function is approximated using the sum of costs starting from that state. The cost terms for the state-action sequences in this algorithm are equal to the running costs in Eq. 4, with the addition of a terminal cost on the final state depending on the task. The terminal cost for the pushing task is given by: 0 if Ro − Rg < 0 Lf = 2 (Ro − Rg ) if Ro − Rg > 0 where Ro is the distance between the pushed object and the center of a circular goal region of radius Rg . The terminal cost term for the task of grasping in clutter is given by: Lf = d2T + wφ · φ2T . We show how the distance dT and the angle φT are computed in Fig. 3. First, create a vector from a fixed point in the gripper to the target object where dT is the length of this vector and φT is the angle between the forward direction of the gripper and the vector. We use wφ to weight angles relative to distances.
Target T
dPi
dT
xti
1
xti Safe zone Table
Fig. 3. Goal cost terms
Fig. 4. Edge cost terms
6 Baseline Approach We implement a standard model predictive control algorithm (M P C) as a baseline approach. It involves repeatedly solving a finite horizon optimal control problem using the stochastic trajectory optimizer presented in Algorithm 4 and proceeds as follows: optimize a finite horizon control sequence, execute the first action, get the resulting state of the environment and then re-optimize to find a new control sequence. When re-optimizing, we warm-start the optimization with the remaining portion of the control sequence from the previous iteration such that optimization now becomes faster. We initialize the trajectory optimizer for standard MPC with actions at the quasi-static speed. In addition, we propose another baseline approach to compare against in this work: uncertainty aware model predictive control (UAMPC). This is a version of our online MDP solver where only low speed actions are considered. Specifically, all the candidate control sequences are generated with the maximum number of actions i.e. nmin ← nmax .
170
W. C. Agboh and M. R. Dogar
7 Experiments We call our planning approach task-adaptive model predictive control (T AM P C). We verify how well our approach is able to handle uncertainty and adapt to varying tasks. First, we compare the performance of T AM P C with a standard model predictive control (M P C) approach. Here we hypothesize that T AM P C will complete a given pushing task within a significantly shorter period of time and will be able to adapt to different tasks, maintaining a high success rate under varying levels of uncertainty. Next, we compare the performance of our approach with uncertainty aware MPC (U AM P C). Here we hypothesize that: U AM P C will have a similar success rate and will take a longer amount of time to complete the task in comparison with T AM P C. We conduct experiments in simulation and on a real robotic system. We consider the tasks of pushing an object to a goal region and grasping an object in clutter. Given an environment for planning, we create two instantiations:
(a) Success rates for low accuracy tasks
(b) Success rates for high accuracy tasks
(c) Total elapsed time for low accuracy tasks
(d) Total elapsed time for high accuracy tasks
Fig. 5. Success rate and total elapsed time versus uncertainty level for low and high accuracy tasks.
Pushing Fast and Slow
171
Planning environment: The robot generates plans in the simulated planning environment. The trajectory optimizer (Algorithm 4) uses deterministic physics during planning and our online MDP solver uses stochastic physics to evaluate actions. Execution environment: Here, the robot executes actions and observes the state evolution. It is the physical world for real robot experiments but it is simulated for simulation experiments. The execution environment is stochastic. For the planning environment and the execution environment when it is simulated, we use a physics engine, Mujoco [28], to model the deterministic state transition function f in Eq. 1. We model stochasticity in the physics engine by adding Gaussian noise on the velocities {q˙ R , q˙ 1 , . . . , q˙ D } of the robot and objects at every simulation time step: 1
˙ , q ˙ , . . . , q ˙ } = {q˙ R , q˙ 1 , . . . , q˙ D } + μ, μ ∼ N (0, β(ut )) {q R
D
(9)
where N is the Gaussian distribution and β is an action dependent variance function. We create a linear model for the variance function as: β(ut ) = b||ut ||
(10)
where b is a constant. In our simulation experiments, uncertainty level refers to the degree of stochasticity dictated by the slope b of the variance function β used to generate the Gaussian noise μ injected at every simulation time step in Eq. 9. 7.1 Push Planning Simulation Experiments We present a high accuracy task in Fig. 1 (top). It is made up of a thin strip and a small goal region. We also define a low accuracy task in Fig. 1 (bottom) which is a much larger table with a wider goal region. We create 200 such planning environments for each of the high and low accuracy tasks. For each environment:
Initial scene
Initial scene
After 1 action
After 10 actions
After 8 actions
After 40 actions
Goal in 15 actions
Goal in 45 actions
Fig. 6. Push planning in a changing environment (top) using a single fast push initially and then slow pushes later on due to the narrow strip. For the L-shaped environment (bottom), the robot executes many actions to successfully navigate the edge.
172
W. C. Agboh and M. R. Dogar
1. We randomly select the shape (box or cylinder) of the pushed object. 2. For each object, we randomly1 select shape dimensions (radius and height for the cylinder, extents for the boxes), mass, and coefficient of friction. 3. We randomly2 select a position on the working surface for the pushed object. We create four uncertainty levels: no uncertainty, low uncertainty, medium uncertainty and high uncertainty. For the no uncertainty case, no extra noise was added to the physics engine. For low, medium and high levels of uncertainty, b = {0.05, 0.075, 0.1} respectively. We test the different planning and control approaches and specify a timeout of 3 min including all planning, re-planning and execution. Success Rates: We declare success when the robot is able to push an object to the target region without dropping it off the edge of the table within the specified time limit. We plot the results in Fig. 5a and b. For the low accuracy level push planning task (Fig. 5a), T AM P C and U AM P C were able to maintain a 100 % success rate while M P C showed a slight decrease in success rates as uncertainty grew. For the high accuracy pushing task (Fig. 5b), T AM P C and U AM P C were also able to maintain a good average success rate. M P C on the other hand maintains a poor success rate. The major reason for this is uncertainty. Total Time: The total time in our experiments includes all planning and execution time. Figure 5c and d show the average of 200 scenes with 95 % confidence interval of the mean. For the low accuracy level task, our T AM P C planner is able to achieve the goal in under 5 s (Fig. 5c), while U AM P C and M P C took significantly more time to complete the task. This clearly shows that our method is able to generate successful fast actions while maintaining a high success rate. For the high accuracy level task (Fig. 5d), our planner is able to generate as many small actions as needed as the uncertainty grew.
Initial scene
After 3 actions
After 7 actions
Goal in 11 actions
Fig. 7. Grasping in clutter: The robot uses fast actions initially but chooses slower actions as it gets closer to the goal object near the edge of the table. 1
2
The uniform range used for each parameter is given here. Box x-y extents: [0.05 m, 0.075 m]; box height: [0.036 m, 0.05 m]; cylinder radius: [0.04 m, 0.07 m]; cylinder height: [0.04 m, 0.05 m]; mass: [0.2 kg, 0.8 kg]; coef. fric.: [0.2, 0.6]. The random position for the pushed object is sampled from a Gaussian with a mean at the lower end of the table (0.1 m from the edge of a 0.6 m long table along the center axis and a variance of 0.01 m.).
Pushing Fast and Slow
173
Hence it was able to maintain a high success rate and still complete the task within a very small amount of time in comparison with the baseline approach. Furthermore, we also test the adaptive behavior of our approach for the environments in Fig. 6. In the changing environment (Fig. 6, top), the robot begins with a fast push due to a large initial area. Thereafter, it naturally switches to slow pushes on the thin strip to complete the task. For pushing in the L-shaped environment (Fig. 6, bottom), the robot generally pushes slow. However, it spends a lot of time to navigate the corner. 7.2 Grasping in Clutter Simulation Experiments We conducted simulation experiments for grasping in clutter in scenes similar to Fig. 7. Our scenes are randomly generated containing boxes and cylinders. In addition, our robot now has four control inputs (including the gripper). We tested our task adaptive planner in clutter to observe how the planner adapts given different environment configurations. We see that the robot manipulates clutter and is able to grasp the target object. An example scene is shown in Fig. 7 where the aim is to grasp the target object in green without pushing any other objects off the edge of the table. The robot initially begins with fast actions to push obstacles out of the way. However, as the robot gets closer to the target object, it chooses slower actions due to a higher probability of task failure in that region. 7.3 Real Robot Experiments In the real robot experiments we use a UR5 robot with a Robotiq 2 finger gripper. We restrict the motion of the gripper to a plane parallel to the working surface such that we have a planar robot. We use OpenRave [6] to find kinematics solutions at every time step. For the push planning experiments, the gripper is completely open such that the robot receives three control inputs ut = (θ˙x , θ˙y , θ˙rotation ) at every time step. We use a medium uncertainty level to model the real world stochasticity. We place markers on the pushed object and track its full pose with a motion capture system (OptiTrack). We manually replicated three execution worlds for each task accuracy level from the randomly generated environments we created during push planning simulation experiments. We tested our planners in these environments. We show snapshots from our real robot experiments. In Fig. 8 (top), we have a low task accuracy environment where the standard MPC approach is successful after 20 actions. However, by using a single dynamic push in Fig. 1 (bottom), our task-adaptive control approach is able to complete the push planning task in under 2 s. Moreover, for the high task accuracy problem, M P C was unable to push the target object to the desired goal location (Fig. 8 (bottom)). It executes actions without reasoning about uncertainty and pushed the goal object off the edge. Our task-adaptive controller was able to consider uncertainty while generating small pushes (Fig. 1 (top)) to complete the task. These results can be found in the accompanying video at https:// youtu.be/8rz f V0WJA.
Low Accuracy Task
174
W. C. Agboh and M. R. Dogar Goal
High Accuracy Task
Initial scene
After 5 actions
After 12 actions
Goal in 20 actions
After 5 actions
After 10 actions
Failed in 13 actions
Goal
Initial scene
Fig. 8. M P C using a large number of actions to complete a low accuracy level task (top), and causing the pushed object to fall off for a high accuracy level task (bottom).
8 Related Work Robots are becoming increasingly capable of performing complex manipulation tasks [33] by leveraging the environment’s physics [29]. A traditional approach to complete manipulation tasks is motion planning [14, 18], followed by open-loop execution. However, these plans are likely to fail due to uncertainty. Others [11, 19, 27, 32] have proposed using conservative open-loop motion plans that are robust to uncertainty. Recently, MPC algorithms for manipulation [1, 2, 4, 9] have been developed. They use feedback and can handle bounded uncertainty [22]. We take a similar closed-loop approach to non-prehensile manipulation in this paper. There are recent works that develop learning-based uncertainty-aware controllers in robotics mainly for navigation and collision avoidance. Richter and Roy [25] proposed safe visual navigation with deep learning. Given that current deep learning methods produce unsafe predictions when faced with out-of-distribution inputs, they detect these novel inputs using the reconstruction loss of an autoencoder and switch to a rulebased safe policy. Similarly, Choi et al. [5] switch between a learned policy and a rulebased policy for autonomous driving. However, they estimate uncertainty using a single mixture density network without Monte Carlo sampling. We focus on non-prehensile manipulation where designing a rule-based policy for each new task is not feasible. Furthermore, Kahn et al. [12] proposed an uncertainty-aware controller for autonomous navigation with collision avoidance. First, they estimate uncertainty from multiple bootstrapped neural networks using dropout. Thereafter, they consider a very large number of fixed action sequences at a given current state. They evaluate these action sequences under an uncertainty-dependent cost function within an MPC framework. The resulting policy chooses low speed actions in unfamiliar environments and naturally chooses high speed actions in familiar environments. While this approach
Pushing Fast and Slow
175
relieves the burden of designing a rule-based policy, it requires the evaluation of a large number of a priori fixed action sequences. Our approach of adaptively sampling the action space given a current state is similar to prior work [3, 20] which propose sampling-based planning in continuous action MDPs. We use a trajectory optimizer to guide our action sampling process.
9 Discussion and Future Work We presented a closed-loop planning and control algorithm capable of adapting to the accuracy requirements of a task by generating both fast and slow actions. This is an exciting first step toward realizing task-adaptive manipulation planners. In this work, we use a stochastic trajectory optimizer that outputs locally optimal control sequences. Thus, the resulting policy can get stuck in a local optima. Moreover, the trajectory optimizer may not return a good control sequence that reaches the goal for a given task if some design parameters (e.g. nmax ) are chosen poorly. Furthermore, our uncertainty model is simple and is only an approximation of the real world stochastic phenomena. We will work toward finding uncertainty models that better describe the real world physics stochasticity especially for manipulation in clutter. Finally, we will investigate the generalization of our task-adaptive approach to other manipulation primitives. Acknowledgements. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska Curie grant agreement No. 746143, and from the UK Engineering and Physical Sciences Research Council under grants EP/P019560/1 and EP/R031193/1.
References 1. Agboh, W.C., Dogar, M.R.: Real-time online re-planning for grasping under clutter and uncertainty. In: IEEE-RAS Humanoids (2018) 2. Arruda, E., Mathew, M.J., Kopicki, M., Mistry, M., Azad, M., Wyatt, J.L.: Uncertainty averse pushing with model predictive path integral control. In: Humanoids (2017) 3. Bubeck, S., Stoltz, G., Szepesv´ari, C., Munos, R.: Online optimization in X-armed bandits. In: NIPS (2009) 4. Calli, B., Dollar, A.M.: Vision-based model predictive control for within-hand precision manipulation with underactuated grippers. In: ICRA (2017) 5. Choi, S., Lee, K., Lim, S., Oh, S.: Uncertainty-aware learning from demonstration using mixture density networks with sampling-free variance modeling. In: ICRA (2018) 6. Diankov, R., Srinivasa, S.S., Ferguson, D., Kuffner, J.: Manipulation planning with caging grasps. In: Humanoids (2008) 7. Dogar, M.R., Hsiao, K., Ciocarlie, M., Srinivasa, S.: Physics-based grasp planning through clutter. In: Robotics: Science and Systems (2012) 8. Fitts, P.M.: The information capacity of the human motor system in controlling the amplitude of movement. Exp. Psychol. 47, 381 (1954) 9. Hogan, F.R., Rodriguez, A.: Feedback control of the pusher-slider system: a story of hybrid and underactuated contact dynamics. In: WAFR (2016) 10. Howe, R.D., Cutkosky, M.R.: Practical force-motion models for sliding manipulation. IJRR 15(6), 557–572 (1996)
176
W. C. Agboh and M. R. Dogar
11. Johnson, A.M., King, J., Srinivasa, S.: Convergent planning. IEEE RA-L 1, 1044–1051 (2016) 12. Kahn, G., Villaflor, A., Pong, V., Abbeel, P., Levine, S.: Uncertainty-aware reinforcement learning for collision avoidance. CoRR (2017) 13. Kalakrishnan, M., Chitta, S., Theodorou, E., Pastor, P., Schaal, S.: STOMP: stochastic trajectory optimization for motion planning. In: ICRA (2011) 14. Kavraki, L., Latombe, J.C.: Randomized preprocessing of configuration for fast path planning. In: ICRA (1994) 15. Kearns, M.J., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In: IJCAI (1999) 16. King, J.E., Haustein, J.A., Srinivasa, S., Asfour, T.: Nonprehensile whole arm rearrangement planning on physics manifolds. In: ICRA (2015) 17. Kitaev, N., Mordatch, I., Patil, S., Abbeel, P.: Physics-based trajectory optimization for grasping in cluttered environments. In: ICRA (2015) 18. Li, W., Todorov, E.: Iterative linear quadratic regulator design for nonlinear biological movement systems. In: ICINCO (2004) 19. Luders, B., Kothari, M., How, J.P.: Chance constrained RRT for probabilistic robustness to environmental uncertainty. In: AIAA Guidance, Navigation, and Control Conference (2010) 20. Mansley, C., Weinstein, A., Littman, M.L.: Sample-based planning for continuous action Markov decision processes. In: ICAPS (2011) 21. Mason, M.T.: Mechanics and planning of manipulator pushing operations. Int. J. Robot. Res. 5(3), 53–71 (1986) 22. Mayne, D.Q., Rawlings, J.B., Rao, C.V., Scokaert, P.O.M.: Constrained model predictive control: stability and optimality. Automatica 36(6), 789–814 (2000) 23. Muhayyuddin, Moll, M., Kavraki, L., Rosell, J.: Randomized physics-based motion planning for grasping in cluttered and uncertain environments. IEEE RA-L 3(2), 712–719 (2018) 24. P´eret, L., Garcia, F.: On-line search for solving Markov decision processes via heuristic sampling. In: ECAI. IOS Press (2004) 25. Richter, C., Roy, N.: Safe visual navigation via deep learning and novelty detection. In: RSS (2017) 26. Ruiz-Ugalde, F., Cheng, G., Beetz, M.: Fast adaptation for effect-aware pushing. In: Humanoids (2011) 27. Sieverling, A., Eppner, C., Wolff, F., Brock, O.: Interleaving motion in contact and in free space for planning under uncertainty. In: IROS (2017) 28. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: IROS (2012) 29. Toussaint, M., Allen, K., Smith, K., Tenenbaum, J.: Differentiable physics and stable modes for tool-use and manipulation planning. In: RSS (2018) 30. Williams, G., Aldrich, A., Theodorou, E.: Model predictive path integral control using covariance variable importance sampling. CoRR (2015) 31. Yu, K.T., Bauza, M., Fazeli, N., Rodriguez, A.: More than a million ways to be pushed. A high-fidelity experimental dataset of planar pushing. In: IROS (2016) 32. Zhou, J., Paolini, R., Johnson, A.M., Bagnell, J.A., Mason, M.T.: A probabilistic planning framework for planar grasping under uncertainty. IEEE RA-L 2(4), 2111–2118 (2017) 33. Zhu, Y., Wang, Z., Merel, J., Rusu, A., Erez, T., Cabi, S., Tunyasuvunakool, S., Kram´ar, J., Hadsell, R., de Freitas, N., Heess, N.: Reinforcement and imitation learning for diverse visuomotor skills. In: RSS (2018)
Imitation Learning
Learning Stabilizable Dynamical Systems via Control Contraction Metrics Sumeet Singh1(B) , Vikas Sindhwani2 , Jean-Jacques E. Slotine3 , and Marco Pavone1 1
Department of Aeronautics and Astronautics, Stanford University, Stanford, USA [email protected], [email protected] 2 Google Brain Robotics, New York, USA [email protected] 3 Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, USA [email protected]
Abstract. We propose a novel framework for learning stabilizable nonlinear dynamical systems for continuous control tasks in robotics. The key idea is to develop a new control-theoretic regularizer for dynamics fitting rooted in the notion of stabilizability, which guarantees that the learned system can be accompanied by a robust controller capable of stabilizing any open-loop trajectory that the system may generate. By leveraging tools from contraction theory, statistical learning, and convex optimization, we provide a general and tractable semisupervised algorithm to learn stabilizable dynamics, which can be applied to complex underactuated systems. We validated the proposed algorithm on a simulated planar quadrotor system and observed notably improved trajectory generation and tracking performance with the control-theoretic regularized model over models learned using traditional regression techniques, especially when using a small number of demonstration examples. The results presented illustrate the need to infuse standard model-based reinforcement learning algorithms with concepts drawn from nonlinear control theory for improved reliability. Keywords: Model-based reinforcement learning · Contraction theory · Robotics
1 Introduction The problem of efficiently and accurately estimating an unknown dynamical system, x(t) ˙ = f (x(t), u(t)),
(1)
from a small set of sampled trajectories, where x ∈ Rn is the state and u ∈ Rm is the control input, is the central task in model-based Reinforcement Learning (RL). In this M. Pavone—This work was supported by NASA under the Space Technology Research Grants Program, Grant NNX12AQ43G, and by the King Abdulaziz City for Science and Technology (KACST). c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 179–195, 2020. https://doi.org/10.1007/978-3-030-44051-0_11
180
S. Singh et al.
setting, a robotic agent strives to pair an estimated dynamics model with a feedback policy in order to optimally act in a dynamic and uncertain environment. The model of the dynamical system can be continuously updated as the robot experiences the consequences of its actions, and the improved model can be leveraged for different tasks affording a natural form of transfer learning. When it works, model-based Reinforcement Learning typically offers major improvements in sample efficiency in comparison to state of the art RL methods such as Policy Gradients [2, 17] that do not explicitly estimate the underlying system. Yet, all too often, when standard supervised learning with powerful function approximators such as Deep Neural Networks and Kernel Methods are applied to model complex dynamics, the resulting controllers do not perform at par with model-free RL methods in the limit of increasing sample size, due to compounding errors across long time horizons. The main goal of this paper is to develop a new control-theoretic regularizer for dynamics fitting rooted in the notion of stabilizability, which guarantees that the learned system can be accompanied by a robust controller capable of stabilizing any trajectory that the system may generate. Formally, a reference state-input trajectory pair (x∗ (t), u∗ (t)), t ∈ [0, T ] for system (1) is termed exponentially stabilizable at rate λ > 0 if there exists a feedback controller k : Rn × Rn → Rm such that the solution x(t) of the system: x(t) ˙ = f (x(t), u∗ (t) + k(x∗ (t), x(t))), converges exponentially to x∗ (t) at rate λ. That is, x(t) − x∗ (t)2 ≤ Cx(0) − x∗ (0)2 e−λt
(2)
for some constant C > 0. The system (1) is termed exponentially stabilizable at rate λ in an open, connected, bounded region X ⊂ Rn if all state trajectories x∗ (t) satisfying x∗ (t) ∈ X , ∀t ∈ [0, T ] are exponentially stabilizable at rate λ. Problem Statement: In this work, we assume that the dynamics function f (x, u) is unknown to us and we are instead provided with a dataset of tuples {(xi , ui , x˙ i )}N i=1 taken from a collection of observed trajectories (e.g., expert demonstrations) on the robot. Our objective is to solve the problem: min
fˆ∈H
N 2 ˆ f (xi , ui ) − x˙ i + μfˆ2H i=1
s.t. fˆ is stabilizable,
2
(3) (4)
where H is an appropriate normed function space and μ > 0 is a regularization parameter. Note that we use (ˆ·) to differentiate the learned dynamics from the true dynamics. We expect that for systems that are indeed stabilizable, enforcing such a constraint may drastically prune the hypothesis space, thereby playing the role of a “controltheoretic” regularizer that is potentially more powerful and ultimately, more pertinent for the downstream control task of generating and tracking new trajectories. Related Work: The simplest approach to learning dynamics is to ignore stabilizability and treat the problem as a standard one-step time series regression task [2, 4, 17].
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
181
However, coarse dynamics models trained on limited training data typically generate trajectories that rapidly diverge from expected paths, inducing controllers that are ineffective when applied to the true system. This divergence can be reduced by expanding the training data with corrections to boost multi-step prediction accuracy [24, 25]. In recent work on uncertainty-aware model-based RL, policies [2, 17] are optimized with respect to stochastic rollouts from probabilistic dynamics models that are iteratively improved in a model predictive control loop. Despite being effective, these methods are still heuristic in the sense that the existence of a stabilizing feedback controller is not explicitly guaranteed. Learning dynamical systems satisfying some desirable stability properties (such as asymptotic stability about an equilibrium point, e.g., for point-to-point motion) has been studied in the autonomous case, x(t) ˙ = f (x(t)), in the context of imitation learning. In this line of work, one assumes perfect knowledge and invertibility of the robot’s controlled dynamics to solve for the input that realizes this desirable closedloop motion [7, 8, 10, 16, 18, 20]. Crucially, in our work, we do not require knowledge, or invertibility of the robot’s controlled dynamics. We seek to learn the full controlled dynamics of the robot, under the constraint that the resulting learned dynamics generate dynamically feasible, and most importantly, stabilizable trajectories. Thus, this work generalizes existing literature by additionally incorporating the controllability limitations of the robot within the learning problem. In that sense, it is in the spirit of recent model-based RL techniques that exploit control theoretic notions of stability to guarantee model safety during the learning process [1]. However, unlike the work in [1] which aims to maintain a local region of attraction near a known safe operating point, we consider a stronger notion of safety – that of stabilizability, that is, the ability to keep the system within a bounded region of any exploratory open-loop trajectory. Potentially, the tools we develop may also be used to extend standard adaptive robot control design, such as [23] – a technique which achieves stable concurrent learning and control using a combination of physical basis functions and general mathematical expansions, e.g. radial basis function approximations [19]. Notably, our work allows us to handle complex underactuated systems, a consequence of the significantly more powerful function approximation framework developed herein, as well as of the use of a differential (rather than classical) Lyapunov-like setting, as we shall detail. Statement of Contributions: Stabilizability of trajectories is a complex task in nonlinear control. In this work, we leverage recent advances in contraction theory for control design through the use of control contraction metrics (CCM) [13] that turns stabilizability constraints into convex Linear Matrix Inequalities (LMIs). Contraction theory [12] is a method of analyzing nonlinear systems in a differential framework, i.e., via the associated variational system [3, Chp 3], and is focused on the study of convergence between pairs of state trajectories towards each other. Thus, at its core, contraction explores a stronger notion of stability – that of incremental stability between solution trajectories, instead of the stability of an equilibrium point or invariant set. Importantly, we harness recent results in [13, 15, 21] that illustrate how to use contraction theory to obtain a certificate for trajectory stabilizability and an accompanying tracking controller with exponential stability properties. In Sect. 2, we provide a brief summary of these results, which in turn will form the foundation of this work.
182
S. Singh et al.
Our paper makes four primary contributions. First, we formulate the learning stabilizable dynamics problem through the lens of control contraction metrics (Sect. 3). Second, under an arguably weak assumption on the sparsity of the true dynamics model, we present a finite-dimensional optimization-based solution to this problem by leveraging the powerful framework of vector-valued Reproducing Kernel Hilbert Spaces (Sect. 4.2). We further motivate this solution from a standpoint of viewing the stabilizability constraint as a novel control-theoretic regularizer for dynamics learning. Third, we develop a tractable algorithm leveraging alternating convex optimization problems and adaptive sampling to iteratively solve the finite-dimensional optimization problem (Sect. 5). Finally, we verify the proposed approach on a 6-state, 2-input planar quadrotor model where we demonstrate that naive regression-based dynamics learning can yield estimated models that generate completely unstabilizable trajectories. In contrast, the control-theoretic regularized model generates vastly superior quality, trackable trajectories, especially for smaller training sets (Sect. 6).
2 Review of Contraction Theory The core principle behind contraction theory [12] is to study the evolution of distance between any two arbitrarily close neighboring trajectories and drawing conclusions on the distance between any pair of trajectories. Given an autonomous system of the form: x(t) ˙ = f (x(t)), consider two neighboring trajectories separated by an infinitesimal (virtual) displacement δx (formally, δx is a vector in the tangent space Tx X at x). The dynamics of this virtual displacement are given by: ∂f δx , δ˙x = ∂x where ∂f /∂x is the Jacobian of f . The dynamics of the infinitesimal squared distance δxT δx between these two trajectories is then given by: d T ∂f δx δx = 2δxT δx . dt ∂x Then, if the (symmetric part) of the Jacobian matrix ∂f /∂x is uniformly negative definite, i.e., 1 ∂f (x) ≤ −λ < 0, sup λmax 2 ∂x x
where (·) := (·) + (·)T , λ > 0, one has that the squared infinitesimal length δxT δx is exponentially convergent to zero at rate 2λ. By path integration of δx between any pair of trajectories, one has that the distance between any two trajectories shrinks exponentially to zero. The vector field is thereby referred to be contracting at rate λ. Contraction metrics generalize this observation by considering as infinitesimal squared length distance, a symmetric positive definite function V (x, δx ) = δxT M (x)δx , where M : X → S>0 n , is a mapping from X to the set of uniformly positive-definite
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
183
n × n symmetric matrices. Formally, M (x) may be interpreted as a Riemannian metric tensor, endowing the space X with the Riemannian squared length element V (x, δx ). A fundamental result in contraction theory [12] is that any contracting system admits a contraction metric M (x) such that the associated function V (x, δx ) satisfies: V˙ (x, δx ) ≤ −2λV (x, δx ),
∀(x, δx ) ∈ T X ,
for some λ > 0. Thus, the function V (x, δx ) may be interpreted as a differential Lyapunov function. 2.1 Control Contraction Metrics Control contraction metrics (CCMs) generalize contraction analysis to the controlled dynamical setting, in the sense that the analysis searches jointly for a controller design and the metric that describes the contraction properties of the resulting closed-loop system. Consider dynamics of the form: x(t) ˙ = f (x(t)) + B(x(t))u(t),
(5)
where B : X → Rn×m is the input matrix, and denote B in column form as (b1 , . . . , bm ) and u in component form as (u1 , . . . , um ). To define a CCM, analogously to the previous section, we first analyze the variational dynamics, i.e., the dynamics of an infinitesimal displacement δx :
δ˙x =
:=A(x,u)
m ∂f (x) j ∂bj (x) + u δx + B(x)δu , ∂x ∂x j=1
(6)
where δu is an infinitesimal (virtual) control vector at u (i.e., δu is a vector in the control input tangent space, i.e., Rm ). A CCM for the system {f, B} is a uniformly positivedefinite symmetric matrix function M (x) such that there exists a function δu (x, δx , u) so that the function V (x, δx ) = δxT M (x)δx satisfies V˙ (x, δx , u) = δxT ∂f +Bu M (x) + M (x)A(x, u) δx + 2δxT M (x)B(x)δu
≤ −2λV (x, δx ),
∀(x, δx ) ∈ T X , u ∈ Rm ,
(7)
where ∂g M (x) is the matrix with element (i, j) given by Lie derivative of Mij (x) along the vector g. Given the existence of a valid CCM, one then constructs a stabilizing (in the sense of Eq. (2)) feedback controller k(x∗ , x) as described in the extended version of this work [22]. Some important observations are in order. First, the function V (x, δx ) may be interpreted as a differential control Lyapunov function, in that, there exists a stabilizing differential controller δu that stabilizes the variational dynamics (6) in the sense of Eq. (7). Second, and more importantly, we see that by stabilizing the variational dynamics (essentially an infinite family of linear dynamics in (δx , δu )) pointwise, everywhere in the state-space, we obtain a stabilizing controller for the original nonlinear system.
184
S. Singh et al.
Crucially, this is an exact stabilization result, not one based on local linearization. Consequently, one can show several useful properties, such as invariance to state-space transformations [13] and robustness [14, 21]. Third, the CCM approach only requires a weak form of controllability, and therefore is not restricted to feedback linearizable (i.e., invertible) systems.
3 Problem Formulation Using the characterization of stabilizability using CCMs, we can now formalize our problem as follows. Given our dataset of tuples {(xi , ui , x˙ i )}N i=1 , the objective of this work is to learn the dynamics functions f (x) and B(x) in Eq. (5), subject to the constraint that there exists a valid CCM M (x) for the learned dynamics. That is, the CCM M (x) plays the role of a certificate of stabilizability for the learned dynamics. As shown in [13], a necessary and sufficient characterization of a CCM M (x) is given in terms of its dual W (x) := M (x)−1 by the following two conditions: ∂bj (x) T W (x) B⊥ = 0, j = 1, . . . , m ∀x ∈ X , (8) B⊥ ∂bj W (x) − ∂x
∂f (x) W (x) + 2λW (x) B⊥ (x) ≺ 0, −∂f W (x) + ∂x
B⊥ (x)T
∀x ∈ X ,
(9)
:=F (x;f,W,λ)
where B⊥ is the annihilator matrix for B, i.e., B(x)T B⊥ (x) = 0 for all x. In the definition above, we write F (x; W, f, λ) since {W, f, λ} will be optimization variables in our formulation. Thus, our learning task reduces to finding the functions {f, B, W } and constant λ that jointly satisfy the above constraints, while minimizing an appropriate regularized regression loss function. Formally, problem (3) can be re-stated as: ˆ :=Jd (fˆ,B)
min
bj ∈HB ,j=1,...,m fˆ∈Hf ,ˆ W ∈HW w,w,λ∈R>0
N m 2 ˆ ˆ i )ui − x˙ i ˆbj 2HB f (xi ) + B(x + µf fˆ2Hf + µb 2
i=1
+ (w − w) + µw W 2HW
j=1
(10a)
:=Jm (W,w,w)
subject to
eqs. (8), (9)
∀x ∈ X ,
wIn W (x) wIn ,
∀x ∈ X ,
(10b) (10c)
where Hf and HB are appropriately chosen Rn -valued function classes on X for fˆ and ˆbj respectively, and HW is a suitable S>0 -valued function space on X . The objective n
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
185
is composed of a dynamics term Jd – consisting of regression loss and regularization terms, and a metric term Jm – consisting of a condition number surrogate loss on the metric W (x) and a regularization term. The metric cost term w − w is motivated by the observation that the state tracking error (i.e., x(t) − x∗ (t)2 ) in the presence of bounded additive disturbances is proportional to the ratio w/w (see [21]). Notice that the coupling constraint (9) is a bi-linear matrix inequality in the decision variables sets {fˆ, λ} and W . Thus at a high-level, a solution algorithm must consist of alternating between two convex sub-problems, defined by the objective/decision variˆ λ}) and (Jm , {W, w, w}). able pairs (Jd , {fˆ, B,
4 Solution Formulation When performing dynamics learning on a system that is a priori known to be exponentially stabilizable at some strictly positive rate λ, the constrained problem formulation in (10a–10c) follows naturally given the assured existence of a CCM. Albeit, the infinite-dimensional nature of the constraints is a considerable technical challenge, broadly falling under the class of semi-infinite optimization [6]. Alternatively, for systems that are not globally exponentially stabilizable in X , one can imagine that such a constrained formulation may lead to adverse effects on the learned dynamics model. Thus, in this section we propose a relaxation of problem (10a–10c) motivated by the concept of regularization. Specifically, constraints (8) and (9) capture this notion of stability of infinitesimal deviations at all points in the space X . They stem from requiring that V˙ ≤ −2λV (x, δx ) in Eq. (7) when δxT M (x)B(x) = 0, i.e., when δu can have no effect on V˙ . This is nothing but the standard control Lyapunov inequality, applied to the differential setting. Constraint (8) sets to zero, the terms in (7) affine in u, while constraint (9) enforces this “natural” stability condition. The simplifications we make are (i) relax constraints (9) and (10c) to hold pointwise over some finite constraint set Xc ∈ X , and (ii) assume a specific sparsity structure ˆ for input matrix estimate B(x). We discuss the pointwise relaxation here; the sparsity ˆ assumption on B(x) is discussed in the following section and [22]. First, from a purely mathematical standpoint, the pointwise relaxation of (9) and (10c) is motivated by the observation that as the CCM-based controller is exponentially stabilizing, we only require the differential stability condition to hold locally (in a tubelike region) with respect to the provided demonstrations. By continuity of eigenvalues for continuously parameterized entries of a matrix, it is sufficient to enforce the matrix inequalities at a sampled set of points [9]. Second, enforcing the existence of such an “approximate” CCM seems to have an impressive regularization effect on the learned dynamics that is more meaningful than standard regularization techniques used in for instance, ridge or lasso regression. Specifically, problem (10a–10c), and more generally, problem (3) can be viewed as the projection of the best-fit dynamics onto the set of stabilizable systems. This results in dynamics models that jointly balance regression performance and stabilizablity, ultimately yielding systems whose generated trajectories are notably easier to track. This effect of regularization is discussed in detail in our experiments in Sect. 6. Practically, the finite constraint set Xc , with cardinality Nc , includes all xi in the regression training set of {(xi , ui , x˙ i )}N i=1 tuples. However, as the LMI constraints are
186
S. Singh et al.
independent of ui , x˙ i , the set Xc is chosen as a strict superset of {xi }N i=1 (i.e., Nc > N ) by randomly sampling additional points within X , drawing parallels with semisupervised learning. 4.1
ˆ Sparsity of Input Matrix Estimate B
While a pointwise relaxation for the matrix inequalities is reasonable, one cannot apply such a relaxation to the exact equality condition in (8). Thus, the second simplification made is the following assumption, reminiscent of control normal form equations. ˆ Assumption 1. Assume B(x) to take the following sparse representation: O(n−m)×m ˆ B(x) = , b(x)
(11)
where b(x) is an invertible m × m matrix for all x ∈ X . ˆ For the assumed structure of B(x), a valid B⊥ matrix is then given by: In−m . B⊥ = Om×(n−m)
(12)
Therefore, constraint (8) simply becomes: ∂ˆbj W⊥ (x) = 0,
j = 1, . . . , m.
where W⊥ is the upper-left (n − m) × (n − m) block of W (x). Assembling these constraints for the (p, q) entry of W⊥ , i.e., w⊥pq , we obtain: ∂w⊥pq (x) ∂w⊥pq (x) b(x) = 0. ··· ∂xn ∂x(n−m)+1 Since the matrix b(x) in (11) is assumed to be invertible, the only solution to this equation is ∂w⊥pq /∂xi = 0 for i = (n − m) + 1, . . . , n, and all (p, q) ∈ {1, . . . , (n − m)}. That is, W⊥ cannot be a function of the last m components of x – an elegant simplification of constraint (8). Due to space limitations, justification for this sparsity assumption is provided in [22]. 4.2
Finite-Dimensional Optimization
We now present a tractable finite-dimensional optimization for solving problem (10a– 10c) under the two simplifying assumptions introduced in the previous sections. The derivation of the solution algorithm itself is presented in [22], and relies extensively on vector-valued Reproducing Kernel Hilbert Spaces.
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
187
n ˆ Step 1: Parametrize the functions fˆ, the columns of B(x): {ˆbj }m j=1 , and {wij }i,j=1 as a linear combination of features. That is,
fˆ(x) = Φf (x)T α, ˆbj (x) = Φb (x)T βj j ∈ {1, . . . , m}, φˆw (x)T θˆij if (i, j) ∈ {1, . . . , n − m}, wij (x) = φw (x)T θij else,
(13) (14) (15)
where α ∈ Rdf , βj ∈ Rdb , θˆij , θij ∈ Rdw are constant vectors to be optimized over, and Φf : X → Rdf ×n , Φb : X → Rdb ×n , φˆw : X → Rdw and φw : X → Rdw are a priori chosen feature mappings. To enforce the sparsity structure in (11), the feature matrix Φb must have all 0s in its first n − m columns. The features φˆw are distinct from φw in that the former are only a function of the first n − m components of x (as per Sect. 4.1). While one can use any function approximator (e.g., neural nets), we motivate this parameterization from a perspective of Reproducing Kernel Hilbert Spaces (RKHS); see [22]. Step 2: Given positive regularization constants μf , μb , μw and positive tolerances (δλ , λ ) and (δw , w ), solve:
min
N
α,βj ,θˆij ,θij ,w,w,λ
:=Jd
ˆ i )ui − x˙ i 22 + μf α22 + μb fˆ(xi ) + B(x
k=1
+ (w − w) + μw
βj 22
j=1
θ˜ij 22
i,j
:=Jm
s.t. F (xi ; α, θ˜ij , λ + λ ) 0,
(16a)
∀xi ∈ Xc ,
(w + w )In W (xi ) wIn , θij = θji , θˆij = θˆji λ ≥ δλ ,
m
∀xi ∈ Xc ,
w ≥ δw ,
(16b) (16c) (16d) (16e)
where θ˜ij is used as a placeholder for θij and θˆij to simplify notation. We wish to highlight the following key points regarding problem (16a–16e). Constraints (16b) and (16c) are the pointwise relaxations of (9) and (10c) respectively. Constraint (16d) captures the fact that W (x) is a symmetric matrix. Finally, constraint (16e) imposes some tolerance requirements to ensure a well conditioned solution. Additionally, the tolerances δ and w are used to account for the pointwise relaxations of the matrix inequalities. A key challenge is to efficiently solve this constrained optimization problem, given a potentially large number of constraint points in Xc . In the next section, we present an iterative algorithm and an adaptive constraint sampling technique to solve problem (16a–16e).
188
S. Singh et al.
5 Solution Algorithm The fundamental structure of the solution algorithm consists of alternating between the dynamics and metric sub-problems derived from problem (16a–16e). We also make a few additional modifications to aid tractability, most notable of which is the use of (k) a dynamically updating set of constraint points Xc at which the LMI constraints are (k) (k) (k) enforced at the k th iteration. In particular Xc ⊂ Xc with Nc := |Xc | being ideally much less than Nc , the cardinality of the full constraint set Xc . Formally, each major iteration k is characterized by three minor steps (sub-problems): 1. Finite-dimensional dynamics sub-problem at iteration k: min
α,βj ,j=1,...,m, λ s≥0
Jd (α, β) + μs s1
(17a)
(k−1) s.t. F (xi ; α, θ˜ij , λ + λ ) s(xi )In−m (k−1)
s(xi ) ≤ s¯ λ ≥ δλ ,
∀xi ∈
∀xi ∈ Xc(k)
Xc(k)
(17b) (17c) (17d)
(k)
where μs is an additional regularization parameter for s – an Nc dimensional nonnegative slack vector. The quantity s¯(k−1) is defined as s¯(k−1) := max λmax F (k−1) (xi ) , where xi ∈Xc
F
(k−1)
(k−1) (xi ) := F (xi ; α(k−1) , θ˜ij , λ(k−1) + λ ).
That is, s¯(k−1) captures the worst violation for the F (·) LMI over the entire constraint set Xc , given the parameters at the end of iteration k − 1. 2. Finite-dimensional metric sub-problem at iteration k: min
θ˜ij ,w,w,s≥0
Jm (θ˜ij , w, w) + (1/μs )s1
(18a)
s.t. F (xi ; α(k) , θ˜ij , λ(k) + λ ) s(xi )In−m s(xi ) ≤ s¯(k−1)
∀xi ∈ Xc(k)
(w + w )In W (xi ) wIn , w ≥ δw .
∀xi ∈ Xc(k)
(18b) (18c)
∀xi ∈
Xc(k) ,
(18d) (18e)
(k)
sub-problem. Choose a tolerance parameter δ > 0. Then, define ν (k) (xi ) := max λmax F k (xi ) , λmax (δw + δ )In − W (xi ) , ∀xi ∈ Xc ,
3. Update Xc
and set
xi ∈ Xc \ Xc(k) : ν (k) (xi ) > 0 . Xc(k+1) := xi ∈ Xc(k) : ν (k) (xi ) > −δ (19)
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
189
(k)
Thus, in the update Xc step, we balance addressing points where constraints are being violated (ν (k) > 0) and discarding points where constraints are satisfied with sufficient strict inequality (ν (k) ≤ −δ). This prevents overfitting to any specific subset of the constraint points. A potential variation to the union above is to only add up to say K (k) constraint violating points from Xc \ Xc (e.g., corresponding to the K worst violators), where K is a fixed positive integer. Indeed this is the variation used in our experi(k) ments and was found to be extremely efficient in balancing the size of the set Xc and thus, the complexity of each iteration. This adaptive sampling technique is inspired by exchange algorithms for semi-infinite optimization, as the one proposed in [26] where one is trying to enforce the constraints at all points in a compact set X . Note that after the first major iteration, we replace the regularization terms in Jd (k) (k−1) 2 (k) (k−1) 2 and Jm with α(k) − α(k−1) 22 , βj − βj 2 , and θ˜ij − θ˜ij 2 . This is done to prevent large updates to the parameters, particularly due to the dynamically updating (k) constraint set Xc . The full pseudocode is summarized below in Algorithm 1. Algorithm 1. Stabilizable Non-Linear Dynamics Learning (SNDL) {xi , ui , x˙ i }N i=1 , constraint set Xc , regularization constants {μf , μb , μw }, constraint tolerances {δλ , λ , δw , w }, discard tolerance parameter δ , Initial # of constraint points: Nc(0) , Max # iterations: Nmax , termination tolerance ε. k ← 0, converged ← false, W (x) ← In . Xc(0) ← R AND S AMPLE(Xc , Nc(0) ) while ¬converged ∧ k < Nmax do (k) {α(k) , βj , λ(k) } ← S OLVE (17)
1: Input: Dataset
2: 3: 4: 5: (k) 6: {θ˜ij , w, w} ← S OLVE (18) (k+1) (k) 7: Xc , s¯(k) ← U PDATE Xc(k) using (19) ,ν (k) (k−1) (k) (k−1) (k) 8: Δ ← max α − α(k−1) ∞ , βj − βj ∞ , θ˜ij − θ˜ij ∞ , λ(k) − λ(k−1) ∞ 9: if Δ < ε or ν (k) (xi ) < ε ∀xi ∈ Xc then 10: converged ← true. 11: end if 12: k ← k + 1. 13: end while
Some comments are in order. First, convergence in Algorithm 1 is declared if either progress in the solution variables stalls or all constraints are satisfied within tolerance. Due to the semi-supervised nature of the algorithm in that the number of constraint points Nc can be significantly larger than the number of supervisory regression tuples N , it is impractical to enforce constraints at all Nc points in any one iteration. Two key consequences of this are: (i) the matrix function W (x) at iteration k resulting from variables θ˜(k) does not have to correspond to a valid dual CCM for the interim learned dynamics at iteration k, and (ii) convergence based on constraint satisfaction at all Nc points is justified by the fact that at each iteration, we are solving relaxed sub-problems that collectively generate a sequence of lower-bounds on the overall objective. Potential future topics in this regard are: (i) investigate the properties of the converged dynamics for models that are a priori unknown unstabilizable, and (ii) derive sufficient conditions for convergence for both the infinitely- and finitely- constrained versions of problem (10a–10c).
190
S. Singh et al.
Second, as a consequence of this iterative procedure, the dual metric and contraction rate pair {W (x), λ} do not possess any sort of “control-theoretic” optimality. For instance, in [21], for a known stabilizable dynamics model, both these quantities are optimized for robust control performance. In this work, these quantities are used solely as regularizers to promote stabilizability of the learned model. A potential future topic to explore in this regard is how to further optimize {W (x), λ} for control performance.
6 Experimental Results In this section we validate our algorithms by benchmarking our results on a known dynamics model. Specifically, we consider the 6-state planar vertical-takeoff-vertical˙ where landing (PVTOL) model. The system is defined by the state: (px , pz , φ, vx , vz , φ) ˙ (px , pz ) is the position in the 2D plane, (vx , vz ) is the body-reference velocity, (φ, φ) are the roll and angular rate respectively, and 2-dimensional control input u corresponding to the motor thrusts. The true dynamics are given by: ⎤ ⎡ ⎤ ⎡ vx cos φ − vz sin φ 0 0 ⎢vx sin φ + vz cos φ⎥ ⎢ 0 0 ⎥ ⎥ ⎢ ⎥ ⎢ ˙ ⎢ ⎢ ⎥ φ 0 ⎥ ⎥ u, ⎥+⎢ 0 x(t) ˙ =⎢ ⎢ vz φ˙ − g sin φ ⎥ ⎢ 0 0 ⎥ ⎥ ⎢ ⎥ ⎢ ⎣ −vx φ˙ − g cos φ ⎦ ⎣(1/m) (1/m) ⎦ l/J (−l/J) 0 where g is the acceleration due to gravity, m is the mass, l is the moment-arm of the thrusters, and J is the moment of inertia about the roll axis. We note that typical benchmarks in this area of work either present results on the 2D LASA handwriting dataset [7] or other low-dimensional motion primitive spaces, with the assumption of full robot dynamics invertibility. The planar quadrotor on the other hand is a complex non-minimum phase dynamical system that has been heavily featured within the acrobatic robotics literature and therefore serves as a suitable case-study. 6.1
Generation of Datasets
The training dataset was generated in 3 steps. First, a fixed set of waypoint paths in (px , pz ) were randomly generated. Second, for each waypoint path, multiple smooth polynomial splines were fitted using a minimum-snap algorithm. To create variation amongst the splines, the waypoints were perturbed within Gaussian balls and the time durations for the polynomial segments were also randomly perturbed. Third, the PVTOL system was simulated with perturbed initial conditions and the polynomial trajectories as references, and tracked using a sub-optimally tuned PD controller; thereby emulating a noisy/imperfect demonstrator. These final simulated paths were sub-sampled at 0.1s resolution to create the datasets. The variations created at each step of this process were sufficient to generate a rich exploration of the state-space for training. Due to space constraints, we provide details of the solution parameterization (number of features, etc) in [22].
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
191
6.2 Models Using the same feature space, we trained three separate models with varying training dataset (i.e., (xi , ui , x˙ s ) tuples) sizes of N ∈ {100, 250, 500, 1000}. The first model, N-R was an unconstrained and un-regularized model, trained by solving problem (17a– 17d) without constraints or l2 regularization (i.e., just least-squares). The second model, R-R was an unconstrained ridge-regression model, trained by solving problem (17a– 17d) without any constraints (i.e., least-squares plus l2 regularization). The third model, CCM-R is the CCM-regularized model, trained using Algorithm 1. We enforced the CCM regularizing constraints for the CCM-R model at Nc = 2400 points in the statespace, composed of the N demonstration points in the training dataset and randomly sampled points from X (recall that the CCM constraints do not require samples of u, x). ˙ As the CCM constraints were relaxed to hold pointwise on the finite constraint set Xc as opposed to everywhere on X , in the spirit of viewing these constraints as regularizers for the model (see Sect. 4), we simulated both the R-R and CCM-R models using the time-varying Linear-Quadratic-Regulator (TV-LQR) feedback controller. This also helped ensure a more direct comparison of the quality of the learned models themselves, independently of the tracking feedback controller. The results are virtually identical using a tracking MPC controller and yield no additional insight. 6.3 Validation and Comparison The validation tests were conducted by gridding the (px , pz ) plane to create a set of 120 initial conditions between 4 m and 12 m away from (0, 0) and randomly sampling the other states for the rest of the initial conditions. These conditions were held fixed for both models and for all training dataset sizes to evaluate model improvement. For each model at each value of N , the evaluation task was to (i) solve a trajectory optimization problem to compute a dynamically feasible trajectory for the learned model to go from initial state x0 to the goal state - a stable hover at (0, 0) at nearzero velocity; and (ii) track this trajectory using the TV-LQR controller. As a baseline, all simulations without any feedback controller (i.e., open-loop control rollouts) led to the PVTOL crashing. This is understandable since the dynamics fitting objective is not optimizing for multi-step error. The trajectory optimization step was solved as a fixedendpoint, fixed final time optimal control problem using the Chebyshev pseudospectral T method [5] with the objective of minimizing 0 u(t)2 dt. The final time T for a given initial condition was held fixed between all models. Note that 120 trajectory optimization problems were solved for each model and each value of N . Figure 1 shows a boxplot comparison of the trajectory-wise RMS full state errors (x(t)−x∗ (t)2 where x∗ (t) is the reference trajectory obtained from the optimizer and x(t) is the actual realized trajectory) for each model and all training dataset sizes. As N increases, the spread of the RMS errors decreases for both R-R and CCM-R models as expected. However, we see that the N-R model generates several unstable trajectories for N = 100, 500 and 1000, indicating the need for some form of regularization. The CCM-R model consistently achieves a lower RMS error distribution than both the N-R and R-R models for all training dataset sizes. Most notable however, is its performance when the number of training samples is small (i.e., N ∈ {100, 250}) when there is
192
S. Singh et al.
Fig. 1. Box-whisker plot comparison of trajectory-wise RMS state-tracking errors over all 120 trajectories for each model and all training dataset sizes. Top row, left-to-right: N = 100, 250, 500, 1000; Bottom row, left-to-right: N = 100, 500, 1000 (zoomed in). The box edges correspond to the 25th, median, and 75th percentiles; the whiskers extend beyond the box for an additional 1.5 times the interquartile range; outliers, classified as trajectories with RMS errors past this range, are marked with red crosses. Notice the presence of unstable trajectories for N-R at all values of N and for R-R at N = 100, 250. The CCM-R model dominates the other two at all values of N , particularly for N = 100, 250.
considerable risk of overfitting. It appears the CCM constraints have a notable effect on the stabilizability of the resulting model trajectories (recall that the initial conditions of the trajectories and the tracking controllers are held fixed between the models). For N = 100 (which is really at the extreme lower limit of necessary number of samples since there are effectively 97 features for each dimension of the dynamics function), both N-R and R-R models generate a large number of unstable trajectories. In contrast, out of the 120 generated test trajectories, the CCM-R model generates one mildly (in that the quadrotor diverged from the nominal trajectory but did not crash) unstable trajectory. No instabilities were observed with CCM-R for N ∈ {250, 500, 1000}. Figure 2a compares the (px , pz ) traces between R-R and CCM-R corresponding to the five worst performing trajectories for the R-R N = 100 model. Similarly, Fig. 2b compares the (px , pz ) traces corresponding to the five worst performing trajectories for the CCM-R N = 100 model. Notice the large number of unstable trajectories generated using the R-R model. Indeed, it is in this low sample training regime where the controltheoretic regularization effects of the CCM-R model are most noticeable. Finally, in Fig. 3, we highlight two trajectories, starting from the same initial conditions, one generated and tracked using the R-R model, the other using the CCM model, for N = 250. Overlaid on the plot are the snapshots of the vehicle outline itself, illustrating the quite aggressive flight-regime of the trajectories (the initial starting bank angle is 40o ). While tracking the R-R model generated trajectory eventually ends in complete loss of control, the system successfully tracks the CCM-R model generated trajectory to the stable hover at (0, 0). An interesting area of future work here is to investigate how to tune the regularization parameters μf , μb , μw . Indeed, the R-R model appears to be extremely sensitive to μf , yielding drastically worse results with a small change in this parameter. On the other hand, the CCM-R model appears to be quite robust to variations in this parameter.
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
193
Fig. 2. (px , pz ) traces for R-R (left column) and CCM-R (right column) corresponding to the 5 worst performing trajectories for (a) R-R, and (b) CCM-R models at N = 100. Colored circles indicate start of trajectory. Red circles indicate end of trajectory. All except one of the R-R trajectories are unstable. One trajectory for CCM-R is slightly unstable.
Fig. 3. Comparison of reference and tracked trajectories in the (px , pz ) plane for R-R and CCMR models starting at same initial conditions with N = 250. Red (dashed): nominal, Blue (solid): actual, Green dot: start, black dot: nominal endpoint, blue dot: actual endpoint; Top: CCM-R, Bottom: R-R. The vehicle successfully tracks the CCM-R model generated trajectory to the stable hover at (0, 0) while losing control when attempting to track the R-R model generated trajectory.
Standard cross-validation techniques using regression quality as a metric are unsuitable as a tuning technique here; indeed, recent results even advocate for “ridgeless” regression [11]. However, as observed in Fig. 1, un-regularized model fitting is clearly unsuitable. The effect of regularization on how the trajectory optimizer leverages the learned dynamics is a non-trivial relationship that merits further study.
194
S. Singh et al.
7 Conclusions In this paper, we presented a framework for learning controlled dynamics from demonstrations for the purpose of trajectory optimization and control for continuous robotic tasks. By leveraging tools from nonlinear control theory, chiefly, contraction theory, we introduced the concept of learning stabilizable dynamics, a notion which guarantees the existence of feedback controllers for the learned dynamics model that ensures trajectory trackability. Borrowing tools from Reproducing Kernel Hilbert Spaces and convex optimization, we proposed a bi-convex semi-supervised algorithm for learning stabilizable dynamics for complex underactuated and inherently unstable systems. The algorithm was validated on a simulated planar quadrotor system where it was observed that our control-theoretic dynamics learning algorithm notably outperformed traditional ridge-regression based model learning. There are several interesting avenues for future work. First, it is unclear how the algorithm would perform for systems that are fundamentally unstabilizable and how the resulting learned dynamics could be used for “approximate” control. Second, we will explore sufficient conditions for convergence for the iterative algorithm under the finiteand infinite-constrained formulations. Third, we will address extending the algorithm to work on higher-dimensional spaces through functional parameterization of the controltheoretic regularizing constraints. Fourth, we will address the limitations imposed by the sparsity assumption on the input matrix B using the proposed alternating algorithm proposed in Sect. 4.1. Finally, we will incorporate data gathered on a physical system subject to noise and other difficult to capture nonlinear effects (e.g., drag, friction, backlash) and validate the resulting dynamics model and tracking ontrollers on the system itself to evaluate the robustness of the learned models.
References 1. Berkenkamp, F., Turchetta, M., Schoellig, A., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: Conference on Neural Information Processing Systems (2017) 2. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. arXiv preprint arXiv:1805.12114 (2018) 3. Crouch, P.E., van der Schaft, A.J.: Variational and Hamiltonian Control Systems. Springer, Heidelberg (1987) 4. Deisenroth, M.P., Rasmussen, C.E.: PILCO: a model-based and data-efficient approach to policy search. In: International Conference on Machine Learning, pp. 465–472 (2011) 5. Fahroo, F., Ross, I.M.: Direct trajectory optimization by a chebyshev pseudospectral method. AIAA J. Guidance Control Dyn. 25(1), 160–166 (2002) 6. Hettich, R., Kortanek, K.O.: Semi-infinite programming: theory, methods, and applications. SIAM Rev. 35(3), 380–429 (1993) 7. Khansari-Zadeh, S.M., Billard, A.: Learning stable nonlinear dynamical systems with Gaussian mixture models. IEEE Trans. Rob. 27(5), 943–957 (2011) 8. Khansari-Zadeh, S.M., Khatib, O.: Learning potential functions from human demonstrations with encapsulated dynamic and compliant behaviors. Auton. Robots 41(1), 45–69 (2017) 9. Lax, P.: Linear Algebra and its Applications, 2nd edn. Wiley, Hoboken (2007)
Learning Stabilizable Dynamical Systems via Control Contraction Metrics
195
10. Lemme, A., Neumann, K., Reinhart, R.F., Steil, J.J.: Neural learning of vector fields for encoding stable dynamical systems. Neurocomputing 141(1), 3–14 (2014) 11. Liang, T., Rakhlin, A.: Just interpolate: kernel ridgeless regression can generalize. arXiv preprint arXiv:1808.00387v1 (2018) 12. Lohmiller, W., Slotine, J.J.E.: On contraction analysis for non-linear systems. Automatica 34(6), 683–696 (1998) 13. Manchester, I.R., Slotine, J.J.E.: Control contraction metrics: convex and intrinsic criteria for nonlinear feedback design. IEEE Trans. Autom. Control 62, 3046–3053 (2017) 14. Manchester, I.R., Slotine, J.J.E.: Robust control contraction metrics: a convex approach to nonlinear state-feedback H − ∞ control. IEEE Control Syst. Lett. 2(2), 333–338 (2018) 15. Manchester, I., Tang, J.Z., Slotine, J.J.E.: Unifying classical and optimization-based methods for robot tracking control with control contraction metrics. In: International Symposium on Robotics Research (2015) 16. Medina, J.R., Billard, A.: Learning stable task sequences from demonstration with linear parameter varying systems and hidden Markov models. In: Conference on Robot Learning, pp. 175–184 (2017) 17. Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596 (2017) 18. Ravichandar, H., Salehi, I., Dani, A.: Learning partially contracting dynamical systems from demonstrations. In: Conference on Robot Learning (2017) 19. Sanner, R.M., Slotine, J.J.E.: Gaussian networks for direct adaptive control. IEEE Trans. Neural Netw. 3(6), 837–863 (1992) 20. Sindhwani, V., Tu, S., Khansari, M.: Learning contracting vector fields for stable imitation learning. arXiv preprint arXiv:1804.04878 (2018) 21. Singh, S., Majumdar, A., Slotine, J.J.E., Pavone, M.: Robust online motion planning via contraction theory and convex optimization. In: Proceedings of the IEEE Conference on Robotics and Automation (2017). Extended Version: http://asl.stanford.edu/wp-content/ papercite-data/pdf/Singh.Majumdar.Slotine.Pavone.ICRA17.pdf 22. Singh, S., Sindhwani, V., Slotine, J.J., Pavone, M.: Learning stabilizable dynamical systems via control contraction metrics. In: Workshop on Algorithmic Foundations of Robotics (2018, in Press). https://arxiv.org/abs/1808.00113 23. Slotine, J.J.E., Li, W.: On the adaptive control of robot manipulators. Int. J. Robot. Res. 6(3), 49–59 (1987) 24. Venkatraman, A., Capobianco, R., Pinto, L., Hebert, M., Nardi, D., Bagnell, A.: Improved learning of dynamics models for control. In: International Symposium on Experimental Robotics, pp. 703–713. Springer (2016) 25. Venkatraman, A., Hebert, M., Bagnell, J.A.: Improving multi-step prediction of learned time series models. In: Proceedings of the AAAI Conference on Artificial Intelligence (2015) 26. Zhang, L., Wu, S.Y., L´opez, M.A.: A new exchange method for convex semi-infinite programming. SIAM J. Optim. 20(6), 2959–2977 (2010)
Generalizing Robot Imitation Learning with Invariant Hidden Semi-Markov Models Ajay Kumar Tanwani1,2(B) , Jonathan Lee1 , Brijen Thananjeyan1 , Michael Laskey1 , Sanjay Krishnan1 , Roy Fox1 , Ken Goldberg1 , and Sylvain Calinon2 1
2
University of California, Berkeley, USA [email protected] Idiap Research Institute, Martigny, Switzerland
Abstract. Generalizing manipulation skills to new situations requires extracting invariant patterns from demonstrations. For example, the robot needs to understand the demonstrations at a higher level while being invariant to the appearance of the objects, geometric aspects of objects such as its position, size, orientation and viewpoint of the observer in the demonstrations. In this paper, we propose an algorithm that learns a joint probability density function of the demonstrations with invariant formulations of hidden semi-Markov models to extract invariant segments (also called sub-goals or options), and smoothly follow the generated sequence of states with a linear quadratic tracking controller. The algorithm takes as input the demonstrations observed with respect to different coordinate systems describing virtual landmarks or objects of interest, and adapts the segments according to the environmental changes in a systematic manner. We present variants of this algorithm in latent space with low-rank covariance decompositions, semi-tied covariances, and non-parametric online estimation of model parameters under small variance asymptotics; yielding considerably low sample and model complexity for acquiring new manipulation skills. The algorithm allows a Baxter robot to learn a pick-and-place task while avoiding a movable obstacle based on only 4 kinesthetic demonstrations. Keywords: Hidden Markov models · Imitation learning · Adaptive systems
1 Introduction Generative models are widely used in robot imitation learning to estimate the distribution of the data and regenerate samples from the learned model [1]. Common applications include probability density function estimation, image regeneration, and dimensionality reduction. The parameters of the model encode the task structure which is inferred from the demonstrations. In contrast to direct trajectory learning from demonstrations, many problems arise in robotic applications that require contextual understanding of the environment at a higher level. This requires learning invariant mappings in the demonstrations that can generalize across different environmental situations such as size, position, orientation of objects, and viewpoint of the observer. Recent trend in imitation leaning is forgoing such a task structure for end-to-end supervised learning which requires a large number of training demonstrations. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 196–211, 2020. https://doi.org/10.1007/978-3-030-44051-0_12
Robot Imitation Learning with Hidden Semi-Markov Models
197
Fig. 1. Conceptual illustration of hidden semi-Markov model (HSMM) for imitation learning: (left) 3-dimensional Z-shaped demonstrations composed of 5 equally spaced trajectory samples, (middle) demonstrations are encoded with a 3 state HMM represented by Gaussians (shown as ellipsoids) that represent the blue, green and red segments respectively. The transition graph shows a duration model (Gaussian) next to each node, (right) the generative model is combined with linear quadratic tracking (LQT) to synthesize motion in performing robot manipulation tasks from 5 different initial conditions marked with orange squares (see also Fig. 2).
The focus of this paper is to learn the joint probability density function of the human demonstrations with a family of Hidden Markov Models (HMMs) in an unsupervised manner [21]. We combine tools from statistical machine learning and optimal control to segment the demonstrations into different components or sub-goals that are sequenced together to perform manipulation tasks in a smooth manner. We first present a simple algorithm for imitation learning that combines the decoded state sequence of a hidden semi-Markov model [21, 32] with a linear quadratic tracking controller to follow the demonstrated movement [2] (see Fig. 1). We then augment the model with a taskparameterized formulation such that it can be systematically adapted to changing situations such as pose/size of the objects in the environment [4, 25, 29]. We present latent space formulations of our approach to exploit the task structure using: (1) mixture of factor analyzers decomposition of the covariance matrix [15], (2) semi-tied covariance matrices of the mixture model [25], and (3) Bayesian non-parametric formulation of the model with Hierarchical Dirichlet process (HDP) for online learning under small variance asymptotics [26]. The paper unifies and extends our previous work on encoding manipulation skills in a task-adaptive manner [24–26]. Our objective is to reduce the number of demonstrations required for learning a new task, while ensuring effective generalization in new environmental situations. 1.1 Related Work Imitation learning provides a promising approach to facilitate robot learning in the most ‘natural’ way. The main challenges in imitation learning include [17]: (1) what-to-learn – acquiring meaningful features of the task from demonstrations for imitation, and (2) how-to-learn – learning a control policy from the features to reproduce the demonstrated behaviour. Imitation learning algorithms typically fall into behaviour cloning or inverse reinforcement learning (IRL) approaches. IRL aims to recover the unknown reward function that is being optimized in the demonstrations, while behaviour cloning
198
A. K. Tanwani et al.
approaches directly learn from human demonstrations in a supervised manner. Prominent approaches to imitation learning include Dynamic Movement Primitives [10], Generative Adversarial Imitation Learning [9], one-shot imitation learning [5] and so on [19]. This paper emphasizes learning manipulation skills from human demonstrations in an unsupervised manner using a family of hidden Markov models by sequencing the atomic movement segments or primitives. HMMs have been typically used for recognition and generation of movement skills in robotics [14]. Other related application contexts in imitation learning include options framework [11], sequencing primitives [16], temporal alignment [23], and neural task programs [7, 31]. A number of variants of HMMs have been proposed to address some of its shortcomings, including: (1) how to bias learning towards models with longer self-dwelling states, (2) how to robustly estimate the parameters with high-dimensional noisy data, (3) how to adapt the model with newly observed data, and (4) how to estimate the number of states that the model should possess. For example, [12] used HMMs to incrementally group whole-body motions based on their relative distance in HMM space. [14] presented an iterative motion primitive refinement approach with HMMs. [18] used the Beta Process Autoregressive HMM for learning from unstructured demonstrations. Figueroa et al. used the transformation invariant covariance matrix for encoding tasks with a Bayesian non-parametric HMM [6]. In this paper, we address these shortcomings with an algorithm that learns a hidden semi-Markov model [21, 32] from a few human demonstrations for segmentation, recognition, and synthesis of robot manipulation tasks (see Sect. 2). The algorithm observes the demonstrations with respect to different coordinate systems describing virtual landmarks or objects of interest, and adapts the model according to the environmental changes in a systematic manner in Sect. 3. Capturing such invariant representations allows us to compactly encode the task variations than using a standard regression problem. We present variants of the algorithm in latent space to exploit the task structure in Sect. 4. In Sect. 5, we show the application of our approach to learning a pick-and-place task from a few demonstrations, with an outlook to our future work.
2 Hidden Markov Models Hidden Markov models (HMMs) encapsulate the spatio-temporal information by augmenting a mixture model with latent states that sequentially evolve over time in the demonstrations [21]. HMM is thus defined as a doubly stochastic process, one with sequence of hidden states and another with sequence of observations/emissions. Spatiotemporal encoding with HMMs can handle movements with variable durations, recurring patterns, options in the movement, or partial/unaligned demonstrations. Without loss of generality, we will present our formulation with semi-Markov models for the remainder of the paper. Semi-Markov models relax the Markovian structure of state transitions by relying not only upon the current state but also on the duration/elapsed time in the current state, i.e., the underlying process is defined by a semi-Markov chain with a variable duration time for each state. The state duration stay is a random integer variable that assumes values in the set {1, 2, . . . , smax }. The value corresponds to the
Robot Imitation Learning with Hidden Semi-Markov Models
199
number of observations produced in a given state, before transitioning to the next state. Hidden Semi-Markov Models (HSMMs) associate an observable output distribution with each state in a semi-Markov chain [32], similar to how we associated a sequence of observations with a Markov chain in a HMM. Let {ξ t }Tt=1 denote the sequence of observations with ξ t ∈ RD collected while demonstrating a manipulation task. The state may represent the visual observation, kinesthetic data such as the pose and the velocities of the end-effector of the human arm, haptic information, or any arbitrary features defining the task variables of the environment. The observation sequence is associated with a hidden state sequence {zt }Tt=1 with zt ∈ {1 . . . K} belonging to the discrete set of K cluster indices. The cluster indices correspond to different segments of the task such as reach, grasp, move etc. We want to learn the joint probability density of the observation sequence and the hidden state sequence. The transition between one segment i to another segment j is denoted by the transition matrix a ∈ RK×K with ai,j P (zt = j|zt−1 = i). The parameters {μSj , ΣSj } represent the mean and the standard deviation of staying s consecutive time steps in state j as p(s) estimated by a Gaussian N (s|μSj , ΣSj ). The hidden state follows a categorical distribution with zt ∼ Cat(π zt−1 ) where π zt−1 ∈ RK is the next state transition distribution over state zt−1 with Πi as the initial probability, and the observation ξ t is drawn from the output distribution of state j, described by a multivariate Gaussian with parameters {μj , Σj }. The overall parameter set for an HSMM is defined K S S . by Πi , {ai,m }K m=1 , μi , Σi , μi , Σi i=1
2.1 Encoding with HSMM For learning and inference in a HMM [21], we make use of the intermediary variables HMM P (zt = i, ξ 1 . . . ξ t |θ): probability of a datapoint as: (1) forward variable, αt,i ξ t to be in state i at time step t given the partial observation sequence {ξ 1 , . . . , ξ t }, HMM P (ξ t+1 . . . ξ T |zt = i, θ): probability of the partial (2) backward variable, βt,i observation sequence {ξ t+1 , . . . , ξ T } given that we are in the i-th state at time step t, (3) smoothed node marginal γ HMM t,i P (zt = i|ξ 1 . . . ξ T , θ): probability of ξ t to be in state i at time step t given the full observation sequence ξ, and (4) smoothed edge HMM P (zt = i, zt+1 = j|ξ 1 . . . ξ T , θ): probability of ξ t to be in state marginal ζt,i,j i at time step t and in state j at time step t + 1 given the full observation sequence K ξ. Parameters Πi , {ai,m }K , μ , Σ are estimated using the EM algorithm for i i m=1 i=1
HMMs, and the duration parameters {μSi , ΣSi }K i=1 are estimated empirically from the data after training using the most likely hidden state sequence z t = {z1 . . . zT }. 2.2 Decoding from HSMM Given the learned model parameters, the probability of the observed sequence {ξ 1 . . . ξ t } to be in a hidden state zt = i at the end of the sequence (also known as filtering problem) is computed with the help of the forward variable as
200
A. K. Tanwani et al. HMM αt,i πi N (ξ t |μi , Σi ) P (zt | ξ 1 , . . . , ξ t ) = hHMM = K . t,i = K HMM k=1 αt,k k=1 πk N (ξ t |μk , Σk )
(1)
Sampling from the model for predicting the sequence of states over the next time horizon P (zt , zt+1 , . . . , zTp | ξ 1 , . . . , ξ t ) can be done in two ways: (1) stochastic sampling: the sequence of states is sampled in a probabilistic manner given the state duration and the state transition probabilities. By stochastic sampling, motions that contain different options and do not evolve only on a single path can also be represented. Starting from the initial state zt = i, the s duration steps are sampled from {μSi , ΣSi }, after which the next transition state is sampled zt+s+1 ∼ π zt+s . The procedure is repeated for the given time horizon in a receding horizon manner; (2) deterministic sampling: the most likely sequence of states is sampled and remains unchanged in successive sampling trials. We use the forward variable of HSMM for deterministic sampling from the HSMM P (zt = i, ξ 1 . . . ξ t |θ) requires marginalizing over model. The forward variable αt,i the duration steps along with all possible state sequences. The probability of a datapoint ξ t to be in state i at time step t given the partial observation sequence {ξ 1 , . . . , ξ t } is now specified as [32] HSMM
αt,i
=
min(smax ,t−1) K s=1
HSMM αt−s,j aj,i N (s|μSi , ΣSi )
t
N ξ c | μi , Σ i ,
(2)
c=t−s+1
j=1
HSMM where the initialization is given by α1,i = Πi N (1|μSi , ΣSi ) N ξ 1 | μi , Σi , and the output in state i is conditionally independent for the s duration steps given t distribution as c=t−s+1 N ξ c | μi , Σi . Note that for t < smax , the sum over duration steps is computed for t − 1 steps, instead of smax . Without the observation sequence for the next time steps, the forward variable simplifies to HSMM
αt,i
=
min(smax ,t−1) K s=1
HSMM αt−s,j aj,i N (s|μSi , ΣSi ).
(3)
j=1
The forward variable is used to plan the movement sequence for the next Tp steps with t = t + 1 . . . Tp . During prediction, we only use the transition matrix and the duration model to plan the future evolution of the initial/current state and omit the influence of the spatial data that we cannot observe, i.e., N (ξ t |μi , Σi ) = 1 for t > 1. This is used ˆ t ) from a given state sequence z t to retrieve a step-wise reference trajectory N (ˆ μt , Σ computed from the forward variable with, HSMM z t = {zt , . . . , zTp } = arg max αt,i ,
i
μ ˆ t = μzt ,
ˆ t = Σzt . Σ
(4)
Figure 2 shows a conceptual representation of the step-wise sequence of states generated by deterministically sampling from HSMM encoding of the Z-shaped data. In the next section, we show how to synthesise robot movement from this step-wise sequence of states in a smooth manner.
Robot Imitation Learning with Hidden Semi-Markov Models
201
Fig. 2. Sampling from HSMM from an unseen initial state ξ 0 over the next time horizon and ˆ t ) with a linear quadratic tracking tracking the step-wise desired sequence of states N (ˆ μt , Σ controller. Note that this converges although ξ 0 was not previously encountered.
2.3 Motion Generation with Linear Quadratic Tracking We formulate the motion generation problem given the step-wise desired sequence of ˆ t )}Tp as sequential optimization of a scalar cost function with a linear states {N (ˆ μt , Σ t=1 quadratic tracker (LQT) [2]. The control policy ut at each time step is obtained by minimizing the cost function over the finite time horizon Tp , ct (ξ t , ut ) =
Tp t=1
s.t.
(ξ t − μ ˆ t ) Qt (ξ t − μ ˆ t ) + ut Rt ut ,
(5)
ξ t+1 = Ad ξ t + B d ut ,
starting from the initial state ξ 1 and following the discrete linear dynamical system specified by Ad and B d . We consider a linear time-invariant double integrator system to describe the system dynamics. Alternatively, a time-varying linearization of the system dynamics along the reference trajectory can also be used to model the system dynamics without loss of generality. Both discrete and continuous time linear quadratic regulator/tracker can be used to follow the desired trajectory. The discrete time formulation, however, gives numerically stable results for a wide range of values of R. The control law u∗t that minimizes the cost function in Eq. (5) under finite horizon subject to the linear dynamics in discrete time is given as, μt − ξ t ) + uFF u∗t = K t (ˆ t ,
(6)
where K t = [K Pt , K Vt ] are the full stiffness and damping matrices for the feedback term, and uFF t is the feedforward term. Figure 2 shows the results of applying discrete
202
A. K. Tanwani et al.
Fig. 3. Task-parameterized formulation of HSMM: four demonstrations on left are observed from two coordinate systems that define the start and end position of the demonstration (starting in purple position and ending in green position for each demonstration). The generative model is learned in the respective coordinate systems. The model parameters in respective coordinate systems are adapted to the new unseen object positions by computing the products of linearly transformed Gaussian mixture components. The resulting HSMM is combined with LQT for smooth retrieval of manipulation tasks.
LQT on the desired step-wise sequence of states sampled from an HSMM encoding of the Z-shaped demonstrations. Note that the gains can be precomputed before simulating the system if the reference trajectory does not change during the reproduction of the ˆt task. The resulting trajectory ξ ∗t smoothly tracks the step-wise reference trajectory μ and the gains K Pt , K Vt locally stabilize ξ t along ξ ∗t in accordance with the precision required during the task.
3 Invariant Task-Parameterized HSMMs Conventional approaches to encode task variations such as change in the pose of the object is to augment the state of the environment with the policy parameters [20]. Such an encoding, however, does not capture the geometric structure of the problem. Our approach exploits the problem structure by introducing the task parameters in the form of coordinate systems that observe the demonstrations from multiple perspectives. Taskparameterization enables the model parameters to adapt in accordance with the external task parameters that describe the environmental situation, instead of hard coding the solution for each new situation or handling it in an ad hoc manner [29]. When a different situation occurs (pose of the object changes), changes in the task parameters/reference frames are used to modulate the model parameters in order to adapt the robot movement to the new situation. 3.1
Model Learning
We represent the task parameters with F coordinate systems, defined by {Aj , bj }F j=1 , where Aj denotes the orientation of the frame as a rotation matrix and bj represents the origin of the frame. We assume that the coordinate frames are specified by the user, based on prior knowledge about the carried out task. Typically, coordinate frames will be attached to objects, tools or locations that could be relevant in the execution of a task. Each datapoint ξ t is observed from the viewpoint of F different experts/frames,
Robot Imitation Learning with Hidden Semi-Markov Models
203
(j)
with ξ t = A−1 j (ξ t − bj ) denoting the datapoint observed with respect to frame j. The parameters of the task-parameterized HSMM are defined by K (j) (j) K S S , {a } , μ , Σ , θ = {μi , Σi }F i,m m=1 j=1 i i i=1
(j)
(j)
where μi and Σi define the mean and the covariance matrix of i-th mixture component in frame j. Parameter updates of the task-parameterized HSMM algorithm remain the same as HSMM, except the computation of the mean and the covariance matrix is repeated for each coordinate system separately. The emission distribution of the ith state is represented by the product of the probabilities of the datapoint to belong to the i-th Gaussian in the corresponding j-th coordinate system. The forward variable of HMM in the task-parameterized formulation is described as TP-HMM = αt,i
K
HMM αt−1,j aj,i
F
j=1
(j) (j) (j) N ξ t | μi , Σ i .
(7)
j=1
TP-HMM TP-HMM Similarly, the backward variable βt,i , the smoothed node marginal γt,i , and the TP-HMM smoothed edge marginal ζt,i,j can be computed by replacing the emission distribution N (ξ t | μi , Σi ) with the product of probabilities of the datapoint in each frame (j) (j) (j) F . The duration model N (s|μSi , ΣSi ) is used as a replacement j=1 N ξ t | μi , Σi of the self-transition probabilities ai,i . The hidden state sequence over all demonstrations is used to define the duration model parameters {μSi , ΣSi } as the mean and the standard deviation of staying s consecutive time steps in the i-th state.
3.2 Model Adaptation in New Situations In order to combine the output from coordinate frames of reference for an unseen sit˜j , ˜ bj }F uation represented by the frames {A j=1 , we linearly transform the Gaussians ˜ ˜ back to the global coordinates with {Aj , bj }F j=1 , and retrieve the new model parame˜ i } for the i-th mixture component by computing the products of the linearly ters {˜ μi , Σ transformed Gaussians (see Fig. 3) ˜ i) ∝ N (˜ μi , Σ
F
˜j . ˜j μ(j) + ˜ ˜j Σ(j) A N A b , A j i i
(8)
j=1
Evaluating the products of Gaussians represents the observation distribution of HSMM, whose output sequence is decoded and combined with LQT for smooth motion generation as shown in the previous section. ⎛
⎞−1 F −1 (j) ˜i = ⎝ ˜j ˜j Σ A ⎠ , Σ A i j=1
˜i μ ˜i = Σ
F −1 ˜ ˜j Σ(j) A ˜j μ (j) + ˜ bj . A A j
j=1
i
i
(9)
204
A. K. Tanwani et al.
Fig. 4. Parameters representation of a diagonal, full and mixture of factor analyzers decomposition of covariance matrix. Filled blocks represent non-zero entries.
4 Latent Space Representations Dimensionality reduction has long been recognized as a fundamental problem in unsupervised learning. Model-based generative models such as HSMMs tend to suffer from the curse of dimensionality when few datapoints are available. We use statistical subspace clustering methods that reduce the number of parameters to be robustly estimated to address this problem. A simple way to reduce the number of parameters would be to constrain the covariance structure to a diagonal or spherical/isotropic matrix, and restrict the number of parameters at the cost of treating each dimension separately. Such decoupling, however, cannot encode the important motor control principles of coordination, synergies and action-perception couplings [30]. Consequently, we seek out a latent feature space in the high-dimensional data to reduce the number of model parameters that can be robustly estimated. We consider three formulations to this end: (1) low-rank decomposition of the covariance matrix using Mixture of Factor Analyzers (MFA) approach [15], (2) partial tying of the covariance matrices of the mixture model with the same set of basis vectors, albeit different scale with semi-tied covariance matrices [8, 25], and (3) Bayesian non-parametric sequence clustering under small variance asymptotics [13, 22, 26]. All the decompositions can readily be combined with invariant task-parameterized HSMM and LQT for encapsulating reactive autonomous behaviour as shown in the previous section. 4.1
Mixture of Factor Analyzers
The basic idea of MFA is to perform subspace clustering by assuming the covariance structure for each component of the form, Σi = Λi Λi + Ψi ,
(10)
where Λi ∈ RD×d is the factor loadings matrix with d < D for parsimonious representation of the data, and Ψi is the diagonal noise matrix (see Fig. 4 for MFA representation in comparison to a diagonal and a full covariance matrix). Note that the mixture of probabilistic principal component analysis (MPPCA) model is a special case of MFA with the distribution of the errors assumed to be isotropic with Ψi = Iσi2 [28]. The MFA model assumes that ξ t is generated using a linear transformation of d-dimensional vector of latent (unobserved) factors f t , ξ t = Λi f t + μi + ,
(11)
Robot Imitation Learning with Hidden Semi-Markov Models
205
where μi ∈ RD is the mean vector of the i-th factor analyzer, f t ∼ N (0, I) is a normally distributed factor, and ∼ N (0, Ψi ) is a zero-mean Gaussian noise with diagonal covariance Ψi . The diagonal assumption implies that the observed variables are independent given the factors. Note that the subspace of each cluster is not spanned by orthogonal vectors, whereas it is a necessary condition in models based on eigendecomposition such as PCA. Each covariance matrix of the mixture component has its own subspace spanned by the basis vectors of Σi . As the number of components increases to encode more complex skills, an increasing large number of potentially redundant parameters are used to fit the data. Consequently, there is a need to share the basis vectors across the mixture components as shown below by semi-tying the covariance matrices of the mixture model. 4.2 Semi-tied Mixture Model When the covariance matrices of the mixture model share the same set of parameters for the latent feature space, we call the model a semi-tied mixture model [25]. The main idea behind semi-tied mixture models is to decompose the covariance matrix Σi into two terms: a common latent feature matrix H ∈ RD×D and a component-specific (diag) ∈ RD×D , i.e., diagonal matrix Σi (diag)
Σi = HΣi
H .
(12)
The latent feature matrix encodes the locally important synergistic directions represented by D non-orthogonal basis vectors that are shared across all the mixture components, while the diagonal matrix selects the appropriate subspace of each mixture component as convex combination of a subset of the basis vectors of H. Note that the (diag) U i contains D basis vectors of Σi in U i . eigen decomposition of Σi = U i Σi In comparison, semi-tied mixture model gives D globally representative basis vectors (diag) are that are shared across all the mixture components. The parameters H and Σi updated in closed form with EM updates of HSMM [8]. The underlying hypothesis in semi-tying the model parameters is that similar coordination patterns occur at different phases in a manipulation task. By exploiting the spatial and temporal correlation in the demonstrations, we reduce the number of parameters to be estimated while locking the most important synergies to cope with perturbations. This allows the reuse of the discovered synergies in different parts of the task having similar coordination patterns. In contrast, the MFA decomposition of each covariance matrix separately cannot exploit the temporal synergies, and has more flexibility in locally encoding the data. 4.3 Bayesian Non-parametrics Under Small Variance Asymptotics Specifying the number of latent states in a mixture model is often difficult. Model selection methods such as cross-validation or Bayesian Information Criterion (BIC) are typically used to determine the number of states. Bayesian non-parametric approaches comprising of Hierarchical Dirichlet Processes (HDPs) provide a principled model selection procedure by Bayesian inference in an HMM with infinite number of states [27].
206
A. K. Tanwani et al.
Fig. 5. Bayesian non-parametric clustering of Z-shaped streaming data under small variance asymptotics with: (left) online DP-GMM, (right) online DP-MPPCA. Note that the number of clusters and the subspace dimension of each cluster is adapted in a non-parametric manner.
These approaches provide flexibility in model selection, however, their widespread use is limited by the computational overhead of existing sampling-based and variational techniques for inference. We take a small variance asymptotics approximation of the Bayesian non-parametric model that collapses the posterior to a simple deterministic model, while retaining the non-parametric characteristics of the algorithm. Small variance asymptotic (SVA) analysis implies that the covariance matrix Σi of all the Gaussians is set to the isotropic noise σ 2 , i.e., Σi ≈ limσ2 →0 σ 2 I in the likelihood function and the prior distribution [3, 13]. The analysis yields simple deterministic models, while retaining the non-parametric nature. For example, SVA analysis of the Bayesian non-parametric GMM leads to the DP-means algorithm [13]. Similarly, SVA analysis of the Bayesian non-parametric HMM under Hierarchical Dirichlet Process (HDP) yields the segmental k-means problem [22]. Restricting the covariance matrix to an isotropic/spherical noise, however, fails to encode the coordination patterns in the demonstrations. Consequently, we model the covariance matrix in its intrinsic affine subspace of dimension di with projection matrix Λdi i ∈ RD×di , such that di < D and Σi = limσ2 →0 Λdi i Λdi i + σ 2 I (akin to DPMPPCA model). Under this assumption, we apply the small variance asymptotic limit on the remaining (D − di ) dimensions to encode the most important coordination patterns while being parsimonious in the number of parameters (see Fig. 5). Performing small-variance asymptotics of the joint likelihood of HDP-HMM yields the maximum aposteriori estimates of the parameters by iteratively minimizing the loss function1 L(z, d, μ, U , a) =
T
dist(ξ t , μzt , U dzti )2 + λ(K − 1)
t=1
+ λ1
K i=1
di − λ 2
T −1 t=1
log(azt ,zt+1 ) + λ3
K
(τi − 1),
i=1
where dist(ξ t , μzt , U dzt )2 represents the distance of the datapoint ξ t to the subspace of cluster zt defined by mean μzt and unit eigenvectors of the covariance matrix U dzt . The algorithm optimizes the number of clusters and the subspace dimension of each 1
Setting di = 0 by choosing λ1 0 gives the loss function formulation with isotropic Gaussian under small variance asymptotics [22].
Robot Imitation Learning with Hidden Semi-Markov Models
207
1
0.5
0 100
200
Fig. 6. (left) Baxter robot picks the glass plate with a suction lever and places it on the cross after avoiding an obstacle of varying height, (centre-left) reproduction for previously unseen object and obstacle position, (cente-right) left-right HSMM encoding of the task with duration model shown next to each state (smax = 100), (right) rescaled forward variable evolution of the forward variable over time.
cluster while minimizing the distance of the datapoints to the respective subspaces of each cluster. The λ2 term favours the transitions to states with higher transition probability (states which have been visited more often before), λ3 penalizes for transition to unvisited states with τi denoting the number of distinct transitions out of state i, while λ and λ1 are the penalty terms for increasing the number of states and the subspace dimension of each output state distribution. The analysis is used here for scalable online sequence clustering that is nonparametric in the number of clusters and the subspace dimension of each cluster. The resulting algorithm groups the data in its low dimensional subspace with non-parametric mixture of probabilistic principal component analyzers based on Dirichlet process, and captures the state transition and state duration information in a HDP-HSMM. The cluster assignment and the parameter updates at each iteration minimize the loss function, thereby, increasing the model fitness while penalizing for new transitions, new dimensions and/or new clusters. An interested reader can find more details of the algorithm in [26].
5 Experiments, Results and Discussion We now show how our proposed work enables a Baxter robot to learn a pick-and-place task from a few human demonstrations. The objective of the task is to place the object in a desired target position by picking it from different initial poses of the object, while adapting the movement to avoid the obstacle. The setup of pick-and-place task with obstacle avoidance is shown in Fig. 6. The Baxter robot is required to grasp the glass plate with a suction lever placed in an initial configuration as marked on the setup. The obstacle can be vertically displaced to one of the 8 target configurations. We describe the task with two frames, one frame for the object initial configuration with {A1 , b1 } and other frame for the obstacle {A2 , b2 } with A2 = I and b2 to specify the centre of the obstacle. We collect 8 kinesthetic demonstrations with different initial configurations of the object and the obstacle successively displaced upwards as marked with the visual tags in the figure. Alternate demonstrations are used for the training set, while the rest are used for the test set. Each observation comprises of the end-effector Cartesian
208
A. K. Tanwani et al.
Fig. 7. Task-Parameterized HSMM performance on pick-and-place with obstacle avoidance task: (top) training set reproductions, (bottom) testing set reproductions.
position, quaternion orientation, gripper status (open/closed), linear velocity, quaternion derivative, and gripper status derivative with D = 16, P = 2, and a total of 200 datapoints per demonstration. We represent the frame {A1 , b1 } as ⎤ ⎡ (n) ⎡ (n) ⎤ R1 0 0 0 0 p1 ⎥ ⎢ ⎢ 0 ⎥ ⎢ 0 E1(n) 0 0 0⎥ ⎢ ⎥ ⎥ (n) ⎢ ⎢ (n) (n) A1 = ⎢ 0 (13) ⎥ , b1 = ⎢ 0 ⎥ 0 R 0 0 ⎥, 1 ⎥ ⎢ ⎣ ⎦ ⎣ 0 0 0 0 E1 (n) 0⎦ 1 0 0 0 0 1 (n)
(n)
(n)
where p1 ∈ R3 , R1 ∈ R3×3 , E1 ∈ R4×4 denote the Cartesian position, the rotation matrix and the quaternion matrix in the n-th demonstration respectively. Note that we do not consider time as an explicit variable as the duration model in HSMM encapsulates the timing information locally. Performance setting in our experiments is as follows: {πi , μi , Σi }K i=1 are initialized using k-means clustering algorithm, R = 9I, where I is the identity matrix, learning converges when the difference of log-likelihood between successive demonstrations is less than 1 × 10−4 . Results of regenerating the movements with 7 mixture components are shown in Fig. 7. For a given initial configuration of the object, the model parameters are adapted by evaluating the product of Gaussians for a new frame configuration. The reference trajectory is then computed from the initial position of the robot arm using the forward variable of HSMM and tracked using LQT. The robot arm moves from its initial configuration to align itself with the first frame {A1 , b1 } to grasp the object, and follows it with the movement to avoid the obstacle and subsequently, align with the second frame {A2 , b2 } before placing the object and returning to a neutral position. The model exploits variability in the observed demonstrations to statistically encode different phases of the task such as reach, grasp, move, place, return. The imposed
Robot Imitation Learning with Hidden Semi-Markov Models
209
Fig. 8. Latent space representations of invariant task-parameterized HSMM for a randomly chosen demonstration from the test set. Black dotted lines show human demonstration, while grey line shows the reproduction from the model. Table 1. Performance analysis of invariant hidden Markov models with training MSE, testing MSE, number of parameters for pick-and-place task. MSE (in meters) is computed between the demonstrated trajectories and the generated trajectories (lower is better). Latent space formulations give comparable task performance with much fewer parameters. Model
Training MSE
Testing MSE
Number of parameters
Pick-and-place via obstacle avoidance (K = 7, F = 2, D = 16) HSMM
0.0026 ± 0.0009 0.014 ± 0.0085
2198
Semi-tied HSMM
0.0033 ± 0.0016
1030
0.0131 ± 0.0077
MFA HSMM (dk = 1) 0.0037 ± 0.0011
0.0109 ± 0.0068
MFA HSMM (dk = 4) 0.0025 ± 0.0007
0.0119 ± 0.0077
1414
MFA HSMM (dk = 7) 0.0023 ± 0.0009
0.0123 ± 0.0084
2086
0.0073 ± 0.0024
0.0149 ± 0.0072
1352
SVA HDP HSMM (K = 8, d¯k = 3.94)
742
structure with task-parameters and HSMM allows us to acquire a new task in a few human demonstrations, and generalize effectively in picking and placing the object. Table 1 evaluates the performance of the invariant task-parameterized HSMM with latent space representations. We observe significant reduction in the model parameters, while achieving better generalization on the unseen situations compared to the taskparameterized HSMM with full covariance matrices (see Fig. 8 for comparison across models). It is seen that the MFA decomposition gives the best performance on test set with much fewer parameters.
6 Conclusions Learning from demonstrations is a promising approach to teach manipulation skills to robots. In contrast to deep learning approaches that require extensive training data, generative mixture models are useful for learning from a few examples that are not explicitly labelled. The formulations are inspired by the need to make generative mixture models easy to use for robot learning in a variety of applications, while requiring considerably less learning time.
210
A. K. Tanwani et al.
We have presented formulations for learning invariant task representations with hidden semi-Markov models for recognition, prediction, and reproduction of manipulation tasks; along with learning in latent space representations for robust parameter estimation of mixture models with high-dimensional data. By sampling the sequence of states from the model and following them with a linear quadratic tracking controller, we are able to autonomously perform manipulation tasks in a smooth manner. This has enabled a Baxter robot to tackle a pick-and-place via obstacle avoidance problem from previously unseen configurations of the environment. A relevant direction of future work is to not rely on specifying the task parameters manually, but to infer generalized task representations from the videos of the demonstrations in learning the invariant segments. Moreover, learning the task model from a small set of labelled demonstrations in a semi-supervised manner is an important aspect in extracting meaningful segments from demonstrations. Acknowledgements. This work was, in large part, carried out at Idiap Research Institute and Ecole Polytechnique Federale de Lausanne (EPFL) Switzerland. This work was in part supported by the DexROV project through the EC Horizon 2020 program (Grant 635491), and the NSF National Robotics Initiative Award 1734633 on Scalable Collaborative Human-Robot Learning (SCHooL). The information, data, comments, and views detailed herein may not necessarily reflect the endorsements of the sponsors.
References 1. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009) 2. Borrelli, F., Bemporad, A., Morari, M.: Predictive Control for Linear and Hybrid Systems. Cambridge University Press, Cambridge (2011) 3. Broderick, T., Kulis, B., Jordan, M.I.: MAD-Bayes: map-based asymptotic derivations from Bayes. In: International Conference on Machine Learning, ICML, pp. 226–234 (2013) 4. Calinon, S.: A tutorial on task-parameterized movement learning and retrieval. Intell. Serv. Robot. 9(1), 1–29 (2016) 5. Duan, Y., Andrychowicz, M., Stadie, B.C., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., Zaremba, W.: One-shot imitation learning. CoRR, abs/1703.07326 (2017) 6. Figueroa, N., Billard, A.: Transform-invariant non-parametric clustering of covariance matrices and its application to unsupervised joint segmentation and action discovery. CoRR, abs/1710.10060 (2017) 7. Fox, R., Shin, R., Krishnan, S., Goldberg, K., Song, D., Stoica, I.: Parametrized hierarchical procedures for neural programming. In: The International Conference on Learning Representations, ICLR 2018 (2018) 8. Gales, M.J.F.: Semi-tied covariance matrices for hidden Markov models. IEEE Trans. Speech Audio Process. 7(3), 272–281 (1999) 9. Ho, J., Ermon, S.: Generative adversarial imitation learning. CoRR, abs/1606.03476 (2016) 10. Ijspeert, A., Nakanishi, J., Pastor, P., Hoffmann, H., Schaal, S.: Dynamical movement primitives: learning attractor models for motor behaviors. Neural Comput. 25, 328–373 (2013) 11. Krishnan, S., Fox, R., Stoica, I., Goldberg, K.: DDCO: discovery of deep continuous options for robot learning from demonstrations. CoRR (2017) 12. Kulic, D., Takano, W., Nakamura, Y.: Incremental learning, clustering and hierarchy formation of whole body motion patterns using adaptive hidden Markov chains. Int. J. Robot. Res. 27(7), 761–784 (2008)
Robot Imitation Learning with Hidden Semi-Markov Models
211
13. Kulis, B., Jordan, M.I.: Revisiting k-means: new algorithms via Bayesian nonparametrics. In: International Conference on Machine Learning ICML, pp. 513–520 (2012) 14. Lee, D., Ott, C.: Incremental motion primitive learning by physical coaching using impedance control. In: Proceedings of the IEEE/RSJ Intl Conference on Intelligent Robots and Systems (IROS), Taipei, Taiwan, pp. 4133–4140, October 2010 15. McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41(3–4), 379–388 (2003) 16. Jose Medina, R., Billard, A.: Learning stable task sequences from demonstration with linear parameter varying systems and hidden Markov models. In: Conference on Robot Learning (CoRL) (2017) 17. Nehaniv, C.L., Dautenhahn, K. (eds.): Imitation and Social Learning in Robots, Humans, and Animals: Behavioural, Social and Communicative Dimensions. Cambridge University Press, Cambridge (2004) 18. Niekum, S., Osentoski, S., Konidaris, G., Barto, A.G.: Learning and generalization of complex tasks from unstructured demonstrations. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5239–5246 (2012) 19. Osa, T., Pajarinen, J., Neumann, G., Bagnell, A., Abbeel, P., Peters, J.: An Algorithmic Perspective on Imitation Learning. Now Publishers Inc. (2018) 20. Paraschos, A., Daniel, C., Peters, J.R., Neumann, G.: Probabilistic movement primitives. In: Advances in Neural Information Processing Systems 26, pp. 2616–2624 (2013) 21. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–285 (1989) 22. Roychowdhury, A., Jiang, K., Kulis, B.: Small-variance asymptotics for hidden Markov models. In: Advances in Neural Information Processing Systems 26, pp. 2103–2111. Curran Associates, Inc. (2013) 23. Shiarlis, K., Wulfmeier, M., Salter, S., Whiteson, S., Posner, I.: Taco: learning task decomposition via temporal alignment for control. In: International Conference on Machine Learning, ICML 2018 (2018) 24. Tanwani, A.K.: Generative models for learning robot manipulation skills from humans. Ph.D. thesis, Ecole Polytechnique Federale de Lausanne, Switzerland (2018) 25. Tanwani, A.K., Calinon, S.: Learning robot manipulation tasks with task-parameterized semitied hidden semi-Markov model. IEEE Robot. Autom. Lett. 1(1), 235–242 (2016) 26. Tanwani, A.K., Calinon, S.: Small-variance asymptotics for non-parametric online robot learning. Int. J. Robot. Res. 38(1), 3–22 (2019) 27. Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006) 28. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural Comput. 11(2), 443–482 (1999) 29. Wilson, A.D., Bobick, A.F.: Parametric hidden Markov models for gesture recognition. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 884–900 (1999) 30. Wolpert, D.M., Diedrichsen, J., Flanagan, J.R.: Principles of sensorimotor learning. Nat. Rev. 12, 739–751 (2011) 31. Xu, D., Nair, S., Zhu, Y., Gao, J., Garg, A., Fei-Fei, L., Savarese, S.: Neural task programming: learning to generalize across hierarchical tasks. CoRR, abs/1710.01813 (2017) 32. Yu, S.-Z.: Hidden semi-Markov models. Artif. Intell. 174, 215–243 (2010)
A Dynamic Regret Analysis and Adaptive Regularization Algorithm for On-Policy Robot Imitation Learning Jonathan N. Lee(B) , Michael Laskey, Ajay Kumar Tanwani, Anil Aswani, and Ken Goldberg University of California, Berkeley, USA {jonathan lee,laskeymd,ajay.tanwani,aaswani,goldberg}@berkeley.edu
Abstract. On-policy imitation learning algorithms such as Dagger evolve a robot control policy by executing it, measuring performance (loss), obtaining corrective feedback from a supervisor, and generating the next policy. As the loss between iterations can vary unpredictably, a fundamental question is under what conditions this process will eventually achieve a converged policy. If one assumes the underlying trajectory distribution is static (stationary), it is possible to prove convergence for Dagger. Cheng and Boots (2018) consider the more realistic model for robotics where the underlying trajectory distribution, which is a function of the policy, is dynamic and show that it is possible to prove convergence when a condition on the rate of change of the trajectory distributions is satisfied. In this paper, we reframe that result using dynamic regret theory from the field of Online Optimization to prove convergence to locally optimal policies for Dagger, Imitation Gradient, and Multiple Imitation Gradient. These results inspire a new algorithm, Adaptive On-Policy Regularization (AOR), that ensures the conditions for convergence. We present simulation results with cart-pole balancing and walker locomotion benchmarks that suggest AOR can significantly decrease dynamic regret and chattering. To our knowledge, this the first application of dynamic regret theory to imitation learning.
Keywords: Imitation learning
1
· Machine learning · Optimization
Introduction
In imitation learning, the robot observes states and control labels from a supervisor and estimates a mapping from states to controls. A fundamental problem in imitation learning is covariate shift [1], where the distribution of trajectories experienced by the robot at run time differs from the distributions experienced during training time. For example, consider an autonomous vehicle trained to drive on a road from demonstrations of humans driving safely on the center of the same road. If the vehicle makes slight errors when it is deployed, it may drift c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 212–227, 2020. https://doi.org/10.1007/978-3-030-44051-0_13
Dynamic Regret
213
towards the sides of the road where it had not previously experienced data from human supervisors, resulting in situations from which it cannot recover. On-policy Imitation learning algorithms such as Dagger [12], Aggrevated [13], and Loki [4] have been proposed to mitigate this issue. As opposed to learning only from supervisor demonstrations, these algorithms roll out the robot’s current policy at each iteration, allowing it to make errors and observe new states. The supervisor then provides corrective control labels for these new states retroactively. For this reason, these algorithms are often referred to as on-policy imitation learning algorithms because the robot iteratively learns from its current policy [10]. This is in contrast to off-policy algorithms which learn from the supervisor’s demonstrations. Recently, there has been interest in determining when on-policy algorithms are guaranteed to converge to good policies because practitioners often settle on the policy from the final iteration of the algorithm. Cheng and Boots [2] proved that Dagger converges under the condition that a sensitivity parameter related to the rate of change of the trajectory distributions is small. On-policy algorithms can be viewed as variants of algorithms from the field of Online Optimization [6]. Online Optimization is often used for problems such as portfolio management and network routing analysis where environments change over time [7]. At iteration n an agent plays some parameter θn and then some loss function fn is presented. The agent then incurs loss fn (θn ). In on-policy imitation learning, θn would correspond to policy parameters. fn would be a supervised learning loss function obtained from rolling out θn and observing corrective labels from the supervisor. At each iteration the supervised learning loss function is different because the distribution of trajectories induced by the robot changes as the policy is updated. The goal is to continually try to find the optimal θn at each iteration based on the sequence of past loss functions. A common choice in Online Optimization to measure the performance of an algorithm is static regret over N iterations, defined as RS (θ1 , . . . , θN ) :=
N
fn (θn ) − min θ
n=1
N
fn (θ).
(1)
n=1
In imitation learning, this would mean the robot is compared against the best it could have done on the average of its trajectory distributions. In this paper, we focus on determining when on-policy algorithms can and cannot converge and specifically when the policy is performing optimally on its own distribution. We draw a connection between this very natural objective and an alternative metric in Online Optimization known as dynamic regret [5,9,14]. As opposed to the well known static regret in (1), dynamic regret measures performance of a policy at each instantaneous iteration: RD (θ1 , . . . , θN ) :=
N n=1
fn (θn ) −
N n=1
min fn (θ). θ
(2)
The difference between static and dynamic regret is that, for dynamic regret, the minimum goes inside the summation, meaning that the regret is an
214
J. N. Lee et al.
instantaneous difference at each iteration. Proving that static regret is low implies that the policy on average is at least as good as a single fixed policy that does well on the average of the distributions seen during training. Proving that dynamic regret is low implies that the average policy is optimal on its own distribution. Achieving low dynamic regret is inherently harder than achieving low static regret because RS ≤ RD , but dynamic regret is more relevant as a theoretical metric in robotics where the trajectory distributions are changing. In Online Optimization, it is well known that it is not possible to achieve low dynamic regret in general due to the possibility of adversarial loss functions [9]. However, in imitation learning, the key insight is that the loss functions at each iteration represent the trajectory distributions as a function of the policy parameters [2]. Therefore, we can leverage information known about how the trajectory distribution changes in response to changing policy parameters to obtain specific dynamic regret rates. This paper makes four contributions: 1. Introduces a dynamic regret analysis to evaluate the convergence of on-policy imitation learning algorithms. 2. Presents average dynamic regret rates and conditions for convergence for Dagger, Imitation Gradient, and Multiple Imitation Gradient. 3. Introduces Adaptive On-Policy Regularization (AOR), a novel algorithm that adaptively regularizes on-policy algorithms to improve dynamic regret and cause convergence. 4. Presents empirical evidence of non-convergent on-policy algorithms and shows that AOR can ensure convergence in a cart-pole balancing task and a walker locomotion task.
2
Preliminaries
In this section we introduce the notation, problem statement, and assumptions for imitation learning by supervised learning. Let st ∈ S and ut ∈ U be the state and control in a Markov decision process. The probability of a trajectory τ of length T under policy π ∈ Π : S → U is given by p(τ ; π) = p(s0 )
T −1
pπ (ut |st )p(st+1 |st , ut ).
t=1
In this paper, we consider parametric policies, i.e., there is a convex, compact subset of parameters Θ ⊂ Rd with diameter D := maxθ1 ,θ2 ∈Θ θ1 − θ2 and our goal is to find a parameterized policy πθ that minimizes some loss with respect to the supervisor policy π ∗ , which may not necessarily be attainable by Θ. The loss of a policy π along a trajectory τ is a non-negative function J such that J(τ, π) =
T −1
(π(st ), π ∗ (st )),
t=1
where : U × U → R≥0 is a per-time step loss.
Dynamic Regret
215
The general optimization problem of imitation learning can be written as min Ep(τ ;πθ ) J(τ, πθ ). θ∈Θ
The expectation is taken over the distribution of trajectories that πθ induces and then πθ is evaluated on the trajectories sampled from that distribution. This problem reflects the goal of having the policy do well on its own induced distribution. However, this is challenging and cannot be solved with regular optimization methods since the distribution of trajectories is unknown. It also cannot be solved with regular supervised learning by sampling supervisor trajectories because the sampling distribution is a function of the policy [1]. Existing algorithms relax the problem by fixing the trajectory distribution and then optimizing over the evaluation parameter. This decouples the sampling from the supervised learning problem. For example, in behavior cloning, one sets the trajectory distribution to the supervisor’s trajectory distribution and finds the policy that minimizes loss on that distribution. Formally, we consider the average loss of a parameter θ ∈ Θ over the distribution of trajectories generated by a possibly different policy parameter θ ∈ Θ: fθ (θ) := Ep(τ ;πθ ) J(τ, πθ ).
(3)
Here, θ controls the trajectory distribution and θ controls the predictions used to the compute the loss on that distribution. For this reason, we refer to the θ as the distribution-generating parameter and θ as the evaluation parameter. To better convey this notion of the policy parameters being decoupled, we use this concise f notation as in [2]. Optimization problems in this paper will be of the form minθ∈Θ fθ (θ) for some fixed and known θ ∈ Θ. Because the trajectory distribution no longer depends on the variable θ, the supervised learning problem can now be feasibly solved by sampling from p(τ ; πθ ) which corresponds to rolling out trajectories under the fixed policy πθ . Specifically in this paper, we will consider iterative on-policy algorithms over N ∈ N iterations. At any iteration n for 1 ≤ n ≤ N , the policy parameter θn is rolled out as the distribution-generating parameter and the loss fθn (θ) = Ep(τ ;πθn ) J(τ, πθ ) is observed, where θ is the free variable and θn is the fixed per-iteration distribution-generating parameter. As in prior work [2,12], for convenience, we write fn (θ) := fθn (θ). These loss functions form the sequence of losses used in the regret metrics (1) and (2). Next, we briefly describe the main assumptions of this paper. The assumptions are stated formally in Sect. 6. As in prior work in both imitation learning and Online Optimization, we assume strong convexity and smoothness of the loss function in the evaluation parameter. Strong convexity ensures the loss is curved at least quadratically while smoothness guarantees it is not too curved. As in [2], we also assume a regularity assumption on the fn sequence, which bounds the sensitivity of the trajectory distribution in response to changes in the distribution-generating parameter.
216
3
J. N. Lee et al.
Related Work
The challenge of covariate shift in imitation learning by supervised learning is the subject of significant research robotics. It is especially prevalent when the robot’s policy cannot fully represent the supervisor [8]. In robotics, many algorithms have been proposed to mitigate covariate shift for imitation learning. Ross et al. [12] introduced Dagger, an on-policy algorithm that allows the robot to make mistakes and then observe corrective labels from the supervisor in new states that might not be seen from ideal supervisor demonstrations alone. Gradient-based on-policy methods for imitation learning have gained interest due to their similarity to policy gradient algorithms and their computational efficiency. These are also known as Imitation Gradient methods. Aggrevated [13] was proposed for fast policy updates designed for deep neural network policies. Loki [4] uses a mirror descent algorithm on an imitation learning loss to bootstrap reinforcement learning. Ross et al. [12] first introduced a static regret-style analysis for Dagger, showing that with strongly convex losses, running Dagger results in low static regret in all cases. This means that a Dagger policy is on average at least as good as one policy that does well on the average of trajectory distributions seen during training. However, the performance on the average of trajectory distributions is not always informative, as shown by Laskey et al. [8], because the average may contain irrelevant distributions as a result of rolling out suboptimal policies. Cheng and Boots [2] recently expanded the Dagger analysis showing that despite guaranteed convergence in static regret, the algorithm may not always converge to a low loss policy. Furthermore, they identified conditions sufficient for convergence to local optima. This work extends the results of Cheng and Boots [2] by drawing a connection with dynamic regret theory to identify conditions for convergence for Dagger and other on-policy algorithms and as a basis for a new algorithm. To the best of our knowledge, this is the first application of dynamic regret theory to imitation learning.
4
On-Policy Algorithms
We now review three on-policy algorithms that will be the focus of the main theoretical and empirical results of the paper. 4.1
DAGGER
Dagger is a variant of the follow-the-leader algorithm from Online Optimization. For detailed discussion of implementation, we refer the reader to [6,12]. Dagger proceeds by rolling out the current policy and observing a loss based on the induced trajectory distribution. The next policy parameter is computed by aggregating all observed losses and minimizing over them. An example of this for the l2 -regularized linear regression problem would be n n α0 min θ2 , fm (θ) = min ESm θ − Um 2 + (4) θ θ 2 m=1 m=1
Dynamic Regret
217
where Sm is the matrix of state vectors observed from rolling out πm and Um is the matrix of control labels from the supervisor at the mth iteration. Algorithm 1: Dagger [12] Input: Initial policy parameter θ1 , Max iterations N . for n = 1 to N − 1 do Roll out θn and collect τn . Form loss fn (θ) = fθn (θ) from supervisor feedback on τn . θn+1 ← arg minθ∈Θ n m=1 fm (θ). end for
Algorithm 2: (Multiple) Imitation Gradient [4,13,15]
Input: Initial policy parameter θ1 , Max iterations N , Updates per iteration K, Stepsize η. for n = 1 to N − 1 do Roll out θn and collect τn . Form loss fn (θ) = fθn (θ) from τn . θn1 ← θn for k = 1 to K do θnk+1 ← PΘ θnk − η∇fn (θnk ) . end for θn+1 ← θnK+1 . end for
Left: Dagger minimizes over all observed loss functions which are represented by a supervised learning loss over all observed data. Right: Multiple Imitation Gradient computes a gradient on only the most recent data collected and applies K gradient steps. Imitation Gradient is a special case where K = 1. PΘ is a projection operation.
4.2
Imitation Gradient and Multiple Imitation Gradient
Recently there has be interest in “Imitation Gradient” algorithms. Algorithms such as Aggrevated and Loki fall in this family. The online gradient descent algorithm from Online Optimization underlies such algorithms and their variants. While the aforementioned methods have explored more complicated gradient-based algorithms such as those inspired by the natural gradient and mirror descent, the analysis in this paper will focus on the basic online gradient descent algorithm. Online gradient descent proceeds by observing fn at each iteration and taking a weighted gradient step: θn − η∇fn (θn ). In the event that the resulting parameter lies outside of Θ, it is projected back on the space with the projection PΘ (θ) = arg minθ ∈Θ θ − θ. We also consider a related algorithm termed Multiple Imitation Gradient, based on the online multiple gradient descent algorithm introduced by Zhang et al. [15]. This algorithm is identical to the Imitation Gradient, but at each iteration it updates the policy parameters K times, recomputing the gradient each time. Imitation Gradient is a special case of Multiple Imitation Gradient where K = 1. The algorithms are shown together in Algorithm 2.
218
5
J. N. Lee et al.
Dynamic Regret
We are interested in showing that the policies generated by these algorithms perform well on the loss on their own induced trajectory distributions and, furthermore, that they converge. This means that we would like the difference fn (θn ) − minθ∈Θ fn (θ) to be as small as possible for every iteration n. This difference represents the instantaneous regret the algorithm has for playing θn instead of θn∗ at iteration n where θn∗ := arg minθ∈Θ fn (θ). Summing over n from 1 to N , we have the definition of dynamic regret given in Eq. (2). The more well known static regret, shown in Eq. (1), compares an algorithm’s sequence of policies to the minimizer over the average of losses on all trajectory distributions seen during training. Static regret has been shown to be low for strongly convex losses regardless of any external conditions such as the robot or system dynamics or distributions induced [12]. However, proving that a policy has low regret compared to the average of the trajectory distributions gives little indication of whether the policy actually performs well on its own induced distribution. The reason is that some distributions observed during training can be irrelevant to the task due to bad initialization or extremely sensitive dynamics, but they are still included in the average. As a result, static regret can be low regardless of actual policy performance or whether the algorithm leads to convergent policies. Prior work has shown in experiments and theoretical examples that on-policy algorithms do indeed fail on “hard” problems [2,8]. In order to characterize this notion of problem difficulty, we turn to dynamic regret. In Online Optimization literature, dynamic regret has become increasingly popular to analyze online learning problems where the objective is constantly shifting [5,7,9,14–16]. For example in portfolio management, θn represents an investment strategy while fn represents some negative return from the market. If the market is shifting over time, i.e. prices are changing, we are interested in playing the best strategy for the given state of the market at each time step. This can be represented as a dynamic regret problem. Dynamic regret compares the nth policy to the instantaneous best policy on the nth distribution, which means it is a stricter metric than static regret. The advantage of the dynamic regret metric that a policy’s performance at any iteration is always evaluated with respect to the most relevant trajectory distribution: the current one. Proving dynamic regret is low implies that each policy on average is as good as the instantaneous best on its own distribution. Furthermore, we can examine convergence properties of an algorithm by proving that dynamic regret is low. The dynamic regret of an algorithm is fundamentally dependent on the change in the loss functions over iterations, often expressed in terms of quantities called variations. If the loss functions change in an unpredictable manner, we can expect large variation terms leading to large regret and suboptimal policies. This is also the reason that sublinear dynamic regret bounds cannot be obtained in general using only the assumptions commonly used for static regret [14]. In this paper we consider the commonly used path variation and a squared variant of it introduced by [15].
Dynamic Regret
219
Definition 1 (Path Variation and Squared Path Variation). For a ∗ := (θi∗ )m≤i≤n , the sequence of optimal parameters from m to n given by θm:n path variation and the squared path variation are defined respectively as: ∗ ) := V (θm:n
n−1
∗ θi∗ − θi+1
and
i=m
∗ S(θm:n ) :=
n−1
∗ θi∗ − θi+1 2 .
i=m
Many algorithms have been proposed and analyzed in this new regret framework in terms of variation measures of the For example, Zinkevich [16] √ loss functions. ∗ ))) for online gradient descent proved a dynamic regret rate of O( N (1 + V (θ1:N with convex losses. Here, the regret rate is dependent on the rate of the path variation, which might also be a function of N . Therein lies the difficulty of dynamic regret problems: no matter the algorithm, rates depend on the variation, which can be large for arbitrary sequences of loss functions. Dynamic regret analyses offer a promising framework for theoretical analysis of on-policy imitation learning algorithms but only as long as the variation can be characterized. In imitation learning, the variation of the loss functions is related to the amount of change in the trajectory distributions induced by the sequence of policies. Ultimately these changes in trajectory distributions are dependent on the dynamics of the system. We will show in the next section that a single assumption on the dynamics, introduced by Cheng and Boots [2], can aid in characterizing the variation.
6
Main Results
In this section, we present the primary theoretical results of the paper, showing that convergence in average dynamic regret can be guaranteed for all three algorithms under certain conditions on the sensitivity of the trajectory distribution. We show that it is possible for all algorithms to achieve O(N −1 ) average dynamic regret when these conditions are met. Since dynamic regret upper bounds static regret, these results suggest that we can also improve static regret rates of [12] by a logarithmic factor. Proofs and details can be found in supplementary material available at https://arxiv.org/pdf/1811.02184.pdf. We first formally state the assumptions introduced in Sect. 2. We begin with common convex optimization assumptions on the loss in the evaluation parameter. Note that all gradients of f in this section are taken with respect to the evaluation parameter, i.e. ∇fθ (θ) is the gradient with respect to θ, not θ . Assumption 1 (Strong Convexity). For all θ1 , θ2 , θ ∈ Θ, ∃α > 0 such that α fθ (θ2 ) ≥ fθ (θ1 ) + ∇fθ (θ1 ), θ2 − θ1 + θ1 − θ2 2 . 2 Assumption 2 (Smoothness and Bounded Gradient). For all θ1 , θ2 , θ ∈ Θ, ∃γ > 0 such that ∇fθ (θ1 ) − ∇fθ (θ2 ) ≤ γθ1 − θ2 and ∃G > 0 such that ∇fθ (θ1 ) ≤ G.
220
J. N. Lee et al.
Assumption 3 (Stationary Optimum). For all θ ∈ Θ, θ∗ is in the relative interior of Θ where θ∗ = arg minθ∈Θ fθ (θ). That is, ∇fθ (θ∗ ) = 0. In practice the above conditions are not difficult to satisfy. For example, running Dagger with l2 -regularized linear regression, i.e. ridge regression, would simultaneously satisfy all three. Finally, we state the regularity constraint on the loss as a function of the distribution-generating parameter. Assumption 4 (Regularity). For all θ1 , θ2 , θ ∈ Θ, ∃β > 0 such that ∇fθ1 (θ) − ∇fθ2 (θ) ≤ βθ1 − θ2 . This assumption is introduced as a form a prior knowledge of the dynamics. It is essentially a sensitivity constraint that implies we are assuming that small changes in the policy parameters guarantee small changes in the induced trajectory distributions. To prove low dynamic regret it is necessary that some prior knowledge be incorporated in order to simplify or regulate the variation measures. In this analysis, we use Assumption 4 because of its precedent in prior work [2]. We now present the main novel results of this paper for the infinite sample or deterministic case as in [2,12]. Let θn∗ = arg minθ∈Θ fn (θ) be the optimal parameter at iteration n. We begin with a result concerning a stability constant β [2]. λ represents the ratio of the sensitivity and the strong convexity. λ := α Proposition 1. Given the assumptions, the following inequality holds on the difference between optimal parameters and corresponding policy parameters for any n, m ∈ N: ∗ θm − θn∗ ≤ λθm − θn .
Consider the case where m = n + 1. This proposition suggests that when ∗ λ < 1, we know with certainty that θn+1 − θn∗ < θn+1 − θn . In other words, the optimal parameters cannot run away faster than the algorithm’s parameters. This intuition is also consistent with the findings of prior work [2], which shows that convergence of the N th policy can be guaranteed when λ < 1 for Dagger. 6.1
DAGGER
We now introduce a dynamic regret corollary to Theorem 2 of [2]. Corollary 1. For Dagger under the assumptions, if λ < 1, then the average dynamic regret N1 RD tends towards zero in N with rate O(max(N −1 , N 2λ−2 )). Proof. The proof is immediate from the result of Theorem 2 of [2]. We have 2 N (λe1−λ G) fn (θn ) − fn (θn∗ ) ≤ 2αn2(1−λ) . Summing from 1 to N , we get n=1 fn (θn ) − N N (λe1−λ G)2 ∗ 2λ−1 = O(max(1, N )). Then the average n=1 fn (θn ) ≤ n=1 2αn2(1−λ) 1 −1 2λ−2 )), which goes to zero. dynamic regret is N RD = O(max(N , N
Dynamic Regret
221
The corollary reveals that the convergence result for Dagger proved by Cheng and Boots can be reframed as a dynamic regret analysis. The rate is dependent on the stability constant λ < 1. The dynamic regret shows that policies generated from Dagger on average achieve local optimality and that for a sufficiently small λ, the regret grows no more than a finite amount. 6.2
Imitation Gradient
For the analysis of dynamic regret bounds for the Imitation Gradient algorithm, we require a stronger condition that α2 > 2γβ. Written another way, the condition is 2λ < ψ where λ is the stability constant and ψ := αγ is the condition number of fn . So we require that the problem is both stable and well-conditioned. In this proof, we will make use of the path variation. Theorem 1. For the Imitation Gradient algorithm under the assumptions, if α(α2 −2γβ) 1 λ < 1, 2λ < ψ and η = 2γ 2 (α2 −β 2 ) , then the average dynamic regret N RD tends −1 ∗ towards zero in N with rate O(N ). Furthermore, θn − θn converges to zero and the sequence of policies (θn )∞ n=1 is convergent. Intuitively, the theorem states that if the conditions are met, the dynamic regret grows no more than a constant value. If we again interpret the variation as describing the difficulty of a problem, this theorem suggests that under the appropriate conditions, solving an imitation learning problem with Imitation Gradient algorithms is as easy as solving a general dynamic regret problem with ∗ ) = O(1). In other words, the equivalent dynamic regret path variation V (θ1:N problem is stationary in the limit: the optimal parameters cumulatively move no more than a finite distance. The reason is that the change in the loss functions is so closely tied to the policy parameters in imitation learning. 6.3
Multiple Imitation Gradient
We now present a similar dynamic regret rate for the Multiple Imitation Gradient algorithm. This theorem will make use of the squared path variation as opposed to the path variation. The squared path variation is especially amenable for a conversion from the standard dynamic regret bound given in [15] to an imitation learning analysis, as we will see in the proof of the theorem. A straightforward conversion to characterize the measure of variation can be established by simply bounding from above the squared path variation by a quantity proportional to the dynamic regret using Assumption 4. We refer to this technique as establishing the reverse relationship between the measure of variation and the dynamic regret. To illustrate this conversion in detail, we present the proof here, which leverages the general dynamic regret rate for online multiple gradient descent first given by Zhang et al. [15].
222
J. N. Lee et al.
Lemma 1 (Zhang et al. [15], Theorem 3). If Assumptions 1–3 hold and η < 1/γ and K = 1/η+α 2α log 4, then the following is true for online multiple gradient descent: ∗ ) + γθ1 − θ1∗ 2 . RD (θ1 , . . . θN ) ≤ 2γS(θ1:N
From this lemma, a specific result for imitation learning can obtained in a straightforward manner by incorporating the regularity and the strong convexity of the loss functions. Theorem 2. For the Multiple Imitation Gradient algorithm under the assump 5/2 −γ 3/2 β log 4 , and K = tions, if λ < 1, λ log 4 < ψ 3/2 , η < min 1/γ, α 2γ 3/2 αβ log 4 1 1/η+α 2α log 4 then the average dynamic regret N RD tends towards zero in N −1 ∗ with rate O(N ). Furthermore, θn − θn converges to zero and the sequence of policies (θn )∞ n=1 is convergent.
Proof. In the interest of brevity, we skip some of the initial steps. We begin by establishing the reverse relationship; that is, we bound from above the squared path variation by a quantity proportional to the dynamic regret using Proposition 1 and Assumption 4: ∗ S(θ1:N )≤
N 2β 2 η 2 γ 2 K 2 fn (θn ) − fn (θn∗ ). α3 n=1
(5)
Then by substituting into Lemma 1, we have RD (θ1 , . . . θN ) ≤ γθ1 − θ1∗ 2 + ≤
γθ1 − θ1∗ 2 1−
4β 2 η 2 γ 3 K 2 α3
N 4β 2 η 2 γ 3 K 2 fn (θn ) − fn (θn∗ ) α3 n=1
≤
γD2 1−
4β 2 η 2 γ 3 K 2 α3
.
5/2 −γ 3/2 β log 4 It can be verified that for η < min 1/γ, α 2γ 3/2 and K = 1/η+α 2α log 4, αβ log 4 the denominator is positive. The above theorem demonstrates that again a constant upper bound on the dynamic regret can be obtained with this algorithm. Interestingly, the conditions sufficient to guarantee convergence are different. Instead of requiring 2λ < ψ, this theorem requires λ log 4 < ψ 3/2 . While log 4 < 2, we also have that ψ 3/2 ≤ ψ because the condition number, which is the ratio of α and γ, is at most 1. The implication of this observation is that there is a trade-off. For the Multiple Imitation Gradient algorithm, we can guarantee O(N −1 ) regret for higher values of λ, i.e. more sensitive systems, but we must then require that our loss functions are better conditioned. Conversely, the Imitation Gradient algorithm achieves a similar guarantee using losses with lower condition numbers, but λ must be smaller.
Dynamic Regret
7
223
Adaptive On-Policy Regularization
We now apply these theoretical results to motivate an adaptive regularization algorithm to help ensure convergence. As noted by Cheng et al. [2], regularization can lead to convergent policies for Dagger.
Algorithm 3: Adaptive On-Policy Regularization (AOR) Input: Initial parameters θ1 , Max iterations N , Initial regularization α ˆ1. for n = 1 to N − 1 do Roll out θn and collect trajectory τn . Observe loss fn (θ) = fθn (θ) from supervisor feedback on τn . ˆ n ). θn+1 ← Update(θn , α ˆ n ← Estimate(f1 , . . . , fn ). λ ˆ n ). ˆn, λ α ˆ n+1 ← Tune(α end for Adaptive On-Policy Regularization adaptively increases α, the regularization parameter for strong convexity, to stabilize any regularized on-policy algorithm.
In Dagger, Imitation Gradient, and Multiple Imitation Gradient, a key sufficient condition is that λ < 1, meaning that the strong convexity constant α must be greater than the regularity constant β. While β is a fixed property of the dynamics, α is controllable by the user and robot. α can be increased by increasing the regularization of the supervised learning loss. By Proposition 1, a lower bound on λ may be estimated by finding the ratio of the distance between ∗ ∗ ˆ = θn+1 −θn . optimal parameters and the distance between policy parameters: λ θn+1 −θn Therefore, we can propose an adaptive algorithm to compute a new regularization term at each iteration n. One caveat of adaptively updating α is that we do not want it to be too large. While the regret will converge, the policy performance can suffer as the regularization term will dominate the loss function and thus the convergent solution will simply be the solution that minimizes the regularizer. This subtlety and the theoretical motivation in the previous subsections are the basis for Algorithm 3, which we call Adaptive On-Policy Regularization, an algorithm for making conservative updates to α that can be applied to any regularized on-policy algorithm. At each iteration, the policy is updated according to a given on-policy algorithm such as Dagger using subˆn ˆ n ) which depends on the current regularization. Then λ routine Update(θn , α is computed with subroutine Estimate. We use a mean of observed λ values ˆ n . We use a linear weighting over iterations. Finally, the α ˆ n+1 is tuned based on λ ˆnα ˆnα ˆ n +(1−t)ˆ αn for t ∈ (0, 1), where λ ˆ n is a conservative update rule: α ˆ n+1 = tλ estimate of β.
224
8
J. N. Lee et al.
Empirical Evaluation
We study the empirical properties of Dagger, Imitation Gradient (IG) and Multiple Imitation Gradient (MIG), showing that even in low dimensional and convex settings, the implications of the convergence of policies become apparent. We intentionally sought out cases and chose parameters such that these on-policy algorithms do not achieve convergence in order to better understand their properties. We consider deterministic domains for the sake of accurately measuring the true dynamic regret. In this evaluation, we attempt to address the following questions: (1) How are policy performance and dynamic regret affected by changing system parameters? (2) Can Adaptive On-Policy Regularization improve convergence of the average dynamic regret and policy performance? 8.1
Cart-Pole Balancing
First we consider a task where the robot learns to push a cart left or right with a fixed force magnitude in order to balance a pole upright over 100 iterations. The control space is discrete {left, right} and the state space consists of cart location and velocity and pole angle and angular velocity. We measure the absolute performance of a policy as the angular deviation from the upright position. We obtained a nonlinear algorithmic supervisor via reinforcement learning. The robot’s policy was learned using l2 -regularized least squares as in (4). In this setting, we vary the difficulty of the problem, i.e. controlling β, by setting the force magnitude to either low or high values corresponding to easy and hard settings, respectively. For all algorithms, the regularization was initially set to α ˆ 1 = 0.1. Stepsizes η for IG and MIG were set to 0.0001 and 0.01, respectively. The instantaneous regrets for all three imitation learning algorithms, measured as fn (θn ) − minθ fn (θ), are shown in the left column of the top and middle rows of Fig. 1 for both the easy and hard settings, respectively. In the right column are the actual angular deviations of the pole. In the easy setting, the algorithms converge to costs similar to the supervisor. In the hard setting, there is chattering during the learning process. We observe that regret does not always converge indicating a discrepancy between the supervisor and the learner, which is consistent with the theoretical results that suggest difficulty influences convergence to the best policy. The bottom row of Fig. 1 shows that Adaptive On-Policy Regularization (AOR) can be used to improve dynamic regret. In our implementation of AOR, ˆ n and α ˆ n every ten iterations. For IG and MIG we set t = 0.01 and updated λ with AOR, we adjusted the stepsize η as a function of α ˆ n as motivated by the conditions in the theorems. In practice this counteracts potentially large gradients caused by large α ˆ values. Dagger without AOR exhibits severe chattering. With AOR, the performance is stabilized, leading to a converged policy. A similar result is observed for MIG. We note that IG did not initially exhibit chattering and the learning curve was unaffected. We attribute this to the discrete nature of the control space; even if the policy parameters have different regret rates, the resulting trajectories could be the same.
Dynamic Regret
225
Fig. 1. Regret is shown in the right column while true system cost, measured as angular deviation, is shown in the left column. All three on-policy algorithms without adaptive regularization are shown on both the easy (top row) and hard (middle, bottom rows) settings of cart-pole. The result shows existence of a system is difficult enough to induce unstable learning curves. The bottom row shows the algorithms again on the hard setting but using AOR. The dynamic regret tends towards zero and the chattering in Dagger and MIG is reduced.
8.2
Walker Locomotion
Next, we consider a 2-dimensional walker from the OpenAI Gym, where the objective is to move the farthest distance. Again we induced difficulty and suboptimal policies by increasing the force magnitude of controls. Ridge regression was used for the robot’s policy. Here, we set α1 = 1.0 and t = 0.1 for AOR. We used the same initial η values for IG and MIG and adjusted the stepsize as a function of α ˆ n when using AOR. The results are shown in Fig. 2. Without adaptive regularization, the average dynamic regret fails to converge and all distance curves exhibit severe chattering with no stable learning. With AOR, average dynamic regret converges to zero and all distance curves are stabilized.
226
J. N. Lee et al.
Fig. 2. The instantaneous regrets fail to converge to zero without AOR (top row) and the distance curves exhibit chattering. With AOR (bottom row), the average dynamic regret converges and chattering is reduced after 200 iterations.
9
Discussion and Future Work
Dynamic regret theory offers a promising method for theoretical analysis in imitation learning. The question of whether an online algorithm leads to converged or stable policies is inherently captured in dynamic regret, in contrast to static regret which captures policy performance on the average of the distributions. Theoretical analyses suggest new conditions to guarantee convergence. The simulation results suggest that the questions of convergence and optimality must be addressed when designing an imitation learning robotics systems because stable performance is not always guaranteed. By modeling the problem with Assumption 4, we constrained the dynamics to aid the analysis. If other properties were known about the dynamics, then additional information could improve regret rates and conditions for optimality. For example, similar problem statements were studied in [5] and [11] in general Online Optimization. It was recently shown that augmenting the mostly modelfree analyses with model-based learning can improve static regret bounds and performance [3]. We hypothesize that dynamic regret rates can also be improved. Acknowledgments. This research was performed at the AUTOLAB at UC Berkeley in affiliation with the Berkeley AI Research (BAIR) Lab, Berkeley Deep Drive (BDD), the Real-Time Intelligent Secure Execution (RISE) Lab, and the CITRIS “People and Robots” (CPAR) Initiative and by the Scalable Collaborative Human-Robot Learning (SCHooL) Project, NSF National Robotics Initiative Award 1734633. The authors were supported in part by donations from Siemens, Google, Amazon Robotics, Toyota Research Institute, Autodesk, ABB, Samsung, Knapp, Loccioni, Honda, Intel, Comcast, Cisco, and Hewlett-Packard. We thank the WAFR community for their valuable comments and our colleagues, in particular Jeffrey Mahler, Ching-An Cheng, and Brijen Thananjeyan for their insights.
Dynamic Regret
227
References 1. Andrew Bagnell, J.: An invitation to imitation. Technical report, Carnegie Mellon University, Robotics Institute, Pittsburgh (2015) 2. Cheng, C.-A., Boots, B.: Convergence of value aggregation for imitation learning. In: International Conference on Artificial Intelligence and Statistics (2018) 3. Cheng, C.-A., Yan, X., Theodorou, E., Boots, B.: Accelerating imitation learning with predictive models. arXiv preprint arXiv:1806.04642 (2018) 4. Cheng, C.-A., Yan, X., Wagener, N., Boots, B.: Fast policy learning through imitation and reinforcement. In: Conference on Uncertainty in Artificial Intelligence (2018) 5. Hall, E.C., Willett, R.M.: Online convex optimization in dynamic environments. IEEE J. Sel. Top. Signal Process. 9(4), 647–662 (2015) 6. Hazan, E.: Introduction to online convex optimization. Found. Trends Optim. 2(3– 4), 157–325 (2016) 7. Hazan, E., Seshadhri, C.: Adaptive algorithms for online decision problems. In: Electronic Colloquium on Computational Complexity (ECCC) (2007) 8. Laskey, M., Chuck, C., Lee, J., Mahler, J., Krishnan, S., Jamieson, K., Dragan, A., Goldberg, K.: Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations. In: IEEE International Conference on Robotics and Automation (ICRA) 2017 (2017) 9. Mokhtari, A., Shahrampour, S., Jadbabaie, A., Ribeiro, A.: Online optimization in dynamic environments: improved regret rates for strongly convex problems. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 7195–7201. IEEE (2016) 10. Osa, T., Pajarinen, J., Neumann, G., Andrew Bagnell, J., Abbeel, P., Peters, J., et al.: An algorithmic perspective on imitation learning. Found. Trends Robot. 7(1–2), 1–179 (2018) 11. Rakhlin, S., Sridharan, K.: Optimization, learning, and games with predictable sequences. In: Advances in Neural Information Processing Systems, pp. 3066–3074 (2013) 12. Ross, S., Gordon, G.J., Andrew Bagnell, J.: A reduction of imitation learning and structured prediction to no-regret online learning. In: International Conference on Artificial Intelligence and Statistics (2011) 13. Sun, W., Venkatraman, A., Gordon, G.J., Boots, B., Andrew Bagnell, J.: Deeply aggrevated: differentiable imitation learning for sequential prediction. In: International Conference on Machine Learning (2017) 14. Yang, T., Zhang, L., Jin, R., Yi, J.: Tracking slowly moving clairvoyant: optimal dynamic regret of online learning with true and noisy gradient. In: International Conference on Machine Learning (2016) 15. Zhang, L., Yang, T., Yi, J., Rong, J., Zhou, Z.-H.: Improved dynamic regret for non-degenerate functions. In: Advances in Neural Information Processing Systems (2017) 16. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pp. 928–936 (2003)
Learning Constraints from Demonstrations Glen Chou(B) , Dmitry Berenson, and Necmiye Ozay Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA {gchou,dmitryb,necmiye}@umich.edu
Abstract. We extend the learning from demonstration paradigm by providing a method for learning unknown constraints shared across tasks, using demonstrations of the tasks, their cost functions, and knowledge of the system dynamics and control constraints. Given safe demonstrations, our method uses hit-and-run sampling to obtain lower cost, and thus unsafe, trajectories. Both safe and unsafe trajectories are used to obtain a consistent representation of the unsafe set via solving an integer program. Our method generalizes across system dynamics and learns a guaranteed subset of the constraint. We also provide theoretical analysis on what subset of the constraint can be learnable from safe demonstrations. We demonstrate our method on linear and nonlinear system dynamics, show that it can be modified to work with suboptimal demonstrations, and that it can also be used to learn constraints in a feature space. Keywords: Learning from demonstration and path planning
1
· Machine learning · Motion
Introduction
Inverse optimal control and inverse reinforcement learning (IOC/IRL) [2,6,23, 27] have proven to be powerful tools in enabling robots to perform complex goaldirected tasks. These methods learn a cost function that replicates the behavior of an expert demonstrator when optimized. However, planning for many robotics and automation tasks also requires knowing constraints, which define what states or trajectories are safe. For example, the task of safely and efficiently navigating an autonomous vehicle can naturally be described by a cost function trading off user comfort and efficiency and by the constraints of collision avoidance and executing only legal driving behaviors. In some situations, constraints can provide a more interpretable representation of a behavior than cost functions. For example, in safety critical environments, recovering a hard constraint or an explicit representation of an unsafe set in the environment is more useful than learning a “softened” cost function representation of the constraint as a penalty term in the Lagrangian. Consider the autonomous vehicle, which absolutely must avoid collision, not simply give collisions a cost penalty. Furthermore, learning global constraints shared across many tasks can be useful for generalization. Again consider the autonomous vehicle, which must avoid the scene of a car accident: a shared constraint that holds regardless of the task it is trying to complete. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 228–245, 2020. https://doi.org/10.1007/978-3-030-44051-0_14
Learning Constraints from Demonstrations
229
While constraints are important, it can be impractical for a user to exhaustively program into a robot all the possible constraints it should obey when performing its repertoire of tasks. To avoid this, we consider in this paper the problem of recovering the latent constraints within expert demonstrations that are shared across tasks in the environment. Our method is based on the key insight that each safe, optimal demonstration induces a set of lower-cost trajectories that must be unsafe due to violation of an unknown constraint. Our method samples these unsafe trajectories, ensuring they are also consistent with the known constraints (system dynamics, control constraints, and start/goal constraints), and uses these unsafe trajectories together with the safe demonstrations as constraints in an “inverse” integer program which recovers a consistent unsafe set. Our contributions are fourfold: – We pose the novel problem of learning a shared constraint across tasks. – We propose an algorithm that, given known constraints and boundedly suboptimal demonstrations of state-control sequences, extracts unknown constraints defined in a wide range of constraint spaces (not limited to the trajectory or state spaces) shared across demonstrations of different tasks. – We provide theoretical analysis on the limits of what subsets of a constraint can be learned, depending on the demonstrations, the system dynamics, and the trajectory discretization. We also show that our method can recover a guaranteed underapproximation of the constraint. – We provide experiments that justify our theory and show that our algorithm can recover an unsafe set with few demonstrations, across different types of linear and nonlinear dynamics, and can be adapted to work with boundedly suboptimal demonstrations. We also demonstrate that our method can learn constraints in the state space and a feature space.
2
Related Work
Inverse optimal control [15,17] (IOC) and inverse reinforcement learning (IRL) [23] aim to recover an objective function consistent with the received expert demonstrations, in the sense that the demonstrations (approximately) optimize the cost function. Our method is complementary to these approaches; if the demonstration is solving a constrained optimization problem, we are finding its constraints, given the objective function; IOC/IRL finds the objective function, given its constraints. For example, [13] attempts to learn the cost function of a constrained optimization problem from optimal demonstrations by minimizing the residuals of the KKT conditions, but the constraints themselves are assumed known. Another approach [5] can represent a state-space constraint shared across tasks as a penalty term in the reward function of an MDP. However, when viewing a constraint as a penalty, it becomes unclear if a demonstrated motion was performed to avoid a penalty or to improve the cost of the trajectory in terms of the true cost function (or both). Thus, learning a constraint which generalizes between tasks with different cost functions becomes difficult. To avoid this issue, we assume a known cost function to explicitly reason about the constraint.
230
G. Chou et al.
One branch of safe reinforcement learning aims to perform exploration while minimizing visitation of unsafe states. Several methods for safe exploration in the state space [3,29,30] use a Gaussian process (GP) to explore safe regions in the state space. These approaches differ from ours in that they use exploration instead of demonstrations. Some drawbacks to these methods include that unsafe states can still be visited, Lipschitz continuity of the safety function is assumed, or the dynamics are unknown but the safe set is known. Furthermore, states themselves are required to be explicitly labeled as safe or unsafe, while we only require the labeling of whole trajectories. Our method is capable of learning a binary constraint defined in other spaces, using only state-control trajectories. There exists prior work in learning geometric constraints in the workspace. In [7], a method is proposed for learning Pfaffian constraints, recovering a linear constraint parametrization. In [26], a method is proposed to learn geometric constraints which can be described by the classes of considered constraint templates. Our method generalizes these methods by being able to learn a nonlinear constraint defined in any constraint space (not limited to the state space). Learning local trajectory-based constraints has also been explored in the literature. The method in [20] samples feasible poses around waypoints of a single demonstration; areas where few feasible poses can be sampled are assumed to be constrained. Similarly, [21] performs online constraint inference in the feature space from a single trajectory, and then learns a mapping to the task space. The methods in [9,10,24,31] also learn constraints in a single task. These methods are inherently local since only one trajectory or task is provided, unlike our method, which aims to learn a global constraint shared across tasks. Our work is also relevant to human-robot interaction. In [19], implicit communication of facts between agents is modeled as an interplay between demonstration and inference, where “surprising” demonstrations trigger inference of the implied fact. Our method can be seen as an inference algorithm which infers an unknown constraint implied by a “surprising” demonstration.
3
Preliminaries and Problem Statement
The goal of this work is to recover unknown constraints shared across a collection of optimization problems, given boundedly suboptimal solutions, the cost functions, and knowledge of the dynamics, control constraints, and start/goal constraints. We discuss the forward problem, which generates the demonstrations, and the inverse problem: the core of this work, which recovers the constraints. 3.1
Forward Optimal Control Problem
Consider an agent described by a state in some state space x ∈ X . It can take control actions u ∈ U to change its state. The agent performs tasks Π drawn from a set of tasks P, where each task Π can be written as a constrained optimization problem over state trajectories in state trajectory space ξx ∈ T x and control trajectories ξu ∈ T u in control trajectory space:
Learning Constraints from Demonstrations
231
Problem 1 (Forward problem/“task” Π). minimize
cΠ (ξx , ξu )
subject to
φ(ξx , ξu ) ∈ S ⊆ C ¯ x , ξu ) ∈ S¯ ⊆ C¯ φ(ξ φΠ (ξx , ξu ) ∈ SΠ ⊆ CΠ
ξx ,ξu
(1)
where cΠ (·) : T x ×T u → R is a cost function for task Π and φ(·, ·) : T x ×T u → C is a known feature function mapping state-control trajectories to some constraint ¯ ·) : T x × T u → C¯ and φΠ (·, ·) : T x × T u → CΠ are known and space C. φ(·, map to potentially different constraint spaces C¯ and CΠ , containing a known shared safe set S¯ and a known task-dependent safe set SΠ , respectively. S is an unknown safe set, and the inverse problem aims to recover its complement, . A = S c , the “unsafe” set. In this paper, we focus on constraints separable in time: φ(ξx , ξu ) ∈ A ⇔ ∃t ∈ {1, . . . , T } φ(ξx (t), ξu (t)) ∈ A, where we overload φ so it applies to the instantaneous values of the state and the input. An analogous definition holds for the continuous time case. Our method easily learns nonseparable trajectory constraints as well1 . . A demonstration, ξxu = (ξx , ξu ), is a state-control trajectory which is a boundedly suboptimal solution to Problem 1, i.e. the demonstration satisfies all constraints and its cost is at most a factor of δ above the cost of the optimal ∗ solution ξxu , i.e. c(ξx∗ , ξu∗ ) ≤ c(ξx , ξu ) ≤ (1 + δ)c(ξx∗ , ξu∗ ). Furthermore, let T be a finite time horizon which is allowed to vary. If ξxu is a discrete-time trajectory (ξx = {x1 , . . . , xT }, ξu = {u1 , . . . , uT }), Problem 1 is a finite-dimensional optimization problem, while Problem 1 becomes a functional optimization problem if ξxu is a continuous-time trajectory (ξx : [0, T ] → X , ξu : [0, T ] → U). We emphasize this setup does not restrict the unknown constraint to be defined on the trajectory space; it allows for constraints to be defined on any space described by the range of some known feature function φ. We assume the trajectories are generated by a dynamical system x˙ = f (x, u, t) or xt+1 = f (xt , ut , t) with control constraints ut ∈ U, for all t, and that the dynamics, control constraints, and start/goal constraints are known. We further denote the set of state-control trajectories satisfying the unknown shared constraint, the known shared constraint, and the known task-dependent constraint as TS , TS¯, and TSΠ , respectively. Lastly, we also denote the set of trajectories satisfying all known constraints but violating the unknown constraint as TA .
1
Write Problem 2 constraints as sums over partially separable/inseparable feature components instead of completely separable components.
232
3.2
G. Chou et al.
Inverse Constraint Learning Problem
The goal of the inverse constraint learning problem is to recover an unsafe set, A ⊆ C, using Ns provided safe demonstrations ξs∗j , j = 1, . . . , Ns , known constraints, and N¬s inferred unsafe trajectories, ξ¬sk , k = 1, . . . , N¬s , generated by our method, which can come from multiple tasks. These trajectories can together be thought of as a set of constraints on the possible assignments of unsafe elements in C. To recover a gridded approximation of the unsafe set A that is consistent with these trajectories, we first discretize C into a finite . set of G discrete cells Z = {z1 , . . . , zG } and define an occupancy function, O(·), which maps each cell Fig. 1. Discretized constraint space with cells z1 , . . . , z10 . to its safeness: O(·) : Z → {0, 1}, where O(zi ) = 1 The trajectory’s constraint if zi ∈ A, and 0 otherwise. Continuous space tra- values are assigned to the red jectories are gridded by concatenating the set of cells. grid cells zi that φ(x1 ), . . . , φ(xT ) lie in, which is graphically shown in Fig. 1 with a non-uniform grid. Then, the problem can be written down as an integer feasibility problem: Problem 2 (Inverse feasibility problem). . . , O(zG ) ∈ {0, 1}G find O(z1 ), . subject to O(zi ) = 0,
∀j = 1, . . . , Ns
zi ∈{φ(ξs∗ (1)),..., j
φ(ξs∗ (Tj ))}
(2)
j
O(zi ) ≥ 1,
∀k = 1, . . . , N¬s
zi ∈{φ(ξ¬sk (1)),..., φ(ξ¬sk (Tk ))}
Inferring unsafe trajectories, i.e. obtaining ξ¬sk , k = 1, . . . , N¬s , is the most difficult part of this problem, since finding lower-cost trajectories consistent with known constraints that complete a task is essentially a planning problem. Much of the next section shows how to efficiently obtain ξ¬sk . Further details on Problem 2, including conservativeness guarantees, incorporating a prior on the constraint, and a continuous relaxation can be found in Sect. 4.4.
4
Method
The key to our method lies in finding lower-cost trajectories that do not violate the known constraints, given a demonstration with boundedly-suboptimal cost satisfying all constraints. Such trajectories must then violate the unknown constraint. Our goal is to determine an unsafe set in the constraint space from these trajectories using Problem 2. In the following, Sect. 4.1 describes lower-cost trajectories consistent with the known constraints; Sect. 4.2 describes how to sample
Learning Constraints from Demonstrations
233
such trajectories; Sect. 4.3 describes how to get more information from unsafe trajectories; Sect. 4.4 describes details and extensions to Problem 2; Sect. 4.5 discusses how to extend our method to suboptimal demonstrations. The complete flow of our method is described in Algorithm 2. 4.1
Trajectories Satisfying Known Constraints
Consider the forward problem (Problem 1). We define the set of unsafe stateξ∗ ∗ control trajectories induced by an optimal, safe demonstration ξxu , TAxu , as the set of state-control trajectories of lower cost that obey the known constraints: . ξ∗ TAxu = {ξxu | c(ξx , ξu ) < c(ξx∗ , ξu∗ ), ξxu ∈ TS¯, ξxu ∈ TSΠ }.
(3)
In this paper, we deal with the known constraints from the system dynamics, the control limits, and task-dependent start and goal state constraints. Hence, TS¯ = Dξxu ∩U ξxu , where Dξxu denotes the set of dynamically feasible trajectories and U ξxu denotes the set of trajectories using controls in U at each time-step. TSΠ denotes trajectories satisfying start and goal constraints. We develop the method for discrete time trajectories, but analogous definitions hold in continuous time. For discrete time, length T trajectories, U ξxu , Dξxu , and TSΠ are explicitly: . U ξxu = {ξxu | ut ∈ U, ∀t ∈ {1, . . . , T − 1} }, . Dξxu = {ξxu | xt+1 = f (xt , ut ), ∀t ∈ {1, . . . , T − 1} }, . TSΠ = {ξxu | x1 = xs , xT = xg }.
(4)
Table 1. Sampling methods for different dynamics/costs/feasible controls. Dynamics Cost function
Control Sampling method constraints
Linear
Quadratic
Convex
Linear
Convex
Convex
Else
4.2
Ellipsoid hit-and-run (Sect. 4.2.1) Convex hit-and-run (Sect. 4.2.2) Non-convex hit-and-run (Sect. 4.2.3)
Sampling Trajectories Satisfying Known Constraints ξ∗
We sample from TAxu to obtain lower-cost trajectories obeying the known conξ∗ straints using hit-and-run sampling [18] over the set TAxu , a method guaranξ∗ teeing convergence to a uniform distribution of samples over TAxu in the limit; the method is detailed in Algorithm 1 and an illustration is shown in Fig. 2.
234
G. Chou et al.
Hit-and-run starts from an initial point within the set, chooses a direction uniformly at random, moves a random amount in that direction such that the new point remains within the set, and repeats. Depending on the convexity of the cost function and the control constraints and on the form of the dynamics, different sampling techniques can be used, organized in Table 1. The following sections describe each sampling method.
Algorithm 1: Hit-and-run
. Output: out = {ξ1 , . . . , ξN¬s } ξ∗
1 2 3
∗ ,N Input : TAxu , ξxu ¬s ∗ ; out ← {}; ξxu ← ξxu for i = 1:N¬s do r ← sampleRandDirection; ξ∗
5 6 7 8
ξ5
∈ T | ξ = L ← TAxu ∩ {ξxu xu ξxu + βr}; L− , L+ ← endpoints(L); ξxu ∼ Uniform(L− , L+ ); out ← out ∪ ξxu ;
4
end
ξ2
ξ0
ξ1
ξ6
ξ3
ξ7
ξ4
Fig. 2. Illustration of hit-and-run. Blue lines denote sampled random directions, black dots denote samples.
4.2.1 Ellipsoid Hit-and-Run When we have a linear system with quadratic cost and convex control constraints . ξ∗ - a very common setup in the optimal control literature - TAxu = {ξxu | c(ξxu ) < ∗ ∗ ∗ )} ∩ Dξxu ≡ {ξxu | ξxu V ξxu < ξxu V ξxu } ∩ Dξxu is an ellipsoid in the c(ξxu trajectory space, which can be efficiently sampled via a specially-tailored hit. and-run method. Here, the quadratic cost is written as c(ξxu ) = ξxu V ξxu , where V is a matrix of cost parameters, and we omit the control and task constraints for now. Without dynamics, the endpoints of the line L, L− , L+ , (c.f. Algorithm 1), ∗ ∗ V ξxu . can be found by solving a quadratic equation (ξxu +βr) V (ξxu +βr) = ξxu ∗ ξxu We show that this can still be done with linear dynamics by writing TA in a special way. Dξxu can be written as an eigenspace of a singular “dynamics consistency” matrix, D1 , which converts any arbitrary state-control trajectory to one that satisfies the dynamics, one time-step at a time. Precisely, if the dynamics can be written as xt+1 = Axt + But , we can write a matrix D1 : ⎤⎡ ⎡ ⎤ ⎡ ⎤ x1 x1 I 0 0 0 0 ··· ··· 0 ⎢ u1 ⎥ ⎢ 0 I 0 0 0 · · · · · · 0⎥ ⎢ u1 ⎥ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ˜ ⎥ ⎢ x2 ⎥ ⎢A B 0 0 0 · · · · · · 0⎥ ⎢ x ⎥⎢ 2 ⎥ ⎢ ⎥ ⎢ ⎢ u2 ⎥ ⎢ 0 0 0 I 0 · · · · · · 0⎥ ⎢ u2 ⎥ ⎥⎢ ˜ ⎥ ⎢ x ⎥ ⎢ (5) ⎢ ˜ 3 ⎥ = ⎢ 0 0 A B 0 · · · · · · 0⎥ ⎢ x 3 ⎥ ⎢ . ⎥ ⎢ . . . . . . . .⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢ . . . . . .. .. .⎥ ⎢ . ⎥ .⎥ ⎢ . ⎥ ⎢ . ⎥ ⎢. . . . . ⎣u ⎦ ⎣ 0 0 0 · · · · · · 0 I 0⎦ ⎣u ⎦ T −1
x ˜T ξˆxu
T −1
0 0 0 ··· ··· A B 0 x ˜T
D1
ξxu
Learning Constraints from Demonstrations
235
that fixes the controls and the initial state and performs a one-step rollout, replacing the second state with the dynamically correct state. In Eq. 5, we denote by x ˜t+1 a state that cannot be reached by applying control ut to state xt . Multiplying the one-step corrected trajectory ξˆxu by D1 again changes x ˜3 to the dynamically reachable state x3 . Applying D1 to the original T -time-step infeasible trajectory T − 1 times results in a dynamically feasible trajectory, feas ξxu = D1T −1 ξxu . Further, note that the set of dynamically feasible trajectories ξxu . = {ξxu | D1 ξxu = ξxu }, which is the span of the eigenvectors of D1 is D associated with eigenvalue 1. Thus, obtaining a feasible trajectory via repeated multiplication is akin to finding the eigenspace via power iteration [14]. One can also interpret this as propagating through the dynamics with a fixed control ξ∗ sequence. Now, we can write TAxu as another ellipsoid which can be efficiently sampled by finding L− , L+ by solving a quadratic equation: . ξ∗ ∗ ∗ TAxu = {ξxu | ξxu D1T −1 V D1T −1 ξxu ≤ ξxu V ξxu }.
(6)
We deal with control constraints separately, as the intersection of U ξxu and Eq. 6 is in general not an ellipsoid. To ensure control constraint satisfaction, we reject samples with controls outside of U ξxu ; this works if U ξxu is not measure zero. For task constraints, we ensure all sampled rollouts obey the goal constraints by adding a large penalty term to the cost function: . c˜(·) = c(·) + αc xg − xT 22 , where αc is a large scalar, which can be incorporated into Eq. 6 by modifying V and including xg in ξxu ; all trajectories sampled in this modified set satisfy the goal constraints to an arbitrarily small tolerance ε, depending on the value of αc . The start constraint is satisfied trivially: all rollouts start at xs . Note the demonstration cost remains the same, since the demonstration satisfies the start and goal constraints; this modification is made purely to ensure these constraints hold for sampled trajectories. 4.2.2 Convex Hit-and-Run For general convex cost functions, the same sampling method holds, but L+ , L− cannot be found by solving a quadratic function. Instead, we solve c(ξxu + βr) = ∗ c(ξxu ) via a root finding algorithm or line search. 4.2.3∗ Non-convex Hit-and-Run ξ If TAxu is non-convex, L can now in general be a union of disjoint line segments. In this scenario, we perform a “backtracking” line search by setting β to lie in some initial range: β ∈ [β, β]; sampling βs within this range and then evaluating the cost function to see whether or not ξxu + βs r lies within the intersection. If it does, the sample is kept and hit-and-run proceeds normally; if not, then the range of possible β values is restricted to [βs , β] if βs is negative, and [β, βs ] otherwise. Then, new βs are re-sampled until either the interval length shrinks below a threshold or a feasible sample is found. This altered hit-and-run technique still converges to a uniform distribution on the set in the limit, but has a slower mixing time than for the convex case, where mixing time describes the number of samples needed until the total variation distance to the steady state distribution
236
G. Chou et al.
is less than a small threshold [1]. Further, we accelerate sampling spread by relaxing the goal constraint to a larger tolerance εˆ > ε but keeping only the trajectories reaching within ε of the goal. 4.3
Improving learnability using cost function structure
Na¨ıvely, the sampled unsafe trajectories may provide little information. Consider an unsafe, length-T discrete-time trajectory ξ, with start and end states in the safe set. This only says there exists at least one intermediate unsafe state in the trajectory, but says nothing directly about which state was unsafe. The weakness of this information can be made concrete using the notion of a version space. In machine learning, the version space is the set of consistent hypotheses given a set of examples [28]. In our setting, hypotheses are possible unsafe sets, and examples are the safe and unsafe trajectories. Knowing ξ is unsafe only disallows unsafe sets that mark every element of the constraint space that ξ traverses as safe: (O(z2 ) = 0) ∧ . . . ∧ (O(zT −1 ) = 0). If C is gridded into G cells, this information invalidates at most 2G−T +2 out of 2G possible unsafe sets. We could do exponentially better if we reduced the number of cells that ξ implies could be unsafe. We can achieve this by sampling sub-segments (or sub-trajectories) of the larger demonstrations, holding other portions of the demonstration fixed. For example, say we fix all but one of the points on ξ when sampling unsafe lower-cost trajectories. Since only one state can be different from the known safe demonstration, the unsafeness of the trajectory can be uniquely localized to whatever new point was sampled: then, this trajectory will reduce the version space by at most a factor of 2, invalidating at most 2G − 2G−1 = 2G−1 unsafe sets. One can sample these sub-trajectories in the full-length trajectory space by fixing appropriate waypoints during sampling: this ensures the full trajectory has lower cost and only perturbs desired waypoints. However, to speed up sampling, sub-trajectories can be sampled directly in the lower dimensional sub-trajectory space if the cost function c(·) that is being optimized is strictly monotone [22]: for any costs c1 , c2 ∈ R, control u ∈ U, and state x ∈ X , c1 < c2 ⇒ h(c1 , x, u) < h(c2 , x, u), for all x, u, where h(c, x, u) represents the cost of starting with initial cost c at state x and taking control u. Strictly monotone cost functions include separable cost functions with additive or multiplicative stage costs, which are common Algorithm 2: Overall method . Output: O = O(1), . . . , O(G) ∗ Input : ξs = {ξ1∗ , . . . , ξN }, cΠ (·), s known constraints} 1 ξu ← {}; 2 for i = 1:Ns do /* Sample unsafe ξ */ 3 if lin., quad., conv. then 4 ξu ← ξu ∩ ellipsoidHNR(ξi∗ ); 5 else if lin., conv., conv. then 6 ξu ← ξu ∩ convexHNR(ξi∗ ); 7 else 8 ξu ← ξu ∩ nonconvexHNR(ξi∗ ); 9 end /* Constraint recovery */ 10 if prior, continuous then 11 O ← Problem 4(ξs , ξu ) 12 else if prior, binary then 13 O ← Problem 3(ξs , ξu ) 14 else 15 O ← Problem 2(ξs , ξu ) 16 end
Learning Constraints from Demonstrations
237
in motion planning and optimal control. If the cost function is strictly monotone, we can sample lower-cost trajectories from sub-segments of the optimal path; otherwise it is possible that even if a new sub-segment with lower cost than the original sub-segment were sampled, the full trajectory containing the sub-segment could have a higher cost than the demonstration. 4.4
Integer Program Formulation
After sampling, we can solve Problem 2 to find an unsafe set consistent with the safe and unsafe trajectories. We now discuss the details of this process. Conservative Estimate: One can obtain a conservative estimate of the unsafe set A from Problem 2 by intersecting all possible solutions: if the unsafeness of a cell is shared across all feasible solutions, that cell must be occupied. In practice, it may be difficult to directly find all solutions to the feasibility problem, as in the worst case, finding the set of all feasible solutions is equivalent to exhaustive search in the full gridded space [25]. A more efficient method is to loop over all G grid cells and set each one to be safe, and see if the optimizer can still find a feasible solution. Cells where there exists no feasible solution are guaranteed unsafe. This amounts to solving G binary integer feasibility problems, which can be trivially parallelized. Furthermore, any cells that are known safe (from demonstrations) do not need to be checked. We use this method to compute the “learned guaranteed unsafe set”, Arec l , in Sect. 6. A Prior on the Constraint: As will be further discussed in Sect. 5.1, it may be fundamentally impossible to recover a unique unsafe set. If we have some prior on the nature of the unsafe set, such as it being simply connected, or that certain regions of the constraint space are unlikely to be unsafe, we can make the constraint learning problem more well-posed. Assume that this prior knowledge can be encoded in some “energy” function E(·, . . . , ·) : {0, 1}G → R mapping the set of binary occupancies to a scalar value, which indicates the desirability of a particular unsafe set configuration. Using E as the objective function in Problem 2 results in a binary integer program, which finds an unsafe set consistent with the safe and unsafe trajectories, and minimizes the energy: Problem 3 (Inverse binary minimization constraint recovery). minimize
O(z1 ),...,O(zG )∈{0,1}G
subject to
E(O(z1 ), . . . , O(zG )) O(zi ) = 0,
∀j = 1, . . . , Ns
zi ∈{φ(ξs∗ (1)),..., j
(7)
φ(ξs∗ (Tj ))}
j
O(zi ) ≥ 1,
∀k = 1, . . . , N¬s
zi ∈{φ(ξ¬sk (1)),..., φ(ξ¬sk (Tk ))}
Probabilistic Setting and Continuous Relaxation: A similar problem can be posed for a probabilistic setting, where grid cell occupancies represent beliefs over unsafeness: instead of the occupancy of a cell being an indicator variable,
238
G. Chou et al.
˜ i) it is instead a random variable Zi , where Zi takes value 1 with probability O(Z ˜ and value 0 with probability 1− O(Zi ). Here, the occupancy probability function ˜ : Z → [0, 1]. maps cells to occupancy probabilities O(·) Trajectories can now be unsafe with some probability. We obtain analogous constraints from the integer program in Sect. 4.4 in the probabilistic setting. Known safe trajectories traverse cells that are unsafe with probability 0; ˜ we enforce this with the constraint Zi ∈φ(ξs∗ ) O(Zi ) = 0: if the unsafeness j
probabilities are all zero along a trajectory, then the trajectory must be safe.
˜ Trajectories that are unsafe with probability pk satisfy Zi ∈φ(ξ¬sk ) O(Zi ) =
E[ Zi ∈φ(ξ¬s ) Zi ] = (1 − pk ) · 0 + pk · Sk ≥ pk where we denote the number of k unsafe grid cells φ(ξ¬sk ) traverses when the trajectory is unsafe as Sk , where Sk ≥ 1. The following problem directly optimizes over occupancy probabilities: Problem 4 (Inverse continuous minimization constraint recovery). minimize
O(Z1 ),...,O(ZG )∈[0,1]G
subject to
E(O(Z1 ), . . . , O(ZG )) ˜ i ) = 0, O(Z
∀j = 1, . . . , Ns
Zi ∈{φ(ξs∗ (1)),..., j φ(ξs∗ (Tj ))} j
(8) ˜ i ) ≥ pk , O(Z
∀k = 1, . . . , N¬s
Zi ∈{φ(ξ¬sk (1)),..., φ(ξ¬sk (Tk ))}
When pk = 1, for all k (i.e. all unsafe trajectories are unsafe for sure), this probabilistic formulation coincides with the continuous relaxation of Problem 3. This justifies interpreting the solution of the continuous relaxation as occupancy probabilities for each cell. Note that Problems 3 and 4 have no conservativeness guarantees and use prior assumptions to make the problem more well-posed. However, we observe that they improve constraint recovery in our experiments. 4.5
Bounded Suboptimality of Demonstrations
ˆ where c(ξ ∗ ) ≤ c(ξ) ˆ ≤ (1+δ)c(ξ ∗ ), If we are given a δ-suboptimal demonstration ξ, ∗ where ξ is an optimal demonstration, we can still apply the sampling techniques discussed in earlier sections, but we must ensure that sampled unsafe trajectories are truly unsafe: a sampled trajectory ξ of cost c(ξ ) ≥ c(ξ ∗ ) can be potentially safe. Two options follow: one is to only keep trajectories with cost less than ˆ c(ξ) 1+δ , but this can cause little to be learned if δ is large. Instead, if we assume a ˆ we know that distribution on suboptimality, i.e. given a trajectory of cost c(ξ), ˆ c( ξ) ˆ is unsafe with probability pk , we can then a trajectory of cost c(ξ ) ∈ [ 1+δ , c(ξ)] use these values of pk to solve Problem 4. We implement this in the experiments.
Learning Constraints from Demonstrations
5
239
Analysis
Due to space, the proofs/more remarks can be found in the extended paper [11]. 5.1
Learnability
We provide analysis on the learnability of unsafe sets, given the known constraints and cost function. Most analysis assumes unsafe sets defined over the state space: A ⊆ X , but we extend it to the feature space in Corollary 3. We provide some definitions and state a result bounding Al , the set of all states that can be learned guaranteed unsafe. We first define the signed distance: Definition 1 (Signed distance). Signed distance from point p ∈ Rm to set S ⊆ Rm , sd(p, S) = − inf y∈∂S p − y if p ∈ S; inf y∈∂S p − y if p ∈ S c . Theorem 1 (Learnability (discrete time)). For trajectories generated by a discrete time dynamical system satisfying xt+1 − xt ≤ Δx for all t, the set of learnable guaranteed unsafe states is a subset of the outermost Δx shell of the unsafe set: Al ⊆ {x ∈ A | − Δx ≤ sd(x, A) ≤ 0} (see Section A.1 in [11] for an illustration). Corollary 1 (Learnability (continuous time)). For continuous trajectories ξ(·) : [0, T ] → X , the set of learnable guaranteed unsafe states shrinks to the boundary of the unsafe set: Al ⊆ {x ∈ A | sd(x, A) = 0}. Depending on the cost function, Al can become arbitrarily small: some cost functions are not very informative for recovering a constraint. For example, the path length cost function used in many of the experiments (which was chosen due to its common use in the motion planning community), prevents any lower-cost sub-trajectories from being sampled from straight sub-trajectories. The system’s controllability also impacts learnability: the more controllable the system, the more of the Δx shell is reachable. We present a theorem quantifying when the dynamics allow unsafe trajectories to be sampled in Theorem A.2 in [11]. 5.2
Conservativeness
We discuss conditions on A and discretization which ensure our method provides a conservative estimate of A. For analysis, we assume A has a Lipschitz boundary [12]. We begin with notation (explanatory illustrations are in Section A.2 in [11]): Definition 2 (Set thickness). Denote the outward-pointing normal vector at a point p ∈ ∂A as n ˆ (p). Furthermore, at non-differentiable points on ∂A, n ˆ (p) is replaced by the set of normal vectors for the sub-gradient of the Lipschitz function describing ∂A at that point [4]. The set A has a thickness larger than dthick if ∀x ∈ ∂A, ∀d ∈ [0, dthick ], sd(x − dˆ n(x), A) ≤ 0. Definition 3 (γ-offset padding). Define the γ-offset padding ∂Aγ as: ∂Aγ = {x ∈ X | x = y + dˆ n(y), d ∈ [0, γ], y ∈ ∂A}.
240
G. Chou et al.
Definition 4 (γ-padded set). We define the γ-padded set of the unsafe set A, . A(γ), as the union of the γ-offset padding and A: A(γ) = ∂Aγ ∪ A. Corollary 2 (Conservative recovery of unsafe set). For a discrete-time system, a sufficient condition ensuring that the set of learned guaranteed unsafe is contained in A is that A has a set thickness greater than or equal states Arec l to Δx (c.f. Definition 1). If we use continuous trajectories directly, the guaranteed learnable set Al shrinks to a subset of the boundary of the unsafe set, ∂A (c.f. Corollary 1). However, if we discretize these trajectories, we can learn unsafe states lying in the interior, at the cost of conservativeness holding only for a padded unsafe set. For the following results, we make two assumptions, which are illustrated in Figs. 10 and 11 of [11] for clarity: Assumption 1: The unsafe set A is aligned with the grid (i.e. there does not exist a grid cell z containing both safe and unsafe states). Assumption 2: The time discretization of the unsafe trajectory ξ : [0, T ] → X , {t1 , . . . , tN }, ti ∈ [0, T ], for all i, is chosen such that there exists at least one discretization point for each cell that the continuous trajectory passes through (i.e. if ∃t ∈ [0, T ] such that ξ(t) ∈ z, then ∃ti ∈ {t1 , . . . , tN } such that ξ(ti ) ∈ z. Theorem 2 (Continuous-to-discrete time conservativeness). Suppose that both Assumptions 1 and 2 hold. Then, the learned guaranteed unsafe set Arec l , defined in Sect. 4.4, is contained within the true unsafe set A. Now, suppose that only Assumption 1 holds. Furthermore, suppose that Problems 2–4 are using M sub-trajectories sampled with Algorithm 1 as unsafe trajectories, and that each sub-trajectory is defined over the time interval [ai , bi ], i = . 1, . . . , M . Denote fΔx ([t1 , t2 ]) = supx∈X ,u∈U ,t∈[t1 ,t2 ] f (x, u, t) · (t2 − t1 ), and . denote [a∗ , b∗ ] = [aj , bj ], where j = maxi fΔx ([ai , bi ]). Then, the learned guaranteed unsafe set Arec is contained within the fΔx ([a∗ , b∗ ])-padded unsafe set, l A(fΔx ([a∗ , b∗ ])). Corollary 3 (Continuous-to-discrete feature space conservativeness). Let the feature mapping φ(x) from the state space to the constraint space be Lipschitz continuous with Lipschitz constant L. Then, under Assumptions 1 and 2 used in Theorem 2, our method recovers a subset of the LfΔx ([a∗ , b∗ ])-padded unsafe set in the feature space, A(LfΔx ([a∗ , b∗ ])), where [a∗ , b∗ ] is as defined in Theorem 2.
6
Evaluations
We provide an example showing the importance of using unsafe trajectories, and experiments showing that our method generalizes across system dynamics, that it works with discretization and suboptimal demonstrations, and that it learns a constraint in a feature space from a single demonstration. See Appendix B in [11] for parameters, cost functions, the dynamics, control constraints, and timings.
Learning Constraints from Demonstrations
241
Version Space Example: Consider a simple 5 × 5 8-connected grid world in which the tasks are to go from a start to a goal, minimizing Euclidean path length while staying out of the unsafe “U-shape”, the outline of which is drawn in black (Fig. 3). Four demonstrations are provided, shown in Fig. 3 on the far left. Initially, the version space contains 225 possible unsafe sets. Each safe trajectory of length T reduces the version space at most by a factor of 2T , invalidating at most 225 − 225−T possible unsafe sets. Unsafe trajectories are computed by enumerating the set of trajectories going from the start to the goal at lower cost than the demonstration. The numbers of unsafe sets consistent with the safe and unsafe trajectories for varying numbers of safe trajectories are given in Table 2. Ultimately, it is impossible to distin- Table 2. Number of consistent guish between the three unsafe sets on the unsafe sets, varying the no. of demonright in Fig. 3. This is because there exists strations, using/not using unsafe trano task where a trajectory with cost lower jectories. than the demonstration can be sampled 1 2 3 4 which only goes through one of the two Safe 262144 4096 1024 256 uncertain states. Further, though the uncer- Safe & unsafe 11648 48 12 3 tain states are in the Δx shell of the constraint, due to the limitations of the cost function, we can only learn a subset of that shell (c.f. Theorem 1). There are two main takeaways from this experiment. First, by generating unsafe trajectories, we can reduce the uncertainty arising from the ill-posedness of constraint learning: after 4 demonstrations, using unsafe demonstrations enables us to reduce the number of possible constraints by nearly a factor of 100, from 256 to 3. Second, due to limitations in the cost function, it may be impossible to recover a unique unsafe set, but the version space can be reduced substantially by sampling unsafe trajectories.
2
2
2
2
1
1
1
1
0
0
0
0
-1
-1
-1
-1
-2 -2
-1
0
1
2
-2 -2
-1
0
1
2
-2 -2
-1
0
1
2
-2 -2
-1
0
1
2
Fig. 3. Leftmost: Demonstrations and unsafe set. Rest: Set of possible constraints. Postulated unsafe cells are plotted in red, safe states in blue.
Dynamics and Discretization: Experiments in Fig. 4 show that our method can be applied to several types of system dynamics, can learn nonconvex/multiple unsafe sets, and can use continuous trajectories. The dynamics, control constraints, and cost functions for each experiment are given in Table 5 in Appendix B in [11]. All unsafe sets A are open sets. We solve Problems 3 and 4, with an energy function promoting smoothness by penalizing squared deviations of the occupancy of a grid cell zi from its 4-connected neighbors
G N (zi ): i=1 zj ∈N (zi ) O(zi ) − O(zj ) 22 . In all experiments, the mean squared
242
G. Chou et al.
G 1 ∗ 2 ∗ error (MSE) is computed as G i=1 O(zi ) − O(zi ) 2 , where O(zi ) is the ground truth occupancy. The demonstrations are color-matched with their corresponding number on the x-axis of the MSE plots. For experiments with more demonstrations, only those causing a notable change in the MSE were colorcoded. The learned guaranteed unsafe states Arec are colored red on the left l column. We recover a non-convex “U-shaped” unsafe set in the state space using trivial 2D single-integrator dynamics (row 1 of Fig. 4). The solutions to both Problems 4 and 3 return reasonable results, and the solution of Problem 3 achieves zero error. The second row shows learning two polyhedral unsafe sets in the state space with 4D double integrator linear dynamics, yielding similar results. We note the linear interpolation of some demonstrations in row 1 and 2 enter A; this is because both sets of dynamics are in discrete time and only the discrete waypoints must stay out of A. The third row shows learning a polyhedral unsafe set in the state space, with time-discretized continuous, nonlinear Dubins’ car . dynamics, which has a 3D state x = χ y θ . These dynamics are more constrained than the previous cases, so sampling lower cost trajectories becomes more difficult, but despite this we can still achieve near zero error solving Problem 3. Some over-approximation results from some sampled unsafe trajectories entering regions not covered by the safe trajectories. For example, the cluster of red blocks to the top left of A is generated by lower-cost trajectories that trade off the increased cost of entering the top left region by entering A. This phenomenon is consistent with Theorem 2; we recover a set that is contained within A(fΔx [(0, Tmax ]) (the maximum trajectory length Tmax was 14.1 s). Learning curve spikes occur when overapproximation occurs. Overall, we note Arec tends l to be a significant underapproximation of A due to the chosen cost function and limited demonstrations. For example, in row 1 of Fig. 4, Arec cannot contain the l portion of A near long straight edges, since there exists no shorter path going from any start to any goal with only one state within that region. For row 3 of Fig. 4, we learn less of the bottom part of A due to most demonstrations’ start and goal locations making it harder to sample feasible control trajectories going through that region; with more demonstrations, this issue becomes less pronounced. Suboptimal Human Demonstrations: We demonstrate our method on suboptimal demonstrations collected via a driving simulator, using a car model with CT Dubins’ car dynamics. Human steering commands were recorded as demonstrations, where the task was to navigate around the orange box and drive between the trees (Fig. 5). For a demonstration of cost c, trajectories with cost less than 0.9c were believed unsafe with probability 1. Trajectories with cost c in the interval [0.9c, c] were believed unsafe with probability 1 − ((c − 0.9c)/0.1c). MSE for Problem 4 is shown in Fig. 5 (Problem 3 is not solved since the probabilistic interpretation is needed). The maximum trajectory length Tmax is 19.1 seconds; hence, despite suboptimality, the learned guaranteed unsafe set is a subset of A(fΔx ([0, Tmax ]). While the MSE is highest here of all experiments, this is expected, as trajectories may be incorrectly labeled safe/unsafe with some probability.
Learning Constraints from Demonstrations
2
0.01
2
1
2
0
0.005
0
0.5
0
-2 -2
0
-1
0
1
1
2
3
2
3
4
5
6
7
8
9 10 11
-2 -2
0
3
0.02
0
-2 -2
1
3
2
0.015
2
0.5
1
0.005
0
0 0
2
4
6
1
2
3
4
0
5
0
0 8
0.04
8
6
0.03
6
4
0.02
4
2
0.01
0 2
4
6
8
2
4
1
2
0 0
2
4
6
8 6
0.5
4 2
0
1 2 3 4 5 6 7 8 9 1011 1213 1415
1
1
6
2
0
0
0
2
2 0.01
1
-1
243
0
0
2
4
6
0
8
0
2
4
6
8
Fig. 4. Results across dynamics, discretization. Rows (top-to-bottom): Single integrator; double integrator; Dubins’ car (CT). Columns, left-to-right: Demos., A, Arec l ; MSE; Problem 4 solution, all demos.; Problem 3 solution, all demos.
0.04
1
70
68 0.035
0.8
68 66
0.03
64
0.025
0.6
66 0.4
64 0.02
0.2
62
62 0
5
10
0.015
1 2 3 4 5 6 7 8 9 10
0
0
5
10
Fig. 5. Suboptimal demonstrations: left: setup, center: demonstrations, A, Arec l , center-right: MSE, right: solution to Problem 4.
Feature Space Constraint: We demonstrate that our framework is not limited to the state space by learning a constraint in a feature space. Consider the scenario of planning a safe path for a mobile robot with continuous Dubins’ car dynamics through hilly terrain, where the magnitude of the terrain’s slope is given as a feature map (i.e. φ(x) = ∂H(ˆ x)/∂ x ˆ 2 , where x ˆ = [χ y] and H(ˆ x) is the elevation map). The robot will slip if the magnitude of the terrain slope is too large, so we generate a demonstration which obeys the ground truth constraint . φ(x) < 0.05; hence, the ground truth unsafe set is A = {x | φ(x) ≥ 0.05}. From one safe trajectory (Fig. 6) generated by RRT* [16] and gridding the feature space as {0, 0.005, . . . , 0.145, 0.15}, we recover the constraint φ(x) < 0.05 exactly.
244
7
G. Chou et al.
Conclusion
10 8
0.05
0.2
0.1 0.15
5
0.
0.
15
0.3
2 0. 0 1 .15 0.0 5
0.215 0. 0.1 05 0.
0.
0.2 5
0.3
5 0.2.2 0
0.05 0.1 0.15
0.25
1
5 0.1 0.1
0.15
0.
1
05 In this paper we propose an algo0. 6 0.3 rithm that learns constraints from 0.0 5 0. 4 2 0.2 demonstrations, which acts as a complementary method to IOC/IRL 3 2 0. 0.0 0.1 5 0.15 0.2 algorithms. We analyze the proper2 0. 0 .25 0 ties of our algorithm as well as the 0.05 -2 theoretical limits of what subset of an unsafe set can be learned from -4 safe demonstrations. The method -6 works well on a variety of system -8 dynamics and can be adapted to work with suboptimal demonstra-10 -10 -8 -6 -4 -2 0 2 4 6 8 10 tions. We further show that our method can also learn constraints in a feature space. The largest shortFig. 6. Demonstration (red: start, green: coming of our method is the con- goal). Unsafe set A is plotted in orange. Terstraint space gridding, which yields rain isocontours H(x) = const are overlaid. a complex constraint representation and causes the method to scale poorly to higher dimensional constraints. We aim to remedy this issue in future work by developing a grid-free counterpart of our method for convex unsafe sets, which can directly describe standard pose constraints like task space regions [8].
Acknowledgements. This work was supported in part by a Rackham first-year graduate fellowship, ONR grants N00014-18-1-2501 and N00014-17-1-2050, and NSF grants CNS-1446298, ECCS-1553873, and IIS-1750489.
References 1. Abbasi-Yadkori, Y., Bartlett, P.L., Gabillon, V., Malek, A.: Hit-and-run for sampling and planning in non-convex spaces. In: AISTATS 2017 (2017) 2. Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: ICML (2004) 3. Akametalu, A., Fisac, J., Gillula, J., Kaynama, S., Zeilinger, M., Tomlin, C.: Reachability-based safe learning with gaussian processes. In: CDC, December 2014 4. Allaire, G., Jouve, F., Michailidis, G.: Thickness control in structural optimization via a level set method. Struct. Multidisc. Optim. 53, 1349–1382 (2016) 5. Amin, K., Jiang, N., Singh, S.P.: Repeated inverse reinforcement learning. In: NIPS, pp. 1813–1822 (2017) 6. Argall, B., Chernova, S., Veloso, M.M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009) 7. Armesto, L., Bosga, J., Ivan, V., Vijayakumar, S.: Efficient learning of constraints and generic null space policies. In: ICRA, pp. 1520–1526. IEEE (2017) 8. Berenson, D., Srinivasa, S.S., Kuffner, J.J.: Task space regions: a framework for pose-constrained manipulation planning. IJRR 30(12), 1435–1460 (2011) 9. Calinon, S., Billard, A.: Incremental learning of gestures by imitation in a humanoid robot. In: HRI 2007, pp. 255–262 (2007)
Learning Constraints from Demonstrations
245
10. Calinon, S., Billard, A.: A probabilistic programming by demonstration framework handling constraints in joint space and task space. In: RSJ (2008) 11. Chou, G., Berenson, D., Ozay, N.: Learning constraints from demonstrations. CoRR, abs/1812.07084 (2018) 12. Dacorogna, B.: Introduction to the Calculus of Variations. Imperial College Press, London (2015) 13. Englert, P., Vien, N.A., Toussaint, M.: Inverse KKT: learning cost functions of manipulation tasks from demonstrations. IJRR 36(13–14), 1474–1488 (2017) 14. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The John Hopkins University Press, Baltimore (1996) 15. Kalman, R.E.: When is a linear control system optimal? J. Basic Eng. 86(1), 51–60 (1964) 16. Karaman, S., Frazzoli, E.: Incremental sampling-based algorithms for optimal motion planning. In: RSS (2010) 17. Keshavarz, A., Wang, Y., Boyd, S.P.: Imputing a convex objective function. In: ISIC, pp. 613–619. IEEE (2011) 18. Kiatsupaibul, S., Smith, R.L., Zabinsky, Z.B.: An analysis of a variation of hitand-run for uniform sampling from general regions. TOMACS 21, 1–11 (2011) 19. Knepper, R.A., Mavrogiannis, C.I., Proft, J., Liang, C.: Implicit communication in a joint action. In: HRI, pp. 283–292 (2017) 20. Li, C., Berenson, D.: Learning object orientation constraints and guiding constraints for narrow passages from one demonstration. In: ISER. Springer (2016) 21. Mehr, N., Horowitz, R., Dragan, A.D.: Inferring and assisting with constraints in shared autonomy. In: (CDC), pp. 6689–6696, December 2016 22. Morin, T.L.: Monotonicity and the principle of optimality. J. Math. Anal. Appl. 88(2), 665–674 (1982) 23. Ng, A.Y., Russell, S.J.: Algorithms for inverse reinforcement learning. In: ICML 2000, San Francisco, CA, USA, pp. 663–670 (2000) 24. Pais, A.L., Umezawa, K., Nakamura, Y., Billard, A.: Learning robot skills through motion segmentation and constraints extraction. In: HRI (2013) 25. Papadimitriou, C.H., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Prentice Hall, Englewood Cliffs (1982) 26. P´erez-D’Arpino, C., Shah, J.A.L: C-LEARN: learning geometric constraints from demonstrations for multi-step manipulation in shared autonomy. In: ICRA (2017) 27. Ratliff, N.D., Bagnell, J.A., Zinkevich, M.: Maximum margin planning. In: ICML 2006, pp. 729–736 (2006) 28. Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach (2003) 29. Schreiter, J., Nguyen-Tuong, D., Eberts, M., Bischoff, B., Markert, H., Toussaint, M.: Safe exploration for active learning with gaussian processes. In: ECML (2015) 30. Turchetta, M., Berkenkamp, F., Krause, A.: Safe exploration in finite Markov decision processes with Gaussian processes. In: NIPS, pp. 4305–4313 (2016) 31. Ye, G., Alterovitz, R.: Demonstration-guided motion planning. In: ISRR (2011)
Probability-Weighted Temporal Registration for Improving Robot Motion Planning and Control Learned from Demonstrations Chris Bowen and Ron Alterovitz(B) University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA {cbbowen,ron}@cs.unc.edu Abstract. Many existing methods that learn robot motion planning task models or control policies from demonstrations require that the demonstrations be temporally aligned. Temporal registration involves an assignment of individual observations from a demonstration to the ordered steps in some reference model, which facilitates learning features of the motion over time. We introduce probability-weighted temporal registration (PTR), a general form of temporal registration that includes two useful features for motion planning and control policy learning. First, PTR explicitly captures uncertainty in the temporal registration. Second PTR avoids degenerate registrations in which too few observations are aligned to a time step. Our approach is based on the forward-backward algorithm. We show how to apply PTR to two task model learning methods from prior work, one which learns a control policy and another which learns costs for a sampling-based motion planner. We show that incorporating PTR yields higher-quality learned task models that enable faster task executions and higher task success rates.
1
Introduction
Registering a time sequence of observations to a reference model is a common subproblem in many robotics algorithms, including algorithms for learning motion planning task models and control policies from demonstrations as in Fig. 1 (e.g., [7,8,14]). This subproblem may be referred to as time alignment, time warping, or temporal registration (the term we use in this paper). Formally, temporal registration is an assignment of individual observations (from a sequence of observations) to the ordered steps in some underlying reference model. In the domain of robot learning from demonstrations, the observations correspond to the time-ordered data collected during each demonstration, where each demonstration must be temporally aligned to an underlying task model (e.g., a reference demonstration or task representation) to facilitate learning features of the motion over time from a set of demonstrations. Temporal registration abstracts away differences in execution speed both between and within demonstrated trajectories and is often a critical step to learning effective task models. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 246–263, 2020. https://doi.org/10.1007/978-3-030-44051-0_15
Probability-Weighted Temporal Registration
247
Fig. 1. The Baxter robot performing a knot-tying task learned from demonstrations. The temporal registration of demonstrations can have a significant impact on the quality of the learned task model, and better registration can improve task performance or reduce the number of required demonstrations.
θ5
θ5
θ4
θ4
θ3
θ3
θ2
θ2
θ1
θ1
y1 y2 y3 y4 y5 y6 y7 y8 (a) Traditional approach (DTW)
y1 y2 y3 y4 y5 y6 y7 (b) Our approach (PTR)
y8
Fig. 2. Comparison between traditional temporal registration (e.g., using dynamic time warping (DTW)) and probability-weighted temporal registration (using our approach, PTR) to register a sequence of observations {y1 , . . . , y8 } to a reference model {θ1 , . . . , θ5 }. In our approach, we register all samples to the task model, avoid degenerate registrations, and assign probabilities to the alignment of each observation, which can be utilized by a learning algorithm to improve the quality of a learned task model.
Many prior methods for learning synthesizable models rely on temporal registration using dynamic time-warping (DTW) [19] or its variants (discussed in Sect. 2), either as a preprocessing step (e.g., [8,23,25]) or interleaved with model estimation (e.g., [4,7]). DTW can be viewed as a maximum-likelihood approach to temporal registration. However, maximum-likelihood approaches are inherently prone to becoming caught in local minima, yielding suboptimal registrations when a different initialization would produce better results. Furthermore, DTW and many of its variants may drop some or even most observations from the temporal registration, which can produce degenerate temporal registrations in which a time step of the reference model has too few observations aligned to it. Most importantly, DTW does not consider uncertainty in the temporal registration. We introduce probability-weighted temporal registration (PTR), a more general form of temporal registration that explicitly captures uncertainty in the temporal registration. Instead of assuming each observation is registered to
248
C. Bowen and R. Alterovitz
(at most) one time step of the reference model, we instead compute probabilityweighted assignments (as shown in Fig. 2). These weighted assignments can be leveraged in robot learning algorithms to yield more robust task models. Specifically, we propose to use the classical forward-backward algorithm [2] to compute probability-weighted temporal registration. This approach is not new, but we present a framework based on a tensor product graph in which DTW and the forward-backward algorithm are in fact remarkably similar in terms of implementation. Using this graphical approach, we describe a novel modification for avoiding degenerate registrations that improves registrations across methods in practical applications. We apply PTR to two existing task learning methods from prior work, one which learns a task model for a control policy [4] and another which learns a task model that informs edge costs for a sampling-based motion planner [7]. We show that incorporating PTR yields higher-quality learned task models that enable faster task executions and higher task success rates on challenging tasks both in simulation and on the Baxter robot.
2
Related Work
Temporal registration is necessary for solving many estimation, classification, and clustering problems in robotics, signal processing, and other fields. The most commonly used method for temporal registration is dynamic time warping (DTW), which has been successfully applied to problems such as gesture recognition [6] and robot task learning using Guassian mixture models [8,25] as a preprocessing step. Similar approaches have been applied to the problem of learning to manipulate deformable objects [14]. Other methods instead integrate DTW or similar methods into an iterative estimation process [4,7]. DTW has also been used to learn and subsequently execute tasks in the presence of external perturbations [5]. Collaborative tasks add an additional cause for variable demonstration speed, which has been addressed using related modifications to DTW [1] or gradient descent to overcome the non-smooth nature of DTW [17]. That weakness of DTW, combined with the desire for a global solution rather than one which relies on local optimization, is a motivator for our work. Along these same lines, a probabilistic interpretation of DTW has been shown to improve modeling accuracy and data efficiency [18]. When viewed more specifically as the problem of registering demonstrations to an underlying hidden Markov model (HMM) [23] or semi-Markov model [21], DTW is quite similar to the well-known Viterbi algorithm [24] for inference of the hidden model variables. Work has been done to produce variants of the Viterbi algorithm for finding registrations with specific properties [22], lower risk [15], or approximations for switching linear dynamic systems (SLDSs). For SLDSs in particular, alternative approaches to estimation have been explored [20], and SLDSs have been combined with reinforcement learning to good effect [13].
Probability-Weighted Temporal Registration
249
Estimation of an HMM using the Viterbi algorithm in a simple expectationmaximization framework yields the Viterbi path-counting algorithm [10]. However, the classical forward-backward algorithm has nicer properties. Prior work has explored efficient forward-background algorithm computation for HMMs [27]. We explore how use of the forward-backward algorithm versus DTW or the Viterbi algorithm impacts learning and subsequent execution for HMM-based task models. In this paper, we describe and compare existing and new variants of temporal registration on two robot learning methods from prior work [4,7], and we find measurable improvements by using a temporal registration approach that captures uncertainty and enforces non-degeneracy constraints.
3
Problem Definition
Consider a user-provided demonstration given by a sequence Y = {y1 , . . . , yn } where each yi is an observation which might be in the state space of the robot or some arbitrary feature space. It is often necessary for learning or recognition algorithms to register such a demonstration to a different sequence Θ = {θ1 , . . . , θT } of time steps (e.g. in a learned task model) while abstracting away differences in demonstration speed. We are generally interested in a registration which minimizes some measure of loss L(yi , θt ) between matched elements of each sequence and where T ≤ n. The most common approach to this problem in the domain of robotics is DTW, which produces for each θt an assignment i = ιt (where ι is the Greek letter iota) corresponding to yi (see Fig. 2(a)). In particular, this assignment T minimizes t L(yιt , θt ) subject to a strict monotonicity constraint ιt < ιt+1 . Stronger variants of this constraint may also be enforced as shown in Sect. 4.4. However, this approach discards much of the demonstration and fails to encode uncertainty in the temporal registration. To address these weaknesses, we consider probability-weighted temporal registrations which assign to each yi a weight ωi,t for each θt (see Fig. 2(b)). These weights sum to one and so form n,T a distribution. Here, we minimize i,t ωi,t L(yιt , θt ) subject to a monotonicity constraint similar to that above generalized to distributions (see Sect. 4.1). To make these temporal registrations truly probability-weighted, we impose a restriction on the loss L, namely that it be defined as the negative log likelihood for some statistical model L (θt | yi ): L(yi , θt ) ≡ − log L (θt | yi ). This model is relevant to a variety of learning approaches including those we consider in Sect. 5. The problem we consider of temporal registration of one user-provided demonstration to a sequence of time steps readily extends to the common problem of registering multiple demonstrations. Each demonstration can be temporally registered to the model independently, the model can then be re-estimated given the resulting registrations, and the process repeats in an expectationmaximization loop until convergence, as done in other approaches (e.g., [4,7,9]).
250
4
C. Bowen and R. Alterovitz
Method
In this section, we describe a graphical approach to probability-weighted temporal registration using the forward-backward algorithm along with a practical improvement to any temporal registration method that fits into this graphical framework. In Sects. 4.1 and 4.2, we discuss two maximum-likelihood temporal registration methods: DTW and the Viterbi algorithm. We recast both approaches as shortest paths on a tensor product graph. In Sect. 4.3, we next discuss a true expectation-maximization algorithm that produces a temporal registration with probability-weighted assignments for the observations. Again, we reformulate this approach using the tensor product graph and show that it is in fact nearly the same algorithm as the Viterbi algorithm. Finally, in Sect. 4.4 we show that any of these methods can be improved by modifying the tensor product graph to enforce a non-degeneracy constraint, which is particularly valuable when integrated with robot learning and using very few demonstrations as input, as shown in Sect. 6. To our knowledge, the reformulation of these algorithms in terms of a tensor product graph is novel, as is the modification to enforce the non-degeneracy constraint. The combination of using the forwardbackward algorithm and enforcing the non-degeneracy constraint gives us our complete method for probability-weighted temporal registration (PTR). 4.1
Dynamic Time Warping as a Graph Algorithm
The structure of the DTW algorithm follows the familiar dynamic programming approach of iteratively solving each subproblem in a table. In particular, let C[t, i] denote the cost of the best registration of the first t time steps {θ1 , . . . , θt } to the first i observations {y1 , . . . , yi }. We then have for all t ≤ T and i ≤ n: C[0, 0] = 0
C[t, 0] = C[0, i] = ∞
C[t, i] = min(C[t, i − 1] + cins (t), C[t − 1, i] + cdel (i), C[t − 1, i − 1] + c(t, i)) (1) where cins (t) denotes the cost of not matching θt to any observation, cdel (i) the cost of not matching observation yi to any time step, and c(t, i) the cost of matching θt with yi . The actual registration ι may then be constructed by traversing this table. The best temporal registration is the one which maximizes L (ι | Y, Θ), that is the likelihood of the entire registration given both sequences. Because the loss function we assumed depends only yi and θt where ι t = i, it satisfies the T Markov property, and thus the joint likelihood is simply t=1 L (θt | yιt ), and maximizing this is equivalent to minimizing − log L (θt | yιt ). Substituting cinsert (t) = ∞
cmatch (t, i) = − log L (θt | yi )
(2)
in the recurrence above yields C[t, i] = − log L (Θ | Y, ι1...t ) where ιt ≤ i. By monotonicity of log, minimizing C[T, n] maximizes L (Θ | Y, ι).
Probability-Weighted Temporal Registration τi−1
τi
τi+1
p1,1
yi−1
yi
yi+1
τ=1
(a) Bayesian network
p2,2 p1,2
τ=2
251
pn,n …
τ=T
(b) Restricted transitions (time graph)
Fig. 3. Discrete-time hidden Markov model in which time steps comprise the discrete states and observations are those from the demonstrations. τ=3
3, 1
3, 2
3, 3
3, 4
3, 5
3, 6
3, 7
τ=3
3, 1
3, 2
3, 3
3, 4
3, 5
3, 6
3, 7
τ=2
2, 1
2, 2
2, 3
2, 4
2, 5
2, 6
2, 7
τ=2
2, 1
2, 2
2, 3
2, 4
2, 5
2, 6
2, 7
τ=1
1, 1
1, 2
1, 3
1, 4
1, 5
1, 6
1, 7
τ=1
1, 1
1, 2
1, 3
1, 4
1, 5
1, 6
1, 7
y1
y2
y3
y4
y5
y6
y7
y1
y2
y3
y4
y5
y6
y7
Fig. 4. (left) Tensor product of the time graph (vertical axis) and demonstration graph (horizontal axis). Darker edges are more probable and blue vertices show the DTW temporal registration. (right) Blue vertices show the maximum likelihood temporal registration, which corresponds to the shortest path in the graph.
Letting cinsert (t) = ∞ ensures that every time step is registered to an observation, but much of the prior work in robotics further assumes cdelete (i) = 0 [19]. That is, not every observation needs to be considered during registration, and only the best T observations contribute to model estimation. Not only does this not match the underlying model, it discards information which could otherwise be useful for robust parameter estimation. In the next section, we will consider an alternative, the Viterbi algorithm. But first, it will be convenient to recast the DTW algorithm as a graph algorithm. This view will become an important step in unifying and extending all the temporal registration algorithms we consider. To view DTW as a graph algorithm, we first construct graphs representing both the time steps and the demonstration. The demonstration simply becomes a sequential graph of its observations (the demonstration graph). Time steps encode a more complex relation, but one which can be described by a hidden Markov model (HMM). To see this, consider the time steps to be the discrete states, denoted τi , of a time-homogeneous hidden Markov model (HMM) with observation distributions parameterized by {θ1 , . . . , θm } and transition probabilities given by pt,t . To enforce a monotonicity constraint, we restrict the state transitions as shown in Fig. 3(b), and this forms the time graph. Finally, we then consider the tensor product [26] of these two graphs (see Fig. 4). Note that unlike the time graph, we omit self-edges in the demonstration graph, which similar to setting cinsert (t) = ∞ above, ensures that every time step is matched with an observation. We will generalize this constraint in Sect. 4.4.
252
C. Bowen and R. Alterovitz
We next assign to each edge from (t, i) to (t , i ) a cost as follows: − log L (θt | yi ) t = t cDTW ((t, i) → (t , i )) = . 0 t = t
(3)
Under these edge costs, the DTW registration can be recovered from the shortest path from (1, 1) to (T, n), where a match occurs whenever the time step changes along the path. We note that the cost itself will differ from the dynamic programming approach by the cost of (1, 1), but this constant factor is unimportant for the purposes of minimization. Going forward, we will ignore similar discrepancies by constant factors without note. 4.2
Temporal Registration Using the Viterbi Algorithm
As we alluded to previously, this variant of DTW does not accurately reflect the underlying HMM because the result may not include some of the observations. However, there is a similar algorithm which does find the true maximumlikelihood registration of a sequence of observations to an HMM, namely the Viterbi algorithm [24], which has previously been used in task learning and recognition [9,16]. This algorithm can be viewed as a specific instance of DTW with an appropriate choice of costs, but it is more illuminating to describe using the graph we introduced in the prior section. To do so, we first need to reparameterize the registration, so that instead of mapping time steps to observations via ιt , we map observations to time steps via τi . Then we need only change the edge costs as follows: c((t, i) → (t , i )) = − log L (τi = t | τi = t, Y, Θ) = − log pt,t − log L (θt | yi ) .
(4)
Recall that pt,t is the probability of transitioning from time step t to t , so this is not only arguably simpler than the DTW edge costs (3), but correctly considers transition probabilities (estimated in Sect. 5). Under these edge costs, the Viterbi registration is again the shortest path from (1, 1) to (T, n), but we consider θt to be matched with yi whenever (t, i) occurs in this path. In particular, a time step may be matched with multiple observations, while every observation will be matched with exactly one time step (see Fig. 4). This coincides with the underlying HMM model. To see that the shortest path indeed corresponds to the maximum-likelihood registration, we again rely on monotonicity and the Markov property as follows: arg min τ
n−1 i=1
− log L (τi+1 | τi , Y, Θ) = arg max τ
n−1
L (τi+1 | τi , Y, Θ)
i=1
(5)
= arg max L (τ | Y, Θ) . τ
However, this approach still suffers from the issues associated with maximum likelihood approaches, namely local minima and sensitivity to noise.
Probability-Weighted Temporal Registration
253
τ=3
3, 1
3, 2
3, 3
3, 4
3, 5
3, 6
3, 7
τ=3
3, 1
3, 2
3, 3
3, 4
3, 5
3, 6
3, 7
τ=2
2, 1
2, 2
2, 3
2, 4
2, 5
2, 6
2, 7
τ=2
2, 1
2, 2
2, 3
2, 4
2, 5
2, 6
2, 7
τ=1
1, 1
1, 2
1, 3
1, 4
1, 5
1, 6
1, 7
τ=1
1, 1
1, 2
1, 3
1, 4
1, 5
1, 6
1, 7
y1
y2
y3
y4
y5
y6
y7
y1
y2
y3
y4
y5
y6
y7
Fig. 5. (left) Tensor product of the time graph (vertical axis) and demonstration graph (horizontal axis). Darker edges are more probable and shaded vertices show the distribution of likely registrations in PTR. (right) Modified to exclude degenerate registrations (KΔ = 2).
One question that arises is what shortest path algorithm to use. Because the edge costs may be negative, Bellman-Ford [3] is a reasonable choice. However, we can relate this better to DTW and the forward-backward algorithm by noting that because the demonstration graph is a directed acyclic graph (DAG), so must be the tensor product graph. Because the graph is acyclic, we need not perform the multiple passes usually required by the Bellmann-Ford algorithm thanks to the availability of a topological ordering. With this simplification, the Bellmann-Ford algorithm may be described by a recurrence: C(v) =
u→v
C(u) ⊗ c(e)
(6)
e
where a ⊕ML b = min(a, b) and a ⊗ML b = a + b with initial condition on the source vertex s ∈ V set to the identity of ⊗, that is R(s) = 1⊗ . Expanding this out again yields the general DTW algorithm. The ML subscripts on these operators indicate that this choice produces maximum-likelihood estimates, and in the next section, we will show that simply changing our choice of semiring [11] (associative ⊕ and ⊗ with the distributive property) effectively yields the forward-backward algorithm. 4.3
PTR Using the Forward-Backward Algorithm
In contrast to the Viterbi algorithm, the forward-backward algorithm [2] computes not only the most likely temporal registrations, but the posterior probabilities of all possible temporal registrations. Rather than presenting this approach in its original terms, however, we present an equivalent formulation using a graph algorithm that is more amenable to further modifications. The graph is the same as the one used in the previous section and shown in Fig. 4. However, to implement the forward-backward algorithm, we instead use: a ⊕FB b = − log(e−a + e−b )
a ⊗FB b = a + b
(7)
254
C. Bowen and R. Alterovitz
which are simply the sum and product of the probabilities in negative log-space, producing (unnormalized) distributions over registrations given prior observations. Specifically, L (τi = t | Θ, y1 , . . . , yi−1 ) ∝ e−C(t,i) .
(8)
To extend this to distributions over registrations given the entire sequence of observations, one need only compute the value C (v) of each vertex in the reverse graph. The likelihood of an observation yi registering to time step t is then:
L (τi = t | Θ, Y ) ∝ e−(C(t,i)+C (t,i)) .
(9)
This view of these algorithms as recurrences over a tensor product graph enables us to further improve registrations by modifying the graph in Sect. 4.4. 4.4
Non-degenerate Temporal Registration
For many practical use cases, it is desirable to ensure that a sufficient number KΔ of observations are registered to each time step. We say that such registrations are non-degenerate because if they are subsequently used to estimate covariance matrices as in prior learning methods [4,7], this guarantee is necessary for the problem to be well-posed. Even with other choices of parameters or estimators, it may be desirable to enforce such a constraint because human demonstrators naturally perform precise parts of a task more slowly. This can be thought of as a stronger version of the Itakura parallelogram constraint used for DTW [12]. This constraint can be enforced by modifying the structure of the graph used in Fig. 5 to ensure that all paths satisfy the KΔ constraint. An example of this modified graph structure is shown in Fig. 5. The edge costs must then be modified similarly for the Viterbi and forward-backward algorithms:
c((t, i) → (t , i )) = −(i − i − 1) · log pt,t − log pt,t −
i −1
log L (θt | yk ) . (10)
k=i
Note that this cost is identical to that given in (4) when i = i + 1, that is, when the time step does not change. The observation weights may then be computed from the vertex values using a similar approach to that used previously for the forward-backward algorithm:
L (τi = t | θ, Y ) = e−C(t,i)−C (t,i) + i−1
(11)
e−C(t,i )−c((t,i )→(t+1,i +KΔ ))−C (t+1,i +KΔ )
i =i+1−KΔ
The combination of using the forward-backward algorithm and enforcing the non-degeneracy constraint gives us our complete method for PTR.
Probability-Weighted Temporal Registration
5
255
Application to Robot Learning from Demonstrations
We apply PTR to the estimation of two different task models from prior work on robot learning from demonstrations [4,7]. These models were selected because they both explicitly require temporal registration. The method of van den Berg et al. [4] learns a task model encoding parameters of a control policy. The method of Bowen et al. [7] learns a task model that is used by a sampling-based motion planner to compute costs for edges in a roadmap such that the shortest path in the roadmap will accomplish the learned task. Consider a set of user-provided demonstrations {Y (1) , . . . , Y (m) } where each (j) (j) Y (j) is a sequence {y1 , . . . , ynj }, where we use a parenthesized superscript to indicate per-demonstration parameters. We reproduce the core learning algorithm of van den Berg et al. [4] in Algorithm 1 with small notational changes for parity and to highlight the underlying HMM model. Next consider the moderately abridged implementation of the method of Bowen et al. [7] shown in Algorithm 2. We focus here on learning the task models; details on implementing the controllers and motion planners using these models are in [4,7]. Although each of these algorithms operates in a different space, we note that both fit the general mold of expectation-maximization (EM) methods, interleaving estimation of model parameters θ and latent parameters (τ or ι and R(j) ). More accurately, however, these are maximization-maximization Algorithm 1. EstimateVanDenBerg2010(T , Y (1) , . . . , Y (m) ) (j)
Initialize R(j) = I and ιt
=
|Y (j) |·t T
.
while not converged do for t = 1 to T do (j) (j) Zt = (yi , R(j) ) | i = ιt ˆ ← KalmanSmoother(Z1 , . . . , Zn ) Θ for j = 1 to m do ˆ R(j) ← arg maxR L R | Y (j) , Θ ˆ ˆι(j) ← arg max L ι | Y (j) , Θ ι
Dynamic time-warping
Algorithm 2. Estimate Bowen 2015(T , Y (1) , . . . , Y (m) ) (j)
Initialize τˆi
=
T ·i |Y (j) |
.
while not converged do for t = 1 to T do (j) (j) Zt = yi | τˆi = t θˆt ← MaximumLikelihoodEstimator(Zt ) for j = 1 to m do ˆ τˆ(j) ← arg maxτ L τ | Y (j) , Θ
Viterbi algorithm
256
C. Bowen and R. Alterovitz
Fig. 6. Learning a drawing task with reference trajectory shown in gray. (left) Example demonstrations of the simulated drawing task with medium Gaussian observation noise. Trajectories learned from these two demonstrations using the DTW-1 (center) and PTR-3 (right) temporal registration algorithms.
methods, because their respective registration steps compute the best (in some sense) single registration given θ rather than a distribution over all possible registrations. To apply PTR to the methods of van den Berg et al. and Bowen et al. we need only replace DTW and extend the model parameter expectation steps of these methods to handle weighted observations to accommodate probabilityweighted temporal registrations. More specifically, when estimating θt , we use (j) all yi , but weight each by its posterior likelihood of being registered to time (j)
(j)
step t, ωt,i = L τi = t | Y (j) , which is given by (8). This produces a true expectation-maximization method, and in the method of Bowen et al. yields the Baum-Welch algorithm, which has the general effect of smoothing the estimation compared to the maximum likelihood approach, reducing (but not eliminating) local minima and improving robustness (see results in Sect. 6). The transition probabilities in the HMM (Fig. 3(b)) may be estimated by applying Bayes’ rule, yielding: (j) (j) i,j ωt,i ωt ,i+1 . (12) pt,t = L (τi+1 = t | τi = t, Y ) = (j) i,j ωt,i In the method of van den Berg et al. at each time step t the Kalman update (j) (j) step can be performed once for each observation yi , with weight ωt,i applied (j)
by scaling the covariance matrix R(j) of the observation by 1/ωt,i . Equivalently, and more efficiently, a single Kalman update step can be done at each time step t by combining the weighted observations as follows: ⎞−1 ⎛ ⎞ ⎛ (j) (j) −1 −1 (j) ˆt = ⎝ ˆt ⎝ Σ ωt,i R(j) ⎠ μ ˆt = Σ ωt,i R(j) yi ⎠ i,j
6
i,j
Results
We evaluate the impact of PTR by applying it to two previously-developed robot learning methods that require temporal registration of demonstrations. In
Probability-Weighted Temporal Registration
257
Fig. 7. Results for the simulated drawing task with Gaussian observation noise. PTR exhibited lower error relative to DTW, particularly when only a small number of demonstrations are available.
Fig. 8. (left) Example demonstrations of the simulated drawing task with high Brownian motion noise. Trajectories learned from these three demonstrations using the DTW-1 (center) and PTR-3 (right) temporal registration algorithms.
particular, we apply our temporal registration to the robot learning methods of van den Berg et al. [4] and Bowen et al. [7]. We use Name-KΔ to indicate the various temporal registration methods (e.g., DTW-1 or PTR-12). Note that when KΔ = 1, as in DTW-1 or PTR-1, these are equivalent to their unconstrained variants, DTW and the forward-backward algorithm respectively. Computation was performed on an Intel Xeon E5-1680 CPU with 8 cores running at 3.40 GHz. 6.1
PTR Applied to the van den Berg et al. Method
We first apply PTR to the method of van den Berg et al. [4], which learns a task model encoding parameters of a control policy and enables speedups on task performance. We consider a drawing task and a knot tying task. In both tasks, the number of time steps T was the length of the shortest demonstration divided by KΔ , which is the maximum number for which a temporal registration exists and matches the original paper for KΔ = 1. Simulated Drawing Task. For the simulated drawing task, instead of using human demonstrations, we perturbed a canonical figure eight using one of two noise models. The goal then, was to recover this canonical motion, allowing
258
C. Bowen and R. Alterovitz
Fig. 9. Results for the simulated drawing task with Brownian motion noise. PTR exhibited lower error relative to DTW, particularly when only a small number of demonstrations are available.
Fig. 10. Using the method of Bowen et al. [7] we learned a task wherein the Baxter robot scooped powder from the yellow container and transferred it into the magenta one without spilling while avoiding obstacles in the environment.
us to empirically evaluate different temporal registration approaches both in terms of learning time and error. The first noise model is that assumed by the underlying model, where observations within a demonstration are corrupted by i.i.d. Gaussian noise as seen in Fig. 6. Results for low, medium, and high noise amplitudes with various numbers of demonstrations are shown in Fig. 7. We performed the same experiments with Brownian motion noise, effectively adding Gaussian velocity noise, as shown in Fig. 8. While this does not match the assumptions of the underlying model, it produces demonstrations much more similar to what a human might. Results are shown in Fig. 9. Under both noise models, our method of PTR exhibited lower error relative to DTW, particularly with high noise and when only a small number of demonstrations are available. The improvements were most dramatic when the noise model matched the underlying method assumptions because noise which violates the independence assumptions introduces bias that temporal registration alone cannot correct. Both approaches required comparable learning time.
Probability-Weighted Temporal Registration
259
Table 1. Results for the knot-tying task using the method of van den Berg et al. with different temporal registration approaches. Motion speedup indicates the maximum multiple of the average demonstration speed at which the task was still performed successfully. Phase Registration Max motion speedup Learning time (s) 1
DTW-1 PTR-1 DTW-3 PTR-3
2.5 2.3 2.4 3.3
0.40 0.64 0.33 0.52
2
DTW-1 PTR-1 DTW-3 PTR-3
2.2 3.0 2.7 3.1
0.05 0.35 0.04 0.19
3
DTW-1 PTR-1 DTW-3 PTR-3
1.1 – 2.0 3.2
0.85 0.57 0.35 1.20
Table 2. Results for the powder transfer task using the method of Bowen et al. [7] with different demonstration counts and temporal registration approaches. PTR-12 had the highest success rate, regardless of the number of demonstrations. Demonstrations Registration Success Learning time (s) 10
Viterbi-1 PTR-1 PTR-12
80% 90% 100%
2.86 78.64 87.31
5
Viterbi-1 PTR-1 PTR-12
60% 80% 90%
3.16 36.37 36.12
Physical Knot Tying Task. Our physical knot tying task was similar to that described by van den Berg et al. However, it was demonstrated and executed on the Baxter robot, which has more restrictive dynamics limitations than the Berkeley Surgical Robot on which the original experiments were performed. As in the original paper, we divided the task into three phases: an initial loop, a grasp, and an extraction (see Fig. 1). Unlike in the original paper, we learned models for all three phases rather than only the first and third. We performed five demonstrations at 20 Hz of the first two phases and three of the third. These demonstrations ranged from 16 to 30 s in length. For an execution to be considered successful, we required the robot to tie a slip knot without exceeding its kinematic or dynamics limitation, including avoiding self-collisions. To evaluate
260
C. Bowen and R. Alterovitz
methods in the context of van den Berg et al.’s superhuman performance, we executed the algorithms at progressively greater speeds until the method failed and recorded the maximum multiple of the average demonstration speed at which the task was still performed successfully. Results separated by task phase are shown in Table 1. A minimum of three samples was used for the non-degenerate variant, which corresponds to a 0.15 s window. In the first phase of the task, DTW-1 failed first due to incorrect placement of the rope, while PTR-1 and DTW-3 encountered self-collisions. Only PTR-3 reached the velocity limits of the robot. In the second and easiest phase, the cause of failure for every temporal registration method was reaching the velocity limits of the robot. However, some methods resulted in smoother registrations and subsequent motions, permitting overall faster execution. In the third and most difficult phase, DTW-1 exceeded the velocity limits of the robot while the other three methods failed to extract the arm through the newly-formed loop in the rope. In the case of PTR-1, even at 1x demonstration speed, this was the cause of failure because the registration failed to isolate a crucial part of the task. PTR-3 achieved the highest speedup for all phases of the knot-tying task, showing the benefits of probability-weighted temporal registration with the non-degenerate registration feature. 6.2
PTR Applied to the Bowen et al. Method
We next apply PTR to the method of Bowen et al. [7], which learns a task model that is used by a sampling-based motion planner to compute costs for edges in a roadmap such that the shortest path in the roadmap will accomplish the learned task. For task model learning, [7] used the Viterbi algorithm for temporal registration, so it is this approach we compare against. Similarly, we set T = 24 to match the original paper. We performed a powder transfer task on the Baxter robot shown in Fig. 10, as specified in [7]. In this task, the robot is to scoop powder onto a spoon from a source container (the yellow bucket) and transfer it to a destination container (the magenta thermos) while avoiding obstacles (e.g., the plant on the table and the white hanging lamp shade). The robot learned the task using the method and features of Bowen et al. [7] from 10 kinesthetic demonstrations (in which no obstacles were present). Task models were learned using three different temporal registration approaches. To evaluate each model, we introduced the obstacles and randomly sampled scenarios with container positions such that the transfer would always cross the centerline of the table to ensure each scenario was challenging. An execution was considered successful if it transferred powder from one container to the other without spilling. Results are shown in Table 2. As with the van den Berg method, the use of PTR yielded a measurably better task model in terms of success rate relative to the Viterbi algorithm for temporal registration. Learning times for PTR were substantially longer, but still very reasonable for an off-line process. The primary cause of the failures that occurred during task execution was slightly missing the cup during the dumping motion, resulting in spilled powder. We observe that the number of
Probability-Weighted Temporal Registration
261
demonstrations had a marginally greater impact when using the Viterbi algorithm for registration than when using PTR. As with the method of van den Berg et al. PTR-12 performed best.
7
Conclusion
Many existing methods for robot learning from demonstrations require registering a time sequence of observations to a reference model, either for aligning demonstrations during preprocessing or as an integral part of task model estimation. We introduced probability-weighted temporal registration (PTR), a more general form of temporal registration that explicitly captures uncertainty in the registration. Instead of assuming each observation is registered to (at most) one time step of the reference model like DTW, we use the forward-backward algorithm to compute probability-weighted assignments and avoid degenerate registrations. We applied PTR to two learning methods from prior work on both simulated and physical tasks and showed that incorporating PTR into robot learning algorithms can yield higher-quality task models that enable faster task executions and higher task success rates. In future work, we would like to apply PTR to other robotics algorithms that require temporal registration and to automatically determine the best nondegeneracy parameter. Acknowledgments. We thank Armaan Sethi for his assistance evaluating methods. This research was supported in part by the U.S. National Science Foundation (NSF) under Awards IIS-1149965 and CCF-1533844.
References 1. Amor, H.B., Neumann, G., Kamthe, S., Kroemer, O., Peters, J.: Interaction primitives for human-robot cooperation tasks. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2831–2837 (2014) 2. Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41(1), 164–171 (1970) 3. Bellman, R.: On a routing problem. Technical report, DTIC Document (1956) 4. van den Berg, J., Miller, S., Duckworth, D., Hu, H., Wan, A., Goldberg, K., Abbeel, P.: Superhuman performance of surgical tasks by robots using iterative learning from human-guided demonstrations. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2074–2081 (2010) 5. Berger, E., Sastuba, M., Vogt, D., Jung, B., Ben Amor, H.: Estimation of perturbations in robotic behavior using dynamic mode decomposition. Adv. Robot. 29(5), 331–343 (2015) 6. Bodiroˇza, S., Doisy, G., Hafner, V.V.: Position-invariant, real-time gesture recognition based on dynamic time warping. In: ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 87–88 (2013)
262
C. Bowen and R. Alterovitz
7. Bowen, C., Ye, G., Alterovitz, R.: Asymptotically-optimal motion planning for learned tasks using time-dependent cost maps. IEEE Trans. Autom. Sci. Eng. 12(1), 171–182 (2015) 8. Calinon, S.: A tutorial on task-parameterized movement learning and retrieval. Intell. Serv. Robot. 9(1), 1–29 (2016) 9. Calinon, S., Guenter, F., Billard, A.: On learning, representing, and generalizing a task in a humanoid robot. IEEE Trans. Syst. Man Cybern.–Part B 37(2), 286–298 (2007) 10. Davis, R.I., Lovell, B.C.: Comparing and evaluating HMM ensemble training algorithms using train and test and condition number criteria. Formal Pattern Anal. Appl. 6(4), 327–335 (2004) 11. Herstein, I.N.: Topics in Algebra. Wiley, Hoboken (2006) 12. Itakura, F.: Minimum prediction residual principle applied to speech recognition. IEEE Trans. Acoust. Speech Signal Process. 23(1), 67–72 (1975) 13. Krishnan, S., Garg, A., Liaw, R., Thananjeyan, B., Miller, L., Pokorny, F.T., Goldberg, K.: SWIRL: a sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards. In: WAFR (2016) 14. Lee, A.X., Lu, H., Gupta, A., Levine, S., Abbeel, P.: Learning force-based manipulation of deformable objects from multiple demonstrations. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 177–184 (2015) 15. Lember, J., Koloydenko, A.A.: Bridging Viterbi and posterior decoding: a generalized risk approach to hidden path inference based on hidden Markov models. J. Mach. Learn. Res. 15(1), 1–58 (2014) 16. Lv, F., Nevatia, R.: Single view human action recognition using key pose matching and Viterbi path searching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 17. Maeda, G., Ewerton, M., Lioutikov, R., Amor, H.B., Peters, J., Neumann, G.: Learning interaction for collaborative tasks with probabilistic movement primitives. In: IEEE-RAS International Conference on Humanoid Robots, pp. 527–534 (2014) 18. Marteau, P.: Times series averaging and denoising from a probabilistic perspective on time-elastic kernels (2016). http://arxiv.org/abs/1611.09194 19. M¨ uller, M.: Dynamic time warping. In: Information Retrieval for Music and Motion, pp. 69–84 (2007) 20. Pavlovic, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: Advances in Neural Information Processing Systems, pp. 981– 987 (2001) 21. Tanwani, A.K., Calinon, S.: Learning robot manipulation tasks with taskparameterized semitied hidden semi-Markov model. IEEE Robot. Autom. Lett. 1(1), 235–242 (2016) 22. Titsias, M.K., Holmes, C.C., Yau, C.: Statistical inference in hidden Markov models using k-segment constraints. J. Am. Stat. Assoc. 111(513), 200–215 (2016) 23. Vakanski, A., Mantegh, I., Irish, A., Janabi-Sharifi, F.: Trajectory learning for robot programming by demonstration using hidden Markov model and dynamic time warping. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 42(4), 1039–1052 (2012) 24. Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)
Probability-Weighted Temporal Registration
263
25. Vukovi´c, N., Miti´c, M., Miljkovi´c, Z.: Trajectory learning and reproduction for differential drive mobile robots based on GMM/HMM and dynamic time warping using learning from demonstration framework. Eng. Appl. Artif. Intell. 45, 388–404 (2015) 26. Weichsel, P.M.: The Kronecker product of graphs. Proc. Am. Math. Soc. 13(1), 47–52 (1962) 27. Yu, S.Z., Kobayashi, H.: An efficient forward-backward algorithm for an explicitduration hidden Markov model. IEEE Signal Process. Lett. 10(1), 11–14 (2003)
Learning in Planning
SACBP: Belief Space Planning for Continuous-Time Dynamical Systems via Stochastic Sequential Action Control Haruki Nishimura(B) and Mac Schwager Stanford University, Stanford, CA 94305, USA {hnishimura,schwager}@stanford.edu
Abstract. We propose a novel belief space planning technique for continuous dynamics by viewing the belief system as a hybrid dynamical system with time-driven switching. Our approach is based on the perturbation theory of differential equations and extends Sequential Action Control [1] to stochastic belief dynamics. The resulting algorithm, which we name SACBP, does not require discretization of spaces or time and synthesizes control signals in near real-time. SACBP is an anytime algorithm that can handle general parametric Bayesian filters under certain assumptions. We demonstrate the effectiveness of our approach in an active sensing scenario and a model-based Bayesian reinforcement learning problem. In these challenging problems, we show that the algorithm significantly outperforms other existing solution techniques including approximate dynamic programming and local trajectory optimization. Keywords: Mobile robots · Optimization and optimal control Probabilistic reasoning · Vision and sensor-based control
1
·
Introduction
Planning under uncertainty still remains as a challenge for robotic systems. Various types of uncertainty, including unmodeled dynamics, stochastic disturbances, and imperfect sensing, significantly complicate problems that are otherwise easy. For example, suppose that a robot needs to move an object from some initial state to a desired goal. If the mass properties of the object are not known beforehand, the robot needs to simultaneously estimate these parameters and perform control, while taking into account the effects of their uncertainty; the exploration and exploitation trade-off needs to be resolved [2]. On the other hand, uncertainty is quite fundamental in motivating some problems. For instance, a noisy sensor may encourage the robot to carefully plan a trajectory so the observations taken along it are sufficiently informative. This type of problems concerns pure information gathering and is often referred to as active sensing [3]. A principled approach to address all those problems is to form plans in the belief space, where the planner chooses sequential control inputs based on the c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 267–283, 2020. https://doi.org/10.1007/978-3-030-44051-0_16
268
H. Nishimura and M. Schwager
evolution of the belief state. This approach enables the robot to appropriately execute controls under stochasticity and partial observability since they are both incorporated into the belief state. Belief space planning is also well suited for generating information gathering actions [4]. This paper proposes a novel online belief space planning algorithm. It does not require discretization of the state space or the action space, and can directly handle continuous-time system dynamics. The algorithm optimizes the expected value of the first-order cost reduction with respect to a nominal control policy at every re-planning time, proceeding in a receding horizon fashion. We are inspired by the Sequential Action Control (SAC) [1] algorithm recently proposed by Ansari and Murphey for model-based deterministic optimal control problems. SAC is an online method to synthesize control signals in real time for challenging (but deterministic) physical systems such as a cart pendulum and a springloaded inverted pendulum. Based on the concept of SAC, this paper develops an algorithmic framework to control stochastic belief systems whose dynamics are governed by parametric Bayesian filters. 1.1
Related Work in Belief Space Planning
Greedy Strategies. Belief space planning is known to be challenging for a few reasons. First, the belief state is continuous and can be high-dimensional even if the underlying state space is small or discrete. Second, the dynamics that govern the belief state transitions are stochastic due to unknown future observations. Greedy approaches alleviate the complexity by ignoring long-term effects and solve single-shot decision making problems. Despite their suboptimality for long-term planning, these methods are often employed to find computationally tractable solutions and achieve reasonable performance in different problems [5–7], especially in the active sensing domain. Trajectory Optimization Methods. In contrast to the greedy approaches, trajectory optimization methods take into account multiple timesteps at once and find non-myopic solutions. In doing so, it is often assumed that the maximum likelihood observation (MLO) will always occur at the planning phase [4,8,9]. This heuristic assumption results in a deterministic optimal control problem, whereby various nonlinear trajectory optimization algorithms are applicable. However, ignoring the effects of stochastic future observations can degrade the performance [10]. Other methods [10,11] that do not rely on the MLO assumption are advantageous in that regard. In particular, belief iLQG [10] performs iterative local optimization in the Gaussian belief space by quadratically approximating the value function and linearizing the dynamics to obtain a time-variant linear feedback policy. However, this method as well as many other solution techniques in this category result in multiple iterations of intensive computation and require significant amount of time until convergence. Belief MDP and POMDP Approaches. Belief space planning can be modeled as a Markov decision process (MDP) in the belief space, given that the belief
SACBP
269
state transition is Markovian. If in addition the reward (or cost) is defined as an explicit function of the state and the control, the problem is equivalent to a partially observable Markov decision process (POMDP) [12]. A key challenge in POMDPs and belief MDPs has been to address problems with large state spaces. This is particularly important in belief MDPs since the state space for a belief MDP is a continuous belief space. To handle continuous spaces, Cou¨etoux et al. [13] introduce double progressive widening (DPW) for Monte Carlo Tree Search (MCTS) [14]. In [2], this MCTS-DPW algorithm is run in the belief space to solve the object manipulation problem mentioned in Sect. 1. We have also presented a motion-based communication algorithm in our prior work [15], which uses MCTS-DPW for active intent inference with monocular vision. While MCTS-DPW as well as other general purpose POMDP methods [16,17] are capable of handling continuous state spaces, their algorithmic concepts are rooted in dynamic programming and tree search, requiring a sufficient amount of exploration in the tree. The tree search technique also implicitly assumes discrete-time transition models. In fact, most prior works discussed above are intended for discrete-time systems. There still remains a need for an efficient and high-performance belief space planning algorithm that is capable of directly handling systems with inherently continuous-space, continuous-time dynamics, such as maneuvering micro-aerial vehicles, or autonomous cars at freeways speeds. 1.2
Contributions
Our approach presented in this paper is significantly different than the previous approaches discussed above. We view the stochastic belief dynamics as a hybrid system with time-driven switching [18], where the controls are applied in continuous time and the observations are made in discrete time. A discrete-time observation creates a jump discontinuity in the belief state trajectory due to a sudden Bayesian update of the belief state. This view of belief space planning yields a continuous-time optimal control problem of a high-dimensional hybrid system. We then propose a model-based control algorithm to efficiently compute the control signals in a receding-horizon fashion. The algorithm is based on Sequential Action Control (SAC) [1]. SAC in its original form is a deterministic, model-based hybrid control algorithm, which “perturbs” a nominal control trajectory in a structured way so that the cost functional is optimally reduced up to the first order. The key to this approach is the use of perturbation theory of differential equations that is often discussed in the mode scheduling literature [19,20]. As a result, SAC derives the optimal perturbation in closed form and synthesizes control signals at a high frequency to achieve a significant improvement over other optimal control methods based on local trajectory optimization. We apply the perturbation theory to parametric Bayesian filters and derive the optimal control perturbation using the framework of SAC. To account for stochasticity, we also extend the original algorithm by incorporating Monte Carlo sampling of nominal belief trajectories. Our key contribution is the resulting
270
H. Nishimura and M. Schwager
continuous belief space planning algorithm, which we name SACBP. The algorithm has the following desirable properties: 1. SACBP optimizes the expected value of the first-order reduction of the cost functional with respect to some nominal control in near real-time. 2. SACBP does not require discretization of the state space, the observation space, or the control space. It also does not require discretization of time other than for numerical integration purposes. 3. General nonlinear parametric Bayesian filters can be used for state estimation as long as the system is control-affine and the control cost is quadratic. 4. Stochasticity in the future observations are fully considered. 5. SACBP is an anytime algorithm. Furthermore, the Monte Carlo sampling part of the algorithm is naturally parallelizable. 6. Even though SACBP is inherently suboptimal for the original stochastic optimal control problem, empirical results suggest that it is highly sampleefficient and outperforms other approaches when near real-time performance is required. Although there exists prior work [21] that uses SAC for active sensing, its problem formulation relies on the ergodic control framework, which is significantly different from the belief space planning framework we propose here. We show that our SACBP outperforms ergodic trajectory optimization, MCTSDPW, and a greedy method on a multi-target tracking example. We also show that SACBP outperforms belief iLQG and MCTS-DPW on a manipulation scenario. In the next section we derive relevant equations and present the SACBP algorithm along with a running time analysis. The simulation results are summarized in Sect. 3, followed by conclusions and future work in Sect. 4.
2
SACBP Algorithm
We first consider the case where some components of the state are fully observable. This mixed observability assumption is common in various active sensing problems [7,22,23] where the state of the robot is perfectly known, but some external variable of interest (e.g. a target’s location) is stochastic. In addition, deterministic state transitions are often assumed for the robot. Therefore, in Sect. 2.1 we derive the SACBP control update formulae for this case. The general belief space planning where none of the state is fully observable or deterministically controlled is discussed in Sect. 2.2. An extension to use a closed-loop policy as the nominal control is presented in Sect. 2.3. The computational time complexity is discussed in Sect. 2.4. 2.1
Problems with Mixed Observability
Suppose that a robot can fully observe and deterministically control some state p(t) ∈ Rn . Other states are not known to the robot and are estimated with the belief vector b(t) ∈ Rm . We define the augmented state as s (pT , bT )T ∈ Rn+m .
SACBP
271
Dynamics Model. The physical state p is described by the following ODE: p(t) ˙ = f (p(t), u(t)) ,
(1)
where u(t) ∈ Rl is the control signal. On the other hand, suppose that the belief state changes in discrete time upon arrival of a new observation from the sensors. We will discuss the more general continuous-discrete filtering in Sect. 2.2. Let tk be the time when k-th observation becomes available to the robot. The belief state transition is given by − − b(t+ k ) = g(p(tk ), b(tk ), yk ) (2) − t ∈ [t+ b(t) = b(t+ k) k , tk+1 ], − where t+ k is infinitesimally larger than tk and tk smaller. Nonlinear function g corresponds to a parametric Bayesian filter (e.g., Kalman Filter, Extended Kalman Filter, Discrete Bayesian Filter, etc.) that takes the new observation yk ∈ Rq and returns the updated belief state. The concrete choice of the filter depends on the instance of the problem. We demand the Bayesian filter to be differentiable in p and b, which we assume hereafter.1 Equations (1) and (2) constitute a hybrid system with time-driven switching [18]. This hybrid system representation is practical since it captures the fact that the observation updates occur less frequently than the control actuation in general, due to expensive information processing of sensor readings. Furthermore, with this representation one can naturally handle agile systems as they are without coarse discretization in time. Given the initial state s(t0 ) = (p(t0 )T , b(t0 )T )T and a control trajectory from t0 to tf denoted as u(t0 → tf ), the system evolves stochastically according to the hybrid dynamics equations. The stochasticity is due to a sequence of stochastic future observations that will be taken by tf .2
Perturbed Dynamics. The control synthesis of SACBP begins with a given nominal control trajectory un (t0 → tf ). Suppose that the nominal control is applied to the system and a sequence of K observations (y(1) , . . . , y(K) ) is obtained. Conditioned on the observation sequence, the augmented state evolves T T deterministically. Let sn = (pT n , bn ) be the nominal trajectory of the augmented state induced by (y(1) , . . . , y(K) ). Now let us consider perturbing the nominal trajectory at a fixed time τ for a short duration > 0. The perturbed control trajectory uw is defined as w if t ∈ [τ − , τ ] uw (t) (3) un (t) otherwise. 1 2
Prior work such as [4, 10] also assumes this differentiability property. We assume that the observation interval tk+1 − tk Δto is fixed, and the control signals are recomputed when a new observation is incorporated in the belief.
272
H. Nishimura and M. Schwager
The resulting perturbed system trajectory can be written as pw (t, ) pn (t) + Ψp (t) + o() bw (t, ) bn (t) + Ψb (t) + o(),
(4)
where Ψp (t) ∂∂+ pw (t, )=0 and Ψb (t) ∂∂+ bw (t, )=0 are the state variations that are linear in the perturbation duration . The notation ∂∂+ represents the right derivative with respect to . The state variations at perturbation time τ satisfy Ψp (τ ) = f (pn (τ ), w) − f (pn (τ ), un (τ )) (5) Ψb (τ ) = 0, assuming that τ does not exactly correspond to one of the switching times tk [24]. For t ≥ τ , the physical state variation Ψp evolves according to the following first-order ODE: d ∂+ ˙ pw (t, ) Ψp (t) = (6) dt ∂ =0 ∂+ f (pw (t, ), un (t)) (7) = ∂ =0 ∂ f (pn (t), un (t)) Ψp (t), (8) = ∂p where the chain rule of differentiation and pw (t, 0) = pn (t) are used in (8). The − dynamics of the belief state variation Ψb in the continuous region t ∈ [t+ k , tk+1 ] are Ψ˙ b (t) = 0 since the belief vector b(t) is constant according to (2). However, across the discrete jumps the belief state variation Ψb changes discontinuously and satisfies ∂+ + b Ψb (t+ ) = (t , ) (9) w k k ∂ =0 ∂+ − = g pw (t− , ), b (t , ), y (10) w k k k ∂ =0 ∂ ∂ − − − − g pn (t− g pn (t− = k ), bn (tk ), yk Ψp (tk ) + k ), bn (tk ), yk Ψb (tk ). ∂p ∂b (11) Perturbed Cost Functional. Let us consider the cost functional of the form tf c (p(t), b(t), u(t)) dt + h(p(tf ), b(tf )), (12) J(p, b, u) = t0
where c is the running cost and h is the terminal cost. Following the discussion above on the perturbed dynamics, let Jn be the total cost of the nominal trajectory conditioned on the given observation sequence (y(1) , . . . , y(K) ). Under the fixed perturbation time τ , we represent the perturbed cost Jw in terms of Jn as
SACBP
Jw (pw , bw , ) Jn + ν(tf ) + o(),
273
(13)
where ν(tf ) ∂∂+ Jw (pw , bw , )|=0 is the variation of the cost functional linear in . For further analysis it is convenient to express the running cost in the Mayer form [24]. Let s0 (t) be a new state variable defined by s˙ 0 (t) = c (p(t), b(t), u(t)) and s(t0 ) = 0. Then the total cost is a function of the appended augmented state s¯ (s0 , sT )T ∈ R1+n+m at time tf , which is given by J = s0 (tf ) + h (s(tf )) .
(14)
Using this form of total cost J, the perturbed cost (13) is ⎡ ⎤T 1 ∂ h (pn (tf ), bn (tf ))⎦ Ψ¯ (tf ) + o(), Jw = Jn + ⎣ ∂p (15) ∂ h (p (t ), b (t )) n f n f ∂b 0 T where Ψ¯ (tf ) Ψ (tf )T , Ψp (tf )T , Ψb (tf )T . Note that the dot product in (15) corresponds to ν(tf ) in (13). The variation of the appended augmented state Ψ 0 follows the variational equation d ∂+ 0 0 ˙ Ψ (t) = sw (t, ) (16) dt ∂ =0 ∂ ∂ c(pn (t), bn (t), un (t))T Ψp (t) + c(pn (t), bn (t), un (t))T Ψb (t). (17) = ∂p ∂b The perturbed cost Eq. (15), especially the dot product expressing ν(tf ), is consequential; it tells us how the total cost functional changes due to the perturbation at some fixed time τ , up to the first order with respect to the perturbation duration . At this point, one could compute the value of ν(tf ) for a control perturbation with a specific value of (w, τ ) by simulating the nominal dynamics and integrating the variational Eqs. (8) (11) (17) up to tf . Adjoint Equations. Unfortunately, this forward integration of ν(tf ) is not so useful by itself since we are interested in finding the value of (w, τ ) that leads to the minimum value of ν(tf ), if it exists; it would be computationally intensive to apply control perturbation at different application times τ with different values of w and re-simulate state variation Ψ¯ . To avoid this computationally expensive search, Ansari and Murphey [1] has the adjoint system ρ¯ with which introduced d ρ¯(t)T Ψ¯ (t) = 0 ∀t ∈ [t0 , tf ]. If we let ρ¯(tf ) = the dot product is invariant: dt T
T ∂ T ∂ 1, ∂p h (pn (tf ), bn (tf )) , ∂b h (pn (tf ), bn (tf )) so that its dot product with ¯ Ψ (tf ) equals ν(tf ) as in (15), the time invariance gives ν(tf ) = ρ¯(τ )T Ψ¯ (τ ) ⎤ ⎡ c (pn (τ ), bn (τ ), w) − c (pn (τ ), bn (τ ), un (τ )) ⎦. f (pn (τ ), w) − f (pn (τ ), un (τ )) = ρ¯(τ )T ⎣ 0
(18) (19)
274
H. Nishimura and M. Schwager
Therefore, we can compute the first-order cost change ν(tf ) for different values − of τ once the adjoint trajectory is derived. For t ∈ [t+ k , tk+1 ] the time derivative of Ψb exists, and the invariance property suggests that ρ¯˙ (t)T Ψ¯ (t) + ρ¯(t)T Ψ¯˙ (t) = 0 is enforced. It can be verified that the following system satisfies this equation: ⎧ 0 ⎪ ⎨ρ˙ (t) = 0 ∂ ∂ (20) c(pn (t), bn (t), un (t)) − ∂p f (pn (t), un (t))T ρp (τ ) ρ˙ p (t) = − ∂p ⎪ ⎩ ∂ ρ˙ b (t) = − ∂b c(pn (t), bn (t), un (t)). Analogously, at discrete jumps we can still enforce the invariance by setting T¯ + T¯ − ¯(t− ρ¯(t+ k ) Ψ (tk ) = ρ k ) Ψ (tk ), which holds for the following adjoint equations: ⎧ 0 − 0 + ⎪ ⎨ρ (tk ) = ρ (tk ) T − ∂ ρp (tk ) = ρp (t+ ) + ∂p g pn (t− ), bn (t− ), yk ρb (t+ k k k k) ⎪ T ⎩ − − + ∂ ) = g p (t ), b (t ), y ρ (t ). ρb (t− n k n k k b k k ∂b
(21)
Note that the adjoint system integrates backward in time as it has the boundary condition defined at tf . More importantly, the adjoint dynamics (20) (21) only depend on the nominal trajectory of the system (pn , bn ) and the observation sequence (y(1) , . . . , y(K) ). The linear variation term ν(tf ) is finally given by ν(tf ) = c(pn (τ ), bn (τ ), w) − c(pn (τ ), bn (τ ), un (τ ))+ ρp (τ )T (f (pn (τ ), w) − f (pn (τ ), un (τ ))) .
(22)
Control Optimization. In order to efficiently optimize (22) with respect to (w, τ ), the rest of the paper assumes that the control cost is quadratic 1 T 2 u Cu u and the dynamics model f (p, u) is control-affine with linear term H(p)u. Although the control-affine assumption may appear restrictive, many physical systems possess this property in engineering practice. As a result of these assumptions, (22) becomes ν(tf ) =
1 T 1 w Cu w + ρp (τ )T H(pn (τ ))(w − un (τ )) − un (τ )T Cu un (τ ). 2 2
(23)
So far we have treated the observation sequence (y(1) , . . . , y(K) ) as given and fixed. However, in practice it is a random process that we have to take into account. Fortunately, our control optimization is all based on the nominal control un (t0 → tf ), with which we can both simulate the augmented dynamics and sample the observations. To see this, let us rewrite ν(tf ) in (23) as ν(tf , y(1) , . . . , y(K) ) to clarify the dependence on the observations. The expected value of the first order cost variation is given by E[ν(tf )] = ν(tf , y(1) , . . . , y(K) )p y(1) , . . . , y(K) | un (t0 → tf ) dy(1) . . . dy(K) . (24)
SACBP
275
Even though we do not know the values of the distribution above, we have the generative model; we can simulate the augmented state trajectory using the nominal control un (t0 → tf ) and sample the stochastic observations from the belief states along the trajectory. Using the linearity of expectation for (23), we have E[ν(tf )] =
1 T 1 w Cu w + E[ρp (τ )]T H(pn (τ ))(w − un (τ )) − un (τ )T Cu un (τ ). 2 2 (25)
Notice that only the adjoint trajectory is stochastic. We can employ Monte Carlo sampling to sample a sufficient number of observation sequences to approximate the expected adjoint trajectory. Now (25) becomes a convex quadratic in w for Cu 0. An existing convex solver efficiently solves the following convex program with a box saturation constraint. Furthermore, analytical solutions are available if Cu is diagonal, since the coordinates of w are completely decoupled in this case. minimize E[ν(tf )] w (26) subject to a w b This optimization is solved for different values of τ ∈ (t0 + tcalc + , t0 + Δto ), where tcalc is the pre-allocated computational time budget and Δto is the time interval between two successive observations and control updates. We then search for the optimal perturbation time τ ∗ to globally minimize E[ν(tf )] over (w∗ (τ ), τ ). There is only a finite number of τ to consider since in practice we use numerical integration such as the Euler scheme with some step size Δtc to compute the trajectories. In [1] the finite perturbation duration is also optimized using line search, but in this work we set as a tunable parameter to reduce the computational complexity. The complete algorithm is summarized in Algorithm 1, which is called every Δto [s] as the new observation is incorporated in the belief. 2.2
General Belief Space Planning Problems
If none of the state is fully observable, the same stochastic SAC framework still applies almost as is to the belief sate b. In this case we consider a continuousdiscrete filter [25] where the prediction step follows an ODE and the update step provides an instantaneous discrete jump. The hybrid dynamics for the belief vector are given by − b(t+ k ) = g(b(tk ), yk ) (27) ˙ = fb (b(t), u(t)) t ∈ [t+ , t− ]. b(t) k k+1 Mirroring the approach in Sect. 2.1, one can verify that the linear cost variation term ν(tf ) has the same form as (22): ν(tf ) = c(bn (τ ), w) − c(bn (τ ), un (τ )) + ρ(τ )T (fb (bn (τ ), w) − fb (bn (τ ), un (τ ))) . (28)
276
H. Nishimura and M. Schwager
Algorithm 1. SACBP Control Update INPUT: Current augmented state s(t0 ) = (p(t0 )T , b(t0 )T )T or belief state b(t0 ), nominal control trajectory un (t0 → tf ), perturbation duration > 0 OUTPUT: Optimally perturbed control trajectory uw (t0 → tf ) 1: for i = 1:N do 2: Forward-simulate augmented state trajectory (1)(2) or belief state trai i , . . . , y(K) ) from the belief jectory (27) and sample observation sequence (y(1) states. 3: Backward-simulate adjoint trajectory ρi (t0 → tf ) (20)(21) or (29) with sampled observations. 4: end for N i 1 5: Monte Carlo estimate E[ρp ] ≈ or E[ρT Hb (bn )] ≈ i=1 ρp N N 1 iT i i=1 ρ Hb (bn ). N 6: for (τ = t0 + tcalc + ; τ ≤ t0 + Δto ; τ ← τ + Δtc ) do 7: Solve convex program (26). Store optimal value v ∗ (τ ) and optimizer ∗ w (τ ). 8: end for 9: τ ∗ ← arg min v ∗ (τ ), w∗ ← w∗ (τ ∗ ) 10: uw (t0 → tf ) ← P erturbControlT rajectory(un , w∗ , τ ∗ , ) (3) 11: return uw (t0 → tf )
The adjoint variable ρ now has the same dimension as b and follows the adjoint dynamics + + ∂ T ρ(t− k ) = ∂b g(bn (tk ), yk ) ρ(tk ) (29) ∂ ∂ c(bn (t), un (t)) − ∂b fb (bn (t), un (t))T ρ(t), ρ(t) ˙ = − ∂b ∂ h(bn (tf )). Under the control-affine with the boundary condition ρ(tf ) = ∂b assumption for fb and the quadratic control cost,3 the expected first order cost variation (28) yields
E[ν(tf )] =
1 T 1 w Cu w + E[ρ(τ )T Hb (bn (τ ))](w − un (τ )) − un (τ )T Cu un (τ ), 2 2 (30)
where Hb is the control coefficient term in fb . We can sample ρ(τ )T Hb (bn (τ )) via the forward-backward simulation of the dynamics. 2.3
Closed-Loop Nominal Policy
In Sects. 2.1 and 2.2 we assumed that the nominal control un was an open-loop control trajectory. However, one can think of a scenario where a nominal control 3
Although it is difficult to state the general conditions under which this control-affine assumption holds, it can be verified that the continuous-discrete EKF [25] satisfies this property if the underlying state dynamics are control-affine.
SACBP
277
is a closed-loop policy computed off-line, possibly using a POMDP algorithm in a discretized space. Indeed, SACBP can also handle closed-loop nominal policies. Let πn be a closed-loop nominal policy, which is a mapping from either an augmented state s or a belief state b to a control value u. Due to the stochastic belief dynamics, the control values returned by πn in the future is also stochastic. This is reflected when we take expectations over the nominal dynamics. Specifically, the terms dependent on un in (25) and (30) now become stochastic and thus need to be sampled. However, the equations are still convex quadratic in w as it is decoupled from the nominal control. Therefore, only lines 5 and 7 of Algorithm 1 are affected. 2.4
Computational Time Analysis
Let us analyze the time complexity of the SACBP algorithm. The bottleneck of the computation is when the forward-backward simulation is performed multiple times (lines 1–5 of Algorithm 1). The asymptotic complexity of this part is t −t given by O(N ( fΔto0 )(Mforward + Mbackward )), where Mforward and Mbackward are the times to respectively integrate the forward and backward dynamics between two successive observations. For a more concrete analysis let us use the Gaussian belief dynamics given by EKF as an example. For simplicity we assume the same dimension n for the state, control and the observation. The belief state has dimension O(n2 ). Using the Euler scheme, The forward integration takes 3 o Mforward = O(( Δt Δtc + 1)n ) since evaluating continuous and discrete EKF equa3 tions are both O(n ). Evaluating the continuous part of the costate dynamics 5 b (29) is dominated by the computation of Jacobian ∂f ∂b , which is O(n ) because 3 2 O(n ) operations to evaluate fb are carried out O(n ) times. The discrete part is 5 o also O(n5 ). Therefore, Mbackward = O(( Δt Δtc +1)n ). Overall, the time complexity tf −t0 5 o is O(N ( Δto )( Δt Δtc +1)n ). This is asymptotically smaller in n than belief iLQG, 6 which is O(n ). See [11] for a comparison of time complexity among different belief space planning algorithms. We also remind the readers that SACBP is an online method and a naive implementation already achieves near real-time performance, computing control in less than 0.4[s]. By near real-time we mean that a naive implementation of SACBP requires approximately 3 × tcalc to 7 × tcalc time to compute an action that must be applied tcalc in the future. We expect that parallelization in a GPU and a more efficient implementation will result in real-time computation for SACBP.
3
Simulation Results
We evaluated the performance of SACBP in the following simulation studies: (i) active multi-target tracking with range-only observations; (ii) object manipulation under model uncertainty. All the computation was performed on a desktop computer with Intel Core i7-6800K CPU and 62.8 GB RAM. The Monte Carlo sampling of SACBP was parallelized on the CPU.
278
3.1
H. Nishimura and M. Schwager
Active Multi-target Tracking with Range-Only Observations
This problem focuses on pure information gathering, namely identifying where the moving targets are in the environment. In doing so, the surveillance robot modeled as a single integrator can only use relative distance observations. The robot’s position p is fully observable and transitions deterministically. Assuming perfect data association, the observation for target i is di = ||qi − p + vi ||2 , where qi is the true target position and vi is zero-mean Gaussian white noise with state-dependent covariance R(p, qi ) = R0 + ||qi − p||2 R1 . We used 0.01I2×2 for the nominal noise R0 . The range-dependent noise R1 = 0.001I2×2 degrades the observation quality as the robot gets farther from the target. The discrete-time UKF was employed for state estimation in tracking 20 independent targets. The target dynamics are modeled by a 2D Brownian motion with covariance Q = 0.1I2×2 . Similarly to [26], an approximated observation covariance R(p, μi ) was used in the filter to obtain tractable estimation results, where μi is the most recent mean estimate of qi . SACBP algorithm generated the continuous robot trajectory over 200[s] with planning horizon tf − t0 = 2[s], update interval Δto = 0.2[s], perturbation duration = 0.16[s], and N = 10 Monte Carlo samples. The Euler scheme was used for integration with Δtc = 0.01[s]. The Jacobians and the gradients were computed either analytically or using an automatic differentiation tool [27] to retain both speed and precision. In this simulation tcalc = 0.05[s] was assumed no matter how long the actual control update took. We used c(p, b, u) = 0.05uT u 20 for the running cost and h(p, b) = i=1 exp(entropy(bi )) for the terminal cost, with an intention to reduce the worst-case uncertainty among the targets. The nominal control was constantly 0. We compared SACBP against three benchmarks: (i) a greedy algorithm based on the gradient descent of terminal cost h, similar to [7]; (ii) MCTS-DPW [13,28] in the Gaussian belief space; (iii) projection-based trajectory optimization for ergodic exploration [29–31]. We also implemented the belief iLQG algorithm, but the policy did not converge for this problem. We speculate that the non-convex terminal cost h contributed to this behavior, which in fact violates one of the underlying assumptions made in the paper [10]. MCTS-DPW used the same planning horizon as SACBP, however it drew N = 15 samples from the belief tree so the computation time of the two algorithms match approximately. Ergodic trajectory optimization is not a belief space planning approach but has been used in active sensing literature. Beginning with nominal control 0, it locally optimized the ergodicity of the trajectory with respect to the spatial information distribution based on Fisher information. This optimization was open-loop since the future observations were not considered. As a new observation became available, the distribution and the trajectory were recomputed. All the controllers were saturated at the same limit. The results presented in Fig. 1 clearly indicates a significant performance improvement of SACBP while achieving near real-time computation. More notably, SACBP generated a trajectory that periodically revisited the two groups whereas other methods failed to do so (Fig. 2).
SACBP
279
Fig. 1. (Left) Simulation environment with 20 targets and a surveillance robot. (Middle) The history of the worst entropy value among the targets averaged over 20 runs with the standard deviation. With the budget of 10 Monte Carlo samples, SACBP had small variance and consistently outperformed the other benchmarks on average. (Right) Computational time of SACBP achieved a reasonable value compared with the benchmarks, only 0.11[s] slower than the targeted value, i.e., simulated tcalc .
Fig. 2. Sample robot trajectories (depicted in red) generated by each algorithm. Greedy, MCTS-DPW, and Ergodic did not result in a trajectory that fully covers the two groups of the targets, whereas SACBP periodically revisited both of them. The blue lines are the target trajectories. The ellipses are 99% error ellipses.
3.2
Object Manipulation Under Model Uncertainty
This problem is the model-based Bayesian reinforcement learning problem studied in [2], therefore the description of the nonlinear dynamics and the observation models are omitted. See Fig. 3 for the illustration of the environment. The robot applies the force and the torque to move the object whose mass, moment of inertia, moment arm lengths, and linear friction coefficient are unknown. The robot’s state also needs to be estimated. The same values for tf − t0 , Δto , Δtc , tcalc as in the previous problem were assumed. SACBP used = 0.04[s] and N = 10. The nominal control was a closed-loop position controller whose input was the mean x-y position and the rotation estimates. The cost function was quadratic in the true state x and control u, given by 12 xT Cx x + 12 uT Cu u. Taking expectations yielded the equivalent cost in the Gaussian belief space c(b, u) = 12 μT Cx μ + 12 tr(Cx Σ) + 12 uT Cu u, where Σ is the covariance matrix. We let terminal cost h be the same as c without the control term.
280
H. Nishimura and M. Schwager
Fig. 3. (Left) The robot is attached to the rectangular object. (Middle) The history of the true running cost 12 xT Cx x + 12 uT Cu u averaged over 20 cases. SACBP with N = 10 samples successfully brought the cost to almost 0, meaning that the goal was reached. MCTS-DPW with N = 190 was not as successful. Belief iLQG resulted in large overshoots around 2[s] and 11[s]. (Right) Computational time of SACBP increased from the multi-target tracking problem due to additional computation related to the closedloop nominal policy. Note that belief iLQG took 5945[s] to derive the policy off-line, although the average online execution time was only 4.0 × 10−5 [s] per iteration.
We compared SACBP against (i) MCTS-DPW in the Gaussian belief space and (ii) belief iLQG. We allowed MCTS-DPW to draw N = 190 samples to set the computation time comparable to SACBP. As suggested in [2], MCTSDPW used the position controller mentioned above as the rollout policy. Similarly, belief iLQG was initialized with a nominal trajectory generated by the same position controller. Note that both MCTS-DPW and belief iLQG computed controls for the discrete-time models whereas SACBP directly used the continuous-time model. However the simulation was all performed in continuous time, which could explain the large overshoot of the belief iLQG trajectory in Fig. 3. Overall, the results presented above demonstrate that SACBP succeeded in this task with only 10 Monte Carlo samples, reducing the running cost to almost 0 within 10[s]. Although the computation time increased from the previous problem due to the usage of a closed-loop nominal policy, it still achieved near real-time performance and much shorter than belief iLQG, which took 5945[s] until convergence in our implementation.
4
Conclusions and Future Work
In this paper we have presented SACBP, a novel belief space planning algorithm for continuous-time dynamical systems. We have viewed the stochastic belief dynamics as a hybrid system with time-driven switching and derived the optimal control perturbation based on the perturbation theory of differential equations. The resulting algorithm extends the framework of SAC to stochastic belief dynamics and is highly parallelizable to run in near real-time. Through the extensive simulation study we have confirmed that SACBP outperforms other
SACBP
281
algorithms including a greedy algorithm, a local trajectory optimization method, and an approximate dynamic programming approach. In future work we will consider a distributed multi-robot version of SACBP as well as problems where both discrete and continuous actions exist. We also plan to address questions regarding the theoretical guarantees of this algorithm and provide additional case studies with more complex belief distributions. Acknowledgements. This work was supported by the Toyota Research Institute (TRI). This article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. This work was also supported in part by NSF grant CMMI-1562335, and a JASSO fellowship. The authors are grateful for this support.
References 1. Ansari, A.R., Murphey, T.D.: Sequential action control: closed-form optimal control for nonlinear and nonsmooth systems. IEEE Trans. Rob. 32(5), 1196–1214 (2016) 2. Slade, P., Culbertson, P., Sunberg, Z., Kochenderfer, M.: Simultaneous active parameter estimation and control using sampling-based Bayesian reinforcement learning. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 804–810 (2017) 3. Mihaylova, L., Lefebvre, T., Bruyninckx, H., Gadeyne, K., De Schutter, J.: Active sensing for robotics - a survey. In: Proceedings of 5th International Conference on Numerical Methods and Applications, pp. 316–324 (2002) 4. Platt, R., Tedrake, R., Kaelbling, L., Lozano-Perez, T.: Belief space planning assuming maximum likelihood observations. In: Robotics Science and Systems Conference (RSS) (2010) 5. Bourgault, F., Makarenko, A., Williams, S., Grocholsky, B., Durrant-Whyte, H.: Information based adaptive robotic exploration. In: IEEE/RSJ International Conference on Intelligent Robots and System, vol. 1, pp. 540–545. IEEE (2002) 6. Seekircher, A., Laue, T., R¨ ofer, T.: Entropy-based active vision for a humanoid soccer robot. In: RoboCup 2010: Robot Soccer World Cup XIV. LNCS, vol. 6556, pp. 1–12. Springer, Heidelberg (2011) 7. Schwager, M., Dames, P., Rus, D., Kumar, V.: A multi-robot control policy for information gathering in the presence of unknown hazards. In: Robotics Research: The 15th International Symposium ISRR, pp. 455–472. Springer (2017) 8. Erez, T., Smart, W.D.: A scalable method for solving high-dimensional continuous POMDPs using local approximation. In: Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence. UAI 2010, pp. 160–167. AUAI Press, Arlington (2010) 9. Patil, S., Kahn, G., Laskey, M., Schulman, J., Goldberg, K., Abbeel, P.: Scaling up Gaussian belief space planning through covariance-free trajectory optimization and automatic differentiation. In: WAFR. Springer Tracts in Advanced Robotics, vol. 107, pp. 515–533. Springer (2014) 10. van den Berg, J., Patil, S., Alterovitz, R.: Motion planning under uncertainty using iterative local optimization in belief space. Int. J. Robot. Res. 31(11), 1263–1278 (2012)
282
H. Nishimura and M. Schwager
11. Rafieisakhaei, M., Chakravorty, S., Kumar, P.R.: T-LQG: closed-loop belief space planning via trajectory-optimized LQG. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 649–656 (2017) 12. Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101(1–2), 99–134 (1998) 13. Cou¨etoux, A., Hoock, J.B., Sokolovska, N., Teytaud, O., Bonnard, N.: Continuous upper confidence trees. In: 2011 International Conference on Learning and Intelligent Optimization, pp. 433–445. Springer, Heidelberg (2011) 14. Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012) 15. Nishimura, H., Schwager, M.: Active motion-based communication for robots with monocular vision. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2948–2955 (2018) 16. Somani, A., Ye, N., Hsu, D., Lee, W.S.: Despot: Online POMDP planning with regularization. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. NIPS 2013, pp. 1772–1780. Curran Associates Inc., USA (2013) 17. Sunberg, Z., Kochenderfer, M.J.: POMCPOW: an online algorithm for POMDPs with continuous state, action, and observation spaces. CoRR abs/1709.06196 (2017) 18. Heemels, W., Lehmann, D., Lunze, J., Schutter, B.D.: Introduction to hybrid systems (chap. 1). In: Lunze, J., Lamnabhi-Lagarrigue, F. (eds.) Handbook of Hybrid Systems Control - Theory, Tools, Applications, pp. 3–30. Cambridge University Press, Cambridge (2009) 19. Egerstedt, M., Wardi, Y., Axelsson, H.: Transition-time optimization for switchedmode dynamical systems. IEEE Trans. Autom. Control 51(1), 110–115 (2006) 20. Wardi, Y., Egerstedt, M.: Algorithm for optimal mode scheduling in switched systems. In: 2012 American Control Conference (ACC), pp. 4546–4551 (2012) 21. Mavrommati, A., Tzorakoleftherakis, E., Abraham, I., Murphey, T.D.: Real-time area coverage and target localization using receding-horizon ergodic exploration. IEEE Trans. Rob., 62–80 (2018) 22. Le Ny, J., Pappas, G.J.: On trajectory optimization for active sensing in Gaussian process models. In: Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pp. 6286–6292. IEEE (2009) 23. Popovi´c, M., Hitz, G., Nieto, J., Sa, I., Siegwart, R., Galceran, E.: Online informative path planning for active classification using UAVs. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5753–5758 (2017) 24. Liberzon, D.: Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press, Princeton (2011) 25. Xie, L., Popa, D., Lewis, F.L.: Optimal and Robust Estimation: With an Introduction to Stochastic Control Theory. CRC Press, Boca Raton (2007) 26. Spinello, D., Stilwell, D.J.: Nonlinear estimation with state-dependent Gaussian observation noise. IEEE Trans. Autom. Control 55(6), 1358–1366 (2010) 27. Revels, J., Lubin, M., Papamarkou, T.: Forward-mode automatic differentiation in Julia. arXiv:1607.07892 [cs.MS] (2016) 28. Egorov, M., Sunberg, Z.N., Balaban, E., Wheeler, T.A., Gupta, J.K., Kochenderfer, M.J.: POMDPs.jl: a framework for sequential decision making under uncertainty. J. Mach. Learn. Res. 18(26), 1–5 (2017)
SACBP
283
29. Miller, L.M., Murphey, T.D.: Trajectory optimization for continuous ergodic exploration. In: 2013 American Control Conference (ACC), pp. 4196–4201. IEEE (2013) 30. Miller, L.M., Silverman, Y., MacIver, M.A., Murphey, T.D.: Ergodic exploration of distributed information. IEEE Trans. Rob. 32(1), 36–52 (2016) 31. Dressel, L., Kochenderfer, M.J.: Tutorial on the generation of ergodic trajectories with projection-based gradient descent. IET Cyber-Phys. Syst.: Theory Appl. (2018)
Active Area Coverage from Equilibrium Ian Abraham(B) , Ahalya Prabhakar, and Todd D. Murphey Department of Mechanical Engineering, Northwestern University, 2145 Sheridan Road, Evanston, IL 60208, USA {i-abr,a-prabhakar}@u.northwestern.edu, [email protected]
Abstract. This paper develops a method for robots to integrate stability into actively seeking out informative measurements through coverage. We derive a controller using hybrid systems theory that allows us to consider safe equilibrium policies during active data collection. We show that our method is able to maintain Lyapunov attractiveness while still actively seeking out data. Using incremental sparse Gaussian processes, we define distributions which allow a robot to actively seek out informative measurements. We illustrate our methods for shape estimation using a cart double pendulum, dynamic model learning of a hovering quadrotor, and generating galloping gaits starting from stationary equilibrium by learning a dynamics model for the half-cheetah system from the Roboschool environment.
Keywords: Active exploration
1
· Safe learning · Active learning
Introduction
Robot learning has proven to be a challenge in real-world application. This is partially due to the ineffectiveness of passive data acquisition for learning and a necessity for action in order to generate informative data. What makes this problem even more challenging is that active data gathering is not a stable process. It involves exciting states in order to acquire new information. Safe exploration then becomes a challenge for modern day robotics. The problem becomes exacerbated when memory and task constraints (i.e., actively collecting data after deployment) are imposed on the robot. If the structures that compose the dynamics of the robot change over time, the robot will need to explore its own dynamics in a manner that is systematic and informative, avoiding damage to the underlying structures (and humans) in the environment. In this paper, we address these fundamental issues by developing an algorithm that is inspired by hybrid systems theory. This algorithm enables robots to actively pursue informative data by generating area coverage while guaranteeing Lyapunov attractiveness during exploration. T. Murphey—This material is based upon work supported by the National Science Foundation under Grants CNS 1837515. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the aforementioned institutions. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 284–300, 2020. https://doi.org/10.1007/978-3-030-44051-0_17
Active Area Coverage from Equilibrium
285
Active data acquisition and learning are often considered part of the same problem of learning from experience [1,2]. This is generally seen in the field of reinforcement learning (RL) where attempts at a task, as well as learning from the outcome of actions, are used to both learn policies and predictive models [1,3]. As a result, generalizing these methods to real-world application has been a topic of research [1,3,4] where data-inefficiency dominates much of the progress. A solution to the problem of data-inefficiency is to simulate robots in a realistic virtual environment and subsequently use the large amount of synthetic data to solve a learning problem before applying the results on a real robot [5]. This leads to issues such as the “reality-gap” where finer modelling details such as motor delays lead to poor quality data for learning. Existing work addresses the data-inefficiency problem by actively seeking out informative data using information maximization [6] or by pruning a dataset based on some information measure [7]. These methods still suffer from the problem of local minima due to a lack of exploration or non-convex information objectives [8]. Safety in the task is also a concern when actively seeking out informative measurements. Methods typically provide some bound on the worst outcome model using probabilistic approaches [9], but often only consider the safety with respect to the task and not with respect to the data collection process. We focus on problems where data collection involves exploring the state-space of robots where safe generation of informative data is important. In treating data acquisition as a dynamic area coverage problem—where the time spent during the trajectory of the robot is proportional to regions where there is an expectation of informative data—we are able to uncover more informative data that is not already expected. With this approach, we can provide attractiveness guarantees—that the robot will eventually return to a stable state—while providing control authority that allows the robot to actively seek out informative data in order to later solve a learning task. Thus, our contribution is an approach to dynamic area coverage for active data collection that starts from equilibrium policies for robots. We structure the paper as follows: Sect. 2 provides a list of related work, Sect. 3 defines the problem statement for this work. Section 4 formulates the algorithm for active data acquisition from equilibrium. Section 5 provides simulated and experimental examples of our method. Last, Sect. 6 provides concluding remarks on our method and future directions.
2
Related Work
Existing work generally formulates problems of active data acquisition as information maximizing with respect to a known parameterized model [10,11]. The problem with this approach is that robots need to address local optima [11,12], resulting in insufficient data collection. Other approaches have sought to solve this problem by thinking of information maximization as an area coverage problem [12,13]. Ergodic exploration, in particular, has remedied the issue of local extrema by using the ergodic metric to minimize the Sobelov distance [14] from
286
I. Abraham et al.
the time-averaged statistics of the robot’s trajectory to the expected information in the explored region. This enables both exploration (quickly in low information regions) and exploitation (spending significant amount of time in highly informative regions) in order to avoid local extrema and collect informative measurements. The major downside is that this method assumes that the model of the robot is fully known. Moreover, there is little guarantee that the robot will not destabilize during the exploration process. This becomes an issue when the robot must explore part of its own state-space (i.e., velocity space) in order to generate informative data. To the authors’ best knowledge this has not been done to this date. Another issue is that these methods do not scale well with the dimensionality of the search space, making experimental applications with this approach challenging due to computational limitations. Our approach overcomes these issues by using a sample-based KL-divergence measure [13] as a replacement for the ergodic metric. This form of measure has been used previously; however, it relied on motion primitives in order to compute control actions [13]. We avoid this issue by using hybrid systems theory in order to compute a controller that sufficiently reduces the KL-divergence measure from an equilibrium stable policy. As a result, we can use approximate models of dynamical systems instead of complete dynamic reconstructions in order to actively collect data while ensuring safety in the exploration process through a notion of attractiveness. The following section formulates the problem statement that our method solves.
3
Problem Statement
Modeling Assumptions and Stable Policies. Assume we have a robot whose approximate dynamics can be modeled using x˙ = f (x, u) = g(x) + h(x)u
(1)
where x ∈ Rn is the state of the robot, u ∈ Rm is a control vector applied to the robot, g(x) : Rn → Rn is the free unactuated dynamics, h(x) : Rn → Rn×m is the actuated dynamics, and x˙ is the time rate of change of the robot at state x subject to the control u. Moreover, let us assume that there exists a Lyapunov function V (x) such that under a policy μ(x) : Rn → Rm , V˙ (x) < 0 ∀x ∈ B, where B = {x ∈ Rn |x < r} for r > 0. For the rest of the paper, we will refer to μ(x) as an equilibrium policy. KL-Divergence and Area Coverage. Given the assumptions of known approximate dynamics and the equilibrium policy, we can define active exploration for informative data acquisition as automating safe switching between μ(x) and some control authority μ (t) that generates actions that actively seek out informative data. This is accomplished by specifying the active data acquisition task using an area coverage objective where we minimize the KL-divergence
Active Area Coverage from Equilibrium
287
between the time average statistics of the robot along a trajectory and a spatial distribution defining the current coverage requirement. We can then define an approximation to the spatial statistics of the robot as follows: Definition 1. Given a search domain X v ⊂ Rn+m where v ≤ n + m, the Σapproximated time-averaged statistics of the robot, i.e., the time the robot spends in regions of the search domain X v , is defined by ti +T 1 η −1 exp − (s − xv (t)) Σ (s − xv (t)) dt (2) q(s | x(t), μ(t)) = Tr ti −tr 2 where s ∈ X v ⊂ Rn+m is a point in the search domain X v , xv (t) is the component of the robot’s trajectory x(t) and actions μ(t) that intersects the search domain X v , Σ ∈ Rv×v is a positive definite matrix parameter that specifiesthe width of the Gaussian, η is a normalization constant such that q(s) > 0 and X v q(s)ds = 1, ti is the ith sampling time, and Tr = T + tr is sum of the time horizon T and amount of time tr the robot remembers xv (t) into the past. This is an approximation because the true time-averaged statistics, as described in [12], is a collection of delta functions parameterized by time. We approximate the delta function as a Gaussian distribution with covariance Σ, converging as Σ → 0. Using this approximation, we are able to relax the ergodic area-coverage objective in [12] and use the following KL-divergence objective [13]: p(s) ds = Ep(s) [ln p(s) − ln q(s)] , DKL (pq) = p(s) ln q(s) Xv where E is the expectation operator, q(s) = q(s | x(t), μ(t)), and p(s), p(s) > 0, X v p(s)ds = 1, is a distribution that describes where in the search domain an informative measurement is likely to be acquired. We can further approximate the KL-divergence via sampling where we approximate the expectation operator as DKL (pq) = Ep(s) [ln p(s) − ln q(s)] ≈
N
p(si ) ln p(si ) − p(si ) ln q(si ),
(3)
i=1
where N is the number of samples in the search domain drawn from a uniform distribution. With this formulation, we can approximate the ergodic coverage metric using (3). In addition to the KL-divergence, we can add a task objective Jtask =
ti +T
ti
(x(t), μ(x(t)))dt + m(x(ti + T ))
(4)
where ti is the ith sampling time, T is the time horizon, (x, u) : Rn × Rm → R is the running cost, m(x) : Rn → R is the terminal cost, and x(t) : R → Rn is
288
I. Abraham et al.
the state of the robot at time t. This additional objective will typically encode some other task, in addition to the KL-divergence objective. By summing the KL-divergence objective and a task objective (4), we can then pose active data acquisition as an optimal control problem subject to the initial approximate dynamic model of the robot. More formally, the objective is written as ti +T (x(t), μ(x(t)))dt + m(x(ti + T )) (5) J = DKL (pq) + ti
where the goal is to generate a control μ (t) that minimizes (5) subject to the approximate dynamics (1). Because we are including the equilibrium policy in the objective, we are able to synthesize controllers that take into account the equilibrium policy μ(x) and the desire to actively seek out measurements. For the rest of the paper, we assume the following: – We have an initial approximate model x˙ = f (x, u) of the robot. – We also have an initial policy μ(x) that maintains the robot at equilibrium, for which there is a Lyapunov function. These two assumptions are reasonable in that often robots are designed around stable states and typically have locally stable policies. The following section uses the fact that we have an initial policy μ(x) in order to synthesize control vectors μ (t) : R → Rm that reduce (5). Specifically, we want to generate a hybrid composition of control actions that enable active data collection and actions that stabilize the robotic system. That way, it is possible to quantify how much the robot is deviating from a stable equilibrium. Thus, we motivate using hybrid systems theory in order to consider how much the objective (5) changes from switching from the equilibrium policy μ(x(t)) to the control μ (t). By quantifying the change, we specify an unconstrained optimization which solves for a control μ that applies actions that retain Lyapunov attractiveness.
4
Algorithm
Our algorithm starts by considering the objective defined in (5) subject to the approximate dynamic constraints (1) and policy μ(x). We want to quantify how sensitive the objective is to switching from policy μ(x(t)) to the control vector μ (t) for time τ ∈ [ti , ti + T ] for a infinitesimally small time duration λ. This sensitivity will be a function of μ and inform us of the most influential time to apply μ (t). Thus, we can use the sensitivity to write an objective whose minimizer is the schedule of control vectors μ (t) that reduces the objective (5). Proposition 1. The sensitivity of the objective (5) with respect to the duration time λ, of switching from the policy μ(x) to the control μ (t) at time τ is ∂J = ρ(τ ) (f2 − f1 ) (6) ∂λ t=τ
Active Area Coverage from Equilibrium
289
where f2 = f (x(t), μ (t)) and f1 = f (x(t), μ(x(t)), and ρ(t) ∈ Rn is the adjoint, or co-state variable which is the solution of the following differential equation
∂f ∂μ ∂ η p(si ) ∂g ∂μ ∂g ∂f ∂μ ∂ + − + − + ρ˙ = − ρ ∂x ∂x ∂u Tr i q(si ) ∂x ∂x ∂u ∂x ∂u ∂x (7) ∂ m(x(ti + T )). subject to the terminal constraint ρ(ti + T ) = ∂x Proof. Taking the derivative of the objective (5) with respect to the duration time λ gives ∂ ∂ ∂ J= DKL + Jtask . ∂λ ∂λ ∂λ The term ∂ DKL ∂λ
∂ ∂λ DKL
t=τ
is calculated by
p(si ) η ti +T ∂g ∂x ∂μ ∂g + dt =− q(si ) Tr τ +λ ∂x ∂x ∂u ∂λ i p(si ) η ti +T ∂g ∂μ ∂g + =− Φ(t, τ )dt (f2 − f1 ) (8) q(si ) Tr τ +λ ∂x ∂x ∂u i
where g = g(si | x(t), μ(x(t))) = exp − 12 (si − xs (t)) Σ−1 (si − xs (t)) , and Φ(t, τ ) is the state transition matrix for the integral equation ∂x = (f2 − f1 ) + ∂λ
ti +T
τ +λ
∂f ∂f ∂μ + ∂x ∂u ∂x
∂x dt ∂λ
(9)
where f2 = f (x(τ ), μ (τ )) and f1 = f (x(τ ), μ(x(τ )). ∂ Jtask is given by We can similarly show that the term ∂λ ∂ Jtask ∂λ
ti +T
= τ +λ
t=τ
ti +T
= τ +λ
∂μ ∂ ∂ + ∂x ∂x ∂u ∂μ ∂ ∂ + ∂x ∂x ∂u
∂x dt. ∂λ
Φ(t, τ )dt (f2 − f1 )
(10)
using the same expression in (9). Combining (8) and (10) and taking the limit as λ → 0 gives ∂ J= ∂λ
ti +T
τ
∂ ∂μ ∂ η p(si ) + − ∂x ∂x ∂u Tr i q(si )
∂g ∂μ ∂g + ∂x ∂x ∂u
Φ(t, τ )dt (f2 − f1 ) .
(11) Setting ρ(τ ) =
τ
ti +T
∂μ ∂ η p(si ) ∂ + − ∂x ∂x ∂u Tr i q(si )
∂μ ∂g ∂g + ∂x ∂x ∂u
Φ(t, τ )dt
290
I. Abraham et al.
and from [15] we can show that (11) can be written as ∂ J = ρ(τ ) (f2 − f1 ) ∂λ t=τ where ρ˙ = −
∂ ∂μ ∂ η p(si ) + − ∂x ∂x ∂u Tr i q(si )
∂g ∂μ ∂g + ∂x ∂x ∂u
subject to the terminal condition ρ(ti + T ) =
∂ ∂x m(x(ti
−
∂f ∂f ∂μ + ∂x ∂u ∂x
+ T )).
ρ.
∂ The sensitivity ∂λ J is known as the mode insertion gradient [15]. We can directly compute the mode insertion gradient for any control μ that we choose. However, our goal is to find one such control μ that reduces the objective (5) but still maintains its value near the equilibrium policy μ(x). To solve for this control, we formulate the following objective function ti +T ∂ 1 J + μ (t) − μ(x(t))2R (12) J2 = ∂λ t=τ 2 ti
where R ∈ Rm×m is a positive definite matrix that penalizes the deviation from the policy μ(x). Proposition 2. The control vector that minimizes J2 is given by μ (t) = −R−1 h(x(t)) ρ(t) + μ(x(t)).
(13)
Proof. Taking the derivative of (12) with respect to μ gives ∂ J2 = ∂μ
ti +T
ti ti +T
= ti
∂ ρ(t) (f2 − f1 ) + R(μ (t) − μ(x(t)))dt ∂μ h(x(t)) ρ(t) + R(μ (t) − μ(x(t)))dt.
(14)
Since J2 is convex in μ , we set the expression in (14) to zero and solve for μ which gives us μ (t) = −R−1 h(x(t)) ρ(t) + μ(x(t)) which is a schedule of control values that reduce the objective for time t ∈ [ti , ti + T ].
This controller reduces (5) for λ > 0 that is sufficiently small. The reduction in ∂ (5), ΔJ, by applying μ (τ ) can be approximated as ΔJ ≈ ∂λ Jλ |t=τ . Ensuring ∂ that ∂λ J < 0 is an indicator that the robot is always actively pursuing data and reducing the objective (5). ∂ H = 0 ∀t ∈ [ti , ti + T ], where H is the Corollary 1. Let us assume that ∂μ ∂ control Hamiltonian. Then ∂λ J < 0 ∀μ (t) ∈ U where U is the control space.
Active Area Coverage from Equilibrium
291
Proof. Inserting (13) into (6) gives ∂ J = ρ(t) (f2 − f1 ) ∂λ = ρ(t) (g(x(t)) + h(x(t))μ (t) − g(x(t)) − h(x(t))μ(x(t))) .
(15)
Because of the manner in which we chose to solve for μ (t), g(x) and μ(x(t)) ∂ ∂ H = 0 implies that ∂λ J = 0 and the policy cancel out in (15). In addition, ∂μ ∂ J without μ(x(t)) is not an optimizer of (5). As a result, we can further analyze ∂λ the need to consider the policy μ(x). This gives us the following expression ∂ J = −ρ(t) h(x(t))R−1 h(x(t)) ρ(t) ∂λ which we can rewrite as ∂ J = −h(x(t)) ρ2R−1 < 0. ∂λ
(16)
∂ J is always negative subject to the schedule of Thus, (16) shows us that ∂λ
control vectors (13) and the objective is being reduced when μ (t) is applied.
We automate the switching between μ(x(t)) and μ (t) by choosing a τ and ∂ J is most negative and ΔJ < 0. This is done through the combiλ such that ∂λ nation of choosing τ with a 1-dimensional optimization and solving for λ using a line search until ΔJ < 0 [16,17]. By choosing λ < T we can place a bound on how much our algorithm excites the dynamical system through Lyapunov analysis (Theorem 1). Theorem 1. Assume there exists a Lyapunov function V (x) for (1) such that under the policy μ(x), x(t) is asymptotically stable. That is, V˙ (x) < 0 ∀μ(x), x ∈ B where B = {x ∈ Rn |x < r} for r > 0. Then, given the schedule of control vectors (13) μ (t) ∀t ∈ [τ, τ + λ], V (x(t)) − V (x(t), μ(x(t))) ≤ λβ, where V (x(t), μ(x(t))) is the Lyapunov function subject to the policy μ(x), and −1 h(x(t)) ρ(t). β = supt∈[τ,τ +λ] − ∂V ∂x h(x(t))R Proof. Writing the integral form of the Lyapunov function switching between μ(x(t)) and μ (t) at time τ for a duration of time λ starting at x(0) gives V (x(t)) = V (x(0))
t
+ 0 τ
= V (x(0))
+
V˙ (x(s), μ(x(s)))ds V˙ (x(s), μ(x(s)))ds
0 τ +λ
+ τ
(17)
V˙ (x(s), μ (s))ds +
t
V˙ (x(s), μ(x(s)))ds,
τ +λ
where we explicitly write the dependency on μ(x(t)) in V˙ . Using chain rule, we can write ∂V ∂V ∂V f (x, u) = g(x) + h(x)u. (18) V˙ (x, u) = ∂x ∂x ∂x
292
I. Abraham et al.
By inserting (13) in (18) we can show the following identity: ∂V h(x)μ ∂x ∂V h(x) −R−1 h(x) ρ + μ(x) ∂x ∂V h(x)R−1 h(x) ρ. = V˙ (x, μ(x)) − ∂x
∂V V˙ (x, μ ) = g(x) + ∂x ∂V g(x) + = ∂x
(19)
Using (19) in (17), we can show that V (x(t)) = V (x(0)) + 0
t
V˙ (x(s), μ(x(s)))ds −
= V (x(t), μ(x(t))) −
τ +λ
τ
τ
τ +λ
∂V h(x(s))R−1 h(x(s)) ρ(s)ds ∂x
∂V h(x(s))R−1 h(x(s)) ρ(s)ds ∂x
(20)
t where V (x(t), μ(x(t))) = V (x(0)) + 0 V˙ (x(s), μ(x(s)))ds. −1 Letting the largest value of ∂V h(x(s)) ρ(s) be given by β = ∂x h(x(s))R −1 h(x(t)) ρ(t) > 0, we can approximate (20) as supt∈[τ,τ +λ] − ∂V ∂x h(x(t))R V (x(t)) = V (x(t), μ(x(t)) −
τ +λ
τ
∂V h(x(s))R−1 h(x(s)) ρ(s)ds ∂x
≤ V (x(t), μ(x(t))) + βλ.
(21) (22)
Subtracting both side by V (x(t), μ(x(t))) gives the upper bound on instability V (x(t)) − V (x(t), μ(x(t))) ≤ βλ for the active data collection process.
(23)
By fixing the maximum value of λ, we can provide an upper bound to the change of the Lyapunov function during active data acquisition. Moreover, we can tune our control vector μ (t) using the regularization value R such that as R → ∞, β → 0 and μ (t) → μ(x(t)). With this bound, we can guarantee Lyapunov attractiveness [18], where the system (1) is not Lyapunov stable, but rather there exists a time t such that the system (1) is guaranteed to return to a region of attraction where the system can be guided towards a stable equilibrium state x0 . This property will play an important role in examples in Sect. 5. Definition 2. A dynamical system (1) is Lyapunov attractive if at some time t, the trajectory of the system x(t) ∈ C(t) ⊂ B where C(t) = {x(t) ∈ Rn |V (x) ≤ c, V˙ (x(t)) < 0} and limt→∞ x(t) → x0 such that x0 is an equilibrium state. Theorem 2. Given the schedule of control vectors (13) μ (t) ∀t ∈ [τ, τ + λ], the robotic system governed by the dynamics in (1) is Lyapunov attractive such that limt→∞ x(t, τ, λ) → x0 , where
Active Area Coverage from Equilibrium x(t, τ, λ) = x(0) +
τ 0
f (x(s), μ(x(s))ds +
τ
τ +λ
f (x(s), μ (s)ds +
t
τ +λ
293
f (x(s), μ(x(s))ds,
is the solution to switching between stable and exploratory motions for duration λ starting at time τ . Proof. Assume there exists a Lyapunov function such that V˙ (x) < 0 under the policy μ(x). Moreover, assume that subject to the control vector μ (t), the trajectory x(τ +λ) ∈ C(τ +λ) ⊂ B where C(t) = {x(t) ∈ Rn |V (x) ≤ c, V˙ (x(t), μ(x(t)) < 0} where c > 0. Using Theorem 1, the integral form of the Lyapunov function (17), and the identity (19), we can write V (x(t)) = V (x(0)) + −
τ
t
V˙ (x(s), μ(x(s)))ds
(24)
0 τ +λ
∂V h(x(s))R−1 h(x(s)) ρ(s)ds ≤ V (x(0)) − γt + βλ, ∂x
(25)
where −γ = sups∈[0,t] V˙ (x(s), μ(x(s))) < 0. Since λ is fixed and β can be tuned by the matrix weight R, we can choose a t such that γt βλ. Thus, limt→∞ V (x(t)) → V (x0 ) and limt→∞ x(t, τ, λ) → x0 , implying Lyapunov attractiveness, where V (x0 ) is the minimum of the Lyapunov function at the
equilibrium state x0 . Asymptotic attractiveness shows us that the robot will return to a region where V (x) will return to a minimum under policy μ(x), allowing the robot to actively explore and collect data safely. Moreover, we can choose the value of λ and μ in automating the active data acquisition such that attractiveness always holds, giving us an algorithm that is safe for active data collection. All that is left is to define a spatial distribution that actively selects which measurements are more informative to the learning task. Measure of Data Importance for Model Learning. Our goal is to provide a method that is general to any form of learning that requires a robot to actively seek out measurements through action. This may include area mapping or learning the dynamics of the robot. Thus, we use measures that allow the robot to quantify where in the search domain there exists useful data that needs to be collected. While there exists many measures that can select important data subject to a learning task, we use a measure of linear independence [7,19,20]. This measure is often used in sparse Gaussian processes [7,20] where a data v set D = {xi , yi }M i=1 is comprised of M input measurements xi ∈ R and M
294
I. Abraham et al.
output measurements yi ∈ Rc such that each data point maximizes the measure of linear independence. We use this measure of independence, also known as a measure of importance, to create a distribution for which the robot will provide area coverage in the search domain for active data collection. As illustrated in [7], this is done by evaluating a new measurement xM +1 , yM +1 against the existing data points in D given the structure of the model that is being learned. Definition 3. The importance measure δ ∈ R+ for a new measurement pair {xM +1 , yM +1 } is given by δ = k(xm+1 , xm+1 ) − k a
(26)
M
which is the solution to δ = i=1 ai φ(xi ) − φ(xM +1 )2 , where φ(x) are the basis functions (also known as feature vectors)1 , ai is the coefficient of linear dependence, the matrix K ∈ RM ×M is known as the kernel matrix with elements Ki,j = k(xi , xj ) such that k : Rv×v → R is the kernel function given by the inner product k(xi , xj ) = φ(xi ), φ(xj ), k = [k(x1 , xm+1 ), k(x2 , xm+1 ), . . . , k(xm , xm+1 )] , and a = K −1 k. The value δ provides a measure of how well the point xM +1 can be represented given the existing data set and structure of the model being learned. Note that this measure will be computationally intractable for very large M . Instead, other measures like the expected information density derived from the Fisher information matrix [12,21] can be used if the learning task has a model that is parameterized by a set of parameters θ. Since δ > 0, we define an importance distribution for which the robot will use generate area coverage. Definition 4. The importance distribution is p(s) = where η =
Xv
1 k(s, s) − k(s) a(s) η
(27)
k(s, s) − k(s) a(s)ds, and k, a are functions of points s ∈ X v .
Note that p(s) will change as D is augmented or pruned. If at any time δ > δi for i = [1, . . . , M ], we remove the ith point with the lowest δ value and add in the new data point. We provide an outline of our method in Algorithm 1 for online data acquisition. The following section evaluates our method on various simulated environments.
1
This feature vector can be anything from a Fourier set of basis functions or a neural network. In addition, we can parameterize the functions φ(x) = φ(x, θ) and have the functions change over time.
Active Area Coverage from Equilibrium
295
Algorithm 1. Active Data Acquisition from Equilibrium 1: initialize: local dynamics model, initial condition x(t0 ), initial equilibrium policy μ(x), learning task model structure φ(x). 2: for i = 0, . . . , ∞ do 3: simulate x(t) with μ(x(t)) from x(ti ) using dynamics model f (x, u) ∂ J 4: calculate ρ(t) and ∂λ 5: compute control μ (t) for t ∈ [ti , ti + T ] ∂ J 6: choose τ, λ that minimizes ∂λ 7: apply μ (τ ) if t ∈ [τ, τ + λ] else apply μ(x(t)) to robot 8: sample state x(ti+1 ) and measurement y(ti+1 ) 9: verify importance δ and update p(s) if δ > δi ∀i ∈ [1, . . . , M ] 10: end for
5
Simulated Examples
In this section, we illustrate examples of Algorithm 1 for different examples that may be encountered in robotics. Figure 1 depicts three robotic systems on which we base our examples. In the first example, we use a cart double pendulum for use in area coverage for shape estimation. In the second and third example, we use a 22 dimensional quadrotor [22] and a 26 dimensional half-cheetah model from Roboschool [23] for learning a dynamics model of the robotic systems by exploring in the state-space. For implementation details, including parameters used, we refer the reader to the appendix.
(a) Cart Double Pendulum
(b) Quadrotor
(c) Half-Cheetah
Fig. 1. Simulated experimental environments (a) cart double pendulum and (b) quadcopter, and (c) half-cheetah. The results for each system may be seen in the accompanying multimedia supplement.
Shape Estimation While Stabilizing Cart Double Pendulum. Our first example demonstrates the functionality of our algorithm for estimating a sinusoidal shape while simultaneously balancing a cart double pendulum in its upright position. The purpose of this example is to show that our method can synthesize actions that ensures the cart double pendulum is maintained upright
296
I. Abraham et al.
while actively collecting data for estimating the shape. This example also serves the purpose of illustrating that our method can safely automate choosing when to stabilize and when to explore for data using approximate linear models of the robot dynamics and stabilizing policies derived from the approximate models.
Fig. 2. Time series snap-shots of cart double pendulum actively sampling and estimating the shape underneath. The uncertainty (dashed gray lines) calculated from the collected data set drives the exploratory motion of the cart double pendulum while our method ensures that the cart double pendulum is maintained in its upright equilibrium state.
The measurements of the height of the sinusoidal shape are collected through x position of the cart (illustrated in Fig. 2 as the magenta crosshair underneath the cart). A Gaussian process with an radial basis function (RBF) kernel [7] is then used to estimate the function and provide the distribution used for exploration. The underlying importance distribution (27) is updated as the data set is pruned to include new informative measurements. As a result of Algorithm 1, the robot will spend time where there is a high probability of acquiring informative data. This results is the shape reconstruction shown in Fig. 2 using a limited fixed set of data (M = 50). We analyze our algorithm by observing a candidate Lyapunov function (energy). Figure 3 depicts the value of the Lyapunov function over the time window of the cart double pendulum collecting data for estimating shape. The control vector μ (t) over the application time t ∈ [τ, τ + λ] increases the overall energy in the system (due to exploration). Since we include a regularization term R that ensures μ does not deviate too far from the equilibrium policy μ(x), the cart double pendulum is able to stabilize itself, eventually returning to an equilibrium state and ensuring stability, illustrating the Lyapunov attractiveness property proven in Theorem 2.
Active Area Coverage from Equilibrium
297
Learning Dynamics of Quadrotor. Our next example illustrates active data acquisition in the state-space of a 22◦ of freedom quadrotor vehicle shown in Fig. 1a. The results are averaged across 30 trials with random initial conditions sampled uniformly in the body angular and linear velocities ω, v ∼ U [−0.01, 0.01] where U is a uniform distribution. The goal for this quadrotor is to maintain hovering height while collecting data in order to learn the dynamics model f (x, u). In this example, a linear approximation of the dynamics centered at the hovering height is used as the local dynamics approximation on which Algorithm 1 is based. We then generate a LQR controller with the approximate dynamics which we use as the equilibrium policy, The input data we collect is the state x(ti ) = xi and control u(ti ) = ui and the output data is (xi+1 − xi )/(ti+1 − ti ) ˙ = Fig. 3. Lyapunov function for the which approximates the function Δx Δt ≈ x f (x, u) [24]. An incremental sparse Gaussian cart double pendulum with upright equilibrium. The red line indicates process [7] with a radial basis function kerwhen the active exploration control nel is used to generate a learned model of the is applied. Lyapunov attractivedynamics using a data set of M = 80 and to ness property is illustrated through specify the importance measure (3). automatic switching of the exploFigure 4(a) and (b) illustrates the model- ration process. ing error and the minimum importance value within the data set using our method and the equilibrium policy with uniformly added noise at 33% of the saturation limit. Our method sequences and automates the process of choosing
Fig. 4. (a) Average L2 error of the learned dynamics model evaluated on 10 uniformly distributed random samples between [−1, 1] in the state and action space. (b) Minimum importance measure from the collected data set using our method compared against injected noise. (c) Minimum value of the mode insertion gradient evaluated at the chosen μ (τ ) value. (d) Calculated duration time of control μ (τ ).
298
I. Abraham et al.
when it is best to explore and to stabilize by taking into account the approximate dynamics and the equilibrium policy. As a result, a robot is capable of acquiring informative data that improves the prediction of the nonlinear dynamic model of the quadrotor. In contrast, adding noise to the control input (often referred to as “motor babble” [24]) does not have temporal dependencies. That is, each new sample does not have information from the previous samples and cannot effectively explore the state-space. As the robot continues to explore, the value of the mode insertion gradient (6) decreases as does the duration time λ as shown in Fig. 4(c) and (d). This implies that the robot is sufficiently reducing the objective for area coverage and the equilibrium policy begins to take over to stabilize the robot. This is a result of taking into account the local stability of the robotic system while generating exploratory actions. Learning to Gallop. In this last example, we consider applications of Algorithm 1 for systems with dynamic models and policies that are learned. We use the halfcheetah from the roboschool environment [23] for the task of learning a dynamics model Fig. 5. (a) Comparison of the minimum imporin order to control the robot to tance measure of the data set for the half-cheetah gallop forward. example from a stable standing policy with added We first learn a simple 20% saturation limit noise and our approach for standing upright policy using active data acquisition. (b) Integrated forward the augmented random search velocity values from using the learned half-cheetah (ARS) method [25]. In that dynamics model in a model-predictive control setprocess, we collect the state ting with standard deviation illustrated for 5 trials. and action data to compute Our method is shown to collect data which provides a linear approximation using a better dynamic model and a positive net forward least-squares for the local velocity with reduced variance. dynamics. Then Algorithm 1 is applied using an incremental sparse Gaussian process using an RBF kernel to generate a dynamics model from data as well as provide the importance measure using a set of M = 40 data points. The input-output data structure maps input (x(ti ), u(ti )) to the change in state Δx Δt . Our running cost (x, u) is set to maintain the half-cheetah upright. After the Gaussian process model is learned, we use the generated model in the prediction of the forward dynamics as a replacement for the initial dynamics model. As shown in Fig. 5, our method collects informative data while respecting the standing upright policy when compared to noisy inputs. We compared the two learned models using our controller with DKL = 0 and the running cost (x, u) set to maximize the forward velocity of the half-cheetah. We show those results
Active Area Coverage from Equilibrium
299
in Fig. 5 over 5 runs of our algorithm at different initial states. Our method provides a learned model that has overall positive integrated velocity (forward movement). While our method is more complex than simply adding noise, it provides stability guarantees based on known policies in order to explore and collect data.
6
Conclusion
Algorithm 1 enables robots to actively seek out informative data based on the learning task while maintaining stability using equilibrium policies. Our method generates area coverage using a KL-divergence measure in order to enable robots to actively seek out informative data. Moreover, by using a hybrid systems theory approach to generating area coverage, we were able to incorporate equilibrium policies in order to provide stability guarantees even with the model of the robot dynamics only locally known. Last, we provide examples that illustrate the benefits of our approach for active data acquisition for learning tasks.
References 1. Kormushev, P., Calinon, S., Caldwell, D.G.: Robot motor skill coordination with Em-based reinforcement learning. In: International Conference on Intelligent Robots and Systems, pp. 3232–3237 (2010) 2. Reinhar, R.F.: Autonomous exploration of motor skills by skill babbling. Auton. Robot. 41(7), 1521–1537 (2017) 3. McKinnon, C.D., Schoellig, A.P.: Learning multimodal models for robot dynamics online with a mixture of Gaussian process experts. In: International Conference on Robotics and Automation, pp. 322–328 (2017) 4. Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner, D., Bohez, S., Vanhoucke, V.: Sim-to-real: learning agile locomotion for quadruped robots. In: Proceedings of Robotics: Science and Systems (2018). https://doi.org/10.15607/RSS. 2018.XIV.010 5. Marco, A., Berkenkamp, F., Hennig, P., Schoellig, A.P., Krause, A., Schaal, S., Trimpe, S.: Virtual vs. real: trading off simulations and physical experiments in reinforcement learning with Bayesian optimization. In: International Conference on Robotics and Automation, pp. 1557–1563 (2017) 6. Schwager, M., Dames, P., Rus, D., Kumar, V.: A multi-robot control policy for information gathering in the presence of unknown hazards. In: Robotics Research, pp. 455–472 (2017) 7. Nguyen-Tuong, D., Peters, J.: Incremental online sparsification for model learning in real-time robot control. Neurocomputing 74(11), 1859–1867 (2011) 8. Ucinski, D.: Optimal Measurement Methods for Distributed Parameter System Identification. CRC Press (2004) 9. Berkenkamp, F., Turchetta, M., Schoellig, A., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: Advances in Neural Information Processing Systems, pp. 908–918 (2017) 10. Lin, T.-C., Liu, Y.-C.: Direct learning coverage control based on expectation maximization in wireless sensor and robot network. In: Conference on Control Technology and Applications, pp. 1784–1790 (2017)
300
I. Abraham et al.
11. Bourgault, F., Makarenko, A.A., Williams, S.B., Grocholsky, B., Durrant-Whyte, H.F.: Information based adaptive robotic exploration. In: International Conference on Intelligent Robots and Systems, vol. 1, pp. 540–545 (2002) 12. Miller, L.M., Silverman, Y., MacIver, M.A., Murphey, T.D.: Ergodic exploration of distributed information. IEEE Trans. Robot. 32(1), 36–52 (2016) 13. Ayvali, E., Salman, H., Choset, H.: Ergodic coverage in constrained environments using stochastic trajectory optimization. In: International Conference on Intelligent Robots and Systems, pp. 5204–5210 (2017) 14. Arnold, R., Wellerding, A.: On the Sobolev distance of convex bodies. aequationes mathematicae, 44(1), 72–83 (1992) 15. Axelsson, H., Wardi, Y., Egerstedt, M., Verriest, E.I.: Gradient descent approach to optimal mode scheduling in hybrid dynamical systems. J. Optim. Theory Appl. 136(2), 167–186 (2008) 16. Mavrommati, A., Tzorakoleftherakis, E., Abraham, I., Murphey, T.D.: Real-time area coverage and target localization using receding-horizon ergodic exploration. IEEE Trans. Robot. 34(1), 62–80 (2018) 17. Abraham, I., Murphey, T.D.: Decentralized ergodic control: distribution-driven sensing and exploration for multiagent systems. IEEE Robot. Autom. Lett. 3(4), 2987–2994 (2018) 18. Polyakov, A., Fridman, L.: Stability notions and Lyapunov functions for sliding mode control systems. J. Franklin Inst. 351(4), 1831–1865 (2014) 19. Scholkopf, B., Mika, S., Burges, C.J., Knirsch, P., Muller, K.R., Ratsch, G., Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE Trans. Neural Netw. 10(5), 1000–1017 (1999) 20. Yan, X., Indelman, V., Boots, B.: Incremental sparse GP regression for continuoustime trajectory estimation and mapping. Robot. Auton. Syst. 87, 120–132 (2017) 21. Emery, A.F., Nenarokomov, A.V.: Optimal experiment design. Meas. Sci. Technol. 9(6), 864 (1998) 22. Fan, T., Murphey, T.: Online feedback control for input-saturated robotic systems on Lie groups. In: Proceedings of Robotics: Science and Systems, June 2016. https://doi.org/10.15607/RSS.2016.XII.027 23. Klimov, O., Schulman, J.: Roboschool (2017). https://github.com/openai/robo school 24. Nagabandi, A., Kahn, G., Fearing, R.S., Levine, S.: Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. arXiv preprint arXiv:1708.02596 (2017) 25. Mania, H., Guy, A., Recht, B.: Simple random search provides a competitive approach to reinforcement learning. arXiv preprint arXiv:1803.07055 (2018)
Learning a Spatial Field with Gaussian Process Regression in Minimum Time Varun Suryan(B) and Pratap Tokekar Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, USA {suryan,tokekar}@vt.edu
Abstract. We study an informative path planning problem where the goal is to minimize the time required to learn a spatial field using Gaussian Process (GP) regression. Specifically, given parameters 0 < , δ < 1, our goal is to ensure that the predicted value at all points in an environment lies within ± of the true value with probability at least δ. We study two versions of the problem. In the sensor placement version, the objective is to minimize the number of sensors placed. In the mobile sensing version, the objective is to minimize the total travel time required to visit the sensing locations. The total time is given by the time spent obtaining measurements as well as time to travel between measurement locations. By exploiting the smoothness properties of GP regression, we present constant-factor approximation algorithms for both problems that make accurate predictions at each point. Our algorithm is a deterministic, non-adaptive one and based on the Traveling Salesperson Problem. In addition to theoretical results, we also compare the empirical performance using a real-world dataset with other baseline strategies.
1
Introduction
The problem of planning adaptive, informative paths for mobile sensors to estimate a spatial field1 has received a lot of attention in the community recently [1– 6]. The underlying field is typically modeled as a Gaussian Process (GP) [7] and estimated through GP regression [2] or GP classification [6]. An informationtheoretic measure such as mutual information, entropy, or variance is used as the criterion to drive the robot to sampling locations [8]. Optimizing such measures leads to adaptive sampling of the field. Typically an approximation guarantee on the information-theoretic measure can be given, especially if the measure used is a submodular one [1]. Such measures are indirect since they are based on the variance of the prediction but do not consider the accuracy of the predicted mean. Unlike these works, we study how to ensure that the predicted 1
In this paper, a spatial field is a function, f (x), x ∈ U , that is defined over a spatial domain, U ⊂ R2 .
This material is based upon work supported by the National Science Foundation under Grant number 1637915 and NIFA grant 2018-67007-28380. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 301–317, 2020. https://doi.org/10.1007/978-3-030-44051-0_18
302
V. Suryan and P. Tokekar
mean2 is accurate and present a constant-factor approximation algorithm if the hyper-parameters of the kernel do not change. We use the notion of chance-constrained optimization [9] to find the optimal trajectory that ensures that the probability of incorrect estimation is less than a user-specified threshold. Let f (x) be the true function value and μ ˆ(x) be the estimated value at a point x. The estimated value is said to be chance-constrained, if P r{|ˆ μ(x) − f (x)| ≤ } > δ for all x. Here, 0 < δ, < 1 are the desired confidence and accuracy parameters. Note that this is a stronger requirement in the sense that the prediction at all points must satisfy the condition. This is in contrast to aggregate measures such as entropy or mutual information that consider the sum or average measure over all the points. Furthermore, we specifically consider the mean predicted value rather than just minimizing the uncertainty in prediction. We study two related problems, that of finding measurement locations to place sensors and planning a tour for a single mobile sensor to ensure chanceconstrained GP regression. The objective functions are to minimize the number of measurement locations and the total tour time, respectively. The total time is given by the measurement time and the travel time between measurement locations. The measurement time depends on the number of measurements taken at each location as well as the time to take a single measurement. Depending on the sensor, the measurement time can be zero (e.g., cameras) or non-zero (e.g., soil probes measuring organic content). We show a non-adaptive algorithm suffices to solve the problem and yields a polynomial-time constant-factor approximation to the optimal. While other algorithms have been proposed in the past for estimating spatial fields, this is the first result that gives guarantees on the total travel time for ensuring predictive accuracy at all points. The rest of the paper is organized as follow. We first present the work related to our problem (Sect. 2). Next, we formally define the actual problems we study in this paper (Sect. 3). The two algorithms, DiskCover and DiskCoverTour, along with their theoretical analysis are given in Sect. 4. We compare the performance of the algorithms with baseline strategies on a real-world dataset [6] in Sect. 5.
2
Related Work
We begin by reviewing the related work in the areas of sensor placement where the goal is to cover a given environment using sensors placed at fixed locations and mobile sensing where sensors are allowed to move and collect measurements from different locations. 2.1
Stationary Sensor Placement
Sensor placement algorithms can be classified as geometric and model-based approaches [1] broadly, based upon the assumptions made about the environment. 2
We use predicted mean and estimated mean interchangeably since the function is independent of time.
Learning Spatial Fields
303
In geometric approaches, typically the goal is to find a set of measurement locations which cover the environment [10,11]. In model-based approaches, one takes a model of the environment (generally a probabilistic one) and finds measurement locations to optimize an objective function of that model. Information-theoretic measures, notably entropy and mutual information, are frequently used as the optimization criteria [12,13]. A general approach is to greedily place sensors at locations where the uncertainty about the phenomena is highest. The entropy criterion tends to place sensors along the boundaries of the environment [14] which can be wasteful. An alternative criterion of mutual information seeks to find placements that are most informative about locations where no sensors are placed [1,12]. This optimization criterion directly measures the effect of sensor placements on the posterior uncertainty of GP. In their seminal work, Krause et al. [1] studied the combinatorial optimization problem of placing sensors to maximize mutual information. The problem is NP-complete. They showed conditions under which mutual information is monotone and submodular [15] and exploited this result to prove that a greedy placement algorithm yields a constant-factor approximation in polynomial time. Unfortunately, entropy and mutual information cannot directly make any guarantees on the predictive accuracy at specific points in the environment. Instead, we design a sensor placement algorithm that guarantees predictive accuracy at every point in the environment. 2.2
Mobile Sensing
The problem of designing optimal trajectories and the related problem of selecting measurement locations has recently received considerable attention. Low et al. [16] presented a control law to minimize the probability of misclassification in a field estimated by GP regression. The authors enforced measurements to be taken continuously, and forced the sensors to only move along a fourconnected grid. Song et al. [17] presented an algorithm to localize multiple radio sources using a mobile robot. They presented upper bounds on the time required to localize the sources up to a desired probability. Otte et al. [18] studied the problem of navigating to a global goal, while foraging for interesting locations along the way. They analyzed two greedy strategies to bound the total distance traveled and verified the bounds through simulations. Krause et al. proposed mutual information as a measure of uncertainty [1]. An algorithm to place sensing locations was given which can closely approximate the optimal increase in mutual information. The work was extended to mobile sensor routing in [19], and multiple robots in [2]. Tan et al. developed an adaptive sampling strategy using receding-horizon cross-entropy trajectory optimization on a GP [20]. Tokekar et al. [6] presented a constant factor approximation algorithm for the case of accurately classifying each point in a spatial field. The first step in the algorithm is to determine potentially misclassified points and then to find a tour visiting neighborhoods of each potentially misclassified point. In this paper, we study a regression version of the problem where every point is of interest.
304
V. Suryan and P. Tokekar
We exploit the properties of GP and squared-exponential kernel to find a constant-factor approximation algorithm.
3
Problem Formulation
In this section, we formally define the terminology and the two problems studied. We assume that the environment is a compact domain U ⊂ R2 . We make the following assumptions about the spatial function fx ≡ f (x): Assumption 1 (Smoothness). The true function is smooth in the sense of Lipschitz [21], i.e., ∀xi , xj ∈ U : |fxi − fxj | ≤ L||xi − xj ||2 , where L is the Lipschitz smoothness constant. Assumption 2 (Kernel). The GP regression uses a squared-exponential kernel [7]. The hyperparameters of the kernel (length scale and signal variance) are known a priori. Assumption 3 (Bound on Measurements). Optimal algorithms for both problems satisfy the chance constraints using a finite number of measurements, no more than N . Let X denote the set of measurement locations within U produced by an algorithm. In the placement problem, our goal is to minimize the cardinality of X whereas in the path planning problem the goal is to minimize the time required to visit and obtain measurements at X. The two problems are formally defined below. Problem 1 (Placement). Given Assumptions 1–3, find the minimum number of measurement locations in U , such that the posterior GP prediction for any point in U is within ± of the actual value with probability at least δ, i.e., minimize
number of measurement locations |X|
subject to P r{|ˆ μ(x) − f (x)| ≤ } ≥ δ, ∀x ∈ U where f (x) is the actual function value that is to be estimated at a point x ∈ U , and μ ˆ(x) is the predicted value using measurements obtained at locations X. Problem 2 (Mobile). Given Assumptions 1–3, find a minimum time trajectory for a mobile sensor that obtains a finite set of measurements at one or more locations in U , such that the posterior GP prediction for any point in U is within ± of the true value with probability at least δ, i.e., minimize
len(τ ) + ηβ(X),
subject to
P r{|ˆ μ(x) − f (x)| ≤ } ≥ δ, ∀x ∈ U.
τ denotes the tour of the robot. Assume that the robot travels at unit speed, obtains one measurement in η units of time and obtains β(X) total measurements at the measurement locations X. Here, we use the function β(X) to take into account the fact that the robot may obtain more than one measurement at a specific location. Therefore, the number of measurements can be more than |X|.
Learning Spatial Fields
4
305
Algorithm
We present the details of our algorithm for both problems. The solution to Problem 1 is a subset of the solution to Problem 2. Our algorithm consists of two main stages: (1) finding a finite number of measurement locations for the robot; and (2) finding a tour to visit all the sensing locations. We exploit the Lipschitz smoothness property to find the measurement locations. By knowing the value at a certain point within some tolerance, we can predict values at nearby points albeit up to a larger tolerance. Our algorithm can be generalized to any type of sensors as long as they satisfy the assumptions listed previously. The algorithms can also be generalized to higher dimensional environments, even though we illustrate using 2D examples. Before we present the detailed algorithms, we review some relevant background and useful properties on GPs. 4.1
Useful Properties of GPs
In GP regression, the posterior variance at a particular location x conditioned on set of observations at locations X = {x1 , . . . , xn } does not depend on the data gathered but only on the locations where the data is gathered. The posterior variance on any location x is given by, −1 2 σ ˆx|X = k(x, x) − k(x, X) K(X, X) + ω 2 I k(X, x),
(1)
where k is the covariance (kernel) function with Kpq = k(xp , xq ) and ω 2 is the variance of additive i.i.d. Gaussian measurement noise [7]. The kernel is a function that measures the similarity between two measurement locations [7]. For our analysis and experiments, we use a squared-exponential kernel with signal variance, and length scale equal to σ02 and l, respectively. Since the posterior variance is a function of only the sensing locations, the posterior variance for all points in the environment can be computed a priori, if the measurement locations are known, even without making any observations. In many implementations [8,19,22], the hyperparameters for the kernel k are tuned online as more data is gathered. As such, the hyperparameters may change with the observed data and the posterior variance will depend on the data observed, which may require adaptive planning. We assume that the hyperparameters are estimated a priori. Nevertheless, one can perform sensitivity analysis of the presented algorithms by varying the hyperparameters [23,24]. The objectives for the two problems depend on not only the posterior variance but also the posterior mean. The posterior mean μ ˆx|X at a location x is given by a weighted linear combination of the observed data plus a bias term, −1 μ ˆx|X = μ(x) + k(x, X) K(X, X) + ω 2 I y,
(2)
where y = {y1 , . . . , yn } denotes the observations at locations X = {x1 , . . . , xn }.
306
4.2
V. Suryan and P. Tokekar
Necessary and Sufficient Conditions
We start by deriving necessary conditions on how far a test location can be from the nearest measurement location. A test location corresponds to a point in the environment where we would like to make a prediction. Next, we prove a sufficient condition that if every point in the environment where no measurement is obtained (test location) is sufficiently close to a measurement location, then we can make accurate predictions at each point. Lemma 1 (Necessary Condition). For any test location x, if the nearest measurement location is at a distance rmax away, and, 2 σ0 + ω 2 , (3) rmax ≥ l − log 1 − √ 2 −1 N σ02 2σ0 erf (δ) then it is not possible to make -accurate predictions at that test location with desired probability δ. Proof. GP prediction at x is a normal random variable with mean μ ˆx|X (Eq. 2) 2 and variance σ ˆx|X (Eq. 1). Even if the posterior mean was exactly equal to the true value of the function, the posterior variance has to be small enough to ensure chance-constrained regression, i.e., total probability mass has to be more 2 ˆx|X can be obtained than δ in ±-band around mean μ ˆx|X . A lower bound on σ using the fact that independent measurements will result in highest decrease in uncertainty at x [25], i.e., 2 2 ≥σ ˆx|X , (4) σ ˆx|X I 2 where, σ ˆx|X is the posterior variance at x assuming all the measurements were I independent. Posterior variance at a test location x assuming all the n measurements independent is given by,
2 σ ˆx|X I
⎤−1 ⎡ ⎡ 2 ⎤ 0 k(x, x1 ) σ0 + ω 2 ⎢ ⎥ ⎢ ⎥ .. .. = σ02 − k(x, x1 ), . . . , k(x, xn ) ⎣ ⎦ ⎣ ⎦ . (5) . . 2 2 0 σ0 + ω k(x, xn )
Non-diagonal entries of the co-variance matrix in Eq. 5 are zero (independence among the measurement locations). Assuming the nearest measurement location to be rmax away from the test location x, we have, 2 r k(x, xi ) ≤ σ02 exp − max (6) , ∀xi ∈ X, 2l2 which combined with Eqs. 4, and 5 results in, 2 nσ02 rmax 2 2 2 ˆx|XI ≥ σ0 1 − exp − 2 σ ˆx|X ≥ σ . l σ02 + ω 2
(7)
Learning Spatial Fields
Now, by lower bounding the probability mass, we have,
μˆx|X + 2 2 ≥ δ =⇒ √ ≥σ ˆx|X N μ ˆx|X , σ ˆx|X . 2 erf −1 (δ) μ ˆ x|X −
307
(8)
where erf −1 is the inverse error function [26]. Equation 8 gives us a necessary bound on the posterior variance at x. Combining Eq. 8 with Eq. 7 and using the fact, N > n, if rmax satisfies the inequality in Lemma 1, then it is not possible to satisfy Eq. 8. Hence, it is impossible to ensure the chance-constraints for the GP regression. That is, if the nearest measurement location from a test location is more than rmax away then it is impossible to make -accurate predictions at the test location with probability δ.
Lemma 2 (Sufficient Condition). For a test location x ∈ U , if there exists a measurement location xi ∈ X with n measurements at xi , such that, ⎞⎞ ⎛ ⎛ 2 δ 1 −1 ⎝ ⎝ − 2 ⎠⎠ , (9) √ d(x, xi ) ≤ 1 n erf L + L + 2L1 1 − a exp − b21 ω2 σ02 thenGP predictions at x will be -accurate with high confidence, i.e., ˆ P r fx|X − fx ≤ ≥ δ, where a, b, and L1 are constants. Proof. Consider a measurement location, say xi , that is a distance d(x, xi ) away from the point x. From the Lipschitz smoothness assumption we have, |fx − fxi | ≤ Ld(x, xi ). A bound on GP posterior variance xi can be obtained using Eq. 1,
σ ˆx2i |X
σ ˆx2i |X ≤
(10)
at xi after making n measurements at
ω2 2 . n + ωσ2
(11)
0
The inequality in Eq. 11 follows from the fact that right hand side does not consider the effect of measurement locations other than xi which can not increase the posterior variance at xi . GPs are consistent function estimators [7] and hence GP mean prediction at any measurement location converges to actual value of the function given sufficient number of measurements at that location. GP prediction ˆxi |X and σ ˆx2i |X , hence, fˆxi |X at xi is a normal random variable with mean μ fˆxi |X − μ ˆxi |X ∼ N (0, 1). σ ˆxi |X
(12)
Using consistency of GPs, Eq. 12 can be written as, ⎞ ⎛ fˆxi |X − fxi kω ⎠ = erf √k . ∼ N (0, 1) =⇒ P r ⎝fˆxi |X − fxi ≤ 2 σ ˆxi |X 2 n + ω2 σ0
(13)
308
V. Suryan and P. Tokekar
Squared-exponential kernel for one dimensional input satisfies the following high probability bound [21] on the derivatives of GP sample paths fˆx|X : for some constants a, b > 0, L2 P r supx∈U ∂ fˆx|X /∂x < L1 ≥ 1 − a exp − 21 . (14) b Equation 14 bounds the first order derivative of the sampled functions with high probability. Hence, sampled functions are Lipschitz smooth with constant L1 . As a result, GP predictions at any locations x and xi satisfies, √ L2 P r fˆx − fˆxi ≤ 2L1 d(x, xi ) ≥ 1 − a exp − 21 . (15) b √ Note that a factor of 2 appears in Eq. 15 due to l2 -distance between x and xi (Cauchy-Schwarz inequality) [27]. Combining Eqs. 10, 13, and 15 using triangle inequality results in, ⎛ √ P r ⎝fˆx|X − fx ≤ L + 2L1 d(x, xi ) +
⎞ kω
2 n+ ω2 σ0
⎠ ≥ erf
√k 2
2 L 1 − a exp − b21 ,
(16)
for any positive constant k. Note that Eq. 16 holds for any measurement location xi , which is a distance d(x, xi ) away from the test location x and can be satisfied
by choosing d(x, xi ) stated in Lemma 2 for a given δ. Lemma 2 gives a sufficient condition for GP predictions to be accurate at any given test location x ∈ U . The following lemma shows that a finite number of measurements, nα are sufficient to ensure predictive accuracy with high probability δ in a smaller disk of radius α1 rmax around xi , where α > 1 (Fig. 1).
Fig. 1. Collecting nα measurements at O suffices to make accurate predictions at all points inside disk D2 (Sufficient condition). No number of measurements at O can ensure predictive accuracy on points outside disk D1 (Necessary condition).
Lemma 3. Given a disk of radius α1 rmax centered at xi , with α > 1, with high probability δ, nα measurements at xi suffice to make -accurate predictions for all points inside the disk, where,
Learning Spatial Fields
⎛
⎛ √ 2ω erf −1 ⎝
⎞
309
⎞2
δ ⎠ L2 1 − a exp − b21
⎟ ⎜ ⎟ ⎜ 2 ⎟ ⎜ ⎟ −ω . nα ≥ ⎜ √ 2 ⎟ ⎜ σ0 ⎟ ⎜ − l(L + 2L1 ) − log σ02 + ω2 1 − √ ⎠ ⎝ α N σ02 2σ02 erf −1 (δ)
(17)
Proof. We want a sufficiency condition on the number of measurements nα inside a disk of radius α1 rmax . Lemma 2 gives an upper bound on the radius of a disk such that all points inside the disk will be chance-constrained after nα measurements at the center. We construct a disk (D2 in Fig. 1), whose radius is equal to α1 rmax such that, ⎛ ⎛ ⎞⎞ 1 2 δ 1 ⎝ − 2 ⎠⎠ . √ rmax ≤ erf −1 ⎝ (18) 1 L α + nωα2 L + 2L1 1 − a exp − 21 σ2 0
b
This ensures all points in D2 will be chance-constrained. Substituting the value of rmax from Lemma 1 in Eq. 18 and re-arranging for nα gives the required bound stated in Lemma 3.
A packing of disks of radius rmax gives a lower bound on the number of measurements required to ensure predictive accuracy. On the other hand, a covering of disks of radius α1 rmax gives us an upper bound on the number of measurements required. To solve Problem 1, what remains is to relate the upper and lower bound and present an algorithm to actually place the disks of radii α1 rmax . 4.3
Placement of Sensors for Problem 1
We use an algorithm similar to the one presented by Tekdas and Isler [28]. The exact procedure is outlined in Algorithm 1. We choose α = 2. Algorithm 1: DiskCover input : An arbitrary environment output: Measurement locations 1 Given an arbitrary set X of disks of radii rmax covering the given environment, calculate the Maximal Independent Set (MIS) I of X greedily i.e., I = MIS(X ); 2 Place disks of radii 3rmax at the center of each disk in I. Denote the set of 3rmax radii disks X¯ ; ¯ with disks of radii 1 rmax as shown in Figure 2 and 3 Cover each disk in X 2 label centers of all disks of radii 12 rmax ; 4 Return all the labeled points in previous step as measurement locations; Theorem 1. DiskCover (Algorithm 1) gives a 72-approximation for Problem 1 in polynomial time.
310
V. Suryan and P. Tokekar
Proof. Denote the optimal set of measurement locations to solve Problem 1 by X ∗ . The function MIS in Step 1 of Algorithm 1 computes a maximally independent set of disks: the disks in I are mutually non-intersecting (independent) and any disk in X \I intersects with some disks in X (maximal). The set I can be computed by a simple polynomial greedy procedure: choose an arbitrary disk d from X , add it to I, remove all disks in X which intersect d, and repeat the procedure until no such d exists. An optimal algorithm will have to collect measurements from as many measurement locations as the cardinality of I. This can be proved by contradiction. Suppose an algorithm visits less measurement locations than number of disks in I. In that case, there will exist at least one disk of radius rmax in I which will not contain a measurement location. This means that there will be at least a point in that disk which will be more than rmax away from each measurement location. From Lemma 1, the robot can never make accurate predictions at that point and hence violating the constraint in Problem 1. Hence, |I| ≤ |X ∗ |.
(19)
The union of X¯ disks covers the entire environment. We prove this by contradiction as well. Consider a test point x in the environment and assume that it does not lie in the union of X¯ disks. Since no disk in X¯ contains x, x lies at least 3rmax distance away from the center of each disk in X¯ (also the centers of disks in I). Hence, the nearest disk to x in I will be at least 2rmax away from x, as a result of which, a disk of radius rmax around x can be constructed without intersecting any disk in I which violates the fact that I is an MIS. Hence, the union of X¯ disks covers the entire environment. Collecting measurements from 72 locations inside a 3rmax disk suffice to make accurate predictions in that disk (satisfying the Problem 1 constraint for points lying in that disk) as illustrated in Fig. 2. DiskCover collects measurement from 72 such locations per disk in X¯ . It collects measurements from a total of 32|X¯ | locations, hence, satisfying the constraint for all points in the area covered by union of X¯ disks. Since union of X¯ disks covers the entire environment, DiskCover satisfies the constraint for all points in the environment. Using Eq. 19, 72|I| ≤ 72|X ∗ |. Note that |X¯ | = |I|. Hence, 72|X¯ | ≤ 72|X ∗ |, nDiskCover ≤ 72|X ∗ |,
(20) (21)
where, nDiskCover is the number of measurement locations for DiskCover. 4.4
Finding Optimal Trajectory for Problem 2
The algorithm for Problem 2 builds on the algorithm presented in the previous section. The locations where measurements are to be made, become the locations that are visited by the robot. The robot must obtain at least n2 samples (Lemma 3 with α = 2) at the center of each disk of radius 12 rmax . A pseudo-code of the algorithm is presented in the Algorithm 2.
Learning Spatial Fields
311
Fig. 2. We cover the disk of radius 3rmax using disks of radii 12 rmax in lawn mover pattern which requires no more than double the minimum number of 12 rmax disks required to cover a 3rmax disk. 72 disks suffice to cover the bigger disk whereas the π(3 rmax )2 minimum number of 12 rmax disks required is π(0.5 = 36. The locations of disks of rmax )2 1 radii 2 rmax inside disk of radius 3rmax are obtained by covering a square circumscribing bigger disk with smaller squares inscribed in smaller disks. The centers of smaller squares coincide with centers of smaller disks.
Algorithm 2: DiskCoverTour input : A set of measurement locations calculated from Algorithm 1. output: Optimal path visiting all measurement locations. ¯ disk in lawn-mover pattern visiting the centers of 1 Cover each X corresponding disks of radius 12 rmax and make n2 measurements at each center point; ¯ 2 Calculate approximate-polynomial TSP tour visiting the centers of X disks; 3 Return union of paths calculated in steps above; Theorem 2. Algorithm 2 yields a constant-factor approximation algorithm for Problem 2 in polynomial time. Proof. From Theorem 1 we already have a constant approximation bound on number of measurement locations. Let the time (travel and measurement time) taken by optimal algorithm be T ∗ . Using notation from Theorem 1, we assume that optimal TSPN time to visit disks in I be TI∗ . The optimal algorithm will visit at least all disks once in I which gives the following minimum bounds on ∗ ∗ ) and optimal measurement time (Tmeasure ), the optimal travel time (Ttravel ∗ ∗ , η|I| ≤ Tmeasure . TI∗ ≤ Ttravel
(22)
Let the optimal time to visit the centers of disks in I be TI∗C . The time TI∗C to visit centers can be bounded by a path length having maximum detour of |I| × 2rmax from TI∗ . As a result: TI∗C ≤ TI∗ + 2rmax |I|. Using inequality from
312
V. Suryan and P. Tokekar
∗ Eq. 22: TI∗C ≤ Ttravel + 2rmax |I|. For any disk in X¯ , the length of lawn-mover path starting from the its center and returning back (Fig. 2) after visiting all center points of 12 rmax disks is 82rmax . Hence, the total travel time for DiskCoverTour is: TC +82rmax |I|, where TC is the (1+)-approximated time with respect to the optimal TSP tour returned by the (1+)-approximation algorithm to visit the centers of the disks in X¯ (or I disks since they are concentric). TC can be calculated in polynomial time [29] having bounds: TC ≤ (1 + ) TI∗C , with TI∗C being the optimal TSP time to visit the centers of X¯ disks. Total measurement time for DiskCoverTour is 72ηn2 |I|. Hence, total time Talg for DiskCoverTour is,
Talg = TC + 82rmax |I| + 72ηn2 |I| ≤ (1 + ) TI∗C + 82rmax |I| + 72ηn2 |I|, (23) ∗ ≤ (1 + ) (Ttravel + 2rmax |I|) + 82rmax |I| + 72ηn2 |I|.
(24)
Length of any tour that visits k non-overlapping equal size disks of radius r is at least 0.24kr [30], which gives 0.24rmax |I| ≤ TI∗ . Combining this result with Eq. 22 modifies the bounds in Eq. 24 as, 2 82 ∗ ∗ + 72n2 Tmeasure , (25) Talg ≤ (1 + ) 1 + + Ttravel .24 .24 82 ∗ ∗ , 72n2 (Ttravel + Tmeasure ), (26) ≤ max 9.33(1 + ) + .24 82 , 72n2 T ∗ ≤ cT ∗ , (27) ≤ max 9.33(1 + ) + .24 where c denotes the corresponding maximum value in the right hand side of Eq. 26, and is a constant.
Fig. 3. Actual and predicted OM content comparison.
Learning Spatial Fields
5
313
Simulations
We evaluate our algorithm with other baseline strategies through simulations based on a real-world dataset [31] of a farm (Fig. 3(a)). The dataset [31] consists of organic matter (OM) measurements collected from 350 locations manually in a farm. Since the algorithm may require to obtain measurements at locations other than one of original 350 locations, we use cubic interpolation to generate the ground truth values at any location in the environment. The simulated sensor returns a noisy version of this ground truth measurement with an additive Gaussian noise of variance ω 2 = 0.25. We compare the performance of our algorithm against mutual information (MI) and entropy-based planning strategies [1] using three criteria. The squaredexponential kernel has two hyperparameters: namely length scale (l) and signal variance (σ02 ). The values of l and σ02 were estimated by minimizing the negative log-marginal likelihood for 350 locations where we know the OM content. We discretized the farm with a resolution of 3l and calculated the values of required metrics at discretized locations. We used δ = 0.8, = 2, ω 2 = 0.25 and l was estimated to be equal to 12.24 m by log-marginal likelihood minimization. Figure 3(b) presents the predicted OM content in the farm by DiskCover algorithm. Figure 4 shows the covering of given environment using disks of radii 2rmax concentric with disks in the MIS calculated by DiskCover algorithm.
110
Distance along Farm width (meter)
100
Starting Postion of the Robot
90 80 70 60 50 40 30 100
200 300 400 Distance along the farm length (meter)
500
600
Fig. 4. Disk placement covering the farm calculated by DC algorithm. The disks are of radii 3rmax which are concentric with disks of radii rmax in I. The optimal TSP tour visiting the centers and farm boundaries are shown in red and blue respectively. Lawn mover detours have been omitted to make the figure more legible.
The total time taken by the robot is summation of the travel time and the time taken to perform measurements. For MI and entropy-based planning, we calculate the measurement locations greedily [1]. After calculating the measurement locations for both, we find the optimal TSP tour visiting all the measurement locations and calculate the total time subsequently.
314
V. Suryan and P. Tokekar
% of points having prediction error more than 100 %
100 Figure 5 presents the percentage of DCT MI based planning points with prediction error more than 80 Entropy based planning 2 ppm as a function of time spent by 60 the robot in the farm. DiskCover40 Tour outperforms MI and entropybased planning in the beginning but 20 as robot spends more time in the 0 farm, MI and entropy-based planning 0 2000 4000 6000 8000 Time taken by the robot (sec) perform better than DiskCoverTour. Asymptotically DiskCoverFig. 5. Percentage of points with predicTour converges with fewer points tion error more than the maximum allowed (than MI and entropy-based plan- error (2 ppm) as a function of tour time. ning) which have high prediction error. DiskCoverTour getting outperformed by other strategies in the middle can be explained by the observation that entropy and MI based planners tend to spread the measurement locations as far as possible from each other, thereby, covering a large part of the environment quicker than DiskCoverTour. A plot of Root Mean Square Error (RMSE) vs time is shown in Fig. 6(a). We observe a similar trend as Fig. 5. This trend can be attributed to the fact that MI and entropy-based planning methods are indirect methods and they tend to collect the measurements from locations which are far away from each other. We plot entropy (a measure of posterior variance) to quantify the confidence about learned OM content variation. For our choice of δ = 0.8, the entropy for DiskCoverTour algorithm converges to 230 as shown in Fig. 6(b). For δ ≈ 1, it converges to the same values as MI and entropy-based strategies. Entropybased planning achieves the maximum reduction in entropy of the distribution because of its diminishing returns property [1]. However, a reduction in an indirect measure like entropy does not guarantee reduction in prediction error. For this fact, entropy and MI-based strategies may cease to perform better than DiskCoverTour while achieving accuracy in posterior predictions.
500
25 DCT Entropy based planning MI based planning
15 10
300 200 100
5
0
0
-100 0
2000
4000
6000
Time taken by the robot (sec)
DCT MI based planning Entropy based planning
400 Posterior Entropy
RMS Error
20
8000
0
2000
4000
6000
8000
Time taken by the robot (sec)
(a) Comparison of total error be- (b) Entropy estimate for posterior tween actual organic matter con- distribution. tent and predicted content.
Fig. 6. Predictive error and posterior entropy comparisons.
Learning Spatial Fields
6
315
Conclusion
We study the problem of minimizing the time taken by a single traveling robot to learn a spatial field. We provide a lower bound on the number of measurement locations required to estimate the field sufficiently accurate with high confidence. We also provide a polynomial time approximation algorithm whose performance is within a constant factor of the lower bound. We show that it is possible to learn a given spatial field accurately with high confidence without planning adaptively. Note that, if the kernel parameters are optimized online, then one would require an adaptive strategy. Our ongoing work is on developing competitive strategies for such cases as well as for spatio-temporal learning. As an immediate future work, we aim to remove the dependence of nα on N and optimize for the value of α to improve the approximation ratio.
References 1. Krause, A., Singh, A., Guestrin, C.: Near-optimal sensor placements in gaussian processes: theory, efficient algorithms and empirical studies. J. Mach. Learn. Res. 9(Feb), 235–284 (2008) 2. Singh, A., Krause, A., Guestrin, C., Kaiser, W.J.: Efficient informative sensing using multiple robots. J. Artif. Intell. Res. 34, 707–755 (2009) 3. Ouyang, R., Low, K.H., Chen, J., Jaillet, P.: Multi-robot active sensing of nonstationary Gaussian process-based environmental phenomena. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 573–580. International Foundation for Autonomous Agents and Multiagent Systems (2014) 4. Hollinger, G.A., Sukhatme, G.S.: Sampling-based robotic information gathering algorithms. Int. J. Robot. Res. 33(9), 1271–1287 (2014) 5. Ling, C.K., Low, K.H., Jaillet, P.: Gaussian process planning with Lipschitz continuous reward functions: towards unifying Bayesian optimization, active learning, and beyond. In: AAAI, pp. 1860–1866 (2016) 6. Tokekar, P., Vander Hook, J., Mulla, D., Isler, V.: Sensor planning for a symbiotic UAV and UGV system for precision agriculture. IEEE Trans. Robot. 32(6), 1498– 1511 (2016) 7. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press, Cambridge (2006) 8. Krause, A., Leskovec, J., Guestrin, C., VanBriesen, J., Faloutsos, C.: Efficient sensor placement optimization for securing large water distribution networks. J. Water Resour. Plann. Manag. 134(6), 516–526 (2008) 9. Blackmore, L., Ono, M., Williams, B.C.: Chance-constrained optimal path planning with obstacles. IEEE Trans. Robot. 27(6), 1080–1094 (2011) 10. Gonz´ alez-Banos, H.: A randomized art-gallery algorithm for sensor placement. In: Proceedings of the Seventeenth Annual Symposium on Computational Geometry, SCG 2001, pp. 232–240. ACM, New York (2001) 11. Hochbaum, D.S., Maass, W.: Approximation schemes for covering and packing problems in image processing and VLSI. J. ACM (JACM) 32(1), 130–136 (1985) 12. Caselton, W.F., Zidek, J.V.: Optimal monitoring network designs. Stat. Probab. Lett. 2(4), 223–227 (1984)
316
V. Suryan and P. Tokekar
13. Zimmerman, D.L.: Optimal network design for spatial prediction, covariance parameter estimation, and empirical prediction. Environmetrics 17(6), 635–652 (2006) 14. Ramakrishnan, N., Bailey-Kellogg, C., Tadepalli, S., Pandey, V.N.: Gaussian processes for active data mining of spatial aggregates. In: Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 427–438. SIAM (2005) 15. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Math. Program. 14(1), 265–294 (1978) 16. Low, K.H., Chen, J., Dolan, J.M., Chien, S., Thompson, D.R: Decentralized active robotic exploration and mapping for probabilistic field classification in environmental sensing. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pp. 105–112. International Foundation for Autonomous Agents and Multiagent Systems (2012) 17. Song, D., Kim, C.-Y., Yi, J.: Simultaneous localization of multiple unknown and transient radio sources using a mobile robot. IEEE Trans. Rob. 28(3), 668–680 (2012) 18. Otte, M., Correll, N., Frazzoli, E.: Navigation with foraging. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3150– 3157. IEEE (2013) 19. Meliou, A., Krause, A., Guestrin, C., Hellerstein, J.M.: Nonmyopic informative path planning in spatio-temporal models. In: Proceedings of the 22nd National Conference on Artificial Intelligence - Volume 1, AAAI 2007, pp. 602–607. AAAI Press (2007) 20. Tan, Y.T., Kunapareddy, A., Kobilarov, M.: Gaussian process adaptive sampling using the cross-entropy method for environmental sensing and monitoring. In: IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia (2018) 21. Srinivas, N., Krause, A., Kakade, S.M., Seeger, M.W.: Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Trans. Inf. Theory 58(5), 3250–3265 (2012) 22. Krause, A., Guestrin, C.: Nonmyopic active learning of gaussian processes: an exploration-exploitation approach. In: Proceedings of the 24th International Conference on Machine Learning, pp. 449–456. ACM (2007) 23. Kennedy, M.C., O’Hagan, A.: Bayesian calibration of computer models. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 63(3), 425–464 (2001) 24. Oakley, J.E., O’Hagan, A.: Probabilistic sensitivity analysis of complex models: a Bayesian approach. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 66(3), 751–769 (2004) 25. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012) 26. Giles, M.: Approximating the erfinv function. In: GPU Computing Gems Jade Edition, pp. 109–116. Elsevier (2011) 27. Axler, S.J.: Linear Algebra Done Right, vol. 2. Springer, Heidelberg (1997) 28. Tekdas, O., Isler, V.: Sensor placement for triangulation-based localization. IEEE Trans. Autom. Sci. Eng. 7(3), 681–685 (2010) 29. Arora, S.: Nearly linear time approximation schemes for Euclidean TSP and other geometric problems. In: Proceedings 38th Annual Symposium on Foundations of Computer Science, pp. 554–563, October 1997
Learning Spatial Fields
317
30. Tekdas, O., Bhadauria, D., Isler, V.: Efficient data collection from wireless nodes under the two-ring communication model. Int. J. Robot. Res. 31(6), 774–784 (2012) 31. Mulla, D.J., Sekely, A.C., Beatty, M.: Evaluation of remote sensing and targeted soil sampling for variable rate application of nitrogen, pp. 1–15. American Society of Agronomy, Madison (2000)
Meta-learning Priors for Efficient Online Bayesian Regression James Harrison(B) , Apoorva Sharma, and Marco Pavone Stanford University, Stanford, CA, USA {jharrison,apoorva,pavone}@stanford.edu
Abstract. Gaussian Process (GP) regression has seen widespread use in robotics due to its generality, simplicity of use, and the utility of Bayesian predictions. The predominant implementation of GP regression is a nonparametric kernel-based approach, as it enables fitting of arbitrary nonlinear functions. However, this approach suffers from two main drawbacks: (1) it is computationally inefficient, as computation scales poorly with the number of samples; and (2) it can be data inefficient, as encoding prior knowledge that can aid the model through the choice of kernel and associated hyperparameters is often challenging and unintuitive. In this work, we propose ALPaCA, an algorithm for efficient Bayesian regression which addresses these issues. ALPaCA uses a dataset of sample functions to learn a domain-specific, finite-dimensional feature encoding, as well as a prior over the associated weights, such that Bayesian linear regression in this feature space yields accurate online predictions of the posterior predictive density. These features are neural networks, which are trained via a meta-learning (or “learning-to-learn”) approach. ALPaCA extracts all prior information directly from the dataset, rather than restricting prior information to the choice of kernel hyperparameters. Furthermore, by operating in the weight space, it substantially reduces sample complexity. We investigate the performance of ALPaCA on two simple regression problems, two simulated robotic systems, and on a lane-change driving task performed by humans. We find our approach outperforms kernelbased GP regression, as well as state of the art meta-learning approaches, thereby providing a promising plug-in tool for many regression tasks in robotics where scalability and data-efficiency are important.
Keywords: Machine learning
1
· Probabilistic reasoning
Introduction
Gaussian Process (GP) regression has become a workhorse in robotics due to its generality, simplicity of use, and the utility of Bayesian predictions. As such, it has been used in robotic applications as diverse as model identification for control and reinforcement learning [1,2], contact modeling [3], terrain and environment modeling [4,5], prediction of human behavior [6,7], and trajectory c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 318–337, 2020. https://doi.org/10.1007/978-3-030-44051-0_19
Meta-learning Priors for Bayesian Regression
319
optimization [8]. Moreover, it has been used effectively for essential tools in robotics such as Bayesian filtering [9], SLAM [10], and Bayesian optimization [11]. While GP regression is a useful tool, standard approaches have several wellknown limitations. GP regression is predominantly performed via kernel-based methods [12], but incorporating physical priors into these models is difficult, and done implicitly through the choice of kernel function. For example, in modeling a physical system with a GP equipped with the widely-used squared exponential (SE) kernel, the system designer may vary the prior variance, and the length scale of the kernel, which controls smoothness [12]. These choices of priors are neither expressive nor intuitive enough for complex applications in robotics. Moreover, kernel-based GP regression algorithms scale cubically with the amount of data [13]. This high sample complexity prevents application beyond relatively small problems [14]. Several works have attempted to incorporate prior knowledge into GP models and to remove the limitations imposed by kernel methods. Methods mapping input data to alternative feature spaces which are then used as inputs to kernel GP models have been shown to allow incorporation of prior knowledge, and reduce the complexity by projecting to a lower dimensional space. For example, Hinton and Salakhutdinov [15] use a Deep Belief Network to learn a feature representation in an unsupervised fashion, but this separates the process of representation learning and regression, which may be suboptimal. Manifold GPs [16] jointly learn a neural network transformation of the data such that GP methods applied to this feature space achieve good predictions. However, this approach relies on the system designer having access to data generated from the exact system being modeled, and it is unclear how well this approach will perform on a regression problem that differs from that used to generate the training dataset. Good performance on a general class of tasks (as opposed to one specific task within this family) is necessary for effective autonomous operation in unstructured environments. Moreover, these approaches still suffer from the poor sample complexity associated with kernel methods. A more extensive discussion of methods for incorporating prior knowledge is presented in Sect. 5. A variety of sparse and approximate methods have been proposed to reduce the sample complexity of kernel methods, but these at best scale quadratically [17,18]. A popular approach to avoiding the complexity of kernel methods is to instead turn to random features for approximating kernel functions, and solving the linear regression problem using these features instead [19]. The approach taken in this work is similar, but we learn features that are well-suited to representing posterior densities, based on available prior information. Contributions. In this work, we propose ALPaCA (Adaptive Learning for Probabilistic Connectionist Architectures), an algorithm for incorporating prior information in to GP regression that increases computational and sample efficiency, and provides Bayesian posteriors reflecting confidence in predictions. We propose to learn a neural network model (i.e. connectionist architecture) offline that is trained to be able to rapidly incorporate online information to predict posterior distributions. Online adaptation is achieved by performing Bayesian
320
J. Harrison et al.
linear regression on the last layer of the network, and thus our approach is capable of providing confidence intervals for predictions. Fundamentally, our network learns priors over the last layer weights, as well as network weights, such that it can optimally predict the posterior over the label, conditioned on the online samples observed so far. This is performed by leveraging the “learning-to-learn” approach of meta-learning [20], where model training is done to minimize divergence between the predicted and true posterior, for any amount of observed online context data. By performing regression on meta-learned basis functions, ALPaCA achieves complexity during inference that is linear in the amount of data. As such, ALPaCA offers a general, plug-in alternative to kernel GP regression which can improve scalability and online data-efficiency in a wide variety of robotics applications. Organization. In Sect. 2, we provide a formal problem statement. We present background information on Bayesian linear regression and GP regression in Sect. 3. In Sect. 4 we introduce the ALPaCA algorithm for generic supervised learning tasks. We contrast our approach to the literature on neural network regression models, current meta-learning approaches, as well as popular GP models in Sect. 5. Finally in Sect. 6, we demonstrate ALPaCA on simple problems that are standard benchmarks in meta-learning [20], as well as high-dimensional robotic systems involving contact, and prediction of human decision making in a driving task1 . ALPaCA is shown to outperform kernel GP regression, as well as meta-learning approaches such as MAML [20].
2
Problem Formulation
In this work, we are focused on the problem of Bayesian function regression. In this setting, we start with a prior over possible functions, observe samples from an unknown test function, and use these samples to compute a posterior predictive density over the unknown function. Formally, let f : Rnx → Rny denote an arbitrary, possibly nonlinear and nonsmooth function with latent parameters θ. We consider a setting where f is fixed and captures problem specific structure, while θ is uncertain and explains intra-domain variation. For example, in the context of prediction of human driving behavior, uncertainty in the state evolution stems from uncertainty in the driver’s intent, while the physics governing the car’s dynamics are known. We assume the samples of f observed online may be corrupted with additive Gaussian noise. That is, we observe a sequence of samples (x, y) such that p(y | x, θ) = N (f (x; θ), Σε ),
(1)
where N (·, ·) denotes the multivariate Gaussian distribution, and Σε is the noise covariance, and θ is unknown. 1
The code for all of our experiments is available at https://github.com/StanfordASL/ ALPaCA.
Meta-learning Priors for Bayesian Regression
321
Suppose we observe a collection of τ sample points Dτ∗ = {(xt , yt )}τt=1 coming from an unknown function with latent parameter θ∗ . Given a prior over model parameters p(θ), and this context data Dτ∗ , the posterior predictive density over y given x can be computed as (2) p(y | x, Dτ∗ ) = p(y | x, θ)p(θ | Dτ∗ ) dθ. Computing this posterior predictive density, however, requires having analytic expressions for f (x; θ) and p(θ), and even then, computing the posterior p(θ | Dτ∗ ) exactly may remain intractable. Instead, we propose using a surrogate model for which computing the posterior predictive density is analytically tractable, and optimizing this surrogate model offline such that its posterior predictive density is close to the true posterior predictive density for all likely settings of θ∗ and the resulting observations Dτ∗ . Let qξ (y | x, Dτ∗ ) represent the posterior predictive density under our surrogate model, parameterized by ξ. Let yˆ represent a sample from this density q. We consider a scenario where data comes in as a stream: that is, at each timestep, the agent is presented with an input xt , and after estimating the output yˆt , the agent is provided with the true value yt . An example of this is a Markovian dynamical system, in which an agent wishes to predict the distribution over the next state, and at the next time step, this state is observed. In this context, the problem we wish to solve can be written as min DKL (p(y | xt+1 , Dt∗ ) qξ (y | xt+1 , Dt∗ )) , ξ
(3)
where DKL (··) denotes the Kullback-Leibler (KL) divergence. Note that this objective is minimizing the conditional KL divergence. As such, the objective is to minimize this divergence in expectation over t, xt+1 , Dt∗ , and θ∗ . Directly optimizing this objective requires access to p(θ) and p(Dt∗ | θ∗ ). Instead of assuming access to these distributions directly, we only require a dataset of samples drawn from different settings of latent parameter θ. Specifically, we will write (j) Dτ := {(xt , yt )}τt=1 to represent a dataset of τ samples generated via sampling (iid) θj ∼ p(θ), xt ∼ p(x) and yt ∼ p(y | xt , θj ), that is, a set of samples from (j) one function instantiation. Let D := {Dτ }M j=1 represent a collection of these (j)
datasets. Note that by sampling t ∼ p(t), sampling Dτ uniformly from D, and (j) the truncating Dτ to t < τ elements, we can approximate the expectations in (3) through a Monte Carlo scheme.
3
Bayesian Regression
In this work, we utilize Bayesian multivariate linear regression as a tool for computing q(y | x, Dτ∗ ), as it allows for analytic computation of the posterior. In this section, we offer a brief review of the technique. For a more in-depth treatment, we refer the reader to [12], [21], or [22]. Let xt ∈ Rnx and yt ∈ Rny
322
J. Harrison et al.
denote the independent and dependent variables respectively. We will consider regression over basis functions φ : Rnx → Rnφ . Then, ytT = φT (xt )K + εt , where K ∈ Rnφ ×ny is a coefficient matrix and ε ∼ N (0, Σε ), wherein N (·, ·) denotes the multivariate Gaussian distribution and Σε denotes the noise variance. We will assume Σε is known throughout this work. The case is which this assumption is relaxed is discussed in the extended version of this paper available at [23]. Let Y T = [y1 , . . . , yτ ], ΦT = [φ(x1 ), . . . , φ(xτ )], where τ is the number of data points. Stacking samples t = 1, . . . , τ , we may equivalently write Y = ΦK + E. Given this formulation, the conditional probability of the data Y is 1 p(Y | Φ, K, Σε ) ∝ |Σε |−ny /2 exp − tr (Y − ΦK)Σε−1 (Y − ΦK)T , (4) 2 where tr(·) denotes the trace operation. A natural conjugate prior for K is ¯ 0 , Λ−1 , Σε ), where M N denotes the matrix normal distribup(K) ∼ M N (K 0 tion, and Λ0 is a precision matrix (the inverse of the variance). The matrix normal distribution is equivalent to the multivariate normal distribution with ¯ 0 ) represents the vectoriza¯ 0 ) and variance Σε ⊗ Λ−1 , where vec(K mean vec(K 0 ¯ tion of K0 and ⊗ denotes the Kronecker product. The posterior, conditioned on ¯ τ , Λ−1 Y and Φ, is then p(K | Y, Φ) = M N (K τ , Σε ), where Λτ = ΦT Φ + Λ0 ¯ τ = (Λτ )−1 (ΦT Y + Λ0 K ¯ 0 ). K
(5) (6)
Given the above, the posterior predictive distribution is
where
¯ τT φ(xτ +1 ), Στ +1 ) p(yτ +1 | φ(xτ +1 ), Φ, Y ) = N (K
(7)
Στ +1 = (1 + φT (xτ +1 )Λ−1 τ φ(xτ +1 ))Σε .
(8) 2
For a derivation of this term, the reader may refer to [21] . Note that several common forms of linear regression, such as ridge regression, are special cases of this general formulation [22]. Bayesian linear regression has strong connections to kernel methods for GP regression. With the choice of Gaussian priors discussed here, Bayesian linear regression implicitly defines a GP with mean and kernel functions that are functions of the prior on the weights, p(K), and the basis functions used (Fig. 1). Note that the terms involving the basis functions φ are always weighted inner products of φ(x) and φ(x ) (with the same weighting, which we will denote Σ). Thus, we define the kernel function k(x, x ) = φT (x)Σφ(x ). In contrast to Bayesian linear regression, kernel-based GP regression employs the well-known “kernel trick” to replace occurrences of inner products with the kernel function [12]. In this way, it can effectively operate in an infinite dimensional feature space such as that defined by the popular squared exponential kernel. This added model capacity comes at the cost of poor computational complexity as the number of samples in the dataset grows. 2
While the expressions in this work and [21] are not obviously equivalent, they can be shown to be the same by applying the Woodbury identity.
Meta-learning Priors for Bayesian Regression Offline
323
Online
Bayesian Linear Regression
Fig. 1. Overview of the ALPaCA algorithm. Rather than compute the intractable posterior p(yt+1 | xt+1 , Dt∗ ) directly, we approximate it through a surrogate model for yt+1 | xt+1 , Dt∗ ) is analytically tractable via Bayesian which computing the posterior qξ (ˆ linear regression. Offline, we optimize the parameters ξ of this surrogate model on a dataset D of functions drawn from a prior distribution, aligning the surrogate posterior with the true empirical posterior in the dataset through a meta-learning procedure. Blue nodes are observed offline, green are online inputs, and purple the targets. White nodes are computed, and gray nodes are not observed.
4
Approach
In this work, we choose to represent the approximate density q(y | xt+1 , Dt∗ ) as a neural network model with a Bayesian last layer, which allows employing Bayesian linear regression to compute q(y | xt+1 , Dt∗ ) analytically. Section 4.1 provides details of this model and other preliminaries. In Sect. 4.2, we outline the full ALPaCA algorithm. 4.1
Algorithmic Preliminaries
Network Model. Recall yˆt denotes the random variable distributed according to q(y | x, Dτ∗ ). In this work we will learn a parametric model for yˆt , yˆt = K T φ(xt ; w) + ε,
(9)
where φ is a neural network with output dimensionality nφ and weights w, ε ∼ N (0, Σε ), and K is a matrix which may be thought of as the last layer of the neural network. Learning this model based on a dataset for fixed θ is relatively simple, and is in line with the majority of the literature on neural network regression models. However, we wish to learn a prior distribution over K such that given data sampled from (1) with the test parameter θ∗ , we can generate accurate Gaussian posteriors. Specifically, we wish to learn the parameters ¯ 0 , Λ0 of a matrix normal prior p(K) = M N (K ¯ 0 , Λ−1 , Σε ), so that we may K 0 apply tools from Bayesian linear regression to analytically compute posterior distributions given online data.
324
J. Harrison et al.
While a fully Bayesian approach would also aim to estimate the posterior over the weights w, we will assume a delta prior over these weights, and thus they are not updated online. However, note that in comparison to previous work that utilizes networks with Bayesian output layers [24,25], we will not train the network and learn the prior over K in separate phases, and instead they are learned jointly. This is an important distinction: as opposed to simply learning a network to produce good predictions in expectation (with no context data), and adding in a Bayesian last layer afterwards, we aim to learn a network that is composed of basis functions capable of representing the posterior distribution given any context data. Given fixed basis functions, learning the prior terms in a data-driven fashion is known as empirical Bayes [26]. In contrast to the empirical Bayes approach, we also learn the basis functions in addition to the matrix normal prior. Thus, in summary, the parameters of the posterior predictive model ¯ 0 , Λ0 , w). qξ we wish to optimize are ξ = (K Optimization Objective. We will denote the process generating D∗ , data x, and label y as p(x, y, D∗ | θ∗ ), for online test parameters θ∗ . Note that this may equivalently be written as p(y | x, θ∗ )p(x, D∗ | θ∗ ), as y is conditionally independent of D∗ , given θ∗ . Making explicit the dependence of the posterior on the prior terms, the inner expectation of (3), wherein the expectation over θ∗ and t is temporarily ignored, may be written Ex,D∗ ∼p(x,D∗ |θ∗ ) Ey∼p(y|x,θ∗ ) [log p(y | x, D∗ ) − log qξ (y | x, D∗ ) | x, D∗ ] . We can then write the above divergence, dropping the terms that do not affect the optimization over the prior terms, as ¯ 0 , Λ0 , w, θ∗ ) = −Ex,y,D∗ ∼p(x,y,D∗ |θ∗ ) [log qξ (y | x, D∗ )] . L(K Rewriting this negative log likelihood using (7), discarding terms not relevant to the optimization problem over the prior terms, and making the time dependency explicit, we obtain ¯ 0 , Λ0 , w, θ∗ , t) = Ex ,y ,D∗ ∼p(x,y,D∗ |θ∗ ) log det(Σt+1 ) L(K t+1 t+1 t ¯ tT φt+1 )T Σ −1 (yt+1 − K ¯ tT φt+1 ) . + (yt+1 − K t+1
Thus, (3) may be written as ¯ 0 , Λ0 , w) = Et∼p(t),θ∗ ∼p(θ) L(K ¯ 0 , Λ0 , w, θ∗ , t)
(K
(10)
¯ t , and Σt are as in (5), (6), and (8). These terms are functions where Λt , K ¯ 0 , Λ0 , and w, as well as Dt∗ . Here, we write φt to denote of the prior terms, K φ(xt ). This equivalence between minimizing KL divergence and maximizing log likelihood is well-known [22], but we emphasize it to improve clarity regarding the expectations and role of the prior terms in the likelihood.
Meta-learning Priors for Bayesian Regression
325
Algorithm 1. ALPaCA: Offline Require: training data D, noise variance Σε ¯ 0 , Λ0 , w 1: Randomly initialize K 2: while not converged do 3: for j = 1 to J do 4: Sample tj from uniform distribution over {1, . . . , τ } 5: Sample dataset D(j) uniformly from D ¯ t , Λt , Σt +1 via (12), using dataset D(j) \ {(xt +1 , yt +1 )} 6: Compute K j j j j j tj +1 7: end for 8: Update K0 , Λ0 , w via gradient step on the optimization problem (12) 9: end while ¯ 0 , Λ0 , w 10: return K
4.2
ALPaCA Overview
Offline Training. Given (10), we are now able to write the full optimization problem we wish to solve. We sample J datasets Dτ of τ samples each uniformly from D. Let j denote the index of these datasets. For each of these datasets, we sample a horizon tj ∼ p(t) and clip each dataset to this horizon to obtain (j) {Dtj +1 }Jj=1 . Because D contains datasets with θ ∼ p(θ), evaluating the loss on this minibatch approximates the expectations in (10). We use the superscript (j) (j) to denote the terms within each dataset Dtj +1 . Note that log det(Σtj +1 ) = log det((1 + φT (xtj +1 )Λ−1 tj φ(xtj +1 ))Σε ) = ny log(1 + φT (xtj +1 )Λ−1 tj φ(xtj +1 )) + log det Σε . Since it does not depend on the optimization variables, log det Σε may be ignored and we may write the Monte Carlo estimate of (10) as J
(j),T (j),−1 (j) ˆK ¯ 0 , Λ0 , w) := 1 φtj +1
( ny log 1 + φtj +1 Λtj J j=1
T
(j) (j),−1 (j) ¯ t(j),T φ(j) ¯ t(j),T φ(j) + ytj +1 − K y Σ − K . tj +1 tj +1 tj +1 tj +1 j j
(11) Thus, our problem formulation is min
¯ 0 ,Λ0 ,w K
s.t.
ˆK ¯ 0 , Λ0 , w)
( Σtj +1 = (1 + φT (xtj +1 )Λtj (j)
(j)
(j)
(j),T
Λtj = Φtj
(j),−1
(j)
φ(xtj +1 ))Σε , j = 1, . . . , J
(j)
Φtj + Λ0 , j = 1, . . . , J
¯ t(j) = (Λt(j) )−1 (Φt(j),T Yt(j) + Λ0 K ¯ 0 ), j = 1, . . . , J K j j j j Λ0 0
(12)
326
J. Harrison et al.
Algorithm 2. ALPaCA: Online ¯ 0 , w, noise variance Σε , data {(xt , yt )}τt=1 Require: prior terms Λ0 , K ¯ 1: Q0 ← Λ0 K0 2: Compute Λ−1 0 3: for t = 1, . . . , τ do −1 T 1 ← Λ−1 (Λ−1 4: Λ−1 t t−1 − t−1 φt )(Λt−1 φt ) T −1 1+φt Λt−1 φt
5: Qt ← φt ytT + Qt−1 ¯ t ← Λ−1 6: K t Qt 7: y¯t+1 ← KtT φt+1 8: Σt+1 ← (1 + φTt+1 Λ−1 t φt+1 )Σε 9: end for 10: return y¯τ +1 , Στ +1
(j)
(j)
where Φtj and Ytj contain data from dataset j, from timesteps 1, . . . , tj . The full algorithm is detailed in Algorithm 1. The semidefinite constraint on Λ0 is easily enforced by optimizing instead over L0 , the Cholesky decomposition (Λ0 = L0 LT0 ). By using this Cholesky decomposition and substituting in the equality constraints, (12) is made unconstrained, and may be solved by stochastic gradient descent. Note that in practice, rather than evaluating the loss only on (xtj +1 , ytj +1 ), we minimize the average loss evaluated on {(xt , yt )}τt=tj +1 . ¯ 0 , Λ0 ) and the weights Online Phase. After learning the prior parameters (K for the basis functions w, ALPaCA computes the online posterior by performing Bayesian linear regression using the data observed online D∗ . Since the number of basis functions used may be large for complex functions, we introduce a recursive Bayesian update that improves computational efficiency. This result is well ¯ 0. known in linear system identification and online regression [27]. Let Q0 := Λ0 K Then, the online updates may be written as −1 Λ−1 t = Λt−1 −
1+
1 −1 T (Λ−1 t−1 φt )(Λt−1 φt ) T φt Λ−1 t−1 φt
Qt = φt ytT + Qt−1
(13) (14)
where φt = φ(xt ). The first update comes from the Woodbury identity [27]. Importantly, this recursive update rule decreases the complexity from the original formulation’s O(n3φ ) (resulting from inverting Λn directly) to O(n2φ ). The full recursive online algorithm is presented in Algorithm 2. Note that φ is not updated online, in contrast to some other approaches to meta-learning such as MAML [20]. Moving from gradient updates to least squares yields a simpler algorithm online, which is useful for debugging or verifying autonomous systems. Because ALPaCA performs GP regression in the weight space as opposed to computing the kernel functions, it achieves linear complexity in the amount of data. In particular, for n context data points and m test data points, ALPaCA has complexity O(n + m). One interpretation of ALPaCA is as an approximation method for kernel-based GP regression, similar to randomized
Meta-learning Priors for Bayesian Regression
327
kernels [19]. However, instead of randomizing basis functions to approximate well-known kernels, we compute basis functions to maximize predictive power with respect to the prior implicit in the training data.
5
Related Work
Thus far we have introduced ALPaCA in the context of GP regression. In this section, we discuss connections to other related works in meta-learning and sample efficient regression. Meta-learning. In this work, we present a Bayesian perspective on metalearning. Recently, model-agnostic meta-learning (MAML) presented a simple approach to meta-learning with impressive performance. MAML was originally introduced from the perspective of learning an initial set of neural network weights that could efficiently reach the desired weights for a task [20], given a handful of gradient steps on samples from that task. Subsequent work then recast this as a form of hierarchical Bayes [28]. Specifically, indexing tasks with j, the marginal likelihood of the data X is written p(xj1 , . . . , xjτ | wj )p(wj | w)dwj , p(X | w) = j
where xjt denote data points from within task j, wj denotes the post metaupdate weights for task and loss function j, and w denote the initial weights. They note that MAML estimates the likelihood of the data drawn from task j with a point estimate of wj , computed via gradient descent. In the case of linear regression, Santos [29] showed that early stopping of gradient descent to compute wj results in a Maximum a Posteriori (MAP) estimate of wj , regularized by wj − w2Q , where Q depends on the gradient step size and the number of iterations. In contrast to these formulations, we assume a delta prior on the network weights, and so the weights are not updated online. Additionally, our formulation of a Bayesian last layer allows the posterior (given the context data) to be computed directly. Because we perform this analytical Bayesian update, we do not suffer the effects of the regularizing term, and instead may exactly compute the Gaussian approximation of the posterior. Generally, in the case of linear filtering, recursive least squares approaches have been shown to outperform gradient-based approaches [30]. These results from the linear case may explain in part why ALPaCA makes more efficient use of the online data provided, and can perform faster updates. The interpretation of MAML as hierarchical Bayes allowed the authors to incorporate Laplace approximation into the weight update [28], and the authors refer to the resulting algorithm as LLAMA. This approximation results in the point estimate of the updated weights being replaced with a second-order approximation of the posterior. Thus, sampling based methods can be used to approximate the posterior distribution of the data. More recently, PLATIPUS [31] and
328
J. Harrison et al.
Bayesian MAML [32] have used variational approaches to move from point estimates to Bayesian posteriors. We wish to emphasize that these approaches are capable of capturing a larger family of distributions (such as multimodal distributions), while we achieve improved sample efficiency and better efficiency in terms of computation. We believe these trade-offs are justified and necessary in many robotics applications. Efficient Stochastic Process Regression. In this work we have considered meta-learning for GP regression. A relaxation of this problem formulation to estimation of general stochastic processes was recently presented in [18]. This approach is based on amortized variational inference (as in PLATIPUS [31]). In particular, it is a form of conditional variational autoencoder (CVAE) [33,34], where the conditioning variable is a function of the context data. As with other autoencoder-based approaches, this method requires sampling-based estimates of the moments of the distribution. In contrast, in ALPaCA, mean and variance may be computed analytically. Moreover, [18] was only tested on problems with two dimensional output spaces, and it is unclear how well it performs for larger problems, whereas ALPaCA performs well even for high dimensional systems (such as the 12D hopper system). Within the context of meta-learning for GPs, [35] took a similar approach to the one presented here. However, while they are learning hyperparameter priors based on a distribution over tasks, they do not jointly learn a set of basis functions capable of accurately representing posterior distributions. ALPaCA
ALPaCA (no meta)
GPR
5 0 −5 5 0 −5
0 −5 5 0 −5 5 0 −5
−4
−2
0
2
4
−4
−2
0
2
4
−4
−2
0
2
4
Fig. 2. Comparison of ALPaCA (left) compared to ALPaCA without meta-training (center) and Gaussian process regression (right) on a simple sinusoid problem from [20]. The plots on each row contain 0, 1, 2, 3, and 5 context samples respectively. Confidence intervals are 95%.
Meta-learning Priors for Bayesian Regression
329
Bayesian Output Layer Models. The approach of affixing a Bayesian last layer to a standard neural network has seen application across machine learning. In [24], the authors use this network structure with an arbitrary prior for Bayesian optimization, and [36] use it to improve exploration efficiency in reinforcement learning. In the context of meta-learning, [25] use a similar approach to the one presented here. However, they first train their network without considering training it with respect to meta-updated last layers and losses, as we have here. They then perform meta-learning by updating the last layer using approximate methods. In contrast to this approach, we learn network weights such that they form useful basis functions for estimating the posterior, for any amount of data accrued online. Our work also shares common features with recent work on meta-learning with closed form solvers [37], in which the authors do metalearning with a ridge regression last layer. Generally speaking, these approaches all make arbitrary assumptions regarding the output layer prior, whereas we map our prior information (available via the training data) to a prior over weights.
6
Experiments
In this section, we investigate the performance of ALPaCA on a variety of regression problems, ranging from simple toy problems which help in building intuition for the algorithm, to challenging prediction and dynamics modeling problems. We compare ALPaCA against several other algorithms that are capable of rapid ALPaCA
2
ALPaCA (no meta)
GPR
0 −2 2 0 −2 2 0 −2 2 0 −2 2 0 −2
−4
−2
0
2
4
−4
−2
0
2
4
−4
−2
0
2
4
Fig. 3. Comparison of ALPaCA (left) compared to ALPaCA without meta-training (center) and Gaussian process regression (right) on a function exhibiting discrete switching between y = −1 and 1 for x ∈ [−2.5, 2.5]. Three switching points are randomly sampled. The plots on each row contain 0, 5, 10, 15, and 25 context samples.
330
J. Harrison et al.
online adaptation. First, as an ablation experiment, we compare to ALPaCA without the meta-training. This is to say, the network is trained to predict the prior for varying datasets, but is not explicitly conditioned on random selections of data. This is roughly equivalent to standard methods for training neural networks with Bayesian last layers [24,36]. In these works, a network is typically trained to minimize e.g. MSE loss over all of the data, and then the prior over the last layer is chosen arbitrarily. For the dynamical systems that we investigate, we also compare to the predictive performance of ALPaCA with no online updating. This network is just trained in expectation over the sampled values of the model parameters θ. For our toy examples, we compare to GP regression. We chose a zero-mean GP with a squared-exponential kernel, which is often the default choice in GP modeling [11]. The three above approaches are compared in terms of negative log likelihood to compare the accuracy of their posterior predictions. In our experiments we use tanh nonlinearities, due to favorable properties in terms of variance smoothness and the behavior of the variance far from the observed data. This is discussed in more detail in the extended online version of this paper [23]. In addition to these comparisons, we also compare to MAML [20], which has rapidly become one of the most common approaches for meta-learning. Indeed, MAML was recently used for online system identification [38] on relatively complex dynamical systems. Because MAML generates point predictions rather than full posteriors, we compare in terms of MSE loss (for ALPaCA, we compute MSE loss on the mean of the posterior). We were not able to compare to PLATIPUS [31] or Neural Processes [18] because to our knowledge, at the time of writing, the code is not publicly available. Throughout our experiments, plotted confidence intervals are 95%. 6.1
Toy Experiments
We first investigate two simple regression examples that allow both quantitative and qualitative evaluation of performance. First, we investigate a regression problem over a family of sinusoids, where the amplitude and phase were sampled uniformly from [0.1, 5.0] and [0, π] respectively. This sinusoid problem was used as a simple test problem in [20]. We also evaluate performance on a new, simple benchmark that consists of a family of two-valued functions that begin at y = −1 for x < 2.5, end at y = 1 for x > 2.5, and switch three times between these values for x ∈ [−2.5, 2.5]. These switching points are sampled uniformly from this range. This function was designed to be challenging relative to its apparent simplicity for our approach. Because we are performing basis function regression, the discrete switches between −1 and 1 which can occur anywhere within the range specified above, are hard to capture. This is also challenging for GP approaches, as the structure of the function is hard to capture via a single global smoothness parameter in the kernel. For example, in [39], the authors learn separate GPs for each piecewise continuous segment, as opposed to loosening their smoothness priors.
ALPaCA ALPaCA (no meta) GPR
2
0 −2
331
ALPaCA MAML (1 step)
0.05
MAML (5 step)
0.04 MSE
Negative Log Likelihood
Meta-learning Priors for Bayesian Regression
0.03 0.02 0.01
−4
0.00 0
20
40 Timesteps
60
0
20
40
60
Timesteps
Fig. 4. Negative Log Likelihood (center) and Mean Squared Error (right) for the pendulum system. The image on the left shows the limits of the uncertainty over the pendulum length.
The outputs of ALPaCA, ALPaCA without meta-training, and Gaussian process regression (GPR) on these two problems are plotted in Figs. 2 and 3. First, note that ALPaCA is capable of making predictions that incorporate structural knowledge of the problem from a single datapoint, compared to the local predictions that are common for GPs with the SE kernel. Secondly, on the sinusoid problem, the first sample allows ALPaCA to correctly reduce uncertainty at points at a distance of a scalar multiple of a half wavelength from the sampled point. Moreover, one sample is enough to have a good estimate of the sinusoid nearly everywhere, and within five samples the estimated variance has dropped to nearly zero. This reduction in variance is observed for ALPaCA without metatraining as well, but its predictions are incorrect, highlighting the utility of our meta-learning approach. Further experiments with varying noise covariance are presented in the online version of this paper [23]. On the step function, ALPaCA can capture the discrete steps as well as or better than GP methods. Indeed, ALPaCA does not exhibit the large overcorrection below the function that GPR exhibits. Again, ALPaCA without meta-training reduces the predicted variance incorrectly. In contrast, the predicted confidence intervals generated by ALPaCA appear to be well-calibrated. ALPaCA with and without meta-training successfully learn that the function always takes a value of −1 and 1 for x below −2.5 and above 2.5 respectively. This automatic incorporation of prior information is a powerful tool in the context of robotics, where a wealth of physical information is available but often hard to encode into standard GP regression. The average negative log likelihood and the mean squared error (based on the mean prediction) of the test set as a function of the number of context samples is plotted for both problems in the extended paper [23]. We find that while the prior (no samples observed) for ALPaCA and ALPaCA without meta-training is nearly identical, ALPaCA nearly monotonically improves in performance as more samples are gathered online. We wish to note that GPR performance could likely be improved by better choice of prior, but as we discussed above, this tuning of the prior is often difficult for complex systems. We also compare ALPaCA to
332
J. Harrison et al. 40
0.09 0.08
ALPaCA ALPaCA (no meta) GPR
0
MSE
Negative Log Likelihood
0.10 20
ALPaCA MAML (1 step)
0.07
MAML (5 step) 0.06
−20
0.05 0.04
−40 0
50
100 Timesteps
150
0
50
100 Timesteps
150
Fig. 5. Negative Log Likelihood (center) and Mean Squared Error (right) for the hopper system. The image on the left depicts bounds on the uncertain torso size of the hopper.
MAML in terms of MSE. We find that we outperform MAML for all context data sizes, but this difference in performance is especially acute for a small number of context datapoints. Indeed, because MAML performs gradient steps on context data, it performs poorly with a single sample. 6.2
Dynamical Systems
We investigate the performance of ALPaCA in fitting the transition model for dynamical systems. In these tasks, we wish to fit a model to use the input x = (st , at ) to predict y = st+1 − st , where (st , at , st+1 ) represents the state, action, and next state of a dynamical system. The process generating these (x, y) pairs is given by the true dynamics of the system, st+1 − st = f (st , at ; θ), where θ here represents model parameters. We generate a dataset of trajectories under different settings of θ sampled from a prior, and study whether ALPaCA can capture this structured uncertainty to enable sample efficient Bayesian regression online. We test ALPaCA on the pendulum and the hopper from OpenAI’s Gym [40]. For these examples, the dynamics parameters were varied. For the pendulum, the mass and length were sampled uniformly from [0.5, 1.5]. The hopper’s foot friction was sampled uniformly from [1.7, 2.0] and the torso size was sampled uniformly from [0.02, 0.08]. Plotted rollouts for these systems are provided in our extended paper [23]. The performance of ALPaCA on the pendulum and the hopper are plotted in Figs. 4 and 5. Again, when the amount of context data is low, MAML provides little performance increase. ALPaCA, in contrast, achieves rapid performance improvement within the first few samples. These results also highlight an issue with approaches such as MAML, where taking more gradient steps can lead to overfitting when the amount of context data is small. The selection of number of gradient steps to take with a MAML-style approach, as well as the batch size to be used for online updates, are hyperparameters that are relatively difficult to tune. By computing the posterior directly via recursive least squares, ALPaCA
Meta-learning Priors for Bayesian Regression
333
avoids this ambiguity in the number of gradient steps to take. Kernel-based methods for GP regression are known to be prohibitively slow for high dimensional dynamical systems; the time required to evaluate ALPaCA is compared to GPR are plotted and discussed in [23]. 6.3
Vehicle Lane Change
Finally, we investigate the performance of ALPaCA on human lane change experiments3 . These experiments were based on the dataset presented in [41]. This data is based on 19 pairs of participants, driving against each other. The objective of the task was to switch lanes with each other without communicating verbally, and without colliding. Further details on use of this dataset are presented in [23]. Here, we treat the system as an autonomous dynamical system, and use ALPaCA to predict the transitions. In this experiment, the underlying parameter θ is essentially capturing the driving style of the human. ALPaCA aims to identify the underlying style of the drivers from observations of the first few transitions of a particular trial to reduce uncertainty over future predictions. Numerical results are presented in Fig. 6. Interestingly, the rate of performance improvement on this problem was slower than previous problems. On previous problems, rapid performance improvement occurred early in the episode, which tailed off toward the end of the episode. A possible explanation is that the transition data is often bimodal, with a mode corresponding to driving straight and a mode corresponding to switching lanes. In these experiments, ALPaCA performance improves nearly linearly. Surprisingly, MAML does not experience an initial performance decrease as in previous experiments, and shows rapid improvement that quickly slows.
ALPaCA MAML (1 step)
0.20
−6
MAML (5 step)
0.15
−8 −10
MSE
Negative Log Likelihood
−4
−12 −14
ALPaCA ALPaCA (no meta)
−16
0.10
0.05
ALPaCA (no update)
−18
0.00 0
10 20 Timesteps
30
0
10 20 Timesteps
30
Fig. 6. Negative Log Likelihood (center) and Mean Squared Error (right) for the lane change experiment. The image on the left shows the view for the drivers, and the aim of the task (reproduced with permission from [41]).
3
Data available at: https://github.com/StanfordASL/TrafficWeavingCVAE.
334
J. Harrison et al.
While building models to identify dynamics parameters as above is reasonably simple, identifying latent variables such as driver intent and style are difficult to do via non-data-driven models. Critically, we did not specify in any way that we wish to identify human driving behavior over the course of the episode. Generally, pre-specification of the latent variables to identify is not necessary. ALPaCA, if trained from real operational data, will learn to identify existing latent parameters.
7
Discussion and Conclusions
In this work we have presented ALPaCA, a sample efficient probabilistic neural network model that enables efficient adaptive online learning. We have demonstrated that ALPaCA outperforms kernel GP regression, which is a tool used throughout robotics and engineering as a whole, as well as MAML [20] (which itself has outperformed many meta-learning approaches). In contrast to kernel GP regression, our approach enables incorporation of an arbitrarily large amount of prior information, and scales to large data regimes. In contrast to MAML, ALPaCA maintains a full analytic posterior, which is useful for applications such as Bayesian optimization and stochastic optimal control. Moreover, it is simple to implement and a plug-and-play tool for any regression problem. We believe its blend of high-capacity regression models, together with sample efficiency, has the potential to improve robotic autonomy in many scenarios. Limitations. While the ALPaCA model has shown potential for efficient online Bayesian learning, it has several limitations. First and foremost, the use of analytic update formulas come with the standard limitations of Gaussian models – in particular, the assumption of Gaussian noise. Indeed, in this paper we assume Σε is known. We discuss the case where this assumption is relaxed in [23]. Relative to MAML-style models [20], in which all parameters of the model are being adapted online, ALPaCA has lower model capacity. Our experiments demonstrate that the reduction in capacity results in an improvement in stability (via near-monotonic MSE improvement relative to MAML), but it may be possible to achieve both of these objectives. Finally, fundamental machine learning considerations such as data requirements and meta-overfitting have largely not been addressed in this paper, or in the meta-learning literature broadly. Future Work. The literature on kernel methods for GP regression is large, and it was impossible to compare to all of the flavors of GP methods that are used in practice. We believe further comparisons of the relative merits of ALPaCA and GP methods are useful future work. Beyond this, several avenues of future work exist. We detail them here briefly, and an extended discussion is presented in [23]. First, because GPs are closed under addition, our method may be combined with kernel GP methods. Additionally, ALPaCA is simply extended to tracking timevarying θ using forgetting methods, which we have not discussed in this work. Generally speaking, many of the tools previously developed within basis function
Meta-learning Priors for Bayesian Regression
335
system identification may be applied with ALPaCA. Finally, characterization of the data requirements of meta-learning algorithms, as well as meta-overfitting are needed. Acknowledgments. This work was supported by the Office of Naval Research YIP program (Grant N00014-17-1-2433), by DARPA under the Assured Autonomy program, and by the Toyota Research Institute (“TRI”). This article solely reflects the opinions and conclusions of its authors and not ONR, DARPA, TRI or any other Toyota entity. James Harrison was supported in part by the Stanford Graduate Fellowship and the National Sciences and Engineering Research Council (NSERC).
References 1. Deisenroth, M., Rasmussen, C.E.: PILCO: a model-based and data-efficient approach to policy search. In: International Conference on Machine Learning (ICML) (2011) 2. Berkenkamp, F., Turchetta, M., Schoellig, A., Krause, A.: Safe model-based reinforcement learning with stability guarantees. In: Neural Information Processing Systems (NIPS) (2017) 3. Bauza, M., Rodriguez, A.: A probabilistic data-driven model for planar pushing. In: IEEE International Conference on Robotics and Automation (ICRA) (2017) 4. Vasudevan, S., Ramos, F., Nettleton, E., Durrant-Whyte, H.: Gaussian process modeling of large-scale terrain. J. Field Robot. 26, 812–840 (2009) 5. O’Callaghan, S.T., Ramos, F.T.: Gaussian process occupancy maps. Int. J. Robot. Res. 31, 42–62 (2012) 6. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 30, 283–298 (2008) 7. Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with Gaussian process dynamical models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2006) 8. Mukadam, M., Yan, X., Boots, B.: Gaussian process motion planning. In: IEEE International Conference on Robotics and Automation (ICRA) (2016) 9. Ko, J., Fox, D.: GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Auton. Robots 27, 75–90 (2009) 10. Ferris, B., Fox, D., Lawrence, N.: WiFi-SLAM using Gaussian process latent variable models. In: International Joint Conference on Artificial Intelligence (IJCAI) (2007) 11. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Neural Information Processing Systems (NIPS) (2012) 12. Rasmussen, C.E.: Gaussian processes in machine learning. Springer (2004) 13. Hensman, J., Fusi, N., Lawrence, N.D.: Gaussian processes for big data. In: Uncertainty in Artificial Intelligence (UAI) (2013) 14. Smola, A.J., Bartlett, P.L.: Sparse greedy Gaussian process regression. In: Neural Information Processing Systems (NIPS) (2001) 15. Hinton, G.E., Salakhutdinov, R.R.: Using deep belief nets to learn covariance kernels for Gaussian processes. In: Neural Information Processing Systems (NIPS) (2008) 16. Calandra, R., Peters, J., Rasmussen, C.E., Deisenroth, M.P.: Manifold Gaussian processes for regression. In: International Joint Conference on Neural Networks (IJCNN) (2016)
336
J. Harrison et al.
17. Snelson, E., Ghahramani, Z.: Sparse Gaussian processes using pseudo-inputs. In: Neural Information Processing Systems (NIPS) (2006) 18. Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D.J., Eslami, S.A., Teh, Y.W.: Neural processes. In: International Conference on Machine Learning (ICML) (2018) 19. Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Neural Information Processing Systems (NIPS) (2008) 20. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning (ICML) (2017) 21. Minka, T.: Bayesian linear regression. MIT Technical Report (2000) 22. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge (2012) 23. Harrison, J., Sharma, A., Pavone, M.: Meta-learning priors for efficient online Bayesian regression (extended version). arXiv:1807.08912 (2018) 24. Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N., Sundaram, N., Patwary, M., Prabhat, Adams, R.: Scalable Bayesian optimization using deep neural networks. In: International Conference on Machine Learning (ICML) (2015) 25. Bauer, M., Rojas-Carulla, M., Swiatkowski, J.B., Scholkopf, B., Turner, R.E.: Discriminative k-shot learning using probabilistic models. arXiv:1706.00326 (2017) 26. Morris, C.N.: Parametric empirical Bayes inference: theory and applications. J. Am. Stat. Assoc. 78, 47–55 (1983) 27. Ljung, L., S¨ oderstr¨ om, T.: Theory and Practice of Recursive Identification. MIT Press, Cambridge (1983) 28. Grant, E., Finn, C., Levine, S., Darrell, T., Griffiths, T.: Recasting gradient-based meta-learning as hierarchical Bayes. In: International Conference on Learning Representations (ICLR) (2018) 29. Santos, R.J.: Equivalence of regularization and truncated iteration for general illposed problems. Linear Algebra Appl. 236, 25–33 (1996) 30. Cioffi, J., Kailath, T.: Fast, recursive-least-squares transversal filters for adaptive filtering. IEEE Trans. Acoust. Speech Sig. Process. 32, 304–337 (1984) 31. Finn, C., Xu, K., Levine, S.: Probabilistic model-agnostic meta-learning. In: Neural Information Processing Systems (NIPS) (2018) 32. Kim, T., Yoon, J., Dia, O., Kim, S., Bengio, Y., Ahn, S.: Bayesian model-agnostic meta-learning. In: Neural Information Processing Systems (NIPS) (2018) 33. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations (ICLR) (2014) 34. Yan, X., Yang, J., Sohn, K., Lee, H.: Attribute2Image: conditional image generation from visual attributes. In: European Conference on Computer Vision (ECCV) (2016) 35. Yu, K., Tresp, V., Schwaighofer, A.: Learning Gaussian processes from multiple tasks. In: International Conference on Machine Learning (ICML) (2005) 36. Azizzadenesheli, K., Brunskill, E., Anandkumar, A.: Efficient exploration through Bayesian deep Q-networks. arXiv:1802.04412 (2018) 37. Bertinetto, L., Henriques, J.F., Torr, P.H., Vedaldi, A.: Meta-learning with differentiable closed-form solvers. arXiv:1805.08136 (2018) 38. Clavera, I., Nagabandi, A., Fearing, R.S., Abbeel, P., Levine, S., Finn, C.: Learning to adapt: meta-learning for model-based control. arXiv:1803.11347 (2018) 39. Svensson, A., Sch¨ on, T.B.: A flexible state-space model for learning nonlinear dynamical systems. Automatica 80, 189–199 (2017)
Meta-learning Priors for Bayesian Regression
337
40. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI gym. arXiv:1606.01540 (2016) 41. Schmerling, E., Leung, K., Vollprecht, W., Pavone, M.: Multimodal probabilistic model-based planning for human-robot interaction. In: IEEE International Conference on Robotics and Automation (ICRA) (2018)
Deep BOO! Automating Beam Orientation Optimization in Intensity-Modulated Radiation Therapy Olalekan Ogunmolu1 , Michael Folkerts1 , Dan Nguyen1 , Nicholas Gans2 , and Steve Jiang1(B) 1
Department of Radiation Oncology, UT Southwestern Medical Center, 2280 Inwood Road, Dallas, TX 75390-9303, USA {olalekan.ogunmolu,dan.nguyen,steve.jiang}@utsouthwestern.edu, [email protected] 2 Department of Electrical Engineering, University of Texas at Dallas, Richardson, TX 75080, USA [email protected] Abstract. Intensity-Modulated Radiation Therapy (IMRT) is a method for treating cancers by aiming radiation to cancer tumor while minimizing radiation to organs-at-risk. Usually, radiation is aimed from a particle accelerator, mounted on a robot manipulator. Computationally finding the correct treatment plan for a target volume is often an exhaustive combinatorial search problem, and traditional optimization methods have not yielded real-time feasible results. Aiming to automate the beam orientation and intensity-modulation process, we introduce a novel set of techniques leveraging (i) pattern recognition, (ii) monte carlo evaluations, (iii) game theory, and (iv) neuro-dynamic programming. We optimize a deep neural network policy that guides Monte Carlo simulations of promising beamlets. Seeking a saddle equilibrium, we let two fictitious neural network players, within a zero-sum Markov game framework, alternatingly play a best response to their opponent’s mixed strategy profile. During inference, the optimized policy predicts feasible beam angles on test target volumes. This work merges the beam orientation and fluence map optimization subproblems in IMRT sequential treatment planning system into one pipeline. We formally introduce our approach, and present numerical results for coplanar beam angles on prostate cases.
1 Introduction In this paper, we will present the preliminary results of a multi-disciplinary research project to design a real-time feasible treatment planning optimization in intensitymodulated radiation therapy (IMRT). IMRT is a cancer treatment method that delivers geometrically-shaped, high-precision x-rays or electron beams to tumors by modulating the intensity of the radiation beam. A multileaf collimator shapes a conventional geometrical field, and the intensity of the geometric field shape is varied bixel-wise in order to modulate the “fluence” profile around a tumor. This is done while the patient lies in a supine position on a treatment table. Before treatment, a doctor contours the critical structures (or tumors) and organs-at-risk (OARs) within a target volume (region of the patient’s computed tomography (CT) or magnetic resonance (MRI) scan that c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 338–354, 2020. https://doi.org/10.1007/978-3-030-44051-0_20
Deep BOO!
339
contains the tumor and other organs) and then prescribes doses that must be delivered. Each beam to be delivered consists of beamlets, aimed from the same angle, where each beamlet may be of a different intensity from that of its neighbors. Radiation intensities may be delivered from about 5–15 different beam orientations with multiple collimator units. The process of choosing what beam angle is best for delivering beamlet intensities is termed beam orientation optimization (BOO), while the process of determining what intensity meets a prescribed fluence profile by a doctor is termed fluence map optimization (FMO). Commonly, the set of beams is constrained to lie in a single plane passing through the critical structures. When just the robot’s tool frame is used to adjust the fluence intensity, we have coplanar beams. In this work, we focus on finding good coplanar beams in BOO problems as commonly, only coplanar beams are employed [1]. We consider fictitious selfplay as a practical application for developing an effective beam orientation selection strategy in a scenario involving two rational decision-making agents that: (i) do not communicate their policies to each other (i.e. the game is non-cooperative), and (ii) behave reactively in order to adequately explore the state space. Contributions: We transform the BOO problem into a game planning strategy. Coupled with neural fictitious self-play, we refine the predictions from a neural network policy to drive the weights of the policy to a saddle equilibrium [2]. To this end, • we devise a tree lookout strategy for games with large state spaces to guide transition from one beam angle set to another; • the sparse tree lookout strategy is guided by a deep neural network policy, which produces a utility (or value) function that characterizes the policy’s preference for an outcome, and a subjective probability distribution, which describes the policy’s belief about all relevant unknown factors at each time step; • in a zero-sum two-player Markov decision game of perfect information, either player finds an alternating best response to the current player’s average strategy; this is to drive the policy’s weights toward an approximate saddle equilibrium [3]; • this aids the network in finding an approximately optimal beam angle candidate set that meets the doctor’s dosimetric requirements. This work presents the first comprehensive description of Monte-Carlo Tree Search (MCTS) within the framework of BOO. It adds new pseudocode that transforms BOO into a game planning strategy. The game recovers an approximately optimal set of beamlets and an optimized fluence during TP. There has been related work on beam orientation optimization. Craft [4] locally tuned a beam angle set within the system’s continuous state space using linear programming duality theory, and found that the BOO problem for a simple 2D pancreatic case has about 100 minima when the optimization is warm-started from 100 different beam angles. Building on Craft’s work, Bertsimas et al. [5] resolved to a two meta-step algorithm: dynamically selecting beam angle candidates within the phase space via local minimum search with gradient descent, then exploring a different portion of the solution space by taking finite steps of simulated annealing. While [5] use global and local search with improvements in solutions obtained from equispaced angles, this method has the drawback of presenting the objective function as convex and assuming the doctor’s preferences can be represented as a set of linear constraints. Jia et al. [6] split the
340
O. Ogunmolu et al.
problem into two subproblems: first progressively identifying non-consequential beam angles by evaluating a multi objective function, and then optimizing the fluence on beam angles that are assigned a high confidence by a dosimetric objective function. Heuristic search strategies have also previously developed e.g. [7–9]. Li et al. [10] used a genetic algorithm fitness metric to select a beams subset from a discrete candidate set of beam angles; a conjugate-gradient dose objective then optimized the intensity of the chosen beams. Other work treat IMRT treatment planning (TP) as an inverse optimization problem, with techniques ranging from adaptive l21 optimization [6], mixed integer linear programming [11–13] and simulated annealing [1, 14].
2 Methods and Materials For games with perfect information, there is an optimal value function, v (s), that decides the game’s outcome for every possible state, s ∈ S, under perfect play. One could devise a planning strategy that guides the search for optimistic beam angle configurations within the setup’s phase space by using a probability distribution, p(s, a), over a set of deterministic pure strategies for the tree. The search for an approximately optimal beam angle set is performed by optimizing the parameters of a function approximator ψ, (here, a deep neural network, with multiple residual blocks as in [15]) that approximates a policy π. The policy guides simulations of ‘best-first’ beam angle combinations for a sufficiently large number of iterations – essentially a sparse lookout simulation that selectively adjusts beams that contribute the least to an optimal fluence. Successor nodes beneath a terminal node are approximated with a value, v(s), to assure efficient selectivity. We maintain a probability distribution over possible states, based on a set of observation probabilities for
Fig. 1. [Left]: Concatenation of the target volume masks and the beam angles before feeding the input planes to the residual tower neural network. The first six planes (top-most mask of left figure) contain the delineated organs and the PTV. This is concatenated with a block of m beams from the current time step, regressed to the previous 5 time steps (here, 5 was heuristically chosen). [Right]: ø each beam angle in a beam block is represented as shown. Together with the target volume, these form an input plane of size 36 × N × W × H to the policy/value neural network tower of residual blocks.
Deep BOO!
341
the underlying Markov Decision Process (MDP). Let us now formalize definitions and notations used throughout the rest of this document. 2.1 Notations and Definitions The state of the dynamical system will be denoted by s ∈ S; it is to be controlled by a discrete action a ∈ A. States evolve according to the (unknown) dynamics p(st+1 |st , at ), which we want to learn. The learning problem is posed within a discrete finite-time horizon, T , while a beam angle combination search task can be defined by a reward function, r(st , at ), which can be found by recovering a policy, p(at |st ; ψ), that specifies a distribution over actions conditioned on the state, and parameterized by the weights of a neural network, a tensor ψ. Without loss of generality, we denote the action conditional p(at |st , ψ) as πψ (at |st ). Recovering the optimal weights may consist of the maximization problem ψ = arg max
ψ
T
E(st ,at )∼p(st ,at |ψ) [r(st , at )] .
t=1
Definition 1. A beam block is a concatenation of beams, {θ1 , θ2 , . . . , θm } as a tensor of dimension m × N × H × W (see Fig. 1 and Table 1) that together with the patient’s ct mask form the state, st , at time step, t. Table 1. Table of notations commonly used in this article Notation
Definition/Examples
Notation Definition/Examples
m
Dimensionality of a node’s beam set, e.g. m=5
n
Θ
Discretized beam angle set e.g. equally spaced angles between 0◦ and 360◦ , spaced apart at 4◦
at ∈ A Control or action, at ∈ A at time step t ∈ [1, T ] used in determining the probability of transitioning from a beam angle subset to another within Θ
Dimension of discretized beam angles, e.g. n = 180 for 4◦ angular resolution
θ j ⊆ Θ Beam angles selected from Θ e.g. θk ∈ Rm
st ∈ S
Markovian system state at time step, t ∈ [1, T ] e.g. patient contour, beam angle candidates; dimensionality 2, 727, 936 to 3, 563, 520
γ
Discount factor e.g. 0.99
fψ
Parametric function approximator (deep neural network policy) for state st
vψ (s)
Value estimate of state, st , as predicted by fψ
p(s)
Probability distribution over current state, s generated by neural network policy
Q(s, a)
Action-state values that encode the BX t “goodness” of a beam-angle set, θk ∈ Rm , where m is the number of beams considered for a fluence generation, e.g. m = 5
Dij (θk ) Dose matrix containing dose influence to voxel i from beam angle, θk , ∀ k ∈ {1, 2, . . . , n} where n is range of the beam set B
Dt
A concatenation of beams in consideration at time step, t, as a block of beams being fed to the neural network policy Dose mask for target volume in consideration at time step, t
342
O. Ogunmolu et al.
Definition 2. Every node of the tree, x, has the following fields: (i) a pointer to the parent that led to it, x.p; (ii) the beamlets, xb , stored at that node where b = {1, . . . , m}; (iii) a set of move probabilities prior, p(s, a); (iv) a pointer, x.r, to the reward, rt , for the state st ; (v) a pointer to the state-action value Q(s, a) and its upper confidence bound U (s, a) (7) (vi) a visit count, N(s, a), that indicates the number of times that node was visited; and (vii) a pointer x.childi to each of its children nodes. We want an adaptive allocation rule for determining the transition between states since we do not know what node may yield the best bandit, as a player might be biased towards always selecting the beams set with the maximum value. Therefore, we define the state broadly enough to capture all subjective unknowns that might influence the payoff/reward to be received by a rational decision-making agent; we then leverage the upper confidence bound algorithm of Agrawal et al. [16] to assure an asymptotic logarithmic regret behavior. We attach a regret term U (n(s)) to the Q-value so as to ensure the optimal beam does not evade the simulation i.e., Q(s, a) − U (n(s)) ≤ Q(s, a) ≤ Q(s, a) + U (n(s)); the width of this confidence bound guides the exploration strategy for states that are momentarily unpromising in values but may later emerge as promising states. Other notations used in the article are delineated in Table 1. 2.2
Data Preprocessing
Patient Mask. We obtained 77 anonymized patient CT scans and their associated dose matrices. The scans relate to prostate cases used in previous treatment plans. Six organs are present within the scans: the patients’ body, bladder, left and right femoral heads, rectum, and the planning target volume (PTV) or tumor. Each patient’s scan, D, is represented in 3D as N × W × H, where N is the total number of slices, W and H are the respective slice width and height. We resized each slice to a square-shaped 2D matrix of size 64 × 64. We generate 3D images that represent the orientation of the robot with respect to the patient for each discretized beam angle. 2.3
Neural Network Architecture
In addition to the resized masks, D, we define five feature planes, Xt as a block of beam configurations, BXt , where BXt denotes the beam angles that generate the current fluence. For five beams for the fluence’s geometric shape for example, BXt would contain the 3D images of the beams being considered at time step t. We augment the state with a memory of five previously used beam blocks, {BXt , . . . , BXt−5 }, in order to mitigate the uncertainty of decisions under this incomplete MDP formulation. The dose masks and beam blocks are as shown in Fig. 1. The input planes to the network are sized as T × N × H × W where T is the total number of input planes (T = 6 structures + 5 beams + 5 × 5 regressed beams = 36). Thus, the input to the network are arranged as: st = [Dt , BXt , BXt−1 , BXt−2 , BXt−3 , BXt−4 , BXt−5 ]. We use a modern neural network architecture with many residual blocks [15] so that each layer of the network fits a residual nonlinear mapping to its input data. We end up with a deeply stacked network whose input features, st , are processed by 34 residual blocks
Deep BOO!
343
described as follows: (i) a 3D convolution with 64 × l filters, a square kernel of width 7, and double strided convolutions, where l is the depth of the stack in the network; (ii) a 3D batch normalization layer [17]; (iii) nonlinear rectifiers [18]; (iv) a 3D convolution of 64×l filters; (v) a 3D batch normalization layer; (vi) a skip connection from the input to the block, in order to facilitate efficient gradients’ propagation; and (vii) nonlinear rectifiers. We split the output of the network into two heads: (i) the first head is a probability distribution over which angle in the current beam block contributes the least to an optimal fluence cost at the current time step; (ii) the second head estimates the value of the subtree beneath the current node. The probability head is comprised of two residual blocks, each containing the following modules: (i) a 3D convolution of 64 × l filters, followed by a square kernel of size 1, and a single strided convolution; (ii) a 3D batch normalization layer; (iii) nonlinear rectifiers; (iv) a 3D convolution of 64 × l filters (v) a 3D batch normalization layer; (vi) a skip connection from the input to the block (vii) nonlinear rectifiers (viii) a fully connected layer that maps the resulting output to the total number of discretized beam angle grids; and (ix) a softmax layer then maps the neuron units to logit probabilities pi (s|a) for all beam angles. The value head applies the following transformation modules: (i) a 3D convolution of 64 × l filters, a square kernel of size 1, and a single strided convolution; (ii) a 3D batch normalization layer; (iii) nonlinear rectifiers; (iv) a second 3D convolution of 64 × l filters; (v) a 3D batch normalization layer; (vi) a skip connection from the input to the block (vii) nonlinear rectifiers and a sequential layer consisting of • • • •
a linear layer mapping the output of the above connections to a hidden 512-layer followed by a linear layer mapping to a 256 hidden unit, then rectified nonlinearities followed by a linear layer mapping to a scalar value, then rectified nonlinearities and a tanh nonlinearity that maps the output to the closed interval [−1, 1].
The network’s parameters were initialized using the proposal in [19]. The value and probability distribution heads are inspired from Bayesian decision theory, where it is expected that a rational decision-maker’s behavior is describable by a utility function, (or value function) – a quantitative characterization of the policy’s preferences for an outcome – and a subjective probability distribution, which describes the policy’s belief about all relevant unknown factors. When new information is presented to the decisionmaker, the subjective probability distribution gets revised. Decisions about the optimal beam angle combination at the current time step are made under uncertainty; so we use a probability model to choose among lotteries (i.e., probability distributions over all discretized beam angles in the setup). Each state during our learning process is constructed by appending the beam block at the current time step to a history of beam blocks for the previous five time steps using a FIFO policy. Specifically, when we transition to a new state, the beam block that has been in the state set for the longest time (i.e., at the head of the queue) is deleted first, and the new state’s beam block is enqueued at the tail as in a queue data structure. This is so as to minimize the partial observability of the system.
344
2.4
O. Ogunmolu et al.
Fluence Map Optimization
Suppose X is the total number of discretized voxels of interest (V OIs) in a target volume, and B1 ∪ B2 ∪ . . . ∪ Bn ⊆ B represents the partition subset of a beam B, where n is the total number of beams from which radiation can be delivered. Let Dij (θk ) be the matrix that describes each dose influence, di , delivered to a discretized voxel, i, in a volume of interest, V OIh (h = 1, . . . , X ), from a beam angle, θk , k ∈ {1, . . . , n}. One can compute the matrix Dij (θk ) by calculating each di for every bixel, j, at every ϕ◦ , resolution, where j ∈ Bk . Doing this, we end up with an ill-conditioned sparse matrix, Dij (θk ), which consists of the dose to every voxel, i, incident from a beam angle, θk at every 360◦ /ϕ◦ (in our implementation, we set ϕ to 4◦ ). For a decision variable, xj , representing the intensities of beamlets, it is trivial to find the dose influence, di , that relates the bixel intensities, xj , to the voxels of interest, V OIh . The fluence problem is to find the values of xj for which di to the tumor is maximized, while simultaneously minimizing the di to critical structures. For the voxels in the target volume, a weighted quadratic objective minimizes the l2 distance between a pre-calculated dose Ax (where x represents the vectorized bixels, xj ), and a doctor’s prescribed dose, b, while a weighted quadratic objective maxibetween Ax and b. The pre-calculated dose term is given by mizes the l2 distance s xs | Dij ∈ Rn×l , n > l}, which is a combination of the dose compoAx = { s wvss Dij ¯s } represent nents that belong to OARs and those that belong to PTVs. Let ws = {ws , w the respective underdosing and overdosing weights for OARs and PTVs, and vs represent the total number of voxels in each structure. We define the following cost 1 1 s s (bs − ws Dij xs )+ 22 + (w ¯s Dij xs − bs )+ 22 vs vs s∈PTVs
(1)
s∈OARs
where the underdosing weights are typically set as ws = 0 to deliver minimal dose to critical structures, while the overdosing weights are chosen to deliver the prescribed dose to the tumor; (·)+ is a Euclidean projection onto the nonnegative orthant R+ . We can rewrite the above objective, subject to nonnegative bixel intensity constraints, as the minimization problem 1 min Ax − b22 2
subject to x ≥ 0.
The Lagrangian thus becomes L(x, λ) =
1 Ax − b22 − λT x, 2
where λ ∈ Rn is a multiplier. This problem can be solved with dual gradient descent (DGD), but DGD has the drawback that the primal and dual updates are not robust to objective’s constraints [20]. The alternating direction method of multipliers (ADMM) [20] tackles the robustness problem by adding a quadratic penalty term to the
Deep BOO!
345
Lagrangian and alternatingly updating the x and λ variables in a “broadcast and gather” process. This turns out to be attractive since we will be solving a large-scale learning problem for the optimal beam angle set combination. Introducing an auxiliary variable z, we have 1 min Ax − b22 , subject to z = x, z ≥ 0, x 2 so that the Lagrangian can be written as, 1 ρ min Ax − b22 − λT (z − x) + z − x22 , x,z 2 2
(2)
where ρ > 0 is an ADMM penalty parameter. Minimizing (2) w.r.t. x, the x subproblem of (2) yields min x
1 T T x (A A + ρI)x + (λT − AT b − ρzT )x, 2
so that the x-update (due to the convex quadratic nature of the problem) becomes, −1 AT b + ρzk − λk . (3) xk+1 = AT A + ρI Similarly, the z-update for (2) can be found by the z-minimization subproblem 1 ρ ρ min −λT z + z − x22 := min z − x − (λ)22 . z z 2 2 ρ Using the soft-thresholding operator, Sλ/ρ , we find that zk+1 = Sλ/ρ xk+1 + λk ,
(4)
where Sλ/ρ (τ ) = (x − λ/ρ)+ − (−τ − λ/ρ)+ . λ is updated as λk+1 = λk − γ(zk+1 − xk+1 ),
(5)
and γ controls the step length. The inverse operation in (3) can be carried out with any iterative solver, e.g. conjugate gradient. We use an over-relaxation parameter, αk = 1.5, and set the quadratic penalty to ρ = 1.5, in the z and λ updates: αk A xk+1 −(1−αk )zk . The stopping criterion is met when the primal and dual residuals are sufficiently small, i.e., rk = xk − zk 2 ≤ pri and sk = − ρ (zk+1 − zk )2 ≤ dual , with, pri =
√
ρ abs + rel max{xk 2 , − z2 }, and dual =
√
n abs + rel (ρλk ),
(6)
where pri > 0, dual > 0 are the primal and dual feasibility tolerances for the primal and dual feasibility conditions (see [20, Sect. 3.3]). In this work, we set abs = 10−4 and rel = 10−2 .
346
2.5
O. Ogunmolu et al.
Game Tree Simulation
Consider bd possible move sequences of a robot-patient setup, where b is the number of beam angles chosen to construct a fluence, and d is the total number of discretized beam angle set. Suppose b = 180 and d = 5, we have 1805 possible search directions, rendering exhaustive search infeasible. Therefore, we leverage Monte Carlo simulations, encouraged by their recent success in large games [21–23], to break the curse of dimensionality [24]. MCTS combines traditional min-max evaluation of a tree’s leaves with Monte Carlo evaluations [25]. MCTS iteratively runs random simulations from a tree’s root position through its children nodes by randomly choosing actions until arrival at a terminal state. A back-up operator progressively averages the outcome of many random simulations to min-max as the amount of simulation expands. Thus MCTS solves a min-max [2] problem by default. We iteratively sample beam angles – performing a lookahead search from the current state at a fixed depth. We restrict samples to 90 discretized beams in Θ. We then progressively add children nodes using an expand_policy (Algorithm 1), guided by move probabilities p(s, a), generated by a network policy fψ , that either recursively expands the current node or rolls out the current simulation to completion. As we recursively traverse the edges of the tree, we need to prevent “angle collisions”. To avoid this, we introduce a minimum pairwise distance, d¯i ∈ R+ between beamlets, defined as θi − θj ≥ d¯i , ∀ {j ∈ m \ i}, with d¯i = 20◦ . Repeatedly performing roll-outs, a history of state-action value pairs along the tree’s edges is kept. This ensures we can bias an action selection based on old actions that were chosen – aiding faster convergence if the same state is encountered more than once, because we can bias an action selection based on old actions that were chosen. We compute the mean outcome of every simulation through state s in which action a is selected, n(s) 1 Ii (s, a)ζi , where Ii (s, a) is an i.e., the tree’s Q(s, a)-value, as Q(s, a) = N(s, a) i=1 indicator function given by 1, if a was selected on the i’th policy rollout Ii (s, a) = 0, otherwise, n(s) and N(s, a) = i=1 Ii (s, a) is the total number of simulations in which action a was selected in state s, n(s) is the total number of times a game is played through state s, and ζi is the outcome of the ith simulation played out from s. Specifically, During simulation, each state and action in the search tree are updated as: n(st ) ← n(st ) + 1; N (st , at ) ← N (st , at ) + 1; Q(st , at ) ← Q(st , at ) ± r(st , at ), where r(st , at ) is the reward/cost gained or incurred by the agent after action a in state st . After each simulation, a ‘best move’ for the current beam block is selected. We exponentiate the move probabilities by a temperature slightly larger than unity to encourage ,a)1/τ diversity in early play as follows, p(a|s0 ; ψ) = N (sN0(s,b) 1/τ , where τ is the temperature b factor that diversifies the move probabilities. The modified UCT algorithm applied to optimal beam angle selection is presented in Algorithm 1.
Deep BOO!
347
Algorithm 1. Deep BOO Monte Carlo Tree Search function MCTS(s0 , c) s0 ← x0 (s0 ) while search_time < budget do ¯ x ← EXPAND_POLICY(x0 , c) ¯ x.r ← FMO_POLICY(¯x) BACKUP(¯x, ¯x.r) end while return BEST_CHILD(x0 ) end function function SELECT_MOVE(x, c) if p1 to play then return argmax¯x∈x Q(¯x) + K(¯x) else return arg min¯x∈x Q(¯x) − K(¯x) end if end function function EXPAND_POLICY(x, c) while x nonterminal do if x not f.expanded then return EXPAND (x, c) else x ← x[BEST_CHILD(x)] end if end while return x end function
function FMO_POLICY(x) return r = −h (x(s)|·) end function
where K(¯x) = c
function FULLY_EXPANDED(x, d ) di ← pairwise_distance(x.s) min_elem ← min(d) if min_elem < d then return True else return False end if end function function EXPAND(x, c) ¯ a = SELECT_MOVE(x, c) sample θ¯ with x.p(s, a) update θ¯ ← θ¯ + ¯a with π t−1 , create ¯x.p(¯s, ¯a) while not ¯x ∈ x do add ¯x to x end while ¯ return x end function function BACK_UP(x, ¯x.r) while ¯x not null do N (¯ x) ← x¯ + 1 Q(¯x) ← Q(¯x) + x¯.r ¯ x = parent of ¯x end while end function function BEST_CHILD(x) if p1 to play then return arg min children of x else return arg max children of x end if end function
2 ln n(¯x.s) and ¯x ∈ x implies ¯x ∈ children of x. N (¯x.s, a)
Definition 3. We define an upper confidence bound, U (s, a), on Q(s, a) that adds an exploration bonus that is highest for seldomly visited state-action pairs so that the tree expansion policy selects the action a that maximizes the augmented value: ¯ a) = Qj (s, a) + c 2 ln n(s) , where a = arg max Q(s, ¯ a). Q(s, (7) a N(s, a)
348
O. Ogunmolu et al.
¯ a) is the highest average observed reward from node j – encouraging exploitation Q(s, of the current node, and ln n(s) is the natural logarithm of the total number of roll-outs through state s. The second term in (7) encourages exploration of other beam angles and c is a scalar exploration constant. Note that (7) is a version of the UCB1 algorithm [26]. We continually update the weights of the neural network policy in a separate thread, writing the weights to a shared memory buffer for the MCTS to read from, i.e., the search thread uses the previous iteration of the trained network policy to run the policy improvement procedure. When angles are at the edges i.e., 0◦ or 360◦ and an angle change outside the range 0 ≤ θ ≤ 360 is recommended, we “wrap” around to enforce cyclicity. Note that the EXPAND_POLICY and FMO_POLICY procedures of Algorithm 1 can be seen as a form of Add/Drop simulated annealing as described in [1]. While the FMO_POLICY procedure returns the node with the optimal fluence cost, the BEST_CHILD procedure compares the quality of all beam angle sets in the children of the tree’s root node. 2.6
Self-play Neuro-Dynamic Programming
We continually play a zero-sum FSP game between two neural networks. Without loss of generality, we will call the first player, p1 , and the second player, p2 . Player p1 chooses its action under a (stochastic) strategy, π p1 = {π0p1 , π1p1 , . . . , πTp1 } ⊆ Π p1 that seeks to minimize the outcome ζ, while p2 ’s actions are governed by a policy π p2 = {π0p2 , π1p2 , . . . , πTp2 } ⊆ Π p2 that seeks to maximize ζ in order to guarantee an equilibrium solution for a game without saddle point. Π pi is the set of all possible non-stationary Markovian policies. Each player bases its decision on a random event’s outcome – obtained from a mixed strategy determined by averaging the outcome of individual plays. Together, both players constitute a two-player stochastic action selection strategy, π(s, a) = {π p1 , π p2 } that gives the probability of selecting moves in any given state. Suppose the game simulation starts from an initial condition s0 , one may write the optimal reward-to-go value function for state s in stage t, with horizon length T as
T −1 Vt∗ (s) = p inf p sup E Vt (s0 , f (st , π p1 , π p2 )) , π
1 ∈Π 1
π p2 ∈Π p2
i=t
s ∈ S, t = 0, . . . , T − 1 where the terminal value VT∗ (s) = 0, ∀ s ∈ S; π p1 and π p2 contain the action/control sequences {apt 1 }0≤t≤T and {apt 2 }0≤t≤T . The saddle point strategies for an optimal p∗ p∗ control sequence pair {at 1 , at 2 } can be recursively obtained by optimizing a stateaction value cost, Qt (s, a), as follows Vt∗ (s) = Q∗t (st , πtp1 , πtp2 ) =
min max Qt (st , π p1 , π p2 ) π p1 ∈Π p1 π p2 ∈Π p2 ∀st ∈ S, π p1 ∈ Π p1 , π p2 ∈ Π p2 .
such that vp1 ≤ v ≤ vp2
∀ {πtp1 , πtp2 }0≤t≤T .
(8)
Deep BOO!
349
where vpi are the respective optimal values for each player. Q(s, a) can be recovered from the cumulative reward function, R(s, a) and probability transition function, P (s, a) as Q∗t (st , πtp1 , πtp2 ) = R(s, a) + γ P (s, a)(x)Vt+1 (x). x∈S
Under ideal conditions, it is desirable to determine the optimal value function under perfect play; however, given the curse of dimensionality for BOO problems, the best we can hope for is an approximately optimal value vψ (s) by continually estimating the value function vψp (s), e.g., using a policy parameterized by a large function approximator such as deep neural networks fψ to approximate the optimal value so that vψ (s) ≈ vψp (s) ≈ v (s). Here ψ are Lipschitz basis functions that are parameters of the function approximator. The network, fψ , predicts a probability distribution over all beam angle configuraangle set θ tions, pa = p(s, a), and a value, vψ (s) – an estimate that the current beam m may be the optimal beam set. For a game, Γ , suppose that y = {y1 , . . . , ym | i=1 yi = n 1} and z = {z1 , . . . , zn | i=1 zi = 1} are the respective probability distributions for players p1 and p2 , defined on the n and m−dimensional simplices respectively. The average value of the game will correspond to player p1 minimizing a cost J (y, z) = y T Γ z and player p2 maximizing J (y, z). Each player’s action is governed by a mixed strategy – obtained by adding a Gaussian random walk sequence with standard deviation 2 to the prior probability distribution predicted by the neural network policy or computed by the tree policy; this is then normalized by the sum of the resulting noised distribution. Players p1 , and p2 ’s strategies are independent random variables, repeatedly implemented during game simulations. As the number of times the game is played gets larger, the frequency with which different actions for p1 and p2 are chosen will converge to the probability distribution that characterize their random strategies [27, p. 24]. The network policy, π(·|ψt ), and search tree, Γ (πψ (·)), are optimized in separate concurrent threads; to assure non-blocking of search and network optimization processes, the network’s weights were written to a shared memory map, where they are asynchronously updated by gradient descent, while the tree search thread ran in a parallel thread from a previous iteration of the network policy, π(·|ψt−1 ). At the end of a full MDP iteration, we compare the value predicted by either player, average their mixing strategies and update the gradients of the loss function with respect to the values. We train the probability distribution over current beams by maximizing the similarity between the computed search probabilities π and the predicted distribution p log ∂pψ (a|s) T (π p), (by the search process) with the cross-entropy loss: Δψp = ∂ψ and we take the network weights’ induced tensor norm (given its robustness to the asymmetrical network modular weights). Altogether, we minimize the combined loss, l = (ζ − v)2 − π T log(p) + λψ22 , where λ (set to 1.5) controls regularization of the network weights to avoid overfitting. The cross entropy term was weighted by 0.9, and the mean square error (mse) loss by 0.01 to keep the overall loss in the direction of persistent reduction in training error. These values were found by empirical evaluations of the loss surface.
350
O. Ogunmolu et al.
Fig. 2. Separated target volume organs and PTV masks.
3 Results This section presents results on 3D prostate cases. We start the training process by randomly adding five beam blocks to the state queue as described in Sect. 2. The input planes are then passed through the tower residual network, from which probability distributions and a value are predicted. We add a random walk sequence to this pure strategy, generating a mixed strategy, and subsequently construct the tree. This mixed strategy guides search for approximately optimal beam angles. When a move is selected, the current node is expanded, resulting in a new set of beamlets. The fluence map is found via (1), and the reward for maximizing player, or the cost of the minimizing player are updated. We continue expanding the leaves of the tree until we reach a terminal leaf – when more than one angle are at the same node. We then compute new search probabilities and propagate the outcome of the game through the ancestors of the terminal node. Figure 2 depicts the arrangement of the respective structures within a prostate case example that we consider. Notice that the PTV is at the center of the tumor, and this is consistent with the cases in this work. What follows in Tables 2 and 3 are the dose
Deep BOO!
351
Table 2. Dose wash plots for select patients after training the self-play network
washes obtained by superimposing the calculated dose (based on the beams that the network selects) on the CT scan map. They provide a qualitative evaluation of the computed dose for a particular case. Represented as a heat map, regions with minimal or no radiation are dark blue, while regions that receive the highest doses are red. Dose increases in intensity from blue to green to yellow to red. The intersection of beams delivers heavy dose to the tumors (center of the slices) while largely spares surrounding tissues. The line overlays on the plots are the angles of the incident radiation. The advantage of this policy is that finding the right beam angles is orders of magnitude faster than the current way these angles are found in the clinic. At test time, we pick the last checkpoint during training and use it to find feasible beam angles. The search process typically takes between 2–3 min before we settle on a good candidate beam angle set. This is significant time saved compared to manually tuning beam angles by dosimetrists or radiation therapists in clinics.
352
O. Ogunmolu et al. Table 3. Dose wash plots for select patients during testing of self-play network
4 Conclusions In modern clinics, solving the BOO problem involves many hours of planning, tuning and refinement – usually by experienced treatment planners. To reduce typical classical optimization time, we adopt a hybrid approach: leveraging Monte Carlo simulation of high dimensional state-spaces, neuro-dynamic programming, convex optimization of fluence profiles, to arrive at good candidate beamlets. Our work is the first, to the best of our knowledge, that transforms the BOO problem into a MCTS strategy. One could envisage improving this formulation by training on a prior beam selection procedure based on clinician experience or a linear programming-based simplex procedure for pre-selecting beam candidates before further refinement by the tree search.
References 1. Aleman, D.M., Kumar, A., Ahuja, R.K., Romeijn, H.E., Dempsey, J.F.: Neighborhood search approaches to beam orientation optimization in intensity modulated radiation therapy treatment planning. J. Glob. Optim. 42(4), 587–607 (2008) 2. Ogunmolu, O., Gans, N., Summers, T.: Minimax iterative dynamic game: application to nonlinear robot control tasks. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6919–6925 (2018). https://doi.org/10.1109/IROS.2018.8594037 3. Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning, pp. 805–813 (2015)
Deep BOO!
353
4. Craft, D.: Local beam angle optimization with linear programming and gradient search. Phys. Med. Biol. 52(7), N127 (2007) 5. Bertsimas, D., Cacchiani, V., Craft, D., Nohadani, O.: A hybrid approach to beam angle optimization in intensity-modulated radiation therapy. Comput. Oper. Res. 40(9), 2187–2197 (2013) 6. Jia, X., Men, C., Lou, Y., Jiang, S.B.: Beam orientation optimization for intensity modulated radiation therapy using adaptive L2,1-minimization. Phys. Med. Biol. 56(19), 6205–6222 (2011) 7. Bortfeld, T., Schlegel, W.: Optimization of beam orientations in radiation therapy: some theoretical considerations. Phys. Med. Biol. 38(2), 291 (1993) 8. Djajaputra, D., Wu, Q., Wu, Y., Mohan, R.: Algorithm and performance Of A clinical imrt beam-angle optimization system. Phys. Med. Biol. 48(19), 3191 (2003) 9. Pugachev, A., Xing, L.: Computer-assisted selection of coplanar beam orientations in intensity-modulated radiation therapy. Phys. Med. Biol. 46(9), 2467 (2001) 10. Li, Y., Yao, J., Yao, D.: Automatic beam angle selection in IMRT planning using genetic algorithm. Phys. Med. Biol. 49(10), 1915 (2004) 11. Wang, C., Dai, J., Hu, Y.: Optimization of beam orientations and beam weights for conformal radiotherapy using mixed integer programming. Phys. Med. Biol. 48(24), 4065 (2003) 12. Lim, G.J., Ferris, M.C., Wright, S.J., Shepard, D.M., Earl, M.A.: An optimization framework for conformal radiation treatment planning. INFORMS J. Comput. 19(3), 366–380 (2007) 13. D’Souza, W.D., Meyer, R.R., Shi, L.: Selection of beam orientations in intensity-modulated radiation therapy using single-beam indices and integer programming. Phys. Med. Biol. 49(15), 3465 (2004) 14. Hou, Q., Wang, J., Chen, Y., Galvin, J.M.: Beam orientation optimization for IMRT by a hybrid method of the genetic algorithm and the simulated dynamics. Med. Phys. 30(9), 2360– 2367 (2003) 15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 16. Agrawal, R.: Sample mean based index policies by O (log n) regret for the multi-armed bandit problem. Adv. Appl. Probab. 27(4), 1054–1078 (1995) 17. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015) 18. Hahnloser, R.H., Sarpeshkar, R., Mahowald, M.A., Douglas, R.J., Seung, H.S.: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405(6789), 947 (2000) 19. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015) 20. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 3(1), 1–122 (2011) 21. Chung, M., Buro, M., Schaeffer, J.: Monte Carlo planning in RTS games. In: CIG. Citeseer (2005) 22. Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)
354
O. Ogunmolu et al.
23. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354 (2017) 24. Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957) 25. Coulom, R.: Efficient selectivity and backup operators in Monte-Carlo tree search. In: International Conference on Computers and Games (2006) 26. Kocsis, L., Szepesvári, C.: Bandit based Monte-Carlo planning. In: European Conference on Machine Learning (2006) 27. Basar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory, vol. 23. SIAM, Philadelphia (1999)
Sensor-Based Planning, Navigation, and Control
Continuous Optimization Framework for Depth Sensor Viewpoint Selection Behnam Maleki(B) , Alen Alempijevic, and Teresa Vidal-Calleja Centre for Autonomous Systems, University of Technology Sydney, 15 Broadway, Ultimo, NSW 2007, Australia [email protected]
Abstract. Distinguishing differences between areas represented with point cloud data is generally approached by choosing a optimal viewpoint. The most informative view of a scene ultimately enables to have the optimal coverage over distinct points both locally and globally while accounting for the distance to the foci of attention. Measures of surface saliency, related to curvature inconsistency, extenuate differences in shape and are coupled with viewpoint selection approaches. As there is no analytical solution for optimal viewpoint selection, candidate viewpoints are generally discretely sampled and evaluated for information and require (near) exhaustive combinatorial searches. We present a consolidated optimization framework for optimal viewpoint selection with a continuous cost function and analytically derived Jacobian that incorporates view angle, vertex normals and measures of task related surface information relative to viewpoint. We provide a mechanism in the cost function to incorporate sensor attributes such as operating range, field of view and angular resolution. The framework is evaluated as competing favourably with the state-of-the-art approaches to viewpoint selection while significantly reducing the number of viewpoints to be evaluated in the process. Keywords: Perception
1
· Optimal viewpoint · Depth-sensors
Introduction
Humans are inherently capable of determining discriminative viewpoints when perceiving 3D objects; generally the view selected emphasizes distinguishing attributes of an object relative to the task at hand. Assuming that position and orientation of viewer constitute the viewpoint (the same as pose), the definition of optimal viewpoint is highly dependent on the concept of information required for a task. Examples include optimal views to discern an object from others in clutter [23], to achieve complete coverage of space [9,19] and to determine distinctiveness between objects of the same class [6]. Our particular interest is in using a depth sensor as the viewer of objects from same class, (e.g human face, single species body attributes - Angus cattle muscle score [14]), as they exhibit limited variations over the entire class. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 357–372, 2020. https://doi.org/10.1007/978-3-030-44051-0_21
358
B. Maleki et al.
When examining a 3D object for discriminating tasks in dynamic scenes (e.g. moving animals), limited number of embedded fixed-pose depth sensors in the scene are often confronted with restricted time and angles for capturing data. Systems that perform dense reconstruction [15] continuously add measurements into an underlying representation, the rapid dynamic nature of animal motion and inability to reasonably constrain animals for any periods of time renders these approaches unfeasible on a scale. However, it is reasonable to obtain dense representative point clouds of exemplars (object, animal or part thereof). Then, the problem of discerning an object can inherently be approached leveraging this information and devising methods to acquire the discriminative properties with a limited number of informative viewpoints (in the context of our work referred to as optimal viewpoints). Assigning information content conveyed by each point of a 3D surface is achieved with saliency measures related to local or global curvature [8,13,20]. Accordingly, the optimal viewpoint should enable a sensor (depth camera) to have the optimal coverage over the most informative or distinct points both locally and globally while also accounting for the distance to the foci of attention. The viewpoint quality and ultimate selection are generally decoupled, candidate viewpoints are discretely sampled on a lattice [12,13] and evaluated for information. In this paper we tackle the problem of finding the optimal viewpoint for a depth sensor by presenting a consolidated framework based on optimization on the manifold. A continuous cost function that incorporates view angle, surface normals and surface quality with an analytically derived Jacobian is leveraged via an optimization framework to determine the optimal viewpoint. We further provide, via a single coefficient embedded in the cost function, a mechanism to consider knowledge of the entire object or limited fields of view (FoV). By ray-tracing the point cloud, the framework enables to consider sensor attributes such as optimal sensing range, field of view and angular resolution. The proposed framework is extendable to configurations of multiple depth sensors. The organization of this paper is as follows: Sect. 2 discusses related work on exploiting surface information, with emphasis on viewpoint selection to discriminate objects. Section 3 details our approach that incorporates surface and viewpoint quality in a unified optimization framework. An empirical evaluation of the approach is presented and we compare our results with the state-of-the-art in Sect. 4. Finally, conclusions are drawn and future work proposed in Sect. 5.
2
Related Work
Assessment of a 3D object exploits stereopsis, which supports the perception of a 3D world including discriminating a difference in depth, judging slant or curvature, and ascertaining surface properties. Previous research demonstrated that an observer’s perception of convexity and concavity of surfaces, even with partial occlusions, allows to ignore task irrelevant regions and shapes of a 3D surface [5,16].
Continuous Optimization Framework for Depth Sensor Viewpoint Selection
359
In order to define regions of interest, measures of surface saliency have been proposed [20]. Measures of centre surround operators have been used, such as maximum curvature [8] or Gaussian-weighted mean curvatures [11]. Alternatively, local descriptors are used, such as Scale Invariant Features [17] and Spin Images [10]. These methods identify regions where curvature is inconsistent with its immediate surroundings and where human vision tends to be drawn to take differences in shape into account. Additionally, work on grouping regions of interest (ROI) and exemplifying influence of extrema points has also been researched [13]. All of these approaches assign weights per vertex or face of a 3D object based on the computed saliency, thereafter this information is the utility to determine optimal viewpoints. Irrespective of the method assigning surface saliency attributes, the process of determining the optimal viewpoints involves a search over the space of viewpoint poses. This search has been generally performed by discretely sampling the space [12], commonly with viewpoints on a sphere encompassing the object, where orientation of each viewpoint is aimed towards the centroid of the object. While techniques to group the points and limit the search space have been introduced [7,12,13], a unified framework that exploits a cost function associated with the surface quality in an optimization framework has not been proposed. We address this with an approach leveraging a cost function that incorporates finite fields of view and spatial resolution of a 3D sensor and surface quality relative to the viewpoint. An analytical Jacobian is used in an optimization approach to guarantee the convergence.
3
Methodology
Given a noisy point cloud obtained by depth sensors, the aim of this work is to find the optimal viewpoint—3D pose (R, t) ∈ SE(3), where t is the translation and R is an orthonormal vector base, i.e. the rotation matrix—for the sensor by computing the geometrical properties of the object-sensor setup to capture the most informative region(s) for accurate perception tasks. Our approach considers the field of view and the depth range of the camera as part of the optimization problem. Surface curvature in this work is extended to point clouds, by considering curvature embedded at each vertex (point) of the surface. The normal curvature at each vertex on the surface is the curvature of the planar curve created by intersecting the surface at that vertex with the plane spanned by the tangent vector and the surface normal. By rotating the tangent vector around the surface normal (and subsequently varying the normal curvature) the curvature property of the surface (and vertices) is acquired as two distinct extrema values called principle curvatures while their average is mean curvature [3]. In this work, in addition to mean curvature, we take advantage of noise associated with position and normals of vertices as the information of interest and utilise them in the framework as vertex weights.
360
B. Maleki et al.
The scalar value of weights associated with each vertex can be used as information metric to drive viewpoint selection. Hence after a pre-processing stage we define a 7-tuple for each vertex of the surface that contains 3D position, surface normal and an information weight as (pj , nj , cj ) where pj = (xj , yj , zj ) and nj = (nxj , nyj , nzj ) respectively. In this work we solely utilize depth information of an RGB-D camera, per manufacturers specifications [1] depth RMSE is related to horizontal offset from principle point. Therefore, the most accurate measurement of depth is achieved when the sensor is gazing perpendicularly to the points of interest. Having this defined, the main axis of the sensor should be co-linear with the vertex normals of ROI. Specifically, given the afore-mentioned tuple and the current depth sensor pose, the proposed framework aims to minimize iteratively the objective function conformed by the angle between the principal axis of the sensor and the surface normals and the vector that joins the vertex and the sensor as shown in Fig. 1. This is done, while applying the FOV and visibility range constraints. The flowchart of our proposed algorithm is demonstrated in Fig. 2. Note that in this work the terms ‘point’ and ‘vertex’ are used interchangeably.
Fig. 1. Angles definition for camera-object setup and magnified area of the camera coordinate frame in the top right. pj : j th vertex on the surface. ti and Ri : camera position and orientation in ith state, respectively. αji : angle between the camera main (= ti − pj ). βji : angle between camera axis (z axis of camera) in state i, and vector hij main axis and flipped vertex normal (−nj ). hij : distance of camera in ith state from vertex pj .
Continuous Optimization Framework for Depth Sensor Viewpoint Selection
361
Fig. 2. System flowchart.
3.1
Angular Terms of Objective Function
Let the camera position t, on the origin of world coordinate system and unit vector of camera axis, be aligned with the Z-axis of the world reference frame. This means that the roll, pitch and yaw components of the camera orientation are zero, which corresponds to a 3×3 identity matrix, i.e. t0 = (0, 0, 0), R03×3 = I. If the camera moves to new pose (position and orientation) of ti and Ri , according to SE(3) transformation, the updated orientation of unit vector associated with camera axis is Ri z. In Fig. 1, pj denotes the coordinate of the j-th point of the point cloud with respect to the world reference frame. By considering the vector hij formed between the point pj and the sensor position ti , two angles have to be minimized; (1) α: the angle defined between the camera’s main axis Ri z and hij and (2) β: angle between the vertex normal, n, and the camera axis, Ri z. Note that the variable z represents the unit vector aligned with the Z-axis of reference frame, hence, z = [0, 0, 1]. The surface normal at point pj is defined by the unit vector nj . Since the angle between the vertex normal and the camera axis is independent of their position, nj can be translated to the camera position. As the minimization procedure includes the angle between −nj and the camera axis in terms of the four-quadrant (inverse tangent atan2), the flipped normal surface is used in computations, i.e. −nj . Thus, α and β are given by:
362
B. Maleki et al.
i R z × (pj − ti ) ∠(R z, t pj ) : = atan2 Ri z • (pj − ti ) i R z × −nj i i ∠(R z, nj ) : βj = atan2 , Ri z • −nj i
αji
i
(1) (2)
where · is defined as the L2-norm of a vector and Ri z is the unit vector defining the camera axis in ith pose. In our computations, αji and βji are confined to the interval [−π, π]. 3.2
Distance Term of Objective Function
The optimal sensing distance at minimum noise of common RGB-D cameras is very limited [1,18], therefore, the function should include a term to restrict the search of the optimal pose to a ideal distance, η, within the nominal depth sensing range of the sensor (e.g. 0.5, continuous optimization is dependent on the initial guess caused by symmetry (mean curvature). To provide a comparison with a similar discrete sampling approach presented by [13] we tested our framework on Igea (Venus) dataset with parameters stated in the fifth experiment of Table 1. The optimal pose is presented in Table 1 while Fig. 6(a), (b) and (c) depict the optimal pose from three different viewing angles. Similar to the cow back dataset, the mean curvature values are color-coded in vertices and the transition of sensor during the optimization process is indicated
368
B. Maleki et al.
Fig. 5. (a) The color map of cost function values associated with the sensor positioned on the first point set (with distance of η = 1.4 m from vertices in direction of vertex normals) and oriented towards the flipped vertex normal per position. (b) The color map of cost function values associated with the sensor positioned on a hemisphere lattice (with 15000 points with distance of 1.4 m from the centroid of object) and oriented towards the centroid of object.
with a color scheme from light orange (initial pose) to black (optimal pose) with a subset of the interim states are selected. Regardless of variation of initial guess (as far as it is not in the back half of head) the optimal pose is unique. The used conditions and assumptions in this experiment are similar to [13] and their optimal viewpoint for this dataset is shown in Fig. 6d. The viewpoint determined by our proposed algorithm is similar to [22] which was used as a benchmark for the work reported by Leifman [13]. The angle displayed by Fig. 6(c) highlights the similarity of optimal viewpoint rendered by our approach with their result shown in Fig. 6d (for more details refer to [13]). A quantitative compassion over the viewpoint computed was not possible, computer graphics publications such as [22] and [13] only illustrate the viewpoint, not reporting actual values. The overall performance of our algorithm mainly depends on the variance of information measure on the surface, value of λ, topology of object and the number of vertices. However, the number of iterations through our experiments has proven to be less than 150, which by considering the gradient computation running in parallel with cost function computation, the average computational complexity of our algorithm considering its continuous property is considerably less than discrete approaches.
Continuous Optimization Framework for Depth Sensor Viewpoint Selection
369
Fig. 6. (a), (b) and (c) three different viewing angles of one optimization experiment (with λ = 0.51) over Igea dataset (color-coded based on mean curvature values per vertex) after going through some interim poses (shown by color scaled cameras) and ending up to the optimal pose (black camera). (d) optimal viewpoint of Igea illustrated in [13].
5
Conclusions
To address the problem of informative viewpoint of a depth camera, we have shown analytically that by aligning the concept of information with inherent surface characteristics of object surface (such as mean curvature, sparseness, normal noise), the optimal pose is achievable at much lower computational cost compared to numeric approaches. The considered optimal pose of the camera consists of 6 parameters (denoting 6 DoF) and is obtained by minimizing a novel cost function based on geometry of the sensor-object setup through the steepest descent defined by its gradient projected over the appropriate manifold in SE(3). The experiment demonstrates that the optimal pose found by our
370
B. Maleki et al.
approach is consistent with the optimal viewpoint of the object obtained via numerical methods and state-of-the-art viewpoint selection approaches. Future work will incorporate a probabilistic framework to deal with uncertainty of acquiring data from a sensor and update the objective function to continuously optimize the distance of sensor with respect to the surface. We will combine viewpoint selection with machine learning approaches to ascertain the possibility of viewpoint invariance when entire cohorts of animals are evaluated for phenotypic trait estimation purposes. Finally, we aim to integrate our framework for viewpoint selection based on surface quality for inspection of parts developed in additive manufacturing. Acknowledgments. This work was possible due to the financial and in-kind support, and efforts of many individuals from NSW Department of Primary Industries, University of Technology Sydney, and Meat and Livestock Australia.
A
Appendix
Lemma: Let t, p and z be three column vectors in IRn and R is a n × n matrix. × (p − t) Then for u(R, t) = Rz Rz • (p − t) , we have: ∂u (Rz · Rz) = (p − t) · (p − t)Rz − ((p − t) · Rz)(p − t) ∂t ((p − t) · Rz)3 u
(25)
∂u ((p − t) · (p − t)) = (Rz · (p − t))Rz − (Rz · Rz)(p − t) z 3 ∂R (Rz · (p − t)) u
(26)
Note: To avoid notational confusion, u denotes the value of function u(R, t). Proof. Assume a and b are two column vectors in IRn . The auxiliary variable κ is defined as a scalar function of a, b which first vector, a, is a constant, κ=
a × b2 (b · b)(a · a) − (b · a)2 (b · b)(a · a) = = −1 2 (a · b) (b · a)2 (b · a)2
(27)
The differential of κ is: 2(b · d b)(a · a) 2(b · b)(a · a)(a · d b) − (b · a)2 (b · a)3 2(a · a) (b · a)b − (b · b)a · d b. = 3 (b · a)
κ= dκ
(28)
κ = 2u du κ = u2 =⇒ dκ
(29)
a = Rz
(30)
b = (p − t)
(31)
db = −dt
(32)
by substituting:
Continuous Optimization Framework for Depth Sensor Viewpoint Selection
du =
κ (a · a) dκ = (b · a)b − (b · b)a · (−dt) 3 2u (b · a) u ∂u (a · a) = (b · b)a − (b · a)b ∂t (b · a)3 u
371
(33) (34)
And also if: a = (p − t)
(35)
b = Rz
(36)
db = dR z (a · a) du = (b · a)b − (b · b)a · dR z (b · a)3 u (a · a) = (b · a)b − (b · b)a z · dR (b · a)3 u (a · a) ∂u = (b · a)b − (b · b)a z ∂R (b · a)3 u
(37)
(38)
(39)
References 1. Anders, G.J., Mies, W., Sweetser, J.N., Woodfill, J.: Best-known-methods for tuning intel realsense D400 depth cameras for best performance. Intel (2018) 2. Andersen, M.R., Jensen, T., Lisouski, P., Mortensen, A.K., Hansen, M.K., Gregersen, T., Ahrendt, P.: Kinect depth sensor evaluation for computer vision applications. Tech. Rep. Electron. Comput. Eng. 1(6) (2012) 3. Botsch, M., Kobbelt, L., Pauly, M., Alliez, P., L´evy, B.: Polygon Mesh Processing. CRC Press, Boca Raton (2010) 4. Boumal, N., Mishra, B., Absil, P.A., Sepulchre, R.: Manopt, a matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15(1), 1455–1459 (2014) 5. Cate, A.D., Behrmann, M.: Perceiving parts and shapes from concave surfaces. Atten. Percept. Psychophys. 72(1), 153–167 (2010) 6. Chen, X., Saparov, A., Pang, B., Funkhouser, T.: Schelling points on 3D surface meshes. ACM Trans. Graph. (TOG) 31(4), 29 (2012) 7. Foissotte, T., Stasse, O., Wieber, P.B., Escande, A., Kheddar, A.: Autonomous 3D object modeling by a humanoid using an optimization-driven next-best-view formulation. Int. J. Hum. Rob. 7(03), 407–428 (2010) 8. Gal, R., Cohen-Or, D.: Salient geometric features for partial shape matching and similarity. ACM Trans. Graph. (TOG) 25(1), 130–150 (2006) 9. Gonzalez-Barbosa, J.J., Garc´ıa-Ram´ırez, T., Salas, J., Hurtado-Ramos, J.B., et al.: Optimal camera placement for total coverage. In: IEEE International Conference on Robotics and Automation 2009. ICRA 2009, pp. 844–848. IEEE (2009) 10. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999) 11. Lee, C.H., Varshney, A., Jacobs, D.W.: Mesh saliency. ACM Trans. Graph. (TOG) 24, 659–666 (2005) 12. Lee, J., Moghaddam, B., Pfister, H., Machiraju, R.: Finding optimal views for 3D face shape modeling. In: Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition 2004, pp. 31–36. IEEE (2004)
372
B. Maleki et al.
13. Leifman, G., Shtrom, E., Tal, A.: Surface regions of interest for viewpoint selection. IEEE Trans. Pattern Anal. Mach. Intell. 38(12), 2544–2556 (2016) 14. May, S., Mies, W., Edwards, J., Williams, F., Wise, J., Morgan, J., Savell, J., Cross, H.: Beef carcass composition of slaughter cattle differing in frame size, muscle score, and external fatness. J. Anim. Sci. 70(8), 2431–2445 (1992) 15. Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 343–352 (2015) 16. Norman, J.F., Norman, H.F., Lee, Y.L., Stockton, D., Lappin, J.S.: The visual perception of length along intrinsically curved surfaces. Percept. Psychophys. 66(1), 77–88 (2004) 17. Ohbuchi, R., Osada, K., Furuya, T., Banno, T.: Salient local visual features for shape-based 3D model retrieval. In: IEEE International Conference on Shape Modeling and Applications 2008, SMI 2008, pp. 93–102. IEEE (2008) 18. Pece, F., Kautz, J., Weyrich, T.: Three depth-camera technologies compared. In: First BEAMING Workshop, Barcelona, vol. 2011, p. 9 (2011) 19. Quin, P., Paul, G., Alempijevic, A., Liu, D., Dissanayake, G.: Efficient neighbourhood-based information gain approach for exploration of complex 3D environments. In: 2013 IEEE International Conference on Robotics and Automation (ICRA), pp. 1343–1348. IEEE (2013) 20. Shilane, P., Funkhouser, T.: Distinctive regions of 3D surfaces. ACM Trans. Graph. (TOG) 26(2), 7 (2007) 21. Skinner, B., Vidal Calleja, T., Valls Miro, J., De Bruijn, F., Falque, R.: 3D point cloud upsampling for accurate reconstruction of dense 2.5 D thickness maps. In: Australasian Conference on Robotics and Automation (2014) 22. Vieira, T., Bordignon, A., Peixoto, A., Tavares, G., Lopes, H., Velho, L., Lewiner, T.: Learning good views through intelligent galleries. In: Computer Graphics Forum, vol. 28, pp. 717–726. Wiley Online Library (2009) 23. Wu, K., Ranasinghe, R., Dissanayake, G.: Active recognition and pose estimation of household objects in clutter. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 4230–4237. IEEE (2015)
Look Before You Sweep: Visibility-Aware Motion Planning Gustavo Goretkin(B) , Leslie Pack Kaelbling, and Tom´ as Lozano-P´erez Massachusetts Institute of Technology, Cambridge, MA 02139, USA {goretkin,lpk,tlp}@csail.mit.edu
Abstract. This paper addresses the problem of planning for a robot with a directional obstacle-detection sensor that must move through a cluttered environment. The planning objective is to remain safe by finding a path for the complete robot, including sensor, that guarantees that the robot will not move into any part of the workspace before it has been seen by the sensor. Although a great deal of work has addressed a version of this problem in which the “field of view” of the sensor is a sphere around the robot, there is very little work addressing robots with a narrow or occluded field of view. We give a formal definition of the problem, several solution methods with different computational trade-offs, and experimental results in illustrative domains.
1
Introduction
Consider a mobile-manipulation robot that must move through a crowded environment. If the location of all obstacles in the environment is known, then the problem it faces is a familiar motion-planning problem. But if the environment can contain unknown obstacles, then the robot must incorporate sensing into its plan in order to guarantee that it will not collide with anything. In one extreme version of this problem, the environment is entirely unknown, and would best be treated with a combination of map-building and exploration. We will focus on a different regime, that arises in the case of a household robot: the primary obstacles in the domain (e.g. walls, refrigerators) are known but there are other temporary obstacles (e.g., toys, trash cans, lightweight chairs). In this case, it is worthwhile to plan a path to a target configuration, but the path must take visibility into account, making sure that it never moves into a region of space that it has not already observed. Should the robot encounter an unexpected object, it could then make a new plan taking it into account, or move it out of the way. When we speak of visibility, we mean any robot-mounted ability to gather information about the locations of obstacles in its neighborhood. It could be based on vision, lidar, or even the ability to reach out slowly with a hand and sense contact or lack thereof. If a robot has visibility of a sphere around it in workspace, and it is quasi-static, then the problem of safe movement is simple: the robot must just move in small enough steps that it never exits the sphere it c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 373–388, 2020. https://doi.org/10.1007/978-3-030-44051-0_22
374
G. Goretkin et al.
saw from its previous configuration. A great deal of work has addressed exploration problems in this visibility model. However, many robots have less encompassing sensing. Figure 1 illustrates several different sensor configurations for a simple planar robot. Case (a) reflects the most common assumption about sensing: that is, that the robot can perceive a ball of some radius in the workspace (although this is often described as perceiving a ball in configuration space, which is not necessarily sensible); case (b) shows a wide field of view as might occur with some steerable sensors; case (c) shows a narrow view as might occur in some vision sensors; case (d) occurs for many humanoid robots with a camera mounted on the head: although they can see a view cone in front of them, it is occluded by the body and so there is a region of space immediately in front of the robot that cannot be seen; and case (e) illustrates a situation in which, for example, a humanoid is carrying a large box in front of it, so its field of view is split into two narrow cones. Our approach handles a general mapping from robot configurations to “viewed” regions of workspace, encompassing these examples, and even more complex ones, including cameras mounted on the arms of a mobile manipulator.
Fig. 1. Some possible sensed volumes with respect to robot configuration.
Fig. 2. In the HallwayEasy domain, a robot with narrow (30◦ ) field of view must steer carefully around a corner.
For a large robot with a limited view that is navigating in a cluttered environment, the problems of moving to the goal and of observing to guarantee safety are generally inextricably linked. Figure 2 shows a plan for a robot with a narrow view cone to enter a hallway. The goal is depicted in the dashed outline. The yellow shading indicates the region of workspace that has been seen by the plan. Note that the robot has to rotate back and forth as it moves, and swing wide around the corner, in order to guarantee safety; this path was obtained with the
VAMP
375
Fig. 3. In the HallwayHard domain, a robot with a wide field of view, must enter the hallway camera-first, back out, re-orient, and back in.
Vamp Backchain method, described in Sect. 4.4. Figure 3 shows a robot with a wider field of view that must enter a narrow hallway backwards. Because it cannot turn around inside the hallway, it must first look down the hallway then back out and turn around. Finally, Fig. 4a shows a particularly difficult environment, in which the robot must enter one hallway to look through a narrow opening to gain visibility of a different hallway before entering it. The contributions of this paper are twofold. First, we clearly frame the problem of planning a path that is safe, in the sense that it never moves through previously unobserved space, for a general visibility function that maps robot configurations to regions of workspace that are observed. We call this problem class vamp, for visibility-aware motion planning. Second, we supply several algorithms that are complete for this problem, which occupy different points in the trade-off space between path length and planning time. In particular, we present in Sect. 4.4 an algorithm that is well-suited to solving vamp problems. The other algorithms we present either provide conceptual clarity to the belief-space nature of the vamp problem, or serve as useful subprocedures for the algorithm in Sect. 4.4. The problem is quite computationally complex, because the state of the planning problem includes what space has already been observed, and so the solutions include paths that a traditional motion-planner would never generate, because the solution paths revisit the same robot configuration with different visibility states. Although the examples presented here are for a robot with 3D configuration space in a 2D workspace, there is nothing in the formulation or algorithms that restricts their applicability to this case. Our algorithms are provably always correct (will not generate an illegal plan); they are complete for holonomic robots, as well as for non-holonomic robots that can always reverse a path they have traveled, without increasing the total swept volume.
2
Related Work
There is a body of work, related to ours, that addresses the problem of robot motion planning in the presence of uncertainty about the environment. This work varies considerably in its objectives and assumptions about the problem. We will lay out the space of approaches and situate our work within it. Problem Variations: There are many possible objectives for the robot’s motion. Coverage problems generally assume a known map of obstacles but require
376
G. Goretkin et al.
planning a path that will observe all parts of the reachable space, for example, to inspect a structure [1–3]. Exploration problems generally assume no prior knowledge of the obstacles and desire that the robot eventually observe all parts of the space, possibly while building a map of the obstacles [4–10]. Navigation problems seek to reach a specified goal region in an incompletely known environment [11–13]. An additional source of variation is the notion of safety: one might wish to guarantee safety (non-collision with any obstacle) absolutely, e.g. [3,8], or with high probability with respect to a distribution over obstacles, e.g. [13,14] and with respect to obstacles that may move, e.g. [15]. Formulations vary in their assumptions about observations. Sometimes no observations are assumed during path execution, as in coverage problems [1]. In other cases, an observation is made from some or all states of the robot during execution; the observation depends on the underlying true world map as well as on the robot’s state and may be an arbitrary function of those inputs. Typically, observations are assumed to be in the plane and take the form of either a fixed cone or a circle centered on the robot’s location, although more general 3D observations have been considered [7]. In addition, the robot is typically assumed to be small relative to the obstacles and the environment uncrowded, so that the robot can be approximated as a point. These simplifying assumptions blur the distinction between workspace and configuration space and limit the application of these algorithms to more general settings, such as those arising in mobile manipulation. Robot motion is typically assumed to be planar. However, during mobile manipulation, all the degrees of freedom of the robot may affect observations, e.g. the arms may partially block the sensor; this setting motivates our work although our experiments are in lower dimensions. Previous work has considered a variety of motion models: kinematic, whether holonomic or non-holonomic [9], or kino-dynamic [11,16], and with deterministic or noisy actuation [10,12]. Robot dynamics introduce additional difficulty because the robot is not able to stop instantly when an obstacle is detected; instead, it must ensure that it never enters an inevitable collision state (ics) [9,11] with respect to its current known free space and a motion model of possible obstacles. It is worth highlighting a harder case of the coverage problem, known as visibility-based pursuit-evasion [17,18]. A solution to this problem is a path, which the pursuer follows, that guarantees that any evaders will be detected. Solution paths to this problem necessarily cover the whole space, In general the paths must revisit the same robot configuration with different information states. Solution Strategies: When there is no map uncertainty, as in coverage problems, then the solution is a (minimum-length) path that covers all parts of the space with its sensors. Problems with map uncertainty, as in our case, can all be cast as some version of a partially observed Markov decision process (pomdp). For our version of the problem, the state space would be the cross product of the robot’s configuration
VAMP
377
space with the space of all possible arrangements of free/occupied space defined by the unknown obstacles, the actions would be finite linear motions in configuration space, and the observations would be determined by the visibility function of the sensor. The objective would be to minimize path length, but with the hard constraint of not moving into non-viewed areas (making this not a standard pomdp). Seeing the problem this way is often clarifying but it does not immediately lead to a solution strategy, since the optimal solution of pomdps is highly computationally intractable even in discrete state and action spaces. Practical approximation strategies for pomdps almost all rely on some form of receding-horizon control [8]. The system makes a plan for a sub-problem under some assumptions, and begins executing it, gathering further information about the map as it goes. When it receives an observation that invalidates its plan (e.g. an object in its path) or reaches its subgoal, the system makes a new plan based on the current robot configuration and map state. When the objective is exploration, a typical strategy is some form of frontierbased or next-best view planning method [4,7,9]. On each replanning iteration, a subgoal configuration is selected (a) that is reachable within known free space from the robot’s current configuration and (b) from which some previouslyunobserved parts of the workspace can be viewed. When the objective is navigation, a typical replanning strategy is to be optimistic, planning a path to the goal that makes some assumptions about the true map and replanning if that assumption is invalidated. Our contribution addresses the planning component of this approach. Our Work: We address a version of the problem in which we assume that the robot knows a map in advance. This map is assumed to be accurate in the sense that the obstacles it contains are definitely present in the world. We assume that the robot has some form of obstacle sensor, but make no assumptions about it except that, for any configuration of the robot, it can observe some (possibly disjoint) subset of the workspace and that this visibility function is known in advance. We assume that observation and control are deterministic and the robot always knows its configuration. The algorithms in this paper assume a robot that can reverse a path without sweeping through additional workspace: this is true of holonomic robots, but also round differentialdrive robots, for example. The planner we present in Sect. 4.4 would be used in a “trust but verify” replanning framework, in which we assume, optimistically, for the purposes of planning, that the obstacles in our current map are, in fact, the only obstacles. This assumption makes it worthwhile to try to plan a complete path to the goal. However, because we are not certain that these are the only obstacles and because we wish to guarantee the robot’s safety, we will seek a visibility-aware path to the goal, in which the robot never moves into space that has not been verified to be free during some previous part of its path (we formalize this criterion more carefully in Sect. 3). The robot could then execute this path until it observes an obstacle that invalidates the path. At that point, it would insert that obstacle
378
G. Goretkin et al.
into its map and re-plan. The focus of this paper is on methods for planning optimistic visibility-aware trajectories. We provide several planning algorithms that take a sampled feasible-motion graph as input (e.g. a PRM, or state lattice). They are guaranteed to be correct and complete on the given graph, and may be resolution-complete or probabilistically complete depending on the strategy used to augment the samples in the graph in case of failure. We do not yet make a claim about the completeness of the entire receding-horizon control policy.
3
Definitions
We will focus on problems in which the robot is holonomic or reversible, so the state of the robot can be modeled only in terms of its configuration. We will use P(X) to denote the powerset of set X (the set of all possible subsets.) A vamp problem instance is a tuple (W, C, V, S, Wobs , q0 , Qgoal , v0 ) where: W = R2 or R3 is the workspace; C is the configuration space of the robot; V : C → P(W ) is a visibility function, mapping robot configurations into (possibly disconnected) subsets of workspace that are visible from that configuration, conditioned on the known obstacles; S : C → P(W ) is a swept volume function, mapping robot configurations into subsets of workspace occupied by the robot in that configuration; Wobs ∈ P(W ) is the subset of workspace that is known to contain obstacles; q0 ∈ C is the initial robot configuration; Qgoal : C → Bool is a function from configurations to Booleans indicating whether the configuration satisfies the goal criteria; and v0 ⊆ W is a region of workspace that has already been viewed and confirmed free (by default, it will be equal to V (q0 ), but for some robots it will need to be more, in order to allow any initial movement.) We are assuming that motion is continuous-time (so all configurations along a path must be previously viewed) but that perception is discrete-time (so new views are only gained at the end of each primitive trajectory). We will define some additional useful quantities, in terms of the basic elements of a vamp problem instance. We extend the definition of swept volume to a path segment, so S (qi , qj ) ⊆ W is the region swept by moving from qi to qj via a primitive trajectory (such as straight line in C-space). We further extend the definitions of S and V to paths: S ([q1 · · · , qn ]) = i=1···n−1 S (qi , qi+1 ) and V ([q1 · · · , qn ]) = i=1···n V (qi ). An edge between configurations qi and qj is optimistically traversible if its swept volume does not intersect a workspace obstacle; the set of optimistic edges, then is Eofree = {(q1 , q2 ) | S(q1 , q2 ) ∩ Wobs = ∅}. A path [q1 , · · · qn ] is feasible for the problem if and only if every edge is optimistically traversible and only moves through parts of the workspace that have been previously viewed: (qi , qi+1 ) ∈ Eofree and S (qi , qi+1 ) ⊆ v0 ∪ V ([q1 , · · · , qi ]), for all i ∈ {0, . . . , n − 1}. We refer to this second condition as the visibility constraint.
VAMP
4
379
Planning Algorithms
We present several algorithmic approaches to the vamp problem, beginning with a computationally inefficient method that produces very high quality plans, and then exploring alternative strategies that are more computationally tractable. In all these algorithms, we assume a given finite graph (Q, E) where Q ⊂ C is a set of configurations and all edges (q1 , q2 ) ∈ E are collision-free with respect to Wobs , so S(q1 , q2 ) ∩ Wobs = ∅. This graph may be, for example, a fixedresolution grid or a probabilistic road-map [19]. Any of our vamp algorithms can be augmented by an outer loop that increases the resolution or sampling density of the graph. 4.1
Belief-Space Search
The most conceptually straightforward approach to this problem is to perform a tree-search in belief space. Roy et al. [16,20] have pioneered techniques of this type in uncertain robot motion planning problems. The basic idea is that a state of the whole system consists of a robot configuration and a current belief state, with the robot configurations drawn from the set Q. In this problem, the belief state v ∈ P(W ) is the region of the workspace that has been observed by the robot to be collision-free. Algorithm 1. Vamp Bel((Q, E), V, q0 , Qgoal , v0 ) s0 ← (q0 , v0 ) g((q, v)) ← Qgoal (q) A((q, v)) ← {q | (q, q ) ∈ E and S(q, q ) ⊂ v} T ((q, v), q ) ← (q , v ∪ V (q )) H((q, v)) ← α|S(mp((Q, E), q, Qgoal )) \ v| return A∗ (s0 , g, A, T, H)
initial state goal test legal actions in state (q, v) state transition function heuristic
Procedure Vamp Bel provides an implementation of the belief-space search via a call to A∗ . The procedure is given, as input, the graph (Q, E), visibility function V , initial configuration q0 , goal test Qgoal , and initial visible workspace v0 . The set of legal actions A that can be taken from state (q, v) is the set of outgoing edges from configuration q that have the property that their swept volume is contained in the previously-viewed region of the workspace v. The transition function T moves along the edge to a new configuration and augments v with the region of configuration space visible from the new configuration. In order to drive the search toward a goal state, we define a heuristic which is based on a visibility-unaware path mp((Q, E), q, Qgoal ) obtained by solving the underlying motion-planning problem to the goal. The size of the swept volume of that path that has not yet been viewed is used as a measure of the difficulty of the remaining problem; α is a constant that makes the units match. Note that, in this search, it is possible for an optimal path to visit the same configuration
380
G. Goretkin et al.
more than once (with different visibility states v). Nonetheless, the search space is finite given finite Q, because only finitely many possible visibility states can be reached (at most one for each set of configurations in Q). Theorem 1. Algorithm Vamp Bel is correct and complete with respect to configuration space graph (Q, E). Proof. The proof can be found in the Supplementary Materials [21]. This algorithm is computationally very complex even on modest graphs because the search must consider distinct paths that reach a given robot configuration. The search can be pruned by using a domination criterion: state (q, v1 ) dominates (q, v2 ) if v1 ⊆ v2 , which means that if the search visits a state that is dominated by a state that has already been expanded, it can discard the dominated state. In our experiments, this condition did not occur frequently enough to be useful; different paths will see slightly different regions. On the other hand, in the setting of Bry and Roy [16], the belief space is lower dimensional (covariance matrices of the dynamical state space) and so domination happens much more frequently and makes the search tractable. We implemented this algorithm with a very computationally cheap domination criterion that eliminates paths that revisit configurations without having visited any new configurations since the last visit (this eliminates looping paths, among others). For HallwayEasy, Fig. 2, the heuristic is very effective at guiding the search—a solution is found in under 10 s after expanding 500 search nodes. However, on HallwayHard, Fig. 3, no solution was found after expanding 440K nodes, with 2 million nodes on the queue, with a computation time of over 2 h. 4.2
Local-Visibility Searches
At the opposite end of the spectrum of approaches to the vamp problem are methods that perform search directly in configuration space, as opposed to the problem’s natural state space, which must include information about regions of observed space. These local approaches are not complete in general, but may be complete for some robots; they will prove useful as a subroutine in later algorithms. Vamp Step Vis is defined in Algorithm 2. The basic version of the method has the same arguments as Vamp Bel, but it may also be used in relaxed mode, which is signaled by parameter relaxed = true, and makes use of an additional argument O ⊂ W , which is a workspace region considered to be out of bounds. In any mode, it may optionally be given a heuristic function. When it is not relaxed, it allows traversal of any edge (q, q ) ∈ E whose swept volume is entirely contained in the union of the initial visibility space v0 and the region of workspace visible from q, V (q). For some combinations of robot kinematics and visibility, this algorithm will be complete. For example, a robot with a wide field of view will always be able to see the space it is about to move into. However, this method does not suffice for robots that can
VAMP
381
Algorithm 2 . Vamp Step Vis((Q, E), V, q0 , Qgoal , v0 , H = 0, relaxed = False, O = ∅) s0 ← q0 initial state if relaxed then legal actions and cost A(q) ← {q | (q, q ) ∈ E and S(q, q ) ∩ O = ∅} C(q, q ) ← q−q 2 ·(if S(q, q ) ⊆ (v0 ∪ V (q)) then 1 else |S(q, q ) \ (v0 ∪ V (q))|) else A(q) ← {q | (q, q ) ∈ E and S(q, q ) ⊆ (v0 ∪ V (q))} C(q, q ) ← q − q 2 T (q, q ) ← q state transition function return A∗ (s0 , Qgoal , A, T, H, C)
move into space that is not immediately visible to them. Relaxed mode is used only to compute intermediate subgoals, but never for executable paths; in it, the robot is allowed to move into areas of the workspace that have not yet been seen, but these motions incur an extra cost. It is not, however, allowed to collide with the out-of-bounds region under any circumstance, a feature used in the Tourist algorithm of Sect. 4.4. Ideally, the relaxed planner would solve a minimum-constraint removal problem [22], keeping track of regions that have ever been violated, and not double-counting the regions that experience repeat violations. This is a very computationally difficult sub-problem, so we simply penalize total distance traversed through unviewed regions. An very useful extension of Vamp Step Vis is the algorithm Vamp Path Vis. It is, fundamentally, a search in the space of configurations, as in Vamp Step Vis, and so for example the tests for whether a node has been visited inside A∗ are made on the configuration alone. However, as in Vamp Bel, we “decorate” each node that is added to the tree with the visibility v computed along the path to that node. However, whatever visibility we have the first time we expand a configuration q is the visibility that will be permanently associated with it. This method can solve more problem instances than Vamp Step Vis, and is always correct, but it is incomplete (because it might commit to a path to some configuration that is not the best in terms of visibility, and it cannot contemplate paths that must revisit the same configuration). We will use it extensively as a subroutine in our final algorithm in Sect. 4.4 4.3
Tree-Visibility Tour
One observation is that, under the vamp assumptions, visibility is monotonic: that is, as the robot moves through the world, after any discrete path step and observation, vt−1 ⊆ vt . In addition, it is order-independent: V ([q0 , . . . , qn ]) = V (perm[q0 , . . . , qn ]) where perm is any permutation of the sequence of configurations. These observations lead us to an algorithm that is complete and much more efficient at finding solution paths for vamp problems than belief-space search, although we will find that it will generally be unsuitable in practice.
382
G. Goretkin et al.
6
3 5
4
3
4 2
1
2
1
0 0
1
2
(a)
3
(b)
(c)
(d)
Fig. 4. (a) The TwoHallway domain with some hand-generated subgoals. (b) Levelsets and heat-map of field F used to compute heuristic for Tourist algorithm. (c, d) Two levels of Vavp.
Rather than associating a new visibility region v with each state in the search, we will maintain a single, global v ∈ P(W ) and carry out a search directly in Q. The search can only traverse an edge if its swept volume is contained in the workspace that has been viewed during the entire search process up until that time. Once this process reaches a goal state, the tree, in the order it was constructed, is used to construct a solution path. Pseudo-code is shown in Algorithm 3. It proceeds in two phases. First, it constructs a search tree, where the extension from a point in the tree is made only within the region that has been visible from any configuration previously visited in the search. Second, it constructs a path that visits all of the configurations, in the order in which they were added to the tree, and returns that path. The tree search is slightly unusual, because which edges in the graph can be traversed depends globally on all search nodes in the tree. For this reason, we perform a queue-based search, keeping an agenda of edges, rather than nodes. If an edge is selected for expansion, but is not yet traversible, it is added back to the end of the agenda for reconsideration after some more of the tree has been grown. When a goal state has been reached, we extract a path from the tree. This path will visit the configurations in the same order that they were visited by the search, but they must be connected together via paths in the tree that existed at the point in the search when the configuration was visited. Theorem 2. Vamp Tree is correct and complete with respect to the configuration-space graph (Q, E) for any robot such that S(q1 , q2 ) = S(q2 , q1 ) for all q1 , q2 . Proof. The proof can be found in the Supplementary Materials [21].
VAMP
383
Algorithm 3. Vamp Tree ((Q, E), V, q0 , Qgoal , v0 ) agenda ← [(q0 , q ) for (q0 , q ) ∈ E] visited ← [q0 ]; T ← [ ]; v ← v0 while agenda is not empty do (qs , qe ) ← pop(agenda) if qe ∈ visited then continue if S(qs , qe ) ⊆ v then add conf to path visited.append(qe ) add edge to tree T.append((qs , qe )) if Qgoal (qe ) then break v ← v ∪ V (qe ) add new visibility add outgoing edges to agenda agenda.extend([(qe , q ) for (qe , q ) ∈ E]) else save edge for reconsideration agenda.append((qs , qe )) if not Qgoal (qe ) then return Failed p ← [q0 ]; qcurr ← q0 for i ∈ [1..len(visited)] do link configurations using previously-enabled edges qnext ← visited[i] p.extend(shortest undirected path(qcurr , qnext , T [0 : i])[1 :]) qcurr ← qnext return p
4.4
Visibility Preimage Backchaining
Our final approach to this problem is to perform a much more goal-driven search to observe parts of the workspace that will make desired paths feasible. This algorithm is motivated by the observation that goals can be decomposed into subgoals. Figure 4a shows a goal configuration marked 4. To make this configuration visible, the robot must stand at 2 and 3. Finally, to make 2 visible, the robot must stand at 1. It is interesting to note that Vamp Path Vis can plan efficiently to visit these subgoals in order, if they are provided. With this motivation in mind, we describe a general algorithm that has several special cases of interest, described in Sect. 5. We make use of the Tourist algorithm, whose goal is to see some part of a given previously-unobserved region of workspace. It uses a local-visibility algorithm to find a path, but where the goal test for a configuration is that it is possible to observe some previously unobserved part of the workspace from there. A critical aspect to making this search effective is to use a heuristic that drives progress toward the objective of observing part of a region of interest, R. We begin by computing a scalar field, F , in workspace, of the shortest distance from location x to a point in R. Then, the heuristic is H(q) = minx∈V (q) F (x), which assigns 0 heuristic value to a configuration that can see part of R (because it will be able to see a workspace point x with F (x) = 0) and increasingly higher heuristic values to configurations that can only see points that are “far” in the sense of F from R. Computing F is relatively inexpensive, and it effectively
384
G. Goretkin et al.
models the fact that visibility does not go through walls. This heuristic is illustrated in Fig. 4b: the black nodes are the workspace target region R. The figure illustrates the level sets of F . Algorithm 4. Tourist((Q, E), V, q0 , R, v0 , relaxed = false, O = ∅) H(q) = minx∈V (q) F (x) where F is distance field return Vamp Path Vis((Q, E), V, q0 , λq.(V (q) ∩ R) = ∅, v0 , H, relaxed, O)
Now we can describe the Vamp Backchain algorithm, with pseudo-code shown in Algorithm 5. The main loop of the algorithm is in lines 11–18. It keeps track of p, the solution path it is constructing, v, the region of workspace that has been viewed by p, and q, the configuration at the end of p. On every iteration, it checks to see whether a goal state is reachable from q with the current visibility v. If so, it appends the path that does so to p and returns a final solution. If that test fails, then it generates a path that is guaranteed to increase v (if the problem is feasible), ideally in a way that makes it easier to reach a goal configuration. In line 15, we find a relaxed plan prelaxed that reaches a goal state, preferring to stay inside v, but allowing excursions if necessary. Now, our objective is to find a path pvis that will observe some previously-unobserved part of the swept volume of prelaxed , by calling procedure Vavp. If that call fails, then we fall back on an undirected exploration strategy, to view any part of the unviewed workspace. Once we have found a view path, we update p, v and q based on pvis , test to see if we can now find a path to the goal, etc. The vavp sub-procedure takes a configuration q, region of interest R, previously viewed volume v and out-of-bounds area O and returns a path that will view some part of R without colliding with O, or, in a recursive call, view an unviewed part of the swept volume of a path that will view some part of R without colliding with either O or the new target region, etc. If it cannot find any such recursively useful view path, it fails. For visual simplicity we are omitting arguments (Q, E), V ) from calls to Tourist and Vamp Path Vis. Figure 4 illustrates the operation of Vamp Backchain in the TwoHallway domain. It is given a goal to see the region at the end of the vertical hallway (red points in Fig. 4c). The hallway is keyed, and the robot can only see the region through the peephole. Vavp generates a relaxed plan (gray) to see these points. This relaxed path is in violation, and requires regions be made visible (blue points in Fig. 4d) before it can be executed. Vavp recurses one level, with the blue region as the goal, and both marked “out of bounds”, and generates the gray path in Fig. 4d. This path satisfies the full constraints, and it is returned by Vavp. Note that the returned path does not satisfy the original goal, but achieves visibility to enable solving the original goal in a later call to Vavp. Two difficult examples that motivate the structure of the Vamp Backchain algorithm are illustrated in Fig. 5. In Fig. 5a, the robot must move to the dashed outline on the right. It cannot do so with step-wise visibility (line 13), so it makes a relaxed plan (line 15)
VAMP
385
Algorithm 5. Vamp Backchain((Q, E), W, V, q0 , Qgoal , v0 ) 1: procedure Vavp(q, R, v, O = ∅) 2: pvis ← Tourist(q, R, v) 3: if pvis = F ailed then return pvis 4: Onew = O ∪ R 5: prelaxed ← Tourist(q, R, v, relaxed = true, O = Onew ) 6: if prelaxed = F ailed then 7: pvis ← Vavp(q, S(prelaxed ) \ v, v, O = Onew ) 8: if pvis = F ailed then return pvis 9: return Failed 10: 11: p ← [ ]; v ← v0 ; q ← q0 12: while True do 13: pfinal ← Vamp Path Vis(q, Qgoal , v) 14: if pfinal = Failed then return p + pfinal 15: prelaxed ← Vamp Path Vis(q, Qgoal , v, relaxed = true) 16: pvis ← Vavp(q, S(prelaxed ) \ v, v) 17: if pvis = Failed then pvis ← Tourist(q, W \ v, v) 18: if pvis = Failed then return Failed 19: p ← p + pvis ; v ← v ∪ V (pvis ); q ← pvis [−1]
to slide horizontally to the goal. However, none of the swept volume of that relaxed plan can be viewed (line 2) under normal visibility constraints, nor can we even generate a relaxed plan to view it (line 5), so Vavp fails. We fall back on simply generating a path that views some part of the un-viewed workspace (line 17) which yields the path shown by the unfilled robot outlines. The ultimate solution to the problem is indicated by the robot outlines. In Fig. 5b, we see an example illustrating the potential need for arbitrary recursive nesting. In this case, the inner walls are transparent (so the robot can see through them, but it cannot move through them.) The solution requires moving forward into the bottom-most hallway to clear it, then moving into it again sideways to look through the windows, thus clearing the hallway above it, and so on.
(a)
(b)
Fig. 5. Difficult examples for the Vamp Backchain algorithm.
386
G. Goretkin et al.
Theorem 3. The algorithm Vamp Backchain is correct and complete with respect to the configuration-space graph (Q, E) for any robot such that S(q1 , q2 ) = S(q2 , q1 ) for all q1 , q2 . Proof. The proof can be found in the Supplementary Materials [21].
5
Experiments
In our experiments, we consider a planar robot (1 m × 1 m) operating in a 2D workspace. For all of the experiments, we discretize the robot motions and search on a 6-connected lattice (Δx = Δy = 0.125 m, Δθ = 2π 16 ). The depth of view of the visible region is 2.5 m. All swept volume and containment computations were performed by sampling points along the boundary of the robot. We ran two versions of Vamp Backchain. VB∞ corresponds to the algorithm as presented in Sect. 4.4, with the difference that all sub-calls to the planners are relaxed versions. We never call the un-relaxed planner, however we still verify the feasibility of paths before incorporating them into the final path. This choice has the benefit of accelerating the search procedure, since we do not have to wait for Tourist to return failure in situations where the search is incomplete. In practice, the relaxed planners often return a feasible path if one exists, but occasionally they produce a violating path, which means subsequent searches may do unnecessary work to provide visibility in the violated region. VB1 corresponds to Vamp Backchain, but with a recursion depth limit. In this variation, in the first call to Vavp, lines 4–5 are skipped. In recursive calls, lines 4 is executed and its result is returned. Furthermore, instead of waiting for this call to Tourist to fail we set a timeout, to trigger the subsequent call to Tourist. We run experiments on many instances of vamp problems. Instances vary in obstacle, start and goal states, and field of view of the vision sensor. There are three combinations of obstacle layout and start/goal states, each exhibiting increasing problem difficulty: HallwayEasy depicted in Fig. 2, HallwayHard depicted in Fig. 3, and TwoHallway depicted in Fig. 4a, which contains a “keyed” vertical hallway, which can only be entered backwards. For each experiment, we report search time, path length, and total number of nodes expanded in any subroutine searches. TwoHallway is designed to demonstrate the recursion capabilities of Vamp Backchain, so VB∞ noticeably outperforms VB1 on it. Because VB1 does not perform backchaining more than once, it relies on line 17 of Vamp Backchain. In practice, for problems exhibiting the nested dependency as in TwoHallway, VB1 generates paths that view the whole space because the search cannot be guided through nested dependencies. Situations in which VB∞ performs worse are due to sub-optimal relaxed paths, which incur violations that could be avoided. We also collected search times and tree size for the TreeVis algorithm. For TwoHallway, it expands 62,000 nodes and searches for 60 s. Note that this
VAMP
387
fov = 50◦ fov = 200◦ fov = 350◦ VB1 VB∞ VB1 VB∞ VB1 VB∞ Search time (s)
HallwayEasy 5.1 16.4 1.2 1.2 1.0 2.2 HallwayHard 23.4 315.3 9.4 13.9 2.4 3.5 TwoHallway 2868.3 281.3 638.5 220.9 1952.1 134.8
Path length (m) HallwayEasy 12.3 HallwayHard 14.3 TwoHallway 63.9 Closed nodes
13.3 16.9 43.4
8.4 12.5 47.6
8.4 12.5 43.2
8.4 11.4 40.6
8.4 11.4 34.3
HallwayEasy 2578 9241 377 377 137 137 HallwayHard 7667 40428 3436 4469 604 604 TwoHallway 139484 64145 76083 62586 92184 44188
does not include any time for generating a path. The na¨ıve path would include every edge in the tree, visiting every node in search order, which would never be a practical path. Note additionally that TreeVis is not a directed search, and so in domains where the workspace is large, it is unlikely that TreeVis will be practical.
6
Discussion
vamp instances are challenging when the domain requires plans that achieve visibility in order to perform a motion in the future. We present two small instances, HallwayHard and TwoHallway that have this property. Solving the problem directly in the belief space is computationally intractable. We, instead, direct the search by relying on calls to constraint-relaxed plans. The setting we consider in this paper is fully deterministic, and in future work we are interested in handing uncertainty on the pose of “known” objects, and uncertainty on the pose of the robot due to odometry and localization errors. Acknowledgements. We gratefully acknowledge support from NSF grants 1523767 and 1723381; from AFOSR grant FA9550-17-1-0165; from ONR grant N00014-18-12847; from Honda Research; and from Draper Laboratory. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
References 1. Galceran, E., Carreras, M.: A survey on coverage path planning for robotics. Robot. Auton. Syst. 61(12), 1258–1276 (2013) 2. Davis, B., Karamouzas, I., Guy, S.J.: C-OPT: coverage-aware trajectory optimization under uncertainty. RA Lett. 1(2), 1020–1027 (2016) 3. Englot, B., Hover, F.: Sampling-based coverage path planning for inspection of complex structures. In: ICAPS (2012)
388
G. Goretkin et al.
4. Yamauchi, B.: A frontier-based approach for autonomous exploration. In: CIRA (1997) 5. Heng, L., Gotovos, A., Krause, A., Pollefeys, M.: Efficient visual exploration and coverage with a micro aerial vehicle in unknown environments. In: ICRA (2015) 6. Oriolo, G., Vendittelli, M., Freda, L., Troso, G.: The SRT method: randomized strategies for exploration. In: ICRA (2004) 7. Dornhege, C., Kleiner, A.: A frontier-void-based approach for autonomous exploration in 3D. Adv. Robot. 27(6), 459–468 (2013) 8. Bircher, A., Kamel, M., Alexis, K., Oleynikova, H., Siegwart, R.: Receding horizon “next-best-view” planner for 3D exploration. In: ICRA (2016) 9. Bekris, K.E., Kavraki, L.: Greedy but safe replanning under kinodynamic constraints. In: ICRA (2007) 10. Lauri, M., Ritala, R.: Planning for robotic exploration based on forward simulation. Robot. Auton. Syst. 83, 15–31 (2016) 11. Janson, L., Hu, T., Pavone, M.: Safe motion planning in unknown environments: optimality benchmarks and tractable policies. In: RSS (2018) 12. Arvanitakis, I., Tzes, A., Giannousakis, K.: Synergistic exploration and navigation of mobile robots under pose uncertainty in unknown environments. Int. J. Adv. Robot. Syst. 15(1), 1729881417750785 (2018) 13. Richter, C., Vega-Brown, W., Roy, N.: Bayesian learning for safe high-speed navigation in unknown environments. In: ISRR (2015) 14. Axelrod, B., Kaelbling, L., Lozano-P´erez, T.: Provably safe robot navigation with obstacle uncertainty. IJRR 37, 1760–1774 (2018) 15. Bouraine, S., Fraichard, T., Salhi, H.: Provably safe navigation for mobile robots with limited field-of-views in dynamic environments. Auton. Robot. 32(3), 267–283 (2012) 16. Bry, A., Roy, N.: Rapidly-exploring random belief trees for motion planning under uncertainty. In: ICRA (2011) 17. Gerkey, B.P., Thrun, S., Gordon, G.: Visibility-based pursuit-evasion with limited field of view. Int. J. Robot. Res. 25(4), 299–315 (2006) 18. Stiffler, N.M., O’Kane, J.M.: Complete and optimal visibility-based pursuitevasion. Int. J. Robot. Res. 36(8), 923–946 (2017) 19. Kavraki, L.E., Svestka, P., Latombe, J.-C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. Trans. Robot. Autom. 12(4), 566–580 (1996) 20. Prentice, S., Roy, N.: The belief roadmap: efficient planning in belief space by factoring the covariance. IJRR 28(11–12), 1448–1465 (2009) 21. Goretkin, G., Kaelbling, L.P., Lozano-P´erez, T.: Look before you sweep: visibilityaware motion planning. arXiv e-prints arXiv:1901.06109 (2019) 22. Hauser, K.: The minimum constraint removal problem with three robotics applications. IJRR 33(1), 5–17 (2014)
Path Planning for Information Gathering with Lethal Hazards and No Communication Michael Otte1(B) and Donald Sofge2 1
2
University of Maryland, College Park, MD 20742, USA [email protected] U.S. Naval Research Laboratory, Washington, DC 20375, USA
Abstract. We consider a scenario where agents search for targets in a hazardous environment that prevents communication. Agents in the field cannot communicate, and hazards are only directly observable by the agents that are destroyed by them. Thus, beliefs about hazard locations must be inferred by sending agents to travel along various paths and then observing which agents survive. In other words, agent survival along a path can be used as a sensor for hazard detection; we call this form of sensor a “path-based sensor”. We present a recursive Bayesian update for path-based sensors, and leverage it to calculate the expected information gained about both hazards and targets along a particular path. We formalize the resulting iterative information based path planning problem that results from this scenario, and present an algorithm to solve it. Agents iteratively foray into the field. The next path each agent follows is calculated to maximize a weighted combination of the expected information gained about targets and hazards (where the weighting is defined by user preferences). The method is evaluated in Monte Carlo simulations, and we observe that it outperforms other techniques.
1
Introduction
We are motivated by the following scenario: Autonomous agents are used to help search for human survivors (“targets”) in a hazardous environment, but wireless communication is prohibited. As agents gather information about survivors’ whereabouts, they must physically visit special “uplink sites” to upload their information for use by humans and other agents (uplink sites may be, e.g., naval ships or bases). Information gathered by a particular agent is lost if that agent is destroyed before reaching an uplink site. Hazards exist in the environment but are invisible. An agent cannot upload data about direct positive hazard observations because the only way to positively “observe” a hazard is to be destroyed by it. Luckily, indirect information about hazards can be inferred by remembering which path an agent plans to take, and then observing whether or not the agent survives a journey along that path. Agents are less likely to return from paths containing adversaries than paths that are adversary free. Thus, we c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 389–405, 2020. https://doi.org/10.1007/978-3-030-44051-0_23
390
M. Otte and D. Sofge
can use path traversal as a sensor for detecting hazards, albeit indirectly—we call this form of sensor a “path-based sensor”. In this paper we show how observations from path-based sensors can be used in a recursive Bayesian framework to refine our beliefs about whether or not hazards exist at various locations in the environment. We also show how to calculate the expected information gain that will result from sending an agent along a particular path and then observing whether or not it survives. Expected information gain is defined as the expected reduction in the Shannon entropy of our beliefs integrated over the distribution of possible events. In other words, we seek paths that maximize mutual information [2,14] between sensor readings and our existing beliefs regarding hazards and targets (Fig. 1).
Fig. 1. A-G: search agents attempt seven paths ending at two uplink points (communication is impossible elsewhere). Three are destroyed at unknown locations (A, D, E). Each agent’s path, combined with the observation of whether or not that agent survived, is a hazard sensor. H: the maximum information is gained by visiting a location with 50% hazard probability (and not the location with the highest hazard probability). On-board sensors simultaneously detect targets (e.g., human survivors, not shown).
The path-based sensor Bayesian update and expected information gain can be used to plan paths that maximize: (A) the expected information gained about hazard locations; (B) the expected information gained about target locations while accounting for the agent’s survival given our belief about hazards; and (C) a weighted linear combination of (A) and (B) as specified by a user. The path-based sensor may produce false positives and/or false negatives. False positives are possible if the agent malfunctions for reasons unrelated to hazards. False negatives are possible because hazards may not destroy every agent they encounter. We assume that agents malfunction every time they move with a known probability pmalfunc ∈ [0, 1), and that hazards destroy agents they
Planning for Information Gathering with Hazards and No Communication
391
encounter with a known probability pkill ∈ (0, 1]. If a hazard fails to destroy an agent, then the hazard remains unobserved by that agent. The recursive path-based sensor update and information theoretic equations that we derive in this paper can be used to solve a variety of problems. A pathsensor update is the appropriate tool to use whenever a vehicle follows a path carrying a “one-time-use” sensor and yet when we have no direct way of observing when the sensor is triggered. Other scenarios of this form include: radiation or chemical source localization using single-use dosimeters that can be replaced at “safe locations” (e.g., dosimeters that change color when encountering a particular chemical as a result of a chemical reaction); presence mapping of ecological species, geological minerals, etc. where multiple specimens are collected and mixed in a single container that must be returned before its contents can be analyzed; a variety of other search and rescue settings (for example, in a civilian setting targets, hazards, and uplink points may be the survivors of a nuclear disaster, radiation sources, and a fixed communication links, respectively). The rest of this paper is organized as follows: Related Work appears in Sect. 2. Nomenclature is introduced in Sect. 3. The recursive Bayesian update for the path-based sensor and the information theoretic calculations are described in Sect. 4. The problem we solve is formally defined in Sect. 5, and an algorithm to solve it is described in Sect. 6. In Sect. 7 we run a number of Monte Carlo simulations to evaluate our algorithm and compare it to other methods that have previously been used to solve related problems. Our conclusions are presented in Sect. 8.
2
Related Work
Surveys of previous work on target search can be found in [1] and [11]. The main difference between our work and previous work is the scenario considered: path planning to maximize information gain about targets and hazards in an environment where hazards can destroy agents and communication is impossible. Most previous work investigating target search in hazardous environments has assumed agents are able to communicate from any location in the environment. The approach presented in [13] shares two important similarities with our work: recursive Bayesian filters are used to update estimates of both target and hazard locations, and the probability that an agent may be destroyed at different locations is used for calculating the mutual information that can be gained about targets. Unlike the problem we consider, [13] assumes that robots can always communicate, which makes agent failure locations directly observable. Our assumption that agents cannot communicate from the field means it is impossible to know where agents are destroyed; and therefore we must use a path-based sensor to gather indirect evidence of hazard positions. Other differences between our work and [13] include: (i) we are interested in planning multi-step paths subject to fuel constraints (while [13] uses information
392
M. Otte and D. Sofge
surfing1 , a greedy 1-step look-ahead approach that does not consider fuel constraints); (ii) we explore a family of techniques that enable path optimization for any weighted combination of the expected information gained about targets and hazards2 ; (iii) a continuous space formulation is used in [13], and targets are assumed to emit a random signal that is a function over the entire search space. We consider a discrete formulation in which each map cell either does or does not contain a target and/or hazard. The target information gathering control problem is studied in [9]; mutual information associated with (only) an environmental monitoring mission is considered and information is gathered about environmental state without the possibility of agent failures. The first formal derivation of the gradient of mutual information is also presented in [9], and it is proven that a multi-agent control policy of gradient ascent will converge (gather all information), in the limit. The authors consider a multi-agent mutual information target-localization control [3], again using 1-step look-ahead information surfing. In [5] the method from [3] was implemented on a test-bed and used to demonstrate localization of magnetic sources by quad-rotor aircraft. Two of the authors present a multi-agent information gathering control policies in [4], using both 1-step look-ahead and a 3-step receding horizon control. The main focus of [3–5] is a decentralized multi-agent control formulation. Our work differs from the aforementioned works [3–5,9] in that we consider a scenario in which hazards can destroy agents. Receding horizon control for gathering information about a moving target following a random walk in the underwater domain is presented in [8]. Similarities to our work include an assumption that long breaks in communication may occur (in [8] this happens when agents are submerged). Our work differs from [8] both in the scenario considered ([8] considers a moving target and assumes false positives are negligible), and in the type of solution that is computed ([8] focuses on distributed multi-agent control—testing horizons of 1 and 4). A number of other works consider the risk of agent loss within a search or target tracking task. A branch-and-bound technique is used to find paths for search over graphs that attempt to simultaneously optimize vs. cost functions defined over fuel constraints, time, risk, and (optionally) the path endpoint [12]. A neural network based strategy is proven to be robust to partial loss of UAVs; individual planners maximize a heuristic function while learning how other agents tend to behave in various situations [17]. In [7] threats are considered and the robot moves in order to decrease a custom heuristic value based on a probability map, while preliminary work by the same authors that does not consider threats appears in [6]. In [16] agents consider other vehicles to be soft obstacles that present dynamic threats, and use “approximate” dynamic programming to plan movements that maximize a heuristic combination of target confirmation, 1 2
In “information surfing” methods an/the agent(s) continually moves “up” a gradient of mutual information using a greedy 1-step look-ahead. Although [13] uses hazard probabilities to help calculate the expected information gain regarding targets, hazard information is not directly considered as part of the objective function.
Planning for Information Gathering with Hazards and No Communication
393
environment exploration, and threat avoidance given updated values of threat probability, target probability, and the certainty of these beliefs. Agent loss in hazardous environments with moving (and hostile) targets is considered in [15]. Targets are assumed to move at a constant speed, and agents use a sweeping formation that is adjusted as agents are destroyed. In [10] the probability that at least k of n robots survive a hazardous multipath is maximized.
3
Nomenclature
The search space is denoted S and is assumed to be a subset of two dimensional Euclidean space S ⊂ R2 . The vector s ∈ S describes a point in S. Let X denote the state of the environment with respect to target presence. In general, X is a discrete time random variable that takes values on alphabet X . For the stationary target task that we consider, X is constant over time. Y is a sensor observation (also a random variable) of a portion of the environment that takes values on an alphabet Y. We assume that environmental sensor measurements occur at discrete times and are indexed by the variable t = 1, 2, 3 . . .. Let Z denote the state of the environment with respect to hazard presence, where Z is a discrete time random variable that takes values on alphabet Z. Let Q be an observation of a failure event at a location in the environment, where Q takes values on alphabet Q. Even though we cannot directly observe agent failures, we can still reason about the probability that they occur; thus, defining Q is useful. We assume the environment is discretized such that paths are broken to a finite number of edges between nodes. The traversal of path segments from node to node is modeled (for the purpose of hazard inference) by a global discrete time counter that uses the variable τ = 1, 2, 3 . . .. Target sensor measurements may happen independently of path segment traversals such that τ = t, in general. We assume the ability to track both τ and t (independently). We are provided a set W of k fixed uplink sites, W = {w1 , . . . , wk } where agents can upload the data they collect, e.g., for future use. A path ζ is a mapping from the interval [0, 1] to the state space S. We will assume that paths are piecewise continuous curves that start and end at uplink sites. Formally, ζ : [0, 1] → S such that ζ(0) = sstart = wa and ζ(1) = sgoal = wb for wa , wb ∈ W . We allow both wa = wb and wa = wb . With an abuse of notation (made to improve the overall clarity of our presentation) we overload the symbol ζ to additionally represent the set of points contained in the path that it defines. The robot is assumed to take sensor measurements regarding target presence as it travels along the path. In this paper we consider the discrete case where one observation is made at each node in the path. Given this assumption, and assuming that t measurements have already been taken before an agent starts moving along its path, then the successful completion of a path provides an ordered finite set of sensor observations {yt+1 , . . . , yt+ }, where yk is taken at position sk ∈ ζ and t + 1 ≤ k ≤ t + , where is the number of sensor readings taken along the path ζ. Given a hazardous environment (as well as a nominal probability of agent failure), the successful traversal of the path is itself a random event that depends
394
M. Otte and D. Sofge
on both the path taken, and the hazards in the environment. Let θζ,alive and θζ,dead denote the complementary events that the robot survives the path ζ or does not, respectively. Let Θζ be the random variable associated with survival of a path ζ. In general, the probability of these events is defined by a functional that accounts for motion along the path, and environmental hazards. P (Θζ = θζ,alive |Z) = f (ζ, Z) and P (Θζ = θζ,dead |Z) = 1 − P (Θζ = θζ,alive |Z). The particular form of f depends on the way that the environment is modeled. In our experiments we assume that hazards destroy agents transiting their map cells with probability pkill ∈ (0, 1], and that agents also malfunction with probability pmalfunc ∈ [0, 1) in each cell.
4
Bayesian Belief Updates and Expected Information Gain
The recursive Bayesian update for the path-based sensor requires a discrete time model. This model supports continuous paths, so long as the path can be partitioned into a finite number of path segments in order to reason about the location of the map cell in which an agent may have been destroyed. In a positive path-based sensor “observation” of a hazard the observing agent is destroyed, which prevents it from relaying direct knowledge about the hazard to other agents. However, a belief update is possible by considering separately each possibility (the finite set of mutually exclusive events that the agent was destroyed while traveling along each segment k ∈ [1, ]), and then combining the resulting separate belief maps weighted by the relative likelihood of each occurring. Whenever the agent survives the path we can directly update our belief map based on negative hazard observation occurring along each of the segments in the path. 4.1
Target Updates (Assuming a Standard Sensor)
Let X0 denote the prior belief defined over S that each point s ∈ S contains a target. For notational convenience, we increment the time index t based on the number of successfully communicated sensor measurements, i.e., t ordered sensor observations have been delivered to the uplink points by time t. Given sensor measurements y1 , . . . , yt (which may have been taken across multiple paths of varying lengths) and X0 , an iterative Bayesian update can be used to compute P (Xt |y1 , . . . , yt ), the posterior probability of X given the t sensor readings delivered to the uplink points by time t. P (Xt |Y1 = y1 , . . . , Yt = yt ) =
P(Yt =yt |Xt−1 )P(Xt−1 |Y1 =y1 ,...,Yt−1 =yt−1 ) P(Yt =yt |Y1 =y1 ,...,Yt−1 =yt−1 )
(1)
As is standard practice, the denominator need not be explicitly calculated; rather, we calculate the numerators of Eq. 1 for all events Xt = x ∈ X and then normalize so that they sum to 1.
Planning for Information Gathering with Hazards and No Communication
395
The information entropy of Xt is denoted H(Xt ) and defined: H(Xt ) = − P (Xt ) log P (Xt ) dx x∈X
and provides a measure of the unpredictability of Xt . As entropy increases, Xt is essentially “worse” at being able to predict the presence or absence of a target (in other words, its values are closer to a uniformly random process). The conditional information entropy H(Xt+1 |Yt+1 ) is the updated entropy of the environmental state X (w.r.t. target presence) given a new observation Yt+1 , averaged over all possible values that Yt+1 may take. The difference between the entropy H(Xt ) and the conditional entropy H(Xt+1 |Yt+1 ) is called mutual information, defined I(Xt ; Yt+1 ) = H(Xt ) − H(Xt+1 |Yt+1 ). Mutual information quantifies the expected reduction in the unpredictability of our estimation of X given the new measurement Yt+1 . It is useful to calculate the mutual information of a new (target sensor) measurement Yt+1 before it is taken, so that we may compare the expected benefits of sampling various locations. The mutual information of a new observation (assuming it is delivered to a communication point) is calculated:
P (Yt+1 = y, Xt = x) log
I(Xt ; Yt+1 ) = y∈Y
x∈X
P (Yt+1 = y, Xt = x) P (Yt+1 = y) P (Xt = x)
dxdy
where P (Yt+1 = y, Xt = x) = P (Yt+1 = y|Xt = x) P (Xt = x). We want to plan paths that gather as much mutual information along the path as possible, given fuel constraints and other goals. Another goal, for example, is to also gather information about hazard locations (hazards are discussed in the next subsection). Given a path ζ that enables sensor observations yt+1 , . . . , yt+ if and only if it is completed successfully, the expected cumulative information gained along that path (given all measurements so far) is calculated: I(Xt ; Yt+1 , . . . , Yt+ ) =
t+
k=t+1
I(Xk−1 ; Yk |Yt+1 , . . . , Yk−1 )
Where the notation I(A; B|C) denotes the conditional mutual information of A and B, integrated over all possible outcomes in the event space of C (and weighted by their relative likelihood). That is, I(A; B|C) = EC (I(A; B)|C). In the most general case (in which targets at any locations in the environment may affect sensor readings at any other location) the calculation of I(Xt ; Yt+1 , . . . , Yt+ ) can become intractable because the number of terms involved in the computation of the inner I(Xk−1 ; Yk |Yt+1 , . . . , Yk−1 ) scales according to |Y|k . However, this complexity can be reduced, e.g., to a small constant, by assuming that each target only affect sensor observations in its own local neighborhood. From the point-of-view of the planning system, no information about target locations is actually gained until the robot reaches an uplink point. Hence, no information about targets is gathered in the event that the robot is destroyed
396
M. Otte and D. Sofge
along its path. Consequently, the expected mutual information along a particular path (assuming the robot may or may not be destroyed along that path) is: I(Xt ; Yt+1 , . . . , Yt+ |Θζ ) = P (Θζ = θζ,alive ) I(Xt ; Yt+1 , . . . , Yt+ ). 4.2
Hazard Updates (Assuming a Path-Based Sensor)
Environmental hazards may prohibit agents from reaching their intended destinations. Thus, we can update our belief about environmental hazards Z by observing whether or not agents reach their destinations. Both events Θζ = θζ,alive and Θζ = θζ,dead can be used to perform an iterative Bayesian update of Z based on ζ. However, the iterative updates to Z based on Θζ take different forms depending on if Θζ = θζ,alive or Θζ = θζ,dead . We begin by noting that if we had access to data of hazard observations at all cells along the path, then a straightforward belief update is as follows: P (Zτ +j |Zτ , Qτ +1 = qτ +1 , . . . , Qτ +j−1 = qτ +j−1 ) = P (Qτ +j = qτ +j |Zτ +j−1 ) P (Zτ +j−1 |Zτ , Qτ +1 = qτ +1 , . . . , Qτ +j−1 = qτ +j−1 ) P (Qτ = qτ |Zτ , Qτ +1 = qτ +1 , . . . , Qτ +j−1 = qτ +j−1 ) (2) Next, we observe that whenever an agent survives we do have direct access to all “observations” of hazards along the path, and they are Qj = qj = 0 by construction (since the agent survived). Formally, Θζ = θζ,alive ⇐⇒ qτ +1 = 0, . . . , qτ +l = 0. Thus, we simply perform the standard update: P (Zτ +j |Θζ = θζ,alive ) = P (Zτ +l |Zτ , Qτ +1 = 0, . . . , Qτ +l = 0) which can be computed iteratively, for each j = 1, . . . , l as follows: P (Zτ +j |Zτ , Qτ +1 = 0, . . . , Qτ +j−1 = 0) =
P(Qτ +j =0|Zτ +j−1 )P(Zτ +j−1 |Zτ ,Qτ +1 =0,...,Qτ +j−1 =0) P(Qτ =0|Zτ ,Qτ +1 =0,...,Qτ +j−1 =0)
In contrast, when an agent does not survive (Θζ = θζ,dead ) the recursive Bayesian update of Z must take a different form. Given a path with l segments, with the first segment starting at time τ , the event Qτ +j = 1 is equivalent to the statement “the agent was killed along the j-th segment of the path”. Given Θζ = θζ,dead , we know that the agent was killed somewhere along ζ, but we do not know where. However, we can integrate over all l possibilities, i.e., considering each possibility that the robot was killed on path segment j for all j such that 1 ≤ j ≤ l, and then summing these results weighted by the relative probability of each (given our current hazard beliefs). It is convenient to use the metaphor of a multiverse. We simultaneously assume the existence of j different universes, such that in the j-th universe the agent was killed along the j-th path segment. Assuming we are in a particular j-th universe, we can calculate the iterative Bayesian update to Z by applying Eq. 2 exactly j times, assuming that on the k-th application: 0 if k < j Qτ +k = qτ +k = 1 if k = j
Planning for Information Gathering with Hazards and No Communication
397
and where no observations are made for k > j in the j-th universe. Let Zτj +l denote the version of Zτ +l that is calculated in the j-th universe. The final overall update to the “real” Zτ +l is the expected value of Zτ +l in the multiverse, found by combining all Zτ +l weighted by P (Qτ +j = 1|Zτ , Θζ = θζ,dead ), the probability of being in the j-th universe. Zτ +l =
l j=1
P (Qτ +j = 1|Zτ , Θζ = θζ,dead ) Zτj +l
(3)
The quantity P (Qτ +j = 1|Zτ , Θζ = θζ,dead ) can be obtained by calculating the probability that the agent survives to the j-th path segment given Zτ and is then destroyed there, and then normalizing such that the probabilities of all l possibilities sum to 1. j−1 P (Qτ +j = 1, Zτ ) k=1 P (Qτ +k = 0, Zτ ) P (Qτ +j = 1, Zτ , Θζ = θζ,dead ) = l j−1 j=1 P (Qτ +j = 1, Zτ ) k=1 P (Qτ +k = 0, Zτ ) where P (Qτ +k = q, Zτ ) = z∈Z P (Qτ +k = q|Zτ = zτ ) P (Zτ = zτ ) dz for q ∈ {0, 1}. The expected decrease in entropy about hazard locations gained from sending an agent along path ζ can be calculated by first calculating the conditional decrease in entropy assuming either possibility of Θζ = θζ,alive and Θζ = θζ,dead (i.e, independently), and then combining the results weighted by the probability of each event given Zτ . θζ,alive be the value of Zτ +l that results if the agent survives the path Let Zτ +l θ
ζ,dead be the result if the agent (as calculated according to Eq. 2). Similarly, let Zτ +l does not survive (as calculated by Eq. 3). The mutual information regarding hazards (the expected information gained from a path-based sensor observation) is given by the expected reduction in entropy: I(Zτ ; Θζ ) = H(Zτ ) − H(Zτ +l |Θζ , Zτ ), where H(Zτ +l |Θζ , Zτ ) = θ P (Θζ = θ|Zτ ) H(Zτ +l |Θζ = θ, Zτ )dθ
l is the conditional entropy, and P (Θζ = θζ,alive |Zτ ) = j=1 P (Qτ +j = 0, Zτ ) and P (Θζ = θζ,dead |Zτ ) = 1 − P (Θζ = θζ,alive |Zτ ) and H(Zτ +l |Θζ = θζ,alive , Zτ ) = θζ,alive θζ,dead Zτ +l and H(Zτ +l |Θζ = θζ,dead , Zτ ) = Zτ +l .
5
Formal Problem Definitions
Problem Definition. Mutual information path planning for targets and hazards without communication: Given a search space S, and a set of uplink points {w1 , . . . , wk } = W ⊂ S, and assuming an agent can start at any wstart ∈ S and end at any wgoal ∈ S, and assuming an agent has a noisy target sensor that provides observations Y
398
M. Otte and D. Sofge
about targets X. Find the path ζ ∗ that maximizes the expected information gain about both targets X and hazards Z: ζ ∗ = arg max cX I(Xt ; Yt+1 , . . . , Yt+ |Θζ ) + cZ I(Zτ ; Θζ ) ζ
where cX , cZ ∈ [0, 1] are weights that represent user preference for either type of information, and where for all ζ it is true that ζ : [0, 1] → S s.t. ζ(0) = wstart and ζ(1) = wgoal , subject to distance constraints ζ < , and Θζ is the space of observations about whether or not the agent successfully completes the path. Problem Definition. Iterative mutual information path planning for targets and hazards without communication: Repeatedly solve the mutual information path planning for targets and hazards problem to continually refine our belief about targets X and hazards Z given Y and Θζ , respectively; assuming we are able to replace agents that are lost (once their failure to appear at their destinations as been observed).
6
Algorithms
The environment is modeled as discrete map M of non-overlapping cells Mi ⊂ M, where 1 ≤ i ≤ m and Mi ∩ Mj = ∅ for i = j. Target and adversary effects are assumed local to the map cells containing those targets and adversaries, respectively. These assumptions are useful in practice because they reduce computational complexity. To simplify our presentation the same map M = i∈[1,m] Mi is used to reason about both targets and hazard. Because target and hazard effects are local to each cell, our beliefs about targets (and hazards) are stored in arrays X (and Z), where X[i] (and Z[i]) are our current belief that map cell Mi contains a target3 (respectively, a hazard). Connectivity information is stored in a graph GS = (VS , ES ). Each map cell Mi has a corresponding node vi ∈ VS , and an edge (vi , vj ) ∈ ES indicates it is possible to move directly between map cells Mi and Mj . Self transitions (vi , vi ) ∈ ES are allowed, but can be removed in cases where agents must remain in motion. Mutual information is sub-modular—there are diminishing returns for visiting the same cell again and again. Therefore, we plan in GS×T = (VS×T , ES×T ) the space-time extension of GS , to track cell visit counts along a path. 3
In the most general discrete formulation of the ideas presented in Sects. 3, 4 and 5, target existence across all cells in the map is represented by a single random variable X that takes one of the 2m different possible values x (depending on which cells contain targets and which do not). The set of all 2m possibilities forms the alphabet X . However, if each target only affects target sensor readings in its own cell, then the resulting independence between cells allows us to consider each of the m dimensions of X separately. In other words, we can consider X as a joint event over a collection of independent random variables X1 , . . . , Xm because P (X = x) = m i=1 P (Xi = xi ). We store our current estimate of P (Xi = xi ) in X[i].
Planning for Information Gathering with Hazards and No Communication
399
Agents have fuel for moves, so GS×T is created by placing a “clone” VS,t ≡ VS at each of the 0 ≤ t ≤ + 1 time steps that must be considered, i.e., VS×T = VS,0 ∪ . . . ∪ VS, . Edges in ES×T move forward in time, and exist vi,t−1 , vˆj,t ) ∈ ES×T for all according to the following rule: (vi , vj ) ∈ ES =⇒ (ˆ t ∈ [1, ]. A valid path ζvalid is a sequence of edges that starts at some uplink site wstart = vˆj,0 at time t = 0 and moves from node to node along edges in space-time until reaching a (goal) uplink site wgoal = vˆj, at time t = . If β is a belief that one of two complementary events has occurred, its entropy is calculated: H(β) = −(β log(β) + (1 − β) log(1 − β)). Given our assumption of m cell independence, total entropy regarding targets is H(X) = i=1 H(X[i]) and m alive ≡ P (θζ,alive ) total entropy regarding hazards is H(Z) = i=1 H(Z[i]). Let pζ dead and pζ ≡ P (θζ,dead ). The outer loop of the iterative planning approach appears in Algorithm 1, and the algorithm used to plan each path appears in Algorithm 3. We track the expected entropy that results from attempting paths using the arrays Xζ and Zζ (for targets and hazards, respectively). Xζ ≡ Eζ (X), where Eζ (X|θζ,alive ) + pdead Eζ (X|θζ,dead ). All sensor readings about Eζ (X) = palive ζ ζ targets are lost if the agent is killed, thus Eζ (X|θζ,dead ) = 0 and so Eζ (X) = Eζ (X|θζ,alive ). In contrast, information about hazard existence is gained palive ζ both if the agent survives or if the agent is destroyed (though different amounts in ≡ Eζ (Z|θζ,alive ) and Zdead ≡ Eζ (Z|θζ,dead ) either case). We use the vectors Zalive ζ ζ to track the conditional expectations of Zζ that will result if the agent survives or is killed along path ζ. This allows us to compute Zζ ≡ Eζ (Z) = Zalive + pdead Zdead . palive ζ ζ ζ ζ
Algorithm 1. Iterative information path planning for targets and hazards 1: for r = 1, 2, . . . do 2: ζ = calculatePath(X, Z) 3: Robot r attempts path ζ 4: if θζ,alive then 5: X ← BayesianCellUpdates(X, Yζ ) 6: Z ← BayesianCellUpdates(Z, [0, . . . , 0]) 7: else 8: Z ← KilledOnPathUpdate(Z)
Algorithm 2. KilledOnPathUpdate(Z, ζ) o 1: psurvivedT ←1 1 2: for k ← 1, . . . , do 3: i ← index of cell in which k-th observation was made ← (pkill + pmalfunc (1 − pkill ))Z[i] 4: pkilledInGivenAt k +pmalfunc (1 − Z[i]) o o ← psurvivedT (1 − pkilledInGivenAt ) 5: psurvivedT k+1 k k 6: Zk ← Z BayesianCellUpdates(Zk , [01:k−1 , 1]) 7: Zk ← o ← k=1 psurvivedT 8: pdead ζ k psurvivedT o 9: Z ← k=1 k pdead Zk ζ
10: return (Z, (1 − pdead ) ζ
Algorithm 3. calculatePath(X, Z) 1: for all uplink points w ∈ Wgoal do 2: ζw ← ∅ 3: InsertFIFOQueue(w) 4: while vˆj ← PopFIFOQueue do 5: hvˆj = −∞ 6: for all (ˆ vi , vˆj ) ∈ ES×T do 7: ζ ← (ˆ vi , vˆj ) + ζvˆj X ˆ live ← 8: h H(Xlive )dx x∈X 9: (Zlive , palive ) ← KilledOnPathUpdate(Z, ζ) ζ 10: Zkilled ← BayesianCellUpdates(Z, [0, . . . , 0]) ˆ this ← cZ (palive H(Zlive ) + (1 − 11: h ζ ˆX )H(Zkilled )) + cX palive h palive ζ ζ live ˆ this > hvˆ then 12: if h j 13: ζvˆi ← ζ 14: hvˆj ← hthis 15: wstart ← arg minw∈Wgoal vˆj 16: return ζwstart
¯ Algorithm 4. BayesianCellUpdates(B, β) 1: for k = 0, . . . , do 2: i ← index of cell in which k-th observation was made ¯ 3: B[i] ← P Bi | B[i], β[k] 4: return B
400
M. Otte and D. Sofge
Cell-wise target and hazard observations made along path ζ are stored in the vectors Yζ and Qζ , where Yζ [k] and Qζ [k] are the observations made in the k-th cell along the path. Yζ [k] ∈ {1, 0} where 1 denotes that a sensor reading was positive and 0 denotes that a sensor reading was negative. Similarily, Qζ [k] ∈ {1, 0}, where 1 represents a hazard observation and 1 denotes a negative (cellwise) reading. Even though it is impossible for an agent to directly report a hazard observation, Qζ is used for two things: first, Qζ [k] = 0 for all k ∈ [0, ] when an agent survives. Second, the algorithm tracks a different version of Qζ for each member of the set of possible events, when reasoning about the relative probabilities of survival to different places along a path. Algorithm 4 shows the recursive Bayesian update that is used for the belief vector β¯ given the observation vector Bi . Depending on context β¯ may represent Yζ or Qζ , and B may represent X and Z. Line 3 performs the recursive update yielding the posterior probability Bi regarding existence in the i-th map cell.
Fig. 2. Top to bottom: adversary and target existence beliefs and their entropies over time. Left two columns relate to targets. Right two columns to hazards. Center column is the sum of target and hazard entropies. Paths are calculated to maximize the (user weighted) reduction in entropy over the set of complementary events that the agent survives or is destroyed, weighted by their probability. At t = 1 the prior beliefs of hazard and target existence are 0.01 and 0.05 in each map cell. At t = 300 the agent has accurate beliefs of all 10 hazards and 19 of 20 targets, and is working (at high risk of being destroyed) to gather target existence information in a cell that contains a known hazard. In this trial hazards have a 50% kill rate.
Planning for Information Gathering with Hazards and No Communication
7
401
Experiments
We use Monte Carlo trials in simulation to evaluate the performance of the algorithm presented in Sect. 6, and compare it to other approaches. For the experiments presented in this section, the environment is represented by a 15 × 15 grid map. Movement is defined by a 9-grid of connectivity (8-grid neighbors plus self transitions) where each move takes the agent one time-step further in time. The agent has enough endurance to make 25 moves. Agent malfunction rate is pmalfunc = .01 per time step (thus, on average agents arbitrarily malfunction in 1/4 of all forays of length 25). In each trial the start and goal uplink points are placed uniformly at random. 10 non-start/goal locations are picked uniformly at random (no replacement) and populated with hazards. This is repeated for 20 non-start/goal locations that are populated with targets. We test our algorithm using three different objectives: weighting information from targets and hazards equally cX , cZ = {1, 1}; gathering only target information cX , cZ = {1, 0}; and gathering only hazard information cX , cZ = {0, 1}. We compare to three other ideas: (1) 1-step look ahead information surfing4 ; (2) a Markov random walk; and (3) planning paths to gather target information while ignoring hazards altogether (by not accounting for the probability of being destroyed when evaluating the expected information gain, and assuming a cX , cZ = {1, 0} objective). Our method and information surfing both track and update target and hazard beliefs, and use the probability of hazard existence to weight the expected information that will be gained about targets and/or hazards. The path of the random walk is calculated before the agent departs such that the resulting-path sensor can be used to infer hazard presence based on whether or not it survives. In all methods agent movement is only allowed in directions from which the agent can still reach the goal given its fuel reserves. Figure 2 shows examples paths for an experiment in the same environment with adversary lethality of 0.5. To generate performance statistics, each method is tested on (the same) 30 randomly generated configurations, and across six different adversary lethality levels (0.01, 0.2, 0.4, 0.6, 0.8, and 0.99). Due to space 4
In “information surfing” the path is greedily computed—the path is initially unknown when that agent leaves the start and then the path is computed on-the-fly. Hazard existence belief is tracked and used to determine the expected information that will be gained about targets. In practice, the destruction of an agent eliminates direct knowledge of the path taken by the agent. While it may be possible to integrate over all possible paths the agent could have taken to obtain a valid update, this computation is at least as hard as the algorithm we present for planning the optimal path. Instead, for the purposes of comparison we choose to be overly generous to “information surfing” and (unrealistically) assume that if the agent is destroyed, then we still know the path that it would have taken to the goal had it not been destroyed. Using this path to refine hazard beliefs (by calculating the relative likelihood the agent was destroyed on each segment) provides a performance bound such that the results for “information surfing” are better than what is expected in practice.
402
M. Otte and D. Sofge
Fig. 3. Hazard and target entropies vs. search round (mean over 30 random trials) when hazards have a 60% kill ratio. Left: an equally weighted (C = [11]) combination of hazard and target entropy. Center: target entropy. Right: hazard entropy.
Fig. 4. Left: target statistics. Center: hazard statistics. Right: total agent losses over time. Statistics include true/false positives and false negatives of target locations. Likelihood ≥.95 is defined as a positive detection. All plots show mean results over 30 random trials. In these experiments hazards have a 60% kill success rate. (Agents are expendable such that any costs associated with their losses are negligible compared to the information that is gained from their loss).
Planning for Information Gathering with Hazards and No Communication
403
limitations we only present results for the 0.6 case. In Figs. 3 and 4 and results for other cases appear in supplementary material available on the first author’s website. Regarding the calculation of true/false positives, if target existence belief is ≥0.95 in a cell, then we declare that we think there is a target in that cell. Likewise, for hazards. The relative performance of different methods is similar across hazard lethality rates, though increasing lethality rates make the detection of hazards easier and the detection of targets harder for all methods. The method we present consistently outperforms the other methods on the objective it seeks to maximize. However, in the case cX , cZ = {0, 1} for which our method seeks information about only hazards and not targets, then all of the comparison methods initially have more accurate beliefs regarding target existence (however they also all eventually fall behind our method at later iterations). This makes sense given that our algorithm is completely ignoring target information in its mission objective in that particular case. In general, “Information surfing” performs better than the random walk, but not as well as full information based path planning. Information based target search that completely ignores the possibly of being destroyed by a hazard has the worst performance of all methods tested. This is because target beliefs remain unchanged in the event that agents are destroyed—if hazards are ignored then subsequent agents will continue to attempt the same dangerous path until it is successfully completed.
8
Summary and Conclusion
An agent’s path can be used as a binary sensor (to detect the occurrence of at least one event along that path), and we show how to compute recursive Bayesian updates given such path-based sensor observations. In hazardous environments where communication with an agent is impossible until it physically returns, this allows the existence of lethal hazards to be inferred based on whether or not agents survive forays along paths. By calculating the expected information that will be gained along different paths, we are able to maximize the information about hazards that is expected to be gained along each foray. This idea is combined with standard Bayesian target search to provide a family of algorithms for solving the problem of iterative path planing for target search in hazardous environments without communication. In Monte Carlo simulations presented in Sect. 7 we find that the algorithms perform favorably vs. three related ideas including: (1) “information surfing” which has been shown to work well for the related problem of target search in a hazardous environment with communication; (2) performing informative path planning for targets while ignoring hazards; and (3) a Markov random walk that is computed before the agent leaves. When the mission objective is set to maximize the information that is collected regarding targets, hazards, or a weighted combination of both, then the detection of targets, hazards, or both are respectively maximized (based on 95% belief defining a positive observation). The Bayesian belief updates for a path-based sensor can be useful even if agents do not use hazard information in their mission objectives. For example,
404
M. Otte and D. Sofge
using random walks will eventually result in accurate beliefs of hazard existence (at least, almost surely in the limit). Thus, even if agents perform a variety of other missions in the environment, we can still use observations of their survival vs. destruction to refine our beliefs of hazard locations. Acknowledgments. This work was performed at the Naval Research Laboratory (NRL) and funded by Office of Naval Research (ONR) grant N0001416WX01271. The views, positions and conclusions expressed herein reflect only the authors’ opinions and expressly do not reflect those of ONR, nor those of NRL.
References 1. Chung, T.H., Hollinger, G.A., Isler, V.: Search and pursuit-evasion in mobile robotics. Auton. Robot. 31(4), 299–316 (2011) 2. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. WilleyInterscience, Hoboken (2006) 3. Dames, P., Schwager, M., Kumar, V., Rus, D.: A decentralized control policy for adaptive information gathering in hazardous environments. In: IEEE Conference on Decision and Control, pp. 2807–2813, December 2012 4. Dames, P., Kumar, V.: Autonomous localization of an unknown number of targets without data association using teams of mobile sensors. IEEE Trans. Autom. Sci. Eng. 12(3), 850–864 (2015) 5. Dames, P.M., Schwager, M., Rus, D., Kumar, V.: Active magnetic anomaly detection using multiple micro aerial vehicles. IEEE Robot. Autom. Lett. 1(1), 153–160 (2016) 6. Flint, M., Polycarpou, M., Fernandez-Gaucherand, E.: Cooperative control for multiple autonomous UAV’s searching for targets. In: IEEE Conference on Decision and Control, vol. 3, pp. 2823–2828, December 2002 7. Flint, M., Fern´ andez-Gaucherand, E., Polycarpou, M.: Cooperative control for UAV’s searching risky environments for targets. In: IEEE Conference on Decision and Control, vol. 4, pp. 3567–3572. IEEE (2003) 8. Hollinger, G.A., Yerramalli, S., Singh, S., Mitra, U., Sukhatme, G.S.: Distributed data fusion for multirobot search. Trans. Robot. 31(1), 55–66 (2015) 9. Julian, B.J., Angermann, M., Schwager, M., Rus, D.: Distributed robotic sensor networks: an information-theoretic approach. Int. J. Robot. Res. 31(10), 1134–1154 (2012) 10. Lyu, Y., Chen, Y., Balkcom, D.: k-survivability: diversity and survival of expendable robots. IEEE Robot. Autom. Lett. 1(2), 1164–1171 (2016) 11. Robin, C., Lacroix, S.: Taxonomy on multi-robot target detection and tracking. In: Workshop on Multi-Agent Coordination in Robotic Exploration (2014) 12. Sato, H., Royset, J.O.: Path optimization for the resource-constrained searcher. Naval Res. Logist. 57(5), 422–440 (2010) 13. Schwager, M., Dames, P., Rus, D., Kumar, V.: A Multi-robot control policy for information gathering in the presence of unknown hazards, pp. 455–472. Springer, Cham (2017) 14. Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 5(1), 3–55 (2001) 15. Vincent, P., Rubin, I.: A framework and analysis for cooperative search using UAV swarms. In: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC 2004, pp. 79–86. ACM, New York (2004)
Planning for Information Gathering with Hazards and No Communication
405
16. Yang, Y., Minai, A.A., Polycarpou, M.M.: Decentralized cooperative search by networked UAVs in an uncertain environment. In: American Control Conference, vol. 6, pp. 5558–5563. IEEE (2004) 17. Yang, Y., Polycarpou, M.M., Minai, A.A.: Multi-uav cooperative search using an opportunistic learning method. J. Dyn. Syst. Meas. Contr. 129(5), 716–728 (2007)
Reactive Navigation in Partially Known Non-convex Environments Vasileios Vasilopoulos1(B) and Daniel E. Koditschek2 1
2
Department of Mechanical Engineering and Applied Mechanics, University of Pennsylvania, Philadelphia, PA 19104, USA [email protected] Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104, USA [email protected]
Abstract. This paper presents a provably correct method for robot navigation in 2D environments cluttered with familiar but unexpected nonconvex, star-shaped obstacles as well as completely unknown, convex obstacles. We presuppose a limited range onboard sensor, capable of recognizing, localizing and (leveraging ideas from constructive solid geometry) generating online from its catalogue of the familiar, non-convex shapes an implicit representation of each one. These representations underlie an online change of coordinates to a completely convex model planning space wherein a previously developed online construction yields a provably correct reactive controller that is pulled back to the physically sensed representation to generate the actual robot commands. We extend the construction to differential drive robots, and suggest the empirical utility of the proposed control architecture using both formal proofs and numerical simulations. Keywords: Motion and path planning and sensor-based control
1 1.1
· Collision avoidance · Vision
Introduction Motivation and Prior Work
Recent advances in the theory of sensor-based reactive navigation [2] and its application to wheeled [3] and legged [28] robots promote its central role in provably correct architectures for increasingly complicated mobile manipulation tasks [29,31]. The advance of the new theory [2] over prior sensor-based collision avoidance schemes [5–8,10,11,15,20,26] was the additional guaranteed convergence to a designated goal which had theretofore only been established This work was supported by AFRL grant FA865015D1845 (subcontract 6697371). The authors thank Dr. Omur Arslan for many formative discussions and for sharing his simulation and presentation infrastructure. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 406–421, 2020. https://doi.org/10.1007/978-3-030-44051-0_24
Reactive Navigation in Partially Known Non-convex Environments
407
for reactive planners possessing substantial prior knowledge about the environment [19,24]. A key feature of these new (and other recent parallel [14,21]) approaches is that they trade away prior knowledge for the presumption of simplicity: unknown obstacles can be successfully negotiated in real time without losing global convergence guarantees if they are “round” (i.e., very strongly convex in a sense made precise in [3]). The likely necessity of such simple geometry for guaranteed safe convergence by a completely uninformed “greedy” reactive navigation planner is suggested by the result that a collision avoiding, distancediminishing reactive navigation policy can reach arbitrarily placed goals in an unknown freespace only if all obstacles are “round” [3, Proposition 14]. This paper offers a step toward elucidating the manner in which partial knowledge may suffice to inform safe, convergent, reactive navigation in geometrically more interesting environments. Growing experience negotiating learned [12] or estimated [17,27] environments suggests that reasonable statistical priors may go a long way toward provable stochastic navigation. But in this work we are interested in what sort of deterministic guarantees may be possible. Recent developments in semantic SLAM [9] and object pose and triangular mesh extraction using convolutional neural net architectures [16,18,22] now provide an avenue for incorporating partial prior knowledge within a deterministic framework well suited to the vector field planning methods reviewed above. 1.2
Contributions and Organization of the Paper
We consider the navigation problem in a 2D workspace cluttered with unknown convex obstacles, along with “familiar” non-convex, star-shaped obstacles [23] that belong to classes of known geometries, but whose number and placement are unknown, awaiting discovery at execution time. We assume a limited range onboard sensor, a sufficient margin separating all obstacles from each other and the goal, and a catalogue of known star-shaped sets, along with a “mapping oracle” for their online identification and localization in the physical workspace. These ingredients suggest a representation of the environment taking the form of a “multi-layer” triple of topological spaces whose realtime interaction can be exploited to integrate the geometrically naive sensor driven methods of [2] with the offline memorized geometry sensitive methods of [24]. Specifically, we adapt the construction of [23] to generate a realtime smooth change of coordinates (a diffeomorphism) of the mapped layer of the environment into a (locally) topologically equivalent but geometrically more favorable model layer relative to which the reactive methods of [2] can be directly applied. We prove that the conjugate vector field defined by pulling back the reactive model space planner through this diffeomorphism induces a vector field on the robot’s physical configuration space that inherits the same formal guarantees of obstacle avoidance and convergence. We extend the construction to the case of a differential drive robot, by pulling back the extended field over planar rigid transformations introduced for this purpose in [2] through a suitable polar coordinate transformation of the tangent lift of our original planar diffeomorphism and demonstrate, once again, that the
408
V. Vasilopoulos and D. E. Koditschek
physical differential drive robot inherits the same obstacle avoidance and convergence properties as those guaranteed for the geometrically simple model robot [2]. Finally, to better support online implementation of these constructions, we adopt modular methods for implicit description of geometric shape [25]. The paper is organized as follows. Section 2 describes the problem and establishes our assumptions. Section 3 describes the physical, mapped and model planning layers used in the constructed diffeomorphism between the mapped and model layers, whose properties are established next. Based on these results, Sect. 4 describes our control approach both for fully actuated and differential drive robots. Section 5 presents a variety of illustrative numerical studies and Sect. 6 concludes by summarizing our findings and presenting ideas for future work. Finally, the reader is referred to Appendix A and Appendix B of the accompanying technical report [30] for the proofs of our main results and a sketch of the ideas from computational geometry [25] underlying our modular construction of implicit representations of polygonal obstacles, respectively.
2
Problem Formulation
We consider a disk-shaped robot with radius r > 0, centered at x ∈ R2 , navigating a closed, compact workspace W ⊂ R2 , with known convex boundary ∂W. The robot is assumed to possess a sensor with fixed range R, capable of recognizing “familiar” objects, as well as estimating the distance of the robot to nearby obstacles1 . The workspace is cluttered by an unknown number of fixed, disjoint obstacles, denoted by O := (O1 , O2 , . . .). We adopt the notation in [2] and define the freespace as F := x ∈ W B(x, r) ⊆ W \ Oi (1) i
where B(x, r) is the open ball centered at x with radius r, and B(x, r) denotes its closure. To simplify our notation, we neglect the robot dimensions, by dilating each obstacle in O by r, and assume that the robot operates in F. We denote ˜ the set of dilated obstacles by O. ˜ are `a-priori known, a Although none of the positions of any obstacles in O ˜ of these obstacles is assumed to be “familiar” in the sense of ˜∗ ⊆ O subset O having an a`-priori known, readily recognizable star-shaped geometry [23] (i.e., belonging to a known catalogue of star-shaped geometry classes), which the robot can efficiently identify and localize instantaneously from online sensory measurement. Although the implementation of such a sensory apparatus lies well beyond the scope of the present paper, recent work on semantic SLAM [9] provides an excellent example with empirically demonstrated technology for achieving this need for localizing, identifying and keeping track of all the familiar obstacles 1
We refer the reader to an example of existing technology [1] generating 2D LIDAR scans from 3D point clouds for such an approach.
Reactive Navigation in Partially Known Non-convex Environments
409
encountered in the otherwise unknown environment. The a`-priori unknown cen˜ ∗ is denoted x∗ . Similarly to [24], ter of each catalogued star-shaped obstacle O i i ∗ ∗ ˜ can be described by an obstacle function, a ˜ ∈O each star-shaped obstacle O i real-valued map providing an implicit representation of the form ˜ ∗ = {x ∈ R2 | βi (x) ≤ 0} O i
(2)
which the robot must construct online from the catalogued geometry, after it ˜ convex := O\ ˜ O ˜ ∗ are are assumed to ˜ ∗ . The remaining obstacles O has localized O i be strictly convex but are in all other regards (location and specific shape) completely unknown to the robot, while nevertheless satisfying a curvature condition given in [2, Assumption 2]. For the obstacle functions, we require the technical assumptions introduced in [24, Appendix III], outlined as follows. Assumption 1. The obstacle functions satisfy the following requirements ˜ ∗ , there exists ε1i > 0 such that for any two obstacles ˜∗ ∈ O (a) For each O i ∗ ˜∗ ∗ ˜ ˜ Oi , Oj ∈ O {x | βi (x) ≤ ε1i } {x | βj (x) ≤ ε1j } = ∅ (3) i.e., the “thickened boundaries” of any two stars still do not overlap. ˜ ∗ , there exists ε2i > 0 such that the set {x | βi (x) ≤ ε2i } does ˜∗ ∈ O (b) For each O i not contain the goal xd ∈ F and does not intersect with any other obstacle ˜ convex . in O (c) For each obstacle function βi , there exists a pair of positive constants (δi , ε3i ) satisfying the inner product condition2 (x − x∗i ) ∇βi (x) ≥ δi
(4)
for all x ∈ R2 such that βi (x) ≤ ε3i . ˜ ∗ , we then define εi = min{ε1i , ε2i , ε3i }. Finally, we ˜∗ ∈ O For each obstacle O i will assume that the range of the sensor R satisfies R >> εi for all i. Based on these assumptions and further positing first-order, fully-actuated robot dynamics x˙ = u(x), the problem consists of finding a Lipschitz continuous controller u : F → R2 , that leaves the freespace F positively invariant and asymptotically steers almost all configurations in F to the given goal xd ∈ F.
3
Multi-layer Representation of the Environment and Its Associated Transformations
In this Section, we introduce associated notation for, and transformations between three distinct representations of the environment that we will refer to as planning “layers” and use in the construction of our algorithm. Figure 1 illustrates the role 2
A brief discussion on this condition is given in [30, Appendix B].
410
V. Vasilopoulos and D. E. Koditschek
Fig. 1. Snapshot illustration of key ideas. The robot in the physical layer (left frame, depicting in blue the robot’s placement in the workspace along with the prior trajectory of its centroid) containing both familiar objects of known geometry but unknown location (dark grey) and unknown obstacles (light grey), moves towards a goal and discovers obstacles (black) with an onboard sensor of limited range (orange disk). These obstacles are localized and stored permanently in the mapped layer (middle frame, depicting in blue the robot’s placement as a point in freespace rather than its body in the workspace) if they have familiar geometry or temporarily, with just the corresponding sensed fragments, if they are unknown. An online map h(x) is then constructed (Sect. 3), from the mapped layer to a geometrically simple model layer (right frame, now depicting the robot’s placement and prior tractory amongst the h-deformed convex images of the mapped obstacles). A doubly reactive control scheme for convex environments [2] defines a vector field on the model layer which is pulled back in realtime through the diffeomorphism to generate the input in the physical layer (Sect. 4).
of these layers and the transformations that relate them in constructing and analyzing a realtime generated vector field that guarantees safe passage to the goal. The new technical contribution is an adaptation of the methods of [24] to the construction of a diffeomorphism, h, where the requirement for fast, online performance demands an algorithm that is as simple as possible and with few tunable parameters. Hence, since the reactive controller in [2] is designed to (provably) handle convex shapes, sensed obstacles not recognized by the semantic SLAM process are simply assumed to be convex (implemented by designing h to resolve to the identity transformation in the neighborhood of “unfamiliar” objects) and the control response defaults to that prior construction. 3.1
Description of Planning Layers
Physical Layer. The physical layer is a complete description of the geometry of the unknown actual world and while inaccessible to the robot is used for purposes of analysis. It describes the actual workspace W, punctured with the obstacles O. This gives rise to the freespace F, given in (1), consisting of all placements of the robot’s centroid that entail no intersections of its body with
Reactive Navigation in Partially Known Non-convex Environments
411
any obstacles. The robot navigates this layer, and discovers and localizes new obstacles, which are then stored in its semantic map if their geometry is familiar. Mapped Layer. The mapped layer Fmap has the same boundary as F (i.e. ∂Fmap := ∂F) and records the robot’s evolving information about the environment aggregated from the raw sensor data about the observable portions of ˜N } ⊆ ˜1 , . . . , O N ≥ 0 unrecognized (and therefore, presumed convex) obstacles {O ∗ ˜ Oconvex , together with the inferred star centers xj and obstacle functions βj of ˜ ∗ , that are instantiated at the ˜∗ } ⊆ O ˜∗, . . . , O M ≥ 0 star-shaped obstacles {O 1 M moment the sensory data triggers the “memory” that identifies and localizes a familiar obstacle. It is important to note that the star environment is constantly updated, both by discovering and storing new star-shaped obstacles in the semantic map and by discarding old information and storing new informa˜ convex . In this representation, the robot is treated tion regarding obstacles in O as a point particle, since all obstacles are dilated by r in the passage from the workspace to the freespace representation of valid placements. Model Layer. The model layer Fmodel has the same boundary as F (i.e. ∂Fmodel := ∂F) and consists of a collection of M Euclidean disks, each centered at one of the mapped star centers, x∗j , j = 1, . . . , M , and copies of the ˜ convex . The sensed fragments of the N unrecognized visible convex obstacles in O radii {ρ1 , . . . , ρM } of the M disks are chosen so that B(x∗j , ρj ) ⊆ {x | βj (x) < 0}, as in [24]. This metric convex sphere world comprises the data generating the doubly reactive algorithm of [2], which will be applied to the physical robot via the online generated change of coordinates between the mapped layer and the model layer to be now constructed. 3.2
Description of the C ∞ Switches
In order to simplify the diffeomorphism construction, we depart from the construction of analytic switches [23] and rely instead on the C ∞ function ζ : R → R [13] described by −1/χ e , χ>0 ζ(χ) = (5) 0, χ≤0
with derivative
ζ (χ) =
ζ(χ) χ2 ,
0,
χ>0 χ≤0
(6)
Based on that function, we can then define the C ∞ switches for each star-shaped ˜ ∗ in the semantic map as obstacle O j σj (x) = ηj ◦ βj (x),
j = 1, . . . , M
(7)
412
V. Vasilopoulos and D. E. Koditschek
with ηj (χ) = ζ(εj − χ)/ζ(εj ) and εj given according to Assumption 1. The gradient of the switch σj is given by ∇σj (x) = (ηj ◦ βj (x)) · ∇βj (x)
(8)
Finally, we define σd (x) = 1 −
M
σj (x)
(9)
j=1
Using the above construction, it is easy to see that σj (x) = 1 on the boundary of the j-th obstacle and σj (x) = 0 when βj (x) > εj for each j = 1, . . . , M . Based on Assumption 1 and the choice of εj for each j, we are, therefore, led to the following results. Lemma 1. At any point x ∈ Fmap , at most one of the switches {σ1 , . . . , σM } can be nonzero. Corollary 1. The set {σ1 , . . . , σM , σd } defines a partition of unity over Fmap . 3.3
Description of the Star Deforming Factors
The deforming factors are the functions νj (x) : Fmap → R, j = 1, . . . , M , responsible for transforming each star-shaped obstacle into a disk in R2 . Once again, we use here a slightly different construction than [23], in that the value of each deforming factor νj at a point x does not depend on the value of βj (x). Namely, the deforming factors are given based on the desired final radii ρj , j = 1, . . . , M as ρj (10) νj (x) = ||x − x∗j || We also get ∇νj (x) = −
3.4
ρj (x − x∗j ) ||x − x∗j ||3
(11)
The Map Between the Mapped and the Model Layer
Construction. The map for M star-shaped obstacles centered at x∗j , j = 1, . . . , M is described by a function h : Fmap → Fmodel given by h(x) =
M
σj (x) νj (x)(x − x∗j ) + x∗j + σd (x)x
(12)
j=1
˜ convex are not ˜1 , . . . , O ˜N } ⊆ O Note that the N visible convex obstacles {O considered in the construction of the map. Since the reactive controller used in the model space Fmodel can handle convex obstacles and there is enough separation between convex and star-shaped obstacles according to Assumption 1-(b), we can “transfer” the geometry of those obstacles directly in the model space using the identity transformation.
Reactive Navigation in Partially Known Non-convex Environments
413
Finally, note that Assumption 1-(b) implies that h(xd ) = xd , since the target location is assumed to be sufficiently far from all star-shaped obstacles. Based on the construction of the map h, the jacobian Dx h at any point x ∈ Fmap is given by Dx h =
M j=1
σj (x)νj (x)I + (x − x∗j ) σj (x)∇νj (x) + (νj (x) − 1)∇σj (x) +σd (x)I
(13)
Qualitative Properties of the Map. We first verify that the construction is a smooth change of coordinates between the mapped and the model layers. Lemma 2. The map h from Fmap to Fmodel is smooth. Proof. Included in Appendix A.1 of the accompanying technical report [30]. Proposition 1. The map h is a C ∞ diffeomorphism between Fmap and Fmodel . Proof. Included in Appendix A.1 of the accompanying technical report [30]. Implicit Representation of Obstacles. To implement the diffeomorphism between Fmap and Fmodel , shown in (12), we rely on the existence of a smooth obstacle function βj (x) for each star-shaped obstacle j = 1, . . . , M stored in the semantic map. Since recently developed technology [16,18,22] provides means of performing obstacle identification in the form of triangular meshes, in this work we focus on polygonal obstacles on the plane and derive implicit representations using so called “R-functions” from the constructive solid geometry literature [25]. In [30, Appendix B], we describe the method used for the construction of such implicit functions for polygonal obstacles that have the desired property of being analytic everywhere except for the polygon vertices. For the construction, we assume that the sensor has already identified, localized and included each discovered star-shaped obstacle in Fmap ; i.e., it has determined its pose in Fmap , given as a rotation Rj of its vertices on the plane followed by a translation of its center x∗j , and that the corresponding polygon has already been dilated by r for inclusion in Fmap .
4
Reactive Controller
4.1
Reactive Controller for Fully Actuated Robots
Construction. First, we consider a fully actuated particle with state x ∈ Fmap , whose dynamics are described by x˙ = u
(14)
The dynamics of the fully actuated particle in Fmodel with state y ∈ Fmodel are described by y˙ = v(y) with the control v(y) given in [2] as3 (15) v(y) = −k y − ΠLF(y) (xd ) 3
Here ΠC (q) denotes the metric projection of q on a convex set C.
414
V. Vasilopoulos and D. E. Koditschek
Here, the convex local freespace for y, LF(y) ⊂ Fmodel , is defined as in [2, Eq. (30)]. Using the diffeomorphism construction in (12) and its jacobian in (13), we construct our controller as the vector field u : Fmap → T Fmap given by u(x) = [Dx h]−1 · (v ◦ h(x))
(16)
Qualitative Properties. First of all, if the range of the virtual LIDAR sensor used to construct LF(y) in the model layer is smaller than R, the vector field u is Lipschitz continuous since v(y) is shown to be Lipschitz continuous in [2] and y = h(x) is a smooth change of coordinates. We are led to the following result. Corollary 2. The vector field u : Fmap → T Fmap generates a unique continuously differentiable partial flow. To ensure completeness (i.e. absence of finite time escape through boundaries in Fmap ) we must verify that the robot never collides with any obstacle in the environment, i.e., leaves its freespace positively invariant. Proposition 2. The freespace Fmap is positively invariant under the law (16). Proof. Included in Appendix A.2 of the accompanying technical report [30]. Lemma 3. 1. The set of stationary points of control law (16) is given as
N
{xd } {h−1 (sj )}j∈{1,...,M } i=1 Gi , where4 sj = x∗j − ρj
(17a)
(q − ΠOi (q)) (q − xd ) =1 (17b) q ∈ Fmap d(q, Oi ) = r, ||q − ΠOi (q)|| ||q − xd ||
Gi :=
xd − x∗j ||xd − x∗j ||
with j spanning the M star-shaped obstacles in Fmap and i spanning the N convex obstacles in Fmap . 2. The goal xd is the only locally stable equilibrium of control law (16) and all
N the other stationary points {h−1 (sj )}j∈{1,...,M } i=1 Gi , each associated with an obstacle, are nondegenerate saddles. Proof. Included in Appendix A.2 of the accompanying technical report [30]. Proposition 3. The goal location xd is an asymptotically stable equilibrium of (16), whose region of attraction includes the freespace Fmap excepting a set of measure zero. Proof. Included in Appendix A.2 of the accompanying technical report [30]. 4
Here d(A, B) = inf{||a − b|| | a ∈ A, b ∈ B} denotes the distance between two sets A, B.
Reactive Navigation in Partially Known Non-convex Environments
415
We can now immediately conclude the following central summary statement. Theorem 1. The reactive controller in (16) leaves the freespace Fmap positively invariant, and its unique continuously differentiable flow, starting at almost any robot placement x ∈ Fmap , asymptotically reaches the goal location xd , while strictly decreasing ||h(x) − xd || along the way. 4.2
Reactive Controller for Differential Drive Robots
In this Section, we extend our reactive controller to the case of a differential drive robot, whose state is x := (x, ψ) ∈ Fmap × S 1 ⊂ SE(2), and its dynamics are given by5 x˙ = B(ψ)u (18) cos ψ sin ψ 0 and u = (v, ω) with v ∈ R and ω ∈ R the linear with B(ψ) = 0 0 1 and angular input respectively. We will follow a similar procedure to the fully actuated case; we begin by describing a smooth diffeomorphism h : Fmap ×S 1 → Fmodel × S 1 and then we establish the results about the controller. Construction and Properties of the SE(2) Diffeomorphism. We construct our map h from Fmap × S 1 to Fmodel × S 1 as y = (y, ϕ) = h(x) := (h(x), ξ(x))
(19)
with x = (x, ψ) ∈ Fmap × S 1 , y := (y, ϕ) ∈ Fmodel × S 1 and ϕ = ξ(x) := ∠ (e(x))
(20)
Here, ∠e := atan2(e2 , e1 ) and 1 cos ψ = Dx h e(x) = Πy · Dx h · B(ψ) · 0 sin ψ
(21)
with Πy denoting the projection onto the first two components. The reason for choosing ϕ as in (20) will become evident in the next paragraph, in our effort to control the equivalent differential drive robot dynamics in Fmodel . Proposition 4. The map h in (19) is a C ∞ diffeomorphism from Fmap × S 1 to Fmodel × S 1 . Proof. Included in Appendix A.2 of the accompanying technical report [30].
5
We use the ordered set notation (∗, ∗, . . .) and the matrix notation ∗ ∗ . . . for vectors interchangeably.
416
V. Vasilopoulos and D. E. Koditschek
Construction of the Reactive Controller. Using (19), we can find the pushforward of the differential drive robot dynamics in (18) as d h(x) −1 −1 y˙ = y˙ = (22) = Dx h ◦ h (y) · B ◦ h (y) · u ϕ˙ dt ξ(x) Based on the above, we can then write d h(x) y˙ = = B(ϕ)v y˙ = ϕ˙ dt ξ(x)
(23)
v, ω ˆ ), and the inputs (ˆ v, ω ˆ ) related to (v, ω) through with v = (ˆ
with Dx ξ =
vˆ = ||e(x)|| v ∂ξ cos ψ ω + ω ˆ = vDx ξ sin ψ ∂ψ
∂ξ ∂ξ ∂x ∂y
(24) (25)
. The calculation of Dx ξ can be tedious, since it involves
derivatives of elements of Dx h, and is included in [30, Appendix C]. Hence, we have found equivalent differential drive robot dynamics, defined on Fmodel × S 1 . The idea now is to use the control strategy in [2] for the dynamical system in (23) to find reference inputs vˆ, ω ˆ , and then use (24), (25) to find the actual inputs v, ω that achieve those reference inputs as v= ω=
∂ξ ∂ψ
−1
vˆ ||e(x)|| ω ˆ − vDx ξ
(26a)
cos ψ sin ψ
(26b)
Namely, our reference inputs vˆ and ω ˆ inspired by [2,4] are given as6 cos ϕ y − ΠLF(y)∩H (xd ) vˆ = −k (27a) sin ϕ ⎛ ⎞ ΠLF(y)∩HG (xd ) + ΠLF(y) (xd ) − sin ϕ y− ⎜ cos ϕ ⎟ 2 ⎜ ⎟ ω ˆ = k atan ⎜ ⎟ (27b) ⎝ cos ϕ ΠLF(y)∩HG (xd ) + ΠLF(y) (xd ) ⎠ y− sin ϕ 2 6
In (19), we construct a diffeomorphism h between Fmap × S 1 and Fmodel × S 1 . However, for practical purposes, we deal only with one specific chart of S 1 in our control structure, described by the angles (−π, π]. As shown in [4], the discontinuity at ±π does not induce a discontinuity in our controller due to the use of the atan function in (27b). On the contrary, with the use of (27b) as in [2, 4], the robot never changes heading in Fmodel , which implies that the generated trajectories both in Fmodel and (by the properties of the diffeomorphism h) in Fmap have no cusps, even though the robot might change heading in Fmap because of the more complicated nature of the function ξ in (20).
Reactive Navigation in Partially Known Non-convex Environments
417
with k > 0 a fixed gain, LF(y) ⊂ Fmodel the convex polygon defining the local freespace at y = h(x), and H and HG the lines defined in [2] as − sin ϕ (z − y) = 0 (28) H = z ∈ Fmodel cos ϕ HG = {αy + (1 − α)xd ∈ Fmodel | α ∈ R}
(29)
Fig. 2. Navigation around a U-shaped obstacle: (1) Fully actuated particle: (a) Original doubly reactive algorithm [2], (b) Our algorithm, (2) Differential drive robot: (a) Original doubly reactive algorithm [2], (b) Our algorithm.
Qualitative Properties. The properties of the differential drive robot control law given in (26) can be summarized in the following theorem. Theorem 2. The reactive controller for differential drive robots, given in (26), leaves the freespace Fmap × S 1 positively invariant, and its unique continuously differentiable flow, starting at almost any robot configuration (x, ψ) ∈ Fmap × S 1 , asymptotically steers the robot to the goal location x∗ , without increasing ||h(x) − xd || along the way. Proof. Included in Appendix A.2 of the accompanying technical report [30].
5
Numerical Experiments
In this Section, we present numerical experiments that verify our formal results. All simulations were run in MATLAB using ode45, with control gain k = 0.4 and p = 20 for the R-function construction (see [30, Appendix B]). The reader is also referred to our video attachment for a visualization of the examples presented here and more numerical simulations.
418
5.1
V. Vasilopoulos and D. E. Koditschek
Comparison with Original Doubly Reactive Algorithm
We begin with a comparison of our algorithm performance with the standalone version of the doubly reactive algorithm in [2], that we use in our construction. Figure 2 demonstrates the basic limitation of this algorithm; in the presence of a non-convex obstacle or a flat surface, whose curvature violates [2, Assumption 2], the robot gets stuck in undesired local minima. On the contrary, our algorithm is capable of overcoming this limitation, on the premise that the robot can recognize the obstacle with star-shaped geometry at hand. The robot radius is 0.2 m and the value of ε used for the obstacle is 0.3. 5.2
Navigation in a Cluttered Non-convex Environment
In the next set of numerical experiments, we evaluate the performance of our algorithm in a cluttered environment, packed with instances of the same Ushaped obstacle, with star-shaped geometry, we use in Fig. 2. Both the fully actuated and the differential drive robot are capable of converging to the desired goal from a variety of initial conditions, as shown in Fig. 3. In the same figure, we also focus on a particular initial condition and include the trajectories observed in the physical, mapped and model layers. The robot radius is 0.25 m and value of ε used for all the star-shaped obstacles in the environment is 0.3.
Fig. 3. Navigation in a cluttered environment with U-shaped obstacles. Top - trajectories in the physical, mapped and model layers from a particular initial condition. Bottom - convergence to the goal from several initial conditions: left - fully actuated robot, right - differential drive robot.
Reactive Navigation in Partially Known Non-convex Environments
419
Fig. 4. Navigating a room cluttered with known star-shaped and unknown convex obstacles. Top - trajectories in the physical, mapped and model layers from a particular initial condition. Bottom - convergence to the goal from several initial conditions: left - fully actuated robot, right - differential drive robot. Mapped obstacles are shown in black, known obstacles in dark grey and unknown obstacles in light grey.
5.3
Navigation Among Mixed Star-Shaped and Convex Obstacles
Finally, we report experiments in an environment cluttered with both starshaped obstacles (with known geometry) and unknown convex obstacles. We consider a robot of radius 0.2 m navigating a room towards a goal. The robot can recognize familiar star-shaped obstacles (e.g., the couch, tables, armchair, chairs) but is unaware of several other convex obstacles in the environment. Figure 4 summarizes our results for several initial conditions. We also include trajectories observed in the physical, mapped and model layers during a single run. The value of ε used for all the star-shaped obstacles in the environment is 0.3.
6
Conclusion and Future Work
In this paper, we present a provably correct method for robot navigation in 2D environments cluttered with familiar but unexpected non-convex, star-shaped obstacles as well as completely unknown, convex obstacles. The robot uses a limited range onboard sensor, capable of recognizing, localizing and generating online from its catalogue of the familiar, non-convex shapes an implicit representation of each one. These sensory data and their interpreted representations underlie an online change of coordinates to a completely convex model planning space wherein a previously developed online construction yields a provably
420
V. Vasilopoulos and D. E. Koditschek
correct reactive controller that is pulled back to the physically sensed representation to generate the actual robot commands. Using a modified change of coordinates, the construction is also extended to differential drive robots, and numerical simulations further verify the validity of our formal results. Experimental validation of our algorithm with deep learning techniques for object pose and triangular mesh recognition [22] is currently underway. Next steps target environments presenting geometry more complicated than starshaped obstacles, by appropriately modifying the purging transformation algorithm for trees-of-stars, presented in [24]. Future work aims to relax the required degree of partial knowledge and the separation assumptions needed for our formal results, by merging the “implicit representation trees” (e.g. see [30, Fig. 5]) online, when needed.
References 1. http://wiki.ros.org/pointcloud to laserscan 2. Arslan, O., Koditschek, D.E.: Sensor-based reactive navigation in unknown convex sphere worlds. In: The 12th International Workshop on the Algorithmic Foundations of Robotics (2016) 3. Arslan, O., Koditschek, D.E.: Sensor-based reactive navigation in unknown convex sphere worlds. Int. J. Robot. Res. 38, 196–223 (2019) 4. Astolfi, A.: Exponential stabilization of a wheeled mobile robot via discontinuous control. J. Dyn. Syst. Meas. Contr. 121(1), 121–126 (1999) 5. van den Berg, J., Lin, M., Manocha, D.: Reciprocal velocity obstacles for realtime multi-agent navigation. In: IEEE International Conference on Robotics and Automation, pp. 1928–1935 (2008) 6. van den Berg, J., Guy, S.J., Lin, M., Manocha, D.: Reciprocal n-body collision avoidance, pp. 3–19. Springer, Heidelberg (2011) 7. Borenstein, J., Koren, Y.: Real-time obstacle avoidance for fast mobile robots. IEEE Trans. Syst. Man Cybern. 19(5), 1179–1187 (1989) 8. Borenstein, J., Koren, Y.: The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Trans. Robot. Autom. 7(3), 278–288 (1991) 9. Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: Probabilistic data association for semantic SLAM. In: IEEE International Conference on Robotics and Automation, pp. 1722–1729, May 2017 10. Brock, O., Khatib, O.: High-speed navigation using the global dynamic window approach. In: IEEE International Conference on Robotics and Automation, pp. 341–346 (1999) 11. Fiorini, P., Shiller, Z.: Motion planning in dynamic environments using velocity obstacles. Int. J. Robot. Res. 17(7), 760–772 (1998) 12. Henry, P., Vollmer, C., Ferris, B., Fox, D.: Learning to navigate through crowded environments. In: IEEE International Conference on Robotics and Automation, pp. 981–986 (2010) 13. Hirsch, M.W.: Differential Topology. Springer, Heidelberg (1976) 14. Ilhan, B.D., Johnson, A.M., Koditschek, D.E.: Autonomous legged hill ascent. J. Field Robot. 35(5), 802–832 (2018). https://doi.org/10.1002/rob.21779 15. Johnson, A.M., Hale, M.T., Haynes, G.C., Koditschek, D.E.: Autonomous legged hill and stairwell ascent. In: IEEE International Symposium on Safety, Security, and Rescue Robotics, pp. 134–142 (2011)
Reactive Navigation in Partially Known Non-convex Environments
421
16. Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1966–1974 (2015) 17. Karaman, S., Frazzoli, E.: High-speed flight in an ergodic forest. In: IEEE International Conference on Robotics and Automation, pp. 2899–2906 (2012) 18. Kong, C., Lin, C.H., Lucey, S.: Using locally corresponding CAD models for dense 3D reconstructions from a single image. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 5603–5611 (2017) 19. Majumdar, A., Tedrake, R.: Funnel libraries for real-time robust feedback motion planning. Int. J. Robot. Res. 36(8), 947–982 (2017) 20. Paranjape, A.A., Meier, K.C., Shi, X., Chung, S.J., Hutchinson, S.: Motion primitives and 3D path planning for fast flight through a forest. Int. J. Robot. Res. 34(3), 357–377 (2015) 21. Paternain, S., Koditschek, D.E., Ribeiro, A.: Navigation functions for convex potentials in a space with convex obstacles. IEEE Trans. Autom. Control (2017). https://doi.org/10.1109/TAC.2017.2775046 22. Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-DoF object pose from semantic keypoints. In: IEEE International Conference on Robotics and Automation, pp. 2011–2018, May 2017 23. Rimon, E., Koditschek, D.E.: The construction of analytic diffeomorphisms for exact robot navigation on star worlds. Trans. Am. Math. Soc. 327(1), 71–116 (1989) 24. Rimon, E., Koditschek, D.E.: Exact robot navigation using artificial potential functions. IEEE Trans. Robot. Autom. 8(5), 501–518 (1992) 25. Shapiro, V.: Semi-analytic geometry with R-functions. Acta Numer. 16, 239–303 (2007) 26. Simmons, R.: The curvature-velocity method for local obstacle avoidance. In: IEEE International Conference on Robotics and Automation, vol. 4, pp. 3375– 3382 (1996) 27. Trautman, P., Ma, J., Murray, R.M., Krause, A.: Robot navigation in dense human crowds: statistical models and experimental studies of human-robot cooperation. Int. J. Robot. Res. 34(3), 335–356 (2015) 28. Vasilopoulos, V., Arslan, O., De, A., Koditschek, D.E.: Sensor-based legged robot homing using range-only target localization. In: IEEE International Conference on Robotics and Biomimetics, pp. 2630–2637 (2017) 29. Vasilopoulos, V., Vega-Brown, W., Arslan, O., Roy, N., Koditschek, D.E.: Sensorbased reactive symbolic planning in partially known environments. In: IEEE International Conference on Robotics and Automation, pp. 5683–5690 (2018) 30. Vasilopoulos, V., Koditschek, D.E.: Technical report: reactive navigation in partially known non-convex environments. In: arXiv preprint (2018). https://arxiv. org/abs/1807.08432 31. Vasilopoulos, V., Topping, T.T., Vega-Brown, W., Roy, N., Koditschek, D.E.: Sensor-based reactive execution of symbolic rearrangement plans by a legged mobile manipulator. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3298–3305 (2018)
Resource-Aware Algorithms for Distributed Loop Closure Detection with Provable Performance Guarantees Yulun Tian, Kasra Khosoussi(B) , and Jonathan P. How Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, USA {yulun,kasra,jhow}@mit.edu
Abstract. Inter-robot loop closure detection, e.g., for collaborative simultaneous localization and mapping (CSLAM), is a fundamental capability for many multirobot applications in GPS-denied regimes. In real-world scenarios, this is a resource-intensive process that involves exchanging observations and verifying potential matches. This poses severe challenges especially for small-size and low-cost robots with various operational and resource constraints that limit, e.g., energy consumption, communication bandwidth, and computation capacity. This paper presents resource-aware algorithms for distributed inter-robot loop closure detection. In particular, we seek to select a subset of potential interrobot loop closures that maximizes a monotone submodular performance metric without exceeding computation and communication budgets. We demonstrate that this problem is in general NP-hard, and present efficient approximation algorithms with provable performance guarantees. A convex relaxation scheme is used to certify near-optimal performance of the proposed framework in real and synthetic SLAM benchmarks.
1
Introduction
Multirobot systems provide efficient and sustainable solutions to many largescale missions. Inter-robot loop closure detection, e.g., for collaborative simultaneous localization and mapping (CSLAM), is a fundamental capability necessary for many such applications in GPS-denied environments. Discovering inter-robot loop closures requires (i) exchanging observations between rendezvousing robots, and (ii) collectively verifying potential matches. In many real-world scenarios, this is a resource-intensive process with a large search space due to, e.g., perceptual ambiguity, infrequent rendezvous, and long-term missions [4,10,12,15]. This task becomes especially challenging for prevalent small-size and low-cost platforms that are subject to various operational or resource constraints such as limited battery, low-bandwidth communication, and limited computation capacity. It is thus crucial for such robots to be able to seamlessly adapt to such constraints and intelligently utilize available on-board resources. Such flexibility also Y. Tian and K. Khosoussi: Equal contribution. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 422–438, 2020. https://doi.org/10.1007/978-3-030-44051-0_25
Resource-Aware Algorithms for Distributed Loop Closure Detection
423
enables robots to explore the underlying trade-off between resource consumption and performance, which ultimately can be exploited to save mission-critical resources. Typical ad hoc schemes and heuristics only offer partial remedies and often suffer from arbitrarily poor worst-case performance. This thus motivates the design of reliable resource-aware frameworks that provide performance guarantees. This paper presents such a resource-aware framework for distributed interrobot loop closure detection. More specifically, given budgets on computation (i.e., number of verification attempts) and communication (i.e., total amount of data transmission), we seek to select a budget-feasible subset of potential loop closures that maximizes a monotone submodular performance metric. To verify a potential inter-robot loop closure, at least one of the corresponding two robots must share its observation (e.g., image keypoints or laser scan) with the other robot. Thus, we need to address the following questions simultaneously: (i) which feasible subset of observations should robots share with each other? And (ii) which feasible subset of potential loop closures should be selected for verification? This problem generalizes previously studied NP-hard problems that only consider budgeted computation (unbounded communication) [14,15] or vice versa [29], and is therefore NP-hard. Furthermore, algorithms proposed in these works are incapable of accounting for the impact of their decisions on both resources, and thus suffer from arbitrarily poor worst-case performance in real-world scenarios. In this paper, we provide simple efficient approximation algorithms with provable performance guarantees under budgeted computation and communication. Contributions. Our algorithmic contributions for the resource-aware distributed loop closure detection problem are the following: 1. A constant-factor approximation algorithm for maximizing the expected number of true loop closures subject to computation and communication budgets. The established performance guarantee carries over to any modular objective. 2. Approximation algorithms for general monotone submodular performance metrics. The established performance guarantees depend on available/allocated resources as well as the extent of perceptual ambiguity and problem size. We perform extensive evaluations of the proposed algorithms using real and synthetic SLAM benchmarks, and use a convex relaxation scheme to certify near-optimality in experiments where computing optimal solution by exhaustive search is infeasible. The detailed description of algorithms, proofs, and additional experimental results are provided in an extended version of this paper [30]. Related Work. Resource-efficient CSLAM in general [3,24], and data-efficient distributed inter-robot loop closure detection [3–5,10,29] in particular have been active areas of research in recent years. Cieslewski and Scaramuzza [4] and Cieslewski et al. [5] propose effective heuristics to reduce data transmission
424
Y. Tian et al.
in the so-called “online query” phase of search for potential inter-robot loop closures, during which robots exchange partial queries and use them to search for promising potential matches. Efficiency of the more resource-intensive phase of full query exchange (e.g., full image keypoints, point clouds) is addressed in [10]. Giamou et al. [10] study the optimal lossless data exchange problem between a pair of rendezvousing robots. In particular, they show that the optimal exchange policy is closely related to the minimum weighted vertex cover of the so-called exchange graph (Fig. 1a); this was later extended to general r-rendezvous [29]. None of the abovementioned works, however, consider explicit budget constraints. In [29] we consider the budgeted data exchange problem (b-DEP) in which robots are subject to a communication budget. Specifically, [29] provides provably near-optimal approximation algorithms for maximizing a monotone submodular performance metric subject to such communication budgets. On the other hand, prior works on measurement selection in SLAM [2,14,15] (rooted in [6,12]) under computation budgets provide similar performance guarantees for cases where robots are subject to a cardinality constraint on, e.g., number of edges added to the pose graph, or number of verification attempts to discover loop closures [14,15]. Note that bounding either quantity also helps to reduce the computational cost of solving the underlying inference problem. The abovementioned works assume unbounded communication or computation. In real-world applications, however, robots need to seamlessly adapt to both budgets. The present work addresses this need by presenting resource-aware algorithms with provable performance guarantees that can operate in such regimes. Notation and Preliminaries. Bold lower-case and upper-case letters are reserved for vectors and matrices, respectively. A set function f : 2W → R≥0 for a finite W is normalized, monotone, and submodular (NMS) if it satisfies the following properties: (i) normalized: f (∅) = 0; (ii) monotone: for any E ⊆ B, f (A) ≤ f (B); and (iii) submodular: f (A) + f (B) ≥ f (A ∪ B) + f (A ∩ B) for any A, B ⊆ W. In addition, f is called modular if both f and −f are submodular. For any set of edges E, cover(E) denotes the set of all vertex covers of E. For any set of vertices V, edges(V) denotes the set of all edges incident to at least one vertex in V. Finally, denotes the union of disjoint sets, i.e., A B = A ∪ B and implies A ∩ B = ∅.
2
Problem Statement
We consider the distributed loop closure detection problem during a multi-robot rendezvous. Formally, an r-rendezvous [29] refers to a configuration where r ≥ 2 robots are situated such that every robot can receive data broadcasted by every other robot in the team. Each robot arrives at the rendezvous with a collection of sensory observations (e.g., images or laser scans) acquired throughout its mission at different times and locations. Our goal is to discover associations (loop closures) between observations owned by different robots. A distributed framework
Resource-Aware Algorithms for Distributed Loop Closure Detection
425
Fig. 1. (a) An example exchange graph Gx in a 3-rendezvous where each robot owns three observations (vertices). Each potential inter-robot loop closure can be verified, if at least one robot shares its observation with the other robot. (b) In the absence of any resource constraint, the robots can collectively verify all potential loop closures. The optimal lossless exchange policy [10] corresponds to sharing vertices of a minimum vertex cover, in this case the 3 vertices marked in red. (c) Now if the robots are only permitted to exchange at most 2 vertices (b = 2) and verify at most 3 edges (k = 3), they must decide which subset of observations to share (marked in red), and which subset of potential loop closures to verify (marked in blue). Note that these two subproblems are tightly coupled, as the selected edges must be covered by the selected vertices.
for inter-robot loop closure detection divides the computational burden among different robots, and thus enjoys several advantages over centralized schemes (i.e., sending all observations to one node) including reduced data transmission, improved flexibility, and robustness; see, e.g., [4,10]. In these frameworks, robots first exchange a compact representation (metadata in [10]) of their observations in the form of, e.g., appearance-based feature vectors [4,10] or spatial clues (i.e., estimated location with uncertainty) [10]. Metadata help robots to efficiently identify a set of potential inter-robot loop closures by searching among their collected observations. For example, in many state-of-the-art SLAM systems (e.g., [22]), a pair of observations is declared a potential loop closure if the similarity score computed from matching the corresponding metadata (e.g., bag-of-words vectors) is above a threshold.1 The similarity score can also be used to estimate the probability that a potential match is a true loop closure. The set of potential loop closures identified from metadata exchange can be naturally represented using an r-partite exchange graph [10,29]. Definition 1 (Exchange Graph). An exchange graph [10,29] between r robots is a simple undirected r-partite graph Gx = (Vx , Ex ) where each vertex v ∈ Vx corresponds to an observation collected by one robot at a particular time. 1
As the fidelity of metadata increases, the similarity score becomes more effective in identifying true loop closures. However, this typically comes at the cost of increased data transmission during metadata exchange. In this paper we do not take into account the cost of forming the exchange graph which is inevitable for optimal data exchange [10].
426
Y. Tian et al.
The vertex set can be partitioned into r (self-independent) sets Vx = V1 · · ·Vr . Each edge {u, v} ∈ Ex denotes a potential inter-robot loop closure identified by matching the corresponding metadata (here, u ∈ Vi and v ∈ Vj ). Gx is endowed with w : Vx → R>0 and p : Ex → [0, 1] that quantify the size of each observation (e.g., bytes, number of keypoints in an image, etc), and the probability that an edge corresponds to a true loop closure (independent of other edges), respectively. Even with high fidelity metadata, the set of potential loop closures Ex typically contains many false positives. Hence, it is essential that the robots collectively verify all potential loop closures. Furthermore, for each true loop closure that passes the verification step, we also need to compute the relative transformation between the corresponding poses for the purpose of e.g., pose-graph optimization in CSLAM. For visual observations, these can be done by performing the so-called geometric verification, which typically involves RANSAC iterations to obtain keypoints correspondences and an initial transformation, followed by an optimization step to refine the initial estimate; see e.g., [22]. Although geometric verification can be performed relatively efficiently, it can still become the computational bottleneck of the entire system in large problem instances [11,26]. In our case, this corresponds to a large exchange graph (e.g., due to perceptual ambiguity and infrequent rendezvous). Verifying all potential loop closures in this case can exceed resource budgets on, e.g., energy consumption or CPU time. Thus, from a resource scheduling perspective, it is natural to select an information-rich subset of potential loop closures E ⊆ Ex for verification. Assuming that the cost of geometric verification is uniform among different potential matches, we impose a computation budget by requiring that the selected subset (of edges in Gx ) must contain no more than k edges, i.e., |E| ≤ k. In addition to the computation cost of geometric verification, robots also incur communication cost when establishing inter-robot loop closures. More specifically, before two robots can verify a potential loop closure, at least one of them must share its observation with the other robot. It has been shown that the minimum data transmission required to verify any subset of edges E ⊆ Ex is determined by the minimum weighted vertex cover of the subgraph induced by E [10,29].2 Based on this insight, we consider three different models for communication budgets in Table 1. First, in Total-Uniform (TU) robots are allowed to exchange at most b observations. This is justified under the assumption of uniform vertex weight (i.e. observation size) w. This assumption is relaxed in TotalNonuniform (TN) where total data transmission must be at most b. Finally, in Individual-Uniform (IU), we assume Vx is partitioned into p blocks and robots are allowed to broadcast at most bi observations from the ith block for all i ∈ [p]. A natural partitioning of Vx is given by V1 · · · Vr . In this case, IU permits robot i to broadcast at most bi of its observations for all i ∈ [r]. This model captures the heterogeneous nature of the team. 2
Selecting a vertex is equivalent to broadcasting the corresponding observation; see Fig. 1b.
Resource-Aware Algorithms for Distributed Loop Closure Detection
427
Given the computation and communication budgets described above, robots must decide which budget-feasible subset of potential edges to verify in order to maximize a collective performance metric f : 2Ex → R≥0 . f (E) aims to quantify the expected utility gained by verifying the potential loop closures in E. For concrete examples of f , see Sects. 3 and 4 where we introduce three choices borrowed from [29]. This problem is formally defined below. Problem 1. Let CB ∈ {TUb , TNb , IUb1:p }. maximize
f (E)
subject to
|E| ≤ k, (# of verifications) ∃ V ∈ cover(E) satisfying CB. (data transmission)
E⊆Ex
(P1 )
Table 1. A subset of edges is feasible with regards to communication budget if there exists a V ⊆ Vx that covers that subset and V satisfies the corresponding constraint; see P1 . Type
TUb
TNb
IUb1:p
Constraint |V| ≤ b v∈V w(v) ≤ b |V ∩ Vi | ≤ bi for i ∈ [p] Cardinality Knapsack Partition Matroid
P1 generalizes NP-hard problems, and thus is NP-hard in general. In particular, for an NMS f and when b or bi ’s are sufficiently large (i.e., unbounded communication), P1 becomes an instance of general NMS maximization under a cardinality constraint. Similarly, for an NMS f and a sufficiently large k (i.e., unbounded computation), this problem reduces to a variant of b-DEP [29, Section 3]. For general NMS maximization under a cardinality constraint, no polynomial-time approximation algorithm can provide a constant factor approximation better than 1 − 1/e, unless P=NP; see [16] and references therein for results on hardness of approximation. This immediately implies that 1 − 1/e is also the approximation barrier for the general case of P1 (i.e., general NMS objective). In Sects. 3 and 4 we present approximation algorithms with provable performance guarantees for variants of this problem.
3
Modular Performance Metrics
In this section, we consider a special case of P1 where f is normalized, monotone, and modular. This immediately implies that f (∅) = 0 and f (E) = e∈E f (e) for all non-empty E ⊆ Ex where f (e) ≥ 0 for all e ∈ Ex . Without loss of generality, we focus on the case where f (E) gives the expected number of true inter-robot loop closures within E, i.e., f : E → E [number of correct matches within E] = e∈E p(e); see Definition 1 and [29, Eq. 4]. This generic objective is applicable
428
Y. Tian et al.
to a broad range of scenarios where maximizing the expected number of “true associations” is desired (e.g., in distributed place recognition). P1 with modular objectives generalizes the maximum coverage problem on graphs, and thus is NP-hard in general; see [29]. In what follows, we present an efficient constant-factor approximation scheme for P1 with modular objectives under the communication cost regimes listed in Table 1. Note that merely deciding whether a given E ⊆ Ex is feasible for P1 under TU is an instance of vertex cover problem [29] which is NP-complete. We thus first transform P1 into the following nested optimization problem: maximize
max
V⊆Vx
E⊆edges(V),|E|≤k
subject to
V satisfies CB.
f (E) (P2 )
Let g : 2Vx → R≥0 return the optimal value of the inner optimization problem (boxed term) and define g(∅) = 0. Note that g(V) gives the maximum expected number of true inter-robot loop closures achieved by broadcasting the observations associated to V and verifying at most k potential inter-robot loop closures. In contrast to P1 , in P2 we explicitly maximize f over both vertices and edges. This transformation reveals the inherent structure of our problem; i.e., one needs to jointly decide which observations (vertices) to share (outer problem), and which potential loop closures (edges) to verify among the set of verifiable potential loop closures given the shared observations (inner problem). For a modular f , it is easy to see that the inner problem admits a trivial solution and hence g can be efficiently computed: if |edges(V)| > k, return the sum of top k edge probabilities in edges(V); otherwise return the sum of all probabilities in edges(V). Theorem 1. g is normalized, monotone, and submodular. The proof is provided in [30]. Theorem 1 implies that P2 is an instance of classical monotone submodular maximization subject to a cardinality (TU), a knapsack (TN), or a partition matroid (IU) constraint. These problems admit constant-factor approximation algorithms. The best performance guarantee in all cases is 1 − 1/e (Table 2); i.e., in the worst case, the expected number of correct loop closures discovered by such algorithms is no less than 1 − 1/e ≈ 63% of that of an optimal solution. Among these algorithms, variants of the natural greedy algorithm are particularly well-suited for our application due to their computational efficiency and incremental nature; see also [29]. These simple algorithms enjoy constant-factor approximation guarantees, albeit with a performance guarantee weaker than 1 − 1/e in the case of TN and IU; see the first row of Table 2 and [16]. More precisely, under the TU regime, the standard greedy algorithm that simply selects (i.e., broadcasts) the next remaining vertex v with the maximum marginal gain over expected number of true loop closures provides the optimal approximation ratio [23]; see [30, Algorithm 1]. A na¨ıve implementation of this algorithm requires O(b · |Vx |) evaluations of g, where each evaluation g(V) takes O(|edges(V)| × log k) time. We note that the number of
Resource-Aware Algorithms for Distributed Loop Closure Detection
429
evaluations can be reduced by using the so-called lazy greedy method [16,21]. Under TN, the same greedy algorithm, together with one of its variants that normalizes marginal gains by vertex weights, provide a performance guarantee of 1/2 · (1 − 1/e) [19]. Finally, in the case of IU, selecting the next feasible vertex according to the standard greedy algorithm leads to a performance guarantee of 1/2 [7]. The following theorem provides an approximation-preserving reduction from P1 to P2 . Theorem 2. Given ALG, an α-approximation algorithm for P2 for an α ∈ (0, 1), the following is an α-approximation algorithm for P1 : 1. Run ALG on the corresponding instance of P2 to produce V. 2. If |edges(V)| > k, return k edges with highest probabilities in edges(V); otherwise return the entire edges(V). The proof is provided in [30]. Theorem 2 implies that any near-optimal solution for P2 can be used to construct an equally-good near-optimal solution for our original problem P1 with the same approximation ratio. The first row of Table 2 summarizes the approximation factors of the abovementioned greedy algorithms for various communication regimes. Note that this reduction also holds for more sophisticated (1 − 1/e)-approximation algorithms (Table 2). Table 2. Approximation ratio for modular objectives Algorithm TU
TN
IU
Greedy
1 − 1/e [23] 1/2 · (1 − 1/e) [19] 1/2 [7]
Best
1 − 1/e [23] 1 − 1/e [28]
1 − 1/e [1]
Remark 1. Kulik et al. [17] study the so-called maximum coverage with packing constraint (MCP), which includes P2 under TU (i.e., cardinality constraint) as a special case. Our approach differs from [17] in the following three ways. Firstly, the algorithm proposed in [17] achieves a performance guarantee of 1 − 1/e for MCP by applying partial enumeration, which is computationally expensive in practice. This additional complexity is due to a knapsack constraint on edges (“items” according to [17]) which is unnecessary in our application. As a result, the standard greedy algorithm retains the optimal performance guarantee for P2 without any need for partial enumeration. Secondly, in addition to TU, we study other models of communication budgets (i.e., TN and IU), which leads to more general classes of constraints (i.e., knapsack and partition matroid) that MCP does not consider. Lastly, besides providing efficient approximation algorithms for P2 , we further demonstrate how such algorithms can be leveraged to solve the original problem P1 by establishing an approximation-preserving reduction (Theorem 2).
430
4
Y. Tian et al.
Submodular Performance Metrics
Now we assume f : 2Ex → R≥0 can be an arbitrary NMS objective, while limiting our communication cost regime to TU. From Sect. 2 recall that this problem is NP-hard in general. To the best of our knowledge, no prior work exists on approximation algorithms for Problem 1 with a general NMS objective. Furthermore, the approach presented for the modular case cannot be immediately extended to the more general case of submodular objectives (e.g., evaluating the generalized g will be NP-hard). It is thus unclear whether any constant-factor (ideally, 1 − 1/e) approximation can be achieved for an arbitrary NMS objective. Before presenting our main results and approximation algorithms, let us briefly introduce two such objectives suitable for our application; see also [29]. The D-optimality design criterion (hereafter, D-criterion), defined as the logdeterminant of the Fisher information matrix, is a popular monotone submodular objective originated from the theory of optimal experimental design [25]. This objective has been widely used across many domains (including SLAM) and enjoys rich geometric and information-theoretic interpretations; see, e.g., [13]. Assuming random edges of Gx occur independently (i.e., potential loop closures “realize” independently), one can use the approximate expected gain in the D-criterion [2] [29, Eq. 2] as the objective in P1 . This is known to be NMS if the information matrix prior to rendezvous is positive definite [27,29]. In addition, in the case of 2D SLAM, Khosoussi et al. [14,15] show that the expected weighted number of spanning trees (or, tree-connectivity) [29, Eq. 3] in the underlying pose graph provides a graphical surrogate for the D-criterion with the advantage of being cheaper to evaluate and requiring no metric knowledge about robots’ trajectories [29]. Similar to the D-criterion, the expected treeconnectivity is monotone submodular if the underlying pose graph is connected before selecting any potential loop closure [14,15]. Now let us revisit P1 . It is easy to see that for a sufficiently large communication budget b, P1 becomes an instance of NMS maximization under only a cardinality constraint on edges (i.e., computation budget). Indeed, when k < b, the communication constraint becomes redundant—because k or less edges can trivially be covered by selecting at most k < b vertices. Therefore, in such cases the standard greedy algorithm operating on edges Ex achieves a 1 − 1/e performance guarantee [23]; see, e.g., [2,14,15] for similar works. Similarly, for a sufficiently large computation budget k (e.g., when b < k/Δ where Δ is the maximum degree of Gx ), P1 becomes an instance of budgeted data exchange problem for which there exists constant-factor (e.g., 1 − 1/e under TU) approximation algorithms [29]. Such algorithms greedily select vertices, i.e., broadcast the corresponding observations and select all edges incident to them [29]. Greedy edge (resp., vertex) selection thus provides 1 − 1/e approximation guarantees for cases where b is sufficiently smaller than k and vice versa. One can apply these algorithms on arbitrary instances of P1 by picking edges/vertices greedily and stopping whenever at least one of the two budget constraints is violated. Let us call the corresponding algorithms Edge-Greedy (E-Greedy) and VertexGreedy (V-Greedy), respectively; see [30, Algorithms 2 and 3]. Moreover, let
Resource-Aware Algorithms for Distributed Loop Closure Detection
431
Submodular-Greedy (S-Greedy) be the algorithm according to which one runs both V-Greedy and E-Greedy and returns the best solution among the two. A na¨ıve implementation of S-Greedy thus needs O(b · |Vx | + k · |Ex |) evaluations
of the objective. For the D-criterion and tree-connectivity, each evaluation in principle takes O(d3 ) time where d denotes the total number of robot poses. In practice, the cubic dependence on d can be eliminated by leveraging the sparse structure of the global pose graph. This complexity can be further reduced by cleverly reusing Cholesky factors in each round and utilizing rank-one updates; see [15,29]. Once again, the number of function calls can be reduced by using the lazy greedy method. The following theorem provides a performance guarantee for S-Greedy in terms of b, k, and Δ. Theorem 3. Submodular-Greedy is an α(b, k, Δ)-approximation algorithm for P1 where α(b, k, Δ) 1 − exp − min {1, γ} in which γ max {b/k, k/Δ/b}.
Fig. 2. (a) S-Greedy approximation factor in KITTI 00 with Δ = 41. The approximation factor in this case varies between 0.18 and 0.63. (b) S-Greedy approximation factor in KITTI 00 with the maximum degree capped at Δ = 5. The approximation factor varies between 0.36 and 0.63. (c) The approximate S-Greedy performance guarantee as a function of κ b/k with different Δ.
The complementary nature of E-Greedy and V-Greedy is reflected in the performance guarantee α(b, k, Δ) presented above. Intuitively, E-Greedy (resp., V-Greedy) is expected to perform well when computation (resp., communication) budget is scarce compared to communication (resp., computation) budget. It is worth noting that for a specific instance of Gx , the actual performance guarantee of S-Greedy can be higher than α(b, k, Δ). This potentially stronger performance guarantee can be computed a posteriori, i.e., after running S-Greedy on the given instance of Gx ; see [30, Lemmas 3 and 4]. As an example, Fig. 2a shows the a posteriori approximation factors of S-Greedy in the KITTI 00 dataset (Sect. 5) for each combination of budgets (b, k). Theorem 3 indicates that reducing Δ enhances the performance guarantee α(b, k, Δ). This is demonstrated in Fig. 2b: after capping Δ at 5, the minimum approximation factor increases from about 0.18 to 0.36. In practice, Δ may be large due to high uncertainty
432
Y. Tian et al.
in the initial set of potential inter-robot loop closures; e.g., in situations with high perceptual ambiguity, an observation could potentially be matched to many other observations during the initial phase of metadata exchange. This issue can be mitigated by bounding Δ or increasing the fidelity of metadata. To gain more intuition, let us approximate k/Δ in α(b, k, Δ) with k/Δ.3 With this simplification, the performance guarantee can be represented as α(κ, ˜ Δ), i.e., a function of the budgets ratio κ b/k and Δ. Figure 2c shows α ˜ (κ, Δ) as a function of κ, with different maximum degree Δ (independent of any specific Gx ). Remark 2. It is worth noting that α(b, k, Δ) can be bounded from below by a function of Δ, i.e., independent of b and k. More precisely, after some algebraic manipulation,4 √ it can be shown that α(b, k, Δ) ≥ 1 − exp(−c(Δ)) where 1/(Δ + 1) ≤ c(Δ) ≤ 1/ Δ. This implies that for a bounded Δ ≤ Δmax , S-Greedy is a constant-factor approximation algorithm for P1 .
robot 1 robot 2 robot 3 robot 4 robot 5
(a) KITTI 00
(b) Simulation
Fig. 3. Left: KITTI 00; Right: 2D simulation [29]. Each figure shows simulated trajectories of five robots. The KITTI trajectories shown are estimated purely using prior beliefs and odometry measurements (hence the drift). The simulated trajectories shown are the exact ground truth.
5
Experimental Results
We evaluate the proposed algorithms using sequence 00 of the KITTI odometry benchmark [9] and a synthetic Manhattan-like dataset [29]. Each dataset is divided into multiple trajectories to simulate individual robots’ paths (Fig. 3). For the KITTI sequence, we project the trajectories to the 2D plane in order to use the tree-connectivity objective [14]. Visual odometry measurements and potential loop closures are obtained from a modified version of ORB-SLAM2 [22]. We estimate the probability of each potential loop closure by normalizing 3
4
This is a reasonable approximation when, e.g., b is sufficiently large (b ≥ b0 ) since k/(Δb) − 1/b < k/Δ/b ≤ k/(Δb) and thus the introduced error in the exponent will be at most 1/b0 . Omitted due to space limitation.
Resource-Aware Algorithms for Distributed Loop Closure Detection
433
the corresponding DBoW2 score [8]. For any specific environment, better mapping from the similarity score to the corresponding probability can be learned offline. Nonetheless, these estimated probabilities are merely used to encourage the selection of more promising potential matches, and the exact mapping used is orthogonal to the evaluation of the proposed approximation algorithms. For simulation, noisy odometry and loop closures are generated using the 2D simulator of g2o [18]. Each loop closure in the original dataset is considered a potential match with an occurrence probability generated randomly according to the uniform distribution U(0, 1). Then, we decide if each potential match is a “true” loop closure by sampling from a Bernoulli distribution with the corresponding occurrence probability. This process thus provides unbiased estimates of the actual occurrence probabilities. In our experiments, each visual observation contains about 2000 keypoints. We ignore the insignificant variation in observation sizes and design all test cases under the TU communication cost regime. Assuming each keypoint (consisting of a descriptor and coordinates) uses 40 bytes of data, a communication budget of 50, for example, translates to about 4 MB of data transmission [29]. 5.1
Certifying Near-Optimality via Convex Relaxation
Evaluating the proposed algorithms ideally requires access to the optimal value (OPT) of P1 . However, computing OPT by brute force is challenging even in relatively small instances. Following [29], we compute an upper bound UPT ≥ OPT by solving a natural convex relaxation of P1 , and use UPT as a surrogate for OPT. Comparing with UPT provides an a posteriori certificate of near-optimality for solutions returned by the proposed algorithms. Let π [π1 , . . . , πn ] and [1 , . . . , m ] be indicator variables corresponding to vertices and edges of Gx , respectively. Let A be the undirected incidence matrix of Gx . P1 can then be formulated in terms of π and . For example, for modular objectives (Sect. 3) under the TU communication model, P1 is equivalent to maximizing f () e∈Ex p(e) · e subject to (π, ) ∈ Fint , where n m Fint (π, ) ∈ {0, 1} × {0, 1} : 1 π ≤ b, 1 ≤ k, A π ≥ . Relaxing Fint to F (π, ) ∈ [0, 1]n × [0, 1]m : 1 π ≤ b, 1 ≤ k, A π ≥ gives the natural LP relaxation, whose optimal value is an upper bound on OPT. Note that in this special case, we can also compute OPT directly by solving the original ILP (this is not practical for real-world applications). On the other hand, for maximizing the D-criterion and tree-connectivity (Sect. 4), convex relaxation produces determinant maximization (maxdet) problems [32] subject to affine constraints. In our experiments, all LP and ILP instances are solved using built-in solvers in MATLAB. All maxdet problems are modeled using the YALMIP toolbox [20] and solved using SDPT3 [31] in MATLAB. 5.2
Results with Modular Objectives
For brevity, we refer to the proposed greedy algorithm in Sect. 3 under TU (see [30, Algorithm 1]) as M-Greedy. Figure 4 shows the optimality gap of M-Greedy
434
Y. Tian et al.
Fig. 4. Optimality gap of M-Greedy in KITTI 00 under TU. In this case, the objective is to maximize the expected number of loop closures ([29, Eq. 4]). For each problem instance specified by a pair of budgets (b, k), we calculate the difference between the achieved value and the optimal value obtained by solving the ILP. Each value is then normalized by the maximum achievable value given infinite budgets and shown in percentage.
evaluated on the KITTI 00 dataset. In this case, the objective is to maximize the expected number of true loop closures ([29, Eq. 4]). Given the exchange graph, we vary the communication budget b and computation budget k to produce different instances of P1 . For each instance, we compute the difference between the achieved value and the optimal value obtained by solving the corresponding ILP (Sect. 5.1). The computed difference is then normalized by the maximum achievable value given infinite budgets and converted into percentage. The result for each instance is shown as an individual cell in Fig. 4. In all instances, the optimality gaps are close to zero. In fact, the maximum unnormalized difference across all instances is about 1.35, i.e., the achieved value and the optimal value only differ by 1.35 expected loop closures. These results clearly confirm the nearoptimal performance of the proposed algorithm. Furthermore, in many instances, M-Greedy finds optimal solutions (shown in dark blue). We note that this result is expected when b > k (top-left region), as in this case P1 reduces to a degenerate instance of the knapsack problem, for which greedy algorithm is known to be optimal. 5.3
Results with Submodular Objectives
Figure 5 shows performance of S-Greedy in KITTI 00 under the TU regime. Figure 5a–c use the D-criterion objective [29, Eq. 2], and 5d–f use the treeconnectivity objective [29, Eq. 3]. Each figure corresponds to one scenario with a fixed communication budget b and varying computation budget k. We compare the proposed algorithm with a baseline that randomly selects b vertices, and then selects k edges at random from the set of edges incident to the selected vertices. When using the tree-connectivity objective, we also plot the upper bound
Resource-Aware Algorithms for Distributed Loop Closure Detection 1 0.8
1
1
S-Greedy Random
S-Greedy Random
0.8 0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0 0
100
200
300
400
0
(a) b = 20 1
100
200
300
0
400
(b) b = 70 1
S-Greedy Random UPT
0.8
0.8
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0 200
300
(d) b = 20
400
400
S-Greedy Random UPT
0.8
0.6
100
200
(c) b = 150 1
S-Greedy Random UPT
0.6
0
S-Greedy Random
0.8
0.6
435
0 0
100
200
300
(e) b = 70
400
0
100
200
300
400
(f) b = 150
Fig. 5. Performance of Submodular-Greedy in KITTI 00 under TU, with the Dcriterion objective in (a)-(c) and tree-connectivity objective in (d)-(f). Each figure shows one scenario with a fixed communication budget b and varying computation budget k. The proposed algorithm is compared against a random baseline. With the tree-connectivity objective, we also show the upper bound (UPT) computed using convex relaxation. All values are normalized by the maximum achievable value given infinite budgets.
(UPT) computed using convex relaxation (Sect. 5.1).5 All values shown are normalized by the maximum achievable value given infinite budgets. In all instances, S-Greedy clearly outperforms the random baseline. Moreover, in 5d–f, the achieved objective values are close to UPT. In particular, using the fact that UPT ≥ OPT, we compute a lower bound on the empirical approximation ratio in 5d– f, which varies between 0.58 and 0.99. This confirms the near-optimality of the approximate solutions found by S-Greedy. The inflection point of each S-Greedy curve in Fig. 5a–f corresponds to the point where the algorithm switches from greedily selecting edges (E-Greedy) to greedily selecting vertices (V-Greedy). Note that in all scenarios, the achieved value of S-Greedy eventually saturates. This is because as k increases, S-Greedy eventually spends the entire communication budget and halts. In these cases, since we already select the maximum number (i.e., b) of vertices, the algorithm achieves a constant approximation factor of 1 − 1/e (see [30, Lemma 4]). Due to space limitation, additional results on the simulated datasets and results comparing M-Greedy with S-Greedy in the case of modular objectives are reported in [30, Appendix C]. 5
For KITTI 00 we do not show UPT when using the D-criterion objective, because solving the convex relaxation in this case is too time-consuming.
436
6
Y. Tian et al.
Conclusion
Inter-robot loop closure detection is a critical component for many multirobot applications, including those that require collaborative localization and mapping. The resource-intensive nature of this process, together with the limited nature of mission-critical resources available on-board necessitate intelligent utilization of available/allocated resources. This paper studied distributed interrobot loop closure detection under computation and communication budgets. More specifically, we sought to maximize monotone submodular performance metrics by selecting a subset of potential inter-robot loop closures that is feasible with respect to both computation and communication budgets. This problem generalizes previously studied NP-hard problems that assume either unbounded computation or communication. In particular, for monotone modular objectives (e.g., expected number of true loop closures) we presented a family of greedy algorithms with constant-factor approximation ratios under multiple communication cost regimes. This was made possible through establishing an approximation-factor preserving reduction to well-studied instances of monotone submodular maximization problems under cardinality, knapsack, and partition matroid constraints. More generally, for any monotone submodular objective, we presented an approximation algorithm that exploits the complementary nature of greedily selecting potential loop closures for verification and greedily broadcasting observations, in order to establish a performance guarantee. The performance guarantee in this more general case depends on resource budgets and the extent of perceptual ambiguity. It remains an open problem whether constant-factor approximation for any monotone submodular objective is possible. We plan to study this open problem as part of our future work. Additionally, although the burden of verifying potential loop closures is distributed among the robots, our current framework still relies on a centralized scheme for evaluating the performance metric and running the approximation algorithms presented for P1 . In the future, we aim to eliminate such reliance on centralized computation. Acknowledgments. This work was supported in part by the NASA Convergent Aeronautics Solutions project Design Environment for Novel Vertical Lift Vehicles (DELIVER), by ONR under BRC award N000141712072, and by ARL DCIST under Cooperative Agreement Number W911NF-17-2-0181.
References 1. Calinescu, G., Chekuri, C., P´ al, M., Vondr´ ak, J.: Maximizing a monotone submodular function subject to a matroid constraint. SIAM J. Comput. 40(6), 1740–1766 (2011). https://doi.org/10.1137/080733991 2. Carlone, L., Karaman, S.: Attention and anticipation in fast visual-inertial navigation. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3886–3893. IEEE (2017)
Resource-Aware Algorithms for Distributed Loop Closure Detection
437
3. Choudhary, S., Carlone, L., Nieto, C., Rogers, J., Christensen, H.I., Dellaert, F.: Distributed mapping with privacy and communication constraints: lightweight algorithms and object-based models. Int. J. Rob. Res. 36(12), 1286–1311 (2017). https://doi.org/10.1177/0278364917732640 4. Cieslewski, T., Scaramuzza, D.: Efficient decentralized visual place recognition using a distributed inverted index. IEEE Rob. Autom. Lett. 2(2), 640–647 (2017) 5. Cieslewski, T., Choudhary, S., Scaramuzza, D.: Data-efficient decentralized visual SLAM. CoRR, abs/1710.05772 (2017) 6. Davison, A.J.: Active search for real-time vision. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1, pp. 66–73 (2005) 7. Fisher, M.L., Nemhauser, G.L., Wolsey, L.A.: An analysis of approximations for maximizing submodular set functions—ii. In: Polyhedral Combinatorics, pp. 73–87. Springer, Heidelberg (1978) 8. G´ alvez-L´ opez, D., Tard´ os, J.D.: Bags of binary words for fast place recognition in image sequences. IEEE Trans. Rob. 28(5), 1188–1197 (2012). https://doi.org/10. 1109/TRO.2012.2197158. ISSN 1552-3098 9. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012) 10. Giamou, M., Khosoussi, K., How, J.P.: Talk resource-efficiently to me: optimal communication planning for distributed loop closure detection. In: IEEE International Conference on Robotics and Automation (ICRA) (2018) 11. Heinly, J., Sch¨ onberger, J.L., Dunn, E., Frahm, J.M.: Reconstructing the World* in Six Days *(as captured by the yahoo 100 million image dataset). In: Computer Vision and Pattern Recognition (CVPR) (2015) 12. Ila, V., Porta, J.M., Andrade-Cetto, J.: Information-based compact pose SLAM. IEEE Trans. Rob. 26(1), 78–93 (2010) 13. Joshi, S., Boyd, S.: Sensor selection via convex optimization. IEEE Trans. Signal Process. 57(2), 451–462 (2009) 14. Khosoussi, K., Sukhatme, G.S., Huang, S., Gamini, D.: A graph-theoretic approach. In: International Workshop on the Algorithmic Foundations of Robotics, Designing sparse reliable pose-graph SLAM (2016) 15. Khosoussi, K., Giamou, M., Sukhatme, G.S., Huang, S., Dissanayake, G., How, J.P.: Reliable graphs for SLAM. Int. J. Rob. Res. (2019). To appear 16. Krause, A., Golovin, D.: Submodular function maximization. In: Bordeaux, L., Hamadi, Y., Kohli, P. (eds.) Tractability: Practical Approaches to Hard Problems, pp. 71–104. Cambridge University Press, Cambridge (2014). ISBN 9781139177801 17. Kulik, A., Shachnai, H., Tamir, T.: Maximizing submodular set functions subject to multiple linear constraints. In: Proceedings of the Twentieth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2009, Philadelphia, PA, USA, pp. 545–554 (2009). Society for Industrial and Applied Mathematics 18. K¨ ummerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g2o: a general framework for graph optimization. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 3607–3613. IEEE (2011) 19. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.: Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pp. 420–429. ACM (2007) 20. L¨ ofberg, J.: Yalmip : a toolbox for modeling and optimization in matlab. In: Proceedings of the CACSD Conference, Taipei, Taiwan (2004)
438
Y. Tian et al.
21. Minoux, M.: Accelerated greedy algorithms for maximizing submodular set functions. In: Optimization Techniques, pp. 234–243. Springer, Heidelberg (1978) 22. Mur-Artal, R., Tard´ os, J.D.: ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Trans. Rob. 33(5), 1255–1262 (2017). https://doi.org/10.1109/TRO.2017.2705103 23. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions–i. Math. Program. 14(1), 265–294 (1978) 24. Paull, L., Huang, G., Leonard, J.J.: A unified resource-constrained framework for graph SLAM. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1346–1353. IEEE (2016) 25. Pukelsheim, F.: Optimal Design of Experiments. SIAM, vol. 50 (1993) 26. Raguram, R., Tighe, J., Frahm, J.M.: Improved geometric verification for large scale landmark image collections. In: BMVC 2012 - Electronic Proceedings of the British Machine Vision Conference 2012. British Machine Vision Association, BMVA (2012). https://doi.org/10.5244/C.26.77 27. Shamaiah, M., Banerjee, S., Vikalo, H.: Greedy sensor selection: leveraging submodularity. In: 49th IEEE Conference on Decision and Control (CDC), pp. 2572– 2577. IEEE (2010) 28. Sviridenko, M.: A note on maximizing a submodular set function subject to a knapsack constraint. Oper. Res. Lett. 32(1), 41–43 (2004) 29. Tian, Y., Khosoussi, K., Giamou, M., How, J.P., Kelly, J.: Near-optimal budgeted data exchange for distributed loop closure detection. In: Proceedings of Robotics: Science and Systems, Pittsburgh, USA (2018) 30. Tian, Y., Khosoussi, K., How, J.P.: Resource-aware algorithms for distributed loop closure detection with provable performance guarantees. arXiv preprint arXiv:1901.05925 (2019) 31. Toh, K.C., Todd, M.J., T¨ ut¨ unc¨ u, R.H.: SDPT3 - a matlab software package for semidefinite programming. Optim. Meth. Softw. 11, 545–581 (1999) 32. Vandenberghe, L., Boyd, S., Shao-Po, W.: Determinant maximization with linear matrix inequality constraints. SIAM J. Matrix Anal. Appl. 19(2), 499–533 (1998)
Dynamics of Contact in Manipulation and Locomotion
RMPflow : A Computational Graph for Automatic Motion Policy Generation Ching-An Cheng1,2(B) , Mustafa Mukadam1,2 , Jan Issac1 , Stan Birchfield1 , Dieter Fox1,3 , Byron Boots1,2 , and Nathan Ratliff1 1
Seattle Robotics Lab, NVIDIA, Seattle, WA, USA Robot Learning Lab, Georgia Institute of Technology, Atlanta, GA, USA [email protected] Robotics and State Estimation Lab, University of Washington, Seattle, WA, USA 2
3
Abstract. We develop a novel policy synthesis algorithm, RMPflow, based on geometrically consistent transformations of Riemannian Motion Policies (RMPs). RMPs are a class of reactive motion policies designed to parameterize non-Euclidean behaviors as dynamical systems in intrinsically nonlinear task spaces. Given a set of RMPs designed for individual tasks, RMPflow can consistently combine these local policies to generate an expressive global policy, while simultaneously exploiting sparse structure for computational efficiency. We study the geometric properties of RMPflow and provide sufficient conditions for stability. Finally, we experimentally demonstrate that accounting for the geometry of task policies can simplify classically difficult problems, such as planning through clutter on high-DOF manipulation systems. Keywords: Motion and path planning Dynamics
1
· Collision avoidance ·
Introduction
In this work, we develop a new motion generation and control framework that enables globally stable controller design within intrinsically non-Euclidean spaces.1 Non-Euclidean geometries are not often modeled explicitly in robotics, but are nonetheless common in the natural world. One important example is the apparent non-Euclidean behavior of obstacle avoidance. Obstacles become holes in this setting. As a result, straight lines are no longer a reasonable definition of shortest distance—geodesics must, therefore, naturally flow around them. This behavior implies a form of non-Euclidean geometry: the space is naturally curved by the presence of obstacles. The planning literature has made substantial progress in modeling nonEuclidean task-space behaviors, but at the expense of efficiency and reactivity. Starting with early differential geometric models of obstacle avoidance [1] 1
Spaces defined by non-constant Riemannian metrics with non-trivial curvature.
The appendices of this paper can be found at https://arxiv.org/abs/1811.07049. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 441–457, 2020. https://doi.org/10.1007/978-3-030-44051-0_26
442
C.-A. Cheng et al.
and building toward modern planning algorithms and optimization techniques [2–9], these techniques can calculate highly nonlinear trajectories. However, they are often computationally intensive, sensitive to noise, and unresponsive to perturbation. In addition, the internal nonlinearities of robots due to kinematic constraints are sometimes simplified in the optimization. At the same time, a separate thread of literature, emphasizing fast reactive control over computationally expensive planning, developed efficient closed-loop control techniques such as Operational Space Control (OSC) [10]. But while these techniques account for internal geometries from the robot’s kinematic structure, they assume simple Euclidean geometry in task spaces [11,12], failing to provide a complete treatment of the external geometries. As a result, obstacle avoidance, e.g., has to rely on extrinsic potential functions, leading to undesirable deacceleartion behavior when the robot is close to the obstacle. If the non-Euclidean geometry can be intrinsically considered, then fast obstacle avoidance motion would naturally arise as traveling along the induced geodesic. The need for a holistic solution to motion generation and control has motivated a number of recent system architectures tightly integrating planning and control [13,14]. We develop a new approach to synthesizing control policies that can accommodate and leverage the modeling capacity of intrinsically non-Euclidean robotics tasks. Taking inspiration from Geometric Control Theory [15],2 we design a novel recursive algorithm, RMPflow, based on a recently proposed mathematical object for representing nonlinear policies known as the Riemannian Motion Policy (RMP) [16]. This algorithm enables the geometrically consistent fusion of many component policies defined across non-Euclidean task spaces that are related through a tree structure. We show that RMPflow, which generates behavior by calculating how the robot should accelerate, mimics the Recursive Newton-Euler algorithm [17] in structure, but generalizes it beyond rigid-body systems to a broader class of highly-nonlinear transformations and spaces. In contrast to existing frameworks, our framework naturally models nonEuclidean task spaces with Riemannian metrics that are not only configuration dependent, but also velocity dependent. This allows RMPflow to consider, e.g., the direction a robot travels to define the importance weights in combing policies. For example, an obstacle, despite being close to the robot, can usually be ignored if robot is heading away from it. This new class of policies leads to an extension of Geometric Control Theory, building on a new class of non-physical mechanical systems we call Geometric Dynamical Systems (GDS). We also show that RMPflow is Lyapunov-stable and coordinate-free. In particular, when using RMPflow, robots can be viewed each as different parameterizations of the same task space, defining a precise notion of behavioral consistency between robots. Additionally, under this framework, the implicit curvature arising from non-constant Riemannian metrics (which may be roughly viewed as position-velocity dependent inertia matrices in OSC) produces nontrivial and intuitive policy contributions that are critical to guaranteeing stability and 2
See Appendix A.1 for a discussion of why geometric mechanics and geometric control theory constitute a good starting point.
RMPflow
443
generalization across embodiments. Our experimental results illustrate how these curvature terms can be impactful in practice, generating nonlinear geodesics that result in curving or orbiting around obstacles. Finally, we demonstrate the utility of our framework with a fully reactive real-world system on multiple dual-arm manipulation problems.
2
Motion Generation and Control
Motion generation and control can be formulated as the problem of transforming curves from the configuration space C to the task space T . Specifically, let C be a d-dimensional smooth manifold. A robot’s motion can be described as a curve q : [0, ∞) → C such that the robot’s configuration at time t is a point q(t) ∈ C. Without loss of generality, suppose C has a global coordinate q : C → Rd , called the generalized coordinate; for short, we would identify the curve q with its coordinate and write q(q(t)) as q(t) ∈ Rd . A typical example of the generalized coordinate is the joint angles of a d-DOF (degrees-of-freedom) robot: we ˙ ¨ (t) as the joint velocities and denote q(t) as the joint angles at time t and q(t), q accelerations. To describe the tasks, we consider another manifold T , the task space, which is related to the configuration space C through a smooth task map ψ : C → T . The task space T can be the end-effector position/orientation [10,18], or more generally can be a space that describes whole-body robot motion, e.g., in simultaneous tracking and collision avoidance [19,20]. Thus, the goal of motion generation and control is to design the curve q so that the transformed curve ψ ◦ q exhibits desired behaviors on the task space T . Notation. For clarity, we use boldface to distinguish the coordinate-dependent representations from abstract objects; e.g. we write q(t) ∈ C and q(t) ∈ Rd . In addition, we will often omit the time- and input-dependency of objects unless ˙ q ¨ ). For derivatives, we use both necessary; e.g. we may write q ∈ C and (q, q, symbols ∇ and ∂, with a transpose relationship: for x ∈ Rm and a differential map y : Rm → Rn , we write ∇x y(x) = ∂x y(x) ∈ Rm×n . For a matrix M ∈ Rm×m , we denote mi = (M)i as its ith column and Mij = (M)ij as its (i, j) element. To compose a matrix, we use (·)·· for vertical (or matrix) concatenation and [·]·· for horizontal concatenation. For example, we write M = [mi ]m i=1 = m×m m×m m m and M = (m ) = (M ) . We use R and R to (Mij )m ji + ++ i,j=1 i i=1 i,j=1 denote the symmetric, positive semi-definite/definite matrices, respectively. 2.1
Motion Policies and the Geometry of Motion
¨ = π(q, q), ˙ where We model motion as a second-order differential equation3 of q ˙ the state. In policy expresses the polwe call π a motion policy and (q, q) icy expresses the trajectory, which forms the basis of many motion planners, a 3
We assume the system has been feedback linearized. A torque-based setup can be similarly derived by setting the robot inertia matrix as the intrinsic metric on C [11].
444
C.-A. Cheng et al.
(a)
(b)
(c)
Fig. 1. Tree-structured task maps
motion policy expresses the entire continuous collection of its integral trajectories and therefore is robust to perturbations. Motion policies can model many adaptive behaviors, such as reactive obstacle avoidance [13,21] or responses driven by planned Q-functions [22], and their second-order formulation enables rich behavior that cannot be realized by the velocity-based approach [23]. The geometry of motion has been considered by many planning and control algorithms. Geometrical modeling of task spaces is used in topological motion planning [3], and motion optimization has leveraged Hessian to exploit the natural geometry of costs [5,24–26]. Ratliff et al. [2], e.g., use the workspace geometry inside a Gauss-Newton optimizer and generate natural obstacle-avoiding reaching motion through traveling along geodesics of curved spaces. Geometry-aware motion policies were also developed in parallel in controls. OSC is the best example [10]. Unlike the planning approaches, OSC focuses on the internal geometry of the robot and considers only simple task-space geometry. It reshapes the workspace dynamics into a simple spring-mass-damper system with a constant inertia matrix, enforcing a form of Euclidean geometry in the task space. Variants of OSC have been proposed to consider different metrics [11, 20,27], task hierarchies [19,28], and non-stationary inputs [29]. While these algorithms have led to many advances, we argue that their isolated focus on either the internal or the external geometry limits the performance. The planning approach fails to consider reactive dynamic behavior; the control approach cannot model the effects of velocity dependent metrics, which are critical to generating sensible obstacle avoidance motions, as discussed in the introduction. While the benefits of velocity dependent metrics was recently explored using RMPs [16], a systematic understanding is still an open question.
3
Automatic Motion Policy Generation with RMPflow
RMPflow is an efficient manifold-oriented computational graph for automatic generation of motion policies. It is aimed for problems with a task space T = {Tli } that is related to the configuration space C through a tree-structured task map ψ, where Tli is the ith subtask. Given user-specified motion policies {πli } on {Tli } as RMPs, RMPflow is designed to consistently combine these subtask policies into a global policy π on C. To this end, RMPflow introduces (1) a data structure, called the RMP-tree, to describe the tree-structured task map ψ and the policies, and (2) a set of operators, called the RMP-algebra, to propagate
RMPflow
445
˙ information across the RMP-tree. To compute π(q(t), q(t)) at time t, RMPflow operates in two steps: it first performs a forward pass to propagate the state from the root node (i.e. C) to the leaf nodes (i.e. {Tli }); then it performs a backward pass to propagate the RMPs from the leaf nodes to the root node. These two steps are realized by recursive use of RMP-algebra, exploiting shared computation paths arising from the tree structure to maximize efficiency. 3.1
Structured Task Maps
In most cases, the task-space manifold T is structured. In this paper, we consider the case where the task map ψ can be expressed through a tree-structured composition of transformations {ψei }, where ψei is the ith transformation. Figure 1 illustrates some common examples. Each node denotes a manifold and each edge denotes a transformation. This family trivially includes the unstructured task space T (Fig. 1a) and the product manifold T = Tl1 × · · · × TlK (Fig. 1b), where K is the number of subtasks. A more interesting example is the kinematic tree (Fig. 1c), where, e.g., the subtask spaces on the leaf nodes can describe the tracking and obstacle avoidance tasks along a multi-DOF robot. The main motivation of explicitly handling the structure in the task map ψ is two-fold. First, it allows RMPflow to exploit computation shared across different subtask maps. Second, it allows the user to focus on designing motion policies for each subtask individually, which is easier than directly designing a global policy for the entire task space T . For example, T may describe the problem of humanoid walking, which includes staying balanced, scheduling contacts, and avoiding collisions. Directly parameterizing a policy to satisfy all these objectives can be daunting, whereas designing a policy for each subtask is more feasible. 3.2
Riemannian Motion Policies (RMPs)
Knowing the structure of the task map is not sufficient for consistently combining subtask policies: we require some geometric information about the motion policies’ behaviors [16]. Toward this end, we adopt an abstract description of motion policies, called RMPs [16], for the nodes of the RMP-tree. Specifically, let M be an m-dimensional manifold with coordinate x ∈ Rm . The canonical form of an RMP on M is a pair (a, M)M , where a : Rm × Rm → Rm is a is a differentiable map. continuous motion policy and M : Rm × Rm → Rm×m + ˙ the desired acceleration Borrowing terminology from mechanics, we call a(x, x) ˙ the inertia matrix at (x, x), ˙ respectively.4 M defines the directional and M(x, x) importance of a when it is combined with other motion policies. Later in Sect. 4, we will show that M is closely related to Riemannian metric, which describes how the space is stretched along the curve generated by a; when M depends on the state, the space becomes non-Euclidean. We additionally introduce a new RMP form, called the natural form. Given an RMP in its canonical form 4
Here we adopt a slightly different terminology from [16]. We note that M and f do not necessarily correspond to the inertia and force of a physical mechanical system.
446
C.-A. Cheng et al.
(a, M)M , the natural form is a pair [f , M]M , where f = Ma is the desired force map. While the transformation between these two forms may look trivial, their distinction will be useful later when we introduce the RMP-algebra. 3.3
RMP-tree
The RMP-tree is the core data structure used by RMPflow. An RMP-tree is a directed tree, in which each node represents an RMP and its state, and each edge corresponds to a transformation between manifolds. The root node of the RMP-tree describes the global policy π on C, and the leaf nodes describe the local policies {πli } on {Tli }. To illustrate, let us consider a node u and its K child M and vi describes an RMP nodes {vi }K i=1 . Suppose u describes an RMP [f , M] Ni ˙ [f , M]M ) [fi , Mi ] , where Ni = ψei (M) for some ψei . Then we write u = ((x, x), N i and vi = ((yi , y˙ i ), [fi , Mi ] ); the edge connecting u and vi points from u to vi along ψei . We will continue to use this example to illustrate how RMP-algebra propagates the information across the RMP-tree. 3.4
RMP-algebra
The RMP-algebra consists of three operators (pushforward, pullback, and resolve) to propagate information.5 They form the basis of the forward and backward passes for automatic policy generation, described in the next section. 1. pushforward is the operator to forward propagate the state from a parent ˙ from u, node to its child nodes. Using the previous example, given (x, x) ˙ for each child node vi , where Ji = it computes (yi , y˙ i ) = (ψei (x), Ji (x)x) ∂x ψei is a Jacobian matrix. The name “pushforward” comes from the linear transformation of tangent vector x˙ to the image tangent vector y˙ i . 2. pullback is the operator to backward propagate the natural-formed RMPs from the child nodes to the parent node. It is done by setting [f , M]M with K K ˙ ˙ (1) f = i=1 J and M = i=1 J i (fi − Mi Ji x) i Mi Ji The name “pullback” comes from the linear transformations of the cotangent vector (1-form) fi − Mi J˙i x˙ and the inertia matrix (2-form) Mi . In summary, velocities can be pushfowarded along the direction of ψi , and forces and inertial matrices can be pullbacked in the opposite direction. To gain more intuition of pullback, we write pullback in the canonical form of RMPs. It can be shown that the canonical form (a, M)M of the natural form [f , M]M above is the solution to a least-squared problem: a = arg mina 12
K i=1
Ji a + J˙ i x˙ − ai 2Mi
(2)
¨ + J˙ i x, ˙ pullback ¨ i = Ji x where ai = M†i fi and · 2Mi = ·, Mi · . Because y attempts to find an a that can realize the desired accelerations {ai } while 5
Precisely it propagates the numerical values of RMPs and states at a particular time.
RMPflow
447
trading off approximation errors with an importance weight defined by the inertia matrix Mi (yi , y˙ i ). The use of state dependent importance weights is a distinctive feature of RMPflow. It allows RMPflow to activate different RMPs according to both configuration and velocity (see Sect. 3.6 for examples). Finally, we note that the pullback operator defined in this paper is slightly different from the original definition given in [16], which ignores the term J˙ i x˙ in (2). While ignoring J˙ i x˙ does not necessary destabilize the system [20], its inclusion is critical to implement consistent policy behaviors. 3. resolve is the last operator of RMP-algebra. It maps an RMP from its natural form to its canonical form. Given [f , M]M , it outputs (a, M)M with a = M† f , where † denotes Moore-Penrose inverse. The use of pseudo-inverse is because in general the inertia matrix is only positive semi-definite. Therefore, we also call the natural form of [f , M]M the unresolved form, as potentially it can be realized by multiple RMPs in the canonical form. 3.5
Algorithm: Motion Policy Generation
Now we show how RMPflow uses the RMP-tree and RMP-algebra to generate a global policy π on C. Suppose each subtask policy is provided as an RMP. First, we construct an RMP-tree with the same structure as ψ, where we assign subtask RMPs as the leaf nodes and the global RMP [fr , Mr ]C as the root node. With the RMP-tree specified, RMPflow can perform automatic policy generation. At every time instance, it first performs a forward pass: it recursively calls pushforward from the root node to the leaf nodes to update the state information in each node in the RMP-tree. Second, it performs a backward pass: it recursively calls pullback from the leaf nodes to the root node to back propagate the values of the RMPs in the natural form, and finally calls resolve at the root node to transform the global RMP [fr , Mr ]C into its canonical form (ar , Mr )C for policy ˙ = ar ). execution (i.e. setting π(q, q) The process of policy generation of RMPflow uses the tree structure for computational efficiency. For K subtasks, it has time complexity O(K) in the worst case as opposed to O(K log K) of a naive implementation which does not exploit the tree structure. Furthermore, all computations of RMPflow are carried out using matrix-multiplications, except for the final resolve call, because the RMPs are expressed in the natural form in pullback instead of the canonical form suggested originally in [16]. This design makes RMPflow numerically stable, as only one matrix inversion M†r fr is performed at the root node with both fr and Mr in the span of the same Jacobian matrix due to pullback. 3.6
Example RMPs
We give a quick overview of some RMPs useful in practice (a complete discussion of these RMPs are postponed to Appendix D). We recall from (2) that M dictates the directional importance of an RMP.
448
C.-A. Cheng et al.
Collision/joint Limit Avoidance. Barrier-type RMPs are examples that use velocity dependent inertia matrices, which can express importance as a function of robot heading (a property that traditional mechanical principles fail to capture). Here we demonstrate a collision avoidance policy in the 1D distance space x = d(q) to an obstacle. Let g(x, x) ˙ = w(x)u(x) ˙ > 0 for some functions w and u. ˙ = −∂x Φ(x) − bx˙ We consider a motion policy such that m(x, x)¨ ˙ x + 12 x˙ 2 ∂x g(x, x) 1 x∂ ˙ g(x, x), ˙ where Φ is a potenand define its inertia matrix m(x, x) ˙ = g(x, x)+ ˙ x˙ 2 tial and b > 0 is a damper. We choose w(x) to increase as x decreases (close to the obstacle), u(x) ˙ to increase when x˙ < 0 (moving toward the obstacle), and u(x) ˙ to be constant when x˙ ≥ 0. This motion policy is a GDS and g is its metric ˙ x˙ g(x, x) ˙ and 12 x˙ 2 ∂x g(x, x) ˙ are due to non-Euclidean (cf. Sect. 4.1); the terms 12 x∂ geometry and produce natural repulsive behaviors. Target Attractors. Designing an attractor policy is relatively straightforward. For a task space with coordinate x, we can consider an inertia matrix M(x) 0 −1 ¨ = −∇Φ−β(x) ˙ and a motion policy such that x x−M ξM , where Φ(x) ≈ x is a smooth attractor potential, β(x) ≥ 0 is a damper, and ξM is a curvature term. It can be shown that this differential equation is also a GDS (see Appendix D.4). Orientations. As RMPflow directly works with manifold objects, orientation controllers become straightforward to design, independent of the choice of coordinate (cf. Sect. 4.4). For example, we can define RMPs on a robotic link’s surface in any preferred coordinate (e.g. in one or two axes attached to an arbitrary point) with the above described attractor to control the orientation. This follows a similar idea outlined in the Appendix of [16]. Q-Functions. Perhaps surprising, RMPs can be constructed using Q-functions as metrics (we invite readers to read [16] for details on how motion optimizers can be reduced to Q-functions and the corresponding RMPs). While these RMPs may not satisfy the conditions of a GDS that we later analyze, they represent a broader class of RMPs that leads to substantial benefits (e.g. escaping local minima) in practice. Also, Q-functions are closely related to Lyapunov functions and geometric control [30]; we will further explore this direction in future work.
4
Theoretical Analysis of RMPflow
We investigate the properties of RMPflow when the child-node motion policies belong to a class of differential equations, which we call structured geometric dynamical systems (structured GDSs). We present the following results. 1. Closure: We show that the pullback operator retains a closure of structured GDSs. When the child-node motion policies are structured GDSs, the parentnode dynamics also belong to the same class.
RMPflow
449
2. Stability: Using the closure property, we provide sufficient conditions for the feedback policy of RMPflow to be stable. In particular, we cover a class of dynamics with velocity-dependent metrics that are new to the literature. 3. Invariance: As its name suggests, RMPflow is closely related to differential geometry. We show that RMPflow is intrinsically coordinate-free. This means that a set of subtask RMPs designed for one robot can be transferred to another robot while maintaining the same task-space behaviors. Setup. We assume that all manifolds and maps are sufficiently smooth. For now, we also assume that each manifold has a single chart; the coordinate-free analysis is postponed to Sect. 4.4. All the proofs are provided in Appendix B. 4.1
Geometric Dynamical Systems (GDSs)
We define a new family of dynamics useful to specify RMPs on manifolds. Let , manifold M be m-dimensional with chart (M, x). Let G : Rm × Rm → Rm×m + , and Φ : Rm → R. The tuple (M, G, B, Φ) is called a B : Rm × Rm → Rm×m + GDS if and only if ˙ x ¨ + ξG (x, x) ˙ = −∇x Φ(x) − B(x, x) ˙ x, ˙ ˙ + ΞG (x, x)) (G(x, x) ˙ := where ΞG (x, x) x
1 2
m i=1
x
˙ ξG (x, x) ˙ := G(x, x) ˙ x˙ − x˙ i ∂x˙ gi (x, x),
(3) 1 ˙ 2 ∇x (x
˙ x), ˙ and G(x, x) ˙ := [∂x gi (x, x) ˙ x] ˙ m ˙ as the metG(x, x) i=1 . We refer to G(x, x) ˙ as the damping matrix, and Φ(x) as the potential function ric matrix, B(x, x) ˙ ˙ := G(x, x) ˙ + ΞG (x, x) which is lower-bounded. In addition, we define M(x, x) as the inertia matrix, which can be asymmetric. We say a GDS is non˙ is nonsingular. We will assume (3) is non-degenerate degenerate if M(x, x) so that it uniquely defines a differential equation and discuss the general ˙ induces a metric of x, ˙ measuring its length as case in Appendix A. G(x, x) 1 ˙ ˙ ˙ ˙ ˙ x G(x, x) x. When G(x, x) depends on x and x, it also induces the curva2 ˙ and ξ(x, x). ˙ In a particular case when G(x, x) ˙ = G(x), ture terms Ξ(x, x) the GDSs reduce to the widely studied simple mechanical systems (SMSs) [15], ˙ x; ˙ in this case M(x) = G(x) and the ˙ x˙ + ∇x Φ(x) = −B(x, x) M(x)¨ x + C(x, x) ˙ The extension to velocity-dependent ˙ x˙ is equal to ξG (x, x). Coriolis force C(x, x) ˙ is important and non-trivial. As discussed in Sect. 3.6, it generalizes the G(x, x) dynamics of classical rigid-body systems, allowing the space to morph according to the velocity direction. As its name suggests, GDSs possess geometric properties. Particularly, when ˙ is invertible, the left-hand side of (3) is related to a quantity aG = G(x, x) ¨ + G(x, x) ˙ −1 (ΞG (x, x)¨ ˙ x + ξG (x, x)), ˙ x known as the geometric acceleration (cf. Sect. 4.4). In short, we can think of (3) as setting aG along the negative natural ˙ −1 ∇x Φ(x) while imposing damping −G(x, x) ˙ −1 B(x, x) ˙ x. ˙ gradient −G(x, x) 4.2
Closure
Earlier, we mentioned that by tracking the geometry in pullback in (1), the task properties can be preserved. Here, we formalize the consistency of RMPflow as
450
C.-A. Cheng et al.
a closure of differential equations, named structured GDSs. Structured GDSs augment GDSs with information on how the metric matrix factorizes. Suppose ˙ ˙ = J(x) H(y, y)J(x), where y : G has a structure S that factorizes G(x, x) n×n n n n x → y(x) ∈ R and H : R × R → R+ , and J(x) = ∂x y. We say the tuple (M, G, B, Φ)S is a structured GDS if and only if ˙ x ¨ + ηG;S (x, x) ˙ = −∇x Φ(x) − B(x, x) ˙ x˙ ˙ + ΞG (x, x)) (G(x, x)
(4)
˙ ˙ := J(x) (ξH (y, y) ˙ + (H(y, y) ˙ + ΞH (y, y)) ˙ J(x, ˙ x). ˙ Note the where ηG;S (x, x) x) metric and factorization in combination defines ηG;S . As a special case, GDSs are structured GDSs with a trivial structure (i.e. y = x). Also, structured GDSs ˙ = G(x), reduce to GDSs (i.e. the structure offers no extra information) if G(x, x) or if n, m = 1 (cf. Appendix B.1). Given two structures, we say Sa preserves Sb if Sa has the factorization (of H) made by Sb . In Sect. 4.4, we will show that structured GDSs are related to a geometric object, pullback connection, which turns out to be the coordinate-free version of pullback. To show the closure property, we consider a parent node on M with K child ˙ i. nodes on {Ni }K i=1 .We note that Gi and Bi can be functions of both yi and y Theorem 1. Let the ith child node follow (Ni , Gi , Bi , Φi )Si and have coordinate yi . Let fi = −ηGi ;Si −∇yi Φi −Bi y˙ i and Mi = Gi + ΞGi . If [f , M]M of the parent the parent node is given by pullback with {[fi , Mi ]Ni }K i=1 and M is non-singular, K K node follows (M, G, B, Φ)S , where G = i=1 J G J , B = J B i i i Ji , Φ = i i=1 i K Φ ◦ y , S preserves S , and J = ∂ y . Particularly, if G is velocity-free i i i x i i i=1 i and the child nodes are GDSs, the parent node follows (M, G, B, Φ). Theorem 1 shows structured GDSs are closed under pullback. It means that the differential equation of a structured GDS with a tree-structured task map can be computed by recursively applying pullback from the leaves to the root. Corollary 1. If all leaf nodes follow GDSs and Mr at the root node is nonsingular, then the root node follows (C, G, B, Φ)S as recursively defined by Theorem 1. 4.3
Stability
By the closure property above, we analyze the stability of RMPflow when the leaf nodes are (structured) GDSs. For compactness, we will abuse the notation to write M = Mr . Suppose M is nonsingular and let (C, G, B, Φ)S be the resultant ˙ = structured GDS at the root node. We consider a Lyapunov candidate V (q, q) 1 ˙ ˙ ˙ G(q, q) q + Φ(q) and derive its rate using properties of structured GDSs. q 2 ˙ ˙ = −q˙ B(q, q) ˙ q. ˙ q) Proposition 1. For (C, G, B, Φ)S , V(q, Proposition 1 directly implies the stability of structured GDSs by invoking LaSalle’s invariance principle [31]. Here we summarize the result without proof. ˙ B(q, q) ˙ 0, the system converges Corollary 2. For (C, G, B, Φ)S , if G(q, q), ˙ : ∇q Φ(q) = 0, q˙ = 0}. to a forward invariant set C∞ := {(q, q)
RMPflow
451
To show the stability of RMPflow, we need to further check when the assump˙ 0 is easy to satisfy: by tions in Corollary 2 hold. The condition B(q, q) ˙ 0; to strictly ensure definiteness, we can copy C into Theorem 1, B(q, q) an additional child node with a (small) positive-definite damping matrix. The ˙ 0 can be satisfied similarly. In addition, we need to verify condition on G(q, q) the assumption that M is nonsingular. Here we provide a sufficient condition. When satisfied, it implies the global stability of RMPflow. Theorem 2. Suppose every leaf node is a GDS with a metric matrix in the form ˙ for differentiable functions R, L, and D satisfying R(x) + L(x) D(x, x)L(x) ˙ = diag((di (x, y˙ i ))ni=1 ) 0, and y˙ i ∂y˙ i di (x, y˙ i ) ≥ 0, where x is R(x) 0, D(x, x) ˙ 0. the coordinate of the leaf-node manifold and y˙ = Lx˙ ∈ Rn . It holds ΞG (q, q) ˙ B(q, q) ˙ 0, then M ∈ Rd×d If further G(q, q), ++ , and the global RMP generated by RMPflow converges to the forward invariant set C∞ in Corollary 2. A particular condition in Theorem 2 is when all the leaf nodes with velocity dependent metric are 1D. Suppose x ∈ R is its coordinate and g(x, x) ˙ is its metric matrix. The sufficient condition essentially boils down to g(x, x) ˙ ≥ 0 and x∂ ˙ x˙ g(x, x) ˙ ≥ 0. This means that, given any x ∈ R, g(x, 0) = 0, g(x, x) ˙ is non-decreasing when x˙ > 0, and non-increasing when x˙ < 0. This condition is satisfied by the collision avoidance policy in Sect. 3.6. 4.4
Invariance
We now discuss the coordinate-free geometric properties of (C, G, B, Φ)S generated by RMPflow. Due to space constraint, we only summarize the results (please see Appendix B.4 and, e.g., [32]). Here we assume that G is positive-definite. We first present the coordinate-free version of GDSs (i.e. the structure is trivial) by using a geometric object called affine connection, which defines how tangent spaces on a manifold are related. Let T C denote the tangent bundle of C, which is a natural manifold to describe the state space. We first show that a GDS on C can be written in terms of a unique, asymmetric affine connection G ∇ that is compatible with a Riemannian metric G (defined by G) on T C. It is important to note that G is defined on T C not the original manifold C. As the metric matrix in a GDS can be velocity dependent, we need a larger manifold. Theorem 3. Let G be a Riemannian metric on T C such that, for s = (q, v) ∈ T C, G(s) = Gvij (s)dq i ⊗ dq j + Gaij dv i ⊗ dv j , where Gvij (s) and Gaij are symmetric and positive-definite, and Gvij (·) is differentiable. Then there is a unique affine connection G ∇ that is compatible with G and satisfies, Γki,j = Γkji , Γki,j+d = 0, and Γki+d,j+d = Γkj+d,i+d , for i, j = 1, . . . , d and k = 1, . . . , 2d. In coordinates, ¨+ ˙ then pr3 (G ∇q¨q¨) can be written as aG := q ˙ is identified as G(q, q), if Gvij (q) −1 ˙ ˙ + ΞG (q, q)¨ ˙ q), where pr3 : (q, v, u, a) → u is a projection. G(q, q) (ξG (q, q) We call pr3 (G ∇q˙ q) ˙ the geometric acceleration of q(t) with respect to G ∇. It is a coordinate-free object, because pr3 is defined independent of the choice of
452
C.-A. Cheng et al.
chart of C. By Theorem 3, it is clear that a GDS can be written abstractly as pr3 (G ∇q¨q¨) = (pr3 ◦G ◦F )(s), where F : s → −dΦ(s)−B(s) defines the covectors due to the potential function and damping, and G : T ∗ T C → T T C denotes the ˙ + ΞG (q, q)¨ ˙ q) = ¨ + G(q, q) ˙ −1 (ξG (q, q) inverse of G. In coordinates, it reads as q −1 ˙ q), ˙ which is exactly (3). ˙ (∇q Φ(q) + B(q, q) −G(q, q) Next we present a coordinate-free representation of RMPflow. Theorem 4. Suppose C is related to K leaf-node task spaces by maps {ψi : Gi ∇ on T Ti , as C → Ti }K i=1 and the ith task space Ti has an affine connection defined by some potential and defined in Theorem 3, and a covector function F i ¯ = K T ψ ∗ Gi ∇ be the pullback connection, damping as described above. Let G ∇ i i=1 K K G = i=1 T ψi∗ Gi be the pullback metric, and F = i=1 T ψi∗ Fi be the pullback ¯ is compatible with G, and covector, where T ψi∗ : T ∗ T Ti → T ∗ T C. Then G ∇ G¯ ¨ + G(q, q) ˙ −1 (ηG;S (q, q) ˙ + pr3 ( ∇q¨q¨) = (pr3 ◦ G ◦ F )(s) can be written as q −1 ˙ q) = −G(q, q) ˙ ˙ q). ˙ In particular, if G is velocity(∇q Φ(q) + B(q, q) ΞG (q, q)¨ ¯ =G ∇. independent, then G ∇ Theorem 4 says that the structured GDS (C, G, B, Φ)S can be written abstractly, without coordinates, using the pullback of task-space covectors, metrics, and asymmetric affine connections (that are defined in Theorem 3). In other words, the recursive calls of pullback in the backward pass of RMPflow is indeed performing “pullback” of geometric objects. Theorem 4 also shows, when G is velocity-independent, the pullback of connection and the pullback of metric com¯ = G ∇, which is equivalent to the Levi-Civita connection mutes. In this case, G ∇ of G. The loss of commutativity in general is due to the asymmetric definition of the connection in Theorem 3, which however is necessary to derive a control law of acceleration, without further referring to higher-order time derivatives. 4.5
Related Approaches
While here we focus on the special case of RMPflow with GDSs, this family already covers a wide range of reactive policies commonly used in practice. For example, when the task metric is Euclidean (i.e. constant), RMPflow recovers OSC (and its variants) [10–12,19,20]. When the task metric is only configuration dependent, RMPflow can be viewed as performing energy shaping to combine multiple SMSs in geometric control [15]. Further, RMPflow allows using velocity dependent metrics, generating behaviors all those previous rigid mechanics-based approaches fail to model. We also note that RMPflow can be easily modified to incorporate exogenous time-varying inputs (e.g. forces to realize impedance control [18] or learned perturbations as in DMPs [29]). In computation, the structure of RMPflow in natural-formed RMPs resembles the classical Recursive NewtonEuler algorithm [17,33] (see Appendix C). Alternatively, the canonical form of RMPflow in (2) resembles Gauss’ Principle [11,12], but with a curvature correction ΞG on the inertia matrix (suggested by Theorem 1) to account for velocity dependent metrics. Thus, we can view RMPflow as a natural generalization of these approaches to a broader class of non-Euclidean behaviors.
RMPflow
453
Fig. 2. Phase portraits (gray) and integral curves (blue; from black circles to red crosses) of 1D example. (a) Desired behavior. (b) With curvature terms. (c) Without curvature terms. (d) Without curvature terms but with nonlinear damping.
5
Experiments
We perform controlled experiments to study the curvature effects of nonlinear metrics, which is important for stability and collision avoidance. We then perform several full-body experiments (video: https://youtu.be/Fl4WvsXQDzo) to demonstrate the capabilities of RMPflow on high-DOF manipulation problems in clutter, and implement an integrated vision-and-motion system on two physical robots. 5.1
Controlled Experiments
1D Example. Let q ∈ R. We consider a barrier-type task map x = 1/q and define a GDS in (3) with G = 1, Φ(x) = 12 (x − x0 )2 , and B = (1 + 1/x), where x0 > 0. Using the GDS, we can define an RMP [−∇x Φ − Bx˙ − ξG , M]R , where M and ξG are defined according to Sect. 4.1. We use this example to study the effects of J˙ q˙ in pullback (1), where we define J = ∂q x. Figure 2 compares the desired behavior (Fig. 2a) and the behaviors of correct/incorrect ˙ the behavior matches pullback. If pullback is performed correctly with J˙ q, the designed one (Fig. 2b). By contrast, if J˙ q˙ is ignored, the observed behavior becomes inconsistent and unstable (Fig. 2c). While the instability of neglecting 2 J˙ q˙ can be recovered with a damping B = (1 + x˙x ) nonlinear in x˙ (suggested in [20]), the behavior remains inconsistent (Fig. 2d). 2D Example. We consider a 2D goal-reaching task with collision avoidance and study the effects of velocity dependent metrics. First, we define an RMP (a GDS as in Sect. 3.6) in x = d(q) (the 1D task space of the distance to the obstacle). ˙ = w(x)u(x), ˙ where w(x) = 1/x4 increases if the partiWe pick a metric G(x, x) ˙ = + min(0, x) ˙ x˙ (where ≥ 0), increases if cle is close to the obstacle and u(x) it moves towards the obstacle. As this metric is non-constant, the GDS has cur˙ ˙ and ξG = 12 x˙ 2 u(x)∂ ˙ x w(x). These curvature vature terms ΞG = 12 xw(x)∂ ˙ u(x) x ˙ terms along with Jq˙ produce an acceleration that lead to natural obstacle avoidance behavior, coaxing the system toward isocontours of the obstacle (Fig. 3b). On the other hand, when the curvature terms are ignored, the particle travels in
454
C.-A. Cheng et al.
Fig. 3. 2D example; initial positions (small circle) and velocities (arrows). (a-d) Obstacle (circle) avoidance: (a) w/o curvature terms and w/o potential. (b) w/ curvature terms and w/o potential. (c) w/o curvature terms and w/ potential. (d) w/ curvature terms and w/ potential. (e) Combined obstacle avoidance and goal (square) reaching.
Fig. 4. Results for reaching experiments. Though some methods achieve a shorter goal distance than RMPflow in successful trials, they end up in collision in most the trials.
straight lines with constant velocity (Fig. 3a). To define the full collision avoidance RMP, we introduce a barrier-type potential Φ(x) = 12 αw(x)2 to create extra repulsive forces, where α ≥ 0. A comparison of the curvature effects in this setting is shown in Fig. 3c and d (with α = 1). Next, we use RMPflow to combine the collision avoidance RMP above (with α = 0.001) and an attractor RMP. Let qg be the goal. The attractor RMP is a GDS in the task space y = q − qg with a metric w(y)I, a damping ηw(y)I, and a potential that is zero at y = 0, where η > 0 (see Appendix D.4). Figure 3e shows the trajectories of the combined RMP. The combined non-constant metrics generate a behavior that transitions smoothly towards the goal while heading away from the obstacle. When the curvature terms are ignored (for both RMPs), the trajectories oscillate near the obstacle. In practice, this can result in jittery behavior on manipulators. When the metric is not velocity-based (G(x) = w(x)) the behavior is less efficient in breaking free from the obstacle to go toward the goal. 5.2
System Experiments
Reaching-through-clutter Experiments. We compare RMPflow with OSC, (i.e. potential fields (PF) with dynamics reshaping), denoted as PF-basic, and a variant, denoted PF-nonlinear, which scales the collision-avoidance weights nonlinearly as a function of obstacle proximity. We highlight the results here;
RMPflow
simulated worlds
455
real-world experiments
Fig. 5. Two of the six simulated worlds in the reaching experiments (left), and the two physical dual-arm platforms in the full system experiment (right).
Appendix E provides additional details, and the supplementary video shows footage of the trials. In both baselines, the collision-avoidance task spaces are specified by control points along the robot’s body (rather than the distance space used in RMPflow) with an isotropic metric G = w(x)I (here w(x) = wo ∈ R+ for PF-basic and w(x) ∈ [0, wo ] for PF-nonlinear, where wo is the max metric size used in RMPflow). The task-space policies of both variants follow GDSs, but without the curvature terms (see Appendix E). Figure 4 summarizes their performance. We measure time-to-goal, C-space path length (assessing economy of motion), achievable distance-to-goal (efficacy in solving the problem), collision intensity (percent time in collision given a collision), collision failures (percent trials with collisions). The isotropic metrics, across multiple settings, fail to match the speed and precision achieved by RMPflow. Higher-weight settings tend to have fewer collisions and better economy of motion, but at the expense of efficiency. Additionally, adding nonlinear weights as in PF-nonlinear does not seem to help. The decisive factor of RMPflow’s performance is rather its non-isotropic metric, which encodes directional importance around obstacles in combing policies. System Integration for Real-Time Reactive Motion Generation. We present an integrated system for vision-driven dual arm manipulation on two robotic platforms, the ABB YuMi robot and the Rethink Baxter robot (Fig. 5) (see the supplementary video). Our system uses the real-time optimization-based tracking algorithm DART [34] to communicate with the RMP system, receiving prior information on robot configuration and sending tracking updates of world state. The system is tested in multiple real-world manipulation problems, like picking up trash in clutter, reactive manipulation of a cabinet with human perturbation, active lead-through (compliant guiding of the arms with world-aware collision controllers) and pick-and-place of objects into a drawer which the robot opens and closes. Please see Appendix F for the details of the experiments.
6
Conclusion
We propose an efficient policy synthesis framework, RMPflow, for generating policies with non-Euclidean behavior, including motion with velocity dependent
456
C.-A. Cheng et al.
metrics that are new to the literature. In design, RMPflow is implemented as a computational graph, which can geometrically consistently combine subtask policies into a global policy for the robot. In theory, we provide conditions for stability and show that RMPflow is intrinsically coordinate-free. In the experiments, we demonstrate that RMPflow can generate smooth and natural motion for various tasks, when proper subtask RMPs are specified. Future work is to further relax the requirement on the quality of designing subtask RMPs by introducing learning components into RMPflow for additional flexibility.
References 1. Rimon, E., Koditschek, D.: The construction of analytic diffeomorphisms for exact robot navigation on star worlds. Trans. Am. Math. Soc. 327(1), 71–116 (1991) 2. Ratliff, N., Toussaint, M., Schaal, S.: Understanding the geometry of workspace obstacles in motion optimization. In: IEEE International Conference on Robotics and Automation (ICRA) (2015) 3. Ivan, V., Zarubin, D., Toussaint, M., Komura, T., Vijayakumar, S.: Topology-based representations for motion planning and generalization in dynamic environments with interactions. Int. J. Rob. Res. (IJRR) 32(9–10), 1151–1163 (2013) 4. Watterson, M., Liu, S., Sun, K., Smith, T., Kumar, V.: Trajectory optimization on manifolds with applications to SO(3) and R3XS2. In: Robotics: Science and Systems (RSS) (2018) 5. Toussaint, M.: Robot trajectory optimization using approximate inference. In: ICML, pp. 1049–1056 (2009) 6. LaValle, S.M.: Planning Algorithms. Cambridge University Press, Cambridge (2006). http://planning.cs.uiuc.edu/ 7. Karaman, S., Frazzoli, E.: Sampling-based algorithms for optimal motion planning. Int. J. Rob. Res. (IJRR) 30(7), 846–894 (2011). http://arxiv.org/abs/1105.1186 8. Gammell, J.D., Srinivasa, S.S., Barfoot, T.D.: Batch Informed Trees (BIT*): sampling-based optimal planning via the heuristically guided search of implicit random geometric graphs. In: IEEE International Conference on Robotics and Automation (ICRA) (2015) 9. Mukadam, M., Dong, J., Yan, X., Dellaert, F., Boots, B.: Continuous-time Gaussian process motion planning via probabilistic inference. arXiv preprint arXiv:1707.07383 (2017) 10. Khatib, O.: A unified approach for motion and force control of robot manipulators: the operational space formulation. IEEE J. Rob. Autom. 3(1), 43–53 (1987) 11. Peters, J., Mistry, M., Udwadia, F.E., Nakanishi, J., Schaal, S.: A unifying framework for robot control with redundant DOFs. Auton. Rob. 1, 1–12 (2008) 12. Udwadia, F.E.: A new perspective on the tracking control of nonlinear structural and mechanical systems. Proc. Roy. Soc. Lond. A Math. Phys. Eng. Sci. 459(2035), 1783–1800 (2003). http://rspa.royalsocietypublishing.org/content/459/2035/1783 13. Kappler, D., Meier, F., Issac, J., Mainprice, J., Garcia Cifuentes, C., W¨ uthrich, M., Berenz, V., Schaal, S., Ratliff, N., Bohg, J.: Real-time perception meets reactive motion generation. IEEE Rob. Autom. Lett. 3(3), 1864–1871 (2018). https://arxiv.org/abs/1703.03512 14. Mukadam, M., Cheng, C.A., Yan, X., Boots, B.: Approximately optimal continuous-time motion planning and control via probabilistic inference. In: IEEE International Conference on Robotics and Automation (ICRA) (2017)
RMPflow
457
15. Bullo, F., Lewis, A.D.: Geometric Control of Mechanical Systems: Modeling, Analysis, and Design for Simple Mechanical Control Systems, vol. 49. Springer, Heidelberg (2004) 16. Ratliff, N.D., Issac, J., Kappler, D., Birchfield, S., Fox, D.: Riemannian motion policies. arXiv preprint arXiv:1801.02854 (2018) 17. Walker, M.W., Orin, D.E.: Efficient dynamic computer simulation of robotic mechanisms. J. Dyn. Syst. Meas. Control 104(3), 205–211 (1982) 18. Albu-Schaffer, A., Hirzinger, G.: Cartesian impedance control techniques for torque controlled light-weight robots. In: IEEE International Conference on Robotics and Automation (ICRA), vol. 1, pp. 657–663 (2002) 19. Sentis, L., Khatib, O.: A whole-body control framework for humanoids operating in human environments. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2641–2648 (2006) 20. Lo, S.Y., Cheng, C.A., Huang, H.P.: Virtual impedance control for safe humanrobot interaction. J. Intell. Rob. Syst. 82(1), 3–19 (2016) 21. Erez, T., Lowrey, K., Tassa, Y., Kumar, V., Kolev, S., Todorov, E.: An integrated system for real-time model-predictive control of humanoid robots. In: IEEE/RAS International Conference on Humanoid Robots (2013) 22. Todorov, E.: Optimal control theory. In: Bayesian Brain: Probabilistic Approaches to Neural Coding, pp. 269–298 (2006) 23. Liegeois, A.: Automatic supervisory control of the configuration and behaviour of multibody mechanisms. IEEE Trans. Syst. Man Cybern. 7(12), 868–871 (1977) 24. Ratliff, N., Zucker, M., Bagnell, J.A.D., Srinivasa, S.: CHOMP: gradient optimization techniques for efficient motion planning. In: IEEE International Conference on Robotics and Automation (ICRA) (2009) 25. Mukadam, M., Yan, X., Boots, B.: Gaussian process motion planning. In: IEEE Conference on Robotics and Automation (ICRA) (2016) 26. Dong, J., Mukadam, M., Dellaert, F., Boots, B.: Motion planning as probabilistic inference using Gaussian processes and factor graphs. In: Robotics: Science and Systems (RSS) (2016) 27. Nakanishi, J., Cory, R., Mistry, M., Peters, J., Schaal, S.: Operational space control: a theoretical and empirical comparison. Int. J. Rob. Res. (IJRR) 6, 737–757 (2008) 28. Platt, R., Abdallah, M.E., Wampler, C.W.: Multiple-priority impedance control. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 6033– 6038. Citeseer (2011) 29. Ijspeert, A.J., Nakanishi, J., Hoffmann, H., Pastor, P., Schaal, S.: Dynamical movement primitives: learning attractor models for motor behaviors. Neural Comput. 25(2), 328–373 (2013) 30. Lewis, A.D.: The geometry of the maximum principle for affine connection control systems (2000) 31. Khalil, H.K.: Noninear Systems, vol. 2, 5th edn, pp. 1–5. Prentice-Hall, New Jersey (1996) 32. Lee, J.M., Chow, B., Chu, S.C., Glickenstein, D., Guenther, C., Isenberg, J., Ivey, T., Knopf, D., Lu, P., Luo, F., et al.: Manifolds and differential geometry. Topology 643, 658 (2009) 33. Featherstone, R.: Rigid Body Dynamics Algorithms. Springer, Heidelberg (2008) 34. Schmidt, T., Newcombe, R., Fox, D.: DART: dense articulated real-time tracking with consumer depth cameras. Auton. Rob. 39(3), 239–258 (2015)
Dynamic Model of Planar Sliding Jiayin Xie and Nilanjan Chakraborty(B) Stony Brook University, Stony Brook, NY 11733, USA {jiayin.xie,nilanjan.chakraborty}@stonybrook.edu
Abstract. In this paper, we present a principled method to model general planar sliding motion with distributed patch contact between two objects undergoing relative sliding motion. The effect of contact patch can be equivalently modeled as the contact wrench at one point contact. We call this point equivalent contact point (ECP). Our dynamic model embeds ECP within the Newton-Euler equations of slider’s motion and friction model. The discrete-time motion model that we derive consists of a system of quadratic equations relating the contact wrench and slip speed. This discrete-time dynamic model allows us to solve for the two components of tangential friction impulses, the friction moment, and the slip speed. The state of the slider as well as the ECP can be computed by solving a system of linear equations once the contact impulses are computed. In addition, we derive the closed form solutions for the state of slider for quasi-static motion. Furthermore, in pure translation, based on the discrete-time model, we present the closed form expressions for the friction impulses that acts on the slider and the state of the slider at each time step. Our results are dependent on the rigid body assumption and a generalized Coulomb friction model, which assumes that the contact force and moment lies within a quadratic convex cone and the friction force is independent of contact area. The results are not dependent on the exact knowledge of contact geometry or pressure distribution on the contact patch. Simulation examples are shown with both convex and non-convex contact patches to demonstrate the validity of our approach.
1
Introduction
In robotic manipulation, a key problem is how to move an object from one configuration to another. There are two ways of manipulating objects, namely, prehensile manipulation and non-prehensile manipulation. In prehensile manipulation, the robot grasps the object and moves it so that all the external wrenches acting on the object through manipulator or gripper during the motion is resisted. In non-prehensile manipulation one manipulates an object without grasping the object. Examples of non-prehensile manipulation includes throwing [1–3], batting [4,5], pushing [6–9] and vibratory motion [10,11]. For non-prehensile manipulation, where the object being manipulated slides over a support surface, a key aspect to designing planning and control algorithms is the ability to predict the motion of the sliding object. In this paper, our goal is to study the problem of motion prediction of an object sliding on a support surface. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 458–473, 2020. https://doi.org/10.1007/978-3-030-44051-0_27
Dynamic Sliding
459
A key aspect of planar sliding motion is that there is usually a non-point patch contact between the slider (or sliding object) and the support. The state of the slider depends on the applied external forces and the friction forces which distribute over the contact patch. The effect of the contact patch can be modeled equivalently by the sum of the total distributed normal and tangential force acting at one point and the net moment about this point due to the tangential contact forces. This point is called the center of friction in [9]. If we assume that the motion of the slider is quasi-static (i.e., the inertial forces are negligible and thus the friction forces balance the applied forces) and center of mass lies on the support plane, the center-of-friction directly coincide with the center of mass, and closed-form expressions can be developed for motion prediction [6,8,9]. However, for dynamic sliding, when the inertial forces cannot be neglected and center of gravity is above the support plane, the center of friction can vary within the convex hull of the contact patch, and there is no method in the literature for computing it. The existing approach is to use a dynamic simulation algorithm with contact patch usually approximated with three support points (chosen in ad-hoc manner). The reason for choosing three support points is that most dynamic simulation algorithms that are usually used for motion prediction implicitly assumes a point contact model and choosing 3 support points ensure that the force distribution at the three points is unique (if four or more points are used, the force distribution will not be unique for the same equivalent force and moment acting on the object). If the center of friction lies within the convex hull of the chosen support points, the motion predicted will be accurate. However, if the center of friction is outside the convex hull, then the predicted motion will not be accurate. Furthermore, since we do not know the center of friction, we will not know when the predicted motion is inaccurate. Note that the accuracy issue arises here due to the ad hoc approximation of the patch contact and not due to other sources of inaccuracy like contact friction model or model parameters. In this paper, based on our previous work of nonlinear complementarity problem-based dynamic time-stepper with convex contact patch [12], we present a dynamic model for sliding, where no ad hoc approximation is made for the patch contact. We model the patch contact with a point contact, called the equivalent contact point (ECP). The ECP is defined as a unique point in the contact patch where the net moment due to the normal contact force is zero [12]. We show that the computation of the contact forces/impulses and the state of the object and ECP can be decoupled for planar sliding. The contact impulses can be computed by solving four quadratic equations in four variables, namely, the two components of tangential impulse, the frictional moment, and the slip speed. The state variables, namely, the position, orientation, linear, and angular velocities of the object as well as the ECP can be computed by solving a system of linear equations once the contact impulses are computed. Note that the ECP as defined here is the same as the center of friction. The presentation of the decoupled set of quadratic and linear equations for computing the contact impulse, the ECP, and the state of the slider is the key contribution of this
460
J. Xie and N. Chakraborty
paper. We show that closed form solutions for the state of the slider can be derived for quasi-static motion (which is same as those previously obtained in the literature). For pure translation also, closed form solutions can be derived for the contact impulse, the state of the object and the ECP. We also present numerical simulation results comparing the model that we derive to the solution of the NCP model from [12]. Our results are dependent on the rigid body assumption and a generalized Coulomb friction model, which assumes that (a) the contact force and moment lies within a quadratic convex cone and (b) the friction force is independent of contact area and only dependent on magnitude of normal force. The results are not dependent on the exact knowledge of contact geometry or pressure distribution on the contact patch.
2
Related Work
During sliding, friction plays an important role in determining the motion of object. Coulomb’s friction law (also called Amonton, da Vinci or dry friction law) [13], which suggests that the friction force should be proportional to the normal force and opposed to the direction of sliding is a popular friction model. There have been many efforts to extend Coulomb’s law into general sliding planar motion [9,14–16] where one has to consider both force and moment due to the contact. There are two key aspects to the generalized Coulomb friction law [14]: (a) The characterization of the set of possible magnitudes of the friction wrench (i.e., force and moment) based on the magnitude of the normal contact force (whose surface is known as the limit surface) (b) The maximum power dissipation law, which states that among all the possible contact velocities that can be generated, the generated contact velocity maximizes the power loss. In [14], the authors presented a detailed study with examples of situations where the maximum power dissipation law is valid. They also showed that for a known pressure distribution and finite number of contact support points, the set of realizable contact wrenches is a convex set and can be computed as a Minkowski sum of the contact wrenches for each support point. Building on [14], in [15], the authors theoretically characterize limit surfaces for different continuous pressure distributions and also present multiple approximate models of the limit surface based on experimental results. A key result is that the shape of the limit surface depends on the pressure distribution. The above-mentioned works on understanding the limit surface provides great insight into the problem of modeling sliding motion. However, they are dependent on knowledge of the pressure distribution or the actual contact or support points (generated due to microscopic asperities). In general, the support points as well as pressure distribution are unknown and may change during dynamic sliding. Therefore, it is difficult to use these results directly in a motion prediction or simulation framework. In this paper, we assume that the maximum power dissipation law is valid and the limit surface for the entire patch is modeled by a convex quadratic function or ellipsoid. This model is quite widely used in complementarity-based dynamic simulation [17] for single point contact. Therefore, we do not make any
Dynamic Sliding
461
assumptions about the pressure distribution or the exact geometry of the contact region in terms of the actual contact supports between the two objects. Recent attempts to apply data-driven techniques to the problem of sliding motion includes [18,19]. Building on [14] and using experimental data, in [19], the authors compared multiple data-driven modeling approaches to approximate the limit surface (multiple support points were chosen randomly on the support surfaces and the motion was assumed to be quasi-static). From basic physics, for patch contact, there exists a unique point on the contact patch where the net moment due to the normal contact force is zero. This point is called the center of friction. For pure translation, the system of frictional forces arising in the contact patch may be reduced to a single force acting through the center of friction [16]. In general, computing the center of friction from it’s definition is not possible without knowing the pressure distribution between the two sliding objects. Therefore, previous models for sliding motion [6,8] make assumptions like quasi-static motion, uniform pressure distribution in the contact patch and isotropic friction, so that the center of friction is computable. In our previous work [12], we developed a principled method to model line or surface contact between objects, in which, the effect of contact patch is modeled equivalently as point contact. We called this point equivalent contact point (ECP). Although this is same as center of friction, it was conceptualized to prevent penetration between contacting objects. We showed that this point can be computed along with contact wrenches if we formulate a non-linear complementarity problem that simultaneously solves the contact detection problem along with the numerical integration of the equations of motion. Note that this is different from the current paradigm of dynamic simulation, where the contact detection and numerical integration of the equations of motion are decoupled and are done in a sequence. Consequently for non-point contact, the contact detection problem is ill-posed, as there are infinitely many points that are valid solutions. In this paper, based on our previous general model of equations of motion for bodies in intermittent contact, we derive a dynamic model for sliding motion, where the contact patch between slider and ground is equivalently modeled with an ECP. As stated before, we assume a friction model that is based on maximum power dissipation principle and it assumes all the possible contact forces or moments should lie within an ellipsoid.
3
Dynamics of Bodies in Contact
In this section, we present the general equations of motion of rigid bodies moving with respect to each other with non-point contact. For simplicity of exposition we assume one body to be static. The general equations of motion of the moving body has three key parts (a) Newton-Euler differential equations of motion giving state update, (b) algebraic and complementarity constraints modeling the fact that two rigid bodies cannot penetrate each other and (c) model of the frictional force and moments acting on the contact patch. Let ν = [v T ω T ]T be the generalized velocity of the rigid body, where v ∈ R3 is the linear velocity and
462
J. Xie and N. Chakraborty
ω ∈ R3 is the angular velocity of the rigid body. Let q be the configuration of the rigid body, which is a concatenated vector of the position and a parameterization of the orientation of the rigid body. Newton-Euler Equations of Motion: The Newton-Euler equations of motion of the rigid body are: M (q)ν˙ = Wn λn + Wt λt + Wo λo + Wr λr + λapp + λv p
(1)
where M (q) is the inertia tensor, λapp is the vector of external forces and moments (including gravity), λvp is the centripetal and Coriolis forces. The magnitude of the normal contact force is λn . The magnitude of tangential contact forces are λt and λo . The magnitude of the moment due to the tangential contact forces about the contact normal is λr . The vectors Wn , Wt , Wo and Wr map the contact forces and moments from the contact point to the center of mass of the robot. The expressions of Wn , Wt , Wo and Wr are: n t o 0 , Wt = , Wo = , Wr = (2) Wn = r×n r×t r×o n where (n, t, o) ∈ R3 are the axes of the contact frame, 0 ∈ R3 is a column vector with each entry equal to zero. The vector r = [ax − qx , ay − qy , az − qz ] is the vector from ECP, a, to center of mass (CM), where (qx , qy , qz ) is the position of the CM. Please note that Eq. (1) is a system of 6 differential equations. Modeling Rigid Body Contact Constraints: The contact model that we use is a complementarity-based contact model as described in [12,20]. In [12], we introduced the notion of an equivalent contact point (ECP) to model non-point contact between objects. Equivalent Contact Point (ECP) is a unique point on the contact surface that can be used to model the surface (line) contact as point contact where the integral of the total moment (about the point) due to the distributed normal force on the contact patch is zero. The ECP defined here is the same as the center of friction. However, we believe that ECP is an apt name, because it allows us to enforce constraints of non-penetration between two rigid bodies. For the special case of planar sliding motion, since there is always contact, we do not need to write down the equations coming from the collision detection constraints as done in [12,20] for computing the ECP. These constraints are trivially satisfied. However, we do need to use the ECP in the equations of motion as we do in the later sections. Friction Model: We use a friction model based on the maximum power dissipation principle that generalizes Coulomb’s friction law. The maximum power dissipation principle states that among all the possible contact forces and moments that lie within the friction ellipsoid, the forces that maximize the power dissipation in the contact patch are selected. Mathematically, s.t.
λt et
2
max − (vt λt + vo λo + vr λr ) 2 2 λo λr + + − μ2 λ2n ≤ 0 eo er
(3)
Dynamic Sliding
463
where λt , λo , and λr are the optimization variables. The parameters, et , eo , and er are positive constants defining the friction ellipsoid and μ is the coefficient of friction at the contact [15,21]. We use the contact wrench at ECP to model the effect of entire distributed contact patch. Therefore vt and vo are the tangential components of velocity at ECP; vr is the relative angular velocity about the normal at ECP. Note that, the ellipsoid constraint in our friction model is the constraint on the friction force and moment that acts on the slider during the motion. This friction model has been previously proposed in the literature [14] and has some experimental justification [15]. There is no assumption made on the nature of the pressure distribution between the two surfaces. Using the Fritz-John optimality conditions of Eq. (3), we can write [22]: 0 = e2t μλn vt + λt σ 0=
e2o μλn vo e2r μλn vr
0≤
μ2 λ2n
0=
−
(4)
+ λo σ
(5)
+ λr σ
λ2t /e2t
−
(6) λ2o /e2o
−
λ2r /e2r
⊥σ≥0
(7)
where σ is a Lagrange multiplier corresponding to the inequality constraint in (3).
4
Equations of Motion for Planar Sliding
The dynamic model presented in the previous section is a general model for an object moving on a planar surface with intermittent contact (that can be nonpoint) between the object and the surface. In this section, we will assume that the motion between the two objects is planar sliding and derive a simpler set of equations that are valid for planar sliding. Figure 1 shows a schematic sketch of a slider (assumed to be a rigid body) that has planar surface contact with the support surface. We assume that the motion of the slider is planar, i.e., the slider can rotate and translate along the planar support surface but cannot topple or lose contact with the support surface. Let Fw with origin Ow be the world frame fixed on the support surface. Let Fs with origin Os be the slider frame attached to the slider’s center of mass (CM). Note that the coordinates of the CM in the world frame, Fw is (qx , qy , qz ). Since, the slider undergoes planar motion, the configuration of the slider is q = [qx , qy , θz ], where θz is the orientation of Fs relative to Fw . Let Fc with origin Oc be the contact frame. The origin Oc is the equivalent contact point (ECP) of the contact patch and we denote the position of Oc in the world frame, Fw , by a. The axes of Fc are chosen to be parallel to Fw . The generalized velocity of the slider is ν = [vx , vy , wz ], where vx and vy are the x and y components of the velocity of the center of mass, Os , and wz is the angular velocity about the z-axis (normal to the plane of the motion). The external forces acting on the slider includes applied force, gravity force, normal or support force and frictional force. The generalized applied force is λapp = [λx , λy , λz , λxτ , λyτ , λzτ ], which includes tangential and normal forces and moments. The gravitational force is assumed to act at the CM.
464
J. Xie and N. Chakraborty
Fig. 1. Slider with square contact patch on the horizontal support plane.
4.1
Newton-Euler Equations for Planar Sliding
For planar motion, the inertia tensor is M (q) = diag(m, m, Iz ), where m is the mass of the slider and Iz represents the moment of inertia about z-axis. As mentioned in Sect. 4, the configuration of the slider is q = [qx , qy , θz ], and the generalized velocity is ν = [vx , vy , wz ]. The generalized applied force is λapp = [λx , λy , λz , λxτ , λyτ , λzτ ]. Without loss of generality, we let unit vectors of the contact frame to be n = [0, 0, 1], t = [1, 0, 0], o = [0, 1, 0]. Consequently, Wn , Wt , Wo and Wr (Eq. (2)) could be written as: ⎤ ⎤ ⎡ ⎤ ⎡ ⎡ ⎡ ⎤ 0 1 0 0 ⎦ , Wo = ⎣ 1 ⎦ , Wr = ⎣0⎦ (8) 0 Wn = ⎣0⎦ , Wt = ⎣ 0 −(ay − qy ) ax − qx 1 Using (8), and the discussion above, the first, second, and sixth equation in (1), can be written as: mv˙ x = λt + λx mv˙ y = λo + λy
(9) (10)
Iz w˙ z = λr + λzτ − λt (ay − qy ) + λo (ax − qx )
(11)
Since, we are assuming that contact is always maintained, the third equation in (1) becomes λn +λz −mg = 0. Furthermore, based on Eqs. (13) and (14) below, we can derive [λxτ (ax −qx )+λyτ (ay −qy )]/(qz −az ) = −λt (ay −qy )+λo (ax −qx ). It makes Eq. (11) to be: Iz w˙ z = λr + λzτ + [λxτ (ax − qx ) + λyτ (ay − qy )]/(qz − az ) 4.2
(12)
Expressions for the ECP
The fourth and fifth equations in (1), which governs the angular accelerations w˙ x and w˙ y , are Ix w˙ x = λxτ + λn (ay − qy ) − λo (az − qz ) + Iy wy wz − Iz wy wz
(13)
Iy w˙ y = λyτ − λn (ax − qx ) + λt (az − qz ) − Ix wx wz + Iz wx wz
(14)
Dynamic Sliding
465
where Ix and Iy are the moment of inertia about the x and y axis respectively. Since the motion is planar, the slider can only rotate about z-axis, i.e., w˙x = w˙y = 0. Equations (13) and (14) provide us the expressions for ECP: ax = (λyτ − λt qz )/λn + qx
(15)
ay = (−λxτ − λo qz )/λn + qy
(16)
From the equations one can deduce that when qz = 0, the ECP (ax , ay ) would be just beneath the CM, i.e., (qx , qy ). When qz > 0, i.e., CM is above the support plane, ECP may shift from the projection of the CM on the plane. Furthermore, the variation of tangential friction forces (λt and λo ) or applied moments (λxτ and λyτ ) would cause the shift of ECP during the motion. Note that since we have assumed no toppling, we always get a solution for the ECP. However, we can also use the computed ECP to check whether the assumption of no toppling is valid. If the ECP lies outside the convex hull of the contact region between the two objects, the sliding assumption is no longer valid. This can be used even in the discrete time setting to verify that there is no toppling. 4.3
Friction Model
For planar sliding, the friction force and moment has to be at the boundary of the friction ellipsoid. Thus, σ > 0 in the complementarity Eq. (7). Also, the tangential velocity at ECP is [vt , vo ]T = v + w × r, and vr = wz . Thus: vt = vx − wz (ay − qy ),
vo = vy + wz (ax − qx ),
vr = wz .
(17)
Using Eq. (17) together with the fact that σ > 0, we can rewrite Eqs. (4) to (7) as 0 = μλn e2t [vx − wz (ay − qy )] + λt σ
(18)
+ wz (ax − qx )] + λo σ
(19)
+ λr σ
(20)
0=
μλn e2o [vy μλn e2r wz
0=
μ2 λ2n
0=
4.4
−
λ2r /e2r
−
λ2t /e2t
−
λ2o /e2o
(21)
Continuous Time Dynamic Model for Planar Sliding
The complete continuous time equations of motion for planar sliding are given by (a) the Newton-Euler equations, (Eqs. (9), (10) and (12)) (b) the expression for ECP (Eqs. (15) and (16) ) and (c) the friction model (Eqs. (18) to (21)). Note that the kinematic map q˙ = ν is also required. 4.5
Discrete-Time Dynamic Model
We use backward Euler time-stepping scheme to discretize the continuous equations of planar sliding. Let tu denote the current time and h be the duration of the
466
J. Xie and N. Chakraborty
time step, the superscript u represents the beginning of the current time and the superscript u + 1 represents the end of the current time. Let ν˙ ≈ (ν u+1 − ν u )/h and the impulse p(.) = hλ(.) . Equations (9), (10) and (12) become m(vxu+1 − vxu ) = pu+1 + pux t
(22)
+ puy m(vyu+1 − vyu ) = pu+1 o + puzτ Iz (wzu+1 − wzu ) = pu+1 r
(23) (24)
Using Eqs. (22) to (24) and the backward Euler discretization, we can rewrite Eqs. (18) to (21) as − pu+1 σ u+1 = μpn e2t vxu + t − pu+1 σ u+1 = μpn e2o vyu + o
+ pu pu+1 qz )[wzu + (pu+1 + pu (pu + pu+1 o r x zτ )/Iz ] t + xτ m pn + pu pu+1 o y m
+
u+1 (pu qz )[wzu + (pu+1 + pu r yτ − pt zτ )/Iz ]
(25)
pn
u+1 u+1 0 = μpn e2r [wzu + (pu+1 + pu σ r zτ )/Iz ] + pr
0 = μ2 p2n − (pu+1 /er )2 − (pu+1 /et )2 − (pu+1 /eo )2 r o t
(26) (27) (28)
The Eqs. (25) to (28) is a system of four quadratic equations in 4 unknowns, pt , po , pr , and σ, at the end of the time-step (with superscript u + 1). After solving these system of equations we can obtain the velocities at the end of the time step, vx , vy , ωz from the linear equations (22) to (24). The ECPs can be found from Eqs. (15) and (16). Thus, the solution of the dynamic time-stepping problem essentially reduces to the solution of 4 quadratic equations in 4 variables.
5
Closed Form Equations for Planar Sliding Motion
In this section, we study some special cases of planar sliding motion, where we can obtain a closed form solution for the motion as well as the contact wrenches. The two special cases are that of quasi-static sliding, where we know the velocity of the contact point between the slider and pusher and pure translation. Quasi-Static Sliding Motion: In quasi-static sliding, the inertial force can be neglected and the frictional forces dominate the motion of the slider. We assume that the quasi-static sliding is due to an applied force with components λx and λy acting on the boundary of the slider at position (xc , yc ). The associated applied torque about the z-axis is λzτ . Based on Eqs. (9) to (11), the quasistatic motion assumption implies that the friction force and applied force should balance with each other (i.e., λt = −λx , λo = −λy , λr = −λzτ ). We take vcx , vcy , the velocity components at (xc , yc ) as the input. This basically says that the point of application of the force can vary during the motion. Thus, the motion of the slider depends on (vcx , vcy ). Now, vx = vcx + wz (yc − qy ),
vy = vcy − wz (xc − qx ).
(29)
Dynamic Sliding
467
Furthermore, the friction moment about normal axis balances with the applied moment and it can be defined by the components of friction force: λr = (xc − qx )λo − (yc − qy )λt
(30)
Assuming that the ECP is just beneath the CM, i.e., ax = qx , ay = qy . In addition, the model assumes isotropic friction, which implies et = eo . We define the parameter c = er /et . From Eqs. (18) to (20), we get: vx λt = c2 , wz λr
λo vy = c2 wz λr
(31)
From the above discussion, using Eqs. (29) to (31), we can get the closed form expressions for the velocity of the slider (vx , vy , wz ): vx =
[c2 + (xc − qx )2 ]vcx + (xc − qx )(yc − qy )vcy c2 + (xc − qx )2 + (yc − qy )2
[c2 + (yc − qy )2 ]vcy + (xc − qx )(yc − qy )vcx c2 + (xc − qx )2 + (yc − qy )2 (xc − qx )vy − (yc − qy )vx wz = c2 vy =
(32) (33) (34)
In [8], the authors also present the closed form solutions for computing the velocity of the slider with quasi-static motion. Note that, if we assume the origin at the CM, i.e., qx = qy = 0, the Eqs. (32) to (33) would be equivalent to the equations of quasi-static motion in [8]. Pure Translation: During pure translation, all the points in the slider move in the same direction. Thus, the slider’s angular velocity is zero, i.e., wz = 0. From Eqs. (25) to (28), using wz = 0 for each time step, based on Eq. (27), the frictional angular impulse pr = 0. Therefore, Eqs. (25), (26) and (28) can be simplified as: = pu+1 t
−e2t μpn (mvxu + pux ) , mσ u+1 + e2t μpn
pu+1 = o
/et )2 + (pu+1 /eo )2 μ2 p2n = (pu+1 t o
−e2o μpn (mvyu + puy ) mσ u+1 + e2o μpn
(35) (36)
Then substituting Eqs. (35) into (36), we get: (mvyu + puy )2 (mvxu + pux )2 + =1 (mσ u+1 + e2t μpn )2 (mσ u+1 + e2o μpn )2 Given the isotropic friction assumption, i.e., et = eo , we get:
mσ u+1 + e2t μpn = (mvxu + pux )2 + (mvyu + puy )2
(37)
(38)
468
J. Xie and N. Chakraborty
Thus, the analytical solutions for friction impulse are: pu+1 =
t pu+1 =
o
−e2t μpn (mvxu + pux ) (mvxu + pux )2 + (mvyu + puy )2 −e2o μpn (mvyu + puy ) (mvxu + pux )2 + (mvyu + puy )2
(39)
(40)
Therefore, we can solve for the velocities from: vxu+1 = (pu+1 + pux )/m + vxu , t
vyu+1 = (pu+1 + puy )/m + vyu . o
(41)
where pu+1 and pu+1 are given by Eqs. (39) and (40). t o
6
Numerical Results
In the preceding section, we derived the closed form expressions for computing the velocities of the slider for quasi-static motion and pure translation with isotropic friction assumption. For the general planar sliding motion there does not exist analytical solutions. Therefore, here, we present numerical solutions based on the discrete-time model of quadratic equations that we developed and compare the results with the full nonlinearity complementarity problem (NCP) formulation from [12], where we do not assume a priori that the motion is planar. The first example is of a slider with square contact patch sliding on a horizontal surface. We compare solutions from the scheme in this paper to that from [12] to validate our technique against our previous NCP-based approach, which gives the correct solution. In our second example, we simulate the sliding motion with a ring-shaped contact patch. This example demonstrates that the quadratic model presented in this paper can simulate the sliding motion with wide range of contact shapes (either convex or non-convex). Furthermore, it is not possible to use a few (say four) contact points chosen in an ad-hoc fashion to model the contact patch. In the third example, we provide a scenario of the slider being pushed with external force. The external force acts on one side of the slider and its position is fixed. The magnitude of external force is periodic and is always perpendicular to the slider. We use this example to show that the slider with external force can be modeled with our scheme. We use ‘fsolve’ in MATLAB to solve the quadratic model. We use PATH complementarity solver [23] to solve the NCP-based model as well as the quadratic model derived in this paper. We compare the average time taken per time-step based on different models for all the examples. The average times per time -step are shown in Table 1. Since the algorithm in [12] assumes convex contact patch, the second example could not be solved with this approach. Hence there is no data for this example in Table 1. As can be seen from the examples, the quadratic model solved in PATH gives consistently better performance. The duration of simulation is shown in paranthesis besides the Example number in Table 1.
Dynamic Sliding
469
Table 1. Average time taken for each step based on different methods Methods
Example 1 (0.45 s) Example 2 (0.6 s) Example 3 (3 s)
NCP-based model (PATH) 0.0064 s
0.0064 s
0.0036 s
Quadratic model (PATH)
0.0024 s
0.0026 s
0.0014 s
Quadratic model (‘fsolve’) 0.0099 s
0.0083 s
0.0058 s
All the examples are implemented in Matlab and run times are obtained on a Macbook Pro with 2.6 GHz processor and 16 GB RAM. Slider on a Plane Without External Forcing: We consider a slider with square shape sliding on a planar surface. The time step chosen is h = 0.01 s and simulation time is 0.45 s. The coefficient of friction between slider and support plane is μ = 0.31, and the friction ellipsoid’s parameters are: et = eo = 1, er = 0.01 m. The mass of the slider is m = 0.5 kg, and acceleration due to gravity 2 is g = 9.8 m/s . The slider slides on the surface without external force. Its initial position of CM is qx = qy = 0 m, qz = 0.08 m. Initial orientation is θ = 0◦ . The initial state of slider is vx = 0.7 m/s, vy = 0.9 m/s and wz = 10 rad/s.
Fig. 2. Slider with square contact patch slides on the surface without external forces. During motion, the ECP varies within the contact patch.
The forces that act on the slider are friction forces and moments. In Fig. 2a, we compute the velocities of the slider (vx and vy ) numerically based on our discrete-time dynamic model (Eqs. (25) to (28)), and compare it with the result (vxp and vvp ) from NCP-based model. There exists no difference between two results (within numerical tolerance of 1e − 6), which validates our method. In addition, the average time that NCP-based model spends for each time-step is 0.0064 s, our quadratic model’s average time is 0.0024 s. This is because our
470
J. Xie and N. Chakraborty
Fig. 3. (a) Non-convex contact patch (shaded yellow) between slider and the support. The black dot is the projection of slider’s CM on the plane and the red dot is the ECP, which may not be in the contact patch, but is within the convex hull of the patch. (b) Snapshot of slider’s motion based on the quadratic model. Red circle and black star represent ECP and CM. During motion, the ECP separates from CM and varies within the contact patch.
model essentially is a system of 4 quadratic equations with 4 variables (there are also a few linear equations afterwards, but the computational cost of those are negligible), and thus the size and complexity of the system are much less than the NCP-based model (in sliding case, the system is composed of 24 nonlinear equations and unknowns). In Fig. 2b, we plot the snapshots of the slider’s contact patch with CM and ECP at each time step. The contact patch’s shape is the square with length L = 0.05 m and width W = 0.05 m. During the motion, we observe that the position of ECP always separates from the position of CM, and its relative position to the slider frame Fs changes within the contact patch. The observation confirms that when the position of center of mass is above the support plane, acceleration of object would cause the shift of ECP [24]. Sliding Motion with Ring-Shaped Contact Patch: In this example, the slider has ring-shaped contact patch with the support. As Fig. 3a illustrates, the contact patch is the ring in yellow, which is non-convex and can not be represented as the convex hull of three chosen support points. If the ECP or center of friction is outside the convex hull, then the motion predicted would be inaccurate. We use the quadratic model presented in this paper to simulate the motion of the slider. The time step and friction parameters chosen are same as in first example. The mass of the slider would be m = 1 kg. The initial configuration of the slider is also same as in first example. The initial state is chosen to be vx = 1.3 m/s, vy = 0.8 m/s, wz = 11 rad/s. The total simulation time is 0.65 s. In Fig. 3b, we plot the snapshot of the ring-shaped contact patch with ECP and CM at each time step. The radius for the outer circle is 0.1 m, while the radius for the inner circle is 0.05 m. During the sliding motion, the ECP always separates from CM. When the slider stops, the ECP is just beneath the CM.
Dynamic Sliding
471
Fig. 4. Slider with square contact patch being pushed by applied force.
Slider Being Pushed on a Plane: In this example, we let the slider be pushed by an applied force on the horizontal plane. The dimension and mass of the slider as well as the friction parameters are the same as previous example. The initial state of slider is vx = 0.2 m/s, vy = 0.3 m/s and wz = 0 rad/s. The time-step is 0.01 s and total simulation time is 3 s. The force is applied on the left edge of the square at 2.5 mm below the middle of the edge and is always perpendicular to that side during the motion. As Fig. 4a illustrates, the magnitude of the applied force is periodic, i.e., Fpush = 2.2 + 2 cos(2πt/T ) N , where the period T = 0.1 s. In Fig. 4b, we show the angular velocity, wz , based on our quadratic model and wzp based on our NCP-based model. In Fig. 4c, we show the linear components of velocity vx , vy based on our quadratic model, and vxp , vxp based on our NCPbased model. For both angular and tangential velocities, the difference between quadratic and NCP solutions is within the numerical tolerance of 1e−6. Figure 4d plot the snapshot of the slider at each time step between t = 0 s to t = 0.6 s.
472
7
J. Xie and N. Chakraborty
Conclusions
We presented a quadratic discrete-time dynamic model for solving the problem of general planar sliding with distributed convex contact patch. Previous method assumes quasi-static motion or chooses multiple contact points (usually three) in an ad-hoc manner to approximate the entire contact patch. In our dynamic model, the effect of contact patch is equivalently modeled as the contact wrench at the equivalent contact point. We show that we can solve for the tangential friction impulses and the friction moment by solving a system of quadratic equations. The state of the slider as well as the ECP can be computed by solving a system of linear equations once the contact impulses are computed. In addition, we also provide closed form expression for quasi-static motion and pure translation motion. We also demonstrate the numerical results based on our quadratic model and NCP model for the general planar motion of the slider with or without applied force. Note that, in our approach, we do not need knowledge of the exact contact geometry or the pressure distribution. We use a friction model based on the maximum power dissipation law and the assumption that the limit surface is an ellipsoid. Thus our approach is valid for any application where the ellipsoid is a good approximation of the limit surface for the net contact force/moment in the contact patch. In our approach, in principle, the ellipsoid assumption can be relaxed to any convex function (e.g., a superquadric model could be used for the limit surface). This would change the derived equations of motion and exploring more general limit surface models is part of future work. In the future, we would also like to use the quadratic model developed for estimating the contact parameters.
References 1. Mason, M.T., Lynch, K.M.: Dynamic manipulation. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), vol. 1, pp. 152–159 (1993) 2. Huang, W.H., Krotkov, E.P., Mason, M.T.: Impulsive manipulation. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), vol. 1, pp. 120–125 (1995) 3. Zhu, C., Aiyama, Y., Chawanya, T., Arai, T.: Releasing manipulation. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), vol. 2, pp. 911–916 (1996) 4. Andersson, R.L.: A Robot Ping-Pong Player: Experiment in Real-Time. MIT Press, Cambridge (1988) 5. Liu, C., Hayakawa, Y., Nakashima, A.: Racket control and its experiments for robot playing table tennis. In: IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 241–246 (2012) 6. Lynch, K.M., Mason, M.T.: Stable pushing: mechanics, controllability, and planning. Int. J. Robot. Res. 15(6), 533–556 (1996) 7. Lynch, K.M.: Locally controllable manipulation by stable pushing. IEEE Trans. Robot. Autom. 15(2), 318–327 (1999)
Dynamic Sliding
473
8. Lynch, K.M., Maekawa, H., Tanie, K.: Manipulation and active sensing by pushing using tactile feedback. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 416–421 (1992) 9. Mason, M.T.: Mechanics and planning of manipulator pushing operations. Int. J. Robot. Res. 5(3), 53–71 (1986) 10. Vose, T.H., Umbanhowar, P., Lynch, K.M.: Vibration-induced frictional force fields on a rigid plate. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 660–667 (2007) 11. Vose, T.H., Umbanhowar, P., Lynch, K.M.: Friction-induced lines of attraction and repulsion for parts sliding on an oscillated plate. IEEE Trans. Autom. Sci. Eng. 6(4), 685–699 (2009) 12. Xie, J., Chakraborty, N.: Rigid body dynamic simulation with line and surface contact. In: IEEE International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR), pp. 9–15 (2016) 13. Coulomb, C.A.: Th´eorie des machines simples en ayant ´egard au frottement de leurs parties et ` a la roideur des cordages. Bachelier (1821) 14. Goyal, S., Ruina, A., Papadopoulos, J.: Planar sliding with dry friction part 1. Limit surface and moment function. Wear 143(2), 307–330 (1991). (Amsterdam, Netherlands) 15. Howe, R.D., Cutkosky, M.R.: Practical force-motion models for sliding manipulation. Int. J. Robot. Res. 15(6), 557–572 (1996) 16. MacMillan, W.D.: Dynamics of Rigid Body. Dover Publications Inc., New York (1936) 17. Moreau, J.J.: Unilateral contact and dry friction in finite freedom dynamics. In: Nonsmooth Mechanics and Applications, pp. 1–82. Springer (1988) 18. Yu, K.-T., Bauza, M., Fazeli, N., Rodriguez, A.: More than a million ways to be pushed. A high-fidelity experimental dataset of planar pushing. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 30–37. IEEE (2016) 19. Zhou, J., Mason, M.T., Paolini, R., Bagnell, D.: A convex polynomial model for planar sliding mechanics. Int. J. Robot. Res. 37(2–3), 249–265 (2018) 20. Chakraborty, N., Berard, S., Akella, S., Trinkle, J.C.: A geometrically implicit timestepping method for multibody systems with intermittent contact. Int. J. Robot. Res. 33(3), 426–445 (2014) 21. Trinkle, J.C., Pang, J.-S., Sudarsky, S., Lo, G.: On dynamic multi-rigid-body contact problems with Coulomb friction. ZAMM-J. Appl. Math. Mech./Zeitschrift f¨ ur Angewandte Mathematik und Mechanik 77(4), 267–279 (1997) 22. Trinkle, J.C., Tzitzouris, J., Pang, J.-S.: Dynamic multi-rigid-body systems with concurrent distributed contacts. Philos. Trans. Roy. Soc. Lond. A Math. Phys. Eng. Sci. 359(1789), 2575–2593 (2001) 23. Dirkse, S.P., Ferris, M.C.: The path solver: a non-monotone stabilization scheme for mixed complementarity problems. Optim. Methods Softw. 5(2), 123–156 (1995) 24. Mason, M.T.: Mechanics of Robotic Manipulation. MIT Press, Cambridge (2001)
A Constrained Kalman Filter for Rigid Body Systems with Frictional Contact Patrick Varin(B) and Scott Kuindersma Harvard University, Cambridge, MA 02138, USA [email protected], [email protected]
Abstract. Contact interactions are central to robot manipulation and locomotion behaviors. State estimation techniques that explicitly capture the dynamics of contact offer the potential to reduce estimation errors from unplanned contact events and improve closed-loop control performance. This is particularly true in highly dynamic situations where common simplifications like no-slip or quasi-static sliding are violated. Incorporating contact constraints requires care to address the numerical challenges associated with discontinuous dynamics, which make straightforward application of derivative-based techniques such as the Extended Kalman Filter impossible. In this paper, we derive an approximate maximum a posteriori estimator that can handle rigid body contact by explicitly imposing contact constraints in the observation update. We compare the performance of this estimator to an existing state-of-the-art Unscented Kalman Filter designed for estimation through contact and demonstrate the scalability of the approach by estimating the state of a 20-DOF bipedal robot in realtime. Keywords: State estimation contact · Optimization
1
· Dynamics · Kalman Filter · Rigid body
Introduction
Achieving reliable state estimation in the presence of uncertain dynamics and noisy measurements is a prerequisite for developing robust feedback controllers. This is particularly true for robots that experience impacts or transitions between static and sliding contact. When these discontinuous events are not explicitly accounted for, they can lead to large estimation errors and catastrophic failures—highlighting the need for efficient and practical estimation algorithms that reason about whole-body contact interactions. Although state estimation for systems with differentiable dynamics has been studied extensively, the problem changes dramatically when robots interact with their environment through rigid body contact. For example, a rigid object colliding with, then sliding across a flat surface is subject to non-linear manifold constraints, discontinuous changes in velocity, and Coulomb friction. This requires machinery beyond standard recursive estimation approaches such as the Extended Kalman Filter (EKF) [1]. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 474–490, 2020. https://doi.org/10.1007/978-3-030-44051-0_28
Contact Constrained Kalman Filter
475
In this paper, we develop an approximate maximum a posteriori (MAP) estimator using ideas from the constrained Kalman filter [2] and the complementarity constraints proposed originally by Stewart and Trinkle [3]. Our approach simultaneously estimates the system state as well as the contact forces during each update, and guarantees that they are physically consistent with our timestepping model of rigid body contact. We compare this formulation directly to a state-of-the-art Unscented Kalman Filter (UKF) [4] using a simulated example where the UKF fails to describe the qualitative behavior of a single rigid body system. We also evaluate the proposed filter on a publicly accessible physical planar pushing dataset and compare estimation performance against motion capture measurements using the Cassie bipedal robot walking overground.
2
Related Work
Traditional approaches to estimation in the presence of rigid body contact, such as locomotion, often use simplified dynamic models to perform estimation updates. Successful approaches for estimation on walking robots have used models such as the linear inverted pendulum to perform estimation on the floating base alone [5–7], while others factor the body dynamics into the floating base and the joint angles, estimating them separately, and use information from the foot kinematics to help localize the floating base [8–12]. These approaches sacrifice accuracy by ignoring the dynamic interactions between the joints and the floating base, but more importantly they often rely on simplifying assumptions like no slip at the contact points and the presence of accurate contact sensors to resolve the active contact mode. As a result, the robot is restricted to conservative behaviors that avoid slipping; violating this assumption quickly causes the filter to fail. Another approach is to use a hybrid model to perform state estimation [13], unfortunately the reliance on accurate contact sensors makes the filter sensitive to unmeasured contacts that can cause large estimation errors. Other work has been done to estimate external forces applied to articulated rigid body systems, so contact forces can be inferred without direct measurement from a contact sensor [14]. This is a big step towards full body estimation through contact by reasoning about external forces applied at arbitrary locations, but doesn’t address the dynamics of contact that are responsible for producing these forces. Several estimation techniques have been explored that attempt to explicitly model contact dynamics. For estimation in low dimensional state spaces, Koval et al. developed the Contact Particle Filter (CPF) [15] and the Manifold Particle Filter (MPF) [16]. The main insight behind the CPF and the MPF was that contact constrains the dynamics evolve on a lower dimensional manifold of state space (i.e. the contact manifold ). This problem is inherent to sampling-based approaches because the probability of sampling a lower dimensional manifold is zero, resulting in particle starvation. The CPF gets around this by sampling
476
P. Varin and S. Kuindersma
Fig. 1. Two illustrations of fundamental problems associated with the UKF in the presence of the inequalities associated with contact. When sample points are generated (a and d) samples are either infeasible or have different contact modes than the mean estimate. In the first sequence (a–c) the resulting estimate (c) is infeasible even though all of the samples are feasible. In the second sequence (d–f) the resulting estimate (f) is feasible, but has a contact mode that is different from any of the individual sample points. In our experience this is the more common behavior, biasing the estimate away from the contact manifold.
from a mixture model over all of the contact manifolds. The MPF takes advantage of this low dimensional contact manifold to focus sampling efforts, allowing the particle filter to extend to higher dimensions. Unfortunately, the curse of dimensionality makes the particle filter intractable as the state dimension grows, because the number of necessary particles grows exponentially with the state dimension. To allow the estimation problem to scale to higher dimensions Lowrey et al. [4] developed a UKF to perform estimation through contact. The UKF avoids the curse of dimensionality by approximating the state distribution as Gaussian, which affords an efficient representative sample set that grows linearly, rather than exponentially with the state dimension. In our experience the UKF fails to accurately describe contact dynamics for the same reasons that cause particle starvation in the particle filter; it is inaccurate to draw sample points from a full Gaussian distribution when the dynamics are evolving on a lower dimensional contact manifold. This causes the UKF to sample infeasible states that can produce non-physical behavior. Figure 1 illustrates two situations that cause the UKF to produce poor estimates in the presence of contact. In practice, the most common behavior is that the UKF produces estimates that are biased away from the contact manifold. There have been a few examples of derivative-based approaches to estimation through contact. Earlier work by Lowrey et al. developed an EKF using a smooth approximation to the contact dynamics [17]. This filter was prone to divergence during contact events because of the large derivatives involved when approximating the discontinuous dynamics with a smooth model. A more robust approach to derivative-based estimation is based on constrained optimization [2], which does not sacrifice the integrity of the model by making a smooth approximation
Contact Constrained Kalman Filter
477
to the dynamics. Xinjilefu developed a variation of a constrained dynamic estimator for bipedal walking [10]. Xinjilefu’s approach uses contact sensors on the feet to infer the contact mode, and solves for the estimated state and contact forces with a quadratic program by assuming that the feet cannot slip. These assumptions limit the use of this filter to conservative walking on robots with contact sensors on the feet, and is not easily extensible to different domains, e.g. manipulation, where sliding contact is more prevalent.
3
Background
The filter we present here is most reminiscent of the constrained optimization work by Xinjilefu [10], but provides a more general framework for estimation through contact and can be applied to arbitrary rigid body systems. In order to develop the Contact Constrained Kalman Filter (CCKF), we first describe our model of rigid body contact and the constraints imposed by this model, then we incorporate these constraints into the constrained Kalman filtering framework. 3.1
Contact Constraints
Stewart and Trinkle [3] developed a time-stepping rigid body model for contact as a constraint satisfaction problem. These dynamics compute the state at the next timestep as the solution to a linear complementarity problem (LCP). This problem can be solved with a variety of techniques such as, Dantzig’s algorithm [18], Lemke’s algorithm [19], PATH [20], etc. Anitescu and Potra [21] later proved that this LCP is solvable. These timestepping dynamics form the basis for many popular physics engines including Bullet [22], Open Dynamics Engine [23], and Dynamic Animation and Robotics Toolkit [24]. We reintroduce these constraints and develop their physical intuition here. We can approximate the continuous dynamics of a rigid body system with a discrete time system using a semi-implicit Euler update, (1) vk+1 = vk + ΔtHk−1 Bk uk − Ck − Gk + JkT λk qk+1 = qk + Δtvk+1 , (2) where qk and vk are the generalized position and velocity of the system, H, C, and G represent the mass matrix, Coriolis terms, and gravitational forces, respectively, B maps control inputs to generalized forces, and J is the Jacobian that maps external forces, λ, to generalized forces. In the absence of contact, λ = 0, so Eqs. (1) and (2) are sufficient to compute the dynamics. During contact, however, we compute λ by considering contact constraints. The first constraints relate to collisions and the normal force required to prevent interpenetration. Both the signed distance between two bodies, φ(q), and the normal force, λn , must be non-negative. Furthermore, the contact forces can only be nonzero when two bodies are in contact. These constraints can be written as φ(q) ≥ 0,
λn ≥ 0,
φ(q)T λn = 0.
(3)
478
P. Varin and S. Kuindersma
This last constraint says that if the bodies are touching, φ = 0, the force is allowed to be non-negative, λn ≥ 0, but if the bodies are not touching, φ > 0, then there can be no contact force between them, λn = 0. This is known as a complementarity constraint and can be written succinctly as φ(q) ≥ 0 ⊥ λn ≥ 0.
(4)
Because φ is generally a nonlinear function of q, this is a nonlinear complementarity constraint. For computational reasons we will instead use the linear complementarity constraint φk + ∇φ(qk+1 − qk ) ≥ 0 ⊥ λnk ≥ 0.
(5)
In addition to non-penetration, we also want to satisfy constraints on tangential friction. Rather than using the usual second order Coulomb friction cone, we use a polyhedral approximation to the friction cone, allowing us to remain within the LCP framework. We construct the contact force, λ, by decomposing it into the normal and tangential components, λ = nλn + Dβ, β ≥ 0,
(6)
where n ∈ R3×1 is the surface normal, and the columns of D ∈ R3×d are unit vectors that positively span the tangent plane. In practice we use d = 4, but any d ≥ 3 would work where larger values provide a more accurate approximation of the friction cone at the expense of additional decision variables. We can then write the polyhedral friction cone constraint as μλn − eT β ≥ 0.
(7)
It is assumed that the tangential friction respects the maximum dissipation principle (i.e. that contact forces are chosen to maximize dissipation of kinetic energy) so the tangential contact force must be an optimal point for the optimization problem λt = arg min v T J T λ λt
subject to μλn − eT β ≥ 0 β ≥ 0.
(8) (9) (10)
The first-order necessary conditions, i.e. Karush-Kuhn-Tucker (KKT) conditions, for this problem can be written as DT Jv + η T e ≥ 0 ⊥ β ≥ 0,
(11)
μλ − e β ≥ 0 ⊥ η ≥ 0,
(12)
n
T
where η is a Lagrange multiplier that can be interpreted as the magnitude of the tangential velocity. Intuitively the first condition enforces that the force of
Contact Constrained Kalman Filter
479
Fig. 2. Contact constraints that govern rigid body contact dynamics. Left: Nonpenetration and positive normal force, the normal force can only non-zero if the signed distance function, φ, is zero. Center: Polyhedral friction cone, we can approximate the Coulomb friction cone with a polyhedral cone. Right: Complementarity between tangential velocity and tangential friction. There can only be tangential velocity when the force of friction is on the edge of the friction cone.
friction opposes the tangential velocity, while the second condition ensures that there can only be tangential velocity (sliding) when force of friction is on the edge of the friction cone. Together Eqs. (1–2), (5), and (11–12) specify Stewart and Trinkle’s time stepping rigid body dynamics, and will be the foundation of the filter we develop here (Fig. 2). 3.2
Constrained Kalman Filter
We can build these contact constraints into a MAP estimator using the constrained Kalman Filter [2]. Consider the discrete time stochastic system: wk ∼ N (0, Q) vk ∼ N (0, R),
xk+1 = f (xk , uk ) + wk yk = h(xk ) + vk
(13) (14)
where x is the state, u is a control input, and y is a noisy measurement. If the prior on x is Gaussian with mean xk−1 and covariance Pk−1 , then we can compute the mean and covariance of an approximate posterior by linearizing ∂h the dynamics, F = ∂f ∂x , and the observation function, H = ∂x , and using the Extended Kalman Filter (EKF) updates: x− k = f (xk−1 , uk−1 )
(15)
Pk−
(16)
T
= F Pk−1 F + Q
xk =
x− k
Pk = (I
+ K(yk − h(x− k )) − − KHk )Pk ,
(17) (18)
where K is the Kalman gain. Because the mean of the Gaussian distribution is also the mode, the observation update (17) can be interpreted as an approximate maximum a posteriori (MAP) estimate—up to the linearization of f (·) and h(·)—that can be computed in closed form because the negative logarithm of a Gaussian pdf is a convex quadratic function.
480
P. Varin and S. Kuindersma
This interpretation of the Kalman filter as the MAP estimator allows us to naturally incorporate dynamic constraints into the filter by writing the observation update as a constrained optimization problem: minimize x
subject to
ΔxT P −1 Δx + (˜ y − HΔx)R−1 (˜ y − HΔx)
(19)
geq (x) = 0
(20)
gin (x) ≤ 0.
(21)
˜k = yk −h(x− where Δx = xk −x− k is the state correction and y k ) is the innovation. If the constraints geq (·) and gin (·) are linear then this can be solved efficiently as a quadratic program (QP). This was the approach used by Xinjilefu [10] in developing the estimator for the Atlas humanoid robot. The technique that we will develop here is similar in spirit, but does not rely on contact sensors to resolve the active contacts, and allows for dynamic contact interactions such as slipping by taking a fundamentally different approach to contact.
4
Contact Constrained Kalman Filter
The contact constraints developed in Sect. 3.1 suggest that in order to perform state estimation through contact we need to consider both the state, x, as well as the contact forces, λ. While the Stewart-Trinkle dynamics compute the contact forces as an implicit function of the state, this mapping is discontinuous and cannot be linearized, precluding its use in the EKF framework. To develop the Contact Constrained Kalman Filter (CCKF) we apply the constrained Kalman filtering approach and explicitly optimize over both the state and the contact forces. We can rewrite the discrete time dynamics (Eqs. 1 and 2) as 2 −1 T Δt H J T T ˆ ˆ ˆk + Jk−1 λk , J = , (22) xk = x ΔtH −1 J T where x ˆk is the state of the unconstrained system at time k (capturing the state changes from the inputs, gravity, etc.) and JˆT maps contact forces to changes in the generalized coordinates. The objective for the optimization function then becomes T T − −1 xk xk x ˆk x ˆ ˆ ˆ Pk − − k I, −Jk−1 I, −Jk−1 λk 0 λk 0 + (˜ yk − Hx Δx − Hλ λ)T R−1 (˜ yk − Hx Δx − Hλ λ), and the contact constraints become ⎤ ⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ n I 0 φ − nJqk−1 nJ 0 0 qk λk ⎦ ⎣ ⎦ ⎣ 0 DJ E ⎦ ⎣vk ⎦ + ⎣ 0 ≥0⊥ 0 I ≥ 0. βk ηk 0 0 0 I μ ˜ −E T
(23)
(24)
Contact Constrained Kalman Filter
481
Algorithm 1: The Contact Constrained Kalman Filter 1 2 3 4 5 6 7 8 9 10 11 12
initialize x and P ; f (x) is the unconstrained dynamics from (1) and (2) with λ = 0; h(x, λ) is the observation function; while true do /* compute the process update with no contact forces */ x ˆ = f (x); F = ∂f ; ∂x − P = FPFT; /* compute the observation update via constrained minimization */ y˜ = y − h(ˆ x); ∂h Hx = ∂h and Hλ = ∂λ ; ∂x x, λ = arg min (23) subject to (24); P = P − − P − HxT (Hx P − HxT + R)−1 Hx P − ; end
Note that the complementarity constraints separate the decision variables nicely into two sets that together form a bilinear constraint. These can be loosely interpreted as the primal variables, x, η, and the dual variables λn , β associated T with contact dynamics. Defining these new variables as x ˜ = xT η T and z˜ = n T T T we can write down the full optimization problem for the observation λ β update concisely as T T x ˜ x ˜ ˜ T x minimize Q −l z˜ z˜ z˜ x,z
(25)
subject to A˜ x + b ≥ 0 ⊥ C z˜ ≥ 0.
(26)
As we noted, this optimization problem doesn’t require contact sensors to resolve the contact mode, but information from contact sensors can be naturally incorporated via the contact force dependence in the observation function h(x, λ). 4.1
Quadratic Program with Complementarity Constraints
In order to implement the CCKF we must solve the optimization problem defined by (23) and (24). This problem is a special case of a quadratic program with complementarity constraints (QPCC). In our experiments, we chose to solve this problem via mixed-integer optimization. We can formulate the complementarity constraint by introducing a binary variable, y, that determines which of the equality constraints are active. This results in the mixed integer quadratic program (MIQP):
482
P. Varin and S. Kuindersma
minimize x,y,z
T T x x x Q − lT z z z
subject to Ax + b ≥ 0 Cy ≥ 0
(27) (28) (29)
y T (Ax + b) = 0
(30)
(1 − y) Cz = 0
(31)
y ∈ {0, 1} .
(32)
T
N
We solve this MIQP using the commercial optimization package Gurobi [25] that employs a branch-and-bound method to find the exact solution. Warm starting the binary variable at each time step allows the optimizer to quickly prune the branch-and-bound tree. As we show in our results, this has a positive impact on both the mean and variance of update times.
5
Results
We evaluate the Contact Constrained Kalman Filter in three scenarios. 1. We demonstrate the practical shortcomings of the state-of-the-art UKF in a simple single 6-DOF rigid body scenario where the UKF fails to describe the system’s behavior, but the CCKF succeeds. 2. We validate the performance of the CCKF during sliding using examples from a large physical planar pushing dataset. 3. We demonstrate that the filter scales to more complex systems by evaluating estimation and timing performance on a 20-DoF physical biped. The filter parameters for each of these experiments are detailed in Table 1. Because the UKF doesn’t successfully capture the behavior of the simple toy example in the first experiment, we do not use the UKF as a baseline in subsequent experiments. For example, in the planar pushing experiment the UKF predicts states that are non-planar, hovering above the pushing surface, whereas the CCKF correctly predicts planar states. 5.1
Simulated Data
One predicted advantage of the CCKF over the comparable state-of-the-art UKF approach [4] is that the UKF tends to produce non-physical behavior near the contact manifold (Fig. 1) while the CCKF handles inequalities arising from the contact manifold by optimizing over a truncated Gaussian. To compare the performance of the proposed filter and the UKF on a simple system, we evaluated both filters on simulated data of a rectangular prism rotating and falling onto flat ground. We approximated the sensor data that would be obtained from system with an IMU and visual or proprioceptive sensors by generating noisy position, gyroscope, and accelerometer measurements. Since both approaches use
Contact Constrained Kalman Filter
483
a different formulation of rigid body dynamics, we chose a simulator that uses neither. Forward simulation was performed using a compliant contact model in Drake [26] that approximates rigid body contact but allows for slight interpenetration according to a user specified Young’s Modulus. In our experiments we used all of the default parameters except for the friction coefficients for which we used μ = 1.0 to be consistent with our implementation of the UKF. Both filters used the same parameters, operating at 100 Hz. The covariance parameters R and P0 were chosen to reflect the true measurement noise and initial state error, while Q was chosen so that both filters converged to steady state before the first contact event. Figure 3 illustrates the performance of both filters from 20 randomly initialized state estimates for a single forward simulation. The UKF performs well up until the second collision, at which point the contact manifold constrains two degrees of freedom, corresponding to at least two infeasible sigma points, and the filter begins to diverge. After the brick has settled, we can see that the state estimate is biased above the ground, away from the contact manifold. The CCKF, however, shows good tracking performance throughout the entire trajectory.
Fig. 3. Simulation results of a brick falling on flat ground. Top: the CCKF accurately estimates the brick trajectory through the contacts, while the UKF struggles near the contact manifold.
5.2
Planar Pushing Dataset
In order to demonstrate the performance of the filter during frictional sliding, we evaluated the filter on the PUSH dataset, a planar pushing dataset collected by the MCube lab at MIT [27]. This dataset contains pose data of various shapes being pushed on a number of surfaces. While many of the pushing examples from
484
P. Varin and S. Kuindersma
Fig. 4. The mean state estimates from an example sliding trajectory with various coefficients of friction. μ = 0.14 has the best agreement and was the mean coefficient of friction reported with the experimental data.
this dataset involve only quasi-static motion, which does not highlight the full capability of this filter, there are a number of examples of aggressive pushing where the object continues to slide after contact with the manipulator is broken. We selected the most aggressive (largest acceleration) pushes of the rect1 object, a rectangular object, on delrin, which was reported to have the most consistent coefficient of friction of the surfaces used in the dataset (0.14 ± 0.016). Detailed geometric and inertial parameters for this object are provided with the dataset. The coefficient of friction was set to the experimentally reported mean. We ran the filter on pose data that was corrupted with zero mean Gaussian noise. The filter parameters were empirically chosen to maximize the filter performance, except for R which was chosen to reflect the true added measurement noise. Figure 4 illustrates the filter performance on an example sliding trajectory. When the coefficient of friction is set correctly the filter exhibits good performance, correctly estimating the time at which the object comes to rest. When the friction coefficient is too large the velocity estimate to drops to zero prematurely, and when the friction is too small the estimate overshoots the true state. The observation update is able to correct the position and orientation for all examples with zero steady state error, except in the frictionless example where the dynamics are sufficiently different. It is interesting to note that while the observation update is able to correct errors in the position, it cannot correct the velocity estimate because it is not directly observed. Figure 4, illustrates that a misspecified model, such as incorrect friction, can produce velocity trajectories that are inconsistent with the position trajectories. Such inconsistencies may be improved with a richer
Contact Constrained Kalman Filter
485
Fig. 5. The filter running on a Cassie series robot in a motion capture lab. Top: an example of a 30 s walking trajectory, we see good agreement between the motion capture data and the filtered estimate. Bottom: the filter performance during a typical step, starting from a 20 random initial estimates. Note that most of the initial estimates that underestimate the height are immediately corrected to the true height, preventing penetration with the floor. The one example that does not immediately track the correct height has a corresponding pitch/roll error that keep the feet above the ground.
measurement model that is able to correct the inconsistent velocities in the observation update, or a longer filter horizon that can use multiple sequential position measurements to inform the velocity estimate. 5.3
State Estimation for a Biped
To demonstrate that this filter scales to complex robots, we collected data using a Cassie bipedal robot walking in two scenarios: a scenario that is indicative of stable walking on flat ground and a scenario where we lubricated the walking surface, causing the robot to slip. Cassie has 20 degrees of freedom (DOF): 6 DOF arise from the floating base, 10 DOF in the joints are directly actuated, and the 4 remaining DOF are stabilized with stiff springs. Additionally, each leg contains a kinematic loop that are represented as additional constraints in the filter. We used the inertial parameters and the spring constants provided by the manufacturer. Cassie is
486
P. Varin and S. Kuindersma
outfitted with a 6-axis IMU and 14 joint encoders that measure the position and velocity of the joints. The prepackaged software provides an orientation estimate from the IMU, but does not log the accelerometer data, so we used the orientation estimate as a sensor in our experiments in lieu of the underlying accelerometer data. Walking. The walking experiment was conducted in a motion capture studio to gather ground truth position data. The robot walked around the motion capture studio for approximately 5 min, then onboard sensor data was synchronized in time with the motion capture measurements. Figure 5 illustrates the performance of the estimator during a typical step and over a large step sequence. Starting from a 20 random initial conditions we can see that the filter converges reliably to the true state within one step. The 30 s walking sequence demonstrates that the filter maintains good performance through many contact mode changes. On a large robot like Cassie, the filter operates at an average of 174 Hz. Although the underlying optimization problem is an MIQP, the contact mode changes much more slowly than the filter update frequency, so we can achieve fast performance by warm starting the contact mode in the optimization problem. This allows the optimizer to quickly prune the branch-and-bound tree and arrive at the globally optimal solution in fewer iterations. Figure 6 illustrates the advantage of warm starting the optimization with the last contact mode.
Fig. 6. The timing distribution for the filter running on Cassie walking data. Warm starting tightens the timing distribution and significantly decreases the mean update time.
Slipping. Estimation while walking in ideal scenarios—e.g., where the probability of slipping is very small—is important, but it is also critical to maintain good state estimation during unplanned slipping events. To test the performance of the estimator in a highly dynamic contact scenario we lubricated one of the feet of our bipedal robot to encourage slipping. We executed a nominal walking
Contact Constrained Kalman Filter
487
controller that does not implement slip-recovery, causing the robot to fall on the first step. We compare the performance of the CCKF against the performance of a more traditional walking estimator based on the estimator developed for the Atlas robot during the DARPA Robotics Challenge [11]. For safety we conducted this test while the robot was attached to a harness outside of the motion capture lab. As a result we can only make qualitative comparisons between the filter results and a video of the experiment. However, these results clearly show that the walking estimator that relies on a no-slip assumption quickly accumulates large errors in floating base position, whereas the CCKF correctly predicts that the stance foot will slip while the floating base has minimal lateral motion. Repeating the slipping experiment while varying the orientation of the approximate friction polyhedron used by the filter produces virtually indistinguishable estimation results (Fig. 7).
Fig. 7. Top: Video frames at key moments of the slip. Middle: The state estimated by the CCKF. Bottom: The state estimated by a filter that factors out the joints from the floating base and uses a no-slip assumption at the feet.
488
P. Varin and S. Kuindersma Table 1. Filter covariance parameters used in the experiments presented here.
Parameter
Falling 2
−2
P0 -position (m )
1 × 10
1 × 10−5 1 × 10−3 −3
1 × 10−3
1 × 10
1 × 10−2 1 × 10−2 1 × 10−3
1 × 10−3
Q-position (m2 )
1 × 10−3 1 × 10−6 1 × 10−3
1 × 10−3
−3
1 × 10
1 × 10−3
−3
P0 -orientation 2 2 P0 -velocity ms2 , rad s
−4
−3
1 × 10−3
1 × 10
1 × 10−3 1 × 10−4 1 × 10−3
1 × 10−3
R-position (m2 )
1 × 10−3 1 × 10−5 1 × 10−2
N/A
R-orientation
1 × 10−3 1 × 10−3 1 × 10−2
1 × 10−2
1 × 10−3 N/A
1 × 10−6
1 × 10−6
1 × 10−3 N/A
N/A
N/A
R-acceleration
rad s2
2
2
m s4 2
R-joint data (rad ),
rad s
2
N/A
1 × 10
1 × 10
Q-orientation 2 2 Q-velocity ms2 , rad s2
R-angular rate
6
−2
Pushing Bipedal walking Bipedal slipping
N/A
1 × 10
−6
1 × 10
1 × 10−6
Conclusion
We developed a constrained Kalman filter as an approximate MAP estimator for rigid body systems with frictional contact and evaluated its performance in several simulated and physical estimation tasks. In addition to addressing some fundamental problems that arise from sample-based estimators, our results suggest that the filter performs well through sliding and collision events. We also demonstrated scalability of the algorithm, despite its non-convexity, by estimating the state of a 20-DoF bipedal robot in realtime. Although the filter demonstrates good empirical performance, we provide no theoretical guarantees on the convergence. In fact, it is not difficult to generate examples in which the discontinuous contact events produce multimodal distributions that may cause the filter to diverge. In such a scenario it may be desirable to represent multimodal belief distributions or to estimate the distribution over contact modes in a similar fashion to the Contact Particle Filter [15]. In practice, it may be possible to solve the problem of multimodality by having sufficient measurement power. For example, even if a contact event causes the prior distribution to become multimodal, the right information from the sensors during the likelihood update can allow the filter to choose the correct mode. It is also worth noting that although we show good empirical performance when solving the optimization problem described by (23) and (24), the number of contact modes grows exponentially with the number of contact pairs. This suggests that for a larger number of contacts the problem may become intractable, and different solution methods for solving the QPCC should be considered. Another limitation of the CCKF is that it assumes knowledge of the geometry and frictional properties of contact with the environment. Even in cases where local contact geometry can be accurately measured, it may be necessary
Contact Constrained Kalman Filter
489
to estimate the coefficient of friction online. In fact, recent work has demonstrated that even in controlled environments it may not be valid to assume that the coefficient of friction is constant, but rather that it should be treated as a random variable that follows some distribution [27]. An alternative approach would be to use a dual estimator to estimate the contact parameters, such as friction and geometry, simultaneously with the state. Recent work has shown that many popular contact models lack the descriptive ability to describe phenomena such as back-spin [28] and data-driven contact models can significantly outperform purely theoretical approaches [29]. Acknowledgements. This work was supported by a Draper Internal Research and Development grant and the National Science Foundation (Grant Number IIS-1657186). Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References 1. Wahba, G.: A least squares estimate of satellite attitude. SIAM Rev. 7, 409–409 (1965) 2. Simon, D.: Kalman filtering with state constraints: a survey of linear and nonlinear algorithms. IET Control Theory Appl. 4, 1303–1318 (2010) 3. Stewart, D.E., Trinkle, J.C.: An implicit time-stepping scheme for rigid body dynamics with inelastic collisions and coulomb friction. Int. J. Numer. Meth. Eng. 39(15), 2673–2691 (1996) 4. Lowrey, K., Dao, J., Todorov, E.: Real-time state estimation with whole-body multi-contact dynamics: a modified UKF approach. In: Proceedings of the IEEERAS International Conference on Humanoid Robots (2016) 5. Kajita, S., Kanehiro, F., Kaneko, K., Fujiwara, K., Yokoi, K., Hirukawa, H.: Biped walking pattern generation by a simple three-dimensional inverted pendulum model. Adv. Robot. 17, 131–147 (2003) 6. Stephens, B.J.: State estimation for force-controlled humanoid balance using simple models in the presence of modeling error. In: 2011 IEEE International Conference on Robotics and Automation, pp. 3994–3999, May 2011 7. Wang, S., Shi, Y., Wang, X., Jiang, Z., Yu, B.: State estimation for quadrupedal using linear inverted pendulum model. In: 2017 2nd International Conference on Advanced Robotics and Mechatronics (ICARM), pp. 13–18, August 2017 8. Bloesch, M., Hutter, M., Hoepflinger, M.A., Leutenegger, S., Gehring, C., Remy, C.D., Siegwart, R.: State estimation for legged robots - consistent fusion of leg kinematics and IMU (2012) 9. Rotella, N., Bloesch, M., Righetti, L., Schaal, S.: State estimation for a humanoid robot. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 952–958, September 2014 10. Xinjilefu, X., Feng, S., Atkeson, C.G.: Dynamic state estimation using quadratic programming. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 989–994, September 2014 11. Kuindersma, S., Deits, R., Fallon, M., Valenzuela, A., Dai, H., Permenter, F., Koolen, T., Marion, P., Tedrake, R.: Optimization-based locomotion planning, estimation, and control design for Atlas. Auton. Robots 40(3), 429–455 (2016)
490
P. Varin and S. Kuindersma
12. Hartley, R., Mangelson, J., Gan, L., Jadidi, M.G., Walls, J.M., Eustice, R.M., Grizzle, J.W.: Legged robot state-estimation through combined forward kinematic and preintegrated contact factors. arXiv:1712.05873 [cs], December 2017 13. Singh, S.P.N., Waldron, K.J.: A hybrid motion model for aiding state estimation in dynamic quadrupedal locomotion. In: Proceedings of the 2007 IEEE International Conference on Robotics and Automation, pp. 4337–4342, April 2007 14. Nori, F., Kuppuswamy, N., Traversaro, S.: Simultaneous state and dynamics estimation in articulated structures. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3380–3386, September 2015 15. Koval, M.C., Pollard, N.S., Srinivasa, S.S.: Pose estimation for planar contact manipulation with manifold particle filters. Int. J. Robot. Res. 34, 922–945 (2015) 16. Koval, M.C., Klingensmith, M., Srinivasa, S.S., Pollard, N.S., Kaess, M.: The manifold particle filter for state estimation on high-dimensional implicit manifolds, pp. 4673–4680. IEEE, May 2017 17. Lowrey, K., Kolev, S., Tassa, Y., Erez, T., Todorov, E.: Physically-consistent sensor fusion in contact-rich behaviors. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1656–1662, September 2014 18. Baraff, D.: Fast contact force computation for nonpenetrating rigid bodies, pp. 23–34. ACM Press (1994) 19. Lemke, C.E., Howson Jr., J.T.: Equilibrium points of bimatrix games. J. Soc. Ind. Appl. Math. 12(2) 20. Ferris, M.C., Munson, T.S.: Interfaces to PATH 3.0: design, implementation and usage. Comput. Optim. Appl. 12, 207–227 (1999) 21. Anitescu, M., Potra, F.A.: Formulating dynamic multi-rigid-body contact problems with friction as solvable linear complementarity problems. Nonlinear Dyn. 14(3), 231–247 (1997) 22. Bullet Collision Detection & Physics Library. http://www.bulletphysics.org/ 23. Smith, R.: Open dynamics engine (2007). http://www.ode.org/ 24. DART: Dynamic animation and robotics toolkit. https://dartsim.github.io/ 25. Gurobi Optimization, Inc.: Gurobi optimizer reference manual. Technical report (2014) 26. Tedrake, R., The Drake Development Team: Drake: a planning, control, and analysis toolbox for nonlinear dynamical systems (2016) 27. Yu, K.-T., Bauza, M., Fazeli, N., Rodriguez, A.: More than a million ways to be pushed: a high-fidelity experimental dataset of planar pushing. arXiv:1604.04038 [cs], April 2016 28. Fazeli, N., Zapolsky, S., Drumwright, E., Rodriguez, A.: Fundamental limitations in performance and interpretability of common planar rigid-body contact models. arXiv:1710.04979 [cs], October 2017 29. Fazeli, N., Kolbert, R., Tedrake, R., Rodriguez, A.: Parameter and contact force estimation of planar rigid-bodies undergoing frictional contact. Int. J. Robot. Res. 36, 1437–1454 (2017)
A Quasi-static Model and Simulation Approach for Pushing, Grasping, and Jamming Mathew Halm(B) and Michael Posa GRASP Laboratory, University of Pennsylvania, Philadelphia, PA 19104, USA {mhalm,posa}@seas.upenn.edu
Abstract. Quasi-static models of robotic motion with frictional contact provide a computationally efficient framework for analysis and have been widely used for planning and control of non-prehensile manipulation. In this work, we present a novel quasi-static model of planar manipulation that directly maps commanded manipulator velocities to object motion. While quasi-static models have traditionally been unable to capture grasping and jamming behaviors, our approach solves this issue by explicitly modeling the limiting behavior of a velocity-controlled manipulator. We retain the precise modeling of surface contact pressure distributions and efficient computation of contact-rich behaviors of previous methods and additionally prove existence of solutions for any desired manipulator motion. We derive continuous and time-stepping formulations, both posed as tractable Linear Complementarity Problems (LCPs). Keywords: Quasi-static motion · Manipulation and grasping · Rigid body motion · Dynamics · Linear complementarity problems · Simulation
1
Introduction
As frictional contact is the fundamental driving process by which many robots are able to interact with their surroundings, it is unsurprising that its behavior is central to a large body of robotic locomotion and manipulation research (e.g. [7,11,19,20,23,27]). However, dynamical models of these systems are inherently complex and challenging to simulate and analyze. Impacts between rigid bodies induce instantaneous jumps in velocity states and a combinatorial explosion of hybrid modes that in conjunction render application of common tools from control theory and trajectory optimization difficult. While there has been notable progress in planning through unknown contact sequences with full dynamics [13,15,19], model complexity has thus far still inhibited real-time usage. Many applications in robotics involving frictional contact exhibit structure that permits partial or full circumvention of these difficulties. Particularly, we examine planar tabletop manipulation, where a manipulator effects motion of c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 491–507, 2020. https://doi.org/10.1007/978-3-030-44051-0_29
492
M. Halm and M. Posa
an object that rests upon a flat, frictional surface. Several results in simulation, control, and planning for such systems have been enabled by quasistatic assumptions—that if manipulator accelerations and velocities are low, a force-balance equation can approximate Newton’s second law. This assumption enables reduced-order modeling of the object’s movement in response to contact with a manipulator driven at a particular velocity; such models also often circumvent the complexity associated with state discontinuities in dynamical approaches. Furthermore, quasi-static models often eliminate numerical sensitivity induced by the stiff dynamics of manipulators with high feedback gain controllers. The tractability of these models has enabled impressive results in formal control analysis, task planning, and learning (e.g. [5,10,12]). However, the range of motion that these methods are currently able to model is limited. They are often restricted to pushing and non-prehensile motions and are completely unable to usefully express grasping or jamming; in these cases, their associated mathematical programs often yield no solutions or ambiguous behavior. Grasping and jamming objects is crucial for a wide range of robot tasks, and much work has been devoted to planning and controlling action before and after grasping events (e.g. [9,18,21,22,30]); However, much of this work can only describe static grasp configurations, and is unable to depict grasp-like behavior with sliding contacts. The process of acquiring such a grasp itself often involves jamming (e.g. when reorienting an object by pushing it up against a wall), which neither the prehensile nor static grasping models can capture. We therefore find great value in the formulation of a unified quasi-static model that can smoothly capture a complete task involving the acquisition and use of a grasp. The ambiguity in traditional approaches arises from the inconsistent assumptions of rigid bodies and perfect control of manipulator velocity. Our key insight is that by appropriately representing the manipulator’s internal controller, physically-grounded motion of the object-manipulator system is guaranteed to exist. We contribute both instantaneous velocity and time-stepping position models that formulate this behavior as linear complementarity problems, and prove existence of solutions for each.
2
Related Work
There is a significant body of research that examines manipulation from a quasistatic perspective [11,14,17,23,27]. For systems in which the object experiences frictional support, relevant research typically examines the pressure distribution supporting the object. Some earlier works provide guaranteed properties of the object’s motion without full knowledge of this distribution. Mason [14] derived the voting theorem to construct a mapping from the center of pressure to the direction of the object’s angular velocity. Lynch and Mason [12] later performed stability and controllability analysis for a manipulator pushing an object with multiple fingers. Other works alternatively contribute models that directly map manipulator joint velocities to object motion. Trinkle [27] characterized vertical planar manipulation using a nonlinear mathematical program that explicitly
Quasi-static Pushing, Grasping, and Jamming
493
solved for the contact forces between the object and the manipulator. A similar, more general model for arbitrary 3D rigid multibody systems was proposed in Trinkle et al. [28]. Neither formulation can model detailed pressure distributions between surfaces, as they model friction as acting at a finite set of points. Efficient and expressive modeling of the complex behavior of these pressure distributions was enabled by Goyal et al. [8], who define the limit surface, the bounded convex set of friction loads that the surfaces in contact might exert on each other. Zhou et al. [31] approximate this set in high detail as a semialgebraic set fitted to experimental data, and from it derive a model for the motion of the manipulator-object system. This method, however, assumes that the manipulator follows a commanded velocity exactly, and therefore does not return a solution for infeasible commands. As such commands often result in grasping in a full dynamical model, valuable behaviors are not captured. Some planning methods (e.g. [6]) introduce manipulator compliance through the generalized dampers of Whitney [29], though they have not been incorporated into models capable of initiating and releasing multiple contacts. Pang and Tedrake [17] also devise a resolution for non-existence in velocitycontrolled 3D rigid quasi-static systems. They model deviations from desired velocity as a result of local elastic deformation at point contacts, and preserve realism by minimizing them in a mixed-integer quadratic program (MIQP) formulation. While their work applies to a more general class of systems, our method has three key advantages for planar manipulation: we draw model behavior from a problem class that is far more tractable than MIQPs; our proof of existence makes possible formal guarantees for controller performance as well as simulation reliability; and our inclusion of a limit-surface model allows for more realistic modeling without introducing significant complexity.
3
Background
Fig. 1. An example of the type of system described in this section. The blue pentagon represents the object, while the red shapes represent the manipulator.
494
M. Halm and M. Posa
We now describe the behavior of a frictionally-supported object with center of mass p in contact with a manipulator, depicted in Fig. 1. For a more detailed treatment of rigid-body contact dynamics, we refer the reader to [1]. Coordinates of the object and manipulator configurations are denoted qo = pW ; qθ ∈ R3 and qm ∈ Rn , respectively. The manipulator is assumed to be controlled to track commanded velocity u. Frictional contact with the manipulator causes the object T to move with body-axis velocity V = vx,b vy,b q˙θ , which can be converted to ˜ θ )V. the world frame velocity q˙ o with the full-rank transformation q˙ o = R(q Contact is modeled as normal and frictional forces λN ∈ Rk and λT ∈ R2k at k pairs of points on the boundaries of the object and manipulator. The coefficients of friction at these points are given by the diagonal matrix μ ∈ Rk×k . Each contact’s tangential force is split into two non-negative opposing forces λ+ Ti and − + − λTi (as in [1]), such that the net tangential force is equal to λTi − λTi . The distances between each pair of points are represented by the vector φ(qo , qm ) ∈ Rk . When the points are in contact, their separating velocity can be calculated from the generalized velocities as φ˙ = JN,o q˙ o +JN,m q˙ m using Jacobian matrices JN,o = ∂∂qφo and JN,m = ∂∂qφm . Similarly, there exist Jacobian matrices JT,o and JT,m such that the velocity at which the bodies slide against each other is given by JT,o q˙ o + JT,m q˙ m . For notational convenience, we will often group the normal and tangential terms as Jo = [JN,o ; JT,o ], Jm = [JN,m ; JT,m ], and λ = λN ; λT . Using the relationship between the contact velocities and the generalized velocities, the contact forces can be used to determine the forces acting on the object and manipulator as Fo = JTo λ and Fm = JTm λ, respectively. T We also make use of the constants e = 1 1 and the block-diagonal matrix ⎡ ⎤ e 0 ··· ⎢ ⎥ (1) E = ⎣0 e ⎦ ∈ R2k×k . .. . . . . 3.1
Linear Complementarity Problems
Throughout this work, we will make regular use of linear complementarity problems (LCPs). An LCP is a particular type of mathematical program for which solutions can be efficiently computed. LCPs have been widely used by the dynamics and robotics communities for describing the effects of contact (e.g. [1,25,31]). Here, we briefly introduce the problem formulation and some useful properties, and we refer the reader to [3] for a more complete description. Definition 1. The linear complementarity problem LCP(M, w) for the matrix M ∈ Rn×n and vector w ∈ Rn describes the mathematical program find subject to
z ∈ Rn , T
(2)
z (Mz + w) = 0, z ≥ 0,
(3) (4)
Mz + w ≥ 0,
(5)
Quasi-static Pushing, Grasping, and Jamming
495
for which the set of solutions is denoted SOL(M, w). Constraints (3)–(5), called complementarity constraints, are often abbreviated as 0 ≤ z ⊥ Mz + w ≥ 0. Note that vector inequalities in the above definition, as well as elsewhere in this work, are taken element-wise. We will find that for LCPs related to frictional behavior, the matrix parameter M is often copositive (i.e. xT Mx ≥ 0 for all x ≥ 0). This property is often theoretically useful, as Corollary 4.4.12 of [3], reproduced below, gives a sufficient condition for copositive LCP feasibility. Proposition 1. Let w ∈ Rn , and let M ∈ Rn×n be copositive. Suppose that for every z ∈ SOL(M, 0), we have zT w ≥ 0. It follows that SOL(M, w) = ∅, and an element of SOL(M, w) can be discovered by Lemke’s Algorithm in finite time. 3.2
Friction at Point Contacts Between Object and Manipulator
Common to many models of friction is the maximum dissipation principle, which states that if F is the set of all possible forces acting at a contact, then the force at any given moment is one such that the power dissipated at the contact is maximized. For contact at a single point between two rigid bodies with relative velocity v, this condition is realized as f · v. f ∈ arg min f ∈F
(6)
For point contacts, F is often modeled as a cone by Coulomb friction, which enforces the following: – For each contact, either the normal velocity is zero and the normal force is non-negative, or vice versa: 0 ≤ JN,o q˙ o + JN,m q˙ m ⊥ λN ≥ 0.
(7)
– The magnitude of the frictional force at the jth contact is bounded above by μj,j λNj . For sliding contacts, the frictional force has magnitude μj,j λNj and is antiparallel to the sliding velocity. This behavior can be captured as complementarity constraints with the addition of a slack variable γ : 0 ≤ JT,o q˙ o + JT,m q˙ m + Eγ ⊥ λT ≥ 0,
(8)
0 ≤ μλN − E λT ⊥ γ ≥ 0.
(9)
T
3.3
Friction at Contact Between Object and Surface
Coulomb friction behavior cannot readily be applied to contacts where the normal force is not concentrated at a finite set of points. Zhou et al. [31] therefore devise and experimentally validate a model that directly approximates the limit surface assuming a constant pressure distribution. They parameterize this behavior with a symmetric, scale-invariant, and strictly convex function H(F) : R3 −→ [0, ∞), defined over the space of body-axis friction wrenches ˜ θ )T Fo . The physical meaning of this function is as follows: F = R(q
496
M. Halm and M. Posa
– The set of possible static friction wrenches is F = {H(F) ≤ 1}. – If the contact is sliding, then the maximum dissipation principle requires that F be on the boundary of F, {H(F) = 1}. Furthermore, V must lie in the normal cone of F at F. As F is strictly convex, the latter condition is exactly satisfied by ∃k ≥ 0, V = k∇H(F).
(10)
We note that (10) will also hold in the case of static friction with k = 0. 3.4
Friction Behavior as an LCP
We examine the quasi-static model of Zhou et al. [31], which composes both point ˜ A ˜ 0, and surface contacts. From (10), if H(F) has the ellipsoid form FT AF, ˜ = kA ˜ R(q ˜ θ )T Fo , V = k AF
(11)
˜ θ )V = k R(q ˜ θ )A ˜ R(q ˜ θ )T Fo = kAFo , q˙ o = R(q
(12)
˜ θ )A ˜ R(q ˜ θ ) 0. Assuming perfect velocity control (i.e. q˙ m = u), where A = R(q (7)–(9) reduce to LCP(M, w(u)), a generalization of (27) in [31], where ⎤ ⎡ JN,o AJTN,o JN,o AJTT,o 0 (13) M = ⎣ JT,o AJTN,o JT,o AJTT,o E⎦ , μ −E 0 ⎡ ⎤ JN,m u w(u) = ⎣ JT,m u ⎦ . (14) 0 T
Each z ∈ SOL(M, w(u)) is a choice for [kλ; γ ] that complies with both the point and surface contact models. While solutions for k and λ are not computed separately, their product is sufficient to calculate object velocity. For simplicity, we will denote z as [λ; γ ] for the rest of this paper. We also define Fu = {λ : ∃γ , [λ; γ ] ∈ SOL(M, w(u))}, the set of feasible point contact forces for command velocity u. For u = 0, Fu = F0 is the set of admissible internal forces [16], i.e. forces that map to zero net force and torque on the object (Fo = 0). For non-quadratic descriptions of H(F), low accuracy solutions may be computed quickly by approximating H as an ellipsoid. If higher accuracy solutions are required, one may solve a sequence of programs such that A in the jth program is equal to the Hessian of H evaluated at a solution of (j − 1)th program.
4
Finite Velocity Feedback Quasi-statics
While the above formulation has been successful at simulating pushes [31] and planning grasps under stochasticity [30], the range of applications of this method is significantly limited due to undefined and ambiguous behaviors as displayed
Quasi-static Pushing, Grasping, and Jamming
497
in Fig. 2. Grasping and jamming commands may result in manipulator velocities that cannot be realized without penetrating (see Lemma 1, [31]). Additionally, when the manipulator is commanded to graze the object (that is, the commanded velocities of all the contact points are parallel to the object boundary), there can be an infinite set of possible solutions which result in wildly different motions. The source of this ambiguity is a critical assumption of the perfect velocity control model—that there exists a feedback controller internal to the manipulator that has high enough (essentially, infinite) gain to overcome any external disturbance. From this perspective, grasping and jamming are undefined as they prescribe two unstoppable objects to oppose each other (“∞ − ∞”-like behavior). Additionally, while exact execution of a grazing maneuver may produce zero external disturbance for the manipulator controller to balance, a small perturbation of the commanded velocities towards the inward normal of the object surface could transform the command to an infeasible grasp or jam.
Fig. 2. Left: example of a configuration for which a quasi-static model assuming perfect velocity control does not yield any feasible solutions for q˙ o . Here, we have a square object (blue squares) with which a four-fingered manipulator (red circles) makes full contact. There are no generalized velocities for the object that would allow the fingers to move along their commanded trajectory (yellow arrows). Right: a configuration that yields ambiguous behavior. Depending on the normal force on the individual fingers, the object may remain stationary, slide upwards, or slide downwards.
In order to resolve these issues, we explicitly interpret the quasi-statics of perfect velocity control as the limiting case of the quasi-statics of finite, stable, linear feedback. That is, we assume some relative feedback gain matrix B 0 and scaling factor c such that the generalized force due to contact exerted on the manipulator is balanced by a feedback torque −Fm =
1 −1 B (u − q˙ m ), c
(15)
where we note that the gain is inversely proportional to c and the scaling of B. While (13) and (14) assumed that the manipulator velocity directly tracked the desired velocity, q˙ m = u, we instead solve (15) for q˙ m and construct a new program LCP(M(cB), w(u)) that accounts for the modified velocity: q˙ m = u + cBFm ,
(16)
498
M. Halm and M. Posa
⎤ ⎡
JN,o AJTN,o JN,o AJTT,o 0 J cBJTm 0 M(cB) = ⎣ JT,o AJTN,o JT,o AJTT,o E⎦ + m , 0 0 μ −E 0 ⎤ ⎡ JN,m u w(u) = ⎣ JT,m u ⎦ . 0
(17)
(18)
While we have eliminated the perfect velocity control assumption, our formulation introduces no additional computational complexity, as our LCP is of equal dimension. We now prove that our model is well-behaved in the sense that solutions are guaranteed to exist (Theorem 1); the manipulator velocity remains bounded as c → 0 (Theorem 2); and solutions typically converge the perfect velocity control case when it is feasible (Theorem 4). Theorem 1. For all (c, B, u), if c > 0 and B 0, SOL(M(cB), w(u)) = ∅. Proof. Let z ≥ 0. As all the indices of μ are non-negative, γ T μλN ≥ 0. We also have that FTo AFo ≥ 0 and FTm cBFm ≥ 0 as A 0 and cB 0. Therefore, zT M(cB)z = FTo AFo + FTm cBFm + γ T μλN ≥ 0,
(19)
and M(cB) is copositive. Now, suppose z ∈ SOL(M(cB), 0). We must have FTo AFo + FTm cBFm + γ T μλN = 0,
(20)
which implies Fm = 0. Therefore, zT w(u) = FTm u = 0. The result follows from Proposition 1.
If the desired velocity u lies within a neighborhood of infeasible velocities (symptomatic of grasping or jamming behavior), then as we take c → 0, the force balance in (15) implies that Fm , and therefore some of the individual finger forces, will grow unboundedly. However, the net manipulator velocity error q˙ m − u = cBFm = BJTm (cλ) remains bounded due to the boundedness of cFm : Theorem 2. Let B 0. There exists r(B) ∈ R such that for all c > 0, for all z ∈ SOL(M(cB), w(u)), c Fm 2 ≤ r(B) u2 . Proof. Let z ∈ SOL(M(cB), w(u)). From Definition 1, we have that z ≥ 0, FTo AFo
+
FTm cBFm
+ γ μλN + T
FTm u
= 0.
(21) (22)
Let λmin > 0 be the minimum eigenvalue of B. Noting A 0, we have that cFTm BFm + FTm u ≤ 0, 2 λmin c Fm 2
+ FTm u 2 λmin c Fm 2
≤ 0,
≤ Fm 2 u2 , 1 u2 . c Fm 2 ≤ λmin
(23) (24) (25) (26)
Quasi-static Pushing, Grasping, and Jamming
499
Intuitively, if individual finger forces grow without bound, yet the net force remains bounded, then there must be a “canceling out” effect. More precisely, the portion of λi that experiences this growth must be an internal force: Theorem 3. For all sequences (zi )i∈N and (c i )i∈N such that ci → 0 and zi ∈ SOL(M(ci B), w(u)), ci λi → F0 Proof. Let F˜0 = {λ ≥ 0 : JTo λ = 0 ∧ μλN − ET λT ≥ 0}. F˜0 ⊆ F0 by ˜ 0] ∈ SOL(M(0), 0) from λ ˜ ∈ F˜0 . It is therefore sufficient ˜ = [λ; constructing z to show ci λi → F˜0 . We first note that ci zi ≥ 0, and μci λNi − ET ci λTi ≥ 0 follow directly from zi ∈ SOL(M(ci B), w(u)). Multiplying (22) by c2i , we have ci FTo,i AFo,i ci + c3i FTm,i BFm,i + ci γ Ti μλNi ci + c2i FTm,i u = 0.
(27)
In the limit, the second and fourth terms in this equation vanish due to the boundedness of ci Fm,i , rendering ci FTo,i AFo,i ci + ci γ Ti μλNi ci → 0. As A 0, ci Fo,i = JTo ci λi → 0, and therefore ci λi → F˜0 .
(28)
F0 can therefore be considered a basis from which one can generate the errors in the manipulator velocity; for c sufficiently close to 0, there exists a δ ∈ F0 , such that q˙ m − u ≈ BJTm δ. When there are no non-zero internal forces, the manipulator displacement approaches zero, and the solution of the finite linear feedback model approaches the behavior for perfect velocity control: Theorem 4. Suppose that M(0) is chosen such that F0 = {0}. For all sequences (zi )i∈N , (c i )i∈N such that ci → 0 and zi ∈ SOL(M(ci B), w(u)), then λi → Fu . Proof. Assume the contrary, so that there exists (c i )i∈N → 0 and (zi )i∈N bounded away from SOL(M(0), w(u)) such that zi ∈ SOL(M(ci B), w(u)). Letting q˙ i = u +ci BFm,i , we have that zi ∈ SOL(M(0), w(q˙ i )). ci λ i → 0, given by Theorem 3, implies q˙ i → u. For all i, we observe the complementarity condition FTo,i AFo,i + γ i μλNi + FTm,i q˙ i = 0,
(29)
which implies that γ i μλNi is bounded for bounded λi 2 . As F0 = {0}, for all z ∈ SOL(M(0), 0), we have zT w(u) = 0. Therefore by Proposition 1, Fu and SOL(M(0), w(u)) are non-empty. If there were a subsequence zij such that λ ij were bounded, then from (29) and the non-emptiness of Fu , λ ij would have a limit point in Fu , violating our assumptions. Therefore, we must have λ i 2 → ∞. Similar to Theorem 3, by 2 dividing (29) by λi 2 ,
˜i = and thus λ
λi λ i 2
1 1 FTo,i AFo,i → 0, (30) λi 2 λi 2 ˜ λ i 2 → F0 = {0}. But λ
i = λ i = 1, a contradiction. 2
2
500
M. Halm and M. Posa
We note that F0 = {0} implies the current configuration is not a force-closure [16]. In practice, we expect the set of non-force-closure configurations that have F0 = {0} to be small. We also expect perfect velocity control models to behave poorly during force-closure, as they exhibit grasping behavior. Therefore, this model will behave more realistically in a variety of scenarios, with minimal accuracy loss for prehensile commands that perfect velocity control models handle well already.
5
Time-Stepping Scheme
Despite the guarantee (Theorem 1) that (17) and (18) provide feasible instantaneous velocity solutions, embedding the LCP into common ODE schemes is an incomplete approach, as the resulting velocities will be discontinuous whenever contact is initiated between the manipulator and objects. Anitescu and Potra [1] resolved a similar issue in their formulation for 3D multibody simulation by root-finding the first sub-time-step impact. While a similar modification could be applied to our LCP, the ability to resolve sub-time-step impacts in a single LCP would be beneficial. To that end, we take inspiration from Stewart and Trinkle [25], and instead formulate an alternative LCP that explicitly models the positions of the manipulator and object at the end of the time-step. To construct this new program, instead of solving for the force at a particular time t, we solve for the net normal and tangential impulse between times t and t + h at each contact, ΛN and ΛT . Using the superscripts − and + to denote values calculated at the beginning and end of this interval, we linearize φ as Δφ = φ+ − φ− ≈
∂φ − ∂φ − − Δqo + Δqm = J− N,o Δqo + JN,m Δqm , ∂qo ∂qm
and we make a first-order approximation of Δqo and Δqm :
−T −T − Δqo = q+ ˙+ o − qo ≈ h q o ≈ A JN,o ΛN + JT,o ΛT ,
−T −T − + − Δqm = q+ − q ≈ h q ˙ ≈ B J Λ + J Λ m m m N,m N T,m T + hu .
(31)
(32) (33)
ΛN and ΛT are subject to the complementarity constraints 0 ≤ φ+ ⊥ ΛN ≥ 0, 0 ≤ JT,o Δqo + JT,m Δqm + Eγ ⊥ ΛT ≥ 0,
(34) (35)
0 ≤ μΛN − ET ΛT ⊥ γ ≥ 0.
(36)
˜ ˜ − )), where Arranging into the standard format, we arrive at LCP(M(B), w(u ˜ M(cB) = M(cB)− ,
φ ˜ − ) = w(hu− )− + w(u . 0
(37) (38)
Quasi-static Pushing, Grasping, and Jamming
501
˜ this program is identical to an instanWith the exception of the added φ− in w, tiation of the velocity formulation defined in (17) and (18). Noting that φ− ≥ 0 for any feasible initial condition, a trivial extension of Theorem 1 guarantees existence of solutions for all feasible initial conditions. This is a significant improvement over existing time-stepping schemes for the full dynamic behavior, as it circumvents both the expensive root-finding subroutine of [1] and the non-existance issues in [25]. However, it is important to note that if φ(q) is non-linear, the linearization in (31) does not guarantee that the true value of φ+ will be feasible (positive), even if φ− is feasible. In these cases, similar to [25], one may rectify this issue by solving a sequence of problems in a fixed-point iteration scheme, + linearizing the problem about the best current estimate for (q+ o , qm ). This iteration can be conducted at the same time as the iteration for a non-ellipsoidal H. We additionally note that setting B = 0 gives a time-stepping reformulation of the program in [31].
6
Examples
We provide a few examples to illustrate the capabilities and accuracy of our approach. Three examples from our open source MATLAB library1 are provided in conjunction with a video depiction2 . Additionally, we compare the output of our model to a fully dynamic, compliant simulation using Drake [26]. All LCPs associated with our model are solved with PATH [4]. 6.1
Pushing with Two Fingers
We first consider a flat disk of radius 1 m pushed by two fully-actuated point fingers. The coefficient of friction between the fingers and the disk is set at 1, and the force-motion map is set as H(Fo ) = FTo Fo such that ∇2 H = A = I3 . We T consider the configuration qm = qx,1 qy,1 qx,2 qy,2 which directly represents the x − y coordinates of each finger in a fixed frame. We assume each coordinate is controlled independently with equal gain, such that B = I4 m s−1 N−1 . We now empirically evaluate the validity of Theorem 4. Analytically, we expect that as we take ci → 0, finite feedback simulation should converge to perfect velocity command tracking. We simulate three motions over t ∈ (0, 10) s at 40 Hz for various feedback scaling terms ci , and plot the corresponding object trajectories qo,i (t) in Fig. 3. We compare the final configuration of each finite feedback trajectory with qo,∞ (t), the trajectory resultant from executing the same manipulator command with a perfect velocity control. For sufficiently high gains, we see a linear relationship between the log of the feedback scaling term and that of ei = qo,i (10) − qo,∞ (10)2 , the error in the final pose. However, the low-gain performance, particularly on the semicircular command, showcases an important nuance in the convergent behavior described in 1 2
https://github.com/mshalm/quasistaticGrasping. https://www.youtube.com/watch?v=1wAH5o3OLck.
502
M. Halm and M. Posa
Fig. 3. Top: three maneuvers are executed at varying feedback intensities: a symmetric push, an asymmetric push, and a semicircular push. The black dotted lines and circles represent the center of mass trajectory and finale pose for qo,∞ . The finite feedback trajectories are represented similarly, with the color shifting from yellow to blue as increases. Bottom: log-log plot comparing ci and ei for the the feedback gain c−1 i corresponding commands.
Theorem 4. While the method may converge over a single time-step, differences in individual time-steps may accumulate such that there is a change in the contact mode during the motion. In these low-gain cases, the fingers tend to slide off of the sides of the object, leaving the object behind its intended goal. We now show our formulation’s ability to simulate through jamming motions that perfect velocity control models cannot capture. We attempt to push the object into and roll along a wall. We use identical simulation parameters, and plot the results in Fig. 4. We can see that the perfect velocity control simulation terminates at the collision with the wall, while the finite gain formulation not only captures the collision, but also still exhibits convergent behavior thereafter. 6.2
Polygonal Peg-in-hole
We apply our method to a peg-in-hole problem, a more complex task requiring initiation and release of several contacts and gripping motions. We simulate a new system consisting of a thin rectangular peg, two triangular manipulator fingers, and a slot into which the peg is fit. The slot has a 1% tolerance on the width of the peg, the relative feedback gains are set to B = I6 (in m s−1 N−1 units for linear terms and m−1 s−1 N−1 for rotational terms), and the feedback scaling is set to c = 0.01. In a hand-designed trajectory, the peg catches the corner of the slot in the initial insertion attempt, after which the manipulator
Quasi-static Pushing, Grasping, and Jamming
503
Fig. 4. Top: motion on the same system displayed in Fig. 3, except with the addition of a wall located at yW = 0, is shown in a similar manner. As the commanded motion involves squeezing the object against the wall, the perfect velocity tracking program (black) terminates at the first time-step at which the squeezing occurs. Bottom: a single trajectory for the polygonal peg-in-hole system segmented into four phases. Far left: grasping. Center left: jamming against the right side of the slot. Center right: twisting into the slot. Far right: final insertion.
reorients and successfully inserts it. For each time step, only bodies close enough to make contact are considered, allowing the LCPs to be kept small. We set our time step to 50 ms, for which PATH is able to compute solutions in 1.51 ms on average. The trajectory is displayed in Fig. 4. 6.3
Comparison to Full Dynamics
We now evaluate the similarity of our model to Drake [26], a framework for simulation of rigid-body dynamics with contact. We simulate a square object of mass .01 kg and side length 0.4 m in contact with four manipulators, each consisting of two 1 m long links with a circular contact at the end of the second link. The manipulators pinch the object, which is particularly numerically challenging to simulate. While the velocity commands are symmetric, each manipulator is driven with different feedback gains, causing the object to move in the +yW direction and spin in response to the pinch. The dynamic and quasi-static simulations exhibit qualitatively similar behavior, shown in Fig. 5.
504
M. Halm and M. Posa
Fig. 5. Left: the initial condition for both object pinching trajectories. Center: An overlay of the final conditions for the dynamic (yellow wireframe) and quasi-static (solid blue) trajectories. Right: A comparison of the quasi-static and dynamic trajectories of the center of mass of the object. The object moves slightly more in the yw direction in the dynamic trajectory than it does under quasi-statics.
7
Discussion and Future Work
We have presented a method for generating motion using a quasi-static model for planar manipulation. By explicitly modeling manipulator feedback behavior, our method is able to synthesize physically-grounded motion in the face of infeasible velocity commands. Despite this added capability, our method is as computationally efficient as previous methods. We have validated our model theoretically and empirically by proving convergence to an established model. An interesting property of many quasi-static manipulation models is that they allow decoupling of manipulator dynamics from object motion. However, we assert that Theorem 3 implies that controller choice inherently effects grasping equilibrium and is therefore an unavoidable component of a physically-grounded grasping model. We also view the decoupling of the relative feedback gains from the overall feedback intensity as essential for the accuracy of our model. As the force scaling term k of (10) is not explicitly determined, even given true manipulator gains, one cannot synthesize M(cB) with complete accuracy. However, in light of Theorems 3 and 4, if the controller has high enough gain, a small enough choice for c will produce accurate results. In these cases, c is a numerical term rather than something physically meaningful; one should choose c as small as possible without degrading numerical precision in the construction of M(cB). One might suspect that embedding a high-gain controller into the model induces stiff behaviors, as the corresponding full dynamics are stiff. However, our quasi-static assumptions happen to eliminate this behavior. As Theorem 4 proves convergence to the perfect velocity control model in most cases and Theorem 2 proves that velocities are bounded by u otherwise, our feedback terms does not add significant stiffness. Some numerical precision may however be lost for small c due to round-off error or poor conditioning of the associated LCPs. In future work, we will conduct a quantitative analysis of this behavior.
Quasi-static Pushing, Grasping, and Jamming
505
We do not expect solutions to our LCPs to be unique, as non-uniqueness is pervasive in complementarity-based contact models [2,24]. While our model does construct a unique mapping between Fm and q˙ m , there are unmet assertions required for the mapping between q˙ o and u to be unique. However, it does disambiguate some grazing cases (such as in Fig. 2). Future extension of this result to 3D motion poses significant challenges. In the 2D case, the contact between the object and the surface below generates a unique map from contact forces to object motion. In 3D motion, the manipulator must instead counteract gravity. Furthermore, quasi-static modeling cannot realistically capture certain actions; for instance, dropping the object may either result in a lack of solution, or in a Δqo large enough to make linearization of φ inaccurate. Possible applications of this model include controller synthesis through sums-of-squares based Lyapunov analysis and model predictive control, as well as planning via trajectory optimization methods. Acknowledgements. This material is based upon work supported by the National Science Foundation under Grant No. CMMI-1830218. The authors thank Joury Abdeljaber for her work on numerical experiments.
References 1. Anitescu, M., Potra, F.A.: A time-stepping method for stiff multibody dynamics with contact and friction. Int. J. Numer. Methods Eng. 55(7), 753–784 2. Brogliato, B.: Nonsmooth Mechanics. Springer, London (1999) 3. Cottle, R.W., Pang, J.S., Stone, R.E.: The Linear Complementarity Problem. Society for Industrial and Applied Mathematics, Philadelphia (2009) 4. Dirkse, S.P., Ferris, M.C.: The path solver: a nommonotone stabilization scheme for mixed complementarity problems. Optim. Methods Softw. 5(2), 123–156 (1995) 5. Dogar, M.R., Srinivasa, S.S.: A planning framework for non-prehensile manipulation under clutter and uncertainty. Auton. Robots 33, 217–236 (2012) 6. Erdmann, M.: Using backprojections for fine motion planning with uncertainty. Int. J. Robot. Res. 5(1), 19–45 (1986) 7. Hogan, F.R., Grau, E.R., Rodriguez, A.: Reactive planar manipulation with convex hybrid MPC. In: Proceedings of the 2018 IEEE International Conference on Robotics and Automation (2018) 8. Goyal, S., Ruina, A., Papadopoulos, J.: Planar sliding with dry friction part 1. Limit surface and moment function. Wear 143(2), 307–330 (1991) 9. Haas-Heger, M., Iyengar, G., Ciocarlie, M.: Passive reaction analysis for grasp stability. IEEE Trans. Autom. Sci. Eng. 15(3), 955–966 (2018) 10. Kloss, A., Schaal, S., Bohg, J.: Combining learned and analytical models for predicting action effects. CoRR abs/1710.04102 (2017). http://arxiv.org/abs/1710. 04102 11. Lynch, K.M.: The mechanics of fine manipulation by pushing. In: Proceedings of the 1992 IEEE International Conference on Robotics and Automation, vol. 3, pp. 2269–2276, May 1992 12. Lynch, K.M., Mason, M.T.: Stable pushing: mechanics, controllability, and planning. Int. J. Robot. Res. 15(6), 533–556 (1996)
506
M. Halm and M. Posa
13. Manchester, Z., Kuindersma, S.: Variational contact-implicit trajectory optimization. In: International Symposium on Robotics Research (ISRR), Puerto Varas, Chile (2017) 14. Mason, M.T.: Mechanics and planning of manipulator pushing operations. Int. J. Robot. Res. 5(3), 53–71 (1986) 15. Mordatch, I., Lowrey, K., Todorov, E.: Ensemble-CIO: full-body dynamic motion planning that transfers to physical humanoids. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5307–5314, September 2015 16. Murray, R.M., Sastry, S.S., Zexiang, L.: A Mathematical Introduction to Robotic Manipulation, 1st edn. CRC Press Inc., Boca Raton (1994) 17. Pang, T., Tedrake, R.: A robust time-stepping scheme for quasistatic rigid multibody systems (2018) 18. Paolini, R., Rodriguez, A., Srinivasa, S., Mason, M.T.: A data-driven statistical framework for post-grasp manipulation. Int. J. Robot. Res. (IJRR) 33(4), 600–615 (2014) 19. Posa, M., Cantu, C., Tedrake, R.: A direct method for trajectory optimization of rigid bodies through contact. Int. J. Robot. Res. 33(1), 69–81 (2014) 20. Posa, M., Tobenkin, M., Tedrake, R.: Stability analysis and control of rigid-body systems with impacts and friction. IEEE Trans. Autom. Control (TAC) 61(6), 1423–1437 (2016) 21. Rimon, E., Burdick, J.: On force and form closure for multiple finger grasps. In: Proceedings of IEEE International Conference on Robotics and Automation, vol. 2, pp. 1795–1800, April 1996 22. Rodriguez, A., Mason, M.T., Ferry, S.: From caging to grasping. Int. J. Robot. Res. 31(7), 886–900 (2012) 23. Song, P., Kumar, V., Trinkle, J.C., Pang, J.S.: A family of models for manipulation planning. In: 2005 6th IEEE International Symposium on Assembly and Task Planning: From Nano to Macro Assembly and Manufacturing, ISATP 2005, pp. 236–241, July 2005 24. Stewart, D.E.: Rigid-body dynamics with friction and impact. SIAM Rev. 42(1), 3–39 (2000) 25. Stewart, D., Trinkle, J.: An implicit time-stepping scheme for rigid body dynamics with inelastic collisions and coulomb friction. Int. J. Numer. Methods Eng. 39, 2673–2691 (1996) 26. Tedrake, R., The Drake Development Team: Drake: a planning, control, and analysis toolbox for nonlinear dynamical systems (2016). https://drake.mit.edu 27. Trinkle, J.C.: A quasi-static analysis of dextrous manipulation with sliding and rolling contacts. In: Proceedings of 1989 International Conference on Robotics and Automation, vol. 2, pp. 788–793 (1989) 28. Trinkle, J.C., Berard, S., Pang, J.S.: A time-stepping scheme for quasistatic multibody systems. In: 2005 6th IEEE International Symposium on Assembly and Task Planning: From Nano to Macro Assembly and Manufacturing, ISATP 2005, pp. 174–181, July 2005 29. Whitney, D.E.: Force feedback control of manipulator fine motions. J. Dyn. Syst. Meas. Contr. 99(2), 91 (1977)
Quasi-static Pushing, Grasping, and Jamming
507
30. Zhou, J., Paolini, R., Johnson, A.M., Bagnell, J.A., Mason, M.T.: A probabilistic planning framework for planar grasping under uncertainty. IEEE Robot. Autom. Lett. 2(4), 2111–2118 (2017) 31. Zhou, J., Mason, M.T., Paolini, R., Bagnell, D.: A convex polynomial model for planar sliding mechanics: theory, application, and experimental validation. Int. J. Robot. Res. 37(2–3), 249–265 (2018)
GP-SUM. Gaussian Process Filtering of non-Gaussian Beliefs Maria Bauza(B) and Alberto Rodriguez Mechanical Engineering Department, Massachusetts Institute of Technology, Cambridge, USA {bauza,albertor}@mit.edu
Abstract. This work studies the problem of stochastic dynamic filtering and state propagation with complex beliefs. The main contribution is GP-SUM, a filtering algorithm tailored to dynamic systems and observation models expressed as Gaussian Processes (GP), and to states represented as a weighted Sum of Gaussians. The key attribute of GP-SUM is that it does not rely on linearizations of the dynamic or observation models, or on unimodal Gaussian approximations of the belief, hence enables tracking complex state distributions. The algorithm can be seen as a combination of a sampling-based filter with a probabilistic Bayes filter. On the one hand, GP-SUM operates by sampling the state distribution and propagating each sample through the dynamic system and observation models. On the other hand, it achieves effective sampling and accurate probabilistic propagation by relying on the GP form of the system, and the sum-of-Gaussian form of the belief. We show that GP-SUM outperforms several GP-Bayes and Particle Filters on a standard benchmark. We also demonstrate its use in a pushing task, predicting with experimental accuracy the naturally occurring nonGaussian distributions.
1
Introduction
Robotics and uncertainty come hand in hand. One of the defining challenges of robotics research is to design uncertainty-resilient behavior to overcome noise in sensing, actuation and/or dynamics. This paper studies the problems of simulation and filtering in systems with stochastic dynamics and noisy observations, with a particular interest in cases where the state belief cannot be realistically approximated by a single Gaussian distribution. Complex multimodal beliefs naturally arise in manipulation tasks where state or action noise can make the difference between contact/separation or between sticking/sliding. For example, the ordinary task of push-grasping a cup of coffee into your hand in Fig. 1 illustrates the naturally occurring multimodality, where the cup will slide to either side of the hand and the handle will rotate accordingly clockwise or anti-clockwise. Dogar and Srinivasa [1] used the observation that a push-grasped object tends to cluster into two clearly distinct outcomes—inside and outside the hand—to plan robust grasp strategies. Similarly, as illustrated in c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 508–525, 2020. https://doi.org/10.1007/978-3-030-44051-0_30
GP-SUM
509
Fig. 1. Multimodality during the push-grasp of a coffee cup. A small change in the initial contact between the hand and the cup, or a small change in the hand’s motion can produce a very distinct—multimodal—outcome either exposing (left) or hiding (right) the cup’s handle from the hand’s palm.
Fig. 6, a push on an object might make it rotate to either side of the pusher, generating complex—clustered and ring-shaped—state distributions. Multimodality and complex belief distributions have been experimentally observed in a variety of manipulation actions such as planar pushing [2,3], ground impacts [4], and bin-picking [5]. The main contribution of this paper is a new algorithm GP-SUM to track complex state beliefs. GP-SUM is specifically tailored to: – Dynamic systems expressed as a GP (Gaussian Process). – States represented as a weighted Sum of Gaussians. We will show that the algorithm is capable of performing Bayes updates without the need to either linearize the dynamic or observation models, or relying on unimodal Gaussian approximations of the belief. This will be key to enable efficient tracking of complex state beliefs. In Sect. 4 we describe how GP-SUM operates by sampling the state distribution, given by a sum of Guassians, so it can be viewed as a sampling-based filter. GP-SUM also maintains the basic structure of a Bayes filter by exploiting the GP form of the dynamic and observation models, which allows a probabilistic sound interpretation of each sample, so it can also be viewed as a GP-Bayes filter. In Sect. 5.1, we compare GP-SUM’s performance to other existing GPfiltering algorithms such as GP-UKF, GP-ADF and GP-PF in a standard synthetic benchmark [6,7]. GP-SUM yields better filtering results both after a single and multiple filtering steps with a variety of metrics, and requires significantly less samples than standard particle filtering techniques. Finally, we also demonstrate that GP-SUM can predict the expected distribution of outcomes when pushing an object. Prior experimental work [3] shows that planar pushing produces heteroscedastic and non-Gaussian distribution after pushes of a few centimeters, i.e., some actions are more deterministic
510
M. Bauza and A. Rodriguez
than others and state distributions can break down into components or become ring-shaped. GP-SUM successfully recovers both when applied to a GP learned model of planar pushing. We compare the results to the distributions from real trajectories estimated by performing the same pushes 100 times. Both actions and sensed information determine the shape of a belief distribution and non-linearities in either when integrated over a finite time can easily lead to non-Gaussian beliefs. This paper provides an efficient algorithm for tracking complex distributions tailored to the case where the observation and transition models are expressed as GPs, and the state is represented as a weighted sum of Gaussians.
2
Related Work
Gaussian Processes (GPs) are a powerful tool to model the dynamics of complex systems [5,8,9], and have been applied to different aspects of robotics including planning and control [10–12], system identification [8,13,14], or filtering [6,7,15]. In this work we study the problem of accurate propagation and filtering of the state of a stochastic dynamic system. In particular, we address filtering in systems whose dynamics and measurement models are learned through GP regression, which we commonly refer to as GP-Bayes filters. Among these, the most frequently considered are GP-EKF [15], GP-UKF [15] and GP-ADF [6]. Most GP-filters rely on the assumption that at all instants, the state distribution is well captured by a single Gaussian and exploit a variety of approximations to maintain that Gaussianity. For example, GP-EKF is based on the extended Kalman filter (EKF) and linearizes the GP models to guarantee that the final distributions are indeed Gaussian. GP-UKF is based on the unscented Kalman filter (UKF) and predicts a Gaussian distribution for the state using an appropriate set of samples that captures the moments of the state. Finally, GP-ADF computes the first two moments of the state distribution by exploiting the structure of GPs and thus returns a Gaussian distribution for the state. GP-SUM instead is based on sampling from the state distributions and using Gaussian mixtures to represent these probabilities. This links our algorithm to the classical problem of particle filtering where each element of the mixture can be seen as a sample with an associated weight and a Gaussian. As a result, GP-SUM can be understood as a type of sampling algorithm that is tailored to exploit the synergies between a GP-based dynamic model and a Gaussian mixture state to enable efficient and probabilistically-sound Bayes updates. Ko and Fox [15] provide another GP-based sampling filter, GP-PF, based on the classical particle filter. However, when compared to GP-UKF or GP-EKF, GPPF is less reliable and more prone to inconsistent results [15]. In the broader context of Bayes filtering with non-linear algebraic dynamic and observation models, multiple algorithms have been proposed to recover nonGaussian state distributions. For instance, there is some resemblance between GP-SUM and the algorithms Gaussian Mixture Filter (GMF) [16], Gaussian Sum Filter (GSF) [17], and Gaussian Sum Particle Filtering (GSPM) [18]; all using
GP-SUM
511
different techniques to propagate the state distributions as a sum of Gaussians. GMF considers a Gaussian mixture model to represent the state distribution, but the covariance of all Gaussians are equal and come from sampling the previous state distribution and computing the covariance of the resulting samples; GP-SUM instead computes directly the covariance of the mixture from the GP system dynamics. GSF is as a set of weighted EKFs running in parallel. As a consequence it requires linearization of the system dynamics and observation models while GP-SUM does not. Finally GSPM, which has proved to outperform GSF, is based on the sequential importance sampling filter (SIS) [18]. GSPM takes samples from the importance function which is defined as the likelihood of a state x given an observation z, p(x|z). GP-SUM instead does not need to learn this extra mapping, p(x|z), to effectively propagate the state distributions. Other more task-specific algorithms also relevant to GP-SUM are the multihypothesis tracking filter (MHT) [19] and the manifold particle filter (MPF) [20]. MHT is designed to solve a data association problem for multiple target tracking by representing the joint distribution of the targets as a Gaussian mixture. MPF is a sample-based algorithm taylored to dynamic systems involving unilateral contact constraints which induce a decomposition of the state space into subsets of different dimension, e.g., free space versus contact space. MPF exploits an explicit model of the contact manifolds of the system to project the distributions defined by the samples into that manifold. An advantage of GP-SUM is that it can be viewed as both a sampling technique and a parametric filter. Therefore most of the techniques employed for particle filtering are applicable. Similarly, GP-SUM can also be adapted to special types of GPs such as heteroscedastic or sparse GPs. For instance, GP-SUM can be easily combined with sparse spectrum Gaussian processes (SSGPs) in Pan et al. [7]. Consequently, the learning, propagation and filtering of the dynamics can be made significantly faster.
3
Background on Gaussian Process Filtering
This work focuses on the classical problem of Bayes filtering where the dynamics and observation models are learned through Gaussian process regression. In this section, we introduce the reader to the concepts of Bayes filtering and Gaussian processes. 3.1
Bayes Filters
The goal of a Bayes filter is to track the state of a system, xt , in a probabilistic setting. At time t, we consider that an action ut−1 is applied to the system making its state evolve from xt−1 to xt . This is followed by an observation of the state, zt . As a result, a Bayes filter computes the state distribution, p(xt ), conditioned on the history of previous actions and observations: p(xt |u1:t−1 , z1:t ). This probability is often referred as the belief of the state at time t.
512
M. Bauza and A. Rodriguez
In general, a Bayes filter is composed of two steps: the prediction update and the measurement or filter update following the terminology from [21]. Prediction Update. Given a model of the system dynamics, p(xt |xt−1 , ut−1 ), the prediction update computes the prediction belief, p(xt |u1:t−1 , z1:t−1 ), as: p(xt |u1:t−1 , z1:t−1 ) = p(xt |xt−1 , ut−1 )p(xt−1 |u1:t−2 , z1:t−1 )dxt−1 (1) where p(xt−1 |u1:t−2 , z1:t−1 ) is the belief of the system before action ut−1 . Thus the prediction belief can be understood as the pre-observation distribution of the state, while the belief is the post-observation distribution. In general, the integral (1) cannot be solved analytically and different approximations are used to simplify its computation. The most common simplifications are to linearize the dynamics of the system, as classically done in the Extended Kalman Filter, or to directly assume that the prediction belief, i.e., the result of the integral in (1), is Gaussian distributed [21]. Measurement Update. Given a new measurement of the state, zt , the belief at time t comes from filtering the prediction belief. The belief is recovered by using Bayes’ rule and the observation model of the system p(zt |xt ): p(xt |u1:t−1 , z1:t ) =
p(zt |xt )p(xt |u1:t−1 , z1:t−1 ) p(zt |u1:t−1 , z1:t−1 )
(2)
Again, this expression cannot usually be computed in a closed-form and we rely on approximations to estimate the new belief. Linearizing the observation model or assuming Gaussianity of the belief are again common approaches [21]. Combining Eqs. (1) and (2), we can express the belief in a recursive manner as a function of the previous belief, the dynamic model, and the observation model: p(xt |u1:t−1 , z1:t ) ∝ p(zt |xt ) p(xt |xt−1 , ut−1 )p(xt−1 |u1:t−2 , z1:t−1 )dxt−1 (3) We will show in Sect. 4 an equivalent recursion equation for the prediction belief, which is key to GP-SUM. For known systems, we might have algebraic expressions for their dynamic and observation models. In real systems, however, these models are often unknown or inaccurate, and Gaussian Processes are a powerful framework to learn them. The following subsection provides a basic introduction. 3.2
Gaussian Processes
Gaussian Processes (GPs) are a flexible non-parametric framework for function approximation [22]. In this paper we use GPs to model the dynamics and observation models of a stochastic system. There are several advantages from using GPs over traditional parametric models. First, GPs can learn high fidelity models from noisy data while estimate the intrinsic noise of the system. Second,
GP-SUM
513
GPs estimate the uncertainty of their predictions given the available data, hence measuring the quality of the regression. GPs provide the value of the expected output together with its variance. In classical GPs [22], the noise in the output is assumed to be Gaussian and constant over the input: y(x) = f (x) + ε, where f (x) is the latent or unobserved function to regress, y(x) is a noisy observation of this function, and ε ∼ N (0, σ 2 ) represents zero-mean Gaussian noise with constant variance σ 2 . The assumption of constant Gaussian noise together with a GP prior on the latent function f (x) makes analytical inference possible for GPs. In practice, to learn a GP model over f (x) you only need a set of training points, D = {(xi , yi )}ni=1 , and a kernel function, k(x, x ). Given a new input x∗ , a GP assigns a Gaussian distribution to the output y∗ = y(x∗ ) expressed as: p(y∗ |x∗ , D, α) = N (y∗ |a∗ , c2∗ + σ 2 ) a∗ = k∗T (K + σ 2 I)−1 y c2∗
= k∗∗ −
k∗T (K
2
(4) −1
+ σ I)
k∗
where K is a matrix that evaluates the kernel at the training points, [K]ij = k(xi , xj ), k∗ is a vector with [k∗ ]i = k(xi , x∗ ) and k∗∗ is the value of the kernel at x∗ , k∗∗ = k(x∗ , x∗ ). Finally, y represents the vector of observations from the training set, and α is the set of hyperparameters, that includes σ 2 together with the kernel parameters. These are optimized during the training process. In this work we consider the ARD-SE kernel [22] which provides smooth representations of f (x) during GP regression and is the most common kernel employed in the GP-literature. However, it is possible to extend our algorithm to other kernel functions as it is done in [7].
4
GP-SUM Bayes Filter
In this section we present GP-SUM, discuss its main assumptions, and describe its computational complexity. Given that GP-SUM is a GP-Bayes filter, our main assumption is that both the dynamics and the measurement models are represented by GPs. This implies that for any state xt−1 and action ut−1 the probabilities p(xt |xt−1 , ut−1 ) and p(zt |xt ) are modeled as Gaussian. To keep track of complex beliefs GP-SUM does not approximate them by single Gaussians, but considers the weaker assumption that they are well approximated by sum of Gaussians. Given this assumption, in Sect. 4.1 we exploit that the transition and observation models are GPs to correctly propagate the prediction belief, i.e. the pre-observation state distribution. In Sect. 4.2 we obtain a close-form solution for the belief expressed as a Gaussian mixture. 4.1
Updating the Prediction Belief
The main idea behind GP-SUM is described in Algorithm 1. Consider (1) and (3), then the belief at time t in terms of the prediction belief is: p(xt |u1:t−1 , z1:t ) ∝ p(zt |xt ) · p(xt |u1:t−1 , z1:t−1 )
(5)
514
M. Bauza and A. Rodriguez
If the prediction belief at time t − 1 is approximated by a sum of Gaussians: Mt−1
p(xt−1 |u1:t−2 , z1:t−2 ) =
ωt−1,i · N (xt−1 |μt−1,i , Σt−1,i )
(6)
i=1
where Mt−1 is the number of components of the Gaussian mixture and ωt−1,i is the weight associated with the i-th Gaussian of the mixture N (xt−1 | μt−1,i , Σt−1,i ). Then we compute the prediction belief at time t combining (1) and (5) as: p(xt |u1:t−1 , z1:t−1 ) = p(xt |xt−1 , ut−1 )p(xt−1 |u1:t−2 , z1:t−1 )dxt−1 ∝ (7) p(xt |xt−1 , ut−1 )p(zt−1 |xt−1 )p(xt−1 |u1:t−2 , z1:t−2 )dxt−1 Given the previous observation zt−1 and the action ut−1 , the prediction belief at time t can be recursively computed using the prediction belief at time t − 1 together with the transition and observation models. If p(xt−1 |u1:t−2 , z1:t−2 ) has t the form of a sum of Gaussians, then we can take Mt samples from it, {xt−1,j }M j=1 , and approximate (7) by: p(xt |u1:t−1 , z1:t−1 ) ∝
Mt
p(xt |xt−1,j , ut−1 )p(zt−1 |xt−1,j )
(8)
j=1
Because the dynamics model is a GP, p(xt |xt−1,j , ut−1 ) is the Gaussian N (xt |μt,j , Σt,j ), and p(zt−1 |xt−1,j ) is a constant value. As a result, we can take: p(zt−1 |xt−1,j ) ωt,j = Mt k=1 p(zt−1 |xt−1,k )
(9)
and express the updated prediction belief again as a Gaussian mixture: p(xt |u1:t−1 , z1:t−1 ) =
Mt
ωt,j · N (xt |μt,j , Σt,j )
(10)
j=1
In the ideal case where Mt tends to infinity, the sum of Gaussians approximation of the prediction belief converges to the real distribution and the propagation over time of the prediction belief remains correct. This property of GP-SUM contrasts with most other GP-Bayes filters where the prediction belief is approximated as a single Gaussian. In those cases, errors from previous approximations inevitably accumulate over time. Note that the weights in (9) are directly related to the likelihood of the observations. As in most sample-based algorithms, if the weights are too small before normalization, it becomes a good strategy to re-sample or modify the number of samples considered. In Sect. 5 we address this issue by re-sampling again from the distributions while keeping the number of samples constant.
GP-SUM
515
Algorithm 1. Prediction belief recursion M
t−1 GP-SUM({μt−1,i , Σt−1,i , ωt−1,i }i=1 , ut−1 , zt−1 , Mt ): Mt−1 Mt , Mt ) {xt−1,j }j=1 = sample({μt−1,i , Σt−1,i , ωt−1,i }i=1 for j ∈ {1, . . . , Mt } do μt,j = GPμ (xt−1,j , ut−1 ) Σt,j = GPΣ (xt−1,j , ut−1 ) ωt,j = p(zt−1 |xt−1,j ) end for Mt t {ωt,j }M j=1 = normalize weights({ωt,j }j=1 ) t return {μt,j , Σt,j , ωt,j }M j=1
4.2
Recovering the Belief from the Prediction Belief
After computing the prediction belief, we take the observation zt and compute the belief as another sum of Gaussians using (5): p(xt |u1:t−1 , z1:t ) ∝ p(zt |xt )
Mt
ωt,j · N (xt |μt,j , Σt,j )
j=1
=
Mt
(11)
ωt,j · p(zt |xt )N (xt |μt,j , Σt,j )
j=1
Note that if p(zt |xt )N (xt |μt,j , Σt,j ) could be normalized and expressed as a Gaussian distribution, then the belief at time t would directly be a Gaussian mixture. In most cases, however, p(zt |xt )N (xt |μt,j , Σt,j ) is not proportional to a Gaussian. For those cases, we use standard approximations in the literature (Algorithm 2). For instance, the algorithm GP-EKF [15] linearizes the observation model to express the previous distribution as a Gaussian. In this work, we exploit the technique proposed by Deisenroth et al. [6] as it preserves the first two moments of p(zt |xt )N (xt |μt,j , Σt,j ) and has proven to outperform GP-EKF [6]. This approximation assumes that p(xt , zt |u1:t−1 , z1:t−1 ) = p(zt |xt )p(xt |u1:t−1 , z1:t−1 ) and p(zt |u1:t−1 , z1:t−1 ) = p(xt , zt |u1:t−1 , z1:t−1 )dxt Algorithm 2. Belief recovery t belief computation({μt,j , Σt,j , ωt,j }M j=1 , zt , Mt ): for j ∈ {1, . . . , Mt } do ˆt,j = Gaussian approx( p(zt |xt )N (xt |μt,j , Σt,j ) ) μ ˆt,j , Σ end for Mt t {ˆ ωt,j }M j=1 = {ωt,j }j=1 t ˆt,j , ω return {ˆ μt,j , Σ ˆ t,j }M j=1
516
M. Bauza and A. Rodriguez
are both Gaussians. Note that this is an approximation, and that is only true when xt and zt are linearly related. Using this assumption and that p(zt |xt ) is a GP, p(zt |xt )N (xt |μt,j , Σt,j ) can be approximated as a Gaussian by analytically computing its first two moments [6]. As a result, we recover the belief as a sum of Gaussians. It is key to note that this approximation is only necessary to recover the belief, but does not incur in iterative filtering error. GP-SUM directly keeps track of the prediction belief, for which the approximation is not required. 4.3
Computational Complexity
The computational complexity of GP-SUM directly depends on the number of Gaussians used at each step. For simplicity, we will now assume that at each time step the number of components is constant, M . Note that the number of samples taken from the prediction belief corresponds to the number of components of the next distribution. Propagating the prediction belief one step then requires taking M samples from the previous prediction belief and evaluating M times the dynamics and measurement models. The cost of sampling once a weighted sum of Gaussians is constant, O(1), while evaluating each model implies computing the output of a GP with cost O(n2 ), where n is the size of data used to train the GPs [22]. Therefore the overall cost of propagating the prediction belief is O(M n2 + M ) where n is the largest size of the training sets considered. Approximating the belief does not represent an increase in O−complexity as it also implies O(M n2 ) operations [6]. Consequently GP-SUM’s time complexity increases linearly with the size of the Gaussian mixture. When necessary, we can reduce the cost of GP-SUM using sparse GPs [7], which choose a sparser training set, reducing the cost from O(M n2 ) to O(M k 2 ) with k n.
5
Results
We evaluate the performance of our algorithm in two different dynamic systems. The first one is a 1D synthetic benchmark for nonlinear state space models used in [6,23], where our algorithm proves superior to previous GP-Bayes filters1 . The second case studies how uncertainty propagates in a real robotic system. We learn the stochastic dynamics of planar pushing from real data with GP regression and then use GP-SUM to simulate the system uncertainty overtime to capture the expected distribution of the object state.
1
The implementations of GP-ADF and GP-UKF are based on [6] and can be found at https://github.com/ICL-SML/gp-adf.
GP-SUM
5.1
517
Synthetic Task: Algorithm Evaluation and Comparison
We evaluate GP-SUM on the synthetic system proposed by Kitagawa [23], with dynamics model: xt+1 =
1 25xt xt + +w 2 1 + x2t
w ∼ N (0, 0.22 )
(12)
and measurement model: zt+1 = 5 sin 2xt + v
v ∼ N (0, 0.012 )
(13)
The system was previously used to benchmark the performance of GP-ADF [6]. Figure 2 illustrates the models and Fig. 3 illustrates the filtering process. Dynamics model
5
Measurment model
10 5 0
0
-5 -10 -20
-10
0
10
20
-5 -20
0
20
Fig. 2. Synthetic benchmark task. Dynamic model and observation model of the synthetic task in Eqs. (12) and (13). Notice that the dynamics are specially sensitive and non-linear around zero. Just like in the example of the push-grasp in Fig. 1, this will lead to unstable behavior and multi-modal state distributions.
The GP models for prediction and measurement are trained using 1000 samples uniformly distributed around the interval [−20, 20]. GP-SUM uses the same number of Gaussian components M = Mt = 1000 during the entire filtering process. The initial prior distribution of x0 is Gaussian with variance σ02 = 0.52 and mean μ0 ∈ [−10, 10]. We randomly pick μ0 200 times in the interval to assess the filters in multiple scenarios. Their behavior becomes specially interesting around x = 0 where the dynamics are highly nonlinear. For each value of μ0 , the filters take 10 time-steps. This procedure is repeated 300 times to average the performance of GP-SUM, GP-ADF, GP-UKF, and GP-PF, described in Sect. 2. For GP-PF, the number of particles is the same as GP-SUM components, M = 1000. We evaluate the error in the final state distribution of the system using 3 metrics. The most relevant is the negative log-likelihood, NLL, which measures the likelihood of the true state according to the predicted belief. We also report the root-mean-square error, RMSE, even though it only evaluates the mean of the belief instead of its whole distribution. Similarly, the Mahalanobis distance, Maha, only considers the first two moments of the belief, for which we approximate the belief from GP-SUM by a Gaussian. For the GP-PF we only compute the RMSE given that particle filters do not provide close-form distributions. Note that in all proposed metrics, low values imply better performance.
518
M. Bauza and A. Rodriguez Prediction update Prior, t=0
0.1
0.8 0.7
Measurement update
Prediction belief, t=1
0.08
0.6
Prediction belief, t=2
0.7
1
0.6
0.8
0.04
0.3 0.2
0.3
-10
0
10
20
0.2
0.1
0.1 0 -20
0.4
0.2
0.02
-10
0
10
20
GP-ADF Gauss GP-SUM GP-SUM Real distribution Real state
0.6
0.4
0.4
0 -20
Belief, t=1
0.8
0.5
0.06
0.5
Prediction update
0 -20
-10
0
10
0 -20
20
-10
0
10
20
Fig. 3. Belief propagation on benchmark synthetic task. The figure illustrates how GP-SUM, GP-ADF and Gauss GP-SUM propagate the state belief three steps of dynamic-observation-dynamic updates. All three algorithms start from a prior belief centered at zero, precisely where the benchmark dynamic system is most sensitive to initial conditions, as illustrated in Fig. 2. As a result, the belief and prediction belief quickly become multimodal. GP-SUM handles properly these complex distributions and its predictions are more accurate. After only one dynamic step, the belief at t = 1 predicted by GP-SUM shows three possible modes for the state of the system, while the other algorithms output a single Gaussian that encloses them all. Table 1. Comparison between GP-filters after 1 time step.
Error
GP-ADF μ±σ
GP-UKF μ±σ
GP-SUM μ±σ
NLL
0.49 ± 0.17 95.0 ± 97.0 −0.55 ± 0.34 –
Maha
0.69 ± 0.06 2.80 ± 0.72 0.67 ± 0.04
RMSE 2.18 ± 0.39 34.5 ± 23.1 2.18 ± 0.38
GP-PF μ±σ – 2.27 ± 0.35
Table 2. Comparison between GP-filters after 10 time steps.
Error
GP-ADF μ±σ
GP-UKF μ±σ
GP-SUM μ±σ
GP-PF μ±σ
NLL
9.58 ± 15.68 1517 ± 7600 −0.24 ± 0.11 –
Maha
0.99 ± 0.31
8.25 ± 3.82
0.77 ± 0.06
RMSE 2.27 ± 0.16
13.0 ± 16.7
0.19 ± 0.02
– N/A
From Tables 1 and 2, it is clear that GP-SUM outperforms the other algorithms in all metrics and is more stable, as it obtains the lowest variance in most of the metrics. In the first time step, GP-PF is already outperformed by GP-SUM and GP-ADF, and after a few more time steps, particle starvation becomes a major issue for GP-PF as the likelihood of the observations becomes extremely low. For this reason, we did not report an RMSE value for the GP-PF
GP-SUM
519
after 10 time steps. GP-UKF performance is clearly surpassed by GP-SUM and GP-ADF after 1 and 10 time steps. In Fig. 3 we compare the true state distributions (computed numerically) to the distributions obtained by GP-ADF, GP-SUM, and a simplified version of GP-SUM, Gauss GP-SUM, that takes a Gaussian approximation of GP-SUM at each time step. It becomes clear that by allowing non-Gaussian beliefs GP-SUM achieves higher likelihood to the actual state while better approximating the true belief. Instead, GP-ADF’s performance is limited by assigning a single Gaussian wide enough to cover all the high density regions.
Negative Log-Likelihood
10
GP-ADF Gauss GP-SUM GP-SUM
8 6 4 2 0 -2
2
4
6
Time step
8
10
Fig. 4. Filter comparison on benchmark task. Evaluation of the negative loglikelihood (NLL) of the distributions predicted by filters GP-ADF, Gauss GP-SUM and GP-SUM. In the first time steps, with very few observations, the belief is nonGaussian or multi-modal. GP-SUM handles this and outperforms both GP-ADF and a Gauss GP-SUM, where at each step the belief is approximated by a Gaussian. As time evolves, both GP-SUM and Gauss GP-SUM converge, since the shape of the real belief becomes uni-modal. GP-ADF worsens with time, since in cases where the dynamics are highly nonlinear, its predicted variance increases, lowering the likelihood of the true state.
Figure 4 shows the temporal evolution of the Negative Log-Likelihood (NLL) metric for GP-ADF, GP-SUM and Gauss GP-SUM. As the number of steps increases, GP-SUM and Gauss GP-SUM converge because as GP-SUM becomes more confident on the location of the true state, its belief becomes unimodal and more Gaussian. The plot in Fig. 4 also shows that GP-ADF worsens its performance over time. This is due to the system crossing states with highly nonlinear dynamics, i.e. around zero, where the variance of GP-ADF increases over time. As a result, GP-SUM is a good fit for those systems where multimodality and complex behaviors can not be neglected, at the cost of a larger computational effort. 5.2
Real Task: Propagating Uncertainty in Pushing
Planar pushing is an under-determined and sometimes undecidable physical interaction [24]. Only under many assumptions and simplifications can be simulated efficiently [25,26]. It has been shown experimentally that due to spatial
520
M. Bauza and A. Rodriguez
and direction friction variability, the uncertainty in frictional interactions yields stochastic pushing behavior [2,3,27]. An important observation is that the type of push has a strong influence in the amount of expected uncertainty, i.e., the level of “noise” in the pushing dynamics is action dependent, a.k.a., heteroscedastic. Figure 5 illustrates this effect, where two different pushes repeated multiple times lead to very different object state distributions. A stochastic pushing simulation framework that captures heteroscedasticity and non-Gaussian distributions could be relevant for planning and control. We could generate robust plans by preferring those pushes that lead to lower uncertainty and tight distributions for the object’s position. To this end, we propose to use GP-SUM to propagate the uncertainty of planar pushing by learning the dynamics of the system using heteroscedastic GPs as proposed in [3]. In this case, since simulation is only concerned about forward propagation of uncertainty, we do not consider a measurement model. As a result, when using GP-SUM the prediction belief and the belief coincide and all the components in the Gaussian mixtures have the same weights. In the absence of an observation model, the simulated state distributions naturally become wider at step. For the following results we use M = 10000 Gaussians to capture with accuracy the complex distributions originated in long pushes. We train the GP model of the dynamics with real data and take as inputs the contact point between the pusher and the object, the pusher’s velocity, and its direction [3]. The outputs of the dynamics model are the displacement—position and orientation—of the object relative to the pusher’s motion. Each real push for validation is repeated 100 times at a velocity of 20 mm/s. The intrinsic noise of the system, combining the positional accuracy of the robot and the positional accuracy of the object tracking system, has a standard deviation lower than 1mm over the object location and lower than 0.01rad for the object orientation.
Fig. 5. Examples of stable and unstable pushes. Two different pushes whose outcome after 100 executions yields very different distributions: (left) convergent, (right) divergent and multi-modal. Green lines are the trajectories of the geometric center of the object and the orange ellipse approximates the final distribution of the object pose.
GP-SUM
521
Fig. 6. Outcome of an unstable long Push. We execute repeatedly a long push at the center of the side of a square-shaped object. Similar to the push-grasp in Fig. 1, the real (green) trajectories show that the block can rotate to either side of the pushing trajectory—it is naturally unstable and undecidable [24]. The stochastic GP model from [3] can capture that uncertainty in the form of a probabilistic distribution. The (orange) dots show the outcome of 1000 Monte Carlo simulations of the learned GP dynamic model. GP-SUM predicts accurately the ring-shaped distribution in a way that is not possible with standard GP-filters that assume a uni-modal Gaussian form for the state belief.
Since the distribution of the object position easily becomes non-Gaussian, GP-SUM obtains more accurate results than other algorithms. Figure 6 shows an example of a 350 mm long push at the center of one of the sides of a squared object. We compare the real pushing trajectories (green) with the outcome of running a montecarlo particle simulation on the learned GP-model and GPSUM’s prediction. The distribution becomes ring-shaped and multi-modal, which GP-SUM has no trouble in recovering. This property cannot be captured by standard GP-Bayes filters that assume single Gaussian state distributions. Being able to propagate the uncertainty of the object position over time exposes interesting properties of the planar pushing system. For instance in Fig. 7 we observe different pushes repeated many times and how GP-SUM can obtain a reasonable approximation of the true distribution and recover the different amounts of noise produced by each type of push, i.e., the heteroscedasticity of the system. Figure 7 also shows how GP-SUM can take into account different initial noise distributions and propagate properly the uncertainty in the object’s position. Being able to recover these behaviors is specially useful when our goal is to push an object to a specific region of the space as it allows to distinguish between pushes that lead to narrower (low-variance) distributions and those that involve multimodal or wider (high-variance) distributions.
522
M. Bauza and A. Rodriguez
Fig. 7. GP-SUM capturing variations in output uncertainty. [Left] Four different straight pushes of 5 cm, and the resulting distributions of the location of the geometric center of the object. We compare the outcome of 100 repeated pushes with GP-SUM’s prediction. [Right] Similar analysis, but changing the amount of uncertainty in the initial location of the geometric center of the object (Gaussian distribution with σ = {0, 0.5, 1, 2} mm added on top of the noise of the Vicon tracking system). In both cases, GP-SUM successfully approximates the shape of the output distributions.
6
Discussion and Future Work
GP-Bayes filters are a powerful tool to model and track systems with complex and noisy dynamics. Most approaches rely on the assumption of a Gaussian belief. This assumption is an effective simplification. It enables filtering with high frequency updates or in high dimensional systems. It is most reasonable in systems where the local dynamics are simple, i.e., linearizable, and when accurate observation models are readily available to continuously correct for complex or un-modelled dynamics. In this paper we look at situations where the Gaussian belief is less reasonable. That is, for example, the case of contact behavior with non-smooth local dynamics due to sudden changes in stick/slip or contact/separation, and is the case in stochastic simulation where, without the benefit of sensor feedback, uncertainty distributions naturally grow over time. We propose the GP-SUM
GP-SUM
523
algorithm which exploits the synergies between dynamic models expressed as GPs, and complex state distributions expressed as a weighted sum of Gaussian. Our approach is sample-based in nature, but has the advantage of using a minimal number of assumptions compared to other GP-filters based on single Gaussian distributions or the linearization of the GP-models. Since GP-SUM preserves the probabilistic nature of a Bayes filter, it also makes a more effective use of sampling than particle filters. When considering GP-SUM, several aspects must be taken into account: Number of Samples. Choosing the appropriate number of samples determines the number of Gaussians in the prediction belief and hence its expressiveness. Adjusting the number of Gaussians over time is likely beneficial in order to properly cover the state space. Similarly, high-dimensional states might require higher values of Mt to ensure a proper sampling of the prediction belief. Because of the sample-based nature of GP-SUM, many techniques from sample-based algorithms can be effectively applied such as resampling or adding randomly generated components to avoid particle deprivation. Likelihood of the Observations. There is a direct relation between the weights of the beliefs and the likelihood of the observations. We can exploit this relationship to detect when the weight of the samples degenerates and correct it by re-sampling or modifying the number of samples. Computational Cost. Unlike non-sampling GP-filters, the cost of GP-SUM scales linearly with the number of samples. Nevertheless, for non-linear systems we showed that our algorithm can recover the true state distributions more accurately and thus obtain better results when compared to faster algorithms such as GP-ADF, GP-UKF or GP-PF. GP Extensions. The structure of GP-SUM is not restricted to classical GPs for the dynamics and observation model. Other types of GPs such as HGPs or sparse GPs can be considered. For instance, combining GP-SUM with SSGPs [7] makes the computation more efficient. Future research will focus on combining GP-SUM with planning and control techniques. Simulating multimodality and noisy actions can provide a guidance to chose actions that better deal with the dynamics of complex stochastic systems.
References 1. Dogar, M.R., Srinivasa, S.S.: Push-grasping with dexterous hands: mechanics and a method. In: IEEE International Conference on Intelligent Robots and Systems (IROS) (2010) 2. Yu, K.T., Bauza, M., Fazeli, N., Rodriguez, A.: More than a million ways to be pushed: a high-fidelity experimental data set of planar pushing. In: IROS (2016) 3. Bauza, M., Rodriguez, A.: A probabilistic data-driven model for planar pushing. In: IEEE International Conference on Robotics and Automation (ICRA) (2017)
524
M. Bauza and A. Rodriguez
4. Fazeli, N., Donlon, E., Drumwright, E., Rodriguez, A.: Empirical evaluation of common impact models on a planar impact task. In: IEEE International Conference on Robotics and Automation (ICRA) (2017) 5. Paolini, R., Rodriguez, A., Srinivasa, S.S., Mason, M.T.: A data-driven statistical framework for post-grasp manipulation. IJRR 33(4), 600–615 (2014) 6. Deisenroth, M., Huber, M., Hanebeck, U.: Analytic moment-based Gaussian process filtering. In: ICML, pp. 225–232 (2009) 7. Pan, Y., Yan, X.: Prediction under uncertainty in sparse spectrum Gaussian processes with applications to filtering and control. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70 (2017) 8. Kocijan, J., Girard, A., Banko, B., Murray-Smith, R.: Dynamic systems identification with Gaussian processes. Math. Comp. Modell. Dyn. Syst. 11(4), 411–424 (2005) 9. Nguyen-Tuong, D., Seeger, M., Peters, J.: Model learning with local Gaussian process regression. Adv. Robot. 23(15), 2015–2034 (2009) 10. Murray-Smith, R., Sbarbaro, D.: Nonlinear adaptive control using nonparametric Gaussian process prior models. IFAC Proc. Vol. 35(1), 325–330 (2002) 11. Deisenroth, M., Rasmussen, C., Fox, D.: Learning to control a low-cost manipulator using data-efficient reinforcement learning. In: Robotics: Science and Systems, vol. 7, pp. 57–64 (2012) 12. Mukadam, M., Yan, X., Boots, B.: Gaussian process motion planning. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 9–15. IEEE (2016) 13. Gregorˇciˇc, G., Lightbody, G.: Nonlinear system identification: from multiple-model networks to Gaussian processes. Eng. Appl. Artif. Intell. 21(7), 1035–1055 (2008) 14. Aˇzman, K., Kocijan, J.: Dynamical systems identification using Gaussian process models within corporated local models. Eng. Appl. Artif. Intell. 24(2), 398–408 (2011) 15. Ko, J., Fox, D.: GP-Bayesfilters: Bayesian filtering using Gaussian process prediction and observation models. Auton. Robots 27(1), 75–90 (2009) 16. Stordal, A.S., Karlsen, H.A., Nævdal, G., Skaug, H.J., Vall`es, B.: Bridging the ensemble Kalman filter and particle filters: the adaptive Gaussian mixture filter. Comput. Geosci. 15(2), 293–305 (2011) 17. Alspach, D., Sorenson, H.: Nonlinear Bayesian estimation using Gaussian sum approximations. IEEE Trans. Autom. Control 17(4), 439–448 (1972) 18. Kotecha, J., Djuric, P.: Gaussian sum particle filtering. IEEE Trans. Signal Process. 51(10), 2602–2612 (2003) 19. Blackman, S.S.: Multiple hypothesis tracking for multiple target tracking. IEEE Aerosp. Electron. Syst. Mag. 19(1), 5–18 (2004) 20. Koval, M.C., Klingensmith, M., Srinivasa, S.S., Pollard, N.S., Kaess, M.: The manifold particle filter for state estimation on high-dimensional implicit manifolds. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), May 2017, pp. 4673–4680 (2017) 21. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, Cambridge (2005) 22. Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006) 23. Kitagawa, G.: Monte Carlo filter and smoother for non-Gaussian nonlinear statespace models. J. Comput. Graph. Stat. 5(1), 1–25 (1996) 24. Mason, M.T.: Mechanics and planning of manipulator pushing operations. IJRR 5(3), 53–71 (1986)
GP-SUM
525
25. Goyal, S., Ruina, A., Papadopoulos, J.: Planar sliding with dry friction Part 1. Limit surface and moment function. Wear 143(2), 307–330 (1991) 26. Lynch, K.M., Mason, M.T.: Stable pushing: mechanics, controllability, and planning. IJRR 15(6), 533–556 (1996) 27. Ma, D., Rodriguez, A.: Friction variability in planar pushing data: anisotropic friction and data-collection bias. IEEE Robot. Autom. Lett. 3(4), 3232–3239 (2018)
Optimal Motion Generation
Time-Optimal Motion of Spatial Dubins Systems Weifu Wang1(B) and Devin Balkcom2 1 2
University at Albany, Albany, USA [email protected] Dartmouth College, Hanover, NH, USA
Abstract. This paper presents a generic numerical approach to find the kinematic time-optimal trajectories for 3D rigid bodies with a finite set of translation, rotation, and screw controls, in an obstacle-free space. First, geometric necessary conditions for time-optimality are derived. Second, a method is presented for sampling trajectories satisfying the necessary conditions sufficiently densely to guarantee that for any start configuration and goal location, a trajectory can be found that provably approximately reaches the goal, approximately optimally.
1 Introduction This paper studies time-optimal motion of 3D rigid bodies that are permitted actions from a finite set of constant-velocity rotations, translations, and screws; we call these rigid bodies spatial Dubins systems, as the known optimal motion of the planar Dubins car follows from the 3D geometry derived. We admit that spatial Dubins systems are not quite a direct generalization since the set of controls is finite; however, for the planar Dubins car, a finite number of controls does turn out to be sufficient for optimality. We have recently presented the derivation of basic necessary conditions in [1]; the current paper summarizes these results briefly. The paper then presents the new analysis of the geometry of these conditions applied to spatial rigid bodies. This geometry is used to further analyze a few interesting systems. The paper also presents an algorithm that searches for approximately optimal trajectories across all points in a bounded region of the space. Figure 1 shows a few trajectories for each system found using this algorithm. Optimal trajectories may be used directly for robot control or as components for planning in more complex environments, such as the steering methods and local planners used in sampling-based planning algorithms [2–4]. Geometric understanding of the optimal trajectories may also influence the design of mechanical systems or planning algorithms, such as work on steerable medical needles [5] that uses planar Dubins curves as motion primitives. Trajectory time is an obvious quantity to minimize, and for a bounded-velocity system, the time-optimal trajectories are a simple generalization of shortest paths for points on the robot. A typical approach to discovering the structure of time-optimal trajectories for a system or class of systems is the use of Pontryagin’s Maximum Principle. System equations are manipulated to write certain adjoint differential equations; if those equations may be integrated, they yield geometric conditions that optimal trajectories must satisfy. Further work with the geometry may eliminate some trajectory structures, or c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 529–544, 2020. https://doi.org/10.1007/978-3-030-44051-0_31
530
W. Wang and D. Balkcom
(a) Trajectory found to reach (−0.4, −0.3, −0.7) using Dubins airplane following controls yaw(left)-pitch(down)yaw(left). Euclidean distance 0.86, and the duration of the trajectory is 5.47.
(b) Trajectory found to reach (−0.4, −0.3, −0.7) using 3D Reeds-Shepp system following controls pitch(forwardup)-yaw(back-right)-pitch(back-down). Euclidean distance 0.86, and the duration of the trajectory is 1.64.
Fig. 1. Sample trajectories found by the proposed algorithm, one for Dubins airplane, and one for 3D Reeds-Shepp model.
allow a mapping from the configuration space to trajectory structure – the optimal control synthesis. Generalizing time-optimal motion beyond the plane has proven challenging. Our own prior approaches to discovering optimal trajectories for 3D rigid bodies were halted by our inability to integrate the adjoint equations; representations of rotations in 3D as matrices, quaternions, or Euler angles each caused some difficulty. Work by Chitsaz et al. [6] presents some partial results and necessary conditions for a 3D Dubins airplane that includes bounded velocity altitude changes; optimal motion primitives are derived, but the configuration space is not that of a true 6DOF rigid body, as the plane effectively remains flat, sidestepping the issue of rotation representation. We avoid the issue of finding analytical solutions to complicated differential equations in a different manner in this paper, by choosing a finite set of control actions for the 3D rigid body. These actions are taken to be rigidly attached to the local frame of the robot. For example, for an airplane, one might take an action to be a pitch up corresponding to rotating the airplane about a rotation axis directly above the robot and perpendicular the direction of travel. We further assume that the optimal trajectory will be a finite sequence of such actions – potentially, a very problematic assumption, which we will discuss shortly. Restricting the search to finding durations and a sequence of actions from a finite set allows the problem to be phrased as a constrained optimization problem. Karush-KuhnTucker (KKT) conditions allow simple necessary conditions to be written directly, without the need for integration of the adjoint. Analysis in [1] reveals an interesting result quite parallel to those known for the planar systems: time optimal trajectories can be viewed as virtual robot arms in special configurations that balance some external force described by a vector λ ; revolute arm joints correspond to rotation actions, and prismatic joints correspond to translation actions. Choosing different λ vectors gives different trajectory structures, including different sequences of actions and action durations.
Time-Optimal Motion of Spatial Dubins Systems
531
Unfortunately, the KKT-derived conditions do not provide some of the information about trajectory structure (in particular, the sequence of actions) that the PMP conditions do for the planar case, necessitating the use of additional discrete search in the described algorithm. The trajectory search algorithm we present explores the trajectory space in an in a branch-and-bound fashion, using constraints provided by λ to prune the search. As the algorithm runs, a numerical representation of the synthesis of approximately optimal trajectories is discovered. This algorithm is an extension to 3D of an algorithm for approximating planar optimal trajectories presented in [7]. In addition to the extension to 3D, the current algorithm works with the weaker conditions derived using KKT. We hope that the numerical techniques and geometric insights presented in this paper prove useful, but must point out several weaknesses in the approach. Chief among the weaknesses is the possibility that for any given system, optimal trajectories do not exist. Typically, for each new system, existence must be considered carefully. One potential issue is that for certain goal configurations, for any given trajectory, there exists another trajectory that is faster, but which requires a greater number of switches between controls: the phenomenon of chattering. While it is common to discretize the action space before using a planner such as RRT* [8], doing so may well introduce chattering. For example, choosing only the left and right turning actions for a Dubins car as discrete actions (and not including pure forward translations) leads to a system that moves forwards most quickly by rapidly alternating between the two rotation actions. If there is some fixed cost to switching between actions, then optimal trajectories do exist for any reachable configuration [9, 10]. We consider the assignment of some small switching cost to be at least as reasonable a model as a cost of zero for switches; effectively, a zero-cost switch requires infinite accelerations of the physical system. The algorithm in the present paper works without modification for this model of costly switches. As long as the described algorithm is applied to a system with (possibly small) switching costs, we can expect that the trajectory found is close to optimal, subject to the resolution of the search. Further weaknesses include the fact that although the rigid bodies we consider have orientation, we do not restrict the orientation at the goal. We also point out that the kinematic model of rigid bodies is not a good model for airplanes, which have dynamics and thrust. Nonetheless, we hope that short or fast trajectories for this model are useful for the basic geometry they provide, as they have proven to be for the planar Dubins particle. Finally, this paper does not provide any particular insight about how to select a finite control set; for the systems studied, we choose simple extreme controls that are analogous to those that have turned out to be sufficient for planar systems.
2 Related Work Analytical time-optimal trajectories are known for several planar mobile robots, including the Dubins car [11], the Reeds-Shepp car [12–15], bounded-velocity differentialdrive robots [16, 17], and omnidirectional robots [18, 19], all in the unobstructed plane. These systems are all planar rigid bodies with different velocity constraints; Furtuna [20, 21] showed that this similarity allows derivation of necessary conditions on
532
W. Wang and D. Balkcom
the time-optimal trajectories that apply to all of these models; a unified theory of timeoptimal motion for simple models of planar mobile robots. Few analytical solutions have been derived for dynamic models of systems. In fact, some systems with bounded accelerations do not even have time-optimal trajectories [22, 23]. Beyond land and air aerial vehicles, time-optimal trajectories for ships under constant current [24] and underwater vehicles has also been studied [25]. To simulate the effect of acceleration, a cost can be introduced between each switch of control. Lyu used Blatt’s Indifference Principle (BIP) [26], Lyu et al. [9, 10] to study and find approximation algorithms for optimal trajectories for a costly switch model for planar Dubins-like systems. Most of the described work on time-optimal strategies does not consider the existence of obstacles; notable exceptions include work by Vendittelli et al. on how to measure the distance between a car-like robot and obstacles [27, 28]. Planning among simple obstacles has also been studied for simple car-like systems [29, 30].
3 Model of Constant-Control Rigid Bodies A Dubins car in the plane with a maximum turning radius of 1 may be considered to have forward velocity v of 1 and an angular velocity ω in the range [−1, 1]. As Dubins proved, only three discrete controls out of this continuum are required for optimality: ω ∈ {−1, 0, 1}. Six discrete controls are required for the Reeds-Shepp car, with v ∈ {−1, 1}. We may view controls for the planar systems as actions attached to the local frame of the robot. For example, the left-turn action for a Dubins car has associated with it a rotation center to the left of the robot; as the robot moves, the rotation center moves with the local frame. Furtuna’s work [21] showed that many other mobile robot models also are equivalent to rigid bodies in the plane with body-fixed rotation and translation controls, only a finite subset of which are required to find an optimal trajectory to each reachable configuration. We extend this idea to spatial rigid bodies. We attach a small number of constant controls to the local frame of the robot. Each control is either a translation, a rotation about an axis fixed to the local frame, or a screw of constant pitch around an axis fixed to the local frame. We call such a system with constant controls attached to the local frame a spatial Dubins system. We consider an example spatial Dubins system that might be called a Dubins airplane or submarine. We attach the following frame to the robot: x-axis is along the nose of the vehicle and the y-axis is along the left wing of the vehicle and is perpendicular to x-axis; the z-axis is perpendicular to both x and y axis and follows the right-hand rule. The vehicle can translate forward along the x axis of the robot frame at speed 1. In addition to translation, the vehicle can perform the following six controls: – pitch up or down: unit-speed rotation around one of two axes, each parallel to yaxis and offset by 1 along the z axis; – yaw left or right: unit-speed rotation around one of two axes, each parallel to the z-axis and offset by 1 along the y axis; – roll left or right: unit-speed rotation around the x axis of the robot frame while translating along the x-axis at speed 1 (a screw).
Time-Optimal Motion of Spatial Dubins Systems
533
One might question whether this model is reasonable for a jet airplane. It is not. Nonetheless, constant actions selected from the tangent space to SE(3) may generate geometrically interesting paths that are in some sense short and may be useful for systems such as steerable needles or submarines.
4 Necessary Conditions for Time-Optimal Trajectories In [1], we derived the necessary conditions for this model of controls for a spatial rigid body. The model may be termed a spatial Dubins model, but is also quite similar to the kinematic model of a robot arm. Specifically, an arm is a series of joints offset by links; the joints typically provide translations along or rotations about axes fixed to the prior link. The primary difference for the current system is that unlike the arm, the particular sequence of joints must be selected as well as the duration. 4.1 The Constraint Jacobian We require that the robot reach a specified goal location in minimum time T , such that at t = T , we have three constraint equations: x(t) = xg , y(t) = yg , and z(t) = zg . Let the duration of the ith control applied in the trajectory be ti . Then, for any given sequence of controls, x(t), etc. are functions of t1 . . .tk . As shown in [1], this problem is thus a constrained minimization problem, for which necessary conditions may be found using Lagrange multipliers. For any time-optimal trajectory, there exists constants λ1 , λ2 , and λ3 such that: ⎛ ⎛ ⎞ ⎞ 1 ∂ x/∂ t1 ∂ y/∂ t1 ∂ z/∂ t1 ⎛ ⎞ ⎜∂ x/∂ t2 ∂ y/∂ t2 ∂ z/∂ t2 ⎟ λ1 ⎜1⎟ ⎜ ⎟⎝ ⎠ ⎜ ⎟ (1) ⎜ ⎟ λ2 = ⎜ .. ⎟ . .. ⎝ ⎝.⎠ ⎠ . λ3 1 ∂ x/∂ tn ∂ y/∂ tn ∂ z/∂ tn 4.2 Geometric Interpretation of Necessary Conditions Since for a given sequence of control actions, x(t), y(t), and z(t) are equivalent to kinematics equations for an arm of a particular design, the Jacobian in Eq. 1 is equivalent to the Jacobian for the same arm. In fact, as pointed out by [1], Eq. 1 is equivalent to a force-torque balance equation for the arm. The similarity of the trajectory to the kinematics of an arm motivates an observation: for each rotation action, the corresponding column of the Jacobian for an arm, or row of the matrix in Eq. 1, may be found by taking the cross product of a vector pointing along the rotation axis with the vector from the current joint to the end effector. If the ith control is a rotation, let the axis be ωi , and let the vector pointing from the axis ωi to the end-effector g be − o→ i g. Then: ⎛ ⎞ ∂ x/∂ ti ⎝∂ y/∂ ti ⎠ = ωi × − o→ (2) ig ∂ z/∂ ti
534
W. Wang and D. Balkcom
at any instant during control i. For translations, denote the velocity vector of the ith − control as → vi . Then vi · λ = 1. We will use this geometry to place constraints on the locations of rotation axes or lines of translation for controls; time-optimal trajectories must satisfy these constraints. 4.3
Geometric Relationship Between λ and Trajectory Structure
The main algorithm in this paper will iterate over durations and structures of the first few controls in the trajectory. Each iteration generates a particular geometry for the first few segments of the trajectory, with which only a single value for λ is consistent. That value is computed and then used to constrain the structure of the rest of the trajectory. Searching over consistent trajectory structures provides candidate optimal trajectories to various configurations. In the two steps of this process, the relationship between the potential λ values and the duration and structure of the trajectory will be critical; we now discuss this relationship. Knowing λ informs us about potential trajectory geometries. Since the differential motion of the endpoint due to varying a control duration must make a unit dot product with λ , that motion vector must lie on a plane perpendicular to λ . For a translation control, this indicates that the translational velocity must lie on this plane. For a rotation control, the differential motion of the endpoint is generated by a cross product; specifying λ constrains orientation and distance to the goal of the rotation axis. Figure 2 illustrates some possible rotation axes (green arrows) for a given example λ (blue vector). On the other hand, the geometry of a trajectory, i.e. the control sequences of the trajectory constraint λ values. Knowing the location and the orientation of one control limits λ to a plane, knowing two controls limits λ to a line, and a third control limits λ to a single vector value.
(a) A few rotation control axes (green arrows) that satisfy the necessary condition for a given λ (blue vector) and a fixed distance between the rotation axis and the goal.
(b) Several rotation control axes (green arrows) that can satisfy the necessary condition for a given λ and a fixed distance between the rotation axis and the goal.
Fig. 2. Relations between rotation controls and λ . λ is in blue, and a red vector is the cross product vector that satisfies the necessary condition; rotation axes are drawn in green.
Time-Optimal Motion of Spatial Dubins Systems
535
Figure 2 shows a few rotation axes consistent with a particular λ vector; to simplify the figure, all of these axes are the same distance from the goal, yielding a sphere such that rotation axes are tangent the sphere. As the distance to the goal varies, the radius of the sphere changes; we omit some details. There are some further constraints on the placement of rotation axes. Specifically, there is a cylinder with the axis parallel to λ within which no rotation axis will occur. This occurs because any particular control switch that is followed by a fixed trajectory geometry cannot reach certain goals that are too close to that switch.
5 Finding Time-Optimal Trajectories The basic strategy we will pursue to find a near-optimal trajectory between a given start and goal is the following: 1. Loop over all possible triplets of actions for the first three controls in the trajectory. 2. For each triplet of actions, iterate over densely sampled durations for the first two actions. 3. Compute the unique λ vector consistent with the current trajectory structure and first two durations. 4. For the current λ vector, compute the fastest trajectory that reaches the goal. 5. Choose the fastest trajectory among all computed λ vectors. In this section, we will discuss the details for each step. Why iterate over sampled trajectory durations and partial trajectory structures in steps 1 and 2? Why not just densely sample λ and search over λ -space? As we will discuss below, λ is not bounded. Further, it was shown in [7] that by carefully selecting the sampling rate for durations of a few of the controls, any final goal can be approximately reached with approximately optimal cost. 5.1 Finding λ Given Partial Trajectory Structure and Durations Since λ has three elements, we need three constraint equations to compute a λ vector, and thus we need to know three controls. Let the portion of the trajectory corresponding to the ith control be its ith segment. Given a endpoint g, define the tangential velocity of a given control as the sum of the translation velocity and the cross product between the rotation axis and the vector pointing from the rotation axis to g. Depending on the control type, a tangential velocity may have zero rotational or translational component. Two controls are considered co-linear if they have the same tangential velocity. Lemma 1. Given a endpoint g, three trajectory segments and their configurations in space, if the three corresponding controls are not co-linear, then either these three segments cannot be on the same time-optimal trajectory, or a λ vector can be uniquely computed.
536
W. Wang and D. Balkcom
Proof. Given the ith control, the tangential velocity must have dot product 1 with λ vector in order to satisfy the necessary condition. Let there be a 3 × 3 matrix A with three different rows where each row is the tangential velocity for one of the controls corresponding to the three segments. If the controls are not co-linear, then matrix A is of full rank. Therefore, the equation A · λ = b where b = [1, 1, 1]T has one or zero solutions. Since three arbitrary segments can be used to compute λ vector as long as they are not co-linear, let us consider the first three segments for simplicity. The reason to choose the first three segments is that only two durations need to be known to generate the configurations of the three segments. When some of the first three controls are co-linear, we need to consider later segments. For a rotation control, the duration is bounded in the range [0, 2π ), but no such natural bound exists for translation. However, if some (non-optimal) method exists to compute some trajectory to any goal, then the cost of reaching that goal can be used as an upper bound for the duration of a single translation section, yielding the following lemma: Lemma 2. If there exists an arbitrary trajectory that can reach a given endpoint g, then it is sufficient to search a bounded set of durations to compute the unique λ vector corresponding to any given trajectory structure of length no less than three. Given Lemma 2, the goal is to be able to iterate over the first few trajectory segments to compute the corresponding λ . Therefore, the key is to find a trajectory to serve as the upper bound. However, there do exist systems so that it is not easy to find any trajectory to reach a given endpoint. We will not focus on such systems. 5.2
Screw Actions
A screw action for Dubins airplane (roll) is a combination of rotation and translation. In order for such a screw action to satisfy the necessary condition, the following equation must hold, given the goal g, the translation velocity v, the rotational component of the − og. tangential velocity ω × → − − og + → v )·λ = 1 (ω × →
(3)
Along the trajectory, λ will not change, and v will not change. Therefore, in order to have a constant dot product with λ , the rotation axis of a roll must go through the goal. Because the translation direction of the roll shares the same direction vector with ω , this means that the translation velocity also goes through the goal. For the spiral type of screw actions, because there is just a single rotation axis and no translation velocity, the necessary condition can be satisfied if the tangential velocity remains constant. Therefore, such screw actions are almost like rotations, except the duration is not bounded by 2π .
Time-Optimal Motion of Spatial Dubins Systems
537
5.3 Fastest Trajectory for a Given λ Given each feasible λ , we need to find the fastest trajectory that satisfies the necessary conditions. At each switch of controls, there may exist multiple controls that can satisfy the necessary condition. Then, all the possible trajectory structures corresponding to a specified λ forms a tree with unknown depth and potential large branching factor. However, one interesting observation is that many of the trajectory structures cannot lead the robot to the endpoint. Therefore, only a portion of the tree needs to be searched to find the fastest trajectory for a given λ . What is more, if we have already found a trajectory with duration T that can reach the goal, the candidate fastest trajectories cannot be longer than T , which further reduce the portion of the tree that needs to be searched. For a Dubins airplane, it is easy to find a control sequence that can reach the goal. Let us consider a drive-turn-drive sequence. We first execute a roll control so that g is either on x-z plane or x-y plane, then find a trajectory within that plane that will lead the robot to the goal following a drive-turn-drive strategy. The same strategy can also be used to find a trajectory to reach any goal in 3D for the Reeds-Shepp system. The resulting trajectory with duration T will serve as the upper bound. We, therefore, can use a Depth First Search (DFS) to find the fast trajectory for a given λ and terminate if any trajectory exceeds duration T . Whenever a new trajectory is found that can reach the goal and is shorter than the current solution, the upper bound will be updated, so that fewer and fewer branches need to be explored. What is more, this upper bound also reduces the region of durations we need to explore to find candidate λ vectors. 5.4 Complete Procedure Below is the complete procedure that we use to find the time-optimal trajectory for a given pair of start and goal. Algorithm 1. Find time-optimal trajectory given a goal position Input: g(oal) T ← Find a trajectory that can reach g; for All possible one segment or two segment trajectories that can reach g do Update the upper bound T to the fastest trajectory; k ← 3; for All possible sequence of first k controls do for All possible durations for first k − 1 controls that do not exceed T do if The some of the first k controls are co-linear then k ← k + 1; continue; Compute corresponding λ ; Starting from the third control, compute next possible control; Run DFS until reach goal, or exceeds upper bound; if Found trajectory can reach goal shorter than T then Update the upper bound T ; return the upper bound T to be the time-optimal trajectory;
538
W. Wang and D. Balkcom
One step we have not yet discussed in Algorithm 1 is the need to loop over all possible one or two segment trajectories. This is because in the main search, we need three controls to compute a corresponding λ , but we cannot rule out the situations where one or two controls can reach the goal. This procedure can be computed easily in closed form. In fact, we can compute in closed form whether any trajectory containing no more than three segments can reach the goal. Algorithm 1 involves looping over the durations for the first k −1 linear independent controls. Given an arbitrary resolution, the returned trajectory may not even share the same structure with the time-optimal trajectory. So, what is the sufficient resolution? In the next section, we will show that given arbitrary tolerance ε , there exist a sampling resolution so that the resulting trajectory will reach a location g where |g − g| < ε . 5.5
Approximation Theorems
We will show that given any two trajectories of the same structure, i.e. control sequence, if the duration of each segment is similar, then at any time the two trajectories will reach locations that are close (measured by Euclidean distance). We extend Lemmas and Theorems from [7] to 3D. Lemma 3. Consider two spatial trajectories Y and Y with identical structure, equal duration for all segments but one translation segment k, and a small distance κ . The corresponding end points are G and G . For any translation control k in Y with duration tk , and the same control k in Y with duration tk , if |tk − tk | < vκ , G − G < κ . k
Proof. Translation is commutative.
Lemma 4. Consider a spatial trajectory Y with end point G, a small angle σ , and two pints P and Q on the trajectory, with P − G > Q − G. Form a new trajectory Y1 with endpoint G1 by rotating the trajectory from P to G around P by δ , and form another trajectory Y2 with endpoint G2 by rotating from Q to G by σ . Then, G2 − G < G1 − G. What is more, for any trajectory Yk with end point Gk achieved by rotating the trajectory Y around a series of points along the trajectory with σ angle in total, Gk − G is upper bounded by only rotating σ around the furthest point from G on Y . Proof. If the trajectory only moves in a plane, the proof from [7] directly applies. When the trajectory moves beyond a plane, we will show that the same conclusions still hold. First, between trajectory Y1 and Y2 , formed by rotating around P and Q on trajectory Y , since each rotation is still within a single plane, even though the two rotations may be around two different planes, the result holds that if P is further from G than Q, then G1 is further from G than G2 . Now, let us consider a trajectory Yk formed by a sequence of rotations around points on Y by a total amount no larger than σ , where the rotations may not be on the same plane. Denote the distance the endpoint moved by the rotation around a point X as d(GX ). No matter on which plane is the rotation, d(GX ) is the same for the same rotation angle. Therefore, for two given rotations around points A and B, the endpoints moved by d(GA ) and d(GB ) respectively. If the two rotations are not in the plane, the total movement of endpoint G can be computed as d(GA ) · d(GB ) · cos(θ ) where θ is the
Time-Optimal Motion of Spatial Dubins Systems
539
angle between the two movement vectors. The angle θ reaches the maximum when the two movements are on the same plane. Therefore, given the trajectory Yk formed by a sequence of rotations around points on Y , Gk − G is upper bounded by the cases when all rotations are in the same plane, which is then upper bounded by rotating σ around the furthest point on Y to G. The screw controls are bounded by its rotational components, so Lemma 4 applies. We therefore can also directly extend the following theorem directly for 3D. Theorem 1 (Theorem 3 from [7]). Consider two spatial trajectories Y and Y with the same control sequence, both starting at S and having time cost t = ∑ni=1 ti and t = ∑ni=1 ti respectively. Denote the point that Y passes through at t as G , and the point that Y passes at time t as G. Then, for any ε > 0, there exist δ > 0 such that if ∑ni=1 |ti − ti | < δ , then G − G < δ . Specifically, let δ be the minimum of vTε and max √ C C C 2 (ωmax tmax vC max /2+vmax ) +2ωmax vmax ε −(ωmax tmax /2+1)vmax . ω vC max max
Last but not least, we need to show that by sampling sufficiently small, the resulting trajectory can be arbitrarily close to the desired goal. Theorem 2. Consider two spatial trajectories Y and Y with the same control sequence, where every control on both trajectories will maintain a constant dot product with a vector λ and λ respectively. Let the durations of the first two linear independent controls be t1 (t1 ) and t2 (t2 ) on Y (Y ). There exist a most sensitive segment k on both trajectories with duration tk and tk respectively, such that if |t1 − t1 | < Δ t, |t2 − t2 | < Δ t, and |tk − tk | < Δ t, then ∑ni=1 |ti − ti | < nΔ t. Proof. First, we can show that because |t1 − t1 | < Δ t and |t2 − t2 | < Δ t, λ and λ are close both in magnitude and in orientation. Given the ith control, if the duration of this segment changes from ti to ti , we can use the following equations to compute the location and orientation of the next control if control i is rotation. c→ (ωi × − i g) · λ = 1 − → − − − → ((|ωi |R(ti )ωi+1 ) × (ci g − |ωi |R(ti )ci ci+1 )) · λ = 1
(4) (5)
If the ith control is translation, then the location of the next control can also be easily computed using the following equations, vi · λ = 1 (ω j × (− c→ g − v · t )) j i i ·λ = 1
(6) (7)
Therefore, given the first three linear independent segments, which are computed using the durations for the first two linear independent controls, a λ can be computed. When the duration changes from t1 (t2 ) to t1 (t2 ), the tangential velocities change by a small amount both in magnitude and in direction, so the resulting λ and λ = λ + Δ λ are similar both in scale and in direction. Let us first consider the ith control being translation, and the i + 1th control is rotation. Extending from Eq. 7 by replacing ti and λ with ti + Δ ti and λ + Δ λ respectively,
540
W. Wang and D. Balkcom
one can derive that Δ t is a function of Δ λ scaled by 1/|vi |, we omit the derivation steps in this proof. But geometrically, because rotation axes are on the surface of concentric spheres, a translation connecting two rotation controls can be viewed as a displacement to form the correct distance between the rotation axis and the goal. This displacement can only change by a small amount when λ changes to λ + Δ λ , so the slower the translation, the larger the Δ ti will need to be. If the i + 1th control is also a translation, the switch can happen at any time and Δ t can be arbitrarily small. Therefore, the slowest translation may need to change its duration the most given a small change in λ , thus is the most sensitive translation control. |ωi |R(ti ) · v j · λ = 1
(8)
Let us then consider rotation control. Extending Eq. 5, one can similarly derive that Δ t is a function of Δ λ scaled by 1/|ωi | if the next control is also a rotation. If the next control is a translation, one can extend Eq. 8 to derive that Δ t again is a function of Δ λ scaled by 1/|ωi |. Geometrically, recall that all rotation axis locates on the surface of spheres, but at each point, only one possible direction for rotation axis satisfies the necessary condition. Therefore, to form the appropriate tangential velocity for the next control, a similar distance needs to be covered by the current control when λ becomes λ + Δ λ . Thus the rotational control with the smallest angular velocity is the most sensitive rotation action. Given a screw action, either the slowest translation component or slowest rotation component will dominate the sensitivity. Over the entire trajectory, the comparison among the most sensitive of each type yields the most sensitive control. Therefore, given that the most sensitive segment and the first two linear independent segments’ durations do not change more than Δ t, all other segment durations will not change more than Δ t, then the total duration difference is no larger than nΔ t. 5.6
Synthesis of Time-Optimal Control Sequences
The proposed algorithm cannot find a closed form solution for the time-optimal trajectory between an arbitrary pair of start and goal. However, using the geometric interpretations of the necessary conditions, we can build a synthesis of all possible time-optimal control sequences for all goals, and store them for future inquiry. We can employ Algorithm 1 to find the time-optimal trajectory for each given goal position, but the reader
Algorithm 2. Find synthesis of time-optimal control sequences to all goals for All possible first control do for Sample first duration at resolution ε do Compare all reachable points and current reaching duration; if Found a new shorter path then Update the shortest time and control sequence to this point; for All possible second control do for . . . do ...;
Time-Optimal Motion of Spatial Dubins Systems
541
can easily tell that there are many repeated computations for the same or similar λ , especially for nearby goals. So, we can simplify the procedure needed to find the synthesis.
6 Planning Algorithm Examples and Results Using Dubins airplane as an example, we conducted experiments to show how our algorithm can find the time-optimal trajectories for the Dubins airplane for a given resolution. Figure 3 shows the computed time-optimal trajectories for the given goal using Algorithm 1. The airplane is represented by the gray tetrahedron, the yellow frame is the world frame, and the green arrows are rotation axes. We implemented the procedure in python, running on a modern desktop. With a resolution of 0.03 for the durations, each search for the goal takes on the order of 300 s on a 2018 commodity desktop computer system.
(a) Trajectory found to reach (2, 0.57, 0.66) following controls pitch(up)-yaw(left)translation. Euclidean distance 2.182, and the duration of the trajectory is 2.24.
(b) Trajectory found to reach (1, 1, 2) following controls pitch(up)-yaw(left) repeatedly. Euclidean distance 2.44, and the duration of the trajectory is 2.915.
(c) Trajectory found to reach (−2, 0, 1) following controls pitch(up)-translationpitch(down). Euclidean distance 2.23, and the duration of the trajectory is 5.297.
Fig. 3. A few trajectories found using Algorithm 1 for Dubins airplane to reach given endpoints.
In addition to the Dubins airplane, we also conducted experiments with a 3D ReedsShepp airplane, by adding symmetric reverse controls. To show the similarity and difference between Dubins airplane and Reeds-Shepp system, we ran the algorithm to find trajectories to reach the same set of goals; a few examples are shown in Figs. 3, 5, and 1. When the goal is along the positive x-axis, the resulting trajectory for Dubins and Reed-Shepp airplanes are the same; when the endpoint is along the negative direction of x-axis, the trajectory duration for Reeds-Shepp airplane uses some reverse actions. 6.1 Trajectories with Many Switches The proposed algorithm can also find trajectories with many segments if no switching cost is specified, as shown in Figs. 3b and 4b. What is more, given different search
542
W. Wang and D. Balkcom
resolutions for the trajectory duration, the number of segments may change, as shown in Fig. 5a. At a resolution of 0.05, the fastest trajectory reaching (1, 1, 2) has 10 segments. The number of segments increases to 16 when the resolution increase to 0.03. We have not proven existence of optimal trajectories for Dubins systems, so the existence of trajectories with many switches is concerning. Fortunately, by adding a small switching cost, existence of optimal trajectories is guaranteed, and we expect
(a) Trajectory found to reach (2, 0.57, 0.66) following controls pitch(up)-yaw(left)translation. Euclidean distance 2.182, and the duration of the trajectory is 2.24.
(b) Trajectory found to reach (1, 1, 2) following controls pitch(up)-yaw(left) repeatedly. Euclidean distance 2.44, and the duration of the trajectory is 2.915.
(c) Trajectory found to reach (−2, 0, 1) following controls pitch(backup)-translation(back)pitch(back-down). Euclidean distance 2.23, and the duration of the trajectory is 2.36.
Fig. 4. A few trajectories found using Algorithm 1 for Reed-Shepp airplane to reach given endpoints.
(a) Trajectory found to reach goal (1, 1, 2) at resolution 0.03 with no switching cost. Euclidean distance 2.44, total duration of trajectory is 2.96 with 16 segments.
(b) Trajectory found to reach goal (1, 1, 2) at resolution 0.03 with switching cost 0.01. Euclidean distance 2.44, total duration of trajectory is 3.02 with 4 segments.
(c) Trajectory found to reach goal (1, 1, 2) at resolution 0.03 with switching cost 0.1. Euclidean distance 2.44, total duration of trajectory is 3.23 with 3 segments.
(d) Trajectory found to reach goal (1, 1, 2) at resolution 0.03 with switching cost 0.3. Euclidean distance 2.44, total duration of trajectory is 3.24 with 2 segments.
Fig. 5. A few trajectories found using Algorithm 1 for Dubins airplane to reach (1, 1, 2) with different switching costs.
Time-Optimal Motion of Spatial Dubins Systems
543
much greater numerical stability for the search. For example, as shown in Fig. 5b, by adding a 0.01 switching cost, the number of segments decreased to four from 16 at the same resolution of 0.03. With higher switching cost, the number of segments on the fastest trajectory found by Algorithm 1 decreases even further.
7 Conclusions In this work, we presented a generic numerical approach to find the kinematic timeoptimal trajectories for 3D rigid bodies. We showed that in order for a trajectory to be time-optimal, there exist a λ vector so that some necessary conditions must be satisfied. The proposed approach starts by finding valid λ vectors and compare the fastest trajectory for each λ . We showed that for any given endpoint, the proposed approach can find a trajectory that reaches the endpoint within a given resolution and approximately optimal. For future work, we would like to extend the current approach to find trajectories that can reach any given endpoint configurations rather than just locations. We would also like to further generalize the algorithms to include more generic systems. Also, we would like to explore how can the proposed approaches be integrated with existing planners to find efficient trajectories even with obstacles.
References 1. Balkcom, D.J., Furtuna, A.A., Wang, W.: The Dubins car and other arm-like mobile robots. In: International Conference on Robotics and Automation (ICRA) (2018) 2. LaValle, S.M., Kuffner Jr., J.J.: Randomized kinodynamic planning. Int. J. Robot. Res. 20(5), 378–400 (2001) 3. Karaman, S., Frazzoli, E.: Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res. 30(7), 846–894 (2011) 4. Balkcom, D.J., Kannan, A., Lyu, Y., Wang, W., Zhang, Y.: Metric cells: towards complete search for optimal trajectories. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, 20 September–2 October 2015, pp. 4941–4948 (2015) 5. Alterovitz, R., Goldberg, K.Y., Okamura, A.M.: Planning for steerable bevel-tip needle insertion through 2D soft tissue with obstacles. In: Proceedings of the 2005 IEEE International Conference on Robotics and Automation, ICRA 2005, Barcelona, Spain, 18–22 April 2005, pp. 1640–1645 (2005) 6. Chitsaz, H., LaValle, S.M.: Time-optimal paths for a Dubins airplane. In: 2007 46th IEEE Conference on Decision and Control, pp. 2379–2384, December 2007 7. Wang, W., Balkcom, D.: Sampling extremal trajectories for planar rigid bodies. In: Algorithmic Foundations of Robotics (WAFR), pp. 331–347 (2012) 8. Karaman, S., Frazzoli, E.: Sampling-based algorithms for optimal motion planning. Int. J. Rob. Res. 30(7), 846–894 (2011) 9. Lyu, Y., Furtuna, A.A., Wang, W., Balkcom, D.: The bench mover’s problem: minimum-time trajectories, with cost for switching between controls. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 106–112 (2014) 10. Lyu, Y.-H., Balkcom, D.J.: Optimal trajectories for kinematic planar rigid bodies with switching costs. Int. J. Robot. Res. 35(5), 454–475 (2016)
544
W. Wang and D. Balkcom
11. Dubins, L.E.: On curves of minimal length with a constraint on average curvature, and with prescribed initial and terminal positions and tangents. Am. J. Math. 79(3), 497+ (1957) 12. Reeds, J.A., Shepp, L.A.: Optimal paths for a car that goes both forwards and backwards. Pac. J. Math. 145, 367–393 (1990) 13. Boissonnat, J., C´er´ezo, A., Leblond, J.: Shortest paths of bounded curvature in the plane. J. Intell. Robot. Syst. 11(1–2), 5–20 (1994) 14. Sussmann, H.J., Tang, G.: Shortest paths for the reeds-shepp car: a worked out example of the use of geometric techniques in nonlinear optimal control. Technical report, Rutgers University (1991) 15. Soueres, P., Laumond, J.P.: Shortest paths synthesis for a car-like robot. IEEE Trans. Autom. Control 41(5), 672–688 (1996) 16. Balkcom, D., Mason, M.T.: Time optimal trajectories for bounded velocity differential drive vehicles. Int. J. Robot. Res. 21(3), 199–218 (2002) 17. Chitsaz, H.R., LaValle, S.M., Balkcom, D.J., Mason, M.T.: Minimum wheel-rotation paths for differential-drive mobile robots. Int. J. Robot. Res. 28(1), 66–80 (2009) 18. Balkcom, D., Kavathekar, P.A., Mason, M.T.: Time-optimal trajectories for an omnidirectional vehicle. Int. J. Robot. Res. 25(10), 985–999 (2006) 19. Wang, W., Balkcom, D.: Analytical time-optimal trajectories for an omni-directional vehicle. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 4519–4524 (2012) 20. Furtuna, A.A., Balkcom, D.: Generalizing Dubins curves: minimum-time sequences of bodyfixed rotations and translations in the plane. Int. J. Robot. Res. 29(6), 703–726 (2010) 21. Furtuna, A.: Minimum time kinematic trajectories for self-propelled rigid bodies in the unobstructed plane. Ph.D. thesis, Dartmouth College (2011) 22. Sou`eres, P., Boissonnat, J.D.: Optimal trajectories for nonholonomic mobile robots. In: Robot Motion Planning and Control, pp. 93–170. Springer (1998) 23. Sussmann, H.J.: The Markov-Dubins problem with angular acceleration control. In: 1997 Proceedings of the 36th IEEE Conference on Decision and Control, vol. 3, pp. 2639–2643. IEEE (1997) 24. Dolinskaya, I.S., Maggiar, A.: Time-optimal trajectories with bounded curvature in anisotropic media. Int. J. Robot. Res. 31(14), 1761–1793 (2012) 25. Chyba, M., Haberkorn, T.: Designing efficient trajectories for underwater vehicles using geometric control theory. In: 24th International Conference on Offshore Mechanics and Artic Engineering (2005) 26. Blatt, J.M.: Optimal control with a cost of switching control. ANZIAM J. 19(3), 316–332 (1976) 27. Vendittelli, M., Laumond, J., Nissoux, C.: Obstacle distance for car-like robots. IEEE Trans. Robot. Autom. 15(4), 678–691 (1999) 28. Giordano, P.R., Vendittelli, M.: Shortest paths to obstacles for a polygonal Dubins car. IEEE Trans. Robot. 25(5), 1184–1191 (2009) 29. Agarwal, P.K., Latombe, J., Motwani, R., Raghavan, P.: Nonholonomic path planning for pushing a disk among obstacles. In: Proceedings of the 1997 IEEE International Conference on Robotics and Automation, Albuquerque, New Mexico, USA, 20–25 April 1997, pp. 3124–3129 (1997) 30. Desaulniers, G.: On shortest paths for a car-like robot maneuvering around obstacles. Robot. Auton. Syst. 17(3), 139–148 (1996)
Robust Tracking with Model Mismatch for Fast and Safe Planning: An SOS Optimization Approach Sumeet Singh1(B) , Mo Chen1 , Sylvia L. Herbert2 , Claire J. Tomlin2 , and Marco Pavone1 1
Department of Aeronautics and Astronautics, Stanford University, Stanford, USA [email protected], {mochen72,pavone}@stanford.edu 2 Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA {sylvia.herbert,tomlin}@berkeley.edu
Abstract. In the pursuit of real-time motion planning, a commonly adopted practice is to compute trajectories by running a planning algorithm on a simplified, low-dimensional dynamical model, and then employ a feedback tracking controller that tracks such a trajectory by accounting for the full, high-dimensional system dynamics. While this strategy of planning with model mismatch generally yields fast computation times, there are no guarantees of dynamic feasibility, which hampers application to safety-critical systems. Building upon recent work that addressed this problem through the lens of Hamilton-Jacobi (HJ) reachability, we devise an algorithmic framework whereby one computes, offline, for a pair of “planner” (i.e., low-dimensional) and “tracking” (i.e., high-dimensional) models, a feedback tracking controller and associated tracking bound. This bound is then used as a safety margin when generating motion plans via the low-dimensional model. Specifically, we harness the computational tool of sum-of-squares (SOS) programming to design a bilinear optimization algorithm for the computation of the feedback tracking controller and associated tracking bound. The algorithm is demonstrated via numerical experiments, with an emphasis on investigating the trade-off between the increased computational scalability afforded by SOS and its intrinsic conservativeness. Collectively, our results enable scaling the appealing strategy of planning with model mismatch to systems that are beyond the reach of HJ analysis, while maintaining safety guarantees.
Singh, Chen, and Pavone were supported by NASA under the Space Technology Research Grants Program, Grant NNX12AQ43G, and by the King Abdulaziz City for Science and Technology (KACST). Herbert and Tomlin were supported by SRC under the CONIX Center and by ONR under the BRC program in Multibody Systems. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 545–564, 2020. https://doi.org/10.1007/978-3-030-44051-0_32
546
1
S. Singh et al.
Introduction
As robotic systems become more pervasive, real-time motion planning becomes increasingly important. In general, motion planning algorithms must find a trade-off among three key challenges: (1) achieving fast computational speed to enable online re-planning, (2) accounting for complex system dynamics, and (3) ensuring formal safety and robustness guarantees. In particular, to enable high-frequency re-planning, a commonly adopted practice is to compute a trajectory by running a planning algorithm on a simplified, low-dimensional dynamical model, oftentimes just a single-integrator model. A feedback tracking controller that accounts for the full, high-dimensional dynamics of the system is then used to track such a trajectory (we refer to this approach as planning with model mismatch, see Fig. 1). This approach usually guarantees fast planning speeds, but guaranteeing safety becomes difficult, due to the mismatch between the model used for planning and the actual dynamics of the robotic system. This approximation can introduce tracking error between the planned path and the actual trajectory of the system, potentially resulting in safety violations. Conversely, one could compute a collision-free motion plan by directly accounting for the full set of differential constraints of a robotic system (e.g., by using a generalpurpose kinodynamic motion planning algorithm [16]). However, despite significant recent progress [13,14,17,18], kinodynamic motion planning is still computationally intensive [16]. For robots characterized by “slow” or differentially flat dynamics, gradient-based trajectory optimization techniques (e.g., [27,31]) may provide quick convergence to a feasible solution, but extending such methods to systems with complex dynamics is challenging.
Fig. 1. Left: planning with model mismatch. The tracking system (car, red trajectory) follows a motion plan computed via a low-fidelity dynamical model (blue trajectory). Tracking with model mismatch while maintaining safety requires keeping the maximum tracking error bound (TEB) below a certifiable value. Right: TEB certification. The TEB is characterized by the property that on its boundary, all possible error dynamics resulting from the nominal trajectory point inwards under some tracking controller, so that the TEB is never exceeded.
In general, providing formal guarantees on safety and robustness typically requires reasoning on the full dynamical model of a robotic system. One of the most powerful tools in this regard is the concept of “funnels” [5]. The key idea is to ensure that the trajectory of a system remains within precomputed funnels [19,20,36]; this is often achieved via Lyapunov analysis techniques.
Robust Tracking with Model Mismatch for Fast and Safe Planning
547
However, while able to provide formal guarantees around nominal trajectories, these methods are less suitable for a priori unknown environments in which safe trajectories must be found online. The work in [21] tackles this challenge by considering a library of funnels, from which maneuvers are composed in sequence to form guaranteed safe trajectories. To obtain trajectory-independent tracking error bounds, contraction theory and nonlinear optimization techniques can be used to construct tubes around dynamically-feasible trajectories obtained from any planning algorithm [33]. However, as mentioned above, rapidly computing dynamically-feasible nominal trajectories poses a significant computational challenge. To tackle real-time motion planning problems, recent works such as [6,11,15] combine low- and high-fidelity models to strike a balance among the three aforementioned challenges of computational speed, model accuracy, and robustness. In [15], a forward reachable set for the high-fidelity model is computed offline and then used to prune trajectories generated online using the low-fidelity model. The approach relies on an assumed model mismatch bound, expressed as the maximum distance between the low- and high-fidelity models under a certain metric. FaSTrack [6,11] considers a slightly different definition of model mismatch, and casts a motion planning problem as a differential game wherein a low-fidelity “planning” system is being pursued (i.e., tracked) by a high-fidelity “tracking” system. Hamilton-Jacobi (HJ) reachability analysis is used to precompute a worst-case tracking error bound (TEB) as well as the optimal feedback tracking controller. Online, FaSTrack plans in real time using the low-fidelity planning system, while simultaneously tracking the plan and remaining within guaranteed tracking error bounds – despite the model mismatch. FaSTrack’s ability to ensure real-time planning of dynamic systems with safety guarantees makes it a desirable framework, but it requires a significant precomputation step in computing the TEB and controller. Executing this precomputation using HJ reachability is reliable and straightforward for nonlinear, highly-coupled dynamical systems with up to five states, or higher-dimensional systems of a specific form [8]. However, the computations required for HJ reachability analysis scale exponentially with the number of states, thus making FaSTrack difficult to apply to systems with a large number of states and/or highly-coupled dynamics. Building upon FaSTrack, we present an approach to designing a feedback tracking controller along with guaranteed tracking error bounds via sum-ofsquares (SOS) programming. SOS programs can be cast as semidefinite programs (SDPs), which may be solved using standard convex optimization toolboxes. SOS programming has been used extensively to provide convex relaxations to fundamentally non-convex or even NP-hard problems arising in robust control [19–21,24,36]. In the context of this work, we leverage SOS to derive sufficient conditions for bounding tracking control error. The price paid for such a convex relaxation is in the tightness of the error bound guarantee (which corresponds to the size/computational tractability of the SOS SDP), but critically not in the existence of the guarantee itself provided the SOS program is feasible. SOS programming thus represents an attractive alternative to exact HJ reachability
548
S. Singh et al.
for those robotic systems where HJ is not directly applicable due computational intractability. In addition, this paper, as well as FaSTrack [6,11], has potential to complement works such as [15] that assume a model-mismatch error, by providing the TEB as well as a feedback tracking controller to achieve such a TEB. Statement of Contributions: The contribution of this paper is threefold. First, we provide an SOS-based formulation for tracking a nominal motion plan computed on a simplified low-dimensional model. Second, leveraging the SOS formulation, we devise a bilinear optimization algorithm to jointly compute, offline, a trajectory-independent feedback tracking controller and the corresponding TEB. Finally, we demonstrate our approach on a 5D example for comparison with FaSTrack [11], and on a 8D example which is beyond the reach of HJ analysis. Collectively, our three contributions enable scaling the appealing strategy of planning with model mismatch to systems that are beyond the reach of HJ analysis, while maintaining safety guarantees. Organization: The remainder of the paper is organized as follows. In Sect. 2 we provide the problem statement and give necessary background on SOS programming. Section 3 formalizes the problem as a bilinear SOS program, while Sect. 4 develops an algorithmic framework for its solution. In Sect. 5 we present numerical experiments. Finally, in Sect. 6, we draw our conclusions and suggest directions for future work.
2
Preliminaries
In this section, we present the problem formulation, and provide the necessary background material related to sum-of-squares programming. 2.1
Problem Formulation
Consider a relatively high-fidelity model of a robotic system, referred to as the “tracking model,” and a lower-fidelity model, referred to as the “planning model,” described, respectively, by the ordinary differential equations (ODEs) s˙ = f s (s, us ),
us ∈ U s ,
p˙ = f p (p, up ),
up ∈ U p ,
(1)
where s ∈ Rns is the tracking model’s state (or “tracking state”), and us ∈ Rms is its control input (or “tracking control”), constrained to lie within the compact set U s . Similarly, p ∈ Rnp is the planning state and up ∈ Rmp is the planning control input, constrained to lie within the compact set U p . The planning model can be thought of as a coarser or lower-fidelity model for which motion planning can be done effectively in real time. For both ODEs in Eq. (1), we assume that the dynamics are Lipschitz continuous in the state for fixed controls, so that given any measurable control function, a unique trajectory exists for each system [10]. Furthermore, we assume that the planning state p is a strict subset of the tracking
Robust Tracking with Model Mismatch for Fast and Safe Planning
549
state s, in particular np < ns (this assumption allows one to clearly relate the tracking and planning state, as discussed in Sect. 3.1), and that the dynamics for the tracking model are control affine (a rather mild condition that significantly simplifies the SOS formulation, as discussed in Sect. 3.3). Crucially, trajectories generated by using the planning model may not be dynamically feasible with respect to the tracking model. Thus, a tracking controller that accounts for the high-fidelity dynamics of the robotic system, as represented by the tracking model, may not be able to track these trajectories exactly and some (possibly large) tracking error may be incurred. Figure 1 illustrates this behavior, for the case of a 4D car model used to track a 2D single integrator model. Specifically, a planning algorithm is used to “quickly” find a nominal motion plan for the planning model, depicted by the dashed blue line. The robot, represented more accurately by the tracking model, strives to track the nominal motion plan as closely as possible, giving rise to the red trajectory. The presence of a minimum turning radius constraint in the tracking model prevents the robot from tracking the sharp corners exactly, thereby yielding a tracking error. Such a decoupling approach is quite common in the field of motion planning [16], yet very few results exist on how to account for the resulting tracking error. In particular, we advocate that the computation of the nominal motion plan via the planning model should account for the tracking error in order to ensure safe execution (i.e., tracking) by the real system, described by the tracking model. Accordingly, the objective of this paper is to devise an efficient and scalable algorithm to compute (1) a trajectory-independent TEB (represented by the light blue circle in Fig. 1, left), which the planner can use as a “safety buffer” to ensure that the actual trajectory is collision free, and (2) a feedback tracking controller to track such a nominal motion plan while respecting the TEB. Intuitively, the tracking controller should have the property that on the boundary of the TEB, the error dynamics are driven “inwards,” regardless of the nominal motion plan (depicted in Fig. 1, right). 2.2
Sum-of-Squares (SOS) Programming Background
Our approach is rooted in SOS programming, for which we provide a brief review here (for a more detailed review of SOS programming and its applications, please refer to [2,21,23]). We begin by discussing semidefinite programs (SDPs), a class of convex optimization problems formulated over the space of symmetric positive semidefinite (psd) matrices. Denote the set of n × n symmetric matrices as S n . A symmetric matrix X ∈ S n is positive semidefinite (psd) if x Xx ≥ 0 for all x = 0, and is denoted as X 0. An SDP in standard form is written as: minimize n
Tr(C X)
subject to
A i X = bi , X0
X∈S
i = 1, 2, ..., m,
(2)
where C and {Ai }i are elements of S n , and “Tr” denotes the trace operator. SOS programs provide a means of certifying nonnegativity of polynomials either
550
S. Singh et al.
globally or over basic semialgebraic sets. A basic semialgebraic set S is a subset of the Euclidean space characterized by a finite set of polynomial inequalities and equalities, that is, S := {x ∈ Rn : φi (x) ≤ 0, ψj (x) = 0},
(3)
where {φi }, {ψj } are multivariate polynomials in x. The simplest task within this class of problems is verifying the nonnegativity of a given polynomial p(x) over S, which is already NP-hard [23]. SOS programming provides a convex relaxation approach to accomplish this task. Specifically, a polynomial p is SOS if it can be written in the form k zk2 for some other polynomials zk (x). While such a decomposition is not necessary, it is sufficient to guarantee (global) nonnegativity of p(x). If one can find a set of SOS polynomials Li (x) and ordinary polynomials qj (x) such that Li φi + qj ψj is SOS, (4) p+ i
j
then one obtains a certificate of nonnegativity of p(x) over S. Indeed, in theabove equation, when φi (x) ≤ 0 and ψj (x) = 0, one has that p(x) ≥ − i Li (x)φi (x) ≥ 0, as required. Such a certificate is the extension of the generalized S-procedure [12] to the setting of real-valued polynomials [23], and additionally constitutes a necessary condition for a subclass of semialgebraic sets [25]. The computational advantage of SOS programming stems from its intrinsic link to SDPs. Specifically, a polynomial p of degree 2d is SOS if and only if p(x) = z(x)T Qz(x), where Q 0 and z(x) is a vector of monomials up to order d. Thus, certifying that a polynomial is SOS reduces to the task of finding a psd matrix Q subject to a finite set of linear equalities, thus taking the form in (2). Certificates of the form in (4) will form the building block for our approach.
3 3.1
SOS-Based Robust Tracking Relative System
Given the tracking and planning models as in Eq. (1), we define the relative system state r ∈ Rnr as r := φ(s, p)(s − Qp), where φ(·, ·) : Rns × Rnp → Rnr ×ns , with nr ≤ ns , is a matrix-valued map with bounded derivatives, and Q = Inp , 0np ×(ns −np ) is a projection matrix from Rnp to Rns (I and 0 denote the identity and zero matrices, respectively). In the definition of Q, we leveraged the assumption that the planning state p is a strict subset of the tracking state s. We make the following assumption on the dynamics of the relative system state, which is central for our approach: Assumption 1 (Relative System Dynamics). The relative system dynamics can be written as a Lipschitz continuous function f r of (r, us , up ), that is, r˙ = f r (r, us , up ).
Robust Tracking with Model Mismatch for Fast and Safe Planning
551
Lipschitz continuity is necessary for the existence and uniqueness of solutions to the ODEs. The structural property in Assumption 1 is satisfied for a number of dynamical models of mobile robotic systems. For example, for ubiquitous cases where the planning model has integrator dynamics, Assumption 1 is satisfied by simply selecting φ(·, ·) as the identity map. However, in general, it is difficult to characterize the conditions on the planning/tracking models such that Assumption 1 is satisfied. Therefore, in Table 1 in the extended version of this work [32], we provide a “catalog” of tracking and planning models fulfilling Assumption 1 for a representative set of robotic systems, along with the map φ that guarantees fulfillment of Assumption 1 and the resulting relative system dynamics. Specifically, the relative system states in the 6th column are obtained by combining the tracking state in the 2nd column, planning state in the 4th column, and the transformation in the 5th column via r = φ(s, p)(s − Qp). Henceforth, we will consider planning/tracking models that fulfill Assumption 1, and refer to the system r˙ = f r (r, us , up ) as the relative system. Finally, we define the error state e as the relative system state excluding the absolute states of the tracking model, and the auxiliary state η as the relative system state excluding the error state. Hence, r = (e, η). For instance, for the case of a 5D car model used to track a 3D Dubins car model (second example in Table 1 in [32]), e = (xr , yr , θr ) and η = (v, ω). Roughly speaking, considering relative system dynamics allows one to reformulate the tracking problem as a stabilization problem to the origin. 3.2
Optimization Problem
In order to define a notion of tracking error bound (TEB), let c(r) : Rnr → Rnr˜ be a smooth mapping representing a vector of quantities we wish to remain bounded, with nr˜ ≥ nr ; henceforth, we will refer to c(r) as the bounded state. The simplest example for c(·) is the identity map. However, for more complex planning/tracking models, the mapping c(·) can be more sophisticated to better capture the structure of the planner/tracker dynamics pair – some examples are provided in Table 1. The goal then is to find a closed and bounded set B in the bounded state such that c(r(t)) ∈ B for all t ≥ 0 (as an illustration, see the light blue ball in Fig. 1). Such an invariance condition is captured by the implication c(r(0)) ∈ B ⇒ c(r(t)) ∈ B,
∀t ≥ 0.
(5)
To parameterize this set, let ρ be a positive constant and V˜ (˜ r) : Rnr˜ → R be a smooth function with a bounded ρ−sublevel set. Also, define the function V (r) : Rnr → R as simply the composition V˜ (c(·)). Suppose that the following implication is true: T ˜ (˜ ∂c(r) r ∂ V r ) f (r, us , up ) < 0. V (r) = ρ ⇒ V˙ = ∂ r˜ ∂r r˜=c(r)
That is, there exists some tracking control us such that the time-derivative of V on the boundary V (r) = ρ is negative. It follows that the set {r : V (r) ≤ ρ}
552
S. Singh et al.
is invariant. Then, the set {c(r) : V˜ (c(r)) ≤ ρ} is a valid error bound B in the sense of Eq. (5). An illustration of this idea is provided in Fig. 1(right). Remark 1. Note that the set {c(r) : V˜ (c(r)) ≤ ρ} is equivalent, up to a trans lation and scaling factor, to the set {c(r) : α V˜ (c(r)) + β ≤ α(ρ + β)}, where β ∈ R and α > 0. To eliminate this redundancy, we impose, without loss of generality, the conditions V˜ (0) = 0 and ρ = 1. The choice V˜ (0) = 0 essentially corresponds to a translational adjustment that “centers” the set B on a zero bounded state. In the context of this paper, set B is used as a “buffer” that a planner should consider along each candidate nominal motion plan, to ensure that the realized tracking trajectory is safe, that is, collision free. Clearly, the search for set B is intertwined with the search for a tracking controller that can keep the realized tracking trajectory within it. We parameterize the tracking controller us as a function of the relative system state and planning control signal, that is us = r (r, up ) := f r (r, K(r, up ), up ) K(r, up ). Denote the closed-loop dynamics as fK and define T ˜ (˜ ∂c(r) r ∂ V r ) p f (r, K(r, up ), up ). V˙ K (r, u ) := ∂ r˜ ∂r r˜=c(r)
The search for a suitable function V and controller K(·, ·) is then formalized by the optimization problem: minimize K(·,·),B
subject to
volume(B) V (r) = 1 ⇒ V˙ K (r, up ) < 0 p p u ∈U V (r) ≤ 1 ⇒ K(r, up ) ∈ U s up ∈ U p
(6a) (6b) (6c)
We will next encode the constraints and objective of problem (6) as SOS certificates. 3.3
Reformulating the Constraints as SOS Certificates
We consider the constraints first. Under the assumption that the dynamics for the tracking model are control affine (common for most robotic systems), the function f r is control affine in us . That is, f r takes the form f r (r, up , us ) = h(r, up ) + B(r)us . Thus, T
T
∂V (r) ∂V (r) V˙ = h(r, up ) + B(r)us . ∂r ∂r Hence, the invariance condition requires the existence of a tracking controller K(r, up ) such that T T ∂V (r) ∂V (r) V (r) = 1 h(r, up ) + B(r)K(r, up ) < 0. (7) ⇒ p p u ∈U ∂r ∂r
Robust Tracking with Model Mismatch for Fast and Safe Planning
553
We can equivalently state inequality (7) as:
T T ∂V (r) p T ∂V (r) h(r, u ) + min K(r, up ) < 0. ⇒ B(r) ∂r ∂r K(r,up )∈U s (8) The equivalence between inequalities (7) and (8) follows from the observation that inequality (7) holds for some us ∈ U s if and only if it holds for a us ∈ U s minimizing the left hand side in inequality (8). Importantly, the minimization in inequality (8) is independent of up , as the constraint set U s and objective coefficient vector B T ∂V /∂r do not depend on up . Thus, we can simplify K by making it a function of r only (with a slight abuse of notation, we still refer to such a simplified tracking controller with K). We can now define SOS certificates for the constraints in problem (6). Suppose set U p can be written as the semialgebraic set {up ∈ Rmp : gip (up ) ≤ 0, i = 1, . . . , Np } and let the controller K be a polynomial function in r. Expanding V˙ K , one can encode constraint (6b) using the multiplier polynomials Np LLya (r, up ), {Lpi (r, up )}i=1 such that V (r) = 1 up ∈ U p
−V˙ K + LLya · (V − 1) +
Np
Lpi · gip is SOS,
{Lpi } are SOS,
i = 1, . . . , Np .
i=1
(9) (In the following we drop the summation notation and write the collective sum as a “dot-product:” Lp · g p ). To capture the control constraint in (6c), one may leverage two different techniques. The first one is to compose the tracking controller K with a saturation function and write V˙ in case form corresponding to the unsaturated and saturated regimes, respectively. However, this approach scales exponentially in the dimension of the control input, as one needs to capture all combinations of saturation regimes [21]. Instead, we assume that the tracking control set U s is polytopic, i.e., U s = {us ∈ Rms : g s (us ) ≤ 0}, where g s (us ) ≤ 0 is a set of linear inequalities in us of the form gis = aTsi us − bsi ≤ 0, i = 1, . . . , Ns . Note that this is not an overly restrictive assumption as the control constraints are often taken to be box constraints. Then, we encode constraint (6c) using the multipliers {Lsi (r, up )}: −gis (K) + Lsi · (V − 1) is SOS,
Lsi are SOS,
i = 1, . . . , Ns .
(10)
Crucially, this approach scales linearly with respect to the number of inputs. 3.4
Reformulating the Objective as SOS Certificates
Minimizing the volume of B, which itself is a non-linear mapping of the 1−sublevel set of V , is a difficult task if one reasons in terms of arbitrary functions V . One approach is to approximate the volume by the integral of V over an Euclidean ball of radius R such that the ball is contained within the 1−sublevel
554
S. Singh et al.
set of V – an approach taken, for example, in [24]. Alternatively, one can minimize the volume of an encapsulating ellipsoid, easily captured using an additional constraint and a convex pseudo-objective. Toward this end, we define the ellipsoid E := {r : c(r) Ec(r) ≤ 1} for some positive definite matrix E δE I, where δE > 0 is a small tolerance parameter. Then, the inclusion constraint B ⊆ E is captured using the SOS multiplier LE (r): 1 − c Ec + LE (V − 1) is SOS,
LE is SOS,
E δE I.
(11)
As the volume of E is inversely proportional to the square root of its determinant, the minimum volume objective reduces to maximizing the determinant of E, and can be written in the form of (2) [4, Chapter 4]. For non-identity mappings c(·), the first SOS constraint in equation (11) can quickly become computationally challenging due to the quadratic terms in c(r) and the complexity of the relation V (·) = V˜ (c(·)). In this case, one may consider the following simplification. Since r ∈ Rnr˜ : V˜ (˜ r) ≤ 1}, B = {c(r) ∈ Rnr˜ : V˜ (c(r)) ≤ 1} ⊆ {˜
(12)
the simplification is to outer bound the set on the right by using an ellipsoid. Specifically, let the ellipsoid E be given by E = {˜ r : r˜ E r˜ ≤ 1} for some positive r) ≤ 1 ⇒ r˜ E r˜ ≤ 1, is definite matrix E δE I. The inclusion condition V˜ (˜ E r): captured using the SOS multiplier L (˜ 1 − r˜ E r˜ + LE (V˜ − 1) is SOS,
LE is SOS,
E δE I.
(13)
Crucially, the constraints above are deliberately written with respect to the variable r˜, which is treated as an independent indeterminate from r. This is beneficial when using a mapping c(·) other than the trivial identity map, as it allows one to approximate the otherwise complex set B with potentially simpler functions in r˜. 3.5
SOS Formulation of Optimization Problem
Collecting all the results so far, our approach entails conservatively (as we rely on sufficient SOS certificates) solving the optimization problem (6) as a SOS program: maximize ˜ ,E,L K,V
subject to
log det(E)
(14a)
Eqs. (9), (10), (13)
(14b) (14c)
E δE I, ρ ≥ 0 V˜ (0) = 0
(14d)
where L := {LLya , Lp , Ls , LE }, and 1 is the vector of 1s. Reinforcing the ideas in Sect. 3.4, we iterate that V˜ is considered as the optimization variable in the independent indeterminate r˜, with constraints (9), (10) encoded using the indeterminate r.
Robust Tracking with Model Mismatch for Fast and Safe Planning
555
Constraints in (14b) are bilinear in the decision variables. Consequently, similar to the bilinear solution algorithms in [21] and [24], one must alternate between the decision variable sets {E, K, L} and {V˜ , E, Lp }, each time holding the other variable set fixed. Specifically, we refer to the sub-problem where K(·) is part of the decision variable set as the K sub-problem, and the sub-problem where V˜ is part of the decision variable set as the V sub-problem. Direct implementation of alternations is hindered by two challenges. First, one requires a feasible initial guess for V˜ to begin the alternations. In [21] and [24], the authors leverage locally linearized models and the solution to the Riccati equation to generate such a guess. In this paper we address a fundamentally more challenging problem as we consider the controllability of a relative dynamical system between two different dynamical models. Consequently, the relative system is often not linearly controllable at r = 0 (even if the individual models are) and thus one requires an additional procedure to generate a feasible function V˜ for problem (14) from an initially infeasible guess. Second, alternating optimization is prone to numerical instabilities as the solutions of the individual convex problems frequently lie on boundaries of the respective feasible sets. We address these challenges next.
4
Solving the Bilinear Optimization Problem
The general idea to tackle both of the aforementioned challenges entails using iterative slack minimization. We first discuss the solutions to the K and V subproblems (Sects. 4.1 and 4.2, respectively) and then present the general solution algorithm in Sect. 4.3. 4.1
The K Sub-problem
A na¨ıve implementation of the K sub-problem would entail solving problem (14) with respect to {E, K, L} for a given V˜ . However, constraint (9) as written may generate controllers that are numerically unstable when held fixed in the V subproblem. Given the polytopic constraint set U s and a fixed V˜ , one can exactly characterize the “most stabilizing controller” as the function K that minimizes the left hand side in inequality (8) for all r such that V (r) = 1. Thus, we propose the following two-step method for solving the K sub-problem, for a given function V˜ : 1. Find the “most stabilizing controller” K(·) as the solution to: minimize
K,LLya ,Lp ,Ls ,γ
subject to
γ
(15a)
− V˙ K + γ + LLya · (V − 1) + Lp · g p s
s
− g (K) + L · (V − 1) is SOS,
is SOS {Lpi }, {Lsi }
(15b) are SOS, (15c)
where γ is a slack variable for the invariance constraint. Notice that when V (r) = 1 and g p (up ) ≤ 0, then V˙ K (r, up ) ≤ γ. Denote the optimal slack as γc∗ .
556
S. Singh et al.
2. Compute the tightest bounding ellipsoid E: maximize
log det(E)
subject to
1 − r˜ E r˜ + LE (V˜ − 1)
LE ,E
E δE I,
(16a) is SOS
E
L is SOS.
(16b) (16c)
The two steps above comprise the K sub-problem. The benefit of decomposing the K sub-problem in this fashion is that we independently search for the most stabilizing controller while simultaneously relaxing the invariance constraint by using the slack variable γ; in particular, γ accounts for the suboptimality of the computed controller with respect to the left hand side of inequality (9). 4.2
The V Sub-problem
Given a controller K(·), and multiplier polynomials {Lsi }, LE , and LLya from the K sub-problem, the V sub-problem is defined as maximize
λ log det(E) − γ − 1
subject to
− V˙ K + γ + LLya · (V − 1) + Lp · g p
V˜ ,E,Lp ,γ,
s
s
− g (K) + + L · (V − 1) is SOS 1 − r˜ E r˜ + LE (V˜ − 1) is SOS E δE I, V˜ (0) = 0, {Lpi }
are SOS
(17a) is SOS
(17b) (17c) (17d) (17e) (17f)
s where ∈ RN ≥0 is a slack vector for the control constraints and λ is a Pareto trade-off parameter. Notice that the control slack is only necessary in the V sub-problem since the controller is held fixed within this problem. In contrast, in problem (15a), (15b) and (15c), we directly optimize over the set of strictly feasible controllers. Given these slack-based relaxed sub-problems, we now provide a solution algorithm for problem (14).
4.3
Solution Algorithm
Before providing the full solution algorithm, we describe an initialization procedure that generates a numerically stable initial guess for V˜ . The procedure is detailed in Algorithm 1. In particular, the algorithm may be initialized by using any polynomial guess for V˜ such that V˜ (0) = 0 (e.g., r˜T r˜). Notice that in line 6, when problem (17) is solved, the Pareto parameter λ is set to 0 since the objective for the initialization algorithm is to simply generate a feasible initial guess for problem (14). Furthermore, we impose the additional constraint γ ≤ γc∗ to ensure monotonic improvement in solution quality between sub-problems. As the original problem is still non-convex, the convergence of the slack variables below the tolerance threshold is not guaranteed for every initial guess of V˜ . However,
Robust Tracking with Model Mismatch for Fast and Safe Planning
557
Algorithm 1. Generating Feasible Guess 1: Input: Initial guess V˜ satisfying V˜ (0) = 0; slack tolerance δ ∈ [0, 1); max # of iterations N . 2: i ← 0. 3: while i < N do 4: {K, LLya , Ls , LE , E, γc∗ } ← Solve (15)–(16). 5: if γc∗ ≤ δ then return (V˜ , E, K) 6: {V˜ , E, γ ∗ , ∗ } ← Solve (17) with λ = 0 and subject to additional constraint γ ≤ γc∗ . 7: if max{γ ∗ , ∗ ∞ } ≤ δ then return (V˜ , E, K) 8: i ← i + 1. 9: end while 10: return Failure.
typical guesses such as r˜T Q˜ r for a positive diagonal matrix Q appear to work quite well, as the experiment section will illustrate. Given an initial guess generated by Algorithm 1, we are now ready to solve problem (14). To do so, we will make use of a slightly modified version of the V sub-problem, given below: maximize ˜ ,γ,E,Lp V
subject to
λ log det(E) − γ
(18a)
Eqs. (17b) − (17f)
(18b) (18c)
αE ∗ E (1 + α)E ∗
where E ∗ is the previous numerically stable solution (i.e., with optimal slack values γ ∗ , ∗ ∞ ≤ δ), α ∈ (0, 1) is a fixed parameter, and α ∈ (0, 1) is a backtracking search parameter that is adjusted in an iterative fashion. Specifically, we iteratively solve problem (18a), (18b) and (18c), each time checking the slack tolerances to ensure numerical stability, while using constraint (18c) to enforce a trust region around the current numerically stable solution. The full solution algorithm is summarized in Algorithm 2. The backtrack search procedure in line 8 is summarized in Algorithm 3. Within this procedure, we first iteratively maximize log det(E) within the trust region (18c) centered on the previous stable numerical solution, using the slack values as a check on solution quality (lines 3–6). Once solution quality degrades, we backtrack (shrink) the trust region in lines 7–11 until we again fall below the slack tolerances. Note that Algorithm 3 will either return an updated {V˜ , E} that is numerically stable (with respect to slack tolerances), or it will simply return the numerically stable solution from line 5 in Algorithm 2 if unable to make progress. Thus, Algorithm 2 terminates if either (1) improvement in log det(E) stalls, or (2) the function V˜ ∗ returned by the backtrack procedure is not strictly feasible (i.e., γ ≤ 0) with respect to line 5 in Algorithm 2, but is still acceptable according to the slack tolerances. The key difference between the alternating method described here and a similar procedure in [24] is that the authors in [24] minimize the slack variable in both sub-problems and use binary search to iteratively refine an upper bound
558
S. Singh et al.
Algorithm 2. Solving Problem (14) 1: Input: Output tuple (V˜ , E, K) from Algorithm 1; slack tolerance δ; termination tolerance θ1 ∈ (0, 1).
2: i ← 1, converged ← false. 3: Initialize: (V˜ ∗ , E ∗ , K ∗ , c∗0 ) ← (V˜ , E, K, log det(E)). 4: while i < N ∧ ¬ converged do 5: {K, LLya , Ls , LE , E} ← Solve (15)–(16) subject to additional constraint: γ ≤ 0. 6: if Line 5 successfully solved then 7: (E ∗ , K ∗ ) ← (E, K). 8: (V˜ ∗ , E ∗ , c∗i ) ← Backtrack {K ∗ , LLya , Ls , LE }, E ∗ , V˜ ∗ . 9: if |c∗i − c∗i−1 | ≤ θ1 |c∗i−1 | then converged ← true. else i ← i + 1. 10: else 11: converged ← true. 12: end if 13: end while 14: return (V˜ ∗ , ρ∗ , E ∗ , K ∗ )
Algorithm 3. Backtrack 1: Input: Solution {K ∗ , LLya , Ls , LE } and E ∗ from line 5 in Algorithm 2 and current best solution V˜ ∗ ; slack tolerance δ; backtrack parameters α, β, θ2 ∈ (0, 1).
2: Initialize: α ← α, bt← false, c∗ ← log det(E ∗ ). 3: while ¬bt do 4: {V˜ , E, γ ∗ , ∗ } ← Solve (18). 5: if max{γ ∗ , ∗ ∞ } ≤ δ then (V˜ ∗ , E ∗ , c∗ ) ← (V˜ , E, log det(E)). else bt← true. 6: end while 7: while bt ∧ α > θ2 α do 8: α ← βα. 9: {V˜ , E, γ ∗ , ∗ } ← Solve (18). 10: if max{γ ∗ , ∗ ∞ } ≤ δ then (V˜ ∗ , E ∗ , c∗ ) ← (V˜ , E, log det(E)), bt ←false. 11: end while 12: return (V˜ ∗ , E ∗ , c∗ )
on the cost function within the V sub-problem. On the other hand, our algorithm maintains numerical stability by using the slack variables solely as a check on the allowable change in the solution, while taking advantage of minimizing the true objective (i.e., log det(E)) within both sub-problems. This is especially useful in situations where the second phase of the algorithm struggles to make notable improvements on the objective (e.g., due to a smaller set of decision variables), thereby allowing the K sub-problem to take over.
5
Numerical Examples
In this section we numerically validate our proposed approach. Specifically, in Sect. 5.1, we compare our approach with the HJ-based approach in [11], while in Sect. 5.2, we study a high-dimensional system which is beyond the reach of HJ analysis. Additional numerical examples are provided in [32]. The code is available at https://github.com/StanfordASL/Model-Mismatch.
Robust Tracking with Model Mismatch for Fast and Safe Planning
5.1
559
Comparison with the HJ Method
In this section, we compare our method with the HJ-based approach in [11] for the case of a 5D car model used to track a 3D Dubins car model (that is, the second example in Table 1 in [32]). This example was chosen because five dimensions is the current limit for standard grid-based HJ reachability methods when techniques such as decomposition cannot be applied. Specifically, the system dynamics and bounded state definition are provided in the second row of Table 1. The model parameters are chosen as |a| ≤ 1 m/s2 , |α| ≤ 3 rad/s2 , vˆ = 1.0 m/s, |ˆ ω | ≤ 0.1 rad/s. For the SOS method, we parameterize K and V˜ as 2nd order polynomials in r and r˜, respectively. The trigonometric terms cos θr and sin θr are approximated with Chebyshev polynomials up to degree 2, over the range θr ∈ [−π/6, π/6] rad. To ensure the validity of these approximations, an additional constraint is appended to problems (16a), (16b) and (16c) and (17), namely: −g + Lg (V˜ − 1) is SOS,
Lg is SOS,
(19)
where g = θr2 − (π/6)2 and Lg is a SOS polynomial in r˜. The initial guess for V˜ r, where Q is a diagonal matrix with randomly sampled positive is simply r˜T Q˜ entries. In order to ensure a fair comparison, the cost function in the min-max game for the HJ method is x2r + yr2 + θr2 + (v cos θr − vˆ)2 + v 2 sin2 θr + ω 2 , which is the closest analogue to the SOS objective of minimizing the entire volume of B and not simply its projection onto the position coordinates. Figure 2 plots the projections of the boundary of set B onto the (xr , yr , θr ) components (i.e., the dimensions most relevant for collision checking) for the HJ and SOS solutions, respectively. The right-most panel in this figure provides a top-down view onto the (xr , yr ) plane. As expected, the HJ solution provides a better position error bound, specifically, 42% of the SOS position error bound, or about 0.2 m in absolute terms. The main reason behind this is that by using a gridbased approach to solve the min-max HJ game, one exactly computes the nonsmooth “optimal controller” in (8), whereas the SOS approach attempts to find the best polynomial approximation. On the other hand, the computation time difference for the two solutions is substantial – approximately 25 h for the HJ solution versus 5 min for the SOS solution. Specifically, the HJ solution was obtained by solving a corresponding HJ PDE [11] using a C++ implementation of the Lax Friedrichs numerical scheme [35]. The SOS programs were solved using the Spotless polynomial optimization toolbox [37] and MOSEK SDP solver [3]. Computations were done on a desktop computer with an Intel Core i7-2600K CPU and 16 GB of RAM. At least for this example, the SOS approach appears to provide a high-quality approximation to the optimal (HJ-based) solution, in a fraction of the time required by the HJ-based approach. In [32] we provide an additional comparison between the HJ and SOS methods for a 4D dynamically extended Dubins car tracking a 2D single integrator planning model (row 1 in Table 1). This example highlights a case in which the HJ method is efficient enough to warrant its use instead of the SOS method (i.e. when the dynamics are less than five dimensions and/or can be decomposed
560
S. Singh et al.
Fig. 2. Projection of the boundary of B onto (xr , yr , θr ). Left: SOS. Middle: HJ. Right: Top-down view (i.e., onto (xr , yr ) for both solutions). The HJ positional error bound is smaller (42% of SOS bound), but the SOS solution requires only 0.3% of the computation time required by HJ analysis.
into subsystems that are each less than five dimensions). In this example, an optimal bound of 0.1 m between the two models was computed using HJ while SOS yielded a bound of 0.3 m. The computation time comparison was on the order of 2 min for SOS vs. 5 min for HJ. The projection of the optimal B onto the translational plane from the HJ method as shown in the appendix in [32] clearly illustrates the limitation of SOS in its ability to approximate non-smooth quantities. Indeed, herein lies the primary weakness of the SOS method – tight approximations require high degree polynomials which can rapidly increase the size of the SOS problems. Despite this limitation inherent in the SOS method, the approach allows us to establish a conservative over-approximation of the tracking error bound (when one exists) for a given planner/tracker pair and gauge the sensitivity of the expected tracking error to changing model parameters such as inertial properties and control authorities. Furthermore, as the next section illustrates, the SOS method allows us to perform this analysis for higher dimensional systems that are severely challenging (albeit, not impossible) for HJ. As such, the HJ and SOS methods should really be viewed as complementary to one another. 5.2
High-Dimensional Example
In this section, we present numerical results for a high-dimensional system for which the HJ approach is intractable. Specifically, we consider the case where an 8D plane model is used to track a 4D decoupled Dubins plane model (that is, the fourth example in Table 1 in [32]). To ensure that this example cannot be solved using standard grid-based HJ reachability, we must first determine that decomposition techniques would not be applicable. The 8D relative system dynamics are highly coupled, which means that approximate decomposition techniques would lead to results that are overly conservative [8]. The dynamics also do not form self-contained subsystems, making exact decomposition impossible [7]. This example well illustrates the usefulness of our approach. The planning model (i.e., decoupled Dubins plane) is a benchmark planning model with well characterized
Robust Tracking with Model Mismatch for Fast and Safe Planning
561
optimal trajectories [9]. The full 8D dynamics of the plane, however, are considerably more difficult to plan trajectories for online. Specifically, the system dynamics and bounded state definition are provided in the fourth row of Table 1. The planner model parameters are as follows. The constant speed vˆ is set to be the nominal speed for the plane in straight level flight conditions (lift equals gravitational force) at a nominal angle of attack α0 = 5◦ , which corresponds to vˆ = 6.5 m/s (see [30] for the relevant constants used to compute this quantity). The maximum magnitude of the turning rate for the planner, |ˆ ω |, is set to be 20% of the plane’s maximum turning rate in a horizontal coordinated turn at nominal speed vˆ, and is computed to be 0.21 rad/s (equivalently, a minimum turning radius of 30 m for the horizontal Dubins model). Finally, the planner’s vertical velocity is limited to the range [−0.1, 0.1] vˆ. Taking advantage of the structure of the dynamics, the normalized acceleration control ua is chosen to exactly cancel drag, plus an additional 2nd degree polynomial in r as determined by the SOS program. The rest of the controller components and V˜ are also parameterized as 2nd order polynomials in r and r˜, respectively. The plane’s (tracking) control limits are chosen to be ua ∈ [−8, 8] m/s2 (≈10 times the acceleration needed to cancel drag at level trim conditions), uφ˙ ∈ [−120, 120]◦ /s, and uα˙ ∈ [−60, 60]◦ /s. To enforce the validity of the Chebyshev approximation for the trigonometric and 1/v terms in the dynamics, we enforce the additional constraints φ ∈ [−π/4, π/4] rad, γ ∈ [−π/6, π/6] rad, ψr ∈ [−π/6, π/6] rad, and v ∈ [3, 10] m/s. The overall (initialization plus main optimization) computation time was under 2 h. The projection of the set B onto (xr , yr , zr ) (the position errors in the frame of the Dubins car ) had span [−5.6, 5.6] × [−3.8, 3.8] × [−4.5, 4.5] m. The bound is naturally more loose in the (xr , zr ) dimensions since the planning model’s horizontal velocity is equal to the plane’s trim conditions at level flight, limiting the plane’s ability to use its remaining velocity range for both tracking and ascending/descending. Thus, for such a choice of planning and tracking models, the bound appears reasonable. To obtain tighter bounds, one could for instance use as planning model the coupled kinematic plane model proposed in [22]. For an application of this bound to online motion planning, we set up a cluttered obstacle environment with trees and buildings (see Fig. 3a). The goal region is the blue shaded box in the corner, and the plane starts at position (1, 1, 5). The nominal motion plan is computed by using kinodynamic FMT* [29] (using the locally optimal trajectories in [9] as steering connections). Having computed all steering connections offline ( 0}. In addition, given an STL formula, for example one denoted ϕ or ψ, we define Sϕ and Sψ to denote the set of states that satisfy ϕ or ψ, respectively. u(·),d(·)
(ξx,t
u(·),d(·)
(·), s) |= μ ⇔ ξx,t
u(·),d(·) (ξx,t (·), s) u(·),d(·) (ξx,t (·), s)
(s) ∈ Sμ
(12a)
|= ϕ ⇔
u(·),d(·) ξx,t (s)
∈ Sϕ
(12b)
|= ψ ⇔
u(·),d(·) ξx,t (s)
∈ Sψ
(12c)
Then, we have the following correspondence in terms of functions that represent satisfaction of STL formulas and functions that represent sets used in HJ reachability. gμ (x) > 0 ⇔ hSμ (x) < 0, gϕ (x) > 0 ⇔ hSϕ (x) < 0, gψ (x) > 0 ⇔ hSψ (x) < 0, (13) Based on the set definitions above, the logical conjunction, disjunction, and negation are equivalent to set intersection, union, and complement, respectively: u(·),d(·)
(·), s) |= ϕ ∧ ψ ⇔ ξx,t
u(·),d(·)
(·), s) |= ϕ ∨ ψ ⇔ ξx,t
(ξx,t (ξx,t
u(·),d(·) (ξx,t (·), s)
|= ¬ϕ ⇔
u(·),d(·)
(s) ∈ Sϕ ∩ Sψ
(14a)
u(·),d(·)
(s) ∈ Sϕ ∪ Sψ
(14b)
Sϕ
(14c)
u(·),d(·) ξx,t (s)
∈
In terms of the functional representation of formulas and sets, we have gϕ∧ψ (x) = min (gϕ (x), gψ (x)) , hϕ∧ψ (x) = max hSϕ (x), hSψ (x) (15a) gϕ∨ψ (x) = max (gϕ (x), gψ (x)) , hϕ∨ψ (x) = min hSϕ (x), hSψ (x) (15b) g¬ϕ (x) = −gϕ (x), 3.2
hSϕ (x) = −hSϕ (x)
(15c)
“Until” and “always” as Reachability Operators
We now look at the important connections between the until and always operators and reachability. The until operator, defined in Eq. (3e), can be interpreted as a constrained reachability operator. For example, in ϕUI ψ, one is interested in reaching states that satisfy ψ within some time horizon, while satisfying the constraints ϕ. We now formally state this reachability interpretation. Proposition 1 (The until operator and constrained reachability) Define T = t + s2 and the sets Tψ (·) and Cϕ :
Tψ (s) =
{x : (x(·), s) |= ψ}, ∅,
if s ∈ [t + s1 , t + s2 ], otherwise,
Cϕ = {x : (x(·), s) |= ϕ}. (16)
590
M. Chen et al.
Then, we have u(·),d(·)
∃u(·) ∈ U, ∀d(·) ∈ D, (ξx,t
(·), t) |= ϕU[s1 ,s2 ] ψ ⇔ x ∈ RAM (t, T ; Tψ (·), Cϕ ). (17)
In addition, define gϕU[s1 ,s2 ] ψ (t, x) = −hRAM (t, x), where hRAM (t, x) is such that RAM (t, t + τ2 ; Tψ (·), Cϕ ) = {x : hRAM (t, x) < 0}. Then, we have u(·),d(·)
∃u(·) ∈ U, ∀d(·) ∈ D, (ξx,t
(·), t) |= ϕU[s1 ,s2 ] ψ ⇔ gϕU[s1 ,s2 ] ψ (t, x) > 0. (18)
Note that −gϕU[s1 ,s2 ] ψ (t, x) satisfies the HJ VI (8), so gϕU[s1 ,s2 ] ψ is a value function associated with ϕU[s1 ,s2 ] ψ. Proposition 1 follows from the definition of the until operator and the maximal RS, and establishes an equivalence between the until operator and the maximal reachability operator. It also provides a single function gϕU[s1 ,s2 ] ψ (s, x) that captures the set of states from which there exists a controller to guarantee the satisfaction of ϕU[s1 ,s2 ] ψ regardless of disturbances. Controller synthesis will be discussed in Sect. 4 in Lemma 1. Note that the eventually operator corresponds to an unconstrained reachability problem, which is the one presented in Proposition 1 with ϕ = True, or equivalently, Cϕ = ∅. The always operator can be indirectly interpreted as an unconstrained reachability operator. For example, in [s1 ,s2 ] ϕ, one is interested in staying in states that satisfy ϕ. In the language of reachability, one would equivalently stay out of states that may lead to a violation of ϕ. We now formally state this reachability interpretation. Proposition 2 (Reachability interpretation of the always operator) Define T = s + s2 , and
Tϕ (s) =
{x : (x(·), s) ϕ}, ∅,
if s ∈ [s + s1 , s + s2 ], otherwise.
(19)
Then, we have u(·),d(·)
∃u(·) ∈ U, ∀d(·) ∈ D, (ξx,t
(·), s) |= [s1 ,s2 ] ϕ ⇔ x ∈ / Rm (s, T ; Tϕ (·)). (20)
In addition, define g[s1 ,s2 ] ϕ (s, x) = hRm (s, x), where hRm (s, x) is such that R (s, T ; Tϕ (·)) = {x : hRm (s, x) < 0}. Then, we have m
u(·),d(·)
∃u(·) ∈ U, ∀d(·) ∈ D, (ξx,t
(·), s) |= [s1 ,s2 ] ϕ ⇔ g[s1 ,s2 ] ϕ (s, x) > 0. (21)
Note that g[s1 ,s2 ] ϕ (s, x) satisfies the HJ VI (8), so g[s1 ,s2 ] ϕ is a value function associated with [s1 ,s2 ] . Proposition 2 follows from the definition of the always operator and minimal RS. Controller synthesis will be discussed in Sect. 4 in Lemma 1.
Signal Temporal Logic Meets Reachability: Connections and Applications
4
591
Controller Synthesis
In this section, we provide a controller synthesis technique that is different from the optimal controller synthesis typically found in works related to HJ reachability. Instead of computing the optimal control, we propose to consider the set of controllers that satisfy a given STL formula by viewing the value function as a Lyapunov-like function. We call this set of controllers the least-restrictive feasible controller set (LRFCS), which ensures the value function does not decrease along trajectories, an idea that is core to many reachability methods other than HJ [21,23]. In the context of STL, the key benefit of considering the LRFCS is that a single controller synthesis procedure can be repeatedly used for satisfying recursively defined STL formulas. Thus, the LRFCS is what allows HJ reachability to leverage the complexity of specifications in the STL framework. The following lemma formalizes the LRFCS: Lemma 1 (Least-restrictive feasible controller set (LRFCS) for satisfying an STL formula) Let ϕ be any STL formula, and gϕ be a value function associated with ϕ. Suppose at time s, the system (1) is at a state x such that gϕ (t, x) ≥ c. Define Uϕ := {u(·) : ∀s ≥ t, u(s) ∈ U˜ϕ (s, x)}, where U˜ϕ (s, x) =
Uϕ (s, x) = {u :
Uϕ (s, x), if gϕ (s, x) ≤ c, U, otherwise,
(22a)
(22b)
∂ gϕ (s, x) + min ∇gϕ (s, x) · f (s, x, u, d) ≥ 0}. (22c) d∈D ∂s u(·),d(·)
Then, u(·) ∈ Uϕ implies ∀s ≥ t, ∀d(·) ∈ D, gϕ (s, ξx,t
(s)) ≥ c.
Proof We start with the expression in (22c): ∂ gϕ (s, x) + min ∇gϕ (s, x) · f (s, x, u, d) d∈D ∂s ∂ = min gϕ (s, x) + ∇gϕ (s, x) · f (s, x, u, d) = min g˙ ϕ (s, x) d∈D ∂s d∈D
0≤
(23a) (23b)
The minimization of d corresponds to the disturbance behaving adversarially to drive the system away from the satisfaction of ϕ. Therefore, for all d, we have g˙ ϕ (s, x) ≥ 0. In addition, since for all s ≥ t, we have u(s) ∈ Uϕ (s) whenever gϕ (s, x) ≤ c and in particular when gϕ (s, x) = c, we must have that ∀s ≥ u(·),d(·) t, ∀d(·), gϕ (s, ξx,t (s)) ≥ c. Corollary 1. If c > 0, then Lemma 1 is equivalent to u(·),d(·)
u(·) ∈ Uϕ ⇒ ∀d(·), (ξx,t
(·), s) |= ϕ.
(24)
592
M. Chen et al.
Remark 2. Lemma 1 provides a set of controllers for satisfying ϕU[s1 ,s2 ] ψ using the value function gϕU[s1 ,s2 ] ψ in Proposition 1 and for [s1 ,s2 ] ϕ using g[s1 ,s2 ] ϕ in Proposition 2. Note that since gϕ is a value function associated with ϕ, the expression in (22c) is always non-empty. This is true by construction of the value function in either (8) or (11), in which the optimal controller is selected in a dynamic programming framework. The associated optimal controller, which we have omitted in this paper, guarantees the existence of a feasible controller; for a more thorough discussion of the optimal controller, please see one of [4,10,16]. We now continue our discussion of repeated controller synthesis via the LRFCS for recursively defined STL formulas. We first considering cases which do not involve control conflicts in Sect. 4.1; here, we provide a simple additional control logic to tie together multiple controllers from STL formulas that make up a more complex STL formula. Next, in Sect. 4.2, we address the cases involving control conflicts by describing why control conflicts arise, and how they can be resolved. 4.1
Negation, Logical Disjunction, and “always”: No Control Conflicts
In this section we consider ¬, ∨, and [s1 ,s2 ] , for which the discussion of controller synthesis is relatively straight forward. We first note that to synthesize a controller that guarantees the satisfaction of ¬ϕ, one would simply first negate ϕ, and then perform controller synthesis recursively. This can be considered a “pre-processing step” when STL formulas are in Negation Normal Form (cf. Sect. 2.2). Next, we provide a joint controller for satisfying ϕ ∨ ψ, given two controllers that respectively guarantee the satisfaction of ϕ and ψ. Intuitively, Proposition 3 first states that applying a controller (in Uϕ ) that satisfies ϕ also implies satisfaction of ϕ ∨ ψ. On the other hand, if ϕ cannot be satisfied, then by assumption ψ can be satisfied using a controller drawn from Uψ , thereby also satisfying ϕ∨ψ. Proposition 3 (Joint controller for logical disjunction). Suppose (24) u (·),d(·) u(·),d(·) holds, and u(·) ∈ Uψ ⇒ ∀d(·), (ξx,t (·), s) |= ψ. Then (ξx,tϕ∨ψ (·), s) |= ϕ ∨ ψ, where Uϕ , if (24) holds, (25) uϕ∨ψ (·) ∈ Uψ , otherwise. Finally, given a set of controllers Uϕ that guarantees satisfaction of ϕ, and a set of controllers U[s1 ,s2 ] ϕ that guarantees satisfaction of [s1 ,s2 ] ϕ, we provide the set of controllers that first drives the system into a state from which ϕ can be satisfied using a controller in U[s1 ,s2 ] ϕ , and then drives the system to satisfy ϕ using a controller in Uϕ .
Signal Temporal Logic Meets Reachability: Connections and Applications
593
Proposition 4 (Compound controller for the always operator). Define some s ∈ [s + s1 , s + s2 ]. Suppose (24) holds, and u(·) ∈ U[s1 ,s2 ] ϕ ⇒ u(·),d(·)
∀d(·), (ξx,t Then,
(·), s) |= [s1 ,s2 ] ϕ.
u ¯(·),d(·) (ξx,t (·), s)
|= ϕ, where u ¯(·) ∈
U[s1 ,s2 ] ϕ , Uϕ ,
if s < s , if s ≥ s .
The reasoning behind Proposition 4 is as follows: If one applies the controller drawn from U[s1 ,s2 ] ϕ , until some time s between s + s1 and s + s2 , the system is then in a position to satisfy ϕ. The satisfaction of ϕ is guaranteed by applying a controller drawn from Uϕ at and after time s . 4.2
Logical Conjunction and “until”: Avoiding Control Conflicts
In this section, we consider the operators ∧ and U, which require more careful treatment due to the possibility of control conflicts. For example, even if controllers drawn from Uϕ and Uψ allow the system to independently satisfy ϕ and ψ respectively, there may not exist a controller that leads to the satisfaction of ϕ ∧ ψ. This is because the system may only use a single control at any given time, and if Uϕ and Uψ do not intersect, then there would not be a controller that guarantees simultaneous satisfaction of ϕ and ψ. An analogous argument also applies for ϕU[s1 ,s2 ] ψ. Therefore, controller synthesis for satisfaction of ψ in the expressions ϕ ∧ ψ or ϕU[s1 ,s2 ] ψ must use the LRFCS with respect to ϕ so the satisfaction of ϕ is guaranteed. This requirement can be formalized by restating Propositions 1 and 2 with the modified control constraint u(·) ∈ Uϕ instead of u(·) ∈ U. The full restatements can be found in the appendix. Here, we highlight that optimizing over a restricted function set u(·) ∈ Uϕ , or equivalently, over a restricted control signal set U˜ϕ , does not significantly increase the computation burden for control and disturbance affine systems. To see this, let the dynamics (1) be x˙ = fx (t, x) + fu (t, x)u + fd (t, x)d. Then, expression in (22c) becomes ∂ gϕ (s, x) + ∇gϕ (s, x) · fx (t, x) + min ∇gϕ (s, x)fd (t, x)d + ∇gϕ (s, x)fu (t, x)u ≥ 0, d∈D ∂s
(26) which is a single affine control constraint. When solving the HJ VIs (8) and (11), the optimization u ∈ U becomes u ∈ U˜ϕ , which, according to (26), at most involves adding an affine constraint. In typical scenarios in which u ∈ U is a box constraint, adding (26) would result in a polytopic constraint, and therefore the optimization u ∈ U˜ϕ does not involve significantly more computation than the optimization u ∈ U.
5
Practical Summary of Theoretical Results
In this section, we compile all the theory in this paper into a list of corresponding STL and reachability operators. Given this list, one would be able to recursively
594
M. Chen et al.
compute the set of states that satisfies an arbitrary STL formula, and to synthesize the corresponding feedback controller. We also provide an illustrative example and brief discussion on minimum violation. 5.1
List of Corresponding STL, Set, and Reachability Operators
Using Table 1, a value function representing any STL formula ϕ can be computed through a sequence of reachability computations as well as point-wise negation, minimization, and maximization of functions. The number of reachability computations required is the number of until and always operators in the STL formula ϕ, and each individual reachability computation scales according to the reachability method. In the case of the HJ method, computational complexity of a single reachability computation is exponential with the number of system dimensions in (1). Practically speaking, this means that systems with up to five state space dimensions can be tractably analyzed. In theory, any controller in the LRFCS can be used to satisfy gϕ , as long as the state x(s) is such that g(s, x) > 0. In practice, additional criteria can be used for choosing the controller within the LRFCS. One simple choice, which is typically used in HJ reachability, is to choose the controller that maximizes g˙ ϕ in Eq. (22c). Table 1. Summary of theoretical results. Given any STL formula ϕ, this table can be used to identify the sequence of reachability computations, and point-wise negation, minimization, and maximization of functions that produces a value function gϕ and the corresponding LRFCS Uϕ . Formula ϕ∨ψ
Value function computation Set gϕ∧ψ = max gϕ , gψ
Controller
Set g¬ϕ (x) = −gϕ (x)
Synthesized recursively from negated formula
Details
u(·) ∈ Uϕ∨ψ in (25) Eq. (15b), Proposition 3 Given Uϕ , u(·) ∈ Uϕ in (22a, 22b, 22c) with Eq. (15a), Sect. 4.2 1. compute gψ with constraint ϕ = ψ, gϕ = gψ , c ≥ 0 u(·) ∈ Uϕ , 2. set gϕ∧ψ = min gϕ , gψ .
ϕ∧ψ
¬ϕ
Eq. (15c), Sect. 4
ϕU[s ,s ] ψ Given Uϕ , gψ , u(·) ∈ Uϕ in (22a, 22b, 22c) with Proposition 1, Sect. 4.2 1 2 1. solve the HJ VI (8) with ϕ = ϕU[s ,s ] ψ, gϕ = 1 2 ,c ≥ 0 Tψ , Cϕ given in (16) and con- gϕU [s1 ,s2 ] ψ straint u(·) ∈ Uϕ to obtain the RS represented by hRAM , 2. set gϕU ψ = −hRAM [s ,s ] ϕ 1 2
[s1 ,s2 ]
1. solve the HJ VI (11) with Tϕ given in (19) and constraint u(·) ∈ Uϕ to obtain the RS represented by hRm 2. g ϕ = hRm [s1 ,s2 ]
5.2
– Use u(·)∈Uϕ in (22a, 22b, 22c) Proposition 2, Sect. 4.1 = with ϕ = [s ,s ] ϕ, gϕ 1 2 ≥ 0 to satisfy g ϕ, c [s1 ,s2 ]
[s ,s ] ϕ 1 2 – Use u(·) ∈ Uϕ when s ∈ [s + s1 , s + s2 ] to satisfy ϕ
Illustrative Example
As a concrete example, consider the formula ϕ = I1 ♦I2 μ1 ∧(μ2 UI3 μ3 ), for functions μ1 , μ2 , μ3 representing atomic propositions, and time intervals I1 , I2 , I3 .
Signal Temporal Logic Meets Reachability: Connections and Applications
595
Using Table 1, one could perform the following steps, which involve three reachability computations in total, to obtain a value function representing the set of states from which ϕ can be satisfied, and the corresponding LRFCS: 1. Compute g♦I2 μ1 with the control constraint u(·) ∈ U (no control constraints other than that given by the dynamical system) according to the fifth row of Table 1. 2. Compute gI1 ♦I2 μ1 and UI1 ♦I2 μ1 with the control constraint u(·) ∈ U (no control constraints other than that given by the dynamical system) according to the sixth row of Table 1. 3. Compute gμ2 UI3 μ3 and Uμ2 UI3 μ3 with the control constraint u(·) ∈ UI1 ♦I2 μ1 (extra control constraint to ensure satisfaction of I1 ♦I2 μ1 ) according to the fifth row of Table 1. Note that this is also step 1 in the third row of Table 1). 4. Set gϕ = min(gI1 ♦I2 μ1 , gμ2 UI3 μ3 ). The set of states from which ϕ can be satisfied is then given by {x : gϕ > 0} (step 2 in the third row of Table 1). 5. The LRFCS for satisfying ϕ is given by Uμ2 UI3 μ3 , (which already accounts for the satisfaction of I1 ♦I2 μ1 ). 5.3
Minimum Violation Interpretation
Lemma 1 and Corollary 1 establish the condition to guarantee the continued satisfaction of an STL formula ϕ using the LRFCS in (22a, 22b, 22c), as long as the formula is satisfied at some state x and time s. However, in some situations, the system may be in a situation in which a desired STL formula ϕ cannot be satisfied. Mathematically, this is characterized by gϕ (s, x) ≤ 0. Such a situation could occur if the controller in (22a, 22b, 22c) is not used, or if there is a sudden change in the STL formula the system needs to satisfy, perhaps as a result of a sudden change in the objective of the system. Since Lemma 1 holds for any value of c, the controller in (22a, 22b, 22c) is also the “minimum violation controller” in the situation in which gϕ (s, x) ≤ 0. By using the controller in (22a, 22b, 22c), one can guarantee that gϕ (s, x) never decreases, which is interpreted as “the situation is guaranteed to not become worse, even under the worst-case disturbance”. On the other hand, if the disturbance does not behave in worst-case fashion, g˙ ϕ (s, x) may be positive, and gϕ (s, x) may evolve to become positive so that ϕ is satisfied at a later time. An example of a system “recovering” from a state that initially does not satisfy a desired STL formula is shown in the appendix.
6
Simulations and Experiments
In this section, we demonstrate our methodology on an example representative of a common autonomous driving scenario. We will consider a car overtaking maneuver in a 2-lane highway. As this example involves the logical conjunction of compound STL formulas, we will use the notion of LRFCS in Sect. 4.2 to avoid control conflicts.
596
M. Chen et al.
In addition, a toy numerical example, and a second involving a robotic experiment related to autonomous driving are presented in the appendix. The toy example validates the notion of LRFCS. In the example involving autonomous driving, we considered an autonomous car attempting to make a left turn in a four-way intersection just after the traffic light has transitioned to yellow. We performed the control synthesis described in Sect. 5. Here, the desired outcome is either making a successful left turn or stopping in initial lane, depending on the behavior of another car travelling in the opposite direction. We also demonstrated the notion of minimum violation introduced in Sect. 5.3 when the desired outcome is not initially guaranteed. For all examples, after obtaining the LRFCS that guarantees the satisfaction of the desired STL formula, we simply chose the controller that maximizes the gϕ expression in (22c). 6.1
Hardware Setup
We performed all computations3 on a laptop with a Intel Core i5-6300HQ CPU and 8 GB of RAM. Reachability computations which produced the value functions were done using the beacls toolbox4 . Numerical simulations for both examples were performed in MATLAB using the helperOC5 and Level Set Methods6 [27] toolboxes. For the hardware experiments7 , we used TurtleBots in place of cars and tracked their positions using a Vicon motion capture system. With the exception of using MATLAB for the real-time evaluation of the optimal controller, all other processes including message-passing between devices and low-level controls were implemented in ROS using Python. 6.2
Highway Example
We now consider a highway overtaking scenario. Initially, the autonomous car is behind a “slow car” which moves at a constant speed v3 . The autonomous car attempts to overtake the slow car while avoiding a collision with an “adjacent car” which travels in the left lane with a possible range of velocities. In addition to avoiding collisions with both cars, the autonomous car will also have to be able to re-enter the right lane within a short duration of time. This additional safety constraint is to plan for situations such as when emergency vehicles require passage through the left lane. The adjacent car, which acts as the disturbance as described in Sect. 2.3, has its position represented by y2 . Since the slow car travels at a constant speed, we express the joint dynamics of the three vehicles in the reference frame of the slow car: 3
The code used in this paper is available at https://github.com/StanfordASL/stlhj. https://github.com/HJReachability/beacls. 5 https://github.com/HJReachability/helperOC. 6 https://www.cs.ubc.ca/∼mitchell/ToolboxLS/. 7 The videos can be viewed at https://www.youtube.com/playlist?list=PL8-2mtIlFIJoNkhcGI7slWX2W3kEW-9Pb. 4
Signal Temporal Logic Meets Reachability: Connections and Applications
x˙ R = vR cos θR ,
θ˙R = ωR ,
y˙ R = vR sin θR − v3 ,
v˙ R = aR ,
597
y˙ 2 = v2 − v3 . (27)
The autonomous car’s primary objective is to overtake the slow car, which is captured by the proposition ψ pass . It needs to adhere to the traffic laws and avoid colliding with either the slow car or the adjacent car. This constraint is represented by ϕ. If such a collision is possible or the overtaking maneuver cannot be achieved within the specified time frame, the autonomous car will then have the secondary objective of remaining behind the slow car, ψ stay . We also include an additional recurrence requirement for the autonomous car to always be within 5 s of re-entering the right lane, represented by ψ lane . The function representations of the STL propositions can be found in the appendix. Overall, the robot must satisfy the following STL formula: [0,25] ♦[0,5] ψ lane ∧ ϕU[0,25] ψ pass ∨ ϕU[0,25] ψ stay , ϕ = ϕoff-road ∧ ϕon-road ∧ ϕavoid
(28)
The results from numerical simulations and experiments with TurtleBots are shown in Figs. 2 and 3. After obtaining the value function and LRFCS for [0,25] ♦[0,5] ψ lane , the LRFCS is used as an additional constraint for the reachability computations associated with ϕU[0,25] ψ pass ∨ϕU[0,25] ψ stay to avoid control conflicts as discussed in Sect. 4.2. This process is described in greater detail in the appendix. The white circle represents the autonomous car, while the red and green car represents the adjacent car and slow car respectively. The colors represent the values of max(gϕU[0,25] ψpass , gϕU[0,25] ψstay ). In Fig. 2, the autonomous car has sufficient time to pass the slow car as the adjacent car passes by quickly. It waits until its states reach the value function’s zero level set to begin the overtaking maneuver. This takes into account the possibility that the adjacent car may slow down drastically at any time. We can observe that the value at the autonomous car’s position increases from t = 0 s to 8 s just as the car begins the overtaking maneuver; this is a result of the adjacent car not behaving in the worst-case manner. t = 0s
t = 8s 0
0
0
0.2
0
0
0
0
0.2
0
3
0
-0.2
-0.2
2 1
0.5
0
-0.5
0.5
0
t = 17s
-0.5 0
t = 25s
0 0
0
0.2 0
-0.2
-2 0
0
0
-1
0 0
x (m)
0.2
-3
-0.2
0.5
0
y (m)
-0.5
0.5
0
-0.5
Fig. 2. The autonomous car (white) successfully overtakes the slow car (green) as the adjacent car (red) moves quickly to “make room” for the passing maneuver. Left: Contour plot of the value function. Right: Time-lapse of the experiment.
598
M. Chen et al. t = 0s
t = 8s 0
0
0
0
0
0
0
0.2
-0.2
0
3
0
0.2
0
-0.2
2 1
0.5
0
t = 17s
-0.5
0.5
0
-0.5
t = 25s
0
0 0
-1
0.2
0.2 0
0
0
0
-0.2
-2
0
x (m)
0
-3
-0.2
0.5
0
-0.5
0.5
0
-0.5
y (m)
Fig. 3. The autonomous car (white) remains behind the slow car (green); the adjacent car (red) moves too slowly. Left: Contour plot of the value function. Right: Time-lapse of the experiment.
In Fig. 3, the autonomous car is unable to pass the slow car within the time horizon and stops behind the slow car due to the adjacent car moving too slowly for the autonomous car to be guaranteed to pass. The value at the autonomous car’s position remains low. Computation of the value function took approximately 8 h. The erratic movement seen in Fig. 3 is caused by the coarse state discretization coupled with the bang-bang controller; this numerical artifact can potentially be alleviated by performing controller smoothing through an optimization framework, as in for example [22].
7
Conclusions and Future Work
In this paper, we presented the combination of Signal Temporal Logic (STL) and Hamilton-Jacobi (HJ) reachability as a versatile verification technique that inherits the strengths of both. From the perspective of STL, our method moves beyond controller synthesis methods that generate single trajectories; instead, we provide guarantees in terms of sets of states, and produce state feedback controllers. From the perspective of HJ reachability, our method looks beyond single reachability problems, and takes advantage of the expressiveness of STL to define sequences of reachability problems. In terms of computational complexity, the number of reachability computations required for set computation and controller synthesis is the same as the number of temporal operators in the STL formula; for any individual reachability computation, our method inherits the same computational complexity as the reachability method. Our approach has been validated in three numerical examples and two robotic experiments. Our work spawns a number of future research directions. For example, one could attempt to alleviate the effects of bang-bang control by using the LRFCS in conjunction with a different controller synthesis framework such as model predictive control. Also, in general, there may be many different orderings of reachability operators that produce LRFCS for an STL formula; thus, it is important
Signal Temporal Logic Meets Reachability: Connections and Applications
599
to investigate the effect of different prioritization of STL formulas when using the LRFCS. Another goal of future work is to incorporate computationally scalable methods, such as [2,19,21], that maintain STL formula satisfaction while balancing computational burden, conservatism, and nonlinearity of system dynamics. Lastly, the ideas in this paper, although presented in the context of STL and HJ reachability, are transferable to other temporal logics and reachability methods; exploring other formulations of temporal logic and reachability would help make trade-offs among computational scalability, conservatism, generality of system dynamics, and expressiveness of logical formulas. In particular, there are several suggested ways to alleviate the “curse of dimensionality” in formal verification. First, preliminary studies on GPU parallelization has been promising, allowing value functions to be computed on a coarse grid for a 7D system [22]. Second, several system decomposition methods [7,8] may be used to reduce dimensionality for specific types of systems. Finally, reachability methods other than the HJ method can be used [2,17,21,23].
References 1. Synthesis of nonlinear continuous controllers for verifiably correct high-level, reactive behaviors. Int. J. Robot. Res 2. Althoff, M., Grebenyuk, D., Kochdumper, N.: Implementation of Taylor models in CORA 2018. In: Proceedings of the International Workshop on Applied Verification for Continuous and Hybrid Systems (2018) 3. Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008) 4. Bansal, S., Chen, M., Fisac, J.F., Tomlin, C.J.: Safe sequential path planning of multi-vehicle systems under presence of disturbances and imperfect information. In: American Control Conference (2017) 5. Bansal, S., Chen, M., Herbert, S., Tomlin, C.J.: Hamilton-Jacobi reachability: a brief overview and recent advances. In: Proceedings of the IEEE Conference on Decision and Control (2017) 6. Belta, C., Bicchi, A., Egerstedt, M., Frazzoli, E., Klavins, E., Pappas, G.J.: Symbolic planning and control of robot motion [grand challenges of robotics]. IEEE Robot. Autom. Mag. 14(1), 61–70 (2007) 7. Chen, M., Herbert, S., Tomlin, C.J.: Fast reachable set approximations via state decoupling disturbances. In: Proceedings of the IEEE Conference on Decision and Control (2016) 8. Chen, M., Herbert, S.L., Vashishtha, M.S., Bansal, S., Tomlin, C.J.: Decomposition of reachable sets and tubes for a class of nonlinear systems. IEEE Trans. Autom. Control (2018, in press) 9. Chen, M., Hu, Q., Fisac, J., Akametalu, K., Mackin, C., Tomlin, C.: Reachabilitybased safety and goal satisfaction of unmanned aerial platoons on air highways. AIAA J. Guidance Control Dyn. 40(6), 1360–1373 (2017) 10. Chen, M., Tomlin, C.J.: Hamilton-Jacobi reachability: some recent theoretical advances and applications in unmanned airspace management. Annu. Rev. Control Robot. Auton. Syst. 1(1), 333–358 (2018) 11. Coddington, E.A., Levinson, N.: Theory of Ordinary Differential Equations. McGraw-Hill, New York (1955)
600
M. Chen et al.
12. Donz´e, A., Ferr`ere, T., Maler, O.: Efficient robust monitoring for STL. In: Proceedings of the International Conference Computer Aided Verification (2013) 13. Dreossi, T., Dang, T., Piazza, C.: Parallelotope bundles for polynomial reachability. In: Hybrid Systems: Computation and Control (2016) 14. Fainekos, G.E., Pappas, G.J.: Robustness of temporal logic specifications for continuous-time signals. Theor. Comput. Sci. 410(42), 4262–4291 (2009) 15. Feng, X., Villanueva, M.E., Chachuat, B., Houska, B.: Branch-and-lift algorithm for obstacle avoidance control. In: Proceedings of the IEEE Conference on Decision and Control (2017) 16. Fisac, J.F., Chen, M., Tomlin, C.J., Sastry, S.S.: Reach-avoid problems with timevarying dynamics, targets and constraints. In: Hybrid Systems: Computation and Control (2015) 17. Frehse, G., Le Guernic, C., Donz´e, A., Cotton, S., Ray, R., Lebeltel, O., Ripado, R., Girard, A., Dang, T., Maler, O.: SpaceEx: scalable verification of hybrid systems. In: Proceedings of the International Conference Computer Aided Verification (2011) 18. Gattami, A., Al Alam, A., Johansson, K.H., Tomlin, C.J.: Establishing safety for heavy duty vehicle platooning: a game theoretical approach. IFAC World Congr. 44(1), 3818–3823 (2011) 19. Herbert, S.L., Chen, M., Han, S., Bansal, S., Fisac, J.F., Tomlin, C.J.: FaSTrack: a modular framework for fast and guaranteed safe motion planning. In: Proceedings of the IEEE Conference on Decision and Control (2017) 20. Kress-Gazit, H., Lahijanian, M., Raman, V.: Synthesis for robots: guarantees and feedback for robot behavior. Annu. Rev. Control Robot. Auton. Syst. 1, 211–236 (2018) 21. Landry, B., Chen, M., Hemley, S., Pavone, M.: Reach-avoid problems via sumof-squares optimization and dynamic programming. In: IEEE/RSJ Intenational Conference on Intelligent Robots & Systems (2018) 22. Leung, K., Schmerling, E., Chen, M., Talbot, J., Gerdes, J.C., Pavone, M.: On infusing reachability-based safety assurance within probabilistic planning frameworks for human-robot vehicle interactions. In: International Symposium on Experimental Robotics (2018) 23. Majumdar, A., Tedrake, R.: Funnel libraries for real-time robust feedback motion planning (2016). https://arxiv.org/abs/1601.04037 24. Majumdar, A., Tedrake, R.: Funnel libraries for real-time robust feedback motion planning. Int. J. Robot. Res. 36(8), 947–982 (2017) 25. Majumdar, A., Vasudevan, R., Tobenkin, M.M., Tedrake, R.: Convex optimization of nonlinear feedback controllers via occupation measures. Int. J. Robot. Res. 33(9), 1209–1230 (2014) 26. Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: Proceedings of the International Symposium Formal Techniques in Real-Time and Fault-Tolerant Systems, Formal Modeling and Analysis of Timed Systems (2004) 27. Mitchell, I.M.: The flexible, extensible and efficient toolbox of level set methods. SIAM J. Sci. Comput. 35(2–3), 300–329 (2008) 28. Mitchell, I.M., Bayen, A.M., Tomlin, C.J.: A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games. IEEE Trans. Autom. Control 50(7), 947–957 (2005) 29. Papusha, I., Fu, J., Topcu, U., Murray, R.M.: Automata theory meets approximate dynamic programming: optimal control with temporal logic constraints. In: Proceedings of the IEEE Conference on Decision and Control (2016)
Signal Temporal Logic Meets Reachability: Connections and Applications
601
30. Raman, V., Donze, A., Maasoumy, M., Murray, R.M., Sangiovanni-Vincentelli, A., Seshia, S.A.: Model predictive control with signal temporal logic specifications. In: Proceedings of the IEEE Conference on Decision and Control (2014) 31. Raman, V., Donz´e, A., Sadigh, D., Murray, R.M., Seshia, S.A.: Reactive synthesis from signal temporal logic specifications. In: Hybrid Systems: Computation and Control (2015) 32. Castro, L.I.R., Chaudhari, P., Tumova, J., Karaman, S., Frazzoli, E., Rus, D.: Incremental sampling-based algorithm for minimum-violation motion planning. In: Proceedings of the IEEE Conference on Decision and Control (2013) 33. Singh, S., Majumdar, A., Slotine, J.J.E., Pavone, M.: Robust online motion planning via contraction theory and convex optimization. In: Proceedings of the IEEE Conference on Robotics and Automation (2017). http://asl.stanford.edu/ wp-content/papercite-data/pdf/Singh.Majumdar.Slotine.Pavone.ICRA17.pdf 34. Wang, L., Ames, A.D., Egerstedt, M.: Safety barrier certificates for collisions-free multirobot systems. IEEE Trans. Robot. 33(3), 661–674 (2017). http://ieeexplore. ieee.org/document/7857061/ 35. Wieland, P., Allg¨ ower, F.: Constructive safety using control barrier functions. In: IFAC Symposium on Nonlinear Control Systems (2007)
Low Complexity Control Policy Synthesis for Embodied Computation in Synthetic Cells Ana Pervan(B) and Todd D. Murphey Department of Mechanical Engineering, Northwestern University, Evanston, IL 60208, USA [email protected], [email protected]
Abstract. As robots become more capable, they also become more complicated—either in terms of their physical bodies or their control architecture, or both. An iterative algorithm is introduced to compute feasible control policies that achieve a desired objective while maintaining a low level of design complexity (quantified using a measure of graph entropy) and a high level of task embodiment (evaluated by analyzing the Kullback-Leibler divergence between physical executions of the robot and those of an idealized system). When the resulting control policy is sufficiently capable, it is projected onto a set of sensor states. The result is a simple, physically-realizable design that is representative of both the control policy and the physical body. This method is demonstrated by computationally optimizing a simulated synthetic cell.
1
Introduction
Many roboticists seek to create robots that are as capable as possible, assuming that optimal design is achieved when a robot can perfectly perform a task. But as robots become more capable, they often become more complicated. While there are some cases where this is acceptable, there are many where complexity is an undesired consequence of optimization. A conflict exists between equipping a robot with what is sufficient and what is necessary—what can enable a robot to succeed (and is likely complex) and what is minimally required for it to achieve its goal (and is necessarily simple). This is especially evident when designing for embodied computation—robots without any on-board, CPU-based, traditional computational capabilities. Robots traditionally require three elements: sensors, computation, and actuators. Some robotic systems employ embodied computation to reduce weight and energy—for example fully mechanical devices, like passive dynamic walkers [9,28] or those used in prosthetic limbs [22]—while others must necessarily resort to embodied computation because of scale, for example, robots on the micro- or nano-scales [13,16,17]. We focus on the latter case later. So although robot design is traditionally identifying which sensory, computation, and actuation elements can be combined to best achieve a goal, here we will examine a c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 602–618, 2020. https://doi.org/10.1007/978-3-030-44051-0_35
Embodied Computation in Synthetic Cells
603
framework for designing the sensory and actuation elements of robots so that no traditional, CPU-based computation is necessary to accomplish a goal.
Fig. 1. A simple example of a control policy. Here the state space is a two dimensional 5 × 5 grid and the desired state is in the upper right of the state space. At each state, the suitable control is indicated by its assigned color. (a) A simple control policy with only three different states in the finite state machine. (b) A complex control policy, with fifteen different states in the finite state machine. Both achieve the task with different implementations.
We first examine the relationship between control policy design and physical robot body design. A control policy assigns an action for a system at each time or state. Symbolic control policies have been useful in robot control and motion planning [3,21] for systems with limited computational power [25]. An example is shown in Fig. 1, where the system consists of three possible control modes: move right (blue), move up (red), and stay still (white) and its goal is to navigate to the upper right corner of the grid world. A simple robot placed in this environment with one of these policies in its memory would be able to achieve its task without any traditional computation, replacing logical operators such as inequalities with physical comparators to relate sensor states to control actions. Figure 1 illustrates the difference in complexity between these two policies. The policy in Fig. 1 (a) requires a combination of sensors that are capable of differentiating between three different regions of the state space, and the policy in Fig. 1 (b) requires sensors that are able to discern between fifteen different regions. It is simpler to physically implement the robot design implied by the control policy in Fig. 1 (a) than the policy in Fig. 1 (b). We pose the physical design problem as the projection of a policy onto an admissible set of physical sensor-actuator interconnections, the complexity of which must be managed during policy iteration. This complexity is a measure
604
A. Pervan and T. D. Murphey
of logical interconnections between sensor states and control modes (which correspond to arrows on the graphs in Fig. 1). Methods from switched systems literature are used to organize these sensor-actuator connections so that they change minimally as a function of state. This is challenging because [4] shows that optimal solutions for discrete switched systems will chatter with probability 1, and chattering implies a complex policy. That is, when choosing among a finite number of sensor-actuator pairings, optimization of an objective function will necessarily lead to arbitrarily complex dependencies on state, and the physical implementation of such a policy would consequently be very complex (and often not physically realizable). Results from [7] show that solutions with slow mode switching are “almost” as good as chattering solutions. Properties from [7] are used here to design control policies with minimal mode switching and then project them onto a feasible sensor space, as illustrated in Fig. 2, resulting in a methodology for designing robots with embodied computation.
1
2
P P
y
3
4
5
6
x (a)
y
y
x
x
(b)
Fig. 2. (a) The state space and controls for the synthetic cell example system introduced in Sect. 3. At any location in the state space, the robot is able to choose one of seven different control modes: attraction to chemical potentials at the six different sources, or zero control. (b) A control policy for this system projected onto a feasible set of sensor states. (These figures will be explained in detail in Sects. 5 and 6.)
After reviewing related work in Sect. 2, this methodology will be explored in terms of an extended example. The example system, called a synthetic cell, will be introduced in Sect. 3. The primary contributions of this work can be summarized as follows: 1. Quantitative definitions are given for design complexity and task embodiment, described in Sect. 4. 2. An iterative algorithm is developed in Sect. 5, which creates control policies with low design complexity while increasing task information. 3. A physically realizable robot design is generated by dividing the state space into regions to create a set of sensor states, illustrated in Sect. 6, and projecting the low complexity control policy onto this set. These are supported by simulations of synthetic cells.
Embodied Computation in Synthetic Cells
2
605
Related Work
This work is substantially inspired by [8], which defines design problems as relations between functionality, resources, and implementation and shows that despite being non-convex, non-differentiable, and noncontinuous, it is possible to create languages and optimization tools to define and automatically solve design problems. The optimal solution to a design problem is defined as the solution that is minimal in resources usage, but provides maximum functionality. We apply this definition by proposing a min-max problem in which the goal is to minimize design complexity (representative of the amount of sensors and actuators required, i.e., the resources), and maximize task embodiment (i.e., the functionality of the design). Robotic primitives are introduced in [23] as independent components that may involve sensing or motion, or both. These are implemented in this work as actuator and sensor libraries from which we allow our algorithm to choose components. Our definition of task embodiment, which is defined in Sect. 4, parallels the dominance relation proposed in [23] that compares robot systems such that some robots are stronger than others based on a sensor-centered theory of information spaces. Similarly, our definition of design complexity (Sect. 4) parallels an existing notion of conciseness, presented in [24]. The results in [24] are motivated by circumstances with severe computational limits, specifically addressing the question of how to produce filters and plans that are maximally concise subject to correctness for a given task. This is very related to our goal of finding the simplest way to physically organize sensors and actuators so that a (computationally limited) robot can achieve a given task. The work presented in [15] produces asymptotically optimal sampling-based methods and proposes scaling laws to ensure low algorithmic complexity for computational efficiency. These algorithms were originally developed for path planning, but we apply similar ideas for generating simple control policies. The methods described in [15] start with an optimal, infinite complexity solution, and from that develop simpler plans. Here, we start with a zero complexity policy and move towards more complex, better performing solutions—while maintaining a level of computational complexity appropriate for physical implementations of embodied computation.
3
Motivating Example: Synthetic Cells
How can we use control principles to organize sensor components, actuator components, and their interconnections to create desired autonomous behavior, without relying on traditional computation? To answer this question we will consider the extended example of a synthetic cell—a small robot that only has a finite number of possible sensor and actuator states and potential pairings between them [19]. The purpose of this example system is to show a concrete implementation of the methods in Sect. 5, and to better illustrate the relationship between control policy design and physical robot body design.
606
A. Pervan and T. D. Murphey
A synthetic cell is a mechanically designed microscopic device with limited sensing, control, and computational abilities [19]; it is essentially an engineered cell. A synthetic cell is in a chemical bath and generates movement by interacting with its environment using chemical inhibitors, and it contains simple circuits that include minimal sensors and very limited nonvolatile memory [20]. Such a device is 100 µm in size or less, rendering classical computation using a CPU impossible. But these simple movement, sensory, and memory elements can potentially be combined with a series of physically realizable logical operators to enable a specific task. How can these discrete structures be algorithmically organized to combine sensing and control to accomplish an objective?
4
Design Complexity and Task Embodiment
Graph entropy [1,10] will be used as a measure of design complexity for comparing robot designs. The complexity of a control policy is equated with the measure of entropy of its resulting finite state machine. A finite state machine consists of a finite set of states (nodes), a finite set of inputs (edges), and a transition function that defines which combinations of nodes and edges lead to which subsequent nodes [26]. The finite set of nodes that the system switches between are the control modes, and the edges—inputs to the system which cause the control modes to change—are the state observations (Fig. 1(a)). Finite state machines and their corresponding adjacency matrices are generated numerically, by simulating a synthetic cell forward for one time step, and recording control modes assigned at the first and second states. These control mode transitions are counted and normalized into probabilities, and the resulting adjacency matrix A is used in the entropy calculation, A(i) log (A(i)) (1) h=− i
which results in a complexity measure h for each robot design. This measure of complexity is more informative than other metrics (e.g., simply counting states) because it is a function of the interconnections between states—which is what we want to minimize in the physical design. We define task embodiment as the amount of information about a task encoded in a robot’s motion (not to be confused with embodiment found in human-robot interaction [14,29]). The focus is on how much information a body encodes about a task, so that the design update can be characterized in terms of moving task information from the centralized computations in the control calculations to embedded computation in the physical body. This can be evaluated using the K-L divergence (2) between a distribution representing the task, P , and a distribution representing trajectories of the robot design, Q [5], Q(i) P (i) log . (2) DKL (P Q) = − P (i) i
Embodied Computation in Synthetic Cells
607
To define the goal task distribution P , a model predictive controller (MPC) is used to simulate the task execution of an idealized robot (e.g., with a computer on board and perfect state measurements). The same method is used to generate a distribution Q that represents the robot design—this time implementing the generated control policy in place of the optimal controller. We use (2) to compare these two distributions: a low measure of K-L divergence indicates that the distributions are similar, and implies a high level of task embodiment, and therefore a better robot design. In other words, if a task is well-embodied by a robot, only a simple control policy is necessary to execute it. Otherwise, more information, in the form of a more complex control policy, is required.
5
Control Modes Based on State
We split the design synthesis problem into two stages. First, in this section, the control policy is optimized as a function of state, keeping the number of transitions between control modes low. Then, in the next section, those control modes are projected onto sensor states to generate a design. The min-max optimization of minimizing implementation complexity while maximizing task embodiment is challenging, with many reasonable approaches. We use techniques from hybrid optimal control because of properties described next. It was proven in [4] that optimal control of switched systems will result in a chattering solution with probability 1. Chattering is equivalent to switching control modes very quickly in time. In the case of these control policies, this translates to switching between control modes very quickly in state. As a result, an implementation of the optimal control policy would be highly complex. Instead of an optimal solution, we are looking for a “good enough,” near optimal solution that results in a minimal amount of mode switching. It was shown in [7] that the mode insertion gradient (MIG), which will be discussed in Sect. 5.2, has useful properties in this regard, including that when the MIG is negative at a point, it is also negative for a region surrounding that point, and that a solution that switches modes slowly can be nearly as optimal as a chattering solution. This section will first review the topics of switched systems [4,7,12,30] and the use of needle variations for optimization [11,27,30], then develop an algorithm for building low complexity control policies. The algorithm creates a simple control policy under the assumption that the system has perfect knowledge of its state. Mapping this policy to physically realizable sensors is the subject of Sect. 6. 5.1
Switched Systems
A switched-mode dynamical system is typically described by state equations of the form (3) x(t) ˙ = {fσ (x(t))}σ∈Σ
608
A. Pervan and T. D. Murphey
with n states x : R → X ⊆ Rn , m control modes1 σ : R → Σ ⊂ [0, m], and continuously differentiable functions {fσ : Rn+m → Rn }σ∈Σ [11]. Such a system will switch between modes a finite number of times N in the time interval [0, T ]. The control policy for this type of switched system often consists of a mode schedule containing a sequence of the switching control modes S = {σ1 , . . . , σN } and a sequence of increasing switching times T = {τ1 , . . . , τN } [12,30]. In this paper, we will consider a similar switched-mode system in that the system has n states x = [x1 , x2 , . . . , xn ]T ∈ Rn and m control modes σ = [σ1 , σ2 , . . . , σm ]T ∈ Rm , but instead of implementing an algorithm to optimize transition times between modes, we optimize transition states, such that a robot can directly map sensory measurements of state to one of a finite number of control outputs. 5.2
Hybrid Optimal Control
Let L : Rn → R be a continuously differentiable cost function, and consider the total cost J, defined by T J= L(x(t))dt. (4) 0
We use the Mode Insertion Gradient (MIG) [11,27,30] to optimize over the choice of control mode at every state. The MIG measures the first-order sensitivity of the cost function (4) to the application of a control mode σi for an infinitesimal duration λ → 0+ . The MIG d(x) is evaluated using dJ = ρ(t)T (fσi (x(t)) − fσ0 (x(t))), (5) d(x) = dλ+ t where the adjoint variable ρ is the sensitivity of the cost function ρ˙ = −(
∂L ∂fσ0 (x(t)))T ρ − ( (x(t)))T , ∂x ∂x
ρ(T ) = 0.
(6)
The derivation of these equations is discussed in [11,27,30], but the key point is that d(x) measures how much inserting a mode σi locally impacts the cost J. When d(x) < 0, inserting a mode at state x will decrease the cost throughout a volume around x, meaning a descent direction has been found for that state. The MIG can be calculated for each mode so that d(x) is a vector of m mode insertion gradients: d(x) = [d1 (x), ..., dm (x)]T . Therefore the best actuation mode (i.e., the mode with the direction of maximum descent) for each state x has the most negative value in the vector d(x). As long as the dynamics f (x(t)) are real, bounded, differentiable with respect to state, and continuous in control and time and the incremental cost, L(x(t)), is 1
Typically u is denoted as a control variable, but in this case the control value u could be anything in the greater context of the control mode σ. The control mode may, in fact, consist of many different values of u, which is why u does not appear in a significant way in the posing of this problem.
Embodied Computation in Synthetic Cells
609
real, bounded, and differentiable with respect to state, the MIG is continuous [7]. Sufficient descent of the mode insertion gradient is proven in [7], where the second derivative of the mode insertion gradient is shown to be Lipschitz continuous under assumptions guaranteeing the existence and uniqueness of both x, the solution to the state equation Eq. (3), and ρ, the solution of the adjoint equation Eq. (6). Combining this with the results of [15], one can conclude that any sufficiently dense finite packing will also satisfy the descent direction throughout the volume of packing. As a result, although chattering policies may be the actual optimizers, finite coverings will generate descent throughout the state space, resulting in a non-optimal but “good enough” solution. This property provides the required guarantee that we can locally control the complexity of the policy as a function of state. This will be discussed further in Sect. 5.3. Figure 3 illustrates differences in complexity as a result of optimizing using the mode insertion gradient. The magnitude of the curves is the default control (the control we are comparing to) minus the step size (a scaling factor, and also the line search parameter) multiplied by the MIG. Therefore the magnitude of these plots correspond to the amount of reduction in cost that can be achieved by locally employing each control mode at the state x. The complex policy illustrated in Fig. 3 (c), occurs in simulation in the chattering policy of Fig. 4. This happens when there is similar utility in employing more than one mode in a region—there is only marginal benefit in choosing one control mode over another, which results in increased complexity.
x (a)
High Complexity σk+1 = σk-γdk
Low Complexity σk+1 = σk-γdk
σk+1 = σk-γdk
Minimum Complexity
x (b)
x (c)
Fig. 3. The curves show σ k+1 = σ k − γdk for three different control modes σ1 (blue), σ2 (green), and σ3 (red), where γ is the line search parameter and the background colors indicate which mode is assigned to state x in the control policy. As step size γ increases from left to right, the magnitude of γdk (x) surpasses that of the default control σ k (x) (here the default control is σ1 = 0). (a) Minimum Complexity: Only one control mode is assigned throughout the entire state space. (b) Low Complexity: A few control modes are employed, indicating that the cost function can be reduced by including these extra control modes. (c) High Complexity: The control mode switches often but the magnitudes of the values are very similar, indicating that there is no significant decrease in cost between these modes—despite the large increase in complexity (i.e., chattering).
610
5.3
A. Pervan and T. D. Murphey
Iterative Algorithm
An algorithm is introduced that can reduce the complexity of a control policy in as little as one iteration, based on the work in [6,7].
Algorithm 1. Iterative Optimization k=0 Choose default policy σ 0 (x) Calculate initial cost J(σ 0 (x)) Calculate initial complexity h0 Calculate initial descent direction d0 (x) hk−1 = ∞ while hk < hk−1 + h while J(σ k (x)) < J(σ k+1 (x)) + J Re-simulate σ k+1 (x) = σ k (x) − γdk (x) Compute new cost J(σ k+1 (x)) Increment step size γ end while Calculate new complexity hk+1 Calculate dk+1 (x) k =k+1 end while
Note that the superscripts in this algorithm are not exponents, but indices of loop increments.
In this line search algorithm σ = [σ1 , σ2 , . . . , σm ]T is a vector of all control modes and descent direction d(x) = [d1 (x), d2 (x), . . . , dm (x)]T is the gradient of inserting any mode with respect to a default mode σ 0 (x) at a particular state. The default control may be chosen arbitrarily, but for simplicity we will show an example using a null default policy (in Fig. 5). The cost J(σ k (x)) of the entire policy is approximated by simulating random initial conditions forward in time and evaluating the total cost function for time T . We use cost J rather than task embodiment DKL as the objective function because the line search in the algorithm is a function of d(x) = dJ dλ . This way, the algorithm decreases the objective function (plus tolerance J ) each iteration. After choosing a default policy σ 0 (x) (which can be anything from a constant mode throughout the state space to a highly complex policy), computing the initial cost J(σ 0 (x)), and calculating the initial entropy h0 (using Eq. (1)) the initial descent direction d0 (x) is calculated for the set of points in S, as described in Sect. 5.2. A line search [2] is performed to find the maximum step size γ that generates a reduction in cost in the descent direction dk (x), and then the policy σ k (x) is updated to the policy σ k+1 (x). The new design complexity hk+1 and descent directions dk+1 (x) are calculated, and this is repeated until the cost can
Embodied Computation in Synthetic Cells
611
no longer be reduced without increasing the complexity beyond the threshold defined by h . The tolerances h and J are design choices based on how much one is willing to compromise between complexity and performance. In the example illustrated in Sect. 5.4, the value for h is significant because it represents the allowable increase in complexity—how much complexity the designer is willing to accept for improved task embodiment. This algorithm enforces low design complexity, meaning it will not result in chattering outputs. The work in [7] showed that if d(x(τ )) < 0 then there exists an > 0 such that d(x(t)) < 0 ∀ t ∈ [τ − , τ + ]. Since d(x) is continuous in x (as discussed in Sect. 5.2), d(x0 ) < 0 implies that there exists an > 0 such that d(x) < 0 ∀ x ∈ B (x0 ). Note that each point in B (x0 ) does not necessarily have the same mode of maximum descent, but they do each have a common mode of descent. The MIG serves as a descent direction for a volume in the state space, rather than just at a point. This property allows us to assign one control mode throughy out a neighborhood so that instead of choosing the optimal control mode (the direction of maximum descent) at each point and causing chattering, we select a x good control mode (a direction of descent) throughout a volume and maintain relFig. 4. A chattering control policy. The ative simplicity in the policy. Figure 4 corresponding graph has entropy h = shows a control policy that is the result 7.6035. of assigning the optimal control mode at each point, which results in chattering. 5.4
Examples
For this example, the configuration space of the synthetic cell is defined by its position in two dimensional space (x, y). We define the control authority as the ability to be attracted toward a specific chemical potential. So at any location (x, y) the robot may choose a control mode σ ∈ {σ0 , σ1 , σ2 , σ3 , σ4 , σ5 , σ6 }, where ¨= σ0 is zero control, and the other six modes are a potential (with dynamics x 1 sn −x 2 ||s −x|| , where rn is the distance from the synthetic cell to source n and sn rn n is the location of source n). The control synthesis problem is to schedule σ in space, based on an objective (in this case, to approach a point P )2 . The state space, desired point, and chemical sources are all shown in Fig. 2 (a). Figure 5 begins with an initial control policy of zero control throughout the state space, and increases design complexity and task embodiment until the line 2
Point P is slightly off center, to avoid adverse effects of symmetry. This is reflected in the asymmetry of the resulting control policies (e.g. red and orange not being perfectly even).
612
(a)
A. Pervan and T. D. Murphey
Control Policy
y
y
y
x
x
x 7
Finite State Machine Graph Entropy
(d)
Monte Carlo Simulations initial final
(f)
Average Final Distance
1
3
y
1
2
6
2
0.9199
y
x
7
8
5
y
5
3
1.0177
0
x
Kullback-Leibler (e) Divergence
5
6 4
3
(c)
2
1
4
0
x 4
7
6
(b)
y
10 9
1.1438
y
x
x
0.4470
0.4490
0.3320
0.0471
1.8166
1.6573
0.8010
0.0202
Fig. 5. Control policies for the system in Fig. 2, starting from a null initial policy. (a) The control policies, mapping state to control for various iterations of the line search. (b) A finite state machine representation of each policy, representative of the complexity of the system. (c) The design complexity value calculated from Eq. (1). (d) 1000 Monte Carlo simulations illustrate the results of random initial conditions using the associated control policies. (e) The Kullback-Leibler divergence between the goal task distribution and the distribution generated by the control policy, from Eq. (2). (f) The average final distance of the final states of (d) from the desired point P .
search algorithm converges to a new control policy. Monte Carlo simulations were performed with 1000 random initial conditions, shown in row (d) and the average distance of the final points from the desired point is shown in row (f). Most interesting are the trends in rows (c) and (e). These correspond to the minmax problem posed earlier, in which we attempt to minimize design complexity, computed using graph entropy (c), and maximize task embodiment, calculated using K-L divergence (e). The graph entropy in row (c) increases as the K-L divergence in row (e) increases. This shows the entropy must increase (from 0) to ensure some amount of task embodiment. Synthetic cells can encode these simplified control policies by physically combining their movement, sensory, and memory elements with a series of logical operators, as discussed next in Sect. 6.
6
Projecting Policies onto Discrete Sensors
Section 5 described synthetic cells with perfect state measurement. In this section, implementations using discrete sensors will be explored. In some cases,
Embodied Computation in Synthetic Cells
613
it may be possible to create sensors that are able to detect exactly where a robot should switch between control modes (e.g. a sensor that can perfectly sense the boundary between the green and orange regions of the control policy). It is also possible that a designer may start with a fixed library of sensors, in which case the state space should first be divided into sensed regions, and then control modes should be assigned. Probably the most likely case, and the one we will examine in the section, is that a designer has a variety of sensors to choose from, and will want to use some subset of them. For the synthetic cell example, we will assume discrete sensing provided by a chemical comparator—a device that compares the relative strength of two chemical concentrations. From a given library of sensors, how should the combination of sensors, actuators, and logical operators be chosen so that the task is best achieved? Figure 6 (a) shows five different individual sensors: each comparing the strength of chemical source 1 to another of the chemical sources in the environment, and how each of these sensors is able to divide the state space into two distinct regions, while Fig. 6 (b) illustrates which regions of the state space are able to be discerned using these 5 sensors combined. Figure 6 (c) shows all possible combinations of comparators: all 6 chemical sources compared to each of their 5 counterparts, and therefore the maximum granularity of sensed regions in the state space using this sensor library3 .
(a)
(b)
(c)
Fig. 6. (a) Illustration of five individual chemical comparators, each comparing the chemical potential of Source 1 with one of the other sources. Each sensor can tell whether the robot is on one side of an equipotential—the dotted lines—or the other. (b) Combination of the five sensors. The robot is able to sense which of these 9 regions it is in. (c) Sensor regions resulting from the combination of all possible synthetic cell comparators in this environment. 3
Note that some comparators divide the state space in the exact same way, i.e. comparing chemical sources 1 and 3 results in the same sensed regions as comparing sources 2 and 4 (this is true for three other sets of comparators: 1/2 = 3/4 = 5/6, 1/5 = 2/6, and 3/5 = 4/6).
614
A. Pervan and T. D. Murphey
The optimal scenario would be that these sensor regions correspond perfectly to the control regions found using the iterative algorithm in Fig. 5. Since this will almost never be the case, we must attempt to approximate our control policy using the library of sensors. Figures 7 and 8 demonstrate how synthesized control policies combined with a library of sensors create a physical design. Figure 7 (a) shows two comparators chosen from the sensor library and how they each divide the state space into sensed regions and Fig. 7 (b) is the policy that results from projecting the final control policy found in the algorithm onto the feasible sensor space. This projection is done by dividing the state space into k different sensed regions R = [R1 , ..., Ri , ..., Rk ] and finding the volumes v = [v1 , ..., vj , ..., vm ] occupied by each of the m control modes in the policy. In the cases of Figs. 7 and 8, k is equal to 4 and 32, respectively, and v is the n-dimensional volume corresponding to each of the 7 control modes. The control mode σ with the maximum volume in each region Ri is assigned to that region using Eq. (7). σRi = argmax(vj ∩ Ri )
(7)
j
Logical operators can be combined with sensory observations to represent the state space with more fidelity than sensors alone (e.g. a single sensor in Fig. 6(a))—so that actions can be associated with combinations of sensory observations (e.g. Fig. 6(b)). Figure 7 (c) illustrates the logical diagram that would be physically encoded in circuitry onto a synthetic cell so that the policy in Fig. 7 (b) could be executed. Figure 8 is similar to Fig. 7, but illustrates the physical design corresponding to the highest fidelity control policy from the library of comparator sensors. Figure 8 (a) shows each of the sensors in the library, including the ones that repeat sensed regions due to the symmetry in this environment. The projected control policy is shown in Fig. 8 (b) and (c) illustrates the logic of the physical circuitry. 1
AX
2
Comparator y
x
1 Compare 1 to 2
y
Memristor
5
2
1 1
1
1 Compare 5 1 to 5
y
1 Compare 5 1 to 5
2 5 1
5
x
x
(a)
(b)
A1
A3
A2
(c)
A4
(d)
Fig. 7. Low-fidelity Design. (a) Two sensors. Top: comparing chemical Sources 1 and 2 divides the state space into left and right. Bottom: comparing Sources 1 and 5 divides the space into top and bottom. (b) Control policy from Fig. 5 projected onto the sensed regions. (c) Logical decision diagram for this system. (d) Circuit diagram for physical synthetic cell design.
Embodied Computation in Synthetic Cells 1
2
y
615
1
y
6
3
1 4
AX
1 x
x
1
5 1 Compare 1 to 2
1
y
4
2
1
5
3
4
6
y
5
3 Compare 1 1 to 3
1 Compare 3 1 to 3
2 5
5
x
x
2
1
y
y
5 Compare 1 1 to 5
1 Compare 5 1 to 5
y
3
3
6
1
3
1 4 Compare 5 4 to 5
6 Compare 3 3 to 6
3 Compare 5 3 to 5
2 3
3
1 1 Compare 4 1 to 4
6
x
2 Compare 5 2 to 5
A3
A3
1
x
A4
5 Compare 3 3 to 5
A1
2
y
y
3
x
1 Compare 6 1 to 6
5 Compare 2 2 to 5
2 Compare 3 2 to 3
A1
x
A1 y
3
6
x
A3
A4
1 Compare 4 1 to 4
A6
3 Compare 6 3 to 6
A3
A3
A3
A3
A2
A4
6
3
1 Compare 6 1 to 6
4 Compare 5 4 to 5
A4
3 Compare 6 3 to 6
6
5
3
A2
5 4
5
5
5
y
A4
4
1
A3
1
5
A5
2
4
3 2
x
5
x
(a)
(b)
(c)
(d)
Fig. 8. High-fidelity Design. (a) Ten sensors. The equipotential lines demonstrate how the device can use chemical comparators to estimate its location in the environment. (b) Control policy from Fig. 5 projected onto the sensed regions. (c) Logical decision diagram for this system. (d) Circuit diagram for physical synthetic cell design.
It is notable that the designs in Figs. 7 and 8 are quite dissimilar. Figure 9 shows how each of the physically feasible designs compare to each other, to another physically feasible design, and to the control policy with perfect knowledge of state. The high-fidelity design in the middle of Fig. 9 captures much of
(a)
Control Policy
y
y
y
y
x
x
x
x
4 3
(b)
6
7
8
Graph Entropy
(d)
Monte Carlo Simulations initial final
(f)
Average Final Distance
11
2 5
7
1
8 6
9
1
2
1
2
3
4
3
5
7
8
10
4
9
6
9
0.5682
1.1438
y
y
x
Kullback-Leibler (e) Divergence
10
5
3
2
(c)
4
1
Finite State Machine
y
x
0.0379
0.3087
y
x
x
0.0471
0.2657
0.2940
0.6311
0.0202
0.3454
0.6153
1.4193
Fig. 9. Left: The policy generated when perfect knowledge of state was assumed. Right: A high-fidelity design using all of the (10 distinct) sensors in the sensor library, as shown in Fig. 8. A medium-fidelity design, using five sensors. A low-fidelity design, using only two of the sensors from the sensor library, as shown in Fig. 7.
616
A. Pervan and T. D. Murphey
the structure of the sensor-agnostic policy, and the results are evident in the relatively low K-L divergence. The medium-fidelity design uses fewer sensors than the high-fidelity one and therefore does not embody the task quite as well. The low-fidelity design has a higher K-L divergence than even the null policy in Fig. 5. This is because during the projection, the control mode assigned to each sensor region corresponded to a chemical potential that was inside that same region, ensuring that no simulated synthetic cell was ever able to escape its quadrant. The fact that all three of these designs were projected from the same original control policy demonstrates that, as expected, the success of the resulting robot design is highly sensitive to choice of sensors.
7
Discussion
In this work we addressed the question of designing robots while minimizing complexity and maximizing task embodiment. We demonstrated our method of solving this min-max problem, which included an iterative algorithm resulting in a control policy assuming perfect sensing, and then projecting that policy onto a discrete space of sensed regions resulting from a library of sensors. This is not necessarily an optimal design pipeline for all robot design problems. In some cases it may be possible to find a simple control policy assuming perfect sensing, and then create sensors that best align with that policy. Conversely, there may be some instances where there is a fixed library of sensors, in which case one would first divide the state space into discrete regions and then find the optimal control mode in each of those regions. In future work, this algorithm will be implemented for a wider range of dynamical systems, specifically higher order systems. The algorithm will be tested with different modifications, including using DKL in the objective function so that task embodiment is the actual object of the optimization, rather than a correlated consequence. Finally, this methodology will be validated by using control policies computed by this method to design and create actual, physical synthetic cells.
References 1. Anand, K., Bianconi, G.: Entropy measures for networks: toward an information theory of complex topologies. Phys. Rev. E 80, 045102 (2009) 2. Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16, 1–3 (1966) 3. Belta, C., Bicchi, A., Egerstedt, M., Frazzoli, E., Klavins, E., Pappas, G.J.: Symbolic planning and control of robot motion [grand challenges of robotics]. IEEE Robot. Autom. Mag. 14, 61–70 (2007) 4. Bengea, S.C., DeCarlo, R.A.: Optimal control of switching systems. Automatica 41, 11–27 (2005) 5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006)
Embodied Computation in Synthetic Cells
617
6. Caldwell, T.M., Murphey, T.D.: Projection-based optimal mode scheduling. In: IEEE Conference on Decision and Control, December 2013 7. Caldwell, T.M., Murphey, T.D.: Projection-based iterative mode scheduling for switched systems. Nonlinear Anal. Hybrid Syst. 21, 59–83 (2016) 8. Censi, A.: A mathematical theory of co-design (2015). ArXiv e-prints. https:// arxiv.org/abs/1512.08055 9. Collins, S., Ruina, A., Tedrake, R., Wisse, M.: Efficient bipedal robots based on passive-dynamic walkers. Science 307, 1082–1085 (2005) 10. Dehmer, M., Mowshowitz, A.: A history of graph entropy measures. J. Inf. Sci. 181, 57–78 (2011) 11. Egerstedt, M., Wardi, Y., Axelsson, H.: Transition-time optimization for switchedmode dynamical systems. IEEE Trans. Autom. Control 51, 110–115 (2006) 12. Egerstedt, M., Wardi, Y., Axelsson, H.: Optimal control of switching times in hybrid systems. In: International Conference on Methods and Models in Automation and Robotics, Miedzyzdroje (2003) ´ 13. Esteban-Fern´ andez de Avila, B., Angsantikul, P., Ram´ırez-Herrera, D.E., Soto, F., Teymourian, H., Dehaini, D., Chen, Y., Zhang, L., Wang, J.: Hybrid biomembrane functionalized nanorobots for concurrent removal of pathogenic bacteria and toxins. Sci. Robot. 3, eaat0485 (2018) 14. Huzaifa, U., Bernier, C., Calhoun, Z., Heddy, G., Kohout, C., Libowitz, B., Moenning, A., Ye, J., Maguire, C., Laviers, A.: Embodied movement strategies for development of a core-located actuation walker. In: 2016 6th IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob), pp. 176–181, June 2016 15. Karaman, S., Frazzoli, E.: Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res. 30, 846–894 (2011) 16. Li, J., Shklyaev, O.E., Li, T., Liu, W., Shum, H., Rozen, I., Balazs, A.C., Wang, J.: Self-propelled nanomotors autonomously seek and repair cracks. Nano Lett. 15, 7077–7085 (2015) 17. Li, S., Jiang, Q., Liu, S., Zhang, Y., Tian, Y., Song, C., Wang, J., Zou, Y., Anderson, G.J., Han, J., Chang, Y., Liu, Y., Zhang, C., Chen, L., Zhou, G., Nie, G., Yan, H., Ding, B., Zhao, Y.: A DNA nanorobot functions as a cancer therapeutic in response to a molecular trigger in vivo. Nature Biotechnol. 36, 258– 264 (2018) 18. Liebenwein, L., Baykal, C., Gilitschenski, I., Karaman, S., Rus, D.: Sampling-based approximation algorithms for reachability analysis with provable guarantees. In: Proceedings of Robotics: Science and Systems (RSS), June 2018 19. Liu, P., Liu, A.T., Kozawa, D., Dong, J., Yang, J.F., Koman, V.B., Saccone, M., Wang, S., Son, Y., Wong, M.H., Strano, M.S.: Autoperforation of 2D materials for generating two-terminal memristive Janus particles. Nature Mater. 17, 1005–1012 (2018) 20. Liu, P., Cottrill, A.L., Kozawa, D., Koman, V.B., Parviz, D., Liu, A.T., Yang, J.F., Tran, T.Q., Wong, M.H., Wang, S., Strano, M.S.: Emerging trends in 2D nanotechnology that are redefining our understanding of “Nanocomposites”. Nano Today 21, 18–40 (2018) 21. Mavrommati, A., Murphey, T.D.: Automatic synthesis of control alphabet policies. In: IEEE International Conference on Automation Science and Engineering (CASE), August 2016 22. Murthy Arelekatti, V.N., Winter, A.G.: Design and preliminary field validation of a fully passive prosthetic knee mechanism for users with transfemoral amputation in India. J. Mech. Robot. pp. 350–356 (2015)
618
A. Pervan and T. D. Murphey
23. O’Kane, J.M., LaValle, S.M.: Comparing the power of robots. Int. J. Robot. Res. 27, 5–23 (2008) 24. O’Kane, J.M., Shell, D.A.: Concise planning and filtering: hardness and algorithms. IEEE Trans. Autom. Sci. Eng. 14, 1666–1681 (2017) 25. Pola, G., Girard, A., Tabuada, P.: Approximately bisimilar symbolic models for nonlinear control systems. Automatica 44, 2508–2516 (2008) 26. Ramadge, P.J., Wonham, W.M.: Supervisory control of a class of discrete event processes. SIAM J. Control Optim. 25, 206–230 (1987) 27. Shaikh, M.S., Caines, P.E.: On the optimal control of hybrid systems: optimization of trajectories, switching times, and location schedules. In: Hybrid Systems: Computation and Control, Springer, Heidelberg (2003) 28. Tedrake, R., Zhang, T.W., Fong, M., Seung, H.S.: Actuating a simple 3D passive dynamic walker. In: IEEE International Conference on Robotics and Automation, April 2004 29. Wainer, J., Feil-Seifer, D.J., Shell, D.A., Mataric, M.J.: Embodiment and humanrobot interaction: a task-based perspective. In: IEEE International Symposium on Robot and Human Interactive Communication, August 2007 30. Xu, X., Antsaklis, P.J.: Optimal control of switched systems via non-linear optimization based on direct differentiations of value functions. Int. J. Control 75, 1406–1426 (2002)
Tools for Perception, Kinematics and Dynamics
Parallax Bundle Adjustment on Manifold with Improved Global Initialization Liyang Liu1(B) , Teng Zhang1 , Yi Liu2 , Brenton Leighton1 , Liang Zhao1 , Shoudong Huang1 , and Gamini Dissanayake1 1
2
University of Technology Sydney, Sydney, NSW 2007, Australia [email protected] Huazhong University of Science and Technology, Wuhan 430074, China
Abstract. In this paper we present a novel extension to the parallax feature based bundle adjustment (BA). We take parallax BA into a manifold form (PMBA) along with an observation-ray based objective function. This formulation faithfully mimics the projective nature in a camera’s image formation, resulting in a stable optimization configuration robust to low-parallax features. Hence it allows use of fast Dogleg optimization algorithm, instead of the usual Levenberg Marquardt. This is particularly useful in urban SLAM in which diverse outdoor environments and collinear motion modes are prevalent. Capitalizing on these properties, we propose a global initialization scheme in which PMBA is simplified into a pose-graph problem. We show that near-optimal solution can be achieved under low-noise conditions. With simulation and a series of challenging publicly available real datasets, we demonstrate PMBA’s superior convergence performance in comparison to other BA methods. We also demonstrate, with the “Bundle Adjustment in the Large” datasets, that our global initialization process successfully bootstrap the full BA in mapping many sequential or out-of-order urban scenes.
Keywords: Bundle adjustment
1
· Global SfM · Monocular SLAM
Introduction
Visual SLAM as well as Structure from Motion (SfM) estimate camera poses and 3D scene geometry simultaneously from 2D images. Bundle adjustment is the gold standard back-end method in these activities that it finds the optimal pose and map in the least squares sense [1] to best explain the data. Solving such a non-linear least squares problem typically requires iterative Newton-based methods [2]: start with an initial guess, repetitively add increments by solving a normal equation until convergence. As shown in Table 1, this approach comes in three forms: original Gauss-Newton (GN) when the equation is easy to solve (the Hessian matrix H has a small condition number), Levenberg Macquardt (LM) as a damped GN if Hessian is near singular, and DogLeg (DL) as a combination of GN and the steepest descent method for fast convergence [6,7]. Although all c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 621–638, 2020. https://doi.org/10.1007/978-3-030-44051-0_36
622
L. Liu et al. Table 1. Three types of Newton-based methods GN x = H
LM −1
DL −1
e(x) x = (H + λI)
e(x) x = (λ1 H−1 + λ2 I)e(x)
three methods can benefit from a sparse implementation, the overall convergence performances are different. Within each iteration step, LM tries to repetitively solve an augmented equation with newly adjusted damping term λ until error is reduced. In comparison, DL only performs Hessian inversion once for all retries, it is therefore more efficient than LM. However, GN and DL are both considered risky due to the large step size and are often avoided in practice. LM is a favourite of the robotics and computer vision communities for its safe handling despite its slowness [6]. Problematic Features. In many modern BA systems [4,11,24], a 3D feature point is parameterized either as Euclidean coordinates (XYZ) or by the direction and inverse of depth from the first observing camera (IDP) [9]. A well-known problem for these representations is that existence of low parallax features during motion causes singularity in the Hessian matrix, a main contribution to GN divergence and numerical instability [9,10]. A small change in error function leads to a large jump in the state variable, making it difficult to specify a consistent stop criterion. To avoid singularity, slow LM is commonly used for safe increment [7,10] in place of GN or DL, efficiency is compromised for stability but could easily result in local minimum. These problematic features manifest in outdoor scenes as far away features and in street view scenes as features collinear to the observing camera motion. IDP can elegantly handle far features [9] but fail to cope with collinear ones [10]. Figure 1 illustrates the failure modes in possible BA forms.
Fig. 1. Compare BA for “Malaga dataset”: existence of collinear features (yellow dots) cause IDP (brown) and XYZ (green) to differ significantly from Ground Truth (red); PMBA (blue) and PBA (the original parallax-based BA [10], orange) do not encounter this issue. PMBA has fastest convergence, see Fig. 8(a).
PMBA with Improved Global Initialization
623
Robustness to problematic features is a major issue in urban SLAM. Several remedies are adopted to address it, with the common principle of separate treatment for problematic features and good ones. In ORB-SLAM [25], a prudent feature selection strategy is applied where features with in-sufficient parallax angles are discarded although they do contain certain information. A hybrid method was proposed in [23], that first estimates camera orientations with remote features then optimises with poses and near features. The vision smart factor proposed in [26] (implemented in GTSAM [24]) shares the same approach of [23]. It avoids degeneracy by using a flexible-sized error function. Recently [27] proposed a solution in which less weighting is given to the error terms for problematic features. Our proposed solution in this paper takes a totally different viewpoints. After rethinking the difference between state and state uncertainty, we argue that the root cause for degeneracy is that the uncertainty of conventional feature forms is not uniformly bounded. In our previous work [10], we presented the parallax angle based bundle adjustment (PBA) algorithm where a feature is represented by 3 highly observable angles without involving depth: a direction confining azimuth and elevation angles and a depth related parallax angle. [10] demonstrated that PBA is more robust and efficient compared to XYZ or IDP form BA’s. Our proposed manifold formulation – PMBA is a continuation along this approach and leads to even better convergence properties. Initialization Methods. BA due to its highly non-convex nature [2], requires good initial estimates to converge to global minimum. The common initialization methods are incremental or global. In incremental methods, with a simple start, many mid-level BAs are performed on each new pose insertion. This strategy draws the criticism that it is slow and relies heavily on picking good initial image pairs to progress. Example systems are VisualSFM [22], Bundler [12] and ORB-SLAM [25]. The alternative is the global strategy where all camera poses are initialised simultaneously. Global SfM thus bootstrapped shows higher efficiency and accuracy. The global strategy exposes many research challenges, and has been studied carefully in [17–19]. Our previous work [10] involved a simple initialization method that unfortunately is vulnerable to complicated camera motions and is only targeted at sequential inputs. The proposed initialization scheme in this paper addresses this issue. Contributions and Paper Structure. This paper provides a novel BA formulation and initialization method robust to problematic features. First, we present (in Sect. 2) a novel BA formulation using parallax feature paramterization on manifold, its retraction method and an observation ray based error function. Next we show that the underlying optimization exhibits non-singular Hessians and bounded error functions, fully suppressing degeneracy due to problematic features. These good convergence properties allow fast DL optimization and is robust to urban scenes. In Sect. 3, we propose a global initialization strategy in which the PMBA problem is simplified into an easy-to-solve position registration problem. We show that the simplification leads to near-optimal solution under
624
L. Liu et al.
low-noise condition. We develop theorems and analysis for both contributions. In Sect. 4, we verify our claims through simulation and a series of large-scale publicly available datasets, all including low-parallax features. We present theorem proofs and reconstruction results in the supplementary material [3]. Notations. Throughout this paper, x× denotes a skew symmetric matrix x gives vector from vector x ∈ R3 . ξ(·) is the normalization operator: ξ(x) = x T x’s direction. π(·) is the homogeneous normalization operator: π( x, y, z ) = T x/z, y/z, 1 . We use the calligraphic font and roman font to represent manifold and Euclidean variable respectively. For example, the i’th camera pose is represented as Ti = (Ri , Pi ) ∈ SE(3). For the j’th feature, Fj and Fj denote its PMBA parameter and 3D coordinates respectively. We use subscript (l) to indicate the reference frame is local.
2
Parallax Bundle Adjustment on Manifold
In this section, we introduce the PMBA formulation. We first define its feature parameterization in manifold domain, then show a retraction method and its compatible error function. Next we give a thorough theoretical analysis on the boundedness of its information matrix, hence proves its smooth convergence without singularity. All these factors lead to the possibility of using faster DL optimization method, which is a significant improvement over previous work [10]. 2.1
Feature Parameterization
A feature’s depth information can be computed from the parallax between observations from different viewpoints. For a 3D feature point Fj , amongst the set of observing cameras Tj , we choose a main Fj Rm anchor Tmj and an associate anchor Taj that j n θj j form the best parallax angle from their obserξ (Nj,i) vation rays. This geometric relationship for Pm − Pa α θ − feature j is illustrated in Fig. 2. α Pmj In manifold domain, we denote the feaPaj ture as Fj and over-parameterize it by its pi unit observation ray vector nj in main-anchor Fig. 2. Feature point Fj anchored frame, and the parallax angle θj , j
j
j
j
j
by Pmj and Paj with parallax angle θj . An arbitrary observing camera is shown at position Pi . Directions of ray from Pmj and Pi are labeled as Rmj nj and ξ(Nj,i ) respectively, all in global frame
.
Fj = (cos θj , sin θj , nj )
(1)
This parameterization only defines the relative structure of the feature with respect to its two anchors. The scale of point Fj
PMBA with Improved Global Initialization
625
is implicitly defined by the relative translation of the two anchors, and is computed as sin(αj − θj ) # » Pmj − Paj Rmj nj + Pmj Fj (Fj ) = Paj Pmj + Pmj = sin(θj )
(2)
where • nj ∈ R3 is the direction of observation ray from Pmj to Fj , local in Tmj frame. sin(αj −θj ) • sin(θ Pmj − Paj is the distance between the two, from sine rule. j) • Rmj is the rotation for main anchor frame Tmj . • αj is the angle between vector (Pmj −Paj ) and Rmj nj , see (11) for derivation Remark 1. In the original PBA parameterization [10], ray direction nj was defined by an elevation and azimuth angle in the global frame, camera’s orientation {Ri } in Euler angles. When a feature’s elevation angle is π2 , its azimuth angle becomes irrelevant. Expressing directions in sinusoids of angles is a potential source of singularity. In PMBA, both nj and Ri are in the manifold domain. Moreover, nj is newly defined to be in Tmj ’s local frame, for ease of multi-camera system application. 2.2
State Variable Retraction in Manifold
Optimization in manifold follows a three step procedure [8]: lift a manifold variable to n ⊕ δn its tangent space, solve a normal equation to n obtain the Euclidean increment, and retract back to manifold. We use the pose retracpm tion method described in [20]. For feature A2 Ta A1 ng ray direction, we give a natural definition of ent Aδn spa ce uncertainty as a normally distributed rotational perturbation to the directional vector Fig. 3. Retraction of ray n in main as shown in Fig. 3. The rotation’s axis conanchor stitutes a plane normal to the ray passing through the observing camera, the plane is the tangent space. We can express the perturbed ray direction as: ˜ j = Exp(Anj δnj )nj , n
δnj ∈ N (0, Σ).
(3)
where δnj ∈ R2 , Anj ∈ R3×2 is the left null-space of nj and [Anj nj ] ∈ SO(3), Exp() is the exponential map for SO(3). The optimal perturbation is the increment for retraction: Fj δFj = (cos(θj + δθj ), sin(θj + δθj ), Exp(Anj δnj )nj ). (4) where the total increment δFj = δθj , δnj ∈ R3 has same dimensionality as conventional parameterization.
626
2.3
L. Liu et al.
Error Function and Optimization Formulation
In PMBA, we estimate camera poses T = {(Ri , Pi )}i=1,··· ,M and feature parameters F = {Fj ∈ M3 }j=1,··· ,N from a set of images {Ii }. When the feature j is observed from the pose Ti , the monocular sensor intercepts the light ray Nj,i that passes through its centre to the feature point at the image pixel umj,i . In conventional bundle adjustment, pixel imprints are used directly as measurements to estimate feature set F = {Fj ∈ R3 }j=1,··· ,N together with poses. As a maximum a posterior (MAP) problem, the conventional BA is formulated as: eij (Ti , Fj )2 = min K ◦ π(Ri (Fj − Pi )) − umj,i 2 , (5) min T ,F
T ,F
i∈Tj ,j
i∈Tj ,j
where K represents the camera calibration matrix and π is the homogeneous normalization operator. The error function eij (·) ∈ R2 does not give clue to cheirality property (lies in front of or behind a given camera). On the other hand, the observation ray Nj,i ∈ R3 includes directional information and should provide a better measurement for feature source than its 2D pixel counterpart. From (2), we express Nj,i (in global frame) as: Nj,i = sin(θj )(Fj − Pi ) = sin(αj − θj )Pmj − Paj Rmj nj + sin(θj )(Pmj − Pi )
(6)
Note that we have applied a factor of sin(θj ) for convenience of mathematical manipulation. The variable forms the observation ray appears in this paper is in Table 2. Table 2. Various forms of observation ray in this paper Global ray
Global ray direction Local ray
Nj,i = Fj − Pi ξ(Nj,i ) =
Fj −Pi Fj −Fi
(l) Nj,i
=
Local ray direction
Ri (Fj
(l)
− Pi ) ξ(Nj,i ) =
Ri (Fj −Pi ) Ri (Fj −Pi )
We now define the ray direction based error function, essentially a “chordal distance” of bearing vectors (on the sphere): (l)
(l)
eij (Nj,i ) := vj,i − ξ(Nj,i ) ∈ R3 ,
(7)
where vj,i = ξ(K−1 umj,i ) ∈ R3 is the observation ray’s measured direction. Moving this measurement to the global frame, expressing all states in manifold, we come to the final non-linear least squares problem for PMBA: (l) ξ(Nj,i ) − Ri vj,i 2 , X = (T , F ) (8) min f (X )2 = min X
T ,F
i∈Tj ,j
PMBA with Improved Global Initialization Fmj Ray direction error:
Fj
eij (ξ(Nj,i )) ≈
β 2
627
The difference between pixel and raydirection error forms are shown in Fig. 4.
R i v j,i
)
Remark 2. The conventional 2D error function (5) shows ambiguity for frontal or hinu u der feature positions, the non-uniqueness of function values brings many local minimums and saddle points. In comparison, the proConventional error e (F ) = u − u posed 3D error function ( 7) leads to improved Pi monotonicity with fewer local minima. Its magnitude can be modelled with the angle Fig. 4. Camera measurement error. β between the estimated and measured ray direction: eij = 2 sin( β2 ) ∈ [0..2]. This covers the entire range of small error (β → 0) to behind-scene error (β → π). Further, (5) employs homogeneous normalization π(·) which prevents feature’s local Z-ordinate from approaching zero, causing BA discontinuity. Eq. ( 7) employs vector normalization ξ(·) instead, is almost totally continuous. i N j,
Image
ξ(
m
β
ij
j
m
With the above discussion, the new optimization (8) should exhibit significantly increased convergence region and the ability to correct feature state from many erroneous estimates through successive iterations, including behind the camera case. 2.4
Theoretical Analysis on Convergence Properties
We now give an analysis on PMBA’s convergence properties. Consider the Hessian matrix of the problem (8) HTT HTF , H=J J= HTF HFF
(9)
ΔX ) |ΔX =0 and X ΔX := (T ΔT , F ΔF ). Like the where J := ∂f (X ∂ΔX Hessian matrix in conventional BA, HFF is block diagonal. Applying the Schur’s complement method, the dominant computation in each Newton method’s iteration boils down to solving the following normal equation: −1 (10) (HTT − HTF H−1 FF HTF )ΔT = − I HTF HFF f (X ),
In conventional BA, existence of problematic features causes the matrix HTT − HTF H−1 FF HTF and the block matrix HFF (with slight abuse of notation) to be ill-conditioned at the neighborhood of global minimum. The global minimum locates at a “long flat valley” [10] such that solvers fail or require large number of iterations to converge, see Fig. 8(a) for illustration. In comparison, PMBA’s formulation (8), thanks to the re-defined retraction (4) and the compatible error function (7), faithfully complies with projective geometry in image formation, is therefore well-posed with significantly improved local observability despite of “problematic” features.
628
L. Liu et al.
Theorem 1. Under the formulation (8), HFF is consistently non-singular for any X and HFF ≥ I. Proof. See [3] for proof. Theorem 1 completely suppresses all ill-conditioned HFF so normal equations (GN) are always solvable. With increased convergence region, the DL trust-region method can be safely applied to increase iteration step size (a sum of GN step plus steepest descent), and minimize iteration time (cheap retries). Theorem 1 can also be appreciated from an Information Theory perspective: the Hessian matrix at global minimum is the inverse of the covariance matrix (up to a scale) and thus the uncertainty of the parallax angle θj and the direction nj is uniformly bounded. Remark 3. The original PBA [10] cannot guarantee non-singularity in HFF due to use of standard addition retraction for feature, Euler angles for orientation and the error function (5). Remark 4. Although the matrices HTT and HTF are denser, compared to those in XYZ or IDP, HTT − HTF H−1 FF HTF shows same sparsity. Thus the computational time for each iteration in PMBA is comparable to conventional BA, see [10].
3
Global Initialization
In this section, we derive a initialization strategy compatible to PMBA. The goal is to derive camera poses from a set of essential matrices. We do this in three steps. We first identify well-matched image pairs to obtain their epipolar geometry’s (EG). From the EG-pairs we then initialize rotations and features. Finally we simplify the original PMBA problem into a pose-graph problem and estimate camera positions with two optimization stages: constrained least square (CLS) and non-linear (NLS) optimisation. We show that near-optimal solution can be obtained under low-noise conditions. This pipeline of global initialization and PMBA are illustrated in Fig. 5.
2. Preprocessing 1. Identify EG-pairs
• Initialise {Ri } • Initialise {fj } • Improve EG-pairs
3. Initialize Position • CLS pose-graph
4. Full PMBA
• NLS pose-graph
Fig. 5. Full global initialization + PMBA pipeline.
PMBA with Improved Global Initialization
3.1
629
Orientation and Feature Initialization
Following the approach in [16–18], we first obtain an initial guess for orientations. This requires a maximal set of EG-pairs (relative rotation and translation) be formed from two-view matches. We use Kneip’s 5-point algorithm [14] to calculate each EG’s essential matrix. Next we build a maximum spanning tree from EG-pairs and discard bad pairs thus establishing accurate image connectivity. We choose the tree root as our reference frame, and use tree branches to help form prior rotation estimates. This is especially useful with out-of-order image inputs. We use the state-of-art chordal initialization [13] for rotation averaging. We found the output rotations very reliable and can feed them back to the tree for outlier-pruning and relative translation fine-tuning [15]. With accurate rotation estimates, we are ready for feature initialization. We adopt the default anchor selection strategy in [10], with the small change that co-visible poses that participate in anchor selection have to be part of an EGpair. This step ensures best as-can-be parallax angle be given to each feature point. We stress that any problematic features corresponding to low parallax angles do stay in the state and do not affect convergence under PMBA. Good features together with problem ones work together to shape the final solution. Remark 5. In PMBA, feature paramterization does not require scale information. Their initialization therefore relies only on camera rotations [10] and are very accurate. We thus completely avoid unreliable/expensive linear triangulation. 3.2
Position Initialization
Rotations and features initialized above are highly accurate, they can be assumed fixed thus do not participate in the subsequent optimization. This way PMBA is transformed into a pose-graph problem where the unknown variables are positions only. This pose-graph problem it not convex but can be further simplified. We do so first by approximating the ray Nj,i from a non-linear function of poses to a linear combination of positions. Specifically, the non-linear term Pmj − Paj Rmj nj in (6) can be seen as a rotation of (Paj − Pmj ) to Rmj nj about axis nzj by angle π − αj , as illustrated in Fig. 6(a). Both nzj and αj are locally observable and are computed as, αj = arccos(ξ(Pmj − Paj ) (Rmj nj )),
nzj = ξ(Paj − Pmj ) × (Rmj nj ) (11)
¯ j,i : We now give the linearized expression for the observation ray, denoted N ¯ j,i (Pm , Pa , Pi ) = N j j sin(¯ αj − θ¯j ) exp(¯ nzj (π − α ¯ j ))(Paj − Pmj ) + sin(θ¯j )(Pmj − Pi )
(12)
¯ j,i into (8), we establish a “position only” optimization, After substituting N ¯ F) ¯ := min ¯ j,i (Pm , Pa , Pi )) − R ¯ i vj,i 2 . ξ(N (13) min h(P, R, j j P
{Pi }
i∈Tj ,j
630
L. Liu et al.
nzj Rmj nj
θj
αj
Pmj − Pa j
Pmj
αj
−
Fj
R iv j,i i
N j,
θj
Paj
pi
Nj,i × (Ri vj,i )
(b) PMBA reformatted as a constrained (a) Linearize ray function: rotate (Paj − Pmj ) by (π − αj ) about nzj becomes Pmj − Paj Rmj nj . -LS problem: minimize cross product
Fig. 6. Simplification of PMBA into position only problem
This approach of position registration from unitized direction vectors is inspired by the non-linear method from [18]. The cost-function in [18] is essentially an algebraic difference of inter-pose directions. Whereas ours is based on posefeature directions without solving for the features, and can directly handle collinear motions by virtue of parallax structure (see Sect. 3.3). Remark 6. Considering (13) is still a nonlinear problem, an initial guess is needed by its iterative solver. We obtain the initial values by computing the optimum of a constrained least squares problem. The objective is to minimize the crossproduct between ray Nj,i (linear to positions) and Ri vj,i as shown in Fig. 6(b). Due to sign ambiguity in cross-products, we add the linear constraint to ensure cheirality condition. The overall CLS problem is: ¯ i vj,i × N ¯ j,i (Pm , Pa , Pi )2 , z(R ¯ j,i ) >= 0. ¯iN R (14) min j j {Pi }
3.3
i∈Tj ,j
Theoretical Analysis
Theorem 2. With accurate initial estimate for orientation, the formulation (13) can always converge to a near-optimal solution for the BA problem (8). Proof. See [3] for proof. Theorem 2 proves the correctness and robustness of the proposed initialization. Moreover, from the viewpoint of computational complexity the pose-graph problem (13) exhibits much reduced variable size than (8) and the expensive feature retraction operation is also avoided.
PMBA with Improved Global Initialization
631
It is interesting that Theorem 2 provides a theoretical assurance that partitioning the BA problem into a pose-graph initialization step and a full BA step is a sound approach. In fact, the separation strategy is well known in SLAM systems. In [29] a SLAM problem with range and bearing observations is shown to exhibit a separable structure: given orientation, robot and feature positions are linear in the corresponding error function. The separable structure is further exploited in [30] to form an efficient iterative solver with better convergence. Undoubtedly, visual SLAM is far more complex where depth information is not readily observable. Our proposed initialization as well as other algorithms [17–19] all intrinsically apply the separation strategy to simplify the complex BA problem. Remark 7. Here we do not claim the proposed global initialization is the best one but it is very compatible to PMBA. The pose-graph problem includes all feature observations in its objective function, hence contain sufficient information to handle collinear motions. Further, it does not require strong triplet image association as in [17]. Remark 8. Note that the proposed method is friendly to robust methods such as pseudo Huber, L1 -norm or outlier detection technique. Further, the non-linear model is still formulated in a probabilistic framework, different from the “Linear Global Translation Estimation” reported in [19].
4 4.1
Evaluation on PMBA Performance Simulation
We demonstrate PMBA’s ability to handle problematic features with a simple simulation test. The scene consists of 4 poses and 10 features, two of which are problematic, as shown in Fig. 7(a). One problematic feature is a far feature, the other initialized with values that would cause singularity in the original PBA algorithm. The BAs under comparison are: XYZ-BA, PBA and PMBA. We run 4 iterations for each BA and collect their Hessian’s. At the end we gather BA estimates deviation from ground truth. The results are listed in Table 3. One can see that PMBA has good Hessian condition numbers and accurate optimized estimates, PBA and XYZ-BA show consistently large condition numbers and high errors. This confirms Theorem 1 that This can be explained by our Theorem 1 that the Hessian in PMBA does not exhibit Hessian singularity yet other BAs can.
632
L. Liu et al.
(a) Simulation with problem features
(b) Comparison of the error from PMBA, PBA and XYZ-BA
Fig. 7. Compare three BA forms in a simulated scene with problem features Table 3. Comparison of HFF ’s condition number during optimization and final state error for PMBA, PBA and XYZ-BA Convergence properties PMBA
PBA
XYZ-BA
Iter-0 Iter-1 Iter-2 Iter-3
1.46E+4 1.46E+4 1.46E+4 1.46E+4
1.22E+94 1.22E+94 1.22E+94 3.53E+95
cond(HFF ) cond(HFF ) cond(HFF ) cond(HFF )
Final χ2error
4.2
9.74 5.68 8.80 5.74
2.58E−3 5.37E−2 3.43E−2
Large Dataset Test
We conducted a series of real datasets to compare performance of the proposed PMBA (8) and original PBA, IDP and XYZ, aiming to address following questions: • Robustness. With degeneracy scenario disappearing, can DL be safely used in PMBA implementation? • Efficiency. If DL were applied for PMBA, how fast can the optimization be? • Accuracy. Since the PMBA formulation employs a different error function (7). Is the global minimum accurate? All methods are tested against six very challenging datasets, which are also accessible from OpenSLAM1 . In particular, • Fake-pile is collected by the Google tango tablet in normal lab environment with a fake bridge pile in the middle, showing close and far features. • Malaga [21] is a classic street view dataset. It is collected using an electric car equipped camera facing the road, consisting of many collinear features. • Village and College are aerial photogrammetric datasets. The low feature to observation ratio implies existence of many small parallax features. • Usyd-Mainquad-2 and Victoria-cottage are collected at University of Sydney campus, full of far features. See [3] for reconstruction results. 1
https://svn.openslam.org/data/svn/ParallaxBA/.
PMBA with Improved Global Initialization
633
Table 4. Comparison of convergence performance for PMBA, PBA, XYZ-BA, IDP-BA Dataset
Fake-pile
Malaga
Village
College
Type
# Pose/# Feat/# Obsv
# Equation solve/# Final Chi2 Time [sec] Iteration
PMBA 135
9/9
1.7E+2
0.7
PBA
/12,741
23/23
1.7E+2
1.9
IDP
/53,878
6.0
104/102
1.7E+2
XYZ
116/108
1.2E+3
4.7
PMBA 170
44/31
9.1E+3
21.6
PBA
/305,719
64/47
9.1E+3
35.4
IDP
/779,268
230/170
5.8E+5
93.8
XYZ
110/85
3.3E+5
39.0
PMBA 90
12/12
3.3E+4
31.8
PBA
/305,719
13/13
3.3E+4
36.0
IDP
/779,268
19/19
3.3E+4
35.2
XYZ
18/18
3.3E+4
26.3
PMBA 468
33/33
1.1E+6
334.4
PBA
/1,236,502 31/31
1.1E+6
370.5
IDP
/3,107,524 34/34
1.1E+6
255.3
295/193
1.0E+7
1361.0
19/16
1.1E+6
70.5
88/66
1.2E+6
301.4
XYZ Victoria cottage PMBA 400 PBA
/153,632
IDP
/890,057
XYZ Usyd-Mainquad PMBA 424
49/48
1.1E+6
157.9
47/44
1.2E+6
124.3
25/25
2.4E+6
214.5
101/57
3.6E+6
642.6
4.6E+6
1994.7
2.8E+6
423.7
PBA
/227,615
IDP
/1,607,082 301/191
XYZ
76/58
We set all BAs from the same starting point use the imperfect initialization method from [10] to observe iteration behaviour. We find that PBA, IDP and XYZ show unstable behaviour under DL. PMBA, in comparison, has always worked well with DL. This can be explained by our prediction that PMBA has a large convergence region and is consistently well-posed. We therefore list DL results for PMBA and LM for all other BA’s. We implement all BAs in C++ and use Ceres-solver [4] as the optimization engine. All BAs are tested on an Intel-i7 CPU running one thread. We use ray direction cost function for PMBA, and compute its corresponding uv-based Chi2 error at each iteration step with current estimate, to compare with other BAs on a common error metric. This scheme is not fair for PMBA, yet is the only convincing way to evaluate performance amongst all methods. Despite of this treatment, we found PMBA the best performer in all tests, consistent with our expectation.
634
L. Liu et al.
(a) Malaga
(b) Usyd-Mainquad-2
Fig. 8. Convergence plots for PMBA, PBA, IDP and XYZ
Selected convergence plots are shown in Fig. 8, more can be found in [3]. All collected metrics are summarized in Table 4. Further, for the Malaga dataset which is full of problematic features (Fig. 1), we observe that the PMBA estimates and Ground Truth are very close, yet conventional BA gives significant error. This is also seen in Table 4, conventional BAs converge to a local minimum, whereas both PBA and PMBA can converge to their respective global minimums. Figure 8 and Table 4 confirm that the error function (7) is practical, consistent with the claim in [28]. In conclusion, these experiments all give positive answers to the raised questions. 4.3
Evaluation of PMBA Global Initialization
Finally, we conduct tests to verify our initialization strategy. We use datasets from the “Bundle Adjustment in the Large” (BAL) database2 [5] and the datasets in Sect. 4.2. We implement a PMBA-based SfM pipeline complying to the procedure in Fig. 5 in C++, using Ceres [4] as the optimization engine. For comparison, we run same tests on an incremental pipeline similar to Bundler, also written in C++ using Ceres [4]. These datasets are selected for showing street scene (Ladybug-1370), diverse proximity scene (Trafalgar-126, Venice-427) or photometric aerial scene (College), all exposing challenges for conventional BA. Since camera calibration is beyond the scope of this activity, we apply the reported optimal camera settings from BAL and PBA websites and only test undistorted versions of these data. We stress that our initial pose and feature values are purely generated from the rotation averaging and translation registration methods described in Sect. 3, without using the initial values provided by [5]. The performance comparison results are shown in Table 5. Here the column labeled “Ours” is the proposed CLS-NLS-PMBA pipeline, the column labeled 2
http://grail.cs.washington.edu/projects/bal/.
PMBA with Improved Global Initialization
635
Table 5. Complete pipleline comparison Number of BA’s Dataset
Order
Ladybug-1370 Sequential
Time [min]
Num poses Ours Incremental Ours Incremental 1370
1
394
3.33
65
Trafalgar-126
Out-of-order 126
1
25
1.01
1.1
Venice-427
Out-of-order 427
1
49
6.62
17.4
College
Sequential
1
238
9.63
85.43
468
“Incremental” refers to the incremental pipleline. Both incremental and our pipeline give similar outputs, we therefore only tabulate the timing information. Figure 9 illustrates our pipeline results in blue, red color shows BAL results or PBA results from [10] for the College dataset. The red and blue data are almost identical. We also give detailed reconstruction results at various stages of our pipeline in [3]. The results in Table 5 shows that our global SfM pipeline uses less computation time and BA invocations than the incremental method in all tests. This result together with the pipeline output plots in [3] confirm our proposed initialization strategy is viable.
Fig. 9. Reconstruction results of full PMBA pipeline on test datasets
636
5
L. Liu et al.
Conclusion
In this work, we proposed a new bundle adjustment formulation – PMBA which utilizes parallax angle based on-manifold feature parametrization and observation-ray based objective function. We proved that under the new formulation the ill-conditioned cases due to problematic features can be avoided without any manual intervention, which results in much better convergence and robustness properties. Furthermore, motivated by the separable structure in the visual SLAM problem and ease of parallax feature initialization, we derived a novel global initialization process for PMBA. We use a simplified pose-graph model that can guarantee a near-optimal solution to bootstrap the original BA problem. Experimental results show that the proposed initialization can provide efficient and accurate estimates and is a viable global initialization strategy for many challenging situations including sequential and out-of-order images. The promising results of the global initialization plus PMBA pipeline using publicly available datasets demonstrate that the proposed technique can deal with different challenging data. In the future, we are planning to integrate the proposed pipeline with efficient visual SLAM front-end to develop a robust and efficient SfM system.
References 1. Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle adjustment – a modern synthesis. In: Triggs, W., Zisserman, A., Szeliski, R. (eds.) Vision Algorithms: Theory and Practices, vol. 1883, pp. 298–375. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44480-7 21 2. Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I., Leonard, J.: Past, present, and future of simultaneous localization and mapping: towards the robust-perception age. IEEE Trans. Rob. 32(6), 1309–1332 (2016). https://doi.org/10.1109/TRO.2016.2624754 3. Liu, L., Zhang, T., Liu, Y., Leighton, B., Zhao, L., Huang, S., Dissanayake, G.: Supplementary Material to: Parallax bundle adjustment on manifold with improved global initialization. University of Technology Sydney (2018) 4. Agarwal, S., Mierle, K.: Ceres Solver. http://ceres-solver.org 5. Agarwal, S., Snavely, N., Seitz, S.M., Szeliski, R.: Bundle adjustment in the large. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) Computer Vision – ECCV 2010, vol. 6312, pp. 29–42. Springer, Heidelberg (2010). http://grail.cs.washington.edu/ projects/bal 6. Rosen, D.M., Kaess, M., Leonard, J.J.: RISE: an incremental trust-region method for Robust online sparse least-squares estimation. IEEE Trans. Rob. 30(5), 1091– 1108 (2014). https://doi.org/10.1109/TRO.2014.2321852 7. Lourakis, M.L., Argyros, A.A.: Is Levenberg-Marquardt the most efficient optimization algorithm for implementing bundle adjustment? In: Tenth IEEE International Conference on Computer Vision (ICCV 2005), Beijing. IEEE Press (2005). https://doi.org/10.1109/ICCV.2005.128 8. Smith, S.T.: Trust-region methods on Riemannian manifolds. Found. Comput. Math. 7(3), 303–330 (2007). https://doi.org/10.1007/s10208-005-0179-9
PMBA with Improved Global Initialization
637
9. Civera, J., Davison, A., Martinez Montiel, J.: Inverse depth parametrization for monocular slam. IEEE Trans. Rob. 24(5), 932–945 (2008). https://doi.org/10. 1109/TRO.2008.2003276 10. Zhao, L., Huang, S., Sun, Y., Yan, L., Dissanayake, G.: ParallaxBA: bundle adjustment using parallax angle feature parametrization. Int. J. Robot. Res. 34(4–5), 493–516 (2015). https://doi.org/10.1177/0278364914551583 11. Kuemmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: G2o: a general framework for graph optimization. In: 2011 IEEE International Conference on Robotics and Automation (ICRA). IEEE Press (2011). https://doi.org/10.1109/ ICRA.2011.5979949 12. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: exploring image collections in 3D. ACM Trans. Graph. (TOG) 25(3), 835–846 (2006). https://doi.org/10.1145/ 1141911.1141964 13. Carlone, L., Tron, R., Daniilidis, K., Dellaert, F.: Initialization techniques for 3D SLAM: a survey on rotation estimation and its use in pose graph optimization. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle. IEEE Press (2015). https://doi.org/10.1109/ICRA.2015.7139836 14. Kneip, L., Siegwart, R., Pollefeys, M.: Finding the exact rotation between two images independently of the translation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision – ECCV 2012, vol. 7577, pp. 696– 709. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3 50 15. Kneip, L., Furgale, P.: OpenGV: a unified and generalized approach to real-time calibrated geometric vision. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong. IEEE Press (2014). https://doi.org/10. 1109/ICRA.2014.6906582 16. Crandall, D., Owens, A., Snavely, N., Huttenlocher, D.: Discrete-continuous optimization for large-scale structure from motion. In: International Conference on Computer Vision and Pattern Recognition (CVPR), Colorado. IEEE Press (2011). https://doi.org/10.1109/CVPR.2011.5995626 17. Jiang, N., Cui, Z., Tan, P.: A global linear method for camera pose registration. In: 2013 IEEE International Conference on Computer Vision (ICCV), Sydney. IEEE Press (2013). https://doi.org/10.1109/ICCV.2013.66 18. Wilson, K., Snavely, N.: Robust global translations with 1DSfM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014, vol. 8691, pp. 61–75. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910578-9 5 19. Cui, Z., Jiang, N., Ping, T.: Linear global translation estimation from feature tracks. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 46.1–46.13. BMVA Press (2015). https://doi.org/10.5244/C.29.46 20. Zhang, T., Wu, K., Song, J., Huang, S., Dissanayake, G.: Convergence and consistency analysis for a 3-D invariant-EKF SLAM. IEEE Robot. Autom. Lett. 2(2), 733–740 (2017). https://doi.org/10.1109/LRA.2017.2651376 21. Blanco, J., Moreno, F., Gonzalez, J.: A collection of outdoor robotic datasets with centimeter-accuracy ground truth. J. Auton. Robot. 27(4), 327 (2009). https:// doi.org/10.1007/s10514-009-9138-7 22. Wu, C.: VisualSFM: A visual structure from motion system. http://ccwu.me/ vsfm/ 23. Zhang, H., Hasith, K., Wang, H.: A hybrid feature parametrization for improving stereo-SLAM consistency. In: 2017 13th IEEE International Conference on Control Automation (ICCA), Ohrid. IEEE Press (2017). https://doi.org/10.1109/ICCA. 2017.8003201
638
L. Liu et al.
24. Dellaert, F.: Factor graphs and GTSAM: A hands-on introduction. Georgia Institute of Technology. https://bitbucket.org/gtborg/gtsam/ 25. Mur-Artal, R., Montiel, M., Tard´ os, J.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Rob. 31(5), 1147–1163 (2015). https:// doi.org/10.1109/TRO.2015.2463671 26. Carlone, L., Kira, Z., Beall, C.: Eliminating conditionally independent sets in factor graphs: a unifying perspective based on smart factors. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong. IEEE Press (2014). https://doi.org/10.1109/ICRA.2014.6907483 27. Lee, C., Yoon, K.: Exploiting feature confidence for forward motion estimation. CoRR, abs/1704.07145 (2017). http://arxiv.org/abs/1704.07145 28. Im, S., Ha, H., Rameau, F., Jeon, H., Choe, G., Kweon, I.: All-around depth from small motion with a spherical panoramic camera. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016, vol. 9907, pp. 156–172 (2016). Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9 10 29. Huang, S., Lai, Y., Frese, U., Dissanayake, G.: How far is SLAM from a linear least squares problem? In: 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Taipei. IEEE Press (2010). https://doi.org/10.1109/ IROS.2010.5652603 30. Khosoussi, K., Huang, S., Dissanayake, G.: A sparse separable SLAM back-end. IEEE Trans. Rob. 32(6), 1536–1549 (2016). https://doi.org/10.1109/TRO.2016. 2609394
Practical Exponential Coordinates Using Implicit Dual Quaternions Neil T. Dantam(B) Colorado School of Mines, Golden, CO, USA [email protected]
Abstract. Modern approaches for robot kinematics employ the product of exponentials formulation, represented using homogeneous transformation matrices. Quaternions over dual numbers are an established alternative representation; however, their use presents certain challenges: the dual quaternion exponential and logarithm contain a zero-angle singularity, and many common operations are less efficient using dual quaternions than with matrices. We present a new derivation of the dual quaternion exponential and logarithm that removes the singularity, and we show an implicit representation of dual quaternions offers analytical and empirical efficiency advantages compared to both matrices and explicit dual quaternions. Analytically, implicit dual quaternions are more compact and require fewer arithmetic instructions for common operations, including chaining and exponentials. Empirically, we demonstrate a 25%–40% speedup to compute the forward kinematics of multiple robots. This work offers a practical connection between dual quaternions and modern exponential coordinates, demonstrating that a quaternion-based approach provides a more efficient alternative to matrices for robot kinematics.
1
Introduction
Efficient geometric computations are important for robot manipulation, 3D simulation, and other areas that must represent the physical world. The product of exponentials formulation, represented as homogeneous transformation matrices, has emerged as the conventional method for robot kinematics [2,14,16]. For pure rotations, the unit quaternion has recently resurged in popularity, particularly for applications in graphics and estimation where the efficient interpolation and normalization of quaternions is especially useful. It is also possible to represent both rotation and translation by extending the ordinary unit quaternion to quaternions over dual numbers [18,22]. Such dual quaternions retain the unit quaternions’ advantages of compactness and efficient normalization; however, they also present challenges. Common kinematics operations—constructing and chaining transforms—require more arithmetic instructions using dual quaternions than the equivalent transformation matrix computation. Critically, the dual quaternion exponential contains a small-angle singularity which must be handled for numerical robustness. We address these challenges and present a dual-quaternion-based representation with advantages for robot kinematics. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 639–655, 2020. https://doi.org/10.1007/978-3-030-44051-0_37
640
N. T. Dantam
We present a new derivation of the dual number quaternion exponential and logarithm that removes the small-angle singularity, and we show that the implicit representation of a dual quaternion is more computationally-efficient for robot kinematics than homogeneous transformation matrices. The conventional representation of exponential coordinates using the homogeneous transformation matrix provides a baseline for comparison (see Sect. 3). We begin with the known forms of the ordinary quaternion exponential and logarithm (see Sect. 4.1). Then, based on dual number arithmetic and quaternion geometry, we derive the exponential and logarithm for the dual quaternions and rewrite factors to identify Taylor series that remove the singularities (see Sect. 4.2). We extend this dual quaternion exponential and logarithm to the implicit representation of a dual quaternion as an ordinary (rotation) quaternion and a translation vector, which is more compact and computationally efficient than explicit dual quaternions (see Sect. 4.3). Next, we present the application of these quaternion forms to robot kinematics, demonstrating a 25%–40% empirical performance gain over transformation matrices. Finally, we discuss issues of equivalence and efficiency between matrix and quaternion representations (see Sect. 6). All the derived forms presented in this paper are available as open source software.1 Quaternion-based forms present both challenges and advantages. A common challenge raised with quaternions is the difficulty of men= v + w tally visualizing the four-dimensi| | |v| = | | sin φ onal space of ordinary quaternions— φ or the eight-dimensional space of R dual quaternions—whereas vector w = | | cos φ and matrix approaches have a direct, 3-dimensional interpretation. Still, the planar projection of quaternions (see Fig. 1) provides some insight into the relationship between quaternion components and angles. More importantly, a growing body of work Fig. 1. The quaternion-imaginary-plane, containing axes for the scalar w and vector continues to demonstrate that ordinary and dual quaternions offer commagnitude |v|. putational advantages in a variety of domains [11,15,20]. The results of this paper are in the same vein. We demonstrate a dual-quaternion-based approach for the product of exponentials that offers computational advantages over matrices. We mitigate the challenge of visualizing quaternions by using the relations of Fig. 1 in an algebraic derivation.
I
1
Software available at http://amino.dyalab.org.
Practical Exponential Coordinates Using Implicit Dual Quaternions
641
A key technique in our derivation is to rewrite factors with singularities into forms with well-defined Taylor series which we can use for evaluation near the singularity. Grassia applies this idea to ordinary quaternions [8]. For example, the ordinary quaternion exponential contains the factor sinθ θ , which has a singularity at θ = 0. However, we can use a Taylor series to remove the singularity: sin θ 1 1 4 1 6 = 1 − θ2 + θ − θ + ... θ 6 120 5040
=⇒
sin θ =1. θ→0 θ lim
(1)
Near the singularity, we need only the initial terms of the Taylor series to evaluate the factor to within machine precision because the final terms will be smaller than the precision we can represent. For (1), we have alternating positive and negative terms of decreasing magnitude, so the error after evaluating the first i terms is no greater than the magnitude of term i + 1. We need not evaluate any additional terms with when this error is less than machine precision. For example, when θ4 is less than machine precision, we may can achieve minimum possible numerical error using only the first two terms 1 − 16 θ2 . We extend this Taylor series conTable 1. Taylor series for θ → 0 struction to the dual quaternions, which 1 have similar—though more complicated— sin θ = θ − 16 θ3 + 120 θ5 + . . . 1 4 factors containing singularities. We use cos θ = 1 − 12 θ2 + 24 θ + ... quaternion trigonometry (see Fig. 1) to sin θ 1 2 1 = 1 − 6 θ + 120 θ4 + . . . θ rewrite these factors into forms that are θ 7 = 1 + 16 θ2 + 360 θ4 + . . . defined in the limit via Taylor series. sin θ 1−cos θ 1 1 2 1 Table 1 lists several common Taylor series. = 2 − 24 θ + 720 θ4 + . . . θ2 We use the following notation. Bold uppercase R denotes a matrix. Bold lowercase v denotes a vector. An over-arrow v denotes a length-three vector over ˆ An over-hat u the basis units ˆ ı, jˆ, k. ˆ denotes a unit vector (|u| = 1). An overtilde n ˜ denotes a dual number (˜ n = nreal + ndual ε). The lowercase script h denotes an ordinary quaternion. The uppercase script S denotes a dual quaternion (S = sreal + sdual ε). We abbreviate sin and cos with s and c.
2
Related Work
Brockett connected robot kinematics with Lie groups expressed as matrix exponentials [2]. This product of exponentials formulation has become the conventional approach for robot kinematics [14,16]. Our work presents a practical connection between such exponential coordinates and quaternion-based representations, and we show that a quaternion-based representation can offer efficiency advantages compared to matrix-based representations. Quaternions provide an alternative to matrix-based geometric representations. Unit quaternions represent rotations [9] with four elements: a 3-element vector and a scalar that together encode the rotational axis, and the sine and
642
N. T. Dantam
cosine of the rotational angle. Though vector analysis became the preferred notation in many areas [1,7], quaternions have seen renewed use in recent years as a practical representation for rotation, interpolation, and estimation [8,13,15,20]. The computational advantages of quaternions in such applications suggest that a quaternion-based approach could merit investigation in other areas typically addressed using vector or matrix representations. Quaternions over dual numbers—the dual quaternion—can represent both rotation and translation [22,23]. Selig presents a modern context for dual quaternions and more broadly Clifford algebras in relation to Lie algebras [18]. Yang and Freudenstein applied dual quaternions to the analysis of spatial mechanisms (closed chains) [27]. Several recent authors have applied dual quaternions to robot kinematics [4,5,10,12,17,21,25]. Wang et al. compare dual quaternion and homogeneous matrix approaches, showing that dual quaternions are often more efficient [26]. We continue this application of dual quaternions to robot kinematics by addressing issues of singularities and numerical robustness in the dual quaternion exponential and logarithm. Though the form of the dual quaternion exponential is well established [6,19], there is, to our knowledge, no prior work that addresses the zero-angle singularity in the dual quaternion exponential and logarithm, which is necessary to practically use dual quaternions in the product of exponentials formulation. Han et al. observe, though do not address, the zero-angle discontinuity [10]. Wang et al. provide an approximation of the logarithm [25]. In this work, we present new, exact derivations of the dual quaternion exponential and logarithm that remove the zero-angle singularity, enabling the practical use of dual quaternions as exponential coordinates. Furthermore, we show that implicitly representing a dual quaternion as an ordinary quaternion and a translation vector is both more compact and more computationally efficient for common kinematics operations than either explicit dual quaternions or homogeneous transformation matrices.
3
Rotation and Transformation Matrix Maps
We briefly restate the key matrix operations for robot kinematics to compare against the quaternion forms and illustrate the Taylor series construction. 3.1
Rotation Matrix
We define the rotation matrix exponential and logarithm using the rotation vector, i.e., the rotation axis scaled by the rotation angle, because separating the axis and angle results in an undefined axis when the angle is zero and poor numerical stability when attempting to construct the unit axis for small angles [8]. The rotation matrix exponential [14] is: ⎤ ⎡ 0 −ωz ωy 1 − cos |ω| sin |ω| [ω] + [ω]2 , where [ω] = ⎣ ωz 0 −ωx ⎦ . (2) e[ω] = I + 2 |ω| |ω| −ωy ωx 0
Practical Exponential Coordinates Using Implicit Dual Quaternions
We remove the singularity at |ω| = 0 via the Taylor series in Table 1 for
643 sin|ω| |ω|
and 1−cos|ω| . |ω|2 The rotation matrix logarithm [14] is: ⎡ ⎤ r32 − r23 r11 + r22 + r33 − 1 θ ⎣ r13 − r30 ⎦ , ω = where θ = cos−1 . (3) 2 sin θ 2 r21 − r12 We remove the singularity at θ = 0 via the Taylor series in Table 1 for 3.2
θ sin θ .
Transformation Matrix
The transformation matrix exponential [14] is: ⎡ [ω] ω e [ω] + I + 1−cos|ω| |ω| exp =⎣ ν 0 1
1−
sin|ω| |ω| 2
|ω|
2
[ω]
⎤ ν ⎦
.
(4)
We remove the singularity at |ω| = 0 via the Taylor series in Table 1 for and the following:
1−cos|ω| |ω|2
1−
sin|ω| |ω| 2
2
=
|ω|
2
|ω| 1 |ω| − + + ... 6 120 5040
(5)
The transformation matrix logarithm [14] is given by:
Rv ln = 0 1 I−
ln R [ω] 2
+
2s−|ω|(1+c) 2s|ω|2
2
[ω]
v
.
(6)
We remove the singularity at |ω| = 0 via the following Taylor series: 2 sin |ω| − |ω| (1 + cos |ω|) 2
2 (sin |ω|) |ω|
4
2
=
4
|ω| |ω| 1 + + + ... 12 720 30240
(7)
Ordinary, Dual, and Implicit Quaternion Maps
Now, we present the key contribution of this work: new, singularity free forms of the dual quaternion exponential and logarithm and their corresponding forms for the implicit, quaternion-translation representation. Our derivation starts with the established ordinary quaterion exponential and logarithm (see Sect. 4.1). Then, we derive the dual quaternion forms (see Sect. 4.2) by using the quaternion trigonometry (see Fig. 1) to construct Taylor series that remove the singularities. Finally, we derive the equivalent exponential and logarithm for the more compact and efficient quaternion-translation representation (see Sect. 4.3).
644
4.1
N. T. Dantam
Ordinary Quaternions
Ordinary quaternions extend complex numbers (ˆ ı2 = −1) to three units: ˆ2 = ˆ ˆ = −1 . ˆ ı2 = jˆ2 = k ıjˆk
(8)
A quaternion, therefore, has four elements: the real term (scalar) and the ˆ (vector). We use the following coefficients of each quaternion unit ˆ ı, jˆ, and k notations for the quaternion elements: ˆ + w = v + w . h = xˆ ı + yˆ j + zk
(9)
scalar
vector v
The dot (·) and cross (×) products, though actually introduced as an alternative to quaternions [7], allow a compact definition of the quaternion multiplication (⊗): q ⊗ p = qv × pv + qw pv + pw qv + qw pw − qv · pv .
(10)
scalar
vector v
The quaternion conjugate and rotation operation are: h ∗ = −hv + hw
p = ahb ⊗ bp ⊗ ahb∗ .
a
and
(11)
The quaternion exponential is: sin |v| ev+w = ew v + cos |v| . |v|
(12)
When |v| approaches zero, we can use the Taylor series for sin|v||v| in Table 1. To compute the logarithm, we first find the angle between the vector v and scalar w parts of the quaternion. Then the logarithm is as follows: φ = atan2 (|v| , w)
and
ln h =
φ v + ln |h| . |v|
When |v| approaches zero, we handle the singularity in follows: φ = |v| 4.2
φ |h| |v| |h|
=
φ |h|
sin φ
=
φ sin φ
|h|
=
1+
θ2 6
+
7θ 4 360
+ |h|
31θ 6 15120
φ |v|
(13) by rewriting as
+ ...
.
(14)
Dual Quaternion
Dual quaternions are a compact representation that offers useful analytic properties. We briefly review the use of dual quaternions for kinematics before introducing our new derivations of the exponential and logarithm to handle the smallangle singularity. For a more thorough overview of dual quaternions for kinematics, please see [18].
Practical Exponential Coordinates Using Implicit Dual Quaternions
645
Dual quaternions combine ordinary quaternions with the dual numbers, ε, defined as: and ε = 0 . (15) ε2 = 0 A dual quaternion will thus have eight coefficients covering all combinations of the quaternion elements, dual element, and scalars. We write a dual quaternion as: ˆ + hw + dxˆ ˆ + dw ε . (16) S = h + d ε = hxˆ ı + hy jˆ + hz k ı + dy jˆ + dz k The Euclidean transformation consisting of rotation h and translation v corresponds to the following dual quaternion. 1 S = h + d ε = h + v ⊗ hε and v = 2d ⊗ h ∗ . 2 Multiplication of dual quaternions will chain successive transforms. a Sc = aSb ⊗ bSc = (ahb + adb ε) ⊗ bhc + bdc ε = ahb ⊗ ahb + ahb ⊗ bdc + adb ⊗ bhc ε .
(17)
(18)
Rewriting (18) in terms of a transformation and a point yields the dual quaternion form to transform a point. An equivalent derivation extends (11) to the dual numbers. a a Sc = aSb ⊗ 1 + bpε p = 2d + h ⊗ bp ⊗ h ∗ (19) The Taylor series for functions of dual numbers yields a useful property: all higher-order terms containing ε2 cancel to zero. 0 f (r) 0 0 f (r) f (r) * * 2 3 * (dε) (dε) . . . (dε) + + + 1! 2! 3! = f (r) + εdf (r) . (20)
f (r + dε) = f (r) +
The dual number Taylor series (20) enables evaluation of dual number functions using only the value and derivative of the real function. We summarize several relevant dual functions in Table 2.
Table 2. Dual numbers functions f (r + d) cos (r + dε) sin (r + dε) tan−1 (r + dε) exp (r + dε) ln √ (r + dε) r + dε
= = = = = = =
f (r) + εd(f (r) cos r − εd sin r sin r + εd cos r tan−1 r + r2εd +1 er + εer d ln r + dr ε √ d r + ε 2√ r
Singularity-Free Dual Quaternion Exponential. To derive a suitable form of the dual quaternion exponential, we begin by rewriting the ordinary quaternion exponential (12) over dual numbers. φ˜ = (hx + dx ε)2 + (hy + dz ε)2 + (hy + dz ε)2
˜ sin φ ˆ + cos φ˜ (21) (hx + dx ε) ˆ eS = ehw +dw ε ı + (hy + dy ε) jˆ + (hy + dz ε) k φ˜
646
N. T. Dantam
Direct evaluation of (21) must contend with the singularity (zero denomina˜ tor) in the factor sinφ˜ φ . To handle the singularity, we will algebraically expand the dual arithmetic and rewrite factors based on quaternion trigonometry into forms where we can find suitable Taylor series. ˜ First, we expand the dual quaternion angle φ. φ˜ = =
(hx + dx ε)2 + (hy + dz ε)2 + (hy + dz ε)2 hx2 + hy2 + hy2 +
hx dx + hy dy + hy dz γ ε=φ+ ε φ hx2 + hy2 + hy2
(22)
where φ is the same as the ordinary quaternion angle and γ = hv · dv . The dual sin and cos are then cos φ˜ = c −
γ sε φ
sin φ˜ = s +
and
γ cε φ
(23)
where s = sin φ and c = cos φ. ˜ Next, we expand the dual sinc function sinφ˜ φ and rearrange terms to find a suitable Taylor series to handle the singularity at φ = 0:
cos(φ) − sin(φ) sin(φ) + φγ cos(φ)ε sin φ˜ sin(φ) φ = +γ = ε φ + φγ ε φ φ2 φ˜ φ2 φ4 φ4 1 φ2 = 1− + + ... + γ − + − + ... ε . (24) 6 120 3 30 840 (cos φ−(sin φ)/φ)/φ2
(sin φ)/φ
Finally, we expand the original form of the exponential in (21):
S
e = e
hw
+ dw e ε hw
s hv + c φ
+
c − φs s s dv + γ hv − γ ε φ φ2 φ
(25)
By applying the Taylor series in (24), we can stably evaluate (25) in the neighborhood of φ = 0. Singularity-Free Dual Quaternion Logarithm. We derive the dual quaternion logarithm by expanding the ordinary form (13) with dual arithmetic.
ln S =
φ˜ (hv + dv ε) + ln m ˜ n ˜
(26)
Practical Exponential Coordinates Using Implicit Dual Quaternions
647
˜ n where φ, ˜ , and m ˜ are the dual number forms of φ, |hv |, and |h|, (respectively) from (13). The dual arithmetic expands as follows: hv · dv ε = |hv | + nd ε, n ˜ = (hx + dx ε)2 + (hy + dy ε)2 + (hy + dz ε)2 = |hv | + |hv | h ·d |hv | + nd ε ε = |h| + md ε, . (27) m ˜ = |h| + and φ˜ = tan−1 |h| hw + dw ε Further expanding φ˜ via the dual Taylor series for tan−1 (see Table 2):
hw nd − |hv | dw hw nd − |hv | dw ˜ ε=φ+ ε . (28) φ = atan2 (|hv | , hw ) + 2 2 |h| |h| ˜
Next, we consider the dual nφ˜ . For this step we take guidance from the quaternion trigonometry (see Fig. 1) to rewrite factors as trigonometric functions for which we can find well-defined Taylor series. Specifically, we know that: φ = atan2 (hw , |hv |)
and
hw = cos φ |h|
and
|hv | = sin φ . |h|
(29)
˜
We expand the dual arithmetic and reorder nφ˜ :
hw nd −|hv |dw ε φ + ˜ 2 φ φ hw nd φnd dw |h| = = + ε. 2 − 2 − 2 n ˜ |hv | + nd ε |hv | |hv | |h| |hv | |h|
(30)
Equation (30) contains a singularity where |h| = 0. We evaluate the term |hφv | as in (14). We rewrite the larger term in the dual coefficient as follows:
2 3 hw nd φnd hw φ φ |h| γ hw |h| − = 3 . (31) 2 − 2 =γ 2 2 − 3 3 |h| |hv |2 |hv | |h| |hv | |hv | |h| |hv | |h| |hv | where γ = hv · dv . Then, we substitute the trigonometric functions and produce the corresponding Taylor series: cos φ γ 17 4 2 1 2 φ γ − φ φ − + . . . . (32) − − = 3 3 3 5 420 sin2 φ sin3 φ |h| |h| Now that we have identified Taylor series to handle the singularities, we have the full dual quaternion logarithm: ⎞ ⎛ hv · dv α − dw φ hv + φ dv + h · d ⎠ ε ln S = hv + ln |h| + ⎝ 2 2 |hv | |hv | |h| |h| φ2 φ4 2 2 φ − − − + . . . hw − |hv | |h| 3 5 420 where α = = 2 |h| |hv |
(33)
648
4.3
N. T. Dantam
Quaternion-Translations as Implicit Dual Quaternions
Just as we may represent transformations with a rotation matrix and translation vector—i.e., the homogeneous transformation matrix—we can also represent transformations with a rotation quaternion and translation vector. The quaternion-translation form offers computational advantages: it consists of only seven elements and chaining requires fewer operations than both the dual quaternion and matrix forms. However, because chaining is no longer a multiplication, as with dual quaternions or matrices, analysis of quaternion-translation kinematics is more complicated, particularly for differential cases involving finding derivatives or in integrating transforms. We address the analytic challenge of the quaternion-translation form by reinterpreting quaternion-translations as implicit dual quaternions, or alternately stated, by adopting an in-memory representation of dual quaternions as a quaternion-translation. The implicit dual quaternion combines the analytic convenience of dual quaternions and the computational efficiency the quaternion-translation representation. The quaternion-translation form stores separately the rotation quaterion h and translation vector v , eliminating the coupling of rotation and translation in the dual part of the dual quaternion: ⎧ ⎫ 1 rewrite ⎪ ⎪h ⎪ ⎪ (34) h + v ⊗ hε ⎩ ⎭ v 2 explicit dual quaternion
implicit dual quaternion
To transform a point, we first apply the rotation, then add the translation— the same operations performed by the homogenous transformation matrix: ∗
p = ahb ⊗ bp ⊗ (ahb ) + avb
a
(35)
Implicit Exponential. We derive the exponential for the implicit dual quteration starting with (25), extracting the translation, and finally identifying Taylor series. First, we simplify (25) to the pure case, i.e., zero scalar part: c − φs s s s exp ( ω + ν ε) = ω +c + ν + γ ω− γ ε φ φ φ2 φ √ where γ = ω · ν and φ = ω ·ω (36) Next, we extract the translation from the dual part. ⎧ ⎫ ⎧ ⎫ ⎪ s ⎪ ω + c ⎪ ⎪ φ ⎪ ⎪h ⎪ ⎪ ∗ ⎪ exp ˘ ( ω + ν ε) = ⎪ s ⎪ ⎪ ⎩ ⎪ ⎭=⎪ c− ⎪ ⎪ s s v ⎩2 s ν + 2φ γ ⎭ ω − γ ⊗ ω + c φ φ φ φ
(37)
In (37), we may evaluate the rotation part h as in the ordinary quaternion case. For the translation part v , we first algebraically simplify: ∗ c − φs s s s v = 2 ν + γ ⊗ ω + c γ ω − (38) φ φ2 φ φ
Practical Exponential Coordinates Using Implicit Dual Quaternions
c(c − φs ) + s2 s2 cs = 2 − 2 ν × ω + ν + γ ω φ φ φ2
649
(39)
Then, we simplify the trigonometric factors, and we identify the common subexpressions.
2 − c 2s s 2s 2s φ ω × ν + c ν + v = γ ω (40) φ φ φ φ2 Using the Taylor series from Table 1 and for the new factor, we obtain: ⎫ ⎧ ⎧ ⎫ ⎪ s ⎪ ω +c ⎪ ⎪ h φ ⎪ ⎪ ⎪ ⎪ ⎪ 2−c 2s exp ˘ ( ω + ν ε) = ⎪ ⎪ ⎪ ⎩ ⎪ ⎭=⎪ ⎪ ⎪ φ v ⎭ ⎩ 2s hv × ν + c 2s ν + ( ω · ν ) ω φ φ φ2 s φ2 φ4 =1− + + ... φ 6 120
and
2 − c 2s φ φ2
=
8φ4 4 4φ2 − + + ... 3 15 315
(41)
Implicit Logarithm. We derive the implicit logarithm starting with (33), substituting the translation vector, and finally identifying suitable Taylor series. We begin with the dual quaternion logarithm (33): ⎧ ⎫ 1 h⎪ ⎪ ˘ ⎪ ⎪ (42) ln ⎩ ⎭ = ω + ν ε = ln h + v ⊗ hε v 2 The real part ω of the implicit logarithm is identical to the dual quaternion case (33). We assume a unit quaternion |h| = 1, so the scalar part of the logarithm is zero. ⎧ ⎫ φ φ h⎪ ˘⎪ ⎪ hv = hv ln =ω = (43) ⎩ ⎪ ⎭ v |h | sin φ v real For the dual part ν , we expand (33), simplifying for the unit case |h| = 1: ⎧ ⎫ φ h⎪ ˘⎪ ⎪ dv + γ + hw dw = ν = (γα − dw ) hv + ln ⎩ ⎪ ⎭ v |h v| dual 1 1 1 where dv = v × hv + hwv and dw = − v · hv (44) 2 2 2 Substituting for dual part d in terms of translation v , we simplify to:
hw φ − 1 1 s hv + hw φ v + φ v × hv ν = − v · hv 2 s2 2s 2s Noting that hw = cos φ and ω = sinφ φ hv , we further simplify to:
1 − c φs v v φ v ν = ω +c ·ω + ×ω 2 φ2 s 2 2
(45)
(46)
650
N. T. Dantam
Finally, we identify the Taylor series to obtain the implicit logarithm as follows: ⎧ ⎫ h⎪ ˘⎪ ⎪ ln + ν ε ⎩ ⎪ ⎭=ω v
1 − c φs cφ v v v φ ·ω + ×ω ε ω + = hv + sin φ 2 φ2 s 2 2 cφ φ2 φ4 = 1− − − ... s 3 45
5
and
1 − cφ φ4 1 φ2 s + + ... = + 2 φ 3 45 945
(47)
Application to Kinematics
Both matrix and quaterion representations may be used to compute the forward kinematics of robot manipulators. We compare the different representations and show that the quaternion-translation offers the best forward kinematics performance. Mathematically, quaternion-translations require the fewest arithmetic operations, and in our empirical evaluation, quaternion-translations require the shortest execution time. Tables 3 and 4 compare operations for quaternion and matrix forms. Table 5 summarizes the construction of transformations for single degreeof-freedom joints using matrix and quaternion forms. We use the known axis of joints to simplify construction over the general-case exponential. The result shows that quaternion-translations require the fewest arithmetic operations. Fig. 2 presents an empirical comparison of forward kinematics performance in terms of speedup over the baseline matrix representation. The quaterniontranslation shows the best empirical performance, consistent with the operation counts in Tables 3, 4, and 5. Additionally, the explicit dual quaternion also offers slightly better performance than matrices in our tests. Even though matrices require fewer arithmetic operations to construct and chain, several other advantages of the dual quaternions lead to the improved performance. Dual quaternions are more compact than matrices, which reduces necessary data shuffling, and quaternions require fewer operations for the exponential and rotation chaining, which are heavily used in robots with many revolute frames.
6
Discussion
We often have the choice of a matrix or quaternion form for any particular application; both will produce a mathematically-equivalent result, but the computational efficiency will differ. For example, interpolation is commonly regarded as a key application area for quaternions; however, we can achieve the same result—at
Practical Exponential Coordinates Using Implicit Dual Quaternions
651
Table 3. Requirements for storage, chaining, and point transformation. Quaternionbased representations are more compact than matrices. Ordinary quaternions and quaternion-translations are most efficient for chaining rotations and transformations, respectively. Matrices are most efficient for rotating and transforming points. Chain Rot. Rotation Matrix Quaternion Tf.
Rot./Tf.
Normalize
Storage Mul. Add Mul. Add Mul. Add Other
Representation
Transformation Matrix
9
27
18
9
6
27
15
4
16
12
15
15
8
3
sqrt(3) sqrt
12
36
27
9
9
27
15
Dual Quaternion
8
48
40
28
28
12
3
sqrt
sqrt(3)
Quaternion-Translation
7
31
30
15
18
8
3
sqrt
Table 4. Exponential and logarithm operation counts. Ordinary and dual quaternions are more efficient than their matrix equivalents. The quaternion-translation costs are between the matrix and dual-quaternion. Exponential
Logarithm
Representation
Mul. Add Other
Mul. Add Other
Rot. Rot. Matrix
17
15
Quaternion
9
Unit Q. Tf.
sqrt, sincos
5
7
2
sqrt, sincos, exp
8
3
sqrt, atan2 sqrt(2), atan2, ln
7
2
sqrt, sincos
7
2
sqrt, atan2
Tf. Matrix
39
34
sqrt, sincos
31
32
sqrt, atan2
Dual Quat.
sqrt, sincos, exp 22
11
31
12
Unit Dual Q. 19
8
sqrt, sincos
18
9
sqrt, atan2
sqrt(2), atan2, ln
Quat.-Trans. 28
15
sqrt, sincos
28
16
sqrt, atan2
greater computational cost—using rotation matrices. Spherical linear interpolation (SLERP) [20] interpolates from an initial to final orientation with constant rotational axis and linearly-varying angle. The algebraic form of SLERP [3] has a direct matrix equivalent: ∗ h(τ ) = h(0) ⊗ exp τ ln (h(0)) ⊗ h(1)
(48)
−1 R(τ ) = R(0) exp τ ln (R(0)) R(1)
(49)
where h(0), R(0) is the initial orientation and h(1), R(1) is the final orientation. Both (48) and (49) equivalently interpolate orientation. However, the quaternion form (48) is more efficient to compute than the matrix form, and the more commonly used geometric simplification of (48) is even more efficient [20]. Similarly, ordinary and dual quaternions provide computational advantages for the blending or averaging of rotations and transformations [11,15], which was described by Wahba [24] as optimal rotation based on a set of weighted observations.
652
N. T. Dantam
Table 5. Single degree-of-freedom joint transforms and operation counts. The quaternion-translation representation is most efficient to construct.
Frame Counts Robot Fixed Revolute
1
1 UR10
Jaco
Baxter UR10 Jaco
1.13 1.4
Baxter
1.09 1.26 1 1.05 1.29
73 7 7
15 6 12
Baxter Tf. Matrix Dual Quat. Quat.-Trans.
UR10 Jaco
Fig. 2. Forward kinematics speedup (higher is better), demonstrating a 25%–40% performance improvement using quaternion forms. We compare the execution time to compute the forward kinematics for the Rethink Baxter, Universal Robots UR10, and Kinova Jaco manipulators using transformation matrices, dual quaternions, and ) over the transforquaternion-translations. The results are shown as speedup ( tbaseline tnew R Xeon E3-1275 v6 using mation matrix case. The tests were conducted on an Intel the kinematics implementations in the Amino (http://amino.dyalab.org) library.
Practical Exponential Coordinates Using Implicit Dual Quaternions
653
Generally, the matrix and quaternion representations of rotation and Euclidean transformation share group structure. Just as the rotation and transformation matrices form Lie groups with associated Lie algebras based on the exponential, so too do the ordinary and dual quaternions form Lie groups and associated algebras. We can map every quaternion representation to matrix equivalent. Specifically, there is a surjective homomorphism (double-cover) from the ordinary unit quaternions to the special orthogonal group SO(3) of rotation matrices. Similarly, we have a surjective homomorphism from the dual unit quaternions to the special Euclidean group SE(3) of homogeneous transformation matrices. The results we have presented continue the broader developments of methods based on ordinary and dual quaternions which offer computational advantages over their matrix counterparts. The quaternion methods we have presented achieve mathematically-equivalent results, but are more compact and efficient, than the matrices.
7
Conclusion
We have presented new derivations of the dual quaternion exponential and logarithm which handle the small-angle singularity and enable the use of dual quaternions within the product of exponentials formulation of robot kinematics. By extending our singularity-robust exponential and logarithm to the implicit representation of dual quaternions as an ordinary quaternion and translation vector, we demonstrate a 25%–40% performance improvement in kinematics computation over the conventional homogeneous transformation matrices. Our implementation is available as open source code2 . These results show that dual quaternion representations provide the same capabilities as transformation matrices and offer computational advantages which may be especially useful for resourceconstrained systems. Even though matrices are a widely-used representation for Euclidean transformations, the quaternion forms are both more compact and—for most cases— require fewer arithmetic operations. In the one case where matrices have an efficiency advantage—transforming large numbers of points—it may still be more efficient to chain transformations via quaternions and then convert the final transform to a matrix to apply to the point set. We hope these derivations of singularity-free exponentials and logarithms for the quaternion forms of transformations will enable widespread use of these more efficient representations.
References 1. Altmann, S.L.: Hamilton, Rodrigues, and the quaternion scandal. Math. Mag. 62(5), 291–308 (1989) 2. Brockett, R.W.: Robotic manipulators and the product of exponentials formula. In: Mathematical Theory of Networks and Systems, pp. 120–129. Springer (1984) 2
Software available at http://amino.dyalab.org.
654
N. T. Dantam
3. Dam, E.B., Koch, M., Lillholm, M.: Quaternions, Interpolation and Animation. Technical report DIKU-TR-98/5, Datalogisk Institut, Københavns Universitet Copenhagen, July 1998 4. Dantam, N.T., Amor, H.B., Christensen, H., Stilman, M.: Online camera registration for robot manipulation. In: International Symposium on Experimental Robotics, pp. 179–194. Springer (2014) 5. Dantam, N.T., Amor, H.B., Christensen, H., Stilman, M.: Online multi-camera registration for bimanual workspace trajectories. In: International Conference on Humanoid Robots, pp. 588–593. IEEE (2014) 6. Funda, J., Paul, R.P.: A computational analysis of screw transformations in robotics. IEEE Trans. Robot. Autom. 6(3), 348–356 (1990) 7. Gibbs, J.W.: Elements of Vector Analysis: Arranged for the Use of Students in Physics. Tuttle, Morehouse & Taylor, New Haven (1884) 8. Grassia, F.S.: Practical parameterization of rotations using the exponential map. J. Graph. Tools 3(3), 29–48 (1998) 9. Hamilton, W.R.: Elements of Quaternions. Longmans, Green, & Company, Harlow (1866) 10. Han, D.-P., Wei, Q., Li, Z.-X.: Kinematic control of free rigid bodies using dual quaternions. Int. J. Autom. Comput. 5(3), 319–324 (2008) ˇ ara, J., O’Sullivan, C.: Geometric skinning with approxi11. Kavan, L., Collins, S., Z´ mate dual quaternion blending. ACM Trans. Graph. (TOG) 27(4), 105 (2008) 12. Kenwright, B.: A beginners guide to dual-quaternions: what they are, how they work, and how to use them for 3D character hierarchies. In: WSCG International Conference on Computer Graphics, Visualization and Computer Vision, October 2012 13. LaViola, J.J.: A comparison of unscented and extended Kalman filtering for estimating quaternion motion. In: Proceedings of the 2003 American Control Conference, vol. 3, pp. 2435–2440. IEEE (2003) 14. Lynch, K.M., Park, F.C.: Modern Robotics: Mechanics, Planning, and Control. Cambridge University Press, Cambridge (2017) 15. Markley, F.L., Mortari, D.: Quaternion attitude estimation using vector observations. J. Astronaut. Sci. 48(2), 359–380 (2000) 16. Murray, R.M.: A Mathematical Introduction to Robotic Manipulation. CRC Press, Boca Raton (1994) ¨ ur, E., Mezouar, Y.: Kinematic modeling and control of a robot arm using unit 17. Ozg¨ dual quaternions. Robot. Auton. Syst. 77, 66–73 (2016) 18. Selig, J.M.: Geometric Fundamentals of Robotics. Springer, New York (2004) 19. Selig, J.M.: Exponential and Cayley maps for dual quaternions. Adv. Appl. Clifford Algebras 20(3–4), 923–936 (2010) 20. Shoemake, K.: Animating rotation with quaternion curves. In: ACM SIGGRAPH Computer Graphics, vol. 19, pp. 245–254. ACM (1985) 21. Srivatsan, R.A., Rosen, G.T., Mohamed, D.F.N., Choset, H.: Estimating SE(3) elements using a dual quaternion based linear Kalman filter. In: Robotics: Science and Systems (2016) 22. Study, E.: Geometrie der Dynamen. Druck und Verlag von B. G. Teubner, Leipzig (1903) 23. Study, E.: Foundations and goals of analytical kinematics. In: Berlin Mathematical Society (1913). Translated by Delphenich, D.H 24. Wahba, G.: A least squares estimate of satellite attitude. SIAM Rev. 7(3), 409 (1965)
Practical Exponential Coordinates Using Implicit Dual Quaternions
655
25. Wang, X., Han, D., Yu, C., Zheng, Z.: The geometric structure of unit dual quaternion with application in kinematic control. J. Math. Anal. Appl. 389(2), 1352–1364 (2012) 26. Wang, X., Zhu, H.: On the comparisons of unit dual quaternion and homogeneous transformation matrix. Adv. Appl. Clifford Algebras 24(1), 213–229 (2014) 27. Yang, A.T., Freudenstein, F.: Application of dual-number quaternion algebra to the analysis of spatial mechanisms. J. Appl. Mech. 31(2), 300–308 (1964)
A Performance Analysis of Parallel Differential Dynamic Programming on a GPU Brian Plancher(B) and Scott Kuindersma Harvard University, Cambridge, MA 02138, USA brian [email protected], [email protected]
Abstract. Parallelism can be used to significantly increase the throughput of computationally expensive algorithms. With the widespread adoption of parallel computing platforms such as GPUs, it is natural to consider whether these architectures can benefit robotics researchers interested in solving trajectory optimization problems online. Differential Dynamic Programming (DDP) algorithms have been shown to achieve some of the best timing performance in robotics tasks by making use of optimized dynamics methods and CPU multi-threading. This paper aims to analyze the benefits and tradeoffs of higher degrees of parallelization using a multiple-shooting variant of DDP implemented on a GPU. We describe our implementation strategy and present results demonstrating its performance compared to an equivalent multi-threaded CPU implementation using several benchmark control tasks. Our results suggest that GPU-based solvers can offer increased per-iteration computation time and faster convergence in some cases, but in general tradeoffs exist between convergence behavior and degree of algorithm-level parallelism. Keywords: Optimization and optimal control · Motion and path planning · Differential Dynamic Programming · Parallel computing GPU
1
·
Introduction
The impending end of Moore’s Law and the rise of GPU architectures has led to a flurry of research focused on designing algorithms that take advantage of massive parallelism. This trend is prominent in the machine learning literature, and recent work in robotics has demonstrated real-time collision checking for a manipulator using custom voxel grids on an FPGA [1] and fast sample-based planning on a GPU [2–4]. For dynamic trajectory optimization, there has been a historical interest in parallel strategies [5] and several more recent efforts lend support to the hypothesis that significant computational benefits are possible [6–9]. However, there remains much to learn about the algorithmic principles that lead to improvements in dynamic robot control tasks and how large-scale parallel c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 656–672, 2020. https://doi.org/10.1007/978-3-030-44051-0_38
A Performance Analysis of Parallel DDP on a GPU
657
execution of existing algorithms compares to state-of-the-art CPU implementations. This paper aims to add to our understanding of the benefits and limitations of instruction-level and algorithm-level parallelization for a particular family of trajectory optimization algorithms based on Differential Dynamic Programming (DDP) [10]. We focus on DDP and its variants, particularly the iterative Linear Quadratic Regulator (iLQR) [11], as they have recently received increased attention due to growing evidence that online planning for model predictive control (MPC) is possible for high-dimensional robots [9,12–16]. We describe our parallel implementation strategy on a modern NVIDIA GPU and present results demonstrating its performance compared to an equivalent multi-threaded CPU implementation using benchmark trajectory optimization tasks on a quadrotor and a 7-DoF manipulator. Our results suggest that GPU-based solvers can offer faster convergence than equivalent parallelized CPU implementations in some cases, which could be important for realtime modelpredictive control applications, but performance tradeoffs exist between convergence behavior and time per iteration as the degree of algorithm-level parallelism increases. 1.1
Related Work
Prior work on parallel nonlinear optimization has broadly focused on exploiting the natural separability of operations performed by the algorithm to achieve instruction-level parallelism. For example, if a series of gradients needs to be computed for a list of static variables, that operation can be shifted from a serial loop over them to a parallel computation across them. Alternatively, if block diagonal matrices must be inverted many times by the solver, each of these instructions can be broken down into a parallel solve of several smaller linear systems. These parallelizations do not change the theoretical properties of the algorithm and therefore can and should be used whenever possible. This research has led to a variety of optimized QP solvers targeting CPUs [17,18], GPUs [19,20], and FGPAs [21,22]. These approaches have also been used to specifically improve the performance of a subclass of QPs that frequently arise in trajectory optimization problems on multi-threaded CPUs [23–26] and GPUs [27,28]. Additionally, Antony and Grant [8] used GPUs to exploit the inherent parallelism in the “next iteration setup” step of DDP. In contrast to instruction-level parallelism, algorithm-level parallelism changes the underlying algorithm to create more opportunities for simultaneous execution of instructions. In the field of trajectory optimization, this approach was first explored by Bock and Plitt [29] and then Betts and Huffman [5], and has inspired a variety of “multiple shooting” methods [30,31]. Recently, this approach has been used to parallelize both an SQP algorithm [6] and the forward [7,32] and backward passes [9] of the iLQR algorithm. Experimental results from using these parallel iLQR variants on multi-core CPUs represent the current state of the art for real-time robotic motion planning. Our work aims to add to this literature by systematically comparing the performance of an identical parallel implementation of iLQR on a modern CPU and
658
B. Plancher and S. Kuindersma
GPU to (1) better understand the performance implications of various implementation decisions that must be made on parallel architectures and (2) to evaluate the benefits and trade-offs of higher degrees of parallelization (GPU) versus a higher clock rate (CPU).
2
DDP Background
The classical DDP algorithm begins by assuming a discrete-time nonlinear dynamical system, xk+1 = f (xk , uk ), where x ∈ Rn is a state and u ∈ Rm is a control input. The goal is to find an input trajectory, U = {u0 , . . . , uN −1 }, that minimizes an additive cost function, J(x0 , U ) = f (xN ) +
N −1
(xk , uk ),
(1)
k=0
where x1 , . . . , xN are computed by integrating the dynamics from x0 . Using Bellman’s principle of optimality, the optimal cost-to-go (CTG), Vk (x), can be defined by the recurrence relation: VN (x) = f (xN )
Vk (x) = min (x, u) + Vk+1 (f (x, u)). u
(2)
DDP avoids the curse of dimensionality by optimizing over Q(δx, δu), the second order Taylor expansion of the local change in the minimization argument in (2) under perturbations, δx, δu: ⎡ ⎤T ⎡ ⎤⎡ ⎤ 1 0 QTx QTu 1 1⎣ ⎦ ⎣ Qx Qxx Qxu ⎦ ⎣δx⎦ , δx Q(δx, δu) = 2 δu δu Qu QTxu Quu Qxx = xx + fxT Vxx fx + Vx · fxx , fu + Vx · fxu , Qxu = xu + fxT Vxx
Quu = uu + fuT Vxx fu + Vx · fuu ,
Qx = x + fxT Vx ,
Qu = u + fuT Vx . (3) Following the notation used elsewhere [12], a prime is used to indicate the next timestep, and derivatives are denoted with subscripts. The rightmost terms in the equations for Qxx , Quu , and Qxu involve second derivatives of the dynamics, which are rank-three tensors. These tensor calculations are relatively expensive and are often omitted, resulting in the iLQR algorithm [11]. Minimizing Eq. (3) with respect to δu results in the following correction: δu = −Q−1 uu (Qux δx + Qu ) ≡ Kδx + κ,
(4)
which consists of an affine term κ and a linear feedback term Kδx. Substituting these terms into Eq. (3) leads to an updated quadratic model of V : Vx = Qx − K T Quu κ − Qxu κ − K T Qu Vxx = Qxx − K T Quu K − Qxu K − K T Qux .
(5)
A Performance Analysis of Parallel DDP on a GPU
659
Therefore, a backward update pass can be performed starting at the final state and iteratively applying the above computations. A forward simulation pass, using the full nonlinear dynamics, is then performed to compute a new state trajectory using the updated controls. This forward-backward process is repeated until the algorithm converges.
3 3.1
Parallelizing DDP Instruction-Level Parallelization
Since the cost (1) is additive and depends on each state and control independently, it can be computed in parallel following forward integration and summed in log(N ) operations using a parallel reduction operator. In addition, instead of computing the line search during the forward pass by sequentially reducing α, we can compute all forward simulations for a set of possible α values in parallel. Furthermore, if all simulations are computed in parallel, then the algorithm could select the “best” trajectory across all α values, rather than the first value that results in an improvement. We employ the line search criteria proposed by Tassa [33] and accept an iterate if the ratio of the expected to actual reduction, z = J − J˜ /δV ∗ (α)
where
δV ∗ (α) = −ακT Hu +
α2 T κ Huu κ, 2
(6)
falls in the range 0 < c1 < z < c2 < ∞.
(7)
Finally, since the dynamics are also defined independently on each state and control pair, the Taylor expansions of the dynamics and cost (the “next iteration setup” step) can occur in parallel following the forward pass. Since this is one of the more expensive steps, this is almost always parallelized in CPU implementations of DDP currently being used for online robotic motion planning (though the number of knot points often exceeds the number of processor cores). 3.2
Algorithm-Level Parallelization
Backward Pass. We break the N time steps into Mb equally spaced parallel blocks of size Nb = N/Mb . We compute the CTG within each block by passing information serially backwards in time, as is done in standard DDP. After each iteration we pass the information from the beginning of one block to the end of the adjacent block, ensuring that CTG information is at worst case (Mb − 1) iterations stale between the first and last block as shown in Fig. 1. Farshidian et al. [9] note that this approach may fail if the trajectory in the next iterate is far enough away from the previous iterate as the stale CTG approximations are defined in relative coordinates and are only valid locally. Therefore, they propose a linear coordinate transformation of the quadratic CTG approximation at iterate i to re-center it about the current iterate: i+1 i = Vxx , Vxx
i Vxi+1 = Vxi + Vxx (xi+1 − xi ).
(8)
660
B. Plancher and S. Kuindersma
Fig. 1. Graphical representation of the backward pass algorithm-level parallelizations.
While this process will still generally converge [34,35], in practice, some forward passes will fail to find a solution because either the CTG information was “too stale,” or the new trajectory moved too far from the previous trajectory, rendering the controls at later time steps sub-optimal. Therefore, on the next pass we use the failed iterate’s CTG approximation, as it is a less stale estimate, in and again follow Tassa [33] and add a state regularization term ρIn to Vxx the computation of Quu and Qxu to stay closer to the last successful trajectory: + ρIn ) fu + Vx · fuu Quu = uu + fuT (Vxx Qxu = xu + fxT (Vxx + ρIn ) fu + Vx · fxu .
(9)
Forward Pass. Giftthaler et al. [7] introduced Gauss-Newton Multiple Shooting, which adapts iLQR for multiple shooting through a fast consensus sweep with linearized dynamics followed by a multiple shooting forward simulation from Mf equally spaced states of Nf = N/Mf time steps as shown in Fig. 2.
Fig. 2. Graphical representation of the forward pass algorithm-level parallelizations.
This parallel simulation leads to defects d between the edges of each block and changes the one step dynamics to xk+1 = f (xk , uk ) − dk . Incorporating this change into Eqs. 2–3 results in a modified version of Q (δx, δu): Qx = Qx + fxT Vxx d
Qu = Qu + fuT Vxx d.
(10)
We also update our line search criteria to include the following: 0 < max ||dk ||1 < c3 < ∞, k
(11)
excluding any trajectories that have large defects which represent an artificial mathematical reduction in cost that is infeasible in practice.
A Performance Analysis of Parallel DDP on a GPU
661
Finally, in order to update the start states of each block, a serial consensus sweep is performed by integrating the state trajectory using the previously computed linearized dynamics (A = fx , B = fu ) and feedback controls (K, κ), updating state k + 1 for iterate i + 1 (using a line search parameter α):
i+1
i i i i xk − xik + αBki κik + dik . (12) xi+1 k+1 = xk+1 + Ak + Bk Kk Thus, the forward pass now becomes the serial consensus sweep followed by a parallel forward simulation on each of the Mf blocks. Parallel DDP (Algorithm 1) combines the instruction-level parallelizations, forward sweep, Mf multiple shooting intervals, and Mb backward pass blocks into a single algorithm with parametrizable levels of parallelism.
4
Implementation
Our implementations were designed to target modern multi-core CPUs and GPUs in order to take advantage of their respective parallel processing power. A multi-core CPU can be roughly viewed as a handful of modern CPUs that are designed to work together, often on different tasks, leveraging the multipleinstruction-multiple-data (MIMD) computing model. In contrast, a GPU is a much larger set of very simple processors, optimized for parallel computations of the same task, leveraging the single-instruction-multiple-data (SIMD) computing model. Therefore, as compared to a CPU processor, each GPU processor has many more arithmetic logic units (ALUs), but reduced control logic and a smaller cache memory (see Fig. 3).
Fig. 3. High level architecture of a single CPU and GPU processor [36].
In this work we specifically targeted NVIDIA GPUs by using the CUDA extensions to C++ and the NVCC compiler. CUDA is built around functions called Kernels which can be launched using parallel blocks of threads on the GPU. Each block is guaranteed to run all of its threads on the same processor but the order of the blocks is not guaranteed. Each processor’s limited cache is split into standard L1 cache memory and shared memory, which is managed by
662
B. Plancher and S. Kuindersma
Algorithm 1. Parallel DDP 1: Initialize the algorithm and load in initial trajectories 2: while cost not converged do 3: for all Mb blocks b do in parallel 4: for k = bNb : b0 do 5: dk , (3), (9), (10) → Qk 6: if Qkuu is invertible then 7: (4) → Kk , κk 8: (5) → Vk and derivatives 9: else 10: Increase ρ go to line 3 11: end if 12: end for 13: end for 14: for all α[i] do in parallel 15: x ˜0 [i] = x0 16: if Mf > 1 then 17: for k = 0 : N − 1 do Consensus ˜k+1 [i] 18: x ˜k [i], (12) → x Sweep 19: end for 20: end if 21: for all Mf blocks b do in parallel 22: for k = b0 : bNf − 1 do xk [i] − xk ) 23: u ˜k [i] = uk + α[i]κk + Kk (˜ xk [i], u ˜k [i]) 24: x ˜k+1 [i] = f (˜ 25: d˜k [i] = 0 26: end for 27: k = bNf 28: if k < N then xk [i] − xk ) 29: u ˜k [i] = uk + α[i]κk + Kk (˜ xk [i], u ˜k [i]) 30: d˜k [i] = xk+1 − f (˜ Forward 31: end if Simulation 32: end for ˜ ˜ [i], (1), (6) → J[i], ˜ z˜[i] 33: X[i], U 34: end for ˜ satisfy (7), (11) ˜ s.t. z˜[i], d[i] 35: i∗ ← arg mini J[i] ∗ 36: if i = ∅ then ˜ ∗] ˜ [i∗ ], d[i ˜ ∗ ], U 37: X, U, d ← X[i 38: else 39: Increase ρ go to line 3 40: end if 41: Taylor approximate the cost at X, U 42: Taylor approximate the dynamics at X, U 43: end while
Backward Pass
Forward Pass
Next Iteration Setup
A Performance Analysis of Parallel DDP on a GPU
663
the programmer and accessible by all threads in the same block. Each kernel launch will naturally run sequentially but can also be placed in parallel streams. We minimized memory bandwidth delays by doing most computations on the GPU. The CPU is instead in charge of high level serial control flow for kernel launches. We also condense as many computations onto as few kernels as possible, a process known as kernel fusion [37], to minimize kernel launch overhead. Finally, we make heavy use of streams and asynchronous memory transfers to increase throughput wherever possible. For example, by computing the Taylor approximations of the dynamics and cost in separate streams, the throughput of the next iteration setup step was much closer to the maximum of the running times for those calculations than the sum. We also found that the general purpose GPU matrix math libraries (e.g., cuBLAS) are optimized for very large matrix operations, while DDP algorithms require many sets of serial small matrix operations. We implemented simpler custom fused kernels which provide a large speedup by keeping the data in shared memory throughout the computations. We further optimized our code by precomputing serial terms during parallel operations. For example, during the backward pass, A, B, K, and κ were loaded into shared memory, so computing A + BK and Bκ only added a small overhead to the paralellizable backward pass, while greatly reducing the time for the serial consensus sweep. Our multi-threaded CPU implementation leveraged the thread library which supports the launching of parallel threads across CPU cores. Threads run with their own set of registers and stack memory. Like GPU blocks, allocation of CPU threads to processor cores, and context switches between threads running on the same core, are scheduled by the operating system. Also, since these threads run on standard CPUs, there is no explicit management of cache memory. The CPU implementation reused the same baseline code to leverage the optimizations made during the GPU implementation and to provide an equivalent implementation for comparison. However, for optimal performance, we had to introduce serial loops within threads to limit the number of threads to a small multiple of the number of CPU cores. We made use of hand derived analytical dynamics and cost functions for the quadrotor system to maximize performance. For the manipulator, we implemented a custom GPU optimized forward dynamics kernel based on the Joint Space Inertia Inversion Algorithm, the fastest parallel forward dynamics algorithm for open kinematic chain robots with a very small number of rigid bodies [38]. For better direct comparisons, we used a looped version of that code for the CPU implementation.1 Finally, following the state of the art, we implemented the iLQR variant of DDP.
1
We note that while the Joint Space Inertia Inversion Algorithm is not the fastest serial forward dynamics algorithm, the difference is not large for a 7-Dof manipulator.
664
5
B. Plancher and S. Kuindersma
Results
Three sets of experiments were conducted to evaluate the performance of parallel iLQR across different problems, computing architectures, and levels of parallelism. We ran our experiments on a laptop with a 2.8 GHz quad-core Intel Core i7-7700HQ CPU, a NVIDIA GeForce GTX 1060 GPU, and 16 GB of RAM. In our experiments we initialized the algorithm with a gravity compensating input. Both the GPU and CPU implementations used the same scheme for updating ρ and the same set of options for α. We report convergence results (cost as a function of time and iteration), the time per iteration, and tracking error where appropriate. Total cost reduction as a function of time is a particularly useful metric when deploying algorithms in MPC scenarios where there is typically a fixed control time budget. To ensure our results were representative for each experiment, we ran 100 trials with noise ∼ N (0, σ 2 ) applied to the velocities of the initial trajectory. Our solver implementations and these examples can be found at http://bit.ly/ParallelDDP. 5.1
Quadrotor
We first considered a quadrotor system with 4 inputs corresponding to the thrust of each rotor and 12 states corresponding to the position and Euler angles, along with their time derivatives. We solved a simple flight task from a stable hover 0.5 m above the origin to a stable hover at the same height and at 7 m in the x and 10 m in the y direction. We used a quadratic cost function of the form: J=
N −1 1 1 1 (xN − xg )T QN (xN − xg ) + (xk − xg )T Q(xk − xg ) + uTk Ruk , (13) 2 2 2 k=0
setting Q = blkdiag(0.01 × I3x3 , 0.001 × I3x3 , 2.0 × I6x6 ), R = 5.0 × I4x4 , QN = 1000 × I12x12 . We solved over a 4 s trajectory with N = 128, MF = MB = M = 1, 2, 4, 8, 16, 32, 64, a 3rd-order Runge-Kutta integrator, and σ = 0.001. Figure 4 reveals that the delayed flow of information due to the algorithmlevel parallelizations (stale CTG information, fixed starting state of each simulation block) generally leads to smaller steps and therefore slower cost reduction per iteration. For example, for the CPU implementation, the median line search depth for M = 1 was between 0 and 1, while for M = 4 it was 5. It also shows that the GPUs ability to run a fully parallel line search, as compared to the CPUs partially parallel approach (due to limited number of hardware cores), allows the GPU to select a better “best line search” option and descend faster while avoiding local minima. For example, for the GPU implementation, the median line search depth for M = 1 was also between 0 and 1, while for M = 4 it was only 3.2 2
These trends were also mirrored in the success rate of the algorithm. While 0% of CPU and GPU runs failed for M = 1, 2, 4, and on the GPU only 5% failed for M ≥ 8, on the CPU over 30% failed for M ≥ 8.
A Performance Analysis of Parallel DDP on a GPU
665
Fig. 4. Median cost vs iteration for the quadrotor experiment.
Fig. 5. Median time per iteration for the quadrotor experiment.
The median time per iteration for each of the parallelization options for both implementations is shown in Fig. 5. Our observed per-iteration times of under 3 ms for all GPU and CPU cases are comparable to state-of-the-art reported rates of 5–25 ms [15] on a similar UAV system. These results also match our expectations that the higher clock rate would allow the CPU to compute the serial consensus sweep faster while the GPU is able to leverage its increased number of cores to compute the parallel next iteration setup step faster. For the CPU implementation, we also observed that parallelization can improve the speed of the backward pass until the number of available cores (4) is saturated at which point performance stagnates. For the forward simulation, we found that slower paths to convergence, and thus deeper line searches, quickly
666
B. Plancher and S. Kuindersma
outweighed the running time gains due to parallelism. This is most evident in the increase in the time for the forward simulation as M grows from 8 to 64. By contrast, the GPU implementation is able to run the line search fully in parallel, which led to reductions in running time for both the backward pass and the forward simulation as M increases. However, there are diminishing returns. First, kernel launch overhead begins to dominate the running time as parallelization is increased. Second, since the next iteration setup is always fully parallelized, and for each line search option the consensus sweep cannot be parallelized, the running times for both steps remain constant. In fact, by the M = 64 case, the consensus sweep was almost a third of the total computational time as compared to only 17% for the M = 2 case. This increased speed per iteration and decreased convergence rate leads to a level of parallelism which optimizes the time to convergence, as shown in Fig. 6. There we find that in the M = 1 and M = 2 cases, the CPU is able to leverage its higher clock rate to outperform the GPU. However, the GPU is able to better exploit the algorithm-level parallelism and outperform the CPU for M > 2. For this experiment the dynamics computations required are simple enough, and the problem size is small enough, that the fastest approach is CPU M = 1 indicating that on simple problems the overhead from parallelism outweighs the gains.
Fig. 6. Median cost for the first 20 ms of the quadrotor experiment.
A Performance Analysis of Parallel DDP on a GPU
5.2
667
Manipulator
We then consider the Kuka LBR IIWA-14 manipulator which has 7 inputs corresponding to torques on the 7 joints. The nominal configuration is defined as the manipulator pointing straight up in the air. We solved a trajectory optimization task from a start state to a goal state across the workspace depicted in Fig. 7. We set Q = blkdiag(0.01 × I7x7 , 0.001 × I7x7 ), R = 0.001 × I7x7 , QN = 1000 × Fig. 7. Start (left) and goal (right) states for I14x14 . We solved the problem over the manipulator experiment. a 0.5 s trajectory with N = 64, MF = MB = M = 1, 2, 4, 8, 16, 32, a 1st-order Euler integrator, and σ = 0.001. Figure 8 shows that parallelism leads to speedups in time per iteration on the GPU and CPU. We see that again on the GPU both the forward simulation and backward pass decrease in time as M increases. On the CPU we again see that the backward pass decreases in time until the CPU runs out of cores at M = 4, while the forward simulation time varies non-monotonically for different values of M depending on the parallelization speedup and the depth of line search slowdown. With all cases having a median time per iteration under 5 ms, both our GPU and CPU implementations are able to perform at speeds reported as state-of-the-art [9,12–16].
Fig. 8. Median time per iteration for the manipulator experiment.
Unlike in the quadrotor example, the more computationally expensive forward dynamics and increased problem size in this example led to performance gains from parallelism as shown in Fig. 9. We find that the GPU is able to
668
B. Plancher and S. Kuindersma
successfully exploit the algorithm-level parallelism with faster convergence from M = 2, 4, 8, 16 than M = 1 and that M = 32 on the GPU converges in about the same time as the CPU’s fastest standard option, M = 2. We also tested various combinations of the number of blocks for the forward and backward passes until we found the best possible CPU variant for this problem (Mf = 2 and Mb = 4) and found that GPU M = 2, 4, 8, 16 still converge faster.
Fig. 9. Median cost for the first 70 ms of the manipulator experiment.
Taken together, the results from the quadrotor and manipulator suggest that practical performance improvements can be obtained through large-scale parallelization and GPU implementations, but that the tradeoffs between the degree of parallelism and convergence speed are strongly dependent on system dynamics and problem specification. 5.3
Model-Predictive Control for the Manipulator
To better understand these results and the applicability of our GPU implementation for online model-predictive control, we conducted a goal tracking experiment with the manipulator in simulation. At each control step we ran our fastest solver, the GPU M = 4 implementation, with a maximum time budget of 10 ms. We warm started the iLQR algorithm by shifting all variables from the previous solve by the control duration and then rolling out a new initial state trajectory starting from the current measured state (with a gravity compensating input in the trailing time steps). During optimization periods, we simulated the system forward in realtime using the previously computed solution. We considered a end effector pose tracking task where the goal moved continuously along a figure eight path. We modified our cost function to include the end effector error:
A Performance Analysis of Parallel DDP on a GPU
1 1 T ˙ QN q˙N (ee(qN ) − eegoal )T QN (ee(qN ) − eegoal ) + q˙N 2 2 N −1 1 1 1 (ee(qk ) − eegoal )T Q(ee(qk ) − eegoal ) + q˙kT Q˙ q˙k + uTk Ruk , 2 2 2
669
J=
(14)
k=0
where we include the quadratic penalty on q˙ to encourage a stable final position. We set Q = blkdiag(0.01 × I3x3 , 0 × I3x3 ), R = 0.0001 × I7x7 , QN = blkdiag(1000 × I3x3 , 0 × I3x3 ), Q˙ = 0.1 × I7x7 , Q˙ N = 10 × I7x7 . At each control step we solved the problem using a first-order Euler integrator and N = 64 knot points over a 0.5 s trajectory horizon. The full figure eight trajectory we were tracking had a period of 10 s. To initialize the experiment, we held the first goal ˙ 22 were both less than 0.05 at pose constant until both ||ee(q) − eegoal ||22 and ||q|| which point the goal began moving along the figure eight path. Figure 10 shows the trajectory computed by running the MPC experiment starting from an initial vertical state. Aside from confirming that good tracking performance is possible, we observed that the bookkeeping needed to implement MPC on a GPU does add delays in the control loop. In particular, the shifting of the previous variables and rolling out from the updated starting state takes almost 1.4 ms on average. Since GPU M = 4 is able to compute iterations in about 1.2 ms, this MPC initialization step is quite expensive. We expect there are potential avenues for improving this overhead through better software engineering (e.g., by using circular buffers to reduce memory copy operations). We plan to investigate this further in future experiments.
Fig. 10. Executed trajectory (red) vs. goal trajectory (blue) for the MPC experiment.
670
6
B. Plancher and S. Kuindersma
Conclusion and Future Work
We presented an analysis of a parallel multiple-shooting iLQR algorithm that achieves state-of-the-art performance on example trajectory optimization and model-predictive control tasks. Our results show how parallelism can be used to increase the convergence speed of DDP algorithms in some situations, but that tradeoffs between per-iteration speed and convergence behavior manifest in problem-specific ways. Several directions for future research remain. First, we used analytical and numerical dynamics methods specific to the systems we considered to ensure good performance on the GPU. More general robot dynamics packages designed for GPUs that can achieve a satisfactory level of physical realism would be a valuable tool for the development and evaluation of parallel trajectory optimization methods. Second, it would be interesting to consider the performance impact of adding nonlinear constraints to parallel iLQR using augmented Lagrangian [39] or QP-based methods [40]. Other types of trajectory optimization formulations and algorithms may be more suitable for large-scale parallelization on GPUs, such as direct transcription and solvers based on the alternating direction method of multipliers [41]. We intend to broaden our investigation beyond DDP algorithms in future work. Finally, we would like to evaluate parallel trajectory optimization algorithms for MPC on hardware using low-power mobile parallel compute platforms such as the NVIDIA Jetson. Acknowledgments. This work was supported by a Draper Internal Research and Development grant and by the National Science Foundation Graduate Research Fellowship (under grant DGE1745303). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the funding organizations. The authors would like to thank the members of the Harvard Agile Robotics Lab for their useful feedback and insights.
References 1. Murray, S., Floyd-Jones, W., Qi, Y., Sorin, D.J., Konidaris, G.: Robot motion planning on a chip. In: Robotics: Science and Systems (2016) 2. Park, C., Pan, J., Manocha, D.: Real-time optimization-based planning in dynamic environments using GPUs. In: 2013 IEEE International Conference on Robotics and Automation, pp. 4090–4097 (2013) 3. Heinrich, S., Zoufahl, A., Rojas, R.: Real-time trajectory optimization under motion uncertainty using a GPU. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3572–3577 (2015) 4. Ichter, B., Schmerling, E., Agha-mohammadi, A.A., Pavone, M.: Real-time stochastic kinodynamic motion planning via multiobjective search on GPUs. In: 2017 IEEE International Conference on Robotics and Automation (2017) 5. Betts, J.T., Huffman, W.P.: Trajectory optimization on a parallel processor. J. Guid. Control 14(2), 431–439 (1991)
A Performance Analysis of Parallel DDP on a GPU
671
6. Kouzoupis, D., Quirynen, R., Houska, B., Diehl, M.: A block based ALADIN scheme for highly parallelizable direct optimal control. In: Proceedings of the American Control Conference (2016) 7. Giftthaler, M., Neunert, M., St¨ auble, M., Buchli, J., Diehl, M.: A family of iterative Gauss-Newton shooting methods for nonlinear optimal control (2018) 8. Antony, T., Grant, M.J.: Rapid indirect trajectory optimization on highly parallel computing architectures. J. Spacecraft Rockets 54(5), 1081–1091 (2017) 9. Farshidian, F., Jelavic, E., Satapathy, A., Giftthaler, M., Buchli, J.: Real-time motion planning of legged robots: a model predictive control approach. In: 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (2017) 10. Jacobson, D.H., Mayne, D.Q.: Differential dynamic programming. Math. Gaz. 56, 78 (1970) 11. Li, W., Todorov, E.: Iterative linear quadratic regulator design for nonlinear biological movement systems. In: Proceedings of the 1st International Conference on Informatics in Control, Automation and Robotics (2004) 12. Tassa, Y., Erez, T., Todorov, E.: Synthesis and stabilization of complex behaviors through online trajectory optimization. In: IEEE/RSJ International Conference on Intelligent Robots and Systems 13. Erez, T., Lowrey, K., Tassa, Y., Kumar, V., Kolev, S., Todorov, E.: An integrated system for real-time model predictive control of humanoid robots. In: 2013 13th IEEE-RAS International Conference on Humanoid Robots 14. Koenemann, J., Del Prete, A., Tassa, Y., Todorov, E., Stasse, O., Bennewitz, M., Mansard, N.: Whole-body model-predictive control applied to the HRP-2 humanoid. In: Proceedings of the IEEERAS Conference on Intelligent Robots 15. Neunert, M., de Crousaz, C., Furrer, F., Kamel, M., Farshidian, F., Siegwart, R., Buchli, J.: Fast nonlinear Model Predictive Control for unified trajectory optimization and tracking. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1398–1404 (2016) 16. Neunert, M., Farshidian, F., Winkler, A.W., Buchli, J.: Trajectory optimization through contacts and automatic gait discovery for quadrupeds. IEEE Robot. Autom. Lett. 2(3), 1502–1509 (2017) 17. Leineweber, D.B.: Efficient Reduced SQP Methods for the Optimization of Chemical Processes Described by Large Sparse DAE Models. VDI-Verlag, D¨ usseldorf (1999) 18. Word, D.P., Kang, J., Akesson, J., Laird, C.D.: Efficient parallel solution of largescale nonlinear dynamic optimization problems. Comput. Optim. Appl. 59(3), 667– 688 (2014) 19. Bolz, J., Farmer, I., Grinspun, E., Schr¨ ooder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. In: ACM SIGGRAPH 2003 Papers, SIGGRAPH 2003, pp. 917–924. ACM (2003) 20. Yu, L., Goldsmith, A., Di Cairano, S.: Efficient convex optimization on GPUs for embedded model predictive control. In: Proceedings of the General Purpose GPUs, GPGPU-10, pp. 12–21. ACM (2017) 21. Ling, K.V., Yue, S.P., Maciejowski, J.M.: A FPGA implementation of model predictive control. In: 2006 American Control Conference (2006) 22. Lau, M.S.K., Yue, S.P., Ling, K.V., Maciejowski, J.M.: A comparison of interior point and active set methods for FPGA implementation of model predictive control. In: 2009 European Control Conference (ECC), pp. 156–161 (2009) 23. Ferreau, H.J., Kozma, A., Diehl, M.: A parallel active-set strategy to solve sparse parametric quadratic programs arising in MPC. IFAC Proc. 45(17), 74–79 (2012)
672
B. Plancher and S. Kuindersma
24. Frasch, J.V., Sager, S., Diehl, M.: A parallel quadratic programming method for dynamic optimization problems. Math. Program. Comput. 7(3), 289–329 (2015) 25. Frison, G.: Algorithms and methods for fast model predictive control (2015) 26. Quirynen, R.: Numerical simulation methods for embedded optimization (2017) 27. Huang, Y., Ling, K.V., See, S.: Solving quadratic programming problems on graphics processing unit. ASEAN Eng. J. 1, 76–86 (2011) 28. Phung, D.-K., H´eriss´e, B., Marzat, J., Bertrand, S.: Model predictive control for autonomous navigation using embedded graphics processing unit. IFACPapersOnLine 50(1), 11883–11888 (2017) 29. Bock, H.G., Plitt, K.-J.: A multiple shooting algorithm for direct solution of optimal control problems. IFAC Proc. 17(2), 1603–1608 (1984) 30. Garza, D.M.: Application of automatic differentiation to trajectory optimization via direct multiple shooting (2003) 31. Diehl, M., Bock, H.G., Diedam, H., Wieber, P.-B.: Fast direct multiple shooting algorithms for optimal robot control. In: Fast Motions in Biomechanics and Robotics, pp. 65–93. Springer, Heidelberg (2006) 32. Pellegrini, E., Russell, R.P.: A multiple-shooting differential dynamic programming algorithm. In: AAS/AIAA Space Flight Mechanics Meeting (2017) 33. Tassa, Y.: Theory and Implementation of Biomimetic Motor Controllers. The Hebrew University of Jerusalem, Jerusalem (2011) 34. Zinkevich, M., Langford, J., Smola, A.J.: Slow learners are fast. In: Advances in Neural Information Processing Systems, vol. 22, pp. 2331–2339. Curran Associates, Inc. (2009) 35. Ho, Q., Cipar, J., Cui, H., Kim, J.K., Lee, S., Gibbons, P.B., Gibson, G.A., Ganger, G.R., Xing, E.P.: More effective distributed ML via a stale synchronous parallel parameter server (2013) 36. NVIDIA: NVIDIA CUDA C Programming Guide, Version 9.1 edition (2011) 37. Filipoviˇc, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion—application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015) 38. Yang, Y., Wu, Y., Pan, J.: Parallel dynamics computation using prefix sum operations. IEEE Robot. Autom. Lett. 2(3), 1296–1303 (2017) 39. Plancher, B., Manchester, Z., Kuindersma, S.: Constrained unscented dynamic programming. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017) 40. Xie, Z., Liu, C.K., Hauser, K.: Differential dynamic programming with nonlinear constraints. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 695–702 (2017) 41. Boyd, S.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Time Integrating Articulated Body Dynamics Using Position-Based Collocation Methods Zherong Pan1(B) and Dinesh Manocha2 1
2
University of North Carolina, Chapel Hill, NC 27514, USA [email protected] Department of Computer Science and Electrical & Computer Engineering, University of Maryland, College Park, MD 20740, USA [email protected]
Abstract. We present a new time integrator for articulated body dynamics. We formulate the governing equations of the dynamics using only the position variables and then recast the position-based articulated dynamics as an optimization problem. Our reformulation allows us to integrate the dynamics in a fully implicit manner without computing high-order derivatives. Therefore, under arbitrarily large timestep sizes, we observe highly stable behaviors using an off-the-shelf numerical optimizer. Moreover, we show that the accuracy of our time integrator can increase by using a high-order collocation method. We show that each iteration of optimization has a complexity of O(N ) using the Quasi-Newton method or O(N 2 ) using Newton’s method, where N is the number of links of the articulated model. Finally, our method is highly parallelizable and can be accelerated using a Graphics Processing Unit (GPU). We highlight the efficiency and stability of our method on different benchmarks and compare the performance with prior articulated body dynamics simulation methods based on the Newton-Euler equation. Using a larger timestep size, our method achieves up to 4 times speedup on a single-core CPU. With GPU acceleration, we observe an additional 3–6 times speedup over a 4-core CPU.
1 Introduction Numerical modeling of articulated bodies is a fundamental problem in robotics. It is important in the design and evaluation of mechanisms, robot arms, and humanoid robots. Furthermore, articulated body simulators are increasingly used to evaluate a controller during reinforcement learning [6, 23], to predict the future state of a robot during online control [29], and to satisfy the dynamics constraints for motion planners [27]. In all these applications, the underlying algorithms are implemented on top of dynamic simulators and may invoke these simulators thousands of times for different parameters and settings [29]. As a result, the performance of these applications is easily affected by these simulators’ performances. Many widely-used articulated body simulation packages [25, 29] are based on implicit time-stepping schemes [26]. These methods model the articulated body’s governing equation as an ordinary differential equation (ODE) and then integrate the ODE using high-order numerical schemes. These methods can be arbitrarily accurate but require small timestep sizes. One simple strategy to improve the runtime performance c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 673–688, 2020. https://doi.org/10.1007/978-3-030-44051-0_39
674
Z. Pan and D. Manocha
is to use a large timestep size [24]. This strategy has proven successful in some applications, such as controlling humanoid robots [28], where the timestep size used in a controller can be larger than that used in the underlying simulator. A key issue in using a large timestep size is ensuring that the time integrator is still stable. For example, the stable region of a semi-implicit Euler integrator shrinks as the timestep size increases [4]. To time integrate an articulated body under a large timestep size, a simple and widely-used method is to use an unconditionally stable fully implicit Euler integrator [4]. However, in a conventional articulated body’s governing dynamics equation, the use of a fully implicit Euler integrator involves a costly O(N 3 ) computation of high-order derivatives, where N is the number of links in an articulated body. Main Results: We present position-based articulated dynamics (PBAD), a novel optimization-based algorithm for articulated body dynamics simulation. Unlike prior method [26], which represents the velocity as a time derivative and evaluates this derivative analytically, our PBAD formulation represents this velocity using finite differences in the Euclidean space. This Euclidean space discretization allows us to represent all the physical variables as functions of positions. As a result, we can integrate the system implicitly without high-order derivatives. In addition, we show that numerical simulation under our PBAD framework can be recast as a numerical optimization. Therefore, our time integrator is stable under an arbitrarily large timestep size because a numerical optimizer can ensure that the energy value decreases during each iteration through linesearch [19] or trust region limitation [18]. Solving these unconstrained minimization problems requires evaluating the energy gradient and/or Hessian and solving a linear system of size O(N ). To this end, we use techniques similar to well-known forwardand inverse-dynamics algorithms [8] and show that the necessary energy gradient and Hessian information can be computed within O(N ) and O(N 2 ). Finally, we show that the accuracy of PBAD time integrator can be improved by approximating the velocity using high-order polynomials, leading to a high-order collocation method [12]. We have implemented our algorithm and evaluated the performance on many articulated models with 10–200 DOFs. Compared with a conventional semi-implicit Euler integrator, our PBAD simulator achieves up to 4 times overall speedup with a serial implementation running on a single-core CPU. Finally, all the operations in our unconstrained energy minimization are inherently parallel and we accelerate the simulation on a GPU to obtain 3–6 times additional speedup over a 4-core CPU, as shown in Sect. 5.2. The rest of the paper is organized as follows. We first review conventional Lagrangian articulated body dynamics in Sect. 3 and then introduce our PBAD formulation in Sect. 4. Next, in Sect. 5, we present some algorithmic and numeric analysis of our method. Finally, we compare our method with an earlier method [26] on a set of classic benchmarks used by [25, 29] in Sect. 6. We also show some applications in online/offline control algorithms in Sect. 6.
2 Related Work We give a brief overview of previous work in articulated body dynamics, timeintegration schemes, and position-based dynamics.
Time Integrating Articulated Body Dynamics
675
2.1 Articulated Body Dynamics Articulated body dynamic simulation is a classic, well-studied problem in robotics. Some methods [5, 30, 31] focus on articulated bodies with general constraints, where the configurations of articulated bodies are represented using maximal coordinates. However, tree-structured articulated bodies represented using minimal coordinates have received the most attention. Very efficient algorithms [8, 20] have been developed for forward/inverse-dynamics and these are key steps in a dynamics simulator. These algorithms have been further accelerated using divide-and-conquer [7], adaptivity [11], and GPU-parallelism [32, 33]. 2.2 Time Integration Schemes A time integrator predicts the future configuration of an articulated body given its current configuration. Time integrators vary in their computational cost, stability, and accuracy (see [4, 16] for a review). Widely-used integrators in articulated body simulators [25, 29], such as explicit high-order Runge-Kutta schemes, are linear multistep methods for ODE, which requires small timestep sizes. Compared with explicit schemes, implicit Runge-Kutta schemes have better stability, some of which are also known as collocation methods [1]. Collocation methods approximate the locus of configuration using high-order polynomials. Unlike these general-purpose integrators, special integrators such as Lie-group integrators [15] and variational integrators [17] can be developed to respect the Lie group structure of articulated bodies, resulting in desirable conservative properties in linear/angular momentum and the Hamiltonian. 2.3 Position-Based Dynamics (PBD) Our method is inspired by the recent advances in PBD in computer graphics (see [2] for a survey). PBD has been shown to be stable under arbitrarily large timestep sizes and is preferred for interactive applications such as game engines. PBD algorithms have been developed for various dynamics systems such as fluid bodies, deformable bodies, and rigid bodies [5]. In computer graphics, however, rigid bodies are represented using maximal coordinates while in our PBAD formulation, we use minimal coordinates. We have also extended conventional second-order PBD discretizations to arbitrarily high-order collocation methods. The connection between PBD and optimization-based integrators is revealed in [3] and later refined in [10, 21].
3 Background: Lagrangian Articulated Body Dynamics We briefly review the conventional articulated body dynamics formulation under generalized coordinates (see [20] for more details). Throughout our derivation, we assume that there is only one rigid body. The more general case of multiple rigid bodies can be derived by a concatenation of equations for each rigid body. The configuration of a rigid body B is parameterized by generalized coordinates, q. |q| is the number of DOFs and
676
Z. Pan and D. Manocha
is proportional to the number of links, N . For an arbitrary point p ∈ B in the body-fixed frame of reference, its corresponding position in a global frame of reference is: P(q) = R(q)p + t(q), where R is a global rotation and t is a global translation. The dynamics of B is governed by the following equation: p∈B
T ∂P(q) ¨ ρP(q) − f dp = 0, ∂q
(1)
where f are the internal/external forces on p and ρ is the mass density. If we analytically evaluate the second derivative in Eq. 1, we arrive at the following well-known equation: 0 T T T ˙ J MJ¨ q + J MJ + J (2) MJ q˙ − JT f = 0, [ω]
T ˙ = [ω]R, J = ∂ω/∂qT ∂t/∂qT where we have R , M being the 6 × 6 mass matrix. to predict the next configura From Eq. 2, we can formulate a discrete version tion qk+1 q˙ k+1 from the current configuration qk q˙ k . Here we use subscript to denote timestep index, i.e. qk is q at time instance kΔt. To this end, several widely-used articulated body simulators [25, 29] use a semi-implicit Euler scheme: −1 q˙ k+1 − q˙ k 0 = JTk Mk Jk JTk fk − JTk Mk J˙ k + JTk Mk Jk q˙ k , [ω k ] Δt
(3)
where [ω k ] is the 3 × 3 skew-symmetric cross-product matrix. The above scheme usually works well for a small timestep size (usually smaller than 0.01 s), but its stability under large timestep size is not guaranteed. This is due to the explicit velocity update in Eq. 3, i.e. the right-hand side of Eq. 3 is at timestep k. One common method for achieving better stability under a large timestep size is to use the fully implicit Euler scheme by replacing qk q˙ k in the right-hand side of Eq. 3 with qk+1 q˙ k+1 and solving for qk+1 using an iterative algorithm. A widely-used iterative algorithm is the (Quasi)Newton method, which has been used to stably simulate deformable and fluid bodies [24]. However, there are two difficulties in using the (Quasi)-Newton method for fully implicit integration: – The (Quasi)-Newton method requires the derivatives of the right-hand side of Eq. 3 with respect to qk+1 , which involves third-order derivatives, ∂ 3 R/∂q3 and ∂ 3 t/∂q3 , the evaluation complexity of which is O(N 3 ). – The implicit integrator solves a system of nonlinear equations for which even (Quasi)-Newton method could fail to converge under large timestep sizes [10].
4 Position-Based Articulated Body Dynamics In this section, we present our PBAD formulation, which overcomes some of the problems found in prior time integrators. We notice that, from Eq. 1, the acceleration of P
Time Integrating Articulated Body Dynamics
677
is evaluated analytically to derive Eq. 2, which involves up to second-order derivatives. ¨ directly from Eq. 1, the anaHowever, if we use a finite difference approximation of P lytic derivatives can be eliminated, allowing us to perform a (Quasi)-Newton method without evaluating ∂ 3 R/∂q3 and ∂ 3 t/∂q3 . For example, if we use second-order finite difference approximation, Eq. 1 becomes: p∈B
∂P(qk+1 ) ∂qk+1
T
P(qk+1 ) − 2P(qk ) + P(qk−1 ) − f (P(qk+1 )) dp = 0. (4) ρ Δt2
Corresponding to Eq. 1 under the conventional formulation, Eq. 4 is the governing equation under our PBAD formulation. Note that Eq. 4 converges to Eq. 1 as Δt → 0. Equation 4 takes a similar form to the governing equations in previous PBD methods [13, 21] for simulating deformable bodies but is expressed for articulated bodies under minimal coordinates. We can now argue that Eq. 4 overcomes the two difficulties. First, if we use the Newton’s method to solve Eq. 4, we only need to evaluate derivatives up to the second-order, i.e. ∂ 2 R/∂q2 and ∂ 2 t/∂q2 . Moreover, we will show in Sect. 5 that, if we use the Quasi-Newton method, only first-order derivatives are needed without modifying the final solutions. Second, the convergence difficulty of the (Quasi)-Newton method under a very large timestep size can be fixed by reformulating Eq. 4 as an energy minimization problem: ρ P(qk+1 ) − 2P(qk ) + P(qk−1 )2 + Q(P(qk+1 )) dp, (5) E(q) 2 p∈B 2Δt where Q is the potential energy for a position-dependent conservative force f . Such a reformulation allows us to use an off-the-shelf, gradient-based optimizer to solve for qk+1 = argmin E(q). These optimizers use line-search [19] or trust region limitations [18] to ensure that each iteration gets the solution closer to a local minima of E(q), i.e. the correct qk+1 . Although E(q) in Eq. 5 still involves an integral over B, we can derive its analytic form. 4.1 High-Order Position-Based Collocation Method One advantage of using Eq. 1 is that one could use a general linear multistep method (see [4]) to achieve a variable-order of accuracy. We show that our PBAD formulation can also have such flexibility by modifying a high-order collocation method [1]. A collocation method approximates the locus of the configuration of B using high-order polynomials. Note that, in Eq. 4, we assume that, for any p ∈ B, its trajectory in the period of time [(k − 1)Δt, (k + 1)Δt] is determined by the three collocation points P(qk−1 ), P(qk ), P(qk+1 ) and a collocation method assumes that p follows a polynomial curve passing through all the collocation points. For example, in Eq. 5, we can fit a quadratic curve from the three points so that it is a second-order collocation method. To develop higher-order methods, we introduce additional collocation points in between timesteps (P(qk+α1 ), · · · , P(qk+αN −2 )) for an Kth-order method, where 0 < α1 < · · · < αK−1 = 1. We fit an Kth-order polynomial for any p ∈ B from the
678
Z. Pan and D. Manocha
K + 1 collocation points P∗ P(qk−1+αK−2 ) · · · P(qk+αK−1 ) . The Kth-order polynomial takes the following form: T T ¨ P(t) P∗ H 1 t · · · tK P(t) P∗ H 1 t · · · tK , where H, H are the polynomial basis matrices. We call this a position-based collocation method. A key difference between a position-based collocation method and a conventional collocation method [1] is that we fit polynomials for P instead of q. In other words, we assume that any p ∈ B follows a polynomial curve in the Cartesian workspace instead of the configuration space. By plugging P(t) into Eq. 1, we obtain:
T ∂P(qi ) ¨ ρP(iΔt) − f (P(qi )) dp = 0 ∀i = k + α1 , · · · , k + αK−1 . (6) ∂qi p∈B from which we can solve for q∗ = qk+α1 · · · qk+αK−1 simultaneously. Given a set of collocation points, we have completed our high-order formulation of PBAD. In practice, we follow [12] and use the roots of the Legendre polynomials as our collocation points. In other words, suppose LK−2 (x) is the (K − 2)th-order Legendre polynomial of the first kind, then LK−2 (2αi − 1) = 0 for i = 1, · · · , K − 2. Note that, although Eq. 6 allows fully implicit integration without high-order derivatives, it does not have a corresponding energy form like Eq. 5. However, we can still govern the convergence of a gradient-based optimizer using the following energy form:
i=k+αK−1
E(q∗ ) =
i=k+α1
p∈B
T ∂P(qi ) ¨ ρP(iΔt) − f (P(qi )) dp2 , ∂qi
(7)
where we solve for all the q∗ from q∗ = argmin E(q∗ ). The high-order positionbased collocation method (Eq. 7) is more general than its second order counterpart (Eq. 5) because f is not integrated to get Q, allowing f to be non-conservative. Further, Eq. 7 still allows simulation in a fully implicit manner without computing third-order derivatives.
5 Optimization Algorithm In this section, we introduce the algorithm that performs numerical simulations under our PBAD formulation. During the timestep k, an implementation of our PBAD articulated body simulator calls a gradient-based optimizer to solve q∗ = argmin E(q∗ ), where E takes the form of Eq. 5 for second-order collocation methods and conservative force models and E takes the form of Eq. 7 for high-order collocation methods or nonconservative force models. Each timestep is an iterative algorithm whose complexity is not a constant. However, we can analyze the complexity of each iteration and profile the number of iterations empirically. Our objective functions involve both inertial and potential energy terms. Since the concrete form of potential energy Q is application-dependent, we focus on the inertial term. Values and derivatives of most widely-used potential energies, such as the gravitational energy, can be evaluated in O(N ) or O(N 2 ) and the complexity of algorithm
Time Integrating Articulated Body Dynamics
679
a ,qb ) Algorithm 1. Compute I(qa , qb ), ∂I(q using adjoint method within O(N ). Here ∂qb A is a 4 × 4 matrix. Note that Line 12 is O(1) because ∂Tii−1 (qb )/∂qb is non-zero only at entries corresponding to the ith link.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
T0 (qa ) ← I, T10 (qa ) ← I T0 (qb ) ← I, T10 (qb ) ← I I(qa , qb ) ← 0 for i = 1, · · · , N do Ti (qa ) ← Ti−1 (qa )Tii−1 (qa ) Ti (qb ) ← Ti−1 (qb )Tii−1 (qb ) I(qa , qb ) ← I(qa , qb ) + Ii (qa , qb ) end for a ,qb ) ←0 A ← 0, ∂I(q ∂qb for i = N, · · · , 1 do i a ,qb ) A ← A + ∂I∂T(qi (q ) b
∂I(qa ,qb ) a ,qb ) 12: ← ∂I(q + (Ti−1 ∂qb ∂qb i T 13: A ← ATi−1 (qb ) 14: end for
O(N ) forward pass
A is 4 × 4 matrix O(N ) backward pass
∂Tii−1 (qb ) ) ∂qb
:A
O(1)
is dominated by the inertial term. During each iteration, we evaluate the value and the partial derivatives of E, which involve an integral over B. We can evaluate this integral analytically. Note that E in Eq. 5 is a linear combination of the following term: P(qa )T P(qb )dp, (8) I(qa , qb ) = p∈B
with different (a, b)-pairs, as shown in our extended report [22]. Similarly, E in Eq. 7 is a linear combination of Eq. 8’s partial derivatives. Equation 8 can be evaluated analytically as: ˜ −1 I(qa , qb ) = T(qa )T T(qb ) : M
˜ M
T p p dp, 1 1 p∈B
˜ is related to the mass and inertia tensor of where the integrals on the right-hand side (M) B (not exactly the same). This matrix does not depend on qa,b and can be precomputed. We have also used contract symbols such that A : B = tr AT B and we have used homogeneous coordinates: T(q) =
R(q) t(q) 1
.
To solve q∗ , we consider two optimizers, LBFGS [19] and LM [18]. Given an objective function E(q∗ ), each iteration of LBFGS computes a gradient, ∂E(q∗ )/∂q∗ , and updates q∗ using a line-search along the gradient direction to ensure the decrease of E(q∗ ). The cost of an LBFGS iteration is dominated by the computation of the gradient which takes O(N 2 ) in the case of Eq. 7 and O(N ) in the case of Eq. 5. Unlike LBFGS,
680
Z. Pan and D. Manocha
Algorithm 2. Compute 4 × 4 matrices.
∂ 2 I(qa ,qb ) ∂qb 2
using adjoint method within O(N 2 ). Here A, B are
1: Same forward pass as Algorithm 1. 2 a ,qb ) ←0 2: A ← 0, ∂ I(q ∂qb 2 3: for i = N, · · · , 1 do i a ,qb ) 4: A ← A + ∂I∂T(qi (q a) 5:
∂ 2 I(qa ,qb ) ∂qb 2
6: 7:
B← for j = i − 1, · · · , 1 do
8:
∂ 2 I(qa ,qb ) ∂qb 2 T ∂Tii−1 (qb ) A ∂qb
←
+ (Ti−1
2 ∂ 2 I(qa ,qb ) a ,qb ) ← ∂ I(q ∂qb 2 ∂qb 2 j T B ← BTj−1 (qb )
O(N 2 ) backward pass ∂ 2 Tii−1 (qb ) ) ∂qb 2
+ (Tj−1
j
:A
∂Tj−1 (qb ) ∂qb
):B
O(1)
O(1)
9: 10: end for 11: A ← ATii−1 (qb )T 12: end for
Table 1. The variables required by different optimizers using different objective functions. Since high-order methods are more frequently used, we use Eq. 7 as our objective function in most cases. Optimizer Objective Equation 5 LM
I(qa , qb ),
LBFGS
I(qa , qb ),
Equation 7 ∂I(qa ,qb ) ∂ 2 I(qa ,qb ) , ∂qa ∂qb ∂qb ∂I(qa ,qb ) ∂qb
∂I(qa ,qb ) ∂ 2 I(qa ,qb ) ∂ 2 I(qa ,qb ) , ∂q 2 , ∂qa ∂qb ∂qb b ∂I(qa ,qb ) ∂ 2 I(qa ,qb ) ∂ 2 I(qa ,qb ) , ∂q 2 , ∂qa ∂qb ∂qb b
each iteration of LM computes a gradient, ∂E(q∗ )/∂q∗ , and a JT J-approximate Hessian, JT J(E(q∗ )), and updates q∗ using the Newton’s method: −1 ∂E(q∗ ) , q∗ ← q∗ − JT J(E(q∗ )) + λI ∂q∗ where λ is tuned to ensure the decrease of E(q∗ ). To compute the JT J-approximate Hessian, our objective function must be a sum-of-squares, as is the case with Eq. 7, or an integral-of-squares, as is the case with Eq. 5. The cost of an LM iteration is dominated by solving a linear system of size |q| × |q|, and is O(N 3 ) assuming a general linear solver. The two optimization algorithms require different partial derivatives of I(qa , qb ) (up to second order) during each iteration, as illustrated in Table 1. The values and derivatives of I(qa , qb ) can be computed efficiently using the adjoint method, which results in algorithms similar to the forward/inverse dynamic algorithms in [8]. To introduce these algorithms, we need notations for multiple rigid bodies. We assume that we have N rigid bodies B 1 , · · · , B N , where the parent of B i is B i−1 . We use superscripts transformation as Ti and we have to denote body indices. For each B i , we denote its i i−1i T = T i−1 . With these notations, I(qa , qb ) = i Ii (qa , qb ) becomes the summation of all the bodies. We compute I(qa , qb ) and ∂I(qa , qb )/∂qb within O(N ) using
Time Integrating Articulated Body Dynamics
681
2
I(qa ,qb ) Algorithm 3. Compute ∂ ∂q using adjoint method within O(N 2 ). Here E, F, G a ∂qb are 4 × 4 × 4 × 4 tensors, A, B, C, D are 4 × 4 matrices, and we use double contraction such that A : E : B = xyzw [Exyzw Bwz ] and we have A : CED : B = AC : E : DB. Finally, we define Exyzw = ∂ 2 I(qa , qb )/∂Txy (qa )∂Tzw (qb ).
1: Same forward pass as Algorithm 1. 2 I(qa ,qb ) ←0 2: E ← 0, ∂ ∂q a ∂qb 3: for i = N, · · · , 1 do 2 i I (qa ,qb ) , F ← E, G ← E 4: E ← E + ∂T∂i (q i a )∂T (qb ) 5: for j = i, · · · , 1 do 2 ∂ 2 I(qa ,qb ) I(qa ,qb ) 6: ← ∂ ∂q + ∂qa ∂qb a ∂qb 7:
(Ti−1 (qa )
8:
(Tj−1 (qa )
∂Tii−1 (qa ) ) ∂qa j ∂Tj−1 (qa )
: F : (Tj−1 (qb )
← 9: F← 10: end for 11: E ← Tii−1 (qa )ETii−1 (qb )T 12: end for
Tjj−1 (qa )G
j
∂Tj−1 (qb ) T ) + ∂qb
) : G : (Ti−1 (qb )
∂qa FTjj−1 (qb )T , G
O(N 2 ) backward pass
∂Tii−1 (qb ) T ) ∂qb
O(1)
Algorithm 1. We compute ∂ 2 I(qa , qb )/∂qb 2 within O(N 2 ) using Algorithm 2 and we compute ∂ 2 I(qa , qb )/∂qa ∂qb within O(N 2 ) using Algorithm 3. 5.1 Algorithm Complexity of High-Order Collocation Methods Compared with second-order collocation method that only optimizes qk+1 , high-order collocation methods optimize multiple q in q∗ . In addition, we can only use Eq. 7 as the objective function. The cost of each iteration of the optimization algorithm is dominated by computing the matrix ∂ 2 I(qa , qb )/∂qa ∂qb . This matrix has size |q∗ | × |q∗ | and can be decomposed into (K − 2) × (K − 2) blocks of size |q| × |q|. Each block is computed using Algorithm 3 and takes O(N 2 ), so that the computation of the entire |q∗ | × |q∗ | matrix takes O((K − 2)2 N 2 ). 5.2 GPU Parallelization Our PBAD formulation is designed to be GPU-friendly. Simulating rigid bodies on a GPU has been previously studied [32, 33]. These methods formulate forward/inverse dynamics algorithms as GPU-scan operations. Our GPU implementation deviates from [32, 33] in two ways. First, our implementation is intended to be used for modeling predictive control [28] and reinforcement learning [6], where we need to generate multiple trajectories at once. This fact provides more opportunities for parallelism. Second, our algorithm is iterative and the number of iterations performed during each timestep tends to be different. In practice, an implementation that runs each timestep in a separate thread could result in starvation, where threads finishing early are waiting for other threads. As a result, we parallelize each iteration of an optimization instead of each timestep. This mechanism is illustrated in Fig. 1.
682
Z. Pan and D. Manocha
Fig. 1. An illustration of our GPU implementation. The GPU has M cores, each illustrated as a gray box on the left. We use a workgroup of N cores (black arrow) to simulate one trajectory. In this illustration, we compute 3 trajectories that each have 4 timesteps (q2 , · · · , q5 ). During each call to the GPU, instead of finishing the entire LM optimization, we compute just one iteration of the LM optimization (colored block on the right) so that all the workgroups are running the same computation and no starvation will happen. Different timesteps are illustrated using blocks of different colors. For example, it takes 2 iterations to compute q2 in the first trajectory and 7 iterations to compute q2 in the second trajectory (red block).
We choose the LM algorithm in our GPU implementation. Each iteration of a ,qb ) , ∂ 2 I(qa , qb )/∂qb 2 , ∂ 2 I(qa , qb )/∂qa ∂qb accordLM involves computing ∂I(q ∂qb ing to Table 1 and then using a linear system solver. The serial computation of ∂ 2 I(qa , qb )/∂qb 2 and ∂ 2 I(qa , qb )/∂qa ∂qb takes O(N 2 ), which can be costly. We introduce an additional fine-grain parallelism by using a GPU workgroup of N cores to reduce the complexity of computing the partial derivatives to O(N ) using algorithms in our extended report [22]. With the same workgroup of N cores, the complexity of the GPU linear solver is reduced to O(N 2 ) using parallel Cholesky factorization [9]. As a result, a GPU with M cores can simulate M/N trajectories in parallel and the complexity of each iteration is dominated by the linear solver, i.e. is O(N 2 ). This method is suitable for modern commodity GPUs with the number of cores M N . Finally, in Sect. 6, we will show that widely used external force models such as frictional contact forces and fluid drag forces can be formulated as integrable energies, Q, whose values and derivatives can be computed in a similar manner to the inertial terms computed in this section. Putting them together, our method can be used to model the complex locomotion tasks in [6], such as swimming, walking, and jumping.
6 Results and Applications 6.1
Comparison
Throughout this section, we compare our formulation with conventional formulations based on Eq. 2 and integrated using the Runge-Kutta method [4]. The same algorithm is implemented in [25, 29]. Note that the definition of order of integration is different for the Runge-Kutta method and the position-based collocation method. The positionbased collocation method of order K has accuracy similar to that of the Runge-Kutta
Time Integrating Articulated Body Dynamics
683
Fig. 2. (a): We plot the total kinetic+potential energy over time during a standard simulation of a 10-link chain that swings downward. Each joint of this chain is a 2-DOF ball joint so that this chain has 20-DOF. Forward Euler integrator for the Newton-Euler equation and semi-implicit Euler integrator are not stable. Being fully implicit, our second-order PBAD solver is stable but quickly loses energy. By increasing the order by one, both the second-order Runge-Kutta and our third-order PBAD solver preserve energy very well. (b): For the more challenging task of a 100-link chain (200-DOF) that swings downward, even the second-order Runge-Kutta method is not stable and we have to use the third-order Runge-Kutta method for better energy preservation. Our second-order PBAD solver is stable but quickly loses energy. Our third-order PBAD solver preserves energy very well. (c): We compare the total computational time for generating a 10 s trajectory of a 10-link chain swinging down using a second-order collocation method for PBAD and a semi-implicit Euler integrator for a conventional formulation. PBAD is 1.5–2.1 times slower at a small timestep size and up to 4 times faster at a large timestep size, such as 0.05 s.
method of order K − 1. All experiments are performed on a single desktop machine with a 4-core CPU (Intel i7-4790 3.6G) and a 3584-core GPU (Nvidia Titan-X), i.e. M = 3584. Energy Preservation: We compare the accuracy of time integrators for our PBAD formulation and conventional formulation. In Fig. 2(a), we plot the total kinetic+potential energy over time during a standard simulation of a 10-link chain (20-DOF) that swings downward (the same benchmark was used in [11]). The timestep size is 0.0025 s. We can see that PBAD is very stable and continuously loses energy (Fig. 2(a) purple). In contrast, low-order explicit integrators such as forward Euler and semi-implicit Euler are not stable. For better accuracy, we can increase the order of integration by one, resulting in a much better performance in terms of energy preservation. In Fig. 2(b), we redo the experiment for a 100-link chain (200-DOF). This is more challenging and low-order explicit integrators are more unstable. The Runge-Kutta method for the Newton-Euler equation is stable at the third order. Although our second-order PBAD solver suffers a fast energy loss, increasing the order by one can significantly improve accuracy. Timestep Size: In Fig. 2(c), we compare the total computational time for generating a 10 s trajectory of a 10-link chain that swings downward using a second-order collocation method for PBAD and a semi-implicit Euler integrator for a conventional formulation. Each timestep of PBAD integration is costlier because multiple iterations of computations are needed to ensure the optimizer converges. For example, when we use timestep sizes of 0.001 s and 0.0025 s, the total computational time of the PBAD integrator is 1.5–2.1 times that of the semi-implicit integrator. However, the PBAD
684
Z. Pan and D. Manocha
(a)
(b)
(c)
(d)
Fig. 3. We compare the performance of the two optimization algorithms (LM and LBFGS) during the simulation of a 10-link (20-DOF) (a) and a 40-link (80-DOF) chain (b) with a large timestep size of 0.05 s. The number of iterations used by LBFGS is much larger than that used by LM, although each iteration of LBFGS is cheaper. In addition, the number of iterations is almost independent of the number of links, N . (c): We plot the average time to finish one step of the simulation against the number of links, N . LBFGS is comparable to LM in terms of computational time and the computational time grows almost linearly with N in the range of N = 10–40. (d): We plot the average time to finish one step of the simulation against the timestep size, Δt. PBAD can be used with very large timestep sizes and we tested Δt = 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128 s. The computation time for each timestep is almost invariant to Δt.
(a)
(b)
(c)
(d)
Fig. 4. We compare the performance of CPU and GPU in simulating a chain swinging benchmark. (a): We plot the speedup against the number of links, N . The speedup increases with N and the maximal speedup over a 4-core CPU is 6 times. (b): When N = 10, we plot the speedup against the number of trajectories. The speedup also increases with the number of trajectories and the maximal speedup is 4 times. (c): We plot the total computational time against the number of links, N , for generating 100 trajectories of 10 timesteps each. When N = 40, the 100 trajectories can be generated in less than 1 s on GPU. (d): We plot the total computational time against the number of trajectories.
integrator can be more efficient under a larger timestep size, while 0.0025 s is the largest timestep size that works for the semi-implicit Euler integrator. At a timestep size of 0.05 s, the total computational time of the PBAD integrator is 0.21 times that of the semi-implicit integrator, leading to a 4 times speedup. Optimization Algorithm: We compare the performance of the two optimization algorithms (LM and LBFGS) on CPU. Figure 3(a, b) shows that, LBFGS generally takes 10 times more iterations than LM. In addition, PBAD integration performed using Eq. 7 as the objective function will require more iterations to converge than when using Eq. 5. Moreover, the numbers of iterations used by both algorithms are independent of the number of links, N . Considering the number of iterations as an invariant, the cost of LM grows as O(N 3 ) and the cost of LBFGS grows as O(N ) on CPU. However, Fig. 3(c)
Time Integrating Articulated Body Dynamics
685
Fig. 5. (a): A 4-linked swimmer is trained to swim forward using CMA-ES. Some optimized swimming gaits and the locus of the center-of-mass are shown. (b): A 4-linked spider is trained to walk forward. Some optimized walking gaits are shown. The notations used by our frictional contact force model, Eq. 9, are shown in (c). Our model is penalty-based. Both normal and tangential forces are related to the penetration depth d (P(qk+1 )) (blue). The tangential forces (red) are modelled as a velocity damping term.
shows that, when the number of links N < 40, the total computational time grows almost linearly. In particular, using LM to optimize Eq. 7 is costlier than other choices. Figure 3(c) also shows that the computation times of LBFGS and LM are comparable. Finally, PBAD can be used with very large timestep sizes, such as Δt = 0.128 s, shown in Fig. 3(d), and the average time to compute each timestep is almost invariant to the timestep size. Therefore, large timestep sizes lead to a reduction in total computation time but they also lead to a higher rate of numerical dissipation. GPU Acceleration: We compare the performance of our PBAD formulation on CPU and GPU. Our GPU implementation only provides acceleration when multiple trajectories are simulated simultaneously for different initial conditions, which is the case with many online/offline control algorithms such as [6, 28]. In Fig. 4(a, b), we show the speedup of our GPU implementation over a 4-core CPU. The speedup increases with both the number of links and the number of trajectories to be computed. The speedup is between 3–6 times. The total computational time for generating 100 trajectories of 10 timesteps each is plotted in Fig. 4(c). On GPU, generating these trajectories takes less than 1 s for N ≤ 40. Finally, in Fig. 4(d), we plot the total computational time against the number of trajectories to be computed when N = 10. Note that our GPU has 3584 cores and we can compute M/N = 358 trajectories in parallel. Therefore, when the number of trajectories increases from 100–300, more GPU cores are used and the total computational time does not increase. Therefore, the green curve in Fig. 4(d) is almost flat. 6.2 External Force Models and Applications in Controller Optimization To build a complete robot simulation system, it is essential to model external forces. In this last benchmark, we propose two force models that are compatible with PBAD formulations and parallel GPU implementations. Our first force model considers a 4-linked chain (9-DOF, including a 6-DOF rigid transformation and 3 hinge joints) immersed underwater, which is under constant fluid
686
Z. Pan and D. Manocha
drag forces. To model these drag forces, we use the following formulation of potential energy in Eq. 5: Qdrag (P(qk+1 ), P(qk )) = D
P(qk+1 ) − P(qk ) 2 , Δt
where D is the drag force coefficient. This term minimizes the velocity of p and can be considered as a damping force model. An integral of Qdrag over B can be written as a linear combination of Eq. 8 with different (a,b)-pairs as shown in our extended report [22] so that its value and derivatives can be computed using the techniques discussed in Sect. 5. We use CMA-ES [14] to optimize a controller for the swimmer to move forward and the results are shown in Fig. 5(a). Our second force model considers a 4-legged spider (18-DOF, including a 6-DOF rigid transformation, 4 ball joints, and 4 hinge joints) trying to move forward on the ground, which is under frictional contact forces. A previous method [26] handles frictional contact forces using complementary conditions, which requires a sequential algorithm. Instead, we use a penalty-based frictional contact model by using the following potential energy in Eq. 5: Qcontact (P(qk+1 ), P(qk )) = D1 d (P(qk+1 )) 2 2
+D2 d (P(qk+1 )) Proj
P(qk+1 ) − P(qk ) Δt
2 .
(9) Here D1 is the normal force penalty and d is the penetration depth, which is positive when P is inside obstacles and zero otherwise, as illustrated in Fig. 5(c). D2 is the frictional force penalty and Proj is the projection matrix to the tangential directions. The integral of Qcontact over B is replaced by a summation of a set of discrete contact points. The second term on the right-hand side of Eq. 9 approximates frictional forces by requiring tangential velocities to be small when a point P is inside any of the obstacles. We use policy gradient method [23] to optimize a controller for the spider to move forward; the results are shown in Fig. 5(b).
7 Conclusion, Limitations and Future Work In this paper, we present the PBAD reformulation of articulated body dynamics. Our reformulation casts the simulation as an energy minimization problem. As a result, offthe-shelf optimizers can be used to stably simulate articulated bodies under very large timestep sizes. Although each timestep of our algorithm requires more iterations than conventional methods, the overall speedup of our PBAD over conventional methods in various benchmarks is up to 4 times under very large timestep sizes, e.g., Δt = 0.1 s. Furthermore, our approach is GPU friendly and can be easily parallelized. We observe an additional 3–6 times speedup on a commodity GPU over a 4-core CPU. The parallel version of our PBAD solver can accelerate control algorithms such as model predictive control and reinforcement learning by simulating multiple trajectories simultaneously.
Time Integrating Articulated Body Dynamics
687
Our current formulation still has some limitations. First, numerical dissipation cannot totally be avoided, although we can reduce it using smaller timestep sizes or highorder collocation methods. Second, to recast the articulated body dynamics as an optimization problem and avoid high-order derivatives, we discretize the velocities in a Euclidean workspace, instead of using a Lie-Group structure [17]. As a result, our PBAD method can be less accurate compared with Lie-Group integrators. As part of future work, we would like to study various external force models that are compatible with the our PBAD formulation. A compatible force model should be stable under large timestep sizes. To this end, one method is to formulate the external force implicitly as a function of qk+1 Eq. 9. However, the accuracy of these force models have not been well studied. Acknowledgement. This research is supported in part by ARO grant W911NF16-1-0085, QNRF grant NPRP-5-995-2-415, and Intel.
References 1. Ascher, U.M., Petzold, L.R.: Computer Methods for Ordinary Differential Equations and Dfferential-Algebraic Equations, vol. 61. SIAM, Philadelphia (1998) 2. Bender, J., M¨uller, M., Macklin, M.: Position-based simulation methods in computer graphics. In: Zwicker, M., Soler, C. (eds.) EG 2015 - Tutorials. The Eurographics Association (2015) 3. Bouaziz, S., Martin, S., Liu, T., Kavan, L., Pauly, M.: Projective dynamics: fusing constraint projections for fast simulation. ACM Trans. Graph. 33(4), 154:1–154:11 (2014) 4. Butcher, J.: Numerical Methods for Ordinary Differential Equations, 2nd edn. Wiley, Chichester (2008) 5. Deul, C., Charrier, P., Bender, J.: Position-based rigid body dynamics. Comput. Anim. Virtual Worlds 27(2), 103–112 (2014) 6. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, vol. 48, pp. 1329–1338. ICML 2016, JMLR.org (2016) 7. Featherstone, R.: A divide-and-conquer articulated-body algorithm for parallel o(log(n)) calculation of rigid-body dynamics. part 1: Basic algorithm. Int. J. Robot. Res. 18(9), 867–875 (1999) 8. Featherstone, R.: Rigid Body Dynamics Algorithms. Springer-Verlag, New York (2007) 9. Galoppo, N., Govindaraju, N.K., Henson, M., Manocha, D.: LU-GPU: efficient algorithms for solving dense linear systems on graphics hardware. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 3. IEEE Computer Society (2005) 10. Gast, T.F., Schroeder, C., Stomakhin, A., Jiang, C., Teran, J.M.: Optimization integrator for large time steps. IEEE Trans. Vis. Comput. Graph. 21(10), 1103–1115 (2015) 11. Gayle, R., Lin, M.C., Manocha, D.: Adaptive dynamics with efficient contact handling for articulated robots. In: Robotics: Science and Systems, pp. 231–238 (2006) 12. Guo, B.Y., Wang, Z.Q.: Legendre-Gauss collocation methods for ordinary differential equations. Adv. Comput. Math. 30(3), 249–280 (2009) 13. Hahn, F., Martin, S., Thomaszewski, B., Sumner, R., Coros, S., Gross, M.: Rig-space physics. ACM Trans. Graph. 31(4), 72:1–72:8 (2012)
688
Z. Pan and D. Manocha
14. Hansen, N., Ostermeier, A.: Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In: 1996 Proceedings of IEEE International Conference on Evolutionary Computation, pp. 312–317. IEEE (1996) 15. Kobilarov, M., Crane, K., Desbrun, M.: Lie group integrators for animation and control of vehicles. ACM Trans. Graph. 28(2), 16:1–16:14 (2009) 16. Krysl, P.: Dynamically equivalent implicit algorithms for the integration of rigid body rotations. Commun. Numer. Meth. Eng. 24(2), 141–156 (2008) 17. Lee, J., Liu, C.K., Park, F.C., Srinivasa, S.S.: A linear-time variational integrator for multibody systems. arXiv preprint arXiv:1609.02898 (2016) 18. Levenberg, K.: A method for the solution of certain non-linear problems in least squares. Q. Appl. Math. 2(2), 164–168 (1944) 19. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989) 20. Murray, R.M., Li, Z., Sastry, S.S.: A Mathematical Introduction to Robotic Manipulation. CRC Press, Boca Raton (1994) 21. Narain, R., Overby, M., Brown, G.E.: ADMM ⊇ projective dynamics: fast simulation of general constitutive models. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 2016, pp. 21–28. Eurographics Association, Aire-laVille (2016) 22. Pan, Z., Manocha, D.: Time integrating articulated body dynamics using position-based collocation method. arXiv:1709.04145 (2018) 23. Peters, J., Schaal, S.: Policy gradient methods for robotics. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2219–2225, October 2006 24. Schroeder, C.A.: Coupled Simulation of Deformable Solids, Rigid Bodies, and Fluids with Surface Tension. Stanford University (2011) 25. Smith, R.: Open dynamics engine (2008). http://www.ode.org/ 26. Stewart, D., Trinkle, J.C.: An implicit time-stepping scheme for rigid body dynamics with coulomb friction. In: 2000 Proceedings of IEEE International Conference on Robotics and Automation, ICRA 2000, vol. 1, pp. 162–169. IEEE (2000) 27. Stilman, M.: Task constrained motion planning in robot joint space. In: 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3074–3081, October 2007 28. Tassa, Y., Erez, T., Todorov, E.: Synthesis and stabilization of complex behaviors through online trajectory optimization. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913, October 2012 29. Todorov, E.: Convex and analytically-invertible dynamics with contacts and constraints: theory and implementation in MuJoCo. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 6054–6061, May 2014 30. Tomcin, R., Sibbing, D., Kobbelt, L.: Efficient enforcement of hard articulation constraints in the presence of closed loops and contacts. Comput. Graph. Forum 33(2), 235–244 (2014) 31. Weinstein, R., Teran, J., Fedkiw, R.: Dynamic simulation of articulated rigid bodies with contact and collision. IEEE Trans. Vis. Comput. Graph. 12(3), 365–374 (2006) 32. Yang, Y., Wu, Y., Pan, J.: Parallel dynamics computation using prefix sum operations. IEEE Robot. Autom. Lett. 2(3), 1296–1303 (2017) 33. Yang, Y., Wu, Y., Pan, J.: Unified GPU-parallelizable robot forward dynamics computation using band sparsity. IEEE Robot. Autom. Lett. 3(1), 203–209 (2018)
Efficient Computation of Higher-Order Variational Integrators in Robotic Simulation and Trajectory Optimization Taosha Fan(B) , Jarvis Schultz, and Todd Murphey Department of Mechanical Engineering, Northwestern University, 2145 Sheridan Road, Evanston, IL 60208, USA [email protected], {jschultz,t-murphey}@northwestern.edu
Abstract. This paper addresses the problem of efficiently computing higher-order variational integrators in simulation and trajectory optimization of mechanical systems as those often found in robotic applications. We develop O(n) algorithms to evaluate the discrete EulerLagrange (DEL) equations and compute the Newton direction for solving the DEL equations, which results in linear-time variational integrators of arbitrarily high order. To our knowledge, no linear-time higher-order variational or even implicit integrators have been developed before. Moreover, an O(n2 ) algorithm to linearize the DEL equations is presented, which is useful for trajectory optimization. These proposed algorithms eliminate the bottleneck of implementing higher-order variational integrators in simulation and trajectory optimization of complex robotic systems. The efficacy of this paper is validated through comparison with existing methods, and implementation on various robotic systems— including trajectory optimization of the Spring Flamingo robot, the LittleDog robot and the Atlas robot. The results illustrate that the same integrator can be used for simulation and trajectory optimization in robotics, preserving mechanical properties while achieving good scalability and accuracy.
1
Introduction
Variational integrators conserve symplectic form, constraints and energetic quantities [1–6]. As a result, variational integrators generally outperform the other types of integrators with respect to numerical accuracy and stability, thus permitting large time steps in simulation and trajectory optimization, which is useful for complex robotic systems [1–6]. Moreover, variational integrators can also be regularized for collisions and friction by leveraging the linear complementarity problem (LCP) formulation [7,8]. This material is based upon work supported by the National Science Foundation under award DCSD-1662233. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 689–706, 2020. https://doi.org/10.1007/978-3-030-44051-0_40
690
T. Fan et al.
The computation of variational integrators is comprised of the discrete EulerLagrange equation (DEL) evaluation, the descent direction computation for solving the DEL equations and the DEL equation linearization. The computation of these three phases of variational integrators can be accomplished with automatic differentiation and our prior methods [2,4], both of which are O(n2 ) to evaluate the DEL equations and O(n3 ) to compute the Newton direction and linearize the DEL equations for an n-degree-of-freedom mechanical system. Recently, a linear-time second-order variational integrator was developed in [9], which uses the quasi-Newton method and works for small time steps and comparatively simple mechanical systems. Higher-order variational integrators are needed for greater accuracy in predicting the dynamic motion of robots [10,11]. However, the computation of higher-order variational integrators has rarely been addressed. The quasi-Newton method in [9] only applies to second-order variational integrators, and while automatic differentiation and our prior methods [2,4] are implementable for higher-order variational integrators, the complexity increases superlinearly as the integrator order increases. In this paper, we address the computation efficiency of higher-order variational integrators and develop: (i) an O(n) method for the evaluation of the DEL equations, (ii) an O(n) method for the computation of the Newton direction, and (iii) an O(n2 ) method for the linearization of the DEL equations. The proposed characteristics (i)–(iii) eliminate the bottleneck of implementing higher-order variational integrators in simulation and trajectory optimization of complex robotic systems, and to the best of our knowledge, no similar work has been presented before. In particular, we believe that the resulting variational integrator from (i) and (ii) is the first exactly linear-time implicit integrator of third or higher order for mechanical systems. The rest of this paper is organized as follows. Section 2 reviews higher-order variational integrators, the Lie group formulation of rigid body motion and the tree representation of mechanical systems. Sections 3 and 4 respectively detail the linear-time higher-order variational integrator and the quadratic-time linearization, which are the main contributions of this paper. Section 5 compares our work with existing methods, and Sect. 6 presents examples of trajectory optimization for the Spring Flamingo robot, the LittleDog robot and the Atlas robot. The conclusions are made in Sect. 7.
2
Preliminaries and Notation
In this section, we review higher-order variational integrators, the Lie group formulation of rigid body motion, and the tree representation of mechanical systems. In addition, notation used throughout this paper is introduced accordingly. 2.1
Higher-Order Variational Integrators
In this paper, higher-order variational integrators are derived with the methods in [1,12,13].
Efficient Computation of Higher-Order Variational Integrators
691
A trajectory (q(t), q(t)) ˙ where 0 ≤ t ≤ T of a forced mechanical system should satisfy the Lagrange-d’Alembert principle: T T L(q, q)dt ˙ + F(t) · δqdt = 0 (1) δS = δ 0
0
in which L(q, q) ˙ is the system’s Lagrangian and F(t) is the generalized force. Provided that the time interval [0, T ] is evenly divided into N sub-intervals with Δt = T /N , and each q(t) over [kΔt, (k + 1)Δt] is interpolated with s + 1 control points q k,α = q(tk,α ) in which α = 0, 1, · · · , s and kΔt = tk,0 < tk,1 < · · · < tk,s = (k + 1)Δt, then there are coefficients bαβ (0 ≤ α, β ≤ s) such that q(t ˙ k,α ) ≈ q˙k,α =
s 1 αβ k,β b q . Δt
(2)
β=0
In this paper, we assume that the quadrature points of the quadrature rule are also tk,α though our algorithms in Sects. 3 and 4 can be generalized for any quadrature rules. Then the Lagrange-d’Alembert principle Eq. (1) is approximated as δS ≈
N −1
s
wα δL(q k,α , q˙k,α ) + F(tk,α ) · δq k,α · Δt = 0,
(3)
k=0 α=0
in which wα are weights of the quadrature rule used for integration. In variational integrators, the discrete Lagrangian and the discrete generalized force are defined to be s Ld (q k,0 , q k,1 , · · · , q k,s ) = wα L(q k,α , q˙k,α )Δt (4) α=0
and Fdk,α (tk,α ) k,s k+1,0
t
=t
= w F(t )Δt, respectively. Note that by definition we have and q k,s = q k+1,0 , and as a result of Eq. (3), we obtain α
k,α
pk + D1 Ld (q k ) + Fdk,0 = 0,
(5a)
Dα+1 Ld (q k ) + Fdk,α = 0 ∀α = 1, · · · , s − 1,
(5b)
pk+1 = Ds+1 Ld (q k ) + Fdk,s
(5c)
in which pk is the discrete momentum, q k stands for the tuple (q k,0 , q k,1 , · · · , q k,α ), and Dα+1 Ld is the derivative with respect to q k,α . Note that Eq. (5) is known as the discrete Euler-Lagrangian (DEL) equations, which implicitly define an update rule (q k,0 , pk ) → (q k+1,0 , pk+1 ) by solving sn nonlinear equations from Eqs. (5a) and (5b). In a similar way, for mechanical systems with constraints h(q, q) ˙ = 0, we have pk + D1 Ld (q k ) + Fdk,0 + Ak,0 (q k,0 ) · λk,0 = 0, Dα+1 Ld (q k ) + Fdk,α + Ak,α (q k,α ) · λk,α = 0 ∀α = 1, · · · , s − 1,
(6a) (6b)
692
T. Fan et al.
pk+1 = Ds+1 Ld (q k ) + Fdk,s , hk,α (q k+1,α , q˙k+1,α ) = 0 k,α
k,α
(6c)
∀α = 1, · · · , s
(6d) k,α
in which A (q ) is the discrete constraint force matrix and λ is the discrete constraint force. The resulting higher-order variational integrator is referred as the Galerkin integrator [1,12,13], the accuracy of which depends on the number of control points as well as the numerical quadrature of the discrete Lagrangian. If there are s + 1 control points and the Lobatto quadrature is employed, then the resulting variational integrator has an accuracy of order 2s [12,13]. The Galerkin integrator includes the trapezoidal variational integrator and the Simpson variational integrator as shown in Examples 1 and 2, the DEL equations of which are given by Eqs. (5) and (6). Example 1. The trapezoidal variational integrator is a second-order integrator with two control points q k = (q k,0 , q k,1 ) such that q k,0 = q kΔt and q k,1 = q (k+ k,1 −q k,0 k,0 k,0 , and Ld (q k ) = Δt , q˙ ) + L(q k,1 , q˙k,1 ) . 1)Δt , q˙k,0 = q˙k,1 = q Δt 2 L(q Example 2. The Simpson variational integrator is a fourth-order integratorwith three control points q k = (q k,0 , q k,1 , q k,2 ) in which q k,0 = q kΔt , q k,0 = q (k + k,1 k,0 k,2 −q k,2 −q k,0 1 k,2 = q (k+1)Δt , q˙k,0 = 4q −3q , q˙k,1 = q Δt and q˙k,2 = 2 )Δt and q Δt k,0 k,2 k,1 q +3q −4q Δt k k,0 k,0 k,1 k,1 k,2 k,2 , and Ld (q ) = 6 L(q , q˙ ) + 4L(q , q˙ ) + L(q , q˙ ) . Δt 2.2
The Lie Group Formulation of Rigid Body Motion
The configurationof a rigid body g = (R, p) ∈ SE(3) can be represented as a Rp 4 × 4 matrix g = in which R ∈ SO(3) is a rotation matrix and p ∈ R3 is 0 1 a position vector. The body velocity of the rigid body v = (ω, vO ) ∈ Te SE(3) is an element of the Lie algebra and can be represented either as a 6 × 1 vector ω ˆ v O T or a 4 × 4 matrix vˆ = g −1 g˙ = v = (g −1 g) ˙ ∨ = ω T vO in which ω = 0 0 (ωx , ωy , ωz ) ∈ T⎤e SO(3) is the angular velocity, vO is the linear velocity, ω ˆ = ⎡ 0 −ωz ωy ⎣ ωz 0 −ωx ⎦ ∈ R3×3 , and the hat “∧” and unhat“∨” are linear operators −ωy ωx 0 that relate the vector and matrix representations. The same representation and operators also apply to the spatial velocity v = (ω, v O ) ∈ Te SE(3), whose 6 × 1 ˙ −1 )∨ and vˆ = vector and 4 × 4 matrix representations are respectively v = (gg −1 . gg ˙ In the rest of this paper, if not specified, vector representation is used for Te SE(3), such as v, v, etc., and the adjoint operators Adg and adv :Te SE(3) → R 0 and Te SE(3) can be accordingly represented as 6 × 6 matrices Adg = pˆR R ω ˆ 0 adv = v1 vˆ2 − vˆ2 vˆ1 )∨ . For consistence, such that v = Adg v and adv1 v2 = (ˆ vˆO ω ˆ
Efficient Computation of Higher-Order Variational Integrators
693
the dual Lie algebra Te∗ SE(3) uses the 6 × 1 vector representation as well. As a result, the body wrench F = (τ, fO ) ∈ Te∗ SE(3) is represented as a 6 × T T 1 vector F = τ T fO in which τ ∈ Te∗ SO(3) is the torque and fO is the linear force so that F, v = F T v. Moreover, we define the linear operator adD F : ˆ τˆ fO Te SE(3) → Te∗ SE(3) which is represented as a 6 × 6 matrix adD F = fˆO 0 D T so that F T adv1 v2 = v2T adD F v1 = −v1 adF v2 for v1 , v2 ∈ Te SE(3). The same representation and operators also apply to the spatial wrench F = Ad−T g F = (τ , f O ) which is paired with the spatial velocity v = Adg v. 2.3
The Tree Representation of Mechanical Systems
In general, a mechanical system with n inter-connected rigid bodies indexed as 1, 2, · · · , n can be represented through a tree structure so that each rigid body has a single parent and zero or more children [2,14], and such a representation is termed as tree representation. In this paper, the spatial frame is denoted as {0}, which is the root of the tree representation, and we denote the body frame of rigid body i as {i}, and the parent, ancestors, children and descendants of rigid body i as par(i), anc(i), chd(i) and des(i), respectively. Since all joints can be modeled using a combination of revolute joints and prismatic joints, we assume that each rigid body i is connected to its parent by a one-degree-of-freedom joint i which is either a revolute or a prismatic joint and parameterized by a real scalar qi ∈ R. As a result, the tree representation is parameterized with n generalized T coordinates q = q1 q2 · · · qn ∈ Rn . For each joint i, the joint twist with respect T to frame {0} and {i} are respectively denoted as 6 × 1 vectors S i = sTi nTi T and Si = sTi nTi in which si , si are 3 × 1 vectors corresponding to rotation and ni , ni are 3 × 1 vectors corresponding to translation. Note that Si , si and ni are constant by definition. Moreover, S i and Si are related as S i = Adgi Si where g ∈ SE(3) is the configuration of rigid body i, and S˙ = ad S , where i
i
vi
i
v i ∈ Te SE(3) is the spatial velocity of rigid body i. It is assumed without loss of generality in this paper that the origin of frame {i} is the mass center of rigid body i, and j ∈ des(i) only if i < j, or equivalently j ∈ anc(i) only if i > j. The rigid body dynamics can be computed through the tree representation. The configuration gi = (Ri , pi ) ∈ SE(3) of rigid body i is gi = gpar(i) gpar(i),i (qi ) in which gpar(i),i (qi ) = gpar(i),i (0) exp(Sˆi qi ) is the rigid body transformation from frame {i} to its parent frame {par(i)}, and the spatial velocity v i of rigid body i is v i = v par(i) +S i · q˙i . In addition, the spatial inertia matrix M i of rigid body {i} −1 with respect to frame {0} is M i = Ad−T gi Mi Adgi in which Mi = diag{Ii , mi I} ∈ 6×6 is the constant body inertia matrix of rigid body i, Ii ∈ R3×3 is the body R rotational inertia matrix, mi ∈ R is the mass and I ∈ R3×3 is the identity matrix. In rigid body dynamics, an important notion is the articulated body [14]. In terms of the tree representation, articulated rigid body i consists of rigid body i and all its descendants j ∈ des(i), and the interactions with articulated body
694
T. Fan et al.
i can only be made through rigid body i, which is known as the handle of the articulated body i. In the last thirty years, a number of algorithms for efficiently computing the rigid body dynamics have been developed based on tree representations and articulated bodies [14–16], making explicit integrators have O(n) complexity for an n-degree-of-freedom mechanical system. Even though the same algorithms might be used for the evaluation of implicit integrators, none of them can be used for the computation of the Newton direction for solving implicit integrators. If the residue is rk , the Newton direction of an implicit integrator is computed as δq k = −J (q k )−1 rk ; however, the Jacobian matrix J (q k ) is usually asymmetric and indefinite, and has a size greater than n × n for higher-order implicit integrators, which means that the computation of implicit integrators is distinct from explicit integrators whose computation is simply a combination of the algorithms in [14– 16] with an appropriate integration scheme. Furthermore, the computation of implicit integrators is much more complicated than the computation of forward and inverse dynamics and out of the scope of those algorithms in [14–16].
3
The Linear-Time Higher-Order Variational Integrator
In this and next section, we present the propositions and algorithms efficiently computing higher-order variational integrators, whose derivations are omitted due to space limitations. Though not required for implementation, we refer the reader to the supplementary appendix of this paper [17] for detailed proofs.1 In the rest of this paper, if not specified, we assume that the mechanical system has n degrees of freedom and the higher-order variational integrator has s + 1 control points q k,α = q(tk,α ) in which 0 ≤ α ≤ s. Note that the notation (·)k,α is used to denote quantities (·) associated with q k,α and tk,α , such as qik,α , gik,α , v k,α i , etc. 3.1
The DEL Equation Evaluation
To evaluate the DEL equations, the discrete articulated body momentum and discrete articulated body impulse are defined from the perspective of articulated bodies as follows. ∈ R6 for articuDefinition 1. The discrete articulated body momentum μk,α i k,α k,α k,α k,α k,α lated body i is defined to be μi = M i v i + j∈chd(i) μj in which M i and v k,α are respectively the spatial inertia matrix and spatial velocity of rigid body i i.
1
In addition to the proofs, the supplementary appendix [17] also contains the complete O(n) algorithms to compute the Newton direction for higher-order variational integrators.
Efficient Computation of Higher-Order Variational Integrators
695
Definition 2. Suppose F i (t) ∈ R6 is the sum of all the wrenches directly acting on rigid body i, which does not include those applied or transmitted through the joints that are connected to rigid body i. The discrete articulated body impulse k,α k,α k,α k,α Γ i ∈ R6 for articulated body i is defined to be Γ i = F i + j∈chd(i) Γ j k,α
in which F i = ω α F i (tk,α )Δt ∈ R6 is the discrete impulse acting on rigid body k,α k,α i. Note that F i (t), F i and Γ i are expressed in frame {0}. Remark 1. As for wrenches exerted on rigid body i, in addition to F i (t) which includes gravity as well as the external wrenches that directly act on rigid body i, there are also wrenches applied through joints, e.g., from actuators, and wrenches transmitted through joints, e.g., from the parent and children of rigid body i in the tree representation. k,α
and Γ i make it possible to evaluate It can be seen in Proposition 1 that μk,α i the DEL equations without explicitly calculating Dα+1 Ld (q k ) and Fdk,α in Eqs. (5) and (6). Proposition 1. If Qi (t) ∈ R is the sum of all joint forces applied to joint i and T ∈ Rn is the discrete momentum, the DEL equations Eq. pk = pk1 pk2 · · · pkn (5) can be evaluated as k,0 T
rik,0 = pki + S i
k,0
· Ωi
+
s
k,β T
a0β S i
· μk,β + Qk,0 i i ,
(7a)
β=0
rik,α
=
k,α T Si k,s T
pk+1 = Si i
·
k,α Ωi
k,s
· Ωi
+
+
s β=0 s
k,β T
aαβ S i
k,β T
asβ S i
· μk,β + Qk,α i i · μk,β + Qk,s i i
∀α = 1, · · · , s − 1,
(7b)
(7c)
β=0
in which rik,α is the residue of the DEL equations Eqs. (5a) and (5b), aαβ = k,α
k,α
wβ bβα , Ω i = wα Δt · adTvk,α · μk,α + Γ i , and Qk,α = ω α Qi (tk,α )Δt is the i i i discrete joint force applied to joint i. Proof. See [17, Section D.1].
In Eqs. (7a) and (7b), if all rik,α are equal to zero, a solution to the variational integrator as well as the DEL equations is obtained. All the quantities used in Proposition 1 can be recursively computed in the tree representation, therefore, we have Algorithm 1 that evaluates the DEL equations, which essentially consists of s + 1 forward passes from root to leaf nodes and s + 1 backward passes in the reverse order, thus totally takes O(sn) time. In contrast, automatic differentiation and our prior methods [2,4] take O(sn2 ) time to evaluate the DEL equations.
696
T. Fan et al.
Algorithm 1. Recursive Evaluation of the DEL Equations 1: initialize g0k,α = I and v k,α =0 0 2: for i = 1 → n do 3: for α = 0 → s do k,α k,α 4: gik,α = gpar(i) gpar(i),i (qik,α ) k,α
5:
Si
6:
q˙ik,α
k,α
= Adgk,α Si ,
Mi
i
=
1 Δt
s β=0
12: 13:
k,α Ωi
14:
end for
15:
rik,0 = pki + S i
16: 17:
gi
bαβ qik,β ,
7: end for 8: end for 9: for i = n → 1 do 10: for α = 0 → s do k,α μk,α = M i v k,α + 11: i i
−1 = Ad−T k,α Mi Ad k,α
v k,α i
j∈chd(i)
=
μk,α j ,
i
k,0
Ωi
+
k,α T
18:
end for
19:
= Si pk+1 i
k,s T
k,α
Ωi
k,s
Ωi
+
k,β T
β=0
for α = 1 → s − 1 do rik,α = S i
s
+
s β=0
s β=0
a0β S i
k,β T
aαβ S i k,β T
asβ S i
k,α
Γi
k,α
+ Si
· q˙ik,α
k,α
= Fi
+
j∈chd(i)
k,α
Γj
k,α Γi
= wα Δt · adTvk,α · μk,α + i k,0 T
gi
v k,α par(i)
· μk,β + Qk,0 i i · μk,β + Qk,α i i
· μk,β + Qk,s i i
20: end for
3.2
Exact Newton Direction Computation T From Eq. (5), the Newton direction δq k = δq k,1 T , · · · , δq k,s T ∈ Rsn is com−1
puted as δq k = −J k (q k )·rk in which J k (q k ) ∈ Rsn×sn is the Jacobian of Eqs. (5a) and (5b) with respect to control points q k,1 , · · · , q k,s , and rk ∈ Rsn is the residue of evaluating the DEL equations Eqs. (5a) and (5b) by Proposition 1. k,α In this section, we make the following assumption on F i and Qk,α i , which is general and applies to a large number of mechanical systems in robotics. Assumption 1. Let u(t) be the control inputs of the mechanical system, we k,α assume that the discrete impulse F i and discrete joint force Qk,α can be respeci k,α
k,α
k,α k,α k,α tively formulated as F i = F i (gik,α , v k,α ) and Qk,α = Qk,α i ,u i i (qi , q˙i , k,α k,α k,α = u(t ). u ) in which u −1
If Assumption 1 holds and J k (q k ) exists, it can be shown that [17, Algorithm B.1] computes the Newton direction for variational integrators in O(s3 n) time.
Efficient Computation of Higher-Order Variational Integrators
697
Proposition 2. For higher-order variational integrators of unconstrained −1 mechanical systems, if Assumption 1 holds and J k (q k ) exists, the Newton −1 direction δq k = −J k (q k ) · rk can be computed with [17, Algorithm B.1] in 3 O(s n) time.
Proof. See [17, Section D.2].
In [17, Algorithm B.1], the forward and backward passes of the tree structure take O(s2 n) time, and the n computations of the s×s matrix inverse takes O(s3 n) time, thus the overall complexity of [17, Algorithm B.1] is O(s3 n + s2 n). In contrast, automatic differentiation and our prior methods in [2,4] take O(s2 n3 ) time to compute J k (q k ) and another O(s3 n3 ) time to compute the sn × sn −1 matrix inverse J k (q k ), and the overall complexity is O(s3 n3 + s2 n3 ). Though the quasi-Newton method [9] is O(n) time for second-order variational integrator in which s = 1, it requires small time steps and can not be used for third- or higher-order variational integrators. Therefore, both Algorithm 1 and [17, Algorithm B.1] have O(n) complexity for a given s, which results in a linear-time variational integrator. Furthermore, Algorithm 1 and [17, Algorithm B.1] have no restrictions on the number of control points, which indicates that the resulting linear-time variational integrator can be arbitrarily high order. To our knowledge, this is the first exactly lineartime third- or higher-order implicit integrator for mechanical systems. 3.3
Extension to Constrained Mechanical Systems
Thus far all our discussions of linear-time variational integrators have been restricted to unconstrained mechanical systems. However, Algorithm 1 and [17, Algorithm B.1] can be extended to constrained mechanical systems as well. In terms of the DEL equation evaluation, the extension to constrained mechanical systems is immediate. From Eq. (6), we only need to add the constraint term Ak,α (q k,α ) · λk,α to the results of using Algorithm 1. If the variational integrator is second-order and the mechanical system has m constraints, it is possible to compute the Newton direction δq k+1 and δλk in O(mn)+O(m3 ) time using [17, Algorithm B.1]. In accordance with Eq. (6), δq k+1 and δλk should satisfy J k (q k ) · δq k+1 + Ak (q k ) · δλk = −rqk and Dhk (q k+1 , q˙k+1 ) · δq k+1 = −rck in which rqk and rck are equation residues. Then δq k+1 and δλk can be computed as follows: (i) compute δqrk+1 = −J k k −1
−1
·rqk with [17, Algorithm B.1]
which takes O(n) time; (ii) compute J · Ak by using [17, Algorithm B.1] m −1 k times which takes O(mn) time; (iii) compute δλk = Dhk · J k · Ak (rc + Dhk · −1
δqrk+1 ) which takes O(m3 ) time; iv) compute δq k+1 = δqrk+1 − J k · Ak · δλk . In regard to third- or higher-order variational integrators, if the constraints k k,α k,α are of hki (gik,α , v k,α i ) = 0 or hi (qi , q˙i ) = 0 or both for each i = 1, 2, · · · , n, [17, Algorithm B.1] can be used to compute the Newton direction δq k and δλk in a similar procedure to the second-order variational integrator. In next section, we will discuss the linearization of higher-order variational integrators in O(n2 ) time.
698
4
T. Fan et al.
The Linearization of Higher-Order Variational Integrators
The linearization of discrete time systems is useful for trajectory optimization, stability analysis, controller design, etc., which are import tools in robotics. From Eqs. (5) and (6), the linearization of variational integrators is comprised of the computation of D2 Ld (q k ), DFdk,α (tk,α ) and DAk,α (q k,α ). In most cases, DFdk,α (tk,α ) and DAk,α (q k,α ) can be efficiently computed in O(n2 ) time, therefore, the linearization efficiency is mostly affected by D2 Ld (q k ). It is by definition that the Lagrangian of a mechanical system is L(q, q) ˙ = K(q, q) ˙ − V (q) in which K(q, q) ˙ is the kinetic energy and V (q) is the potential energy, and from Eq. (4), the computation of D2 Ld (q k ) is actually to compute ∂K ∂ 2 K ∂ 2 K ∂ 2 K ∂V ∂ q˙2 , ∂ q∂q ˙ , ∂q∂ q˙ , ∂q 2 and ∂q 2 , for which we have Propositions 3 and 4 as follows. Proposition 3. For the kinetic energy K(q, q) ˙ of a mechanical system, ∂2K ∂2K ∂2K ∂ q∂q ˙ , ∂q∂ q˙ , ∂q 2
∂2K ∂ q˙2 ,
can be recursively computed with Algorithm 2 in O(n2 ) time.
Proof. See [17, Section D.3].
In the matter of potential energy V (q), we only consider the gravitational potential energy Vg (q), and the other types of potential energy can be computed in a similar way. Proposition 4. If g ∈ R3 is gravity, then for the gravitational potential energy ∂2V Vg (q), ∂q2g can be recursively computed with Algorithm 3 in O(n2 ) time. Proof. See [17, Section D.4].
In regard to Proposition 4 and Algorithm 3, we remind the reader of the notation introduced in Sects. 2.2 and 2.3 that mi ∈ R is the mass of rigid body i, pi ∈ R3 is the mass center of rigid body i as well as the origin of frame {i}, T ∈ R6 is the spatial Jacobian of joint i with respect to frame and S i = sTi nTi {0}. ∂2K ∂2K ∂2K ∂V 2 If ∂∂K q˙2 , ∂ q∂q ˙ , ∂q∂ q˙ , ∂q 2 and ∂q 2 are computed in O(n ) time, then according to Eqs. (2) and (4), the remaining computation of D2 Ld (q k,α ) is simply the application of the chain rule. Therefore, if the variational integrator has s + 1 control points, the complexity of the linearization is O(s2 n2 ). In contrast, automatic differentiation and our prior methods [2,4] take O(s2 n3 ) time to linearize the variational integrators.
5
Comparison with Existing Methods
The variational integrators using Algorithms 1 to 3 and [17, Algorithm B.1] are compared with the linear-time quasi-Newton method [9], automatic differentiation and the Hermite-Simpson direct collocation method, which verifies the accuracy, efficiency and scalability of our work. All the tests are run in C++ on a 3.1 GHz Intel Core Xeon Thinkpad P51 laptop.
Efficient Computation of Higher-Order Variational Integrators
699
∂2K ∂2K ∂2K ∂2K ∂ q˙2 , ∂ q∂q ˙ , ∂q∂ q˙ , ∂q 2
Algorithm 2. Recursive Computation of 1: initialize g0 = I and v 0 = 0 2: for i = 1 → n do 3: gi = gpar(i) gpar(i),i (qi ) −1 4: M i = Ad−T gi Mi Adgi , S i = Adgi Si 5: v =v + S · q˙ , S˙ = ad S i
par(i)
i
i
i
vi
i
6: end for 2 ∂2 K ∂2 K = 0, ∂q∂ = 0, 7: initialize ∂∂ q˙K 2 = 0, ∂ q∂q ˙ q˙ 8: for i = n → 1 do μi = M i v i + μj , Mi = M i + 9: 10: 11: 12: 13:
j∈chd(i) B Mi S i , Mi
∂2 K ∂q 2
=0
j∈chd(i)
Mj
A Mi = = Mi S˙ i − adD μi S i for j ∈ anc(i) ∪ {i} do 2 T A ∂2 K = ∂∂q˙j ∂Kq˙i = S j Mi ∂ q˙i ∂ q˙j ∂2 K ∂ q˙i ∂qj
=
2
∂ K 14: = ∂qi ∂qj 15: end for 16: end for
∂2 K ∂qj ∂ q˙i ∂2 K ∂qj ∂qi
T A = S˙ j Mi , T B = S˙ j Mi
∂2 K ∂qi ∂ q˙j
Algorithm 3. Recursive Computation of
=
∂2 K ∂ q˙j ∂qi
T
B
= S j Mi
∂ 2 Vg ∂q 2
initialize g0 = I for i = 1 → n do gi = gpar(i) gpar(i),i (qi ), S i = Adgi Si end for ∂2 V initialize ∂q2g = 0 for i = n → 1 do σ mi = mi + σ mj , σ pi = mi pi + σ pj j∈chd(i) j∈chd(i) ˆ pi · si ˆ σ mi · ni − σ 8: σA i = g 9: for j ∈ anc(i) ∪ {i} do 1: 2: 3: 4: 5: 6: 7:
10:
∂ 2 Vg ∂qi ∂qj
=
∂ 2 Vg ∂qj ∂qi
= sTj · σ A i
11: end for 12: end for
5.1
Comparison with the Linear-Time Quasi-Newton Method
In this subsection, we compare the O(n) Newton method using Algorithm 1 and [17, Algorithm B.1] with the O(n) quasi-Newton method in [9] on the trapezoidal variational integrator (Example 1) of a 32-link pendulum with different time steps. In the comparison, 1000 initial joint angles q 0 and joint velocities q˙0 are uniformly sampled from [− π2 , π2 ] for each of the selected time steps, which are
700
T. Fan et al.
Fig. 1. The comparison of the O(n) Newton method with the O(n) quasi-Newton method [9] for the trapezoidal variational integrator of a 32-link pendulum with different time steps. The results of computational time are in (a), number of iterations in (b) and success rates in (c). Each result is calculated over 1000 initial conditions.
0.001 s, 0.002 s, 0.005 s, 0.01 s, 0.02 s, 0.03 s, 0.04 s, 0.05 s and 0.06 s, and the Newton and quasi-Newton methods are used to solve the DEL equations for one time step. The results are in Fig. 1, in which the computational time and the number of iterations are calculated only over initial conditions that the DEL equations are successfully solved. It can be seen that the Newton method using Algorithm 1 and [17, Algorithm B.1] outperforms the quasi-Newton method in [9] in all aspects, especially for relatively large time steps. 5.2
Comparison with Automatic Differentiation
In this subsection, we compare Algorithms 1 to 3 and [17, Algorithm B.1] with automatic differentiation for evaluating the DEL equations, computing the Newton direction and linearizing the DEL equations. The variational integrator used is the Simpson variational integrator (Example 2). In the comparison, we use pendulums with different numbers of links as benchmark systems. For each pendulum, 100 initial joint angles q 0 and joint velocities q˙0 are uniformly sampled from [− π2 , π2 ]. The results are in Fig. 2 and it can be seen that our recursive algorithms are much more efficient, which is consistent with the fact that Algorithms 1 to 3 and [17, Algorithm B.1] are O(n) for evaluating the DEL equations, O(n) for computing the Newton direction, and O(n2 ) for linearizing the DEL equations, whereas automatic differentiation are O(n2 ), O(n3 ) and O(n3 ), respectively. 5.3
Comparison with the Hermite-Simpson Direct Collocation Method
In this subsection, we compare the fourth-order Simpson variational integrator (Example 2) with the Hermite-Simpson direct collocation method, which is a third-order implicit integrator commonly used in robotics for trajectory
Efficient Computation of Higher-Order Variational Integrators
701
Fig. 2. The comparison of our recursive algorithms with automatic differentiation for pendulums with different numbers of links. The variational integrator used is the Simpson variational integrator. The results of evaluating the DEL equations are in (a), computing the Newton direction in (b) and linearizing the DEL equations in (c). Each result is calculated over 100 initial conditions.
optimization [10,11].2 Note that both integrators use three control points for integration. The strict comparison of the two integrators for trajectory optimization is usually difficult since it depends on a number of factors, such as the target problem, the optimizers used, the optimality and feasibility tolerances, etc. Therefore, we compare the Simpson variational integrator and the Hermite-Simpson direct collocation method by listing the order of accuracy, the number of variables and the number of constraints for trajectory optimization. In general, the computational loads of optimization depends on the problem size that is directly related with the number of variables and the number of constraints. The higher-order accuracy suggests the possibility of large time steps in trajectory optimization, which reduces not only the problem size but the computational loads of optimization as well. The results are in Table 1.3 It can be concluded that the Simpson variational integrator is more accurate and has less variables and constraints in trajectory optimization, especially for constrained mechanical systems. The accuracy comparison in Table 1 of the Simpson variational integrator with the Hermite-Simpson direct collocation method is further numerically validated on a 12-link pendulum. In the comparison, different time steps are used to simulate 100 trajectories with the final time T = 10 s, and the initial joint 2
3
The Hermite-Simpson direct collocation methods used in [10, 11] are actually implicit integrators that integrate the trajectory as a second-order system in the (q, q) ˙ space, whereas the variational integrators integrate the trajectory in the (q, p) space. The explicit and implicit formulations of the Hermite-Simpson direct collocation methods differ in whether the joint acceleration q¨ is explicitly computed or implicitly involved as extra variables. Even though the explicit formulation of the HermiteSimpson direct collocation has less variables and constraints than the implicit formulation, it is usually more complicated for the evaluation and linearization, therefore, the implicit formulation is usually more efficient and more commonly used in trajectory optimization [11].
702
T. Fan et al.
Table 1. The comparison of the Simpson variational integrator with the HermiteSimpson direct collocation method for trajectory optimization. The trajectory optimization problem has N stages and the mechanical system has n degrees of freedom, m holonomic constraints and is fully actuated with n control inputs. Note that both integrators use three control points for integration. Integrator
Accuracy # of variables
Variational integrator
4th-order (4N + 3)n + (2N + 1)m 3N n + (2N + 1)m
# of constraints
Direct collocation (explicit) 3rd-order (6N + 3)n + (2N + 1)m 4N n + (6N + 3)m Direct collocation (implicit) 3rd-order (8N + 4)n + (2N + 1)m (6N + 1)n + (6N + 3)m
Fig. 3. The comparison of the Simpson variational integrator with the HermiteSimpson direction collocation method on a 12-link pendulum with different time steps. The results of the integrator error are in (a), the computational time in (b) and the integration error v.s. computational time in (c). Each result is calculated over 100 initial conditions. π π angles q 0 are uniformly sampled from [− 12 , 12 ] and the initial joint velocities q˙0 are zero. Moreover, the Simpson variational integrator uses Algorithm 1 and [17, Algorithm B.1] which has O(n) complexity for the integrator evaluation and the Newton direction computation, whereas the Hermite-Simpson direct collocation method uses [14,18] which is O(n) for the integrator evaluation and O(n3 ) for the Newton direction computation. For each initial condition, the benchmark solution qd (t) is created from the Hermite-Simpson direct collocation method with a time step of 5 × 10−4 s and the simulation error in q(t) is evaluated as 1 T T 0 q(t) − qd (t) dt. The running time of the simulation is also recorded. The results are in Fig. 3, which indicates that the Simpson variational integrator is more accurate and more efficient in simulation, and more importantly, a better alternative to the Hermite-Simpson direction collocation method for trajectory optimization. In regard to the integrator evaluation and linearization, for unconstrained mechanical systems, experiments (not shown) suggest that the Simpson variational integrator using Algorithms 1 to 3 is usually faster than the HermiteSimpson direct collocation method using [14,18] even though theoretically both integrators have the same order of complexity. However, for constrained
Efficient Computation of Higher-Order Variational Integrators
703
mechanical systems, if there are m holonomic constraints, the Simpson variational integrator is O(mn) for the evaluation and O(mn2 ) for the linearization while the Hermite-Simpson direct collocation method in [10,11] is respectively O(mn2 ) and O(mn3 ), the difference of which results from that the HermiteSimpson direct collocation method is more complicated to model the constrained dynamics.
6
Implementation for Trajectory Optimization
In this section, we implement the fourth-order Simpson variational integrator (Example 2) with Algorithms 1 to 3 on the Spring Flamingo robot [19], the LittleDog robot [20] and the Atlas robot [21] for trajectory optimization, the results of which are included in our supplementary videos. It should be noted that the variational integrators used in [2–4,6,8] for trajectory optimization are second order. In Sects. 6.1 and 6.2, a LCP formulation similar to [8] is used to model the discontinuous frictional contacts with which no contact mode needs to be prespecified. These examples indicate that higher-order variational integrators are good alternatives to the direct collocation methods [10,11]. The trajectory optimization problems are solved with SNOPT [22]. 6.1
Spring Flamingo
The Spring Flamingo robot is a 9-DoF flat-footed biped robot with actuated hips and knees and passive springs at ankles [19]. In this example, the Spring Flamingo robot is commanded to jump over an obstacle that is 0.16 m high while walking horizontally from one position to another. The results are in Fig. 4, in which the initial walking velocity is 0.26 m/s and the average walking velocity is around 0.9 m/s.
(a) t = 0 s
(b) t = 0.13 s
(c) t = 0.33 s
(d) t = 0.44 s
(e) t = 0.57 s
(f) t = 0.68 s
(g) t = 0.88 s
(h) t = 1.1 s
Fig. 4. The Spring Flamingo robot jumps over a obstacle of 0.16 m high.
704
T. Fan et al.
(a) t = 0 s
(b) t = 0.48 s
(c) t = 0.56 s
(d) t = 1.04 s
(e) t = 1.84 s
(f) t = 2.16 s
(g) t = 2.72 s
(h) t = 3.2 s
Fig. 5. The LittleDog robot walks over terrain with gaps.
(a) t = 0s
(b) t = 0.4 s
(c) t = 0.6 s
(d) t = 1.3 s
Fig. 6. The Atlas robot picks a red ball while keeping balanced with a single foot.
6.2
LittleDog
The LittleDog robot is 18-DoF quadruped robot used in research of robot walking [20]. In this example, the LittleDog robot is required to walk over terrain with two gaps. The results are in Fig. 5, in which the average walking velocity is 0.25 m/s. 6.3
Atlas
The Atlas robot is a 30-DoF humanoid robot used in the DARPA Robotics Challenge [21]. In this example, the Atlas robot is required to pick a red ball with its left hand while keeping balanced only with its right foot. Moreover, the contact wrenches applied to the supporting foot should satisfy contact constraints of a flat foot [11]. The results are in Fig. 6 and it takes around 1.3 s for the Atlas robot to pick the ball.
Efficient Computation of Higher-Order Variational Integrators
7
705
Conclusion
In this paper, we present O(n) algorithms for the linear-time higher-order variational integrators and O(n2 ) algorithms to linearize the DEL equations for use in trajectory optimization. The proposed algorithms are validated through comparison with existing methods and implementation on robotic systems for trajectory optimization. The results illustrate that the same integrator can be used for simulation and trajectory optimization in robotics, preserving mechanical properties while achieving good scalability and accuracy. Furthermore, thought not presented in this paper, these O(n) algorithms can be regularized for parallel computation, which results in O(log(n)) algorithms with enough processors.
References 1. Marsden, J.E., West, M.: Discrete mechanics and variational integrators. Acta Numer. 10, 357–514 (2001) 2. Johnson, E.R., Murphey, T.D.: Scalable variational integrators for constrained mechanical systems in generalized coordinates. IEEE Trans. Robot. 25(6), 1249– 1261 (2009) 3. Kobilarov, M., Crane, K., Desbrun, M.: Lie group integrators for animation and control of vehicles. ACM Trans. Graph. (TOG) 28(2), 16 (2009) 4. Johnson, E., Schultz, J., Murphey, T.: Structured linearization of discrete mechanical systems for analysis and optimal control. IEEE Trans. Autom. Sci. Eng. 12, 140–152 (2015) 5. Fan, T., Murphey, T.: Structured linearization of discrete mechanical systems on lie groups: a synthesis of analysis and control. In: IEEE Conference on Decision and Control (CDC), pp. 1092–1099 (2015) 6. Junge, O., Marsden, J.E., Ober-Bl¨ obaum, S.: Discrete mechanics and optimal control. IFAC Proc. Vol. 38(1), 538–543 (2005) 7. Lacoursiere, C.: Ghosts and machines: regularized variational methods for interactive simulations of multibodies with dry frictional contacts. Ph.D. thesis, Datavetenskap (2007) 8. Manchester, Z., Kuindersma, S.: Variational contact-implicit trajectory optimization. In: International Symposium on Robotics Research (ISRR) (2017) 9. Lee, J., Karen Liu, C., Park, F.C., Srinivasa, S.S.: A linear-time variational integrator for multibody systems. In: International Workshop on the Algorithmic Foundations of Robotics (WAFR) (2016) 10. Posa, M., Kuindersma, S., Tedrake, R.: Optimization and stabilization of trajectories for constrained dynamical systems. In: IEEE International Conference on Robotics and Automation (ICRA) (2016) 11. Hereid, A., Hubicki, C.M., Cousineau, E.A., Ames, A.D.: Dynamic humanoid locomotion: a scalable formulation for HZD gait optimization. IEEE Trans. Robot. 34, 370–387 (2018) 12. Ober-Bl¨ obaum, S., Saake, N.: Construction and analysis of higher order Galerkin variational integrators. Adv. Comput. Math. 41(6), 955–986 (2015) 13. Ober-Bl¨ obaum, S.: Galerkin variational integrators and modified symplectic Runge-Kutta methods. IMA J. Numer. Anal. 37(1), 375–406 (2017) 14. Featherstone, R.: Rigid Body Dynamics Algorithms. Springer, Boston (2014)
706
T. Fan et al.
15. Yamane, K., Nakamura, Y.: Efficient parallel dynamics computation of human figures. In: IEEE International Conference on Robotics and Automation (ICRA). IEEE (2002) 16. Fijany, A., Sharf, I., D’Eleuterio, G.M.T.: Parallel O (log n) algorithms for computation of manipulator forward dynamics. IEEE Trans. Robot. Autom. 11(3), 389–400 (1995) 17. Fan, T., Schultz, J., Murphey, T.D.: Efficient computation of variational integrators in robotic simulation and trajectory optimization. Appendix. https://arxiv.org/ pdf/1904.12756.pdf 18. Carpentier, J.: Analytical derivatives of rigid body dynamics algorithms. In: Robotics: Science and Systems (RSS) (2018) 19. Pratt, J., Pratt, G.: Intuitive control of a planar bipedal walking robot. In: IEEE International Conference on Robotics and Automation (ICRA) (1998) 20. Shkolnik, A., Levashov, M., Manchester, I.R., Tedrake, R.: Bounding on rough terrain with the LittleDog robot. Int. J. Robot. Res. 30(2), 192–215 (2011) 21. Nelson, G., Saunders, A., Neville, N., Swilling, B., Bondaryk, J., Billings, D., Lee, C., Playter, R., Raibert, M.: Petman: a humanoid robot for testing chemical protective clothing. J. Robot. Soc. Jpn. 30(4), 372–377 (2012) 22. Gill, P.E., Murray, W., Saunders, M.A.: SNOPT: an SQP algorithm for large-scale constrained optimization. SIAM Rev. 47(1), 99–131 (2005)
Multi-Robot and Multi-Body Systems
Interlocking Block Assembly Yinan Zhang(B) and Devin Balkcom Dartmouth College, Hanover, USA [email protected]
Abstract. This paper presents a design for interlocking blocks and an algorithm that allows these blocks to be assembled into desired shapes. During and after assembly, the structure is kinematically interlocked if a small number of blocks are immobilized relative to other blocks. There are two types of blocks: cubes and double-height posts, each with a particular set of male and female joints. Layouts for shapes involving thousands of blocks have been planned automatically, and shapes with several hundred blocks have been built by hand. As a proof of concept, a robot was used to assemble sixteen blocks. The paper also describes a method for assembling blocks in parallel.
1
Introduction
The goal of the work described in this paper is to enable robotic assembly of large structures from blocks that interlock without the need for glue, cement, screws, or other connectors. Figure 1 shows two models for which layouts and assembly plans were generated automatically by the presented algorithm. The motion of blocks are constrained by joints; later blocks reinforce and immobilize prior blocks. Each structure has a few blocks that can move, called keys; the bunny has two keys, in the ears, and the chair has one, at the top of the back of the chair. If the keys are immobilized, the structure is rigidly interlocked. Kinematic interlock presents some advantages over traditional connection methods such as glue, cement, screws, nails, or friction locks. The interlocks may be structurally strong, allow simple assembly by robots, allow disassembly and re-use of the components, and may be suitable for underwater or other environments where adhesives are ineffective. Relative to 3D printing, fabricating in parts may present some advantages. Individual components may be fabricated efficiently, packed for storage and transport, repaired or replaced as needed, and allow design changes. The algorithm described in this paper takes a voxelized 3D model as input, and finds an assembly plan such that the interlocked structure covers the specified voxels. There are two types of block: 1 × 1 × 1 cubes, and 1 × 1 × 2 posts, with connectors arranged in a particular way. The assembly requires only translation motions. The concept of interlocking block assembly was previously presented by the authors in [28]. However, the technical work in the current paper is effectively c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 709–726, 2020. https://doi.org/10.1007/978-3-030-44051-0_41
710
Y. Zhang and D. Balkcom
entirely new. New block designs and layout algorithms enabled the reduction of the types of blocks needed from nine to two, and have allowed structures that appear to be more robust and easier to assemble. The paper also explores construction of physical structures much larger than previously built (406 pieces compared to 64), as well as a more convincing demonstration of robotic assembly. New theoretical contributions include an analysis of how blocks may be assembled in parallel, speeding up assembly.
(a) Stanford Bunny assem- (b) Chair model after auto- (c) Chair assembled by bled in simulation. matic layout. hand.
Fig. 1. Models assembled
2
Limitations
This paper focuses almost entirely on the geometry of a particular design of blocks, and associated layout algorithms. As such, many critical issues that would need to be solved for a practical system have been neglected. Chief among these is the need for analysis of the rigidity and robustness of the final structures. The physical experiments conducted use 3D printed blocks, and are no more than a proof of concept, using a very small number of blocks. Although the algorithm presented can build essentially arbitrary voxelized structures, overhanging components of layers are not interlocked until a second identical layer is placed above. This means that some external (though temporary) means of support is needed during construction, just as in 3D printing. We believe that modifications to the algorithm and block designs would allow overhanging layers to be build to be stably interlocked during construction, but have not made these modifications. We would also like to gain a better understanding of how to design and analyze block types and layout algorithms. Effectively, the block types designed are the result of trial-and-error and creative thought; we do not present significant mathematical or mental tools to find other block designs.
Interlocking Block Assembly
3
711
Related Work
Interlocking structures have a long history. Wood joints such as the dovetail and mortise and tenon are used in carpentry around the world; in China and Japan, complex interlocking designs have permitted the construction of wooden buildings with no screws or nails [29]; In the paleontology community, evidence has recently been presented that supports a hypothesis that the backbones of theropod dinosaurs interlocked to provide support for the extremely large body mass [23]. The present work is closest in spirit to Song et al. [6,15–17,19] which consider the problem of designing reusable components to be assembled into different forms relying on geometric constraints; the primary contribution of the current work is a universal block design and layout algorithm that allows construction of arbitrary geometries. Yao et al. [24] proposed a method for interactively designing joints for structures and analyzing the stability. Kong et al. applied curve matching techniques for finding solutions for assembly of 2D and 3D interlocking puzzles; the layout algorithms considered in the current paper generate assembly motions together with the design. Robotic construction research dates back to the 1990s [1] when Andres et al. created a prototype, ROCCO, capable of gripping and laying bricks. The same robotic system was later applied to site assembly operations by Balaguer et al. [4]. More recent works include DimRob, a system with an industrial robot arm mounted on a mobile platform (Helm et al. [8]) used for construction tasks. This prototype was later developed into a mobile robot, In situ Fabricator, for construction at 1:1 scale (Giftthaler et al. [7]). Willmann et al. [22], for example, used autonomous flying vehicles to lift and position small building elements. Augugliaro et al. [3] demonstrated a system of multiple quadrocopters precisely laying out foam blocks forming a desired shape. Lindsey et al. [10] built cubic structures using quadrocopters. Augugliaro et al. [2] explored another approach of construction: quadcopters assembled a rope bridge capable of supporting people. Keating et al [9] built a large mobile 3D printer using a robot arm to extrude adhesive materials. Instead of focusing on the robot control system to carry building elements, some researchers designed new building elements. Rus and Vona [13] developed Crystalline, a modular robot with a 3-DoF actuation mechanism allowing it to make and break connections with other identical units; a set of such robots form a self-reconfigurable robot system. White et al. [21] introduced two three-dimensional stochastic modular robot systems that are self reconfigurable and self assemble-able by successive bonding and release of free-floating units. Romanishin et al. [11] proposed a momentum-driven modular robot. SamBot, a cube shaped modular robots with rotation mechanism was introduced by Wei et al. [20]. Daudelin et al. [5] present a self reconfigurable system that integrates perception, mission planning, and modular robot hardware. Tosun et al. [18] created a design framework for rapid creation and verification of modular robots. Inspired by LEGO, Schweikardt et al. [14] proposed a robotic construction kit, roBlocks, with programmable cubic blocks for educational purpose.
712
Y. Zhang and D. Balkcom
Kilobot, a swarm of 1000 crawling mobile robots, was introduced by Rubenstein et al. [12], along with algorithms for planning mechanisms allowing kilobots to form 2D shapes.
4
Interlocking Blocks and Constraint Graph
A block is a rigid body that has male and female joints allowing assembly with other blocks, typically by sliding one block against the other using a simple translation. Figure 2 shows three different joint pairs that may connect blocks in the system we describe in the paper. A mortise and tenon joint pair (Fig. 2a) allows blocks to be disassembled only in the non-penetrating normal direction of the contact surface. A dovetail joint pair (Fig. 2b) allows block motion only in a particular tangential direction. The third joint pair we use in the current design is a two-way joint (Fig. 2c), which allows motions of associated blocks in both normal and tangential directions. The dovetail and mortise and tenon joints fully constrain rotational motion, but the two-way joint permits one rotational degree of freedom. We manufactured the joints so the front part in the insertion direction is thiner thus reduce surface contacts, providing better error tolerance. See Fig. 2d and e.
(a) Mortise and tenon joints
(b) Dovetail joints
(d) Tenon joint design.
(c) Two-way joints
(e) Dovetail joint design.
Fig. 2. Three different joint pairs and detailed design.
Blocks are assembled into a structure in order, and the last block assembled can be removed by reversing the most recent translation assembly motion. Therefore, the last block assembled must be attached to the structure using glue, friction, a screw, or some other external method; we call such a block a key. Figure 3 shows a 2D projection of an interlocking structure assembled using blocks with dovetail and mortise and tenon joints. First, block 2 is assembled to block 1, using a tenon joint on the top of the blocks, and moving block 2 in the positive y direction, assuming a coordinate frame aligned with the page. Block 3 then slides in and connect with block 2 with another tenon joint. The final block is assembled from top down, connecting block 1 and 3 using two dovetail joints and limiting block 2 and 3 to move in y negative or x positive directions.
Interlocking Block Assembly
1
1
2
2
(a) Tenon joint.
3
(b) Tenon joint.
1
4
2
3
713
(c) Dovetail joint assembled from top down.
Fig. 3. Assembling an interlocking 4-block square. Arrow indicates the assembly direction. Block 4 is the key.
4.1
The Constraint Graph
To better understand how joints constrain motions, we represent a structure using a directed graph. Each vertex in the graph represents a block. A pair of directed edges is added between vertices corresponding to blocks that are in contact; w(ei,j ) denotes the set of permitted motions of j relative to i.
1
4
2
3
z+
1 y+
4
z− y−
z− x+
2
x−
z+
3
(a) A 2 × 2 interlocking structure. Numbers (b) Graph representation. Each edge allows indicate the order. Block 4 is the key. some motions for associated blocks.
Fig. 4. A four-block interlocking structure and its graph representation.
A
B
x+ KA
z−
A
KB
xKA
B
z+ x-
x+ x+
xKB
(a) Two interlocking substructures connected (b) Forming a larger interlocked strucby a dovetail joint and a mortise & tenon joint. ture from two interlocked structures.
Fig. 5. Any interlocking substructure can be viewed as two nodes in the graph, which simplifies the graph representation.
Consider a partition of the graph into some non-overlapping subsets of vertices. A partition is separable if there exists a motion that satisfies constraints by all in-edges along the boundary of the subset of vertices. In this particular work, the block and joint design limit motions of every block to translations
714
Y. Zhang and D. Balkcom
directly along axes, simplifying the analysis. (The two-way joint is used only as an auxiliary connection for blocks whose motions are already constrained to pure axis-aligned translations by other joints). Figure 4b shows an example of a constraint graph for the previous example of a four-block interlocking structure. Consider partitioning the structure into two parts {1, 2} and {3, 4}. These parts are inseparable, since w(e2,3 ) ∩ w(e1,4 ) = {x+} ∩ {z+} = ∅. By checking more partitions of the structure, we find only {1, 2, 3} and {4} are separable. If block 4 is attached to either of its neighbors, the structure is rigid. We call a structure k-interlocked, if when k keys are attached to neighbors, no partition is separable. The example structure is 1-interlocked with block 4 as the key. Analyzing a large structure gets difficult when there are a large number of blocks, because the number of possible partitions on the graph increases exponentially. Fortunately, proof that a complete structure is interlocked can be accomplished in a hierarchical fashion, by first showing that smaller components are interlocked, and then using those components to build larger interlocking structures. Figure 5 shows an example of how a larger interlocking structure can be built from smaller interlocking substructures. Interlocking substructures A and B are similar to those shown in Fig. 4a); the careful eye may note some additional geometry on each block representing dovetail joints attached from the side; these joints provide some redundant constraints that add rigidity to the final structure. The keys of the substructures are KA and KB , and are not considered to be part of A and B. To show that the entire structure is 1-interlocked by KB , it is sufficient to consider only partitions that separate KA from A or KB from B, since A and B act as rigid bodies if their keys are not separated. Figure 5b shows the graph representation. 4.2
Overview of the Layout Algorithm
The example above suggests an approach to constructing large interlocked structures. We can build 4-block interlocked squares, and use a second interlocked square to build an 8-block rectangle (Fig. 5a). Inductively, we can extend the rectangle as far as we like by adding additional squares to the end; we call such a structure a segment. Intuitively, some additional connections might be added to connect segments to form a flat structure that we will call a layer. 3D volumes may then be constructed from stacks of layers. Figure 6 shows a conceptual picture, with the single key block of each new larger structure shown in red. The remainder of the paper addresses the details needed to allow implementation of this process. How should segments interconnect to form a layer? How should layers interconnect? How should joints be arranged on blocks to allow
Interlocking Block Assembly
715
creating segments and layers from only a few types of block? How should layers be automatically shaped to allow construction of geometries more interesting than large cubical volumes? Algorithm 1 presents an overview of layout algorithm; the details will be discussed in later sections of the paper, as indicated by the section numbering indicated in the algorithm; the reader may wish to only skim the algorithm on first reading. For now, it is worth noting that the input to the algorithm is a voxelized model describing the desired output shape. Each voxel is further subdivided into eight subvoxels; each subvoxel will be effectively be instantiated by a block. The blocks are labelled by layer and segment. Layer and segment labels allow assignment of joint types that must connect adjacent blocks. Once the joint types have been assigned, blocks providing these joint types can be selected. The output is a sequence of block assembly orders that constructs an interlocking structure shaped as the input model. Since our model is built layer-by-layer, the final structure will have k keys, where k is the number of layers that do not have another layer on their top. One critical observation is that joint types for a pair of blocks are selected by the layout algorithm based on the location of those blocks in the segment and layer. Since there are three joint types, each male or female, and six faces on a cube, this suggests that there might be 66 = 46656 different types of block to construct. Fortunately, patterns in the segments and layers mean that not all of these block types occur. Further tricks allow reduction of the number of block types to two. As an example, consider the blocks in Fig. 5a. Adding a mortise joint on the right side of block KB makes it a copy of KA , reducing the number of types of blocks. It is also worth pointing out the approach we have taken to connecting adjacent layers. To provide a firm connection, we use a block of height two as a connector between layers; we call this block a post. Algorithm 1. Algorithm overview 1: function ConstructVoxelModel(M ) 2: M ← split every voxel into eight dimension-1 cubes. 3: for each layer Li of M from bottom to top do 4: Lay out any missing posts. 5: if Li is an even layer then 6: Set all segment types to Xl− Y+ . (Section 6) 7: else 8: Determine the key to each layer component. (Section 7.1) 9: Order segments in each layer component. (Section 7.2) 10: Determine the key(s) to each segment. (Section 7.2) 11: Determine the type of each segment. (Section 7.2) 12: Find special cases. (Section 7.2) 13: Modify Li and Li+1 if necessary. (Section 7.3) 14: Assemble blocks, potentially in parallel. (Section 8.1)
716
Y. Zhang and D. Balkcom
Fig. 6. Building an interlocking 3D structure. Blocks interlock to form a square; squares form an interlocking segment; segments form a layer; layers interlock to form the structure. The keys are marked red.
5
Blocks and Squares
The layout algorithms makes use of two types of blocks: a cube is a unit-cube sized block for filling empty space in a layer and locking with existing blocks, and a post is a two-unit high block. See Fig. 7. The lower half of a post block connects with cube blocks in the same layer, while the upper half of the block connects with cube blocks in the upper layer. The post blocks also act as key blocks of substructures.
Fig. 7. Different views of cube and post blocks.
We carefully analyzed segments and layers to determine how joints might be arranged on cubes and posts. A cube block has two dovetail male joints on two opposite sides that can connect, in a tangential direction, with female joints in post blocks allowing motion only in the assembly direction. Figure 9a and b show how a cube’s side male joints connect with a post’s female joints in two different directions. A cube also has a male joint on the bottom that connects, in the normal direction, either with the top of a cube or post. The female joints on the opposite sides of the cube block allow post blocks’ male joint to drop and slide to connect, which allows the post block to disassemble only in two directions. To describe a layout and an assembly process, some notation is helpful. For each block, we use a triplet of characters indicating block type, orientation of the block, and assembly direction. Figure 8 shows all of the triplets used in assembly of structures in this paper. For example, C1D means “Cube in orientation 1, assembled by moving down”. Figure 8b shows all of the notation triplets used in the current approach. Not all axis-aligned orientations of cubes and posts
Interlocking Block Assembly
C1D
C2D
C3D
C4D
C5N
C6E
C7S
C8W
(a) Cube orientations and assembly directions.
717
P1W
P1N
P2N
P2E
P3S
P3E
P4W
P4S
(b) Post orientations and assembly directions.
Fig. 8. Different ways to assembly cube and post blocks and corresponding notations.
1
3
1
2
2
4
3
4
(a) Slide in a (b) Drop in a (c) Assemble a (d) Top view (e) Sa square. (f) Sb square. C8W block. C3D block. P1W block. of the square.
Fig. 9. Assembly process of a square, and two designs for a square.
are needed to construct structures; for example, posts only occur in the four orientations generated by rotating the post in Fig. 7 around the z axis in ninetydegree increments. Block designs are crafted to allow design of squares, segments, and layers. A square is the smallest interlocking structure we consider, composed of four blocks: two posts and two cubes. By using posts and cubes in different orientations, different squares may be constructed, as shown in Fig. 9e and f as Sa and Sb . Different squares will be used in Sect. 7 to constrain key block motions of other adjacent segments in the same layer, allowing interlock of the layer. Figure 9 shows the process of assembling one kind of square. The first piece, a post, may connect to a layer below the current one. Two cubes are added to the top of the post. The second post acts as a key, and the top half of this post extends above the square to provide a connection to a square that may be later built above the current one.
6
Segments
We now introduce a method to link squares into a longer interlocking structure, a segment. A segment is composed of n squares in a 1 × n pattern. To build a segment, we assume n posts have already be pre-placed in the prior layer, such that the top of each post appears in the same position in each square; these posts allow the segment to interlock with the prior layer.
718
Y. Zhang and D. Balkcom
We will discuss how to assemble a simple segment built from left to right in the direction of the x axis, assuming posts are in the upper (or y+) half of each line; other segments are symmetric and will not be discussed in detail. We denote a segment as Yl+ X+ if the posts are in the left position of the y positive half and the segment is built towards the x positive direction with the key block at the end.
0
2
0
5
0
8
1
3
4
6
7
9 Sa
Sa
Sa
0
2
0
5
0
7
1
3
4
6
8
9
0
2
0
5
0
7
0
11
0
14
0
16
1
3
4
6
8
9
10
12
13
15
17
18
Sa
Sa
Sb
(a) Two Yl+ X+ segments.
(b) A layer built by two Yl+ X+ segments.
Fig. 10. A simple segment and an example layer built by connecting two segments.
Figure 10a visualizes the process of assembling a Yl+ X+ segment of 3 squares. We connect, from left to right, n−1 Sa squares. The final square of a sub-segment can be of type Sa or Sb . The key piece of the segment is the last assembled post block. A sub-segment with a Sb final square is not interlocking, but when connected with previous segment(s), the Sb square prevents the adjacent block in the y positive direction from moving and interlocks the structure. Building another Yl+ X+ segment on the y negative side will create an interlocking layer (Fig. 10b). In Sect. 7, we will discuss how to constrain the motion of the key in different types of segments. 6.1
Structure Mirrors
Knowing how to assemble Yl+ X+ segments, one can lay out an array of segments one-by-one and create interlocking planar structures as in Fig. 10b. However, these structures require the key to every segment to be in the x positive end and constrained by the next adjacent segment. In order to build more complicated planar structures, we introduce the concept of mirrors. Definition 1 (x-mirror). Object A is an x-mirror, mx (B), of another object B if one is a reflection of the other with reflection plane perpendicular to the x-axis.
Interlocking Block Assembly
719
We define an analogous y-mirror operation. Cube and post designs are symmetric in such a way that x and y mirror operations can be accomplished by simple rotation of the block. Construction of a mirrored structure follows the same order of the original structure with opposite directions along the same axis; for example, we may build a Yr+ X− segment by x-mirroring a Yl+ X+ segment. Two other types of segments we will need for layer construction are Yr+ X+ (Fig. 11a) and its x-mirror Yl+ X− (Fig. 11b). To build a Yr+ X+ segment with n squares, where n ≥ 2, we first assemble two blocks (C3D and P 1W ) in the left two positions. Then assemble a Yl+ X+ segment of n − 1 squares. When all pre-existing posts are prevented from moving along z axis, the segment is interlocked. For many input geometries, it may turn out that neither end of a segment is adjacent to the next segment, causing the key to be exposed. In this case, we may replace a single segment by two segments grown from the ends, effectively allowing placement of a pair of keys at an arbitrary position in the middle, as shown in Fig. 11c. These keys may then be immobilized by later segments.
1
0
4
0
6
2
3
5
7
8
AX−
0
AX−
(a) A Yr+ X+ segment.
0
6
0
4
0
1
8
7
5
3
2
BX+
BX+
(b) A Yl+ X− segment.
0 1
2
0
4
3
5
6
Yl+ X+
0
12 0 13
10 0
3 11 12
7 9
8
Yl+ X−
(c) A pair of segments with keys in the middle.
Fig. 11. Construction of two x-mirrored segments. Arrows indicate construction direction. Numbers indicate assembly order.
7
Layers
Now that we know how to build different kinds of segments, we can connect a set of segments on the same plane to create complicated interlocking 3D structures, by careful assignment of subvoxels from the original model into layers, segments, squares, and blocks. A layer is a set of squares with the same z-coordinate. A set of connected squares with the same z-coordinate is a layer component. We assume all layer posts are provided by the prior layer. This is a fundamental limitation of our approach – it does not allow overhanging structures to be generated without building additional supports. The section first introduces the ordering of segments in a component. Once ordered, segments are ready to be assigned square types and assembled. Then we discuss some special cases caused by the nature of our block design and square structure, and techniques to ensure interlock.
720
7.1
Y. Zhang and D. Balkcom
Layer Key(s)
As the first step of building any interlocking structure, we determine the key(s) of the layer. A layer is immobilized if the key(s) is fixed with respect to its neighbors. Since every even layer has an upper layer with the exact same shape, based on the division of voxels into subvoxels, post blocks that connect the upper layer will be immobilized as long as the upper layer is interlocked, preventing the horizontal motion of any posts. Therefore, we only consider the odd layers in this section. For any odd layer component without adjacent upper layer blocks, we select a post block at the x negative end of a boundary segment as the key, where a boundary segment is a segment with adjacent neighbors on only one side. If the odd layer component has an adjacent upper layer, the key can be any post block covered by an upper layer square. Under this rule, every layer component constrains the key to its lower component. Any layer components that do not have an immediate upper layer introduce a new key that will not be covered. The number of key pieces of the whole structure is thus the number of layer components without an immediate upper layer. This introduces an interesting effect of the orientation of the object to be constructed. For example, the chair in Fig. 1b has a single key, but if the chair were built upside-down, then there would be four keys: one in each leg. 7.2
Segment Construction Order
Once a layer’s key square and all starting posts of squares are known, the second step of assembling a layer is to determine the order and type of each segment. In the preprocessing step, every voxel is broken into two squares, making every layer of voxels two layers in the assembly. The bottom layer has an even z-coordinate value, while the up layer has an odd z-coordinate. Every segment in an even layer is constructed along y-axis directions. We simply assemble every segment as Xl− Y+ , or 90◦ clockwise rotation of a Yr+ X− segment, from left to right. An even layer component is not necessarily interlocked, because there can be many segment keys unconstrained and able to move in the x positive direction. However, all square keys are posts in the upper layer, and as long as the upper layer is interlocked, or all posts are prevented from moving in x positive direction, the two-layer structure is interlocked. Each square in an odd layer component is initially assumed to have a post in the bottom-right position. This, however, could change after the segment types have been assigned. We first build a set of post lists where each list contains posts with the same y-coordinate, and two adjacent posts are 2 units away. Each list will be built into a segment. Two posts are considered adjacent if their x or y-coordinates have a difference of 2. Two lists are considered adjacent they have adjacent posts. Lists are ordered by their shortest distances to the final list that contains the post of the key square, where the distance between two adjacent lists is 1. Given a list l and the next-built adjacent list ln , the type of the segment Sl associated with l is determined as described below:
Interlocking Block Assembly
721
– If ln is at y− side of l & the left end post of l is adjacent to ln , Sl is Yr+ X− . – If ln is at y− side of l & the right end post of l is adjacent to ln , Sl is Yr+ X+ . – If ln is at y− side of l & neither ends of l is adjacent to ln , Sl is broken into a Yr+ X− and a Yr+ X+ segment. – If ln is at y positive side of l & the left end post of l is adjacent to ln , Sl is Yr− X− . – If ln is at y positive side of l & the right end post of l is adjacent to ln , Sl is Yr− X+ . – If ln is at y positive side of l & neither ends of l is adjacent to ln , Sl is broken into a Yr− X− and a Yr− X+ segment. The segment associated with the last built list has been specified a key (line 8 of Algorithm 1). Its type is thus determined. 7.3
Special Cases
At this point, the type of each segment and the order of construction in each layer has been selected. Many interlocking layer structures can be assembled by directly following the construction of each segment as specified in Sect. 6. However, depends on the successor segments, some small modifications might be applied to insure the interlocking of adjacent segments. Consider a segment with key(s) in the y negative side, for example Yl+ X+ . Its successor can be (1) a segment whose key will be constrained by further segments in y negative side, (2) a segment with key being constrained in y positive side, or (3) a segment whose key will be constrained by the upper layer. We now list all possible cases that need modifications. Yl+X+ 0
2
0
5
0
7
1
3
4
6
8
9
0
0
Yl+X+ 10
12
14
15
0
11
0
13
Yl−X+
0 Yl−X+
1
2 Yl−X+
(a) A Yl− X+ segment built after a Yl+ X+ (b) A longer lower segment. The lower segsegment. Some positions are left empty. ment is broken into two segments.
Fig. 12. Two special cases of building adjacent segments. Green blocks are posts, and red blocks are keys of each segment. Numbers indicates the assembly order.
Case (1): A Yl+ X+ segment followed by another Yl+ X+ segment. We use a Sb squares at the later built segment to prevent the segment’s key from moving. See Fig. 10b. Otherwise, a Yl+ X+ segment always uses a Sa end square. Case (2) contains four subcases where the current segment has one or both ends adjacent to its successor whose key is in the x positive or negative side. Figure 12a shows one subcase. The first Yl+ X+ segment is still assembled as
722
Y. Zhang and D. Balkcom
1
1
2
2
4
4
3
3
Fig. 13. Special cases where two segments with posts in different sides are finished before the segment in the middle. Red blocks are keys of two segments (Yr+ X and Yr− X types). Numbers indicates the assembly order.
usual. We leave some positions adjacent to the first segment unfilled, and assembled the rest part. Figure 12b is a similar subcase where both ends of the segment are adjacent to the successor. We divide the lower segment into two segments, one containing no posts adjacent to the upper segment will be built first, the other containing the rest posts will be built after the upper segment. In the other subcases, the successor has a key in the x negative direction, we change the upper segment to Yl+ X− and create an x-mirror of the previous case. Case (3) is shown in Fig. 13 where a Yr+ X segment and a Yr− X segment are assembled before the segment in the middle. We require the upper and lower segments’ keys to be in different x positions. To ensure interlocking, we firstly finish the upper segment, then assemble two C5N blocks in the middle segment. After the lower segment is assembled, we put in C3D block(s) in the middle to constrain the motion of the lower segment key(s). The last assembled blocks (keys) in the middle will be constrained by its upper layer. If the upper layer is not wide enough to cover the keys, we must expand the upper layer (Line 13 in Algorithm 1).
8
Automatic Assembly and Parallel Construction
Algorithm 1 gives an overview of the construction process. Our construction starts from the bottom layer to the top. For each layer, we first check if all required posts exist. If not, we lay out these posts before starting the assembly (Line 4). Even layers are constructed using Xl− Y+ segments (Line 5, 6). Odd layers need to find the keys first (Line 8). Based on the key to each layer component, we order segments (Line 9) then determine segment keys and segment types (Line 10, 11). Before assembling, we check if any special cases exist as mentioned in Sect. 7.3 (Line 12). Since Yr+ X and Yr− X segments require at least two adjacent square, we need to modify the current layer if the condition is not satisfied. The special case as in Fig. 13 can also require the upper layer to expand and cover lower layer keys (Line 13). We then finally assemble blocks based on block types and special cases.
Interlocking Block Assembly
8.1
723
Parallel Construction
Laying out blocks one-by-one is time-consuming when a structure has a large number of blocks. This section provides an algorithm that generates a parallel construction order to accelerate the process. We first consider preliminaries blocks of assembling each new block, and build a graph between blocks. By querying the graph for blocks whose preliminaries are satisfied, we can have multiple agents to lay out the blocks. Consider a block b to be assembled in a layer. Any adjacent block(s) to be assembled later should not be prevented by the male joint(s) of b, meaning the joints of a block connect to only the pre-existing blocks. Along the assembly direction of b, the male joints of b should not be able to touch any blocks. The blocks that must be assembled before a new block to prevent collision are called predecessors of the new block. Every block has a predecessor below it if an adjacent block exists in the lower layer. Consider a block at position (x, y) in any layer. Table 1 is a list of predecessors of different types of blocks in the same layer. Table 1. Predecessors of each type of block. Block type Predecessors
Block type Predecessors
C1D, C3D
(x − 1, y), (x + 1, y)
C8W
C2D, C4D
(x, y − 1), (x, y + 1)
P1W, P1N (x − 1, y), (x, y + 1)
(x, y + 1), (x, y − 1), (x − 1, y)
C5N
(x − 1, y), (x + 1, y), (x, y + 1) P2N, P2E
(x, y + 1), (x + 1, y)
C6E
(x, y + 1), (x, y − 1), (x + 1, y) P3S, P3E
(x + 1, y), (x, y − 1)
C7S
(x − 1, y), (x + 1, y), (x, y − 1) P4W, P4S
(x − 1, y − 1), (x, y)
Besides predecessors listed above, inside each square, cube blocks with mortise joints connecting blocks in the same layer (C5N, C6E, C7S or C8W blocks) must be assembled before others (C1D, C2D, C3D or C4D blocks). With the predecessors of each block, we then construct a directed graph G = {V, E}, where V is the set of blocks, and directed edge ei,j ∈ E indicates block i being a predecessor of block j. The construction follows the order of removing nodes with in-degree of 0. Each construction agent/thread will take a block whose predecessors have been placed, and remove the node from the graph when the block assembly is finished. A simple observation with the parallel construction is, after the construction of one square s, all the adjacent squares to be assembled after s in the sequential order are ready to assemble. We therefore have the following theorem: √ Theorem 1. Parallel construction of a solid cube of N squares takes O( 3 N ) time. Proof. First consider constructing a solid layer of n×n squares. For simplicity, we scale the width of each square to one. After assembling the square at the corner
724
Y. Zhang and D. Balkcom
(a) Robotic assembly of a layer with 4 (b) Overhead view of the assembled strucsquares using a 4-DoF robot arm. ture. Numbers indicate assembly order.
Fig. 14. Assembling 16 blocks using a 4-DoF robot arm. No sensors are used. Blocks are initially placed in correct orientation for the arm to pick up.
(0, 0), two adjacent squares in x and y positive directions will be assembled at the next time step, then three, four, and so on. It takes k steps to construct k(k + 1)/2 squares. When k = n, over n2 /2 squares are constructed. So constructing a layer takes at most 2n steps. In a cube, since finishing every square allows all adjacent squares in x, y and z positive directions to assemble. When the last square of the bottom layer is done, it takes one more step to finish the upper layer. So 2n − 1 more steps will finish √ all upper layers. Therefore a solid cube of 2n × n × n squares takes O(n) = O( 3 N ) time to assemble.
9
Results and Robotic Assembly
We algorithmically designed plans to assemble several models including a Stanford bunny and a chair model, and animated the results in software. Figure 1 shows these examples. The Stanford bunny model has 7337 blocks while the chair model has 472 blocks. The assemblies of both models are done in sequential order. The rendered animation of chair assembly can be found in [26]. We also 3D printed 406 blocks and assembled into a similar chair based on the rendered animation. The assembled chair is a simplified version of the chair in the simulation; two layers were omitted to save material and assembly time. Four legs of the chair are relatively loose compare to other parts, because each pair of layers in the legs are connected by only one post and a mortise and tenon joint. For robotic assembly, we used a 4-DoF Adept robot arm to assemble both a one-layer structure with 4 squares (Fig. 14a) and a two-layer structure with 6 squares. A closer look at the one-layer structure is shown in Fig. 14b. This interlocking layer has two Yl− X+ segments. Assembly orders are marked with numbers. The recorded assembly can be found in [25] and [27]. The assembly shown in Fig. 14a was recorded without human interruption. The joint designs show good tolerance during construction.
Interlocking Block Assembly
725
Acknowledgements. This project received seed funding from NSF IIS-1813043 and the Dubai Future Foundation. The authors are grateful to Haopeng Zhang and Geoffrey Hsuan-Chieh Huang, who helped build 3d models of blocks, built robot grippers and recorded videos. The authors also thank Jeremy Betz for useful insights on the geometry of joints. Thanks also to Emily Whiting, as well as members of the Dartmouth robotics lab, for useful feedback and insights throughout.
References 1. Andres, J., Bock, T., Gebhart, F., Steck, W.: First results of the development of the masonry robot system ROCCO: a fault tolerant assembly tool. In: Automation and Robotics in Construction XI, pp. 87–93. Elsevier (1994) 2. Augugliaro, F., Mirjan, A., Gramazio, F., Kohler, M., D’Andrea, R.: Building tensile structures with flying machines. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3487–3492. IEEE (2013) 3. Augugliaro, F., Lupashin, S., Hamer, M., Male, C., Hehn, M., Mueller, M.W., Willmann, J.S., Gramazio, F., Kohler, M., D’Andrea, R.: The flight assembled architecture installation: Cooperative construction with flying machines. IEEE Control Syst. 34(4), 46–64 (2014) 4. Balaguer, C., Gambao, E., Barrientos, A., Puente, E.A., Aracil, R.: Site assembly in construction industry by means of a large range advanced robot. In: Proceedings of 13th International Symposium Automatic Robotics in Construction (ISARC 1996), pp. 65–72 (1996) 5. Daudelin, J., Jing, G., Tosun, T., Yim, M., Kress-Gazit, H., Campbell, M.: An integrated system for perception-driven autonomy with modular robots. arXiv preprint arXiv:1709.05435 (2017) 6. Fu, C.-W., Song, P., Yan, X., Yang, L.W., Jayaraman, P.K., Cohen-Or, D.: Computational interlocking furniture assembly. ACM Trans. Graph. (TOG) 34(4), 91 (2015) 7. Giftthaler, M., Sandy, T., D¨ orfler, K., Brooks, I., Buckingham, M., Rey, G., Kohler, M., Gramazio, F., Buchli, J.: Mobile robotic fabrication at 1:1 scale: the in situ fabricator. Constr. Robot. 1(1–4), 3–14 (2017) 8. Helm, V., Ercan, S., Gramazio, F., Kohler, M.: Mobile robotic fabrication on construction sites: Dimrob. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4335–4341. IEEE (2012) 9. Keating, S.J., Leland, J.C., Cai, L., Oxman, N.: Toward site-specific and selfsufficient robotic fabrication on architectural scales. Sci. Robot. 2(5), eaam8986 (2017) 10. Lindsey, Q., Mellinger, D., Kumar, V.: Construction of cubic structures with quadrotor teams. In: Proceedings of Robotics: Science & Systems VII (2011) 11. Romanishin, J.W., Gilpin, K., Rus, D.: M-blocks: momentum-driven, magnetic modular robots. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4288–4295. IEEE (2013) 12. Rubenstein, M., Cornejo, A., Nagpal, R.: Programmable self-assembly in a thousand-robot swarm. Science 345(6198), 795–799 (2014) 13. Rus, D., Vona, M.: Crystalline robots: self-reconfiguration with compressible unit modules. Auton. Robots 10(1), 107–124 (2001) 14. Schweikardt, E., Gross, M.D.: roBlocks: a robotic construction kit for mathematics and science education. In: Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 72–75. ACM (2006)
726
Y. Zhang and D. Balkcom
15. Song, P., Deng, B., Wang, Z., Dong, Z., Li, W., Chi-Wing, F., Liu, L.: CofiFab: coarse-to-fine fabrication of large 3D objects. ACM Trans. Graph. (TOG) 35(4), 45 (2016) 16. Song, P., Fu, C.-W., Cohen-Or, D.: Recursive interlocking puzzles. ACM Trans. Graph. 31(6), 128:1–128:10 (2012). (SIGGRAPH Asia 2012) 17. Song, P., Chi-Wing, F., Jin, Y., Hongfei, X., Liu, L., Heng, P.-A., Cohen-Or, D.: Reconfigurable interlocking furniture. ACM Trans. Graph. (TOG) 36(6), 174 (2017) 18. Tosun, T., Jing, G., Kress-Gazit, H., Yim, M.: Computer-aided compositional design and verification for modular robots. In: Robotics Research, pp. 237–252. Springer (2018) 19. Wang, Z., Song, P., Pauly, M.: DESIA: a general framework for designing interlocking assemblies. ACM Trans. Graph. 37(6), Article No. 191 (2018). (SIGGRAPH Asia) 20. Wei, H., Chen, Y., Tan, J., Wang, T.: Sambot: a self-assembly modular robot system. IEEE/ASME Trans. Mechatron. 16(4), 745–757 (2011) 21. White, P., Zykov, V., Bongard, J.C., Lipson, H.: Three dimensional stochastic reconfiguration of modular robots. In: Robotics: Science and Systems, Cambridge, pp. 161–168 (2005) 22. Willmann, J., Augugliaro, F., Cadalbert, T., D’Andrea, R., Gramazio, F., Kohler, M.: Aerial robotic construction towards a new field of architectural research. Int. J. Architectural Comput. 10(3), 439–459 (2012) 23. Wilson, J.P., Woodruff, D.C., Gardner, J.D., Flora, H.M., Horner, J.R., Organ, C.L.: Vertebral adaptations to large body size in theropod dinosaurs. PLoS One 11(7), e0158962 (2016) 24. Yao, J., Kaufman, D.M., Gingold, Y., Agrawala, M.: Interactive design and stability analysis of decorative joinery for furniture. ACM Trans. Graph. 36(2), 20:1– 20:16 (2017) 25. Zhang, Y.: One-layer structure robotic assembly experiment with two kinds of blocks (2.5x speed). https://youtu.be/1 lbVyPcLOI 26. Zhang, Y.: Robotic assembly of interlocking blocks - WAFR (2018). https://youtu. be/lV2xIA Q8SI 27. Zhang, Y.: Two-layer structure robotic assembly experiment with two kinds of blocks (2.5x speed). https://youtu.be/ZjFFZzrl69s 28. Zhang, Y., Balkcom, D.: Interlocking structure assembly with voxels. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE (2016) 29. Zwerger, K., Olgiati, V.: Wood and Wood Joints: Building Traditions of Europe. Japan and China, Birkh¨ auser (2012)
CADbots: Algorithmic Aspects of Manipulating Programmable Matter with Finite Automata S´ andor P. Fekete1 , Robert Gmyr2 , Sabrina Hugo1 , Phillip Keldenich1 , Christian Scheffer1 , and Arne Schmidt1(B) 1
Department of Computer Science, TU Braunschweig, Braunschweig, Germany {s.fekete,s.hugo,keldenich,scheffer,arne.schmidt}@tu-bs.de 2 Department of Computer Science, University of Houston, Houston, USA [email protected]
Abstract. We contribute results for a set of fundamental problems in the context of programmable matter by presenting algorithmic methods for evaluating and manipulating a collective of particles by a finite automaton that can neither store significant amounts of data, nor perform complex computations, and is limited to a handful of possible physical operations. We provide a toolbox for carrying out fundamental tasks on a given arrangement of tiles, using the arrangement itself as a storage device, similar to a higher-dimensional Turing machine with geometric properties. Specific results include time- and space-efficient procedures for bounding, counting, copying, reflecting, rotating or scaling a complex given shape.
1
Introduction
When dealing with classic challenges of robotics, such as exploration, evaluation and manipulation of objects, traditional robot models are based on relatively powerful capabilities, such as the ability (1) to collect and store significant amounts of data, (2) perform intricate computations, and (3) execute complex physical operations. With the ongoing progress in miniaturization of devices, new possibilities emerge for exploration, evaluation and manipulation. However, dealing with micro- and even nano-scale dimensions (as present in the context of programmable matter) introduces a vast spectrum of new difficulties and constraints. These include significant limitations to all three of the mentioned capabilities; challenges get even more pronounced in the context of complex nanoscale systems, where there is a significant threshold between “internal” objects and sensors and “external” control entities that can evaluate gathered data, extract important information and provide guidance. In this paper, we present algorithmic methods for evaluating and manipulating a collective of particles by agents of the most basic possible type: finite A full version has been made available on arXiv [7]. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 727–743, 2020. https://doi.org/10.1007/978-3-030-44051-0_42
728
S. P. Fekete et al.
automata that can neither store significant amounts of data, nor perform complex computations, and are limited to a handful of possible physical operations. The objective is to provide a toolbox for carrying out fundamental tasks on a given arrangement of particles, such as bounding, counting, copying, reflecting, rotating or scaling a large given shape. The key idea is to use the arrangement itself as a storage device, similar to a higher-dimensional Turing machine with geometric properties. 1.1
Our Results
We consider an arrangement P of N pixel-shaped particles, on which a single finite-state robot can perform a limited set of operations; see Sect. 2 for a precise model description. Our goal is to develop strategies for evaluating and modifying the arrangement by defining sequences of transformations and providing metrics for such sequences. In particular, we present the following; a full technical overview is given in Table 1. – We give a time- and space-efficient method for determining the bounding box of P . – We show that we can simulate a Turing machine with our model. – We provide a counter for evaluating the number N of tiles forming P , as well as the number of corner pixels of P . – We develop time- and space-efficient algorithms for higher-level operations, such as copying, reflecting, rotating or scaling P . Table 1. Results of this paper. N is the number of tiles in the given shape P , w and h its width and height. (∗ ) is the number of auxiliary tiles after constructing the bounding box. Problem
Tile complexity
Time complexity
Space complexity
Bounding Box
O(|∂P |)
O(wh max(w, h))
O(w + h)
N Tiles
O(log N )∗
O (max(w, h) log N + N min(w, h))
O(max(w, h))
k Corners
O(log k)∗
O (max(w, h) log k + k min(w, h) + wh)
O(max(w, h))
Copy
O(N )∗
O(wh2 )
O(wh)
Reflect
O(max(w, h))∗
O((w + h)wh)
O(w + h)
Counting:
Function:
1.2
Rotate
O(w + h)∗
O((w + h)wh)
O(|w − h| max(w, h))
Scaling by c
O(cN )
O((w2 + h2 )c2 N )
O(cwh)
Related Work
There is a wide range of related work; due to space limitations, we can only present a small selection. A very successful model considers self-assembling DNA tiles (e.g., [5,15]) that form complex shapes based on the interaction of different glues along their
CADbots: Manipulating Programmable Matter with Finite Automata
729
edges; however, no active agents are involved, and composition is the result of chemical and physical diffusion. The setting of a finite-automaton robot operating on a set of tiles in a grid was introduced in [10], where the objective is to arrange a given set of tiles into an equilateral triangle. An extension studies the problem of recognizing certain shapes [9]. We use a simplified variant of the model underlying this line of research that exhibits three main differences: First, for ease of presentation we consider a square grid instead of a triangular grid. Second, our model is less restrictive in that the robot can create and destroy tiles at will instead of only being able to transport tiles from one position to another. Finally, we allow the robot to move freely throughout the grid instead of restricting it to move along the tile structure. Other previous related work includes shape formation in a number of different models: in the context of agents walking DNA-based shapes [17,20,22]; in the Amoebot model [6]; in modular robotics [11]; in a variant of population protocols [13]; in the nubot model [23]. Work focusing on a setting of robots on graphs includes network exploration [3], maze exploration [1], rendezvous search [16], intruder caption and graph searching [2,8]. For a connection to other approaches to agents moving tiles, e.g., see [4,19]. Although the complexity of our model is very restricted, actually realizing such a system, for example using complex DNA nanomachines, is currently still a challenging task. However, on the practical side, recent years have seen significant progress towards realizing systems with the capabilities of our model. For example, it has been shown that nanomachines have the ability to act like the head of a finite automaton on an input tape [17], to walk on a one- or two-dimensional surface [12,14,22], and to transport cargo [18,20,21].
2
Preliminaries
We consider a single robot that acts as a deterministic finite automaton. The robot moves on the (infinite) grid G = (Z2 , E) with edges between all pairs of nodes that are within unit distance. The nodes of G are called pixels. Each pixel is in one of two states: It is either empty or occupied by a particle called tile. A diagonal pair is a pair of two pixels (x1 , y1 ), (x2 , y2 ) with |x1 −x2 | = |y1 −y2 | = 1 (see Fig. 1, left and middle). A polyomino is a connected set of tiles P ⊂ Z2 such that for all diagonal pairs p1 , p2 ∈ P there is another tile p ∈ P that is adjacent to p1 and adjacent to p2 (see Fig. 1 middle and right). We say that P is simple if it has no holes, i.e., if there is no empty pixel p ∈ G that lies inside a cycle formed by occupied tiles. Otherwise, P is non-simple. The a-row of P is the set of all pixels (x, a) ∈ P . We say that P is y-monotone if the a-row of P is connected in G for each a ∈ Z (see Fig. 1 right). Analogously, the a-column of P is the set of all pixels (a, y) ∈ P and P is called x-monotone if the a-column of P is connected in G for each a ∈ Z. The boundary ∂P of P is the set of all tiles of P that are adjacent to an empty pixel or build a diagonal pair with an empty pixel (see Fig. 1 right).
730
S. P. Fekete et al.
Fig. 1. Left: An illegal diagonal pair (p, q). Middle: An allowed diagonal pair (p, q). Right: A polyomino P with its boundary ∂P (tiles with bold outline). P is x-monotone but not y-monotone, because the 1-row is not connected.
A configuration consists of the states of all pixels and the robot’s location and state. The robot can transform a configuration into another configuration using a sequence of look-compute-move steps as follows. In each step, the robot acts according to a transition function δ. This transition function maps a pair (p, b) containing the state of the robot and the state of the current pixel to a triple (q, c, d), where q is the new state of the robot, c is the new state of the current pixel, and d ∈ {up, down, left, right} is the direction the robot moves in. In other words, in each step, the robot checks its state p and the state of the current pixel b, computes (q, c, d) = δ(p, b), changes into state q, changes the state of the current pixel to c if c = b, and moves one step into direction d. Our goal is to develop robots transforming an unknown initial configuration S into a target configuration T (S). We assess the efficiency of a robot by several metrics: (1) Time complexity, i.e., the total number of steps performed by the robot until termination, (2) space complexity, i.e., the total number of visited pixels outside the bounding box of the input polyomino P , and (3) tile complexity, i.e., the maximum number of tiles on the grid minus the number of tiles in the input polyomino P .
3
Basic Tools
A robot can check the states of all pixels within a constant distance by performing a tour of constant length. Thus, from now on we assume that a robot can check within a single step all eight pixels that are adjacent to the current position r or build a diagonal pair with r. 3.1
Bounding Box Construction
We describe how to construct a bounding box of a given polyomino P in the form of a zig-zag as shown in Fig. 2. We can split the bounding box into an outer lane and an inner lane (see Fig. 2). This allows distinguishing tiles of P and tiles of the bounding box. Our construction proceeds in three phases. (i) We search for an appropriate starting position. (ii) We wrap a zig-zag path around P . (iii) We finalize this path to obtain the bounding box.
CADbots: Manipulating Programmable Matter with Finite Automata
731
Fig. 2. A Polyomino P (white) surrounded by a bounding box of tiles (gray) on the inner lane (solid line) and tiles on the outer lane (dashed line).
For phase (i), we simply search for a local minimum in the y-direction. We break ties by taking the tile with minimal x-coordinate, i.e., the leftmost tile. From the local minimum, we go two steps further to the left. If we land on a tile, this tile belongs to P and we restart phase (i) from this tile. Otherwise we have found a possible start position. In phase (ii), we start constructing a path that will wrap around P . We start placing tiles in a zig-zag manner in the upwards direction. While constructing the path, three cases may occur. 1. At some point we lose contact with P , i.e., there is no tile at distance two from the inner lane. In this case, we do a right turn and continue our construction in the new direction. 2. Placing a new tile produces a conflict, i.e., the tile shares a corner or a side with a tile of P . In this case, we shift the currently constructed side of the bounding box outwards until no more conflict occurs. This shifting process may lead to further conflicts, i.e., we may not be able to further shift the current side outwards. If this happens, we deconstruct the path until we can shift it outwards again (see Fig. 3). In this process, we may be forced to deconstruct the entire bounding box we have built so far, i.e., we remove all tiles of the bounding box including the tile in the starting position. In this case we know that there must be a tile of P to the left of the start position; we move to the left until we reach such a tile and restart phase (i). 3. Placing a new tile closes the path, i.e., we reach a tile of the bounding box we have created so far. We proceed with phase (iii). Phase (iii): Let t be the tile that we reached at the end of phase (ii). We can distinguish two cases: (i) At t, the bounding box splits into three different directions (shown as the tile with black-and-white diagonal stripes in Fig. 4), and (ii) the bounding box splits into two different directions. In the latter case we are done, because we can only reach the bottom or left side of the bounding box. In the first case we have reached a right or top side of the bounding box. From t we can move to the tile where we started the bounding box construction, and remove the bounding box parts until we reach t. Now the bounding box splits in two directions at t. Because we have reached a left or top side of the bounding box, we may not have built a convex shape around P (see Fig. 4 left). This can
732
S. P. Fekete et al.
be dealt with by straightening the bounding box in an analogous fashion, e.g., by pushing the line to the right of t down.
Fig. 3. Left: Further construction is not possible. Therefore, we shift the line upwards (see light gray tiles). Right: A further shift produces a conflict with the polyomino. Thus, we remove tiles (diagonal stripes) and proceed with the shift (light gray) when there is no more conflict.
Fig. 4. Left: During construction, starting in s, we reach a tile t (diagonal stripes) at which the bounding box is split into three directions. The part between s and t (dark gray tiles) can be removed, followed by straightening the bounding box along the dashed line. Right: The final bounding box.
Theorem 1. The described strategy builds a bounding box that encloses the given polyomino P of width w and height h. Moreover, the strategy terminates after O(max(w, h) · wh) steps, using at most O(|∂P |) auxiliary tiles and only O(w + h) of additional space. The running time in the best case is O(wh). Proof. Correctness: We show that (a) the bounding box will enclose P ; and (b) whenever we make a turn or shift a side of the bounding box, we find a tile with distance two from the bounding box, i.e., we will not build an infinite line. First note that we only make a turn after encountering a tile with distance two to the inner lane. This means that we will “gift wrap” the polyomino until we reach the start. When there is a conflict (i.e., we would hit a tile of P with the current line), we shift the current line. Thus, we again find a tile with distance two to the inner lane. This also happens when we remove the current line and extend the previous one. After a short extension we make a turn and find a
CADbots: Manipulating Programmable Matter with Finite Automata
733
f
p4
p3
p2
p1
Fig. 5. A worst-case example for building a bounding box. The positions p1 to p4 denote the first, second, third and fourth starting position of the robot during bounding box construction. With each restart, the pixel f is visited at least once. Therefore, f can be visited Ω(w) times.
tile with distance two to the inner lane, meaning that we make another turn at some point. Therefore, we do not construct an infinite line, and the construction remains finite. Time: To establish a runtime of O(wh), we show that each pixel lying in the final bounding box is visited only a constant number of times. Consider a tile t that is visited for the first time. We know that t gets revisited again if the line through gets shifted or removed. When t is no longer on the bounding box, we can visit t again while searching for the start tile. Thus, t is visited at most four times by the robot, implying a running time of O(wh) unit steps. However, it may happen that we have to remove the bounding box completely and have to restart the local minimum search. In this case, there may be tiles that can be visited up to max(w, h) times (see Fig. 5). Therefore, the running time is O(max(w, h) · wh) in the worst-case. Auxiliary Tiles: We now show that we need at most O(|∂P |) many auxiliary tiles at any time. Consider a tile t of ∂P from which we can shoot a ray to a tile of the bounding box (or the intermediate construction), such that no other tile of P is hit. For each tile t of ∂P , there are at most four tiles t1 , . . . , t4 of the bounding box. We can charge the cost of t1 , . . . , t4 to t, which is constant. Thus, each tile in ∂P has been charged by O(1). However, there are still tiles on the bounding box that have not been charged to a tile of ∂P , i.e., tiles that lie in a curve of the bounding box. Consider a locally convex tile t, i.e., a tile at which we can place a 2 × 2 square solely containing t. Each turn of the bounding box can be charged to a locally convex tile. Note that there are at most four turns that can be charged to a locally convex tile. For each turn, there are at most four tiles that are charged to a locally convex tile. Thus, each tile of ∂P has constant cost, i.e., we need at most O(∂P ) auxiliary tiles. It is straightforward to see that we need O(w + h) additional space.
734
S. P. Fekete et al.
8
4
2
1
8
4
2
1
8
4
2
1
Fig. 6. Left: A 4-bit counter with decimal value 10 (1010 in binary). Middle: The binary counter increased by one. Right: The binary counter decreased by one. 1248
1248
Bounding Box
Counter
P
Counter
P
t t
Bounding Box
Fig. 7. An 8 × 8 square P in a bounding box, being counted row-by-row (left) or column-by-column (right). The robot has already counted 14 tiles of P and stored this information in the binary counter. The arrow shows how the robot () moves to count the next tile t.
3.2
Binary Counter
For counting problems, a binary counter is indispensable, because we are not able to store non-constant numbers. The binary counter for storing an n-bit number consists of a base-line of n tiles. Above each tile of the base-line there is either a tile denoting a 1, or an empty pixel denoting 0. Given an n-bit counter we can increase and decrease the counter by 1 (or by any constant c), and extend the counter by one bit. The latter operation will only be used in an increase operation. In order to perform an increase or decrease operation, we move to the least significant bit and start flipping the bit (i.e., remove or place tiles) (Fig. 6).
4
Counting Problems
The constant memory of our robot poses several challenges when we want to count certain elements of our polyomino, such as tiles or corners. Because these counts can be arbitrarily large, we cannot store them in the state space of our robot. Instead, we have to make use of a binary counter. This requires us to move back and forth between the polyomino and the counter. Therefore, we must be able to find and identify the counter coming from the polyomino and to find the way back to the position where we stopped counting. This motivates the following strategy for counting problems. We start by constructing a slightly extended bounding box. We perform the actual counting by shifting the polyomino two units downwards or to the left, one tile at a time. After moving each tile, we return to the counter and increase it if necessary. We describe further details in the next two sections, where we present algorithms for counting the number of tiles or corners in a polyomino.
CADbots: Manipulating Programmable Matter with Finite Automata
4.1
735
Counting Tiles
The total number of tiles in our polyomino can be counted using the strategy outlined above, increasing the counter by one after each moved tile. Theorem 2. Let P be a polyomino of width w and height h with N tiles for which the bounding box has already been created. Counting the number of tiles in P can be done in O (max(w, h) log N + N min(w, h)) steps using O(max(w, h)) of additional space and O(log N ) auxiliary tiles. Proof. In a first step, we determine whether the polyomino’s width is greater than its height or vice versa. We can do this by moving to the lower left corner of the bounding box and then alternatingly moving up and right one step until we meet the bounding box again. We can recognize the bounding box by a local lookup based on its zig-zag shape that contains tiles only connected by a corner, which cannot occur in a polyomino. The height is at least as high as the width if we end up on the right side of the bounding box. In the following, we describe our counting procedure for the case that the polyomino is higher than wide; the other case is analogous. We start by extending the bounding box by shifting its bottom line down by two units. Afterwards we create a vertical binary counter to the left of the bounding box. We begin counting tiles in the bottom row of the polyomino. We keep our counter in a column to the left of the bounding box such that the least significant bit at the bottom of the counter is in the row, in which we are currently counting. We move to the right into the current row until we find the first tile. If this tile is part of the bounding box, the current row is done. In this case, we move back to the counter, shift it upwards and continue with the next row until all rows are done. Otherwise, we simply move the current tile down two units, return to the counter and increment it. For an example of this procedure, refer to Fig. 7. For each tile in the polyomino, we use O(min(w, h)) steps to move to the tile, shift it and return to the counter; incrementing the counter itself only takes O(1) amortized time per tile. For each empty pixel we have cost O(1). In addition, we have to shift the counter max(w, h) times. Thus, we need O(max(w, h) log N + min(w, h)N + wh) = O(max(w, h) log N + min(w, h)N ) unit steps. We only use O(log N ) auxiliary tiles in the counter, in addition to the bounding box. In order to achieve O(1) additional space, we modify the procedure as follows. Whenever we move our counter, we check whether it extends into the space above the bounding box. If it does, we reflect the counter vertically, such that the least significant bit is at the top in the row, in which we are currently counting. This requires O(log2 N ) extra steps and avoids using O(log N ) additional space above the polyomino.
736
S. P. Fekete et al.
4.2
Counting Corners
In this section, we present an algorithm to count reflex or convex corners of a given polyomino. Theorem 3. Let P be a polyomino of width w and height h with n convex (reflex) corners for which the bounding box has already been created. Counting the number of k convex (reflex) corners in P can be done in O(max(w, h) log k + k min(w, h)+ wh) steps, using O(max(w, h)) of additional space and O(log k) auxiliary tiles.
x6 x5 x4 t x3
empty
empty
x1
x2
Fig. 8. Left: A polyomino and its convex corners (◦) and reflex corners (×). Right: After reaching a new tile t, we have to know which of the pixel x1 to x6 are occupied to decide how many convex (reflex) corners t has.
Proof. We use the same strategy as for counting tiles in Theorem 2 and retain the same asymptotic bounds on running time, auxiliary tiles and additional space. It remains to describe how we recognize convex and reflex corners. Whenever we reach a tile t that we have not yet considered so far, we examine its neighbors x1 , . . . , x6 , as shown in Fig. 8. The tile t has one convex corner for each pair (x1 , x2 ), (x2 , x3 ), (x3 , x5 ), (x5 , x1 ) that does not contain a tile, and no convex corner is contained in more than one tile of the polyomino. As there are at most four convex corners per tile, we can simply store this number in the state space of our robot, return to the counter, and increment it accordingly. A reflex corner is shared by exactly three tiles of the polyomino, so we have to ensure that each corner is counted exactly once. We achieve this by only counting a reflex corner if it is on top of our current tile t and was not already counted with the previous tile x1 . In other words, we count the upper right corner of t as a reflex corner if exactly two of x3 , x4 , x5 are occupied; we count the upper left corner of t as reflex corner if x6 and x5 are present and x1 is not. In this way, all reflex corners are counted exactly once.
5
Transformations with Turing Machines
In this section we develop a robot that transforms a polyomino P1 into a required target polyomino P2 . In particular, we encode P1 and P2 by strings S(P1 ) and S(P2 ) whose characters are from {0, 1, } (see Fig. 9 left and Definition 1). If there is a Turing machine transforming S(P1 ) into S(P2 ), we can give a strategy that transforms P1 into P2 (see Theorem 4). We start with the definition of the encodings S(P1 ) and S(P2 ).
CADbots: Manipulating Programmable Matter with Finite Automata
y 2 1 0
P
6 5 4 3 2 1 R 12 11 10 9 8 7 18 17 16 15 14 13 0 1 2 3 4 5 x
writing the ith bit auxiliary line
737
binary binary 2 4 6 8 3 5 7 9 line column 1 counter counter 1 1 1 1 0 1 0 0 1 0 0 1 0 1 blank tiles separating different strings
position tile indicating the current writing position ↑
Fig. 9. Left: An example showing how to encode a polyomino of height h = 3 (S1 = 11 in binary) and width w = 6 (S2 = 110 in binary) by S(P ) = S1 S2 S3 = 11 110 100100101111111011, where S3 is the string of 18 bits that represent tiles (black, 1 in binary) and empty pixel (white, 0 in binary), proceeding from high bits to low bits. Right: Phase (2) writing S(P ) (currently the 10th bit) on a horizontal auxiliary line.
Definition 1. Let R := R(P ) be the polyomino forming the smallest rectangle containing a given polyomino P (see Fig. 9 right). We represent P by the concatenation S(P ) := S1 S2 S3 of three bit strings S1 , S2 , and S3 separated by blanks . In particular, S1 and S2 are the height and width of R. Furthermore, we label each tile of R by its rank in the lexicographic decreasing order w.r.t. yand x-coordinates with higher priority to y-coordinates (see Fig. 9 right). Finally, S3 is an |R|-bit string b1 , . . . , b|R| with bi = 1 if and only if the ith tile in R also belongs to P . Theorem 4. Let P1 and P2 be two polyominos with |P1 | = |P2 | = N . There is a strategy transforming P1 into P2 if there is a Turing machine transforming S(P1 ) into S(P2 ). The robot needs O(∂P1 + ∂P2 + ST M ) auxiliary tiles, O(N 4 + TT M ) steps, and Θ(N 2 +ST M ) of additional space, where TT M and ST M are the number of steps and additional space needed by the Turing machine. Proof. Our robot works in five phases: Phase (1) constructs a slightly modified bounding box of P1 (see Fig. 9). Phase (2) constructs a shape representing S(P1 ). Bit-by-bit, the robot writes S(P1 ) onto a horizontal auxiliary line (see Fig. 9). In order to remember the position, at which the previous bit was written, we use a position tile placed on another horizontal auxiliary line. Phase (3) simulates the Turing machine that transforms S(P1 ) into S(P2 ). Phase (4) is the reversed version of Phase (2), and Phase (5) is the reversed version of Phase (1). As Phase (4) and Phase (5) are the reversed version of Phase (1) and Phase (2), we still have to discuss how to realize Phases (1), (2), and (3), for which we refer to the full version [7]. This concludes the proof of Theorem 4.
738
6
S. P. Fekete et al.
CAD Functions
As already seen, we can transform a given polyomino by any computable function by simulating a Turing machine. However, this requires a considerable amount of space. In this section, we present realizations of common functions in a more space-efficient manner. 6.1
Copying a Polyomino
Copying a polyomino P has many application. For example, we can apply algorithms that may destroy P after copying P into free space. In the following, we describe the copy function that copies each column below the bounding box, as seen in Fig. 10 (a row-wise copy can be described analogously). Our strategy to copy a polyomino is as follows: After constructing the bounding box, we copy each column of the bounding box area into free space. This, however, comes with a problem. As the intersection of the polyomino with a column may consists of more than one component, which may be several units apart, we have to be sure that the right amount of empty pixels are copied. This problem can be solved by using bridges, i.e., auxiliary tiles denoting empty pixels. To distinguish between tiles of P and bridges, we use two lanes (i) a left lane containing tiles of P and (ii) a right line containing bridge tiles (see Fig. 10). To copy an empty pixel, we place a tile two unit steps to the left. This marks a part of a bridge. Then we move down and search for the first position, at which there is no copied tile of P to the left, nor a bridge tile. When this position is found, we place a tile, denoting an empty position. To copy a tile t of P , we move this tile three unit steps to the left. Afterwards, we move downwards following the tiles and bridges until we reach a free position and place a tile, denoting a copy of t. when we reach the bottom of a column, we remove any tile representing a bridge tile and proceed with the next column (see Fig. 10 right).
Fig. 10. Left: Intermediate step while copying a column of a polyomino (gray tiles) with bridges (black tiles). Tiles in the left box (dotted) are already copied, the tile of P in the right box (dash-dotted) are not copied yet. The robot () moves to next pixel. Right: When the column is copied the bridges get removed and the robot proceeds with the next column.
CADbots: Manipulating Programmable Matter with Finite Automata
739
Theorem 5. Copying a polyomino P column-wise can be done within O(wh2 ) unit steps using O(N ) of auxiliary tiles and O(h) additional space. Proof. Consider the strategy described above. It is straightforward to see that by maintaining the bridges, we can ensure that each tile of P is copied to the correct position. As there are O(wh) many pixels that we have to copy with cost of O(h) per pixel, the strategy needs O(wh2 ) unit steps to terminate. Now consider the number of auxiliary tiles. We need N tiles for the copy and O(h) tiles for bridges, which are reused for each column. Thus, we need O(N ) auxiliary tiles in total. Because we place the copied version of P beneath P , we need O(h) additional space in the vertical direction. 6.2
Reflecting a Polyomino
In this section we show how to perform a vertical reflection on a polyomino P (a horizontal reflection is done analogously). Assume that we already built the bounding box. Then we shift the bottom side of the bounding box one unit down, such that we have two units of space between the bounding box and P . We start with the bottom-most row r and copy r in reversed order beneath the bounding box using bridges, as seen in the previous section. After the copy process, we delete r from P and shift the bottom side of the bounding box one unit up. Note that we still have two units of space between the bounding box and P . We can therefore repeat the process until no tile is left. Theorem 6. Reflecting a polyomino P vertically can be done in O(w2 h) unit steps, using O(w) of additional space and O(w) auxiliary tiles. Proof. For each pixel within the bounding box we have to copy this pixel to the desired position. This costs O(w) steps, i.e., we have to move to the right side of the boundary, move a constant number of steps down and move O(w) to the desired position. These are O(w) steps per pixel, O(w2 h) unit steps in total. It can be seen in the described strategy that we only need a constant amount of additional space in one dimension because we are overwriting the space that was occupied by P . This implies a space complexity of O(w). Following the same argumentation of Theorem 5, we can see that we need O(w) auxiliary tiles. Corollary 1. Reflecting a polyomino P vertically and horizontally can be done in O(wh(w + h)) unit steps, using O(w + h) additional space and O(w) auxiliary tiles. 6.3
Rotating a Polyomino
Rotating a polyomino presents some additional difficulties, because the dimension of the resulting polyomino has dimensions h × w instead of w × h. Thus, we may need non-constant additional space in one dimension, e.g., if one dimension is large compared to the other dimension. A simple approach is to copy the
740
S. P. Fekete et al.
rows of P bottom-up to the right of P . This allows us to rotate P with O(wh) additional space. For now, we assume that h ≥ w. We now propose a strategy that is more compact. The strategy consists of two phases: First build a reflected version of our desired polyomino, then reflect it to obtain the correct polyomino. After constructing the bounding box, we place a tile t1 in the bottom-left corner of the bounding box, which marks the bottom-left corner of P (see Fig. 11). We also extend the bounding box at the left side and the bottom side by six units. This gives us a width of seven units between the polyomino and the bounding box at the left and bottom side. Now we can move the first column rotated in clockwise direction five units below P (which is two units above the bounding box), as follows. We use two more tiles tc denoting the column, at which we have to enter the bounding box, and t2 marking the pixel that has to be copied next (see Fig. 11). We can maintain these tiles as in a copy process to always copy the next pixel: If t2 reached t1 , i.e., we would place t2 on t1 , then we know that we have finished the column and proceed with the row by moving t2 to the right of t1 . We can copy this pixel to the desired position again by following the bridges and tiles that now make a turn at some point (see Fig. 11 for an example). Note that we cannot place the row directly on top of the column or else we may get conflicts with bridges. We therefore build the row one unit to the left and above the column. Also note that during construction of the first column or during the shifting we may hit the bounding box. In this case we extend the bounding box by one unit. After constructing a column and a row, we move the constructed row one unit to the left and two units down, and we move the column one unit down and two units left. This gives us enough space to construct the next column and row in the same way. When all columns and rows are constructed, we obtain a polyomino that is a reflected version of our desired polyomino. It is left to reflect horizontally to obtain a polyomino rotated in counter-clockwise direction, or vertically to obtain
Fig. 11. Left: Constructing the first column and row (light gray in the polyomino). Middle: First row and column have been moved away by a constant number of steps to make space for the next row and column. Right: Merging the first two columns and rows.
CADbots: Manipulating Programmable Matter with Finite Automata
741
a polyomino that is rotated in clockwise direction. This can be done with the strategy described above. Theorem 7. There is a strategy to rotate a polyomino P by 90◦ within O((w + h)wh) unit steps, using O(|w − h|h) of additional space and O(w + h) auxiliary tiles. Proof. Like in other algorithms, the number of steps to copy the state of a pixel to the desired position is bounded by O(w +h). Shifting the constructed row and column also takes the same number of steps. Therefore, constructing the reflected version of our desired polyomino needs O((w + h)wh) unit steps. Additionally we may have to extend one side of the bounding box. This can happen at most O(|w − h|) times, with each event needing O(h) unit steps. Because O(|w − h|) can be bounded by O(w + h), this does not change the total time complexity. Because the width of the working space increases by O(|w − h|) and the height only increases by O(1), we need O(|w − h|h) of additional space. It is straightforward to see that we need a total of O(max(w + h)) auxiliary tiles. 6.4
Scaling a Polyomino
Scaling a polyomino P by a factor c replaces each tile by a c × c square. This can easily be handled by our robot. Theorem 8. Given a constant c, the robot can scale the polyomino by c within O((w2 + h2 )c2 N ) unit steps using O(c2 wh) additional tiles and O(c2 N ) additional space. Proof. After constructing the bounding box, we place a tile denoting the current column. Suppose we are in the i-th column Ci . Then we shift all columns that lie h to the right of Ci by c units to the right with cost of O((w − i) · j=i+1 cNCj ) ⊆ O(wcN ), where NCj is the number of tiles in the jth column. Because c is a constant, we can always find the next column that is to be shifted. Afterwards, we copy Ci exactly c−1 times to the right of Ci , which costs O(NCi c) unit steps. Thus, extending a single row costs O(c · (wN + NCi )) and hence extending all rows costs O(cw2 N ) unit steps in total. Extending each row is done analogously. However, because each tile has already been copied c times, we obtain a running time of O(c2 h2 N ). This implies a total running time of O((w2 + h2 )c2 N ). The proof for the tile and space complexity is straightforward.
7
Conclusion
We have given a number of tools and functions for manipulating an arrangement of tiles by a single robotic automaton. There are various further questions, including more complex operations, as well as sharing the work between multiple robots. These are left for future work.
742
S. P. Fekete et al.
References 1. Blum, M., Kozen, D.: On the power of the compass (or, why mazes are easier to search than graphs). In: Proceedings of 19th Annual Symposium on Foundations of Computer Science, pp. 132–142 (1978) 2. Bonato, A., Nowakowski, R.J.: The Game of Cops and Robbers on Graphs. AMS (2011) 3. Das, S.: Mobile agents in distributed computing: network exploration. Bull. Eur. Assoc. Theor. Comput. Sci. 109, 54–69 (2013) 4. Demaine, E., Demaine, M., Hoffmann, M., O’Rourke, J.: Pushing blocks is hard. Comput. Geom. 26(1), 21–36 (2003) 5. Demaine, E.D., Fekete, S.P., Scheffer, C., Schmidt, A.: New geometric algorithms for fully connected staged self-assembly. Theor. Comput. Sci. 671, 4–18 (2017) 6. Derakhshandeh, Z., Gmyr, R., Richa, A.W., Scheideler, C., Strothmann, T.: Universal shape formation for programmable matter. In: Proceedings of 28th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 289–299 (2016) 7. Fekete, S.P., Gmyr, R., Hugo, S., Keldenich, P., Scheffer, C., Schmidt, A.: CADbots: Algorithmic Aspects of Finite Automata for Manipulating Programmable Matter. arXiv preprint arXiv:1810.06360 (2018) 8. Fomin, F.V., Thilikos, D.M.: An annotated bibliography on guaranteed graph searching. Theor. Comput. Sci. 399(3), 236–245 (2008) 9. Gmyr, R., Hinnenthal, K., Kostitsyna, I., Kuhn, F., Rudolph, D., Scheideler, C.: Shape recognition by a finite automaton robot. In: 43nd International Symposium on Mathematical Foundations of Computer Science (2018). To appear 10. Gmyr, R., Hinnenthal, K., Kostitsyna, I., Kuhn, F., Rudolph, D., Scheideler, C., Strothmann, T.: Forming tile shapes with simple robots. In: 23rd International Conference on DNA Computing and Molecular Programming (2018). To appear 11. Hurtado, F., Molina, E., Ramaswami, S., Sacrist´ an, V.: Distributed reconfiguraiton of 2D lattice-based modular robotic systems. Auton. Robots 38(4), 383–413 (2015) 12. Lund, K., Manzo, A., Dabby, N., Michelotti, N., Johnson-Buck, A., Nangreave, J., Taylor, S., Pei, R., Stojanovic, M., Walter, N., Winfree, E.: Molecular robots guided by prescriptive landscapes. Nature 465(7295), 206–210 (2010) 13. Michail, O., Spirakis, P.G.: Simple and efficient local codes for distributed stable network construction. Distrib. Comput. 29(3), 207–237 (2016) 14. Omabegho, T., Sha, R., Seeman, N.: A bipedal DNA Brownian motor with coordinated legs. Science 324(5923), 67–71 (2009) 15. Patitz, M.J.: An introduction to tile-based self-assembly and a survey of recent results. Nat. Comput. 13(2), 195–224 (2014) 16. Pelc, A.: Deterministic rendezvous in networks: a comprehensive survey. Networks 59(3), 331–347 (2012) 17. Reif, J.H., Sahu, S.: Autonomous programmable DNA nanorobotic devices using dnazymes. Theor. Comput. Sci. 410, 1428–1439 (2009) 18. Shin, J., Pierce, N.: A synthetic DNA walker for molecular transport. J. Am. Chem. Soc. 126, 4903–4911 (2004) 19. Terada, Y., Murata, S.: Automatic modular assembly system and its distributed control. Int. J. Robot. Res. 27(3–4), 445–462 (2008) 20. Thubagere, A., Li, W., Johnson, R., Chen, Z., Doroudi, S., Lee, Y., Izatt, G., Wittman, S., Srinivas, N., Woods, D., Winfree, E., Qian, L.: A cargo-sorting DNA robot. Science 357(6356), eaan6558 (2017)
CADbots: Manipulating Programmable Matter with Finite Automata
743
21. Wang, Z., Elbaz, J., Willner, I.: A dynamically programmed DNA transporter. Angewandte Chemie Int. Edn. 51(48), 4322–4326 (2012) 22. Wickham, S., Bath, J., Katsuda, Y., Endo, M., Hidaka, K., Sugiyama, H., Turberfield, A.: A DNA-based molecular motor that can navigate a network of tracks. Nat. Nanotechnol. 7(3), 169–173 (2012) 23. Woods, D., Chen, H., Goodfriend, S., Dabby, N., Winfree, E., Yin, P.: Active self-assembly of algorithmic shapes and patterns in polylogarithmic time. In: Proceedings of 4th Conference of Innovations in Theoretical Computer Science, pp. 353–354 (2013)
Multi-agent Trajectory Prediction and Generation with Topological Invariants Enforced by Hamiltonian Dynamics Christoforos I. Mavrogiannis1(B) and Ross A. Knepper2 1
Sibley School of Mechanical and Aerospace Engineering, Cornell University, Ithaca, USA [email protected] 2 Department of Computer Science, Cornell University, Ithaca, USA [email protected]
Abstract. We present a planning framework for decentralized navigation in dynamic multi-agent environments where no explicit communication takes place among agents. Our framework is based on a novel technique for computationally efficient multi-agent trajectory generation from symbolic topological specifications. At planning time, this technique allows an agent to generate a diverse set of potential future scene evolutions in the form of Cartesian, multi-agent trajectory representations. The planning agent selects and executes the next action assigned to it from the evolution of minimum cost. The proposed strategy enables an agent to maintain a smooth and consistent navigation behavior that is robust to behaviors generated by heterogeneous agents and to unexpected events such as agents with changing intentions. Simulation results demonstrate the efficacy of our approach in responding to such situations and its potential for real-world applications in crowded environments.
1
Introduction
Several real-world, multi-agent navigation domains, such as pedestrian or street environments, prohibit the use of explicit communication among agents. Thus, rational agents need to employ mechanisms for predicting the behaviors of others to plan collision-free paths. Although predicting the behaviors of others in a detailed way is challenging, the assumption of rationality, combined with the constraints imposed to agents by the environment bounds results in the definition of a set of qualitatively distinct, global planning alternatives in the joint trajectory space of all agents, corresponding to different strategies of joint collision avoidance. Being cognizant of this structure, agents could anticipate different classes of unfolding multi-agent dynamics. This could enable them to execute actions of global outlook and consistency, which could allow for consistently expressive, smooth and efficient motion, even in the face of unexpected events, such as the appearance of agents with heterogeneous policies or agents with changing intentions. This ability is of particular importance for operation in human-populated c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 744–761, 2020. https://doi.org/10.1007/978-3-030-44051-0_43
Multi-agent Trajectory Generation with Topological Invariants
745
domains, where agents continuously read and react to the rapidly changing environment. Important domains with these properties include automobile traffic in parking lots and pedestrian traffic in communal spaces like hallways. Extending an earlier workshop paper [11], we approach this vision by contributing (1) a symbolic abstraction that allows us to classify and enumerate distinct multi-agent trajectory alternatives; (2) a multi-agent navigation planner that can drive a set of agents from a starting configuration to a goal configuration, while satisfying topological trajectory constraints; (3) an online, reactive navigation planning algorithm that makes use of the aforementioned multi-agent planner as a predictive mechanism; (4) simulation results, demonstrating the ability of the multi-agent planner to generate a broad range of topologically distinct, multi-agent trajectories; (5) simulation results, demonstrating the ability of the online algorithm to generate smooth, collision-free behaviors, even under challenging conditions such as the emergence of agents running different policies or agents with changing intentions.
2
Related Work
In the area of decentralized multi-agent navigation, a significant amount of research has been devoted to the design of planning algorithms, based on mechanisms for predicting the joint behavior of multiple agents. The majority of the literature typically assumes that navigating agents employ policies with shared architecture, with notable examples including the Social Force [5] and Reciprocal Velocity Obstacle [20] frameworks as well as more recent algorithms of similar basis (e.g. Moussaid et al. [16], Karamouzas et al. [6]). These works do not make explicit predictions of other agents’ trajectories but make decisions under strong assumptions on their behaviors. The problem of predicting the trajectories of multiple navigating agents in real time and for an adequate horizon to allow for motion planning is challenging, as suggested by the literature on tracking (e.g. [10]). This has motivated roboticists to look for more practical alternatives for multi-agent motion prediction. In particular, several approaches have leveraged the coupling of agents’ decision making in multi-agent navigation as a way to guide the motion planning process. Some of them have employed learning techniques [4,7,9,19] to develop models for prediction and generation of humanlike trajectories whereas others have employed heuristics to directly exploit the topological structure of the problem [8,14] as a more tractable alternative for explicit trajectory prediction. This paper is closer to the latter class of work. We leverage the topological properties of the spatiotemporal structure of the multi-agent navigation planning problem to generate diverse multi-agent trajectory predictions. In contrast to our past work [12–15], in which we used this topological structure at an abstract level to design an action selection mechanism, this paper introduces a framework for trajectory generation from symbolic, topological specifications which allows for smoother and more consistent behaviors. Our framework is based on the method of Berger [2], which allows for braiding multi-particle trajectories into desired
746
C. I. Mavrogiannis and R. A. Knepper
topological patterns. Our prediction mechanism grows trajectories from agents’ initial configurations to agents’ predicted destinations in topologically distinct ways. This allows us to introduce desired global properties to the trajectory, in contrast to typical trajectory optimization methods (e.g. [21]), which act on a trajectory locally. Based on this mechanism, we present an online algorithm that generates a diverse set of distinctly entangled multi-agent trajectories and evaluates them with respect to their quality and likelihood, to select an action that best adapts to other agents’ impending behaviors and preferences. This allows for rapid adjustment to the changing environment and facilitates robustness to unexpected events such as the emergence of heterogeneous agents or agents with changing intentions.
3
Foundations
We consider the planning problem for an autonomous agent that navigates towards its destination in a known environment where other agents are also navigating towards their respective destinations. The agent aims at reaching its destination by following an efficient, smooth, collision-free trajectory and assumes that others share similar objectives, although it has no knowledge of their specific policies. The agent is not explicitly communicating with others and thus has no knowledge of their intended destinations or planned paths or policies but is able to perfectly observe their motion. Assuming that others share similar abilities and objectives, the agent may form a belief about how they are going to move in the future so that it can plan a safe and efficient path towards its destination. In this paper, we present an approach inspired by the point vortex problem [1] from fluid dynamics. We design a planning framework built around the observation that the collision avoidance process for agents navigating on a plane resembles the dynamics of interacting point vortices in two dimensions. Treating agents as point masses subjected to vortex dynamics allows us to synthesize multi-agent trajectories with desired topological properties. At planning time, this technique enables a planning agent to construct several qualitatively distinct predictions about the future behavior of the system. This allows for an informed action selection towards facilitating a rapid and robust adjustment to the changing environment. Since the predictions are made with a global outlook, this strategy results in a consistently smooth and intent-expressive behavior, even in the face of unexpected events such as agents changing intentions or violating the assumption of rationality. Our approach is based on the method of Berger [2,3] for generating braided trajectories of multi-particle systems from topological invariants [3]. In this section, we introduce some preliminaries about point vortex flows, review the method of Berger [2] and present the key components of our approach.
Multi-agent Trajectory Generation with Topological Invariants
3.1
747
Hamiltonian Motion for Multi-particle Trajectory Braiding
A dynamical system whose evolution is described by Hamilton’s equations is called a Hamiltonian system. Under the Hamiltonian formalism, the state of a system is completely described by a set of variables corresponding to the generalized coordinates of the system’s degrees of freedom and their conjugate momenta. Hamilton’s equations relate the evolution of an energy function, called the Hamiltonian, to the evolution of the coordinates and momenta for all degrees of freedom of the system. In particular, denoting by qj and pj , j ∈ M = {1, . . . , m} the generalized coordinate and conjugate momentum of the ith degree of freedom of a Hamiltonian system respectively, its evolution is given by: q˙j =
∂H , ∂pj
p˙ j = −
∂H , ∂qj
dH = 0, dt
j = 1, . . . , m,
(1)
where the dot notation indicates time derivatives, H denotes the Hamiltonian of the system (which is preserved), defined as its total energy, i.e., the sum of the total kinetic and potential energy of all degrees of freedom of the system. Let us now combine the coordinates and momenta for each degree of freedom into a complex coordinate zj = qj + ipj , j ∈ M . We define an analytic function F (z1 , . . . , zm ) = Ψ (z1 , . . . , zm ) + iH(z1 , . . . , zm ),
(2)
where Ψ : Cm → R and H : Cm → R. Berger [2] showed that the Hamiltonian flow (1) results in motion z˙j , j ∈ N = {1, . . . , n}, that follows the Wirtinger derivative of Ψ with respect to zj . Therefore, the collective Hamiltonian motion of all degrees of freedom follows the gradient of Ψ and points towards its direction of maximum increase. Berger [2] used this finding to generate braided trajectory patterns for systems of two and three particles. In particular, he replaced Ψ with Topological Invariants [3] towards forcing the system to evolve along the growth of the topological invariant. 3.2
Topological Invariants of Particle Trajectories
Consider a set of n particles, following trajectories ξi : [0, T ] → R2 , i ∈ N , from time t = 0 to time t = T and let us collect these trajectories into a system trajectory Ξ = (ξ1 , . . . , ξn ). A topological invariant over Ξ, may be defined as a function Ψ : Ξ → R that maps the system trajectory to a real number that characterizes the spatiotemporal topology of the system dynamics. For any distorted, ˜ topology-preserving trajectory Ξ˜ = Ξ with the same endpoints Ξ(0) = Ξ(0), ˜ Ξ(T ) = Ξ(T ), for which ξi (t) = ξj (t), ∀t ∈ (0, T ) and i = j ∈ N , a topological ˜ = Ψ (Ξ). invariant is preserved, i.e., Ψ (Ξ) 3.2.1 The Winding Number The so called Winding Number is a topological invariant of particular interest for our problem. Consider a curve γ : [0, t] → C\{0}. The complex winding number
748
C. I. Mavrogiannis and R. A. Knepper
of the curve γ, from time 0 to time t is defined as: dz 1 , λ(t) = 2πi γ z
(3)
where z ∈ C. Let us express γ in polar coordinates as γ(t) = r(t)eiθ(t) , where r(t) = ||γ(t)|| and θ(t) = ∠γ(t). Then, through the use of the Cauchy integral formula, (3) may be decomposed into: 1 λ(t) = 2πi
0
t
r˙ 1 dt + r 2π
t
˙ θdt
(4)
0
and computing the integrals yields: r(t) 1 1 log (θ(t) − θ(0)). λ(t) = + 2πi r(0) 2π
(5)
The real part of this integral, w = Re(λ), is a topological invariant, counting the number of times the curve γ encircled the origin in the time interval [0, t]. In other words, fixing the endpoints of the curve, any topology-preserving deformations are mapped to the same value of the winding number. For closed curves, the imaginary part of the winding number is zero. In the following section, considering open curves (evolving trajectories), we describe how it can be used to enforce Hamiltonian motion to interacting particles.
Fig. 1. Spacetime plot of the trajectories of two agents, navigating in a circular workspace (left) and projection of their trajectories until time t1 , onto the xy plane, along with the definition of their pairwise winding angle and winding number (right).
Multi-agent Trajectory Generation with Topological Invariants
3.3
749
Two-Particle Vortex Motion
In this section we put the pieces together to demonstrate a motivating example from fluid dynamics that constitutes the computational basis of our approach. Consider a system of two particles, placed initially at positions a = (ax , ay ) ∈ R2 and b = (bx , by ) ∈ R2 with respect to a fixed coordinate system and assume that a vortex1 lies between them. Point vortex motion prescribes that the x and y coordinates are conjugate to each other (e.g. the conjugate momentum to ax is ay ) [1]. Let us define the function γab from the previous section to track the quantity a − b, i.e., let us set γab (t) = rab (t)eiθab (t) , where rab = ||a − b|| and θab (t) = ∠γab (t). Assuming unit vorticity, the Hamiltonian for this system may be written as: H=−
1 log rab . 2π
(6)
Similarly to Sect. 3.2.1, we may define the complex winding number of γab as: λab (t) =
1 1 log(rab ) + (θab (t) − θab (0)), 2πi 2π
(7)
and let us set its real part to a dedicated variable wab (t) =
1 (θab (t) − θab (0)), 2π
(8)
denoting the pairwise winding number of the two curves (see Fig. 1 for a graphic representation of the pairwise winding number). We may notice that Im(λab ) = H. Thus, according to Sect. 3.1, the Hamiltonian flow for this system maximizes the growth of the real part Re(λab ) = wab . This motion corresponds to the two points rotating about each other at a constant radius, in a counterclockwise direction. Hamilton’s equations for this system may be derived as: ∂H ∂H 1 ay − by ax − bx ,− , = , (9) (a˙ x , a˙ y ) = − 2 2 ∂ay ∂ax 2π rab rab (b˙ x , b˙ y ) =
∂H ∂H ,− ∂by ∂bx
=
1 2π
by − ay bx − ax , . − 2 2 rab rab
(10)
We may control the directionality of the rotation by switching the signs in the right hand side of Eqs. (9) and (10). 3.4
Two-Agent Collision Avoidance as Vortex Motion
Treating agents as particles, we may use the outlined method of Berger [2] as a mechanism for generating two-agent, collision-avoidance maneuvers of desired topological specification. Given a winding number wab , by multiplying the right hand sides of Eqs. (9) and (10) with sign(wab ), we have a planning rule that 1
A vortex is a region in a fluid in which the flow revolves around an axis line.
750
C. I. Mavrogiannis and R. A. Knepper
allows us to grow trajectories for a and b that follow the direction indicated by wab , with sign(wab ) > 0 and sign(wab ) < 0 corresponding to right and left hand side collision avoidance respectively. In a two-agent scene, this may serve as a prediction of the emerging joint behavior. In a scene with high uncertainty, where no agent has committed to a passing side, this mechanism allows a planning agent to anticipate both outcomes. This is useful as it allows the agent to either enforce its own preference or adapt to the preference of the other agent. In the following section, we show how we use this method for synthesizing trajectories of complex topological specifications in environments with multiple agents. 3.5
Multi-agent Trajectory Generation from Topological Specifications
Consider the problem of centralized trajectory planning for driving n agents from their initial positions S = (s1 , . . . , sn ) ∈ R2n to their destinations D = (d1 , . . . , dn ) ∈ R2n in a collision-free fashion and while following a global topological specification w, prescribing passing sides to agents. In this paper, we model w as a tuple of pairwise winding numbers w = (w12 , w13 , . . . ) from the set of such tuples W. Assuming that each agent passes each other exactly once on its way to its destination (agents do not loop around others), the magnitude of wij , i = j ∈ N is not important, so we will be using wij to refer to sign(wij ). n The cardinality of the set of possible specifications is |W| = 2( 2 ) , corresponding to all possible combinations of passing sides for all agents. It should be noted that although all combinations in W are topologically possible, in practice, only a subset of them are meaningful and likely given agents’ state history and under the assumption of rationality. Section 3.6 addresses the problem of evaluating the likelihood and the feasibility of a topological specification. We now describe a policy π : R2n × W → R2n that can be sequentially iterated to produce a multi-agent trajectory that satisfies a topological specification w. The policy (referred from now on as HTTG, standing for Hamiltonian Topological Trajectory Generation) prescribes an action ui ∈ R2 to every agent i ∈ N , synthesized from a weighted consideration of all pairwise collision avoidance reactions between the agent and all others, towards meeting the pairwise specifications contained in w. The policy is executed repeatedly until all agents reach their destinations. It may be formulated for agent i as follows: ui = νi · k uiatt + uirep ,
(11)
where νi ∈ R is an agent’s desired speed, uiatt , uirep are potentials attracting the agent towards its destination and repulsing it from others respectively and k ∈ R is a normalization constant. The potential uiatt = katt (di − qi )
(12)
Multi-agent Trajectory Generation with Topological Invariants
751
attracts the agent from its current state qi towards its destination di with katt being an importance weight. The potential uirep = krep
N
cij wij vji ,
(13)
j=i
repulses agent i from each other agent j ∈ N , j = i, through the velocity vji , derived from Eqs. (9) and (10), with a degree of consideration equivalent to the criticality of their pairwise collision avoidance, expressed by cij ∈ R (the closer two agents are, the more critical their avoidance becomes) and along the direction indicated by wij whereas krep is an importance weight. The choice of the weighting factors katt , krep expresses the relative significance between goal attraction and collision avoidance. The criticality term is designed to be a polynomial function of the distance between two agents, activated when the distance becomes lower than a threshold. By sequentially executing the outlined policy, in parallel for all agents, in equal time steps of length dt, the system of agents is forced to follow the specification w. Note that this method does not guarantee the attainment of the desired topology. Depending on the number of agents, their initial configurations and intended destinations and the parameter tuning, the method has a varying control over the topological properties of agents’ trajectories. Section 4 explores the performance of the method on scenarios with different numbers of agents. 3.6
TANP: Topologically Adaptive Navigation Planning
In this section, we present an online, decentralized navigation planning algorithm that makes use of the described method for generating online a set of topologically distinct, multi-agent trajectory predictions. The algorithm comprises the following sequence of actions: (1) predict the destinations of other agents; (2) generate a set of candidate multi-agent trajectories that drive agents from their current locations to their predicted destinations; (3) evaluate candidates with respect to a cost function; (4) execute the next action assigned to the planning agent from the lowest-cost candidate. In the following subsections, we describe the main components of the algorithm and provide a detailed presentation of it in pseudocode format (see Algorithm 1). 3.6.1 Destination Prediction In Sect. 3.5, it was assumed that the planning policy has access to the destinations of other agents. In the settings we are considering (no explicit communication among agents), this is not the case. Thus, the planning agent needs to make a prediction about the destinations of others in order to use the policy. However, in practice, an agent only interacts with others for as long as they lie within its sensing range, which for current robotic systems is quite limited. During this amount of time, other agents’ observed behaviors may or may not be revealing about their specific destination. And in fact, detailed predictions of
752
C. I. Mavrogiannis and R. A. Knepper
agents’ destinations may not be sufficiently informative regarding agents’ future behaviors; in crowded environments, the collision avoidance process is a more significant influence over agents’ behaviors. For this reason, we take a more practical approach, focusing on coarse predictions of agents’ future locations. In particular, we assume that an agent’s sensing range has the shape of a disk of radius R, centered at the agent’s position, qi . Any agent lying outside of this disk is not perceived by the agent whereas any agents lying behind the robot are ignored at the planning stage. For each one of the perceived and actively considered agents, we approximate their intended direction of motion by fitting a line to their recent, observed trajectory and projecting their current velocity on it. We then propagate their current speed Fig. 2. The destination prediction mechaalong this direction until it intersects nism. The red agent makes destination prethe boundary of the sensing disk. For dictions for all agents, lying within its cirour planning algorithm, that point of cular sensing disk and in front of it. intersection is considered to be that agent’s destination (see Fig. 2). This prediction is expected to be a coarse approximation of where an agent is heading. However, since our algorithm runs in replanning cycles, this approximation provides a sufficient amount of detail for the trajectory prediction mechanism of Sect. 3.5. This mechanism makes use of the assumption that agents act rationally, i.e., agents’ behaviors are driven by an incentive of making progress towards their destinations. Finally, alternative methods of filtering could be employed to provide more accurate destination prediction; however, this is not our focus and as our will be shown in Sect. 4, this simplistic model may yield the desired performance. 3.6.2 Outcome Evaluation The set W contains symbolic representations of topologically distinct outcomes for the system of all considered agents. Naturally, a significant question that arises is: which outcome should the planning agent trust and follow? We approach this problem with the following sequence of computations: (1) we first evaluate an outcome with respect to its likelihood; (2) we then generate trajectory representations for the set of the K most likely outcomes WK ⊂ W, using the policy presented in Sect. 3.5; (3) finally, we evaluate these K best outcomes with respect to the physical properties of their trajectory representations. Probability of an Outcome: An outcome is initially encoded symbolically as a tuple w that prescribes how agents avoid each other throughout the course of the scene. From a topological perspective, these symbols are independent of each other; any motion is allowed even if it is not efficient. However, from a real-world
Multi-agent Trajectory Generation with Topological Invariants
753
point of view, the collision-avoidance strategies that agents employ to avoid oneanother are coupled and modeling the complex probabilistic relationships among them is a challenging problem. For our purposes in this paper, we are interested in finding a way to bias our search towards the outcomes that are more likely to occur. We do so by using the following expression: 1 P (wij |Ξpast ), (14) P (w|Ξpast ) = P (w12 , w13 , . . . |Ξpast ) ∝ Z ij where Ξpast denotes agents’ past trajectories and Z is a normalization constant across all w ∈ W. This expression was derived by factorizing P (w12 , . . . |Ξpast ) using the product rule and then substituting each factor with its Bayes’ rule expression. Similarly to our past work [14], we model P (wij |Ξpast ) by employing the physical quantity of angular momentum. For two agents i, j, navigating on a plane, their angular momentum Lij lies along the z axis. Notice that the sign of the z-component of the momentum, Lij z is an indicator of agents’ passing side and thus of the winding number of their trajectories wij , with Lij z > 0 indicating the emergence of wij > 0 (right hand side collision avoidance) and Lij z < 0 indicating the emergence of wij < 0 (left hand side collision avoidance). We incorporate the momentum as a heuristic in a sigmoid model as follows: P (wij |Ξpast ) =
1 1 + exp(−wij Lij z )
.
(15)
The greater |Lij z | is, the greater the mutual intention or preference of agents i and j over a collision avoidance along the direction of Lij z is. Trajectory Evaluation: We evaluate a trajectory representation Ξw of an outcome w by computing its total energy E : Z n → R, its required immediate acceleration A : Z n → R and its safety cost S : Z n → R. The Energy measure (sum of squared speeds throughout the trajectory) gives an estimate of the efficiency of an outcome whereas the acceleration measure is indicative of the aggressiveness of the maneuvers required to comply with an outcome. We model the Safety cost as S(Ξ) = exp(−dmin ), where dmin ∈ R is the minimum distance between any pair of agents in a trajectory Ξ. Note that other cost functions could be used to incorporate different considerations such as social comfort (e.g. [18]). 3.6.3 Decision Making We first rank outcomes at a symbolic level through the use of the probability distribution, presented in Sect. 3.6.2 and determine the set of the K most likely outcomes WK . Then, we determine the outcome of lowest cost: C(Ξ) = αe E + αa A + αs S,
(16)
where αe , αa , αs are importance weights and finally extract the optimal outcome through the following optimization scheme: w∗ = arg min C(Ξw ). w ∈WK
(17)
754
C. I. Mavrogiannis and R. A. Knepper
Fig. 3. Illustration of the planning scheme. At every replanning cycle, the planning agent generates a set of diverse (topologically distinct) predictions about the joint future behavior of all agents, evaluates them with respect to a cost function C and executes the action assigned to it from the prediction of lowest cost.
The planning agent executes the next action assigned to it from the trajectory of lowest cost Ξw ∗ . Figure 3 depicts a graphic representation of the planning scheme. 3.6.4 Pseudocode Algorithm 1 summarizes the described algorithm (Topologically Adaptive Navigation Planning – TANP) in pseudocode format. The algorithm runs in replanning cycles for as long as the boolean variable AtGoal is set to F alse, indicating that the agent has not reached its destination yet. At every cycle, the agent first determines a set of reactive agents, i.e., agents that lie within the robot’s sensing disk and to the front of the robot’s heading (function Get_Reactive_Agents). Then, function Predict_Destinations outputs predictions for the destinations of the reactive agents and Get_Outcomes returns a set of topological representations for outcomes that could emerge in the remainder of the execution. Function Get_Outcome_Probability returns the probability for each of the outcomes considered and function Get_Best_Outcomes returns the K best outcomes. Function HTTG executes the HTTG policy and generates trajectory representations for these outcomes and function Score_Trajectory evaluates them with respect to the cost function considered. Finally, function Get_Best_Next_Action returns the next action for the planning agent from the trajectory of lowest cost and function Execute Action executes that action. The distance between the resulting agent state and its destination is compared to the predefined threshold and the flag AtGoal is updated to T rue in case the agent is sufficiently close to its destination. 3.7
Complexity and Practical Considerations
The most computationally intense component of our algorithm is the estimation 2 of the outcome probabilities. For n agents, this computation runs in time O(2n ) – the rest of the computations run in polynomial time. In practice, a replanning cycle of TANP on a scenario involving 4 agents and thus the evaluation of 64
Multi-agent Trajectory Generation with Topological Invariants
755
Algorithm 1. TANP(q, d, Ξ) Input: q − agent’s current state; d − agent’s intended destination; Ξpast − state history of all agents; K − Number of outcomes to consider; − desired distanceto-goal threshold. 1: AtGoal ← F alse 2: while ¬AtGoal do 3: R ← Get_Reactive_Agents(Ξpast ) 4: D ← Predict_Destinations(Ξpast , R) 5: W ← Get_Outcomes(R) 6: P ← Get_Outcome_Probability(W, Ξpast ) 7: WK ← Get_BestOutcomes(P, W, K) 8: Z←∅ 9: for all w ∈ WK do 10: Ξpred ← HTTG(Ξpast , w, D) 11: Z ← {Z, Ξpred } 12: C ← Score_Trajectories(Z) 13: u ← Get_Best_NextAction(Z, C) 14: q ← Execute_Action(u) 15: if ||q − d|| < then 16: AtGoal ← T rue 17: return None
topological classes with K = 5, runs at an average of 42 ms, with the worst case being 203 ms in a non-optimized MatLab implementation on a MacBook Pro of 2015 with an Intel Core i7 processor of 2.5 GHz, running macOS High Sierra. Transfer to a faster language and optimization of parts of the code could help vastly improve performance. Under the current design, scaling to large n is not possible. However, for a mobile robot application, we argue that it is also not practical. The sensing limitations would prohibit the emergence of a large number of agents. Even if more agents are sensed, pruning them to the subset of directly reactive agents is a motivated and human-inspired way of reducing the load. Future work involves the design of an online data-driven topology-classification mechanism that would enable agents to directly estimate the most likely candidates, without bruteforcing their evaluation.
4
Evaluation
In this section, we present the performance of the offline planner in generating trajectories of desired topological properties and the behaviors generated by the online algorithm in different types of scenarios. 4.1
Offline Performance
We demonstrate the performance of the offline motion planner in generating topologically distinct, multi-agent navigation trajectories. We consider 4
756
C. I. Mavrogiannis and R. A. Knepper
Table 1. Success rate of HTTG in generating the desired, topologically distinct executions for each of the 100 scenarios consider per condition. Condition
2 agents 3 agents 4 agents 5 agents
Number of outcomes 2
8
64
1024
Success (%)
99.75
89.70
65.48
1
different conditions, corresponding to different numbers of agents (2, 3, 4 and 5 agents), navigating in a circular workspace of radius 2.5 m (agents are represented as disks of radius 0.3 m). For each condition n ∈ {2, 3, 4, 5}, we randomly generate 100 distinct scenarios, by assigning agents initial and final locations that lead to challenging multi-agent encounters, requiring competent collision n avoidance maneuvers. We execute each scenario, 2( 2 ) times, each with a distinct topological specification. We measure the success rate of the planner in generating the desired topology under all conditions considered and report it in Table 1 (a trial is considered successful if the planner was able to produce all of the distinct topologies). The planner parameters were kept constant across conditions and scenarios. It can be observed that the planner performance drops as the number of agents n increases. The method becomes more sensitive to parameter tuning, as the effects of the chaotic nature of the vortex problem become more significant. Finally, Fig. 5 shows examples of how the outlined trajectory generation mechanism may be used for online prediction in scenarios involving two, three and four agents. 4.2
Comparison with Trajectory Optimization
To the best of our knowledge, this is the first work that addresses the problem of generating trajectories for multiple agents in distinct topologies, specified a priori. Conceptually similar, the work of R¨ osmann et al. [17] considered the problem of generating multiple, topologically distinct trajectories for a single agent in a static environment with obstacles. However, optimizing multiple trajectories together and accounting for topological constraints while ensuring trajectory smoothness is a challenging problem. Typical gradient-based methods (e.g. [21]) act on the trajectory locally, with costs comprising several objectives; thus the gradient action could lose sight of the global, topological specification in favor of a different, local cost improvement. Furthermore, a differentiable cost Table 2. Success rates and computation times of HTTG and CHOMP over 500 randomly generated 2-agent scenarios. Planner
Success (%) No. of iterations Time (s)
CHOMP 78.80
80.3325
0.1291
HTTG
86.8862
0.0048
98.40
Multi-agent Trajectory Generation with Topological Invariants
757
function that would quantify the progress towards a desired topological outcome is hard to hand-design and we were not able to find any functions of that form in the literature. Our method constitutes a principled alternative to trajectory optimization for this problem. Instead of locally reshaping a set of trajectories according to the gradients on the trajectory waypoints to attain local optima, our method grows the trajectories from initial conditions with a policy that has global knowledge of the desired trajectory topology. Similar to gradient-based optimization techniques, our method cannot guarantee the attainment of global optima. However, the physical encoding of the topological specification into the planning mechanism results in satisfactory performance for a class of problems. To illustrate the difficulty of automatically synthesizing multi-agent trajectories of desired topological specifications through trajectory optimization techniques, we consider a simple case study, in which we compare the performance of HTTG with the performance of CHOMP [21]. We randomly generate 500 different scenarios involving 2 agents navigating towards opposing sides of a circular workspace (workspace has 5 m diameter, starting positions are uniformly distributed along the circumference, speed normally distributed between 0.3 m/s and 1.5 m/s for each agent). For each scenario, we randomly sample a passing side that agents should pass one another. To encode the objective of respecting a passing side to CHOMP, we construct a cost functional Ftop = 12 (wab − wdes )2 which approaches zero as the winding number of agents’ trajectories wab approaches the winding number corresponding to the desired passing side, wdes . Table 2 illustrates the performance of the two approaches, which is measured with respect to success rate and computation time (non-optimized MatLab implementation on a MacBook Pro of 2015 with an Intel Core i7 processor of 2.5 GHz, running macOS High Sierra). For CHOMP, a trial is considered successful if it generates trajectories of the desired topology within 500 iterations whereas for HTTG, a trial is considered successful if the desired topology is achieved once the agents reach their destinations. It can be observed that HTTG dominates with a success rate of 98.40% (corresponding to 492/500 successful trials). The computation time is comparable in terms of interactions but HTTG requires almost two orders of magnitude less time in seconds. The benefits provided by HTTG in terms of success rate and computation time make the consideration of multiple trajectory topologies at planning time a more practical strategy. 4.3
Online Performance
In order to demonstrate the benefits of our online Algorithm 1, we perform a simulation study comprising a series of experiments on the circular workspace considered in the previous sections (diameter 5 m). We consider 9 different experiment configurations, each corresponding to a different group of navigating agents. In particular, we consider groups of 2, 3 and 4 agents, navigating under three different conditions: (a) a homogeneous condition –all agents run the same planner; (b) a heterogeneous condition in which one agent runs our planner and others are moving straight to their goals without avoiding collisions; (c) a heterogeneous condition in which one agent runs our planner and others are
758
C. I. Mavrogiannis and R. A. Knepper
Fig. 4. Trajectory Quality for all experiment configurations considered. For group size, the same 200 randomly generated scenarios are executed under each of the conditions considered with both planners. For each condition and measure, we perform a paired Student’s t-test to compare the populations yielded by TANP and ORCA. Points with black circular boundaries indicate rejection of the null hypothesis with p-value < 0.001 whereas points with star boundaries indicate rejection of the null hypothesis with p-value < 0.05.
changing intentions over a destination twice, without avoiding collisions. Note that the two latter cases are particularly challenging for decentralized planners, as a typical assumption they rely heavily on is homogeneity. For reference, we perform the same experiments with the Optimal Reciprocal Collision Avoidance (ORCA) [20] (clearance and speed parameters tuned similarly to ensure a fair
Multi-agent Trajectory Generation with Topological Invariants
759
Fig. 5. Overlaid predictions made by a TANP agent (red color) as it navigates towards the red landmark in environments with 2, 3 and 4 agents.
comparison). We quantify the performance of the planners with respect to four aspects of trajectory quality: (1) Experiment time, measured as the amount of time that the last agent to reach its destination took; (2) Safety, measured as the minimum distance between any two agents for the homogeneous condition and as the minimum distance between a TANP/ORCA agent and any other agent for the heterogeneous conditions; (3) Path Efficiency, measured as the ratio between the length of the optimal path to goal and the length of the path a TANP/ORCA agent followed (averaged over agents in the homogeneous case); (4) Trajectory Acceleration, measured as the average acceleration per time step per TANP/ORCA agent throughout the experiment. Figure 4 depicts the performance of TANP and ORCA under each of the configurations considered. For each configuration, each planner executed the same set of 200, randomly generated scenarios. Overall, TANP exhibits the best timeefficiency for almost all configurations (Fig. 4a). When executed under homogeneous settings, TANP establishes a consistently high clearance from others, which results in a drop in terms of path efficiency (Fig. 4c) and a high acceleration per time step (Fig. 4d). The increased time efficiency of TANP could be attributed to the implicit consensus that is reached through the consideration of joint strategies of collision avoidance. The price TANP agents pay is increased accelerations and generally lower path efficiency. On the other hand, ORCA is consistently slower but stably safe across all conditions. Under the homogeneous condition, it achieves the highest path efficiency and lowest acceleration, which was expected by its optimality-driven design. This efficiency advantage fades under the heterogeneous conditions, in contrast to TANP, which exhibits a more balanced behavior.
5
Discussion and Future Work
We presented an online planning framework for the generation of adaptive robot motion in dynamic multi-agent environments. Our framework is based on an
760
C. I. Mavrogiannis and R. A. Knepper
offline planner that generates a diverse set of multi-agent trajectory predictions. Each prediction prescribes a different, collision-avoiding behavior to the planning agent. The agent selects the prediction of lowest cost and executes the first action from it. This architecture enables an agent to make local decisions with a global outlook and allows for anticipation of upcoming agent interactions towards achieving a rapid adaptation to unexpected events, such as the emergence of heterogeneous agents. Simulated examples demonstrate the performance of the offline and online parts of our framework. Future work involves (a) evaluating our algorithm in environments of more complex geometries, (b) reducing its computational load by designing a mechanism that efficiently reuses past trajectories and (c) performing real-world experiments on a robot platform. Acknowledgment. This material is based upon work supported by the National Science Foundation under Grants IIS-1526035 and IIS-1563705. We are grateful for this support.
References 1. Aref, H.: Point vortex dynamics: a classical mathematics playground. J. Math. Phys. 48(6), 065401 (2007) 2. Berger, M.A.: Hamiltonian dynamics generated by Vassiliev invariants. J. Phys. A: Math. Gen. 34(7), 1363 (2001) 3. Berger, M.A.: Topological invariants in braid theory. Lett. Math. Phys. 55(3), 181–192 (2001) 4. Chen, Y.F., Everett, M., Liu, M., How, J.P.: Socially aware motion planning with deep reinforcement learning. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1343–1350 (2017) 5. Helbing, D., Moln´ ar, P.: Social force model for pedestrian dynamics. Phys. Rev. E 51, 4282–4286 (1995) 6. Karamouzas, I., Skinner, B., Guy, S.J.: Universal power law governing pedestrian interactions. Phys. Rev. Lett. 113, 238701 (2014) 7. Kim, B., Pineau, J.: Socially adaptive path planning in human environments using inverse reinforcement learning. Int. J. Soc. Robot. 8(1), 51–66 (2016) 8. Knepper, R.A., Rus, D.: Pedestrian-inspired sampling-based multi-robot collision avoidance. In: Proceedings of the 2012 IEEE International Symposium on Robot and Human Interactive Communication, pp. 94–100 (2012) 9. Kretzschmar, H., Spies, M., Sprunk, C., Burgard, W.: Socially compliant mobile robot navigation via inverse reinforcement learning. Int. J. Robot. Res. 35(11), 1289–1307 (2016) 10. Ma, W.C., Huang, D.A., Lee, N., Kitani, K.M.: Forecasting interactive dynamics of pedestrians with fictitious play. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4636–4644 (2017) 11. Mavrogiannis, C., Knepper, R.A.: Decentralized navigation planning using multiagent trajectory prediction governed by Hamiltonian dynamics. In: Workshop on Multi-robot Perception-Driven Control and Planning, IEEE/RSJ International Conference on Intelligent Robots and Systems (2018)
Multi-agent Trajectory Generation with Topological Invariants
761
12. Mavrogiannis, C.I., Blukis, V., Knepper, R.A.: Socially competent navigation planning by deep learning of multi-agent path topologies. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 6817– 6824 (2017) 13. Mavrogiannis, C.I., Knepper, R.A.: Decentralized multi-agent navigation planning with braids. In: Proceedings of the International Workshop on the Algorithmic Foundations of Robotics (2016) 14. Mavrogiannis, C.I., Knepper, R.A.: Multi-agent path topology in support of socially competent navigation planning. Int. J. Robot. Res. 38, 338–356 (2018) 15. Mavrogiannis, C.I., Thomason, W.B., Knepper, R.A.: Social momentum: a framework for legible navigation in dynamic multi-agent environments. In: Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, pp. 361– 369 (2018) 16. Moussa¨ıd, M., Helbing, D., Theraulaz, G.: How simple rules determine pedestrian behavior and crowd disasters. Proc. Nat. Acad. Sci. 108(17), 6884–6888 (2011) 17. R¨ osmann, C., Hoffmann, F., Bertram, T.: Integrated online trajectory planning and optimization in distinctive topologies. Robot. Auton. Syst. 88, 142–153 (2017) 18. Sisbot, E.A., Marin-Urias, L.F., Alami, R., Sim´eon, T.: A human aware mobile robot motion planner. IEEE Trans. Robot. 23(5), 874–883 (2007) 19. Trautman, P., Ma, J., Murray, R.M., Krause, A.: Robot navigation in dense human crowds: statistical models and experimental studies of human-robot cooperation. Int. J. Robot. Res. 34(3), 335–356 (2015) 20. van den Berg, J., Guy, S.J., Lin, M.C., Manocha, D.: Reciprocal n-body collision avoidance. In: Proceedings of the International Symposium on Robotics Research, pp. 3–19 (2009) 21. Zucker, M., Ratliff, N., Dragan, A., Pivtoraiko, M., Klingensmith, M., Dellin, C., Bagnell, J.A.D., Srinivasa, S.: CHOMP: covariant Hamiltonian optimization for motion planning. Int. J. Robot. Res. 32(9–10), 1164–1193 (2013)
Balancing Unpredictability and Coverage in Adversarial Patrolling Settings Nicola Basilico1 and Stefano Carpin2(B) 1
2
Department of Computer Science, University of Milan, Milan, Italy Department of Computer Science and Engineering, University of California, Merced, CA, USA [email protected]
Abstract. We present a novel strategy for a patroller defending a set of heterogeneous assets from the attacks carried by an attacker that through repeated observations attempts to learn the strategy followed by the patroller. Implemented through a Markov chain whose stationary distribution is a function of the values of the assets being defended and the topology of the environment, the strategy is biased towards providing more protection to valuable assets, yet is provably hard to learn for an opponent. After having studied its properties, we show that our proposed method outperforms strategies commonly used for this type of problems. Keywords: Security games
1
· Learning · Patrolling
Introduction
Intelligent autonomous systems are increasingly utilized for providing surveillance to valuable or sensitive assets, such as harbors, banks, airports and the like [18]. Such systems may be implemented by deploying robots in the environment, or as decision support systems informing one or more human operators about how to complete the task. In the following, we use the terms agents or patrollers to indicate the entities protecting the assets. In non trivial scenarios assets are spatially distributed, and when the number of assets to guard is larger than the number of agents used to protect them, patrollers need to move between different assets to ensure coverage. Central to this problem, then, is a scheduling algorithm deciding the route an agent should follow. In competitive situations where patrollers face adversarial entities trying to compromise the assets being protected, deterministic patrolling routes are normally avoided because through repeated observations an intelligent adversary could easily determine the deterministic route, and then attack an asset while being sure it is not protected. Randomization plays then an essential role in these problems. However, purely random strategies are not ideal either, in particular if the assets are heterogeneous and of different value. In such case the patroller should devote more effort to defend the more valuable assets, i.e., introduce some bias in the route towards the more valued locations. This approach, however, is also prone to being learned c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 762–777, 2020. https://doi.org/10.1007/978-3-030-44051-0_44
Unpredictability and Coverage
763
and forecast by an intelligent opponent that can build a model of the process used to generate patrolling routes. In this paper, we present a novel patrolling strategy for a single patroller that is provably hard to model or learn for an adversarial opponent. Notwithstanding, the strategy is not purely random, but rather biased to provide more coverage to the most important assets. Central to our approach is the idea of allowing the patroller to slow down when traveling between vertices, thus introducing unpredictability in the time it takes to move from asset to asset. As we will show, an opponent trying to determine how often a patroller visits a certain region will essentially be faced with the problem of forecasting a white noise time series. To ensure that more valuable assets are visited more often, the next vertex to visit is selected using a Markov chain whose stationary distribution considers the structure of the underlying graph and ensures that expected losses are bounded in a worst case scenario. The original contributions of this paper are the following: – we consider an adversarial patrolling setting where we assume realistic observation capabilities for the attacker; – we devise an approach to approximate a strategy that is optimal against a fully informed attacker and that makes information gathering through observation a difficult process, independently on how the attacker formulates its beliefs; – we empirically evaluate the proposed approach in realistic settings and we compare it with a mainstream approach for robotics surveillance applications. The rest of this paper is organized as follows. Related work is discussed in Sect. 2, where we also outline how our work differs, while the adversarial patrolling problem we consider is formally introduced in Sect. 3. Next, in Sect. 4 we recap a few facts about Markov Chains necessary to prove the properties of the patrolling algorithm we then present and study in Sect. 5. Building upon our theoretical findings, in Sect. 6 we show how our algorithm defeats commonly used alternatives. Finally, conclusions are given in Sect. 7.
2
Related Work
The use of mobile robots in surveillance settings received considerable attention from the Robotics and AI communities in the recent years. The classic problem formulation, which is also the one we adopt in this work, describes the environment by means of a graph whose vertices represent areas to protect. To safeguard these assets, one or more patrolling agents travel on this graph, trying to detect and stop ongoing intrusions in each currently visited vertex. A patrolling strategy is used to determine the movements of the patrolling agents. Its general objective is to schedule the patrolling activities to optimize the enforced protection in the environment and the execution costs. A first theoretical analysis that recognized the importance of this problem was presented in [8]. Therein, the author studies the optimality of particular classes of patrolling
764
N. Basilico and S. Carpin
strategies adopting a criterion based on the idleness of a vertex, defined as the time elapsed since its last visit by a patroller. Idleness-based criteria naturally encode the intuitive rationale that the more frequently a vertex is patrolled, the more protected it will be. This type of approaches are today among the mainstream solutions for a large number of applications deployed in the real world [13]. Most of the idleness-based settings entail hard optimization problems, making the study of heuristic and approximated solutions an important goal. For example, the work proposed in [2] studies a setting where idleness is combined with the importance of the different vertices and provides approximation algorithms able to scale up to very large instances. A parallel line of research focuses on stochastic surveillance strategies based on Markov chains and Monte Carlo techniques. The general idea in these works is to devise strategies that could exhibit desired properties at convergence. For example, in [9] the authors define the patrolling strategy as a Markov chain and seek the minimization of the mixing rate to a desired steady state uniform distribution for an homogeneous setting where all vertices have equal importance. A more recent example is presented in [1], where the objective to be minimized is the mean first passage time. All the above approaches, either deterministic or stochastic, do not consider the adversarial nature of the patrolling task. Along such dimension, they are complemented by the wide literature on security games [16], that addresses the general problem of resource allocation in the presence of rational adversaries and that includes models specifically tailored for patrolling on graphs [3,5]. The working assumption of these approaches is that the patrolling task is carried out in the presence of a rational attacker that, thanks to unlimited observation capabilities, has exact knowledge of the patrolling strategy. This assumption results in a leader-follower interaction where the attacker substantially best responds to the patrolling strategy it observes. Some model refinements considered security games where the attacker has limited or constrained observation capabilities. One notable example is [4], where the attacker constructs a belief over the patrolling strategy by performing costly observations. In [6] the authors consider a similar belief-based observing attacker and show that planning against the strongest observer induces limited losses. In a departure from former literature in this area, in this work we consider a patrolling setting characterized by an observing attacker and we try to overcome the predictability of the widely-adopted deterministic idleness-based patrolling strategies. To this end we seek a stochastic approach based on Markov chains that, instead of focusing on a particular convergence property, seeks a steadystate distribution that complies with the area coverage requirements while, at the same time, being difficult to predict for an observer. Differently from the classical security games literature, we do not assume a setting where the attacker has exact knowledge of the patrolling strategy that is being executed. With respect to security games that considered refinements of the observer, we adopt a much more limited (and realistic) attacker model, i.e., one that cannot obtain full knowledge of the patrolling strategy even with unbounded time or observations.
Unpredictability and Coverage
765
We maintain that such a model is more suitable to capture realistic interactions between a robotic patrolling system and a malicious attacker in a physical environment represeted by a graph.
3
Problem Definition
Our patrolling problem is defined as follows. We consider a domain D within which K target locations are found. Let the target locations be l1 , . . . , lK . An undirected, weighted graph G = (V, E) is used to model this environment. Each target location is associated with a vertex in G, and the weight of and edge (li , lj ), denoted as di,j ∈ R+ , represents the temporal traveling cost when moving from li to lj (or viceversa). We assume the graph is connected and complete1 and we furthermore set di,i = 0 for all vertices, even though self-loop edges do not necessarily need to be assumed. Edge costs are assumed to be symmetric, and therefore di,j = dj,i Each target is associated with a value vi , which encodes a measure of its importance, and a time to attack ai , which will be explained later. The task we consider is that of controlling one patrolling agent moving over the graph G with the objective of protecting its targets. Differently from what typically done in graph exploration settings, we do not assume non-preemptive edge traversals. The only constraint we enforce is that when leaving a vertex li the patroller cannot arrive at another vertex lj before at least di,j time units have passed. Moreover, the patroller is assumed to have attack detection capabilities localized to the currently occupied vertex. We assume to perform this task under a threat modeled as an attacker agent whose objective is compromising one of the target locations. To successfully carry out an attack on target li , the attacker has to position itself at li for a time greater or equal than the attack time ai . During such time, if the patroller visits li before the attack time expires, the attacker is neutralized and the attack fails. Figure 1 illustrates this interaction, where, as customary in literature, we assume that payoffs reflect a constant-sum setting. Specifically, defining ν = li ∈V vi , we assume that when the attacker successfully carries out an attack to target li it receives a payoff of vi while the patroller gets ν − vi . If instead the attack is detected the attacker receives 0 while the patroller gets ν. The patroller’s sequence of visits to the targets is defined by a patrolling strategy that is not accessible to the attacker. However, the attacker can observe a specific single target li and, at any time, it can determine if the patroller is currently at that vertex or not. Formally, this means that the attacker is capable of recording the timestamps of the sequence of visits that the patroller performs at target li and, from such list of timestamps, it can compute the inter-arrival times at li . This sequence represents the only source of knowledge available to the attacker. Based on this setup, the following two facts easily follow: 1
The assumption is w.l.o.g. since any non-complete graph can be turned into a complete one by taking its closure on shortest paths, i.e., adding an edge (li , lj ) and setting di,j to the cost of the shortest path between li and lj in the original graph.
766
N. Basilico and S. Carpin
Fig. 1. Two scenarios where the attacker attacks location lj at time t . In the first case (top) the next visit to lj by the patroller happens before t + aj , the attacker is detected and the attack fails. In the second case (bottom) the next visit happens after t + aj and the attacker is therefore successful and receives reward vj .
– the attacker will attack the observed vertex, but the patroller does not know the vertex observed by the attacker; – the patroller cannot increase the chances of detecting an attack by spending time at a vertex since a rational attacker will never strike a target that is under surveillance; in practice, as soon as it visits a vertex, the patroller leaves it. The question we aim at answering is therefore the following. How should the patroller schedule its visits to the locations, so that the expected reward gathered by the attacker is minimized? Notice that, for now, we formulate our question without making any assumption on the algorithm used by the attacker to translate the gathered information into an attack decision.
4
Background in Markov Chains
We shortly recap some basic facts about Markov chains that will be useful to establish the properties of the algorithm we propose in the following section. For sake of brevity, the discussion remains informal. For more details and a rigorous discussion, the reader is referred to textbooks in stochastic processes, such as [11,12]. A Markov chain with finite state space is defined by a finite set of states {s1 , . . . , sn } and an n × n transition matrix P whose (i, j)-th entry pi,j defines the one-step transition probability, i.e., pi,j = Pr[Xt+1 = sj |Xt = si ], where Xt is the state of the Markov chain at time t. The matrix P is stochastic, i.e., its rows add to one. A Markov chain is irreducible if from each state it is possible to move to any other state in one or more steps. A state s in a Markov chain is periodic if starting from s the chain can return to s only after a number of steps multiple of an integer larger than one. A state is aperiodic if it is not periodic. In an irreducible Markov chain, if one state is aperiodic, then all states
Unpredictability and Coverage
767
are aperiodic. Based on these definitions, it is immediate to observe that if all entries in P are strictly positive, then the Markov chain is irreducible and aperiodic. If a Markov chain is irreducible and aperiodic, there always exist a unique stationary distribution, i.e., an n-dimensional stochastic vector π such that π = πP . The stationary distribution π describes the probability that, irrespective of the initial state, the Markov chain is in a given state once it has converged, i.e., πi is the probability that the Markov chain is in si after a large number of transitions. For a given Markov chain, the recurrent visit time for a state si is the number of transitions until the chain returns to state si given that it started in state si . The recurrent visit time for a vertex si is a discrete random variable whose expectation is π1i where πi is the i-th component of the stationary distribution π. 4.1
Relationships Between π and P
As stated above, if a Markov chain is irreducible and aperiodic, then the stationary distribution exists and is unique. This means that π is a function of P . Often times we are interested in the opposite problem, i.e., for a given stationary distribution π we want a transition matrix P having π as stationary distribution. In general there exist multiple transition matrices solving this problem. The Metropolis-Hastings algorithm can be used to determine one such matrix. In simplified form2 the method to determine P can be formulated as follows. For i = j define πj 1 αi,j = min 1, pi,j = αi,j πi n For i = j, pii is instead set so that all rows add up to one. Therefore, for a given π this method provides a unique transition matrix P , although this is not the only matrix such that π = πP .
5
An Unforecastable Patrolling Strategy
In this section we develop a randomized patrolling strategy aimed at being “difficult” to observe for the attacker we introduced in Sect. 3, and, at the same time, complying with the environment coverage requirements (visiting more frequently the vertices with higher values) and biased to bound its worst case losses. To ease the discussion of the final strategy, we start with a generic approach that will then be instantiated to an algorithm with an accurate performance characterization. The framework is schematized in Algorithm 1, where we outline
2
This simplified version is obtained assuming that the proposal distribution is all i, j. See [14] for more details.
1 n
for
768
N. Basilico and S. Carpin
a generic patrolling strategy whose core idea is that of decoupling the spatial decision from the temporal one. 1 2 3 4 5 6
Randomly select a start vertex li ∈ V ; while true do Randomly select the next vertex lj ∈ V ; Generate a random time t ≥ di,j with finite expectation; Move to lj spending time t; li ← lj ; Algorithm 1: General patrolling strategy
The patroller operates in an infinite loop where in line 3 it selects the next vertex to visit, and in line 4 it determines how much time to spend to move there. The random variables describing these choices are subject to the constraints discussed in the following. First, when selecting the new vertex lj , a strictly positive probability must be assigned to each vertex in V . This does not necessarily imply a uniform distribution, and indeed in the final formulation a uniform distribution will not be used. We assume these probabilities are however constant and therefore history independent. Therefore, they are not influenced by the idleness of a vertex, i.e., the time since its last visit. Second, when moving towards the selected vertex lj , the patroller does not necessarily spend the minimum time di,j , but rather elects to spend a time t that is larger or equal than di,j . This is modeled by a continuous random variable with zero density for values smaller than di,j and finite expectation.3 This ability to slow down is essential to introduce unpredictability in the patrolling schedule and therefore make it difficult for an observer to reconstruct the sequence of inter-arrival times to a particular target. In particular, it is critical to observe that if in step 3 the selected next vertex lj is equal to the current vertex li , the time spent to move from li to li can be strictly positive even though di,i = 0. This, for example, corresponds to the situation where the patroller first leaves li , but then comes back to li again before having visited another vertex. Remark: the importance of this last aspect should not be overlooked. Consider the classic case where from li the patroller would always pick a next vertex different from li (or, if deciding to remain in li , it would not leave it, i.e., it would move from li to li with time di,i = 0.) Through repeated observations, an opponent could easily figure out that when the patroller leaves li , the next visit will happen no earlier than mi = 2 minj=i di,j (this is the time to move forward and backward along the outgoing edge of minimum cost). Based on this information, the attacker could strategically determine that if ai < mi then an attack to li placed immediately after the patroller leaves li would succeed with probability 1. Instead, with the proposed patrolling strategy the time for the next visit to li could be smaller than mi and therefore the attacker cannot produce an attack that is certain to succeed. 3
The density must be 0 for values smaller than di,j because the patroller must generate a time larger than the minimum time necessary to complete the move.
Unpredictability and Coverage
769
If we next consider the temporal sequence of vertices visited by the patroller and the assumptions we made about how the next vertex is selected, this can be modeled as a finite Markov chain whose state space is V . Let P be the associated K × K transition matrix. Because of the assumptions we made about how the next vertex is selected in line 3, it follows that all entries in P are strictly positive and therefore the Markov chain is irreducible and aperiodic. Let us now consider the time elapsed between two successive visits to vertex lj by the patroller. As shown in Fig. 2, this is a continuous random variable Rj obtained adding a finite number of continuous random variables. In particular, each of these is generated in step 4 in Algorithm 1. This continuous random variable is in the following called return time to vertex lj to distinguish it from the discrete recurrent visit time introduced in Sect. 4.
Fig. 2. The return time to vertex lj (Rj ) is the overall time spent by the patroller to return in lj given that it started in lj . Rj is the sum of M random variables t1 . . . tM , where M is the recurrent visit time for vertex lj .
To gain some insights about the random variable Rj , it is necessary to now make some assumptions about how the random times are generated in line 4 in Algorithm 1. To this end, we assume that the time is distributed as a uniform random variable between the values di,j and di,j + Δ, where Δ > 0 is a positive constant. In the following, with a slight abuse of notation we indicate as U[di,j , di,j + Δ] both this random variable and its density. This choice respects the constraints we assumed earlier on, i.e., its density is 0 for values smaller than di,j and it has finite expectation. Let us now consider the sequence of travel times generated by the patroller while moving from vertex to vertex. Because of the assumption we made about how the travel time is generated when moving between li and lj , and since the Markov chain is irreducible and aperiodic, it follows that travel times are distributed according to a mixture of random variables with the following density: K
πi
i=1
K
pi,j U[di,j , di,j + Δ]
(1)
j=1
where πi is the ith entry in the unique stationary distribution π associated with P . This random variable has finite expectation equal to K i=1
πi
K j=1
pi,j .
2di,j + Δ 2
(2)
770
N. Basilico and S. Carpin
Importantly, all travel times are independent, identically distributed (iid) according to this mixture of uniforms. This allows to determine the expectation of Rj for each location lj , as detailed by the following theorem. Theorem 1. Let us consider the Markov Chain induced by Algorithm 1 and let Rj be the return time to state lj . Its expectation is E[Rj ] =
K K 2di,j + Δ 1 πi pi,j πj i=1 j=1 2
where π is the stationary distribution and pi,j is the (i, j) entry in the transition matrix P . Proof. First recall that since all entries in P are strictly positive, the Markov chain is recurrent and aperiodic, and therefore there exist a unique stationary distribution π. With reference to Fig. 2, Rj can be written as Rj =
Mj
ti
i=1
where each of the ti is a random variable with density given by Eq. (1) and Mj is the random variable for the recurrence time for state lj . Rj is therefore the sum of a random finite number of iid random variables. The claim, therefore, follows by invoking Wald’s equation (see, e.g., [14], Theorem 7.2). ⎤ ⎡ Mj K K 2di,j + Δ 1 E[Rj ] = E ⎣ ti ⎦ = E[Mj ]E[ti ] = πi pi,j π 2 j i=1 i=1 j=1 where we used the aforementioned fact that E[Mj ] =
1 πj
and Eq. (2) for E[ti ].
Remark: the attentive reader could correctly object that πi is the probability that the Markov chain is in state li only after the chain has converged, and therefore the ti s will be iid only in the limit. However, as we we will show later on, the algorithm will bootstrap its operations ensuring that that the Markov chain has already converged from the first step, and therefore the ti s are always iid as required in the proof. To determine E[Rj ] one needs to know P (from which π can be derived). P is determined by the the patroller selects the next vertex in Algorithm 1, line 3. Alternatively, we could determine a desired stationary distribution π and from this reconstruct a transition matrix P using the Metropolis-Hastings method we discussed in Sect. 4.1. Such matrix could then be used to implement the selection of the next vertex to visit. In the following section we show that indeed the transition matrix will be computed starting from a target stationary distribution.
Unpredictability and Coverage
5.1
771
Optimizing Against an Observing Attacker
As customary done in the security games literature, we take a worst-case stance and assume that the attacker is in the position of observing a vertex lj for an indefinite amount of time. This, however, does not amount to assume that the attacker has knowledge of the patrolling strategy, but, instead, that the attacker can build a correct belief on the random variable Rj . The expected payoff obtained by attacking lj immediately after the patroller leaves lj is then vj Pr[Rj > aj ], where Pr[Rj > aj ] is the probability that the attack will be successful since Rj > aj corresponds to the event in which the patroller returns on lj only after the attack time aj has elapsed (recall Figs. 1 and 2). This probability is in general difficult to compute analytically, although it could be approximated numerically. However, being Rj a non-negative random variable, we can exploit Markov’s inequality [10] to bound the probability of a successful attack: Pr[Rj ≥ aj ] ≤
E[Rj ] aj
It follows that the attacker’s expected payoff obtained on lj is upper bounded by Lj =
K K vj 1 πi pi,j ζi,j aj πj i=1 j=1
(3) 2d
+Δ
where we used Theorem 1 for E[Rj ] and we wrote ζi,j for i,j2 . Note that ζi,j is a function of the graph and Δ only, but does not depend on the strategy. Since we assumed to work in a constant-sum setting, the expression for Lj defines an upper bound on the patroller’s expected loss that we seek to minimize. Specifically, let S be the set of all possible strategies obtained by varying the vertex selection step in Algorithm 1. Under the operational assumption that we do not know in advance the target observed by the attacker, we define the patrolling strategy as the solution of the following optimization problem: min max Lj s∈S j=1...K
This can be rewritten as follows: K K vj 1 πi pi,k ζi,k min max π j aj πj i=1
j = 1...K
(4)
k=1
s.t.
π1 + π2 + · · · + πK = 1 πj > 0 j = 1...K
where the optimization variable is the K dimensional stochastic vector π and we assume that the pi,j values are obtained from π using the method described in Sect. 4.1.
772
N. Basilico and S. Carpin
Theorem 2. Assuming P is computed from π using the Metropolis-Hastings method given in Sect. 4.1, the solution to the optimization problem (4) is obtained by setting vi
πi = Kai
vj j=1 aj
1 ≤ i ≤ K.
(5)
Proof. Let π ∗ be the vector obtained setting all components according to Eq. (5). v The j-th function to be optimized in problem (4) is ajj E[Rj ], where E[Rj ] is the expected return time for vertex lj induced by the strategy defined by π ∗ . By substitution, it is immediate to verify that for this specific choice of π ∗ all j functions to be minimized assume the same value, i.e., v2 vK v1 E[R1 ] = E[R2 ] = · · · = E[RK ] a1 a2 aK By contradiction, assume π ∗ is not optimal. This means there exist a vector π = π ∗ giving a lower value for problem (4). Let E[Rj ] be the expected return time induced by π . For π to be optimal, it must then be
vj vj E[Rj ] < E[Rj ] aj aj
j = 1...K
i.e., E[Rj ] < E[Rj ] for all vertices. However, this cannot be simultaneously achieved for all vertices, because if the expected return time for a vertex decreases, then there must be an increase in the return time for a different vertex. At this point we have all elements to write the detailed patrolling strategy implemented by the patroller. The input is the graph G = (V, E) together with the values and attack times for each vertex, and the parameter Δ. 1 2 3 4 5 6 7 8
Input: graph G = (V, E), vector of values v, vector of attack times a, Δ; Solve optimization problem (4) and determine π; Compute transition matrix P using Metropolis-Hastings method; Select start vertex li ∼ π ; while true do Select next vertex lj with probability P ij ; Generate random time t ∼ U[di,j , di,j + Δ]; Move to lj spending time t; li ← lj ; Algorithm 2: Adopted patrolling strategy
Remark: line in Algorithm 2 uses the stationary distribution to select the initial vertex. This ensures that the Markov chain converges from the first step, i.e., π gives the probability of being in each state at each time step. This ensures that all travel times ti are identically distributed according to the mixture of uniforms. If the start vertex was chosen differently, this would be true only in the limit, i.e., after the chain has converged.
Unpredictability and Coverage
773
Let us now consider what an attacker observing location lj trying to construct a belief over Rj would “see”. The return times Rj are distributed as per Eq. (1). This entails that the attacker would observe a sequence iid, finite mean and variance inter-arrival times. It is useful to recall that independent variables with finite mean and variance are also uncorrelated and therefore the sequence of return times constitutes a white noise stochastic process. It is worth noting that some authors require zero mean in the definition of white noise, while some others do not and simply require uncorrelation between all values (see e.g., [11], page 385). In our case, the return times do not have zero mean, but this is inconsequential to the (in)ability of using past values to make predictions about future ones. The strategy summarized in Algorithm 2 is then consistent with the adversarial setting since it seeks optimality against a fully informed and rational attacker (as defined in Sect. 3). At the same time, however, it makes difficult to this type of attacker to reach such a fully informed condition by observing a target. It is also worth noting that the solution to problem (4) depends on the ratios vj /aj , i.e., the ratio between the value of a vertex and the time to attack. This quantity is as the amount of value per unit of time that an attacker will receive if the attack succeeds.
6
Experimental Evaluation
In this section we provide some empirical assessments of the patrolling strategies obtained with Algorithm 2, which we will refer as “Delta”. As already discussed, “Delta” is the optimal strategy as per Problem 4. The experimental results we present here are focused on evaluating its advantages in terms of difficulty of being observed by an attacker characterized by the local-target visibility model we introduced in Sect. 3. We shall compare “Delta” with an heuristic strategy that well represents an established class of approaches for robotic surveillance (see our discussion in Sect. 2). We created a dataset of random graphs scaling up to 50 vertices. For each problem instance we execute the patrolling strategy in simulation for 5000 vertex visits. For the Delta strategy we set Δ = max(i,j) di,j /2. 6.1
Comparing Against a Heuristic Strategy
We compare against a heuristic strategy based on vertex idleness. Formally, let I(li ) be the current idleness of vertex li defined as the amount of time since the last visit by the patroller. In the beginning, I(li ) = 0 for all li ∈ V . To understand how I() evolves during the patrolling mission, let us assume that at time t1 the patroller leaves a vertex l1 , and it arrives at vertex l2 at time t2 ≥ t1 + d1,2 . Upon its arrival in l2 , the idleness for l2 is set to I(l2 ) = 0, while for i = 2 it is set to I(li ) = I(li ) + (t2 − t1 ). Once in a vertex li , the patroller makes its decision on where to go next by computing the following utility function for each vertex lj = li :
774
N. Basilico and S. Carpin
u(lj ) =
vj aj
I(lj ) 1+ di,j
Maximizing the above utility would result in an informed choice, where important targets that have been left unvisited would be chosen. However, the resulting sequence of inter-arrival times that could be observed at a target would be easily forecastable since the strategy is ultimately deterministic and, in the long run, it will produce a periodic behavior. For this reason we add to the fully informed decision a random component, inspired by ε-soft exploration strategies [15]. The objective of the random component is to mislead an observer. We do this by introducing a random selection with a parameter ε ∈ [0, 1] and define the selection of lnext (the next location to patrol) as follows (we assume that li is the current vertex occupied by the patroller): arg maxlj ∈V \{li } u(lj ), with probability 1 − ε next = l choose uniformly in V \ {li }, with probability ε
Fig. 3. Partial autocorrelation analysis for the time series of inter-arrival times at a target.
Unpredictability and Coverage
775
One of the standard approaches to assess if a time series can be forecast from its past values is to perform a partial autocorrelation analysis [17]. Given a lag k, the partial autocorrelation function measures the likelihood that a sample observed k steps in the past can be a predictor for the last observed value of the series (the correlation between a sample and a previous one is computed assuming a linear relationship). Figure 3 shows such analysis for the inter-arrival times observed at particular target on a randomly generated graph with 30 vertices. The partial autocorrelation is shown for different time lags where the blue lines define a 95% confidence interval. If at lag k the partial autocorrelation is above or below such line, then we have confidence that with observed k samples in the past can predict the present. As shown in the figure, the Delta strategy is the one achieving the smallest autocorrelations. With ε = 0 we clearly get high autocorrelations at different lag points and as we proceed to ε = 1 we got the same profile of Delta. However, such lower predictability we obtain by increasing ε comes at a less informed strategy, which tends to distribute coverage equally to all the targets, without accounting for their importance. 6.2
Performance Against an Observer
The second experimental evaluation we present shows the performance of Delta and Idleness-ε against an attacker different from the worst-case one, that is an attacker that does not have access to the exact characterization of the interarrival times. We model it by using a simple local learning method for time series prediction as outlined in [7], namely a nearest neighbor predictor. Formally, the attacker at location lj gets a sequence of observations defined as O = (Rj0 , Rj1 , . . . , Rjt ) where O(i) = Rji . The observer considers sub-sequences of finite length m (this parameter can be interpreted as the observer’s finite memory). Thus, for m − 1 ≤ i < t, call Oi = (O(i − m + 1), . . . , O(i)). The observer then computes i∗ = arg min d(Oi , Ot ) i∈{m,...,t−1}
∗
and makes a prediction as O(i +1). If O(i∗ +1) ≥ aj then an attack is attempted. In Fig. 4, we depict the performance trend of the various strategies observed in a representative instance. For each vertex, we report the protection ratio defined as the number of unsuccessful attacks over the number of the attack attempts. The ratio is then multiplied by the value of the vertex it is referred to, to account for the fact that different vertices have different values. Each point in the curve can then be interpreted as the amount of protected value. As it can be seen, in this setting the ability of generating a sequence of inter-arrival times that is difficult to predict plays a critical role. However, increasing ε in the Idleness strategy introduces only marginal improvements while the Delta strategy achieves good performance in protecting the environment while, at the same time, misleading the observer. This performance gap is also summarized in Fig. 5(a) showing the cumulative protected value (the sum over all the vertices of the weighted protection ratio) and Fig. 5(b) where we report the ratio between
776
N. Basilico and S. Carpin
Fig. 4. Weighted protection ratio for different patrolling strategies.
Fig. 5. (a) expected protected value for different ε values. (b) ratio between the protected value of the Delta and Idleness strategy.
the cumulative protected value of Delta and that achieved with different ε. To put these numbers into context, the protected value obtained by Delta is 56.19.
7
Conclusions
In this paper we have studied a novel approach to implement patrolling strategies that are provably hard to learn for an observer. Our problem departs from classic literature in security games inasmuch as the attacker does not have complete knowledge of the patrolling strategy, but rather builds a model through observations limited to a fixed subset of the environment. We maintain that this approach is more relevant to practical applications than those based on worst case assumptions with an omniscient attacker. Our patrolling strategy essentially produces a white noise time series of interarrival times. Nevertheless, since the patroller does not know which vertex is being observed by the attacker, the strategy is biased towards visiting more often the vertices with high values and low attack times, i.e., those more likely to create large losses for the patroller. The ability of the patroller to introduce random bounded delays (as governed
Unpredictability and Coverage
777
by the parameter Δ) when moving from vertex to vertex increases its unpredictability. It is important to stress that this patrolling approach is suited when the attacker learns through limited observations, but does not necessarily apply when the attacker has global visibility about the patroller’s routes.
References 1. Agharkar, P., Patel, R., Bullo, F.: Robotic surveillance and Markov chains with minimal first passage time. In: Proceedings of the IEEE conference on Decision and Control, pp. 6603–6608 (2014) 2. Alamdari, S., Fata, E., Smith, S.L.: Persistent monitoring in discrete environments: minimizing the maximum weighted latency between observations. Int. J. Robot. Res. 33(1), 138–154 (2014) 3. Alpern, S., Morton, A., Papadaki, K.: Patrolling games. Oper. Res. 59(5), 1246– 1257 (2011) 4. An, B., Brown, M., Vorobeychik, Y., Tambe, M.: Security games with surveillance cost and optimal timing of attack execution. In: International Conference on Autonomous Agents and Multi-agent Systems, pp. 223–230 (2013) 5. Basilico, N., Gatti, N., Amigoni, F.: Patrolling security games: definition and algorithms for solving large instances with single patroller and single intruder. Artif. Intell. 184, 78–123 (2012) 6. Blum, A., Haghtalab, N., Procaccia, A.D.: Lazy defenders are almost optimal against diligent attackers. In: AAAI, pp. 573–579 (2014) 7. Bontempi, G., Ben Taieb, S., Le Borgne, Y.: Machine learning strategies for time series forecasting. In: Business Intelligence, pp. 62–77. Springer (2013) 8. Chevaleyre, Y.: Theoretical analysis of the multi-agent patrolling problem. In: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology, pp. 302–308 (2004) 9. Grace, J., Baillieul, J.: Stochastic strategies for autonomous robotic surveillance. In: Proceedings of the IEEE Conference on Decision and Control, pp. 2200–2205 (2005) 10. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995) 11. Papoulis, A., Pillai, S.U.: Probability, Random Variables, and Stochastic Processes, 4th edn. McGraw-Hill, New York (2002) 12. Privault, N.: Understanding Markov Chains. Springer, Heidelberg (2013) 13. Robin, C., Lacroix, S.: Multi-robot target detection and tracking: taxonomy and survey. Auton. Robot. 40(4), 729–760 (2016) 14. Ross, S.M.: Introduction to Probability Models. Elsevier, Amsterdam (2014) 15. Sutton, R.S., Barto, A.G.: Reinforcement Learning. MIT Press, Cambridge (1998) 16. Tambe, M.: Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned. Cambridge University Press, Cambridge (2011) 17. Wei, W.W.S.: Time series analysis. In: The Oxford Handbook of Quantitative Methods in Psychology, vol. 2 (2006) 18. Zhang, C., Bucarey, V., Mukhopadhyay, A., Sinha, A., Qian, Y., Vorobeychik, Y., Tambe, M.: Using abstractions to solve opportunistic crime security games at scale. In: International Conference on Autonomous Agents and Multiagent Systems, pp. 196–204 (2016)
Fast, High-Quality Dual-Arm Rearrangement in Synchronous, Monotone Tabletop Setups Rahul Shome1(B) , Kiril Solovey3 , Jingjin Yu1 , Kostas Bekris1 , and Dan Halperin2 1
Rutgers University, New Brunswick, NJ, USA [email protected] 2 Tel Aviv University, Tel Aviv-Yafo, Israel 3 Stanford University, Stanford, CA, USA
Abstract. Rearranging objects on a planar surface arises in a variety of robotic applications, such as product packaging. Using two arms can improve efficiency but introduces new computational challenges. This paper studies the structure of dual-arm rearrangement for synchronous, monotone tabletop setups and develops an optimal mixed integer model. It then describes an efficient and scalable algorithm, which first minimizes the cost of object transfers and then of moves between objects. This is motivated by the fact that, asymptotically, object transfers dominate the cost of solutions. Moreover, a lazy strategy minimizes the number of motion planning calls and results in significant speedups. Theoretical arguments support the benefits of using two arms and indicate that synchronous execution, in which the two arms perform together either transfers or moves, introduces only a small overhead. Experiments support these points and show that the scalable method can quickly compute solutions close to the optimal for the considered setup.
1
Introduction
Automation tasks in industrial and service robotics, such as product packing or sorting, often require sets of objects to be arranged in specific poses on a planar surface. Efficient and high-quality single-arm solutions have been proposed for such setups [18]. The proliferation of robot arms, however, including dual-arm setups, implies that industrial settings can utilize multiple robots in the same workspace (Fig. 1). This work explores (a) the benefits of coordinated dual-arm rearrangement versus single-arm, (b) the combinatorial challenges involved and (c) computationally efficient, high-quality and scalable methods. A motivating point is that the coordinated use of multiple arms can result in significant improvements in efficiency. This arises from the following argument. Work by RS, JU and KB was supported by NSF CNS-1330789, IIS-1617744 and IIS1734419, and DH by the ISF grants 825/15, 1736/19, Blavatnik Computer Science Research Fund, Blavatnik Interdisciplinary Cyber Research Center, Yandex and Facebook. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 778–795, 2020. https://doi.org/10.1007/978-3-030-44051-0_45
Fast, High-Quality Dual-Arm Rearrangement
779
Fig. 1. Example of dual-arm setups that can utilize algorithms proposed in this work.
Lemma 1. There are classes of tabletop rearrangement problems, where a karm (k ≥ 2) solution can be arbitrarily better than the optimal single-arm one. For instance, assume two arms that have full (overhand) access to a unit square planar tabletop. There are n objects on the table, divided into two groups of n2 each. Objects in each group are ε-close to each other and to their goals. Let the distance between the two groups be on the order of 1, i.e., the two groups are at opposite ends of the table. The initial position of each end-effector is ε-close to a group of objects. Let the cost of each pick and drop action be cpd , while moving the end-effector costs ct per unit distance. Then, the 2-arm cost is no more than 2ncpd + 2nεct . A single arm solution costs at least 2ncpd + (2n − 1)εct + ct . If cpd and ε are sufficiently small, the 2-arm cost can be arbitrarily better than the single-arm one. The argument also extends to k-arms relative to (k − 1)-arms. In most practical setups, the expectation is that a 2-arm solution will be close to half the cost (e.g., time-wise) of the single-arm case, which is a desirable improvement. While there is coordination overhead, the best 2-arm solution cannot do worse; simply let one of the arms carry out the best single arm solution while the other remains outside the workspace. More generally: Lemma 2. For any rearrangement problem, the best k-arm (k ≥ 2) solution cannot be worse than an optimal single arm solution. The above points motivate the development of scalable algorithmic tools for such dual-arm rearrangement instances. This work considers certain relaxations to achieve this objective. In particular, monotone tabletop instances are considered, where the start and goal object poses do not overlap. Furthermore, the focus is on synchronized execution of pick-and-place actions by the two arms. i.e., where the two arms simultaneously transfer (different) objects, or simultaneously move towards picking the next objects. Theoretical arguments and experimental evaluation indicate that this does not significantly degrade solution quality. The first contribution is the study of the combinatorial structure of synchronous, monotone dual-arm rearrangement. Then, a mixed integer linear programming (MILP) model is proposed that achieves optimal coordinated solutions in this domain. The proposed efficient algorithm Tour Over Matching (TOM) significantly improves in scalability. TOM first optimizes the cost of object transfers and assigns objects to the two arms by solving an optimal matching problem. It minimizes the cost of moves from an object’s goal to another object’s start pose per arm as a secondary objective by employing a TSP solution. Most of the
780
R. Shome et al.
computation time is spent on the many calls to a lower-level motion planner that coordinates the two arms. A lazy evaluation strategy is proposed, where motion plans are evaluated for candidate solution sequences. This results in significant computational improvement and reduces the calls to the motion planner. An analysis studies the expected improvement in solution quality versus the single arm case, as well as the expected cost overhead from a synchronous solution. Finally, experiments for (i) a simple planar picker setting, and (ii) two 7DOF Kuka arms, demonstrate a nearly two-fold improvement against the single arm case for the proposed approach in practice. The algorithm exhibits close to optimal solutions and good scalability. The lazy evaluation strategy significantly improves computation costs for both the optimal and proposed methods.
2
Related Work
The current work dealing with dual-arm object rearrangement touches upon the challenging intersection of a variety of rich bodies of prior work. It is closely related to multi-robot planning and coordination where a challenge is the high dimensionality of the configuration space. Optimal strategies were developed for simpler instances of the problem [35], although in general the problem is known to be computationally hard [33]. Decentralized approaches [40] also used velocity tuning [24]. General multi-robot planning tries to plan for multiple highdimensional platforms [14,42] using sampling-based techniques. Recent advances provide scalable [34] and asymptotically optimal [9] sampling based frameworks. In some cases, by restricting the input of the problem to a certain type, it is possible to cast known hard instances of a problem as related algorithmic problems which have efficient solvers. For instance, unlabeled multi-robot motion planning can be reduced to pebble motion on graphs [1]; pebble motion can be reduced to network flow [46]; and single-arm object rearrangement can be cast as a traveling salesman problem [18]. These provide the inspiration to closely inspect the structure of the problem to derive efficient solutions. In this work we leverage a connection between dual-arm rearrangement and two combinatorial problems: (1) optimal matching [10] and (2) TSP. On the surface the problem seems closely related to multi-agent TSP. Prior work has formulated the k-TSP solution in terms of splitting a single tour [11] or as an optimization task [27]. Some work [12] deals with asymmetric edge weights which are more relevant to the problems of our interest. The problem can be posed as an instance of multi vehicle pickup and delivery (PDP) [26]. Prior work [7] has applied the PDP problem to robots, taking into account time windows and robot-robot transfers. Some seminal work [23,29] has explored its complexity, and concedes to the hardness of the problem. Typically this line of work ignores coordination costs. Some methods [5] reason about it on candidate solutions. Navigation among movable objects deals with the combinatorial challenges of multiple objects [39,43] and has been shown to be a hard problem, and extended to manipulation applications [37]. Despite a lot of interesting work on challenges of manipulation and grasp planning, the current work shall make assumptions
Fast, High-Quality Dual-Arm Rearrangement
781
that avoid complexities arising from them. Manipulators opened the applications of rearrangement planning [3,25], including instances where objects can be grasped only once or monotone [37], as well as non-monotone instances [19,36]. Efficient solutions to assembly planning problems [17,44] typically assumes monotonicity, as without it the problem becomes much more difficult. Recent work has dealt with the hard instances of task planning [4,6] and rearrangement planning [18,21,22]. Sampling-based task planning has made a recent push towards guarantees of optimality [30,41]. These are broader approaches that are invariant to the combinatorial structure of the application domain. General task planning methods are unaware of the underlying structure studied in this work. The current work tries to bridge this gap and provide insights regarding the structure of dual-arm rearrangement. Under assumptions that enable this study, an efficient solution emerges for this problem.
3
Problem Setup and Notation
Consider a planar surface and a set of n rigid objects O = {o1 , o2 · · · on } that can rest on the surface in stable poses pi ∈ Pi ⊂ SE(3). The arrangement space A = P1 ×P2 . . .×Pn is the Cartesian product of all Pi , where A = (p1 , . . . , pn ) ∈ A. In valid arrangements Aval ⊂ A, the objects are pairwise disjoint (Fig. 2). Two robot arms m1 and m2 can pick and place the objects from and to the surface. Ckfree is the set of collision-free configurations for arm mk (while ignoring possible collisions with objects or the other arm), and is assumed here to be a subset of Rd . A path for mk is denoted as π k : [0, 1] → Ckfree and includes picking and placing actions. Let Cifree (π j ) repreFig. 2. An example of object rearrangesent the collision free C-space for mi , ment involving two robotic arms. Initial j j given that m moves along π . This is (left) and final (right) object configuration. a function of the paths’ parametrization. The space of dual-arm paths D denotes pairs of paths for the two arms: D = (π 1 , π 2 ) ∈ D. Then, A(Ainit , D) is the resulting arrangement when the objects are at Ainit and moved according to D. Let cost(D) : D → R be a cost function over dual-arm paths. Definition 1 (Optimal Dual-arm Rearrangement). Given arms m1 , m2 and objects O to be moved from initial arrangement Ainit ∈ Aval to target arrangement Agoal ∈ Aval , the optimal dual-arm rearrangement problem asks for a dual-arm path D∗ ∈ D, which satisfies the following constraint: D∗ = (π ∗1 , π ∗2 ) | A(Ainit , D∗ ) = Agoal and π ∗i ∈ Cifree (π ∗j ) and optimizes a cost function: D∗ = argmin cost(D). ∀D∈D
(1)
782
R. Shome et al.
A set of assumptions are introduced to deal with the problem’s combinatorics. The reachable task-space T k ⊂ SE(3) of arm mk is the set of SE(3) poses that objects attached to the arm’s end effector can acquire. Let the ordered set of objects moved during the arm path π i be denoted as ¯ k ). In general, an object can appear many times in O(π ¯ k ). The current work, O(π however, focuses on monotone instances, where each object is moved once. Assumption 1 (Monotonicity). There are dual-arm paths D = (π 1 , π 2 ) that ¯ 1 ) or O(π ¯ 2 ). satisfy Eq. 1, where each object oi ∈ O appears once in O(π For the problem to be solvable, all objects are reachable by at least one arm at both Ainit and Agoal : ∀ pi ∈ Ainit and ∀ pj ∈ Agoal and ∃ k ∈ [1, 2] : pi , pj ⊂ T k . The focus will be on simultaneous execution of transfer and move paths. ¯ k ) = ok and each mk : Transfers: Dual-arm paths T (πi1 , πi2 ) ∈ D, where O(π i i – starts the path in contact with an object oki at its initial pose in Ainit , – and completes it in contact with object oki at its final pose in Agoal . 1 2 k ¯ 1 Moves or Transits: Paths M (πi→i , πi→i ) ∈ D, O(πi→i ) = ∅, and each m :
– starts in contact with object oki at its final pose in Agoal , – and completes it in contact with object oki at its initial pose in Ainit . Assumption 2 (Synchronicity). Consider dual-arm paths, which can be decomposed into a sequence of simultaneous transfers and moves for both arms: 1 2 1 2 D = T (π11 , π12 ), M (π1→2 , π1→2 ), . . . , M (π−1→ , π−1→ ), T (π1 , π2 ) . (2) k For simplicity, Eq. 2 does not include an initial move from qsafe ∈ Ckfree and a k final move back to qsafe . An odd number of objects can also be easily handled. Then, the sequence of object pairs moved during a dual-arm path as in Eq. 2 is:
Ω(D) =
ωi =
(o1i , o2i )
| i, j ∈ [1 · · · ], (o1i ∪ o2i ) = O, ∀k, k ∈ [1, 2], oki = okj
.
i
Given the pairs of objects ωi , it is possible to express a transfer as T (ωi ) and a move as M (ωi→j ). Then, D(Ω) is the synchronous, monotone dual-arm path due to Ω = (ω1 , . . . , ω ), i.e., D(Ω) = (T (ω1 ), M (ω1→2 ), . . . , M (ω−1→ ), T (ω )). Assumption 3 (Object Non-Interactivity). There are collision-free transfers T (ωi ) and moves M (ωi→i ) regardless of the object poses in Ainit and Agoal . This entails that there is no interaction between the n resting objects and the arms during transits. Similarly, there are no interactions between the arm-object system and the n − 2 resting objects during the transfers. Collisions involving the arms, static obstacles and picked objects are always considered. The metric this work focuses on relates to makespan and minimizes the sum of the longest distances traveled by the arms in each synchronized operation.
Fast, High-Quality Dual-Arm Rearrangement
783
Let π k denote the Euclidean arc length in Ckfree ⊂ Rd of path π k . Then, for 1 2 transfers cost(T (πi1 , πi2 )) = max{ πi1 , πi2 }. Similarly, cost(M (πi→i , πi→i )) = 1 2 max{ πi→i , πi→i }. Then, over the entire dual-arm path D define: cost(D) =
i=1
cost(T (ωi )) +
−1
cost(M (ωi→i+1 )).
(3)
i=1
Note that the transfer costs do not depend on the order with which the objects are moved but only on the assignment of objects to arms. The transit costs arise out of the specific ordering in Ω(D). Then, for the setup of Definition 1 and under Assumptions 1–3, the problem is to compute the optimal sequence of object pairs Ω ∗ so that D(Ω ∗ ) satisfies Eq. 1 and minimizes the cost of Eq. 3. Note on Assumptions: This work restricts the study to a class of monotone problems that relate to a range of industrial packing and stowing applications. The monotonicity assumption is often used in manageable variants of wellstudied problems [17,44]. A monotone solution also implies that every object’s start and target is reachable by at least one arm. Objects do not need to be in the commonly reachable workspace as long as the problem is monotone and solvable. Section 6 discusses a NO ACT task that can deal with unbalanced object assignments between two arms. The synchronicity assumption allows to study the combinatorial challenges of the problem, which do not relate to the choice of time synchronization of different picks and placements. Section 6 describes the use of dRRT∗ [9] as the underlying motion planner that can discover solutions that can synchronize arm motions for simultaneous picks, and simultaneous placements. The synchronicity assumption is relaxed through smoothing (Sect. 6). The non-interactivity assumption comes up naturally in planar tabletop scenarios with top-down picks or delta robots. Such scenarios are popular in industrial settings. Once the object is raised from the resting surface, transporting it to its target does not introduce interactions with the other resting objects. This assumption is also relaxed in Sect. 6, with a lazy variant of the proposed method. Once a complete candidate solution is obtained, collision checking can be done with everything in the scene. Overall, under the assumptions, the current work identifies a problem structure, which allows arguments pertaining to the search space, completeness, and optimality. Nevertheless, the smoothed, lazy variant of the proposed method will still return effective solutions in practice, even if these assumptions do not hold.
4
Baseline Approaches and Size of Search Space
This section highlights two optimal strategies to discover D(Ω ∗ ): (a) exhaustive search, which reveals the search space of all possible sequences of object pairs Ω and (b) an MILP model. Both alternatives, however, suffer from scalability issues as for each possible assignment, it is necessary to solve a coordinated motion planning problem for the arms. This motivates minimizing the number of
784
R. Shome et al.
assignments ω considered, and the number of motion planning queries it requires for discovering D(Ω ∗ ), while still aiming for high quality solutions. Exhaustive Search: The exhaustive search approach shown in Fig. 3 is a brute force expansion of all possible sequences of object pairs Ω. Nodes correspond to transfers T (ωi ) and edges are moves M (ωi→i ). The approach evaluates the cost for all possible branches to return the best sequence Ω. The total number of nodes is in the order of O(n!), expressed over the levels L of the search tree as: P2 + P2 ×
n
n
P 2 + ... + P 2 ×
n−2
n
P2 ×
n−2
P 2 ... × P 2 =
n−4
2
n/2 2L−1 L=1
(n − k),
k=0
where nP k is the k-permutations of n. Motion plans can be reused, however, and repeated occurrences of T (ωi ) and M (ωi→i ) should be counted only once for a total of: – nP 2 transfers of objects, and – nP 2 × n−2P 2 transits between all possible valid ordered pairs of ω. Additional motion plans are needed for the inik and the return to it at the Fig. 3. Search tree for 4 objects. tial move from qsafe end of the process, introducing 2 × nP 2 transits: # of Transfers + # of Moves = nP 2 + (nP 2 × n−2P 2 ) + (2 × nP 2 ) (4) This returns optimal synchronized solution but performs an exhaustive search and requires exponentially many calls to a motion planner. MILP Formulation: Mixed Integer Linear Programming (MILP) formulations can utilize highly optimized solvers [16]. Prior work has applied these techniques for solving m-TSP [12,27] and pickup-and-delivery problems [7,29], but viewed these problems in a decoupled manner. This work outlines an MILP formulation for the synchronized dual-arm rearrangement problem that reasons about coordination costs arising from arm interactions in a shared workspace. Graph Representation: The problem can be represented as a directed graph where each vertex v = ωv = (o1v , o2v ) corresponds to a transfer T (ωu ) and edges e(u, v) are valid moves M (ωu→v ). A valid edge e(u, v) is one where an object does not appear more than once in the transfers of nodes u and v. The cost of a directed edge e(u, v) encodes both the cost of the transfer T (ωv ) and the cost of the move M (ωu→v ). There is also a vertex S, which connects moves from and k ˆ Vˆ , E) ˆ is defined: . The directed graph G( to the safe arm configurations qsafe Vˆ = {v = ωv = (o1v , o2v ) | ∀ o1v , o2v ∈ O, o1v = o2v } ∪ {S} ˆ = {e(u, v) | ∀ u, v ∈ Vˆ so that u = E
v, okv = ou ∀ k, ∈ [1, 2]} ∪ {e(S, v) ∀ v ∈ Vˆ \ S} ∪ {e(v, S) ∀ v ∈ Vˆ \ S} coste(u,v) = cost(u) + cost(u, v) = cost(T (ωu )) + cost(M (ωu→v ))
Fast, High-Quality Dual-Arm Rearrangement
785
Let cost(S) = 0. The total number of motion planning queries needed to be answered to define the edge costs is expressed in Eq. 4. The formulation proposed ˆ as a tour that starts and in this section tries to ensure the discovery of Ω ∗ on G ends at S, while traversing each vertex corresponding to Ω ∗ . To provide the MILP formulation, define δin (v) as the in-edge set v, and δout (v) as the out-edge set. ˆ o ∈ ωu }, i.e., all Then, γ(o) is the object coverage set γ(o) = {e(u, v) | e ∈ E, the edges that transfer o. Model: Set the optimization objective as: min coste xe ˆ
[A]
e∈E
Equation [B] below defines indicator variables. Equations [C-E] ensure edge-flow conserved tours. Equations [F-G] force S to be part of the tour. Equation [H] transfers every object only once. Equation [I] lazily enforces the tour to be of length n2 + 1. While the number of motion-planning queries to be solved is the same as in exhaustive search, efficient MILP solvers [16] provide a more scalable search process. ˆ xe ∈ {0, 1} ∀e ∈ E xe ≤ 1 ∀v ∈ Vˆ
[B] [C]
e∈δin (v)
e∈δin (v)
xe ≤ 1 ∀v ∈ Vˆ
e∈δout (v)
xe ∀v ∈ Vˆ
[F]
xe = 1
[G]
xe = 1 ∀o ∈ O
[H]
e∈δout (S)
[D]
e∈δout (v)
xe =
xe = 1
e∈δin (S)
e∈γ(o)
[E]
e(u,v)∈T
n xe < |T| ∀T ⊂ Vˆ , |T| ≤ 2 [I]
5
Efficient Solution via Tour over Matching
The optimal baseline methods described above highlight the problem’s complexity. Both methods suffer from the large number of motion-planning queries they have to perform to compute the cost measures on the corresponding search structures. For this purpose it needs to be seen whether it is possible to decompose the problem into solvable sub-problems. Importance of Transfers: In order to draw some insight, consider again the tabletop setup with a general cost measure of ct per unit distance. Lemma 1 suggests that under certain conditions, there may not be a meaningful bound on the performance ratio between a k-arm solution and a single-arm solution. This motivates the examination of another often used setting—randomly chosen non-overlapping start and goal locations for n objects (within a unit square). In order to derive a meaningful bound on the benefit of using a 2-arm solution to a single-arm solution, we first derive a conservative cost of a single-arm solution.
786
R. Shome et al.
A single-arm optimal cost has three main parts: (1) the portion of the transfer cost involving the pickup and drop-off of the n objects with a cost of Cpd = ncpd , (2) the remaining transfer cost from start to goal for all objects Csg , and 3) the transit cost going from the goals to starts Cgs . The single arm cost is Csingle = ncpd + Csg + Cgs .
(5)
To estimate Eq. 5, first note √ that the randomized setup will allow us to obtain √ 2+ 2+5 ln(1+ 2) the expected Csg [28] as nct ≈ 0.52nct . Approximate Cgs by 15 simply computing an optimal assignment of the goals to the starts of the objects excluding the matching of the same start and goal. Denoting the distance cost M M , clearly Cgs > Cgs because the paths produced by the of this matching as Cgs matching may form multiple closed loops instead of the desired single loop that connects all starts and goals. However, the number of loops produced by the M + ct ln n, matching procedure is on the order of ln n and therefore, Cgs < Cgs √ M by [38]. By [2], Cgs = Θ( n ln n). Putting these together, we have, √ √ M M Θ(ct n ln n) = Cgs < Cgs < Cgs + ct ln n = Θ(ct n ln n) + ct ln n M which implies that Cgs ≈ Cgs because for large n, ln n √ M in [45] that Cgs ≈ 0.44 n ln nct for large n. Therefore,
√
n ln n. It is estimated
√ Csingle ≈ ncpd + (0.52n + 0.44 n ln n)ct ≈ (cpd + 0.52ct )n (6) √ noting that the 0.44 n ln nct term may also be ignored for large n. The cost of dual arm solutions will be analyzed in Sect. 7. Lemma 3. For large n, the transfers dominate the cost of the solution. We formulate a strategy to reflect this importance of transfers. The proposed approach gives up on the optimality of the complete problem, instead focusing on a high-quality solution, which: – first optimizes transfers and selects an assignment of object pairs to arms, – and then considers move costs and optimizes the schedule of assignments. This ends up scaling better by effectively reducing the size of the search space and performing fewer motion planning queries. It does so by optimizing a related cost objective and taking advantage of efficient polynomial-time algorithms. G(V, E) V = {v = o, ∀o ∈ O}
GΓ (VΓ , EΓ ) VΓ = {v = ω, ∀ω ∈ {Ω}∗ } ∪ {ωS }
E = {e(u, v) : ω = (u, v), ∀u, v ∈ V} EΓ = {e(u, v) : ωu→v , ∀u, v ∈ VΓ } cost(e(u, v)) = cost(M (ωu→v )) cost(e(u, v)) = cost(T (ω = (u, v))) (8) (7)
Fast, High-Quality Dual-Arm Rearrangement
787
Foundations: Consider a complete weighted directed graph G(V, E) (Eq. 7), where v ∈ V corresponds to a single object o. Each directed edge, e = (oi , oj ) ∈ E is an ordered pair of objects oi and oj , where the order determines the assignment of an object to an arm m1 or m2 . The cost of an edge cost(e) is the coordinated motion planning cost of performing the transfer corresponding to ω = (oi , oj ). For instance, as shown in Fig. 4(middle), e(A, B) corresponds to m1 transferring ‘A’, while m2 transfers ‘B’, and the cost(e(A, B)) = cost(T (ω = (A, B)) is the cost of such a concurrent motion. It should also be noted that since the arms are different, in general, cost(e(A, B)) = cost(e(B, A)). Following the observation made in Eq. 3, the cost of the transfers can be reasoned about independent of their order. Define {Ω} as the unordered set of ω ∈ Ω, then unordered transfer cost component corresponds to ω∈{Ω} cost(T (ω)). Since G is a complete graph where every edge corresponds to every Fig. 4. (left) A dual-arm rearrangement probpossible valid ω, by construction lem. (middle) The same problem as minimum {Ω} ⊂ E. A candidate solution of a weight edge matching on a fully connected monotone dual-arm rearrangement directed graph of transfers. (right) The ordering of the object-arm assignments from an problem must transfer every object optimal tour over a transit graph. exactly once. In terms of the graph, this means that in the subset of edges {Ω} every vertex appears in only one edge. We arrive at the following crucial observation. Lemma 4 (Perfect Matching). Every candidate solution to a monotone dualarm rearrangement problem comprises of a set of unordered object-to-arm assignments {Ω} that is also a perfect matching solution on G. As per the decomposition of the costs in Eq. 3, it follows that the cost({Ω}) is the cost of the transfers in the solution. The solution to the minimum-weight perfect edge matching problem on such a graph would correspond to a {Ω} that optimizes the cost of transfers of all the objects. Lemma 5 (Optimal Matching). The set of object-to-arm assignments {Ω}∗ that minimizes the cost of object transfers is a solution to the minimum-weight perfect matching problem on G. Once such an optimal assignment {Ω}∗ is obtained, the missing part of the complete solution is the set of transits between the object transfers and their ordering. Construct another directed graph GΓ (Eq. 8), where the vertices comprise of the ω ∈ {Ω}∗ . An edge e(ωu→v ) corresponds to the coordinated transit motion between them. For instance, as in Fig. 4(right) for an edge between ω(A, B) and ω(C, D), m1 moves from the target pose of object ‘A’ to the starting pose of object ‘C’, and similarly m2 moves from the target of ‘B’ to the start
788
R. Shome et al.
of ‘D’. An additional vertex corresponding to the starting (and ending) configuration of the two arms (S) is added to the graph. The graph is fully connected again to represent all possible transits or moves. A complete candidate solution to the problem now requires the sequence of ω, which is a complete tour over GΓ , that visits all the vertices i.e., an ordered sequence of vertices Γ = (S, ΩΓ , S). Lemma 6 (Tour). Any complete tour Γ over the graph GΓ , corresponds to a sequence of object-to-arm assignments ΩΓ and is a candidate solution to the synchronous dual-arm rearrangement problem. Let P{Ω} represent the set of all possible ordering of the elements in {Ω}. This means, any candidate tour corresponds to a ΩΓ ∈ P{Ω} . An optimal tour on GΓ minimizes the transit costs over the all possible candidate solutions in ∗ P{Ω} . cost(M (ωu→v ))) (9) Ω + = argmin Ω∈P{Ω}∗ e(u,v)∈Γ
This differs from the true optimal Ω ∗ , since the second step of finding the optimal transit tour only operates over all possible solutions that include the optimally matched transfers obtained in the first step. The insight here is that, even though Ω + reports a solution to a hierarchical optimization objective, the search space is much smaller, and the sub-problems more efficient to solve. Algorithm: This section describes the algorithm Tour Over Matching (TOM) outlined in the previous section. The steps correspond to Algorithm 1. transfer graph: This function constructs a directed graph G defined by Eq. 7. This step creates a graph with n vertices and nP 2 edges. optimal matching: This funcAlgorithm 1: TOM(O, S, Ainit , Agoal ) tion takes the graph G constructed as an argument and 1 G ← transfer graph(O, Ainit , Agoal ); ∗ returns the unordered set of 2 {Ω} ← optimal matching(G); ∗ edges, corresponding to the 3 GΓ ← transit graph({Ω} , S); + set of optimal transfers over 4 Ω ← optimal tour(GΓ ); + G. Optimal matching over 5 return D(Ω ); an undirected graph can be solved using Edmond’s Blossom Algorithm [8,10,13]. The directed graph G is converted into an equivalent undirected graph GU . Since G is complete, every pair of vertices shares two directed edges. GU only preserves the minimally weighted connection for every vertex pair. The result of matching is a subset of edges on GU which correspond to a set of directed edges on G i.e., {Ω}∗ . The runtime complexity of the step is O(|E||V| log |V|) = O(nP 2 n log n) = O(n3 log n). transit graph: This function constructs the directed transit graph GΓ as per n the set of Eq. 8. This constructs n2 + 1 vertices and ( 2 +1)P 2 edges. optimal tour: Standard TSP solvers like Gurobi [16] can be used to find the optimal tour over GΓ corresponding to Ω + .
Fast, High-Quality Dual-Arm Rearrangement
789
The costs are deduced from coordinated motion plans over edges. The total number of such calls compared to the count from the baseline in Eq. 4, shows a saving in the order of O(n2 ) queries (# of Transfers + # of Moves). n n n−2 n P + P × P P + 2 × 2 2 2 2 Baseline # 4(n − 1)((n − 5)n + 9) = = n n TOM # 5n − 2 P 2 + ( 2 +1)P 2 The evaluation performed here focuses on the maximum of distances (Eq. 3) for a fair comparison with the other methods. The prioritization of optimization objective, however, is also amenable to other cost functions, where carrying objects is often more expensive than object-free motions.
6
Integration with Motion Planning
The algorithms described so far are agnostic to the underlying motion planner. Depending upon the model of the application domain, different motion planning primitives might be appropriate. For planar environments with disk robot pickers (similar to delta robots), recent work [20] characterizes the optimal two-disk coordinated motions. The current implementation uses a general multi-robot motion planning framework dRRT∗ [9] for dual-arm coordinated planning. In practice the cost of generating and evaluating two-arm motions can dominate the overall running time of the algorithm, when compared to the combinatorial ingredients that discover the high-level plan, i.e., execution order and and arm assignment. Even though TOM reduces this, further improvements can be made with lazy evaluation. Algorithm 2: Lazy Evaluation(ALGO, H, MP) Lazy Evaluation: Recent 1 Eb ← ∅; D ← ∅; work [31] introduces heuris∗ 2 while D = ∅ ∧ time not exceeded do tics for dRRT , which pre3 Ω ← ALGO(H, Eb ); process estimates of the short4 for ωi , ωi→i+1 ∈ Ω do est path costs for the arms. 5 Di , Di→i+1 ← MP(ωi ), MP(ωi→i+1 ); Dual-arm rearrangement can 6 D ← (D, Di , Di→i+1 ); be significantly sped up if the 7 if Di = ∅ then motion planning queries are 8 Eb ← Eb ∪ {ωi }; D ← ∅; replaced with look-ups of such 9 if D i→i+1 = ∅ then heuristics. Once a candidate 10 E b ← Eb ∪ {ωi→i+1 }; D ← ∅; Ω is obtained, motion plan11 if D = ∅ then break ning can evaluate the solution 12 return D D(Ω). If this fails, the algorithm tries other sequences. The Algorithm 2 takes as input the algorithm ALGO, a heuristic H, and a motion planner MP. Eb keeps track of the blocked edges. The process keeps generating candidate solutions using the ALGO (Line 3). Line 5 motion plans over the candidate solution, and appends to the result (Line 6). Any failures are recorded in Eb (Lines 8,10), and inform subsequent runs of ALGO.
790
R. Shome et al.
Completeness: The lazy variants give up on optimality for the sake of efficiency but given enough retries the algorithms will eventually solve every problem that ALGO can. Since the motion planning happens in the order of execution, the object non-interactivity assumption is relaxed. Smoothing: Applying velocity tuning over the solution trajectories for the individual arms relaxes the synchronization assumption by minimizing any waits that might be a by-product of the synchronization. Smoothing does not change the maximum of distances, only improves execution time. Indications that the smoothed variants of the synchronous solutions do not provide significant savings in execution time are included in the Appendix [32] for the interested reader. It should be noted that in an iterative version of TOM, in order to explore variations in {Ω}∗ if failures occur in finding Γ , some edges need to be temporarily blocked on {Ω}∗ . The search structures can also be augmented with NO ACT tasks, for possible ω where one of the arms do not move.
7
Bounding Costs Under Planar Disc Manipulator Model
Following from Lemma 3, the current analysis studies the dual arm costs in the randomized unit tabletop setting, with ct as the cost measure per unit distance. For the 2-arm setting, assume for simplicity that each arm’s volume is represented as a disc of some radius r. For obtaining a 2-arm solution, we partition the n objects randomly into two piles of n2 objects each; then obtain two initial solutions similar to the single arm case. It is expected (Eq. 6) that these two halves should add up to approximately (cpd + 0.52ct )n. From the initial 2-arm solution, we construct an asynchronous 2-arm solution that is collision-free. Assume that pickups and drop-offs can be achieved without collisions between the two arms, which can be achieved with properly designed end-effectors. The main overhead is then the potential collision between the two (disc) arms during transfer and move operations. Because there are n2 objects for each arm to work with, an arm may travel a path formed by n + 1 straight line segments. Therefore, there are up to (n + 1)2 intersections between the two endeffector trajectories where potential collision may happen. However, because for the transfers and transits associated with one pair of objects (one for each arm) can have at most four intersections, there are at most 2n potential collisions to handle. For each intersection, let one arm wait while letting the other circling around it, which incurs a cost that is bounded by 2π · r · ct . Adding up all the potential extra cost, a cumulative cost is obtained as Cdual = Csingle + 2n(2πrct ) ≈ (cpd + 0.52ct + 4πrct )n . For small r, Cdual is almost the same as Csingle ct is a distance (e.g., energy) cost. Upon considering the maximum of the two arc lengths or makespan (Eq. 3), the t ≈ (cpd + 0.52ct ) n2 + 4nπrct . The cost ratio is 2-arm cost becomes Cdual t (cpd + 0.52ct ) n2 + 4nπrct Cdual 1 4πrct = + ≈ . Csingle (cpd + 0.52ct )n 2 cpd + 0.52ct
(10)
Fast, High-Quality Dual-Arm Rearrangement
791
t When r is small or when ccpd is small, the 2-arm solution is roughly half as costly as the single arm solution. On the other hand, in this model a 2-arm solution does not do better than 12 of the single arm solution. This argument can be extended to k-arms as well [32].
Theorem 1. For rearranging objects with non-overlapping starts and goals that are uniformly distributed in a unit square, a 2-arm solution can have an asymptotic improvement of 12 over the single arm solution. The synchronization assumption changes the expected cost of the solution. The random partitioning of the n objects into two sets of n2 object with a random ordering of the objects yields n2 pairs of objects transfers, which dominate the total cost for large n. The cost (Eq. 3) of n2 synchronized transfers sync ≈ (E(max(l1 , l2 ))ct ) n2 , where E(max(l1 , l2 )) is the (ωi ) includes n2 cpd and Csg expected measure of the max of lengths l1 ,l2 of two randomly paired transfers. Using the pdf [15] of lengths of random lines in an unit square and integrating over the setup [32], results in the value of E(max(l1 , l2 )) to be 0.66 . This means sync ≈ (cpd + 0.66ct ) n2 + 4nπrct . The synchronized cost ratio is Cdual sync (cpd + 0.66ct ) n2 + 4nπrct Cdual 1 (0.07 + 4πr)ct = + ≈ . Csingle (cpd + 0.52ct )n 2 cpd + 0.52ct
(11)
t When ccpd is small, even the synchronized 2-arm solution provides an improve1 ment of 2 . For the case when both r and cpd are small, we observe that the ratio approaches 0.636.
Theorem 2. For rearranging objects with non-overlapping starts and goals that are uniformly distributed in a unit square, a randomized 2-arm synchronized solution can have an asymptotic improvement of 12 over the single arm solution t is small, and a improvement ≈ 0.64 when both cpd and r are small. if ccpd
8
Evaluation
This section describes the experiments performed to evaluate the algorithms in two domains shown in Fig. 5: (a) simple picker and (b) general manipulators. In order to ensure monotonicity, the object Fig. 5. Picker and Manipulator trials. starts and goals do not overlap. Uniform cuboidal objects simplify the grasping problem, though this is not a limitation of the methods. 50 random experiments were limited to 300 s of computation time. The underlying dRRT∗ motion planner is restricted to a max of 3s per plan. A comparison point includes a random split method, which splits O at random into two subsets and chooses an arbitrary ordering. Maximum of distances cost is compared to the single arm solution [18]. Computation times
792
R. Shome et al.
and success rates are reported. The trends in both experiments show that in the single-shot versions, exhaustive and MILP tend to time-out for larger n. Lazy variants scale much better for all the algorithms, and in some cases increase the success ratio due to retries. TOM has much better running time than exhaustive and MILP, and producing better and more solutions than random split. Overall, the results show (a) our MILP succeeds more within the time limit than exhaustive, (b) TOM scales the best among all the methods, and (c) the cost of solutions from TOM is close to the optimal baseline, which is around half of the single arm cost. Simple Picker: This benchmark evaluates two disk robots hovering over a planar surface scattered with objects. The robots are only free to move around in a plane parallel to the resting plane of the objects, and the robots can pick up objects when they are directly above them. Figure 6(top) all runs up to 24 objects succeeded for TOM. MILP scales better than exhaustive. Lazy random split succeeds in all cases (bottom). In terms of solution costs (middle) exhaustive finds the true optimal. MILP matches exhaustive and TOM is competitive. In all experiments, TOM enjoys a success rate of 100%. General Dexterous Manipulator: The second benchmark sets up two Kuka arms across a table with objects on it. The objects are placed in the common reachable part of both arms’ workspace, and only one top-down grasping configuration is allowed for each object pose. Here (Fig. 7) a larger number of motion plans tend to fail, so the single shot variants show artifacts of the randomness of dRRT∗ in their success rates. Random split performs the worst since it is unlikely to chance upon valid motion plans. Single shot exhaustive and MILP scale poorly because of expensive motion planning. Interestingly, motion planning infeasibility reduces the size of the exhaustive search tree. The solution costs (middle) substantiate benefits of the use of two arms. The computation times (bottom) again show the scalability of TOM, even compared to random split.
Fig. 6. Simple Picker results with success (top), solution costs (middle), and computation(bottom) reported for single-shot (left) and lazy (right) versions of the methods
Fast, High-Quality Dual-Arm Rearrangement
793
Fig. 7. Kuka results with success (top), solution costs (middle), and computation(bottom) reported for single-shot (left) and lazy (right) versions of the methods
9
Discussion
The current work demonstrates the underlying structure of synchronized dualarm rearrangement and proposes an MILP formulation, as well as a scalable algorithm TOM that provides fast, high quality solutions. Existing efficient solvers for reductions of the dual-arm problem made TOM effective. Future work will attempt to explore the k−arm case. The matching subproblem ceases to have effective solvers for k > 2, and heuristics might need to be considered for feasible solutions. Regrasp reasoning can extend this work to non-monotone problem instances. The incorporation of manipulation and grasp reasoning can apply such solvers to more cluttered environments.
References 1. Adler, A., de Berg, M., Halperin, D., Solovey, K.: Efficient multi-robot motion planning for unlabeled discs in simple polygons. IEEE T-ASE 12(4) (2015) 2. Ajtai, M., Koml´ os, J., Tusn´ ady, G.: On optimal matchings. Combinatorica 4(4) (1984). https://doi.org/10.1007/BF02579135 3. Ben-Shahar, O., Rivlin, E.: Practical pushing planning for rearrangement tasks. IEEE Trans. Robot. Autom. 14(4), 549–565 (1998) 4. Berenson, D., Srinivasa, S., Kuffner, J.: Task space regions: a framework for poseconstrained manipulation planning. IJRR 30(12), 1435–1460 (2011) 5. Caricato, P., Ghiani, G., Grieco, A., Guerriero, E.: Parallel tabu search for a pickup and delivery problem under track contention. Parallel Comput. 29(5), 631–639 (2003) 6. Cohen, B., Chitta, S., Likhachev, M.: Single-and dual-arm motion planning with heuristic search. IJRR 33(2), 305–320 (2014) 7. Coltin, B.: Multi-agent pickup and delivery planning with transfers. Ph.D. thesis, Carnegie Mellon University, CMU-RI-TR-14-05 (2014) 8. Dezs˝ o, B., J¨ uttner, A., Kov´ acs, P.: Lemon-an open source C++ graph template library. Electron. Notes Theor. Comput. Sci. 264(5), 23–45 (2011)
794
R. Shome et al.
9. Shome, R., Solovey, K., Dobson, A., Halperin, D., Bekris, K.E.: Scalable and informed asymptotically-optimal multi-robot motion planning. Auton. Robot. 44(3), 443–467 (2020). https://doi.org/10.1007/s10514-019-09832-9. ISSN 15737527 10. Edmonds, J.: Maximum matching and a polyhedron with 0, 1-vertices. J. Res. Natl. Bureau Stand. B 69(125–130), 55–56 (1965) 11. Frederickson, G.N., Hecht, M.S., Kim, C.E.: Approximation algorithms for some routing problems. In: FOCS (1976) 12. Friggstad, Z.: Multiple traveling salesmen in asymmetric metrics. In: Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Springer (2013) 13. Galil, Z., Micali, S., Gabow, H.: An O(EV log V) algorithm for finding a maximal weighted matching in general graphs. SIAM J. Comput. 15(1), 120–130 (1986) 14. Gharbi, M., Cort´es, J., Sim´eon, T.: Roadmap composition for multi-arm systems path planning. In: IROS (2009) 15. Ghosh, B.: Random distances within a rectangle and between two rectangles. Bull. Calcutta Math. Soc. 43, 17–24 (1951) 16. LLC Gurobi Optimization: Gurobi optimizer reference manual (2016) 17. Halperin, D., Latombe, J.C., Wilson, R.H.: A general framework for assembly planning: the motion space approach. Algorithmica 26(3–4), 577–601 (2000) 18. Han, S.D., Stiffler, N., Krontiris, A., Bekris, K., Yu, J.: Complexity results and fast methods for optimal tabletop rearrangement with overhand grasps. IJRR 37, 1775–1795 (2018) 19. Havur, G., Ozbilgin, G., Erdem, E., Patoglu, V.: Geometric rearrangement of multiple movable objects on cluttered surfaces. In: ICRA (2014) 20. Kirkpatrick, D., Liu, P.: Characterizing minimum-length coordinated motions for two discs. arXiv preprint arXiv:1607.04005 (2016) 21. Krontiris, A., Bekris, K.E.: Dealing with difficult instances of object rearrangement. In: Robotics: Science and Systems (2015) 22. Krontiris, A., Bekris, K.E.: Efficiently solving general rearrangement tasks: a fast extension primitive for an incremental sampling-based planner. In: ICRA (2016) 23. Lenstra, J.K., Kan, A.: Complexity of vehicle routing and scheduling problems. Networks 11(2), 221–227 (1981) 24. Leroy, S., Laumond, J.P., Sim´eon, T.: Multiple path coordination for mobile robots: a geometric algorithm. In: IJCAI, vol. 99 (1999) 25. Ota, J.: Rearrangement of multiple movable objects-integration of global and local planning methodology. In: ICRA, vol. 2 (2004) 26. Parragh, S.N., Doerner, K.F., Hartl, R.F.: A survey on pickup and delivery models part ii: Transportation between pickup and delivery locations. J. Betriebswirtschaft 58(2), 81–117 (2008) 27. Rathinam, S., Sengupta, R.: Matroid intersection and its application to a multiple depot, multiple TSP (2006) 28. Santal´ o, L.A.: Integral Geometry and Geometric Probability. Cambridge University Press, Cambridge (2004) 29. Savelsbergh, M.W., Sol, M.: The general pickup and delivery problem. Transp. Sci. 29(1), 17–29 (1995) 30. Schmitt, P.S., Neubauer, W., Feiten, W., Wurm, K.M., Wichert, G.V., Burgard, W.: Optimal, sampling-based manipulation planning. In: ICRA. IEEE (2017) 31. Shome, R., Bekris, K.E.: Improving the scalability of asymptotically optimal motion planning for humanoid dual-arm manipulators. In: Humanoids (2017)
Fast, High-Quality Dual-Arm Rearrangement
795
32. Shome, R., Solovey, K., Yu, J., Bekris, K., Halperin, D.: Fast, high-quality dualarm rearrangement in synchronous, monotone tabletop setups. arXiv:1810.12202 [cs.RO] (2018) 33. Solovey, K., Halperin, D.: On the hardness of unlabeled multi-robot motion planning. IJRR 35(14), 1750–1759 (2016) 34. Solovey, K., Salzman, O., Halperin, D.: Finding a needle in an exponential haystack: discrete RRT for exploration of implicit roadmaps in multi-robot motion planning. IJRR 35(5), 501–513 (2016) 35. Solovey, K., Yu, J., Zamir, O., Halperin, D.: Motion planning for unlabeled discs with optimality guarantees. arXiv preprint arXiv:1504.05218 (2015) 36. Srivastava, S., Fang, E., Riano, L., Chitnis, R., Russell, S., Abbeel, P.: Combined task and motion planning through an extensible planner-independent interface layer. In: ICRA (2014) 37. Stilman, M., Schamburek, J.U., Kuffner, J., Asfour, T.: Manipulation planning among movable obstacles. In: ICRA (2007) 38. Treleaven, K., Pavone, M., Frazzoli, E.: Asymptotically optimal algorithms for oneto-one pickup and delivery problems with applications to transportation systems. IEEE Trans. Autom. Control 59(9), 2261–2276 (2013) 39. Van Den Berg, J., Stilman, M., Kuffner, J., Lin, M., Manocha, D.: Path planning among movable obstacles: a probabilistically complete approach. In: WAFR (2008) 40. Van Den Berg, J.P., Overmars, M.H.: Prioritized motion planning for multiple robots. In: IROS (2005) 41. Vega-Brown, W., Roy, N.: Asymptotically optimal planning under piecewiseanalytic constraints. In: WAFR (2016) 42. Wagner, G., Kang, M., Choset, H.: Probabilistic path planning for multiple robots with subdimensional expansion. In: ICRA (2012) 43. Wilfong, G.: Motion planning in the presence of movable obstacles. Ann. Math. Artif. Intell. 3(1), 131–150 (1991) 44. Wilson, R.H., Latombe, J.C.: Geometric reasoning about mechanical assembly. Artif. Intell. J. 71(2), 371–396 (1994) 45. Yu, J., Chung, S.J., Voulgaris, P.G.: Target assignment in robotic networks: distance optimality guarantees and hierarchical strategies. IEEE Trans. Autom. Control 60(2), 327–341 (2015) 46. Yu, J., LaValle, S.M.: Optimal multirobot path planning on graphs: complete algorithms and effective heuristics. IEEE Trans. Robot. 32(5), 1163–1177 (2016)
Optimality, Completeness, and Complexity in Planning
Motion Planning for Multiple Unit-Ball Robots in Rd Israela Solomon(B) and Dan Halperin Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv-Yafo, Israel [email protected] Abstract. We present a decoupled algorithm for motion planning for a collection of unit-balls moving among polyhedral obstacles in Rd , for any d 2. We assume that the robots have revolving areas in the vicinity of their start and target positions: Revolving areas are regions where robots can maneuver in order to give way to another moving robot. Given that this assumption is fulfilled, the algorithm is complete, namely it is guaranteed to find a solution or report that none exists. A key goal in our design is to make the revolving areas as economical as possible and in particular to allow different revolving areas to overlap. This makes the analysis rather involved but in return makes the algorithm conceptually fairly simple. We show that for the case of m unit-discs moving among polygonal obstacles of total complexity n in R2 , the algorithm can be executed in O(n2 m + m(m + n) log(m + n)) time. We implemented the algorithm for this case and tested it on several scenarios, for which we show experimental results for up to 1000 robots. Finally, we address the problem of choosing the order of execution of the paths in decoupled algorithms that locally solve interferences and show that finding the optimal order of execution is NP-hard. This motivated us to develop a heuristic for choosing the order; we describe the heuristic and demonstrate its effectiveness in certain scenarios.
1 Introduction The basic motion planning problem in robotics is to produce a continuous path for a robot that connects a given start position and a given target position, while avoiding collisions with given obstacles. The geometric description of the robot and the obstacles is known, and the execution of the path is assumed to be exact. A natural extension of this problem is multi-robot motion planning, in which multiple robots share a common workspace. In this variant, each robot has its own start and target positions, and the goal is to produce a path for each robot with the additional requirement that the paths do not incur collisions between robots. 1.1 Previous Work The first algorithms for multi-robot motion planning were complete algorithms, guaranteed to find a solution when one exists or report that none exists otherwise. Work by D. Halperin has been supported in part by the Israel Science Foundation (grant nos. 825/15, 1736/19), by the Blavatnik Computer Science Research Fund, by the Blavatnik Interdisciplinary Cyber Research Center at Tel Aviv University, and by grants from Yandex and from Facebook. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 799–816, 2020. https://doi.org/10.1007/978-3-030-44051-0_46
800
I. Solomon and D. Halperin
In the seminal “piano movers” series of papers, Schwartz and Sharir gave the first algorithm for moving multiple general bodies in the plane [13], and another specialized algorithm for moving discs in the plane [14]. Their algorithms are polynomial in the combinatorial complexity of the obstacles, which we denote by n, but exponential in the number of robots (in particular, their running time is O(n3 ) for two discs and O(n13 ) for three discs). Later on, Yap [28] gave algorithms for two and three discs whose running times are O(n2 ) and O(n3 ) respectively. Shortly after, Sifrony and Sharir [16] gave another O(n2 ) algorithm for two convex translating robots, each of constant descriptive complexity, which is extensible to m robots in O(nm ) time. It has been shown that several elementary variants of the multi-robot problem are computationally intractable [7, 9, 22], even when we assume that the robots are indistinguishable and any target position can be covered by any robot [19]. We call the latter variant of multi-robot motion planning unlabeled; several positive algorithmic results have been obtained for unlabeled disc robots under simplifying assumptions that the start and target positions have some minimal distance from each other and from the obstacles [1, 21, 24]. Due to the hardness results, most of the subsequent research was focused on developing heuristic and approximation algorithms. Specifically, the most common approach since then has been sampling-based algorithms [3, Chapter 7], [11]. Sampling-based algorithms try to capture the structure of the configuration space without explicitly constructing it, by sampling random configurations and constructing a roadmap based on these configurations, in which a path can be looked for. These algorithms often provide asymptotic guarantees of completeness and optimality. However, they can only find paths in reasonable time if the paths do not pass through narrow or tight passages (since the probability to sample the points along such passages may become minuscule). Furthermore, without special tailoring to the multi-robot case, their running time is prohibitively large; indeed there are several more efficient variants designed particularly for this case [20, 25]. A special and intensively studied variant of multi-robot motion planning is when the robots move on grids; see, e.g., the recent papers [5, 29] and the references therein. Approaches for multi-robot motion planning can be generally classified as coupled or decoupled. Coupled approaches consider all the robots as a single composite robot and compute a path in the combined configuration space, whose dimension is the sum of the number of degrees of freedom of the individual robots. Most of the motion planning techniques for a single robot can be applied as-is to the composite robot. When dealing with a large number of robots, the combined configuration space is of high dimension, thus the coupled techniques tend to be slow. In contrast, decoupled techniques find a path for each robot individually (or for small subsets of the robots) and then combine the paths into a joint solution while avoiding collisions between the robots. Approaches for avoiding collisions include adjusting the velocities of the robots (e.g., [4]), resolving collisions by local modifications of the paths (e.g., [6]), and prioritizing the robots and treating robots with higher priority as obstacles (e.g., [2]). Decoupled algorithms are usually faster (though some of them are heuristics whose running time is not theoretically bounded). But, unless restricting the input, they are usually not complete, as coupling is sometimes necessary, for example when the target position of each robot is the start position of some other robot.
Motion Planning for Multiple Unit-Ball Robots in Rd
801
1.2 Contribution We present a decoupled algorithm for motion planning for a collection of unit-balls moving among polyhedral obstacles in Rd , for any d 2. Our algorithm makes a simplifying assumption regarding the input. It requires that for each start or target position z there exists a ball of radius 2, termed revolving area, that contains the unit-ball centered at z, and does not intersect any obstacle or any unitball centered at any other start or target position. Given that this assumption is fulfilled, our algorithm is complete, namely, it is guaranteed to find a solution if one exists, or report that none exists otherwise. Following the methodology of decoupled algorithms, our algorithm first finds a path for each robot, disregarding other robots, and then modifies the paths to avoid collision between the robots. The paths are planned to be executed one after the other. Our simplifying assumption allows the algorithm to temporarily move the robots in the vicinities of their start and target positions, thus avoiding collisions between the robots while making only local modifications to the paths. The algorithm uses several operations as black boxes, for which we suggest a few possible implementations. The running time of the algorithm, as well as the total length and the total combinatorial complexity of the resulting paths, depend on the specific methods used to perform these operations. We show that for the case of m unit-discs moving among polygonal obstacles of total combinatorial complexity n in R2 , the algorithm can be executed in O(n2 m + m(m + n) log(m + n)) time, when using shortest paths as the initial paths. The algorithm has been implemented in code for unit-discs moving among polygonal obstacles in the plane. We tested our implementation on several scenarios, for which we show experimental results for up to 1000 robots. The results attest to the efficiency of the algorithm and suggest that the excess total length of the paths over the original paths is small in practice (although we show a theoretical example in which the excess length tends to infinity, as the number of obstacles tends to infinity and their size tends to zero). Finally, we address the problem of choosing the order of execution of the paths. Since our algorithm first finds an initial path for each robot and then locally handles interferences, the order in which the paths are executed might have a substantial effect on the running time and on the properties of the paths that the algorithm finds. Thus, it is desirable to choose an order that minimizes the number of interferences that the algorithm has to handle. We formalize this problem more generally, for any decoupled algorithm that locally handles interferences, and prove that it is NP-hard to find the optimal order. We also describe a heuristic for choosing the order and demonstrate its effectiveness in certain scenarios. Remark. Without any simplifying assumptions, the multi-robot motion-planning problem is intractable (see the Previous Work section above). The best known general algorithms have exponential dependence on the total number of degrees of freedom of all the robots. To obtain efficient solutions we need to make some assumptions. The success of sampling-based planners has taught us that clearance in the workspace (the availability of sufficient distance to the obstacles and distance between the moving robots) is a key factor to efficiency in practice. We argue that our revolving-areas assumption keeps the
802
I. Solomon and D. Halperin
problem non-trivial still. Notice that we do not assume anything about the clearance of the paths of the robots but rather only assume that the robots in their start and goal positions have the required revolving areas. Furthermore, our assumptions do not preclude that a pair of robots will be osculating at their start or target positions, as long as otherwise each of them has a revolving area not containing the other robot. Similarly, we do not preclude start or goal positions with zero clearance from the obstacles. Additionally, we allow revolving areas to overlap. All these together make the problem challenging to analyze and solve. More Details. Many technical details, proofs and experiments were omitted in this extended abstract due to lack of space. The details can be found in the full version of the paper [18]. In particular, the reader could find there directions for future work [18, Section 8].
2 Algorithmic Framework 2.1
Preliminaries and Assumptions
We consider the problem of m unit-ball robots moving in a polyhedral workspace W ⊆ Rd . We assume that the dimension d is fixed. We define O := Rd \ W to be the complement of the workspace, and we call O the obstacle space. We denote by n the complexity of the workspace, which is the total number of faces of any dimension of the obstacles. For every x ∈ Rd , let Dx be the open unit-ball centered at x. The set of positions in which a unit-ball robot does not collide with the obstacle space, termed the free space, is F := {x ∈ W : Dx ∩ O = ∅}. Notice that we denote a position of a ball robot by the coordinates of its center. A path is a continuous function γ : [0, 1] → Rd . We say that a path is collision-free if its image is contained in F. For a set X ⊆ Rd , we denote its boundary by ∂X. Given a set of start positions S = {s1 , ..., sm } and a set of target positions T = {t1 , ..., tm }, our goal is to plan a collision-free motion for the m unit-ball robots, such that for every 1 i m there is one path that starts at si and ends at ti . Formally, for each 1 i m we wish to find a path γi : [0, 1] → F, such that γi (0) = si and γi (1) = ti . For every i = j and t ∈ [0, 1], we require that Dγi (t) ∩ Dγj (t) = ∅, so the robots will not collide with each other. We assume that each start or target position z has a revolving area in which a robot placed at z can move in order to avoid collision with any other robot. Formally, we assume that for every z, Dz can be placed inside a (not necessarily concentric) open ball Az of radius 2 such that: – Az ∩ O = ∅; that is, Az does not intersect any obstacle. – For every start or target position y = z, Az ∩ Dy = ∅.
Motion Planning for Multiple Unit-Ball Robots in Rd
We call Az the revolving area of z, and denote its center by cz . Note that the revolving areas may intersect each other. We also let Cz := Dcz and denote by Bz the ball of radius 3 centered at cz . See Fig. 1. Allowing the revolving areas to intersect makes the method less restrictive and applicable to a larger set of scenarios than if we had assumed that they do not intersect. Much of the effort in the forthcoming analysis goes to accommodating this requirement. Throughout the paper we use the following useful lemma.
803
Fig. 1. Illustration of definitions.
Lemma 1. For any two revolving areas, the distance between their centers is at least 2. The proof appears in the full version of the paper [18]. Lemma 1 implies that any ball of constant radius contains only O(1) start and target positions. 2.2 Algorithm Overview Our algorithm operates as follows. First, it finds a set of free paths {γ1 , . . . , γm }, where γi : [0, 1] → F is the path of robot ri from si to ti , ignoring possible collisions with other robots. Then, it modifies these paths in order to avoid collisions between the robots. The algorithm iterates over the paths in an arbitrary order, and based on each path γi , it builds a new path γi from si to ti that does not pass “too close” to centers of revolving areas. When a robot follows the new path, other robots can be moved inside their revolving areas according to retraction paths and avoid collisions. In Sect. 2.3 we define the retraction paths and show that they can be used to avoid collisions. In Sect. 2.4 we show how to build the new paths. In Sect. 2.5 we describe the algorithm in detail. In Sect. 3 we suggest possible concrete implementations to operations the algorithm uses as “black boxes”. In Sect. 4 we analyze the algorithm. 2.3 Retraction In this section we are concerned with moving a single robot ri along its path γi . We assume that the other robots are positioned at their start or target positions. Without loss of generality, assume that every rj , for j = i, is positioned at its start position sj . During the motion of ri along γi it might interfere (collide) with other robots. We wish to show that such interferences can be avoided by slightly modifying γi and temporarily moving the interfering robots inside the respective revolving areas Asj . More formally, suppose there exist x ∈ γi and 1 j = i m such that Dx ∩Asj = ∅ and x ∈ / Csj . We show that there exists a point ρsj (x) ∈ Asj to which rj can be moved so that ri will not collide with rj , where ri is at x and rj is at ρsj (x). The retraction point ρsj (x) is defined so that several robots that are retracted by x do not collide. Definition 1. Let x be a point in W and assume that x = csj , for some sj . The retraction point from sj with respect to x is the unique point ρsj (x) on the line through csj and x, such that (1) Dρsj (x) ⊆ Asj ; (2) Dρsj (x) touches ∂Asj ; and (3) csj is between x and ρsj (x) (see Fig. 2).
804
I. Solomon and D. Halperin x
In the full version of this paper we prove the following lemmas. First, we show that for any i = j, robot ri placed at x does not collide with robot rj placed at the retraction point from some sj with respect to x.
A sj Csj c sj ρsj (x)
Lemma 2. If x ∈ / Csj , then Dx ∩ Dρsj (x) = ∅. Next, we show that if multiple robots are retracted with respect to x, then, even though their respective revolving areas may intersect, they do not collide with each other.
Fig. 2. The retraction point from sj with respect to x.
Lemma 3. For any j = k, and any point x = csj , csk , it holds that Dρsj (x) ∩ Dρsk (x) = ∅. We extend the definition of a retraction from a single point to a path. Definition 2. Let γ : I ⊆ [0, 1] → F be a path that does not pass through csj for some sj . The retraction from sj with respect to γ is the path Pγ,sj : I → Asj defined by Pγ,sj (t) := ρsj (γ(t)) for all t ∈ I. Applying Lemmas 2 and 3 on each point in a retraction path, we conclude that if ri follows γ and rj and rk follow their retraction paths with respect to γ, they do not collide with each other or with ri . Corollary 1. Let γ : I ⊆ [0, 1] → F be a path for robot ri that does not pass through Csj for all j = i. Then for any t ∈ I and any 1 j = k m, it holds that γ(t) − Pγ,sj (t) 2 and that Pγ,sj (t) − Pγ,sk (t) 2. Next we show that a robot rj can be moved from its start position to the beginning of its retraction path with respect to x on the segment between them, where Dx touches but does not intersect the revolving area containing rj . Since during this motion rj is contained in its revolving area, it cannot collide with ri or with other robots that are positioned at their start or target positions. We show that this motion also does not cause collision with another robot that is retracted with respect to x. The following proposition states that if Dx touches the revolving area of a robot rj and touches or intersects the revolving area of another robot rk , which is retracted with respect to x, then rj can be moved inside its revolving area to the retraction point from sj with respect to x without colliding with rk . Proposition 1. Let x ∈ W such that x − csj = 3 and 1 x − csk 3. Then for any point u in the segment [ρsj (x), sj ], it holds that ρsk (x) − u 2. The proof appears in the full version of the paper [18].
Motion Planning for Multiple Unit-Ball Robots in Rd
805
2.4 Constructing a New Path In this section we show how to modify a path, so a robot that follows the modified path will not collide with any robot that moves along the retraction paths defined above. Assume we are given a path γ : [0, 1] → F from si to ti for a robot ri . We assume that all the other robots are in their start or target positions, and denote the set of start and target positions that are currently occupied by robots other than ri by Z ⊆ S ∪ T . We show how to construct a new path γ : [0, 1] → F from si to ti , such that γ does not pass through Cz for any z ∈ Z. Notice that this path may still collide with other robots, but this will be resolved later by moving the robots along retraction paths. For any z ∈ Z and an intersection point p = γ(t) of γ and ∂Cz , we call p an / Cz , or an exit point from Cz if entrance point to Cz if γ(t + ) ∈ Cz and γ(t − ) ∈ / Cz , for infinitesimally small > 0. γ(t − ) ∈ Cz and γ(t + ) ∈ First the algorithm finds the intersection points of γ with every ∂Cz for z ∈ Z that are either entrance points or exit points, and sorts all these points according to their order of appearance in γ. The algorithm starts constructing a new path γ from s := si . Note that Fig. 3. Modification of the original path such that the robot moving s ∈ / Cz for all z ∈ Z, along the new path does not collide with any of the robots along its and this invariant holds way when they move along their respective retraction paths. during the algorithm for every value that is assigned to s. Then the algorithm repeats the following recursively. If the whole sub-path of γ from s to ti does not intersect any Cz for z ∈ Z, then this sub-path is added to γ , and the algorithm terminates. Otherwise, the algorithm finds the first entrance point to some Cz . Denote this entrance point by pz , and denote the last exit point from Cz by qz . Then the algorithm adds to γ the sub-path of γ from s to pz and a geodesic arc of ∂Cz from pz to qz . Afterwards, the algorithm sets s := qz and recursively repeats the process. See Fig. 3. Lemma 4. The new path γ is free from collision with the obstacles and does not intersect Cz for any z ∈ Z. The proof appears in the full version of the paper [18]. 2.5 The Algorithm We now describe our algorithm in detail. First, we find a set of paths {γ1 , . . . , γm }, where γi : [0, 1] → F is the path of robot ri from si to ti , avoiding the polyhedral obstacles, but ignoring the other robots. The algorithm iterates over the paths, and for each i it builds a new path γi for robot ri while assuming that rj is in its start position
806
I. Solomon and D. Halperin
for all j > i or in its target position for all j < i. We denote by Zi := {tj | 1 j < i} ∪ {sj | i < j m} the set of start and target positions that are occupied during the motion of ri . The new path γi from si to ti is built as described in Sect. 2.4. By Lemma 4, γi is a path that is free from collision with the obstacles and does not intersect Cz for any z ∈ Zi . We find a reparametrization of γi such that whenever ri follows the path and is about to enter (or leave) a revolving area containing another robot, ri stays at its position while the other robot moves to (or from) its retraction path. Formally, for each path γi , the algorithm finds a reparametrization γi = γi ◦ fi , where fi : [0, 1] → [0, 1] is a continuous non-decreasing function, which satisfies the following. Consider any ζ ∈ [0, 1] and any revolving area Az , z ∈ Zi , such that γi (ζ) − z = 3 and γi (ζ + ) − z < 3 or γi (ζ − ) − z < 3 for infinitesimally small > 0. Then fi−1 (ζ) is a segment of the form [ζ , ζ + δ], where δ > 0 is a fixed small constant. It means that as ri follows γi , whenever ri touches ∂Az and is about to intersect Az , or has just stopped intersecting it, there is a time frame of length δ in which ri stays at the same position. We now define the final paths π1 , . . . , πm : [0, m] → F. At first, for each i we set πi ([0, i − 1]) := si , πi ([i − 1, i]) := γi and πi ([i, m]) := ti . Next, retraction paths are added for sub-paths that might incur collision. Roughly, whenever a robot following the path πi is about to enter a revolving area Az that contains a robot rj , the robot rj is moved to the retraction point from z with respect to the position of the robot ri . Then rj follows the retraction path Pπi ,z as long as ri intersects Az . When ri leaves Az , rj is moved back to z, its start or target position. Formally, consider any 1 i = j m and a maximal time segment (a, b) ⊂ [i − 1, i] in which for all ζ ∈ (a, b) it holds that Dπi (ζ) intersects Az , where Az is the revolving area that contains the robot rj at time [i − 1, i]. Because of the reparametrization, it holds that πi (ζ) = πi (a) for any ζ ∈ [a − δ, a] and that πi (ζ) = πi (b) for any ζ ∈ [b, b + δ]. We change πj to follow the retraction path Pπi ,z in this time frame. We set πj ((a, b)) := Pπi ,z ((a, b)), and we set πj ([a − δ, a]) and πj ([b, b + δ]) to be the segments from z to Pπi ,z (a) and from Pπi ,z (b) to z, respectively. In case that as ri follows γi it touches and is about to intersect (or stop intersecting) several occupied revolving areas at the same time ζ, then the reparametrization is defined such that fi−1 (ζ) is a time segment δˆ whose length is δ times the number of these revolving areas. Then, we add retraction paths for all those revolving areas at conˆ It means that the robots secutive time segments of length δ during the time segment δ. are moved to the beginning (or from the end) of their retraction paths one after the other, while ri stays at its position. Putting it all together, by Corollary 1 and Proposition 1 the paths π1 , . . . , πm are free and do not incur collision between the robots.
3 Concrete Methods There are several operations that we use in the algorithm as “black boxes”: finding revolving areas, finding the original paths for the robots and computing intersections between paths and spheres. These operations can be implemented in different ways, resulting in different running time and different paths. We suggest a few methods to perform these steps.
Motion Planning for Multiple Unit-Ball Robots in Rd
807
3.1 Finding Revolving Areas General Method. A point can serve as the center of the revolving area of a start or target position z if and only if its distance from z is at most 1 and its distance from O ∪ {Dy }z=y∈S∪T is at least 2. Therefore, for any start or target position z we can find a revolving area (or conclude that no such ball exists) in the following manner. Compute the union of the Minkowski sums of a ball of radius 2 with each obstacle and with each unit-ball Dy centered at a start or target position y = z. Denote the complement of this union by N . Find a point in the intersection of Dz and N . This point can be the center of the revolving area of z. If Dz ∩ N is empty, then no such ball exists. We work out the details for moving discs in R2 and moving balls in R3 . Recall that n is the complexity of the polyhedral workspace. In the following lemma we use λq (n) to denote the near-linear function related to Davenport-Schinzel sequences as defined in [15]. The proof appears in the full version of the paper [18]. Lemma 5. After a preprocessing phase whose time complexity is O(m + n log2 (n)) in R2 and O(m + n2 λq (n) log(n)) in R3 , finding a center for a single revolving area (or reporting that none exists) can be done in O(n) time in R2 and in O(n2 λq (n)) time in R3 , where q is a small constant. Therefore finding revolving areas for all the 2m start and target positions can be done in O(n log2 (n) + mn) time in R2 and in O(n2 λq (n) log(n) + mn2 λq (n)) time in R3 . Stronger Assumptions. In case it is known that each start or target position is at distance of at least 2 from any obstacle and at distance at least 3 from any other start or target position, then each start or target position can serve as the center of its revolving area. 3.2 Finding Original Paths Any method for finding a single path for a ball robot moving among polyhedral obstacles can be used for producing the initial paths. We do not assume anything in particular about the original paths. These may as well be computed by sampling-based algorithms, especially in high dimensions. Notice however that the running time of the algorithm and the properties of the resulting paths depend on the properties of the initial paths, as explained in Sect. 4. In Sect. 5 we analyze the case of unit-disc robots in R2 where the initial paths are each a shortest path for each robot while ignoring the other robots. In this case, a shortest path can be found in O(n2 ) time [8]. 3.3 Computing Intersections For each 1 i m, we need to compute the intersection points of γi with ∂Cz for all z ∈ Zi , and the intersection points of γi with ∂Bz for all z ∈ Zi . Recall that we are interested only in intersection points that are either entrance or exit points (as explained in Sect. 2.4), so even if an arc is contained in a sphere, we should consider only its endpoints.
808
I. Solomon and D. Halperin
Let γ be a simple path and let C be a set of m spheres of constant radius, each concentric with a different revolving area. Denote by k(γ) the combinatorial complexity of γ, namely, the number of arcs that γ comprises, and by I(γ, C) the number of intersection points between γ and the spheres in C. Naive Approach. Compute the intersection points of each arc of γ and each sphere in C. The running time is O(mk(γ)). Sweep Line Algorithm in R2 . We utilize the sweep line algorithm by Mairson and Stolfi [12] for detecting red-blue intersections between a set of red arcs and a set of blue arcs. The red arcs are the O(k(γ)) arcs of the path γ, whose relative interiors are pairwise disjoint, since the path is simple. The blue arcs are the arcs of the arrangement of the circles C. Note that the complexity of the arrangement is O(m), since the distance between the centers of any two revolving areas is at least 2 (Lemma 1). Thus, there are O(m) blue arcs, whose relative interiors are also pairwise disjoint. We split each arc (blue or red) into a finite number of x-monotone arcs. Therefore, we can find all the intersections of γ and the circles in C in O((k(γ) + m) log(k(γ) + m) + I(γ, C)) time. In Sect. 5 we show that for a shortest path γ for a disc robot in R2 it holds that I(γ, C) = O(k(γ) + m), and since the complexity of a shortest path for a disc robot in R2 among polygonal obstacles with a total number of n vertices is k(γ) = O(n), the running time of the method in this case is O((m + n) log(m + n)).
4 General Analysis As explained in Sect. 3, there are several steps in the algorithm that can be implemented by different methods, resulting in different running time and different paths. In this section we give a general analysis, in which we use some upper bounds that depend on the properties of the original paths and the algorithms that are used to perform these operations. Afterwards, in Sect. 5 we analyze the special case of shortest paths in R2 . As before, we denote by k(γ) the complexity of γ. In addition we use the following upper bounds: – F (n, m) - The total time it takes to find for all 1 i m a free path γi from si to ti , disregarding other robots. – M (n, m) - The total time it takes to find revolving areas for all 2m start and target positions. – K(n, m) - An upper bound on the total complexity of all the m paths γi . – L(n, m) - An upper bound on the total length of all the m paths γi . – Q(n, m) - The total time it takes to compute for all 1 i m the intersection points of γi with ∂Cz and of γi with ∂Bz for all z ∈ Zi . – I(n, m) - An upper bound on the total number of intersection points of γi with ∂Cz and of γi with ∂Bz , summed over all 1 i m and all z ∈ Zi . Theorem 1. Given polyhedral workspace of complexity n in Rd , and for each of m unit-ball robots a start position and a target position, the algorithm described above computes free paths for the robots in O(M (n, m) + F (n, m) + Q(n, m) + I(n, m) log(I(n, m)) + K(n, m) + m) time. The total length of the paths is bounded
Motion Planning for Multiple Unit-Ball Robots in Rd
809
by O(L(n, m) + I(n, m)), and their total combinatorial complexity is bounded by O(K(n, m) + I(n, m)). The full analysis appears in the full version of the paper [18]. We point out that our framework is “embarrassingly parallel” [26], since each step can be executed for all robots in parallel. See more details in the full version of the paper.
5 The Case of Shortest Paths in R2 We analyze the special case of shortest paths in R2 . We show upper bounds on the parameters we use in the general analysis and achieve a more refined analysis for this case. As mentioned in Sect. 3, in R2 we can find a shortest path for a single unit-disc robot moving among polygonal obstacles of complexity n in O(n2 ) time, and it is well known that the complexity of such a path is O(n). Thus F (n, m) = O(n2 m) and K(n, m) = O(nm). By the methods suggested in Sect. 3, all revolving areas can be computed in M (n, m) = O(n log2 (n) + mn) time, and the intersection points between a single shortest path γ and a set of circles C ⊆ {∂Cz , ∂Bz | z ∈ S ∪ T } can be computed in O((n + m) log(n + m) + I(γ, C)) time, where I(γ, C) is the number of intersections. We now give upper bounds on the number of intersections and the time it takes to compute them. Recall that for each 1 i m we compute the intersections between the shortest path γi and ∂Cz for all z ∈ Zi , and the intersections between the modified path γi and ∂Bz for all z ∈ Zi . Each γi intersects each ∂Cz at most twice, because revolving areas do not intersect obstacles. Each modified path γi is composed of arcs of γi and additional O(m) circular arcs of ∂Cz . Each such circular arc intersects at most O(1) circles ∂Bz , by Lemma 1. In the next lemma we show that the number of intersections between γi and all circles ∂Bz , z ∈ Zi , is O(n + m). Therefore, the total number of intersections is I(n, m) = O(nm + m2 ), and they can be computed in Q(n, m) = O(m(n + m) log(n + m)) time. Lemma 6. Let γ be a shortest path of a unit-disc moving among polygonal obstacles, and let Z be a set of m centers of revolving areas. Then the number of intersections between γ and all ∂Bz , z ∈ Z, is O(k(γ) + m). Proof. Let z ∈ Z. Denote by kz (γ) the number of breakpoints of γ in B6 (z), where B6 (z) is a disc of radius 6 centered at z. We show that γ intersects ∂Bz at most 4kz (γ)+ O(1) times. Each arc of γ can intersect ∂Bz at most twice, so it is sufficient to show that no more than 2kz (γ) + O(1) arcs intersect ∂Bz . At most 2kz (γ) of these arcs have an endpoint in B6 (z). Consider the arcs that intersect ∂Bz and both of their endpoints are not in B6 (z). Notice that these arcs must be line segments, and denote the set of these segments by U. For any u ∈ U we define Nu := {x ∈ Bz | x − u < 12 }. Let u1 , u2 ∈ U, and assume towards a contradiction that Nu1 ∩ Nu2 = ∅. Let x ∈ Nu1 ∩ Nu2 . There exist x1 ∈ u1 and x2 ∈ u2 such that x1 − x < 12 and x2 − x < 12 . Thus x1 − x2 < 1, and in particular x1 − u2 < 1. Since x1 ∈ F and u2 ⊆ F,
810
I. Solomon and D. Halperin
the segment from x1 to u2 that is perpendicular to u2 is contained in F. See Fig. 4. Therefore γ can be shortened using this segment, which is a contradiction. We conclude z) const. Therefore that Nu1 ∩Nu2 = ∅ for any u1 , u2 ∈ U, and thus |U| minvol(B u∈U |Nu | there are at most O(1) such segments, and indeed the overall number of intersections between γ and ∂Bz is at most 4kz (γ) + O(1). Each breakpoint of γ is at distance smaller than 6 from O(1) centers of revolving areas, since the distance between every two centers of revolving areas is at least 2 (Lemma 1). Thus z∈Z kz (γ) = O(k(γ)), and the total number of intersections between γ and ∂Bz summed over all z ∈ Z is z∈Z O(kz (γ) + 1) = O(k(γ) + m).
Fig. 4. Illustration of the proof of Lemma 6. The purple area does not contain obstacles, since x1 ∈ F and u2 ⊆ F . Therefore the segment from x1 to u2 that is perpendicular to u2 (dashed) is free.
We conclude the following theorem. Theorem 2. The algorithm can compute free paths for m unit-disc robots in R2 moving among polygonal obstacles of complexity n in O(n2 m + m(m + n) log(m + n)) time. The total length of the paths is bounded by O(L(n, m) + nm + m2 ), where L(n, m) is the total length of the initial m shortest paths, and their total combinatorial complexity is bounded by O(nm + m2 ). 5.1
Construction of a Bad Input for Shortest Paths in R2
In the full version of the paper we show an example input for which, when using shortest paths as the original paths, the algorithm computes paths whose total length is Θ(L(n, m) + nm + m2 ). This matches the upper bound proven in Theorem 2. However, in order to construct this example we had to place many tiny obstacles, whose size tends to zero, and we do not expect this to happen in practical scenarios. See the full version [18, Section 5.1].
6 Finding Optimal Order of Paths Is NP-Hard Recall that our algorithm first finds a path for each robot, disregarding possible interferences with other robots, and then modifies these paths to avoid collisions. In particular, a path is locally modified when it passes near a start or target position that is occupied by another robot during the motion. Since each such interference has to be handled, the
Motion Planning for Multiple Unit-Ball Robots in Rd
811
number of these interferences directly affects the running time, as well as the length and complexity of the modified paths. The order in which the paths are executed might have a major effect on the number of interferences, as a path might pass near a start position of some robot but not near its target position, or vice versa. Therefore, a natural extension of our algorithm would be to find an order of execution of the paths in which the number of interferences is minimal. However, we show that finding such an order is NP-hard. The hardness result that we now prove applies more generally to any algorithm for motion-coordination of many robots that first finds a path for each robot (disregarding other robots) and then handles possible interferences. Therefore we first state the problem in its general form. To simplify notation, we define the problem for unit-balls, though its extension to any type of bounded robots is straightforward. 6.1 Problem Definition and Hardness Proof We assume that each start or target position z has a neighborhood, which is a bounded subset of Rd that contains the unit-ball Dz , such that a path interferes with z if the path passes through the neighborhood of z. We assume that each neighborhood is bounded by a ball of radius , for some that does not depend on the locations of the start and target positions. The specific subset that serves as a neighborhood depends on the algorithm being used. For example, in our algorithm, the neighborhood of a start or target position z is Bz , the ball of radius 3 centered at the center of its revolving area, and = 3. Assume that we have a set Γ of m paths, γi from si to ti , for 1 i m. Each path might interfere with any start or target position (perhaps several times), and any such interference has to be locally resolved. We wish to find a permutation of the paths, such that if the paths are executed one after the other according to this order, the total number of interferences is minimal. We define the interference graph of Γ as follows. Let GΓ be a directed graph on m vertices. For any i = j, for each interference of γi with tj , and for each interference of γj with si , we add an edge (i, j) to GΓ . Note that since any path might interfere with any start or target position multiple times, the resulting graph may be a multi-graph. Let σ : {1, . . . , m} → {1, . . . , m} be a permutation. Then the number of interferences that have to be handled when the order of execution is γσ(1) , . . . , γσ(m) equals the number of edges (σ(i), σ(j)) such that σ(i) > σ(j). Therefore finding an order of execution that minimizes the number of interferences is equivalent to finding a minimum feedback arc set [27], namely finding a permutation of the Fig. 5. An example for applying the reduction from nodes in which the number of edges the proof of Theorem 3 on a graph. For any vertex of the graph, its path passes through the targets associthat have to be removed in order to ated with its outgoing neighbors. get a topological sort is minimal. The “minimum feedback arc set” problem is known to be NP-hard [10].
812
I. Solomon and D. Halperin
Theorem 3. Finding an order of execution of the paths in which the number of interferences is minimal is NP-hard. The complete proof appears in the full version of the paper [18]. The general idea is that given a simple, directed, unweighted graph, we construct a path for each vertex such that the path corresponding to vertex v passes through the neighborhoods of the target positions corresponding to the outgoing neighbors of v, and does not pass through any other neighborhood. See Fig. 5. 6.2
Heuristic for Determining the Order
In the experiments (Sect. 7), we test a heuristic for determining the order of the paths, which chooses the order according to the topological order of the strongly connected components (SCC) of the interference graph (the details are given in the full version of the paper [18]).
(a) Grid
(b) Triangles
(c) Tunnel - Version I
(d) Tunnel - Version II
Fig. 6. Test scenarios.
We expect this heuristic to be helpful in scenarios in which the interference graph is “close” to having a topological order, namely, when the number of SCCs is relatively large. Indeed, the heuristic considerably improves the experimental results of the Tunnel scenario (Version II).
7 Experiments We implemented our algorithm for the case of unit-disc robots in the plane moving amidst polygonal obstacles and tested it on several scenarios. 7.1
Implementation Details
The algorithm was implemented in C++ using CGAL [23] and Boost [17]. The code was tested on a PC with Intel i7-6700HQ 2.60GHz processor with 12 GB of memory, running a Windows 10 64-bit OS.
Motion Planning for Multiple Unit-Ball Robots in Rd
813
We chose to implement the operations from Sec. 3 as follows. For finding original paths, we find for each robot its shortest path (disregarding other robots) using a visibility graph. We use the general method described in Sect. 3.1 to find revolving areas.1 For each robot, we compute the intersection points of its path with the circles ∂Cz and ∂Bz for any center of an occupied revolving area z, construct the modified path and construct the retraction paths it incurs by traversing the arrangement of the arcs of the path and the relevant circles. The details are given in the full version of the paper [18]. Excluding the heuristic for choosing the order of the paths, the implementation is deterministic and parameter free. 7.2 Experiments We tested our code on the scenarios described below. We report the running times for various numbers of robots. We also report the ratio between the total length of the paths and the total length of the original shortest paths, where the latter is a (usually not tight) lower bound on the total length of the paths in an optimal solution. We call this ratio dist ratio in the tables below. The results (without the heuristic) appear in Tables 1, 2 and 3. The results with the heuristic appear in the full version of the paper [18]. Grid: (See Fig. 6a). The start and target positions are organized in a grid in a square room, bounded by walls, where the start positions are in the upper half of the square and the target positions are in the lower half. The start and target positions are assigned to the robots by a random permutation. The distance between any two adjacent start or target positions is 3, which is the minimal distance required for existence of revolving areas. Triangles: (See Fig. 6b). All obstacles are triangles, which, as well as the start and target positions, are chosen at random. Tunnel - Version I: (See Fig. 6c). In this scenario the robots have to move from the upper part of a narrow winding tunnel to its lower part. For any i < j, the start position of i is closer than the start position of j to the upper end of the tunnel (top-left corner), and the target of i is closer than the target of j to the lower end of the tunnel (bottom-left corner). Therefore i has to bypass j, and the tunnel is designed in a manner that for any path of i from start to target, it is impossible to move j to a position in which it does not block the path of i. Therefore m 2 interactions between the robots are required, so the total length of the paths has to be greater than the total length of all original paths by Ω(m2 ). Tunnel - Version II: (See Fig. 6d). This scenario is the same as Version I, but the order of the target positions is reversed. Namely, for any i < j, the start position of robot i is closer than the start position of robot j to the upper end of the tunnel, but the target of i is farther than the target of j from the lower end. Therefore, if the algorithm computes 1
We emphasize that this method is used for all scenarios, even where it is clear that each start or target position can serve as a center for its revolving area.
814
I. Solomon and D. Halperin
Table 1. Experimental results for the Grid scenario.
Triangles (in 100 × 100 square)
Grid m
Time (sec) Dist ratio
20
1
30
Table 2. Experimental results for the Triangles scenario.
1
m
10 triangles
30 triangles
50 triangles
Time (sec) Dist ratio Time (sec) Dist ratio Time (sec) Dist ratio
1.809
20
3
1.009
23
1.019
61
1.005
2.047
30
4
1.016
24
1.010
66
1.018
50
4
1.028
27
1.048
73
1.039
100 8
1.061
33
1.070
89
1.070
1.135
110
1.122
50
3
2.491
100
10
2.385
200 18
1.142
53
200
37
2.530
300 29
1.216
76
1.194
144
1.201
400 49
1.276
104
1.280
185
1.279
300
75
2.546
500 78
1.360
144
1.365
228
1.348
400
134
2.453
600 115
1.430
190
1.451
293
1.426
700 166
1.505
248
1.525
357
1.503
800 227
1.566
314
1.590
445
1.595
900 308
1.673
426
1.674
529
1.664
1000 392
1.755
505
1.761
652
1.739
500
203
2.591
1000 830
2.589
Table 3. Experimental results for the Tunnel Version I and Tunnel Version II scenarios. Tunnel Version I
Tunnel Version II
m
n
Time (sec) Dist ratio Time (sec) Dist ratio
20
242
36
1.243
30
342
74
1.248
75
1.497
50
542
197
1.252
203
1.505
37
1.487
100 1042 783
1.255
1065
1.511
200 2042 3324
1.257
3373
1.514
paths using the given order of the robots, then m 2 interactions are required, just as in Version I. However, there exists a different order, the reverse order, in which the original paths are free paths, so if the algorithm chooses this order, it finds the optimal solution. 7.3
Conclusions
Our experiments show that our algorithm is efficient in practice. All the scenarios that we tested are very dense and of high combinatorial complexity, and we managed to solve them for hundreds of robots in a short amount of time. The Grid and Triangles scenarios are modified versions of scenarios that were tested by Solovey et al. [21]. In their work, they solved the unlabeled version under separation assumptions that are stronger than our assumptions (that is, any input that satisfies their assumptions also satisfies ours). In [21] they reported that they solved the Grid scenario for 40 robots in 311 seconds, and the Triangles scenario for 32 robots and 38 triangles in 6394 seconds. In our experiments, we decreased the distances between the start and target positions and the obstacles to match the minimal requirements of our algorithm, making the scenes as dense as possible, and assigned random labels to the start and
Motion Planning for Multiple Unit-Ball Robots in Rd
815
target positions.2 For a similar number of robots and triangles as in [21], we solve both the Grid and Triangles scenarios significantly faster, in only a few seconds for each. We provide further analysis of the experimental results in the full version of this paper. Here we mention the highlights: (i) In all the scenarios that we tested, the ratio between the total length of the final paths and the total length of the original shortest paths is a small number, specifically smaller than 3. Although we have shown in Sect. 5.1 an example where the length of the paths increases by a multiplicative factor of Θ(n), it requires the addition of tiny obstacles, whose size tends to zero as n tends to infinity, and it seems highly unlikely to occur in practice. (ii) The heuristic we tested for determining the order of the paths has, as expected, the most effect in Version II of the Tunnel scenario.
References 1. Adler, A., de Berg, M., Halperin, D., Solovey, K.: Efficient multi-robot motion planning for unlabeled discs in simple polygons. TASE 12(4), 1309–1317 (2015) ˇ ap, M., Nov´ak, P., Seleck`y, M., Faigl, J., Vokˇr´ınek, J.: Asynchronous decentralized priori2. C´ tized planning for coordination in multi-robot system. In: IROS, pp. 3822–3829 (2013) 3. Choset, H., et al.: Principles of Robot Motion: Theory, Algorithms, and Implementation. MIT Press, Cambridge (2005) 4. Cui, R., Gao, B., Guo, J.: Pareto-optimal coordination of multiple robots with safety guarantees. Auton. Rob. 32(3), 189–205 (2012) 5. Demaine, E.D., Fekete, S.P., Keldenich, P., Scheffer, C., Meijer, H.: Coordinated motion planning: reconfiguring a swarm of labeled robots with bounded stretch. In: SoCG, pp. 29:1– 29:15 (2018) 6. Freund, E., Hoyer, H.: Real-time pathfinding in multirobot systems including obstacle avoidance. IJRR 7(1), 42–70 (1988) 7. Hearn, R.A., Demaine, E.D.: PSPACE-completeness of sliding-block puzzles and other problems through the nondeterministic constraint logic model of computation. Theoret. Comput. Sci. 343(1–2), 72–96 (2005) 8. Hershberger, J., Guibas, L.J.: An O(n2 ) shortest path algorithm for a non-rotating convex body. J. Algorithms 9(1), 18–46 (1988) 9. Hopcroft, J.E., Schwartz, J.T., Sharir, M.: On the complexity of motion planning for multiple independent objects; PSPACE-hardness of the “warehouseman’s problem”. IJRR 3(4), 76– 88 (1984) 10. Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations, pp. 85–103. Springer, Boston (1972) 11. LaValle, S.M.: Planning Algorithms. Cambridge University Press, Cambridge (2006) 12. Mairson, H.G., Stolfi, J.: Reporting and counting intersections between two sets of line segments. In: Earnshaw, R.A. (ed.) Theoretical Foundations of Computer Graphics and CAD, pp. 307–325. Springer, Heidelberg (1988) 13. Schwartz, J.T., Sharir, M.: On the “piano movers” problem. II. General techniques for computing topological properties of real algebraic manifolds. Adv. Appl. Math. 4(3), 298–351 (1983) 2
Note that our algorithm can always be applied to unlabeled scenarios by assigning random labels. We refer the reader to the full version of the paper, where we discuss a possible modification of our algorithm toward the unlabeled case [18, Section 8].
816
I. Solomon and D. Halperin
14. Schwartz, J.T., Sharir, M.: On the “piano movers” problem: III. Coordinating the motion of several independent bodies: the special case of circular bodies moving amidst polygonal barriers. IJRR 2(3), 46–75 (1983) 15. Sharir, M., Agarwal, P.K.: Davenport-Schinzel Sequences and Their Geometric Applications. Cambridge University Press, Cambridge (1995) 16. Sharir, M., Sifrony, S.: Coordinated motion planning for two independent robots. Ann. Math. Artif. Intell. 3(1), 107–130 (1991) 17. Siek, J.G., Lee, L.Q., Lumsdaine, A.: The Boost Graph Library User Guide and Reference Manual. Addison-Wesley, Boston (2002) 18. Solomon, I., Halperin, D.: Motion planning for multiple unit-ball robots in Rd . ArXiv eprints (2018). http://arxiv.org/abs/1807.05428 19. Solovey, K., Halperin, D.: On the hardness of unlabeled multi-robot motion planning. IJRR 35(14), 1750–1759 (2016) 20. Solovey, K., Salzman, O., Halperin, D.: Finding a needle in an exponential haystack: discrete RRT for exploration of implicit roadmaps in multi-robot motion planning. IJRR 35(5), 501– 513 (2016) 21. Solovey, K., Yu, J., Zamir, O., Halperin, D.: Motion planning for unlabeled discs with optimality guarantees. In: RSS, Rome, Italy (2015) 22. Spirakis, P., Yap, C.K.: Strong NP-hardness of moving many discs. Inf. Process. Lett. 19(1), 55–59 (1984) 23. The CGAL Project: CGAL User and Reference Manual. CGAL Editorial Board (2018) 24. Turpin, M., Michael, N., Kumar, V.: Concurrent assignment and planning of trajectories for large teams of interchangeable robots. In: ICRA, pp. 842–848 (2013) 25. Wagner, G., Choset, H.: Subdimensional expansion for multirobot path planning. Artif. Intell. 219, 1–24 (2015) 26. Wikipedia: Embarrassingly parallel — Wikipedia, the free encyclopedia 27. Wikipedia: Feedback arc set — Wikipedia, the free encyclopedia 28. Yap, C.K.: Coordinating the motion of several discs. Robotics Report no. 16, Department of Computer Science, New York University (1984) 29. Yu, J.: Constant factor time optimal multi-robot routing on high-dimensional grids. In: RSS (2018)
Coordinating the Motion of Labeled Discs with Optimality Guarantees under Extreme Density Rupesh Chinta1(B) , Shuai D. Han2 , and Jingjin Yu2 1
2
Department of Electrical and Computer Engineering, Rutgers University, New Brunswick, USA [email protected] Department of Computer Science, Rutgers University, New Brunswick, USA {shuai.han,jingjin.yu}@rutgers.edu
Abstract. We push the limit in planning collision-free motions for routing uniform labeled discs in two dimensions. First, from a theoretical perspective, we show that the constant-factor time-optimal routing of labeled discs can be achieved using a polynomial-time algorithm with robot density over 50% in the limit (i.e., over half of the workspace may be occupied by the discs). Second, from a more practical standpoint, we provide a high performance algorithm that computes near-optimal (e.g., 1.x) solutions under the same density setting.
1
Introduction
The routing of rigid bodies (e.g., mobile robots) to desired destinations under dense settings (i.e., many rigid bodies in a confined workspace) is a challenging yet high utility task. On the side of computational complexity, when it comes to feasibility (i.e., finding collision-free paths for moving the bodies without considering path optimality), it is well known that coordinating the motion of translating rectangles is PSPACE-hard [8] whereas planning for moving labeled discs of variable sizes is strongly NP-hard in simple polygons [15]. More recently, it is further established that PSPACE-hardness extends to the unlabeled case as well [13]. Since computing an arbitrary solution is already difficult under these circumstances, finding optimal paths (e.g., minimizing the task completion time or the distances traveled by the bodies) are at least equally hard. Taking a closer look at proof constructions in [8,13,15], one readily observes that the computational difficulty increases as the bodies are packed more tightly in the workspace. On the other hand, in many multi-robot applications, it is desirable to have the capacity to have many robots efficiently and (near-)optimally navigate closely among each other, e.g., in automated warehouses [3,19]. Provided that per-robot efficiency and safety are not compromised, having higher robot density directly results in space and energy1 savings, thus enhancing productivity. 1
With higher robot density, a fixed number of robots can fit in a smaller workspace, reducing the distance traveled by the robots.
c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 817–834, 2020. https://doi.org/10.1007/978-3-030-44051-0_47
818
R. Chinta et al.
As a difficult but intriguing geometric problem, the optimal routing of rigid bodies has received much attention in many research fields, particularly robotics. While earlier research in the area tends to focus on the structural properties and complete (though not necessarily scalable) algorithmic strategies [4,5,10–12], more recent studies have generally attempted to provide efficient and scalable algorithms with either provable optimality guarantees or impressive empirical results, or both. For the unlabeled case, a polynomial-time algorithm from [17] computes trajectories for uniform discs which minimizes the maximal path length traveled by any disc. The completeness of the algorithm depends on some clearance assumptions between the discs and between a disc and the environment. In [14], a polynomial-time complete algorithm is also proposed for unlabeled discs that optimizes the total travel distance, with a more natural clearance assumption compared to [17]. The clearance assumption (among others, the distance between two unit discs is at least 4) translates to a maximum density of about 23%, i.e., the discs may occupy at most 23% of the available free space. For the labeled case, under similar clearance settings, an integer linear programming (ILP) based method is provided in [22] for minimizing solution makespan. Though without polynomial running time guarantee, the algorithm is complete and appears to performs well in practice. Complete polynomial-time algorithms also exist that do not require any clearance in the start and goal configurations [7]. However, the supported density is actually lower in this case, as the algorithm needs to expand the start and goal configurations so that the clearance conditions in [22] is satisfied. In this work, we study the problem of optimally routing labeled uniform unit discs in a bounded continuous two-dimensional workspace. As the main result, we provide a complete, deterministic, and polynomial-time algorithm that allows more than half of the workspace to be occupied by the discs while simultaneously ensuring O(1) (i.e., constant-factor) time optimality of the computed paths. We also provide a practical and fast algorithm for the same setting without the polynomial running time guarantee. More concretely, our study brings the following contributions: (i) We show that, using a computer-based proof, when the distance between the centers of any two labeled unit discs is more than 38 , the continuous problem can be transformed into a multi-robot routing problem on a triangular grid graph with minimal optimality loss. A separation of 83 implies a maximum density of over 50%. (ii) We develop a low polynomial-time constantfactor time-optimal algorithm for routing discs on a triangular grid with the constraint that no two discs may travel on the same triangle concurrently. The algorithm has a similar computation time as [7]: it can solve problem instances with thousands of robots. (iii) We develop a fast and novel integer linear programming (ILP) based algorithm that computes time-optimal routing plans for the triangular grid-based multi-robot routing problem. Combining (i) and (ii) yields the O(1) time-optimal algorithm while combining (i) and (iii) results in the more practical and highly optimal algorithm. In addition, the 83 separation proof employs both geometric arguments and computation-based verification, which may be of independent interest.
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
819
Our work leans on graph-theoretic methods for multi-robot routing, e.g., [1,2,9,16,18,20,21]. In particular, our constant-factor time-optimal routing algorithm for the triangular grids adapts from a powerful routing method for rectangular grid in [20] that actually works for arbitrary dimensions. However, while the method from [20] comes with strong theoretical guarantee and runs in low polynomial time, the produced paths are not ideal due to the large constant factor. This prompts us to also look at more practical algorithms and we choose to build on the fast ILP-based method from [21], which allows us to properly encode the additional constraints induced by the triangular grid, i.e., no two discs may simultaneously travel along any triangle. Organization. The rest of the paper is organized as follows. We provide a formal statement of the routing problem and its initial treatment in Sect. 2. In Sect. 3, we show how the problem may be transformed into a discrete one on a special triangular grid. Then, in Sects. 4 and 5, we present a polynomial time algorithm with O(1)-optimality guarantee and a fast algorithm that computes highly optimal solutions, respectively. We conclude in Sect. 6.
2 2.1
Preliminaries Labeled Disc Routing: Problem Statement
Let W denote a closed and bounded w × h rectangular region. For technical convenience, we assume w = 4n1 + 2 and h = √43 n2 + 2 for integers n1 ≥ 2 and n2 ≥ 3. There are n labeled unit discs residing in W. Also for technical reasons, we assume that the discs are open, i.e., two discs are not in collision when their centers are exactly distance two apart. These discs may move in any direction with an instantaneous velocity v satisfying |v| ∈ [0, 1]. Let Cf ⊂ R2 denote the free configuration space for a single robot in W. The centers of the n discs are initially located at S = {s1 , . . . , sn } ⊂ Cf , with goals G = {g1 , . . . , gn } ⊂ Cf . For all 1 ≤ i ≤ n, a disc labeled i initially located at si must move to gi . Beside planning collision-free paths, we want to optimize the resulting path quality by minimizing the global task completion time, also commonly known as the makespan. Let P = {p1 , . . . , pn } denote a set of feasible paths with each pi a continuous function, defined as pi : [0, tf ] → Cf , pi (0) = si , pi (tf ) = gi ,
(1)
the makespan objective seeks a solution that minimizes tf . That is, denoting P as the set of all feasible solution path sets, the task is to find a set P with tf (P ) approaching the optimal solution tmin := min tf (P ). P ∈P
(2)
Positive separation between the labeled discs is necessary to render the problem feasible (regardless of optimality). In this work, we require the following
820
R. Chinta et al.
clearance condition between a pair of si and sj and a pair of gi and gj : ∀1 ≤ i, j ≤ n,
si − sj >
8 8 , gi − gj > . 3 3
(3)
For notational convenience, we denote the problem address in this work as the Optimal Labeled Disc Routing problem (OLDR). By assumption (3) and assuming that the unit discs occupy the vertices of a regular triangular grid, the discs may occupy ( 12 π ∗ 12 )/( 12 83 ∗ √43 ) ≈ 51% of the free space in the limit (to see that this is the case, we note that each equilateral triangle with side length 8 1 3 contains half of a unit disc; each corner of the triangle contains 6 of a disc). 2.2
Workspace Discretization
Similar to [4,11,12,22], we approach OLDR through first discretizing the problem, starting by embedding a discrete graph within W. The assumption of w = 4n1 + 2 and h = √43 n2 + 2 on the workspace dimensions allows the embedding of a triangular grid with side length of √43 in W such that the grid has 2n1 columns and about n2 (zigzagging) rows of equilateral triangles, and a clearance of 1 from ∂W. An example is provided in Fig. 1. w
h
W
Fig. 1. An example of a workspace W with w = 14 and h = 3 √43 + 2, i.e., n1 = 3 and n2 = 3. The embedded triangular grid is at least distance 1 from the boundary of W. The grid has 6 columns and 2+ zigzagging rows.
Throughout the paper, we denote the underlying graph of the triangular grid as G. Henceforth, we assume such a triangular grid G for a given workspace W. The choice of the side length of √43 for the triangular grid ensures that two unit discs located on adjacent vertices of G may move simultaneously on G without collision when the angle formed by the two traveled edges is not sharp (Fig. 2). We note that w ≥ 10 and h ≥ 3 √43 + 2 are needed for our algorithm to have completeness and optimality guarantees. For smaller w or h, an instance may not be solvable. On the other hand, the discrete increment assumption on w and h are for technical convenience and are not strictly necessary. Without these
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
821
Fig. 2. On a triangular grid with a side length of √43 , two unit discs, initially residing on two adjacent vertices of the grid, may travel concurrently on the grid without collision when the two trajectories do not form a sharp angle. In the figure, the two cyan discs may travel as indicated without incurring collision. On the other hand, the red discs will collide if they follow the indicated travel directions.
discrete increments assumptions, we will need additional (and more complex) clearance assumptions between the discs and ∂W, which does not affect the 51% density bound since ∂W contributes Θ(w + h) to the area of W which is wh. The ratio is Θ( w+h wh ) which goes to zero as both w and h increase. We also mention that although this study only considers bounded rectangular workspace without static obstacles within the workspace, our results can be directly combined with [22] to support static obstacles.
3
Translating Continuous Problems to Discrete Problems with Minimal Penalty on Optimality
A key insight enabling this work is that, under the separation condition (3), a continuous OLDR can be translated into a discrete one with little optimality penalty. The algorithm for achieving this is relatively simple. For a given W and the corresponding G = (V, E) embedded in W, for each si ∈ S, let vis ∈ V be a vertex of G that is closest to si (if there are more than one such vis , pick an arbitrary candidate). After all vis ’s (let VS = {vis }) are identified for 1 ≤ i ≤ n, let dmax = maxi vis − si . Note that dmax ≤ 43 . We then let the labeled discs at si move in a straight line to the corresponding vis at a constant speed given v s −s by dimaxi , which means that for all 1 ≤ i ≤ n, disc i will reach vis in exactly one unit of time. The same procedure is then applied to G to obtain VG = {vig }. The discrete OLDR is fully defined by (G, VS , VG ). We denote the algorithm as DiscretizeOLDR. Figure 3 illustrates the assignment of a few unit discs to vertices of the triangular grid. Because it takes a constant amount computational effort to deal with one disc, DiscretizeOLDR runs in linear time, i.e., Proposition 1. DiscretizeOLDR has a running time of O(n). The rest of this section is devoted to showing that DiscretizeOLDR is collision-free and incurs little penalty on time optimality. We only need to show this for translating S to VS ; translating G to VG is a symmetric operation. We first
822
R. Chinta et al.
Fig. 3. An illustration of assigning a few unit discs to vertices of the triangular grid.
make the straightforward observation that DiscretizeOLDR adds a makespan penalty of up to four because translating S to VS takes exactly one unit of time. Same holds for translating G to VG . Proposition 2. DiscretizeOLDR incurs a makespan penalty of up to four. We then show DiscretizeOLDR assigns a unique vis ∈ V for a given si ∈ S. Lemma 1. DiscretizeOLDR assign a unique vis ∈ V for an si ∈ S. Proof. Each equilateral triangle in G has a side length of √43 , which means that the distance from the center of a triangle to its vertices is 43 . Therefore, for any si ∈ S, it must be at most of distance 34 to at least one vertex of G. Let this vertex be vis . Now given any other sj ∈ S, assume DiscretizeOLDR assigns to it vjs . We argue that vis = vjs because otherwise 4 4 8 + ≥ vi − vis + vj − vjs = vi − vis + vj − vis ≥ vi − vj > , 3 3 3 which is a contradiction. Here, the first ≥ holds because vi − vis ≤ 43 and vj − vjs ≤ 43 by DiscretizeOLDR; the second ≥ is due to the triangle inequality. The > is due to assumption (3).
Next, we establish that DiscretizeOLDR is collision-free. For the proof, we use geometric arguments assisted with computation-based case analysis. Theorem 1. DiscretizeOLDR guarantees collision-free motion of the discs. Proof. We fix a vertex v ∈ V of the triangular grid G. By Lemma 1, at most one si ∈ S may be matched with v, in which case v becomes vis . If this is the case, then si must be located within one of the six equilateral triangles surrounding v. Assume without loss of generality that si belongs to an equilateral triangle uvw as shown in Fig. 4(a). The rules of DiscretizeOLDR further imply that si must fall within one (e.g., the orange shaded triangle in Fig. 4(a)) of the six triangles belonging to uvw that are formed by the three bisectors of uvw. Let this triangle be vox. Now, let sj = si be the center of a labeled disc j; assume that disc j go to some vjs ∈ V . By symmetry, if we can show that disc
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
823
w
u
o si
x
si vis
vjs
sj
v(vis ) (a)
(b)
Fig. 4. (a) By symmetry, for an si ∈ S to be moved to some v = vis , we only need to 1 -th of all possible places where si may appear. consider the region vox, which is 12 (b) For a fixed si , we only need to consider sj that is of exactly 83 distance from it.
i with si ∈ vox and an arbitrary disc j with si − sj > 83 will not collide with each other as disc i and disc j move along si vis and sj vjs , respectively, then DiscretizeOLDR is a collision-free procedure. We then observe that if disc i and disc j collide as we align their centers to vertices of the triangular grid, then the distance between their centers must be exactly 83 at some point before they collide (when their centers are of distance less than 2). Following this reasoning, instead of showing a disc j with si − sj > 83 will not collide with disc i, it suffices to show the same only for si − sj = 83 . That is, for any si ∈ vox and any sj on a circle of radius 83 centered at si , disc i and disc j will not collide as si and sj move to vis and vjs , respectively, according to the rules specified by DiscretizeOLDR (see Fig. 4(b) for an illustration). To proceed from here, one may attempt direct case-by-case geometric analysis, which appears to be quite tedious. We instead opt for a more direct computer assisted proof as follows. We first partition vox using axis-aligned square grids with side length ε; ε is some parameter to be determined through computation. For each of the resulting ε × ε square region (a small green square in Fig. 5), its center. For each fixed si , an annulus centered at si we assume that si is at √ √ with inner radius 83 − 22ε and outer radius 83 + 22ε is obtained (part of which is illustrated as in Fig. 5). Given this construction, for any potential si in a fixed ε × ε square, a circle of radius 83 around it falls within the annulus. We√then divide the outer perimeter of the annulus into arcs of length no more 2ε. For each piece, we obtain a roughly square region with side length than √ 2ε on the annulus, one of which is shown as the red square in Fig. 5. We take the center of the square as sj . We then fix vjs accordingly (note that it may be the case that vjs is not unique for the square that bounds a piece of arc, in which case we will attempt all potentially valid vjs ’s). For each fixed set of si , vis , sj , and vjs , following the rules of DiscretizeOLDR, we may (analytically) compute the shortest distance between the centers of disc i and disc j as disc i is moved from si to vis while disc j is
824
R. Chinta et al.
vjs si
vis
sj
Fig. 5. Illustration of picking a pair of si and sj for a computer based proof.
moved from sj to vjs . Let the trajectory followed by the two centers in this case be τi (t) and τj (t), respectively, with 0 ≤ t ≤ 1 (as guaranteed by DiscretizeOLDR). We may express the distance (for fixed ε, si , vis , sj , and vjs ) as δε (si , sj ) = mint τi (t) − τj (t) . For any si that falls in the same ε × ε box as si , if disc i is initially located at si , let it follow a trajectory τi (t) to vis . We observe that τi (t) − τi (t) ≤ ε. This holds because as the center of disc i moves from anywhere within the ε × ε box to vis , τi (t) − τi (t) continuously decreases until it becomes zero at vis , which is the same for both si and si . Therefore, the initial √ uncertainty is the largest, which is no more than ε because si − si ≤ 22ε . The same argument applies to disc j, i.e., τj (t) − τj (t) ≤ ε. Therefore, we have δε (si , sj ) = min τi (t) − τj (t) t
= min τi (t) − τi (t) + τj (t) − τj (t) + τi (t) − τj (t) t
≤ min( τi (t) − τi (t) + τj (t) − τj (t) + τi (t) − τj (t) ) t
≤ 2ε + min τi (t) − τj (t) t
≤ 2ε + δε (si , sj ). If δε (si , sj ) > 2ε, then we may conclude that δε (si , sj ) > 0. We verify this using a python program that computes the minimum δε (si , sj ) over all possible choices of si and sj given a small ε. When ε = 0.025, we obtain that δε (si , sj ) is lower bounded at approximately 0.076, which is larger than 2ε = 0.05. Therefore, DiscretizeOLDR is a collision-free algorithm.
With DiscretizeOLDR, in Sect. 4 and Sect. 5, we assume a discrete multirobot routing problem is given as a 3-tuple (G, VS , VG ) in which G is the unique triangular grid embedded in W. Also, VS , VG ⊂ V and |VS | = |VG | = n.
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
4
825
Constant-Factor Time-Optimal Multi-robot Routing on Triangular Grid
In [20], it is established that constant-factor makespan time-optimal solution can be computed in quadratic running time on a k-dimensional orthogonal grid G for an arbitrary fixed k. It is a surprising result that applies even when n = |V |, i.e., there is a robot or disc on every vertex of grid G. The functioning of the algorithm, PaF (standing for partition and flow), requires putting together many algorithmic techniques. However, the key requirements of the PaF algorithm hinges on three basic operations, which we summarize here for the case of k = 2. Due to limited space, only limited details are provided. First, to support the case of n = |V | while ensuring desired optimality, it must be possible to “swap” two adjacent discs in a constant sized neighborhood in a constant number of steps (i.e. makespan), as illustrated in Fig. 6. This operation is essential in ensuring makespan time optimality as the locality of the operation allows many such operations to be concurrently carried out. Note that a random problem on G is always feasible if G satisfies this requirement, because any two robots can exchange their locations without affecting the other robots by using multiple “swap” actions.
Fig. 6. Discs 2 and 3 may be “swapped” in three steps on a 3 × 2 grid, implying that any two discs can be swapped in O(1) steps without net effect on other discs.
Second, it must be possible to iteratively split the initial problem into smaller sub-problems. This is achieved using a grouping operation that in turn depends on the swap operation. We illustrate the idea using an example. In Fig. 7(a), a 8 × 4 grid is split in the middle into two smaller grids. Each vertex is occupied by a disc; we omit the individual labels. The lightly (cyan) shaded discs have goals on the right 4×4 grid. The grouping operation moves the 7 lightly shaded discs to the right, which also forces the 7 darker shaded discs on the right to the left side. This is achieved through multiple rounds of concurrent swap operations either along horizontal lines or vertical lines. The result is Fig. 7(b). This effectively reduces the initial problem (G, VS , VG ) to two disjoint sub-problems. Repeating the iterative process can actually solve the problem completely but does not always guarantee constant-factor makespan time optimality in the worst case. This is referred to as the iSaG algorithm in [20]. Lastly, PaF achieves guaranteed constant-factor optimality using iSaG as a subroutine. It begins by computing the maximum distance between any pair of vis ∈ VS and vig ∈ VG over all 1 ≤ i ≤ n. Let this distance be dg . G is then partitioned into square grid cells of size roughly 5dg × 5dg each. With this
826
R. Chinta et al.
(a)
(b)
Fig. 7. Illustration of an iteration of the iSaG algorithm.
partition, a disc must have its goal in the same cell it is in or in a neighboring cell. After some pre-processing using iSaG, the discs that need to cross cell boundaries can be arranged to be near the destination cell boundary. At this point, multiple global circulations (a circulation may be interpreted as discs rotating synchronously on a cycle on G) are arranged so that every disc ends up in a 5dg × 5dg cell partition where its goal also resides. A rough illustration of the global circulation concept is provided in Fig. 8. Then, a last round of iSaG is invoked at the cell level to solve the problem, which yields a constant-factor time-optimal solution even in the worst case.
Fig. 8. Illustration of a single global circulation constructed and executed by PaF (the discs and the underlying grid cells are not fully drawn). (a) In six partitioned (5dg × 5dg ) cells numbered 1 − 6 in G, there are six labeled discs with goals in the correspondingly numbered cells, e.g., disc 1 should be in cell 1. (b) Using iSaG in each cell, discs 1 − 6 are moved to boundary areas and a cycle is formed on G for robot routing. (c) Moving all discs on the cycle by one edge synchronously, all discs are now in the desired cell; no other discs (not shown) have crossed any cell boundary.
To adapt PaF to the special triangular grid graph G, we need to: (i) identify a constant sized local neighborhood for the swapping operation to work, (ii) identify two “orthogonal” directions that cover G for the iSaG algorithm to work, and (iii) ensure that the constructed global circulation can be executed. Because of the limitation imposed by the triangular grid, i.e., any two edges of a triangle cannot be used at the same time (see Fig. 2), achieving these conditions simultaneously becomes non-trivial. In what follows, we will show how we may simulate PaF on a triangular grid G under the assumption that all vertices of
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
827
G are occupied by labeled discs, i.e., n = |V |. For the case of n < |V |, we may treat empty vertices as having “virtual discs” placed on them. Because two edges of a triangle cannot be simultaneously used, we use two adjacent hexagons on G (e.g., the two red full hexagons in Fig. 9(a)) to simulate the two square cells in Fig. 6. It is straightforward to verify that the swap operation can be carried out using two adjacent hexagons. There is an issue, however, as not all vertices of G can be covered with a single hexagonal grid. For example, the two red hexagons in Fig. 9(a) left many vertices uncovered. This can be resolved using up to three sets of interweaving hexagon grids as illustrated in Fig. 9(a) (here we use the assumption that W has dimensions w ≥ 10 and h ≥ 3 √43 + 2, which limits the possible embeddings of the triangular grid G). We note that for the particular graph G in Fig. 9(a), we only need the red and the green hexagons to cover all vertices.
Fig. 9. (a) We may use the red, green, and cyan hexagon grids on G to perform the swap operation. (b) The red and green paths may serve as orthogonal paths for carrying out the split and group operations as required by iSaG.
To realize requirement (ii), i.e., locating two sets of “orthogonal” paths for carrying out iSaG iterations, we may use the red and green paths as illustrated in Fig. 9(b). The remaining issue is that the red waving paths do not cover the few vertices at the bottom of G (the green paths, on the other hand, covers all vertices of G). This issue can be addressed with some additional swaps (e.g., with a second pass) which still only takes constant makespan during each iteration of iSaG and does not impact the time optimality or running time of iSaG. The realization of requirement (iii) is straightforward as the only restriction here is that the closed paths for carrying out circulations on G cannot contain sharp turns. We can readily realize this using any one of the three interweaving hexagonal grids on G that we use for the swap operation, e.g., the red one in Fig. 9(a). Clearly, any cycle on a hexagonal grid can only have angles of 2π 3 which are obtuse. We note that there is no need to cover all vertices for this global circulation-based routing operation because only a fraction (< 12 , see [20] for details) of discs need to cross the 5dg ×5dg cell boundary. On the other hand, any one of the three hexagonal grids cover about 23 of the vertices on a large G. Calling the adapted PaF algorithm on the special triangular grid as PaFT, we summarize the discussion in this section in the following result.
828
R. Chinta et al.
Lemma 2. PaFT computes constant-factor makespan time-optimal solutions for multi-robot routing on triangular grids in O(|V |2 ) time. Combining DiscretizeOLDR with PaFT then gives us the following. In deriving the running time result, we use the fact that wh = Θ(|V |) = Ω(n). Theorem 2. In a rectangular workspace W with w ≥ 10 and h ≥ 3 √43 +2, for n labeled unit discs having start and goal configurations of separation no less than 8 3 , constant-factor makespan time-optimal collision-free paths connecting the two configurations may be computed in O(w2 h2 ) time. We conclude this section with the additional remark that PaFT should mainly be viewed as providing a theoretical guarantee rather than being a practical algorithm due to the fairly large constant in the optimality guarantee.
5
Fast Computation of Near-Optimal Solutions via Integer Linear Programming
From a practical standpoint, the DiscretizeOLDR algorithm introduces the possibility of plugging in any discrete algorithm for multi-robot routing. Indeed, algorithms including those from [1,2,16,18,21] may be modified to serve this purpose. In this paper, we develop a new integer linear programming (ILP) approach based on the time-expanded network structure proposed in [21]. The benefits of using an ILP model are its flexibility and computational performance when combined with the appropriate solvers, e.g., Gurobi [6]. 5.1
Integer Linear Programming Model for Multi-robot Routing On Triangular Grids
The essential idea behind an ILP-based approach, e.g., [21], is the construction of a directed time-expanded network graph representing the possible flow of the robots over time. Given a discrete problem instance (G, VS , VG ), the network is constructed by taking the vertex set V of G and making T + 1 copies of it. Each copy represents an integer time instance starting from 0 to T . Then, a directed edge is added between any two vertices when they are both adjacent on G and in time, in the direction from time step t to time step t + 1. To build the ILP model, we create a binary variable for each robot and each edge in the time-expanded graph to represent whether the given robot uses that edge as part of its trajectory. We then add constraints to ensure that robots will not collide. The basic model from [21] only ensures that no two robots can use the same edge or vertex at the same time. But in our case, we need to consider more complex interactions, which are detailed as follows. Denoting N (i) as the set of vertex i ∈ V and its neighbors, the ILP model contains two sets of binary variables: (i) {xr,i,j,t |1 ≤ r ≤ n, i ∈ V, j ∈ N (i), 0 ≤ t < T }, where xr,i,j,t indicates whether robot r moves from vertex i to j between time step t and t + 1. Note that by a reachability test, some variables here are
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
829
fixed to 0. (ii) {xr,vrg ,vrs ,T |1 ≤ r ≤ n} which represent virtual edges between the goal vertex of each robot at time step T and its start vertex at time 0. xr,vrg ,vrs ,T is set to 1 iff r reaches its goal at T . The objective of this ILP formulation is to maximize the number of robots that reach their goal vertices at T , i.e., maximize xr,vrg ,vrs ,T , 1≤r≤n
under the constraints ∀1 ≤ r ≤ n, 0 ≤ t < T, ∀1 ≤ r ≤ n,
xr,i,j,t =
i∈N (j)
xr,vrs ,i,0 =
i∈N (vrs )
∀0 ≤ t < T, i ∈ V,
i∈N (vrg )
xr,i,vrg ,T −1 = xr,vrg ,vrs ,T ,
xr,i,j,t ≤ 1,
1≤r≤n j∈N (i)
∀0 ≤ t < T, i ∈ V, j ∈ N (i),
xr,j,k,t+1 ,
1≤r≤n
(4)
k∈N (j)
xr,i,j,t +
(5) (6)
xr,j,i,t ≤ 1.
(7)
1≤r≤n
Here, constraint (4) and (5) ensure a robot always starts from its start vertex, and can only stay at the current vertex or move to an adjacent vertex at each time step. Moreover, constraint (5) is essential for calculating the objective value. Constraint (6) prevents robots from simultaneously occupying the same vertex, while constraint (7) eliminates head-to-head collisions on edges. For a triangular grid, we must impose one extra set of constraints so that two robots cannot simultaneously move on the same triangle. Reasoning about each triangle formed by mutually adjacent vertices i, j, k ∈ V , for all 0 ≤ t < T , the constraint can be expressed as (xr,i,j,t + xr,j,i,t + xr,i,k,t + xr,k,i,t + xr,j,k,t + xr,k,j,t ) ≤ 1. (8) 1≤r≤n
Building on the ILP model, the overall route planning algorithm for triangular grids, TriILP, is outlined in Algorithm 1. In line 1, an underestimated makespan T is computed by routing robots to goal vertices while ignoring mutual collisions. Then, as T gradually increases (line 6), ILP models are iteratively constructed and solved (line 3–4) until the resulting objective value objval equals n. In line 5, time-optimal paths are extracted and returned. Derived from [21], TriILP has completeness and optimality guarantees. To improve the scalability of the ILP-based algorithm, a k-way split heuristic is introduced in [21] that adds intermediate robot configurations (somewhere between the start and goal configurations) to split the problem into subproblems. These sub-problems require fewer steps to solve, which means that the corresponding ILP models are much smaller and can be solved much faster. Detailed description and evaluation of the split heuristic are available in [21]. This heuristic is directly applicable to TriILP.
830
R. Chinta et al.
Algorithm 1: TriILP 1 2 3 4 5 6
5.2
T ←UnderestimatedMakespan(G, VS , VG ) while True do model ← PrepareModel(G, VS , VG , T ) objval ← Optimize(model) if objval equals to n then return ExtractSolution(model) else T ← T + 1
Performance Evaluation
We evaluate the performance of TriILP based on two standard measures: computation time and optimality ratio. To compute the optimality ratio, we first obtain the underestimated makespan tˆi , which is the number of steps needed to move robots to their goals for a given problem instance i while ignoring potential by TriILP of the robot-robot collisions. Denoting ti as the makespan produced i-th problem instance, the optimality ratio is defined as ( i ti )/( i tˆi ). For each set of problem parameters, ten random instances are generated and the average R CoreTM i7-6900K CPU with is taken. All experiments are executed on an Intel 32 GB RAM at 2133 MHz. For the ILP solver, Gurobi 8 is used [6]. We begin with TriILP on purely discrete multi-robot routing problems. On a densely occupied minimum triangular grid (n1 = 2, n2 = 3, |V | = 22, n = 16) as allowed by our formulation, a randomly generated problem can be solved optimally within 5 s on average. For a much larger environment (n1 = 7, n2 = 16, |V | = 232), we evaluate TriILP with k-way split heuristics, gradually increasing the number of robots. As shown in Fig. 10, TriILP could solve problems with 50 robots optimally in 60 s. Performance of TriILP is significantly improved with k-way split heuristic: with 4-way split, TriILP can solve problems with 110 robots in 55 s to 1.65-optimal. With 8-way split, we can further push to 140 robots with reasonable optimality ratio.
Fig. 10. Performance of TriILP with k-way split heuristics on a triangular grid with 232 vertices and varying numbers of robots.
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
831
Solving (continuous) OLDR requires both DiscretizeOLDR and TriILP. We first attempted a scenario where the density approaches the theoretical limit by placing the robots just 83 apart from each other in a regular (triangular) pattern for both start and goal configurations (see Fig. 11(a) for an illustration; we omit the labels of the robots, which are different for the start and goal configurations). After running DiscretizeOLDR, we get a discrete arrangement as illustrated in Fig. 11(b). For this particular problem, we can compute a 1.5optimal solution in 2.1 s without using splitting heuristics.
Fig. 11. Illustration of a compact OLDR instance with 20 densely packed robots, and the configuration of robots after DiscretizeOLDR.
To test the effectiveness of combining DiscretizeOLDR and TriILP, we constructed many instances similar to Fig. 11 but with different environment sizes, always packing as many robots as possible with separation of exactly 83 . The computational performance of this case is compiled in Fig. 12. With the 8-way split heuristic, our method can solve tightly packed problems of 120 robots in 21.93 s with a 3.88 optimality ratio. We note that the (underestimated) optimality ratio in this case actually decreases as the number of robots increases. This is expected because when the number of robots are small, the corresponding environment is also small. The optimality loss due to discretization is more obvious when the environment is smaller.
Fig. 12. Performance of TriILP (plus DiscretizeOLDR) on dense OLDR instances.
832
R. Chinta et al.
A second evaluation of OLDR carries out a comparison between TriILP (plus DiscretizeOLDR) and HexILP (the main algorithm from [22], which is based on a hexagonal grid discretization). We fix W with w = 42 and h = 43.57; the number of vertices in the triangular grid and hexagonal grid are 312 and 252, respectively. For each fixed number of robots n, S and G are randomly generated within W that are at least 83 apart. Note that this means that collisions may potentially happen for HexILP during the discretization phase, which are ignored (to our disadvantage). The evaluation result is provided in Fig. 13. Since discretization based on a triangular grid produces larger models, the running time is generally a bit higher when compared with discretization based on hexagonal grids. However, TriILP can solve problems with many more robots and also produce solutions with much better optimality guarantees. https://youtu.be/2-QHESZ p3E illustrates a few typical runs of TriILP.
Fig. 13. Performance comparison between TriILP and HexILP (with and without 4-way split heuristic) on randomly generated OLDR instances with a fixed W.
6
Conclusion and Future Work
In this work, we have developed a complete, polynomial-time algorithm for multirobot routing in a bounded environment under extremely high robot density. The algorithm produces plans that are constant-factor time-optimal. We also provide a fast and more practical ILP-based algorithm capable of generating near-optimal solutions. We mention here that extensions to 3D settings, which may be more applicable to drones and other airborne robot vehicles, can be readily realized under the same framework with only minor adjustments. Given the theoretical and practical importance of multi-robot (and more generally, multi-agent) routing in crowded settings, in future work, we would like to push robot density to be significantly higher than 50%. To achieve this while retaining optimality assurance, we believe the computation-based method developed in this work can be leveraged, perhaps in conjunction with a more sophisticated version of the DiscretizeOLDR algorithm. On the other hand,
Coordinating Labeled Discs W/Optimality Guarantees at Extreme Density
833
a triangular grid supports a maximum density of 66%; it may be of interest to explore alternatives structures for accommodating denser robot configurations. Acknowledgement. This work is supported by NSF awards IIS-1617744 and IIS1734419. Opinions or findings expressed here do not reflect the views of the sponsor.
References 1. Boyarski, E., Felner, A., Stern, R., Sharon, G., Betzalel, O., Tolpin, D., Shimony, E.: ICBS: the improved conflict-based search algorithm for multi-agent pathfinding. In: Eighth Annual Symposium on Combinatorial Search (2015) 2. Cohen, L., Uras, T., Kumar, T., Xu, H., Ayanian, N., Koenig, S.: Improved bounded-suboptimal multi-agent path finding solvers. In: International Joint Conference on Artificial Intelligence (2016) 3. Enright, J., Wurman, P.R.: Optimization and coordinated autonomy in mobile fulfillment systems. In: Automated Action Planning for Autonomous Mobile Robots, pp. 33–38 (2011) 4. Erdmann, M.A., Lozano-P´erez, T.: On multiple moving objects. In: Proceedings IEEE International Conference on Robotics & Automation, pp. 1419–1424 (1986) 5. Ghrist, R., O’Kane, J.M., LaValle, S.M.: Computing pareto optimal coordinations on roadmaps. Int. J. Robot. Res. 24(11), 997–1010 (2005) 6. Gurobi Optimization, Inc.: Gurobi optimizer reference manual (2018) 7. Han, S.D., Rodriguez, E.J., Yu, J.: SEAR: a polynomial-time expected constantfactor optimal algorithmic framework for multi-robot path planning. In: Proceedings IEEE/RSJ International Conference on Intelligent Robots & Systems (2018, to appear) 8. Hopcroft, J.E., Schwartz, J.T., Sharir, M.: On the complexity of motion planning for multiple independent objects; PSPACE-hardness of the “warehouseman’s problem”. Int. J. Rob. Res. 3(4), 76–88 (1984) 9. Kornhauser, D., Miller, G., Spirakis, P.: Coordinating pebble motion on graphs, the diameter of permutation groups, and applications. In: Proceedings IEEE Symposium on Foundations of Computer Science, pp. 241–250 (1984) 10. LaValle, S.M., Hutchinson, S.A.: Optimal motion planning for multiple robots having independent goals. IEEE Trans. Rob. Autom. 14(6), 912–925 (1998) 11. Peasgood, M., Clark, C., McPhee, J.: A complete and scalable strategy for coordinating multiple robots within roadmaps. IEEE Trans. Rob. 24(2), 283–292 (2008) 12. Solovey, K., Halperin, D.: k-color multi-robot motion planning. In: Proceedings Workshop on Algorithmic Foundations of Robotics (2012) 13. Solovey, K., Halperin, D.: On the hardness of unlabeled multi-robot motion planning. In: Robotics: Science and Systems (RSS) (2015) 14. Solovey, K., Yu, J., Zamir, O., Halperin, D.: Motion planning for unlabeled discs with optimality guarantees. In: Robotics: Science and Systems (2015) 15. Spirakis, P., Yap, C.K.: Strong NP-hardness of moving many discs. Inf. Process. Lett. 19(1), 55–59 (1984) 16. Standley, T., Korf, R.: Complete algorithms for cooperative pathfinding problems. In: Proceedings International Joint Conference on Artificial Intelligence (2011) 17. Turpin, M., Mohta, K., Michael, N., Kumar, V.: CAPT: concurrent assignment and planning of trajectories for multiple robots. Int. J. Robot. Res. 33(1), 98–112 (2014)
834
R. Chinta et al.
18. Wagner, G., Choset, H.: M*: a complete multirobot path planning algorithm with performance bounds. In: Proceedings IEEE/RSJ International Conference on Intelligent Robots & Systems, pp. 3260–3267 (2011) 19. Wurman, P.R., D’Andrea, R., Mountz, M.: Coordinating hundreds of cooperative, autonomous vehicles in warehouses. AI Mag. 29(1), 9–19 (2008) 20. Yu, J.: Constant factor time optimal multi-robot routing on high-dimensional grid. In: Robotics: Science and Systems (2018) 21. Yu, J., LaValle, S.M.: Optimal multi-robot path planning on graphs: complete algorithms and effective heuristics. IEEE Trans. Rob. 32(5), 1163–1177 (2016) 22. Yu, J., Rus, D.: An effective algorithmic framework for near optimal multi-robot path planning. In: Bicchi, A., Burgard, W. (eds.) Robotics Research, pp. 495–511. Springer, Cham (2018)
An Experimental Analysis on the Necessary and Sufficient Conditions for the RRT* Applied to Dynamical Systems Israel Becerra(B) , Heikel Yervilla-Herrera, and Rafael Murrieta-Cid Centro de Investigaci´ on en Matem´ aticas (CIMAT), Guanajuato, Mexico {israelb,yervilla.herrera,murrieta}@cimat.mx
Abstract. In this paper, we study some properties of several local planners for nonholonomic dynamical systems to achieve asymptotic global optimality through the RRT*. More specifically, we study the conditions that a local steering method must have to produce global optimal trajectories in an environment with obstacles. The main properties we analyse in the steering methods are the following: (1) Whether or not the steering method produces local optimal motion primitives (optimal letters). (2) Whether or not the steering method concatenates the local optimal primitives in such a way that the resulting concatenation is also optimal (optimal words). (3) Whether or not the steering method produces trajectories that respect the topological property. Experimentally, it is studied how those properties affect the speed of convergence to the globally optimal solution, moreover, their sufficiency and necessity is also validated, all making use of the problem of finding the time-optimal trajectories for a differential drive robot in the presence of obstacles. We also discard conditions that show not to be necessary and we give some insight on the necessary and sufficient conditions for the RRT* to asymptotically converge to optimal trajectories, which is indeed the sough research target. Keywords: RRT* · Dynamical systems Steering methods · Motion planning
1
· Optimality · Nonholonomy ·
Introduction
Motion Planning has presented great advances since the early proposed algorithms where the generation of exact solutions was the preferred approach [3], passing through cellular decompositions, graphs searches, potential fields, and sampling-based methods [10]. In particular, the sampling-based methods have been successful in finding collision-free trajectories for high dimensional spaces and have seen a constant evolution from the early PRM [9], which is one of the first tools to construct road-maps using a sampling-based approach, to the The authors would like to acknowledge the financial support of Intel Corporation. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 835–851, 2020. https://doi.org/10.1007/978-3-030-44051-0_48
836
I. Becerra et al.
RRT [13], which introduces interesting concepts such as sampling the controls space to address the kinodynamic planning problem. A lot of work has been done regarding Kinodynamic and nonholonomic problems [2,5,11–13], particularly in sampling-based planning methods. In the present work we study this particular type of planning methods. Since the introduction of the PRM and the RRT, one of the major breakthroughs was the introduction of the PRM*, RRT* and related approaches [7], which provided tools that are capable of achieving asymptotic optimality. Specially, the RRT based algorithms naturally extend to deal with nonholonomic constraints, which has given them a large amount of attention in the last years. In [15] a recent survey of such algorithms can be found. Nonetheless, the RRT based algorithms rely on the availability of a local steering method that corresponds to solve a two point boundary value problem (BVP), which by no means is an easy task for many interesting dynamical systems. Some efforts have been done to avoid this issue, such as in [14], where the authors propose an incremental sampling-based planner, the Stable Sparse RRT (SST), which does not rely on the availability of a local steering method. However, when a local steering method is available, for certain scenarios, the RRT* can produce higher convergence rates than the SST, as it is shown in [14] experiments. Hence, the interest in RRT* based algorithms is still present. Effort has been put into extending the RRT* to deal with a variety of kinodynamic motion planning problems. For instance, in [18] the authors extend the RRT* to consider linear differential constrains by using a fixed-final-state-freefinal-time controller that connects any pair of states, optimizing a cost function that includes a trade-off between the duration of a trajectory and the expended control effort. The non-linear dynamics are considered using first-order Taylor approximations. Similarly, in [16], a linearisation of the system dynamics is performed to later apply a linear-quadratic regulator (LQR) controller within the RRT*, finding in that manner optimal plans for several complex or underactuated systems. Nonetheless, a richer generalization to extend the RRT* method to deal with a broader range of problems with differential constraints is still of interest, moreover, a crucial question that remains unanswered is related to the properties that the local planners must posses. An attempt to achieve such generalization is presented in [6,8]. In contrast to other works, the authors of [6,8] tried to present a set of conditions to be able to apply the RRT* to nonholonomic planning problems. The conditions presented in [6] are proved to be sufficient, while the conditions presented in [8] are merely suggested without a formal proof of whether they are necessary or sufficient. The main requirement presented in [6] is the availability of a local optimal planner that will induce an optimal distance function, and which directly impacts the steering and rewire procedures. Apart from the required local planner, the authors proposed sufficient conditions in the form of a local controllability property, referred as weak local controllability, and a second sufficient condition related to the existence of a solution to the problem. According to the authors, fulfilling the aforementioned conditions guarantees the RRT* asymptotic
On the Necessary and Sufficient Conditions for the RRT*
837
optimality. In [8], the authors relax some requirements such as the optimality of the local planner, but introduce some new conditions, namely, the topological property [17] related to the local planner, and the system to be small time locally attainable. The main contributions in the present work can be summarized as follows: – We present a concise summary of the conditions presented in [6,8] to achieve asymptotic optimality while using the RRT* in the context of kinodynamic planning problems, and experimentally evaluate the validity of the conditions that were merely suggested without formal proof. – A compilation of different methodologies based on the summarized conditions, which combine local steering methods with the RRT*, is presented. – An experimental analysis to compare the performance of the considered methods. For such comparison the case of time-optimal trajectories for a differential drive robot (DDR) in the presence of obstacles is addressed. – We exhibit the existence of cases where a local planner that does not generate optimal motion primitives (optimal letters) does not converge to the optimal cost, indeed, it does not even converge to a finite cost. – Experimentally we found that the topological property is not a necessary condition in the current context, noting that the problem of achieving asymptotically global optimality with the RRT* is a different problem from approximating any geometric path through paths computed with a local planner that respects the nonholonomic constraints [12]. – Making use of a counterexample approach, we give some insight on the necessary and sufficient conditions for the RRT* to asymptotically converge to optimal trajectories, which sets possible directions for future research. The remainder of this paper is organized as follows. Section 2 presents the problem definition. Section 3 summarizes the conditions in [6,8]. Section 4 shows the compilation of methodologies based on the summarized conditions. Section 5 describes the methodologies in the context of the problem of time-optimal trajectories for a DDR. Section 6 shows the performance experimental analysis. Section 7 presents a discussion with our analysis on the sufficiency and necessity of the studied conditions.
2
Problem Definition
Let X and U be smooth manifolds that represent the state and control spaces, respectively, and consider the following dynamical system: x(t) ˙ = f (x(t), u(t)), x(0) = x0 ,
(1)
where x(t) ∈ X, u(t) ∈ U , for all t, f is Lipschitz continuous and x0 ∈ X. Let Xf ree ⊂ X be the set of collision free states, Xgoal ⊂ X the goal set, and c : X → R≥0 the cost function. In the present work we consider the optimal kinodynamic motion planning problem that is stated as follows:
838
I. Becerra et al.
Definition 1 (Optimal Kinodynamic Motion Planning Problem). Find a dynamically-feasible trajectory x : [0, T ] → X, with x(0) = x0 , that (i) is collision-free, i.e. x(t) ∈ Xf ree , ∀t, (ii) reaches the goal region, i.e. x(T ) ∈ Xgoal , T (iii) minimizes the cost functional J(x) = 0 c(x(t))dx. We will also consider that to solve the optimal kinodynamic motion planning problem, the RRT* methodology is being used. See [7] for further details.
3
Conditions for Asymptotic Convergence of the RRT*
In this section we present the conditions introduced in [6,8]. We summarize them keeping only the conditions that are not redundant (nor dominated by any other condition), unless it is convenient to preserve them to gain flexibility on the available catalogue of conditions that will be chosen to be fulfilled. The first condition in [6] is the availability of an optimal local planner (local steering method), which returns the optimal trajectory connecting any two states, z1 , z2 ∈ X, in the absence of obstacles, and which is formally defined as follows: Definition 2. A local planner is an Optimal Local Planner, if there exists an > 0, and returns a trajectory x∗ : [0, T ] → X, with x∗ (0) = z1 , and x∗ (T ) = z2 , driven by the input u∗ : [0, T ] → U fulfilling x˙ ∗ = f (x∗ (t), u∗ (t)) ∀t ∈ [0, T ], s.t. J(x∗ ) = minT ∈R≥0 ,u J(x), ∀ z1 − z2 ≤ . The second sufficient condition in [6] is the that the considered dynamical system respects the so called weak local controllability (WLC) property. For stating that condition consider the next concepts. Denote B (z) the closed ball centred at state z, and let Xz,z denote the set of all trajectories that start from state z and reach state z respecting the state transition equation of the dynamical system. Given a state z ∈ X and a constant > 0, let R (z) = {z ∈ X|∃x ∈ Xz,z , s.t. x(t) ∈ B (z) ∀t ∈ [0, T ]}, see [6] for more details. Then, the weak local controllability (WLC) is given by the next definition: Definition 3 (Weak Local Controllability (WLC) [6]). There exist constants α, ¯ ∈ R>0 , p ∈ N, such that for any ∈ (0, ¯), and any state z ∈ X, the set R (z) of all states that can be reached from z with a path that lies entirely inside the -ball centred at z, contains a ball of radius αp . The third sufficient condition in [6] is an assumption on the obstacle region to ensure that there exists an optimal trajectory with enough free space around it to allow almost-sure convergence. For the sake of brevity we refer the reader to [6] for details. In [8], the authors present an enhancement of the computational effectiveness of the RRT* algorithm. This is achieved by modifying the NearVertices procedure such that the neighbours are computed considering a weighted Euclidean box, which resembles the shape of the sub-Riemannian ball that respects the nonholonomic constrains of the dynamical system. This modification requires the dynamical system to be small time locally attainable (STLA), which gives rise to the first
On the Necessary and Sufficient Conditions for the RRT*
839
condition presented in [8]. Consider a given state z ∈ X and a real number t > 0, the small-time attainable set A(z, t), is the set of all reachable states within time t by the considered dynamical system, starting from state z. Then, the small time locally attainable (STLA) property is given by next definition: Definition 4. A system is said to be Small Time Locally Attainable (STLA) at state z ∈ X, if A(z, t) has a non empty interior for all t > 0. The second condition shown in [8] is related to the local planner used in the RRT*. Such condition is referred as the topological property, which was first introduced in [17]; broadly speaking, it states that the trajectory that joins any two states that are within a ball of radius η, will not leave a ball of radius . The topological property is formally stated as follows: Definition 5. A local planner that drives the system from a state z to a state z with a trajectory x ∈ Xz,z , with z = x(0) and z = x(T ), respects the Topological Property (TP) if ∀ > 0, ∃η > 0 s.t. ∀z ∈ X, ∀z ∈ Bη (z), ∀t ∈ [0, T ], x(t) ∈ B (z). In [17], the topological property (TP) was introduced under the assumption that the dynamical system is small time locally controllable (STLC). This implicitly suggests that property as a third condition assumed in [8], which would override the STLA property, since it is well known that a system that is STLC is STLA, but not the contrary. Such property is given by the next definition: Definition 6. A system is said to be Small Time Locally Controllable (STLC) at state z ∈ X, if A(z, t) has a neighbourhood of z for all t > 0. As stated by [6], a system that is locally controllable in the sense of [4], which is referred in [17] as small-space locally controllable 1 (SSLC), also fulfils the WLC property. However, since a SSLC can be interpreted as the STLC condition, a system that is STLC would also respect the WLC condition. Therefore, joining the facts that a system that is STLC is WLC, and that the STLA property is not sufficient to guarantee the local planner topological property, we will keep the STLC condition over the WLC and STLA properties. Summarizing, the conditions presented in [6,8] are: I II III IV
The considered dynamical system is STLC. The local planner used in the RRT* is an optimal local planner. The local planner used in the RRT* respects the topological property. There exist an optimal path which has enough obstacle-free space around it to allow almost-sure convergence.
1
A system is small-space locally controllable at z ∈ X, if for any neighbourhood Ω of z, there exists a neighbourhood AΩ (x) whose points are all accessible by the system without departing from Ω. The system is small-space locally controllable if it is SSLC at any z ∈ X.
840
I. Becerra et al.
Notice that an optimal local planner satisfies the topological property [17], but a local planner that satisfies the topological property is not necessarily optimal, indeed, in Sect. 5.3 we present such a planner. As stated in [17], it is not an easy task to obtain an optimal local planner, nor to design a local planner that respects the topological property, hence, we will keep both properties to preserve the alternative of either satisfying one condition or the other one depending on the available local planner.
4
Methods Overview for Using the RRT* for Nonholonomic Dynamical Systems
The different methodologies we present in this section are derived based on the characteristics of the local planer used in the RRT*. We will generate different combinations depending on the optimal letters or optimal words used by the planner, or whether fulfils or not the topological property. We refer as the optimal letters to the motion primitives returned by an optimal planner, and the optimal words, to the concatenation of motion primitives that compose any optimal trajectory. For instance, for the time-optimal trajectories for a DDR in the absence of obstacles [1], the optimal letters–motion primitives–, are either rotations in site (clockwise rotation or counter-clockwise rotations ) or straight line motions (forward motion ⇑ or backward motion ⇓), and the structure of the optimal words are one of the shown in Table 1, or a subsection of them. Table 1. Structure of optimal words ⇑
⇓
⇑
⇓
Tangentπ ⇑π ⇓
⇓π ⇑
⇑π ⇓
⇓π ⇑
Tangent ZigZag
⇑⇓⇑ ⇓⇑⇓ ⇑⇓⇑ ⇓⇑⇓
Table 2 shows the different local planners to be considered in the rest of the paper. We use the superscript (+) to denote that a planner respects the respective property, or (-) if it does not. The superscript (*) means that the planner partially respects the mentioned property. Table 2. Local planners Optimal Letters Optimal Words Topological Property +
+
L W T
+
+
L W∗ T
+
+
−
+
+
−
−
−
−
−
L W T L W T
L W T
YES
YES
YES
YES
Not all
YES
YES
NO
YES
YES
NO
NO
NO
NO
NO
On the Necessary and Sufficient Conditions for the RRT*
5
841
Description of Local Planners: The Case of Time-Optimal Planning for a DDR
In this section, we further describe the local planners to be used in the context of the problem of obtaining time-optimal trajectories for a DDR in the presence of obstacles. We describe each of the planners and prove their claimed properties. These planners are the ones that will be used in the experimental analysis shown in Sect. 6. The kinematics of a DDR are given by the next set of equations: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ x˙ cos θ 0 ⎝y˙ ⎠ = ⎝ sin θ ⎠ v + ⎝0⎠ w, (2) 0 1 θ˙ 1 (u1 − u2 ) is the robot’s where v = 12 (u1 + u2 ) is the robot’s linear speed, w = 2b angular speed, and 2b is the width of the DDR. The left and right wheels angular speeds are u1 and u2 , respectively, which are the system controls. In the following sections a DDR will be used as our dynamical system, which is STLC [12], fulfilling in that way Condition I from Sect. 3. We will also assume in the next sections that Condition IV is respected.
5.1
+
+
Local Planner L W T
+
This planner corresponds to the planner from [1] that solves the time-optimal planning problem for a DDR in the absence of obstacles. The authors of [1] present an algorithm that returns an optimal trajectory (and its cost) within nine possible symmetry classes of optimal trajectories (indexed as A, B, C, . . ., H, I), connecting in that way any two states z1 ∈ X and z2 ∈ X. Note that this planner is not local, conversely, it contains the whole dictionary of optimal words to cover the whole state space X. Such planner does not only respects optimality but also respects the topological property as it is stated in Proposition 1. +
+
+
Proposition 1. Local planner L W T respects the topological property. +
+
+
Proof. The local planner L W T returns optimal trajectories for the whole state space X according to [1]. Local planners that are based on families of optimal trajectories satisfy the topological property [17]. Corollary 1. A local planner that considers a subset of the optimal words of the dictionary from [1], for a neighbourhood of any z ∈ X, respects the topological property.
842
5.2
I. Becerra et al. +
Local Planner L W∗ T +
+
+
Planner L W∗ T is also based on the optimal planner presented in [1]. From + + the nine possible families of optimal trajectories, planner L W∗ T sticks to only trajectories from symmetry classes D and G, which are able to reach a neighbourhood of any state z ∈ X. Class D corresponds to a structure of the form ⇑, which encloses seven different symmetric trajectories. Class G is of the structure ⇓⇑, containing its respective seven symmetric trajectories. + + + + + Planner L W∗ T is a local version of planner L W T , it contains some of the optimal words of the whole dictionary, moreover, it respects the topological + + property by Corollary 1. If planner L W∗ T is used in the RRT* construction, only samples that can be optimally connected to the tree are included in the tree, that is, if the optimal trajectory to connect a given sample to the tree does not correspond to class D or class G, then the sample is discarded. 5.3
+
−
Local Planner L W T
+
This local planner alternates straight line motions and rotations in site yielding sort of a ‘zigzag’ behaviour, hence, note that it makes use of optimal letters. Given two states zs = (xs , ys , θs ) ∈ X and zg = (xg , yg , θg ) ∈ X, the planner proceed as follows (see Fig. 1). First, we generate a intermediate goal zg = (xg , yg , θs ), that is, we build zg such that it has the x and y coordinates of zg , but zg has the orientation of zs . Second, consider the line l between states zs and zg , projected on R2 . The DDR will first move backward (motion primitive ⇓) a distance r, then rotates in site to align with a direction β, then moves forward 2r , then rotates in site, moves backward, etc., and keeps repeata distance d = sinβ ing this. Following such behaviour the robot will visit m intermediate points over l, until zg is reached. Third, the planner makes the robot to rotate in site from zg to finally align with zg .
(a) Trajectory with one intermediate point (m = 1).
(b) Trajectory with three intermediate points (m = 3). +
−
Fig. 1. Trajectories (dashed lines) computed by L W T state zs .
+
to reach a state zg from
On the Necessary and Sufficient Conditions for the RRT*
843
To compute the r, β and m parameters, we make use of Algorithm 1. This + − + algorithm makes sure that planner L W T respects the topological property, keeping the trajectories inside a ball B2r (zs ) by iteratively increasing the number of reference intermediate points over l, until the containment is achieved. + − + Through Proposition 2 we formally prove that the local planner L W T respects the topological property.
Fig. 2. The starting state zs is shown at the center of balls B (zs ) and Bη (zs ). Different trajectories are shown as dashed lines, corresponding to trajectories that planner + − + L W T might produce to take the robot from zs to the respective goals shown as coloured points over the inner circle. The goal zg is the state goal whose trajectory contains the farthest point x (t). +
−
+
Proposition 2. Local planner L W T respects the topological property. Proof. Given a start state zs = (xs , ys , θs ) ∈ X and a goal state zg = (xg , yg , θg ) ∈ X, Algorithm 1 computes the parameter r = (xs − xg )2 + (ys − yg )2 + (θs − θg )2 , which can be used to define a ball Br (zs ) around the start state zs . Later, algorithm 1 computes the parameters β and m, such that by construction, the trajectory x delivered by the local planner + − + L W T , with x(0) = zs and x(T ) = zg , does not leave a ball B2r (zs ), that is, x(t) ∈ B2r (zs ), ∀t ∈ [0, T ]. Moreover, given any potential goal state zg ∈ Br (zs ), + − + the trajectory x generated by L W T , with x (0) = zs and x (T ) = zg , respects x (t) ∈ B2r (zs ), ∀t ∈ [0, T ]. The goal state zg whose trajectory contains the farthest state x (t), is zg = (xs + r cos(θs ), ys + r sin(θs ), θs ), which actually touches the periphery of B2r (zs ) at (xs + 2r cos(θs ), ys + 2r sin(θs ), θs ) (see Fig. 2). Setting = 2r, and η = r, yields = 2η, meaning that to any + − + corresponds a unique η, independently of zs , hence, the local planner L W T respects the topological property.
844
I. Becerra et al. +
−
+
Algorithm 1. Local planner L W T ( β, m computation) Input: zs = (xs , ys , θs ) and zg = (xg , yg , θg ) Output: β and m 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
5.4
r ← (xs − xg )2 + (ys − yg )2 + (θs − θg )2 u ˆ ← (cos(θs ), sin(θs ), 0) //Unit vector in direction θs qu ← zs − r ∗ u ˆ //Intermediate state (see Fig. 1a), qu = (xu , yu , θu ) x +x y +y qm ← ( s 2 g , s 2 g , θs ) //Intermediate state (see Fig. 1a), qm = (xm , ym , θm ) m←1 //Total number of intermediate points qv ← zg + r ∗ u ˆ //Intermediate state (see Fig. 1a), qv = (xv , yv , θv ) Loop: β← angle to align the DDR from qu toward qm if (xv − xs )2 + (yv − ys )2 + (β − θs )2 > 2 ∗ r OR (xu − xs )2 + (yu − ys )2 + (β − θs )2 > 2 ∗ r then qu ← qm − r ∗ u ˆ x +x y +y qm ← ( m2 g , m2 g , θs ) m←2∗m+1 goto Loop. else return β, m end if
+
−
Local Planner L W T
−
The present local planner consists of a rotation in site (/), a straight line motion (⇑/⇓), and a rotation in site (/). This corresponds to the symmetry classes D and F from [1], hence, this planner makes use of optimal letters. The + − − trajectories that the L W T planner throws are used to connect any pair of given states, zs and zg (given that the trajectory is collision free), even if such trajectory is not the optimal one, therefore, the planner will be considered to not be using optimal words. It is worth to mention that despite the fact that this planner is able to connect any pair of states, the planner does not respect the topological property, as it is proved in Proposition 3. Such statement is proved by giving a family of pathological goal states such that, no manner how close those states are from a start state zs , there is no arbitrary small whose associated ball B (zs ) will contain their related trajectories. +
−
−
Proposition 3. Local planner L W T does not respect the topological property. x, y¯, π4 ), with x ¯ > xs Proof. Consider a pair of states zs = (xs , ys , π4 ) and z¯ = (¯ and y¯ = ys . Also note that both have the same orientation θ = π4 . In order to + − − move from zs towards z¯, planner L W T would return the next sequence of motion primitives: ⇑. Now consider the ball B (zs ), and consider the ray ρ that goes from zs and infinitely extends toward z¯. The trajectory delivered + − − by L W T to move from zs to any state z ∈ ρ would require to rotate a total angle of π2 radians, no matter how close z is to zs . Any ball Bη (zs ) inside B (zs ) would still contain states over ρ, therefore, as → 0, the trajectories corresponding to goal states z ∈ ρ inside Bη (zs ) would eventually be out of B (zs ), due to the always required total rotation of π2 radians. Therefore, ∃ an with no η s.t. ∀z ∈ X, ∀z ∈ Bη (z), ∀t ∈ [0, T ], x(t) ∈ B (z). The result follows.
On the Necessary and Sufficient Conditions for the RRT*
5.5
−
−
Local Planner L W T
845
−
The proposed local planner consists of a concatenation of arcs of circles, so it clearly applies controls that do not correspond to optimal words, nor to optimal letters. The planner proceeds as follows. Consider a pair of states zs ∈ X and zg ∈ X. Then consider the line segment l in R2 that joins zs and zg . Planner − − − L W T will try to connect the extreme points of l with an arc centred at l midpoint. The trajectory will consist in rotating in site until the DDR’s orientation is aligned with the tangent to the arc, then the robot follows the arc, and finishes rotating in site to align with θg , all applying saturated controls. If by following such trajectory collision happens, then a refinement is attempted, introducing intermediate bias points and connecting them with smaller arcs as it shown in Fig. 3. This is repeated until N refinements are attempted. Notice that for any trajectory x, its corresponding projection x ˆ in R2 could be approximated by this procedure, and as the number of intermediate bias points tends to ∞, the arcs approximation resembles more and more the curve x ˆ. Nonetheless, it is important to mention that as the number of bias points increase, the cost related to the arcs-approximation might not tend to J(x). This is proved in Proposition 4.
Fig. 3. The green segment is the projection x ˆ of a trajectory x into R2 , which is the trajectory sought to be approximated by the arc-type trajectories. As the intermediate bias points is increased, the arc-type trajectories tend to x ˆ.
Proposition 4. Given a pair of states zs ∈ X and zg ∈ X, and a trajectory x, with x(0) = zs and x(T ) = zg , let Δn (x) denote an arcs-approximation trajectory that considers n intermediate bias points over trajectory x. There exist cases where limn→∞ J(Δn (x)) = J(x). Proof. Consider two states zs = (0, 0, π2 ) ∈ X and zg = (2, 0, π2 ) ∈ X, and consider a trajectory x, with zs = x(0) and zg = x(T ), which rotates in site, moves with a straight line, and rotates in site, with saturated wheels speed, just as shown in Fig. 3. Assuming the robot’s radius b = 1, the cost (elapsed time) related to such trajectory would be J(x) = 2 + π 2 (assuming a unit valued maximum speed). Considering an arcs-approximation Δ0 (x), the time that will take the robot to travel the arc would be π + 2π∗1 2 ∗ 2 = 3π, yielding J(Δ0 (x)) = + 2π∗0.5 ) ∗ 3 = 3π, which is 3π. For a trajectory Δ1 (x), J(Δ1 (x)) = ( 2π∗0.5 2 2 2
The related cost is computed as t = s(t) + bσ(t), where s(t) is the rectified path length in R2 , the plane of robot position, and σ(t) is the rectified arc length in S 1 , the circle of robot orientations, see [1] for details.
846
I. Becerra et al.
computed multiplying the sum of arc lengths by 1/v. For a trajectory Δ3 (x), + 2π∗0.25 + 2π∗0.25 + 2π∗0.25 )∗5 = 5π, etc. Then, J(Δn (x)) = J(Δ3 (x)) = ( 2π∗0.25 2 2 2 2 i (n + 2)π for n ≥ 2 − 1, i = 1, 2, . . ., therefore, limn→∞ J(Δn (x)) = ∞, which is clearly different from J(x) = 2 + π. The result follows. Remark 1. In the scenario described in Proposition’s 4 proof, the length of the different trajectories Δn (x) have the same rectified length, but the time that takes the robot to travel them tends to infinity because the DDR linear velocity v → 0 as n → ∞. x, y¯, 0), with Considering the pair of states zs = (xs , ys , 0) and z¯ = (¯ x ¯ > xs and y¯ = ys , using similar arguments to the ones presented in Propo− − − sition 3, it can also can be proven that L W T does not respect the topo− − − logical property. Finally, since local planner L W T might yield trajectories Δn (x) whose cost does not tend to the cost of the related trajectory x, that is − − − limn→∞ J(Δn (x)) = J(x), we will not consider L W T in the experimental − − − analysis presented in the next section, however, we further developed L W T in this section because it will be relevant in Sect. 7, where we present our discussion and conclusions.
6
Experimental Analysis of the Convergence: The Case of Time-Optimal Planning for a DDR
In this section, we present two experiments where we compute time optimal trajectories for a DDR in the two environments shown in Fig. 4. The used local + + + planners used in the comparison are L W T (the global planner proposed + + + − + in [1]), L W∗ T (a local version of the planner proposed in [1]), L W T + − − (the zigzag planner that respects the topological property), and L W T (the planner that rotates in site, moves in straight line motion, and rotates in site). 6.1
Experiment 1
This experiment considers several pairs of star and goal states within the environment. For statistical purposes, the results that we present in this section correspond to the averages over trajectories resulting from 10 pairs of start-goal states. More precisely, for each one of the start-goal pairs, we compute a trajectory using each one of the four local planners, and then we run statistics over the 40 resulting trajectories (4 local planners and 10 trajectories for each). In each case the trees’ generation was stopped when 20000 nodes were successfully gen+ + + erated. Table 3 summarizes the results. Planner L W T achieved the smallest + + average cost in both environments, followed by L W∗ T in Environment 1, and + − − + − + by L W T in Environment 2. Planner L W T presented an average cost way above from the other planners. Regarding average planning time, planner + − − + + + + + L W T showed the fastest times, followed by L W∗ T , then L W T , and + − + finally L W T .
On the Necessary and Sufficient Conditions for the RRT*
847
Table 3. Average total planning time and total cost for Experiment 1. Environment 1 Environment 2 Plan. Time (sec) Tot. Cost Plan. Time (sec) Tot. Cost +
+
L W T
+
+
∗
+
−
+
+
−
−
L W T
+
L W T L W T
6.2
324.1677
147.9497
322.3515
153.1534
262.3257
151.2736
262.1216
158.8366
631.305
819.5697
605.4679
813.8562
193.3317
151.98
193.6325
154.4702
Experiment 2
In this experiment we set a single pair of start-goal states and we compute 10 trajectories using each one of the local planners. The presented results correspond to averages over those sets of 10 trajectories. Again, in each case the trees’ generation was stopped when 20000 nodes were successfully generated. Figure 4 shows four sample trajectories that where generated with each one of the local planners. It can be seen that the resulting trajectories are similar, with + − + the exception of the one obtained with planner L W T , nonetheless, the four trajectories belong to the same homotopy class.
+
+
+
(a) L W T .
+
+
+
(e) L W T .
+
+
(b) L W∗ T .
+
+
(f) L W∗ T .
+
−
+
(c) L W T .
+
−
+
(g) L W T .
+
−
−
(d) L W T .
+
−
−
(h) L W T .
Fig. 4. Trajectories generated using the four different local planners for given start and goal states.
As expected, see Table 4, for each local planner the planning times are quite similar to the ones presented in Experiment 1. This is because the planning time is mainly affected by the number of nodes, which were set to 20000 per tree in + + + both of the experiments. Regarding the trajectories cost, planners L W T , + + + − − L W∗ T and L W T obtained similar resulting costs, conversely to planner + − + + + + L W T that has a quite larger cost. The best cost comes from L W T , fol+ − − + + lowed by L W T , and L W∗ T . Figure 5 shows the evolution of the average
848
I. Becerra et al. Table 4. Average total planning time and total cost for Experiment 2. Environment 1 Environment 2 Plan. Time (sec) Tot. Cost Plan. Time (sec) Tot. Cost +
+
L W T
+
+
∗
+
−
+
+
−
−
L W T
+
L W T L W T
325.3999
194.4113
326.4784
165.4334
265.6703
198.1895
261.3032
171.299
637.5478
990.7503
600.3666
867.1575
194.9445
196.5391
194.37
166.6844
accumulated planning times and running costs as a function of number of nodes for Environment 2. Similar tendencies were shown in Environment 1. Costs of + − + planner L W T were not shown as they are huge compared to the other three.
7
Discussion and Conclusions
Based on the statistics obtained in Sect. 6, differences in performance are evident between the different planners. The local planner that achieved the best cost + + + for a fixed number of nodes was the planner L W T , which considered the whole dictionary of optimal words. This is reasonable as that planner delivers optimal trajectories that connect any two states z ∈ X and z ∈ X, given there is no obstacles. Nonetheless, note that the difference in performance was not + + + − − large comparing it against planners L W∗ T and L W T . Regarding the + − − computation time, the best results came from planner L W T . This is because the same concatenation of motions primitives is always executed, without the necessity of applying a procedure that first discerns what type of trajectory is the optimal and then compute the proper parameters for the motion primitives. + − − + + Surprisingly, L W T outperformed L W∗ T most of the trials in terms of total cost.
(a) Average total time.
(b) Average total cost.
Fig. 5. Average planning time and total costs evolutions as a function of the number + + + of nodes for given start and goal states in Environment 2. L W T shown in blue, + + + − + + − − L W∗ T shown in orange, L W T shown in gray, and L W T shown in yellow.
On the Necessary and Sufficient Conditions for the RRT*
849
Apart from the performance comparison, the past results give insight on the necessary and sufficient conditions to achieve asymptotic optimality for the RRT* in the context of the kinodynamic problem. Table 5 summarized a series of inferences that we deduce based on the experimental observations of whether a given local planner presented a convergence to the minimal cost or not. For + + + such analysis the costs and trajectories yielded by local planner L W T are considered as a baseline. We believe such planner to be the must reliable, since it fulfils all the properties that were presented in [6,8]. In Table 5 each column represents a property to be satisfied by the local planner , and the rows refer to whether that property is a necessary condition or a sufficient condition. We mark a table entry with * when there is evidence from the previous section suggesting that the shown label (YES or NO) is correct, but further analysis is required to confirm the conjecture. The difference between the Optimal Words, Local Optimal Words and Subset Optimal Words properties (the 3 of them use optimal letters), is that the first one considers that contemplates a complete dictionary of optimal words that optimally connects (given no obstacles) any pair of states z ∈ X and z ∈ X. The second one considers that has a subset of the optimal words dictionary, such that given a state z ∈ X, there exist a neighbourhood ξ(z) around z, where yields an optimal trajectory to move from z to any z ∈ ξ(z). The third one, has a subset of the optimal words that allows to optimally traverse from a state z ∈ X to only some z ∈ ξ(z). Table 5. Necessary and sufficient conditions Optimal Letters
Optimal Words
Local Optimal Words
Subset Optimal Words
Topological Property
Necessary YES*
NO
NO
YES*
NO
Sufficient YES*
YES
YES
YES*
YES*
Starting with the Optimal Letters property, the four planners contained opti+ − + mal letters but since the behaviour of planner L W T can result from a slow convergence or no converge at all, the necessity of this property is set as a conjecture. Using the same reasoning we set the sufficiency of this condition as + + a conjecture. Considering the optimal words property, both planners L W∗ T + − − and L W T where able to converge to the optimal cost, so this property is not + + + necessary. On the other hand, planner L W T tells us that such property is + + + − − sufficient, moreover, planners L W∗ T and L W T did converge with only a subset of the optimal words. Regarding the local optimal words property, planner + − − L W T was able to converge to the optimal cost, so this property is no nec+ + essary. Planner L W∗ T confirms that this property is sufficient, furthermore, + + + this planner is applied and equal to L W T when two samples z ∈ X and z ∈ X are close enough. Considering the subset optimal words property, planner + − − L W T seems to converge to the optimal cost, however, we did not test the
850
I. Becerra et al.
set of all possible planners , so the necessity of this property is a conjecture. + − − Again, planner L W T did converge, however, this planner does not match the optimality conditions of a local steering method mentioned in [6], so we leave it as a conjecture that this property is sufficient. + − − Concerning the topological property, it was shown through planner L W T than convergence is achievable even if does not satisfy such property, hence, there is evidence that the topological property is not necessary. Two of the three planners that fulfil the topological property are evidently converging to the opti+ − + mal cost, however, the behaviour of planner L W T makes us label the sufficiency of this property as a conjecture. + − − It is important to notice that it was observed that L W T does achieve convergence despite of not fulfilling the topological property, nor optimality for a local neighbourhood. However, this planner respects the system dynamics and is able to approximate any path of the DDR up to a resolution. It is then impor− − − tant to recall that planner L W T , the fifth planner in Sect. 5, is also able to approximate any path of the DDR subject to its dynamics, but it is not able to converge to the optimal cost as the resolution of the approximation increases. Moreover, such planner does not contain optimal letters, which can be considered as evidence in favour of the necessity of a planner considering the optimal letters. Thus, the condition that a planner respects the system dynamics, that it is able to approximate any path of the system, and that the cost associated to the approximation converges to the actual cost of the optimal path x∗ , might be relevant to be studied and might be part of the sought necessary and sufficient conditions for the RRT* to achieve asymptotic optimality in the kinodynamic problem. Also keep in mind that to approximate a geometric path that does not respect the system dynamics is different from approximating a path that does respect the system dynamics; the latter is the relevant case for the optimal kinodynamic problem. Let us conclude emphasizing that the main objective of this paper was to analyse local planners in the context of RRT* for dynamical systems. We have narrowed down the conditions that such planners must have and also provided candidate properties to be necessary (such as optimal letters) to achieve asymptotic optimality, yielding insight for further formal mathematical analysis.
References 1. Balkcom, D.J., Mason, M.T.: Time optimal trajectories for bounded velocity differential drive vehicles. Int. J. Rob. Res. 21(3), 199–217 (2002) 2. Barraquand, J., Latombe, J.: Nonholonomic multibody mobile robots: controllability and motion planning in the presence of obstacles. Algorithmica 10(2–4), 121–155 (1993) 3. Canny, J.: The Complexity of Robot Motion Planning. ACM Doctoral Dissertation Award. MIT Press, Cambridge (1988) 4. Hermann, R., Krener, A.: Nonlinear controllability and observability. IEEE Trans. Autom. Control 22(5), 728–740 (1977)
On the Necessary and Sufficient Conditions for the RRT*
851
5. Hsu, D., Kindel, R., Latombe, J.C., Rock, S.: Randomized kinodynamic motion planning with moving obstacles. Int. J. Rob. Res. 21(3), 233–255 (2002). https:// doi.org/10.1177/027836402320556421 6. Karaman, S., Frazzoli, E.: Optimal kinodynamic motion planning using incremental sampling-based methods. In: IEEE Conference on Decision and Control (2010) 7. Karaman, S., Frazzoli, E.: Sampling-based algorithms for optimal motion planning. Int. J. Rob. Res. 30(7), 846–894 (2011) 8. Karaman, S., Frazzoli, E.: Sampling-based optimal motion planning for nonholonomic dynamical systems. In: 2013 IEEE International Conference on Robotics and Automation, pp. 5041–5047 (2013) 9. Kavraki, L.E., Svestka, P., Latombe, J.C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580 (1996). https://doi.org/10.1109/70.508439 10. Latombe, J.C.: Robot Motion Planning. Kluwer Academic Publishers, Norwell (1991) 11. Laumond, J.P., Jacobs, P.E., Taix, M., Murray, R.M.: A motion planner for nonholonomic mobile robots. IEEE Trans. Robot. Autom. 10(5), 577–593 (1994) 12. Laumond, J.P., Sekhavat, S., Lamiraux, F.: Guidelines in nonholonomic motion planning for mobile robots. In: Laumond, J.P. (ed.) Robot Motion Plannning and Control, pp. 1–53. Springer, Heidelberg (1998) 13. LaValle, S.M., Kuffner, J.J.: Randomized kinodynamic planning. Int. J. Rob. Res. 20(5), 378–400 (2001) 14. Li, Y., Littlefield, Z., Bekris, K.E.: Asymptotically optimal sampling-based kinodynamic planning. Int. J. Rob. Res. 35(5), 528–564 (2016) 15. Noreen, I., Khan, A., Habib, Z.: Optimal path planning using RRT* based approaches: a survey and future directions. Int. J. Adv. Comput. Sci. Appl. 7(11) (2016). https://doi.org/10.14569/IJACSA.2016.071114 16. Perez, A., Platt, R., Konidaris, G., Kaelbling, L., Lozano-Perez, T.: LQR-RRT*: optimal sampling-based motion planning with automatically derived extension heuristics. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 2537–2542 (2012) 17. Sekhavat, S., Laumond, J.P.: Topological property for collision-free nonholonomic motion planning: the case of sinusoidal inputs for chained form systems. IEEE Trans. Robot. Autom. 14(5), 671–680 (1998) 18. Webb, D.J., van den Berg, J.: Kinodynamic RRT*: asymptotically optimal motion planning for robots with linear dynamics. In: 2013 IEEE International Conference on Robotics and Automation, pp. 5054–5061 (2013)
Hardness of 3D Motion Planning Under Obstacle Uncertainty Luke Shimanuki1 and Brian Axelrod2(B) 1
Massachusetts Institute of Technology, Cambridge, MA 02139, USA [email protected] 2 Stanford University, Stanford, CA 94305, USA [email protected]
Abstract. We consider the problem of motion planning in the presence of uncertain obstacles, modeled as polytopes with Gaussian-distributed faces (PGDF). A number of practical algorithms exist for motion planning in the presence of known obstacles by constructing a graph in configuration space, then efficiently searching the graph to find a collision-free path. We show that such a class of algorithms is unlikely to be efficient in the domain with uncertain obstacles. In particular, we show that safe 3D motion planning among PGDF obstacles is N P −hard with respect to the number of obstacles, and remains N P −hard after being restricted to a graph. Our reduction is based on a path encoding of 3−SAT and uses the risk of collision with an obstacle to encode the variable assignment. This implies that, unlike in the known case, planning under uncertainty is hard, even when given a graph containing the solution. Keywords: Collision avoidance Motion and path planning
1
· Completeness and complexity ·
Introduction
Navigation under uncertainty is one of the most basic problems in robotics. While there are many methods to plan a trajectory between two points with a known environment and strong theoretical guarantees, few of them generalize to obstacles with locations estimated by noisy sensors. It has proven much harder to provide strong completeness, runtime, and optimality guarantees in this setting. While some of the original work addressing planning under uncertainty was able to capture the additional richness of this problem by modeling it as a partially observable Markov decision process (POMDP) [8], it has proven difficult to solve POMDPs for complicated real world problems despite large advances in POMDP solvers [17,29]. In fact, solving POMDPs is PSPACE-complete in the finite horizon and undecidable otherwise, suggesting that it likely not possible to find a general, efficient algorithm for solving POMDPs [21]. Luckily, navigating among uncertain obstacles is a significantly more restricted problem class than POMDPs, giving us hope that we might find an c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 852–867, 2020. https://doi.org/10.1007/978-3-030-44051-0_49
Hardness of 3D Motion Planning Under Obstacle Uncertainty
853
algorithm that is efficient in practice and gives strong theoretical guarantees such as completeness and safety. Axelrod et al. proposed solving an approximation of the navigation under uncertainty problem [3,4]. Instead of trying to compute a path that minimizes the true probability of collision under any distribution of obstacles, they propose solving a restricted problem where the obstacles are limited to a structured class of distributions and the collision probability is approximated using a shadow (the geometric equivalent of a confidence interval). While shadow bounds are inherently loose (they overestimate the probability of collision when the obstacle is likely to be far away from the estimated location) they greatly decrease the computational complexity of bounding the probability of collision, since only space visited by the robot close to the obstacle affects the probability bound. Axelrod et al. proposed the following question: Is there an efficient algorithm that, given a graph embedded in Rn and a set of obstacles, can find the path with minimal risk as computed via a shadow bound [4]? Unlike the original problem without the shadow approximation, the cost function was only influenced by the portion of the trajectory close to the obstacle and had submodular structure with respect to the graph. The fact that similar approximations have worked well for motion planning, and the existence of efficient algorithms for certain classes of submodular minimization problems gave the hope that it might be possible to find an efficient algorithm for this problem as well. While motion planning is hard in general, practical and efficient algorithms have proven very successful under some assumptions [18]. One such body of work are the sampling-based motion-planning methods. These algorithms often have the assumption that the problem can be split into two pieces: First use a practically (though often not worst-case) efficient method to generate a small graph that contains a solution; then use a standard, efficient graph algorithm to find the solution in this graph. Algorithms based on this scheme have been successful even for high dimensional planning problems for robots with many degrees of freedom. There are several other classes of practically efficient algorithms (including grid based and optimization-based planners) that rely on the assumption that part of the problem may be solved much more efficiently in the average case than in the worst case. We discuss this further in the background section. This paper answers the question posed by Axelrod et al. in the negative [4]. Theorem 1. Safe path planning in the presence of uncertain obstacles in 3 dimensions is NP-hard. A more formal statement of this result is presented in Sect. 3. The proofs presented in this paper illuminate what makes this problem more difficult than the standard motion-planning problem with known obstacles. Searching for the minimum-risk path does not have a Markov-like structure. Unlike in the shortest-path problem on a graph, the risk of the second half of a trajectory is very much affected by the first half. This means that the problem is lacking the Bellman property, as identified by Salzman et al. [26].
854
L. Shimanuki and B. Axelrod
The absence of a Markov-like property for the risk over the path seems almost necessary to capture important parts of the original problem. The probabilities that one collides with an obstacle are most certainly correlated across the trajectory since collision is inherently a local property. We further discuss sufficient properties of the random obstacle model to guarantee hardness in the conclusion.
2
Background
Motion planning for robotics has been extensively studied in many different settings. One high-level distinction is between motion planning in a known environment and planning in an environment that is not known. 2.1
Complexity in Motion Planning
The story of motion-planning algorithms in robotics has been one of walking the fine boundaries of complexity classes. On one hand, motion planning is PSPACEhard in R3 [24] and R2 [10,13] with respect to the number of arms of a robot (and thus dimension of its configuration space). However, while Canny’s work on singly-exponential time (with respect to number of arms) roadmaps leads to a polynomial-time algorithm when the number of arms is fixed [6], a different set of algorithms is used in practice. The robotics community has been able to find practically efficient methods that provide meaningful theoretical guarantees weaker than completeness (finding a solution if one exists). Sampling-based planners such as Rapidly-Exploring Random Trees (RRTs) [18,19] and Probabilistic Roadmaps (PRMs) [16] are both practically efficient and probabilistically complete under some regularity conditions. Given effective heuristics, graph-based planners have also proved efficient and provide resolution completeness [18]. Searching for optimal plans, as opposed to simply feasible plans, further increases the difficulty. In a classic result, Canny shows that the 3-d ShortestPath Problem is NP-hard for a simple robot in terms of the number of obstacles [7]. This ruled out results of the form of Canny’s roadmap algorithm that showed fixed parameter tractability in the feasible motion planning case. However, the community has been able to find practically efficient algorithms regardless of these worst-case results. A modified version of the original samplingbased algorithms allows them to return nearly optimal solutions in the limit [15] and graph-based planning algorithms are able to provide bounds on the suboptimality of their solutions [1]. Another motion-planning problem that lacks a Markov property is the minimum constraints removal problem, where the objective is to find a path that collides with the fewest obstacles. This problem was shown to be NP-hard in Cartesian spaces of dimension 3 [12]. 2.2
Planning Under Uncertainty
While planning under uncertainty has been broadly studied in robotics, few methods have formal guarantees on solution quality and efficient runtime. We survey some of the related work below.
Hardness of 3D Motion Planning Under Obstacle Uncertainty
855
Many works assume some sort of uncertainty about the environment, but do not propose a model in which to rigorously quantify the uncertainty in the environment and provide guarantees about the success probability of the trajectory. Instead they often rely on heuristics that seem to provide the desired behavior. One line of work focuses on uncertainty in the robot’s position. Here the model of the robot itself is “inflated” before the collision checking, ensuring that any slight inaccuracy in the position estimate or tracking of the trajectory does not result in a collision. Work that focuses on uncertainty in the environment sometimes does the exact opposite. They often inflate the occupied volume of the obstacle with a “shadow” and ensure that any planned trajectory Fig. 1. The orange set is avoids the shadow [14,20]. A more general approach that handles either or a shadow of the obstacle. both of localization and obstacle uncertainty is belief- The blue set is the obstacle space planning. Belief space is the set of all possi- represented by the mean parameters. ble beliefs about or probability distributions over the current state. Belief-space planning converts the uncertain domain in state space to belief space, then plans in belief space using trees [5,23] or control systems [22]. Another line of work uses synthesis techniques to construct a trajectory intended to be safe by construction. If the system is modeled as a Markov decision process with discrete states, a safe plan can be found using techniques from formal verification [9,11]. Other authors have used techniques from Signal Temporal Logic combined with an explicitly modeled uncertainty to generate plans that are heuristically safe [25]. Recent work by Hauser on the minimum constraints removal problem randomly samples many draws for each obstacle and finds the path that intersects with the fewest samples, demonstrating low runtime and error on average although with poor worst case performance [12]. 3
2
1
0
1
2
3
3
2
1
0
1
2
3
Shadows. In previous work, Axelrod et al. formalized the notion of a shadow in a way that allowed the construction of an efficient algorithm to bound the probability that a trajectory will collide with the estimated obstacles [3,4]. We can now define a shadow rigorously: Definition 1 (-shadow). A set S ⊆ Rd is an -shadow of a random obstacle O ⊆ Rd if P r[O ⊆ S] ≥ 1 − . Shadows are important because they allow for an efficient method to upperbound the probability of collision. If there exists an −shadow of an obstacle that does not intersect a given trajectory’s swept volume, then the probability of the obstacle intersecting with the trajectory is at most . An example of a shadow for an obstacle is shown in Fig. 1.
856
3 3.1
L. Shimanuki and B. Axelrod
Problem Formulation Notation
In this section we will cover definitions and notation conventions that will be used in this paper. – A vector will be marked in bold as in u, in contrast to a scalar u. – ∧, ∨, and ¬ are the logical AND, OR, and NOT operators, respectively. – The power set (set of all subsets) of S is denoted by P(S). A function mapping into the power set of Rn outputs subsets of Rn . Definition 2 (Standard Basis Vector). A Standard Basis Vector in d dimensions is a vector ei ∈ Rd that is 1 in the ith dimension and 0 in the remaining dimensions. 3.2
Random Obstacle Model
In order to attempt to provide formal non-collision guarantees one must first model the uncertainty in the environment. At a high level we assume that each episode of the robot’s interaction happens in the following sequence: 1. A set of obstacles is drawn from a known distribution (the conditional distribution for the obstacles given the sensor observations). These obstacles now remain static for the duration of the episode. 2. The robot computes, commits to and executes a trajectory. 3. The probability of collision in question is exactly the probability that this trajectory collides with at least one of the obstacles. In this work we restrict ourselves to polytopes with Gaussian-distributed faces (PGDFs). A PGDF is the intersection of halfspaces with parameters drawn from a multivariate normal distribution. More formally a PGDF O ⊂ Rn is i O = αiT x ≤ 0, where αi ∼ N (μi , Σi ). We can use homogenous coordinates to create obstacles not centered about the origin. One reason that PGDF obstacles are important is that we have methods of computing shadows for PGDF obstacles efficiently [3]. We note that this formulation differs from the notion of “risk-zones” evaluated by Salzman et al., where the cost of a trajectory is proportional to the amount of time spent within a risk-zone [26,27]. These problems share the lack of an optimal substructure—the subpaths of an optimal path are not necessarily optimal. Salzman et al. provide a generalization of Dijkstra’s algorithm that finds minimum-risk plans in their domain efficiently [26], but as we will show, there are no such techniques for our problem.
Hardness of 3D Motion Planning Under Obstacle Uncertainty
3.3
857
Algorithmic Question
Now that we have defined shadows and PGDF obstacles, we can define what it means for a path to be safe. Suppose the robot operates, and obstacles are, in Rd (d is usually 3). This is commonly referred to as the task space. Furthermore, suppose the configuration space of the robot is parametrized in Rk (usually corresponding to the k degrees of freedom of the robot). Since planning usually happens in the robot’s configuration space, but the obstacles are in task space, we need to be able to convert between the two. Definition 3 (Embedding Map). A function f : Rk → P(Rd ) is an embedding map if it maps robot configurations into the subset of Rd that is occupied by the robot at that configuration. The embedding map can usually be constructed by combining the forward kinematics and robot model. Definition 4 (Configuration Space Trajectory). A configuration space trajectory τ : [0, 1] → Rk is a map from a “time” index into the trajectory to the robot configuration at that point in the trajectory. Definition 5 (Task-space Trajectory). A task-space trajectory τ : [0, 1] → P(Rd ) is defined as the map between an index into the trajectory and the space occupied by the robot at that point in the trajectory. Alternatively, if given a configuration space trajectory τ , τ (t) = f (τ (t)) where f is the embedding map. For the rest of the paper we will only concern ourselves with task-space trajectories, noting that it is easy to go from a configuration space trajectory to a task-space trajectory using the embedding map. Definition 6 (Swept Volume). The swept volume X of a task-space trajectory τ is the set of task-space points touched by the robot while executing trajectory τ. Said differently, X = τ (t). t∈[0,1]
This allows us to formally define what it means for a trajectory to be safe. Definition 7 (-safe trajectory). Given a joint distributions over random obstacles, a task-space trajectory is -safe if the corresponding swept volume has at most probability of intersecting at least one obstacle. This leads to the following algorithmic question, of finding safe plans for a known distribution of PGDF obstacles. Problem 1 (-safe Planning Problem) Given the parameters of PGDF distributions for each obstacle and initial and end points s, t in configuration space, find an -safe trajectory from s to t.
858
L. Shimanuki and B. Axelrod
Note that there exists reductions between the safe planning problem and finding a path that minimizes the risk of collision. Since the probability is confined to [0, 1], a binary search over yields an efficient algorithm that can approximately compute the minimum risk given an −safe planner. For convenience, our proofs will consider the approximate minimum-risk planning problem, though the construction applies directly to −safe planning as well. Problem 2 ((1 + α)-approximate minimum-risk planning problem) Given the parameters of PGDF distributions for each obstacle and initial and end points s, t in configuration space, return a ((1 + α)∗ )-safe trajectory from s to t, where ∗ is the minimum for which an -safe trajectory exists. 3.4
Graph Restriction
We start by considering the class of motion-planning algorithms that first construct a graph embedded in the robot’s configuration space, and then run a graph-search algorithm to find a path within the graph. This class of algorithms has been shown to be practical in the known environment, with sampling-based planners such as RRT and RRG. Conditioned on there being a nonzero probability of sampling a solution, these algorithm are guaranteed to find a collision-free path with probability approaching 1 as the number of iterations approaches infinity [15,19]. More formally this condition can be articulated as the existence of a path in the δ-interior of the free space Xf ree . Definition 8 (δ-interior[15]). A state x ∈ Xf ree is in the δ-interior of Xf ree if the closed ball of radius δ around x lies entirely in Xf ree . This condition is necessary because it guarantees that finding a plan does not require waiting for a zero probability event. However this formulation does not extend well to the domain with uncertain obstacles; there is no concept of “free space” because the locations of the obstacles are not known. Instead we will use the equivalent view of inflating the path instead of shrinking the free space. {y | Definition 9 (δ-inflation). The δ-inflation of the set X is the set Y = d(x, y) ≤ δ}.
x∈X
We note that in the deterministic setting, if a trajectory is in the δ-interior of Xf ree , then the δ-inflation of the trajectory is entirely in Xf ree . This allows us to consider problems with the following regularity condition: there exists a δ-inflated task-space trajectory that has a low risk of collision. Definition 10 (-safe δ-inflated task-space trajectory). A task-space trajectory is an -safe δ-inflated trajectory if its δ-inflation intersects an obstacle with probability at most . We want to find an algorithm that satisfies the completeness and safety guarantees defined below.
Hardness of 3D Motion Planning Under Obstacle Uncertainty
859
Definition 11 (Probabilistically Complete (1 + α)-approximate Safe Planning Algorithm). A planning algorithm takes a set of PGDF obstacles, a start state s, and a goal state t as input and generates a path as output. A planning algorithm is probabilistically complete and (1 + α)-approximate safe if, with n samples, the probability that it finds a ((1 + α)∗ )-safe trajectory approaches 1 as n approaches ∞, where ∗ is the minimum for which an -safe trajectory exists. Axelrod et al. provides an extension of the RRT algorithm to the probabilistic domain using the shadow approximation [3]. The uniqueness of paths between any two vertices in a tree makes finding the optimal (restricted to the tree) path trivial. However, while the paths it generates are indeed safe, the algorithm is not probabilistically complete. The following extension of the RRG algorithm is probabilistically complete [2]. Algorithm 1. SAFE RRG Input: End points s, t ∈ Rd , set of PGDF obstacles O, and number of samples n. Output: A ((1 + α)∗ )-safe trajectory from s to t, where ∗ is the minimum for which an -safe trajectory exists. 1: G = CONSTRUCT RRG(s, t, n) 2: return GRAPH SEARCH(G, O, s, t)
We note that as n increases, the probability that there is a sample near any given point x in the space approaches 1. Here, GRAPH SEARCH is a (1 + α)approximate safe graph-search algorithm as defined below. Definition 12 ((1+α)-approximate safe graph-search algorithm). A (1+ α)-approximate safe graph-search algorithm is a procedure φ(G, O, s, t), where G is a graph, O is a set of PGDF obstacles, and s and t are the start and end nodes in G, respectively. It returns a ((1 + α)∗ )-safe trajectory in G, where ∗ is the minimum for which an -safe trajectory exists. Theorem 2 ([2]). SAFE RRG is probabilistically complete and (1 + α)approximate safe as long as GRAPH SEARCH is complete and (1 + α)approximate safe. However, no graph-search procedure, beyond the na¨ıve, exponential-time search procedure, is provided [2]. Sampling-based motion-planning algorithms succeed in the known environment because efficient graph-search algorithms can quickly find collision-free paths within a graph. In order for this class of motion-planning algorithms to be practical, we would need a corresponding graph-search algorithm in the probabilistic domain. Because the cost of a path depends on what set of shadows it intersects, the state space of the graph search is not just the current node but also includes the accumulated risk incurred by each obstacle.
860
L. Shimanuki and B. Axelrod
This means that the typical approaches for searching graphs with known obstacles, which make use of dynamic programming, cannot be applied in the same manner to graphs with unknown obstacles. Unfortunately, as will be shown in the remainder of this paper, this problem is NP-HARD with respect to n = Θ(|G| + |O|), the size of the input. Theorem 3. Unless P = N P , there is no (1 + α)-approximate safe graphsearch algorithm that, given a graph with n nodes, runs in P OLY (n) time, when restricted to O(n) obstacles in RO(n) , where α = Θ( n1 ). And we can strengthen it to show that the minimum-risk planning problem in constant dimension is hard in general, that is, even when not restricted to a graph. Theorem 4. The (1 + α)-approximate minimum-risk planning problem is NPhard. That is, unless P = N P , there is no (1 + α)-approximate Safe Planning Algorithm for R3 that runs in P OLY (n) when α = Θ( n1 ), even when provided a graph containing the solution.
4 4.1
Basic Hardness Result 3SAT
3SAT is an NP-complete problem that is commonly used to prove the hardness of other problems [28]. The problem input is a Boolean formula given in conjunctive normal form, where each clause consists of three literals, or in other words, it is of the form ((x0 ∨ ¬x1 ∨ x2 ) ∧ (x1 ∨ ¬x3 ∨ ¬x4 ) ∧ . . .). The algorithm must then decide whether there exists any variable assignment that satisfies the formula. We will consider a 3SAT problem with k variables x0 , x1 , . . . and m clauses, where each clause j is of the form (xju ∨ ¬xjv ∨ xjw ). 4.2
Proof Outline
We prove Theorem 3 using a reduction from 3SAT. Given a 3SAT instance, we construct a (1 + α)-approximate safe graph-search problem as follows. 1. Construct a pair of obstacles for each variable that will encode whether the variable is set to true or false. 2. Construct a portion of the graph to force the algorithm to assign every variable by going near either the true or false obstacle for each variable. 3. Construct the remainder of the graph to force the algorithm to satisfy every clause. There will be additional collision risk for a path that goes by both the true and false obstacles (i.e. uses both x and ¬x). The solution to the planning problem can then be transformed into a solution to the 3SAT instance in polynomial time, demonstrating that the (1 + α)approximate safe graph-search problem is at least as hard as 3SAT. The proof for R3 will use a similar technique with a more complicated construction that folds the graph into R3 .
Hardness of 3D Motion Planning Under Obstacle Uncertainty
4.3
861
Proof
Our graph and obstacle set will be in space Rd where d = k + 1. Each of the first k dimensions will correspond to each variable, and dimension t = k + 1 can be thought of as “time” or some monotonically increasing value as we progress along a path. Variable Assignment. First, for each variable i in the 3SAT problem, we will construct two halfspace PGDF obstacles. Note that we define a PGDF obstacle as the intersection of halfspaces of the form αT x ≤ 0 for α norFig. 2. A path through mally distributed and x represented in homogeneous this gadget must select coordinates. Here we will work with just one face and one of the obstacles to go standard coordinates for convenience. near, either the positive That is, each obstacle i will be defined as αiT x ≤ 1 assignment x (obs2i ) or for α ∼ (μi , Σi ). the negative assignment For each variable i we define a true obstacle and ¬x (obs2i+1 ). a false obstacle with the following parameters; true: α ∼ N (2ei , ei eTi ), false: α ∼ N (−2ei , ei eTi ). Intuitively the covariance ei eTi means that α has variance 1 in the direction of the normal of the face. This is important because it means that there is no variance in the orientation of the face. Then we will construct a graph that will force any path to select a true or false assignment for each variable as illustrated in Fig. 2. Said formally, indexing over the variable with index i, we embed nodes in locations −(i + 1)et , −(i + 12 )et ± ei , −iet . We then draw edges from −(i + 1)et to both −(i+ 12 )et ±ei and from both −(i+ 12 )et ±ei to −iet .
(j + 12 )et ± eju jet
(j + 12 )et ± ejv (j + 1)et (j + 12 )et ± ejw
Fig. 3. A path through this gadget must take one of three paths, each extending in different dimensions. Each path goes near an obstacle corresponding to the respective literal.
Clause Gadget. For each clause j, we will construct a graph that lets the algorithm choose which variable with which to satisfy the clause as shown in Fig. 3. Recall that each clause j is of the form (xju ∨ ¬xjv ∨ xjw ). First indexing over j, construct “via” nodes at jet and (j + 1)et , respectively. Then if xju appears positively, construct a node connected to the “via” nodes at (j + 12 )et + eju and at (j + 12 )et − eju if the variable appears negatively. Repeat for xjv and xjw . This is illustrated in Fig. 3. A path through this gadget must pick one of the literals in the clause to satisfy and pass near the obstacle that corresponds to that variable and the
862
L. Shimanuki and B. Axelrod
value the literal requires it to have. In doing so, it may incur risk of intersecting with the obstacle. If this variable was assigned to the value the literal specifies, then the path would have already gone near this obstacle so no further risk is incurred. However, if the literal contradicts the variable assignment, the path will incur additional risk for going near this obstacle. Path Risk Encoding 3SAT. The above graph was constructed such that there will exist a gap between the risk of a satisfying assignment and of a non-satisfying assignment. First we note that for any reasonable shadow approximation (including the ones presented in [3]), there exists a gap between the induced risk close to the obstacle and far away from the obstacle. For shadow approximation schemes where this is the case, there is some rclose that lower-bounds the risk computed from the shadow approximation for the closer points and rf ar that upper-bounds the computed risk for the further point. This holds true for all shadows derived from the methods in [4] but may not hold for shadows that are computed via Monte Carlo algorithms. A path through the variable assignment portion of the graph will go near k obstacles for the k variable assignments it makes. Then it will be “close” to k obstacles and “far” from the other k obstacles. Therefore, it will incur risk krclose + krf ar . If a path through the variable assignment portion encodes a satisfying assignment to the 3SAT problem, there will exist a path through the remainder of the graph that will not incur any additional cost. If there is no satisfying assignment, then any path through the remaining portion must go near an obstacle that it did not go near in the variable assignment portion, so for some variable i, the optimal path must go close to both the true and false obstacles, incurring cost at least (k + 1)rclose + (k − 1)rf ar . This allows us to compute a lower bound on ratio between the two risks: (k + 1)rclose + (k − 1)rf ar krclose + krf ar krclose + krf ar + rclose − rf ar = krclose + krf ar rclose − rf ar =1+ krclose + krf ar 1 =1+Θ . k
risk ratio =
Each gadget can be constructed in polynomial time, and the number of gadgets is polynomial, so this reduction can be constructed in polynomial time. Thus any algorithm that can approximate the minimum-risk planning problem in a graph to a factor better than 1+Θ k1 can also solve 3SAT with polynomial overhead.
Hardness of 3D Motion Planning Under Obstacle Uncertainty
5
863
Hardness Result in R3
We prove Theorem 4 by reducing from 3SAT using the same outline as for Theorem 3 but with more complicated gadgets. We also add deterministic obstacles to force any procedure that computes a safe plan to decide the 3SAT problem, showing that the problem is hard with and without the graph restriction. 5.1
Proof
Variable Gadgets. As before, we will construct two PGDF obstacles for each variable, Fig. 4. A path through this gadbut this time they will each have three faces. get must go near either the true We will continue to use standard coordinates for or false obstacle for each variable, thereby selecting a varithe remainder of this proof. For obstacle i, the true obstacle will be able assignment. defined as the intersection of αiT x ≤ 1 and i ≤ eT2 x ≤ i + 1, where αi ∼ N (2e1 , e1 eT1 ). The “negative” obstacle will similarly be defined with βiT x ≤ 1 and i ≤ eT2 x ≤ i + 1, where βi ∼ N (−2e1 , e1 eT1 ). Then we will construct the variable assignment graph, as illustrated in Fig. 4. Said formally, indexing over the variable with index i, we embed nodes in locations ie2 , (i + 12 )e2 ± e1 , (i + 1)e2 . We then draw edges from ie2 to both of (i + 12 )e2 ± e1 , and from both of (i + 12 )e2 ± e1 to (i + 1)e2 . Because for this proof we allow any path, not just those restricted to the graph, we need an additional obstacle to block the path from crossing in between the variable obstacles. This obstacle will be deterministic and be defined as the intersection of − 18 ≤ eT1 x ≤ 18 , i+ 18 ≤ eT2 x ≤ i+ 78 , and eT3 x ≤ 12 . We also need an obstacle above the gadget to prevent a path from just going straight up without first passing through the gadget. This obstacle is defined as the intersection of eT2 x ≤ k + 12 and 14 ≤ eT3 x ≤ 34 . Clause Gadgets. For each clause j we will construct an additional graph “layer” as illustrated in Fig. 5. Recall that each clause j is of the form xju ∨¬xv ∨xw . Without loss of generality, let ju < jv < jw . Indexing over j, construct nodes at (j+1)e3 , (j+1)e3 +(ju + 1 1 1 2 1 2 2 )e2 ,(j + 1 3 )e3 + (ju + 2 )e2 ± e1 ,(j + 1 3 )e3 + (ju + 2 )e2 , j + 1 3 )e3 , drawing edges between consecutive nodes, letting ‘±’ represent ‘-’ if xju is given in negated form and ‘+’ otherwise. Then construct nodes at (j +1)e3 +(ju + 12 )e2 , (j +1)e3 +(jv + 1 1 1 2 1 2 1 2 )e2 , (j + 1 3 )e3 + (jv + 2 )e2 ± e1 , (j + 1 3 )e3 + (jv + 2 )e2 , (j + 1 3 )e3 + (ju + 2 )e2 (the first and last were already constructed previously), drawing edges between consecutive nodes, similarly setting ‘±’ based on the negation of literal xjv . Then construct nodes at (j + 1)e3 + (jv + 12 )e2 , (j + 1)e3 + (jw + 12 )e2 , (j + 1 13 )e3 + (jw + 12 )e2 ± e1 , (j + 1 23 )e3 + (jw + 12 )e2 , (j + 1 23 )e3 + (jv + 12 )e2 (the first and
864
L. Shimanuki and B. Axelrod
last were already constructed previously), drawing edges between consecutive nodes, similarly setting ‘±’ based on the negation of literal xjw . Intuitively, this creates three possible routes through the graph, each going near the obstacle corresponding to a particular value assigned to a variable. (j + 1)e3 + (jw + 12 )e2
(j + 1 23 )e3 + (jw + 12 )e2
(j + 1 13 )e3 + (jw + 12 )e2 ± e1 (j + 1)e3 + (jv + 12 )e2
(j + 1 23 )e3 + (jv + 12 )e2
(j + 1 13 )e3 + (jv + 12 )e2 ± e1 (j + 1)e3 + (ju + 12 )e2
(j + 1 23 )e3 + (ju + 12 )e2
(j + 1 13 )e3 + (ju + 12 )e2 ± e1 (j + 1)e3
(j + 1 23 )e3
Fig. 5. A path through this gadget must select one of three paths to go through, each going near the obstacle for the corresponding literal.
As with the variable gadget, we need a few additional deterministic obstacles in each layer to force paths to follow the graph, in the sense that it must still go near the same obstacles as a path restricted to the graph would. First, we put boundaries below and above each layer to prevent paths from just going straight up with- Fig. 6. The bottom layer is out passing through the gadgets. These obstacles the variable assignment gadare defined as the intersection of 12 ≤ eT2 x and get. The top layer is a single 9 clause gadget. There would usu≤ eT3 x ≤ j + 11 j + 12 12 , and as the intersection 1 9 11 T T of 2 ≤ e2 x and j + 1 12 ≤ e3 x ≤ j + 1 12 . Then ally be many more clause gadwe construct an obstacle in the middle of the gets stacked on top. layer that a path must go around, defined as the intersection of − 14 ≤ eT1 x ≤ 14 and j + 1 16 ≤ eT3 x ≤ j + 1 12 . We construct obstacles blocking a path from going around the middle obstacle except for via one of the paths in the graph. For each variable i and value +/−, if the literal (¬)xi (where the ‘¬’ is dependent on whether the value is ‘+’ or ‘−’) appears in clause j, construct the obstacle defined as the intersection of 11 T ±eT1 x ≥ 18 , i ≤ eT2 x ≤ i + 1, and j + 11 12 ≤ e3 x ≤ j + 1 12 . Finally, we construct an obstacle after the last variable obstacle to prevent the path from just bypassing all the obstacles. This obstacle is defined as the intersection of eT2 x ≤ k + 34 , and 11 T j + 11 12 ≤ e3 x ≤ j + 1 12 . Now we combine the variable and clause gadgets, as seen in Fig. 6. We see that as in the proof of Theorem 3, if a path through the variable assignment portion
Hardness of 3D Motion Planning Under Obstacle Uncertainty
865
encodes a satisfying assignment, there will exist a path through the remainder of the graph that will not incur any additional cost, where as a nonsatisfying assignment must incur some additional cost in the clause gadgets. Therefore, we see the same gap between a satisfying assignment and a nonsatisfying assignment: (k+1)rclose +(k−1)rf ar = 1 + Θ( k1 ). Each gadget can be constructed in polynokrclose +krf ar mial time, and the number of gadgets is polynomial, so this reduction can be constructed in polynomial time. Thus, any (1 + α)-approximate Safe Planning Algorithm can also solve 3SAT with polynomial overhead.
6
Conclusions and Future Work
We have shown that the minimum-risk planning problem on graphs is NP-hard, even in dimension 3. Furthermore, the fact that it remains hard after restriction to a small graph indicates that algorithms reducing to a graph search are likely to be impractical in the uncertain domain. This suggests that the field should pursue other directions towards solving motion planning under uncertainty. Trajectory-optimization-based methods, for example, might lead to efficient practical solutions. Furthermore, barring stronger hardness-of-approximation results, it is possible that there is a practical approximation algorithm for solving the minimum-risk planning problem on graphs. There is also the related direction of investigating models of uncertainty over obstacles. We focus on the PGDF model in this work because it captures certain desirable characteristics and has been used in prior work. However, the PGDF model has certain surprising characteristics, particularly near the tails of the distribution [3]. Perhaps there is a model that is a better fit for obstacle estimates in practice, that also permits efficient algorithms. In exploring this direction, it is important to note that we do not strongly invoke the structure of PGDF obstacles. Interesting directions for future work also include finding a good minimal condition on the obstacle distribution to make the problem N P -hard. Another direction of future work is finding upper bounds on the safe motionplanning problem. While, when there is some “slack” in the shadows for the optimal solution there is a trivial algorithm for finding an approximate solution by exhaustively iterating through an −net of shadow configurations (each one reduces to a motion planning instance that can be solved by Canny’s roadmap algorithm), no exact algorithm is known. Funding. We gratefully acknowledge support from the Thomas and Stacey Siebel Foundation; from NSF Fellowship grant DGE-1656518; from NSF grants CCF-1763299, 1420316, 1523767 and 1723381; from AFOSR grant FA9550-17-1-0165; from ONR grant N00014-18-1-2847; from Honda Research; and from Draper Laboratory. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
866
L. Shimanuki and B. Axelrod
References 1. Aine, S., Swaminathan, S., Narayanan, V., Hwang, V., Likhachev, M.: Multiheuristic a*. Int. J. Robot. Res. 35(1–3), 224–243 (2016) 2. Axelrod, B.: Algorithms for safe robot navigation. Master’s thesis, Massachusetts Institute of Technology (2017) 3. Axelrod, B., Kaelbling, L., Lozano-Perez, T.: Provably safe robot navigation with obstacle uncertainty. Robot. Sci. Syst. 13, 1760–1774 (2017) 4. Axelrod, B., Kaelbling, L.P., Lozano-P´erez, T.: Provably safe robot navigation with obstacle uncertainty. Int. J. Robot. Res. 37, 1760–1774 (2018) 5. Bry, A., Roy, N.: Rapidly-exploring random belief trees for motion planning under uncertainty. In: 2011 IEEE International Conference on Robotics and Automation (ICRA), pp. 723–730. IEEE (2011) 6. Canny, J.: Some algebraic and geometric computations in PSPACE. In: Proceedings of the Annual ACM Symposium on Theory of Computing, pp. 460–467 (1988) 7. Canny, J., Reif, J.: New lower bound techniques for robot motion planning problems. In: Foundations of Computer Science, pp. 49 – 60 (1987) 8. Cassandra, A., Kaelbling, L., Kurien, J.: Acting under uncertainty: discrete Bayesian models for mobile-robot navigation. In: International Conference on Intelligent Robots and Systems, vol. 12, pp. 963–972 (1996) 9. Ding, X.C., Pinto, A., Surana, A.: Strategic planning under uncertainties via constrained Markov decision processes. In: 2013 IEEE International Conference on Robotics and Automation (ICRA), pp. 4568–4575. IEEE (2013) 10. Hopcroft, J., Joseph, D., Whitesides, S.: Movement problems for 2-dimensional linkages. SIAM J. Comput. 13, 610–629 (1984) 11. Feyzabadi, S., Carpin, S.: Multi-objective planning with multiple high level task specifications. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5483–5490. IEEE (2016) 12. Hauser, K.: The minimum constraint removal problem with three robotics applications. Int. J. Robot. Res. 33, 5–17 (2014) 13. Hopcroft, J., Joseph, D., Whitesides, S.: On the movement of robot arms in 2dimensional bounded regions. In: SIAM J. Comput. 14, 280–289 (1982) 14. Kaelbling, L.P., Lozano-P´erez, T.: Integrated task and motion planning in belief space. Int. J. Robot. Res. 32, 1194–1227 (2013) 15. Karaman, S., Frazzoli, E.: Sampling-based algorithms for optimal motion planning. Int. J. Robot. Res. 30(7), 846–894 (2011) 16. Kavraki, L.E., Svestka, P., Latombe, J.C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580 (1996) 17. Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems (2008) 18. LaValle, S.M.: Planning Algorithms. Cambridge University Press, Cambridge (2006). http://planning.cs.uiuc.edu/ 19. LaValle, S.M., Kuffner, J.J.: Randomized kinodynamic planning. In: ICRA (1999) 20. Lee, A., Duan, Y., Patil, S., Schulman, J., McCarthy, Z., van den Berg, J., Goldberg, K., Abbeel, P.: Sigma hulls for gaussian belief space planning for imprecise articulated robots amid obstacles. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5660–5667. IEEE (2013)
Hardness of 3D Motion Planning Under Obstacle Uncertainty
867
21. Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of markov decision processes. Math. Oper. Res. 12(3), 441–450 (1987) 22. Platt, R., Tedrake, R., Kaelbling, L., Lozano-Perez, T.: Belief space planning assuming maximum likelihood observations. In: Proceedings of Robotics: Science and Systems. Zaragoza, Spain (2010) 23. Prentice, S., Roy, N.: The belief roadmap: efficient planning in linear POMDPs by factoring the covariance. In: Proceedings of the 13th International Symposium of Robotics Research (ISRR), Hiroshima, Japan (2007) 24. Reif, J.: Complexity of the mover’s problem and generalizations. In: Proceedings of the 20th IEEE Symposium on Foundations of Computer Science, pp. 421–427 (1979) 25. Sadigh, D., Kapoor, A.: Safe control under uncertainty with probabilistic signal temporal logic. In: Proceedings of Robotics: Science and Systems, Ann Arbor, Michigan (2016) 26. Salzman, O., Hou, B., Srinivasa, S.: Efficient motion planning for problems lacking optimal substructure. arXiv preprint arXiv:1703.02582 (2017) 27. Salzman, O., Srinivasa, S.: Open problem on risk-aware planning in the plane. arXiv preprint arXiv:1612.05101 (2016) 28. Sipser, M.: Introduction to the Theory of Computation, 1st edn. International Thomson Publishing, Stamford (1996) 29. Somani, A., Ye, N., Hsu, D., Sun Lee, W.: DESPOT: online POMDP planning with regularization. In: Advances in Neural Information Processing Systems, vol. 58 (2013)
The Hardness of Minimizing Design Cost Subject to Planning Problems Fatemeh Zahra Saberifar1 , Jason M. O’Kane2 , and Dylan A. Shell3(B) 1 2 3
Amirkabir University of Technology, Tehran, Iran University of South Carolina, Columbia, SC, USA Texas A&M University, College Station, TX, USA [email protected]
Abstract. Assuming one wants to design the most cost-effective robot for some task, how difficult is it to choose the robot’s actuators? This paper addresses that question in algorithmic terms, considering the problem of identifying optimal sets of actuation capabilities to allow a robot to complete a given task. We consider various cost functions which model the cost needed to equip a robot with some capabilities, and show that the general form of this problem is NP-hard, confirming what many perhaps have suspected about this sort of design-time optimization. As a result, several questions of interest having both optimality and efficiency of solution is unlikely. However, we also show that, for some specific types of cost functions, the problem is either polynomial time solvable or fixed-parameter tractable.
1
Introduction
Research on autonomous robots is entering a phase of maturation: already there is general agreement on the centrality of estimation and planning problems and there is broad consensus on basic representations and algorithms to address the underlying problems; the last decade has seen the emergence of (open source) software infrastructure and adoption is increasingly widespread; and an inchoate industry is pursuing profitable applications. Some academic researchers have begun to move away from questions concerning how to program a given robot, turning to questions of the form “Given resource constraints c, d, and e, what is the ideal robot, considering that design choices influence feasible behavior?” Evidence for this growing interest can be seen in recently held workshops with titles such as ‘Minimality & Design Automation’ (RSS’16), ‘Minimality and Trade-offs in Automated Robot Design’ (RSS’17), and ‘Workshop on Autonomous Robot Design’ (ICRA’18)1 . The research is, naturally, focused on the development of algorithmic tools to help answer such design questions. Several ideas have been proffered as useful ways to tackle robot design problems. They display great and refreshing variety, including angles on the problem that emphasize fabrication, prototyping and manufacturability [5,7,11,14,15]; formal 1
For a summative report of the first two workshops, presented at the third, see [18].
c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 868–883, 2020. https://doi.org/10.1007/978-3-030-44051-0_50
The Hardness of Minimizing Design Cost Subject to Planning Problems
869
Fig. 1. A motivating example that illustrates how, for an otherwise simple planning problem, straightforward design constraints can quickly lead to complex considerations. [top] Suppose we want to design a cleaning robot that is able to navigate to any room in the house. [bottom] The robot’s ability to navigate spills, puddles, and toys depends on the particular design choices we make about what resources with which the robot is to be equipped.
methods for (controller and hardware) synthesis of robots [12,15,16,23,25]; simulators and methods for interactive design [9]; compositional frameworks along with catalogs of components [1,2,22]; software for fault tracking and componentbased identification [27]; and so on. This paper deals with design problems that we believe many may already suspect to be difficult problems, but for which actual hardness results have not appeared in the literature. (At least to the authors, somewhat surprisingly!) Though many aspects of the hardness of planning have been examined, prior work has tended consider costs associated with plan execution and which often bear little relation to the cost of realizing the particular robot design. It is surprising that even rather immediate questions about robot design lack formal analysis from the complexity theoretic point of view. Thus, we begin to remedy this fact. To help make matters concrete, consider the example illustrated in Fig. 1. We are interested in designing a home cleaning robot and, to be effective, the robot must be able to navigate from region to region within the environment. The ability to navigate depends on the actuators that the robot is equipped with and their ability depends on the particular assortment of clutter that is encountered: ‘spills’ and ‘puddles’ and ‘toys’. The cheapest wheels are w0 which, though inexpensive, can’t surmount any of these three, while wheel w1 operates fine in wet conditions, but still fails with toys. The third option, w2 , can convey the
870
F. Z. Saberifar et al.
robot over all three, but requires a heavier duty motor (m1 ) than the standard one (m0 ). If any robot is equipped with a scoop s0 , it is no longer hindered by dry detritus, though the scoop is ineffective with liquids. And, alas, the chassis we have available can’t support both s0 and m1 simultaneously. Our task requirements dictate that the cleaning robot ought to be able move from any room to any other, the morning after a party, that is, under worstcase conditions. Here, multiple designs satisfy these task requirements while also respecting the component limitations (namely the chassis weight requirement). Even if identifying that set of designs doesn’t seem too onerous, the fact that the solutions in our example seem to need, at a glance, to satisfy distinct discrete constraints (overcome wet or dry obstacles, or both), and may do so in multiple ways, suggests that the solution set has a non-trivial combinatorial structure. One might imagine, for more complex problems with greater varieties of available components, that the complexity of this structure may scale exponentially. We are interested in minimizing some cost, typically over the set of all useful designs. Suppose we are given some reasonable cost function that assigns a cost to every set of resources—reasonableness involves properties such as nonnegativity. Perhaps, in a mood of generosity, we opt to simplify things (ignoring bulk purchase pricing breaks and economies of scale) and assume monotonicity in costs. Can one find the cheapest design efficiently then? This paper formalizes questions about certain robot design choices and their effect on the resulting robot’s ability to plan and achieve goals. The work contributes hardness results primarily, but the purpose and import of such analyses is not merely to underscore that we expect to have to forgo overall/global optimality, but also to help understand whether thinking about such problems is fundamentally impractical or whether there is hope for good approximation algorithms.
2
Related Work
Having already provided broad context for robot design questions, here we draw attention to particular threads bearing an especially close relationship to this paper. Detailed connections to the authors’ prior work is postponed until Sect. 6, after presentation of the technical results. Censi [1,2] introduced a theory of co-design, which adopts a poset-based optimization point of view in order to relate functionality, implementations, and resources to each another. He demonstrates his methodology by applying it to questions about the minimal resources required to realize some functionality. This question, in various forms, has a rich history (cf. [4,19,20,26]). An especially noteworthy aspect of Censi’s theory is how monotone properties enable efficient optimization; his work contrasts with the present work—here monotonicity offers only a limited salve. Indeed, at least when choosing the sets of actions with which to equip a robot, planning problems induce a multi-stage interaction between the various options, making it difficult to ensure optimality.
The Hardness of Minimizing Design Cost Subject to Planning Problems
871
It is also worth pointing out the influential work of Hauser [10] who examined a variant on motion planning where, starting with an infeasible problem, he seeks the minimal set of constraints to be removed to make the problem solvable. In a sense, this is the inverse of the problem we examine: we ask not about removing challenging elements from the problem, but rather about the addition of capabilities to the robot.
3
Definitions and Problem Formulation
This section presents some definitions necessary to introduce the algorithmic problem addressed in the balance of the paper. 3.1
Planning Problems and Plans
We are interested in reasoning about the ability of robots equipped with varying action capabilities to complete certain tasks. Following our earlier work [8], we model both planning problems and the plans that solve them using procrustean graphs. Definition 1. A procrustean graph (or p-graph) is a bipartite directed graph G = (V0 , V, E, lu , ly , U, Y ) in which: 1. The finite set of states V contains two disjoint kinds of states called action states Vu and observation states Vy , with V = Vu ∪ Vy . 2. The multiset of edges E = Eu ∪ Ey is composed of action edges Eu and observation edges Ey . 3. Each action edge e ∈ Eu goes from an action state to an observation state, and is labeled with a finite nonempty set of actions lu (e) drawn from the action space U . 4. Likewise, each observation edge e ∈ Ey goes from an observation state to an action state, and is labeled with a finite nonempty set of observations ly (e) drawn from an observation space Y . 5. The set V0 ⊆ V represents a nonempty set of initial states. All states V0 should be of the same kind: either V0 ⊆ Vu or V0 ⊆ Vy . The interpretation is that a p-graph describes a set of event sequences that alternate between actions and observations. We say that an event sequence — that is, a sequence of actions and observations— is an execution on a p-graph if there exists some start state in the p-graph from which we can trace the sequence, following edges whose labels include each successive event in the sequence. For two p-graphs G1 and G2 , we say that an event sequence is a joint execution if it is an execution for both G1 and G2 . For simplicity, this paper focuses on state-determined p-graphs, which are those where, for any given state, the labels on the edges departing that state are mutually disjoint, and where V0 is a singleton. Any execution can be traced in no more than one way in such p-graphs. We can use p-graphs to model both planning problems and plans.
872
F. Z. Saberifar et al.
Definition 2. A planning problem is a p-graph G with a goal region Vgoal ⊆ V (G). A plan is a p-graph P with termination region Pterm ⊆ V (P ). We direct the reader to look again at the example in Fig. 1. It is clear, certainly, how a planning problem involving motion from one particular room to another —when both rooms are given— can be posed as a p-graph with a goal region. A relatively straightforward extension, leveraging the ability to designate multiple states as start states, can express the problem of being able to transit from any room to any other. Specifically, one would combine the individual pgraphs for specific start and goal pairs, and then perform an expansion of this combined graph to a state-determined presentation. Informally, we say that a plan is safe on a planning problem, if the plan has observation edges for any observation that might be generated by the planning problem at any reachable state pair, and likewise the planning problem has action edges for any action that might be generated by the plan. We say that a plan is finite on a planning problem if there is some upper bound on the length of all joint executions. (We omit the complete definitions of safe and of finite, which are somewhat tedious, referring the reader instead to [8]). Now we can define the notion of a plan solving a planning problem. Definition 3. A plan (P, Pterm ) with at least one execution solves a planning problem (W, Vgoal ) if P is finite and safe on W , and every joint-execution e1 · · · ek of P on W either reaches a vertex in Pterm , or is a prefix of some execution that does and, moreover, all the e1 · · · ek that reach a vertex v ∈ V (P ) with v ∈ Pterm , reach a vertex w ∈ V (W ) with w ∈ Vgoal . The intuition here is that a plan solves a planning problem for every possible joint execution eventually terminates in the goal region. 3.2
Design Costs
In this paper, we are interested specifically in plans that minimize a design cost function, which depends on which actions are utilized in a plan. Note that we are concerned here only with each action’s presence in (or absence from) the plan in question—we are not concerned with how many times an action is carried out on any particular execution of a plan. The next two definitions formalize this idea. Definition 4. For an action set U , a cost function c : 2U → R assigns a real number cost to each subset of U . Definition 5. For any plan (P, Pterm ) that solves planning problem (W, Vgoal ), we write A(P, W ) ⊆ U to denote the set of actions that appear in any joint execution of P on W . We then define the design cost of P on W as c(A(P, W )). When the planning problem W is clear from the context, we overload the notation slightly by writing A(P ) for A(P, W ) and likewise c(P ), instead of c(A(P, W )).
The Hardness of Minimizing Design Cost Subject to Planning Problems
873
The essential idea entailed by Definitions 4 and 5 is that c is a measure of the cost of a plan that depends only upon which actions are used by the plan, rather than upon how frequently those actions are used when the plan is executed. The intent is to establish a dependence between c(P ) and the cost of constructing a robot that is capable of executing each action in A(P ). Finding a plan that minimizes this design cost can give some insight into the simplest robots, in the sense of actuator complexity, that can solve the planning problem. Some example cost functions, intended to illustrate the expressive flexibility of the definitions, follow. Example 6 (counter design cost). Given a plan P , consider the design cost c(P ) = |A(P )|. This cost function simply counts number of actions utilized by P . By minimizing this counter design cost, we minimize the number of distinct actions used in the plan. Example 7 (weighted sum design cost). We can generalize the counter design cost by defining a weight function w : U → R that assigns a specific cost to each action, and then defining c(P ) = u∈A(P ) w(u). Example 8 (binary design cost). Suppose we are given a set of actions A that a robot designer would prefer to use, if possible. Define cA : 2U → {0, 1} as 0 if A(P ) ⊆ A cA (P ) = . 1 otherwise For a given set of actions A and a plan P , design cost cA , called a binary design cost, gives a value of 0 if all actions used to carry out P are from the preferred set A , and a value of 1 if some additional action, not in A , is used. Example 9 (ordered actions). Suppose we have a choice of options for which actuators to include, each of which subsumes its predecessors, both in ability and in expense. We would like to identify which of these options is the simplest that suffices to solve a particular planning problem. We can model this kind of situation by assuming that U is a finite ordered set U = {u1 , . . . , un }, and defining c(P ) = max A(P ). Example 10 (monotone cost functions). Another natural class of cost functions are those that are monotone, in the following sense: A cost function is monotone if, for any sets U1 ⊆ U and U2 ⊆ U , we have U1 ⊆ U2 =⇒ c(U1 ) ≤ c(U2 ). Monotone cost functions are interesting because they capture the eminently sensible idea that adding additional abilities to the robot should not decrease the cost. Notice in particular that the counter design cost (Example 6), binary design cost (Example 8), and ordered action design cost functions (Example 9) are all monotone.
874
F. Z. Saberifar et al.
We can now state the main algorithmic problem. Following the standard pattern, we consider both optimization and decision versions of the problem. Decision Problem: Design minimization (DecDM) Input: A planning problem (G, Vgoal ), where G is state-determined, a cost function c, and a real number k. Output: Yes if there is a plan (P, Pterm ) that solves (G, Vgoal ), with design cost c(A(P, W )) ≤ k. No otherwise. Optimization Problem: Design minimization (OptDM) Input: A planning problem (G, Vgoal ) with state-determined G, and cost function c. Output: A plan (P, Pterm ) that can solve (G, Vgoal ) such that the design cost of P is minimal. We can also form specialized versions of each of these problems by placing restrictions on the design cost function c. Our objective, in the following sections, is to classify the types of design cost functions for which this problem can be solved efficiently, and the types for which these problems are hard.
4
Hardness of Design Cost Minimization
In this section, we prove that the decision version of the design cost minimization problem is NP-complete. 4.1
The General Case
Our proof proceeds by reduction from the standard set cover problem, which is known to be NP-complete [13]: Decision Problem: SetCover Input: A universeset R with n elements, a set T including m sets T1 , . . . , Tm m such that i=1 Ti = R, and an integer k. Output: Yes if there is some set I ⊆ T such that I covers all elements of R and the size of I is at most k. No otherwise. Given an instance (R, T, k) of SetCover problem, we construct an instance of (G, Vgoal , c, k ) of DecDM as follows: 1. Begin with an empty p-graph G. Choose U = {u1 , . . . , um }, with one action for each of the sets in T , for its action space. Choose Y = {}, a singleton set containing a dummy observation, for its observation space. 2. For each element xi ∈ R of the universe R, add to G an action state qi and an observation state oi . In addition, insert an extra action state qn+1 into G.
The Hardness of Minimizing Design Cost Subject to Planning Problems
875
3. From each action state qi , except qn+1 , connect the corresponding observation state oi by a directed edge ei . Determine the label lu (ei ) as follows: The label for edge ei includes action uj if and only if the set Tj contains xi . That is, we set lu (ei ) = {uj | xi ∈ Tj }. 4. Connect each observation state oi to the subsequent action state qi+1 with a directed edge ei , labeled with the sole observation , so that ly (ei ) = {}. 5. Designate v0 = {q1 } as the only initial state of G. 6. Designate Vgoal = {qn+1 } as the goal region of the planning problem. 7. For design cost function, choose a counter design cost function, c(P ) = |A(P )|. 8. Select k = k. Figure 2 shows an example of this construction. Note that the time needed for this construction is polynomial in the input size. Next, we prove that this construction is indeed a reduction from SetCover to DecDM. The intuition is that each action state in the constructed planning problem acts as a sort of ‘gate’ to check whether a certain element has been covered. If enough elements have been selected from T to fully cover R, then the corresponding plan will be able to transition through each of these gates to the goal. If not, not. The next two lemmata make this idea more precise. Lemma 1. For any instance (R, T, k) of SetCover, consider the DecDM instance (G, Vgoal , c, k ) constructed as described above. If there exists a subset of T of size at most k that covers R, then there exists a plan (P, Pterm ) that solves (G, Vgoal ), for which c(P ) ≤ k. Proof. Let I ⊆ T denote a coverage set for R, which has |I| ≤ k. To produce a plan (P, Pterm ), we start with a copy of the constructed planning problem (G, Vgoal ), and remove from G all action labels that do not correspond to elements / I, we remove ui from P . Note that of I. That is, for any i for which Ti ∈ Vgoal = Pterm = {qn+1 }. Clearly c(P ) = |I| ≤ k. So it remains only to show that (P, Pterm ) solves (G, Vgoal ). First, we prove that P is finite and safe on G. Since the construction yields a linear chain of events in both G and P then there are joint-executions with lengths from 0 to at most 2n. Thus, P is finite on G. Note also that, according to the construction of G and P , we can conclude that for every joint-execution e1 · · · ek on P and G that leads to v ∈ P and w ∈ G, if v is an action state then the label set of action edge e, originating at v, is a subset of label set of action edge e , originating at w. We also know that P and G have the same single observation label for each of their observation states. Therefore, P is safe on G. Because of the shared linear chain form of both G and P and existence of only one initial state and one goal state, there is one unique joint execution that reaches Vterm in P , which by construction also reaches Vgoal in G. The linear chain structure also ensures that every other joint execution is a prefix of this one, which implies that every joint-execution e1 · · · ek on P and G either leads to the goal state qn+1 or is a prefix of some execution that leads to qn+1 , as
required by Definition 3. Therefore, (P, Vterm ) solves (G, Vgoal ).
876
F. Z. Saberifar et al.
Fig. 2. An example of the construction of a DecDM instance from a SetCover instance. Given a set cover instance with R = {1, 2, 3, 4}, T = {{2, 3, 4}, {1, 3}, {2, 4}, {1, 2}} and k = 2, we construct the planning problem shown on the left. Every subset of T that covers R corresponds to a plan that can reach Vgoal from the initial state. For this example, we can cover R in the SetCover instance by choosing {1, 3} and {2, 4}; likewise one can solve the planning problem using only the actions u2 and u3 .
Lemma 2. For any instance (R, T, k) of SetCover, consider the DecDM instance (G, Vgoal , c, k) constructed as described above. If there exists a plan (P, Pterm ) that solves (G, Vgoal ), for which c(P ) ≤ k, then there exists a subset of T of size at most k that covers R. Proof. Let (P, Pterm ) be some plan with c(P ) ≤ k that solves (G, Vgoal ). Consider some execution e1 · · · em on P . We may assume it reaches a vertex in Pterm without sacrificing generality for, if it does not, it is certainly the prefix of some execution which does, according to Definition 3. Thus, when em arrives at a
The Hardness of Minimizing Design Cost Subject to Planning Problems
877
vertex in Pterm ⊆ V (P ), it must be that e1 · · · em reaches a vertex in Vgoal ⊆ V (W ). But, by construction of G, that means e1 · · · em = ui1 ui2 . . . uin , and we see that n = |R|. Define I = Tj ∈ T | j ∈ {i1 , i2 , . . . , in } where j is taken over the set simply collecting the indices of the actions in the execution. Clearly I ⊆ T and, because Tx → ux corresponds elements of I with A(P ) in a one-to-one fashion, |I| = |A(P )| = c(P ) ≤ k. All that remains is to show that I covers R. Reaching the goal state requires transiting, linearly, through q1 o1 q2 . . . on qn+1 . So, for any w ∈ R, the action uiw was used to transition from action vertex qiw to observation vertex oiw en route to qn+1 . That means Tiw ∈ I and, since uiw is a feasible action from qiw , the
construction ensures that w ∈ Tiw . These two lemmas lead directly to the following result. Theorem 1. DecDM is NP-complete. Proof. We need to show that DecDM is in both NP and NP-hard. For the former, we must be able to verify that a given instance of DecDM is a Yes instance efficiently. For any positive instance, there is a solution no larger than W via Theorem 27 of [8], the argument therein carrying over when considering plans subject to some design cost c(P ) ≤ k. Such a plan, which is itself statedetermined, can be used as a certificate. Confirming that this plan does indeed solve the planning problem is straightforward via backchaining and, since both the plan and W are state-determined, this takes polynomial time. For the latter, we must show a polynomial-time reduction from a known NP-complete problem to DecDM. Part 6 of Karp’s ‘Main Theorem’ [13] establishes that SetCover is NP-complete, and Lemmas 1 and 2 establish that the construction described above is indeed a reduction.
4.2
Special Cases that Are Also Hard
Given the kind of hardness result expressed in Theorem 1, one reasonable followup question is to consider various kinds of restrictions to the problem, in hope that some natural or interesting special cases may yet be efficiently solvable. However, notice that the cost function c used in the reduction is the counter design cost (recall Example 6). This leads immediately to several stronger results. Corollary 1. DecDM, restricted to weighted sum cost functions (Example 7), is NP-complete. Proof. Use the same reduction as in Theorem 1, but replace the counter design cost c with a weighted sum cost function in which each action has weight 1.
Corollary 2. DecDM, restricted to monotone design cost functions (Example 10), is NP-complete. Proof. The reduction in Theorem 1 uses counter design cost, which happens to be monotone.
(The astute reader will note that we have not yet referred back to Example 8 nor Example 9; we revisit these in Sect. 5.)
878
4.3
F. Z. Saberifar et al.
Hardness of Approximation
Another avenue of attack for a hard problems is to try to find efficient approximation algorithms that can guarantee to provide solutions close to the optimal. Unfortunately, we can show that design cost minimization is hard even to approximate. Theorem 2. For every ε > 0, OptDM is NP-hard to approximate to within ratio (1 − ε) ln n. Proof. Let ε > 0. Suppose, a contrario, that there exists a polynomial time approximation algorithm A that solves OptDM with ratio (1 − ε) ln n. For a given instance (G, Vgoal , c) of OptDM, let OPT(G, Vgoal , c) denote the smallest cost, according to c, for a plan that solves (G, Vgoal ). Similarly, let A(G, Vgoal , c) denote the cost of the output plan generated by algorithm A. By construction, we have A(G, Vgoal , c) ≤ (1 − ε) ln n OPT(G, Vgoal , c). Under this assumption, we introduce the following polynomial-time approximation algorithm, called B, for SetCover. 1. For a given instance (R, T ) of SetCover, we construct a planning problem using the construction in Sect. 4.1. 2. We choose counter design cost for c, and execute algorithm A to find a plan whose design cost is within an (1 − ε) ln n factor of optimal. 3. Then, using Lemma 2, we recover a set cover for R from the extracted plan. We write B(R, T ) for the size of the set cover generated by algorithm B and OPT(R, T ) for the minimum coverage set size. Note that the size of this set cover is equal to the design cost for the plan, so that B(R, T ) = A(G, Vgoal , c). We know also know, from Lemmas 1 and 2, that OPT(G, Vgoal , c) = OPT(R, T ). Thus, for sufficiently large n, we have B(R, T ) = A(G, Vgoal , c) ≤ (1 − ε) ln n OPT(G, Vgoal , c) = (1 − ε) ln n OPT(R, T ). Therefore, we have a polynomial-time approximation algorithm B for SetCover with approximation ratio (1−ε) ln n. Unless P = N P , this contradicts the known inapproximability result for SetCover due to Dinur and Steurer [3].
This proof also carries over to narrower classes of cost functions, just as the basic hardness proof in Sect. 4.2 does, thus: Corollary 3. For every ε > 0, OptDM is NP-hard to approximate to within ratio (1 − ε) ln n, even when the design cost function is restricted to weighted sums (Example 7), or to monotone functions (Example 10).
The Hardness of Minimizing Design Cost Subject to Planning Problems
4.4
879
Fixed Parameter Hardness
Another general way to cope with NP-hard problems is the fixed-parameter tractability (fpt) approach. The intuition of the approach is to try to identify features, called parameters, of an input instance, other than the problem size, that govern the hardness of a problem. Specifically, an NP-hard problem is fpt if there exists an algorithm to solve it in time f (k)nO(1) , in which f (·) is some computable function and k is some parameter of the input instance [6,17]. There is bad news on this front as well. Lemma 3. DecDM, restricted to the counter design cost, and parameterized by the cost of the output plan, is not FPT under commonly held complexity-theoretic assumptions.2 Proof. Consider the construction in Sect. 4.1, denoted by Γ . We show that Γ is an fpt-reduction from SetCover, parameterized by the size of cover set, to DecDM, parameterized by the cost of output plan. The definition of fptreduction [6] has three conditions: 1. For all x, x is a positive instance of parameterized SetCover if and only if Γ (x) is an positive instance of parameterized DecDM. 2. The construction Γ is computable by an fpt algorithm. 3. There exists a computable function from the value of parameter k to the value of parameter k , such that for any instance x of parameterized SetCover, the value of parameter k in the instance outputted by Γ is less or equal to the value of parameter k in the instance x. Lemmas 1 and 2 confirm that the first condition is satisfied. In Sect. 4.1, we mentioned that the construction Γ takes polynomial time with respect to the input size, which is time f (c)nO(1) , for some constant-valued function f and some constant c. Thus, the second condition is satisfied. Finally, according to Lemmas 1 and 2, the value of parameter k is equal to the value of parameter k . The identity function is obviously computable, so the third condition is satisfied. Thus, construction Γ is a fpt-reduction. Then, suppose that DecDM parameterized by the size of counter design cost of output plan is F P T . We know that F P T is closed under fpt-reductions, that means, if a parameterized problem Q with parameter k, denoted by (Q, k), is reducible to a parameterized problem (Q , k ) and (Q , k ) ∈ F P T , then (Q, k) ∈ F P T [6]. So, our supposition implies that SetCover parameterized by size of coverage set is in F P T , which is a contradiction—unless the entire fixed parameter hierarchy collapses [6].
Corollary 4. DecDM, parameterized by the design cost, is not FPT even when restricted to weighted sum cost functions (Example 7) or to monotone cost functions (Example 10), under common assumptions. (see Footnote 5) 2
The particular assumption being that W [2] = F P T .
880
5
F. Z. Saberifar et al.
Design Cost Minimization in Polynomial Time
Section 4 presented a variety of hardness results of various kinds, for several variations of the design cost minimization problem. Now we present a modicum of good news, in the form of results that show certain versions of the problem can indeed be solved in polynomial time, or are fixed parameter tractable. 5.1
Binary Design and Ordered Action Costs Are Efficiently Solvable
It is useful to identify a class of cost functions which are amenable to a particular sort of decomposition. Definition 11. A cost function c is n-partition orderable if there exists a par. . . , Un which form tition of the action set U into mutuallydisjoint sets U 1 , U2, an increasing sequence of costs, where c i∈{1,...,m} Ui < c i∈{1,...,m+1} Ui for 1 ≤ m < n, and ∀x ∈ Ui+1 , c(Ui ) < c(Ui ∪ {x}). The intuition is that one must be able to split U into ordered level-sets with respect to costs. These allow easy solution. Lemma 4. DecDM, restricted to n-partition orderable cost functions, can be solved in time O(n |U | |V (G)|). Proof. For a planning problem (G, Vgoal ) with an n-partition orderable cost function c(P ) there are n + 1 possible outcomes. A straightforward procedure determines which: apply standard backchaining to (G, Vgoal ), but restricting consideration only to actions within U1 ; if a solution is found it has cost c(U1 ) and this minimizes the cost. If no plan has been found at this point, backchaining can be continued, but now permitting actions from U2 as well. A solution found at this juncture has cost c(U1 ∪ U2 ), which must minimize the cost. Otherwise, this procedure is repeated, adding Ui+1 only after the search with Ui fails. If after Un has been added no solution has been found, then no plan solves (G, Vgoal ). Since (G, Vgoal ) is state-determined, there are no more than O(|U | |V (G)|) edges that one need examine.
We now turn to the particular cost functions mentioned in Examples 8 and 9. Corollary 5. Any plan minimization problem with a binary design cost function is polynomial solvable. Proof. A binary design cost function cA (·) is a 2-partition orderable cost func
tion with U1 = A and U2 = U \A . Corollary 6. Any plan minimization problem with ordered action costs is polynomial solvable. Proof. An ordered cost function is an |U |-partition orderable cost function with
Ui = {ui }, i ∈ {1, . . . , n}.
The Hardness of Minimizing Design Cost Subject to Planning Problems
5.2
881
Counter Design Cost Is Fixed Parameter Tractable
Though counter design cost (the cost of Example 6), is not n-partition orderable, we need not be inconsolably bleak. Lemma 5. DecDM, with counter design cost, parameterized by size of the action space is in F P T . Proof. Let (W, Vgoal ) be the given planning problem, and let λ = |U |, i.e., let it denote the size of action space of the problem. Consider the following simple algorithm: Enumerate 2U . Then, for each subsets of Ui ∈ 2U , construct a planning problem with only actions from Ui . Checking whether each of these new planning problems can be solved or not with c(PUi ) ≤ k, which takes polynomial time in V (W ) because W is state-determined. Thus, this algorithm is
F P T , because its running time is 2λ nO(1) .
6
Discussion
This paper complements the authors’ previous papers; it is worth discussing why those papers, some of which also describe the hardness of aspects related to plans, do not quite capture an appropriate notion of design cost, the subject of this work. States: In [21], motivated by memory constraints, we examine the hardness of a problem termed ‘concise planning’, which minimizes the number of states in a plan that solves some planning problem. The cardinality of the set of states is distinct from the total number of actions or, indeed, any function of A(P, W ), though, minimization turns out to be difficult nevertheless. Observations: A paper more obviously connected with design problems is [24], where we introduce representations and algorithms for examining whether some specific deterioration of an idealized sensor is destructive to task achievement. Determining the most aggressive non-destructive modification is hard. In contrast, the present paper’s focus is actions. While states are proportional to memory requirements, this is seldom the constraining factor in a robot design. Choice of sensors is an important design consideration, but the question there is usually whether they will be too noisy to be effective or not. In contrast, the mere inclusion of an action in a robot’s repertoire usually demands some change in actuation. The connection between A(P, W ) and cost of the robot is direct. The idea of a state-determined form was developed in [8,24]. Its importance, prior to the present paper, had been mainly abstract and conceptual. The requirement of a state-determined planning problem, directly in the definition of DecDM, is important for the proof of the completeness result in Theorem 1. Note, though, that without that requirement, the proof of containment within NP-hard is preserved. Future work could examine whether there are meaningful costs for observations that are “dual” to the design costs for actions. Imagine a design problem
882
F. Z. Saberifar et al.
where the robot wishes to never receive some sensor reading: one might think that a robot should just omit any sensor which may receive such a signal, or that the robot could just forget the unwanted information. But these options may be impossible for particular contexts. Acknowledgements. This material is based upon work supported by the National Science Foundation under Awards IIS-1526862 and IIS-1527436.
References 1. Censi, A.: A class of co-design problems with cyclic constraints and their solution. IEEE Robot. Autom. Lett. 2(1), 96–103 (2017) 2. Censi, A.: Uncertainty in monotone co-design problems. IEEE Robot. Autom. Lett. 2(3), 1556–1563 (2017) 3. Dinur, I., Steurer, D.: Analytical approach to parallel repetition. In: Proceedings of the Forty-sixth Annual ACM Symposium on Theory of Computing, STOC 2014. ACM (2014) 4. Donald, B.R.: On information invariants in robotics. Artif. Intell. 72(1–2), 217– 304 (1995). Special Volume on Computational Research on Interaction and Agency, Part 1 5. Fitzner, I., Sun, Y., Sachdeva, V., Revzen, S.: Rapidly prototyping robots: using plates and reinforced flexures. IEEE Robot. Autom. Mag. 24(1), 41–47 (2017) 6. Flum, J., Grohe, M.: Parameterized Complexity Theory. Springer, Heidelberg (1998) 7. Fuller, S., Wilhelm, E., Jacobson, J.: Ink-jet printed nanoparticle microelectromechanical systems. J. Microelectromech. Syst. 11(1), 54–60 (2002) 8. Ghasemlou, S., Saberifar, F.Z., O’Kane, J.M., Shell, D.: Beyond the planning potpourri: reasoning about label transformations on procrustean graphs. In: Proceedings of Workshop on the Algorithmic Foundations of Robotics (2016) 9. Ha, S., Coros, S., Alspach, A., Kim, J., Yamane, K.: Joint optimization of robot design and motion parameters using the implicit function theorem. In: Proceedings of Robotics: Science and Systems (2017) 10. Hauser, K.: The minimum constraint removal problem with three robotics applications. Int. J. Robot. Res. 33(1), 5–17 (2014) 11. Hoover, A.M., Fearing, R.S.: Fast scale prototyping for folded millirobots. In: Proceedings of International Conference on Robotics and Automation (2008) 12. Jing, G., Tosun, T., Yim, M., Kress-Gazit, H.: An end-to-end system for accomplishing tasks with modular robots. In: Proceedings of Robotics: Science and Systems. AnnArbor, Michigan, June 2016 13. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of Computer Computations, pp. 85–103. Springer (1972) 14. Luck, K.S., Campbell, J., Jansen, M., Aukes, D., Amor, H.B.: From the lab to the desert: fast prototyping and learning of robot locomotion. In: Proceedings of Robotics: Science and Systems. Cambridge, Massachusetts, July 2017 15. Mehta, A., Bezzo, N., Gebhard, P., An, B., Kumar, V., Lee, I., Rus, D.: A design environment for the rapid specification and fabrication of printable robots. In: Springer Tracts in Advanced Robotics, pp. 435–449 (2015) 16. Mehta, A.M., DelPreto, J., Wong, K.W., Hamill, S., Kress-Gazit, H., Rus, D.: Robot creation from functional specifications. In: Springer Proceedings in Advanced Robotics, pp. 631–648 (2018)
The Hardness of Minimizing Design Cost Subject to Planning Problems
883
17. Niedermeier, R.: Invitation to Fixed-Parameter Algorithms. Oxford University Press, New York (2006) 18. Nilles, A.Q., Shell, D.A., O’Kane, J.M.: Robot design: formalisms, representations, and the role of the designer. In: IEEE ICRA Workshop on Autonomous Robot Design. Brisbane, Australia, May 2018. https://arxiv.org/abs/1806.05157 19. O’Kane, J.M.: A theory for comparing robot systems. Ph.D. thesis, University of Illinois, October 2007 20. O’Kane, J.M., LaValle, S.M.: On comparing the power of robots. Int. J. Robot. Res. 27(1), 5–23 (2008) 21. O’Kane, J.M., Shell, D.: Concise planning and filtering: hardness and algorithms. IEEE Trans. Autom. Sci. Eng. 14(4), 1666–1681 (2017) 22. Paull, L., Severac, G., Raffo, G., Angel, J., Boley, H., Durst, P., Gray, W., Habib, M., Nguyen, B., Ragavan, S.V., Saeedi, S., Sanz, R., Seto, M., Stefanovski, A., Trentini, M., Li, H.: Towards an ontology for autonomous robots. In: Proceedings of International Conference on Intelligent Robots and Systems, pp. 1359–1364 (2012) 23. Raman, V., Kress-Gazit, H.: Towards minimal explanations of unsynthesizability for high-level robot behaviors. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (2013) 24. Saberifar, F.Z., Ghasemlou, S., O’Kane, J.M., Shell, D.: Set-labelled filters and sensor transformations. In: Proceedings of Robotics: Science and Systems (2016) 25. Schulz, A., Sung, C., Spielberg, A., Zhao, W., Cheng, R., Grinspun, E., Rus, D., Matusik, W.: Interactive robogami: an end-to-end system for design of robots with ground locomotion. Int. J. Robot. Res. 36(10), 1131–1147 (2017) 26. Tovar, B.: Minimalist Models and Methods for Visibility-based Tasks. Ph.D. thesis, University of Illinois at Urbana Champaign (2009) 27. Ziglar, J., Williams, R., Wicks, A.: Context-aware system synthesis, task assignment, and routing (2017). arXiv:1706.04580
Human-Robot Interaction
Altruistic Autonomy: Beating Congestion on Shared Roads Erdem Bıyık1(B) , Daniel A. Lazar2(B) , Ramtin Pedarsani2 , and Dorsa Sadigh3 1
3
Electrical Engineering, Stanford University, Stanford, USA [email protected] 2 Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, USA {dlazar,ramtin}@ece.ucsb.edu Computer Science and Electrical Engineering, Stanford University, Stanford, USA [email protected]
Abstract. Traffic congestion has large economic and social costs. The introduction of autonomous vehicles can potentially reduce this congestion, both by increasing network throughput and by enabling a social planner to incentivize users of autonomous vehicles to take longer routes that can alleviate congestion on more direct roads. We formalize these effects of altruistic autonomy on roads shared between human drivers and autonomous vehicles. In this work, we develop a formal model of road congestion on shared roads based on the fundamental diagram of traffic. We consider a network of parallel roads and provide algorithms that compute optimal equilibria that are robust to additional unforeseen demand. We further plan for optimal routings when users have varying degrees of altruism. We find that even with arbitrarily small altruism, total latency can be unboundedly better than without altruism, and that the best selfish equilibrium can be unboundedly better than the worst selfish equilibrium. We validate our theoretical results through microscopic traffic simulations and show average latency decrease of a factor of 4 from worst-case selfish equilibrium to the optimal equilibrium when autonomous vehicles are altruistic.
Keywords: Autonomy-enabled transportation systems systems · Stackelberg routing game · Optimization
1
· Multi-robot
Introduction
Autonomous and connected vehicles are soon becoming a significant part of roads normally used by human drivers. Such vehicles hold the promise of safer streets, better fuel efficiency, more flexibility in tailoring to specific drivers’ needs, and time savings. In addition to benefits impacting individual users, autonomous cars E. Bıyık and D. A. Lazar—First two authors contributed equally and are listed in alphabetical order. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 887–904, 2020. https://doi.org/10.1007/978-3-030-44051-0_51
888
E. Bıyık et al.
can significantly influence the flow of traffic networks. A coalition of autonomous cars can potentially decrease road congestion through a number of creative techniques. Autonomous vehicles can increase road capacity and throughput, as well as decrease fuel consumption, by forming platoons and maintaining shorter headways to the vehicles they follow [1–5]. They can smooth shockwaves and instabilities that form in congested flows of human-driven vehicles [6–9]. Finally, autonomous vehicles can incentivize people to take unselfish routes, i.e., routes leading to a lower overall traffic delay for all drivers on the network. Previous work has studied methods such as tolling and incentives to influence drivers’ routing choices by, e.g., shifting their commute times or incentivizing them to use less congested roads or less congesting modes of transportation [10–12]. Our key insight is how people, who are incentivized to behave altruistically, along with vehicles with autonomous capabilities, can be leveraged to positively influence traffic networks. Our key contribution is to formally study and plan for altruistic autonomous cars that can influence traffic flows in order to achieve lower latencies. The setting we are interested in is as follows: assume there is a population of selfish human drivers and a population of autonomous car users, some of whom are willing to take routes with varying degrees of inefficiency. How can we best utilize this altruistic autonomy to minimize average user delay? To answer this, we first develop a model of vehicle flow that reflects changes in road capacity due to efficient platooning of autonomous cars on shared roads. The latency experienced by users of a road is a function of the total vehicle flow as well as the autonomy level, the fraction of vehicles that are autonomous. We use the terminology latency to refer to the delay experienced by a driver from taking a road, flow to refer to the vehicle flux on a road (vehicles per second), and density to describe the number of cars per meter on a road. Our mixed-autonomy model is developed based on the commonly used fundamental diagram of traffic [13], which relates vehicle density to vehicle flow on roads with only human-driven cars. Using our mixed-autonomy traffic flow model for a network of parallel roads, we study selfish equilibria in which human-drivers and users of autonomous vehicles all pick their routes to minimize their individual delays. As it may be difficult to exactly gauge flow demands, we define a notion of robustness that quantifies how resilient an equilibrium is to additional unforeseen demand. We establish properties of these equilibria and provide a polynomial-time algorithm for finding a robust equilibrium that minimizes overall delay. We then address how to best use autonomous vehicles that are altruistic. We develop the notion of an altruism profile, which reflects the varying degrees to which users of autonomous vehicles are willing to take routes longer than the quickest route. We establish properties of optimal altruistic equilibria and provide a polynomial-time algorithm for finding such routings. We validate our theoretical results through experiments in a microscopic traffic simulator. We show that an optimal selfish equilibrium can improve latency by a factor of two, and utilizing altruism can improve latency by another
Altruistic Autonomy: Beating Congestion on Shared Roads
889
factor of two in a realistic driving scenario. Our contributions in this work are as follows: – We develop a formal model of traffic flow on mixed-autonomy roads. – We design an optimization-based algorithm to find a best-case Nash equilibrium. – We define a robustness measure for the ability to allocate additional flow demand and develop an algorithm to optimize for this measure. – We develop an algorithm to minimize total experienced latency in the presence of altruistic autonomous vehicles.1 Our development considers how to find optimal equilibria. This does not directly inform a planner on how to achieve such equilibria, but is the first step towards doing so. Related Work. Our work is closely related to work in optimal routing, traffic equilibria, and Stackelberg strategies in routing games. In routing games, populations of users travel from source graph nodes to destination nodes via edges, and the cost incurred due to traveling on an edge is a function of the volume of traffic flow on that edge. Some works have found optimal routing strategies for these games [15,16]. If all users choose their routes selfishly and minimize their own travel time this results in a Nash or Wardrop Equilibrium [17,18]. Stackelberg routing games are games in which a leader controls some fraction of traffic, and the remainder of the traffic reaches a selfish equilibrium [19–21]. Many works bound the inefficiency due to selfish routing in these games including in the presence of traffic composed of multiple vehicle types [22–25]. However, these works on routing games consider a model of road latency in which latency increases as vehicle flow increases. This latency model does not capture congestion, in which jammed traffic has low flow and high latency. While not in the routing game setting, other works focus specifically on characterizing these congestion effects on traffic flow with multiple classes of users [26–29]. Krichene et al. combine these two fields by considering optimal Stackelberg strategies on networks of parallel roads with latency functions that model this notion of congestion [30]. Their work considers a single traffic type (no autonomous vehicles) and the controlled traffic can be made to take routes with delay arbitrarily worse than the best available route. In contrast, we consider a routing game with two types of traffic and there is a distribution of the altruism level of the controlled autonomous traffic.
2
Model and Problem Statement
Traffic Flow on Pure Roads. We assume every road has a maximum flow zimax . This occurs when traffic is in free-flow – when all vehicles travel at the nominal road speed. 1
Due to space constraints, we defer all proofs to this paper’s extended version [14].
890
E. Bıyık et al. latency (seconds)
zimax-h
vif density (cars/meter) crit-h i
(a)
max i
ai (b)
zimax-h
Fig. 1. (a) Fundamental diagram of traffic for traffic of a pure road, relating vehicle density and flow. (b) Relationship between vehicle flow and latency. Blue and brown lines correspond to free-flow and congested regimes; arrowheads show increasing density.
Traffic Density vs Traffic Flow: Adding more cars to a road that is already at maximum flow makes the traffic switch from free-flow to a congested regime, which decreases the vehicle flow. In the extreme case, at a certain density, cars are bumper-to-bumper and vehicle flow stops. Removing vehicles also decreases the flow since it is not going to speed up cars that are already traveling in freeflow at their nominal speed. Figure 1(a) – Fundamental Diagram of Traffic [13] – illustrates this phenomenon, where flow increases linearly with respect to density with slope vif until it hits the critical density. The slope corresponds to the freeflow velocity on road i. At the critical point, flow decreases linearly until it is zero at the maximum density. This linear model matches our simulations (Sect. 4). Traffic Flow vs Road Latency: The relationship between vehicle flow and road latency reflects the same free-flow/congested divide. In free-flow the road has constant latency, as all vehicles are traveling at the nominal velocity. In the congested regime, however, vehicle flow decreases as latency increases – in congestion, a high density of vehicles is required to achieve a low traffic flow, resulting in high latency. This is represented in Fig. 1(b), where in free-flow, latency is constant at free-flow latency ai for any amount of flow up to the maximum flow on a road. The brown curve above corresponds to the congested regime, in which latency increases as vehicle flow decreases.2 Traffic Flow on Mixed-Autonomy Roads. We assume that on mixedautonomy roads, the autonomous vehicles can coordinate with one another and potentially form platoons to help with the efficiency of the road network. We now extend the traffic model on pure roads to mixed-autonomy settings as shown in Fig. 2(a). Assuming that neither the nominal velocity nor the maximum density changes, the critical density at which traffic becomes congested will now shift and increase with autonomy level of the road. This can be explained as autonomous vehicles do not require a headway as large as that needed by human drivers due to platooning benefits. 2
Though congested flow may be unstable, it can be stabilized with a small number of autonomous vehicles [7], adding an additional constraint to later optimizations.
Altruistic Autonomy: Beating Congestion on Shared Roads
891
Fig. 2. (a) The fundamental diagram of traffic for roads with all human-driven (solid) and all autonomous (dashed) vehicles. Congestion begins at a higher vehicle density as autonomous vehicles require a shorter headway when following other vehicles. Superscripts -h and -r denote parameters corresponding to purely human and purely autonomous traffic, respectively. (b) The relationship between vehicle flow and latency also changes in the presence of autonomous vehicles. Free-flow speed remains the same but maximum flow on a road increases.
To formalize the relationship between autonomy level and critical density on road i, we assume at nominal velocity autonomous vehicles and humans ¯ i ≤ hi . This inequality reflects the ¯ i and hi , respectively, with h require headway h assumption that autonomous vehicles can maintain a short headway, regardless of the type of vehicle they are following. Then, if xi and yi denote the flow of human-driven and autonomous vehicles respectively, the critical density is ρcrit i (xi , yi ) =
1 yi ¯ xi +yi hi
+
xi xi +yi hi
+L
,
where all vehicles have length L. Here the denominator represents the average length from one car’s rear bumper to the preceding car’s rear bumper when all cars follow the vehicle in front of them with nominal headway. Note that critical density is expressed here as a function of flows xi and yi . This quantity can also i , i.e., the fraction be expressed as a function of autonomy level αi (xi , yi ) = xiy+y i of vehicles on road i that are autonomous. Since flow increases linearly with density until hitting ρcrit i , the maximum flow can also be expressed as a function of autonomy level: zimax (xi , yi ) = vif ρcrit i (xi , yi ). The latency on a road is a function of vehicle flows on the road as well as a binary argument si , which indicates whether the road is congested: ⎧ ⎨ dfi , si = 0, i (xi , yi , si ) = vi ρmax (1) crit max ρi (xi ,yi )−ρi ⎩di ( i ) , si = 1. xi +yi + z max (xi ,yi ) i
We define the free-flow latency on road i as ai := di /vif , where di is the length of road i. Figure 2(b) illustrates the effect of mixed autonomy on latency (Fig. 3).
892
E. Bıyık et al.
Demonstrative Example. We use the following running example to intuitively describe the concepts formalized in Main Results (Sect. 3). Imagine driving on a Friday afternoon in Los Angeles, where you plan to drive from the Beverly Hills library to the Valley. The most direct route, Coldwater Canyon, takes 25 min without traffic. Taking the 405 freeway would be 30 min, and Laurel Canyon would be 35 min in free-flow. Unfortunately, it is Friday afternoon and the flow is anything but free! Let us assume that everyone is in the same predicament – meaning that all traffic on these roads is from people with Fig. 3. The map with three posthe same start and end points. People would only sible routes for the example. take Laurel Canyon if Coldwater and the 405 were congested to the point that each take at least 35 min. Further, any route that people use will have the same latency, otherwise they would switch to the quicker route. Figure 4 illustrates three such equilibria with varying delays. latency (seconds)
Laurel 405 Coldwater
flow (cars/second) (a)
(b)
(c)
Fig. 4. Illustration of equilibria in the network. Equilibria can have one road in free-flow and others congested (a) or all used roads can be congested (b, c).
Krichene et al. have shown that for any given volume of traffic composed of only human drivers, there exists one best equilibrium which is an equilibrium where one road is in free-flow and all other used roads are congested and the number of roads used is minimized [30]. However, when there are multiple types of vehicles that affect congestion differently, as we see in mixed-autonomy, there can be multiple equilibria with the same aggregate delay. Concretely, autonomous vehicles can form platoons to increase the free-flow capacity of a road, an effect that is accentuated on freeways. This effect can be used not only to find equilibria that minimize aggregate delay, but also to find equilibria, within the set of delay-minimizing equilibria, that can accommodate extra unforeseen flow demand. As an example, Fig. 5(a) and (b) have the same delay, whether all autonomous flow (green) is the 405 or is split between the 405 and Coldwater. However, (b) has higher autonomy level on the 405 than (a) and can therefore accommodate more additional flow on the 405, since it matches its autonomous vehicles with the roads that benefit most from them. If a social
Altruistic Autonomy: Beating Congestion on Shared Roads
893
planner can dictate equilibrium routing but does not have perfect information about flow demands, using a routing such as in (b) can make the routing more robust to unforeseen demand. The social cost can further be improved if users of autonomous vehicles are altruistic and are willing to endure a delay longer than the quickest route available in order to improve traffic. Assume the autonomous cars are owned by entities concerned for the public good and consider a population of autonomous car users who, in exchange for cheaper rides, are willing to accept a delay of 20% longer than the quickest available route. In this case, a social planner can send some autonomous users on the 405 so that Coldwater is in an uncongested state, yielding a social cost lower than that of the best-case selfish equilibria (Fig. 5(c)). Before presenting our main results, we also define our network and objective. Network Objectives. We consider a network of N parallel roads, with indices {1, . . . , n}. For convenience we assume no two roads have identical free-flow latencies, and accordingly, order our roads in order of increasing free-flow latency. We use [m] = {1, . . . , m} to denote the set of roads with indices 1 through m.
Fig. 5. (a) and (b) represent two selfish equilibria with the same social cost. The equilibrium with all autonomous flow on the 405 freeway (b) is more robust to additional unforeseen flow demand. Solid (resp. dashed) lines characterize the road with only human drivers (resp. autonomous vehicles). The 405 has maximum flow that benefits more from autonomy than the canyons – by routing all autonomous traffic onto the 405, a social planner can make the routing more robust to unforeseen demand. If some autonomous users are altruistic, a social planner can send them on the 405 while Coldwater is uncongested (c), leading to a lower overall travel time.
We consider an inelastic demand of x ¯ and y¯ (human-driven and autonomous vehicle flow, respectively), meaning that the total flow of vehicles wishing to travel across the network is independent of the delays on the roads. We describe
894
E. Bıyık et al.
N the network state by (x, y, s), where x, y ∈ RN ≥0 and s ∈ {0, 1} . A feasible routing is one for which xi + yi ≤ zimax (xi , yi ) for all roads, where zimax denotes the maximum flow on road i. We denote total flow on a road by zi := xi + yi . We are interested in finding a routing, i.e. allocation of vehicles into the roads, that minimizes the total latency experienced by all vehicles, C(x, y, s) = i∈[n] (xi + ¯ and i∈[n] yi = y¯. yi )i (xi , yi , si ), while satisfying the demand, i.e. i∈[n] xi = x Further, we constrain this optimization based on selfishness or altruism of the vehicles, which we will define in Sect. 3.
3
Main Results
We now make precise the aforementioned notions of selfishness, robustness, and altruism. We develop properties of the resulting equilibria, and using those, provide polynomial-time algorithms for computing optimal vehicle flows. Selfishness. Human drivers are often perceived as selfish, i.e., they will not take a route with long delay if a quicker route is available to them. If all drivers are selfish this leads to a Nash, or Wardrop Equilibrium, in which no driver can achieve a lower travel time by unilaterally switching routes [17,18]. In the case of parallel roads, this means that all selfish users experience the same travel time. Definition 1. The longest equilibrium road is the road with maximum freeflow latency that has delay equal to the delay experienced by selfish users. Let x, y¯, mEQ ) to denote the set mEQ denote the index of this road. We then use NE(¯ of Nash Equilibria with longest equilibrium road having index mEQ . Definition 2. The longest used road is the road with maximum free-flow latency that has positive vehicle flow of any type on it. We use mALL to denote the index of this road; if all vehicles in a network are selfish then mEQ = mALL . We define the set of Best-case Nash Equilibria (BNE) as the set of feasible routings in equilibrium that minimize the total latency for flow demand (¯ x, y¯), denoted BNE(¯ x, y¯). The following theorem provides properties of the set of BNE for mixed-autonomy roads (for pure roads see [30]). Theorem 1. There exists a road index m∗EQ such that all routings in the set of BNE have the below properties. Further, this index m∗EQ is the minimum index such that a feasible routing can satisfy the properties: 1. road m∗EQ is in free-flow, 2. roads with index less than m∗EQ are congested with latency am∗EQ , and 3. all roads with index greater than m∗EQ have zero flow. The running example in Fig. 5 shows that there does not necessarily exist a unique BNE. However, as estimating flow demand can be difficult, we prefer to choose a BNE that can best incorporate additional unforeseen demand. We thus develop the notion of robustness.
Altruistic Autonomy: Beating Congestion on Shared Roads
895
Definition 3. The robustness of a routing in the set of Nash Equilibria is how much additional traffic (as a multiple of the original demand flow), at the overall autonomy level, can be routed onto the free-flow road. Formally, if (x, y) ∈ NE(¯ x, y¯, m), then it has robustness: β(x, y, x ¯, y¯, sm ) := max max γ s.t. xm + ym + γ(¯ x + y¯) ≤ zm (xm + γ x ¯, ym + γ y¯) 0
sm = 0 (2) sm = 1 .
We use Robust Best Nash Equilibria (RBNE) to refer to the subset of BNE that maximize robustness. The RBNE may not be unique. Altruism. It is possible that some passengers, especially those in autonomous vehicles, can be incentivized to use routes that are not quickest for that individual, but instead lead to a lower social cost (Fig. 5). We use the term altruism profile to refer to the distribution of the degree to which autonomous vehicles are willing to endure longer routes. For computational reasons, we consider altruism profiles with a finite number of altruism levels. ( ) , fraction of autonomous cars that will not tolerate delay 0
( ) 1
1
1 0
1
0
(a)
, multiple of lowest delay
1
0
1
2
N 1
0
(b)
Fig. 6. Altruism profiles. A fraction ϕ(κ) of autonomous users will not accept delay greater than κ times that of the quickest available route. (a) Users will tolerate delay of up to κ0 times that of the quickest route. (b) Users have multiple altruism levels.
Formally, we define ϕ : R≥0 → [0, 1] to represent the altruism profile as a nondecreasing function of a delay value that is mapped to [0, 1]. A volume of ϕ(κ)¯ y autonomous flow will reject a route incurring delay κ times the minimum route delay available, which we denote by ˆ0 . If autonomous users have a uniform altruism level as in Fig. 6(a), we call them κ0 -altruistic users, where κ0 is the maximum multiple of the minimum delay that autonomous users will accept. Users may have differing altruism levels, as in Fig. 6(b). We use K to denote the set of altruism levels, with cardinality |K|. Accordingly, a feasible routing (x, y, s) is in the set of Altruistic Nash Equilibria (ANE) if 1. all routes with human traffic have latency ˆ0 ≤ i (xi , yi , si ) ∀i ∈ [n] and
896
E. Bıyık et al.
2. for any ≥ 0, a volume of at least ϕ(/ˆ0 ) y¯ autonomous traffic experiences a delay less than or equal to . Note that it is enough to check this condition for = i (xi , yi , si ) for all i.3 We denote the set of routings at Altruistic Nash Equilibria with demand x, y¯, ˆ0 , ϕ). The (¯ x, y¯), equilibrium latency ˆ0 , and altruism profile ϕ as AN E(¯ set of Best-case Altruistic Nash Equilibria (BANE) is the subset of ANE with routings that minimize total latency. Theorem 2. There exist a longest equilibrium road m∗EQ and a longest used road m∗ALL with m∗EQ ≤ m∗ALL , such that all routings in BANE have the following properties: 1. roads with index less than m∗EQ are congested, 2. roads with index greater than m∗EQ are in free-flow, 3. roads with index greater than m∗EQ and less than m∗ALL have maximum flow. Remark 1. One may be tempted to think that, like in BNE, road m∗EQ will be in free-flow. Further, one might think that m∗EQ is the minimum index such that all selfish traffic can be feasibly routed at Nash Equilibrium. We provide counterexamples to these conjectures in this paper’s extended version [14]. Finding the Best-case Nash Equilibria. In general, the Nash Equilibrium constraint is a difficult combinatorial constraint. Theorem 1 however states that we can characterize the congestion profile of the roads by finding the minimum free-flow road such that Nash Equilibrium can be feasibly achieved. Once this is found, we select a feasible routing that maximizes robustness. This is formalized x, y¯, mEQ ) is nonempty. Then as follows: find the minimum mEQ such that NE(¯ the RBNE is the set of routings that maximize the robustness of road mEQ : x, y¯, mEQ ) = ∅ , then m∗EQ = arg min amEQ s.t. NE(¯ mEQ ∈[n]
BNE(¯ x, y¯, m∗EQ )
=
arg max
(x,y ,s)∈NE(¯ x,¯ y ,m∗ EQ )
β(x, y, x ¯, y¯, sm∗EQ ) .
(3)
Theorem 3. Finding a solution to (3) is equivalent to finding a routing in the set of RBNE, if any exist. Further, (3) can be solved in O(N 4 ) time. We provide a sketch of the algorithm for finding an element of the RBNE: 1. Search over mEQ , the longest equilibrium road, to find the minimum mEQ that is feasible for required flow demands at Nash Equilibrium. 2. For this mEQ , find the routing that maximizes the robustness of road mEQ . This can be formulated as a linear program, which has cubic computational complexity in the number of roads [32]. 3
This is similar to the notion of epsilon-approximate Nash Equilibrium [31], but with populations playing strategies that bring them within some factor of the best strategy available to them, with each population having a different factor.
Altruistic Autonomy: Beating Congestion on Shared Roads
897
The following optimization (which can be converted to a linear program) maximizes robustness while at feasible free-flow equilibrium. max
x ,y ∈RN ,γ≥0 ≥0
γ s.t.
i∈[m∗ EQ ]
xi = x ¯,
i∈[m∗ EQ ]
yi = y¯, i (xi , yi , 1) = am∗EQ ∀i ∈ [m∗EQ − 1],
max ∗ (xm∗ x + y¯) ≤ zm + γx ¯, ym∗EQ + γ y¯) . xm∗EQ + ym∗EQ + γ(¯ EQ EQ
Finding the Best-case Altruistic Nash Equilibria. To find an element of the BANE, we need to solve: arg min
C(x, y) .
(4)
mEQ ∈[n], ˆ0 ∈[amEQ ,amEQ +1 ), (x,y ,s)∈ANE(¯ x,¯ y ,ˆ0 ,ϕ)
As demonstrated in [14], the BANE no longer has monotonicity in mEQ , the index of the longest equilibrium road. Further, road mEQ may not be in freeflow in the set of BANE. We do however have a degree of monotonicity in the latency on road mEQ : for a fixed mEQ , optimal social cost is minimized at the minimum feasible latency on road mEQ . Theorem 4. Finding a solution to (4) is equivalent to finding a routing in the set of BANE, if any exist. Further, (4) can be solved in O(|K|N 5 ) time, where |K| is the number of altruism levels of autonomous vehicle users. In the full proof we show that for any given mEQ , there are a maximum of |K|N critical points to check for the equilibrium latency, ˆ0 . We sketch the algorithm for finding a routing in the set of BANE: 1. Enumerate through all options for mEQ . 2. For each mEQ , enumerate through the |K|N critical points of ˆ0 . 3. For each of these combinations, find the routing that maximizes the autonomous flow on roads [mEQ ], while incorporating all human flow on these roads. This can be formulated as a linear program. Then find optimal autonomous routing on the remaining roads (order N calculations). Improvement Can Be Unbounded. We motivate the schemes described above by demonstrating that without them, aggregate latency can be arbitrarily worse than with them (Fig. 7). First we show the cost at BNE can be arbitrarily worse than that at BANE, even when autonomous users have arbitrarily low altruism. Consider three roads with free-flow latencies a3 > a2 > a1 , and human-driven x, 0) = and autonomous flow demands x ¯ and y¯ relative to maximum flows z1max (¯ max max x, y¯) ≥ x ¯ + y¯. This means human-driven vehicles can x ¯, z2 (0, y¯) = y¯, and z3 (¯ fit on road 1 with autonomous vehicles on road 2, or both vehicle types can fit on roads 1, 2, and 3 if all are selfish, as all vehicles cannot fit on roads 1 and 2 at the same latency. Let autonomous vehicles have uniform altruism level κ0 = aa21 , which can be arbitrarily close to 1. In BANE, autonomous vehicles will be on
898
E. Bıyık et al.
Fig. 7. Illustration showing unbounded improvement in (a) altruistic NE vs. best selfish NE, and (b) best selfish NE vs. worst NE. Red and green represent regular and autonomous vehicle flow, +’s are worst-case Nash Equilibrium routing, x’s are best-case NE routing, and circles are best-case altruistic NE routings.
road 2 and human-driven vehicles on road 1. However in BNE, all flow will be on BNE +¯ y) a3 road 3, as the others are fully congested. Then, CCBANE = aa13x¯(¯x+a ¯ ≥ a2 , which 2y can be arbitrarily large. Now consider one road with very large critical density and a small demand x ¯. This road can exist at free-flow latency at low density or in a highly congested x,0,1) state at high density and serve flow x ¯ either way. Then, limx¯→0+ ii (¯ (¯ x,0,0) → ∞, which shows a NE can be arbitrarily worse than a BNE.
4
Experiments
Numerical Example of Gained Utility by BNE and BANE. Next, in order to numerically show the utility gained by RBNE (or BNE) against another NE and by altruistic autonomy in a more realistic setting, we create a scenario with two parallel residential roads of length 400π and 1000π meters. The speed limit is 13.9 m/s (∼50 kph). All vehicles are 5 m long and have headway of max(2, τ v) meters with the vehicle in front, where v represents the speed of the vehicle and τ is the reaction time (2 s for regular and 1 s for autonomous vehicles). We want to allocate the flows Fig. 8. The simulation visualization for of 0.3 regular and 0.3 autonomous Road 1. Regular vehicles (red) have larger headway than autonomous ones (green). vehicles per second.
Altruistic Autonomy: Beating Congestion on Shared Roads
899
We simulated this scenario using the traffic simulator SUMO [33]. To conveniently simulate the congestion effect, we designed circular and single-lane roads (Fig. 8). We employed the original Krauss car following model for driver behavior [34]. We solved the optimizations using the CVX framework [35,36]. For comparison, we also computed and simulated a congested NE. In both calculations Table 1. Effect of altruistic autonomy and simulations, the BNE NE RBNE BANE (κ0 = 2.5) approximately halves (x1 , y1 ) (0.006, 0.252) (0.3, 0.031) (0.3, 0.215) the total cost com(x2 , y2 ) (0.294,0.048) (0, 0.269) (0, 0.085) pared to congested CT (x, y) 324 135.608 61.5 NE, and the BANE CS (x, y) 297.052 135.87 68.64 halves it again (Table 1). The altruism level is κ0 = 2.5. Gray cells represent congested roads. Subscripts T and S denote if the presented result is theoretical or simulated, respectively. Figure 9 visualizes the solutions over flow-latency graph. Simulation of a 4-Parallel-Road Network. To show the properties of all the equilibria we defined, we simulated the following comprehensive scenario: We assume there are 4 parallel roads with different lengths from one point to another, two of which are residential roads where the speed limit is 13.9 m/s, and the other two are undivided highways where the limit is 25.0 m/s (∼90 kph). The residential roads and highways have lengths 400π, 600π, and 800π, 1000π meters, respec- Fig. 9. Three solutions to the scenario with two residential roads are visualized. tively. Sorting based on the free-flow Blue lines represent the shorter road, latencies makes the order of lengths whereas brown is the longer one. Solid (400π, 800π, 1000π, 600π), and we use and dashed lines correspond to fully regthis order from this point on. We keep ular and fully autonomous configurations, all other parameters the same as the respectively. previous simulation. With this configuration, we first simulate Road 4 (residential road, 600π meters long) alone. The resulting fundamental diagram and the relationship between vehicle flow and latency are shown in Fig. 10 along with the theoretical curves. There exists a small mismatch between the theory and simulations, which is apparent only for z max . We conjecture that this is stemming from: (1) Shape: We approximated circular roads using pentacontagons (regular polygons with 50 edges), so the actual road lengths are slightly different. (2) Initiation: We cannot initiate the vehicles homogeneously over the polygon. Instead, we initiate from the corners. This hinders our ability to allocate the maximum possible number of vehicles. (3) Discretization: As the number of vehicles needs to be an integer, density values are heavily discretized. (4) Randomness: Given
900
E. Bıyık et al.
the total number of vehicles for each road, our simulation initiates regular and autonomous vehicles drawn from a Bernoulli process with the given probability distribution. To show the effect of changing κ0 in the altruistic case, we consider human flow demand of 0.4 and autonomous flow demand of 1.2 vehicles per second. For BANE, we simulated both κ0 = 1.25 and κ0 = 1.5, where we assume all autonomous vehicles are uniformly altruistic for simplicity.
(a)
(b)
Fig. 10. Simulation results of Road 4 for three different α values with theoretical values (black curves). The noisy appearance of the curves for α = 0.5 is due to randomness, while mismatches on ρcrit are due to the other imperfections. (a) The fundamental diagram of traffic. Due to simulation incapabilities regarding initiation, we cannot simulate heavily congested regions. We used more lightly congested traffic measurements to project heavily congested cases (dashed lines). (b) The flow and latency relationship.
We then simulate the computed flow values using SUMO. We found that often the full calculated flow cannot be allocated in the simulation because of the mismatch between the theoretical maximum flow on a road and the simulated maximum flow (Fig. 10(b)). In these cases, we kept α-value for that road fixed and allocated flow with respect to Nash equilibrium. Hence, for BNE, RBNE and BANE, we added the remaining flow into road m∗EQ , without breaking Nash equilibrium. For NE, we again kept α and the number of utilized roads fixed and swept latency values to satisfy flow conservation. These simulation results, as well as theoretically computed values for x, y, β, C(x, y) are given in Table 2. Note that BNE and RBNE lead to lower latencies, with a slight difference in the simulation, compared to the presented NE configuration. By introducing altruistic autonomous vehicles, the overall cost can be further decreased. While theoretical values conform with the fact that RBNE maximizes the robustness with the same cost as other BNE’s, the simulated robustness for the RBNE case is smaller than that of the presented BNE solution. To understand this, we first note that the RBNE solution makes the free-flow road with the lowest latency (road 3) fully autonomous to be able to allocate more flow. However, as can be seen from Fig. 10, the mismatch between theory and simulations grows larger with increasing autonomy. This causes RBNE to be unable to allocate high prospective flow. A better maximum flow model would resolve this.
Altruistic Autonomy: Beating Congestion on Shared Roads
901
Table 2. Theoretical and simulated results NE
BNE
(x1 , y1 )
(0.036, 0.277) (0.075, 0.52)
(x2 , y2 )
(0.121, 0.311)
(x3 , y3 )
(0.161, 0.303) (0.126, 0.25)
(x4 , y4 )
(0.2, 0.43)
RBNE
BANE (κ0 = 1.25) BANE (κ0 = 1.5)
(0.391, 0)
(0.4, 0.024)
(0.009, 0.772)
(0, 0.833)
(0.4, 0.041) (0, 0.833)
(0, 0.428)
(0, 0.343)
(0, 0.325)
(0.083, 0.309)
(0, 0)
(0, 0)
(0, 0)
(0, 0)
CT (x, y)
640
201.062
201.062
169.469
164.56
CS (x, y)
599.072
199.985
199.575
172
167.035
βT
0
0.183
0.210
0
0
βS
0
0.154
0.135
0
0
Fig. 11. Roads are visualized with respect to the vehicle density values obtained from each of the solutions. Red and green dots represent regular and autonomous vehicles, respectively. We do not show headways for simplicity. Black roads are congested, while gray roads are in free-flow. On the right, average vehicle speeds (m/s) are shown.
Figure 11 shows the average speeds of vehicles on the roads and visualizes the vehicle densities for each of the theoretically presented solutions. The congested roads are black and the free-flow roads are gray. This makes clear that the NE solution leads to high latency due to heavy traffic congestion. Other solutions support the same flows with lower vehicle densities thanks to maintaining higher speeds, yielding lower latencies. Further, the RBNE makes the free-flow road fully autonomous to enable the allocation of more additional flow. Lastly, increasing the altruism level in BANE reduces the overall traffic congestion.4
4
An animated version can be found at http://youtu.be/Hy2S6zbL6Z0 with realistic numerical values for densities, headways, car lengths, speeds, etc.
902
E. Bıyık et al.
Effect of Different Altruism Levels. We also analyzed the effect of altruism level under different flow constraints. Figure 12 shows the average latency experienced by each vehicle for three different altruism levels and under different total flow requirements. We solved all cases using Fig. 12. Average latency (seconds) BANE optimization. The first heat obtained from BANE optimization map shows non-altruistic case where for x¯, y¯ ∈ [0, 1.5] cars per second and the average latencies are exactly equal κ0 ∈ {1, 1.25, 1.5}. Black regions represent to one of the free-flow latencies. This infeasible demand. is a direct result of Theorem 1. In the altruistic cases, the average latency experienced by vehicles is lower and increasing the altruism level helps decrease the latency further.
5
Discussion
Summary. In this work, we develop a new model to incorporate the effect of autonomy in parallel road networks based on the fundamental diagram of traffic. We further define a notion of robustness indicating the capability of a Nash Equilibrium to be resilient to additional unforeseen flow demand. We design an optimization-based polynomial-time algorithm to find a robust Nash Equilibrium that minimizes overall latency. We then define the concept of altruistic autonomy to model autonomous users’ willingness to accept higher latencies. We provide a polynomial-time algorithm that utilizes altruism to minimize overall latency. We demonstrate the improvements our algorithms provide using simulations. Limitations and Future Directions. In this work we consider parallel networks; we wish to also study more complex transportation networks that include other modes of transportation. Congestion and altruism can be modeled for each mode of transportation. Further, in this work we use the original Krauss model [34] for car following; more complex and expressive models, such as Intelligent Driver Model [37], could yield to other interesting outcomes. This work also considers only isolated equilibria – investigating how to move the network from one equilibrium to another is a valuable direction. Finally, our work is a first step in formalizing altruistic autonomy. We would like to perform more realistic case studies to understand emergent phenomenon in mixed-autonomy roads. Acknowledgments. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. This work was also supported by NSF grant CCF-1755808 and UC Office of the President grant LFR-18-548175.
Altruistic Autonomy: Beating Congestion on Shared Roads
903
References 1. Askari, A., Farias, D.A., et al.: Effect of adaptive and cooperative adaptive cruise control on throughput of signalized arterials. In: IEEE Intelligent Vehicles Symposium (2017) 2. Lioris, J., Pedarsani, R., et al.: Platoons of connected vehicles can double throughput in urban roads. Transp. Res. Part C Emerging Technol. 77, 292–305 (2017) 3. Motie, M., Savla, K.: Throughput analysis of a horizontal traffic queue under safe car following models. In: IEEE Conference on Decision and Control (2016) 4. Adler, A., Miculescu, D., Karaman, S.: Optimal policies for platooning and ride sharing in autonomy-enabled transportation. In: Workshop on Algorithmic Foundations of Robotics (WAFR) (2016) 5. Lazar, D., Chandrasekher, K., Pedarsani, R., Sadigh, D.: Maximizing road capacity using cars that influence people. In: IEEE Conference on Decision and Control. IEEE (2018, accepted) 6. Wu, C., Kreidieh, A., Vinitsky, E., Bayen, A.M.: Emergent behaviors in mixedautonomy traffic. In: Conference on Robot Learning, pp. 398–407 (2017) 7. Wu, C., Bayen, A.M., Mehta, A.: Stabilizing traffic with autonomous vehicles. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. IEEE (2018) 8. Cui, S., Seibold, B., Stern, R., Work, D.B.: Stabilizing traffic flow via a single autonomous vehicle: possibilities and limitations. In: IEEE Intelligent Vehicles Symposium (2017) 9. Stern, R.E., et al.: Dissipation of stop-and-go waves via control of autonomous vehicles: field experiments. Transp. Res. Part C Emerging Technol. 89, 205–221 (2018) 10. Merugu, D., Prabhakar, B.S., Rama, N.: An incentive mechanism for decongesting the roads: a pilot program in Bangalore. In: Proceedings of ACM NetEcon Workshop. Citeseer (2009) 11. Pluntke, C., Prabhakar, B.: INSINC: a platform for managing peak demand in public transit. JOURNEYS Land Transp. Authority Acad. Singapore (2013) 12. Nie, Y.M., Yin, Y.: Managing rush hour travel choices with tradable credit scheme. Transp. Res. Part B Methodol. 50, 1–19 (2013) 13. Daganzo, C.F.: The cell transmission model: a dynamic representation of highway traffic consistent with the hydrodynamic theory. Transp. Res. Part B Methodol. 28, 269–287 (1994) 14. Bıyık, E., Lazar, D.A., Pedarsani, R., Sadigh, D.: Altruistic autonomy: beating congestion on shared roads. arXiv preprint arXiv:1810.11978 (2018) 15. Hearn, D.W., Lawphongpanich, S., Nguyen, S.: Convex programming formulations of the asymmetric traffic assignment problem. Transp. Res. Part B Methodol. 18, 357–365 (1984) 16. Dafermos, S.C.: The traffic assignment problem for multiclass-user transportation networks. Transp. Sci. 6(1), 73–87 (1972) 17. Florian, M., Hearn, D.: Network equilibrium models and algorithms. In: Handbooks in Operations Research and Management Science, vol. 8, pp. 485–550 (1995) 18. Dafermos, S.: Traffic equilibrium and variational inequalities. Transp. Sci. 14(1), 42–54 (1980) 19. Roughgarden, T.: Stackelberg scheduling strategies. SIAM J. Comput. 33(2), 332– 350 (2004)
904
E. Bıyık et al.
20. Bonifaci, V., Harks, T., Sch¨ afer, G.: Stackelberg routing in arbitrary networks. Math. Oper. Res. 35(2), 330–346 (2010) 21. Swamy, C.: The effectiveness of Stackelberg strategies and tolls for network congestion games. ACM Trans. Algorithms (TALG) 8, 1–19 (2012) ´ How bad is selfish routing? J. ACM (JACM) 49(2), 22. Roughgarden, T., Tardos, E.: 236–259 (2002) 23. Correa, J.R., Schulz, A.S., Stier-Moses, N.E.: A geometric approach to the price of anarchy in nonatomic congestion games. Games Econ. Behav. 64, 457–469 (2008) 24. Perakis, G.: The “price of anarchy” under nonlinear and asymmetric costs. Math. Oper. Res. 32(3), 614–628 (2007) 25. Lazar, D.A., Coogan, S., Pedarsani, R.: Routing for traffic networks with mixed autonomy. arXiv preprint arXiv:1809.01283 (2018) 26. Daganzo, C.F.: A continuum theory of traffic dynamics for freeways with special lanes. Transp. Res. Part B Methodol. 31, 83–102 (1997) 27. Zhang, H., Jin, W.: Kinematic wave traffic flow model for mixed traffic. Transp. Res. Rec. J. Transp. Res. Board 1802, 197–204 (2002) 28. Logghe, S., Immers, L.: Heterogeneous traffic flow modelling with the LWR model using passenger-car equivalents. In: 10th World Congress on ITS (2003) 29. Fan, S., Work, D.B.: A heterogeneous multiclass traffic flow model with creeping. SIAM J. Appl. Math. 75(2), 813–835 (2015) 30. Krichene, W., Reilly, J.D., Amin, S., Bayen, A.M.: Stackelberg routing on parallel transportation networks. In: Handbook of Dynamic Game Theory (2017) 31. Nisan, N., Roughgarden, T., Tardos, E., Vazirani, V.V.: Algorithmic Game Theory. Cambridge University Press, New York (2007) 32. Gonzaga, C.C.: Path-following methods for linear programming. SIAM Rev. 34(2), 167–224 (1992) 33. Krajzewicz, D., Erdmann, J., Behrisch, M., Bieker, L.: Recent development and applications of sumo-simulation of urban mobility. Int. J. Adv. Syst. Meas. 5, 128– 132 (2012) 34. Krauß, S., Wagner, P., Gawron, C.: Metastable states in a microscopic model of traffic flow. Phys. Rev. E 55(5), 5597 (1997) 35. Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1, March 2014. http://cvxr.com/cvx 36. Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V.D., Boyd, S.P., Kimura, H. (eds.) Recent Advances in Learning and Control. Springer, London (2008) 37. Treiber, M., Hennecke, A., Helbing, D.: Congested traffic states in empirical observations and microscopic simulations. Phys. Rev. E 62(2), 2000 (1805)
Operation and Imitation Under Safety-Aware Shared Control Alexander Broad1,2(B) , Todd Murphey1 , and Brenna Argall1,2 1
2
Northwestern University, Evanston, IL 60208, USA [email protected] Shirley Ryan AbilityLab, Chicago, IL 60611, USA
Abstract. We describe a shared control methodology that can, without knowledge of the task, be used to improve a human’s control of a dynamic system, be used as a training mechanism, and be used in conjunction with Imitation Learning to generate autonomous policies that recreate novel behaviors. Our algorithm introduces autonomy that assists the human partner by enforcing safety and stability constraints. The autonomous agent has no a priori knowledge of the desired task and therefore only adds control information when there is concern for the safety of the system. We evaluate the efficacy of our approach with a human subjects study consisting of 20 participants. We find that our shared control algorithm significantly improves the rate at which users are able to successfully execute novel behaviors. Experimental results suggest that the benefits of our safety-aware shared control algorithm also extend to the human partner’s understanding of the system and their control skill. Finally, we demonstrate how a combination of our safety-aware shared control algorithm and Imitation Learning can be used to autonomously recreate the demonstrated behaviors.
Keywords: Human-robot interaction Optimization and optimal control
1
· Machine learning ·
Introduction
Mechanical devices can be used to extend the abilities of a human operator in many domains, from travel to manufacturing to surgery. In this work, we are interested in developing a shared control methodology that further enhances the ability of a human user to operate dynamic systems in scenarios that would otherwise prove challenging due to the complexity of the control problem (e.g., modern aircraft), the complexity of the environment (e.g., navigation in a densely populated area), or the skill of the user (e.g., due to physical injury). A particularly motivating domain is assistive and rehabilitation medicine. Consider, for example, the use of an exoskeleton in rehabilitating the leg muscles of a spinal cord injured subject [1]. While these devices are designed explicitly to aid a user in recovering from trauma by rebuilding lost muscular control, c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 905–920, 2020. https://doi.org/10.1007/978-3-030-44051-0_52
906
A. Broad et al.
the complexity of the machine itself often requires that one or more physical therapists assist the subject in operating the device during therapy (e.g., to provide stabilization). Artificial intelligence can further improve the efficacy of these devices by incorporating autonomy into the control loop to reduce the burden on the human user. That is, if the autonomous agent accounts for subpar (and potentially dangerous) control input, the human operator and therapist(s) are freed to focus on important therapeutic skills. In this work, we improve the effectiveness of joint human-machine systems by developing a safety-aware shared control (SaSC) algorithm that assists a human operator in controlling a dynamic machine without a priori knowledge of the human’s desired objective. In general, shared control is a paradigm that can be used to produce human-machine systems that are more capable than either partner on their own [2]. However, in practice, shared control systems often require the autonomous agent to know the goal (or a set of discrete, potential goals). While a priori knowledge of a desired set of goals may be a valid assumption in some domains, it can also be a severely limiting assumption in many other scenarios. Therefore, instead of allocating control based on whether the human operator’s input will improve the likelihood of achieving a goal, we aim to allocate control based on whether the user’s control commands will lead to dangerous states and actions in the future. Under this paradigm, the autonomous partner develops a control strategy that is only concerned with the safety of the system, and is otherwise indifferent to the control of the human operator. Our safety-aware shared control algorithm can be used to improve the efficacy of human-machine systems in two main ways: G1. Improve a human operator’s control, and understanding, of a dynamic system in complex and potentially unsafe environments. G2. Improve the value of Imitation Learning (IL) in the same domains, both by facilitating demonstration and addressing the covariate shift problem [3]. Item G1 is important because control challenges can stem from a variety of issues including the inherent complexity of the system, the required fidelity in the control signal, or the physical limitations of the human partner. For this reason, there is often an explicit need for assistance from an autonomous partner in controlling the mechanical device. Item G2 is important because Imitation Learning can be used to further extend the capabilities of human-machine systems, however demonstration may not always be feasible for the human partner given the aforementioned control challenges. The main contribution of this work is a safety-aware shared control algorithm that improves the efficacy of human-machine collaboration in complex environments, with the key feature that it is possible for the user’s desired objective to remain unknown. In this algorithm, an autonomous partner accounts for systemand environment-based safety concerns, thereby freeing the human operator to focus their mental and physical capacities on achieving high-level goals. Our algorithm (Sect. 3) describes a novel interaction paradigm that extends the viability of complex human-machine systems in various scenarios including those
Safety-Aware Shared Control
907
in which the human’s skill is limited or impaired. We also provide an analysis of our algorithm (Sect. 4) with a human subjects study consisting of 20 participants conducted in a series of challenging simulated environments. Finally, we show how the same algorithm can be used to improve the human operator’s control skill and the power of Imitation Learning (Sect. 5).
2
Background and Related Work
The majority of related work in the shared control literature focuses on the development and analysis of techniques that statically and/or dynamically allocate control between human and robot partners. The main objective in these works is to improve system performance while reducing the control burden on the human operator [2]. In some applications, the autonomous partner is allocated the majority of the low-level control while the human operator acts in a supervisory role [4] while in others, the roles are reversed, and the autonomous partner takes on the supervisory role [5]. There is also related work in which neither partner acts as a supervisor and instead control is shared via a mixedinitiative paradigm. For example, researchers have developed techniques in which the autonomous partner is explicitly aware of the human operator’s intention [6] as well as techniques in which the autonomous partner has an implicit understanding of the human operator’s control policy [7]. While these examples aim to improve task performance, the main motivation for our work is to extend the human operator’s control ability in domains where safety is a primary concern. For this reason, the most closely related shared control paradigms are safety, or context, aware. In this area, researchers have explored the impact of autonomous obstacle avoidance on teleoperated robots in search and rescue [5]. Additionally, safety is a particular concern when the human and robot partner are co-located, such as with autonomous vehicles [8]. Co-located partners are also common in assistive robotics where researchers have developed environmentaware smart walkers [9] to help visually-impaired people avoid dynamic obstacles. Related to our goal of generating autonomous policies that recreate the behavior demonstrated during system operation, there is prior work in the field of Imitation Learning (IL). Most commonly, the demonstration data is provided by a human partner [10], though it can come from variety of sources including trajectory optimization and simulation [11]. Example data is commonly assumed to come from an expert (or optimal) controller [12]; however, researchers also have explored techniques that use demonstrations provided by novice (or suboptimal) sources [13]. In this work we describe how Imitation Learning can be used even when the human operator is not able to provide demonstration data on their own. We further describe how our safety-aware shared control algorithm can be used to address the covariate shift problem [3], a common issue in Imitation Learning that stems from the fact that the data used to train the policy may come from a different distribution than the data observed at runtime. Lastly, there is also related work in the subfield of safety-aware Reinforcement Learning (RL). In this domain, safe autonomous control policies are learned from
908
A. Broad et al.
Fig. 1. Flow chart of our Safety-aware Shared Control (SaSC) algorithm. The operator focuses on task objectives, while the autonomy accounts for safety.
exploration instead of demonstration. Examples include techniques that enforce safety during the exploration process through linear temporal logic [14] and formal methods [15]. Researchers also have explored model-free RL as a paradigm to integrate the feedback of a human operator through the reward structure during the learning process, and to share control with a human operator at run-time [16]. Finally, researchers have considered learning safe policies from demonstration by computing probabilistic performance bounds that allow an autonomous agent to adaptively sample demonstration data to ensure safety in the learned policy [17].
3
Safety-Aware Shared Control
In this work, we contribute a specific implementation of a class of algorithms that we refer to as Safety-aware Shared Control (SaSC). SaSC can help users operate dynamic systems in challenging and potentially unsafe environments that would normally require expert-level control. An analogous engineering solution is fly-by-wire control of modern aircraft [18]. In these systems, the onboard computer accounts for many of the intricacies of the control problem allowing the pilot to focus on high-level tasks. Our SaSC algorithm takes this idea a step further to account for unsafe actions related to both the system dynamics and the environment. Safety-aware shared control consists of two components: 1. A safe policy generation method for the autonomous agent. 2. A control allocation algorithm to safely integrate input from both partners.
Safety-Aware Shared Control
909
The high-level algorithm is depicted in Fig. 1. Points 1 and 2 are described in more detail in Subsects. 3.1 and 3.2. Relevant policy notation is below: • • • • •
πh πsa−a πsa−sc πil πil−sc
3.1
Human operator’s policy Safety-aware autonomous policy (has no task information) Safety-aware shared control policy Imitation Learning policy Imitation Learning policy under safety-aware shared control
Safety-Aware Autonomous Policy Generation
To implement our safety-aware shared control algorithm, we must first develop an autonomous control policy that is capable of safely controlling the dynamic system in question (policy πsa−a in Sect. 3). In this work, we utilize Modelbased Optimal Control [19] (MbOC). MbOC learns a model of the dynamic system directly from data, which is then incorporated into an optimal control algorithm to produce autonomous policies. Here, we learn a model of the system and control dynamics through an approximation to the Koopman operator [20]. Further details of this modeling technique are presented in Sect. 4.2. Given a learned model of the system dynamics, we can then compute a control policy by solving the finite-horizon nonlinear Model Predictive Control (MPC) [21] problem defined by a cost function of the form tf l(x(t), u(t)) + ltf (x(t)) (1) J(x(t), u(t)) = t=0
where x(t) ˙ = f (x(t), u(t)),
x(0) = x0
(2)
and f defines the nonlinear system dynamics, x(t) and u(t) are the state and control trajectories, and l and ltf are the running and terminal costs, respectively. To solve the optimal control problem, we use Sequential Action Control (SAC) [22], an algorithm designed to iteratively find a single action (and time to act) that maximally improves performance. Data-driven model-based, and model-free, optimal control algorithms have been experimentally validated with numerous dynamic systems [23] including joint human-machine systems [7,16]. To address the main focus of this work, we specify a control objective that relates only to system safety. Specifically, we use quadratic costs to deter unstable states and higher order polynomial penalty functions to keep the system away from dangerous locations. Notably, the cost function does not incorporate any task information. Therefore, if the autonomous partner’s policy is applied directly, the system will work to maintain a safe state but will not move towards any specific goal. The cost function used in our work is described in Sect. 4.2. 3.2
Safety-Aware Dynamic Control Allocation
To assist the human partner in safely controlling the dynamic system, we define an outer-loop control allocation algorithm that incorporates input signals from
910
A. Broad et al.
Algorithm 1. Safety-Aware Maxwell’s Demon Algorithm 1: if unsafe (system, environment) then 2: u = us a −a ; 3: else 4: if uh , us a −a ≥ 0 then 5: u = uh ; Maxwell’s 6: else Demon 7: u = 0; Algorithm 8: end if 9: end if
When used with an IL policy uh is replaced by uil
the human and autonomous partners (policy πsa−sc in Sect. 3). There is, of course, a balance to strike between the control authority given to the each partner. If the outer-loop controller is too permissive and accepts a significant portion of the human operator’s input, it may do a poor job enforcing the necessary safety requirements. However, if the outer-loop controller is too stringent, it can negatively impact the ability of the human operator to produce their desired motion. In this work, we balance the control authority between the human and autonomous partners to increase the authority of the human operator when the system is deemed to be in a safe state, and increase the authority of the autonomy when the system is deemed to be in a dangerous state. Here, the autonomy adds information into the system only when it is necessary to ensure safety. Specifically, we allocate control using a variant of Maxwell’s Demon Algorithm (MDA) [24]. MDA uses information from an optimal control algorithm as a guide by which to evaluate the input from another source. In this work, we contribute a safety-aware variant that we call Safety-Aware MDA (Sa-MDA). Sa-MDA is described in full in Algorithm 1. Here, unsafe (system, environment) describes whether the system is in an unstable or dangerous configuration with respect to the environment (e.g., through barrier functions, see Sect. 4.2), uh is the input from the human partner, ua is the input produced by the autonomy, · is the inner product, and u is the applied control. When a learned policy is used to mimic a demonstration uh is replaced with uil (see Sect. 3.3). 3.3
Safety-Aware Shared Control Imitation Learning
We now describe how we use the data collected under shared control to produce autonomous policies through Imitation Learning [10]. The goal here is to learn a policy πil that mimics the behavior demonstrated by the human operator. To achieve this goal, we treat the data collected under shared control πsa−sc as a supervisor in the policy learning process. Notably, this data, and the associated learned policy, now contain task-relevant information, as provided by the human operator during demonstration. Our goal, then, is to learn an autonomous policy that minimizes the following objective
Safety-Aware Shared Control
J(πil , πh ) = min θ
||πil (s) − πh (s)||22
911
(3)
s∈ξ∈D
where J is the cost to be minimized and s is the state. By minimizing J we learn a policy that closely matches the policy demonstrated by the human partner. π(s) : s → u defines a control policy which is parameterized by θ, πh represents the (supervisor) human’s policy and πil represents the Imitation Learning policy. The autonomous policy is learned from a set D of trajectory data ξ recorded during the demonstration phase. To generate πil , we use behavior cloning [25], a standard offline imitation learning algorithm. Details of our specific implementation are provided in Sect. 4.2. As described in Sect. 2, behavior cloning can fail to reproduce the desired behavior due to the covariate shift problem [3,25]. We address this issue with an autonomous policy πil−sc that combines the learned policy πil with the safety-aware autonomous policy πsa−a (Algorithm 1). By incorporating the same shared control algorithm used during data collection, we encourage the system to operate in a similar distribution of the state space to what was observed during demonstration. One can view this solution as a shared control paradigm in which the control is shared between two autonomous agents: the autonomy mimicking the human control and the autonomy enforcing safety constraints.
4
Empirical Evaluation
To evaluate the efficacy of our algorithm, we perform a human subjects study on a simulated system exhibiting nonlinear dynamics in complex environments. In this evaluation, we compare the efficacy of each policy presented in Sect. 3 except πsa−a , which is never executed and is only used to keep πh and πil safe. 4.1
Experimental System
The experimental system consists of a simulated “lunar lander” (Fig. 2), chosen to demonstrate the impact that shared control can have on the safety of a joint human-machine system when the control problem and environment are
Fig. 2. Visualization of lunar lander (enlarged) and experimental environments. A trial is complete when the lander moves across the green line boundary.
912
A. Broad et al.
complex. The lunar lander exhibits nonlinear dynamics and can easily become unstable as it rotates away from its point of equilibrium. Additionally, two of the experimental environments contain obstacles that must be avoided to stay safe (see Fig. 2). The system and environment are implemented in the Box2D physics engine based on the environment defined in OpenAI’s Gym [26]. The lunar lander is defined by a six dimensional vector which includes the 2D position and heading (x1−3 ) and their rates of change (x4−6 ). The control input is a continuous two dimensional vector which represents the throttle of the main (u1 ) and rotational (u2 ) thrusters. The first environment includes only the lunar lander and the ground surface (Fig. 2, left). This environment illuminates the challenges associated with maintaining the stability of a complex dynamic system, while simultaneously executing unspecified behaviors. The second environment incorporates dynamic obstacles that obstruct the motion of the system (Fig. 2, middle). In this environment, a series of circular obstacles move across the screen at the same height as the lander (one at a time). The third environment includes two static obstacles that force the operator to navigate through a narrow passageway, increasing the required control fidelity (Fig. 2, right). 4.2
Implementation Details
Model Learning. We learn a model of the system and control dynamics using an approximation to the Koopman operator [27], which has been validated on numerous systems [23], including human-machine systems [7]. The Koopman is a linear operator that can model all relevant features of nonlinear dynamical systems by operating on a nominally infinite dimensional representation of the state [20]. To approximate the true Koopman operator one must define a basis. In this work, we define φ = [1, x1 , x2 , x3 , x4 , x5 , x6 , u1 , u2 , u1 ·x1 , u1 ·x2 , u1 ·x3 , u1 · x4 , u1 · x5 , u1 · x6 , u2 · x1 , u2 · x2 , u2 · x3 , u2 · x4 , u2 · x5 , u2 · x6 , u1 · cos(x3 ), u1 · sin(x3 ), u2 ·cos(x3 ), u2 ·sin(x3 )], where x1−6 represent the system state variables and u1−2 represent the control input variables. The specific basis elements used in this work are chosen empirically and represent a reduced set of features that describe the full state and control space, as well as the interaction between the user’s input and the state. Data-driven methods (e.g. sparsity-promoting DMD [28]) can be used to automatically choose a proper set of basis functions. Safety-Aware Autonomous Policy Generation. To compute an autonomous policy that is solely concerned with the safety of the system, we define a cost function (Eq. (1)) that considers two notions of safety: stability around points of equilibrium and collision avoidance, l(x) = stabilization
obstacle avoidance
Diag[0, 0, 15 · x3 , 1 · x4 , 1 · x5 , 10 · x6 ]2 + Diag[(x1 − o1 ), (x2 − o2 ), 0, 0, 0, 0]8 where (o1 , o2 ) is the position of the nearest obstacle in 2D space. This cost function (i) penalizes states that are far from points of equilibrium using a quadratic
Safety-Aware Shared Control
913
cost and (ii) prevents the system from entering dangerous portions of the state space using polynomial barrier functions [29]. The stabilization term ensures that the lunar lander does not rotate too far away from upright—if this happens, the lander’s main thruster can no longer be used to counteract gravity, a situation that commonly leads to catastrophic failure in rockets. We therefore penalize both the position and velocity of the heading term. We additionally penalize the x and y velocity terms as momentum can significantly reduce the time a controller has to avoid collision. Finally, the obstacle avoidance term simply acts to repel the system from the nearest obstacle. Importantly, if this policy is applied on its own (without any input from the human partner), the lander will simply attempt to hover in a safe region, and will not advance towards any goal state. Notably, the safety-aware autonomous policy only adds information into the control loop when the system is deemed to be unsafe (see Algorithm 1). Otherwise the user’s commands are, at most, simply blocked by the autonomy. This can be thought of as an accept-reject-replace shared control paradigm [30]. We define the unsafe function based on an empirically chosen distance to the nearest obstacle. Therefore, if the system gets too close to an obstacle, the autonomy’s signal is sent to the system, otherwise the human’s input is accepted or rejected, according to the MDA filter. Here we note that the structure and weights in the defined cost function, as well as the pre-defined distance metric, are specific to the experimental system; however, there are generalizable principles that can be used to develop similar cost function for other systems. For example, system stability can generally be improved by defining costs that help reduce dynamic features to kinematic features, while obstacle avoidance terms can be defined using additional information from the learned system model. Imitation Learning. A neural network is used to learn a control policy that mimics successful trials demonstrated by the human partner. The input is the current state (x1−6 ) and the output is the control signal (u1−2 ) sent to the system. The control signal is discretized (−1.0 to 1.0 in increments of 0.5) and the problem is therefore cast as a classification instead of regression. There are three hidden layers in the neural network—the first has 32 nodes, and the following two layers have 64 nodes. Each hidden layer uses ReLu as an activation function. The final layer uses a softmax activation. We use categorical cross entropy to compute the loss and RMSProp as the optimizer. 4.3
Study Protocol
The human subjects study consisted of 20 participants (16 female, 4 male). All subjects gave their informed consent and the experiment was approved by Northwestern University’s Institutional Review Board. Each participant provided demonstrations of novel behaviors in all three of the environments, under both a user-only control paradigm and our safety-aware shared control paradigm. There was no goal location specified to the participants; instead a trial was considered complete when the human operator navigated the lunar lander across a
914
A. Broad et al.
barrier defined by the green line in the environment (see Fig. 2). The specific trajectory taken by the lander during a demonstration was up to the participant. The operator used a PS3 controller to interact with the system. The joystick controlled by the participant’s dominant hand fired the main thruster, and the opposing joystick fired the side thrusters. Subjects were asked to provide 10 demonstrations per environment (3 total) and per control paradigm (2 total), resulting in a total of 60 demonstrations per participant. The environments were presented in a randomized and counterbalanced order. Participants were assigned to one of two groups, where Group A (10 subjects) provided demonstration data in each environment under user-only control first, and Group B (10 subjects) provided demonstration data under shared control first. Group assignment was random and balanced across subjects.
5
Experimental Results
We find that our SaSC algorithm significantly improves the user’s skill with respect to a no-assistance baseline (Fig. 3). Additionally, we find that our SaSC algorithm can be used as a training mechanism to improve a subject’s understanding and control of the dynamic system (Table 2). Finally, we show how the same shared control technique can be used to extend the Imitation Learning paradigm (Figs. 4 and 5). These findings are discussed in detail in the following subsections. 5.1
Safety-Aware Shared Control Enables Successful Demonstration
To address item G1, the primary metric we evaluate is a binary indicator of control competency: the occurrence of safe, successful demonstrations, indicated by navigating the lunar lander beyond the green border. We first analyze data collected from all three experimental environments together. We then segment the data based on the specific environment in which it was collected and reperform our analysis (Fig. 3). We use the non-parametric Wilcoxon signed-rank test to statistically analyze the data. The results of the statistical tests revealed that our described shared control paradigm significantly improved the human partner’s control skill (p < 0.005). In particular, we find that participants provided safe demonstrations of the desired behavior in 96.0% of the trials produced under the shared control paradigm versus 38.5% of the trials produced under the user-only control paradigm. Additionally, the statistically significant result holds when we compare the control paradigms in each experimental environment separately (p < 0.005 in all cases). Recall that the shared control paradigm does not provide any task -related assistance, but rather only safety-related assistance. We therefore interpret the increase in task success as evidence that our SaSC algorithm helps subjects exhibit greater control skill, and an associated increased ability to provide demonstrations of novel behaviors. Additionally, we note that in some experiments conducted under user-only control, subjects were not able to provide any
Safety-Aware Shared Control
915
Fig. 3. Average fraction of successful trials under each control paradigm in each environment. The plots represent data collected in all environments (left), and broken down by each individual environment (right). In all cases, participants under the shared control paradigm provide safe demonstrations significantly more often than under the user-only control paradigm (***: p < 0.005).
successful demonstrations in a given environment. This suggests that safetyaware shared control may be a requirement for functional usage of dynamic systems in some of the more challenging domains that motivate this work. This finding is also important when considering our ability to train autonomous policies that mimic behaviors demonstrated by the human operator (see Sect. 5.4). 5.2
Impact of Safety-Aware Shared Control on Trajectory Features
As discussed in Sect. 3.2, safety-aware shared control impacts features of the trajectories produced by the human operator. For example, by rejecting a majority of the user’s inputs the SaSC algorithm can ensure system safety, but it will Table 1. Mean and standard deviation of trajectory metrics computed from successful demonstrations. Path length is not impacted significantly by SaSC, but shared control does result in trajectories that take longer to execute, have slower final speeds, and are more upright (stable) in their final configurations. Metric
Control Env 1.
Env 2.
Env 3.
Path length (m)
User Shared
30.0 ± 2.4 27.7 ± 1.1
33.8 ± 4.8 34.4 ± 2.3
28.3 ± 0.8 30.5 ± 2.2
Trial time (s)
User Shared
15.5 ± 3.5 27.0 ± 7.8
18.6 ± 5.6 30.0 ± 7.4
22.4 ± 5.9 30.4 ± 6.8
Final speed (m/s)
User Shared
23.6 ± 6.8 7.9 ± 2.8
23.8 ± 6.2 9.6 ± 4.5
14.5 ± 5.0 7.9 ± 1.9
Final heading (deg) User Shared
45.5 ± 73.5 79.5 ± 79.8 39.6 ± 46.0 2.7 ± 10.4 6.4 ± 22.0 3.5 ± 5.9
916
A. Broad et al.
not allow the user to execute desired behaviors. To evaluate how our SaSC algorithm impacts the user’s abilities to demonstrate a novel behavior, we compare a number of quantitative metrics that go beyond safety and relate specifically to features of the trajectories (Table 1). Here, we analyze only the successful demonstrations provided under each control paradigm. Our analysis shows that the safety-aware shared control paradigm impacts not only the ability of a user to provide demonstrations, but also how they provide demonstrations. In all environments, we find that participants produced trajectories of nearly equal length under both control paradigms. However, under the user-only control paradigm participants produced successful demonstrations in less time, with a greater final speed and in a state that is rotated further away from the point of equilibrium than under the shared control paradigm. Moreover, in Environments 1 and 2 (the less constrained environments), participants were able to produce demonstrations that were safe over the course of the demonstrated trajectory, but unstable in the final configuration. While these were counted as successful demonstrations, they require less control skill than trajectories that end in a stable configuration. This suggests that SaSC improves control skill in ways not fully captured by the binary success metric. 5.3
Safety-Aware Shared Control as a Training Mechanism
As a final piece of analysis addressing item G1, we examine whether experience under a safety-aware shared control algorithm improves human skill learning. To evaluate this idea, we compare the user-only control trials of Group A (user-only condition first) with those of Group B (user-only control condition second ). We segment the data based on the specific environment in which it was collected and use the non-parametric Mann-Whitney U test to statistically analyze the results. We display data and the results of the described statistical tests in Table 2. The important take-away from these results is that shared control allows users to operate human-machine systems safely during their own skill learning, and that this practice then translates to skill retention when the assistance is removed. In the most challenging environment, Environment 3, we see the largest raw difference (+18%) in success rate between the two cohorts and a statistically significant result (p < 0.005). Notably, zero participants provided any successful demonstrations in this environment when the data was provided under the useronly control paradigm first. We also see a relatively large, but not statistically Table 2. Average success rate under user-only control over time. User-only first User-only second Difference Stat. significance p > 0.05
Env 1. 67%
77%
+10%
Env 2. 35%
44%
+9%
p > 0.05
Env 3. 0%
18%
+18%
p < 0.005
All
46%
+12%
p = 0.06
34%
Safety-Aware Shared Control
917
significant, difference in raw percentage points in Environments 1 and 2 (+10% and +9%) which might become statistically significant with more data. Under this paradigm, we can allow users to learn naturally while simultaneously ensuring the safety of both partners. For systems where failure during learning is not acceptable (e.g., exoskeleton balancing), safety-aware shared control becomes a requirement in enabling human skill acquisition: without some sort of safety assistance, operators are simply not able to control the system. 5.4
Safety-Aware Shared Control Improves Imitation Learning
To address item G2, we examine whether learned policies are capable of reproducing the behavior demonstrated by the human operator. Our evaluation is based on a comparison of trajectories generated by Imitation Learning with (πil−sc ) and without (πil ) safety-aware shared control at runtime. We provide visualizations of 10 successful reproductions of the demonstrated behaviors in Environments 2 and 3 in Figs. 4 and 5, respectively. Notably, all trajectories produced by πil−sc safely avoid both the static and dynamic obstacles. In Fig. 4 we see that the learned policy πil−sc is able to mimic the behavior demonstrated by the human operator. In Fig. 5, we include visualizations of the trajectories provided under user-only control (πh ) and Imitation Learning without shared control (πil ) in Environment 3. Here, we also see that the user was unable to provide any successful demonstrations without safety assistance. Similarly, the learned control policy (πil ) was unable to avoid obstacles in the environment without the safety assistance. These two final points elucidate the need for our safety-aware shared control system in both the demonstration and imitation phases. Without assistance from the SaSC algorithm, not only is the human operator unable to demonstrate desired behaviors, but the learned neural network policy fails to generalize.
Fig. 4. Environment 2. A visualization of data provided by the human partner under safety-aware shared control (πsa−sc : blue) and trajectories produced autonomously (πil−sc : pink) using the learned control policy. The vertical and horizontal position of the dynamic obstacles (black) over time are also displayed.
918
A. Broad et al.
Fig. 5. Environment 3. Visualization of (1) demonstrations provided under the useronly control paradigm (πh : yellow), (2) demonstrations provided under our safety-aware shared control paradigm (πsa−sc : blue), (3) trajectories produced autonomously using solely the Imitation Learning policy (πil : purple, learned from πsa−sc demonstrations), and (4) trajectories produced autonomously using the safety aware shared control imitation learning policy (πil−sc : pink).
6
Discussion and Conclusion
In this work, we contribute a shared control paradigm that allows users to provide demonstrations of desired actions in scenarios that would otherwise prove too difficult due to the complexity of the control problem, the complexity of the environment, or the skill of the user. We solve this problem through the application of shared control, allowing a human operator to provide demonstrations while an autonomous partner ensures the safety of the system. We validate our approach with a human subjects study comprised of 20 participants. The results of our human subjects study show that our safety-aware shared control paradigm is able to help human partners provide demonstrations of novel behaviors in situations in which they would otherwise not be able (Fig. 3). Additionally, we find that our SaSC algorithm can be used as a training mechanism to improve a human operator’s control skill by ensuring safety during training (Table 2). Furthermore, we find that a combination of Imitation Learning with our safety-aware shared control paradigm produces autonomous policies that are capable of safely reproducing the demonstrated behaviors. In this work, the autonomous policies are learned from data provided under shared control and for this reason, one must also consider how the autonomy affects the demonstration data. We find that the shared control paradigm slows the average speed of the system, but generally increases the stability (Table 1). Of course, it is also possible to learn autonomous policies from the data provide under user-only control.
Safety-Aware Shared Control
919
However, when system safety is a requirement (e.g. co-located human-machine systems), shared control can be thought of as fundamental for allowing users to provide demonstrations of new behaviors. Allowing a system to fail during demonstration is often an unrealistic assumption with real-world systems. In future work, we plan to explore additional uses of data collected under a shared control paradigm in learning autonomous policies that do not rely on continued safety assistance (e.g. as seeds in Guided Policy Search [11]). We also plan to explore a bootstrapped notion of shared control, in which the outer-loop autonomous controller originally considers only the safety of the joint system and then dynamically updates to consider both the safety of the system system and task-level metrics that describe the desired behavior of the human operator. Acknowledgements. This material is based upon work supported by the National Science Foundation under Grants CNS 1329891 & 1837515. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the aforementioned institutions.
References 1. Colombo, G., Joerg, M., Schreier, R., Dietz, V.: Treadmill training of paraplegic patients using a robotic orthosis. J. Rehabil. Res. Dev. 37(6), 693–700 (2000) 2. Musi´c, S., Hirche, S.: Control sharing in human-robot team interaction. Annu. Rev. Control 44, 342–354 (2017) 3. Ross, S., Gordon, G.J., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: International Conference on Artificial Intelligence and Statistics, pp. 627–635 (2011) 4. Aigner, P., McCarragher, B.J.: Modeling and constraining human interactions in shared control utilizing a discrete event framework. Trans. Syst. Man Cybern. 30(3), 369–379 (2000) 5. Shen, J., Ibanez-Guzman, J., Ng, T.C., Chew, B.S.: A collaborative-shared control system with safe obstacle avoidance capability. In: IEEE International Conference on Robotics, Automation and Mechatronics, pp. 119–123 (2004) 6. Chipalkatty, R., Droge, G., Egerstedt, M.B.: Less is more: mixed-initiative modelpredictive control with human inputs. IEEE Trans. Rob. 29(3), 695–703 (2013) 7. Broad, A., Murphey, T., Argall, B.: Learning models for shared control of humanmachine systems with unknown dynamics. In: Robotics: Science and Systems (2017) 8. Anderson, S.J., Peters, S.C., Pilutti, T.E., Iagnemma, K.: An optimal-controlbased framework for trajectory planning, threat assessment, and semi-autonomous control of passenger vehicles in hazard avoidance scenarios. Int. J. Veh. Auton. Syst. 8(2–4), 190–216 (2010) 9. Lacey, G., MacNamara, S.: Context-aware shared control of a robot mobility aid for the elderly blind. Int. J. Robot. Res. 19(11), 1054–1065 (2000) 10. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Robot. Auton. Syst. 57(5), 469–483 (2009) 11. Levine, S., Koltun, V.: Guided policy search. In: International Conference on Machine Learning, pp. 1–9 (2013)
920
A. Broad et al.
12. Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: International Conference on Machine learning, p. 1. ACM (2004) 13. Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: AAAI Conference on Artificial Intelligence, vol. 8, pp. 1433–1438 (2008) 14. Alshiekh, M., Bloem, R., Ehlers, R., K¨ onighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding. In: AAAI Conference on Artificial Intelligence (2018) 15. Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods. In: AAAI Conference on Artificial Intelligence (2018) 16. Reddy, S., Levine, S., Dragan, A.: Shared autonomy via deep reinforcement learning. In: Robotics: Science and Systems (2018) 17. Brown, D.S., Niekum, S.: Efficient probabilistic performance bounds for inverse reinforcement learning. In: AAAI Conference on Artificial Intelligence (2018) 18. Sutherland, J.: Fly-by-wire flight control systems. Technical report, Air Force Flight Dynamics Lab Wright-Patterson AFB OH (1968) 19. Atkeson, C.G., Santamaria, J.C.: A comparison of direct and model-based reinforcement learning. In: IEEE International Conference on Robotics and Automation, vol. 4, pp. 3557–3564 (1997) 20. Koopman, B.O.: Hamiltonian systems and transformation in Hilbert space. Proc. Natl. Acad. Sci. 17(5), 315–318 (1931) 21. Rawlings, J., Mayne, D., Diehl, M.: Model Predictive Control: Theory, Computation, and Design. Nob Hill Publishing, Madison (2017) 22. Ansari, A.R., Murphey, T.D.: Sequential action control: closed-form optimal control for nonlinear and nonsmooth systems. IEEE Trans. Rob. 32(5), 1196–1214 (2016) 23. Abraham, I., Torre, G.D.L., Murphey, T.D.: Model-based control using Koopman operators. In: Robotics: Science and Systems (2017) 24. Tzorakoleftherakis, E., Murphey, T.D.: Controllers as filters: noise-driven swing-up control based on Maxwells demon. In: IEEE Conference on Decision and Control (2015) 25. Laskey, M., Lee, J., Fox, R., Dragan, A., Goldberg, K.: DART: noise injection for robust imitation learning. In: Conference on Robot Learning, pp. 143–156 (2017) 26. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., Zaremba, W.: OpenAI Gym. arXiv, vol. abs/1606.01540 (2016) 27. Williams, M.O., Kevrekidis, I.G., Rowley, C.W.: A data-driven approximation of the Koopman operator: extending dynamic mode decomposition. J. Nonlin. Sci. 25(6), 1307–1346 (2015) 28. Jovanovi´c, M.R., Schmid, P.J., Nichols, J.W.: Sparsity-promoting dynamic mode decomposition. Phys. Fluids 26(2), 024103 (2014) 29. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004) 30. Kalinowska, A., Fitzsimons, K., Dewald, J., Murphey, T.D.: Online user assessment for minimal intervention during task-based robotic assistance. In: Robotics: Science and Systems (2018)
Guided Exploration of Human Intentions for Human-Robot Interaction Min Chen(B) , David Hsu, and Wee Sun Lee National University of Singapore, Singapore 117417, Singapore [email protected]
Abstract. Robot understanding of human intentions is essential for fluid human-robot interaction. Intentions, however, cannot be directly observed and must be inferred from behaviors. We learn a model of adaptive human behavior conditioned on the intention as a latent variable. We then embed the human behavior model into a principled probabilistic decision model, which enables the robot to (i) explore actively in order to infer human intentions and (ii) choose actions that maximize its performance. Furthermore, the robot learns from the demonstrated actions of human experts to further improve exploration. Preliminary experiments in simulation indicate that our approach, when applied to autonomous driving, improves the efficiency and safety of driving in common interactive driving scenarios.
1
Introduction
Understanding human intentions is essential for fluid human-robot interaction. It helps the robot to interpret humans’ behaviors and predict their future actions. Earlier work often treats humans as passive moving obstacles in the environment: humans do not react to robot actions [3,4,11]. At a result, the robot is overly conservative. It waits and observes until humans’ intentions become clear. However, in reality, humans are not merely moving obstacles, and they respond to robot actions. For example, in our study of lane-switch for autonomous driving, a robot car tries to switch to a lane, in which a human-driven car drives (Fig. 1). A conservative human driver will slow down, while an aggressive driver will accelerate and refuse to let the robot car in. If the robot car chooses to wait, the human driver will maintain the speed, and the robot car will not learn the human driver’s intention. Human intentions, human actions, and robot actions are interconnected. The robot must take advantage of the connections and actively explore in order to understand human intentions. Exploration is, however, not always appropriate. Consider the lane-switch example again (Fig. 1d). If the robot car switches lanes when the two cars are very close, the human will not have enough time to react. In our study, the participants indicated that they felt unsafe in such situations. So the robot must not only explore, but also explore safely, for effective human-robot interaction. c Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 921–938, 2020. https://doi.org/10.1007/978-3-030-44051-0_53
922
M. Chen et al.
Fig. 1. A robot car explores human intentions during the lane-switch. (a,b) If the robot car tries to explore, a conservative human driver will slow down, while an aggressive human driver may accelerate, revealing the human intentions. (c) If the robot car chooses to wait, the human driver will maintain the speed. The robot car will not learn the human driver’s intention and initiate the lane switch suitably. (d) Exploration may be dangerous, when the two car are too close.
To this end, we propose two ideas. The first is an intention-driven human behavior model, integrated into a probabilistic robot decision making framework for human-robot interaction. Unlike earlier work on intentional behavior modeling [3], our model conditions human actions explicitly on robot actions and captures adaptive human behaviors. Since the human intention is not directly observable, we model it as a latent variable in a partially observable Markov decision process (POMDP) [14]. We further assume that the human intention remains static during a single interaction, thus reducing the POMDP model to a computationally more efficient variant, POMDP-lite [6]. Despite the simplifying assumption, the resulting intention POMDP-lite model successfully captures many interesting human behaviors. See Sect. 5 for examples. Handcrafting accurate POMDP models is a major challenge. Here we take a data-driven approach and learn the intention POMDP-lite model from data. Solving an intention POMDP-lite model produces a policy that enables the robot to explore actively and infer human intentions in order to improve robot performance. Aggressive exploration is sometimes unsafe in applications such as driving. Our second idea is to leverage human expert demonstrations for improved robot exploration. We learn from human demonstrations a probability distribution over state-action pairs. The learned distribution captures the actions of human experts when exploration is needed. It is then used as a heuristic to guide robot exploration and favor the frequently demonstrated state-action pairs. We evaluated our approach in simulation on common driving tasks, including lane-switch and intersection navigation. Compared with a myopic policy that does not actively explore human intentions, intention POMDP-lite substantially improved robot driving efficiency. Combined with guided exploration, it also improved driving safety. While our experiments are specific to autonomous driving, the approach is general and applicable to other human-robot interaction tasks that require exploration of human intentions. In the following, Sect. 2 briefly surveys related work. Section 3 presents the intention POMDP-lite model and guided exploration. Section 4 describes how we learn an intention-driven human driving policy and a distribution for guided
Guided Exploration of Human Intentions for Human-Robot Interaction
923
exploration. Section 5 compares our approach with common alternatives in simulation. Section 6 discusses the limitations of the current approach and directions for further investigation.
2
Related Work
Intention has been studied extensively in the field of psychology [2,5], where intention is characterized as a mental state that represents human’s commitment to carrying out a sequence of actions in the future. Understanding intentions is crucial in understanding various social contexts, e.g., it helps to interpret other people’s behaviors and predict their future actions [10]. In human-robot interaction, intention has been used as a means to model human behaviors [3,4,17,19,25]. For example, Bai et al. [3] modeled pedestrians’ behaviors with a set of intentions, and they enabled the autonomous car to drive successfully in a crowd. However, they assumed that the human won’t react to the robot, which made the robot act conservatively most of the time. Most recently, Sadigh et al. [23] showed that robot’s action directly affects human actions, which can be used to actively infer human intentions [24]. However, inferring human intentions is not the end goal. Instead, our work embeds an intention-driven human behavior model into a principled robot decision model to maximize the robot performance. In other words, the robot may choose to ignore the human if he/she does not affect robot performance. To maximize performance, the robot needs to actively infer human intentions (exploration), and achieve its own goal (exploitation). In addition, the robot needs to explore gently as the human might not willing to be probed in certain scenarios. The partially observable Markov decision process (POMDP) [14] trades off exploration and exploitation optimally. However, POMDP itself does not model human’s intent to be probed, thus it may generate explorative actions that are too aggressive. Imitation learning derives a robot policy directly from human demonstrations [1]. But the robot policy cannot be generalized to unseen state space. Instead of learning a robot policy directly, Garcia et al. [12] used human demonstrations to guide robot explorations, and they significantly reduced the damage incurred from exploring unknown state-action space. Our work draws insight from [12], and we explicitly guide robot explorations in a POMDP model with human demonstrations.
3 3.1
Intention POMDP-lite with Guided Exploration Mathematical Formulation of Human-Robot Interaction
Mathematically, we formulate the human-robot interaction problem as a Markov decision process (X, AR , AH , T, RR , RH , γ), where x ∈ X is the world state. AR is the set of actions that the robot can take, and AH is the set of actions that the human can take. The system evolves according to a stochastic state transition function T (x, aR , aH , x ) = P (x |x, aR , aH ), which captures the probability of
924
M. Chen et al.
transitioning to state x when joint actions (aR , aH ) are applied at state x. At each time step, the robot receives a reward of rR (x, aR , aH ) and the human receives a reward of rH (x, aR , aH ). The discount factor γ is a constant scalar that favors immediate rewards over future ones. Given a human behavior policy, i.e., aH ∼ π H , the optimal value function of the robot is given by Bellman’s equation γP (x |x, aR , aH )V ∗ (x |π H ) (1) E rR (x, aR , aH )+ V ∗ (x|π H ) = max aR
aH ∼π H
x
The optimal robot policy π∗R is the action that maximizes the right hand size of Eq. 1, and the key to solve it is to have a model of human behaviors. 3.2
Intention-Driven Human Behavior Modeling
Our insight in modeling human behaviors is that people cannot be treated as obstacles that move, i.e., people select actions based on their intentions and they adapt to what the robot does. Following previous works on intention modeling [4,23], we assume that human intention can be represented as a single discrete random variable θ, and H H we explicitly condition human actions on their intention, i.e., aH t ∼ π (at |xt , θt ). Apart from their own intention, people also adapt to the robot. In the most general case, people condition their actions on the entire human-robot interaction H R H history, i.e., ht = {aR 0 , a0 , . . . , at−1 , at−1 }. However, the history ht may grow arbitrary long and make human actions extremely difficult to compute. Even if we can compute it , this will still not be a good model for how people make decisions in day to day tasks since people are known to be not fully rational, i.e., bounded rationality [15,22]. In human-robot interaction, bounded rationality has been modeled by assuming that people have “bounded memory”, and they based their decisions only on most recent observations [18]. Our human behavior model connects the human intention model with the bounded memory model, i.e., people condition their actions on their intention H R H and the last k steps of the history, hkt = {aR t−k , at−k , . . . , at−1 , at−1 }. Thus, the human behavior policy can be rewritten as H H k aH t ∼ π (at |xt , ht , θt )
3.3
(2)
Intention POMDP-lite
A key challenge is that human intention can not be directly observed by the robot, and therefore has to be inferred from human behaviors. To achieve that, We model human intention as a latent state variable in a partially observable Markov decision process (POMDP). In this paper, we assume the human intention remains static during a single interaction. Thus, we can reduce the POMDP model to a POMDP-lite model [6], which can be solved more efficiently.
Guided Exploration of Human Intentions for Human-Robot Interaction
925
hkt θt+1
θt
Robot
aH t
Human
ot+1
ot aR t
R aH t at
xt+1
xt+1
xt
Environment
Fig. 2. The intention POMDP-lite graphical model and the human-robot interaction flowchart. The robot action aR t depends on the world state xt and the belief over human intention θt . The dashed arrow indicates the static assumption on the intention dynamics, i.e., θt = θt+1 .
To build the intention POMDP-lite model, we first create a factored state s = (x, θ) that contains the fully observable world state x and the partially observable human intention θ. We maintain a belief b over human intention. The human behavior policy is embedded in the POMDP-lite transition dynamics, and we describe in Sect. 4 how we learn it from data. Figure 2 shows the intention POMDP-lite graphical model and human-robot interaction flowchart. The solution to an intention POMDP-lite is a policy that maps belief states R to robot actions, i.e., aR t = π (bt , xt ). And it has two distinct objectives: (1) maximizing rewards based on current information (exploitation); (2) gathering information over human intention (exploration). The Bayes-optimal robot trades off exploration/exploitation by incorporating belief updates into its plans [9]. It acts to maximize the following value function ∗ R R R ∗ γP (xt+1 |bt , xt , at )V (bt+1 , xt+1 ) (3) V (bt , xt ) = max r (bt , xt , at ) + aR t
xt+1
R R H bt (θt )P (aH t |θt )r (xt , at , at ) is the mean reward func R H tion, and P (xt+1 |bt , xt , aR bt (θt )P (aH t ) = t |θt )P (xt+1 |xt , at , at ) is the mean
where rR (bt , xt , aR t ) =
θt
θt
transition function, where P (aH t |θt ) depends on the human behavior policy in Eq. 2. Note that bt+1 = τ (bt , aR t , xt+1 ) is the updated belief after arriving at a new state xt+1 , where τ represents Bayes’ rule. However, Bayes-optimal planning is intractable in general. An alternative approach to trade off exploration/exploitation is the explicit modification of the reward function, i.e., adding an extra reward bonus for exploration [6]. In this case, the robot acts to maximize the following value function:
926
M. Chen et al.
R R V˜ ∗ (bt , xt ) = max βrB (bt , xt , aR t ) + r (bt , xt , at ) aR t
+
˜∗ γP (xt+1 |bt , xt , aR t )V (bt , xt+1 )
(4)
xt+1
where rB (bt , xt , aR t ) is the reward bonus term that encourages the robot to explore. β is a constant scalar that explicitly trades off exploration and exploitation. Note that the belief bt is not updated in this equation. In other words, solving Eq. 4 is equivalent to solving a mean MDP of current belief with an additional reward bonus, which is computationally efficient. More importantly, the computed robot policy is near Bayes-optimal [6], given that the reward bonus is defined as the expected L1 distance between two consecutive beliefs E ||bt+1 − bt ||1 .
bt+1
3.4
Guided Exploration
The policy computed by Eq. 4 enables the robot to actively infer human intentions, however, the human might not like to be probed in certain scenarios (e.g., Fig. 1d). Thus, guidance needs to be provided for more effective robot explorations. We achieve guided exploration via human expert demonstrations. More specifically, we maintain a probability distribution over each state-action pair, where pG (x, aR ) ∈ [0, 1] measures how likely the human expert will take action aR at state x. We will describe in Sect. 4 how we learn pG (x, aR ) from data. The learned probability distribution is then embedded into the reward bonus term as a prior knowledge, and our algorithm acts to maximize the following value function: G∗ B R R R ˜ V (bt , xt ) = max βpG (xt , aR t )r (bt , xt , at ) + r (bt , xt , at ) aR t (5) ˜ G∗ (bt , xt+1 ) + γP (xt+1 |bt , xt , aR ) V t xt+1
Compared with Eq. 4, the reward bonus term is multiplied by pG (xt , aR t ), which ) is small. Although we discourdiscourages robot exploration when pG (xt , aR t age the robot to explore certain state action pairs, the theoretic results in the POMDP-lite paper [6] retains, i.e., the robot policy remains near Bayes-optimal. Due to space constraints, the proof of the theorem is deferred to the appendix, which is available in the full version of the paper ([7]). Theorem 1. Let At denote the policy followed by our algorithm at time step t. Let xt , bt denote the state and belief at time step t. Let |X|, |AR | denote the size of the state space and robot action space. Let φ denote the minimal value of the
Guided Exploration of Human Intentions for Human-Robot Interaction
927
probability distribution pG (x, aR ), i.e., φ = min{pG (x, aR )}. Suppose φ > 0, β = O(
|X|2 ,|AR | φ(1−γ)2 ),
for any inputs ∀δ > 0, > 0, with probability at least 1 − δ, V At (bt , xt ) ≥ V ∗ (bt , xt ) − 4 .
In other to the Bayes-optimal policy, for all but words, our algorithm1is 4 -close m = O poly(|X|, |AR |, 1 , 1δ , 1−γ ) time steps.
4
Learning Human Behavior Policies and Guided Robot Exploration
Nested within the intention POMDP-lite model is the human behavior policy k G R π H (aH t |xt , ht , θt ), and the guided exploration distribution p (xt , at ). We adopt a data-driven approach and learn those two models from data for the interactive driving tasks. Note that suitable probabilistic models derived from alternative approaches can be substituted for these learned models. 4.1
Data Collection
Interactive Driving Tasks. Figure 3 shows the simulation setup and three typical interactive driving tasks. For simplicity, we assume both the human-driven car and the robot car follow a fixed path respectively. For example, in the intersection scenario, each of them will follow a path that goes straight forward. With this assumption, the human driver and the robot only need to control the speed of the vehicles. The steering angle is controlled by a path tracking algorithm [8]. One exception is the lane-switch scenario, where the robot can decide when to switch lanes, and has two additional actions, i.e., {switch−lef t, switch−right}. Once the robot decides to switch lanes, a new path will be generated online and the path tracking algorithm will start following the new path. In general, human intention in driving scenarios has multiple dimensions, e.g., turn-left/turn-right, aggressive/conservative, e.t.c. In this paper, we focus on the last dimension, i.e., the human driver can be either aggressive or conservative.
Fig. 3. Simulation setup. (a) A human driver interacts with a robot car in the driving simulator powered by the Unity 3D engine. (b,c,d) Three interactive driving tasks: lane-switch, intersection, and lane-merge.
928
M. Chen et al.
Participants. We recruited 10 participants (3 female, 7 male) in the age range of 22 − 30 years old. All participants owned a valid driving license. Design. The human-driven car was controlled by one of the participants, and the robot car was controlled by a human expert. Note that the human expert was not recruited from the general public but one of the experiment conductors. Intuitively, one can treat the participants as the human drivers that the robot will interact with, and the human expert as the owner of robot car whom teaches the robot how to act in different scenarios. We learn a human drivers’ behavior model from the controls recorded from the participants. To capture different human intentions, we asked each participant to perform the task as an aggressive human driver and as a conservative human driver. Note that the notion of “aggressive/conservative” is subjective and may vary among different participants. Thus, we expect certain amount of variance in our human behavior prediction model. Similarly, we learn a guided exploration distribution from the controls recorded from the human expert. Since safety is our primary concern in the driving tasks, the human expert was told to drive carefully. Procedure. Before the simulation started, the participant was asked to follow one of the driver intentions, i.e., aggressive or conservative. Once the simulation started, the participant and the human expert could control the speed of their vehicles respectively via an accelerator pedal that provides continuous input. In the lane-switch scenario, the human expert could also decide when to switch to the other lane. The simulation ends once the robot car has achieved its goal, i.e., crossed the intersection or switched to another lane. Data Format. The world state in the driving task is depicted in Fig. 4, x = {dH , dR , v H , v R }, where dH , dR are the vehicles’ distance to the potential colliding point, and v H , v R are the vehicles’ current speed. For each simulation sequence, we and Fig. 4. The world state for an interactive recorded two set of data, DH i , for learning the human behavior DG driving task. i model and guided exploration distribution respectively. The data recorded can be written as follows: R H R H G R R DH i = θi ∪ {(x0 , a0 , a0 ), . . . , (xKi , aKi , aKi )}, Di = {(x0 , a0 ), . . . , (xKi , aKi )}
where θi is the human intention at the ith interaction, Ki is the number of steps in the ith interaction. xt is the world state at time step t. vtR , aR t are the are the speed and speed and acceleration of the robot car at time step t. vtH , aH t acceleration of the human-driven car at time step t.
Guided Exploration of Human Intentions for Human-Robot Interaction
929
Each participant was asked to perform the driving task 8 times as an aggressive driver and 8 times as a conservative driver. We have 10 participants in total, this gives us 160 sequences of interactions for the learning purpose, i.e., 80 for the aggressive human driver, 80 for the conservative human driver, and 160 for the guided exploration distribution. 4.2
Human Behavior Policy
A human behavior policy is a function that maps the current world state xt , bounded history hkt and human intention θt to human actions (Eq. 2), where human actions are accelerations in our driving tasks. The Gaussian Process (GP) places a distribution over functions. It serves a non-parametric form of interpolation, and it is extremely robust to unaligned noisy measurements [21]. In this paper, we use GP as the human behavior prediction model. We learn a GP model for each human driver type, i.e., aggressive or conservative. Our GP model is specified by a set of mean and covariance functions, whose input includes the world state xt and bounded history hkt , i.e., x = (xt , hkt ). To fix the dimension of the input, we add paddings with value of 0 to the history if t < k. Mean Function. The mean function is a convenient way of incorporating prior knowledge. We denote the mean function as μ(x), and set it initially to be the mean of the training data. This encodes the prior knowledge that, without any additional knowledge, we expect the human driver to behave similarly to the average of what we have observed before. Covariance Function. The covariance function describes the correlations between two inputs, x and x , and we use the standard radial-basis function kernel [26].
k(x, x ) = exp
H d − d H 2 d R − d R 2 v H − v H 2 − 1/2 + + + σd σd σv aR − aR 2 aH − aH 2 (6) v R − v R 2 + + σv σa σa k R H a ,a ∈ht
where the exponential term encodes that similar state and history should make similar predictions. The length-scale σd , σv and σa normalize the scale of the data, i.e., distance, speed and acceleration.
930
M. Chen et al.
Fig. 5. The Mean Squared Error (MSE) of GP predictions with increasing history length k. The MSE stabilizes for k ≥ 2, confirming that humans have short, bounded memory in driving tasks.
Training. We train the GP model with the scikit-learn package [20], where 80% data is used for training and 20% data is used for testing. We evaluate different values of k. Figure 5 shows the mean squared train/test error with respect to k. The errors are large when k = 0 (people ignore the robot), but converge quickly within 2 steps. This supports our intuition that people adapt to the robot but have a bounded memory. In the remainder of the paper, GP models with k = 2 are used to predict human actions. Figure 6 shows GP predictions for some example scenarios. The plots are generated by varying one of the input variables, i.e., d or v, while fixing the others. Due to space constraints, we only show selected plots where the aggressive human driver and the conservative human driver are most distinguishable. The key message in those plots is: the conservative human driver slows down if there is a potential collision in the near future, while the aggressive human driver keeps going and ignores the danger. The robot can take advantage of this difference to actively infer human intentions.
Fig. 6. GP predictions of human accelerations, mean and standard error (y axis), for some example driving scenarios. (a) Predicted human acceleration w.r.t the distance between two vehicles (lane-switch). (b) Predicted human acceleration w.r.t the velocity of the robot car (intersection). (c) Predict human acceleration w.r.t the distance between two vehicles (lane-merge).
Guided Exploration of Human Intentions for Human-Robot Interaction
931
Human Behavior Policy Table. The human behavior policy will be queried frequently during the robot planning phase, and GP prediction is too slow for online POMDP planning. Instead, we build a policy table offline for each human driver type, where we store all the GP predictions. During the online phase, the POMDP planning algorithm only needs to query the table and it is much faster. To build a policy table, we discretize the state space and action space as follows: – Distance (m): near [0, 5), middle [5, 20), far [20, +∞). – Speed (m/s): low [0, 1), middle [1, 5), high [5, +∞). – Acceleration (m/s2 ): decelerate (−∞, −0.2), keep [−0.2, 0.2], accelerate (0.2, +∞). With this level of discretization, the policy table will have 38 entries when history length k = 2. 4.3
Probability Distribution for Guided Exploration
For the autonomous driving task, we are mostly concerned with a state-action pair being safe or unsafe. Consequently, the probability pG (x, aR ) measures how safe it is for the robot to explore (x, aR ), i.e., safe probability. Prior. Since a state-action pair is either safe or unsafe, a natural means is to use Beta distribution as a prior , i.e., Beta(α(x, aR ), β(x, aR )). The initial safe probability can be computed as pG (x, aR ) =
α(x, aR ) α(x, aR ) + β(x, aR )
Initially, we set α(x, aR ) = 0.05, β(x, aR ) = 5, ∀x ∈ X, aR ∈ AR . This implies that all state-action pairs are close to unsafe without seeing any human demonstrations. Posterior. Given a set of human demonstration data DG = {(x0 , aR 0 ), . . . , (xN , aR N )}, the posterior of the safe probability can be computed as p˜G (x, aR ) =
α(x, aR ) + n(x, aR ) α(x, aR ) + β(x, aR ) + n(x, aR )
where n(x, aR ) is the number of times that (x, aR ) appeared in the human demonstration data. Safe Probability Table. Similar to the human behavior policy, we store all the safe probabilities in a table, and we follow the same discretization intervals for the human behavior policy table.
932
5
M. Chen et al.
Simulation Experiments
Our approach, intention POMDP-lite with Guided exploration (IPL-G), enables the robot to actively infer human intentions and choose explorative actions that are similar to the human expert. In this section, we present some simulation results on several interactive driving tasks, where the robot car interacts with a simulated human driver. We sought to answer the following two questions: – Question A. Does active exploration improve robot efficiency? – Question B. Does guided exploration improve robot safety? Comparison. To answer question A, we compared IPL-G with a myopic robot policy that does not actively explore, i.e., the reward bonus in Eq. 4 was set to be 0. To answer question B, we compared IPL-G with the original intention POMDP-lite model (IPL) without guided exploration. Since the driving scenarios considered in this paper are relatively simple, one may argue that a simple heuristic policy might work just as well. To show that is not the case, we designed an additional baseline, i.e., heuristic-k, and it works as follows: the robot explores at the first k steps, e.g., accelerate or switch lane, then the robot proceeds to go if the human slows down, otherwise, the robot waits until the human has crossed. Parameter Settings. Both IPL and IPL-G need to set a constant scalar β that trades off exploration and exploitation (Eqs. 4 and 5). Similar to previous works [16], we evaluated a wide range of values for β, and chose the one that had the best performance on IPL. In this way, β favors IPL more than IPL-G. For the heuristic policy, we need to choose parameter k, i.e., the number of steps that robot explores at the beginning. We evaluated heuristic policies with different values of k ∈ {1, 2, 3, 4}, and chose the one that had the best performance. Performance Measures. To measure the efficiency of a policy, we used the time taken for the robot to achieve its goal as the performance measure, i.e., T (goal), the less the better. To measure the safety of a policy, we used the near-miss rate as the performance measure, i.e., P (near − miss), since we didn’t observe any accidents in the simulations. We adopted the definition of near-miss from a seminal paper in the field of traffic safety control [13], where near-miss was defined based on the time-measured-to-collision (TMTC). TMTC is the time required for the two vehicles to collide if they continue at their current velocity. Intuitively, TMTC is a measurement of danger, and lower TMTC value implies that the scenario is more dangerous. According to [13], a near-miss happens if the value of TMTC is lower than 1 s, and an analysis over the films taken with the surveillance system at an urban interaction suggested a near-miss rate of 0.35 in daily traffic.
Guided Exploration of Human Intentions for Human-Robot Interaction
933
Fig. 7. Top row: comparison of the active exploration policy and the myopic policy, when exploration is safe. Bottom row: comparison of policies with and without guided exploration, when exploration is unsafe.
Simulation Setup. For all the simulations performed, the robot car was controlled by one of the algorithms above. The human-driven car was controlled by learned human behavior policy in Sect. 4.2. For each simulation run, the human driver was set to be aggressive or conservative with 0.5 probability. The state space x was set to be continuous. The robot action space was set to be discrete, i.e., {Accelerate, Keep, Decelerate}, which controls the speed of the car. The steering angle of the car was controlled independently by a path tracking algorithm [8]. All planning algorithms were given 0.33 s per step to compute the robot actions online. 5.1
Driving Scenarios
To test the effectiveness of active exploration, we selected three interactive driving scenarios where robot exploration is safe (Fig. 7, top row). Initially, the robot car either had a safe distance to the human-driven car or had low speed. This gave the robot car enough space/time to explore without too much danger of colliding to the human-driven car. The myopic robot waited until the humandriven car had crossed or slowed down. However, the active exploration robot started to switch to the other lane or accelerate to test if the human was willing to yield. If the human was conservative and slowed down, the robot proceeded to go and improved its efficiency.
934
M. Chen et al.
Fig. 8. Performance results. Top row: when it was safe to explore, the active exploration policy (IPL and IPL-G) achieved better efficiency with nearly zero near-misses. Bottom row: when it was unsafe to explore, the IPL-G robot achieved significantly lower nearmiss rate compared with the robot that explores without guided exploration (IPL).
To test the effectiveness of guided exploration, we selected another three interactive driving scenarios where robot exploration is unsafe (Fig. 7, bottom row). Initially, the robot car was near to the human-driven car and it had high speed. Without guided exploration, the IPL robot chose to switch lane or cross the intersection, which might cause near-misses. On the other hand, with guided exploration, the IPL-G robot chose to not explore and waited for the human to cross first. 5.2
Quantitative Results
We performed 200 simulation runs for each scenario. The performance results are shown in Fig. 8. When robot exploration was safe, both IPL and IPL-G actively explored. This significantly reduced the time taken for the robot to achieve its goal, compared with the robot that followed a myopic policy or a heuristic policy (Fig. 8a). Very few near-misses ( 0. Otherwise, the specification is falsified (i.e. x |= ϕ).
944
J. DeCastro et al.
Our goal is, for a given scenario S and safety condition ϕ, to find a set of counterexamples to ϕ as bounded-time trajectories for all of the traffic participants. For each counterexample, we then seek a contract C ⊂ X 0 that can be applied as a rule for the ego vehicle to follow in order to guard against the counterexample and thereby locally satisfy ϕ. We impose the following requirements: 1. C yields certain constraints on the ego vehicle’s trajectories that prevent violating a given ruleset (e.g. rules of the road), 2. C yields additional constraints on the ego vehicle’s trajectories that prevent violation of ϕ with respect to the counterexamples associated with C, 3. C generalizes to protect the ego vehicle against a continuum of possible traffic vehicle behaviors under wi , in addition to those in the finite set of counterexamples, and 4. the counterexamples associated with C satisfy a chance constraint describing reasonable driver behaviors, i.e. p(w1 , . . . , wT ) ≥ αT , where the tolerance α > 0 bounds the likelihood of the behavior. If a particular counterexample satisfies such a chance constraint, then we know that it is reasonably well-explained by the underlying behavior model of actual driver behaviors. On the other hand, if this check fails, then the counterexample can be considered to be “uncanny” behavior that does not resemble true driving behaviors and the ego vehicle needs not have a contract. Contracts with different road rules can be compared to examine the affordances or compromises to safety. For the sake of simplicity of the contracts, the approach in this paper seeks to attain a convex contract representation that asserts, under the assumptions of the scenario S, the ego vehicle is guaranteed to remain safe with respect to a finite, but diverse, set of counterexamples associated with C. Our focus in this paper is pairwise collisions, as this keeps the problem within the continuous domain without sacrificing the variety in discovered counterexamples.
3
Constructing Safety Contracts
Contracts are created by an alternation between falsification and reachability analysis, under the scenario model (2). The overall approach is as shown in Algorithm 1. Starting with a set of initial contracts that enforce a certain ruleset, the falsification step (GenerateCounterexamples) generates counterexamples to these contracts (if any exist) by solving for possible egocar and traffic behaviors that result in failure of ϕ. In the reachability step (GenerateContract), a reach-avoid problem is then solved to find the set of time-indexed states for the ego-car, to over-approximation, for which the ego car is able to steer away from the generated counterexample. The failure case is indicative of an undecidable result, where it is inconclusive whether the ego vehicle can take any action to remain safe under the given scenario. Figure 2 depicts two iterations of the overall procedure. The left-hand side depicts the reachablility step, in which a ruleset and any existing contracts are considered as constraints in the reachable set computation. The right-hand side
Counterexample-Guided Safety Contracts
945
Algorithm 1. SynthesizeContracts Input: S = (R, P, X0 , XF0 ): scenario, ψ(·): safety condition function, α: chance constraint, T : time horizon. Output: C: safety contract for each timestep k ∈ {0, . . . , T }. 1: 2: 3: 4: 5: 6: 7: 8:
C ← InitializeContract(S) repeat x ¯, p(w) ¯ ← GenerateCounterexamples(S, C, ψ(·), T ) C ← GenerateContract(S, C, x ¯) until (p(w) ¯ < αT ) ∨ (C = ∅) if (C = ∅) ∨ ¬IsReachable(XF0 , CT ) then return failure return C
Fig. 2. Two iterations of the overall approach.
illustrates how we use falsification to find counterexamples with respect to the contracts. The counterexample is treated as an obstacle to avoid in the subsequent iteration, at which point, a set of constraints are created that separates the set difference between the reachable set at the previous step and the one at the current step. Solving the reach-avoid problem, versus constructing contracts based only on counterexamples, serves two purposes. First, it allows selection of new contracts that minimize the volume of the reachable set treated as unsafe in the next iteration under the contract. Second, we can verify whether it is feasible for the ego-car to reach the final set from within the initial set under the computed contracts (the check IsReachable in Algorithm 1).
946
3.1
J. DeCastro et al.
Gradient-Based Probabilistic Falsification
In general, we are interested in finding counterexamples to the safety specification: dynamically feasible trajectories that falsify our safety condition ϕ within a given chance constraint on the underlying probabilistic behavior model. We do so using a formulation that takes into consideration analytical structure of these behavior models in order to perform gradient-based optimization to solve for a the most likely counterexample. We solve a direct-collocation trajectory optimization problem [19], by discretizing time k = {0, . . . , T }, with time step h, where t = hk. We let u ¯, w, ¯ x ¯ ¯ = {wk }Tk=0 , and x ¯ = {xk }Tk=0 . denote, respectively, the sequences u ¯ = {uk }Tk=0 , w The counterexample trajectory is summarized by a collection of decision variables {h, u ¯, w, ¯ x ¯} that falsify the condition x ¯ |= ϕ but satisfy, at a minimum, the system dynamics (2), the initial conditions, and some threshold on the likelihood of selecting the random perturbations w, ¯ denoted by p(w). ¯ We aim to find the most likely explanations of the failure under the given model, motivating the following problem: ¯ max p(w)
h,¯ u,w,¯ ¯x
s.t. xk+1 − xk = hfcollocation , ∀k = 0, . . . , T − 1 xk ∈ X , ∀k = 0, . . . , T uk ∈ U, ∀k = 0, . . . , T − 1 x0 ∈ X0 , u0 ∈ U ψ(xT ) ≤ 0 κjk (xk )
≤ 0, ∀j = 1, . . . , Q, ∀k = 0, . . . , T p(w) ¯ ≥ αT
(dynamics)
(3) (initial conditions) (safety specification) (contracts) (chance constraint)
where fcollocation = 16 (fk + 4f˜ + fk+1 ), fk = fρ (xk , uk , wk ), and 1 h 1 1 ˜ (uk + uk+1 ) + (fk + fk+1 ), (uk + uk+1 ), (wk + wk+1 ) . f = fρ 2 8 2 2 The function κjk (·) ∈ C represents constraints of the form (ajk )T xk ≤ bjk , ajk ∈ Rnego , bjk ∈ R at time step k due to a contract j, when taken together, are the safety-preserving contracts the ego vehicle must adhere to. When (3) is solvable, we end up with corner cases to the hypothesis for κj (·) found thus far. Every time a new constraint is added to C, the condition ϕ becomes harder and harder to falsify. We revisit the computation of κj (·) in Sect. 3.3. Notice that we can choose to leave out the last constraint in (3), since the optimal choice of w ¯ is a maximizer for p(w) ¯ and hence a check of the optimal values is sufficient to verify the chance constraint. The task now is to find the representation p(w) ¯ and express J(w) ¯ as a convex ¯ = arg maxw¯ J(w). ¯ Taking wk ∼ N (0, Σ) (where Σ cost such that arg maxw¯ p(w) is block-diagonal of Σ i ) and noting the probability of action wk is drawn from
Counterexample-Guided Safety Contracts
947
1
the distribution p(wk ) = (2π)− 2 |Σ|− 2 exp(− 12 wkT Σ −1 wk ) we can easily obtain the log-likelihood representation of the probability as n
log p(w) ¯ =
N
k=0
N 1 T −1 nN log 2π − log |Σ| − wk Σ wk . 2 2 2 N
log p(wk ) = −
k=0
Due to monotonicity Nof the log operator, the cost function may be more simply ¯ ≥ log α− written as J(w) ¯ = k=0 wkT Σ −1 wk , and the chance constraint log p(w) log T . By maximizing p(w), ¯ we ensure that we are always searching for the most realistic counterexample. Note that the problem in (3) is a non-convex one to solve in general. Hence, we cannot guarantee a solution will be found, and hence cannot hope to exhaust all possible counterexamples (i.e. achieve completeness). However, it is important to note that we can achieve soundness of our solutions to (3). 3.2
Collision-Free Safety Conditions
Let B(xik ) ⊂ R2 be the orientation-dependent footprint of the ith vehicle at time k, that is, the Cartesian space occupied at state xi . Our safety criteria is one where we wish to avoiding crashes with other vehicles, ϕ = i,k (B(x0k )∩B(xik ) = ∅). We apply the mild assumption that we only search for conditions in which the ego vehicle is in collision with only one other vehicle at a time, which makes it easy to reduce a potentially combinatorial problem into one in which we solve (3) sequentially. Unfortunately, finding analytical forms for collision of two rectangular objects involves Minkowski operations, which is difficult to solve analytically. We instead express collision in terms of two inscribing ellipses using the following result. T Lemma 1. Let ai and bi be the length and width of vehicle
i, zi= [ xi yi ] be a 0 its Cartesian coordinates, and θi its angle. Let Ci = R(θi ) i , where R(θi ) 0 bi is a rotation matrix, and let z˜ = C0−1 (z1 − z0 ). Then,
B(x0 ) ∩ B(x1 ) = ∅ ⇒ z˜T (C1 + R1 )−1 (C1 + R1 )−T z˜ ≤ 1.
(4)
Proof. (sketch) Condition (4) can be obtained directly by transforming one of the ellipses to the unit disc, then applying the same transform to the other ellipse and writing out the expression for containment of the origin. We note that the constraint (4) preserves soundness of the falsification problem; when a trajectory is found that satisfies this condition, that trajectory falsifies ϕ. 3.3
Reachability with Contracts
Let F (tk ; X00 ) denote an over-approximation to the reachable set at time tk at iteration j of the main loop in Algorithm 1, i.e. the time-indexed set of states
948
J. DeCastro et al.
Algorithm 2. GenerateContract (⊕ denotes the Minkowski sum) Input: S = (R, X0 , P): scenario, C : previous contract, x ¯: a trajectory for the system of traffic cars, T : time horizon. Output: C: safety contract for each timestep k ∈ {0, . . . , T }. 1: Fsaf e (t0 ; X00 ) ← X00 ∩ C0 2: Ck ← X 0 , ∀k ∈ {0, . . . , T } 3: for all k ∈ {0, . . . , T } do 4: F (tk ; X00 ) ← F (h; Fsaf e (tk−1 ; X00 )) ∩ Ck Compute the reachable set 5: Oki ← B (xik , R) ⊕ B(F (tk ; X00 )) Compute the traffic constraints 6: for all i ∈ {1, . . . , N } such that Oki = ∅ do 7: Ck ← ComputeContract(F (tk ; X00 ) ∩ Ck , Oki ) ∩ Ck 8:
Fsaf e (tk ; X00 ) ← F (tk ; X00 ) ∩ Ck
Compute the safe reachable set
9: return C
x0k ∈ X 0 for which there exists a control u : R≥0 → U containing the trajectories satisfying x˙ 0 = fego (x0 , u) when starting in the initial set X00 . Our objective is essentially the converse of the falsification problem: to compute a safe reachable set for the ego vehicle Fsaf e (tk ; X00 ) ⊆ F (tk ; x00 ) such that it preserves the ruleset R and is not in collision with any other traffic vehicle at all timesteps. An overview of the approach may be found in Algorithm 2. For simplicity, we only present the computation of forward reachable sets, but this can be extended to backward reachable sets with the modifications explained in [27]. Once a contract is created, we extend the reachable set to verify that it intersects the goal region. Similar [27], we consider overapproximation to the reachable set, which is in line with the goal of safety since the resulting counterexample might not even be dynamically feasible, but the safety contract can still guard against it. Given a safe reachable set Fsaf e (tk ; X00 ) computed at some iteration j of the main loop in Algorithm 1, we want to select a hyperplane for each of the i vehicles of the form (aik )T x0k ≤ bik such that our new safe set Fsaf e (tk ; X00 ) at time step k is a valid reachable set that is safe with respect to the counterexample. Precisely, in line 4, values for aik , bik are selected such that (aik )T χi > bik for all vertices χi of Oki , the ith footprint of the counterexample, so that we obtain Fsaf e (tk ; X00 ) = F (tk ; X00 ) ∩ {x | (aik )T x0k ≤ bik , ∀i = 1, . . . , N }.
(5)
That is, we select aik and bik such that we may treat it as an obstacle in computing the reachable set at future times. In line 5, we let B(X) denote an orientationdependent Cartesian expansion of some set X ⊂ Rnego , and let B (x, R) denote a state-dependent inflation of B(x) according to the ruleset R, as explained in Sect. 3.4. Within ComputeContract, we select one hyperplane, i.e. aik and bik , in such a way as to maximize the volume of the resulting safe reachable set Fsaf e (·). If we assume F (·) is a union of polytopes, we can easily choose one from among
Counterexample-Guided Safety Contracts
949
the facets that maximizes the union of the remaining polytopes in (5) and satisfies (aik )T χi > bik . Ck is returned as the intersection of Ck and the new contract. 3.4
Rules of the Road
In the following we consider the subset of rules from the Vienna Convention on Traffic Rules [15], see Table 1. We select these rules as they form a subset of engagement rules for highway scenarios, and exclude rules involving traffic signals and other discrete conditionals. For simplicity, we show the constraint sets for straight road segments, and equally sized cars. We assume that the centerline of carriageway is along the x axis of the egocar for straight road segments. The length of the road segment is denoted by L, the width of a lane by W , and the number of left and right lanes by nlef t and nright , respectively. A sequence 0 ≤ ξx1 < ζx1 < ξx2 < . . . < ζxnsolid ≤ L defines the solid line segments (ξx , ζx ) along the centerline of the road. The pose and longitudinal speed of the vehicles are denoted by (xic , yci , θi ) and v i , respectively, where 0 ≤ i ≤ N , and i = 0 represents the ego-car. The average speed of vehicles around the ego-car and in the same lane is denoted by v¯. The safe e > 0, distances to other vehicles ahead and behind the ego is expressed as saf x saf e > 0. while the lateral safe distance to oncoming vehicles is expressed as y Overtaking maneuvers are performed within a stretch of the road segment of > 0 centered on the car that is being overtaken. Overtaking length 2 overtake x is safe if there are no other cars in the left lane where the ego-car performs the e−overtake around the car being overtaken. maneuver within a distance of saf x > 0. Lastly, the legal speed limit for a lane is given by legal v Table 1. Rules of the road for highway scenarios. No Rule
Constraint set
1
{0 ≤ x0c ≤ L, −nright · W ≤ yc0 ≤ 0}
2
Don’t drive in the left lanes
e 0 If driving behind another car, keep a reasonable {xic − x0c ≥ saf v | ∀i . xic − x0c ≥ x distance away to avoid collision if it suddenly stops 0 ∧ |yci − yc0 | < W } e 0 {x0c − xic ≥ saf v | ∀i . x0c − xic ≥ x 0 ∧ |yci − yc0 | < W }
3
If you want to slow down, give clear warning and do not inconvenience drivers behind you
4
Don’t cross solid lines
5
Overtake on the left when it is safe
{yc0 − yci > W ∧ v 0 > v i | ∧ ∀i . v i > 0 ∧ |x0c − xic | ≤ overtake x e−overtake ∧ yc0 − j .( |xjc − xic | ≤ saf x ycj ≤ W )}
6
If another vehicle is trying to overtake you keep right and don’t accelerate. If necessary, slow down and pull over
{u0a ≤ 0 ∧ yci − yc0 ≥ W ∧ yc0 ≤ 0 | ∀i . yci − yc0 ≤ 1.5W ∧ v i > } 0 ∧ |xic − x0c | ≤ overtake x
7
If passing oncoming traffic, leave sufficient lateral space to not get hit. If obstructed, slow down
8
Don’t drive abnormally slowly such that you impede the progress of other vehicles. Don’t drive above the speed limit or abnormally fast
{ξx ≤ x0c ≤ ζx ∧ −nright · W ≤ yc0 ≤ 0 | 1 ≤ ≤ nsolid }
e {yci − yc0 ≥ saf | yci ≥ 0 ∧ v i ≤ 0} y
{|v 0 − v ¯| ≤ v , |v 0 | ≤ legal } v
950
J. DeCastro et al.
Those rules that are only a function of the ego car (rules 1, 4, 8) are included in InitializeContract, while those that are functions of the joint state space are included only when a counterexample is obtained from the falsification step. Hence, these rules are included in ComputeContract as a modification to the traffic car footprint, i.e. B (·).
4
Results
We implemented the falsification algorithm, scenario, and system models using the Drake toolbox [35]. We use the SNOPT optimization package [18] for solving the sequential quadratic program (SQP) in (3). We furthermore parallelize the constraint evaluation across time steps in order to speed up the solve time. To generate new contracts, we compute the reachable sets using a Taylor expansion to the nonlinear dynamics with sets being expressed as zonotopes; we do this with the aid of the CORA reachability tool [1]. Set operations are carried out using the MPT toolbox, which is based on a polytopic representation of sets. We consider the following model for both the ego vehicle and traffic vehicles: ⎤ ⎡ ⎤ ⎡ x˙c v cos θ
⎢y˙c ⎥ ⎢ v sin θ ⎥ ⎥ , u = uδ . ⎥ = ⎢v x˙ = ⎢ ⎣ θ˙ ⎦ ⎣ tan uδ ⎦ ua L ua v˙ As previously, xc and yc are Cartesian positions at the center of the vehicle, θ is the heading angle, and v is the forward speed, while uδ and ua denote the steering angle and acceleration inputs, respectively. Table 2. Parameters used to model driver behaviors for the traffic cars. Description
Symbol
Driving style Normal Aggressive
IDM
Reference speed (m/s)
vref
10
1.5
Maximum acceleration (m/s2 )
a
1
4
Comfortable deceleration (m/s2 )
b
3
6
1
0.5 0.05
Minimum-desired net distance (m) s0 Time headway to lead vehicle (s)
th
0.1
Free-road exponent
δ
4
4
slook
15
10
Pure-pursuit Lookahead distance (m)
sperception 100
100
Disturbances Steering angle variance (rad2 )
σδ
0.1
5
Acceleration variance (m2 /s4 )
σa
0.1
2.5
Perception
Range (m)
Counterexample-Guided Safety Contracts
951
Fig. 3. The contract for timestep t = 4.8 s at iteration 4 for each set of parameters.
To represent naturalistic behaviors for the traffic cars, we consider the intelligent driver model (IDM) [36], a model whose parameters are typically fit to driver data and which is used to represent the longitudinal actions (acceleration) of actual drivers. We consider a pure-pursuit controller [13] to model the lateral actions (steering) of drivers. Essentially, the IDM model allows cars to react to one another, while adapting to a driver’s preferences for speed, acceleration and time headway between vehicles. The pure-pursuit controller allows steering to be adjusted smoothly so that the vehicle converges to a desired curve. In our experiments, we set the desired curve to be fixed as the centerline of a target lane to which the car must be steered. We randomize these traffic behaviors by defining Σ i , where Σ i = diag{σδ , σa }, and be treating the disturbance signal as additive noise to the nominal acceleration and steering commands provided by IDM and pure pursuit. We adapt to different driving styles by using the complete list of parameters in Table 2. We furthermore augment the implementation by ignoring vehicles that lie beyond a limited perception range. The road geometry is configurable for highway scenarios in which any number of lanes and lane sizes can be chosen for the scenario. Our aggregate model of the entire traffic system is implemented to account for all of the traffic participants in the context of their positions on the road. The models are encoded such that partial derivatives can be easily obtained via automatic differentiation. 4.1
Leading/Trailing Car Scenarios
In this scenario, we consider the ego-car sandwiched between two traffic cars in the right lane of a two-lane highway with opposing traffic lane, which may be used for overtaking if free. We synthesized contracts using both the ruleset in Table 1 and again with a relaxed ruleset, in which we disable rules 1 and 3 to enable evasive maneuvers onto the other lane. For both rulesets, we explore traffic models having two levels of aggressiveness (normal and aggressive) using the parameters in Table 2. The overall computation time for the synthesis roughly ranged from 8 h to 10 h. In Fig. 1, we depict different iterations of Algorithm 1 for the relaxed rules and normal driving style. We compare the contracts obtained
952
J. DeCastro et al.
Fig. 4. The log-likelihood for each test case across all iterations. The red × marks iterations where the contract terminated with an empty set, and the green dashed line indicates the chance constraint α.
at a fixed iteration of the algorithm for each case in Fig. 3 and, for each case, report the log-likelihood of the counterexample normalized on |Σ| in Fig. 4. We observe that with more iterations (and more unlikely behaviors of the traffic cars), more contracts are added, making the contract more restrictive, but also harder to falsify, as indicated by the log-likelihood. With a greater number of rules and more aggressive traffic, we note that the contract gets smaller and more prohibitive (see Fig. 3). We also note that relaxing the ruleset (e.g. allowing lane switches) enables more behaviors for the ego-car, demonstrating that safety can be preserved at the expense of rule-breaking in some scenarios. Moreover, the ego-car can readily estimate the cost of violating rules of the road by observing the varying contracts depending on the set of actively enforced rules. We note that for the normal driving style, the log-likelihood quickly decreases, whereas for the aggressive driving styles, the log-likelihood remains high as contracts are added, indicating that aggressive traffic can induce failure regardless of the ego-car’s behavior. In both of the agressive-driving cases, empty contracts were returned before exhausting possible counterexamples. Of the normal-driving cases, the relaxed set provides a contract with 14 counterexamples, whereas the strict set provides five counterexamples, indicating that changing lanes presents more possible failure events to guard against.
5
Conclusion
In this paper, we presented a novel framework for the synthesis of safety constraints for autonomous decision systems that can be applied and used by a wide variety of real-world systems. The framework allows for incorporating a large variety of scenarios, with a diverse set of probabilistic traffic behaviors, and for subsequently generating the appropriate safety constraints. We overcome issues of computational tractability by iteratively generating a set of safety constraints, based on reachability analysis, and generating counterexamples, i.e.,
Counterexample-Guided Safety Contracts
953
traffic scenarios, using gradient-based probabilistic falsification. We judiciously account for rules of the roads, in terms of state space constraints enforced during reachability analysis. The empirical results on a variety of real-world inspired scenarios validate the favorable performance of our approach, and reaffirm the practical applicability. We envision that our method can be used to inform the decision-making and planning system of an autonomous agent about the appropriate safety constraints applicable in a particular traffic scenario. In future work, we plan to extend our method to allow for behavior sequencing (such as collisions leading to further collisions), synthesize safety constraints that are simultaneously applicable across a wide variety of traffic scenarios, and show its effectiveness in real-world experiments. Acknowledgements. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research, but this article solely reflects the opinions and conclusions of its authors, and not TRI or any other Toyota entity.
References 1. Althoff, M.: An introduction to CORA 2015. In: Proceedings of the Workshop on Applied Verification for Continuous and Hybrid Systems, pp. 120–151 (2015) 2. Althoff, M., Dolan, J.: Online verification of automated road vehicles using reachability analysis. IEEE Trans. Rob. 30, 903–918 (2014) 3. Alur, R., Courcoubetis, C., Halbwachs, N., Henzinger, T., Ho, P.H., Nicollin, X., Olivero, A., Sifakis, J., Yovine, S.: The algorithmic analysis of hybrid systems. Theoret. Comput. Sci. 138(1), 3–34 (1995) 4. Alur, R., Dang, T., Ivanˇci´c, F.: Predicate abstraction for reachability analysis of hybrid systems. ACM Trans. Embedded Comput. Syst. (TECS) 5(1), 152–199 (2006) 5. Baram, N., Anschel, O., Caspi, I., Mannor, S.: End-to-end differentiable adversarial imitation learning. In: ICML (2017) 6. Bhatia, A., Frazzoli, E.: Incremental search methods for reachability analysis of continuous and hybrid systems. In: Alur, R., Pappas, G.J. (eds.) Hybrid Systems: Computation and Control. Springer, Heidelberg (2004) 7. Chen, M., Hu, Q., Fisac, J.F., Akametalu, K., Mackin, C., Tomlin, C.J.: Reachability-based safety and goal satisfaction of unmanned aerial platoons on air highways. J. Guid. Control Dynam. 40(6), 1360–1373 (2017) ´ 8. Chen, X., Abrah´ am, E., Sankaranarayanan, S.: Flow*: an analyzer for non-linear hybrid systems. In: International Conference on Computer Aided Verification (2013) ´ 9. Chen, X., Schupp, S., Makhlouf, I., Abrah´ am, E., Frehse, G., Kowalewski, S.: A benchmark suite for hybrid systems reachability analysis. In: NASA Formal Methods Symposium (2015) 10. Cheng, P., Kumar, V.: Sampling-based falsification and verification of controllers for continuous dynamic systems. Int. J. Robot. Res. 27(11–12), 1232–1245 (2008) 11. Clarke, E., Grumberg, O., Jha, S., Lu, Y., Veith, H.: Counterexample-guided abstraction refinement. In: International Conference on Computer Aided Verification (2000)
954
J. DeCastro et al.
12. Clarke, E., Grumberg, O., Long, D.: Verification tools for finite-state concurrent systems. In: Workshop/School/Symposium of the REX Project (Research and Education in Concurrent Systems) (1993) 13. Coulter, R.C.: Implementation of the pure pursuit path tracking algorithm. Technical report CMU-RI-TR-92-01, Carnegie Mellon University, Pittsburgh, PA (1992) 14. DeCastro, J.A., Kress-Gazit, H.: Nonlinear controller synthesis and automatic workspace partitioning for reactive high-level behaviors. In: ACM International Conference on Hybrid Systems: Computation and Control (HSCC), Vienna, Austria (2016) 15. Economic Commission for Europe – Inland Transport Committee, Vienna, Austria: Convention on Road Traffic, E/CONF.56/16/Rev.1/Amend.1 edn. (1968) 16. Erlien, S.M., Fujita, S., Gerdes, J.C.: Shared steering control using safe envelopes for obstacle avoidance and vehicle stability. IEEE Trans. Intell. Transp. Syst. 17, 441–451 (2016) 17. Fisac, J., Bajcsy, A., Herbert, S., Fridovich-Keil, D., Wang, S., Tomlin, C., Dragan, A.: Probabilistically safe robot planning with confidence-based human predictions. In: Proceedings of Robotics: Science and Systems, Pittsburgh, PA, June 2018 18. Gill, P.E., Murray, W., Saunders, M.A.: SNOPT: an SQP algorithm for large-scale constrained optimization. SIAM Rev. 47(1), 99–131 (2005) 19. Hargraves, C.R., Paris, S.W.: Direct trajectory optimization using nonlinear programming and collocation. AIAA J. Guid. 10, 338–342 (1987) 20. Ivanovic, B., Harrison, J., Sharma, A., Chen, M., Pavone, M.: BaRC: backward reachability curriculum for robotic reinforcement learning. arXiv:1806.06161 (2018) 21. Kapinski, J., Deshmukh, J., Sankaranarayanan, S., Arechiga, N.: Simulation-guided Lyapunov analysis for hybrid dynamical systems. In: Proceedings of the International Conference on Hybrid Systems: Computation and Control (2014) 22. Karlsson, J., Vasile, C.I., Tumova, J., Karaman, S., Rus, D.: Multi-vehicle motion planning for social optimal mobility-on-demand. In: IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia (2018) 23. Kim, E.S., Arcak, M., Seshia, S.A.: Compositional controller synthesis for vehicular traffic networks. In: IEEE Conference on Decision and Control (CDC), pp. 6165– 6171 (2015) 24. Kim, E.S., Sadraddini, S., Belta, C., Arcak, M., Seshia, S.A.: Dynamic contracts for distributed temporal logic control of traffic networks. In: IEEE Conference on Decision and Control (CDC), pp. 3640–3645 (2017) 25. Kuefler, A., Morton, J., Wheeler, T.A., Kochenderfer, M.J.: Imitating driver behavior with generative adversarial networks. In: IEEE Intelligent Vehicles Symposium (IV), pp. 204–211 (2017) 26. Liebenwein, L., Baykal, C., Gilitschenski, I., Karaman, S., Rus, D.: Sampling-based approximation algorithms for reachability analysis with provable guarantees. In: Proceedings of Robotics: Science and Systems, Pittsburgh, PA, June 2018 27. Liebenwein, L., Schwarting, W., Vasile, C.I., DeCastro, J., Alonso-Mora, J., Karaman, S., Rus, D.: Compositional and contract-based verification for autonomous driving on road networks. In: International Symposium on Robotics Research (ISRR) (2017) 28. Mitchell, I.M., Bayen, A.M., Tomlin, C.J.: A time-dependent Hamilton-Jacobi formulation of reachable sets for continuous dynamic games. IEEE Trans. Autom. Control 50(7), 947–957 (2005) 29. Morton, J., Kochenderfer, M.J.: Simultaneous policy learning and latent state inference for imitating driver behavior. In: IEEE International Conference on Intelligent Transportation Systems (ITSC), pp. 1–6 (2017)
Counterexample-Guided Safety Contracts
955
30. Plaku, E., Kavraki, L., Vardi, M.: Falsification of LTL safety properties in hybrid systems. In: International Conference on Tools and Algorithms for the Construction and Analysis of Systems (2009) 31. Sangiovanni-Vincentelli, A., Damm, W., Passerone, R.: Taming Dr. Frankenstein: contract-based design for cyber-physical systems. In: 2011 Control and Decision Conference and European Control Conference (2012) 32. Sankaranarayanan, S., Fainekos, G.: Falsification of temporal properties of hybrid systems using the cross-entropy method. In: ACM International Conference on Hybrid Systems: Computation and Control, pp. 125–134 (2012) 33. Schwarting, W., Alonso-Mora, J., Paull, L., Karaman, S., Rus, D.: Parallel autonomy in automated vehicles: safe motion generation with minimal intervention. In: IEEE International Conference on Robotics and Automation (ICRA) (2017) 34. Shalev-Shwartz, S., Shammah, S., Shashua, A.: Safe, multi-agent, reinforcement learning for autonomous driving. CoRR abs/1610.03295 (2016) 35. Russ Tedrake and the Drake Development Team: Drake: A planning, control, and analysis toolbox for nonlinear dynamical systems (2016). http://drake.mit.edu 36. Treiber, M., Kesting, A.: Traffic Flow Dynamics. Springer, Heidelberg (2013) 37. Vasile, C.I., Tumova, J., Karaman, S., Belta, C., Rus, D.: Minimum-violation scLTL motion planning for mobility-on-demand. In: IEEE International Conference on Robotics and Automation (ICRA), Singapore, pp. 1481–1488 (2017) 38. Wongpiromsarn, T., Topcu, U., Murray, R.M.: Receding horizon temporal logic planning for dynamical systems. In: IEEE Conference on Decision and Control (CDC) and Chinese Control Conference, pp. 5997–6004 (2009)
Author Index
A Abraham, Ian, 284 Agboh, Wisdom C., 160 Alempijevic, Alen, 357 Alterovitz, Ron, 69, 246 Amato, Nancy M., 36 Argall, Brenna, 905 Aswani, Anil, 212 Axelrod, Brian, 852 B Balkcom, Devin, 529, 709 Basilico, Nicola, 762 Bauza, Maria, 508 Becerra, Israel, 89, 835 Bekris, Kostas, 778 Berenson, Dmitry, 228 Birchfield, Stan, 441 Bıyık, Erdem, 887 Boots, Byron, 441 Bowen, Chris, 246 Broad, Alexander, 905 C Calinon, Sylvain, 196 Carpin, Stefano, 762 Carvalho, J. Frederico, 19 Chakraborty, Nilanjan, 458 Chaudhuri, Swarat, 127 Chen, Min, 921 Chen, Mo, 545, 581 Cheng, Ching-An, 441 Chiang, Hao-Tien Lewis, 52 Chinta, Rupesh, 817
Chirikjian, Gregory S., 3 Chou, Glen, 228 D Dantam, Neil T., 639 DeCastro, Jonathan, 939 Dissanayake, Gamini, 621 Dogar, Mehmet R., 160 F Fan, Taosha, 689 Faust, Aleksandra, 52 Fekete, Sándor P., 727 Folkerts, Michael, 338 Fox, Dieter, 441 Fox, Roy, 196 G Gans, Nicholas, 338 Ghosh, Mukulika, 36 Gmyr, Robert, 727 Goldberg, Ken, 196, 212 Goretkin, Gustavo, 373 H Halm, Mathew, 491 Halperin, Dan, 778, 799 Han, Shuai D., 817 Harrison, James, 318 Hauser, Kris, 565 Herbert, Sylvia L., 545 How, Jonathan P., 422 Hsu, David, 921
© Springer Nature Switzerland AG 2020 M. Morales et al. (Eds.): WAFR 2018, SPAR 14, pp. 957–959, 2020. https://doi.org/10.1007/978-3-030-44051-0
958 Huang, Shoudong, 621 Hugo, Sabrina, 727 I Ichnowski, Jeffrey, 69 Issac, Jan, 441 J Jiang, Steve, 338 K Kaelbling, Leslie Pack, 373 Karaman, Sertac, 939 Kavraki, Lydia E., 127 Keldenich, Phillip, 727 Khosoussi, Kasra, 422 Knepper, Ross A., 744 Koditschek, Daniel E., 406 Kragic, Danica, 19 Krishnan, Sanjay, 196 Kuindersma, Scott, 474, 656 L Laskey, Michael, 196, 212 LaValle, Steven M., 89 Lazar, Daniel A., 887 Lee, Jonathan, 196 Lee, Jonathan N., 212 Lee, Wee Sun, 921 Leighton, Brenton, 621 Liebenwein, Lucas, 939 Liu, Liyang, 621 Liu, Yi, 621 Livingston, Scott C., 581 Lozano-Pérez, Tomás, 373 M Ma, Qianli, 3 Maleki, Behnam, 357 Manocha, Dinesh, 673 Mavrogiannis, Christoforos I., 744 Mukadam, Mustafa, 441 Murphey, Todd, 689, 905 Murphey, Todd D., 284, 602 Murrieta-Cid, Rafael, 835 N Nguyen, Dan, 338 Nilles, Alexandra Q., 89 Nishimura, Haruki, 267
Author Index O O’Kane, Jason M., 106, 868 Ogunmolu, Olalekan, 338 Otte, Michael, 389 Ozay, Necmiye, 228 P Pan, Zherong, 673 Pavone, Marco, 179, 318, 545, 581 Pedarsani, Ramtin, 887 Pervan, Ana, 602 Plancher, Brian, 656 Poblete, Karen L., 3 Pokorny, Florian T., 19 Posa, Michael, 491 Prabhakar, Ahalya, 284 R Ratliff, Nathan, 441 Ren, Yingying, 89 Rodriguez, Alberto, 508 Ruan, Sipu, 3 Rus, Daniela, 939 S Saberifar, Fatemeh Zahra, 868 Sadigh, Dorsa, 887 Scheffer, Christian, 727 Schmidt, Arne, 727 Schultz, Jarvis, 689 Schwager, Mac, 267 Sharma, Apoorva, 318 Shell, Dylan A., 106, 868 Shimanuki, Luke, 852 Shome, Rahul, 778 Sindhwani, Vikas, 179 Singh, Sumeet, 179, 545 Slotine, Jean-Jacques E., 179 Sofge, Donald, 389 Solomon, Israela, 799 Solovey, Kiril, 778 Sugaya, Satomi, 52 Suryan, Varun, 301 T Tam, Qizhan, 581 Tanwani, Ajay Kumar, 196, 212 Tapia, Lydia, 52 Tedrake, Russ, 939 Thananjeyan, Brijen, 196 Thomas, Shawna, 36
Author Index Tian, Yulun, 422 Tokekar, Pratap, 144, 301 Tomlin, Claire J., 545 V Varava, Anastasiia, 19 Varin, Patrick, 474 Vasile, Cristian-Ioan, 939 Vasilopoulos, Vasileios, 406 Vidal-Calleja, Teresa, 357 W Wang, Weifu, 529 Wang, Yue, 127
959 X Xie, Jiayin, 458 Y Yan, Yan, 3 Yervilla-Herrera, Heikel, 835 Yu, Jingjin, 778, 817 Z Zhang, Teng, 621 Zhang, Yinan, 709 Zhang, Yulin, 106 Zhao, Liang, 621 Zhou, Lifeng, 144